1. Introduction: The Death of the “Simple Search”
In Part 1, we defined the blueprint for a production-grade Agentic RAG system. We moved away from passive retrieval toward a “reasoning-first” architecture. But even the best reasoning engine fails if the data fed into it is garbage.
When a business user asks, “What’s our policy on remote work?”, they aren’t just looking for a document. They are asking a cluster of nested questions:
- Definition: What counts as remote?
- Eligibility: Does this apply to my role?
- Process: How do I apply?
- History: When was this last updated?
Traditional RAG takes that entire query, turns it into a single vector, and prays that the semantic math lands near the right paragraph. In production, “praying” isn’t a strategy. This article covers Agentic Query Expansion—the process of breaking down “messy” human intent into a structured execution plan.
2. Expansion Strategies: Beyond Vector Similarity
We use three primary strategies to ensure the retriever actually finds what the user is looking for.
2.1 Sub-Question Decomposition (The ReAct Pattern)
Instead of one broad search, the agent generates multiple precise sub-queries. This requires a Structured Output from your LLM to ensure the orchestrator can parse the plan.
User Query: “What’s our policy on remote work?”
Expansion Agent Logic (Python):
class SubQuery(BaseModel):
query: str
intent: Literal["definition", "eligibility", "process", "metadata"]
async def expand_query(user_input: str) -> List[SubQuery]:
# Prompting the LLM to return a JSON array of SubQuery objects
...
2.2 Hypothetical Document Embeddings (HyDE)
Sometimes vocabulary doesn’t match. A user asks about “desk money,” but the manual calls it a “home office stipend.” HyDE has the LLM generate a fake answer first. We embed that hypothetical answer to search the database. Since the LLM uses professional terminology, the vector sits closer to the actual policy than the user’s original slang.
2.3 Entity-Relationship Expansion
Many business queries pivot around entities: roles (“engineers”), locations (“NYC”), or benefits (“401k”).
By extracting these entities, we transform keyword search into relationship-based discovery:
Engineer -> (part_of) -> Engineering Dept -> (applies_to) -> Remote Policy.
3. Intelligent Routing: Choosing the Right Tool
Once we have sub-questions, the Router Agent decides where to send them. In production, every routing decision should include a fallback.
| Question Type | Target Data Source | Reasoning |
| “What is…?” | Vector DB | Best for semantic definitions. |
| “When did…?” | SQL / Metadata | Dates and versions are structured. |
| “Who reports to…?” | Knowledge Graph | Relationships are explicit, not semantic. |
| “How many…?” | SQL / Analytic | Vector search cannot count or aggregate. |
The Engineering Reality: Use Few-Shot Prompting for your router. Providing 3-5 examples of “SQL vs. Vector” queries significantly reduces routing errors.
4. The Orchestration Workflow (Parallel Execution)
The following diagram illustrates how the components interact in a parallel execution environment:
graph TD
A[User Query] --> B{Expansion Agent}
B -->|Decompose| C1[Sub-Q 1]
B -->|Decompose| C2[Sub-Q 2]
B -->|HyDE| C3[Sub-Q 3]
C1 --> D{Router}
C2 --> D
C3 --> D
D -->|Parallel| E1[(Vector DB)]
D -->|Parallel| E2[(SQL DB)]
D -->|Parallel| E3[(Knowledge Graph)]
E1 --> F[Result Merger]
E2 --> F
E3 --> F
F --> G[Synthesis Agent]
G --> H[Final Answer with Citations]Performance Note: Sequential execution is a latency killer. By running these searches in parallel, our total retrieval time is only as long as the slowest single database query (typically ~300ms), rather than the sum of all three (~700ms+).
# The Agentic Parallel Pattern
results = await asyncio.gather(
vector_search(sub_q1), # ~300ms
sql_query(sub_q2), # ~100ms
kg_traversal(sub_q3), # ~200ms
return_exceptions=True
)
# Total Latency: ~300ms (max of the three)
5. The Production Stack: Hybrid Similarity Search
In Part 1, we touched on the importance of hybrid search. In production, you cannot rely on dense vectors alone—they often struggle with acronyms or specific product IDs. We implement a Weighted Hybrid Search that combines pgvector (Dense) with PostgreSQL ts_rank (Sparse/Full-Text).
Implementation: The Hybrid Retriever
async def hybrid_similarity_search(
self, query_embedding: List[float], query_text: str,
limit: int = 10, dense_weight: float = 0.6, sparse_weight: float = 0.4
) -> List[Dict[str, Any]]:
"""Combines Dense (pgvector) and Sparse (Full-Text) search results."""
# 1. Dense Search: Semantic similarity
dense_query = "SELECT ..., 1 - (embedding_dense <=> %s::vector) as dense_similarity FROM documents..."
dense_results = await self.execute_query(dense_query, (query_embedding, ...))
# 2. Sparse Search: Keyword matching via ts_rank
sparse_query = "SELECT ..., ts_rank(content_tsv, plainto_tsquery('english', %s)) as sparse_similarity FROM documents..."
sparse_results = await self.execute_query(sparse_query, (query_text, ...))
# 3. Weighted Merge: Reciprocal Rank Fusion or Linear Combination
merged = {}
for r in dense_results:
merged[r['id']] = {**r, 'hybrid_score': r['dense_similarity'] * dense_weight}
for r in sparse_results:
if r['id'] in merged:
merged[r['id']]['hybrid_score'] += r['sparse_similarity'] * sparse_weight
else:
merged[r['id']] = {**r, 'hybrid_score': r['sparse_similarity'] * sparse_weight}
return sorted(merged.values(), key=lambda x: x['hybrid_score'], reverse=True)[:limit]
By weighting these results (e.g., 60% Vector, 40% Keyword), the system becomes robust against both “vocabulary mismatch” (solved by vectors) and “exact term requirements” (solved by full-text).
6. Enrichment: Making Data “Searchable”
High-quality retrieval doesn’t happen at query time; it starts at Ingestion. As we established in Part 1 when discussing Multi-Vector Indexing, we don’t just “chunk and embed.” We enrich each document segment with high-fidelity metadata that our agents can later reason over.
- Summaries (Linked to Synthesis): We generate 50-token summaries for every chunk. As we discussed in our architecture overview, this allows the Synthesis Agent to “skim” the context of 20 documents to decide which 5 to read in depth. This turns a massive context window into a targeted one, making final response generation up to 5x faster.
- Semantic Roles (The Router’s Map): By tagging a chunk as a “Policy Definition,” “Procedure,” or “Exception,” we give the Router (introduced in Section 3) a map. If a user asks “How do I…?”, the system prioritizes chunks with a “Procedure” role.
- Entity Extraction (Powering Hybrid Search): This is where the Hybrid Similarity Search from Section 5 gets its power. By extracting roles, locations, and dates upfront during the ingestion phase we designed in Part 1, we enable the precise SQL filtering (
WHERE metadata->>'role' = 'Engineer') that makes the system deterministic where vector search is merely probabilistic.
The Production Impact: Enrichment is the bridge between the raw data storage we built in the first article and the agentic reasoning we are implementing now. It adds a few seconds to your ingestion pipeline, but it saves several seconds of “LLM thinking time” for every single user query.
7. Conclusion: Retrieval is a Reasoning Task
In the “naive RAG” era, retrieval was a math problem. In the Agentic RAG era, retrieval is a reasoning problem. By breaking queries down and routing them intelligently, we create a system that understands intent rather than just matching keywords.
In Part 3, we will dive into the ‘Precision Filter’: How to use Cross-Encoders and Reranking to eliminate retrieval noise and ensure only the most relevant context reaches the LLM.
Implementation: Let’s Bridge Your RAG Gap
Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.
The Agentic RAG Blueprint described in this series isn’t just a conceptual framework—it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.
Why Partner With Us?
We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:
- Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
- Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
- Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.
Schedule a Technical Strategy Session
If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.
We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.
Book a RAG Strategy Consultation
Direct access to our lead architects. No sales fluff, just engineering.