Part 2: The Multi-Step Retriever — Implementing Agentic Query Expansion - DATA DO

1. Introduction: The Death of the “Simple Search”

In Part 1, we defined the blueprint for a production-grade Agentic RAG system. We moved away from passive retrieval toward a “reasoning-first” architecture. But even the best reasoning engine fails if the data fed into it is garbage.

When a business user asks, “What’s our policy on remote work?”, they aren’t just looking for a document. They are asking a cluster of nested questions:

Definition: What counts as remote?
Eligibility: Does this apply to my role?
Process: How do I apply?
History: When was this last updated?

Traditional RAG takes that entire query, turns it into a single vector, and prays that the semantic math lands near the right paragraph. In production, “praying” isn’t a strategy. This article covers Agentic Query Expansion—the process of breaking down “messy” human intent into a structured execution plan.

2. Expansion Strategies: Beyond Vector Similarity

We use three primary strategies to ensure the retriever actually finds what the user is looking for.

2.1 Sub-Question Decomposition (The ReAct Pattern)

Instead of one broad search, the agent generates multiple precise sub-queries. This requires a Structured Output from your LLM to ensure the orchestrator can parse the plan.

User Query: “What’s our policy on remote work?”

Expansion Agent Logic (Python):

class SubQuery(BaseModel):
    query: str
    intent: Literal["definition", "eligibility", "process", "metadata"]

async def expand_query(user_input: str) -> List[SubQuery]:
    # Prompting the LLM to return a JSON array of SubQuery objects
    ...

2.2 Hypothetical Document Embeddings (HyDE)

Sometimes vocabulary doesn’t match. A user asks about “desk money,” but the manual calls it a “home office stipend.” HyDE has the LLM generate a fake answer first. We embed that hypothetical answer to search the database. Since the LLM uses professional terminology, the vector sits closer to the actual policy than the user’s original slang.

2.3 Entity-Relationship Expansion

Many business queries pivot around entities: roles (“engineers”), locations (“NYC”), or benefits (“401k”).

By extracting these entities, we transform keyword search into relationship-based discovery:

Engineer -> (part_of) -> Engineering Dept -> (applies_to) -> Remote Policy.

3. Intelligent Routing: Choosing the Right Tool

Once we have sub-questions, the Router Agent decides where to send them. In production, every routing decision should include a fallback.

Question Type	Target Data Source	Reasoning
“What is…?”	Vector DB	Best for semantic definitions.
“When did…?”	SQL / Metadata	Dates and versions are structured.
“Who reports to…?”	Knowledge Graph	Relationships are explicit, not semantic.
“How many…?”	SQL / Analytic	Vector search cannot count or aggregate.

The Engineering Reality: Use Few-Shot Prompting for your router. Providing 3-5 examples of “SQL vs. Vector” queries significantly reduces routing errors.

4. The Orchestration Workflow (Parallel Execution)

The following diagram illustrates how the components interact in a parallel execution environment:

graph TD
    A[User Query] --> B{Expansion Agent}
    B -->|Decompose| C1[Sub-Q 1]
    B -->|Decompose| C2[Sub-Q 2]
    B -->|HyDE| C3[Sub-Q 3]
    
    C1 --> D{Router}
    C2 --> D
    C3 --> D
    
    D -->|Parallel| E1[(Vector DB)]
    D -->|Parallel| E2[(SQL DB)]
    D -->|Parallel| E3[(Knowledge Graph)]
    
    E1 --> F[Result Merger]
    E2 --> F
    E3 --> F
    
    F --> G[Synthesis Agent]
    G --> H[Final Answer with Citations]

Performance Note: Sequential execution is a latency killer. By running these searches in parallel, our total retrieval time is only as long as the slowest single database query (typically ~300ms), rather than the sum of all three (~700ms+).

# The Agentic Parallel Pattern
results = await asyncio.gather(
    vector_search(sub_q1), # ~300ms
    sql_query(sub_q2),     # ~100ms
    kg_traversal(sub_q3),  # ~200ms
    return_exceptions=True
)
# Total Latency: ~300ms (max of the three)

5. The Production Stack: Hybrid Similarity Search

In Part 1, we touched on the importance of hybrid search. In production, you cannot rely on dense vectors alone—they often struggle with acronyms or specific product IDs. We implement a Weighted Hybrid Search that combines pgvector (Dense) with PostgreSQL ts_rank (Sparse/Full-Text).

Implementation: The Hybrid Retriever

async def hybrid_similarity_search(
        self, query_embedding: List[float], query_text: str, 
        limit: int = 10, dense_weight: float = 0.6, sparse_weight: float = 0.4
) -> List[Dict[str, Any]]:
    """Combines Dense (pgvector) and Sparse (Full-Text) search results."""
    
    # 1. Dense Search: Semantic similarity
    dense_query = "SELECT ..., 1 - (embedding_dense &lt;=> %s::vector) as dense_similarity FROM documents..."
    dense_results = await self.execute_query(dense_query, (query_embedding, ...))

    # 2. Sparse Search: Keyword matching via ts_rank
    sparse_query = "SELECT ..., ts_rank(content_tsv, plainto_tsquery('english', %s)) as sparse_similarity FROM documents..."
    sparse_results = await self.execute_query(sparse_query, (query_text, ...))

    # 3. Weighted Merge: Reciprocal Rank Fusion or Linear Combination
    merged = {}
    for r in dense_results:
        merged[r['id']] = {**r, 'hybrid_score': r['dense_similarity'] * dense_weight}
    
    for r in sparse_results:
        if r['id'] in merged:
            merged[r['id']]['hybrid_score'] += r['sparse_similarity'] * sparse_weight
        else:
            merged[r['id']] = {**r, 'hybrid_score': r['sparse_similarity'] * sparse_weight}

    return sorted(merged.values(), key=lambda x: x['hybrid_score'], reverse=True)[:limit]

By weighting these results (e.g., 60% Vector, 40% Keyword), the system becomes robust against both “vocabulary mismatch” (solved by vectors) and “exact term requirements” (solved by full-text).

6. Enrichment: Making Data “Searchable”

High-quality retrieval doesn’t happen at query time; it starts at Ingestion. As we established in Part 1 when discussing Multi-Vector Indexing, we don’t just “chunk and embed.” We enrich each document segment with high-fidelity metadata that our agents can later reason over.

Summaries (Linked to Synthesis): We generate 50-token summaries for every chunk. As we discussed in our architecture overview, this allows the Synthesis Agent to “skim” the context of 20 documents to decide which 5 to read in depth. This turns a massive context window into a targeted one, making final response generation up to 5x faster.
Semantic Roles (The Router’s Map): By tagging a chunk as a “Policy Definition,” “Procedure,” or “Exception,” we give the Router (introduced in Section 3) a map. If a user asks “How do I…?”, the system prioritizes chunks with a “Procedure” role.
Entity Extraction (Powering Hybrid Search): This is where the Hybrid Similarity Search from Section 5 gets its power. By extracting roles, locations, and dates upfront during the ingestion phase we designed in Part 1, we enable the precise SQL filtering (WHERE metadata->>'role' = 'Engineer') that makes the system deterministic where vector search is merely probabilistic.

The Production Impact: Enrichment is the bridge between the raw data storage we built in the first article and the agentic reasoning we are implementing now. It adds a few seconds to your ingestion pipeline, but it saves several seconds of “LLM thinking time” for every single user query.

7. Conclusion: Retrieval is a Reasoning Task

In the “naive RAG” era, retrieval was a math problem. In the Agentic RAG era, retrieval is a reasoning problem. By breaking queries down and routing them intelligently, we create a system that understands intent rather than just matching keywords.

In Part 3, we will dive into the ‘Precision Filter’: How to use Cross-Encoders and Reranking to eliminate retrieval noise and ensure only the most relevant context reaches the LLM.

Implementation: Let’s Bridge Your RAG Gap

Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.

The Agentic RAG Blueprint described in this series isn’t just a conceptual framework—it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.

Why Partner With Us?

We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:

Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.

Schedule a Technical Strategy Session

If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.

We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.

Book a RAG Strategy Consultation

Book a RAG Strategy Consultation

Direct access to our lead architects. No sales fluff, just engineering.

Authors

Marc Matt
Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

I help clients:
- Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
- Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
- Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.
Proven track record leading engineering teams.
Saidah Kafka