Part 3: The Validation Layer — Reranking, Cross-Encoders, and Automated Evaluation - DATA DO

1. Introduction: Why Vector Search Alone Isn’t Enough

In Part 2, we optimized our system for Recall—using expansion and routing to ensure the “needle” is somewhere in our top 50 results. However, in production, being “somewhere in the top 50” is a liability, not a feature.

Vector search is fast—it takes milliseconds to retrieve candidates. But it’s also mathematically “blunt.” It finds documents that look like the query in embedding space, not necessarily documents that answer the query.

Consider a legal discovery scenario:

You ask “What are the termination clauses in our vendor agreements?” Vector search might return 50 documents semantically similar to “termination” and “vendor.” But does Document #23 actually contain an enforceable termination clause, or is it just a footnote mentioning the word in passing? A vector score of 0.87 doesn’t tell you. This is where reranking and automated evaluation enter the picture to transform “probabilistic search” into “deterministic verification.”

2. The Precision Filter: Two-Stage Retrieval

To move beyond “good enough,” we implement a pragmatic split of labor known as the Two-Stage Retrieval Pattern.

2.1 Stage 1: Broad Retrieval (Recall)

Action: Query → Embedding → Vector DB.
Result: Top 50 candidates fast (< 100ms).
Goal: Catch all potentially relevant documents.

2.2 Stage 2: Precision Scoring (Reranking)

Action: Take 50 candidates → Cross-Encoder model.
Result: Score each (query, document) pair directly for relevance.
Goal: Eliminate noise and rank by true contextual intent.

2.3 Cross-Encoders vs. Bi-Encoders: The Fundamental Difference

A Bi-Encoder (Vector Search) sees the query and document as isolated points. A Cross-Encoder sees them as a relationship.

Aspect	Bi-Encoder (Vector Search)	Cross-Encoder (Reranking)
Input Processing	Query and doc encoded separately	Query and doc encoded together
Scoring	Independent embeddings → Cosine distance	Joint transformer → Direct relevance score
Computational Cost	O(1) per doc (Pre-computed)	O(n) — Must score each pair
Accuracy	~85% discrimination	~95%+ discrimination

Example: For “termination clauses,” a bi-encoder gives a high score to a doc about “termination-related health insurance benefits” because the keywords match. A cross-encoder sees the [SEP] token between query and doc and recognizes the context is “benefits,” not “contractual clauses,” assigning a low relevance score (e.g., 0.58).

3. Library Choices for On-Prem Reranking

Our architecture supports three production-ready options depending on your hardware:

3.1 BGE-Reranker-v2-m3 (Recommended)

BAAI’s multilingual reranker is the “sweet spot” for most enterprise use cases.

Model Size: 150MB | Device: CPU or GPU
Use Case: Legal docs, technical queries, mixed-language corpora.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu')

# Score all pairs for surgical precision
scores = reranker.predict(
    [[query, doc.content] for doc in candidates],
    batch_size=32
)

# Rerank to top 5 results
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

3.2 Jina-Reranker-v2

Best for international teams needing 30+ language support and high-stakes technical ranking.

3.3 FlashRank (Ultra-Lite)

For on-prem setups without GPU access and strict sub-200ms latency requirements.

Trade-off: Slightly lower accuracy (~91%) for 3x the speed.

4. The Evaluation Loop: Self-RAG and RAGAS

Retrieval is only half the battle. In a production environment, you don’t know if your RAG is working unless you measure it. We build a Critic Agent to grade the pipeline using the Self-RAG principle—a self-reflective loop that validates the answer before it ever reaches the user.

4.1 The Self-Reflective Critic: Implementation with Pydantic AI

To make this critique deterministic, we use Pydantic AI. Unlike a standard LLM call, this ensures the Critic returns a strictly typed EvaluationResult that our orchestrator can act upon.

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from typing import Literal

# 1. Define the 'Quality Gate' Schema
class EvaluationResult(BaseModel):
    is_faithful: bool = Field(description="Is the answer grounded ONLY in the context?")
    is_relevant: bool = Field(description="Does the answer directly address the user query?")
    overall_score: float = Field(ge=0, le=1.0)
    suggested_action: Literal["proceed", "refine_query", "retry_retrieval"]

# 2. Configure the Critic Agent
critic_agent = Agent(
    'ollama:mistral', # On-prem execution for data sovereignty
    result_type=EvaluationResult,
    system_prompt="You are a strict QA auditor. Compare context vs. answer for hallucinations."
)

By using this structured approach, the agent checks the RAG Triad at runtime:

Relevance: Is the retrieved context actually useful for this specific query?
Faithfulness: Is the generated answer grounded only in the provided context?
Utility: Does the final synthesis actually resolve the user’s question?

4.2 The Three RAGAS Metrics

While the Critic provides a “Go/No-Go” decision, we use the RAGAS framework to generate granular, objective scores for long-term observability. In our pipeline, we calculate a weighted Overall Quality Score:

Metric	Weight	Engineering Goal
Faithfulness	0.35	Anti-Hallucination: Ensure no claims exist outside the source docs.
Answer Relevance	0.35	Precision: Ensure the LLM didn’t “drift” into generic knowledge.
Context Precision	0.30	Retrieval Quality: Did the Reranker put the “needle” in the top 3?

4.3 The Quality Gate Decision

Before the user sees the answer, the system runs this final check:

async def quality_gate(query, context, answer):
    eval_run = await critic_agent.run(f"Q: {query}, C: {context}, A: {answer}")
    report = eval_run.data
    
    if report.suggested_action == "proceed" and report.overall_score >= 0.7:
        return f"Verified Answer: {answer}"
    
    # If the gate fails, we trigger the 'Refine' loop introduced in Section 5
    return handle_failure(report.suggested_action)

5. Orchestration: The Quality Gate

Before an answer is returned, it must pass a “Go/No-go” decision gate based on a weighted average of these metrics.

def _classify_quality(self, metrics: EvaluationMetrics) -> str:
    score = metrics.overall_score # Weighted average
    
    if score >= 0.8: return "excellent"
    elif score >= 0.6: return "good"
    elif score >= 0.4: return "fair"
    else: return "poor"

What happens if the Gate fails?

Refine: Ask the user for clarification.
Retry: Trigger a second retrieval pass with different parameters.
Reject: Do not show the answer if the hallucination risk is too high.

6. On-Premise Infrastructure: Data Sovereignty

To maintain data privacy, our evaluation node runs locally. No data leaves the firewall.

graph LR
    subgraph On-Premise Firewall
    A[RAG Pipeline] --> B[Eval Server]
    B --> C[Ollama / Mistral-7B]
    B --> D[RAGAS Metrics]
    end
    D --> E[Langfuse Observability]

7. Conclusion: Precision + Evaluation = Trust

By combining Reranking for surgical precision and Self-RAG/RAGAS for verification, we close the loop on reliability. We are no longer just “searching”; we are verifying.

Part 1: Storage & Indexing
Part 2: Expansion & Routing
Part 3: Quality & Verification (The Validation Layer)

In the final part of this series, we will focus on “The Human Interface”: Handling Citations, Streaming UI, and the User Feedback Loop.

From Architecture to Implementation: Let’s Bridge Your RAG Gap

Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.

The Agentic RAG Blueprint described in this series isn’t just a conceptual framework—it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.

Why Partner With Us?

We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:

Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.

Schedule a Technical Strategy Session

If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.

We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.

Book a RAG Strategy Consultation

Book a RAG Strategy Consultation

Direct access to our lead architects. No sales fluff, just engineering.

Authors

Marc Matt
Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

I help clients:
- Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
- Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
- Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.
Proven track record leading engineering teams.
Saidah Kafka