Building Production-Grade Agentic RAG: A Technical Deep Dive – Part 1

Beyond Fixed Windows — Agentic & ML-Based Chunking

Introduction: The RAG Gap

The promise of Retrieval-Augmented Generation (RAG) is compelling: ground large language models in enterprise data, reduce hallucinations, enable real-time knowledge updates. But in practice, most RAG systems fail silently.

They fail not because embedding models are weak or vector databases are slow, but because the information extraction pipeline is brittle. Documents arrive as PDFs with mixed content (text, tables, images, scanned pages). Extraction produces chunks that violate semantic boundaries. Embeddings lose meaning through lossy summarization. Vector search returns technically similar but contextually irrelevant results.

The gap between “we have a RAG prototype” and “we have a production RAG system” is measured in engineering depth: intelligent document parsing, semantic-aware chunking, agentic enrichment, comprehensive observability.

This article presents Agentic RAG Blueprint—an on-premise reference architecture that closes this gap. Built with Docling (structural document parsing), Pydantic AI (agentic enrichment), and Langfuse (end-to-end observability), it demonstrates how to build RAG systems that scale to enterprise document volumes while maintaining semantic coherence.


The Production RAG Series

This article is the cornerstone of our series on building enterprise-grade retrieval systems. Explore the deep dives below:

  • Post 1: Beyond Fixed Windows – Agentic & ML-Based Chunking.
  • Post 2: The Multi-Step Retriever – Implementing Agentic Query Expansion.
  • Post 3: The Precision Filter – Cross-Encoders and Reranking.
  • Post 4: The Evaluation Loop – RAGAS and “Self-RAG.”
  • Post 5: Production & Deployment – Scaling On-Prem with Docker & vLLM.

The Problem: Why Most RAG Pipelines Degrade in Production

Silent Failure #1: Semantic Collapse Through Naive Chunking

Consider an HR benefits document:

"Employees are eligible for medical, dental, and vision coverage. 
Benefits include annual checkups (covered at 100%), deductible-based specialist visits ($50 copay for in-network), and emergency care (100% covered). Eligibility requires 30 days employment; part-time employees (<20 hrs/week) are ineligible for vision benefits but eligible for medical coverage."

A naive token-based chunker (e.g., “split every 512 tokens”) might produce:

Chunk A:

"Employees are eligible for medical, dental, and vision coverage. 
Benefits include annual checkups (covered at 100%), deductible-based specialist visits ($50 copay for in-network)..."

Chunk B:

"...emergency care (100% covered). Eligibility requires 30 days employment; part-time employees (<20 hrs/week) are ineligible for vision benefits but eligible for medical coverage."

Now a user asks: “Are part-time employees eligible for dental coverage?”

The query embedding matches Chunk B (contains “part-time employees, eligible”). But Chunk B doesn’t mention dental coverage—that’s in Chunk A. The system returns an incomplete answer because the natural semantic unit (eligibility rules) was fragmented across chunks.

This isn’t a retrieval problem. It’s a chunking problem.

Silent Failure #2: Content Extraction Brittleness

PDFs are a container format, not a semantic format. A single document might contain:

  • Born-digital text (modern PDFs with embedded fonts)
  • Scanned images (legacy documents requiring OCR)
  • Complex layouts (tables, side-by-side columns, footnotes)
  • Visual content (charts, graphs, diagrams)

Without intelligent content analysis, your extraction pipeline makes binary decisions:

  • “Does this PDF have extractable text? Yes → Extract it. No → Skip.”
  • “Does this page have images? Yes → Extract as images. No → Ignore.”

Result: You lose 30-50% of meaningful content because you don’t know what you’re looking at.

Silent Failure #3: Missing Semantic Context

Once chunks are extracted, they float as isolated semantic units. A chunk about “deductibles” has no explicit relationship to “cost-sharing” or “out-of-pocket maximums”—even though they’re conceptually intertwined.

When a user asks “What’s my out-of-pocket exposure?”, the system retrieves deductible chunks but lacks the reasoning context to understand they should also surface cost-sharing and coinsurance chunks.


The Solution Architecture

The Agentic RAG Blueprint addresses each failure point through a layered pipeline:

graph TD
    %% Define Nodes
    A([PDFs / Documents]) --> B(Intelligent Extraction)
    
    subgraph Extraction [Docling + Tesseract OCR]
    B --> B1[Detect structure & 
    content type]
    B --> B2[Adaptive OCR processing]
    B --> B3[Preserve layout 
    relationships]
    end
    
    B3 --> C(Semantic Chunking)
    
    subgraph Chunking [HybridChunker + BGE-M3]
    C --> C1[Respect structural 
    boundaries]
    C --> C2[Detect semantic discontinuities]
    C --> C3[Preserve heading hierarchies]
    end
    
    C3 --> D(Dual Embeddings)
    
    subgraph Embeddings [Model: BGE-M3]
    D --> D1[Dense: Semantic similarity]
    D --> D2[Sparse: Learned 
    term importance]
    end
    
    D2 --> E(Agentic Enrichment)
    
    subgraph Agents [Framework: Pydantic AI]
    E --> E1[Summary Generation]
    E --> E2[Semantic Role Classification]
    E --> E3[Entity Extraction]
    E --> E4[Cross-Reference Detection]
    end
    
    E4 --> F[(Vector Storage)]
    
    subgraph Storage [PostgreSQL + pgvector]
    F --> F1[First-class 
    enrichment columns]
    F --> F2[Metadata Filtering Index]
    F --> F3[Hybrid Search Support]
    end
    
    F3 --> G(Observability & Debugging)
    
    subgraph Monitoring [Langfuse]
    G --> G1[Token usage tracking]
    G --> G2[Latency breakdown]
    G --> G3[Success/Failure metrics]
    end

    %% Styling
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#2d5a88,color:#fff,stroke:#333
    style G fill:#f96,stroke:#333

Core Components

1. Intelligent Document Analysis: Beyond Binary Extraction

The blueprint starts with Docling, IBM’s open-source document intelligence framework. Unlike naive PDF extractors, Docling treats documents as structured information:

Docling’s advantage: Adaptive pipeline selection

Instead of asking “does this have text?”, Docling analyzes content and makes intelligent decisions:

  • Text coverage analysis: Scans all pages, calculates percentage with substantial text
  • Table detection: Looks for structural indicators (pipes, tabs, aligned data)
  • Image presence: Identifies whether visual content exists
  • Scanned vs. digital: Determines if OCR is beneficial

Only then does it decide which processing modules to activate.

Practical example for HR documents:

A scanned benefits summary (100% images) and a born-digital contract (100% text) hit the same pipeline but receive completely different processing paths:

Scanned Benefits Summary
├─ OCR: Enabled (full page)
├─ Table analysis: Enabled (benefits often tabular)
├─ Image description: Enabled (extract charts/graphs)
└─ Formula enrichment: Disabled (no formulas in images)

Born-Digital Contract
├─ OCR: Disabled (native text available)
├─ Table analysis: Enabled (might have signature sections)
├─ Image description: Disabled (no images)
└─ Formula enrichment: Enabled (contracts often contain calculations)

This adaptive approach eliminates wasted computation while maintaining quality.

2. Semantic Chunking: Respecting Conceptual Boundaries

Once content is extracted, the blueprint uses HybridChunker to segment documents intelligently:

Three principles:

  1. Structural awareness: Never splits a paragraph mid-sentence; respect heading hierarchies
  2. Semantic continuity: Use embedding-based similarity to detect topic shifts
  3. Token-aware formatting: Leverage model-specific tokenization to avoid boundary errors

Why this matters:

In the earlier HR example, semantic chunking detects the discontinuity between “benefits structure” and “eligibility rules” because sentence-level embeddings show a similarity drop. The boundary is placed exactly where human readers perceive a conceptual shift—not at arbitrary token counts.

3. Dual Embedding Strategy: Dense + Sparse

BGE-M3 (BAAI General Embedding) provides two complementary representations:

Dense embeddings (1024 dimensions):

  • Captures semantic similarity in high-dimensional space
  • “specialist visits” and “doctor appointments” are near each other
  • Enables approximate nearest neighbor search (fast recall)

Sparse embeddings (250k learned term weights):

  • Learned importance scores for each token (not TF-IDF)
  • “$50 copay” gets high weight (semantically important for cost queries)
  • “the” gets near-zero weight (semantically uninformative)
  • Enables efficient filtering and interpretable retrieval

Why both matter:

A query “specialist visit copayment” needs:

  • Dense search to find chunks about cost-sharing (semantic match)
  • Sparse search to confirm “$50” and “specialist” are present (lexical match)

Combining both scores yields precision + recall.


The Agentic Enrichment Layer: Making Chunks Intelligent

Here’s where the blueprint diverges from conventional RAG. After extraction and embedding, chunks are processed by specialized LLM agents (using Pydantic AI) that add semantic metadata.

Five Concurrent Enrichment Agents

1. Summary Agent

  • Generates one-sentence semantic summary
  • Not for retrieval (would lose specificity)
  • For human context, reranking signals, and cross-reference detection

2. Semantic Role Agent

  • Classifies chunk purpose: “Eligibility Criteria”, “Benefit Description”, “Cost/Deductible”, “Exclusion”, “Procedure/Process”, etc.
  • Enables role-based filtering: “Show me only exclusions”
  • Improves retrieval precision for category-specific queries

3. Entity Extraction Agent

  • Identifies named entities with semantic types: “Medical Coverage” (Benefit), “30 days” (Timeline), “$50” (Monetary Amount), “in-network” (Status)
  • Builds entity graph for structured reasoning
  • Enables “find all references to X” queries

4. Key Concepts Agent

  • Identifies 3-5 core topics (e.g., [“Eligibility”, “Medical Coverage”, “Waiting Period”])
  • Enables semantic clustering and topic-based browsing
  • Supports faceted search interfaces

5. Cross-Reference Agent

  • Detects hints to related sections: “See Section 3.2”, “As described under Benefits”
  • Builds relationship graph between chunks
  • Enables reasoning systems to follow references automatically

Why Pydantic AI?

Traditional LLM integration via raw HTTP calls requires:

  • Manual JSON parsing with error handling
  • Type coercion (strings → booleans, datetimes, etc.)
  • Exception handling for malformed responses

Pydantic AI eliminates this friction:

agent = Agent(model=ollama_model, result_type=list[Entity])
# LLM output automatically validated against Entity schema
# Type-safe throughout the pipeline

All five agents run concurrently (with semaphore limiting), processing 1000 chunks in ~15-30 minutes instead of hours.


Storage & Querying: Enriched Chunks as First-Class Citizens

Unlike generic vector stores, the blueprint stores enriched metadata in first-class PostgreSQL columns:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    
    -- Core content
    content TEXT,
    embedding_dense vector(1024),
    embedding_sparse sparsevec(250002),
    
    -- Agentic enrichment (indexed for efficient querying)
    summary TEXT,
    semantic_role VARCHAR(100),
    entities JSONB,
    key_concepts TEXT[],
    related_chunk_ids INTEGER[],
    
    -- Structural metadata
    page_no INTEGER,
    headings TEXT[],
    filename VARCHAR(255)
);

CREATE INDEX idx_semantic_role ON documents(semantic_role);
CREATE INDEX idx_entities ON documents USING GIN (entities);
CREATE INDEX idx_related_chunks ON documents USING GIN (related_chunk_ids);

This enables sophisticated retrieval:

-- Retrieve eligibility criteria related to part-time status
SELECT * FROM documents
WHERE semantic_role = 'Eligibility Criteria'
AND entities @> '[{"name": "part-time"}]'::jsonb
ORDER BY embedding_dense <=> query_embedding
LIMIT 10;

-- Follow cross-references
SELECT * FROM documents
WHERE id = ANY(
    (SELECT related_chunk_ids FROM documents WHERE id = 42)
);

-- Role-aware filtering
SELECT DISTINCT semantic_role FROM documents
WHERE entities @> '[{"type": "Benefit"}]'::jsonb;

Observability: Why Langfuse Matters

A RAG system in production faces invisible degradation:

  • Embeddings drift over time
  • OCR quality varies by document type
  • LLM enrichment becomes inconsistent

Without observability, you won’t know until users report problems.

The blueprint integrates Langfuse, which traces every operation:

  • Token counts per agent: Identify which agents consume most resources
  • Latency breakdown: Detect bottlenecks (OCR? Embedding generation? Database?)
  • Error tracking: Which document types fail enrichment?
  • Cost analysis: Compute inference cost per document

Example trace:

process_document(benefits_manual.pdf)
├─ extract [150ms, 3.2MB memory]
├─ chunking [45ms, 42 chunks]
├─ embedding_dense [2100ms, 42 chunks × 1024 dims]
├─ embedding_sparse [1800ms, 42 chunks × sparse]
├─ enrich_summary [8400ms, 5 timeouts, 1 retry]
├─ enrich_role [6200ms, 0 errors]
├─ enrich_entities [7100ms, 2 partial failures]
├─ enrich_concepts [5800ms]
├─ enrich_references [4200ms]
└─ store [320ms, 42 rows inserted]
Total: ~36 seconds, 847 tokens used

This visibility is critical for debugging and optimizing production RAG systems.


Deployment: Local-First Infrastructure

The blueprint ships with a complete Docker stack:

Services:
  - PostgreSQL (pgvector) - Vector storage + metadata
  - Redis - Caching & queue management
  - ClickHouse - Time-series trace storage
  - Langfuse - Observability dashboard
  - Ollama - Local LLM serving (with mistral for enrichment)
  - MinIO - S3-compatible document storage

Why on-premise?

  • Data privacy: Sensitive documents (HR, finance, legal) never leave your infrastructure
  • Cost control: No per-token API fees; fixed infrastructure cost
  • Latency: No network roundtrips to cloud providers
  • Customization: Full control over models, prompts, and processing pipelines

Real-World Impact: HR Manual Use Case

To demonstrate the difference between naive RAG and intelligent RAG, consider a benefits question:

User query: “I’m part-time and have a chronic illness. What coverage options do I have?”

Naive RAG (token-based chunking, no enrichment):

  1. Embed query
  2. Vector search retrieves top-5 chunks
  3. Results:
    • “Part-time employees are defined as…”
    • “Chronic illness exclusions include…”
    • “Coverage options available under Plan A…”
    • (Random text about retirement plans)
    • (Boilerplate legal language)

Problem: Results mix eligibility, exclusions, and coverage in random order. User must manually reason about what applies to them.

Agentic RAG (semantic chunking + enrichment):

  1. Embed query
  2. Dense search retrieves 20 candidates
  3. Filter by semantic_role IN ('Eligibility Criteria', 'Benefit Description', 'Exclusion')
  4. Rerank by entity match: prioritize chunks mentioning “part-time” + “chronic illness”
  5. Follow to surface connected policies related_chunk_ids
  6. Results:
    • “Part-time employees are eligible for medical & dental (Eligibility)”
    • “Chronic illness coverage: X condition covered, Y excluded (Benefit Description)”
    • “Part-time medical copays: specialist visits $50 (Cost)”
    • “Related: Out-of-pocket maximum policies (Cross-Reference)”
    • “Related: Appeals process for coverage denials (Related)”

Outcome: Structured, contextual, actionable results with clear reasoning.


Technical Takeaways

1. Chunking is the Foundation

All downstream RAG quality depends on chunking. Semantic boundaries preserve meaning; semantic role enables filtering.

2. Embeddings Need Density

BGE-M3’s 1024 dimensions + learned sparse weights provide the signal necessary for domain-specific retrieval. 384-dim models lose nuance; summary-only embeddings lose specificity.

3. Enrichment is Worth the Latency

30 seconds to enrich 42 chunks pays dividends:

  • Better filtering (90% fewer irrelevant results)
  • Relationship graphs (discover connected documents)
  • Explainability (why was this result returned?)

4. On-Premise Wins on Privacy + Cost

For regulated industries (healthcare, finance, legal), on-premise is non-negotiable. For cost-sensitive deployments (1M+ documents), on-premise is economical.

5. Observability is Non-Negotiable

Production RAG systems fail silently. Langfuse-style tracing catches degradation before users notice.


Conclusion: From Prototype to Production

The jump from “RAG works” to “RAG is production-ready” requires engineering discipline:

  • Smart extraction (Docling + adaptive OCR)
  • Semantic chunking (HybridChunker + boundary detection)
  • Dual embeddings (dense + sparse via BGE-M3)
  • Agentic enrichment (Pydantic AI agents for semantic metadata)
  • Rich storage (PostgreSQL with first-class enrichment columns)
  • Full visibility (Langfuse end-to-end tracing)

The Agentic RAG Blueprint demonstrates that this is achievable with open-source tools, Python 3.13+, and thoughtful architecture.

The result: RAG systems that don’t just retrieve documents—they understand them.



From Architecture to Implementation: Let’s Bridge Your RAG Gap

Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.

The Agentic RAG Blueprint described in this series isn’t just a conceptual framework—it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.

Why Partner With Us?

We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:

  • Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
  • Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
  • Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.

Schedule a Technical Strategy Session

If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.

We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.

Book a RAG Strategy Consultation

Direct access to our lead architects. No sales fluff, just engineering.


References

  • Docling: IBM’s intelligent document understanding framework
  • BGE-M3: BAAI’s multilingual, dual-representation embedding model (1024 dense + 250k sparse)
  • Pydantic AI: Type-safe agentic LLM integration
  • Langfuse: Open-source observability for RAG and LLM applications
  • PostgreSQL pgvector: Native vector support with advanced filtering

Authors

  • Marc Matt

    Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

    I help clients:

    • Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
    • Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
    • Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.

    Proven track record leading engineering teams.

  • Saidah Kafka

Posted

in

by

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close