Beyond Fixed Windows — Agentic & ML-Based Chunking
Introduction: The RAG Gap
The promise of Retrieval-Augmented Generation (RAG) is compelling: ground large language models in enterprise data, reduce hallucinations, enable real-time knowledge updates. But in practice, most RAG systems fail silently.
They fail not because embedding models are weak or vector databases are slow, but because the information extraction pipeline is brittle. Documents arrive as PDFs with mixed content (text, tables, images, scanned pages). Extraction produces chunks that violate semantic boundaries. Embeddings lose meaning through lossy summarization. Vector search returns technically similar but contextually irrelevant results.
The gap between “we have a RAG prototype” and “we have a production RAG system” is measured in engineering depth: intelligent document parsing, semantic-aware chunking, agentic enrichment, comprehensive observability.
This article presents Agentic RAG Blueprint—an on-premise reference architecture that closes this gap. Built with Docling (structural document parsing), Pydantic AI (agentic enrichment), and Langfuse (end-to-end observability), it demonstrates how to build RAG systems that scale to enterprise document volumes while maintaining semantic coherence.
The Production RAG Series
This article is the cornerstone of our series on building enterprise-grade retrieval systems. Explore the deep dives below:
- Post 1: Beyond Fixed Windows – Agentic & ML-Based Chunking.
- Post 2: The Multi-Step Retriever – Implementing Agentic Query Expansion.
- Post 3: The Precision Filter – Cross-Encoders and Reranking.
- Post 4: The Evaluation Loop – RAGAS and “Self-RAG.”
- Post 5: Production & Deployment – Scaling On-Prem with Docker & vLLM.
The Problem: Why Most RAG Pipelines Degrade in Production
Silent Failure #1: Semantic Collapse Through Naive Chunking
Consider an HR benefits document:
"Employees are eligible for medical, dental, and vision coverage.
Benefits include annual checkups (covered at 100%), deductible-based specialist visits ($50 copay for in-network), and emergency care (100% covered). Eligibility requires 30 days employment; part-time employees (<20 hrs/week) are ineligible for vision benefits but eligible for medical coverage."
A naive token-based chunker (e.g., “split every 512 tokens”) might produce:
Chunk A:
"Employees are eligible for medical, dental, and vision coverage.
Benefits include annual checkups (covered at 100%), deductible-based specialist visits ($50 copay for in-network)..."
Chunk B:
"...emergency care (100% covered). Eligibility requires 30 days employment; part-time employees (<20 hrs/week) are ineligible for vision benefits but eligible for medical coverage."
Now a user asks: “Are part-time employees eligible for dental coverage?”
The query embedding matches Chunk B (contains “part-time employees, eligible”). But Chunk B doesn’t mention dental coverage—that’s in Chunk A. The system returns an incomplete answer because the natural semantic unit (eligibility rules) was fragmented across chunks.
This isn’t a retrieval problem. It’s a chunking problem.
Silent Failure #2: Content Extraction Brittleness
PDFs are a container format, not a semantic format. A single document might contain:
- Born-digital text (modern PDFs with embedded fonts)
- Scanned images (legacy documents requiring OCR)
- Complex layouts (tables, side-by-side columns, footnotes)
- Visual content (charts, graphs, diagrams)
Without intelligent content analysis, your extraction pipeline makes binary decisions:
- “Does this PDF have extractable text? Yes → Extract it. No → Skip.”
- “Does this page have images? Yes → Extract as images. No → Ignore.”
Result: You lose 30-50% of meaningful content because you don’t know what you’re looking at.
Silent Failure #3: Missing Semantic Context
Once chunks are extracted, they float as isolated semantic units. A chunk about “deductibles” has no explicit relationship to “cost-sharing” or “out-of-pocket maximums”—even though they’re conceptually intertwined.
When a user asks “What’s my out-of-pocket exposure?”, the system retrieves deductible chunks but lacks the reasoning context to understand they should also surface cost-sharing and coinsurance chunks.
The Solution Architecture
The Agentic RAG Blueprint addresses each failure point through a layered pipeline:
graph TD
%% Define Nodes
A([PDFs / Documents]) --> B(Intelligent Extraction)
subgraph Extraction [Docling + Tesseract OCR]
B --> B1[Detect structure &
content type]
B --> B2[Adaptive OCR processing]
B --> B3[Preserve layout
relationships]
end
B3 --> C(Semantic Chunking)
subgraph Chunking [HybridChunker + BGE-M3]
C --> C1[Respect structural
boundaries]
C --> C2[Detect semantic discontinuities]
C --> C3[Preserve heading hierarchies]
end
C3 --> D(Dual Embeddings)
subgraph Embeddings [Model: BGE-M3]
D --> D1[Dense: Semantic similarity]
D --> D2[Sparse: Learned
term importance]
end
D2 --> E(Agentic Enrichment)
subgraph Agents [Framework: Pydantic AI]
E --> E1[Summary Generation]
E --> E2[Semantic Role Classification]
E --> E3[Entity Extraction]
E --> E4[Cross-Reference Detection]
end
E4 --> F[(Vector Storage)]
subgraph Storage [PostgreSQL + pgvector]
F --> F1[First-class
enrichment columns]
F --> F2[Metadata Filtering Index]
F --> F3[Hybrid Search Support]
end
F3 --> G(Observability & Debugging)
subgraph Monitoring [Langfuse]
G --> G1[Token usage tracking]
G --> G2[Latency breakdown]
G --> G3[Success/Failure metrics]
end
%% Styling
style A fill:#f9f,stroke:#333,stroke-width:2px
style F fill:#2d5a88,color:#fff,stroke:#333
style G fill:#f96,stroke:#333Core Components
1. Intelligent Document Analysis: Beyond Binary Extraction
The blueprint starts with Docling, IBM’s open-source document intelligence framework. Unlike naive PDF extractors, Docling treats documents as structured information:
Docling’s advantage: Adaptive pipeline selection
Instead of asking “does this have text?”, Docling analyzes content and makes intelligent decisions:
- Text coverage analysis: Scans all pages, calculates percentage with substantial text
- Table detection: Looks for structural indicators (pipes, tabs, aligned data)
- Image presence: Identifies whether visual content exists
- Scanned vs. digital: Determines if OCR is beneficial
Only then does it decide which processing modules to activate.
Practical example for HR documents:
A scanned benefits summary (100% images) and a born-digital contract (100% text) hit the same pipeline but receive completely different processing paths:
Scanned Benefits Summary
├─ OCR: Enabled (full page)
├─ Table analysis: Enabled (benefits often tabular)
├─ Image description: Enabled (extract charts/graphs)
└─ Formula enrichment: Disabled (no formulas in images)
Born-Digital Contract
├─ OCR: Disabled (native text available)
├─ Table analysis: Enabled (might have signature sections)
├─ Image description: Disabled (no images)
└─ Formula enrichment: Enabled (contracts often contain calculations)
This adaptive approach eliminates wasted computation while maintaining quality.
2. Semantic Chunking: Respecting Conceptual Boundaries
Once content is extracted, the blueprint uses HybridChunker to segment documents intelligently:
Three principles:
- Structural awareness: Never splits a paragraph mid-sentence; respect heading hierarchies
- Semantic continuity: Use embedding-based similarity to detect topic shifts
- Token-aware formatting: Leverage model-specific tokenization to avoid boundary errors
Why this matters:
In the earlier HR example, semantic chunking detects the discontinuity between “benefits structure” and “eligibility rules” because sentence-level embeddings show a similarity drop. The boundary is placed exactly where human readers perceive a conceptual shift—not at arbitrary token counts.
3. Dual Embedding Strategy: Dense + Sparse
BGE-M3 (BAAI General Embedding) provides two complementary representations:
Dense embeddings (1024 dimensions):
- Captures semantic similarity in high-dimensional space
- “specialist visits” and “doctor appointments” are near each other
- Enables approximate nearest neighbor search (fast recall)
Sparse embeddings (250k learned term weights):
- Learned importance scores for each token (not TF-IDF)
- “$50 copay” gets high weight (semantically important for cost queries)
- “the” gets near-zero weight (semantically uninformative)
- Enables efficient filtering and interpretable retrieval
Why both matter:
A query “specialist visit copayment” needs:
- Dense search to find chunks about cost-sharing (semantic match)
- Sparse search to confirm “$50” and “specialist” are present (lexical match)
Combining both scores yields precision + recall.
The Agentic Enrichment Layer: Making Chunks Intelligent
Here’s where the blueprint diverges from conventional RAG. After extraction and embedding, chunks are processed by specialized LLM agents (using Pydantic AI) that add semantic metadata.
Five Concurrent Enrichment Agents
1. Summary Agent
- Generates one-sentence semantic summary
- Not for retrieval (would lose specificity)
- For human context, reranking signals, and cross-reference detection
2. Semantic Role Agent
- Classifies chunk purpose: “Eligibility Criteria”, “Benefit Description”, “Cost/Deductible”, “Exclusion”, “Procedure/Process”, etc.
- Enables role-based filtering: “Show me only exclusions”
- Improves retrieval precision for category-specific queries
3. Entity Extraction Agent
- Identifies named entities with semantic types: “Medical Coverage” (Benefit), “30 days” (Timeline), “$50” (Monetary Amount), “in-network” (Status)
- Builds entity graph for structured reasoning
- Enables “find all references to X” queries
4. Key Concepts Agent
- Identifies 3-5 core topics (e.g., [“Eligibility”, “Medical Coverage”, “Waiting Period”])
- Enables semantic clustering and topic-based browsing
- Supports faceted search interfaces
5. Cross-Reference Agent
- Detects hints to related sections: “See Section 3.2”, “As described under Benefits”
- Builds relationship graph between chunks
- Enables reasoning systems to follow references automatically
Why Pydantic AI?
Traditional LLM integration via raw HTTP calls requires:
- Manual JSON parsing with error handling
- Type coercion (strings → booleans, datetimes, etc.)
- Exception handling for malformed responses
Pydantic AI eliminates this friction:
agent = Agent(model=ollama_model, result_type=list[Entity])
# LLM output automatically validated against Entity schema
# Type-safe throughout the pipeline
All five agents run concurrently (with semaphore limiting), processing 1000 chunks in ~15-30 minutes instead of hours.
Storage & Querying: Enriched Chunks as First-Class Citizens
Unlike generic vector stores, the blueprint stores enriched metadata in first-class PostgreSQL columns:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
-- Core content
content TEXT,
embedding_dense vector(1024),
embedding_sparse sparsevec(250002),
-- Agentic enrichment (indexed for efficient querying)
summary TEXT,
semantic_role VARCHAR(100),
entities JSONB,
key_concepts TEXT[],
related_chunk_ids INTEGER[],
-- Structural metadata
page_no INTEGER,
headings TEXT[],
filename VARCHAR(255)
);
CREATE INDEX idx_semantic_role ON documents(semantic_role);
CREATE INDEX idx_entities ON documents USING GIN (entities);
CREATE INDEX idx_related_chunks ON documents USING GIN (related_chunk_ids);
This enables sophisticated retrieval:
-- Retrieve eligibility criteria related to part-time status
SELECT * FROM documents
WHERE semantic_role = 'Eligibility Criteria'
AND entities @> '[{"name": "part-time"}]'::jsonb
ORDER BY embedding_dense <=> query_embedding
LIMIT 10;
-- Follow cross-references
SELECT * FROM documents
WHERE id = ANY(
(SELECT related_chunk_ids FROM documents WHERE id = 42)
);
-- Role-aware filtering
SELECT DISTINCT semantic_role FROM documents
WHERE entities @> '[{"type": "Benefit"}]'::jsonb;
Observability: Why Langfuse Matters
A RAG system in production faces invisible degradation:
- Embeddings drift over time
- OCR quality varies by document type
- LLM enrichment becomes inconsistent
Without observability, you won’t know until users report problems.
The blueprint integrates Langfuse, which traces every operation:
- Token counts per agent: Identify which agents consume most resources
- Latency breakdown: Detect bottlenecks (OCR? Embedding generation? Database?)
- Error tracking: Which document types fail enrichment?
- Cost analysis: Compute inference cost per document
Example trace:
process_document(benefits_manual.pdf)
├─ extract [150ms, 3.2MB memory]
├─ chunking [45ms, 42 chunks]
├─ embedding_dense [2100ms, 42 chunks × 1024 dims]
├─ embedding_sparse [1800ms, 42 chunks × sparse]
├─ enrich_summary [8400ms, 5 timeouts, 1 retry]
├─ enrich_role [6200ms, 0 errors]
├─ enrich_entities [7100ms, 2 partial failures]
├─ enrich_concepts [5800ms]
├─ enrich_references [4200ms]
└─ store [320ms, 42 rows inserted]
Total: ~36 seconds, 847 tokens used
This visibility is critical for debugging and optimizing production RAG systems.
Deployment: Local-First Infrastructure
The blueprint ships with a complete Docker stack:
Services:
- PostgreSQL (pgvector) - Vector storage + metadata
- Redis - Caching & queue management
- ClickHouse - Time-series trace storage
- Langfuse - Observability dashboard
- Ollama - Local LLM serving (with mistral for enrichment)
- MinIO - S3-compatible document storage
Why on-premise?
- Data privacy: Sensitive documents (HR, finance, legal) never leave your infrastructure
- Cost control: No per-token API fees; fixed infrastructure cost
- Latency: No network roundtrips to cloud providers
- Customization: Full control over models, prompts, and processing pipelines
Real-World Impact: HR Manual Use Case
To demonstrate the difference between naive RAG and intelligent RAG, consider a benefits question:
User query: “I’m part-time and have a chronic illness. What coverage options do I have?”
Naive RAG (token-based chunking, no enrichment):
- Embed query
- Vector search retrieves top-5 chunks
- Results:
- “Part-time employees are defined as…”
- “Chronic illness exclusions include…”
- “Coverage options available under Plan A…”
- (Random text about retirement plans)
- (Boilerplate legal language)
Problem: Results mix eligibility, exclusions, and coverage in random order. User must manually reason about what applies to them.
Agentic RAG (semantic chunking + enrichment):
- Embed query
- Dense search retrieves 20 candidates
- Filter by
semantic_role IN ('Eligibility Criteria', 'Benefit Description', 'Exclusion') - Rerank by entity match: prioritize chunks mentioning “part-time” + “chronic illness”
- Follow to surface connected policies
related_chunk_ids - Results:
- “Part-time employees are eligible for medical & dental (Eligibility)”
- “Chronic illness coverage: X condition covered, Y excluded (Benefit Description)”
- “Part-time medical copays: specialist visits $50 (Cost)”
- “Related: Out-of-pocket maximum policies (Cross-Reference)”
- “Related: Appeals process for coverage denials (Related)”
Outcome: Structured, contextual, actionable results with clear reasoning.
Technical Takeaways
1. Chunking is the Foundation
All downstream RAG quality depends on chunking. Semantic boundaries preserve meaning; semantic role enables filtering.
2. Embeddings Need Density
BGE-M3’s 1024 dimensions + learned sparse weights provide the signal necessary for domain-specific retrieval. 384-dim models lose nuance; summary-only embeddings lose specificity.
3. Enrichment is Worth the Latency
30 seconds to enrich 42 chunks pays dividends:
- Better filtering (90% fewer irrelevant results)
- Relationship graphs (discover connected documents)
- Explainability (why was this result returned?)
4. On-Premise Wins on Privacy + Cost
For regulated industries (healthcare, finance, legal), on-premise is non-negotiable. For cost-sensitive deployments (1M+ documents), on-premise is economical.
5. Observability is Non-Negotiable
Production RAG systems fail silently. Langfuse-style tracing catches degradation before users notice.
Conclusion: From Prototype to Production
The jump from “RAG works” to “RAG is production-ready” requires engineering discipline:
- Smart extraction (Docling + adaptive OCR)
- Semantic chunking (HybridChunker + boundary detection)
- Dual embeddings (dense + sparse via BGE-M3)
- Agentic enrichment (Pydantic AI agents for semantic metadata)
- Rich storage (PostgreSQL with first-class enrichment columns)
- Full visibility (Langfuse end-to-end tracing)
The Agentic RAG Blueprint demonstrates that this is achievable with open-source tools, Python 3.13+, and thoughtful architecture.
The result: RAG systems that don’t just retrieve documents—they understand them.
From Architecture to Implementation: Let’s Bridge Your RAG Gap
Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.
The Agentic RAG Blueprint described in this series isn’t just a conceptual framework—it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.
Why Partner With Us?
We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:
- Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
- Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
- Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.
Schedule a Technical Strategy Session
If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.
We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.
Book a RAG Strategy Consultation
Direct access to our lead architects. No sales fluff, just engineering.
References
- Docling: IBM’s intelligent document understanding framework
- BGE-M3: BAAI’s multilingual, dual-representation embedding model (1024 dense + 250k sparse)
- Pydantic AI: Type-safe agentic LLM integration
- Langfuse: Open-source observability for RAG and LLM applications
- PostgreSQL pgvector: Native vector support with advanced filtering