1. Introduction: From Prototype to Enterprise
Building a Retrieval-Augmented Generation (RAG) system that works on a laptop is a common starting point, but it is rarely enough for a corporate environment. Consequently, deploying it to handle 100+ concurrent employees each with unique access levels, real-time streaming requirements, and finite GPU resources represents an entirely different engineering challenge.
In this final installment of our agentic RAG series, we tackle The Human Interface. This is the orchestration layer that brings together everything we’ve built so far: intelligent retrieval, quality evaluation, and the infrastructure glue that makes it scale reliably. Ultimately, we are moving from “experimental AI” to a Production-Ready Answer Engine.
2. The FastAPI Orchestration Layer
In production, we utilize FastAPI as our backbone because it provides native async/await support for I/O-bound operations. Furthermore, it offers built-in dependency injection for managing shared state, such as GPU resources.
Application State Management
To ensure a stable environment, we avoid global imports in favor of a Singleton AppState. This pattern ensures all services initialize in a strict dependency order. Therefore, if PostgreSQL fails to connect during startup, the app fails immediately rather than throwing a 503 error mid-request.
class AppState:
config: Optional[Config] = None
db: Optional[AsyncConnection] = None
rag_service: Optional[RAGQueryService] = None
access_control: Optional[AccessControlManager] = None
gpu_manager: Optional[GPUMemoryManager] = None
inference_pool: Optional[InferencePool] = None
rate_limiter: Optional[RateLimiter] = None
3. Handling Citations & Streaming Responses
3.1 Making Real-Time Feel Real
Users typically hate waiting 5–10 seconds for a complete response. To solve this, we use Server-Sent Events (SSE) to stream structured updates, providing “Zero-Latency Feedback.” As a result, the user sees progress immediately.
async def stream_generator() -> AsyncGenerator[str, None]:
# Stage 1: Pipeline status (Thinking UI)
async for chunk in app_state.streaming_manager.stream_steps(steps):
yield chunk
# Stage 2: Retrieve and filter
result = await app_state.rag_service.query(request.query)
filtered_results = await app_state.access_control.filter_results_by_access(
user_id, result.supporting_results
)
# Stage 3: Citations (Verification UI)
async for chunk in app_state.streaming_manager.stream_citations(
filtered_results, result.confidence
):
yield chunk
# Stage 4: Answer tokens (Content UI)
async for chunk in app_state.streaming_manager.stream_tokens(token_gen()):
yield chunk
3.2 Citation Traceability
Moreover, each citation includes a Document ID, Page Number, and Semantic Role. By calling filter_results_by_access, we ensure a junior analyst never sees sensitive metadata in their stream, even if those chunks were retrieved by the vector engine.
4. GPU Memory Management: Solving the “OOM” Problem
LLMs require ~14GB of VRAM for weights in a 7B model, leaving very little room for the KV Cache. Without careful management, concurrent requests will inevitably trigger “Out of Memory” (OOM) crashes.
The Solution: Request Queuing
To mitigate this, we implement a priority queue where executives or high-priority API keys get priority=0. Subsequently, the GPUMemoryManager ensures that memory is only allocated if available.
class GPUMemoryManager:
"""Tracks and enforces GPU memory constraints."""
def __init__(self, total_vram_gb: float = 16.0):
self.total_vram_gb = total_vram_gb
self.allocated_mb = 0.0
self.lock = asyncio.Lock()
async def reserve(self, required_mb: float) -> bool:
async with self.lock:
if self.allocated_mb + required_mb > self.total_vram_gb * 1024:
return False
self.allocated_mb += required_mb
return True
5. Role-Based Access Control (RBAC)
We enforce security at the Vector DB Level to ensure data sovereignty. Consequently, documents are classified during ingestion (Part 1) with specific classification_level and department tags.
5.1 Pre-Retrieval Filtering (PostgreSQL)
CREATE TABLE embeddings (
id BIGSERIAL PRIMARY KEY,
content TEXT,
embedding vector(1024),
classification_level VARCHAR(50), -- public, internal, confidential, executive
department VARCHAR(50), -- hr, engineering, finance
allowed_roles TEXT[] -- Explicit whitelist
);
At query time, we use the PostgreSQL overlap operator (&&) to ensure the user’s roles match the document’s allowed roles.
6. Monitoring & Observability
We use Prometheus and Grafana to monitor system health. In addition, custom metrics allow us to see the “Internal Reasoning” of the agent, providing a window into its decision-making process.
Custom Metrics Implementation
from prometheus_client import Counter, Histogram, Gauge
# Track security violations and throughput
requests_total = Counter('rag_api_requests_total', 'Total requests', ['endpoint'])
access_violations = Counter('rag_api_access_violations_total', 'RBAC violations')
request_duration = Histogram('rag_api_request_duration_seconds', 'Latency', buckets=[0.5, 1.0, 5.0])
gpu_memory_allocated = Gauge('rag_api_gpu_memory_allocated_gb', 'VRAM utilization')
7. Deployment Strategies
7.1 Docker Compose: The Quick Start
This approach is best for small teams (<200 users) and on-premise deployments.
services:
postgres:
image: pgvector/pgvector:pg17
ollama:
image: ollama/ollama
rag-api:
build: .
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
7.2 Kubernetes (Helm): The Enterprise Path
Conversely, this path is best for 500+ employees and cloud-native scaling. Specifically, we use Horizontal Pod Autoscalers (HPA) to scale pods based on GPU utilization.
YAML
# values.yaml snippet
rag-api:
replicas: 3
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
resources:
limits:
nvidia.com/gpu: "1"
8. Putting It All Together: The Request Lifecycle
The flow below demonstrates how these components interact:
sequenceDiagram
participant U as User (Client)
participant A as FastAPI Orchestrator
participant Q as Priority Queue
participant G as GPU Memory Manager
participant R as RAG Pipeline (Parts 2 & 3)
participant DB as Vector/SQL/KG
participant E as Eval & RBAC
U->>A: POST /query/stream
A->>A: Validate RBAC & Rate Limits
A->>Q: Enqueue Request (Priority)
Q->>G: Check VRAM Capacity
G-->>Q: Reserve VRAM
Q->>R: Trigger Inference Worker
R->>DB: Parallel Retrieval (Recall)
DB-->>R: 150 Candidates
R->>R: Cross-Encoder Reranking (Precision)
R->>E: Filter by User Roles (RBAC)
E-->>A: Streaming Citations & Status
A-->>U: SSE Stream Starts
R->>A: Generate Answer Tokens
A-->>U: Final Answer + Metadata9. Conclusion: Lab to Production
In summary, we have built an agentic RAG system that scales, secures, and streams information effectively. By moving filtering to query-time and verification to the quality gate, we transform a probabilistic search into an Enterprise Answer Engine that users can trust.
From Architecture to Implementation: Let’s Bridge Your RAG Gap
Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.
The Agentic RAG Blueprint described in this series isn’t just a conceptual framework it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.
Why Partner With Us?
We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:
- Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
- Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
- Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.
Schedule a Technical Strategy Session
If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.
We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.
Book a RAG Strategy Consultation
Direct access to our lead architects. No sales fluff, just engineering.