Part 4: The Human Interface — Enterprise RAG Deployment for 100+ Users

1. Introduction: From Prototype to Enterprise

Building a Retrieval-Augmented Generation (RAG) system that works on a laptop is a common starting point, but it is rarely enough for a corporate environment. Consequently, deploying it to handle 100+ concurrent employees each with unique access levels, real-time streaming requirements, and finite GPU resources represents an entirely different engineering challenge.

In this final installment of our agentic RAG series, we tackle The Human Interface. This is the orchestration layer that brings together everything we’ve built so far: intelligent retrieval, quality evaluation, and the infrastructure glue that makes it scale reliably. Ultimately, we are moving from “experimental AI” to a Production-Ready Answer Engine.

2. The FastAPI Orchestration Layer

In production, we utilize FastAPI as our backbone because it provides native async/await support for I/O-bound operations. Furthermore, it offers built-in dependency injection for managing shared state, such as GPU resources.

Application State Management

To ensure a stable environment, we avoid global imports in favor of a Singleton AppState. This pattern ensures all services initialize in a strict dependency order. Therefore, if PostgreSQL fails to connect during startup, the app fails immediately rather than throwing a 503 error mid-request.

class AppState:
    config: Optional[Config] = None
    db: Optional[AsyncConnection] = None
    rag_service: Optional[RAGQueryService] = None
    access_control: Optional[AccessControlManager] = None
    gpu_manager: Optional[GPUMemoryManager] = None
    inference_pool: Optional[InferencePool] = None
    rate_limiter: Optional[RateLimiter] = None

3. Handling Citations & Streaming Responses

3.1 Making Real-Time Feel Real

Users typically hate waiting 5–10 seconds for a complete response. To solve this, we use Server-Sent Events (SSE) to stream structured updates, providing “Zero-Latency Feedback.” As a result, the user sees progress immediately.

async def stream_generator() -> AsyncGenerator[str, None]:
    # Stage 1: Pipeline status (Thinking UI)
    async for chunk in app_state.streaming_manager.stream_steps(steps):
        yield chunk
    
    # Stage 2: Retrieve and filter
    result = await app_state.rag_service.query(request.query)
    filtered_results = await app_state.access_control.filter_results_by_access(
        user_id, result.supporting_results
    )
    
    # Stage 3: Citations (Verification UI)
    async for chunk in app_state.streaming_manager.stream_citations(
        filtered_results, result.confidence
    ):
        yield chunk
    
    # Stage 4: Answer tokens (Content UI)
    async for chunk in app_state.streaming_manager.stream_tokens(token_gen()):
        yield chunk

3.2 Citation Traceability

Moreover, each citation includes a Document ID, Page Number, and Semantic Role. By calling filter_results_by_access, we ensure a junior analyst never sees sensitive metadata in their stream, even if those chunks were retrieved by the vector engine.

4. GPU Memory Management: Solving the “OOM” Problem

LLMs require ~14GB of VRAM for weights in a 7B model, leaving very little room for the KV Cache. Without careful management, concurrent requests will inevitably trigger “Out of Memory” (OOM) crashes.

The Solution: Request Queuing

To mitigate this, we implement a priority queue where executives or high-priority API keys get priority=0. Subsequently, the GPUMemoryManager ensures that memory is only allocated if available.

class GPUMemoryManager:
    """Tracks and enforces GPU memory constraints."""
    def __init__(self, total_vram_gb: float = 16.0):
        self.total_vram_gb = total_vram_gb
        self.allocated_mb = 0.0
        self.lock = asyncio.Lock()
    
    async def reserve(self, required_mb: float) -> bool:
        async with self.lock:
            if self.allocated_mb + required_mb > self.total_vram_gb * 1024:
                return False
            self.allocated_mb += required_mb
            return True

5. Role-Based Access Control (RBAC)

We enforce security at the Vector DB Level to ensure data sovereignty. Consequently, documents are classified during ingestion (Part 1) with specific classification_level and department tags.

5.1 Pre-Retrieval Filtering (PostgreSQL)

CREATE TABLE embeddings (
    id BIGSERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1024),
    classification_level VARCHAR(50), -- public, internal, confidential, executive
    department VARCHAR(50),           -- hr, engineering, finance
    allowed_roles TEXT[]              -- Explicit whitelist
);

At query time, we use the PostgreSQL overlap operator (&&) to ensure the user’s roles match the document’s allowed roles.

6. Monitoring & Observability

We use Prometheus and Grafana to monitor system health. In addition, custom metrics allow us to see the “Internal Reasoning” of the agent, providing a window into its decision-making process.

Custom Metrics Implementation

from prometheus_client import Counter, Histogram, Gauge

# Track security violations and throughput
requests_total = Counter('rag_api_requests_total', 'Total requests', ['endpoint'])
access_violations = Counter('rag_api_access_violations_total', 'RBAC violations')
request_duration = Histogram('rag_api_request_duration_seconds', 'Latency', buckets=[0.5, 1.0, 5.0])
gpu_memory_allocated = Gauge('rag_api_gpu_memory_allocated_gb', 'VRAM utilization')

7. Deployment Strategies

7.1 Docker Compose: The Quick Start

This approach is best for small teams (<200 users) and on-premise deployments.

services:
  postgres:
    image: pgvector/pgvector:pg17
  ollama:
    image: ollama/ollama
  rag-api:
    build: .
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

7.2 Kubernetes (Helm): The Enterprise Path

Conversely, this path is best for 500+ employees and cloud-native scaling. Specifically, we use Horizontal Pod Autoscalers (HPA) to scale pods based on GPU utilization.

YAML

# values.yaml snippet
rag-api:
  replicas: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
  resources:
    limits:
      nvidia.com/gpu: "1"

8. Putting It All Together: The Request Lifecycle

The flow below demonstrates how these components interact:

sequenceDiagram
    participant U as User (Client)
    participant A as FastAPI Orchestrator
    participant Q as Priority Queue
    participant G as GPU Memory Manager
    participant R as RAG Pipeline (Parts 2 & 3)
    participant DB as Vector/SQL/KG
    participant E as Eval & RBAC

    U->>A: POST /query/stream
    A->>A: Validate RBAC & Rate Limits
    A->>Q: Enqueue Request (Priority)
    Q->>G: Check VRAM Capacity
    G-->>Q: Reserve VRAM
    Q->>R: Trigger Inference Worker
    R->>DB: Parallel Retrieval (Recall)
    DB-->>R: 150 Candidates
    R->>R: Cross-Encoder Reranking (Precision)
    R->>E: Filter by User Roles (RBAC)
    E-->>A: Streaming Citations & Status
    A-->>U: SSE Stream Starts
    R->>A: Generate Answer Tokens
    A-->>U: Final Answer + Metadata

9. Conclusion: Lab to Production

In summary, we have built an agentic RAG system that scales, secures, and streams information effectively. By moving filtering to query-time and verification to the quality gate, we transform a probabilistic search into an Enterprise Answer Engine that users can trust.



From Architecture to Implementation: Let’s Bridge Your RAG Gap

Building a prototype is easy; hardening a production-grade RAG system that handles 1M+ complex PDFs without “silent failures” is a multi-month engineering lift.

The Agentic RAG Blueprint described in this series isn’t just a conceptual framework it is a proprietary, production-ready codebase developed to solve the most stubborn data extraction and retrieval challenges in regulated industries.

Why Partner With Us?

We don’t start from scratch. We deploy our audited reference architecture directly into your infrastructure, customized for your specific document types:

  • Accelerated Deployment: Skip 6+ months of R&D with our pre-built Docling, Pydantic AI, and Langfuse integrations.
  • Total Data Sovereignty: Our “Local-First” Docker stack ensures your sensitive data never leaves your firewall.
  • Guaranteed Precision: We move beyond naive similarity search to hybrid, agent-enriched retrieval that matches human-level accuracy.

Schedule a Technical Strategy Session

If your current RAG implementation is struggling with complex layouts, losing context in chunks, or failing to scale on-premise, let’s talk.

We will walk you through a live demonstration of the blueprint using your own document samples and discuss how to integrate this architecture into your existing stack.

Book a RAG Strategy Consultation

Direct access to our lead architects. No sales fluff, just engineering.

Authors

  • Marc Matt

    Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

    I help clients:

    • Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
    • Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
    • Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.

    Proven track record leading engineering teams.

  • Saidah Kafka

Posted

in

by

Tags:

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close