Cost-Aware Agentic Workflows with PydanticAI

Introduction: The Hidden Price of Autonomy

  • The Infinite Loop Problem: Why agentic workflows, especially those involving ReAct or recursive reasoning, are inherently more expensive than stateless RAG.
  • Economic Sustainability for SMBs: The necessity of moving away from move fast and break things towards structured governance for LLM spending.
  • The Solution: Implementing Cost Guardrails, a combination of pre-emptive token budgets and reactive human approval.

The Architecture of a Cost Guardrail

  • The Interceptor Pattern: The architecture relies on middleware that wraps the LLM call, checking consumption before and after every request.
  • Primary Roles:
    • PydanticAI: Enforces hard mathematical limits (total requests, total tokens) on the session.
    • LiteLLM: Provides semantic pricing intelligence, converting abstract token counts into actual currency (USD) in real-time.

Implementing Usage Limits with PydanticAI

PydanticAI provides the primary library-level enforcement mechanism through its UsageLimits class.

from pydantic_ai import Agent, RunContext
from pydantic_ai.usage import UsageLimits
from pydantic_ai.exceptions import UsageLimitExceeded

from loguru import logger

# 1. Define the agent
investor_agent = Agent(
    'openai:gpt-5',
    system_prompt="Analyze this stock portfolio. You may call tools multiple times."
)

# 2. Configure PydanticAI UsageLimits
budget_limits = UsageLimits(
    request_limit=10,           # Max 10 LLM calls per run
    request_tokens_limit=25000, # Max 25k prompt tokens
    response_tokens_limit=10000 # Max 10k completion tokens
)

try:
    # 3. Apply limits to the run
    result = investor_agent.run_sync(
        "Should I rebalance my tech portfolio?",
        usage_limits=budget_limits
    )
    logger.info(f"Success! Session Usage: {result.usage()}")

except UsageLimitExceeded as e:
    # 4. Catch the specific exception
    # HIGHLIGHT: This triggers when ANY limit in UsageLimits is breached.
    logger.info(f"--- GUARDRAIL TRIGGERED ---")
    logger.info(f"Reason: {e}")
    # Proceed to HITL Handling Section

Real-Time Cost Tracking with LiteLLM

While PydanticAI manages counts, LiteLLM converts those counts to dollars.

import litellm
from pydantic_ai import RunUsage

from loguru import logger

def calculate_dollar_cost(usage: RunUsage, model_name: str) -> float:
    """
    Utility function to map PydanticAI usage data to LiteLLM pricing.
    """
    # LiteLLM's implementation uses the same request/response structure
    cost_usd = litellm.completion_cost(
        model=model_name,
        prompt_tokens=usage.request_tokens,
        completion_tokens=usage.response_tokens
    )
    return float(cost_usd)

# conceptual example after an event:
# total_usd = calculate_dollar_cost(result.usage(), "gpt-4o")
# logger.info(f"Run Cost: ${total_usd:.4f}")

Detailed HITL Workflow: The Slack Intervention

For a SMB, a simple notification system like Slack is often the most effective way to implement Human-in-the-Loop (HITL) without building complex custom UIs.

graph TD
    A[User Request] -->|Initiates Run| B(PydanticAI Agent);
    B --> C{Pre-Call Check:\nUsage Within limits?};
    C -- Yes --> D[LiteLLM Proxy];
    D -->|Calls LLM| E(LLM Provider);
    E -->|Returns Response| D;
    D --> F[PydanticAI Update Usage];
    F -->|Result| B;
    B -->|Final Answer| G[User];

    C -- No (Threshold Exceeded) --> H[Raise UsageLimitExceeded];
    H -->|Exception| I(HITL Intervention Handler);
    I -->|Serializes State| J[Database];
    I -->|Sends Approval Request| K[Slack API];
    K --> L(Slack Workflow App);
    L -->|Posts Message| M[Data Team Slack Channel];
    M -->|Approves/Denies| N[Human Reviewer];

    N -- Denies --> O[Agent Terminates,\nNotifies User];
    N -- Approves --> P[Slack App Triggers\nResume Endpoint];
    P --> Q(Resume Script);
    Q -->|Retrieves State| J;
    Q -->|Injects Message History| R(New PydanticAI Agent Instance);
    R -->|Continues Run| C;

1. Exception Handling and State Serialization

When UsageLimitExceeded is caught, the agent is paused. Its entire state must be preserved to resume later.

# (Continuing from Section III try/except block)
except UsageLimitExceeded as e:
    # A. Access the partial result/history from the exception
    partial_history = e.partial_history  # Conceptual: The message chain so far

    # B. Serialize State
    session_id = str(uuid.uuid4())
    save_agent_state_to_db(session_id, partial_history, current_usage=e.usage)

    # C. Calculate LiteLLM Cost for the message
    dollar_spent = calculate_dollar_cost(e.usage, "gpt-4o")

    # D. Trigger the Slack workflow
    # trigger_slack_approval_request(session_id, dollar_spent, reason=str(e))

2. The Slack Approval Workflow Structure

This process uses a configured Slack Workflow Builder app to handle the interaction.

  • A. The Trigger: The Python script makes a POST request to a specialized webhook managed by the Slack Workflow App, passing the session_id, dollar_spent, and a snippet of the context.
  • B. The Slack Message: The Workflow App formats this into an interactive message posted to a #data-ops or #ai-approvals channel.💰 Agent Budget Alert 💰 Agent ID: investor_agent_982 Status: PAUSED Reason: Token Limit Exceeded (Spent: $0.12)The agent needs another $0.20 budget to continue. [Approve & Add $0.20] [Deny & Terminate]
  • C. The Human Action: A designated data team member reviews the message (or clicks a linked context log) and selects an option.
  • D. The Response Loop:
    • Slack sends the interaction payload back to your infrastructure (e.g., an AWS Lambda or FastAPI endpoint).
    • If Approved: The resumption script updates the DB, increases the budget context, and restarts the agent run (see below).
    • If Denied: The agent state in the DB is finalized, and a message is sent back to the user: “Agent run cancelled by supervisor.”

3. Resuming the Run

Resuming does not restart the task; it injects the previous history into a new agent instance.

# (FastAPI endpoint handling the 'Approve' webhook from Slack)
def resume_agent_endpoint(payload: SlackApprovalPayload):
    # 1. Verify approval and extract session_id
    session_id = payload.session_id

    # 2. Retrieve state from DB
    original_history = get_agent_history_from_db(session_id)

    # 3. Create a NEW agent instance with higher limits
    extended_limits = UsageLimits(request_limit=15, total_tokens_limit=50000)

    # 4. RUN AGAIN, INJECTING THE HISTORY
    new_agent = Agent('openai:gpt-5')
    new_result = new_agent.run_sync(
        # We don't need the prompt again, the history contains it.
        message_history=original_history,
        usage_limits=extended_limits
    )
    return {"status": "resumed", "final_result": new_result.data}

Best Practices for SMB Data Teams

  • Pre-Flight Estimation: Utilize litellm.token_counter on the initial prompt before starting the agent. If the starting prompt alone costs $0.10, perhaps GPT-4o is the wrong choice.
  • Model Routing by Cost: Implement PydanticAI logic that switches models dynamically. Use GPT-4o for strategic planning, but use GPT-4o-mini (via LiteLLM router) for repetitive data formatting or tool execution tasks to preserve budget.
  • Budget Tiers by Role: Define standard usage limit profiles based on context (e.g., DEV_BUDGET, RESEARCH_BUDGET, CLIENT_BUDGET).

Conclusion: Governance as an Enabler

By combining PydanticAI’s native enforcement and LiteLLM’s pricing data, SMB data teams can deploy autonomous agents safely. This architecture moves beyond restriction and instead builds economically sustainable automation.

Authors

  • Marc Matt

    Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

    I help clients:

    • Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
    • Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
    • Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.

    Proven track record leading engineering teams.

  • Saidah Kafka

Posted

in

by

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close