Skip to main content

Module 11: Observability

Building the agent is half the job. Knowing what it's doing in production is the other half.

Why Agents Are Hard to Debug​

Traditional software is deterministic. Same input, same output. You can read the code and predict the behavior.

AI agents are non-deterministic. The same input can produce different outputs depending on model temperature, retrieved context, and the agent's reasoning path. When something goes wrong, you can't just read the code. You need to trace the agent's entire decision chain.

What to Track​

LayerWhat to LogWhy
RequestInput, user/system contextReproduce the issue
OrchestratorWhich subagents were dispatched, in what orderUnderstand the workflow
SubagentPrompt sent, model response, tools calledDebug reasoning
Tool callsInput, output, latency, errorsFind integration failures
RAGQuery, retrieved chunks, relevance scoresDebug retrieval quality
GuardrailsTriggered rules, blocked contentSecurity audit trail
OutputFinal result, confidence, citationsQuality monitoring
Human reviewDecision, override reason, time to reviewFeedback loop

Traces, Not Just Logs​

A log tells you something happened. A trace tells you why.

A trace for a contract review might look like:

[Trace: contract-review-2026-05-05-001]
├── Orchestrator received: contract_vendor_acme_2026.pdf
├── Dispatched: extraction_agent (parallel)
│ ├── Tool: extract_pdf → 12 clauses extracted (1.2s)
│ └── Output: structured_clauses.json
├── Dispatched: vendor_lookup_agent (parallel)
│ ├── Tool: lookup_vendor("Acme Corp") → 3 past reviews found (0.8s)
│ └── Output: vendor_history.json
├── Dispatched: compliance_agent (sequential, depends on extraction)
│ ├── RAG: retrieved 4 policy sections (relevance: 0.89, 0.87, 0.82, 0.71)
│ ├── Analysis: 2 deviations found
│ ├── Guardrail: liability_minimum_check → TRIGGERED (below $100K threshold)
│ └── Output: compliance_report.json (risk: HIGH)
├── Dispatched: financial_agent
│ ├── Tool: calculate_exposure → $2.1M total exposure (0.3s)
│ └── Output: financial_analysis.json
├── Human review: routed to legal (high risk)
│ ├── Reviewer: attorney@company.com
│ ├── Decision: APPROVED with amendments
│ └── Time to review: 4h 22m
└── Report generated and distributed (total: 4h 24m, agent time: 8.2s)

When the legal team says "the agent missed the non-compete clause," you trace back to the extraction step. Did it extract the clause? Yes. Did the compliance agent see it? Check the RAG retrieval. Was the non-compete policy in the top results? No, relevance score was 0.45 and it was filtered out. Found the bug.

Key Metrics​

Operational metrics:

  • Agent response time (p50, p95, p99)
  • Tool call success/failure rates
  • Cost per agent run (tokens consumed)
  • Error rates by subagent

Quality metrics:

  • Human override rate (how often the reviewer changes the agent's recommendation)
  • RAG retrieval relevance scores
  • Guardrail trigger rates
  • Grounding validation pass/fail rates

Business metrics:

  • Contracts processed per day
  • Average time from submission to decision
  • Cost per contract review (agent + human time)

Alerting​

Not every trace needs human attention. Set alerts for:

  • Error rate above threshold (agent is failing)
  • Human override rate spikes (agent quality is degrading)
  • Latency above SLA (performance regression)
  • Guardrail triggers above normal (possible prompt injection or bad input pattern)
  • Cost per run spikes (model or retrieval changes burning money)

What's Next​

Observability tells you what's happening. In Module 12: Evaluation, we cover how to systematically measure whether your agent is good at its job and getting better over time.

Premium

Observability Lab

Build a complete observability pipeline with structured traces, CloudWatch dashboards, anomaly detection alerts, and cost tracking for multi-agent systems.