Module 11: Observability
Building the agent is half the job. Knowing what it's doing in production is the other half.
Why Agents Are Hard to Debug​
Traditional software is deterministic. Same input, same output. You can read the code and predict the behavior.
AI agents are non-deterministic. The same input can produce different outputs depending on model temperature, retrieved context, and the agent's reasoning path. When something goes wrong, you can't just read the code. You need to trace the agent's entire decision chain.
What to Track​
| Layer | What to Log | Why |
|---|---|---|
| Request | Input, user/system context | Reproduce the issue |
| Orchestrator | Which subagents were dispatched, in what order | Understand the workflow |
| Subagent | Prompt sent, model response, tools called | Debug reasoning |
| Tool calls | Input, output, latency, errors | Find integration failures |
| RAG | Query, retrieved chunks, relevance scores | Debug retrieval quality |
| Guardrails | Triggered rules, blocked content | Security audit trail |
| Output | Final result, confidence, citations | Quality monitoring |
| Human review | Decision, override reason, time to review | Feedback loop |
Traces, Not Just Logs​
A log tells you something happened. A trace tells you why.
A trace for a contract review might look like:
[Trace: contract-review-2026-05-05-001]
├── Orchestrator received: contract_vendor_acme_2026.pdf
├── Dispatched: extraction_agent (parallel)
│ ├── Tool: extract_pdf → 12 clauses extracted (1.2s)
│ └── Output: structured_clauses.json
├── Dispatched: vendor_lookup_agent (parallel)
│ ├── Tool: lookup_vendor("Acme Corp") → 3 past reviews found (0.8s)
│ └── Output: vendor_history.json
├── Dispatched: compliance_agent (sequential, depends on extraction)
│ ├── RAG: retrieved 4 policy sections (relevance: 0.89, 0.87, 0.82, 0.71)
│ ├── Analysis: 2 deviations found
│ ├── Guardrail: liability_minimum_check → TRIGGERED (below $100K threshold)
│ └── Output: compliance_report.json (risk: HIGH)
├── Dispatched: financial_agent
│ ├── Tool: calculate_exposure → $2.1M total exposure (0.3s)
│ └── Output: financial_analysis.json
├── Human review: routed to legal (high risk)
│ ├── Reviewer: attorney@company.com
│ ├── Decision: APPROVED with amendments
│ └── Time to review: 4h 22m
└── Report generated and distributed (total: 4h 24m, agent time: 8.2s)
When the legal team says "the agent missed the non-compete clause," you trace back to the extraction step. Did it extract the clause? Yes. Did the compliance agent see it? Check the RAG retrieval. Was the non-compete policy in the top results? No, relevance score was 0.45 and it was filtered out. Found the bug.
Key Metrics​
Operational metrics:
- Agent response time (p50, p95, p99)
- Tool call success/failure rates
- Cost per agent run (tokens consumed)
- Error rates by subagent
Quality metrics:
- Human override rate (how often the reviewer changes the agent's recommendation)
- RAG retrieval relevance scores
- Guardrail trigger rates
- Grounding validation pass/fail rates
Business metrics:
- Contracts processed per day
- Average time from submission to decision
- Cost per contract review (agent + human time)
Alerting​
Not every trace needs human attention. Set alerts for:
- Error rate above threshold (agent is failing)
- Human override rate spikes (agent quality is degrading)
- Latency above SLA (performance regression)
- Guardrail triggers above normal (possible prompt injection or bad input pattern)
- Cost per run spikes (model or retrieval changes burning money)
What's Next​
Observability tells you what's happening. In Module 12: Evaluation, we cover how to systematically measure whether your agent is good at its job and getting better over time.
Observability Lab
Build a complete observability pipeline with structured traces, CloudWatch dashboards, anomaly detection alerts, and cost tracking for multi-agent systems.