Module 11: Observability

Building the agent is half the job. Knowing what it's doing in production is the other half.

Why Agents Are Hard to Debug

Traditional software is deterministic. Same input, same output. You can read the code and predict the behavior.

AI agents are non-deterministic. The same input can produce different outputs depending on model temperature, retrieved context, and the agent's reasoning path. When something goes wrong, you can't just read the code. You need to trace the agent's entire decision chain.

What to Track

Layer	What to Log	Why
Request	Input, user/system context	Reproduce the issue
Orchestrator	Which subagents were dispatched, in what order	Understand the workflow
Subagent	Prompt sent, model response, tools called	Debug reasoning
Tool calls	Input, output, latency, errors	Find integration failures
RAG	Query, retrieved chunks, relevance scores	Debug retrieval quality
Guardrails	Triggered rules, blocked content	Security audit trail
Output	Final result, confidence, citations	Quality monitoring
Human review	Decision, override reason, time to review	Feedback loop

Traces, Not Just Logs

A log tells you something happened. A trace tells you why.

A trace for a contract review might look like:

[Trace: contract-review-2026-05-05-001]
├── Orchestrator received: contract_vendor_acme_2026.pdf
├── Dispatched: extraction_agent (parallel)
│   ├── Tool: extract_pdf → 12 clauses extracted (1.2s)
│   └── Output: structured_clauses.json
├── Dispatched: vendor_lookup_agent (parallel)
│   ├── Tool: lookup_vendor("Acme Corp") → 3 past reviews found (0.8s)
│   └── Output: vendor_history.json
├── Dispatched: compliance_agent (sequential, depends on extraction)
│   ├── RAG: retrieved 4 policy sections (relevance: 0.89, 0.87, 0.82, 0.71)
│   ├── Analysis: 2 deviations found
│   ├── Guardrail: liability_minimum_check → TRIGGERED (below $100K threshold)
│   └── Output: compliance_report.json (risk: HIGH)
├── Dispatched: financial_agent
│   ├── Tool: calculate_exposure → $2.1M total exposure (0.3s)
│   └── Output: financial_analysis.json
├── Human review: routed to legal (high risk)
│   ├── Reviewer: attorney@company.com
│   ├── Decision: APPROVED with amendments
│   └── Time to review: 4h 22m
└── Report generated and distributed (total: 4h 24m, agent time: 8.2s)

When the legal team says "the agent missed the non-compete clause," you trace back to the extraction step. Did it extract the clause? Yes. Did the compliance agent see it? Check the RAG retrieval. Was the non-compete policy in the top results? No, relevance score was 0.45 and it was filtered out. Found the bug.

Key Metrics

Operational metrics:

Agent response time (p50, p95, p99)
Tool call success/failure rates
Cost per agent run (tokens consumed)
Error rates by subagent

Quality metrics:

Human override rate (how often the reviewer changes the agent's recommendation)
RAG retrieval relevance scores
Guardrail trigger rates
Grounding validation pass/fail rates

Business metrics:

Contracts processed per day
Average time from submission to decision
Cost per contract review (agent + human time)

Alerting

Not every trace needs human attention. Set alerts for:

Error rate above threshold (agent is failing)
Human override rate spikes (agent quality is degrading)
Latency above SLA (performance regression)
Guardrail triggers above normal (possible prompt injection or bad input pattern)
Cost per run spikes (model or retrieval changes burning money)

What's Next

Observability tells you what's happening. In Module 12: Evaluation, we cover how to systematically measure whether your agent is good at its job and getting better over time.

Premium

Observability Lab

Build a complete observability pipeline with structured traces, CloudWatch dashboards, anomaly detection alerts, and cost tracking for multi-agent systems.

Why Agents Are Hard to Debug​

What to Track​

Traces, Not Just Logs​

Key Metrics​

Alerting​

What's Next​