Module 12: Evaluation

"It seems to work" is not a measurement. Evaluation is how you know whether your agent is good at its job, and whether changes you make are improvements or regressions.

Why Evaluation Is Different for Agents

Traditional software has unit tests. Expected input, expected output, pass or fail.

AI agents produce natural language. There are many correct ways to summarize a contract. There's no single "right answer" to match against. You need evaluation methods that handle this ambiguity.

Evaluation Dimensions

Dimension	Question	How to Measure
Correctness	Did the agent get the facts right?	Compare against human-reviewed ground truth
Completeness	Did the agent catch everything?	Check against a checklist of known issues
Relevance	Is the output useful for the intended audience?	Human rating (1-5)
Grounding	Are claims backed by actual sources?	Automated source verification
Safety	Did the agent stay within guardrails?	Guardrail trigger logs
Efficiency	How fast and how expensive?	Latency and token counts

Building an Evaluation Dataset

Start with contracts your legal team has already reviewed. You know the correct answers because humans already provided them.

For each contract in your evaluation set, document:

Key terms that should be extracted
Compliance risks that should be flagged
Financial exposure that should be calculated
The final recommendation

Run the agent on these contracts. Compare its output to the human-provided answers.

Metrics That Matter

Precision

Of the issues the agent flagged, how many were actually real issues?

High precision = Few false alarms. When the agent flags something, it's worth investigating.

Low precision = Too many false alarms. The legal team starts ignoring the agent's flags.

Recall

Of the real issues that exist, how many did the agent catch?

High recall = The agent doesn't miss things. Comprehensive coverage.

Low recall = The agent misses issues. Humans find problems the agent didn't.

The Tradeoff

You usually can't maximize both. A conservative agent (flag everything remotely suspicious) has high recall but low precision. An aggressive agent (only flag clear violations) has high precision but low recall.

For contract review, high recall is more important. Missing a real compliance issue is worse than flagging a few false positives. The human reviewer can dismiss false alarms quickly, but they can't catch issues the agent didn't surface.

Automated Evaluation

For some dimensions, you can automate the evaluation using another LLM call:

LLM-as-Judge: Give a second model the contract, the agent's analysis, and the human's analysis. Ask it to score the agent's output on correctness, completeness, and relevance.

This isn't perfect, but it scales. You can run automated evaluation on hundreds of contracts nightly and flag regressions before they reach production.

Regression Testing

Every time you change something (update a prompt, switch a model, modify a subagent), run the full evaluation suite.

Change: Updated compliance agent prompt to be more specific about IP clauses
Before: Precision 0.87, Recall 0.91
After:  Precision 0.89, Recall 0.88
Result: Precision improved but recall dropped. IP clause detection
        improved but we're now missing some liability issues.
        Investigate before shipping.

Without evaluation, you'd ship the change and discover the recall drop when a human finds an issue the agent missed. With evaluation, you catch it before deployment.

Continuous Improvement Loop

Agent runs in production
         ↓
Observability captures traces (Module 11)
         ↓
Human overrides logged (Module 9)
         ↓
Override cases added to evaluation dataset
         ↓
Evaluation suite grows over time
         ↓
Next change tested against larger, more representative dataset
         ↓
Agent quality improves with every deployment

This is the flywheel. The longer your agent runs in production with human review, the better your evaluation dataset gets, and the more confidently you can make changes.

What's Next

You've completed all 12 building blocks. Go back to the series overview to review how they connect, or dive into the hands-on labs to build each component on AWS.

Premium

Evaluation Lab

Build an automated evaluation pipeline with ground truth datasets, LLM-as-Judge scoring, precision/recall tracking, and regression detection integrated into your CI/CD pipeline.

Why Evaluation Is Different for Agents​

Evaluation Dimensions​

Building an Evaluation Dataset​

Metrics That Matter​

Precision​

Recall​

The Tradeoff​

Automated Evaluation​

Regression Testing​

Continuous Improvement Loop​

What's Next​