Module 12: Evaluation
"It seems to work" is not a measurement. Evaluation is how you know whether your agent is good at its job, and whether changes you make are improvements or regressions.
Why Evaluation Is Different for Agents​
Traditional software has unit tests. Expected input, expected output, pass or fail.
AI agents produce natural language. There are many correct ways to summarize a contract. There's no single "right answer" to match against. You need evaluation methods that handle this ambiguity.
Evaluation Dimensions​
| Dimension | Question | How to Measure |
|---|---|---|
| Correctness | Did the agent get the facts right? | Compare against human-reviewed ground truth |
| Completeness | Did the agent catch everything? | Check against a checklist of known issues |
| Relevance | Is the output useful for the intended audience? | Human rating (1-5) |
| Grounding | Are claims backed by actual sources? | Automated source verification |
| Safety | Did the agent stay within guardrails? | Guardrail trigger logs |
| Efficiency | How fast and how expensive? | Latency and token counts |
Building an Evaluation Dataset​
Start with contracts your legal team has already reviewed. You know the correct answers because humans already provided them.
For each contract in your evaluation set, document:
- Key terms that should be extracted
- Compliance risks that should be flagged
- Financial exposure that should be calculated
- The final recommendation
Run the agent on these contracts. Compare its output to the human-provided answers.
Metrics That Matter​
Precision​
Of the issues the agent flagged, how many were actually real issues?
High precision = Few false alarms. When the agent flags something, it's worth investigating.
Low precision = Too many false alarms. The legal team starts ignoring the agent's flags.
Recall​
Of the real issues that exist, how many did the agent catch?
High recall = The agent doesn't miss things. Comprehensive coverage.
Low recall = The agent misses issues. Humans find problems the agent didn't.
The Tradeoff​
You usually can't maximize both. A conservative agent (flag everything remotely suspicious) has high recall but low precision. An aggressive agent (only flag clear violations) has high precision but low recall.
For contract review, high recall is more important. Missing a real compliance issue is worse than flagging a few false positives. The human reviewer can dismiss false alarms quickly, but they can't catch issues the agent didn't surface.
Automated Evaluation​
For some dimensions, you can automate the evaluation using another LLM call:
LLM-as-Judge: Give a second model the contract, the agent's analysis, and the human's analysis. Ask it to score the agent's output on correctness, completeness, and relevance.
This isn't perfect, but it scales. You can run automated evaluation on hundreds of contracts nightly and flag regressions before they reach production.
Regression Testing​
Every time you change something (update a prompt, switch a model, modify a subagent), run the full evaluation suite.
Change: Updated compliance agent prompt to be more specific about IP clauses
Before: Precision 0.87, Recall 0.91
After: Precision 0.89, Recall 0.88
Result: Precision improved but recall dropped. IP clause detection
improved but we're now missing some liability issues.
Investigate before shipping.
Without evaluation, you'd ship the change and discover the recall drop when a human finds an issue the agent missed. With evaluation, you catch it before deployment.
Continuous Improvement Loop​
Agent runs in production
↓
Observability captures traces (Module 11)
↓
Human overrides logged (Module 9)
↓
Override cases added to evaluation dataset
↓
Evaluation suite grows over time
↓
Next change tested against larger, more representative dataset
↓
Agent quality improves with every deployment
This is the flywheel. The longer your agent runs in production with human review, the better your evaluation dataset gets, and the more confidently you can make changes.
What's Next​
You've completed all 12 building blocks. Go back to the series overview to review how they connect, or dive into the hands-on labs to build each component on AWS.
Evaluation Lab
Build an automated evaluation pipeline with ground truth datasets, LLM-as-Judge scoring, precision/recall tracking, and regression detection integrated into your CI/CD pipeline.