AI Evals

learning notes from Hamel's posts on evals

AI products need measurement to reach production. Evals provide that measurement.

Without evals, teams fly blind. Changes may help or hurt. No way to know.

With evals, teams iterate with confidence. Every change gets measured against real problems.

💥 AI development means long failure periods, then sudden breakthroughs. Traditional metrics kill projects too early.

What Are Evals

Evals measure if AI outputs meet user outcome criteria. Not generic quality scores. Custom pass/fail judgements for your domain.

Investment reality: Expect 60–80% of dev time on measurement. This delivers:

10x faster iteration
Confidence in changes
Clear path to production
Measurable outcomes

Building Your Eval System

Foundation Stage ‼️‼️‼️

Objective: Eliminate obvious failures before launch.

Who Owns Quality?

Your competitive advantage isn't AI technology. It's domain expertise.

Those with deepest knowledge must shape the AI directly. Lawyers, doctors, specialists. Not through engineers.

Old approach: Experts explain. Engineers translate. Information lost at every step.

New approach: Experts write and iterate directly. Natural language makes this possible.

Appoint one domain expert as quality authority. One voice. No conflicting standards. No paralysis.

What to Measure

Generic metrics mislead. "Helpfulness scores" create false confidence. Teams celebrate better numbers while users leave.

Define pass/fail criteria for specific outcomes. Binary judgements force clarity.

Build custom tools. Off-the-shelf solutions fail. Every domain differs. Inadequate tools prevent analysis.

How to Start?

Start eval with zero users:

Generate synthetic test cases with AI
Review with domain experts
Build failure taxonomy
Create custom viewer

Build basic infrastructure:

Custom viewing tools for experts
Automated checks for predictable failures
Complete logging

Key Activities

Experts review 100+ interactions
Categorise every failure
Fix issues through prompt refinement

Success metric: No deterministic failures reach users.

Scaling Stage

Objective: Automate quality judgement. Maintain standards.

As usage grows, expert judgement becomes the template:

Label examples pass/fail
Document reasoning for each judgement
Train AI to replicate judgement at scale

Key Activities

Sample real interactions
Categorise failure modes
Automate repetitive judgements
Validate automated vs human standards

Aim for >90% agreement between automated and human judgement.

Success metric: Automated systems replicate expert judgement.

Maturity Stage

Objective: Connect quality to business outcomes.

Shift from fixing errors to validating impact:

Strategic sampling finds remaining issues
Production monitoring prevents failures
A/B testing validates business value

Track experiments run, not features shipped. Use capability funnels:

Error rates by category
Edge case coverage
Automated vs human alignment
User behaviour changes

Success metric: Quality improvements drive behaviour changes.

Critical Decisions

Custom data viewers: Highest-ROI investment. Purpose-built interfaces remove friction. Generic tools waste time and discourage analysis.

Domain expert access: Give experts direct access to prompts and logic. Engineers aren't translators. Dedicate senior expert time to quality standards.

Measurement infrastructure: Complete logging and trace storage. Foundation for all improvement.

Normalise failure: AI learns from failures. View systematic failure analysis as progress.

Focus narrowly: Build evaluators for discovered failures, not imagined problems. Optimising multiple dimensions fragments attention and delays progress.

Remove jargon: Technical terms prevent domain experts from contributing.

Experiment mindset: Leadership must support sustained experimentation. No premature optimisation.

Takeaways

AI Evals aren't quality assurance. They're product development.

Without measurement, AI products stay prototypes.

With rigorous evals, teams build confidence. Iterate rapidly. Deliver value.

The question isn't whether to invest. It's how quickly you start.

PreviousFuzzy Land NextAnti Patterns

Last updated 1 month ago