AI Evals

learning notes from Hamel's posts on evals

AI products need measurement to reach production. Evals provide that measurement.

Without evals, teams fly blind. Changes may help or hurt. No way to know.

With evals, teams iterate with confidence. Every change gets measured against real problems.

💥 AI development means long failure periods, then sudden breakthroughs. Traditional metrics kill projects too early.

What Are Evals

Evals measure if AI outputs meet user outcome criteria. Not generic quality scores. Custom pass/fail judgements for your domain.

Investment reality: Expect 60–80% of dev time on measurement. This delivers:

  • 10x faster iteration

  • Confidence in changes

  • Clear path to production

  • Measurable outcomes

Building Your Eval System

Foundation Stage ‼️‼️‼️

Objective: Eliminate obvious failures before launch.

Who Owns Quality?

Your competitive advantage isn't AI technology. It's domain expertise.

Those with deepest knowledge must shape the AI directly. Lawyers, doctors, specialists. Not through engineers.

Old approach: Experts explain. Engineers translate. Information lost at every step.

New approach: Experts write and iterate directly. Natural language makes this possible.

Appoint one domain expert as quality authority. One voice. No conflicting standards. No paralysis.

What to Measure

Generic metrics mislead. "Helpfulness scores" create false confidence. Teams celebrate better numbers while users leave.

Define pass/fail criteria for specific outcomes. Binary judgements force clarity.

Build custom tools. Off-the-shelf solutions fail. Every domain differs. Inadequate tools prevent analysis.

How to Start?

Start eval with zero users:

  • Generate synthetic test cases with AI

  • Review with domain experts

  • Build failure taxonomy

  • Create custom viewer

Build basic infrastructure:

  • Custom viewing tools for experts

  • Automated checks for predictable failures

  • Complete logging

Key Activities

  • Experts review 100+ interactions

  • Categorise every failure

  • Fix issues through prompt refinement

Success metric: No deterministic failures reach users.

Scaling Stage

Objective: Automate quality judgement. Maintain standards.

As usage grows, expert judgement becomes the template:

  • Label examples pass/fail

  • Document reasoning for each judgement

  • Train AI to replicate judgement at scale

Key Activities

  • Sample real interactions

  • Categorise failure modes

  • Automate repetitive judgements

  • Validate automated vs human standards

Aim for >90% agreement between automated and human judgement.

Success metric: Automated systems replicate expert judgement.

Maturity Stage

Objective: Connect quality to business outcomes.

Shift from fixing errors to validating impact:

  • Strategic sampling finds remaining issues

  • Production monitoring prevents failures

  • A/B testing validates business value

Track experiments run, not features shipped. Use capability funnels:

  • Error rates by category

  • Edge case coverage

  • Automated vs human alignment

  • User behaviour changes

Success metric: Quality improvements drive behaviour changes.

Critical Decisions

Custom data viewers: Highest-ROI investment. Purpose-built interfaces remove friction. Generic tools waste time and discourage analysis.

Domain expert access: Give experts direct access to prompts and logic. Engineers aren't translators. Dedicate senior expert time to quality standards.

Measurement infrastructure: Complete logging and trace storage. Foundation for all improvement.

Normalise failure: AI learns from failures. View systematic failure analysis as progress.

Focus narrowly: Build evaluators for discovered failures, not imagined problems. Optimising multiple dimensions fragments attention and delays progress.

Remove jargon: Technical terms prevent domain experts from contributing.

Experiment mindset: Leadership must support sustained experimentation. No premature optimisation.

Takeaways

AI Evals aren't quality assurance. They're product development.

Without measurement, AI products stay prototypes.

With rigorous evals, teams build confidence. Iterate rapidly. Deliver value.

The question isn't whether to invest. It's how quickly you start.

Virtuous Cycle

Last updated