> For the complete documentation index, see [llms.txt](https://jamiewen00.gitbook.io/ai-engineering-handbook/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://jamiewen00.gitbook.io/ai-engineering-handbook/building-ai-products/ai-evals.md).

# AI Evals

> learning notes from [Hamel](https://hamel.dev/)'s posts on [evals](https://hamel.dev/notes/llm/evals/)

AI products need measurement to reach production. Evals provide that measurement.

Without evals, teams fly blind. Changes may help or hurt. No way to know.

With evals, teams iterate with confidence. Every change gets measured against real problems.

> 💥 AI development means long failure periods, then sudden breakthroughs. Traditional metrics kill projects too early.

## What Are Evals

Evals measure if AI outputs meet user outcome criteria. Not generic quality scores. Custom pass/fail judgements for your domain.

**Investment reality:** Expect 60–80% of dev time on measurement. This delivers:

* 10x faster iteration
* Confidence in changes
* Clear path to production
* Measurable outcomes

## Building Your Eval System

### Foundation Stage ‼️‼️‼️

**Objective:** Eliminate obvious failures before launch.

#### Who Owns Quality?

Your competitive advantage isn't AI technology. It's domain expertise.

Those with deepest knowledge must shape the AI directly. Lawyers, doctors, specialists. Not through engineers.

**Old approach:** Experts explain. Engineers translate. Information lost at every step.

**New approach:** Experts write and iterate directly. Natural language makes this possible.

Appoint one domain expert as quality authority. One voice. No conflicting standards. No paralysis.

#### What to Measure

Generic metrics mislead. "Helpfulness scores" create false confidence. Teams celebrate better numbers while users leave.

Define pass/fail criteria for specific outcomes. Binary judgements force clarity.

Build custom tools. Off-the-shelf solutions fail. Every domain differs. Inadequate tools prevent analysis.

#### How to Start?

Start eval with zero users:

* Generate synthetic test cases with AI
* Review with domain experts
* Build failure taxonomy
* Create custom viewer

Build basic infrastructure:

* Custom viewing tools for experts
* Automated checks for predictable failures
* Complete logging

#### Key Activities

* Experts review 100+ interactions
* Categorise every failure
* Fix issues through prompt refinement

**Success metric:** No deterministic failures reach users.

### Scaling Stage

**Objective:** Automate quality judgement. Maintain standards.

As usage grows, expert judgement becomes the template:

* Label examples pass/fail
* Document reasoning for each judgement
* Train AI to replicate judgement at scale

#### Key Activities

* Sample real interactions
* Categorise failure modes
* Automate repetitive judgements
* Validate automated vs human standards

Aim for >90% agreement between automated and human judgement.

**Success metric:** Automated systems replicate expert judgement.

### Maturity Stage

**Objective:** Connect quality to business outcomes.

Shift from fixing errors to validating impact:

* Strategic sampling finds remaining issues
* Production monitoring prevents failures
* A/B testing validates business value

Track experiments run, not features shipped. Use capability funnels:

* Error rates by category
* Edge case coverage
* Automated vs human alignment
* User behaviour changes

**Success metric:** Quality improvements drive behaviour changes.

## Critical Decisions

**Custom data viewers:** Highest-ROI investment. Purpose-built interfaces remove friction. Generic tools waste time and discourage analysis.

**Domain expert access:** Give experts direct access to prompts and logic. Engineers aren't translators. Dedicate senior expert time to quality standards.

**Measurement infrastructure:** Complete logging and trace storage. Foundation for all improvement.

**Normalise failure:** AI learns from failures. View systematic failure analysis as progress.

**Focus narrowly:** Build evaluators for discovered failures, not imagined problems. Optimising multiple dimensions fragments attention and delays progress.

**Remove jargon:** Technical terms prevent domain experts from contributing.

**Experiment mindset:** Leadership must support sustained experimentation. No premature optimisation.

## Takeaways

AI Evals aren't quality assurance. They're product development.

Without measurement, AI products stay prototypes.

With rigorous evals, teams build confidence. Iterate rapidly. Deliver value.

The question isn't whether to invest. It's how quickly you start.

<figure><img src="/files/lMZXPF0L9xp9ZFXNfomN" alt="Virtuous Cycle"><figcaption></figcaption></figure>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://jamiewen00.gitbook.io/ai-engineering-handbook/building-ai-products/ai-evals.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
