You can't ship what you can't measure: agent evaluation as the foundation of production MLOps

The agent that was getting worse and nobody knew

A team I reviewed had deployed an LLM-based customer support agent six months earlier. Usage was growing. Stakeholders liked the engagement metrics. The team was busy shipping new features.

Then a customer support manager noticed something. Three months in, the agent's responses had become less helpful. Not dramatically wrong. Subtly less precise, more likely to hand back a generic answer instead of a specific one, occasionally missing context it used to catch.

I asked the team how they would detect this kind of slow drift. Long pause. They had A/B testing for new features. They had uptime monitoring. They had latency alerting.

They had no evaluation pipeline.

The degradation had accumulated over three months of prompt tweaks, model updates, and retrieval config changes, and not one of those changes had been measured against a consistent quality benchmark. The agent drifted, and nobody knew until a human noticed.

This is not an unusual story. It's the norm on teams that treat evaluation as a launch-day checkbox instead of an ongoing operational discipline.

Why agent evaluation is different from traditional software testing

In traditional software, testing is about correctness. Does the function return the expected output for a given input? Input and output are linked deterministically. A test suite that passes today passes tomorrow if the code hasn't changed.

LLM agent evaluation breaks that comfort in three ways.

The same input can produce different outputs across invocations. Evaluation has to measure distributions, not point values.

An LLM agent can lose quality without ever returning an error. A degraded agent still returns 200 OK. The damage shows up only in the output itself, which means you need human or AI-assisted judgment to catch it.

And quality has many dimensions at once. A good response is correct, relevant, faithful to its source material, safe, appropriately confident, and fast enough. These move independently. An optimization that improves faithfulness can quietly cost you relevance. A pass/fail test will never see that.

The five evaluation dimensions

+-------------------+------------------------------------------+
| DIMENSION         | WHAT IT MEASURES                         |
+-------------------+------------------------------------------+
| Correctness       | Is the answer factually accurate?        |
|                   | (Requires ground truth answers)          |
+-------------------+------------------------------------------+
| Relevance         | Does the answer address the question?    |
|                   | (Can diverge from correctness)           |
+-------------------+------------------------------------------+
| Faithfulness      | Is the answer grounded in the source     |
|                   | context? (Anti-hallucination metric)     |
+-------------------+------------------------------------------+
| Safety            | Does the answer avoid harmful,           |
|                   | biased, or off-policy content?           |
+-------------------+------------------------------------------+
| Latency           | Is response time within SLA?             |
|                   | (P50, P90, P99 percentiles)              |
+-------------------+------------------------------------------+

Not every system needs all five weighted equally. A financial document extraction system lives or dies on correctness and faithfulness. A customer support agent cares most about relevance and safety. A code generation assistant cares about correctness and latency.

So the first step in building an evaluation framework is deciding which dimensions actually matter for your use case, and weighting them honestly.

The MLOps evaluation pipeline

Evaluation is not a phase you pass through once. It runs at every stage of the system lifecycle.

graph TD
    subgraph DEV["Development Phase"]
        CODE[Code / Prompt Change] --> UNIT_EVAL[Unit Evaluation - sample subset]
        UNIT_EVAL --> PR_GATE{PR Quality Gate}
        PR_GATE -->|Passes| STAGING[Staging Deployment]
        PR_GATE -->|Fails| BLOCK[Blocked - eval report]
    end

    subgraph STAGING_EVAL["Staging Evaluation"]
        STAGING --> FULL_EVAL[Full Evaluation - complete benchmark]
        FULL_EVAL --> COMPARE[Comparison vs. Production Baseline]
        COMPARE --> DEPLOY_GATE{Deployment Gate}
        DEPLOY_GATE -->|Passes| PROD[Production Deployment]
        DEPLOY_GATE -->|Fails| ROLLBACK[Block + Regression Report]
    end

    subgraph PROD_MONITORING["Production Monitoring"]
        PROD --> SAMPLE[5% Random Sample - flagged for eval]
        PROD --> TRIGGER[User Feedback Triggers]
        SAMPLE --> ASYNC_EVAL[Async Evaluation - LLM-as-Judge]
        TRIGGER --> ASYNC_EVAL
        ASYNC_EVAL --> DAILY_REPORT[Daily Quality Report]
        DAILY_REPORT --> ALERT3{Quality Alert}
        ALERT3 -->|Degradation| ENG_ALERT[Engineering Alert]
        ALERT3 -->|Within bounds| TREND[Trend Dashboard]
    end

    subgraph DATASET["Evaluation Dataset Management"]
        ASYNC_EVAL --> CANDIDATE[Candidate Addition - interesting cases]
        CANDIDATE --> ANNOTATE[Human Annotation Queue]
        ANNOTATE --> GOLDEN[Golden Dataset - updated monthly]
        GOLDEN --> FULL_EVAL
        GOLDEN --> UNIT_EVAL
    end

Building the evaluation dataset

The evaluation dataset is the foundation of the whole framework. Without a curated, annotated set of representative queries and expected outputs, you can't measure anything consistently.

Building that initial dataset is the slowest part of standing up an evaluation framework, and the part teams skip most often. Don't skip it.

For a new system, start with 50 to 100 hand-curated query-response pairs. Roughly 60% should be common cases, 25% edge cases that map to known failure modes, and 15% adversarial cases built to stress-test safety and correctness.

For an existing system with production history, mine the logs for the interesting cases: high-confidence responses that users later corrected, queries that triggered escalation or negative feedback, and queries that sit right at the edge of what the agent can handle.

For annotation, each query needs at minimum three things: the query itself, the ideal response (or a description of what an ideal response contains), and the evaluation criteria that apply to it.

LLM-as-judge: scaling evaluation without scaling headcount

Human evaluation of LLM outputs is the gold standard. It is also expensive and slow. For a system handling thousands of interactions a day, having a person review every output is not feasible.

LLM-as-judge uses a separate, higher-capability model to score the production agent's outputs against your evaluation criteria. That scales to production volume.

The implementation needs care. Instruct the judge to evaluate against specific, well-defined criteria, not to hand back a general opinion. Calibrate it against human annotations so its scores actually line up with human scores on the evaluation set. And treat its output as a signal, not ground truth: high and low scores are reliable, the middle is where you still want a human to look.

In my production systems, a properly calibrated LLM judge hits 85 to 90% agreement with human annotation on binary quality judgments. That's enough for production monitoring, where the job is spotting degradation trends, not nailing perfect precision on every single response.

What good evaluation enables

An evaluation framework is not just a quality gate. It speeds the whole team up.

With evaluation in place, prompt changes and model updates can be tested against the benchmark in minutes instead of days of manual review. The deployment gate lets the team ship without that low hum of anxiety about unknown quality impacts. And when several optimization approaches are on the table, the evaluation data tells you which one actually improves quality, not which one feels better in a demo.

Without evaluation, every change is a guess, every deployment is a roll of the dice, and your users find the quality problems before your engineers do.