Building a production LLM evaluation pipeline that caught 12 regressions before users did

Before the evaluation pipeline existed

In the first year of running production LLM systems at an AI-powered loyalty and gifting platform, our process for shipping changes felt adequate. Review the code, test the integration, deploy to staging, confirm the basic workflow, deploy to production.

The problem was the word "basic." We were confirming that the system worked. We were not measuring whether it worked well.

The production system was a RAG-based AI agent doing financial document extraction, loyalty recommendation, and AI-assisted forecasting. We updated it often: prompt adjustments, model version changes, retrieval configuration tweaks, new document types added to the knowledge base.

Any of those changes could move output quality. Most did not. Some did, and we usually found out when a business stakeholder noticed something wrong in a report, or a loyalty recommendation just looked off.

The drift added up. Looking back, we estimated that roughly 14 days of degraded output had reached users over a six-month period before we built the evaluation framework. In a financial context, 14 days of bad outputs feeds into reports and decisions in ways that are hard to quantify and unpleasant to think about.

Building the evaluation framework: the three-phase approach

Phase 1: the golden dataset

The first thing we built was not a pipeline. It was a dataset.

I set aside two weeks of engineering time to build a curated evaluation dataset of 200 query-response pairs. It was not a popular call. It felt slow, and the team wanted to build tooling. I held the line anyway. A pipeline without a calibrated dataset is not an evaluation system. It is a measurement system with an undefined measuring instrument.

The 200 items covered three categories.

Routine cases (120 items) were the most common query patterns from production logs, spanning financial extraction, loyalty recommendation, and forecasting. These set the baseline for what normal performance looked like.

Edge cases (50 items) were queries from production logs that had previously produced incorrect or low-confidence outputs, queries with unusual document formats, and queries sitting at the edge of what the system claimed to handle.

Adversarial cases (30 items) were built to probe specific failure modes: leading questions that nudged the model toward hallucination, queries with deliberately ambiguous intent, and queries with conflicting information in the context.

Each item in the dataset carried the query, the ideal response (or a description of what an ideal response should look like), the evaluation criteria that applied to it, and the failure mode we expected if the system degraded. Not every criterion applies equally to every item, and the expected failure mode tells the evaluator what to watch for.

Phase 2: the automated evaluation pipeline

With the dataset in place, we built the pipeline.

graph TD
    subgraph TRIGGER["Evaluation Triggers"]
        PR_MERGE[PR Merged to Main]
        MODEL_UPDATE[Model Version Update]
        PROMPT_CHANGE[Prompt Configuration Change]
        DAILY_CRON[Daily Production Sample - 5% of volume]
    end

    subgraph EVAL_RUNNER["Evaluation Runner"]
        TRIGGER_HANDLER[Trigger Handler]
        DATASET_LOADER[Golden Dataset Loader]
        BATCH_INF[Batch Inference - production system]
        JUDGE[LLM-as-Judge - GPT-4 with eval rubric]
        SCORER[Multi-Dimension Scorer]
    end

    subgraph SCORING["Scoring Dimensions"]
        CORR_S[Correctness Score - 0 to 1]
        FAITH_S[Faithfulness Score - 0 to 1]
        REL_S[Relevance Score - 0 to 1]
        LAT_S[Latency Percentiles - P50/P90/P99]
    end

    subgraph OUTPUT["Evaluation Output"]
        WEIGHTED[Weighted Composite Score]
        COMPARE2[Delta vs. Previous Baseline]
        REGRESS{Regression Check}
        REPORT[Eval Report - Slack + Dashboard]
    end

    PR_MERGE --> TRIGGER_HANDLER
    MODEL_UPDATE --> TRIGGER_HANDLER
    PROMPT_CHANGE --> TRIGGER_HANDLER
    DAILY_CRON --> TRIGGER_HANDLER
    TRIGGER_HANDLER --> DATASET_LOADER
    DATASET_LOADER --> BATCH_INF
    BATCH_INF --> JUDGE
    JUDGE --> SCORER
    SCORER --> CORR_S
    SCORER --> FAITH_S
    SCORER --> REL_S
    SCORER --> LAT_S
    CORR_S --> WEIGHTED
    FAITH_S --> WEIGHTED
    REL_S --> WEIGHTED
    LAT_S --> WEIGHTED
    WEIGHTED --> COMPARE2
    COMPARE2 --> REGRESS
    REGRESS -->|>3% decline| REPORT
    REGRESS -->|Within bounds| REPORT

The composite score weighted each dimension by business risk: correctness 40%, faithfulness 35%, relevance 20%, latency 5%. The weights came from the financial context of our system. A wrong answer in a financial report does more damage than a slow one.

We set the regression threshold at a 3% composite score decline. Any change that dropped the score 3% or more against the prior baseline got flagged as a regression and had to be reviewed before it could go to production.

Phase 3: the production monitoring loop

The pipeline ran at two cadences.

On-change evaluation fired on every PR merge, model update, or prompt configuration change. It ran against the full 200-item dataset and returned results within 20 minutes of the trigger.

Daily production sample evaluation took 5% of production traffic, anonymized it, and scored it against the same rubric as the golden dataset. This caught quality drift driven by input distribution changes, where production queries were diverging from the golden dataset in ways that hurt quality.

The daily sample turned out to be the more valuable of the two. Several of the regressions we caught were distribution drift issues that the golden dataset evaluation alone would never have surfaced.

The 12 regressions we caught

Over six months, the pipeline caught 12 regressions before they reached production users.

#	Trigger	Root Cause	Severity
1	Model version update (GPT-3.5 -> GPT-4o)	Output format changed - JSON schema non-compliance increased 18%	High
2	Prompt change - financial extraction	Chain-of-thought instruction removed - faithfulness declined 8%	High
3	RAG config change - chunk size increased	Longer chunks reduced retrieval precision - relevance declined 11%	High
4	New document type added to knowledge base	Embedding distribution shift - retrieval for existing doc types degraded	Medium
5-7	Three prompt variations tested A/B	Each evaluated lower than baseline on at least one dimension	Medium
8	Dependency library update	Tokenization change altered context assembly - subtle correctness decline	Low
9	Distribution drift (daily sample)	New partner's transaction data format changed - extraction accuracy fell 12% for partner A	High
10	Distribution drift (daily sample)	Seasonal vocabulary shift in loyalty queries - relevance declined for Q4 context	Low
11	Prompt change - recommendation engine	Instruction change inadvertently reduced diversity of recommendations	Medium
12	Model temperature config change	Higher temperature increased output variance - faithfulness score standard deviation doubled	Medium

Regressions 1 through 3 were the worst of the set. A GPT model version update that silently changes output format compliance is exactly the kind of change that produces financial reporting errors that sit undetected for weeks without an evaluation gate. We caught it in 20 minutes.

The organizational impact

The pipeline changed how the team worked, not just what we shipped.

Before it, every prompt change or model update felt like a gamble. Engineers were cautious because they could not predict the quality impact, and the team was moving slowly partly out of that caution.

After it, a change could be proposed, evaluated, and accepted or rejected in under an hour. The caution faded because data had replaced the uncertainty. Iteration sped up.

Mean time to detect quality issues fell from the pre-pipeline baseline of 14-plus days, found by a human noticing something, to under 6 hours for on-change regressions and under 24 hours for distribution drift.

What it cost to build

The initial dataset took roughly 80 person-hours over two weeks, split between an engineer and a domain analyst.

The pipeline itself took about three engineering weeks, including the LLM-as-Judge calibration work.

Running it costs around 2 hours a week of dashboard review and anomaly investigation.

On the return: in the six months after deployment, we estimated that the 12 caught regressions, had they reached production, would each have taken about 3 days to diagnose and remediate post-deployment. That is roughly 36 engineering days of prevention against three weeks of build cost.

And that math leaves out the downstream business impact of financial reports built on degraded outputs. That is harder to quantify, but it was the actual reason we made the investment.