Building a production LLM evaluation pipeline that caught 12 regressions before users did
How we built a three-phase evaluation framework (golden dataset, automated pipeline, production monitoring loop) that cut mean time to detect quality issues from 14 days to under 6 hours.
Before the evaluation pipeline existed
In the first year of running production LLM systems at an AI-powered loyalty and gifting platform, our process for shipping changes felt adequate. Review the code, test the integration, deploy to staging, confirm the basic workflow, deploy to production.
The problem was the word "basic." We were confirming that the system worked. We were not measuring whether it worked well.
The production system was a RAG-based AI agent doing financial document extraction, loyalty recommendation, and AI-assisted forecasting. We updated it often: prompt adjustments, model version changes, retrieval configuration tweaks, new document types added to the knowledge base.
Any of those changes could move output quality. Most did not. Some did, and we usually found out when a business stakeholder noticed something wrong in a report, or a loyalty recommendation just looked off.
The drift added up. Looking back, we estimated that roughly 14 days of degraded output had reached users over a six-month period before we built the evaluation framework. In a financial context, 14 days of bad outputs feeds into reports and decisions in ways that are hard to quantify and unpleasant to think about.
Building the evaluation framework: the three-phase approach
Phase 1: the golden dataset
The first thing we built was not a pipeline. It was a dataset.
I set aside two weeks of engineering time to build a curated evaluation dataset of 200 query-response pairs. It was not a popular call. It felt slow, and the team wanted to build tooling. I held the line anyway. A pipeline without a calibrated dataset is not an evaluation system. It is a measurement system with an undefined measuring instrument.
The 200 items covered three categories.
Routine cases (120 items) were the most common query patterns from production logs, spanning financial extraction, loyalty recommendation, and forecasting. These set the baseline for what normal performance looked like.
Edge cases (50 items) were queries from production logs that had previously produced incorrect or low-confidence outputs, queries with unusual document formats, and queries sitting at the edge of what the system claimed to handle.
Adversarial cases (30 items) were built to probe specific failure modes: leading questions that nudged the model toward hallucination, queries with deliberately ambiguous intent, and queries with conflicting information in the context.
Each item in the dataset carried the query, the ideal response (or a description of what an ideal response should look like), the evaluation criteria that applied to it, and the failure mode we expected if the system degraded. Not every criterion applies equally to every item, and the expected failure mode tells the evaluator what to watch for.
Phase 2: the automated evaluation pipeline
With the dataset in place, we built the pipeline.
graph TD
subgraph TRIGGER["Evaluation Triggers"]
PR_MERGE[PR Merged to Main]
MODEL_UPDATE[Model Version Update]
PROMPT_CHANGE[Prompt Configuration Change]
DAILY_CRON[Daily Production Sample - 5% of volume]
end
subgraph EVAL_RUNNER["Evaluation Runner"]
TRIGGER_HANDLER[Trigger Handler]
DATASET_LOADER[Golden Dataset Loader]
BATCH_INF[Batch Inference - production system]
JUDGE[LLM-as-Judge - GPT-4 with eval rubric]
SCORER[Multi-Dimension Scorer]
end
subgraph SCORING["Scoring Dimensions"]
CORR_S[Correctness Score - 0 to 1]
FAITH_S[Faithfulness Score - 0 to 1]
REL_S[Relevance Score - 0 to 1]
LAT_S[Latency Percentiles - P50/P90/P99]
end
subgraph OUTPUT["Evaluation Output"]
WEIGHTED[Weighted Composite Score]
COMPARE2[Delta vs. Previous Baseline]
REGRESS{Regression Check}
REPORT[Eval Report - Slack + Dashboard]
end
PR_MERGE --> TRIGGER_HANDLER
MODEL_UPDATE --> TRIGGER_HANDLER
PROMPT_CHANGE --> TRIGGER_HANDLER
DAILY_CRON --> TRIGGER_HANDLER
TRIGGER_HANDLER --> DATASET_LOADER
DATASET_LOADER --> BATCH_INF
BATCH_INF --> JUDGE
JUDGE --> SCORER
SCORER --> CORR_S
SCORER --> FAITH_S
SCORER --> REL_S
SCORER --> LAT_S
CORR_S --> WEIGHTED
FAITH_S --> WEIGHTED
REL_S --> WEIGHTED
LAT_S --> WEIGHTED
WEIGHTED --> COMPARE2
COMPARE2 --> REGRESS
REGRESS -->|>3% decline| REPORT
REGRESS -->|Within bounds| REPORT
The composite score weighted each dimension by business risk: correctness 40%, faithfulness 35%, relevance 20%, latency 5%. The weights came from the financial context of our system. A wrong answer in a financial report does more damage than a slow one.
We set the regression threshold at a 3% composite score decline. Any change that dropped the score 3% or more against the prior baseline got flagged as a regression and had to be reviewed before it could go to production.
Phase 3: the production monitoring loop
The pipeline ran at two cadences.
On-change evaluation fired on every PR merge, model update, or prompt configuration change. It ran against the full 200-item dataset and returned results within 20 minutes of the trigger.
Daily production sample evaluation took 5% of production traffic, anonymized it, and scored it against the same rubric as the golden dataset. This caught quality drift driven by input distribution changes, where production queries were diverging from the golden dataset in ways that hurt quality.
The daily sample turned out to be the more valuable of the two. Several of the regressions we caught were distribution drift issues that the golden dataset evaluation alone would never have surfaced.
The 12 regressions we caught
Over six months, the pipeline caught 12 regressions before they reached production users.
| # | Trigger | Root Cause | Severity |
|---|---|---|---|
| 1 | Model version update (GPT-3.5 -> GPT-4o) | Output format changed - JSON schema non-compliance increased 18% | High |
| 2 | Prompt change - financial extraction | Chain-of-thought instruction removed - faithfulness declined 8% | High |
| 3 | RAG config change - chunk size increased | Longer chunks reduced retrieval precision - relevance declined 11% | High |
| 4 | New document type added to knowledge base | Embedding distribution shift - retrieval for existing doc types degraded | Medium |
| 5-7 | Three prompt variations tested A/B | Each evaluated lower than baseline on at least one dimension | Medium |
| 8 | Dependency library update | Tokenization change altered context assembly - subtle correctness decline | Low |
| 9 | Distribution drift (daily sample) | New partner's transaction data format changed - extraction accuracy fell 12% for partner A | High |
| 10 | Distribution drift (daily sample) | Seasonal vocabulary shift in loyalty queries - relevance declined for Q4 context | Low |
| 11 | Prompt change - recommendation engine | Instruction change inadvertently reduced diversity of recommendations | Medium |
| 12 | Model temperature config change | Higher temperature increased output variance - faithfulness score standard deviation doubled | Medium |
Regressions 1 through 3 were the worst of the set. A GPT model version update that silently changes output format compliance is exactly the kind of change that produces financial reporting errors that sit undetected for weeks without an evaluation gate. We caught it in 20 minutes.
The organizational impact
The pipeline changed how the team worked, not just what we shipped.
Before it, every prompt change or model update felt like a gamble. Engineers were cautious because they could not predict the quality impact, and the team was moving slowly partly out of that caution.
After it, a change could be proposed, evaluated, and accepted or rejected in under an hour. The caution faded because data had replaced the uncertainty. Iteration sped up.
Mean time to detect quality issues fell from the pre-pipeline baseline of 14-plus days, found by a human noticing something, to under 6 hours for on-change regressions and under 24 hours for distribution drift.
What it cost to build
The initial dataset took roughly 80 person-hours over two weeks, split between an engineer and a domain analyst.
The pipeline itself took about three engineering weeks, including the LLM-as-Judge calibration work.
Running it costs around 2 hours a week of dashboard review and anomaly investigation.
On the return: in the six months after deployment, we estimated that the 12 caught regressions, had they reached production, would each have taken about 3 days to diagnose and remediate post-deployment. That is roughly 36 engineering days of prevention against three weeks of build cost.
And that math leaves out the downstream business impact of financial reports built on degraded outputs. That is harder to quantify, but it was the actual reason we made the investment.
Work with me
Ready to discuss your architecture?
I work with founders and engineering leaders as a Fractional CTO to translate business goals into technical strategy - and execute on them. Free 30-minute Technical Health Check to start.
Book a call