Why Traditional Testing Fails for AI
Standard software testing is deterministic: given input X, the correct output is always Y. You write assert output == Y and ship.
AI outputs are non-deterministic. The same input can produce different outputs on every run. Even if the output is deterministic at temperature 0, it changes when the model is updated - and providers update models without asking you.
This breaks every testing assumption you’re used to:
| Traditional Testing | AI Testing |
|---|---|
| Exact string match | Semantic similarity check |
| Binary pass/fail | Scored on a rubric |
| Deterministic output | Probabilistic output |
| Runs once per commit | Runs continuously in production |
| Tests written once | Test cases need ongoing curation |
You don’t abandon testing. You evolve it. The field calls this evals (evaluations).
The Three Eval Types
AI Eval Pyramid
flowchart TD H[Human Evals Gold standard, expensive, slow] R[Rubric Evals LLM-as-judge, scalable, requires calibration] A[Assertion Evals Fast, deterministic, limited coverage] A --> R --> H style H fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style R fill:#fef3c7,stroke:#d97706,color:#b45309 style A fill:#dcfce7,stroke:#16a34a,color:#15803dflowchart TD H[Human Evals Gold standard, expensive, slow] R[Rubric Evals LLM-as-judge, scalable, requires calibration] A[Assertion Evals Fast, deterministic, limited coverage] A --> R --> H style H fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style R fill:#fef3c7,stroke:#d97706,color:#b45309 style A fill:#dcfce7,stroke:#16a34a,color:#15803d
Assertion-based evals - the fastest tier. Check exact matches, substring contains, JSON structure, or response format. These run in milliseconds and are 100% reliable. Example: does the response contain a valid JSON object? Does it start with “I cannot”? These catch clear failures cheaply.
Rubric-based evals (LLM-as-judge) - a second LLM grades the first LLM’s output against a defined rubric. “Is this response accurate, concise, and in the correct language? Score 1-5.” These can scale to thousands of examples but require calibration against human labels to be trusted.
Human evals - a human reviewer reads the output and judges it. The gold standard. Too expensive to run on every commit, but necessary for high-stakes decisions and for calibrating your automated evals.
In practice: Use all three. Assertion evals gate CI/CD. Rubric evals run on every deploy. Human evals run quarterly and before major model upgrades.
The Eval Pipeline
Eval Pipeline: From Test Cases to Deployment Gate
flowchart LR
TC[Test Case Library
input + expected properties]
RUN[Run AI System
generate outputs]
ASSERT[Assertion Evals
fast checks]
JUDGE[LLM Judge
rubric scoring]
DB[(Eval Database
results + history)]
GATE{Pass threshold
>= 85%?}
DEPLOY[Deploy]
BLOCK[Block Deploy
alert team]
TC --> RUN --> ASSERT --> GATE
RUN --> JUDGE --> GATE
ASSERT --> DB
JUDGE --> DB
GATE -- yes --> DEPLOY
GATE -- no --> BLOCK
style DEPLOY fill:#dcfce7,stroke:#16a34a,color:#15803d
style BLOCK fill:#fef2f2,stroke:#ef4444,color:#dc2626
style GATE fill:#fef3c7,stroke:#d97706,color:#b45309
flowchart LR
TC[Test Case Library
input + expected properties]
RUN[Run AI System
generate outputs]
ASSERT[Assertion Evals
fast checks]
JUDGE[LLM Judge
rubric scoring]
DB[(Eval Database
results + history)]
GATE{Pass threshold
>= 85%?}
DEPLOY[Deploy]
BLOCK[Block Deploy
alert team]
TC --> RUN --> ASSERT --> GATE
RUN --> JUDGE --> GATE
ASSERT --> DB
JUDGE --> DB
GATE -- yes --> DEPLOY
GATE -- no --> BLOCK
style DEPLOY fill:#dcfce7,stroke:#16a34a,color:#15803d
style BLOCK fill:#fef2f2,stroke:#ef4444,color:#dc2626
style GATE fill:#fef3c7,stroke:#d97706,color:#b45309
The LLM-as-Judge Pattern
LLM-as-Judge: How It Works
flowchart TD INPUT[User Input] --> SUT[System Under Test AI Application] SUT --> OUTPUT[AI Output] INPUT --> JUDGE[Judge LLM Different model] OUTPUT --> JUDGE RUBRIC[Rubric Scoring criteria] --> JUDGE JUDGE --> SCORE[Score 1-5 + reasoning] SCORE --> REPORT[Eval Report] style JUDGE fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style REPORT fill:#dcfce7,stroke:#16a34a,color:#15803d style RUBRIC fill:#dbeafe,stroke:#2563eb,color:#1d4ed8flowchart TD INPUT[User Input] --> SUT[System Under Test AI Application] SUT --> OUTPUT[AI Output] INPUT --> JUDGE[Judge LLM Different model] OUTPUT --> JUDGE RUBRIC[Rubric Scoring criteria] --> JUDGE JUDGE --> SCORE[Score 1-5 + reasoning] SCORE --> REPORT[Eval Report] style JUDGE fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style REPORT fill:#dcfce7,stroke:#16a34a,color:#15803d style RUBRIC fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
The judge LLM should be a different model from the system under test - ideally a stronger one. A GPT-4o judge evaluating GPT-4o-mini outputs works well. Using the same model to judge itself introduces bias.
Build It: Eval Suite with Assertions and LLM-as-Judge
Eval Suite: Assertions + LLM-as-Judge
Example code (static). Copy and run locally in your own environment.
import os
import json
from dataclasses import dataclass, field
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# --- SYSTEM UNDER TEST ---
# Simple Q&A system that we're evaluating
def run_qa_system(question: str) -> str:
"""The AI app we're testing."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer concisely in 1-2 sentences."
},
{"role": "user", "content": question}
],
max_tokens=150,
temperature=0,
)
return response.choices[0].message.content
# --- TEST CASES ---
@dataclass
class TestCase:
id: str
input: str
must_contain: list[str] = field(default_factory=list)
must_not_contain: list[str] = field(default_factory=list)
rubric: str = ""
TEST_CASES = [
TestCase(
id="tc-001",
input="What is the capital of France?",
must_contain=["Paris"],
must_not_contain=["Berlin", "London", "Madrid"],
rubric="The answer must correctly identify Paris as the capital of France.",
),
TestCase(
id="tc-002",
input="What are the primary colors in painting?",
must_contain=["red", "blue", "yellow"],
rubric="The answer must name red, blue, and yellow as the three primary colors.",
),
TestCase(
id="tc-003",
input="Explain what an API is in one sentence.",
must_not_contain=["I cannot", "I don't know"],
rubric=(
"Score 1-5 on: (1) accuracy - correctly describes what an API is, "
"(2) conciseness - fits in 1-2 sentences, "
"(3) clarity - understandable to a non-technical reader."
),
),
]
# --- ASSERTION EVAL ---
def run_assertion_eval(tc: TestCase, output: str) -> dict:
failures = []
for required in tc.must_contain:
if required.lower() not in output.lower():
failures.append(f"Missing required term: '{required}'")
for forbidden in tc.must_not_contain:
if forbidden.lower() in output.lower():
failures.append(f"Contains forbidden term: '{forbidden}'")
return {
"passed": len(failures) == 0,
"failures": failures,
}
# --- LLM-AS-JUDGE EVAL ---
JUDGE_PROMPT = """You are an AI evaluation judge. Score the following AI response.
Question: {question}
AI Response: {response}
Rubric: {rubric}
Respond with JSON only:
{{
"score": <integer 1-5>,
"reasoning": "<one sentence explanation>"
}}"""
def run_llm_judge(tc: TestCase, output: str) -> dict:
if not tc.rubric:
return {"score": None, "reasoning": "No rubric defined"}
prompt = JUDGE_PROMPT.format(
question=tc.input,
response=output,
rubric=tc.rubric,
)
response = client.chat.completions.create(
model="gpt-4o", # Stronger judge model
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
temperature=0,
response_format={"type": "json_object"},
)
try:
return json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
return {"score": None, "reasoning": "Judge returned invalid JSON"}
# --- RUN THE EVAL SUITE ---
def run_eval_suite(test_cases: list[TestCase]) -> dict:
results = []
total_score = 0
total_assertion_passes = 0
print(f"Running {len(test_cases)} test cases...\n{'='*50}")
for tc in test_cases:
output = run_qa_system(tc.input)
assertion_result = run_assertion_eval(tc, output)
judge_result = run_llm_judge(tc, output)
if assertion_result["passed"]:
total_assertion_passes += 1
if judge_result.get("score"):
total_score += judge_result["score"]
result = {
"test_id": tc.id,
"input": tc.input,
"output": output,
"assertion_passed": assertion_result["passed"],
"assertion_failures": assertion_result["failures"],
"judge_score": judge_result.get("score"),
"judge_reasoning": judge_result.get("reasoning"),
}
results.append(result)
status = "PASS" if assertion_result["passed"] else "FAIL"
score_str = str(judge_result.get("score", "N/A"))
print(f"[{tc.id}] Assertion: {status} | Judge: {score_str}/5")
if assertion_result["failures"]:
for f in assertion_result["failures"]:
print(f" ✗ {f}")
n = len(test_cases)
summary = {
"total": n,
"assertion_pass_rate": total_assertion_passes / n,
"avg_judge_score": total_score / n if total_score else None,
"results": results,
}
print(f"\n{'='*50}")
print(f"Assertion pass rate: {summary['assertion_pass_rate']*100:.0f}%")
if summary["avg_judge_score"]:
print(f"Avg judge score: {summary['avg_judge_score']:.1f}/5")
return summary
summary = run_eval_suite(TEST_CASES)
This runs three test cases against a live AI system, checks assertions, and gets LLM-as-judge scores. In production you’d persist summary to a database and compare against your baseline pass rate.
Your eval test cases are the acceptance criteria. Write them before the feature is built, just like unit tests. The format should be: input, required properties (must contain, must not contain), and a rubric. Every acceptance criterion on the ticket should map to at least one eval test case. If it can’t be expressed as a test case, it’s not a real acceptance criterion.
Store eval results in a database - never just in logs. You need to track pass rate over model versions. When GPT-4o-mini is replaced by a new version, you need to know immediately if your pass rate dropped from 94% to 78%. Without historical data, you’re flying blind. A simple table with columns (test_id, model, timestamp, assertion_passed, judge_score) is enough to start.
Setting Your Pass Threshold
What constitutes a passing eval suite? There’s no universal answer, but these are reasonable starting points:
- Assertion pass rate - should be 100%. Assertion failures indicate clear factual errors or format violations.
- LLM judge average - 3.5/5 or above is a sensible minimum. Below 3 suggests systematic quality problems.
- Regression threshold - if today’s score is more than 10% lower than your baseline, block the deployment regardless of absolute score.
LLM-as-judge is biased toward verbose, confident-sounding responses. A response that sounds authoritative but is factually wrong often scores higher than a brief, accurate, appropriately hedged response. Calibrate your judge by running it against 50-100 human-labeled examples and measuring how often it agrees with human raters. A judge with less than 80% agreement with human raters should not be used as a deployment gate.
What’s Next
Evals tell you if your AI is working. But when it’s working at scale, context management becomes your next challenge. In the next tutorial you’ll learn how to manage context windows so your AI stays fast and affordable as conversations grow.
Interview Notes: Eval Harness Design
A mature eval harness records dataset version, prompt version, model version, judge version, and failure tags. Use deterministic assertions for format, schema, and forbidden behavior; use LLM judges for fuzzy quality only when calibrated against human examples.
release_gate:
min_pass_rate: 0.92
max_cost_regression: 0.10
critical_failures_allowed: 0
required_suites:
- regression
- prompt_injection
- pii_redaction
Interview Practice
- What is the difference between unit tests and evals for AI apps?
- When should you use deterministic assertions instead of LLM-as-judge?
- How do you prevent eval overfitting?
- What should be included in an eval release gate?
- How do you measure regressions in cost and latency?
- Why should production incidents become eval cases?