Evaluating Your AI Application | Praveen Srinag Yellamaraju

Why Traditional Testing Fails for AI

Standard software testing is deterministic: given input X, the correct output is always Y. You write assert output == Y and ship.

AI outputs are non-deterministic. The same input can produce different outputs on every run. Even if the output is deterministic at temperature 0, it changes when the model is updated - and providers update models without asking you.

This breaks every testing assumption you’re used to:

Traditional Testing	AI Testing
Exact string match	Semantic similarity check
Binary pass/fail	Scored on a rubric
Deterministic output	Probabilistic output
Runs once per commit	Runs continuously in production
Tests written once	Test cases need ongoing curation

You don’t abandon testing. You evolve it. The field calls this evals (evaluations).

The Three Eval Types

AI Eval Pyramid

flowchart TD
  H[Human Evals
Gold standard, expensive, slow]
  R[Rubric Evals
LLM-as-judge, scalable, requires calibration]
  A[Assertion Evals
Fast, deterministic, limited coverage]

  A --> R --> H

  style H fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style R fill:#fef3c7,stroke:#d97706,color:#b45309
  style A fill:#dcfce7,stroke:#16a34a,color:#15803d

Code copied! Link copied!

Assertion-based evals - the fastest tier. Check exact matches, substring contains, JSON structure, or response format. These run in milliseconds and are 100% reliable. Example: does the response contain a valid JSON object? Does it start with “I cannot”? These catch clear failures cheaply.

Rubric-based evals (LLM-as-judge) - a second LLM grades the first LLM’s output against a defined rubric. “Is this response accurate, concise, and in the correct language? Score 1-5.” These can scale to thousands of examples but require calibration against human labels to be trusted.

Human evals - a human reviewer reads the output and judges it. The gold standard. Too expensive to run on every commit, but necessary for high-stakes decisions and for calibrating your automated evals.

In practice: Use all three. Assertion evals gate CI/CD. Rubric evals run on every deploy. Human evals run quarterly and before major model upgrades.

The Eval Pipeline

Eval Pipeline: From Test Cases to Deployment Gate

flowchart LR
  TC[Test Case Library
input + expected properties]
  RUN[Run AI System
generate outputs]
  ASSERT[Assertion Evals
fast checks]
  JUDGE[LLM Judge
rubric scoring]
  DB[(Eval Database
results + history)]
  GATE{Pass threshold
>= 85%?}
  DEPLOY[Deploy]
  BLOCK[Block Deploy
alert team]

  TC --> RUN --> ASSERT --> GATE
  RUN --> JUDGE --> GATE
  ASSERT --> DB
  JUDGE --> DB
  GATE -- yes --> DEPLOY
  GATE -- no --> BLOCK

  style DEPLOY fill:#dcfce7,stroke:#16a34a,color:#15803d
  style BLOCK fill:#fef2f2,stroke:#ef4444,color:#dc2626
  style GATE fill:#fef3c7,stroke:#d97706,color:#b45309

Code copied! Link copied!

The LLM-as-Judge Pattern

LLM-as-Judge: How It Works

flowchart TD
  INPUT[User Input] --> SUT[System Under Test
AI Application]
  SUT --> OUTPUT[AI Output]
  INPUT --> JUDGE[Judge LLM
Different model]
  OUTPUT --> JUDGE
  RUBRIC[Rubric
Scoring criteria] --> JUDGE
  JUDGE --> SCORE[Score 1-5
+ reasoning]
  SCORE --> REPORT[Eval Report]

  style JUDGE fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed
  style REPORT fill:#dcfce7,stroke:#16a34a,color:#15803d
  style RUBRIC fill:#dbeafe,stroke:#2563eb,color:#1d4ed8

Code copied! Link copied!

The judge LLM should be a different model from the system under test - ideally a stronger one. A GPT-4o judge evaluating GPT-4o-mini outputs works well. Using the same model to judge itself introduces bias.

Build It: Eval Suite with Assertions and LLM-as-Judge

Eval Suite: Assertions + LLM-as-Judge

Example code (static). Copy and run locally in your own environment.

import os
import json
from dataclasses import dataclass, field
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# --- SYSTEM UNDER TEST ---
# Simple Q&A system that we're evaluating

def run_qa_system(question: str) -> str:
  """The AI app we're testing."""
  response = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
          {
              "role": "system",
              "content": "You are a helpful assistant. Answer concisely in 1-2 sentences."
          },
          {"role": "user", "content": question}
      ],
      max_tokens=150,
      temperature=0,
  )
  return response.choices[0].message.content

# --- TEST CASES ---

@dataclass
class TestCase:
  id: str
  input: str
  must_contain: list[str] = field(default_factory=list)
  must_not_contain: list[str] = field(default_factory=list)
  rubric: str = ""

TEST_CASES = [
  TestCase(
      id="tc-001",
      input="What is the capital of France?",
      must_contain=["Paris"],
      must_not_contain=["Berlin", "London", "Madrid"],
      rubric="The answer must correctly identify Paris as the capital of France.",
  ),
  TestCase(
      id="tc-002",
      input="What are the primary colors in painting?",
      must_contain=["red", "blue", "yellow"],
      rubric="The answer must name red, blue, and yellow as the three primary colors.",
  ),
  TestCase(
      id="tc-003",
      input="Explain what an API is in one sentence.",
      must_not_contain=["I cannot", "I don't know"],
      rubric=(
          "Score 1-5 on: (1) accuracy - correctly describes what an API is, "
          "(2) conciseness - fits in 1-2 sentences, "
          "(3) clarity - understandable to a non-technical reader."
      ),
  ),
]

# --- ASSERTION EVAL ---

def run_assertion_eval(tc: TestCase, output: str) -> dict:
  failures = []

  for required in tc.must_contain:
      if required.lower() not in output.lower():
          failures.append(f"Missing required term: '{required}'")

  for forbidden in tc.must_not_contain:
      if forbidden.lower() in output.lower():
          failures.append(f"Contains forbidden term: '{forbidden}'")

  return {
      "passed": len(failures) == 0,
      "failures": failures,
  }

# --- LLM-AS-JUDGE EVAL ---

JUDGE_PROMPT = """You are an AI evaluation judge. Score the following AI response.

Question: {question}
AI Response: {response}

Rubric: {rubric}

Respond with JSON only:
{{
"score": <integer 1-5>,
"reasoning": "<one sentence explanation>"
}}"""

def run_llm_judge(tc: TestCase, output: str) -> dict:
  if not tc.rubric:
      return {"score": None, "reasoning": "No rubric defined"}

  prompt = JUDGE_PROMPT.format(
      question=tc.input,
      response=output,
      rubric=tc.rubric,
  )

  response = client.chat.completions.create(
      model="gpt-4o",   # Stronger judge model
      messages=[{"role": "user", "content": prompt}],
      max_tokens=200,
      temperature=0,
      response_format={"type": "json_object"},
  )

  try:
      return json.loads(response.choices[0].message.content)
  except json.JSONDecodeError:
      return {"score": None, "reasoning": "Judge returned invalid JSON"}

# --- RUN THE EVAL SUITE ---

def run_eval_suite(test_cases: list[TestCase]) -> dict:
  results = []
  total_score = 0
  total_assertion_passes = 0

  print(f"Running {len(test_cases)} test cases...\n{'='*50}")

  for tc in test_cases:
      output = run_qa_system(tc.input)
      assertion_result = run_assertion_eval(tc, output)
      judge_result = run_llm_judge(tc, output)

      if assertion_result["passed"]:
          total_assertion_passes += 1

      if judge_result.get("score"):
          total_score += judge_result["score"]

      result = {
          "test_id": tc.id,
          "input": tc.input,
          "output": output,
          "assertion_passed": assertion_result["passed"],
          "assertion_failures": assertion_result["failures"],
          "judge_score": judge_result.get("score"),
          "judge_reasoning": judge_result.get("reasoning"),
      }
      results.append(result)

      status = "PASS" if assertion_result["passed"] else "FAIL"
      score_str = str(judge_result.get("score", "N/A"))
      print(f"[{tc.id}] Assertion: {status} | Judge: {score_str}/5")
      if assertion_result["failures"]:
          for f in assertion_result["failures"]:
              print(f"  ✗ {f}")

  n = len(test_cases)
  summary = {
      "total": n,
      "assertion_pass_rate": total_assertion_passes / n,
      "avg_judge_score": total_score / n if total_score else None,
      "results": results,
  }

  print(f"\n{'='*50}")
  print(f"Assertion pass rate: {summary['assertion_pass_rate']*100:.0f}%")
  if summary["avg_judge_score"]:
      print(f"Avg judge score: {summary['avg_judge_score']:.1f}/5")

  return summary

summary = run_eval_suite(TEST_CASES)

This runs three test cases against a live AI system, checks assertions, and gets LLM-as-judge scores. In production you’d persist summary to a database and compare against your baseline pass rate.

🧪 For QA Engineers

Your eval test cases are the acceptance criteria. Write them before the feature is built, just like unit tests. The format should be: input, required properties (must contain, must not contain), and a rubric. Every acceptance criterion on the ticket should map to at least one eval test case. If it can’t be expressed as a test case, it’s not a real acceptance criterion.

⚙️ For Developers

Store eval results in a database - never just in logs. You need to track pass rate over model versions. When GPT-4o-mini is replaced by a new version, you need to know immediately if your pass rate dropped from 94% to 78%. Without historical data, you’re flying blind. A simple table with columns (test_id, model, timestamp, assertion_passed, judge_score) is enough to start.

Setting Your Pass Threshold

What constitutes a passing eval suite? There’s no universal answer, but these are reasonable starting points:

Assertion pass rate - should be 100%. Assertion failures indicate clear factual errors or format violations.
LLM judge average - 3.5/5 or above is a sensible minimum. Below 3 suggests systematic quality problems.
Regression threshold - if today’s score is more than 10% lower than your baseline, block the deployment regardless of absolute score.

Production Gotcha

LLM-as-judge is biased toward verbose, confident-sounding responses. A response that sounds authoritative but is factually wrong often scores higher than a brief, accurate, appropriately hedged response. Calibrate your judge by running it against 50-100 human-labeled examples and measuring how often it agrees with human raters. A judge with less than 80% agreement with human raters should not be used as a deployment gate.

What’s Next

Evals tell you if your AI is working. But when it’s working at scale, context management becomes your next challenge. In the next tutorial you’ll learn how to manage context windows so your AI stays fast and affordable as conversations grow.

Interview Notes: Eval Harness Design

A mature eval harness records dataset version, prompt version, model version, judge version, and failure tags. Use deterministic assertions for format, schema, and forbidden behavior; use LLM judges for fuzzy quality only when calibrated against human examples.

release_gate:
  min_pass_rate: 0.92
  max_cost_regression: 0.10
  critical_failures_allowed: 0
  required_suites:
    - regression
    - prompt_injection
    - pii_redaction

Interview Practice

What is the difference between unit tests and evals for AI apps?
When should you use deterministic assertions instead of LLM-as-judge?
How do you prevent eval overfitting?
What should be included in an eval release gate?
How do you measure regressions in cost and latency?
Why should production incidents become eval cases?

How to Use This Lesson

Why Traditional Testing Fails for AI

The Three Eval Types

AI Eval Pyramid

The Eval Pipeline

Eval Pipeline: From Test Cases to Deployment Gate

The LLM-as-Judge Pattern

LLM-as-Judge: How It Works

Build It: Eval Suite with Assertions and LLM-as-Judge

Eval Suite: Assertions + LLM-as-Judge

Setting Your Pass Threshold

What’s Next

Interview Notes: Eval Harness Design

Interview Practice