GenAI Foundations / Beginner Track Module 6 / 9
GenAI Foundations Beginner ⏱ 20 min
DEVQA

Generating Clean Structured Data Using Schemas

Use Pydantic and JSON Schema to constrain AI output to exactly the shape your code expects. No more parsing failures or unexpected fields.

How to Use This Lesson

  • Start with the user problem, then map the pattern to architecture and failure modes.
  • If a code or design example is included, change one assumption and reason through the impact.
  • Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: 05-structured-input-output

What a Schema Does

In the previous tutorial, you learned that JSON mode guarantees valid JSON, but not the specific fields your code needs. A schema goes further: it defines the exact shape of the data - which fields exist, what types they are, which are required, and what values are allowed.

Think of it as the difference between “fill out any form” and “fill out this specific form with these specific fields.” The form constrains the space of valid responses.

Without a schema, an AI might return:

{"result": "positive", "score": "high"}

When you expected:

{"sentiment": "positive", "confidence": 0.87, "key_phrases": ["good value", "fast shipping"]}

Both are valid JSON. Only one is useful to your code.

Pydantic: Python’s Schema Language

Pydantic is a Python library that lets you define data shapes as classes. It validates that incoming data matches your definition and raises clear errors when it doesn’t.

Here’s a simple Pydantic model:

from pydantic import BaseModel
from typing import List, Optional
from enum import Enum

class Severity(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class SecurityIssue(BaseModel):
    issue: str
    severity: Severity
    line_number: Optional[int] = None
    recommended_fix: str

class CodeReviewResult(BaseModel):
    issues: List[SecurityIssue]
    overall_risk: Severity
    summary: str

This model says: “A CodeReviewResult must have a list of issues (each with a specific shape), an overall risk level from the enum, and a summary string.” Pydantic enforces this automatically when you instantiate it.

From Pydantic Model to JSON Schema

Pydantic models can export themselves as JSON Schema - the same format that OpenAI’s json_schema mode accepts:

schema = CodeReviewResult.model_json_schema()
print(schema)

This outputs a complete JSON Schema definition. You pass it directly to the API:

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "code_review_result",
            "strict": True,
            "schema": CodeReviewResult.model_json_schema()
        }
    },
    messages=[...]
)

Now the API is guaranteed to return data that matches your Pydantic model. Parse and validate in one step:

import json
raw = response.choices[0].message.content
result = CodeReviewResult.model_validate_json(raw)
# result is now a typed Python object, not a dict
print(result.overall_risk)        # Severity.HIGH
print(result.issues[0].severity)  # Severity.CRITICAL

The Schema-Driven Workflow

Schema-Driven AI Output Validation

flowchart LR
  PM[Pydantic Model
Define the shape] --> JS[JSON Schema
.model_json_schema]
  JS --> API[OpenAI API
json_schema mode]
  API --> RAW[Raw JSON
from model]
  RAW --> VAL[model_validate_json
Pydantic parses + validates]
  VAL --> OBJ[Typed Python Object
ready to use]

  PM -.->|"also used for"| TESTS[Unit Tests
assert structure]

  style PM fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
  style OBJ fill:#dcfce7,stroke:#16a34a,color:#15803d
  style TESTS fill:#fef3c7,stroke:#d97706,color:#b45309
Code copied! Link copied!

The Pydantic model is your single source of truth. It defines the contract between the AI model and your business logic. Change the model once; everything downstream updates automatically.

Complete Example: Invoice Data Extraction

Here’s a realistic use case: extracting structured invoice data from unstructured text (email bodies, scanned documents, pasted text).

Extract Invoice Fields from Unstructured Text

Example code (static). Copy and run locally in your own environment.

import os
import json
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Define the output schema with Pydantic
class LineItem(BaseModel):
  description: str
  quantity: float
  unit_price: float
  total: float

class InvoiceData(BaseModel):
  invoice_number: Optional[str] = None
  vendor_name: str
  invoice_date: Optional[str] = None
  due_date: Optional[str] = None
  line_items: List[LineItem]
  subtotal: float
  tax_amount: float
  total_amount: float
  currency: str = Field(default="USD")
  notes: Optional[str] = None

# Unstructured text (e.g., from an email or PDF)
raw_invoice_text = """
Hi team,

Please process the following invoice from Acme Software Solutions:

Invoice #INV-2024-0847 dated March 15, 2024 (due April 14, 2024).

Services rendered:
- Cloud infrastructure setup: 8 hours at $150/hr = $1,200
- API integration development: 12 hours at $175/hr = $2,100
- Documentation and training: 4 hours at $125/hr = $500

Subtotal: $3,800
Tax (8.5%): $323
Total Due: $4,123

Please remit in USD. Questions to billing@acme-software.com.
"""

# Ask the model to extract structured data
response = client.chat.completions.create(
  model="gpt-4o-mini",
  response_format={
      "type": "json_schema",
      "json_schema": {
          "name": "invoice_data",
          "strict": True,
          "schema": InvoiceData.model_json_schema()
      }
  },
  messages=[
      {
          "role": "system",
          "content": "You are an invoice data extraction specialist. Extract all invoice fields accurately."
      },
      {
          "role": "user",
          "content": f"Extract all invoice data from this text:

{raw_invoice_text}"
      }
  ],
  max_tokens=800
)

# Parse and validate with Pydantic
invoice = InvoiceData.model_validate_json(
  response.choices[0].message.content
)

# Now use as a typed Python object
print(f"Vendor: {invoice.vendor_name}")
print(f"Invoice #: {invoice.invoice_number}")
print(f"Total: {invoice.currency} {invoice.total_amount:.2f}")
print(f"Line items: {len(invoice.line_items)}")
for item in invoice.line_items:
  print(f"  - {item.description}: {item.total:.2f}")

JSON Schema Basics (When You’re Not Using Pydantic)

If you’re working in a language other than Python, or prefer to write schemas by hand, here’s the essential JSON Schema vocabulary:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer", "minimum": 0 },
    "status": { "type": "string", "enum": ["active", "inactive", "pending"] },
    "tags": { "type": "array", "items": { "type": "string" } },
    "address": {
      "type": "object",
      "properties": {
        "city": { "type": "string" },
        "country": { "type": "string" }
      },
      "required": ["city", "country"]
    }
  },
  "required": ["name", "age", "status"],
  "additionalProperties": false
}

Key fields: type, properties, required, enum (allowed values), minimum/maximum (for numbers), items (for arrays), additionalProperties: false (reject unexpected fields).

Production Gotcha: Schema Validates Structure, Not Semantics

LLMs can still hallucinate values that pass schema validation. An email field with type string will pass even if the model invents a plausible-looking but nonexistent email address. An amount field typed as number will pass even if the number is wrong.

Schema validation catches structural problems (missing fields, wrong types, invalid enums). It does not catch semantic problems (wrong values that are structurally valid). You need separate business logic to validate semantic correctness - for example, checking that extracted totals match the sum of line items.

Handling Validation Failures

What happens when Pydantic validation fails? You get a clear, structured error:

from pydantic import ValidationError

try:
    invoice = InvoiceData.model_validate_json(raw_response)
except ValidationError as e:
    print(f"AI response failed validation: {e}")
    # Log the raw response for debugging
    # Retry the API call, or fall back to manual review
    # Never silently swallow this error

In production, validation failures should trigger alerts. They indicate either a prompt that needs refinement or a model behavior change that needs investigation.

🧪 For QA Engineers

What this means for testing: Schema violations are the most testable kind of AI failure - they’re binary pass/fail. Write test cases that assert your Pydantic model instantiates successfully from the AI response. For edge cases (unusual inputs, long documents, non-English text), run the full pipeline and assert the output validates. This gives you deterministic, automatable test assertions instead of “does this look right?” manual checks.

⚙️ For Developers

What this means for your code: model_validate_json() is your friend - it parses and validates in a single call, raising ValidationError with field-level detail when something is wrong. Never use json.loads() followed by manual key access for AI responses you’ve invested in schema-defining. Also: store your Pydantic models in a dedicated schemas/ module. They are your API contracts, and they should be versioned and imported by both the AI call layer and the downstream data layer.

What’s Next

Your prompts now produce reliably structured output. The next challenge is scale: different users need different prompts, different contexts need different instructions, and hard-coding every variation doesn’t work. That’s where prompt templates come in.

Quick Checklist

Before shipping any AI feature that produces data your code consumes: (1) define a Pydantic model for the output, (2) use model_json_schema() with the API’s json_schema mode, (3) parse with model_validate_json(), (4) add a try/except for ValidationError with proper logging, (5) write at least one test that asserts the schema validates successfully.

Interview Notes: Pydantic v2 and Discriminated Unions

For Python apps, Pydantic v2 is a common way to validate model output after JSON parsing. Use validators for business constraints and discriminated unions when responses can take several shapes.

from typing import Annotated, Literal, Union
from pydantic import BaseModel, Field, field_validator

class RefundAction(BaseModel):
    kind: Literal["refund"]
    invoice_id: str
    amount_usd: float

    @field_validator("amount_usd")
    @classmethod
    def amount_must_be_positive(cls, value: float) -> float:
        if value <= 0:
            raise ValueError("refund amount must be positive")
        return value

class EscalateAction(BaseModel):
    kind: Literal["escalate"]
    reason: str
    team: Literal["billing", "support", "risk"]

NextAction = Annotated[Union[RefundAction, EscalateAction], Field(discriminator="kind")]

Libraries such as Instructor wrap provider calls so responses are parsed directly into Pydantic models, but validation failures still need retries, fallbacks, or human review.

Interview Practice

  1. Why are schemas useful for LLM output?
  2. How do Pydantic validators differ from type annotations?
  3. When would you use a discriminated union?
  4. What are common causes of schema validation failure?
  5. How does Instructor-style parsing change the retry flow?
  6. Why should business rules live in validators or code instead of prompts only?