FDE System Design Starter Scenarios

Practice explaining AI-adjacent systems to technical and non-technical stakeholders.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Free · email to track progress

System Design for AI & FDE

Free subscriber access. Unlock all 13 modules covering system design interview skills for AI/ML and Field Delivery Engineering roles.

Foundations to distributed systems — storage, APIs, reliability, and global AI infrastructure.
Interview-ready walkthroughs — LLM serving, RAG, multi-agent, safety, and compliance scenarios.
Browser-local progress — track completion privately, no account needed.

Forward deployed engineering interviews reward two abilities at once: you can design the system, and you can explain the trade-offs to customers, product leaders, security reviewers, and infrastructure engineers.

The SCARE Framework

Use SCARE to structure open-ended prompts:

Step	What to say
Scope	Users, workflows, non-goals, compliance boundaries
Capacity	Back-of-envelope reads, writes, storage, latency, peak load
Architecture	APIs, services, data stores, queues, caches, model calls
Reliability	failure modes, retries, observability, SLOs, fallback
Evaluation	safety, cost, quality, human review, launch plan

This prevents jumping directly to “use Kafka” or “put Redis in front” before the user problem is clear.

Scenario 1: API Rate Limiter

User promise: legitimate customers can use the API within their quota; abusive or runaway clients are throttled quickly and fairly.

Start with capacity. If one tenant is allowed 1,000 requests per minute and the platform has 10,000 active tenants, the design must handle millions of counter updates per minute. A single in-process map will fail because traffic is spread across API servers.

Architecture: an L7 gateway authenticates requests, extracts tenant ID and route, checks a shared rate limiter service backed by Redis, and either forwards the request or returns 429 Too Many Requests with Retry-After.

Algorithms:

Fixed window: simple but allows bursts at window boundaries.
Sliding log: accurate but memory-heavy.
Sliding window counter: good balance for most APIs.
Token bucket: allows controlled bursts while enforcing average rate.
Leaky bucket: smooths traffic at a steady drain rate.

Use token bucket for customer-facing API quotas and sliding window counters for abuse detection. If Redis is unavailable, fail open for low-risk read endpoints with local emergency limits, and fail closed for expensive model endpoints if cost exposure is high.

Scenario 2: Compliance Document Ingestion

User promise: compliance teams can ask grounded questions over controlled documents and see citations and audit history.

Architecture: upload service stores PDFs in object storage, metadata in PostgreSQL, and ingestion jobs in a queue. Workers extract text, chunk by section, embed chunks, store vectors, and write audit records. The query path checks permissions, retrieves relevant chunks with hybrid search, assembles context, calls the model, and returns citations.

Reliability: ingestion is asynchronous and retryable with idempotency keys per document version. Human review is required when confidence is low or the action is irreversible.

Scenario 3: Multi-Tenant LLM Serving

User promise: each customer gets predictable latency, correct isolation, and transparent cost attribution.

Architecture: gateway authenticates tenants and applies quotas. A scheduler routes requests by model, tenant tier, region, and context length. Inference workers batch compatible requests. Usage events stream to billing and observability.

Isolation choices: separate API keys are not enough. Use tenant-scoped storage, tenant IDs in every metric and log, per-tenant rate limits, and optionally dedicated model pools for regulated customers.

Scenario 4: Safe Moderation Pipeline

User promise: unsafe content is blocked or escalated without making the product unusable.

Architecture: input classifier, policy engine, model call, output classifier, audit log, appeal queue, and human review console. Measure false positives, false negatives, appeal outcomes, and classifier latency.

Safety is not a final filter bolted on at the end. It belongs in input validation, retrieval permissions, tool authorization, output review, logging, and launch monitoring.

Communication Signals

Strong FDE answers include:

L4 versus L7 load balancing trade-offs.
CAP choices in product language.
Observability for customer-facing incidents.
Cost and latency estimates for model calls.
Security boundaries, token scopes, and audit trails.
A migration path from prototype to production.

Interview Practice

Design a rate limiter for a public LLM API with free and enterprise tiers.
Which rate limiter algorithm would you choose for bursty customers, and why?
How would you explain L4 versus L7 load balancing to a non-infra stakeholder?
Design a compliance ingestion pipeline with citations and human review.
What isolation controls are required for multi-tenant LLM serving?
What metrics prove a moderation pipeline is working?
Where would you fail open versus fail closed in a customer deployment?
How would you turn a prototype RAG demo into a production launch plan?