GenAI Foundations / Advanced Track Module 11 / 15
GenAI Foundations Advanced ⏱ 40 min
DEVQABAPM

Context and Memory Engineering for Enterprise Agents

Design memory layers and context budgets that improve quality and lower cost in long-running enterprise workflows.

How to Use This Lesson

  • Start with the user problem, then map the pattern to architecture and failure modes.
  • If a code or design example is included, change one assumption and reason through the impact.
  • Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: intermediate/05-context-window-management, intermediate/06-memory-patterns

Memory Is a System, Not a Feature

Enterprise agents need several memory layers with different trust, cost, and retention rules. Dumping every conversation into the next prompt is expensive, risky, and often lower quality than deliberate context assembly.

Enterprise Context Assembly

flowchart LR
U[Current User Turn] --> A[Context Assembler]
S[System and Policy Blocks] --> A
M[Short-Term Session Memory] --> A
E[Entity Memory] --> A
R[RAG Retrieval] --> A
T[Recent Tool Results] --> A
A --> B[Budget and Trust Ranking]
B --> L[LLM Request]
Code copied! Link copied!

Memory Types

LayerPurposeRetentionRisk
Session bufferRecent turnsMinutes to daysToken bloat
Summary memoryCompact conversation stateDays to monthsSummary drift
Entity memoryStable facts about users/accountsPolicy-definedPrivacy and stale data
Episodic memoryPast task outcomesPolicy-definedWrong transfer to new context
RAG retrievalExternal knowledgeSource lifecyclePrompt injection from content
Tool-result cacheAvoid repeated callsShort TTLStale operational state

Context Budgeting

Context windows are large but not free. Good systems reserve budget before adding optional content.

from dataclasses import dataclass

@dataclass
class ContextBlock:
    name: str
    text: str
    tokens: int
    priority: int
    trust: int

BUDGET = 32_000
RESERVED = {
    "system": 2_000,
    "response": 2_000,
    "tool_schemas": 4_000,
    "safety_margin": 1_000,
}

def assemble_context(blocks: list[ContextBlock]) -> list[ContextBlock]:
    available = BUDGET - sum(RESERVED.values())
    selected: list[ContextBlock] = []

    for block in sorted(blocks, key=lambda b: (b.priority, b.trust), reverse=True):
        if block.tokens <= available:
            selected.append(block)
            available -= block.tokens

    return selected

Budgeting is also a quality tool. High-trust policy and recent task state should beat low-trust old memories even if the old memories are semantically similar.

Prompt Caching and Stable Prefixes

Many providers can cache repeated prompt prefixes. Put stable content first: system policy, tool instructions, rubrics, and long static reference blocks. Put volatile user turns and retrieved chunks later.

[stable] system policy
[stable] tool usage rules
[stable] output contract
[semi-stable] account profile summary
[volatile] current task
[volatile] retrieved documents
[volatile] recent tool results

This improves latency and cost without changing model quality. It also makes traces easier to compare because the front of the prompt is stable across runs.

Compression Strategies

StrategyUse whenFailure mode
SummarizationLong chat historyDrops important details
Extractive memoryNeed exact user preferencesMisses implicit facts
Prompt compressionNeed shorter context fastCan remove safety constraints if careless
Retrieval re-rankingMany candidate chunksSlow if reranker is expensive
Hierarchical summariesLong-running projectsSummary drift across generations

Never compress policy, authz, or tool safety instructions with the same lossy summarizer used for chat history.

Conflict Resolution

Memory can disagree with retrieval or current user input. Encode the rule explicitly:

type Evidence = {
  source: "current_user" | "tool_result" | "retrieved_doc" | "profile_memory" | "summary_memory";
  text: string;
  observedAt: string;
  trust: number;
};

function rankEvidence(e: Evidence): number {
  const freshnessPenalty = Date.now() - Date.parse(e.observedAt) > 30 * 86400_000 ? 10 : 0;
  const sourceWeight = {
    current_user: 100,
    tool_result: 95,
    retrieved_doc: 80,
    profile_memory: 60,
    summary_memory: 40
  }[e.source];
  return sourceWeight + e.trust - freshnessPenalty;
}

Current user intent and fresh tool results usually outrank old memory. For regulated workflows, authoritative systems of record must outrank user claims.

Privacy and Retention

Memory is personal data when it contains user preferences, account facts, or conversation history. Production designs need:

  • Purpose limitation: store memory only for a product reason.
  • Retention windows by memory type.
  • Delete/export paths for user data rights.
  • Tenant isolation and row-level authz.
  • PII redaction in logs and eval datasets.
  • Memory provenance so bad memories can be removed.
⚙️ For Developers

Introduce a context assembler service. Do not let feature code append arbitrary strings directly into model prompts.

🧪 For QA Engineers

Test stale and conflicting memories. The agent should prefer fresh, trusted evidence and explain uncertainty when sources disagree.

🎯 For Product Managers

Define retention and deletion behavior as product requirements. Infinite memory sounds useful until compliance, privacy, and user trust are considered.

Cost Lever

Prompt caching, context ranking, and summary refresh policies often reduce cost more than switching to a smaller model.

Interview Practice

  1. Compare session memory, summary memory, entity memory, and RAG retrieval.
  2. Why should context assembly be a separate runtime layer?
  3. How do prompt caching and stable prefixes reduce cost?
  4. What can go wrong with lossy prompt compression?
  5. How should an agent resolve conflicts between old memory and fresh tool results?
  6. What privacy controls are needed for long-term memory?
  7. How would you test for stale memory regressions?
  8. Why is a larger context window not a complete memory strategy?