LLM Mastery for Enterprise AI Engineering / Beginner Track Module 5 / 5

LLM Mastery for Enterprise AI Engineering Beginner ⏱ 55 min

DEVQABAPMEXEC

Context, Embeddings, Transformers, and Model Choices

The remaining foundation layer: context windows, embeddings, transformers, attention, parameters, training vs inference, and open vs closed models.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: Tokens and Tokenization

Free · email to track progress

LLM Mastery for Enterprise AI Engineering

Free subscriber access. Enter your email to unlock all 18 modules, track your progress, and export your enterprise AI readiness packet.

Foundation to Advanced — tokens and transformers to deployment readiness and enterprise governance.
12 enterprise deliverables — data cards, eval reports, deployment reviews, governance packets.
Browser-local progress — your completion data stays private, no account needed.

LLM Mastery course page. This lesson is part 5 of 5 in the beginner track. Use the lab and assessment sections as the completion standard, not optional reading.

Required mastery artifact: by the end of this lesson, update the running enterprise readiness packet for a realistic use case. Treat examples and vendor names as dated illustrations; defend decisions with current model, cost, risk, and evaluation evidence.

04 — Context Windows

Module 01 | Foundations

What is a Context Window?

Every LLM has a maximum number of tokens it can “see” at once.

This is called the context window — like the model’s working memory or attention span.

Analogy: Imagine you’re reading a book, but you can only keep 10 pages in front of you at a time. When you turn to page 11, page 1 falls off the back. The model is the same — it can only “see” tokens up to its limit.

GPT-3.5          →  4,096 tokens  (~3,000 words)
GPT-4 Turbo      → 128,000 tokens (~96,000 words)
Claude 3 Opus    → 200,000 tokens (~150,000 words)
LLaMA 3 8B       →   8,192 tokens (~6,000 words)
Gemini 1.5 Pro   → 1,000,000 tokens (~750,000 words)

What Goes Into the Context Window?

The context window contains EVERYTHING the model processes:

┌─────────────────────────────────────┐
│  System Prompt      (e.g., 500 tok) │
│  Conversation History (e.g., 2000)  │
│  Your New Message   (e.g., 200 tok) │
│  Retrieved Documents (e.g., 3000)   │
│                                     │
│  Total used: 5,700 tokens           │
│  Remaining: 122,300 tokens          │
└─────────────────────────────────────┘
```

When the context is full, older messages get dropped (usually from the beginning) or you hit an error.

---

## Why Context Window Size Matters

### Longer context = more capabilities
- Analyze a whole codebase at once
- Summarize long documents
- Maintain coherent very long conversations
- Process multiple documents together

### But longer context = more cost + slower responses
- Each token costs money (input tokens are usually cheaper than output)
- Processing 100K tokens takes real compute time
- You pay for every token in your context, every turn

### The "Lost in the Middle" Problem
Research shows that LLMs tend to pay more attention to tokens at the **beginning** and **end** of the context. Information buried in the middle gets attended to less.

Practical implication: Put the most important information at the start or end of your prompts.

---

## Context Window vs Memory

These are NOT the same thing:

| Context Window | Memory |
|---------------|--------|
| Within-conversation state | Across-conversation state |
| Automatic (included in the model) | Must be built explicitly |
| Lost when session ends | Can persist indefinitely |
| Costs tokens | Usually external storage |

LLMs have context windows by default. Memory requires RAG or external systems (covered in Module 06).

---

## Managing Context Efficiently

```python
# Bad: Sending entire conversation every time
messages = [
    {"role": "user", "content": "long message 1..."},  # 500 tokens
    {"role": "assistant", "content": "long reply 1..."}, # 800 tokens
    {"role": "user", "content": "long message 2..."},  # 500 tokens
    # ... 50 more turns
    {"role": "user", "content": "new question"}
]
# Total: might be 50,000 tokens — expensive!

# Better: Summarize old turns
# Keep recent turns in full, summarize older ones
messages = [
    {"role": "system", "content": "Summary of previous conversation: [brief summary]"},
    # Last 5 turns only:
    {"role": "user", "content": "recent question"},
    {"role": "assistant", "content": "recent answer"},
    {"role": "user", "content": "new question"}
]

Next: 05 — Embeddings

05 — Embeddings

Module 01 | Foundations

The Problem: Computers Don’t Understand Words

Computers work with numbers. Text is just characters.

How do you make a computer “understand” that “cat” and “kitten” are similar, but “cat” and “car” are less similar?

The answer: embeddings.

What is an Embedding?

An embedding is a list of numbers that represents a piece of text.

"cat"    → [0.23, -0.14, 0.87, 0.03, -0.56, ...]  (1536 numbers)
"kitten" → [0.25, -0.12, 0.89, 0.01, -0.54, ...]  (1536 numbers)
"car"    → [0.71, 0.44, -0.23, 0.92, 0.11, ...]   (1536 numbers)
```

The key insight: **similar meanings = similar numbers**.

"Cat" and "kitten" have similar numbers (they're close in space).
"Cat" and "car" have very different numbers (they're far apart in space).

---

## The Vector Space Analogy

Imagine a map where every word is a point in space. Similar words are located near each other.

```
         animals
           ↑
    cat • kitten
    dog •   • puppy
           
           ←————→
        vehicles
    car •  truck
    bus •
```

This space can have 1536 dimensions (not 2 like a map), but the principle is the same.

---

## Famous Embedding Math

The classic demonstration:

```
king - man + woman ≈ queen

In embedding space:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
```

This works because the model learned relational patterns, not just individual words.

---

## Types of Embeddings

### Token Embeddings
Each token has a learned embedding (a fixed vector). These are the input to the model.

### Contextual Embeddings
Inside the transformer, embeddings update based on context:
- "bank" near "river" → different embedding than "bank" near "money"
- The same token gets different embeddings based on context

### Sentence/Document Embeddings
You can embed entire sentences or documents:
```
"The dog ran fast" → one vector representing the whole sentence
```
Useful for search, similarity comparison, RAG.

---

## Embeddings in Practice

```python
# Getting embeddings from OpenAI
from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog"
)

embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")  # 1536
print(f"First 5 values: {embedding[:5]}")
```

```python
# Comparing similarity between two texts
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

emb1 = get_embedding("I love cats")
emb2 = get_embedding("I adore kittens")
emb3 = get_embedding("I drive cars")

print(cosine_similarity(emb1, emb2))  # ~0.92 (very similar)
print(cosine_similarity(emb1, emb3))  # ~0.61 (less similar)

Why Embeddings Matter for Engineers

Semantic search: Find documents by meaning, not just keywords
RAG systems: Find relevant context to inject into prompts
Classification: Cluster similar items together
Recommendation: “Similar to what you liked”
Anomaly detection: Outlier items in embedding space

Next: 06 — Transformers

06 — Transformers

Module 01 | Foundations

The Architecture That Changed Everything

In 2017, a paper titled “Attention Is All You Need” introduced the Transformer architecture.

Before transformers, AI used RNNs (Recurrent Neural Networks) which processed text one word at a time — slow and forgetful.

Transformers process all words at the same time (in parallel) and use “attention” to learn which words matter to which other words.

This made LLMs possible.

The Transformer Building Blocks

A transformer model has these main parts:

Input Tokens
    ↓
[Token Embedding] — converts tokens to vectors
    ↓
[Positional Encoding] — adds position information
    ↓
[Transformer Block × N] — the main processing
  ├── [Multi-Head Attention] — what to pay attention to
  ├── [Add & Normalize]
  ├── [Feed-Forward Network] — process the information
  └── [Add & Normalize]
    ↓
[Output Layer] — predicts next token probabilities

Transformer Block in Plain English

Each transformer block does two things:

1. Attention (Communication)

Tokens “look at” each other and figure out which ones are related.

”The cat sat on the mat because it was comfortable.”

What does “it” refer to? The model uses attention to figure out that “it” → “mat”.

2. Feed-Forward (Computation)

After tokens have communicated, each token processes its updated information independently.

Think of it as: attention = “gather information from neighbors”, feed-forward = “think about it yourself”.

Why “Multi-Head” Attention?

Instead of one attention mechanism, transformers use many heads running in parallel.

Each head learns to look for different kinds of relationships:

Head 1: Grammatical relationships (subject-verb)
Head 2: Coreference (pronoun → noun)
Head 3: Semantic similarity
Head 4: Positional relationships
… (GPT-4 has 96+ attention heads per layer)

Then all heads’ outputs are combined.

Positional Encoding: Order Matters

Transformers process all tokens at once (in parallel), which means they don’t naturally know the order.

”Dog bites man” vs “Man bites dog” — same tokens, different meaning.

Positional encoding adds a unique signal to each token based on its position, so the model knows where each token is in the sequence.

Scale: Why Size Matters

Model	Layers	Attention Heads	Hidden Size
GPT-2 Small	12	12	768
GPT-2 Large	36	20	1280
GPT-3	96	96	12,288
LLaMA 3 8B	32	32	4,096
LLaMA 3 70B	80	64	8,192

More layers = deeper understanding. More heads = more types of patterns learned. Larger hidden size = richer representations.

Next: 07 — Attention Mechanism

07 — Attention Mechanism

Module 01 | Foundations

The Core Idea

Attention lets the model decide: when processing this token, which other tokens should I look at?

Like a human reader: when you read “it”, your eyes scan back to find what “it” refers to. Attention is the mathematical version of that.

Queries, Keys, and Values

The attention mechanism uses three concepts: Q, K, V (Query, Key, Value).

Analogy: Library Search

Query = your search terms (“books about cats”)
Key = the label on each book
Value = the actual content inside each book

The attention mechanism:

Takes your Query
Compares it against all Keys (every token in the context)
The most matching Keys get the highest score
Returns a weighted mix of Values based on those scores

The Math (Simplified)

Attention(Q, K, V) = softmax(QK^T / √d) × V

Translation:
1. QK^T: How much does each query match each key? (dot product)
2. / √d: Scale down (prevents values from getting too large)
3. softmax(): Convert to probabilities (all add up to 1.0)
4. × V: Weight the values by those probabilities
```

You don't need to memorize this. The important insight: **higher match between Q and K = more of that token's V is included in the output**.

---

## Causal Masking

During training and generation, the model shouldn't be able to "cheat" by looking at future tokens.

Causal masking ensures each token can only attend to tokens **before** it (and itself):

```
Token 1: can see → [1]
Token 2: can see → [1, 2]
Token 3: can see → [1, 2, 3]
Token 4: can see → [1, 2, 3, 4]
```

This is why these models are called **causal language models**.

---

## Attention Visualization

If you could visualize what a model attends to:

```
"The cat sat on the mat because it was comfortable"

When processing "it":
→ "mat" gets 60% attention weight
→ "cat" gets 25% attention weight  
→ "sat" gets 10% attention weight
→ others: 5%

When processing "comfortable":
→ "it" gets 45% (since we just established it = mat)
→ "mat" gets 35%
→ others: 20%

Next: 08 — Parameters

08 — Parameters

Module 01 | Foundations

What are Parameters?

Parameters are the learnable numbers inside a model.

Think of a model’s parameters as all the dials and knobs that get tuned during training. After training, they’re fixed — they encode the model’s “knowledge”.

When someone says “LLaMA 3 8B”, the “8B” means 8 billion parameters.

Where Parameters Live

In a transformer, parameters exist in:

Embedding tables — mapping token IDs to vectors
Attention weight matrices — Q, K, V projection weights
Feed-forward network weights — large dense matrices
Layer normalization parameters — small scaling factors

The vast majority live in attention and feed-forward layers.

Parameters ≠ Intelligence (Directly)

More parameters generally means:

More capacity to memorize facts
More nuanced understanding
Better at complex reasoning

But:

A smaller model fine-tuned on specific data often beats a larger general model
Efficiency improvements (quantization, LoRA) can shrink effective parameter needs
Quality of training data matters more than raw parameter count

7B model + great data > 70B model + bad data

How Much Memory Do Parameters Need?

Each parameter is a number. Different precisions use different memory:

Precision	Bits per parameter	Memory for 7B model
float32 (fp32)	32 bits (4 bytes)	~28 GB
float16 (fp16)	16 bits (2 bytes)	~14 GB
bfloat16 (bf16)	16 bits (2 bytes)	~14 GB
int8 (Q8)	8 bits (1 byte)	~7 GB
int4 (Q4)	4 bits (0.5 bytes)	~3.5 GB

This is why quantization (Module 03) is so important — it makes models 4-8x smaller with minimal quality loss.

Rule of Thumb for VRAM

To run a model for inference:

Minimum VRAM ≈ model_parameters × bytes_per_param × 1.2

For LLaMA 3 8B at fp16:
= 8,000,000,000 × 2 bytes × 1.2
= ~19 GB VRAM

For LLaMA 3 8B at Q4:
= 8,000,000,000 × 0.5 bytes × 1.2
= ~4.8 GB VRAM
```

This is why quantized models matter so much for local inference.

---

*Next: 09 — Training vs Inference*

---
---

# 09 — Training vs Inference

> *Module 01 | Foundations*

---

## Two Very Different Things

| | Training | Inference |
|--|---------|-----------|
| What it is | Teaching the model | Using the model |
| When | Before deployment | Every time someone uses it |
| Cost | Very expensive | Cheaper per use |
| Hardware | Many GPUs, weeks/months | Fewer GPUs, milliseconds |
| Modifies weights | Yes | No |

---

## Training in Depth

Training is what creates the model. It involves:

1. **Data preparation**: Curating and cleaning training data
2. **Forward pass**: Run data through the model, get predictions
3. **Loss calculation**: How wrong were the predictions?
4. **Backward pass**: Calculate gradients (which direction to adjust each parameter)
5. **Weight update**: Adjust parameters slightly in the right direction
6. **Repeat**: Billions of times

### The scale of pre-training
- GPT-4 training: ~$100 million, ~3-6 months
- LLaMA 3 70B: ~$10 million, weeks
- Fine-tuning a model: $50-$5,000, hours to days

### Fine-tuning is also training
Fine-tuning = additional training on top of a pre-trained model. Much cheaper because:
- Starting from a good base (not random)
- Training on much less data
- Usually updating only some parameters (LoRA)

---

## Inference in Depth

Inference = using a trained model to generate outputs.

The steps:
1. Input tokens → embeddings
2. Process through all transformer layers
3. Output token probabilities
4. Sample next token
5. Repeat (autoregressive generation)

### Inference costs
- Proportional to: tokens processed × model size
- Input tokens cheaper than output tokens (output requires generating one token at a time)
- Larger models = slower inference + more memory

---

## The Memory Difference

**Training** needs to store:
- Model weights (parameters)
- Gradients (same size as weights!)
- Optimizer states (2x weights for Adam optimizer!)
- Activations (per batch)

Total: ~8-16x the model size in memory

```
Training LLaMA 3 8B at fp16:
= 14 GB (weights) + 14 GB (gradients) + 28 GB (optimizer) + activations
= ~80+ GB VRAM needed
= Need multiple A100 80GB GPUs
```

**Inference** only needs:
- Model weights
- KV cache (covered in Module 04)

```
Inference LLaMA 3 8B at fp16:
= ~14-19 GB VRAM
= Can run on a single A100 40GB
```

This is why you can't fine-tune a 70B model on your laptop, but you might be able to run it.

---

## LoRA Changes the Training Story

LoRA (covered in Module 03) is a technique that:
- Freezes the original model weights during fine-tuning
- Only trains small "adapter" matrices
- Reduces trainable parameters by 99%+
- Makes training feasible on consumer hardware

```
Training LLaMA 3 8B with LoRA (Q4 quantized):
= ~6 GB VRAM for the model
= ~2 GB for LoRA adapters and optimizer
= Total: ~8 GB VRAM
= Possible on a gaming GPU!

Next: 10 — Open-Source vs Closed-Source Models

10 — Open-Source vs Closed-Source Models

Module 01 | Foundations

The Two Worlds

Closed-Source Models

Trained and hosted by a company
You access them via API (pay per token)
You never see the weights (the actual model)
Example: GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google)

Open-Source/Open-Weight Models

Weights are publicly released (you can download them)
You can run them yourself, fine-tune them, modify them
May have usage restrictions (Meta’s LLaMA has commercial terms)
Example: LLaMA 3 (Meta), Mistral, Qwen, Gemma

Side-by-Side Comparison

Factor	Closed-Source	Open-Source
Cost	Pay per token	Free to run (pay for hardware)
Privacy	Data sent to provider	Fully local option
Customization	Limited (system prompts)	Full fine-tuning possible
Performance	Frontier performance	Slightly behind, closing fast
Deployment	Managed	You manage everything
Compliance	Depends on provider ToS	Full control
Latency	Network-dependent	Local = potentially faster
Uptime	Provider-dependent	You control

When to Use Each

Use Closed-Source When:

You need best-in-class performance RIGHT NOW
You want zero infrastructure management
Your use case doesn’t need customization
Privacy isn’t critical
You’re prototyping quickly

Use Open-Source When:

Data privacy is critical (medical, legal, financial)
You need to fine-tune for a specific domain
Regulatory requirements prohibit third-party data processing (EU companies!)
You want to reduce long-term costs (high volume)
You need offline/air-gapped deployment
You’re building a product and need control

The Closing Gap

Open-source models were 2-3 years behind closed-source in 2022.

By 2024-2025:

LLaMA 3 70B competes with GPT-4 on many benchmarks
Qwen 2.5 72B matches GPT-4o on coding
Mistral Large 2 competes on reasoning
Specialized fine-tunes often beat general frontier models on narrow tasks

The gap is closing. Fast.

Practical Recommendation for Engineers

Start with:

Prototype with Claude/GPT-4 (fast, easy, good)
Identify your actual needs (privacy? cost? customization?)
Switch to open-source if needed (LLaMA 3 or Mistral as base)
Fine-tune for your specific domain
Evaluate and compare

📝 Summary — Complete Foundations Module

You now understand the core foundations:

LLMs predict the next token using neural networks trained on massive text
Tokens are the atomic units (not words or characters)
Context windows limit how much the model can see at once
Embeddings turn text into numbers that capture meaning
Transformers process all tokens in parallel using attention
Attention determines which tokens influence which others
Parameters are the learned numbers that store model knowledge
Training creates models; inference uses them
Open-source models give you freedom; closed-source gives you convenience

🧠 The Unified Mental Model

Text → Tokens → Numbers → Transformer Layers → Probabilities → Next Token
         (tokenizer)        (attention + math)  (softmax)     (sampling)

Training: Do this backward too. Adjust weights to improve predictions.
Inference: Go forward only. Generate one token at a time.

🏋️ Final Foundations Exercise

Build a mini “text similarity” app using embeddings:

# Install: pip install anthropic numpy

import anthropic
import numpy as np

client = anthropic.Anthropic()

def get_embedding(text):
    # Note: Use OpenAI's embedding API or a HuggingFace model for embeddings
    # Claude's API doesn't expose embeddings directly
    # For this exercise, install: pip install sentence-transformers
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    return model.encode(text)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Test pairs
pairs = [
    ("I love programming", "I enjoy coding"),
    ("I love programming", "The weather is nice today"),
    ("cat", "kitten"),
    ("cat", "automobile"),
    ("The bank approved my loan", "I sat by the river bank"),
]

for a, b in pairs:
    emb_a = get_embedding(a)
    emb_b = get_embedding(b)
    similarity = cosine_similarity(emb_a, emb_b)
    print(f"'{a}' vs '{b}'")
    print(f"  Similarity: {similarity:.3f}\n")
```

**Expected output:** Semantically similar sentences have similarity > 0.8. Unrelated sentences have similarity < 0.5.

---

*You've completed Module 01! Move to [Module 02 — Datasets & Training](/tutorials/llm-mastery/intermediate/01-datasets-training-governance)*