LLM Mastery for Enterprise AI Engineering / Intermediate Track Module 4 / 8

LLM Mastery for Enterprise AI Engineering Intermediate ⏱ 50 min

DEVQABAPMEXEC

Local AI Ecosystem

llama.cpp, Ollama, vLLM, MLX, Hugging Face, Unsloth, Axolotl, PEFT, and TRL.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Prerequisites: Inference and Optimization

Free · email to track progress

LLM Mastery for Enterprise AI Engineering

Free subscriber access. Enter your email to unlock all 18 modules, track your progress, and export your enterprise AI readiness packet.

Foundation to Advanced — tokens and transformers to deployment readiness and enterprise governance.
12 enterprise deliverables — data cards, eval reports, deployment reviews, governance packets.
Browser-local progress — your completion data stays private, no account needed.

LLM Mastery course page. This lesson is part 4 of 8 in the intermediate track. Use the lab and assessment sections as the completion standard, not optional reading.

Required mastery artifact: by the end of this lesson, update the running enterprise readiness packet for a realistic use case. Treat examples and vendor names as dated illustrations; defend decisions with current model, cost, risk, and evaluation evidence.

Module 05 — Local AI Ecosystem

The tools of the trade: llama.cpp, Ollama, vLLM, MLX, HuggingFace, Unsloth, Axolotl, PEFT, TRL.

01 — llama.cpp

What is llama.cpp?

llama.cpp is a C++ implementation of LLaMA inference that runs LLMs on CPU (and GPU).

Created by Georgi Gerganov in early 2023. One of the most impactful open-source AI projects ever.

Before llama.cpp: running LLMs required expensive GPUs and Python/PyTorch. After llama.cpp: you can run a 7B model on your MacBook.

Why It’s Fast on CPU

Written in C++: No Python overhead, no heavy frameworks
GGUF quantization: 4-bit models fit in RAM
SIMD optimizations: Uses CPU’s specialized math instructions (AVX2, AVX512)
Metal/CUDA support: Can offload layers to GPU for speed
Memory mapping: Loads models without copying them entirely into RAM

Using llama.cpp

Installation

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU only
make

# With CUDA (NVIDIA GPU)
make LLAMA_CUDA=1

# With Metal (Apple Silicon)
make LLAMA_METAL=1

Basic inference

# Download a GGUF model (e.g., from HuggingFace)
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf

# Run it
./llama-cli \
  -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -p "What is the capital of Germany?" \
  -n 100 \
  --temp 0.7

# Interactive chat
./llama-cli \
  -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -i \
  --chat-template llama3

As a server (OpenAI-compatible API)

./llama-server \
  -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  -c 4096 \
  -ngl 33  # Number of layers to offload to GPU (33 = all layers for 8B)

# Now you have an OpenAI-compatible API at localhost:8080

Python client for llama.cpp server

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="llama-3-8b",
    messages=[{"role": "user", "content": "Hello, are you running locally?"}]
)
print(response.choices[0].message.content)

Layer Offloading

Split model across CPU RAM and GPU VRAM:

# 8B model has 33 layers (including embed/output)
# -ngl 0: CPU only (slow but works with just RAM)
# -ngl 20: 20 layers on GPU, rest on CPU (balanced)
# -ngl 33: All layers on GPU (fastest, needs ~5 GB VRAM for Q4)

./llama-cli -m model.gguf -ngl 20 -p "Your prompt"
```

This lets you use GPU acceleration even when the model doesn't fully fit in VRAM.

---

# 02 — Ollama

## What is Ollama?

Ollama is the user-friendly wrapper around llama.cpp (and other backends).

**Analogy:** llama.cpp is the engine. Ollama is the car — it adds the dashboard, steering wheel, and easy controls.

Ollama handles:
- Model downloading (like Docker images)
- Model management (list, delete, update)
- Running models as a local service
- OpenAI-compatible REST API
- Cross-platform (Mac, Windows, Linux)

---

## Getting Started with Ollama

```bash
# Install (Mac/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: Download from ollama.com

# Pull a model (like docker pull)
ollama pull llama3.2:3b       # 3B — fastest
ollama pull llama3.1:8b       # 8B — good balance
ollama pull llama3.1:70b      # 70B — best quality (needs 48+ GB RAM/VRAM)
ollama pull mistral:7b        # Alternative
ollama pull qwen2.5:7b        # Alibaba's model

# Run in terminal
ollama run llama3.2:3b
>>> Hello! I'm running locally!

# List installed models
ollama list

# Remove a model
ollama rm llama3.2:3b

# See model info
ollama show llama3.1:8b

Ollama as API Server

Ollama automatically starts as an API server at http://localhost:11434.

# Option 1: Raw Ollama API
import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [{"role": "user", "content": "What is Fiserv?"}],
        "stream": False
    }
)
print(response.json()["message"]["content"])

# Option 2: OpenAI-compatible endpoint
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain PSD2 regulation"}]
)
print(response.choices[0].message.content)

# Option 3: Ollama Python library
import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Write a Python sort function"}]
)
print(response["message"]["content"])

Custom Modelfiles

Like Dockerfiles for models — define your own model configuration:

# compliance-expert.Modelfile

FROM llama3.1:8b

SYSTEM """You are an expert in EU financial compliance regulations.
You have deep knowledge of GDPR, PSD2, MiFID II, DORA, and Basel III.
Always cite specific regulation articles when possible.
If you're unsure, say so — never hallucinate regulatory requirements."""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
```

```bash
# Build your custom model
ollama create compliance-expert -f compliance-expert.Modelfile

# Run it
ollama run compliance-expert
>>> Tell me about DORA compliance requirements

Ollama with LangChain / LlamaIndex

from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate

llm = Ollama(model="llama3.1:8b", temperature=0.3)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful compliance expert."),
    ("human", "{question}")
])

chain = prompt | llm
result = chain.invoke({"question": "What is GDPR article 17?"})
print(result)

03 — vLLM

Production-Grade LLM Serving

Ollama is great for development. vLLM is for production serving at scale.

Key features:

PagedAttention: Novel KV cache management — near-perfect GPU utilization
Continuous batching: Mix different-length requests efficiently
High throughput: 20-50x higher throughput than naive HuggingFace serving
OpenAI-compatible API: Drop-in replacement for OpenAI API
Multi-GPU: Tensor parallelism across multiple GPUs
LoRA serving: Serve multiple LoRA adapters on one base model

vLLM Quickstart

# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --port 8000 \
  --max-model-len 4096

# With multiple GPUs (tensor parallelism)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000

# With quantization
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --quantization awq \
  --port 8000

vLLM Python API

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    quantization="awq",       # or "gptq"
    dtype="bfloat16",
    max_model_len=4096,
    tensor_parallel_size=1    # GPUs to use
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    stop=["<|eot_id|>"]  # LLaMA 3 stop token
)

# Generate (handles batching automatically)
prompts = [
    "What is MiFID II?",
    "Explain Basel III",
    "What is GDPR article 5?",
    # Can send thousands at once for batch processing
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Q: {output.prompt}")
    print(f"A: {output.outputs[0].text}\n")

vLLM vs Ollama Comparison

Factor	Ollama	vLLM
Ease of setup	Very easy	Moderate
Target use	Development, local	Production serving
Throughput	Moderate	Very high (20-50x)
Multi-GPU	Basic	Excellent
Quantization	GGUF (llama.cpp)	AWQ, GPTQ, bitsandbytes
LoRA support	Limited	Full
Windows support	Yes	Linux/Mac only
Memory efficiency	Good	Excellent (PagedAttention)

Rule: Ollama for development, vLLM for production.

04 — MLX (Apple Silicon)

Apple’s ML Framework

MLX is Apple’s machine learning framework optimized for Apple Silicon (M1, M2, M3, M4).

Unlike PyTorch which treats CPU and GPU as separate, MLX uses unified memory — the CPU and GPU share the same memory pool. This is why M2 Max (96 GB unified memory) can run very large models.

MLX for LLM Inference

# Install
pip install mlx-lm

# Run a model
mlx_lm.generate \
  --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
  --prompt "What is MLX?"

# Chat interface
mlx_lm.chat --model mlx-community/Meta-Llama-3-8B-Instruct-4bit
```

```python
# Python API
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

response = generate(
    model,
    tokenizer,
    prompt="What is Apple Silicon's advantage for LLMs?",
    max_tokens=500,
    verbose=True  # Shows tokens/second
)

Apple Silicon Performance

Chip	Unified Memory	LLM Performance
M1 (base)	8-16 GB	7B Q4 (slow ~15 tok/s)
M2 Pro	16-32 GB	13B Q4 (~25 tok/s)
M2 Max	32-96 GB	34B Q4 (~20 tok/s)
M3 Max	36-128 GB	70B Q4 (~15 tok/s)
M4 Ultra	192 GB	70B Q8 (~25 tok/s)

Apple Silicon is genuinely competitive with cloud inference for personal use.

Fine-tuning with MLX on Mac

# Fine-tune on Mac (no NVIDIA GPU needed!)
mlx_lm.lora \
  --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
  --train \
  --data ./my_data \
  --batch-size 4 \
  --lora-layers 16 \
  --iters 1000

# Convert adapter for deployment
mlx_lm.fuse \
  --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
  --adapter-path ./adapters
```

For Praveen with M1 Pro 16GB: You can fine-tune 8B models with LoRA. Performance is good.

---

# 05 — Hugging Face

## The GitHub of AI Models

Hugging Face is the central hub of the open-source AI ecosystem.

What it provides:
- **Model Hub**: 500,000+ models to download
- **Dataset Hub**: 100,000+ datasets
- **Spaces**: Demo apps for models
- **Inference API**: Run models without local hardware
- **Transformers library**: The standard Python library for working with LLMs
- **PEFT, TRL, Datasets**: Key fine-tuning libraries

---

## The Transformers Library

The most important library for LLM engineering:

```python
from transformers import (
    AutoModelForCausalLM,  # Load any causal LM
    AutoTokenizer,          # Load matching tokenizer
    AutoConfig,             # Load model config
    pipeline,               # High-level inference
    Trainer,               # Training loop
    TrainingArguments,     # Training config
    BitsAndBytesConfig,    # Quantization config
    GenerationConfig,      # Generation settings
)

# Load any model from Hub
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Easy inference pipeline
pipe = pipeline("text-generation", model="gpt2")
result = pipe("Hello, world!")

Hugging Face Hub Operations

from huggingface_hub import (
    hf_hub_download,
    snapshot_download,
    HfApi,
    login
)

# Login (get token from huggingface.co/settings/tokens)
login(token="hf_xxx...")

# Download specific file
path = hf_hub_download(
    repo_id="meta-llama/Meta-Llama-3-8B",
    filename="config.json"
)

# Download whole model
local_dir = snapshot_download(
    repo_id="meta-llama/Meta-Llama-3-8B",
    local_dir="./llama-3-8b"
)

# Upload your model
api = HfApi()
api.create_repo("your-username/my-fine-tuned-model", private=True)
api.upload_folder(
    folder_path="./my-fine-tuned-model",
    repo_id="your-username/my-fine-tuned-model"
)

Datasets Library

from datasets import load_dataset, Dataset, DatasetDict

# Load any dataset from Hub
dataset = load_dataset("tatsu-lab/alpaca")
print(dataset["train"][0])

# Load from your own files
dataset = load_dataset("json", data_files="my_data.jsonl")
dataset = load_dataset("csv", data_files="my_data.csv")

# Process and filter
filtered = dataset.filter(lambda x: len(x["output"]) > 100)
mapped = dataset.map(lambda x: {"formatted": f"Q: {x['instruction']}\nA: {x['output']}"})

# Split
split = dataset["train"].train_test_split(test_size=0.1)

# Push to Hub
split.push_to_hub("your-username/my-dataset")

06 — Unsloth

The Fastest Fine-Tuning Library

Unsloth is a library that makes QLoRA fine-tuning 2-5x faster and 50-70% more memory efficient than vanilla HuggingFace + PEFT.

How it achieves this:

Custom CUDA kernels (rewrites key operations in hand-optimized code)
Custom attention implementation
Memory-efficient gradient computation
Better Flash Attention integration

Why Use Unsloth vs PEFT/TRL Directly

Metric	PEFT + TRL	Unsloth
Training speed	1x	2-5x
VRAM usage	1x	0.5-0.7x
Code complexity	Moderate	Simple
Model support	All	Popular models
Accuracy	Baseline	Same (no quality loss)

Complete Unsloth Fine-Tuning Example

# pip install unsloth

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# 1. Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",  # Pre-quantized for speed
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# 2. Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,    # Rank-stabilized LoRA (try True if unstable)
    loftq_config=None,
)

# 3. Prepare dataset
def format_example(example):
    """Format as chat template"""
    chat = [
        {"role": "system", "content": "You are a compliance expert."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]}
    ]
    return {"text": tokenizer.apply_chat_template(chat, tokenize=False)}

dataset = load_dataset("json", data_files="my_compliance_data.jsonl", split="train")
dataset = dataset.map(format_example, batched=False)

# 4. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",        # Memory-efficient optimizer
        weight_decay=0.01,
        lr_scheduler_type="linear",
        output_dir="./outputs",
        save_strategy="epoch",
    ),
)

trainer.train()

# 5. Save adapter
model.save_pretrained("compliance-lora-adapter")
tokenizer.save_pretrained("compliance-lora-adapter")

# 6. Optional: Save merged model for deployment
model.save_pretrained_merged("compliance-merged-model", tokenizer, 
                              save_method="merged_16bit")

# 7. Optional: Save as GGUF for Ollama
model.save_pretrained_gguf("compliance-model", tokenizer, quantization_method="q4_k_m")

07 — Axolotl

The Flexible Training Framework

Axolotl is a YAML-configured training framework that handles the complexity of LLM fine-tuning.

Rather than writing Python training code, you describe your training run in a config file.

Axolotl Config Example

# compliance-finetune.yml

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Data
datasets:
  - path: my_compliance_data.jsonl
    type: chat_template
    chat_template: llama3

dataset_prepared_path: ./prepared_data
val_set_size: 0.05

# LoRA
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true  # Target all linear layers

# Quantization
load_in_4bit: true
bnb_4bit_use_double_quant: true
bnb_4bit_quant_type: nf4

# Training
sequence_len: 2048
sample_packing: true  # Packs multiple short sequences into one — more efficient

micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_steps: 10

# Saving
output_dir: ./outputs/compliance-model
save_safetensors: true
saves_per_epoch: 1
logging_steps: 10

# Evaluation
eval_steps: 100
eval_table_size: 5

# wandb logging (optional)
wandb_project: compliance-finetune
wandb_run_name: llama3-compliance-v1
```

```bash
# Run training
accelerate launch -m axolotl.cli.train compliance-finetune.yml

# Continue from checkpoint
accelerate launch -m axolotl.cli.train compliance-finetune.yml \
  --resume-from-checkpoint ./outputs/compliance-model/checkpoint-500

Axolotl vs Unsloth

Factor	Axolotl	Unsloth
Configuration	YAML config	Python code
Flexibility	Very high	Moderate
Supported formats	Many	Common
Speed	Good	Excellent
Beginner friendly	Moderate	Very
Multi-GPU	Excellent	Good

Start with Unsloth for learning. Use Axolotl for complex production training.

08 — PEFT & TRL Library

PEFT: Parameter-Efficient Fine-Tuning

PEFT is Hugging Face’s library implementing all adapter methods:

from peft import (
    LoraConfig,           # LoRA configuration
    get_peft_model,       # Apply adapters to model
    PeftModel,            # Load saved adapter
    TaskType,             # Task types (CAUSAL_LM, SEQ_CLS, etc.)
    prepare_model_for_kbit_training,  # Prepare for QLoRA
)

# Full LoRA setup
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

# Load a saved adapter later
loaded_model = PeftModel.from_pretrained(base_model, "path/to/adapter")

TRL: Transformer Reinforcement Learning

TRL implements the training algorithms:

from trl import (
    SFTTrainer,     # Supervised fine-tuning
    DPOTrainer,     # Direct Preference Optimization
    PPOTrainer,     # RLHF with PPO
    RewardTrainer,  # Training reward models
    ORPOTrainer,    # ORPO (SFT + DPO combined)
)

# SFT
sft_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=training_args,
)

# DPO
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    train_dataset=preference_dataset,  # needs "prompt", "chosen", "rejected"
    args=dpo_args,
)

# ORPO (combines SFT + DPO, no ref model needed)
orpo_trainer = ORPOTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=preference_dataset,
    args=orpo_args,
)

The Complete Tool Stack Mental Map

For LOCAL INFERENCE:
  Mac (M1/M2/M3) → Ollama or MLX
  Windows/Linux with GPU → Ollama
  Production server → vLLM or llama.cpp server
  Low-level control → llama.cpp directly

For FINE-TUNING:
  Beginner, quick results → Unsloth (easiest)
  Complex/production training → Axolotl (most flexible)
  Multi-GPU scale → Axolotl + DeepSpeed
  API layers → PEFT (adapters) + TRL (training algorithms)

For MODEL MANAGEMENT:
  Download, share, discover → Hugging Face Hub
  Dataset work → Hugging Face Datasets
  Any model architecture → Hugging Face Transformers

📝 Module 05 Summary

Tool	Role	When to Use
llama.cpp	C++ LLM inference engine	Low-level, embedded, max efficiency
Ollama	User-friendly local model runner	Development, local chat, personal use
vLLM	Production LLM server	High-throughput serving, real deployments
MLX	Apple Silicon inference/training	M1/M2/M3 Mac users
Hugging Face	Model/dataset hub + core libraries	Everything — it’s the ecosystem
Unsloth	Fast fine-tuning library	Quick, efficient QLoRA training
Axolotl	Config-driven training framework	Production fine-tuning pipelines
PEFT	Adapter library	LoRA and other adapter methods
TRL	RL/alignment training	SFT, DPO, RLHF training loops

🏋️ Module Exercise

Set up a complete local AI stack:

# Step 1: Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Step 2: Pull a model
ollama pull llama3.2:3b

# Step 3: Create a custom model
cat > compliance.Modelfile << 'EOF'
FROM llama3.2:3b
SYSTEM """You are an expert in EU financial regulations.
Be precise, cite specific articles when possible.
If uncertain, say so."""
PARAMETER temperature 0.2
EOF

ollama create compliance-bot -f compliance.Modelfile

# Step 4: Test it
ollama run compliance-bot "What is GDPR?"

# Step 5: Use it via Python
python3 << 'EOF'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

questions = [
    "What is PSD2?",
    "Explain GDPR article 17",
    "What are Basel III capital requirements?"
]

for q in questions:
    response = client.chat.completions.create(
        model="compliance-bot",
        messages=[{"role": "user", "content": q}]
    )
    print(f"Q: {q}")
    print(f"A: {response.choices[0].message.content}\n")
EOF
```

**Challenge:** Compare the custom compliance-bot vs vanilla llama3.2:3b on compliance questions. Does the system prompt make a measurable difference?

---

*Move to [Module 06 — RAG & Memory](/tutorials/llm-mastery/intermediate/05-rag-memory-access-control)*