What Breaks in Basic RAG at Scale
A basic RAG system works fine in demos. It breaks in production for four reasons:
- Retrieval recall is too low. Your dense (semantic) index misses documents that use different vocabulary than the query. A user asks “show me the refund policy” and your embeddings don’t retrieve the doc titled “return merchandise authorization procedure.”
- No confidence signal. Basic RAG generates an answer whether the retrieved context is excellent or garbage. You can’t tell which case you’re in.
- No metadata filtering. When a user asks “what changed in Q3 2024?” a pure semantic search will happily return Q3 2019 docs if they’re semantically similar.
- Re-ranking not applied. Embedding similarity is a good but imperfect signal. The top-1 document by cosine similarity is often not the most relevant document for answering the question.
This tutorial addresses all four with production patterns you can implement today.
Hybrid Search: Dense + Sparse Combined
The single highest-leverage improvement you can make to a basic RAG system is adding sparse retrieval alongside dense retrieval.
Dense retrieval (what you already have): embed the query, find the nearest vectors. Great for semantic similarity. Misses exact keyword matches.
Sparse retrieval (BM25): a probabilistic keyword scoring algorithm. Finds exact term matches. Misses semantic equivalence.
Hybrid: run both, normalize scores to [0,1], take a weighted combination. In most enterprise corpora, this outperforms either approach alone, but benchmark on your own dataset.
The standard formula is Reciprocal Rank Fusion (RRF):
RRF_score(doc) = 1/(k + rank_dense) + 1/(k + rank_sparse)
Where k=60 is a constant that dampens the impact of top-ranked documents. This simple formula is usually a strong baseline and often competitive with learned fusion, without tuning overhead.
Hybrid Search Architecture
flowchart TD Q([User Query]) --> DE[Dense Embedding text-embedding-3-small] Q --> BM[BM25 Scorer Keyword matching] DE --> VS[(Vector Store Cosine similarity)] BM --> IX[(Inverted Index BM25 scores)] VS --> DR[Dense Results Top-20 with scores] IX --> SR[Sparse Results Top-20 with scores] DR --> RRF[Reciprocal Rank Fusion Normalize and combine] SR --> RRF RRF --> TOP[Top-10 Merged Results] TOP --> CE[Cross-Encoder Re-ranker Precision pass] CE --> FINAL[Final Top-5 Chunks] FINAL --> LLM[LLM Generation] style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8 style RRF fill:#fef3c7,stroke:#d97706,color:#92400e style CE fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style FINAL fill:#dcfce7,stroke:#16a34a,color:#15803dflowchart TD Q([User Query]) --> DE[Dense Embedding text-embedding-3-small] Q --> BM[BM25 Scorer Keyword matching] DE --> VS[(Vector Store Cosine similarity)] BM --> IX[(Inverted Index BM25 scores)] VS --> DR[Dense Results Top-20 with scores] IX --> SR[Sparse Results Top-20 with scores] DR --> RRF[Reciprocal Rank Fusion Normalize and combine] SR --> RRF RRF --> TOP[Top-10 Merged Results] TOP --> CE[Cross-Encoder Re-ranker Precision pass] CE --> FINAL[Final Top-5 Chunks] FINAL --> LLM[LLM Generation] style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8 style RRF fill:#fef3c7,stroke:#d97706,color:#92400e style CE fill:#f3e8ff,stroke:#7c3aed,color:#7c3aed style FINAL fill:#dcfce7,stroke:#16a34a,color:#15803d
Weighted sum requires you to tune the weight α that balances dense vs sparse. RRF needs no tuning - it works by rank position alone. Use RRF unless you have a labeled evaluation set to tune weights against.
Re-ranking: The Precision Pass
Initial retrieval (whether dense, sparse, or hybrid) optimizes for recall - get all relevant documents in the top-K. Re-ranking then optimizes for precision - put the most relevant document first.
A cross-encoder re-ranker takes a (query, document) pair and produces a single relevance score. Unlike bi-encoders (which embed query and doc separately), cross-encoders see both simultaneously and can model query-document interaction directly.
The typical pipeline:
- Retrieve top-20 candidates (cheap, fast)
- Re-rank top-20 with cross-encoder (more expensive but only 20 pairs)
- Use top-5 as context for generation
Models like cross-encoder/ms-marco-MiniLM-L-6-v2 are small (22M params), fast, and dramatically improve precision. They run locally in milliseconds.
Self-Query: LLM-Generated Metadata Filters
When your documents have metadata (date, author, category, product version), pure semantic search throws that signal away. Self-query lets the LLM parse the user’s intent into structured filters before retrieval.
User query: “What were the breaking changes in the v2.1 release?”
Self-query extracts:
{
"filters": { "version": "2.1", "type": "breaking_change" },
"semantic_query": "breaking changes"
}
The retrieval then applies the metadata filters first, then runs semantic search only within that filtered subset. This is dramatically more precise for date-filtered, version-filtered, or category-filtered queries.
Self-query only works if your metadata schema is consistent. If “version” is sometimes “v2.1”, sometimes “2.1”, sometimes “version 2.1”, the filter will miss documents. Normalize metadata at indexing time, not at query time.
Self-Healing RAG: Detect and Repair Retrieval Failures
A self-healing RAG system detects when retrieval failed and attempts recovery before returning an answer.
The detection mechanism: after generating an answer, ask the LLM to assess its own confidence. If the answer required reasoning beyond what the retrieved context explicitly states, confidence is low.
A practical self-assessment prompt:
Given the context provided and the question asked, assess whether the context
contains sufficient information to answer the question accurately.
Rating: SUFFICIENT | PARTIAL | INSUFFICIENT
Reason: [one sentence]
If INSUFFICIENT: trigger a re-retrieval with a reformulated query. If PARTIAL: answer with explicit caveats. If still INSUFFICIENT after two attempts: fall back to “I don’t have enough information.”
Self-Healing RAG Loop
flowchart TD
Q([User Query]) --> HYB[Hybrid Search
Top-20 candidates]
HYB --> RR[Re-rank
Cross-encoder]
RR --> CTX[Assemble Context
Top-5 chunks]
CTX --> GEN[Generate Answer]
GEN --> ASSESS{Self-Assessment
SUFFICIENT?}
ASSESS -->|SUFFICIENT| OUT([Return Answer])
ASSESS -->|PARTIAL| WARN[Add Caveat
Return with warning]
WARN --> OUT
ASSESS -->|INSUFFICIENT| CB{Circuit Breaker
Attempts less than 2?}
CB -->|Yes| REFORM[Reformulate Query
Expand or rephrase]
REFORM --> HYB
CB -->|No| FALL([Fallback Response
Insufficient information])
style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
style ASSESS fill:#fef3c7,stroke:#d97706,color:#92400e
style CB fill:#fee2e2,stroke:#dc2626,color:#991b1b
style OUT fill:#dcfce7,stroke:#16a34a,color:#15803d
style FALL fill:#f1f5f9,stroke:#64748b,color:#475569
flowchart TD
Q([User Query]) --> HYB[Hybrid Search
Top-20 candidates]
HYB --> RR[Re-rank
Cross-encoder]
RR --> CTX[Assemble Context
Top-5 chunks]
CTX --> GEN[Generate Answer]
GEN --> ASSESS{Self-Assessment
SUFFICIENT?}
ASSESS -->|SUFFICIENT| OUT([Return Answer])
ASSESS -->|PARTIAL| WARN[Add Caveat
Return with warning]
WARN --> OUT
ASSESS -->|INSUFFICIENT| CB{Circuit Breaker
Attempts less than 2?}
CB -->|Yes| REFORM[Reformulate Query
Expand or rephrase]
REFORM --> HYB
CB -->|No| FALL([Fallback Response
Insufficient information])
style Q fill:#dbeafe,stroke:#2563eb,color:#1d4ed8
style ASSESS fill:#fef3c7,stroke:#d97706,color:#92400e
style CB fill:#fee2e2,stroke:#dc2626,color:#991b1b
style OUT fill:#dcfce7,stroke:#16a34a,color:#15803d
style FALL fill:#f1f5f9,stroke:#64748b,color:#475569
Implementation
The hybrid search implementation below uses pure Python with no vector database dependency - it builds dense vectors with OpenAI embeddings and sparse scores with a simple BM25 implementation. In production, use Weaviate (native hybrid support), Elasticsearch (kNN + BM25 built in), or Qdrant (sparse + dense vectors) to avoid building this yourself.
Hybrid Search: Cosine Similarity + BM25 with RRF
Example code (static). Copy and run locally in your own environment.
import math
import numpy as np
from collections import Counter
# ── Minimal BM25 implementation ──────────────────────────────────────────────
class BM25:
def __init__(self, corpus: list[str], k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.corpus = corpus
self.tokenized = [doc.lower().split() for doc in corpus]
self.doc_freqs = []
self.idf = {}
self.avg_dl = 0.0
self._build_index()
def _build_index(self):
N = len(self.tokenized)
self.avg_dl = sum(len(d) for d in self.tokenized) / N
# Document frequency for each term
df = Counter()
for tokens in self.tokenized:
df.update(set(tokens))
# IDF with BM25 smoothing
for term, freq in df.items():
self.idf[term] = math.log((N - freq + 0.5) / (freq + 0.5) + 1)
# Term frequencies per document
self.doc_freqs = [Counter(tokens) for tokens in self.tokenized]
def score(self, query: str, doc_idx: int) -> float:
tokens = query.lower().split()
dl = len(self.tokenized[doc_idx])
score = 0.0
for token in tokens:
if token not in self.idf:
continue
tf = self.doc_freqs[doc_idx].get(token, 0)
numerator = tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * dl / self.avg_dl)
score += self.idf[token] * numerator / denominator
return score
def get_top_k(self, query: str, k: int = 10) -> list[tuple[int, float]]:
scores = [(i, self.score(query, i)) for i in range(len(self.corpus))]
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:k]
# ── Dense similarity ──────────────────────────────────────────────────────────
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr, b_arr = np.array(a), np.array(b)
dot = np.dot(a_arr, b_arr)
norm = np.linalg.norm(a_arr) * np.linalg.norm(b_arr)
return float(dot / norm) if norm > 0 else 0.0
def dense_top_k(
query_embedding: list[float],
doc_embeddings: list[list[float]],
k: int = 10,
) -> list[tuple[int, float]]:
scores = [
(i, cosine_similarity(query_embedding, emb))
for i, emb in enumerate(doc_embeddings)
]
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:k]
# ── Reciprocal Rank Fusion ────────────────────────────────────────────────────
def reciprocal_rank_fusion(
dense_results: list[tuple[int, float]],
sparse_results: list[tuple[int, float]],
k: int = 60,
) -> list[tuple[int, float]]:
rrf_scores: dict[int, float] = {}
for rank, (doc_idx, _) in enumerate(dense_results):
rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1 / (k + rank + 1)
for rank, (doc_idx, _) in enumerate(sparse_results):
rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1 / (k + rank + 1)
sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results
# ── Demo ──────────────────────────────────────────────────────────────────────
corpus = [
"Return merchandise authorization procedure for defective products",
"Refund policy: customers may request a refund within 30 days of purchase",
"Shipping rates and delivery time estimates for domestic orders",
"How to track your order using the online portal",
"Customer service contact information and business hours",
"Product warranty terms and conditions for electronics",
"Exchange policy for clothing items purchased online",
"Bulk order discounts and corporate account setup",
]
# Fake embeddings (in production: call text-embedding-3-small)
# We simulate semantic clustering by giving similar docs similar vectors
np.random.seed(42)
doc_embeddings = [np.random.rand(8).tolist() for _ in corpus]
# Make "refund" and "RMA" docs semantically similar to query embedding
query_embedding = np.random.rand(8).tolist()
# Artificially boost similarity for docs 0 and 1 (refund/RMA)
query_embedding = doc_embeddings[1].copy() # perfect match to refund doc
query = "show me the refund policy"
# Run hybrid search
bm25 = BM25(corpus)
sparse_results = bm25.get_top_k(query, k=5)
dense_results = dense_top_k(query_embedding, doc_embeddings, k=5)
hybrid_results = reciprocal_rank_fusion(dense_results, sparse_results)
print(f"Query: '{query}'\n")
print("BM25 (sparse) top-3:")
for rank, (idx, score) in enumerate(sparse_results[:3]):
print(f" {rank+1}. [{score:.3f}] {corpus[idx][:60]}")
print("\nDense top-3:")
for rank, (idx, score) in enumerate(dense_results[:3]):
print(f" {rank+1}. [{score:.3f}] {corpus[idx][:60]}")
print("\nHybrid RRF top-3:")
for rank, (idx, score) in enumerate(hybrid_results[:3]):
print(f" {rank+1}. [RRF={score:.4f}] {corpus[idx][:60]}")
Putting It Together: The Production RAG Checklist
Before shipping a RAG system to production:
- Hybrid search - dense + BM25 with RRF fusion
- Re-ranking - cross-encoder on top-20 candidates
- Self-query metadata filtering - if docs have structured attributes
- Confidence assessment - detect low-quality retrievals
- Circuit breaker - cap re-retrieval at 2 attempts
- Source attribution - every answer cites the source chunks
- Chunk-level evaluation - periodically audit which chunks are retrieved most and whether they’re correct
Self-healing loops can spiral. An LLM that decides its answer is low-confidence will keep re-querying. Implement circuit breakers: max 2 re-retrieval attempts, then fall back to “I don’t have enough information.” Without a circuit breaker, a poorly phrased query can trigger an infinite retrieval loop, exhausting your token budget and hanging the request. Always bound your loops.
Interview Notes: RAG Failure Diagnosis
When RAG fails, classify the failure before changing the architecture: query rewrite failure, retrieval miss, ranking failure, context packing failure, generation failure, or citation failure. Advanced patterns such as HyDE, ColBERT, RAPTOR, and GraphRAG are useful only when they match the observed failure mode.
Interview Practice
- How do you diagnose a RAG failure before changing architecture?
- Compare hybrid search, reranking, HyDE, RAPTOR, ColBERT, and GraphRAG.
- What is self-healing RAG?
- How do you evaluate retrieval separately from generation?
- What citation failures matter in production?
- How do you defend a vector store from poisoned or cross-tenant content?