20. Vector Databases for Gen-AI Agents¶
Vector databases are the memory backbone of LLM-powered applications. This chapter covers RAG, agentic memory, advanced retrieval patterns, and evaluation.
20.1 RAG Architecture¶
Basic RAG Pipeline¶
flowchart LR
Q[User Query] --> E[Embed Query]
E --> VDB[(Vector DB)]
VDB --> C[Retrieved Chunks]
C --> P[Prompt: Context + Query]
P --> LLM[LLM]
LLM --> A[Answer]
Naive vs. Production RAG¶
| Aspect | Naive | Production |
|---|---|---|
| Retrieval | Single vector search | Hybrid (BM25 + vector) + re-rank |
| Chunking | Fixed 512 tokens | Semantic with overlap |
| Context | Dump all chunks | Select + compress relevant chunks |
| Evaluation | Manual inspection | Automated faithfulness + relevance |
20.2 Advanced Retrieval Patterns¶
HyDE (Hypothetical Document Embeddings)¶
Instead of embedding the query, embed a hypothetical answer:
- LLM generates a hypothetical answer to the query
- Embed the hypothetical answer
- Search the vector DB with that embedding
$$
\text{HyDE}: q \xrightarrow{\text{LLM}} \hat{d} \xrightarrow{\text{embed}} \mathbf{v}_{\hat{d}} \xrightarrow{\text{search}} \text{results}
$$
Works better than query embedding because answers are more similar to stored documents than questions are.
Multi-Step Retrieval¶
flowchart TD
Q[Query] --> R1[Retrieve initial docs]
R1 --> LLM1[LLM: refine query]
LLM1 --> R2[Retrieve with refined query]
R2 --> LLM2[LLM: generate answer]
Parent-Child Retrieval¶
- Index: Embed small chunks (256 tokens) for precise matching
- Return: The parent chunk (1024+ tokens) for richer context
20.3 Agentic Memory¶
Short-Term vs. Long-Term Memory¶
| Memory Type | Stored in | Lifetime | Example |
|---|---|---|---|
| Working memory | LLM context window | Single turn | Current conversation |
| Short-term memory | Vector DB (session-scoped) | Conversation | Chat history embeddings |
| Long-term memory | Vector DB (persistent) | Permanent | User preferences, facts |
| Episodic memory | Vector DB + metadata | Permanent | Past interactions |
Memory Architecture for Agents¶
flowchart TB
Agent --> WM["Working Memory<br/>Context Window"]
Agent --> STM["Short-Term<br/>Session Vector DB"]
Agent --> LTM["Long-Term<br/>Persistent Vector DB"]
Agent --> Tools[External Tools]
STM --> LTM
20.4 Evaluation Frameworks¶
RAG Evaluation Metrics¶
| Metric | Measures | Formula |
|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? | $\frac{\text{claims supported by context}}{\text{total claims}}$ |
| Answer Relevancy | Does the answer address the query? | $\text{cos}(\text{embed}(q), \text{embed}(a))$ |
| Context Precision | Are retrieved chunks relevant? | $\text{precision@}k$ with LLM-judged relevance |
| Context Recall | Are all needed chunks retrieved? | $\frac{\text{relevant retrieved}}{\text{total relevant}}$ |
LLM-as-Judge¶
Use a strong LLM to evaluate retrieval and generation quality:
Given the question: {question}
And the retrieved context: {context}
And the generated answer: {answer}
Rate faithfulness (1-5): Is every claim in the answer
supported by the context?
Frameworks¶
| Framework | Type | Key Feature |
|---|---|---|
| RAGAS | Python library | Automatic metrics (faithfulness, relevancy) |
| LangSmith | Platform | Tracing + evaluation |
| DeepEval | Python library | Unit tests for LLM outputs |
| TruLens | Python library | Feedback functions |
20.5 Context Window Management¶
With growing context windows (128K–1M tokens), do we still need RAG?
$$
\text{Cost}_{\text{full context}} = O(n^2) \quad \text{vs.} \quad \text{Cost}_{\text{RAG}} = O(k) + O(k^2)
$$
| Long Context | RAG | |
|---|---|---|
| Latency | High (process all tokens) | Low (retrieve k chunks) |
| Cost | $$$$ (pay per token) | $ (vector search is cheap) |
| Accuracy | "Lost in the middle" problem | Focused on relevant chunks |
| Scale | Limited by context window | Unlimited |
References¶
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- Gao, L., et al. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). ACL.
- Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.