20. Vector Databases for Gen-AI Agents¶

Vector databases are the memory backbone of LLM-powered applications. This chapter covers RAG, agentic memory, advanced retrieval patterns, and evaluation.

20.1 RAG Architecture¶

Basic RAG Pipeline¶

flowchart LR
    Q[User Query] --> E[Embed Query]
    E --> VDB[(Vector DB)]
    VDB --> C[Retrieved Chunks]
    C --> P[Prompt: Context + Query]
    P --> LLM[LLM]
    LLM --> A[Answer]

Naive vs. Production RAG¶

Aspect	Naive	Production
Retrieval	Single vector search	Hybrid (BM25 + vector) + re-rank
Chunking	Fixed 512 tokens	Semantic with overlap
Context	Dump all chunks	Select + compress relevant chunks
Evaluation	Manual inspection	Automated faithfulness + relevance

20.2 Advanced Retrieval Patterns¶

HyDE (Hypothetical Document Embeddings)¶

Instead of embedding the query, embed a hypothetical answer:

LLM generates a hypothetical answer to the query
Embed the hypothetical answer
Search the vector DB with that embedding

$$ \text{HyDE}: q \xrightarrow{\text{LLM}} \hat{d} \xrightarrow{\text{embed}} \mathbf{v}_{\hat{d}} \xrightarrow{\text{search}} \text{results} $$

Works better than query embedding because answers are more similar to stored documents than questions are.

Multi-Step Retrieval¶

flowchart TD
    Q[Query] --> R1[Retrieve initial docs]
    R1 --> LLM1[LLM: refine query]
    LLM1 --> R2[Retrieve with refined query]
    R2 --> LLM2[LLM: generate answer]

Parent-Child Retrieval¶

Index: Embed small chunks (256 tokens) for precise matching
Return: The parent chunk (1024+ tokens) for richer context

20.3 Agentic Memory¶

Short-Term vs. Long-Term Memory¶

Memory Type	Stored in	Lifetime	Example
Working memory	LLM context window	Single turn	Current conversation
Short-term memory	Vector DB (session-scoped)	Conversation	Chat history embeddings
Long-term memory	Vector DB (persistent)	Permanent	User preferences, facts
Episodic memory	Vector DB + metadata	Permanent	Past interactions

Memory Architecture for Agents¶

flowchart TB
    Agent --> WM["Working Memory<br/>Context Window"]
    Agent --> STM["Short-Term<br/>Session Vector DB"]
    Agent --> LTM["Long-Term<br/>Persistent Vector DB"]
    Agent --> Tools[External Tools]
    STM --> LTM

20.4 Evaluation Frameworks¶

RAG Evaluation Metrics¶

Metric	Measures	Formula
Faithfulness	Is the answer grounded in retrieved context?	$\frac{\text{claims supported by context}}{\text{total claims}}$
Answer Relevancy	Does the answer address the query?	$\text{cos}(\text{embed}(q), \text{embed}(a))$
Context Precision	Are retrieved chunks relevant?	$\text{precision@}k$ with LLM-judged relevance
Context Recall	Are all needed chunks retrieved?	$\frac{\text{relevant retrieved}}{\text{total relevant}}$

LLM-as-Judge¶

Use a strong LLM to evaluate retrieval and generation quality:

Given the question: {question}
And the retrieved context: {context}
And the generated answer: {answer}

Rate faithfulness (1-5): Is every claim in the answer
supported by the context?

Frameworks¶

Framework	Type	Key Feature
RAGAS	Python library	Automatic metrics (faithfulness, relevancy)
LangSmith	Platform	Tracing + evaluation
DeepEval	Python library	Unit tests for LLM outputs
TruLens	Python library	Feedback functions

20.5 Context Window Management¶

With growing context windows (128K–1M tokens), do we still need RAG?

$$ \text{Cost}_{\text{full context}} = O(n^2) \quad \text{vs.} \quad \text{Cost}_{\text{RAG}} = O(k) + O(k^2) $$

	Long Context	RAG
Latency	High (process all tokens)	Low (retrieve k chunks)
Cost	$$$$ (pay per token)	$ (vector search is cheap)
Accuracy	"Lost in the middle" problem	Focused on relevant chunks
Scale	Limited by context window	Unlimited

References¶

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Gao, L., et al. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). ACL.
Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.