2. Approximate Nearest Neighbor (ANN) Algorithms¶
An exact, brute-force search requires calculating the distance between your query and every single vector in the database — an $O(n \cdot d)$ operation per query. At a billion vectors and 768 dimensions, that represents thousands of gigabytes of raw math per query.
To make real-time applications possible, vector databases abandon the idea of finding the absolute perfect nearest neighbor. Instead, they use algorithms that guarantee they will find an exceptionally close neighbor, just much faster. This chapter covers the four major families of Approximate Nearest Neighbor (ANN) algorithms that power the industry.
2.1 Exact vs. Approximate Search¶
Why Approximate?¶
| Exact k-NN | Approximate k-NN | |
|---|---|---|
| Complexity | $O(n \cdot d)$ | $O(\text{polylog}(n) \cdot d)$ |
| Search Time (1B vectors) | Minutes | Milliseconds |
| Accuracy (Recall) | 1.0 (Mathematically perfect) | 0.90–0.99 (Tunable) |
ELI5: Exact search is like finding a phone number by reading the phonebook from page 1 to page 500, checking every single name. Approximate search is like using an index to flip directly to the "S" section, and grabbing the first number that mostly matches your intent. It takes milliseconds, and 95% of the time, you get the person you actually wanted.
The key metric measuring this accuracy is Recall@k: Out of the true mathematical top $k$ results, what percentage did our fast algorithm successfully find?
For applications like Retrieval Augmented Generation (RAG) or recommendations, a recall of 95% is functionally indistinguishable to end-users from 100%.
2.2 Tree-Based Methods¶
Tree-based methods slice the spatial universe into smaller and smaller boxes, helping queries narrow down where points live.
KD-Trees¶
A KD-tree splits space by drawing a line strictly along one dimension (e.g., the X-axis), then looking inside those boxes and splitting them along another dimension (e.g., the Y-axis), and so on.
flowchart TD
subgraph KDSearch ["KD-Tree Construction"]
Root{"Split Dimension 1<br/>(x₁ < 5.0)"}
L1{"Split Dimension 2<br/>(x₂ < 3.0)"}
R1{"Split Dimension 2<br/>(x₂ < 7.0)"}
Leaf1["Leaf Node A<br/>{p₁, p₂}"]
Leaf2["Leaf Node B<br/>{p₃}"]
Leaf3["Leaf Node C<br/>{p₄, p₅}"]
Leaf4["Leaf Node D<br/>{p₆}"]
Root -- Yes --> L1
Root -- No --> R1
L1 -- Yes --> Leaf1
L1 -- No --> Leaf2
R1 -- Yes --> Leaf3
R1 -- No --> Leaf4
end
Why KD-trees fail for AI
KD-trees are brilliant for 2D maps or 3D video game rendering. However, in 768 dimensions, the "boxes" become incredibly complex. Because of the curse of dimensionality, a query point will often sit awkwardly close to hundreds of box boundaries. The algorithm is forced to backtrack and check almost every box, devolving purely into a slow brute-force scan.
Ball Trees and VP-Trees¶
Instead of drawing straight lines to form boxes, Ball Trees draw intersecting circles (hyper-spheres) around clusters of points.
Vantage-Point Trees (VP-Trees) pick a random central "Vantage Point" and split the data into "points closer than the median" and "points further than the median". These spherical approaches survive slightly better in higher dimensions than KD-trees, but still largely degrade past 50 dimensions.
2.3 Hashing-Based Methods (LSH)¶
Core Idea¶
Traditional database hashing guarantees that hash("apple") always equals exactly hash("apple"), but hash("app") yields a totally different random string. This is terrible for similarity searches.
Locality-Sensitive Hashing (LSH) flips this logic. LSH creates a mathematical hash function where similar inputs produce identical outputs.
ELI5: Imagine slicing a room full of people in half randomly with a laser pointer. You assign everyone on the left a 0 and everyone on the right a 1. You do this 5 times from different angles. If Bob and Alice are standing right next to each other, there is a very high probability they will end up with the exact same 5-digit code (e.g., 10110). The query just calculates its own code, goes to the 10110 bucket, and only compares distances against the people inside that bucket.
Random Hyperplane LSH (Cosine Similarity)¶
Each slice (hash function) tests which side of a random plane a vector sits on:
Multi-Table Routing¶
To prevent false negatives (Alice and Bob accidentally getting split by a laser), we build multiple tables. If a point shares a bucket in at least one out of $L$ tables, we consider it a candidate.
flowchart LR
Q[Query Vector] --> HashFunc["Calculate LSH Hashes"]
HashFunc --> Table1["Hash Table 1<br/>Bucket '100'"]
HashFunc --> Table2["Hash Table 2<br/>Bucket '011'"]
HashFunc --> Table3["Hash Table 3<br/>Bucket '101'"]
Table1 --> Candidates[Candidate Vectors Pool]
Table2 --> Candidates
Table3 --> Candidates
Candidates --> ExactDist["Compute Exact Distance<br/>on small subset"]
ExactDist --> Results((Top-K Results))
style Candidates fill:#fff3e0,stroke:#f57c00
style Results fill:#e8f5e9,stroke:#4caf50
C++ Implementation: Random Hyperplane LSH (click to expand)
/**
* Random Hyperplane LSH for cosine similarity.
*
* Each hash function: h(x) = sign(r · x)
* where r is a random Gaussian vector.
*
* Collision probability:
* Pr[h(x) = h(y)] = 1 - θ(x,y) / π
*
* Parameters:
* dim — dimensionality of input vectors
* num_tables — number of hash tables L (more tables → higher recall)
* num_hashes — hash bits per table k (more bits → higher precision)
*/
class RandomHyperplaneLSH {
public:
RandomHyperplaneLSH(size_t dim, size_t num_tables = 10, size_t num_hashes = 8)
: dim_(dim), num_tables_(num_tables), num_hashes_(num_hashes) {
std::mt19937 rng(42);
std::normal_distribution<float> normal(0.0f, 1.0f);
// Generate random hyperplanes for each table
hyperplanes_.resize(num_tables);
for (size_t t = 0; t < num_tables; ++t) {
hyperplanes_[t].resize(num_hashes * dim);
for (auto &val : hyperplanes_[t]) {
val = normal(rng);
}
}
tables_.resize(num_tables);
}
/**
* Build the index from a set of vectors.
*/
void build(const std::vector<std::vector<float>> &vectors) {
vectors_ = vectors;
for (auto &table : tables_)
table.clear();
for (size_t i = 0; i < vectors.size(); ++i) {
for (size_t t = 0; t < num_tables_; ++t) {
auto sig = hash(vectors[i], t);
tables_[t][sig].push_back(i);
}
}
}
/**
* Query for k approximate nearest neighbors.
*
* 1. Hash query into each table
* 2. Collect all candidate IDs from matching buckets
* 3. Re-rank candidates by exact cosine similarity
*/
std::vector<size_t> query(const std::vector<float> &q, size_t k) const {
std::unordered_set<size_t> candidates;
for (size_t t = 0; t < num_tables_; ++t) {
auto sig = hash(q, t);
auto it = tables_[t].find(sig);
if (it != tables_[t].end()) {
for (size_t idx : it->second) {
candidates.insert(idx);
}
}
}
// Re-rank by cosine similarity
std::vector<std::pair<float, size_t>> scored;
scored.reserve(candidates.size());
for (size_t idx : candidates) {
float sim = cosine_sim(q, vectors_[idx]);
scored.push_back({-sim, idx}); // negate for ascending sort
}
std::sort(scored.begin(), scored.end());
std::vector<size_t> result;
for (size_t i = 0; i < std::min(k, scored.size()); ++i) {
result.push_back(scored[i].second);
}
return result;
}
size_t num_vectors() const { return vectors_.size(); }
private:
HashSignature hash(const std::vector<float> &vec, size_t table_idx) const {
HashSignature sig;
sig.bits.resize(num_hashes_);
for (size_t h = 0; h < num_hashes_; ++h) {
float dot = 0.0f;
const float *plane = &hyperplanes_[table_idx][h * dim_];
for (size_t d = 0; d < dim_; ++d) {
dot += plane[d] * vec[d];
}
sig.bits[h] = (dot > 0.0f) ? 1 : 0;
}
return sig;
}
static float cosine_sim(const std::vector<float> &a,
const std::vector<float> &b) {
float dot = 0, na = 0, nb = 0;
for (size_t i = 0; i < a.size(); ++i) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
float denom = std::sqrt(na) * std::sqrt(nb);
return (denom > 1e-10f) ? dot / denom : 0.0f;
}
size_t dim_, num_tables_, num_hashes_;
std::vector<std::vector<float>> hyperplanes_; // [table][hash*dim + d]
std::vector<
std::unordered_map<HashSignature, std::vector<size_t>, HashSignatureHash>>
tables_;
std::vector<std::vector<float>> vectors_;
};
2.4 Graph-Based Methods¶
Hierarchical Navigable Small World (HNSW)¶
HNSW is currently the undisputed king of vector search. It powers almost every major production vector database today (Pinecone, Qdrant, Weaviate, Milvus).
It is based on the idea of a "Small World Network" — much like the "Six Degrees of Kevin Bacon". You don't need to know everyone in the world; you just need to know a few distant hubs, and a few local neighbours, and you can navigate to anyone in a few hops.
ELI5: Imagine navigating a highway system. To get from a small town in Texas to a small town in New York, you don't drive on local backroads the whole way. You: 1. Drive local roads to the nearest interstate highway on-ramp. 2. Hop on the massive, sparse interstate system spanning the country. 3. Exit near New York and take local dense streets to the exact address.
HNSW builds exactly this: a multi-layered graph. The top layers are the "interstate", holding only a few sparse points that span vast distances. The bottom layer contains every single vector in dense "local streets".
flowchart TD
subgraph Layer2 ["Layer 2 (The Highway - Super Sparse)"]
direction LR
L2A((Node A)) <--> L2E((Node E))
end
subgraph Layer1 ["Layer 1 (State Roads - Medium Density)"]
direction LR
L1A((Node A)) <--> L1C((Node C))
L1C <--> L1E((Node E))
end
subgraph Layer0 ["Layer 0 (Local Streets - Deeply Dense)"]
direction LR
L0A((Node A)) <--> L0B((Node B))
L0B <--> L0C((Node C))
L0C <--> L0D((Node D))
L0D <--> L0E((Node E))
L0A <--> L0C
end
Layer2 -.-> Layer1
Layer1 -.-> Layer0
style Layer2 fill:#e3f2fd,stroke:#1e88e5
style Layer1 fill:#bbdefb,stroke:#1976d2
style Layer0 fill:#90caf9,stroke:#1565c0
HNSW Algorithm: Querying¶
- Start at a predefined entry point on the topmost layer.
- Look at the neighbours. Jump strictly to the neighbour that is mathematically closer to your query.
- Keep jumping until every neighbour is further away than the point you are currently standing on (a local minimum).
- Drop down to the next layer and repeat, using your current spot as the new starting point.
- Keep dropping until you hit Layer 0. The points you arrive at are your nearest neighbours.
| Key Parameter | Default | Effect on Database |
|---|---|---|
M (Max connections) |
16-64 | Determines how many "highways" connect. Higher means much higher recall accuracy, but significantly more RAM usage. |
ef_construction |
200 | Search quality during graph building. Higher means a better quality graph, but takes vastly longer to import data. |
ef_search |
50+ | The width of the "net" cast during a user query. Higher means more accurate search results, but higher latency. |
C++ Implementation: HNSW Index (click to expand)
/**
* HNSW Index for approximate nearest neighbor search.
*
* Builds a multi-layer navigable small-world graph.
* - Layer 0 is the densest (all nodes present)
* - Higher layers are exponentially sparser
* - Search descends from top layer greedily, then does
* beam search at layer 0.
*
* Template parameters are avoided for clarity; uses float vectors.
*
* Key parameters:
* M — max edges per node per layer (default 16)
* M_max0 — max edges at layer 0 (default 2*M)
* ef_construction — beam width during build
* ef_search — beam width during query
* mL — level generation factor = 1/ln(M)
*/
class HNSWIndex {
public:
struct SearchResult {
float distance;
size_t id;
bool operator>(const SearchResult &o) const {
return distance > o.distance;
}
bool operator<(const SearchResult &o) const {
return distance < o.distance;
}
};
HNSWIndex(size_t dim, size_t M = 16, size_t ef_construction = 200,
size_t ef_search = 50)
: dim_(dim), M_(M), M_max0_(2 * M), ef_construction_(ef_construction),
ef_search_(ef_search), mL_(1.0 / std::log(static_cast<double>(M))),
entry_point_(NONE), max_layer_(0), rng_(42), uniform_(0.0, 1.0) {}
/**
* Insert a single vector into the index.
*
* Algorithm (simplified):
* 1. Draw random level l ~ Geometric(mL)
* 2. From entry point, greedily descend to layer l+1
* 3. At each layer [l..0], beam-search for ef_construction
* neighbors and add bidirectional edges
*/
size_t insert(const std::vector<float> &vec) {
size_t id = vectors_.size();
vectors_.push_back(vec);
int level = random_level();
// Grow graph layers if needed
while (static_cast<int>(graph_.size()) <= level) {
graph_.emplace_back();
}
// First element
if (entry_point_ == NONE) {
entry_point_ = id;
max_layer_ = level;
for (int l = 0; l <= level; ++l) {
if (graph_[l].size() <= id)
graph_[l].resize(id + 1);
}
return id;
}
// Ensure node has adjacency lists at all its layers
for (int l = 0; l <= level; ++l) {
if (graph_[l].size() <= id)
graph_[l].resize(id + 1);
}
size_t current = entry_point_;
// Phase 1: Greedy descent from max_layer to level+1
for (int l = max_layer_; l > level; --l) {
auto nearest = search_layer(vec, current, 1, l);
if (!nearest.empty())
current = nearest[0].id;
}
// Phase 2: Insert at layers [min(level, max_layer)..0]
for (int l = std::min(level, max_layer_); l >= 0; --l) {
auto candidates = search_layer(vec, current, ef_construction_, l);
size_t M_max = (l == 0) ? M_max0_ : M_;
auto neighbors = select_neighbors(candidates, M_max);
// Add bidirectional edges
for (auto &nb : neighbors) {
graph_[l][id].push_back(nb.id);
graph_[l][nb.id].push_back(id);
// Prune neighbor if over capacity
if (graph_[l][nb.id].size() > M_max) {
prune(nb.id, l, M_max);
}
}
if (!candidates.empty())
current = candidates[0].id;
}
if (level > max_layer_) {
entry_point_ = id;
max_layer_ = level;
}
return id;
}
/**
* Search for k approximate nearest neighbors.
*
* 1. Greedy descent from top layer to layer 1
* 2. Beam search at layer 0 with ef = max(ef_search, k)
* 3. Return top-k results sorted by distance
*/
std::vector<SearchResult> search(const std::vector<float> &query,
size_t k) const {
if (entry_point_ == NONE)
return {};
size_t current = entry_point_;
// Descend from top
for (int l = max_layer_; l > 0; --l) {
auto nearest = search_layer(query, current, 1, l);
if (!nearest.empty())
current = nearest[0].id;
}
size_t ef = std::max(ef_search_, k);
auto results = search_layer(query, current, ef, 0);
// Take sqrt for actual Euclidean distances
if (results.size() > k)
results.resize(k);
for (auto &r : results) {
r.distance = std::sqrt(r.distance);
}
return results;
}
/**
* Bulk insert all vectors.
*/
void build(const std::vector<std::vector<float>> &vectors) {
for (const auto &v : vectors)
insert(v);
}
size_t size() const { return vectors_.size(); }
size_t num_layers() const { return graph_.size(); }
void set_ef_search(size_t ef) { ef_search_ = ef; }
private:
static constexpr size_t NONE = std::numeric_limits<size_t>::max();
float distance_sq(const std::vector<float> &a,
const std::vector<float> &b) const {
float sum = 0.0f;
for (size_t i = 0; i < dim_; ++i) {
float d = a[i] - b[i];
sum += d * d;
}
return sum;
}
int random_level() {
return static_cast<int>(-std::log(uniform_(rng_)) * mL_);
}
/**
* Beam search in a single layer.
* Returns up to ef nearest elements, sorted by distance (ascending).
*/
std::vector<SearchResult> search_layer(const std::vector<float> &query,
size_t entry, size_t ef,
int layer) const {
if (layer >= static_cast<int>(graph_.size()))
return {};
if (entry >= graph_[layer].size())
return {};
std::unordered_set<size_t> visited;
visited.insert(entry);
float d = distance_sq(query, vectors_[entry]);
// candidates: min-heap (closest first)
std::priority_queue<SearchResult, std::vector<SearchResult>,
std::greater<SearchResult>>
candidates;
// results: max-heap (farthest first, for bounding)
std::priority_queue<SearchResult> results;
candidates.push({d, entry});
results.push({d, entry});
while (!candidates.empty()) {
auto [c_dist, c_id] = candidates.top();
candidates.pop();
float farthest = results.top().distance;
if (c_dist > farthest)
break;
// Expand neighbors
if (c_id < graph_[layer].size()) {
for (size_t nb : graph_[layer][c_id]) {
if (visited.count(nb))
continue;
visited.insert(nb);
float nb_dist = distance_sq(query, vectors_[nb]);
farthest = results.top().distance;
if (nb_dist < farthest || results.size() < ef) {
candidates.push({nb_dist, nb});
results.push({nb_dist, nb});
if (results.size() > ef)
results.pop();
}
}
}
}
// Extract sorted results
std::vector<SearchResult> out;
out.reserve(results.size());
while (!results.empty()) {
out.push_back(results.top());
results.pop();
}
std::sort(out.begin(), out.end());
return out;
}
std::vector<SearchResult>
select_neighbors(const std::vector<SearchResult> &candidates,
size_t M) const {
auto sorted = candidates;
std::sort(sorted.begin(), sorted.end());
if (sorted.size() > M)
sorted.resize(M);
return sorted;
}
void prune(size_t node, int layer, size_t M_max) {
auto &adj = graph_[layer][node];
std::vector<SearchResult> scored;
for (size_t nb : adj) {
scored.push_back({distance_sq(vectors_[node], vectors_[nb]), nb});
}
std::sort(scored.begin(), scored.end());
adj.clear();
for (size_t i = 0; i < std::min(M_max, scored.size()); ++i) {
adj.push_back(scored[i].id);
}
}
size_t dim_;
size_t M_, M_max0_;
size_t ef_construction_, ef_search_;
double mL_;
size_t entry_point_;
int max_layer_;
std::mt19937 rng_;
std::uniform_real_distribution<double> uniform_;
std::vector<std::vector<float>> vectors_;
// graph_[layer][node_id] = list of neighbor IDs
std::vector<std::vector<std::vector<size_t>>> graph_;
};
Vamana / DiskANN¶
HNSW relies heavily on storing the entire graph structure in RAM to execute those quick multi-layer hops, which becomes prohibitively expensive at billion-scale datasets. DiskANN/Vamana is an algorithm designed by Microsoft specifically to store the graph mostly on NVMe SSD drives. It utilizes asynchronous I/O batch reads to fetch neighborhoods from disk, pulling latencies down to <5ms despite operating on cheap storage.
2.5 Quantization-Based Methods¶
High-dimensional data eats massive amounts of RAM. Quantization is the mathematical technique of brutally compressing vectors while trying to retain their semantic shape.
Product Quantization (PQ)¶
Product Quantization breaks massive vectors up into tiny sub-chunks, and replaces the massive arrays of random floats with tiny 8-bit integers.
ELI5: Imagine trying to store highly detailed RGB color codes for a photograph (e.g. [251.3, 12.1, 88.9]). Instead of storing those exact floats, you look at an 8-color crayon box. You decide that pixel looks closest to "red". So instead of storing 3 floats, you just store the ID number 2 (the red crayon). You've lost some fine detail, but you compressed the pixel massively.
PQ does this, but for 768-dimensional language vectors. It chops a 768-dim vector into 32 sub-chunks, finds the "closest crayon colour" for each chunk, and stores 32 bytes instead of 3072 bytes (a massive 96x compression).
Asymmetric Distance Computation (ADC)¶
When the query comes in, it is not quantized. We compute the exact distance between the pure uncompressed query vector, and the tiny compressed memory codes.
| Step | Complexity |
|---|---|
| Look up pre-calculated table | $O(M \cdot K \cdot d_s)$ — once per query |
| Scan $n$ codes in DB | $O(n \cdot M)$ — just fast integer lookups, no math |
C++ Implementation: Product Quantizer (click to expand)
/**
* Product Quantizer.
*
* Compresses d-dimensional float vectors into M bytes (one code per subspace).
*
* Compression ratio: d × 4 bytes → M bytes (e.g. 128-dim × 4B = 512B → 8B).
*
* Workflow:
* 1. train() — learn M codebooks of K=256 centroids each via k-means
* 2. encode() — compress vectors to PQ codes (M uint8 per vector)
* 3. search() — approximate distance using ADC lookup tables
*
* Asymmetric Distance Computation (ADC):
* dist(q, x̃)² ≈ Σ_m || q^(m) - c_{code_m}^(m) ||²
* Precompute a (M × K) distance table, then M lookups per candidate.
*/
class ProductQuantizer {
public:
ProductQuantizer(size_t dim, size_t M = 8, size_t K = 256)
: dim_(dim), M_(M), K_(K), ds_(dim / M) {
assert(dim % M == 0 && "dim must be divisible by M");
codebooks_.resize(
M, std::vector<std::vector<float>>(K, std::vector<float>(ds_)));
}
/**
* Train codebooks with k-means on each subspace.
*/
void train(const std::vector<std::vector<float>> &data, size_t n_iter = 25) {
std::mt19937 rng(42);
size_t n = data.size();
for (size_t m = 0; m < M_; ++m) {
// Extract sub-vectors for subspace m
std::vector<std::vector<float>> sub(n, std::vector<float>(ds_));
for (size_t i = 0; i < n; ++i) {
for (size_t d = 0; d < ds_; ++d) {
sub[i][d] = data[i][m * ds_ + d];
}
}
// Initialize centroids randomly
std::vector<size_t> indices(n);
std::iota(indices.begin(), indices.end(), 0);
std::shuffle(indices.begin(), indices.end(), rng);
for (size_t k = 0; k < K_; ++k) {
codebooks_[m][k] = sub[indices[k % n]];
}
// k-means iterations
std::vector<size_t> assignments(n);
for (size_t iter = 0; iter < n_iter; ++iter) {
// Assign
for (size_t i = 0; i < n; ++i) {
float best_dist = std::numeric_limits<float>::max();
for (size_t k = 0; k < K_; ++k) {
float d = l2_sq(sub[i], codebooks_[m][k]);
if (d < best_dist) {
best_dist = d;
assignments[i] = k;
}
}
}
// Update centroids
std::vector<std::vector<float>> sums(K_, std::vector<float>(ds_, 0));
std::vector<size_t> counts(K_, 0);
for (size_t i = 0; i < n; ++i) {
size_t k = assignments[i];
counts[k]++;
for (size_t d = 0; d < ds_; ++d) {
sums[k][d] += sub[i][d];
}
}
for (size_t k = 0; k < K_; ++k) {
if (counts[k] > 0) {
for (size_t d = 0; d < ds_; ++d) {
codebooks_[m][k][d] = sums[k][d] / counts[k];
}
}
}
}
}
trained_ = true;
}
/**
* Encode vectors to PQ codes (M uint8 per vector).
*/
std::vector<std::vector<uint8_t>>
encode(const std::vector<std::vector<float>> &data) const {
assert(trained_);
size_t n = data.size();
std::vector<std::vector<uint8_t>> codes(n, std::vector<uint8_t>(M_));
for (size_t i = 0; i < n; ++i) {
for (size_t m = 0; m < M_; ++m) {
float best_dist = std::numeric_limits<float>::max();
uint8_t best_k = 0;
for (size_t k = 0; k < K_; ++k) {
float d = 0;
for (size_t dd = 0; dd < ds_; ++dd) {
float diff = data[i][m * ds_ + dd] - codebooks_[m][k][dd];
d += diff * diff;
}
if (d < best_dist) {
best_dist = d;
best_k = static_cast<uint8_t>(k);
}
}
codes[i][m] = best_k;
}
}
return codes;
}
/**
* Decode PQ codes back to approximate vectors.
*/
std::vector<float> decode(const std::vector<uint8_t> &code) const {
assert(trained_);
std::vector<float> vec(dim_);
for (size_t m = 0; m < M_; ++m) {
for (size_t d = 0; d < ds_; ++d) {
vec[m * ds_ + d] = codebooks_[m][code[m]][d];
}
}
return vec;
}
2.6 Hybrid and Composite Indexes¶
Production systems mix and match these theories for maximum power. The most common combination historically used in libraries like Meta's FAISS is IVF-PQ (Inverted File Index with Product Quantization).
- IVF: Group the entire space into distinct clusters (voronoi cells).
- PQ: Compress all vectors inside those clusters aggressively.
- Query: Find the closest 12 clusters to the query, jump into them, and scan the heavily compressed PQ codes inside.
C++ Implementation: IVF Index (click to expand)
/**
* Inverted File Index.
*
* Search complexity: O(nprobe × n/nlist × d) vs O(n × d) brute-force.
*
* Parameters:
* nlist — number of Voronoi cells (centroids)
* nprobe — number of cells to search (trade-off: recall vs speed)
*/
class IVFIndex {
public:
struct SearchResult {
float distance;
size_t id;
bool operator<(const SearchResult &o) const {
return distance < o.distance;
}
};
IVFIndex(size_t dim, size_t nlist = 100, size_t nprobe = 10)
: dim_(dim), nlist_(nlist), nprobe_(nprobe) {
inverted_lists_.resize(nlist);
}
/**
* Train centroids using k-means.
*/
void train(const std::vector<std::vector<float>> &data, size_t n_iter = 20) {
size_t n = data.size();
centroids_.resize(nlist_, std::vector<float>(dim_));
std::mt19937 rng(42);
std::vector<size_t> indices(n);
std::iota(indices.begin(), indices.end(), 0);
std::shuffle(indices.begin(), indices.end(), rng);
for (size_t c = 0; c < nlist_; ++c) {
centroids_[c] = data[indices[c % n]];
}
std::vector<size_t> assignments(n);
for (size_t iter = 0; iter < n_iter; ++iter) {
// Assign to nearest centroid
for (size_t i = 0; i < n; ++i) {
float best = std::numeric_limits<float>::max();
for (size_t c = 0; c < nlist_; ++c) {
float d = l2_sq(data[i], centroids_[c]);
if (d < best) {
best = d;
assignments[i] = c;
}
}
}
// Update centroids
std::vector<std::vector<float>> sums(nlist_, std::vector<float>(dim_, 0));
std::vector<size_t> counts(nlist_, 0);
for (size_t i = 0; i < n; ++i) {
counts[assignments[i]]++;
for (size_t d = 0; d < dim_; ++d)
sums[assignments[i]][d] += data[i][d];
}
for (size_t c = 0; c < nlist_; ++c) {
if (counts[c] > 0) {
for (size_t d = 0; d < dim_; ++d)
centroids_[c][d] = sums[c][d] / counts[c];
}
}
}
trained_ = true;
}
/**
* Add vectors to the inverted lists.
*/
void add(const std::vector<std::vector<float>> &data) {
assert(trained_);
vectors_ = data;
for (auto &list : inverted_lists_)
list.clear();
for (size_t i = 0; i < data.size(); ++i) {
float best = std::numeric_limits<float>::max();
size_t best_c = 0;
for (size_t c = 0; c < nlist_; ++c) {
float d = l2_sq(data[i], centroids_[c]);
if (d < best) {
best = d;
best_c = c;
}
}
inverted_lists_[best_c].push_back(i);
}
}
/**
* Search for k approximate nearest neighbors.
*
* 1. Find nprobe nearest centroids
* 2. Scan vectors in those cells
* 3. Return top-k
*/
std::vector<SearchResult> search(const std::vector<float> &query,
size_t k) const {
assert(trained_);
// Find nearest centroids
std::vector<std::pair<float, size_t>> centroid_dists(nlist_);
for (size_t c = 0; c < nlist_; ++c) {
centroid_dists[c] = {l2_sq(query, centroids_[c]), c};
}
std::partial_sort(centroid_dists.begin(),
centroid_dists.begin() + std::min(nprobe_, nlist_),
centroid_dists.end());
// Collect and score candidates
std::vector<SearchResult> candidates;
for (size_t p = 0; p < std::min(nprobe_, nlist_); ++p) {
size_t cell = centroid_dists[p].second;
for (size_t idx : inverted_lists_[cell]) {
float d = std::sqrt(l2_sq(query, vectors_[idx]));
candidates.push_back({d, idx});
}
}
k = std::min(k, candidates.size());
if (k > 0) {
std::partial_sort(candidates.begin(), candidates.begin() + k,
candidates.end());
candidates.resize(k);
}
return candidates;
}
void set_nprobe(size_t nprobe) { nprobe_ = nprobe; }
size_t size() const { return vectors_.size(); }
private:
static float l2_sq(const std::vector<float> &a, const std::vector<float> &b) {
float s = 0;
for (size_t i = 0; i < a.size(); ++i) {
float d = a[i] - b[i];
s += d * d;
}
return s;
}
size_t dim_, nlist_, nprobe_;
bool trained_ = false;
std::vector<std::vector<float>> centroids_;
std::vector<std::vector<size_t>> inverted_lists_;
std::vector<std::vector<float>> vectors_;
};
2.7 Algorithm Selection Guide¶
flowchart TD
A[How many vectors?] -->|"Under 100K"| B[Brute-Force Flat Search]
A -->|"100K to 10M"| C[HNSW in RAM]
A -->|"Over 10M"| D{Infrastructure Budget?}
D -->|High Budget| E[HNSW + 8-bit Quantized RAM]
D -->|Tight Budget| F[DiskANN SSD or IVF-PQ]
style B fill:#fff9c4,stroke:#fbc02d
style C fill:#c8e6c9,stroke:#388e3c
style E fill:#c8e6c9,stroke:#388e3c
style F fill:#bbdefb,stroke:#1976d2
Assignment: Implement a K-Means IVF Index¶
In Chapter 0, you built a naive "Index" that hardcoded two buckets based on whether the first dimension was positive or negative. In this assignment, you will upgrade your FastMiniVectorDB to use a true Inverted File Index (IVF) based on K-Means centroids.
Goal¶
Define $K$ centroids (for example, $K=4$). Write an ingestion pipeline that calculates the distance between a new vector and all 4 centroids, and pushes the vector ID into the bucket (inverted list) of the closest centroid. During search, implement the nprobe parameter to only scan the 2 closest buckets.
#include <vector>
#include <iostream>
#include <limits>
class IVFVectorDB {
private:
std::unordered_map<size_t, Record> storage;
size_t next_id = 1;
// The K Centroids (e.g., K = 4)
std::vector<std::vector<float>> centroids;
// The Inverted Lists (Buckets). Maps a centroid index (0 to K-1)
// to a list of Vector IDs.
std::vector<std::vector<size_t>> inverted_lists;
public:
IVFVectorDB(const std::vector<std::vector<float>>& initial_centroids) {
centroids = initial_centroids;
inverted_lists.resize(centroids.size());
}
size_t insert(const std::vector<float>& vec) {
size_t id = next_id++;
storage[id] = {id, vec, ""};
// EXERCISE 1: Find the nearest centroid
int best_centroid_idx = -1;
float min_dist = std::numeric_limits<float>::max();
/* Write your loop here: compare `vec` to all `centroids` */
// Push to the correct bucket
if (best_centroid_idx != -1) {
inverted_lists[best_centroid_idx].push_back(id);
}
return id;
}
std::vector<SearchResult> search(const std::vector<float>& query, int k, int nprobe) {
// EXERCISE 2: Implement nprobe routing
// 1. Find the distance from `query` to all `centroids`.
// 2. Sort the centroids by distance.
// 3. Take the top `nprobe` closest centroids.
// 4. Iterate *only* over the `inverted_lists` belonging to those specific centroids.
// 5. Calculate exact distances and return the top-K.
std::vector<SearchResult> results;
/* Write your routing and scanning logic here */
return results;
}
};
References¶
- Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI.
- Jegou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI.
- Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS.
- Indyk, P., & Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC.