Skip to content

2. Approximate Nearest Neighbor (ANN) Algorithms

An exact, brute-force search requires calculating the distance between your query and every single vector in the database — an $O(n \cdot d)$ operation per query. At a billion vectors and 768 dimensions, that represents thousands of gigabytes of raw math per query.

To make real-time applications possible, vector databases abandon the idea of finding the absolute perfect nearest neighbor. Instead, they use algorithms that guarantee they will find an exceptionally close neighbor, just much faster. This chapter covers the four major families of Approximate Nearest Neighbor (ANN) algorithms that power the industry.


Why Approximate?

Exact k-NN Approximate k-NN
Complexity $O(n \cdot d)$ $O(\text{polylog}(n) \cdot d)$
Search Time (1B vectors) Minutes Milliseconds
Accuracy (Recall) 1.0 (Mathematically perfect) 0.90–0.99 (Tunable)

ELI5: Exact search is like finding a phone number by reading the phonebook from page 1 to page 500, checking every single name. Approximate search is like using an index to flip directly to the "S" section, and grabbing the first number that mostly matches your intent. It takes milliseconds, and 95% of the time, you get the person you actually wanted.

The key metric measuring this accuracy is Recall@k: Out of the true mathematical top $k$ results, what percentage did our fast algorithm successfully find?

$$ \text{Recall@}k = \frac{|\mathcal{R}_{\text{approx}} \cap \mathcal{R}_{\text{exact}}|}{k} $$

For applications like Retrieval Augmented Generation (RAG) or recommendations, a recall of 95% is functionally indistinguishable to end-users from 100%.


2.2 Tree-Based Methods

Tree-based methods slice the spatial universe into smaller and smaller boxes, helping queries narrow down where points live.

KD-Trees

A KD-tree splits space by drawing a line strictly along one dimension (e.g., the X-axis), then looking inside those boxes and splitting them along another dimension (e.g., the Y-axis), and so on.

flowchart TD
    subgraph KDSearch ["KD-Tree Construction"]
        Root{"Split Dimension 1<br/>(x₁ < 5.0)"}
        L1{"Split Dimension 2<br/>(x₂ < 3.0)"}
        R1{"Split Dimension 2<br/>(x₂ < 7.0)"}
        Leaf1["Leaf Node A<br/>{p₁, p₂}"]
        Leaf2["Leaf Node B<br/>{p₃}"]
        Leaf3["Leaf Node C<br/>{p₄, p₅}"]
        Leaf4["Leaf Node D<br/>{p₆}"]

        Root -- Yes --> L1
        Root -- No --> R1
        L1 -- Yes --> Leaf1
        L1 -- No --> Leaf2
        R1 -- Yes --> Leaf3
        R1 -- No --> Leaf4
    end

Why KD-trees fail for AI

KD-trees are brilliant for 2D maps or 3D video game rendering. However, in 768 dimensions, the "boxes" become incredibly complex. Because of the curse of dimensionality, a query point will often sit awkwardly close to hundreds of box boundaries. The algorithm is forced to backtrack and check almost every box, devolving purely into a slow brute-force scan.

Ball Trees and VP-Trees

Instead of drawing straight lines to form boxes, Ball Trees draw intersecting circles (hyper-spheres) around clusters of points.

Vantage-Point Trees (VP-Trees) pick a random central "Vantage Point" and split the data into "points closer than the median" and "points further than the median". These spherical approaches survive slightly better in higher dimensions than KD-trees, but still largely degrade past 50 dimensions.


2.3 Hashing-Based Methods (LSH)

Core Idea

Traditional database hashing guarantees that hash("apple") always equals exactly hash("apple"), but hash("app") yields a totally different random string. This is terrible for similarity searches.

Locality-Sensitive Hashing (LSH) flips this logic. LSH creates a mathematical hash function where similar inputs produce identical outputs.

ELI5: Imagine slicing a room full of people in half randomly with a laser pointer. You assign everyone on the left a 0 and everyone on the right a 1. You do this 5 times from different angles. If Bob and Alice are standing right next to each other, there is a very high probability they will end up with the exact same 5-digit code (e.g., 10110). The query just calculates its own code, goes to the 10110 bucket, and only compares distances against the people inside that bucket.

Random Hyperplane LSH (Cosine Similarity)

Each slice (hash function) tests which side of a random plane a vector sits on:

$$ h(\mathbf{x}) = \text{sign}(\mathbf{r} \cdot \mathbf{x}), \quad \mathbf{r} \sim \mathcal{N}(0, I_d) $$

Multi-Table Routing

To prevent false negatives (Alice and Bob accidentally getting split by a laser), we build multiple tables. If a point shares a bucket in at least one out of $L$ tables, we consider it a candidate.

flowchart LR
    Q[Query Vector] --> HashFunc["Calculate LSH Hashes"]
    HashFunc --> Table1["Hash Table 1<br/>Bucket '100'"]
    HashFunc --> Table2["Hash Table 2<br/>Bucket '011'"]
    HashFunc --> Table3["Hash Table 3<br/>Bucket '101'"]

    Table1 --> Candidates[Candidate Vectors Pool]
    Table2 --> Candidates
    Table3 --> Candidates

    Candidates --> ExactDist["Compute Exact Distance<br/>on small subset"]
    ExactDist --> Results((Top-K Results))

    style Candidates fill:#fff3e0,stroke:#f57c00
    style Results fill:#e8f5e9,stroke:#4caf50
C++ Implementation: Random Hyperplane LSH (click to expand)
/**
 * Random Hyperplane LSH for cosine similarity.
 *
 * Each hash function: h(x) = sign(r · x)
 * where r is a random Gaussian vector.
 *
 * Collision probability:
 *   Pr[h(x) = h(y)] = 1 - θ(x,y) / π
 *
 * Parameters:
 *   dim         — dimensionality of input vectors
 *   num_tables  — number of hash tables L (more tables → higher recall)
 *   num_hashes  — hash bits per table k (more bits → higher precision)
 */
class RandomHyperplaneLSH {
public:
  RandomHyperplaneLSH(size_t dim, size_t num_tables = 10, size_t num_hashes = 8)
      : dim_(dim), num_tables_(num_tables), num_hashes_(num_hashes) {
    std::mt19937 rng(42);
    std::normal_distribution<float> normal(0.0f, 1.0f);

    // Generate random hyperplanes for each table
    hyperplanes_.resize(num_tables);
    for (size_t t = 0; t < num_tables; ++t) {
      hyperplanes_[t].resize(num_hashes * dim);
      for (auto &val : hyperplanes_[t]) {
        val = normal(rng);
      }
    }

    tables_.resize(num_tables);
  }

  /**
   * Build the index from a set of vectors.
   */
  void build(const std::vector<std::vector<float>> &vectors) {
    vectors_ = vectors;
    for (auto &table : tables_)
      table.clear();

    for (size_t i = 0; i < vectors.size(); ++i) {
      for (size_t t = 0; t < num_tables_; ++t) {
        auto sig = hash(vectors[i], t);
        tables_[t][sig].push_back(i);
      }
    }
  }

  /**
   * Query for k approximate nearest neighbors.
   *
   * 1. Hash query into each table
   * 2. Collect all candidate IDs from matching buckets
   * 3. Re-rank candidates by exact cosine similarity
   */
  std::vector<size_t> query(const std::vector<float> &q, size_t k) const {
    std::unordered_set<size_t> candidates;

    for (size_t t = 0; t < num_tables_; ++t) {
      auto sig = hash(q, t);
      auto it = tables_[t].find(sig);
      if (it != tables_[t].end()) {
        for (size_t idx : it->second) {
          candidates.insert(idx);
        }
      }
    }

    // Re-rank by cosine similarity
    std::vector<std::pair<float, size_t>> scored;
    scored.reserve(candidates.size());
    for (size_t idx : candidates) {
      float sim = cosine_sim(q, vectors_[idx]);
      scored.push_back({-sim, idx}); // negate for ascending sort
    }
    std::sort(scored.begin(), scored.end());

    std::vector<size_t> result;
    for (size_t i = 0; i < std::min(k, scored.size()); ++i) {
      result.push_back(scored[i].second);
    }
    return result;
  }

  size_t num_vectors() const { return vectors_.size(); }

private:
  HashSignature hash(const std::vector<float> &vec, size_t table_idx) const {
    HashSignature sig;
    sig.bits.resize(num_hashes_);
    for (size_t h = 0; h < num_hashes_; ++h) {
      float dot = 0.0f;
      const float *plane = &hyperplanes_[table_idx][h * dim_];
      for (size_t d = 0; d < dim_; ++d) {
        dot += plane[d] * vec[d];
      }
      sig.bits[h] = (dot > 0.0f) ? 1 : 0;
    }
    return sig;
  }

  static float cosine_sim(const std::vector<float> &a,
                          const std::vector<float> &b) {
    float dot = 0, na = 0, nb = 0;
    for (size_t i = 0; i < a.size(); ++i) {
      dot += a[i] * b[i];
      na += a[i] * a[i];
      nb += b[i] * b[i];
    }
    float denom = std::sqrt(na) * std::sqrt(nb);
    return (denom > 1e-10f) ? dot / denom : 0.0f;
  }

  size_t dim_, num_tables_, num_hashes_;
  std::vector<std::vector<float>> hyperplanes_; // [table][hash*dim + d]
  std::vector<
      std::unordered_map<HashSignature, std::vector<size_t>, HashSignatureHash>>
      tables_;
  std::vector<std::vector<float>> vectors_;
};

2.4 Graph-Based Methods

Hierarchical Navigable Small World (HNSW)

HNSW is currently the undisputed king of vector search. It powers almost every major production vector database today (Pinecone, Qdrant, Weaviate, Milvus).

It is based on the idea of a "Small World Network" — much like the "Six Degrees of Kevin Bacon". You don't need to know everyone in the world; you just need to know a few distant hubs, and a few local neighbours, and you can navigate to anyone in a few hops.

ELI5: Imagine navigating a highway system. To get from a small town in Texas to a small town in New York, you don't drive on local backroads the whole way. You: 1. Drive local roads to the nearest interstate highway on-ramp. 2. Hop on the massive, sparse interstate system spanning the country. 3. Exit near New York and take local dense streets to the exact address.

HNSW builds exactly this: a multi-layered graph. The top layers are the "interstate", holding only a few sparse points that span vast distances. The bottom layer contains every single vector in dense "local streets".

flowchart TD
    subgraph Layer2 ["Layer 2 (The Highway - Super Sparse)"]
        direction LR
        L2A((Node A)) <--> L2E((Node E))
    end

    subgraph Layer1 ["Layer 1 (State Roads - Medium Density)"]
        direction LR
        L1A((Node A)) <--> L1C((Node C))
        L1C <--> L1E((Node E))
    end

    subgraph Layer0 ["Layer 0 (Local Streets - Deeply Dense)"]
        direction LR
        L0A((Node A)) <--> L0B((Node B))
        L0B <--> L0C((Node C))
        L0C <--> L0D((Node D))
        L0D <--> L0E((Node E))
        L0A <--> L0C
    end

    Layer2 -.-> Layer1
    Layer1 -.-> Layer0

    style Layer2 fill:#e3f2fd,stroke:#1e88e5
    style Layer1 fill:#bbdefb,stroke:#1976d2
    style Layer0 fill:#90caf9,stroke:#1565c0

HNSW Algorithm: Querying

  1. Start at a predefined entry point on the topmost layer.
  2. Look at the neighbours. Jump strictly to the neighbour that is mathematically closer to your query.
  3. Keep jumping until every neighbour is further away than the point you are currently standing on (a local minimum).
  4. Drop down to the next layer and repeat, using your current spot as the new starting point.
  5. Keep dropping until you hit Layer 0. The points you arrive at are your nearest neighbours.
Key Parameter Default Effect on Database
M (Max connections) 16-64 Determines how many "highways" connect. Higher means much higher recall accuracy, but significantly more RAM usage.
ef_construction 200 Search quality during graph building. Higher means a better quality graph, but takes vastly longer to import data.
ef_search 50+ The width of the "net" cast during a user query. Higher means more accurate search results, but higher latency.
C++ Implementation: HNSW Index (click to expand)
/**
 * HNSW Index for approximate nearest neighbor search.
 *
 * Builds a multi-layer navigable small-world graph.
 * - Layer 0 is the densest (all nodes present)
 * - Higher layers are exponentially sparser
 * - Search descends from top layer greedily, then does
 *   beam search at layer 0.
 *
 * Template parameters are avoided for clarity; uses float vectors.
 *
 * Key parameters:
 *   M               — max edges per node per layer (default 16)
 *   M_max0          — max edges at layer 0 (default 2*M)
 *   ef_construction  — beam width during build
 *   ef_search        — beam width during query
 *   mL              — level generation factor = 1/ln(M)
 */
class HNSWIndex {
public:
  struct SearchResult {
    float distance;
    size_t id;
    bool operator>(const SearchResult &o) const {
      return distance > o.distance;
    }
    bool operator<(const SearchResult &o) const {
      return distance < o.distance;
    }
  };

  HNSWIndex(size_t dim, size_t M = 16, size_t ef_construction = 200,
            size_t ef_search = 50)
      : dim_(dim), M_(M), M_max0_(2 * M), ef_construction_(ef_construction),
        ef_search_(ef_search), mL_(1.0 / std::log(static_cast<double>(M))),
        entry_point_(NONE), max_layer_(0), rng_(42), uniform_(0.0, 1.0) {}

  /**
   * Insert a single vector into the index.
   *
   * Algorithm (simplified):
   * 1. Draw random level l ~ Geometric(mL)
   * 2. From entry point, greedily descend to layer l+1
   * 3. At each layer [l..0], beam-search for ef_construction
   *    neighbors and add bidirectional edges
   */
  size_t insert(const std::vector<float> &vec) {
    size_t id = vectors_.size();
    vectors_.push_back(vec);

    int level = random_level();

    // Grow graph layers if needed
    while (static_cast<int>(graph_.size()) <= level) {
      graph_.emplace_back();
    }

    // First element
    if (entry_point_ == NONE) {
      entry_point_ = id;
      max_layer_ = level;
      for (int l = 0; l <= level; ++l) {
        if (graph_[l].size() <= id)
          graph_[l].resize(id + 1);
      }
      return id;
    }

    // Ensure node has adjacency lists at all its layers
    for (int l = 0; l <= level; ++l) {
      if (graph_[l].size() <= id)
        graph_[l].resize(id + 1);
    }

    size_t current = entry_point_;

    // Phase 1: Greedy descent from max_layer to level+1
    for (int l = max_layer_; l > level; --l) {
      auto nearest = search_layer(vec, current, 1, l);
      if (!nearest.empty())
        current = nearest[0].id;
    }

    // Phase 2: Insert at layers [min(level, max_layer)..0]
    for (int l = std::min(level, max_layer_); l >= 0; --l) {
      auto candidates = search_layer(vec, current, ef_construction_, l);
      size_t M_max = (l == 0) ? M_max0_ : M_;
      auto neighbors = select_neighbors(candidates, M_max);

      // Add bidirectional edges
      for (auto &nb : neighbors) {
        graph_[l][id].push_back(nb.id);
        graph_[l][nb.id].push_back(id);

        // Prune neighbor if over capacity
        if (graph_[l][nb.id].size() > M_max) {
          prune(nb.id, l, M_max);
        }
      }

      if (!candidates.empty())
        current = candidates[0].id;
    }

    if (level > max_layer_) {
      entry_point_ = id;
      max_layer_ = level;
    }

    return id;
  }

  /**
   * Search for k approximate nearest neighbors.
   *
   * 1. Greedy descent from top layer to layer 1
   * 2. Beam search at layer 0 with ef = max(ef_search, k)
   * 3. Return top-k results sorted by distance
   */
  std::vector<SearchResult> search(const std::vector<float> &query,
                                   size_t k) const {
    if (entry_point_ == NONE)
      return {};

    size_t current = entry_point_;

    // Descend from top
    for (int l = max_layer_; l > 0; --l) {
      auto nearest = search_layer(query, current, 1, l);
      if (!nearest.empty())
        current = nearest[0].id;
    }

    size_t ef = std::max(ef_search_, k);
    auto results = search_layer(query, current, ef, 0);

    // Take sqrt for actual Euclidean distances
    if (results.size() > k)
      results.resize(k);
    for (auto &r : results) {
      r.distance = std::sqrt(r.distance);
    }
    return results;
  }

  /**
   * Bulk insert all vectors.
   */
  void build(const std::vector<std::vector<float>> &vectors) {
    for (const auto &v : vectors)
      insert(v);
  }

  size_t size() const { return vectors_.size(); }
  size_t num_layers() const { return graph_.size(); }

  void set_ef_search(size_t ef) { ef_search_ = ef; }

private:
  static constexpr size_t NONE = std::numeric_limits<size_t>::max();

  float distance_sq(const std::vector<float> &a,
                    const std::vector<float> &b) const {
    float sum = 0.0f;
    for (size_t i = 0; i < dim_; ++i) {
      float d = a[i] - b[i];
      sum += d * d;
    }
    return sum;
  }

  int random_level() {
    return static_cast<int>(-std::log(uniform_(rng_)) * mL_);
  }

  /**
   * Beam search in a single layer.
   * Returns up to ef nearest elements, sorted by distance (ascending).
   */
  std::vector<SearchResult> search_layer(const std::vector<float> &query,
                                         size_t entry, size_t ef,
                                         int layer) const {
    if (layer >= static_cast<int>(graph_.size()))
      return {};
    if (entry >= graph_[layer].size())
      return {};

    std::unordered_set<size_t> visited;
    visited.insert(entry);

    float d = distance_sq(query, vectors_[entry]);

    // candidates: min-heap (closest first)
    std::priority_queue<SearchResult, std::vector<SearchResult>,
                        std::greater<SearchResult>>
        candidates;
    // results: max-heap (farthest first, for bounding)
    std::priority_queue<SearchResult> results;

    candidates.push({d, entry});
    results.push({d, entry});

    while (!candidates.empty()) {
      auto [c_dist, c_id] = candidates.top();
      candidates.pop();

      float farthest = results.top().distance;
      if (c_dist > farthest)
        break;

      // Expand neighbors
      if (c_id < graph_[layer].size()) {
        for (size_t nb : graph_[layer][c_id]) {
          if (visited.count(nb))
            continue;
          visited.insert(nb);

          float nb_dist = distance_sq(query, vectors_[nb]);
          farthest = results.top().distance;

          if (nb_dist < farthest || results.size() < ef) {
            candidates.push({nb_dist, nb});
            results.push({nb_dist, nb});
            if (results.size() > ef)
              results.pop();
          }
        }
      }
    }

    // Extract sorted results
    std::vector<SearchResult> out;
    out.reserve(results.size());
    while (!results.empty()) {
      out.push_back(results.top());
      results.pop();
    }
    std::sort(out.begin(), out.end());
    return out;
  }

  std::vector<SearchResult>
  select_neighbors(const std::vector<SearchResult> &candidates,
                   size_t M) const {
    auto sorted = candidates;
    std::sort(sorted.begin(), sorted.end());
    if (sorted.size() > M)
      sorted.resize(M);
    return sorted;
  }

  void prune(size_t node, int layer, size_t M_max) {
    auto &adj = graph_[layer][node];
    std::vector<SearchResult> scored;
    for (size_t nb : adj) {
      scored.push_back({distance_sq(vectors_[node], vectors_[nb]), nb});
    }
    std::sort(scored.begin(), scored.end());
    adj.clear();
    for (size_t i = 0; i < std::min(M_max, scored.size()); ++i) {
      adj.push_back(scored[i].id);
    }
  }

  size_t dim_;
  size_t M_, M_max0_;
  size_t ef_construction_, ef_search_;
  double mL_;
  size_t entry_point_;
  int max_layer_;

  std::mt19937 rng_;
  std::uniform_real_distribution<double> uniform_;

  std::vector<std::vector<float>> vectors_;
  // graph_[layer][node_id] = list of neighbor IDs
  std::vector<std::vector<std::vector<size_t>>> graph_;
};

Vamana / DiskANN

HNSW relies heavily on storing the entire graph structure in RAM to execute those quick multi-layer hops, which becomes prohibitively expensive at billion-scale datasets. DiskANN/Vamana is an algorithm designed by Microsoft specifically to store the graph mostly on NVMe SSD drives. It utilizes asynchronous I/O batch reads to fetch neighborhoods from disk, pulling latencies down to <5ms despite operating on cheap storage.


2.5 Quantization-Based Methods

High-dimensional data eats massive amounts of RAM. Quantization is the mathematical technique of brutally compressing vectors while trying to retain their semantic shape.

Product Quantization (PQ)

Product Quantization breaks massive vectors up into tiny sub-chunks, and replaces the massive arrays of random floats with tiny 8-bit integers.

ELI5: Imagine trying to store highly detailed RGB color codes for a photograph (e.g. [251.3, 12.1, 88.9]). Instead of storing those exact floats, you look at an 8-color crayon box. You decide that pixel looks closest to "red". So instead of storing 3 floats, you just store the ID number 2 (the red crayon). You've lost some fine detail, but you compressed the pixel massively.

PQ does this, but for 768-dimensional language vectors. It chops a 768-dim vector into 32 sub-chunks, finds the "closest crayon colour" for each chunk, and stores 32 bytes instead of 3072 bytes (a massive 96x compression).

Asymmetric Distance Computation (ADC)

When the query comes in, it is not quantized. We compute the exact distance between the pure uncompressed query vector, and the tiny compressed memory codes.

Step Complexity
Look up pre-calculated table $O(M \cdot K \cdot d_s)$ — once per query
Scan $n$ codes in DB $O(n \cdot M)$ — just fast integer lookups, no math
C++ Implementation: Product Quantizer (click to expand)
/**
 * Product Quantizer.
 *
 * Compresses d-dimensional float vectors into M bytes (one code per subspace).
 *
 * Compression ratio: d × 4 bytes → M bytes (e.g. 128-dim × 4B = 512B → 8B).
 *
 * Workflow:
 *   1. train()  — learn M codebooks of K=256 centroids each via k-means
 *   2. encode() — compress vectors to PQ codes (M uint8 per vector)
 *   3. search() — approximate distance using ADC lookup tables
 *
 * Asymmetric Distance Computation (ADC):
 *   dist(q, x̃)² ≈ Σ_m || q^(m) - c_{code_m}^(m) ||²
 *   Precompute a (M × K) distance table, then M lookups per candidate.
 */
class ProductQuantizer {
public:
  ProductQuantizer(size_t dim, size_t M = 8, size_t K = 256)
      : dim_(dim), M_(M), K_(K), ds_(dim / M) {
    assert(dim % M == 0 && "dim must be divisible by M");
    codebooks_.resize(
        M, std::vector<std::vector<float>>(K, std::vector<float>(ds_)));
  }

  /**
   * Train codebooks with k-means on each subspace.
   */
  void train(const std::vector<std::vector<float>> &data, size_t n_iter = 25) {
    std::mt19937 rng(42);
    size_t n = data.size();

    for (size_t m = 0; m < M_; ++m) {
      // Extract sub-vectors for subspace m
      std::vector<std::vector<float>> sub(n, std::vector<float>(ds_));
      for (size_t i = 0; i < n; ++i) {
        for (size_t d = 0; d < ds_; ++d) {
          sub[i][d] = data[i][m * ds_ + d];
        }
      }

      // Initialize centroids randomly
      std::vector<size_t> indices(n);
      std::iota(indices.begin(), indices.end(), 0);
      std::shuffle(indices.begin(), indices.end(), rng);
      for (size_t k = 0; k < K_; ++k) {
        codebooks_[m][k] = sub[indices[k % n]];
      }

      // k-means iterations
      std::vector<size_t> assignments(n);
      for (size_t iter = 0; iter < n_iter; ++iter) {
        // Assign
        for (size_t i = 0; i < n; ++i) {
          float best_dist = std::numeric_limits<float>::max();
          for (size_t k = 0; k < K_; ++k) {
            float d = l2_sq(sub[i], codebooks_[m][k]);
            if (d < best_dist) {
              best_dist = d;
              assignments[i] = k;
            }
          }
        }

        // Update centroids
        std::vector<std::vector<float>> sums(K_, std::vector<float>(ds_, 0));
        std::vector<size_t> counts(K_, 0);
        for (size_t i = 0; i < n; ++i) {
          size_t k = assignments[i];
          counts[k]++;
          for (size_t d = 0; d < ds_; ++d) {
            sums[k][d] += sub[i][d];
          }
        }
        for (size_t k = 0; k < K_; ++k) {
          if (counts[k] > 0) {
            for (size_t d = 0; d < ds_; ++d) {
              codebooks_[m][k][d] = sums[k][d] / counts[k];
            }
          }
        }
      }
    }
    trained_ = true;
  }

  /**
   * Encode vectors to PQ codes (M uint8 per vector).
   */
  std::vector<std::vector<uint8_t>>
  encode(const std::vector<std::vector<float>> &data) const {
    assert(trained_);
    size_t n = data.size();
    std::vector<std::vector<uint8_t>> codes(n, std::vector<uint8_t>(M_));

    for (size_t i = 0; i < n; ++i) {
      for (size_t m = 0; m < M_; ++m) {
        float best_dist = std::numeric_limits<float>::max();
        uint8_t best_k = 0;
        for (size_t k = 0; k < K_; ++k) {
          float d = 0;
          for (size_t dd = 0; dd < ds_; ++dd) {
            float diff = data[i][m * ds_ + dd] - codebooks_[m][k][dd];
            d += diff * diff;
          }
          if (d < best_dist) {
            best_dist = d;
            best_k = static_cast<uint8_t>(k);
          }
        }
        codes[i][m] = best_k;
      }
    }
    return codes;
  }

  /**
   * Decode PQ codes back to approximate vectors.
   */
  std::vector<float> decode(const std::vector<uint8_t> &code) const {
    assert(trained_);
    std::vector<float> vec(dim_);
    for (size_t m = 0; m < M_; ++m) {
      for (size_t d = 0; d < ds_; ++d) {
        vec[m * ds_ + d] = codebooks_[m][code[m]][d];
      }
    }
    return vec;
  }

2.6 Hybrid and Composite Indexes

Production systems mix and match these theories for maximum power. The most common combination historically used in libraries like Meta's FAISS is IVF-PQ (Inverted File Index with Product Quantization).

  1. IVF: Group the entire space into distinct clusters (voronoi cells).
  2. PQ: Compress all vectors inside those clusters aggressively.
  3. Query: Find the closest 12 clusters to the query, jump into them, and scan the heavily compressed PQ codes inside.
C++ Implementation: IVF Index (click to expand)
/**
 * Inverted File Index.
 *
 * Search complexity: O(nprobe × n/nlist × d) vs O(n × d) brute-force.
 *
 * Parameters:
 *   nlist  — number of Voronoi cells (centroids)
 *   nprobe — number of cells to search (trade-off: recall vs speed)
 */
class IVFIndex {
public:
  struct SearchResult {
    float distance;
    size_t id;
    bool operator<(const SearchResult &o) const {
      return distance < o.distance;
    }
  };

  IVFIndex(size_t dim, size_t nlist = 100, size_t nprobe = 10)
      : dim_(dim), nlist_(nlist), nprobe_(nprobe) {
    inverted_lists_.resize(nlist);
  }

  /**
   * Train centroids using k-means.
   */
  void train(const std::vector<std::vector<float>> &data, size_t n_iter = 20) {
    size_t n = data.size();
    centroids_.resize(nlist_, std::vector<float>(dim_));

    std::mt19937 rng(42);
    std::vector<size_t> indices(n);
    std::iota(indices.begin(), indices.end(), 0);
    std::shuffle(indices.begin(), indices.end(), rng);

    for (size_t c = 0; c < nlist_; ++c) {
      centroids_[c] = data[indices[c % n]];
    }

    std::vector<size_t> assignments(n);
    for (size_t iter = 0; iter < n_iter; ++iter) {
      // Assign to nearest centroid
      for (size_t i = 0; i < n; ++i) {
        float best = std::numeric_limits<float>::max();
        for (size_t c = 0; c < nlist_; ++c) {
          float d = l2_sq(data[i], centroids_[c]);
          if (d < best) {
            best = d;
            assignments[i] = c;
          }
        }
      }
      // Update centroids
      std::vector<std::vector<float>> sums(nlist_, std::vector<float>(dim_, 0));
      std::vector<size_t> counts(nlist_, 0);
      for (size_t i = 0; i < n; ++i) {
        counts[assignments[i]]++;
        for (size_t d = 0; d < dim_; ++d)
          sums[assignments[i]][d] += data[i][d];
      }
      for (size_t c = 0; c < nlist_; ++c) {
        if (counts[c] > 0) {
          for (size_t d = 0; d < dim_; ++d)
            centroids_[c][d] = sums[c][d] / counts[c];
        }
      }
    }
    trained_ = true;
  }

  /**
   * Add vectors to the inverted lists.
   */
  void add(const std::vector<std::vector<float>> &data) {
    assert(trained_);
    vectors_ = data;
    for (auto &list : inverted_lists_)
      list.clear();

    for (size_t i = 0; i < data.size(); ++i) {
      float best = std::numeric_limits<float>::max();
      size_t best_c = 0;
      for (size_t c = 0; c < nlist_; ++c) {
        float d = l2_sq(data[i], centroids_[c]);
        if (d < best) {
          best = d;
          best_c = c;
        }
      }
      inverted_lists_[best_c].push_back(i);
    }
  }

  /**
   * Search for k approximate nearest neighbors.
   *
   * 1. Find nprobe nearest centroids
   * 2. Scan vectors in those cells
   * 3. Return top-k
   */
  std::vector<SearchResult> search(const std::vector<float> &query,
                                   size_t k) const {
    assert(trained_);

    // Find nearest centroids
    std::vector<std::pair<float, size_t>> centroid_dists(nlist_);
    for (size_t c = 0; c < nlist_; ++c) {
      centroid_dists[c] = {l2_sq(query, centroids_[c]), c};
    }
    std::partial_sort(centroid_dists.begin(),
                      centroid_dists.begin() + std::min(nprobe_, nlist_),
                      centroid_dists.end());

    // Collect and score candidates
    std::vector<SearchResult> candidates;
    for (size_t p = 0; p < std::min(nprobe_, nlist_); ++p) {
      size_t cell = centroid_dists[p].second;
      for (size_t idx : inverted_lists_[cell]) {
        float d = std::sqrt(l2_sq(query, vectors_[idx]));
        candidates.push_back({d, idx});
      }
    }

    k = std::min(k, candidates.size());
    if (k > 0) {
      std::partial_sort(candidates.begin(), candidates.begin() + k,
                        candidates.end());
      candidates.resize(k);
    }
    return candidates;
  }

  void set_nprobe(size_t nprobe) { nprobe_ = nprobe; }
  size_t size() const { return vectors_.size(); }

private:
  static float l2_sq(const std::vector<float> &a, const std::vector<float> &b) {
    float s = 0;
    for (size_t i = 0; i < a.size(); ++i) {
      float d = a[i] - b[i];
      s += d * d;
    }
    return s;
  }

  size_t dim_, nlist_, nprobe_;
  bool trained_ = false;
  std::vector<std::vector<float>> centroids_;
  std::vector<std::vector<size_t>> inverted_lists_;
  std::vector<std::vector<float>> vectors_;
};

2.7 Algorithm Selection Guide

flowchart TD
    A[How many vectors?] -->|"Under 100K"| B[Brute-Force Flat Search]
    A -->|"100K to 10M"| C[HNSW in RAM]
    A -->|"Over 10M"| D{Infrastructure Budget?}

    D -->|High Budget| E[HNSW + 8-bit Quantized RAM]
    D -->|Tight Budget| F[DiskANN SSD or IVF-PQ]

    style B fill:#fff9c4,stroke:#fbc02d
    style C fill:#c8e6c9,stroke:#388e3c
    style E fill:#c8e6c9,stroke:#388e3c
    style F fill:#bbdefb,stroke:#1976d2

Assignment: Implement a K-Means IVF Index

In Chapter 0, you built a naive "Index" that hardcoded two buckets based on whether the first dimension was positive or negative. In this assignment, you will upgrade your FastMiniVectorDB to use a true Inverted File Index (IVF) based on K-Means centroids.

Goal

Define $K$ centroids (for example, $K=4$). Write an ingestion pipeline that calculates the distance between a new vector and all 4 centroids, and pushes the vector ID into the bucket (inverted list) of the closest centroid. During search, implement the nprobe parameter to only scan the 2 closest buckets.

Exercise: Implement IVF
#include <vector>
#include <iostream>
#include <limits>

class IVFVectorDB {
private:
    std::unordered_map<size_t, Record> storage;
    size_t next_id = 1;

    // The K Centroids (e.g., K = 4)
    std::vector<std::vector<float>> centroids;

    // The Inverted Lists (Buckets). Maps a centroid index (0 to K-1) 
    // to a list of Vector IDs.
    std::vector<std::vector<size_t>> inverted_lists;

public:
    IVFVectorDB(const std::vector<std::vector<float>>& initial_centroids) {
        centroids = initial_centroids;
        inverted_lists.resize(centroids.size());
    }

    size_t insert(const std::vector<float>& vec) {
        size_t id = next_id++;
        storage[id] = {id, vec, ""};

        // EXERCISE 1: Find the nearest centroid
        int best_centroid_idx = -1;
        float min_dist = std::numeric_limits<float>::max();

        /* Write your loop here: compare `vec` to all `centroids` */

        // Push to the correct bucket
        if (best_centroid_idx != -1) {
            inverted_lists[best_centroid_idx].push_back(id);
        }
        return id;
    }

    std::vector<SearchResult> search(const std::vector<float>& query, int k, int nprobe) {
        // EXERCISE 2: Implement nprobe routing
        // 1. Find the distance from `query` to all `centroids`.
        // 2. Sort the centroids by distance.
        // 3. Take the top `nprobe` closest centroids.
        // 4. Iterate *only* over the `inverted_lists` belonging to those specific centroids.
        // 5. Calculate exact distances and return the top-K.

        std::vector<SearchResult> results;

        /* Write your routing and scanning logic here */

        return results;
    }
};

References

  1. Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs. IEEE TPAMI.
  2. Jegou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI.
  3. Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS.
  4. Indyk, P., & Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC.