17. Benchmark Harness¶

How do you know which algorithm is best for your data? This chapter covers benchmark methodology, standard datasets, and how to build a fair evaluation harness.

17.1 The ANN-Benchmarks Framework¶

ANN-Benchmarks is the standard evaluation framework. It measures the recall vs. QPS Pareto frontier.

Key Metrics¶

Metric	Definition	Interpretation
Recall@k	$\frac{\\|\text{ANN}(q,k) \cap \text{BF}(q,k)\\|}{k}$	Search quality
QPS	Queries per second	Throughput
Build time	Wall-clock index construction	One-time cost
Index size	Memory footprint (bytes)	Resource cost
p99 latency	99^th percentile query time	Tail performance

The Pareto Frontier¶

The most informative view is Recall vs. QPS at fixed $k$:

flowchart LR
    A["Low ef → High QPS, Low Recall"] --> B["High ef → Low QPS, High Recall"]
    style A fill:#ff6b6b
    style B fill:#51cf66

An algorithm dominates another if it achieves higher recall at the same QPS (or higher QPS at the same recall).

17.2 Standard Datasets¶

Dataset	Vectors	Dimensions	Metric	Size	Use Case
SIFT1M	1M	128	L2	512 MB	Quick smoke test
SIFT1B	1B	128	L2	512 GB	Billion-scale
GloVe-200	1.2M	200	Cosine	960 MB	NLP embeddings
MNIST	60K	784	L2	188 MB	High-dimensional
Deep1B	1B	96	L2	384 GB	CNN features
LAION-5B	5B	768	Cosine	15 TB	CLIP embeddings

SIFT1M Format¶

# Binary format: 4-byte int (dim), then dim × 4-byte floats per vector
# File: sift_base.fvecs (128-dim × 1M = 512 MB)

17.3 Fair Evaluation Methodology¶

Common Pitfalls¶

Unfair comparisons to avoid

Tuning one algorithm but not the other — give each the same parameter search budget
Including build time in QPS — deceptive for slow-build, fast-query algorithms
Measuring on the training set — always use separate query vectors
Ignoring memory — an algorithm using 100× more RAM isn't a fair comparison
Single-threaded vs. multi-threaded — report both, or fix thread count

Recommended Protocol¶

Fix hardware: Same machine, same CPU, same memory
Fix dataset: Standard benchmark set (SIFT1M, GloVe)
Parameter sweep: For each algorithm, try 10+ configurations
Report Pareto: Plot recall vs. QPS across all configurations
Report memory: Alongside QPS, report peak RSS
Multiple runs: Average over 3+ runs to reduce variance

17.4 Building Your Own Harness¶

C++ Benchmark Structure¶

#include <chrono>
#include <vector>

struct BenchmarkResult {
    float recall;
    float qps;
    size_t memory_bytes;
    float build_time_sec;
    float p99_latency_ms;
};

BenchmarkResult run_benchmark(
    /* index */,
    const std::vector<std::vector<float>>& queries,
    const std::vector<std::vector<size_t>>& ground_truth,
    size_t k
) {
    BenchmarkResult result;

    // Measure QPS
    auto start = std::chrono::high_resolution_clock::now();
    size_t total_queries = queries.size();
    size_t correct = 0;

    for (size_t i = 0; i < total_queries; ++i) {
        auto results = /* index.search(queries[i], k) */;

        // Compute recall for this query
        std::set<size_t> gt_set(ground_truth[i].begin(),
                                ground_truth[i].begin() + k);
        for (auto& r : results) {
            if (gt_set.count(r.id)) correct++;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration<double>(end - start).count();

    result.qps = total_queries / elapsed;
    result.recall = static_cast<float>(correct) / (total_queries * k);
    return result;
}

What to Vary¶

Parameter	HNSW	IVF-PQ	LSH
Build-time	M, ef_construction	nlist, PQ M	num_tables, num_hashes
Query-time	ef_search	nprobe	—

17.5 Industry Benchmarks and Leaderboards¶

Benchmark	Focus	Notable
ANN-Benchmarks	Algorithm comparison	Standard reference
Big-ANN-Benchmarks	Billion-scale	NeurIPS competition
VectorDBBench	Full database comparison	Includes filtering

🛠 Assignment: Build a Recall-vs-QPS Benchmark Harness¶

Now let's turn the pseudo-code from Section 17.4 into a fully working benchmark that uses the HNSW index from the book's src/cpp/ directory. You will sweep the ef_search parameter and print a table showing how recall and throughput trade off.

Your tasks: 1. Implement brute_force_ground_truth() to compute exact k-NN. 2. Implement measure_recall() comparing HNSW results to ground truth. 3. Implement measure_qps() timing batched queries. 4. Sweep ef_search and print a Recall vs QPS table.

Exercise: Recall-vs-QPS Benchmark Harness

// Compile from src/cpp/:
//   g++ -std=c++17 -O2 -o benchmark benchmark_harness.cpp

#include "hnsw.hpp"   // The HNSW index from the book

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <set>
#include <iomanip>
#include <cassert>

// ── Step 1: Brute-force ground truth ─────────────────
float l2_dist(const std::vector<float>& a, const std::vector<float>& b) {
    float sum = 0;
    for (size_t i = 0; i < a.size(); ++i) {
        float d = a[i] - b[i]; sum += d * d;
    }
    return sum;
}

std::vector<size_t> brute_force_knn(
    const std::vector<std::vector<float>>& dataset,
    const std::vector<float>& query, size_t k)
{
    std::vector<std::pair<float, size_t>> dists;
    for (size_t i = 0; i < dataset.size(); ++i)
        dists.push_back({l2_dist(dataset[i], query), i});
    std::partial_sort(dists.begin(), dists.begin() + k, dists.end());
    std::vector<size_t> result(k);
    for (size_t i = 0; i < k; ++i) result[i] = dists[i].second;
    return result;
}

// ── Step 2: Measure recall@K ─────────────────────────
float measure_recall(
    HNSWIndex& index,
    const std::vector<std::vector<float>>& queries,
    const std::vector<std::vector<size_t>>& ground_truth,
    size_t k)
{
    size_t total_hits = 0;
    for (size_t i = 0; i < queries.size(); ++i) {
        auto results = index.search(queries[i], k);
        std::set<size_t> gt_set(ground_truth[i].begin(),
                                ground_truth[i].end());
        for (const auto& r : results) {
            if (gt_set.count(r.id)) total_hits++;
        }
    }
    return static_cast<float>(total_hits) / (queries.size() * k);
}

// ── Step 3: Measure QPS ──────────────────────────────
float measure_qps(
    HNSWIndex& index,
    const std::vector<std::vector<float>>& queries,
    size_t k)
{
    auto start = std::chrono::high_resolution_clock::now();
    for (const auto& q : queries) {
        index.search(q, k);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double elapsed = std::chrono::duration<double>(end - start).count();
    return static_cast<float>(queries.size()) / elapsed;
}

int main() {
    const size_t N = 10000, DIM = 64, K = 10;
    const size_t NUM_QUERIES = 200;
    std::mt19937 rng(42);
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);

    // Generate dataset
    std::vector<std::vector<float>> dataset(N);
    for (auto& v : dataset) {
        v.resize(DIM);
        for (auto& x : v) x = dist(rng);
    }

    // Generate queries
    std::vector<std::vector<float>> queries(NUM_QUERIES);
    for (auto& q : queries) {
        q.resize(DIM);
        for (auto& x : q) x = dist(rng);
    }

    // Compute ground truth via brute force
    std::vector<std::vector<size_t>> ground_truth(NUM_QUERIES);
    for (size_t i = 0; i < NUM_QUERIES; ++i) {
        ground_truth[i] = brute_force_knn(dataset, queries[i], K);
    }

    // Build HNSW index (M=16, ef_construction=200)
    HNSWIndex index(DIM, /*M=*/16, /*ef_construction=*/200, /*ef_search=*/10);
    index.build(dataset);

    // ── Step 4: Sweep ef_search and print table ──────
    std::vector<size_t> ef_values = {10, 20, 50, 100, 200, 500};

    std::cout << "=== HNSW Benchmark: Recall vs QPS ===" << std::endl;
    std::cout << "Dataset: " << N << " vectors × " << DIM << "D, "
              << NUM_QUERIES << " queries, K=" << K << std::endl;
    std::cout << std::endl;
    std::cout << std::setw(12) << "ef_search"
              << std::setw(12) << "Recall@K"
              << std::setw(12) << "QPS" << std::endl;
    std::cout << std::string(36, '-') << std::endl;

    for (size_t ef : ef_values) {
        index.set_ef_search(ef);
        float recall = measure_recall(index, queries, ground_truth, K);
        float qps    = measure_qps(index, queries, K);

        std::cout << std::setw(12) << ef
                  << std::setw(11) << std::fixed << std::setprecision(1)
                  << recall * 100 << "%"
                  << std::setw(12) << std::setprecision(0)
                  << qps << std::endl;
    }

    // Verify that high ef_search achieves near-perfect recall
    index.set_ef_search(500);
    float high_recall = measure_recall(index, queries, ground_truth, K);
    assert(high_recall >= 0.95f && "ef_search=500 should achieve >= 95% recall");
    std::cout << "\n✅ Assertion passed: high ef_search recall = "
              << high_recall * 100 << "%" << std::endl;

    return 0;
}

Compile and run (from src/cpp/):

g++ -std=c++17 -O2 -o benchmark benchmark_harness.cpp
./benchmark

References¶

Aumüller, M., Bernhardsson, E., & Faithfull, A. (2020). ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Information Systems.
Simhadri, H., et al. (2022). Results of the NeurIPS'21 Challenge on Billion-Scale ANN Search. NeurIPS.