14. PQ-IVF on GPUs¶

GPU-accelerated IVF-PQ is the workhorse of billion-scale vector search. This chapter shows how FAISS-style GPU kernels achieve 10,000+ QPS on a single GPU.

14.1 GPU Memory Hierarchy Review¶

flowchart TD
    REG["Registers<br/>~256KB per SM<br/>~0.5 cycle"] --> SMEM["Shared Memory<br/>64-228KB per SM<br/>~5 cycles"]
    SMEM --> L2["L2 Cache<br/>4-40MB<br/>~30 cycles"]
    L2 --> HBM["HBM/GDDR<br/>16-80GB<br/>~500 cycles"]

Design principle

Keep the distance lookup table in shared memory (fits easily: $M \times K \times 4 = 8 \times 256 \times 4 = 8$ KB). Scan PQ codes from HBM — they're compact ($M$ bytes per vector).

14.2 IVF-PQ on GPU: Execution Model¶

Step 1: Coarse Quantizer (Find Nearest Centroids)¶

// Each thread block processes one query
// Find nprobe nearest centroids from nlist candidates
__global__ void find_nearest_centroids(
    const float* queries,     // [nq × d]
    const float* centroids,   // [nlist × d]
    int* nearest_cells,       // [nq × nprobe]
    float* centroid_dists,    // [nq × nprobe]
    int nq, int nlist, int d, int nprobe
);

Step 2: Build Distance Tables¶

For each query, precompute the $(M \times K)$ distance table:

$$ \text{dt}[m][k] = \|q^{(m)} - c_k^{(m)}\|^2, \quad m \in [0, M), \; k \in [0, 256) $$

// Kernel: one thread block per query
// Each thread computes distances for a subset of (m, k) pairs
// Store result in shared memory for reuse
__global__ void compute_distance_tables(
    const float* queries,       // [nq × d]
    const float* codebooks,     // [M × K × ds]
    float* distance_tables,     // [nq × M × K]
    int nq, int M, int K, int ds
) {
    extern __shared__ float shared_dt[];  // M × K floats

    int qid = blockIdx.x;
    int tid = threadIdx.x;

    // Each thread handles ceil(M*K / blockDim) entries
    for (int idx = tid; idx < M * K; idx += blockDim.x) {
        int m = idx / K;
        int k = idx % K;
        float dist = 0.0f;
        for (int dd = 0; dd < ds; ++dd) {
            float diff = queries[qid * (M * ds) + m * ds + dd]
                       - codebooks[m * K * ds + k * ds + dd];
            dist += diff * diff;
        }
        shared_dt[idx] = dist;
    }
    __syncthreads();

    // Copy to global memory
    for (int idx = tid; idx < M * K; idx += blockDim.x) {
        distance_tables[qid * M * K + idx] = shared_dt[idx];
    }
}

Step 3: Scan PQ Codes¶

The inner loop — scan all codes in selected Voronoi cells:

// Each thread block handles one query × one cell
// Threads cooperatively scan PQ codes
__global__ void scan_pq_codes(
    const uint8_t* pq_codes,    // [n × M]
    const float* dist_tables,   // [nq × M × K]
    const int* cell_starts,     // inverted list offsets
    const int* cell_sizes,
    float* top_k_dists,         // [nq × k]
    int* top_k_ids,             // [nq × k]
    int M, int K, int k
) {
    int qid = blockIdx.x;
    int cell = blockIdx.y;

    // Load distance table into shared memory
    extern __shared__ float sdt[];
    for (int i = threadIdx.x; i < M * K; i += blockDim.x)
        sdt[i] = dist_tables[qid * M * K + i];
    __syncthreads();

    // Each thread scans a subset of vectors in this cell
    int start = cell_starts[cell];
    int size = cell_sizes[cell];
    for (int v = threadIdx.x; v < size; v += blockDim.x) {
        float dist = 0.0f;
        const uint8_t* code = &pq_codes[(start + v) * M];
        for (int m = 0; m < M; ++m) {
            dist += sdt[m * K + code[m]];  // Table lookup!
        }
        // Thread-local top-k insertion (heap)
        // ... merge across threads using warp shuffle ...
    }
}

14.3 Performance Analysis¶

Arithmetic Intensity¶

For PQ code scanning, each vector requires: - $M$ byte loads (PQ codes) - $M$ float loads (table lookups)
- $M$ float additions

$$ \text{Arithmetic intensity} = \frac{M \text{ FLOPs}}{M + M \times 4 \text{ bytes}} = \frac{1}{5} \text{ FLOP/byte} $$

This is memory-bandwidth bound — GPU HBM bandwidth (900+ GB/s on A100) is the limiting factor.

Throughput Model¶

$$ \text{QPS} = \frac{\text{HBM bandwidth}}{n_{\text{probe}} \cdot \frac{n}{n_{\text{list}}} \cdot M \text{ bytes/query}} $$

For A100 (2 TB/s), $n = 10^9$, $n_{\text{list}} = 65536$, $n_{\text{probe}} = 64$, $M = 8$:

$$ \text{QPS} \approx \frac{2 \times 10^{12}}{64 \times 15259 \times 8} \approx 256{,}000 \text{ QPS} $$

In practice: ~10K–50K QPS per GPU

Overhead from coarse quantizer, kernel launch, and host-device transfer reduces theoretical throughput.

14.4 Multi-GPU Strategies¶

Strategy	Description	Scaling
Replicated	Full index on each GPU	Linear QPS
Sharded	Partition across GPUs	Linear capacity
Hybrid	Shard data, replicate codebooks	Best for large datasets

References¶

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE TBD.
FAISS Wiki. GPU FAQ. https://github.com/facebookresearch/faiss/wiki/