Skip to content

11. Hardware Acceleration

The inner loop of vector search — distance computation — runs billions of times per second. This chapter covers how hardware features can accelerate it by 10–100×.


11.1 SIMD (Single Instruction, Multiple Data)

AVX2: Process 8 Floats Simultaneously

A single AVX2 instruction operates on 256-bit registers = 8 × float32:

C++ AVX2-Optimized L2 Distance (click to expand)
#ifdef __AVX2__
#include <immintrin.h>

/**
 * AVX2-optimized L2 squared distance.
 *
 * Processes 8 floats per iteration using 256-bit SIMD registers.
 * ~4-8x faster than scalar on modern CPUs.
 */
float l2_distance_avx2(const float* x, const float* y, size_t d) {
    __m256 sum = _mm256_setzero_ps();
    size_t i = 0;

    // Process 8 floats at a time
    for (; i + 8 <= d; i += 8) {
        __m256 vx = _mm256_loadu_ps(x + i);
        __m256 vy = _mm256_loadu_ps(y + i);
        __m256 diff = _mm256_sub_ps(vx, vy);
        sum = _mm256_fmadd_ps(diff, diff, sum);  // FMA: sum += diff * diff
    }

    // Horizontal sum of 8 floats in sum register
    __m128 hi = _mm256_extractf128_ps(sum, 1);
    __m128 lo = _mm256_castps256_ps128(sum);
    __m128 sum128 = _mm_add_ps(lo, hi);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    float result = _mm_cvtss_f32(sum128);

    // Handle remaining elements
    for (; i < d; ++i) {
        float diff = x[i] - y[i];
        result += diff * diff;
    }

    return result;
}
#endif

Performance Impact

Implementation Throughput (768-dim) Speedup
Scalar loop ~50M dist/sec
AVX2 ~350M dist/sec
AVX-512 (where available) ~600M dist/sec 12×
Instruction Width Operation Use
_mm256_sub_ps 8 × f32 Subtraction diff = x - y
_mm256_fmadd_ps 8 × f32 Fused multiply-add sum += diff * diff
_mm256_dp_ps 8 × f32 Dot product Inner product
_mm_popcnt_u64 64-bit Population count Hamming distance

11.2 GPU Acceleration

When GPUs Help

GPUs excel at batch queries over large datasets:

$$ \text{GPU speedup} \propto \frac{\text{parallelizable work}}{\text{data transfer cost}} $$

FAISS GPU can process 10K+ queries/sec at 1M vectors — but data must fit in GPU memory (24–80 GB).

flowchart LR
    subgraph GPU
        SM1[SM 0: 32 queries] --> L2[Shared L2 Cache]
        SM2[SM 1: 32 queries] --> L2
        L2 --> HBM[HBM: Vectors + Index]
    end
    CPU --> PCIe --> GPU

CPU vs. GPU Decision Matrix

Factor Prefer CPU Prefer GPU
Query volume < 100 QPS > 1000 QPS
Batch size 1 (single query) 100+ (batch)
Index updates Frequent Rare (batch rebuild)
Memory capacity > 80 GB ≤ GPU memory
Latency requirement < 1ms < 10ms acceptable

11.3 FPGA

Field-Programmable Gate Arrays offer deterministic latency — no OS jitter, no garbage collection:

Aspect GPU FPGA
Throughput Very high High
Latency Variable (μs–ms) Deterministic (ns–μs)
Power 300–700W 20–75W
Programming CUDA (easy) Verilog/HLS (hard)
Flexibility High Low (post-synthesis)

Microsoft's Catapult project used FPGAs for Bing search acceleration.


11.4 NUMA-Aware Design

On multi-socket servers, accessing remote NUMA memory costs 2–3× more than local:

$$ \text{Remote access latency} \approx 2\text{–}3 \times \text{local access latency} $$

NUMA trap

A naïve HNSW implementation that allocates vectors across NUMA nodes will see ~40% latency regression vs. NUMA-aware placement.

NUMA-Aware Strategies

  1. Pin threads to NUMA nodes; keep their data local
  2. Shard by NUMA node — each node has its own index partition
  3. Interleave for read-heavy workloads where all threads access all data

11.5 RDMA (Remote Direct Memory Access)

For distributed vector databases, RDMA enables direct memory-to-memory data transfer between nodes, bypassing the OS kernel:

TCP/IP RDMA
CPU involvement Both sides Zero-copy
Latency 50–100 μs 1–3 μs
Throughput 10–25 Gbps 100+ Gbps

Used in high-performance vector search clusters where shard-to-shard communication is the bottleneck.


References

  1. Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs (FAISS). IEEE TBD.
  2. Intel. Intel Intrinsics Guide. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
  3. Putnam, A., et al. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (Catapult). ISCA.