11. Hardware Acceleration¶
The inner loop of vector search — distance computation — runs billions of times per second. This chapter covers how hardware features can accelerate it by 10–100×.
11.1 SIMD (Single Instruction, Multiple Data)¶
AVX2: Process 8 Floats Simultaneously¶
A single AVX2 instruction operates on 256-bit registers = 8 × float32:
C++ AVX2-Optimized L2 Distance (click to expand)
#ifdef __AVX2__
#include <immintrin.h>
/**
* AVX2-optimized L2 squared distance.
*
* Processes 8 floats per iteration using 256-bit SIMD registers.
* ~4-8x faster than scalar on modern CPUs.
*/
float l2_distance_avx2(const float* x, const float* y, size_t d) {
__m256 sum = _mm256_setzero_ps();
size_t i = 0;
// Process 8 floats at a time
for (; i + 8 <= d; i += 8) {
__m256 vx = _mm256_loadu_ps(x + i);
__m256 vy = _mm256_loadu_ps(y + i);
__m256 diff = _mm256_sub_ps(vx, vy);
sum = _mm256_fmadd_ps(diff, diff, sum); // FMA: sum += diff * diff
}
// Horizontal sum of 8 floats in sum register
__m128 hi = _mm256_extractf128_ps(sum, 1);
__m128 lo = _mm256_castps256_ps128(sum);
__m128 sum128 = _mm_add_ps(lo, hi);
sum128 = _mm_hadd_ps(sum128, sum128);
sum128 = _mm_hadd_ps(sum128, sum128);
float result = _mm_cvtss_f32(sum128);
// Handle remaining elements
for (; i < d; ++i) {
float diff = x[i] - y[i];
result += diff * diff;
}
return result;
}
#endif
Performance Impact¶
| Implementation | Throughput (768-dim) | Speedup |
|---|---|---|
| Scalar loop | ~50M dist/sec | 1× |
| AVX2 | ~350M dist/sec | 7× |
| AVX-512 (where available) | ~600M dist/sec | 12× |
Key SIMD Instructions for Vector Search¶
| Instruction | Width | Operation | Use |
|---|---|---|---|
_mm256_sub_ps |
8 × f32 | Subtraction | diff = x - y |
_mm256_fmadd_ps |
8 × f32 | Fused multiply-add | sum += diff * diff |
_mm256_dp_ps |
8 × f32 | Dot product | Inner product |
_mm_popcnt_u64 |
64-bit | Population count | Hamming distance |
11.2 GPU Acceleration¶
When GPUs Help¶
GPUs excel at batch queries over large datasets:
FAISS GPU can process 10K+ queries/sec at 1M vectors — but data must fit in GPU memory (24–80 GB).
GPU Architecture for Vector Search¶
flowchart LR
subgraph GPU
SM1[SM 0: 32 queries] --> L2[Shared L2 Cache]
SM2[SM 1: 32 queries] --> L2
L2 --> HBM[HBM: Vectors + Index]
end
CPU --> PCIe --> GPU
CPU vs. GPU Decision Matrix¶
| Factor | Prefer CPU | Prefer GPU |
|---|---|---|
| Query volume | < 100 QPS | > 1000 QPS |
| Batch size | 1 (single query) | 100+ (batch) |
| Index updates | Frequent | Rare (batch rebuild) |
| Memory capacity | > 80 GB | ≤ GPU memory |
| Latency requirement | < 1ms | < 10ms acceptable |
11.3 FPGA¶
Field-Programmable Gate Arrays offer deterministic latency — no OS jitter, no garbage collection:
| Aspect | GPU | FPGA |
|---|---|---|
| Throughput | Very high | High |
| Latency | Variable (μs–ms) | Deterministic (ns–μs) |
| Power | 300–700W | 20–75W |
| Programming | CUDA (easy) | Verilog/HLS (hard) |
| Flexibility | High | Low (post-synthesis) |
Microsoft's Catapult project used FPGAs for Bing search acceleration.
11.4 NUMA-Aware Design¶
On multi-socket servers, accessing remote NUMA memory costs 2–3× more than local:
NUMA trap
A naïve HNSW implementation that allocates vectors across NUMA nodes will see ~40% latency regression vs. NUMA-aware placement.
NUMA-Aware Strategies¶
- Pin threads to NUMA nodes; keep their data local
- Shard by NUMA node — each node has its own index partition
- Interleave for read-heavy workloads where all threads access all data
11.5 RDMA (Remote Direct Memory Access)¶
For distributed vector databases, RDMA enables direct memory-to-memory data transfer between nodes, bypassing the OS kernel:
| TCP/IP | RDMA | |
|---|---|---|
| CPU involvement | Both sides | Zero-copy |
| Latency | 50–100 μs | 1–3 μs |
| Throughput | 10–25 Gbps | 100+ Gbps |
Used in high-performance vector search clusters where shard-to-shard communication is the bottleneck.
References¶
- Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs (FAISS). IEEE TBD.
- Intel. Intel Intrinsics Guide. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
- Putnam, A., et al. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (Catapult). ISCA.