Appendix C — Annotated Bibliography¶

Foundational Algorithms¶

Indyk, P., & Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. — Introduced the theoretical framework for LSH and $(1+\epsilon)$-approximate NN.
Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI. — The HNSW paper. Basis for most production vector database indexes.
Jégou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI. — Introduced PQ and ADC. Foundation of IVF-PQ used in FAISS.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale Similarity Search with GPUs. IEEE TBD. — FAISS paper. GPU-accelerated IVF-PQ and flat search.
Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS. — SSD-resident graph search. Single-machine billion-scale.
Guo, R., et al. (2020). Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML. — ScaNN: learned quantization that optimizes ranking, not just reconstruction.

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. ICDT. — Empirical study of distance concentration.
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics. — The JL lemma: random projections preserve distances.
Blum, A., Hopcroft, J., & Kannan, R. (2020). Foundations of Data Science. Cambridge University Press. — Textbook chapter on high-dimensional geometry.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP. — Made BERT practical for semantic similarity search.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. — CLIP: shared text-image embedding space.
Muennighoff, N., et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL. — Standard benchmark for comparing embedding models.
Kusupati, A., et al. (2022). Matryoshka Representation Learning. NeurIPS. — Variable-dimension embeddings: any prefix is a valid embedding.

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. — Introduced RAG: combining retrieval with generation.
Gao, L., et al. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. ACL. — HyDE: generate hypothetical answers for better retrieval.
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR. — Multi-vector retrieval with MaxSim scoring.

Wang, J., et al. (2021). Milvus: A Purpose-Built Vector Data Management System. SIGMOD. — Architecture of a production vector database.
Wei, J., et al. (2023). Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters. WWW. — Combining metadata filtering with graph-based ANN.

Morris, J. X., et al. (2023). Text Embeddings Reveal (Almost) As Much As Text. EMNLP. — Demonstrates embedding inversion attacks.
Dwork, C. (2006). Differential Privacy. ICALP. — Foundational framework for privacy-preserving data analysis.

Aumüller, M., Bernhardsson, E., & Faithfull, A. (2020). ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Information Systems. — Standard evaluation framework for ANN algorithms.
Simhadri, H., et al. (2022). Results of the NeurIPS'21 Challenge on Billion-Scale ANN Search. NeurIPS. — Big-ANN-Benchmarks: billion-scale evaluation.