Alert button
Picture for Harsha Vardhan Simhadri

Harsha Vardhan Simhadri

Alert button

Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis

May 07, 2023
Magdalen Dobson, Zheqi Shen, Guy E. Blelloch, Laxman Dhulipala, Yan Gu, Harsha Vardhan Simhadri, Yihan Sun

Figure 1 for Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis
Figure 2 for Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis
Figure 3 for Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis
Figure 4 for Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis

Algorithms for approximate nearest-neighbor search (ANNS) have been the topic of significant recent interest in the research community. However, evaluations of such algorithms are usually restricted to a small number of datasets with millions or tens of millions of points, whereas real-world applications require algorithms that work on the scale of billions of points. Furthermore, existing evaluations of ANNS algorithms are typically heavily focused on measuring and optimizing for queries-per second (QPS) at a given accuracy, which can be hardware-dependent and ignores important metrics such as build time. In this paper, we propose a set of principled measures for evaluating ANNS algorithms which refocuses on their scalability to billion-size datasets. These measures include ability to be efficiently parallelized, build times, and scaling relationships as dataset size increases. We also expand on the QPS measure with machine-agnostic measures such as the number of distance computations per query, and we evaluate ANNS data structures on their accuracy in more demanding settings required in modern applications, such as evaluating range queries and running on out-of-distribution data. We optimize four graph-based algorithms for the billion-scale setting, and in the process provide a general framework for making many incremental ANNS graph algorithms lock-free. We use our framework to evaluate the aforementioned graph-based ANNS algorithms as well as two alternative approaches.

Viaarxiv icon

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

May 08, 2022
Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krishnaswamy, Gopal Srinivasa, Suhas Jayaram Subramanya, Jingdong Wang

Figure 1 for Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search
Figure 2 for Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search
Figure 3 for Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search
Figure 4 for Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-\`a-vis their hardware cost. This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS~\citep{Faiss17} as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN~\citep{DiskANN19} as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline. We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput.

Viaarxiv icon

FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search

May 20, 2021
Aditi Singh, Suhas Jayaram Subramanya, Ravishankar Krishnaswamy, Harsha Vardhan Simhadri

Figure 1 for FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
Figure 2 for FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
Figure 3 for FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
Figure 4 for FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search

Approximate nearest neighbor search (ANNS) is a fundamental building block in information retrieval with graph-based indices being the current state-of-the-art and widely used in the industry. Recent advances in graph-based indices have made it possible to index and search billion-point datasets with high recall and millisecond-level latency on a single commodity machine with an SSD. However, existing graph algorithms for ANNS support only static indices that cannot reflect real-time changes to the corpus required by many key real-world scenarios (e.g. index of sentences in documents, email, or a news index). To overcome this drawback, the current industry practice for manifesting updates into such indices is to periodically re-build these indices, which can be prohibitively expensive. In this paper, we present the first graph-based ANNS index that reflects corpus updates into the index in real-time without compromising on search performance. Using update rules for this index, we design FreshDiskANN, a system that can index over a billion points on a workstation with an SSD and limited memory, and support thousands of concurrent real-time inserts, deletes and searches per second each, while retaining $>95\%$ 5-recall@5. This represents a 5-10x reduction in the cost of maintaining freshness in indices when compared to existing methods.

* 19 pages, 22 figures 
Viaarxiv icon

DROCC: Deep Robust One-Class Classification

Feb 28, 2020
Sachin Goyal, Aditi Raghunathan, Moksh Jain, Harsha Vardhan Simhadri, Prateek Jain

Figure 1 for DROCC: Deep Robust One-Class Classification
Figure 2 for DROCC: Deep Robust One-Class Classification
Figure 3 for DROCC: Deep Robust One-Class Classification
Figure 4 for DROCC: Deep Robust One-Class Classification

Classical approaches for one-class problems such as one-class SVM (Scholkopf et al., 1999) and isolation forest (Liu et al., 2008) require careful feature engineering when applied to structured domains like images. To alleviate this concern, state-of-the-art methods like DeepSVDD (Ruff et al., 2018) consider the natural alternative of minimizing a classical one-class loss applied to the learned final layer representations. However, such an approach suffers from the fundamental drawback that a representation that simply collapses all the inputs minimizes the one class loss; heuristics to mitigate collapsed representations provide limited benefits. In this work, we propose Deep Robust One Class Classification (DROCC) method that is robust to such a collapse by training the network to distinguish the training points from their perturbations, generated adversarially. DROCC is motivated by the assumption that the interesting class lies on a locally linear low dimensional manifold. Empirical evaluation demonstrates DROCC's effectiveness on two different one-class problem settings and on a range of real-world datasets across different domains - images(CIFAR and ImageNet), audio and timeseries, offering up to 20% increase in accuracy over the state-of-the-art in anomaly detection.

Viaarxiv icon

RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

Feb 27, 2020
Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri, Manik Varma, Prateek Jain

Figure 1 for RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference
Figure 2 for RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference
Figure 3 for RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference
Figure 4 for RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

Pooling operators are key components in most Convolutional Neural Networks (CNNs) as they serve to downsample images, aggregate feature information, and increase receptive field. However, standard pooling operators reduce the feature size gradually to avoid significant loss in information via gross aggregation. Consequently, CNN architectures tend to be deep, computationally expensive and challenging to deploy on RAM constrained devices. We introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs), that efficiently aggregate features over large patches of an image and rapidly downsamples its size. Our empirical evaluation indicates that an RNNPool layer(s) can effectively replace multiple blocks in a variety of architectures such as MobileNets (Sandler et al., 2018), DenseNet (Huang et al., 2017) and can be used for several vision tasks like image classification and face detection. That is, RNNPool can significantly decrease computational complexity and peak RAM usage for inference, while retaining comparable accuracy. Further, we use RNNPool to construct a novel real-time face detection method that achieves state-of-the-art MAP within computational budget afforded by a tiny Cortex M4 microcontroller with ~256 KB RAM.

* 16 pages, 8 figures 
Viaarxiv icon