Blazing Fast Approximate Nearest Neighbour Search On Apache Spark Using HNSW

Most of Recommender Systems deployed in industry today rely on representing the items and users in the form of feature vectors. These feature vectors, whether they were hand-tagged or generated using machine learning models, are embedded in a space where the distance between two points represent the similarity between them.

In the context of music recommendations at Anghami, songs, albums, artists and users features vectors are generated using Collaborative Filtering factorization models, such as Matrix Factorization and Word2Vec, in addition to audio features generated using convolutional neural networks. The features vectors of these entities would be embedded in the same euclidean space, and finding similarity between two entities is as simple as applying Cosine Similarity on the vectors representing those entities.

For instance, in order to generate a weekly discovery mixtape for a user, we need a query user feature vector, and search for the most similar songs feature vectors to that query vector. The most naive way to do that is to do a brute force search on all song vectors. However, when we have N millions of candidates, and M millions of query vectors, doing a full search requires O(NxM) operations, which is in the order of trillions of operations, so it’s computationally expensive, and costs a lot of resources. This requires solutions that approximate the nearest neighbour search with a trade-off between accuracy and speed.

We have tried two approximate neatest neighbour search solutions, first LSH, which is a commonly used algorithm for approximate nearest neighbour search, and then we tried HNSW after its success in online recommendations serving:

Locality-Sensitive Hashing (LSH)

This is a simple method and commonly used for nearest neighbour search. It’s based on hashing the feature vectors into buckets. It hashes similar feature vectors into the same bucket with high probability, and then it only searches in the bucket in which the query vector belongs. However, this algorithm didn’t end up working very well for us, with 2x speed-up, we have seen 10% drop in accuracy, and with 5x speed-up, we have seen 40% drop in accuracy. We used Recall as the accuracy metric.

LSH can be tuned to have more than 1 hash table in order to increase accuracy, but we have found that the computational performance is the same as doing brute force search when having more hash tables.

Hierarchical Navigable Small World Graphs (HNSW)

This is a recently developed algorithm that is based on the idea of navigable small world graphs, where the neighbours of a node are similar to that node and those neighbours are highly likely to be connected to each other. This algorithm guarantees a logarithmic search complexity O(log(N)).

There are two indexing time parameters, m and efConstruction, higher values of these two parameters lead to higher indexing time, but could help in having higher accuracy. There is one search time parameter ef, where higher values of this parameter lead to higher search time but higher accuracy. ef is lower bounded by the requested number of nearest neighbours to be returned k, so for small values of k (i.e < 200), it’s recommended to have higher values of ef (i.e > 200), while for large values of k, having ef=k was found to work good enough. It’s important to note that tuning index and search parameters depends on the dataset being modeled. More information on the index time and search time parameters are provided here.

We have found having m=16 and efConstruction=200 to work really well. We tuned these parameters to have as high accuracy as possible while achieving a good speed-up, where we have reached over 20x speed-up with 99% in search accuracy.

HNSW On Apache Spark

HNSW is open sourced as hnswlib and written in C++. To interface it with Scala, we used hnswlib-jna which is the Java Native Access of hnswlib. Although it wasn’t written to work natively with Apache Spark, it was simple to integrate it as follows:

Generate the HNSW index and store it on disk.
Add the index file to Spark Context which will send the index file to all nodes.
Partition the query vectors.
For every partition of the query vectors, load the index, then query the index for every vector in that partition.

HNSW Index

HNSW Index Builder

Example

Potential Issue

One potential issue with that approach is that it fetches all features into the Apache Spark driver process. This might require a large amount of memory if there are 100s of millions of items vectors. One possible solution for that problem is to shard the index into different partitions/indices. However, that adds another layer of complexity, so it’s a tradeoff.

Check out our other Machine Learning Articles:

Tags: ML