• Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
  • Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
Anghami
No Result
View All Result

Blazing Fast Approximate Nearest Neighbour Search On Apache Spark Using HNSW

Abdallah Moussawi by Abdallah Moussawi
June 15, 2020
in Engineering
Share on FacebookShare on Twitter
Clusters Of Embeddings

Most of Recommender Systems deployed in industry today rely on representing the items and users in the form of feature vectors. These feature vectors, whether they were hand-tagged or generated using machine learning models, are embedded in a space where the distance between two points represent the similarity between them.

In the context of music recommendations at Anghami, songs, albums, artists and users features vectors are generated using Collaborative Filtering factorization models, such as Matrix Factorization and Word2Vec, in addition to audio features generated using convolutional neural networks. The features vectors of these entities would be embedded in the same euclidean space, and finding similarity between two entities is as simple as applying Cosine Similarity on the vectors representing those entities.

For instance, in order to generate a weekly discovery mixtape for a user, we need a query user feature vector, and search for the most similar songs feature vectors to that query vector. The most naive way to do that is to do a brute force search on all song vectors. However, when we have N millions of candidates, and M millions of query vectors, doing a full search requires O(NxM) operations, which is in the order of trillions of operations, so it’s computationally expensive, and costs a lot of resources. This requires solutions that approximate the nearest neighbour search with a trade-off between accuracy and speed.

We have tried two approximate neatest neighbour search solutions, first LSH, which is a commonly used algorithm for approximate nearest neighbour search, and then we tried HNSW after its success in online recommendations serving:

Locality-Sensitive Hashing (LSH)

This is a simple method and commonly used for nearest neighbour search. It’s based on hashing the feature vectors into buckets. It hashes similar feature vectors into the same bucket with high probability, and then it only searches in the bucket in which the query vector belongs. However, this algorithm didn’t end up working very well for us, with 2x speed-up, we have seen 10% drop in accuracy, and with 5x speed-up, we have seen 40% drop in accuracy. We used Recall as the accuracy metric.

LSH can be tuned to have more than 1 hash table in order to increase accuracy, but we have found that the computational performance is the same as doing brute force search when having more hash tables.

Hierarchical Navigable Small World Graphs (HNSW)

This is a recently developed algorithm that is based on the idea of navigable small world graphs, where the neighbours of a node are similar to that node and those neighbours are highly likely to be connected to each other. This algorithm guarantees a logarithmic search complexity O(log(N)).

There are two indexing time parameters, m and efConstruction, higher values of these two parameters lead to higher indexing time, but could help in having higher accuracy. There is one search time parameter ef, where higher values of this parameter lead to higher search time but higher accuracy. ef is lower bounded by the requested number of nearest neighbours to be returned k, so for small values of k (i.e < 200), it’s recommended to have higher values of ef (i.e > 200), while for large values of k, having ef=k was found to work good enough. It’s important to note that tuning index and search parameters depends on the dataset being modeled. More information on the index time and search time parameters are provided here.

We have found having m=16 and efConstruction=200 to work really well. We tuned these parameters to have as high accuracy as possible while achieving a good speed-up, where we have reached over 20x speed-up with 99% in search accuracy.

HNSW On Apache Spark

HNSW is open sourced as hnswlib and written in C++. To interface it with Scala, we used hnswlib-jna which is the Java Native Access of hnswlib. Although it wasn’t written to work natively with Apache Spark, it was simple to integrate it as follows:

  1. Generate the HNSW index and store it on disk.
  2. Add the index file to Spark Context which will send the index file to all nodes.
  3. Partition the query vectors.
  4. For every partition of the query vectors, load the index, then query the index for every vector in that partition.

HNSW Index

HNSW Index Builder

Example

Potential Issue

One potential issue with that approach is that it fetches all features into the Apache Spark driver process. This might require a large amount of memory if there are 100s of millions of items vectors. One possible solution for that problem is to shard the index into different partitions/indices. However, that adds another layer of complexity, so it’s a tradeoff.

Check out our other Machine Learning Articles:

  • 30 Million Songs Down to 30
  • Using Word2Vec For Music Recommendations

Follow us on Anghami Tech for more Engineering and Machine Learning Articles from Anghami.

Tags: ML
Abdallah Moussawi

Abdallah Moussawi

Machine Learning Lead, Anghaminator since 2018

Related Posts

Simple Re-Ranker For Personalized Music Recommendation At Anghami
Engineering

Simple Re-Ranker For Personalized Music Recommendation At Anghami

How we used a simpler re-ranker based on UserDNA to improve our recommendations Recommending more personalized music helps users...

by Jimmy Jarjoura
March 9, 2021
Goya, Anghami’s Painter
Engineering

Goya, Anghami’s Painter

Music in the streaming era doesn’t exist alone, it is coupled with images; album arts, playlist cover arts, the...

by Ibrahim Fawaz
October 6, 2020
Anghami Live Radios
Engineering

Anghami Live Radios

Prior to digitally buying and streaming music on our smartphones, mediums ranging from vinyls, cassettes, to walkmans dominated the...

by Sebastien Melki
September 21, 2020
Managing an Application Rewrite
Engineering

Managing an Application Rewrite

Last February, Anghami released a totally revamped version of its iOS application. Since its inception in 2012, Anghami has...

by Marwan Fawaz
July 12, 2020
Next Post
Immersive audio – virtual reality through sounds

Immersive audio - virtual reality through sounds

  • Anghami Reports 35.6% Revenue Growth in Preliminary Unaudited 2022 Results, with Focus on Continued Efficiency and Profitability for 2023

    Anghami Reports 35.6% Revenue Growth in Preliminary Unaudited 2022 Results, with Focus on Continued Efficiency and Profitability for 2023

    0 shares
    Share 0 Tweet 0
  • Hidden Anghami Features

    0 shares
    Share 0 Tweet 0
  • ANGHAMI LAB OPENS IN BOULEVARD RIYADH CITY, BRINGING A WHOLE NEW EXPERIENCE TO THE CITY’S BUZZING MUSIC AND SOCIAL SCENE

    0 shares
    Share 0 Tweet 0
  • 5 Tips to Increase Your Streams on Anghami

    0 shares
    Share 0 Tweet 0
  • Anghami Live Radios

    0 shares
    Share 0 Tweet 0

About Anghami . Join Our Team . Go To app

© 2021 Anghami

No Result
View All Result
  • Homepage
  • Engineering
  • Product
  • What’s New
  • For Brands
  • Music
  • Life at Anghami

© 2020 Anghami blog