• Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
  • Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
Anghami
No Result
View All Result

Blazing Fast Approximate Nearest Neighbour Search On Apache Spark Using HNSW

Abdallah Moussawi by Abdallah Moussawi
June 15, 2020
in Engineering
Share on FacebookShare on Twitter
Clusters Of Embeddings

Most of Recommender Systems deployed in industry today rely on representing the items and users in the form of feature vectors. These feature vectors, whether they were hand-tagged or generated using machine learning models, are embedded in a space where the distance between two points represent the similarity between them.

In the context of music recommendations at Anghami, songs, albums, artists and users features vectors are generated using Collaborative Filtering factorization models, such as Matrix Factorization and Word2Vec, in addition to audio features generated using convolutional neural networks. The features vectors of these entities would be embedded in the same euclidean space, and finding similarity between two entities is as simple as applying Cosine Similarity on the vectors representing those entities.

For instance, in order to generate a weekly discovery mixtape for a user, we need a query user feature vector, and search for the most similar songs feature vectors to that query vector. The most naive way to do that is to do a brute force search on all song vectors. However, when we have N millions of candidates, and M millions of query vectors, doing a full search requires O(NxM) operations, which is in the order of trillions of operations, so it’s computationally expensive, and costs a lot of resources. This requires solutions that approximate the nearest neighbour search with a trade-off between accuracy and speed.

We have tried two approximate neatest neighbour search solutions, first LSH, which is a commonly used algorithm for approximate nearest neighbour search, and then we tried HNSW after its success in online recommendations serving:

Locality-Sensitive Hashing (LSH)

This is a simple method and commonly used for nearest neighbour search. It’s based on hashing the feature vectors into buckets. It hashes similar feature vectors into the same bucket with high probability, and then it only searches in the bucket in which the query vector belongs. However, this algorithm didn’t end up working very well for us, with 2x speed-up, we have seen 10% drop in accuracy, and with 5x speed-up, we have seen 40% drop in accuracy. We used Recall as the accuracy metric.

LSH can be tuned to have more than 1 hash table in order to increase accuracy, but we have found that the computational performance is the same as doing brute force search when having more hash tables.

Hierarchical Navigable Small World Graphs (HNSW)

This is a recently developed algorithm that is based on the idea of navigable small world graphs, where the neighbours of a node are similar to that node and those neighbours are highly likely to be connected to each other. This algorithm guarantees a logarithmic search complexity O(log(N)).

There are two indexing time parameters, m and efConstruction, higher values of these two parameters lead to higher indexing time, but could help in having higher accuracy. There is one search time parameter ef, where higher values of this parameter lead to higher search time but higher accuracy. ef is lower bounded by the requested number of nearest neighbours to be returned k, so for small values of k (i.e < 200), it’s recommended to have higher values of ef (i.e > 200), while for large values of k, having ef=k was found to work good enough. It’s important to note that tuning index and search parameters depends on the dataset being modeled. More information on the index time and search time parameters are provided here.

We have found having m=16 and efConstruction=200 to work really well. We tuned these parameters to have as high accuracy as possible while achieving a good speed-up, where we have reached over 20x speed-up with 99% in search accuracy.

HNSW On Apache Spark

HNSW is open sourced as hnswlib and written in C++. To interface it with Scala, we used hnswlib-jna which is the Java Native Access of hnswlib. Although it wasn’t written to work natively with Apache Spark, it was simple to integrate it as follows:

  1. Generate the HNSW index and store it on disk.
  2. Add the index file to Spark Context which will send the index file to all nodes.
  3. Partition the query vectors.
  4. For every partition of the query vectors, load the index, then query the index for every vector in that partition.

HNSW Index

HNSW Index Builder

Example

Potential Issue

One potential issue with that approach is that it fetches all features into the Apache Spark driver process. This might require a large amount of memory if there are 100s of millions of items vectors. One possible solution for that problem is to shard the index into different partitions/indices. However, that adds another layer of complexity, so it’s a tradeoff.

Check out our other Machine Learning Articles:

  • 30 Million Songs Down to 30
  • Using Word2Vec For Music Recommendations

Follow us on Anghami Tech for more Engineering and Machine Learning Articles from Anghami.

Tags: ML
Abdallah Moussawi

Abdallah Moussawi

Machine Learning Lead, Anghaminator since 2018

Related Posts

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية
Engineering

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية

أعلنت castLabs، الشركة الرائدة في تكنولوجيا الفيديو الرقمي، عن تعاونها مع +OSN لتقديم تقنية "دي آر إم توداي" لحماية...

by Nour Sawli
September 11, 2024
OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday
Engineering

OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday

OSN+ has partnered with castLabs to implement DRMtoday, a cloud-based digital rights management (DRM) solution aiming to safeguard it's...

by Nour Sawli
September 11, 2024
Anghami Selects Bitmovin’s VOD Encoder to Power New Multimedia Streaming Platform
Engineering

Anghami Selects Bitmovin’s VOD Encoder to Power New Multimedia Streaming Platform

Following its merger with OSN+, Anghami has chosen Bitmovin’s VOD Encoding to encode over 40,000 video files, bringing the...

by Nour Sawli
July 16, 2024
أنغامي تتعاون مع بيتموفين لتعزيز منصة بث الوسائط المتعددة الجديدة
Engineering

أنغامي تتعاون مع بيتموفين لتعزيز منصة بث الوسائط المتعددة الجديدة

بعد اندماجها مع+OSN ، اختارت أنغامي مشفر الفيديو حسب الطلب (VOD) من بيتموفين لترميز أكثر من 40,000 ملف فيديو...

by Nour Sawli
July 16, 2024
Next Post
Immersive audio – virtual reality through sounds

Immersive audio - virtual reality through sounds

  • Anghami Files 2023 Annual Report and Announces 2024 Q1 Results, Highlighting 18% Growth in Subscribers and Significant Margin Improvement

    Anghami Files 2023 Annual Report and Announces 2024 Q1 Results, Highlighting 18% Growth in Subscribers and Significant Margin Improvement

    0 shares
    Share 0 Tweet 0
  • EA SPORTS™ AND ANGHAMI ANNOUNCE FIFA 23 GLOBAL IN GAME VANITY DROP

    0 shares
    Share 0 Tweet 0
  • Anghami and OSN+ Successfully Complete Milestone Transaction, Creating an Entertainment Powerhouse

    0 shares
    Share 0 Tweet 0
  • Hidden Anghami Features

    0 shares
    Share 0 Tweet 0
  • Anghami and Groover team up to support up-and-coming artists worldwide.

    0 shares
    Share 0 Tweet 0

About Anghami . Join Our Team . Go To app

© 2021 Anghami

No Result
View All Result
  • Homepage
  • Engineering
  • Product
  • What’s New
  • For Brands
  • Music
  • Life at Anghami

© 2020 Anghami blog