Crunching Data at Anghami

The jigsaw puzzle of setting up a flexible and affordable Business Intelligence environment in a quite new, dynamic and ever evolving industry

Anghamiis the leading Music streaming service in the MENA region with more than 60 Million users and a music catalogue of more than 30 Million songs. What seemed like science fiction a decade ago is now simply taken as granted: Anyone is just one click/tap away from playing whatever he feels like, whenever he wishes and from virtually anywhere. And legally! Whether through our iOS/Android mobile app, our Mac/Windows desktop app, our website or even our brand new Progressive Web App (PWA).

Needless to say that behind the scene, billions of events are generated by such actions. That’s a lot of raw data to store, process, cleanse, transform, aggregate, store (again) and analyze…

The (not so) difficult choice of the right architecture

I have been working as a Business Intelligence consultant for the past 18 years, mostly in the Telecom and Banking industries. Each BI project would be nicely wrapped and would cost hundreds of man days and would be done by a fully packed team of well trained experts. We had everything from developers to Project Directors including business analysts, functional and technical analysts, architects, DBAs, a bunch of junior and senior back-end and front-end developers, technical experts, testers, project managers, a well-detailed planning, Gantt charts, functional and technical specifications, use cases, tests scenarios, team meetings every day and project committees ever other day, … Just name it and we’ll get it. Actually don’t bother, we already have it.

In parallel, clients would go for well renowned software solutions, wisely choosing the best ETL tool, OLAP engine, databases, reporting and data mining tools and so on. And that costs a lot of money, CAPEX and OPEX wise. And I mean A LOT, probably as much as the GDP of a small country.

Oracle, Sybase, Teradata, SAS, Micro Strategy, SAP Business Objects, Informatica, IBM DataStage, Netezza, ESSBASE, … These were my roommates at that time.

Designing each solution and choosing the right methodology would depend on the initial requirements but to make it simple a classical BI architecture would look roughly like this :

a basic BI architecture (roughly so don’t shoot the messenger)

Some extra layers could be added, some could be omitted or merged, most of them would be physically built, some could be logical/virtual but basically that’s how it would look like.

So… what is it?

In a not-totally-anymore-but-still Start-Up company such as Anghami, we have a limited number of resources. Most of them are brilliant engineers, definitely experts in what they do, but focusing above anything else in delivering the best experience to our users. No time for Gantt charts and detailed functional and technical specifications, no unlimited budgets to throw on shiny things, and we can’t afford to loose weeks or months each time we need to change something. We just use our imagination and creativity, iterate, endlessly, bringing new features, killing other ones, improving our UI/UX, getting more content and better music recommendations to our users. And that’s it. Mostly.

But to do so everyone agrees that we need proper analytics… We need feedbacks from our users. Indirectly that is, through all this valuable data we’re idly storing. There is no point in walking blindly in a minefield: It hurts.

So to summarize :

We have a limited number of resources. Most of them are focused on bringing the best music experience to our users
we have a limited budget and we need to spend it wisely
Things go fast and change every other day. We need to be able to quickly change and adapt
We have a Data team composed of “soft-handed” data scientists 😉 They are eager to crunch data but are not much into cooking the raw and unshaved one… And our Data Engineers are quite busy improving our operational databases.

So let’s make it simple and use what the cloud jungle has to offer (we opted for AWS but that’s another subject) :

A rather convenient, cheap and unlimited storage system : Amazon S3. That will be our Data Lake. Most of our data is moved there straight from our MySQL operational databases. We then use Spark, Python or R to get whatever we need from there and apply a bunch of obscure and complex algorithms on it (you can check the following article if you’re interested in Machine Learning and music recommendation)
AWS Redshift : That’s our Data Warehouse. Data is extracted from S3. Some data is inserted during the night, some other is done in real-time. Should we go for a Star Schema? a Data Vault model? B. Inmon vs. R. Kimball vs. Dan Linstedt? All of them? Actually no. We decided to keep more or less the same ER structure we have in our operational database (mostly but I’ll keep it simple). The ETL flows will mainly focus on data cleansing and almost no aggregations or business rules will be applied here. Let’s leave the hard work to the “petabyte-scale” database engine.

Third parties :

Looker : a web-based reporting tool that offers an intuitive logical model builder that allows us to hide the (sometimes twisted) physical structure of our operational tables. That’s where we’ll actually apply most of our business rules, design our virtual subject-oriented Data Marts and build all the needed charts and dashboards on top of them. It’s ideal for the technical team to quickly implement a new logical model or change an existing one and it’s easy for our business users to navigate through all our data and create new dashboards.
We also use Metabasefor quick ad-hoc reporting. It’s an open source solution that we started using almost since the beginning of our journey. It’s rather limited compared to Looker or other similar solutions but it’s getting better at each update and it gets the work done.
We talked here about our operational data which includes most of what we need but what about these billions of events we’re getting each day from our app? Storing them in Redshift would drastically increase our database infrastructure cost. But analyzing our users’ behaviors is a must. Here’s where Amplitudecomes on handy and that’s what we decided to go for and we honestly don’t regret it! We just send them a bunch of events and they take care of all the rest. Creating cohorts, looking at our users’ retention or segmenting our users can be done in a couple of clicks : It’s a piece of cake but just make sure to have all your attributes and events well documented or you’ll find your Sales Director stating that you have more active users than there are people on this planet!

That’s mostly it

That’s what our BI architecture roughly looks like today knowing that we’re actively working on it and improving it day by day. It’s obvious that I didn’t include everything here : Some pipelines are missing as well as some other tools we use such as our ads delivery system or the different solutions needed for our marketing campaigns but I didn’t want to overload it too much for the sake of clarity.

I really hope you enjoyed reading this article and if you’re interested in what’s going on at Anghami you can definitely check some of my colleagues publications (Helmi Rifai, Salim Batlouni, Aziz Antoun, Ramzi Karam, Charbel Khadra, Elias El Khoury) It’s worth the ride!

Join the fun: https://www.anghami.com/careers

Tags: Data