A tale of investigating rising costs in a high velocity startup
Anghami is the leading music & entertainment platform in the Middle East. We’ve been dubbed the Spotify of the Middle East, though we prefer just Anghami. Millions of users use our services daily to enjoy Music, Music Videos or Expressions.
We have been on the AWS Cloud since Day 0, which is 4 years ago, I joined 2 & 1/2 years ago (and my oh my does time fly) and we mostly love the scale and flexibility we get. We rely on AWS’s SQS (Simple Queue Service) to move data asynchronously back and forth through some of our systems. For those that don’t know how SQS or what we call Message Queues work: they are systems (or a service in the case of SQS) where you can push some data we call “Messages” to, from what we call “Producers”, you can then subsequently pull data from it and process it on what we call “Consumers” to accomplish the processing of tasks asynchronously, where “Consumers” usually run the longer running tasks needed to be done like sending out emails to welcome a New User.
The wrapper and core processing implementation around SQS was written before I joined and was working pretty well at the scale we were running it back then. Then something happened, we need to process more data and so we started adding more “Consumers” latching on to SQS to fetch and process “Messages”. That’s when we started to see our SQS costs go up disproportionately to the increase in the number of “Messages” we were processing! Queue panic mode and investigations.
I opened our AWS Billing Dashboard (it was my first time back then) to check what was going on. I found that they provide a tab called “Cost Explorer” which contained the link to “Daily Spend View”. There I chose the date range I was interested in (before and after the jump in costs and addition of “Consumers”), followed by adding a filter for the Service SQS shown above.
Then I noticed the Grouping option and then tried to see what the different options were, the magical option for our case was grouping by API Operation which makes sense as SQS charges by number of API Operations, not “Messages”.
The chart provided gave me all the insight I needed to make sense of things. Most of our spend was going towards the GetQueueAttributes call (in blue above). Interesting! I opened up the old piece of code for the SQS Wrapper core we had and checked for calls to the method that invokes that API call, it was being called before every time we try to fetch messages to process to check if the queue had any contents and sleep if it didn’t. This was weird but without it we would keep calling the ReceiveMessage API in a loop (at the time this code was initially written SQS didn’t have long polling). By the time I had adopted the code SQS had long polling so first things first I updated all our Queues to set the Receive Message Wait Time to the max (20 seconds), which basically tells the SQS library to keep the HTTP connection to SQS waiting up to 20 seconds given that it can still receive more messages with the limit being 10 messages per call.
Then I updated the code to utilize the long polling, this removed all our reliance on GetQueueAttributes API Calls and I removed them from the code. That was the first sweep that cut down the bulk of the cost that was basically waste at that point.
After the rush of victory from my first investigation I looked into how I can take this further and tune it down. We know we will always be receiving “Messages” in batches of up to 10 and we are processing those on the same node. When a message is processed successfully you must call the DeleteMessage API to tell SQS to delete it from its pool of messages (there is a safety in SQS that sets a Timeout on the Message after which it is released back into the pool of available messages). What came next was batching those into DeleteMessageBatch instead of delete message taking care of not crossing the Visibility Timeout of the Queues in question. After I changed that code we saw another reduction in cost since we were basically calling DeleteMessageBatch once for every 10 messages (in reality it’s lower, at around one for every 7 Messages, since we don’t always get 10 messages or fail to process etc.).
I continued to then batch some of the SendMessage calls that could be which also resulted in some reduction there, but due to the nature of these systems not much can be batched when Sending as they are generated one by one when our API instances receive some calls etc. We also added autoscaling logic to our “Consumer” tier and with the above changes we no longer needed to worry and reduced costs further by only having the number of instances we need to process the load of messages we needed to process.
Thanks for reading!