The latest app update is showing increased engagement? That’s wrong.
You’re proud of the new feature you fully embraced for the past month, and waiting for Apple to approve the update so you can delve into analytics and prepare for the next iteration. The app gets approved, many users downloaded it, and metrics start coming in. A great 40% of users try your new feature, which has an amazing 80% completion rate. A month later, only 20% of users have tried the feature, and completion rate is down to 40%. You blame users who tried the feature but didn’t use it again, so you re-analyze over the whole month to take into account early users. The needle doesn’t move. What happened?
This is a problem I have faced for every feature update we’ve released for Anghami, the leading music streaming service in the Middle East with more than 60 million users now. For every update, we make sure that the proper analytics events are set up. Once released, I perform in-depth analysis for user behavior on Amplitude, which is a great app analytics platforms that makes it really easy to get actionable answers quickly. Before solving the feature update conundrum, it’s important to set up analytics events properly.
Design proper analytics events
A new feature usually has a certain flow of events, and might lead to a different flow in some cases. Sending an event for each step is crucial, so that you can identify where the issues are later when analyzing with the funnel or pathfinder tool.
Don’t forget sending an extra event just to indicate the feature was opened, so that the funnel provides useful insights. It’s also very helpful to know how users reach this feature. There’s usually multiple ways a user can access your feature, and there’s typically multiple text copies being tested — so it’s important to know which source works best.
We’ve recently extended our tracking to external sources like ads, social media, email and push notifications. That way we can analyze the whole lifecycle from one place.
Events are most often tied to screen views, but often times users perform critical behavior without switching views. For example, tapping a button to make an in-app purchase will show a third-party alert without switching views. Users can cancel the payment here and remain in the same view. But it’s important to compare how many users are choosing to buy versus how many actually buy (which should be another event, of course). That’s why tapping the purchase button here should be an event in itself, even if the user doesn’t end up buying. .
Unfair comparison between user segments
Finding out whether an enhancement is doing good or bad involves comparing segments. Throughout my experience with analytics, unfair comparison between segments is the big elephant in the room wearing an invisibility cloak — it’s hard to realize if the comparison between 2 segments is fair or not.
Going back to the original problem in this article: why does user engagement with a new feature show very inconsistent results between first day and first month? The thing is, users who try the new feature the same day it’s released have updated the app immediately, and these users are not your typical users. They’re the highly engaged users who open your app daily and are curious enough to browse around the app. In essence, looking at the scope of one day will skew your results towards the most active users. Over the course of a month however, there’s a lot of new or one-time app users, so highly engaged users occupy a much smaller percentage from the monthly audience compared to the daily audience. That’s why engagement metrics vary significantly on the first day after a feature release compared to a whole month.
Green line is the previous version, and blue line is the new version that makes it easier to add songs to playlists. Notice how the green line decreases since it represents less engaged users. Time will tell if the blue line will sustain itself.
Does that mean you need to wait for a month? For the most accurate results yes. But practically you will not wait for a month to start working on the next iteration, so you need to find an equally engaged audience to compare against. It can be a cohort of users who have updated the same day on previous releases, or in our case, users who have played as many songs. That way the comparison between users on the updated version versus the previous one would be fair since you’re not skewing towards more active users.
Unfair comparisons don’t just happen with app updates. Another example is a test we did where we send a text message asking users to subscribe after they see the subscribe screen 5 times. To analyze the effect of sending a text message, we have to compare to another segment that didn’t receive anything, the control segment. If the control segment is random, then of course the text message segment will have much higher conversion because it’s focused on highly engaged users who have seen the subscribe screen 5 times. This is an unfair comparison, and can be fixed by comparing against a segment of users who have also seen the subscribe screen 5 times, but didn’t receive any message.
Don’t be focused only on the percentage of conversion when analyzing. There are other dimensions to look at, like time to convert. Also, make sure the funnel is set to a short duration, so that the sequence of events is not spread out over several days. With that said, sometimes you want to spread out events, to see if a limitation has affected a user’s decision to purchase at a later stage.
How valid are the results I’m looking at?
Once the results show up, excitement builds once a winning segment is obvious. But there’s a small worry that the event data coming from the apps is not entirely correct, or the complicated graph you created is wrong somewhere.
There’s a very quick way to validate event data: segment the unique event you’re validating by platform, and check if the numbers are proportional to the platform distribution of your users. Toggle to “Active %” so that Amplitude automatically calculates the percentage by platform — you should get similar percentages here. Then compare the average amount of times the event is sent by user, and make sure both platforms are not too off. These steps should be enough to at least point out if there’s a data collection issue somewhere.
It’s harder to validate that your graph setup is correct, but these tricks should do it:
Platform distribution: Are the segments you’re comparing on the same platform distribution? If one segment includes mostly iOS while the other is mostly Android and Web, the results will be misleading. Even the version matters, so a segment that only includes the latest iOS version cannot be compared to another segment that includes all iOS users.
New vs existing users: Don’t compare new users to existing users, except if you’re actually comparing the behavior of new vs existing users.
More engaged or technical users: Make sure one segment doesn’t skew towards more engaged users by definition. Segmenting on users who are audiophiles (say, they changed their equalizer settings) cannot be compared to the general audience.
Size of each segment: Make sure the number of users in each segment makes sense — a segment with 50k users being compared to 200 users is a no-go.
Huge improvement: Did your segment comparison show huge improvement? There’s something fishy. A 50% improvement for a small change you did is a signal for error.
Weekly vs daily: A weekly window can correct slight variations, but too much daily fluctuation is not a good thing, especially if it’s a large pool of users. Set a daily window and validate the trend is the same across all days. A consistent trend is reassuring, but if there’s a lot of daily fluctuation, then this is cause for worry.
The data looks weird
So you validated the data and realized there’s something fishy. These solutions work for most cases:
External factors: Did the marketing team launch a campaign you’re not aware of? Did an operator send a bulk SMS sent to one of your markets? These could be reasons for peaks in the graph that you cannot explain. A good starting point here is segmenting by country, since most external factors affect one country at a time. Then you can ask the team something like “what happened in Egypt on January 11th?”
Numbers are still low: It’s been a few days since launch, the numbers are too low to conclude, but you still need to make something out of it. Switch to daily view, and see if the trend is consistent.
Slightly wrong data: There are small mistakes that can be corrected after the events get sent. If one app is sending the wrong event name, or event properties are not consistent across platforms, then Amplitude Custom Events makes it easy to correct those issues by renaming events. So even without an app update, you can still analyze holistically.
Really wrong data: There’s something really weird in the results, and you can’t even pinpoint the source. At this point you can involve the backend team. They often have the data to be analyzed, which you can query directly from the database to validate. However, for proper comparison with the graphs you created, backend can forward these events to Amplitude. Once done, it becomes much easier to find the culprit.
Thinking long term
At a fast-moving startup like Anghami, we can’t afford waiting for a long time before taking action. The good thing is we don’t have to, and we’ve benefited from plenty of actionable insights through analytics upon launching new features, using the tricks above. With that said, it’s critical to check back in 2 months and verify things are still on track.
When 2 months pass, just looking at the percentage of users entering and completing the feature funnel is not enough. What is the relation between funnel conversion and churn? Maybe this new feature offers good metrics, but caused unneeded churn. The churn can either be tied to the feature or how it’s communicated.
Did users who try this feature come back to it? The original funnel can show terrific metrics, but most features are only successful if they cause stickiness (using this feature multiple times).
With onboarding aside, the single most important reason for checking back in 2 months is the effect of novelty wearing off. Introducing a feature or even altering a component will increase engagement at that moment, just because it’s new to a user’s eyes. A feature or change is only sustainable if users stay engaged 2 months after. But don’t let that slow you down, as there’s all the ways I mentioned to help you, from designing analytics events, to validating the data upon launch.