Statistical analysis of playlist’s mood
Playlists usually combine songs under a single category such as artist’s “Best of”, genre and epoch, activity-oriented (i.e. workout or cooking playlists) or mood-focused such as melancholic or joyful playlists. We will analyse the coherency of the latter through the evaluation of each song’s sentiment score and the use of the median absolute deviation (MAD) to robustly measure the variability of the playlist’s mood.
At Anghami, we focus on delivering the best user experience and quality content for music lovers. On our path towards continuous improvement and as a data-driven company, we have turned to statistics to further enhance our customer’s listening sessions.
Converting words into numbers
We’ve had adjectives assigned for every song in the playlist such as exciting, sensual, depressive, nostalgic, etc… and in order to analyse them, we created a unidimensional projection with values ranging from -1 (most extreme negative) to +1 (most extreme positive).
For this task, we’ve chosen VADER Sentiment Analysis [1], an open-source lexicon and rule-based sentiment analysis tool. Words that aren’t in VADER’s dictionary were replaced with their closest synonyms with the help of www.thesaurus.com.
Standard deviation & why we didn’t use it
The standard deviation (SD) is how much members of a group differ from the mean value of the group. A small SD indicates that the values are tightly located around the mean, while a large SD means the values are spread over a wide range. The following plot illustrates how the standard deviation is affected by the proximity of the values to the mean. Both populations have the same mean, but their values are spread differently.
A simple method of determining outliers in a set is to find all entries that are two standard deviations away from the mean. Let’s take a look at this set of numbers: [1 1 2 2.2 3 3.5 4.1 9]
The mean is 3.2250 and SD is 2.5828. The last entry (9) is greater than mean+2*SD (8.3905). We have successfully detected the outlier here, but that’s not enough.
Let’s do the same for this set of numbers: [1 1 2 2.2 3 3.5 4.1 19 62]
The red line is mean+2*SD (50.811), we failed to identify (19) which logically should be considered an outlier. The reason is that the standard deviation, which is based on squared distances from the mean, is greatly influenced by the large deviations of extreme outliers, in our case (62).
Median absolute deviation
The median absolute deviation (MAD) is a robust measure of statistical dispersion and is more resilient to few extreme outliers. It is defined as the median of the absolute deviations from the data’s median, simply MAD = median( abs( Xᵢ – median(X) ).
Let’s compute the robust zScore of each of the previous data points using the MAD:
octave:1> a = [1 1 2 2.2 3 3.5 4.1 19 62];
octave:2> abs(a - median(a)) / mad(a,1)
ans =1.81818 1.81818 0.90909 0.72727 0.00000 0.45455 1.00000 14.54545 53.63636
Using the same cut-off factor of 2, we find that the outliers here are entries 19 and 62.
octave:3> a(find((abs(a - median(a)) / mad(a,1)) > 2))
ans =19 62
It’s not a perfect solution, but it’s much better than the less robust std. dev.
Wrapping up
For every playlist, we translate the song’s mood into a normalized score using VADER and compute its robust zScore in terms of the median absolute deviation. The playlists are ordered using the distance between their maximum and minimum sentiments and the sum of their songs’ zScores; thus, if a playlist contains a negative song among mostly positive ones, it will be surfaced first and the outlier songs highlighted for review.
What’s next?
Now that we’re able to analyse playlists over a single attribute (sentiment), we began looking into methods that efficiently deal with multiple dimensions. Hopefully this will be the topic of another story.
[1] Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.