Music App Churn Prediction Using Logistic Regression

10 min readDec 30, 2020

This article describes (as part of the Udacity Data Science Capstone Project) the prediction of users who cancel their subscriptions (paid or non paid) to a music playing service- called Sparkify. The feature creation and Machine Learning is done using PySpark.

Project Definition

This project is based in user interaction data for a music streaming service (called Sparkify) supplied by Udacity. The project aims to use Spark functionality to create features and then build a machine learning in order to predict churn (users who cancel their subscriptions to the music streaming service).

Using the data supplied, a target to be predicted was created, features based on user characteristics were created and models were fitted selecting which features best predict churn.

The final model is one with low predictive ability.

Problem Statement

With the data supplied, features are to be created and models fitted which predict churn. The effectiveness of the model is then to be assessed. Since the target variable will be binary, Logistic Regression will be the main Machine Learning technique to fit the model.

Metrics

The model is to predict which users churn. The target is thus binary. The metrics chosen to assess the model give an estimate of the discriminatory power of the model- how well the model is able to predict churn. The two measures considered are the Gini Coefficient and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) measures. There is a relationship between the Gini and AUC- Gini Coefficient is 2*AUC — 1.

The AUC and Gini were chosen since these measures handle class imbalances well. The output of the logistic regression is a probability prediction between 0–1, this allows for a range a strategies to be developed for users based on their likeliness to churn. Other metrics require a cut-off to be defined on this prediction, but the Gini and AUC doesn’t need this.

Data Used and Analysis

A mini data set of user interactions with the music app was provided, mini_sparkify_event_data.json. These interactions record the date and time of playing a song, logging in, view an advert, adding a song to a play list, adding a fiend, etc… The interactions also include a cancellation event. Cancellation is what we want to predict using features created from the data.

The data is uploaded into Pspark where feature creation and Machine Learning takes place.

Data Exploration and Feature Engineering

The columns available in the source data is listed below.

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

Using the data supplied, the cancellation event (target) needed to be specified and features created to be used to predicted the cancellation event. The data was also aggregated further by calendar date, the reason for this was to assist with the feature creation. This means that the model will be run on data which has been aggregated by each calendar date day.

The below gives the number of observations in the aggregated data as well as the cancellation rate of the users in the data given.

+--------------------+----------+---------+
|         cancel_rate|cancel_vol|total_vol|
+--------------------+----------+---------+
|0.016645326504481434|        52|     3124|
+--------------------+----------+---------+

There are only 52 cancellation events at a rate of 1.67%. This is a small amount. This means that it might prove difficult to predict this event as we only have 52 cases.

There are 3124 observations in the aggregated data set. The mean of the target variable gives the overall cancellation rate in the data — mean of 1.67%.

No data anomalies were identified relating to outliers of specific features.

Features were then created in order to help predict the cancellation. Using the data interactions, the following features were created for each user at each specific data date the user was active:

max_time_off: Maximum number of days a user didn’t use the app for the period prior to the data date. Expecting that more time off means higher chance of cancellation.

cum_songs_played: Total number of songs played as at the data date. Expecting that more songs played means less chance of cancellation.

cum_songs_added_to_play_list: Total number of songs added to a play list as at data date. Expecting that more songs added to play list means less chance of cancellation.

cum_friends: Total friends added as at data date. Expecting that more added friends means less chance of cancellation.

days_as_user: Number of days user has been present in the data. Expecting that the longer the user has been using the app, the less likely a cancellation.

curr_songs_to_prev_ratio: Songs played current date vs previous day session. Expecting that ratio below 1 would indicate reduced activity thus higher chance of cancellation.

curr_songs_to_maxLw_ratio: Current songs played against the maximum songs played in the previous 7 days active. Expecting that ratio below 1 would indicate reduced activity thus higher chance of cancellation.

Data Visualisation

For selected variables, plots of the cancellation rate (blue) and volume (green) against the variable are shown below.

There doesn’t appear to be a strong trend in cancellation rate for any of the above variables.

Methodology

Data Preprocessing

The data was initially cleaned to exclude cases with missing userIDs and sessionIDs. There are no missing sessionIDs but there are some missing userIDs. These missing userIDs are excluded since they represent people who don’t actually use the service.

Implementation

Initial Set up

The code is created in Python 3 where a spark session is created. Data processing and feature creation is done using Spark SQL logic.

The model is built using Logistic Regression from the PySpark Machine Learning package.

Logistic regression is used as it is the standard when the outcome is binary. The output is also a probability between 0 and 1 so strategies can be developed based on the likelihood of a user cancelling their subscription to the music streaming service.

Implementation Description

The features created are user specific. This involved using SQL partition over logic in order to create these fields. Here is the code for coding up the cumulative songs played before the data is aggregated by date. This code will create a cumulative sum of songs played at each instance for a specific userID.

sum(case when page = ‘NextSong’ then 1 else 0 end) over (
partition by userId
order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cum_songs_played

This is done for the other variables described above. The data is then rolled up to daily data (a specific calendar date). This is done by selecting the maximum variable value grouped by the calendar date. Previous days activity (using the SQL lag function) features are also created here. The final feature creation step then creates ratio fields comparing current activity to previous activity.

The features predictive power are then assessed. Here all the data is used. The dataset size is small- so it was decided to use all the data to select features. The final model build will however be split into a train and test set. Features were selected based on AUC measure in the first step and when additional variables are added given the variable in the previous step and so on.

The features are then put into a logistic regression model and the Gini is compared on the training and test data. The logistic regression was run with the following hyper parameters:

maxIter=10, regParam=0.0, elasticNetParam=0.

There were no major complications during the coding process. There might be some issues and refinement needed if the solution is extended onto the full data set.

Refinement

The model with all the features is assessed against the model with the chosen features in the variable selection process. The variable selection process is basically the model refinement process.

Results

Model Evaluation and Validation

Initial Model

As an initial fit, all the variables were fitted using the train data and then tested on the validation data. This model has a Gini of 20% in the training data and -25% on the validation data. The model makes variables which don’t generalise well to the validation data important such as days_as_user. This model gives results opposite to what is expected. This model is unusable.

Variable Selection

Variable selection was then conducted using the approach described above to ensure that the model is not over fitted to the training data. Note again, due to the low sample size and low amount of cancellation events, all the data was used for variable selection. The selected variables are then fitted onto the training data (70% random sample) and testing on the validation data (remaining 30%). Doing it this way helps to prevent over training of the model. The results of the variable selection steps are as follows:

Step 1: Select most predictive variable of cancellation rate using Area under ROC curve (AUC) measure. Variable with highest measure came out to be curr_songs_to_maxLw_ratio.

Step 2: Using curr_songs_to_maxLw_ratio as the first variable select another variable which gives the model highest AUC. This variable came to be cum_songs_added_to_play_list.

Step 3: Using these two selected select another variable which gives the highest AUC. From here all other variables added to the model resulted in a lower AUC. Thus stop at this step.

Of course, on a different data set(the full data for example) this process could continue further.

curr_songs_to_maxLw_ratio and cum_songs_added_to_play_list both represent how the user is using the service, both these variables are logical for their inclusion in the model.

Final Model

Fitting the model using the selected variables- curr_songs_to_maxLw_ratio and cum_songs_added_to_play_list gives a training sample Gini of 13% and a validation sample Gini of 7%. The Gini’s are low but they are positive and the drop between the training and validation set is not as large as compared to the initial model fit using all the variables.

The ROC curve plot for the training data is shown below.

The blue line is the model and the orange line is a random model. The further away the blue line (to the left) from the orange line the greater the model’s predictive power. This model has low predictive power.

Justification

The approach is a result of the low amount of data available for use in the workspace. Using all the data for feature selection is slightly cheating, however which such a low sample size and number of cancellation cases it makes sense to use all the data to find features then feed those features into the model build using the training data. This does give a more stable predictive power between the training and test sets.

Conclusion

Reflection

The project went from feature creation to model fitting for predicting churn from the music streaming service. For me, one of the most interesting aspects of the project was the feature creation- this involved learning some new SQL coding techniques which create the variables by userID and can be used recursively across the rows for the each user. In this case, the features created did not really prove too fruitful in predicting churn, but a bit more creativity and intuition could be able to find something suitable. I think this part of the process is the most important, the Machine Learning model is only as good as the features being used.

Another interesting aspect was how to structure the data. I decided to structure the data daily. For a user, there were cases of missing days between use, it would have been useful to create all the days and full in zero activity for the user for these days — this could have resulted in a more meaningful ratio variable, when comparing current number of songs listened to compared to the maximum in the last week.

The model fitted provides minimal predictive ability in order to predict churn of users for the Sparkify music streaming service. This model is not really suitable for use.

Improvement

It is suggested the following areas of further work are conducted:

Extend the feature generation and model build to the full sample size. This could potentially give different results as there now should be more cancellation events to determine robust trends.
Create more features. A lot more features could be created and considered for the model. Features relating to changes in user behaviour with the service could be particularly useful.
Try out other Machine Learning techniques other than Logistic regression — such as Neural networks, K-nearest neighbour or decision trees. I don’t envisage this making a material difference to the predictive ability but another approach could give a marginal improvement in the performance of the model.