Figuring Out Which Customers May Leave

CASE STUDY:

Predicting if customers will churn based on demographic data and cost data from CRM

The aim is to use various features from a customer CRM list to accurately predict behaviour to retain customers. Dataset comes from a Kaggle data set here. Code used to explore the data and here.  Churn Rate is defined to be the rate at which customers stop subscribing to a service.

Three different models were used and from that, a model was chosen with an accuracy of 78%, precision of 60% and recall of 56% for the churned customers. We were also able to identify key features that are important in determining if a customer will leave or not. Monthly charges as well as customers having month to month contracts was an important feature for both the logistic regression and random forest models. There was no analysis done here on profitability. 

Defn (Accuracy): The percentage of correctly classified predictions of the model

(Correctly Classified Predictions / Total Predictions)

Defn (Precision): The ratio of predictions correctly classified as X to the predictions classified as X (Correctly and Incorrectly)

(Correctly Classified as X / Total Predicted to be X)

Defn (Recall): The ratio of predictions correctly classified as X to actually number of outcomes that are X

(Correctly Classified as X / Total No. of X outcomes)

Data Analysis:

The following features were used in the data set with a total of 7024 different users, of which 1869 reported to have churned. 

 
Gender Teneur Online Security Streaming TV Payment Method
Senior Citizen Phone Service Online Backup Streaming Movies Monthly Charges
Partner Multiple Lines Device Provider Contract Total Charges
Dependents Internet Service Tech Support Paperless Billing Churn

Of the customers, 26.5% have churned. It is this set of customers we aim to accurately predict with the features mentioned above. The following table of categorical features show what value of each feature has the highest churn rate.

ChurnPercentage.png
Feature Value Churn Percentage
Payment Method Electronic Check 45%
Contract Month to Month 43%
Tech Support No 42%
Internet Service Fiber Optic 42%
Online Security No 42%
Senior Citizen Yes 42%

Violin Plots are useful to investigate the distribution of Churned and non-churned customers. 

Scatter.png


Looking into the violin plots we can see a clear disparity in distributions for Monthly Charges and Tenure. Of the Churned customers, there is a higher proportion of people with a lower tenure and higher monthly fee. In the Total Charges curve, it is similarly distributed for both churned and non-churned customers. This is likely due to customers with a lower tenure having higher monthly charges resulting in a similar total charge value to the non-churned customers who have low monthly charges but longer tenure. This is demonstrated in the scatter plot. 


Logistic Regression

Initially take a look at the performance of a logistic regression. 

Accuracy: 78%

An accuracy of 78% is certainly not bad however, in this instance it is more important to predict if the user has Churned correctly. Thus a recall of 48% is a poor performance. 

 
Precision Recall f1-Score Support
Not Churned 84% 88% 86% 1545
Churned 61% 53% 57% 562

Feature Importance for Logistic Regression:

Using the coefficient to determine the feature importance. Facture importance is a process of finding the most important features in a model. In this instance it is looking at the coefficient which essentially looks as the change in probability of churning is there was a change in feature. 

In the logistic regression, the most important features are

Thus from the logistic regression, these above features should be flagged as signals of someone more likely to churn. 

Feature Importance
Contract: Month to Month
Contract: Two Years
Online Security: No
Payment Method: Check
Paperless Billing

Random Forest:

Let’s increase model complexity and investigate the performance of a Random Forest.

Accuracy: 76%

Clearly a decrease across the board. Especially in recall of churned users. 

 
Precision Recall f1-Score Support
Not Churned 81% 88% 85% 1550
Churned 57% 44% 49% 557
Feature Importance
Total Charges
Monthly Charges
Tenure
Contract: Month to Month
Online Security: No

Feature Importance for Logistic Regression:

Just running through the exercise to see if we have an overlap in feature importance with the logistic regression. 

The Random Forest clearly gives more weight to the continuous variables such as Total Charges, Monthly Charges and Tenure. There is an overlap with Month to Month contracts and having no online security. This suggests a clear importance of these features. 


Deep Learning: 

Now let’s go even more complicated and try a neural network using TensorFlow. First, let’s use a 1 hidden layer model with 20 nodes.  

Accuracy: 76%

So there is a slight increase in the Recall for Churned users. However, I would seek for something better than 58%.

 
Precision Recall f1-Score Support
Not Churned 85% 87% 86% 1550
Churned 60% 56% 58% 557

The Neural Network is certainly not a bad route but can we do better by making it more complex and thus allow it to correctly capture the non-linearities. 

Now we are going to use a 3 layer model with 2000, 1000 and 500 nodes respectively. We are also going to introduce a dropout, checkpoints and early stopping to avoid overfitting.

Well we did slightly worse in fact. Not ideal.

Precision Recall f1-Score Support
Not Churned 84% 87% 86% 1550
Churned 60% 55% 58% 557


Future Models to try:

  • K-Nearest Neighbours

  • Support Vector Machine


This is detailed analysis from Kaggle.

Kevin SynnottComment