Predict Customer Churn with Machine Learning

Anzhi Tian
10 min readJun 18, 2021

--

by Anzhi Tian, Harriet Peng, Yanqing Shen; contributed equally.

Companies are spending more and more on customer acquisition and re-engagement, however, in practice the return on investment (ROI) is always a struggle. If companies can foresee which customers are about to leave and take actions in advance, it would help customer retention and therefore boost the ROI of the company’s strategies.

Companies are spending more and more on customer acquisition and re-engagement, however, in practice the return on investment (ROI) is always a struggle. If companies can foresee which customers are about to leave and take actions in advance, it would help customer retention and therefore boost the ROI of the company’s strategies.

In this article, we will analyze and predict customer churn for a telecommunication services provider TelCo using a dataset that contains its customers’ account information, demographic and opt-in services. We predict customer churn with a LightGBM model that identifies customers who are going to churn, then based on its feature importance survival analysis is conducted to infer those factors’ impact on customer churn. In order to generate actionable and specific marketing strategies for different customers, K-means clustering is applied to segment customers into four groups.

Table of Contents

  1. Exploratory Data Analysis
  2. Churn Prediction with LightGBM
  3. Survival Analysis
  4. Customer Segmentation
  5. Next Steps

References

Exploratory Data Analysis

The dataset for TelCo churn analysis is from Kaggle. It has 7,043 observations and 21 variables. The target variable is Churn, and most of the explanatory variables are categorical, including customers’ demographic, account information and the service they opt in. Tenure, MonthlyCharges and TotalCharges are the only three numerical variables.

We replace all 11 missing values in the TotalCharge column by multiplying the tenure (in month) by MonthlyCharges. For binary categorical variables, all ‘Yes’ and ‘No’ values were replaced by 1 and 0.

The distribution of tenure shows the number of short tenure (0–5) customers is the highest, followed by long tenure (65–70). The rest of the customers are evenly distributed in the middle. The distribution of MonthlyCharges shows most customers either pay low or medium to high charges every month.

Overall, the average tenure is 32 months and the average monthly charge is $65. Among 7,043 customers, 26% of them churned. Boxplots show customers who stay have higher tenure and lower monthly charges.

The rest categorical variables demonstrate the Personas of customers who churned and those who stay. Among customers who churned, 83% of them don’t have dependents. Their average tenure is 17 months and they pay $74 monthly. 89% of this group sign short-term month-to-month contracts, and over half of them pay bills via electronic check. Most of them don’t subscribe to cybersecurity services, like data backup.

Churned Customer Persona

For customers who stay, 66% of them live with dependents. They have much higher average tenure of 38 months and lower monthly charge of $61. 32% of them sign long-term 2-year contracts with TelCo. They show no preference for Internet service and payment methods.

Retained Customer Persona

Churn Prediction with LightGBM

After exploring the dataset, we proceed to predict customer churn with a LightGBM model. 43 dummy variables are created to transform categorical features to numbers with One-Hot encoding method which indicates levels of features by 0 or 1.

Metrics for Model Performance Evaluation

In businesses where customers hold contracts, the potential cost of losing existing customers is greater than reaching out to new customers. Therefore, it is important for the model to correctly identify about-to-churn customers.

Recall is a metric that calculates the percentage of predicted true churn out of the total actual churns. ROC-AUC is the most commonly used metric for classification models, which represents the ability to distinguish between true positive and false positive classes. Precision-Recall-AUC summarizes a curve with a range of threshold values of Recall and Precision which is perfect for evaluating imbalanced datasets that have more negative values. Since our target variable Churn has imbalanced classes, we will use the above three metrics to select better performing model with greater business value.

Preliminary Model and Hyperparameters Tuning

LightGBM is chosen to be the algorithm given its proven outstanding performance. We started off by using default parameters which yielded a 0.832 ROC-AUC score. Based on the Recall rate (0.49), out of all 467 churns only half can be correctly identified, which shows the model requires further optimization.

RandomizedSearchCV library is used to randomly try the combinations of parameters from given bucket lists. F1 score and average precision are chosen to be the scoring methods used to build two separate models.

After careful comparison, we decide to move forward with F1 scoring-based model as it has a higher PR-AUC and Recall rate, even though its ROC-AUC score is slightly lower. To caveat, the F1 scoring-based model has a lower precision score, which means we might end up spending extra dollars on customers who are not intended to leave. Despite this risk, as a trade-off, we will still refer to Recall as the most vital model performance indicator.

The optimal parameters are as follows:

According to the confusion matrix of F1 scoring-based model, out of all customers who left, 70.45% can be correctly identified, which is a significant improvement from the previous model.

F1 Scoring-Based Model Confusion Matrix

According to the feature importance chart below, Tenure, MonthlyCharges and TotalCharges are the top three predictive factors for churn, followed by OnlineSercurity_No, InternetService_FiberOptic and Contract_Month_to_Month. We will take those factors as a starting point to conduct survival analysis and customer segmentation.

F1 Scoring-Based Model Feature Importance

Survival Analysis

Survival analysis is a powerful tool for inferring the relationship between tenure and customer retention. Cox Proportional-Hazards Model is used to find out at any given point in time which variables are more influential with respect to customer churn.

Based on the feature importance chart from LightGBM and secondary research, Contract and TotalCharges are chosen to be the target features for survival analysis. We will focus this analysis on just the customers with Internet service. The model set up is as follow:

Cox.Model = coxph(Surv(tenure,Churn) ~ Contract + TotalCharges)

Above model yields a Concordance score of 0.949 which shows great goodness of fit. For both features their p-values are smaller than significance level 0.05, indicating their notable impacts.

Cox Proportional-Hazards Model Summary

Model coefficients give the following findings:

  • At a given instant in time, customers who have a Month over Month contract are 4.85 times as likely to churn as someone who has a 2-year contract, adjusting for their total charges.
  • At a given instant in time, customers who have a 2-year contract are 5.5 times as likely to churn as customers who have a 1-year contract.
  • At a given point in time, customers who pay 100 dollars more are 9.58% less likely to churn.

Customer Segmentation

As no single marketing strategy is effective for everyone, segmentation plays an important role in helping TelCo better understand customers and maximize their value for business. By implementing k-means clustering, customers are divided into discrete groups that share similar characteristics.

K-means Clustering Implementation

Tenure and MonthlyCharges are chosen here to group similar customers due to the feature importance. The algorithm identifies K cluster center, then allocates each customer to its nearest cluster. In order to find out the optimal number of clusters, inertia — sum of squared distance within-cluster — is calculated. As shown below, the optimal number of segments should be 4.

K-means Clustering Elbow Plot

All the customers are divided into four clusters as shown below:

Four Clusters: At Risk, Champion, Newcomer, Loyalist

We define customers who have short-tenure, high monthly charge as At Risk Customers; those who have long-tenure, high monthly charge are defined as Champions; Customers who have long-tenure and low monthly charge will be Loyalists; and those who have short-tenure and low monthly charge will be defined as Newcomers. Below table shows general statistics for each segment:

Segment Statistics Summary

The difference of churn rate between At Risk and Loyalists customers is much larger than the difference of revenue they brought in. Loyalists with long tenure and low monthly charge have the lowest churn rate, but account for the least percentage of the total number of customers.

In terms of revenue, 70% revenue is generated by 30% of the customers — Champions, who have long-tenure and high monthly charges. At Risk customers have the highest churn rate even with a relatively high TotalCharge. The Newcomers generates the least revenue.

Profiling

After reviewing the general statistics, we move on to profile each segment based on demographic, account information and service subscribed.

  • At Risk: This segment has the highest percentage of senior customers. Over 60% of them have no partners/dependents. All of them have Internet service and most of them use fiber optic. More than 70% of them don’t have online security/backup and tech support. 80% of this segment sign month-to-month contracts. They prefer electronic checks and the least number of people in this cluster use credit cards to pay.
  • Champion: This segment also owns a high percentage of senior customers. Most of them have partners and dependents. More than half of them have multiple lines, online backup and tech support. They usually sign a longer-term contract. Only a small percent of them use mailed checks to pay bills.
  • Loyalist: More than 90% of this segment are non-senior people and most of them have partners. They only use basic services. A quarter of them don’t have phone service with TelCo. Over half of them have no Internet service with TelCo or any other additional service. The segment has the highest percentage of two-year contract customers. Most of them opt out of paperless billing.
  • Newcomer: This segment has a similar demographic profile with Loyalists in terms of age, gender and family members. They only use the basic service provided by TelCo, while most of them sign short-term contracts. Over 50% of them use mailed checks to pay bills.

Compared to Champions, At Risk customers care less about online security and sign more short-term contracts. Over half of the Loyalists sign two-year contracts and they have the lowest churn rate. While Newcomers, despite their similarities with Loyalists, mostly sign short-term contracts and have a much higher churn rate.

For Champions and At Risk customers, they tend to have Internet service and most of them use fiber optic for faster Internet speed. They also have a larger percentage of senior citizens compared to low-charge customers like Loyalists and Newcomers.

Marketing Strategies

From revenue perspective, it would be beneficial for TelCo to have as many Champions and Loyalists as possible. The following marketing strategies will focus on keeping those two segments at their original place, converting At Risk customers to Champions, and incentivizing Newcomers to become Loyalists.

  • To maintain the relationship with Champions, as most Champions value online security, TelCo can emphasize its online security technology in communication materials to strengthen trust. In addition, as most Champions have partners and dependents, TelCo can devise a family service bundle to better serve this segment.
  • To convert At Risk customers to Champions, since At Risk customers own similar service as Champions but without protective service, TelCo can introduce services that enhance data security and provide tech support to help guide them through new services. Finally, TelCo could work on incentivizing At Risk customers to sign long-term contracts.
  • To incentivize Newcomers to become Loyalists, TelCo can offer better terms for signing long-term contracts. Besides, encouraging different payment methods might mitigate their payment frustration, thus improve retention.

Next Steps

The root causes for customer churn are not clear to us. Is it a poor Internet connection? Is it competitors’ promotion? It’s recommended to send out questionnaires to understand the motivations behind, as they are important for TelCo to improve existing services and related marketing strategies.

How different incentives would work on different segments is not clear to us. It’s recommended to conduct follow-up A/B tests to find out what are the most effective promotions for each individual segment.

References

--

--

Anzhi Tian
Anzhi Tian

Responses (1)