Churn prediction with Machine Learning / LazyPredict
Build a simple churn model from scratch, with the help of LazyPredict
Churn is a widely used KPI, focused in finding the rate at which customers stop doing business with an entity. It represents the percentage of service subscribers who discontinue their subscriptions within a given time period. For each industry, Companies might experience different rates, which it essential to track, as long as it may affect the Annual Recurrent Revenue (ARR), Customer Acquisition Cost (CAC) and other relevant KPIs.
We will build from scratch a simple Machine Learning model, with the help of LazyPredict, to predict potential Churn in Customers. The construction is divided into three categories - that could be extended to extra steps, if extra information is available:
- Data Visualization;
- Data Preparation
- Machine Learning / Prediction.
Here we go 👇
First, we import the libraries for Data Cleaning/Visualization, LazyPredict and SKLearn.
The dataset used in this article is originally from IBM (“Build a customer churn predictor using Watson Studio and Jupyter Notebooks”). It is available at GitHub, as shown below. Now, lets take a further look at it:
Initial view from the raw DataFrame:
Data Visualization: Now, let’s plot some views, to help us measure potential insights.
At first sight, there is two potential insights from this database: it seems to have not a direct correlation between multiple lines from Customers and the Churn Rate is higher in a Month-to-Month contract. Well, what about the prices, does it affect the Churn?
By splitting our Churn into gender and visualizing in the graph above, we do not expect a huge difference to be measured by gender, but we do infer a higher churn meanwhile keeping a low MonthlyCharge. For a final visualization, we plot the Tenure, which represents the amount of time the Customer remains adherent to the Company. As expected, there is a inverse correlation: the majority of Customers with Churn happens in the first periods.
Data Preparation: Now we need to prepare the data, by generating some dummies for categorical variables, look for potential errors and convert the datatypes from the DataFrame so far.
We’re almost there! Now let’s finish this and create our dummies. For this article, i’m using the function pd.get_dummies from Pandas.
Now we have everything ready to begin the Machine Learning with the help of LazyPredict, which helps to build a lot of basic models without much code and helps understand which models works better without any parameter tuning.
As every ML problem, we must separate our DataFrame in a training and test set. It is crucial to evaluate how well our model performs with data that it has never seen before. For that, SKLearn helps us in a very intuitive way:
Easy! Now for the fun part: action! In just three lines, we run 30 machine learning models:
Voilá! Now we have a very nice view from the best algorithms for this situation. For a deeper exercise, we’ll dive into four of them: AdaBoostClassifier, LogisticRegression, LinearSVC and RidgeClassifier.
For that moment, better to take a better look, step-by-step:
First, we instantiate the models into the code, by calling the modules we have imported at the first moment.
Then we plot potential hyperparameters for better improvement of the models.
By setting variables with parameters, we use a technique known as GridSearch, in which we can iterate over several parameters to find the ones with the higher performance. With the help of “Fit” method, it starts running.
Then, we obtain the best parameters which were found after the GridSearch. As long as we need to test it into our model, we might simply pass then as real parameters to each one, by slicing into a dictionary.
No need to rewrite everything again, as it must happen right after.
By the end, we should ask the models to predict against our dependent variable: the churn from the customers.
Let’s go forward!
Now we call the method “Predict” along the models, with the help of the metric AUC (instead of Accuracy, which may represent some bias with unbalanced classification problems). From an interpretation standpoint, it is more useful because it shows us how good at ranking predictions your model is.
It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
Finally, we reach to the end, and the winner is…
The best performance model is Adaboost, with an AUC Score of 0.74!
We can plot the confusion matrix from AdaBoost, for a better review over the ratings, such as an Accuracy of 0.81:
precision recall f1-score support
0 0.85 0.91 0.88 1283
1 0.70 0.56 0.62 478
accuracy 0.81 1761
macro avg 0.77 0.74 0.75 1761
weighted avg 0.81 0.78 0.81 1761
There is still room for improvement, with the of XGBoost, a higher amount of data or some feature engineering. That’s it for now!
Follow me on LinkedIn via:
Thiago Bellotto Rosa | LinkedIn
Thanks For Reading!