Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Statistics: Churn Modeling
As a former banker, I can verify that customer retention is a topic that generates a significant amount of attention. Banks employ different tactics to increase retention, but all banks start with the question ‘who is likely to leave, and who is likely to stay?’. The purpose of the program is to answer a single question. Which customers will/will not churn.
The data set consists of 10,000 customers and gives basic demographic information, credit score, banking balance and relationship information, and product usage information. The dataset does not list the type of banking products, which is often considered the most predictive indicator of whether or not a customer will continue their relationship with the bank.
It is assumed that “Geography”, “Age”, and “EstimatedSalary” are the variables most likely responsible for predicting whether or not a customer leaves the bank. “Geography”, “Age”, and “EstimatedSalary” will be set as the independent variables (IV) and “Exited” will be the dependent variable (DV).
One Hot Encoder (OHE) is used to encode the categorical data into numerical data. OHE gives weight to categorical data so that it can be used in a linear regression model (Gefferth, 2024). ColumnTransformer is then used to prepare the IV’s to be used in machine learning models. “Passthrough” allows all columns not specifically designated, to be passed by without change. In this case, only the IV’s are being prepared. LabelEncoder is used on the DV to convert the categorical variable into a numerical value. The data is then split into 70% for training and 30% for testing.
Each IV was then tested for variance. Variance is the measure of spread in the data from its mean position. It shows how sensitive the model is to other subsets of the data. Low variance means that the model is less sensitive to changes in the training data. High variance means the data is very sensitive (Bias and Variance in Machine, 2024). The variance scores for the DV’s were: Tenure: 8.36; Balance: 3893436175.99; EstimatedSalary: 3307456784.13; CreditScore: 9341.86; Age: 109.99.
Finally, we test for accuracy using a linear regression model. A logistic regression model is a supervised machine learning algorithm used to predict the probability of an event or observation (Kanade, 2022). The precision metric measures the proportion of true positive predictions out of all predictions. The precision score was .82/.70. Recall measures the proportion of true positive predictions out of all actual positive data points. The recall score was .98/.16. Accuracy measures the percentage of correctly classified data points. The accuracy score for precision was .76 (.79 weighted), and for recall it was .57 (.81 weighted).
The data shows that, of the tested IV’s, Tenure is the only variable that affects whether a customer will stay with the bank. However, including additional variables likely would have added value to the results.

















