Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Statistics: Churn Modeling 3
The primary purpose of this code was to test the accuracy of the models we built to predict customer churn. The data provided was unstructured data, which is data that is not stored in the common structured data format of rows and columns. Natural Language Processing (NLP) models are often used to process unstructured data by analyzing the text within the data, searching for specific words, combinations, symbols, or images. There are many available tools that help process unstructured data, some of the more common ones are Pytorch and TensorFlow. Both use deep learning, which uses neural networks with multiple layers to analyze and interpret complex data (Terra, 2024). However, PyTorch works best with data that is fixed, so PyTorch is often used in research and testing. TensorFlow is more adaptable and can handle changing data, so it used in production. Both models utilize data in a No SQL, or unstructured database. No SQL databases utilize data lakes, which store as-is, without the need for ordering. Data lakes are advantageous because there is no limit on the amount of data that can be stored. Structured data is best used when the type of data being collected is fixed and collected from controlled sources in a single format. For example, customer profiles collected from call centers or front-line employees. Unstructured data is best used when the data can come from multiple sources and in a variety of formats. For example, pulling text information from multiple websites.
The principal tool chosen to measure the accuracy of the model is the classification report from sklearn. The classification report offers three scores. The first is the precision score. The precision score is the percentage of correct predictions relative to the total number of predictions. The second score is recall. Recall is the percentage of correct predictions relative to total actual positives. Finally, there is the F1 score. The F1 score is the weighted mean of precision and recall. The maximum F1 score is 1 and the closer the score is to 1, the better. Support is the total number of the data set (Bobbitt, 2022).
The model ran three full classification reports with precision scores ranging from .80 - .89. Recall ranged from .95 – 1.0. F1 Score ranged from .89 - .92. The classification report shows that the model was accurate at predicting customer churn.
The model used 3 types of classification and regression tests. The first was Logistic Regression which estimates the probability of an event occurring. The second was Support Vector Machine (SVM) which typically solves binary problems that divide the data into two groups. The third was Random Forest which is good at combining the output of multiple decision trees.
Finally, a bar chart was used to display the accuracy based on the separate features. The feature scores ranged from ‘HasCrCard’ at .02 to ‘Age’ at .24.













