Predicting Customer Churn with ML Classification Models
Link to Github repo
Introduction
This case study tackles the problem of predicting customer churn for a U.S. Wealth Management business with over 10,000 clients. The objective was to identify which customers were most likely to leave and design a cost-aware retention strategy that maximizes expected value. Using Python (2021 stack) I executed the end-to-end data science workflow—EDA → cleaning → feature engineering → modeling → decisioning—and translated the results into actions for advisors. The deliverables included productionized churn probabilities at the customer level and an interactive Dash/Flask dashboard to prioritize outreach.
Executive Summary
I developed a decision function that combines predicted churn probability with business costs (churn cost, promo cost, and promo effectiveness) to rank customers by expected value (EV) of intervention. Model insights consistently highlighted Age, Digital Activity, Number of Products, Balance, Gender, and East region as key drivers of churn. The final XGBoost model achieved ROC-AUC = 0.87 and Recall = 0.77 (cross-validated), chosen to minimize costly false negatives in a dataset with ~20% churn rate. Operationally, the strategy recommended targeting ~1,510 high-EV customers, down from ~2,500 raw predicted churners at a 0.20 cutoff—a 40% reduction in interventions with higher expected ROI.
Data Science Approach
Exploratory Data Analysis
I profiled the dataset (10K rows; 12+ features) and visualized correlations and churn rates across key dimensions to form early hypotheses. The target Exited showed strong relationships with Age, Geography_East, IsActiveMember, Balance, and Gender, indicating these would carry predictive signal. Churn was imbalanced (~20%), requiring stratified splits and class-aware metrics. I also surfaced distribution quirks (e.g., many zero balances) and group differences (e.g., older customers and specific product-count bands showing higher exit rates) that guided later feature and threshold choices.
Data Cleaning
Data validation flagged impossible values: Age = 190 and Tenure = 30 for a 29-year-old. I corrected these via regression imputation for Age (from IsActiveMember; imputed to 37.8) and mean imputation for Tenure (5.0), preserving realistic relationships. Three missing CreditScore values—associated with high-income, multi-product clients—were imputed to 650 (domain-consistent). I retained two edge CreditScores (305, 865) given real-world variability, ensuring we didn’t truncate legitimate signal. The cleaned dataset reduced bias risk and stabilized linear and tree-based training.
Feature Engineering
To enrich demographics, I left-joined 2010 U.S. Census surname data (coverage for 162,255 surnames) to derive ethnicity percentage features and name frequency. For relational effects, I created a “Relative” feature using the prop100k surname frequency: after 50 chi-square tests across thresholds [0.01, 0.5], I selected 0.055 to flag uncommon shared surnames as likely relatives. While “Relative” proved weakly predictive, the census features improved interpretability and segmentation.


Model Training
I trained Logistic Regression, Random Forest, and XGBoost to compare linear vs. ensemble learners, optimizing primarily for Recall due to the higher cost of missed churners. All models used stratified CV and class-imbalance handling. Results (mean ± σ):
- Logistic Regression: ROC-AUC 0.75 (±0.02), Recall 0.71 (±0.04), Precision 0.64 (±0.02)
- Random Forest: ROC-AUC 0.86 (±0.02), Recall 0.76 (±0.05), Precision 0.50 (±0.03)
- XGBoost: ROC-AUC 0.87 (±0.03), Recall 0.77 (±0.03), Precision 0.50 (±0.02)
XGBoost delivered the best recall with the lowest variance, so its predict_proba
outputs powered the decision function.
Recommendations
I translated model insights into client-management actions prioritized by EV. The top risk segments were >40 years old, female, digitally active, 3–4 products, and balances >$75k—high-value clients warranting proactive retention. The East division consistently surfaced as a predictor, pointing to potential competitive pressure or operational issues worth investigation. With business inputs set to $1,000 churn cost, $500 promo cost, 80% promo effectiveness, and $1 wrong-action cost, the strategy recommended 1,510 targeted promotions (vs ~2,500 naive churn flags), focusing spend where expected value per customer peaked at ~$266.


Dashboard
To operationalize the workflow, I built a Dash + Flask app that serves real-time predictions and a ranked customer table. Advisors can (1) search by CustomerID, (2) view probability + confidence, and (3) see rank by EV given the current business cost settings. The UI surfaces the key drivers next to each prediction—Age, Number of Products, Digital Activity, Balance, Gender—to support conversations with clients. The app was containerized and deployed on an EC2 instance in 2021, enabling lightweight access for management and front-line teams.
Thanks for reading!