📊 Banking Customer Churn Prediction

🔍 Project Overview

This repository contains a machine learning project focused on predicting customer churn in the banking industry. Created as a portfolio project to demonstrate data science and ML engineering skills, it applies software engineering best practices to create a well-tested and well-documented prediction system.

Why Customer Churn Matters

Customer churn (when customers stop using a company's services) significantly impacts business revenue and growth. In banking specifically:

Acquiring new customers costs 5-25x more than retaining existing ones
Even small improvements in retention rates can have significant financial implications
Identifying at-risk customers before they leave enables proactive intervention

This project explores how machine learning can identify patterns that indicate increased churn probability, using best software engineering best practices and clean production python code, providing insights that could potentially inform customer retention strategies.

💡 Solution Approach

This project implements a complete ML pipeline demonstrating best practices in data science:

Data Exploration & Processing

Comprehensive EDA: Thorough exploration of banking customer data with visualization
Data Cleaning: Handling of missing values and outliers
Feature Engineering: Creating meaningful predictors from raw banking data
Data Transformation: Preparing categorical and numerical variables for modeling

Model Development

Multiple Algorithms: Implementation of Random Forest and Logistic Regression
Hyperparameter Optimization: Grid search for model tuning
Performance Evaluation: Comprehensive metrics including ROC-AUC, precision, recall, and F1
Feature Importance Analysis: Identifying key factors that predict customer churn

Software Engineering Best Practices

Modular Design: Well-structured code with separation of concerns
Documentation: Comprehensive docstrings and comments
Testing: Complete test suite using pytest
Logging: Detailed execution logs for process monitoring
Code Quality: Adherence to PEP8 style guidelines

🚀 Project Implementation

Code Quality Example

def perform_feature_engineering(df, response=None):
    """
    Engineer features for machine learning model from preprocessed data.
    
    Args:
        df (pandas.DataFrame): Preprocessed data
        response (str, optional): Target variable name. Defaults to 'Churn'.
        
    Returns:
        tuple: X_train, X_test, y_train, y_test - split and prepared modeling datasets
    
    Raises:
        ValueError: If critical features are missing
        TypeError: If df is not a pandas DataFrame
    """
    if not isinstance(df, pd.DataFrame):
        logging.error("Input is not a pandas DataFrame")
        raise TypeError("Input must be a pandas DataFrame")
        
    try:
        # Validate expected columns present
        expected_features = ['Customer_Age', 'Dependent_count', 'Total_Trans_Ct']
        missing_cols = [col for col in expected_features if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing critical features: {missing_cols}")

        # Feature Engineering implementation
        y = df[response] if response else df['Churn']
        X = pd.DataFrame()
        
        # Category Encodings
        cat_columns = ['Gender', 'Education_Level', 'Marital_Status', 
                       'Income_Category', 'Card_Category']
        X = pd.get_dummies(df, columns=cat_columns, drop_first=True)
        
        # Remove target from features
        if response in X.columns:
            X = X.drop([response], axis=1)
        
        # Train-test split with stratification
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42, stratify=y
        )
        
        logging.info("Feature engineering successful: %s features created", X.shape[1])
        return X_train, X_test, y_train, y_test
        
    except Exception as err:
        logging.error("Feature engineering failed: %s", str(err))
        raise

Testing Approach

The project includes comprehensive testing to ensure reliability:

def test_perform_eda_creates_expected_plots():
    """
    Test that perform_eda function creates all expected exploratory plots.
    """
    # Setup
    df = import_data("./data/bank_data.csv")
    expected_files = [
        './images/eda/customer_age_distribution.png',
        './images/eda/marital_status_distribution.png',
        './images/eda/transaction_heatmap.png',
        './images/eda/churn_distribution.png'
    ]
    
    # Remove any existing files for clean test
    for file in expected_files:
        if os.path.exists(file):
            os.remove(file)
    
    # Execute
    perform_eda(df)
    
    # Assert
    for file in expected_files:
        assert os.path.exists(file), f"EDA failed to create {file}"
        assert os.path.getsize(file) > 0, f"EDA created empty file: {file}"

📊 Model Performance

This project explores two machine learning models for churn prediction with remarkably strong results:

🧪 Results Summary

Model	Accuracy	Precision	Recall	F1 Score
Random Forest	1.00	1.00	1.00	1.00
Logistic Regression	0.89	0.82	0.73	0.76

Model Performance Details

The classification reports show detailed metrics for both models:

🌲 Random Forest Performance

       precision    recall  f1-score    support

   0       1.00      1.00      1.00       5957
   1       1.00      1.00      1.00       1131

accuracy 1.00 7088

📉 Logistic Regression Performance

       precision    recall  f1-score    support

   0       0.91      0.97      0.94       5957
   1       0.74      0.48      0.58       1131

accuracy 0.89 7088

The Random Forest model achieves perfect classification on the test set, which is remarkable. This could indicate either:

An extremely effective model for this particular dataset
Features that very clearly separate the classes
A need to verify there's no data leakage between training and test sets

The Logistic Regression model performs well but struggles more with correctly identifying the positive class (churn), as shown by its lower recall (0.48) for class 1.

Confusion Matrix Analysis

The confusion matrix shows the Random Forest model's prediction results:

True Positives: Correctly identified customers likely to churn
False Positives: Customers incorrectly flagged as likely to churn
False Negatives: At-risk customers that the model failed to identify
True Negatives: Correctly identified stable customers

Feature Importance Analysis

The analysis reveals several important predictors of customer churn:

Total_Trans_Ct: Transaction frequency is the strongest predictor
Total_Trans_Amt: Total spending volume is highly relevant
Customer_Age: Customer tenure affects churn likelihood
Credit_Limit: Available credit shows relationship with retention

These insights align with common banking industry knowledge that customer engagement (measured through transaction activity) is strongly correlated with retention.

🧪 Key Learnings

This project demonstrates several important aspects of applied machine learning:

Imbalanced Classification: Techniques for handling the typical class imbalance in churn prediction
Feature Engineering: Creating meaningful predictors from banking transaction data
Model Comparison: Evaluating tradeoffs between different algorithms
Software Engineering: Applying best practices to data science workflows

🚀 Getting Started

Prerequisites

Python 3.8+
Libraries listed in requirements.txt

Installation

# Clone repository
git clone https://github.com/levisstrauss/Banking-Customer-Churn-Prediction.git
cd Banking-Customer-Churn-Prediction

# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running the Pipeline

# Run the main script
python churn_library.py

# Run tests
python -m pytest test_churn_script_logging_and_tests.py -v

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Udacity Machine Learning DevOps Engineer Nanodegree program for project inspiration
The scikit-learn team for their excellent documentation and examples

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
__pycache__		__pycache__
assets		assets
data		data
images		images
logs		logs
models		models
.DS_Store		.DS_Store
Guide.ipynb		Guide.ipynb
README.md		README.md
churn_library.py		churn_library.py
churn_notebook.ipynb		churn_notebook.ipynb
churn_script_logging_and_tests.py		churn_script_logging_and_tests.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
requirements_py3.10.txt		requirements_py3.10.txt
requirements_py3.6.txt		requirements_py3.6.txt
requirements_py3.8.txt		requirements_py3.8.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 Banking Customer Churn Prediction

🔍 Project Overview

Why Customer Churn Matters

💡 Solution Approach

Data Exploration & Processing

Model Development

Software Engineering Best Practices

🚀 Project Implementation

Code Quality Example

Testing Approach

📊 Model Performance

🧪 Results Summary

Model Performance Details

🌲 Random Forest Performance

📉 Logistic Regression Performance

Confusion Matrix Analysis

Feature Importance Analysis

🧪 Key Learnings

🚀 Getting Started

Prerequisites

Installation

Running the Pipeline

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

levisstrauss/Banking-Customer-Churn-Prediction-System

Folders and files

Latest commit

History

Repository files navigation

📊 Banking Customer Churn Prediction

🔍 Project Overview

Why Customer Churn Matters

💡 Solution Approach

Data Exploration & Processing

Model Development

Software Engineering Best Practices

🚀 Project Implementation

Code Quality Example

Testing Approach

📊 Model Performance

🧪 Results Summary

Model Performance Details

🌲 Random Forest Performance

📉 Logistic Regression Performance

Confusion Matrix Analysis

Feature Importance Analysis

🧪 Key Learnings

🚀 Getting Started

Prerequisites

Installation

Running the Pipeline

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages