Skip to content

Binary classification of bank marketing calls to determine the likelihood a customer responds.

License

Notifications You must be signed in to change notification settings

Sambonic/bank-marketing-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bank Marketing Classification Documentation

GitHub License Python Version

This project uses machine learning to classify bank marketing campaign outcomes, aiming to improve prediction accuracy.

Last Updated: January 5th, 2025

Table of Contents

  1. Installation
  2. Usage
  3. Features

Installation

Make sure you have python downloaded if you haven't already. Follow these steps to set up the environment and run the application:

  1. Clone the Repository:
git clone https://github.com/Sambonic/bank-marketing-classification
cd bank-marketing-classification
  1. Create a Python Virtual Environment:
python -m venv env
  1. Activate the Virtual Environment:
  • On Windows:

    env\Scripts\activate
    
  • On macOS and Linux:

    source env/bin/activate
    
  1. Ensure Pip is Up-to-Date:
python.exe -m pip install --upgrade pip
  1. Install Dependencies:

    pip install -r requirements.txt
  2. Import Bank Marketing Classification as shown below.

Usage

To use this project, follow these steps:

  1. Execute the notebook: Open main-repo/src/bank-marketing-classification-ml.ipynb in a Jupyter Notebook environment. Run all code cells sequentially.

  2. Data Exploration: The notebook begins by loading the dataset, displaying initial rows, and providing summary statistics (data types, unique values, descriptive statistics, and missing values). Data visualizations (histograms, pie charts, count plots, and correlation heatmaps) are generated to understand the data distribution and relationships between features.

  3. Data Preprocessing: The notebook then performs data preprocessing steps, including:

    • Renaming columns for better readability.
    • Encoding categorical features using label encoding.
    • Handling missing values. Multiple imputation methods (KNN, KMeans, median imputation for numerical features; mode, KModes imputation for categorical features) are evaluated using Random Forest classification accuracy. The best performing method is then applied. The notebook also demonstrates dropping null values as a comparison.
    • Discretizing numerical features ('Age' and 'Campaign_Contacts') into categorical features using binning and applying outlier handling techniques.
    • Removing duplicate rows.
    • Converting data types to be appropriate for models.
  4. Feature Selection: Two feature selection methods are applied:

    • Chi-squared test for categorical features.
    • Random Forest feature importance for all features. The notebook then identifies common features selected by both methods.
  5. Class Imbalance Handling: The notebook addresses the class imbalance by employing both undersampling and oversampling. It demonstrates the undersampling method by downsampling the majority class to match the size of the minority class. It then shows using SMOTENC for oversampling minority class instances, accounting for categorical features.

  6. Model Training and Evaluation: Several classification models (KNeighborsClassifier, DecisionTreeClassifier, LogisticRegression, RandomForestClassifier, LGBMClassifier, and XGBClassifier) are trained and evaluated using the preprocessed data. Each model is evaluated using different combinations of feature selection and hyperparameter optimization strategies. The notebook evaluates models with and without feature selection. It provides a learning curve for each model showing training accuracy and validation accuracy at various training sizes to show whether a model is overfitting or underfitting. A confusion matrix and ROC curve with AUC values are generated to evaluate model performance. Metrics like accuracy, precision, recall, and F1-score are calculated and displayed for each model. Comparative analysis is provided via bar charts.

  7. Model Comparison: The notebook presents a comparison of all evaluated models, showcasing their performance metrics (accuracy, precision, recall, F1-score) across different scenarios (with/without feature selection and hyperparameter tuning), to determine the best-performing model for the given dataset and task. Plots will show ROC curves comparing various models.

Features

  • Bank Marketing Campaign Classification: Predicts customer subscription based on various features.
  • Data Preprocessing: Handles missing values using KNN, K-means, median, mode, and k-modes imputation; evaluates different imputation methods based on RandomForestClassifier accuracy. Drops duplicates.
  • Feature Engineering: Discretizes 'Age' and 'Campaign_Contacts' features into meaningful categories, and normalizes 'Campaign_Contacts'.
  • Feature Selection: Employs Chi-squared test and Random Forest feature importance for selecting relevant features. Combines results from both methods to choose a final feature subset.
  • Class Imbalance Handling: Addresses class imbalance using undersampling of the majority class and SMOTENC for oversampling the minority class. Also explores class weights as an alternative.
  • Model Training and Evaluation: Trains and evaluates multiple classification models (KNN, Decision Tree, Logistic Regression, Random Forest, LightGBM, XGBoost) with and without feature selection and hyperparameter tuning, reporting accuracy, precision, recall, and F1-score. Uses learning curves to assess model complexity and overfitting. ROC curves compare model performance.
  • Hyperparameter Tuning: Uses RandomizedSearchCV for hyperparameter optimization of all models to maximize the F1 score.
  • Comprehensive Visualization: Uses various plots (histograms, pie charts, bar charts, heatmaps, confusion matrices, ROC curves, learning curves) to visualize data, results, and model performance.

About

Binary classification of bank marketing calls to determine the likelihood a customer responds.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published