- Implement a ‘Transaction Fraud Prevention System’ leveraging machine learning models, which aims to predict whether a given financial transaction is ‘Fraudulent’ or ‘Valid’.
- Name: comp9417_final.py: Type: Python Script Description: Final script submitted for assessment.
- Name: comp9417-unsw.ipynb: Type: Jupyter Notebook Description: Intial Exploratory Data Analysis + Data Preprocessing + Decision Tree model.
- Name: corr_pairs_sorted.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization. (Sorted)
- Name: corr_pairs.csv Type: CSV Description: correlation matrix of Transactional dataset features used in selecting the optimal number of features for the dataset for learning and processing optimization.
- Name: EDA.ipynb Type: Jupyter Notebook Description: All the exploratory data anaylsis of both the transactional and identity train and test dataset.
- Name: mohitkhanna-comp9417.ipynb Type: Jupyter Notebook Description: Data Preprocessing + Exploratory Data Analysis + Light Gradient Boost Machine (LGBM) model + Random Tree Model + Decision Tree Model + Bernouli Naive Bias Model + extreme gradient Boosting Model (XGB).
- Name: submission.csv Type: CSV Description: One of the submission file generated to submit to kaggle competition.
- The dataset for the model was taken from the Kaggle competition: https://www.kaggle.com/c/ieee-fraud-detection and was provided via the collaboration of IEEE and Vesta Corporation.
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp).
- TransactionAMT: transaction payment amount in USD.
- (*) ProductCD: product code -> the product for each transaction. (categorical feature)
- (*) [card1, card2, card3, card4, card5, card6] : payment card information For example card type, card category, issue bank, country, etc. (categorical feature)
- (*) addr1: address. (categorical feature)
- (*) addr2: address. (categorical feature)
- dist: distance.
- (*) P_emaildomain: Purchaser email domain. (categorical feature)
- (*) R_emaildomain: Recipient email domain. (categorical feature)
- [C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14]: The actual meaning is masked but can be said as a count such as how many addresses are found to associated with the payment card.
- [D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15]: timedelta in days between previous transaction.
- (*) [M1,M2,M3,M4,M5,M6,M7,M8,M9]: match such as names on card and address etc. (categorical feature)
- Vxxx: Vesta engineered rich features such as ranking, counting and other entity relations.
-
The field names are masked for privacy protection and contract agreement as part fo Vesta's policies.
-
Mostly fields are related to identity information such as network connection information.
- DeviceType.
- DeviceInfo.
- id_12 - id_38.
Note: Credit to Vesta (Competition Host) for providing the above data description and details. Link: https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203
-
After solving class imbalance, leveraging feature selection and Exploratory Data Analysis, we executed tested the following models for the given data:
- Decision Tree: This was our baseline model.
- Bernoulli Naive Bayes.
- K-Nearest Neighbour.
- SVM: We could not get the conclusive answer via the SVM.
- Random Forest.
- Light Gradient Boost.
- Integrated Stacked Model.
-
The final model is an LGB model with hyper parameter tuning giving the Kaggle Score of 93.
For exploratory data anaylsis please refer to Final_Report_COMP9417_Project.pdf and EDA.ipynb file in this repository
Above image shows the original distribution between the fraud and valid transactions
Approaches considered to solve the class imbalance are minority over sampling and majority over sampling.
Majority under sampling appraoch wass rejected since there could be a possiblity of losing important information.
We used Synthentic Minority Ovver Sampling (SMOTE) and the details are mentioned in the report and the notebook.
Since the dataset has over 400 features so we used correlation matrix, graphs created in the exploratory data analysis and result in the end was the most relevant features of the dataset. We also as a part of this process, we used sklearn's RFECV for a recursive feature elimation to get the optimal features of this dataset.
Feature Selection | Parameters |
---|---|
RFECV | BernoulliNB(), step = 15, scoring = 'roc_auc', ev = 5, verbose = 1, n_jobs = 3 |
Model/Scenario | Parameters | Kaggle Score |
---|---|---|
Label encoding, Features\Columns with 50 percent or more null values removed, Balance of the class variable | random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 | 0.69 |
One hot encoding, Features\Columns with 90 percent or more null values removed, Balance of the class variable | random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 | 0.70 |
Label encoding, Features\Columns with 90 percent or more null values removed, Imbalance of the class vaiable | random_state = 0, criterion = 'entropy', max_depth = 8, splitter = 'best', min_samples_split = 30 | 0.72 |
Model/Scenrio | Kaggle Score |
---|---|
Class variable is imbalance | 0.50 |
Class variable is balance and no parameter tuning | 0.63 |
Parameter | Value(s) |
---|---|
alpha | [0.001,0.01,0.1,1] |
Fit_prior | [True] |
Grid Search and Feature Selection | 0.75 |
Hyperparameters | Kaggle Score |
---|---|
N_neigbours = 3, metric = "minkowski" with p = 2 | 0.50 |
Hyperparameters | Kaggle Score |
---|---|
N_neigbours = 5, metric = "minkowski" with p = 2 | 0.50 |
N_neigbours = 7, metric = "minkowski" with p = 2 | 0.50 |
Hyperparameters | Value(s) |
---|---|
Score_func | f_classif |
K | 20 |
Hyperparameters | Kaggle Score |
---|---|
N_neigbours = 5, metric = "minkowski" with p = 2 | 0.67 |
We could not get a conclusive answer for the SVM.
Hyperparameters | Kaggle Score |
---|---|
'n_estimators' = 100 | 0.85 |
'n_estimators' = 500, 'random_state' = 10, 'max_depth' = 20 | 0.82 |
'n_estimators' = 1000, 'random_state' = 200, 'bootstrap' = False, 'max_depth' = 5 | 0.86 |
'n_estimators' = 1000, 'random_state' = 121, 'min_samples_split' = 2, 'bootstrap' = False, 'max_depth' = 5 | 0.88 |
Hyperparameters | Kaggle Score |
---|---|
'objective' = 'binary', 'n_estimators' = 300, 'learning_rate' = 0.1, 'subsample' = 0.8 | 0.84 |
'objective' = 'binary', 'n_estimators' = 200, 'learning_rate' = 0.1 | 0.83 |
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1 | 0.87 |
'objective' = 'binary', 'n_estimators' = 500, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree = 0.9' | 0.89 |
'objective' = 'binary', 'n_estimators' = 600, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9 | 0.90 |
'objective' = 'binary', 'n_estimators' = 700, 'learning_rate' = 0.1, 'num_leaves' = 50, 'max_depth' = 7, 'subsample' = 0.9, 'colsample_bytree' = 0.9, 'random_state' = 108 | 0.92 |
Hyperparameters | Kaggle Score |
---|---|
Decision Tree + K-Nearest Neighbour + Light Gradient Boost Machine + Random Forest + Bernouli Naive Bias | 0.78 |
Model | Parameters | Kaggle Score |
---|---|---|
Decision Tree | random_state = 0, criterion = 'entropy', max_depth = 30, splitter = 'best', min_samples_split = 30 | 0.70 |
Naive Bayes | Alpha = 0.01, prior_class = True | 0.75 |
K Nearest Neighbour | 0.67 | |
Random Forest | n_estimators = 1000, random_state = 121, min_samples_split = 2, bootstrap = False, max_depth = 5 | 0.87 |
Light Gradient Boosting Machine | objective = binary, n_estimators = 700, learning_rate = 0.1, num_leaves = 50, max_depth = 7, subsample = 0.9, colsample_bytree = 0.9, random_state = 108 | 0.92 |
Integrated Stacked Model | Decision Tree + Naive Bayes + K-Nearest Neighbour + Random Forest + Light Gradient Boosting Machine | 0.77 |
* Light Gradient Boost Machine was chosen as the final model with the final prediction score of 0.92
- Usama Sadiq. (Github Profile: https://github.com/usama-sadiq)
- Mohit Khanna. (Github Profile: https://github.com/mohitKhanna1411)
- Uttkarsh Sharma. (Github Profile: https://github.com/khaamosh)
- Sibo Zhang. (Github Profile: https://github.com/sibozhang400)