Shan Dou
MLND capstone project
July 2018
Link to proposal review: https://review.udacity.com/#!/reviews/1314525
Kaggle competion "TalkingData AdTracking Fraud Detection Challenge".
conda env create -f environment.yml
This operation will create a conda environment named mlnd_clean
. If you wish to use a different name, please open the requirement file environment.yml
and change the first line name: mlnd_clean
into your preferred name.
Once all the dependencies are installed, please run the following command in your shell terminal to activate the environment
source activate mlnd_clean
To deactivate, type
source deactivate
- The following modules are installed with
conda
1. numpy
2. pandas
3. seaborn
4. sklearn
5. xgboost
6. lightgbm
7. imblearn
8. notebook
- Module for stack ensemble is installed with
pip
:
9. mlens
For more information about mlens, please visit its webiste.
-
Jupyter notebooks:
MLNDcapstone_shandou_main.ipynb
: Main workbookMLNDcapstone_shandou_robustness.ipynb
: Companion workbook for models' robustness texts
-
Python models in
./customlib/
:./customlib/preprocessing.py
: data processing./customlib/modeling.py
: modeling./customlib/utils.py
: miscellaneous tasks such as visualization and generating result summary tables
-
Dataset:
The raw datatrain.csv
can be directly download from Kaggle. Out of file size concerns, only downsized training data and the original testing data are included in this repo.train_sample.csv
: 0.1% of the raw click recordstrain_sample_2.csv
: 0.2% of the raw click recordstest.csv
: First 10 lines of the orignal test data downloaded from Kaggle. NOTE thattest.csv
provided by Kaggle is only used for checking data fields. In the actual implementation, testing data is instead a portion oftrain_sample.csv
ortrain_sample_2.csv
-
Proposals and reports:
proposal.pdf
: Proposal of the capstone projectproposal_review.pdf
: Comments from proposal reviewreport.pdf
: Report of the capstone project
-
Others:
- folder
images/
: contains all the images used in the report - matplotlib style sheet
stylelib/custom.mplstyle
: dataviz styler used throughout this project
- folder