Authors: Aidan Jackson | Andi Morey Peterson | Naga Chandrasekaran | Scott Gatzemeier |
U.C. Berkeley, Masters in Information & Data Science program - datascience@berkeley
Spring 2021, W207 - Machine Learning - Tue. 6:30pm PDT
This repo contains itterative solutions (including a final solution) for the Kaggle Forest Cover Typer Prediction challenge, developed by Aidan Jackson, Andi Morey, Naga Chandrasekaran, and Scott Gatzemeier. The goal of this project is to classify trees in four different wilderness areas of the Roosevelt National Forest in Northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. Accurate results of a successful model will allow US Forest Service (USFS) to predict the predominant cover type trees to plant in reforestation efforts of the 800,000 acres in the Roosevelt National Forest.
Our solution leverages a variety of modeling techniques. Base models were developing using K-Nearest Neighbors, Naive-Bayes, Logistic Regression, Decision Tree, and Neural Networks. These models were itteratively improved independently through data cleansing/formating, feature engineering, and hyperparameter tuning. These models were then leveraged to build an ensemble model for our final results.
Folder | File | Description |
---|---|---|
.. | 207-final-notebook.ipynb | Jupyter Notebook containing the final write up of our report. |
Models | Individual Model Notebooks | Folder containing principal component analysis, individual model testing and ensemble model. Each folder include the respective notebooks and results. Models test include: Naive Bayes, Logistic Regression, Neural Network, Decision Trees, K-Nearest Neighbor, Gaussian Mixture Models, and finally the bagging ensemble. |
EDA | Individual EDA Notebooks | Exploratory Data Analysis notebooks to help with model hyperparameter tuning and feature engineering |
presentations | Midterm_Pres_Forest_Cover_Type_Prediction | Midterm presentation of EDA and initial models |
presentations | Final Presentation - Cover Type Prediction | Final Presentaion with Ensemble Model |
data | covtype.csv | Raw Dataset containing test and training data, Number of Records: 581012 and Number of Features: 55 |
data | test.csv | Test dataset, Number of Records: 565892 and Number of Features: 55 |
data | train.csv | Dataset used to train models, Number of Records: 15120 and Number of Features: 56 |
Tuned & Featured Engineered Model Results
Model | Kaggle Accuracy, Before (%) | Kaggle Accuracy, After (%) |
---|---|---|
K-Nearest Neighbor | 63 | 71 |
Naive Bayes | 42 | 42 |
Logistic Regression | 40 | 59 |
Decision Tree | 66 | 77 |
Neural Network | 35 | 72 |
Tie Breaker | - | 72 |
The final ensemble is found to have an accuracy almost to 80%. With this best accuracy, the final leaderboard position would have been 197 out of 1693 had this team entered, breaking the top 15%.
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. We were asked to predict an integer classification for the forest cover type. The seven types are:
- Spruce/Fir
- Lodgepole Pine
- Ponderosa Pine
- Cottonwood/Willow
- Aspen
- Douglas-fir
- Krummholz
The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. The challenge was to predict the Cover_Type for every row in the test set (565892 observations).