We will be using machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Our team selected this topic because we wanted to obtain a deeper understanding of the tragedy and how different passenger attributes impacted their odds of survival.
Our resources will come from the dataset contained in the "stablelearner" r-package and stored as CSV files.
The project is broken into component pieces below.
- Database Storage: We used a local instance of pgAdmin (PostgresSQL) to store a backup file of the database that project members can download and restore on their local machines. Users are encouraged to use the create_database script in Jupyter Notebook to this end (see steps below)
- Machine Learning: We used a decision tree model
- Statistical Analyses: RStudio
- Interactive Dashboard: Tableau
- Website: Flask with API routes that render across Python, HTML and JavaScript (see usage details below)
- Presentation: PowerPoint; presentation
We will run statistical analysis to see how different groups fared based on factors such as age, gender, socio-economic status, etc. We are hoping to add a section of our dashboard that allows users to input their own information and generate their probability of survival.
- Create a database named
titanic_project
and ensure it is selected with an active connection. - Add a
config.py
file to your Notebooks folder in the group repo. (The file is otherwise hidden in .gitignore.) It should read:db_password = '[insert your password here]'
- Open a command line terminal in this same Notebooks folder and run
jupyter notebook
- Open the
create_database.ipynb
file and execute the four cells. - Refresh your database. You should see both new tables
passenger_registry
andembarked
- First, we cleaned the data and ensured it was ready for entry into the model
- Next, we set up the machine learning model
- Then, we fit the model using the decision tree method
- After That, we ran the model and arrived at the following conclusions:
-
Precision (True Positives divided by sum of True and False Positives): 79%
-
Recall (True Positives divided by sum of True Postivies and False Negatives): 90%
-
The Recall score is higher than precision, meaning our model is most likely to be incorrect when it rules out False Negative predictions. This means it's not as likely that the model would predict survival in an instance when the person actually would not survive.
- Finally, we saved the scaler to a file using the dependency pickle
-
Ensure that your development environment is active with
conda activate [development-environment-name]
-
If you haven't already, install Flask with the following command
pip install flask
-
For additional dependencies, see the requirements.txt file.
-
Navigate to the /webapp folder of the repo. Run the following command:
flask run
or
python wsgi.py
The app should open on a localhost (likely http://127.0.0.1:5000/). Copy this address into your browser and enjoy!
-
When you finish using the app, you can run
Ctrl + C
in the terminal to end the local connection.