A collection of projects completed as part of Udacity's Data Analyst Nanodegree.
- Jupyter Data Analysis
- R Exploratory Data Analysis
- Tableau Visualization
- SQL Data Wrangling
- Inferential Statistics
- Scikit-Learn Machine Learning
Baseball Statistics
This project used Sean Lahman's Major League Baseball data set to investigate whether or not the level of professional baseball players had, overall, improved. The inquiry was limited from 1955 to 2017 and placed an emphasis on batter ability (measured with On-Base plus Slugging) and pitcher ability (measured with Fielding Independent Pitching).
No trend, positive or negative, was observed in player ability.
Environmental factors (like changing the strike zone) account for much more variability in statistics than player ability.
Uses Python, matplotlib, pandas, and numpy.
U.S. College Statistics
This project investigated a few key variables from College Scorecard, a dataset created by the U.S. Department of Education to evaluate universities across the nation. An emphasis was placed on four-year universities with variables related to admissions, finances, and location.
There appears to be a noticable trend relating tuition and five-year completion rates.
There is also a distinct correlation between funding type (public, non-profit, for-profit) and completion rate.
Uses R and ggplots.
U.S. College Statistics
This project focused specifically on for-profit universities. Unlike the R data exploration project (which used the same dataset), this project analyzed data across many years.
A Tableau Story which details some of the concerns surrounding for-profit universities.
Multiple interactive charts that can filtered by year.
OpenStreetMap Southwest Idaho
This project attempted to clean and organize a set of geographical data for Southwest Idaho.
Conversions between XML, CSV, and SQL data.
SQL queries and simple regular expressions.
Provided Stroop Effect Data
This project made use of descriptive and inferential statistics to analyze the significance of the Stroop Effect for a given set of data.
Formal report of statistical significance written in LaTeX.
Histograms generated with RStudio.
Data analyzed with Google Spreadsheets.
Enron Data
This project scanned a pool of Enron email data for patterns, then built a classifier to determine persons likely involved in illicit activities.
Multiple algorithms used with parameter tuning.
Charts illustrating the efficacy of particular features.
A writeup detailing the forms of assessment used (accuracy, precision, recall, F1).