Udacity Nanodegree
Explore the repository»
apache, cassandra, nosql, data engineering, ETL, data modeling
In this project, I learned how to do data modeling in non-relational databases with Apache Cassandra and used Python to build an ETL pipeline.
I applied what I've learned through the data modeling module to build a pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. I had to create separate denormalized tables for answering specific queries, properly using partition keys and clustering columns, following the concepts of NoSQL databases.
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
They'd like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions, and wish to bring you on the project. Your role is to create a database for this analysis. You'll be able to test your database by running queries given to you by the analytics team from Sparkify to create the results.
- Python
- Apache Cassandra
- Jupyter notebooks
This event dataset is a collection of CSV files containing the information of user activity across a period of time. Each file in the dataset contains the information regarding the song played, user information and other attributes.
Columns:
artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userId
In this case since I was working with NoSQL databases, each table is modeled to answer a specific knew query. This model enables to efficiently query through databases containing huge amounts of data. Relational databases are not suitable in this scenario due to the magnitude of data.
You can see an Entity Relationship Diagram (ERD) of the built data model below:
Files:
File / Folder | Description |
---|---|
event_data | Folder at the root of the project, where all the CSV data resides |
images | Folder at the root of the project, where images are stored |
Project 2.ipynb | Jupyter notebook containing the ETL pipeline including data extraction, modeling and loading into the tables. |
README | Readme file |
event_datafile_new.csv | CSV cointaining the whole data after merging all the CSV files at event_data |
Clone the repository into a local machine using
git clone https://github.com/BinariesGoalls/Udacity-Data-Engineering-Nanodegree
These are the tools necessaries to run the program.
- Python
- Apache Cassandra
- cassandra python librarie
Follow the steps to extract and load the data into the data model.
-
Navigate to
Project 2 Data Modeling with Apache Cassandra
folder -
Run
Project 2.ipynb
Jupyter Notebook -
Run Part I to create
event_datafile_new.csv
-
Run Part 2 to execute the ETL process and load data into tables
-
Check whether the data has been loaded into database by executing the
SELECT
queries
Alisson lima - ali2slima10@gmail.com