Name		Name	Last commit message	Last commit date
parent directory ..
brazillian-ecommerce-datasets		brazillian-ecommerce-datasets
images		images
marketing-funnel-datasets		marketing-funnel-datasets
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
inspection_notebook.ipynb		inspection_notebook.ipynb
sql_queries.py		sql_queries.py
tests_queries.ipynb		tests_queries.ipynb

README.md

Capstone Project: ETL Pipeline for a Brazillian E-Commerce

Udacity Nanodegree
Explore the repository»

Project Summary

Olist is a Brazilian startup with headquarter in Curitiba. Its business model is of the e-commerce type, with the main objective of helping people who sell a product to find buyers in all Brazilian e-commerce.

Olist has been growing a lot in recent years, due to the increase in new users and also the increase in transactions on its platform. And a company's growing pains are inevitable, especially with data.

The whole system extends its capacity, both in terms of processing and in terms of data storage, and consequently the query time to databases increases, impacting the speed of analysis and the time to update metrics in dashboards.

The objective of this project was to create an ETL pipeline on the olist datasets to create a analytic Data Warehouse on Amazon Redshift. A use case for this analytic database is to find answers for some questions such as "Has the amount of sales has increased over time?", "Which product categories are top sellers?", "What is the total turnover?", et cetera.

With this in mind, the project will follow these steps:

Step 1: Scope the Project and Gather Data
Step 2: Explore and Assess the Data
Step 3: Define the Data Model
Step 4: Run ETL to Model the Data
Step 5: Complete Project Write Up

Step 1: Scope the Project and Gather Data

Scope

In this project, we will aggregate the total of eight datasets that represents each one of the tables from the Olist e-commerce OLTP database, with the objective to model this source of data into “Fact” and “Dimension” tables for the analytics team.

The final solution proposed is a pipeline built upon Python for ingesting data from Amazon S3 into a Redshift Data Warehouse, running transformations on Redshift and saving them into a proposed Star Schema model.

Describe and Gather Data

Data Source

Brazilian E-Commerce Public Datasets by Olist

https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

Content

Olist is the largest department store on marketplaces. It has a catalog with over 950,000 products, hundreds of thousands of orders and a network of over 9,000 partner retailers spread across all regions of Brazil.

All the files use the CSV format.

Model

The data sources are the company's e-commerce OLTP system tables, you can check the ERD of the system in the figure below.

Step 2: Explore and Assess the Data

Please refer to inspection_notebook.ipynb.

Step 3: Define the Data Model

3.1 Conceptual Data Model

Since the purpose of this data warehouse is for OLAP and BI app usage, we will model these data sets with star schema data modeling, the proposed data model consists of the following tables:

Fact table:
- fact_oders (order_id, product_id, date_key, seller_id, payment_key, customer_id, freight_value, price, order_items_qtd, order_status)
Dimension tables:
- dim_products (product_id, product_category_name)
- dim_date (date_key, date, year, quarter, month, day, week, is_weekend)
- dim_sellers (seller_id, seller_zip_code_prefix, seller_city, seller_state, seller_geo_lat, seller_geo_lng)
- dim_payments (payment_key, order_id, payment_sequential, payment_type, payment_installments)
- dim_customers (customer_id, customer_zip_code_prefix, customer_city, customer_state, customer_geo_lat, customer_geo_lng)

3.2 Mapping Out Data Pipelines

The pipeline consists on the following steps:

Store the data into Amazon S3
Stage the data from S3 to Redshift
Perform the necessary transformations for storing it in the corresponding tables in the Star Schema
Do quality checks on the data

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

Data processing and data model was created in Python.

Please refer to:

4.2 Data Quality Checks

The quality checks performed mainly check the load and quantity of records in the fact and dimension tables.

Please refer to test_queries.ipynb

4.3 Data Dictionary

dim_date table

date_key: primary key generated from the timestamp of the order
date: date of the order
year: year of the order
quarter: quarter of the order
month: month of the order
day: day of the order
week: week of the order
is_weeknd: indicator if it is weekday or not

dim_products table

product_id: primary key, identifier of the product
product_category_name: category description of the product

dim_sellers table

seller_id: primary key, identifier of the seller
seller_zip_code_prefix: seller address zip code description
seller_city: seller address city description
seller_state: seller address state description
seller_geo_lat: seller address latitude location value
seller_geo_lng: seller address longidute location value

dim_payments table

payments_key: primary key, generated identifier for the payment
order_id: foreign key, identifier of the order linked to payment
payment_sequential: payment sequential
payment_type: payment type description (credit card, et cetera)
payment_installments: number of installments of the payment

dim_customers table

customer_id: primary key, identifier of the customer
customer_zip_code_prefix: customer address zip code description
customer_city: customer address city description
customer_state: customer address state description
customer_geo_lat: customer address latitude location value
customer_geo_lng: customer address longidute location value

fact_orders table

order_id: primary key, identifier of the order
product_id: foreign key, identifier of the product linked to the order
date_key: foreign key, identifier of the date register linked to the order
seller_id: foreign key, identifier of the seller linked to the order
customer_id: foreign key, identifier of the customer linked to the order
freight_value: order product freight value
price: order product price
order_items_qtd: order products quantity description
order_status: current status of the order

Step 5: Complete Project Write Up

Tools and Technologies

AWS S3 for data storage (Because it offers industry-leading scalability, data availability, security, and performance)
AWS Redshift as Data Warehouse (Because it is a powerful analytics system that integrates well with database)
Python (Because it uses a simplified syntax with an emphasis on natural language)
PostgreSQL (Because it is an advanced open-source object-relational system which applies SQL language)
Jupyter Notebooks (Because it allows users to compile all aspects of a data project in one place making it easier to show the entire process of a project to your intended audience)

Data Update Frequency

Depending on how frenquently you would like to inspect Data Engineering jobs, we could feed data into the corresponding S3 buckets and generate the reports. A weekly schedule should be enough for checking how the market is changing.
All tables should be update in an append-only mode.

Future Design Considerations

The data was increased by 100x.

If the data was increaed by 100x an alternative is to use Spark alongside AWS EMR which is a distributed data cluster for processing large data sets on cloud.

The data populates a dashboard that must be updated on a daily basis by 7am every day.

In this case we can build a pipeline using Apache Airflow to regularly update the data and populate a report. Apache Airflow also integrate with Python and AWS very well. More applications can be combined together to deliever more powerful task automation.

The database needed to be accessed by 100+ people.

For this case we can still use AWS Redshift, it handles up to 500 connections. If this SSOT database will be accessed by 100+ people, we can move this database to Redshift with confidence to handle this request. Cost/Benefit analysis will be needed if we are going be implement this cloud solution.

How to run

Follow the steps to extract and load the data into the data model.

Navigate to Capstone Project folder
Create a S3 bucket on AWS
Create a cluster on AWS Redshift
Create a IAM user
Create a IAM role giving the correct acesses to S3 and Reshift
Link the role with the S3 bucket and the Redshift cluster
Fill the information on the dwh.cfg file
Upload the datasets on the S3 bucket
Run create_tables.py to create/reset the tables.
Run etl.py process and load data into database.This will execute SQL queries corresponding to staging data from S3 on Redshift and to transform and insert into the Postgres tables on Redshift.
Run queries on the notebook tests_queries.ipynb to validate the entry of data into tables.

Contact

Alisson lima - ali2slima10@gmail.com

Linkedin: https://www.linkedin.com/in/binariesgoalls/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capstone Project

Capstone Project

README.md

Capstone Project: ETL Pipeline for a Brazillian E-Commerce

Project Summary

Step 1: Scope the Project and Gather Data

Scope

Describe and Gather Data

Step 2: Explore and Assess the Data

Step 3: Define the Data Model

3.1 Conceptual Data Model

3.2 Mapping Out Data Pipelines

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

4.2 Data Quality Checks

4.3 Data Dictionary

dim_date table

dim_products table

dim_sellers table

dim_payments table

dim_customers table

fact_orders table

Step 5: Complete Project Write Up

Tools and Technologies

Data Update Frequency

Future Design Considerations

How to run

Contact

Files

Capstone Project

Directory actions

More options

Directory actions

More options

Latest commit

History

Capstone Project

Folders and files

parent directory

README.md

Capstone Project: ETL Pipeline for a Brazillian E-Commerce

Project Summary

Step 1: Scope the Project and Gather Data

Scope

Describe and Gather Data

Step 2: Explore and Assess the Data

Step 3: Define the Data Model

3.1 Conceptual Data Model

3.2 Mapping Out Data Pipelines

Step 4: Run Pipelines to Model the Data

4.1 Create the data model

4.2 Data Quality Checks

4.3 Data Dictionary

dim_date table

dim_products table

dim_sellers table

dim_payments table

dim_customers table

fact_orders table

Step 5: Complete Project Write Up

Tools and Technologies

Data Update Frequency

Future Design Considerations

How to run

Contact