INDIVIDUAL PROJECT 01

EDA-ETL-FastAPI-Docker in Datasets of Streaming platforms

¡Hi! My name is Guillermo Fernández and this is my first individual project, which is part of the training for the Henry Data Science bootcamp. (Para la versión en español, revisar el readme_es)

Objective

Input data from various sources, apply the transformations that are considered relevant, and then make the clean data available for consultation through an API in a docker virtual environment.

Project description (in Spanish)

Context

The amount of content available today on the platforms is constantly growing. For this work, datasets from the Amazon, Disney, Hulu and Netflix platforms were used.

Tech stack

Python
- Pandas
- Numpy
- pymysql
- sqlalchemy
FastAPI
Uvicorn
MySQL
Docker
Mogenius

Workplan:

In the following video I explain the steps detailed below (in Spanish):

EDA (Exploratory data analysis)
Relate datasets and join them
ETL (Extraction, Transform, Load)
Create an API with FastAPI
Create a Docker container with the API
Queries
Mogenius deployment (extra step)

Repository archives

Datasets: Inside this folder are the raw files used to carry out the project, and also the file that was created as a result of ETL.
EDA_ETL: Here are the files used to perform the EDA and ETL.
Dockerfile: Script to upload the Docker container.
main.py: Script to instance the API, with the functions to answer the queries.

EDA

For the first step, the EDA was made with the PI_Script.ipynb file. The datasets were explored, dropping duplicate data, reviewing data types and missing data. Based on this analysis, some initial conclusions were drawn.

Relate

The datasets were related in a new Dataset, adding a Feature referring to the source table. Then, they were loaded into a MySQL database to perform some transformations.

ETL

Queries were made with the ETL_Script.sql, in order to adapt the type of the data. Some pertinent value transformations were done, taking into account which Features were required for the queries. Others that were not so relevant were left aside, because the focus of the work was not the ETL, but the following stages.

Create an API with FastAPI

For the creation of the API, the main.py file was used, where the functions for making queries were configured. The script locally instantiates the API, which loads the CSV already transformed to perform said queries, and returns the expected results. To check the API queries, we raise the Uvicorn locally from the terminal with the command: uvicorn main:app --reload

For this project, only 4 types of queries were requested

Longest running title, by platform and by year: The request must be: /get_max_duration(year, platform, [min o season])
Total movies and series, by platform: The request must be: /get_count_platform(platform)
Genre with the most occurrences, and its platform: The request must be: get_listedin(genero)
Actor with the most occurrences, by platform and by year: The request must be: get_actor(plataforma, año)

Docker container

To upload the container image, Dockerfile was used. This tells us that we are going to use a container that already includes the Python functions, with the necessary libraries to load the API. This is done with the Docker Desktop app for Windows, and with a few lines in the Visual Studio Code terminal.
docker build -t <image> . To create an image. An already existing image can be used too.
docker run -it -p 8000:8000 -v cd:/usr/src/app <image> Creates the container with said image.

Queries

Once the container is active, the docs URL is loaded to perform the queries, or also with the direct URL: localhost:8000/get_max_duration(2018,'Hulu','min') Upon reviewing that the queries deliver the expected results, the required steps are considered complete.

Mogenius - extra step

To build the container with the API so that it can be consumed from anywhere, Mogenius was used. With all files on GitHub, permission is granted in Mogenius to be able to consume the API.

In the following link you can consult the Deployment in Mogenius:

Conclusion

By carrying out this project, it was possible to integrate the concepts learned about Python, and to investigate new knowledge through research on API and Docker. There is also a lot of work to be done in SQL, which will be attempted over time.

Here is my contact info:

Thank you very much for reading all!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Datasets		Datasets
EDA_ETL		EDA_ETL
Dockerfile		Dockerfile
Readme.md		Readme.md
Readme_es.md		Readme_es.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INDIVIDUAL PROJECT 01

EDA-ETL-FastAPI-Docker in Datasets of Streaming platforms

Objective

Context

Tech stack

Workplan:

Repository archives

EDA

Relate

ETL

Create an API with FastAPI

Docker container

Queries

Mogenius - extra step

Conclusion

About

Releases

Packages

Languages

fernandezguille/PI01_API_Docker

Folders and files

Latest commit

History

Repository files navigation

INDIVIDUAL PROJECT 01

EDA-ETL-FastAPI-Docker in Datasets of Streaming platforms

Objective

Context

Tech stack

Workplan:

Repository archives

EDA

Relate

ETL

Create an API with FastAPI

Docker container

Queries

Mogenius - extra step

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages