Skip to content

Data engineering project with ETL, and an API on a Docker container, uploaded to Mogenius

Notifications You must be signed in to change notification settings

fernandezguille/PI01_API_Docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INDIVIDUAL PROJECT 01

EDA-ETL-FastAPI-Docker in Datasets of Streaming platforms


¡Hi! My name is Guillermo Fernández and this is my first individual project, which is part of the training for the Henry Data Science bootcamp. (Para la versión en español, revisar el readme_es)


Objective

Input data from various sources, apply the transformations that are considered relevant, and then make the clean data available for consultation through an API in a docker virtual environment.

Project description (in Spanish)

Context

The amount of content available today on the platforms is constantly growing. For this work, datasets from the Amazon, Disney, Hulu and Netflix platforms were used.


Tech stack


Workplan:

In the following video I explain the steps detailed below (in Spanish):

  1. EDA (Exploratory data analysis)
  2. Relate datasets and join them
  3. ETL (Extraction, Transform, Load)
  4. Create an API with FastAPI
  5. Create a Docker container with the API
  6. Queries
  7. Mogenius deployment (extra step)

Repository archives

  • Datasets: Inside this folder are the raw files used to carry out the project, and also the file that was created as a result of ETL.
  • EDA_ETL: Here are the files used to perform the EDA and ETL.
  • Dockerfile: Script to upload the Docker container.
  • main.py: Script to instance the API, with the functions to answer the queries.

EDA

For the first step, the EDA was made with the PI_Script.ipynb file. The datasets were explored, dropping duplicate data, reviewing data types and missing data. Based on this analysis, some initial conclusions were drawn.

Relate

The datasets were related in a new Dataset, adding a Feature referring to the source table. Then, they were loaded into a MySQL database to perform some transformations.

ETL

Queries were made with the ETL_Script.sql, in order to adapt the type of the data. Some pertinent value transformations were done, taking into account which Features were required for the queries. Others that were not so relevant were left aside, because the focus of the work was not the ETL, but the following stages.

Create an API with FastAPI

For the creation of the API, the main.py file was used, where the functions for making queries were configured. The script locally instantiates the API, which loads the CSV already transformed to perform said queries, and returns the expected results. To check the API queries, we raise the Uvicorn locally from the terminal with the command: uvicorn main:app --reload

For this project, only 4 types of queries were requested

  • Longest running title, by platform and by year: The request must be: /get_max_duration(year, platform, [min o season])

  • Total movies and series, by platform: The request must be: /get_count_platform(platform)

  • Genre with the most occurrences, and its platform: The request must be: get_listedin(genero)

  • Actor with the most occurrences, by platform and by year: The request must be: get_actor(plataforma, año)

Docker container

To upload the container image, Dockerfile was used. This tells us that we are going to use a container that already includes the Python functions, with the necessary libraries to load the API. This is done with the Docker Desktop app for Windows, and with a few lines in the Visual Studio Code terminal.
docker build -t <image> . To create an image. An already existing image can be used too.
docker run -it -p 8000:8000 -v cd:/usr/src/app <image> Creates the container with said image.

Queries

Once the container is active, the docs URL is loaded to perform the queries, or also with the direct URL: localhost:8000/get_max_duration(2018,'Hulu','min') Upon reviewing that the queries deliver the expected results, the required steps are considered complete.

Mogenius - extra step

To build the container with the API so that it can be consumed from anywhere, Mogenius was used. With all files on GitHub, permission is granted in Mogenius to be able to consume the API.

In the following link you can consult the Deployment in Mogenius:


Conclusion

By carrying out this project, it was possible to integrate the concepts learned about Python, and to investigate new knowledge through research on API and Docker. There is also a lot of work to be done in SQL, which will be attempted over time.

Here is my contact info:
Linkedin

Thank you very much for reading all!

About

Data engineering project with ETL, and an API on a Docker container, uploaded to Mogenius

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published