The project will involve building a pipeline that automates the process of extracting, transforming and loading data using Python and SQLAlchemy. This automated ETL pipeline will extract the zip file that contains the data and process the transactions in one go.
Data in this project is contained in a csv file within a zip file. Therefore, the csv file needs to be extracted and saved into the data folder. The extract.py script will do the task of extracting data.
SQLAlchemy is an SQL toolkit that is written in Python. It provides a Pythonic framework that can be used to streamline the workflow and efficiently query the data.
The connection to the database is found on the base.py file.
In order to connect to the postgresql database, you have to create an engine using the connection string. The syntax to creating an engine is:
create_engine(db+dialect://username:password@database_url:port/database_name)
where:
- db is the database
- dialect is the driver that allows SQLAlchemy obtain a connection to postgres
In order to create an sql table, the Python class is mapped to the sql table. To do this, the following steps are followed:
- Initialize a declarative base to ensure the Python class inherits from it
- Set the table name to map the class with the PostgreSQL table
- Set id to primary key
- Declare the rest of the columns
The Base class the model inherited has the definition of the PprRawAll model in its metadata. Therefore to create a table, call the create_all function with the engine as an argument.
In this step the following will be done: Moving the data from the csv file to a table-this will make it easy to apply transformation on the data using python and SQL and the data cleaning process.
On the transform.py script the following actions would be used to transform the data to the required form:
- Lowercase all the strings in all columns
- Change date format
- Change the price to an integer by:
- Removing the Euro symbol
- Removing the comma
- Converting it to a float
- Converting to an integer
- Change the description to only show whether a property is new or second hand.
This phase is dependent on inserting and delete records from the table to remove erroneous rows.
Before loading the data, a new table has to be created that will store the data to be used by the analysts. The new table,ppr_clean_all is created on the .