This project leverages PySpark to execute data processing tasks on a large dataset. It encompasses a Spark SQL query and a data transformation operation to enhance data analysis and insights.
- Utilize PySpark for data processing on a substantial dataset.
- Incorporate a Spark SQL query and a data transformation.
Extract (E): Retrieves a dataset in CSV format from a specified URL.
Transform (T): Cleans, filters, and enriches the extracted data, preparing it for analysis.
Load (L): Loads the transformed data into a SQLite Database table using Python's sqlite3 module.
Query (Q): Writes and executes SQL queries on the SQLite database to analyze and extract insights from the data.
Dataset: Baskin Robbins Ice-Cream
To run the project, you can use the Makefile and follow these commands:
-
# To install the required the python packages make install
-
# To check code style make lint
-
# To run tests make test
-
# To format the code make format
-
# To extract data make extract
-
# To tranform data make transform_load
-
# To query data make query
On running make format, make lint, and make test in actions, it executes succesfully.