Skip to content

Latest commit

 

History

History
70 lines (52 loc) · 1.75 KB

README.md

File metadata and controls

70 lines (52 loc) · 1.75 KB

PySpark Data Processing


Purpose

This project leverages PySpark to execute data processing tasks on a large dataset. It encompasses a Spark SQL query and a data transformation operation to enhance data analysis and insights.


Requirements

  1. Utilize PySpark for data processing on a substantial dataset.
  2. Incorporate a Spark SQL query and a data transformation.

ETL

Extract (E): Retrieves a dataset in CSV format from a specified URL.
Transform (T): Cleans, filters, and enriches the extracted data, preparing it for analysis.
Load (L): Loads the transformed data into a SQLite Database table using Python's sqlite3 module.
Query (Q): Writes and executes SQL queries on the SQLite database to analyze and extract insights from the data.


Dataset: Baskin Robbins Ice-Cream


Commands to Run the Repo

To run the project, you can use the Makefile and follow these commands:

  1. # To install the required the python packages
    make install
    
  2. # To check code style
    make lint
    
  3. # To run tests
    make test
    
  4. # To format the code
    make format
    
  5. # To extract data
    make extract
    
  6. # To tranform data
    make transform_load
    
  7. # To query data
    make query
    

Successful Formatting, Linting and Testing

On running make format, make lint, and make test in actions, it executes succesfully.

make lint format make test