PySpark Data Processing

Purpose

This project leverages PySpark to execute data processing tasks on a large dataset. It encompasses a Spark SQL query and a data transformation operation to enhance data analysis and insights.

Requirements

Utilize PySpark for data processing on a substantial dataset.
Incorporate a Spark SQL query and a data transformation.

ETL

Extract (E): Retrieves a dataset in CSV format from a specified URL.
Transform (T): Cleans, filters, and enriches the extracted data, preparing it for analysis.
Load (L): Loads the transformed data into a SQLite Database table using Python's sqlite3 module.
Query (Q): Writes and executes SQL queries on the SQLite database to analyze and extract insights from the data.

Dataset: Baskin Robbins Ice-Cream

Commands to Run the Repo

To run the project, you can use the Makefile and follow these commands:

# To install the required the python packages
make install

```
# To check code style
make lint
```
```
# To run tests
make test
```
```
# To format the code
make format
```
```
# To extract data
make extract
```
```
# To tranform data
make transform_load
```
```
# To query data
make query
```

Successful Formatting, Linting and Testing

On running make format, make lint, and make test in actions, it executes succesfully.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
mylib		mylib
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
baskin_icecream.db		baskin_icecream.db
main.py		main.py
query_log.md		query_log.md
requirements.txt		requirements.txt
setup.sh		setup.sh
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Data Processing

Purpose

Requirements

ETL

Commands to Run the Repo

Successful Formatting, Linting and Testing

About

Releases

Packages

Contributors 2

Languages

License

nogibjj/PySpark-Data-Processing

Folders and files

Latest commit

History

Repository files navigation

PySpark Data Processing

Purpose

Requirements

ETL

Commands to Run the Repo

Successful Formatting, Linting and Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages