NYC Yellow Taxi Trips 🚕

Start by creating a new Google Cloud Platform (GCP) project.
Create a storage bucket for the project on Cloud Storage (I opted for asia-south region).
Upload the dataset and Python script written using PySpark to the bucket.
Then create a Hadoop cluster using DataProc with the below specifications.

Region: asia-south-1
Cluster type: Standard (1 master, 2 workers)
Machine type:
- Master --> N2-standard- 4 vCPU, 16GB RAM Powered by Intel Cascade Lake 50GB SSD
- Workers --> N2-standard- 2 vCPU, 8GB RAM Powered by Intel Cascade Lake 30GB SSD

ℹ️ The estimated cost to keep the cluster running for two hours under the above specifications is 2.59 USD.

Once you set up the cluster, create a new PySpark job in the cluster and submit the Python script to pre-process.
The pre-processed dataset will be saved to your bucket.
Now, you can access the pre-processed dataset through BigQuery on GCP to generate insights using SQL commands.
In addition, you can play with Looker to make insightful dashboards.

⚠️The estimated cost to keep the cluster running for two hours under the above specifications is 2.59 USD.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Looker_dashboard.pdf		Looker_dashboard.pdf
README.md		README.md
basic_queries.sql		basic_queries.sql
preprocessing.py		preprocessing.py

Provide feedback