- Start by creating a new
Google Cloud Platform
(GCP) project. - Create a
storage bucket
for the project onCloud Storage
(I opted for asia-south region). - Upload the dataset and Python script written using
PySpark
to the bucket. - Then create a Hadoop cluster using
DataProc
with the below specifications.
- Region: asia-south-1
- Cluster type: Standard (1 master, 2 workers)
- Machine type:
- Master --> N2-standard- 4 vCPU, 16GB RAM Powered by Intel Cascade Lake 50GB SSD
- Workers --> N2-standard- 2 vCPU, 8GB RAM Powered by Intel Cascade Lake 30GB SSD
ℹ️ The estimated cost to keep the cluster running for two hours under the above specifications is 2.59 USD.
- Once you set up the cluster, create a new
PySpark job
in the cluster and submit the Python script to pre-process. - The pre-processed dataset will be saved to your bucket.
- Now, you can access the pre-processed dataset through
BigQuery
on GCP to generate insights using SQL commands. - In addition, you can play with
Looker
to make insightful dashboards.
⚠️ The estimated cost to keep the cluster running for two hours under the above specifications is 2.59 USD.