The DE_NASA_NeoW_Pipeline project is an end-to-end data engineering pipeline designed to fetch, process, and analyze Near-Earth Object (NEO) data provided by NASA's APIs. The pipeline leverages cloud infrastructure, containerization, and orchestration tools to ensure scalability, reliability, and ease of deployment.
- Data Collection: Fetches NEO data from NASA's API with efficient pagination and stores it for further processing.
- Data Transformation: Processes raw data using Python and prepares it for analytical use.
- Cloud Integration: Uses Google Cloud Platform (GCP) services like BigQuery and Cloud Storage for data storage and querying.
- Orchestration: Employs Apache Airflow for task scheduling and pipeline management.
- Containerization: Encapsulates the entire solution in Docker containers for consistent execution across environments.
- Scalability: Ensures the pipeline can handle growing data volumes with cloud-native tools and distributed architecture.
- Data Ingestion:
- Extracts data from NASA's NEO API.
- Processes and validates the data.
- Data Storage:
- Stores raw and processed data in Google Cloud Storage (GCS).
- Loads transformed data into BigQuery for analysis.
- Data Analysis:
- Facilitates querying and reporting using BigQuery and tools like Looker Studio.
- Pipeline Management:
- Automated workflows and monitoring via Apache Airflow.
- Deployment:
- Dockerized solution for deployment on local machines or cloud environments.
This project demonstrates best practices in modern data engineering and serves as a template for building scalable ETL pipelines.
- Google Cloud Platform (GCP) Account
- Visit the GCP Console and create a new project.
- Enable the required APIs for your project (e.g., BigQuery, Cloud Storage).
- Create a service account with appropriate roles (e.g., BigQuery Admin, Storage Admin).
- Download the service account key JSON file.
- Python
- Install Python 3.7 or higher. Verify installation by running:
python3 --version
- Ensure pip is installed and updated:
python3 -m pip install --upgrade pip
- Docker
- Download and install Docker from Docker's official website.
- After installation, verify Docker is installed correctly by running:
docker --version
- Verify that docker-compose is installed
docker-compose --version
-
Git
- Install Git to clone the repository. Confirm installation
git --version
-
System Resources
- RAM: At least 8GB.
- Disk Space: 10GB or more free space for Docker images and logs.
- CPU: Dual-core processor or higher.
-
Internet Access
- Stable internet connection to install dependencies and interact with GCP services.
-
Clone the repository
git clone https://github.com/Shegzimus/DE_NASA_NeoW_Pipeline
-
Create a virtual environment in your local machine
python3 -m venv venv
-
Activate the virtual environment
source venv/scripts/activate
-
Install dependencies
pip install -r airflow/requirements.txt
-
Create directories to store your google credentials
cd airflow && mkdir -p .google
- Move the downloaded service account key JSON file into the .google directory.
- Rename the file to "credentials.json" for consistency.
-
Install Docker
-
Build the Docker image
docker build -t nasa_neow_pipeline .
-
Start the Docker containers
docker-compose up -d
-
Launch the Airflow web UI
open http://localhost:8081