Skip to content

Airflow powered ETL pipeline for moving Near-Earth-Object data from NASA to Google Cloud

License

Notifications You must be signed in to change notification settings

Shegzimus/DE_NASA_NeoW_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

waketime

Table of Contents

Motivation and Objectives


Overview

The DE_NASA_NeoW_Pipeline project is an end-to-end data engineering pipeline designed to fetch, process, and analyze Near-Earth Object (NEO) data provided by NASA's APIs. The pipeline leverages cloud infrastructure, containerization, and orchestration tools to ensure scalability, reliability, and ease of deployment.

Key Features

  • Data Collection: Fetches NEO data from NASA's API with efficient pagination and stores it for further processing.
  • Data Transformation: Processes raw data using Python and prepares it for analytical use.
  • Cloud Integration: Uses Google Cloud Platform (GCP) services like BigQuery and Cloud Storage for data storage and querying.
  • Orchestration: Employs Apache Airflow for task scheduling and pipeline management.
  • Containerization: Encapsulates the entire solution in Docker containers for consistent execution across environments.
  • Scalability: Ensures the pipeline can handle growing data volumes with cloud-native tools and distributed architecture.

Architecture Overview

  1. Data Ingestion:
    • Extracts data from NASA's NEO API.
    • Processes and validates the data.
  2. Data Storage:
    • Stores raw and processed data in Google Cloud Storage (GCS).
    • Loads transformed data into BigQuery for analysis.
  3. Data Analysis:
    • Facilitates querying and reporting using BigQuery and tools like Looker Studio.
  4. Pipeline Management:
    • Automated workflows and monitoring via Apache Airflow.
  5. Deployment:
    • Dockerized solution for deployment on local machines or cloud environments.

This project demonstrates best practices in modern data engineering and serves as a template for building scalable ETL pipelines.


Architecture

New Personal Insights

Prerequisites

  1. Google Cloud Platform (GCP) Account
    • Visit the GCP Console and create a new project.
    • Enable the required APIs for your project (e.g., BigQuery, Cloud Storage).
    • Create a service account with appropriate roles (e.g., BigQuery Admin, Storage Admin).
    • Download the service account key JSON file.

  2. Python
    • Install Python 3.7 or higher. Verify installation by running:
    python3 --version

  • Ensure pip is installed and updated:
    python3 -m pip install --upgrade pip

  1. Docker
    • Download and install Docker from Docker's official website.
  • After installation, verify Docker is installed correctly by running:
    docker --version
  • Verify that docker-compose is installed
    docker-compose --version
    

  1. Git

    • Install Git to clone the repository. Confirm installation
    git --version
    

  2. System Resources

    • RAM: At least 8GB.
    • Disk Space: 10GB or more free space for Docker images and logs.
    • CPU: Dual-core processor or higher.

  3. Internet Access

    • Stable internet connection to install dependencies and interact with GCP services.

System Configuration

  1. Clone the repository

    git clone https://github.com/Shegzimus/DE_NASA_NeoW_Pipeline
  2. Create a virtual environment in your local machine

    python3 -m venv venv
  3. Activate the virtual environment

    source venv/scripts/activate
  4. Install dependencies

    pip install -r airflow/requirements.txt
  5. Set up your Google Cloud Platform (GCP) account

  6. Create directories to store your google credentials

    cd airflow && mkdir -p .google
    
    • Move the downloaded service account key JSON file into the .google directory.
    • Rename the file to "credentials.json" for consistency.
  7. Install Docker

  8. Build the Docker image

     docker build -t nasa_neow_pipeline .
    
  9. Start the Docker containers

    docker-compose up -d
  10. Launch the Airflow web UI

    open http://localhost:8081