Airline Flights Pipeline

Overview

The Airline Flights Pipeline project is an end-to-end ETL (Extract, Transform, Load) solution designed to process and analyze US flight data efficiently. Utilizing cutting-edge tools like Apache Spark and Apache Iceberg, this pipeline ensures robust data processing, quality checks, and insightful visualizations through an analytics dashboard.

Features

Data Processing: Processes, cleans, and enriches data using Apache Spark.
Data Storage: Leverages Apache Iceberg on top of local object storage with MinIO for scalable and efficient data storage.
Data Quality Checks: Ensures data accuracy, completeness, and consistency using PyDeequ.
Write-Audit-Publish: Implements a WAP pattern to guarantee only high-quality data is ever exposed to downstream consumers.
Unit Testing: Includes unit tests with Chispa for Spark DataFrames to ensure pipeline robustness.
Reduced Table Size: Optimized storage, reducing data sizes from ~600MB (Raw) to 120MB (Silver) to 0.3MB (Gold) for aggregated view.
Business-Intelligence-as-Code: Uses Evidence.dev to create a static BI site, built with GitHub Actions and hosted for free on GitHub Pages.
Containerization: Uses Docker-Compose for easy deployment and environment consistency.

Pipeline Architecture

Key Steps in the Pipeline:

Data Extraction: Ingests raw data from external sources.
Data Transformation: Cleans and enriches the data using Apache Spark.
Data Loading: Stores the processed data into Apache Iceberg tables (staging branches).
Data Quality Checks: Validates the data during Write-Audit-Publish, prior to publishing staged data only if validations pass.
Visualization: Presents insights via an interactive dashboard.

Example Dashboard

Design Specification

Ensures proper design and can be validated by stakeholders prior to implementation. For more details, refer to the Design Specification.

Data Sources

This project uses publicly available US Domestic Flight Data from the US Department of Transportation (DOT). Data includes:

Flight schedules
Delays and cancellations
Airport and airline metadata

Challenges Addressed

Handling large-scale data efficiently using distributed processing (Spark).
Maintaining data integrity with quality checks.
Optimizing storage and query performance with Iceberg.
Building user-friendly dashboards for non-technical stakeholders.

Getting Started

Prerequisites

Ensure the following tools and libraries are installed:

Docker Compose

Installation

Clone the repository:

git clone https://github.com/phamlamn/airline-flights-pipeline.git
cd airline-flights-pipeline

Set up the environment:
```
cp example.env .env
```
Build and start the Docker containers:
```
docker-compose up -d
```

Running the Pipeline

Run the ETL pipeline from within the Spark-Iceberg container:

docker exec -it spark-iceberg-flights python /home/iceberg/src/jobs/etl_pipeline.py

Future Enhancements and Next Steps

Data Integration

Expand historical coverage by incorporating more years from Kaggle datasets.
Integrate additional data sources, including weather data.

Pipeline Optimization

Enable direct ingestion of raw data from US DOT BTS sources.
Refactor the pipeline to adopt an incremental processing approach.

Scalability and Orchestration

Deploy the solution to a cloud environment for scalability.
Add support for orchestration tools like Dagster or Airflow.

Automation and DataOps

Implement a CI/CD pipeline for automated deployment and testing.
Introduce DataOps tools such as SQL Mesh or DBT for better data management and governance.

Contact

For questions or feedback, please contact:

Name: Lam Pham
LinkedIn: LinkedIn Profile

Thank you for exploring the Airline Flights Pipeline project! Feel free to dive into the code and contribute to its development.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
docs		docs
images		images
notebooks		notebooks
spark		spark
src		src
visualizations		visualizations
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
example.env		example.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airline Flights Pipeline

Table of Contents

Overview

Features

Pipeline Architecture

Key Steps in the Pipeline:

Example Dashboard

Design Specification

Data Sources

Challenges Addressed

Getting Started

Prerequisites

Installation

Running the Pipeline

Future Enhancements and Next Steps

Data Integration

Pipeline Optimization

Scalability and Orchestration

Automation and DataOps

Contact

About

Packages

Languages

phamlamn/airline-flights-pipeline

Folders and files

Latest commit

History

Repository files navigation

Airline Flights Pipeline

Table of Contents

Overview

Features

Pipeline Architecture

Key Steps in the Pipeline:

Example Dashboard

Design Specification

Data Sources

Challenges Addressed

Getting Started

Prerequisites

Installation

Running the Pipeline

Future Enhancements and Next Steps

Data Integration

Pipeline Optimization

Scalability and Orchestration

Automation and DataOps

Contact

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages