View the Interactive Dashboard Here!
- Overview
- Features
- Pipeline Architecture
- Example Dashboard
- Design Specification
- Data Sources
- Challenges Addressed
- Getting Started
- Future Enhancements
- Contribution
- License
- Contact
The Airline Flights Pipeline project is an end-to-end ETL (Extract, Transform, Load) solution designed to process and analyze US flight data efficiently. Utilizing cutting-edge tools like Apache Spark and Apache Iceberg, this pipeline ensures robust data processing, quality checks, and insightful visualizations through an analytics dashboard.
- Data Processing: Processes, cleans, and enriches data using Apache Spark.
- Data Storage: Leverages Apache Iceberg on top of local object storage with MinIO for scalable and efficient data storage.
- Data Quality Checks: Ensures data accuracy, completeness, and consistency using PyDeequ.
- Write-Audit-Publish: Implements a WAP pattern to guarantee only high-quality data is ever exposed to downstream consumers.
- Unit Testing: Includes unit tests with Chispa for Spark DataFrames to ensure pipeline robustness.
- Reduced Table Size: Optimized storage, reducing data sizes from ~600MB (Raw) to 120MB (Silver) to 0.3MB (Gold) for aggregated view.
- Business-Intelligence-as-Code: Uses Evidence.dev to create a static BI site, built with GitHub Actions and hosted for free on GitHub Pages.
- Containerization: Uses Docker-Compose for easy deployment and environment consistency.
- Data Extraction: Ingests raw data from external sources.
- Data Transformation: Cleans and enriches the data using Apache Spark.
- Data Loading: Stores the processed data into Apache Iceberg tables (staging branches).
- Data Quality Checks: Validates the data during Write-Audit-Publish, prior to publishing staged data only if validations pass.
- Visualization: Presents insights via an interactive dashboard.
Ensures proper design and can be validated by stakeholders prior to implementation. For more details, refer to the Design Specification.
This project uses publicly available US Domestic Flight Data from the US Department of Transportation (DOT). Data includes:
- Flight schedules
- Delays and cancellations
- Airport and airline metadata
- Handling large-scale data efficiently using distributed processing (Spark).
- Maintaining data integrity with quality checks.
- Optimizing storage and query performance with Iceberg.
- Building user-friendly dashboards for non-technical stakeholders.
Ensure the following tools and libraries are installed:
- Docker Compose
-
Clone the repository:
git clone https://github.com/phamlamn/airline-flights-pipeline.git cd airline-flights-pipeline
-
Set up the environment:
cp example.env .env
-
Build and start the Docker containers:
docker-compose up -d
- Run the ETL pipeline from within the Spark-Iceberg container:
docker exec -it spark-iceberg-flights python /home/iceberg/src/jobs/etl_pipeline.py
- Expand historical coverage by incorporating more years from Kaggle datasets.
- Integrate additional data sources, including weather data.
- Enable direct ingestion of raw data from US DOT BTS sources.
- Refactor the pipeline to adopt an incremental processing approach.
- Deploy the solution to a cloud environment for scalability.
- Add support for orchestration tools like Dagster or Airflow.
- Implement a CI/CD pipeline for automated deployment and testing.
- Introduce DataOps tools such as SQL Mesh or DBT for better data management and governance.
For questions or feedback, please contact:
- Name: Lam Pham
- LinkedIn: LinkedIn Profile
Thank you for exploring the Airline Flights Pipeline project! Feel free to dive into the code and contribute to its development.