This project implements an automated modular ETL pipeline for analyzing cryptocurrency exchange metrics. It extracts data from various sources, transforms it, and loads it into a database for visualization and analysis. The end product is a dashboard that provides near real-time insights into exchange performance, trading volumes, and trading volume discrepancies. The project employs a robust data pipeline built using modern data engineering practices and tools.
Link to the dashboard
The architecture was chosen to ensure scalability, maintainability, and performance. Docker is used to containerize the application, making it easy to deploy and manage dependencies. The data pipeline is built with modularity in mind, separating extraction, transformation, and loading processes. Data is stored in the Motherduck data warehouse, and logging is implemented for monitoring and debugging.
Before running this project, ensure you have the following prerequisites installed:
- Docker
- Docker Compose
- Python 3.10 or above
- DuckDB
- Make (for using the Makefile)
- Github
Follow these step-by-step instructions to get the project up and running:
Clone the repository and navigate to the project directory.
git clone https://github.com/ukokobili/data_aggregator.git
cd data_aggregator
Set up the environment and enter the required credentials:
cp env
Build and start the Docker containers:
make docker
Run the data pipeline:
python scripts/data_pipeline.py
To run tests:
make ci
To stop and remove the containers:
make down
containers/
: Contains Dockerfile and requirements for containerization.logs/
: Logging configuration and log files.media/
: Images for documentation.scripts/
: Main Python scripts including the data pipeline and ETL processes.test/
: Unit and integration tests.Makefile
: Contains commands for common operations.docker-compose.yml
: Defines and configures Docker services.env
: Environment configuration file for storing sensitive information and settings.
- Implementing a more robust error-handling system
- Exploring cloud-based solutions for improved scalability
- Incorporating real-time data streaming for more up-to-date analytics
For any questions or feedback, don't hesitate to get in touch with me:
Would you like me to explain or elaborate on any part of this README?