Refere to airflow docker compose docs for more information
This will create all the necessary folders, initialize airflow containers and create postgres db with project schema.
Make a copy of .env.sample and rename it to .env. Make any overrides you need for compose there
cp .env.sample .env
Create airflow folders
mkdir -p ./airflow/dags ./airflow/logs ./airflow/plugins ./airflow/config ./airflow/data ./airflow/data/sql ./airflow/data/kaggle_data
Create airflow base image with project dependencies and connections
docker build -t data-eng-project-airflow-base .
Initialize airflow
docker compose up airflow-init
Run liquibase update. Execute from project root directory
Install python 3.11 dependencies (recommended to use a virtual environment) for development
pip install -r requirements.txt
- Optional but recommended to use virtualenv(MacOS)
homebrew install virtualenv virtualenv -p python3.11 data_eng_project source data_eng_project/bin/activate python --version #Check that the virtual env is using 3.11.x pip install -r requirements.txt #When done working in virtual env, exit virtual env: deactivate
Run docker compose
docker compose up -d
Airflow web UI will be available at http://localhost:8080 with default credentials username: airflow, password: airflow. Postgres connection will be available on host at localhost:5444/dwh_pg with default credentials username: dwh_user, password: dwh_user.
To stop everything
docker compose stop
To clean project schema
To clean up everything
docker compose down --volumes --remove-orphans
Current schema
- id
- versionId (FK version)
- summaryId (FK summary)
- submitterId (FK author)
- paperId
- date
- title
- id
- number
- creationDate
- id
- name
- affiliation
- authorId (FK author)
- submissionId (FK submission)
- id
- title
- year
- publisher
- citationId (FK citation)
- submissionId (FK submission)
- authorId (FK author)
- citationId (FK citation)
- id
- pages
- figures
- category
- abstract
- id
- title
- name
- date
- date