Skip to content

Commit

Permalink
docs: added dvc dataset control
Browse files Browse the repository at this point in the history
  • Loading branch information
MBenediktF committed Sep 2, 2024
1 parent c69f61f commit 24b7227
Showing 1 changed file with 14 additions and 8 deletions.
22 changes: 14 additions & 8 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,11 @@
- Run `make start_mlflow_ui` to start the container again
- Run make `remove_mlflow_ui` to delete the docker container and the image

## How to manage raw datasets
## How to manage datasets

- Datasets are stored in the folder `datasets`, which is not synced with github, but stored in s3 instead
- To download all existing datasets, use `make download_datasets`
- To download a specific dataset use `make download_dataset NAME=<dataset_name>`
- To upload a new dataset to s3, add it to the `datasets`folder and use `make upload_datasets`
- This feature is only for storing raw data. Procecced datasets are stored as artifacts and can be accessed using the mlflow ui
- Datasets are stored in the folder datasets. THe version control is handled by dvc. The datasets will be synced to github, but stored at another remote (default: s3)
- To download the dataset files, run `make dvc_pull_s3`
- To upload a new dataset, commit to git and run `make dvc_push_s3`

## How to deploy a model

Expand All @@ -43,5 +41,13 @@
- Advanced model training code with multiple parameters
- Deploy to an automated testing environment and run tests there
- Deploy to target system
- Export the model
- Run tests with the compiles model and the implementation code
- Export the model
- Run tests with the compiles model and the implementation code

## How to manage raw datasets with aws cli (legacy)

- Datasets are stored in the folder `datasets`, which is not synced with github, but stored in s3 instead
- To download all existing datasets, use `make download_datasets`
- To download a specific dataset use `make download_dataset NAME=<dataset_name>`
- To upload a new dataset to s3, add it to the `datasets`folder and use `make upload_datasets`
- This feature is only for storing raw data. Procecced datasets are stored as artifacts and can be accessed using the mlflow ui

0 comments on commit 24b7227

Please sign in to comment.