This repository contains auxiliary material for the article: "A Taxonomy of Tools for Reproducible Machine Learning Experiments" by Luigi Quaranta, Fabio Calefato, and Filippo Lanubile.
In the following of this README, the full sample of analyzed tools is classified according to the features from the taxonomy presented in the paper; for the reader's convenience, a figure representing the taxonomy is also displayed in the following paragraph.
The tool categorization reported in this README as well as the figure representing the taxonomy are licensed under a Creative Commons Attribution 4.0 International License.
Please, include the following citation if you intend to (re)use our work:
L. Quaranta, F. Calefato and F. Lanubile, “A Taxonomy of Tools for Reproducible Machine Learning Experiments,” Proceedings of the AIxIA 2021 Discussion Papers Workshop (AIxIA DP 2021), 2021, pp. 65-76, online: CEUR-WS.org/Vol-3078/paper-81.pdf.
The tool sample classified according to the features of the General category.
Interaction Mode | Workflow Coverage | Languages | License | |
---|---|---|---|---|
DVC | CLI | All | Language agnostic | FLOSS (Apache 2.0) |
Guild AI | CLI, API | Data Preparation + Model Building | Python Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn |
FLOSS (Apache 2.0) |
Pachyderm | CLI, API | All | Language agnostic | Community Ed.: FLOSS (Apache 2.0) Enterprise Ed.: Proprietary |
Comet.ml | API, CLI | Data Preparation + Model Building | Python, R, Java (beta) Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn, SageMaker |
Proprietary |
MLflow | API, CLI | All | Python, R, Java Built-in framework support: Apache Spark, TensorFlow, PyTorch, Keras, Scikit-Learn, H2O |
FLOSS (Apache 2.0) |
Neptune | API, CLI | All | Language agnostic (CLI) Python and R (API) Built-in framework support: TensorFlow, PyTorch, Keras MLflow, SageMaker |
Proprietary |
wandb | API, CLI | Data Preparation + Model Building | Python | Proprietary |
Valohai | CLI, API | All | Language agnostic | Proprietary |
Google Colab | Cloud IDE | Data Preparation + Model Building | Python | Proprietary |
FloydHub | Cloud IDE, API, CLI | All | Python Built-in framework support: TensorFlow, PyTorch, Keras, Scikit-Learn |
Proprietary |
Domino | Cloud IDE, API, CLI | All | Python, R, Julia Built-in framework support: TensorFlow, PyTorch, H2O, Apache Spark, Hadoop |
Proprietary |
Spell.run | Cloud IDE, CLI | All | Python Built-in framework support: TensorFlow, Keras Weights & Biases |
Proprietary |
Polynote | Web-based IDE | Data Preparation + Model Building | Scala, Python, SQL Built-in framework support: Apache Spark |
FLOSS (Apache 2.0) |
DataRobot | AutoML Platform | All | Language agnostic (Python API) |
Proprietary |
databricks | Cloud IDE, API, CLI | All | Python, R, Scala, SQL Built-in framework support: Apache Spark, MLflow, Delta Lake, TensorFlow |
Proprietary |
Driverless AI | AutoML Platform | All | (Python recipes) | Proprietary |
RapidMiner | AutoML Platform | All | (Python and R for custom code) |
Proprietary |
dstack.ai | API | Data Preparation | Python, R | Proprietary |
Python (Cloud IDE, API) |
The tool sample classified according to the features of the Analysis Support category.
Notebook support | Data Visualization | Web Dashboard | Collaboration mode | Computational Resources |
|
---|---|---|---|---|---|
DVC | No | No | No | Async (push/pull commands) |
Local |
Guild AI | Yes (on-premise) |
No | Yes (local) |
Async (push/pull commands) |
Local |
Pachyderm | Yes (on-premise) |
No | Yes (local or remote) |
Async (push/pull commands) |
Local + On-premise + Remote (in-house*) |
Comet.ml | Yes (on-premise) |
No | Yes (remote) |
No | Local + On-premise* + Remote* (in-house) |
MLflow | Yes (on-premise) |
No | Yes (local) |
No | Local + On-premise |
Neptune | Yes (on-premise) |
No | Yes (remote) |
Async (comments) | On-premise* + Remote (in-house) |
wandb | Yes (on-premise) |
No | Yes (remote) |
No | On-premise* +Remote (in-house) |
Valohai | Yes (on-premise orhosted) |
No | Yes (remote) |
No | On-premise* + Remote (in-house) |
Google Colab | Yes (hosted) |
No | No | Sync (co-editing) + Async (comments) |
Local + Remote (in-house or third-party) |
FloydHub | Yes (hosted) |
No | Yes (remote) |
No | On-premise* + Remote (in-house) |
Domino | Yes (hosted) |
No | Yes (remote) |
Async (reviews) | Remote (in-house*) |
Spell.run | Yes (hosted) |
No | Yes (remote) |
No | On-premise* + Remote (in-house) |
Polynote | Yes (on-premise) |
Yes | No | No | Local |
DataRobot | No | Yes | Yes (remote) |
No | On-premise* + Remote* (in-house or third-party) |
databricks | Yes (hosted) |
Yes | Yes (remote) |
Sync (co-editing) + Async (comments) |
Remote* (third-party) |
Driverless AI | No | Yes | Yes (remote) |
No | Remote* (in-house or third-party) |
RapidMiner | Yes (hosted) |
Yes | Yes (remote) |
No | Local + Remote* (in-house or third-party) |
dstack.ai | Yes (on-premise) |
No | Yes (remote) |
Async (comments) | On-premise* + Remote (in-house) |
(hosted) |
(remote) |
(Fork&Pull for notebooks) |
Remote (in-house or third-party*) |
The tool sample classified according to the features of the Reproducibility Support category.
Code Versioning | Data Access | Data Versioning | Experiment Logging |
Reproducible Pipeline |
|
---|---|---|---|---|---|
DVC | Yes (external, git-based) |
Local +Remote (third-party) | Yes | Yes (manual) |
Yes (automatic) |
Guild AI | Yes (external, git-based) |
Local +Remote (third-party) | Yes | Yes (hybrid) |
Yes (configuration file) |
Pachyderm | Yes (integrated) |
Local +Remote (third-party) | Yes | No | Yes |
Comet.ml | Yes (external, git-based) |
Local + Remote (internal) |
Yes | Yes (hybrid) |
? |
MLflow | Yes (external, git-based) |
Local + Remote (third-party) |
No | Yes (hybrid) |
Yes (configuration file) |
Neptune | Yes (integrated orexternal, git-based) |
Local + Remote (third-party) |
No | Yes (hybrid) |
No |
wandb | Yes (external, git-based) |
Local + Remote (internal orthird-party) |
No | Yes (hybrid) |
Local + Remote (third-party) |
Valohai | Yes (integrated or external, git-based) |
Local + Remote (third-party*) |
Yes | Yes (manual) |
Yes (configuration file) |
Google Colab | Yes (file-sharing services - Google Drive) |
Remote (internal orthird-party) | Yes | No | No |
FloydHub | Yes (integrated orexternal, git-based) | Remote (internal orthird-party) | Yes | Yes (manual) |
Yes |
Domino | Yes (integrated) |
Remote (internal orthird-party) | Yes | No | Yes (automatic) |
Spell.run | Yes (external, git-based) |
Remote (internal orthird-party) | ? | Yes (hybrid) |
Yes (script) |
Polynote | No | Local | No | No | No |
DataRobot | ? | Remote | ? | Yes (automatic) |
Yes (built-in) |
databricks | Yes (integrated orexternal, git-based) | Remote (internal orthird-party) | Yes | Yes (hybrid) |
? |
Driverless AI | Yes (integrated) |
Remote (internal or third-party) |
Yes | Yes (automatic) |
Yes (built-in) |
RapidMiner | Yes (external, git-based) |
Local + Remote (third-party) |
? | Yes (automatic) |
Yes (visual or built-in) |
dstack.ai | No | Local + Remote (internal) |
Yes | Yes (manual) |
No |
(integrated) |
(internal or third-party) |
(manual) |
(automatic) |
* = only available in paid plans
N.B.: Rows related to Dotscience are strike-through because the service seems to be shutting down. We read this blog post a few days after our trial.
The tools/
folder contains environment templates for the tools that require a local installation to be executed. To try the tools we used -- where possible -- a realistic case study inspired to the lessons of the Kaggle's micro-courses "Intro to Machine Learning" and "Intermediate Machine Learning". The kernels/
folder contains template notebooks implementing the case study, while the sample dataset is stored in the input/
folder.
To try one of the reviewed tools, follow these steps:
- go to the tool's folder:
/tools/<tool_name>
; - if a
.env_template
file exist, make a copy of it; give the name.env
to the copy; edit.env
giving a value to each of the mentioned variables. - if a
README.md
file is present, follow the specific instruction there.