docs: 960 docs add a glossary concept section (#970)

* feat: add _STEP_CATEGORY_TO_DESCRIPTION * docs: emphasize structured generation * docs: add pipeline visualizations * docs: add concepts page outline * Apply suggestions from code review Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com> * docs: processed comments Natalia * chore: add tabulate dependency * feat: add task overview * docs: remove task overview from API definition * docs: update naming tutorials * docs: update naming --------- Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>
argilla-io · Sep 16, 2024 · b2d8eb5 · b2d8eb5
1 parent 75e34e1
commit b2d8eb5
Show file tree

Hide file tree

Showing 31 changed files with 2,970 additions and 55 deletions.
diff --git a/docs/assets/pipelines/arena-hard.png b/docs/assets/pipelines/arena-hard.png
diff --git a/docs/assets/pipelines/clean-dataset.png b/docs/assets/pipelines/clean-dataset.png
diff --git a/docs/assets/pipelines/deepseek.png b/docs/assets/pipelines/deepseek.png
diff --git a/docs/assets/pipelines/deita.png b/docs/assets/pipelines/deita.png
diff --git a/docs/assets/pipelines/generate-preference-dataset.png b/docs/assets/pipelines/generate-preference-dataset.png
diff --git a/docs/assets/pipelines/instruction_backtranslation.png b/docs/assets/pipelines/instruction_backtranslation.png
diff --git a/docs/assets/pipelines/knowledge_graphs.png b/docs/assets/pipelines/knowledge_graphs.png
diff --git a/docs/assets/pipelines/prometheus.png b/docs/assets/pipelines/prometheus.png
diff --git a/docs/assets/pipelines/sentence-transformer.png b/docs/assets/pipelines/sentence-transformer.png
diff --git a/docs/assets/pipelines/ultrafeedback.png b/docs/assets/pipelines/ultrafeedback.png
diff --git a/docs/sections/getting_started/quickstart.md b/docs/sections/getting_started/quickstart.md
@@ -10,7 +10,15 @@ hide:
 
 # Quickstart
 
-To start off, `distilabel` is a framework for building pipelines for generating synthetic data using LLMs, that defines a [`Pipeline`][distilabel.pipeline.Pipeline] which orchestrates the execution of the [`Step`][distilabel.steps.base.Step] subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
+Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).
+
+- **Steps**: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.
+- **Tasks**: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.
+- **LLMs**: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.
+
+Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.
+
+Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:
 
 ## Installation
 
@@ -84,31 +92,3 @@ if __name__ == "__main__":
 7. We run the pipeline with the parameters for the `load_dataset` and `text_generation` steps. The `load_dataset` step will use the repository `distilabel-internal-testing/instruction-dataset-mini` and the `test` split, and the `text_generation` task will use the `generation_kwargs` with the `temperature` set to `0.7` and the `max_new_tokens` set to `512`.
 
 8. Optionally, we can push the generated [`Distiset`][distilabel.distiset.Distiset] to the Hugging Face Hub repository `distilabel-example`. This will allow you to share the generated dataset with others and use it in other pipelines.
-
-## Minimal example
-
-`distilabel` gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:
-
-```python
-from distilabel.llms import InferenceEndpointsLLM
-from distilabel.pipeline import Pipeline
-from distilabel.steps.tasks import TextGeneration
-from datasets import load_dataset
-
-
-dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")
-
-with Pipeline() as pipeline:  # (1)
-    TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"))  # (2)
-
-
-if __name__ == "__main__":
-    distiset = pipeline.run(dataset=dataset)  # (3)
-    distiset.push_to_hub(repo_id="distilabel-example")
-```
-
-1. The [`Pipeline`][distilabel.pipeline.Pipeline] can take no arguments and generate a default name on it's own that will be tracked internally.
-
-2. Just as with the [`Pipeline`][distilabel.pipeline.Pipeline], the [`Step`][distilabel.steps.base.Step]s don't explicitly need a name.
-
-3. You can generate the dataset as you would normally do with Hugging Face and pass the dataset to the run method.
diff --git a/docs/sections/how_to_guides/basic/llm/index.md b/docs/sections/how_to_guides/basic/llm/index.md
@@ -1,4 +1,4 @@
-# Define LLMs as local or remote models
+# Executing Tasks with LLMs
 
 ## Working with LLMs
 

diff --git a/docs/sections/how_to_guides/basic/step/index.md b/docs/sections/how_to_guides/basic/step/index.md
@@ -1,4 +1,4 @@
-# Define Steps for your Pipeline
+# Steps for processing data
 
 ## Working with Steps
 

diff --git a/docs/sections/how_to_guides/basic/task/generator_task.md b/docs/sections/how_to_guides/basic/task/generator_task.md
@@ -1,4 +1,4 @@
-# GeneratorTask
+# GeneratorTask that produces output
 
 ## Working with GeneratorTasks
 

diff --git a/docs/sections/how_to_guides/basic/task/index.md b/docs/sections/how_to_guides/basic/task/index.md
@@ -1,4 +1,4 @@
-# Define Tasks that rely on LLMs
+# Tasks for generating and judging with LLMs
 
 ## Working with Tasks
 

diff --git a/docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md b/docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md
@@ -1,12 +1,14 @@
 ---
 hide: toc
 ---
-# Benchmarking with `distilabel`: Arena Hard
+# Benchmarking with `distilabel`
 
 Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.
 
 The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.
 
+![Arena Hard](../../../assets/pipelines/arena-hard.png)
+
 To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`.
 
 ??? Run

diff --git a/docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md b/docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md
@@ -1,12 +1,14 @@
 ---
 hide: toc
 ---
-# llama.cpp with `outlines`
+# Structured generation with `outlines`
 
 Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
 
 This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
 
+![Arena Hard](../../../assets/pipelines/knowledge_graphs.png)
+
 It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
 
 ??? Run

diff --git a/docs/sections/pipeline_samples/examples/mistralai_with_instructor.md b/docs/sections/pipeline_samples/examples/mistralai_with_instructor.md
@@ -1,12 +1,14 @@
 ---
 hide: toc
 ---
-# MistralAI with `instructor`
+# Structured generation with `instructor`
 
 Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
 
 This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
 
+![Knowledge graph figure](../../../assets/pipelines/knowledge_graphs.png)
+
 This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.
 
 ??? Run

diff --git a/docs/sections/pipeline_samples/index.md b/docs/sections/pipeline_samples/index.md
@@ -1,13 +1,13 @@
 ---
 hide: toc
 ---
-# Pipeline Samples
+# Tutorials
 
-- **Tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
+- **End-to-end tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
 - **Paper implementations** provide reproductions of fundamental papers in the synthetic data domain.
 - **Examples** don't provide explenations but simply show code for different tasks.
 
-## Tutorials
+## End-to-end tutorials
 
 <div class="grid cards" markdown>
 
@@ -97,15 +97,15 @@ hide: toc
 
     [:octicons-arrow-right-24: Example](examples/benchmarking_with_distilabel.md)
 
--   __llama.cpp with outlines__
+-   __Structured generation with outlines__
 
     ---
 
     Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.
 
     [:octicons-arrow-right-24: Example](examples/llama_cpp_with_outlines.md)
 
--   __MistralAI with instructor__
+-   __Structured generation with instructor__
 
     ---
 

diff --git a/docs/sections/pipeline_samples/papers/deepseek_prover.md b/docs/sections/pipeline_samples/papers/deepseek_prover.md
@@ -8,6 +8,8 @@ The authors propose a method for generating [Lean 4](https://github.com/leanprov
 
 Here we show how to deal with steps 1 and 2, but the authors ensure the theorems are checked using the [lean4](https://github.com/leanprover/lean4) program on the generated proofs, and iterate for a series of steps, fine-tuning a model on the synthetic data (DeepSeek prover 7B), regenerating the dataset, and continue the process until no further improvement is found.
 
+![DEITA pipeline overview](../../../assets/pipelines/deepseek.png)
+
 ### Replication
 
 !!! Note
@@ -32,7 +34,7 @@ There are three components we needed to define for this pipeline, for the differ
 !!! Note
 
     We will use the same `LLM` for all the tasks, so we will define once and reuse it for the different tasks:
-    
+
     ```python
     llm = InferenceEndpointsLLM(
         model_id="meta-llama/Meta-Llama-3-70B-Instruct",

diff --git a/docs/sections/pipeline_samples/papers/deita.md b/docs/sections/pipeline_samples/papers/deita.md
@@ -10,6 +10,8 @@ The strategy utilizes **LLMs to replace human effort in time-intensive data qual
 
 You can see that we see again the dataset of instructions/responses and we kind of reproducing the second step when we learn how to optimize the responses according to an instruction by comparing several possibilities.
 
+![DEITA pipeline overview](../../../assets/pipelines/deita.png)
+
 ### Datasets and budget
 
 We will dive deeper into the whole process. We will investigate each stage to efficiently select the final dataset used for supervised fine-tuning with a budget constraint. We will tackle technical challenges by explaining exactly how you would assess good data as presented in the paper.

diff --git a/docs/sections/pipeline_samples/papers/instruction_backtranslation.md b/docs/sections/pipeline_samples/papers/instruction_backtranslation.md
@@ -2,6 +2,8 @@
 
 ["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.
 
+![Instruction Backtranslation pipeline overview](../../../assets/pipelines/instruction_backtranslation.png)
+
 Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions.
 
 A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.

diff --git a/docs/sections/pipeline_samples/papers/prometheus.md b/docs/sections/pipeline_samples/papers/prometheus.md
@@ -2,6 +2,8 @@
 
 ["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/pdf/2405.01535) presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in ["Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"](https://arxiv.org/abs/2310.08491); since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.
 
+![Prometheus 2 pipeline overview](../../../assets/pipelines/prometheus.png)
+
 Existing open evaluator LMs exhibit critical shortcomings:
 
 1. They issue scores that significantly diverge from those assigned by humans.

diff --git a/docs/sections/pipeline_samples/papers/ultrafeedback.md b/docs/sections/pipeline_samples/papers/ultrafeedback.md
@@ -4,6 +4,8 @@
 
 UltraFeedback collects about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), then they use these prompts to query multiple LLMs (commercial models, Llama models ranging 7B to 70B, and non-Llama models) and generate four different responses for each prompt, resulting in a total of 256k samples i.e. the UltraFeedback will rate four responses on every OpenAI request.
 
+![UltraFeedback pipeline overview](../../../assets/pipelines/ultrafeedback.png)
+
 To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects.
 
 ## Replication

diff --git a/docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb b/docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb
@@ -10,6 +10,8 @@
     "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n",
     "- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [GenerateSentencePair](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
     "\n",
+    "![GenerateSentencePair pipeline overview](../../../assets/pipelines/sentence-transformer.png)\n",
+    "\n",
     "!!! note\n",
     "    For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this [guide](https://docs.zenml.io/user-guide/llmops-guide/finetuning-embeddings) in collaboration with [ZenML](https://github.com/zenml-io/zenml), an open-source MLOps framework designed for building portable and production-ready machine learning pipelines."
    ]

diff --git a/docs/sections/pipeline_samples/tutorials/clean_existing_dataset.ipynb b/docs/sections/pipeline_samples/tutorials/clean_existing_dataset.ipynb
@@ -13,7 +13,9 @@
    "source": [
     "- **Goal**: Clean an existing preference dataset by providing AI feedback on the quality of the data.\n",
     "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
-    "- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](../../how_to_guides/basic/step/global_step.md)"
+    "- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](../../how_to_guides/basic/step/global_step.md)\n",
+    "\n",
+    "![Knowledge graph figure](../../../assets/pipelines/clean-dataset.png)"
    ]
   },
   {

diff --git a/docs/sections/pipeline_samples/tutorials/generate_preference_dataset.ipynb b/docs/sections/pipeline_samples/tutorials/generate_preference_dataset.ipynb
@@ -13,7 +13,9 @@
    "source": [
     "- **Goal**: Generate a synthetic preference dataset for DPO/ORPO.\n",
     "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
-    "- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [TextGeneration](https://distilabel.argilla.io/latest/components-gallery/tasks/textgeneration/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [GroupColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [FormatTextGenerationDPO](https://distilabel.argilla.io/latest/components-gallery/steps/formattextgenerationdpo/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)"
+    "- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [TextGeneration](https://distilabel.argilla.io/latest/components-gallery/tasks/textgeneration/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [GroupColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [FormatTextGenerationDPO](https://distilabel.argilla.io/latest/components-gallery/steps/formattextgenerationdpo/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
+    "\n",
+    "![Knowledge graph figure](../../../assets/pipelines/generate-preference-dataset.png)"
    ]
   },
   {