Skip to content

Commit

Permalink
docs: 960 docs add a glossary concept section (#970)
Browse files Browse the repository at this point in the history
* feat: add _STEP_CATEGORY_TO_DESCRIPTION

* docs: emphasize structured generation

* docs: add pipeline visualizations

* docs: add concepts page outline

* Apply suggestions from code review

Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>

* docs: processed comments Natalia

* chore: add tabulate dependency

* feat: add task overview

* docs: remove task overview from API definition

* docs: update naming tutorials

* docs: update naming

---------

Co-authored-by: Natalia Elvira <126158523+nataliaElv@users.noreply.github.com>
  • Loading branch information
davidberenstein1957 and nataliaElv authored Sep 16, 2024
1 parent 75e34e1 commit b2d8eb5
Show file tree
Hide file tree
Showing 31 changed files with 2,970 additions and 55 deletions.
Binary file added docs/assets/pipelines/arena-hard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/clean-dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/deepseek.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/deita.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/knowledge_graphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/prometheus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/sentence-transformer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/ultrafeedback.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 9 additions & 29 deletions docs/sections/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,15 @@ hide:

# Quickstart

To start off, `distilabel` is a framework for building pipelines for generating synthetic data using LLMs, that defines a [`Pipeline`][distilabel.pipeline.Pipeline] which orchestrates the execution of the [`Step`][distilabel.steps.base.Step] subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).

- **Steps**: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.
- **Tasks**: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.
- **LLMs**: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.

Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.

Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:

## Installation

Expand Down Expand Up @@ -84,31 +92,3 @@ if __name__ == "__main__":
7. We run the pipeline with the parameters for the `load_dataset` and `text_generation` steps. The `load_dataset` step will use the repository `distilabel-internal-testing/instruction-dataset-mini` and the `test` split, and the `text_generation` task will use the `generation_kwargs` with the `temperature` set to `0.7` and the `max_new_tokens` set to `512`.

8. Optionally, we can push the generated [`Distiset`][distilabel.distiset.Distiset] to the Hugging Face Hub repository `distilabel-example`. This will allow you to share the generated dataset with others and use it in other pipelines.

## Minimal example

`distilabel` gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:

```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset


dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")

with Pipeline() as pipeline: # (1)
TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")) # (2)


if __name__ == "__main__":
distiset = pipeline.run(dataset=dataset) # (3)
distiset.push_to_hub(repo_id="distilabel-example")
```

1. The [`Pipeline`][distilabel.pipeline.Pipeline] can take no arguments and generate a default name on it's own that will be tracked internally.

2. Just as with the [`Pipeline`][distilabel.pipeline.Pipeline], the [`Step`][distilabel.steps.base.Step]s don't explicitly need a name.

3. You can generate the dataset as you would normally do with Hugging Face and pass the dataset to the run method.
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/llm/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Define LLMs as local or remote models
# Executing Tasks with LLMs

## Working with LLMs

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/step/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Define Steps for your Pipeline
# Steps for processing data

## Working with Steps

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/task/generator_task.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# GeneratorTask
# GeneratorTask that produces output

## Working with GeneratorTasks

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/task/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Define Tasks that rely on LLMs
# Tasks for generating and judging with LLMs

## Working with Tasks

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
hide: toc
---
# Benchmarking with `distilabel`: Arena Hard
# Benchmarking with `distilabel`

Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.

The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.

![Arena Hard](../../../assets/pipelines/arena-hard.png)

To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`.

??? Run
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
hide: toc
---
# llama.cpp with `outlines`
# Structured generation with `outlines`

Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.

This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.

![Arena Hard](../../../assets/pipelines/knowledge_graphs.png)

It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].

??? Run
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
hide: toc
---
# MistralAI with `instructor`
# Structured generation with `instructor`

Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.

This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.

![Knowledge graph figure](../../../assets/pipelines/knowledge_graphs.png)

This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.

??? Run
Expand Down
10 changes: 5 additions & 5 deletions docs/sections/pipeline_samples/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
hide: toc
---
# Pipeline Samples
# Tutorials

- **Tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
- **End-to-end tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
- **Paper implementations** provide reproductions of fundamental papers in the synthetic data domain.
- **Examples** don't provide explenations but simply show code for different tasks.

## Tutorials
## End-to-end tutorials

<div class="grid cards" markdown>

Expand Down Expand Up @@ -97,15 +97,15 @@ hide: toc

[:octicons-arrow-right-24: Example](examples/benchmarking_with_distilabel.md)

- __llama.cpp with outlines__
- __Structured generation with outlines__

---

Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.

[:octicons-arrow-right-24: Example](examples/llama_cpp_with_outlines.md)

- __MistralAI with instructor__
- __Structured generation with instructor__

---

Expand Down
4 changes: 3 additions & 1 deletion docs/sections/pipeline_samples/papers/deepseek_prover.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ The authors propose a method for generating [Lean 4](https://github.com/leanprov

Here we show how to deal with steps 1 and 2, but the authors ensure the theorems are checked using the [lean4](https://github.com/leanprover/lean4) program on the generated proofs, and iterate for a series of steps, fine-tuning a model on the synthetic data (DeepSeek prover 7B), regenerating the dataset, and continue the process until no further improvement is found.

![DEITA pipeline overview](../../../assets/pipelines/deepseek.png)

### Replication

!!! Note
Expand All @@ -32,7 +34,7 @@ There are three components we needed to define for this pipeline, for the differ
!!! Note

We will use the same `LLM` for all the tasks, so we will define once and reuse it for the different tasks:

```python
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
Expand Down
2 changes: 2 additions & 0 deletions docs/sections/pipeline_samples/papers/deita.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ The strategy utilizes **LLMs to replace human effort in time-intensive data qual

You can see that we see again the dataset of instructions/responses and we kind of reproducing the second step when we learn how to optimize the responses according to an instruction by comparing several possibilities.

![DEITA pipeline overview](../../../assets/pipelines/deita.png)

### Datasets and budget

We will dive deeper into the whole process. We will investigate each stage to efficiently select the final dataset used for supervised fine-tuning with a budget constraint. We will tackle technical challenges by explaining exactly how you would assess good data as presented in the paper.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.

![Instruction Backtranslation pipeline overview](../../../assets/pipelines/instruction_backtranslation.png)

Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions.

A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.
Expand Down
2 changes: 2 additions & 0 deletions docs/sections/pipeline_samples/papers/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/pdf/2405.01535) presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in ["Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"](https://arxiv.org/abs/2310.08491); since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.

![Prometheus 2 pipeline overview](../../../assets/pipelines/prometheus.png)

Existing open evaluator LMs exhibit critical shortcomings:

1. They issue scores that significantly diverge from those assigned by humans.
Expand Down
2 changes: 2 additions & 0 deletions docs/sections/pipeline_samples/papers/ultrafeedback.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

UltraFeedback collects about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), then they use these prompts to query multiple LLMs (commercial models, Llama models ranging 7B to 70B, and non-Llama models) and generate four different responses for each prompt, resulting in a total of 256k samples i.e. the UltraFeedback will rate four responses on every OpenAI request.

![UltraFeedback pipeline overview](../../../assets/pipelines/ultrafeedback.png)

To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects.

## Replication
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n",
"- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [GenerateSentencePair](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
"\n",
"![GenerateSentencePair pipeline overview](../../../assets/pipelines/sentence-transformer.png)\n",
"\n",
"!!! note\n",
" For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this [guide](https://docs.zenml.io/user-guide/llmops-guide/finetuning-embeddings) in collaboration with [ZenML](https://github.com/zenml-io/zenml), an open-source MLOps framework designed for building portable and production-ready machine learning pipelines."
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
"source": [
"- **Goal**: Clean an existing preference dataset by providing AI feedback on the quality of the data.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
"- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](../../how_to_guides/basic/step/global_step.md)"
"- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](../../how_to_guides/basic/step/global_step.md)\n",
"\n",
"![Knowledge graph figure](../../../assets/pipelines/clean-dataset.png)"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
"source": [
"- **Goal**: Generate a synthetic preference dataset for DPO/ORPO.\n",
"- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n",
"- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [TextGeneration](https://distilabel.argilla.io/latest/components-gallery/tasks/textgeneration/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [GroupColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [FormatTextGenerationDPO](https://distilabel.argilla.io/latest/components-gallery/steps/formattextgenerationdpo/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)"
"- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [TextGeneration](https://distilabel.argilla.io/latest/components-gallery/tasks/textgeneration/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [GroupColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [FormatTextGenerationDPO](https://distilabel.argilla.io/latest/components-gallery/steps/formattextgenerationdpo/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)\n",
"\n",
"![Knowledge graph figure](../../../assets/pipelines/generate-preference-dataset.png)"
]
},
{
Expand Down
Loading

0 comments on commit b2d8eb5

Please sign in to comment.