Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.
Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.
Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
-
Community Meetup: listen in or present during one of our bi-weekly events.
-
Discord: get direct support from the community in #argilla-general and #argilla-help.
-
Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.
The Argilla community uses distilabel to create amazing datasets and models.
- The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
- Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
- The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.
pip install distilabel --upgrade
Requires Python 3.9+
In addition, the following extras are available:
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration.cohere
: for using models available in Cohere via theCohereLLM
integration.argilla
: for exporting the generated datasets to Argilla.groq
: for using models available in Groq usinggroq
Python client via theGroqLLM
integration.hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration.hf-transformers
: for using models available in transformers package via theTransformersLLM
integration.litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration.llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration.mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration.ollama
: for using Ollama and their available models viaOllamaLLM
integration.openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
.vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration.vllm
: for using vllm serving engine via thevLLM
integration.sentence-transformers
: for generating sentence embeddings using sentence-transformers.mlx
: for using MLX models via theMlxLLM
integration.
outlines
: for using structured generation of LLMs with outlines.instructor
: for using structured generation of LLMs with Instructor.
ray
: for scaling and distributing a pipeline with Ray.faiss-cpu
andfaiss-gpu
: for generating sentence embeddings using faiss.text-clustering
: for using text clustering with UMAP and Scikit-learn.minhash
: for using minhash for duplicate detection with datasketch and nltk.
To run the following example you must install distilabel
with the hf-inference-endpoints
extra:
pip install "distilabel[hf-inference-endpoints]" --upgrade
Then run:
from datasets import load_dataset
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline() as pipeline:
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={"temperature": 0.7, "max_new_tokens": 512},
),
)
if __name__ == "__main__":
dataset = load_dataset("distilabel-internal-testing/instructions", split="test")
distiset = pipeline.run(dataset=dataset)
distiset.push_to_hub(repo_id="distilabel-example")
If you build something cool with distilabel
consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
To directly contribute with distilabel
, check our good first issues or open a new one.
@misc{distilabel-argilla-2024,
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/argilla-io/distilabel}}
}