Releases: argilla-io/distilabel
1.5.2
What's Changed
- Fix structured output JSON to
pydantic.BaseModel
andLiteLLM
async completion client by @rolshoven in #1105
New Contributors
- @rolshoven made their first contribution in #1105
Full Changelog: 1.5.1...1.5.2
1.5.1
What's Changed
- Remove deprecated
CombineColumns
step by @gabrielmbmb in #1101 - Fix image import handling and update MlxLLM initialisation by @davidberenstein1957 in #1102
- Fix
MlxLLM
by aligning it withmlx-lm>=0.21
by @davidberenstein1957 in #1103 1.5.1
by @gabrielmbmb in #1104
Full Changelog: 1.5.0...1.5.1
1.5.0
✨ Release highlights
🖼️ Image Generation Support
We're excited to introduce ImageGenerationModel
, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.
Available Services
- 🤗
InferenceEndpointsImageGeneration
: Integration with Hugging Face's Inference Endpoints OpenAIImageGeneration
: Integration with OpenAI's DALL-E
Architecture
Just as LLM
s are used by a Task
, we've introduced ImageTask
as a high-level abstraction for image generation workflows. ImageTask
defines how a step should use an ImageGenerationModel
to accomplish specific image generation tasks.
Our first implementation, the ImageGeneration
task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.
We've also added a small tutorial on how to generate images using distilabel
: distilabel - Tutorials - Image generation with distilabel
Images as inputs for LLM
s
We've added initial support for providing images as input to an LLM
through the new TextGenerationWithImage
task. We've updated and tested InferenceEndpointsLLM
and OpenAILLM
with this new task, but we'll image as input compatibility in the next releases for others such as vLLM
.
Check the tutorial distilabel - Tutorials - Text generation with images in distilabel
to get started!
💻 New MlxLLM
integration
We've integrated mlx-lm package with the new MlxLLM
class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.
New InstructionResponsePipeline
template
We've started making changes so distilabel
is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:
from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline()
distiset = pipeline.run()
Define load stages
We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.
We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.
What's Changed
- Add common typing module by @plaguss in #1029
- docs: textcat tutorial by @sdiazlor in #949
- Add
task
decorator by @gabrielmbmb in #1028 - Update
docs
workflows to useuv
by @gabrielmbmb in #1032 - fix: simplify prompt template
ArgillaLabeller
by @davidberenstein1957 in #1033 - Add
dataset_batch_size
argument by @gabrielmbmb in #1039 - Move all LLMs to distilabel.models by @plaguss in #1045
- Fix a tiny typo in
_Step
docstring by @sadra-barikbin in #1051 - docs: improve docs for
MinHashDedup
Step
by @anakin87 in #1050 - Fix new response_format variable in openai api by @plaguss in #1053
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #1043
- Update
LLM.generate
output to includestatistics
by @plaguss in #1034 - Add example of structured output. by @plaguss in #1061
- feat: implenent basic SFT pipeline based on synthetic data generator by @burtenshaw in #1059
- fix: broken import in instruction by @burtenshaw in #1063
- Fix StepOutput type by @plaguss in #1072
- docs: update issue templates by @sdiazlor in #1074
- Update
unload
method fromvLLM
to properly free resources by @gabrielmbmb in #1077 - Add tasks to replicate Math-shepherd by @plaguss in #1052
- Add
load_groups
argument torun
by @gabrielmbmb in #1075 - Add
TextGenerationWithImage
task by @plaguss in #1066 - Create columns with
LLM
returned extra keys by @gabrielmbmb in #1078 - Fix
vLLM
unload logic when model isNone
by @gabrielmbmb in #1080 - Fix
merge_distilabel_metadata
function when handling outputs fromTask
withgroup_generations==True
by @gabrielmbmb in #1082 - chore: update base.py by @eltociear in #1085
- Add magpie support llama cpp ollama by @davidberenstein1957 in #1086
- Feat/954 llama cpp by @bikash119 in #1000
- fix import by replacing GeneratorOutput with GeneratorStepOutput by @davidberenstein1957 in #1093
- add mlx support by @davidberenstein1957 in #1089
- Support custom default headers in
OpenAILLM
class. by @khulaifi95 in #1088 - fix/pip install messages by @davidberenstein1957 in #1095
- Fix handling empty list statistics by @gabrielmbmb in #1094
- update to outlines010 by @davidberenstein1957 in #1092
- update: search by match by @sdiazlor in #1096
- Add Legend to Component Gallery Icons by @ParagEkbote in #1090
- Image Language Models and
ImageGeneration
task by @plaguss in #1060 - Update
LLM
s to support prompt logprobs use-case by @gabrielmbmb in #1099 1.5.0
by @gabrielmbmb in #1100
New Contributors
- @sadra-barikbin made their first contribution in #1051
- @anakin87 made their first contribution in #1050
- @pre-commit-ci made their first contribution in #1043
- @eltociear made their first contribution in #1085
- @bikash119 made their first contribution in #1000
- @khulaifi95 made their first contribution in #1088
- @ParagEkbote made their first contribution in #1090
Full Changelog: 1.4.2...1.5.0
1.4.2
What's Changed
- Fix chat template not applied in
TransformersLLM
by @gabrielmbmb in #1083
Full Changelog: 1.4.1...1.4.2
1.4.1
What's Changed
- Fix not handling list of all primitive types in
SignatureMixin
by @gabrielmbmb in #1037
Full Changelog: 1.4.0...1.4.1
1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM
interface so now LLM
s using an external platform that offers a batch service can be integrated in distilabel
. In addition, OpenAILLM
has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
distilabel-offline-batch-generation.mp4
Improved cache for maximum outputs reusability
We all know that running LLM
is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel
cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset
generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Step
s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache
attribute in the Step
s that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step
produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact
that can be called within the step to store artifacts generated by it. The artifacts generated by the Step
will also get uploaded to the Hugging Face Hub.
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
New Tasks
: CLAIR
, APIGEN
and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferred
A’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator
,APIGenSemanticChecker
,APIGenExecutionChecker
. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_name
acached_property
by @gabrielmbmb in #862 - Pass dataset to dry_run method by @plaguss in #863
- Add default structured output for
GenerateSentencePair
task by @plaguss in #868 - Complexity scorer default structured output by @plaguss in #870
- Quality scorer default structured output by @plaguss in #873
- Ultrafeedback default structured output by @plaguss in #876
- Remove use of
default_chat_template
by @gabrielmbmb in #888 - Temporary fix for installing
llama-cpp-python
by @gabrielmbmb in #886 - Fix unit tests after release of
transformers==4.44.0
by @gabrielmbmb in #891 - Fix default structured output by @plaguss in #892
- Send as many batches as possible to input queues by @gabrielmbmb in #895
- Exclude
repo_id
fromLoadDataFromFileSystem
by @plaguss in #898 - Fix loader to read from a glob pattern by @plaguss in #877
- Add
save_artifact
method to_Step
by @gabrielmbmb in #871 - Add new
add_raw_input
argument to_Task
so we can automatically include the formatted input by @plaguss in #903 - New
TruncateTextColumn
to truncate the length of texts using the number of tokens or characters by @plaguss in #902 - Update
inputs
andoutputs
interface to allow returning dict indicating optionality by @gabrielmbmb in #883 - Update mistrallm by @plaguss in #904
- Deepseek prover by @plaguss in #907
- Update
RewardModelScore.inputs
property by @gabrielmbmb in #908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
- Fix load data from disk by @plaguss in #910
- docs: minor fixes by @davidberenstein1957 in #913
- Add
URIAL
task by @gabrielmbmb in #921 - Add
vLLMEmbeddings
by @plaguss in #920 - docs: add tutorials preference and clean by @sdiazlor in #917
- Fix
StructuredGeneration
examples and internal check by @plaguss in #912 - Generate deterministic pipeline name when it's not given by @plaguss in #878
- Add custom errors by @plaguss in #911
- Docs/tutorials fix by @sdiazlor in #922
- Add
revision
runtime parameter toLoadDataFromHub
by @gabrielmbmb in #928 - Add plausible as replacement for GA by @davidberenstein1957 in #929
- Add minhash related steps to deduplicate texts by @plaguss in #931
- docs: API reference review by @sdiazlor in #932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
- Update
make_generator_step
to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://g...
1.3.2
What's Changed
- Deepseek prover task by @plaguss in #733
- Do not cancel in progress docs workflows by @gabrielmbmb in #919
- Fix creating Ray placement groups for vLLM by @gabrielmbmb in #918
- Fix passing
base_url
inmodel_id
inInferenceEndpointsLLM
by @gabrielmbmb in #924
Full Changelog: 1.3.1...1.3.2
1.3.1
What's Changed
- Create new
distilabel.constants
module to store constants and avoid circular imports by @plaguss in #861 - Add OpenAI request timeout by @ashim-mahara in #858
New Contributors
- @ashim-mahara made their first contribution in #858
Full Changelog: 1.3.0...1.3.1
1.3.0
What's Changed
- Add new step
CombineKeys
by @plaguss in #747 - Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in #758
- Drop remove deprecated
LoadHubDataset
by @davidberenstein1957 in #759 - Add
requirements
list forPipeline
by @plaguss in #720 - Add
StepResources
and step replicas inPipeline
by @gabrielmbmb in #750 - Add load stages by @gabrielmbmb in #760
- Update min required version to
python==3.9
by @gabrielmbmb in #770 - Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in #762
- Add
docs-pr.yml
anddocs-pr-close.yml
workflows by @gabrielmbmb in #774 - Add
RayPipeline
class by @gabrielmbmb in #769 - Fixed closed PR workflow by @gabrielmbmb in #776
- Add
Magpie
andMagpieGenerator
tasks by @gabrielmbmb in #778 - Fix some issues related to
Magpie
task by @gabrielmbmb in #783 - Add
end_with_user
andinclude_system_prompt
flags toMagpie
tasks and handleNone
s. by @gabrielmbmb in #784 - Add workflow concurrency group for publishing docs by @gabrielmbmb in #796
- Add
_desired_num_gpus
attribute toCudaDevicePlacementMixin
by @gabrielmbmb in #795 - Compatibility with
vLLM
withtensor_parallel_size
argument by @gabrielmbmb in #805 - Update default names in
GroupColumns
by @plaguss in #808 - Request batches to
GeneratorStep
if only step in pipeline by @gabrielmbmb in #828 - Add default name for a pipeline by @plaguss in #809
- Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in #821
- Some more
Magpie
improvements by @gabrielmbmb in #833 - Add
Embeddings
base class,SentenceTransformerEmbeddings
class,EmbeddingGeneration
andFaissNearestNeighbour
steps by @gabrielmbmb in #830 - Create file per hostname in
CudaDevicePlacementMixin
by @gabrielmbmb in #814 - Create a
GeneratorStep
from a dataset using a helper function by @plaguss in #812 - Do not take into account
disable_cuda_device_placement
for pipeline signature by @gabrielmbmb in #838 - Add
RewardModelScore
step by @gabrielmbmb in #840 - Fix
LoadDataFromHub
attribute_dataset
hadellipsis
by default instead ofNone
by @gabrielmbmb in #841 - Create
PlacementGroup
for steps usingvLLM
by @gabrielmbmb in #842 - Update
argilla
integration to useargilla_sdk
v2 by @alvarobartt in #705 - Make
overall-rating
the default aspect forUltraFeedback
task by @gabrielmbmb in #843 - fix typo index.md by @franperic in #844
- Use
CudaDevicePlacementMixin
inRewardModelScore
step by @gabrielmbmb in #845 - Gather GPUs per Ray node to create placement groups by @gabrielmbmb in #848
- Fix typo in docs by @plaguss in #850
- Add
xfail
routing batch function tests by @gabrielmbmb in #852 - Fix creating placement group when
pipeline_parallel_size>1
by @gabrielmbmb in #851 - docs: 846 docs include google analytics by @davidberenstein1957 in #847
- Add
ClientvLLM
class by @gabrielmbmb in #854 - Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in #856
- Add bibtex references in the docstrings to be shown in the README by @plaguss in #855
- distilabel
1.3.0
by @gabrielmbmb in #857
New Contributors
- @franperic made their first contribution in #844
Full Changelog: 1.2.4...1.3.0
1.2.4
What's Changed
- Update
InferenceEndpointsLLM
to usechat_completion
method by @gabrielmbmb in #815
Full Changelog: 1.2.3...1.2.4