Skip to content

Releases: argilla-io/distilabel

1.5.2

22 Jan 10:48
8717dd1
Compare
Choose a tag to compare

What's Changed

  • Fix structured output JSON to pydantic.BaseModel and LiteLLM async completion client by @rolshoven in #1105

New Contributors

Full Changelog: 1.5.1...1.5.2

1.5.1

17 Jan 14:39
69bbe3d
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.5.0...1.5.1

1.5.0

17 Jan 08:28
b261b23
Compare
Choose a tag to compare

✨ Release highlights

🖼️ Image Generation Support

We're excited to introduce ImageGenerationModel, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.

Available Services

  • 🤗 InferenceEndpointsImageGeneration: Integration with Hugging Face's Inference Endpoints
  • OpenAIImageGeneration: Integration with OpenAI's DALL-E

Architecture

Just as LLMs are used by a Task, we've introduced ImageTask as a high-level abstraction for image generation workflows. ImageTask defines how a step should use an ImageGenerationModel to accomplish specific image generation tasks.

Our first implementation, the ImageGeneration task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.

We've also added a small tutorial on how to generate images using distilabel: distilabel - Tutorials - Image generation with distilabel

Images as inputs for LLMs

We've added initial support for providing images as input to an LLM through the new TextGenerationWithImage task. We've updated and tested InferenceEndpointsLLM and OpenAILLM with this new task, but we'll image as input compatibility in the next releases for others such as vLLM.

Check the tutorial distilabel - Tutorials - Text generation with images in distilabel to get started!

💻 New MlxLLM integration

We've integrated mlx-lm package with the new MlxLLM class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.

New InstructionResponsePipeline template

We've started making changes so distilabel is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:

from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline()
distiset = pipeline.run()

Define load stages

We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.

We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.

What's Changed

New Contributors

Full Changelog: 1.4.2...1.5.0

1.4.2

18 Dec 16:42
bfc8445
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.4.1...1.4.2

1.4.1

16 Oct 07:30
844165f
Compare
Choose a tag to compare

What's Changed

  • Fix not handling list of all primitive types in SignatureMixin by @gabrielmbmb in #1037

Full Changelog: 1.4.0...1.4.1

1.4.0

08 Oct 14:53
c0d798a
Compare
Choose a tag to compare

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

distilabel-offline-batch-generation.mp4

Improved cache for maximum outputs reusability

We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

image

In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.

from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt

if TYPE_CHECKING:
    from distilabel.steps import StepOutput


class CountTextCharacters(GlobalStep):
    @property
    def inputs(self) -> List[str]:
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        return ["text_character_count"]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        character_counts = []

        for input in inputs:
            text_character_count = len(input["text"])
            input["text_character_count"] = text_character_count
            character_counts.append(text_character_count)

        # Generate plot with the distribution of text character counts
        plt.figure(figsize=(10, 6))
        plt.hist(character_counts, bins=30, edgecolor="black")
        plt.title("Distribution of Text Character Counts")
        plt.xlabel("Character Count")
        plt.ylabel("Frequency")

        # Save the plot as an artifact of the step
        self.save_artifact(
            name="text_character_count_distribution",
            write_function=lambda path: plt.savefig(path / "figure.png"),
            metadata={"type": "image", "library": "matplotlib"},
        )

        plt.close()

        yield inputs

New Tasks: CLAIR, APIGEN and many more!

  • New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A preferred A’ is much more contrastive and precise.
  • New tasks to replicate APIGen framework: APIGenGenerator, APIGenSemanticChecker, APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
  • New URIAL task that allows using non-instruct models to generate a response for an instruction.
  • New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
  • TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
  • Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

  • New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
  • New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
  • New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
  • New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
  • New CombineOutputs to combine the outputs of two or more steps into a single output.

Generate text embeddings using vLLM

Extra things

What's Changed

Read more

1.3.2

23 Aug 13:15
ed88585
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.3.1...1.3.2

1.3.1

07 Aug 09:09
268358b
Compare
Choose a tag to compare

What's Changed

  • Create new distilabel.constants module to store constants and avoid circular imports by @plaguss in #861
  • Add OpenAI request timeout by @ashim-mahara in #858

New Contributors

Full Changelog: 1.3.0...1.3.1

1.3.0

06 Aug 14:16
63f948b
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.2.4...1.3.0

1.2.4

23 Jul 16:03
add2b6e
Compare
Choose a tag to compare

What's Changed

  • Update InferenceEndpointsLLM to use chat_completion method by @gabrielmbmb in #815

Full Changelog: 1.2.3...1.2.4