Skip to content

Latest commit

 

History

History
327 lines (261 loc) · 12.9 KB

README.md

File metadata and controls

327 lines (261 loc) · 12.9 KB

Bespoke Labs Logo

Bespoke Curator

Data Curation for Post-Training & Structured Data Extraction


Static Badge Site PyPI - Version Follow on X Discord

[ English | 中文 ]

🎉 What's New

Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

  • Rich Python based library for generating and curating synthetic data.
  • Interactive viewer to monitor data while it is being generated.
  • First class support for structured outputs.
  • Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
  • Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

CLI in action

Check out our full documentation for getting started, tutorials, guides and detailed reference.

🛠️ Installation

pip install bespokelabs-curator

🚀 Quickstart

Using curator.LLM

from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.to_pandas())

Note

Retries and caching are enabled by default to help you rapidly iterate your data pipelines. So now if you run the same prompt again, you will get the same response, pretty much instantly. You can delete the cache at ~/.cache/curator or disable it with export CURATOR_DISABLE_CACHE=true.

Important

Make sure to set your API keys as environment variables for the model you are calling. For example running export OPENAI_API_KEY=sk-... and export ANTHROPIC_API_KEY=ant-... will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found in the litellm docs.

You can also send a list of prompts to the LLM, or a HuggingFace Dataset object (see below for more details).

Using structured outputs and custom prompting and parsing logic

Here's an example of using structured outputs and custom prompting and parsing logic.

from typing import Dict
from pydantic import BaseModel, Field
from bespokelabs import curator
from datasets import Dataset

class Poem(BaseModel):
    title: str = Field(description="The title of the poem.")
    poem: str = Field(description="The content of the poem.")

class Poet(curator.LLM):
    response_format = Poem

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poem) -> Dict:
        return [{"title": response.title, "poem": response.poem}]

poet = Poet(model_name="gpt-4o-mini")
topics = Dataset.from_dict({'topic': ['Dreams of a Robot']})

poems = poet(topics)
print(poems.to_pandas())

Output:

    title                           poem
0   Dreams of a Robot: Awakening    In circuits deep, where silence hums, \nA dre..
1   Life of an AI Agent - Poem 1    In circuits woven, thoughts ignite,\nI dwell i...

In the Poet class:

  • response_format is the structured output class we defined above.
  • prompt takes the input (input) and returns the prompt for the LLM.
  • parse takes the input (input) and the structured output (response) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Note that topics can be created with another LLM class as well, and we can scale this up to create tens of thousands of diverse poems.

class Topics(BaseModel):
    topics_list: List[str] = Field(description="A list of topics.")

class TopicGenerator(curator.LLM):
  response_format = Topics

  def prompt(self, subject):
    return f"Return 3 topics related to {subject}"

  def parse(self, input: str, response: Topics):
    return [{"topic": t} for t in response.topics_list]

topic_generator = TopicGenerator(model_name="gpt-4o-mini")
topics = topic_generator("Mathematics")
poems = poet(topics)

Output:

 	title                     poem
0	The Language of Algebra	  In symbols and signs, truths intertwine,..
1	The Geometry of Space	  In the world around us, shapes do collide,..
2	The Language of Logic	  In circuits and wires where silence speaks,..

You can see more examples in the examples directory.

See the docs for more details as well as for troubleshooting information.

Tip

If you are generating large datasets, you may want to use batch mode to save costs. Currently batch APIs from OpenAI and Anthropic are supported. With curator this is as simple as setting batch=True in the LLM class.

Anonymized Telemetry

We collect minimal, anonymized usage telemetry to help prioritize new features and improvements that benefit the Curator community. You can opt out by setting the TELEMETRY_ENABLED environment variable to False.

📖 Providers

Curator supports a wide range of providers, including OpenAI, Anthropic, and many more.

OpenAI backend

llm = curator.LLM(
    model_name="gpt-4o-mini",
)

For other models that support OpenAI-compatible APIs, you can use the openai backend:

llm = curator.LLM(
    model_name="gpt-4o-mini",
    backend="openai",
    backend_params={
        "base_url": "https://your-openai-compatible-api-url",
        "api_key": <YOUR_OPENAI_COMPATIBLE_SERVICE_API_KEY>,
    },
)

LiteLLM (Anthropic, Gemini, together.ai, etc.)

Here is an example of using Gemini with litellm backend:

llm = curator.LLM(
    model_name="gemini/gemini-1.5-flash",
    backend="litellm",
    backend_params={
        "max_requests_per_minute": 2_000,
        "max_tokens_per_minute": 4_000_000
    },
)

Documentation

Ollama

llm = curator.LLM(
    model_name="ollama/llama3.1:8b",  # Ollama model identifier
    backend_params={"base_url": "http://localhost:11434"},
)

Documentation

vLLM

llm = curator.LLM( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)

Documentation

DeepSeek

DeepSeek offers an OpenAI-compatible API that you can use with the openai backend.

Important

The DeepSeek API is experiencing intermittent issues and will return empty responses during times of high traffic. We recommend calling the DeepSeek API through the openai backend, with a high max retries so that we can retry failed requests upon empty response and a reasonable max requests and tokens per minute so we don't retry too aggressively and overwhelm the API.

llm = curator.LLM(
    model_name="deepseek-reasoner",
    generation_params={"temp": 0.0},
    backend_params={
        "max_requests_per_minute": 100,
        "max_tokens_per_minute": 10_000_000,
        "base_url": "https://api.deepseek.com/",
        "api_key": <YOUR_DEEPSEEK_API_KEY>,
        "max_retries": 50,
    },
    backend="openai",
)

kluster.ai

llm = curator.LLM(
    model_name="deepseek-ai/DeepSeek-R1", 
    backend="klusterai",
)

Documentation

📦 Batch Mode

Several providers offer about 50% discount on token usage when using batch mode. Curator makes it easy to use batch mode with a wide range of providers.

Example with OpenAI (docs reference):

llm = curator.LLM(model_name="gpt-4o-mini", batch=True)

See documentation:

Bespoke Curator Viewer

Viewer in action

To run the bespoke dataset viewer:

curator-viewer

This will pop up a browser window with the viewer running on 127.0.0.1:3000 by default if you haven't specified a different host and port.

The dataset viewer shows all the different runs you have made. Once a run is selected, you can see the dataset and the responses from the LLM.

Optional parameters to run the viewer on a different host and port:

>>> curator-viewer -h
usage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]

Curator Viewer

options:
  -h, --help     show this help message and exit
  --host HOST    Host to run the server on (default: localhost)
  --port PORT    Port to run the server on (default: 3000)
  --verbose, -v  Enables debug logging for more verbose output

The only requirement for running curator-viewer is to install node. You can install them by following the instructions here.

For example, to check if you have node installed, you can run:

node -v

If it's not installed, installing latest node on MacOS, you can run:

# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
# download and install Node.js (you may need to restart the terminal)
nvm install 22
# verifies the right Node.js version is in the environment
node -v # should print `v22.11.0`
# verifies the right npm version is in the environment
npm -v # should print `10.9.0`

Contributing

Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.

Citation

If you find Curator useful, please consider citing us!

@software{Curator: A Tool for Synthetic Data Creation,
  author = {Marten, Ryan* and Vu, Trung* and Cheng-Jie Ji, Charlie and Sharma, Kartik and Pimpalgaonkar, Shreyas and Dimakis, Alex and Sathiamoorthy, Maheswaran},
  month = jan,
  title = {{Curator}},
  year = {2025},
  howpublished = {\url{https://github.com/bespokelabsai/curator}}
}