Skip to content

Commit

Permalink
Integration instructor (#654)
Browse files Browse the repository at this point in the history
* New module for the integration with instructor

* Mode common functions related to structured outputs to it's own module

* Draft instructor integration with openai

* Add tests for openai integration

* Add unit tests for the instructor integrations

* Add tests for anthropic integration

* Fix including anthropic wrapper

* Update llms to deal with instructor

* Update dependencies with instructor

* Run tests with instructor only on python>=3.9

* Fix circular import with create_distiset

* Define _prepare_structured_output as staticmethod

* Remove rewritten variable

* Remove dead code

* Check on Enum.value instead of Enum class as it isn't pickleable

* Add tests for utilities related to generation of BaseModel objects from json schema dicts

* Add fix to deal with nested BaseModel objects

* Fix call from instructor, this should be done on instructor end, but works for the moment

* Add docstirngs and typing info

* Add script to generate a sample dataset and visualize the result

* Update the docstring of the structured output expected format

* Add reference in the docs to structured outputs with instructor

* Add reference to the dependency installation

* Update typing info

* Fix test with new mocked client for mistral

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Add changes from code review

* Fix type hint per code review

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

* Remove repeated line

---------

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
  • Loading branch information
3 people authored May 29, 2024
1 parent 7e9230b commit 01b4292
Show file tree
Hide file tree
Showing 28 changed files with 1,472 additions and 171 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
if [ "${python_version}" != "(3, 8)" ]; then
pip install -e .[mistralai]
pip install -e .[mistralai,instructor]
fi;
pip install git+https://github.com/argilla-io/LLM-Blender.git
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 75 additions & 1 deletion docs/sections/learn/advanced/structured_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@

The [`LLM`][distilabel.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.llms.LlamaCppLLM].

!!! Note

For `outlines` integration to work you may need to install the corresponding dependencies:

```bash
pip install distilabel[outlines]
```

### JSON

We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
Expand Down Expand Up @@ -101,7 +109,7 @@ if match:

These were some simple examples, but one can see the options this opens.

!!! NOTE
!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llama-cpp-with-outlines)

Expand All @@ -119,6 +127,72 @@ These were some simple examples, but one can see the options this opens.
curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
```

## Instructor

When working with model providers behind an API, there's no direct way of accesing the internal logit processor as `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.llms.AsyncLLM], so you can work with the following LLMs: [`OpenAILLM`][distilabel.llms.OpenAILLM], [`AzureOpenAILLM`][distilabel.llms.AzureOpenAILLM], [`CohereLLM`][distilabel.llms.CohereLLM], [`GroqLLM`][distilabel.llms.GroqLLM], [`LiteLLM`][distilabel.llms.LiteLLM] and [`MistralLLM`][distilabel.llms.MistralLLM].

`instructor` works with `pydantic.BaseModel` objects internally but in `distilabel` the examples generated would result in the string representation of them, from which the `BaseModel` object can be regenerated.

!!! Note
For `instructor` integration to work you may need to install the corresponding dependencies:

```bash
pip install distilabel[instructor]
```

!!! Note
Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.structured_outputs.instructor.InstructorStructuredOutputType] to see the expected format
of the `structured_output` dict variable.

The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.

```python
from pydantic import BaseModel

class User(BaseModel):
name: str
last_name: str
id: int
```

And then we provide that schema to the `structured_output` argument of the LLM:

!!! Note
In this example we are using *open-mixtral-8x22b*, keep in mind not all the models work with the function calling functionality required for this example to work.

```python
from distilabel.llms import MistralLLM

llm = MistralLLM(
model="open-mixtral-8x22b",
structured_output={"schema": User}
)
llm.load()
```

And we are ready to pass our instruction as usual:

```python
import json

result = llm.generate(
[[{"role": "user", "content": "Create a user profile for the following marathon"}]],
max_new_tokens=256
)

data = json.loads(result[0][0])
data
# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
User(**data)
# User(name='John', last_name='Doe', id=12345)
```

We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.

!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor)

## OpenAI JSON

OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.
Expand Down
41 changes: 40 additions & 1 deletion docs/sections/pipeline_samples/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them.

### [llama.cpp with outlines](#llama-cpp-with-outlines)
### [llama.cpp with `outlines`](#llama-cpp-with-outlines)

Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.

Expand All @@ -21,3 +21,42 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis
```python title="structured_generation_with_outlines.py"
--8<-- "examples/structured_generation_with_outlines.py"
```


### [MistralAI with `instructor`](#mistralai-with-instructor)

Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.

??? Example "See example"

This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.

This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.

??? Run

```python
python examples/structured_generation_with_instructor.py
```

```python title="structured_generation_with_instructor.py"
--8<-- "examples/structured_generation_with_instructor.py"
```

??? "Visualizing the graphs"

Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:

!!! NOTE

This example uses graphviz to render the graph, you can install with `pip` in the following way:

```console
pip install graphviz
```

```python
python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples.
```

![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
82 changes: 82 additions & 0 deletions examples/draw_kg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
from typing import Any, Dict, List, Union

from graphviz import Digraph
from pydantic import BaseModel, Field


class Node(BaseModel):
id: int
label: str
color: str


class Edge(BaseModel):
source: int
target: int
label: str
color: str = "black"


class KnowledgeGraph(BaseModel):
nodes: List[Node] = Field(..., default_factory=list)
edges: List[Edge] = Field(..., default_factory=list)


def visualize_knowledge_graph(kg: KnowledgeGraph):
dot = Digraph(comment="Knowledge Graph")

# Add nodes
for node in kg.nodes:
dot.node(str(node.id), node.label, color=node.color)

# Add edges
for edge in kg.edges:
dot.edge(
str(edge.source),
str(edge.target),
label=edge.label,
color=edge.color or "black",
)

# Render the graph
dot.render("knowledge_graph.gv", view=True)


def create_knowledge_graph(data: str) -> Union[KnowledgeGraph, None]:
data: Dict[str, Any] = json.loads(data)

nodes = [Node(**node) for node in data["nodes"]]
edges = []
for edge in data["edges"]:
if edge.get("color") is None:
edge["color"] = "black"
edges.append(Edge(**edge))

return KnowledgeGraph(nodes=nodes, edges=edges)


if __name__ == "__main__":
import sys

args = sys.argv[1:]

from datasets import load_dataset

ds = load_dataset("distilabel-internal-testing/knowledge_graphs", split="train")
graphs = [create_knowledge_graph(g) for g in ds["generation"]]
visualize_knowledge_graph(graphs[int(args[0])])
87 changes: 87 additions & 0 deletions examples/structured_generation_with_instructor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List

from distilabel.llms import MistralLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, Field


class Node(BaseModel):
id: int
label: str
color: str


class Edge(BaseModel):
source: int
target: int
label: str
color: str = "black"


class KnowledgeGraph(BaseModel):
nodes: List[Node] = Field(..., default_factory=list)
edges: List[Edge] = Field(..., default_factory=list)


with Pipeline(
name="Knowledge-Graphs",
description=(
"Generate knowledge graphs to answer questions, this type of dataset can be used to "
"steer a model to answer questions with a knowledge graph."
),
) as pipeline:
sample_questions = [
"Teach me about quantum mechanics",
"Who is who in The Simpsons family?",
"Tell me about the evolution of programming languages",
]

load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
"instruction": f"{question}",
}
for question in sample_questions
],
)

text_generation = TextGeneration(
name="knowledge_graph_generation",
llm=MistralLLM(
model="open-mixtral-8x22b", structured_output={"schema": KnowledgeGraph}
),
input_batch_size=8,
output_mappings={"model_name": "generation_model"},
)
load_dataset >> text_generation


if __name__ == "__main__":
distiset = pipeline.run(
parameters={
text_generation.name: {
"llm": {"generation_kwargs": {"max_new_tokens": 2048}}
}
},
use_cache=False,
)

distiset.push_to_hub("distilabel-internal-testing/knowledge_graphs")
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ cohere = ["cohere >= 5.2.0"]
groq = ["groq >= 0.4.1"]
hf-inference-endpoints = ["huggingface_hub >= 0.19.0"]
hf-transformers = ["transformers >= 4.34.1", "torch >= 2.0.0"]
instructor = ["instructor >= 1.2.3"]
litellm = ["litellm >= 1.30.0"]
llama-cpp = ["llama-cpp-python >= 0.2.0"]
mistralai = ["mistralai >= 0.1.0"]
Expand Down
Loading

0 comments on commit 01b4292

Please sign in to comment.