Integration instructor (#654)

* New module for the integration with instructor * Mode common functions related to structured outputs to it's own module * Draft instructor integration with openai * Add tests for openai integration * Add unit tests for the instructor integrations * Add tests for anthropic integration * Fix including anthropic wrapper * Update llms to deal with instructor * Update dependencies with instructor * Run tests with instructor only on python>=3.9 * Fix circular import with create_distiset * Define _prepare_structured_output as staticmethod * Remove rewritten variable * Remove dead code * Check on Enum.value instead of Enum class as it isn't pickleable * Add tests for utilities related to generation of BaseModel objects from json schema dicts * Add fix to deal with nested BaseModel objects * Fix call from instructor, this should be done on instructor end, but works for the moment * Add docstirngs and typing info * Add script to generate a sample dataset and visualize the result * Update the docstring of the structured output expected format * Add reference in the docs to structured outputs with instructor * Add reference to the dependency installation * Update typing info * Fix test with new mocked client for mistral * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> * Add changes from code review * Fix type hint per code review * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Alvaro Bartolome <alvaro@argilla.io> * Remove repeated line --------- Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com> Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
argilla-io · May 29, 2024 · 01b4292 · 01b4292
1 parent 7e9230b
commit 01b4292
Show file tree

Hide file tree

Showing 28 changed files with 1,472 additions and 171 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -47,7 +47,7 @@ jobs:
 
           pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
           if [ "${python_version}" != "(3, 8)" ]; then
-            pip install -e .[mistralai]
+            pip install -e .[mistralai,instructor]
           fi;
           pip install git+https://github.com/argilla-io/LLM-Blender.git
 

diff --git a/docs/assets/images/sections/examples/knowledge-graph-example.png b/docs/assets/images/sections/examples/knowledge-graph-example.png
diff --git a/docs/sections/learn/advanced/structured_generation.md b/docs/sections/learn/advanced/structured_generation.md
@@ -8,6 +8,14 @@
 
 The [`LLM`][distilabel.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.llms.LlamaCppLLM].
 
+!!! Note
+
+    For `outlines` integration to work you may need to install the corresponding dependencies:
+
+    ```bash
+    pip install distilabel[outlines]
+    ```
+
 ### JSON
 
 We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
@@ -101,7 +109,7 @@ if match:
 
 These were some simple examples, but one can see the options this opens.
 
-!!! NOTE
+!!! Tip
     A full pipeline example can be seen in the following script:
     [`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llama-cpp-with-outlines)
 
@@ -119,6 +127,72 @@ These were some simple examples, but one can see the options this opens.
     curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
     ```
 
+## Instructor
+
+When working with model providers behind an API, there's no direct way of accesing the internal logit processor as `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.llms.AsyncLLM], so you can work with the following LLMs: [`OpenAILLM`][distilabel.llms.OpenAILLM], [`AzureOpenAILLM`][distilabel.llms.AzureOpenAILLM], [`CohereLLM`][distilabel.llms.CohereLLM], [`GroqLLM`][distilabel.llms.GroqLLM], [`LiteLLM`][distilabel.llms.LiteLLM] and [`MistralLLM`][distilabel.llms.MistralLLM].
+
+`instructor` works with `pydantic.BaseModel` objects internally but in `distilabel` the examples generated would result in the string representation of them, from which the `BaseModel` object can be regenerated.
+
+!!! Note
+    For `instructor` integration to work you may need to install the corresponding dependencies:
+
+    ```bash
+    pip install distilabel[instructor]
+    ```
+
+!!! Note
+    Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.structured_outputs.instructor.InstructorStructuredOutputType] to see the expected format
+    of the `structured_output` dict variable.
+
+The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.
+
+```python
+from pydantic import BaseModel
+
+class User(BaseModel):
+    name: str
+    last_name: str
+    id: int
+```
+
+And then we provide that schema to the `structured_output` argument of the LLM:
+
+!!! Note
+    In this example we are using *open-mixtral-8x22b*, keep in mind not all the models work with the function calling functionality required for this example to work.
+
+```python
+from distilabel.llms import MistralLLM
+
+llm = MistralLLM(
+    model="open-mixtral-8x22b",
+    structured_output={"schema": User}
+)
+llm.load()
+```
+
+And we are ready to pass our instruction as usual:
+
+```python
+import json
+
+result = llm.generate(
+    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
+    max_new_tokens=256
+)
+
+data = json.loads(result[0][0])
+data
+# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
+User(**data)
+# User(name='John', last_name='Doe', id=12345)
+```
+
+We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.
+
+!!! Tip
+    A full pipeline example can be seen in the following script:
+    [`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor)
+
 ## OpenAI JSON
 
 OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.

diff --git a/docs/sections/pipeline_samples/examples/index.md b/docs/sections/pipeline_samples/examples/index.md
@@ -2,7 +2,7 @@
 
 This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them.
 
-### [llama.cpp with outlines](#llama-cpp-with-outlines)
+### [llama.cpp with `outlines`](#llama-cpp-with-outlines)
 
 Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
 
@@ -21,3 +21,42 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis
     ```python title="structured_generation_with_outlines.py"
     --8<-- "examples/structured_generation_with_outlines.py"
     ```
+
+
+### [MistralAI with `instructor`](#mistralai-with-instructor)
+
+Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
+
+??? Example "See example"
+
+    This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
+
+    This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.
+
+    ??? Run
+
+        ```python
+        python examples/structured_generation_with_instructor.py
+        ```
+
+    ```python title="structured_generation_with_instructor.py"
+    --8<-- "examples/structured_generation_with_instructor.py"
+    ```
+
+    ??? "Visualizing the graphs"
+
+        Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:
+
+        !!! NOTE
+
+            This example uses graphviz to render the graph, you can install with `pip` in the following way:
+
+            ```console
+            pip install graphviz
+            ```
+
+        ```python
+        python examples/draw_kg.py 2  # You can pass 0,1,2 to visualize each of the samples.
+        ```
+
+        ![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
diff --git a/examples/draw_kg.py b/examples/draw_kg.py
@@ -0,0 +1,82 @@
+# Copyright 2023-present, Argilla, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from typing import Any, Dict, List, Union
+
+from graphviz import Digraph
+from pydantic import BaseModel, Field
+
+
+class Node(BaseModel):
+    id: int
+    label: str
+    color: str
+
+
+class Edge(BaseModel):
+    source: int
+    target: int
+    label: str
+    color: str = "black"
+
+
+class KnowledgeGraph(BaseModel):
+    nodes: List[Node] = Field(..., default_factory=list)
+    edges: List[Edge] = Field(..., default_factory=list)
+
+
+def visualize_knowledge_graph(kg: KnowledgeGraph):
+    dot = Digraph(comment="Knowledge Graph")
+
+    # Add nodes
+    for node in kg.nodes:
+        dot.node(str(node.id), node.label, color=node.color)
+
+    # Add edges
+    for edge in kg.edges:
+        dot.edge(
+            str(edge.source),
+            str(edge.target),
+            label=edge.label,
+            color=edge.color or "black",
+        )
+
+    # Render the graph
+    dot.render("knowledge_graph.gv", view=True)
+
+
+def create_knowledge_graph(data: str) -> Union[KnowledgeGraph, None]:
+    data: Dict[str, Any] = json.loads(data)
+
+    nodes = [Node(**node) for node in data["nodes"]]
+    edges = []
+    for edge in data["edges"]:
+        if edge.get("color") is None:
+            edge["color"] = "black"
+        edges.append(Edge(**edge))
+
+    return KnowledgeGraph(nodes=nodes, edges=edges)
+
+
+if __name__ == "__main__":
+    import sys
+
+    args = sys.argv[1:]
+
+    from datasets import load_dataset
+
+    ds = load_dataset("distilabel-internal-testing/knowledge_graphs", split="train")
+    graphs = [create_knowledge_graph(g) for g in ds["generation"]]
+    visualize_knowledge_graph(graphs[int(args[0])])
diff --git a/examples/structured_generation_with_instructor.py b/examples/structured_generation_with_instructor.py
@@ -0,0 +1,87 @@
+# Copyright 2023-present, Argilla, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+from distilabel.llms import MistralLLM
+from distilabel.pipeline import Pipeline
+from distilabel.steps import LoadDataFromDicts
+from distilabel.steps.tasks import TextGeneration
+from pydantic import BaseModel, Field
+
+
+class Node(BaseModel):
+    id: int
+    label: str
+    color: str
+
+
+class Edge(BaseModel):
+    source: int
+    target: int
+    label: str
+    color: str = "black"
+
+
+class KnowledgeGraph(BaseModel):
+    nodes: List[Node] = Field(..., default_factory=list)
+    edges: List[Edge] = Field(..., default_factory=list)
+
+
+with Pipeline(
+    name="Knowledge-Graphs",
+    description=(
+        "Generate knowledge graphs to answer questions, this type of dataset can be used to "
+        "steer a model to answer questions with a knowledge graph."
+    ),
+) as pipeline:
+    sample_questions = [
+        "Teach me about quantum mechanics",
+        "Who is who in The Simpsons family?",
+        "Tell me about the evolution of programming languages",
+    ]
+
+    load_dataset = LoadDataFromDicts(
+        name="load_instructions",
+        data=[
+            {
+                "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
+                "instruction": f"{question}",
+            }
+            for question in sample_questions
+        ],
+    )
+
+    text_generation = TextGeneration(
+        name="knowledge_graph_generation",
+        llm=MistralLLM(
+            model="open-mixtral-8x22b", structured_output={"schema": KnowledgeGraph}
+        ),
+        input_batch_size=8,
+        output_mappings={"model_name": "generation_model"},
+    )
+    load_dataset >> text_generation
+
+
+if __name__ == "__main__":
+    distiset = pipeline.run(
+        parameters={
+            text_generation.name: {
+                "llm": {"generation_kwargs": {"max_new_tokens": 2048}}
+            }
+        },
+        use_cache=False,
+    )
+
+    distiset.push_to_hub("distilabel-internal-testing/knowledge_graphs")
diff --git a/pyproject.toml b/pyproject.toml
@@ -66,6 +66,7 @@ cohere = ["cohere >= 5.2.0"]
 groq = ["groq >= 0.4.1"]
 hf-inference-endpoints = ["huggingface_hub >= 0.19.0"]
 hf-transformers = ["transformers >= 4.34.1", "torch >= 2.0.0"]
+instructor = ["instructor >= 1.2.3"]
 litellm = ["litellm >= 1.30.0"]
 llama-cpp = ["llama-cpp-python >= 0.2.0"]
 mistralai = ["mistralai >= 0.1.0"]