-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathp_ai_LLM_functions.qmd
272 lines (186 loc) · 9.09 KB
/
p_ai_LLM_functions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
title: "Using LLMs in Python for Text Generation"
---
## Intro
In this tutorial, we'll explore how to leverage Large Language Models (LLMs) to generate text using OpenAI's API. We'll use the `gpt-4o-mini` model to generate responses to fixed and variable prompts, optimize our code with helper functions and vectorization, and handle data using pandas DataFrames.
## Learning Objectives
- Set up the OpenAI client
- Define and use simple functions to generate text
- Use vectorization to apply functions to DataFrames
## Setting Up the OpenAI Client
First, we need to set up the OpenAI client using your API key. Here, we store the key in a file called `local_settings.py`, then import it into our script.
```{python}
from openai import OpenAI
import pandas as pd
import numpy as np
from local_settings import OPENAI_KEY
# Set up the OpenAI API key
# Initialize the OpenAI client with your API key
client = OpenAI(api_key=OPENAI_KEY)
```
Alternatively, you can pass your API key directly when setting the `api_key`, but be cautious not to expose it in your code, especially if you plan to share or publish it.
## Making an API Call
Let's make an API call to the `gpt-4o-mini` model to generate a response to a prompt.
```{python}
response = client.chat.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": "What is the most tourist-friendly city in France?"}]
)
print(response.choices[0].message.content)
```
## Defining Helper Functions
To simplify our code and avoid repetition, we'll define a helper function for making API calls. API calls contain a lot of boilerplate code, so encapsulating this logic in a function makes our code cleaner and more maintainable.
If you ever forget how to structure the API calls, refer to the [OpenAI API documentation](https://platform.openai.com/docs/api-reference/introduction) or search for "OpenAI Python API example" online.
Here's how we can define the `llm_chat` function:
```{python}
def llm_chat(message):
response = client.chat.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": message}]
)
return response.choices[0].message.content
```
This function takes a `message` as input, sends it to the LLM, and returns the generated response. The `model` parameter specifies which model to use—in this case, `gpt-4o-mini`. We use this model for its balance of quality, speed, and cost. If you want a more performant model, you can use `gpt-4o` but be careful not to exceed your API quota.
## Fixed Questions
Let's start by sending a fixed question to the `gpt-4o-mini` model and retrieving a response.
```{python}
# Example usage
response = llm_chat("What is the most tourist-friendly city in France?")
print(response)
```
::: {.callout-tip title="Practice"}
## Practice Q: Get tourist-friendly city in Brazil
Use the `llm_chat` function to ask the model for the most tourist-friendly city in Brazil. Store the response in a variable called `rec_brazil`. Print the response.
```{python}
# Your code here
```
```{python}
#| include: false
response = llm_chat("What is the most tourist-friendly city in Brazil?")
print(response)
```
:::
## Variables as Prompt Inputs
Often, you'll want to generate responses based on varying inputs. Let's create a function that takes a country as input and asks the model for the most tourist-friendly city in that country.
```{python}
def city_rec(country):
prompt = f"What is the most tourist-friendly city in {country}?"
return llm_chat(prompt)
```
Now, you can get recommendations for different countries by calling `city_rec("Country Name")`:
```{python}
city_rec("Nigeria")
```
However, if we try to use this function on a list of countries or a DataFrame column directly, it won't process each country individually. Instead, it will attempt to concatenate the list into a single string, which isn't the desired behavior.
```{python}
# Incorrect usage
country_df = pd.DataFrame({"country": ["Nigeria", "Chile", "France", "Canada"]})
response = city_rec(country_df["country"])
print(response)
```
To process each country individually, we can use NumPy's `vectorize` function. This function transforms `city_rec` so that it can accept arrays (like lists or NumPy arrays) and apply the function element-wise.
```{python}
# Vectorize the function
city_rec_vec = np.vectorize(city_rec)
# Apply the function to each country
country_df["city_rec"] = city_rec_vec(country_df["country"])
country_df
```
This code will output a DataFrame with a new column `city_rec` containing city recommendations corresponding to each country.
::: {.callout-tip title="Practice"}
## Practice Q: Get local dishes
Create a function called `get_local_dishes` that takes a country name as input and returns some of the most famous local dishes from that country. Then, vectorize this function and apply it to the `country_df` DataFrame to add a column with local dish recommendations for each country.
```{python}
# Your code here
```
```{python}
#| include: false
def get_local_dishes(country):
prompt = f"What are some of the most famous local dishes from {country}?"
return llm_chat(prompt)
# Vectorize the function
get_local_dishes_vec = np.vectorize(get_local_dishes)
# Apply to the DataFrame
country_df['local_dishes'] = get_local_dishes_vec(country_df['country'])
country_df
```
:::
## Automated Summary: Movies Dataset
In this example, we'll use the movies dataset from `vega_datasets` to generate automated summaries for each movie. We'll convert each movie's data into a dictionary and use it as input for the LLM to generate a one-paragraph performance summary.
First, let's load the movies dataset and preview the first few rows:
```{python}
import pandas as pd
import vega_datasets as vd
# Load the movies dataset
movies = vd.data.movies().head() # Using only the first 5 rows to conserve API credits
movies
```
Next, we'll convert each row of the DataFrame into a dictionary. This will be useful for passing the data to the LLM.
```{python}
# Convert each movie's data into a dictionary
movies.to_dict(orient="records")
```
Let's store this new column in the DataFrame:
```{python}
movies["full_dict"] = movies.to_dict(orient="records")
movies
```
Now, let's define a function `movie_performance` that takes a movie's data dictionary, constructs a prompt, and calls the `llm_chat` function to get a summary:
```{python}
def movie_performance(movie_data):
prompt = f"Considering the following data on this movie {movie_data}, provide a one-paragraph summary of its performance for my report."
return llm_chat(prompt)
```
We'll vectorize this function so we can apply it to the entire `full_dict` column:
```{python}
import numpy as np
# Vectorize the function to apply it to the DataFrame
movie_performance_vec = np.vectorize(movie_performance)
```
Let's test our function with an example:
```{python}
# Example usage
movie_performance("Name: Kene's Movie, Sales: 100,000 USD")
```
Finally, we'll apply the vectorized function to generate summaries for each movie:
```{python}
# Generate summaries for each movie
movies["llm_summary"] = movie_performance_vec(movies["full_dict"])
```
You can now save the DataFrame with the generated summaries to a CSV file:
```{python}
# Save the results to a CSV file
movies.to_csv("movies_output.csv", index=False)
```
This approach allows you to generate detailed summaries for each movie based on its full set of data, which can be incredibly useful for automated reporting and data analysis.
::: {.callout-tip title="Practice"}
## Practice Q: Weather Summary
Using the first 5 rows of the `seattle_weather` dataset from `vega_datasets`, create a function that takes all weather columns for a particular day and generates a summary of the weather conditions for that day. The function should use the LLM to generate a one-paragraph summary for a report, considering the data provided. Store the function in a column called `weather_summary`.
```{python}
weather = vd.data.seattle_weather().head()
weather
```
```{python}
# Your code here
```
```{python}
#| include: false
# Step 1: Load the dataset
weather = vd.data.seattle_weather().head()
# Step 2: Convert each row into a dictionary and add it to the DataFrame
weather_dicts = weather.to_dict(orient="records")
weather["full_dict"] = weather_dicts
# Step 3: Define the function to generate summaries
def weather_summary(weather_data):
prompt = (
f"Considering the following weather data: {weather_data}, "
"provide a one-paragraph summary of the weather conditions for my report."
)
return llm_chat(prompt)
# Step 4: Vectorize the function
weather_summary_vec = np.vectorize(weather_summary)
# Step 5: Apply the function to generate summaries
weather["summary"] = weather_summary_vec(weather["full_dict"])
```
:::
## Wrap-up
In this tutorial, we learned the basics of using OpenAI's LLMs in Python for text generation, created helper functions, and applied these functions to datasets using vectorization.
In the next lesson, we'll look at structured outputs that allow us to specify the format of the response we want from the LLM. We'll use this to extract structured data from unstructured text, a common task in data analysis.