sc-openai-c2-L6-vid7-take-2_1.srt

﻿1
00:00:05,766 --> 00:00:06,600
In this video

2
00:00:06,600 --> 00:00:09,900
will focus on checking outputs
generated by the system.

3
00:00:09,900 --> 00:00:13,700
Checking outputs before showing them to
users can be important for ensuring

4
00:00:13,700 --> 00:00:17,466
the quality, relevance and safety
of the responses provided to them.

5
00:00:17,633 --> 00:00:21,633
Or use an automation place or learn
how to use the Moderation API.

6
00:00:21,633 --> 00:00:25,166
But this time the outputs and how to use
additional prompts to the model

7
00:00:25,166 --> 00:00:27,600
to evaluate output quality
before displaying them.

8
00:00:28,566 --> 00:00:30,866
So let's dive into the examples.

9
00:00:30,866 --> 00:00:34,733
We've already discussed the Moderation
API in the context of evaluating inputs.

10
00:00:35,300 --> 00:00:39,300
Now let's revisit it
in the context of checking outputs.

11
00:00:39,300 --> 00:00:43,566
Moderation API can also be used to filter
and moderate outputs generated by

12
00:00:43,800 --> 00:00:45,266
the system itself.

13
00:00:45,266 --> 00:00:47,600
And so here's an example.

14
00:00:47,600 --> 00:00:49,266
So here's

15
00:00:49,600 --> 00:00:56,066
a generated response to the user,
and we're going to use the Moderation

16
00:00:56,066 --> 00:01:02,000
API in the same way
that we saw the earlier video.

17
00:01:02,000 --> 00:01:06,133
So let's see if this output is flagged.

18
00:01:06,133 --> 00:01:09,833
As you can see, this output is not flagged

19
00:01:10,366 --> 00:01:15,333
and has very low scores in all categories
which make sense given the response

20
00:01:16,266 --> 00:01:19,066
in general, it can also be important
to check the outputs.

21
00:01:19,666 --> 00:01:24,633
For example, if you were creating
a chat box for sensitive audiences,

22
00:01:24,633 --> 00:01:27,500
you could use lower thresholds
for flagging outputs.

23
00:01:28,000 --> 00:01:31,500
In general, If the moderation output
indicates that the content is flagged,

24
00:01:31,766 --> 00:01:34,800
you can take appropriate actions
such as responding with a fallback

25
00:01:34,800 --> 00:01:38,100
answer or generating a new response.

26
00:01:39,900 --> 00:01:40,733
Note that as we

27
00:01:40,733 --> 00:01:43,900
improve the models,
they also are becoming less

28
00:01:43,900 --> 00:01:47,300
and less likely
to retain some kind of harmful output.

29
00:01:48,033 --> 00:01:51,300
Another approach for checking outputs
is to ask the model itself,

30
00:01:51,300 --> 00:01:55,233
and the generated was satisfactory,
and if it follows a certain rubric

31
00:01:55,233 --> 00:01:56,566
that you define,

32
00:01:56,566 --> 00:01:59,800
this can be done by providing
the generated output as part of the input

33
00:01:59,800 --> 00:02:05,133
to the model and asking it
to rate the quality of the output.

34
00:02:05,133 --> 00:02:06,900
You can do this in various different ways.

35
00:02:06,900 --> 00:02:09,366
So let's see an example.

36
00:02:09,800 --> 00:02:12,866
So our system method is
you are an assistant that evaluates

37
00:02:12,866 --> 00:02:14,866
whether the customer service
agent responses

38
00:02:14,866 --> 00:02:18,266
sufficiently answer customer questions
and also validates

39
00:02:19,466 --> 00:02:21,900
that all of the assistant

40
00:02:21,900 --> 00:02:23,966
sites
from the product information are correct.

41
00:02:25,166 --> 00:02:26,866
The product information and user

42
00:02:27,900 --> 00:02:29,166
and customer service agent

43
00:02:29,166 --> 00:02:31,366
messages will be limited by three
objectives.

44
00:02:33,000 --> 00:02:34,666
Respond with a y

45
00:02:34,666 --> 00:02:37,733
or end character with no punctuation y.

46
00:02:37,733 --> 00:02:39,933
If the output sufficiently answers
the question

47
00:02:40,666 --> 00:02:44,000
and the response
correctly uses product information and no.

48
00:02:44,000 --> 00:02:46,300
Otherwise I'll put a single letter only.

49
00:02:46,766 --> 00:02:51,600
And you could also use a chain of thought
reasoning prompt for this.

50
00:02:52,000 --> 00:02:55,100
This might be a little bit difficult for
the model to validate both in one stop.

51
00:02:55,100 --> 00:02:56,700
So you could play around with this.

52
00:02:56,700 --> 00:02:59,133
You could also add
some other kind of guidelines.

53
00:02:59,133 --> 00:03:04,500
You could ask give a rubric like a rubric
for an exam or grading an essay.

54
00:03:05,333 --> 00:03:09,500
You could use that kind of format and say,
Does this use a friendly tone in line

55
00:03:09,500 --> 00:03:11,400
with our brand guidelines
and maybe outline

56
00:03:11,400 --> 00:03:14,200
some of your brand guidelines if that's
something that's very important to you.

57
00:03:15,366 --> 00:03:17,333
So let's add our customer message.

58
00:03:17,333 --> 00:03:21,166
So this is the initial message
used to generate this response.

59
00:03:21,566 --> 00:03:24,200
And then also paste
in our product information.

60
00:03:24,533 --> 00:03:27,666
And so this is the product information
we fetched in the previous setup

61
00:03:28,100 --> 00:03:30,166
for all of the products
mentioned in this message

62
00:03:33,766 --> 00:03:35,866
and now will

63
00:03:36,300 --> 00:03:37,666
define a comparison.

64
00:03:37,666 --> 00:03:40,666
So the customer message is

65
00:03:41,533 --> 00:03:45,066
the customer message, product information,
and then the agent response,

66
00:03:45,533 --> 00:03:50,400
which is the response to the customer
that we have from this previous.

67
00:03:50,400 --> 00:03:54,000
So so that's format this
into a messages list

68
00:03:54,566 --> 00:03:59,133
and get the response from the model.

69
00:03:59,133 --> 00:04:03,000
So the model says, yes,
the product information is

70
00:04:03,533 --> 00:04:06,533
correct and the question is answered
sufficiently well.

71
00:04:06,533 --> 00:04:08,600
In general for this kind of evaluation.

72
00:04:09,000 --> 00:04:10,633
I also think it is

73
00:04:10,633 --> 00:04:14,533
better to use a more advanced model
because that is better at reasoning.

74
00:04:14,900 --> 00:04:16,566
So something like JPT for

75
00:04:20,000 --> 00:04:23,466
let's try another example.

76
00:04:23,466 --> 00:04:26,700
So this responses,
life is like a box of chocolates.

77
00:04:27,800 --> 00:04:29,833
So let's add our message to do the output.

78
00:04:29,833 --> 00:04:36,033
Checking.

79
00:04:36,033 --> 00:04:40,700
And the model has determined that this
does not sufficiently answer the question.

80
00:04:40,700 --> 00:04:43,800
We use the Retrieved information.

81
00:04:43,800 --> 00:04:46,566
This question does it use
the Retrieved information correctly?

82
00:04:46,833 --> 00:04:49,433
This is a good

83
00:04:49,500 --> 00:04:54,100
prompt to use if you want to make sure
that the model isn't hallucinating,

84
00:04:54,433 --> 00:04:59,866
which is making up things
that aren't true.

85
00:04:59,866 --> 00:05:04,500
And feel free to pause the video now
and try some of your own customer

86
00:05:04,500 --> 00:05:07,433
messages, responses
and adding product information

87
00:05:07,733 --> 00:05:11,766
to test how this works.

88
00:05:11,766 --> 00:05:13,633
So as you can see,
the model can provide feedback

89
00:05:13,633 --> 00:05:17,366
on the quality of a generated output,
and you can use this feedback to decide

90
00:05:17,366 --> 00:05:21,100
whether to present the output to the user
or to generate any response.

91
00:05:21,100 --> 00:05:24,833
You could even experiment with generating
multiple model responses per user query

92
00:05:25,066 --> 00:05:27,600
and then having the model choose
the best wanted to show the user.

93
00:05:27,900 --> 00:05:29,900
So there's lots of different things
you could try.

94
00:05:29,900 --> 00:05:33,533
In general, checking outputs
using the moderation API is good practice,

95
00:05:33,833 --> 00:05:37,166
but while asking the model
to evaluate its own output might be useful

96
00:05:37,166 --> 00:05:40,500
for immediate feedback
to ensure the quality of responses

97
00:05:40,500 --> 00:05:46,166
in a very small number of cases,
I think it's probably unnecessary

98
00:05:46,200 --> 00:05:49,866
most of the time, especially if you're
using a more advanced model like GBG for

99
00:05:50,366 --> 00:05:53,333
I haven't actually seen many people
do something like this in production.

100
00:05:53,633 --> 00:05:56,400
It would also increase the latency
and cost of your system

101
00:05:56,666 --> 00:05:58,900
because you'd have to wait
for an additional call for the model,

102
00:05:59,400 --> 00:06:01,133
and that's also additional tokens.

103
00:06:01,133 --> 00:06:03,766
If it's really important
for your Apple product that

104
00:06:04,600 --> 00:06:09,866
your error rate is 0.000001%,
then maybe you should try this approach.

105
00:06:10,100 --> 00:06:12,933
But overall, I wouldn't really recommend
that you do this in practice.

106
00:06:13,600 --> 00:06:15,300
In the next video,
we're going to put together

107
00:06:15,300 --> 00:06:18,366
everything we've learned in the evaluate
input section, process

108
00:06:18,366 --> 00:06:21,933
section and checking output section
to build an end to end system.