Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit Tool Turns #626

Merged

Conversation

psionic-k
Copy link
Contributor

@psionic-k psionic-k commented Feb 9, 2025

Draft

And does not pass tests yet. Should be close to passing. Just sleepy and wanted to give PST a chance to take a look at approaches before I finish up.

One of my commits almost vanished during merge and became whitespace only.

  • tool-call & tool-result correspondence enforcement & warning
  • make tests pass
  • tool call display function name & args via overlay, re-hydrate overlays when opening file

@karthink
Copy link
Owner

karthink commented Feb 9, 2025

The general approach looks good to me. I one major suggestion and one question. The suggestion first:

I don't think it's worth propertizing the prompt-prefix and response-prefix. The current approach is to strip them via text search (see gptel--trim-prefixes). Additionally, all leading and trailing whitespace is stripped from the user prompt as well, so the most common gptel-response-separators are taken care of automatically.

I understand this is less than ideal when you change the prefixes. For example, if you open an old chat file from before you changed the prefixes, the prefixes in the file will no longer be stripped.

However, I think the new approach is worse on balance because it doubles or triples the length of the serialized bounds list when you save the chat. We want to store as little metadata as we can get away with. Note also that the local variables block which is used for serialization in markdown and text chat files has a maximum allowed length of 2000 characters or something. (Enforced by Emacs.) Each extra character of metadata we store bumps us closer to this limit, and all hell breaks loose if we cross it.

Continuing to strip prefixes/suffixes using string matching will also simplify the code a fair bit -- no need for a gptel-response-separator function, for instance.

Finally, I haven't noticed any auto-mimicry problems with retaining the prefixes in the messages array, possibly because if the prefixes fail to be stripped, they always end up as part of the user prompts, not previous LLM responses. So failing to get it right is not a critical problem.

To address the case of rehydrating older chat files correctly, if we absolutely must strip the prefixes correctly we can store the prefix strings in the file as Org properties/local variables. I'm not sure this is required, and in any case it doesn't have to be part of this PR.

Will address the question in my next message.

@karthink
Copy link
Owner

karthink commented Feb 9, 2025

IIUC, you haven't yet updated gptel--get-buffer-bounds, is that correct? This is required for serializing the response (and now tool result) bounds to the buffer. When you implement this, keep in mind that we want to be as parsimonious as possible. So response text bounds (the most common case) should still be something like (1234 1500). Currently, you are assuming here that it will look like (1234 1500 response). The ignore case can also probably be made shorter/simpler than (1612 1700 ignore).

@karthink
Copy link
Owner

karthink commented Feb 9, 2025

The tool-use API looks like this right now: the gptel-request callback is called with two different types of lists for tool calls and tool results as the RESPONSE argument.

For tool calls, it is called with a list of the form

((tool args closure) ...)

where tool is a gptel-tool struct. (The closure here is the closure used to continue the gptel request with the tool call result.)

For tool results, it is called with a list of the form

((name args result) ...)

where name is a string, the name of the tool.

  • These have to be of different types because the callback needs to know if what it received was a tool call (waiting for confirmation/run) or a tool result (to insert into the buffer etc).

  • Further, these have to be different in a way that's easy for a package author to understand, since this is part of the gptel-request API.

  • I judged the more common case to be the one involving the tool result, which is why it's simpler -- an alist mapping a tool name (like "get_buffer") to the LLM-supplied args and result of calling it. In contrast, the ((tool args callback) ...) case sends the callback the tool object itself since it needs to run the tool function (gptel-tool-function tool).

  • Bonus: it would be good to hew close to the API of the llm package, for future cooperation/coordination/reuse. As I understand it, llm package does not currently do the first kind of call at all (no tool call confirmations), so it has an easier time of it. It sends ((name . result) ...) to the callback.

For your use case, this API needs to be redesigned because you need the tool call id along with the result. We should keep the above goals in mind when making the API change.


One option I considered before is an explicit mention of the purpose of the call, i.e. send to the gptel-request callback

(:tool-call ((tool args closure) ...))

for a tool-call confirmation/run, and

(:tool-result ((tool args result) ...))

for a tool result.

I judged this as an unnecessarily complex structure before (a plist of list of lists). Same for

(tool-call   . ((tool args closure)...)) ;and
(tool-result . ((tool args result) ...))

These are too verbose because most consumers of gptel-request + tools will only ever see one of these, if at all. I prefer the simplicity of ((name args result) ...) if we can get away with it.

See the documentation for other types of REPONSE gptel-request can send to the callback:

  • a string for response text,
  • nil if there's an error,
  • the symbol abort if the request was aborted.

We want something at this level of simplicity for tool calls and tool results.

@karthink
Copy link
Owner

karthink commented Feb 9, 2025

We also need to add new tests, but maybe that can wait until the design is final?

@psionic-k
Copy link
Contributor Author

psionic-k commented Feb 10, 2025

Note also that the local variables block which is used for serialization in markdown and text chat files has a maximum allowed length of 2000 characters or something.. ..all hell breaks loose if we cross it

Technically it's broken for big chats already. When the day comes that someone needs this, we can use file storage. Add an option for the property to be a string. The string names a file at (in the no-littering case) /user-emacs-dir/var/gptel/bounds/<UUID-or-hash>.el. No longer self-contained, but IMO unavoidable in the end ✂️.

I understand this is less than ideal when you change the prefixes.

It's not something I would stress over. I agree stripping can be very valuable in a few cases. An empty line can be ignored as a user message instead of treated blindly like a turn in the chat. In that case, I no longer need to mark some lines as ignore. The ignore property can be re-hydrated from #+tool_call and ```tool-call matches almost as easily as it can be stored.

The tool-use API

Aight, I think I will have to asynchronously answer this part.

@karthink
Copy link
Owner

karthink commented Feb 10, 2025

When the day comes that someone needs this, we can use file storage. Add an option for the property to be a string. The string names a file at (in the no-littering case) /user-emacs-dir/var/gptel/bounds/.el. No longer self-contained, but IMO unavoidable in the end.

Losing self-contained files is not acceptable. At that point we can switch to using a TOML frontmatter block or something in Markdown files instead, the equivalent of Org properties. But anyway, this is not currently a problem.

It's not something I would stress over. I agree stripping can be very valuable in a few cases. An empty line can be ignored as a use-message instead of treated blindly like a turn in the chat. In that case, I no longer need to mark some lines as ignore. The ignore property can be re-hydrated from #+tool_call and ```tool-call matches almost as easily as it can be stored.

Not sure what you mean by "stress over it". The current approach of stripping the prefixes via text search is simple and sometimes wrong, but that's me not stressing over it.

I'd like to avoid explicitly marking prefixes as ignored text for the mentioned reasons:

  1. it's okay to get this wrong sometimes,
  2. it avoids extra code in gptel,
  3. it doubles/triples the length of the serialized bounds, and
  4. it's easy to add in the future, but not as easy to remove.

Adding the ignore property to the tool call block decorations is fine 👍

@karthink
Copy link
Owner

An empty line can be ignored as a user-message instead of treated blindly like a turn in the chat.

Empty/whitespace lines should already be stripped, IIRC.

@psionic-k
Copy link
Contributor Author

Not sure what you mean by "stress over it". The current approach of stripping the prefixes via text search is simple and sometimes wrong, but that's me not stressing over it.

I meant to confirm that not stressing over it was already correct. My reasoning was that any use case for keeping prefixes in sync is extremely niche and can be done by simple text replace if the user really needs.

@karthink
Copy link
Owner

The ignore property can be re-hydrated from #+tool_call and ```tool-call matches almost as easily as it can be stored.

If I'm reading this right, you're debating if you should serialize the bounds of #+tool_call and ```tool-call decorators as (beg end ignore) when writing the buffer to disk, and instead use text search to mark these as ignored when restoring the chat state?

@psionic-k
Copy link
Contributor Author

The ignore property can be re-hydrated from #+tool_call and ```tool-call matches almost as easily as it can be stored.

If I'm reading this right, you're debating if you should serialize the bounds of #+tool_call and ```tool-call decorators as (beg end ignore) when writing the buffer to disk, and instead use text search to mark these as ignored when restoring the chat state?

Avoid serialization, yes. However, instead of marking them ignore, just stripping them out before calling gptel--parse-buffer.

I think I can filter them on the frontends easy enough. That seems more proper since they're an artifact of each frontend anyway. In both markdown and org, stripping these block header + footers is just a linear scan going forward if we escape the block contents. It's more robust in the case that someone munges a block. ignore is still really useful, but probably should be avoided where stripping is trivial.

@karthink
Copy link
Owner

Avoid serialization, yes.

Sure. As I see it the less we need to serialize the more robust the persistence feature remains.

However, instead of marking them ignore, just stripping them out before calling gptel--parse-buffer.

Hmm, this requires going all-in into using a buffer copy for parsing. We're currently doing this for Org because of gptel-org-branching-context and org-unescape-code-in-region, but still avoiding it in non-Org buffers. This is more than just Markdown -- it's every other major-mode in which gptel is used.

We've paid the price for Org already, but I'm hoping to avoid doing this everywhere else.

I think I can filter them on the frontends easy enough. That seems more proper since they're an artifact of each frontend anyway. In both markdown and org, stripping these block header + footers is just a linear scan going forward if we escape the block contents.

Well one of your original arguments was that using intervals (text-properties/overlays) is both more efficient (O(1) or O(log(point-max)) or something) and robust, and we don't pay a parsing penalty. I like the idea.

It's more robust in the case that someone munges a block.

If someone messes up a block, aren't text properties more robust?

If you're using #+begin_tool_call... and someone messes this up to read #+begin_tool_cal..., it will still work fine if you're using the ignore property. The same argument applies to messing up the contents of the block. The tool call args or results as recorded will be distorted, but it will still be recognized as a tool call.

On the other hand, if you're text-matching #+begin_tool_call and it's modified, it will get inserted as part of the response, leading eventually to auto-mimicry.

ignore is still really useful, but probably should be avoided where stripping is trivial.

I propose a hybrid solution:

  1. Don't serialize the tool call block decorators as (beg end ignore). This keeps the serialized list short and tidy.
  2. When rehydrating, use text matching and do apply the ignore property to the decorators. This requires text processing, which is slower, but it only runs once on rehydration instead of before every request.
  3. When creating the payload, respect the current ignore rules. This way you don't need to create a new buffer to filter out text except in Org mode (where it is required for other reasons.)

@psionic-k
Copy link
Contributor Author

We've paid the price for Org already, but I'm hoping to avoid doing this everywhere else.

If we're talking about 100MB contents, I could see this being heavy, but especially where we don't even activate the mode, buffer copying is idiomatic rather than anti-pattern. It's super cheap and in many cases much more straight-forward. We have like 5GB/s available on this kind of workload.

Try this function out on a typical buffer:

(defun clone-and-report ()
  (interactive)
  (let ((source (current-buffer))
        (beg (point-min))
        (end (point-min))        
        (conses cons-cells-consed)
        (start (float-time))
        (created))
    (dotimes (_ 100)
      (let ((buffer (generate-new-buffer (buffer-name) t)))
        (set-buffer buffer)
        (insert-buffer-substring source beg end)
        (push buffer created)))
    (let ((new-conses cons-cells-consed)
          (now (float-time)))
      (message "Old: %s New: %s Delta: %s"
               conses new-conses (- new-conses conses))
      (message "Start: %s End: %s Elapsed: %s"
               start now (- now start)))
    (mapcar #'kill-buffer created)
    (let ((new-conses cons-cells-consed)
          (now (float-time)))
      (message "After killing - Old: %s New: %s Delta: %s"
               conses new-conses (- new-conses conses))
      (message "After killing - Start: %s End: %s Elapsed: %s"
               start now (- now start)))))

I ran this on one of my org buffers and it was 4ms and 1858 conses, many of which are used to run the command and print the results. It's way, way faster than human speed and doesn't gum up the GC.

There's reasons to do other things like only copying a region from the buffer, but in general buffer-to-buffer copy is one of the fastest, cheapest, lowest reside things in Emacs.

@karthink
Copy link
Owner

karthink commented Feb 10, 2025

There's reasons to do other things like only copying a region from the buffer, but in general buffer-to-buffer copy is one of the fastest, cheapest, lowest reside things in Emacs.

I take your point. 👍

Independent of this, did you think about the other arguments (robustness to text munging, interval trees vs string/regexp search) for painting the tool call block decorators with the ignore property and avoiding text parsing?

@psionic-k
Copy link
Contributor Author

If someone messes up a block, aren't text properties more robust?

Unless we save them, until restoring. Hmm...

Btw, linear scan are O(n). Grows with buffer size. Parsing without the need to backtrack is O(n) in time. Realistically it's only heavy in org because of the overhead of push-down and deciding and scooping up every single kind of element into the result.

((name . result) ...)

Without the id of the tool call, the API is incomplete. Possibly the llm package is incomplete and needs to make a breaking change here. Without ids, I'd say it's broken to begin with.

I'll look around, but I definitely like the idea of any state machine callback receiving an explicit indication of state, either through a method name or a leading symbol.

(:tool-call . ((tool args))) and (:tool-result . ((tool-result args result))) are my instinct. I need to look at the other callbacks. I don't like implicit information when matching, such as "if it's a cons, do Y" because that kind of stuff is made to break.

Whatever the case, since the feature is young, I'd be in favor of an aggressive breaking change to get to a good place early in its life, one that is easy to understand. Let's not make it backwards compatible but convoluted.

@psionic-k psionic-k force-pushed the explicit-tool-turns branch 2 times, most recently from 2e9ab99 to ccee2e8 Compare February 10, 2025 12:36
@psionic-k
Copy link
Contributor Author

After looking through a bit more, I don't think any :tool-call or :tool-result is necessary here. The type check of the caar is fine for now. In the default callback, both results are lists and so are passed correctly to gptel--display-tool-calls.

It wouldn't be the worst idea to add a TYPE argument to the callback but I just don't think it solves enough here. The user's callback is about as simple as it can get unless a separate :tool-callback is made to separate the two entirely.

I need to throw some time at this tomorrow. Too sleepy to write anything smart.

@karthink
Copy link
Owner

karthink commented Feb 10, 2025

Unless we save them, until restoring. Hmm...

I don't follow. I meant that painting the decorators with ignore is more robust against fat-fingering by the user.

Btw, linear scan are O(n). Grows with buffer size. Parsing without the need to backtrack is O(n) in time. Realistically it's only heavy in org because of the overhead of push-down and deciding and scooping up every single kind of element into the result.

Interval tree lookups are O(log(m)), where m is the number of intervals (as opposed to buffer-size in chars), and the proportionality constant and memory requirements are much lower too. regex search is a heavy operation in comparison. That said, you're right that practically it won't matter until org-element enters the picture.

Without the id of the tool call, the API is incomplete. Possibly the llm package is incomplete and needs to make a breaking change here. Without ids, I'd say it's broken to begin with.

I disagree, the tool call id is an internal communication detail. There's no reason the callback/API consumer needs to be aware of some random UUID. Your case is special since you want to encode the messages array as the buffer in a lossless way. (See my note at the end.)

Let's not make it backwards compatible but convoluted.

The concern isn't backwards compatibility, as there is no "backwards" yet. We are at the start line.

The concern is the complexity of and confusion in the gptel-request API. Right now you're sending the callback these structures:

For tool calls confirmation/run:

((#[gptel-tool name function args description category ...] args closure) ...)

where #[...] is the gptel-tool struct.

For tool results:

((:name ... :args ... :result ... :id ... :arguments ... :function ...) ...)

Except for closure in the first case and :result in the second, these contain the exact same information packaged two different ways, both of which are quite complex. This seems convoluted to me. And possibly very confusing for a gptel-request consumer.

(The structure of the plist above isn't even uniform across backends, by the way. For OpenAI the above has both :args and :arguments, for example. However, all of them will have the keys :name, :args, :function, which are the only ones we need.)

Compare with

((#[gptel-tool name function args description category ...] args closure) ...)

for a tool call confirmation/run and

((name args result) ...)

for a tool result, keeping in mind that the latter is going to be the much more common in uses of gptel-request.

These two are much more distinct.

I don't like implicit information when matching, such as "if it's a cons, do Y" because that kind of stuff is made to break.

But this is what's happening right now.


In any case, the user can switch backends at any time so whatever tool id you store has to work for all backends. So I have a hypothesis: modulo some template or prefix like "call_", the tool call id does not matter as long as it's unique. If this is true we don't need to pass the tool call id to the callback at all, you can just make up a unique identifier.

@psionic-k
Copy link
Contributor Author

The concern isn't backwards compatibility, as there is no "backwards" yet. We are at the start line.

agree.

This seems convoluted to me. And possibly very confusing for a gptel-request consumer.

The structure of the plist above isn't even uniform across backends, by the way.

we don't need to pass the tool call id to the callback at all

I snipped a great deal of distracting reading because I need more answers from reading / writing code. Here is another great deal of less distracting reading. 🥲

While not for now, I think backends will need to be able to leak data into the front-end, but in a completely pass-through manner, using a gptel-<backend>-<model> property. Backend context caching is such a detail.

I think it would be better to always pass tool structs to the callback rather than "tool-name". This can later support a :passthrough or :backend slot on the tool struct that is then blindly attached to the buffer contents through properties without attempting to decipher or manipulate them. Passing tool structs is more consistent.

The calls & results would be: (tool args closure) and (tool args result).

In this case, while the id is hopefully not backend specific, I would like to use it and just hope for the best. If it doesn't matter, why replace it with another unique value unless we know such a value will fail less and not more? How this can actually help backends (and they might not tell us) is that the computation up to the tool call might be cached, leading to faster first tokens. All of the big providers are strongly incentivized to silently cache and use several methods of content cleaning and hashing to avoid re-computing the model state up to the next token.

I does seem better to provide the callback with RESULT INFO TYPE so that the user can just pcase on TYPE instead of having to look at the type of a caar in a list etc. That's the kind of decision that is inconvenient and easy to get wrong in the callback. If the user misses a case, they can warn / log on _, so it's robust moving forward.

Your case is special

Which part, the round tripping of the backend detail or retaining the tool result in context?

I meant that painting the decorators with ignore is more robust against fat-fingering by the user.

If during restore I'm parsing decorators to re-ignore them, the fat fingered decorator will not be propertized. Only if we save the properties of ignored content does this advantage hold during persistence.

There's other problems with fat fingering, such as leaving behind a result that can no longer be read. Maybe we should mark every tool call as read-only and be done with these user what-if headaches. It's almost never meaningful for them to update the results by hand unless they just delete the entire tool call. I'm going with this.

@karthink
Copy link
Owner

karthink commented Feb 11, 2025

While not for now, I think backends will need to be able to leak data into the front-end, but in a completely pass-through manner, using a gptel-<backend>-<model> property. Backend context caching is such a detail.

I'm not following, but I'll wait for the PR to develop further instead of taking up more of your time on future plans right now.

I think it would be better to always pass tool structs to the callback rather than "tool-name". This can later support a :passthrough or :backend slot on the tool struct that is then blindly attached to the buffer contents through properties without attempting to decipher or manipulate them. Passing tool structs is more consistent.

  • Always passing tool structs: ✅ I like the uniformity/simplicity of this.
  • Adding a slot to the tool struct to pass information to the buffer ❌ Hard no. This is a mingling of unrelated concerns. The tool struct is pretty much an immutable object. It should not be responsible for passing data from the "backend" to the "frontend", as you put it.

In this case, while the id is hopefully not backend specific, I would like to use it and just hope for the best. If it doesn't matter, why replace it with another unique value unless we know such a value will fail less and not more?

You still need a whole system written around the tool call ids to accommodate the various APIs. If you've stored call_1234567890 and the user now switches to Anthropic, you have to convert it to toolu_1234567890. It's better to use a common internal representation and store that instead, pushing the id templating logic into the parser functions for each backend.

How this can actually help backends (and they might not tell us) is that the computation up to the tool call might be cached, leading to faster first tokens. All of the big providers are strongly incentivized to silently cache and use several methods of content cleaning and hashing to avoid re-computing the model state up to the next token.

I understand the idea in general, but doubt very much tool call ids have anything to do with it. It also depends on where the cache boundary is placed. Tool calls appear in the last two messages in the messages array, unlikely to make much difference.

The calls & results would be: (tool args closure) and (tool args result).

I like this and agree, except for two things:

  1. Neither of these involves passing the tool call id, which you want
  2. Your following point about distinguishing between them won't work:

I does seem better to provide the callback with RESULT INFO TYPE so that the user can just pcase on TYPE instead of having to look at the type of a caar in a list etc. That's the kind of decision that is inconvenient and easy to get wrong in the callback. If the user misses a case, they can warn / log on _, so it's robust moving forward.

Changing the callback calling convention is a no-no. This would make it a backwards compatibility issue.

But it's also unnecessary, since INFO is a bucket meant for exactly this kind of dynamic information and variadic use. Instead of TYPE it can be stored as (plist-put INFO :type 'tool-call) or something, and the callback can look it up if RESPONSE is a cons. Other uses of the callback remain unaffected.

However, looking at integrating the Deepseek model properly is convincing me that the callback is going to have to fulfill an increasing number of duties. In this case, it has to decide how to handle the "reasoning" text block supplied by Deepseek, for which gptel needs to signal that the provided text is a reasoning block. So a dispatch system where RESPONSE is set to different lists is looking more appealing:

tool-call:

(tool-call . ((tool args closure) ...)) ; or just (call . ((...)))?

tool-result:

(tool-result . ((tool args result) ...)) ; or just (result . ((...)))?

reasoning block:

(reasoning . "The user is asking for...")

This is an alternative to running (plist-get info :type 'tool-result) before sending the callback the results, etc.

Your case is special

Which part, the round tripping of the backend detail or retaining the tool result in context?

The latter, you're trying to map the messages array to the buffer in a one-to-one fashion. I want this for gptel's chat feature, but most other uses of tools don't have this goal.

@karthink
Copy link
Owner

karthink commented Feb 11, 2025

I meant that painting the decorators with ignore is more robust against fat-fingering by the user.

If during restore I'm parsing decorators to re-ignore them, the fat fingered decorator will not be propertized. Only if we save the properties of ignored content does this advantage hold during persistence.

Yes, you lose robustness the moment text parsing enters the picture. But it's not all or nothing, as many chats (I'm guessing a majority) are single-use and not persisted to disk, and it works fine for these.

Alternatively, you can save the properties of the ignored content too, if we can find some way to shorten the length of the serialized string: perhaps just (241 276 i) instead of (241 276 ignore).

On that note, tool calls and result properties can also be shortened when serialized, to tc and tr?

(I must reiterate here that the prompt/response prefix don't deserve this much care as they're part of the user prompt and not a cause of auto-mimicry. Even applying ignore is unnecessary, text parsing should continue to work fine for them.)

There's other problems with fat fingering, such as leaving behind a result that can no longer be read.

I'm imagining the situation where a result is split into two chunks by an errant yank. Or worse, when a tool call (name, args) is split.

There needs to be some way to sanitize a parsed tool block. I see a TODO item in your PR about coupling tool calls and results so they appear next to each other in the messages array.

In the worst case it should all be parsed and included as regular response text, I guess.

Maybe we should mark every tool call as read-only and be done with these user what-if headaches. It's almost never meaningful for them to update the results by hand unless they just delete the entire tool call. I'm going with this.

You can't mark a region of the chat buffer read-only if it keeps the user from running, say, M-< C-w. On the other hand, ensuring that deletion is allowed when it's atomic across the whole tool result block is a fair compromise.

But how do you ensure that the entire tool call can be deleted but a part of it can't? I can think of a couple ways to do this; they're super clunky.

@psionic-k
Copy link
Contributor Author

I want this for gptel's chat feature, but most other uses of tools don't have this goal.

They will. The front-ends are just rudimentary right now. The killer pattern will be having a usually hidden but editable session buffer to store and remix context. When re-writing regions, you want gptel to have seen how other source code works and to remember it for the duration of potentially 2-3 re-write attempts with different prompts. This session buffer can be used across multiple buffers, such as with occur workflows. It can be cloned to start multiple requests in parallel, forked, supports undo barriers and so on. Chat combines the idea of a working buffer and a target buffer.

It's kind of like macros with LLMs providing the gummy heuristics and pseudo-programming interface and tools providing the RAG data necessary for contextualizing work with accurate facts.

But it's also unnecessary, since INFO is a bucket meant for exactly this kind of dynamic information and variadic use.

I'm hesitant because it's a shared mutable object that lives for the entire request lifecycle. Doesn't feel right to throw such a clear function signature relationship in there.

I like your later suggestion of 'tool-call in the car better.

So a dispatch system where RESPONSE is set to different lists is looking more appealing

Not the best, but simple enough. Recommend putting a TODO(2.0) on this?

Neither of these involves passing the tool call id, which you want

Ah, hell. True. I was off on a tangent imagining there was a tool-call object.

If you've stored call_1234567890 and the user now switches to Anthropic, you have to convert it to toolu_1234567890. It's better to use a common internal representation

  • template & de-template 👍
  • made up ID 🙅

Recommend:

(tool-call ((tool args closure id))...) and (tool-result ((tool args result id))...). Backends without ids can generate it.

Tool calls appear in the last two messages in the messages array, unlikely to make much difference.

The opposite is true. As I've demonstrated, tool calls are going to wind up heavily interwoven in recursive lookup use cases. We're pretty good at sending an append-only log in many cases. It's good to be consistent.

On that note, tool calls and result properties can also be shortened when serialized, to tc and tr?

We can even combine tool calls and results. It might make other problems simpler. I'll take a look.

A 2D structure is most efficient here. For enough flexibility in the values:

(response (beg end) (beg end)
 tool (beg end id) (beg end id)
 ignore (beg end) (beg end))

Fairly simple logic to use this.

As for tool bounds that are haphazardly split by the user, I recommend interning them into re-education camps until they stop expecting to pour skittles into the gas tank without something breaking. I'll see what's in the manual about that.

Claude Sonnet 3.0 wrote that 🙊

@karthink
Copy link
Owner

karthink commented Feb 11, 2025

I want this for gptel's chat feature, but most other uses of tools don't have this goal.

They will. The front-ends are just rudimentary right now. The killer pattern will be having a usually hidden but editable session buffer to store and remix context. When re-writing regions, you want gptel to have seen how other source code works and to remember it for the duration of potentially 2-3 re-write attempts with different prompts. This session buffer can be used across multiple buffers, such as with occur workflows. It can be cloned to start multiple requests in parallel, forked, supports undo barriers and so on. Chat combines the idea of a working buffer and a target buffer.

Interesting idea -- but what you're describing will essentially reuse the mapping that you're creating in this PR. No one else is going to write this feature from scratch. A big chunk of tool use is going to be for side-effects, and another chunk is going to be one-off queries or queries where retaining the history is not important.

But it's also unnecessary, since INFO is a bucket meant for exactly this kind of dynamic information and variadic use.

I'm hesitant because it's a shared mutable object that lives for the entire request lifecycle. Doesn't feel right to throw such a clear function signature relationship in there.

INFO changes all the time anyway. You can do something like

(plist-put info :type 'tool-result)
(funcall (plist-get info :callback) result)
(plist-put info :type nil)

I like your later suggestion of 'tool-call in the car better.

That said, including the type in the car is fine, yeah.

So a dispatch system where RESPONSE is set to different lists is looking more appealing

Not the best, but simple enough. Recommend putting a TODO(2.0) on this?

Yeah. Scales well enough to new tasks too. The way I think about it is that every case looks like this:

(response    . "response text")
(reasoning   . "reasoning text")
(tool-call   . ((tool args closure) ...))
(tool-result . ((tool args result) ...))

only the first case is 90%+ of uses so we make an exception and directly send "response text".

Recommend:

(tool-call ((tool args closure id))...) and (tool-result ((tool args result id))...). Backends without ids can generate it.

I hope the missed dot is a typo: we want (tool-call . ((tool args closure id) ...)) and not (tool-call ((tool args closure id) ...)).

I maintain that including the id here is premature, as almost no one will need it except for us in this PR. (If an API consumer wants a persistent history of tool-calls, they'll just be reusing our UI.)

How about picking it up from INFO instead? From within the callback,

(plist-get info :tool-use)

contains all the ids, as well as other details you might need.

Tool calls appear in the last two messages in the messages array, unlikely to make much difference.

The opposite is true. As I've demonstrated, tool calls are going to wind up heavily interwoven in recursive lookup use cases. We're pretty good at sending an append-only log in many cases. It's good to be consistent.

Not quite -- the cache miss will only be for a fraction of the buffer each time, you can run through the experiment. But we might as well store the original id, yeah.

On that note, tool calls and result properties can also be shortened when serialized, to tc and tr?

We can even combine tool calls and results. It might make other problems simpler. I'll take a look.

A 2D structure is most efficient here. For enough flexibility in the values:

(response (beg end) (beg end)
 tool (beg end id) (beg end id)
 ignore (beg end) (beg end))

Fairly simple logic to use this.

No strong opinions here, except that a slight extension of the current system is about the same, since we don't need to indicate response, which is the default:

(beg end) (beg end)
(beg end tc id) (beg end tr id)
(beg end i) (beg end i)

As for tool bounds that are haphazardly split by the user, I recommend interning them into re-education camps until they stop expecting to pour skittles into the gas tank without something breaking. I'll see what's in the manual about that.

To Emacs users a buffer is a sandbox, not a gas tank. So it's perfectly reasonable from their perspective to pour skittles into it. That it represents a structured array of LLM conversation turns is unfortunately our problem to handle.

@psionic-k
Copy link
Contributor Author

(defun gptel--modification-allow-ask (_beg _end)
  "Ask before allowing user to edit tool region."
  (if (y-or-n-p "Warning! Editing GPTel tool could break next request!
Do you know what you're doing?  Be honest, bro.")
      (save-excursion
        (when (text-property-search-backward
               'modification-hooks '(gptel--modification-allow-p) t)
          (when-let ((found (text-property-search-forward
                             'modification-hooks
                             '(gptel--modification-allow-p) t)))
            (let ((inhibit-modification-hooks t))
              (remove-text-properties (prop-match-beginning found)
                                      (prop-match-end found)
                                      '(modification-hooks
                                        insert-in-front-hooks
                                        insert-behind-hooks)))
            (message "Region modification protection removed."))))
    (user-error "Edit aborted.  You did the right thing.")))

(defun gptel--modification-undo-ask (beg end)
  "Ask user to undo edits to tool region."
  (unless (y-or-n-p "Warning! Inserting into GPTel tool could break next request!
Do you know what you're doing?  Be honest, bro.")
    (let ((inhibit-modification-hooks t))
      (delete-region beg end)
      (message "Insertion deleted"))))

(defun gptel--modification-protect (beg end)
  (add-text-properties
   beg end
   '(modification-hooks
     (gptel--modification-allow-ask)
     insert-in-front-hooks
     (gptel--modification-undo-ask)
     insert-behind-hooks
     (gptel--modification-undo-ask))))

This is what I spiked out to prevent fat fingering while allowing editing if the user insists. It is more complex than setting `'read-only'.

Instead just adding `'read-only' to a region is pretty nearly ideal. It allows killing. It's just a bit tricky to edit unless the user is industrious enough to make some Elisp.

@psionic-k
Copy link
Contributor Author

Merged master and squashed fixups.

benthamite and others added 12 commits February 28, 2025 21:12
* gptel.el (gptel--handle-post-insert):
It was ambiguous to read

* gptel-transient.el (gptel--regenerate):
* gptel.el (gptel--get-response-bounds):
Structural, non-behavioral.  Nothing to see.

* gptel.el (gptel--insert-response):
This change is structural

The conditions are not checked properly.

* gptel-curl.el:
* gptel-transient.el:
* gptel.el:
The warning and fallback to parsing entire buffer can happen in the
conditionals.

This change is primarily structural, non-beheavioral.

* gptel-org.el:
Include tool call & result as explicitly recoverable messages within message log
buffers (like chat).  The recovery of tool messages enables re-use of tool
results in several rounds of messages, making RAG-like use cases feasible.  The
:include feature of tools tended to confuse LLMs, resulting in auto-mimicry and
other undesireable behaviors.  These changes corrected this behavior.

To mark explicit tool messages, the gptel text property has been given new
unique values, including (tool . id).  The persistence & rehydration of these
properties is handled in the next commit.

The default callback insertion functions were modified to support a RAW argument
to indicate that they should not propertize the insertion.  This allows re-use
of the several tracking markers and other behavior without treating tool
insertion like response insertion.  See See gptel-curl--stream-insert-response
and gptel--insert-response.

To facilitate creation of a tool message on the backend, the tool call form is
written literally to the beginning of tool messages.  The backend recovers it by
calling read.  After the first read form, the rest of the message is treated
like the tool result.  The result has been coerced to a string through
gptel--to-string before buffer insertion, so the literal buffer string is
treated as the result.  This can erase some type information for backends which
support JSON types besides string (including OpenAI and Gemini).  Further work
may be needed to allow tools to return non-string results to backends which
support them.

This commit only supports the OpenAI backend.  The others will need similar
enhancements:

- unformat and format their tool ids
- match the gptel text property values for (tool . id)
- create tool call and result messages from each recorded call

Buffer encoded tool messages needed to be parsed.  This commit covers the OpenAI
backend.  The strategy is to pcase on the gptel property.  Since tool insertions
have one distinct gptel text property, the call and response message are
recovered at the same time.

This commit also addresses several shortcomings of the UX and display that were
necessarilly tangled due to the path of work.

Because the org block display in packages like org-modern keeps parameters
visible when folded, the block was chosen.  The truncated call data is rendered
as block parameters to easily see which calls were performed while blocks are
folded.  To aid in the recognition of tool messages on text / markdown, the too
message fence has truncated call data appended.  It uses the 'ignore property
even though removing these fences can employ a similar strategy as org mode.

To prevent org block escape by its tool result contents, the tool result is
escaped before insertion for display and unescaped before sending to the
backend.  Unescape takes place in a buffer containing a clone of the data.  The
path for branching context and non-branching context both similarly clone the
buffer to clean it before sending to the backend.

To avoid excessive gptel-separator insertion, new markers were introduced during
insertion, stored in :info as usual.  Using indpendent tool and message markers,
we can infer if we did the last update to the buffer or if another state-path
has done an insertion, meaning we need to re-insert a regular gptel-separator
instead of just a newline or nothing.

The 'ignore property is introduced and supported by has been de-emphasized.
Users can make use of it to have non-conversation content intermingled with
chat.

Fixup: when in org, inform downstream parsing calls to presume org
Fixup: Bug in org unescape
The prev-pt may be ahead of the current point if a read or regex search forward
blows past prev-pt.  In such case, we want to at worst "escape" prev-pt prev-pt.

Fixup: downgrade of malformed tool-use region warning to message
Fixup: propagate local variables into org mode prior to calling backend parsing
Fixup: messed up the switch to rx while trying to make it readable.

Squash: Don't push messages for empty content during parsing (this is later
propagated to other backends)
Squash: Order tool call messages before tool result when parsing chat log

Squash: Tool result string coercion

This commit relates to the conversion of the results of tools.  The result was
to always coerce tool results to strings before handing off to display or to
each backend.

Not all backends support all types of results.  When writing tools, it was
observed that strings are supported by the OpenAI API.  The API docs state that
arrays are also supported, but only an array of objects.  The decision made for
now is to do string coercion of whatever the user passes back.

In the case that the user ever wants to take advantage of non-string types, we
will have to accomodate per-backend.  Backends that support non-string types
will call `read' a second time to recover the tool result.

Earlier I had removed a string-coercion because it can lose type information: A
string that is readable is indistinguishable from an Elisp object after coercing
the object to string.  However, representing strings as escaped strings
in-buffer means we're sending double-escaped results to the backend in many
cases.  The LLMs might not interpret such over-escaping as well.

Anthropic API might reveal something about the right choice.  I Can't navigate
the Gemini docs well enough to judge it clearly.  R1 docs say String only.

Squashed: Update non-streaming callback to feature parity

Result is almost identical to the streaming callback.  The more I looked at each
little behavior, the more I could see that the streaming callback is just better
code and operates in the non-streaming case.

* gptel-anthropic.el:
* gptel-curl.el:
* gptel-openai.el:
* gptel-org.el:
* gptel-transient.el:
* gptel.el:
This goes in the direction of our backend agnosticism over persisted tool calls.
As long as they only use these to pair call and result, we can perhaps re-use
truncated ids in other backends.

* gptel-openai.el:
Property bounds are stored in an alist for compactness.  The alist values are
lists of bounds, which are always lists (trading conses for shorter local
variables) and contain either BEG END or BEG END VAL for text properties like
tool calls that can store an ID.

With work, the tool call ID can be obtained from the buffer data instead of
saved in the local variable as application state, but this is not consistent
across frontends.

fixup: property rehydration
fixup: comment for caadar
fixup: use generalized variables (`setf' and `plist-get')
1. gptel--trim-prefixes returns nil whenever a string is empty after
trimming (for blank strings or prompt-only strings, this will occur)

2. all models check for a non-nil after-trimming string before appending
messages to their parts

3. all models trim the whole buffer using simple string-trim when response
tracking is off

4. zero-length multi-part vectors are also skipped

Squashed: Trim prefixes to nil

Also fixes the edge case where a prefix includes whitespace.  Whitespace removal
is added to each literal prefix, so the literals need to be trimmed of whitespace.

* gptel-anthropic.el:
* gptel-gemini.el:
* gptel-ollama.el:
* gptel-openai.el:
* gptel.el:
This was an opportunistic update, but it seems this model is obviously out of
use and will be abandoned soon

* gptel-kagi.el:
There might be a mismatch in the JSON structure of the initial tool call and
subsequent tool turns as encoded in tool blocks.

Fixups:
remove id reference unavailable on gemeni

* gptel-gemini.el:
@psionic-k psionic-k force-pushed the explicit-tool-turns branch from a06ea68 to 4d0689a Compare March 1, 2025 07:02
@karthink
Copy link
Owner

karthink commented Mar 4, 2025

...and I'm back. Sorry about that, I was having two of the busiest weeks of my life there.

I'm going to get things moving again soon. I'll merge this into the feature-tool-use branch, split a couple of the commits, then merge everything into master except the persistence change, since that has lasting effects. I can address the reasoning-block handling immediately after.

@psionic-k
Copy link
Contributor Author

except the persistence change, since that has lasting effects.

It never existed if I didn't answer a call to support it.

I have to say I don't think persistence is actually useful. It was an idea for a certain kind of user who just didn't understand that the cost is going to plummet and that the high-value use cases will all involve much more ephemeral use in the background and in parallel. Persisting chat will ultimately be like persisting process buffers. The output is sometimes useful, but I certainly don't care about saving, much less reanimating it.

split a couple of the commits

Git history, like LLM output, is mostly just log output. Really it strikes me as a dark pattern of making work that does not move things to hopefully put off work that would move things, owing to a dread of too many things moving. What it means is that there is a need for more bandwidth, not less.

Probably what needs to happen is some delegation and liberalization of the development, possibly through a fork. The LLM package is too incomplete to build on top of. GPTel is kind of like a full stack and becoming beholden to users who will just bog down newer users with rapidly changing usage patterns. There's likely an opportunity where GPTel minus a specific frontend and backend is a very good package to build on top of. The tools to feed buffers into llm requests, manage context, prompts, files, lifecycle, and FSM parts are all useful. If these pieces are made the focus, the downstream packages, unencumbered by a specific frontend instance, can be more successful.

@karthink
Copy link
Owner

karthink commented Mar 5, 2025

except the persistence change, since that has lasting effects.

It never existed if I didn't answer a call to support it.

I don't understand what this means.

I have to say I don't think persistence is actually useful. It was an idea for a certain kind of user who just didn't understand that the cost is going to plummet and that the high-value use cases will all involve much more ephemeral use in the background and in parallel. Persisting chat will ultimately be like persisting process buffers. The output is sometimes useful, but I certainly don't care about saving, much less reanimating it.

You have a certain way of using LLMs. Other people, including myself, use them very differently. There is no other way to resume a conversation if you can't decipher the roles in a text file.

Git history, like LLM output, is mostly just log output.

I disagree.

Really it strikes me as a dark pattern of making work that does not move things to hopefully put off work that would move things, owing to a dread of too many things moving. What it means is that there is a need for more bandwidth, not less.

I have seen the kind of open source software that "increased bandwidth" and ad hoc committing generate. The result might work well for the user, but I'm not interested in becoming a manager whose primary job is to adjudicate heated discussions and review high volume commits. Sounds like hell to me.

Probably what needs to happen is some delegation and liberalization of the development, possibly through a fork. The LLM package is too incomplete to build on top of. GPTel is kind of like a full stack and becoming beholden to users who will just bog down newer users with rapidly changing usage patterns.

Your comment elsewhere about the need for increased velocity doesn't make sense to me -- needed by whom? If I cared about keeping up with the competition -- for some definition of "competition -- I wouldn't be using Emacs. It's too limited.

There's likely an opportunity where GPTel minus a specific frontend and backend is a very good package to build on top of.
The tools to feed buffers into llm requests, manage context, prompts, files, lifecycle, and FSM parts are all useful. If these pieces are made the focus, the downstream packages, unencumbered by a specific frontend instance, can be more successful.

This is what the llm package offers. It supports all the "middleware" you mention, including the FSM. What is it missing (besides the UI) that gptel has?

That said, I would like to break up gptel into a few pieces, and hand off some of the pieces to other maintainers. Org support would be one of the pieces, for instance.

@psionic-k
Copy link
Contributor Author

I cared about keeping up with the competition -- for some definition of "competition -- I wouldn't be using Emacs.

💀

hand off some of the pieces to other maintainers

This is management.

When I want to do something when asked but I dread it when it shows up at the front door, the truth is I don't want to do it.

I think we're pulling in different directions and at best indicating willingness to cooperate when there isn't sufficient underlying alignment. I wish you all the best but I sense a need from the ecosystem that I believe is going to go unanswered.

So are we forking? Yeah, we're forking. For now my intent is to develop GPTel independent of this repo. If the prefix collision actually starts causing confusion, I can rename on request.

@shroomist
Copy link

hey guys. I'm just a regular user, I'd like to see the tool-use the way it was demoed by @psionic-k on his YT.
I do feel much respect and gratitude for @karthink efforts and contributions to emacs community.

I'm still to test the branch in question here, it might be better or worse than what we have at the main branch.

On a tangent, I did find that it's confusing to me how the context and gptel separators interact.
My workflow does include modifying text in gptel buffer, I don't approach it as a chat, it could be viewed as if I'm forming an org document with the help of gptel, I do find it cumbersome to manage prompt-prefix and response-prefix. From discussions above I feel like there's not that much love for org-mode support from the primary maintainer. Not sure if this feedback is relevant or helps the discussion in any way, I just felt that there's some misalignment on how to mark the response and perhaps prompt on the buffer, the suggested text-properties instead of prompt/response-prefix sounded appealing to me.

@psionic-k I suggest we agree with maintainer to clean up the git history, the above changeset is actually big and plugs in some opinionated use cases. having separate commits gives us some hope that those changes might reach the upstream event with a separate PR at some point in the future.

@karthink
Copy link
Owner

karthink commented Mar 5, 2025 via email

@karthink karthink merged commit 4d0689a into karthink:feature-tool-use Mar 5, 2025
@karthink
Copy link
Owner

karthink commented Mar 5, 2025

So are we forking? Yeah, we're forking. For now my intent is to develop GPTel independent of this repo. If the prefix collision actually starts causing confusion, I can rename on request.

Very well.

In the meantime I've merged this into the tool use branch and will be breaking up the commits as I would like them. In the rebase process I'm going to try to retain your authorship on all the commits but if I mess it up please let me know.

@shroomist
Copy link

Could you explain the problem with the separators in a little more detail? Why do you need to "manage" it instead of just setting it once in your config?

I need to clean up those separators from the buffer when my session is done. I'm doing branching contexts, I'm coming back to editing the initial prompt, I'm editing responses etc, the @prompt and @agent prefix most of the time ends up in a wrong place. often times there's text that ends up being after my prompt, so I need to change structure of the document so that it's not included in the prompt. perhaps a postfix marker or a #+begin block would have solved this for me.

Again, it's not clear what you mean by misalignment. gptel uses both text-properties and prefix strings.

miscommunication of you and the author of this PR I mean. I see a long thread here, both you guys are involved, that's great for the project. We don't need more forks, let's find a way to plug the functionality in.

"yes" for velocity, but also "yes" for clean and focused contributions, git history can be valuable, smaller pr easier to push through, smaller commits are great for tracking back.

@karthink
Copy link
Owner

karthink commented Mar 5, 2025

I need to clean up those separators from the buffer when my session is done. I'm doing branching contexts, I'm coming back to editing the initial prompt, I'm editing responses etc, the @prompt and @agent prefix most of the time ends up in a wrong place. often times there's text that ends up being after my prompt, so I need to change structure of the document so that it's not included in the prompt. perhaps a postfix marker or a #+begin block would have solved this for me.

If you are editing the text in the buffer, you're responsible for the text. How can gptel handle this?

If the prefixes are getting in the way you can also just set them to nil. Even if the prefix text is invisibly ignored by gptel when parsing, you're going to have errant @prompt and @agent strings littering the buffer.

We don't need more forks, let's find a way to plug the functionality in.

The functionality in this PR will be added. I merged it into the feature-tool-use branch just now, and it will make its way to master.

Future contributions are welcome as well, but they would have to be focused and incremental PRs.

The fork is because @psionic-k has different ideas about how gptel should be developed.

@shroomist
Copy link

If you are editing the text in the buffer, you're responsible for the text. How can gptel handle this?

I'm sharing the way I use the tool, it's different to the interface with a box for input and one for output, and that's where I find value from gptel - it's just a buffer that could be edited.

If the prefixes are getting in the way you can also just set them to nil.

great I didn't know about that, I thought it was essential for correct functioning.

The fork is because @psionic-k has different ideas about how gptel should be developed.

from the sidelines, it certainly looks like you guys had a heated argument.

@psionic-k
Copy link
Contributor Author

that's great for the project. We don't need more forks, let's find a way to plug the functionality in.

To make way for changes, I already ripped out persistence and org branching context. That should make it clear that the level of hand wringing we've subjected ourselves to was about to become a lot more unproductive. There are two independent visions for future development that should be allowed to diverge. I am grateful to Karthik for the hand wringing because that is how I was introduced to many of the features and architecture.

It's also easy to forget that two-way flow of information happens well after forking. Long after the git history is incompatible, we just patch things with diffs. To the extent that my goal of offloading backend behavior onto llm is possible, the result just flows back into gptel. To the extent that I see commits I want, I will just apply them.

Ultimately, I needed a starting point. I could have either built up on top of llm or picked a completed stack and started whittling. Starting with GPTel is the latter method. The presence of a lot of solved problems is easy to reduce down to a good foundation. The direction I want to go is quite a bit different than the use patterns we've seen so far. If the development process seems to strain under stress, and much more stress is necessary to express the desired goal, the remedy was to fork. It's a subjective decision and there is no right answer, but the most wrong answer is to demand one singular result in the presence of great uncertainty.

from the sidelines

I credit any appearance of argument to expectations created by the sickness in our present internet culture. I am merely pickaxing in a straight line as much as possible. Twitter fights and journo's using the word "slam" don't really understand that open source is as much about independence as it is cooperation. Consumers tend to favor cooperation, but the dirty little secret is that they favor cooperation among people they are dependent upon 😉 It's not my nor Karthik's job to suppress our own opinions and differences in technical vision or even ambition to placate the anxiety of those who benefit in a purely one-way manner. (My sponsors can complain, and my response is that this way will go faster and get to a better place).

@karthink
Copy link
Owner

karthink commented Mar 7, 2025

Added tool result parsing support for Ollama and Anthropic. And fixed Gemini's support for it.

@karthink
Copy link
Owner

karthink commented Mar 7, 2025

Pushed everything from this PR except the :message-marker commit to master. I find this change very confusing so I'm going to try to find a better way to do it.

Going to fix the reasoning block handling next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants