Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: best way to cancel an ongoing generation? #227

Open
deet opened this issue Mar 7, 2025 · 1 comment
Open

Question: best way to cancel an ongoing generation? #227

deet opened this issue Mar 7, 2025 · 1 comment

Comments

@deet
Copy link

deet commented Mar 7, 2025

I'm looking into how feasible it would be to interrupt an in-progress LLM or VLM generation.

It seems that it's reasonably straightforward to cancel once the output tokens are being generated, since the developer can check for eg Task.isCancelled in their didGenerate block and abort.

However, significant time can also be spent in earlier phases like in the TokenIterator before output is generated and the UserInputProcessor.

So a few questions:

  1. Are there plans to support cancellation of generation tasks?
  2. If I wanted to implement a version of this now, is checking for Task.isCancelled in various places in eg the implementations of UserInputProcessor the best approach? It seems that for vision models this might be between processing of each image, or each video frame.
  3. How feasible would it be to store the state already completed, such as the KVCache when interrupted, so that the task could be resumed at a later point from the same position?
@davidkoski
Copy link
Collaborator

  1. Are there plans to support cancellation of generation tasks?

Nothing planned, but this sounds useful.

  1. If I wanted to implement a version of this now, is checking for Task.isCancelled in various places in eg the implementations of UserInputProcessor the best approach? It seems that for vision models this might be between processing of each image, or each video frame.

Yes, that sounds like it would be needed. The UserInputProcessor has to return an LMInput:

public func prepare(input: UserInput) throws -> LMInput

So you would need to figure out what to do if it was cancelled. It could throw an error (maybe a well known taskCancelled error that the caller could deal with) or an empty LMInput. The call to prepare the UserInput -> LMInput is already in the user code so call sites would need to know what to do -- I think this ties in to question 3 below. If the prepare() call is interrupted the caller needs to know that the LMInput isn't valid, which suggests a throw to me, however this might be an unexpected error to some callers. Do we need to let the caller specify the behavior here? Pass in some kind of "step observer block" for policy? I think this requires some playing around to see what fits well.

The other checks on isCancelled could either be in the generate() callback or in generate() itself. This already has mechanisms for stopping early so it seems like either could work well, though it may depend on what is decided for the prepare() call (if it is explicit or not).

The TLDR, I think is one of these two general approaches:

  • prepare() and generate() implicitly handle Task.isCancelled, but callers of prepare() need to know if the LMInput is valid, so the probably means an Error
  • prepare() and generate() have to opt-in to cancellation behavior by passing an optional parameter (a bool or a block), same caveat on prepare() wrt Error (but the caller determines if this is possible)
  1. How feasible would it be to store the state already completed, such as the KVCache when interrupted, so that the task could be resumed at a later point from the same position?

See #196 for some discussion. Since the KVCache is a reference type and you can pass it in, you should be able to hold it in the code that calls the iterator without a problem. For a VLM I wonder if this is sufficient? I think it would want to redo the image/video portion of the input -- want to capture the LMInput rather than the UserInput (which is exposed at the generate() level).

Also be aware of this:

we are doing asynchronous evaluation to prepare the next token (keeping the GPU busy in the gaps between tokens). I think that is fine, but something to think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants