-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several Inference Endpoint fixes #66
Conversation
This was generating a tricky error when calling "/health" at the server startup: this was calling prefill and returning None as the cached batch, that was failing to be serialized.
Doing that on TPU seems to slow down (due to compilation?) and takes a lot of memory.
This allows to correctly handle warmup.
This clears a potential issue when clearing TGI requests. When a client cancels a TGI request, two different methods can be called on the TGI server: - if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented, - if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch. This is now fixed, basically reproducing TGI Neuron fix: huggingface/optimum-neuron#609
input_ids = torch.full((batch_size, seq_length), self.tokenizer.pad_token_id, dtype=torch.int64) | ||
attention_mask = torch.full((batch_size, seq_length), 0, dtype=torch.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: Can't we make it use int32
for input_ids
and attention_mask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umh, maybe we could, but the price would be a frequent cast between int32 and int64, because PB integers are serialized in int64. I think it's not worth doing it here, we could look at the advantages/disadvantages in another PR.
if slot.state != Slot.State.EMPTY and slot.request_id not in request_ids: | ||
logger.debug(f"Removing request {slot.request_id}") | ||
if slot.state != Slot.State.EMPTY and slot.id not in keep_slot_ids: | ||
logger.info(f"Removing slot {slot.id} with request {slot.request_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug
might be enough for this imo
What does this PR do?