Support for KV-cache loading? #8

mark-lord · 2024-10-02T08:21:43Z

It's not clear to me from looking at the code if this library supports the following pattern:

mlx_lm.cache_prompt --prompt 'Here are 100 examples of how to produce a desired output: {examples}'

... cachedprompt.safetensors saved to cwd

prompt_template = "Now produce an output from this sentence: {sentence}"
prompts_raw = [prompt_template.format(sentence=sentence) for sentence in sentences]
response = batch_generate(kv-cache-file=cachedprompt.safetensors, prompts=prompts_raw)

... batched generations created

Is this something the library can or could do? I'm interested in being able to provide multi-shot examples without introducing huge prompt processing times due to wasted compute on re-encoding the same pre-prompt

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for KV-cache loading? #8

Support for KV-cache loading? #8

mark-lord commented Oct 2, 2024

Support for KV-cache loading? #8

Support for KV-cache loading? #8

Comments

mark-lord commented Oct 2, 2024