Models should not need to be re-loaded between back-to-back prompts #210

neilmehta24 · 2025-02-21T17:37:21Z

When using mlx-vlm through the python API, we need to call mlx_vlm.utils.load for before every request to stream_generate. We need to do this because we see Exceptions being raised when we try to call stream_generate without reloading the model. We are seeing Exceptions across multiple VLM architectures when not re-loaded. The exceptions are different from one another between architectures, there is usually some state that is not being reset

The text was updated successfully, but these errors were encountered:

Blaizzy · 2025-02-24T11:18:11Z

Hey @neilmehta24

Thanks for reporting this!

Could you share a reproducible example?

Blaizzy · 2025-02-24T11:26:45Z

The exceptions are different from one another between architectures, there is usually some state that is not being reset

I usually use stream_generate in the manner you refer to in dev.

So I suspect the KV cache or the increase in input size (i.e., number of images). The former is easy to fix, the latter has limitations because not all models support multiple images and/or multi-turn conversation.

Blaizzy added the bug Something isn't working label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models should not need to be re-loaded between back-to-back prompts #210

Models should not need to be re-loaded between back-to-back prompts #210

neilmehta24 commented Feb 21, 2025

Blaizzy commented Feb 24, 2025

Blaizzy commented Feb 24, 2025

Models should not need to be re-loaded between back-to-back prompts #210

Models should not need to be re-loaded between back-to-back prompts #210

Comments

neilmehta24 commented Feb 21, 2025

Blaizzy commented Feb 24, 2025

Blaizzy commented Feb 24, 2025