Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models should not need to be re-loaded between back-to-back prompts #210

Open
neilmehta24 opened this issue Feb 21, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@neilmehta24
Copy link
Contributor

When using mlx-vlm through the python API, we need to call mlx_vlm.utils.load for before every request to stream_generate. We need to do this because we see Exceptions being raised when we try to call stream_generate without reloading the model. We are seeing Exceptions across multiple VLM architectures when not re-loaded. The exceptions are different from one another between architectures, there is usually some state that is not being reset

@Blaizzy
Copy link
Owner

Blaizzy commented Feb 24, 2025

Hey @neilmehta24

Thanks for reporting this!

Could you share a reproducible example?

@Blaizzy
Copy link
Owner

Blaizzy commented Feb 24, 2025

The exceptions are different from one another between architectures, there is usually some state that is not being reset

I usually use stream_generate in the manner you refer to in dev.

Image

So I suspect the KV cache or the increase in input size (i.e., number of images). The former is easy to fix, the latter has limitations because not all models support multiple images and/or multi-turn conversation.

@Blaizzy Blaizzy added the bug Something isn't working label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants