You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using mlx-vlm through the python API, we need to call mlx_vlm.utils.load for before every request to stream_generate. We need to do this because we see Exceptions being raised when we try to call stream_generate without reloading the model. We are seeing Exceptions across multiple VLM architectures when not re-loaded. The exceptions are different from one another between architectures, there is usually some state that is not being reset
The text was updated successfully, but these errors were encountered:
The exceptions are different from one another between architectures, there is usually some state that is not being reset
I usually use stream_generate in the manner you refer to in dev.
So I suspect the KV cache or the increase in input size (i.e., number of images). The former is easy to fix, the latter has limitations because not all models support multiple images and/or multi-turn conversation.
When using mlx-vlm through the python API, we need to call
mlx_vlm.utils.load
for before every request tostream_generate
. We need to do this because we see Exceptions being raised when we try to callstream_generate
without reloading the model. We are seeing Exceptions across multiple VLM architectures when not re-loaded. The exceptions are different from one another between architectures, there is usually some state that is not being resetThe text was updated successfully, but these errors were encountered: