-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIx idefics3 do-image-split #192
base: main
Are you sure you want to change the base?
Conversation
I also think it will probably fix the broadcast error reported here. I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run. Here is the execution command I tried.
Btw, the vision patch embed sanitization is not working well tho the code checks out to me.
|
@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering
Full stacktrace before crash
|
@@ -794,7 +783,7 @@ public class Idefics3Processor: UserInputProcessor { | |||
|
|||
// From the Python code and default config, we know image_token_id is usually 49153. | |||
// Hardcode this since we can't pass it in or rely on it from the processor config. | |||
private let imageTokenId = 49153 | |||
private let imageTokenId = 49190 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer matches the comment. Also, might want to change this line:
self.imageTokenId = (try? container.decode(Int.self, forKey: .imageTokenId)) ?? 49153
That said, perhaps we should address the "Hardcode this since we can't pass it in or rely on it from the processor config" part of it -- if UserInputProcessor needs it then it should get passed in (VLMModelFactory).
Comparing this to the other VLMs it looks like the difference is here:
// Encode only the text part of the prompt, without <image>
var promptTokens = try tokenizer.encode(text: prompt)
let imageTokenIndex = promptTokens.count / 2
promptTokens.insert(imageTokenId, at: imageTokenIndex)
this deals in numeric tokens after tokenization while the others are injecting text.
I think we can keep this hard coded for this specific issue but it would be good to decide which approach to take here:
- inject text into the prompt (no token id needed)
- we need the model config
Looking at pixel_values coming in to
vs
It looks like:
if do_image_splitting:
# We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio
# for size=(10, max_image_size) -> rescaled_size=(max_image_size, max_image_size)
# for size=(11, max_image_size+1) -> rescaled_size=(max_image_size, max_image_size*2)
...
As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions. |
In python it works with do_split TRUE and FALSE. Because we are flattening the first 3/4 dimensions of the image_feature tensor. |
@davidkoski thanks a lot for the tips!
My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :) |
Without knowing specifically what you are running into it is hard to say. You might look here:
Once you have it installed, you should be able to open the These are the schemes: ![]() you pick one and a build/run destination. The simplest will be macOS. You can press cmd-opt-r to edit the "run" configuration for the scheme. In the first tab you can select between Debug/Release builds (Debug if you want to run in the debugger, Release to run full speed). The arguments tab is useful for If you run into specific problems just ask! |
Thanks! I'm running MacOS :) |
Hi @Blaizzy, welcome to Swift-land :) Incidentally, I started working on this before seeing your PR. I have a preliminary question, did you manage to get any smolvlm running, even before your patch? I tested the default model in this repo (mlx-community/SmolVLM-Instruct-4bit) and I'm getting gibberish (after changing the image token id to 49190). |
Description
This PR matches the fix for idefics 2 and 3 on mlx-vlm.
Source: Blaizzy/mlx-vlm#191
@davidkoski @awni I'm not a swift expert and would love to start contributing. This is my first attempt.
Although I couldn't manage to run this example sucessfully this change should allow the model to process images in multi-resolution boost OCR and finer grained details. For this Idefics 3 models have an argument in the preprocessor called do_image_splitting which is True by default. I previously had set it to false on all mlx models but now I pinpointed the bug and made a fix.
Input pixel values:
-> (1, 3, 512, 512) without do_image_splitting
-> (13, 3, 512, 512) with do_image_splitting
Python MLX-VLM example
Input Image:
Before
After (matches transformers ✅)