FIx idefics3 do-image-split #192

Blaizzy · 2025-01-31T14:49:56Z

Description

This PR matches the fix for idefics 2 and 3 on mlx-vlm.

@davidkoski @awni I'm not a swift expert and would love to start contributing. This is my first attempt.

Although I couldn't manage to run this example sucessfully this change should allow the model to process images in multi-resolution boost OCR and finer grained details. For this Idefics 3 models have an argument in the preprocessor called do_image_splitting which is True by default. I previously had set it to false on all mlx models but now I pinpointed the bug and made a fix.

Input pixel values:
-> (1, 3, 512, 512) without do_image_splitting
-> (13, 3, 512, 512) with do_image_splitting

Python MLX-VLM example

Input Image:

Before

 In this image we can see an airplane flying in the air. In the background we can see the land with water.<end_of_utterance>
==========
Prompt: 873 tokens, 1766.771 tokens-per-sec
Generation: 26 tokens, 200.883 tokens-per-sec
Peak memory: 2.118 GB

After (matches transformers ✅)

The image depicts a small, sleek, and narrow aircraft, likely a private jet, flying at a high altitude. The aircraft is oriented with its tail and wings pointing towards the right side of the image, suggesting it is moving from left to right. The aircraft's body is primarily white with a black and gray tail fin and a black and gray propeller. The tail fin has the text "G700" printed on it in black.

The aircraft's wings are extended, and the landing gear is retracted, indicating that it is either preparing to land or is taxiing. The aircraft's windows are visible, and there are several small, circular openings along the sides, which are likely for the cockpit windows.
==========
Prompt: 873 tokens, 1801.514 tokens-per-sec
Generation: 250 tokens, 200.506 tokens-per-sec
Peak memory: 2.118 GB

Blaizzy · 2025-01-31T14:51:38Z

I also think it will probably fix the broadcast error reported here.

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

Here is the execution command I tried.

./mlx-run llm-tool --model /Users/prince_canuma/SmolVLM-500M-Instruct-6bit --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> directory(file:///Users/prince_canuma/SmolVLM-500M-Instruct-6bit)
Starting generation ...
Describe this image MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/prince_canuma/Library/Developer/Xcode/DerivedData/mlx-swift-examples-bbrnuectwzmzmfbvavmcvmahnvzs/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:31

Btw, the vision patch embed sanitization is not working well tho the code checks out to me.

./mlx-run llm-tool --model HuggingFaceTB/SmolVLM-500M-Instruct --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Error: Mismatched parameter weight shape. Actual [768, 3, 16, 16], expected [768, 16, 16, 3]

isbee · 2025-02-03T02:49:50Z

I also think it will probably fix the broadcast error reported Blaizzy/mlx-vlm#194.

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

Full stacktrace before crash

Thread 4 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0	0x0000000106c58e18 in ::mlx_add(mlx_array *, mlx_array, mlx_array, mlx_stream) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:27
#1	0x0000000106b5e520 in static MLXArray.+ infix(_:_:) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray+Ops.swift:45
#2	0x0000000106ad7f2c in Vision.VisionEmbeddings.callAsFunction(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:553
#3	0x0000000106ad8d10 in Vision.VisionModel.callAsFunction(_:outputHiddenStates:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:579
#4	0x0000000106ada8c0 in Idefics3.getInputEmbeddings(inputIds:pixelValues:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:649
#5	0x0000000106adb178 in Idefics3.prepare(_:cache:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:703
#6	0x0000000106adbf5c in protocol witness for LanguageModel.prepare(_:cache:windowSize:) in conformance Idefics3 ()
#7	0x000000010791d868 in TokenIterator.prepare(input:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:328
#8	0x000000010791de08 in TokenIterator.init(input:model:cache:parameters:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:299
#9	0x000000010791ff00 in generate(input:parameters:context:didGenerate:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:491
#10	0x0000000106ab4938 in closure #1 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:339
#11	0x0000000106ab7d4c in partial apply for closure #1 in VLMEvaluator.generate(prompt:image:) ()
#12	0x0000000107935374 in ModelContainer.perform<τ_0_0>(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/ModelContainer.swift:66
#13	0x0000000106ab3b10 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:333
#14	0x0000000106aaf374 in closure #1 in ContentView.generate() at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:230
#15	0x0000000106ab734c in partial apply for closure #1 in ContentView.generate() ()
#16	0x0000000106ab9834 in thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()
#17	0x0000000106aba48c in partial apply for thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()

davidkoski · 2025-02-03T15:39:57Z

Libraries/MLXVLM/Models/Idefics3.swift

@@ -794,7 +783,7 @@ public class Idefics3Processor: UserInputProcessor {

    // From the Python code and default config, we know image_token_id is usually 49153.
    // Hardcode this since we can't pass it in or rely on it from the processor config.
-    private let imageTokenId = 49153
+    private let imageTokenId = 49190


This no longer matches the comment. Also, might want to change this line:

self.imageTokenId = (try? container.decode(Int.self, forKey: .imageTokenId)) ?? 49153

That said, perhaps we should address the "Hardcode this since we can't pass it in or rely on it from the processor config" part of it -- if UserInputProcessor needs it then it should get passed in (VLMModelFactory).

Comparing this to the other VLMs it looks like the difference is here:

// Encode only the text part of the prompt, without <image> var promptTokens = try tokenizer.encode(text: prompt) let imageTokenIndex = promptTokens.count / 2 promptTokens.insert(imageTokenId, at: imageTokenIndex)

this deals in numeric tokens after tokenization while the others are injecting text.

I think we can keep this hard coded for this specific issue but it would be good to decide which approach to take here:

inject text into the prompt (no token id needed)

we need the model config

davidkoski · 2025-02-03T16:56:00Z

Looking at pixel_values coming in to getInputEmbeddings in python I see:

pixel_values.shape=(1, 13, 3, 512, 512)

vs

(lldb) po pixelValues!.shape
▿ 4 elements
  - 0 : 1
  - 1 : 384
  - 2 : 384
  - 3 : 3

It looks like:

we should not have fixedImageSize but get that from Idefics3ProcessorConfiguration (need the property)

  "max_image_size": {
    "longest_edge": 512
  },

I think the code from transformers (python) idefics3/image_processor needs to be ported:

        if do_image_splitting:
            # We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio
            # for size=(10, max_image_size) -> rescaled_size=(max_image_size, max_image_size)
            # for size=(11, max_image_size+1) -> rescaled_size=(max_image_size, max_image_size*2)
...

and we need the config to trigger this

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

davidkoski · 2025-02-03T17:03:00Z

@Blaizzy

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

That said, I am able to get it to print shapes:

./mlx-run --debug llm-tool --model mlx-community/SmolVLM-500M-Instruct-6bit --image ~/Downloads/408605771-277c33f2-502c-4ee1-a6be-d94c086aaed7.png --prompt "what is this"
2025-02-03 09:00:03.344 xcodebuild[92546:18373549]  DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
2025-02-03 09:00:03.345 xcodebuild[92546:18373540] [MT] DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> id("mlx-community/SmolVLM-500M-Instruct-6bit")
Starting generation ...
what is this pixels: [1, 384, 384, 3]
MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/dkoski/Library/Developer/Xcode/DerivedData/mlx-swift-examples-eimbjcofifunwybkcvhnzjbqwyri/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:3

The --debug is to switch it to pick up the Debug Run Configuration. If you are using Release you can omit that.

I just added a print like this:

diff --git a/Libraries/MLXVLM/Models/Idefics3.swift b/Libraries/MLXVLM/Models/Idefics3.swift
index 634c05e..624241c 100644
--- a/Libraries/MLXVLM/Models/Idefics3.swift
+++ b/Libraries/MLXVLM/Models/Idefics3.swift
@@ -852,6 +852,8 @@ public class Idefics3Processor: UserInputProcessor {
             {
                 pixels = pixels.transposed(0, 2, 3, 1)
             }
+            
+            print("pixels: \(pixels.shape)")
 
             return LMInput(
                 text: .init(tokens: promptArray, mask: mask),

Blaizzy · 2025-02-12T19:53:20Z

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

I see, thanks a lot!

I figured as much but I got sick last week and only returned this week.

Blaizzy · 2025-02-12T19:56:45Z

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

In python it works with do_split TRUE and FALSE.

Because we are flattening the first 3/4 dimensions of the image_feature tensor.

Blaizzy · 2025-02-12T19:59:45Z

@davidkoski thanks a lot for the tips!

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

davidkoski · 2025-02-12T20:38:09Z

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

Without knowing specifically what you are running into it is hard to say. You might look here:

Once you have it installed, you should be able to open the mlr-swift-examples.xcodeproj select a scheme (roughly a target) and build and run.

These are the schemes:

you pick one and a build/run destination. The simplest will be macOS. You can press cmd-opt-r to edit the "run" configuration for the scheme. In the first tab you can select between Debug/Release builds (Debug if you want to run in the debugger, Release to run full speed). The arguments tab is useful for llm-tool to edit the arguments passed.

If you run into specific problems just ask!

Blaizzy · 2025-02-12T21:23:16Z

Thanks!

I'm running MacOS :)

pcuenca · 2025-02-14T14:21:26Z

Hi @Blaizzy, welcome to Swift-land :) Incidentally, I started working on this before seeing your PR. I have a preliminary question, did you manage to get any smolvlm running, even before your patch? I tested the default model in this repo (mlx-community/SmolVLM-Instruct-4bit) and I'm getting gibberish (after changing the image token id to 49190).

fix do_image_split for idefics3

44481a8

Blaizzy mentioned this pull request Jan 31, 2025

Support SmolVLM-256M/500M-Instruct Blaizzy/mlx-vlm#194

Closed

davidkoski reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIx idefics3 do-image-split #192

FIx idefics3 do-image-split #192

Blaizzy commented Jan 31, 2025 •

edited

Loading

Blaizzy commented Jan 31, 2025

isbee commented Feb 3, 2025

davidkoski Feb 3, 2025

davidkoski commented Feb 3, 2025 •

edited

Loading

davidkoski commented Feb 3, 2025

Blaizzy commented Feb 12, 2025

Blaizzy commented Feb 12, 2025

Blaizzy commented Feb 12, 2025 •

edited

Loading

davidkoski commented Feb 12, 2025

Blaizzy commented Feb 12, 2025

pcuenca commented Feb 14, 2025

FIx idefics3 do-image-split #192

Are you sure you want to change the base?

FIx idefics3 do-image-split #192

Conversation

Blaizzy commented Jan 31, 2025 • edited Loading

Description

Python MLX-VLM example

Input Image:

Before

After (matches transformers ✅)

Blaizzy commented Jan 31, 2025

isbee commented Feb 3, 2025

davidkoski Feb 3, 2025

Choose a reason for hiding this comment

davidkoski commented Feb 3, 2025 • edited Loading

davidkoski commented Feb 3, 2025

Blaizzy commented Feb 12, 2025

Blaizzy commented Feb 12, 2025

Blaizzy commented Feb 12, 2025 • edited Loading

davidkoski commented Feb 12, 2025

Blaizzy commented Feb 12, 2025

pcuenca commented Feb 14, 2025

Blaizzy commented Jan 31, 2025 •

edited

Loading

davidkoski commented Feb 3, 2025 •

edited

Loading

Blaizzy commented Feb 12, 2025 •

edited

Loading