Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIx idefics3 do-image-split #192

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Blaizzy
Copy link

@Blaizzy Blaizzy commented Jan 31, 2025

Description

This PR matches the fix for idefics 2 and 3 on mlx-vlm.

Source: Blaizzy/mlx-vlm#191

@davidkoski @awni I'm not a swift expert and would love to start contributing. This is my first attempt.

Although I couldn't manage to run this example sucessfully this change should allow the model to process images in multi-resolution boost OCR and finer grained details. For this Idefics 3 models have an argument in the preprocessor called do_image_splitting which is True by default. I previously had set it to false on all mlx models but now I pinpointed the bug and made a fix.

Input pixel values:
-> (1, 3, 512, 512) without do_image_splitting
-> (13, 3, 512, 512) with do_image_splitting

Python MLX-VLM example

Input Image:

g700_darkblue

Before

 In this image we can see an airplane flying in the air. In the background we can see the land with water.<end_of_utterance>
==========
Prompt: 873 tokens, 1766.771 tokens-per-sec
Generation: 26 tokens, 200.883 tokens-per-sec
Peak memory: 2.118 GB

After (matches transformers ✅)

The image depicts a small, sleek, and narrow aircraft, likely a private jet, flying at a high altitude. The aircraft is oriented with its tail and wings pointing towards the right side of the image, suggesting it is moving from left to right. The aircraft's body is primarily white with a black and gray tail fin and a black and gray propeller. The tail fin has the text "G700" printed on it in black.

The aircraft's wings are extended, and the landing gear is retracted, indicating that it is either preparing to land or is taxiing. The aircraft's windows are visible, and there are several small, circular openings along the sides, which are likely for the cockpit windows.
==========
Prompt: 873 tokens, 1801.514 tokens-per-sec
Generation: 250 tokens, 200.506 tokens-per-sec
Peak memory: 2.118 GB

@Blaizzy
Copy link
Author

Blaizzy commented Jan 31, 2025

I also think it will probably fix the broadcast error reported here.

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

Here is the execution command I tried.

./mlx-run llm-tool --model /Users/prince_canuma/SmolVLM-500M-Instruct-6bit --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> directory(file:///Users/prince_canuma/SmolVLM-500M-Instruct-6bit)
Starting generation ...
Describe this image MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/prince_canuma/Library/Developer/Xcode/DerivedData/mlx-swift-examples-bbrnuectwzmzmfbvavmcvmahnvzs/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:31

Btw, the vision patch embed sanitization is not working well tho the code checks out to me.

./mlx-run llm-tool --model HuggingFaceTB/SmolVLM-500M-Instruct --prompt "Describe this image" --image "/Users/prince_canuma/Downloads/78000D05-4283-4575-BB67-952702717C99.JPEG" --max-tokens 100
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006034-000860D90184001C, name:My Mac }
{ platform:macOS, name:Any Mac }
Error: Mismatched parameter weight shape. Actual [768, 3, 16, 16], expected [768, 16, 16, 3]

@isbee
Copy link

isbee commented Feb 3, 2025

I also think it will probably fix the broadcast error reported Blaizzy/mlx-vlm#194.

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

  • Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

Full stacktrace before crash

Thread 4 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0	0x0000000106c58e18 in ::mlx_add(mlx_array *, mlx_array, mlx_array, mlx_stream) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:27
#1	0x0000000106b5e520 in static MLXArray.+ infix(_:_:) at /Users/user/Library/Developer/Xcode/DerivedData/mlx-swift-examples-djkwqspwvzjyvodqeyxeahqxpsse/SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray+Ops.swift:45
#2	0x0000000106ad7f2c in Vision.VisionEmbeddings.callAsFunction(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:553
#3	0x0000000106ad8d10 in Vision.VisionModel.callAsFunction(_:outputHiddenStates:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:579
#4	0x0000000106ada8c0 in Idefics3.getInputEmbeddings(inputIds:pixelValues:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:649
#5	0x0000000106adb178 in Idefics3.prepare(_:cache:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXVLM/Models/Idefics3.swift:703
#6	0x0000000106adbf5c in protocol witness for LanguageModel.prepare(_:cache:windowSize:) in conformance Idefics3 ()
#7	0x000000010791d868 in TokenIterator.prepare(input:windowSize:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:328
#8	0x000000010791de08 in TokenIterator.init(input:model:cache:parameters:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:299
#9	0x000000010791ff00 in generate(input:parameters:context:didGenerate:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:491
#10	0x0000000106ab4938 in closure #1 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:339
#11	0x0000000106ab7d4c in partial apply for closure #1 in VLMEvaluator.generate(prompt:image:) ()
#12	0x0000000107935374 in ModelContainer.perform<τ_0_0>(_:) at /Users/user/practice/mlx-swift-examples/Libraries/MLXLMCommon/ModelContainer.swift:66
#13	0x0000000106ab3b10 in VLMEvaluator.generate(prompt:image:) at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:333
#14	0x0000000106aaf374 in closure #1 in ContentView.generate() at /Users/user/practice/mlx-swift-examples/Applications/VLMEval/ContentView.swift:230
#15	0x0000000106ab734c in partial apply for closure #1 in ContentView.generate() ()
#16	0x0000000106ab9834 in thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()
#17	0x0000000106aba48c in partial apply for thunk for @escaping @isolated(any) @callee_guaranteed @async () -> (@out A) ()

@@ -794,7 +783,7 @@ public class Idefics3Processor: UserInputProcessor {

// From the Python code and default config, we know image_token_id is usually 49153.
// Hardcode this since we can't pass it in or rely on it from the processor config.
private let imageTokenId = 49153
private let imageTokenId = 49190
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer matches the comment. Also, might want to change this line:

        self.imageTokenId = (try? container.decode(Int.self, forKey: .imageTokenId)) ?? 49153

That said, perhaps we should address the "Hardcode this since we can't pass it in or rely on it from the processor config" part of it -- if UserInputProcessor needs it then it should get passed in (VLMModelFactory).

Comparing this to the other VLMs it looks like the difference is here:

            // Encode only the text part of the prompt, without <image>
            var promptTokens = try tokenizer.encode(text: prompt)

            let imageTokenIndex = promptTokens.count / 2
            promptTokens.insert(imageTokenId, at: imageTokenIndex)

this deals in numeric tokens after tokenization while the others are injecting text.

I think we can keep this hard coded for this specific issue but it would be good to decide which approach to take here:

  • inject text into the prompt (no token id needed)
  • we need the model config

@davidkoski
Copy link
Collaborator

davidkoski commented Feb 3, 2025

Looking at pixel_values coming in to getInputEmbeddings in python I see:

pixel_values.shape=(1, 13, 3, 512, 512)

vs

(lldb) po pixelValues!.shape
▿ 4 elements
  - 0 : 1
  - 1 : 384
  - 2 : 384
  - 3 : 3

It looks like:

  • we should not have fixedImageSize but get that from Idefics3ProcessorConfiguration (need the property)
  "max_image_size": {
    "longest_edge": 512
  },
  • I think the code from transformers (python) idefics3/image_processor needs to be ported:
        if do_image_splitting:
            # We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio
            # for size=(10, max_image_size) -> rescaled_size=(max_image_size, max_image_size)
            # for size=(11, max_image_size+1) -> rescaled_size=(max_image_size, max_image_size*2)
...
  • and we need the config to trigger this

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

@davidkoski
Copy link
Collaborator

@Blaizzy

I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run.

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

image

That said, I am able to get it to print shapes:

./mlx-run --debug llm-tool --model mlx-community/SmolVLM-500M-Instruct-6bit --image ~/Downloads/408605771-277c33f2-502c-4ee1-a6be-d94c086aaed7.png --prompt "what is this"
2025-02-03 09:00:03.344 xcodebuild[92546:18373549]  DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
2025-02-03 09:00:03.345 xcodebuild[92546:18373540] [MT] DVTDeviceOperation: Encountered a build number "" that is incompatible with DVTBuildVersion.
--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006021-001C18320C47401E, name:My Mac }
{ platform:macOS, name:Any Mac }
Model loaded -> id("mlx-community/SmolVLM-500M-Instruct-6bit")
Starting generation ...
what is this pixels: [1, 384, 384, 3]
MLX error: Shapes (1,576,768) and (1,1024,768) cannot be broadcast. at /Users/dkoski/Library/Developer/Xcode/DerivedData/mlx-swift-examples-eimbjcofifunwybkcvhnzjbqwyri/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:3

The --debug is to switch it to pick up the Debug Run Configuration. If you are using Release you can omit that.

I just added a print like this:

diff --git a/Libraries/MLXVLM/Models/Idefics3.swift b/Libraries/MLXVLM/Models/Idefics3.swift
index 634c05e..624241c 100644
--- a/Libraries/MLXVLM/Models/Idefics3.swift
+++ b/Libraries/MLXVLM/Models/Idefics3.swift
@@ -852,6 +852,8 @@ public class Idefics3Processor: UserInputProcessor {
             {
                 pixels = pixels.transposed(0, 2, 3, 1)
             }
+            
+            print("pixels: \(pixels.shape)")
 
             return LMInput(
                 text: .init(tokens: promptArray, mask: mask),

@Blaizzy
Copy link
Author

Blaizzy commented Feb 12, 2025

@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering prepareInputsForMultimodal().

Actual crash happens on ops.cpp but since this is shape bug, I think mlx-c is not a issue.

I see, thanks a lot!

I figured as much but I got sick last week and only returned this week.

@Blaizzy
Copy link
Author

Blaizzy commented Feb 12, 2025

As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions.

In python it works with do_split TRUE and FALSE.

Because we are flattening the first 3/4 dimensions of the image_feature tensor.

@Blaizzy
Copy link
Author

Blaizzy commented Feb 12, 2025

@davidkoski thanks a lot for the tips!

I typically use the debugger in Xcode for this. You can press cmd-opt-r to set the arguments to the program. You may want to set the Run Configuration to Debug:

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

@davidkoski
Copy link
Collaborator

My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :)

Without knowing specifically what you are running into it is hard to say. You might look here:

Once you have it installed, you should be able to open the mlr-swift-examples.xcodeproj select a scheme (roughly a target) and build and run.

These are the schemes:

image

you pick one and a build/run destination. The simplest will be macOS. You can press cmd-opt-r to edit the "run" configuration for the scheme. In the first tab you can select between Debug/Release builds (Debug if you want to run in the debugger, Release to run full speed). The arguments tab is useful for llm-tool to edit the arguments passed.

If you run into specific problems just ask!

@Blaizzy
Copy link
Author

Blaizzy commented Feb 12, 2025

Thanks!

I'm running MacOS :)

@pcuenca
Copy link
Contributor

pcuenca commented Feb 14, 2025

Hi @Blaizzy, welcome to Swift-land :) Incidentally, I started working on this before seeing your PR. I have a preliminary question, did you manage to get any smolvlm running, even before your patch? I tested the default model in this repo (mlx-community/SmolVLM-Instruct-4bit) and I'm getting gibberish (after changing the image token id to 49190).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants