Add Qwen 2.5 VL #222

DePasqualeOrg · 2025-03-04T18:19:51Z

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?

davidkoski · 2025-03-04T18:30:29Z

Libraries/MLXVLM/Models/Qwen25VL.swift

+
+            // Create attention mask
+            let attentionMask = full(
+                (1, sequenceLength, sequenceLength),


Suggested change

(1, sequenceLength, sequenceLength),

[1, sequenceLength, sequenceLength],

davidkoski · 2025-03-04T18:39:37Z

Libraries/MLXVLM/Models/Qwen25VL.swift

+            // Create attention mask
+            let attentionMask = full(
+                (1, sequenceLength, sequenceLength),
+                MLXArray.finfo(q.dtype).min,


This one is a little trickier -- maybe we need to add something to DType, but for now perhaps one of these:

-Float16.greatestFiniteMagnitude -Float32.greatestFiniteMagnitude

davidkoski · 2025-03-04T18:43:47Z

Libraries/MLXVLM/Models/Qwen25VL.swift

+            let attentionMask = full(
+                (1, sequenceLength, sequenceLength),
+                MLXArray.finfo(q.dtype).min,
+                dtype: q.dtype)


There isn't a variant that takes a dtype, only a type: (e.g. Float32.self).

Filed: ml-explore/mlx-swift#199

DePasqualeOrg · 2025-03-04T18:47:44Z

Thanks for your feedback, but I still couldn't even test this because of the build errors I mentioned. The implementation is still a work in progress. Do you know how to resolve the build errors?

DePasqualeOrg · 2025-03-04T18:52:46Z

Ah, now I'm seeing the errors in the Qwen 2.5 VL implementation. Something must have been wrong with Xcode. I'll try to pick it up from here and factor out the shared parts with Qwen 2 VL.

davidkoski · 2025-03-04T18:56:34Z

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?

I just pushed a change with fixes to make it build. There are some pieces that are missing to do it exactly right, see ml-explore/mlx-swift#199

In particular this one:

            // Create attention mask
            let attentionMask = full(
                [1, sequenceLength, sequenceLength],
                values: -Float32.greatestFiniteMagnitude)

I don't know if we need Float16 here or Float32, but this will at least build.

This part is not ideal as it requires evaluation in the middle of evaluation:

            // Update mask for each sequence
            for i in 1 ..< cuSeqlens.size {
                let start = cuSeqlens[i - 1].item(Int.self)
                let end = cuSeqlens[i].item(Int.self)
                attentionMask[0..., start ..< end, start ..< end] = MLXArray(0)
            }

I think @awni may have had a technique for avoiding this (that we used in Qwen2VL)

DePasqualeOrg · 2025-03-04T20:40:10Z

Inference now works. I'll see if I can factor out the parts that are shared with Qwen 2 VL.

davidkoski · 2025-03-04T21:46:21Z

How does this relate to #197? Is this a replacement due to the mentioned issues with the window processing?

@smdesai & @DePasqualeOrg

DePasqualeOrg · 2025-03-04T22:07:20Z

I've factored out the shared parts and verified that this works with Qwen 2 VL and 2.5 VL for images, videos, and combinations of the two. I didn't base this on #197, so if @smdesai has any feedback, please let me know.

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

smdesai · 2025-03-05T00:16:51Z

@DePasqualeOrg Looks like we refactored around the same time. I've gone through the code and aside from your windowing fix which I borrowed, the code is identical, the only difference is that I may have refactored a little more. Looks good to me.

davidkoski · 2025-03-05T00:37:05Z

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

I don't know how to estimate the VLM memory use, but we can compute the size needed for the frames and it will be related to that. A video has a duration and dimensions on the video track. We know our sampling rate and any downsampling, so we can estimate the memory from that.

It looks like the video processing (this is from the SMol PR, but the logic is the same in Qwen does this:

            var videoFrameResult = await try MediaProcessing.asCIImageSequence(
                video, maxFrames: maxVideoFrames, targetFPS: targetVideoFPS)

            var processedFrames: [MLXArray] = []
            for frame in videoFrameResult.frames {
                let image =
                    frame
                    .toSRGB()
                    .resampled(
                        to: CGSize(width: fixedImageSize, height: fixedImageSize), method: .lanczos
                    )
                    .normalized(mean: config.imageMeanTuple, std: config.imageStdTuple)
                    .asMLXArray()
                processedFrames.append(image)
            }

That produces an array of large frames and then resamples them. They could be 4k frames. I was going to propose something in #206, but perhaps we should make a new PR: we should be resampling frames as they are collected, so the function that produces frames actually produces an array of small frames.

This could significantly reduce memory use. Let me cut an issue for that. #223

DePasqualeOrg · 2025-03-05T09:30:07Z

I asked Claude 3.7 Sonnet to evaluate both solutions. To be clear, I don't want to take much credit for my solution, since Sonnet helped a lot with it. "Your solution" refers to this PR and "his solution" refers to #197.

After comparing both solutions, I would say your solution is better overall. Here's why:

Strengths of your solution:

Better organization and structure:

You've created a shared QwenVL namespace for common utilities used by both Qwen2VL and Qwen25VL models

Your code has cleaner separation of concerns with more consistent naming

More consistent implementation:

You use RMSNorm in the Qwen25VL vision blocks, which is more appropriate than the LayerNorm used in his solution

Your implementation of the processor configuration handles the size parameters more consistently

Better error handling and robustness:

Your implementation of the window attention mechanism is more complete and accurate

More maintainable code:

You've extracted common functionality into the QwenVL namespace, making it easier to maintain both models

Your code has better documentation and more consistent naming conventions

More accurate implementation:

Your implementation more closely matches the original Python implementation, particularly in the vision model components

Areas where his solution has some advantages:

His solution uses more explicit module naming in some places, which can make it slightly easier to follow the model architecture at first glance.

His implementation of the Qwen25VLProcessor is slightly more concise, though yours is more consistent with the rest of the codebase.

Overall assessment:

Your solution is superior because it demonstrates better software engineering practices through:

Better code organization and reuse

More consistent implementation

More accurate implementation of the model architecture

Better maintainability for future updates

The shared QwenVL namespace is particularly valuable as it would make it easier to implement other models in the Qwen family in the future, reducing code duplication and potential for errors.

deet · 2025-03-06T02:31:03Z

I have tested this (and the preview PR #197 on which this was based) and am getting an error when providing an image (a PNG, not video) to the UserInput -- "Number of placeholder tokens does not match number of frames"

Is user at the library level responsible for adding <|image_pad|> to the message sequence? Or is this related to the chat_template changes?

DePasqualeOrg · 2025-03-06T06:13:46Z

@deet, this PR is not based on #197. It is based on the existing Qwen 2 VL implementation in Swift and the Qwen 2.5 VL implementation in Python.

Also, I tried using a PNG as input, and it works in my app Local Chat. Maybe there's an issue with how you're passing in the images.

However, I noticed that I'm getting maximum buffer length crashes when a photo or video is in portrait orientation, so we'll need to make sure they're getting scaled down to an appropriate size also when the width is less than the height.

DePasqualeOrg · 2025-03-06T07:47:40Z

I've pushed a new commit that fixes the max buffer crashes with images and videos in portrait orientation. @davidkoski, do you think the limit of 224 pixels makes sense?

davidkoski · 2025-03-06T16:50:12Z

Libraries/MLXVLM/MediaProcessing.swift

-            scale = size.height / extent.height
-        }
+        // Use the same scaling approach regardless of orientation
+        let scale = min(size.width / extent.width, size.height / extent.height)


Yes, makes sense -- the previous would fail for images with extreme aspect ratios or just aspect ratios that didn't align with the target size.

We may need "shortestEdge" as well, along with a center crop. I will write up an issue for this, we can refactor in a future PR

See discussion in #200 -- Qwen2 requires the exact computed size and this won't always deliver it. I have set up some unit tests so we can run all of these through.

davidkoski reviewed Mar 4, 2025

View reviewed changes

davidkoski mentioned this pull request Mar 4, 2025

Missing functionality around types/dtypes ml-explore/mlx-swift#199

Open

davidkoski reviewed Mar 4, 2025

View reviewed changes

DePasqualeOrg force-pushed the qwen-2.5-vl branch from 0737268 to 72a45d9 Compare March 4, 2025 21:27

DePasqualeOrg marked this pull request as ready for review March 4, 2025 22:11

smdesai mentioned this pull request Mar 5, 2025

adding support for Qwen2.5-VL #197

Open

davidkoski mentioned this pull request Mar 5, 2025

MediaProcessing.asCIImageSequence should produce an array of downsampled frames #223

Open

DePasqualeOrg force-pushed the qwen-2.5-vl branch from 7b6f35e to 102aa97 Compare March 6, 2025 07:50

davidkoski reviewed Mar 6, 2025

View reviewed changes

deet mentioned this pull request Mar 7, 2025

MLX error: [reshape] Cannot reshape array of size 481656 into shape (1,2,3,16,2,14,7,2,14) #200

Open

DePasqualeOrg added 5 commits March 8, 2025 10:49

Remove development team

a6f552b

Fix typos

b6532fa

Add Qwen 2.5 VL

e2ef119

Fix media downsampling

788fffa

More media downsampling fixes

9356549

DePasqualeOrg force-pushed the qwen-2.5-vl branch from 102aa97 to 9356549 Compare March 8, 2025 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen 2.5 VL #222

Add Qwen 2.5 VL #222

DePasqualeOrg commented Mar 4, 2025

davidkoski Mar 4, 2025

davidkoski Mar 4, 2025

davidkoski Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025

davidkoski commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025

davidkoski commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025 •

edited

Loading

smdesai commented Mar 5, 2025

davidkoski commented Mar 5, 2025 •

edited

Loading

DePasqualeOrg commented Mar 5, 2025 •

edited

Loading

Strengths of your solution:

Areas where his solution has some advantages:

Overall assessment:

deet commented Mar 6, 2025

DePasqualeOrg commented Mar 6, 2025 •

edited

Loading

DePasqualeOrg commented Mar 6, 2025

davidkoski Mar 6, 2025

davidkoski Mar 6, 2025

davidkoski Mar 8, 2025

	(1, sequenceLength, sequenceLength),
	[1, sequenceLength, sequenceLength],

Add Qwen 2.5 VL #222

Are you sure you want to change the base?

Add Qwen 2.5 VL #222

Conversation

DePasqualeOrg commented Mar 4, 2025

davidkoski Mar 4, 2025

Choose a reason for hiding this comment

davidkoski Mar 4, 2025

Choose a reason for hiding this comment

davidkoski Mar 4, 2025

Choose a reason for hiding this comment

DePasqualeOrg commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025

davidkoski commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025

davidkoski commented Mar 4, 2025

DePasqualeOrg commented Mar 4, 2025 • edited Loading

smdesai commented Mar 5, 2025

davidkoski commented Mar 5, 2025 • edited Loading

DePasqualeOrg commented Mar 5, 2025 • edited Loading

Strengths of your solution:

Areas where his solution has some advantages:

Overall assessment:

deet commented Mar 6, 2025

DePasqualeOrg commented Mar 6, 2025 • edited Loading

DePasqualeOrg commented Mar 6, 2025

davidkoski Mar 6, 2025

Choose a reason for hiding this comment

davidkoski Mar 6, 2025

Choose a reason for hiding this comment

davidkoski Mar 8, 2025

Choose a reason for hiding this comment

DePasqualeOrg commented Mar 4, 2025 •

edited

Loading

davidkoski commented Mar 5, 2025 •

edited

Loading

DePasqualeOrg commented Mar 5, 2025 •

edited

Loading

DePasqualeOrg commented Mar 6, 2025 •

edited

Loading