Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Qwen 2.5 VL #222

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

DePasqualeOrg
Copy link
Contributor

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?


// Create attention mask
let attentionMask = full(
(1, sequenceLength, sequenceLength),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(1, sequenceLength, sequenceLength),
[1, sequenceLength, sequenceLength],

// Create attention mask
let attentionMask = full(
(1, sequenceLength, sequenceLength),
MLXArray.finfo(q.dtype).min,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little trickier -- maybe we need to add something to DType, but for now perhaps one of these:

-Float16.greatestFiniteMagnitude
-Float32.greatestFiniteMagnitude

let attentionMask = full(
(1, sequenceLength, sequenceLength),
MLXArray.finfo(q.dtype).min,
dtype: q.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a variant that takes a dtype, only a type: (e.g. Float32.self).

Filed: ml-explore/mlx-swift#199

@DePasqualeOrg
Copy link
Contributor Author

Thanks for your feedback, but I still couldn't even test this because of the build errors I mentioned. The implementation is still a work in progress. Do you know how to resolve the build errors?

@DePasqualeOrg
Copy link
Contributor Author

Ah, now I'm seeing the errors in the Qwen 2.5 VL implementation. Something must have been wrong with Xcode. I'll try to pick it up from here and factor out the shared parts with Qwen 2 VL.

@davidkoski
Copy link
Collaborator

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?

I just pushed a change with fixes to make it build. There are some pieces that are missing to do it exactly right, see ml-explore/mlx-swift#199

In particular this one:

            // Create attention mask
            let attentionMask = full(
                [1, sequenceLength, sequenceLength],
                values: -Float32.greatestFiniteMagnitude)

I don't know if we need Float16 here or Float32, but this will at least build.

This part is not ideal as it requires evaluation in the middle of evaluation:

            // Update mask for each sequence
            for i in 1 ..< cuSeqlens.size {
                let start = cuSeqlens[i - 1].item(Int.self)
                let end = cuSeqlens[i].item(Int.self)
                attentionMask[0..., start ..< end, start ..< end] = MLXArray(0)
            }

I think @awni may have had a technique for avoiding this (that we used in Qwen2VL)

@DePasqualeOrg
Copy link
Contributor Author

Inference now works. I'll see if I can factor out the parts that are shared with Qwen 2 VL.

@davidkoski
Copy link
Collaborator

How does this relate to #197? Is this a replacement due to the mentioned issues with the window processing?

@smdesai & @DePasqualeOrg

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 4, 2025

I've factored out the shared parts and verified that this works with Qwen 2 VL and 2.5 VL for images, videos, and combinations of the two. I didn't base this on #197, so if @smdesai has any feedback, please let me know.

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

@DePasqualeOrg DePasqualeOrg marked this pull request as ready for review March 4, 2025 22:11
@smdesai
Copy link
Contributor

smdesai commented Mar 5, 2025

@DePasqualeOrg Looks like we refactored around the same time. I've gone through the code and aside from your windowing fix which I borrowed, the code is identical, the only difference is that I may have refactored a little more. Looks good to me.

@davidkoski
Copy link
Collaborator

davidkoski commented Mar 5, 2025

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

I don't know how to estimate the VLM memory use, but we can compute the size needed for the frames and it will be related to that. A video has a duration and dimensions on the video track. We know our sampling rate and any downsampling, so we can estimate the memory from that.

It looks like the video processing (this is from the SMol PR, but the logic is the same in Qwen does this:

            var videoFrameResult = await try MediaProcessing.asCIImageSequence(
                video, maxFrames: maxVideoFrames, targetFPS: targetVideoFPS)

            var processedFrames: [MLXArray] = []
            for frame in videoFrameResult.frames {
                let image =
                    frame
                    .toSRGB()
                    .resampled(
                        to: CGSize(width: fixedImageSize, height: fixedImageSize), method: .lanczos
                    )
                    .normalized(mean: config.imageMeanTuple, std: config.imageStdTuple)
                    .asMLXArray()
                processedFrames.append(image)
            }

That produces an array of large frames and then resamples them. They could be 4k frames. I was going to propose something in #206, but perhaps we should make a new PR: we should be resampling frames as they are collected, so the function that produces frames actually produces an array of small frames.

This could significantly reduce memory use. Let me cut an issue for that. #223

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 5, 2025

I asked Claude 3.7 Sonnet to evaluate both solutions. To be clear, I don't want to take much credit for my solution, since Sonnet helped a lot with it. "Your solution" refers to this PR and "his solution" refers to #197.

After comparing both solutions, I would say your solution is better overall. Here's why:

Strengths of your solution:

  1. Better organization and structure:

    • You've created a shared QwenVL namespace for common utilities used by both Qwen2VL and Qwen25VL models
    • Your code has cleaner separation of concerns with more consistent naming
  2. More consistent implementation:

    • You use RMSNorm in the Qwen25VL vision blocks, which is more appropriate than the LayerNorm used in his solution
    • Your implementation of the processor configuration handles the size parameters more consistently
  3. Better error handling and robustness:

    • Your implementation of the window attention mechanism is more complete and accurate
  4. More maintainable code:

    • You've extracted common functionality into the QwenVL namespace, making it easier to maintain both models
    • Your code has better documentation and more consistent naming conventions
  5. More accurate implementation:

    • Your implementation more closely matches the original Python implementation, particularly in the vision model components

Areas where his solution has some advantages:

  1. His solution uses more explicit module naming in some places, which can make it slightly easier to follow the model architecture at first glance.

  2. His implementation of the Qwen25VLProcessor is slightly more concise, though yours is more consistent with the rest of the codebase.

Overall assessment:

Your solution is superior because it demonstrates better software engineering practices through:

  • Better code organization and reuse
  • More consistent implementation
  • More accurate implementation of the model architecture
  • Better maintainability for future updates

The shared QwenVL namespace is particularly valuable as it would make it easier to implement other models in the Qwen family in the future, reducing code duplication and potential for errors.

@deet
Copy link

deet commented Mar 6, 2025

I have tested this (and the preview PR #197 on which this was based) and am getting an error when providing an image (a PNG, not video) to the UserInput -- "Number of placeholder tokens does not match number of frames"

Is user at the library level responsible for adding <|image_pad|> to the message sequence? Or is this related to the chat_template changes?

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 6, 2025

@deet, this PR is not based on #197. It is based on the existing Qwen 2 VL implementation in Swift and the Qwen 2.5 VL implementation in Python.

Also, I tried using a PNG as input, and it works in my app Local Chat. Maybe there's an issue with how you're passing in the images.

However, I noticed that I'm getting maximum buffer length crashes when a photo or video is in portrait orientation, so we'll need to make sure they're getting scaled down to an appropriate size also when the width is less than the height.

@DePasqualeOrg
Copy link
Contributor Author

I've pushed a new commit that fixes the max buffer crashes with images and videos in portrait orientation. @davidkoski, do you think the limit of 224 pixels makes sense?

scale = size.height / extent.height
}
// Use the same scaling approach regardless of orientation
let scale = min(size.width / extent.width, size.height / extent.height)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense -- the previous would fail for images with extreme aspect ratios or just aspect ratios that didn't align with the target size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need "shortestEdge" as well, along with a center crop. I will write up an issue for this, we can refactor in a future PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in #200 -- Qwen2 requires the exact computed size and this won't always deliver it. I have set up some unit tests so we can run all of these through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants