-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Qwen 2.5 VL #222
base: main
Are you sure you want to change the base?
Add Qwen 2.5 VL #222
Conversation
|
||
// Create attention mask | ||
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1, sequenceLength, sequenceLength), | |
[1, sequenceLength, sequenceLength], |
// Create attention mask | ||
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), | ||
MLXArray.finfo(q.dtype).min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a little trickier -- maybe we need to add something to DType
, but for now perhaps one of these:
-Float16.greatestFiniteMagnitude
-Float32.greatestFiniteMagnitude
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), | ||
MLXArray.finfo(q.dtype).min, | ||
dtype: q.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a variant that takes a dtype
, only a type:
(e.g. Float32.self
).
Filed: ml-explore/mlx-swift#199
Thanks for your feedback, but I still couldn't even test this because of the build errors I mentioned. The implementation is still a work in progress. Do you know how to resolve the build errors? |
Ah, now I'm seeing the errors in the Qwen 2.5 VL implementation. Something must have been wrong with Xcode. I'll try to pick it up from here and factor out the shared parts with Qwen 2 VL. |
I just pushed a change with fixes to make it build. There are some pieces that are missing to do it exactly right, see ml-explore/mlx-swift#199 In particular this one: // Create attention mask
let attentionMask = full(
[1, sequenceLength, sequenceLength],
values: -Float32.greatestFiniteMagnitude) I don't know if we need Float16 here or Float32, but this will at least build. This part is not ideal as it requires evaluation in the middle of evaluation: // Update mask for each sequence
for i in 1 ..< cuSeqlens.size {
let start = cuSeqlens[i - 1].item(Int.self)
let end = cuSeqlens[i].item(Int.self)
attentionMask[0..., start ..< end, start ..< end] = MLXArray(0)
} I think @awni may have had a technique for avoiding this (that we used in |
Inference now works. I'll see if I can factor out the parts that are shared with Qwen 2 VL. |
0737268
to
72a45d9
Compare
How does this relate to #197? Is this a replacement due to the mentioned issues with the window processing? |
I've factored out the shared parts and verified that this works with Qwen 2 VL and 2.5 VL for images, videos, and combinations of the two. I didn't base this on #197, so if @smdesai has any feedback, please let me know. As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video? |
@DePasqualeOrg Looks like we refactored around the same time. I've gone through the code and aside from your windowing fix which I borrowed, the code is identical, the only difference is that I may have refactored a little more. Looks good to me. |
I don't know how to estimate the VLM memory use, but we can compute the size needed for the frames and it will be related to that. A video has a duration and dimensions on the video track. We know our sampling rate and any downsampling, so we can estimate the memory from that. It looks like the video processing (this is from the SMol PR, but the logic is the same in Qwen does this:
That produces an array of large frames and then resamples them. They could be 4k frames. I was going to propose something in #206, but perhaps we should make a new PR: we should be resampling frames as they are collected, so the function that produces frames actually produces an array of small frames. This could significantly reduce memory use. Let me cut an issue for that. #223 |
I asked Claude 3.7 Sonnet to evaluate both solutions. To be clear, I don't want to take much credit for my solution, since Sonnet helped a lot with it. "Your solution" refers to this PR and "his solution" refers to #197.
|
I have tested this (and the preview PR #197 on which this was based) and am getting an error when providing an image (a PNG, not video) to the Is user at the library level responsible for adding |
@deet, this PR is not based on #197. It is based on the existing Qwen 2 VL implementation in Swift and the Qwen 2.5 VL implementation in Python. Also, I tried using a PNG as input, and it works in my app Local Chat. Maybe there's an issue with how you're passing in the images. However, I noticed that I'm getting maximum buffer length crashes when a photo or video is in portrait orientation, so we'll need to make sure they're getting scaled down to an appropriate size also when the width is less than the height. |
I've pushed a new commit that fixes the max buffer crashes with images and videos in portrait orientation. @davidkoski, do you think the limit of 224 pixels makes sense? |
7b6f35e
to
102aa97
Compare
scale = size.height / extent.height | ||
} | ||
// Use the same scaling approach regardless of orientation | ||
let scale = min(size.width / extent.width, size.height / extent.height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense -- the previous would fail for images with extreme aspect ratios or just aspect ratios that didn't align with the target size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need "shortestEdge" as well, along with a center crop. I will write up an issue for this, we can refactor in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in #200 -- Qwen2 requires the exact computed size and this won't always deliver it. I have set up some unit tests so we can run all of these through.
102aa97
to
9356549
Compare
@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?