Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for Qwen2.5-VL #197

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

smdesai
Copy link
Contributor

@smdesai smdesai commented Feb 6, 2025

No description provided.

}
}

public class Qwen25VLProcessor: UserInputProcessor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this end up being the same as the Qwen2VL Processor? If not I wonder if we would want to factor these so that they share code?

See also #173 -- it would be nice to have that work apply to both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkoski Yes it's the same UserInputProcessor. I'll refactor it and then take a look at #173

@davidkoski
Copy link
Collaborator

@DePasqualeOrg given this refactoring of the processor is in the same place as your work on using the new template in #173 , what do you think about how to proceed? The two models share the same processor and I think it will be good to have it only in one place. If you are going to land your changes soon, perhaps we should merge that and then do this refactor (I think that will just be copying code around).

Or would it be easier to put your work on top of the processor broken out like this?

@smdesai any thoughts from your side?

@DePasqualeOrg
Copy link
Contributor

I think it would be best to merge #173 first.

@smdesai
Copy link
Contributor Author

smdesai commented Feb 8, 2025

@davidkoski I don't have a preference and as @DePasqualeOrg mentions above, perhaps it's best to merge #173 first then I can make the appropriate changes here.

@DePasqualeOrg
Copy link
Contributor

#173 is ready for review. Thanks @smdesai for your work on supporting Qwen 2.5 VL. I'm looking forward to trying it out!

@smdesai
Copy link
Contributor Author

smdesai commented Feb 11, 2025

@DePasqualeOrg Thank you for your work. I applied your changes together with changes on my side and noticed the chat template in Qwen2.5 VL is missing the image_pad, vision_start and vision_end tokens. The template looks like a normal template with function calling. To get it to work, I used the Qwen2 chat template as a string and passed that to applyChatTemplate as shown here:

var promptTokens = try tokenizer.applyChatTemplate(messages: messages, chatTemplate: chatTemplate)

in the call to prepare.

@DePasqualeOrg
Copy link
Contributor

Wow. This is not the first time I've been surprised by the chat templates that ship with models. I've opened a discussion on Hugging Face about replacing the current chat template.

@DePasqualeOrg
Copy link
Contributor

@smdesai, I made some changes in the image processing before my PR got merged. It's too bad that several people were working on this at the same time, but hopefully we can sort out these last few conflicts.

@smdesai
Copy link
Contributor Author

smdesai commented Feb 14, 2025

@DePasqualeOrg I've combined the merged changes to the refactored processor here, the only change being applying the chat template for Qwen2.5-VL which uses a hard coded template vs. one from the config. Once that model config is fixed the change will be removed.

Right now, I'm going through the code to ensure the port is accurate as I'm seeing different results on laptop and device. Once I've gone though the code, I'll push the change which resolves the conflicts.

…n2.5VL as the one in config does not support image/video

merged with changes from ml-explore#173 and added chat template for Qwen2.5VL as
the one in the config does not support image/video
@smdesai
Copy link
Contributor Author

smdesai commented Feb 17, 2025

@davidkoski I believe this is ready for review.

@DePasqualeOrg
Copy link
Contributor

@JustinLin610, as you can see in the discussion here and on Hugging Face, the chat template for the Qwen 2.5 VL models doesn't include the vision logic. The fix should be very simple: Just replace the chat template in tokenizer_config.json with the one used by Qwen 2 VL models in the repos on Hugging Face.

@DePasqualeOrg
Copy link
Contributor

@Blaizzy, do you want to fix the chat templates in the Qwen 2.5 VL models on Hugging Face by replacing them with the templates used in Qwen 2 VL? See the discussion above for more context. If the Qwen team doesn't want to do it, we can at least fix them for MLX users.

@smdesai
Copy link
Contributor Author

smdesai commented Feb 25, 2025

@davidkoski @DePasqualeOrg Thanks for pushing to get the correct template added. As for the final review, I may have been somewhat hasty on that. The windowing function isn't altogether correct as it stands, it's not representative of the python MLX code and I'm not altogether happy with some of the results as it differs from the python MLX code. I'm working to resolve this to achieve the same results or as close to it as possible with python MLX.

@DePasqualeOrg
Copy link
Contributor

DePasqualeOrg commented Feb 25, 2025

@smdesai, I thought I solved that already in my solution for Qwen 2 VL. I haven't looked at the Python implementation, but if you change the approach, please make sure that it supports a mix of images and videos in any order, anywhere in a multi-turn chat, which my solution does.

@smdesai
Copy link
Contributor Author

smdesai commented Feb 25, 2025

@DePasqualeOrg Thanks for pointing it out. I'll take a closer look at the Qwen 2 VL implementation and compare.

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

@Blaizzy, do you want to fix the chat templates in the Qwen 2.5 VL models on Hugging Face by replacing them with the templates used in Qwen 2 VL? See the discussion above for more context. If the Qwen team doesn't want to do it, we can at least fix them for MLX users.

Fixed all models hosted in the MLX community :)

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

There are zero differences between Qwen2-vl and Qwen2.5-vl instruct chat template.

However, there is a deeper issue with the preprocessor that I just fixed.

Screenshot 2025-02-25 at 11 03 41 PM

@DePasqualeOrg
Copy link
Contributor

There are zero differences between Qwen2-vl and Qwen2.5-vl instruct chat template.

@Blaizzy, you can see the differences here:

https://huggingface.co/mlx-community/Qwen2-VL-7B-Instruct-4bit/blob/main/tokenizer_config.json

https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-4bit/blob/main/tokenizer_config.json

@smdesai
Copy link
Contributor Author

smdesai commented Feb 25, 2025

@Blaizzy It looks like you're comparing chat_template.json that's present for both models. Yes these are identical and what @DePasqualeOrg points to is the chat template in config.json which is different and what's being used when the template is applied.

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

@Blaizzy It looks like you're comparing chat_template.json that's present for both models. Yes these are identical and what @DePasqualeOrg points to is the chat template in config.json which is different and what's being used when the template is applied.

Yes, transformers now uses the chat_template.json and not the one in tokenizer_config.json.

It happened late last year.

@DePasqualeOrg
Copy link
Contributor

Yes, transformers now uses the chat_template.json and not the one in tokenizer_config.json.

It happened late last year.

@pcuenca, I guess we should use the template from chat_template.json, if present, in swift-transformers, and fall back to the one in tokenizer_config.json if not?

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

According to them tokenizer_config.json shouldn't have chat_template for VLMs.

I know it's complicated.

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

We should use processor's chat template and not the tokenizer one.
That's why we have a chat_template.json file now.
huggingface/transformers#31691

@Blaizzy
Copy link

Blaizzy commented Feb 25, 2025

@pcuenca, I guess we should use the template from chat_template.json, if present, in swift-transformers, and fall back to the one in tokenizer_config.json if not?

Yap, we could use some improvements to simplify the API here.

Perhaps chat_template.json can be the source of truth for both processor(if present) and tokenizer.

@DePasqualeOrg
Copy link
Contributor

I submitted a PR in swift-transformers to use the correct chat template: huggingface/swift-transformers#184

@DePasqualeOrg
Copy link
Contributor

@smdesai, I'm curious to know: What's the difficulty in making my solution work for Qwen 2.5 VL? If Qwen 2 uses the same processing, can't you just factor out my solution and use it for both models? I'm tempted to try it myself but wanted to check with you first.

@smdesai
Copy link
Contributor Author

smdesai commented Mar 4, 2025

@DePasqualeOrg First of all thank you for the PR to swift-transformers for the chat template. I've tested by modifying Package.swift with the new version and have removed the temporary change I made to the Qwen25VLProcessor.

The windowing function is different in Qwen25VL, at least from the Python source from which this port is based on. I'll factor your solution for the VisionModel and see how that performs and all going well, I'll push and update to the PR.

I've also run into the reshape issue and it has to do with small images (588x1291) and how they're sized.

@davidkoski davidkoski mentioned this pull request Mar 4, 2025
@smdesai
Copy link
Contributor Author

smdesai commented Mar 5, 2025

Here's the refactored version, it's identical to #222 by @DePasqualeOrg. The only difference being refactoring of additional code. You can close this out and merge #222 if that looks ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants