-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for Qwen2.5-VL #197
base: main
Are you sure you want to change the base?
Conversation
} | ||
} | ||
|
||
public class Qwen25VLProcessor: UserInputProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this end up being the same as the Qwen2VL Processor? If not I wonder if we would want to factor these so that they share code?
See also #173 -- it would be nice to have that work apply to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkoski Yes it's the same UserInputProcessor. I'll refactor it and then take a look at #173
@DePasqualeOrg given this refactoring of the processor is in the same place as your work on using the new template in #173 , what do you think about how to proceed? The two models share the same processor and I think it will be good to have it only in one place. If you are going to land your changes soon, perhaps we should merge that and then do this refactor (I think that will just be copying code around). Or would it be easier to put your work on top of the processor broken out like this? @smdesai any thoughts from your side? |
I think it would be best to merge #173 first. |
@davidkoski I don't have a preference and as @DePasqualeOrg mentions above, perhaps it's best to merge #173 first then I can make the appropriate changes here. |
@DePasqualeOrg Thank you for your work. I applied your changes together with changes on my side and noticed the chat template in Qwen2.5 VL is missing the image_pad, vision_start and vision_end tokens. The template looks like a normal template with function calling. To get it to work, I used the Qwen2 chat template as a string and passed that to applyChatTemplate as shown here: var promptTokens = try tokenizer.applyChatTemplate(messages: messages, chatTemplate: chatTemplate) in the call to prepare. |
Wow. This is not the first time I've been surprised by the chat templates that ship with models. I've opened a discussion on Hugging Face about replacing the current chat template. |
@smdesai, I made some changes in the image processing before my PR got merged. It's too bad that several people were working on this at the same time, but hopefully we can sort out these last few conflicts. |
@DePasqualeOrg I've combined the merged changes to the refactored processor here, the only change being applying the chat template for Qwen2.5-VL which uses a hard coded template vs. one from the config. Once that model config is fixed the change will be removed. Right now, I'm going through the code to ensure the port is accurate as I'm seeing different results on laptop and device. Once I've gone though the code, I'll push the change which resolves the conflicts. |
…n2.5VL as the one in config does not support image/video merged with changes from ml-explore#173 and added chat template for Qwen2.5VL as the one in the config does not support image/video
@davidkoski I believe this is ready for review. |
@JustinLin610, as you can see in the discussion here and on Hugging Face, the chat template for the Qwen 2.5 VL models doesn't include the vision logic. The fix should be very simple: Just replace the chat template in |
@Blaizzy, do you want to fix the chat templates in the Qwen 2.5 VL models on Hugging Face by replacing them with the templates used in Qwen 2 VL? See the discussion above for more context. If the Qwen team doesn't want to do it, we can at least fix them for MLX users. |
@davidkoski @DePasqualeOrg Thanks for pushing to get the correct template added. As for the final review, I may have been somewhat hasty on that. The windowing function isn't altogether correct as it stands, it's not representative of the python MLX code and I'm not altogether happy with some of the results as it differs from the python MLX code. I'm working to resolve this to achieve the same results or as close to it as possible with python MLX. |
@smdesai, I thought I solved that already in my solution for Qwen 2 VL. I haven't looked at the Python implementation, but if you change the approach, please make sure that it supports a mix of images and videos in any order, anywhere in a multi-turn chat, which my solution does. |
@DePasqualeOrg Thanks for pointing it out. I'll take a closer look at the Qwen 2 VL implementation and compare. |
Fixed all models hosted in the MLX community :) |
@Blaizzy, you can see the differences here: https://huggingface.co/mlx-community/Qwen2-VL-7B-Instruct-4bit/blob/main/tokenizer_config.json https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-4bit/blob/main/tokenizer_config.json |
@Blaizzy It looks like you're comparing chat_template.json that's present for both models. Yes these are identical and what @DePasqualeOrg points to is the chat template in config.json which is different and what's being used when the template is applied. |
Yes, transformers now uses the It happened late last year. |
@pcuenca, I guess we should use the template from |
According to them I know it's complicated. |
We should use processor's chat template and not the tokenizer one. |
Yap, we could use some improvements to simplify the API here. Perhaps |
I submitted a PR in swift-transformers to use the correct chat template: huggingface/swift-transformers#184 |
Pull from latest
@smdesai, I'm curious to know: What's the difficulty in making my solution work for Qwen 2.5 VL? If Qwen 2 uses the same processing, can't you just factor out my solution and use it for both models? I'm tempted to try it myself but wanted to check with you first. |
@DePasqualeOrg First of all thank you for the PR to swift-transformers for the chat template. I've tested by modifying Package.swift with the new version and have removed the temporary change I made to the Qwen25VLProcessor. The windowing function is different in Qwen25VL, at least from the Python source from which this port is based on. I'll factor your solution for the VisionModel and see how that performs and all going well, I'll push and update to the PR. I've also run into the reshape issue and it has to do with small images (588x1291) and how they're sized. |
provide public accessors for ModelConfigurations from registries (ml-explore#219)
Here's the refactored version, it's identical to #222 by @DePasqualeOrg. The only difference being refactoring of additional code. You can close this out and merge #222 if that looks ok. |
No description provided.