-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce results of LVBench #715
Comments
Hi, thank you for your interest in Qwen2.5-VL. The parameters we used during evaluation are as follows:
|
Could you also provide the prompt engineering you used? Currently we followed the MVBench's prompt format but the model kept generating in the free-form, instead of selecting one of the options. Thank you very much! |
The prompt we use isn't much different from MVBench's prompt, so MVBench's prompt should also work. The prompt is as follows,
|
Thank you for your quick reply. Could you provide the evaluation script for lvbench? Below is the script I used:
The model I used is Qwen2.5-VL-7B-Instruct, and the number I got is 42.03, which is 3 points lower than the official reported numbers (45.3 for Qwen2.5-VL-7B). Is it caused by the model difference (with/without instruct tuning)? Or there are some other reasons that cause the gap? Thanks for the help! |
Our reported result (45.3) is based on the use of an image list as input, where we pre-decode the video at 2fps with a maximum frame num of 768. Using the same native video input as yours, our evaluation score is 43.7 The main difference between these two methods is due to the mrope time id. For ultra-long video understanding (e.g. hour-level), there is a slight improvement when using the image list for inference. We will clarify it in the subsequent tech report. |
Thank you for your reply. I have follow-up questions:
Thank you! |
|
Hi,
Thank you for your great work! Could you provide details of reproducing results on LVBench? What is the total number of frames used for each video and what is the max_pixels? Besides, did you use the start and end time from the meta data to trim the video? Thank you!
The text was updated successfully, but these errors were encountered: