Reproduce results of LVBench #715

zgzxy001 · 2025-02-03T15:01:47Z

Hi,

Thank you for your great work! Could you provide details of reproducing results on LVBench? What is the total number of frames used for each video and what is the max_pixels? Besides, did you use the start and end time from the meta data to trim the video? Thank you!

sibosutd · 2025-02-05T05:03:20Z

Hi, thank you for your interest in Qwen2.5-VL. The parameters we used during evaluation are as follows:

"min_pixels": 48  * 28 * 28
"max_pixels": 128 * 28 * 28
"min_frames": 4
"max_frames": 768
"fps": 2

zgzxy001 · 2025-02-05T05:45:44Z

Could you also provide the prompt engineering you used? Currently we followed the MVBench's prompt format but the model kept generating in the free-form, instead of selecting one of the options. Thank you very much!

sibosutd · 2025-02-05T06:11:24Z

The prompt we use isn't much different from MVBench's prompt, so MVBench's prompt should also work.
Could you share more information, please? For example, what size of model do you use for inference? and could you provide the full inference script?

The prompt is as follows,

"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects and the action and pose of persons.\nBased on your observations, select the best option that accurately addresses the question.\nQuestion: {question}\nOptions:\n{option_string}Answer with the option\'s letter from the given choices directly and only give the best option."

zgzxy001 · 2025-02-05T19:25:09Z

Thank you for your quick reply. Could you provide the evaluation script for lvbench? Below is the script I used:

import json
import os
import cv2
import torch

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)


processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

def qa_template(data):
        question = f"Question: {data['question']}\n"
        question += "Options:\n"
        answer = data['answer']
        answer_idx = -1
        for idx, c in enumerate(data['candidates']):
            question += f"({chr(ord('A') + idx)}) {c}\n"
            if c == answer:
                answer_idx = idx
        question = question.rstrip()
        answer = f"({chr(ord('A') + answer_idx)}) {answer}"
        return question, answer


def generate_single(video_path, each_data, video_start, video_end, fps):

    system = 'Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects and the action and pose of persons.\nBased on your observations, select the best option that accurately addresses the question.\n'
    question, answer = qa_template(each_data)
    question_prompt="\nAnswer with the option\'s letter from the given choices directly and only give the best option."
    input_text_prompt = system + question + question_prompt
    
   
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "min_pixels": 48  * 28 * 28,
                    "max_pixels": 128 * 28 * 28,
                    "min_frames": 4,
                    "max_frames": 768,
                    "fps": 2
                },
                {"type": "text", "text": input_text_prompt},
            ],
        }
    ]
    
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return output_text

The model I used is Qwen2.5-VL-7B-Instruct, and the number I got is 42.03, which is 3 points lower than the official reported numbers (45.3 for Qwen2.5-VL-7B). Is it caused by the model difference (with/without instruct tuning)? Or there are some other reasons that cause the gap? Thanks for the help!

sibosutd · 2025-02-10T09:42:46Z

Our reported result (45.3) is based on the use of an image list as input, where we pre-decode the video at 2fps with a maximum frame num of 768. Using the same native video input as yours, our evaluation score is 43.7

The main difference between these two methods is due to the mrope time id. For ultra-long video understanding (e.g. hour-level), there is a slight improvement when using the image list for inference. We will clarify it in the subsequent tech report.

zgzxy001 · 2025-02-10T19:47:40Z

Thank you for your reply. I have follow-up questions:

Do you use the end_time in lvbench to trim the video (i.e. each video is trimed into [0, end_time]) or you use the entire video without trimming?
Do you use the Multi image inference API or the video API but with a list of video frame path as inputs?
Do you use vision_ids (i.e. add_vision_id=True)?

Thank you!

sibosutd · 2025-02-11T02:26:58Z

Thank you for your reply. I have follow-up questions:

Do you use the end_time in lvbench to trim the video (i.e. each video is trimed into [0, end_time]) or you use the entire video without trimming?

Do you use the Multi image inference API or the video API but with a list of video frame path as inputs?

Do you use vision_ids (i.e. add_vision_id=True)?

Thank you!

No trimming is applied during evaluation.
Video API is used with a list of video frame path as inputs.
No, we do not add any vision ids.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce results of LVBench #715

Reproduce results of LVBench #715

zgzxy001 commented Feb 3, 2025 •

edited

Loading

sibosutd commented Feb 5, 2025

zgzxy001 commented Feb 5, 2025

sibosutd commented Feb 5, 2025

zgzxy001 commented Feb 5, 2025

sibosutd commented Feb 10, 2025 •

edited

Loading

zgzxy001 commented Feb 10, 2025

sibosutd commented Feb 11, 2025

Reproduce results of LVBench #715

Reproduce results of LVBench #715

Comments

zgzxy001 commented Feb 3, 2025 • edited Loading

sibosutd commented Feb 5, 2025

zgzxy001 commented Feb 5, 2025

sibosutd commented Feb 5, 2025

zgzxy001 commented Feb 5, 2025

sibosutd commented Feb 10, 2025 • edited Loading

zgzxy001 commented Feb 10, 2025

sibosutd commented Feb 11, 2025

zgzxy001 commented Feb 3, 2025 •

edited

Loading

sibosutd commented Feb 10, 2025 •

edited

Loading