Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce results of LVBench #715

Open
zgzxy001 opened this issue Feb 3, 2025 · 7 comments
Open

Reproduce results of LVBench #715

zgzxy001 opened this issue Feb 3, 2025 · 7 comments

Comments

@zgzxy001
Copy link

zgzxy001 commented Feb 3, 2025

Hi,

Thank you for your great work! Could you provide details of reproducing results on LVBench? What is the total number of frames used for each video and what is the max_pixels? Besides, did you use the start and end time from the meta data to trim the video? Thank you!

@sibosutd
Copy link

sibosutd commented Feb 5, 2025

Hi, thank you for your interest in Qwen2.5-VL. The parameters we used during evaluation are as follows:

"min_pixels": 48  * 28 * 28
"max_pixels": 128 * 28 * 28
"min_frames": 4
"max_frames": 768
"fps": 2

@zgzxy001
Copy link
Author

zgzxy001 commented Feb 5, 2025

Could you also provide the prompt engineering you used? Currently we followed the MVBench's prompt format but the model kept generating in the free-form, instead of selecting one of the options. Thank you very much!

@sibosutd
Copy link

sibosutd commented Feb 5, 2025

The prompt we use isn't much different from MVBench's prompt, so MVBench's prompt should also work.
Could you share more information, please? For example, what size of model do you use for inference? and could you provide the full inference script?

The prompt is as follows,

"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects and the action and pose of persons.\nBased on your observations, select the best option that accurately addresses the question.\nQuestion: {question}\nOptions:\n{option_string}Answer with the option\'s letter from the given choices directly and only give the best option."

@zgzxy001
Copy link
Author

zgzxy001 commented Feb 5, 2025

Thank you for your quick reply. Could you provide the evaluation script for lvbench? Below is the script I used:

import json
import os
import cv2
import torch

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)


processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

def qa_template(data):
        question = f"Question: {data['question']}\n"
        question += "Options:\n"
        answer = data['answer']
        answer_idx = -1
        for idx, c in enumerate(data['candidates']):
            question += f"({chr(ord('A') + idx)}) {c}\n"
            if c == answer:
                answer_idx = idx
        question = question.rstrip()
        answer = f"({chr(ord('A') + answer_idx)}) {answer}"
        return question, answer


def generate_single(video_path, each_data, video_start, video_end, fps):

    system = 'Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects and the action and pose of persons.\nBased on your observations, select the best option that accurately addresses the question.\n'
    question, answer = qa_template(each_data)
    question_prompt="\nAnswer with the option\'s letter from the given choices directly and only give the best option."
    input_text_prompt = system + question + question_prompt
    
   
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "min_pixels": 48  * 28 * 28,
                    "max_pixels": 128 * 28 * 28,
                    "min_frames": 4,
                    "max_frames": 768,
                    "fps": 2
                },
                {"type": "text", "text": input_text_prompt},
            ],
        }
    ]
    
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return output_text


The model I used is Qwen2.5-VL-7B-Instruct, and the number I got is 42.03, which is 3 points lower than the official reported numbers (45.3 for Qwen2.5-VL-7B). Is it caused by the model difference (with/without instruct tuning)? Or there are some other reasons that cause the gap? Thanks for the help!

@sibosutd
Copy link

sibosutd commented Feb 10, 2025

Our reported result (45.3) is based on the use of an image list as input, where we pre-decode the video at 2fps with a maximum frame num of 768. Using the same native video input as yours, our evaluation score is 43.7

The main difference between these two methods is due to the mrope time id. For ultra-long video understanding (e.g. hour-level), there is a slight improvement when using the image list for inference. We will clarify it in the subsequent tech report.

@zgzxy001
Copy link
Author

Thank you for your reply. I have follow-up questions:

  1. Do you use the end_time in lvbench to trim the video (i.e. each video is trimed into [0, end_time]) or you use the entire video without trimming?
  2. Do you use the Multi image inference API or the video API but with a list of video frame path as inputs?
  3. Do you use vision_ids (i.e. add_vision_id=True)?

Thank you!

@sibosutd
Copy link

Thank you for your reply. I have follow-up questions:

  1. Do you use the end_time in lvbench to trim the video (i.e. each video is trimed into [0, end_time]) or you use the entire video without trimming?
  2. Do you use the Multi image inference API or the video API but with a list of video frame path as inputs?
  3. Do you use vision_ids (i.e. add_vision_id=True)?

Thank you!

  1. No trimming is applied during evaluation.
  2. Video API is used with a list of video frame path as inputs.
  3. No, we do not add any vision ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants