Monkeypatch for Qwen2.5-VL #552

BenasdTW · 2025-01-31T21:00:07Z

Summary

Qwen2.5-VL was released a few days ago. This PR aims to add support for it.
There are some small changes in its architecture compared to Qwen2-VL.

Details

As of February 1, the latest git version of transformers (62db3e6) supports it, but it's not yet in the latest release of transformers.

Needs help with the tests, as I got OOMs on my laptop GPU. I have tried LoRA with PR, which works great.

Testing Done

Can't run the tests on my hardware, got OOMs.

Hardware Type: RTX 4070 laptop
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

tyler-romero · 2025-02-08T21:23:51Z

I made some fixes to the multimodal convergence test. However the float32 convergence test is failing:

test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05]

        # Raise AssertionError with detailed information if there are mismatches
        if not all_close and num_mismatched >= 1:
            mismatch_details = [f"Number of mismatched elements: {num_mismatched}"]
            print_count = min(max_print, num_mismatched)
            for index in mismatched_indices[:print_count]:
                i = tuple(index.tolist())
                mismatch_details.append(f"Mismatch at index {i}: tensor1[{i}] = {tensor1[i]}, tensor2[{i}] = {tensor2[i]}")
            if num_mismatched > max_print:
                mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.")
    
>           raise AssertionError("\n".join(mismatch_details))
E           AssertionError: Number of mismatched elements: 2
E           Mismatch at index (0, 24): tensor1[(0, 24)] = 0.09020370990037918, tensor2[(0, 24)] = 0.0902027115225792
E           Mismatch at index (0, 27): tensor1[(0, 27)] = 0.08561328053474426, tensor2[(0, 27)] = 0.08561427891254425

These are small difference, but I wonder if they show up in this test instead of the other convergence tests because the multimodal-rope implementation is slightly different? Could you look into it?

Thanks - other than this the PR looks close

BenasdTW · 2025-02-09T14:53:19Z

@tyler-romero Thanks for reviewing my PR and helping out with testing.

I made some fixes to the multimodal convergence test. However the float32 convergence test is failing:
These are small difference, but I wonder if they show up in this test instead of the other convergence tests because the multimodal-rope implementation is slightly different? Could you look into it?

Does it only fail on Qwen2.5-VL? I have checked, the transformers library uses the same implementation for both Qwen2-VL and Qwen2.5-VL.

The discrepancy might be related to PR #412. I'm not sure why, but #412 only changes the rtol in test_mini_models.py and not in test_mini_models_multimodal.py.
In #412:

The root cause of this issue lies in HuggingFace's release of new transformers, which introduced modifications to QWEN2VL. Since the discrepancy doesn't originate from a bug in the Liger QWEN2VL implementation, it's acceptable to slightly relax the relative tolerances (rtols) to accommodate these changes.

Should we just change the rtol to match its value in test_mini_models.py?

tyler-romero · 2025-02-09T19:27:09Z

For Qwen2.5VL, we dont have this same issue:

The root cause of this issue lies in HuggingFace's release of new transformers, which introduced modifications to QWEN2VL

Because it was added after this change to huggingface (so there arent multiple versions to support) So we should track whatever that modification was and update our patch to support it correctly.

BenasdTW · 2025-02-09T20:17:37Z

I haven't done any change in the code. Somehow the tests passed on my laptop (main git branch of transformers: 6b55046).

=========================================================================== test session starts ===========================================================================
platform linux -- Python 3.11.10, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/PatchLiger/Liger-Kernel
configfile: pyproject.toml
plugins: hypothesis-6.115.5, xdist-3.6.1, rerunfailures-15.0, anyio-4.8.0
collecting ... 
--------------------------------------------------------------------------- live log collection ---------------------------------------------------------------------------
INFO     datasets:config.py:54 PyTorch version 2.5.1+cu124 available.
collected 6 items                                                                                                                                                         

test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED              [ 16%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] PASSED                    [ 33%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED            [ 50%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype3-0.001-0.01-0.1-0.01-0.01-0.01] PASSED                  [ 66%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_mllama-32-0.0001-dtype4-1e-08-1e-05-0.005-1e-05-0.005-1e-05] FAILED                [ 83%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_mllama-32-0.0001-dtype5-0.001-0.01-0.1-0.01-0.01-0.01] FAILED                      [100%]

BenasdTW · 2025-02-09T20:52:50Z

In this PR, the forward function of Qwen2.5-VL has already been changed to match the transformers implementation.
Other than that, although there are some architectural changes compared to Qwen2-VL, all the changes just use existing components (Qwen2RMSNorm, CrossEntropyLoss, Qwen2MLP), and they didn't modify those components.

My theory is: Before #412, Qwen2-VL failed on test_mini_models.py, and since then, Liger Kernel has not updated the implementation, so the tiny discrepancy is still there.
#412 only relaxed rtol in test_mini_models.py, but not in test_mini_models_multimodal.py, which Qwen2-VL could have also failed on, but was just lucky enough not to fail.
That's why I proposed that the tolerances in test_mini_models_multimodal.py should also be changed.

BenasdTW and others added 6 commits January 30, 2025 19:51

Monkeypatch for Qwen2.5-VL

d33dd60

add tests for Qwen2.5-VL

2f3803e

update tests

4de5edf

Update README.md for Qwen2.5-VL

f60c121

Merge branch 'main' into qwen2_5_vl

c802223

some multimodal test fixes

8ae7e28

tyler-romero self-requested a review February 8, 2025 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monkeypatch for Qwen2.5-VL #552

Monkeypatch for Qwen2.5-VL #552

BenasdTW commented Jan 31, 2025 •

edited

Loading

tyler-romero commented Feb 8, 2025

BenasdTW commented Feb 9, 2025

tyler-romero commented Feb 9, 2025 •

edited

Loading

BenasdTW commented Feb 9, 2025

BenasdTW commented Feb 9, 2025

Monkeypatch for Qwen2.5-VL #552

Are you sure you want to change the base?

Monkeypatch for Qwen2.5-VL #552

Conversation

BenasdTW commented Jan 31, 2025 • edited Loading

Summary

Details

Testing Done

tyler-romero commented Feb 8, 2025

BenasdTW commented Feb 9, 2025

tyler-romero commented Feb 9, 2025 • edited Loading

BenasdTW commented Feb 9, 2025

BenasdTW commented Feb 9, 2025

BenasdTW commented Jan 31, 2025 •

edited

Loading

tyler-romero commented Feb 9, 2025 •

edited

Loading