Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monkeypatch for Qwen2.5-VL #552

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

BenasdTW
Copy link
Contributor

@BenasdTW BenasdTW commented Jan 31, 2025

Summary

Qwen2.5-VL was released a few days ago. This PR aims to add support for it.
There are some small changes in its architecture compared to Qwen2-VL.

Details

As of February 1, the latest git version of transformers (62db3e6) supports it, but it's not yet in the latest release of transformers.

Needs help with the tests, as I got OOMs on my laptop GPU. I have tried LoRA with PR, which works great.

Testing Done

Can't run the tests on my hardware, got OOMs.

  • Hardware Type: RTX 4070 laptop
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@tyler-romero
Copy link
Collaborator

I made some fixes to the multimodal convergence test. However the float32 convergence test is failing:

test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05]

        # Raise AssertionError with detailed information if there are mismatches
        if not all_close and num_mismatched >= 1:
            mismatch_details = [f"Number of mismatched elements: {num_mismatched}"]
            print_count = min(max_print, num_mismatched)
            for index in mismatched_indices[:print_count]:
                i = tuple(index.tolist())
                mismatch_details.append(f"Mismatch at index {i}: tensor1[{i}] = {tensor1[i]}, tensor2[{i}] = {tensor2[i]}")
            if num_mismatched > max_print:
                mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.")
    
>           raise AssertionError("\n".join(mismatch_details))
E           AssertionError: Number of mismatched elements: 2
E           Mismatch at index (0, 24): tensor1[(0, 24)] = 0.09020370990037918, tensor2[(0, 24)] = 0.0902027115225792
E           Mismatch at index (0, 27): tensor1[(0, 27)] = 0.08561328053474426, tensor2[(0, 27)] = 0.08561427891254425

These are small difference, but I wonder if they show up in this test instead of the other convergence tests because the multimodal-rope implementation is slightly different? Could you look into it?

Thanks - other than this the PR looks close

@tyler-romero tyler-romero self-requested a review February 8, 2025 21:24
@BenasdTW
Copy link
Contributor Author

BenasdTW commented Feb 9, 2025

@tyler-romero Thanks for reviewing my PR and helping out with testing.

I made some fixes to the multimodal convergence test. However the float32 convergence test is failing:
These are small difference, but I wonder if they show up in this test instead of the other convergence tests because the multimodal-rope implementation is slightly different? Could you look into it?

Does it only fail on Qwen2.5-VL? I have checked, the transformers library uses the same implementation for both Qwen2-VL and Qwen2.5-VL.

The discrepancy might be related to PR #412. I'm not sure why, but #412 only changes the rtol in test_mini_models.py and not in test_mini_models_multimodal.py.
In #412:

The root cause of this issue lies in HuggingFace's release of new transformers, which introduced modifications to QWEN2VL. Since the discrepancy doesn't originate from a bug in the Liger QWEN2VL implementation, it's acceptable to slightly relax the relative tolerances (rtols) to accommodate these changes.

Should we just change the rtol to match its value in test_mini_models.py?

@tyler-romero
Copy link
Collaborator

tyler-romero commented Feb 9, 2025

For Qwen2.5VL, we dont have this same issue:

The root cause of this issue lies in HuggingFace's release of new transformers, which introduced modifications to QWEN2VL

Because it was added after this change to huggingface (so there arent multiple versions to support) So we should track whatever that modification was and update our patch to support it correctly.

@BenasdTW
Copy link
Contributor Author

BenasdTW commented Feb 9, 2025

I haven't done any change in the code. Somehow the tests passed on my laptop (main git branch of transformers: 6b55046).

=========================================================================== test session starts ===========================================================================
platform linux -- Python 3.11.10, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/PatchLiger/Liger-Kernel
configfile: pyproject.toml
plugins: hypothesis-6.115.5, xdist-3.6.1, rerunfailures-15.0, anyio-4.8.0
collecting ... 
--------------------------------------------------------------------------- live log collection ---------------------------------------------------------------------------
INFO     datasets:config.py:54 PyTorch version 2.5.1+cu124 available.
collected 6 items                                                                                                                                                         

test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED              [ 16%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] PASSED                    [ 33%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED            [ 50%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_5_vl-32-0.0001-dtype3-0.001-0.01-0.1-0.01-0.01-0.01] PASSED                  [ 66%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_mllama-32-0.0001-dtype4-1e-08-1e-05-0.005-1e-05-0.005-1e-05] FAILED                [ 83%]
test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_mllama-32-0.0001-dtype5-0.001-0.01-0.1-0.01-0.01-0.01] FAILED                      [100%]

@BenasdTW
Copy link
Contributor Author

BenasdTW commented Feb 9, 2025

In this PR, the forward function of Qwen2.5-VL has already been changed to match the transformers implementation.
Other than that, although there are some architectural changes compared to Qwen2-VL, all the changes just use existing components (Qwen2RMSNorm, CrossEntropyLoss, Qwen2MLP), and they didn't modify those components.

My theory is: Before #412, Qwen2-VL failed on test_mini_models.py, and since then, Liger Kernel has not updated the implementation, so the tiny discrepancy is still there.
#412 only relaxed rtol in test_mini_models.py, but not in test_mini_models_multimodal.py, which Qwen2-VL could have also failed on, but was just lucky enough not to fail.
That's why I proposed that the tolerances in test_mini_models_multimodal.py should also be changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants