Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix undefined symbol issue for transformer_engine::getenv #763

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

jinzex
Copy link
Contributor

@jinzex jinzex commented Apr 9, 2024

Fix #756. The issue happens because we recently included common/util/system.h into PyTorch source CPP files at https://github.com/NVIDIA/TransformerEngine/pull/713/files to utilize the transformer_engine::getenv method. As a result, the PyTorch transformer_engine_extensions will now search for symbols within the libtransformer_engine.so library.

However, if the user builds TransformerEngine in a local environment (conda or PyPI), because the pre-built PyTorch, either from conda or PyPI, has the CXX11_ABI set as False, whereas the common library is using the new CXX11_ABI. As a result, there will be an undefined symbol problem when importing transformer_engine_extensions. NGC PyTorch container is using CXX11_ABI=1, so there is no problem.

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
@timmoon10
Copy link
Collaborator

/te-ci pytorch

@timmoon10
Copy link
Collaborator

This hack works around the immediate issue, but I think it may be worth doing a proper refactor to avoid ABI issues in the future. See #756 (comment).

@ptrendx ptrendx added the 1.6.0 label Apr 10, 2024
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll merge this before the 1.6 release to quickly unblock users. Long-term, the build system should have logic to ensure ABI compatibility with PyTorch (see #756 (comment)).

@timmoon10 timmoon10 merged commit 1b20f2d into NVIDIA:main Apr 10, 2024
9 of 10 checks passed
@jinzex jinzex deleted the fix_getenv_undefined_symbol branch April 15, 2024 16:11
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 9, 2024
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Co-authored-by: Jinze Xue <jinzex@nvidia.com>
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 15, 2024
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Co-authored-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 23, 2024
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Co-authored-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_ZN18transformer_engine6getenvIiEET_RKSsRKS1_ on the latest main branch
3 participants