Undefined symbol: _ZNK5torch8autograd4Node4nameEv
Hi team,
I installed fbgemm_gpu by $pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121/ command, and using torch.2.4.0.
Currently I am facing the error as below
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
File "/home/gr-optimizations/train.py", line 29, in <module>
import fbgemm_gpu # noqa: F401, E402
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
import fbgemm_gpu.docs # noqa: F401, E402
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/__init__.py", line 9, in <module>
from . import jagged_tensor_ops, table_batched_embedding_ops # noqa: F401
File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/jagged_tensor_ops.py", line 14, in <module>
torch.ops.fbgemm.jagged_2d_to_dense,
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
Could you please let me know how to resolve this issue?
Best,
The latest stable of release of FBGEMM_GPU is targeted to binary compatibility with torch 2.3.x. The nightliy version should be used for running against torch 2.4.x:
pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu121/
Could you try this and let us know if there are any issues?
Thanks for your kind reply. Now I am facing following error.
Could you please let me know how to resolve this issue?
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv
INFO:root:cuda.matmul.allow_tf32: True
I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True
INFO:root:cudnn.allow_tf32: True
I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True
INFO:root:Training model on rank 0.
I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0.
Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params
NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11
0%| | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext
I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
0%| | 0/1082 [00:03<?, ?it/s]
......
Traceback logs....
.......
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'
We have not run into the issue undefined symbol: _ZNK3c105Error4whatEv before. However, I suspect it might have to do with the exact nightly version of PyTorch, and that a more recently nightly version might resolve the issue. We use the installation instructions here for reproducible environments, could you try installation through this and let us know if you still run into the issue?
Was pytorch built with C++11 ABI? (If torch.compiled_with_cxx11_abi() returns True, then yes). If so, the pre-built wheels are incompatible.
Thanks for your kind reply. Now I am facing following error.
Could you please let me know how to resolve this issue?
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv /usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv INFO:root:cuda.matmul.allow_tf32: True I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True INFO:root:cudnn.allow_tf32: True I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True INFO:root:Training model on rank 0. I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0. Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11 0%| | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext INFO:root:running build_ext INFO:root:running build_ext INFO:root:running build_ext INFO:root:running build_ext 0%| | 0/1082 [00:03<?, ?it/s] ...... Traceback logs.... ....... File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__ raise AttributeError( AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'
Try lower version for both torch and fbgemm_gpu.
This works for me:
torch==2.3.0+cu121
fbgemm_gpu==0.7.0+cu121
Hi @wookjeHan we no longer support fbgemm v0.7.0, please use at least v1.0.0, and let us know if you still run into this error.
Hi, If anybody still facing issue with this
Try this
pip install torch==2.3.1
pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu121/
It worked for me.
I'm facing very much the same issue with pytorch and torchrec..
I'm using pytorch version 2.6.0+cu124 on cuda 12.4
OSError: /home/ankur/miniconda3/envs/py3_12/lib/python3.12/site-packages/fbgemm_gpu/fbgemm_gpu_config.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb
installed torchrec following instructions here: https://github.com/pytorch/torchrec
Running: a simple torchrec program: import torchrec ebc = torchrec.EmbeddingBagCollection( device="cpu", tables=[ torchrec.EmbeddingBagConfig( name="product_table", embedding_dim=64, num_embeddings=4096, feature_names=["product"], pooling=torchrec.PoolingType.SUM, ), torchrec.EmbeddingBagConfig( name="user_table", embedding_dim=64, num_embeddings=4096, feature_names=["user"], pooling=torchrec.PoolingType.SUM, ) ] ) print(ebc.embedding_bags)
tried using cuda12_4 version of fbgemm and cpu version.. same error. Please help!
Hi @ankur6ue it is likely that you have an old version of glibc that does not contain symbols for GLIBCXX 2.29. You will need to install a more recent version of gcc (11.4+) in order to achieve an updated glibc installation that FBGEMM_GPU can use. We recommend doing gcc + FBGEMM_GPU + Torchrec installation inside an isolated Conda environment, which can be done by following the instructions here.
I updated my gcc version by updating to ubuntu 22.04, but the issue persists..
I'll try the isolated conda environment instructions.. any other tips? Could it be the GPUs? I have 1080 TI GPUs, which are quite old.. although that should result in a run time error, not a missing symbol issue, I'd think?
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Actucally there's no such symbol in any PyTorch libraries, we can only find torch::autograd::PyNode::name() const in libtorch_python.so. torch::autograd::Node::name() const only shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.
https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47
Actucally there's no such symbol in any PyTorch libraries, we can only find
torch::autograd::PyNode::name() constinlibtorch_python.so.torch::autograd::Node::name() constonly shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47
There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load
Actucally there's no such symbol in any PyTorch libraries, we can only find
torch::autograd::PyNode::name() constinlibtorch_python.so.torch::autograd::Node::name() constonly shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47
There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load
Thank you for your guidance. I discovered that the issue stems from a CXX11 ABI discrepancy. The symbol _ZNK5torch8autograd4Node4nameEv belongs to the non-CXX11 ABI. Starting with PyTorch 2.6, the default ABI standard was updated to CXX11.
Actucally there's no such symbol in any PyTorch libraries, we can only find
torch::autograd::PyNode::name() constinlibtorch_python.so.torch::autograd::Node::name() constonly shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47
There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load
Thank you for your guidance. I discovered that the issue stems from a CXX11 ABI discrepancy. The symbol _ZNK5torch8autograd4Node4nameEv belongs to the non-CXX11 ABI. Starting with PyTorch 2.6, the default ABI standard was updated to CXX11.
Ah yes, the ABI standard was upgraded around that time frame, and from time to time, we do have bug reports coming in that stem from that update. Upgrading to the later versions of torch should solve the problem - please let us know if you still observe issues otherwise.