FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

Undefined symbol: _ZNK5torch8autograd4Node4nameEv

Open wookjeHan opened this issue 1 year ago • 4 comments

Hi team, I installed fbgemm_gpu by $pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cu121/ command, and using torch.2.4.0.

Currently I am facing the error as below

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
Traceback (most recent call last):
  File "/home/gr-optimizations/train.py", line 29, in <module>
    import fbgemm_gpu  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    import fbgemm_gpu.docs  # noqa: F401, E402
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/__init__.py", line 9, in <module>
    from . import jagged_tensor_ops, table_batched_embedding_ops  # noqa: F401
  File "/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/docs/jagged_tensor_ops.py", line 14, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'

Could you please let me know how to resolve this issue?

Best,

wookjeHan avatar Jun 11 '24 16:06 wookjeHan

The latest stable of release of FBGEMM_GPU is targeted to binary compatibility with torch 2.3.x. The nightliy version should be used for running against torch 2.4.x:

pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu121/

Could you try this and let us know if there are any issues?

q10 avatar Jun 11 '24 17:06 q10

Thanks for your kind reply. Now I am facing following error.

Could you please let me know how to resolve this issue?

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv
INFO:root:cuda.matmul.allow_tf32: True
I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True
INFO:root:cudnn.allow_tf32: True
I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True
INFO:root:Training model on rank 0.
I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0.
Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params
NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11
  0%|                                                                                                                                                           | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext
I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
  0%|                                                                                                                                                           | 0/1082 [00:03<?, ?it/s]
......
Traceback logs....
.......
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'

wookjeHan avatar Jun 11 '24 17:06 wookjeHan

We have not run into the issue undefined symbol: _ZNK3c105Error4whatEv before. However, I suspect it might have to do with the exact nightly version of PyTorch, and that a more recently nightly version might resolve the issue. We use the installation instructions here for reproducible environments, could you try installation through this and let us know if you still run into the issue?

q10 avatar Jun 17 '24 18:06 q10

Was pytorch built with C++11 ABI? (If torch.compiled_with_cxx11_abi() returns True, then yes). If so, the pre-built wheels are incompatible.

isuruf avatar Sep 09 '24 17:09 isuruf

Thanks for your kind reply. Now I am facing following error.

Could you please let me know how to resolve this issue?

/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK3c105Error4whatEv
/usr/local/lib/python3.10/dist-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: _ZNK3c105Error4whatEv
INFO:root:cuda.matmul.allow_tf32: True
I0611 17:30:01.247293 139738748224128 train.py:135] cuda.matmul.allow_tf32: True
INFO:root:cudnn.allow_tf32: True
I0611 17:30:01.247349 139738748224128 train.py:136] cudnn.allow_tf32: True
INFO:root:Training model on rank 0.
I0611 17:30:01.247383 139738748224128 train.py:137] Training model on rank 0.
Initialize _item_emb.weight as truncated normal: torch.Size([131263, 256]) params
NFO:root:Rank 0: writing logs to ./exps/ml-20m-l200/HSTU_CUSTOM-b16-h8-dqk32-dv32-lsilud0.2-ad0.0_DotProduct_local-l2-eps1e-06_ssl-t0.05-n128-b128-lr0.001-wu0-wd0-2024-06-11
  0%|                                                                                                                                                           | 0/1082 [00:00<?, ?it/s]INFO:root:running build_ext
I0611 17:30:10.720256 139738748224128 dist.py:985] running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
INFO:root:running build_ext
  0%|                                                                                                                                                           | 0/1082 [00:03<?, ?it/s]
......
Traceback logs....
.......
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1131, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'asynchronous_complete_cumsum'

Try lower version for both torch and fbgemm_gpu.

This works for me:

torch==2.3.0+cu121
fbgemm_gpu==0.7.0+cu121

NiDHanWang avatar Jan 05 '25 08:01 NiDHanWang

Hi @wookjeHan we no longer support fbgemm v0.7.0, please use at least v1.0.0, and let us know if you still run into this error.

q10 avatar Jan 06 '25 00:01 q10

Hi, If anybody still facing issue with this Try this pip install torch==2.3.1 pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu121/

It worked for me.

krP471 avatar Feb 04 '25 04:02 krP471

I'm facing very much the same issue with pytorch and torchrec..

I'm using pytorch version 2.6.0+cu124 on cuda 12.4

OSError: /home/ankur/miniconda3/envs/py3_12/lib/python3.12/site-packages/fbgemm_gpu/fbgemm_gpu_config.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb

installed torchrec following instructions here: https://github.com/pytorch/torchrec

Running: a simple torchrec program: import torchrec ebc = torchrec.EmbeddingBagCollection( device="cpu", tables=[ torchrec.EmbeddingBagConfig( name="product_table", embedding_dim=64, num_embeddings=4096, feature_names=["product"], pooling=torchrec.PoolingType.SUM, ), torchrec.EmbeddingBagConfig( name="user_table", embedding_dim=64, num_embeddings=4096, feature_names=["user"], pooling=torchrec.PoolingType.SUM, ) ] ) print(ebc.embedding_bags)

tried using cuda12_4 version of fbgemm and cpu version.. same error. Please help!

ankur6ue avatar Mar 03 '25 04:03 ankur6ue

Hi @ankur6ue it is likely that you have an old version of glibc that does not contain symbols for GLIBCXX 2.29. You will need to install a more recent version of gcc (11.4+) in order to achieve an updated glibc installation that FBGEMM_GPU can use. We recommend doing gcc + FBGEMM_GPU + Torchrec installation inside an isolated Conda environment, which can be done by following the instructions here.

q10 avatar Mar 03 '25 07:03 q10

I updated my gcc version by updating to ubuntu 22.04, but the issue persists..

I'll try the isolated conda environment instructions.. any other tips? Could it be the GPUs? I have 1080 TI GPUs, which are quite old.. although that should result in a run time error, not a missing symbol issue, I'd think?

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Image

ankur6ue avatar Mar 03 '25 23:03 ankur6ue

Actucally there's no such symbol in any PyTorch libraries, we can only find torch::autograd::PyNode::name() const in libtorch_python.so. torch::autograd::Node::name() const only shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.

https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47

HowardZorn avatar Sep 04 '25 03:09 HowardZorn

Actucally there's no such symbol in any PyTorch libraries, we can only find torch::autograd::PyNode::name() const in libtorch_python.so. torch::autograd::Node::name() const only shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.

https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47

There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load

q10 avatar Sep 04 '25 17:09 q10

Actucally there's no such symbol in any PyTorch libraries, we can only find torch::autograd::PyNode::name() const in libtorch_python.so. torch::autograd::Node::name() const only shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.

https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47

There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load

Thank you for your guidance. I discovered that the issue stems from a CXX11 ABI discrepancy. The symbol _ZNK5torch8autograd4Node4nameEv belongs to the non-CXX11 ABI. Starting with PyTorch 2.6, the default ABI standard was updated to CXX11.

HowardZorn avatar Sep 05 '25 02:09 HowardZorn

Actucally there's no such symbol in any PyTorch libraries, we can only find torch::autograd::PyNode::name() const in libtorch_python.so. torch::autograd::Node::name() const only shows at the cpp sources of PyTorch, and does not export to libraries. It could be a bug.

https://github.com/pytorch/pytorch/blob/d636c181f9140a7b59be10b36eae23039fc2bb72/torch/csrc/autograd/function.cpp#45-47

There are some undefined symbols in the built binnary, but they should not affect the library's functionality. As of today, the following combination should work: torch 2.8 release + fbgemm_gpu 1.3.0 release, or torch nightly + fbgemm_gpu nightly. Any other combination is not guaranteed to work correctly and may crash on library load

Thank you for your guidance. I discovered that the issue stems from a CXX11 ABI discrepancy. The symbol _ZNK5torch8autograd4Node4nameEv belongs to the non-CXX11 ABI. Starting with PyTorch 2.6, the default ABI standard was updated to CXX11.

Ah yes, the ABI standard was upgraded around that time frame, and from time to time, we do have bug reports coming in that stem from that update. Upgrading to the later versions of torch should solve the problem - please let us know if you still observe issues otherwise.

q10 avatar Sep 05 '25 06:09 q10