Illegal memory access error after get_chi_atom_indices
I tried to predict a 5 subunit complex (in total ~5000 aa) and get the following error with various settings (1-4x A100 80GB, w/ and w/o --inplace, w/ and w/o --chunk_size 1-32). The error seems to be associated with exceeding the GPU memory and I am not sure if this is normal at the given sequence length and available GPU memory. I installed fastfold from the recent commit 930a58a into a clean conda environment and built triton from source. For a smaller complex (~2000 aa) it ran without errors.
terminate called after throwing an instance of 'c10::Error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:173, unhandled cuda error, NCCL version 2.10.3
Process Group destroyed on rank 1
Exception raised from ncclCommAbort at /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:173 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f43cf264497 in .../fastfold/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f43cf23bc94 in .../fastfold/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19ea61 (0x7f44092e2a61 in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x118 (0x7f44092c6098 in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f44092c6369 in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #5: <unknown function> + 0x9d7799 (0x7f440f4fd799 in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x354732 (0x7f440ee7a732 in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x3555ff (0x7f440ee7b5ff in .../fastfold/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x116878 (0x55a0881ca878 in .../fastfold/bin/python3)
frame #9: <unknown function> + 0x11699d (0x55a0881ca99d in .../fastfold/bin/python3)
frame #10: <unknown function> + 0x1fd471 (0x55a0882b1471 in .../fastfold/bin/python3)
frame #11: <unknown function> + 0x10e937 (0x55a0881c2937 in .../fastfold/bin/python3)
frame #12: _PyGC_CollectNoFail + 0x2b (0x55a0882b134b in .../fastfold/bin/python3)
frame #13: PyImport_Cleanup + 0x371 (0x55a0882b11b1 in .../fastfold/bin/python3)
frame #14: Py_FinalizeEx + 0x7a (0x55a0882aff9a in .../fastfold/bin/python3)
frame #15: Py_Exit + 0x8 (0x55a0881454bc in .../fastfold/bin/python3)
frame #16: <unknown function> + 0x9141b (0x55a08814541b in .../fastfold/bin/python3)
frame #17: <unknown function> + 0x910ee (0x55a0881450ee in .../fastfold/bin/python3)
frame #18: PyRun_SimpleStringFlags + 0x4a (0x55a088141f12 in .../fastfold/bin/python3)
frame #19: Py_RunMain + 0x27b (0x55a0882abc1b in .../fastfold/bin/python3)
frame #20: Py_BytesMain + 0x39 (0x55a088283619 in .../fastfold/bin/python3)
frame #21: __libc_start_main + 0xf5 (0x7f444b239555 in /lib64/libc.so.6)
frame #22: <unknown function> + 0x1cf525 (0x55a088283525 in .../fastfold/bin/python3)
Traceback (most recent call last):
File ".../.../FastFold/inference.py", line 519, in <module>
main(args)
File ".../.../FastFold/inference.py", line 149, in main
inference_multimer_model(args)
File ".../.../FastFold/inference.py", line 282, in inference_multimer_model
torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args))
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedExceptio
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 264, in iteration
template_embeds = self.template_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 351, in forward
self.template_single_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 238, in forward
all_atom_multimer.compute_chi_angles(
File ".../.../FastFold/fastfold/utils/all_atom_multimer.py", line 403, in compute_chi_angles
chi_atom_indices = get_chi_atom_indices(aatype.device)
File ".../.../FastFold/fastfold/utils/all_atom_multimer.py", line 365, in get_chi_atom_indices
return torch.tensor(chi_atom_indices, device=device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The most likely reason is indeed out of memory, reasoning about single-precision sequences on a 40GB card, 5000 is the limit of length. It is recommended to use --inplace --chunk_size 1.
You may need to set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:15000 to inference such an extreme long sequence. Or you may need to use bfloat16 for inference.
I tried again with --inplace --chunk_size 1 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:15000 but still same error. The card has 80 GB memory. Is my understanding correct that using 2 GPUs would speed-up the job but not increase the memory limit to 160 GB for a long sequence? What would be the best way to switch to bf16?
The cuda execution is asyncronize, so you need to set CUDA_LAUNCH_BLOCKING=1 to locate the bug. Or you can give us the fasta file to reproduce the bug.
This would be the traceback with CUDA_LAUNCH_BLOCKING=1. I saw in the installation instructions that cuda >= 11.4 is suggested for building triton but the environment.yml installs cudatoolkit 11.3 and there doesn't seem to be a colossalai relelease for cuda > 11.3. Could this cause any problems?
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 264, in iteration
template_embeds = self.template_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 339, in forward
pair_act = self.template_pair_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 215, in forward
query_embedding = self.query_embedding_layer_norm(query_embedding)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/kernel/layer_norm.py", line 38, in forward
return LayerNormTritonFunc.apply(input, self.normalized_shape, self.weight, self.bias,
File ".../.../FastFold/fastfold/model/fastnn/kernel/triton/layer_norm.py", line 164, in forward
_layer_norm_fwd_fused[(M,)](
File ".../fastfold/lib/python3.8/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "<string>", line 23, in _layer_norm_fwd_fused
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
The bug has been fixed https://github.com/hpcaitech/FastFold/pull/103 and will merge into the main branch soon.
Great, thank's a lot! Now it was running for some time but another error occurred.
Command line args were:
--gpus 4 --inplace --chunk_size 1
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 373, in iteration
m, z, s = self.evoformer.inplace(
File ".../.../FastFold/fastfold/model/fastnn/evoformer.py", line 319, in inplace
m, z = checkpoint_blocks(
File ".../.../FastFold/fastfold/utils/checkpointing.py", line 73, in checkpoint_blocks
return exec(blocks, args)
File ".../.../FastFold/fastfold/utils/checkpointing.py", line 60, in exec
a = wrap(block(*a))
File ".../.../FastFold/fastfold/model/fastnn/evoformer.py", line 131, in inplace
z = self.communication.inplace(m[0], msa_mask, z)
File ".../.../FastFold/fastfold/model/fastnn/ops.py", line 206, in inplace
left_act = M_mask_col * left_act
RuntimeError: The size of tensor a (1282) must match the size of tensor b (5128) at non-singleton dimension 2
Fixed in https://github.com/hpcaitech/FastFold/pull/105.
Thanks again for the super fast fix. It is now running for some hours but at some point it still crashes with OOM. I set the max_split_size already down to 1000MB. Is it possible to globally change the precision to fp16?
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 389, in iteration
outputs["sm"] = self.structure_module(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 886, in forward
outputs = self._forward_multimer(s, z, aatype, mask)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 825, in _forward_multimer
s = s + self.ipa(s, z, rigids, mask)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 397, in forward
pt_att = sum([c**2 for c in pt_att])
RuntimeError: CUDA out of memory. Tried to allocate 4.70 GiB (GPU 0; 79.21 GiB total capacity; 77.87 GiB already allocated; 103.12 MiB free; 77.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It goes out of memory in structure generation because multimer structure generation consumes much more memory than monomer sequence. But sadly we haven't optimized that part yet.
Monomer supports bf16, but multimer doesn't yet. Maybe we will support it in the future.
Looking forward to these optimizations. Overall great project!
Monomer supports bf16, but multimer doesn't yet. Maybe we will support it in the future.
How to use bf16 or fp32 to reduce memory use in GPU calculation? Thanks for your kind help!