AlphaFold Stuck at hhblits Step on Cluster Compute Node
AlphaFold Stuck at hhblits Step on Cluster Compute Node
Issue Description: I am experiencing a problem running AlphaFold on a compute node in our cluster. The process consistently gets stuck at the hhblits step. However, when I run the same program directly on the login node, it proceeds without any issues. The issue arises specifically when submitting the job to the compute node - it hangs at "- 22:54:01.709 INFO: Prefiltering database" and does not progress further.
Additionally, system call tracing shows repeated occurrences of the following:
futex(0x2b5ff13dd634, FUTEX_WAIT_PRIVATE, 4294967295, NULL) = 0 futex(0x2b5ff13dd634, FUTEX_WAKE_PRIVATE, 1) = 0 ... During this time, there is no additional memory load, and the GPU does not appear to be computing, although the program itself seems to have a load.
login node can output normally like this:
22:15:22.605 INFO: Searching 32053680 column state sequences.
22:15:22.729 INFO: /tmp/yanghao2022/MSA_4508283962/seq.fasta is in A2M, A3M or FASTA format
22:15:22.730 INFO: Iteration 1
22:15:22.808 INFO: Prefiltering database
22:16:19.797 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 693794
22:16:25.614 INFO: HMMs passed 2nd prefilter (gapped profile-profile alignment) : 292
22:16:25.614 INFO: HMMs passed 2nd prefilter and not found in previous iterations : 292
22:16:25.614 INFO: Scoring 292 HMMs using HMM-HMM Viterbi alignment
22:16:26.110 INFO: Alternative alignment: 0
22:16:31.556 INFO: 292 alignments done
22:16:31.559 INFO: Alternative alignment: 1
22:16:31.623 INFO: 287 alignments done
22:16:31.624 INFO: Alternative alignment: 2
22:16:31.648 INFO: 20 alignments done
22:16:31.648 INFO: Alternative alignment: 3
22:16:31.679 INFO: 3 alignments done
22:16:31.984 INFO: Realigning 210 HMM-HMM alignments using Maximum Accuracy algorithm
22:16:33.013 INFO: 77 sequences belonging to 77 database HMMs found with an E-value < 0.001
22:16:33.013 INFO: Number of effective sequences of resulting query HMM: Neff = 5.92897
22:16:33.040 INFO: Iteration 2
Environment Description: _libgcc_mutex 0.1 main defaults _openmp_mutex 5.1 1_gnu defaults absl-py 0.13.0 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi biopython 1.79 pypi_0 pypi ca-certificates 2023.12.12 h06a4308_0 defaults cachetools 5.3.2 pypi_0 pypi certifi 2023.11.17 pypi_0 pypi charset-normalizer 3.3.2 pypi_0 pypi chex 0.0.7 pypi_0 pypi click 8.1.7 pypi_0 pypi contextlib2 21.6.0 pypi_0 pypi cudatoolkit 11.3.1 h9edb442_10 conda-forge cudatoolkit-dev 11.3.1 py38h497a2fe_0 conda-forge cudnn 8.2.1.32 h86fa8c9_0 conda-forge dm-haiku 0.0.4 pypi_0 pypi dm-tree 0.1.6 pypi_0 pypi fftw 3.3.10 nompi_h77c792f_102 conda-forge flatbuffers 1.12 pypi_0 pypi gast 0.4.0 pypi_0 pypi google-auth 2.26.1 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.34.1 pypi_0 pypi h5py 3.1.0 pypi_0 pypi hhsuite 3.3.0 py38pl5321h8ded8fe_5 bioconda hmmer 3.3.2 h87f3376_2 bioconda idna 3.6 pypi_0 pypi immutabledict 2.0.0 pypi_0 pypi importlib-metadata 7.0.1 pypi_0 pypi jax 0.2.14 pypi_0 pypi jaxlib 0.1.69+cuda111 pypi_0 pypi kalign2 2.04 hec16e2b_3 bioconda keras-nightly 2.5.0.dev2021032900 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi libblas 3.9.0 15_linux64_openblas conda-forge libcblas 3.9.0 15_linux64_openblas conda-forge libedit 3.1.20230828 h5eee18b_0 defaults libffi 3.2.1 hf484d3e_1007 defaults libgcc-ng 11.2.0 h1234567_1 defaults libgfortran-ng 13.2.0 h69a702a_0 conda-forge libgfortran5 13.2.0 ha4646dd_0 conda-forge libgomp 11.2.0 h1234567_1 defaults liblapack 3.9.0 15_linux64_openblas conda-forge libnsl 2.0.0 h5eee18b_0 defaults libopenblas 0.3.20 pthreads_h78a6416_0 conda-forge libstdcxx-ng 11.2.0 h1234567_1 defaults markdown 3.5.1 pypi_0 pypi markupsafe 2.1.3 pypi_0 pypi ml-collections 0.1.0 pypi_0 pypi ncurses 6.4 h6a678d5_0 defaults numpy 1.19.5 pypi_0 pypi oauthlib 3.2.2 pypi_0 pypi ocl-icd 2.3.1 h7f98852_0 conda-forge ocl-icd-system 1.0.0 1 conda-forge openmm 7.5.1 py38ha082873_1 conda-forge openssl 1.1.1w h7f8727e_0 defaults opt-einsum 3.3.0 pypi_0 pypi pandas 1.3.4 pypi_0 pypi pdbfixer 1.7 pyhd3deb0d_0 conda-forge perl 5.32.1 0_h5eee18b_perl5 defaults pillow 10.2.0 pypi_0 pypi pip 23.3.2 pypi_0 pypi protobuf 3.20.3 pypi_0 pypi pyasn1 0.5.1 pypi_0 pypi pyasn1-modules 0.3.0 pypi_0 pypi python 3.8.0 h0371630_2 defaults python-dateutil 2.8.2 pypi_0 pypi python_abi 3.8 2_cp38 conda-forge pytz 2023.3.post1 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi readline 7.0 h7b6447c_5 defaults requests 2.31.0 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rsa 4.9 pypi_0 pypi scipy 1.7.0 pypi_0 pypi setuptools 68.2.2 py38h06a4308_0 defaults six 1.15.0 pypi_0 pypi sqlite 3.33.0 h62c20be_0 defaults svgwrite 1.4.3 pypi_0 pypi tabulate 0.9.0 pypi_0 pypi tensorboard 2.11.2 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.5.0 pypi_0 pypi tensorflow-cpu 2.5.0 pypi_0 pypi tensorflow-estimator 2.5.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi tk 8.6.12 h1ccaba5_0 defaults toolz 0.12.0 pypi_0 pypi tree 0.2.4 pypi_0 pypi typing-extensions 3.7.4.3 pypi_0 pypi urllib3 2.1.0 pypi_0 pypi werkzeug 3.0.1 pypi_0 pypi wheel 0.41.2 py38h06a4308_0 defaults wrapt 1.12.1 pypi_0 pypi xz 5.4.5 h5eee18b_0 defaults zipp 3.17.0 pypi_0 pypi zlib 1.2.13 h5eee18b_0 defaults
AlphaFold Version: 2.3.2 Operating System and Version: CentOS Linux release 7.6.1810 (Core) Thank you very much in advance