nbodykit icon indicating copy to clipboard operation
nbodykit copied to clipboard

Task with IO stuck when using 2 nodes

Open mpellejero opened this issue 8 years ago • 7 comments

This is happening on marenostrum. We think the reason is that the temporary files disagree when they are running on two different nodes. The temporary directory, is local on the computer node.

mpellejero avatar Oct 04 '17 00:10 mpellejero

Hi, can you post a code snippet or a bit more specifics about the code that is leading to the error? What version are you using? If it's 0.2.7, it could be related to a bug (#419), which we will be fixing shortly and releasing a new fixed version for (0.2.8).

nickhand avatar Oct 05 '17 12:10 nickhand

It is 0.2.7. Most of the tests were fine when we using only 1 node -- Thus I suspected it is due to the fact that the temp storage is local -- I don't think #419 cares the number of nodes? I guess we can try again with 0.2.8 after it is tagged ..

@mpellejero will you be able to deploy and test this (or to urge Albert to do it)?

BTW the job system on Marenostrum works with conda stock mpich -- so apparently we don't even need to recompile specially for the computer -- the packages from bccp channel worked directly. We didn't do benchmarks to see if it falls back to TCP/IP or not though.

rainwoodman avatar Oct 06 '17 06:10 rainwoodman

Does nbodykit use internal temp storage anywhere? Or it's being created in the user script?

nickhand avatar Oct 06 '17 10:10 nickhand

but #419 would not show up in serial....it was due to uneven allreduce(size) calls when some ranks had non-unity boolean selections while others did not

nickhand avatar Oct 06 '17 11:10 nickhand

Hi, we installed the nbodykit 0.2.8 and tried to run the test in mpi in 4 nodes using:

python run-tests.py --mpirun="srun -n 4"

but it again got stacked at:

../testenv/lib/python3.6/site-packages/nbodykit/base/tests/test_catalogmesh.py .

mpellejero avatar Oct 06 '17 21:10 mpellejero

I suspect the issue still persists. Could you try to set $TMPDIR to the scratch folder?

rainwoodman avatar Nov 01 '17 00:11 rainwoodman

As soon as the cluster is back from maintenance I'll try.

mpellejero avatar Nov 01 '17 00:11 mpellejero