nbodykit Task with IO stuck when using 2 nodes

This is happening on marenostrum. We think the reason is that the temporary files disagree when they are running on two different nodes. The temporary directory, is local on the computer node.

Oct 04 '17 00:10 mpellejero

Hi, can you post a code snippet or a bit more specifics about the code that is leading to the error? What version are you using? If it's 0.2.7, it could be related to a bug (#419), which we will be fixing shortly and releasing a new fixed version for (0.2.8).

Oct 05 '17 12:10 nickhand

It is 0.2.7. Most of the tests were fine when we using only 1 node -- Thus I suspected it is due to the fact that the temp storage is local -- I don't think #419 cares the number of nodes? I guess we can try again with 0.2.8 after it is tagged ..

@mpellejero will you be able to deploy and test this (or to urge Albert to do it)?

BTW the job system on Marenostrum works with conda stock mpich -- so apparently we don't even need to recompile specially for the computer -- the packages from bccp channel worked directly. We didn't do benchmarks to see if it falls back to TCP/IP or not though.

Oct 06 '17 06:10 rainwoodman

Does nbodykit use internal temp storage anywhere? Or it's being created in the user script?

Oct 06 '17 10:10 nickhand

but #419 would not show up in serial....it was due to uneven allreduce(size) calls when some ranks had non-unity boolean selections while others did not

Oct 06 '17 11:10 nickhand

Hi, we installed the nbodykit 0.2.8 and tried to run the test in mpi in 4 nodes using:

python run-tests.py --mpirun="srun -n 4"

but it again got stacked at:

../testenv/lib/python3.6/site-packages/nbodykit/base/tests/test_catalogmesh.py .

Oct 06 '17 21:10 mpellejero

I suspect the issue still persists. Could you try to set $TMPDIR to the scratch folder?

Nov 01 '17 00:11 rainwoodman

As soon as the cluster is back from maintenance I'll try.

Nov 01 '17 00:11 mpellejero