Yawei Li

Results 15 comments of Yawei Li

> @sandylaker Thank you for opening this issue. I will have a look asap and try to reproduce on my side. > > Moreover, could you explain a bit more...

> @sandylaker could you please test this code with nightly version : `pip install --pre pytorch-ignite` ? > I think it should raise this runtime error: > > https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222 >...

@sdesrozis Yeah it might be caused by the SLURM settings. I did not run another script with `ignite.distributed.Parallel`. But I have run multiple scripts with `torch.distributed.launcher`, and they worked quite...

@sdesrozis Thank you for your explanation. I removed the `if` block and run with `srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py`, but got following error: ```python3 RuntimeError: NCCL error in:...

> > @sdesrozis Thank you for your explanation. I removed the `if` block and run with `srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py`, but got following error: > > ```python...

> @sandylaker which cuda version ? > > I can not reproduce the issue with NCCL with my setup using 1.8.1 and cuda 11.1 and 2 GPUs. CUDA 11.0

@sdesrozis So it is like this: (a) `srun -N1 -n4 --ntasks_per_node=4`: job gets stucked in queue; (b) `srun -N4 -n4 --ntasks_per_node=1`: works for multi-node. But as I said, I have...

@sdesrozis Thank you very much for the detailed experiments. I suppose there might be some incorrect configuration of the SLURM on my server.

> hi @Dorniwang , if eq2 is wrong, so all the following formulation is wrong ? or Is there any influence on following formulation? The sum of C_ij w.r.t. i...