FDM icon indicating copy to clipboard operation
FDM copied to clipboard

Error in distributed training (DDP)

Open 1zeryu opened this issue 1 year ago • 0 comments

when I run :

torchrun --standalone --nproc_per_node=4 train.py --outdir=training-output \
    --data=datasets/ffhq-64x64.zip --cond=0 --arch=ddpmpp \
    --batch=256 --cres=1,2,2,2 --lr=2e-4 --dropout=0.05 --augment=0.15 \
    --precond=fdm_edm --warmup_ite=800 --fdm_multiplier=1

it appear: untimeError: params[127] in this process with sizes [256, 256, 1, 1] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 91129 closing signal SIGTERM

1zeryu avatar Apr 21 '24 17:04 1zeryu