consistency_models How to run on a single linux server with multiple GPUs

Nice Job! I wonder how I can run the code on a single linux server with multiple GPUs. I can run the code on the server with one GPU by not using mpiexec. But what if I want to use multiple GPUs as nn.DataParallel?

Apr 20 '23 13:04 1999kevin

@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you

Apr 21 '23 06:04 stonecropa

@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you

Just delete the mpiexec part in the command of the sampling.

Apr 21 '23 06:04 1999kevin

@1999kevin but in image_samping.py ，I don't find mpiexec.thanks

Apr 21 '23 08:04 stonecropa

Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over

Apr 21 '23 08:04 stonecropa

Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over

I'm still working on training phase and not so sure about the inference phase. I guess you can follow Line 48 and Line 51 in scripts/launch.sh to sample the images. If you want to use one thread, just directly use the command: python image_sample.py ...

Apr 21 '23 15:04 1999kevin

I add CUDA_VISIBLE_DEVICES=6,7 in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ..., and change the code of ./cm/dist_util.py#L27 into:

    if 'CUDA_VISIBLE_DEVICES' not in os.environ:
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE}"
    else:
        gpu_inds_list = os.environ["CUDA_VISIBLE_DEVICES"].split(',')
        idx = MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE
        os.environ["CUDA_VISIBLE_DEVICES"] = gpu_inds_list[idx]

Does it work?

Apr 22 '23 14:04 tyshiwo1

Does it work?

I will test it once I finish current training.

Apr 23 '23 02:04 1999kevin

Does it work?

I will test it once I finish current training.

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

Apr 23 '23 04:04 tyshiwo1

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory.

Apr 23 '23 07:04 1999kevin

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory. Will the pre-training model also use such a large amount of Gpu memory?

Apr 23 '23 07:04 stonecropa

Will the pre-training model also use such a large amount of Gpu memory?

Do not test such case currently.

Apr 24 '23 02:04 1999kevin

I add CUDA_VISIBLE_DEVICES=6,7 in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ..., and change the code of ./cm/dist_util.py#L27

This change can definitely enable multiple GPUs training. However, it may cause error 'Expected q.stride(-1) == 1 to be true, but got false' as in issus #3. Change the flash attenion to defaclt can resolve the error

Apr 24 '23 02:04 1999kevin