some problems on training time

Open arealgoodname opened this issue 2 years ago • 1 comments

thanks to your brilliant work, and your code is really easy to read

I'm quite interested in your work so I was planning to try it, I've noticed that you said the EBMs can be trained on a single V100 tesla GPU for 3days on clevr, I have A40 which should be OK

but when I ran the code, each iteration takes about 10 seconds , the estimated time cost for one epoch is around 12 hours so I think there might be something wrong

Is Slurm a must? I did not use Slurm. Or maybe the training speed will get faster after the first epoch? since I've noticed that there are some buffers.

Could really need some help, thank you 💗

Jan 01 '24 18:01 arealgoodname

Maybe check if the program is run on the GPU that you get assigned to.

If there are multiple programs asking for the resources at the same time, this may happen.

But this shouldn't happen if you run on GPUs, assuming you are using the same setup.

Jan 01 '24 18:01 nanlliu