accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

using `accelerate` in a SLURM environment

Open deven-gqc opened this issue 3 years ago • 2 comments

Hello,

A member in our team has access to a compute cluster and we wish to use accelerate in that environment to accomplish distributed training across multiple GPUs.

However, the documentation on the accelerate config page is a little confusing for me.

Here are a few questions

  1. if I have multi-gpu selected as yes, does that mean --num_processes == no of GPUs?
  2. --num_cpu_threads_per_process -- The number of CPU threads per process. Can be tuned for optimal performance. -- can someone explain this a little further? Does this mean, a single py file would be run on multiple threads in parallel?

Let me know, if I should move this into the forums or the the Discord channel. TIA!

deven-gqc avatar Sep 29 '22 20:09 deven-gqc

Hello @deven-gqc,

Forum would be the correct place for above queries. As for your questions

  1. with multi-gpu, it is usually equal to the number of GPUs on a machine * number of machines.
  2. num_cpu_threads_per_process is mainly for multi-cpu scenario and to parallelize CPU intensive tasks across CPU cores, e.g., data tokenization.

Sylvain and Zach would be able to add more context.

pacman100 avatar Sep 30 '22 05:09 pacman100

@pacman100 I've created a post on the forums over here cc @sengv1

deven-gqc avatar Oct 05 '22 19:10 deven-gqc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 30 '22 15:10 github-actions[bot]