ZHANG Bowen
ZHANG Bowen
Sorry for the confusion, we should've added a short help message to this argument. DA refers to Distribution Alignment, which is a technique proposed in the ReMixMatch paper. > To...
Hi, did this happen before the training even started or during the training? If it's the former, make sure the distributed arguments are correctly set (i.e. `word-size`, `rank`, `dist-url`), in...
@Ajaypatel1234 According to your description, did you set rank=1 on both machines (nodes)? In your case, you should set rank=0 on the first node and rank=1 on the second.
I doubt whether they'll open-source it at all 👎
9 months passed and I still see this error.
I know, no, obviously.
Unfortunately, I don't think I have slurm installed on our cluster nor do I have a root privilege to configure it. Are there any other startup methods e.g. using torchrun...
Thank you for the reply. It's very nice of you! I'll try again tomorrow. I thought there should be +override.*** when the argument already exists in the yaml, and without...
Clear to me now. Thanks again for the clarification.👍 Will try out distributed training again tmr hopefully it will work. On Wed, Feb 16, 2022, 00:56 chevalierNoir ***@***.***> wrote: >...
Really frustrating, I've been working on this for a whole day and I just couldn't make it right. :-< Here is what I do (I wrote the port number 12356...