ngoyal2707
ngoyal2707
@PrashanthVenkatesan First of all, glad that the solution helped you and thanks for the comment. The code is very old (3-4 years) and I haven't had chance to visit that...
is it okay with you if we merge this directly to v3? and we can push the small PR to metaseq to make it work with v3?
On nccl logs, instead of not doing debug, lets put them in different file using https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-file
lets not merge this, I tried this locally and I think current way of creating skip iterator always return has_next() False, which makes the code think every update is end_of_epoch....
How I identified the issue: added record_function for logging_stats which is after the train_step: https://github.com/facebookresearch/metaseq/blob/11bf89f3aa128acc44de359aa1de02c275e54f8f/metaseq_cli/train.py#L283-L295 the profile roughly looks something like:  The...