Colin Taylor
Colin Taylor
@chongxiaoc is this resolved for you?
IMO we should call it "apply_optimizer_in_backward". Fused/non fused is an implementation detail, and whether its done in torch.autograd or requires comms (e.g PT-D) can also be flexible
@wangkuiyi sorry for the delay :) I think the snippet is a bit confusing, but the core API as landed is shard_embedding_modules https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/shard_embedding_modules.py#L24 This will replace (module swap) the embedding...
thanks @Luo-Liang -> I think this isn't relevant anymore, sorry formissing the PR
@davidxiaozhi I"m not so familiar with horovod, but my understanding is that it does not use pytorch distributed (https://pytorch.org/docs/stable/distributed.html) library and does the collective / p2p comms itself. torchrec is...
closing due to lack of engagement, @davidxiaozhi feel free ot reopen or follow up about horovod integration if you are still itnerested
This is landed in master, and will be going out in the next stable release
@henrylhtsang yes, that is where DDP modules are set up (using actual DDP) to make these data_parallel tables call all_reduce to get the correct gradients. Why do you call this...
@pytorchmergebot -g
@pytorchbot --help