mabingqi
mabingqi
The implementation of stoch depth in the code of nfnet seems to be batch-wise dropout, but not block-level dropout as described in paper.
The ddp in pytorch can not distinguish experts and other shared parameters. And experts may be updated with shared gradient. The TutelDistributedOptimizer seems to be an implementation of zero, which...
### System Info - `transformers` version: 4.41.0.dev0 - Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.17 - Python version: 3.9.12 - Huggingface_hub version: 0.21.4 - Safetensors version: 0.4.2 - Accelerate version: 0.21.0 - Accelerate config: not...