Karan Jariwala
Karan Jariwala
- Data was not sharded across GPUs when running Horovod distributed Training. This PR will fix that issue. - Fixed the issue with the `validation` condition where it was doing...
Hi, I am running the `lstm_benchmark.py` test on CPU and multi GPU device(Amazon EC2) and I am not getting scaling as expected. Below are the pieces of information: **Instance:** P3.8xLarge(Amazon...
## Description of changes: - Moved local directory creation and existence check from CloudUploader to Writer class ## Issue #, if available: ## Merge Checklist: _Put an `x` without space...
**Environment** - OS: [Ubuntu 20.04] - Hardware (GPU, or instance type): [A100] >= 2 GPUs **To reproduce** Steps to reproduce the behavior: When trying to run [examples/bert mlm training](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training) (using...