Jeff Rasley issues

Results 18 issues of


                                            Jeff Rasley

Add explicit gradient_accumulation_dtype config

DeepSpeed has support for several dtypes now (i.e., fp32, fp16, bf16). However, it's becoming less clear what parts of training are using what dtypes and what time. For example, in...

enhancement

[zero-3] add support for new params added during fwd pass

/cc @stas00

accelerate requires torch 1.9+

We noticed our DeepSpeed + Accelerate unit tests are failing on torch 1.8. `torch.distributed.run` requires torch 1.9+ so bumping your min torch version to 1.9. If you'd rather guard the...

add ds inject policies

[bloom] use mii cache dir for config/tokenizer

AML deployments the model dir is not writeable, download config/tokenizer to a writeable cache path.

Add local AML deployment option

Provide local AML deployment option, this will use the [AML inference server](https://pypi.org/project/azureml-inference-server-http/) for the front end. We can then easily deploy an MII generated score file via: `azmlinfsrv --model_dir --entry_script...

enhancement

Jeff Rasley

Add explicit gradient_accumulation_dtype config

[zero-3] add support for new params added during fwd pass

accelerate requires torch 1.9+

add ds inject policies

[bloom] use mii cache dir for config/tokenizer

Add local AML deployment option

Expose DS-inference and ZeRO-inference configs to user

only override forward if using cuda-graph

re-enable neox inference tests

pre/post forward calls to engine + generate method