andyG
andyG
**Describe the bug** When tranining llama 13B(https://github.com/ymcui/Chinese-LLaMA-Alpaca), I observed it cannot free parameter memory using ZeRO3 + Offload strategy parameter in pytorch1.9, but parameter memory can be freed in pytorch1.13...
vocab_parallel_logits's shape is [seq_len, batch_size, vocab_size / tp]. if vocab_size is very large like Llama3, use inplace subtract to reduce memory usage.
# What does this PR do? update deepspeed-mlu dependency related PRs: 1. (https://github.com/huggingface/transformers/pull/34362) 2. (https://github.com/microsoft/DeepSpeed/pull/6472) ## Before submitting - [ ] This PR fixes a typo or improves the docs...