DeepSpeed
DeepSpeed copied to clipboard
add sharded checkpoint loading for AutoTP path to reduce the peak mem…
…ory in initialization stage
@delock @yao-matrix
Hi,@molly-smith, this PR is meant to reduce the host memory per Rank, support shard loading in AutoTP path, same with shard loading in kernel injection path.