André Bauer
André Bauer
+1 for this feature , is there any way to get the desired behavior already?
> In the config json, set "stage3_prefetch_bucket_size": 0, that should work While this might "work" this still not solves the problem for example with `mixtral`, since this kind of MoE...
> I had some success loading the model this way: > > ``` > with deepspeed.OnDevice(dtype=dtype, device="meta"): > model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True) > model = deepspeed.init_inference( > model, > tensor_parallel...
That means if I have 300G of ram for a 301G model there is no way to offload only the 1G of params to nvme in inference 🤔 ? I...