Fail to resume when training with float16

Open CrazyBoyM opened this issue 2 years ago • 1 comments

with the option:

--mixed_precision="fp16" \

it train faster, but when I try to resume train with

--resume_unet="/epoch_41_step_0/lora_weights/lora_e41_s0.pt" \

got error:

TypeError: cannot assign 'torch.HalfTensor' as parameter 'weight'

is there any way to solve this problem?
I try to fix it with:

if loras is not None:
            print("########## inject from checkpoint ###########")
            _module._modules[name].lora_up.weight = torch.nn.Parameter(torch.tensor(loras.pop(0)).float().detach())
            _module._modules[name].lora_down.weight = torch.nn.Parameter(torch.tensor(loras.pop(0)).float().detach())

but fail

Mar 02 '23 03:03 CrazyBoyM

think it's related to pytorch version issues, just modify sections of the code will do ... require_grad_params.append(_module._modules[name].lora_up.parameters()) require_grad_params.append(_module._modules[name].lora_down.parameters()) wt_tensor_type=_module._modules[name].lora_up.weight.dtype if loras != None: _module._modules[name].lora_up.weight.data = loras.pop(0).to(wt_tensor_type) _module._modules[name].lora_down.weight.data = loras.pop(0).to(wt_tensor_type) ...

Jul 28 '23 07:07 danieltanhx