Vista icon indicating copy to clipboard operation
Vista copied to clipboard

How to load the pretrained safesensor and continue to train?

Open JunyuanDeng opened this issue 1 year ago • 8 comments

Hello, Thanks for your sharing code!

I am now try to train the stage 2 with the provided vista.safetensors

So I change the command to below:

torchrun \
    --nnodes=1 \
    --nproc_per_node=8 \
    train.py \
    --base configs/training/vista_phase2_stage2.yaml \
    --finetune ${PATH_TO_STAGE1_CKPT}/vista.safetensors \
    --num_nodes 1 \
    --n_devices 8

But there are lots of missing keys like: image

And the loss, in my expectation, should be low, which is not true in my observation: image

I download the sampled video "samples_mp4_epoch00_batch0000_step000001.mp4":

https://github.com/OpenDriveLab/Vista/assets/62542727/80f5237f-9d68-46f5-8d5b-9ec0b5587b63

What should I do to use the provided weight to start the phase 2 stage 2 traning?

JunyuanDeng avatar Jun 19 '24 08:06 JunyuanDeng

Sorry for the trouble. I haven't verify this resuming feature yet. It seems that there are some random weights after initialization. Make sure the new weights are initialized as zeros. In addition, if there are some "unexpected" weights when loading the checkpoint, make sure all of them are remapped to "missing" weights. It can be realized by renaming the keys in the state dictionary and loading the dictionary to the model again.

Little-Podi avatar Jul 29 '24 12:07 Little-Podi

@JunyuanDeng Hi, have you resolved this issue? Could you please share how you did it? Thank you!

zhoujiawei3 avatar Nov 07 '24 08:11 zhoujiawei3

@Little-Podi Hi,I want to make sure your words mean that we need to change the code to set the missing keys initialized as zeros in this case? As when I set these missing keys's value to zero, the samples_mp4_epoch00_batch0000_step000001.mp4 is still in that strange form

zhoujiawei3 avatar Nov 10 '24 08:11 zhoujiawei3

@Little-Podi Hi, thanks a lot for sharing the great work! I met the same question, could you share the checkpoint after stage1 for continue training? Thanks a lot!

jywu511 avatar Dec 25 '24 01:12 jywu511

@Little-Podi Hi,I want to make sure your words mean that we need to change the code to set the missing keys initialized as zeros in this case? As when I set these missing keys's value to zero, the samples_mp4_epoch00_batch0000_step000001.mp4 is still in that strange form

@zhoujiawei3 Hello,Do you find the answer about it?

zzz5y avatar Jan 08 '25 08:01 zzz5y

Upon inspection, vista.safetensors does not appear to contain model_ema weights. I manually enable init_ema in this line and it seems to work.

hungdche avatar Jan 21 '25 03:01 hungdche

@hungdche, Hi, thanks for your response! Do you mean that we always set "model.reinit_ema()"?

jywu511 avatar Mar 06 '25 07:03 jywu511

Hello, When I finetune the LoRA layers using the pretrained weights, I meet such error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Do you know how to solve it? Sorry to bother you, Thank you very much!

johnren-code avatar May 24 '25 14:05 johnren-code