xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

Errors of llava pretrain for phi3_mini_4k_instruct_clip_vit_large_p14_336

Open JiamingLv opened this issue 1 year ago • 3 comments

I strictly follow the document for phi3_mini_4k_instruct_clip_vit_large_p14_336. Run comand NPROC_PER_NODE=4 xtuner train llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024

conda environment python==3.10 transformers==4.41.1 torch==2.3.0 CUDA 12.1 4x3090

05/23 08:10:20 - mmengine - INFO - before_train in EvaluateChatHook. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. [rank3]: Traceback (most recent call last): [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook [rank3]: getattr(hook, fn_name)(self, **kwargs) [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 230, in before_train [rank3]: self._generate_samples(runner, max_new_tokens=50) [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 216, in _generate_samples [rank3]: self._eval_images(runner, model, device, max_new_tokens, [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 148, in _eval_images [rank3]: generation_output = model.generate( [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank3]: return func(*args, **kwargs) [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate [rank3]: result = self._sample( [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 2390, in _sample [rank3]: model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 1321, in _get_initial_cache_position [rank3]: past_length = model_kwargs["past_key_values"][0][0].shape[2] [rank3]: TypeError: 'NoneType' object is not subscriptable

JiamingLv avatar May 23 '24 10:05 JiamingLv

downgrade transformers can solve this

acdart avatar May 27 '24 04:05 acdart

downgrade transformers can solve this @acdart hi , now my transformer is 4.41.1, Which version of Transformer should it be downgraded to?

J0eky avatar May 28 '24 06:05 J0eky

How were you even able to run that command? It tells me that it doesn't recognize NPROC_PER_NODE=4, and if I run it without that bit (i.e. just running xtuner train llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024), it says it doesn't know what llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain is.

nakoeni avatar Jun 07 '24 17:06 nakoeni