DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

HunyuanVideo ValueError: Image features and image tokens do not match: tokens: 1, features 2359296

Open qq1343277857 opened this issue 6 months ago • 2 comments

怎么解决呢

Traceback (most recent call last):
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/hunyuanvideo_i2v_24G.py", line 43, in <module>
    video = pipe(prompt, input_images=images, num_inference_steps=50, seed=0, i2v_resolution=i2v_resolution)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/diffsynth/pipelines/hunyuan_video.py", line 190, in __call__
    prompt_emb_posi = self.encode_prompt(prompt, positive=True, input_images=input_images)
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/diffsynth/pipelines/hunyuan_video.py", line 106, in encode_prompt
    prompt_emb, pooled_prompt_emb, text_mask = self.prompter.encode_prompt(
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/diffsynth/prompters/hunyuan_video_prompter.py", line 288, in encode_prompt
    prompt_emb, attention_mask = self.encode_prompt_using_mllm(prompt_formated, images, llm_sequence_length, device,
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/diffsynth/prompters/hunyuan_video_prompter.py", line 191, in encode_prompt_using_mllm
    last_hidden_state = self.text_encoder_2(input_ids=input_ids,
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/paddlejob/workspace/env_run/zwr_workspace/DiffSynth-Studio/diffsynth/models/hunyuan_video_text_encoder.py", line 63, in forward
    outputs = super().forward(input_ids=input_ids,
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 455, in forward
    outputs = self.model(
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/transformers/utils/generic.py", line 943, in wrapper
    output = func(self, *args, **kwargs)
  File "/root/paddlejob/workspace/env_run/miniconda3/envs/Diff-S/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 296, in forward
    raise ValueError(
`ValueError: Image features and image tokens do not match: tokens: 1, features 2359296`

qq1343277857 avatar Jul 18 '25 08:07 qq1343277857

same problem

lixincst avatar Jul 29 '25 08:07 lixincst

This problem can be resolved by degrading transformers. According to my test, transformers==4.45 will work.

Tianxinhuang avatar Oct 14 '25 15:10 Tianxinhuang