Transformers-Tutorials Finetune LLaVaNeXT -> ValueError: Image features and image tokens do not match

Hi everyone,

I tried running the notebook provided here for finetuning LLaVaNeXT: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb

However, during training, I encountered the following error: ValueError: Image features and image tokens do not match: tokens: 251, features 2160

Im using transformers==4.51.3 and did not modify the notebook. I attempted to debug this by reviewing the code around the collate function, but couldn’t find the issue. Has anyone else run into this error or might have ideas on what’s going wrong?

Thanks

May 03 '25 15:05 benjwolff

cc @zucchini-nlp

May 03 '25 17:05 NielsRogge

@benjwolff hey, can you try increasing MAX_LENGTH to 3000 tokens? In latest transformers versions we include all image tokens to max length count, and from next release we'll be raising errors when max length is too small to include all image tokens. Until then you can indicate a very large length so tokens do not get truncated

May 06 '25 11:05 zucchini-nlp

@zucchini-nlp Thanks for helping out! Increasing MAX_LENGTH to 3000 resolves the mismatch issue, but now I’m running into memory problems. With an A100 (40GB), it runs out of RAM during training.

May 06 '25 12:05 benjwolff

Yeah, llava next requires huge memory. I ran the script on 80GB iirc.

May 06 '25 13:05 zucchini-nlp