What does this PR do?

This PR adds the LlaVA model (https://arxiv.org/abs/2304.08485), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. https://github.com/huggingface/transformers/issues/22848
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag. Please tag fewer than 3 people.

Models:

text models: @ArthurZucker and @younesbelkada
vision models: @amyeroberts

May 29 '23 21:05 youssefadr

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

May 29 '23 21:05 HuggingFaceDocBuilderDev

Hey! Thanks for wanting to contribute. I would suggest you to follow the guide on how to share a model like this one. Since it is basically patching two models up, should be easy to fit on the hub! 🤗

May 30 '23 08:05 ArthurZucker

Hey @ArthurZucker! Thank you for your message!

I have looked up the guide you provided to share a model and according to my understanding, you are making a reference to uploading the model weights to the model hub and adding a model card, right?

However, I am a little confused, since the model is already on the hub (https://huggingface.co/liuhaotian/LLaVA-7b-delta-v0), but it cannot be ran using the current LLaMA implementation in transformers. I was thinking more to follow this guide, and include in my PR new classes for Llava inheriting from PreTrainedConfig and PreTrainedModel and a LlavaForCausalLM class, as implemented here https://github.com/haotian-liu/LLaVA/blob/main/llava/model/llava.py.

What to do you think of it @ArthurZucker ? (@jprivera44 do not hesitate to participate in the convo since we will collaborate with each other on this PR)

May 30 '23 21:05 youssefadr

Hi @youssefadr, following up on your post, I am also following the same guide for HF. Although we might interpret the steps slightly differently. I'm not sure which steps you are on, but even though the original researchers included the model card, this should be used to get the initial weights from the LLaMA weights(I'm still waiting on Meta for these weights). Once the pre-loaded weights are in, the process of tracing the forward pass(in the original repo) to see what functions are needed for transfomers/LLaVA kicks off the whole process. Were you able to get the original LLaMA weights from Meta?

May 31 '23 00:05 jprivera44

Hey @youssefadr what I meant is that you should host the code on the hub, others will be able to run your code using trust_remote_code = True. This is easier to do, and more aligned with the way this model seems to work!

May 31 '23 08:05 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 29 '23 15:06 github-actions[bot]

[WIP] Add llava model

What does this PR do?

Before submitting

Who can review?