guidance Multimodal support with Phi 3 Vision + Transformers

Adds a general framework for supporting multimodal models in Guidance, as well as an implementation of Phi 3 Vision, using the transformers library.

Refactored some code in _parser.py. Changed the __init__ function for TokenParser so that it contains less logic. This was necessary to implement more flexible processing of the prompt to get the token IDs when dealing with various data formats. Now engines can do more specific preparation of the TokenParser if they. need to.
Added Phi 3 vision chat template
Model and engine refactor
- Created Modality enum which is used to indicate the modality of data in a prompt segment. Some types have been created to indicate that the data points to a URL. This might enable us to support APIs like Gemini in the future, which needs users to supply a Google Cloud URL for the blob data in the API request.
- The model state is still stored as a string. Multimodal data is stored in the model object's key-value store. The data is pointed to by its ID. The multimodal data is encoded like this in the prompt: <|_{modality.name}:{str(id(data))}|>. It's essentially a placeholder for where the larger blob data should be inserted later on when prompting the model.
- Image, audio, and video bytes are appended to the model with functions like append_image(), append_audio_bytes(), etc. These functions are meant to be used by user-facing guidance library functions such as image() so those guidance functions can load the data into the model state.
- When the model calls the engine, the multimodal blob data must be passed to the engine, which might live in a separate process or server somewhere. To allow this, a new media dict parameter was added to __call__(), get_next_token(), and get_logits() in Engine. There is a default implementation of these provided in the Engine base class. Subclasses of Engine can override these functions as necessary to parse the prompt string and pack the media data as needed depending on the API.
- The prompt string parameter sent to engine will still contain the placeholders formatted like <|_{modality.name}:{str(id(data))}|>. Engines will parse this string and extract the ID part using a regex. This ID is used to map to the actual blob data in the media dict: {id: blob_data}
- get_next_token() and get_logits() also receive the media dict parameter because sometimes engines will need the media data, along with the prompt string, at that particular point in time to prepare an API request. The idea is to ensure there's enough flexibility to handle various kinds of models.
A new hack was included in Transformers Tokenizer creation to accommodate for phi 3 vision's tokenizer, which uses sentencepiece convention for encoding spaces, but is not an sp_model itself
A specific class was written for TransformersPhi3VisionEngine. I found it difficult to subclass the existing Transformers classes to add the needed new code. I think it's best to consider this new TransformersPhi3VisionEngine as a prototype for what a multimodal Transformers engine looks like. For now, there were some Phi 3 behaviors I had to account for. In the future as we support more multimodal Transformers models, we might notice specific patterns arising that we can use to make the implementation cleaner and more general
- Multimodal Transformers models use an AutoProcessor instead of AutoTokenizer to prepare model inputs. The model inputs for Phi 3 include token ids, image pixel values, and an attention mask
- Phi 3 vision uses a convention of negative token ids to pad the tokens with the space needed to fit the image embeddings in model input. The negative ids correspond to the image index. If there are 3 image inputs, then you would see tokens -1, -2, and -3 for example. Note that Phi 3 vision is only trained on 1 image input, though. On HuggingFace they have stated their stance is to allow people to fine-tune for multi-image use cases if they want: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/60
- The input tokens might look something like this for a prompt like "Hello <image_1> what is this <image_2>?: [30, 50, 22, -1, -1, -1, -1, -1, -1, ..., -1, 54, 893, 250, -2, -2, -2, -2, -2, ..., -2, 542]
LL Guidance has a process_prompt() function which does the initial token healing and grammar forcing for a prompt string. For multimodal prompts, there are boundaries between text data and multimodal data in the token space. Token healing or forcing cannot be applied across those boundaries. So, we will preserve the existing tokens provided by the initial tokenization, then only send the text tokens at the end of the prompt, after all multimodal data, to process_prompt(). If the prompt ends with a multimodal input, we will not use process_prompt(). We might improve on this later on, but it seems to work for now.

TODO:

Fix and improve tests

Sep 11 '24 01:09 nking-1

Thanks for your reviews Hudson & Harsha! I am picking this PR back up now and will make revisions based on your feedback. I'm also going to work on integrating Llama 3.2 to come up with a more general solution.

Oct 01 '24 21:10 nking-1

Closing this PR since Hudson has implemented an alternative system - it can remain a reference for local models like Phi 3 vision in the future though.

May 21 '25 20:05 nking-1

Note that there's a bit more work / a few loose ends to tie up -- will tackle this cycle.

May 21 '25 20:05 hudson-ai