feat(api): Add image multimodal support for LLMNode
Summary
Enhance LLMNode with multimodal capability, introducing support for
image outputs.
This implementation extracts base64-encoded images from LLM responses,
saves them to the storage service, and records the file metadata in the
ToolFile table. In conversations, these images are rendered as
markdown-based inline images.
Additionally, the images are included in the LLMNode's output as
file variables, enabling subsequent nodes in the workflow to utilize them.
To integrate file outputs into workflows, adjustments to the frontend code are necessary.
For multimodal output functionality, updates to related model configurations are required. Currently, this capability has been applied exclusively to Google's Gemini models.
Close #15814.
Screenshots
| Before | After |
|---|---|
The image is showed twice. I don't know why. (maybe some issues in frontend code?)
To utilize multimodal output capability, updating to Gemini models is required. Related PR will be submitted later.
Checklist
- [ ] This change requires a documentation update, included: Dify Document
- [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
- [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
- [x] I've updated the documentation accordingly.
- [x] I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods
@QuantumGhost I can't find gemini 2.0 flash image generation model to Switch.Is it a provider issue? I'm runing your branch code.
@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.
It's worth noting that, as far as I know, multimodal output is supported only by the Gemini-2.0-Flash-exp model at this time.
Let me know if you have further questions or need assistance!
@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.
It's worth noting that, as far as I know, multimodal output is supported only by the
Gemini-2.0-Flash-expmodel at this time.Let me know if you have further questions or need assistance!
Thank you for your reply.I checkout your branch and debug it. But there is an error
Branch info:
Hi @inspire-boy,
The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!
Hi @inspire-boy,
The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!
Thank your reply.It' pending now. Finally turned into a timeout.
version info
dify: QuantumGhost:feat/support-image-generate-for-gemini
plugin
Linux curl works well
Notice you have used model name "gemini-2.0-flash-experiment". But only exist gemini-2.0-flash-exp in gemini\models\llm folder.somewhere, Its named "gemini-2.0-flash-exp-image-generation".Is this another question?
cool, waiting to use it
Hi @inspire-boy,
Thank you for catching the error in the model name check. The correct name should indeed be gemini-2.0-flash-exp.
I have made the necessary adjustments in the new PR langgenius/dify-official-plugins#804.
Regarding the timeout issue, it seems likely to be related to your network environment. Could you try setting up a proxy for the plugin daemon (as well as the Gemini plugin process, if you're running in development mode) and test it again? Let me know if the issue persists or if you encounter anything else.