Summary

Enhance LLMNode with multimodal capability, introducing support for image outputs.

This implementation extracts base64-encoded images from LLM responses, saves them to the storage service, and records the file metadata in the ToolFile table. In conversations, these images are rendered as markdown-based inline images. Additionally, the images are included in the LLMNode's output as file variables, enabling subsequent nodes in the workflow to utilize them.

To integrate file outputs into workflows, adjustments to the frontend code are necessary.

For multimodal output functionality, updates to related model configurations are required. Currently, this capability has been applied exclusively to Google's Gemini models.

Close #15814.

Screenshots

Before	After

The image is showed twice. I don't know why. (maybe some issues in frontend code?)

To utilize multimodal output capability, updating to Gemini models is required. Related PR will be submitted later.

Checklist

[ ] This change requires a documentation update, included: Dify Document
[x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
[x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
[x] I've updated the documentation accordingly.
[x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Apr 03 '25 00:04 QuantumGhost

@QuantumGhost I can't find gemini 2.0 flash image generation model to Switch.Is it a provider issue? I'm runing your branch code.

Apr 13 '25 21:04 inspire-boy

@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.

It's worth noting that, as far as I know, multimodal output is supported only by the Gemini-2.0-Flash-exp model at this time.

Let me know if you have further questions or need assistance!

Apr 15 '25 09:04 QuantumGhost

@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.

It's worth noting that, as far as I know, multimodal output is supported only by the Gemini-2.0-Flash-exp model at this time.

Let me know if you have further questions or need assistance!

Thank you for your reply.I checkout your branch and debug it. But there is an error

Branch info:

Apr 15 '25 14:04 inspire-boy

Hi @inspire-boy,

The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!

Apr 23 '25 03:04 QuantumGhost

Hi @inspire-boy,

The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!

Thank your reply.It' pending now. Finally turned into a timeout.

version info

dify: QuantumGhost:feat/support-image-generate-for-gemini plugin

Linux curl works well

Notice you have used model name "gemini-2.0-flash-experiment". But only exist gemini-2.0-flash-exp in gemini\models\llm folder.somewhere, Its named "gemini-2.0-flash-exp-image-generation".Is this another question?

Apr 23 '25 13:04 inspire-boy

cool, waiting to use it

Apr 25 '25 21:04 juniorsereno

Hi @inspire-boy,

Thank you for catching the error in the model name check. The correct name should indeed be gemini-2.0-flash-exp. I have made the necessary adjustments in the new PR langgenius/dify-official-plugins#804.

Regarding the timeout issue, it seems likely to be related to your network environment. Could you try setting up a proxy for the plugin daemon (as well as the Gemini plugin process, if you're running in development mode) and test it again? Let me know if the issue persists or if you encounter anything else.

Apr 27 '25 14:04 QuantumGhost

feat(api): Add image multimodal support for LLMNode

Summary

Screenshots

Checklist

Branch info:

version info