dify icon indicating copy to clipboard operation
dify copied to clipboard

feat(api): Add image multimodal support for LLMNode

Open QuantumGhost opened this issue 10 months ago • 7 comments

Summary

Enhance LLMNode with multimodal capability, introducing support for image outputs.

This implementation extracts base64-encoded images from LLM responses, saves them to the storage service, and records the file metadata in the ToolFile table. In conversations, these images are rendered as markdown-based inline images. Additionally, the images are included in the LLMNode's output as file variables, enabling subsequent nodes in the workflow to utilize them.

To integrate file outputs into workflows, adjustments to the frontend code are necessary.

For multimodal output functionality, updates to related model configurations are required. Currently, this capability has been applied exclusively to Google's Gemini models.

Close #15814.

Screenshots

Before After
image image

The image is showed twice. I don't know why. (maybe some issues in frontend code?)

To utilize multimodal output capability, updating to Gemini models is required. Related PR will be submitted later.

Checklist

  • [ ] This change requires a documentation update, included: Dify Document
  • [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

QuantumGhost avatar Apr 03 '25 00:04 QuantumGhost

@QuantumGhost I can't find gemini 2.0 flash image generation model to Switch.Is it a provider issue? I'm runing your branch code.

inspire-boy avatar Apr 13 '25 21:04 inspire-boy

@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.

It's worth noting that, as far as I know, multimodal output is supported only by the Gemini-2.0-Flash-exp model at this time.

Let me know if you have further questions or need assistance!

QuantumGhost avatar Apr 15 '25 09:04 QuantumGhost

@inspire-boy Thank you for trying out the branch! Currently, there isn't a toggle to enable or disable multimodal support within the system. To experiment with Google Gemini's multimodal capabilities, you will need to integrate an updated version of the Google Gemini model. For more details, please refer to this PR: langgenius/dify-official-plugins#687. Besides, you need to use the development model as described here.

It's worth noting that, as far as I know, multimodal output is supported only by the Gemini-2.0-Flash-exp model at this time.

Let me know if you have further questions or need assistance!

Thank you for your reply.I checkout your branch and debug it. But there is an error image

Branch info:

image image

inspire-boy avatar Apr 15 '25 14:04 inspire-boy

Hi @inspire-boy,

The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!

QuantumGhost avatar Apr 23 '25 03:04 QuantumGhost

Hi @inspire-boy,

The issue you reported has been addressed in QuantumGhost/dify-official-plugins@998c669. Could you please pull the latest code and give it another try? Let me know if you encounter any further issues!

Thank your reply.It' pending now. Finally turned into a timeout. image

version info

dify: QuantumGhost:feat/support-image-generate-for-gemini plugin image

Linux curl works well image

Notice you have used model name "gemini-2.0-flash-experiment". But only exist gemini-2.0-flash-exp in gemini\models\llm folder.somewhere, Its named "gemini-2.0-flash-exp-image-generation".Is this another question?

image

inspire-boy avatar Apr 23 '25 13:04 inspire-boy

cool, waiting to use it

juniorsereno avatar Apr 25 '25 21:04 juniorsereno

Hi @inspire-boy,

Thank you for catching the error in the model name check. The correct name should indeed be gemini-2.0-flash-exp. I have made the necessary adjustments in the new PR langgenius/dify-official-plugins#804.

Regarding the timeout issue, it seems likely to be related to your network environment. Could you try setting up a proxy for the plugin daemon (as well as the Gemini plugin process, if you're running in development mode) and test it again? Let me know if the issue persists or if you encounter anything else.

QuantumGhost avatar Apr 27 '25 14:04 QuantumGhost