dify Voice selection in Text-to-Speech UI has no effect on output

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.4.1

Cloud or Self Hosted

Cloud

Steps to reproduce

Environment

Cloud Dify workflow
Node: Text to Speech
Model: TTS1-HD (langgenius/openai/openai)

✔️ Expected Behavior

Speech should switch timbre and speaking style according to the voice chosen in the UI.

❌ Actual Behavior

Regardless of the voice selected, there is no audible difference in timbre or speaking style.

May 31 '25 09:05 fuki01

Hi @fuki01! I'm Dosu and I’m helping the dify team.

Based on the code and architecture of Dify's Text-to-Speech (TTS) feature, the voice selection in the UI is designed to pass the selected voice all the way through to the backend, which then invokes the TTS model with the specified voice parameter. This process is provider-agnostic and should work for any supported model, including TTS1-HD (langgenius/openai/openai)—assuming the model and provider actually support multiple voices and that the backend receives and uses the correct voice value in the request. The backend retrieves available voices from the model provider and uses the selected voice when generating audio output, so any change in voice should be reflected in the output if the model supports it (source, source, source).

If you are not hearing any difference when switching voices, there are a few likely causes:

The TTS1-HD model or its provider may not actually support multiple distinct voices, or the available voices may not differ significantly in timbre or style.
The provider's API may ignore the voice parameter, or all voices may be mapped to the same underlying voice.
There could be a misconfiguration or limitation on the provider side, rather than in Dify's UI or backend logic.

No documentation or prior issues were found describing this specific problem or listing known limitations for TTS1-HD voice selection.

If you want to confirm whether the model supports multiple voices, you can check the list of available voices in the UI or via the API. If the list contains only one voice, or if all voices sound the same, this is likely a provider/model limitation rather than a Dify bug.

If you believe the model should support multiple distinct voices and it does not, you may want to reach out to the model provider or check their documentation for supported features. If the provider confirms support for multiple voices and the issue persists, it could indicate a deeper integration issue.

To reply, just mention my name @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

May 31 '25 09:05 dosubot[bot]

Can you reproduce this issue in the local environment?

Jun 07 '25 09:06 water-in-stone