elevenlabs-python icon indicating copy to clipboard operation
elevenlabs-python copied to clipboard

WebSocket text-to-speech with Input streaming - always return in MP3 format

Open skinnynpale opened this issue 2 years ago • 1 comments

Hi there! There is a problem that no matter what output_format I give as input, mp3 is always returned only for websockets

def text_stream():
    yield "Hello my friend "
    yield "How are you today? "
        
audio_stream = client.generate(
    text=text_stream(),
    stream=True,
    model="eleven_multilingual_v2",
    output_format="pcm_24000",
    request_options={
      "additional_headers": {
        "output_format": "pcm_24000"
      }
    },
    optimize_streaming_latency=3,
    voice=Voice(
      voice_id='yDVKFZyiAaNeYZvcliQG',
      settings=VoiceSettings(stability=0.32, similarity_boost=1.0, style=1.0, use_speaker_boost=True)
    )
)


for chunk in audio_stream:
    print(chunk)
`Example of response (looks like mp3):
b'\xff\xfb\x90\xc4\x00\x00\x10\xf9\x9b#\xac0\xc6\x89\xec\xb1(t\xf4\x8d\xb6\x00\x00\x00\x10\x84\x92mF\xa0j\x0e\x8fe\xa3@\x81\xa0\x88(\xa9\x00`\x1f\x11\xc9\xe6fk\xc9\x93&@\x82\x1e\xed\x81"\xc9\xa6}\xdd\xde\xc4DD\x18B\x13\xbb\xbb\xbd\xf6A\x08\x88\x88\x88\xbb\xbb<\x9awq\x11\x11\x11\x11d\xc9\xdd\xdd\xefh\x88 \x84D]\xdd\xdd\xdd\xd9\x08\x88\x88\x88\x8b\xbb\xbb\xbb\xbb\x88\x88\x88\x8c\x8b\xbb\xbb\xdf\xf4\xc4""\x08G\xdfv\xf7\xb6....`

skinnynpale avatar Apr 03 '24 18:04 skinnynpale

Yeah, I was having this issue and I think I solved it. You're gonna want to go to the site package and find the realtime_tts.py file. This file defines the RealtimeTextToSpeechClient class and its convert_realtime() method. That is the method that gets called when you use client.generate() with stream set to True and where the text is a generator function as in your code. But for some reason, generate() does not pass the output_format param to convert_realtime(). So what you need to do is find line 89 in realtime_tts.py and add the output format you want to the end of the URL like so: "wss://api.elevenlabs.io/", f"v1/text-to-speech/{jsonable_encoder(voice_id)}/stream-input?model_id={model_id}&output_format=pcm_24000"

That should do it!

ryandonahue1 avatar Apr 06 '24 19:04 ryandonahue1

@skinnynpale @ryandonahue1 appreciate the patience here, the SDK has been fixed in v1.5.0. Please let us know if you run into anymore issues!

dsinghvi avatar Jul 18 '24 12:07 dsinghvi