Duration and buffer control for inference endpoints

Open Vibrat opened this issue 1 year ago • 0 comments

Context: I'm trying to call piper on a stream application where each chunk of wav audio is sent upon readiness to clients for reducing downloading time. My understanding is that PiperVoice.synthesize_stream_raw give me the chunk audio in bytes that i want without wav headers (nframes, framerate, ...). This is great but to stream partial results from piper to clients, I also need a way to control the speed / duration of the generated audio, just to derive the nframes.

Problem: The problem is that header info sending at the beginning contains nframes of 0, simply because we don't know how long the generated audio would be. Is there any way to control either speed per word or duration of the audio? This would help derive the nframes in advance.

My streaming flow from server to client works as follows:

Upon a client request, create buffer of wav header and send to client.

def create_wav_header():
    """Create header bytes data for WAV file"""
    header_buf = io.BytesIO()
    with wave.open(header_buf, "wb") as wave_file:
        # pylint: disable=no-member
        wave_file.setframerate(sample_rate)
        wave_file.setsampwidth(2)  # 16-bit
        wave_file.setnchannels(1)  # mono
        wave_file.writeframes(b'')

    return header_buf.getvalue()

# tell client about file metadata
header_buf = create_wav_header()
sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, header_buf), to=sid)

Then generate each chunk of speech from piper and deliver ti client by raw bytes in order.

# start sending chunks of data
# tts(chunks) here just call PiperVoice.synthesize_stream_raw inside
for audio in tts(chunks):
    sio.emit("piper_assets", ({"tsid": body.tsid, "eventID": event_id}, audio), to=sid)

Jul 28 '24 17:07 Vibrat