Kokoro-FastAPI Missing x-timestamps-path Response Header on dev/captioned

Describe the bug In the v0.2.4 release, the response from the dev/captioned_speech endpoint does not include the x-timestamps-path header, even when the request body explicitly sets "return_timestamps": true. This behavior was previously expected to return the header when timestamps are requested.

Screenshots or console output The following screenshots compare the response headers from v0.2.2 and v0.2.4 using the same request. In v0.2.2, the x-timestamps-path header is present, whereas in v0.2.4, it is missing.

Branch / Deployment used It's the docker container with the image directly pulling from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4

Operating System Docker desktop v4.42.1 on Mac mini 4 (MacOS 15.5 (24F74))

Additional context n/a

Jul 01 '25 05:07 opiuman

Hi. I'm experiencing the same too.

Branch / Deployment used Pulled from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4

Operating System Docker desktop 28.2.2, build e6534b4, Windows 11 Home, Version 10.0.26100 Build 26100

Jul 03 '25 17:07 tongshen-yong

@opiuman @tongshen-yong Right now it is returned as JSON in either chunks or all at once:

Taken from readme.md:

Streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": True,
    },
    stream=True
)

f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
    if chunk:
        chunk_json=json.loads(chunk)
        
        # Decode base 64 stream to bytes
        chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
        
        # Process streaming chunks
        f.write(chunk_audio)
        
        # Print word level timestamps
        print(chunk_json["timestamps"])

Non Streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": False,
    },
    stream=False
)

with open("output.mp3","wb") as f:

    audio_json=json.loads(response.content)
    
    # Decode base 64 stream to bytes
    chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
    
    # Process streaming chunks
    f.write(chunk_audio)
    
    # Print word level timestamps
    print(audio_json["timestamps"])