Add TTFT (time to first token) to Langfuse traces
Is your feature request related to a problem? Please describe. We are developing several chatbot-like applications that require streaming the response from LLM. There are a couple of metrics to look at, and one of which is the TTFT (time to first token) indicating how long the user needs to wait before seeing something in the output dialog box. However, due to the way that the tracing spans are handled in the pipeline, the run invocation inside the component does not have direct access to the span, so we are not able to log this information to the tracer.
Describe the solution you'd like The simplest solution would be adding visibility to tracing span from component run() method. This could be a context variable that the methods inside of the component have access to, but I am not very confident about the exact approach here.
Describe alternatives you've considered The only temporary solution right now is to directly manipulate low-level tracing sdks inside the streaming callback function, and make a special callback function such that it uploads the timestamp upon receiving the first SSE.
Additional context Add any other context or screenshots about the feature request here.
Note to self:
TTFT in Langfuse is automatically calculated when completion_start_time (the timestamp of the first token) is provided to generation span i.e. just call update on the generation span with this key/value i.e completion_start_time=datetime.now()
This could be done by attaching custom stream callback on chat generator (most likely from our LangfuseTracer), consuming first token in the callback and calling generation span update - perhaps directly in that callback.
To be investigated how we can eventually do this in async calls as well.
@LastRemote and @julian-risch
This was, in fact, not so hard to do. Forget the recommendation above. We need to simply timestamp the first chunk received from the LLM. And we should do this across all LLM chat generators. When streaming we don't have data for prompt and completion tokens available and what's really interesting is that even if we set "prompt_tokens" and "completion_tokens" to 0 (see branch above) Langfuse somehow correctly counts them. I'll check with them how this is actually done. Here is the trace depicted below.
@LastRemote and @julian-risch
This was, in fact, not so hard to do. Forget the recommendation above. We need to simply timestamp the first chunk received from the LLM. And we should do this across all LLM chat generators. When streaming we don't have data for prompt and completion tokens available and what's really interesting is that even if we set "prompt_tokens" and "completion_tokens" to 0 (see branch above) Langfuse somehow correctly counts them. I'll check with them how this is actually done.
Interesting, I have never thought about this approach, but I guess it should work. So basically we store completion_first_chunk as a part of the usage meta so it can be accessed inside of haystack.component.output?
By the way, for OpenAI and the latest versions of Azure OpenAI models, if you set stream_options accordingly, the last streaming chunk will include the actual usage data. Additionally Langfuse will automatically count the usage tokens if the model is from OpenAI or Claude (see screenshot below).
Aha nice @LastRemote that's why I collected those meta chunks hoping that one day this will work. The change in langfuse tracer was minimal as well. One LOC change in langfuse/tracer.py at ~151, we need to:
span._span.update(usage=meta.get("usage") or None,
model=meta.get("model"),
completion_start_time=meta.get("usage", {}).get("completion_start_time"))
I'll speak to @julian-risch about scheduling this change in the near future but if you wish - feel free to create a PR that updates our chat generators with this change and we'll take it from there - I can review the PR and we can try out various chat generators together.
@vblagoje Okay sure, I will make a PR.
@LastRemote before we open a bunch of PRs or group together all the changes into one PR for all chat generators - let's do one trial PR and set the standard for other chat generators.
@vblagoje Sorry for the delayed response. I took some days off last week. Here we go: https://github.com/deepset-ai/haystack/pull/8444
Note that I only made minimal changes since I am not exactly sure how to enable include_usage in OpenAI SDK. I have a customized OpenAI implementation based on httpx that works, but that would require a complete refactor.
By the way, I also attempted to support Anthropic (including Bedrock Anthropic) models. It seems there's a mismatch in the usage data format between the Langfuse API and Anthropic when updating the Langfuse span, which causes the operation to fail. I believe it would be better for Langfuse to provide direct support for the raw Anthropic format.
Hey @LastRemote I'll check it out and get back to you. The most recent Anthropic haystack release should work our of the box. I don't think Bedrock Anthropic works with Langfuse atm.
Hey @LastRemote how do we stand here? What else do we need to do to consider this issue done?
Hey @LastRemote how do we stand here? What else do we need to do to consider this issue done?
Hi @vblagoje , thanks for the reminder. We should close this.
An extra note for future readers: we only implemented TTFT tracking for OpenAIChatGenerator as all generators in the near future would be deprecated. See #8444 for more information.
