llmperf Sagemaker client issue

when i am executing token_benchmark_ray.py we are getting below error File "token_benchmark_ray.py", line 456, in run_token_benchmark( File "token_benchmark_ray.py", line 297, in run_token_benchmark summary, individual_responses = get_token_throughput_latencies( File "token_benchmark_ray.py", line 111, in get_token_throughput_latencies request_metrics[common_metrics.INTER_TOKEN_LAT] /= num_output_tokens TypeError: unsupported operand type(s) for /=: 'list' and 'int' (SageMakerClient pid=15473) Warning Or Error: 'SageMakerRuntime' object has no attribute 'invoke_endpoint_with_response_stream' (SageMakerClient pid=15473) None

Jun 07 '24 09:06 SuchethaChintha

Hey @SuchethaChintha did you fix that ?

Nov 15 '24 16:11 Tatiats7

This is probably due to using an older version of the Sagemaker SDK. Updating it should fix the issue.

Dec 10 '24 17:12 vjaramillo

It seems that this error occurs because there's an inconsistency in how INTER_TOKEN_LAT is handled between different LLM clients.

SageMaker client keeps INTER_TOKEN_LAT as a list https://github.com/ray-project/llmperf/blob/f1d6bed47e4501b0e371082b41601b59ab55269f/src/llmperf/ray_clients/sagemaker_client.py#L109

On the other hand, OpenAI client sums the latencies before returning https://github.com/ray-project/llmperf/blob/f1d6bed47e4501b0e371082b41601b59ab55269f/src/llmperf/ray_clients/openai_chat_completions_client.py#L112

I think that if you modify the source code for sagemaker_client.py as follows, it will work correctly.

metrics[common_metrics.INTER_TOKEN_LAT] = sum(time_to_next_token)

Even if you make this change, INTER_TOKEN_LAT is divided by the number of output tokens in token_benchmark_ray.py, so the correct metrics should be calculated.

Feb 07 '25 20:02 ryoshirahama