High CPU Usage (100% per stream) in AWS Transcribe Streaming
Describe the bug
The AWS Transcribe Streaming SDK C++ implementation is consuming excessive CPU resources when processing audio streams. Each individual stream consumes approximately 100% CPU usage, scaling linearly with multiple streams (e.g., 3 streams = 300% CPU usage). This appears inefficient for an operation that should primarily be handling audio data transmission to AWS Transcribe service.
I have tested using the CRT-HTTP version also and I get similar results. Will follow up with CRT-HTTP Docker version if requested.
It will slightly fluctuate on CPU usage but will mostly stick around 100%. I have tested on Macbook M1 running docker and then multiple Linux EC2 instance types and had the same results.
Is this performance intended/expected?
Regression Issue
- [ ] Select this option if this issue appears to be a regression.
Expected Behavior
- Minimal CPU usage for streaming audio to AWS Transcribe service
- Efficient handling of multiple concurrent streams without linear CPU scaling
- CPU usage should primarily be focused on audio data transmission rather than processing
Current Behavior
- Each individual stream consumes 100% CPU
- Multiple streams scale linearly (e.g., 3 streams = 300% CPU)
- CPU usage monitored through top command shows excessive utilization
- The high CPU usage persists throughout the entire streaming session
- Behavior is consistent across multiple test runs
Reproduction Steps
Here is the minimal reproduction steps in a single Dockerfile using the sample code.
Dockerfile
FROM public.ecr.aws/lts/ubuntu:22.04_stable
RUN apt-get update && \
apt-get install build-essential cmake git libcurl4-openssl-dev zlib1g-dev libssl-dev curl ffmpeg -y
#Build sdk from source
RUN git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
cd aws-sdk-cpp && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" -DBUILD_ONLY="transcribestreaming;transcribe" && \
make install
#Build transcribe samples
RUN git clone https://github.com/awsdocs/aws-doc-sdk-examples.git && \
cd aws-doc-sdk-examples/cpp/example_code/transcribe-streaming && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" && \
make
# Download and convert the test file
RUN cd /aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/.media && \
rm -f transcribe-test-file.wav && \
curl -L "https://ia800202.us.archive.org/26/items/desophisticiselenchis/desophisticiselenchis_01_aristotle_pdf557.wav" -o original.wav && \
ffmpeg -i original.wav -ar 8000 transcribe-test-file.wav && \
rm original.wav
Please note:
- Test file: Using a longer audio file from archive.org (converted to match original specs)
Steps:
- Build the Docker container using provided Dockerfile:
docker build -t transcribe-cpu-test-example .
- Run the container with AWS credentials:
docker run -d \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_SESSION_TOKEN=<token> \
--name transcribe-container \
transcribe-cpu-test-example \
tail -f /dev/null
- In first terminal, run:
docker exec -it transcribe-container bash
top # Keep this running to monitor CPU
- In second terminal, execute:
docker exec -it transcribe-container bash
/aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/build/get_transcript
Repeat step 4 in additional terminals to observe CPU scaling with multiple streams
You will notice high cpu usage.
Possible Solution
Potential memory leaks or inefficient resource handling in the streaming implementation.
Additional Information/Context
- This is just a single example I have seen it in my own implementation with different file types also
- Issue affects scalability of applications requiring multiple concurrent streams
AWS CPP SDK version used
Latest
Compiler and Version used
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Operating System and version
Ubuntu 22.04 LTS (running in Docker container)
Seems I might have to submit another bug that is unrelated, bundling with CRT-HTTP breaks the sample code.
Hits this error: Transcribe streaming error Request Timeout Has Expired
This is unrelated to the current issue though just noting for later.
@sbiscigl Who do you think would be best to respond to my issue?
I am eager to get this resolved?
Hello, from the example , the client configuration field httpLibPerfMode defaults to Http::TransferLibPerformanceMode::LOW_LATENCY
Could you please retry with the following setting locally and check:
config.httpLibPerfMode = Http::TransferLibPerformanceMode::REGULAR;
@sbera87 Thank you for the response I will try this out first thing tomorrow and get back to you.
Is the current CPU usage I am seeing with LOW_LATENCY expected?
Yes, its expected
Same issue in the python sdk. Here to reproduce:
import amazon_transcribe
amazon_transcribe.__version__
# '0.6.2'
from amazon_transcribe.client import TranscribeStreamingClient
import threading
import asyncio
async def _transcribe_main(region):
client = TranscribeStreamingClient(region=region)
stream = await client.start_stream_transcription(
language_code="en-US",
media_sample_rate_hz=8000,
media_encoding="pcm",
show_speaker_label=False,
vocabulary_name=None,
enable_partial_results_stabilization=True,
partial_results_stability="high",
)
def _run_async():
asyncio.run(_transcribe_main("us-east-1"))
thread = threading.Thread(target=_run_async, daemon=True)
thread.start()
import psutil
def monitor_cpu():
while True:
cpu_usage = psutil.Process().cpu_percent(interval=0.2)
print(f"Current Process CPU Usage: {cpu_usage}%",end="\r")
cpu_thread = threading.Thread(target=monitor_cpu, daemon=True)
cpu_thread.start()
# Current Process CPU Usage: 99.2%%
Hello, from the example , the client configuration field httpLibPerfMode defaults to Http::TransferLibPerformanceMode::LOW_LATENCY
Could you please retry with the following setting locally and check:
config.httpLibPerfMode = Http::TransferLibPerformanceMode::REGULAR;
@sbera87 Thank you for the response, I was able to attempt as requested and received a 50% reduction in CPU usage per transcription stream. This is helpful, but is there any way I could get it even lower?
- I would like around 10% CPU usage per stream if possible?
- This is only supported in the CURL version, correct? Any intention of adding this to the CRT version?
Here is the Dockerfile I used to run this by the way:
FROM public.ecr.aws/lts/ubuntu:22.04_stable
RUN apt-get update && \
apt-get install build-essential cmake git libcurl4-openssl-dev zlib1g-dev libssl-dev curl ffmpeg -y
#Build sdk from source
RUN git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
cd aws-sdk-cpp && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" -DBUILD_ONLY="transcribestreaming;transcribe" && \
make install
#Build transcribe samples
RUN git clone https://github.com/awsdocs/aws-doc-sdk-examples.git && \
cd aws-doc-sdk-examples/cpp/example_code/transcribe-streaming && \
# Patch the source code to add the performance mode configuration
sed -i '/Aws::Client::ClientConfiguration config;/a\ config.httpLibPerfMode = Http::TransferLibPerformanceMode::REGULAR;' get_transcript.cpp && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" && \
make
# Download and convert the test file
RUN cd /aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/.media && \
rm -f transcribe-test-file.wav && \
curl -L "https://ia800202.us.archive.org/26/items/desophisticiselenchis/desophisticiselenchis_01_aristotle_pdf557.wav" -o original.wav && \
ffmpeg -i original.wav -ar 8000 transcribe-test-file.wav && \
rm original.wav
Hi @blundercode ,
The config setting Http::TransferLibPerformanceMode::REGULAR at the moment takes effect only on the default libCurl http client.
It changes how often libCurl polls the input for more data to be sent. By default, in performance mode, it polls constantly, resulting in the high cpu utilization. In regular mode, we let libCurl control how often it pools for data input. Unfortunately, libCurl does not provide an option how often it will ask for data for easy_handle. However, if you don't provide data for some time, libCurl will slow down input polling for up-to 1s (i.e. it will ask for input each 1 second which is very infrequent to result in a 50% cpu utilization).
What I'm trying to say, is that we don't have (or at least we are not aware about) any other hot loops within the SDK code for streaming. We will try to profile using your docker example, thank you for this.
In the meantime, I'd also suggest to modify the streaming sample: For the sample code, I can notice it is actually streaming at the rate 5x times faster than the actual bitrate: the streaming loop sleeps for 25ms while the chunk length is 125ms: https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/cpp/example_code/transcribe-streaming/get_transcript.cpp#L110-L111 https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/cpp/example_code/transcribe-streaming/get_transcript.cpp#L24
Please also modify sample to use PooledThreadExecutor:
#include <aws/core/utils/threading/PooledThreadExecutor.h>
...
clientConfig.executor = Aws::MakeShared<Aws::Utils::Threading::PooledThreadExecutor>("SomeTag", 5 /* number of worker threads to spin*/);
as it is more efficient than the original legacy default executor.
Another performance tuning config here might be the streaming buffer size here: https://github.com/aws/aws-sdk-cpp/blob/main/src/aws-cpp-sdk-core/include/aws/core/utils/stream/ConcurrentStreamBuf.h#L33 But for the given sample code, 8mb is enough.
We have a longer term plan to re-write streaming support to use AWS CRT async HTTP client (or libcurl multi handle) as it will better suite our needs, but we cannot provide any estimate.
Best regards, Sergey
Hi @SergeyRyabinin
Thank you for the well-written and thorough response.
I am glad the docker is helpful I tried to make it as plug-and-play as possible so anyone could just run and test it easily.
I will test out your extra suggestions early next week and get back to you with the results.
I agree that 50% does still feel high so hopefully, these extra steps will help.
We are trying to use the streaming at a large scale so any optimizations we can get will be very beneficial.
The usual live transcribe stream design is a source(file, audio device with a constant read speed), stream buffer(to accommodate the delays between the source and the sink), and a sink({AWS SDK, network, transcribe engine} triple with a variable speed). Instead of the application to calculate how much and often to send data in order to accommodate the variable sink speed the sink can notify the application that it is ready to accept more data.
This simplifies the application design considerably and improves the application performance. This is the design used by Google live transcription streaming where the CPU usage with multiple channels is around 5%.
@SergeyRyabinin
So we have been using it for some time in production but the REGULAR mode is just not sufficient for us it keeps dropping transcriptions in flight. If we use LOW_LATENCY we get massive CPU usage. If we use REGULAR its so sporadic we are losing lots of transcripts.
Do you have any thoughts on how to resolve or optimize my situation?
@SergeyRyabinin @blundercode any fix for this cpu usage , this has been issue LOW_LATENCY or REGULAR does not solve the actual problem , due to this issue i had to rebuild the application using https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-http-websocket.html
@vdharashive No I couldn't find out any solution its internal to their SDK. The only thing I can do is mitigate it so lock all transcribe_streaming process to a single core, so it at least multiple only pin a single core.
Really a shame because REGULAR mode is extremely erratic in timing from all my testing.
@SergeyRyabinin
Curious if this has been resolved I noticed this post in this conversation on the other SDK. Where the aws dev says it has been resolved. Is it also resolved in the C++ sdk?
https://github.com/awslabs/amazon-transcribe-streaming-sdk/issues/121