Humongous allocation with CRT client when downloading chunks of files
Describe the bug
When downloading a 4mb chunk from a large file, we were seeing humongous allocation from native code. CRT client was passing heap bytes byte[] bodyBytesIn to the caller.
Can we make it offheap?
Version: aws-crt:0.33.9
Lowering the part size helps to allocate smaller bytes, but still heap pressure.
We are using 8mb as part size. Will lowering it to 512kb affecting partial download or full file downloading performance?
class S3MetaRequestResponseHandlerNativeAdapter {
int onResponseBody(byte[] bodyBytesIn, long objectRangeStart, long objectRangeEnd) {
return this.responseHandler.onResponseBody(ByteBuffer.wrap(bodyBytesIn), objectRangeStart, objectRangeEnd);
}
Regression Issue
- [ ] Select this option if this issue appears to be a regression.
Expected Behavior
Offheap buffer should be used
Current Behavior
Heap buffer is used
Reproduction Steps
Download 4mb chunk from a large file.
GetObjectRequest request = GetObjectRequest.builder()
.bucket(bucket)
.key(rkey)
.range(String.format("bytes=%d-%d", pos, pos+size-1))
.build();
return s3Async.getObject(request, asyncResponseTransformerImp);
Possible Solution
No response
Additional Information/Context
No response
AWS Java SDK version used
aws-crt:0.33.9
JDK version used
JDK22
Operating System and version
Ubuntu 22.04 LTS. Also reproduced on Mac M1.
Hi @jasonstack thank you for reaching out. I'm trying to better understand your use case.
Are you manually downloading 4mb chunks of a file, then manually combining the chunks to form the file back?
- If the answer is yes, have you tried using S3TransferManager? S3TransferManager
downloadFile(and other APIs too, see the documentation) will automatically use multipart operations when used with the CRT-based S3 client :-
FileDownload downloadFile = transferManager.downloadFile(downloadFileRequest);
-
If you can use
s3Async.getObject(request, Path)instead of usingasyncResponseTransformer, this will write the file directly to disk and greatly improve memory usage. This was released in SDK version2.32.11. -
Which version of the Java SDK are you using? aws-crt:0.33.9 is fairly old, so I'm assuming you're using an old version of the Java SDK as well. We improved the memory management in S3 operations in recent releases, so I recommend you test a more recent SDK version (
2.38.7is the latest) to see if you still experience the high memory allocation.
Let us know if you have questions about the suggestions above.
If you upgrade to a more recent version and still see the memory issue, please provide a self-contained SSCCE code example we can use to reproduce, and a heap dump.
Thanks for the reply.
Are you manually downloading 4mb chunks of a file, then manually combining the chunks to form the file back?
Not combining the chunks.
If you can use s3Async.getObject(request, Path) instead of using asyncResponseTransformer, this will write the file directly to disk and greatly improve memory usage.
Right, this might help to reduce memory pressure. Initially, the plan was to serve the buffer before writing to disk to reduce some latency. It's worth a try.
Which version of the Java SDK are you using? aws-crt:0.33.9 is fairly old, so I'm assuming you're using an old version of the Java SDK as well. We improved the memory management in S3 operations in recent releases, so I recommend you test a more recent SDK version (2.38.7 is the latest)
it's s3:2.31.18.
According to the latest CRT native adapter code, it's still passing byte[] from native code. I don't think it avoids humongous allocation.
Right, this might help to reduce memory pressure. Initially, the plan was to serve the buffer before writing to disk to reduce some latency. It's worth a try.
Let us know if writing to disk works for your use case.
If you want us to look further into the memory issue, please send us a repro code and heap dump.
Hi @debora-ito,
I'm a colleague of @jasonstack, and I'd like to try to rephrase the issue a bit.
The main context is that we do would like to use the s3Async.getObject(request, transformer) API and not s3Async.getObject(request, Path). I don't want to focus too much on why: the API exists, so surely it's meant to be used. Suffice to say that we need to have access to the retrieved bytes in memory, regardless of whether we do or do not also write them to disk.
Further, we care about efficiency, so we try to follow S3's best practices, and use fairly large byte-range requests, say 8MB. In particular, the S3 best practice whitepaper says "Typical sizes for byte-range requests are 8 MB or 16 MB" (page 5), and I believe S3TransferManager also use 8MB ranges under the hood for that reason.
The issue is, if you follow those best practices with the CRT client and the s3Async.getObject(request, transformer) API, then it results in the client allocating contiguous 8MB byte arrays on heap, which is considered humongous allocations for any sanely configured JVM. And humongous allocations are far from optimal from a GC POV.
This isn't specific to our use case, this is how the code works: s_on_s3_meta_request_body_callback in s3_client.c always allocate a JNI byte array for the full response body, and calls the S3MetaRequestResponseHandlerNativeAdapter#onResponseBody method mentioned in the desciption only once with that full 8MB byte array.
Note that this isn't an issue with Netty based client, as it relies on Netty's default HttpClientCodec which process the response in chunks of 8kb.
To be clear, for our own use case, we can and are working around this, by switching from the CRT client to the Netty-based one in particular (and yes, s3Async.getObject(request, Path) is another work-around where appropriate).
But I admit that "following documented best practices with a public API leads to notoriously bad GC behavior" feels like a bug to me, so we figured this might be worth reporting.
As a small aside, since the CRT client uses native code, we were initially (a bit naively) hoping that the response bytes would be passed to the transformer using a direct ByteBuffer over the network buffer bytes. Which would bypass this humongous allocation issue (even if the body is passed to the transformer as a single buffer). And performance wise, this would be great. This is the context of the "Can we make if offheap" ask of the description. In hindsight, we understand that API wise this would make it easy to have use-after-free issues and is probably not an option. Still, we'd love and would use such an option if it was available (even if that meant promising to not use the ByteBuffer of Consumer#onNext after the call returns within the transfer).
CRT was primarily designed from perspective of wanting to get multi GB transfers to go as quickly as possible. And it does some tradeoffs to support that. Primary one is that CRT will burn mem to support scaling to multiple parallel transfers. What this means on native side is allocating multiple part size buffers that are all written in parallel (ignoring the buffer reuse, etc.. for simplicity sake). Those buffers are delivered across jni boundary as soon as data is available in order and since on native side its already a part sized buffer, it will give java a part sized buffer.
We are aware that passing big buffers cross jni is not ideal behavior and it is one area that we are actively working on to eliminate the need to create huge java arrays. It sounds like in your case, you are interested in writing data directly to file and if the crt client could just write directly to disk instead of passing data to Java side, which then writes to file, it would have side stepped your memory concern?
It sounds like in your case, you are interested in writing data directly to file
Not quite. We have cases where we're essentially applying some transformation on the downloaded bytes. We do happen to write the resulting bytes on disk ultimately, but the point is that it's not direct to file (but we have multiple use cases, and some of them are multi GB transfers directly to file, and for those we're using the CRT client successfully).
We are aware that passing big buffers cross jni is not ideal behavior and it is one area that we are actively working on to eliminate the need to create huge java arrays.
Nice. Our goal with this issue was mostly to say that we did run into the consequence on this current non-ideal behavior, and I guess that it surprised us. We had no similar issue with the Netty-based client, and saw no prior warning that this might be an issue switching to the CRT one (maybe this is documented somewhere and we missed it though).
But we have switched back to the Netty-based client for now and aren't blocked on this. If you're tracking elsewhere the improvement to "eliminate the need to create huge java arrays", feel free to close this ticket.
Note: that i think in the netty flow where it is streaming to mem, it will default to only having 1 http connection open. So its a tradeoff between perf and mem usage - netty flow is much better from mem perspective, but is potentially slower, while crt approach can achieve much higher bandwidth, but mem usage is non-ideal now. Ill leave the decision on whether to keep this open or not to Java team. We are internally tracking potential approaches to improve java-crt integration, but i dont think there is anything on GH now.