OutOfMemoryError on GHRepository.readZip due to lack of streaming
Describe the bug I encounter OutOfMemoryError over GHRepository.readZip while having reasonable free heap.
To Reproduce Steps to reproduce the behavior:
- Consider a large Github repository (e.g. 50MB)
- Rely on GHRepository.readZip, while being a bit memory constrained (e.g. with Xmx128M)
- Encounter a OutOfMemoryError
Expected behavior I would expect the whole InputStream not to be materialized in-memory by org.kohsuke.github.connector.GitHubConnectorResponse.ByteArrayResponse.bodyStream()
Additional context A typical stack looks like:
Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.io.output.AbstractByteArrayOutputStream.toByteArrayImpl(AbstractByteArrayOutputStream.java:366)
at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:163)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:2241)
at org.kohsuke.github.connector.GitHubConnectorResponse$ByteArrayResponse.bodyStream(GitHubConnectorResponse.java:187)
at org.kohsuke.github.Requester.lambda$fetchStream$3(Requester.java:117)
at org.kohsuke.github.Requester$$Lambda$568/0x00000008404da440.apply(Unknown Source)
at org.kohsuke.github.GitHubClient.createResponse(GitHubClient.java:485)
at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:387)
at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:355)
at org.kohsuke.github.Requester.fetchStream(Requester.java:117)
at org.kohsuke.github.GHRepository.downloadArchive(GHRepository.java:3246)
at org.kohsuke.github.GHRepository.readZip(GHRepository.java:3195)
Agreed, the current implementation has this issue. We chose to do it that way to address other problems in more common scenarios. Allocating some memory that will be quickly returned is okay in most cases.
There are a number of possible fixes, but they add complexity in exchange for memory efficiency. Using file-based caches for example... Or maybe providing a different code path for endpoints likely to return large streams. Something that takes a function letting the client read the http response body stream directly.
PR welcome, but if you decide to do it sketch out what you have in mind before implementing fully so we can agree on the design.