github-api icon indicating copy to clipboard operation
github-api copied to clipboard

OutOfMemoryError on GHRepository.readZip due to lack of streaming

Open blacelle opened this issue 3 years ago • 1 comments

Describe the bug I encounter OutOfMemoryError over GHRepository.readZip while having reasonable free heap.

To Reproduce Steps to reproduce the behavior:

  1. Consider a large Github repository (e.g. 50MB)
  2. Rely on GHRepository.readZip, while being a bit memory constrained (e.g. with Xmx128M)
  3. Encounter a OutOfMemoryError

Expected behavior I would expect the whole InputStream not to be materialized in-memory by org.kohsuke.github.connector.GitHubConnectorResponse.ByteArrayResponse.bodyStream()

Additional context A typical stack looks like:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
	at org.apache.commons.io.output.AbstractByteArrayOutputStream.toByteArrayImpl(AbstractByteArrayOutputStream.java:366)
	at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:163)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:2241)
	at org.kohsuke.github.connector.GitHubConnectorResponse$ByteArrayResponse.bodyStream(GitHubConnectorResponse.java:187)
	at org.kohsuke.github.Requester.lambda$fetchStream$3(Requester.java:117)
	at org.kohsuke.github.Requester$$Lambda$568/0x00000008404da440.apply(Unknown Source)
	at org.kohsuke.github.GitHubClient.createResponse(GitHubClient.java:485)
	at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:387)
	at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:355)
	at org.kohsuke.github.Requester.fetchStream(Requester.java:117)
	at org.kohsuke.github.GHRepository.downloadArchive(GHRepository.java:3246)
	at org.kohsuke.github.GHRepository.readZip(GHRepository.java:3195)

blacelle avatar Mar 15 '22 21:03 blacelle

Agreed, the current implementation has this issue. We chose to do it that way to address other problems in more common scenarios. Allocating some memory that will be quickly returned is okay in most cases.

There are a number of possible fixes, but they add complexity in exchange for memory efficiency. Using file-based caches for example... Or maybe providing a different code path for endpoints likely to return large streams. Something that takes a function letting the client read the http response body stream directly.

PR welcome, but if you decide to do it sketch out what you have in mind before implementing fully so we can agree on the design.

bitwiseman avatar Mar 16 '22 01:03 bitwiseman