bagit-java icon indicating copy to clipboard operation
bagit-java copied to clipboard

Verifying a bag throws an exception: Unable to create new native threads

Open chinuhub opened this issue 7 years ago • 5 comments

When submitting an issue please include:

  • Bagit library version 5.1.1
  • MacOS version 10.12.6
  • If available Attach all logs, and or output, and or screenshots

Please format it in the given when then style

For example (from link above):

Given

  • I have a bag of size 5.8 GB. Number of data files is 28730.

When

  • I run bag.verify() method on this bag from java (JDK 1.8)

Then

  • After some time verify throws an exception saying Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at gov.loc.repository.bagit.verify.PayloadVerifier.checkAllFilesListedInManifestExist(PayloadVerifier.java:146) at gov.loc.repository.bagit.verify.PayloadVerifier.verifyPayload(PayloadVerifier.java:103)

The exception is thrown in method "checkAllFilesListedInManifestExist(Set<Path> files)" in file PayloadVerifier.java in line this.executor.execute(new CheckIfFileExistsTask(file, missingFiles, latch)); when a new task isto be executed on executor.

When checking the thread creation limit on mac It was 709.

chinuhub avatar Jul 11 '18 16:07 chinuhub

When instantiating BagVerifier I am now using Executors.newSingleThreadExecutor(). It fixed the issue.

chinuhub avatar Jul 11 '18 19:07 chinuhub

For a large bag you may want to use Executors.newFixedThreadPool() and specify how many threads you want to use instead of just using a single thread as multi-threading it will be much faster(as long as you aren't hitting IO problems).

@acdha does Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()) seem reasonable for being the default instead of Executors.newCachedThreadPool()?

jscancella avatar Jul 17 '18 17:07 jscancella

That's exactly what I was wondering — as long as someone can override it for unusual cases, the CPU count seems like a reasonable default.

acdha avatar Jul 17 '18 18:07 acdha

Yup, they are able to because a different user asked to be able to finely tune that threadpool. That's why there are 4 different constructors on that class, to be able to override various parts of it or keep the defaults.

jscancella avatar Jul 17 '18 18:07 jscancella

Yeah, I figure there are a few cases where someone would need to change that value but it really does seem like most of them would be edge cases.

acdha avatar Jul 17 '18 18:07 acdha