Referential integrity check does not complete with large bundles.
🐛 Describe the bug
When validating a large bundle, there is a risk that the referential integrity check will not complete, due to an out of memory error. This happened recently with the spacewatch bundle, which has 1.8 million products.
📜 To Reproduce
Steps to reproduce the behavior:
Run validate on the spacewatch bundle:
validate -D -R pds4.bundle -t gbo.ast.survey.spacewatch
I've attached the output from our last validation run which illustrates the error, as well.
🕵️ Expected behavior
The referential integrity check should have completed, and raised errors in the bundle related to a LID collection id mismatch. At minimum, a more clearly marked abnormal exit would be desirable, so that it's clear that the validation results cannot be used.
📚 Version of Software Used
Validate 2.1.4
🩺 Test Data / Additional context
If needed, I can provide a stub bundle with all of the spacewatch labels.
🏞Screenshots
🖥 System Info
- OS: CentOS7
🦄 Related requirements
⚙️ Engineering Details
@jstone-psi copy. per the error message, this is a Java VM issue. we will need to increase the available java memory on your machine. we will look into this for B13.0
@jstone-psi as a note, one way to avoid this would potentially be to breakup the bundle into chunks for product validation (using GNU Parallels) and then run referential integrity checking separately. See our advanced execution instructions here: https://nasa-pds.github.io/validate/operate/index.html#Advanced_Bundle_Validation
@jordanpadams This already is just for the referential integrity check. I'm guessing that it this check can't be split up in the same way.
I forgot to indicate that in the instructions to reproduce the problem. I've updated the issue.
@jstone-psi copy. not sure if this will help, but can you try running with the flag below instead of -D?
--skip-product-validation Disables product validation when
attempting to run pds4.bundle or
pds4.collection validation. The
software will perform member integrity
checks but will not validate individual
products or their labels.
@jstone-psi just following up if you got this to work?
@jordanpadams Looks like we're still running out of memory. I did manage to get a partial stack trace this time, however. stacktrace.txt
copy. thanks Jessie. to verify, can you send me exactly how the command-line is being executed?
also, are there any other instances of validate or Java Apps being run while that command-line execution occurs?
validate --skip-product-validation --skip-content-validation -R pds4.bundle -r referential_validation.txt -t /surveys/archive/spacewatch
This was a pretty long running-job on a shared system, so there may have been occasional invocations of validate running, but only for short periods.
@jstone-psi copy. thanks @jstone-psi . how many products are there right now?
There are about 1.8 million products.
Thanks @jstone-psi we will look into this. I am surprised this is failing, and curious if this is just hitting the overall system memory max. Worst case, we get you all to ingest all this data into the Registry. and we have a good initial test case for referential integrity checking with the registry
The system has 24GB of memory, so that seems unlikely. I think the startup script only allocates 4GB, though.
Copy. Yeah. With that much memory on the machine, increasing the allocated java memory in the validate script would definitely be an option to fix this. Can you try doubling each of those numbers and give it a shot?
@jstone-psi ☝️
Can do
Ok, it's running. I'll let you know how it goes.
It finally finished executing, and still ran out of memory. The validation report is attached, along with the modified script that I used to run it.
validatexl.zip referential_validation.txt.zip
The command line was:
/sbn/tools/validate-latest/bin/validatexl --skip-product-validation --skip-content-validation -R pds4.bundle -r referential_validation.txt -t /surveys/archive/spacewatch
@jstone-psi copy. not sure what is going on here. IMG runs this on as large or larger data sets and does not encounter any issues. we will look to investigate further.
@jstone-psi what version of java are you running?
openjdk version "1.8.0_332" OpenJDK Runtime Environment (build 1.8.0_332-b09) OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)
@jstone-psi thanks. I am curious if upgrading Java would help? I know newer versions have better garbage collection, but not sure if that is the issue.
Ok, I'm trying it with Java 11. I'm also running with the -v0 flag, so we'll see if anything comes up from that.
I have intermittent issues with bundle level validation (BLV): If the bundles are particularly large, the java process will sometimes completely twink out and vanish. Even when I'm monitoring the process, it just ... stops
So, I've taken to using a set of evasive maneuvers to increase the odds of success:
- I modified the
validatecommand (the shell script) to double the heap sizes - I also, just for good measure, push the BLV to a background process and then terminate my session, thereby forcing the java process to become owned by the
initprocess (you could usenohupto good effect here, too) - I always add
--skip-product-validation(I've validated the products prior to BLV)
Even with these preconditions, I get into situations where the validate process will run for days. This bit me on Mars2020 Release 3: after running for 80 hours, it still had not completed.
This is a pernicious issue exacerbated by accumulating bundles of large size
Just some thoughts and thinks...
thanks @myche ! all hoping this will become loads easier and dramatically faster once we get the Registry in the loop for referential integrity checks.
Ok, this time it almost completed, but still ran out of memory in the end.
@jordanpadams can you explain the priority status for this item? Is it something that we should be expected to work around?
@mdrum since we have been unable to replicate this in any of our environments, our solution is to instead work on providing a new feature to use the registry for referential integrity checking, versus trying to figure this out using the file system. does that work for you all?