validate icon indicating copy to clipboard operation
validate copied to clipboard

Referential integrity check does not complete with large bundles.

Open jstone-psi opened this issue 3 years ago • 27 comments

🐛 Describe the bug

When validating a large bundle, there is a risk that the referential integrity check will not complete, due to an out of memory error. This happened recently with the spacewatch bundle, which has 1.8 million products.

📜 To Reproduce

Steps to reproduce the behavior:

Run validate on the spacewatch bundle: validate -D -R pds4.bundle -t gbo.ast.survey.spacewatch

spacewatch_labels.txt.gz

I've attached the output from our last validation run which illustrates the error, as well.

🕵️ Expected behavior

The referential integrity check should have completed, and raised errors in the bundle related to a LID collection id mismatch. At minimum, a more clearly marked abnormal exit would be desirable, so that it's clear that the validation results cannot be used.

📚 Version of Software Used

Validate 2.1.4

🩺 Test Data / Additional context

If needed, I can provide a stub bundle with all of the spacewatch labels.

🏞Screenshots

Screen Shot 2022-04-19 at 3 41 23 PM

🖥 System Info

  • OS: CentOS7

🦄 Related requirements

⚙️ Engineering Details

jstone-psi avatar Apr 19 '22 22:04 jstone-psi

@jstone-psi copy. per the error message, this is a Java VM issue. we will need to increase the available java memory on your machine. we will look into this for B13.0

jordanpadams avatar Apr 20 '22 15:04 jordanpadams

@jstone-psi as a note, one way to avoid this would potentially be to breakup the bundle into chunks for product validation (using GNU Parallels) and then run referential integrity checking separately. See our advanced execution instructions here: https://nasa-pds.github.io/validate/operate/index.html#Advanced_Bundle_Validation

jordanpadams avatar Apr 20 '22 15:04 jordanpadams

@jordanpadams This already is just for the referential integrity check. I'm guessing that it this check can't be split up in the same way.

I forgot to indicate that in the instructions to reproduce the problem. I've updated the issue.

jstone-psi avatar Apr 20 '22 15:04 jstone-psi

@jstone-psi copy. not sure if this will help, but can you try running with the flag below instead of -D?

    --skip-product-validation            Disables product validation when
                                         attempting to run pds4.bundle or
                                         pds4.collection validation. The
                                         software will perform member integrity
                                         checks but will not validate individual
                                         products or their labels.

jordanpadams avatar Apr 20 '22 16:04 jordanpadams

@jstone-psi just following up if you got this to work?

jordanpadams avatar May 09 '22 20:05 jordanpadams

@jordanpadams Looks like we're still running out of memory. I did manage to get a partial stack trace this time, however. stacktrace.txt

jstone-psi avatar May 16 '22 22:05 jstone-psi

copy. thanks Jessie. to verify, can you send me exactly how the command-line is being executed?

also, are there any other instances of validate or Java Apps being run while that command-line execution occurs?

jordanpadams avatar May 17 '22 15:05 jordanpadams

validate --skip-product-validation --skip-content-validation -R pds4.bundle -r referential_validation.txt -t /surveys/archive/spacewatch

This was a pretty long running-job on a shared system, so there may have been occasional invocations of validate running, but only for short periods.

jstone-psi avatar May 17 '22 15:05 jstone-psi

@jstone-psi copy. thanks @jstone-psi . how many products are there right now?

jordanpadams avatar May 17 '22 15:05 jordanpadams

There are about 1.8 million products.

jstone-psi avatar May 17 '22 15:05 jstone-psi

Thanks @jstone-psi we will look into this. I am surprised this is failing, and curious if this is just hitting the overall system memory max. Worst case, we get you all to ingest all this data into the Registry. and we have a good initial test case for referential integrity checking with the registry

jordanpadams avatar May 17 '22 15:05 jordanpadams

The system has 24GB of memory, so that seems unlikely. I think the startup script only allocates 4GB, though.

jstone-psi avatar May 17 '22 15:05 jstone-psi

Copy. Yeah. With that much memory on the machine, increasing the allocated java memory in the validate script would definitely be an option to fix this. Can you try doubling each of those numbers and give it a shot?

jordanpadams avatar May 17 '22 16:05 jordanpadams

@jstone-psi ☝️

jordanpadams avatar May 17 '22 16:05 jordanpadams

Can do

jstone-psi avatar May 17 '22 17:05 jstone-psi

Ok, it's running. I'll let you know how it goes.

jstone-psi avatar May 17 '22 18:05 jstone-psi

It finally finished executing, and still ran out of memory. The validation report is attached, along with the modified script that I used to run it.

validatexl.zip referential_validation.txt.zip

The command line was:

/sbn/tools/validate-latest/bin/validatexl --skip-product-validation --skip-content-validation -R pds4.bundle -r referential_validation.txt -t /surveys/archive/spacewatch

jstone-psi avatar Jun 02 '22 15:06 jstone-psi

@jstone-psi copy. not sure what is going on here. IMG runs this on as large or larger data sets and does not encounter any issues. we will look to investigate further.

jordanpadams avatar Jun 02 '22 18:06 jordanpadams

@jstone-psi what version of java are you running?

jordanpadams avatar Jun 02 '22 19:06 jordanpadams

openjdk version "1.8.0_332" OpenJDK Runtime Environment (build 1.8.0_332-b09) OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

jstone-psi avatar Jun 02 '22 19:06 jstone-psi

@jstone-psi thanks. I am curious if upgrading Java would help? I know newer versions have better garbage collection, but not sure if that is the issue.

jordanpadams avatar Jun 02 '22 19:06 jordanpadams

Ok, I'm trying it with Java 11. I'm also running with the -v0 flag, so we'll see if anything comes up from that.

jstone-psi avatar Jun 02 '22 22:06 jstone-psi

I have intermittent issues with bundle level validation (BLV): If the bundles are particularly large, the java process will sometimes completely twink out and vanish. Even when I'm monitoring the process, it just ... stops

So, I've taken to using a set of evasive maneuvers to increase the odds of success:

  • I modified the validate command (the shell script) to double the heap sizes
  • I also, just for good measure, push the BLV to a background process and then terminate my session, thereby forcing the java process to become owned by the init process (you could use nohup to good effect here, too)
  • I always add --skip-product-validation (I've validated the products prior to BLV)

Even with these preconditions, I get into situations where the validate process will run for days. This bit me on Mars2020 Release 3: after running for 80 hours, it still had not completed.

This is a pernicious issue exacerbated by accumulating bundles of large size

Just some thoughts and thinks...

myche avatar Jun 02 '22 23:06 myche

thanks @myche ! all hoping this will become loads easier and dramatically faster once we get the Registry in the loop for referential integrity checks.

jordanpadams avatar Jun 02 '22 23:06 jordanpadams

Ok, this time it almost completed, but still ran out of memory in the end.

referential_validation.20220620.txt.gz

jstone-psi avatar Jun 24 '22 17:06 jstone-psi

@jordanpadams can you explain the priority status for this item? Is it something that we should be expected to work around?

mdrum avatar Dec 20 '22 16:12 mdrum

@mdrum since we have been unable to replicate this in any of our environments, our solution is to instead work on providing a new feature to use the registry for referential integrity checking, versus trying to figure this out using the file system. does that work for you all?

jordanpadams avatar Dec 20 '22 16:12 jordanpadams