validate icon indicating copy to clipboard operation
validate copied to clipboard

validate does not raise a warning when a table has more records than label says it has

Open mace-space opened this issue 10 months ago • 1 comments

Checked for duplicates

Yes - I've already checked, e.g. #149

🐛 Describe the bug

When I ran Validate with the --rule pds4.bundle option, it reported a collection file with a mismatch between the number of records reported in the label compared to the actual number of records in the table:

ERROR  [error.table.records_mismatch]   Number of records read is not equal to the defined number of records in the collection (expected 5, got 7).

(and that was very helpful). However, when I ran Validate at the individual product level (no --rule setting) on the same collection inventory file, I noticed that Validate passed the label without any warning/error even though the label had only 5 <records> for a table that actually contained 7 records. I confirmed that the same behaviour occurred with a non-collection product (I attach the collection_document.xml, and another example product label, vg2u_49xr_1986025t024210.xml with their associated files below).

Changing the value of <records> in the label and re-running Validate on that collection file (with no --rule setting), I saw that error.table.records_mismatch is only raised when the label <records> is greater than what's actually in the table:

 ERROR  [error.table.records_mismatch]   data object 1: Number of records read is not equal to the defined number of records in the label (expected 8, got 7).

This is related to other issues, e.g. #149, where it seems that the desired outcome was for: – an error to be raised if the table is shorter than the label says it is (e.g. table has 7 records but label says 8), – a warning to be raised if the table is longer than the label says it is (e.g. table has 7 records but label says 5).

However, given that was back in 2020, I'd like to also check that that is still the desired behaviour?

Tagging interested parties, @radiosci and @matthewtiscareno.

🕵️ Expected behavior

I expected Validate to report a warning (or perhaps an error) if the number of <records> in the label is fewer than the actual number of records in the table. (Validate does report an error if the number of <records> described in the label is more than what's actually in the table.)

📜 To Reproduce

% validate ./collection_document.xml
% validate ./vg2u_49xr_1986025t024210.xml

🖥 Environment Info

  • Validate v3.7.0 (v3.6.3 and v3.5.1 also exhibit this behaviour)
  • MacOS Sequoia 15.1

📚 Version of Software Used

3.7.0, 3.6.3, 3.5.1

🩺 Test Data / Additional context

tar.gz too large to attach directly, so have put on Dropbox

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

mace-space avatar Apr 28 '25 23:04 mace-space

thanks @mace-space . added this to our backlog. we should be able to get to this in the next couple months.

jordanpadams avatar May 12 '25 17:05 jordanpadams

@jordanpadams @mace-space

I am looking for the issue and will update this comment when I find it, but my memory is: because labels do not have to describe a whole table but only sections of it, we cannot detect underflow.

When running as a bundle or collection rule, we know more about the inventory file beyond a simple table. Hence, a different table analysis is done and can detect underflow.

Does that make sense? Is this still an issue?

al-niessner avatar May 30 '25 16:05 al-niessner

@al-niessner I agree with what you are saying, but maybe we try this:

  • "peak" ahead to see if there are additional records, throw a warning
  • add a flag to disable that warning if someone doesn't want to see them

in the past we avoided spurious warnings, but I think since we have now introduced the ability to disable these various warnings, we can just add this as well.

do we want to maybe take this opportunity to generalize the disabling of specific warnings? I'm sure this won't be the last case someone will want a warning and someone else will not.

jordanpadams avatar May 30 '25 16:05 jordanpadams

@jordanpadams

Um, not easily. It gets really complicated, but the short story is to fix another problem we introduced this problem. The code to peek forward is already there in TableValidator. However it no longer works because of a fix that you can track backward from there.

The simple reason why it went away, we only read the data the label defines and no more. We can undo the change in jparser and see what breaks. Our test suite is now more robust and with this check we may be able to find a solution that fixes both.

Large files will be fully processed rather than the block they care about causing validate to take longer again as well.

al-niessner avatar May 30 '25 17:05 al-niessner

@jordanpadams

Made changes in JParser to no effect. Turns out this is the beasty: https://github.com/NASA-PDS/validate/blob/49f40a649a4cdb6e4e07166ea19de1084fb32c7c/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L620-L630

Blame says unchanged for 3 years and cannot look before it.

It means when you hit this line, it is always false: https://github.com/NASA-PDS/validate/blob/49f40a649a4cdb6e4e07166ea19de1084fb32c7c/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L657-L669

I can see why it bails. If it tried to parse the next row and it did not match, it would spew lots of errors that were not the fault of the row but validate for reading too much. In fact, I am not sure how to peek without that being true; meaning if we read a row past the end and it has errors does that mean matches but erroneous or does not match because past the end. It really throws into question this idea of peeking beyond and thinking you can make a statement. Even if it does match in the sense of the parser, it may not be a continuation.

Let me give a concrete example because I know you love it when I give you long reading assignments. Lets say we have a table with 3 columns and 20 rows. For simplicity, each column for the first 10 rows can be the value a thru f inclusive. For the next 10 rows a thru g inclusive. Does not take much imagination to see that some of the last 10 rows could be mistaken for the first 10 rows if you select values at random.

Now, if you peek:

  1. Row 11 has a value out of range for first 10 rows,
    1. Do you stop peeking?
      1. Yes, because you assume the label is thus correct? Then why is the label table length suspect?
      2. No, because you cannot trust the label?
    2. Do you look at row 12 and if row 12 matches to you infer that row 11 was just a typo and the label is wrong?
    3. Just how far do you want to dig with this logic?
  2. Row 11 has a value in range for the first 10 rows,
    1. Do you stop peeking?
      1. Yes, because underflow has been detected and user now can see label is wrong and should fix it by counting rows not using our count.
      2. No, because the extent of the underflow needs to be determined for good message because user is going to use our number. Repeat whole table with row 12.

The point really is, when do you trust the label and when do you not. In the case of when we KNOW it is an inventory file (rule is bundle or collection) then we can apply the special case of a predetermined table format that does not allow for similar lines that could be mistaken for inventory content allowing for underflow detection. In the case of a generic label describing a portion of a table you are assuming big things:

  1. next row will not be like the last row
  2. if next row looks similar, then it is part of the table
  3. somehow we know these errors are different and should not be recorded in the log but change the processing behavior (processing continues to next row despite errors on this row).
  4. if peeking, then we will know when to stop peeking

I think the first two assumptions sink the whole idea. If you generate a table randomly using my concrete example and uniform random sampling, there is a very real chance that the entire table would look like it matched the first 10 rows despite that not being true. The third is pragmatic on how validate works in that it just spews messages to the user but really knows nothing about the errors or how to correct them. The last assumption is the trickiest given the first two and the last assumption controls the third. No, I think you can only do overflow detection for generic tables.

Again, I can expand that logic table all you want and just keep putting up more and more impossible to resolve cases. I think the one case that you care about - the inventory file - is solved with bundle and collection rules.

al-niessner avatar May 30 '25 20:05 al-niessner

@jordanpadams

Ah, if you try to read past the label's number of rows:

      FATAL_ERROR  [error.table.bad_file_read]   : Error occurred while reading the data file: The index is out of range 1 - 5

There are even deeper reasons for not trying to peek. It is an exception from jparser that is turned into error.table.bad_file_read.

al-niessner avatar May 30 '25 21:05 al-niessner

@al-niessner Totally hear you on the complications here. This all goes back to the origin where we just stopped doing this.

That being said, I think we may have pulled this back too much to basically assume the label is correct, and assume that if there is any data beyond the data described it is some expected footer, and we do all of this without notifying the user of anything. So if there is some footer, we silently leave it, versus notifying the user that there is a footer or some space in the file that not described.

Maybe instead of trying to read the data as a table, we just throw a warning that there is an undescribed X bytes in the data file "between objects A and B" or "after object C". We can then provide a flag to disable this warning (or the more generic flag to disable any warning).

jordanpadams avatar Jun 02 '25 20:06 jordanpadams

@jordanpadams

Maybe instead of trying to read the data as a table, we just throw a warning that there is an undescribed X bytes in the data file "between objects A and B" or "after object C". We can then provide a flag to disable this warning (or the more generic flag to disable any warning).

Yes, that seems the most sensible. I will code it up this week.

I think there is another ticket about disabling. I would limit disabling to things that save significant amounts of time. Content validation for example. Disabling requires changing code and not doing checks. There is no way to do this generically. We can allow for filters that ignore rather than disable. Good example would be context validation. It saves 10 ms over a million labels that took a week. Who cares. Still, I do not want to see them or have the be counted. Filter.

al-niessner avatar Jun 02 '25 20:06 al-niessner

copy ok. we can just create the ability to disable specific warnings on a case by case basis, but for the sake of our --help menu from getting more crazy, it may be good to roll those in to one --disable-warning flag that has a specific set of allowable values.

jordanpadams avatar Jun 02 '25 21:06 jordanpadams

@mace-space would this be OK?

Maybe instead of trying to read the data as a table, we just throw a warning that there is an undescribed X bytes in the data file "between objects A and B" or "after object C". We can then provide a flag to disable this warning (or the more generic flag to disable any warning).

This is not exactly what you are requesting, but due to some of the complications noted above in peaking ahead, we think is the best alternative where it will at least warn you there is some undescribed data after your table.

jordanpadams avatar Jun 02 '25 22:06 jordanpadams

This conversation has been going on for a while; and I can see that a general solution is almost impossible without spending extraordinary amounts of time in validate and/or dealing with very large numbers of useless warnings.

However, there was a brief mention in an earlier exchange about being rigorous with Inventory, which is a special case of Table_Delimited — and (I think) one that prompted our concern. I wonder whether a rigorous test of rows could be applied to Inventory even if not much can be done elsewhere.

radiosci avatar Jun 03 '25 21:06 radiosci

@jordanpadams @mace-space @radiosci

However, there was a brief mention in an earlier exchange about being rigorous with Inventory, which is a special case of Table_Delimited — and (I think) one that prompted our concern. I wonder whether a rigorous test of rows could be applied to Inventory even if not much can be done elsewhere.

I thought the same thing and tried it. No.

The general table checker uses the jparser to read the file. Each row of the block is read and tested independent of all the other rows in block and file. When you read past the label defined block, jparser throws an error and does not return data (what I ran into after testing for inventory).

In the case of bundle and collection rules, the table is read using standard java tools not jparser. Hence, the file is read in full then compared with expectations of an inventory file not a generalized table. It does other checks like duplicate lidvids, better boundary checks, version on the lidvid exists, etc. These are not specified in the label. Therefore the general table checker cannot be extended to handle inventory files.

al-niessner avatar Jun 04 '25 19:06 al-niessner

@jordanpadams

Maybe promote the "whole file not covered" warning to "error.table.records_mismatch" for inventory tables since it means the same thing? I could add other table types that have the inventory distinction that all rows must be processed as they creep up.

al-niessner avatar Jun 04 '25 19:06 al-niessner

@jordanpadams

Small number of changes got me to this:

validate -t github1234/collection_document.xml

PDS Validate Tool Report

Configuration:
   Version     3.8.0-SNAPSHOT
   Date        2025-06-04T20:51:33Z

Parameters:
   Targets                      [file:github1234/collection_document.xml]
   Severity Level               WARNING
   Recurse Directories          true
   File Filters Used            [*.xml, *.XML]
   Data Content Validation      on
   Product Level Validation     on
   Max Errors                   100000
   Registered Contexts File     
resources/util/registered_context_products.json


Product Level Validation Results

  FAIL: file:github1234/collection_document.xml
    Begin Content Validation: file:github1234/collection_document.csv
      ERROR  [error.table.records_mismatch]   : Number of records read is not equal to the defined number of records in the label (expected 5, got 7).
    End Content Validation: file:github1234/collection_document.csv
        1 product validation(s) completed

Summary:

  1 product(s)
  1 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    1          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.table.records_mismatch

End of Report
Completed execution in 5243 ms

This is modifying how tables are validated and doing a special check on inventory files. No other file type will report underflow nor that blocks are not being tested. It will be hard to expand other types to detect underflow. Is this sufficient for this ticket and then wait-and-see for other tables?

al-niessner avatar Jun 04 '25 20:06 al-niessner

Partially implemented by https://github.com/NASA-PDS/validate/pull/1266 for a clear check on inventory files. TODO more work on all tables.

jordanpadams avatar Jun 06 '25 17:06 jordanpadams

Thanks to EN for the upgrade applied to Inventory.

radiosci avatar Jun 06 '25 18:06 radiosci

Hi @jordanpadams and @al-niessner. We're discussing this within RMS right now, and I'm afraid we've lost lock on what exactly you're proposing.

Our concern is that, if a label says that a table is 30 lines long, but that table is actually 40 lines long, then at least a warning should be generated. If there is a reason why this is not practical, then can we discuss it? Apologies if this repeats something you've already discussed.

matthewtiscareno avatar Jun 17 '25 21:06 matthewtiscareno

@matthewtiscareno we are looking to implement this fix in two phases:

  1. Done - Add a specialized check for collection inventories since those are pretty straightforward and easy to flag (https://github.com/NASA-PDS/validate/pull/1266)

  2. Up next - Next is to throw a warning if there are any remaining bytes after or between the table definition(s). We will not try to match those remaining bytes with the table definition, but we can at least alert you that there are some undefined bytes in case that is unintended.

Would that satisfy this use case?

We can try to dig further to try to detect if the remaining bytes actually match the table definition, but it gets very complicated when we start talking about handling of different table types, and products where there are multiple tables defined within the same file. Additionally, it is really just a guess in terms of whether or not it actually matches, and there may be other issues with those extra rows that may cause some erroneous errors to be thrown that will potentially lead the user astray.

Thoughts?

jordanpadams avatar Jun 17 '25 21:06 jordanpadams

Yes, @mit3ch and @mace-space and I agree that this solution would be sufficient.

For the second part of your solution, you might consider throwing only one warning per file, instead of one for every instance, if the volume of warnings is an issue.

matthewtiscareno avatar Jun 17 '25 21:06 matthewtiscareno

@jordanpadams

We already implemented this in #686 for #535. To save our regression tests you have to turn it on with the switch --complete-descriptions then partially described tables will give you a warning.

al-niessner avatar Jun 18 '25 17:06 al-niessner

@matthewtiscareno @mace-space FYI, you can try downloading the latest SNAPSHOT and run validate with --complete-descriptions to perform these checks

jordanpadams avatar Jun 26 '25 16:06 jordanpadams

https://github.com/NASA-PDS/validate/releases/tag/v3.8.0-SNAPSHOT

jordanpadams avatar Jun 26 '25 16:06 jordanpadams

Thanks for all your help with this @al-niessner and @jordanpadams.

I now see the following error msg when I run validate-3.8.0-SNAPSHOT without any options (like --complete-descriptions):

FAIL: file:/Volumes/pdsdata-sandbox/voyager2_rss_uranus_49xr_raw_v1.0_mjtm/collection_document.xml
  Begin Content Validation: file:/Volumes/pdsdata-sandbox/voyager2_rss_uranus_49xr_raw_v1.0_mjtm/collection_document.csv
   ERROR [error.table.records_mismatch]  : Number of records read is not equal to the defined number of records in the label (expected 6, got 7).
  End Content Validation: file:/Volumes/pdsdata-sandbox/voyager2_rss_uranus_49xr_raw_v1.0_mjtm/collection_document.csv
    1 product validation(s) completed

which is very helpful (the collection file in question does actually have 7 records).

However, when I use --complete-descriptions, I see the same output. Is this expected behaviour?

mace-space avatar Jun 27 '25 21:06 mace-space

@mace-space

However, when I use --complete-descriptions, I see the same output. Is this expected behaviour?

Yes. The --compete-descriptions is for non-inventory tables.

al-niessner avatar Jun 27 '25 21:06 al-niessner

@mace-space yes! sorry for the confusion.

For collections, since they have a very specific definition in the standard, are well known, and should not contain any un-described bits, we perform that check by default.

For data tables, we have that flag since there are many data providers that intentionally leave bits un-described in the label.

jordanpadams avatar Jun 28 '25 14:06 jordanpadams

I see, thank you. Yes, there's a warning.data.not_described for non-inventory labels with reported <records> fewer than the actual number of records in the table.

mace-space avatar Jun 30 '25 19:06 mace-space