validate icon indicating copy to clipboard operation
validate copied to clipboard

As a user, I want validate to report empty (blank) PDS4 labels

Open jennifergward opened this issue 10 months ago • 11 comments

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Data Engineers and Data Providers

💪 Motivation

so I know if an XML label file didn't transfer properly or is just blank for some other reason.

📖 Additional Details

Attached is an example showing that validate ignores the blank XML label, lend_rdr_dlx_20240615.xml.

validate_test.zip

Acceptance Criteria

Given an XML file in a folder path/to/data that is not a PDS4 XML When I perform validate --rule pds4.folder --target path/to/data Then I expect to receive an warning.label.xml_not_label warning message

Given an empty XML file in a folder path/to/data that is not a PDS4 XML When I perform validate --rule pds4.folder --target path/to/data Then I expect to receive an warning.label.xml_not_label warning message

Given an XML file in a folder path/to/data that is not a PDS4 XML When I perform validate --rule pds4.folder --target path/to/data --quiet-warning-xml-not-label Then I expect to not receive a warning.label.xml_not_label warning message

⚙️ Engineering Details

No response

🎉 I&T

No response

jennifergward avatar Apr 03 '25 14:04 jennifergward

@jennifergward @jordanpadams

This is really long because there is a lot of setup/context to understand the design decision that needs to take place. Ultimately, 'pds4.folder' might be a bad choice and should use other tools to build a list of products.

With -R pds4.folder all files with the appropriate suffix have their content tested to see if they are label or not. There are two important implications in that statement that require a trade off between signal to noise ratio and signal loss:

  1. one can tell validate that appropriate suffix are * and it may be the default case
  2. data files (non labels) are allowed to have the same suffix as labels

In pds4.folder mode, the choice is made to not bombard the user with useless checks on files that to them are obviously not labels but require validate to check the file's content to know if it is a label or not.

An easy solution to this is to run validate this way:

validate -t lend_rdr_dlx_20240615.xml lend_rdr_dld_20240615.xml

PDS Validate Tool Report

Configuration:
   Version     3.7.0-SNAPSHOT
   Date        2025-04-03T16:45:37Z

Parameters:
   Targets                      [file:/lend_rdr_dlx_20240615.xml, file:/lend_rdr_dld_20240615.xml]
   Severity Level               WARNING
   Recurse Directories          true
   File Filters Used            [*.xml, *.XML]
   Data Content Validation      on
   Product Level Validation     on
   Max Errors                   100000
   Registered Contexts File     registered_context_products.json


Error 
  SXXP0003   Error reported by XML parser: Premature end of file.
Fatal error: [node=null,object=null,url=null,line=-1,col=-1,offset=-1]: Premature end of file.
Error 
  SXXP0003   Error reported by XML parser: Premature end of file.
Error 
  SXXP0003   Error reported by XML parser: Premature end of file.
Error 
  SXXP0003   Error reported by XML parser: Premature end of file.
Fatal error: [node=null,object=null,url=null,line=-1,col=-1,offset=-1]: Premature end of file.

Product Level Validation Results

  FAIL: file:/lend_rdr_dlx_20240615.xml
      WARNING  [error.validation.missing_required_file]   Cannot check versioning because XML could not be parsed.
      ERROR  [error.label.schema]   Premature end of file.
        1 product validation(s) completed

  PASS: file:/lend_rdr_dld_20240615.xml
        2 product validation(s) completed

Summary:

  2 product(s)
  1 error(s)
  1 warning(s)

  Product Validation Summary:
    1          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    2          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.label.schema
    1            error.validation.missing_required_file

End of Report
Completed execution in 9896 ms

Obviously this is much harder when there are more than 2 labels. However, using tools like find or ls - or the equivalent for the user's platform - would enable them to build lists based on more sensible criteria than suffix alone. For instance validate -t $(find validate_test -name label_\*.xml) or validate -t $(ls validate_test/label_*.xml). Redirection is another choice as in validate -t < mylist.txt to load a very large pile of products.

Right now, validate is set to maximize signal to noise at the expense of some signal loss. We could undo this a bit and add a warning that a file found in the folder looks like but does not test to be a label: warning.label.content_not_label. However, this is not a fix as much as a squishy work around until another user pushes to have the squish go back the other way.

Again, this is ultimately a design question. Providing pds4.folder gives the user a very imperfect way to filter content. Adding more messages increases the noise. Making the filtering more complex is non-linear costly. Does it therefore make sense to provide pds4.foler or have users use tools on their platform to do the work for them?

al-niessner avatar Apr 03 '25 17:04 al-niessner

This problem seems to affect other malformed xml files, as well. The validator will treat these as non-labels.

I see where you are coming from that xml files do not have to be labels, so this is a difficult situation.

The standards do require that any file with the .xml extension is an XML formatted file (6C.1.6 Reserved File Name Extensions), so at least warning about invalid XML files (as opposed to non-label files) should be uncontroversial. This would cover empty XML files as well.

jstone-psi avatar Apr 24 '25 16:04 jstone-psi

I thought policy was that a decision needs to be made at the bundle level whether all labels (and only labels) will be .xml files or whether all labels will be .lblx files. Shouldn't this remove the ambiguity at the heart of this problem?

matthewtiscareno avatar Apr 24 '25 16:04 matthewtiscareno

The standards reference says that it's at the collection level, and not the bundle level.

jstone-psi avatar Apr 24 '25 21:04 jstone-psi

Thanks for the correction, @jstone-psi, but I think my point remains. Shouldn't Validate always know whether a .xml file is intended to be a label or not, and thus not have the ambiguity that we're discussing?

matthewtiscareno avatar Apr 24 '25 21:04 matthewtiscareno

I think it's a significant difference, since there will no longer be a guaranteed single extension type across validation runs. It might be possible to still come up with some logic that will work, but I can say as a fellow programmer that I'm glad it's not my problem to solve.

We did find temporary workaround on our end, however. If you run with a referential integrity check, you will get clues that something is wrong with the label. It still takes some work to find the root cause (bad labels), but the validation run won't improperly pass.

jstone-psi avatar Apr 25 '25 15:04 jstone-psi

@matthewtiscareno @jstone-psi thanks for the input here. as @matthewtiscareno notes, Validate "should" be able to know what file extensions to look for, but folks have been putting non-XML files in directories, even if they shouldn't. So we got a request to more gracefully handle coming across these files. We may have quieted too much of the logging information here. We will bring this back with a warning.

@al-niessner let's maybe go with this:

Right now, validate is set to maximize signal to noise at the expense of some signal loss. We could undo this a bit and add a warning that a file found in the folder looks like but does not test to be a label: warning.label.xml_not_label. However, this is not a fix as much as a squishy work around until another user pushes to have the squish go back the other way.

Let's support this warning by default, and add a flag like --quiet-warning-xml-not-label to quiet these warnings

jordanpadams avatar May 12 '25 17:05 jordanpadams

This will actually allow us to eventually start adding more --quiet-warning flags for some of the other warnings people do not want

jordanpadams avatar May 12 '25 17:05 jordanpadams

folks have been putting non-XML files in directories, even if they shouldn't

I'm not aware of there being anything in the standards that forbids have files in the same directory as PDS4 files that are not part of the PDS4 archive. Please advise if otherwise.

I had assumed that @jennifergward and @jstone-psi were talking about .xml files whose LIDs are listed in the collection inventory. Those should be checked as labels and reported if they are empty or blank.

However, now I'm not so sure whether that is what they meant. Do they mean instead that any .xml file (blank or not) should be checked even if it is not listed in the inventory?

A corollary question is whether Validate can assume that each label's filename should match a LID in the inventory, or whether Validate must look inside the .xml file in order to determine its LID and thus check whether it is listed in the inventory.

matthewtiscareno avatar May 12 '25 23:05 matthewtiscareno

For your last question, I think that the second part is true. The match between a LID and a filename is only a coincidence, and even if they are similar, they don't necessarily match exactly.

This leaves us with a bootstrapping problem. If the xml file is blank or otherwise malformed, then it is unparseable and the LID is unknowable. Since the inventories only have the LIDs, there is no way to determine if the file is a label and if it should have been in the inventory.

jstone-psi avatar May 13 '25 22:05 jstone-psi

I do still think that we would get a lot of mileage out of checking that XML files (whether they have a .xml or .lblx) extension are parseable at all. It is unambiguously a problem if these files are not valid XML, regardless of whether they are in the inventory.

jstone-psi avatar May 13 '25 22:05 jstone-psi

Thinks for the input @jstone-psi @matthewtiscareno . I think the plan will be:

  1. When .xml is the expected file extension for labels, check all and raise a warning in the event there is an empty XML. As Jesse noted, we have no idea if this is supposed to be a label or not, so I think a warning suffices. We will have a flag to disable in the event this is expected.

  2. When .lblx is used, ignore try to read the .xml since they are now considered “data files” and the standard does mention .lblx and .xml labels cannot intermixed within a collection.

jordanpadams avatar Jun 19 '25 11:06 jordanpadams

@jordanpadams and all that are watching,

validate has been updated and now throws some interesting errors and counts. One in particular is that blank files now count as a success with a warning. Not sure I can change a warning to failure or uncounted. It is because ValidationProblem() requires a URL and report is just counting URLs. Not sure how/if you want to try and fix this but the obvious thing is to elevate it to an error.

Below are two comments. The first with .xml as the suffix. The second with .lblx as the suffix. Let me know if this is good enough.

al-niessner avatar Jun 25 '25 19:06 al-niessner

After fixing the report generator is reads better:

validate --skip-context-validation -R pds4.folder -t test/resources/github1201

PDS Validate Tool Report

Configuration:
   Version     3.8.0-SNAPSHOT
   Date        2025-06-25T19:57:39Z

Parameters:
   Targets                      [file:test/resources/github1201/]
   Rule Type                    pds4.folder
   Severity Level               WARNING
   Recurse Directories          true
   File Filters Used            [*.xml, *.XML]
   Data Content Validation      on
   Product Level Validation     on
   Max Errors                   100000
   Registered Contexts File     main/resources/util/registered_context_products.json



Product Level Validation Results

  PASS: file:/home/niessner/Projects/PDS/validate/src/test/resources/github1201/lend_rdr_dld_20240615.xml
        1 product validation(s) completed

  SKIP: file:/home/niessner/Projects/PDS/validate/src/test/resources/github1201/lend_rdr_dlx_20240615.xml
      WARNING  [warning.label.not_understandable]   The label file cannot be parsed or understood using PDS4 schema but it may be a data file being as a label.
        2 product validation(s) completed

Summary:

  2 product(s)
  0 error(s)
  1 warning(s)

  Product Validation Summary:
    1          product(s) passed
    0          product(s) failed
    1          product(s) skipped
    2          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            warning.label.not_understandable

End of Report
Completed execution in 11365 ms

al-niessner avatar Jun 25 '25 19:06 al-niessner

validate --skip-context-validation -R pds4.folder --label-extension lblx -t test/resources/github1201

PDS Validate Tool Report

Configuration:
   Version     3.8.0-SNAPSHOT
   Date        2025-06-25T19:41:54Z

Parameters:
   Targets                      [file:test/resources/github1201/]
   Rule Type                    pds4.folder
   Severity Level               WARNING
   Recurse Directories          true
   File Filters Used            [*.lblx, *.LBLX]
   Data Content Validation      on
   Product Level Validation     on
   Max Errors                   100000
   Registered Contexts File     main/resources/util/registered_context_products.json


Product Level Validation Results

  PASS: file:test/resources/github1201/lend_rdr_dld_20240615.lblx
        1 product validation(s) completed
  FAIL: file:/test/resources/github1201/lend_rdr_dlx_20240615.lblx
      ERROR  [error.label.not_understandable]   The label file cannot be parsed or understood using PDS4 schema.
        2 product validation(s) completed

Summary:

  2 product(s)
  1 error(s)
  0 warning(s)

  Product Validation Summary:
    1          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    2          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.label.not_understandable

End of Report
Completed execution in 16398 ms

al-niessner avatar Jun 25 '25 19:06 al-niessner

@jordanpadams Okay, I think I got it all fixed up and reading well.

al-niessner avatar Jun 25 '25 20:06 al-niessner