As a user, I want validate to report empty (blank) PDS4 labels
Checked for duplicates
Yes - I've already checked
🧑🔬 User Persona(s)
Data Engineers and Data Providers
💪 Motivation
so I know if an XML label file didn't transfer properly or is just blank for some other reason.
📖 Additional Details
Attached is an example showing that validate ignores the blank XML label, lend_rdr_dlx_20240615.xml.
Acceptance Criteria
Given an XML file in a folder path/to/data that is not a PDS4 XML
When I perform validate --rule pds4.folder --target path/to/data
Then I expect to receive an warning.label.xml_not_label warning message
Given an empty XML file in a folder path/to/data that is not a PDS4 XML
When I perform validate --rule pds4.folder --target path/to/data
Then I expect to receive an warning.label.xml_not_label warning message
Given an XML file in a folder path/to/data that is not a PDS4 XML
When I perform validate --rule pds4.folder --target path/to/data --quiet-warning-xml-not-label
Then I expect to not receive a warning.label.xml_not_label warning message
⚙️ Engineering Details
No response
🎉 I&T
No response
@jennifergward @jordanpadams
This is really long because there is a lot of setup/context to understand the design decision that needs to take place. Ultimately, 'pds4.folder' might be a bad choice and should use other tools to build a list of products.
With -R pds4.folder all files with the appropriate suffix have their content tested to see if they are label or not. There are two important implications in that statement that require a trade off between signal to noise ratio and signal loss:
- one can tell validate that appropriate suffix are * and it may be the default case
- data files (non labels) are allowed to have the same suffix as labels
In pds4.folder mode, the choice is made to not bombard the user with useless checks on files that to them are obviously not labels but require validate to check the file's content to know if it is a label or not.
An easy solution to this is to run validate this way:
validate -t lend_rdr_dlx_20240615.xml lend_rdr_dld_20240615.xml
PDS Validate Tool Report
Configuration:
Version 3.7.0-SNAPSHOT
Date 2025-04-03T16:45:37Z
Parameters:
Targets [file:/lend_rdr_dlx_20240615.xml, file:/lend_rdr_dld_20240615.xml]
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Max Errors 100000
Registered Contexts File registered_context_products.json
Error
SXXP0003 Error reported by XML parser: Premature end of file.
Fatal error: [node=null,object=null,url=null,line=-1,col=-1,offset=-1]: Premature end of file.
Error
SXXP0003 Error reported by XML parser: Premature end of file.
Error
SXXP0003 Error reported by XML parser: Premature end of file.
Error
SXXP0003 Error reported by XML parser: Premature end of file.
Fatal error: [node=null,object=null,url=null,line=-1,col=-1,offset=-1]: Premature end of file.
Product Level Validation Results
FAIL: file:/lend_rdr_dlx_20240615.xml
WARNING [error.validation.missing_required_file] Cannot check versioning because XML could not be parsed.
ERROR [error.label.schema] Premature end of file.
1 product validation(s) completed
PASS: file:/lend_rdr_dld_20240615.xml
2 product validation(s) completed
Summary:
2 product(s)
1 error(s)
1 warning(s)
Product Validation Summary:
1 product(s) passed
1 product(s) failed
0 product(s) skipped
2 product(s) total
Referential Integrity Check Summary:
0 check(s) passed
0 check(s) failed
0 check(s) skipped
0 check(s) total
Message Types:
1 error.label.schema
1 error.validation.missing_required_file
End of Report
Completed execution in 9896 ms
Obviously this is much harder when there are more than 2 labels. However, using tools like find or ls - or the equivalent for the user's platform - would enable them to build lists based on more sensible criteria than suffix alone. For instance validate -t $(find validate_test -name label_\*.xml) or validate -t $(ls validate_test/label_*.xml). Redirection is another choice as in validate -t < mylist.txt to load a very large pile of products.
Right now, validate is set to maximize signal to noise at the expense of some signal loss. We could undo this a bit and add a warning that a file found in the folder looks like but does not test to be a label: warning.label.content_not_label. However, this is not a fix as much as a squishy work around until another user pushes to have the squish go back the other way.
Again, this is ultimately a design question. Providing pds4.folder gives the user a very imperfect way to filter content. Adding more messages increases the noise. Making the filtering more complex is non-linear costly. Does it therefore make sense to provide pds4.foler or have users use tools on their platform to do the work for them?
This problem seems to affect other malformed xml files, as well. The validator will treat these as non-labels.
I see where you are coming from that xml files do not have to be labels, so this is a difficult situation.
The standards do require that any file with the .xml extension is an XML formatted file (6C.1.6 Reserved File Name Extensions), so at least warning about invalid XML files (as opposed to non-label files) should be uncontroversial. This would cover empty XML files as well.
I thought policy was that a decision needs to be made at the bundle level whether all labels (and only labels) will be .xml files or whether all labels will be .lblx files. Shouldn't this remove the ambiguity at the heart of this problem?
The standards reference says that it's at the collection level, and not the bundle level.
Thanks for the correction, @jstone-psi, but I think my point remains. Shouldn't Validate always know whether a .xml file is intended to be a label or not, and thus not have the ambiguity that we're discussing?
I think it's a significant difference, since there will no longer be a guaranteed single extension type across validation runs. It might be possible to still come up with some logic that will work, but I can say as a fellow programmer that I'm glad it's not my problem to solve.
We did find temporary workaround on our end, however. If you run with a referential integrity check, you will get clues that something is wrong with the label. It still takes some work to find the root cause (bad labels), but the validation run won't improperly pass.
@matthewtiscareno @jstone-psi thanks for the input here. as @matthewtiscareno notes, Validate "should" be able to know what file extensions to look for, but folks have been putting non-XML files in directories, even if they shouldn't. So we got a request to more gracefully handle coming across these files. We may have quieted too much of the logging information here. We will bring this back with a warning.
@al-niessner let's maybe go with this:
Right now, validate is set to maximize signal to noise at the expense of some signal loss. We could undo this a bit and add a warning that a file found in the folder looks like but does not test to be a label:
warning.label.xml_not_label. However, this is not a fix as much as a squishy work around until another user pushes to have the squish go back the other way.
Let's support this warning by default, and add a flag like --quiet-warning-xml-not-label to quiet these warnings
This will actually allow us to eventually start adding more --quiet-warning flags for some of the other warnings people do not want
folks have been putting non-XML files in directories, even if they shouldn't
I'm not aware of there being anything in the standards that forbids have files in the same directory as PDS4 files that are not part of the PDS4 archive. Please advise if otherwise.
I had assumed that @jennifergward and @jstone-psi were talking about .xml files whose LIDs are listed in the collection inventory. Those should be checked as labels and reported if they are empty or blank.
However, now I'm not so sure whether that is what they meant. Do they mean instead that any .xml file (blank or not) should be checked even if it is not listed in the inventory?
A corollary question is whether Validate can assume that each label's filename should match a LID in the inventory, or whether Validate must look inside the .xml file in order to determine its LID and thus check whether it is listed in the inventory.
For your last question, I think that the second part is true. The match between a LID and a filename is only a coincidence, and even if they are similar, they don't necessarily match exactly.
This leaves us with a bootstrapping problem. If the xml file is blank or otherwise malformed, then it is unparseable and the LID is unknowable. Since the inventories only have the LIDs, there is no way to determine if the file is a label and if it should have been in the inventory.
I do still think that we would get a lot of mileage out of checking that XML files (whether they have a .xml or .lblx) extension are parseable at all. It is unambiguously a problem if these files are not valid XML, regardless of whether they are in the inventory.
Thinks for the input @jstone-psi @matthewtiscareno . I think the plan will be:
-
When .xml is the expected file extension for labels, check all and raise a warning in the event there is an empty XML. As Jesse noted, we have no idea if this is supposed to be a label or not, so I think a warning suffices. We will have a flag to disable in the event this is expected.
-
When .lblx is used, ignore try to read the .xml since they are now considered “data files” and the standard does mention .lblx and .xml labels cannot intermixed within a collection.
@jordanpadams and all that are watching,
validate has been updated and now throws some interesting errors and counts. One in particular is that blank files now count as a success with a warning. Not sure I can change a warning to failure or uncounted. It is because ValidationProblem() requires a URL and report is just counting URLs. Not sure how/if you want to try and fix this but the obvious thing is to elevate it to an error.
Below are two comments. The first with .xml as the suffix. The second with .lblx as the suffix. Let me know if this is good enough.
After fixing the report generator is reads better:
validate --skip-context-validation -R pds4.folder -t test/resources/github1201
PDS Validate Tool Report
Configuration:
Version 3.8.0-SNAPSHOT
Date 2025-06-25T19:57:39Z
Parameters:
Targets [file:test/resources/github1201/]
Rule Type pds4.folder
Severity Level WARNING
Recurse Directories true
File Filters Used [*.xml, *.XML]
Data Content Validation on
Product Level Validation on
Max Errors 100000
Registered Contexts File main/resources/util/registered_context_products.json
Product Level Validation Results
PASS: file:/home/niessner/Projects/PDS/validate/src/test/resources/github1201/lend_rdr_dld_20240615.xml
1 product validation(s) completed
SKIP: file:/home/niessner/Projects/PDS/validate/src/test/resources/github1201/lend_rdr_dlx_20240615.xml
WARNING [warning.label.not_understandable] The label file cannot be parsed or understood using PDS4 schema but it may be a data file being as a label.
2 product validation(s) completed
Summary:
2 product(s)
0 error(s)
1 warning(s)
Product Validation Summary:
1 product(s) passed
0 product(s) failed
1 product(s) skipped
2 product(s) total
Referential Integrity Check Summary:
0 check(s) passed
0 check(s) failed
0 check(s) skipped
0 check(s) total
Message Types:
1 warning.label.not_understandable
End of Report
Completed execution in 11365 ms
validate --skip-context-validation -R pds4.folder --label-extension lblx -t test/resources/github1201
PDS Validate Tool Report
Configuration:
Version 3.8.0-SNAPSHOT
Date 2025-06-25T19:41:54Z
Parameters:
Targets [file:test/resources/github1201/]
Rule Type pds4.folder
Severity Level WARNING
Recurse Directories true
File Filters Used [*.lblx, *.LBLX]
Data Content Validation on
Product Level Validation on
Max Errors 100000
Registered Contexts File main/resources/util/registered_context_products.json
Product Level Validation Results
PASS: file:test/resources/github1201/lend_rdr_dld_20240615.lblx
1 product validation(s) completed
FAIL: file:/test/resources/github1201/lend_rdr_dlx_20240615.lblx
ERROR [error.label.not_understandable] The label file cannot be parsed or understood using PDS4 schema.
2 product validation(s) completed
Summary:
2 product(s)
1 error(s)
0 warning(s)
Product Validation Summary:
1 product(s) passed
1 product(s) failed
0 product(s) skipped
2 product(s) total
Referential Integrity Check Summary:
0 check(s) passed
0 check(s) failed
0 check(s) skipped
0 check(s) total
Message Types:
1 error.label.not_understandable
End of Report
Completed execution in 16398 ms
@jordanpadams Okay, I think I got it all fixed up and reading well.