Context reference check is not collapsing whitespace
Checked for duplicates
Yes - I've already checked
🐛 Describe the bug
When I created test labels that has a line break in a particularly long observing system component name ("The Origins, Spectral Interpretation, Resource Identification, Security, Regolith Explorer (OSIRIS-REx) Spacecraft"), I noticed that the validate tool was raising a warning that the component name did not match the context object.
This warning does not appear when there are no line breaks in the component name. However, it does appear again when I have multiple spaces in the name.
🕵️ Expected behavior
The data type of Observing_System_Component/name is UTF8_Short_String_Collapsed, so these extra spaces and line breaks should be collapsed into a single space before the check is performed. When the values are collapsed, no warning would occur for these labels.
📜 To Reproduce
- Find a label that validates correctly.
- Add additional spaces in the middle of the Observing_System_Component/name
- Try to validate the new label, and observe the warning
🖥 Environment Info
- Version of this software: validate 3.7.1
- Operating System: MacOS Sonoma 14.7.5 with OpenJDK openjdk 23.0.2
📚 Version of Software Used
validate 3.7.1
🩺 Test Data / Additional context
🦄 Related requirements
- 🦄 https://github.com/NASA-PDS/validate/issues/970
- 🦄 https://github.com/NASA-PDS/validate/issues/861
- 🦄 #857
⚙️ Engineering Details
No response
🎉 Integration & Test
No response
@jordanpadams @jstone-psi
I am not a lawyer nor do I play one on television nor do I want to do either. However the definition of the UTF8_Short_String_Collapsed says that it is collapsed white space not will be or should be or could be. In other words, what is written in the XML is collapsed as needed not that the reader should collapse it. Is this going to become a DDWG thing?
I think you are not off the mark with your reading (the standards reference describes all of the collapsed data types as if only pre-collapsed data were allowed). If what you are saying is correct, then this means that there is still a different problem. It seems that this should have been a validation failure, since the original value contained invalid characters.
Something strikes me as not quite right about this answer, though. First, the example products have values that are not pre-collapsed. Additionally, the XML specifications say that this should be handled at the data-normalization step, not in storage.
https://www.w3.org/TR/xmlschema-2/#datatype-components https://www.w3.org/TR/REC-xml/#AVNormalize
At this point, I'm willing to back away from this issue for now, but I'll leave these for reference.
@jstone-psi I hate to say it, but can we get a ticket opened with the DDWG to clarify what is meant here? I agree this is confusing, and we clearly are not checking this properly either in schematron or in our validation checks.
@rsjoyner any thoughts here?