As a user, I want to have the ability to skip directories in a validation run
Checked for duplicates
No - I haven't checked
🧑🔬 User Persona(s)
Data Engineers and Data Providers
💪 Motivation
...so that I can produce Validate reports without WARNINGs regarding the absence of PDS4 labels for files/directories that are not intended to have them.
📖 Additional Details
For example, MSL archive bundles meet both PDS3 and PDS4 standards by containing the essential files needed to pass both validations. However, in the (PDS3) /EXTRAS directory there are three sizes of browse files, /FULL, /BROWSE, and /THUMBNAIL. Only the /BROWSE directory gets PDS4 xml labels added. The /FULL and /THUMBNAIL do not, yet Validate lists WARNINGs for every image within these two directories, thus needlessly extending the length of the validation report. By being able to omit specific directories that are not of interest, it would save time and needless reporting of files that are not listed in any collection_*.csv file.
Acceptance Criteria
Given When I perform Then I expect
⚙️ Engineering Details
No response
🎉 I&T
No response
Thanks @ralanis-jpl we will add this to the backlog. Is this a blocker to anything you are trying to do? Or is this more of an inconvenience in triaging a validate run?
It's not a blocker, but it has been noticeable as we're folding in more, still active, archives into the PDS3 to PDS4 migration. As you said, it would help with analyzing validation runs. Thanks.
@jordanpadams @ralanis-jpl
Would it be possible to have the full content in one tree then have a script build a PDS4 tree that linked back to the full tree omitting the unwanted directories? You could then validate the PDS4 tree, and it would behave as you want. The script would also give you much better control of omissions that you may want later.
@ralanis-jpl Based upon all the priorities we have in our backlog, unfortunately, I going to need to keep this in the icebox for the time being.
As @al-niessner noted, I would recommend doing a find/tree on the file system, filter out what you want validate to look at, and either feed that in as a manifest or via the CLI as a list of targets. When running the final bundle validation, you will need to just let it run, but that will hopefully be a last stop.
Let us know if this becomes a blocker for running validate, and we will reevaluate.
For the Validate option of "-t", does one specify a directory only with its name or is the path also required? For example,
$ ./validate MSLNAV_1XXX -R pds4.bundle -D -t bundle.xml DATA EXTRAS/BROWSE -v 2 -r MSLNAV_1XXX_rpt_01.txt
(DATA/, EXTRAS/ and bundle.xml all sit at the same top level)
Thanks
@ralanis-jpl
I am not expert on this so @jordanpadams may have to correct me.
The -R pds4.bundle tells us that the objects pointed to by the -t, in your case is bundle.xml, is bundle and to use all knowledge about bundles to process it; meaning, use it as the root location and search all directories it points to etc.
If the bundle is in the directory MSLNAV_1XXX, then it should be -t MSLNAV_1XXX/bundle.xml.
As written, validate thinks there are 4 bundles: MSLNAV_1XXX, bundle.xml, DATA, and EXTRAS/BROWSE.
If all you want to do is process the bundle: ./validate -D -v2 -r MSLNAV_1XXX_rpt_01.txt -R pds4.bundle -t MSLNAV_1XXX/bundle.xml
or rather, $ ./validate -D -v 2 -r MSLNAV_1XXX_rpt_01.txt -R pds4.bundle -t MSLNAV_1XXX/bundle.xml MSLNAV_1XXX/DATA MSLNAV_1XXX/EXTRAS/BROWSE
If you want it to process directories rather than bundles,
./validate -D -v 2 -r MSLNAV_1XXX_rpt_01.txt -R pds4.directory -t MSLNAV_1XXX MSLNAV_1XXX/DATA MSLNAV_1XXX/EXTRAS/BROWSE
It will not do what you want. It will walk EVERY directory in MSLNAV_1XXX.
When you give it more than one target, it is the union of those targets. Also, all targets should be of the same type that matches the -R value (or label when not specified).
My intention was to carry out a 'bundle' validation with all of its integrity checking as well. I was trying to circumvent Validate's inability to ignore specific directories by instead targeting only the sub-directories that contain PDS4 labels.
I suspected as much. Unfortunately, or fortunately depending on your outlook on bundles, validate tries to identify files that are not mentioned. I guess many of those that develop PDS bundles like this feature so it is fortunate.
It makes sense if the bundle is a 100% PDS4 bundle. The bundles I am dealing with are "hybrid" PDS3/PDS4 bundles, meaning that not all of the directories are intended to be accounted for by the PDS4 validation software. Some are there solely for PDS3's sake. The nice thing about the old PDS3 validation software was that one could specify which directories to ignore. I had hoped PDS4 Validate could do the same. Thanks.
The reason for hybrid bundles is to preserve the PDS3 format that users still want and have developed software for. At the same time, adding PDS4 labels to the data that is already there reduces duplication of data and meets the PDS3 to PDS4 migration mandate.
@ralanis-jpl Understood on the migration. As @al-niessner mentioned, when running bundle validation, just point to the bundle.xml and nothing else. It will go through all sub-dirs with the current functionality of validate. Unfortunately, we have lots of other work to do, and since this is more of a nuisance than a blocker, we have given this a "could-have" priority and put it into our icebox for implementation at a later date. thanks!
I too would like to see this improvement for my archive and migration tasks. I submitted a ticket #1252 (and closed it) before finding this ticket and #1079 after a deeper search giving my input and use cases.
I would like to see an --exclude files/directory option and perhaps an --exclude-list option that points to a list of files/directories to exclude.