ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Use Case: Describe a collection of highly related files

Open multimeric opened this issue 8 months ago • 1 comments

As a researcher, I want to be able to describe a set of related files so that the metadata file does not contain redundant descriptions.

Use Case

Here is a simple example dataset from a MERSCOPE microscope:

$ ls -1 region_R1/images
manifest.json
micron_to_mosaic_pixel_transform.csv
mosaic_DAPI_z0.tif
mosaic_DAPI_z1.tif
mosaic_DAPI_z2.tif
mosaic_DAPI_z3.tif
mosaic_DAPI_z4.tif
mosaic_DAPI_z5.tif
mosaic_DAPI_z6.tif
mosaic_PolyT_z0.tif
mosaic_PolyT_z1.tif
mosaic_PolyT_z2.tif
mosaic_PolyT_z3.tif
mosaic_PolyT_z4.tif
mosaic_PolyT_z5.tif
mosaic_PolyT_z6.tif

According to the user guide:

The images are single channel, single plane, 16-bit grayscale tiff files, with the naming convention mosaic_{stain name}_z{ZIndex}.tif

Now, I could describe every single file here, which would end up with 14 (but in real life, many more) almost identical entities:

[
    {
        "@id": "mosaic_DAPI_z0.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing the 0th Z-slice for the DAPI stain."
    },
    {
        "@id": "mosaic_DAPI_z1.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing the 1st Z-slice for the DAPI stain."
    },
    ...
]

I also don't like the idea of describing these only as part of the description of the parent Dataset, because then I miss all of the image-specific properties, I lose the ability to run queries like "find all TIFF files", and the Dataset description would become exceedingly long.

Suggestion

One suggestion I have is to allow us to use glob-style patterns to describe sets of files.

One way this might work is simply by allowing an ID which is a glob. For example:

    {
        "@id": "mosaic_DAPI_z*.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing a singular Z-slice for the DAPI stain."
    }

The only downside of this is that * is an unusual character in an ID, but it is technically legal in an IRI according to RFC 3987.

Alternatively, we could create a new property called pattern (I'm sure we could find an IRI for it that corresponds to practical usage), which is a glob pattern that selects a set of files. Then we can attach that to a Dataset to capture a subset of files. Then we assume that any property on the Dataset describes any given file within that dataset. For example:

    {
        "@id": "#mosaic-dapi",
        "@type": "Dataset",
        "encodingFormat": "image/tiff",
        "pattern": "mosaic_DAPI_z*.tif",
        "description": "Mosiac tiff capturing a singular Z-slice for the DAPI stain."
    }

I like this less, because it's a bit odd and ugly to attach File properties to a Dataset.

multimeric avatar May 29 '25 06:05 multimeric

Understand the stress, do not like the solution.

IMHO the tension to "not repeat yourself" should be resolved by tools that help them create the describing graphs. (e.g we are in the works of introducing a ro-creator-tool that takes a simplified roc.yml file in a certain folder to produce the actual ro-crate-metadata.json

I see two separate concerns:

  • (effective reading) be clear and provide directly accessible information in a full graph, possibly with redundant repetition
  • (effective writing) be effective, concise, efficient, consistent, reduce copy-paste errors

I would not try to mix these concerns.

mpo-vliz avatar Oct 15 '25 09:10 mpo-vliz