kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Kerchunk for representing subset of a dataset

Open forrestfwilliams opened this issue 3 years ago • 2 comments

Is it possible, or how could we enable, using kerchunk to create files that represent a subset of a dataset? In other words, given that you know which chunks you would like to include, and where those chunks are located, can you use kerchunk to create a file that zarr can use to read in a portion of a single dataset?

The paricular use case I have in mind is creating a kerchunk json file that represents a single burst within a Sentinel-1 SLC dataset. Bursts are subsets of SLC files and the SLC files come with metadata stating where each burst is located within each SLC image. SLCs come in GeoTIFFs that are chunked by line, and the metadata provides information stating which lines the bursts begin and end on. I've done some digging and here is a rough idea of how this might work.

  1. Obtain the line location information for a burst
  2. Convert this line information to byte offset and length
  3. Pass this information to fsspec/tifffile
  4. Write this information to a kerchunk json file

I have already completed step 1, but could use assistance with the last three. At this point, I'm not concerned with including metadata with the subsetted data. As an example, here is a link to download a Sentinel-1 SLC from the Alaska Satellite Facility's archive, and I've included the relevant file and line location information for one burst below:

Burst Numer: 1 PATH: S1A_IW_SLC__1SDV_20211229T231926_20211229T231953_041230_04E66A_3DBE.SAFE/measurement/s1a-iw2-slc-vv-20211229t231926-20211229t231951-041230-04e66a-005.tiff Line Start: 0 Line End: 1509

forrestfwilliams avatar Jul 10 '22 22:07 forrestfwilliams

As things stand, the single-file kerchunk ingestors produce one reference for every chunk in the target file. It is expected that these reference files are small (compared to the data) and can then be reused in a variety of second-stage processing.

At the combine stage, there are many options for how to go about the process, and more options being added. Currently, the input references can be preprocessed and the final reference set post-processed by arbitrary functions - you would want the former for this case. Since we also consider the coordinates of each chunk, we could also easily allow for explicit logic to exclude some chunks from consideration, but this has not been done.

Editing the reference sets is super-easy, however, so you could process any JSON file to make an amended version as you like, making sure that if you remove keys, the other keys are renamed, and the zarr metadata updated. This might be a common enough pattern to have helper functions in kerchunk (@peterm790 , if interested)

It is important to note, that you are in every case stuck* with the inherent chunking of the original files, you do not get an arbitrary choice.

(* except for the rare and special case of uncompressed raw C buffers)

martindurant avatar Jul 11 '22 13:07 martindurant

Hi @martindurant, thank you for your reply. Luckily my data is already chunked conveniently for the subsets I want to produce. Could you give me some further guidance on how to access the references kerchunk creates at the various stages you mentioned? Specifically, how can I access information regarding the references created for the chunks of the target file?

forrestfwilliams avatar Jul 11 '22 21:07 forrestfwilliams