kerchunk MultiZarrToZarr speed improvements

Opening an issue so we can discuss processing time for MultiZarrToZarr. I'm running with the following code:

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol='az',
    remote_options={
       'account_name' : 'goeseuwest'
    },    
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'decode_coords' : False,
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "t"

    }
)

mzz.translate('combined.json')

Running mzz.translate() takes upwards of 40 minutes on a list of 144 reference jsons, which were generated with inline_threshold=300. It looks like a good chunk of the time is divided between to_zarr() and split().

I've attached the profile file here as well, which can be visualzed with snakeviz ./multizarr_profile.

multizarr_profile.zip

Jun 29 '21 16:06 lsterzinger

posixpath.split is being called 391721185 times! How many chunks are there here? Event dask.optimize takes a decent while. Sure we could stick an lru-cache around split, but something else is going on here.

Jun 29 '21 16:06 martindurant

I think it's just 1 chunk per file? At least that seems to be what xarray is treating it as. But one of the jsons lists a variable as having chunks : [226, 226]. I'll attach a zip of all the jsons I'm trying to combine. combined.zip

Jun 29 '21 17:06 lsterzinger

Each JSON has 17,217 references. An example data variable, and there are many of these:

In [20]: ds.CMI_C02.data
Out[20]: dask.array<xarray-CMI_C02, shape=(5424, 5424), dtype=float32, chunksize=(226, 226), chunktype=numpy.ndarray>

has 576 chunks, 200kB each. This is tiny! If you were to actually do any analysis on this data, you would have to specify a much larger chunksize to dask in order not to completely swamp the scheduler.

Jun 29 '21 18:06 martindurant

That would explain why I was having problems actually doing analysis on it 😆

Can we specify a chunk size with xarray during initial individual json creation? I'm guessing this would also speed up the MultiZarr creation

Jun 29 '21 18:06 lsterzinger

Can we specify a chunk size with xarray during initial individual json creation

No, xarray is not actually reading the data at this point (except some specific coordinate arrays), it's reading the metadata pieces that take so long. We could probably do something smart about the caching of bytes blocks, but the access pattern of scanning through the file seems to be unique to each HDF data provider.

The end goal is to have the scanning done in the cloud, close to the data - that's what pangeo-forge is all about. Maybe the problem will go away.

Jun 29 '21 18:06 martindurant

Gotcha, so my understanding is that as things stand right now there's no easy solution to making this process faster at the moment. That's fine with me, what I'm looking to show off is the speed of opening a dataset for analysis with existing references already created. I should be able to specify a larger chunk size in xarray when I do an xr.open_dataset on a fsspec mapper, correct?

Jun 29 '21 18:06 lsterzinger

Exactly right - there is little getting around the upfront cost of creating the references, but analysis will be faster as a result for downstream users.

Jun 29 '21 18:06 martindurant

.. running with the following code:

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol='az',
    remote_options={
       'account_name' : 'goeseuwest'
    },    
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'decode_coords' : False,
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "t"

    }
)

mzz.translate('combined.json')

This syntax is still all valid, right ? Asking because, and despite reading the docstrings, there is no full example (if I am not wrong) on how to use the kwargs for xarray. Would here, for example, using use_cftime suffice or is still then the following needed ?

        multi_zarr = MultiZarrToZarr(
            input_references,
            remote_protocol="file",
            ..
            ..
            coo_map={"time": "cf:time"},
            out=output,
        )

Nov 11 '23 11:11 NikosAlexandris

No, we don't use any direct xarray kwargs - MultiZarrToZarr doesn't use xarray to infer anything

Nov 13 '23 19:11 martindurant