kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

MultiZarrToZarr append method - coo_map not working as expected

Open sreesanjeevkg opened this issue 1 year ago • 5 comments

I have approximately 4000 kerchunk JSON files in the mentioned directory, all of which require MultiZarrToZarr processing to create a single reference file, and all require http calls. When I call the MultiZarrToZarr.translate() method on all of them at once, I encounter a server disconnected error. As a quick workaround, I thought of appending to the reference file in batches. However, I came across an error when attempting to append.

It seems that coo_map is not working as expected for the append operation

Additionally, I'm wondering if there's a way to append directly to an empty path, rather than first creating a reference file and then appending to it.

Could you please provide guidance on how to improve this approach?

Screenshot 2024-07-18 at 5 52 41 PM

sreesanjeevkg avatar Jul 18 '24 23:07 sreesanjeevkg

The server disconnected error when i try to access a large number of files:

Screenshot 2024-07-18 at 6 24 20 PM Screenshot 2024-07-18 at 6 24 34 PM

sreesanjeevkg avatar Jul 18 '24 23:07 sreesanjeevkg

The following should fix the first issue:

--- a/kerchunk/combine.py
+++ b/kerchunk/combine.py
@@ -212,7 +212,7 @@ class MultiZarrToZarr:
         )
         mzz.coos = {}
         for var, selector in mzz.coo_map.items():
-            if selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
+            if isinstance(selector, str) and selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
                 import cftime
                 import datetime

As for your question: it would be totally reasonable to have append() create the reference set if it doesn't already exist, so that you would not have to have two different calls in your code.

For the final issue with ServerDisconnect: this is probably happening during inlining of values. The backend HTTPFileSystem has a few ways to limit the number of concurrent connections allowed. Probably the easierst is to set the following

fsspec.config.conf["nofiles_gather_batch_size"] = N

where N is a number well less than the default 1280. This setting is for the current session only (but for all async backends) unless you explicitly save the config.

martindurant avatar Jul 19 '24 13:07 martindurant

Sure, Thanks Martin. Can you just open a PR for the changes for combine method and merge them.

and any timeline on the feature request for the append(), when can it be pushed ?

Also, let me try the fsspec config as well, for the server requests.

sreesanjeevkg avatar Jul 19 '24 15:07 sreesanjeevkg

any timeline on the feature request for the append()

I'm not sure when I'll get to it, but you can keep pinging me :)

martindurant avatar Jul 19 '24 15:07 martindurant

https://github.com/fsspec/kerchunk/pull/481

martindurant avatar Jul 19 '24 17:07 martindurant