MultiZarrToZarr append method - coo_map not working as expected
I have approximately 4000 kerchunk JSON files in the mentioned directory, all of which require MultiZarrToZarr processing to create a single reference file, and all require http calls. When I call the MultiZarrToZarr.translate() method on all of them at once, I encounter a server disconnected error. As a quick workaround, I thought of appending to the reference file in batches. However, I came across an error when attempting to append.
It seems that coo_map is not working as expected for the append operation
Additionally, I'm wondering if there's a way to append directly to an empty path, rather than first creating a reference file and then appending to it.
Could you please provide guidance on how to improve this approach?
The server disconnected error when i try to access a large number of files:
The following should fix the first issue:
--- a/kerchunk/combine.py
+++ b/kerchunk/combine.py
@@ -212,7 +212,7 @@ class MultiZarrToZarr:
)
mzz.coos = {}
for var, selector in mzz.coo_map.items():
- if selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
+ if isinstance(selector, str) and selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
import cftime
import datetime
As for your question: it would be totally reasonable to have append() create the reference set if it doesn't already exist, so that you would not have to have two different calls in your code.
For the final issue with ServerDisconnect: this is probably happening during inlining of values. The backend HTTPFileSystem has a few ways to limit the number of concurrent connections allowed. Probably the easierst is to set the following
fsspec.config.conf["nofiles_gather_batch_size"] = N
where N is a number well less than the default 1280. This setting is for the current session only (but for all async backends) unless you explicitly save the config.
Sure, Thanks Martin. Can you just open a PR for the changes for combine method and merge them.
and any timeline on the feature request for the append(), when can it be pushed ?
Also, let me try the fsspec config as well, for the server requests.
any timeline on the feature request for the append()
I'm not sure when I'll get to it, but you can keep pinging me :)
https://github.com/fsspec/kerchunk/pull/481