earthkit icon indicating copy to clipboard operation
earthkit copied to clipboard

Download issue - Land Cover data from CDS

Open gritk opened this issue 1 year ago • 4 comments

What happened?

The download of Land Cover data from CDS is probably not possible. This is needed for the development of tutorials and use cases foreseen in the C3S-LOT5 contract.

What are the steps to reproduce the bug?

If you are using the pre-downloaded data then please set DOWNLOAD_FROM_CDS to False and set the LOCAL_DATA_DIR to where you stored the data. DOWNLOAD_FROM_CDS = True LOCAL_DATA_DIR = "../data/"

if DOWNLOAD_FROM_CDS: lc_data = ek.data.from_source( "cds", 'satellite-land-cover', { 'year': '2022', 'version': 'v2.1.1', 'variable': 'all', 'format': 'zip', } ) lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip") else: lc_data = ek.data.from_source("file", f"{LOCAL_DATA_DIR}/lc_2022.zip")

Version

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] Type 'copyright', 'credits' or 'license' for more information IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

Platform (OS and architecture)

Windows 11 Pro

Relevant log output

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[4], line 7
      4 LOCAL_DATA_DIR = "../data/"
      6 if DOWNLOAD_FROM_CDS:
----> 7     lc_data = ek.data.from_source(
      8         "cds",
      9     'satellite-land-cover',
     10     {
     11         'year': '2022',
     12         'version': 'v2.1.1',
     13         'variable': 'all',
     14         'format': 'zip',
     15     }
     16     )
     19     # # This command was used to save the data files in our managed storage,
     20     # #  they are not required for the notebook to run, and your computer will cache the 
     21     # #  results so you don't have to download again
     22     lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip")

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:143, in from_source(name, lazily, *args, **kwargs)
    140     return from_source_lazily(name, *args, **kwargs)
    142 prev = None
--> 143 src = get_source(name, *args, **kwargs)
    144 while src is not prev:
    145     prev = src

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:124, in SourceMaker.__call__(self, name, *args, **kwargs)
    117 klass = find_plugin(os.path.dirname(__file__), name, loader)
    119 # if os.environ.get("FIEDLIST_TESTING_ENABLE_MOCKUP_SOURCE", False):
    120 #     from earthkit.data.mockup import SourceMockup
    121 
    122 #     klass = SourceMockup
--> 124 source = klass(*args, **kwargs)
    126 if getattr(source, "name", None) is None:
    127     source.name = name

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/core/__init__.py:21, in MetaBase.__call__(cls, *args, **kwargs)
     19 obj = cls.__new__(cls, *args, **kwargs)
     20 args, kwargs = cls.patch(obj, *args, **kwargs)
---> 21 obj.__init__(*args, **kwargs)
     22 return obj

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:92, in CdsRetriever.__init__(self, dataset, *args, **kwargs)
     89 nthreads = min(self.settings("number-of-download-threads"), len(requests))
     91 if nthreads < 2:
---> 92     self.path = [self._retrieve(dataset, r) for r in requests]
     93 else:
     94     with SoftThreadPool(nthreads=nthreads) as pool:

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:92, in <listcomp>(.0)
     89 nthreads = min(self.settings("number-of-download-threads"), len(requests))
     91 if nthreads < 2:
---> 92     self.path = [self._retrieve(dataset, r) for r in requests]
     93 else:
     94     with SoftThreadPool(nthreads=nthreads) as pool:

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:104, in CdsRetriever._retrieve(self, dataset, request)
    101 def retrieve(target, args):
    102     self.client().retrieve(args[0], args[1], target)
--> 104 return self.cache_file(
    105     retrieve,
    106     (dataset, request),
    107     extension=EXTENSIONS.get(request.get("format"), ".cache"),
    108 )

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:62, in Source.cache_file(self, create, args, **kwargs)
     59 if owner is None:
     60     owner = re.sub(r"(?!^)([A-Z]+)", r"-\1", self.__class__.__name__).lower()
---> 62 return cache_file(owner, create, args, **kwargs)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/core/caching.py:916, in cache_file(owner, create, args, hash_extra, extension, force, replace)
    912 with FileLock(lock):
    913     if not os.path.exists(
    914         path
    915     ):  # Check again, another thread/process may have created the file
--> 916         owner_data = create(path + ".tmp", args)
    917         os.rename(path + ".tmp", path)
    918         CACHE.update_entry(path, owner_data)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:102, in CdsRetriever._retrieve.<locals>.retrieve(target, args)
    101 def retrieve(target, args):
--> 102     self.client().retrieve(args[0], args[1], target)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/cdsapi/api.py:364, in Client.retrieve(self, name, request, target)
    363 def retrieve(self, name, request, target=None):
--> 364     result = self._api("%s/resources/%s" % (self.url, name), request, "POST")
    365     if target is not None:
    366         result.download(target)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/cdsapi/api.py:519, in Client._api(self, url, request, method)
    517             break
    518         self.error("  %s", n)
--> 519     raise Exception(
    520         "%s. %s."
    521         % (reply["error"].get("message"), reply["error"].get("reason"))
    522     )
    524 raise Exception("Unknown API state [%s]" % (reply["state"],))

Exception: the request you have submitted is not valid. Request too large. Requesting 372 items, limit is 10.

Accompanying data

No response

Organisation

No response

gritk avatar Mar 08 '24 11:03 gritk

Thank you for reporting this issue. Please can you provide me with the earthkit-data and cdsapi versions you are using?

In my environment the actual cds retrieval works, however from_source crashes at a later stage when tries to parse the NetCDF file that the zip file contains (see issue https://github.com/ecmwf/earthkit-data/issues/337).

However, even if it is fixed

lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip")

would not work properly because it would only create a NetCDF file called "lc_2022.zip". This is because lc_data represents a NetCDF file and is decoupled from the zip that originally contained it.

Unfortunately, there is no way in earthkit-data at the moment to retrieve data into a user specified file target without parsing/interpreting the downloaded file(s). So it cannot be used as a simple file retriever! (See issue: https://github.com/ecmwf/earthkit-data/issues/338)

So there are a couple of issues here, which we need to sort out before your use-case could work. I will let you know when these features will be available.

sandorkertesz avatar Mar 08 '24 13:03 sandorkertesz

Thank you for the prompt reply!

earthkit-data version - '0.1.1.dev40+g1aaf922' cdsapi versions - 0.6.1

As you wrote, your retrieval works, what I need to change to download at least one file in original format.

lc_data = ek.data.from_source( "cds", 'satellite-land-cover', { 'year': '2022', 'version': 'v2.1.1', 'variable': 'all', } ) lc_data.save(f"{LOCAL_DATA_DIR}/")

gritk avatar Mar 08 '24 16:03 gritk

Thanks!

I noticed that your download error might be related to your permissions to access the CDS:

Exception: the request you have submitted is not valid. Request too large. Requesting 372 items, limit is 10.

You can check it easily if you use the cdsapi code I posted below. If it is producing the same error it is definitely not an issue on the earthkit side.

I noticed your earthkit-data version is very-very old. The latest one available is 0.5.6, I suggest you upgrade to this one. However, it is not yet able to download the zip file in the way you want to do in your code. For that purpose I recommend to use cdsapi like this:

import cdsapi
cds = cdsapi.Client()
cds.retrieve(
'satellite-land-cover',
{
'year': '2022',
'version': 'v2.1.1',
'variable': 'all',
'format': 'zip',
}, 'download.zip')

sandorkertesz avatar Mar 12 '24 14:03 sandorkertesz

Thank you very much - now it works!

gritk avatar Mar 12 '24 15:03 gritk