MetPy icon indicating copy to clipboard operation
MetPy copied to clipboard

MetPy's usage of POOCH is frustrating

Open akrherz opened this issue 5 years ago • 22 comments

My blissful world includes running python within environments with no internet access, jailed HTTP web services and python processes with read-only state. When I install MetPy from conda-forge, I manually edit cbook.py to remove all the POOCH usage as I lamely do not want to deal with it and the warnings it will spew to stdout/stderr with the envs above.

With metpy=0.12, it now will auto-load the airport-codes.csv as part of metpy.io.station_data import since metpy.io imports *. For example:

Traceback (most recent call last):
  File "pyWWA/parsers/nexrad3_attr.py", line 8, in <module>
    from metpy.io.nexrad import Level3File
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/__init__.py", line 12, in <module>
    from .metar import *  # noqa: F403
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/metar.py", line 15, in <module>
    from .station_data import station_info
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/metpy/io/station_data.py", line 140, in <module>
    station_info = StationLookup()

So all I want is just from metpy.io.nexrad import Level3File, but am having to download an unrelated station file.

Would you consider just batteries-including these files that are required for MetPy functionality vs testing? Having to set the TEST_DATA_DIR environment variable is not very intuitive just to get a running MetPy in the environments above.

So there I ranted, sorry.

akrherz avatar Feb 11 '20 19:02 akrherz

EEEEK! That's no bueno. That's no rant, that's...super annoying and would definitely cause problems elsewhere in no-internet environments.

My preferred solution, though, would be to make the code that loads the station file more lazy. You shouldn't pay an import price for anything you're not using. Does that sound ok?

dopplershift avatar Feb 13 '20 01:02 dopplershift

Thanks @dopplershift for the gentle response. I think lazy loading would be the simplest and most direct thing to do, but I still wonder about the usage of get_test_data() and the environment variable TEST_DATA_DIR to control this loading of non-test data. For testing, all staticdata files are already provided, right? So pooch doesn't really fetch anything in the case of testing?

~~I dunno, this gets more into design and implementation, but why is pooch even needed here for how it is currently being used?~~

akrherz avatar Feb 13 '20 03:02 akrherz

Well, staticdata is also there for examples. I can see a case to be made that for things needed for metpy features, we should just be shipping the data files needed; but then, do we really want to be shipping the state/county shapefiles?

dopplershift avatar Feb 14 '20 06:02 dopplershift

For me I am going to use your shapefiles to generate Metpy maps. That is something in my case that I want to have access to. My goal will be get any weather data in GIS format from NOAA web site to display the data. As for station information I was also thinking of getting it from Aviation Weather web site as a CSV file. Sometimes we just want a subset of METAR reports not the full dataset. I find it much quicker to retrieve METAR reports from Aviation Weather where no parsing is needed. I find the parsing takes too long to process.

eliteuser26 avatar Feb 14 '20 15:02 eliteuser26

@eliteuser26 For the county/state shapefiles we won't make it any less accessible than it already is. The question is whether they should be part of the install package, or only downloaded (and cached) when a user actually tries to use the feature.

dopplershift avatar Feb 27 '20 05:02 dopplershift

@dopplershift I suppose I can download the shapefiles manually and save it locally. In this case it would only be done once. As for the airport stations information I have been saving this into a MySQL database. The problem with this is that I need to keep it updated. One less thing to download I guess.

eliteuser26 avatar Feb 27 '20 17:02 eliteuser26

Let me clarify: the current situation is that MetPy, through the use of pooch, automatically downloads the shapefiles for you—no manual intervention needed. The only alternative I’m considering is including them in the metpy package files.

Regardless, there will be no manual user intervention needed here. Either MetPy will download the files for you, or MetPy will ship with the files.

dopplershift avatar Feb 27 '20 23:02 dopplershift

@dopplershift If it downloads the shapefiles by default then that would be better. At least it will provide the latest version of the shapefiles.

eliteuser26 avatar Feb 28 '20 00:02 eliteuser26

We just ran into this issue (literally as GitHub has been having issues today)

Random runtime errors due to a hidden external service are no good:

ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl
ERROR test/....py - requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://raw.githubusercontent.com/Unidata/MetPy/v0.12.0/staticdata/sfstns.tbl

I'd much rather have them bundled, storage is cheap. +1 from me.

four43 avatar Apr 21 '20 16:04 four43

sigh Yeah, just encountered it myself, though it seems to be resolved or at least working somewhat for me. But yes, we'll have to do something about this so that regardless it's not a blocker to using basic MetPy functionality.

dopplershift avatar Apr 21 '20 16:04 dopplershift

The only alternative I’m considering is including them in the metpy package files.

A third option would be to bundle this data in an optional, data-only package. This way, the data could be updated on a separate timeline from the main package, and explicitly downloaded before moving offline.

minchinweb avatar Oct 04 '20 03:10 minchinweb

So while storage is cheap, we don't want people to have to download a 100M package for MetPy alone--packages like that are the ones I swear at when I'm teaching a workshop and stuck on shotty wifi.

@MinchinWeb Yeah, I'm coming around to that as an option. Just want to do this in a way that doesn't add too much testing burden on us.

dopplershift avatar Oct 05 '20 21:10 dopplershift

So the data files used by features (not tests/examples) in MetPy:

  • US Counties shapefiles: 22M
  • US States shapesfiles: 6.3M
  • Station tables: 7M

So we're looking at 35M of data files right now on disk (18M compressed). Package sizes from anaconda.org:

  • Matplotlib: ~7M (23M uncompressed)
  • Numpy: ~5M (25M uncompressed)
  • Scipy: ~21M (65M uncompressed)
  • MKL: ~150-200M
  • Qt: ~100M (320M uncompressed)

So this would increase our package size to something comparable with downloading SciPy, which seems big (we're currently 720K)...but it would also eliminate some potential unexpected network requirements. It's probably a mistake that we use a function called get_test_data to implement features in the code base. Anyone have arguments for bundled together vs. separate package? The separate package approach doesn't seem to be as frequently done on PyPI, but does have some use on Anaconda.

dopplershift avatar May 26 '21 22:05 dopplershift

I recently ran into a similar frustration. To piggyback on what @akrherz originally said, I, too, run in environments that can be a bit more restricted than typical users. A current use case of mine includes using MetPy as part of an embedded flask application in Apache. I create virtual environments using venv. I think in normal cases, the files are downloaded to a share/ directory that is writable. For my case, and what I think is related to it being an Apache-embedded process, I get the following error:

PermissionError: [Errno 13] Permission denied: '/usr/share/httpd/.cache' | Pooch could not create data cache folder '/usr/share/httpd/.cache/metpy/v1.0.1'. Will not be able to download data files.

If it would write to the share/ folder in the virtual environment it would be fine, but it instead tries to write to locked down system folder. Ultimately, I do not think I even use the file, but my application chokes because it is not present and cannot be downloaded and written to disk. Setting TEST_DATA_DIR is not really an optimal solution in my case either, though I may have to use it to at least run in the meantime. I actually looked to see in the staticdata was placed in its own package as is done from time to time.

nawendt avatar Sep 19 '22 20:09 nawendt

Just to be clear, on Linux it's trying to write to $HOME/.cache, so the problem there is that $HOME for Apache is set to /user/share/httpd.

@nawendt What files are you encountering problems with it downloading?

dopplershift avatar Sep 19 '22 22:09 dopplershift

The file was sfstns.tbl which seems to be brought in because I import from metpy.io._tools import IOBuffer, NamedStruct.

nawendt avatar Sep 19 '22 23:09 nawendt

The file was sfstns.tbl which seems to be brought in

@nawendt you sure that happened with current MetPy? I thought that was fixed with #1840

akrherz avatar Sep 20 '22 02:09 akrherz

Well, dang. I feel silly. I was on old version because of it being RHEL7. My bad.

nawendt avatar Sep 20 '22 02:09 nawendt

No worries. I do feel like this is still a problem overall, I'm just not satisfied with any of the solutions so far.

dopplershift avatar Sep 20 '22 21:09 dopplershift