MetPy icon indicating copy to clipboard operation
MetPy copied to clipboard

Dask delayed operations and Pint Quantity

Open huard opened this issue 7 years ago • 8 comments

I haven't found from the documentation whether MetPy supports delayed operations with dask. The code for unit conversion seems to access _data_array.values, which suggests that the entire array is loaded in memory. We have multi Gb files that require unit conversion and ideally the converted DataArray would be lazily evaluated.

huard avatar Jan 21 '19 20:01 huard

I haven't taken a look with dask yet, but your initial analysis seems about right unfortunately.

This might be out of our control due to pint, but is definitely something on my todo list to take a look at. I can see it also being another reason to adjust how we handle the unit problem internally.

dopplershift avatar Jan 22 '19 01:01 dopplershift

Upon attempting this with RH calculations over a large dataset in Xarray (with Dask enabled) I can confirm that calling MetPy does load the arrays memory. Dask still allows for parallel computations on chunks which does keep from runaway RAM usage and performance is acceptable, but the lazy evaluation stops at the point of calling metpy.calc. The temporary workaround would be to do all subsetting operations before metpy computations.

tjwixtrom avatar Jul 14 '19 15:07 tjwixtrom

We've gone around this issue by calling units.convert instead of using the to method.

huard avatar Jul 22 '19 14:07 huard

So you're saying .to() forces breaks the parallelism but units.convert() doesn't?

dopplershift avatar Jul 22 '19 19:07 dopplershift

It's probably not that straightforward and it's been a while, but I think using to, we had to copy the input array, then use the output of to and change the values in place. I don't quite remember why we needed this copy, but this was the culprit, not the conversion itself (see https://github.com/Ouranosinc/xclim/pull/156/files).

huard avatar Jul 22 '19 20:07 huard

Just an update from upstream: Pint v0.10 (to be released in the next week or so) will have preliminary support for wrapping Dask arrays. However, https://github.com/dask/dask/issues/4583 is holding up full compatibility and the ability to put together a robust set of integration tests, so there will likely be issues remaining (such as non-commutativity and Dask mistakenly wrapping Pint).

So, from MetPy's point-of-view, I think it would be good to start some early experiments with Dask support in calculations, but it won't be ready for v1.0?

jthielen avatar Dec 27 '19 20:12 jthielen

That seems about right. I think overall full support will be something we look at beyond the GEMPAK work.

dopplershift avatar Dec 27 '19 22:12 dopplershift

Leaving a note here for future Dask compatibility work: the window smoother added in #1223 explicitly casts to ndarray, which prevents Dask compatibility for that smoother (and dependent smoothers like circular, rectangular, and n-point) (see https://github.com/Unidata/MetPy/pull/1223#discussion_r366079624).

jthielen avatar Jan 14 '20 00:01 jthielen