dea-notebooks icon indicating copy to clipboard operation
dea-notebooks copied to clipboard

Create "Optimising memory use" notebook in `Frequently_used_code`

Open robbibt opened this issue 4 years ago • 2 comments

We should create a Frequently_used_code notebook that documents some useful techniques for optimising memory use when analysing DEA data.

@Kirill888 has lots of useful tools for doing that here: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py

E.g. you can use something like fmask_to_bool to produce a boolean mask from fmask flags: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L517

Then pass that to erase_bad to set those "bad" values to the data's nodata value (still in the original data type): https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L97

Then finally convert it to floats at the end using to_float (this is the first time the nodata values will be set to NaN ) : https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L204

The idea behind those funcs is to keep things as dask arrays and int datatypes until the last possible moment so that memory is kept better under control. I'm not entirely sure though if there's options there for computing things like means/medians etc on the data in its original data type (taking into account the custom nodata values), but this would also be good to include as these are very common workflows.

robbibt avatar Sep 28 '21 01:09 robbibt

From Kirill:

It's a bit messy (some internals are exposed that should not be, and docs quality is not uniform). Probably best place to get an overview of what's available is here: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/init.py#L19 There are things like

  • enum_to_bool
  • to_float from_float
  • apply_numexpr

There are no nodata aware reduction functions. Maybe these are supported by masked arrays in numpy? But really bigger problem is not so much representation and handling of missing values, bigger problem is integer math can be hard to reason about and implement correctly (without silent overflows). So I prefer to convert to float then use nan{mean,sum,...} family of functions followed by conversion back to integer. to_float is also useful for plotting as nan are automatically transparent, whereas nodata values are not

robbibt avatar Sep 28 '21 02:09 robbibt

A step-by-step example of using some of these tools are available in the Cloud and Pixel Quality Masking notebook in deafrica-sandbox-notebooks, including the mask_cleanup function which is pretty handy. Not explicity reference memory optimisation, but might provide some boilerplate code for starting this notebook

cbur24 avatar Oct 07 '21 04:10 cbur24