Create "Optimising memory use" notebook in `Frequently_used_code`
We should create a Frequently_used_code notebook that documents some useful techniques for optimising memory use when analysing DEA data.
@Kirill888 has lots of useful tools for doing that here: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py
E.g. you can use something like fmask_to_bool to produce a boolean mask from fmask flags:
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L517
Then pass that to erase_bad to set those "bad" values to the data's nodata value (still in the original data type):
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L97
Then finally convert it to floats at the end using to_float (this is the first time the nodata values will be set to NaN ) :
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L204
The idea behind those funcs is to keep things as dask arrays and int datatypes until the last possible moment so that memory is kept better under control. I'm not entirely sure though if there's options there for computing things like means/medians etc on the data in its original data type (taking into account the custom nodata values), but this would also be good to include as these are very common workflows.
From Kirill:
It's a bit messy (some internals are exposed that should not be, and docs quality is not uniform). Probably best place to get an overview of what's available is here: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/init.py#L19 There are things like
enum_to_boolto_float from_floatapply_numexprThere are no nodata aware reduction functions. Maybe these are supported by masked arrays in numpy? But really bigger problem is not so much representation and handling of missing values, bigger problem is integer math can be hard to reason about and implement correctly (without silent overflows). So I prefer to convert to float then use
nan{mean,sum,...}family of functions followed by conversion back to integer.to_floatis also useful for plotting as nan are automatically transparent, whereas nodata values are not
A step-by-step example of using some of these tools are available in the Cloud and Pixel Quality Masking notebook in deafrica-sandbox-notebooks, including the mask_cleanup function which is pretty handy. Not explicity reference memory optimisation, but might provide some boilerplate code for starting this notebook