Figuring out how to reduce memory burden when processing large datasets

Open JacobGlennAyers opened this issue 3 years ago • 1 comments

Currently, it appears that PyHa crashes when generating automated labels of particularly large datasets. I suspect that this is due to the size of the automated dataframe becoming too large.

Potential fixes:

Convert floats being stored int 8 byte floats (float8). By default, Pandas uses float64.
Try to use the builtin csv python library. It could be that we just create the individual dataframes on each clip, and then we just append to some master csv file. This would hopefully shift the burden from memory onto storage.
Look into parallelization with DASK (this may speed things up, but I am skeptical if it addresses the memory problems)

Oct 28 '22 19:10 JacobGlennAyers

How large are said datasets, and what model was used?

Jul 05 '23 18:07 sprestrelski