tensorly icon indicating copy to clipboard operation
tensorly copied to clipboard

Serialisation of decompositions

Open yngvem opened this issue 4 years ago • 1 comments

Decomposing large tensors can be time-consuming, and it would therefore be useful to have an easy-to-use interface for storing these decompositions to disc. I am happy to work on this once we decide the API.

Possible file formats

I have had a good experience working with the Python binding for the HDF5 format and can recommend that. Alternatively, we can follow xarray and use the NetCDF format. SciPy has bindings for NetCDF v1 and v2, however these are legacy formats. The current NetCDF standard is compatible with HDF5, and there are two separate Python bindings: netCDF4 and h5netcdf. The former provides bindings for the NetCDF C-library, which also depends on the HDF5 C-library, while the uses h5py.

I know from experience that h5py is a very nice library to work with. It is well documented and it makes it very easy to compress the data to save disc space, but I'm happy to use NetCDF too.

API draft

Here is a draft for the API:

def store_DECOMPOSITION_TYPE(decomposition, path, internal_path="/", compression_opts=None, compression_args=None):
    # Check if file exists and handle collisions
    with h5py.File(path, "a") as h5:
        # Check if internal path clashes and handle collisions
        group = h5.create_group(internal_path)
        group.attrs["decomposition_type"] = "DECOMPOSITION_TYPE"
        # Add additional attributes such as the number of modes to the attrs field
        # Store the decomposition

def load_DECOMPOSITION_TYPE(path, internal_path="/"):
    with h5py.File(path, "r") as h5:
        # Check if internal path exists
        group = h5[internal_path]
        if group.attrs["decomposition_type"] != "DECOMMPOSITION_TYPE":
            raise ValueError("The HDF5 file contains a {group.attrs["decomposition_type"]} decomposition, not a DECOMPOSITION_TYPE.")
        # Load the decomposition

Closing notes

The downside with this addition is that we add an additional dependency. However, we can make it optional — disabling the option to serialise files if h5py (or NetCDF) is not installed.

yngvem avatar Aug 12 '21 08:08 yngvem

Sounds like a good idea - feels to me like a good candidate for tensorly-lab, what do you think?

JeanKossaifi avatar Aug 16 '21 18:08 JeanKossaifi