vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[FEATURE-REQUEST] Support for HDF5 special types (e.g., variable-length dtypes.)

Open callous4567 opened this issue 2 years ago • 0 comments

Thank you for reaching out and helping us improve Vaex!

Description Include support for vaex.DataFrame.export_hdf5(...) to handle columns that contain elements with variable length lists/arrays/etc and other HDF5 "special types," e.g., see https://docs.h5py.org/en/stable/special.html. Here's some example code that would ideally run, and generate an appropriate HDF5 file-

import vaex
import numpy as np

# Generate some test arrays/lists/lists-of-lists
rng = np.random.default_rng()
lol = [[d for d in range(rng.integers(0, 100, 1)[0])] for i in range(1000)]
lol = np.array(lol, dtype=list)

# To vaex
df = vaex.from_arrays(_primary=lol)

# Export to a file 
df.export_hdf5("test.hdf5")

The column lol (list-of-lists) includes a list of variable-length lists (these could be other variable-length objects.) These are ostensibly supported by h5py/HDF5, e.g., see https://docs.h5py.org/en/stable/special.html and I've confirmed this in Python 3.10 via (this is just a scrap of code from something I'm writing that happens to write lists-of-lists fine)


    def write_list(self, group: str, dataset: str, _list: list, **kwargs):

        """
        Write the provided list within [group,dataset] in the file located at self.path.

        Behaviour
        ----
            If [group,set] exists, del will be attempted within the group, and a new dataset made. Note that this will
            simply remove the data from the HDF5 files tree- it will not relieve file space. Special behaviour arises
            when the elements of your list are not all of the same size-
            see https://docs.h5py.org/en/stable/special.html.

        **kwargs
        ----
            _vtype: str (optional, default False)
                In the case that your list is made up of lists or other elements of various length, you must specify
                the dtype, e.g., "int32" or "float64." The list-of-lists will be converted to a list-of-arrays before
                being written.

        :param group: Parent key
        :param dataset: Child key
        :param _list: list
        :return: bool for success.
        """

        with h5py.File(self.path, 'a') as f:

            if group not in f.keys():

                f.create_group(group)

            if dataset in f[group].keys():

                del f[group][dataset]

            _vtype = kwargs.get("_vtype", False)

            if _vtype is not False:

                _dtype = h5py.vlen_dtype(np.dtype(_vtype))
                _list = [np.array(d, _vtype) for d in _list]
                f.create_dataset(name=group + "/" + dataset, dtype=_dtype, data=_list)

            else:

                f.create_dataset(name=group + "/" + dataset, data=_list)

Is your feature request related to a problem? Please describe. Not as far as I am aware of.

Additional context When vaex attempts to write a list of variable length objects, this error message arises-

Traceback (most recent call last):
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 430, in <module>
    DB().test()
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 428, in test
    self._export()
  File "A:\straszaks\pycharm_tpa\DBKnowPy-sstrasza\class_DB.py", line 211, in _export
    self.FileLookup.export_hdf5(os.path.join(self.Root, self.Name + "_FileLookup.hdf5"), progress=False)
  File "C:\Users\sstrasza\Documents\miniforge3\lib\site-packages\vaex\dataframe.py", line 6949, in export_hdf5
    writer.layout(self, progress=progressbar_layout)
  File "C:\Users\sstrasza\Documents\miniforge3\lib\site-packages\vaex\hdf5\writer.py", line 85, in layout
    raise TypeError(f"Cannot export column of type: {dtype} (column {name})")
TypeError: Cannot export column of type: object (column _keys)

There should be an option somewhere under vaex.DataFrame.export_hdf5 for the user to specify if variable length types (or indeed other HDF5 "special" types) are present, and which columns in the DataFrame correspond to them, such that vaex can then successfully go forth and export these particular columns into the HDF5.

callous4567 avatar Sep 19 '23 16:09 callous4567