Error when opening a HDF5 Dataset with np.ndarray

Open sbyrdsell opened this issue 6 years ago • 1 comments

When I try to open an HDF5 Dataset with np.ndarray the application quit unexpectedly with no errors. I have python 2.7 installed.

See DataFile and screenshots.

Canada_Population.h5.zip

Oct 23 '19 15:10 sbyrdsell

I tried the same HDF5 dataset with HDF Compass v0.6.0 and it worked. Since you mentioned np.ndarray, I also tried with Python 3.6.9. Below is ipython output:

In [1]: import h5py

In [2]: h5py.version.version
Out[2]: '2.10.0'

In [3]: h5py.version.hdf5_version
Out[3]: '1.10.4'

In [4]: f = h5py.File('Canada_Population.h5', 'r')

In [5]: labels = f['/Record/Labels/Values']

In [6]: labels.shape
Out[6]: (1,)

In [7]: labels.dtype
Out[7]: dtype([('Country', 'O', (1,)), ('Continent', 'O', (1,)), ('Abbreviation', 'O', (1,)), ('Language', 'O', (2,)), ('DataSource', 'O', (1,))])

In [8]: labels[0]
Segmentation fault: 11

The reported stack trace indicates the segmentation fault happened during conversion of the dataset's data to NumPy memory structures by h5py:

0   _conv.cpython-36m-darwin.so     0x000000010efbde9a __pyx_f_4h5py_5_conv_conv_vlen2str + 186
1   _conv.cpython-36m-darwin.so     0x000000010efbdd58 __pyx_f_4h5py_5_conv_generic_converter + 680
2   _conv.cpython-36m-darwin.so     0x000000010efbc76e __pyx_f_4h5py_5_conv_vlen2str + 62
3   libhdf5.103.dylib               0x000000010bfadede H5T_convert + 478
4   libhdf5.103.dylib               0x000000010bfc9f31 H5T__conv_array + 2481
5   libhdf5.103.dylib               0x000000010bfadfc1 H5T_convert + 705
6   libhdf5.103.dylib               0x000000010bfc5314 H5T__conv_struct_opt + 2788
7   libhdf5.103.dylib               0x000000010bfadfc1 H5T_convert + 705
8   libhdf5.103.dylib               0x000000010bfadc28 H5Tconvert + 1272
9   defs.cpython-36m-darwin.so      0x000000010c1f433c __pyx_f_4h5py_4defs_H5Tconvert + 76
10  _proxy.cpython-36m-darwin.so    0x000000010f1aaa18 __pyx_f_4h5py_6_proxy_dset_rw + 2824
11  h5d.cpython-36m-darwin.so       0x000000010f1bbf2e __pyx_pw_4h5py_3h5d_9DatasetID_1read + 542

h5dump description of the the dataset's datatype is:

            DATATYPE  H5T_COMPOUND {
               H5T_ARRAY { [1] H5T_STRING {
                  STRSIZE H5T_VARIABLE;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               } } "Country";
               H5T_ARRAY { [1] H5T_STRING {
                  STRSIZE H5T_VARIABLE;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               } } "Continent";
               H5T_ARRAY { [1] H5T_STRING {
                  STRSIZE H5T_VARIABLE;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               } } "Abbreviation";
               H5T_ARRAY { [2] H5T_STRING {
                  STRSIZE H5T_VARIABLE;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               } } "Language";
               H5T_ARRAY { [1] H5T_STRING {
                  STRSIZE H5T_VARIABLE;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               } } "DataSource";
            }

which in my opinion is a bit unconventional. I'd suggest simplifying some of the compound fields if interoperability of this file format is important.

Oct 28 '19 17:10 ghost