hdmf [Bug]: NeuroConv dev tests - Last axis of data shapes is None

What happened?

I believe the merge of https://github.com/hdmf-dev/hdmf/pull/1093 broke some things on NeuroConv

Mainly suspect that because it seems to be the only PR that was merged in last 2 days, and our dev tests were passing fine before then: https://github.com/catalystneuro/neuroconv/actions/workflows/dev-testing.yml

It might be advantageous to setup some dev testing of NeuroConv here on HDMF to ensure PRs don't have ripple effects throughout the ecosystem (for example, NWB Inspector tests against both dev PyNWB and downstream DANDI to ensure both upward and downward compatibility)

The full log: https://github.com/catalystneuro/neuroconv/actions/runs/9006012395/job/24742623348?pr=845

Seems to be affecting all interfaces, caught during roundtrip stage of testing (files seem to write just fine, but don't read back)

Final line of traceback might be the most informative - some shape property has become None instead of a finite value (which seems to be expected)

Steps to Reproduce

The NeuroConv tests are pretty basic (any interface or converter), have not tried to reproduce locally on randomly generated data but I assume based on the error that it's the same


def test_converter_in_converter(self):
        class TestConverter(NWBConverter):
            data_interface_classes = dict(TestPoseConverter=LightningPoseConverter)
    
        converter = TestConverter(
            source_data=dict(
                TestPoseConverter=dict(
                    file_path=self.file_path,
                    original_video_file_path=self.original_video_file_path,
                    labeled_video_file_path=self.labeled_video_file_path,
                    image_series_original_video_name=self.original_video_name,
                    image_series_labeled_video_name=self.labeled_video_name,
                )
            )
        )
    
        conversion_options = dict(TestPoseConverter=self.conversion_options)
    
        nwbfile_path = str(self.test_dir / "test_lightningpose_converter_in_nwbconverter.nwb")
        converter.run_conversion(nwbfile_path=nwbfile_path, conversion_options=conversion_options)
    
>       self.assertNWBFileStructure(nwbfile_path, **self.conversion_options)
        # Error occurs during read of the file at this line

test parameters (HDF5 datasets might have closed following pytest grabbing info)

self = <ndx_pose.io.pose.PoseEstimationSeriesMap object at 0x7f17771692b0>
kwargs = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}
builder = root/processing/behavior/PoseEstimation/PoseEstimationSeriesnose_bot GroupBuilder {'attributes': {'comments': 'no comm...iesnose_bot/starting_time DatasetBuilder {'attributes': {'rate': 250.0, 'unit': 'seconds'}, 'data': 0.0}}, 'links': {}}
manager = <hdmf.build.manager.BuildManager object at 0x7f1778de8af0>
parent = {'source': '/tmp/tmp3zj0duok/test_lightningpose_converter_in_nwbconverter.nwb', 'location': 'root/behavior/PoseEstimation', 'namespace': 'ndx-pose', 'data_type': 'PoseEstimation'}
cls = <class 'ndx_pose.pose.PoseEstimationSeries'>
subspecs = {{'doc': 'Description of the time series.', 'name': 'description', 'required': False, 'dtype': 'text', 'default_value'...32768/8000 = 9.5367e-9.", 'name': 'conversion', 'required': False, 'dtype': 'float32', 'default_value': 1.0}: 1.0, ...}
const_args = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}
subspec = {'dtype': 'float64', 'doc': 'Timestamp of the first sample in seconds. When timestamps are uniformly spaced, the times...': "Unit of measurement for time, which is fixed to 'seconds'.", 'name': 'unit', 'dtype': 'text', 'value': 'seconds'}]}

Traceback

tests/test_on_data/test_format_converters/test_lightningpose_converter.py:199: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_on_data/test_format_converters/test_lightningpose_converter.py:138: in assertNWBFileStructure
    nwbfile = io.read()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pynwb/__init__.py:326: in read
    file = super().read(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/backends/hdf5/h5tools.py:484: in read
    return super().read(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/backends/io.py:60: in read
    container = self.__manager.construct(f_builder)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:285: in construct
    result = self.__type_map.construct(builder, self, None)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
    return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
    subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
    self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1212: in __get_sub_builders
    ret.update(self.__get_subspec_values(sub_builder, subspec, manager))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
    self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
    sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
    result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
    return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
    subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
    self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
    sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
    result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
    return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
    subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
    self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
    sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
    tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
    result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
    return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
    return func(args[0], **pargs)

-----------------------------------------------------------------------------

    @docval({'name': 'builder', 'type': (DatasetBuilder, GroupBuilder),
             'doc': 'the builder to construct the AbstractContainer from'},
            {'name': 'manager', 'type': BuildManager, 'doc': 'the BuildManager for this build'},
            {'name': 'parent', 'type': (Proxy, AbstractContainer),
             'doc': 'the parent AbstractContainer/Proxy for the AbstractContainer being built', 'default': None})
    def construct(self, **kwargs):
        ''' Construct an AbstractContainer from the given Builder '''
        builder, manager, parent = getargs('builder', 'manager', 'parent', kwargs)
        cls = manager.get_cls(builder)
        # gather all subspecs
        subspecs = self.__get_subspec_values(builder, self.spec, manager)
        # get the constructor argument that each specification corresponds to
        const_args = dict()
        # For Data container classes, we need to populate the data constructor argument since
        # there is no sub-specification that maps to that argument under the default logic
        if issubclass(cls, Data):
            if not isinstance(builder, DatasetBuilder):  # pragma: no cover
                raise ValueError('Can only construct a Data object from a DatasetBuilder - got %s' % type(builder))
            const_args['data'] = self.__check_ref_resolver(builder.data)
        for subspec, value in subspecs.items():
            const_arg = self.get_const_arg(subspec)
            if const_arg is not None:
                if isinstance(subspec, BaseStorageSpec) and subspec.is_many():
                    existing_value = const_args.get(const_arg)
                    if isinstance(existing_value, list):
                        value = existing_value + value
                const_args[const_arg] = value
        # build kwargs for the constructor
        kwargs = dict()
        for const_arg in get_docval(cls.__init__):
            argname = const_arg['name']
            override = self.__get_override_carg(argname, builder, manager)
            if override is not None:
                val = override
            elif argname in const_args:
                val = const_args[argname]
            else:
                continue
            kwargs[argname] = val
        try:
>           obj = self.__new_container__(cls, builder.source, parent, builder.attributes.get(self.__spec.id_key()),
                                         **kwargs)

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1275: in __new_container__
    obj.__init__(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:667: in func_call
    pargs = _check_args(args, kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<[AttributeError("'PoseEstimationSeries' object has no attribute '_AbstractContainer__name'") raised in repr()] PoseEstimationSeries object at 0x7f177935c790>,)
kwargs = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}

    def _check_args(args, kwargs):
        """Parse and check arguments to decorated function. Raise warnings and errors as appropriate."""
        # this function was separated from func_call() in order to make stepping through lines of code using pdb
        # easier
    
        parsed = __parse_args(
            loc_val,
            args[1:] if is_method else args,
            kwargs,
            enforce_type=enforce_type,
            enforce_shape=enforce_shape,
            allow_extra=allow_extra,
            allow_positional=allow_positional
        )
    
        parse_warnings = parsed.get('future_warnings')
        if parse_warnings:
            msg = '%s: %s' % (func.__qualname__, ', '.join(parse_warnings))
            warnings.warn(msg, category=FutureWarning, stacklevel=3)
    
        for error_type, ExceptionType in (('type_errors', TypeError),
                                          ('value_errors', ValueError),
                                          ('syntax_errors', SyntaxError)):
            parse_err = parsed.get(error_type)
            if parse_err:
                msg = '%s: %s' % (func.__qualname__, ', '.join(parse_err))
>               raise ExceptionType(msg)
E               ValueError: PoseEstimationSeries.__init__: incorrect shape for 'data' (got '(None, None)', expected '((None, 2), (None, 3))')

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:660: ValueError

The above exception was the direct cause of the following exception:

Operating System

macOS

Python Executable

Conda

Python Version

3.12

Package Versions

No response

May 08 '24 21:05 CodyCBakerPhD

get_data_shape is used by the shape validator. If maxshape is present, then get_data_shape returns maxshape. Newly written data will have maxshape set to None for all axes. This will break the shape validator. I don't understand why get_data_shape returns maxshape if present. But get_data_shape is used in many places for many types of data, including DataChunkIterator. Maybe the get_data_shape called for DCI needs to be a different function, or there should be a flag to return maxshape if present.

@mavaylon1 could you look into this? We can revert that PR if needed.

May 08 '24 22:05 rly

get_data_shape is also used by the validator, so I worry that newly written datasets with a shape requirement that is non-None in some axis will be marked as invalid because the dataset will be processed as having shape [None] * ndim

May 08 '24 22:05 rly

Newly written data will have maxshape set to None for all axes.

It seems that this may be too "agressive". If the schema specifies a dimension as fixed in size, then we should not set it to None. I.e, we should only make the dimensions expandable that the schema allows to be expandable. Is this information somehow communicated to the backend in the Builder so that we could adjust the logic that was added in https://github.com/hdmf-dev/hdmf/pull/1093 to only make only dimensions that are not fixed in the schema expandable?

May 08 '24 22:05 oruebel

The backend does not have direct access to the schema associated with a builder and is intentionally siloed from the schema. write_builder, write_group, write_dataset, etc. write the builder data as they were received. The builder does not have access to the schema also intentionally. The backend could query for the schema associated with a builder through the BuildManager, but that breaks the separation. It might be better to have the ObjectMapper set a maxshape property on a DatasetBuilder during build and use that when calling write_dataset if expandable=True. Alternatively, the builder could maintain a reference to the spec that it is mapped to. We would have to modify every place that a builder is created. That doesn't seem so bad, but there may be negative consequences of breaking that separation.

May 09 '24 06:05 rly

get_data_shape is used by the shape validator. If maxshape is present, then get_data_shape returns maxshape. Newly written data will have maxshape set to None for all axes. This will break the shape validator. I don't understand why get_data_shape returns maxshape if present. But get_data_shape is used in many places for many types of data, including DataChunkIterator. Maybe the get_data_shape called for DCI needs to be a different function, or there should be a flag to return maxshape if present.

@mavaylon1 could you look into this? We can revert that PR if needed.

Let me see if I understand. get_data_shape is used by the shape validator and the validator expects shape and not maxshape. If so, do we want the validator to verify the shape and not maxshape? If that is the case, then I also agree it is weird to have maxshape returned if present.

As for the DCI, having a parameter that is a bool in get_data_shape where if True, return maxshape if present. The default being false.

For the data that is fixed in size in the schema, I would need to give this more thought. @rly Thoughts?

May 09 '24 18:05 mavaylon1

@mavaylon1 Yes, the validator should validate using actual shape not maxshape.

May 10 '24 00:05 rly

TODO:

[ ] Update get_data_shape to return shape unless the new bool parameter for maxshape is True.
[ ] Ensure that a dataset of references supports being expanded.
[ ] Set maxshape to None only in the axis that are None in the shape specification. (via ObjectMapper) It might be easier to have the builder have the shape info as a field. (check all shape options from schema)

May 13 '24 15:05 mavaylon1

@rly I am thinking about the problem for shapes defined in the schema. How are these allowed to be written? By setting maxshape right at the end in dset, I think we are skipping shape checks that would've prevented the data to be written in the first place. This leads to on read throwing a fit. This assumes there is a check.

May 13 '24 15:05 mavaylon1

I think if on write we use the shape from the schema, this should leave read alone.

May 13 '24 15:05 mavaylon1

I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.

I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.

If get_data_shape returns actual shape instead of maxshape during validation, then the errors should be fixed. However, the maxshape would be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.

One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape [None, None], [None, None, 3] or [None, None, 4] which correspond to [width, height], [width, height, rgb], and [width, height, rgba]. The maxshape should correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.

May 13 '24 18:05 rly

I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.

I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.

If get_data_shape returns actual shape instead of maxshape during validation, then the errors should be fixed. However, the maxshape would be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.

One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape [None, None], [None, None, 3] or [None, None, 4] which correspond to [width, height], [width, height, rgb], and [width, height, rgba]. The maxshape should correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.

Yeah I believe the shape is validated in docval. What I was thinking about was the goal you mentioned of having the shape be validated prior to write.

May 13 '24 19:05 mavaylon1

Note: we need to consider this when working on implementing extendable datasets in HDMF again @mavaylon1

Jun 27 '24 20:06 rly