[Bug]: NeuroConv dev tests - Last axis of data shapes is None
What happened?
I believe the merge of https://github.com/hdmf-dev/hdmf/pull/1093 broke some things on NeuroConv
Mainly suspect that because it seems to be the only PR that was merged in last 2 days, and our dev tests were passing fine before then: https://github.com/catalystneuro/neuroconv/actions/workflows/dev-testing.yml
It might be advantageous to setup some dev testing of NeuroConv here on HDMF to ensure PRs don't have ripple effects throughout the ecosystem (for example, NWB Inspector tests against both dev PyNWB and downstream DANDI to ensure both upward and downward compatibility)
The full log: https://github.com/catalystneuro/neuroconv/actions/runs/9006012395/job/24742623348?pr=845
Seems to be affecting all interfaces, caught during roundtrip stage of testing (files seem to write just fine, but don't read back)
Final line of traceback might be the most informative - some shape property has become None instead of a finite value (which seems to be expected)
Steps to Reproduce
The NeuroConv tests are pretty basic (any interface or converter), have not tried to reproduce locally on randomly generated data but I assume based on the error that it's the same
def test_converter_in_converter(self):
class TestConverter(NWBConverter):
data_interface_classes = dict(TestPoseConverter=LightningPoseConverter)
converter = TestConverter(
source_data=dict(
TestPoseConverter=dict(
file_path=self.file_path,
original_video_file_path=self.original_video_file_path,
labeled_video_file_path=self.labeled_video_file_path,
image_series_original_video_name=self.original_video_name,
image_series_labeled_video_name=self.labeled_video_name,
)
)
)
conversion_options = dict(TestPoseConverter=self.conversion_options)
nwbfile_path = str(self.test_dir / "test_lightningpose_converter_in_nwbconverter.nwb")
converter.run_conversion(nwbfile_path=nwbfile_path, conversion_options=conversion_options)
> self.assertNWBFileStructure(nwbfile_path, **self.conversion_options)
# Error occurs during read of the file at this line
test parameters (HDF5 datasets might have closed following pytest grabbing info)
self = <ndx_pose.io.pose.PoseEstimationSeriesMap object at 0x7f17771692b0>
kwargs = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}
builder = root/processing/behavior/PoseEstimation/PoseEstimationSeriesnose_bot GroupBuilder {'attributes': {'comments': 'no comm...iesnose_bot/starting_time DatasetBuilder {'attributes': {'rate': 250.0, 'unit': 'seconds'}, 'data': 0.0}}, 'links': {}}
manager = <hdmf.build.manager.BuildManager object at 0x7f1778de8af0>
parent = {'source': '/tmp/tmp3zj0duok/test_lightningpose_converter_in_nwbconverter.nwb', 'location': 'root/behavior/PoseEstimation', 'namespace': 'ndx-pose', 'data_type': 'PoseEstimation'}
cls = <class 'ndx_pose.pose.PoseEstimationSeries'>
subspecs = {{'doc': 'Description of the time series.', 'name': 'description', 'required': False, 'dtype': 'text', 'default_value'...32768/8000 = 9.5367e-9.", 'name': 'conversion', 'required': False, 'dtype': 'float32', 'default_value': 1.0}: 1.0, ...}
const_args = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}
subspec = {'dtype': 'float64', 'doc': 'Timestamp of the first sample in seconds. When timestamps are uniformly spaced, the times...': "Unit of measurement for time, which is fixed to 'seconds'.", 'name': 'unit', 'dtype': 'text', 'value': 'seconds'}]}
Traceback
tests/test_on_data/test_format_converters/test_lightningpose_converter.py:199:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_on_data/test_format_converters/test_lightningpose_converter.py:138: in assertNWBFileStructure
nwbfile = io.read()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pynwb/__init__.py:326: in read
file = super().read(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/backends/hdf5/h5tools.py:484: in read
return super().read(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/backends/io.py:60: in read
container = self.__manager.construct(f_builder)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:285: in construct
result = self.__type_map.construct(builder, self, None)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1212: in __get_sub_builders
ret.update(self.__get_subspec_values(sub_builder, subspec, manager))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1232: in construct
subspecs = self.__get_subspec_values(builder, self.spec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1161: in __get_subspec_values
self.__get_sub_builders(groups, spec.groups, manager, ret)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1204: in __get_sub_builders
sub_builder = self.__flatten(sub_builder, subspec, manager)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in __flatten
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1217: in <listcomp>
tmp = [manager.construct(b) for b in sub_builder]
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:281: in construct
result = self.__type_map.construct(builder, self, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/manager.py:803: in construct
return obj_mapper.construct(builder, build_manager, parent)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:668: in func_call
return func(args[0], **pargs)
-----------------------------------------------------------------------------
@docval({'name': 'builder', 'type': (DatasetBuilder, GroupBuilder),
'doc': 'the builder to construct the AbstractContainer from'},
{'name': 'manager', 'type': BuildManager, 'doc': 'the BuildManager for this build'},
{'name': 'parent', 'type': (Proxy, AbstractContainer),
'doc': 'the parent AbstractContainer/Proxy for the AbstractContainer being built', 'default': None})
def construct(self, **kwargs):
''' Construct an AbstractContainer from the given Builder '''
builder, manager, parent = getargs('builder', 'manager', 'parent', kwargs)
cls = manager.get_cls(builder)
# gather all subspecs
subspecs = self.__get_subspec_values(builder, self.spec, manager)
# get the constructor argument that each specification corresponds to
const_args = dict()
# For Data container classes, we need to populate the data constructor argument since
# there is no sub-specification that maps to that argument under the default logic
if issubclass(cls, Data):
if not isinstance(builder, DatasetBuilder): # pragma: no cover
raise ValueError('Can only construct a Data object from a DatasetBuilder - got %s' % type(builder))
const_args['data'] = self.__check_ref_resolver(builder.data)
for subspec, value in subspecs.items():
const_arg = self.get_const_arg(subspec)
if const_arg is not None:
if isinstance(subspec, BaseStorageSpec) and subspec.is_many():
existing_value = const_args.get(const_arg)
if isinstance(existing_value, list):
value = existing_value + value
const_args[const_arg] = value
# build kwargs for the constructor
kwargs = dict()
for const_arg in get_docval(cls.__init__):
argname = const_arg['name']
override = self.__get_override_carg(argname, builder, manager)
if override is not None:
val = override
elif argname in const_args:
val = const_args[argname]
else:
continue
kwargs[argname] = val
try:
> obj = self.__new_container__(cls, builder.source, parent, builder.attributes.get(self.__spec.id_key()),
**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1262:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/build/objectmapper.py:1275: in __new_container__
obj.__init__(**kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:667: in func_call
pargs = _check_args(args, kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (<[AttributeError("'PoseEstimationSeries' object has no attribute '_AbstractContainer__name'") raised in repr()] PoseEstimationSeries object at 0x7f177935c790>,)
kwargs = {'comments': 'no comments', 'confidence': <Closed HDF5 dataset>, 'conversion': 1.0, 'data': <Closed HDF5 dataset>, ...}
def _check_args(args, kwargs):
"""Parse and check arguments to decorated function. Raise warnings and errors as appropriate."""
# this function was separated from func_call() in order to make stepping through lines of code using pdb
# easier
parsed = __parse_args(
loc_val,
args[1:] if is_method else args,
kwargs,
enforce_type=enforce_type,
enforce_shape=enforce_shape,
allow_extra=allow_extra,
allow_positional=allow_positional
)
parse_warnings = parsed.get('future_warnings')
if parse_warnings:
msg = '%s: %s' % (func.__qualname__, ', '.join(parse_warnings))
warnings.warn(msg, category=FutureWarning, stacklevel=3)
for error_type, ExceptionType in (('type_errors', TypeError),
('value_errors', ValueError),
('syntax_errors', SyntaxError)):
parse_err = parsed.get(error_type)
if parse_err:
msg = '%s: %s' % (func.__qualname__, ', '.join(parse_err))
> raise ExceptionType(msg)
E ValueError: PoseEstimationSeries.__init__: incorrect shape for 'data' (got '(None, None)', expected '((None, 2), (None, 3))')
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/hdmf/utils.py:660: ValueError
The above exception was the direct cause of the following exception:
Operating System
macOS
Python Executable
Conda
Python Version
3.12
Package Versions
No response
get_data_shape is used by the shape validator. If maxshape is present, then get_data_shape returns maxshape. Newly written data will have maxshape set to None for all axes. This will break the shape validator. I don't understand why get_data_shape returns maxshape if present. But get_data_shape is used in many places for many types of data, including DataChunkIterator. Maybe the get_data_shape called for DCI needs to be a different function, or there should be a flag to return maxshape if present.
@mavaylon1 could you look into this? We can revert that PR if needed.
get_data_shape is also used by the validator, so I worry that newly written datasets with a shape requirement that is non-None in some axis will be marked as invalid because the dataset will be processed as having shape [None] * ndim
Newly written data will have
maxshapeset to None for all axes.
It seems that this may be too "agressive". If the schema specifies a dimension as fixed in size, then we should not set it to None. I.e, we should only make the dimensions expandable that the schema allows to be expandable. Is this information somehow communicated to the backend in the Builder so that we could adjust the logic that was added in https://github.com/hdmf-dev/hdmf/pull/1093 to only make only dimensions that are not fixed in the schema expandable?
The backend does not have direct access to the schema associated with a builder and is intentionally siloed from the schema. write_builder, write_group, write_dataset, etc. write the builder data as they were received. The builder does not have access to the schema also intentionally. The backend could query for the schema associated with a builder through the BuildManager, but that breaks the separation. It might be better to have the ObjectMapper set a maxshape property on a DatasetBuilder during build and use that when calling write_dataset if expandable=True. Alternatively, the builder could maintain a reference to the spec that it is mapped to. We would have to modify every place that a builder is created. That doesn't seem so bad, but there may be negative consequences of breaking that separation.
get_data_shapeis used by the shape validator. Ifmaxshapeis present, thenget_data_shapereturnsmaxshape. Newly written data will havemaxshapeset to None for all axes. This will break the shape validator. I don't understand whyget_data_shapereturnsmaxshapeif present. Butget_data_shapeis used in many places for many types of data, includingDataChunkIterator. Maybe theget_data_shapecalled for DCI needs to be a different function, or there should be a flag to returnmaxshapeif present.@mavaylon1 could you look into this? We can revert that PR if needed.
Let me see if I understand. get_data_shape is used by the shape validator and the validator expects shape and not maxshape. If so, do we want the validator to verify the shape and not maxshape? If that is the case, then I also agree it is weird to have maxshape returned if present.
As for the DCI, having a parameter that is a bool in get_data_shape where if True, return maxshape if present. The default being false.
For the data that is fixed in size in the schema, I would need to give this more thought. @rly Thoughts?
@mavaylon1 Yes, the validator should validate using actual shape not maxshape.
TODO:
- [ ] Update
get_data_shapeto return shape unless the new bool parameter for maxshape is True. - [ ] Ensure that a dataset of references supports being expanded.
- [ ] Set maxshape to None only in the axis that are None in the shape specification. (via ObjectMapper) It might be easier to have the builder have the shape info as a field. (check all shape options from schema)
@rly I am thinking about the problem for shapes defined in the schema. How are these allowed to be written? By setting maxshape right at the end in dset, I think we are skipping shape checks that would've prevented the data to be written in the first place. This leads to on read throwing a fit. This assumes there is a check.
I think if on write we use the shape from the schema, this should leave read alone.
I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.
I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.
If get_data_shape returns actual shape instead of maxshape during validation, then the errors should be fixed. However, the maxshape would be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.
One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape [None, None], [None, None, 3] or [None, None, 4] which correspond to [width, height], [width, height, rgb], and [width, height, rgba]. The maxshape should correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.
I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.
I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.
If
get_data_shapereturns actual shape instead of maxshape during validation, then the errors should be fixed. However, themaxshapewould be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape
[None, None],[None, None, 3]or[None, None, 4]which correspond to[width, height],[width, height, rgb], and[width, height, rgba]. Themaxshapeshould correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.
Yeah I believe the shape is validated in docval. What I was thinking about was the goal you mentioned of having the shape be validated prior to write.
Note: we need to consider this when working on implementing extendable datasets in HDMF again @mavaylon1