Read an empty cell array blob fails
Bug Report
Description
Reading an empty cell array inserted with mym MATLAB fails to be read in Datajoint python
Reproducibility
- OS (WIN (MATLAB) & MACOS (Python)
- Python Version (3.7) & MATLAB Version (2019b)
- MySQL Version (10.2.33-MariaDB)
- DataJoint Version (0.13.7)
I have a corner case for reading some special. blobs in Datajoint Python when these are stored with mym Matlab: Here is the type of blob stored in the DB and read on Matlab:
la = bdata('select protocol_data from bdata.sessions where sessid=889527');
la{1}.crash_comments
ans =
3×1 cell array
{0×0 double}
{0×0 double}
{0×0 double}
As you can see, what is stored in a part of the blob is a 3x1 cell array composed of empty items:
When trying to read this data in Python, I got this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/sg/5bw1t8p11nx09k7kmytnmfb40000gp/T/ipykernel_31219/4072205635.py in <module>
3 session_key = {'sessid': 889527}
4 # session_key = {'sessid': 889664}
----> 5 session_data = (bdata.Sessions & session_key).fetch('protocol_data', as_dict=True)
6 parsed_events = (bdata.ParsedEvents & session_key).fetch(as_dict=True)
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
234 squeeze=squeeze,
235 download_path=download_path,
--> 236 format="array",
237 )
238 if attrs_as_dict:
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
287 for name in heading:
288 # unpack blobs and externals
--> 289 ret[name] = list(map(partial(get, heading[name]), ret[name]))
290 if format == "frame":
291 ret = pandas.DataFrame(ret).set_index(heading.primary_key)
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/fetch.py in _get(connection, attr, data, squeeze, download_path)
112 squeeze=squeeze,
113 )
--> 114 if attr.is_blob
115 else data
116 )
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in unpack(blob, squeeze)
619 return blob
620 if blob is not None:
--> 621 return Blob(squeeze=squeeze).unpack(blob)
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in unpack(self, blob)
127 blob_format = self.read_zero_terminated_string()
128 if blob_format in ("mYm", "dj0"):
--> 129 return self.read_blob(n_bytes=len(self._blob) - self._pos)
130
131 def read_blob(self, n_bytes=None):
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_blob(self, n_bytes)
161 % data_structure_code
162 )
--> 163 v = call()
164 if n_bytes is not None and self._pos - start != n_bytes:
165 raise DataJointError("Blob length check failed! Invalid blob")
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_struct(self)
463 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
464 )
--> 465 for __ in range(n_elem)
466 ]
467
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in <listcomp>(.0)
463 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
464 )
--> 465 for __ in range(n_elem)
466 ]
467
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in <genexpr>(.0)
461 raw_data = [
462 tuple(
--> 463 self.read_blob(n_bytes=int(self.read_value())) for _ in range(n_fields)
464 )
465 for __ in range(n_elem)
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_blob(self, n_bytes)
161 % data_structure_code
162 )
--> 163 v = call()
164 if n_bytes is not None and self._pos - start != n_bytes:
165 raise DataJointError("Blob length check failed! Invalid blob")
~/opt/anaconda3/envs/bl_pipeline_python_env/lib/python3.7/site-packages/datajoint/blob.py in read_cell_array(self)
508 return (
509 self.squeeze(
--> 510 np.array(result).reshape(shape, order="F"), convert_to_scalar=False
511 )
512 ).view(MatCell)
ValueError: cannot reshape array of size 0 into shape (3,1)
I have “patched” the blob.py code read_cell_array function with:
if result.size == 0:
return (
self.squeeze(
np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False
)
).view(MatCell)
else:
return (
self.squeeze(
np.array(result).reshape(shape, order="F"), convert_to_scalar=False
)
).view(MatCell)
Just to add the case that the size of the array is zero (numpy array size is 0 if it’s filled with empty arrays) Probably not the cleanest way to do it.
Expected Behavior
To get something similar to this when reading this kind of blobs:
session_data['crash_comments']
MatCell([[None],
[None],
[None]], dtype=object)
Thank you for submitting this @Alvalunasan. I think I understand the problem.
Hi @dimitri-yatsenko In brodylab this error has resurfaced with the new datajoint integration. Is it possible that this gets merged ? (or the bug fixed somehow) ?
Thank you very much for your help
Hi @dimitri-yatsenko , a note that this is not restricted to old Matlabs-- it was happening with a recent Matlab , and on data from August 2023. Maybe newer data too, I haven't yet checked on newer data.
Would it be appropriate to merge Alvaro's patch?
@Alvalunasan updates his patch and suggests replacing lines 495-499 with
sizes_array = [x.size for x in result]
sum_sizes = sum(sizes_array)
if n_elem ==0:
return np.array(np.empty(0)).view(MatCell)
elif sum_sizes == 0:
return (self.squeeze(np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False)).view(MatCell)
else:
return (self.squeeze(np.array(result).reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)
ok, will incorporate asap. We are starting to work on a new release. Thanks.
See next coment
@dimitri-yatsenko , @carlosbrody
New corner case for the function (Cell Matrix reading, instead of only nx1 vectors):
def read_cell_array(self):
"""deserialize MATLAB cell array"""
load_as_object = False
n_dims = self.read_value()
shape = self.read_value(count=n_dims)
n_elem = int(np.prod(shape))
result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
# If it is a matrix (and not a nx1 vector) load as object
if np.sum(shape > 1) > 1:
load_as_object = True
# Check size for each element (could have Empty elements in vector)
if n_elem > 0:
# If there are arrays, tuple or list inside elements of result, load as object (except if all emptys)
if isinstance(result[0], np.ndarray):
sizes_array = [x.size for x in result]
sum_sizes = sum(sizes_array)
load_as_object = True
elif isinstance(result[0], tuple) or isinstance(result[0], list):
sizes_array = [len(x) for x in result]
sum_sizes = sum(sizes_array)
load_as_object = True
else:
sum_sizes = n_elem
# If no trials in array
if n_elem ==0:
return np.array(np.empty(0)).view(MatCell)
# If all trials contains "empty" data
elif sum_sizes == 0:
return (self.squeeze(np.array(np.empty(shape, dtype=type(result[0]))), convert_to_scalar=False)).view(MatCell)
# If some trials contains data and others contains "empty" data
elif sum_sizes != n_elem or load_as_object:
return (self.squeeze(np.array(result, dtype='object').reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)
# Regular case, all trials contains data
else:
return (self.squeeze(np.array(result).reshape(shape, order="F"), convert_to_scalar=False)).view(MatCell)
Excellent. I am traveling until next week and will work on this when I return. Thank you so much for this solution.