Constrained chunk shape estimation

Open CodyCBakerPhD opened this issue 2 years ago • 1 comments

Motivation

fix #995

Potential replacement of #996

This is actually really good timing; the latest stage of Zarr integration in NeuroConv (https://github.com/catalystneuro/neuroconv/pull/569) benefits greatly from having these types of methods become static methods of the iterator (and its children) for broader usage, much as @oruebel suggests in https://github.com/hdmf-dev/hdmf/issues/995#issuecomment-1802509057

@bendichter Anyway, I did some mathematical reduction and refactoring of the suggested approach from #996; it's actually quite similar to the original axis ratio method in that one could imagine an altogether more general algorithm that places 'weight factors' on each axis to determine how strongly to bias the distribution of bytes throughout the process...

In the recently suggested update these weight factors would all be 1; in the original, they are something proportional to the relative length of each axis

This PR also addresses @oruebel's comments by no longer requiring secondary methods

Two other changes:

I strongly suggest keeping the previous method accessible 'just in case' something unexpected arises in the new implementation; a lesson from learned from previous bugs in this iterator. The other fallback would be to force installation of a previous HDMF version, but that could be less desirable and cause other problems (such as incompatibilities) on top of that

Another lesson learned from the past, always use math.prod for these operations. Eventually these methods will be used on shapes with very large values, and np.prod overflows easily and silently (well, occasionally there might be a warning thrown but not always)

How to test the behavior?

Existing tests may require updating to the newly predicted default values

A few new tests should be added for the static usage and showcase of the expected results from each strategy

Checklist

[ ] Did you update CHANGELOG.md with your changes?
[X] Have you checked our Contributing document?
[X] Have you ensured the PR clearly describes the problem and the solution?
[X] Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
[X] Have you checked to ensure that there aren't other open Pull Requests for the same change?
[X] Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

Nov 08 '23 23:11 CodyCBakerPhD

I think you probably want the ratio to be informed by the overall size of the dataset. I.e., for (time, electrodes) where time is much longer than electrodes, then you'd also want the chunks to be larger in the time dimension. (@oruebel https://github.com/hdmf-dev/hdmf/pull/996#issuecomment-1802574509)

@oruebel, this was our original guess, and it was conveniently also the approach h5py uses, but our experience with @magland has shown that this can cause big problems for certain common applications and at this point I am leaning towards the hypercube approach, as I don't expect it to totally blow up in most cases. The (time, electrode) is the perfect example- it's actually much more common for users to want many channels over a short time than one channel over a long time, particularly in streaming applications.

This PR also addresses @oruebel's comments by no longer requiring secondary methods

See the way I separated the function out in my PR. I think this the function as-is is a bit too long and should be broken up, but that's more of a style thing.

I'm fine with keeping the old method for now. Let's see how often we or anyone else tends to use it.

And yes, using math.prod is a good idea. I do remember that being a particularly sneaky bug we definitely do not want to reintroduce.

Nov 09 '23 02:11 bendichter