One `imread` to rule them all
A lot of people have put a lot of effort into imread lately. This is great, and it's really helped. However, we've still got a way to go.
This is where I see the four major areas problems pop up in:
-
Read image data into Dask arrays accurately. We need more simple test cases here. Bug report: https://github.com/dask/dask-image/issues/220
-
Reduce confusion. Currently, there are multiple implementations of a dask
imreadfunction. The two most easily confused aredask_image.imread.imread()anddask.array.image.imread(). We need to figure out which is best, and only use that one. -
Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:
- Imread performance issue https://github.com/dask/dask-image/issues/181
- Getting movie files into Dask efficiently https://github.com/dask/dask-image/issues/134
-
Process the image data fast, too. For that to happen, we need smart default choices for how we chunk image data in dask arrays. Jackson Maxfield Brown describes the problem well in this short video here
The first step is getting a benchmark script going.
We need to:
- [ ] Report benchmark results for
dask_image.imread.imread()anddask.array.image.imread()(for an apples to apples comparison, you might need to explictly passpims.openas a keyword argument todask.array.image.imread()) - [ ] Add benchmarking to run on our CI
Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:
Imread performance issue #181
Getting movie files into Dask efficiently #134
Highly recommend using asv. We use it on aicsimageio to get our pure reading benchmarks and after working on the Dask Summit presentation I also just added the example from my slides as a benchmark suite and a "LibCompareSuite" to monitor aicsimageio and dask-image performance.
- our benchmark code for general IO suites
- our benchmark code for lib comparison suites
-
the CI setup for our benchmarks we run it as a part of our doc building on push to
main - the produced benchmarks webpage
Note that because I just changed the benchmark parameters it reset a lot of the visualizations but it does have the benchmarks for the most recent commit as scatter plots basically. As more commits are added with the same benchmark configuration, it will show as a timeseries.
Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())
I tried doing the above during my benchmark setup on aicsimageio and I couldn't get the pims.open option working.
For the default case I felt it was an unfair comparison. dask.array.image.imread reads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using the dask.array.image.imread will result in each chunk of the dask array being a whole file read of a file in the glob) dask-image can do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).
Happy to help and PR into dask-image where I can. At the very least, my talk is now basically built into the library on every commit :smile:
That's a strong recommendation for asv! Very helpful to have those links and implementation details.
For the default case I felt it was an unfair comparison.
dask.array.image.imreadreads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using thedask.array.image.imreadwill result in each chunk of the dask array being a whole file read of a file in the glob)dask-imagecan do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).
I'm not sure whether it's the most common, but it's definitely common enough that we need good performance.
Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())
Wanted to link here a quick performance comparison we had done in the past: https://github.com/dask/dask-image/issues/194#issuecomment-791987698. The conclusion had been that dask.array.image is significantly faster than dask_image.imread both when using skimage.io and pims. The only advantage for the latter is dask graph creation time when input image files are large (as dask.array.image reads in a file to determine image shape, while dask_image.imread uses pims for this).
The only advantage for the latter is dask graph creation time when input image files are large (as
dask.array.imagereads in a file to determine image shape, whiledask_image.imreadusespimsfor this).
Presumably we could add this behaviour to dask.array.image if that's useful.
Presumably we could add this behaviour to dask.array.image if that's useful.
Definitely. Wouldn't currently think it's too critical though.
@jni says that scikit-image also has a good guide to asv. I think this is it here: https://scikit-image.org/docs/dev/contribute.html#benchmarks
One big disadvantage for dask.array.image.imread is poor chunking behaviour. It looks like it makes a single chunk for every filename on disk. This is not greart for movie files or multislice tiffs, etc. where you probably don't want to load the whole movie file into RAM.
See https://github.com/dask/dask-image/issues/262#issuecomment-1125063820
Yeah this comes up with large multipage TIFFs. They can be kind of movie-like
Wonder if we should just make the move to using ImageIO here with PR ( https://github.com/imageio/imageio/pull/739 ) in? It's hard supporting all of the different file formats/use cases out there. Maybe a better separation of concerns would improve the user experience.
Edit: Also broadly related ( https://github.com/dask/dask/issues/9049 )