dask-image One `imread` to rule them all

A lot of people have put a lot of effort into imread lately. This is great, and it's really helped. However, we've still got a way to go.

This is where I see the four major areas problems pop up in:

Read image data into Dask arrays accurately. We need more simple test cases here. Bug report: https://github.com/dask/dask-image/issues/220
Reduce confusion. Currently, there are multiple implementations of a dask imread function. The two most easily confused are dask_image.imread.imread() and dask.array.image.imread(). We need to figure out which is best, and only use that one.
Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:
- Imread performance issue https://github.com/dask/dask-image/issues/181
- Getting movie files into Dask efficiently https://github.com/dask/dask-image/issues/134
Process the image data fast, too. For that to happen, we need smart default choices for how we chunk image data in dask arrays. Jackson Maxfield Brown describes the problem well in this short video here

May 17 '21 09:05 GenevieveBuckley

The first step is getting a benchmark script going.

We need to:

[ ] Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())
[ ] Add benchmarking to run on our CI

May 17 '21 09:05 GenevieveBuckley

Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:

Imread performance issue #181

Getting movie files into Dask efficiently #134

Highly recommend using asv. We use it on aicsimageio to get our pure reading benchmarks and after working on the Dask Summit presentation I also just added the example from my slides as a benchmark suite and a "LibCompareSuite" to monitor aicsimageio and dask-image performance.

our benchmark code for general IO suites
our benchmark code for lib comparison suites
the CI setup for our benchmarks we run it as a part of our doc building on push to main
the produced benchmarks webpage

Note that because I just changed the benchmark parameters it reset a lot of the visualizations but it does have the benchmarks for the most recent commit as scatter plots basically. As more commits are added with the same benchmark configuration, it will show as a timeseries.

Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())

I tried doing the above during my benchmark setup on aicsimageio and I couldn't get the pims.open option working.

For the default case I felt it was an unfair comparison. dask.array.image.imread reads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using the dask.array.image.imread will result in each chunk of the dask array being a whole file read of a file in the glob) dask-image can do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).

Happy to help and PR into dask-image where I can. At the very least, my talk is now basically built into the library on every commit :smile:

May 31 '21 00:05 evamaxfield

That's a strong recommendation for asv! Very helpful to have those links and implementation details.

For the default case I felt it was an unfair comparison. dask.array.image.imread reads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using the dask.array.image.imread will result in each chunk of the dask array being a whole file read of a file in the glob) dask-image can do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).

I'm not sure whether it's the most common, but it's definitely common enough that we need good performance.

May 31 '21 08:05 GenevieveBuckley

Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())

Wanted to link here a quick performance comparison we had done in the past: https://github.com/dask/dask-image/issues/194#issuecomment-791987698. The conclusion had been that dask.array.image is significantly faster than dask_image.imread both when using skimage.io and pims. The only advantage for the latter is dask graph creation time when input image files are large (as dask.array.image reads in a file to determine image shape, while dask_image.imread uses pims for this).

May 31 '21 09:05 m-albert

The only advantage for the latter is dask graph creation time when input image files are large (as dask.array.image reads in a file to determine image shape, while dask_image.imread uses pims for this).

Presumably we could add this behaviour to dask.array.image if that's useful.

Jun 01 '21 01:06 GenevieveBuckley

Presumably we could add this behaviour to dask.array.image if that's useful.

Definitely. Wouldn't currently think it's too critical though.

Jun 01 '21 12:06 m-albert

@jni says that scikit-image also has a good guide to asv. I think this is it here: https://scikit-image.org/docs/dev/contribute.html#benchmarks

Jun 08 '21 03:06 GenevieveBuckley

One big disadvantage for dask.array.image.imread is poor chunking behaviour. It looks like it makes a single chunk for every filename on disk. This is not greart for movie files or multislice tiffs, etc. where you probably don't want to load the whole movie file into RAM.

See https://github.com/dask/dask-image/issues/262#issuecomment-1125063820

May 13 '22 02:05 GenevieveBuckley

Yeah this comes up with large multipage TIFFs. They can be kind of movie-like

Wonder if we should just make the move to using ImageIO here with PR ( https://github.com/imageio/imageio/pull/739 ) in? It's hard supporting all of the different file formats/use cases out there. Maybe a better separation of concerns would improve the user experience.

Edit: Also broadly related ( https://github.com/dask/dask/issues/9049 )

May 13 '22 05:05 jakirkham