Slower performance than PIL-SIMD and VIPS on large images
So been trying to use image as an alternative to speedup image cropping in python.
I tested this against python bindings provided by PIL-SIMD and vips. image + rayon provides great results for mini-batches of small images (eg: 512 x 512 ), however it looks like this does not scale to large images (eg: 4000 x 4000). Here are some detailed results (full repo here):
Gray Scale 4000 x 4000 JPEG Image:
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.48177589893341066 +/- 0.06522766741342648 sec
rust crop average over 100 trials : 0.5541290354728698 +/- 0.01701503452857914 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.46666148900985716 +/- 0.04960818629061339 sec
rust crop average over 100 trials : 0.5545140480995179 +/- 0.018513324366932055 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.4660675573348999 +/- 0.057874179663163314 sec
rust crop average over 100 trials : 0.5588112926483154 +/- 0.02093116503601568 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.4722147536277771 +/- 0.0485672093840053 sec
rust crop average over 100 trials : 0.5598368859291076 +/- 0.023788753163516176 sec
Best: PIL-SIMD or vips
Gray Scale 4000 x 4000 PNG Image:
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.48749495029449463 +/- 0.08113528509347652 sec
rust crop average over 100 trials : 0.5558233976364135 +/- 0.016653053525873318 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.47109495639801025 +/- 0.04871502077173173 sec
rust crop average over 100 trials : 0.5580063557624817 +/- 0.03254635395283061 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.45097975969314574 +/- 0.05091062232673361 sec
rust crop average over 100 trials : 0.5570903038978576 +/- 0.02757379330167939 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.46305535078048704 +/- 0.05080726901263223 sec
rust crop average over 100 trials : 0.5532968330383301 +/- 0.024264257976699295 sec
Best: PIL-SIMD or vips
Gray Scale 4000 x 4000 BMP Image:
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.41575870513916013 +/- 0.2293008182224854 sec
rust crop average over 100 trials : 0.41494714736938476 +/- 0.008218232557933018 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.40679455041885376 +/- 0.1453958218374439 sec
rust crop average over 100 trials : 0.4185313296318054 +/- 0.01249058083031065 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=0 --use-grayscale=1
python crop average over 100 trials : 0.4151870512962341 +/- 0.14371236661889436 sec
rust crop average over 100 trials : 0.4274453210830689 +/- 0.019399714659261984 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=1 --use-grayscale=1
python crop average over 100 trials : 0.4173280334472656 +/- 0.15052042382476047 sec
rust crop average over 100 trials : 0.4264633345603943 +/- 0.021683232996469272 sec
Best: all almost equal
Color 512 x 512 PNG Image:
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=0
python crop average over 100 trials : 0.1581390118598938 +/- 0.07718986836579676 sec
rust crop average over 100 trials : 0.049066624641418456 +/- 0.0063537388027109275 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=1
python crop average over 100 trials : 0.16133457899093628 +/- 0.08104974713972524 sec
rust crop average over 100 trials : 0.04776606798171997 +/- 0.002771486956986403 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=0
python crop average over 100 trials : 0.1862180781364441 +/- 0.08092567829068359 sec
rust crop average over 100 trials : 0.048760356903076174 +/- 0.004486362361121834 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=1
python crop average over 100 trials : 0.15620680570602416 +/- 0.07046950472514979 sec
rust crop average over 100 trials : 0.0478854775428772 +/- 0.0029215781261751495 sec
Best: parallel_image_crop Rust Library
Color 512 x 512 JPEG Image:
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=0
python crop average over 100 trials : 0.2412877368927002 +/- 0.08093420929641125 sec
rust crop average over 100 trials : 0.05020998954772949 +/- 0.0029470025228493205 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=0 --use-threading=1
python crop average over 100 trials : 0.24630839586257935 +/- 0.08689070644503785 sec
rust crop average over 100 trials : 0.053718960285186766 +/- 0.005535220006405835 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=0
python crop average over 100 trials : 0.25572638273239134 +/- 0.07592489302786763 sec
rust crop average over 100 trials : 0.056830999851226804 +/- 0.0025679050981689214 sec
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-vips=1 --use-threading=1
python crop average over 100 trials : 0.275831139087677 +/- 0.06285512636507674 sec
rust crop average over 100 trials : 0.05264364004135132 +/- 0.004437379466866233 sec
Best: parallel_image_crop Rust Library
Please include a requirements.txt file for use with pip/virtualenv or similar to facilitate reproduction.
Are you sure you are loading the release version (cargo build --release) of the library? Because if there has been a debug build I am pretty sure that https://github.com/jramapuram/parallel_image_crop/blob/ba0aeca9e0c68fdd99aed7ff8e2fc2747ed66898/benchmarks/test.py#L173 will first encounter the path
../target/debug/libparallel_image_crop.so
before visiting
../target/release/libparallel_image_crop.so
I see the following numbers, using the release build:
512x512
python crop average over 10 trials : 0.16067423820495605 +/- 0.3671143569581757 sec
rust crop average over 10 trials : 0.04678645133972168 +/- 0.0012110288525016243 sec
4000x4000
python crop average over 10 trials : 0.7032721757888794 +/- 0.3654377582986589 sec
rust crop average over 10 trials : 0.7183722257614136 +/- 0.0021783159746312864 sec
@HeroicKatora : thanks for compiling it; let me know if you still need a requirements.txt. Was going to add it after I got things working well.
I did compile a release build (and have hardcoded this in benchmarks.py for commit https://github.com/jramapuram/parallel_image_crop/commit/2467a0cfc0be875f84ac8bf0b54f6ed9c95dc9ac ). As mentioned it does work great for small images! (the image present in assets is only 512x512).
You will have to resize to see the difference:
convert assets/lena.png -resize 4000x4000 assets/lena.png
Here is the workflow:
# let's run crops over a batch of 32 images w/100 trials for 512x512
(base) ➜ parallel_image_crop git:(master) python benchmarks/test.py --batch-size=32 --num-trials=100
python crop average over 100 trials : 0.116404447555542 +/- 0.1595486246983642 sec
rust crop average over 100 trials : 0.09679011821746826 +/- 0.0062470562722498025 sec
# Now we convert it 4000 x 4000 and try the same
(base) ➜ parallel_image_crop git:(master) convert assets/lena.png -resize 4000x4000 assets/lena.png
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100
python crop average over 100 trials : 3.173974087238312 +/- 0.2067425334971192 sec
rust crop average over 100 trials : 4.159008667469025 +/- 0.20634427332646715 sec
See full GIST here: https://gist.github.com/jramapuram/f3a69e5810f56347c470d10f54414c5e
I missed that there is no 4000x4000 image in the repository but figured that all results were in release build. The second set of results that I posted is also with an upscaled image of size 4000x4000. There are several small inefficiencies in the Rust source but none of them dramatic:
- There is no direct reason for
Array, you can convert all variants toDynamicImageto aVec<u8>or even extract a&[u8]from theDynamicImage. This avoid someunsafeand maybe unsoundness. - Cropping could be implemented without the copy but a current error in the interface (an unecessary
'staticlifetime bound onimageops::resize) prevents that.
Still, this gives only ~4% speedup on my machine.
Take this with a grain of salt healthy for simple benchmarks but measuring load time vs. processing time:
Loading: 399.338101ms, Processing: 3.197122ms
where the first is measure from function starts until after image::open and the second is from that point until the end of crop_and_resize. The rest of the execution time should therefore be copying back the result?
Fair points, I will look into removing array: you are right, I initially had it as a return value. In the current state it is not necessary.
Regarding load time: do you think lazy loading will solve this problem? Eg: something like pyramid tiled tif? All I'm doing are crop()'s followed by a resize()
Afaik, most decoder do not yet implement an API that would only load a partial image. I can link you more precise issues if you want to track the progress of this but the interface was not yet clear. This means that open, crop followed by resize is likely about as fast as load_rect and resize.
Edit: This has been request more intensly in the recent past, might focus on getting something available soon.
I think one of them might be mine :P (i.e. load_rect not working https://github.com/PistonDevelopers/image/issues/802)
Thanks for your help! Any idea why image is so much faster for smaller images though? I think parity with PIL-SIMD is great, but it seems to do much better with smaller images --> wonder if that can be translated to the larger ones. But I guess that will have to do with load_rect and other such impls. Note though: PIL-SIMD does also load the entire image for crops now as opposed to lazy loading.
@HeroicKatora : I removed the Array implementation and simply copy the vec over:
let mut resultant_vec = vec![];
image_paths_vec.into_par_iter().zip(scale_values)
.zip(x_values.par_iter()).zip(y_values)
.map(|(((path, scale), x), y)| {
crop_and_resize(path,
*scale, *x, *y,
max_img_percent,
window_size,
window_size).raw_pixels()
}).collect_into_vec(&mut resultant_vec);
// copy the buffer into the return array
let win_size = (window_size * window_size * chans) as usize;
for (begin, rvec) in izip!((0..length*win_size).step_by(win_size), resultant_vec)
{
assert!(rvec.len() == win_size, "rvec [{:?}] != window_size [{:?}]",
rvec.len(), win_size);
unsafe { ptr::copy(rvec.as_ptr() as *const u8, return_ptr.offset(begin as isize),
win_size) };
}
Not sure if the above was what you had in mind? I don't believe I have any control of the static lifetime bound issue you mentioned in point 2 though.
Results:
Listed below is a comparison against vips [Note: vips allows for sequential reading (similar to read_scanline, which is why cropping should be faster on average --> I think this is a good baseline target for image ). vips is currently quite a bit faster :
# test over 4000 x 4000 b/w image
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-grayscale --use-vips
python crop average over 100 trials : 0.7368819999694824 +/- 0.20126972433300294 sec
rust crop average over 100 trials : 1.3322323775291443 +/- 0.06476191433811372 sec
Listed below is a comparison against PIL-SIMD ( this is much closer ):
# test over 4000 x 4000 b/w image
(base) ➜ parallel_image_crop git:(master) ✗ python benchmarks/test.py --batch-size=32 --num-trials=100 --use-grayscale
python crop average over 100 trials : 1.0117538738250733 +/- 0.17556666522802858 sec
rust crop average over 100 trials : 1.2921310329437257 +/- 0.044052204986870375 sec
Yes, issue 2 needs to be resolved in this crate and should be published in the next version (might even get a minor version in, it simply relaxes a constraint so no breaking change). I can't promise anything for the speed issue but I will take a look and maybe we can get it to at least the PIL-SIMD baseline.
It would be really nice if image could match or beat pillow-simd in benchmarks. I'm almost exclusively having to use Python for image processing because pillow-simd beats other libraries by a large margin.
I'm interested here too -- we are evaluating rust image to interface with kornia to load directly into tensors. My idea is to wrap image-rs to https://github.com/dmlc/dlpack to adopt out new kornia.Image API /cc @carlosb1 @strasdat