BiaPy Speed up prediction by chunks

We need to rethink how the final prediction is reconstructed using inference by chunks, as right now we are using so much disk (and time) with the mask that needs to be divided to take into account the overlap/padding as here.

Jul 10 '24 09:07 danifranco

Hi @danifranco, I hope you are doing well, I will look at this issue next week and send an associated PR!

Mar 21 '25 09:03 ClementCaporal

Hi @ClementCaporal, glad to hear from you! I was planning to change how the inference is done too and speed up the process by incorporating a Pytorch generator. What do you think if I first modify that and then you try to rethink this disk related problem?

Mar 21 '25 09:03 danifranco

It depends if the inference with the pytorch generator really change this cropping part (I suppose not?)? If yes, I better wait!

https://github.com/BiaPyX/BiaPy/blob/f872e1fd74fe55b66c694ef0289bb07495265c5f/biapy/engine/base_workflow.py#L1127

And for the inference structure, how do you think it will compare with: https://thejacksonlaboratory.github.io/zarrdataset/examples/advanced_example_pytorch_inference.html ?

Mar 21 '25 09:03 ClementCaporal

Yeah I think that will be changed too. I have an idea in mind: let me cook for two weeks or so and see if I can have a first version of the generator. Then we both can build upon, correcting bugs that may appear and including the zarrdataset to better handle Zarrs (not sure about this last as I would like to maintain H5 large file support too). What do you think?

Mar 21 '25 09:03 danifranco

Can't wait to taste what you've cooked! See you in two weeks

+1 to not drop .h5 support

Mar 21 '25 10:03 ClementCaporal

TLDR: This new way of doing the "by chunks" inference should be faster than the one we had plus it does not require a mask to make the division of the final Zarr.

Long explanation: I have just finished the changes!! I completely redo the by chunks inference and now it works like this:

For each test sample:
- A data loader is created (chunked_test_pair_data_generator) which is in charge of sharding the Zarr/H5 data. It takes into account the number of workers + the number of GPUs. The usage of the dataloader is key point comparing to our old implementation as it should speed up the process because of the underlying caching/fetch process it uses.
- Each chunk process a part of the original data taking into account a padding margin to avoid border effects. From my experience this is enough and overlapping is not necessary. Also that was the thing forcing us to create the division mask.
- Each chunk is then processed by the model.
- The prediction is processed by the specific workflow at hand via after_one_patch_prediction_by_chunks function. In the old implementation, this process was done in a subsequent step after the whole image was predicted. Now, with the generator this part should be faster I think.

Notes: Now, a Zarr per each rank (GPU) is created first and then the main rank gathers all into just one. I tried several times to avoid that gathering step by creating just the final Zarr image instead, i.e. not creating one per each rank, but there is ALWAYS a data loss. From the old zarr version docs you can read that if the data is chunked with no overlapping chunks, so do I, it should be possible to write simultaneously in the Zarr. They provide also some sync through file/threads. All options lead to data loss. I tried using latest zarr version (3.0.6) but it is already not documented how to do this parallel writing plus I faced several errors when adapting the code to the new zarr usage. I think they are still solving issues from the migration from Zarr 2 to Zarr 3: maybe we need to wait until they document that so we can try it again.

On the other hand, now you can filter out those regions of the test data based on some statistics. I think you should try it out to reduce the number of patches that are processed, e.g. to discard all background patches. Find the variables here: https://github.com/BiaPyX/BiaPy/blob/483c98ec9b41dbc4f97fdbb56df6a7226b5fa837/biapy/config/config.py#L538 . I think it is worth trying all these things with a medium size file, so you can adjust the filtering, and then apply the model to the whole huge image(s).

Let me know what do you think!

Apr 03 '25 12:04 danifranco

Hi @danifranco, (You cooked exactly 2 weeks ahah)

Fantastic, you solve many problems with thoses changes. I will try when I have time to check the time effect on our dataset. I will also check the border effect with padding without overlapping.

I will check the code but in case it is simplier for you to answer directly:

when the patch is point detection is done subsequently, is the padding from the patch already removed?
Is the detection conducted by another thread? If so, can we choose the number of worker for the inference / detection ?
Filter sample will be super useful. Does it run on a specific zarr layer ? (ie, can I choose to check on layer 2 to skip a batch of layer 0 patches?)

Thank you a lot for the update!

Apr 03 '25 13:04 ClementCaporal

Hi,

Ofc, I'm a man of my word! ;)

when the patch is point detection is done subsequently, is the padding from the patch already removed?

Yeah, the padding is already removed by that time. See when after_one_patch_prediction_by_chunks is called and how above the padding is removed here: https://github.com/BiaPyX/BiaPy/blob/e19664f47027b0741d932c749538c378c23d0424/biapy/engine/base_workflow.py#L1381

Is the detection conducted by another thread? If so, can we choose the number of worker for the inference / detection ?

The loader gave a patch to each GPU/rank. Each rank then predicts and after that processes the detection on that patch (without padding). So this process is done by each rank, which means is it multithreaded always. There are no two steps now, everything is done together.

Filter sample will be super useful. Does it run on a specific zarr layer ? (ie, can I choose to check on layer 2 to skip a batch of layer 0 patches?)

I'm not understanding this very well. You mean if it can work on specific channels of the data? it can not, each patch is processed entirely to determine if it is discarded or not.

Apr 03 '25 14:04 danifranco

Yeah, the padding is already removed by that time. See when after_one_patch_prediction_by_chunks is called and how above the padding is removed here:

I would argue that the padding should be kept so the detection can run smoothly. Because I think what we discussed here is still valid: https://github.com/BiaPyX/BiaPy/issues/50

I'm not understanding these very well. You mean if it can work on specific channels of the data? it can not, each patch is processed entirely to determine if it is discarded or not.

In my experience, loading patch can be the time bottleneck. So the less patch we load the better. When I am writing layer I thought of "zarr level resolution". Loading a patch from a lower resolution zarr ( my_zarr/1 vs my_zarr/2 in ome.zarr) would allow me to skip faster patches --> faster inference!

a Zarr per each rank (GPU) is created first and then the main rank gathers all into just one.

I could also add a PR that do not right the inference at all? The user (at least for me) workflow could be:

check on a subpart of the To what the inference looks like.
on the whole dataset without saving the inference

Apr 03 '25 14:04 ClementCaporal

I would argue that the padding should be kept so the detection can run smoothly. Because I think what we discussed here is still valid: #50

I don't think this problem still applies. Try with the implementation that we have now and if there is still such problem then we could leave the padding before process.

In my experience, loading patch can be the time bottleneck. So the less patch we load the better.

Greate cause the filters are applied before feding the patches to the GPUs so it will be an speed up here too.

When I am writing layer I thought of "zarr level resolution". Loading a patch from a lower resolution zarr ( my_zarr/1 vs my_zarr/2 in ome.zarr) would allow me to skip faster patches --> faster inference!

Oh, I see. You can decide were the data is stored within the Zarr enabling DATA.TEST.INPUT_ZARR_MULTIPLE_DATA and using DATA.TEST.INPUT_ZARR_MULTIPLE_DATA_RAW_PATH por the path. Find these variables in the config.py to see more details.

I could also add a PR that do not right the inference at all? The user (at least for me) workflow could be:

check on a subpart of the To what the inference looks like.

on the whole dataset without saving the inference

I'm not getting this. If you want to just check how the inference is done you can try with a small zarr image.

Apr 03 '25 18:04 danifranco

Thank you for your answers!

I could also add a PR that do not right the inference at all? The user (at least for me) workflow could be:

    check on a subpart of the To what the inference looks like.
    on the whole dataset without saving the inference
I'm not getting this. If you want to just check how the inference is done you can try with a small zarr image.

If I understand correctly in the new detection workflow load -> inference per patch + padding -> detect per patch -> save inference in .zarr (per rank) and detection .csv (per patch)

I think it can be useful to add an option to remove saving the .zarr to save only the .csv

Apr 04 '25 08:04 ClementCaporal

load -> inference per patch + padding -> detect per patch -> save inference in .zarr (per rank) and detection .csv (per patch)

There is no .csv per patch now, only the final one is generated.

Now I'm understanding you. I get your point of avoid creating the zarr parts and final zarr. Feel free to add a PR or give me a few days to do it.

Apr 04 '25 09:04 danifranco

I hope to propose something by Tuesday once I checked the padding in detection

Apr 04 '25 10:04 ClementCaporal

I tried on our dataset your new implementation

For ~50Go 47 minutes against 1h10 before, so cool improvement with your new version! (I don't know if this improvement is linear but it is still a good sign). Thank you, we will gain for larger dataset several hours!

Concerning the patch size,

First test with new implementation patch size = (26, 128, 128)

Second test with new implementation patch size = (64, 256, 256)

Previously with the padding also for the detection, before your fast loader implementation:

Discussion

Thus, I think we have (at least on my understanding of the test above) the same problem detailed in #50 again. The theoretical patch border/cut corresponds to the histogram detection peak and it scales with the patch size used for inference. you can check my code to generates the plots here: https://gist.github.com/ClementCaporal/d9eddd4e15c8ee47250cb3aa8699f243

Apr 05 '25 12:04 ClementCaporal

I was really convinced the problem should’ve been fixed by now.... I added the changes in b298c7, try it again please!

Apr 07 '25 06:04 danifranco

Apart from that and regarding the speed up, now I'm limiting the number of workers to 2*number of GPUs (code). It is set like that in the training too, as creating more workers implies duplicating the dataset a lot, one per worker... There are ways to avoid that, as the way I did for the chunked_generator used when inferring Zarr/H5, but still need to implement it (https://github.com/BiaPyX/BiaPy/issues/111) . Before, we were creating a lot of workers, 8 by each GPU, so maybe you can touch the code in your local branch, increase that limit, and check if setting more workers speeds up the thing a little bit more. If so, maybe we can create another variable to control the workers in inference, or just increase the limit in that case.

Apr 07 '25 07:04 danifranco

Just realised that I'm having different results by inferring with one or multiple GPUs so don't try it until I solved the error... Sorry for that!

Apr 07 '25 09:04 danifranco

@ClementCaporal just finished fixing the errors. Looking forward to hearing your feedback!

Edit: Now the csv files of each patch are generated again as they are necessary for merging all points into just one csv.

Apr 07 '25 13:04 danifranco

Donno the version you are using but just realised that I had a bug in the patch prediction (solved now in 95e2a7eb35c38d853e87f0e9b5645b52949ad63b). The bug was present since this commit: f525aa6fc4b25c09bdb8dfcfd04d9f546ef5b025

Apr 27 '25 11:04 danifranco

aaaaah, thank you @danifranco, it is possible that I was trying to debug this since one week I will try again tomorrow

Apr 27 '25 21:04 ClementCaporal

Yeay I think it is solved for the chunk issue. Actually you are not removing the pad before the detection, I must have misunderstood from "Each rank then predicts and after that processes the detection on that patch (without padding)"

But actually padding removal is done after:

https://github.com/BiaPyX/BiaPy/blob/5b478d63f6e66656b0c52d1b524748d3bb412b48/biapy/engine/detection.py#L879-L896

Apr 28 '25 17:04 ClementCaporal

So you can close the issue!

Thank you @danifranco!

Apr 28 '25 17:04 ClementCaporal