DeepForest Profile the predict

The predict_tile function is the workhorse of the entire package. Most people will arrive with large geospatial tiles that cannot fit into memory. The predict_tile function needs to be more thoroughly profiled and understood.

General instructions and psuedo-code

Take a large tile from https://zenodo.org/record/5912107#.ZBiLR-zMKDU

wget https://zenodo.org/record/5912107/files/2018_TEAK_3_315000_4094000_image_crop.tif?download=1 .

Use the predict tile function to generate predictions
Post the cProfile results here and identify which facets are slow under the following four conditions, cpu/gpu with config["workers"] > 0, 2) and cpu/gpu with config["workers"] == 0. What number of workers is fastest? Please post full details of operating system and GPU.

The motivation was I was running predict_tile on a large number of RGB tiles on SLURM linux setup with GPU with 10 workers. I'm seeing alot of variation in the run time among tiles. This makes sense when there are no trees predicted, since there will be no non-max suppression to run (which suggests NMS is costly), but beyond that the iterations/per second seem to vary wildly.

One tile 389 crops took 30 min


Predicting DataLoader 0: 100%|██████████| 389/389 [30:40<00:00,  4.73s/it]

Another tile 729 crops took 23 seconds just a minute later. Something is going on here with the workers locking?


100%|██████████| 729/729 [00:23<00:00, 30.65it/s]

Mar 20 '23 16:03 bw4sz

Hello!

I am trying the predict_tile method on my computer's CPU (8 cores) and setting workers to 1, 4 or 8 does not make any difference on the time taken by predict_tile. I wonder if this has anything to do with the warning at https://pytorch.org/docs/stable/data.html#multi-process-data-loading (which points to this thorough comment: https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662).

My code:

import time

from deepforest import main

img_filepath = "..."
num_workers = 1  # could be 4 or 8

model = main.deepforest(config_args={"workers": num_workers})
model.use_release()

start_time = time.time()
result = model.predict_tile(raster_path=img_filepath)
print(f"--- Inference on a tile: {(time.time() - start_time):.2f} seconds ---")

In my case, for the same image, I always get:

Predicting DataLoader 0: 100% 36/36 [01:44<00:00,  2.90s/it]
345 predictions in overlapping windows, applying non-max supression
263 predictions kept after non-max suppression
--- Inference on a tile: 104.82 seconds ---

The 104.82 seconds change slightly but definetly do not scale with the number of workers.

Thank you. Best, Martí

Feb 09 '24 12:02 martibosch

Have you tried to change the numbers num_workers = 1 # could be 4 or 8

Feb 09 '24 12:02 henrykironde

Ah never mind I think its something else, Let me check that out

Feb 09 '24 12:02 henrykironde

Hi @martibosch! I was just recommending detectree to a student who needed a tree/not tree segmentation. I'm away next week, but happy to connect. Let's take a quick look here. Its always very hard to know whether parallelization is working as hoped in pytorch dataloaders. I have this problem often in other models, sometimes getting data IO is so heavy, it doesn't matter how many workers you have. I often set to zero in other contexts (https://github.com/pytorch/pytorch/issues/12831). Looking here:

https://github.com/weecology/DeepForest/blob/06436707b28418b300591641e4678a9715df2fd6/deepforest/main.py#L301

I can see the the workers argument is correctly connected. Let's print the config at that time to make sure the kwargs approach is getting correctly synced up.

import time

from deepforest import main

img_filepath = "..."
num_workers = 4  # could be 4 or 8

model = main.deepforest(config_args={"workers": num_workers})
model.use_release()

assert model.config["workers"] == 4

No issues there. Are you on CPU or GPU? If you are local, can you see the python processes being created? I'm trying to decide if you are just not seeing speedup, or if genuinely the workers aren't being created.

As an aside, we are working on a very large collaboration (https://milliontrees.idtrees.org/) and retraining of deepforest model (we know its sensitive to resolution and patch size), you are welcome to be involved.

Feb 09 '24 18:02 bw4sz

Hello @bw4sz!

thank you for your response (and for recommending detectree :smile:). I am working locally on my CPU (ubuntu computer without GPU). I have conducted several test and the issue seems to be that 4 Python processes are created regardless of what I enter as num_workers, even if I set it to 1 or 2. Is it possible that pytorch overrides the setting at some point based on some check on my computer's capabilities?

Regarding the milliontrees project, I'd be very happy to contribute - thank you for the invitation. I will write you on a separate mail. Best, Martí

Feb 14 '24 14:02 martibosch

I think this is going to take some time to debug, I've added a PR to assert that the dataloader gets the arguments. That way we can confident this isn't a DeepForest problem. We can't really unpack what's happening inside pytorch, but we still want to better understand, can you paste any code or screenshots about

I have conducted several test and the issue seems to be that 4 Python processes are created regardless of what I enter as num_workers, even if I set it to 1 or 2.

Is this just looking at the activity monitor/task manager? Operating system?

I think the key idea is that workers copy the entire process (https://pytorch.org/docs/stable/data.html#multi-process-data-loading), the entire reason a user would use predict_tile is that the data is large. We need to understand how this relates to the pinned memory and prefetch. I don't think we ever want to prefetch data and put it on GPU ( i think), because a user wouldn't likely ever want to predict_tile on the same image twice in non-debugging situations. Its possible we have some heavy memory processes elsewhere and its slowing down predict_tile worker creation?

Feb 17 '24 15:02 bw4sz

Profile the predict_tile method.