BirdNET-Analyzer icon indicating copy to clipboard operation
BirdNET-Analyzer copied to clipboard

Using all memory and swap and then killed

Open Tindtily opened this issue 1 year ago • 3 comments

I have 48,391 audio files totaling 80GB. I trained with the default parameters. During the Training model phase, memory and swap were fully occupied and finally the program had to quit.

...Done. Loaded 48345 training samples and 1335 labels.
Building model...
...Done.
Training model...
[1]    2843557 killed

Ubuntu with python3.10.12 and 126GB memory and 8GB swap and 32 cores CPU.

Tindtily avatar Mar 18 '24 00:03 Tindtily

With step by step debug, I find that it stuck on this line of code:

https://github.com/kahst/BirdNET-Analyzer/blob/7c6fe30231525a233217e28f3d7609bde7be7e5f/utils.py#L206

image
Loading training data...
        ...loading from cache: ./cache/train_cache.npz
...Done. Loaded 48345 training samples and 1335 labels.
Building model...
...Done.
Training model...
split
range end
app1
len: 51593784
[1]    3063959 killed     python3 train.py

Tindtily avatar Mar 18 '24 12:03 Tindtily

That is because the train.py function preprocesses all audio clips and loads spectrogram embeddings of the entire dataset into memory. This is bound to not work with large datasets. The solution would be to modify the script so that data only gets loaded when needed (here) and substitute the in-memory feature vectors x with a dataset class that calculates them on-the-fly or else loads them from disk once cached. I am facing the same problem right now and might be able to find time to update the script so that it only loads recordings into memory on-demand. I will post a solution here if I get around to do that.

bkellenb avatar Mar 22 '24 13:03 bkellenb

That is because the train.py function preprocesses all audio clips and loads spectrogram embeddings of the entire dataset into memory. This is bound to not work with large datasets. The solution would be to modify the script so that data only gets loaded when needed (here) and substitute the in-memory feature vectors x with a dataset class that calculates them on-the-fly or else loads them from disk once cached. I am facing the same problem right now and might be able to find time to update the script so that it only loads recordings into memory on-demand. I will post a solution here if I get around to do that.

Initially I also thought it was because of this reason, but after investigating the data of x_train and y_train and labels in cache, I found that actually the final cache data was only 145MB.

With my friend's help, change the code in this line

https://github.com/kahst/BirdNET-Analyzer/blob/7c6fe30231525a233217e28f3d7609bde7be7e5f/utils.py#L199 to:

non_event_indices = np.where(np.sum(y[:,:], axis=1) == 0)[0]

the problem solved. I can't technically explain why, and my friend said that it might be a bug. You can check this if it will help.

Tindtily avatar Mar 23 '24 13:03 Tindtily

Awesome, thank you! That seems to have been the problem in my case as well. That line of code finds zero entries in the NumPy ndarray y. Calculating the sum first does indeed reduce the memory footprint for that, but I am likewise surprised that this causes OOM issues. That really would only happen if y had an enormous size to begin with.

Other than that, those pre-calculated features seem to occupy less space than I thought...

bkellenb avatar Mar 27 '24 13:03 bkellenb