Using all memory and swap and then killed
I have 48,391 audio files totaling 80GB. I trained with the default parameters. During the Training model phase, memory and swap were fully occupied and finally the program had to quit.
...Done. Loaded 48345 training samples and 1335 labels.
Building model...
...Done.
Training model...
[1] 2843557 killed
Ubuntu with python3.10.12 and 126GB memory and 8GB swap and 32 cores CPU.
With step by step debug, I find that it stuck on this line of code:
https://github.com/kahst/BirdNET-Analyzer/blob/7c6fe30231525a233217e28f3d7609bde7be7e5f/utils.py#L206
Loading training data...
...loading from cache: ./cache/train_cache.npz
...Done. Loaded 48345 training samples and 1335 labels.
Building model...
...Done.
Training model...
split
range end
app1
len: 51593784
[1] 3063959 killed python3 train.py
That is because the train.py function preprocesses all audio clips and loads spectrogram embeddings of the entire dataset into memory. This is bound to not work with large datasets.
The solution would be to modify the script so that data only gets loaded when needed (here) and substitute the in-memory feature vectors x with a dataset class that calculates them on-the-fly or else loads them from disk once cached.
I am facing the same problem right now and might be able to find time to update the script so that it only loads recordings into memory on-demand. I will post a solution here if I get around to do that.
That is because the
train.pyfunction preprocesses all audio clips and loads spectrogram embeddings of the entire dataset into memory. This is bound to not work with large datasets. The solution would be to modify the script so that data only gets loaded when needed (here) and substitute the in-memory feature vectorsxwith a dataset class that calculates them on-the-fly or else loads them from disk once cached. I am facing the same problem right now and might be able to find time to update the script so that it only loads recordings into memory on-demand. I will post a solution here if I get around to do that.
Initially I also thought it was because of this reason, but after investigating the data of x_train and y_train and labels in cache, I found that actually the final cache data was only 145MB.
With my friend's help, change the code in this line
https://github.com/kahst/BirdNET-Analyzer/blob/7c6fe30231525a233217e28f3d7609bde7be7e5f/utils.py#L199 to:
non_event_indices = np.where(np.sum(y[:,:], axis=1) == 0)[0]
the problem solved. I can't technically explain why, and my friend said that it might be a bug. You can check this if it will help.
Awesome, thank you! That seems to have been the problem in my case as well.
That line of code finds zero entries in the NumPy ndarray y. Calculating the sum first does indeed reduce the memory footprint for that, but I am likewise surprised that this causes OOM issues. That really would only happen if y had an enormous size to begin with.
Other than that, those pre-calculated features seem to occupy less space than I thought...