avalanche AvalancheConcatDataset is slow for online buffer updates

As pointed out in PR #1024, in online strategies buffers are updated after every (sub-)experience, therefore creating an instance of AvalancheConcatDataset after every experience slows down the update process. This is not a problem for regular/offline strategies. Keeping the data in a list (and possibly with their transformations) and using the first N (eg: 1<n<=10) samples to create a small buffer dataset is always an option, but does anyone know a better way to handle creating large datasets in an efficient way?

May 13 '22 14:05 HamedHemati

Are you sure that the problem is the AvalancheConcatDataset? I tried to measure its runtime some time ago and it was quite fast, but I may remember correctly. Can you check the runtime of the buffer update without any other strategy code?

May 16 '22 07:05 AntonioCarta

So, to find the the bottleneck in the update_from_dataset function of ReservoirSamplingBuffer, I added the following condition to avoid dataset concatenation after the buffer is full as a test to make sure that it's the dataset concatenation that is slowing down the update:

        new_weights = torch.rand(len(new_data))

        cat_weights = torch.cat([new_weights, self._buffer_weights])
       
        # ADDED this:
        if len(self.buffer) == self.max_size:
                    cat_data = self.buffer
        else:
                    cat_data = AvalancheConcatDataset([new_data, self.buffer])

        sorted_weights, sorted_idxs = cat_weights.sort(descending=True)

        buffer_idxs = sorted_idxs[: self.max_size]
        self.buffer = AvalancheSubset(cat_data, buffer_idxs)
        self._buffer_weights = sorted_weights[: self.max_size]

This was still a bit slower than online navie (which was expected) but is much faster than dataset concatenation after each experience.

May 17 '22 08:05 HamedHemati

These are the results that I get from a quick profiling script after 11552 iterations (batch size = 1):

reservoir sampling: 116.8 sec
random sampling: 71.3

Not optimal but not that bad as a starting point, considering that the training loop will often be much more expensive.

from unittest.mock import Mock

import time
from os.path import expanduser

from tqdm import tqdm

from avalanche.benchmarks import fixed_size_experience_split, SplitMNIST
from avalanche.training import ReservoirSamplingBuffer
from avalanche.training import ParametricBuffer

benchmark = SplitMNIST(
    n_experiences=5,
    dataset_root=expanduser("~") + "/.avalanche/data/mnist/",
)

experience = benchmark.train_stream[0]
print("len experience: ", len(experience.dataset))

# start = time.time()
# buffer = ReservoirSamplingBuffer(100)
# for exp in tqdm(fixed_size_experience_split(experience, 1)):
#     buffer.update_from_dataset(exp.dataset)
#
# end = time.time()
# duration = end - start
# print("ReservoirSampling Duration: ", duration)


start = time.time()
buffer = ParametricBuffer(100)
for exp in tqdm(fixed_size_experience_split(experience, 1)):
    buffer.update(Mock(experience=exp, dataset=exp.dataset))

end = time.time()
duration = end - start
print("ParametricBuffer (random sampling) Duration: ", duration)

May 17 '22 11:05 AntonioCarta