avalanche Memory Replay increased GPU memory consumption after the first experience

Hi,

I am not sure if this is expected behavior when I run experiment with Memory Replay plugin only, but after the first experience the GPU memory usage is increased (in my case, quite significantly) as per nvidia-smi.

This is for the setting with: replay = ReplayPlugin(mem_size=250) ResNet-50 network, SGD optimizer, Cross entropy loss, batch size 32, 256 by 256 pixels input image size, 2 experiences (experience 0 and experience 1)

The training strategy and loop is as in Avalanche examples:

cl_strategy = SupervisedTemplate(
    net, optimizer, criterion,
    plugins=[replay],device=device,train_mb_size=batch_size,train_epochs=1,eval_mb_size=batch_size,evaluator=eval_plugin)
 

for i, experience in enumerate(generic_scenario.train_stream,0):
    print(i)
    print("Start of experience: ", experience.current_experience)

    cl_strategy.train(experience) # train on ith experience
    print('Training completed')

According to my understanding, memory replay methods based on sampling from the previous subset (of a size defined by mem_size) and adding them into current training batches, which shouldn't increase the occupied memory wrt the first experience.

So is this GPU memory usage increase the expected behavior when using MemoryReplay plugin?

Thanks, Woj

Aug 02 '22 05:08 matkowski-voy

Do you have a script to reproduce this problem? what happens in the following steps? Is the memory growing continuously?

Aug 02 '22 15:08 AntonioCarta

The code is attached below. There are 2 experiences in the data stream scenario (generic_scenario = dataset_benchmark([trainset1, trainset2], [testset1, testset2])) and

during first experience the GPU memory usage is ~4GB (see as indicated in my above comment : training on first experience (experience 0) memory usage:)
and then during second experience the GPU memory usage is ~7GB (see as indicated in my above comment: training on second experience (experience 1) memory usage) So while running the code below I just check this GPU usage manually by running nvidia-smi

import torch
from torch.utils.data import TensorDataset
import torch.nn as nn
import torch.optim as optim
from torchvision import models

from avalanche.training.templates import SupervisedTemplate
from avalanche.training.plugins import ReplayPlugin
from avalanche.evaluation.metrics import forgetting_metrics, accuracy_metrics,\
    loss_metrics, confusion_matrix_metrics
from avalanche.models import SimpleMLP
import numpy as np
from avalanche.logging import InteractiveLogger
from avalanche.training.plugins import EvaluationPlugin
from avalanche.benchmarks.generators import dataset_benchmark


N1 = 500
 
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.deterministic = True
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 
# train sets - experience 0 and 1
x_data1 = torch.rand(N1,3,320,320)+2
y_data1 = torch.ones(N1).long()

x_data1_neg = torch.rand(N1,3,320,320)
y_data1_neg = torch.zeros(N1).long()

x_data1 = torch.cat((x_data1,x_data1_neg))
y_data1 = torch.cat((y_data1,y_data1_neg))
 
x_data2 = torch.rand(N1,3,320,320)+4
y_data2 = torch.ones(N1).long()

x_data2_neg = torch.rand(N1,3,320,320)+0.5
y_data2_neg = torch.zeros(N1).long()

x_data2 = torch.cat((x_data2,x_data2_neg))
y_data2 = torch.cat((y_data2,y_data2_neg))
 
trainset1 = TensorDataset(x_data1, y_data1)
trainset2 = TensorDataset(x_data2, y_data2)

# test sets - experience 0 and 1
x_data1 = torch.rand(N1,3,320,320)+2
y_data1 = torch.ones(N1).long()

x_data1_neg = torch.rand(N1,3,320,320)
y_data1_neg = torch.zeros(N1).long()

x_data1 = torch.cat((x_data1,x_data1_neg))
y_data1 = torch.cat((y_data1,y_data1_neg))
 
x_data2 = torch.rand(N1,3,320,320)+4
y_data2 = torch.ones(N1).long()

x_data2_neg = torch.rand(N1,3,320,320)+0.5
y_data2_neg = torch.zeros(N1).long()

x_data2 = torch.cat((x_data2,x_data2_neg))
y_data2 = torch.cat((y_data2,y_data2_neg))

testset1 = TensorDataset(x_data1, y_data1)
testset2 = TensorDataset(x_data2, y_data2)
 
 
generic_scenario = dataset_benchmark([trainset1, trainset2],
                                     [testset1, testset2])
 
 
eval_plugin = EvaluationPlugin(
    accuracy_metrics(minibatch=True, epoch=True, experience=True, stream=True),
    loss_metrics(minibatch=True, epoch=True, experience=True, stream=True),
    forgetting_metrics(experience=True, stream=True),
    confusion_matrix_metrics(num_classes=2, save_image=False, stream=True),
    loggers=[InteractiveLogger()],
    strict_checks=False
)



class ResNet_2C(nn.Module):
    def __init__(self, model):
        super(ResNet_2C, self).__init__()

        self.feature_extractor = nn.Sequential(*list(model.children())[0:8])

        self.classification_layer = nn.Sequential(
            nn.Linear(2048,2),
            )
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        

    def forward(self, x):
        x = self.feature_extractor(x)
        x = self.avg_pool(x)
        x = x.view(-1, 2048)
        x = self.classification_layer(x) 
        
        return x
 

batch_size=16
model = models.resnet50(pretrained=True)
net = ResNet_2C(model).to(device)

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
criterion = nn.CrossEntropyLoss()
replay = ReplayPlugin(mem_size=250)

cl_strategy = SupervisedTemplate(
    net, optimizer, criterion,
    plugins=[replay],device=device,train_mb_size=batch_size,train_epochs=1,eval_mb_size=batch_size,evaluator=eval_plugin)
 

for i, experience in enumerate(generic_scenario.train_stream,0):
    print(i)
    print("Start of experience: ", experience.current_experience)

    cl_strategy.train(experience) # train on ith experience
    print('Training completed')

Aug 02 '22 16:08 matkowski-voy

Hi @matkowski-voy ! The batch size for the replay dataloader is equal to 2 x batch_size by default because it concatenate two batches of the same size, one from the main dataloader load and one from the memory; this could be the reason for the increase in the memory usage. You can change the default setting with the following parameters when initializing the replay plugin: ReplayPlugin(mem_size=50, batch_size=bsize//2, batch_size_mem=bsize//2) where bsize is the original batch size. Can you give it a try and see if it solves the issue?

Aug 16 '22 07:08 HamedHemati

Hi @HamedHemati ,

yes, that solves the issue.

Thanks a lot! :)

Aug 18 '22 08:08 matkowski-voy