Efficient-PyTorch Large memory occupation

Hi, I'm training faster-rcnn on 4 gpus with coco dataset converted to LMDB. I used num_worker=4 for the dataloader and I found that the memory occupation is almost 60Gb. I suspect that the whole dataset is read into memory. But per your description in readme,

Here I choose lmdb because 2. hdf5 pth n5, though with a straightforward json-like API, require to put the whole file into memory. This is not practicle when you play with large dataset like imagenet.

LMDB shouldn't perform like this. Any thought about this? I can share part of my dataset code

class LMDBWrapper(object):
    def __init__(self, lmdb_path):
        self.env = lmdb.open(lmdb_path, max_readers=1, 
                             subdir=os.path.isdir(lmdb_path),
                             readonly=True, lock=False,
                             readahead=False, meminit=False)
        with self.env.begin(write=False) as txn:
            self.length = pa.deserialize(txn.get(b'__len__'))
            self.keys = pa.deserialize(txn.get(b'__keys__'))

    def get_image(self, image_key):
        env = self.env
        with env.begin(write=False) as txn:
            byteflow = txn.get(u'{}'.format(image_key).encode('ascii'))
        imgbuf = pa.deserialize(byteflow)
        buf = six.BytesIO()
        buf.write(imgbuf)
        buf.seek(0)
        image = Image.open(buf).convert('RGB')

        return np.asarray(image)


class LMDBDataset(Dataset):
    def __init__(self, lmdb_path):
        self.lmdb = None
        self.lmdb_path = lmdb_path

    def init_lmdb(self):
        self.lmdb = LMDBWrapper(self.lmdb_path)

    def __getitem__(self, idx):
        if self.lmdb is None:
            self.init_lmdb()

class CocoInstanceLMDBDataset(LMDBDataset):
    def __init__(self, lmdb_path):
        super().__init__(lmdb_path=lmdb_path)

    def __getitem__(self, idx):
        super().__getitem__(idx)
        ann = self.filtered_anns[idx]
        data = dict()
        # transforms
        return data

Mar 17 '20 10:03 gathierry

Same Problem @Lyken17

Mar 31 '20 10:03 xieydd

@xieydd @gathierry can u share the version of your torch and py-lmdb?

Mar 31 '20 19:03 Lyken17

In my case, torch==1.4.0+cu92 and lmdb==0.98

Apr 01 '20 04:04 gathierry

I have similar problem. I have used 'ImageFolderLMDB' function in folder2lmdb.py and during iteration of dataloader, ram usage continuously increased. Problem may be caused by "txn.get(self.keys[index])", but i don't know how to fix it.

May 13 '20 09:05 guhyunkim

I did a simple test using imagenet dataset, however, I failed to observe any memory leakage

from folder2lmdb import ImageFolderLMDB

dst = ImageFolderLMDB(
    "/ImageNet/train.lmdb",
    transform=transforms.Compose([
        transforms.CenterCrop(224),
        transforms.ToTensor()
    ]))
train_loader = torch.utils.data.DataLoader(
        dst, batch_size=64, num_workers=40, pin_memory=True)

for i, _ in enumerate(train_loader):
    if i % 10 == 0:
        print("[%d/%d]" % (i, len(train_loader)))

The memory usage showed in htop does not increase over time.

Though I notice there are some issues mentioning this https://github.com/pytorch/vision/issues/619, could you provide more detailed settings (e.g., a sample snippet that leads to memory leak)?

May 15 '20 18:05 Lyken17

Maybe you need to remove the param max_readers=1?

May 17 '20 22:05 Lyken17

I tried without max_readers=1 but it doesn't change anything. Do you think it's because started the program with mp.spawn so that it's run in a multiprocess context?

May 19 '20 03:05 gathierry