Large memory occupation
Hi, I'm training faster-rcnn on 4 gpus with coco dataset converted to LMDB. I used num_worker=4 for the dataloader and I found that the memory occupation is almost 60Gb. I suspect that the whole dataset is read into memory. But per your description in readme,
Here I choose lmdb because 2. hdf5 pth n5, though with a straightforward json-like API, require to put the whole file into memory. This is not practicle when you play with large dataset like imagenet.
LMDB shouldn't perform like this. Any thought about this? I can share part of my dataset code
class LMDBWrapper(object):
def __init__(self, lmdb_path):
self.env = lmdb.open(lmdb_path, max_readers=1,
subdir=os.path.isdir(lmdb_path),
readonly=True, lock=False,
readahead=False, meminit=False)
with self.env.begin(write=False) as txn:
self.length = pa.deserialize(txn.get(b'__len__'))
self.keys = pa.deserialize(txn.get(b'__keys__'))
def get_image(self, image_key):
env = self.env
with env.begin(write=False) as txn:
byteflow = txn.get(u'{}'.format(image_key).encode('ascii'))
imgbuf = pa.deserialize(byteflow)
buf = six.BytesIO()
buf.write(imgbuf)
buf.seek(0)
image = Image.open(buf).convert('RGB')
return np.asarray(image)
class LMDBDataset(Dataset):
def __init__(self, lmdb_path):
self.lmdb = None
self.lmdb_path = lmdb_path
def init_lmdb(self):
self.lmdb = LMDBWrapper(self.lmdb_path)
def __getitem__(self, idx):
if self.lmdb is None:
self.init_lmdb()
class CocoInstanceLMDBDataset(LMDBDataset):
def __init__(self, lmdb_path):
super().__init__(lmdb_path=lmdb_path)
def __getitem__(self, idx):
super().__getitem__(idx)
ann = self.filtered_anns[idx]
data = dict()
# transforms
return data
Same Problem @Lyken17
@xieydd @gathierry can u share the version of your torch and py-lmdb?
In my case, torch==1.4.0+cu92 and lmdb==0.98
I have similar problem. I have used 'ImageFolderLMDB' function in folder2lmdb.py and during iteration of dataloader, ram usage continuously increased. Problem may be caused by "txn.get(self.keys[index])", but i don't know how to fix it.
I did a simple test using imagenet dataset, however, I failed to observe any memory leakage
from folder2lmdb import ImageFolderLMDB
dst = ImageFolderLMDB(
"/ImageNet/train.lmdb",
transform=transforms.Compose([
transforms.CenterCrop(224),
transforms.ToTensor()
]))
train_loader = torch.utils.data.DataLoader(
dst, batch_size=64, num_workers=40, pin_memory=True)
for i, _ in enumerate(train_loader):
if i % 10 == 0:
print("[%d/%d]" % (i, len(train_loader)))
The memory usage showed in htop does not increase over time.

Though I notice there are some issues mentioning this https://github.com/pytorch/vision/issues/619, could you provide more detailed settings (e.g., a sample snippet that leads to memory leak)?
Maybe you need to remove the param max_readers=1?
I tried without max_readers=1 but it doesn't change anything. Do you think it's because started the program with mp.spawn so that it's run in a multiprocess context?