Automatic limiting of local batchsize bounds after OOM
Here we update the higher limit on local batchsize when we are hit with an OOM. The upper limit is constrained by LOCAL_BSZ_CUTOFF_PCT of current local batchsize. We have to take a quick checkpoint and restart after setting the limit because a simple retry doesn't work. PyTorch GPU memory allocator does caching and simply reducing current batchsize doesn't have much of an impact on the total allocated memory (+caches). It results in subsequent OOMs.
A new decorator retry is introduced to catch the OOM exception as it is not visible from inside the dataloader. The train function should be decorated with retry which retries (from the position before restart) the training loop after limiting the batchsize of the current dataloader.
Fixes #40