Shared pinned buffers
This PR implements a better way of sharing torch tensors between process by creating (large enough) shared tensors that are created once are used as a transfer mechanism. Doing this on the fragment environment seh_frag.py I'm getting a 30% wall time improvement for simple settings, with batch size 64 (I'm sure we could have fun maxing that out and see how far we can take GPU utilization).
Some notes:
- The effect is mostly felt when sampling (which is where most time is spent in the first place), and sending
BatchandGraphActionCategoricalsthrough shared buffers improves time - Passing batches to the training loop (which are much bigger and "rarer") doesn't seem to have a significant speedup, but I've implemented it nonetheless for future proofing
Other changes:
- Removed local grad clipping which is not quite correct; the difference is minimal but relevant, there's also a nice speedup
- Made all algorithms inherit from
GFNAlgorithm -
global_cfgis set for all algorithms -
cond_infois now folded into the batch object rather than being passed as an argument everywhere - fixed
GraphActionCategorical.entropywhen masks are used, gradients wrt logits would be NaN.
Note, EnvelopeQL is still in a broken state, will fix in #127
I'm of a mind to merge this actually. It's not the cleanest implementation possible but there are significant gains here (as mentioned, a 30% speedup with the default settings on seh_frag.py). Will test across tasks and report back.
Made significant simplifications to the method by subclassing Pickler/Unpickler, found some very tricky bugs (I was making a bad usage of pinned CUDA buffers and ended up with rare race conditions). Speedups remain (might even be a bit faster).
Merged with trunk + made a few fixes. Pretty happy with this now!