gflownet Shared pinned buffers

This PR implements a better way of sharing torch tensors between process by creating (large enough) shared tensors that are created once are used as a transfer mechanism. Doing this on the fragment environment seh_frag.py I'm getting a 30% wall time improvement for simple settings, with batch size 64 (I'm sure we could have fun maxing that out and see how far we can take GPU utilization).

Some notes:

The effect is mostly felt when sampling (which is where most time is spent in the first place), and sending Batch and GraphActionCategoricals through shared buffers improves time
Passing batches to the training loop (which are much bigger and "rarer") doesn't seem to have a significant speedup, but I've implemented it nonetheless for future proofing

Other changes:

Removed local grad clipping which is not quite correct; the difference is minimal but relevant, there's also a nice speedup
Made all algorithms inherit from GFNAlgorithm
global_cfg is set for all algorithms
cond_info is now folded into the batch object rather than being passed as an argument everywhere
fixed GraphActionCategorical.entropy when masks are used, gradients wrt logits would be NaN.

Note, EnvelopeQL is still in a broken state, will fix in #127

Feb 23 '24 22:02 bengioe

I'm of a mind to merge this actually. It's not the cleanest implementation possible but there are significant gains here (as mentioned, a 30% speedup with the default settings on seh_frag.py). Will test across tasks and report back.

Mar 01 '24 01:03 bengioe

Made significant simplifications to the method by subclassing Pickler/Unpickler, found some very tricky bugs (I was making a bad usage of pinned CUDA buffers and ended up with rare race conditions). Speedups remain (might even be a bit faster).

Mar 11 '24 18:03 bengioe

Merged with trunk + made a few fixes. Pretty happy with this now!

May 09 '24 15:05 bengioe