[BUG] (feature/fvdb) multi-card training error and memory leak

Open xiaoc57 opened this issue 10 months ago • 0 comments

Issues encountered during experiments with fvdb

I encountered two issues during my experiments using the latest fvdb environment:

Incorrect Function Call Causing Multi-GPU Training Failure

The current implementation in TorchDeviceBuffer::create is incompatible with recent changes to NanoVDB's buffer creation. Specifically, NanoVDB buffer creation has been updated from:

auto buffer = BufferT::create(mData.size, &pool, false); // only allocate buffer on the device

to:

auto buffer = BufferT::create(mData.size, &pool, device, mStream); // only allocate buffer on the device

However, the corresponding TorchDeviceBuffer::create method signature remains:

TorchDeviceBuffer::create(uint64_t size, const TorchDeviceBuffer *proto, bool host, void *stream)

Due to this mismatch, the function call fails, preventing successful multi-GPU training.

Potential GPU Memory Leak During Training I observed a gradual increase in GPU memory usage during training, eventually leading to out-of-memory errors. However, it is challenging to confirm definitively whether this memory leak is due to the fvdb framework, as the incremental memory increase per iteration is very small.

These issues were detected under the latest fvdb version and its associated environment. Based on initial observations, it seems that the XCube implementation utilizing fvdb may not exhibit these problems.

Apr 16 '25 10:04 xiaoc57