Add empty_cache for releasing GPU memory
When I run the code sample below, my GPU monitoring seems to show that the memory that was allocated during execution (in a jupyter notebook) is still allocated despite the fact that I've done all I know to do to have intermediate tensors disposed
#r "nuget:TorchSharp-cuda-linux"
#r "nuget:TorchSharp"
open TorchSharp
let test () =
use d = torch.NewDisposeScope()
use tt = torch.randn(50000,50000,device=torch.device("cuda:7"))
tt.MoveToOuterDisposeScope()
let test2() =
use d2 = torch.NewDisposeScope()
use ttt = test()
()
let empty_result = test2()
I get a similar result when I do the same experiment in python with pytorch
import torch
def test():
tt = torch.randn(50000,50000,device=torch.device('cuda:7'))
return ttt
def test2():
ttt = test()
return 0
empty_result = test2()
but I can free the memory by calling torch.cuda.empty_cache(). @NiklasGustafsson says
The underlying library does keep a high-watermark of allocated GPU memory, so even when you dispose of tensors, the overall allocation won't necessary go down. I'll see how I can get empty_cache() implemented
After some digging, I have found the function that would be needed to implement 'empty_cache()' it's exported from torch_cuda_cpp.dll/so. We don't statically link against this library when the native component of TorchSharp is built, so we would have to find it at runtime by going looking for the DLL and using the mangled name to import the function. Perfectly doable, but very ugly code.
I'll keep it on the backlog, but it's probably not going to be the highest priority.
Thanks, Niklas, for running that down. Does sound like the solution would be pretty gnarly. We're able to do what we need to do as things are now so we'll be glad to have that feature if it ever comes out but we're not in a hurry for it.
BTW - I'm pretty new to TorchSharp but I'm really enjoying working in it. Thanks so much for all you do to make it happen!
@NiklasGustafsson, Looks like we need to port this code to LibTorchSharp first
@NiklasGustafsson, Looks like we need to port this code to LibTorchSharp first
That's the CUDA backend (or parts of it), porting it would mean duplication.
It would be better to hack it by dynamically loading and finding the entry in the backend when we know we're loading the CUDA backend (in Torch.cs). It will have a mangled name, which complicates things, since the schemes will differ between compiler.
As stated above:
After some digging, I have found the function that would be needed to implement 'empty_cache()' it's exported from torch_cuda_cpp.dll/so. We don't statically link against this library when the native component of TorchSharp is built, so we would have to find it at runtime by going looking for the DLL and using the mangled name to import the function. Perfectly doable, but very ugly code.
@NiklasGustafsson, Looks like we need to port this code to LibTorchSharp first
That's the CUDA backend (or parts of it), porting it would mean duplication.
It would be better to hack it by dynamically loading and finding the entry in the backend when we know we're loading the CUDA backend (in Torch.cs). It will have a mangled name, which complicates things, since the schemes will differ between compiler.
I think this is like a bridge between the Cuda backend and python. Because we just open a file, we can see the PyObject. https://pytorch.org/docs/stable/_modules/torch/cuda/memory.html#empty_cache https://github.com/pytorch/pytorch/blob/main/torch/_C/init.pyi.in#L1545 https://github.com/pytorch/pytorch/blob/main/torch/csrc/cuda/Module.cpp#L1422
Cuda backend is here.
https://github.com/pytorch/pytorch/blob/main/torch/csrc/api/include/torch/cuda.h
https://github.com/pytorch/pytorch/blob/main/torch/csrc/api/src/cuda.cpp
Is there a workaround for this?
Nope. Fixing it will require us changing how we build TorchSharp, since the native entry point is not in the backend-independent libtorch C API. That will require more engineering resources than we have assigned.
Any fix or workaround for the torch.cuda.empty_cache() so far?
Workaround is to load native binaries yourself and run the empty_cache method.
Like it has been done here for example: https://github.com/K024/llm-sharp
Thank you for your response @K1T00. I will check it out.