[Feature] Memory resource predictor for primitives
Hi, I'd like to request a feature.
Context: In the project I develop (pytket-cutensornet), we make extensive use of cuTensorNet's primitive operations on tensors: tensor.decompose (both QRMethod and SVDMethod) and contract (applied often on only two/three tensors). We have encountered multiple cases where we reach OutOfMemory errors, and we would like to improve the user experience around these. To do so, we need to be able to detect if an OOM error would happen if we were to apply one of these primitives. With this, we sometimes may be able to prevent the OOM error, for instance, truncating tensors more aggresively before applying the primitive. Conceptually, this must be possible, since if I set CUTENSORNET_LOG_LEVEL=6, I can see how much workspace memory each primitive requests from the GPU, and I can keep track of how much memory I am using to store my tensor network on the GPU.
Feature request: A method for the user to obtain an upper bound of the GPU memory used by primitives contract, tensor.decompose (both QRMethod and SVDMethod) and experimental.contract_decompose on the inputs given by the user. Such method should not run the primitive itself, only inform of the memory resources it would require. Alternatively, I'd be happy with an optional memory_budget: int parameter passed to these primitives so that, if it requires more than the memory_budget, it does not apply the operation, and let's the user know it was skipped (but does not error out, or if it does, it throws an exception that can be handled at the Python level to recover from it).
If this sounds interesting, I'd be happy to provide more details of my use case and refine the feature request.
Thanks for the clear description of the feature request. :) I will discuss it with the team.
NetworkOptions.memory_limit is meant to act as the budge guide but it appears that there may be a bug that we didn't throw an MemoryError in decompose/contract_decompose when required memory exceeds the budget. Would it be sufficient if we throw this MemoryError with message on the actual required workspace size? Then you may be able to resolve it with try except handling?
Ah, I had not seen NetworkOptions.memory_limit, thanks for pointing that out!
Would it be sufficient if we throw this MemoryError with message on the actual required workspace size?
As long as it is guaranteed that the tensors were not modified if MemoryError is thrown, then this would indeed work for me. Receiving the actual required workspace size in the error message would be very useful.
What is the current behaviour for decompose when a memory_limit is set? I'm wondering if there is a workaround that I could play with while I wait for the bugfix (and addition of required space in the message).
The current decompose doesn't actually check options.memory_limit (this is a bug) on our side. Ideally we should check the workspace required like what we do in contract, see here.
For decompose, one just needs to insert the memory check here
For contract_decompose, it would be here
Thanks! Is there an expected date for a release including the bugfix and adding the extra info in the MemoryError message?
Our next release is planned out around end of Oct or early Nov. Please stay tuned!
This has been fixed by the new MemoryLimitExceeded exception class in cuquantum-python 24.11: https://docs.nvidia.com/cuda/cuquantum/24.11.0/python/api/generated/cuquantum.MemoryLimitExceeded.html#cuquantum.MemoryLimitExceeded, closing this issue