cuQuantum icon indicating copy to clipboard operation
cuQuantum copied to clipboard

[Feature] Memory resource predictor for primitives

Open PabloAndresCQ opened this issue 1 year ago • 6 comments

Hi, I'd like to request a feature.

Context: In the project I develop (pytket-cutensornet), we make extensive use of cuTensorNet's primitive operations on tensors: tensor.decompose (both QRMethod and SVDMethod) and contract (applied often on only two/three tensors). We have encountered multiple cases where we reach OutOfMemory errors, and we would like to improve the user experience around these. To do so, we need to be able to detect if an OOM error would happen if we were to apply one of these primitives. With this, we sometimes may be able to prevent the OOM error, for instance, truncating tensors more aggresively before applying the primitive. Conceptually, this must be possible, since if I set CUTENSORNET_LOG_LEVEL=6, I can see how much workspace memory each primitive requests from the GPU, and I can keep track of how much memory I am using to store my tensor network on the GPU.

Feature request: A method for the user to obtain an upper bound of the GPU memory used by primitives contract, tensor.decompose (both QRMethod and SVDMethod) and experimental.contract_decompose on the inputs given by the user. Such method should not run the primitive itself, only inform of the memory resources it would require. Alternatively, I'd be happy with an optional memory_budget: int parameter passed to these primitives so that, if it requires more than the memory_budget, it does not apply the operation, and let's the user know it was skipped (but does not error out, or if it does, it throws an exception that can be handled at the Python level to recover from it).

If this sounds interesting, I'd be happy to provide more details of my use case and refine the feature request.

PabloAndresCQ avatar Aug 29 '24 15:08 PabloAndresCQ

Thanks for the clear description of the feature request. :) I will discuss it with the team.

daniellowell avatar Aug 29 '24 16:08 daniellowell

NetworkOptions.memory_limit is meant to act as the budge guide but it appears that there may be a bug that we didn't throw an MemoryError in decompose/contract_decompose when required memory exceeds the budget. Would it be sufficient if we throw this MemoryError with message on the actual required workspace size? Then you may be able to resolve it with try except handling?

yangcal avatar Aug 29 '24 23:08 yangcal

Ah, I had not seen NetworkOptions.memory_limit, thanks for pointing that out!

Would it be sufficient if we throw this MemoryError with message on the actual required workspace size?

As long as it is guaranteed that the tensors were not modified if MemoryError is thrown, then this would indeed work for me. Receiving the actual required workspace size in the error message would be very useful.

What is the current behaviour for decompose when a memory_limit is set? I'm wondering if there is a workaround that I could play with while I wait for the bugfix (and addition of required space in the message).

PabloAndresCQ avatar Aug 30 '24 09:08 PabloAndresCQ

The current decompose doesn't actually check options.memory_limit (this is a bug) on our side. Ideally we should check the workspace required like what we do in contract, see here.

For decompose, one just needs to insert the memory check here

For contract_decompose, it would be here

yangcal avatar Aug 30 '24 16:08 yangcal

Thanks! Is there an expected date for a release including the bugfix and adding the extra info in the MemoryError message?

PabloAndresCQ avatar Sep 05 '24 12:09 PabloAndresCQ

Our next release is planned out around end of Oct or early Nov. Please stay tuned!

yangcal avatar Sep 06 '24 18:09 yangcal

This has been fixed by the new MemoryLimitExceeded exception class in cuquantum-python 24.11: https://docs.nvidia.com/cuda/cuquantum/24.11.0/python/api/generated/cuquantum.MemoryLimitExceeded.html#cuquantum.MemoryLimitExceeded, closing this issue

yangcal avatar Jul 17 '25 22:07 yangcal