DeepRobust icon indicating copy to clipboard operation
DeepRobust copied to clipboard

cuda memory cannot empty

Open Kidleyh opened this issue 4 years ago • 10 comments

Hello, I used deeprobust 0.2.2 on windows11, everything was fine But yesterday I used deeprobust 0.2.4 on ubuntu 16.04 LTS,the func self.inner_train(self, features, adj_norm, idx_train, idx_unlabeled, labels) cannot empty the cuda memmory except epoch 0,only 200 perturbations the GPU is out of memory . for example , on windows epoch 1 ,the memmory allocated 561737216 before inner_train,and allocated 511405056 after that. but on ubuntu epoch 1 ,the memmory allocated 561737216 before inner_train,and allocated 586903040 after that. I debug one hour but cannot find the error,so i have to issue this Looking forward to your reply, thank you!

Kidleyh avatar Dec 02 '21 07:12 Kidleyh

and i tried deeprobust==0.2.2 on ubuntu 16.04 LTS ,it has same problem as 0.2.4

Kidleyh avatar Dec 02 '21 07:12 Kidleyh

Can you provide more details on this bug? Did you tried examples/test_mettatck.py?

ChandlerBang avatar Dec 02 '21 19:12 ChandlerBang

yes,i also tried examples/test_mettack.py , it is also the problem of self.inner_train , and i find this bug again on windows11 when i use torch==1.10.0,so i think maybe it is the problem of torch==1.10.0,but i haven't try torch<1.10.0 on ubuntu 16.04LTS, because when i run pip install deeprobust , your code requires me to install torch==1.10.0

Kidleyh avatar Dec 03 '21 01:12 Kidleyh

Hi, I just tried examples/test_mettack.py with torch==1.10.0 and it works fine for me. Can you provide more details on the error information? (by copying the whole error message)

ChandlerBang avatar Dec 03 '21 02:12 ChandlerBang

error message may not tell us any useful information, because the error message is about the cuda out of memory due to the self.inner_train could not empty its memmory normally.And I successfully use torch==1.8.0 on ubuntu 16.04 LTS to run examples/test_mettack.py without any question. And when i use torch==1.10.0 on windows11,the error message: RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 6.00 GiB total capacity; 3.42 GiB already allocated; 0 bytes free; 4.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF. the torch==1.10.0 on ubuntu 16.04 LTS has been uninstalled

Kidleyh avatar Dec 03 '21 03:12 Kidleyh

Now i can run the code normally use torch==1.8.0, but I'm afraid the code has some questions with torch=1.10.0. Anyway, i think you can check the self.inner_train in Metaack , and I am very grateful for your work and your help! Thank you !

Kidleyh avatar Dec 03 '21 03:12 Kidleyh

Ok, let me see if I can figure it out.

ChandlerBang avatar Dec 03 '21 03:12 ChandlerBang

Hi, sorry to bother you again, I also want to know how much cuda memory capacity is enough for training metaApprox to attack pubmed, i have attacked cora,citeseer and polblogs successfully. But when i was attacking pubmed , the error message showed that cuda out of memory, by the way my cuda memmory is 12 GB.

Kidleyh avatar Dec 04 '21 01:12 Kidleyh

I am not sure about the detailed gpu memory usage for attacking pubmed but I made it on a GPU with 32GB. The memory complexity of metattack is very high since the search space is quadratic to the number of nodes.

You may turn to some scalable attack instead, e.g., https://github.com/DSE-MSU/DeepRobust/blob/master/deeprobust/graph/targeted_attack/sga.py.

ChandlerBang avatar Dec 04 '21 23:12 ChandlerBang

Hi! I've encountered the same problem. Metattack works fine when using the following environment on ubuntu 16.04.12

numpy==1.18.1
scipy==1.6.2
torch==1.8.1
torch_geometric==1.6.3
torch_scatter==2.0.9
torch_sparse==0.6.12

But it shows that CUDA out of memory with the latest version of torch on ubuntu 20.04.1:

deeprobust==0.2.6
numpy==1.23.3
scikit_learn==1.1.3
scipy==1.8.1
torch==1.12.1
torch_geometric==2.1.0
torch_sparse==0.6.15

I also find that the self.inner_train does not empty the gradients. I hope my case will be helpful in solving the problem.

Leirunlin avatar Nov 27 '22 11:11 Leirunlin