AnalogConv2d fails when using TT-v2
Description
When I tried to use the TT-v2 algorithm to train the convolutional network, I got a Cuda error.
How to reproduce
After running the following main.py file, I got an error RuntimeError: CUDA_CALL Error 'an illegal memory access was encountered' at cuda_util.cu:653
# main.py
import torch
from aihwkit.nn import AnalogConv2d
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.configs import build_config
from aihwkit.simulator.configs.devices import SoftBoundsReferenceDevice
DEVICE = 'cuda:0'
rpu_config = build_config('ttv2', device=SoftBoundsReferenceDevice())
model = AnalogConv2d(
in_channels=1, out_channels=3, kernel_size=5, rpu_config=rpu_config
).to(DEVICE)
optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)
# if I use images = torch.empty((128, 1, 32, 32)), even the forward fails
images = torch.ones((128, 1, 32, 32))
images = images.to(DEVICE)
output = model(images)
loss = output.norm()**2
loss.backward()
optimizer.step()
Besides, if I create torch.empty() instead of torch.one(), the forward clause model(images) never stops, I guess there could be some endless loop happening.
Other information
- Pytorch version: 2.1.2
- Package version: aihwkit-gpu 0.9.0
- OS: Linux
- Python version: 3.10.13
- Conda version (or N/A) :23.11.0
Hi @Zhaoxian-Wu, Thanks for reporting this issue. what GPU were you using when you encountered this issue?
Hi @Zhaoxian-Wu for the CUDA memory problem, it looked like the problem had to do with how to set the DEVICE. If I set it by DEVICE = torch.cuda.set_device(0) instead of DEVICE = torch.device('cuda:0') then I did not see the problem. I found the solution in this issue
It tried the same technique with torch.empty and I did not see the hanging(looping) issue either. So this is torch problem. Let us know if you have any questions.
@Zhaoxian-Wu do you still have this issue. if not, we can close this. Please let us know
Hi @Zhaoxian-Wu for the CUDA memory problem, it looked like the problem had to do with how to set the DEVICE. If I set it by
DEVICE = torch.cuda.set_device(0)instead ofDEVICE = torch.device('cuda:0')then I did not see the problem. I found the solution in this issue
I tried this solution and it worked! It seems to be the issue from Pytorch. Thank you for your help @kkvtran! It is weird for the Pytorch community to leave this issue for such a long time.