smart-tree icon indicating copy to clipboard operation
smart-tree copied to clipboard

RuntimeError: CUDA error: out of memory

Open away-back opened this issue 1 year ago • 0 comments

May I ask which kind of GPU for project training? I use the NVIDIA RTX A6000 and attache 47.5 memory,but show out of memory.

The follow is my running error detail: [2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Train Dataset Size: 480 [2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Validation Dataset Size: 60 [2025-02-18 16:41:01,068][smart_tree.model.train][INFO] - Test Dataset Size: 60 /home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:214: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. amp_ctx = torch.cuda.amp.autocast() if cfg.fp16 else contextlib.nullcontext() /home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py:215: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead. scaler = torch.cuda.amp.grad_scaler.GradScaler() Epoch: 0%| | 0/1 [00:27<?, ?it/s] Error executing job with overrides: [] Traceback (most recent call last): File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 232, in main val_tracker = eval_epoch( File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/train.py", line 74, in eval_epoch for sp_input, targets, mask, fn in tqdm( File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/model/helper.py", line 14, in get_batch for (feats, target_feats), coords, mask, filenames in dataloader: File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in next data = self._next_data() File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/anaconda3/envs/smart-tree/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 74, in getitem cld = self.load(filename) File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/dataset/dataset.py", line 68, in load self.cache[filename] = self.load_cloud(filename).pin_memory() File "/home/huangyongchang/OpenProject/smart-tree/smart_tree/data_types/cloud.py", line 112, in pin_memory rgb = self.rgb.pin_memory() if self.rgb is not None else None RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. `

away-back avatar Feb 18 '25 08:02 away-back