ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: RuntimeError: setStorage: sizes [640, 640], strides [1, 640], storage offset 54179840, and itemsize 2 requiring a storage size of 109178880 are out of bounds for storage of size 0 /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint.

Open Alfred-Duncan opened this issue 3 years ago • 0 comments

🐛 Describe the bug

Epoch 0: 0%| | 0/8 [00:00<?, ?it/s]/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 16. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn( Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 598, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 640, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run results = self._run_stage() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run_stage self._run_train() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train self.fit_loop.run() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 250, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 371, in _optimizer_step self.trainer._call_lightning_module_hook( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1337, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1673, in optimizer_step optimizer.step(closure=optimizer_closure) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 411, in optimizer_step return self.precision_plugin.optimizer_step( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step closure_result = closure() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 150, in call self._result = self.closure(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 145, in closure self._backward_fn(step_output.closure_loss) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 306, in backward_fn self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1475, in _call_strategy_hook output = fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 207, in backward self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, optimizer_idx, *args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 58, in backward optimizer.backward(tensor) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 184, in backward self.module.backward(loss) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward loss.backward() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward return handle_torch_function( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function ret = func(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/util.py", line 141, in backward output_tensors = ctx.run_function(*shallow_copies) File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/attention.py", line 262, in _forward x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/attention.py", line 163, in forward q = self.to_q(x) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: setStorage: sizes [640, 640], strides [1, 640], storage offset 54179840, and itemsize 2 requiring a storage size of 109178880 are out of bounds for storage of size 0 /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint.

Although such an exception is thrown, my first card is still performing tasks at full power. It seems that it is not interrupted by this exception? And when I press Ctrl+C, it doesn't stop. I want to know why? Is this exception not affecting the training, or is it just pretending to perform. I have four 3090 graphics cards.

Environment

i use the way bellow to train, https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion pytorch-lightning is 1.9.0.dev0 colossalai is 0.1.10+torch1.12cu11.3 stable-diffusion-v1-4

and using: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml to train

Alfred-Duncan avatar Dec 22 '22 03:12 Alfred-Duncan