🐛 Describe the bug
Epoch 0: 0%| | 0/8 [00:00<?, ?it/s]/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 16. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Traceback (most recent call last):
File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in
trainer.fit(model, data)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 598, in fit
call._call_and_handle_interrupt(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 640, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
results = self._run_stage()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run_stage
self._run_train()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
self.fit_loop.run()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 250, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 371, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1337, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1673, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 411, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step
closure_result = closure()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 150, in call
self._result = self.closure(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 145, in closure
self._backward_fn(step_output.closure_loss)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 306, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1475, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 207, in backward
self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, optimizer_idx, *args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 58, in backward
optimizer.backward(tensor)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 184, in backward
self.module.backward(loss)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward
loss.backward()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward
return handle_torch_function(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function
ret = func(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/util.py", line 141, in backward
output_tensors = ctx.run_function(*shallow_copies)
File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/attention.py", line 262, in _forward
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tongange/ColossalAI/examples/images/diffusion/ldm/modules/attention.py", line 163, in forward
q = self.to_q(x)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: setStorage: sizes [640, 640], strides [1, 640], storage offset 54179840, and itemsize 2 requiring a storage size of 109178880 are out of bounds for storage of size 0
/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Summoning checkpoint.
Although such an exception is thrown, my first card is still performing tasks at full power. It seems that it is not interrupted by this exception? And when I press Ctrl+C, it doesn't stop. I want to know why? Is this exception not affecting the training, or is it just pretending to perform.
I have four 3090 graphics cards.
Environment
i use the way bellow to train,
https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion
pytorch-lightning is 1.9.0.dev0
colossalai is 0.1.10+torch1.12cu11.3
stable-diffusion-v1-4
and using:
python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml
to train