[BUG]: Dreambooth training example fails with PyTorch nightly build
🐛 Describe the bug
But works fine on the current stable build (1.13)
AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?
❯ ./colossalai.sh
2023-01-11 20:30:41.120169: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[01/11/23 20:30:43] INFO colossalai - colossalai - INFO:
/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[01/11/23 20:30:45] INFO colossalai - colossalai - INFO:
/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024,
python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the
default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data
parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
INFO colossalai - colossalai - INFO:
/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
_colossalai.py:427 main
INFO colossalai - colossalai - INFO: Loading tokenizer from pretrained model
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
INFO colossalai - colossalai - INFO:
/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
_colossalai.py:439 main
INFO colossalai - colossalai - INFO: Loading text_encoder from
/home/alpha/Storage/AIModels/diffusers/pmerge
INFO colossalai - colossalai - INFO:
/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
_colossalai.py:447 main
INFO colossalai - colossalai - INFO: Loading AutoencoderKL from
/home/alpha/Storage/AIModels/diffusers/pmerge
[01/11/23 20:30:46] INFO colossalai - colossalai - INFO:
/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
_colossalai.py:454 main
INFO colossalai - colossalai - INFO: Loading UNet2DConditionModel from
/home/alpha/Storage/AIModels/diffusers/pmerge
The config attributes {'class_embed_type': None, 'mid_block_type': 'UNetMidBlock2DCrossAttn', 'resnet_time_scale_shift': 'default'} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
[01/11/23 20:30:46] INFO colossalai - ProcessGroup - INFO:
/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
colossalai/tensor/process_group.py:24 get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 677, in <module>
main(args)
File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 456, in main
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path,
File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 519, in from_pretrained
model = cls.from_config(config, **unused_kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 210, in from_config
model = cls(**init_dict)
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
f(module, *args, **kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 567, in inner_init
init(self, *args, **init_kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 122, in __init__
self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, padding=(1, 1))
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
f(module, *args, **kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 450, in __init__
super(Conv2d, self).__init__(
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 55, in wrapper
self._post_init_method(module, *args, **kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/colo_init_context.py", line 130, in _post_init_method
setattr(submodule, param_name, colo_param)
File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1609, in __setattr__
self.register_parameter(name, value)
File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 559, in register_parameter
elif param.grad_fn:
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_parameter.py", line 91, in __torch_function__
return super().__torch_function__(func, types, args, kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_tensor.py", line 182, in __torch_function__
with torch._C.DisableTorchFunction():
AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11711) of binary: /usr/bin/python
Traceback (most recent call last):
File "/home/alpha/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 779, in main
run(args)
File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 770, in run
elastic_launch(
File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_colossalai.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-01-11_20:30:49
host : Asus-GA401IV
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11711)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment
torch==2.0.0.dev20230111+cu118 CUDA 11.8 Python 3.10.9 xformers git
Laptop I am testing this on:
.-------------------------: alpha@Asus-GA401IV
.+=========================. ------------------
:++===++==================- :++- OS: CachyOS Linux x86_64
:*++====+++++=============- .==: Host: ROG Zephyrus G14 GA401IV_GA401IV (1.0)
-*+++=====+***++==========: Kernel: 6.1.4-1-cachyos-lto
=*++++========------------: Uptime: 5 mins
=*+++++=====- ... Packages: 1195 (pacman)
.+*+++++=-===: .=+++=: Shell: fish 3.5.1
:++++=====-==: -*****+ Resolution: 3840x2160 @ 60Hz
:++========-=. .=+**+. DE: KDE Plasma 5.26.5
.+==========-. . WM: KWin (Wayland)
:+++++++====- .--==-. WM Theme: Breeze
:++==========. :+++++++: Theme: Lightly (CachyOSNord) [QT], cachyos-nor]
.-===========. =*****+*+ Icons: breeze-dark [QT], breeze-dark [GTK2/3/4]
.-===========: .+*****+: Font: Noto Sans (10pt) [QT], Noto Sans (10pt) ]
-=======++++:::::::::::::::::::::::::-: .---: Cursor: capitaine (24px)
:======++++====+++******************=. Terminal: alacritty
:=====+++==========++++++++++++++*- Terminal Font: monospace (12pt)
.====++==============++++++++++*- CPU: AMD Ryzen 9 4900HS (16) @ 3 GHz
.===+==================+++++++: GPU: AMD Renoir
.-=======================+++: GPU: NVIDIA GeForce RTX 2060 Max-Q
.......................... Memory: 3.04 GiB / 15.05 GiB (20%)
Disk (/): 114 GiB / 139 GiB (82%)
Disk (/home/alpha/Storage): 279 GiB / 344 GiB )
Disk (/run/media/alpha/External): 140 GiB / 93]
Disk (/windows): 296 GiB / 434 GiB (68%) [Remo]
Battery: 100% [Not charging]
Locale: en_US.UTF-8
Hi, Colossal-AI is not compatible with torch 2.0 for now.
While on this topic, do y'all think the new torch.compile() feature could reduce VRAM usage?
Would it even be feasible to use in ColossalAI?
Possible, in fact we plan to use it to parallelize training as well. We would integrate it with Colossal-AI upon the official release of torch 2.0.
We have updated a lot. This issue was closed due to inactivity. Thanks.