ColossalAI [BUG]: Dreambooth training example fails with PyTorch nightly build

🐛 Describe the bug

But works fine on the current stable build (1.13)

AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?


❯ ./colossalai.sh
2023-01-11 20:30:41.120169: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[01/11/23 20:30:43] INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/context/parallel_context.py:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
[01/11/23 20:30:45] INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/context/parallel_context.py:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024,
                             python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the
                             default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/initialize.py:117 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data
                             parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:427 main
                    INFO     colossalai - colossalai - INFO: Loading tokenizer from pretrained model
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:439 main
                    INFO     colossalai - colossalai - INFO: Loading text_encoder from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:447 main
                    INFO     colossalai - colossalai - INFO: Loading AutoencoderKL from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
[01/11/23 20:30:46] INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:454 main
                    INFO     colossalai - colossalai - INFO: Loading UNet2DConditionModel from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
The config attributes {'class_embed_type': None, 'mid_block_type': 'UNetMidBlock2DCrossAttn', 'resnet_time_scale_shift': 'default'} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
[01/11/23 20:30:46] INFO     colossalai - ProcessGroup - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/tensor/process_group.py:24 get
                    INFO     colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
  File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 677, in <module>
    main(args)
  File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 456, in main
    unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path,
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 519, in from_pretrained
    model = cls.from_config(config, **unused_kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 210, in from_config
    model = cls(**init_dict)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
    f(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 567, in inner_init
    init(self, *args, **init_kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 122, in __init__
    self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, padding=(1, 1))
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
    f(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 450, in __init__
    super(Conv2d, self).__init__(
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 55, in wrapper
    self._post_init_method(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/colo_init_context.py", line 130, in _post_init_method
    setattr(submodule, param_name, colo_param)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1609, in __setattr__
    self.register_parameter(name, value)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 559, in register_parameter
    elif param.grad_fn:
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_parameter.py", line 91, in __torch_function__
    return super().__torch_function__(func, types, args, kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_tensor.py", line 182, in __torch_function__
    with torch._C.DisableTorchFunction():
AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11711) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/alpha/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 779, in main
    run(args)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 770, in run
    elastic_launch(
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_colossalai.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-01-11_20:30:49
  host      : Asus-GA401IV
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11711)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment

torch==2.0.0.dev20230111+cu118 CUDA 11.8 Python 3.10.9 xformers git

Laptop I am testing this on:

           .-------------------------:                    alpha@Asus-GA401IV
          .+=========================.                    ------------------
         :++===++==================-       :++-           OS: CachyOS Linux x86_64
        :*++====+++++=============-        .==:           Host: ROG Zephyrus G14 GA401IV_GA401IV (1.0)
       -*+++=====+***++==========:                        Kernel: 6.1.4-1-cachyos-lto
      =*++++========------------:                         Uptime: 5 mins
     =*+++++=====-                     ...                Packages: 1195 (pacman)
   .+*+++++=-===:                    .=+++=:              Shell: fish 3.5.1
  :++++=====-==:                     -*****+              Resolution: 3840x2160 @ 60Hz
 :++========-=.                      .=+**+.              DE: KDE Plasma 5.26.5
.+==========-.                          .                 WM: KWin (Wayland)
 :+++++++====-                                .--==-.     WM Theme: Breeze
  :++==========.                             :+++++++:    Theme: Lightly (CachyOSNord) [QT], cachyos-nor]
   .-===========.                            =*****+*+    Icons: breeze-dark [QT], breeze-dark [GTK2/3/4]
    .-===========:                           .+*****+:    Font: Noto Sans (10pt) [QT], Noto Sans (10pt) ]
      -=======++++:::::::::::::::::::::::::-:  .---:      Cursor: capitaine (24px)
       :======++++====+++******************=.             Terminal: alacritty
        :=====+++==========++++++++++++++*-               Terminal Font: monospace (12pt)
         .====++==============++++++++++*-                CPU: AMD Ryzen 9 4900HS (16) @ 3 GHz
          .===+==================+++++++:                 GPU: AMD Renoir
           .-=======================+++:                  GPU: NVIDIA GeForce RTX 2060 Max-Q
             ..........................                   Memory: 3.04 GiB / 15.05 GiB (20%)
                                                          Disk (/): 114 GiB / 139 GiB (82%)
                                                          Disk (/home/alpha/Storage): 279 GiB / 344 GiB )
                                                          Disk (/run/media/alpha/External): 140 GiB / 93]
                                                          Disk (/windows): 296 GiB / 434 GiB (68%) [Remo]
                                                          Battery: 100% [Not charging]
                                                          Locale: en_US.UTF-8

Jan 12 '23 01:01 brucethemoose

Hi, Colossal-AI is not compatible with torch 2.0 for now.

Jan 12 '23 01:01 FrankLeeeee

While on this topic, do y'all think the new torch.compile() feature could reduce VRAM usage?

Would it even be feasible to use in ColossalAI?

Jan 12 '23 02:01 brucethemoose

Possible, in fact we plan to use it to parallelize training as well. We would integrate it with Colossal-AI upon the official release of torch 2.0.

Jan 12 '23 02:01 FrankLeeeee

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 18 '23 07:04 binmakeswell