ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Dreambooth training example fails with PyTorch nightly build

Open brucethemoose opened this issue 3 years ago • 3 comments

🐛 Describe the bug

But works fine on the current stable build (1.13)

AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?


❯ ./colossalai.sh
2023-01-11 20:30:41.120169: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[01/11/23 20:30:43] INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/context/parallel_context.py:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
[01/11/23 20:30:45] INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/context/parallel_context.py:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024,
                             python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the
                             default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/initialize.py:117 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data
                             parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:427 main
                    INFO     colossalai - colossalai - INFO: Loading tokenizer from pretrained model
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:439 main
                    INFO     colossalai - colossalai - INFO: Loading text_encoder from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
                    INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:447 main
                    INFO     colossalai - colossalai - INFO: Loading AutoencoderKL from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
[01/11/23 20:30:46] INFO     colossalai - colossalai - INFO:
                             /home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth
                             _colossalai.py:454 main
                    INFO     colossalai - colossalai - INFO: Loading UNet2DConditionModel from
                             /home/alpha/Storage/AIModels/diffusers/pmerge
The config attributes {'class_embed_type': None, 'mid_block_type': 'UNetMidBlock2DCrossAttn', 'resnet_time_scale_shift': 'default'} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
[01/11/23 20:30:46] INFO     colossalai - ProcessGroup - INFO:
                             /home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/
                             colossalai/tensor/process_group.py:24 get
                    INFO     colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
  File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 677, in <module>
    main(args)
  File "/home/alpha/tempclone/ColossalAI/examples/images/dreambooth/train_dreambooth_colossalai.py", line 456, in main
    unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path,
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/modeling_utils.py", line 519, in from_pretrained
    model = cls.from_config(config, **unused_kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 210, in from_config
    model = cls(**init_dict)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
    f(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 567, in inner_init
    init(self, *args, **init_kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 122, in __init__
    self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, padding=(1, 1))
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 54, in wrapper
    f(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 450, in __init__
    super(Conv2d, self).__init__(
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/utils.py", line 55, in wrapper
    self._post_init_method(module, *args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/utils/model/colo_init_context.py", line 130, in _post_init_method
    setattr(submodule, param_name, colo_param)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1609, in __setattr__
    self.register_parameter(name, value)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 559, in register_parameter
    elif param.grad_fn:
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_parameter.py", line 91, in __torch_function__
    return super().__torch_function__(func, types, args, kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/colossalai-0.2.0-py3.10.egg/colossalai/tensor/colo_tensor.py", line 182, in __torch_function__
    with torch._C.DisableTorchFunction():
AttributeError: module 'torch._C' has no attribute 'DisableTorchFunction'. Did you mean: '_EnableTorchFunction'?
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11711) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/alpha/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 779, in main
    run(args)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 770, in run
    elastic_launch(
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_colossalai.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-01-11_20:30:49
  host      : Asus-GA401IV
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11711)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment

torch==2.0.0.dev20230111+cu118 CUDA 11.8 Python 3.10.9 xformers git

Laptop I am testing this on:

           .-------------------------:                    alpha@Asus-GA401IV
          .+=========================.                    ------------------
         :++===++==================-       :++-           OS: CachyOS Linux x86_64
        :*++====+++++=============-        .==:           Host: ROG Zephyrus G14 GA401IV_GA401IV (1.0)
       -*+++=====+***++==========:                        Kernel: 6.1.4-1-cachyos-lto
      =*++++========------------:                         Uptime: 5 mins
     =*+++++=====-                     ...                Packages: 1195 (pacman)
   .+*+++++=-===:                    .=+++=:              Shell: fish 3.5.1
  :++++=====-==:                     -*****+              Resolution: 3840x2160 @ 60Hz
 :++========-=.                      .=+**+.              DE: KDE Plasma 5.26.5
.+==========-.                          .                 WM: KWin (Wayland)
 :+++++++====-                                .--==-.     WM Theme: Breeze
  :++==========.                             :+++++++:    Theme: Lightly (CachyOSNord) [QT], cachyos-nor]
   .-===========.                            =*****+*+    Icons: breeze-dark [QT], breeze-dark [GTK2/3/4]
    .-===========:                           .+*****+:    Font: Noto Sans (10pt) [QT], Noto Sans (10pt) ]
      -=======++++:::::::::::::::::::::::::-:  .---:      Cursor: capitaine (24px)
       :======++++====+++******************=.             Terminal: alacritty
        :=====+++==========++++++++++++++*-               Terminal Font: monospace (12pt)
         .====++==============++++++++++*-                CPU: AMD Ryzen 9 4900HS (16) @ 3 GHz
          .===+==================+++++++:                 GPU: AMD Renoir
           .-=======================+++:                  GPU: NVIDIA GeForce RTX 2060 Max-Q
             ..........................                   Memory: 3.04 GiB / 15.05 GiB (20%)
                                                          Disk (/): 114 GiB / 139 GiB (82%)
                                                          Disk (/home/alpha/Storage): 279 GiB / 344 GiB )
                                                          Disk (/run/media/alpha/External): 140 GiB / 93]
                                                          Disk (/windows): 296 GiB / 434 GiB (68%) [Remo]
                                                          Battery: 100% [Not charging]
                                                          Locale: en_US.UTF-8

brucethemoose avatar Jan 12 '23 01:01 brucethemoose

Hi, Colossal-AI is not compatible with torch 2.0 for now.

FrankLeeeee avatar Jan 12 '23 01:01 FrankLeeeee

While on this topic, do y'all think the new torch.compile() feature could reduce VRAM usage?

Would it even be feasible to use in ColossalAI?

brucethemoose avatar Jan 12 '23 02:01 brucethemoose

Possible, in fact we plan to use it to parallelize training as well. We would integrate it with Colossal-AI upon the official release of torch 2.0.

FrankLeeeee avatar Jan 12 '23 02:01 FrankLeeeee

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 18 '23 07:04 binmakeswell