[BUG] ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such file or directory
I am trying to use Accelerate and Deepspeed for training, but I encountered the following error:
ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such fi
le or directory
My Accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 4
gradient_clipping: 1.0
zero3_init_flag: true
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
and my ds_report:
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.0.0+cu117
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Here is the script that you can run it directly by accelerate launch --mixed_precision="fp16" train_toy.py:
#!/usr/bin/env python
# coding=utf-8
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import datasets as datasets_1
import torch
import torch.utils.checkpoint
import transformers
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import ProjectConfiguration
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate.utils import DummyOptim
import diffusers
from diffusers import AutoencoderKL, DDPMScheduler, UNet2DConditionModel
from diffusers.utils import check_min_version
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
check_min_version("0.15.0.dev0")
logger = get_logger(__name__, log_level="INFO")
dataset_name_mapping = {
"lambdalabs/pokemon-blip-captions": ("image", "text"),
}
def main():
accelerator_project_config = ProjectConfiguration(total_limit=None)
accelerator = Accelerator(
gradient_accumulation_steps=4,
mixed_precision=None,
project_config=accelerator_project_config,
)
if accelerator.is_local_main_process:
datasets_1.utils.logging.set_verbosity_warning()
transformers.utils.logging.set_verbosity_warning()
diffusers.utils.logging.set_verbosity_info()
else:
datasets_1.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()
diffusers.utils.logging.set_verbosity_error()
# Load scheduler, tokenizer and models.
pretrained_model_name_or_path = "stabilityai/stable-diffusion-2-1"
text_encoder = CLIPTextModel.from_pretrained(
pretrained_model_name_or_path, subfolder="text_encoder"
)
vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(
pretrained_model_name_or_path, subfolder="unet"
)
optimizer_cls = (
torch.optim.AdamW
if accelerator.state.deepspeed_plugin is None
or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
else DummyOptim
)
# optimizer_cls = deepspeed.ops.adam.DeepSpeedCPUAdam
optimizer = optimizer_cls(
text_encoder.parameters(),
lr=0.0001,
betas=(0.9, 0.999),
weight_decay=0.0001,
eps=0.00000001,
)
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor()
)
train_dataloader = torch.utils.data.DataLoader(test_data, batch_size=2)
unet, vae, text_encoder, optimizer,train_dataloader = accelerator.prepare(
unet, vae, text_encoder, optimizer,train_dataloader
)
if __name__ == "__main__":
main()
The complete error message is:
Traceback (most recent call last):
File "train_toy.py", line 106, in <module>
main()
File "train_toy.py", line 101, in main
unet, vae, text_encoder, optimizer,train_dataloader = accelerator.prepare(
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1090, in p
repare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1367, in _
prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initia
lize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in
__init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in
_configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1398, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 154, in __init__
util_ops = UtilsBuilder().load()
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/root/anaconda3/envs/mabing_py38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 556, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1166, in create_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such file or directory
similar issue I am facing.
ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory
were you able to resolve?
This problem could be because the extensions folder is located in /root, which is privileged. Can you try using /tmp instead by setting export TORCH_EXTENSIONS_DIR=/tmp
similar issue I am facing.
ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directorywere you able to resolve?
Same as above issues, I check the DeepSpeed with ds_report and maybe you should install DeepSpeed with pre-install ops, not jit mode.
similar issue I am facing.
ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directorywere you able to resolve?
Same as above issues
@feimadecaogaozhi, did you try changing the extensions folder as suggested above?
similar issue I am facing.
ImportError: /root/.cache/torch_extensions/py38_cu102/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directorywere you able to resolve?
Maybe you can check whether you have install transformers both use 'pip' and "conda"; “中文说明一下,如果是多卡训练,pip和conda分别安装了transformer,会导致冲突发生,但是在单卡上可能不会遇到这种问题”
I also have meet the problem, and then I find this is because gcc in the computer in lower than 5.0.0, when promot the gcc version, then I solved the problem. "可能是因为linux系统上gcc的版本太低,不支持deepspeed所需要的及时编译所需要的参数配置,将gcc版本提高不低于版本5就行了"
I also have meet the problem, and then I find this is because gcc in the computer in lower than 5.0.0, when promot the gcc version, then I solved the problem. "可能是因为linux系统上gcc的版本太低,不支持deepspeed所需要的及时编译所需要的参数配置,将gcc版本提高不低于版本5就行了"
don't work for me
when using accelerate,it will start multiprocess, and they all triggle JIT compile,cause this issue; we can triggle deepspeed JIT compile before running task:
python -c "from deepspeed.ops.op_builder import UtilsBuilder;UtilsBuilder().load()"
This problem could be because the extensions folder is located in
/root, which is privileged. Can you try using/tmpinstead by settingexport TORCH_EXTENSIONS_DIR=/tmp
I try this, it's work
Closing as it seems a solution was found.