LiuXin
LiuXin
我也碰上了跟题主一样的报错,请问有人解决了modelscope多卡训练的问题吗,还是说是环境问题 Task related config: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 185461) of binary: /opt/conda/envs/modelscope/bin/python Traceback (most recent call last): File "/opt/conda/envs/modelscope/lib/python3.8/runpy.py", line 194, in _run_module_as_main return...
I also have the same questions , do you have a solution ?Or does this have something to do with the long-term loading of collections when I deploy the interface?
> # 1. Construct the dataset > ``` > train.jsonl (each line): {"query_id": "111", "query": "吃饭的猫猫1", "image_id": "222", "image": "/path/to/cat_1.jpg"} > validation.jsonl (each line): {"query_id": "333", "query": "吃饭的猫猫2", "image_id": "444",...
> Please check training data,format reference (https://alibaba-damo-academy.github.io/FunASR/en/egs_modelscope/asr/TEMPLATE/README.html#finetune-with-your-data) 您好,我单卡训练没问题,但是多卡训练报错了,我的启动命令是CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node 2 finetune.py 报错如下: Task related config: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 185479) of...
> 换检测的权重 det_weights 感谢解答,我看我训练出来的checkpoint大小和官方的不一致,预训练出来的ppyoloe_crn_l_36e_640x640_mot17half.pdparams大小是204M,我自己训练出来的大小都是214M,其他的模型比如centernet_dla34_140e_coco.pdparams自己训练出来的和官方大小都是一致的,而且替换权重进行验证的时候报没检测出目标的警告,请问这可能是什么原因呢?
我也是一样,我是服务器执行的,一直卡在这个地方,有什么解决办法吗