PaddleNLP [Question]: paddle.distributed.launch 启动多进程训练结束后Loading best model from checkpoint 报错

请提出你的问题

使用示例 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme 进程训练时，使用

python3 -m paddle.distributed.launch --nproc_per_node=24 train.py \
    --do_train \
    --do_eval \
    --do_export \
    --model_name_or_path ernie-3.0-tiny-medium-v2-zh \
    --output_dir checkpoint \
    --device cpu \
    --num_train_epochs 100 \
    --early_stopping True \
    --early_stopping_patience 5 \
    --learning_rate 3e-5 \
    --max_length 128 \
    --per_device_eval_batch_size 32 \
    --per_device_train_batch_size 32 \
    --metric_for_best_model accuracy \
    --load_best_model_at_end \
    --logging_steps 5 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --save_total_limit 3

开启多进程并行，在训练完成的时候加载结果会报如下错误。是我使用的方式不对吗？CPU模式下开启多进程或多线程同时计算应该用什么命令正确开启？官方文档里没有查到，参数里面也没有明确的选项，使用enable_auto_parallel参数报错。见#8428

[2024-05-13 12:49:22,098] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:49:22) [2024-05-13 12:55:42,547] [ INFO] - ***** Running Evaluation ***** [2024-05-13 12:55:42,548] [ INFO] - Num examples = 1955 [2024-05-13 12:55:42,548] [ INFO] - Total prediction steps = 3 [2024-05-13 12:55:42,548] [ INFO] - Pre device batch size = 32 [2024-05-13 12:55:42,548] [ INFO] - Total Batch size = 768 [2024-05-13 12:55:56,791] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-13 12:55:56) [2024-05-13 12:55:56,791] [ INFO] - Training completed.

[2024-05-13 12:55:56,805] [ INFO] - Loading best model from checkpoint/checkpoint-170 (score: 0.8204603580562659). [2024-05-13 12:55:57,120] [ INFO] - set state-dict :([], []) Traceback (most recent call last): File "train.py", line 230, in main() File "train.py", line 185, in main shutil.rmtree(checkpoint_path) File "/usr/lib/python3.8/shutil.py", line 715, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/usr/lib/python3.8/shutil.py", line 672, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/usr/lib/python3.8/shutil.py", line 670, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: 'tokenizer_config.json'

May 13 '24 06:05 jazzly

使用的版本如下：

paddlepaddle: 2.6.1
paddlenlp: 2.8.0

May 13 '24 06:05 jazzly

可以看一下你的checkpoint/checkpoint-170目录，是不是没有保存tokenizer，一个简单的解决方式是去掉参数：

load_best_model_at_end

May 13 '24 13:05 w5688414

可以看一下你的checkpoint/checkpoint-170目录，是不是没有保存tokenizer，一个简单的解决方式是去掉参数：
load_best_model_at_end

是这样的，如果要使用early_stopping ，那么load_best_model_at_end是必须项。当报这个错的时候，类似checkpoint-170这种目录已经不存在了。我查看worklog发现，其实训练已经完成了。但是可能是多进程开启的原因，每个进程都想load_best_model_at_end。所以只有一个进程能成功。其它的进程应该都失败了。

python3 -m paddle.distributed.launch --nproc_per_node=24

这样是正确开启多进程的方式吗？在CPU模式下

May 14 '24 01:05 jazzly

不建议在cpu上训练，训练效率低，gpu的分布式训练文档参考：

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch

--nproc_per_node：每个节点启动的进程数，在 GPU 训练中，应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

May 14 '24 07:05 w5688414

不建议在cpu上训练，训练效率低，gpu的分布式训练文档参考：

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch
--nproc_per_node：每个节点启动的进程数，在 GPU 训练中，应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

暂时手头没有GPU可用，使用CPU测试的。示例任务使用24个CPU核心训练大概4个小时不到就够了。还可一用。我的意思是，CPU模式如果不用 paddle.distributed.launch 那么应该如何正确开启多线程或多进程训练？

May 14 '24 08:05 jazzly

这个可以在框架下面提issue，cpu场景不是很高频，应该是不支持的，分布式训练可以参考文档：

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

May 14 '24 15:05 w5688414

这个可以在框架下面提issue，cpu场景不是很高频，应该是不支持的，分布式训练可以参考文档：

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

OK，明白了。感谢

May 15 '24 06:05 jazzly

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

Jul 15 '24 00:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

Jul 29 '24 00:07 github-actions[bot]