SwanLab icon indicating copy to clipboard operation
SwanLab copied to clipboard

[BUG] 训练完毕后图标查看问题,无法选中step=200时的训练曲线

Open WjMessi1 opened this issue 9 months ago • 2 comments

设置的logging_steps 2时,训练完毕后,我想查看200个step时的情况,但是从图标中无法选中200个step的情况,只能选择198和202个step查看

Image

Describe the main elements of the bug

🧑‍💻 Step to reproduce

实验链接: https://swanlab.cn/@Cube/telechat2_7b_32k/runs/et47g3pj4vrsltxy2ifqw/chart

启动训练参数:

MASTER_PORT=29501 CUDA_VISIBLE_DEVICES=5,6,7 NPROC_PER_NODE=3 swift rlhf --rlhf_type grpo --model /home/gpu/modelscope/TeleChat2-7B-32K/output/v0-20250417-092611/save/checkpoint-90-merged --reward_funcs accuracy format repetition --train_type lora --lora_rank 8 --lora_alpha 32 --target_modules all-linear --torch_dtype bfloat16 --dataset 'AI-MO/NuminaMath-TIR#2000' --max_completion_length 2048 --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --learning_rate 1e-5 --gradient_accumulation_steps 1 --eval_steps 100 --save_steps 100 --save_total_limit 2 --logging_steps 2 --max_length 2048 --output_dir output --deepspeed zero2 --report_to swanlab --swanlab_project telechat2_7b_32k --swanlab_workspace Cube --swanlab_exp_name telechat2_7b_32k_3_A100_grpo_pure --warmup_ratio 0.05 --beta 0.1 --max_grad_norm 0.5 --dataloader_num_workers 4 --dataset_num_proc 4 --num_generations 8 --temperature 0.9 --deepspeed zero2 --system '/home/gpu/modelscope/TeleChat2-7B-32K/prompt.txt' --log_completions true

  • SwanLab Version: Version: 0.5.5

  • Platform: 3卡A100 ms-swift 3.3.0.post1

WjMessi1 avatar Apr 21 '25 07:04 WjMessi1

Feels quite similar to #926

SAKURA-CAT avatar Apr 22 '25 03:04 SAKURA-CAT

收到,我这边跟一下这个issue!

ShaohonChen avatar Apr 23 '25 05:04 ShaohonChen

应该与 #1085 的错误类似,此issue关闭,有相关进展将在 #1085 上同步

SAKURA-CAT avatar Jul 07 '25 10:07 SAKURA-CAT