Joel
Joel
@lidizheng Multithread asyncio client works fine, but the message is very annoying. Is there any work on this?
> @wuxibin89 We do want to improve the UX of the API. Can you help to create a new issue to describe the problem you are seeing? @lidizheng My problem...
@lidizheng Hi, I want to help to fix this issue. I spend some time dive into the code, and can't figure out why this happened. As we can see, `PollerCompletionQueue._handle_events`...
> Thanks, I will try. I suspect the issue might stem from the interaction between deepspeed ZeRO and QLoRA leading to value_head information not being saved in the checkpoint directory,...
@karthik19967829 I can't reproduce this problem with your script, my job is succeeded as expect. Can your post ray job supervisor's log? You can find it at `/tmp/ray/session_latest/logs/job-driver-raysubmit_{JOBID}.log`
My hardware info is 1 node with 8 A100 GPUs, and run command is: ```bash ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "."}' \ --no-wait \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes...
> also could you share the exact version of libraries by using `pip list` in your environment ? > > thank you so much for the quick response :) hope...
@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?
@tianhao-nexusflow Can you post your run command and hardware info?
@tianhao-nexusflow I can't reproduce with your script either, let me switch to cuda 12 and torch 2.2.