Joel comments

Results 70 comments of


                                            Joel

Multi-thread support for python asyncio gRPC clients

@lidizheng Multithread asyncio client works fine, but the message is very annoying. Is there any work on this?

Multi-thread support for python asyncio gRPC clients

> @wuxibin89 We do want to improve the UX of the API. Can you help to create a new issue to describe the problem you are seeing? @lidizheng My problem...

Multi-thread support for python asyncio gRPC clients

@lidizheng Hi, I want to help to fix this issue. I spend some time dive into the code, and can't figure out why this happened. As we can see, `PollerCompletionQueue._handle_events`...

Loading a reward model causes ValueError: weight is on the meta device, we need a `value` to put in on 0

> Thanks, I will try. I suspect the issue might stem from the interaction between deepspeed ZeRO and QLoRA leading to value_head information not being saved in the checkpoint directory,...

vllm +zero2 hangs

@karthik19967829 I can't reproduce this problem with your script, my job is succeeded as expect. Can your post ray job supervisor's log? You can find it at `/tmp/ray/session_latest/logs/job-driver-raysubmit_{JOBID}.log`

vllm +zero2 hangs

My hardware info is 1 node with 8 A100 GPUs, and run command is: ```bash ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "."}' \ --no-wait \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes...

vllm +zero2 hangs

> also could you share the exact version of libraries by using `pip list` in your environment ? > > thank you so much for the quick response :) hope...

vllm +zero2 hangs

@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?

vllm +zero2 hangs

@tianhao-nexusflow Can you post your run command and hardware info?

vllm +zero2 hangs

@tianhao-nexusflow I can't reproduce with your script either, let me switch to cuda 12 and torch 2.2.