Max
Max
Hi @mrm8488, that would be great! @LouisCastricato There were few issues back, but I think they are no longer relevant since week ago colab upgraded to 3.8, same version as...
It might be possible to initialize the reward model seperately with zero inference[1]. Afaik accelerate by itself doesn't support second dsconfig. [1] https://www.deepspeed.ai/2022/09/09/zero-inference.html
Hey @RobertKirk, for larger reward models you can adopt code at https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L113-L183. With it you either host a reward model using triton server's gRPC endpoint or dedicate a separate GPU...
@LouisCastricato yes with #156
Hello! The reward function currently expects the whole `sample = prompt + output` as input, so in the case of summarization you could split it over "TL;DR" to recover prompts,...
Hi @xwjiang2010 thanks for dropping it! In short we're lacking knowledge how to do hyperparameter optimization with Tune in the least invasive way to our codebase. Currently we have a...
Hey thanks for giving us some pointers! Seems like the most that could be achieved with ray train would be DDP handled internally by ray. @ayulockin Maybe a more lightweight...
@cat-state something like that? https://github.com/vwxyzjn/cleanrl/pull/307
> Sure, although you might need compute? @reciprocated maybe we should make a single-node config version that can be finetuned on a single gpu fast? fwiw {ppo,ilql}_config.yml were meant to...
Resolved with https://github.com/CarperAI/trlx/pull/357