verl
verl copied to clipboard
[Learning Verl] what is max_colocate_count?
This line contains the code block
for resource_pool_name, process_on_nodes in self.resource_pool_spec.items():
# max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
# For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
# For Megatron backend, we recommend using max_colocate_count>1
# that can utilize different WorkerGroup for differnt models
resource_pool = RayResourcePool(
process_on_nodes=process_on_nodes, use_gpu=True, max_colocate_count=1, name_prefix=resource_pool_name
)
self.resource_pool_dict[resource_pool_name] = resource_pool
So what is max_colocate_count? According to the explanation, max_colocate_count>1 shall be set for megatron backend; however this number is hardcoded as 1.
By checking here, it is more like the number of CPUs per colocated processing. The name is confusing?
@wuxibin89 @vermouth1992 can you help review my fix? 🙏