verl icon indicating copy to clipboard operation
verl copied to clipboard

[Learning Verl] what is max_colocate_count?

Open kzhou92 opened this issue 3 months ago • 1 comments

This line contains the code block

        for resource_pool_name, process_on_nodes in self.resource_pool_spec.items():
            # max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
            # For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
            # For Megatron backend, we recommend using max_colocate_count>1
            # that can utilize different WorkerGroup for differnt models
            resource_pool = RayResourcePool(
                process_on_nodes=process_on_nodes, use_gpu=True, max_colocate_count=1, name_prefix=resource_pool_name
            )
            self.resource_pool_dict[resource_pool_name] = resource_pool

So what is max_colocate_count? According to the explanation, max_colocate_count>1 shall be set for megatron backend; however this number is hardcoded as 1.

By checking here, it is more like the number of CPUs per colocated processing. The name is confusing?

kzhou92 avatar Nov 08 '25 02:11 kzhou92

@wuxibin89 @vermouth1992 can you help review my fix? 🙏

JobQiu avatar Nov 22 '25 22:11 JobQiu