UCP/WIREUP: Add rma_bw lanes in smaller chunks for each memtype
What
Add rma_bw lanes in smaller chunks for each memtype.
Why ?
The rma_bw lanes are added for each memtype, starting from host. Thus, the lanes added for host memory can exhaust the limit for the total number of lanes, leading to cuda_ipc lane to be missed. Adding the lanes in smaller chunks iteratively avoids this issue.
How ?
@brminich One issue I'm running to is that local and remote device bitmaps will not be preserved in different iterations of the new loop. The is because we don't modify the bitmaps in bw_info; we modify them locally in add_bw_lanes. However, if I update add_bw_lanes so that we modify those bitmaps in the bw_info, then updates for one memtype will affect the bitmap for other memtypes too.
@brminich One issue I'm running to is that local and remote device bitmaps will not be preserved in different iterations of the new loop. The is because we don't modify the bitmaps in bw_info; we modify them locally in add_bw_lanes. However, if I update add_bw_lanes so that we modify those bitmaps in the bw_info, then updates for one memtype will affect the bitmap for other memtypes too.
i think it should be ok, because if the same device is suitable for several memory types, we would want to initiate just one set of lanes.
@brminich ucp_wireup_add_fast_lanes will not work as intended with the new changes. Because it compares bw of the lanes that are in sinfo_array, and when we add the lanes for each mem_type in multiple iterations, we lose the global bw.
@yosefe @brminich can you please take another look at this PR so we can have it merged for 1.17? Thanks But please don't merge until I remove the label.
I've been debugging a case where the selected lanes are different with smaller vs. larger batch size. The root cause of it is that the ucp_wireup_rma_bw_score_func() and ucp_wireup_add_fast_lanes() calculate the bandwidth differently. As a result, the rma_bw lane selected by ucp_wireup_select_transport (which will be of highest score) may not be the lane with the highest bandwidth (from ucp_wireup_add_fast_lanes() point of view).
Closing this PR as https://github.com/openucx/ucx/pull/9814 addresses the original problem.