distributed-join
distributed-join copied to clipboard
By default, the memory pool size used is the total GPU memory - 500MB. During some OOM runs, we observed using smaller memory pool solves the OOM issue. This indicates...
`UCXBufferCommunicator` allocates buffers for `count` and `recv_buffer` used in callbacks but never frees them. Although these buffers are small (8 bytes each), we should consider making it clean.
`cudaStreamDefault` is a flag used for passing to `cudaStreamCreateWithFlags` instead of a valid stream. We should replace that with either `rmm::cuda_stream_default` or `0`.
We should check the device memory usage, and compare it with what we projected in the modeling, with and without pipelining. We should consider whether the extra device usage is...
We should profile and improve the computation-communication overlap efficiency on - a single node DGX with NVLink - multiple DGX nodes connected with IB
The error checking utilities of this repo (currently located at `src/error.cuh`) should be aligned with cuDF's error checking utilities (`cudf/utilities/error.hpp`). I believe this will allow more code reuse. For example,...
- Currently `UCXCommunicator` uses different communication tag design than `UCXBufferCommunicator`. We should make them in line with each other. - Currently when there's no comm buffer available in the buffer...