Aluminum icon indicating copy to clipboard operation
Aluminum copied to clipboard

High-performance, GPU-aware communication library

Results 11 Aluminum issues
Sort by recently updated
recently updated
newest added

Provide optimized versions of our custom-implemented NCCL collectives: - [ ] `Alltoall` - [ ] `Gather` - [ ] `Scatter` - [ ] `Allgatherv` - [ ] `Alltoallv` - [...

enhancement

The progress engine code has gotten crufty and has a lot of various hacks. Clean it up.

enhancement

Support a compile-time flag to only start the progress engine on demand (i.e., if something is submitted to it). This is a flag so that we only pay this runtime...

enhancement

Our current testing infrastructure does not actually check results when using `half`, since MPI does not support it.

The default MPI error handler is typically bad, because it kills the application, but doesn't give you a stack trace. This adds a better error handler.

enhancement

``` $ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend mpi --op scatter --inplace Aborting after hang in Al size=1 ```...

bug

Our coding style here is a bit of a mess and needs to be unified. Especially variable names.

The NCCL backend's in-place reduce-scatter uses `sendbuf = recvbuf` [here](https://github.com/LLNL/Aluminum/blob/master/src/nccl_impl.hpp#L403). But per NCCL documentation, the in-place reduce-scatter should actually have `recvbuf` be the appropriate offset into the recvbuf (see [here](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/inplace.html#in-place-operations))....

We should have component support to make life easier for picky downstreams. For example, distconv requires NCCL and HostTransfer backends, and it'd be better to just ``` find_package(Aluminum COMPONENTS NCCL...

enhancement

As of ROCm 4.2, HIP supports `hipStreamWaitValue32`/`64` and `hipStreamWriteValue32`/`64` (analogous to the corresponding `cuStreamWaitValue`/`WriteValue` methods we use). We should support these as well and make them the default implementation on...

enhancement