Pavan Balaji
Pavan Balaji
Can you create a separate PR for new algorithm selections? That's orthogonal to this PR.
@hzhou Your interpretation of the standard is correct. But I think @wesbland's point is that, the implementation should be correct even if some processes give a different value than others,...
I don't know how `allreduce(sum)` will help. We should probably add an internal op called `is_equal`.
> > I don't know how `allreduce(sum)` will help. We should probably add an internal op called `is_equal`. > > ```c > if (sum == comm.size) > is_equal = true;...
Is this failing without GPU support?
@hzhou Yaksa is 32-bit clean. Are you seeing some error on 32-bit systems?
I don't think this is a yaksa bug. The user of yaksa (MPICH) is required to make sure the data buffer that's passed in is correctly aligned for the datatype...
That is the user's (MPICH) responsibility. For example, if the user passes a buffer to Yaksa as a collection of integers, then that buffer must be aligned for integers. The...
@minsii The DMA access latencies for most GPUs are in microseconds. In comparison, a function pointer dereference is ~25 cycles. So perhaps an expected gain comparison is useful before doing...