mpi all_reduce
This MR targets replicated_tasks as it can be tied with the use of alternative_source.
StarPU lacks an all-reduce. While this operation could be done with a reduce then a broadcast, it might be interesting to have a shorter operation.
This MR proposes something simple akin to the butterfly pattern in an FFT. It works with non-power-of-2 contributions by adding an extra step. If we look at the litterature for MPI collectives in the 00s, a lot of patterns for all-reduce exists, sometimes involving halving the results (it makes sense with matrices, and could be achieved with partitioning in StarPU). A trade-off exists between latency and bandwidth.
The present implementation should be latency optimal (i.e. there are fewer steps than reduce + bcast).
-
docs -
example -
"simple" all-reduce, providing a benchmark -
all-reduce + alternative_source ?
-
-
fortran interface