Here is a distributed implementation of the Conjugate Gradient. It's based on the basic example already in StarPU examples/cg/cg.c, but distributes data among MPI processes and submits tasks taking into account MPI.
Any review is welcomed. :)
Background: I'm looking for an application memory-bound with communications overlapped by computations. The Conjugate Gradient seems to be a good candidate. Since I want communications, I'm not especially looking to optimize/reduce them.