*[StarPU](https://starpu.gitlabpages.inria.fr/): task-based runtime system
*[Chameleon](https://solverstack.gitlabpages.inria.fr/chameleon/): dense linear algebra library, built on top of StarPU. Provides benchmarks of linear algebra kernels, very useful!
dense linear algebra library, built on top of StarPU.
Both are available as Guix (in the [Guix-HPC channel](https://gitlab.inria.fr/guix-hpc/guix-hpc)) or Spack packages, and usually available as modules on some clusters (well, maybe only PlaFRIM).
Both are available as Guix (in the [Guix-HPC
channel](https://gitlab.inria.fr/guix-hpc/guix-hpc)) or Spack
packages.
## Building StarPU
```sh
sudo apt install libtool-bin libhwloc-dev libmkl-dev pkg-config# and probably other I already have installed
git clone git@gitlab.inria.fr:starpu/starpu.git # or https://gitlab.inria.fr/starpu/starpu.git if you don't have a gitlab.inria.fr account with registered SSH key
../configure --prefix=$HOME/dev/builds/starpu --disable-opencl--disable-cuda--disable-fortran# adapt to your usecase, see https://files.inria.fr/starpu/testing/master/doc/html/CompilationConfiguration.html
@@ -59,34 +65,49 @@ $HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H # should print som
StarPU should have detected MPI during its building.
For Chameleon, you have to add the options `-DCHAMELEON_USE_MPI=ON -DCHAMELEON_USE_MPI_DATATYPES=ON` to the cmake command line and build again.
For Chameleon, you have to add the options `-DCHAMELEON_USE_MPI=ON
-DCHAMELEON_USE_MPI_DATATYPES=ON` to the `cmake` command line and build
again.
The common way of using distributed StarPU is to launch one MPI/StarPU
process per compute node, and then StarPU takes care of feeding all
available cores with task. You can run:
The common way of using distributed StarPU is to launch one MPI/StarPU process per compute node, and then StarPU takes care of feeding all available cores with task. You can do:
This will execute a Cholesky decomposition (`potrf`) with 4 MPI processes (`-np 4`) and presents results in a human-readable way (`-H`; for a CSV-like output, you can omit this option).
You can measure performance of different matrix size with the option `-n 3200:32000:3200` (from matrix size 3200 to 32000 with a step of 3200).
This will execute a Cholesky decomposition (`potrf`) with 4 MPI
processes (`-np 4`) and presents results in a human-readable way
(`-H`; for a CSV-like output, you can omit this option).
You can measure performance of different matrix size with the option
`-n 3200:32000:3200` (from matrix size 3200 to 32000 with a step of
3200).
You can do several iteration of the same matrix size with `--niter 2`.
## Basic performance tuning
A good matrix distribution is square 2D-block-cyclic, for this add `-P x` where `x` should be (close to) the square root of the number of MPI processes (ie, you should use a square number of compute nodes).
A good matrix distribution is square 2D-block-cyclic, for this add `-P
x` where `x` should be (close to) the square root of the number of MPI
processes (ie, you should use a square number of compute nodes).
To get better results, you should bind the main thread:
```sh
export STARPU_MAIN_THREAD_BIND=1
```
Set the number of workers (CPU cores executing task) to the number of cores available on the compute node minus one:
Set the number of workers (CPU cores executing task) to the number of
cores available on the compute node minus one:
```sh
export STARPU_NCPU=15
```
You should not use hyperthreads.
To know what is the good matrix size range, just execute with sizes, let's say, `3200:50000:3200`, plot the obtained Gflop/s and see with which size you reach the plateau.
To know what is the good matrix size range, just execute with sizes,
let's say, `3200:50000:3200`, plot the obtained Gflop/s and see with