diff --git a/content/pages/faq.md b/content/pages/faq.md index 1d69df682c9c99fe70c8b3666a0b9daa5d7a0c0b..9852c1676bd6112e238fa1ac623a011431dbcccb 100644 --- a/content/pages/faq.md +++ b/content/pages/faq.md @@ -10,42 +10,48 @@ attribute: 3 # Getting started with StarPU and Chameleon * [StarPU](https://starpu.gitlabpages.inria.fr/): task-based runtime system -* [Chameleon](https://solverstack.gitlabpages.inria.fr/chameleon/): dense linear algebra library, built on top of StarPU. Provides benchmarks of linear algebra kernels, very useful! +* [Chameleon](https://solverstack.gitlabpages.inria.fr/chameleon/): + dense linear algebra library, built on top of StarPU. -Both are available as Guix (in the [Guix-HPC channel](https://gitlab.inria.fr/guix-hpc/guix-hpc)) or Spack packages, and usually available as modules on some clusters (well, maybe only PlaFRIM). +Both are available as Guix (in the [Guix-HPC +channel](https://gitlab.inria.fr/guix-hpc/guix-hpc)) or Spack +packages. ## Building StarPU ```sh -sudo apt install libtool-bin libhwloc-dev libmkl-dev pkg-config # and probably other I already have installed -git clone git@gitlab.inria.fr:starpu/starpu.git # or https://gitlab.inria.fr/starpu/starpu.git if you don't have a gitlab.inria.fr account with registered SSH key +sudo apt install libtool-bin libhwloc-dev libmkl-dev pkg-config +git clone https://gitlab.inria.fr/starpu/starpu.git cd starpu ./autogen.sh mkdir build cd build -../configure --prefix=$HOME/dev/builds/starpu --disable-opencl --disable-cuda --disable-fortran # adapt to your usecase, see https://files.inria.fr/starpu/testing/master/doc/html/CompilationConfiguration.html +../configure --prefix=$HOME/dev/builds/starpu --disable-opencl --disable-cuda --disable-fortran +# see https://files.inria.fr/starpu/testing/master/doc/html/CompilationConfiguration.html make -j && make -j install ``` -Adjust environment variables (in your .bash_profile / ...): +Adjust environment variables (for example in your `.bash_profile`): ```sh export PATH=$HOME/dev/builds/starpu/bin:${PATH} export LD_LIBRARY_PATH=$HOME/dev/builds/starpu/lib/:${LD_LIBRARY_PATH} export PKG_CONFIG_PATH=$HOME/dev/builds/starpu/lib/pkgconfig:${PKG_CONFIG_PATH} ``` -After sourcing your .bash_profile, you should be able to execute: +After sourcing `.bash_profile`, you should be able to execute: ```sh starpu_machine_display ``` Which shows which hardware is available on your local machine. +Full information on how to build StarPU is available [here](https://files.inria.fr/starpu/doc/html_web_installation/) + ## Building Chameleon ```sh -sudo apt install cmake libmkl-dev # and probably other I already have installed -git clone --recurse-submodules git@gitlab.inria.fr:solverstack/chameleon.git # or https://gitlab.inria.fr/solverstack/chameleon.git +sudo apt install cmake libmkl-dev +git clone --recurse-submodules https://gitlab.inria.fr/solverstack/chameleon.git cd chameleon mkdir build cd build @@ -59,34 +65,49 @@ $HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H # should print som StarPU should have detected MPI during its building. -For Chameleon, you have to add the options `-DCHAMELEON_USE_MPI=ON -DCHAMELEON_USE_MPI_DATATYPES=ON` to the cmake command line and build again. +For Chameleon, you have to add the options `-DCHAMELEON_USE_MPI=ON +-DCHAMELEON_USE_MPI_DATATYPES=ON` to the `cmake` command line and build +again. + +The common way of using distributed StarPU is to launch one MPI/StarPU +process per compute node, and then StarPU takes care of feeding all +available cores with task. You can run: -The common way of using distributed StarPU is to launch one MPI/StarPU process per compute node, and then StarPU takes care of feeding all available cores with task. You can do: ```sh mpirun -np 4 $HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H ``` -This will execute a Cholesky decomposition (`potrf`) with 4 MPI processes (`-np 4`) and presents results in a human-readable way (`-H`; for a CSV-like output, you can omit this option). -You can measure performance of different matrix size with the option `-n 3200:32000:3200` (from matrix size 3200 to 32000 with a step of 3200). +This will execute a Cholesky decomposition (`potrf`) with 4 MPI +processes (`-np 4`) and presents results in a human-readable way +(`-H`; for a CSV-like output, you can omit this option). + +You can measure performance of different matrix size with the option +`-n 3200:32000:3200` (from matrix size 3200 to 32000 with a step of +3200). You can do several iteration of the same matrix size with `--niter 2`. ## Basic performance tuning -A good matrix distribution is square 2D-block-cyclic, for this add `-P x` where `x` should be (close to) the square root of the number of MPI processes (ie, you should use a square number of compute nodes). +A good matrix distribution is square 2D-block-cyclic, for this add `-P +x` where `x` should be (close to) the square root of the number of MPI +processes (ie, you should use a square number of compute nodes). To get better results, you should bind the main thread: ```sh export STARPU_MAIN_THREAD_BIND=1 ``` -Set the number of workers (CPU cores executing task) to the number of cores available on the compute node minus one: +Set the number of workers (CPU cores executing task) to the number of +cores available on the compute node minus one: ```sh export STARPU_NCPU=15 ``` You should not use hyperthreads. -To know what is the good matrix size range, just execute with sizes, let's say, `3200:50000:3200`, plot the obtained Gflop/s and see with which size you reach the plateau. +To know what is the good matrix size range, just execute with sizes, +let's say, `3200:50000:3200`, plot the obtained Gflop/s and see with +which size you reach the plateau. ## Misc