Mentions légales du service

Skip to content
Snippets Groups Projects
using.org 46.07 KiB

Linking an external application with Chameleon libraries

Compilation and link with Chameleon libraries have been tested with the GNU compiler suite gcc/gfortran and the Intel compiler suite icc/ifort.

Flags required

The compiler, linker flags that are necessary to build an application using Chameleon are given through the pkg-config mechanism.

export PKG_CONFIG_PATH=/home/jdoe/install/chameleon/lib/pkgconfig:$PKG_CONFIG_PATH
pkg-config --cflags chameleon
pkg-config --libs chameleon
pkg-config --libs --static chameleon

The .pc files required are located in the sub-directory lib/pkgconfig of your Chameleon install directory.

Static linking in C

Lets imagine you have a file main.c that you want to link with Chameleon static libraries. Lets consider /home/yourname/install/chameleon is the install directory of Chameleon containing sub-directories include/ and lib/. Here could be your compilation command with gcc compiler:

gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c

Now if you want to link your application with Chameleon static libraries, you could do:

gcc main.o -o main                                         \
/home/yourname/install/chameleon/lib/libchameleon.a        \
/home/yourname/install/chameleon/lib/libchameleon_starpu.a \
/home/yourname/install/chameleon/lib/libcoreblas.a         \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64           \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

As you can see in this example, we also link with some dynamic libraries starpu-1.2, Intel MKL libraries (for BLAS/LAPACK/CBLAS/LAPACKE), pthread, m (math) and rt. These libraries will depend on the configuration of your Chameleon build. You can find these dependencies in .pc files we generate during compilation and that are installed in the sub-directory lib/pkgconfig of your Chameleon install directory. Note also that you could need to specify where to find these libraries with -L option of your compiler/linker.

Before to run your program, make sure that all shared libraries paths your executable depends on are known. Enter ldd main to check. If some shared libraries paths are missing append them in the LD_LIBRARY_PATH (for Linux systems) environment variable (DYLD_LIBRARY_PATH on Mac).

Dynamic linking in C

For dynamic linking (need to build Chameleon with CMake option BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but instead of specifying path to your static libraries you indicate the path to dynamic libraries with -L option and you give the name of libraries with -l option like this:

gcc main.o -o main \
-L/home/yourname/install/chameleon/lib \
-lchameleon -lchameleon_starpu -lcoreblas \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

Note that an update of your environment variable LD_LIBRARY_PATH (DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be required before executing

export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib

Using Chameleon executables

Chameleon provides several test executables that are compiled and linked with Chameleon’s dependencies. Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. This set of binaries are separated into three categories and can be found in three different directories:

  • example: contains examples of API usage and more specifically the sub-directory lapack_to_morse/ provides a tutorial that explains how to use Chameleon functionalities starting from a full LAPACK code, see Tutorial LAPACK to Chameleon
  • testing: contains testing drivers to check numerical correctness of Chameleon linear algebra routines with a wide range of parameters
    ./testing/stesting 4 1 LANGE 600 100 700
        

    Two first arguments are the number of cores and gpus to use. The third one is the name of the algorithm to test. The other arguments depend on the algorithm, here it lies for the number of rows, columns and leading dimension of the problem.

    Name of algorithms available for testing are:

    • LANGE: norms of matrices Infinite, One, Max, Frobenius
    • GEMM: general matrix-matrix multiply
    • HEMM: hermitian matrix-matrix multiply
    • HERK: hermitian matrix-matrix rank k update
    • HER2K: hermitian matrix-matrix rank 2k update
    • SYMM: symmetric matrix-matrix multiply
    • SYRK: symmetric matrix-matrix rank k update
    • SYR2K: symmetric matrix-matrix rank 2k update
    • PEMV: matrix-vector multiply with pentadiagonal matrix
    • TRMM: triangular matrix-matrix multiply
    • TRSM: triangular solve, multiple rhs
    • POSV: solve linear systems with symmetric positive-definite matrix
    • GESV_INCPIV: solve linear systems with general matrix
    • GELS: linear least squares with general matrix
    • GELS_HQR: gels with hierarchical tree
    • GELS_SYSTOLIC: gels with systolic tree
  • timing: contains timing drivers to assess performances of Chameleon routines. There are two sets of executables, those who do not use the tile interface and those who do (with _tile in the name of the executable). Executables without tile interface allocates data following LAPACK conventions and these data can be given as arguments to Chameleon routines as you would do with LAPACK. Executables with tile interface generate directly the data in the format Chameleon tile algorithms used to submit tasks to the runtime system. Executables with tile interface should be more performant because no data copy from LAPACK matrix layout to tile matrix layout are necessary. Calling example:
    ./timing/time_dpotrf --n_range=1000:10000:1000 --nb=320
                         --threads=9 --gpus=3
                         --nowarmup
        

    List of main options that can be used in timing:

    • --help: show usage
    • --threads: Number of CPU workers (default: _SC_NPROCESSORS_ONLN)
    • --gpus: number of GPU workers (default: 0)
    • --n_range=R: range of N values, with R=Start:Stop:Step (default: 500:5000:500)
    • --m=X: dimension (M) of the matrices (default: N)
    • --k=X: dimension (K) of the matrices (default: 1), useful for GEMM algorithm (k is the shared dimension and must be defined >1 to consider matrices and not vectors)
    • --nrhs=X: number of right-hand size (default: 1)
    • --nb=X: block/tile size. (default: 128)
    • --ib=X: inner-blocking/IB size. (default: 32)
    • --niter=X: number of iterations performed for each test (default: 1)
    • --rhblk=X: if X > 0, enable Householder mode for QR and LQ factorization. X is the size of each subdomain (default: 0)
    • --[no]check: check result (default: nocheck)
    • --[no]profile: print profiling informations (default: noprofile)
    • --[no]trace: enable/disable trace generation (default: notrace)
    • --[no]dag: enable/disable DAG generation (default: nodag)
    • --[no]inv: check on inverse (default: noinv)
    • --nocpu: all GPU kernels are exclusively executed on GPUs
    • --ooc: Enable out-of-core (available only with StarPU)
    • --bound: Compare result to area bound (available only with StarPU) (default: 0)

    List of timing algorithms available:

    • LANGE: norms of matrices
    • GEMM: general matrix-matrix multiply
    • TRSM: triangular solve
    • POTRF: Cholesky factorization with a symmetric positive-definite matrix
    • POTRI: Cholesky inversion
    • POSV: solve linear systems with symmetric positive-definite matrix
    • GETRF_NOPIV: LU factorization of a general matrix using the tile LU algorithm without row pivoting
    • GESV_NOPIV: solve linear system for a general matrix using the tile LU algorithm without row pivoting
    • GETRF_INCPIV: LU factorization of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges
    • GESV_INCPIV: solve linear system for a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges matrix
    • GEQRF: QR factorization of a general matrix
    • GELQF: LQ factorization of a general matrix
    • QEQRF_HQR: GEQRF with hierarchical tree
    • QEQRS: solve linear systems using a QR factorization
    • GELS: solves overdetermined or underdetermined linear systems involving a general matrix using the QR or the LQ factorization
    • GESVD: general matrix singular value decomposition

Execution trace using StarPU

<sec:trace>

StarPU can generate its own trace log files by compiling it with the --with-fxt option at the configure step (you can have to specify the directory where you installed FxT by giving --with-fxt=... instead of --with-fxt alone). By doing so, traces are generated after each execution of a program which uses StarPU in the directory pointed by the STARPU_FXT_PREFIX environment variable.

export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/

When executing a ./timing/... Chameleon program, if it has been enabled (StarPU compiled with FxT and -DCHAMELEON_ENABLE_TRACING=ON), you can give the option --trace to tell the program to generate trace log files.

Finally, to generate the trace file which can be opened with Vite program, you can use the starpu_fxt_tool executable of StarPU. This tool should be in $STARPU_INSTALL_REPOSITORY/bin. You can use it to generate the trace file like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename

There is one file per mpi processus (prof_filename_0, prof_filename_1 …). To generate a trace of mpi programs you can call it like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*

The trace file will be named paje.trace (use -o option to specify an output name). Alternatively, for non mpi execution (only one processus and profiling file), you can set the environment variable STARPU_GENERATE_TRACE=1 to automatically generate the paje trace file.

Use simulation mode with StarPU-SimGrid

<sec:simu>

Simulation mode can be activated by setting the cmake option CHAMELEON_SIMULATION to ON. This mode allows you to simulate execution of algorithms with StarPU compiled with SimGrid. To do so, we provide some perfmodels in the simucore/perfmodels/ directory of Chameleon sources. To use these perfmodels, please set your STARPU_HOME environment variable to path/to/your/chameleon_sources/simucore/perfmodels. Finally, you need to set your STARPU_HOSTNAME environment variable to the name of the machine to simulate. For example: STARPU_HOSTNAME=mirage. Note that only POTRF kernels with block sizes of 320 or 960 (simple and double precision) on mirage and sirocco machines are available for now. Database of models is subject to change.

Chameleon API

Chameleon provides routines to solve dense general systems of linear equations, symmetric positive definite systems of linear equations and linear least squares problems, using LU, Cholesky, QR and LQ factorizations. Real arithmetic and complex arithmetic are supported in both single precision and double precision. Routines that compute linear algebra are of the following form:

MORSE_name[_Tile[_Async]]
  • all user routines are prefixed with MORSE
  • in the pattern MORSE_name[_Tile[_Async]], name follows the BLAS/LAPACK naming scheme for algorithms (e.g. sgemm for general matrix-matrix multiply simple precision)
  • Chameleon provides three interface levels
    • MORSE_name: simplest interface, very close to CBLAS and LAPACKE, matrices are given following the LAPACK data layout (1-D array column-major). It involves copy of data from LAPACK layout to tile layout and conversely (to update LAPACK data), see Step1.
    • MORSE_name_Tile: the tile interface avoid copies between LAPACK and tile layouts. It is the standard interface of Chameleon and it should achieved better performance than the previous simplest interface. The data are given through a specific structure called a descriptor, see Step2.
    • MORSE_name_Tile_Async: similar to the tile interface, it avoids synchonization barrier normally called between Tile routines. At the end of an Async function, completion of tasks is not guaranteed and data are not necessarily up-to-date. To ensure that tasks have been all executed, a synchronization function has to be called after the sequence of Async functions, see Step4.

MORSE routine calls have to be preceded from

MORSE_Init( NCPU, NGPU );

to initialize MORSE and the runtime system and followed by

MORSE_Finalize();

to free some data and finalize the runtime and/or MPI.

Tutorial LAPACK to Chameleon

<sec:tuto>

This tutorial is dedicated to the API usage of Chameleon. The idea is to start from a simple code and step by step explain how to use Chameleon routines. The first step is a full BLAS/LAPACK code without dependencies to Chameleon, a code that most users should easily understand. Then, the different interfaces Chameleon provides are exposed, from the simplest API (step1) to more complicated ones (until step4). The way some important parameters are set is discussed in step5. step6 is an example about distributed computation with MPI. Finally step7 shows how to let Chameleon initialize user’s data (matrices/vectors) in parallel.

Source files can be found in the example/lapack_to_morse/ directory. If CMake option CHAMELEON_ENABLE_EXAMPLE is ON then source files are compiled with the project libraries. The arithmetic precision is double. To execute a step X, enter the following command:

./stepX --option1 --option2 ...

Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. Note there exist default values for options.

For all steps, the program solves a linear system $Ax=B$ The matrix values are randomly generated but ensure that matrix \$A\$ is symmetric positive definite so that $A$ can be factorized in a $LL^T$ form using the Cholesky factorization.