using.org 47.9 KB
Newer Older
PRUVOST Florent's avatar
PRUVOST Florent committed
1 2 3 4
# This file is part of the Chameleon User's Guide.
# Copyright (C) 2017 Inria
# See the file ../users_guide.org for copying conditions.

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
** Linking an external application with Chameleon libraries
   Compilation and link with Chameleon libraries have been tested with
   the GNU compiler suite ~gcc/gfortran~ and the Intel compiler suite
   ~icc/ifort~.

*** Flags required
    The compiler, linker flags that are necessary to build an
    application using Chameleon are given through the [[https://www.freedesktop.org/wiki/Software/pkg-config/][pkg-config]]
    mechanism.
    #+begin_src
    export PKG_CONFIG_PATH=/home/jdoe/install/chameleon/lib/pkgconfig:$PKG_CONFIG_PATH
    pkg-config --cflags chameleon
    pkg-config --libs chameleon
    pkg-config --libs --static chameleon
    #+end_src
    The .pc files required are located in the sub-directory
    ~lib/pkgconfig~ of your Chameleon install directory.
*** Static linking in C
    Lets imagine you have a file ~main.c~ that you want to link with
    Chameleon static libraries.  Lets consider
    ~/home/yourname/install/chameleon~ is the install directory
    of Chameleon containing sub-directories ~include/~ and
    ~lib/~.  Here could be your compilation command with gcc
    compiler:
    #+begin_src
    gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c
    #+end_src
    Now if you want to link your application with Chameleon static libraries, you
    could do:
    #+begin_src
    gcc main.o -o main                                         \
    /home/yourname/install/chameleon/lib/libchameleon.a        \
    /home/yourname/install/chameleon/lib/libchameleon_starpu.a \
    /home/yourname/install/chameleon/lib/libcoreblas.a         \
    -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64           \
    -lmkl_sequential -lmkl_core -lpthread -lm -lrt
    #+end_src
    As you can see in this example, we also link with some dynamic
    libraries *starpu-1.2*, *Intel MKL* libraries (for
    BLAS/LAPACK/CBLAS/LAPACKE), *pthread*, *m* (math) and *rt*. These
    libraries will depend on the configuration of your Chameleon
    build.  You can find these dependencies in .pc files we generate
    during compilation and that are installed in the sub-directory
    ~lib/pkgconfig~ of your Chameleon install directory.  Note also that
    you could need to specify where to find these libraries with *-L*
    option of your compiler/linker.

    Before to run your program, make sure that all shared libraries
    paths your executable depends on are known.  Enter ~ldd main~
    to check.  If some shared libraries paths are missing append them
    in the LD_LIBRARY_PATH (for Linux systems) environment
    variable (DYLD_LIBRARY_PATH on Mac).

*** Dynamic linking in C
    For dynamic linking (need to build Chameleon with CMake option
    BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but
    instead of specifying path to your static libraries you indicate
    the path to dynamic libraries with *-L* option and you give
    the name of libraries with *-l* option like this:
    #+begin_src
    gcc main.o -o main \
    -L/home/yourname/install/chameleon/lib \
    -lchameleon -lchameleon_starpu -lcoreblas \
    -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
    -lmkl_sequential -lmkl_core -lpthread -lm -lrt
    #+end_src
    Note that an update of your environment variable LD_LIBRARY_PATH
    (DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be
    required before executing
    #+begin_src
    export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib
    #+end_src

PRUVOST Florent's avatar
PRUVOST Florent committed
78 79 80 81
# # *** Build a Fortran program with Chameleon                         :noexport:
# #
# #     Chameleon provides a Fortran interface to user functions. Example:
# #     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
82
# #     call chameleon_version(major, minor, patch) !or
Mathieu Faverge's avatar
Mathieu Faverge committed
83
# #     call CHAMELEON_VERSION(major, minor, patch)
PRUVOST Florent's avatar
PRUVOST Florent committed
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
# #     #+end_src
# #
# #     Build and link are very similar to the C case.
# #
# #     Compilation example:
# #     #+begin_src
# #     gfortran -o main.o -c main.f90
# #     #+end_src
# #
# #     Static linking example:
# #     #+begin_src
# #     gfortran main.o -o main                                    \
# #     /home/yourname/install/chameleon/lib/libchameleon.a        \
# #     /home/yourname/install/chameleon/lib/libchameleon_starpu.a \
# #     /home/yourname/install/chameleon/lib/libcoreblas.a         \
# #     -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64           \
# #     -lmkl_sequential -lmkl_core -lpthread -lm -lrt
# #     #+end_src
# #
# #     Dynamic linking example:
# #     #+begin_src
# #     gfortran main.o -o main                          \
# #     -L/home/yourname/install/chameleon/lib           \
# #     -lchameleon -lchameleon_starpu -lcoreblas        \
# #     -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
# #     -lmkl_sequential -lmkl_core -lpthread -lm -lrt
# #     #+end_src
111

PRUVOST Florent's avatar
PRUVOST Florent committed
112 113 114 115 116 117 118
** Using Chameleon executables

   Chameleon provides several test executables that are compiled and
   linked with Chameleon's dependencies.  Instructions about the
   arguments to give to executables are accessible thanks to the
   option ~-[-]help~ or ~-[-]h~.  This set of binaries are separated into
   three categories and can be found in three different directories:
119
   * *example*: contains examples of API usage and more specifically the
Mathieu Faverge's avatar
Mathieu Faverge committed
120
     sub-directory ~lapack_to_chameleon/~ provides a tutorial that explains
PRUVOST Florent's avatar
PRUVOST Florent committed
121 122
     how to use Chameleon functionalities starting from a full LAPACK
     code, see [[sec:tuto][Tutorial LAPACK to Chameleon]]
123
   * *testing*: contains testing drivers to check numerical correctness of
PRUVOST Florent's avatar
PRUVOST Florent committed
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
     Chameleon linear algebra routines with a wide range of parameters
     #+begin_src
     ./testing/stesting 4 1 LANGE 600 100 700
     #+end_src
     Two first arguments are the number of cores and gpus to use.
     The third one is the name of the algorithm to test.
     The other arguments depend on the algorithm, here it lies for the number of
     rows, columns and leading dimension of the problem.

     Name of algorithms available for testing are:
     * LANGE: norms of matrices Infinite, One, Max, Frobenius
     * GEMM: general matrix-matrix multiply
     * HEMM: hermitian matrix-matrix multiply
     * HERK: hermitian matrix-matrix rank k update
     * HER2K: hermitian matrix-matrix rank 2k update
     * SYMM: symmetric matrix-matrix multiply
     * SYRK: symmetric matrix-matrix rank k update
     * SYR2K: symmetric matrix-matrix rank 2k update
     * PEMV: matrix-vector multiply with pentadiagonal matrix
     * TRMM: triangular matrix-matrix multiply
     * TRSM: triangular solve, multiple rhs
     * POSV: solve linear systems with symmetric positive-definite matrix
     * GESV_INCPIV: solve linear systems with general matrix
     * GELS: linear least squares with general matrix
148 149
     * GELS_HQR: gels with hierarchical tree
     * GELS_SYSTOLIC: gels with systolic tree
150
   * *timing*: contains timing drivers to assess performances of
PRUVOST Florent's avatar
PRUVOST Florent committed
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
     Chameleon routines. There are two sets of executables, those who
     do not use the tile interface and those who do (with _tile in the
     name of the executable). Executables without tile interface
     allocates data following LAPACK conventions and these data can be
     given as arguments to Chameleon routines as you would do with
     LAPACK. Executables with tile interface generate directly the
     data in the format Chameleon tile algorithms used to submit tasks
     to the runtime system. Executables with tile interface should be
     more performant because no data copy from LAPACK matrix layout to
     tile matrix layout are necessary. Calling example:
     #+begin_src
     ./timing/time_dpotrf --n_range=1000:10000:1000 --nb=320
                          --threads=9 --gpus=3
                          --nowarmup
     #+end_src
PRUVOST Florent's avatar
PRUVOST Florent committed
166 167

     List of main options that can be used in timing:
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
     * ~--help~: Show usage
     * Machine parameters
       * ~-t x, --threads=x~: Number of CPU workers (default: automatic
         detection through runtime)
       * ~-g x, --gpus=x~: Number of GPU workers (default: ~0~)
       * ~-P x, --P=x~: Rows (P) in the PxQ process grid (default: ~1~)
       * ~--nocpu~: All GPU kernels are exclusively executed on GPUs
     * Matrix parameters
       * ~-m x, --m=X, --M=x~: Dimension (M) of the matrices (default:
         ~N~)
       * ~-n x, --n=X, --N=x~: Dimension (N) of the matrices
       * ~-N R, --n_range=R~: Range of N values to time with
         ~R=Start:Stop:Step~ (default: ~500:5000:500~)
       * ~-k x, --k=x, --K=x, --nrhs=x~: Dimension (K) of the matrices
         or number of right-hand size (default: ~1~). This is useful for
         GEMM algorithms (k is the shared dimension and must be defined
         >1 to consider matrices and not vectors)
       * ~-b x, --nb=x~: NB size. (default: ~320~)
       * ~-i x, --ib=x~: IB size. (default: ~32~)
     * Check/prints
       * ~--niter=X~: Number of iterations performed for each test
         (default: ~1~)
       * ~-W, --nowarning~: Do not show warnings
       * ~-w, --nowarmup~: Cancel the warmup run to pre-load libraries
       * ~-c, --check~: Check result
       * ~-C, --inc~: Check on inverse
       * ~--mode=x~ : Change the xLATMS matrix mode generation for
         SVD/EVD (default: ~4~). It must be between 0 and 20 included.
     * Profiling parameters
       * ~-T, --trace~: Enable trace generation
       * ~--progress~: Display progress indicator
       * ~-d, --dag~: Enable DAG generation. Generates a dot_dag_file.dot.
       * ~-p, --profile~: Print profiling informations
     * HQR parameters
       * ~-a x, --qr_a=x, --rhblk=x~: Define the size of the local TS
         trees in housholder reduction trees for QR and LQ
         factorization. N is the size of each subdomain (default: ~-1~)
       * ~-l x, --llvl=x~: Tree used for low level reduction inside
         nodes (default: ~-1~)
       * ~-L x, --hlvl=x~: Tree used for high level reduction between
         nodes, only if P > 1 (default: ~-1~). Possible values are -1:
         Automatic, 0: Flat, 1: Greedy, 2: Fibonacci, 3: Binary, 4:
         Replicated greedy.
       * ~-D, --domino~: Enable the domino between upper and lower trees
     * Advanced options
       * ~--nobigmat~: Disable single large matrix allocation for
         multiple tiled allocations
       * ~-s, --sync~: Enable synchronous calls in wrapper function such
         as POTRI
       * ~-o, --ooc~: Enable out-of-core (available only with StarPU)
       * ~-G, --gemm3m~: Use gemm3m complex method
       * ~--bound~: Compare result to area bound
PRUVOST Florent's avatar
PRUVOST Florent committed
220

PRUVOST Florent's avatar
PRUVOST Florent committed
221 222 223 224 225 226 227 228 229 230 231 232 233 234
     List of timing algorithms available:
     * LANGE: norms of matrices
     * GEMM: general matrix-matrix multiply
     * TRSM: triangular solve
     * POTRF: Cholesky factorization with a symmetric
       positive-definite matrix
     * POTRI: Cholesky inversion
     * POSV: solve linear systems with symmetric positive-definite matrix
     * GETRF_NOPIV: LU factorization of a general matrix using the tile LU algorithm without row pivoting
     * GESV_NOPIV: solve linear system for a general matrix using the tile LU algorithm without row pivoting
     * GETRF_INCPIV: LU factorization of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges
     * GESV_INCPIV: solve linear system for a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges matrix
     * GEQRF: QR factorization of a general matrix
     * GELQF: LQ factorization of a general matrix
235
     * QEQRF_HQR: GEQRF with hierarchical tree
PRUVOST Florent's avatar
PRUVOST Florent committed
236 237
     * QEQRS: solve linear systems using a QR factorization
     * GELS: solves overdetermined or underdetermined linear systems involving a general matrix using the QR or the LQ factorization
238
     * GESVD: general matrix singular value decomposition
PRUVOST Florent's avatar
PRUVOST Florent committed
239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257

*** Execution trace using StarPU
    <<sec:trace>>

    StarPU can generate its own trace log files by compiling it with
    the ~--with-fxt~ option at the configure step (you can have to
    specify the directory where you installed FxT by giving
    ~--with-fxt=...~ instead of ~--with-fxt~ alone).  By doing so, traces
    are generated after each execution of a program which uses StarPU
    in the directory pointed by the STARPU_FXT_PREFIX environment
    variable.
    #+begin_example
    export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/
    #+end_example
    When executing a ~./timing/...~ Chameleon program, if it has been
    enabled (StarPU compiled with FxT and
    *-DCHAMELEON_ENABLE_TRACING=ON*), you can give the option ~--trace~ to
    tell the program to generate trace log files.

258 259 260 261
    Finally, to generate the trace file which can be opened with [[http://vite.gforge.inria.fr/][Vite]]
    program, you can use the *starpu_fxt_tool* executable of StarPU.
    This tool should be in ~$STARPU_INSTALL_REPOSITORY/bin~.  You can
    use it to generate the trace file like this:
PRUVOST Florent's avatar
PRUVOST Florent committed
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281
    #+begin_src
    path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename
    #+end_src
    There is one file per mpi processus (prof_filename_0,
    prof_filename_1 ...).  To generate a trace of mpi programs you can
    call it like this:
    #+begin_src
    path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*
    #+end_src
    The trace file will be named paje.trace (use -o option to specify
    an output name).  Alternatively, for non mpi execution (only one
    processus and profiling file), you can set the environment
    variable *STARPU_GENERATE_TRACE=1* to automatically generate the
    paje trace file.

*** Use simulation mode with StarPU-SimGrid
    <<sec:simu>>

    Simulation mode can be activated by setting the cmake option
    CHAMELEON_SIMULATION to ON.  This mode allows you to simulate
282 283 284 285
    execution of algorithms with StarPU compiled with [[http://simgrid.gforge.inria.fr/][SimGrid]].  To do
    so, we provide some perfmodels in the simucore/perfmodels/
    directory of Chameleon sources.  To use these perfmodels, please
    set your *STARPU_HOME* environment variable to
PRUVOST Florent's avatar
PRUVOST Florent committed
286 287 288 289
    ~path/to/your/chameleon_sources/simucore/perfmodels~.  Finally, you
    need to set your *STARPU_HOSTNAME* environment variable to the name
    of the machine to simulate.  For example: *STARPU_HOSTNAME=mirage*.
    Note that only POTRF kernels with block sizes of 320 or 960
290
    (simple and double precision) on /mirage/ and /sirocco/ machines are
PRUVOST Florent's avatar
PRUVOST Florent committed
291 292 293 294 295 296 297 298 299
    available for now.  Database of models is subject to change.

** Chameleon API

   Chameleon provides routines to solve dense general systems of
   linear equations, symmetric positive definite systems of linear
   equations and linear least squares problems, using LU, Cholesky, QR
   and LQ factorizations.  Real arithmetic and complex arithmetic are
   supported in both single precision and double precision.  Routines
300
   that compute linear algebra are of the following form:
PRUVOST Florent's avatar
PRUVOST Florent committed
301
   #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
302
   CHAMELEON_name[_Tile[_Async]]
PRUVOST Florent's avatar
PRUVOST Florent committed
303
   #+end_src
Mathieu Faverge's avatar
Mathieu Faverge committed
304 305
   * all user routines are prefixed with *CHAMELEON*
   * in the pattern *CHAMELEON_name[_Tile[_Async]]*, /name/ follows the
PRUVOST Florent's avatar
PRUVOST Florent committed
306 307 308
     BLAS/LAPACK naming scheme for algorithms (/e.g./ sgemm for general
     matrix-matrix multiply simple precision)
   * Chameleon provides three interface levels
Mathieu Faverge's avatar
Mathieu Faverge committed
309
     * *CHAMELEON_name*: simplest interface, very close to CBLAS and
PRUVOST Florent's avatar
PRUVOST Florent committed
310 311 312 313
       LAPACKE, matrices are given following the LAPACK data layout
       (1-D array column-major).  It involves copy of data from LAPACK
       layout to tile layout and conversely (to update LAPACK data),
       see [[sec:tuto_step1][Step1]].
Mathieu Faverge's avatar
Mathieu Faverge committed
314
     * *CHAMELEON_name_Tile*: the tile interface avoid copies between LAPACK
315 316 317 318
       and tile layouts. It is the standard interface of Chameleon and
       it should achieved better performance than the previous
       simplest interface. The data are given through a specific
       structure called a descriptor, see [[sec:tuteo_step2][Step2]].
Mathieu Faverge's avatar
Mathieu Faverge committed
319
     * *CHAMELEON_name_Tile_Async*: similar to the tile interface, it avoids
320 321 322 323 324 325 326
       synchonization barrier normally called between *Tile* routines.
       At the end of an *Async* function, completion of tasks is not
       guaranteed and data are not necessarily up-to-date.  To ensure
       that tasks have been all executed, a synchronization function
       has to be called after the sequence of *Async* functions, see
       [[tuto_step4][Step4]].

Mathieu Faverge's avatar
Mathieu Faverge committed
327
   CHAMELEON routine calls have to be preceded from
PRUVOST Florent's avatar
PRUVOST Florent committed
328
   #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
329
   CHAMELEON_Init( NCPU, NGPU );
PRUVOST Florent's avatar
PRUVOST Florent committed
330
   #+end_src
Mathieu Faverge's avatar
Mathieu Faverge committed
331
   to initialize CHAMELEON and the runtime system and followed by
PRUVOST Florent's avatar
PRUVOST Florent committed
332
   #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
333
   CHAMELEON_Finalize();
PRUVOST Florent's avatar
PRUVOST Florent committed
334 335 336 337
   #+end_src
   to free some data and finalize the runtime and/or MPI.

*** Tutorial LAPACK to Chameleon
338
    <<sec:tuto>>
PRUVOST Florent's avatar
PRUVOST Florent committed
339 340 341 342 343 344 345 346 347 348 349 350 351

    This tutorial is dedicated to the API usage of Chameleon.  The
    idea is to start from a simple code and step by step explain how
    to use Chameleon routines.  The first step is a full BLAS/LAPACK
    code without dependencies to Chameleon, a code that most users
    should easily understand.  Then, the different interfaces
    Chameleon provides are exposed, from the simplest API (step1) to
    more complicated ones (until step4).  The way some important
    parameters are set is discussed in step5.  step6 is an example
    about distributed computation with MPI.  Finally step7 shows how
    to let Chameleon initialize user's data (matrices/vectors) in
    parallel.

Mathieu Faverge's avatar
Mathieu Faverge committed
352
    Source files can be found in the ~example/lapack_to_chameleon/~
PRUVOST Florent's avatar
PRUVOST Florent committed
353 354 355 356 357
    directory.  If CMake option *CHAMELEON_ENABLE_EXAMPLE* is ON then
    source files are compiled with the project libraries.  The
    arithmetic precision is /double/.  To execute a step
    *X*, enter the following command:
    #+begin_src
358
    ./stepX --option1 --option2 ...
PRUVOST Florent's avatar
PRUVOST Florent committed
359 360 361 362 363 364
    #+end_src
    Instructions about the arguments to give to executables are
    accessible thanks to the option ~-[-]help~ or ~-[-]h~.  Note there
    exist default values for options.

    For all steps, the program solves a linear system $Ax=B$ The
365
    matrix values are randomly generated but ensure that matrix \$A\$ is
PRUVOST Florent's avatar
PRUVOST Florent committed
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418
    symmetric positive definite so that $A$ can be factorized in a
    $LL^T$ form using the Cholesky factorization.


    The different steps of the tutorial are:
    * Step0: a simple Cholesky example using the C interface of BLAS/LAPACK
    * Step1: introduces the LAPACK equivalent interface of Chameleon
    * Step2: introduces the tile interface
    * Step3: indicates how to give your own tile matrix to Chameleon
    * Step4: introduces the tile async interface
    * Step5: shows how to set some important parameters
    * Step6: introduces how to benefit from MPI in Chameleon
    * Step7: introduces how to let Chameleon initialize the user's matrix data

**** Step0
     The C interface of BLAS and LAPACK, that is, CBLAS and LAPACKE,
     are used to solve the system. The size of the system (matrix) and
     the number of right hand-sides can be given as arguments to the
     executable (be careful not to give huge numbers if you do not
     have an infinite amount of RAM!).  As for every step, the
     correctness of the solution is checked by calculating the norm
     $||Ax-B||/(||A||||x||+||B||)$.  The time spent in
     factorization+solve is recorded and, because we know exactly the
     number of operations of these algorithms, we deduce the number of
     operations that have been processed per second (in GFlops/s).
     The important part of the code that solves the problem is:
     #+begin_example
     /* Cholesky factorization:
      * A is replaced by its factorization L or L^T depending on uplo */
     LAPACKE_dpotrf( LAPACK_COL_MAJOR, 'U', N, A, N );
     /* Solve:
      * B is stored in X on entry, X contains the result on exit.
      * Forward ...
      */
     cblas_dtrsm(
         CblasColMajor,
         CblasLeft,
         CblasUpper,
         CblasConjTrans,
         CblasNonUnit,
         N, NRHS, 1.0, A, N, X, N);
     /* ... and back substitution */
     cblas_dtrsm(
         CblasColMajor,
         CblasLeft,
         CblasUpper,
         CblasNoTrans,
         CblasNonUnit,
         N, NRHS, 1.0, A, N, X, N);
     #+end_example

**** Step1
     <<sec:tuto_step1>>
419

PRUVOST Florent's avatar
PRUVOST Florent committed
420 421 422 423 424 425
     It introduces the simplest Chameleon interface which is
     equivalent to CBLAS/LAPACKE.  The code is very similar to step0
     but instead of calling CBLAS/LAPACKE functions, we call Chameleon
     equivalent functions.  The solving code becomes:
     #+begin_example
     /* Factorization: */
Mathieu Faverge's avatar
Mathieu Faverge committed
426
     CHAMELEON_dpotrf( UPLO, N, A, N );
PRUVOST Florent's avatar
PRUVOST Florent committed
427
     /* Solve: */
Mathieu Faverge's avatar
Mathieu Faverge committed
428
     CHAMELEON_dpotrs(UPLO, N, NRHS, A, N, X, N);
PRUVOST Florent's avatar
PRUVOST Florent committed
429 430
     #+end_example
     The API is almost the same so that it is easy to use for beginners.
Mathieu Faverge's avatar
Mathieu Faverge committed
431 432
     It is important to keep in mind that before any call to CHAMELEON routines,
     *CHAMELEON_Init* has to be invoked to initialize CHAMELEON and the runtime system.
PRUVOST Florent's avatar
PRUVOST Florent committed
433 434
     Example:
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
435
     CHAMELEON_Init( NCPU, NGPU );
PRUVOST Florent's avatar
PRUVOST Florent committed
436
     #+end_example
Mathieu Faverge's avatar
Mathieu Faverge committed
437
     After all CHAMELEON calls have been done, a call to *CHAMELEON_Finalize* is
PRUVOST Florent's avatar
PRUVOST Florent committed
438 439
     required to free some data and finalize the runtime and/or MPI.
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
440
     CHAMELEON_Finalize();
PRUVOST Florent's avatar
PRUVOST Florent committed
441
     #+end_example
Mathieu Faverge's avatar
Mathieu Faverge committed
442
     We use CHAMELEON routines with the LAPACK interface which means the
PRUVOST Florent's avatar
PRUVOST Florent committed
443 444 445 446 447 448 449
     routines accepts the same matrix format as LAPACK (1-D array
     column-major).  Note that we copy the matrix to get it in our own
     tile structures, see details about this format here [[sec:tile][Tile Data
     Layout]].  This means you can get an overhead coming from copies.

**** Step2
     <<sec:tuto_step2>>
450

PRUVOST Florent's avatar
PRUVOST Florent committed
451
     This program is a copy of step1 but instead of using the LAPACK interface which
Mathieu Faverge's avatar
Mathieu Faverge committed
452
     reads to copy LAPACK matrices inside CHAMELEON routines we use the tile interface.
PRUVOST Florent's avatar
PRUVOST Florent committed
453
     We will still use standard format of matrix but we will see how to give this
Mathieu Faverge's avatar
Mathieu Faverge committed
454
     matrix to create a CHAMELEON descriptor, a structure wrapping data on which we want
PRUVOST Florent's avatar
PRUVOST Florent committed
455 456 457 458
     to apply sequential task-based algorithms.
     The solving code becomes:
     #+begin_example
     /* Factorization: */
Mathieu Faverge's avatar
Mathieu Faverge committed
459
     CHAMELEON_dpotrf_Tile( UPLO, descA );
PRUVOST Florent's avatar
PRUVOST Florent committed
460
     /* Solve: */
Mathieu Faverge's avatar
Mathieu Faverge committed
461
     CHAMELEON_dpotrs_Tile( UPLO, descA, descX );
PRUVOST Florent's avatar
PRUVOST Florent committed
462
     #+end_example
Mathieu Faverge's avatar
Mathieu Faverge committed
463
     To use the tile interface, a specific structure *CHAM_desc_t* must be
PRUVOST Florent's avatar
PRUVOST Florent committed
464 465
     created.
     This can be achieved from different ways.
Mathieu Faverge's avatar
Mathieu Faverge committed
466
     1. Use the existing function *CHAMELEON_Desc_Create*: means the matrix
PRUVOST Florent's avatar
PRUVOST Florent committed
467 468
        data are considered contiguous in memory as it is considered
        in PLASMA ([[sec:tile][Tile Data Layout]]).
Mathieu Faverge's avatar
Mathieu Faverge committed
469
     2. Use the existing function *CHAMELEON_Desc_Create_OOC*: means the
PRUVOST Florent's avatar
PRUVOST Florent committed
470 471
        matrix data is allocated on-demand in memory tile by tile, and
        possibly pushed to disk if that does not fit memory.
Mathieu Faverge's avatar
Mathieu Faverge committed
472
     3. Use the existing function *CHAMELEON_Desc_Create_User*: it is more
PRUVOST Florent's avatar
PRUVOST Florent committed
473 474 475 476
        flexible than *Desc_Create* because you can give your own way to
        access to tile data so that your tiles can be allocated
        wherever you want in memory, see next paragraph [[sec:tuto_step3][Step3]].
     4. Create you own function to fill the descriptor.  If you
Mathieu Faverge's avatar
Mathieu Faverge committed
477
        understand well the meaning of each item of *CHAM_desc_t*, you
PRUVOST Florent's avatar
PRUVOST Florent committed
478 479 480 481
        should be able to fill correctly the structure.

     In Step2, we use the first way to create the descriptor:
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
482
     CHAMELEON_Desc_Create(&descA, NULL, ChamRealDouble,
483
                       NB, NB, NB*NB, N, N,
PRUVOST Florent's avatar
PRUVOST Florent committed
484 485 486 487 488 489
                       0, 0, N, N,
                       1, 1);
     #+end_example
     * *descA* is the descriptor to create.
     * The second argument is a pointer to existing data. The existing
       data must follow LAPACK/PLASMA matrix layout [[sec:tile][Tile Data Layout]]
Mathieu Faverge's avatar
Mathieu Faverge committed
490 491
       (1-D array column-major) if *CHAMELEON_Desc_Create* is used to create
       the descriptor. The *CHAMELEON_Desc_Create_User* function can be used
PRUVOST Florent's avatar
PRUVOST Florent committed
492 493 494 495 496 497
       if you have data organized differently. This is discussed in
       the next paragraph [[sec_tuto_step3][Step3]].  Giving a *NULL* pointer means you let
       the function allocate memory space.  This requires to copy your
       data in the memory allocated by the *Desc_Create.  This can be
       done with
       #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
498
       CHAMELEON_Lapack_to_Tile(A, N, descA);
PRUVOST Florent's avatar
PRUVOST Florent committed
499 500 501 502 503 504 505 506 507 508 509 510 511 512
       #+end_example
     * Third argument of @code{Desc_Create} is the datatype (used for
       memory allocation).
     * Fourth argument until sixth argument stand for respectively,
       the number of rows (*NB*), columns (*NB*) in each tile, the total
       number of values in a tile (*NB*NB*), the number of rows (*N*),
       colmumns (*N*) in the entire matrix.
     * Seventh argument until ninth argument stand for respectively,
       the beginning row (0), column (0) indexes of the submatrix and
       the number of rows (N), columns (N) in the submatrix.  These
       arguments are specific and used in precise cases.  If you do
       not consider submatrices, just use 0, 0, NROWS, NCOLS.
     * Two last arguments are the parameter of the 2-D block-cyclic
       distribution grid, see [[http://www.netlib.org/scalapack/slug/node75.html][ScaLAPACK]].  To be able to use other data
Mathieu Faverge's avatar
Mathieu Faverge committed
513
       distribution over the nodes, *CHAMELEON_Desc_Create_User* function
PRUVOST Florent's avatar
PRUVOST Florent committed
514 515 516 517 518 519 520 521
       should be used.

**** Step3
     <<sec:tuto_step3>>

     This program makes use of the same interface than Step2 (tile
     interface) but does not allocate LAPACK matrices anymore so that
     no copy between LAPACK matrix layout and tile matrix layout are
Mathieu Faverge's avatar
Mathieu Faverge committed
522
     necessary to call CHAMELEON routines.  To generate random right
PRUVOST Florent's avatar
PRUVOST Florent committed
523 524 525
     hand-sides you can use:
     #+begin_example
     /* Allocate memory and initialize descriptor B */
Mathieu Faverge's avatar
Mathieu Faverge committed
526
     CHAMELEON_Desc_Create(&descB,  NULL, ChamRealDouble,
PRUVOST Florent's avatar
PRUVOST Florent committed
527 528 529
                       NB, NB,  NB*NB, N, NRHS,
                       0, 0, N, NRHS, 1, 1);
     /* generate RHS with random values */
Mathieu Faverge's avatar
Mathieu Faverge committed
530
     CHAMELEON_dplrnt_Tile( descB, 5673 );
PRUVOST Florent's avatar
PRUVOST Florent committed
531 532
     #+end_example
     The other important point is that is it possible to create a
Mathieu Faverge's avatar
Mathieu Faverge committed
533
     descriptor, the necessary structure to call CHAMELEON efficiently, by
PRUVOST Florent's avatar
PRUVOST Florent committed
534 535
     giving your own pointer to tiles if your matrix is not organized
     as a 1-D array column-major.  This can be achieved with the
Mathieu Faverge's avatar
Mathieu Faverge committed
536
     *CHAMELEON_Desc_Create_User* routine.  Here is an example:
PRUVOST Florent's avatar
PRUVOST Florent committed
537
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
538
     CHAMELEON_Desc_Create_User(&descA, matA, ChamRealDouble,
PRUVOST Florent's avatar
PRUVOST Florent committed
539 540 541 542 543 544
                            NB, NB, NB*NB, N, N,
                            0, 0, N, N, 1, 1,
                            user_getaddr_arrayofpointers,
                            user_getblkldd_arrayofpointers,
                            user_getrankof_zero);
     #+end_example
Mathieu Faverge's avatar
Mathieu Faverge committed
545
     Firsts arguments are the same than *CHAMELEON_Desc_Create* routine.
PRUVOST Florent's avatar
PRUVOST Florent committed
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572
     Following arguments allows you to give pointer to functions that
     manage the access to tiles from the structure given as second
     argument.  Here for example, *matA* is an array containing
     addresses to tiles, see the function *allocate_tile_matrix*
     defined in step3.h.  The three functions you have to
     define for *Desc_Create_User* are:
     * a function that returns address of tile $A(m,n)$, m and n
       standing for the indexes of the tile in the global matrix. Lets
       consider a matrix @math{4x4} with tile size 2x2, the matrix
       contains four tiles of indexes: $A(m=0,n=0)$, $A(m=0,n=1)$,
       $A(m=1,n=0)$, $A(m=1,n=1)$
     * a function that returns the leading dimension of tile $A(m,*)$
     * a function that returns MPI rank of tile $A(m,n)$

     Examples for these functions are vizible in step3.h.  Note that
     the way we define these functions is related to the tile matrix
     format and to the data distribution considered.  This example
     should not be used with MPI since all tiles are affected to
     processus 0, which means a large amount of data will be
     potentially transfered between nodes.

**** Step4
     <<sec:tuto_step4>>

     This program is a copy of step2 but instead of using the tile
     interface, it uses the tile async interface.  The goal is to
     exhibit the runtime synchronization barriers.  Keep in mind that
Mathieu Faverge's avatar
Mathieu Faverge committed
573
     when the tile interface is called, like *CHAMELEON_dpotrf_Tile*,
PRUVOST Florent's avatar
PRUVOST Florent committed
574 575 576 577 578 579 580
     a synchronization function, waiting for the actual execution and
     termination of all tasks, is called to ensure the proper
     completion of the algorithm (i.e. data are up-to-date).  The code
     shows how to exploit the async interface to pipeline subsequent
     algorithms so that less synchronisations are done.  The code
     becomes:
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
581
     /* Cham structure containing parameters and a structure to interact with
PRUVOST Florent's avatar
PRUVOST Florent committed
582
      * the Runtime system */
Mathieu Faverge's avatar
Mathieu Faverge committed
583
     CHAM_context_t *chamctxt;
Mathieu Faverge's avatar
Mathieu Faverge committed
584
     /* CHAMELEON sequence uniquely identifies a set of asynchronous function calls
PRUVOST Florent's avatar
PRUVOST Florent committed
585
      * sharing common exception handling */
Mathieu Faverge's avatar
Mathieu Faverge committed
586 587 588
     RUNTIME_sequence_t *sequence = NULL;
     /* CHAMELEON request uniquely identifies each asynchronous function call */
     RUNTIME_request_t request = CHAMELEON_REQUEST_INITIALIZER;
PRUVOST Florent's avatar
PRUVOST Florent committed
589 590 591 592
     int status;

     ...

Mathieu Faverge's avatar
Mathieu Faverge committed
593
     chameleon_sequence_create(chamctxt, &sequence);
PRUVOST Florent's avatar
PRUVOST Florent committed
594 595

     /* Factorization: */
Mathieu Faverge's avatar
Mathieu Faverge committed
596
     CHAMELEON_dpotrf_Tile_Async( UPLO, descA, sequence, &request );
PRUVOST Florent's avatar
PRUVOST Florent committed
597 598

     /* Solve: */
Mathieu Faverge's avatar
Mathieu Faverge committed
599
     CHAMELEON_dpotrs_Tile_Async( UPLO, descA, descX, sequence, &request);
PRUVOST Florent's avatar
PRUVOST Florent committed
600 601 602

     /* Synchronization barrier (the runtime ensures that all submitted tasks
      * have been terminated */
Mathieu Faverge's avatar
Mathieu Faverge committed
603
     RUNTIME_barrier(chamctxt);
PRUVOST Florent's avatar
PRUVOST Florent committed
604 605 606 607 608 609 610 611 612 613 614 615
     /* Ensure that all data processed on the gpus we are depending on are back
      * in main memory */
     RUNTIME_desc_getoncpu(descA);
     RUNTIME_desc_getoncpu(descX);

     status = sequence->status;
     #+end_example

     Here the sequence of *dpotrf* and *dpotrs* algorithms is processed
     without synchronization so that some tasks of *dpotrf* and *dpotrs*
     can be concurently executed which could increase performances.
     The async interface is very similar to the tile one.  It is only
Mathieu Faverge's avatar
Mathieu Faverge committed
616 617
     necessary to give two new objects *RUNTIME_sequence_t* and
     *RUNTIME_request_t* used to handle asynchronous function calls.
PRUVOST Florent's avatar
PRUVOST Florent committed
618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

     #+CAPTION: POTRI (POTRF, TRTRI, LAUUM) algorithm with and without synchronization barriers, courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
     #+NAME: fig:potri_async
     #+ATTR_HTML: :width 640px :align center
     [[file:potri_async.png]]

**** Step5
     <<sec:tuto_step5>>

     Step5 shows how to set some important parameters.  This program
     is a copy of Step4 but some additional parameters are given by
     the user.  The parameters that can be set are:
     * number of Threads
     * number of GPUs

       The number of workers can be given as argument
       to the executable with ~--threads=~ and ~--gpus=~ options.  It is
       important to notice that we assign one thread per gpu to
       optimize data transfer between main memory and devices memory.
       The number of workers of each type CPU and CUDA
Mathieu Faverge's avatar
Mathieu Faverge committed
638
       must be given at *CHAMELEON_Init*.
PRUVOST Florent's avatar
PRUVOST Florent committed
639 640 641 642 643 644 645 646
       #+begin_example
       if ( iparam[IPARAM_THRDNBR] == -1 ) {
           get_thread_count( &(iparam[IPARAM_THRDNBR]) );
           /* reserve one thread par cuda device to optimize memory transfers */
           iparam[IPARAM_THRDNBR] -=iparam[IPARAM_NCUDAS];
       }
       NCPU = iparam[IPARAM_THRDNBR];
       NGPU = iparam[IPARAM_NCUDAS];
Mathieu Faverge's avatar
Mathieu Faverge committed
647 648
       /* initialize CHAMELEON with main parameters */
       CHAMELEON_Init( NCPU, NGPU );
PRUVOST Florent's avatar
PRUVOST Florent committed
649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676
       #+end_example

     * matrix size
     * number of right-hand sides
     * block (tile) size

       The problem size is given with ~--n=~ and ~--nrhs=~ options.  The
       tile size is given with option ~--nb=~.  These parameters are
       required to create descriptors.  The size tile NB is a key
       parameter to get performances since it defines the granularity
       of tasks.  If NB is too large compared to N, there are few
       tasks to schedule.  If the number of workers is large this
       leads to limit parallelism.  On the contrary, if NB is too
       small (/i.e./ many small tasks), workers could not be correctly
       fed and the runtime systems operations could represent a
       substantial overhead.  A trade-off has to be found depending on
       many parameters: problem size, algorithm (drive data
       dependencies), architecture (number of workers, workers speed,
       workers uniformity, memory bus speed).  By default it is set
       to 128.  Do not hesitate to play with this parameter and
       compare performances on your machine.

     * inner-blocking size

        The inner-blocking size is given with option ~--ib=~.
        This parameter is used by kernels (optimized algorithms applied on tiles) to
        perform subsequent operations with data block-size that fits the cache of
        workers.
Mathieu Faverge's avatar
Mathieu Faverge committed
677
        Parameters NB and IB can be given with *CHAMELEON_Set* function:
PRUVOST Florent's avatar
PRUVOST Florent committed
678
        #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
679 680
        CHAMELEON_Set(CHAMELEON_TILE_SIZE,        iparam[IPARAM_NB] );
        CHAMELEON_Set(CHAMELEON_INNER_BLOCK_SIZE, iparam[IPARAM_IB] );
PRUVOST Florent's avatar
PRUVOST Florent committed
681 682 683 684 685 686 687
        #+end_example

**** Step6
     <<sec:tuto_step6>>

     This program is a copy of Step5 with some additional parameters
     to be set for the data distribution.  To use this program
Mathieu Faverge's avatar
Mathieu Faverge committed
688
     properly CHAMELEON must use StarPU Runtime system and MPI option must
PRUVOST Florent's avatar
PRUVOST Florent committed
689 690 691 692 693 694 695 696 697 698 699
     be activated at configure.  The data distribution used here is
     2-D block-cyclic, see for example [[http://www.netlib.org/scalapack/slug/node75.html][ScaLAPACK]] for explanation.  The
     user can enter the parameters of the distribution grid at
     execution with ~--p=~ option.  Example using OpenMPI on four nodes
     with one process per node:
     #+begin_example
     mpirun -np 4 ./step6 --n=10000 --nb=320 --ib=64 --threads=8 --gpus=2 --p=2
     #+end_example

     In this program we use the tile data layout from PLASMA so that the call
     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
700
     CHAMELEON_Desc_Create_User(&descA, NULL, ChamRealDouble,
PRUVOST Florent's avatar
PRUVOST Florent committed
701 702 703
                            NB, NB, NB*NB, N, N,
                            0, 0, N, N,
                            GRID_P, GRID_Q,
Mathieu Faverge's avatar
Mathieu Faverge committed
704 705 706
                            chameleon_getaddr_ccrb,
                            chameleon_getblkldd_ccrb,
                            chameleon_getrankof_2d);
PRUVOST Florent's avatar
PRUVOST Florent committed
707 708 709 710
     #+end_example
     is equivalent to the following call

     #+begin_example
Mathieu Faverge's avatar
Mathieu Faverge committed
711
     CHAMELEON_Desc_Create(&descA, NULL, ChamRealDouble,
712 713
                       NB, NB, NB*NB, N, N,
                       0, 0, N, N,
PRUVOST Florent's avatar
PRUVOST Florent committed
714 715
                       GRID_P, GRID_Q);
     #+end_example
Mathieu Faverge's avatar
Mathieu Faverge committed
716 717
     functions *chameleon_getaddr_ccrb*, *chameleon_getblkldd_ccrb*,
     *chameleon_getrankof_2d* being used in *Desc_Create*.  It is interesting
PRUVOST Florent's avatar
PRUVOST Florent committed
718 719
     to notice that the code is almost the same as Step5.  The only
     additional information to give is the way tiles are distributed
Mathieu Faverge's avatar
Mathieu Faverge committed
720
     through the third function given to *CHAMELEON_Desc_Create_User*.
PRUVOST Florent's avatar
PRUVOST Florent committed
721 722 723
     Here, because we have made experiments only with a 2-D
     block-cyclic distribution, we have parameters P and Q in the
     interface of *Desc_Create* but they have sense only for 2-D
Mathieu Faverge's avatar
Mathieu Faverge committed
724
     block-cyclic distribution and then using *chameleon_getrankof_2d*
PRUVOST Florent's avatar
PRUVOST Florent committed
725 726 727 728 729
     function.  Of course it could be used with other distributions,
     being no more the parameters of a 2-D block-cyclic grid but of
     another distribution.

**** Step7
730

PRUVOST Florent's avatar
PRUVOST Florent committed
731 732 733 734 735
     <<sec:tuto_step7>>

     This program is a copy of step6 with some additional calls to
     build a matrix from within chameleon using a function provided by
     the user.  This can be seen as a replacement of the function like
Mathieu Faverge's avatar
Mathieu Faverge committed
736 737 738
     *CHAMELEON_dplgsy_Tile()* that can be used to fill the matrix with
     random data, *CHAMELEON_dLapack_to_Tile()* to fill the matrix with data
     stored in a lapack-like buffer, or *CHAMELEON_Desc_Create_User()* that
PRUVOST Florent's avatar
PRUVOST Florent committed
739 740 741 742 743 744 745
     can be used to describe an arbitrary tile matrix structure.  In
     this example, the build callback function are just wrapper
     towards *CORE_xxx()* functions, so the output of the program step7
     should be exactly similar to that of step6.  The difference is
     that the function used to fill the tiles is provided by the user,
     and therefore this approach is much more flexible.

Mathieu Faverge's avatar
Mathieu Faverge committed
746
     The new function to understand is *CHAMELEON_dbuild_Tile*, e.g.
PRUVOST Florent's avatar
PRUVOST Florent committed
747 748
     #+begin_example
     struct data_pl data_A={(double)N, 51, N};
Mathieu Faverge's avatar
Mathieu Faverge committed
749
     CHAMELEON_dbuild_Tile(ChamUpperLower, descA, (void*)&data_A, Cham_build_callback_plgsy);
PRUVOST Florent's avatar
PRUVOST Florent committed
750 751 752 753 754 755
     #+end_example

     The idea here is to let Chameleon fill the matrix data in a
     task-based fashion (parallel) by using a function given by the
     user.  First, the user should define if all the blocks must be
     entirelly filled or just the upper/lower part with, /e.g./
Mathieu Faverge's avatar
Mathieu Faverge committed
756 757 758
     ChamUpperLower.  We still relies on the same structure
     *CHAM_desc_t* which must be initialized with the proper
     parameters, by calling for example *CHAMELEON_Desc_Create*.  Then, an
PRUVOST Florent's avatar
PRUVOST Florent committed
759 760 761 762 763
     opaque pointer is used to let the user give some extra data used
     by his function.  The last parameter is the pointer to the user's
     function.

*** List of available routines
764 765 766
**** Linear Algebra routines

     We list the linear algebra routines of the form
Mathieu Faverge's avatar
Mathieu Faverge committed
767
     *CHAMELEON_name[_Tile[_Async]]* (/name/ follows LAPACK naming scheme, see
768 769
     http://www.netlib.org/lapack/lug/node24.html) that can be used
     with the Chameleon library. For details about these functions
770 771
     please refer to the doxygen documentation. /name/ can be one of the
     following:
772 773 774 775 776 777 778 779 780 781 782 783 784

     * *BLAS 2/3 routines*
       * gemm: matrix matrix multiply and addition
       * hemm: gemm with A Hermitian
       * herk: rank k operations with A Hermitian
       * her2k: rank 2k operations with A Hermitian
       * lauum: computes the product U * U' or L' * L, where the
         triangular factor U or L is stored in the upper or lower
         triangular part of the array A
       * symm: gemm with A symmetric
       * syrk: rank k operations with A symmetric
       * syr2k: rank 2k with A symmetric
       * trmm: gemm with A triangular
785
     * *Triangular solving routines*
786 787 788 789 790 791
       * trsm: computes triangular solve
       * trsmpl: performs the forward substitution step of solving a
         system of linear equations after the tile LU factorization of
         the matrix
       * trsmrv:
       * trtri: computes the inverse of a complex upper or lower triangular matrix A
792
     * *LL' (Cholesky) routines*
793 794 795 796 797 798 799 800 801 802 803 804
       * posv: linear systems solving using Cholesky factorization
       * potrf: Cholesky factorization
       * potri: computes the inverse of a complex Hermitian positive
         definite matrix A using the Cholesky factorization A
       * potrimm:
       * potrs: linear systems solving using existing Cholesky
         factorization
       * sysv: linear systems solving using Cholesky decomposition with
         A symmetric
       * sytrf: Cholesky decomposition with A symmetric
       * sytrs: linear systems solving using existing Cholesky
         decomposition with A symmetric
805
     * *LU routines*
806 807 808 809 810 811 812 813 814 815
       * gesv_incpiv: linear systems solving with LU factorization and
         partial pivoting
       * gesv_nopiv: linear systems solving with LU factorization and
         without pivoting
       * getrf_incpiv: LU factorization with partial pivoting
       * getrf_nopiv: LU factorization without pivoting
       * getrs_incpiv: linear systems solving using existing LU
         factorization with partial pivoting
       * getrs_nopiv: linear systems solving using existing LU
         factorization without pivoting
816
     * *QR/LQ routines*
817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842
       * gelqf: LQ factorization
       * gelqf_param: gelqf with hqr
       * gelqs: computes a minimum-norm solution min || A*X - B || using
         the LQ factorization
       * gelqs_param: gelqs with hqr
       * gels: Uses QR or LQ factorization to solve a overdetermined or
         underdetermined linear system with full rank matrix
       * gels_param: gels with hqr
       * geqrf: QR factorization
       * geqrf_param: geqrf with hqr
       * geqrs: computes a minimum-norm solution min || A*X - B || using
         the RQ factorization
       * hetrd: reduces a complex Hermitian matrix A to real symmetric
         tridiagonal form S
       * geqrs_param: geqrs with hqr
       * tpgqrt: generates a partial Q matrix formed with a blocked QR
         factorization of a "triangular-pentagonal" matrix C, which is
         composed of a unused triangular block and a pentagonal block V,
         using the compact representation for Q. See tpqrt to
         generate V
       * tpqrt: computes a blocked QR factorization of a
         "triangular-pentagonal" matrix C, which is composed of a
         triangular block A and a pentagonal block B, using the compact
         representation for Q
       * unglq: generates an M-by-N matrix Q with orthonormal rows,
         which is defined as the first M rows of a product of the
Mathieu Faverge's avatar
Mathieu Faverge committed
843
         elementary reflectors returned by CHAMELEON_zgelqf
844 845 846
       * unglq_param: unglq with hqr
       * ungqr: generates an M-by-N matrix Q with orthonormal columns,
         which is defined as the first N columns of a product of the
Mathieu Faverge's avatar
Mathieu Faverge committed
847
         elementary reflectors returned by CHAMELEON_zgeqrf
848 849 850 851 852 853 854 855 856 857 858 859
       * ungqr_param: ungqr with hqr
       * unmlq: overwrites C with Q*C or C*Q or equivalent operations
         with transposition on conjugate on C (see doxygen
         documentation)
       * unmlq_param: unmlq with hqr
       * unmqr: similar to unmlq (see doxygen documentation)
       * unmqr_param: unmqr with hqr
     * *EVD/SVD*
       * gesvd: singular value decomposition
       * heevd: eigenvalues/eigenvectors computation with A Hermitian
     * *Extra routines*
       * *Norms*
Mathieu Faverge's avatar
Mathieu Faverge committed
860
         * lange: compute norm of a matrix (Max, One, Inf, Frobenius)
861 862 863 864
         * lanhe: lange with A Hermitian
         * lansy: lange with A symmetric
         * lantr: lange with A triangular
       * *Random matrices generation*
Mathieu Faverge's avatar
Mathieu Faverge committed
865 866 867
         * plghe: generate a random Hermitian matrix
         * plgsy: generate a random symmetrix matrix
         * plrnt: generate a random matrix
868 869 870
       * *Others*
         * geadd: general matrix matrix addition
         * lacpy: copy matrix into another
Mathieu Faverge's avatar
Mathieu Faverge committed
871
         * lascal: scale a matrix
872 873 874
         * laset: copy the triangular part of a matrix into another, set a
           value for the diagonal and off-diagonal part
         * tradd: trapezoidal matrices addition
Mathieu Faverge's avatar
Mathieu Faverge committed
875 876
       * *Map functions*
         * map: apply a user operator on each tile of the matrix
877 878

**** Options routines
Mathieu Faverge's avatar
Mathieu Faverge committed
879
     Enable CHAMELEON feature.
880
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
881
     int CHAMELEON_Enable  (CHAMELEON_enum option);
882 883
     #+end_src
     Feature to be enabled:
Mathieu Faverge's avatar
Mathieu Faverge committed
884 885 886 887 888
     * *CHAMELEON_WARNINGS*:   printing of warning messages,
     * *CHAMELEON_AUTOTUNING*: autotuning for tile size and inner block size,
     * *CHAMELEON_PROFILING_MODE*:  activate kernels profiling,
     * *CHAMELEON_PROGRESS*:  to print a progress status,
     * *CHAMELEON_GEMM3M*: to enable the use of the /gemm3m/ blas bunction.
889

Mathieu Faverge's avatar
Mathieu Faverge committed
890
     Disable CHAMELEON feature.
891
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
892
     int CHAMELEON_Disable (CHAMELEON_enum option);
893
     #+end_src
Mathieu Faverge's avatar
Mathieu Faverge committed
894
     Symmetric to *CHAMELEON_Enable*.
895

Mathieu Faverge's avatar
Mathieu Faverge committed
896
     Set CHAMELEON parameter.
897
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
898
     int CHAMELEON_Set     (CHAMELEON_enum param, int  value);
899 900
     #+end_src
     Parameters to be set:
Mathieu Faverge's avatar
Mathieu Faverge committed
901 902 903 904 905
     * *CHAMELEON_TILE_SIZE*:        size matrix tile,
     * *CHAMELEON_INNER_BLOCK_SIZE*: size of tile inner block,
     * *CHAMELEON_HOUSEHOLDER_MODE*: type of householder trees (FLAT or TREE),
     * *CHAMELEON_HOUSEHOLDER_SIZE*: size of the groups in householder trees,
     * *CHAMELEON_TRANSLATION_MODE*: related to the *CHAMELEON_Lapack_to_Tile*, see ztile.c.
906

Mathieu Faverge's avatar
Mathieu Faverge committed
907
     Get value of CHAMELEON parameter.
908
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
909
     int CHAMELEON_Get     (CHAMELEON_enum param, int *value);
910
     #+end_src
PRUVOST Florent's avatar
PRUVOST Florent committed
911

PRUVOST Florent's avatar
PRUVOST Florent committed
912
**** Auxiliary routines
PRUVOST Florent's avatar
PRUVOST Florent committed
913

Mathieu Faverge's avatar
Mathieu Faverge committed
914
     Reports CHAMELEON version number.
PRUVOST Florent's avatar
PRUVOST Florent committed
915
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
916
     int CHAMELEON_Version        (int *ver_major, int *ver_minor, int *ver_micro);
PRUVOST Florent's avatar
PRUVOST Florent committed
917 918
     #+end_src

Mathieu Faverge's avatar
Mathieu Faverge committed
919
     Initialize CHAMELEON: initialize some parameters, initialize the runtime and/or MPI.
PRUVOST Florent's avatar
PRUVOST Florent committed
920
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
921
     int CHAMELEON_Init           (int nworkers, int ncudas);
PRUVOST Florent's avatar
PRUVOST Florent committed
922 923
     #+end_src

Mathieu Faverge's avatar
Mathieu Faverge committed
924
     Finalyze CHAMELEON: free some data and finalize the runtime and/or MPI.
PRUVOST Florent's avatar
PRUVOST Florent committed
925
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
926
     int CHAMELEON_Finalize       (void);
PRUVOST Florent's avatar
PRUVOST Florent committed
927 928
     #+end_src

Mathieu Faverge's avatar
Mathieu Faverge committed
929 930
     Suspend CHAMELEON runtime to poll for new tasks, to avoid useless CPU consumption when
     no tasks have to be executed by CHAMELEON runtime system.
PRUVOST Florent's avatar
PRUVOST Florent committed
931
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
932
     int CHAMELEON_Pause          (void);
PRUVOST Florent's avatar
PRUVOST Florent committed
933 934
     #+end_src

Mathieu Faverge's avatar
Mathieu Faverge committed
935
     Symmetrical call to CHAMELEON_Pause, used to resume the workers polling for new tasks.
PRUVOST Florent's avatar
PRUVOST Florent committed
936
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
937
     int CHAMELEON_Resume         (void);
PRUVOST Florent's avatar
PRUVOST Florent committed
938 939
     #+end_src

PRUVOST Florent's avatar
PRUVOST Florent committed
940 941
     Return the MPI rank of the calling process.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
942
     int CHAMELEON_My_Mpi_Rank    (void);
PRUVOST Florent's avatar
PRUVOST Florent committed
943 944 945 946
     #+end_src

     Return the size of the distributed computation
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
947
     int CHAMELEON_Comm_size( int *size )
PRUVOST Florent's avatar
PRUVOST Florent committed
948 949 950 951
     #+end_src

     Return the rank of the distributed computation
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
952
     int CHAMELEON_Comm_rank( int *rank )
PRUVOST Florent's avatar
PRUVOST Florent committed
953 954 955 956
     #+end_src

     Prepare the distributed processes for computation
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
957
     int CHAMELEON_Distributed_start(void)
PRUVOST Florent's avatar
PRUVOST Florent committed
958 959 960 961
     #+end_src

     Clean the distributed processes after computation
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
962
     int CHAMELEON_Distributed_stop(void)
PRUVOST Florent's avatar
PRUVOST Florent committed
963 964 965 966
     #+end_src

     Return the number of CPU workers initialized by the runtime
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
967
     int CHAMELEON_GetThreadNbr()
PRUVOST Florent's avatar
PRUVOST Florent committed
968 969
     #+end_src

PRUVOST Florent's avatar
PRUVOST Florent committed
970 971
     Conversion from LAPACK layout to tile layout.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
972
     int CHAMELEON_Lapack_to_Tile (void *Af77, int LDA, CHAM_desc_t *A);
PRUVOST Florent's avatar
PRUVOST Florent committed
973 974 975 976
     #+end_src

     Conversion from tile layout to LAPACK layout.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
977
     int CHAMELEON_Tile_to_Lapack (CHAM_desc_t *A, void *Af77, int LDA);
PRUVOST Florent's avatar
PRUVOST Florent committed
978 979 980 981 982 983
     #+end_src

**** Descriptor routines

     Create matrix descriptor, internal function.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
984
     int CHAMELEON_Desc_Create(CHAM_desc_t **desc, void *mat, cham_flttype_t dtyp,
985 986
                           int mb, int nb, int bsiz, int lm, int ln,
                           int i, int j, int m, int n, int p, int q);
PRUVOST Florent's avatar
PRUVOST Florent committed
987 988 989 990
     #+end_src

     Create matrix descriptor, user function.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
991
     int CHAMELEON_Desc_Create_User(CHAM_desc_t **desc, void *mat, cham_flttype_t dtyp,
PRUVOST Florent's avatar
PRUVOST Florent committed
992 993
                                int mb, int nb, int bsiz, int lm, int ln,
                                int i, int j, int m, int n, int p, int q,
Mathieu Faverge's avatar
Mathieu Faverge committed
994 995 996
                                void* (*get_blkaddr)( const CHAM_desc_t*, int, int),
                                int (*get_blkldd)( const CHAM_desc_t*, int ),
                                int (*get_rankof)( const CHAM_desc_t*, int, int ));
PRUVOST Florent's avatar
PRUVOST Florent committed
997 998
     #+end_src

999 1000 1001
     Create matrix descriptor for tiled matrix which may not fit
     memory.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1002
     int CHAMELEON_Desc_Create_OOC(CHAM_desc_t **descptr, cham_flttype_t dtyp, int mb, int nb, int bsiz,
1003 1004 1005
                               int lm, int ln, int i, int j, int m, int n, int p, int q);
     #+end_src

Mathieu Faverge's avatar
Mathieu Faverge committed
1006
     User's function version of CHAMELEON_Desc_Create_OOC.
1007
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1008
     int CHAMELEON_Desc_Create_OOC_User(CHAM_desc_t **descptr, cham_flttype_t dtyp, int mb, int nb, int bsiz,
1009
                                    int lm, int ln, int i, int j, int m, int n, int p, int q,
Mathieu Faverge's avatar
Mathieu Faverge committed
1010
                                    int (*get_rankof)( const CHAM_desc_t*, int, int ));
1011 1012
     #+end_src

PRUVOST Florent's avatar
PRUVOST Florent committed
1013 1014
     Destroys matrix descriptor.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1015
     int CHAMELEON_Desc_Destroy (CHAM_desc_t **desc);
PRUVOST Florent's avatar
PRUVOST Florent committed
1016 1017
     #+end_src

1018 1019
     Ensures that all data of the descriptor are up-to-date.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1020
     int CHAMELEON_Desc_Acquire (CHAM_desc_t  *desc);
1021 1022 1023
     #+end_src

     Release the data of the descriptor acquired by the
Mathieu Faverge's avatar
Mathieu Faverge committed
1024
     application. Should be called if CHAMELEON_Desc_Acquire has been
1025 1026 1027
     called on the descriptor and if you do not need to access to its
     data anymore.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1028
     int CHAMELEON_Desc_Release (CHAM_desc_t  *desc);
1029 1030 1031 1032
     #+end_src

     Ensure that all data are up-to-date in main memory (even if some
     tasks have been processed on GPUs).
PRUVOST Florent's avatar
PRUVOST Florent committed
1033
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1034
     int CHAMELEON_Desc_Flush(CHAM_desc_t  *desc, RUNTIME_sequence_t *sequence);
PRUVOST Florent's avatar
PRUVOST Florent committed
1035 1036
     #+end_src

1037 1038 1039 1040 1041
     Set the sizes for the MPI tags.  Default value: tag_width=31,
     tag_sep=24, meaning that the MPI tag is stored in 31 bits, with
     24 bits for the tile tag and 7 for the descriptor.  This function
     must be called before any descriptor creation.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1042
     void CHAMELEON_user_tag_size(int user_tag_width, int user_tag_sep);
1043 1044
     #+end_src

PRUVOST Florent's avatar
PRUVOST Florent committed
1045 1046 1047 1048
**** Sequences routines

     Create a sequence.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1049
     int CHAMELEON_Sequence_Create  (RUNTIME_sequence_t **sequence);
PRUVOST Florent's avatar
PRUVOST Florent committed
1050 1051 1052 1053
     #+end_src

     Destroy a sequence.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1054
     int CHAMELEON_Sequence_Destroy (RUNTIME_sequence_t *sequence);
PRUVOST Florent's avatar
PRUVOST Florent committed
1055 1056 1057 1058
     #+end_src

     Wait for the completion of a sequence.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1059
     int CHAMELEON_Sequence_Wait    (RUNTIME_sequence_t *sequence);
PRUVOST Florent's avatar
PRUVOST Florent committed
1060
     #+end_src
PRUVOST Florent's avatar
PRUVOST Florent committed
1061 1062 1063

     Terminate a sequence.
     #+begin_src
Mathieu Faverge's avatar
Mathieu Faverge committed
1064
     int CHAMELEON_Sequence_Flush(RUNTIME_sequence_t *sequence, RUNTIME_request_t *request)
PRUVOST Florent's avatar
PRUVOST Florent committed
1065
     #+end_src