Newer
Older
# This file is part of the Chameleon User's Guide.
# Copyright (C) 2017 Inria
# See the file ../users_guide.org for copying conditions.
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
** Linking an external application with Chameleon libraries
Compilation and link with Chameleon libraries have been tested with
the GNU compiler suite ~gcc/gfortran~ and the Intel compiler suite
~icc/ifort~.
*** Flags required
The compiler, linker flags that are necessary to build an
application using Chameleon are given through the [[https://www.freedesktop.org/wiki/Software/pkg-config/][pkg-config]]
mechanism.
#+begin_src
export PKG_CONFIG_PATH=/home/jdoe/install/chameleon/lib/pkgconfig:$PKG_CONFIG_PATH
pkg-config --cflags chameleon
pkg-config --libs chameleon
pkg-config --libs --static chameleon
#+end_src
The .pc files required are located in the sub-directory
~lib/pkgconfig~ of your Chameleon install directory.
*** Static linking in C
Lets imagine you have a file ~main.c~ that you want to link with
Chameleon static libraries. Lets consider
~/home/yourname/install/chameleon~ is the install directory
of Chameleon containing sub-directories ~include/~ and
~lib/~. Here could be your compilation command with gcc
compiler:
#+begin_src
gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c
#+end_src
Now if you want to link your application with Chameleon static libraries, you
could do:
#+begin_src
gcc main.o -o main \
/home/yourname/install/chameleon/lib/libchameleon.a \
/home/yourname/install/chameleon/lib/libchameleon_starpu.a \
/home/yourname/install/chameleon/lib/libcoreblas.a \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt
#+end_src
As you can see in this example, we also link with some dynamic
libraries *starpu-1.2*, *Intel MKL* libraries (for
BLAS/LAPACK/CBLAS/LAPACKE), *pthread*, *m* (math) and *rt*. These
libraries will depend on the configuration of your Chameleon
build. You can find these dependencies in .pc files we generate
during compilation and that are installed in the sub-directory
~lib/pkgconfig~ of your Chameleon install directory. Note also that
you could need to specify where to find these libraries with *-L*
option of your compiler/linker.
Before to run your program, make sure that all shared libraries
paths your executable depends on are known. Enter ~ldd main~
to check. If some shared libraries paths are missing append them
in the LD_LIBRARY_PATH (for Linux systems) environment
variable (DYLD_LIBRARY_PATH on Mac).
*** Dynamic linking in C
For dynamic linking (need to build Chameleon with CMake option
BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but
instead of specifying path to your static libraries you indicate
the path to dynamic libraries with *-L* option and you give
the name of libraries with *-l* option like this:
#+begin_src
gcc main.o -o main \
-L/home/yourname/install/chameleon/lib \
-lchameleon -lchameleon_starpu -lcoreblas \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt
#+end_src
Note that an update of your environment variable LD_LIBRARY_PATH
(DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be
required before executing
#+begin_src
export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib
#+end_src
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# # *** Build a Fortran program with Chameleon :noexport:
# #
# # Chameleon provides a Fortran interface to user functions. Example:
# # #+begin_src
# # call morse_version(major, minor, patch) !or
# # call MORSE_VERSION(major, minor, patch)
# # #+end_src
# #
# # Build and link are very similar to the C case.
# #
# # Compilation example:
# # #+begin_src
# # gfortran -o main.o -c main.f90
# # #+end_src
# #
# # Static linking example:
# # #+begin_src
# # gfortran main.o -o main \
# # /home/yourname/install/chameleon/lib/libchameleon.a \
# # /home/yourname/install/chameleon/lib/libchameleon_starpu.a \
# # /home/yourname/install/chameleon/lib/libcoreblas.a \
# # -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
# # -lmkl_sequential -lmkl_core -lpthread -lm -lrt
# # #+end_src
# #
# # Dynamic linking example:
# # #+begin_src
# # gfortran main.o -o main \
# # -L/home/yourname/install/chameleon/lib \
# # -lchameleon -lchameleon_starpu -lcoreblas \
# # -lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
# # -lmkl_sequential -lmkl_core -lpthread -lm -lrt
# # #+end_src
** Using Chameleon executables
Chameleon provides several test executables that are compiled and
linked with Chameleon's dependencies. Instructions about the
arguments to give to executables are accessible thanks to the
option ~-[-]help~ or ~-[-]h~. This set of binaries are separated into
three categories and can be found in three different directories:
* *example*: contains examples of API usage and more specifically the
sub-directory ~lapack_to_morse/~ provides a tutorial that explains
how to use Chameleon functionalities starting from a full LAPACK
code, see [[sec:tuto][Tutorial LAPACK to Chameleon]]
* *testing*: contains testing drivers to check numerical correctness of
Chameleon linear algebra routines with a wide range of parameters
#+begin_src
./testing/stesting 4 1 LANGE 600 100 700
#+end_src
Two first arguments are the number of cores and gpus to use.
The third one is the name of the algorithm to test.
The other arguments depend on the algorithm, here it lies for the number of
rows, columns and leading dimension of the problem.
Name of algorithms available for testing are:
* LANGE: norms of matrices Infinite, One, Max, Frobenius
* GEMM: general matrix-matrix multiply
* HEMM: hermitian matrix-matrix multiply
* HERK: hermitian matrix-matrix rank k update
* HER2K: hermitian matrix-matrix rank 2k update
* SYMM: symmetric matrix-matrix multiply
* SYRK: symmetric matrix-matrix rank k update
* SYR2K: symmetric matrix-matrix rank 2k update
* PEMV: matrix-vector multiply with pentadiagonal matrix
* TRMM: triangular matrix-matrix multiply
* TRSM: triangular solve, multiple rhs
* POSV: solve linear systems with symmetric positive-definite matrix
* GESV_INCPIV: solve linear systems with general matrix
* GELS: linear least squares with general matrix
* GELS_HQR: gels with hierarchical tree
* GELS_SYSTOLIC: gels with systolic tree
* *timing*: contains timing drivers to assess performances of
Chameleon routines. There are two sets of executables, those who
do not use the tile interface and those who do (with _tile in the
name of the executable). Executables without tile interface
allocates data following LAPACK conventions and these data can be
given as arguments to Chameleon routines as you would do with
LAPACK. Executables with tile interface generate directly the
data in the format Chameleon tile algorithms used to submit tasks
to the runtime system. Executables with tile interface should be
more performant because no data copy from LAPACK matrix layout to
tile matrix layout are necessary. Calling example:
#+begin_src
./timing/time_dpotrf --n_range=1000:10000:1000 --nb=320
--threads=9 --gpus=3
--nowarmup
#+end_src
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
List of main options that can be used in timing:
* ~--help~: show usage
* ~--threads~: Number of CPU workers (default:
~_SC_NPROCESSORS_ONLN~)
* ~--gpus~: number of GPU workers (default: ~0~)
* ~--n_range=R~: range of N values, with ~R=Start:Stop:Step~
(default: ~500:5000:500~)
* ~--m=X~: dimension (M) of the matrices (default: ~N~)
* ~--k=X~: dimension (K) of the matrices (default: ~1~), useful for
GEMM algorithm (k is the shared dimension and must be defined
>1 to consider matrices and not vectors)
* ~--nrhs=X~: number of right-hand size (default: ~1~)
* ~--nb=X~: block/tile size. (default: ~128~)
* ~--ib=X~: inner-blocking/IB size. (default: ~32~)
* ~--niter=X~: number of iterations performed for each test
(default: ~1~)
* ~--rhblk=X~: if X > 0, enable Householder mode for QR and LQ
factorization. X is the size of each subdomain (default: ~0~)
* ~--[no]check~: check result (default: ~nocheck~)
* ~--[no]profile~: print profiling informations (default:
~noprofile~)
* ~--[no]trace~: enable/disable trace generation (default: ~notrace~)
* ~--[no]dag~: enable/disable DAG generation (default: ~nodag~)
* ~--[no]inv~: check on inverse (default: ~noinv~)
* ~--nocpu~: all GPU kernels are exclusively executed on GPUs
* ~--ooc~: Enable out-of-core (available only with StarPU)
* ~--bound~: Compare result to area bound (available only with
StarPU) (default: ~0~)
List of timing algorithms available:
* LANGE: norms of matrices
* GEMM: general matrix-matrix multiply
* TRSM: triangular solve
* POTRF: Cholesky factorization with a symmetric
positive-definite matrix
* POTRI: Cholesky inversion
* POSV: solve linear systems with symmetric positive-definite matrix
* GETRF_NOPIV: LU factorization of a general matrix using the tile LU algorithm without row pivoting
* GESV_NOPIV: solve linear system for a general matrix using the tile LU algorithm without row pivoting
* GETRF_INCPIV: LU factorization of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges
* GESV_INCPIV: solve linear system for a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges matrix
* GEQRF: QR factorization of a general matrix
* GELQF: LQ factorization of a general matrix
* QEQRF_HQR: GEQRF with hierarchical tree
* QEQRS: solve linear systems using a QR factorization
* GELS: solves overdetermined or underdetermined linear systems involving a general matrix using the QR or the LQ factorization
* GESVD: general matrix singular value decomposition
*** Execution trace using StarPU
<<sec:trace>>
StarPU can generate its own trace log files by compiling it with
the ~--with-fxt~ option at the configure step (you can have to
specify the directory where you installed FxT by giving
~--with-fxt=...~ instead of ~--with-fxt~ alone). By doing so, traces
are generated after each execution of a program which uses StarPU
in the directory pointed by the STARPU_FXT_PREFIX environment
variable.
#+begin_example
export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/
#+end_example
When executing a ~./timing/...~ Chameleon program, if it has been
enabled (StarPU compiled with FxT and
*-DCHAMELEON_ENABLE_TRACING=ON*), you can give the option ~--trace~ to
tell the program to generate trace log files.
Finally, to generate the trace file which can be opened with [[http://vite.gforge.inria.fr/][Vite]]
program, you can use the *starpu_fxt_tool* executable of StarPU.
This tool should be in ~$STARPU_INSTALL_REPOSITORY/bin~. You can
use it to generate the trace file like this:
#+begin_src
path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename
#+end_src
There is one file per mpi processus (prof_filename_0,
prof_filename_1 ...). To generate a trace of mpi programs you can
call it like this:
#+begin_src
path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*
#+end_src
The trace file will be named paje.trace (use -o option to specify
an output name). Alternatively, for non mpi execution (only one
processus and profiling file), you can set the environment
variable *STARPU_GENERATE_TRACE=1* to automatically generate the
paje trace file.
*** Use simulation mode with StarPU-SimGrid
<<sec:simu>>
Simulation mode can be activated by setting the cmake option
CHAMELEON_SIMULATION to ON. This mode allows you to simulate
execution of algorithms with StarPU compiled with [[http://simgrid.gforge.inria.fr/][SimGrid]]. To do
so, we provide some perfmodels in the simucore/perfmodels/
directory of Chameleon sources. To use these perfmodels, please
set your *STARPU_HOME* environment variable to
~path/to/your/chameleon_sources/simucore/perfmodels~. Finally, you
need to set your *STARPU_HOSTNAME* environment variable to the name
of the machine to simulate. For example: *STARPU_HOSTNAME=mirage*.
Note that only POTRF kernels with block sizes of 320 or 960
(simple and double precision) on /mirage/ and /sirocco/ machines are
available for now. Database of models is subject to change.
** Chameleon API
Chameleon provides routines to solve dense general systems of
linear equations, symmetric positive definite systems of linear
equations and linear least squares problems, using LU, Cholesky, QR
and LQ factorizations. Real arithmetic and complex arithmetic are
supported in both single precision and double precision. Routines
that compute linear algebra are of the following form:
#+begin_src
MORSE_name[_Tile[_Async]]
#+end_src
* all user routines are prefixed with *MORSE*
* in the pattern *MORSE_name[_Tile[_Async]]*, /name/ follows the
BLAS/LAPACK naming scheme for algorithms (/e.g./ sgemm for general
matrix-matrix multiply simple precision)
* Chameleon provides three interface levels
* *MORSE_name*: simplest interface, very close to CBLAS and
LAPACKE, matrices are given following the LAPACK data layout
(1-D array column-major). It involves copy of data from LAPACK
layout to tile layout and conversely (to update LAPACK data),
see [[sec:tuto_step1][Step1]].
* *MORSE_name_Tile*: the tile interface avoid copies between LAPACK
and tile layouts. It is the standard interface of Chameleon and
it should achieved better performance than the previous
simplest interface. The data are given through a specific
structure called a descriptor, see [[sec:tuteo_step2][Step2]].
* *MORSE_name_Tile_Async*: similar to the tile interface, it avoids
synchonization barrier normally called between *Tile* routines.
At the end of an *Async* function, completion of tasks is not
guaranteed and data are not necessarily up-to-date. To ensure
that tasks have been all executed, a synchronization function
has to be called after the sequence of *Async* functions, see
[[tuto_step4][Step4]].
MORSE routine calls have to be preceded from
#+begin_src
MORSE_Init( NCPU, NGPU );
#+end_src
to initialize MORSE and the runtime system and followed by
#+begin_src
MORSE_Finalize();
#+end_src
to free some data and finalize the runtime and/or MPI.
*** Tutorial LAPACK to Chameleon
This tutorial is dedicated to the API usage of Chameleon. The
idea is to start from a simple code and step by step explain how
to use Chameleon routines. The first step is a full BLAS/LAPACK
code without dependencies to Chameleon, a code that most users
should easily understand. Then, the different interfaces
Chameleon provides are exposed, from the simplest API (step1) to
more complicated ones (until step4). The way some important
parameters are set is discussed in step5. step6 is an example
about distributed computation with MPI. Finally step7 shows how
to let Chameleon initialize user's data (matrices/vectors) in
parallel.
Source files can be found in the ~example/lapack_to_morse/~
directory. If CMake option *CHAMELEON_ENABLE_EXAMPLE* is ON then
source files are compiled with the project libraries. The
arithmetic precision is /double/. To execute a step
*X*, enter the following command:
#+begin_src
#+end_src
Instructions about the arguments to give to executables are
accessible thanks to the option ~-[-]help~ or ~-[-]h~. Note there
exist default values for options.
For all steps, the program solves a linear system $Ax=B$ The
matrix values are randomly generated but ensure that matrix \$A\$ is
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
symmetric positive definite so that $A$ can be factorized in a
$LL^T$ form using the Cholesky factorization.
The different steps of the tutorial are:
* Step0: a simple Cholesky example using the C interface of BLAS/LAPACK
* Step1: introduces the LAPACK equivalent interface of Chameleon
* Step2: introduces the tile interface
* Step3: indicates how to give your own tile matrix to Chameleon
* Step4: introduces the tile async interface
* Step5: shows how to set some important parameters
* Step6: introduces how to benefit from MPI in Chameleon
* Step7: introduces how to let Chameleon initialize the user's matrix data
**** Step0
The C interface of BLAS and LAPACK, that is, CBLAS and LAPACKE,
are used to solve the system. The size of the system (matrix) and
the number of right hand-sides can be given as arguments to the
executable (be careful not to give huge numbers if you do not
have an infinite amount of RAM!). As for every step, the
correctness of the solution is checked by calculating the norm
$||Ax-B||/(||A||||x||+||B||)$. The time spent in
factorization+solve is recorded and, because we know exactly the
number of operations of these algorithms, we deduce the number of
operations that have been processed per second (in GFlops/s).
The important part of the code that solves the problem is:
#+begin_example
/* Cholesky factorization:
* A is replaced by its factorization L or L^T depending on uplo */
LAPACKE_dpotrf( LAPACK_COL_MAJOR, 'U', N, A, N );
/* Solve:
* B is stored in X on entry, X contains the result on exit.
* Forward ...
*/
cblas_dtrsm(
CblasColMajor,
CblasLeft,
CblasUpper,
CblasConjTrans,
CblasNonUnit,
N, NRHS, 1.0, A, N, X, N);
/* ... and back substitution */
cblas_dtrsm(
CblasColMajor,
CblasLeft,
CblasUpper,
CblasNoTrans,
CblasNonUnit,
N, NRHS, 1.0, A, N, X, N);
#+end_example
**** Step1
<<sec:tuto_step1>>
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
It introduces the simplest Chameleon interface which is
equivalent to CBLAS/LAPACKE. The code is very similar to step0
but instead of calling CBLAS/LAPACKE functions, we call Chameleon
equivalent functions. The solving code becomes:
#+begin_example
/* Factorization: */
MORSE_dpotrf( UPLO, N, A, N );
/* Solve: */
MORSE_dpotrs(UPLO, N, NRHS, A, N, X, N);
#+end_example
The API is almost the same so that it is easy to use for beginners.
It is important to keep in mind that before any call to MORSE routines,
*MORSE_Init* has to be invoked to initialize MORSE and the runtime system.
Example:
#+begin_example
MORSE_Init( NCPU, NGPU );
#+end_example
After all MORSE calls have been done, a call to *MORSE_Finalize* is
required to free some data and finalize the runtime and/or MPI.
#+begin_example
MORSE_Finalize();
#+end_example
We use MORSE routines with the LAPACK interface which means the
routines accepts the same matrix format as LAPACK (1-D array
column-major). Note that we copy the matrix to get it in our own
tile structures, see details about this format here [[sec:tile][Tile Data
Layout]]. This means you can get an overhead coming from copies.
**** Step2
<<sec:tuto_step2>>
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
This program is a copy of step1 but instead of using the LAPACK interface which
reads to copy LAPACK matrices inside MORSE routines we use the tile interface.
We will still use standard format of matrix but we will see how to give this
matrix to create a MORSE descriptor, a structure wrapping data on which we want
to apply sequential task-based algorithms.
The solving code becomes:
#+begin_example
/* Factorization: */
MORSE_dpotrf_Tile( UPLO, descA );
/* Solve: */
MORSE_dpotrs_Tile( UPLO, descA, descX );
#+end_example
To use the tile interface, a specific structure *MORSE_desc_t* must be
created.
This can be achieved from different ways.
1. Use the existing function *MORSE_Desc_Create*: means the matrix
data are considered contiguous in memory as it is considered
in PLASMA ([[sec:tile][Tile Data Layout]]).
2. Use the existing function *MORSE_Desc_Create_OOC*: means the
matrix data is allocated on-demand in memory tile by tile, and
possibly pushed to disk if that does not fit memory.
3. Use the existing function *MORSE_Desc_Create_User*: it is more
flexible than *Desc_Create* because you can give your own way to
access to tile data so that your tiles can be allocated
wherever you want in memory, see next paragraph [[sec:tuto_step3][Step3]].
4. Create you own function to fill the descriptor. If you
understand well the meaning of each item of *MORSE_desc_t*, you
should be able to fill correctly the structure.
In Step2, we use the first way to create the descriptor:
#+begin_example
MORSE_Desc_Create(&descA, NULL, MorseRealDouble,
NB, NB, NB*NB, N, N,
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
0, 0, N, N,
1, 1);
#+end_example
* *descA* is the descriptor to create.
* The second argument is a pointer to existing data. The existing
data must follow LAPACK/PLASMA matrix layout [[sec:tile][Tile Data Layout]]
(1-D array column-major) if *MORSE_Desc_Create* is used to create
the descriptor. The *MORSE_Desc_Create_User* function can be used
if you have data organized differently. This is discussed in
the next paragraph [[sec_tuto_step3][Step3]]. Giving a *NULL* pointer means you let
the function allocate memory space. This requires to copy your
data in the memory allocated by the *Desc_Create. This can be
done with
#+begin_example
MORSE_Lapack_to_Tile(A, N, descA);
#+end_example
* Third argument of @code{Desc_Create} is the datatype (used for
memory allocation).
* Fourth argument until sixth argument stand for respectively,
the number of rows (*NB*), columns (*NB*) in each tile, the total
number of values in a tile (*NB*NB*), the number of rows (*N*),
colmumns (*N*) in the entire matrix.
* Seventh argument until ninth argument stand for respectively,
the beginning row (0), column (0) indexes of the submatrix and
the number of rows (N), columns (N) in the submatrix. These
arguments are specific and used in precise cases. If you do
not consider submatrices, just use 0, 0, NROWS, NCOLS.
* Two last arguments are the parameter of the 2-D block-cyclic
distribution grid, see [[http://www.netlib.org/scalapack/slug/node75.html][ScaLAPACK]]. To be able to use other data
distribution over the nodes, *MORSE_Desc_Create_User* function
should be used.
**** Step3
<<sec:tuto_step3>>
This program makes use of the same interface than Step2 (tile
interface) but does not allocate LAPACK matrices anymore so that
no copy between LAPACK matrix layout and tile matrix layout are
necessary to call MORSE routines. To generate random right
hand-sides you can use:
#+begin_example
/* Allocate memory and initialize descriptor B */
MORSE_Desc_Create(&descB, NULL, MorseRealDouble,
NB, NB, NB*NB, N, NRHS,
0, 0, N, NRHS, 1, 1);
/* generate RHS with random values */
MORSE_dplrnt_Tile( descB, 5673 );
#+end_example
The other important point is that is it possible to create a
descriptor, the necessary structure to call MORSE efficiently, by
giving your own pointer to tiles if your matrix is not organized
as a 1-D array column-major. This can be achieved with the
*MORSE_Desc_Create_User* routine. Here is an example:
#+begin_example
MORSE_Desc_Create_User(&descA, matA, MorseRealDouble,
NB, NB, NB*NB, N, N,
0, 0, N, N, 1, 1,
user_getaddr_arrayofpointers,
user_getblkldd_arrayofpointers,
user_getrankof_zero);
#+end_example
Firsts arguments are the same than *MORSE_Desc_Create* routine.
Following arguments allows you to give pointer to functions that
manage the access to tiles from the structure given as second
argument. Here for example, *matA* is an array containing
addresses to tiles, see the function *allocate_tile_matrix*
defined in step3.h. The three functions you have to
define for *Desc_Create_User* are:
* a function that returns address of tile $A(m,n)$, m and n
standing for the indexes of the tile in the global matrix. Lets
consider a matrix @math{4x4} with tile size 2x2, the matrix
contains four tiles of indexes: $A(m=0,n=0)$, $A(m=0,n=1)$,
$A(m=1,n=0)$, $A(m=1,n=1)$
* a function that returns the leading dimension of tile $A(m,*)$
* a function that returns MPI rank of tile $A(m,n)$
Examples for these functions are vizible in step3.h. Note that
the way we define these functions is related to the tile matrix
format and to the data distribution considered. This example
should not be used with MPI since all tiles are affected to
processus 0, which means a large amount of data will be
potentially transfered between nodes.
**** Step4
<<sec:tuto_step4>>
This program is a copy of step2 but instead of using the tile
interface, it uses the tile async interface. The goal is to
exhibit the runtime synchronization barriers. Keep in mind that
when the tile interface is called, like *MORSE_dpotrf_Tile*,
a synchronization function, waiting for the actual execution and
termination of all tasks, is called to ensure the proper
completion of the algorithm (i.e. data are up-to-date). The code
shows how to exploit the async interface to pipeline subsequent
algorithms so that less synchronisations are done. The code
becomes:
#+begin_example
/* Morse structure containing parameters and a structure to interact with
* the Runtime system */
MORSE_context_t *morse;
/* MORSE sequence uniquely identifies a set of asynchronous function calls
* sharing common exception handling */
MORSE_sequence_t *sequence = NULL;
/* MORSE request uniquely identifies each asynchronous function call */
MORSE_request_t request = MORSE_REQUEST_INITIALIZER;
int status;
...
morse_sequence_create(morse, &sequence);
/* Factorization: */
MORSE_dpotrf_Tile_Async( UPLO, descA, sequence, &request );
/* Solve: */
MORSE_dpotrs_Tile_Async( UPLO, descA, descX, sequence, &request);
/* Synchronization barrier (the runtime ensures that all submitted tasks
* have been terminated */
RUNTIME_barrier(morse);
/* Ensure that all data processed on the gpus we are depending on are back
* in main memory */
RUNTIME_desc_getoncpu(descA);
RUNTIME_desc_getoncpu(descX);
status = sequence->status;
#+end_example
Here the sequence of *dpotrf* and *dpotrs* algorithms is processed
without synchronization so that some tasks of *dpotrf* and *dpotrs*
can be concurently executed which could increase performances.
The async interface is very similar to the tile one. It is only
necessary to give two new objects *MORSE_sequence_t* and
*MORSE_request_t* used to handle asynchronous function calls.
#+CAPTION: POTRI (POTRF, TRTRI, LAUUM) algorithm with and without synchronization barriers, courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
#+NAME: fig:potri_async
#+ATTR_HTML: :width 640px :align center
[[file:potri_async.png]]
**** Step5
<<sec:tuto_step5>>
Step5 shows how to set some important parameters. This program
is a copy of Step4 but some additional parameters are given by
the user. The parameters that can be set are:
* number of Threads
* number of GPUs
The number of workers can be given as argument
to the executable with ~--threads=~ and ~--gpus=~ options. It is
important to notice that we assign one thread per gpu to
optimize data transfer between main memory and devices memory.
The number of workers of each type CPU and CUDA
must be given at *MORSE_Init*.
#+begin_example
if ( iparam[IPARAM_THRDNBR] == -1 ) {
get_thread_count( &(iparam[IPARAM_THRDNBR]) );
/* reserve one thread par cuda device to optimize memory transfers */
iparam[IPARAM_THRDNBR] -=iparam[IPARAM_NCUDAS];
}
NCPU = iparam[IPARAM_THRDNBR];
NGPU = iparam[IPARAM_NCUDAS];
/* initialize MORSE with main parameters */
MORSE_Init( NCPU, NGPU );
#+end_example
* matrix size
* number of right-hand sides
* block (tile) size
The problem size is given with ~--n=~ and ~--nrhs=~ options. The
tile size is given with option ~--nb=~. These parameters are
required to create descriptors. The size tile NB is a key
parameter to get performances since it defines the granularity
of tasks. If NB is too large compared to N, there are few
tasks to schedule. If the number of workers is large this
leads to limit parallelism. On the contrary, if NB is too
small (/i.e./ many small tasks), workers could not be correctly
fed and the runtime systems operations could represent a
substantial overhead. A trade-off has to be found depending on
many parameters: problem size, algorithm (drive data
dependencies), architecture (number of workers, workers speed,
workers uniformity, memory bus speed). By default it is set
to 128. Do not hesitate to play with this parameter and
compare performances on your machine.
* inner-blocking size
The inner-blocking size is given with option ~--ib=~.
This parameter is used by kernels (optimized algorithms applied on tiles) to
perform subsequent operations with data block-size that fits the cache of
workers.
Parameters NB and IB can be given with *MORSE_Set* function:
#+begin_example
MORSE_Set(MORSE_TILE_SIZE, iparam[IPARAM_NB] );
MORSE_Set(MORSE_INNER_BLOCK_SIZE, iparam[IPARAM_IB] );
#+end_example
**** Step6
<<sec:tuto_step6>>
This program is a copy of Step5 with some additional parameters
to be set for the data distribution. To use this program
properly MORSE must use StarPU Runtime system and MPI option must
be activated at configure. The data distribution used here is
2-D block-cyclic, see for example [[http://www.netlib.org/scalapack/slug/node75.html][ScaLAPACK]] for explanation. The
user can enter the parameters of the distribution grid at
execution with ~--p=~ option. Example using OpenMPI on four nodes
with one process per node:
#+begin_example
mpirun -np 4 ./step6 --n=10000 --nb=320 --ib=64 --threads=8 --gpus=2 --p=2
#+end_example
In this program we use the tile data layout from PLASMA so that the call
#+begin_example
MORSE_Desc_Create_User(&descA, NULL, MorseRealDouble,
NB, NB, NB*NB, N, N,
0, 0, N, N,
GRID_P, GRID_Q,
morse_getaddr_ccrb,
morse_getblkldd_ccrb,
morse_getrankof_2d);
#+end_example
is equivalent to the following call
#+begin_example
MORSE_Desc_Create(&descA, NULL, MorseRealDouble,
NB, NB, NB*NB, N, N,
0, 0, N, N,
GRID_P, GRID_Q);
#+end_example
functions *morse_getaddr_ccrb*, *morse_getblkldd_ccrb*,
*morse_getrankof_2d* being used in *Desc_Create*. It is interesting
to notice that the code is almost the same as Step5. The only
additional information to give is the way tiles are distributed
through the third function given to *MORSE_Desc_Create_User*.
Here, because we have made experiments only with a 2-D
block-cyclic distribution, we have parameters P and Q in the
interface of *Desc_Create* but they have sense only for 2-D
block-cyclic distribution and then using *morse_getrankof_2d*
function. Of course it could be used with other distributions,
being no more the parameters of a 2-D block-cyclic grid but of
another distribution.
**** Step7
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
<<sec:tuto_step7>>
This program is a copy of step6 with some additional calls to
build a matrix from within chameleon using a function provided by
the user. This can be seen as a replacement of the function like
*MORSE_dplgsy_Tile()* that can be used to fill the matrix with
random data, *MORSE_dLapack_to_Tile()* to fill the matrix with data
stored in a lapack-like buffer, or *MORSE_Desc_Create_User()* that
can be used to describe an arbitrary tile matrix structure. In
this example, the build callback function are just wrapper
towards *CORE_xxx()* functions, so the output of the program step7
should be exactly similar to that of step6. The difference is
that the function used to fill the tiles is provided by the user,
and therefore this approach is much more flexible.
The new function to understand is *MORSE_dbuild_Tile*, e.g.
#+begin_example
struct data_pl data_A={(double)N, 51, N};
MORSE_dbuild_Tile(MorseUpperLower, descA, (void*)&data_A, Morse_build_callback_plgsy);
#+end_example
The idea here is to let Chameleon fill the matrix data in a
task-based fashion (parallel) by using a function given by the
user. First, the user should define if all the blocks must be
entirelly filled or just the upper/lower part with, /e.g./
MorseUpperLower. We still relies on the same structure
*MORSE_desc_t* which must be initialized with the proper
parameters, by calling for example *MORSE_Desc_Create*. Then, an
opaque pointer is used to let the user give some extra data used
by his function. The last parameter is the pointer to the user's
function.
*** List of available routines
**** Linear Algebra routines
We list the linear algebra routines of the form
*MORSE_name[_Tile[_Async]]* (/name/ follows LAPACK naming scheme, see
http://www.netlib.org/lapack/lug/node24.html) that can be used
with the Chameleon library. For details about these functions

PRUVOST Florent
committed
please refer to the doxygen documentation. /name/ can be one of the
following:
* *BLAS 2/3 routines*
* gemm: matrix matrix multiply and addition
* hemm: gemm with A Hermitian
* herk: rank k operations with A Hermitian
* her2k: rank 2k operations with A Hermitian
* lauum: computes the product U * U' or L' * L, where the
triangular factor U or L is stored in the upper or lower
triangular part of the array A
* symm: gemm with A symmetric
* syrk: rank k operations with A symmetric
* syr2k: rank 2k with A symmetric
* trmm: gemm with A triangular
* *Triangular solving routines*
* trsm: computes triangular solve
* trsmpl: performs the forward substitution step of solving a
system of linear equations after the tile LU factorization of
the matrix
* trsmrv:
* trtri: computes the inverse of a complex upper or lower triangular matrix A
* *LL' (Cholesky) routines*
* posv: linear systems solving using Cholesky factorization
* potrf: Cholesky factorization
* potri: computes the inverse of a complex Hermitian positive
definite matrix A using the Cholesky factorization A
* potrimm:
* potrs: linear systems solving using existing Cholesky
factorization
* sysv: linear systems solving using Cholesky decomposition with
A symmetric
* sytrf: Cholesky decomposition with A symmetric
* sytrs: linear systems solving using existing Cholesky
decomposition with A symmetric
* *LU routines*
* gesv_incpiv: linear systems solving with LU factorization and
partial pivoting
* gesv_nopiv: linear systems solving with LU factorization and
without pivoting
* getrf_incpiv: LU factorization with partial pivoting
* getrf_nopiv: LU factorization without pivoting
* getrs_incpiv: linear systems solving using existing LU
factorization with partial pivoting
* getrs_nopiv: linear systems solving using existing LU
factorization without pivoting
* *QR/LQ routines*
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
* gelqf: LQ factorization
* gelqf_param: gelqf with hqr
* gelqs: computes a minimum-norm solution min || A*X - B || using
the LQ factorization
* gelqs_param: gelqs with hqr
* gels: Uses QR or LQ factorization to solve a overdetermined or
underdetermined linear system with full rank matrix
* gels_param: gels with hqr
* geqrf: QR factorization
* geqrf_param: geqrf with hqr
* geqrs: computes a minimum-norm solution min || A*X - B || using
the RQ factorization
* hetrd: reduces a complex Hermitian matrix A to real symmetric
tridiagonal form S
* geqrs_param: geqrs with hqr
* tpgqrt: generates a partial Q matrix formed with a blocked QR
factorization of a "triangular-pentagonal" matrix C, which is
composed of a unused triangular block and a pentagonal block V,
using the compact representation for Q. See tpqrt to
generate V
* tpqrt: computes a blocked QR factorization of a
"triangular-pentagonal" matrix C, which is composed of a
triangular block A and a pentagonal block B, using the compact
representation for Q
* unglq: generates an M-by-N matrix Q with orthonormal rows,
which is defined as the first M rows of a product of the
elementary reflectors returned by MORSE_zgelqf
* unglq_param: unglq with hqr
* ungqr: generates an M-by-N matrix Q with orthonormal columns,
which is defined as the first N columns of a product of the
elementary reflectors returned by MORSE_zgeqrf
* ungqr_param: ungqr with hqr
* unmlq: overwrites C with Q*C or C*Q or equivalent operations
with transposition on conjugate on C (see doxygen
documentation)
* unmlq_param: unmlq with hqr
* unmqr: similar to unmlq (see doxygen documentation)
* unmqr_param: unmqr with hqr
* *EVD/SVD*
* gesvd: singular value decomposition
* heevd: eigenvalues/eigenvectors computation with A Hermitian
* *Extra routines*
* *Norms*
* lange: computes norm of a matrix (Max, One, Inf, Frobenius)
* lanhe: lange with A Hermitian
* lansy: lange with A symmetric
* lantr: lange with A triangular
* *Random matrices generation*
* plghe: generates a random Hermitian matrix
* plgsy: generates a random symmetrix matrix
* plrnt: generates a random matrix
* *Others*
* geadd: general matrix matrix addition
* lacpy: copy matrix into another
* lascal: scales a matrix
* laset: copy the triangular part of a matrix into another, set a
value for the diagonal and off-diagonal part
* tradd: trapezoidal matrices addition
**** Options routines
Enable MORSE feature.
#+begin_src
int MORSE_Enable (MORSE_enum option);
#+end_src
Feature to be enabled:
* *MORSE_WARNINGS*: printing of warning messages,
* *MORSE_AUTOTUNING*: autotuning for tile size and inner block size,
* *MORSE_PROFILING_MODE*: activate kernels profiling,
* *MORSE_PROGRESS*: to print a progress status,
* *MORSE_GEMM3M*: to enable the use of the /gemm3m/ blas bunction.
Disable MORSE feature.
#+begin_src
int MORSE_Disable (MORSE_enum option);
#+end_src
Symmetric to *MORSE_Enable*.
Set MORSE parameter.
#+begin_src
int MORSE_Set (MORSE_enum param, int value);
#+end_src
Parameters to be set:
* *MORSE_TILE_SIZE*: size matrix tile,
* *MORSE_INNER_BLOCK_SIZE*: size of tile inner block,
* *MORSE_HOUSEHOLDER_MODE*: type of householder trees (FLAT or TREE),
* *MORSE_HOUSEHOLDER_SIZE*: size of the groups in householder trees,
* *MORSE_TRANSLATION_MODE*: related to the *MORSE_Lapack_to_Tile*, see ztile.c.
Get value of MORSE parameter.
#+begin_src
int MORSE_Get (MORSE_enum param, int *value);
#+end_src
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
Reports MORSE version number.
#+begin_src
int MORSE_Version (int *ver_major, int *ver_minor, int *ver_micro);
#+end_src
Initialize MORSE: initialize some parameters, initialize the runtime and/or MPI.
#+begin_src
int MORSE_Init (int nworkers, int ncudas);
#+end_src
Finalyze MORSE: free some data and finalize the runtime and/or MPI.
#+begin_src
int MORSE_Finalize (void);
#+end_src
Suspend MORSE runtime to poll for new tasks, to avoid useless CPU consumption when
no tasks have to be executed by MORSE runtime system.
#+begin_src
int MORSE_Pause (void);
#+end_src
Symmetrical call to MORSE_Pause, used to resume the workers polling for new tasks.
#+begin_src
int MORSE_Resume (void);
#+end_src
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
Return the MPI rank of the calling process.
#+begin_src
int MORSE_My_Mpi_Rank (void);
#+end_src
Return the size of the distributed computation
#+begin_src
int MORSE_Comm_size( int *size )
#+end_src
Return the rank of the distributed computation
#+begin_src
int MORSE_Comm_rank( int *rank )
#+end_src
Prepare the distributed processes for computation
#+begin_src
int MORSE_Distributed_start(void)
#+end_src
Clean the distributed processes after computation
#+begin_src
int MORSE_Distributed_stop(void)
#+end_src
Return the number of CPU workers initialized by the runtime
#+begin_src
int MORSE_GetThreadNbr()
#+end_src
Conversion from LAPACK layout to tile layout.
#+begin_src
int MORSE_Lapack_to_Tile (void *Af77, int LDA, MORSE_desc_t *A);
#+end_src
Conversion from tile layout to LAPACK layout.
#+begin_src
int MORSE_Tile_to_Lapack (MORSE_desc_t *A, void *Af77, int LDA);
#+end_src
**** Descriptor routines
Create matrix descriptor, internal function.
#+begin_src

PRUVOST Florent
committed
int MORSE_Desc_Create(MORSE_desc_t **desc, void *mat, MORSE_enum dtyp,
int mb, int nb, int bsiz, int lm, int ln,
int i, int j, int m, int n, int p, int q);
#+end_src
Create matrix descriptor, user function.
#+begin_src
int MORSE_Desc_Create_User(MORSE_desc_t **desc, void *mat, MORSE_enum dtyp,
int mb, int nb, int bsiz, int lm, int ln,
int i, int j, int m, int n, int p, int q,
void* (*get_blkaddr)( const MORSE_desc_t*, int, int),
int (*get_blkldd)( const MORSE_desc_t*, int ),
int (*get_rankof)( const MORSE_desc_t*, int, int ));
#+end_src

PRUVOST Florent
committed
Create matrix descriptor for tiled matrix which may not fit
memory.
#+begin_src
int MORSE_Desc_Create_OOC(MORSE_desc_t **descptr, MORSE_enum dtyp, int mb, int nb, int bsiz,
int lm, int ln, int i, int j, int m, int n, int p, int q);
#+end_src
User's function version of MORSE_Desc_Create_OOC.
#+begin_src
int MORSE_Desc_Create_OOC_User(MORSE_desc_t **descptr, MORSE_enum dtyp, int mb, int nb, int bsiz,
int lm, int ln, int i, int j, int m, int n, int p, int q,
int (*get_rankof)( const MORSE_desc_t*, int, int ));
#+end_src
Destroys matrix descriptor.
#+begin_src
int MORSE_Desc_Destroy (MORSE_desc_t **desc);
#+end_src

PRUVOST Florent
committed
Ensures that all data of the descriptor are up-to-date.
#+begin_src
int MORSE_Desc_Acquire (MORSE_desc_t *desc);
#+end_src
Release the data of the descriptor acquired by the
application. Should be called if MORSE_Desc_Acquire has been
called on the descriptor and if you do not need to access to its
data anymore.
#+begin_src