using.org



Linking an external application with Chameleon libraries
Compilation and link with Chameleon libraries have been tested with
  the GNU compiler suite gcc/gfortran and the Intel compiler suite
  icc/ifort.
Flags required
The compiler, linker flags that are necessary to build an
  application using Chameleon are given through the pkg-config
  mechanism.
export PKG_CONFIG_PATH=/home/jdoe/install/chameleon/lib/pkgconfig:$PKG_CONFIG_PATH
pkg-config --cflags chameleon
pkg-config --libs chameleon
pkg-config --libs --static chameleon

The .pc files required are located in the sub-directory
  lib/pkgconfig of your Chameleon install directory.
Static linking in C
Lets imagine you have a file main.c that you want to link with
  Chameleon static libraries.  Lets consider
  /home/yourname/install/chameleon is the install directory
  of Chameleon containing sub-directories include/ and
  lib/.  Here could be your compilation command with gcc
  compiler:
gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c

Now if you want to link your application with Chameleon static libraries, you
  could do:
gcc main.o -o main                                         \
/home/yourname/install/chameleon/lib/libchameleon.a        \
/home/yourname/install/chameleon/lib/libchameleon_starpu.a \
/home/yourname/install/chameleon/lib/libcoreblas.a         \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64           \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

As you can see in this example, we also link with some dynamic
  libraries starpu-1.2, Intel MKL libraries (for
  BLAS/LAPACK/CBLAS/LAPACKE), pthread, m (math) and rt. These
  libraries will depend on the configuration of your Chameleon
  build.  You can find these dependencies in .pc files we generate
  during compilation and that are installed in the sub-directory
  lib/pkgconfig of your Chameleon install directory.  Note also that
  you could need to specify where to find these libraries with -L
  option of your compiler/linker.
Before to run your program, make sure that all shared libraries
  paths your executable depends on are known.  Enter ldd main
  to check.  If some shared libraries paths are missing append them
  in the LD_LIBRARY_PATH (for Linux systems) environment
  variable (DYLD_LIBRARY_PATH on Mac).
Dynamic linking in C
For dynamic linking (need to build Chameleon with CMake option
  BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but
  instead of specifying path to your static libraries you indicate
  the path to dynamic libraries with -L option and you give
  the name of libraries with -l option like this:
gcc main.o -o main \
-L/home/yourname/install/chameleon/lib \
-lchameleon -lchameleon_starpu -lcoreblas \
-lstarpu-1.2 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

Note that an update of your environment variable LD_LIBRARY_PATH
  (DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be
  required before executing
export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib

Using Chameleon executables
Chameleon provides several test executables that are compiled and
  linked with Chameleon’s dependencies.  Instructions about the
  arguments to give to executables are accessible thanks to the
  option -[-]help or -[-]h.  This set of binaries are separated into
  three categories and can be found in three different directories:

  
example: contains examples of API usage and more specifically the
    sub-directory lapack_to_morse/ provides a tutorial that explains
    how to use Chameleon functionalities starting from a full LAPACK
    code, see Tutorial LAPACK to Chameleon

  
testing: contains testing drivers to check numerical correctness of
    Chameleon linear algebra routines with a wide range of parameters
    ./testing/stesting 4 1 LANGE 600 100 700
    
    Two first arguments are the number of cores and gpus to use.
      The third one is the name of the algorithm to test.
      The other arguments depend on the algorithm, here it lies for the number of
      rows, columns and leading dimension of the problem.
    Name of algorithms available for testing are:
    
      LANGE: norms of matrices Infinite, One, Max, Frobenius
      GEMM: general matrix-matrix multiply
      HEMM: hermitian matrix-matrix multiply
      HERK: hermitian matrix-matrix rank k update
      HER2K: hermitian matrix-matrix rank 2k update
      SYMM: symmetric matrix-matrix multiply
      SYRK: symmetric matrix-matrix rank k update
      SYR2K: symmetric matrix-matrix rank 2k update
      PEMV: matrix-vector multiply with pentadiagonal matrix
      TRMM: triangular matrix-matrix multiply
      TRSM: triangular solve, multiple rhs
      POSV: solve linear systems with symmetric positive-definite matrix
      GESV_INCPIV: solve linear systems with general matrix
      GELS: linear least squares with general matrix
      GELS_HQR: gels with hierarchical tree
      GELS_SYSTOLIC: gels with systolic tree
    
  
timing: contains timing drivers to assess performances of
    Chameleon routines. There are two sets of executables, those who
    do not use the tile interface and those who do (with _tile in the
    name of the executable). Executables without tile interface
    allocates data following LAPACK conventions and these data can be
    given as arguments to Chameleon routines as you would do with
    LAPACK. Executables with tile interface generate directly the
    data in the format Chameleon tile algorithms used to submit tasks
    to the runtime system. Executables with tile interface should be
    more performant because no data copy from LAPACK matrix layout to
    tile matrix layout are necessary. Calling example:
    ./timing/time_dpotrf --n_range=1000:10000:1000 --nb=320
                     --threads=9 --gpus=3
                     --nowarmup
    
    List of main options that can be used in timing:
    
      
--help: Show usage
      Machine parameters
        
          
-t x, --threads=x: Number of CPU workers (default: automatic
            detection through runtime)
          
-g x, --gpus=x: Number of GPU workers (default: 0)
          
-P x, --P=x: Rows (P) in the PxQ process grid (default: 1)
          
--nocpu: All GPU kernels are exclusively executed on GPUs
        
      
      Matrix parameters
        
          
-m x, --m=X, --M=x: Dimension (M) of the matrices (default:
            N)
          
-n x, --n=X, --N=x: Dimension (N) of the matrices
          
-N R, --n_range=R: Range of N values to time with
            R=Start:Stop:Step (default: 500:5000:500)
          
-k x, --k=x, --K=x, --nrhs=x: Dimension (K) of the matrices
            or number of right-hand size (default: 1). This is useful for
            GEMM algorithms (k is the shared dimension and must be defined
            >1 to consider matrices and not vectors)
          
-b x, --nb=x: NB size. (default: 320)
          
-i x, --ib=x: IB size. (default: 32)
        
      
      Check/prints
        
          
--niter=X: Number of iterations performed for each test
            (default: 1)
          
-W, --nowarning: Do not show warnings
          
-w, --nowarmup: Cancel the warmup run to pre-load libraries
          
-c, --check: Check result
          
-C, --inc: Check on inverse
          
--mode=x : Change the xLATMS matrix mode generation for
            SVD/EVD (default: 4). It must be between 0 and 20 included.
        
      
      Profiling parameters
        
          
-T, --trace: Enable trace generation
          
--progress: Display progress indicator
          
-d, --dag: Enable DAG generation. Generates a dot_dag_file.dot.
          
-p, --profile: Print profiling informations
        
      
      HQR parameters
        
          
-a x, --qr_a=x, --rhblk=x: Define the size of the local TS
            trees in housholder reduction trees for QR and LQ
            factorization. N is the size of each subdomain (default: -1)
          
-l x, --llvl=x: Tree used for low level reduction inside
            nodes (default: -1)
          
-L x, --hlvl=x: Tree used for high level reduction between
            nodes, only if P > 1 (default: -1). Possible values are -1:
            Automatic, 0: Flat, 1: Greedy, 2: Fibonacci, 3: Binary, 4:
            Replicated greedy.
          
-D, --domino: Enable the domino between upper and lower trees
        
      
      Advanced options
        
          
--nobigmat: Disable single large matrix allocation for
            multiple tiled allocations
          
-s, --sync: Enable synchronous calls in wrapper function such
            as POTRI
          
-o, --ooc: Enable out-of-core (available only with StarPU)
          
-G, --gemm3m: Use gemm3m complex method
          
--bound: Compare result to area bound
        
      
    List of timing algorithms available:
    
      LANGE: norms of matrices
      GEMM: general matrix-matrix multiply
      TRSM: triangular solve
      POTRF: Cholesky factorization with a symmetric
        positive-definite matrix
      POTRI: Cholesky inversion
      POSV: solve linear systems with symmetric positive-definite matrix
      GETRF_NOPIV: LU factorization of a general matrix using the tile LU algorithm without row pivoting
      GESV_NOPIV: solve linear system for a general matrix using the tile LU algorithm without row pivoting
      GETRF_INCPIV: LU factorization of a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges
      GESV_INCPIV: solve linear system for a general matrix using the tile LU algorithm with partial tile pivoting with row interchanges matrix
      GEQRF: QR factorization of a general matrix
      GELQF: LQ factorization of a general matrix
      QEQRF_HQR: GEQRF with hierarchical tree
      QEQRS: solve linear systems using a QR factorization
      GELS: solves overdetermined or underdetermined linear systems involving a general matrix using the QR or the LQ factorization
      GESVD: general matrix singular value decomposition
    
  
Execution trace using StarPU
<sec:trace>
StarPU can generate its own trace log files by compiling it with
  the --with-fxt option at the configure step (you can have to
  specify the directory where you installed FxT by giving
  --with-fxt=... instead of --with-fxt alone).  By doing so, traces
  are generated after each execution of a program which uses StarPU
  in the directory pointed by the STARPU_FXT_PREFIX environment
  variable.
export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/

When executing a ./timing/... Chameleon program, if it has been
  enabled (StarPU compiled with FxT and
  -DCHAMELEON_ENABLE_TRACING=ON), you can give the option --trace to
  tell the program to generate trace log files.
Finally, to generate the trace file which can be opened with Vite
  program, you can use the starpu_fxt_tool executable of StarPU.
  This tool should be in $STARPU_INSTALL_REPOSITORY/bin.  You can
  use it to generate the trace file like this:
path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename

There is one file per mpi processus (prof_filename_0,
  prof_filename_1 …).  To generate a trace of mpi programs you can
  call it like this:
path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*

The trace file will be named paje.trace (use -o option to specify
  an output name).  Alternatively, for non mpi execution (only one
  processus and profiling file), you can set the environment
  variable STARPU_GENERATE_TRACE=1 to automatically generate the
  paje trace file.
Use simulation mode with StarPU-SimGrid
<sec:simu>
Simulation mode can be activated by setting the cmake option
  CHAMELEON_SIMULATION to ON.  This mode allows you to simulate
  execution of algorithms with StarPU compiled with SimGrid.  To do
  so, we provide some perfmodels in the simucore/perfmodels/
  directory of Chameleon sources.  To use these perfmodels, please
  set your STARPU_HOME environment variable to
  path/to/your/chameleon_sources/simucore/perfmodels.  Finally, you
  need to set your STARPU_HOSTNAME environment variable to the name
  of the machine to simulate.  For example: STARPU_HOSTNAME=mirage.
  Note that only POTRF kernels with block sizes of 320 or 960
  (simple and double precision) on mirage and sirocco machines are
  available for now.  Database of models is subject to change.
Chameleon API
Chameleon provides routines to solve dense general systems of
  linear equations, symmetric positive definite systems of linear
  equations and linear least squares problems, using LU, Cholesky, QR
  and LQ factorizations.  Real arithmetic and complex arithmetic are
  supported in both single precision and double precision.  Routines
  that compute linear algebra are of the following form:
MORSE_name[_Tile[_Async]]


  all user routines are prefixed with MORSE

  in the pattern MORSE_name[_Tile[_Async]], name follows the
    BLAS/LAPACK naming scheme for algorithms (e.g. sgemm for general
    matrix-matrix multiply simple precision)
  Chameleon provides three interface levels
    
      
MORSE_name: simplest interface, very close to CBLAS and
        LAPACKE, matrices are given following the LAPACK data layout
        (1-D array column-major).  It involves copy of data from LAPACK
        layout to tile layout and conversely (to update LAPACK data),
        see Step1.
      
MORSE_name_Tile: the tile interface avoid copies between LAPACK
        and tile layouts. It is the standard interface of Chameleon and
        it should achieved better performance than the previous
        simplest interface. The data are given through a specific
        structure called a descriptor, see Step2.
      
MORSE_name_Tile_Async: similar to the tile interface, it avoids
        synchonization barrier normally called between Tile routines.
        At the end of an Async function, completion of tasks is not
        guaranteed and data are not necessarily up-to-date.  To ensure
        that tasks have been all executed, a synchronization function
        has to be called after the sequence of Async functions, see
        Step4.
    
  
MORSE routine calls have to be preceded from
MORSE_Init( NCPU, NGPU );