introduction.org

# This file is part of the Chameleon User's Guide.
# Copyright (C) 2017 Inria
# See the file ../users_guide.org for copying conditions.
** MORSE project
   #+NAME: fig:morse_header
   #+ATTR_HTML: :align center
   [[file:morse_header.png]]

   Chameleon is a linear algebra software created jointly by several
   research teams as part of the MORSE associate team: [[http://www.icl.utk.edu/][ICL]], [[https://www.inria.fr/en/][Inria]],
   [[https://www.kaust.edu.sa/en][KAUST]], [[http://www.ucdenver.edu/pages/ucdwelcomepage.aspx][The University of Colorado Denver]].

*** MORSE Objectives
    When processor clock speeds flatlined in 2004, after more than
    fifteen years of exponential increases, the era of near automatic
    performance improvements that the HPC application community had
    previously enjoyed came to an abrupt end.  To develop software that
    will perform well on petascale and exascale systems with thousands
    of nodes and millions of cores, the list of major challenges that
    must now be confronted is formidable:
    1) dramatic escalation in the costs of intrasystem communication
       between processors and/or levels of memory hierarchy;
    2) increased heterogeneity of the processing units (mixing CPUs,
       GPUs, etc. in varying and unexpected design combinations);
    3) high levels of parallelism and more complex constraints means
       that cooperating processes must be dynamically and unpredictably
       scheduled for asynchronous execution;
    4) software will not run at scale without much better resilience to
       faults and far more robustness; and
    5) new levels of self-adaptivity will be required to enable
       software to modulate process speed in order to satisfy limited
       energy budgets.
    The MORSE associate team will tackle the first three challenges in
    a orchestrating work between research groups respectively
    specialized in sparse linear algebra, dense linear algebra and
    runtime systems.  The overall objective is to develop robust linear
    algebra libraries relying on innovative runtime systems that can
    fully benefit from the potential of those future large-scale
    complex machines.  Challenges 4) and 5) will also be investigated
    by the different teams in the context of other partnerships, but
    they will not be the main focus of the associate team as they are
    much more prospective.

*** Research fields
    The overall goal of the MORSE associate team is to enable advanced
    numerical algorithms to be executed on a scalable unified runtime
    system for exploiting the full potential of future exascale
    machines.  We expect advances in three directions based first on
    strong and closed interactions between the runtime and numerical
    linear algebra communities.  This initial activity will then
    naturally expand to more focused but still joint research in both
    fields.

**** Fine interaction between linear algebra and runtime systems
     On parallel machines, HPC applications need to take care of data
     movement and consistency, which can be either explicitly managed
     at the level of the application itself or delegated to a runtime
     system.  We adopt the latter approach in order to better keep up
     with hardware trends whose complexity is growing exponentially.
     One major task in this project is to define a proper interface
     between HPC applications and runtime systems in order to maximize
     productivity and expressivity.  As mentioned in the next section,
     a widely used approach consists in abstracting the application as
     a DAG that the runtime system is in charge of scheduling.
     Scheduling such a DAG over a set of heterogeneous processing units
     introduces a lot of new challenges, such as predicting accurately
     the execution time of each type of task over each kind of unit,
     minimizing data transfers between memory banks, performing data
     prefetching, etc.  Expected advances: In a nutshell, a new runtime
     system API will be designed to allow applications to provide
     scheduling hints to the runtime system and to get real-time
     feedback about the consequences of scheduling decisions.

**** Runtime systems
     A runtime environment is an intermediate layer between the system
     and the application.  It provides low-level functionality not
     provided by the system (such as scheduling or management of the
     heterogeneity) and high-level features (such as performance
     portability).  In the framework of this proposal, we will work on
     the scalability of runtime environment. To achieve scalability it
     is required to avoid all centralization.  Here, the main problem
     is the scheduling of the tasks.  In many task-based runtime
     environments the scheduler is centralized and becomes a bottleneck
     as soon as too many cores are involved.  It is therefore required
     to distribute the scheduling decision or to compute a data
     distribution that impose the mapping of task using, for instance
     the so-called ``owner-compute'' rule.  Expected advances: We will
     design runtime systems that enable an efficient and scalable use
     of thousands of distributed multicore nodes enhanced with
     accelerators.

**** Linear algebra
     Because of its central position in HPC and of the well understood
     structure of its algorithms, dense linear algebra has often
     pioneered new challenges that HPC had to face.  Again, dense
     linear algebra has been in the vanguard of the new era of
     petascale computing with the design of new algorithms that can
     efficiently run on a multicore node with GPU accelerators. These
     algorithms are called ``communication-avoiding'' since they have
     been redesigned to limit the amount of communication between
     processing units (and between the different levels of memory
     hierarchy).  They are expressed through Direct Acyclic Graphs
     (DAG) of fine-grained tasks that are dynamically
     scheduled. Expected advances: First, we plan to investigate the
     impact of these principles in the case of sparse applications
     (whose algorithms are slightly more complicated but often rely on
     dense kernels).  Furthermore, both in the dense and sparse cases,
     the scalability on thousands of nodes is still limited; new
     numerical approaches need to be found.  We will specifically
     design sparse hybrid direct/iterative methods that represent a
     promising approach.

*** Research papers
    Research papers about MORSE can be found [[http://icl.cs.utk.edu/projectsdev/morse/pubs/index.html][here]].

** Chameleon
*** Chameleon software
    The main purpose is to address the performance shortcomings of the
    [[http://www.netlib.org/lapack/][LAPACK]] and [[http://www.netlib.org/scalapack/][ScaLAPACK]] libraries on multicore processors and
    multi-socket systems of multicore processors and their inability to
    efficiently utilize accelerators such as Graphics Processing Units
    (GPUs).

    Chameleon is a framework written in C which provides routines to
    solve dense general systems of linear equations, symmetric positive
    definite systems of linear equations and linear least squares
    problems, using LU, Cholesky, QR and LQ factorizations.  Real
    arithmetic and complex arithmetic are supported in both single
    precision and double precision.  It supports Linux and Mac OS/X
    machines (only tested on Intel x86-64 architecture).

    Chameleon is based on [[http://icl.cs.utk.edu/plasma/][PLASMA]] source code but is not limited to
    shared-memory environment and can exploit multiple GPUs.  Chameleon
    is interfaced in a generic way with both [[http://icl.cs.utk.edu/quark/][QUARK]] and [[http://runtime.bordeaux.inria.fr/StarPU/][StarPU]] runtime
    systems.  This feature allows to analyze in a unified framework how
    sequential task-based algorithms behave regarding different runtime
    systems implementations.  Using Chameleon with [[http://runtime.bordeaux.inria.fr/StarPU/][StarPU]] runtime
    system allows to exploit GPUs through kernels provided by [[https://developer.nvidia.com/cublas][cuBLAS]]
    and clusters of interconnected nodes with distributed memory (using
    [[http://www.open-mpi.org/][MPI]]).  Computation of very large systems with dense matrices on a
    cluster of nodes is still being experimented and stabilized.  It is
    not expected to get stable performances with the current version
    using MPI.

*** PLASMA's design principles
    Chameleon is originally based on [[http://icl.cs.utk.edu/plasma/][PLASMA]] so that design principles
    are very similar.  The content of this section PLASMA's design
    principles has been copied from the /Design principles/ section of
    the PLASMA User's Guide.

**** Tile Algorithms
     Tile algorithms are based on the idea of processing the matrix by
     square tiles of relatively small size, such that a tile fits
     entirely in one of the cache levels associated with one core.
     This way a tile can be loaded to the cache and processed
     completely before being evicted back to the main memory.  Of the
     three types of cache misses, *compulsory*, *capacity* and *conflict*,
     the use of tile algorithms minimizes the number of capacity
     misses, since each operation loads the amount of data that does
     not ``overflow'' the cache.

     For some operations such as matrix multiplication and Cholesky
     factorization, translating the classic algorithm to the tile
     algorithm is trivial.  In the case of matrix multiplication, the
     tile algorithm is simply a product of applying the technique of
     *loop tiling* to the canonical definition of three nested loops.  It
     is very similar for the Cholesky factorization.  The *left-looking*
     definition of Cholesky factorization from LAPACK is a loop with a
     sequence of calls to four routines: xSYRK (symmetric *rank-k*
     update), xPOTRF (Cholesky factorization of a small block on the
     diagonal), xGEMM (matrix multiplication) and xTRSM (triangular
     solve).  If the xSYRK, xGEMM and xTRSM operations are expressed
     with the canonical definition of three nested loops and the
     technique of loop tiling is applied, the tile algorithm results.
     Since the algorithm is produced by simple reordering of
     operations, neither the number of operations nor numerical
     stability of the algorithm are affected.

     The situation becomes slightly more complicated for LU and QR
     factorizations, where the classic algorithms factorize an entire
     panel of the matrix (a block of columns) at every step of the
     algorithm.  One can observe, however, that the process of matrix
     factorization is synonymous with introducing zeros in approproate
     places and a tile algorithm can be fought of as one that zeroes
     one tile of the matrix at a time.  This process is referred to as
     updating of a factorization or *incremental factorization*.  The
     process is equivalent to factorizing the top tile of a panel, then
     placing the upper triangle of the result on top of the tile blow
     and factorizing again, then moving to the next tile and so on.
     Here, the tile LU and QR algorithms perform slightly more floating
     point operations and require slightly more memory for auxiliary
     data.  Also, the tile LU factorization applies a different
     pivoting pattern and, as a result, is less numerically stable than
     classic LU with full pivoting.  Numerical stability is not an
     issue in case of the tile QR, which relies on orthogonal
     transformations (Householder reflections), which are numerically
     stable.

     #+CAPTION: Schematic illustration of the tile LU factorization (kernel names for real arithmetics in double precision), courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
     #+NAME: fig:tile_lu
     #+ATTR_HTML: :width 640px :align center
     [[file:tile_lu.jpg]]

**** Tile Data Layout
     <<sec:tile>>

     Tile layout is based on the idea of storing the matrix by square
     tiles of relatively small size, such that each tile occupies a
     continuous memory region.  This way a tile can be loaded to the
     cache memory efficiently and the risk of evicting it from the
     cache memory before it is completely processed is minimized.  Of
     the three types of cache misses, *compulsory*, *capacity* and
     *conflict*, the use of tile layout minimizes the number of conflict
     misses, since a continuous region of memory will completely fill
     out a /set-associative/ cache memory before an eviction can
     happen.  Also, from the standpoint of multithreaded execution, the
     probability of *false sharing* is minimized.  It can only
     affect the cache lines containing the beginning and the ending of
     a tile.

     In standard *cache-based* architecture, tiles continously laid out
     in memory maximize the profit from automatic prefetching.  Tile
     layout is also beneficial in situations involving the use of
     accelerators, where explicit communication of tiles through DMA
     transfers is required, such as moving tiles between the system
     memory and the local store in Cell B. E. or moving tiles between
     the host memory and the device memory in GPUs.  In most
     circumstances tile layout also minimizes the number of TLB misses
     and conflicts to memory banks or partitions.  With the standard
     (*column-major*) layout, access to each column of a tile is much
     more likely to cause a conflict miss, a false sharing miss, a TLB
     miss or a bank or partition conflict.  The use of the standard
     layout for dense matrix operations is a performance minefield.
     Although occasionally one can pass through it unscathed, the risk
     of hitting a spot deadly to performance is very high.

     Another property of the layout utilized in PLASMA is that it is
     ``flat'', meaning that it does not involve a level of
     indirection. Each tile stores a small square submatrix of the main
     matrix in a *column-major* layout. In turn, the main matrix is an
     arrangement of tiles immediately following one another in a
     *column-major* layout.  The offset of each tile can be calculated
     through address arithmetics and does not involve pointer
     indirection.  Alternatively, a matrix could be represented as an
     array of pointers to tiles, located anywhere in memory. Such
     layout would be a radical and unjustifiable departure from LAPACK
     and ScaLAPACK.  Flat tile layout is a natural progression from
     LAPACK's *column-major* layout and ScaLAPACK's
     /block-cyclic/ layout.

     Another related property of PLASMA's tile layout is that it
     includes provisions for padding of tiles, i.e., the actual region
     of memory designated for a tile can be larger than the memory
     occupied by the actual data.  This allows to force a certain
     alignment of tile boundaries, while using the flat organization
     described in the previous paragraph.  The motivation is that, at
     the price of small memory overhead, alignment of tile boundaries
     may prove benefivial in multiple scenarios involving memory
     systems of standard multicore processors, as well as accelerators.
     The issues that come into play are, again, the use of TLBs and
     memory banks or partitions.

     #+CAPTION: Schematic illustration of the tile layout with *column-major* order of tiles, *column-major* order of elements within tiles and (optional) padding for enforcing a certain alighment of tile bondaries, courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
     #+NAME: fig:tile_layout
     #+ATTR_HTML: :width 640px :align center
     [[file:tile_layout.jpg]]

**** Dynamic Task Scheduling

     Dynamic scheduling is the idea of assigning work to cores based on
     the availability of data for processing at any given point in time
     and is also referred to as *data-driven* scheduling.  The concept is
     related closely to the idea of expressing computation through a
     task graph, often referred to as the DAG (*Direct Acyclic Graph*),
     and the flexibility exploring the DAG at runtime.  Thus, to a
     large extent, dynamic scheduling is synonymous with *runtime
     scheduling*.  An important concept here is the one of the *critical
     path*, which defines the upper bound on the achievable parallelism,
     and needs to be pursued at the maximum speed.  This is in direct
     opposition to the *fork-and-join* or *data-parallel* programming
     models, where artificial synchronization points expose serial
     sections of the code, where multiple cores are idle, while
     sequential processing takes place.  The use of dynamic scheduling
     introduces a *trade-off*, though.  The more dynamic (flexible)
     scheduling is, the more centralized (and less scalable) the
     scheduling mechanism is.  For that reason, currently PLASMA uses
     two scheduling mechanisms, one which is fully dynamic and one
     where work is assigned statically and dependency checks are done
     at runtime.

     The first scheduling mechanism relies on unfolding a *sliding
     window* of the task graph at runtime and scheduling work by
     resolving data hazards: *Read After Write(RAW)*, *Write After Read
     (WAR)* and *Write After Write (WAW)*, a technique analogous to
     instruction scheduling in superscalar processors.  It also relies
     on *work-stealing* for balanding the load among all multiple cores.
     The second scheduling mechanism relies on statically designating a
     path through the execution space of the algorithm to each core and
     following a cycle: transition to a task, wait for its
     dependencies, execute it, update the overall progress.  Task are
     identified by tuples and task transitions are done through locally
     evaluated formulas.  Progress information can be centralized,
     replicated or distributed (currently centralized).

     #+CAPTION: A trace of the tile QR factorization executing on eight cores without any global synchronization points (kernel names for real arithmetics in single precision), courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
     #+NAME: fig:trace_qr
     #+ATTR_HTML: :width 640px :align center
     [[file:trace_qr.jpg]]