Newer
Older
# This file is part of the Chameleon User's Guide.
# Copyright (C) 2017 Inria
# See the file ../users_guide.org for copying conditions.
** MORSE project
#+NAME: fig:morse_header
#+ATTR_HTML: :align center
[[file:morse_header.png]]
Chameleon is a linear algebra software created jointly by several
research teams as part of the MORSE associate team: [[http://www.icl.utk.edu/][ICL]], [[https://www.inria.fr/en/][Inria]],
[[https://www.kaust.edu.sa/en][KAUST]], [[http://www.ucdenver.edu/pages/ucdwelcomepage.aspx][The University of Colorado Denver]].
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
*** MORSE Objectives
When processor clock speeds flatlined in 2004, after more than
fifteen years of exponential increases, the era of near automatic
performance improvements that the HPC application community had
previously enjoyed came to an abrupt end. To develop software that
will perform well on petascale and exascale systems with thousands
of nodes and millions of cores, the list of major challenges that
must now be confronted is formidable:
1) dramatic escalation in the costs of intrasystem communication
between processors and/or levels of memory hierarchy;
2) increased heterogeneity of the processing units (mixing CPUs,
GPUs, etc. in varying and unexpected design combinations);
3) high levels of parallelism and more complex constraints means
that cooperating processes must be dynamically and unpredictably
scheduled for asynchronous execution;
4) software will not run at scale without much better resilience to
faults and far more robustness; and
5) new levels of self-adaptivity will be required to enable
software to modulate process speed in order to satisfy limited
energy budgets.
The MORSE associate team will tackle the first three challenges in
a orchestrating work between research groups respectively
specialized in sparse linear algebra, dense linear algebra and
runtime systems. The overall objective is to develop robust linear
algebra libraries relying on innovative runtime systems that can
fully benefit from the potential of those future large-scale
complex machines. Challenges 4) and 5) will also be investigated
by the different teams in the context of other partnerships, but
they will not be the main focus of the associate team as they are
much more prospective.
*** Research fields
The overall goal of the MORSE associate team is to enable advanced
numerical algorithms to be executed on a scalable unified runtime
system for exploiting the full potential of future exascale
machines. We expect advances in three directions based first on
strong and closed interactions between the runtime and numerical
linear algebra communities. This initial activity will then
naturally expand to more focused but still joint research in both
fields.
**** Fine interaction between linear algebra and runtime systems
On parallel machines, HPC applications need to take care of data
movement and consistency, which can be either explicitly managed
at the level of the application itself or delegated to a runtime
system. We adopt the latter approach in order to better keep up
with hardware trends whose complexity is growing exponentially.
One major task in this project is to define a proper interface
between HPC applications and runtime systems in order to maximize
productivity and expressivity. As mentioned in the next section,
a widely used approach consists in abstracting the application as
a DAG that the runtime system is in charge of scheduling.
Scheduling such a DAG over a set of heterogeneous processing units
introduces a lot of new challenges, such as predicting accurately
the execution time of each type of task over each kind of unit,
minimizing data transfers between memory banks, performing data
prefetching, etc. Expected advances: In a nutshell, a new runtime
system API will be designed to allow applications to provide
scheduling hints to the runtime system and to get real-time
feedback about the consequences of scheduling decisions.
**** Runtime systems
A runtime environment is an intermediate layer between the system
and the application. It provides low-level functionality not
provided by the system (such as scheduling or management of the
heterogeneity) and high-level features (such as performance
portability). In the framework of this proposal, we will work on
the scalability of runtime environment. To achieve scalability it
is required to avoid all centralization. Here, the main problem
is the scheduling of the tasks. In many task-based runtime
environments the scheduler is centralized and becomes a bottleneck
as soon as too many cores are involved. It is therefore required
to distribute the scheduling decision or to compute a data
distribution that impose the mapping of task using, for instance
the so-called ``owner-compute'' rule. Expected advances: We will
design runtime systems that enable an efficient and scalable use
of thousands of distributed multicore nodes enhanced with
accelerators.
**** Linear algebra
Because of its central position in HPC and of the well understood
structure of its algorithms, dense linear algebra has often
pioneered new challenges that HPC had to face. Again, dense
linear algebra has been in the vanguard of the new era of
petascale computing with the design of new algorithms that can
efficiently run on a multicore node with GPU accelerators. These
algorithms are called ``communication-avoiding'' since they have
been redesigned to limit the amount of communication between
processing units (and between the different levels of memory
hierarchy). They are expressed through Direct Acyclic Graphs
(DAG) of fine-grained tasks that are dynamically
scheduled. Expected advances: First, we plan to investigate the
impact of these principles in the case of sparse applications
(whose algorithms are slightly more complicated but often rely on
dense kernels). Furthermore, both in the dense and sparse cases,
the scalability on thousands of nodes is still limited; new
numerical approaches need to be found. We will specifically
design sparse hybrid direct/iterative methods that represent a
promising approach.
*** Research papers
Research papers about MORSE can be found [[http://icl.cs.utk.edu/projectsdev/morse/pubs/index.html][here]].
The main purpose is to address the performance shortcomings of the
[[http://www.netlib.org/lapack/][LAPACK]] and [[http://www.netlib.org/scalapack/][ScaLAPACK]] libraries on multicore processors and
multi-socket systems of multicore processors and their inability to
efficiently utilize accelerators such as Graphics Processing Units
(GPUs).
Chameleon is a framework written in C which provides routines to
solve dense general systems of linear equations, symmetric positive
definite systems of linear equations and linear least squares
problems, using LU, Cholesky, QR and LQ factorizations. Real
arithmetic and complex arithmetic are supported in both single
precision and double precision. It supports Linux and Mac OS/X
machines (only tested on Intel x86-64 architecture).
Chameleon is based on [[http://icl.cs.utk.edu/plasma/][PLASMA]] source code but is not limited to
shared-memory environment and can exploit multiple GPUs. Chameleon
is interfaced in a generic way with both [[http://icl.cs.utk.edu/quark/][QUARK]] and [[http://runtime.bordeaux.inria.fr/StarPU/][StarPU]] runtime
systems. This feature allows to analyze in a unified framework how
sequential task-based algorithms behave regarding different runtime
systems implementations. Using Chameleon with [[http://runtime.bordeaux.inria.fr/StarPU/][StarPU]] runtime
system allows to exploit GPUs through kernels provided by [[https://developer.nvidia.com/cublas][cuBLAS]]
and clusters of interconnected nodes with distributed memory (using
[[http://www.open-mpi.org/][MPI]]). Computation of very large systems with dense matrices on a
cluster of nodes is still being experimented and stabilized. It is
not expected to get stable performances with the current version
using MPI.
*** PLASMA's design principles
Chameleon is originally based on [[http://icl.cs.utk.edu/plasma/][PLASMA]] so that design principles
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
are very similar. The content of this section PLASMA's design
principles has been copied from the /Design principles/ section of
the PLASMA User's Guide.
**** Tile Algorithms
Tile algorithms are based on the idea of processing the matrix by
square tiles of relatively small size, such that a tile fits
entirely in one of the cache levels associated with one core.
This way a tile can be loaded to the cache and processed
completely before being evicted back to the main memory. Of the
three types of cache misses, *compulsory*, *capacity* and *conflict*,
the use of tile algorithms minimizes the number of capacity
misses, since each operation loads the amount of data that does
not ``overflow'' the cache.
For some operations such as matrix multiplication and Cholesky
factorization, translating the classic algorithm to the tile
algorithm is trivial. In the case of matrix multiplication, the
tile algorithm is simply a product of applying the technique of
*loop tiling* to the canonical definition of three nested loops. It
is very similar for the Cholesky factorization. The *left-looking*
definition of Cholesky factorization from LAPACK is a loop with a
sequence of calls to four routines: xSYRK (symmetric *rank-k*
update), xPOTRF (Cholesky factorization of a small block on the
diagonal), xGEMM (matrix multiplication) and xTRSM (triangular
solve). If the xSYRK, xGEMM and xTRSM operations are expressed
with the canonical definition of three nested loops and the
technique of loop tiling is applied, the tile algorithm results.
Since the algorithm is produced by simple reordering of
operations, neither the number of operations nor numerical
stability of the algorithm are affected.
The situation becomes slightly more complicated for LU and QR
factorizations, where the classic algorithms factorize an entire
panel of the matrix (a block of columns) at every step of the
algorithm. One can observe, however, that the process of matrix
factorization is synonymous with introducing zeros in approproate
places and a tile algorithm can be fought of as one that zeroes
one tile of the matrix at a time. This process is referred to as
updating of a factorization or *incremental factorization*. The
process is equivalent to factorizing the top tile of a panel, then
placing the upper triangle of the result on top of the tile blow
and factorizing again, then moving to the next tile and so on.
Here, the tile LU and QR algorithms perform slightly more floating
point operations and require slightly more memory for auxiliary
data. Also, the tile LU factorization applies a different
pivoting pattern and, as a result, is less numerically stable than
classic LU with full pivoting. Numerical stability is not an
issue in case of the tile QR, which relies on orthogonal
transformations (Householder reflections), which are numerically
stable.
#+CAPTION: Schematic illustration of the tile LU factorization (kernel names for real arithmetics in double precision), courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
#+NAME: fig:tile_lu
#+ATTR_HTML: :width 640px :align center
[[file:tile_lu.jpg]]
**** Tile Data Layout
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
Tile layout is based on the idea of storing the matrix by square
tiles of relatively small size, such that each tile occupies a
continuous memory region. This way a tile can be loaded to the
cache memory efficiently and the risk of evicting it from the
cache memory before it is completely processed is minimized. Of
the three types of cache misses, *compulsory*, *capacity* and
*conflict*, the use of tile layout minimizes the number of conflict
misses, since a continuous region of memory will completely fill
out a /set-associative/ cache memory before an eviction can
happen. Also, from the standpoint of multithreaded execution, the
probability of *false sharing* is minimized. It can only
affect the cache lines containing the beginning and the ending of
a tile.
In standard *cache-based* architecture, tiles continously laid out
in memory maximize the profit from automatic prefetching. Tile
layout is also beneficial in situations involving the use of
accelerators, where explicit communication of tiles through DMA
transfers is required, such as moving tiles between the system
memory and the local store in Cell B. E. or moving tiles between
the host memory and the device memory in GPUs. In most
circumstances tile layout also minimizes the number of TLB misses
and conflicts to memory banks or partitions. With the standard
(*column-major*) layout, access to each column of a tile is much
more likely to cause a conflict miss, a false sharing miss, a TLB
miss or a bank or partition conflict. The use of the standard
layout for dense matrix operations is a performance minefield.
Although occasionally one can pass through it unscathed, the risk
of hitting a spot deadly to performance is very high.
Another property of the layout utilized in PLASMA is that it is
``flat'', meaning that it does not involve a level of
indirection. Each tile stores a small square submatrix of the main
matrix in a *column-major* layout. In turn, the main matrix is an
arrangement of tiles immediately following one another in a
*column-major* layout. The offset of each tile can be calculated
through address arithmetics and does not involve pointer
indirection. Alternatively, a matrix could be represented as an
array of pointers to tiles, located anywhere in memory. Such
layout would be a radical and unjustifiable departure from LAPACK
and ScaLAPACK. Flat tile layout is a natural progression from
LAPACK's *column-major* layout and ScaLAPACK's
/block-cyclic/ layout.
Another related property of PLASMA's tile layout is that it
includes provisions for padding of tiles, i.e., the actual region
of memory designated for a tile can be larger than the memory
occupied by the actual data. This allows to force a certain
alignment of tile boundaries, while using the flat organization
described in the previous paragraph. The motivation is that, at
the price of small memory overhead, alignment of tile boundaries
may prove benefivial in multiple scenarios involving memory
systems of standard multicore processors, as well as accelerators.
The issues that come into play are, again, the use of TLBs and
memory banks or partitions.
#+CAPTION: Schematic illustration of the tile layout with *column-major* order of tiles, *column-major* order of elements within tiles and (optional) padding for enforcing a certain alighment of tile bondaries, courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
#+NAME: fig:tile_layout
#+ATTR_HTML: :width 640px :align center
[[file:tile_layout.jpg]]
**** Dynamic Task Scheduling
Dynamic scheduling is the idea of assigning work to cores based on
the availability of data for processing at any given point in time
and is also referred to as *data-driven* scheduling. The concept is
related closely to the idea of expressing computation through a
task graph, often referred to as the DAG (*Direct Acyclic Graph*),
and the flexibility exploring the DAG at runtime. Thus, to a
large extent, dynamic scheduling is synonymous with *runtime
scheduling*. An important concept here is the one of the *critical
path*, which defines the upper bound on the achievable parallelism,
and needs to be pursued at the maximum speed. This is in direct
opposition to the *fork-and-join* or *data-parallel* programming
models, where artificial synchronization points expose serial
sections of the code, where multiple cores are idle, while
sequential processing takes place. The use of dynamic scheduling
introduces a *trade-off*, though. The more dynamic (flexible)
scheduling is, the more centralized (and less scalable) the
scheduling mechanism is. For that reason, currently PLASMA uses
two scheduling mechanisms, one which is fully dynamic and one
where work is assigned statically and dependency checks are done
at runtime.
The first scheduling mechanism relies on unfolding a *sliding
window* of the task graph at runtime and scheduling work by
resolving data hazards: *Read After Write(RAW)*, *Write After Read
(WAR)* and *Write After Write (WAW)*, a technique analogous to
instruction scheduling in superscalar processors. It also relies
on *work-stealing* for balanding the load among all multiple cores.
The second scheduling mechanism relies on statically designating a
path through the execution space of the algorithm to each core and
following a cycle: transition to a task, wait for its
dependencies, execute it, update the overall progress. Task are
identified by tuples and task transitions are done through locally
evaluated formulas. Progress information can be centralized,
replicated or distributed (currently centralized).
#+CAPTION: A trace of the tile QR factorization executing on eight cores without any global synchronization points (kernel names for real arithmetics in single precision), courtesey of the [[http://icl.cs.utk.edu/plasma/][PLASMA]] team.
#+NAME: fig:trace_qr
#+ATTR_HTML: :width 640px :align center
[[file:trace_qr.jpg]]