index.html 21.3 KB
Newer Older
AUMAGE Olivier's avatar
AUMAGE Olivier committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU hands-on session</TITLE>
<link rel="stylesheet" type="text/css" href="../../style.css" />
<link rel="Shortcut icon" href="http://www.inria.fr/extension/site_inria/design/site_inria/images/favicon.ico" type="image/x-icon" />
</HEAD>

<body>

<div class="title">
<h1><a href="../../">StarPU</a></h1>
<h2>ComPAS 2018</h2>
<h3>StarPU Tutorial - Toulouse, July 2018</h3>
</div>

<div class="menu">
      <a href="../">Back to the main page</a>
</div>

<div class="section">

<p>
Other materials (talk slides, links) are available at the
<a href="index.html#other">bottom</a> of this page.
</p>
</div>

<div class="section">
<h2>Setup</h2>

<div class="section">
<h3>Connection to the Platform</h3>
<p>
<!--
The lab works are going to be done on
the <a href="https://groupes.renater.fr/wiki/poincare/public/description_de_poincare">MDS</a> platform.
-->
<!--
A subset of machines has been specifically booked for our own usage.
-->
You should have received information on how to connect to the
platform.
</p>

<!--
<P>
To use StarPU on the machines, you need to load the following modules
</p>

<tt>
<pre>
module load cuda
module load openmpi
module load hwloc/1.6.2_gnu47
module load openblas/v0.2.8_gnu48
</pre>
</tt>
-->

<p>
  The following variables need to be set to use StarPU.
</p>

<tt>
<pre>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
69 70 71 72 73 74 75 76 77 78 79 80 81
export TP_DIR=/mnt/n7fs/ens/tp_abuttari/TP_StarPU/

export HWLOC_PATH=$TP_DIR/hwloc-1.11.10
export PATH=$HWLOC_PATH/bin:$PATH
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HWLOC_PATH/lib/pkgconfig
export LD_LIBRARY_PATH=$HWLOC_PATH/lib:$LD_LIBRARY_PATH

export FXT_PATH=$TP_DIR/fxt-0.3.8
export PATH=$FXT_PATH/bin:$PATH
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$FXT_PATH/lib/pkgconfig
export LD_LIBRARY_PATH=$FXT_PATH/lib:$LD_LIBRARY_PATH

export STARPU_PATH=$TP_DIR/starpu
AUMAGE Olivier's avatar
AUMAGE Olivier committed
82 83 84
export PATH=$STARPU_PATH/bin:$PATH
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$STARPU_PATH/lib/pkgconfig
export LD_LIBRARY_PATH=$STARPU_PATH/lib:$LD_LIBRARY_PATH
AUMAGE Olivier's avatar
AUMAGE Olivier committed
85
export STARPU_IDLE_FILE=$HOME/starpu_idle_microsec.log
AUMAGE Olivier's avatar
AUMAGE Olivier committed
86 87 88 89 90 91 92 93 94

export LIBRARY_PATH=$LD_LIBRARY_PATH

</pre>
</tt>

<p>
  You can either add the previous lines to your
file <tt>$HOME/.bash_profile</tt>, or use the script
AUMAGE Olivier's avatar
AUMAGE Olivier committed
95
file <tt>/mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh</tt>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
96 97 98 99 100
</p>

</div>

<div class="section">
AUMAGE Olivier's avatar
AUMAGE Olivier committed
101
<h3>Testing the installation</h3>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
102 103 104
<tt>
<pre>
#!/bin/bash
AUMAGE Olivier's avatar
AUMAGE Olivier committed
105
source /mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
106 107 108 109 110
starpu_machine_display
</pre>
</tt>

<P>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
111 112
  You will find a copy of the script in <tt>/mnt/n7fs/ens/tp_abuttari/TP_StarPU/starpu_machine_display.sh</tt>.
  To execute the script, simply call:
AUMAGE Olivier's avatar
AUMAGE Olivier committed
113 114 115 116
</p>

<tt>
<pre>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
117
starpu_machine_display.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
118 119 120 121 122 123 124
</pre>
</tt>

<p>
Note that the first time <tt>starpu_machine_display</tt> is executed,
it calibrates the performance model of the bus, the results are then
stored in different files in the
AUMAGE Olivier's avatar
AUMAGE Olivier committed
125
directory <tt>$HOME/.starpu/sampling/bus</tt>.
AUMAGE Olivier's avatar
AUMAGE Olivier committed
126 127 128 129 130 131 132 133 134 135 136 137 138 139
</p>

<p>
Of course, on a heterogeneous cluster, the cluster launcher script
should set various hostnames for the different node classes, as
appropriate.
</p>
</div>

<div class="section">
<h3>Tutorial Material</h3>

<p>
All files needed for the lab works are available on the machine in the
AUMAGE Olivier's avatar
AUMAGE Olivier committed
140
directory <tt>/mnt/n7fs/ens/tp_abuttari/TP_StarPU/material</tt>.
AUMAGE Olivier's avatar
AUMAGE Olivier committed
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
</p>

</div>

</div>

<div class="section">
<h2>Session Part 1: Task-based Programming Model</h2>

<div class="section">
<h3>Application Example: Vector Scaling</h3>

<h4>Making it and Running it</h4>

<p>
A typical <a href="files/Makefile"><tt>Makefile</tt></a> for
applications using StarPU is the following:
</p>

<tt>
<pre>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
162 163
CFLAGS += $(shell pkg-config --cflags starpu-1.3)
LDFLAGS += $(shell pkg-config --libs starpu-1.3)
AUMAGE Olivier's avatar
AUMAGE Olivier committed
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
%.o: %.cu
	nvcc $(CFLAGS) $< -c $

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
</pre>
</tt>

<p>
Here are the source files for the application:
<ul>
<li><a href="files/vector_scal_task_insert.c">The main application</a></li>
<li><a href="files/vector_scal_cpu.c">The CPU implementation of the codelet</a></li>
<li><a href="files/vector_scal_cuda.cu">The CUDA implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl.c">The OpenCL host implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl_kernel.cl">The OpenCL device implementation of the codelet</a></li>
</ul>

Run <tt>make</tt>, and run the
AUMAGE Olivier's avatar
AUMAGE Olivier committed
182 183
resulting <tt>vector_scal_task_insert</tt> executable 
using the <a href="files/vector_scal.sh">given script
AUMAGE Olivier's avatar
AUMAGE Olivier committed
184 185 186 187 188 189
  vector_scal.sh</a>. It should be working: it simply scales a given
vector by a given factor.
</p>

<tt>
<pre>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
190
source /mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325

make vector_scal_task_insert

./vector_scal_task_insert
</pre>
</tt>

<h4>Computation Kernels</h4>
<p>
Examine the source code, starting from <tt>vector_scal_cpu.c</tt> : this is
the actual computation code, which is wrapped into a <tt>vector_scal_cpu</tt>
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
</p>

<p>
The GPU implementation, in <tt>vector_scal_cuda.cu</tt>, is basically
the same, with the host part (<tt>vector_scal_cuda</tt>) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(<tt>vector_mult_cuda</tt>) which performs the actual computation.
</p>

<p>
The OpenCL implementation in <tt>vector_scal_opencl.c</tt> and
<tt>vector_scal_opencl_kernel.cl</tt>is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
</p>

<p>
Modify the source code of the different implementations (CPU, CUDA and
OpenCL) and see which ones gets executed. You can force the execution
of one the implementations simply by disabling a type of device when
running your application, e.g.:
</p>

<tt>
<pre>
# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 vector_scal_task_insert
</pre>
</tt>

<p>
You can set the environment variable STARPU_WORKER_STATS to 1 when
running your application to see the number of tasks executed by each
device. You can see the whole list of environment
variables <a href="http://starpu.gforge.inria.fr/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html">here</a>.
</p>

<tt>
<pre>
STARPU_WORKER_STATS=1 vector_scal_task_insert
</pre>
</tt>

<h4>Main Code</h4>
<p>
Now examine <tt>vector_scal_task_insert.c</tt>: the <tt>cl</tt>
(codelet) structure simply gathers pointers on the functions
mentioned above.
</p>

<p>
The <tt>main</tt> function
<ul>
<li>Allocates an <tt>vector</tt> application buffer and fills it.</li>
<li>Registers it to StarPU, and gets back a DSM handle. From now on, the
application is not supposed to access <tt>vector</tt> directly, since its
content may be copied and modified by a task on a GPU, the main-memory copy then
being outdated.</li>
<li>Submits a (asynchronous) task to StarPU.</li>
<li>Waits for task completion.</li>
<li>Unregisters the vector from StarPU, which brings back the modified version
to main memory.</li>
</ul>
</p>

</div>

<div class="section">
<h3>Data Partitioning</h3>

<p>
In the previous section, we submitted only one task. We here discuss how to
<i>partition</i> data so as to submit multiple tasks which can be executed in
parallel by the various CPUs and GPUs.
</p>

<p>
Let's examine <a href="files/mult.c">mult.c</a>.

<ul>
<li>
The computation kernel, <tt>cpu_mult</tt> is a trivial matrix multiplication
kernel, which operates on 3 given DSM interfaces. These will actually not be
whole matrices, but only small parts of matrices.
</li>
<li>
<tt>init_problem_data</tt> initializes the whole A, B and C matrices.
</li>
<li>
<tt>partition_mult_data</tt> does the actual registration and partitioning.
Matrices are first registered completely, then two partitioning filters are
declared. The first one, <tt>vert</tt>, is used to split B and C vertically. The
second one, <tt>horiz</tt>, is used to split A and C horizontally. We thus end
up with a grid of pieces of C to be computed from stripes of A and B.
</li>
<li>
<tt>launch_tasks</tt> submits the actual tasks: for each piece of C, take
the appropriate piece of A and B to produce the piece of C.
</li>
<li>
The access mode is interesting: A and B just need to be read from, and C
will only be written to. This means that StarPU will make copies of the pieces
of A and B along the machines, where they are needed for tasks, and will give to
the tasks some
uninitialized buffers for the pieces of C, since they will not be read
from.
</li>
<li>
The <tt>main</tt> code initializes StarPU and data, launches tasks, unpartitions data,
and unregisters it. Unpartitioning is an interesting step: until then the pieces
of C are residing on the various GPUs where they have been computed.
Unpartitioning will collect all the pieces of C into the main memory to form the
whole C result matrix.
</li>
</ul>
</p>

<p>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
326
Run the application with the script <a href="files/mult.sh">mult.sh</a>, enabling some statistics:
AUMAGE Olivier's avatar
AUMAGE Olivier committed
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357
</p>

<tt>
<pre>
</pre>
</tt>

<p>
Figures show how the computation were distributed on the various processing
units.
</p>
</div>

<div class="section">
<h3>Other example</h3>

<p>
<a href="files/gemm/xgemm.c"><tt>gemm/xgemm.c</tt></a> is a very similar
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The <tt>mult_kernel_common</tt> functions
shows how we call <tt>DGEMM</tt> (CPUs) or <tt>cublasDgemm</tt> (GPUs)
on the DSM interface.
</p>

<p>
Let's execute it.
</p>

<tt>
<pre>
#!/bin/bash
AUMAGE Olivier's avatar
AUMAGE Olivier committed
358
source /mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602

make gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/sgemm
</pre>
</tt>

<!--
<p>
We can notice that StarPU gave much more tasks to the GPU. You can also try
to set <tt>num_gpu=2</tt> to run on the machine which has two GPUs (there is
only one of them, so you may have to wait a long time, so submit this in
background in a separate terminal), the interesting thing here is that
with <b>no</b> application modification beyond making it use a task-based
programming model, we get multi-GPU support for free!
</p>
-->

</div>

<!--
<div class="section">
<h3>More Advanced Examples</h3>
<p>
<tt>examples/lu/xlu_implicit.c</tt> is a more involved example: this is a simple
LU decomposition algorithm. The <tt>dw_codelet_facto_v3</tt> is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
</p>

<p>
<tt>examples/cholesky/cholesky_implicit.c</tt> is a similar example, but which makes use
of the <tt>starpu_insert_task</tt> helper. The <tt>_cholesky</tt> function looks
very much like <tt>dw_codelet_facto_v3</tt> of the previous paragraph, and all
task submission details are handled by <tt>starpu_insert_task</tt>.
</p>

<p>
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using <tt>starpu_insert_task</tt>.
</p>
</div>
-->

<div class="section">
<h3>Exercise</h3>
<p>
Take the vector example again, and add partitioning support to it, using the
matrix-matrix multiplication as an example. Here we will use the
<a href="http://starpu.gforge.inria.fr/doc/html/group__API__Data__Partition.html#ga212189d3b83dfa4e225609b5f2bf8461"><tt>starpu_vector_filter_block()</tt></a> filter function. You can see the list of
predefined filters provided by
StarPU <a href="http://starpu.gforge.inria.fr/doc/html/starpu__data__filters_8h.html">here</a>.
Try to run it with various numbers of tasks.
</p>
</div>
</div>

<div class="section">
<h2>Session Part 2: Optimizations</h2>


<p>
This is based on StarPU's documentation
<a href="http://starpu.gforge.inria.fr/doc/html/HowToOptimizePerformanceWithStarPU.html">optimization
  chapter</a>
</p>

<div class="section">
<h3>Data Management</h3>

<p>
We have explained how StarPU can overlap computation and data transfers
thanks to DMAs. This is however only possible when CUDA has control over the
application buffers. The application should thus use <a href="http://starpu.gforge.inria.fr/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>
when allocating its buffer, to permit asynchronous DMAs from and to
it.
</p>

<p>
Take the vector example again, and fix the allocation, to make it use
<a href="http://starpu.gforge.inria.fr/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>.
</p>

</div>

<div class="section">
<h3>Task Submission</h3>

<p>
To let StarPU reorder tasks, submit data transfers in advance, etc., task
submission should be asynchronous whenever possible. Ideally, the application
should behave like that: submit the
whole graph of tasks, and wait for termination.
</p>

</div>

<div class="section">
<h3>Task Scheduling Policy</h3>
<p>
By default, StarPU uses the <tt>eager</tt> simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.
</p>

<p>
If the application codelets have performance models, the scheduler should be
changed to take benefit from that. StarPU will then really take scheduling
decision in advance according to performance models, and issue data prefetch
requests, to overlap data transfers and computations.
</p>

<p>
For instance, compare the <tt>eager</tt> (default) and <tt>dmda</tt> scheduling
policies:
</p>

<tt>
<pre>
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -x 1024 -y 1024 -z 1024
</pre>
</tt>

<p>
with:
</p>

<tt>
<pre>
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -x 1024 -y 1024 -z 1024
</pre>
</tt>

<p>
You can see most (all?) the computation have been done on GPUs,
leading to better performances.
</p>

<p>
Try other schedulers, use <tt>STARPU_SCHED=help</tt> to get the
list.
</p>

<p>
Also try with various sizes and draw curves.
</p>

<p>
You can also try the double version, <tt>dgemm</tt>, and notice that GPUs get
less great performance.
</p>

</div>


<div class="section">
<h3>Performance Model Calibration</h3>

<p>
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated.  This is done automatically by
StarPU when a codelet is executed for the first time.  Once this is done, the
result is saved to a file in <tt>$STARPU_HOME</tt> for later re-use.  The
<tt>starpu_perfmodel_display</tt> tool can be used to check the resulting
performance model.
</p>

<tt>
<pre>
$ starpu_perfmodel_display -l
file: &lt;starpu_sgemm_gemm.mirage&gt;
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	1.848856e+04   	4.026761e+03   	12
performance model for cuda_0_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	4.918095e+02   	9.404866e+00   	66
...
</pre>
</tt>

<p>
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
execution time on CPUs was about 18ms, with a 4ms standard deviation, over
12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using <tt>export
STARPU_CALIBRATE=1</tt>
</p>

<p>
If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
<tt>export STARPU_CALIBRATE=2</tt>
</p>

<p>
The performance model can also be drawn by using <tt>starpu_perfmodel_plot</tt>,
which will emit a gnuplot file in the current directory.
</p>

</div>
</div>

<div class="section">
<h2>Sessions Part 3: MPI Support</h2>

<p>
StarPU provides support for MPI communications. It does so in two ways. Either the
application specifies MPI transfers by hand, or it lets StarPU infer them from
data dependencies.
</p>

<div class="section">
<h3>Manual MPI transfers</h3>

<p>Basically, StarPU provides
equivalents of <tt>MPI_*</tt> functions, but which operate on DSM handles
instead of <tt>void*</tt> buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
</p>

<p>
<a href="files/mpi/ring_async_implicit.c"><tt>ring_async_implicit.c</tt></a>
shows an example of mixing MPI communications and task submission. It
is a classical ring MPI ping-pong, but the token which is being passed
on from neighbour to neighbour is incremented by a starpu task at each
step.
</p>

<p>
This is written very naturally by simply submitting all MPI
communication requests and task submission asynchronously in a
sequential-looking loop, and eventually waiting for all the tasks to
complete.
</p>

<tt>
<pre>
#!/bin/bash
AUMAGE Olivier's avatar
AUMAGE Olivier committed
603
source /mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625

make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
</pre>
</tt>
</div>

<div class="section">
<h3>starpu_mpi_insert_task</h3>

<p>
<a href="files/mpi/stencil5.c">A stencil application</a> shows a basic MPI
task model application. The data distribution over MPI
nodes is decided by the <tt>my_distrib</tt> function, and can thus be changed
trivially.
It also shows how data can be migrated to a
new distribution.
</p>

<tt>
<pre>
#!/bin/bash
AUMAGE Olivier's avatar
AUMAGE Olivier committed
626
source /mnt/n7fs/ens/tp_abuttari/TP_StarPU/tp_vars.sh
AUMAGE Olivier's avatar
AUMAGE Olivier committed
627 628 629 630 631 632 633 634

make stencil5
mpirun -np 2 $PWD/stencil5 -display
</pre>
</tt>

</div>

AUMAGE Olivier's avatar
AUMAGE Olivier committed
635
<!--div class="section">
AUMAGE Olivier's avatar
AUMAGE Olivier committed
636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664
<h2>Session Part 4: OpenMP Support</h2>

<div class="section">
<h3>The Klang-Omp OpenMP Compiler</h3>

<p>
The <b>Klang-Omp</b> OpenMP compiler converts C/C++ source codes annotated with OpenMP 4 directives into StarPU enabled codes. Klang-Omp is source-to-source compiler based on the LLVM/CLang compiler framework.
</p>

<p>
The following shell sequence shows an example of an OpenMP version of the Cholesky decomposition compiled into StarPU code.
</p>

<tt>
<pre>
cd
source /gpfslocal/pub/training/runtime_june2016/openmp/environment
cp -r /gpfslocal/pub/training/runtime_june2016/openmp/Cholesky .
cd Cholesky
make
./cholesky_omp4.starpu
</pre>
</tt>

<p>
Homepage of the Klang-Omp OpenMP compiler: <a href="http://kstar.gforge.inria.fr/">Klang-Omp</a>
</p>

</div>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
665
</div-->
AUMAGE Olivier's avatar
AUMAGE Olivier committed
666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774


<div class="section" id="contact">
<h2>Contact</h2>
<p>
For any questions regarding StarPU, please contact the StarPU developers mailing list.
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</p>
</div>

<div class="section">
<h2>More Performance Optimizations</h2>
<p>
The StarPU
documentation <a href="http://starpu.gforge.inria.fr/doc/html/PerformanceFeedback.html">performance
    feedback chapter</a> provides more optimization tips for further
reading after this tutorial.
</p>

<!--
<div class="section">
<h3>FxT Tracing Support</h3>

<p>
In addition to online profiling, StarPU provides offline profiling tools,
based on recording a trace of events during execution, and analyzing it
afterwards.
</p>

<p>
To use the version of StarPU compiled with FxT support, you need to reload the
module StarPU after loading the module FxT.
</p>

<tt>
<pre>
module unload runtime/starpu/1.1.4
module load trace/fxt/0.2.13
module load runtime/starpu/1.1.4
</pre>
</tt>

<p>
The trace file is stored in <tt>/tmp</tt> by default. Since execution will
happen on a cluster node, the file will not be reachable after execution,
we need to tell StarPU to store output traces in the home directory, by
setting:
</p>

<tt>
<pre>
$ export STARPU_FXT_PREFIX=$HOME/
</pre>
</tt>

<p>
do not forget the add the line in your file <tt>.bash_profile</tt>.
</p>

<p>
The application should be run again, and this time a <tt>prof_file_XX_YY</tt>
trace file will be generated in your home directory. This can be converted to
several formats by using:
</p>

<tt>
<pre>
$ starpu_fxt_tool -i ~/prof_file_*
</pre>
</tt>

<p>
This will create
<ul>
<li>
a <tt>paje.trace</tt> file, which can be opened by using the <a
href="http://vite.gforge.inria.fr/">ViTE</a> tool. This shows a Gant diagram of
the tasks which executed, and thus the activity and idleness of tasks, as well
as dependencies, data transfers, etc. You may have to zoom in to actually focus
on the computation part, and not the lengthy CUDA initialization.
</li>
<li>
a <tt>dag.dot</tt> file, which contains the graph of all the tasks
submitted by the application. It can be opened by using Graphviz.
</li>
<li>
an <tt>activity.data</tt> file, which records the activity of all processing
units over time.
</li>
</ul>
</p>
</div>
</div>
-->

<div class="section" id="other">
<h2>Other Materials: Talk Slides and Website Links</h2>
<p>
<h3>The StarPU Runtime System</h3>
<ul>
<Li> <a href="slides/01_introducing_starpu.pdf">Slides: StarPU - Part. 1 – Introducing StarPU</a></li>
<Li> <a href="slides/02_mastering_starpu.pdf">Slides: StarPU - Part. 2 – Mastering StarPU</a></li>
</ul>

</p>
</div>

<div class="section bot">
<p class="updated">
AUMAGE Olivier's avatar
AUMAGE Olivier committed
775
  Last updated on 2018/07/03.
AUMAGE Olivier's avatar
AUMAGE Olivier committed
776 777 778
</p>
</body>
</html>