index.html 22.5 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU hands-on session</TITLE>
<link rel="stylesheet" type="text/css" href="../../style.css" />
<link rel="Shortcut icon" href="http://www.inria.fr/extension/site_inria/design/site_inria/images/favicon.ico" type="image/x-icon" />
</HEAD>

<body>

<div class="title">
<h1><a href="../../">StarPU</a></h1>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
15 16
<h2>Runtime Systems for Heterogeneous Platform Programming</h2>
<h3>StarPU Tutorial - Bordeaux, May 2014</h3>
17 18 19 20 21 22
</div>

<div class="menu">
      <a href="../">Back to the main page</a>
</div>

23
<div class="section">
24
<p>
25 26 27
This tutorial is part of
the <a href="http://events.prace-ri.eu/conferenceDisplay.py?confId=269">PATC
  Training "Runtime systems for heterogeneous platform programming"</a>.
28
</p>
29 30

<p>
31
Other materials (talk slides, links) for the whole tutorial session are available at the
32 33
<a href="index.html#other">bottom</a> of this page.
</p>
34
</div>
35

36
<div class="section">
Nathalie Furmento's avatar
Nathalie Furmento committed
37
<h2>Setup</h2>
38 39 40

<div class="section">
<h3>Connection to the Platform</h3>
Nathalie Furmento's avatar
Nathalie Furmento committed
41 42
<p>
The lab works are going to be done on
43
the <a href="https://plafrim.bordeaux.inria.fr/">PlaFRIM/DiHPES</a> platform.
Nathalie Furmento's avatar
Nathalie Furmento committed
44 45 46
A subset of machines has been specifically booked for our own usage.
You should have received information on how to connect to the
platform.
47
</p>
Nathalie Furmento's avatar
Nathalie Furmento committed
48 49 50
<P>
Once you are connected, we advise you to add the following lines at
the end of your <tt>.bashrc</tt> file.
51
</p>
52 53

<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
54
module purge
55
module load compiler/intel
Nathalie Furmento's avatar
Nathalie Furmento committed
56 57
module load hardware/hwloc
module load gpu/cuda/5.5
58
module load mpi/intel
THIBAULT Samuel's avatar
THIBAULT Samuel committed
59
module load trace/fxt/0.2.13
60
module load runtime/starpu/1.1.2
61 62
</pre></tt>

63 64 65
<p>
<b>Important:</b>

66 67
Due to an issue with the NFS-mounted home, you need to redirect CUDA's cache to
/tmp, so also put this in your .bashrc:
68 69 70

<tt><pre>
rm -fr ~/.nv
71
mkdir -p /tmp/$USER-nv
72 73 74 75
ln -s /tmp/$USER-nv ~/.nv
</pre></tt>
</p>

76 77 78 79
</div>

<div class="section">
<h3>Job Submission</h3>
80
<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
81
Jobs can be submitted to the platform to reserve a set of nodes and to
82
execute a application on those nodes. We advise not to reserve nodes
Nathalie Furmento's avatar
Nathalie Furmento committed
83
interactively so as not to block the machines for the others
AUMAGE Olivier's avatar
- typo  
AUMAGE Olivier committed
84
participants. Here is
85
a <a href="files/starpu_machine_display.pbs">script</a> to submit your
Nathalie Furmento's avatar
Nathalie Furmento committed
86 87
first StarPU application. It calls the
tool <tt>starpu_machine_display</tt> which shows the processing units
Nathalie Furmento's avatar
Nathalie Furmento committed
88 89
that StarPU can use, and the bandwitdh and affinity measured between
the memory nodes.
Nathalie Furmento's avatar
Nathalie Furmento committed
90
</p>
91

92 93 94
<p>
PlaFRIM/DiHPES nodes are normally accessed through queues. For our lab
works, no queue needs to be specified, however if you have an account
95 96
on the platform and you want to use the same pool of machines, you
will need to specify you want to use the <tt>mirage</tt> queue.
97 98
</p>

99
<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
100
#how many nodes and cores
101
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
102

Nathalie Furmento's avatar
Nathalie Furmento committed
103
starpu_machine_display
104 105
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
106 107
<P>
To submit the script, simply call:
108 109
</p>

Nathalie Furmento's avatar
Nathalie Furmento committed
110 111 112
<tt><Pre>
qsub starpu_machine_display.pbs
</pre></tt>
113

Nathalie Furmento's avatar
Nathalie Furmento committed
114
<p>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
115
The state of the job can be queried by calling the command <tt>qstat | grep $USER</tt>.
Nathalie Furmento's avatar
Nathalie Furmento committed
116 117 118 119 120 121
Once finished, the standard output and the standard error generated by
the script execution are available in the files:
<ul>
<li>jobname.<b>o</b>sequence_number</li>
<li>jobname.<b>e</b>sequence_number</li>
</ul>
122 123 124
</p>

<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
125 126 127 128 129 130 131 132 133 134 135
Note that the first time <tt>starpu_machine_display</tt> is executed,
it calibrates the performance model of the bus, the results are then
stored in different files in the
directory <tt>$HOME/.starpu/sampling/bus</tt>. If you run the command
several times, you will notice that StarPU may calibrate the bus speed
several times. This is because the cluster's batch scheduler may assign a
different node each time, and StarPU does not know that the local
cluster we use is homogeneous, and thus assumes that all nodes of the
cluster may be different. Let's force it to use the same machine ID
for the whole cluster:
</p>
136 137

<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
138
$ export STARPU_HOSTNAME=mirage
139 140
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
141 142
<p>
Also add this do your <tt>.bashrc</tt> for further connections. Of course, on
143
a heterogeneous cluster, the cluster launcher script should set various
Nathalie Furmento's avatar
Nathalie Furmento committed
144 145
hostnames for the different node classes, as appropriate.
</p>
146 147 148 149 150 151
</div>

<div class="section">
<h3>Tutorial Material</h3>

<p>
152 153
All files needed for the lab works are available in
the <a href="files/2014.05.22.patcStarPU.zip">zip file</a>. Copy that
154
file on your PlaFRIM/DiHPES account and unzip its contents.
155
</p>
156

157
</div>
158 159 160 161

</div>

<div class="section">
162
<h2>Session Part 1: Task-based Programming Model</h2>
163 164

<div class="section">
165
<h3>Application Example: Vector Scaling</h3>
166

167
<h4>Making it and Running it</h4>
168

Nathalie Furmento's avatar
Nathalie Furmento committed
169
<p>
170
A typical <a href="files/Makefile"><tt>Makefile</tt></a> for
THIBAULT Samuel's avatar
typo  
THIBAULT Samuel committed
171
applications using StarPU is the following:
Nathalie Furmento's avatar
Nathalie Furmento committed
172
</p>
173 174 175 176 177 178

<tt><pre>
CFLAGS += $(shell pkg-config --cflags starpu-1.1)
LDFLAGS += $(shell pkg-config --libs starpu-1.1)
%.o: %.cu
	nvcc $(CFLAGS) $< -c $
179 180

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
181 182
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
183
<p>
THIBAULT Samuel's avatar
typo  
THIBAULT Samuel committed
184
Here are the source files for the application:
Nathalie Furmento's avatar
Nathalie Furmento committed
185 186 187 188 189 190 191 192 193 194
<ul>
<li><a href="files/vector_scal_task_insert.c">The main application</a></li>
<li><a href="files/vector_scal_cpu.c">The CPU implementation of the codelet</a></li>
<li><a href="files/vector_scal_cuda.cu">The CUDA implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl.c">The OpenCL host implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl_kernel.cl">The OpenCL device implementation of the codelet</a></li>
</ul>

Run <tt>make</tt>, and run the
resulting <tt>vector_scal_task_insert</tt> executable using the batch
THIBAULT Samuel's avatar
THIBAULT Samuel committed
195
scheduler using the <a href="files/vector_scal.pbs">given qsub script vector_scal.pbs</a>. It should be working: it simply scales a given vector by a
Nathalie Furmento's avatar
Nathalie Furmento committed
196 197
given factor.
</p>
198

199 200
<tt><pre>
#how many nodes and cores
201
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
202

203 204 205
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

206
make vector_scal_task_insert
207
./vector_scal_task_insert
208 209
</pre></tt>

210
<h4>Computation Kernels</h4>
Nathalie Furmento's avatar
Nathalie Furmento committed
211 212
<p>
Examine the source code, starting from <tt>vector_scal_cpu.c</tt> : this is
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
213
the actual computation code, which is wrapped into a <tt>vector_scal_cpu</tt>
214
function which takes a series of DSM interfaces and a non-DSM parameter. The
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
215 216 217
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
Nathalie Furmento's avatar
Nathalie Furmento committed
218
</p>
219

Nathalie Furmento's avatar
Nathalie Furmento committed
220 221
<p>
The GPU implementation, in <tt>vector_scal_cuda.cu</tt>, is basically
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
222
the same, with the host part (<tt>vector_scal_cuda</tt>) which extracts the
223
actual CUDA pointer from the DSM interface, and passes it to the device part
Nathalie Furmento's avatar
Nathalie Furmento committed
224 225 226 227
(<tt>vector_mult_cuda</tt>) which performs the actual computation.
</p>

<p>
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
228 229 230
The OpenCL implementation in <tt>vector_scal_opencl.c</tt> and
<tt>vector_scal_opencl_kernel.cl</tt>is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
Nathalie Furmento's avatar
Nathalie Furmento committed
231
</p>
232

Nathalie Furmento's avatar
Nathalie Furmento committed
233 234 235 236 237 238 239 240 241 242 243 244 245 246
<p>
Modify the source code of the different implementations (CPU, CUDA and
OpenCL) and see which ones gets executed. You can force the execution
of one the implementations simply by disabling a type of device when
running your application, e.g.:
</p>

<tt><pre>
# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 vector_scal_task_insert
</pre></tt>
247

248 249 250 251
<p>
You can set the environment variable STARPU_WORKER_STATS to 1 when
running your application to see the number of tasks executed by each
device. You can see the whole list of environment
THIBAULT Samuel's avatar
THIBAULT Samuel committed
252
variables <a href="/files/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html">here</a>.
253 254 255 256 257 258
</p>

<tt><pre>
STARPU_WORKER_STATS=1 vector_scal_task_insert
</pre></tt>

259
<h4>Main Code</h4>
Nathalie Furmento's avatar
Nathalie Furmento committed
260 261
<p>
Now examine <tt>vector_scal_task_insert.c</tt>: the <tt>cl</tt>
262 263
(codelet) structure simply gathers pointers on the functions
mentioned above.
Nathalie Furmento's avatar
Nathalie Furmento committed
264
</p>
265

Nathalie Furmento's avatar
Nathalie Furmento committed
266
<p>
267
The <tt>main</tt> function
268 269
<ul>
<li>Allocates an <tt>vector</tt> application buffer and fills it.</li>
Nathalie Furmento's avatar
Nathalie Furmento committed
270
<li>Registers it to StarPU, and gets back a DSM handle. From now on, the
271 272 273
application is not supposed to access <tt>vector</tt> directly, since its
content may be copied and modified by a task on a GPU, the main-memory copy then
being outdated.</li>
Nathalie Furmento's avatar
Nathalie Furmento committed
274 275
<li>Submits a (asynchronous) task to StarPU.</li>
<li>Waits for task completion.</li>
276 277 278 279 280 281 282 283
<li>Unregisters the vector from StarPU, which brings back the modified version
to main memory.</li>
</ul>
</p>

</div>

<div class="section">
284
<h3>Data Partitioning</h3>
285

286 287
<p>
In the previous section, we submitted only one task. We here discuss how to
288
<i>partition</i> data so as to submit multiple tasks which can be executed in
289 290
parallel by the various CPUs and GPUs.
</p>
291

292 293
<p>
Let's examine <a href="files/mult.c">mult.c</a>.
294 295

<ul>
296 297
<li>
The computation kernel, <tt>cpu_mult</tt> is a trivial matrix multiplication
298
kernel, which operates on 3 given DSM interfaces. These will actually not be
299 300 301 302 303 304 305
whole matrices, but only small parts of matrices.
</li>
<li>
<tt>init_problem_data</tt> initializes the whole A, B and C matrices.
</li>
<li>
<tt>partition_mult_data</tt> does the actual registration and partitioning.
306 307 308
Matrices are first registered completely, then two partitioning filters are
declared. The first one, <tt>vert</tt>, is used to split B and C vertically. The
second one, <tt>horiz</tt>, is used to split A and C horizontally. We thus end
309 310 311 312 313 314 315 316
up with a grid of pieces of C to be computed from stripes of A and B.
</li>
<li>
<tt>launch_tasks</tt> submits the actual tasks: for each piece of C, take
the appropriate piece of A and B to produce the piece of C.
</li>
<li>
The access mode is interesting: A and B just need to be read from, and C
317 318 319
will only be written to. This means that StarPU will make copies of the pieces
of A and B along the machines, where they are needed for tasks, and will give to
the tasks some
320 321 322 323
uninitialized buffers for the pieces of C, since they will not be read
from.
</li>
<li>
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
324
The <tt>main</tt> code initializes StarPU and data, launches tasks, unpartitions data,
325 326 327
and unregisters it. Unpartitioning is an interesting step: until then the pieces
of C are residing on the various GPUs where they have been computed.
Unpartitioning will collect all the pieces of C into the main memory to form the
328 329
whole C result matrix.
</li>
330 331 332
</ul>
</p>

333 334
<p>
Run the application with the <a href="files/mult.pbs">batch scheduler</a>, enabling some statistics:
335
</p>
336 337

<tt><pre>
338
#how many nodes and cores
339
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
340

341 342 343
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

344
make mult
345
STARPU_WORKER_STATS=1 ./mult
346 347
</pre></tt>

348
<p>
349 350 351
Figures show how the computation were distributed on the various processing
units.
</p>
352 353 354 355
</div>

<div class="section">
<h3>Other example</h3>
356 357

<p>
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
358
<a href="files/gemm/xgemm.c"><tt>gemm/xgemm.c</tt></a> is a very similar
359 360 361 362
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The <tt>mult_kernel_common</tt> functions
shows how we call <tt>DGEMM</tt> (CPUs) or <tt>cublasDgemm</tt> (GPUs)
on the DSM interface.
363 364
</p>

365
<p>
366
Let's execute it.
367
</p>
368 369

<tt><pre>
370
#how many nodes and cores
371
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
372

373 374 375
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

376
make gemm/sgemm
377
STARPU_WORKER_STATS=1 ./gemm/sgemm
378 379
</pre></tt>

380
<!--
381 382
<p>
We can notice that StarPU gave much more tasks to the GPU. You can also try
383 384 385 386
to set <tt>num_gpu=2</tt> to run on the machine which has two GPUs (there is
only one of them, so you may have to wait a long time, so submit this in
background in a separate terminal), the interesting thing here is that
with <b>no</b> application modification beyond making it use a task-based
387 388
programming model, we get multi-GPU support for free!
</p>
389
-->
390

391 392
</div>

393
<!--
394
<div class="section">
395
<h3>More Advanced Examples</h3>
396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
<p>
<tt>examples/lu/xlu_implicit.c</tt> is a more involved example: this is a simple
LU decomposition algorithm. The <tt>dw_codelet_facto_v3</tt> is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
</p>

<p>
<tt>examples/cholesky/cholesky_implicit.c</tt> is a similar example, but which makes use
of the <tt>starpu_insert_task</tt> helper. The <tt>_cholesky</tt> function looks
very much like <tt>dw_codelet_facto_v3</tt> of the previous paragraph, and all
task submission details are handled by <tt>starpu_insert_task</tt>.
</p>

<p>
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using <tt>starpu_insert_task</tt>.
</p>
</div>
415
-->
416 417 418

<div class="section">
<h3>Exercise</h3>
419 420
<p>
Take the vector example again, and add partitioning support to it, using the
421
matrix-matrix multiplication as an example. Here we will use the
THIBAULT Samuel's avatar
THIBAULT Samuel committed
422
<a href="/files/doc/html/group__API__Data__Partition.html#ga212189d3b83dfa4e225609b5f2bf8461"><tt>starpu_vector_filter_block()</tt></a> filter function. You can see the list of
423
predefined filters provided by
THIBAULT Samuel's avatar
THIBAULT Samuel committed
424
StarPU <a href="/files/doc/html/starpu__data__filters_8h.html">here</a>.
425
Try to run it with various numbers of tasks.
426
</p>
427 428 429 430
</div>
</div>

<div class="section">
431
<h2>Session Part 2: Optimizations</h2>
432

433 434 435

<p>
This is based on StarPU's documentation
THIBAULT Samuel's avatar
THIBAULT Samuel committed
436
<a href="/files/doc/html/HowToOptimizePerformanceWithStarPU.html">optimization
437 438
  chapter</a>
</p>
439 440

<div class="section">
441
<h3>Data Management</h3>
442

443 444
<p>
We have explained how StarPU can overlap computation and data transfers
445
thanks to DMAs. This is however only possible when CUDA has control over the
THIBAULT Samuel's avatar
THIBAULT Samuel committed
446
application buffers. The application should thus use <a href="/files/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>
447 448 449
when allocating its buffer, to permit asynchronous DMAs from and to
it.
</p>
450

THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
451 452
<p>
Take the vector example again, and fix the allocation, to make it use
THIBAULT Samuel's avatar
THIBAULT Samuel committed
453
<a href="/files/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>.
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
454 455
</p>

456 457 458
</div>

<div class="section">
459
<h3>Task Submission</h3>
460

461 462
<p>
To let StarPU reorder tasks, submit data transfers in advance, etc., task
463 464
submission should be asynchronous whenever possible. Ideally, the application
should behave like that: submit the
465 466
whole graph of tasks, and wait for termination.
</p>
467 468 469 470

</div>

<div class="section">
471
<h3>Task Scheduling Policy</h3>
472 473
<p>
By default, StarPU uses the <tt>eager</tt> simple greedy scheduler. This is
474 475 476
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
477 478
scheduling decision is taken late.
</p>
479 480 481 482 483 484 485 486

<p>
If the application codelets have performance models, the scheduler should be
changed to take benefit from that. StarPU will then really take scheduling
decision in advance according to performance models, and issue data prefetch
requests, to overlap data transfers and computations.
</p>

487 488
<p>
For instance, compare the <tt>eager</tt> (default) and <tt>dmda</tt> scheduling
489
policies:
490
</p>
491 492

<tt><pre>
493
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -x 1024 -y 1024 -z 1024
494 495
</pre></tt>

496 497 498
<p>
with:
</p>
499 500

<tt><pre>
501
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -x 1024 -y 1024 -z 1024
502 503
</pre></tt>

504
<p>
505 506
You can see most (all?) the computation have been done on GPUs,
leading to better performances.
507
</p>
508

509 510 511 512
<p>
Try other schedulers, use <tt>STARPU_SCHED=help</tt> to get the
list.
</p>
513

514 515 516
<p>
Also try with various sizes and draw curves.
</p>
517

518 519 520 521
<p>
You can also try the double version, <tt>dgemm</tt>, and notice that GPUs get
less great performance.
</p>
522 523 524

</div>

525

526
<div class="section">
527
<h3>Performance Model Calibration</h3>
528

529 530
<p>
Performance prediction is essential for proper scheduling decisions, the
531 532
performance models thus have to be calibrated.  This is done automatically by
StarPU when a codelet is executed for the first time.  Once this is done, the
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
533
result is saved to a file in <tt>$HOME/.starpu</tt> for later re-use.  The
534 535 536 537 538 539
<tt>starpu_perfmodel_display</tt> tool can be used to check the resulting
performance model.
</p>

<tt><pre>
$ starpu_perfmodel_display -l
540
file: &lt;starpu_sgemm_gemm.mirage&gt;
541
$ starpu_perfmodel_display -s starpu_sgemm_gemm
542 543
performance model for cpu_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
544
8bd4e11d	2359296        	0.000000e+00   	1.848856e+04   	4.026761e+03   	12
545 546 547 548
performance model for cuda_0_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	4.918095e+02   	9.404866e+00   	66
...
549 550
</pre></tt>

551 552
<p>
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
553
execution time on CPUs was about 18ms, with a 4ms standard deviation, over
554
12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard
555 556 557 558 559 560
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using <tt>export
STARPU_CALIBRATE=1</tt>
</p>

561 562
<p>
If the code of a computation kernel is modified, the performance changes, the
563 564 565
performance model thus has to be recalibrated from start. To do so, use
<tt>export STARPU_CALIBRATE=2</tt>
</p>
566 567 568 569 570 571

<p>
The performance model can also be drawn by using <tt>starpu_perfmodel_plot</tt>,
which will emit a gnuplot file in the current directory.
</p>

572
</div>
573
</div>
574 575

<div class="section">
576
<h2>Sessions Part 3: MPI Support</h2>
577

578
<p>
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
579 580 581 582 583 584 585 586 587
StarPU provides support for MPI communications. It does so two ways. Either the
application specifies MPI transfers by hand, or it lets StarPU infer them from
data dependencies
</p>

<div class="section">
<h3>Manual MPI transfers</h3>

<p>Basically, StarPU provides
588 589 590 591 592 593
equivalents of <tt>MPI_*</tt> functions, but which operate on DSM handles
instead of <tt>void*</tt> buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
</p>

594 595 596 597 598 599 600 601 602 603 604 605 606
<p>
<a href="files/mpi/ring_async_implicit.c"><tt>ring_async_implicit.c</tt></a>
shows an example of mixing MPI communications and task submission. It
is a classical ring MPI ping-pong, but the token which is being passed
on from neighbour to neighbour is incremented by a starpu task at each
step.
</p>

<p>
This is written very naturally by simply submitting all MPI
communication requests and task submission asynchronously in a
sequential-looking loop, and eventually waiting for all the tasks to
complete.
607 608
</p>

609 610
<tt><pre>
#how many nodes and cores
611
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
612

613 614 615
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

616 617 618
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
</pre></tt>
619 620 621 622 623
</div>

<div class="section">
<h3>starpu_mpi_insert_task</h3>

624 625 626
<p>
<a href="files/mpi/stencil5.c">A stencil application</a> shows a basic MPI
task model application. The data distribution over MPI
627
nodes is decided by the <tt>my_distrib</tt> function, and can thus be changed
628 629 630 631
trivially.
It also shows how data can be migrated to a
new distribution.
</p>
632 633

</div>
634 635
</div>

636 637 638

<div class="section" id="contact">
<h2>Contact</h2>
639
<p>
640 641 642 643
For any questions regarding StarPU, please contact the StarPU developers mailing list.
<pre>
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</pre>
644
</pre>
645 646 647
</div>

<div class="section">
648
<h2>More Performance Optimizations</h2>
649 650
<p>
The StarPU
THIBAULT Samuel's avatar
THIBAULT Samuel committed
651
documentation <a href="/files/doc/html/PerformanceFeedback.html">performance
652 653 654
    feedback chapter</a> provides more optimization tips for further
reading after this tutorial.
</p>
655 656

<div class="section">
657
<h3>FxT Tracing Support</h3>
658

659 660
<p>
In addition to online profiling, StarPU provides offline profiling tools,
661
based on recording a trace of events during execution, and analyzing it
662 663
afterwards.
</p>
664

665
<p>
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
666
StarPU should be recompiled with FxT support:
667
</p>
668 669 670 671 672 673 674 675

<tt><pre>
$ ./configure --with-fxt --prefix=$HOME
$ make clean
$ make
$ make install
</pre></tt>

676 677 678 679
<p>
You should make sure that the summary at the end
of <tt>./configure</tt> shows that tracing was enabled:
</p>
680 681 682 683 684

<tt><pre>
Tracing enabled: yes
</pre></tt>

685 686
<p>
The trace file is output in <tt>/tmp</tt> by default. Since execution will
687
happen on a cluster node, the file will not be reachable after execution,
688 689 690
we need to direct StarPU to output traces to the home directory, by
using:
</p>
691 692 693 694 695

<tt><pre>
$ export STARPU_FXT_PREFIX=$HOME/
</pre></tt>

696 697 698
<p>
and add it to your <tt>.bashrc</tt>.
</p>
699

700 701
<p>
The application should be run again, and this time a <tt>prof_file_XX_YY</tt>
702
trace file will be generated in your home directory. This can be converted to
703 704
several formats by using:
</p>
705 706 707 708 709

<tt><pre>
$ starpu_fxt_tool -i ~/prof_file_*
</pre></tt>

710 711
<p>
That will create
712
<ul>
713 714
<li>
a <tt>paje.trace</tt> file, which can be opened by using the <a
715 716 717
href="http://vite.gforge.inria.fr/">ViTE</a> tool. This shows a Gant diagram of
the tasks which executed, and thus the activity and idleness of tasks, as well
as dependencies, data transfers, etc. You may have to zoom in to actually focus
718 719 720 721 722 723 724 725 726
on the computation part, and not the lengthy CUDA initialization.
</li>
<li>
a <tt>dag.dot</tt> file, which contains the graph of all the tasks
submitted by the application. It can be opened by using Graphviz.
</li>
<li>
an <tt>activity.data</tt> file, which records the activity of all processing
units over time.
727 728 729
</li>
</ul>
</p>
730
</div>
731 732
</div>

733
<div class="section" id="other">
AUMAGE Olivier's avatar
AUMAGE Olivier committed
734
<h2>Other Materials: Talk Slides and Website Links</h2>
735
<p>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
736
<h3>General Session Introduction</h3>
737
<ul>
738
<li> <a href="slides/00_intro_runtimes.pdf">Slides: Introduction to Runtime Systems</a>
739
</li>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
740 741 742
</ul>
<h3>The Hardware Locality Library (hwloc)</h3>
<ul>
743
<li> <a href="http://www.open-mpi.org/projects/hwloc/tutorials/">Tutorial: hwloc</a>
744
</li>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
745 746 747 748
</ul>
<h3>The XKaapi Runtime System</h3>
<ul>
<li> <a href="http://kaapi.gforge.inria.fr/dokuwiki/doku.php">XKaapi wiki</a>
749
</li>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
750 751 752 753 754
<li>
<a href="http://kaapi.gforge.inria.fr/dokuwiki/doku.php?id=documentation">XKaapi documentation</a>
</li>
</ul>
<h3>The StarPU Runtime System</h3>
755
<ul>
756 757
<Li> <a href="slides/01_introducing_starpu.pdf">Slides: StarPU - Part. 1 – Introducing StarPU</a></li>
<Li> <a href="slides/02_mastering_starpu.pdf">Slides: StarPU - Part. 2 – Mastering StarPU</a></li>
758
</ul>
AUMAGE Olivier's avatar
AUMAGE Olivier committed
759 760
<h3>The EZtrace Performance Debugging Framework</h3>
<ul>
761
<li> <a href="http://eztrace.gforge.inria.fr/tutorials/index.html">Tutorial: EzTrace</a>
762 763 764 765 766 767
</li>
</ul>

</p>
</div>

768 769
<div class="section bot">
<p class="updated">
THIBAULT Samuel's avatar
fixes  
THIBAULT Samuel committed
770
  Last updated on 2014/05/19.
771 772 773
</p>
</body>
</html>