index.html 22.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU hands-on session</TITLE>
<link rel="stylesheet" type="text/css" href="../../style.css" />
<link rel="Shortcut icon" href="http://www.inria.fr/extension/site_inria/design/site_inria/images/favicon.ico" type="image/x-icon" />
</HEAD>

<body>

<div class="title">
<h1><a href="../../">StarPU</a></h1>
15
<h2>StarPU Tutorial - Runtime Systems for Heterogeneous Platform Programming - Bordeaux, May 2014</h2>
16
17
18
19
20
21
22
</div>

<div class="menu">
<a href="/Runtime/">RUNTIME TEAM</a> |
      <a href="../">Back to the main page</a>
</div>

23
<div class="section">
24
<p>
25
26
27
This tutorial is part of
the <a href="http://events.prace-ri.eu/conferenceDisplay.py?confId=269">PATC
  Training "Runtime systems for heterogeneous platform programming"</a>.
28
</p>
29
</div>
30

31
<div class="section">
Nathalie Furmento's avatar
Nathalie Furmento committed
32
<h2>Setup</h2>
33
34
35

<div class="section">
<h3>Connection to the Platform</h3>
Nathalie Furmento's avatar
Nathalie Furmento committed
36
37
<p>
The lab works are going to be done on
38
the <a href="https://plafrim.bordeaux.inria.fr/">PlaFRIM/DiHPES</a> platform.
Nathalie Furmento's avatar
Nathalie Furmento committed
39
40
41
A subset of machines has been specifically booked for our own usage.
You should have received information on how to connect to the
platform.
42
</p>
Nathalie Furmento's avatar
Nathalie Furmento committed
43
44
45
<P>
Once you are connected, we advise you to add the following lines at
the end of your <tt>.bashrc</tt> file.
46
</p>
47
48

<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
49
module purge
50
module load compiler/intel
Nathalie Furmento's avatar
Nathalie Furmento committed
51
52
module load hardware/hwloc
module load gpu/cuda/5.5
53
module load mpi/intel
THIBAULT Samuel's avatar
THIBAULT Samuel committed
54
module load trace/fxt/0.2.13
55
module load runtime/starpu/1.1.2
56
57
</pre></tt>

58
59
60
<p>
<b>Important:</b>

61
62
Due to an issue with the NFS-mounted home, you need to redirect CUDA's cache to
/tmp, so also put this in your .bashrc:
63
64
65

<tt><pre>
rm -fr ~/.nv
66
mkdir -p /tmp/$USER-nv
67
68
69
70
ln -s /tmp/$USER-nv ~/.nv
</pre></tt>
</p>

71
72
73
74
</div>

<div class="section">
<h3>Job Submission</h3>
75
<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
76
Jobs can be submitted to the platform to reserve a set of nodes and to
77
execute a application on those nodes. We advise not to reserve nodes
Nathalie Furmento's avatar
Nathalie Furmento committed
78
interactively so as not to block the machines for the others
AUMAGE Olivier's avatar
- typo    
AUMAGE Olivier committed
79
participants. Here is
80
a <a href="files/starpu_machine_display.pbs">script</a> to submit your
Nathalie Furmento's avatar
Nathalie Furmento committed
81
82
first StarPU application. It calls the
tool <tt>starpu_machine_display</tt> which shows the processing units
Nathalie Furmento's avatar
Nathalie Furmento committed
83
84
that StarPU can use, and the bandwitdh and affinity measured between
the memory nodes.
Nathalie Furmento's avatar
Nathalie Furmento committed
85
</p>
86

87
88
89
<p>
PlaFRIM/DiHPES nodes are normally accessed through queues. For our lab
works, no queue needs to be specified, however if you have an account
90
91
on the platform and you want to use the same pool of machines, you
will need to specify you want to use the <tt>mirage</tt> queue.
92
93
</p>

94
<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
95
#how many nodes and cores
96
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
97

Nathalie Furmento's avatar
Nathalie Furmento committed
98
starpu_machine_display
99
100
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
101
102
<P>
To submit the script, simply call:
103
104
</p>

Nathalie Furmento's avatar
Nathalie Furmento committed
105
106
107
<tt><Pre>
qsub starpu_machine_display.pbs
</pre></tt>
108

Nathalie Furmento's avatar
Nathalie Furmento committed
109
<p>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
110
The state of the job can be queried by calling the command <tt>qstat | grep $USER</tt>.
Nathalie Furmento's avatar
Nathalie Furmento committed
111
112
113
114
115
116
Once finished, the standard output and the standard error generated by
the script execution are available in the files:
<ul>
<li>jobname.<b>o</b>sequence_number</li>
<li>jobname.<b>e</b>sequence_number</li>
</ul>
117
118
119
</p>

<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
120
121
122
123
124
125
126
127
128
129
130
Note that the first time <tt>starpu_machine_display</tt> is executed,
it calibrates the performance model of the bus, the results are then
stored in different files in the
directory <tt>$HOME/.starpu/sampling/bus</tt>. If you run the command
several times, you will notice that StarPU may calibrate the bus speed
several times. This is because the cluster's batch scheduler may assign a
different node each time, and StarPU does not know that the local
cluster we use is homogeneous, and thus assumes that all nodes of the
cluster may be different. Let's force it to use the same machine ID
for the whole cluster:
</p>
131
132

<tt><pre>
Nathalie Furmento's avatar
Nathalie Furmento committed
133
$ export STARPU_HOSTNAME=mirage
134
135
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
136
137
<p>
Also add this do your <tt>.bashrc</tt> for further connections. Of course, on
138
a heterogeneous cluster, the cluster launcher script should set various
Nathalie Furmento's avatar
Nathalie Furmento committed
139
140
hostnames for the different node classes, as appropriate.
</p>
141
142
143
144
145
146
</div>

<div class="section">
<h3>Tutorial Material</h3>

<p>
147
148
All files needed for the lab works are available in
the <a href="files/2014.05.22.patcStarPU.zip">zip file</a>. Copy that
149
file on your PlaFRIM/DiHPES account and unzip its contents.
150
</p>
151

152
</div>
153
154
155
156

</div>

<div class="section">
157
<h2>Session Part 1: Task-based Programming Model</h2>
158
159

<div class="section">
160
<h3>Application Example: Vector Scaling</h3>
161

162
<h4>Making it and Running it</h4>
163

Nathalie Furmento's avatar
Nathalie Furmento committed
164
<p>
165
A typical <a href="files/Makefile"><tt>Makefile</tt></a> for
THIBAULT Samuel's avatar
typo    
THIBAULT Samuel committed
166
applications using StarPU is the following:
Nathalie Furmento's avatar
Nathalie Furmento committed
167
</p>
168
169
170
171
172
173

<tt><pre>
CFLAGS += $(shell pkg-config --cflags starpu-1.1)
LDFLAGS += $(shell pkg-config --libs starpu-1.1)
%.o: %.cu
	nvcc $(CFLAGS) $< -c $
174
175

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
176
177
</pre></tt>

Nathalie Furmento's avatar
Nathalie Furmento committed
178
<p>
THIBAULT Samuel's avatar
typo    
THIBAULT Samuel committed
179
Here are the source files for the application:
Nathalie Furmento's avatar
Nathalie Furmento committed
180
181
182
183
184
185
186
187
188
189
<ul>
<li><a href="files/vector_scal_task_insert.c">The main application</a></li>
<li><a href="files/vector_scal_cpu.c">The CPU implementation of the codelet</a></li>
<li><a href="files/vector_scal_cuda.cu">The CUDA implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl.c">The OpenCL host implementation of the codelet</a></li>
<li><a href="files/vector_scal_opencl_kernel.cl">The OpenCL device implementation of the codelet</a></li>
</ul>

Run <tt>make</tt>, and run the
resulting <tt>vector_scal_task_insert</tt> executable using the batch
THIBAULT Samuel's avatar
THIBAULT Samuel committed
190
scheduler using the <a href="files/vector_scal.pbs">given qsub script vector_scal.pbs</a>. It should be working: it simply scales a given vector by a
Nathalie Furmento's avatar
Nathalie Furmento committed
191
192
given factor.
</p>
193

194
195
<tt><pre>
#how many nodes and cores
196
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
197

198
199
200
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

201
make vector_scal_task_insert
202
./vector_scal_task_insert
203
204
</pre></tt>

205
<h4>Computation Kernels</h4>
Nathalie Furmento's avatar
Nathalie Furmento committed
206
207
<p>
Examine the source code, starting from <tt>vector_scal_cpu.c</tt> : this is
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
208
the actual computation code, which is wrapped into a <tt>vector_scal_cpu</tt>
209
function which takes a series of DSM interfaces and a non-DSM parameter. The
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
210
211
212
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
Nathalie Furmento's avatar
Nathalie Furmento committed
213
</p>
214

Nathalie Furmento's avatar
Nathalie Furmento committed
215
216
<p>
The GPU implementation, in <tt>vector_scal_cuda.cu</tt>, is basically
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
217
the same, with the host part (<tt>vector_scal_cuda</tt>) which extracts the
218
actual CUDA pointer from the DSM interface, and passes it to the device part
Nathalie Furmento's avatar
Nathalie Furmento committed
219
220
221
222
(<tt>vector_mult_cuda</tt>) which performs the actual computation.
</p>

<p>
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
223
224
225
The OpenCL implementation in <tt>vector_scal_opencl.c</tt> and
<tt>vector_scal_opencl_kernel.cl</tt>is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
Nathalie Furmento's avatar
Nathalie Furmento committed
226
</p>
227

Nathalie Furmento's avatar
Nathalie Furmento committed
228
229
230
231
232
233
234
235
236
237
238
239
240
241
<p>
Modify the source code of the different implementations (CPU, CUDA and
OpenCL) and see which ones gets executed. You can force the execution
of one the implementations simply by disabling a type of device when
running your application, e.g.:
</p>

<tt><pre>
# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 vector_scal_task_insert
</pre></tt>
242

243
244
245
246
247
248
249
250
251
252
253
<p>
You can set the environment variable STARPU_WORKER_STATS to 1 when
running your application to see the number of tasks executed by each
device. You can see the whole list of environment
variables <a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html">here</a>.
</p>

<tt><pre>
STARPU_WORKER_STATS=1 vector_scal_task_insert
</pre></tt>

254
<h4>Main Code</h4>
Nathalie Furmento's avatar
Nathalie Furmento committed
255
256
<p>
Now examine <tt>vector_scal_task_insert.c</tt>: the <tt>cl</tt>
257
258
(codelet) structure simply gathers pointers on the functions
mentioned above.
Nathalie Furmento's avatar
Nathalie Furmento committed
259
</p>
260

Nathalie Furmento's avatar
Nathalie Furmento committed
261
<p>
262
The <tt>main</tt> function
263
264
<ul>
<li>Allocates an <tt>vector</tt> application buffer and fills it.</li>
Nathalie Furmento's avatar
Nathalie Furmento committed
265
<li>Registers it to StarPU, and gets back a DSM handle. From now on, the
266
267
268
application is not supposed to access <tt>vector</tt> directly, since its
content may be copied and modified by a task on a GPU, the main-memory copy then
being outdated.</li>
Nathalie Furmento's avatar
Nathalie Furmento committed
269
270
<li>Submits a (asynchronous) task to StarPU.</li>
<li>Waits for task completion.</li>
271
272
273
274
275
276
277
278
<li>Unregisters the vector from StarPU, which brings back the modified version
to main memory.</li>
</ul>
</p>

</div>

<div class="section">
279
<h3>Data Partitioning</h3>
280

281
282
<p>
In the previous section, we submitted only one task. We here discuss how to
283
<i>partition</i> data so as to submit multiple tasks which can be executed in
284
285
parallel by the various CPUs and GPUs.
</p>
286

287
288
<p>
Let's examine <a href="files/mult.c">mult.c</a>.
289
290

<ul>
291
292
<li>
The computation kernel, <tt>cpu_mult</tt> is a trivial matrix multiplication
293
kernel, which operates on 3 given DSM interfaces. These will actually not be
294
295
296
297
298
299
300
whole matrices, but only small parts of matrices.
</li>
<li>
<tt>init_problem_data</tt> initializes the whole A, B and C matrices.
</li>
<li>
<tt>partition_mult_data</tt> does the actual registration and partitioning.
301
302
303
Matrices are first registered completely, then two partitioning filters are
declared. The first one, <tt>vert</tt>, is used to split B and C vertically. The
second one, <tt>horiz</tt>, is used to split A and C horizontally. We thus end
304
305
306
307
308
309
310
311
up with a grid of pieces of C to be computed from stripes of A and B.
</li>
<li>
<tt>launch_tasks</tt> submits the actual tasks: for each piece of C, take
the appropriate piece of A and B to produce the piece of C.
</li>
<li>
The access mode is interesting: A and B just need to be read from, and C
312
313
314
will only be written to. This means that StarPU will make copies of the pieces
of A and B along the machines, where they are needed for tasks, and will give to
the tasks some
315
316
317
318
uninitialized buffers for the pieces of C, since they will not be read
from.
</li>
<li>
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
319
The <tt>main</tt> code initializes StarPU and data, launches tasks, unpartitions data,
320
321
322
and unregisters it. Unpartitioning is an interesting step: until then the pieces
of C are residing on the various GPUs where they have been computed.
Unpartitioning will collect all the pieces of C into the main memory to form the
323
324
whole C result matrix.
</li>
325
326
327
</ul>
</p>

328
329
<p>
Run the application with the <a href="files/mult.pbs">batch scheduler</a>, enabling some statistics:
330
</p>
331
332

<tt><pre>
333
#how many nodes and cores
334
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
335

336
337
338
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

339
make mult
340
STARPU_WORKER_STATS=1 ./mult
341
342
</pre></tt>

343
<p>
344
345
346
Figures show how the computation were distributed on the various processing
units.
</p>
347
348
349
350
</div>

<div class="section">
<h3>Other example</h3>
351
352

<p>
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
353
<a href="files/gemm/xgemm.c"><tt>gemm/xgemm.c</tt></a> is a very similar
354
355
356
357
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The <tt>mult_kernel_common</tt> functions
shows how we call <tt>DGEMM</tt> (CPUs) or <tt>cublasDgemm</tt> (GPUs)
on the DSM interface.
358
359
</p>

360
<p>
361
Let's execute it.
362
</p>
363
364

<tt><pre>
365
#how many nodes and cores
366
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
367

368
369
370
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

371
make gemm/sgemm
372
STARPU_WORKER_STATS=1 ./gemm/sgemm
373
374
</pre></tt>

375
<!--
376
377
<p>
We can notice that StarPU gave much more tasks to the GPU. You can also try
378
379
380
381
to set <tt>num_gpu=2</tt> to run on the machine which has two GPUs (there is
only one of them, so you may have to wait a long time, so submit this in
background in a separate terminal), the interesting thing here is that
with <b>no</b> application modification beyond making it use a task-based
382
383
programming model, we get multi-GPU support for free!
</p>
384
-->
385

386
387
</div>

388
<!--
389
<div class="section">
390
<h3>More Advanced Examples</h3>
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
<p>
<tt>examples/lu/xlu_implicit.c</tt> is a more involved example: this is a simple
LU decomposition algorithm. The <tt>dw_codelet_facto_v3</tt> is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
</p>

<p>
<tt>examples/cholesky/cholesky_implicit.c</tt> is a similar example, but which makes use
of the <tt>starpu_insert_task</tt> helper. The <tt>_cholesky</tt> function looks
very much like <tt>dw_codelet_facto_v3</tt> of the previous paragraph, and all
task submission details are handled by <tt>starpu_insert_task</tt>.
</p>

<p>
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using <tt>starpu_insert_task</tt>.
</p>
</div>
410
-->
411
412
413

<div class="section">
<h3>Exercise</h3>
414
415
<p>
Take the vector example again, and add partitioning support to it, using the
416
matrix-matrix multiplication as an example. Here we will use the
417
418
419
420
<a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/group__API__Data__Partition.html#ga212189d3b83dfa4e225609b5f2bf8461"><tt>starpu_vector_filter_block()</tt></a> filter function. You can see the list of
predefined filters provided by
StarPU <a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/starpu__data__filters_8h.html">here</a>.
Try to run it with various numbers of tasks.
421
</p>
422
423
424
425
</div>
</div>

<div class="section">
426
<h2>Session Part 2: Optimizations</h2>
427

428
429
430
431
432
433

<p>
This is based on StarPU's documentation
<a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/HowToOptimizePerformanceWithStarPU.html">optimization
  chapter</a>
</p>
434
435

<div class="section">
436
<h3>Data Management</h3>
437

438
439
<p>
We have explained how StarPU can overlap computation and data transfers
440
thanks to DMAs. This is however only possible when CUDA has control over the
441
application buffers. The application should thus use <a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>
442
443
444
when allocating its buffer, to permit asynchronous DMAs from and to
it.
</p>
445

THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
446
447
<p>
Take the vector example again, and fix the allocation, to make it use
448
<a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/group__API__Standard__Memory__Library.html#ga49603eaea3b05e8ced9ba1bd873070c3"><tt>starpu_malloc()</tt></a>.
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
449
450
</p>

451
452
453
</div>

<div class="section">
454
<h3>Task Submission</h3>
455

456
457
<p>
To let StarPU reorder tasks, submit data transfers in advance, etc., task
458
459
submission should be asynchronous whenever possible. Ideally, the application
should behave like that: submit the
460
461
whole graph of tasks, and wait for termination.
</p>
462
463
464
465

</div>

<div class="section">
466
<h3>Task Scheduling Policy</h3>
467
468
<p>
By default, StarPU uses the <tt>eager</tt> simple greedy scheduler. This is
469
470
471
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
472
473
scheduling decision is taken late.
</p>
474
475
476
477
478
479
480
481

<p>
If the application codelets have performance models, the scheduler should be
changed to take benefit from that. StarPU will then really take scheduling
decision in advance according to performance models, and issue data prefetch
requests, to overlap data transfers and computations.
</p>

482
483
<p>
For instance, compare the <tt>eager</tt> (default) and <tt>dmda</tt> scheduling
484
policies:
485
</p>
486
487

<tt><pre>
488
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -x 1024 -y 1024 -z 1024
489
490
</pre></tt>

491
492
493
<p>
with:
</p>
494
495

<tt><pre>
496
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -x 1024 -y 1024 -z 1024
497
498
</pre></tt>

499
<p>
500
501
You can see most (all?) the computation have been done on GPUs,
leading to better performances.
502
</p>
503

504
505
506
507
<p>
Try other schedulers, use <tt>STARPU_SCHED=help</tt> to get the
list.
</p>
508

509
510
511
<p>
Also try with various sizes and draw curves.
</p>
512

513
514
515
516
<p>
You can also try the double version, <tt>dgemm</tt>, and notice that GPUs get
less great performance.
</p>
517
518
519

</div>

520

521
<div class="section">
522
<h3>Performance Model Calibration</h3>
523

524
525
<p>
Performance prediction is essential for proper scheduling decisions, the
526
527
performance models thus have to be calibrated.  This is done automatically by
StarPU when a codelet is executed for the first time.  Once this is done, the
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
528
result is saved to a file in <tt>$HOME/.starpu</tt> for later re-use.  The
529
530
531
532
533
534
<tt>starpu_perfmodel_display</tt> tool can be used to check the resulting
performance model.
</p>

<tt><pre>
$ starpu_perfmodel_display -l
535
file: &lt;starpu_sgemm_gemm.mirage&gt;
536
$ starpu_perfmodel_display -s starpu_sgemm_gemm
537
538
performance model for cpu_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
539
8bd4e11d	2359296        	0.000000e+00   	1.848856e+04   	4.026761e+03   	12
540
541
542
543
performance model for cuda_0_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	4.918095e+02   	9.404866e+00   	66
...
544
545
</pre></tt>

546
547
<p>
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
548
execution time on CPUs was about 18ms, with a 4ms standard deviation, over
549
12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard
550
551
552
553
554
555
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using <tt>export
STARPU_CALIBRATE=1</tt>
</p>

556
557
<p>
If the code of a computation kernel is modified, the performance changes, the
558
559
560
performance model thus has to be recalibrated from start. To do so, use
<tt>export STARPU_CALIBRATE=2</tt>
</p>
561
562
563
564
565
566

<p>
The performance model can also be drawn by using <tt>starpu_perfmodel_plot</tt>,
which will emit a gnuplot file in the current directory.
</p>

567
</div>
568
</div>
569
570

<div class="section">
571
<h2>Sessions Part 3: MPI Support</h2>
572

573
<p>
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
574
575
576
577
578
579
580
581
582
StarPU provides support for MPI communications. It does so two ways. Either the
application specifies MPI transfers by hand, or it lets StarPU infer them from
data dependencies
</p>

<div class="section">
<h3>Manual MPI transfers</h3>

<p>Basically, StarPU provides
583
584
585
586
587
588
equivalents of <tt>MPI_*</tt> functions, but which operate on DSM handles
instead of <tt>void*</tt> buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
</p>

589
590
591
592
593
594
595
596
597
598
599
600
601
<p>
<a href="files/mpi/ring_async_implicit.c"><tt>ring_async_implicit.c</tt></a>
shows an example of mixing MPI communications and task submission. It
is a classical ring MPI ping-pong, but the token which is being passed
on from neighbour to neighbour is incremented by a starpu task at each
step.
</p>

<p>
This is written very naturally by simply submitting all MPI
communication requests and task submission asynchronously in a
sequential-looking loop, and eventually waiting for all the tasks to
complete.
602
603
</p>

604
605
<tt><pre>
#how many nodes and cores
606
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12
607

608
609
610
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR

611
612
613
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
</pre></tt>
614
615
616
617
618
</div>

<div class="section">
<h3>starpu_mpi_insert_task</h3>

619
620
621
<p>
<a href="files/mpi/stencil5.c">A stencil application</a> shows a basic MPI
task model application. The data distribution over MPI
622
nodes is decided by the <tt>my_distrib</tt> function, and can thus be changed
623
624
625
626
trivially.
It also shows how data can be migrated to a
new distribution.
</p>
627
628

</div>
629
630
</div>

631
632
633

<div class="section" id="contact">
<h2>Contact</h2>
634
<p>
635
636
637
638
For any questions regarding StarPU, please contact the StarPU developers mailing list.
<pre>
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</pre>
639
</pre>
640
641
642
</div>

<div class="section">
643
<h2>More Performance Optimizations</h2>
644
645
646
647
648
649
<p>
The StarPU
documentation <a href="http://runtime.bordeaux.inria.fr/StarPU/doc/html/PerformanceFeedback.html">performance
    feedback chapter</a> provides more optimization tips for further
reading after this tutorial.
</p>
650
651

<div class="section">
652
<h3>FxT Tracing Support</h3>
653

654
655
<p>
In addition to online profiling, StarPU provides offline profiling tools,
656
based on recording a trace of events during execution, and analyzing it
657
658
afterwards.
</p>
659

660
<p>
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
661
StarPU should be recompiled with FxT support:
662
</p>
663
664
665
666
667
668
669
670

<tt><pre>
$ ./configure --with-fxt --prefix=$HOME
$ make clean
$ make
$ make install
</pre></tt>

671
672
673
674
<p>
You should make sure that the summary at the end
of <tt>./configure</tt> shows that tracing was enabled:
</p>
675
676
677
678
679

<tt><pre>
Tracing enabled: yes
</pre></tt>

680
681
<p>
The trace file is output in <tt>/tmp</tt> by default. Since execution will
682
happen on a cluster node, the file will not be reachable after execution,
683
684
685
we need to direct StarPU to output traces to the home directory, by
using:
</p>
686
687
688
689
690

<tt><pre>
$ export STARPU_FXT_PREFIX=$HOME/
</pre></tt>

691
692
693
<p>
and add it to your <tt>.bashrc</tt>.
</p>
694

695
696
<p>
The application should be run again, and this time a <tt>prof_file_XX_YY</tt>
697
trace file will be generated in your home directory. This can be converted to
698
699
several formats by using:
</p>
700
701
702
703
704

<tt><pre>
$ starpu_fxt_tool -i ~/prof_file_*
</pre></tt>

705
706
<p>
That will create
707
<ul>
708
709
<li>
a <tt>paje.trace</tt> file, which can be opened by using the <a
710
711
712
href="http://vite.gforge.inria.fr/">ViTE</a> tool. This shows a Gant diagram of
the tasks which executed, and thus the activity and idleness of tasks, as well
as dependencies, data transfers, etc. You may have to zoom in to actually focus
713
714
715
716
717
718
719
720
721
on the computation part, and not the lengthy CUDA initialization.
</li>
<li>
a <tt>dag.dot</tt> file, which contains the graph of all the tasks
submitted by the application. It can be opened by using Graphviz.
</li>
<li>
an <tt>activity.data</tt> file, which records the activity of all processing
units over time.
722
723
724
</li>
</ul>
</p>
725
</div>
726
727
</div>

728
729
730
731
<div class="section">
<h2>Other Materials</h2>
<p>
<ul>
732
<li> <a href="slides/00_intro_runtimes.pdf">Introduction to Runtime Systems</a>
733
734
735
</li>
<li> <a href="http://www.open-mpi.org/projects/hwloc/tutorials/">hwloc</a>
</li>
736
737
738
739
740
741
742
743
<li> Kaapi: <a href="http://kaapi.gforge.inria.fr/dokuwiki/doku.php">here</a>
  and <a href="http://kaapi.gforge.inria.fr/dokuwiki/doku.php?id=documentation">here</a>
</li>
<li> StarPU:
<ul>
<Li> <a href="slides/01_introducing_starpu.pdf">StarPU - Part. 1 – Introducing StarPU</a></li>
<Li> <a href="slides/02_mastering_starpu.pdf">StarPU - Part. 2 – Mastering StarPU</a></li>
</ul>
744
745
746
747
748
749
750
751
</li>
<li> <a href="http://eztrace.gforge.inria.fr/tutorials/index.html">EzTrace</a>
</li>
</ul>

</p>
</div>

752
753
<div class="section bot">
<p class="updated">
THIBAULT Samuel's avatar
fixes    
THIBAULT Samuel committed
754
  Last updated on 2014/05/19.
755
756
757
</p>
</body>
</html>