index.html 27.7 KB
Newer Older
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
2 3 4 5 6
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU</TITLE>
<link rel="stylesheet" type="text/css" href="style.css" />
7
<link rel="Shortcut icon" href="http://www.inria.fr/extension/site_inria/design/site_inria/images/favicon.ico" type="image/x-icon" />
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
8 9 10 11
</HEAD>

<body>

12
<div class="title">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
13
<h1><a href="./">StarPU</a></h1>
14 15
<h2>A Unified Runtime System for Heterogeneous Multicore Architectures</h2>
</div>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
16

17
<div class="menu">
18
<a href="http://runtime.bordeaux.inria.fr/">RUNTIME TEAM</a> |
19
&nbsp; &nbsp; &nbsp;
20
|
21
<a href="#overview">Overview</a> |
Nathalie Furmento's avatar
Nathalie Furmento committed
22
<a href="#news">News</a> |
23
<a href="#contact">Contact</a> |
24
<a href="#features">Features</a> |
25
<a href="#software">Software</a> |
THIBAULT Samuel's avatar
THIBAULT Samuel committed
26
<a href="#tryit">Try it!</a> |
27
<a href="#publications">Publications</a> |
28
<a href="internships/">Jobs/Interns</a> |
29 30 31
<a href="files/">Download</a> |
<a href="tutorials">Tutorials</a> |
<a href="https://wiki.bordeaux.inria.fr/runtime/doku.php?id=starpu">Intranet</a>
Nathalie Furmento's avatar
Nathalie Furmento committed
32
</div>
33

34 35
<div class="section" id="overview">
<h3>Overview</h3>
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
  <p>
<span class="important">StarPU is a task programming library for hybrid architectures</span>
<ol>
<li><b>The application provides algorithms and constraints</b>
    <ul>
    <li>CPU/GPU implementations of tasks</li>
    <li>A graph of tasks, using either the StarPU's high level <b>GCC plugin</b> pragmas or StarPU's rich <b>C API</b></li>
    </ul>
<br>
</li>
<li><b>StarPU handles run-time concerns</b>
    <ul>
    <li>Task dependencies</li>
    <li>Optimized heterogeneous scheduling</li>
    <li>Optimized data transfers and replication between main memory and discrete memories</li>
    <li>Optimized cluster communications</li>
    </ul>
</li>
</ol>
</p>
<p>
<span class="important">Rather than handling low-level issues, <b>programmers can concentrate on algorithmic concerns!</b></span>
</p>

<p>
61
<span class="note">The StarPU documentation is available in <a href="./doc/starpu.pdf">PDF</a> and in <a href="./doc/index.html">HTML</a>.</span> Please note that these documents are up-to-date with the latest release of StarPU.
62 63 64 65
</p>
</div>

<div class="section emphasize newslist" id="news">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
66 67
<h3>News</h3>
<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
68 69 70 71 72 73
September 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      v1.1.5 release of StarPU is now available!</b></a>. This release notably brings the concept of
      scheduling contexts which allows to separate computation
      resources.
</p>
<p>
74 75 76 77 78 79 80 81
August 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      fourth release candidate of the v1.2.0 release of StarPU is now
      available!</b></a>.
      This release notably brings an out-of-core support, a MIC Xeon
      Phi support, an OpenMP runtime support, and a new internal
      communication system for MPI.
</p>
<p>
82 83 84 85 86 87 88 89
July 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      third release candidate of the v1.2.0 release of StarPU is now
      available!</b></a>.
      This release notably brings an out-of-core support, a MIC Xeon
      Phi support, an OpenMP runtime support, and a new internal
      communication system for MPI.
</p>
<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
90 91 92 93 94 95 96 97
May 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      second release candidate of the v1.2.0 release of StarPU is now
      available!</b></a>.
      This release notably brings an out-of-core support, a MIC Xeon
      Phi support, an OpenMP runtime support, and a new internal
      communication system for MPI.
</p>
<p>
Nathalie Furmento's avatar
Nathalie Furmento committed
98
April 2015 <b>&raquo;&nbsp;</b>A <a href="https://events.prace-ri.eu/event/339/">tutorial</a> on runtime systems including
99 100 101
StarPU will be given at INRIA Bordeaux in June 2015.
</p>
<p>
102 103 104 105 106 107 108 109
March 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      first release candidate of the v1.2.0 release of StarPU is now
      available!</b></a>.
      This release notably brings an out-of-core support, a MIC Xeon
      Phi support, an OpenMP runtime support, and a new internal
      communication system for MPI.
</p>
<p>
110 111 112 113 114 115
March 2015 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      v1.1.4 release of StarPU is now available!</b></a>. This release notably brings the concept of
      scheduling contexts which allows to separate computation
      resources.
</p>
<p>
116 117 118
September 2014 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      v1.1.3 release of StarPU is now available!</b></a>. This release notably brings the concept of
      scheduling contexts which allows to separate computation
119 120
      resources.
</p>
121
</div>
122 123

<div class="section emphasizebot" style="text-align: right; font-style: italic;">
124
Get the latest StarPU news by subscribing to the <a href="http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-announce">starpu-announce mailing list</a>.
125
See also the full <a href="news/">news</a>.
126 127 128 129
</div>

<div class="section" id="contact">
<h3>Contact</h3>
130
<p>For any questions regarding StarPU, please contact the StarPU developers mailing list.</p>
131 132 133
<pre>
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</pre>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
134 135
</div>

136 137
<div class="section" id="features">
<h3>Features</h3>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
138

139
<h4>Portability</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
140
  <p>
141 142 143 144 145 146 147 148
Portability is obtained by the means of a unified abstraction of the machine.
StarPU offers a unified offloadable task abstraction named <em>codelet</em>. Rather
than rewriting the entire code, programmers can encapsulate existing functions
within codelets. In case a codelet can run on heterogeneous architectures, <b>it
is possible to specify one function for each architectures</b> (e.g. one function
for CUDA and one function for CPUs). StarPU takes care of scheduling and
executing those codelets as efficiently as possible over the entire machine, include
multiple GPUs.
149 150 151 152
One can even specify <b>several functions for each architecture</b> (new in
v1.0) as well as
<b>parallel implementations</b> (e.g. in OpenMP), and StarPU will
automatically determine which version is best for each input size (new in v0.9).
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
153 154
  </p>

155
<h4>Data transfers</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
156
  <p>
157
To relieve programmers from the burden of explicit data transfers, a high-level
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
158
data management library enforces memory coherency over the machine: before a
159 160
codelet starts (e.g. on an accelerator), all its <b>data are automatically made
available on the compute resource</b>. Data are also kept on e.g. GPUs as long as
THIBAULT Samuel's avatar
THIBAULT Samuel committed
161 162
they are needed for further tasks. When a device runs out of memory, StarPU uses
an LRU strategy to <b>evict unused data</b>. StarPU also takes care of <b>automatically
163 164
prefetching</b> data, which thus permits to <b>overlap data transfers with computations</b>
(including GPU-GPU direct transfers) to achieve the most of the architecture.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
165 166
  </p>

167
<h4>Dependencies</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
168
  <p>
169 170 171
Dependencies between tasks can be given several ways, to provide the
programmer with best flexibility:
  <ul>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
172 173
    <li><b>explicitly</b> between pairs of tasks,</li>
    <li>explicitly through <b>tags</b> which act as rendez-vous points between
174
    tasks (thus including tasks which have not been created yet),</li>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
175
    <li><b>implicitly</b> from RAW, WAW, and WAR data dependencies.</li>
176
  </ul>
177 178
  </p>
  <p>
179 180 181 182
StarPU also supports an OpenMP-like <a href="doc/html/DataManagement.html#DataReduction">reduction</a> access mode (new in v0.9).
  </p>
  <p>
It also supports a <a href="doc/html/DataManagement.html#DataCommute">commute</a> access mode to allow data access commutativity (new in v1.2).
183 184 185 186 187
  </p>

<h4>Heterogeneous Scheduling</h4>
  <p>
StarPU obtains
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
188
portable performances by efficiently (and easily) using all computing resources
189
at the same time. StarPU also takes advantage of the <b>heterogeneous</b> nature of a
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
190
machine, for instance by using scheduling strategies based on auto-tuned
191 192 193
performance models. These determine the relative performance achieved
by the different processing units for the various kinds of task, and thus
permits to <b>automatically let processing units execute the tasks they are the best for</b>.
194
Various strategies and variants are available: dmda (a data-aware MCT strategy,
195 196
thus similar to heft but starts executing tasks before the whole task graph is
submitted, thus allowing dynamic task submission), eager, locality-aware
197 198 199 200
work-stealing, ... The overhead per task is typically around the order of
magnitude of a microsecond. Tasks should thus be a few orders of magnitude
bigger, such as 100 microseconds or 1 millisecond, to make the overhead
negligible.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
201 202
  </p>

203 204
<h4>Clusters</h4>
  <p>
205
To deal with clusters, StarPU can nicely integrate with <a href="doc/html/MPISupport.html">MPI</a> through
206 207 208 209
explicit network communications, which will then be <b>automatically combined and
overlapped</b> with the intra-node data transfers and computation. The application
can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU
will automatically determine which MPI node should execute which task, and
210 211 212 213 214
<b>generate all required MPI communications</b> accordingly (new in v0.9). We
have gotten excellent scaling on a 144-node cluster with GPUs, we have not yet
had the opportunity to test on a yet larger cluster. We have however measured
that with naive task submission, it should scale to a thousand nodes, and with
pruning-tuned task submission, it should scale to about a million nodes.
215 216 217 218 219 220 221 222
  </p>

<h4>Out of core</h4>
  <p>
When memory is not big enough for the working set, one may have to resort to
using disks. StarPU makes this seamless thanks to its <a href="doc/html/OutOfCore.html">out of core support</a> (new in 1.2).
StarPU will automatically evict data from the main memory in advance, and
prefetch back required data before it is needed for tasks.
223 224
  </p>

225 226 227
<h4>Extensions to the C Language</h4>
<p>
  StarPU comes with a GCC plug-in
228
  that <a href="doc/html/cExtensions.html">extends the C programming
229 230
  language</a> with pragmas and attributes that make it easy
  to <b>annotate a sequential C program to turn it into a parallel
231 232 233 234 235 236 237
  StarPU program</b> (new in v1.0).
</p>

<h4>OpenCL-compatible interface</h4>
<p>
  StarPU provides an <a href="doc/html/SOCLOpenclExtensions.html">OpenCL-compatible interface, SOCL</a>
  which allows to simply run OpenCL applications on top of StarPU (new in v1.0).
238 239
</p>

240 241 242 243
<h4>Simulation support</h4>
<p>
  StarPU can very accurately simulate an application execution
  and measure the resulting performance thanks to using the
244
  <a href="http://simgrid.gforge.inria.fr">SimGrid simulator</a> (new in v1.1).  This allows
245 246 247 248 249
  to quickly experiment with various scheduling heuristics, various application
  algorithms, and even various platforms (available GPUs and CPUs, available
  bandwidth)!
</p>

250 251
<h4>All in all</h4>
  <p>
252
All that means that, with the help
253
of <a href="doc/html/cExtensions.html">StarPU's extensions to the C
254 255
language</a>, the following sequential source code of a tiled version of
the classical Cholesky factorization algorithm using BLAS is also valid
THIBAULT Samuel's avatar
THIBAULT Samuel committed
256
StarPU code, possibly running on all the CPUs and GPUs, and given a data
Nathalie Furmento's avatar
Nathalie Furmento committed
257
distribution over MPI nodes, it is even a distributed version!
258 259 260 261 262 263 264 265 266 267 268 269 270
  </p>

  <tt><pre>
for (k = 0; k < tiles; k++) {
  potrf(A[k,k])
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])
}</pre></tt>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
271

272
<h4>Supported Architectures</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
273 274
<ul>
<li>SMP/Multicore Processors (x86, PPC, ...) </li>
275
<li>NVIDIA GPUs (e.g. heterogeneous multi-GPU), with pipelined and concurrent kernel execution support (new in v1.2) and GPU-GPU direct transfers (new in 1.1)</li>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
276 277 278
<li>OpenCL devices</li>
<li>Cell Processors (experimental)</li>
</ul>
279
and soon (in v1.2)
280 281
<ul>
<li>Intel SCC</li>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
282
<li>Intel MIC / Xeon Phi</li>
283
</ul>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
284

285
<h4>Supported Operating Systems</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
286
<ul>
Ludovic Courtès's avatar
Ludovic Courtès committed
287 288
<li>GNU/Linux</li>
<li>Mac OS X</li>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
289 290 291
<li>Windows</li>
</ul>

292
<h4>Performance analysis tools</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
293 294 295 296 297 298 299 300 301
  <p>
In order to understand the performance obtained by StarPU, it is helpful to
visualize the actual behaviour of the applications running on complex
heterogeneous multicore architectures.  StarPU therefore makes it possible to
generate Pajé traces that can be visualized thanks to the <a
href="http://vite.gforge.inria.fr/"><b>ViTE</b> (Visual Trace Explorer) open
source tool.</a>
  </p>

302
<p>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
303 304 305 306 307 308
<b>Example:</b> LU decomposition on 3 CPU cores and a GPU using a very simple
greedy scheduling strategy. The green (resp. red) sections indicate when the
corresponding processing unit is busy (resp. idle). The number of ready tasks
is displayed in the curve on top: it appears that with this scheduling policy,
the algorithm suffers a certain lack of parallelism. <b>Measured speed: 175.32
GFlop/s</b>
309
<center><a href="./images/greedy-lu-16k-fx5800.png"> <img src="./images/greedy-lu-16k-fx5800.png" alt="LU decomposition (greedy)" width="75%"></a></center>
310 311
</p>

Nathalie Furmento's avatar
website  
Nathalie Furmento committed
312 313 314 315 316 317
<p>
This second trace depicts the behaviour of the same application using a
scheduling strategy trying to minimize load imbalance thanks to auto-tuned
performance models and to keep data locality as high as possible. In this
example, the Pajé trace clearly shows that this scheduling strategy outperforms
the previous one in terms of processor usage. <b>Measured speed: 239.60
318
GFlop/s</b>
319
<center><a href="./images/dmda-lu-16k-fx5800.png"><img src="./images/dmda-lu-16k-fx5800.png" alt="LU decomposition (dmda)" width="75%"></a></center>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
320 321
</p>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
322 323
<p>
<a href="http://www.hlrs.de/temanejo">Temanejo</a> can be used to debug the task
324
graph, as shown below (new in v1.1).
THIBAULT Samuel's avatar
THIBAULT Samuel committed
325 326 327
</p>

<center>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
328
<a href="images/temanejo.png"><img src="images/temanejo.png" width="50%"/></a>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
329 330
</center>

Nathalie Furmento's avatar
website  
Nathalie Furmento committed
331 332
</div>

333 334 335 336 337 338 339 340 341 342
<div class="section" id="software">
<h3>Software using StarPU</h3>

<p>
Some software is known for being able to use StarPU to tackle heterogeneous
architectures, here is a non-exhaustive list:
</p>

<ul>
	<li><a href="http://icl.cs.utk.edu/magma/">MAGMA</a>, dense linear algebra library, starting from version 1.1</li>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
343
	<li><a href="https://project.inria.fr/chameleon/">Chameleon</a>, dense linear algebra library</li>
344 345 346 347
	<li><a href="http://www.ida.liu.se/~chrke/skepu/">SkePU</a>, a skeleton programming framework.</li>
	<li><a href="http://pastix.gforge.inria.fr/">PaStiX</a>, sparse linear algebra library, starting from version 5.2.1</li>
</ul>

348 349 350 351 352
<p>
You can find below the list of publications related to applications
using StarPU.
</p>

353 354
</div>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
355 356 357 358
<div class="section" id="tryit">
<h3>Give it a try!</h3>
<p>
You can easily try the performance on the Cholesky factorization for
THIBAULT Samuel's avatar
THIBAULT Samuel committed
359 360 361
instance. Make sure to have the pkg-config and
<a href="http://www.open-mpi.org/projects/hwloc/">hwloc</a>
software installed for
THIBAULT Samuel's avatar
THIBAULT Samuel committed
362 363 364
proper CPU control and BLAS kernels for your computation units and configured in
your environment (e.g. MKL for CPUs and CUBLAS for GPUs).
</p>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
365 366

<tt><pre>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
367 368 369 370 371 372
$ wget http://starpu.gforge.inria.fr/files/starpu-someversion.tar.gz
$ tar xf starpu-someversion.tar.gz
$ cd starpu-someversion
$ ./configure
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
THIBAULT Samuel's avatar
THIBAULT Samuel committed
373
$ STARPU_SCHED=dmdas mpirun -np 4 -machinefile mymachines ./mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((960*40*4)) -nblocks $((40*4))</pre></tt>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
374 375 376 377 378

<p>Note that the dmdas scheduler uses performance models, and thus needs
calibration execution before exhibiting optimized performance (until the "model
something is not calibrated enough" messages go away).</p>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
379 380 381 382 383 384 385 386 387 388 389 390 391 392
<p>To get a glimpse at what happened, you can get an execution trace by
installing
<a href="http://savannah.nongnu.org/projects/fkt">FxT</a>
and <a href="http://vite.gforge.inria.fr/">ViTE</a>, and enabling traces:
</p>

<tt><pre>
$ ./configure --with-fxt
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ ./tools/starpu_fxt_tool -i /tmp/prof_file_${USER}_0
$ vite paje.trace
</pre></tt>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
393 394 395 396 397 398 399 400 401 402 403 404 405
<p>
Starting with StarPU 1.1, it is also possible to reproduce the performance that
we show in our articles on our machines, by installing simgrid, and then using
the simulation mode of StarPU using the performance models of our machines:
</p>
  <tt><pre>
$ ./configure --enable-simgrid
$ make -j 12
$ STARPU_PERF_MODEL_DIR=$PWD/tools/perfmodels/sampling STARPU_HOSTNAME=mirage STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
# size	ms	GFlops
38400	10216	1847.6</pre></tt>
<p>(MPI simulation is not supported yet)</p>

406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
<div class="section" id="publications">
<h3>Publications</h3>
<p>
All StarPU related publications are also
listed <a href="http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html">here</a>
with the corresponding Bibtex entries.
</p>

<p>A good overview is available in
the following <a href="http://hal.archives-ouvertes.fr/inria-00467677">Research Report</a>.
</p>

<h4>General presentations</h4>
<ol>
<li>
421 422 423 424 425
C. Augonnet.
<br/>
<b>Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective</b>.
PhD thesis, Université Bordeaux 1, December 2011.
<br/>
426
Available <a href="http://tel.archives-ouvertes.fr/tel-00777154">here</a>.
427 428
</li>
<li>
429 430 431 432 433 434
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
<br/>
<b>StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.</b>
<em>Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009</em>, 23:187-198, February 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00550877">here</a>.
435
</li>
436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
<li>
C. Augonnet.
<br/>
<b>StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes</b>.
In <em>19èmes Rencontres Francophones du Parallélisme</em>, September 2009. Note: Best Paper Award.
<br/>
Available <a href="http://hal.inria.fr/inria-00411581">here</a>. (French version)
</li>
<li>
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
<br/>
<b>StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.</b>
In <em>Proceedings of the 15th International Euro-Par Conference</em>, volume 5704 of LNCS, August 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00384363">here</a>. (short version)
451
</li>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
452

453 454 455 456 457 458 459 460 461 462
<li>
C. Augonnet and R. Namyst.
<br/>
<b>A unified runtime system for heterogeneous multicore architectures.</b>
In <em>Proceedings of the International Euro-Par Workshops 2008, HPPC'08</em>, volume 5415 of LNCS, August 2008.
<br/>
Available <a href="http://hal.inria.fr/inria-00326917">here</a>. (early version)
</li>
</ol>

Nathalie Furmento's avatar
Nathalie Furmento committed
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486
<h4>On Composability</h4>
<ol>
<li>
A. Hugo, A. Guermouche, R. Namyst, and P.-A. Wacrenier.
<br/>
<b>Composing multiple StarPU applications over heterogeneous machines:
  a supervised approach.</b> In <em>Third International Workshop on
  Accelerators and Hybrid Exascale Systems</em>, Boston, USA, May
2013.
<br/>
Available <a href="http://hal.inria.fr/hal-00824514">here</a>.
</li>

<li>
A. Hugo.
<br/>
<b>Le problème de la composition parallèle : une approche
  supervisée.</b> In <em>21èmes Rencontres Francophones du
  Parallélisme (RenPar'21)</em>, Grenoble, France, January 2013.
<br/>
Available <a href="http://hal.inria.fr/hal-00773610">here</a>.
</li>
</ol>

487 488
<h4>On Scheduling</h4>
<ol>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
489 490 491 492 493 494 495 496 497 498 499
<li>
E. Agullo, O. Beaumont, L. Eyraud-Dubois, J. Herrmann, S. Kumar, L. Marchal, and
S. Thibault.
<br/>
<b>Bridging the Gap between Performance and Bounds of Cholesky Factorization on
Heterogeneous Platforms.</b> In <em>Heterogeneity in Computing Workshop
2015</em>, Hyderabad, India, May 2015.
<br/>
Available <a href="https://hal.inria.fr/hal-01120507">here</a>
</li>

500 501 502 503 504 505 506 507 508 509 510
<li>
M. Sergent and S. Archipoff.
<br/>
<b>Modulariser les ordonnanceurs de tâches : une approche structurelle.</b> In
<em>Conférence d’informatique en Parallélisme, Architecture et Système
	(Compas'2014)</em>, Neuchâtel, Switzerland, April 2014.  
<br/>
Available <a href="http://hal.inria.fr/hal-00978364">here</a>.
</li>
</ol>

511 512 513 514 515 516 517 518 519 520 521 522
<h4>On the C Extensions</h4>
<ol>
<LI>
L. Courtès.
<br/>
<b>C Language Extensions for Hybrid CPU/GPU Programming with
  StarPU.</b>
<br/>
Available <a href="http://hal.inria.fr/hal-00807033/en">here</a>.
</li>
</ol>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
523 524 525 526 527 528 529 530 531 532 533 534
<h4>On OpenMP support on top of StarPU</h4>
<ol>
<LI>
P. Virouleau, P. Brunet, F. Broquedis, N. Furmento, S. Thibault, O. Aumage, and T. Gautier.
<br/>
<b>Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite.</b> In
<em>10th International Workshop on OpenMP, IWOMP2014</em>, September 2014.
<br/>
Available <a href="https://hal.inria.fr/hal-01081974">here</a>.
</li>
</ol>

535 536 537 538 539
<h4>On MPI support</h4>
<ol>
<li>
C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault.
<br/>
Nathalie Furmento's avatar
Nathalie Furmento committed
540
<b>StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators.</b>
541 542 543 544 545 546 547
INRIA Research Report RR-8538, May 2014.
<br/>
Available <a href="http://hal.inria.fr/hal-00992208">here</a>.
</li>
<li>
C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault.
<br/>
548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578
<b>StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators.</b>
In <em>EuroMPI 2012</em>, volume 7490 of LNCS, September 2012. Note: Poster Session.
<br/>
Available <a href="http://hal.inria.fr/hal-00725477">here</a>.
</li>
</ol>

<h4>On data transfer management</h4>
<ol>
<li>
C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst
<br/>
<b>Data-Aware Task Scheduling on Multi-Accelerator based Platforms.</b>
In <em>The 16th International Conference on Parallel and Distributed Systems (ICPADS)</em>, December 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00523937">here</a>.
</li>
</ol>

<h4>On performance model tuning</h4>
<ol>
<li>
C. Augonnet, S. Thibault, and R. Namyst.
<br/>
<b>Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures.</b>
In <em>Proceedings of the International Euro-Par Workshops 2009, HPPC'09</em>, volume 6043 of LNCS, August 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00421333">here</a>.
</li>
</ol>

THIBAULT Samuel's avatar
THIBAULT Samuel committed
579 580 581 582 583 584 585 586 587 588
<h4>On the simulation support through SimGrid</h4>
<ol>
<li>
L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut.<br/>
<b>Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures</b>
In <em>Euro-par 2014 - 20th International Conference on Parallel Processing</em>, Porto, Portugal, August 2014.<br/>
Available <a href="http://hal.inria.fr/hal-01011633">here</a>.
</li>
</ol>

589 590 591 592 593 594 595 596 597 598 599 600 601 602
<h4>On the Cell support</h4>
<ol>
<li>
C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis.
<br/>
<b>Exploiting the Cell/BE architecture with the StarPU unified runtime system.</b>
In <em>SAMOS Workshop - International Workshop on Systems, Architectures, Modeling, and Simulation</em>, volume 5657 of LNCS, July 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00378705">here</a>.
</li>
</ol>

<h4>On Applications</h4>
<ol>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
603 604 605 606 607 608 609 610 611

<li>
S. Henry, A. Denis, D. Barthou, M.-C. Counilh, R. Namyst<br/>
<b>Toward OpenCL Automatic Multi-Device Support</b>
<em>Euro-Par 2014</em>, Porto, Portugal, August 2014.<br/>
Available <a href="http://hal.inria.fr/hal-01005765">here</a>.
</li>

<li>
THIBAULT Samuel's avatar
typo  
THIBAULT Samuel committed
612
X. Lacoste, M. Faverge, P. Ramet, S. Thibault, and G. Bosilca<br/>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
<b>Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes</b>
<em>HCW'2014 workshop of IPDPS</em>, May 2014.<br/>
Available <a href="http://hal.inria.fr/hal-00987094">here</a>.
</li>

<li>
T. Odajima, T. Boku, M. Sato, T. Hanawa, Y. Kodama, R. Namyst, S. Thibault, and O. Aumage<br/>
<b>Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing</b>
In <em>The 2013 International Symposium on Advances of Distributed and Parallel Computing (ADPC 2013)</em>, Vietri sul Mare, Italy.
December 2013.<br/>
Available <a href="http://hal.inria.fr/hal-00920915">here</a>.
</li>

<li>
S. Henry<br/>
<b>Modèles de programmation et supports exécutifs pour architectures hétérogènes</b>.
PhD thesis, Université Bordeaux 1, Novembre 2013.<br/>
Available <a href="http://tel.archives-ouvertes.fr/tel-00948309">here</a>.
</li>

<li>
S. Ohshima, S. Katagiri, K. Nakajima, S. Thibault, and R. Namyst<br/>
<b>Implementation of FEM Application on GPU with StarPU</b>
In <em>SIAM CSE13 - SIAM Conference on Computational Science and Engineering 2013</em>, Boston, USA
February 2013.<br/>
Available <a href="http://hal.inria.fr/hal-00926144">here</a>.
</li>

Nathalie Furmento's avatar
Nathalie Furmento committed
641 642 643
<li>
C. Rossignon.<br/>
<b>Optimisation du produit matrice-vecteur creux sur architecture GPU
644
  pour un simulateur de réservoir.</b> In <em>21èmes Rencontres
Nathalie Furmento's avatar
Nathalie Furmento committed
645 646 647 648 649
  Francophones du Parallélisme (RenPar'21)</em>, Grenoble, France,
January 2013.<br/>
Available <a href="http://hal.inria.fr/hal-00773571">here</a>.
</li>

650 651 652 653 654 655 656 657
<li>
S. Henry, A. Denis, and D. Barthou.<br/>
<b>Programmation unifiée multi-accélérateur OpenCL</b>.
<em>Techniques et Sciences Informatiques</em>, (8-9-10):1233-1249, 2012.
<br/>
Available <a href="http://hal.inria.fr/hal-00772742">here</a>
</li>

658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732
<li>
S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.<br/>
<b>Traitements d'Images sur Architectures Parallèles et Hétérogènes.</b>
<em>Technique et Science Informatiques</em>, 2012.
<br/>
Available <a href="http://hal.inria.fr/hal-00714858/">here</a>.
</li>

<li>
S. Benkner, S. Pllana, J.L. Träff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov.
<br/>
<b>PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems.</b> <em>IEEE Micro</em>, 31(5):28-41, September 2011.
<br/>
Available <a href="http://hal.inria.fr/hal-00648480">here</a>.
</li>

<li>
U. Dastgeer, C. Kessler, and S. Thibault.<br/>
<b>Flexible runtime support for efficient skeleton programming on hybrid systems.</b>
In <em>Proceedings of the International Conference on Parallel Computing (ParCo), Applications, Tools and Techniques on the Road to Exascale Computing</em>, volume 22 of Advances of Parallel Computing, August 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00606200/">here</a>.
</li>

<li>
S. Henry.
<br/>
<b>Programmation multi-accélérateurs unifiée en OpenCL.</b>
In <em>20èmes Rencontres Francophones du Parallélisme (RenPar'20)</em>, May 2011.
<br/>
Available <a href="http://hal.archives-ouvertes.fr/hal-00643257">here</a>.
</li>

<li>
S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.
<br/>
<b>Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicoeurs hétérogènes.</b>
In <em>20èmes Rencontres Francophones du Parallélisme</em>, May 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00606195">here</a>.
</li>

<li>
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov.
<br/>
<b>A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs.</b>
In <em>GPU Computing Gems, volume 2.</em>, September 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00547847">here</a>.
<li>
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov.
<br/>
<b>QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators</b>.
In <em>25th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2011)</em>, May 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00547614">here</a>.
</li>
<li>
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov.
<br/>
<b>Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators.</b>
In <em>Symposium on Application Accelerators in High Performance Computing (SAAHPC)</em>, July 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00547616">here</a>.
</li>
<li>
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov.
<br/>
<b>LU factorization for accelerator-based systems.</b>
In <em>9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11)</em>, June 2011.
<br/>
Available <a href="http://hal.inria.fr/hal-00654193">here</a>
</li>
</ol>

Nathalie Furmento's avatar
website  
Nathalie Furmento committed
733 734
</div>

735
<div class="section bot">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
736
<p class="updated">
737
  Last updated on 2012/10/03.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
738
</p>
739
</div>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
740 741 742

</body>
</html>