index.html 16.9 KB
Newer Older
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
2 3 4 5 6
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU</TITLE>
<link rel="stylesheet" type="text/css" href="style.css" />
7
<link rel="Shortcut icon" href="http://www.inria.fr/extension/site_inria/design/site_inria/images/favicon.ico" type="image/x-icon" />
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
8 9 10 11
</HEAD>

<body>

12
<div class="title">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
13
<h1><a href="./">StarPU</a></h1>
14 15
<h2>A Unified Runtime System for Heterogeneous Multicore Architectures</h2>
</div>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
16

17 18
<div class="menu">
<a href="/Runtime/">RUNTIME TEAM</a> |
19
&nbsp; &nbsp; &nbsp;
20
|
21
<a href="#overview">Overview</a> |
Nathalie Furmento's avatar
Nathalie Furmento committed
22
<a href="#news">News</a> |
23
<a href="#contact">Contact</a> |
24
<a href="#features">Features</a> |
25 26
<a href="#software">Software</a> |
<a href="#publications">Publications</a> |
27 28 29 30
<a href="internships/">Internships</a> |
<a href="files/">Download</a> |
<a href="tutorials">Tutorials</a> |
<a href="https://wiki.bordeaux.inria.fr/runtime/doku.php?id=starpu">Intranet</a>
Nathalie Furmento's avatar
Nathalie Furmento committed
31
</div>
32

33 34
<div class="section" id="overview">
<h3>Overview</h3>
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
  <p>
<span class="important">StarPU is a task programming library for hybrid architectures</span>
<ol>
<li><b>The application provides algorithms and constraints</b>
    <ul>
    <li>CPU/GPU implementations of tasks</li>
    <li>A graph of tasks, using either the StarPU's high level <b>GCC plugin</b> pragmas or StarPU's rich <b>C API</b></li>
    </ul>
<br>
</li>
<li><b>StarPU handles run-time concerns</b>
    <ul>
    <li>Task dependencies</li>
    <li>Optimized heterogeneous scheduling</li>
    <li>Optimized data transfers and replication between main memory and discrete memories</li>
    <li>Optimized cluster communications</li>
    </ul>
</li>
</ol>
</p>
<p>
<span class="important">Rather than handling low-level issues, <b>programmers can concentrate on algorithmic concerns!</b></span>
</p>

<p>
<span class="note">The StarPU documentation is available in <a href="./starpu.pdf">PDF</a> and in <a href="./starpu.html">HTML</a>.</span> Please note that these documents are up-to-date with the latest release of StarPU.
</p>
</div>

<div class="section emphasize newslist" id="news">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
65 66
<h3>News</h3>
<p>
67
November 2012 <b>&raquo;&nbsp;</b> StarPU at SuperComputing'12: A
68
StarPU poster is on display on the Inria booth, Feel free to come & have a chat, at booth #1209!
69 70 71
</p>
<p>
October 2012 <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
72 73 74 75
      v1.0.4 release of StarPU is now available!</b></a>. This release
      mainly brings bug fixes.
</p>
<p>
76 77 78
September 2012 <b>&raquo;&nbsp;</b> StarPU was presented at the conference <a href="http://www.par.univie.ac.at/conference/eurompi2012/">EuroMPI</a>
</p>
<p>
79 80 81 82 83
September 2012  <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      v1.0.3 release of StarPU is now available!</b></a>. This release
      mainly brings bug fixes.
</p>
<p>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
84 85 86 87 88
August 2012  <b>&raquo;&nbsp;</b><a href="http://gforge.inria.fr/frs/?group_id=1570"><b>The
      v1.0.2 release of StarPU is now available!</b></a>. This release
      notably fixes CPU/GPU binding.
</p>
<p>
89 90
July 2012 <b>&raquo;&nbsp;</b> StarPU was presented at the <a href="http://gcc.gnu.org/wiki/cauldron2012#StarPU.27s_C_Extensions_for_Hybrid_CPU.2BAC8-GPU_Task_Programming.2C_or.2C_An_Experience_in_Turning_a_Clumsy_API_Into_Language_Extensions">GNU Tools Cauldron 2012</a>
</p>
91
</div>
92 93

<div class="section emphasizebot" style="text-align: right; font-style: italic;">
94 95 96 97 98 99
Get the latest StarPU news by subscribing to the <a href="http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/starpu-announce">starpu-announce mailing list</a>.
See also the <a href="news/">news archive</a>.
</div>

<div class="section" id="contact">
<h3>Contact</h3>
100
<p>For any questions regarding StarPU, please contact the StarPU developers mailing list.</p>
101 102 103
<pre>
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</pre>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
104 105
</div>

106 107
<div class="section" id="features">
<h3>Features</h3>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
108

109
<h4>Portability</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
110
  <p>
111 112 113 114 115 116 117 118
Portability is obtained by the means of a unified abstraction of the machine.
StarPU offers a unified offloadable task abstraction named <em>codelet</em>. Rather
than rewriting the entire code, programmers can encapsulate existing functions
within codelets. In case a codelet can run on heterogeneous architectures, <b>it
is possible to specify one function for each architectures</b> (e.g. one function
for CUDA and one function for CPUs). StarPU takes care of scheduling and
executing those codelets as efficiently as possible over the entire machine, include
multiple GPUs.
THIBAULT Samuel's avatar
THIBAULT Samuel committed
119 120
One can even specify <b>several functions for each architecture</b>, and StarPU will
automatically determine which version is best for each input size.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
121 122
  </p>

123
<h4>Data transfers</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
124
  <p>
125
To relieve programmers from the burden of explicit data transfers, a high-level
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
126
data management library enforces memory coherency over the machine: before a
127 128
codelet starts (e.g. on an accelerator), all its <b>data are automatically made
available on the compute resource</b>. Data are also kept on e.g. GPUs as long as
THIBAULT Samuel's avatar
THIBAULT Samuel committed
129 130
they are needed for further tasks. When a device runs out of memory, StarPU uses
an LRU strategy to <b>evict unused data</b>. StarPU also takes care of <b>automatically
131 132
prefetching</b> data, which thus permits to <b>overlap data transfers with computations</b>
(including GPU-GPU direct transfers) to achieve the most of the architecture.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
133 134
  </p>

135
<h4>Dependencies</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
136
  <p>
137 138 139
Dependencies between tasks can be given several ways, to provide the
programmer with best flexibility:
  <ul>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
140 141
    <li><b>explicitly</b> between pairs of tasks,</li>
    <li>explicitly through <b>tags</b> which act as rendez-vous points between
142
    tasks (thus including tasks which have not been created yet),</li>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
143
    <li><b>implicitly</b> from RAW, WAW, and WAR data dependencies.</li>
144
  </ul>
145 146
  </p>
  <p>
147
StarPU also supports an OpenMP-like <a href="http://runtime.bordeaux.inria.fr/StarPU/starpu.html#Data-reduction">reduction</a> access mode.
148 149 150 151 152
  </p>

<h4>Heterogeneous Scheduling</h4>
  <p>
StarPU obtains
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
153
portable performances by efficiently (and easily) using all computing resources
154
at the same time. StarPU also takes advantage of the <b>heterogeneous</b> nature of a
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
155
machine, for instance by using scheduling strategies based on auto-tuned
156 157 158
performance models. These determine the relative performance achieved
by the different processing units for the various kinds of task, and thus
permits to <b>automatically let processing units execute the tasks they are the best for</b>.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
159 160
  </p>

161 162
<h4>Clusters</h4>
  <p>
163
To deal with clusters, StarPU can nicely integrate with <a href="starpu.html#StarPU-MPI-support">MPI</a> through
164 165 166 167 168 169 170
explicit network communications, which will then be <b>automatically combined and
overlapped</b> with the intra-node data transfers and computation. The application
can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU
will automatically determine which MPI node should execute which task, and
<b>generate all required MPI communications</b> accordingly.
  </p>

171 172 173 174 175 176 177 178 179
<h4>Extensions to the C Language</h4>
<p>
  StarPU comes with a GCC plug-in
  that <a href="starpu.html#C-Extensions">extends the C programming
  language</a> with pragmas and attributes that make it easy
  to <b>annotate a sequential C program to turn it into a parallel
  StarPU program</b>.
</p>

180 181
<h4>All in all</h4>
  <p>
182 183 184 185
All that means that, with the help
of <a href="starpu.html#C-Extensions">StarPU's extensions to the C
language</a>, the following sequential source code of a tiled version of
the classical Cholesky factorization algorithm using BLAS is also valid
THIBAULT Samuel's avatar
THIBAULT Samuel committed
186 187
StarPU code, possibly running on all the CPUs and GPUs, and given a data
distribution over MPI nodes, it is even a distribute version!
188 189 190 191 192 193 194 195 196 197 198 199 200
  </p>

  <tt><pre>
for (k = 0; k < tiles; k++) {
  potrf(A[k,k])
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])
}</pre></tt>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
201

202
<h4>Supported Architectures</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
203 204 205 206 207 208
<ul>
<li>SMP/Multicore Processors (x86, PPC, ...) </li>
<li>NVIDIA GPUs (e.g. heterogeneous multi-GPU)</li>
<li>OpenCL devices</li>
<li>Cell Processors (experimental)</li>
</ul>
209 210 211
and soon
<ul>
<li>Intel SCC</li>
THIBAULT Samuel's avatar
THIBAULT Samuel committed
212
<li>Intel MIC / Xeon Phi</li>
213
</ul>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
214

215
<h4>Supported Operating Systems</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
216
<ul>
Ludovic Courtès's avatar
Ludovic Courtès committed
217 218
<li>GNU/Linux</li>
<li>Mac OS X</li>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
219 220 221
<li>Windows</li>
</ul>

222
<h4>Performance analysis tools</h4>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
223 224 225 226 227 228 229 230 231
  <p>
In order to understand the performance obtained by StarPU, it is helpful to
visualize the actual behaviour of the applications running on complex
heterogeneous multicore architectures.  StarPU therefore makes it possible to
generate Pajé traces that can be visualized thanks to the <a
href="http://vite.gforge.inria.fr/"><b>ViTE</b> (Visual Trace Explorer) open
source tool.</a>
  </p>

232
<p>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
233 234 235 236 237 238
<b>Example:</b> LU decomposition on 3 CPU cores and a GPU using a very simple
greedy scheduling strategy. The green (resp. red) sections indicate when the
corresponding processing unit is busy (resp. idle). The number of ready tasks
is displayed in the curve on top: it appears that with this scheduling policy,
the algorithm suffers a certain lack of parallelism. <b>Measured speed: 175.32
GFlop/s</b>
239
<center><a href="./images/greedy-lu-16k-fx5800.png"> <img src="./images/greedy-lu-16k-fx5800.png" alt="LU decomposition (greedy)" width="75%"></a></center>
240 241
</p>

Nathalie Furmento's avatar
website  
Nathalie Furmento committed
242 243 244 245 246 247
<p>
This second trace depicts the behaviour of the same application using a
scheduling strategy trying to minimize load imbalance thanks to auto-tuned
performance models and to keep data locality as high as possible. In this
example, the Pajé trace clearly shows that this scheduling strategy outperforms
the previous one in terms of processor usage. <b>Measured speed: 239.60
248
GFlop/s</b>
249
<center><a href="./images/dmda-lu-16k-fx5800.png"><img src="./images/dmda-lu-16k-fx5800.png" alt="LU decomposition (dmda)" width="75%"></a></center>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
250 251 252 253
</p>

</div>

254 255 256 257 258 259 260 261 262 263 264 265 266 267
<div class="section" id="software">
<h3>Software using StarPU</h3>

<p>
Some software is known for being able to use StarPU to tackle heterogeneous
architectures, here is a non-exhaustive list:
</p>

<ul>
	<li><a href="http://icl.cs.utk.edu/magma/">MAGMA</a>, dense linear algebra library, starting from version 1.1</li>
	<li><a href="http://www.ida.liu.se/~chrke/skepu/">SkePU</a>, a skeleton programming framework.</li>
	<li><a href="http://pastix.gforge.inria.fr/">PaStiX</a>, sparse linear algebra library, starting from version 5.2.1</li>
</ul>

268 269 270 271 272
<p>
You can find below the list of publications related to applications
using StarPU.
</p>

273 274
</div>

275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295
<div class="section" id="publications">
<h3>Publications</h3>
<p>
All StarPU related publications are also
listed <a href="http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html">here</a>
with the corresponding Bibtex entries.
</p>

<p>A good overview is available in
the following <a href="http://hal.archives-ouvertes.fr/inria-00467677">Research Report</a>.
</p>

<h4>General presentations</h4>
<ol>
<li>
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
<br/>
<b>StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.</b>
<em>Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009</em>, 23:187-198, February 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00550877">here</a>.
296
</li>
297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
<li>
C. Augonnet.
<br/>
<b>StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes</b>.
In <em>19èmes Rencontres Francophones du Parallélisme</em>, September 2009. Note: Best Paper Award.
<br/>
Available <a href="http://hal.inria.fr/inria-00411581">here</a>. (French version)
</li>
<li>
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
<br/>
<b>StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.</b>
In <em>Proceedings of the 15th International Euro-Par Conference</em>, volume 5704 of LNCS, August 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00384363">here</a>. (short version)
312
</li>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
313

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
<li>
C. Augonnet and R. Namyst.
<br/>
<b>A unified runtime system for heterogeneous multicore architectures.</b>
In <em>Proceedings of the International Euro-Par Workshops 2008, HPPC'08</em>, volume 5415 of LNCS, August 2008.
<br/>
Available <a href="http://hal.inria.fr/inria-00326917">here</a>. (early version)
</li>
</ol>

<h4>On MPI support</h4>
<ol>
<li>
C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault.
<br/>
<b>StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators.</b>
In <em>EuroMPI 2012</em>, volume 7490 of LNCS, September 2012. Note: Poster Session.
<br/>
Available <a href="http://hal.inria.fr/hal-00725477">here</a>.
</li>
</ol>

<h4>On data transfer management</h4>
<ol>
<li>
C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst
<br/>
<b>Data-Aware Task Scheduling on Multi-Accelerator based Platforms.</b>
In <em>The 16th International Conference on Parallel and Distributed Systems (ICPADS)</em>, December 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00523937">here</a>.
</li>
</ol>

<h4>On performance model tuning</h4>
<ol>
<li>
C. Augonnet, S. Thibault, and R. Namyst.
<br/>
<b>Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures.</b>
In <em>Proceedings of the International Euro-Par Workshops 2009, HPPC'09</em>, volume 6043 of LNCS, August 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00421333">here</a>.
</li>
</ol>

<h4>On the Cell support</h4>
<ol>
<li>
C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis.
<br/>
<b>Exploiting the Cell/BE architecture with the StarPU unified runtime system.</b>
In <em>SAMOS Workshop - International Workshop on Systems, Architectures, Modeling, and Simulation</em>, volume 5657 of LNCS, July 2009.
<br/>
Available <a href="http://hal.inria.fr/inria-00378705">here</a>.
</li>
</ol>

<h4>On Applications</h4>
<ol>
<li>
S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.<br/>
<b>Traitements d'Images sur Architectures Parallèles et Hétérogènes.</b>
<em>Technique et Science Informatiques</em>, 2012.
<br/>
Available <a href="http://hal.inria.fr/hal-00714858/">here</a>.
</li>

<li>
S. Benkner, S. Pllana, J.L. Träff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov.
<br/>
<b>PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems.</b> <em>IEEE Micro</em>, 31(5):28-41, September 2011.
<br/>
Available <a href="http://hal.inria.fr/hal-00648480">here</a>.
</li>

<li>
U. Dastgeer, C. Kessler, and S. Thibault.<br/>
<b>Flexible runtime support for efficient skeleton programming on hybrid systems.</b>
In <em>Proceedings of the International Conference on Parallel Computing (ParCo), Applications, Tools and Techniques on the Road to Exascale Computing</em>, volume 22 of Advances of Parallel Computing, August 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00606200/">here</a>.
</li>

<li>
S. Henry.
<br/>
<b>Programmation multi-accélérateurs unifiée en OpenCL.</b>
In <em>20èmes Rencontres Francophones du Parallélisme (RenPar'20)</em>, May 2011.
<br/>
Available <a href="http://hal.archives-ouvertes.fr/hal-00643257">here</a>.
</li>

<li>
S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.
<br/>
<b>Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicoeurs hétérogènes.</b>
In <em>20èmes Rencontres Francophones du Parallélisme</em>, May 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00606195">here</a>.
</li>

<li>
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov.
<br/>
<b>A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs.</b>
In <em>GPU Computing Gems, volume 2.</em>, September 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00547847">here</a>.
<li>
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov.
<br/>
<b>QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators</b>.
In <em>25th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2011)</em>, May 2011.
<br/>
Available <a href="http://hal.inria.fr/inria-00547614">here</a>.
</li>
<li>
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov.
<br/>
<b>Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators.</b>
In <em>Symposium on Application Accelerators in High Performance Computing (SAAHPC)</em>, July 2010.
<br/>
Available <a href="http://hal.inria.fr/inria-00547616">here</a>.
</li>
<li>
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov.
<br/>
<b>LU factorization for accelerator-based systems.</b>
In <em>9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11)</em>, June 2011.
<br/>
Available <a href="http://hal.inria.fr/hal-00654193">here</a>
</li>
</ol>

Nathalie Furmento's avatar
website  
Nathalie Furmento committed
449 450
</div>

451
<div class="section bot">
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
452
<p class="updated">
453
  Last updated on 2012/10/03.
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
454
</p>
455
</div>
Nathalie Furmento's avatar
website  
Nathalie Furmento committed
456 457 458

</body>
</html>