A Unified Runtime System for Heterogeneous Multicore Architectures

RUNTIME homepage | Publications | Software | Contacts | Intranet

2011 M2 Internships

Extending StarPU:

Using StarPU:


March 2012 » The third release candidate for StarPU 1.0.0 is now available!. It brings additional bug fixes, along with fixes for memory leaks and races.

February 2012 » The second release candidate for StarPU 1.0.0 is now available!. It brings additional bug fixes and features to rc1, notably automatically created manual pages for some of the StarPU tools, a reduction mode for MPI and an example written in C++.

January 2012 » The first release candidate for StarPU 1.0.0 is now available!. This release provides notably a gcc plugin to extend the C interface with pragmas which allows to easily define codelets and issue tasks, and a new multi-format interface which permits to use different binary formats on CPUs & GPUs.

May 2011 »  StarPU 0.9.1 is now available ! This release provides a reduction mode, an external API for schedulers, theoretical bounds, power-based optimization, parallel tasks, an MPI DSM, profiling interfaces, an initial support for CUDA4 (GPU-GPU transfers), improved documentation and of course various fixes.

September 2010 »  Discover how we ported the MAGMA and the PLASMA libraries on top of StarPU in collaboration with ICL/UTK in this Lapack Working Note.

August 2010 »  StarPU 0.4 is now available ! This release provides support for task-based dependencies, implicit data-based dependencies (RAW/WAR/WAW), profiling feedback, an MPI layer, OpenCL and Windows support, as well as an API naming revamp.

July 2010 »  StarPU was presented during a tutorial entitled "Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and the DPLASMA and StarPU Scheduler" at SAAHPC in Knoxville, TN (USA) (slides)

May 2010 »  Want to get an overview of StarPU ? Check out our latest research report!

June 2009 » NVIDIA granted the StarPU team with a professor partnership and donated several high-end CUDA-capable cards.


Powered By GForge Collaborative Development Environment

» All releases and the development tree of StarPU are freely available on INRIA's gforge under the LGPL license. Some releases are available under the BSD license.

» Get the latest release

» Get the latest nightly snapshot.

» Current development version is also accessible via svn

svn checkout svn:// StarPU

StarPU Overview

Traditional processors have reached architectural limits which heterogeneous multicore designs and hardware specialization (e.g. coprocessors, accelerators, ...) intend to address. However, exploiting such machines introduces numerous challenging issues at all levels, ranging from programming models and compilers to the design of scalable hardware solutions. The design of efficient runtime systems for these architectures is a critical issue. StarPU typically makes it much easier for high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors: rather than handling low-level issues, programmers may concentrate on algorithmic concerns.

Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named "codelet". Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet may run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs). StarPU takes care to schedule and execute those codelets as efficiently as possible over the entire machine. In order to relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are transparently made available on the compute resource.

Given its expressive interface and portable scheduling policies, StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models.

Supported Architectures

Supported Operating Systems

Performance analysis tools

In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behaviour of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.

Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s

LU decomposition (greedy)

This second trace depicts the behaviour of the same application using a scheduling strategy trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s

LU decomposition (dmda)

Documentation and Related Publications


For any questions regarding StarPU, please contact the StarPU developers mailing list.

Last updated on 2011/01/19.