Commit 277abffc authored by Nathalie Furmento's avatar Nathalie Furmento
Browse files

website

git-svn-id: svn+ssh://scm.gforge.inria.fr/svn/starpu/website@4088 176f6dd6-97d6-42f4-bd05-d3db9ad07c7a
parent 1c40e92a
CFLAGS += $(shell pkg-config --cflags libstarpu)
LDFLAGS += $(shell pkg-config --libs libstarpu)
vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
%.o: %.cu
nvcc $(CFLAGS) $< -c $
clean:
rm -f vector_scal *.o
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU hands-on session</TITLE>
<link rel="stylesheet" type="text/css" href="../style.css" />
</HEAD>
<body>
<h1><a href="./">StarPU</a></h1>
<h1 class="sub">Hands-on session part 1: Task-based programming model</h1>
<a href="/Runtime/">RUNTIME homepage</a> |
<a href="../">StarPU homepage</a> |
<a href="/Publis/Keyword/StarPU.html">Publications</a>
<hr class="main"/>
<div class="section">
<h3>Download & Install</h3>
<h4>DAS-4 modules</h4>
<p>The first step is of course to download and install StarPU. Before doing so,
make sure to enable paths to the CUDA and CUBLAS environments on your machine,
for DAS-4 that means running</p>
<tt><pre>
$ module load cuda32/toolkit
$ module load cuda32/blas
</pre></tt>
<p>We will also use the <tt>prun</tt> tool:</p>
<tt><pre>
$ module load prun
</pre></tt>
<p>You should probably put these <tt>module load</tt> commands in your
<tt>.bashrc</tt> for further connections to DAS-4.</p>
<h4>hwloc</h4>
<p>In order to properly discover the machine cores, StarPU uses the hwloc
library. It can be downloaded from
<a href="http://www.open-mpi.org/software/hwloc/v1.2/">the hwloc website</a>.
The build procedure is the usual
</p>
<tt><pre>
$ ./configure --prefix=$HOME
$ make
$ make install
</pre></tt>
<p>
To easily get the proper compiler and
linker flags for StarPU as well as execution paths, enable them in the
<tt>pkg-config</tt> search path and library path:</p>
<tt><pre>
$ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/lib/pkgconfig
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib
$ export PATH=$PATH:$HOME/bin
</pre></tt>
<p>You should add these lines to your <tt>.bashrc</tt> file for further
connections.</p>
<h4>StarPU</h4>
<p>The StarPU source code can be downloaded from
<a href="http://starpu.gforge.inria.fr/">the StarPU website</a>, make sure to get
the latest release, that is 0.9.1. The build process is using the usual GNU style:</p>
<tt><pre>
$ ./configure --prefix=$HOME
$ make
$ make install
</pre></tt>
<p>In the summary dumped at the end of the <tt>configure</tt> step,
check that CUDA support was detected (<tt>CUDA enabled: yes</tt>) as well
as hwloc.
</p>
<p>You can test execution of a "Hello world!" program (using <tt>prun</tt> to
run the command on a DAS-4 computation node, as required by usage Policy):</p>
<tt><pre>
$ prun -np 1 ./examples/basic_examples/hello_world
</pre></tt>
<p>If execution does not find the cudart library, make sure that your
<tt>.bashrc</tt> properly keeps existing paths in the <tt>LD_LIBRARY_PATH</tt>
environment variable.</p>
<p>Run the command several times, you will notice that StarPU calibrates the
bus speed each time. This is because DAS-4's job scheduler assigns a different
node each time, and StarPU does not know that the local cluster we use
is homogeneous, and thus assumes that all nodes of the cluster may be
different. Let's force it to use the same machine ID for the whole cluster:</p>
<tt><pre>
$ export STARPU_HOSTNAME=das4
</pre></tt>
<p>Also add this do your <tt>.bashrc</tt> for further connections. Of course, on
a heterogeneous cluster, the cluster launcher script should set various
hostnames for the different node classes, as appropriate.</p>
</div>
<hr class="main"/>
<div class="section">
<h3>Application example: vector scaling</h3>
<h4>Making it and running it</h4>
<p>A typical <tt>Makefile</tt> for applications using StarPU is then the
following (<a href=Makefile>available for download</a>):</p>
<tt><pre>
CFLAGS += $(shell pkg-config --cflags libstarpu)
LDFLAGS += $(shell pkg-config --libs libstarpu)
vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
%.o: %.cu
nvcc $(CFLAGS) $< -c $
clean:
rm -f vector_scal *.o
</pre></tt>
<p>Copy the <tt>vector_scal*.c*</tt> files from
<tt>examples/basic_examples</tt> into a new empty directory, along with
the <tt>Makefile</tt> mentioned above. Run <tt>make</tt>, and try
<tt><pre>$ prun -np 1 ./vector_scal</pre></tt>
it should be working: it simply scales a given vector by a given factor.</p>
<h4>Computation kernels</h4>
<p>Examine the source code, starting from <tt>vector_scal_cpu.c</tt> : this is
the actual computation code, which is wrapped into a <tt>scal_cpu_func</tt>
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets an actual pointer from the first DSM interface, and the factor
value from the non-DSM parameter, and performs the vector scaling.</p>
<p>The GPU implementation, in <tt>vector_scal_cuda.cu</tt>, is basically
the same, with the host part (<tt>scal_cuda_func</tt>) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(<tt>vector_mult_cuda</tt>) which performs the actual computation.</p>
<p>The OpenCL implementation is more hairy due to the low-level aspect of the
OpenCL standard, but the principle remains the same.</p>
<h4>Main code</h4>
<p>Now examine <tt>vector_scal.c</tt>: the <tt>cl</tt> (codelet) structure simply gathers
pointers on the functions mentioned above. It also includes a performance model,
which we will discuss about this afternoon.</p>
<p>The <tt>main</tt> function
<ul>
<li>Allocates an <tt>vector</tt> application buffer and fills it.</li>
<li>Registers it to StarPU, and get back a DSM handle. For now on, the
application is not supposed to access <tt>vector</tt> directly, since its
content may be copied and modified by a task on a GPU, the main-memory copy then
being outdated.</li>
<li>Submits a (synchronous) task to StarPU.</li>
<li>Unregisters the vector from StarPU, which brings back the modified version
to main memory.</li>
</ul>
</p>
</div>
<div class="section">
<h3>Data partitioning</h3>
<p>In the previous section, we submitted only one task. We here discuss how to
<i>partition</i> data so as to submit multiple tasks which can be executed in
parallel by the various CPUs and GPUs.</p>
<p>Let's examine <tt>examples/basic_examples/mult.c</tt>.
<ul>
<li>The computation kernel, <tt>cpu_mult</tt> is a trivial matrix multiplication
kernel, which operates on 3 given DSM interfaces. These will actually not be
whole matrices, but only small parts of matrices.</li>
<li><tt>init_problem_data</tt> initializes the whole A, B and C matrices.</li>
<li><tt>partition_mult_data</tt> does the actual registration and partitioning.
Matrices are first registered completely, then two partitioning filters are
declared. The first one, <tt>vert</tt>, is used to split B and C vertically. The
second one, <tt>horiz</tt>, is used to split A and C horizontally. We thus end
up with a grid of pieces of C to be computed from stripes of A and B.</li>
<li><tt>launch_tasks</tt> submits the actual tasks: for each piece of C, take
the appropriate piece of A and B to produce the piece of C.</li>
<li>The access mode is interesting: A and B just need to be read from, and C
will only be written to. This means that StarPU will make copies of the pieces
of A and B along the machines, where they are needed for tasks, and will give to
the tasks some
uninitialized buffers for the pieces of C, since they will not be read from.</li>
<li>The main code initializes StarPU and data, launches tasks, unpartitions data,
and unregisters it. Unpartitioning is an interesting step: until then the pieces
of C are residing on the various GPUs where they have been computed.
Unpartitioning will collect all the pieces of C into the main memory to form the
whole C result matrix.</li>
</ul>
</p>
<p>Run the application, enabling some statistics:
<tt><pre>
$ prun -np 1 STARPU_WORKER_STATS=1 ./examples/basic_examples/mult
</pre></tt>
Figures show how the computation were distributed on the various processing
units. We will discuss performance further this afternoon.
</p>
<p>
<tt>examples/mult/xgemm.c</tt> is a very similar matrix-matrix product example,
but which makes use of BLAS kernels for much better performance. The <tt>mult_kernel_common</tt> functions
shows how we call <tt>DGEMM</tt> (CPUs) or <tt>cublasDgemm</tt> (GPUs) on the DSM interface.
It is also able to benefit from a parallel implementation of <tt>DGEMM</tt>, we
will however not have the time to discuss about this still-experimental feature.
</p>
<p>Let's execute it on a node with one GPU:
<tt><pre>
$ prun -native '-l gpu=GTX480' -np 1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm
</pre></tt>
(it takes some time for StarPU to make an off-line bus performance
calibration, but this is done only once).
</p>
<p>We can notice that StarPU gave much more tasks to the GPU. You can also try
to set <tt>num_gpu=2</tt> to run on the machine which has two GPUs (there is
only one of them, so you may have to wait a long time, so submit this in
background in a separate terminal), the interesting thing here is that
with <b>no</b> application modification beyond making it use a task-based
programming model, we get multi-GPU support for free!</p>
</div>
<div class="section">
<h3>More advanced examples</h3>
<p>
<tt>examples/lu/xlu_implicit.c</tt> is a more involved example: this is a simple
LU decomposition algorithm. The <tt>dw_codelet_facto_v3</tt> is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
</p>
<p>
<tt>examples/cholesky/cholesky_implicit.c</tt> is a similar example, but which makes use
of the <tt>starpu_insert_task</tt> helper. The <tt>_cholesky</tt> function looks
very much like <tt>dw_codelet_facto_v3</tt> of the previous paragraph, and all
task submission details are handled by <tt>starpu_insert_task</tt>.
</p>
<p>
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using <tt>starpu_insert_task</tt>.
</p>
</div>
<div class="section">
<h3>Exercise</h3>
<p>Take the vector example again, and add partitioning support to it, using the
matrix-matrix multiplication as an example. Try to run it with various numbers
of tasks</p>
</div>
<hr class="main" />
<h1 class="sub">Hands-on session part 2: Optimizations</h1>
<p>This is based on StarPU's documentation
<a href="http://runtime.bordeaux.inria.fr/StarPU/starpu.html#Performance-optimization">optimization chapter</a></p>
<div class="section">
<h3>Data management</h3>
<p>We have explained how StarPU can overlap computation and data transfers
thanks to DMAs. This is however only possible when CUDA has control over the
application buffers. The application should thus use <tt>starpu_malloc</tt>
when allocating its buffer, to permit asynchronous DMAs from and to it.</p>
</div>
<div class="section">
<h3>Task submission</h3>
<p>To let StarPU reorder tasks, submit data transfers in advance, etc., task
submission should be asynchronous whenever possible. Ideally, the application
should behave like the applications we have observed this morning: submit the
whole graph of tasks, and wait for termination.</p>
</div>
<div class="section">
<h3>Task scheduling policy</h3>
<p>By default, StarPU uses the <tt>eager</tt> simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.</p>
<p>
If the application codelets have performance models, the scheduler should be
changed to take benefit from that. StarPU will then really take scheduling
decision in advance according to performance models, and issue data prefetch
requests, to overlap data transfers and computations.
</p>
<p>For instance, compare the <tt>eager</tt> (default) and <tt>heft</tt> scheduling
policies:
<tt><pre>
prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm -x 1024 -y 1024 -z 1024
</pre></tt>
with
<tt><pre>
prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=heft ./examples/mult/sgemm -x 1024 -y 1024 -z 1024
</pre></tt>
</p>
<p>There are much less data transfers, and StarPU realizes that there is no
point in giving tasks to GPUs, resulting to better performance.</p>
<p>You may have to use STARPU_CALIBRATE=2 , as it seems the performance of the
kernels varies on DAS-4, I have not had the time to check why.</p>
<p>Depending on using <tt>gpu=GTX480</tt> (old GPUs) or <tt>gpu=C2050</tt>
(recent GPUs), you should set <tt>export STARPU_HOSTNAME=das4-gtx480</tt> or
<tt>-c2050</tt>, to make StarPU use separate performance models for these two
kinds of machines.</p>
<p>Try other schedulers, use <tt>STARPU_SCHED=help</tt> to get the
list.</p>
<p>Also try with various sizes and draw curves.</p>
<p>You can also try the double version, <tt>dgemm</tt>, and notice that GPUs get
less great performance.</p>
</div>
<div class="section">
<h3>Performance model calibration</h3>
<p>Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in <tt>$HOME</tt> for later re-use. The
<tt>starpu_perfmodel_display</tt> tool can be used to check the resulting
performance model.
</p>
<tt><pre>
$ starpu_perfmodel_display -l
file: &lt;starpu_sgemm_gemm.das4&gt;
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu
# hash size mean dev n
8bd4e11d 2359296 9.318547e+04 4.335047e+02 700
performance model for cuda_0
# hash size mean dev n
8bd4e11d 2359296 3.396056e+02 3.391979e+00 900
</pre></tt>
<p>This shows that for the sgemm kernel with a 2.5M matrix slice, the average
execution time on CPUs was about 93ms, with a 0.4ms standard deviation, over
700 samples, while it took about 0.033ms on GPUs, with a 0.004ms standard
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using <tt>export
STARPU_CALIBRATE=1</tt>
</p>
<p>If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
<tt>export STARPU_CALIBRATE=2</tt>
</p>
</div>
<div class="section">
<h3>More performance optimizations</h3>
<p>The starpu documentation <a href="http://runtime.bordeaux.inria.fr/StarPU/starpu.html#Performance-optimization">optimization chapter</a> provides more optimization tips for further reading after the Spring School.</p>
</div>
<div class="section">
<h3>FxT tracing support</h3>
<p>In addition to online profiling, StarPU provides offline profiling tools,
based on recording a trace of events during execution, and analyzing it
afterwards.</p>
<p>The tool used by StarPU to record a trace is called FxT, and can be downloaded from <a href="http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz">savannah</a>. The build process is as usual:
</p>
<tt><pre>
$ ./configure --prefix=$HOME
$ make
$ make install
</pre></tt>
<p>StarPU should then be recompiled with FxT support:</p>
<tt><pre>
$ ./configure --with-fxt --prefix=$HOME
$ make clean
$ make
$ make install
</pre></tt>
<p>You should make sure that the summary at the end of <tt>./configure</tt> shows that tracing was enabled:</p>
<tt><pre>
Tracing enabled: yes
</pre></tt>
<p>The trace file is output in <tt>/tmp</tt> by default. Since execution will
happen on a cluster node, the file will not be reachable after execution,
we need to direct StarPU to output traces to the home directory, by using</p>
<tt><pre>
$ export STARPU_FXT_PREFIX=$HOME/
</pre></tt>
<p>and add it to your <tt>.bashrc</tt>.</p>
<p>The application should be run again, and this time a <tt>prof_file_XX_YY</tt>
trace file will be generated in your home directory. This can be converted to
several formats by using</p>
<tt><pre>
$ starpu_fxt_tool -i ~/prof_file_*
</pre></tt>
<p>That will create
<ul>
<li>a <tt>paje.trace</tt> file, which can be opened by using the <a
href="http://vite.gforge.inria.fr/">ViTE</a> tool. This shows a Gant diagram of
the tasks which executed, and thus the activity and idleness of tasks, as well
as dependencies, data transfers, etc. You may have to zoom in to actually focus
on the computation part, and not the lengthy CUDA initialization.</li>
<li>a <tt>dag.dot</tt> file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz.</li>
<li>an <tt>activity.data</tt> file, which records the activity of all processing
units over time.</li>
</li>
</ul>
</p>
</div>
<div class="section">
<h3>MPI support</h3>
<p>StarPU provides support for MPI communications. Basically, it provides
equivalents of <tt>MPI_*</tt> functions, but which operate on DSM handles
instead of <tt>void*</tt> buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
</p>
<p><tt>mpi/tests/ring_async_implicit.c</tt> shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.
</p>
<p>This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.</p>
</div>
<div class="section">
<h3>starpu_mpi_insert_task</h3>
<p>The Cholesky factorization shown in the presentation slides is available in
<tt>mpi/examples/cholesky/mpi_cholesky.c</tt>. The data distribution over MPI
nodes is decided by the <tt>my_distrib</tt> function, and can thus be changed
trivially.</p>
</div>
<hr class="main" />
<h1 class="sub">Contact</h1>
<div class="section" id="contact">
<h3>Contact</h3>
For any questions regarding StarPU, please contact the StarPU developers mailing list.
<pre>
<a href="mailto:starpu-devel@lists.gforge.inria.fr?subject=StarPU">starpu-devel@lists.gforge.inria.fr</a>
</pre>
</div>
<hr class="main" />
<p class="updated">
Last updated on 2011/05/11.
</p>
</body>
</html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<HEAD>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<TITLE>StarPU</TITLE>
<link rel="stylesheet" type="text/css" href="style.css" />
</HEAD>
<body>
<h1><a href="./">StarPU</a></h1>
<h1 class="sub">A Unified Runtime System for Heterogeneous Multicore Architectures</h1>
<a href="/Runtime/">RUNTIME homepage</a> |
<a href="/Publis/">Publications</a> |
<a href="/Runtime/software.html">Software</a> |
<a href="/Runtime/index.html.en#contacts">Contacts</a> |
<a href="/Intranet/">Intranet</a>
<hr class="main"/>
<div class="section" id="news">
<h3>News</h3>
<p>
May 2011 <b>&raquo;&nbsp;</b> <a href="http://gforge.inria.fr/frs/?group_id=1570"><b>StarPU 0.9.1 is now available !</b></a>
This release provides a reduction mode, an external API for schedulers, theoretical bounds, power-based optimization, parallel
tasks, an MPI DSM, profiling interfaces, an initial support for CUDA4 (GPU-GPU
transfers), improved documentation and of course various fixes.
</p>
<p>
September 2010 <b>&raquo;&nbsp;</b> Discover how we ported the MAGMA and the PLASMA libraries on top of StarPU in collaboration with ICL/UTK in <a href="http://www.netlib.org/lapack/lawnspdf/lawn230.pdf"><b>this Lapack Working Note</b></a>.
</p>
<p>
August 2010 <b>&raquo;&nbsp;</b> <a href="http://gforge.inria.fr/frs/?group_id=1570"><b>StarPU 0.4 is now available !</b></a>
This release provides support for task-based dependencies, implicit data-based dependencies
(RAW/WAR/WAW), profiling feedback, an MPI layer, OpenCL and Windows support, as
well as an API naming revamp.
</p>
<p>
July 2010 <b>&raquo;&nbsp;</b> StarPU was presented during a tutorial entitled "Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and the DPLASMA and StarPU Scheduler" at SAAHPC in Knoxville, TN (USA) <a href="http://runtime.bordeaux.inria.fr/StarPU/saahpc.pdf">(slides)</a>
</p>
<p>
May 2010 <b>&raquo;&nbsp;</b> Want to get an overview of StarPU ? Check out our <a href="http://hal.archives-ouvertes.fr/inria-00467677">latest research report</a>!
</p>
<!--<p>
October 2009 <b>&raquo;&nbsp;</b> <a href="http://gforge.inria.fr/frs/?group_id=1570"><b>StarPU 0.2.901 (0.3-rc1) is now available !</b></a>
This release adds support for asynchronous GPUs and heterogeneous multi-GPU platforms as well as many other improvements.
</p> -->
<p>
June 2009 <b>&raquo;&nbsp;</b>NVIDIA granted the StarPU team with a professor partnership and donated several high-end CUDA-capable cards.
</p>
</div>
<div class="section" id="download">
<h3>Download</h3>
<!-- PLEASE LEAVE "Powered By Gforge" on your site -->
<a href="http://gforge.org/"><img src="/Images/pow-gforge.png" align="right" alt="Powered By GForge Collaborative Development Environment" border="0"></a>
<p>
<b>&raquo;&nbsp;</b>StarPU is freely available on <a href="http://gforge.inria.fr/projects/starpu/">INRIA's gforge</a> under LGPL license.
</p>
<p>
<b>&raquo;&nbsp;</b>Get the <a href="http://gforge.inria.fr/frs/?group_id=1570">latest release</a>
</p>
<p>
<b>&raquo;&nbsp;</b>Get the <a href="http://starpu.gforge.inria.fr/testing/">latest nightly snapshot</a>.
</p>
<p>
<b>&raquo;&nbsp;</b>Current development version is also accessible via svn
</p>
<p style ="text-indent:25px">
svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk StarPU
</p>
</div>
<div class="section" id="StarPU">
<h3><b>StarPU</b> Overview</h3>
<p>
Traditional processors have reached architectural limits which heterogeneous
multicore designs and hardware specialization (e.g. coprocessors, accelerators,
...) intend to address. However, exploiting such machines introduces numerous
challenging issues at all levels, ranging from programming models and compilers
to the design of scalable hardware solutions. The design of efficient runtime
systems for these architectures is a critical issue. StarPU typically makes it
much easier for high performance libraries or compiler environments to exploit
heterogeneous multicore machines possibly equipped with GPGPUs or Cell
processors: rather than handling low-level issues, programmers may concentrate
on algorithmic concerns.
</p>
<p>
Portability is obtained by the means of a unified abstraction of the machine.
StarPU offers a unified offloadable task abstraction named "codelet". Rather
than rewriting the entire code, programmers can encapsulate existing functions
within codelets. In case a codelet may run on heterogeneous architectures, it
is possible to specify one function for each architectures (e.g. one function
for CUDA and one function for CPUs). StarPU takes care to schedule and execute
those codelets as efficiently as possible over the entire machine. In order to
relieve programmers from the burden of explicit data transfers, a high-level
data management library enforces memory coherency over the machine: before a
codelet starts (e.g. on an accelerator), all its data are transparently made
available on the compute resource.
</p>
<p>
Given its expressive interface and portable scheduling policies, StarPU obtains
portable performances by efficiently (and easily) using all computing resources
at the same time. StarPU also takes advantage of the heterogeneous nature of a
machine, for instance by using scheduling strategies based on auto-tuned
performance models.