Commit e870bba7 authored by THIBAULT Samuel's avatar THIBAULT Samuel
Browse files

fixes

git-svn-id: svn+ssh://scm.gforge.inria.fr/svn/starpu/website@12924 176f6dd6-97d6-42f4-bd05-d3db9ad07c7a
parent 30028617
......@@ -51,6 +51,7 @@ module load compiler/intel
module load hardware/hwloc
module load gpu/cuda/5.5
module load mpi/intel
module load lib/fxt/0.2.13
module load runtime/starpu/1.1.2
</pre></tt>
......@@ -181,22 +182,24 @@ vector_scal_task_insert
<h4>Computation Kernels</h4>
<p>
Examine the source code, starting from <tt>vector_scal_cpu.c</tt> : this is
the actual computation code, which is wrapped into a <tt>scal_cpu_func</tt>
the actual computation code, which is wrapped into a <tt>vector_scal_cpu</tt>
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets an actual pointer from the first DSM interface, and the factor
value from the non-DSM parameter, and performs the vector scaling.
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
</p>
<p>
The GPU implementation, in <tt>vector_scal_cuda.cu</tt>, is basically
the same, with the host part (<tt>scal_cuda_func</tt>) which extracts the
the same, with the host part (<tt>vector_scal_cuda</tt>) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(<tt>vector_mult_cuda</tt>) which performs the actual computation.
</p>
<p>
The OpenCL implementation is more hairy due to the low-level aspect of the
OpenCL standard, but the principle remains the same.
The OpenCL implementation in <tt>vector_scal_opencl.c</tt> and
<tt>vector_scal_opencl_kernel.cl</tt>is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
</p>
<p>
......@@ -279,7 +282,7 @@ uninitialized buffers for the pieces of C, since they will not be read
from.
</li>
<li>
The main code initializes StarPU and data, launches tasks, unpartitions data,
The <tt>main</tt> code initializes StarPU and data, launches tasks, unpartitions data,
and unregisters it. Unpartitioning is an interesting step: until then the pieces
of C are residing on the various GPUs where they have been computed.
Unpartitioning will collect all the pieces of C into the main memory to form the
......@@ -310,7 +313,7 @@ units.
<h3>Other example</h3>
<p>
<a href="files/gemm/xgemm.c"><tt>xgemm.c</tt></a> is a very similar
<a href="files/gemm/xgemm.c"><tt>gemm/xgemm.c</tt></a> is a very similar
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The <tt>mult_kernel_common</tt> functions
shows how we call <tt>DGEMM</tt> (CPUs) or <tt>cublasDgemm</tt> (GPUs)
......@@ -397,6 +400,11 @@ when allocating its buffer, to permit asynchronous DMAs from and to
it.
</p>
<p>
Take the vector example again, and fix the allocation, to make it use
<tt>starpu_malloc</tt>
</p>
</div>
<div class="section">
......@@ -474,7 +482,7 @@ less great performance.
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in <tt>$HOME</tt> for later re-use. The
result is saved to a file in <tt>$HOME/.starpu</tt> for later re-use. The
<tt>starpu_perfmodel_display</tt> tool can be used to check the resulting
performance model.
</p>
......@@ -485,7 +493,7 @@ file: &lt;starpu_sgemm_gemm.mirage&gt;
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu_impl_0
# hash size flops mean (us) stddev (us) n
8bd4e11d 2359296 0.000000e+00 1.848856e+04 4.026761e+04 12
8bd4e11d 2359296 0.000000e+00 1.848856e+04 4.026761e+03 12
performance model for cuda_0_impl_0
# hash size flops mean (us) stddev (us) n
8bd4e11d 2359296 0.000000e+00 4.918095e+02 9.404866e+00 66
......@@ -494,7 +502,7 @@ performance model for cuda_0_impl_0
<p>
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
execution time on CPUs was about 18ms, with a 0.4ms standard deviation, over
execution time on CPUs was about 18ms, with a 4ms standard deviation, over
12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
......@@ -514,7 +522,15 @@ performance model thus has to be recalibrated from start. To do so, use
<h2>Sessions Part 3: MPI Support</h2>
<p>
StarPU provides support for MPI communications. Basically, it provides
StarPU provides support for MPI communications. It does so two ways. Either the
application specifies MPI transfers by hand, or it lets StarPU infer them from
data dependencies
</p>
<div class="section">
<h3>Manual MPI transfers</h3>
<p>Basically, StarPU provides
equivalents of <tt>MPI_*</tt> functions, but which operate on DSM handles
instead of <tt>void*</tt> buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
......@@ -588,20 +604,7 @@ afterwards.
</p>
<p>
The tool used by StarPU to record a trace is called FxT, and can be
downloaded
from <a href="http://download.savannah.gnu.org/releases/fkt/fxt-0.2.14.tar.gz">savannah</a>.
The build process is as usual:
</p>
<tt><pre>
$ ./configure --prefix=$HOME
$ make
$ make install
</pre></tt>
<p>
StarPU should then be recompiled with FxT support:
StarPU should be recompiled with FxT support:
</p>
<tt><pre>
......@@ -670,7 +673,7 @@ units over time.
<div class="section bot">
<p class="updated">
Last updated on 2014/05/16.
Last updated on 2014/05/19.
</p>
</body>
</html>
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment