Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 7c6cb360 authored by hhakim's avatar hhakim
Browse files

Add a tutorial to use GPU with matfaust (equivalent to the pyfaust jupyter...

Add a tutorial to use GPU with matfaust (equivalent to the pyfaust jupyter notebook which was updated by the way).
parent 4221793a
Branches
Tags
No related merge requests found
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional //EN">
<html>
<head>
<meta name="GENERATOR" content="mkd2html 2.2.4">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title></title>
</head>
<body>
<h1>Using The GPU FAµST API</h1>
<p><strong>Sections in this tutorial:</strong></p>
<ol>
<li><a href="#1" >Creating a GPU Faust object</a></li>
<li><a href="#2" >Generating a GPU Faust</a></li>
<li><a href="#3" >Manipulating GPU Fausts and CPU interoperability</a></li>
<li><a href="#4" >Benchmarking your GPU with matfaust!</a></li>
<li><a href="#5" >Running some FAµST algorithms on GPU</a></li>
<li><a href="#6" >Manually loading the pyfaust GPU module</a></li>
</ol>
<p>In this tutorial we&rsquo;ll see quickly how to leverage the GPU computing power with matfaust.
Since matfaust 3.0.0 the API has been modified to make the GPU available directly from the Matlab wrapper.
Indeed, a independent GPU module (aka <code>gpu_mod</code>) has been developed for this purpose.</p>
<p>The first question you might ask is: does it work on my computer? Here is the answer: the loading of this module is quite transparent, if an NVIDIA GPU is available and CUDA is properly installed on your system, you have normally nothing to do except installing matfaust to get the GPU implementations at your fingertips. We&rsquo;ll see at the end of this tutorial how to load manually the module and how to get further information in case of an error.</p>
<p>It is worthy to note two drawbacks about the matfaust GPU support:</p>
<ul>
<li>Mac OS X is not supported because NVIDIA has stopped to support this OS.</li>
<li>On Windows and Linux, the matfaust GPU support is currently limited to CUDA 9.2 version.</li>
</ul>
<p>In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn&rsquo;t evolve that much in a near future.</p>
<h3><a name="1">1. Creating a GPU Faust object</a></h3>
<p>Let&rsquo;s start with some basic Faust creation on the GPU. Almost all the ways of creating a Faust object in CPU memory are also available to create a GPU Faust.
First of all, creating a Faust using the constructor works seamlessly on GPU, the only need is to specify the <code>dev</code> keyword argument, as follows:</p>
<pre><code> import matfaust.Faust
M = rand(10,10);
N = rand(10, 15);
gpuF = Faust({M, N}, 'dev', 'gpu')
Output:
gpuF =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e9a7d8870, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
</code></pre>
<p>It&rsquo;s clearly indicated in the output that the Faust object is instantiated in GPU memory (the N and M numpy arrays are copied from the CPU to the GPU memory). However it&rsquo;s also possible to check this programmatically:</p>
<pre><code> device(gpuF)
Output:
ans =
'gpu'
</code></pre>
<p>While for a CPU Faust you&rsquo;ll get:</p>
<pre><code> device(Faust({M, N}, 'dev', 'cpu'))
Output:
ans =
'cpu'
</code></pre>
<p>In <code>gpuF</code> the factors are dense matrices but it&rsquo;s totally possible to instantiate sparse matrices on the GPU as you can do on CPU side.</p>
<pre><code> S = sprand(10, 15, .25);
T = sprand(15, 10, .05);
sparse_gpuF = Faust({S, T}, 'dev', 'gpu')
Output:
sparse_gpuF =
- GPU FACTOR 0 (real) SPARSE size 10 x 15, addr: 0x14a5a7e882d0, density 0.220000, nnz 33
- GPU FACTOR 1 (real) SPARSE size 15 x 10, addr: 0x14a5a7e88910, density 0.053333, nnz 8
</code></pre>
<p>You can also create a GPU Faust by explicitly copying a CPU Faust to the GPU memory. Actually, at anytime you can copy a CPU Faust to GPU and conversely. The <code>clone()</code> member function is here precisely for this purpose. Below we copy <code>gpuF</code> to CPU and back again to GPU in the new Faust <code>gpuF2</code>.</p>
<pre><code> cpuF = clone(gpuF, 'cpu')
Output:
cpuF =
Faust size 10x15, density 1.66667, nnz_sum 250, 2 factor(s):
- FACTOR 0 (real) DENSE, size 10x10, density 1, nnz 100
- FACTOR 1 (real) DENSE, size 10x15, density 1, nnz 150
gpuF2 = clone(cpuF, 'gpu')
Output:
gpuF2 =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e43dda130, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e43dd7b70, density 1.000000, nnz 150
</code></pre>
<h3><a name="2">2. Generating a GPU Faust</a></h3>
<p>Many of the functions for generating a Faust object on CPU are available on GPU too. It is always the same, you precise the <code>dev</code> argument by assigning the <code>'gpu'</code> value and you&rsquo;ll get a GPU Faust instead of a CPU Faust.</p>
<p>For example, the code below will successively create a random GPU Faust, a Hadamard transform GPU Faust, identity GPU Faust and finally a DFT GPU Faust.</p>
<pre><code> import matfaust.wht
import matfaust.dft
% Random GPU Faust
matfaust.rand(10, 10, 'num_factors', 11, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 10 x 10, addr: 0x154e43acb150, density 0.500000, nnz 50
- GPU FACTOR 1 (real) SPARSE size 10 x 10, addr: 0x154e43ddb830, density 0.500000, nnz 50
- GPU FACTOR 2 (real) SPARSE size 10 x 10, addr: 0x154e43ddbce0, density 0.500000, nnz 50
- GPU FACTOR 3 (real) SPARSE size 10 x 10, addr: 0x154e43ddce90, density 0.500000, nnz 50
- GPU FACTOR 4 (real) SPARSE size 10 x 10, addr: 0x154e43de38c0, density 0.500000, nnz 50
- GPU FACTOR 5 (real) SPARSE size 10 x 10, addr: 0x154e43de4710, density 0.500000, nnz 50
- GPU FACTOR 6 (real) SPARSE size 10 x 10, addr: 0x154e43de55a0, density 0.500000, nnz 50
- GPU FACTOR 7 (real) SPARSE size 10 x 10, addr: 0x154e43de6490, density 0.500000, nnz 50
- GPU FACTOR 8 (real) SPARSE size 10 x 10, addr: 0x154e43de7380, density 0.500000, nnz 50
- GPU FACTOR 9 (real) SPARSE size 10 x 10, addr: 0x154e43de8270, density 0.500000, nnz 50
- GPU FACTOR 10 (real) SPARSE size 10 x 10, addr: 0x154e43de9160, density 0.500000, nnz 50
% Hadamard GPU Faust
wht(32, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 32 x 32, addr: 0x154e43df0c20, density 0.062500, nnz 64
- GPU FACTOR 1 (real) SPARSE size 32 x 32, addr: 0x154e43dee3d0, density 0.062500, nnz 64
- GPU FACTOR 2 (real) SPARSE size 32 x 32, addr: 0x154e43df3780, density 0.062500, nnz 64
- GPU FACTOR 3 (real) SPARSE size 32 x 32, addr: 0x154e43df4480, density 0.062500, nnz 64
- GPU FACTOR 4 (real) SPARSE size 32 x 32, addr: 0x154e43df51c0, density 0.062500, nnz 64
% Identity GPU Faust
matfaust.eye(16, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 16 x 16, addr: 0x154e43de7410, density 0.062500, nnz 16
% DFT GPU Faust
dft(32, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (complex) SPARSE size 32 x 32, addr: 0x154e43df5720, density 0.062500, nnz 64
- GPU FACTOR 1 (complex) SPARSE size 32 x 32, addr: 0x154e43df21e0, density 0.062500, nnz 64
- GPU FACTOR 2 (complex) SPARSE size 32 x 32, addr: 0x154e43df3640, density 0.062500, nnz 64
- GPU FACTOR 3 (complex) SPARSE size 32 x 32, addr: 0x154e43de5520, density 0.062500, nnz 64
- GPU FACTOR 4 (complex) SPARSE size 32 x 32, addr: 0x154e43de3880, density 0.062500, nnz 64
- GPU FACTOR 5 (complex) SPARSE size 32 x 32, addr: 0x154e43df2da0, density 0.031250, nnz 32
</code></pre>
<h3><a name="3">3. Manipulating GPU Fausts and CPU interoperability</a></h3>
<p>Once you&rsquo;ve created GPU Faust objects, you can perform operations on them staying in GPU world (that is, with no array transfer to CPU memory). That&rsquo;s of course not always possible.</p>
<p>For example, let&rsquo;s consider Faust-scalar multiplication and Faust-matrix product. In the first case the scalar is copied to the GPU memory and likewise in the second case the matrix is copied from CPU to GPU in order to proceed to the computation. However in both cases the Faust factors stay into GPU memory and don&rsquo;t move during the computation.</p>
<pre><code> % Faust-scalar multiplication
2*gpuF
Output:
ans =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e34980120, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
</code></pre>
<p>As you see the first factor&rsquo;s address has changed in the result compared to what it was in <code>gpuF</code>. Indeed, when you make a scalar multiplication only one factor is multiplied, the others don&rsquo;t change, they are shared between the Faust being multiplied and the resulting Faust. This is an optimization and to go further in this direction the factor chosen to be multiplied is the smallest in memory (not necessarily the first one).</p>
<pre><code>% Faust-matrix product (the matrix is copied to GPU
% then the multiplication is performed on GPU)
gpuF*rand(size(gpuF,2), 15)
Output:
ans =
20.0361 18.9839 19.2403 15.6763 17.6467 14.2024 22.0221 16.8871 14.9518 18.0550 16.9907 16.2313 19.1798 17.7156 15.0071
17.5964 16.4156 16.3757 13.8127 15.6474 12.0306 18.6480 15.2072 13.2615 14.7832 14.3821 13.4478 15.9126 14.7898 12.3165
24.2206 22.4507 22.6955 18.7368 21.6624 16.4971 26.3582 21.3575 18.2987 20.9960 20.2449 19.6180 22.7556 20.8585 17.5761
21.9159 21.0406 20.5890 16.8110 19.0772 14.9778 23.8053 19.0746 16.5460 18.4001 18.4172 17.2334 19.9098 18.6816 16.0544
24.6674 23.2190 23.0114 18.8039 21.8146 16.8671 27.4397 21.7257 18.6095 22.0368 20.2354 19.1477 22.9723 21.2252 18.0958
17.5488 16.5573 16.7642 13.8421 15.7318 12.2374 19.3189 15.1602 13.1840 15.6283 15.1275 14.6058 17.0519 15.7529 13.1074
21.8256 20.3676 20.1061 16.9731 19.4686 14.6711 23.4962 18.4808 16.2128 18.6823 18.3134 17.7391 20.4453 19.1173 15.9421
19.1147 17.7332 17.3880 14.7597 16.9095 12.8873 20.0457 16.0832 14.1063 15.8157 15.7174 14.8118 17.1714 15.9060 13.5531
23.3360 21.6335 21.7350 18.3694 20.7213 16.1601 24.4399 18.9016 17.0501 19.8258 19.6005 18.7066 21.6264 20.0042 16.9067
19.1308 17.9028 17.7825 14.9782 16.8418 13.1389 20.1918 16.3237 14.3236 15.9297 15.7401 14.6461 17.1327 15.7691 13.5463
</code></pre>
<p>On the contrary, and that matters for optimization, there is no CPU-GPU transfer at all when you create another GPU Faust named for example <code>gpuF2</code> on the GPU and decide to multiply the two of them like this:</p>
<pre><code> gpuF2 = matfaust.rand(size(gpuF, 2), 18, 'dev', 'gpu')
gpuF2 =
- GPU FACTOR 0 (real) SPARSE size 15 x 15, addr: 0x154e4bbbcae0, density 0.333333, nnz 75
- GPU FACTOR 1 (real) SPARSE size 15 x 18, addr: 0x154e34ac9810, density 0.277778, nnz 75
- GPU FACTOR 2 (real) SPARSE size 18 x 18, addr: 0x154e34b088d0, density 0.277778, nnz 90
- GPU FACTOR 3 (real) SPARSE size 18 x 18, addr: 0x154e9a8e1df0, density 0.277778, nnz 90
- GPU FACTOR 4 (real) SPARSE size 18 x 18, addr: 0x154e9af31d30, density 0.277778, nnz 90
gpuF3 = gpuF*gpuF2
gpuF3 =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e9a7d8870, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
- GPU FACTOR 2 (real) SPARSE size 15 x 15, addr: 0x154e4bbbcae0, density 0.333333, nnz 75
- GPU FACTOR 3 (real) SPARSE size 15 x 18, addr: 0x154e34ac9810, density 0.277778, nnz 75
- GPU FACTOR 4 (real) SPARSE size 18 x 18, addr: 0x154e34b088d0, density 0.277778, nnz 90
- GPU FACTOR 5 (real) SPARSE size 18 x 18, addr: 0x154e9a8e1df0, density 0.277778, nnz 90
- GPU FACTOR 6 (real) SPARSE size 18 x 18, addr: 0x154e9af31d30, density 0.277778, nnz 90
</code></pre>
<p>Besides, it&rsquo;s important to note that <code>gpuF3</code> factors are not duplicated in memory because they already exist for <code>gpuF</code> and <code>gpuF2</code>, that&rsquo;s an extra optimization: <code>gpuF3</code> is just a memory view of the factors of <code>gpuF</code> and <code>gpuF2</code> (the same GPU arrays are shared between <code>Faust</code> objects). That works pretty well the same for CPU <code>Faust</code> objects.</p>
<p>Finally, please notice that CPU Faust objects are not directly interoperable with GPU Fausts objects. You can try, it&rsquo;ll end up with an error.</p>
<pre><code> cpuF = matfaust.rand(5, 5, 'num_factors', 5, 'dev', 'cpu');
gpuF = matfaust.rand(5, 5, 'num_factors', 6, 'dev', 'gpu');
% A first try to multiply a CPU Faust with a GPU one...
cpuF*gpuF
Output:
Error using mexFaustReal
Handle not valid.
Error in matfaust.Faust/call_mex (line 3007)
[varargout{1:nargout}] = mexFaustReal(func_name, F.matrix.objectHandle, varargin{:});
Error in matfaust.Faust/mtimes_trans (line 714)
C = matfaust.Faust(F, call_mex(F, 'mul_faust', A.matrix.objectHandle));
Error in * (line 652)
G = mtimes_trans(F, A, 0);
% it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before multiplying.
% A second try using conversion as needed...
clone(cpuF, 'gpu')*gpuF
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 5 x 5, addr: 0x154e34aca8b0, density 1.000000, nnz 25
- GPU FACTOR 1 (real) SPARSE size 5 x 5, addr: 0x154e34aca910, density 1.000000, nnz 25
- GPU FACTOR 2 (real) SPARSE size 5 x 5, addr: 0x154e34aca7f0, density 1.000000, nnz 25
- GPU FACTOR 3 (real) SPARSE size 5 x 5, addr: 0x154e34aca850, density 1.000000, nnz 25
- GPU FACTOR 4 (real) SPARSE size 5 x 5, addr: 0x154e349df930, density 1.000000, nnz 25
- GPU FACTOR 5 (real) SPARSE size 5 x 5, addr: 0x154e34aca6f0, density 1.000000, nnz 25
- GPU FACTOR 6 (real) SPARSE size 5 x 5, addr: 0x154e34ae5b80, density 1.000000, nnz 25
- GPU FACTOR 7 (real) SPARSE size 5 x 5, addr: 0x154e34ae6770, density 1.000000, nnz 25
- GPU FACTOR 8 (real) SPARSE size 5 x 5, addr: 0x154e349a3d80, density 1.000000, nnz 25
- GPU FACTOR 9 (real) SPARSE size 5 x 5, addr: 0x154e34adfa30, density 1.000000, nnz 25
- GPU FACTOR 10 (real) SPARSE size 5 x 5, addr: 0x154e34ae2ba0, density 1.000000, nnz 25
% Now it works
</code></pre>
<h3><a name="4">4. Benchmarking your GPU with matfaust!</a></h3>
<p>Of course when we run some code on GPU rather than on CPU, it is clearly to enhance the performances. So let&rsquo;s try your GPU and find out if it is worth it or not compared to your CPU.
First, measure how much time it takes on CPU to compute a Faust norm and the dense array corresponding to the product of its factors:</p>
<pre><code> cpuF = matfaust.rand(1024, 1024, 'num_factors', 10, 'fac_type', 'dense');
timeit(@() norm(cpuF, 2))
Output:
ans =
0.1673
timeit(@() full(cpuF))
Output:
ans =
2.5657
</code></pre>
<p>Now let&rsquo;s make some GPU heat with norms and matrix products!</p>
<pre><code> gpuF = clone(cpuF, 'dev', 'gpu');
timeit(@() norm(gpuF, 2))
Output:
ans =
0.0557
timeit(@() full(gpuF))
Output:
ans =
0.1354
</code></pre>
<p>Of course not all GPUs are equal, the results above were obtained with a GTX980 NVIDIA GPU. Below are the results I got using a Tesla V100:</p>
<pre><code> timeit(@() norm(gpuF, 2))
Output:
ans =
0.0076
timeit(@() full(gpuF))
Output:
ans =
0.0103
</code></pre>
<p>Likewise let&rsquo;s compare the performance obtained for a sparse Faust (on a GTX980):</p>
<pre><code> cpuF2 = matfaust.rand(1024, 1024, 'num_factors', 10, 'fac_type', 'sparse');
gpuF2 = clone(cpuF2, 'dev', 'gpu');
timeit(@() norm(gpuF2, 2))
Output:
ans =
0.0093
timeit(@() full(gpuF2))
Output:
ans =
0.0178
</code></pre>
<p>And then on a Tesla V100:</p>
<pre><code> &gt;&gt; timeit(@() norm(gpuF2, 2))
Output:
ans =
0.0059
&gt;&gt; timeit(@() full(gpuF2))
Output:
ans =
0.0102
</code></pre>
<h3><a name="5">5. Running some FAµST algorithms on GPU</a></h3>
<p>Some of the FAµST algorithms implemented in the C++ core are now also available in pure GPU mode.<br/>
For example, let&rsquo;s compare the factorization times taken by the hierarchical factorization when launched on CPU and GPU.<br/>
When running on GPU, the matrix to factorize is copied in GPU memory and almost all operations executed during the algorithm don&rsquo;t imply the CPU in any manner (the only exception at this stage of development is the proximal operators that only run on CPU).</p>
<p>First please copy the following function in the appropriate filename <code>factorize_MEG.m</code> into in the current working directory of Matlab (or by adding the destination directory to your path by calling <code>addpath</code>).</p>
<pre><code> function [MEG16, total_time, err] = factorize_MEG(dev)
import matfaust.fact.hierarchical
MEG = matrix.'
num_facts = 9;
k = 10;
s = 8;
tic
MEG16 = hierarchical(MEG, {'rectmat', num_facts, k, s}, 'backend', 2020, 'gpu', dev == 'gpu');
total_time = toc;
err = norm(MEG16-MEG, 'fro')/norm(MEG, 'fro');
end
</code></pre>
<p><strong>Warning: THE COMPUTATION CAN LAST THIRTY MINUTES OR SO ON CPU</strong></p>
<p>So executing this function on a Intel Xeon CPU (E5-260), I got the following results:</p>
<pre><code> [MEG16, total_time, err] = factorize_MEG('cpu')
Ouput:
Faust::hierarchical: 1
Faust::hierarchical: 2
Faust::hierarchical: 3
Faust::hierarchical: 4
Faust::hierarchical: 5
Faust::hierarchical: 6
Faust::hierarchical: 7
Faust::hierarchical: 8
MEG16 =
Faust size 204x8193, density 0.0631649, nnz_sum 105572, 9 factor(s):
- FACTOR 0 (real) SPARSE, size 204x204, density 0.293589, nnz 12218
- FACTOR 1 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 2 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 3 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 4 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 5 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 6 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 7 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 8 (real) SPARSE, size 204x8193, density 0.0490196, nnz 81930
total_time =
1.4411e+03
err =
0.1289
</code></pre>
<p>And for the comparison here are the results I got on Tesla V100 GPU:</p>
<pre><code> [MEG16, total_time, err] = factorize_MEG('gpu')
Ouput:
Faust::hierarchical: 1
Faust::hierarchical: 2
Faust::hierarchical: 3
Faust::hierarchical: 4
Faust::hierarchical: 5
Faust::hierarchical: 6
Faust::hierarchical: 7
Faust::hierarchical: 8
MEG16 =
Faust size 204x8193, density 0.0631649, nnz_sum 105572, 9 factor(s):
- FACTOR 0 (real) SPARSE, size 204x204, density 0.293589, nnz 12218
- FACTOR 1 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 2 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 3 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 4 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 5 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 6 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 7 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 8 (real) SPARSE, size 204x8193, density 0.0490196, nnz 81930
total_time =
320.0904
err =
0.1296
</code></pre>
<p>As you see it&rsquo;s far faster than with the CPU!</p>
<h3><a name="6">6. Manually loading the pyfaust GPU module</a></h3>
<p>If something goes wrong when trying to use the GPU pyfaust extension, here is how to manually load the module and obtain more information.</p>
<p>The key is the function <a href="https://faustgrp.gitlabpages.inria.fr/faust/last-doc/html/namespacematfaust.html#a75568ecea590cd9f9cd14dce87bfdc84">enable_gpu_mod</a>.<br/>
This function allows to give another try to <code>gpu_mod</code> loading with the verbose mode enabled.</p>
<p>Below I copy output that show what it should look like when it doesn&rsquo;t work:</p>
<pre><code> matfaust.enable_gpu_mod('silent', false)
Output:
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libgm.so: cannot open shared object file: No such file or directory
</code></pre>
<p><strong>NOTE</strong>: this tutorial was made upon FAµST version 3.0.8.</p>
</body>
</html>
# Using The GPU FAµST API
__Sections in this tutorial:__
1. <a href="#1" >Creating a GPU Faust object</a>
2. <a href="#2" >Generating a GPU Faust</a>
3. <a href="#3" >Manipulating GPU Fausts and CPU interoperability</a>
4. <a href="#4" >Benchmarking your GPU with matfaust!</a>
5. <a href="#5" >Running some FAµST algorithms on GPU</a>
6. <a href="#6" >Manually loading the pyfaust GPU module</a>
In this tutorial we'll see quickly how to leverage the GPU computing power with matfaust.
Since matfaust 3.0.0 the API has been modified to make the GPU available directly from the Matlab wrapper.
Indeed, a independent GPU module (aka ``gpu_mod``) has been developed for this purpose.
The first question you might ask is: does it work on my computer? Here is the answer: the loading of this module is quite transparent, if an NVIDIA GPU is available and CUDA is properly installed on your system, you have normally nothing to do except installing matfaust to get the GPU implementations at your fingertips. We'll see at the end of this tutorial how to load manually the module and how to get further information in case of an error.
It is worthy to note two drawbacks about the matfaust GPU support:
- Mac OS X is not supported because NVIDIA has stopped to support this OS.
- On Windows and Linux, the matfaust GPU support is currently limited to CUDA 9.2 version.
In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't evolve that much in a near future.
### <a name="1">1. Creating a GPU Faust object</a>
Let's start with some basic Faust creation on the GPU. Almost all the ways of creating a Faust object in CPU memory are also available to create a GPU Faust.
First of all, creating a Faust using the constructor works seamlessly on GPU, the only need is to specify the ``dev`` keyword argument, as follows:
import matfaust.Faust
M = rand(10,10);
N = rand(10, 15);
gpuF = Faust({M, N}, 'dev', 'gpu')
Output:
gpuF =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e9a7d8870, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
It's clearly indicated in the output that the Faust object is instantiated in GPU memory (the N and M numpy arrays are copied from the CPU to the GPU memory). However it's also possible to check this programmatically:
device(gpuF)
Output:
ans =
'gpu'
While for a CPU Faust you'll get:
device(Faust({M, N}, 'dev', 'cpu'))
Output:
ans =
'cpu'
In ``gpuF`` the factors are dense matrices but it's totally possible to instantiate sparse matrices on the GPU as you can do on CPU side.
S = sprand(10, 15, .25);
T = sprand(15, 10, .05);
sparse_gpuF = Faust({S, T}, 'dev', 'gpu')
Output:
sparse_gpuF =
- GPU FACTOR 0 (real) SPARSE size 10 x 15, addr: 0x14a5a7e882d0, density 0.220000, nnz 33
- GPU FACTOR 1 (real) SPARSE size 15 x 10, addr: 0x14a5a7e88910, density 0.053333, nnz 8
You can also create a GPU Faust by explicitly copying a CPU Faust to the GPU memory. Actually, at anytime you can copy a CPU Faust to GPU and conversely. The ``clone()`` member function is here precisely for this purpose. Below we copy ``gpuF`` to CPU and back again to GPU in the new Faust ``gpuF2``.
cpuF = clone(gpuF, 'cpu')
Output:
cpuF =
Faust size 10x15, density 1.66667, nnz_sum 250, 2 factor(s):
- FACTOR 0 (real) DENSE, size 10x10, density 1, nnz 100
- FACTOR 1 (real) DENSE, size 10x15, density 1, nnz 150
gpuF2 = clone(cpuF, 'gpu')
Output:
gpuF2 =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e43dda130, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e43dd7b70, density 1.000000, nnz 150
### <a name="2">2. Generating a GPU Faust</a>
Many of the functions for generating a Faust object on CPU are available on GPU too. It is always the same, you precise the ``dev`` argument by assigning the ``'gpu'`` value and you'll get a GPU Faust instead of a CPU Faust.
For example, the code below will successively create a random GPU Faust, a Hadamard transform GPU Faust, identity GPU Faust and finally a DFT GPU Faust.
import matfaust.wht
import matfaust.dft
% Random GPU Faust
matfaust.rand(10, 10, 'num_factors', 11, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 10 x 10, addr: 0x154e43acb150, density 0.500000, nnz 50
- GPU FACTOR 1 (real) SPARSE size 10 x 10, addr: 0x154e43ddb830, density 0.500000, nnz 50
- GPU FACTOR 2 (real) SPARSE size 10 x 10, addr: 0x154e43ddbce0, density 0.500000, nnz 50
- GPU FACTOR 3 (real) SPARSE size 10 x 10, addr: 0x154e43ddce90, density 0.500000, nnz 50
- GPU FACTOR 4 (real) SPARSE size 10 x 10, addr: 0x154e43de38c0, density 0.500000, nnz 50
- GPU FACTOR 5 (real) SPARSE size 10 x 10, addr: 0x154e43de4710, density 0.500000, nnz 50
- GPU FACTOR 6 (real) SPARSE size 10 x 10, addr: 0x154e43de55a0, density 0.500000, nnz 50
- GPU FACTOR 7 (real) SPARSE size 10 x 10, addr: 0x154e43de6490, density 0.500000, nnz 50
- GPU FACTOR 8 (real) SPARSE size 10 x 10, addr: 0x154e43de7380, density 0.500000, nnz 50
- GPU FACTOR 9 (real) SPARSE size 10 x 10, addr: 0x154e43de8270, density 0.500000, nnz 50
- GPU FACTOR 10 (real) SPARSE size 10 x 10, addr: 0x154e43de9160, density 0.500000, nnz 50
% Hadamard GPU Faust
wht(32, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 32 x 32, addr: 0x154e43df0c20, density 0.062500, nnz 64
- GPU FACTOR 1 (real) SPARSE size 32 x 32, addr: 0x154e43dee3d0, density 0.062500, nnz 64
- GPU FACTOR 2 (real) SPARSE size 32 x 32, addr: 0x154e43df3780, density 0.062500, nnz 64
- GPU FACTOR 3 (real) SPARSE size 32 x 32, addr: 0x154e43df4480, density 0.062500, nnz 64
- GPU FACTOR 4 (real) SPARSE size 32 x 32, addr: 0x154e43df51c0, density 0.062500, nnz 64
% Identity GPU Faust
matfaust.eye(16, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 16 x 16, addr: 0x154e43de7410, density 0.062500, nnz 16
% DFT GPU Faust
dft(32, 'dev', 'gpu')
Output:
ans =
- GPU FACTOR 0 (complex) SPARSE size 32 x 32, addr: 0x154e43df5720, density 0.062500, nnz 64
- GPU FACTOR 1 (complex) SPARSE size 32 x 32, addr: 0x154e43df21e0, density 0.062500, nnz 64
- GPU FACTOR 2 (complex) SPARSE size 32 x 32, addr: 0x154e43df3640, density 0.062500, nnz 64
- GPU FACTOR 3 (complex) SPARSE size 32 x 32, addr: 0x154e43de5520, density 0.062500, nnz 64
- GPU FACTOR 4 (complex) SPARSE size 32 x 32, addr: 0x154e43de3880, density 0.062500, nnz 64
- GPU FACTOR 5 (complex) SPARSE size 32 x 32, addr: 0x154e43df2da0, density 0.031250, nnz 32
### <a name="3">3. Manipulating GPU Fausts and CPU interoperability</a>
Once you've created GPU Faust objects, you can perform operations on them staying in GPU world (that is, with no array transfer to CPU memory). That's of course not always possible.
For example, let's consider Faust-scalar multiplication and Faust-matrix product. In the first case the scalar is copied to the GPU memory and likewise in the second case the matrix is copied from CPU to GPU in order to proceed to the computation. However in both cases the Faust factors stay into GPU memory and don't move during the computation.
% Faust-scalar multiplication
2*gpuF
Output:
ans =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e34980120, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
As you see the first factor's address has changed in the result compared to what it was in ``gpuF``. Indeed, when you make a scalar multiplication only one factor is multiplied, the others don't change, they are shared between the Faust being multiplied and the resulting Faust. This is an optimization and to go further in this direction the factor chosen to be multiplied is the smallest in memory (not necessarily the first one).
% Faust-matrix product (the matrix is copied to GPU
% then the multiplication is performed on GPU)
gpuF*rand(size(gpuF,2), 15)
Output:
ans =
20.0361 18.9839 19.2403 15.6763 17.6467 14.2024 22.0221 16.8871 14.9518 18.0550 16.9907 16.2313 19.1798 17.7156 15.0071
17.5964 16.4156 16.3757 13.8127 15.6474 12.0306 18.6480 15.2072 13.2615 14.7832 14.3821 13.4478 15.9126 14.7898 12.3165
24.2206 22.4507 22.6955 18.7368 21.6624 16.4971 26.3582 21.3575 18.2987 20.9960 20.2449 19.6180 22.7556 20.8585 17.5761
21.9159 21.0406 20.5890 16.8110 19.0772 14.9778 23.8053 19.0746 16.5460 18.4001 18.4172 17.2334 19.9098 18.6816 16.0544
24.6674 23.2190 23.0114 18.8039 21.8146 16.8671 27.4397 21.7257 18.6095 22.0368 20.2354 19.1477 22.9723 21.2252 18.0958
17.5488 16.5573 16.7642 13.8421 15.7318 12.2374 19.3189 15.1602 13.1840 15.6283 15.1275 14.6058 17.0519 15.7529 13.1074
21.8256 20.3676 20.1061 16.9731 19.4686 14.6711 23.4962 18.4808 16.2128 18.6823 18.3134 17.7391 20.4453 19.1173 15.9421
19.1147 17.7332 17.3880 14.7597 16.9095 12.8873 20.0457 16.0832 14.1063 15.8157 15.7174 14.8118 17.1714 15.9060 13.5531
23.3360 21.6335 21.7350 18.3694 20.7213 16.1601 24.4399 18.9016 17.0501 19.8258 19.6005 18.7066 21.6264 20.0042 16.9067
19.1308 17.9028 17.7825 14.9782 16.8418 13.1389 20.1918 16.3237 14.3236 15.9297 15.7401 14.6461 17.1327 15.7691 13.5463
On the contrary, and that matters for optimization, there is no CPU-GPU transfer at all when you create another GPU Faust named for example ``gpuF2`` on the GPU and decide to multiply the two of them like this:
gpuF2 = matfaust.rand(size(gpuF, 2), 18, 'dev', 'gpu')
gpuF2 =
- GPU FACTOR 0 (real) SPARSE size 15 x 15, addr: 0x154e4bbbcae0, density 0.333333, nnz 75
- GPU FACTOR 1 (real) SPARSE size 15 x 18, addr: 0x154e34ac9810, density 0.277778, nnz 75
- GPU FACTOR 2 (real) SPARSE size 18 x 18, addr: 0x154e34b088d0, density 0.277778, nnz 90
- GPU FACTOR 3 (real) SPARSE size 18 x 18, addr: 0x154e9a8e1df0, density 0.277778, nnz 90
- GPU FACTOR 4 (real) SPARSE size 18 x 18, addr: 0x154e9af31d30, density 0.277778, nnz 90
gpuF3 = gpuF*gpuF2
gpuF3 =
- GPU FACTOR 0 (real) DENSE size 10 x 10, addr: 0x154e9a7d8870, density 1.000000, nnz 100
- GPU FACTOR 1 (real) DENSE size 10 x 15, addr: 0x154e99ccb500, density 1.000000, nnz 150
- GPU FACTOR 2 (real) SPARSE size 15 x 15, addr: 0x154e4bbbcae0, density 0.333333, nnz 75
- GPU FACTOR 3 (real) SPARSE size 15 x 18, addr: 0x154e34ac9810, density 0.277778, nnz 75
- GPU FACTOR 4 (real) SPARSE size 18 x 18, addr: 0x154e34b088d0, density 0.277778, nnz 90
- GPU FACTOR 5 (real) SPARSE size 18 x 18, addr: 0x154e9a8e1df0, density 0.277778, nnz 90
- GPU FACTOR 6 (real) SPARSE size 18 x 18, addr: 0x154e9af31d30, density 0.277778, nnz 90
Besides, it's important to note that ``gpuF3`` factors are not duplicated in memory because they already exist for ``gpuF`` and ``gpuF2``, that's an extra optimization: ``gpuF3`` is just a memory view of the factors of ``gpuF`` and ``gpuF2`` (the same GPU arrays are shared between ``Faust`` objects). That works pretty well the same for CPU ``Faust`` objects.
Finally, please notice that CPU Faust objects are not directly interoperable with GPU Fausts objects. You can try, it'll end up with an error.
cpuF = matfaust.rand(5, 5, 'num_factors', 5, 'dev', 'cpu');
gpuF = matfaust.rand(5, 5, 'num_factors', 6, 'dev', 'gpu');
% A first try to multiply a CPU Faust with a GPU one...
cpuF*gpuF
Output:
Error using mexFaustReal
Handle not valid.
Error in matfaust.Faust/call_mex (line 3007)
[varargout{1:nargout}] = mexFaustReal(func_name, F.matrix.objectHandle, varargin{:});
Error in matfaust.Faust/mtimes_trans (line 714)
C = matfaust.Faust(F, call_mex(F, 'mul_faust', A.matrix.objectHandle));
Error in * (line 652)
G = mtimes_trans(F, A, 0);
% it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before multiplying.
% A second try using conversion as needed...
clone(cpuF, 'gpu')*gpuF
Output:
ans =
- GPU FACTOR 0 (real) SPARSE size 5 x 5, addr: 0x154e34aca8b0, density 1.000000, nnz 25
- GPU FACTOR 1 (real) SPARSE size 5 x 5, addr: 0x154e34aca910, density 1.000000, nnz 25
- GPU FACTOR 2 (real) SPARSE size 5 x 5, addr: 0x154e34aca7f0, density 1.000000, nnz 25
- GPU FACTOR 3 (real) SPARSE size 5 x 5, addr: 0x154e34aca850, density 1.000000, nnz 25
- GPU FACTOR 4 (real) SPARSE size 5 x 5, addr: 0x154e349df930, density 1.000000, nnz 25
- GPU FACTOR 5 (real) SPARSE size 5 x 5, addr: 0x154e34aca6f0, density 1.000000, nnz 25
- GPU FACTOR 6 (real) SPARSE size 5 x 5, addr: 0x154e34ae5b80, density 1.000000, nnz 25
- GPU FACTOR 7 (real) SPARSE size 5 x 5, addr: 0x154e34ae6770, density 1.000000, nnz 25
- GPU FACTOR 8 (real) SPARSE size 5 x 5, addr: 0x154e349a3d80, density 1.000000, nnz 25
- GPU FACTOR 9 (real) SPARSE size 5 x 5, addr: 0x154e34adfa30, density 1.000000, nnz 25
- GPU FACTOR 10 (real) SPARSE size 5 x 5, addr: 0x154e34ae2ba0, density 1.000000, nnz 25
% Now it works
### <a name="4">4. Benchmarking your GPU with matfaust!</a>
Of course when we run some code on GPU rather than on CPU, it is clearly to enhance the performances. So let's try your GPU and find out if it is worth it or not compared to your CPU.
First, measure how much time it takes on CPU to compute a Faust norm and the dense array corresponding to the product of its factors:
cpuF = matfaust.rand(1024, 1024, 'num_factors', 10, 'fac_type', 'dense');
timeit(@() norm(cpuF, 2))
Output:
ans =
0.1673
timeit(@() full(cpuF))
Output:
ans =
2.5657
Now let's make some GPU heat with norms and matrix products!
gpuF = clone(cpuF, 'dev', 'gpu');
timeit(@() norm(gpuF, 2))
Output:
ans =
0.0557
timeit(@() full(gpuF))
Output:
ans =
0.1354
Of course not all GPUs are equal, the results above were obtained with a GTX980 NVIDIA GPU. Below are the results I got using a Tesla V100:
timeit(@() norm(gpuF, 2))
Output:
ans =
0.0076
timeit(@() full(gpuF))
Output:
ans =
0.0103
Likewise let's compare the performance obtained for a sparse Faust (on a GTX980):
cpuF2 = matfaust.rand(1024, 1024, 'num_factors', 10, 'fac_type', 'sparse');
gpuF2 = clone(cpuF2, 'dev', 'gpu');
timeit(@() norm(gpuF2, 2))
Output:
ans =
0.0093
timeit(@() full(gpuF2))
Output:
ans =
0.0178
And then on a Tesla V100:
>> timeit(@() norm(gpuF2, 2))
Output:
ans =
0.0059
>> timeit(@() full(gpuF2))
Output:
ans =
0.0102
### <a name="5">5. Running some FAµST algorithms on GPU</a>
Some of the FAµST algorithms implemented in the C++ core are now also available in pure GPU mode.
For example, let's compare the factorization times taken by the hierarchical factorization when launched on CPU and GPU.
When running on GPU, the matrix to factorize is copied in GPU memory and almost all operations executed during the algorithm don't imply the CPU in any manner (the only exception at this stage of development is the proximal operators that only run on CPU).
First please copy the following function in the appropriate filename ``factorize_MEG.m`` into in the current working directory of Matlab (or by adding the destination directory to your path by calling ``addpath``).
function [MEG16, total_time, err] = factorize_MEG(dev)
import matfaust.fact.hierarchical
MEG = matrix.'
num_facts = 9;
k = 10;
s = 8;
tic
MEG16 = hierarchical(MEG, {'rectmat', num_facts, k, s}, 'backend', 2020, 'gpu', dev == 'gpu');
total_time = toc;
err = norm(MEG16-MEG, 'fro')/norm(MEG, 'fro');
end
**Warning: THE COMPUTATION CAN LAST THIRTY MINUTES OR SO ON CPU**
So executing this function on a Intel Xeon CPU (E5-260), I got the following results:
[MEG16, total_time, err] = factorize_MEG('cpu')
Ouput:
Faust::hierarchical: 1
Faust::hierarchical: 2
Faust::hierarchical: 3
Faust::hierarchical: 4
Faust::hierarchical: 5
Faust::hierarchical: 6
Faust::hierarchical: 7
Faust::hierarchical: 8
MEG16 =
Faust size 204x8193, density 0.0631649, nnz_sum 105572, 9 factor(s):
- FACTOR 0 (real) SPARSE, size 204x204, density 0.293589, nnz 12218
- FACTOR 1 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 2 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 3 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 4 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 5 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 6 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 7 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 8 (real) SPARSE, size 204x8193, density 0.0490196, nnz 81930
total_time =
1.4411e+03
err =
0.1289
And for the comparison here are the results I got on Tesla V100 GPU:
[MEG16, total_time, err] = factorize_MEG('gpu')
Ouput:
Faust::hierarchical: 1
Faust::hierarchical: 2
Faust::hierarchical: 3
Faust::hierarchical: 4
Faust::hierarchical: 5
Faust::hierarchical: 6
Faust::hierarchical: 7
Faust::hierarchical: 8
MEG16 =
Faust size 204x8193, density 0.0631649, nnz_sum 105572, 9 factor(s):
- FACTOR 0 (real) SPARSE, size 204x204, density 0.293589, nnz 12218
- FACTOR 1 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 2 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 3 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 4 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 5 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 6 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 7 (real) SPARSE, size 204x204, density 0.0392157, nnz 1632
- FACTOR 8 (real) SPARSE, size 204x8193, density 0.0490196, nnz 81930
total_time =
320.0904
err =
0.1296
As you see it's far faster than with the CPU!
### <a name="6">6. Manually loading the pyfaust GPU module</a>
If something goes wrong when trying to use the GPU pyfaust extension, here is how to manually load the module and obtain more information.
The key is the function [enable_gpu_mod](https://faustgrp.gitlabpages.inria.fr/faust/last-doc/html/namespacematfaust.html#a75568ecea590cd9f9cd14dce87bfdc84).
This function allows to give another try to ``gpu_mod`` loading with the verbose mode enabled.
Below I copy output that show what it should look like when it doesn't work:
matfaust.enable_gpu_mod('silent', false)
Output:
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libgm.so: cannot open shared object file: No such file or directory
**NOTE**: this tutorial was made upon FAµST version 3.0.8.
......@@ -14261,14 +14261,14 @@ a.anchor-link {
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h1 id="Using-The-GPU-FA&#181;ST-API">Using The GPU FA&#181;ST API<a class="anchor-link" href="#Using-The-GPU-FA&#181;ST-API">&#182;</a></h1><p>In this notebook we'll see quickly how to leverage the GPU computing power with pyfaust.<br>
Since pyfaust 2.9.0 the API has been modified to make the GPU available directly from the python wrapper.
Indeed, a GPU module (aka <code>gpu_mod</code>) has been developed for this purpose.</p>
Indeed, an independent GPU module (aka <code>gpu_mod</code>) has been developed for this purpose.</p>
<p>The first question you might ask is: does it work on my computer? Here is the answer: the loading of this module is quite transparent, if an NVIDIA GPU is available and CUDA is properly installed on your system, you have normally nothing to do except installing pyfaust to get the GPU implementations at your fingertips. We'll see at the end of this notebook how to load manually the module and how to get further information in case of an error.</p>
<p>It is worthy to note two drawbacks about the pyfaust GPU support:</p>
<ul>
<li>Mac OS X is not supported because NVIDIA has stopped to support this OS.</li>
<li>On Windows and Linux, the pyfaust GPU support is currently limited to CUDA 9.2 version.</li>
</ul>
<p>In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't change / evolve that much in a near future.</p>
<p>In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't evolve that much in a near future.</p>
 
</div>
</div>
......
%% Cell type:markdown id: tags:
# Using The GPU FAµST API
In this notebook we'll see quickly how to leverage the GPU computing power with pyfaust.
Since pyfaust 2.9.0 the API has been modified to make the GPU available directly from the python wrapper.
Indeed, a GPU module (aka ``gpu_mod``) has been developed for this purpose.
Indeed, an independent GPU module (aka ``gpu_mod``) has been developed for this purpose.
The first question you might ask is: does it work on my computer? Here is the answer: the loading of this module is quite transparent, if an NVIDIA GPU is available and CUDA is properly installed on your system, you have normally nothing to do except installing pyfaust to get the GPU implementations at your fingertips. We'll see at the end of this notebook how to load manually the module and how to get further information in case of an error.
It is worthy to note two drawbacks about the pyfaust GPU support:
- Mac OS X is not supported because NVIDIA has stopped to support this OS.
- On Windows and Linux, the pyfaust GPU support is currently limited to CUDA 9.2 version.
In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't change / evolve that much in a near future.
In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't evolve that much in a near future.
%% Cell type:markdown id: tags:
### Creating a GPU Faust object
Let's start with some basic Faust creation on the GPU. Almost all the ways of creating a Faust object in CPU memory are also available to create a GPU Faust.
First of all, creating a Faust using the constructor works seamlessly on GPU, the only need is to specify the ``dev`` keyword argument, as follows:
%% Cell type:code id: tags:
``` python
from pyfaust import Faust
from numpy.random import rand
M, N = rand(10,10), rand(10,15)
gpuF = Faust([M, N], dev='gpu')
gpuF
```
%% Cell type:markdown id: tags:
It's clearly indicated in the output that the Faust object is instantiated in GPU memory (the N and M numpy arrays are copied from the CPU to the GPU memory). However it's also possible to check this programmatically:
%% Cell type:code id: tags:
``` python
gpuF.device
```
%% Cell type:markdown id: tags:
While for a CPU Faust you'll get:
%% Cell type:code id: tags:
``` python
Faust([M, N], dev='cpu').device
```
%% Cell type:markdown id: tags:
In ``gpuF`` the factors are dense matrices but it's totally possible to instantiate sparse matrices on the GPU as you can do on CPU side.
%% Cell type:code id: tags:
``` python
from pyfaust import Faust
from scipy.sparse import random, csr_matrix
S, T = csr_matrix(random(10, 15, density=0.25)), csr_matrix(random(15, 10, density=0.05))
sparse_gpuF = Faust([S, T], dev='gpu')
sparse_gpuF
```
%% Cell type:markdown id: tags:
You can also create a GPU Faust by explicitly copying a CPU Faust to the GPU memory. Actually, at anytime you can copy a CPU Faust to GPU and conversely. The ``clone()`` member function is here precisely for this purpose. Below we copy ``gpuF`` to CPU and back again to GPU in the new Faust ``gpuF2``.
%% Cell type:code id: tags:
``` python
cpuF = gpuF.clone('cpu')
gpuF2 = cpuF.clone('gpu')
gpuF2
```
%% Cell type:markdown id: tags:
## Generating a GPU Faust
Many of the functions for generating a Faust object on CPU are available on GPU too. It is always the same, you precise the ``dev`` argument by assigning the ``'gpu'`` value and you'll get a GPU Faust instead of a CPU Faust.
For example, the code below will successively create a random GPU Faust, a Hadamard transform GPU Faust, identity GPU Faust and finally a DFT GPU Faust.
%% Cell type:code id: tags:
``` python
from pyfaust import rand as frand, eye as feye, wht, dft
print("Random GPU Faust:", frand(10,10, num_factors=11, dev='gpu'))
print("Hadamard GPU Faust:", wht(32, dev='gpu'))
print("Identity GPU Faust:", feye(16, dev='gpu'))
print("DFT GPU Faust:", dft(32, dev='gpu'))
```
%% Cell type:markdown id: tags:
### Manipulating GPU Fausts and CPU interoperability
Once you've created GPU Faust objects, you can perform operations on them staying in GPU world (that is, with no array transfer to CPU memory). That's of course not always possible.
For example, let's consider Faust-scalar multiplication and Faust-matrix product. In the first case the scalar is copied to the GPU memory and likewise in the second case the matrix is copied from CPU to GPU in order to proceed to the computation. However in both cases the Faust factors stay into GPU memory and don't move during the computation.
%% Cell type:code id: tags:
``` python
# Faust-scalar multiplication
2*gpuF
```
%% Cell type:markdown id: tags:
As you see the first factor's address has changed in the result compared to what it was in ``gpuF``. Indeed, when you make a scalar multiplication only one factor is multiplied, the others don't change, they are shared between the Faust being multiplied and the resulting Faust. This is an optimization and to go further in this direction the factor chosen to be multiplied is the smallest in memory (not necessarily the first one).
%% Cell type:code id: tags:
``` python
# Faust-matrix product (the matrix is copied to GPU
# then the multiplication is performed on GPU)
gpuF@rand(gpuF.shape[1],15)
```
%% Cell type:markdown id: tags:
On the contrary, and that matters for optimization, there is no CPU-GPU transfer at all when you create another GPU Faust named for example ``gpuF2`` on the GPU and decide to multiply the two of them like this:
%% Cell type:code id: tags:
``` python
from pyfaust import rand as frand
gpuF2 = frand(gpuF.shape[1],18, dev='gpu')
gpuF3 = gpuF@gpuF2
gpuF3
```
%% Cell type:markdown id: tags:
Besides, it's important to note that ``gpuF3`` factors are not duplicated in memory because they already exist for ``gpuF`` and ``gpuF2``, that's an extra optimization: ``gpuF3`` is just a memory view of the factors of ``gpuF`` and ``gpuF2`` (the same GPU arrays are shared between ``Faust`` objects). That works pretty well the same for CPU ``Faust`` objects.
Finally, please notice that CPU Faust objects are not directly interoperable with GPU Fausts objects. You can try, it'll end up with an error.
%% Cell type:code id: tags:
``` python
cpuF = frand(5,5,5, dev='cpu')
gpuF = frand(5,5,6, dev='gpu')
try:
print("A first try to multiply a CPU Faust with a GPU one...")
cpuF@gpuF
except:
print("it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before.")
print("it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before multiplying.")
print("A second try using conversion as needed...")
print(cpuF.clone('gpu')@gpuF) # this is what you should do
print("Now it works!")
```
%% Cell type:markdown id: tags:
### Benchmarking your GPU with pyfaust!
Of course when we run some code on GPU rather than on CPU, it is clearly to enhance the performances. So let's try your GPU and find out if it is worth it or not compared to your CPU.
First, measure how much time it takes on CPU to compute a Faust norm and the dense array corresponding to the product of its factors:
%% Cell type:code id: tags:
``` python
from pyfaust import rand as frand
cpuF = frand(1024, 1024, num_factors=10, fac_type='dense')
%timeit cpuF.norm(2)
%timeit cpuF.toarray()
```
%% Cell type:markdown id: tags:
Now let's make some GPU heat with norms and matrix products!
%% Cell type:code id: tags:
``` python
gpuF = cpuF.clone(dev='gpu')
%timeit gpuF.norm(2)
%timeit gpuF.toarray()
```
%% Cell type:markdown id: tags:
Of course not all GPUs are equal, below are the results I got using a Tesla V100:
```
6.85 ms ± 9.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.82 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Likewise let's compare the performance obtained for a sparse Faust:
%% Cell type:code id: tags:
``` python
from pyfaust import rand as frand
cpuF2 = frand(1024, 1024, num_factors=10, fac_type='sparse', density=.2)
gpuF2 = cpuF2.clone(dev='gpu')
print("CPU times:")
%timeit cpuF2.norm(2)
%timeit cpuF2.toarray()
print("GPU times:")
%timeit gpuF2.norm(2)
%timeit gpuF2.toarray()
```
%% Cell type:markdown id: tags:
On a Tesla V100 it gives these results:
```
9.86 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
13.8 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
%% Cell type:markdown id: tags:
### Running some FAµST algorithms on GPU
Some of the FAµST algorithms implemented in the C++ core are now also available in pure GPU mode.
For example, let's compare the factorization times taken by the hierarchical factorization when launched on CPU and GPU.
When running on GPU, the matrix to factorize is copied in GPU memory and almost all operations executed during the algorithm don't imply the CPU in any manner (the only exception at this stage of development is the proximal operators that only run on CPU).
**Warning: THE COMPUTATION CAN LAST THIRTY MINUTES OR SO ON CPU**
%% Cell type:code id: tags:
``` python
from scipy.io import loadmat
from pyfaust.demo import get_data_dirpath
d = loadmat(get_data_dirpath()+'/matrix_MEG.mat')
def factorize_MEG(dev='cpu'):
from pyfaust.fact import hierarchical
from time import time
from numpy.linalg import norm
MEG = d['matrix'].T
num_facts = 9
k = 10
s = 8
t_start = time()
MEG16 = hierarchical(MEG, ['rectmat', num_facts, k, s], backend=2020, on_gpu=dev=='gpu')
total_time = time()-t_start
err = norm(MEG16.toarray()-MEG)/norm(MEG)
return MEG16, total_time, err
```
%% Cell type:code id: tags:
``` python
gpuMEG16, gpu_time, gpu_err = factorize_MEG(dev='gpu')
print("GPU time, error:", gpu_time, gpu_err)
```
%% Cell type:code id: tags:
``` python
cpuMEG16, cpu_time, cpu_err = factorize_MEG(dev='cpu')
print("CPU time, error:", cpu_time, cpu_err)
```
%% Cell type:markdown id: tags:
Depending on you GPU card and CPU the results may vary, so below are showed some results obtained on specific hardware.
<table align="left">
<tr align="center">
<th>Implementation</th>
<th> Hardware </th>
<th> Time (s) </th>
<th>Error Faust vs MEG matrix </th>
</tr>
<tr>
<td>CPU</td>
<td>Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz</td>
<td>1241.00</td>
<td>.129</td>
</tr>
<tr>
<td>GPU</td>
<td>NVIDIA GTX980</td>
<td>465.42</td>
<td>.129</td>
</tr>
<tr>
<td>GPU</td>
<td>NVIDIA Tesla V100</td>
<td>321.50</td>
<td>.129</td>
</tr>
</table>
%% Cell type:markdown id: tags:
### Manually loading the pyfaust GPU module
If something goes wrong when trying to use the GPU pyfaust extension, here is how to manually load the module and obtain more information.
The key is the function [enable_gpu_mod](https://faustgrp.gitlabpages.inria.fr/faust/last-doc/html/namespacepyfaust.html#aea03fff2525fc834f2a56e63fd30a54f). This function allows to give another try to ``gpu_mod`` loading with the verbose mode enabled.
%% Cell type:code id: tags:
``` python
import pyfaust
pyfaust.enable_gpu_mod(silent=False, fatal=True)
```
%% Cell type:markdown id: tags:
Afterward you can call ``pyfaust.is_gpu_mod_enabled()`` to verify if it works in your script.
Below I copy outputs that show what it should look like when it doesn't work:
1) If you asked a fatal error using ``enable_gpu_mod(silent=False, fatal=True)`` an exception will be raised and your code won't be able to continue after this call:
```
python -c "import pyfaust; pyfaust.enable_gpu_mod(silent=False, fatal=True)"
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libcublas.so.9.2: cannot open shared object file: No such file or directory
[...]
Exception: Can't load gpu_mod library, maybe the path (/home/test/venv_pyfaust-2.10.14/lib/python3.7/site-packages/pyfaust/lib/libgm.so) is not correct or the backend (cuda) is not installed or configured properly so the libraries are not found.
```
2) If you just want a warning, you must use ``enable_gpu_mod(silent=False)``, the code will continue after with no gpu_mod enabled but you'll get some information about what is going wrong (here the CUDA toolkit version 9.2 is not installed) :
```
python -c "import pyfaust; pyfaust.enable_gpu_mod(silent=False)"
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libcublas.so.9.2: cannot open shared object file: No such file or directory
```
%% Cell type:markdown id: tags:
------------------------------------------------------------
%% Cell type:markdown id: tags:
**Note**: this notebook was executed using the following pyfaust version:
%% Cell type:code id: tags:
``` python
import pyfaust
pyfaust.version()
```
%% Cell type:markdown id: tags:
Thanks for reading this notebook! Many other are available at [faust.inria.fr](https://faust.inria.fr).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment