Add Arm NEO instructions set tutorial

change names to be more generic on SIMD instructions sets

Add Arm NEO instructions set tutorial
2029ee7c · ANDRADE-BARROSO Guillermo · 9b2eff62 · 2029ee7c · 2029ee7c · 2029ee7c
Commit 2029ee7c authored 3 years ago by ANDRADE-BARROSO Guillermo
--- a/README.md
+++ b/README.md
@@ -9,17 +9,17 @@ Work with these tutorials supposes that you have a python interpreter installed
 ## pthread simple tutorial
 There is a simple tutorial for begin manipulate threads using POSIX Threads library [(pthread)](pthread/README.md)

-## SSE and OpenMP 
-### Introduction to C++ SSE using Python and inline compilation
+## SIMD and OpenMP 
+### Introduction to C++ SIMD instructions (Intel SSE or ARM Neo) using Python and inline compilation
 There is a tutorial showing :
 * how to include C++ code in python script with [inline module](https://github.com/GuillermoAndrade/inline)
-* how to optimize C++ code using SSE and measure performance
-[SSE tutorial using Python and Inline module](SSE/README.md)
+* how to optimize C++ code using SIMD instructions and measure performance
+[SIMD tutorial using Python and Inline module](SIMD/README.md)

-### Multi-core programming with OpenMP and SSE using Python and Inline module 
+### Multi-core programming with OpenMP and SIMD instructions using Python and Inline module 
 There is a tutorial showing :
-* how to optimize C++ code using OpenMP and SSE
-[OpenMP and SSE tutorial using Python and Inline module](OpenMP/README.md)
+* how to optimize C++ code using OpenMP and SIMD instructions
+[OpenMP and SIMD instructions tutorial using Python and Inline module](OpenMP/README.md)

 ## CUDA 
 ### Blocks and Grid on scalar multiplication kernel

--- a/SIMD/README.md
+++ b/SIMD/README.md
+## Goals of this tutorial  ##
+* launch and modify python scripts using NumPy module
+* Learn how to run C++ code inside python using [inline module](https://github.com/GuillermoAndrade/inline)
+* Optimize C++ code using SIMD instructions :
+  * with a Intel compatible CPU, work with SSE instructions set Intrinsic C api
+  * with a ARMv7 compatible CPU, work with NEO instructions set Intrinsic C api
+
+
+## Very little introduction to **Python** and **NumPy** module ##
+[Python](http://www.python.org/ ) is a programing language defined in http://docs.python.org/2/tutorial/index.html as this: *''Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.''*
+
+Python seems to be well adapted for tutorials on parallel computing thanks to modules like [**inline**](https://github.com/GuillermoAndrade/inline), **PyCuda** and **PyOpenCL**. We will work with python 2.x versions (but all instructions can run with python 3.x).
+### Launch a Python console in your PC ###
+Python come with an interpreter console, you can launch it just typing :
+
+```
+$ python
+Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
+[GCC 9.3.0] on linux
+Type "help", "copyright", "credits" or "license" for more information.
+>>> 5+7
+12
+>>> exit()
+$
+```
+
+#### IPython ####
+There is a very powerful and comfortable interactive console named Ipython. Ipython allows you to access to command history, command complexion using [TAB] touch, help about command or objects using "?" or debugging scripts facilities. I recommended you to work with it. To launch Ipython console just tape: 
+
+```
+$ ipython
+Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
+Type 'copyright', 'credits' or 'license' for more information
+IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.
+
+In [1]:   
+```
+
+
+##### Ask for in-line documentation #####
+
+
+```
+In [1]: range?                                                                                                                                                                       
+Init signature: range(self, /, *args, **kwargs)
+Docstring:     
+range(stop) -> range object
+range(start, stop[, step]) -> range object
+
+Return an object that produces a sequence of integers from start (inclusive)
+to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
+start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
+These are exactly the valid indices for a list of 4 elements.
+When step is given, it specifies the increment (or decrement).
+Type:           type
+Subclasses: 
+```
+
+##### Completion #####
+Completion using [TAB] touch:
+
+```
+In [2]: mylist = li 
+                    license()
+                    list        
+                                
+```
+
+##### Using a variable and lists #####
+
+Define a variable ```my_list``` that contents a list of integers:
+
+```
+In [2]: my_list = list(range(10))                                                                                                                                                    
+
+In [3]: my_list                                                                                                                                                                      
+Out[3]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+
+In [4]: list.reverse?                                                                                                                                                                
+Signature: list.reverse(self, /)
+Docstring: Reverse *IN PLACE*.
+Type:      method_descriptor
+
+In [5]: my_list.reverse()                                                                                                                                                            
+
+In [6]: my_list                                                                                                                                                                      
+Out[6]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
+
+In [7]: exit() 
+$
+```
+
+
+### Write a python program ###
+You can edit a python program using a file editor like vim, or gedit (graphic mode only). Python files have usually ".py" extension.
+ $gedit my_test.py
+By default python files are interpreted as ASCII to files. But you can set another character coding mode by adding this kind of instruction at the begin of file (here is UTF-8 character sets):
+
+```python
+# -*- coding: utf-8 -*-
+```
+Thats allow you to type char with French accents if you want.
+You can define comments with "#" char to the end of line :
+```python
+a= 5 # this is a comment 
+```
+or multi-line comments (in fact, triple-quoted strings) :
+
+```python
+'''
+This is a multiline
+comment.
+'''
+```
+
+#### Block syntax  ####
+Unlike C or C++, python don't need semi-colons to separate instructions. You can just separate instructions with line-feeds:
+
+```python
+a=5
+b=8
+print ("a=", a)
+print ("b=", b)
+print ("a+b=", a+b)
+```
+
+And unlike C or C++, in Python instructions blocks aren't delimited by "{}" but using indentation:
+
+```python
+if a> 2:
+  c=1
+  print ("a is bigger that 2")
+else:
+  print ("a is smaller or equal to 2")
+  c=2
+print ("c=",c)
+```
+
+
+#### Functions ####
+You can define functions with parameters, and parameters with default values:
+
+```python
+def my_function(x, y=0):
+  a= 5*y + x**2 
+  return a
+def my_function(x, y=0):
+  a= 5*y + x**2
+  return a
+
+w= my_function(3)
+z= my_function(3,1)
+y= my_function(y=2, x=3)
+
+print "(w,z,y)=",(w,z,y)
+```
+
+
+### Run your program ###
+You can run your program with python:
+
+```
+$ python my_test.py 
+a= 5
+b= 8
+a+b= 13
+a is bigger that 2
+c= 1
+(w,z,y)= (9, 14, 19)
+$
+```
+
+
+Inside ipython you can run your program with:
+
+```
+
+
+In [1]: %run my_test.py
+a= 5
+b= 8
+a+b= 13
+a is bigger that 2
+c= 1
+(w,z,y)= (9, 14, 19)
+
+
+```
+
+After your program run, you can watch state of variable or call methods:
+
+```
+In [2]: b
+Out[2]: 8
+
+In [3]: my_function(4)
+Out[3]: 16
+```
+
+
+### NumPy ###
+NumPy is a module that allow you to efficiently manipulate array of data in high level with a optimized operation.
+You can find more information here : http://docs.scipy.org/doc/numpy/user/index.html
+
+You can access to Numpy functions by importing module:
+```python
+import numpy
+```
+#### Manipulate arrays  ####
+##### Create an array of float32 #####
+An empty linear array of 1000 elements:
+```python
+a= numpy.empty(1000,numpy.float32)
+```
+A linear array of 1000 elements with all values set to ```zero``` :
+```python
+a= numpy.zeros(1000,numpy.float32)
+```
+A linear array of 1000 elements with all values set to ```one``` :
+```python
+a= numpy.ones(1000,numpy.float32)
+```
+A linear array of 1000 from with random values between 0 and 1:
+```python
+a=numpy.random.rand(1000).astype(numpy.float32)
+```
+A linear array of 1000 from with linear from values 0 to 1000 :
+```python
+a= numpy.arange(1000, dtype=numpy.float32)
+```
+##### Access to elements #####
+Get the first element:
+```python
+In [3]: a[0]
+Out[3]: 0.0
+```
+Get the last:
+```python
+In [4]: a[999]
+Out[4]: 999.0
+```
+But in general way get the last element of any linear array:
+```python
+In [5]: a[-1]
+Out[5]: 999.0
+```
+
+##### Copy and reference of an array #####
+Assignation by reference:
+
+```python
+In [3]: b=a
+
+In [4]: a[0]
+Out[4]: 0.0
+
+In [5]: b[0]=5
+
+In [6]: b[0]
+Out[6]: 5.0
+
+In [7]: a[0]
+Out[7]: 5.0
+```
+
+
+To do a copy (physical copy in memory)
+
+```python
+In [2]: a=numpy.arange(1000, dtype=numpy.float32) 
+
+In [3]: b=a.copy()
+
+In [4]: b[0]=5
+
+In [5]: a[0]
+Out[5]: 0.0
+```
+
+#### Views and slicing ####
+Numpy allows you to get different views of an array using ''slicing''. An Slice is create with index manipulation. For example:
+
+```python
+In [2]: a=numpy.arange(10, dtype=numpy.float32)
+
+In [3]: a
+Out[3]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)
+
+In [4]: a[0:9:2]
+Out[4]: array([ 0.,  2.,  4.,  6.,  8.], dtype=float32)
+```
+
+Index 0:9:2 indicate de begin = 0 to the end = 9 with a step of 2.
+##### Reverse Indexing  #####
+Step can be a negative integer:
+```python
+In [5]: a[9:0:-2]
+Out[5]: array([ 9.,  7.,  5.,  3.,  1.], dtype=float32)
+```
+
+in this case a[9:0:-2] is equivalent to a[-1:0:-2]
+Slice is a view of existent array. Is is not a copy of data.
+##### Default Index #####
+It is possible to take default values for begin or end index:
+```python
+In [6]: a[::-2]
+Out[6]: array([ 9.,  7.,  5.,  3.,  1.], dtype=float32)
+```
+Or for step:
+```python
+In [7]: a[:]
+Out[7]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)
+
+In [8]: a[5:]
+Out[8]: array([ 5.,  6.,  7.,  8.,  9.], dtype=float32)
+
+In [9]: a[:5]
+Out[9]: array([ 0.,  1.,  2.,  3.,  4.], dtype=float32)
+```
+#### Array Operators ####
+##### Scalar operation #####
+Scalar operation with a array is equivalent to apply the operation between the scalar and every element of the array: 
+
+
+```python
+In [1]: import numpy
+
+In [2]: a=numpy.arange(10, dtype=numpy.float32)
+
+In [3]: a*4
+Out[3]: array([  0.,   4.,   8.,  12.,  16.,  20.,  24.,  28.,  32.,  36.], dtype=float32)
+
+In [4]: 5+a
+Out[4]: array([  5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.], dtype=float32)
+```
+
+##### Array Vs. Array operation#####
+Binary operations like "*" "+" are do in element-wise way :
+
+```python
+In [1]: import numpy
+
+In [2]: a=numpy.arange(10, dtype=numpy.float32)
+
+In [3]: a*4
+Out[3]: array([  0.,   4.,   8.,  12.,  16.,  20.,  24.,  28.,  32.,  36.], dtype=float32)
+
+In [4]: 5+a
+Out[4]: array([  5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.], dtype=float32)
+
+In [5]: a+a
+Out[5]: array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.], dtype=float32)
+
+In [6]: a*a
+Out[6]: array([  0.,   1.,   4.,   9.,  16.,  25.,  36.,  49.,  64.,  81.], dtype=float32)
+
+In [7]: c=-a
+
+In [8]: a*c
+Out[8]: array([ -0.,  -1.,  -4.,  -9., -16., -25., -36., -49., -64., -81.], dtype=float32)
+```
+
+
+Matrix dot product must be called explicitly :
+```python
+In [15]: numpy.dot(a,a)
+Out[15]: 285.0
+```
+## Using **Inline** module to integrate C/C++ code in python ##
+**Inline** is a module that allow in-line inclusion of C/C++ code in python using code generation and compilation on the fly.
+
+see https://github.com/GuillermoAndrade/inline for more information.
+
+Actually, Python is compiled using C language and is possible to call functions from a shared library using C interface. For C++ functions python need to pass throw a C wrapper function that negotiate conversion from pure C parameters and C++ class objects in real C++ functions.
+### Hello world with Inline ###
+Imagine a C function that take a string parameter `text` and print a message :
+
+```c
+#include <stdio.h>
+void Hello( char * text)
+{
+    printf("Hello %s \n", text);
+}
+```
+
+
+
+We can call a similar function in python with this code:
+
+```python
+In [1]: code=r'''
+   ...: #include <stdio.h>
+   ...: void Hello( char * text)
+   ...: {
+   ...:     printf("Hello %s \n", text);
+   ...: }
+   ...: '''
+
+In [2]: import inline
+
+In [3]: lib=inline.c(code)
+
+In [4]: lib.Hello(b'world')
+Hello world 
+Out[4]: 13
+
+```
+
+After defined de C code as a python raw text, we can compile using `inline.c` function to produce a object representing a share library . Them we can call a function from this library as this : `lib.Hello(b'world')`.
+But the argument python text passed to this function will be a string of bytes, that why we use `b'world'` to indicate to python the string is a bytes array
+
+
+### Function with two arguments and a return value ###
+See next C example:
+
+```c
+#include<math.h>
+float norm( float x, float y)
+{
+    return sqrt(x*x+y*y);
+}
+```
+
+We can inline compile this code as precedent :
+
+```python
+In [7]: code=r'''
+   ...: #include<math.h>
+   ...: float norm( float x, float y)
+   ...: {
+   ...:     return sqrt(x*x+y*y);
+   ...: }
+   ...: '''
+
+In [8]: lib=inline.c(code)
+
+```
+But for call `norm` function we need to indicate the type of float for arguments to python interface object of this function. We will do that with types defined in module `ctypes` module. There are two ways to do that:
+
+**First way** : use a object that handle float number in C call :
+
+```python
+In [6]: import ctypes
+
+In [7]: lib.norm( ctypes.c_float(3.0), ctypes.c_float(5.0))
+
+```
+
+**Second way** : define arguments types for `norm` function :
+```python
+In [8]: lib.norm.argtypes = [ctypes.c_float, ctypes.c_float]
+
+In [9]: lib.norm(3.0,5.0)
+
+```
+But for returned value from the function call (float type), we also need to define the conversion type to python: 
+
+```python
+In [11]: lib.norm.restype = ctypes.c_float
+
+In [12]: lib.norm(3.0,5.0)
+Out[12]: 5.830951690673828
+
+```
+
+### Passing NumPy Arrays as arguments in read/write mode ###
+Passing a NumPy array to a inline code in C is equivalent to pass pointers of array in C, with a specific type converter :
+
+
+```python
+import inline
+import ctypes
+import numpy
+size=10
+a=numpy.arange(size, dtype=numpy.float32)
+b=numpy.arange(size, dtype=numpy.float32)
+c=numpy.empty_like(a)
+code = '''
+
+void sum( int size, float *a, float *b, float *c)
+{
+    for(unsigned int i=0; i < size ; i++)
+    {
+        c[i]=a[i]+b[i];
+    }
+}
+'''
+
+lib = inline.c(code)
+p_float= numpy.ctypeslib.ndpointer(dtype=numpy.float32)
+
+lib.sum.argtypes = [ctypes.c_int, p_float, p_float, p_float]
+
+```
+The result is :
+
+```python
+
+In [18]: lib.sum(len(a), a, b, c)
+Out[18]: 10
+
+In [19]: c
+Out[19]: array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14., 16., 18.], dtype=float32)
+```
+We can compare `c` array with the sum computed by numpy `a+b` :
+```python
+
+In [20]: c - (a+b)
+Out[20]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
+
+```
+
+## Optimization of a C++ code with SIMD instructions set ##
+It's time to integrate NumPy and Inline to have  benchmark program for optimization tests.
+
+### Benchmark code ###
+This code define a benchmark for test a classical procedure in linear algebra packages : SAXPY in a particular case.
+This code use others optional arguments in call of Inline :
+* ```extra_compile_args``` : Compilation arguments to allow to use SIMD instructions and OpenMP instructions
+* ```extra_link_args``` : link arguments to allow to use SIMD instructions and OpenMP instructions
+
+```python
+import numpy
+import ctypes
+from time import time
+import inline
+sizeX = 1000000
+numberIterations =1000
+X = numpy.random.rand(sizeX).astype(numpy.float32)
+Y = numpy.empty(sizeX).astype(numpy.float32)
+
+
+def BenchmarkCode(name, code,X,Y):
+	# init
+	Y[:]=0.5
+	X[:]=1.0
+
+	# compile the code
+	lib=inline.cxx(code, compiler_extra_args=['-march=native','-fopenmp'], link_extra_args= ['-march=native','-fopenmp']) 
+	p_float= numpy.ctypeslib.ndpointer(dtype=numpy.float32) 	
+	lib.compute.argtypes = [ctypes.c_int, ctypes.c_int, p_float, p_float] 
+
+	# start chronometer
+	start_time = time()
+	# run the code
+	lib.compute(numberIterations, sizeX, X, Y)
+	# stop chronometer             
+	stop_time = time()
+	execution_time= stop_time - start_time
+	print("execution time for "+name+" code = "+ str(execution_time))
+	return execution_time
+
+# C++ reference code
+referenceCode="""
+#line 33 "saxpy.py" // helpful for debug
+extern "C" {
+#ifdef __SSE2__
+#include<xmmintrin.h>
+#else
+#include <arm_neon.h>
+#endif
+
+
+void saxpy(int n, float alpha, float *X, float *Y)
+{
+	int i;
+	for (i=0; i<n; i++)
+		Y[i] += alpha * X[i];
+}
+
+void compute(int numberIterations, int sizeX, float *X, float *Y )
+{
+	for(int j=0; j< numberIterations;j++)
+  		saxpy(sizeX, 0.001f, X, Y);
+	return ;
+}
+
+}
+"""
+
+referenceTime=BenchmarkCode('Reference', referenceCode,X,Y)
+
+SIMDCode="""
+#line 58 "saxpy.py" // helpful for debug
+extern "C" {
+
+#ifdef __SSE2__
+#include<xmmintrin.h>
+#else
+#include <arm_neon.h>
+#endif
+
+void saxpy(int n, float alpha, float *X, float *Y)
+{
+	int i;
+	for (i=0; i<n; i++)
+		Y[i] += alpha * X[i];
+}
+
+
+void compute(int numberIterations, int sizeX, float *X, float *Y )
+{
+	for(int j=0; j< numberIterations;j++)
+  		saxpy(sizeX, 0.001f, X, Y);
+	return ;
+}
+
+}
+"""
+
+SIMDTime=BenchmarkCode('SIMD', SIMDCode,X,Y)
+
+print("speed up for SIMD = " + str(referenceTime/SIMDTime))
+
+
+```
+
+
+### SAXPY ###
+SAXPY is a classical linear algebra routine that take `X` and `Y` arrays of float in parameters an produce an a output `Y = Y + alpha *X`. Where `alpha` is a scalar input parameter.
+In this tutorial we are interesting in computation performance of SAXPY in case where we need to cumulate 1000 iterative call of this function:
+
+```c
+for(int j=0; j< numberIterations;j++)
+  saxpy(sizeX, 0.001f, X, Y);
+```
+
+### SIMD version of SAXPY ###
+In benchmark, we have a reference code of the function SAXPY in ```referenceCode``` string and a code to be changed in ```SIMDCode```.
+
+
+
+1. Modify SIMDCode to compute SAXPY using either intel SSE or Arm NEO instructions (depends on your machine architecture) you can find NEO instructions references here : https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&f:@navigationhierarchiesreturnbasetype=[float] and Intel SSE references here https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2
+
+1. test this version and compare speed ratio between reference code and SIMDCode
+1. Draw a schema of memory transfer at every call of SIMD code and global calls in main code
+1. Modify main code to integrate for loop `for(int j=0; j< numberIterations;j++`) inside SAXPY implementation
+1. Change order of loops to determine best memory schema access
+1. Measure corresponding acceleration (Speed-Up) in relation with reference code
+1. Verify that results are identical for reference function and your code.
+------------------
+Go back to [Parallel Computing Tutorials](../README.md)
--- a/SIMD/inline.py
+++ b/SIMD/inline.py
+#!/usr/bin/env python
+# -*- coding:utf-8 -*-
+
+import atexit
+import ctypes
+import distutils.ccompiler
+import os.path
+import platform
+import shutil
+import sys
+import tempfile
+
+
+__version__ = '0.0.1'
+
+
+def c(source, libraries=[], compiler_extra_args=[], link_extra_args=[]):
+    r"""
+    >>> c('int add(int a, int b) {return a + b;}').add(40, 2)
+    42
+    >>> sqrt = c('''
+    ... #include <math.h>
+    ... double _sqrt(double x) {return sqrt(x);}
+    ... ''', ['m'])._sqrt
+    >>> sqrt.restype = ctypes.c_double
+    >>> sqrt(ctypes.c_double(400.0))
+    20.0
+    """
+    path = _cc_build_shared_lib(source, '.c', libraries, 
+        compiler_extra_args, link_extra_args)
+    return ctypes.cdll.LoadLibrary(path)
+
+
+def cxx(source, libraries=[], compiler_extra_args=[], link_extra_args=[]):
+    r"""
+    >>> cxx('extern "C" { int add(int a, int b) {return a + b;} }').add(40, 2)
+    42
+    """
+    path = _cc_build_shared_lib(source, '.cc', libraries,
+        compiler_extra_args, link_extra_args)
+    return ctypes.cdll.LoadLibrary(path)
+
+cpp = cxx  # alias
+
+def cxx2asm(source, compiler_extra_args=[]):
+    return _cc_get_assembly_code(source, '.cc', compiler_extra_args)
+
+def c2asm(source, compiler_extra_args=[]):
+    return _cc_get_assembly_code(source, '.c', compiler_extra_args)
+
+def python(source):
+    r"""
+    >>> python('def add(a, b): return a + b').add(40, 2)
+    42
+    """
+    obj = type('', (object,), {})()
+    _exec(source, obj.__dict__, obj.__dict__)
+    return obj
+
+
+def _cc_get_assembly_code(source, suffix, compiler_extra_args):
+    assembly_code=None
+    tempdir = tempfile.mkdtemp()
+    atexit.register(lambda: shutil.rmtree(tempdir))
+    with tempfile.NamedTemporaryFile('w+', suffix=suffix, dir=tempdir) as f:
+        f.write(source)
+        f.seek(0)
+        name=f.name
+        cc = distutils.ccompiler.new_compiler()
+        args = [] + compiler_extra_args
+        if platform.system() == 'Linux':
+            args.append('-fPIC')
+            assembly_args = ['-S']
+            assembly_file = cc.compile((name,), tempdir, extra_postargs=args+assembly_args)
+            with open(assembly_file[0],'r') as fa:
+                assembly_code=fa.read()
+        return assembly_code
+
+def _cc_build_shared_lib(source, suffix, libraries, 
+    compiler_extra_args, link_extra_args, return_assembly_code=False):
+    tempdir = tempfile.mkdtemp()
+    atexit.register(lambda: shutil.rmtree(tempdir))
+    with tempfile.NamedTemporaryFile('w+', suffix=suffix, dir=tempdir) as f:
+        f.write(source)
+        f.seek(0)
+        name=f.name
+        cc = distutils.ccompiler.new_compiler()
+        args = [] + compiler_extra_args
+        if platform.system() == 'Linux':
+            args.append('-fPIC')
+        obj = cc.compile((name,), tempdir, extra_postargs=args)
+        for library in libraries:
+            cc.add_library(library)
+        cc.link_shared_lib(obj, name, tempdir, extra_postargs=link_extra_args)
+        filename = cc.library_filename(name, 'shared')
+        return os.path.join(tempdir, filename)
+
+def _exec(object, globals, locals):
+    r"""
+    >>> d = {}
+    >>> exec('a = 0', d, d)
+    >>> d['a']
+    0
+    """
+    if sys.version_info < (3,):
+        exec('exec object in globals, locals')
+    else:
+        exec(object, globals, locals)
--- a/SIMD/saxpy_simd_origin.py
+++ b/SIMD/saxpy_simd_origin.py
+import numpy
+import ctypes
+from time import time
+import inline
+sizeX = 1000000
+numberIterations =1000
+X = numpy.random.rand(sizeX).astype(numpy.float32)
+Y = numpy.empty(sizeX).astype(numpy.float32)
+
+
+
+def get_assembly(code):
+	return inline.cxx2asm(code, compiler_extra_args=['-march=native','-fopenmp', '-lstdc++']) 
+
+
+def BenchmarkCode(name, code, X, Y, SIMD=True, OMP=True):
+	# init
+	Y[:]=0.5
+	X[:]=1.0
+
+	# compile the code
+	compiler_extra_args = ['-g0','-lstdc++']
+	link_extra_args = ['-lstdc++']
+	if SIMD :
+		compiler_extra_args+=['-march=native']
+		link_extra_args += ['-march=native']
+	if OMP:
+		compiler_extra_args+=['-fopenmp']
+		link_extra_args += ['-fopenmp']
+	lib=inline.cxx(code, compiler_extra_args= compiler_extra_args, link_extra_args= link_extra_args) 
+	p_float= numpy.ctypeslib.ndpointer(dtype=numpy.float32) 	
+	lib.compute.argtypes = [ctypes.c_int, ctypes.c_int, p_float, p_float] 
+
+	# print assembler code 
+	asm_code=inline.cxx2asm(code, compiler_extra_args= compiler_extra_args) 
+	with open(name+'.s','w') as f:
+		f.write(asm_code)
+
+	# start chronometer
+	start_time = time()
+	# run the code
+	lib.compute(numberIterations, sizeX, X, Y)
+	# stop chronometer             
+	stop_time = time()
+	execution_time= stop_time - start_time
+	print("execution time for "+name+" code = "+ str(execution_time))
+	return execution_time
+
+
+# C++ reference code
+referenceCode="""
+#line 52 "saxpy.py" // helpful for debug
+extern "C" {
+
+extern "C" {
+#ifdef __SSE2__
+#include<xmmintrin.h>
+#else
+#include <arm_neon.h>
+#endif
+
+void saxpy(int n, float alpha, float *X, float *Y)
+{
+	int i;
+	for (i=0; i<n; i++)
+		Y[i] += alpha * X[i];
+}
+
+void compute(int numberIterations, int sizeX, float *X, float *Y )
+{
+	for(int j=0; j< numberIterations;j++)
+  		saxpy(sizeX, 0.001f, X, Y);
+	return ;
+}
+
+}
+"""
+
+referenceTime=BenchmarkCode('Reference', referenceCode, X, Y)
+
+SIMDCode="""
+#line 82 "saxpy.py" // helpful for debug
+extern "C" {
+
+#include<xmmintrin.h>
+
+void saxpy(int n, float alpha, float *X, float *Y)
+{
+	int i;
+	for (i=0; i<n; i++)
+		Y[i] += alpha * X[i];
+}
+
+
+void compute(int numberIterations, int sizeX, float *X, float *Y )
+{
+	for(int j=0; j< numberIterations;j++)
+  		saxpy(sizeX, 0.001f, X, Y);
+	return ;
+}
+
+}
+"""
+
+SIMDTime=BenchmarkCode('SIMD', SIMDCode,X,Y)
+
+print("speed up for SIMD = " + str(referenceTime/SIMDTime))
+