step2-vecto.org

#+TITLE: Getting started
#+SUBTITLE: Vecto
#+AUTHOR: Emmanuel Agullo, Olivier Aumage, Alycia Lisito and Mathieu Faverge
#+INCLUDE: https://gitlab.inria.fr/elementaryx/emacs-elementaryx-ox-html-themes/-/raw/main/org/theme-bigblow-less.setup
#+PROPERTY: header-args:bash :eval no :exports both

Dans cette étape, vous allez étendre votre code séquentiel (~-v seq~)
pour supporter le blocage et la vectorisation.

** Produit de matrice par bloc

Qu'optimise-t-on lorsqu'on décompose le produit matriciel en produits matriciels
par blocs ? Réalisez une routine ~dgemm_bloc~ (qui respecte à nouveau
l'interface =cblas=) qui effectue le produit de matrices ainsi.

Vous réutiliserez la routine scalaire précédent pour faire les produits de
blocs. C'est sans doute l'occasion de le renommer en ~dgemm_scalaire~ et
d'utiliser ~dgemm_seq~ comme une méthode de haut-niveau qui appellera votre
fonction la plus rapide (c'est ~dgemm_seq~ que nous testerons de notre côté).

** Vectorisation

Vectorisez votre code. A-t-on besoin de vectoriser toutes les routines? Quelle
routine est-il le plus pertinent de vectoriser? Vous pouvez organisez votre code
comme vous le souhaitez, mais, /in fine/, faîtes en sorte que la routine
~dgemm_seq~ tire partie de l'ensemble de vos optimisations (blocage,
vectorisation, ...).

N'hésitez pas à étudier le code généré (voir notamment [[https://godbolt.org/][godbolt]]).

Notez également qu'en l'état aucune option n'est donnée au compilateur pour
tirer partie de l'architecture. Il vous incombe de modifier votre
[[../CMakeLists.txt]] à cet effet, p. ex. en spécifiant l'architecture cible
~set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=haswell")~, ou, en demandant au
compilateur d'optimiser pour l'architecture détectée : ~set(CMAKE_C_FLAGS
"${CMAKE_C_FLAGS} -march=native")~. Vous pouvez également lire la section dédiée
dans le cas =guix= en fin de [[./setup-guix.org][setup guix]].

*** Option 1: Intrinsics

Mettez en oeuvre [[https://moodle.bordeaux-inp.fr/mod/page/view.php?id=121116][votre cours]].

*** Option 2: MIPP

You can also consider the [[https://github.com/aff3ct/MIPP][MyIntrinsics++ (MIPP)]] portable and
open-source wrapper for vector intrinsic functions (SIMD).

*October 27, 2023*: Please *update* =mini-chameleon= (and =guix= if
you are using it) to benefit from the ~MIPP~ support.

Enable it in =mini-chameleon= with ~-DENABLE_MIPP=ON~. Assuming being at the
root of =mini-chameleon=:

#+begin_src bash
  mkdir -p build/mipp
  cmake . -B build/mipp -DENABLE_MIPP=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo
  cmake --build build/mipp
#+end_src

This will build the ~myblas/gemm_mipp.cpp~ [[https://github.com/aff3ct/MIPP/blob/04d9f5f2733dcc4681d80f87ee0fa76a185dfd12/examples/gemm.cpp][MIPP reference example]] as
starting point. It will be your responsibility to design a
fully-featured ~GEMM~ routine by calling your (improved) MIPP routine
within ~gemm_mipp~ from your ~dgemm_seq~ main sequential routine.

**** IDE set up

Note that you may want to refine your [[./setup-ide.org][IDE setup]]. In this case remove
your ~compile_commands.json~ at the root of the =mini-chameleon=
project (if you previously generated it) and re-generate it as follows
(from a clean ~build/mipp~):

#+begin_src bash
  mkdir -p build/mipp
  cmake . -B build/mipp/ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DENABLE_MPI=ON -DENABLE_STARPU=ON -DENABLE_MIPP=ON # -DCMAKE_BUILD_TYPE=Debug #
  ln -s build/mipp/compile_commands.json .
#+end_src

The ~-DCMAKE_BUILD_TYPE=Debug~ flag may be turned off so that you can
check the generated code with effective compilation flags. For that,
in ~emacs-bedrock~, this can be done with enabling [[https://github.com/emacsmirror/rmsbolt/blob/master/doc/rmsbolt.org][rmsbolt]] within your
~gemm_mipp.cpp~ buffer (~M-x rmsbolt-mode~). You can then just press
~C-c C-c~ each time you want to (re-)generate the assembly code
corresponding to your current code. As a reminder, you can run
~emacs-bedrock~ with:

#+begin_src
guix shell --pure emacs-bedrock-as-default -D mini-chameleon -- emacs -nw -f xterm-mouse-mode
#+end_src