Commit 03f041bf authored by COULAUD Olivier's avatar COULAUD Olivier
parents ce1151d8 41ecd20c
......@@ -114,13 +114,15 @@ if (MORSE_DISTRIB_DIR OR EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules/morse/
option( SCALFMM_DISABLE_NATIVE_OMP4 "Set to ON to disable the gcc/intel omp4" ON )
endif()
option( SCALFMM_TIME_OMPTASKS "Set to ON to time omp4 tasks and generate output file" OFF )
# SIMGRID and peformance models options
option( SCALFMM_SIMGRID_NODATA "Set to ON to avoid the allocation of numerical parts in the group tree" OFF )
option( SCALFMM_SIMGRID_TASKNAMEPARAMS "Set to ON to have verbose information in the task name" OFF )
# STARPU options
option( STARPU_SIMGRID_MLR_MODELS "Set to ON to enable MLR models need for calibration and simulation" OFF )
# STARPU options
CMAKE_DEPENDENT_OPTION(SCALFMM_STARPU_USE_COMMUTE "Set to ON to enable commute with StarPU" ON "SCALFMM_USE_STARPU" OFF)
CMAKE_DEPENDENT_OPTION(SCALFMM_STARPU_USE_REDUX "Set to ON to enable redux with StarPU" OFF "SCALFMM_USE_STARPU" OFF)
CMAKE_DEPENDENT_OPTION(SCALFMM_STARPU_USE_PRIO "Set to ON to enable priority with StarPU" ON "SCALFMM_USE_STARPU" OFF)
CMAKE_DEPENDENT_OPTION(SCALFMM_STARPU_FORCE_NO_SCHEDULER "Set to ON to disable heteroprio even if supported" OFF "SCALFMM_USE_STARPU" OFF)
CMAKE_DEPENDENT_OPTION(SCALFMM_USE_STARPU_EXTRACT "Set to ON to enable extract with StarPU mpi implicit" ON "SCALFMM_USE_STARPU" OFF)
endif()
message(STATUS "AVANT ${CMAKE_CXX_COMPILER_ID}" )
#
......@@ -599,6 +601,10 @@ if (MORSE_DISTRIB_DIR OR EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules/morse/
set(SCALFMM_INCLUDES "${SCALFMM_INCLUDES}; ${STARPU_INCLUDE_DIRS}")
endif()
# Adding SimGrid includes
set(SCALFMM_INCLUDES "${SCALFMM_INCLUDES};$ENV{SIMGRID_INCLUDE}")
message(STATUS " Adding SIMGRID = $ENV{SIMGRID_INCLUDE}")
# TODO: replace this by a component of find starpu
OPTION( SCALFMM_USE_OPENCL "Set to ON to use OPENCL with StarPU" OFF )
MESSAGE( STATUS "SCALFMM_USE_OPENCL = ${SCALFMM_USE_OPENCL}" )
......
#+TITLE: Usage implicit de MPI dans ScalFmm
#+AUTHOR: Martin Khannouz
#+LANGUAGE: fr
#+STARTUP: inlineimages
#+OPTIONS: H:3 num:t toc:t \n:nil @:t ::t |:t ^:nil -:t f:t *:t <:t
#+OPTIONS: TeX:t LaTeX:t skip:nil d:nil todo:nil pri:nil tags:not-in-toc
#+EXPORT_SELECT_TAGS: export
#+EXPORT_EXCLUDE_TAGS: noexport
#+TAGS: noexport(n)
* Abstract
The Fast Multipole Methode (FMM) is one of the most prominent algorithms to perform pair-wise particle interactions, with application in many physical problems such as astrophysical simulation, molecular dynmics, the boundary element method, radiosity in computer-graphics or dislocation dynamics. Its linear complexity makes it an excellent candidate for large-scale simulation.
The following paper/website/file aim to decribe how to switch from an
explicit starpu mpi code to an implicit starpu mpi code. It describe also
the methodology used to compare mpi algorithms on scalfmm and the result obtained.
* Introduction
** N-Body problem
<<sec:nbody_problem>>
In physics, the /n-body/ problem is the problem of predicting the individual motions of a group of celestial objects interacting with each other gravitationally.
Solving this problem has been motivated by the desire to understand the motions of the Sun, Moon, planets and the visible stars.
It then has been extended to other field like electrostatics or molecular dynamics.
NOTE: Come from wikipedia, not sure this is studious.
** Fast Multipole Method (FMM)
*** Description
The Fast Multipole Method (FMM) is a hierarchical method for the /n−body/ problem introduced in [[sec:nbody_problem]] that has been classified to be one of the top ten algorithms of the 20th century by the SIAM (source).
In the original study, the FMM was presented to solve a 2D particle interaction problem, but it was later extended to 3D.
The FMM succeeds to dissociate the near and far field and still uses the accurate direct computation for the near field.
The original FMM proposal was based on a mathematical method to approximate the far field and using an algorithm based on a quadtree/octree for molecular dynamics or astrophysics.
The algorithm is the core part because it is responsible for the calls to the mathematical functions in a correct order to approximate the interactions between clusters that represent far particles.
When an application is said to be accelerated by the FMM it means the application uses the FMM algorithm but certainly with another mathematical kernel that matches the problems.
The FMM is used to solve a variety of problems: astrophysical simulations, molecular dynamics, the boundary element method, radiosity in computer-graphics and dislocation dynamics among others.
NOTE: Directly come from Bérenger thesis.
*** Algorithm
The FMM algorithm rely on an octree (quadtree in 2D) obtained by splitting the space of the simulation recursivly in 8 parts (4 parts in 2D). The building is shown on figure [[fig:octree]].
#+CAPTION: 2D space decomposition (Quadtree). Grid view and hierarchical view.
#+name: fig:octree
[[./figure/octree.png]]
#+CAPTION: Different steps of the FMM algorithm; upward pass (left), transfer pass and direct step (center), and downward pass (right).
#+name: fig:algorithm
[[./figure/FMM.png]]
The algorithm is illustraded on figure [[fig:algorithm]]. It first compute
the P2M operator to approximate particle interactions in the multipole.
Then the M2M operator is applied between each level to approximate
interactions to the next level. Then M2L and P2P are done between neighboors.
Finnaly, the L2L operator apply approximations to the level below and the L2P
will apply the approximations of the last level to the particles.
* State of the art
TODO
What are the other fmm librairy ?
What are their performence ?
What are the other fmm library with mpi ?
What are their performence ?
What data mapping ?
** Task based FMM
*** Runtime
In the field of HPC, a runtime system is in charge of the parallel execution of an application.
A runtime system must provide facilities to split and to pipeline the work but also to use the hardware efficiently. In our case, we restrict this definition and remove the thread libraries.
We list here some of the well-known runtime systems: [[https://www.bsc.es/computer-sciences/programming-models/smp-superscalar/programming-model][SMPSs]], [[http://starpu.gforge.inria.fr/][StarPU]], [[http://icl.cs.utk.edu/parsec/][PaRSEC]], [[https://software.intel.com/sites/landingpage/icc/api/index.html][CnC]], [[http://icl.cs.utk.edu/quark][Quark]], SuperMatrix and [[http://openmp.org/wp/][OpenMP]].
These different runtime systems rely on several paradigms like the two well-known fork-join or task-based models.
We can describe the tasks that composed an application and their dependencies with a direct acyclic graph (DAG); the tasks are represented by nodes/vertices and their dependencies by edges.
For example, the DAG given by A → B states that there are two tasks A and B , and that A must be finished to release B.
Such a dependency happens if A modifies a value that will later be used by B or if A read a value that B will modify.
The tasks-based paradigm has been studied by the dense linear algebra community and used in new production solvers such as [[http://icl.cs.utk.edu/plasma/news/news.html?id=212][Plasma]], [[http://icl.cs.utk.edu/magma/software/][Magma]] or Flame.
The robustness and high efficiency of these dense solvers have motivate the study of more irregular algorithms such as sparse linear solvers and now the fast multipole method.
NOTE: Directly come from Bérenger thesis.
NOTE: somehow, tell that scalfmm is using The StarPU runtime.
*** Sequential Task Flow
A sequential task flow is a sequential algorithm that describe the tasks to be done and their dependencies. On the other hand the tasks can be executed asynchronously and parralelly.
In that way, the main algorithm remain almost as simple as the sequential one and can unlock a lot of parralellisme.
It also create a DAG from which interesting property can be used to prove interesting stuff. (Not really sure)
*** Group tree
A group tree is like the original octree (or quadtree in 2D) where cells and
particles are packed together in new cells and new "particles". Tasks are
then executed ont those groups rather than only on one particles (or
multipole).
The group tree was introduced because the original algorithm generated too
much small tasks and the time spent in the runtime increased to much. With the
group tree, task got long enough so the time spent in the runtime got neglectable again.
A group tree is built following a simple rule. Given a group size,
particles (or multipoles) following the Morton index are grouped together
regardless their parents or children.
#+CAPTION: A quadtree and the correspondig group tree with a group size of 3.
#+name: fig:grouptree
[[./figure/blocked.png]]
*** Scalfmm
Scalfmm is a library to simulate N-body interactions using the Fast
Multipole Method and it can be found
[[http://scalfmm-public.gforge.inria.fr/doc/][here]]
*** Distributed FMM
For later, I will refer to the initial distributed algorithm in Scalfmm
using StarPU as the explicit MPI algorithm.
TODO how is it currently working ?
Mpi task posted manualy.
* Setup
** Installing
*** FxT
StarPU can use the FxT library to generate traces to help in identifying bottlenecks and to compute different runtime metrics.
To installing Fxt:
#+begin_src
wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.11.tar.gz
tar xf fxt-0.2.11.tar.gz
cd fxt-0.2.11
./configure --prefix=$FXT_INSTALL_DIR
make
make install
#+end_src
Remember to set /FXT_INSTALL_DIR/.
For more information, check [[http://starpu.gforge.inria.fr/doc/html/OfflinePerformanceTools.html][here]].
*** starpu
Then, install starpu and its dependancy.
It is recommended to install hwloc. Please refer to your package manager, or
follow the installation guide from [[https://www.open-mpi.org/projects/hwloc/][here]].
#+begin_src
svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk StarPU
cd StarPU
./autogen.sh
./configure --prefix=$STARPU_INSTALL_DIR \
--disable-cuda \
--disable-opencl \
--disable-fortran \
--with-fxt=$FXT_INSTALL_DIR \
--disable-debug \
--enable-openmp \
--disable-verbose \
--disable-gcc-extensions \
--disable-starpu-top \
--disable-build-doc \
--disable-build-examples \
--disable-starpufft \
--disable-allocation-cache
make
make install
#+end_src
Remember to set /STARPU_INSTALL_DIR/.
If you are on Debian or Debian like distribution, simpler way to install StarPU are described [[http://starpu.gforge.inria.fr/doc/html/BuildingAndInstallingStarPU.html][here]].
This package may not be the very last release and some feature may not be
available.
For now, Scalfmm require the trunk version.
**** Set up environment
This are envirronement variable that might be useful to set.
#+begin_src
export PKG_CONFIG_PATH=$FXT_INSTALL_DIR/lib/pkgconfig:$PKG_CONFIG_PATH
export PKG_CONFIG_PATH=$STARPU_INSTALL_DIR/lib/pkgconfig:$PKG_CONFIG_PATH
export LD_LIBRARY_PATH=$STARPU_INSTALL_DIR/lib:$LD_LIBRARY_PATH
export PATH=$PATH:$STARPU_INSTALL_DIR/bin
export STARPU_DIR=$STARPU_INSTALL_DIR
#+end_src
*** Scalfmm
Finally, install Scalfmm:
For those with access on the scalfmm repository, you can get scalfmm this way:
#+begin_src
git clone git+ssh://mkhannou@scm.gforge.inria.fr/gitroot/scalfmm/scalfmm.git
cd scalfmm
git checkout mpi_implicit
#+end_src
If you don't have any access to this repository, you can
get a tarbal this way:
#+begin_src
wget http://scalfmm.gforge.inria.fr/orgmode/implicit/scalfmm.tar.gz
tar xf scalfmm.tar.gz
cd scalfmm
#+end_src
Then:
#+begin_src
cd Build
cmake .. -DSCALFMM_USE_MPI=ON -DSCALFMM_USE_STARPU=ON -DSCALFMM_USE_FFT=ON -DSCALFMM_BUILD_EXAMPLES=ON -DSCALFMM_BUILD_TESTS=ON
make testBlockedChebyshev testBlockedImplicitChebyshev testBlockedMpiChebyshev testBlockedImplicitAlgorithm testBlockedMpiAlgorithm
#+end_src
*** Execute
Here is a quick way to execute on your computer:
#+begin_src
cd scalfmm/Build
export STARPU_FXT_PREFIX=otter
mpiexec -n 2 ./Tests/Release/testBlockedImplicitChebyshev -nb 50000 -bs 100 -h 5
#+end_src
If you want to gather the traces for this specific case:
#+begin_src
starpu_fxt_tool -i otter*
#+end_src
This will create /paje.trace/, /trace.rec/, /task.rec/ and many other trace file.
** Useful script
*** Setup on plafrim
To setup everything that is needed on plafrim I first install spack.
#+begin_src sh
git clone https://github.com/fpruvost/spack.git
#+end_src
Then you have to add spack binary in your path.
#+begin_src sh
PATH=$PATH:spack/bin/spack
#+end_src
If your default python interpreter isn't python 2, you might have to replace the first line of spack/bin/spack by
#+begin_src sh
#!/usr/bin/env python2
#+end_src
So the script is automaticly run with python 2.
Then, you have to add your ssh key to your ssh agent. The following script kill all ssh agent, then respawn it and add the ssh key.
#+begin_src sh
SSH_KEY=".ssh/rsa_inria"
killall -9 ssh-agent > /dev/null
eval `ssh-agent` > /dev/null
ssh-add $SSH_KEY
#+end_src
Because on plafrim, users can't connect to the rest of the world, you have to copy data there.
So copy spack directory, use spack to create a mirror that will be sent to plafrim so spack will be able to install package.
#+begin_src sh
MIRROR_DIRECTORY="tarball_scalfmm"
#Copy spack to plafrim
scp -r spack mkhannou@plafrim:/home/mkhannou
#Recreate the mirror
rm -rf $MIRROR_DIRECTORY
mkdir $MIRROR_DIRECTORY
spack mirror create -D -d $MIRROR_DIRECTORY starpu@svn-trunk+mpi \^openmpi
#Create an archive and send it to plafrim
tar czf /tmp/canard.tar.gz $MIRROR_DIRECTORY
scp /tmp/canard.tar.gz mkhannou@plafrim-ext:/home/mkhannou
rm -f /tmp/canard.tar.gz
#Install on plafrim
ssh mkhannou@plafrim 'tar xf canard.tar.gz; rm -f canard.tar.gz'
ssh mkhannou@plafrim "/home/mkhannou/spack/bin/spack mirror add local_filesystem file:///home/mkhannou/$MIRROR_DIRECTORY"
ssh mkhannou@plafrim '/home/mkhannou/spack/bin/spack install starpu@svn-trunk+mpi+fxt \^openmpi'
#+end_src
TODO add script I add on plafrim side with library links.
*** Execute on plafrim
To run my tests on plafrim, I used the two following scripts.
One to send the scalfmm repository to plafrim.
#+include: "~/narval.sh" src sh
Note : you might have to add your ssh_key again if you killed your previous ssh agent.
Then, the one that is runned on plafrim. It configure, compile and submit all the jobs on plafrim.
#+begin_src sh
module add slurm
module add compiler/gcc/5.3.0 tools/module_cat/1.0.0 intel/mkl/64/11.2/2016.0.0
# specific to plafrim to get missing system libs
export LIBRARY_PATH=/usr/lib64:$LIBRARY_PATH
# load spack env
export SPACK_ROOT=$HOME/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load fftw
spack load hwloc
spack load openmpi
spack load starpu@svn-trunk+fxt
cd scalfmm/Build
rm -rf CMakeCache.txt CMakeFiles > /dev/null
cmake .. -DSCALFMM_USE_MPI=ON -DSCALFMM_USE_STARPU=ON -DSCALFMM_USE_FFT=ON -DSCALFMM_BUILD_EXAMPLES=ON -DSCALFMM_BUILD_TESTS=ON -DCMAKE_CXX_COMPILER=`which g++`
make clean
make testBlockedChebyshev testBlockedImplicitChebyshev testBlockedMpiChebyshev testBlockedImplicitAlgorithm testBlockedMpiAlgorithm
cd ..
files=./jobs/*.sh
mkdir jobs_result
for f in $files
do
echo "Submit $f..."
sbatch $f
if [ "$?" != "0" ] ; then
break;
fi
done
#+end_src
*** Export orgmode somewhere accessible
A good place I found to put your orgmode file and its html part is on the inria forge, in your project repository.
For me it was the path /home/groups/scalfmm/htdocs.
So I created a directory named orgmode and create the following script to update the files.
#+include: "~/scalfmm/export_orgmode.sh" src sh
* Implicit MPI FMM
** Sequential Task Flow with implicit communication
There is very few difference between the STF and implicite MPI STF.
*** Init
The first difference between a simple StarPU algorithm and a StarPU
MPI implicit is the call to /starpu_mpi_init/ right after /starpu_init/
and a call to /starpu_mpi_shutdown/ right before /starpu_shutdown/.
The call to /starpu_mpi_init/ looks like :
#+begin_src c
starpu_mpi_init(argc, argv, initialize_mpi)
#+end_src
/initialize_mpi/ should be set to 0 if a call to /MPI_Init/ (or
/MPI_Init_thread/) has already be
made.
*** Data handle
The second difference is the way StarPU handle are registered.
There is still the classical call to /starpu_variable_data_register/ so
StarPU know the data but it also need a call to /starpu_mpi_data_register/.
The call looks like this:
#+begin_src c
starpu_mpi_data_register(starpu_handle, tag, mpi_rank)
#+end_src
/starpu_handle/ : is a the handle used by StarPU to work with data.
/tag/ : is the MPI tag which need to be different for each
handle, but must correspond to the same handle among all MPI node.
/mpi_rank/ : correspond to the MPI node on which the data will be stored.
Note that, when an handle is registered on a node different from the
current node the call to /starpu_variable_data_register/ should looks like :
#+begin_src c
starpu_variable_data_register(starpu_handle, -1, buffer, buffer_size);
#+end_src
The -1 specify that the data is not stored in the main memory and in this
case, it is stored on another node.
At the end of the application, handles should be unregister with /starpu_data_unregister/ only if it
were registered on the node.
*** Data mapping function
The last difference and probably the most interesting one is the data
mapping function. This function must return the node on which the data
will be mapped given information about the data.
For now, in Scalfmm, it use the level in the octree and the Morton index
inthis level. But it could be anything, like external
information previously compute by another sofware.
For now, here is the data mapping function:
#+begin_src src c
int dataMappingBerenger(MortonIndex const idx, int const idxLevel) const {
for(int i = 0; i < nproc; ++i)
if(nodeRepartition[idxLevel][i][0] <= nodeRepartition[idxLevel][i][1] && idx >= nodeRepartition[idxLevel][i][0] && idx <= nodeRepartition[idxLevel][i][1])
return i;
if(mpi_rank == 0)
cout << "[scalfmm][map error] idx " << idx << " on level " << idxLevel << " isn't mapped on any proccess." << endl;
return -1;
}
#+end_src
/nodeRepartition/ is an array which describe at each level the working
interval per node.
** Data Mapping
One of the main advantage of using implicit mpi communication in starpu is that tha data mapping can be separated from the algorithm. It is then possible to change the data mapping without changing the algorithm.
*** Level Split Mapping
The level split mapping aim to balence work between mpi process by spliting evenly each level between all mpi processes.
*** Leaf Load Balancing
The leaf load balancing aim to balance the work between mpi process by evenly split leaf between mpi process.
Then the upper levels are split following the 13th algorithm present int Bérenger Thesis.
*** TreeMatch Balancing
TreeMatch is a tool developped at Inria Bordeaux which is accessible [[http://treematch.gforge.inria.fr/][here]].
TreeMatch aim to reorganize processus mapping to obtain the best performance.
It use as input the hardware topologie and information about MPI exchange.
This tool could be used to force certain data mapping in the implicit mpi version and achieve hihger performance.
** Result
*** Hardware
One node has 2 Dodeca-core Haswell Intel® Xeon® E5-2680, 2,5GHz, 128Go de RAM (DDR4 2133MHz), 500 Go de stockage (Sata).
*** Aims
The aims is to compare explicit and implicit version as well as any other mpi
version or mpi data mapping.
But to measure the impact of implicit communication we need an implicit version as close to the explicit version as possible.
Mainly, this means, same particules into the same group tree with same task executed on the same node.
All algorithm study two different particle disposition, a uniform cube
and an ellipsoid.
Both looks like figure [[fig:ellipse]] and
[[fig:uniform]].
#+CAPTION: cube (volume).
#+name: fig:uniform
[[./figure/uniformdistribution.png]]
#+CAPTION: Ellipsoid (surface).
#+name: fig:ellipse
[[./figure/ellipsedistribution.png]]
The point of working on the uniform cube is to validate algorithms on a
simple case. It also allows to check for any performance regression. The
ellipsoid (which is a surface and not a volume) is a more challenging
particle set because it generates an unbalenced tree. It is used to see
if an algorithm is better than another.
*** Description of the plots
**** Time
The time plots displays the time spent into each part of the execution.
It is useful to diagnose what take the most time in a run.
**** Parallel efficiency
The parallel efficiency plots displays how faster is an algorithm
compare to it's one node version.
**** Normalized time
The normalized time plot shows the speedup compare to a one node
algorithm. Which is the StarPU algorithm without any MPI communication
in it.
**** Efficiency
Not sure yet
**** Speedup
The speedup plot shows how faster is the algorithm compare a reference
algorithm.
The explicit algorithm was used as a reference. It was choosen instead
of the StarPU algorithm, because the comparison was done for each number
of node and the StarPU algorithm (without any MPI communication) only
run on one node.
*** Measurement
<<sec:measurement>>
To compute the execution time and make sure each algorithm has the way to
compute it we do it like the following:
#+begin_src src C
mpiComm.global().barrier();
FTic timer;
starpu_fxt_start_profiling();
groupalgo.execute();
mpiComm.global().barrier();
starpu_fxt_stop_profiling();
timer.tac();
#+end_src
With a barrier before starting the measurement. A barrier at the end,
before stop the measurement.
What is measured corresponds to the time of one iteration of the
algorithm without the time of object creation nor pre-computation of the
kernel.
There is still a tiny exception, for the StarPU algorithm
(which is the StarPU version without MPI) because this algortihm always run
on one node, there is no need to add MPI barrier to correctly measure its
execution time.
*** One percent checking
The one percent checking is a check made while running the post
traitements. It aim to measure the correctness of our time measurement.
It is compute by the following python script where
/config.num_nodes/ is the number of MPI node and /config.num_threads/ the
number of threads by MPI node.
#+begin_src src python
sum_time = (runtime_time + task_time + scheduling_time + communication_time +
idle_time)/(config.num_nodes*config.num_threads)
diff_time = float('%.2f'%(abs(global_time-sum_time)/global_time))
if diff_time > 0.01:
print('/!\\Timing Error of ' + str(diff_time))
#+end_src
Note that /runtime_time/, /task_time/, scheduling_time/ and
/communication_time/ are computed from the execution traces and
/global_time/ is computed during the running time as explain in section
[[sec:measurement]].
*** Scripts and jobs
<<sec:result>>
The scripts of the jobs for StarPU on a single node:
#+include: "~/scalfmm/jobs/starpu_chebyshev.sh" src sh
The scripts of the jobs for StarPU with implicit mpi on 10 nodes:
#+include: "~/scalfmm/jobs/implicit_chebyshev.sh" src sh
The results are stored into one directories at ~/scalfmm/jobs_results on
plafrim. They need to be downloaded and aggregated.
The work is done by the two following scripts. All results are aggregated
into a single csv file which is used by R scripts to generated plots.
#+include: "~/suricate.sh" src sh
#+include: "~/scalfmm/Utils/benchmark/loutre.py" src python
*** Display
**** General
#+CAPTION: Cube speedup ([[./output/cube-speedup.pdf][pdf]]).
[[./output/cube-speedup.png]]
#+CAPTION: Normalized time on cube ([[./output/cube-normalized-time.pdf][pdf]]).
[[./output/cube-normalized-time.png]]
#+CAPTION: Parallel efficiency on cube ([[./output/cube-parallel-efficiency.pdf][pdf]]).
[[./output/cube-parallel-efficiency.png]]
#+CAPTION: Communication volume ([[./output/cube-comm.pdf][pdf]]).
[[./output/cube-comm.png]]
**** Individual time
#+CAPTION: Implicit time on cube ([[./output/cube-implicit-times.pdf][pdf]]).
[[./output/cube-implicit-times.png]]
#+CAPTION: Explicit time on cube ([[./output/cube-explicit-times.pdf][pdf]]).
[[./output/cube-explicit-times.png]]
#+CAPTION: StarPU single node time on cube ([[./output/cube-starpu-times.pdf][pdf]]).
[[./output/cube-starpu-times.png]]
**** Individual Efficiencies
#+CAPTION: Implicit efficiencies ([[./output/cube-implicit-efficiencies.pdf][pdf]]).
[[./output/cube-implicit-efficiencies.png]]
#+CAPTION: Explicit efficiencies ([[./output/cube-explicit-efficiencies.pdf][pdf]]).
[[./output/cube-explicit-efficiencies.png]]
**** Gantt
#+CAPTION: Implicit gantt 10 nodes ([[./output/cube-implicit-10N-50M-gantt.pdf][pdf]]).
[[./output/cube-implicit-10N-50M-gantt.png]]
#+CAPTION: Explicit gantt 10 nodes ([[./output/cube-explicit-10N-50M-gantt.pdf][pdf]]).
[[./output/cube-explicit-10N-50M-gantt.png]]
#+CAPTION: Vite view.
[[./figure/narva_full.png]]
#+CAPTION: Implicit gantt 4 nodes ([[./output/cube-implicit-4N-50M-gantt.pdf][pdf]]).
[[./output/cube-implicit-4N-50M-gantt.png]]
#+CAPTION: Explicit gantt 4 nodes ([[./output/cube-explicit-4N-50M-gantt.pdf][pdf]]).
[[./output/cube-explicit-4N-50M-gantt.png]]
* Journal
** Implémentation mpi implicite très naïve
Cette première version avait pour principal but de découvrir et à prendre en main les fonctions de StarPU MPI.
Les premières étant starpu_mpi_init et starpu_mpi_shutdown. Mais rapidement ont suivies les fonctions pour /tagger/ les /handles/ de StarPU et les ajouter à des nœuds MPI.
À cela c'est ajouté la transformation de tous les appels à starpu_insert_task ou starpu_task_submit par starpu_mpi_insert_task.
Par soucis de simplicité chaque nœud MPI possède l'intégralité de l'arbre, même si ce n'est pas une solution viable sur le long terme.
Pour vérifier que tout fonctionnait correctement, je me suis amusé à /mapper/ toutes les données sur un premier nœud MPI et toutes les tâches sur un second.
J'ai ensuite pu valider que l'arbre du premier nœud avait les bons résultats et que le second nœud n'avait que des erreurs.
** Implémentation moins naïve
Dans l'idée de créer une version 0 un brin potable qui puisse faire du calcul avec plus de deux nœuds MPI, j'ai créé une fonction de /mapping/ des données.
Elle consistait à partager chaque niveau entre tous les processus de la manière la plus équitable possible.
#+CAPTION: Division de chaque niveau entre chaque processus. Groupe de l'arbre de taille 4.
[[./figure/naive_split.png]]
** Reproduction du mapping mpi explicite
Pour pouvoir effectuer des comparaisons il était nécessaire de reproduire le même /mapping/ de tâches la version MPI explicite.
Dans le cas de la version implicite telle qu'elle est actuellement implémentée, le /mapping/ des données infère le /mapping/ de tâches.
La façon la plus simple de procéder est de faire en sorte que les particules se retrouvent sur les mêmes nœuds MPI.
*** Premier problème des groupes
La disposition des particules sur les nœuds MPI étant décidé par un tri distribué, il était plus judicieux de sauvegarder le /mapping/ des particules dans un fichier puis de le charger (dans la version implicite) et d'utiliser ce /mapping/ pour influer le /mapping/ au niveau de la version implicite.
Le soucis du tri distribué est qu'il essaye d'équilibrer les particules sur les nœuds sans tenir compte des groupes de l'arbre groupé (/group tree/).
#+CAPTION: Problème issuent de la constitution des groupes.
#+NAME: fig:SED-HR4049
[[./figure/group_issue1.png]]
Or le /mapping/ des données est fait avec la granularité des groupes de l'arbre groupé.
Une première solution serait de modifier un peu l'algorithme de l'arbre pour le forcer à faire des groupes un peu plus petit de telle sorte qu'ils correspondent aux groupes de la version MPI explicite.
Soucis, quand il faudra remonter dans l'arbre, que faire des cellules qui sont présentes sur plusieurs nœuds MPI, que faire de la racine ?
*** Solution retenue
Plutôt que d'essayer de reproduire un /mapping/ de données identique à celui de la version explicite quel que soit les particules, nous avons choisi de limiter le nombre de cas reproductibles et de ségmenter ce /mapping/ par niveau.
Ainsi avec un arbre parfait où chaque indice de morton possède le même nombre de particules, il est possible de reproduire le même /mapping/ de données sur un certain de nombre de niveaux.
Ce nombre varie en fonction de la taille des groupes de l'arbre groupé.
#+CAPTION: Méthode pour générer une particule à un indice de Morton donné.
#+NAME: fig:SED-HR4049
[[./figure/morton_box_center.png]]
*** Solution apportée par la suite
Après discussion avec Bérenger il s'avèra qu'il n'était pas si difficile de reproduire le tri parrallèle. Ce à quoi je me suis attelé durant les jours qui on suivi.
Ainsi un constructeur a été ajouté à l'arbre bloqué pour décrire la taille de chaque bloque à chaque étage.
Cela requière un un pré-calcul qui est effectué par une fonction intermédiaire.
Cela revient à:
- Trier les particules selon leur indice de Morton.
- Compter le nombre de feuilles différentes.
- Répartir les feuilles en utilisant l'objet FLeafBalance.
- Créer les groupe des étages supérieurs en se basant sur l'interval de travail fourni par l'algorithme 13 de la thèse de Bérenger.
*** Validation des résultats
Pour valider ces résultats, j'ai réutilisé le système de nom de tâches fait pour simgrid. Ainsi j'ai pu indiquer, dans un fichier des informations à propos de chaque tâche.
Les indices de Morton ainsi que les nœuds MPI sur lesquels elles s'exécutent.
**** Observation
Si l'on fait exception des niveaux où l'on sait que des erreurs de tâches se trouveront, on a :
- Plus de tâches dans la version explicite car elle a des tâches (P2P, M2L) symetriques.
- Toutes les tâches issuent de l'algorithme implicite se retrouvent dans l'ensemble des tâches explicite.
- Toutes les tâches sont au moins mapper sur le même nœud MPI. Les tâches symetriques étant parfois mappé sur deux nœuds différents.
*** Comparaison des performances
Pour comparer les performances entre l'approche explicite et implicite, il a été décidé d'ajouté un nouveau test judicieusement nommé testBlockedImplicitChebyshev. Ainsi les comparaisons se font sur la base d'un véritable noyaux de calcul et non un noyaux de test peu réaliste. Les résultats sont déroulés dans la section [[sec:result]]. Notez que les temps indiqués ne correspondent qu'au temps de création des noyaux ainsi que de l'objet correspondant à l'algorithme de la FMM. N'est pas compris tout le temps passé à construire l'arbre, à stocker les particules ou a les lire dans un fichier et à vérifier le résultats.
Le temps passé est compté de la manière suivante :
#+BEGIN_SRC C
//Start the timer
FTic timerExecute;
//Creating some useful object
const MatrixKernelClass MatrixKernel;
GroupKernelClass groupkernel(NbLevels, loader.getBoxWidth(), loader.getCenterOfBox(), &MatrixKernel);
GroupAlgorithm groupalgo(&groupedTree,&groupkernel, distributedMortonIndex);
//Executing the algorithm
groupalgo.execute();
//Stop the timer
double elapsedTime = timerExecute.tacAndElapsed();
//Sum the result
timeAverage(mpi_rank, nproc, elapsedTime);
#+END_SRC
Les résultats dénotent deux choses :
- L'algorithme implicite répartis mal les calculs.
- Une situation curieuse : Avec le noyaux de test, l'implicite est 10x plus rapide, avec le noyau de Chebyshev, il est 5x plus lent.
Après une petite étude, cette curieuse situation n'était pas dû à une mauvaise répartition des particules car ladite répartition est la même.
Il s'avéra que pour le calcul du P2P, tous les nœuds s'attendaient en cascade, mais ce temps d'attente peut être réduit en definissant la constante suivante : STARPU_USE_REDUX.
Cela active la reduction au niveau des P2P et nous offre les même performances que l'algorithme explicite.
*** Erreurs rencontrées
Un /bug/ a fait son apparition dans la version MPI explicit où des segfaults apparaissent si l'arbre n'a pas au moins une particule dans chaque indice de Morton.
Cette erreur n'impacte pas encore la bonne progression du stage, car dans la pratique, il y a suffisament de particules pour remplir l'arbre.
** Autre /mapping/
*** Treematch
Suite à la présentation par Emmanuel Jeannot de l'outil TreeMatch qui permettrait de proposer des /mapping/ intéressant, il serait bon de profiter dudit outil.
Une structure de fichier pour communiquer avec TreeMatch serait la suivante :
#+BEGIN_EXAMPLE
123 7
124:6000
45:3000
23:400
#+END_EXAMPLE
Cette structure (simpliste) se lirait de la manière suivante :
- Le groupe 123 est situé sur le nœud 7.
- Le groupe 123 échange 6000 octets avec le groupe 124.
- Le groupe 123 échange 3000 octets avec le groupe 45.
- Le groupe 123 échange 400 octets avec le groupe 23.
Les groupes correspondent aux /handles/ de Starpu qui correspondent aux groupes de l'arbre bloqué.