Commit 18612f99 authored by Martin Khannouz's avatar Martin Khannouz Committed by Berenger Bramas

Delete unused files

parent adaf30ff
#+TITLE: Usage implicit de MPI dans l'algorithme de ScalFmm usant de StarPU
#+AUTHOR: Martin Khannouz
#+STARTUP: inlineimages
#+OPTIONS: H:3 num:t toc:t \n:nil @:t ::t |:t ^:nil -:t f:t *:t <:t
#+OPTIONS: TeX:t LaTeX:t skip:nil d:nil todo:nil pri:nil tags:not-in-toc
#+TAGS: noexport(n)
* Implicit MPI in Scalfmm using StarPU
** Installation
*** Requirement
sudo pacman -S openmpi fftw blas lapack
*** StarPU
#+BEGIN_SRC sh :exports code :eval never
svn checkout svn:// StarPU
cd StarPU
mkdir install
./configure --prefix=$HOME/StarPU/install
make -j `nproc`
export STARPU_DIR=$HOME/StarPU/install
*** ScalFmm
#+BEGIN_SRC sh :exports code :eval never
git clone git+ssh://
cd scalfmm/Build
ccmake ..
Then generate the Makefile by pressing 'c' the 'g'.
To check if everything seems good execute the following command.
#+BEGIN_SRC sh :exports code :eval never
Some information about the group tree should be printed and a core dump if something went wrong.
** First Try
Firstly, copy original files.
#+BEGIN_SRC sh :exports code :eval never
cp Tests/GroupTree/testBlockedAlgorithm.cpp Tests/GroupTree/testBlockedImplicitAlgorithm.cpp
cp Src/GroupTree/Core/FGroupTaskStarpuAlgorithm.hpp Src/GroupTree/Core/FGroupTaskStarpuImplicitAlgorithm.hpp
Edit testBlockedImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and change an include to include FGroupTaskStarpuImplicitAlgorithm instead of FGroupTaskStarpuAlgorithm.
Edit FGroupTaskStarpuImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and add, somewhere :
#include <starpu_mpi.h>
As MPI need to be initialized a call to starpu_mpi_init has to be made :
FAssertLF(starpu_mpi_init ( 0, 0, 1 ) == 0);
The first and second 0 are there because we don't need to give argument to MPI.
The 1 is to tell starpu to call MPI_Init.
As every malloc need a free, a starpu_mpi_init need starpu_mpi_shutdown which is written in the destructor, before the call of starpu_shutdown.
To check if everything goes right, the following code had added right after starpu_mpi_init and it should print the number of node and the rank of the current node.
int rank, nbProc;
cout << rank << "/" << nbProc << endl;
To run (on my computer) multiple node, I use mpirun as follow.
#+BEGIN_SRC sh :exports code :eval never
mpirun -n 2 Tests/Release/testBlockedImplicitAlgorithm
Somewhere, the following two lines should appear in the output of mpirun.
** Let's MPIse a little bit
*** Simple start
Firstly, every starpu_insert_task (or starpu_task_insert) must be replaced by starpu_mpi_insert_task.
So a previous call like
starpu_insert_task(&some_codelet ...
will end as
starpu_mpi_insert_task(MPI_COMM_WORLD, &some_codelet ...
After a quick execution, it turn out that starpu_mpi doesn't like unregistered data, so in the function buildHandles we must add some starpu_mpi_data_register.
For each handle registered in starpu, I added
starpu_mpi_data_register(corresponding_handle, tag++, 0);
with tag initialized to 0.
The tag just has to be unique and all handle are register in the same function so I used it this way.
I tried to use starpu_mpi_data_set_rank, but for some dark reasons, it couldn't register more than one handle (even if they were different).
So from here we have all data registered on the node 0 and all tasks known by starpu_mpi.
If I run the code, the node 1 has multiple errors because his Octree doesn't contains values it should.
This is quite understandable because according to StarPU documentation (and according to Marc) each task is executed where their data are.
And because all the data are registered on node 0, no wonder node 1 doesn't have any valid data.
Because there is some error in syncData function, I disable it for the node 1. But in the future each node will have to acquire and release handle of its node only.
If we don't want any error on node 1 either, we should register data handles with the mpi rank instead of 0. So each node will execute the whole algorithm without exchanging data with the others and will have an consistant octree.
*** Complicated continue
Let's see if starpu_mpi can handle many node working together.
Thanks to the macro STARPU_EXECUTE_ON_NODE, a task can be attached to a specific node.
So if each task inserted is executed on node 1 and all data are registered on node 0, StarPU will have to exhange data somehow.
So I added
for each starpu_mpi_insert_task.
When I runned it, everything seems right but, to ensure that node 1 execute everything, I prevented node 1 from creating the DAG (Directed Acyclic Graph).
And it turn out that node 0 was blocked in starpu_task_wait_for_all at the end of executeCore. Which mean that all tasks were really executed on node 1 so starpu seems good at exhanging data.
** Distributed FMM with StarPU
*** First Data Mapping
To correctly distribute data among nodes, the algorithm need a function that return the mpi node were a block of the grouped tree will be registered, given it location in the tree.
The following methode was used to distribute data.
int dataMapping(int idxGroup, int const nbGroup) const {
int bonusGroup = nbGroup%nproc; //Number of node in the bonus group
int groupPerNode = (int)floor((double)nbGroup / (double)nproc);
int mpi_node;
if(idxGroup < bonusGroup*(groupPerNode+1)) { //GroupCell in the bonus group
mpi_node = (idxGroup - (idxGroup%(groupPerNode+1)))/(groupPerNode+1);
else { //GroupCell outside the bonus group
idxGroup -= bonusGroup*(groupPerNode+1); //Remove the bonus group to compute easily
mpi_node = (idxGroup - (idxGroup%groupPerNode))/groupPerNode; //Find the corresponding node as if their was node in the bonus group
mpi_node += bonusGroup; //Shift to find the real node
return mpi_node;
The choice was to split each level between all nodes. So each node has consecutive group of cells. That's why the only parameters needed wera the size of the level and the index in the level.
The bonus group is the first nodes that have one more element to compute because the level size isn't a multiple of the number of node.
For simpler code a new public methode has been added to check if a block of the grouped tree is owned.
bool isDataOwned(int idxGroup, int nbGroup) const {
return dataMapping(idxGroup, nbGroup) == mpi_rank;
This function is used, when unregistering data so only the data owned by the node are unregitered.
void cleanHandle(){
for(int idxLevel = 0 ; idxLevel < tree->getHeight() ; ++idxLevel){
for(int idxHandle = 0 ; idxHandle < int(cellHandles[idxLevel].size()) ; ++idxHandle){
if(isDataOwned(idxHandle, int(cellHandles[idxLevel].size())))//Clean only our data handle
for(int idxHandle = 0 ; idxHandle < int(particleHandles.size()) ; ++idxHandle){
if(isDataOwned(idxHandle, int(particleHandles.size())))
Now let's look at the function that build handles.
void buildHandles(){
int where, tag = 0;
for(int idxLevel = 2 ; idxLevel < tree->getHeight() ; ++idxLevel){
for(int idxGroup = 0 ; idxGroup < tree->getNbCellGroupAtLevel(idxLevel) ; ++idxGroup){
int registeringNode = dataMapping(idxGroup, tree->getNbCellGroupAtLevel(idxLevel));
where = (registeringNode == mpi_rank) ? STARPU_MAIN_RAM : -1;
const CellContainerClass* currentCells = tree->getCellGroup(idxLevel, idxGroup);
starpu_variable_data_register(&cellHandles[idxLevel][idxGroup].symb, where,
(uintptr_t)currentCells->getRawBuffer(), currentCells->getBufferSizeInByte());
starpu_variable_data_register(&cellHandles[idxLevel][idxGroup].up, where,
(uintptr_t)currentCells->getRawMultipoleBuffer(), currentCells->getMultipoleBufferSizeInByte());
starpu_variable_data_register(&cellHandles[idxLevel][idxGroup].down, where,
(uintptr_t)currentCells->getRawLocalBuffer(), currentCells->getLocalBufferSizeInByte());
starpu_mpi_data_register(cellHandles[idxLevel][idxGroup].symb, tag++, registeringNode);
starpu_mpi_data_register(cellHandles[idxLevel][idxGroup].up, tag++, registeringNode);
starpu_mpi_data_register(cellHandles[idxLevel][idxGroup].down, tag++, registeringNode);
cellHandles[idxLevel][idxGroup].intervalSize = int(currentCells->getNumberOfCellsInBlock());
starpu_data_assign_arbiter(cellHandles[idxLevel][idxGroup].up, arbiterGlobal);
starpu_data_assign_arbiter(cellHandles[idxLevel][idxGroup].down, arbiterGlobal);
for(int idxGroup = 0 ; idxGroup < tree->getNbParticleGroup() ; ++idxGroup){
int registeringNode = dataMapping(idxGroup, tree->getNbParticleGroup());
where = (registeringNode == mpi_rank) ? STARPU_MAIN_RAM : -1;
ParticleGroupClass* containers = tree->getParticleGroup(idxGroup);
starpu_variable_data_register(&particleHandles[idxGroup].symb, where,
(uintptr_t)containers->getRawBuffer(), containers->getBufferSizeInByte());
starpu_variable_data_register(&particleHandles[idxGroup].down, where,
(uintptr_t)containers->getRawAttributesBuffer(), containers->getAttributesBufferSizeInByte());
starpu_mpi_data_register(particleHandles[idxGroup].symb, tag++, registeringNode);
starpu_mpi_data_register(particleHandles[idxGroup].down, tag++, registeringNode);
starpu_data_set_reduction_methods(particleHandles[idxGroup].down, &p2p_redux_perform,
starpu_data_assign_arbiter(particleHandles[idxGroup].down, arbiterGlobal);
particleHandles[idxGroup].intervalSize = int(containers->getNumberOfLeavesInBlock());
There is at least 2 new stuff. The variable "where" which tell if a handle is in the node memory or if it's somewhere else. And the variable "registeringNode" which tell on which node the block should be registered.
*** Checking result
Now that I have a beautiful mapping function, how on earth am I suppose to check if the result is correct ?
To answer this question, I need to clarify a little bit on how StarPU MPI work (from what I understood of the documentation).
When a node register a data, every time a task will write this data somewhere else, the fresh data should be write back to the node that registered the data.
So each node should have an up to date value in its own cells.
To check the result I used the following algorithm.
- Load the particle and create two tree (a classic tree and a grouped one)
- Execute both distributed and classic algorithm
- For each cell and leaf that I own in the grouped tree
* get the corresponding cell in the classic tree
* compare the value
* Call checking function created by Bérenger
For those who prefer the C code, there it is.
for(int idxLevel = 2 ; idxLevel < groupedTree.getHeight() ; ++idxLevel){
for(int idxGroup = 0 ; idxGroup < groupedTree.getNbCellGroupAtLevel(idxLevel) ; ++idxGroup){
if(groupalgo.isDataOwned(idxGroup, groupedTree.getNbCellGroupAtLevel(idxLevel))){
GroupOctreeClass::CellGroupClass* currentCells = groupedTree.getCellGroup(idxLevel, idxGroup);
currentCells->forEachCell([&](GroupCellClass gcell){
const CellClass* cell = tree.getCell(gcell.getMortonIndex(), idxLevel);
if(cell == nullptr){
std::cout << "[Empty node(" << rank << ")] Error cell should not exist " << gcell.getMortonIndex() << "\n";
else {
if(gcell.getDataUp() != cell->getDataUp()){
std::cout << "[Up node(" << rank << ")] Up is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataUp() << " should be " << cell->getDataUp() << "\n";
if(gcell.getDataDown() != cell->getDataDown()){
std::cout << "[Down node(" << rank << ")] Down is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataDown() << " should be " << cell->getDataDown() << "\n";
int idxLevel = groupedTree.getHeight()-1;
for(int idxGroup = 0 ; idxGroup < groupedTree.getNbCellGroupAtLevel(idxLevel) ; ++idxGroup){
if(groupalgo.isDataOwned(idxGroup, groupedTree.getNbCellGroupAtLevel(idxLevel))){
GroupOctreeClass::ParticleGroupClass* particleGroup = groupedTree.getParticleGroup(idxGroup);
GroupOctreeClass::CellGroupClass* cellGroup = groupedTree.getCellGroup(idxLevel, idxGroup);
cellGroup->forEachCell([&](GroupCellClass cell){
MortonIndex midx = cell.getMortonIndex();
const int leafIdx = particleGroup->getLeafIndex(midx);
GroupContainerClass leaf = particleGroup->template getLeaf<GroupContainerClass>(leafIdx);
const FSize nbPartsInLeaf = leaf.getNbParticles();
if(cell.getDataUp() != nbPartsInLeaf){
std::cout << "[P2M node(" << rank << ")] Error a Cell has " << cell.getDataUp() << " (it should be " << nbPartsInLeaf << ")\n";
const long long int* dataDown = leaf.getDataDown();
for(FSize idxPart = 0 ; idxPart < nbPartsInLeaf ; ++idxPart){
if(dataDown[idxPart] != loader.getNumberOfParticles()-1){
std::cout << "[Full node(" << rank << ")] Error a particle has " << dataDown[idxPart] << " (it should be " << (loader.getNumberOfParticles()-1) << ") at index " << cell.getMortonIndex() << "\n";
*** Bérenger's Mapping
The next goal is to try matching the data mapping used by Bérenger in the explicit algorithm of MPI.
Knowing that the same data mapping should imply the same task mapping.
**** Same leaf mapping
Because the explicit MPI algorithm first sort particles before knowing which node works on which data, I choose to dump the data related to an execution.
Which mean, the particle data and the morton index done by each mpi node.
Then, in the implicit version, the sofware will read those file to reproduce the same data mapping.
**** Same cell mapping
Because reproducing the same data mapping at leaf level, doesn't mean all the cells in other levels will be mapped on the exact same node, I had to reproduce the algorithme to split cells among nodes.
This algorithm is described in the section 4.3.1 in Bérenger Thesis.
But this algorithm isn't aware of the groups of the grouped tree and because the implicit mpi code split data only by groups of the grouped tree I had to made sligth changes.
Keep in mind that mpi data transfert are managed by starpu mpi and use starpu_data_handle. And in the implicite algorithm, data handles correspond to a group of the grouped tree.
So I choose that if the first cell of the group is mapped on the node i, the whole group is mapped on the node i.
That's when an other problem come out. Because the mpi quicksort algorithm make sure that all the particles of a leaf are on the same node, but it does not make sure that all cells of a group are mapped on the same mpi node.
And if it does, what about the next level of the tree ? Does the distributed quicksort make sure that group of node at this level are complete and on the same node ? Firstly, it isn't the case and secondly it woud have been too hard.
So, is it possible to reproduce the exact same data mapping as Berenger does in the explicit version ?
Well, not the exact same mapping. But it is still possible to make the same mapping at a specific level. And because the first levels (ones close to the root) imply far less work than the leaf level, it is less significant if we can't match the same mapping at these levels.
For example, if there is the same number of particle per leaf and each morton index exists (so it's a perfect tree), it is possible to tell if a specific level has the same mapping as Berenger's mapping.
Why ?
If all morton index are used, each mpi node will work always on the same part of the tree which is easily predictible and with a perfect tree it is easier to make sure groups of the grouped tree are created the same way as in the explicit mpi code.
To check if a mapping error could appear at a level, there is one simple rule : the number of group at the level i, must be divisible by the number of mpi node.
There is no issue with 'not completly filled group' because, the tree is perfect.
These are numbers I tried and which seems to work pretty well.
8 mpi node. (2^3)
A size of a group in the grouped tree of 8. (because it was an octree, it looked like it was a good idea)
((2^(tree height-1))/number of mpi node) particle to generate for each node.
A tree higher than 2 levels. Usually, it was 5.
Now remain only one problem, we need to generate a particle for each Morton index.
It seemed difficult in the first place, but after discussing it with Quentin, it looked pretty easy, so here is the code.
void fillParticleAtMortonIndex(FPoint<FReal>*const inParticlePositions, MortonIndex idx, unsigned int treeHeight){
MortonIndex mask = 0x1LL;
//Box with at the leaf level
FReal leafWidth = boxWidth / FReal(1<<(treeHeight-1));
//offset from the previous center. Which is half the box width
FReal currentOffset = leafWidth / 2.0;
//Start from the center of the box
FReal x, y, z;
x = centerOfBox.getX();
y = centerOfBox.getY();
z = centerOfBox.getZ();
for(unsigned int i = 0; i < treeHeight-1; ++i)
bool x_offset, y_offset, z_offset;
z_offset = (idx & mask);
idx >>= 1;
y_offset = (idx & mask);
idx >>= 1;
x_offset = (idx & mask);
idx >>= 1;
x += currentOffset;
x -= currentOffset;
y += currentOffset;
y -= currentOffset;
z += currentOffset;
z -= currentOffset;
//Increase the offset as we go down the tree. So the box is getting larger
currentOffset *= 2;
inParticlePositions->setPosition( x, y, z);
Let's have a quick look at the function. We want to insert a particle in specific morton index.
I use a two dimensional matrix because it's easier for the drawing.
Let's say, the morton index is 7. Which give 0111 in binary.
To find the center of the 7th box, we are going to find the center of its father.
The first bit is 0, so, it substract one fourth of the box width to the x axis.
The second bit is 1, it add one fourth of the box width to the y axis.
So now we have the center of the sub box were the 7th box is. So, we keep going.
Because the third and the fourth bits are 1, we'll respectivly add one eighth of the box width to the x and y axis.
#+NAME: fig:MortonForNoobs
As you can see on the picture it is possible to do start either by the least or the most significant bit and in the function fillParticleAtMortonIndex, it start by the least significant.
**** Task mapping
//TODO check that blody DAG !
#!/usr/bin/env bash
## name of job
#SBATCH -J Implicit_MPI_time
#SBATCH -p special
## Resources: (nodes, procs, tasks, walltime, ... etc)
# # standard output message
#SBATCH -o batch%j.out
# # output error message
#SBATCH -e batch%j.err
module purge
module load slurm
module add compiler/gcc/5.3.0 tools/module_cat/1.0.0 intel/mkl/64/11.2/2016.0.0
. /home/mkhannou/spack/share/spack/
spack load fftw
spack load hwloc
spack load openmpi
spack load starpu@svn-trunk
## modules to load for the job
export GROUP_SIZE=500
export TREE_HEIGHT=5
export NB_PARTICLE_PER_NODE=100000
echo "=====my job informations ===="
echo "Node List: " $SLURM_NODELIST
echo "my jobID: " $SLURM_JOB_ID
echo "Nb node: " $NB_NODE
echo "Particle per node: " $NB_PARTICLE_PER_NODE
echo "In the directory: `pwd`"
rm -f canard.fma > /dev/null 2> /dev/null
mpiexec -n $NB_NODE ./Build/Tests/Release/testBlockedMpiAlgorithm -nb $NB_PARTICLE_PER_NODE -bs $GROUP_SIZE -h $TREE_HEIGHT > loutre
cat loutre | grep Executing
cat loutre | grep Average
sleep 10
mpiexec -n $NB_NODE ./Build/Tests/Release/testBlockedImplicitAlgorithm -f canard.fma -bs $GROUP_SIZE -h $TREE_HEIGHT > loutre
cat loutre | grep Executing
cat loutre | grep Average
rm -f canard.fma > /dev/null 2> /dev/null
sleep 10
mpiexec -n $NB_NODE ./Build/Tests/Release/testBlockedMpiChebyshev -nb $NB_PARTICLE_PER_NODE -bs $GROUP_SIZE -h $TREE_HEIGHT > loutre
cat loutre | grep Executing
cat loutre | grep Average
sleep 10
mpiexec -n $NB_NODE ./Build/Tests/Release/testBlockedImplicitChebyshev -f canard.fma -bs $GROUP_SIZE -h $TREE_HEIGHT > loutre
cat loutre | grep Executing
cat loutre | grep Average
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment