Edit testBlockedImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and change an include to include FGroupTaskStarpuImplicitAlgorithm instead of FGroupTaskStarpuAlgorithm.
Edit FGroupTaskStarpuImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and add, somewhere :
#+BEGIN_SRC C
#include <starpu_mpi.h>
#+END_SRC
As MPI need to be initialized a call to starpu_mpi_init has to be made :
#+BEGIN_SRC C
FAssertLF(starpu_mpi_init ( 0, 0, 1 ) == 0);
#+END_SRC
The first and second 0 are there because we don't need to give argument to MPI.
The 1 is to tell starpu to call MPI_Init.
As every malloc need a free, a starpu_mpi_init need starpu_mpi_shutdown which is written in the destructor, before the call of starpu_shutdown.
To check if everything goes right, the following code had added right after starpu_mpi_init and it should print the number of node and the rank of the current node.
#+BEGIN_SRC C
int rank, nbProc;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nbProc);
cout << rank << "/" << nbProc << endl;
#+END_SRC
To run (on my computer) multiple node, I use mpirun as follow.
After a quick execution, it turn out that starpu_mpi doesn't like unregistered data, so in the function buildHandles we must add some starpu_mpi_data_register.
The tag just has to be unique and all handle are register in the same function so I used it this way.
I tried to use starpu_mpi_data_set_rank, but for some dark reasons, it couldn't register more than one handle (even if they were different).
So from here we have all data registered on the node 0 and all tasks known by starpu_mpi.
If I run the code, the node 1 has multiple errors because his Octree doesn't contains values it should.
This is quite understandable because according to StarPU documentation (and according to Marc) each task is executed where their data are.
And because all the data are registered on node 0, no wonder node 1 doesn't have any valid data.
Because there is some error in syncData function, I disable it for the node 1. But in the future each node will have to acquire and release handle of its node only.
If we don't want any error on node 1 either, we should register data handles with the mpi rank instead of 0. So each node will execute the whole algorithm without exchanging data with the others and will have an consistant octree.
*** Complicated continue
Let's see if starpu_mpi can handle many node working together.
Thanks to the macro STARPU_EXECUTE_ON_NODE, a task can be attached to a specific node.
So if each task inserted is executed on node 1 and all data are registered on node 0, StarPU will have to exhange data somehow.
So I added
#+BEGIN_SRC C
STARPU_EXECUTE_ON_NODE, 1
#+END_SRC
for each starpu_mpi_insert_task.
When I runned it, everything seems right but, to ensure that node 1 execute everything, I prevented node 1 from creating the DAG (Directed Acyclic Graph).
And it turn out that node 0 was blocked in starpu_task_wait_for_all at the end of executeCore. Which mean that all tasks were really executed on node 1 so starpu seems good at exhanging data.
** Distributed FMM with StarPU
*** First Data Mapping
To correctly distribute data among nodes, the algorithm need a function that return the mpi node were a block of the grouped tree will be registered, given it location in the tree.
The following methode was used to distribute data.
#+BEGIN_SRC C
int dataMapping(int idxGroup, int const nbGroup) const {
int bonusGroup = nbGroup%nproc; //Number of node in the bonus group
int groupPerNode = (int)floor((double)nbGroup / (double)nproc);
int mpi_node;
if(idxGroup < bonusGroup*(groupPerNode+1)) { //GroupCell in the bonus group
idxGroup -= bonusGroup*(groupPerNode+1); //Remove the bonus group to compute easily
mpi_node = (idxGroup - (idxGroup%groupPerNode))/groupPerNode; //Find the corresponding node as if their was node in the bonus group
mpi_node += bonusGroup; //Shift to find the real node
}
return mpi_node;
}
#+END_SRC
The choice was to split each level between all nodes. So each node has consecutive group of cells. That's why the only parameters needed wera the size of the level and the index in the level.
The bonus group is the first nodes that have one more element to compute because the level size isn't a multiple of the number of node.
For simpler code a new public methode has been added to check if a block of the grouped tree is owned.
#+BEGIN_SRC C
bool isDataOwned(int idxGroup, int nbGroup) const {
There is at least 2 new stuff. The variable "where" which tell if a handle is in the node memory or if it's somewhere else. And the variable "registeringNode" which tell on which node the block should be registered.
*** Checking result
Now that I have a beautiful mapping function, how on earth am I suppose to check if the result is correct ?
To answer this question, I need to clarify a little bit on how StarPU MPI work (from what I understood of the documentation).
When a node register a data, every time a task will write this data somewhere else, the fresh data should be write back to the node that registered the data.
So each node should have an up to date value in its own cells.
To check the result I used the following algorithm.
- Load the particle and create two tree (a classic tree and a grouped one)
- Execute both distributed and classic algorithm
- For each cell and leaf that I own in the grouped tree
std::cout << "[Empty node(" << rank << ")] Error cell should not exist " << gcell.getMortonIndex() << "\n";
}
else {
if(gcell.getDataUp() != cell->getDataUp()){
std::cout << "[Up node(" << rank << ")] Up is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataUp() << " should be " << cell->getDataUp() << "\n";
}
if(gcell.getDataDown() != cell->getDataDown()){
std::cout << "[Down node(" << rank << ")] Down is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataDown() << " should be " << cell->getDataDown() << "\n";
std::cout << "[Full node(" << rank << ")] Error a particle has " << dataDown[idxPart] << " (it should be " << (loader.getNumberOfParticles()-1) << ") at index " << cell.getMortonIndex() << "\n";
}
}
});
}
}
}
#+END_SRC
*** Bérenger's Mapping
The next goal is to try matching the data mapping used by Bérenger in the explicit algorithm of MPI.
Knowing that the same data mapping should imply the same task mapping.
**** Same leaf mapping
Because the explicit MPI algorithm first sort particles before knowing which node works on which data, I choose to dump the data related to an execution.
Which mean, the particle data and the morton index done by each mpi node.
Then, in the implicit version, the sofware will read those file to reproduce the same data mapping.
**** Same cell mapping
Because reproducing the same data mapping at leaf level, doesn't mean all the cells in other levels will be mapped on the exact same node, I had to reproduce the algorithme to split cells among nodes.
This algorithm is described in the section 4.3.1 in Bérenger Thesis.
But this algorithm isn't aware of the groups of the grouped tree and because the implicit mpi code split data only by groups of the grouped tree I had to made sligth changes.
Keep in mind that mpi data transfert are managed by starpu mpi and use starpu_data_handle. And in the implicite algorithm, data handles correspond to a group of the grouped tree.
So I choose that if the first cell of the group is mapped on the node i, the whole group is mapped on the node i.
That's when an other problem come out. Because the mpi quicksort algorithm make sure that all the particles of a leaf are on the same node, but it does not make sure that all cells of a group are mapped on the same mpi node.
And if it does, what about the next level of the tree ? Does the distributed quicksort make sure that group of node at this level are complete and on the same node ? Firstly, it isn't the case and secondly it woud have been too hard.
So, is it possible to reproduce the exact same data mapping as Berenger does in the explicit version ?
Well, not the exact same mapping. But it is still possible to make the same mapping at a specific level. And because the first levels (ones close to the root) imply far less work than the leaf level, it is less significant if we can't match the same mapping at these levels.
For example, if there is the same number of particle per leaf and each morton index exists (so it's a perfect tree), it is possible to tell if a specific level has the same mapping as Berenger's mapping.
Why ?
If all morton index are used, each mpi node will work always on the same part of the tree which is easily predictible and with a perfect tree it is easier to make sure groups of the grouped tree are created the same way as in the explicit mpi code.
To check if a mapping error could appear at a level, there is one simple rule : the number of group at the level i, must be divisible by the number of mpi node.
There is no issue with 'not completly filled group' because, the tree is perfect.
These are numbers I tried and which seems to work pretty well.
8 mpi node. (2^3)
A size of a group in the grouped tree of 8. (because it was an octree, it looked like it was a good idea)
((2^(tree height-1))/number of mpi node) particle to generate for each node.
A tree higher than 2 levels. Usually, it was 5.
Now remain only one problem, we need to generate a particle for each Morton index.
It seemed difficult in the first place, but after discussing it with Quentin, it looked pretty easy, so here is the code.
#+BEGIN_SRC C
void fillParticleAtMortonIndex(FPoint<FReal>*const inParticlePositions, MortonIndex idx, unsigned int treeHeight){
//offset from the previous center. Which is half the box width
FReal currentOffset = leafWidth / 2.0;
//Start from the center of the box
FReal x, y, z;
x = centerOfBox.getX();
y = centerOfBox.getY();
z = centerOfBox.getZ();
for(unsigned int i = 0; i < treeHeight-1; ++i)
{
bool x_offset, y_offset, z_offset;
z_offset = (idx & mask);
idx >>= 1;
y_offset = (idx & mask);
idx >>= 1;
x_offset = (idx & mask);
idx >>= 1;
if(x_offset)
x += currentOffset;
else
x -= currentOffset;
if(y_offset)
y += currentOffset;
else
y -= currentOffset;
if(z_offset)
z += currentOffset;
else
z -= currentOffset;
//Increase the offset as we go down the tree. So the box is getting larger
currentOffset *= 2;
}
inParticlePositions->setPosition( x, y, z);
}
#+END_SRC
Let's have a quick look at the function. We want to insert a particle in specific morton index.
I use a two dimensional matrix because it's easier for the drawing.
Let's say, the morton index is 7. Which give 0111 in binary.
To find the center of the 7th box, we are going to find the center of its father.
The first bit is 0, so, it substract one fourth of the box width to the x axis.
The second bit is 1, it add one fourth of the box width to the y axis.
So now we have the center of the sub box were the 7th box is. So, we keep going.
Because the third and the fourth bits are 1, we'll respectivly add one eighth of the box width to the x and y axis.
#+CAPTION:
#+NAME: fig:MortonForNoobs
[[./morton_box_center.png]]
As you can see on the picture it is possible to do start either by the least or the most significant bit and in the function fillParticleAtMortonIndex, it start by the least significant.