std::cout<<"[Part] Nb particles is different at index "<<gcell.getMortonIndex()<<" is "<<gleaf->getNbParticles()<<" should be "<<src->getNbParticles()<<"\n";
std::cout<<"[Empty node("<<rank<<")] Error cell should not exist "<<gcell.getMortonIndex()<<"\n";
}
else{
if(gcell.getDataUp()!=cell->getDataUp()){
std::cout<<"[Up node("<<rank<<")] Up is different at index "<<gcell.getMortonIndex()<<" level "<<idxLevel<<" is "<<gcell.getDataUp()<<" should be "<<cell->getDataUp()<<"\n";
}
if(gcell.getDataDown()!=cell->getDataDown()){
std::cout<<"[Down node("<<rank<<")] Down is different at index "<<gcell.getMortonIndex()<<" level "<<idxLevel<<" is "<<gcell.getDataDown()<<" should be "<<cell->getDataDown()<<"\n";
std::cout<<"[Full node("<<rank<<")] Error a particle has "<<dataDown[idxPart]<<" (it should be "<<(loader.getNumberOfParticles()-1)<<") at index "<<cell.getMortonIndex()<<"\n";
Edit testBlockedImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and change an include to include FGroupTaskStarpuImplicitAlgorithm instead of FGroupTaskStarpuAlgorithm.
Edit FGroupTaskStarpuImplicitAlgorithm and replace FGroupTaskStarPUAlgorithm by FGroupTaskStarPUImplicitAlgorithm and add, somewhere :
#+BEGIN_SRC C
#include <starpu_mpi.h>
#+END_SRC
As MPI need to be initialized a call to starpu_mpi_init has to be made :
#+BEGIN_SRC C
FAssertLF(starpu_mpi_init ( 0, 0, 1 ) == 0);
#+END_SRC
The first and second 0 are there because we don't need to give argument to MPI.
The 1 is to tell starpu to call MPI_Init.
As every malloc need a free, a starpu_mpi_init need starpu_mpi_shutdown which is written in the destructor, before the call of starpu_shutdown.
To check if everything goes right, the following code had added right after starpu_mpi_init and it should print the number of node and the rank of the current node.
#+BEGIN_SRC C
int rank, nbProc;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nbProc);
cout << rank << "/" << nbProc << endl;
#+END_SRC
To run (on my computer) multiple node, I use mpirun as follow.
After a quick execution, it turn out that starpu_mpi doesn't like unregistered data, so in the function buildHandles we must add some starpu_mpi_data_register.
The tag just has to be unique and all handle are register in the same function so I used it this way.
I tried to use starpu_mpi_data_set_rank, but for some dark reasons, it couldn't register more than one handle (even if they were different).
So from here we have all data registered on the node 0 and all tasks known by starpu_mpi.
If I run the code, the node 1 has multiple errors because his Octree doesn't contains values it should.
This is quite understandable because according to StarPU documentation (and according to Marc) each task is executed where their data are.
And because all the data are registered on node 0, no wonder node 1 doesn't have any valid data.
Because there is some error in syncData function, I disable it for the node 1. But in the future each node will have to acquire and release handle of its node only.
If we don't want any error on node 1 either, we should register data handles with the mpi rank instead of 0. So each node will execute the whole algorithm without exchanging data with the others and will have an consistant octree.
*** Complicated continue
Let's see if starpu_mpi can handle many node working together.
Thanks to the macro STARPU_EXECUTE_ON_NODE, a task can be attached to a specific node.
So if each task inserted is executed on node 1 and all data are registered on node 0, StarPU will have to exhange data somehow.
So I added
#+BEGIN_SRC C
STARPU_EXECUTE_ON_NODE, 1
#+END_SRC
for each starpu_mpi_insert_task.
When I runned it, everything seems right but, to ensure that node 1 execute everything, I prevented node 1 from creating the DAG (Directed Acyclic Graph).
And it turn out that node 0 was blocked in starpu_task_wait_for_all at the end of executeCore. Which mean that all tasks were really executed on node 1 so starpu seems good at exhanging data.
** Distributed FMM with StarPU
*** First Data Mapping
To correctly distribute data among nodes, the algorithm need a function that return the mpi node were a block of the grouped tree will be registered, given it location in the tree.
The following methode was used to distribute data.
#+BEGIN_SRC C
int dataMapping(int idxGroup, int const nbGroup) const {
int bonusGroup = nbGroup%nproc; //Number of node in the bonus group
int groupPerNode = (int)floor((double)nbGroup / (double)nproc);
int mpi_node;
if(idxGroup < bonusGroup*(groupPerNode+1)) { //GroupCell in the bonus group
idxGroup -= bonusGroup*(groupPerNode+1); //Remove the bonus group to compute easily
mpi_node = (idxGroup - (idxGroup%groupPerNode))/groupPerNode; //Find the corresponding node as if their was node in the bonus group
mpi_node += bonusGroup; //Shift to find the real node
}
return mpi_node;
}
#+END_SRC
The choice was to split each level between all nodes. So each node has consecutive group of cells. That's why the only parameters needed wera the size of the level and the index in the level.
The bonus group is the first nodes that have one more element to compute because the level size isn't a multiple of the number of node.
For simpler code a new public methode has been added to check if a block of the grouped tree is owned.
#+BEGIN_SRC C
bool isDataOwned(int idxGroup, int nbGroup) const {
There is at least 2 new stuff. The variable "where" which tell if a handle is in the node memory or if it's somewhere else. And the variable "registeringNode" which tell on which node the block should be registered.
*** Checking result
Now I have a beautiful mapping function how on earth am I suppose to check if the result is correct ?
To answer this question, I need to clarify a little bit on how StarPU MPI work (from what I understood of the documentation).
When a node register a data, every time a task will write this data somewhere else, the fresh data should be write back to the node that registered the data.
So each node should have an up to date value in its own cells.
To check the result I used the following algorithm.
- Load the particle and create two tree (a classic tree and a grouped one)
- Execute both distributed and classic algorithm
- For each cell and leaf that I own in the grouped tree
std::cout << "[Empty node(" << rank << ")] Error cell should not exist " << gcell.getMortonIndex() << "\n";
}
else {
if(gcell.getDataUp() != cell->getDataUp()){
std::cout << "[Up node(" << rank << ")] Up is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataUp() << " should be " << cell->getDataUp() << "\n";
}
if(gcell.getDataDown() != cell->getDataDown()){
std::cout << "[Down node(" << rank << ")] Down is different at index " << gcell.getMortonIndex() << " level " << idxLevel << " is " << gcell.getDataDown() << " should be " << cell->getDataDown() << "\n";