The main motivation to create a distributed version of the FMM is to run large simulations.
These ones contain more particles than a computer can host which involves using several computers.
Moreover, it is not reasonable to ask a master process to load an entire file and to dispatch the data to others processes. Without being able to know the entire tree it may send randomly the data to the slaves.
To override this situation, our solution can be viewed as a two steps process.
First, each node loads a part of the file to possess several particles.
After this task, each node can compute the Morton index for the particles he had loaded.
The Morton index of a particle depends of the system properties but also of the tree height.
If we want to choose the tree height and the number of nodes at run time then we cannot pre-process the file.
The second step is a parallel sort based on the Morton index between all nodes with a balancing operation at the end.
The main motivation to create a distributed version of the FMM is to
run large simulations. These ones contain more particles than a
computer can host which involves using several computers. Moreover,
it is not reasonable to ask a master process to load an entire file
and to dispatch the data to others processes. Without being able to
know the entire tree it may send randomly the data to the slaves. To
override this situation, our solution can be viewed as a two steps
process. First, each node loads a part of the file to possess several
particles. After this task, each node can compute the Morton index
for the particles he had loaded. The Morton index of a particle
depends of the system properties but also of the tree height. If we
want to choose the tree height and the number of nodes at run time
then we cannot pre-process the file. The second step is a parallel
sort based on the Morton index between all nodes with a balancing
Once each node has a set of particles we need to sort them.
This problem boils down to a simple parallel sort where Morton index are used to compare particles.
We use two different approaches to sort the data.
In the next version of scalfmm the less efficient method should be deleted.
Once each node has a set of particles we need to sort them. This
problem boils down to a simple parallel sort where Morton index are
used to compare particles. We use two different approaches to sort
the data. In the next version of scalfmm the less efficient method
should be deleted.
\subsection{Using QuickSort}
A first approach is to use a famous sorting algorithm.
We choose to use the quick sort algorithm because the distributed and the shared memory approaches are mostly similar.
Our implementation is based on the algorithm described in \cite{itpc03}.
The efficiency of this algorithm depends roughly of the pivot choice.
In fact, a wrong idea of the parallel quick sort is to think that each process first sort their particles using quick sort and then use a merge sort to share their results.
Instead, the nodes choose a common pivot and progress for one quick sort iteration together.
From that point all process has an array with a left part where all values are lower than the pivot and a right part where all values are upper or equal than the pivot.
Then, the nodes exchange data and some of them will work on the lower part and the other on the upper parts until there is one process for a part.
At this point, the process performs a shared memory quick sort.
To choose the pivot we tried to use an average of all the data hosted by the nodes:
A first approach is to use a famous sorting algorithm. We choose to
use the quick sort algorithm because the distributed and the shared
memory approaches are mostly similar. Our implementation is based on
the algorithm described in \cite{itpc03}. The efficiency of this
algorithm depends roughly of the pivot choice. In fact, a wrong idea
of the parallel quick sort is to think that each process first sort
their particles using quick sort and then use a merge sort to share
their results. Instead, the nodes choose a common pivot and progress
for one quick sort iteration together. From that point all process
has an array with a left part where all values are lower than the
pivot and a right part where all values are upper or equal than the
pivot. Then, the nodes exchange data and some of them will work on
the lower part and the other on the upper parts until there is one
process for a part. At this point, the process performs a shared
memory quick sort. To choose the pivot we tried to use an average of
A bug was made when at the beginning, we did an average by summing all the values first and dividing after. But the Morton index may be extremly high, so we need to to divide all the value before performing the sum.
A bug was made when at the beginning, we did an average by summing all
the values first and dividing after. But the Morton index may be
extremly high, so we need to to divide all the value before performing
It is just the contrary, a result hosted by only one node needs to be shared with every others nodes that are responsible of at least one child of this node.
The L2L operator is very similar to the M2M. It is just the contrary,
a result hosted by only one node needs to be shared with every others
nodes that are responsible of at least one child of this node.
The L2L operator fill child local array from parent local array, so
there is no need to precise wich cell is send, since it's the parent
cell that is send. Consequently, there is no need for a heading.
\BlankLine
\begin{algorithm}[H]
\restylealgo{boxed}
\linesnumbered
\SetLine
\KwData{none}
\KwResult{none}
\BlankLine
\For{idxLevel $\leftarrow$ 2 \KwTo$Height -2$}{
\uIf{$cells[0]$ not in my working interval}{
irecv($cells[0]$)\;
hasRecv $\leftarrow$ true\;
}
\uIf{$cells[end]$ in another working interval}{
isend($cells[end]$)\;
hasSend $\leftarrow$ true\;
}
\ForAll{Cell c at level idxLevel in working interval}{
M2M(c, c.child)\;
}
\emph{Wait send and recv if needed}\;
\uIf{hasRecv is true}{
M2M($cells[0]$, $cells[0].child$)\;
}
}
\BlankLine
\caption{Distributed L2L}
\RestyleAlgo{boxed}
\LinesNumbered
\SetAlgoLined
\KwData{none}
\KwResult{none}
\BlankLine
\For{idxLevel $\leftarrow$ 2 \KwTo$Height -2$}{
\uIf{$cells[0]$ not in my working interval}{
irecv($cells[0]$)\;
hasRecv $\leftarrow$ true\;
}
\uIf{$cells[end]$ in another working interval}{
isend($cells[end]$)\;
hasSend $\leftarrow$ true\;
}
\ForAll{Cell c at level idxLevel in working interval}{
@@ -271,49 +364,53 @@ To compute the P2P a leaf need to know all its direct neighbors.
Even if the Morton indexing maximizes the locality, the neighbors of a leaf can be on any node.
Also, the tree used in our library is an indirection tree.
It means that only the leaves that contain particles are created.
That is the reason why when we know that a leaf needs another one on a different node, this other node may not realize this relation if this neighbor leaf do not exist on its own tree.
That is the reason why when we know that a leaf needs another one on a
different node, this other node may not realize this relation if this
neighbor leaf do not exist on its own tree.
At the contrary, if this neighbor leaf exists then the node wills require the first leaf to compute the P2P too.
In our current version we are first processing each potential needs to know the communication we should need.
Then the nodes do an all gather to inform each other how many communication they are going to send.
Finally they send and receive data in an asynchronous way and cover it by the P2P they can do.