I have added some details to the description of M2M and L2L

7497595b · PIACIBELLO Cyrille · aeea131a · 7497595b
Commit 7497595b authored 11 years ago by PIACIBELLO Cyrille
--- a/Doc/ParallelDetails.tex
+++ b/Doc/ParallelDetails.tex
@@ -13,7 +13,7 @@

 %% Package config
 \lstset{language=c++, frame=lines}
-\restylealgo{boxed}
+\RestyleAlgo{boxed}
 \geometry{scale=0.8, nohead}
 \hypersetup{ colorlinks = true, linkcolor = black, urlcolor = blue, citecolor = blue }
 %% Remove introduction numbering
@@ -28,40 +28,51 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Introduction}
-In this document we introduce the principles and the algorithms used in our library to run in a distributed environment using MPI.
-The algorithms in this document may not be up to date comparing to those used in the code.
-We advise to check the version of this document and the code to have the latest available.
+In this document we introduce the principles and the algorithms used
+in our library to run in a distributed environment using MPI.  The
+algorithms in this document may not be up to date comparing to those
+used in the code.  We advise to check the version of this document and
+the code to have the latest available.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Building the tree in Parallel}
 \section{Description}
-The main motivation to create a distributed version of the FMM is to run large simulations.
-These ones contain more particles than a computer can host which involves using several computers.
-Moreover, it is not reasonable to ask a master process to load an entire file and to dispatch the data to others processes. Without being able to know the entire tree it may send randomly the data to the slaves.
-To override this situation, our solution can be viewed as a two steps process.
-First, each node loads a part of the file to possess several particles.
-After this task, each node can compute the Morton index for the particles he had loaded.
-The Morton index of a particle depends of the system properties but also of the tree height.
-If we want to choose the tree height and the number of nodes at run time then we cannot pre-process the file.
-The second step is a parallel sort based on the Morton index between all nodes with a balancing operation at the end.
+The main motivation to create a distributed version of the FMM is to
+run large simulations.  These ones contain more particles than a
+computer can host which involves using several computers.  Moreover,
+it is not reasonable to ask a master process to load an entire file
+and to dispatch the data to others processes. Without being able to
+know the entire tree it may send randomly the data to the slaves.  To
+override this situation, our solution can be viewed as a two steps
+process.  First, each node loads a part of the file to possess several
+particles.  After this task, each node can compute the Morton index
+for the particles he had loaded.  The Morton index of a particle
+depends of the system properties but also of the tree height.  If we
+want to choose the tree height and the number of nodes at run time
+then we cannot pre-process the file.  The second step is a parallel
+sort based on the Morton index between all nodes with a balancing
+operation at the end.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Load a file in parallel}
-We use the MPI $I/O$ functions to split a file between all the mpi processes.
-The prerequisite to make the splitting easier is to have a binary file.
-Thereby, using a very basic formula each node knows which part of the file it needs to load.
+We use the MPI $I/O$ functions to split a file between all the mpi
+processes.  The prerequisite to make the splitting easier is to have a
+binary file.  Thereby, using a very basic formula each node knows
+which part of the file it needs to load.
 \begin{equation}
-size per proc \leftarrow \left (file size - header size \right ) / nbprocs
+  size per proc \leftarrow \left (file size - header size \right ) / nbprocs
 \end{equation}
 \begin{equation}
-offset \leftarrow header size + size per proc .\left ( rank - 1 \right )
+  offset \leftarrow header size + size per proc .\left ( rank - 1 \right )
 \end{equation}
 \newline
-We do not use the view system to read that data as it is used to write. The MPI\_File\_read is called as described in the fallowing $C++$ code.
+We do not use the view system to read that data as it is used to
+write. The MPI\_File\_read is called as described in the fallowing
+$C++$ code.
 \begin{lstlisting}
-// From FMpiFmaLoader
-MPI_File_read_at(file, headDataOffSet + startPart * 4 * sizeof(FReal),
-                 particles, int(bufsize), MPI_FLOAT, &status);
+  // From FMpiFmaLoader
+  MPI_File_read_at(file, headDataOffSet + startPart * 4 * sizeof(FReal),
+  particles, int(bufsize), MPI_FLOAT, &status);
 \end{lstlisting}
 Our files are composed by a header fallowing by all the particles.
 The header enables to check several properties as the precision of the file.
@@ -72,37 +83,47 @@ Finally, a particle is represented by four decimal values: a position and a phys
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Sorting the particles}
-Once each node has a set of particles we need to sort them.
-This problem boils down to a simple parallel sort where Morton index are used to compare particles.
-We use two different approaches to sort the data.
-In the next version of scalfmm the less efficient method should be deleted.
+Once each node has a set of particles we need to sort them.  This
+problem boils down to a simple parallel sort where Morton index are
+used to compare particles.  We use two different approaches to sort
+the data.  In the next version of scalfmm the less efficient method
+should be deleted.

 \subsection{Using QuickSort}
-A first approach is to use a famous sorting algorithm.
-We choose to use the quick sort algorithm because the distributed and the shared memory approaches are mostly similar.
-Our implementation is based on the algorithm described in \cite{itpc03}.
-The efficiency of this algorithm depends roughly of the pivot choice.
-In fact, a wrong idea of the parallel quick sort is to think that each process first sort their particles using quick sort and then use a merge sort to share their results.
-Instead, the nodes choose a common pivot and progress for one quick sort iteration together.
-From that point all process has an array with a left part where all values are lower than the pivot and a right part where all values are upper or equal than the pivot.
-Then, the nodes exchange data and some of them will work on the lower part and the other on the upper parts until there is one process for a part.
-At this point, the process performs a shared memory quick sort.
-To choose the pivot we tried to use an average of all the data hosted by the nodes:
+A first approach is to use a famous sorting algorithm.  We choose to
+use the quick sort algorithm because the distributed and the shared
+memory approaches are mostly similar.  Our implementation is based on
+the algorithm described in \cite{itpc03}.  The efficiency of this
+algorithm depends roughly of the pivot choice.  In fact, a wrong idea
+of the parallel quick sort is to think that each process first sort
+their particles using quick sort and then use a merge sort to share
+their results.  Instead, the nodes choose a common pivot and progress
+for one quick sort iteration together.  From that point all process
+has an array with a left part where all values are lower than the
+pivot and a right part where all values are upper or equal than the
+pivot.  Then, the nodes exchange data and some of them will work on
+the lower part and the other on the upper parts until there is one
+process for a part.  At this point, the process performs a shared
+memory quick sort.  To choose the pivot we tried to use an average of
+all the data hosted by the nodes:
 \newline
 \begin{algorithm}[H]
-\linesnumbered
-\SetLine
-\KwResult{A Morton index as next iteration pivot}
-\BlankLine
-myFirstIndex $\leftarrow$ particles$[0]$.index\;
-allFirstIndexes = MortonIndex$[nbprocs]$\;
-allGather(myFirstIndex, allFirstIndexes)\;
-pivot $\leftarrow$ Sum(allFirstIndexes(:) / nbprocs)\;
-\BlankLine
-\caption{Choosing the QS pivot}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwResult{A Morton index as next iteration pivot}
+  \BlankLine
+  myFirstIndex $\leftarrow$ particles$[0]$.index\;
+  allFirstIndexes = MortonIndex$[nbprocs]$\;
+  allGather(myFirstIndex, allFirstIndexes)\;
+  pivot $\leftarrow$ Sum(allFirstIndexes(:) / nbprocs)\;
+  \BlankLine
+  \caption{Choosing the QS pivot}
 \end{algorithm}
-\newline
-A bug was made when at the beginning, we did an average by summing all the values first and dividing after. But the Morton index may be extremly high, so we need to to divide all the value before performing the sum.
+
+A bug was made when at the beginning, we did an average by summing all
+the values first and dividing after. But the Morton index may be
+extremly high, so we need to to divide all the value before performing
+the sum.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -143,122 +164,194 @@ It is a simple reordoring of the data, but the data has to stayed sorted.
 At the end of the algorithm our system is completely balanced with the same number of leaves on each process.

 \begin{figure}[h!]
-\begin{center}
-\includegraphics[width=15cm, height=15cm, keepaspectratio=true]{Balance.png}
-\caption{Balancing Example}
-\end{center}
+  \begin{center}
+    \includegraphics[width=15cm, height=15cm, keepaspectratio=true]{Balance.png}
+    \caption{Balancing Example}
+  \end{center}
 \end{figure}

 A process has to send data to the left if its current left limit is upper than its objective limit.
 Same in the other side, and we can reverse the calculs to know if a process has to received data.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \chapter{Simple operators: P2M, M2M, L2L}
-We present the different FMM operators in two separated parts depending on their parallel complexity.
-In this first part, we present the three simplest operators P2M, M2M and L2L.
-Their simplicity is explained by the possible prediction to know which node hosts a cell and how to organize the communication.
+We present the different FMM operators in two separated parts
+depending on their parallel complexity.  In this first part, we
+present the three simplest operators P2M, M2M and L2L.  Their
+simplicity is explained by the possible prediction to know which node
+hosts a cell and how to organize the communication.
+
+We will first present how the different processus can know which cell
+or leaf belongs to which processus.
+
+\section{Morton Index Intervals}
+A Morton Index Interval is a simple structure with two Morton indexes
+inside, referencing the first a last leaf of each processus.  Each
+processus compute its Morton Index Interval at first by scanning all
+its leafs.
+
+Once each processus compute its interval, there is a global
+communication for the processus to know the interval of the others,
+and the result is stored in an array of interval structures.
+

 \section{P2M}
-The P2M still unchanged from the sequential approach to the distributed memory algorithm.
-In fact, in the sequential model we compute a P2M between all particles of a leaf and this leaf which is also a cell.
-Although, a leaf and the particles it hosts belong to only one node so doing the P2M operator do not require any information from another node.
-From that point, using the shared memory operator makes sense.
+The P2M still unchanged from the sequential approach to the
+distributed memory algorithm.  In fact, in the sequential model we
+compute a P2M between all particles of a leaf and this leaf which is
+also a cell.  Although, a leaf and the particles it hosts belong to
+only one node so doing the P2M operator do not require any information
+from another node.  From that point, using the shared memory operator
+makes sense.

+\clearpage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{M2M}
-During the upward pass information moves from a level to the upper one.
-The problem in a distributed memory model is that one cell can exist in several trees i.e. in several nodes.
-Because the M2M operator computes the relation between a cell and its child, the nodes which have a cell in common need to share information.
-Moreover, we have to decide which process will be responsible of the computation if the cell is present on more than one node.
-We have decided that the node with the smallest rank has the responsibility to compute the M2M and propagate the value for the future operations.
-Despite the fact that others processes are not computing this cell, they have to send the child of this shared cell to the responsible node.
-We can establish some rules and some properties of the communication during this operation.
-In fact, at each iteration a process never needs to send more than 7 cells, also a process never needs to receive more than 7 cells.
-The shared cells are always at extremities and one process cannot be designed to be the responsible of more than one shared cell at a level.
+During the upward pass information moves from a level to the upper
+one.  The problem in a distributed memory model is that one cell can
+exist in several trees i.e. in several nodes.  Because the M2M
+operator computes the relation between a cell and its child, the nodes
+which have a cell in common need to share information.
+
+Moreover, we have to decide which process will be responsible of the
+computation if the cell is present on more than one node.  We have
+decided that the node with the smallest rank has the responsibility to
+compute the M2M and propagate the value for the future operations.
+
+Despite the fact that others processes are not computing this cell,
+they have to send the child of this shared cell to the responsible
+node.
+
+We can establish some rules and some properties of the communication
+during this operation.  In fact, at each iteration a process never
+needs to send more than 7 cells, also a process never needs to receive
+more than 7 cells.  The shared cells are always at extremities and one
+process cannot be designed to be the responsible of more than one
+shared cell at a level. 
+
+There are to cases : 
+\begin{itemize}
+  \item My first cell is shared means that I need to send the children I have of
+    this cell to the processus on my left.
+  \item My last cell is shared means that I need to receive some
+    children from the processus on my right.
+\end{itemize}
+

 \begin{figure}[h!]
-\begin{center}
-\includegraphics[width=14cm, height=7cm, keepaspectratio=true]{ruleillu.jpg}
-\caption{Potential Conflicts}
-\end{center}
+  \begin{center}
+    \includegraphics[width=14cm, height=7cm, keepaspectratio=true]{ruleillu.jpg}
+    \caption{Potential Conflicts}
+  \end{center}
 \end{figure}

 \begin{algorithm}[H]
-\restylealgo{boxed}
-\linesnumbered
-\SetLine
-\KwData{none}
-\KwResult{none}
-\BlankLine
-\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
-        \ForAll{Cell c at level idxLevel}{
-                M2M(c, c.child)\;
-        }
-}
-\BlankLine
-\caption{Traditional M2M}
+  \RestyleAlgo{boxed}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwData{none}
+  \KwResult{none}
+  \BlankLine
+  \For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+    \ForAll{Cell c at level idxLevel}{
+      M2M(c, c.child)\;
+    }
+  }
+  \BlankLine
+  \caption{Traditional M2M}
 \end{algorithm}
 \begin{algorithm}[H]
-\restylealgo{boxed}
-\linesnumbered
-\SetLine
-\KwData{none}
-\KwResult{none}
-\BlankLine
-\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
-        \uIf{$cells[0]$ not in my working interval}{
-                isend($cells[0].child$)\;
-                hasSend $\leftarrow$ true\;
-        }
-        \uIf{$cells[end]$ in another working interval}{
-                irecv(recvBuffer)\;
-                hasRecv $\leftarrow$ true\;
-        }
-        \ForAll{Cell c at level idxLevel in working interval}{
-                M2M(c, c.child)\;
-        }
-        \emph{Wait send and recv if needed}\;
-        \uIf{hasRecv is true}{
-                M2M($cells[end]$, recvBuffer)\;
-        }
-}
-\BlankLine
-\caption{Distributed M2M}
+  \RestyleAlgo{boxed}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwData{none}
+  \KwResult{none}
+  \BlankLine
+  \For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+    \uIf{$cells[0]$ not in my working interval}{
+      isend($cells[0].child$)\;
+      hasSend $\leftarrow$ true\;
+    }
+    \uIf{$cells[end]$ in another working interval}{
+      irecv(recvBuffer)\;
+      hasRecv $\leftarrow$ true\;
+    }
+    \ForAll{Cell c at level idxLevel in working interval}{
+      M2M(c, c.child)\;
+    }
+    \emph{Wait send and recv if needed}\;
+    \uIf{hasRecv is true}{
+      M2M($cells[end]$, recvBuffer)\;
+    }
+  }
+  \BlankLine
+  \caption{Distributed M2M}
 \end{algorithm}
+
+In the oct-tree, a cell or a leaf only exists if it has some children
+or particles in. When the processus receive some cells, it need to
+know their positions in the tree, because maybe one of the cells has
+not be sent since it didn't exist.
+
+The first thing to read from the buffer received is the heading, which
+is a bit vector of length 8 (practically a char), indexing each cells
+send.
+
+
+Example :
+\begin{tabular}{| c || c | c | c |}
+  \hline
+  Header & Datas & ... & Datas \\
+  \hline
+  00001011 & Datas of cell 5 & Datas of cell 7 & Datas of cell 8 \\
+  \hline
+\end{tabular}
+
+
+\clearpage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{L2L}
-The L2L operator is very similar to the M2M.
-It is just the contrary, a result hosted by only one node needs to be shared with every others nodes that are responsible of at least one child of this node.
+The L2L operator is very similar to the M2M.  It is just the contrary,
+a result hosted by only one node needs to be shared with every others
+nodes that are responsible of at least one child of this node.
+
+The L2L operator fill child local array from parent local array, so
+there is no need to precise wich cell is send, since it's the parent
+cell that is send. Consequently, there is no need for a heading.
+
 \BlankLine
 \begin{algorithm}[H]
-\restylealgo{boxed}
-\linesnumbered
-\SetLine
-\KwData{none}
-\KwResult{none}
-\BlankLine
-\For{idxLevel $\leftarrow$ 2 \KwTo $Height - 2$ }{
-        \uIf{$cells[0]$ not in my working interval}{
-                irecv($cells[0]$)\;
-                hasRecv $\leftarrow$ true\;
-        }
-        \uIf{$cells[end]$ in another working interval}{
-                isend($cells[end]$)\;
-                hasSend $\leftarrow$ true\;
-        }
-        \ForAll{Cell c at level idxLevel in working interval}{
-                M2M(c, c.child)\;
-        }
-        \emph{Wait send and recv if needed}\;
-        \uIf{hasRecv is true}{
-                M2M($cells[0]$, $cells[0].child$)\;
-        }
-}
-\BlankLine
-\caption{Distributed L2L}
+  \RestyleAlgo{boxed}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwData{none}
+  \KwResult{none}
+  \BlankLine
+  \For{idxLevel $\leftarrow$ 2 \KwTo $Height - 2$ }{
+    \uIf{$cells[0]$ not in my working interval}{
+      irecv($cells[0]$)\;
+      hasRecv $\leftarrow$ true\;
+    }
+    \uIf{$cells[end]$ in another working interval}{
+      isend($cells[end]$)\;
+      hasSend $\leftarrow$ true\;
+    }
+    \ForAll{Cell c at level idxLevel in working interval}{
+      M2M(c, c.child)\;
+    }
+    \emph{Wait send and recv if needed}\;
+    \uIf{hasRecv is true}{
+      M2M($cells[0]$, $cells[0].child$)\;
+    }
+  }
+  \BlankLine
+  \caption{Distributed L2L}
 \end{algorithm}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -271,49 +364,53 @@ To compute the P2P a leaf need to know all its direct neighbors.
 Even if the Morton indexing maximizes the locality, the neighbors of a leaf can be on any node.
 Also, the tree used in our library is an indirection tree.
 It means that only the leaves that contain particles are created.
-That is the reason why when we know that a leaf needs another one on a different node, this other node may not realize this relation if this neighbor leaf do not exist on its own tree.
+
+That is the reason why when we know that a leaf needs another one on a
+different node, this other node may not realize this relation if this
+neighbor leaf do not exist on its own tree.
+
 At the contrary, if this neighbor leaf exists then the node wills require the first leaf to compute the P2P too.
 In our current version we are first processing each potential needs to know the communication we should need.
 Then the nodes do an all gather to inform each other how many communication they are going to send.
 Finally they send and receive data in an asynchronous way and cover it by the P2P they can do.
 \BlankLine
 \begin{algorithm}[H]
-\restylealgo{boxed}
-\linesnumbered
-\SetLine
-\KwData{none}
-\KwResult{none}
-\BlankLine
-\ForAll{Leaf lf}{
-        neighborsIndexes $\leftarrow$ $lf.potentialNeighbors()$\;
-        \ForAll{index in neighborsIndexes}{
-                \uIf{index belong to another proc}{
-                        isend(lf)\;
-                        \emph{Mark lf as a leaf that is linked to another proc}\;
-                }
-        }
-}
-\emph{all gather how many particles to send to who}\;
-\emph{prepare the buffer to receive data}\;
-\ForAll{Leaf lf}{
-        \uIf{lf is not linked to another proc}{
-                neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
-                P2P(lf, neighbors)\;
-        }
-}
-\While{We do not have receive/send everything}{
-	\emph{Wait some send and recv}\;
-	\emph{Put received particles in a fake tree}\;
-}
-\ForAll{Leaf lf}{
-	\uIf{lf is linked to another proc}{
-	        neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
-	        otherNeighbors $\leftarrow$ $fakeTree.getNeighbors(lf)$\;
-	        P2P(lf, neighbors + otherNeighbors)\;
-	}
-}
-\BlankLine
-\caption{Distributed P2P}
+  \RestyleAlgo{boxed}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwData{none}
+  \KwResult{none}
+  \BlankLine
+  \ForAll{Leaf lf}{
+    neighborsIndexes $\leftarrow$ $lf.potentialNeighbors()$\;
+    \ForAll{index in neighborsIndexes}{
+      \uIf{index belong to another proc}{
+        isend(lf)\;
+        \emph{Mark lf as a leaf that is linked to another proc}\;
+      }
+    }
+  }
+  \emph{all gather how many particles to send to who}\;
+  \emph{prepare the buffer to receive data}\;
+  \ForAll{Leaf lf}{
+    \uIf{lf is not linked to another proc}{
+      neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
+      P2P(lf, neighbors)\;
+    }
+  }
+  \While{We do not have receive/send everything}{
+    \emph{Wait some send and recv}\;
+    \emph{Put received particles in a fake tree}\;
+  }
+  \ForAll{Leaf lf}{
+    \uIf{lf is linked to another proc}{
+      neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
+      otherNeighbors $\leftarrow$ $fakeTree.getNeighbors(lf)$\;
+      P2P(lf, neighbors + otherNeighbors)\;
+    }
+  }
+  \BlankLine
+  \caption{Distributed P2P}
 \end{algorithm}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -332,51 +429,51 @@ The algorithm can be viewed as several tasks:
 \end{enumerate}
 \BlankLine
 \begin{algorithm}[H]
-\restylealgo{boxed}
-\linesnumbered
-\SetLine
-\KwData{none}
-\KwResult{none}
-\BlankLine
-\ForAll{Level idxLeve from 2 to Height - 2}{
-        \ForAll{Cell c at level idxLevel}{
-                neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
-                \ForAll{index in neighborsIndexes}{
-                        \uIf{index belong to another proc}{
-                                isend(c)\;
-                                \emph{Mark c as a cell that is linked to another proc}\;
-                        }
-                }
-        }
-}
-\emph{Normal M2L}\;
-\emph{Wait send and recv if needed}\;
-\ForAll{Cell c received}{
-        $lightOctree.insert( c )$\;
-}
-\ForAll{Level idxLeve from 2 to Height - 1}{
-        \ForAll{Cell c at level idxLevel that are marked}{
-                neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
-                neighbors $\leftarrow$ lightOctree.get(neighborsIndexes)\;
-                M2L( c, neighbors)\;
+  \RestyleAlgo{boxed}
+  \LinesNumbered
+  \SetAlgoLined
+  \KwData{none}
+  \KwResult{none}
+  \BlankLine
+  \ForAll{Level idxLeve from 2 to Height - 2}{
+    \ForAll{Cell c at level idxLevel}{
+      neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
+      \ForAll{index in neighborsIndexes}{
+        \uIf{index belong to another proc}{
+          isend(c)\;
+          \emph{Mark c as a cell that is linked to another proc}\;
        }
-}
-\BlankLine
-\caption{Distributed M2L}
+      }
+    }
+  }
+  \emph{Normal M2L}\;
+  \emph{Wait send and recv if needed}\;
+  \ForAll{Cell c received}{
+    $lightOctree.insert( c )$\;
+  }
+  \ForAll{Level idxLeve from 2 to Height - 1}{
+    \ForAll{Cell c at level idxLevel that are marked}{
+      neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
+      neighbors $\leftarrow$ lightOctree.get(neighborsIndexes)\;
+      M2L( c, neighbors)\;
+    }
+  }
+  \BlankLine
+  \caption{Distributed M2L}
 \end{algorithm}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{thebibliography}{9}
 \bibitem{itpc03}
-   Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta,
-   \emph{Introduction to Parallel Computing}.
-   Addison Wesley, Massachusetts,
-   2nd Edition,
-   2003.
+  Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta,
+  \emph{Introduction to Parallel Computing}.
+  Addison Wesley, Massachusetts,
+  2nd Edition,
+  2003.
 \bibitem{ptttplwaefmm11}
-   I. Kabadshow, H. Dachsel,
-   \emph{Passing The Three Trillion Particle Limit With An Error-Controlled Fast Multipole Method}.
-   2011.
+  I. Kabadshow, H. Dachsel,
+  \emph{Passing The Three Trillion Particle Limit With An Error-Controlled Fast Multipole Method}.
+  2011.
 \end{thebibliography}
 \end{document}