Correct and update the ParallelDetails.tex

git-svn-id: svn+ssh://scm.gforge.inria.fr/svn/scalfmm/scalfmm/trunk@172 2616d619-271b-44dc-8df4-d4a8f33a7222

Correct and update the ParallelDetails.tex
1804156d · berenger-bramas · 3d859277 · 1804156d · 1804156d · 1804156d
Commit 1804156d authored 14 years ago by berenger-bramas
--- a/.gitignore
+++ b/.gitignore
@@ -24,3 +24,4 @@ tmp/
 *.toc
 *.log
 *.aux
+*.brf
--- a/Doc/ParallelDetails.pdf
+++ b/Doc/ParallelDetails.pdf
--- a/Doc/ParallelDetails.tex
+++ b/Doc/ParallelDetails.tex
-\documentclass[10pt,letterpaper,titlepage]{report}
+\documentclass[12pt,letterpaper,titlepage]{report}
 \usepackage{algorithm2e}
-\usepackage{hyperref}
+\usepackage{listings}
-\usepackage{listings}
+\usepackage{geometry}
-\usepackage{geometry}
+\usepackage{graphicx}
-\usepackage{graphicx}
+\usepackage[hypertexnames=false, pdftex]{hyperref}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % use pdflatex ParallelDetails.tex
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\author{Berenger Bramas}
+\title{ScalFmm - Parallel Algorithms (Draft)}
-\author{Berenger Bramas}
+\date{August, 2011}
-\title{ScalFmm - Parallel Algorithms (Draft)}
-\date{August, 2011}
+%% Package config
+\lstset{language=c++, frame=lines}
-\lstset{language=c++, frame=lines}
+\restylealgo{boxed}
+\geometry{scale=0.8, nohead}
-%% ou \addtolength{\voffset}{-1.5cm} \addtolength{\textheight}{2cm}
+\hypersetup{ colorlinks = true, linkcolor = black, urlcolor = blue, citecolor = blue }
-%% \addtolength{\hoffset}{-1cm} \addtolength{\textwidth}{2cm}
+%% Remove introduction numbering
-\geometry{scale=0.8, nohead}
+\setcounter{secnumdepth}{-1}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\setcounter{secnumdepth}{-1}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{document}
-\begin{document}
+\maketitle{}
-\maketitle{}
+\newpage
-\newpage
+\tableofcontents
+\newpage
-\tableofcontents
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\newpage
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Introduction}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+In this document we introduce the principles and the algorithms used in our library to run in a distributed environment using MPI.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The algorithms in this document may not be up to date comparing to those used in the code.
+We advise to check the version of this document and the code to have the latest available.
-\section{Introduction}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-In this document we introduce the principles and the algorithms used in our library to run in a distributed environment using MPI.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The algorithms in this document may not be up to date comparing to those used in the code.
+\chapter{Building the tree in Parallel}
-We advice to check the version of this document and the code to have the lastest available.
+\section{Description}
+The main motivation to create a distributed version of the FMM is to run large simulations.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+These ones contain more particles than a computer can host which involves using several computers.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Moreover, it is not reasonable to ask a master process to load an entire file and to dispatch the data to others processes. Without being able to know the entire tree it may send randomly the data to the slaves.
+To override this situation, our solution can be viewed as a two steps process.
-\chapter{Building the tree in Parallel}
+First, each node loads a part of the file to possess several particles.
-\section{Description}
+After this task, each node can compute the Morton index for the particles he had loaded.
-The main motivation to create a distributed version of the FMM is to run large simulation.
+The Morton index of a particle depends of the system properties but also of the tree height.
-These simulations contains more particles than a computer can host which involves to use several computers.
+If we want to choose the tree height and the number of nodes at run time then we cannot pre-process the file.
-Moreover, that is the reason why it is not reasonable to ask a master process to load an entire file and to dispatch the data to others processes. Without being able to know the entire tree it may send randomly the data to the slaves.
+The second step is a parallel sort based on the Morton index between all nodes with a balancing operation at the end.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-To override this situation, we use a solution that can be viewed as a two steps process.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-First, each node loads a part of the file to possess several particles.
+\section{Load a file in parallel}
-After this task, each node can compute the Morton index for the particles he loaded.
+We use the MPI $I/O$ functions to split a file between all the mpi processes.
-The Morton index of a particle depends of the system properties but also of the tree height.
+The prerequisite to make the splitting easier is to have a binary file.
-If we want to choose the tree height and the number of nodes at run time then we cannot pre-process the file.
+Thereby, using a very basic formula each node knows which part of the file it needs to load.
-The second step is a parallel sort based on the Morton index between all nodes with a balancing operation at the end.
+\begin{equation}
+size per proc \leftarrow \left (file size - header size \right ) / nbprocs
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\end{equation}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{equation}
+offset \leftarrow header size + size per proc .\left ( rank - 1 \right )
-\section{Load a file in parallel}
+\end{equation}
-We use the MPI $I/O$ functions to split a file between all the mpi processes.
+\newline
-The prerequisite to make the spliting easier is to have a binary file.
+The MPI $I/O$ functions use a view model to manage data access.
-Thereby, using a very basic formula each node know which part of the file it needs to load.
+We first construct a view using the function MPI\_File\_set\_view and then read the data with MPI\_File\_read as described in the fallowing $C++$ code.
+\begin{lstlisting}
-\begin{equation}
+// From FMpiFmaLoader
-size per proc \leftarrow \left (file size - header size \right ) / nbprocs
+particles = new FReal[bufsize];
-\end{equation}
+MPI_File_set_view(file, headDataOffSet + startPart * 4 * sizeof(FReal),
-\begin{equation}
+        MPI_FLOAT, MPI_FLOAT, const_cast<char*>("native"), MPI_INFO_NULL);
-offset \leftarrow header size + size per proc .\left ( rank - 1 \right )
+MPI_File_read(file, particles, bufsize, MPI_FLOAT, &status);
-\end{equation}
+\end{lstlisting}
-\newline
+Our files are composed by a header fallowing by all the particles.
+The header enables to check several properties as the precision of the file.
-The MPI $I/O$ functions use a view model to manage data access.
+Finally, a particle is represented by four decimal values: a position and a physical value.
-We first construct a view using the function MPI\_File\_set\_view and then read the data with MPI\_File\_read as described in the fallowing $C++$ code.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\begin{lstlisting}
+\section{Sorting the particles}
-// From FMpiFmaLoader
+Once each node has a set of particles we need to sort them.
-particles = new FReal[bufsize];
+This problem boils down to a simple parallel sort where Morton index are used to compare particles.
-MPI_File_set_view(file, headDataOffSet + startPart * 4 * sizeof(FReal),
+\subsection{Using QuickSort}
-	MPI_FLOAT, MPI_FLOAT, const_cast<char*>("native"), MPI_INFO_NULL);
+A first approach is to use a famous sorting algorithm.
-MPI_File_read(file, particles, bufsize, MPI_FLOAT, &status);
+We choose to use the quick sort algorithm because the distributed and the shared memory approaches are mostly similar.
-\end{lstlisting}
+Our implementation is based on the algorithm described in \cite{itpc03}.
+The efficiency of this algorithm depends roughly of the pivot choice.
-Our files are composed by a header fallowing by all the particles.
+In fact, a wrong idea of the parallel quick sort is to think that each process first sort their particles using quick sort and then use a merge sort to share their results.
-The header enable to check the precision of the file.
+Instead, the nodes choose a pivot and progress for one quick sort iteration together.
-Finally, a particles is represented by four values: a position and a physical value.
+From that point all process has an array with a left part where all values are lower than the pivot and a right part where all values are upper or equal than the pivot.
+Then, the nodes exchange data and some of them will work on the lower part and the other on the upper parts until there is one process for a part.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+At this point, the process performs a shared memory quick sort.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+To choose the pivot we tried to use an average of all the data hosted by the nodes:
+\newline
-\section{Sorting the particles}
+\begin{algorithm}[H]
-Once each node has a set of particles we need to sort them.
+\linesnumbered
-This problem boils down to a simple parallel sort where Morton index are used to compare particles.
+\SetLine
+\KwResult{A Morton index as next iteration pivot}
-\subsection{Using QuickSort}
+\BlankLine
-A first approach is to use a famous sorting algorithm.
+myFirstIndex $\leftarrow$ particles$[0]$.index\;
-We choose to use the quick sort algorithm because the distributed and the shared momory approches are mostly similar.
+allFirstIndexes $\leftarrow$ MortonIndex$[nbprocs]$\;
-Our implementation is based on the implementation describe in \cite{itpc03}.
+allGather(myFirstIndex, allFirstIndexes)\;
-The efficiency of this algorithm depends roughly of the pivot choice.
+pivot $\leftarrow$ Sum(allFirstIndexes) / nbprocs\;
-In fact, a wrong idea of the parallel quick sort is to think that each process first sort their particles and second use a merge sort to share their results.
+\BlankLine
-Instead, the nodes choose a pivot and do one quick sort iteration together.
+\caption{Choosing the QS pivot}
-From that point all process has an array with a left part where all value are lower than the pivot and a right part where all values are upper or equal than the pivot.
+\end{algorithm}
-Then, the nodes exchange data and some of them will work on the lower part and the other on the upper parts until there is one process for a part.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-At this point, the process perform a shared memory quick sort.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-To choose the pivot we tried to use an average of all the data hosted by the nodes:
+\subsection{Using an intermediate Octree}
-\newline
+The second approach uses an octree to sort the particles in each process instead of a sorting algorithm.
+The time complexity is equivalent but it needs more memory since it is not done in place.
-\begin{algorithm}[H]
+After inserting the particles in the tree, we can iterate at the leaves level and access to the particles in an ordered way.
-\SetLine
+Then, the processes are doing a minimum and a maximum reduction to know the real Morton interval of the system.
-\KwResult{A Morton index as next iteration pivot}
+By building the system interval in term of Morton index, the nodes cannot know the data scattering.
-\BlankLine
+Finally, the processes split the interval in a uniform manner and exchange data with $P^{2}$ communication in the worst case.
-myFirstIndex $\leftarrow$ particles$[0]$.index\;
+\newline
-allFirstIndexes $\leftarrow$ MortonIndex$[nbprocs]$\;
+\newline
-allGather(myFirstIndex, allFirstIndexes)\;
+In both approaches the data may not be balanced at the end.
-pivot $\leftarrow$ Sum(allFirstIndexes) / nbprocs\;
+In fact, the first method is pivot dependent and the second consider that the data are uniformly distributed.
-\BlankLine
+That is the reason why we need to balance the data among nodes.
-\caption{Choosing the QS pivot}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\end{algorithm}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Balancing the leaves}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+After sorting, each process has potentially several leaves.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+If we have two processes $P_{i}$ and $P_{j}$ with $i < j$ the sort guarantees that all leaves from node i are inferior than the leaves on the node j in a Morton indexing way.
+But the leaves are randomly distributed among the nodes and we need to balance them.
-\subsection{Using an intermediate Octree}
+Our idea is to use a two passes algorithm describes as a sand settling:
-The second approach uses an octree to sort the particles in each process instead of a sorting algorithm.
+\begin{enumerate}
-The time complexity is equivalent but it needs more memory since it is not done in place.
+\item We can see each node as a heap of sand.
-After inserting the particles in the tree, we can iterate at the leaves level and access to the particles in an ordered way.
+This heap represents the leaves of the octree.
-Then, the processes are doing a minimum and a maximum reduction to know the real Morton interval of the system.
+Some nodes have lot of leaves and then are a big heap.
-By building the system interval in term of Morton index, the nodes cannot know the data scattering.
+On the contrary, some are small heaps because composed by a few leaves.
-Finally, the processes split the interval in a uniform manner and exchange data with $P^{2}$ communication in the worse case.
+Starting from the extremities, each node can know the sand there is on its left and on its right.
-\newline
+\item Because each node knows the total among of sand in the system which is the sum of the sand there is on each of its sides plus its own sand it can compute a balancing calculus.
-\newline
+What should happen if we put a heavy plank above all our sand heaps?
-In both approaches the data may not be balanced at the end.
+Well the system should balance until all heaps have the same size.
-In fact, the first method is pivot dependant and the second consider that the data are uniform.
+The same happens here, each node can know what to do to balance the system.
-That is the reason why we need to balance the data among nodes.
+Each node communicates only with its two neighbors by sending or receiving entire leaves.
+\end{enumerate}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+At the end of the algorithm our system is completely balanced with the same number of leaves on each process.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{figure}[h!]
+\begin{center}
-\section{Balancing the leaves}
+\includegraphics[width=14cm, height=17cm, keepaspectratio=true]{SandSettling.png}
-After sorting, each process has potentialy several leaves.
+\caption{Sand Settling Example}
-If we have to processes $P_{i}$ and $P_{j}$ with $i < j$ the sort guarantes that all leaves from node i are inferior than the leaves on the node j in a Morton indexing way.
+\end{center}
-But the leaves are randomly distributed among the nodes and we need to balance them.
+\end{figure}
-Our idea is to use a two passes algorithm describes as a sand settling:
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\begin{enumerate}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\item We can see each node as a heap of sand.
+\chapter{Simple operators: P2M, M2M, L2L}
-This heap represents the leaves of the octree.
+We present the different FMM operators in two separated parts depending on their parallel complexity.
-Some nodes have lot of leaves and then have a big heaps.
+In this first part we present the three simplest operators P2M, M2M and L2L.
-On the contraray, some other have small heaps.
+Their simplicity is explained by the possible prediction to know which node hosts a cell and how to organize the communication.
-Starting from the extremities, each node can know the sand there is on its left and on its right.
+\section{P2M}
-\item Because each node knows the total among of sand in the system which is the sum of the sand there is on each of its sides plus its own sand it can compute a balancing calculs.
+The P2M still unchanged from the sequential approach to the distributed memory algorithm.
-What should happen if we put a heavy plank above all our sand heaps?
+In fact, in the sequential model we compute a P2M between all particles of a leaf and this leaf which is also a cell.
-Well the system should balance until all heaps have the same size.
+Although, a leaf and the particles it hosts belong to only one node so doing the P2M operator do not require any information from another node.
-The same happens here, each node can know what to do to balance the system.
+From that point, using the shared memory operator makes sense.
-Each node communicate only with its two neighbors by sending or receiving entire leaves.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\end{enumerate}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{M2M}
-At the end of the algorithm our system is completly balanced with the same number of leaves on each processes.
+During the upward pass information moves from a level to the upper one.
+The problem in a distributed memory model is that one cell can exist in several trees i.e. in several nodes.
+Because the M2M operator computes the relation between a cell and its child, the nodes which have a cell in common need to share information.
-\begin{figure}[h!]
+Moreover, we have to decide which process will be responsible of the computation if the cell is present on more than one node.
-\begin{center}
+We have decided that the node with the smallest rank has the responsibility to compute the M2M and propagate the value for the future operations.
-\includegraphics[width=15cm, height=20cm, keepaspectratio=true]{SandSettling.png} 
+Despite the fact that others processes are not computing this cell, they have to send the child of this shared cell to the responsible node.
-\caption{Sand Settling Example}
+We can establish some rules and some properties of the communication during this operation.
-\end{center}
+In fact, at each iteration a process never needs to send more than 8-1 cells, also a process never needs to receive more than 8-1 cells.
-\end{figure}
+The shared cells are always at extremities and one process cannot be designed to be the responsible of more than one shared cell at a level.
+\begin{algorithm}[H]
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\restylealgo{boxed}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\linesnumbered
+\SetLine
-\chapter{Simple operators: P2M, M2M, L2L}
+\KwData{none}
-We present the different FMM operators in two separated parts depending on their parallel complexity.
+\KwResult{none}
-In this first part we present the three simpliest operators P2M, M2M and L2L.
+\BlankLine
-Their simplicity is explain by the possible prediction to know which node hosts a cell and how to organized the communication.
+\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+        \ForAll{Cell c at level idxLevel}{
-\section{P2M}
+                M2M(c, c.child)\;
-The P2M still unchanged from the sequential approach to the distributed memory algorithm.
+        }
-In fact, in the sequential model we compute a P2M between all particles of a leaf and this same leaf which is also a cell.
+}
-Although, a leaf and the particles it hosts belong to only one node so doing the P2M operator do not require any information from another node.
+\BlankLine
-From that point, using the shared memory operator makes sense.
+\caption{Traditional M2M}
+\end{algorithm}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{algorithm}[H]
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\restylealgo{boxed}
+\linesnumbered
-\section{M2M}
+\SetLine
-During the upward pass information moves from a level to the upper one.
+\KwData{none}
-The problem in a distributed memory model is that one cell can exist in several trees i.e. in several nodes.
+\KwResult{none}
-Because the M2M operator compute the relation between a cell and its child, the nodes which have a cell in commom need to share information.
+\BlankLine
-Moreover, we have to decide which process will be responsible of the computation if the cell is present on more than one node.
+\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+        \uIf{$cells[0]$ not in my working interval}{
-We have decided that the node with the smallest rank has the responsibility to compute the M2M and propagate the value for the future operations.
+                isend($cells[0].child$)\;
-Despite the fact that others processes are not computing this cell, they have to send the child of this shared cell to the responsible node.
+                hasSend $\leftarrow$ true\;
+        }
-Concentring on that problem enable to establish some rules.
+        \uIf{$cells[end]$ in another working interval}{
-In fact, at each iteration a process never needs to send more than 8-1 cells, also a process never needs to receive more than 8-1 cells.
+                irecv(recvBuffer)\;
-Also, the shared cells are always at extremities and one process cannot be designed to be the responsible of more than one shared cell at a level.
+                hasRecv $\leftarrow$ true\;
+        }
+        \ForAll{Cell c at level idxLevel in working interval}{
-\begin{algorithm}[H]
+                M2M(c, c.child)\;
-\SetLine
+        }
-\KwData{none}
+        \emph{Wait send and recv if needed}\;
-\KwResult{none}
+        \uIf{hasRecv is true}{
-\BlankLine
+                M2M($cells[end]$, recvBuffer)\;
-\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+        }
-	\ForAll{Cell c at level idxLevel}{
+}
-		M2M(c, c.child)\;
+\BlankLine
-	}
+\caption{Distributed M2M}
-}
+\end{algorithm}
-\BlankLine
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\caption{Traditional M2M}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\end{algorithm}
+\section{L2L}
+The L2L operator is very similar to the M2M.
+It is just the contrary, a result hosted by only one node needs to be shared with every others nodes that are responsible of at least one child of this node.
-\begin{algorithm}[H]
+\BlankLine
-\SetLine
+\begin{algorithm}[H]
-\KwData{none}
+\restylealgo{boxed}
-\KwResult{none}
+\linesnumbered
-\BlankLine
+\SetLine
-\For{idxLevel $\leftarrow$ $Height - 2$ \KwTo 1}{
+\KwData{none}
-	\uIf{$cells[0]$ not in my working interval}{
+\KwResult{none}
-		isend($cells[0].child$)\;
+\BlankLine
-		hasSend $\leftarrow$ true\;
+\For{idxLevel $\leftarrow$ 2 \KwTo $Height - 2$ }{
-	}
+        \uIf{$cells[0]$ not in my working interval}{
-	\uIf{$cells[end]$ in another working interval}{
+                irecv($cells[0]$)\;
-		irecv(recvBuffer)\;
+                hasRecv $\leftarrow$ true\;
-		hasRecv $\leftarrow$ true\;
+        }
-	}
+        \uIf{$cells[end]$ in another working interval}{
-	\ForAll{Cell c at level idxLevel in working interval}{
+                isend($cells[end]$)\;
-		M2M(c, c.child)\;
+                hasSend $\leftarrow$ true\;
-	}
+        }
-	\emph{Wait send and recv if needed}\;
+        \ForAll{Cell c at level idxLevel in working interval}{
-	\uIf{hasRecv is true}{
+                M2M(c, c.child)\;
-		M2M($cells[end]$, recvBuffer)\;
+        }
-	}
+        \emph{Wait send and recv if needed}\;
-}
+        \uIf{hasRecv is true}{
-\BlankLine
+                M2M($cells[0]$, $cells[0].child$)\;
-\caption{Distributed M2M}
+        }
-\end{algorithm}
+}
+\BlankLine
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\caption{Distributed L2L}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\end{algorithm}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{L2L}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The L2L operator is very similar to the M2M.
+\chapter{Complex operators: P2P, M2L}
-On the contrary that a result hosted by only one node need to be shared with every other nodes that are responsible of some child of this node.
+These two operators are more complex than the ones presented in the previous chapter.
+In fact, it is very difficult to predict the communication between nodes.
-\begin{algorithm}[H]
+Each step requires pre-processing to know what are the potential communications and a gather to inform other about the needs.
-\SetLine
+\section{P2P}
-\KwData{none}
+To compute the P2P a leaf need to know all its direct neighbors.
-\KwResult{none}
+Even if the Morton indexing maximizes the locality, the neighbors of a leaf can be on any node.
-\BlankLine
+Also, the tree used in our library is an indirection tree.
-\For{idxLevel $\leftarrow$ 2 \KwTo $Height - 2$ }{
+It means that only the leaf that contains particles is created.
-	\uIf{$cells[0]$ not in my working interval}{
+That is the reason why when we know that a leaf needs another one on a different node, this other node may not realize this relation if this neighbor leaf do not exist on its own tree.
-		irecv($cells[0]$)\;
+At the contrary, if this neighbor leaf exists then the node wills require the first leaf to compute the P2P too.
-		hasRecv $\leftarrow$ true\;
+In our current version we are first processing each potential needs to know the communication we should need.
-	}
+Then the nodes do an all gather to inform each other how many communication they are going to send.
-	\uIf{$cells[end]$ in another working interval}{
+Finally they send and receive data in an asynchronous way and cover it by the P2P they can do.
-		isend($cells[end]$)\;
+\BlankLine
-		hasSend $\leftarrow$ true\;
+\begin{algorithm}[H]
-	}
+\restylealgo{boxed}
-	\ForAll{Cell c at level idxLevel in working interval}{
+\linesnumbered
-		M2M(c, c.child)\;
+\SetLine
-	}
+\KwData{none}
-	\emph{Wait send and recv if needed}\;
+\KwResult{none}
-	\uIf{hasRecv is true}{
+\BlankLine
-		M2M($cells[0]$, $cells[0].child$)\;
+\ForAll{Leaf lf}{
-	}
+        neighborsIndexes $\leftarrow$ $lf.potentialNeighbors()$\;
-}
+        \ForAll{index in neighborsIndexes}{
-\BlankLine
+                \uIf{index belong to another proc}{
-\caption{Distributed L2L}
+                        isend(lf)\;
-\end{algorithm}
+                        \emph{Mark lf as a leaf that is linked to another proc}\;
+                }
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+        }
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+}
+\emph{all gather how many particles to send to who}\;
-\chapter{Complex operators: P2P, M2L}
+\emph{prepare the buffer to receive data}\;
-This two operators are more complex than the ones presented in the previous chapter.
+\ForAll{Leaf lf}{
-In fact, they require a gather and a pre-process task that take time.
+        \uIf{lf is not linked to another proc}{
+                neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
-\section{P2P}
+                P2P(lf, neighbors)\;
-To process the P2P we need the neighbors of all our leaves.
+        }
-But this neighbors can be potentially hosted by any other process because of the Morton indexing.
+}
-Also, our tree is an indirection tree, so a leaf may not exist because there is particles at this place.
+\emph{Wait send and recv if needed}\;
-From that description of the problem we have implemented the fallowing algorithm.
+\emph{Put received particles in a fake tree}\;
+\ForAll{Leaf lf}{
-\begin{algorithm}[H]
+        \uIf{lf is linked to another proc}{
-\SetLine
+                neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
-\KwData{none}
+                otherNeighbors $\leftarrow$ $fakeTree.getNeighbors(lf)$\;
-\KwResult{none}
+                P2P(lf, neighbors + otherNeighbors)\;
-\BlankLine
+        }
+}
-\ForAll{Leaf lf}{
+\BlankLine
-	neighborsIndexes $\leftarrow$ $lf.potentialNeighbors()$\;
+\caption{Distributed P2P}
-	\ForAll{index in neighborsIndexes}{
+\end{algorithm}
-		\uIf{index belong to another proc}{
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-			isend(lf)\;
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-			\emph{Mark lf as a leaf that is linked to another proc}\;
+\section{M2L}
-		}
+The M2L operator is relatively similar to the P2P.
-	}
+Hence P2P is done at the leaves level, M2L is done on several levels from Height - 2 to 2.
-}
+At each level, a node needs to have access to all the distant neighbors of the cells it is the proprietary and those ones can be hosted by any other node.
-\emph{all gather how many particles to send to who}\;
+Anyway, each node can compute a part of the M2L with the data it has.
-\emph{prepare the buffer to receive data}\;
+The algorithm can be viewed as several tasks:
-\ForAll{Leaf lf}{
+\begin{enumerate}
-	\uIf{lf is not linked to another proc}{
+\item Compute to know what data has to be sent
-		neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
+\item All gather to know what data has to be received
-		P2P(lf, neighbors)\;
+\item Do all the computation we can without the data from other nodes
-	}
+\item Wait $send/receive$
-}
+\item Compute M2L with the data we received
-\emph{Wait send and recv if needed}\;
+\end{enumerate}
-\emph{Put received particles in a fake tree}\;
+\BlankLine
-\ForAll{Leaf lf}{
+\begin{algorithm}[H]
-	\uIf{lf is linked to another proc}{
+\restylealgo{boxed}
-		neighbors $\leftarrow$ $tree.getNeighbors(lf)$\;
+\linesnumbered
-		otherNeighbors $\leftarrow$ $fakeTree.getNeighbors(lf)$\;
+\SetLine
-		P2P(lf, neighbors + otherNeighbors)\;
+\KwData{none}
-	}
+\KwResult{none}
-}
+\BlankLine
+\ForAll{Level idxLeve from 2 to Height - 2}{
-\BlankLine
+        \ForAll{Cell c at level idxLevel}{
-\caption{Distributed P2P}
+                neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
-\end{algorithm}
+                \ForAll{index in neighborsIndexes}{
+                        \uIf{index belong to another proc}{
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+                                isend(c)\;
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+                                \emph{Mark c as a cell that is linked to another proc}\;
+                        }
-\section{M2L}
+                }
-The M2L operator is relatively similar to the P2P.
+        }
-Hence P2P is done at the leaves level, M2L is done from several levels from Height - 2 to 2.
+}
-At each level we need to have access to neighbors of the cells we are proprietary and those ones can be hosted by any processes.
+\emph{Normal M2L}\;
-But we can process most of the cells without the need of data from other processes.
+\emph{Wait send and recv if needed}\;
-That is the reason why we work with several tasks:
+\ForAll{Cell c received}{
-\begin{enumerate}
+        $lightOctree.insert( c )$\;
-\item Compute to know what data has to be sent
+}
-\item Do all the computation we can without the data from others
+\ForAll{Level idxLeve from 2 to Height - 2}{
-\item Wait sendings and receptions
+        \ForAll{Cell c at level idxLevel that are marked}{
-\item Compute M2L with the data we received
+                neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
-\end{enumerate}
+                neighbors $\leftarrow$ lightOctree.get(neighborsIndexes)\;
+                M2L( c, neighbors)\;
-Then the algorithm is detailled in the fallowing figure:
+        }
+}
-\begin{algorithm}[H]
+\BlankLine
-\SetLine
+\caption{Distributed M2L}
-\KwData{none}
+\end{algorithm}
-\KwResult{none}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\BlankLine
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\ForAll{Level idxLeve from 2 to Height - 2}{
+\begin{thebibliography}{9}
-	\ForAll{Cell c at level idxLevel}{
+\bibitem{itpc03}
-		neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
+   Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta,
-		\ForAll{index in neighborsIndexes}{
+   \emph{Introduction to Parallel Computing}.
-			\uIf{index belong to another proc}{
+   Addison Wesley, Massachusetts,
-				isend(c)\;
+   2nd Edition,
-				\emph{Mark c as a cell that is linked to another proc}\;
+   2003.
-			}
+\end{thebibliography}
-		}
+\end{document}
-	}
-}
-\emph{Normal M2L}\;
-\emph{Wait send and recv if needed}\;
-\ForAll{Cell c received}{
-	$lightOctree.insert( c )$\;
-}
-\ForAll{Level idxLeve from 2 to Height - 2}{
-	\ForAll{Cell c at level idxLevel that are marked}{
-		neighborsIndexes $\leftarrow$ $c.potentialDistantNeighbors()$\;
-		neighbors $\leftarrow$ lightOctree.get(neighborsIndexes)\;
-		M2L( c, neighbors)\;
-	}
-}
-\BlankLine
-\caption{Distributed M2L}
-\end{algorithm}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\begin{thebibliography}{9}
-\bibitem{itpc03}
-   Ananth Grama, George Karypis, Vipin Kumar, Anshul Gupta,
-   \emph{Introduction to Parallel Computing}.
-   Addison Wesley, Massachusetts,
-   2nd Edition,
-   2003.
-\end{thebibliography}
-\end{document}