Elaborate excursion 2

93d034da · Sebastian Will · a336e730 · 93d034da
Commit 93d034da authored 3 years ago by Sebastian Will
--- a/infrared-bookchapter.tex
+++ b/infrared-bookchapter.tex
@@ -403,7 +403,7 @@ sampler.set_target(0.75*n, 0.01*n, 'gc')
 samples = [sampler.targeted_sample() for _ in range(1000)]
 \end{Pythoncode}
 Note the use of \texttt{Sampler}'s method \texttt{targeted\_sample()} in place of \texttt{sample()}. This method provides access to an automatic mechanism that 
-returns only samples within the tolerance from the target. To make such a rejection strategy effective, we iteratively sample, estimate the current mean of the feature value and then update the feature weight. Concretely, \Infrared implements a form of multi-dimensional Boltzmann sampling~\cite{Bodini2010} as applied in RNARedPrint~\cite{hammer2019fixed}.
+returns only samples within the tolerance from the target. To make such a rejection strategy effective, we iteratively sample, estimate the current mean of the feature value and then update the feature weight. Concretely, \Infrared implements a form of multi-dimensional Boltzmann sampling~\cite{Bodini2010} as applied in \RNARedPrint~\cite{hammer2019fixed}.

 \subsection{Controlling energy---Multiple features}

@@ -411,7 +411,7 @@ While, for instructional purposes, we first presented how to target \GC content,

 Similar to the \GC content, RNA energy can be modeled as a sum of function values. This holds even for the detailed nearest-neighbor energy model of RNAs, where energy is composed of empirically determined or trained loop energies~\cite{Turner2010,Andronescu2007}. Here, we focus on the much simpler base pair energy model, which has been demonstrated to be an effective proxy for the Turner (nearest neighbor) model in design applications~\cite{hammer2019fixed}.

-In this simple model, every type of base pair (A-U, C-G or G-U) receives a different energy. To define the feature \texttt{energy}, we impose a function for each base pair \texttt{(i,j)} in the target structure and moreover distinguish terminal and non-terminal base pairs, simply by \texttt{(i-1, j+1) not in bps}. \Infrared provides a default parameterization, which has been originally trained for use with RNARedPrint~\cite{hammer2019fixed}.
+In this simple model, every type of base pair (A-U, C-G or G-U) receives a different energy. To define the feature \texttt{energy}, we impose a function for each base pair \texttt{(i,j)} in the target structure and moreover distinguish terminal and non-terminal base pairs, simply by \texttt{(i-1, j+1) not in bps}. \Infrared provides a default parameterization, which has been originally trained for use with \RNARedPrint~\cite{hammer2019fixed}.
 \begin{Pythoncode}
    bps = parse(target)
    model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
@@ -492,21 +492,40 @@ Finally, as expected, \Infrared will indeed generate sequences that are compatib
 \begin{Pythoncode}
    samples = [sampler.targeted_sample() for _ in range(10)]
 \end{Pythoncode}
+At this point, we arrived at reimplementing the essential functionality of \RNARedPrint~\cite{hammer2019fixed} in the \Infrared framework. A full-fledged \Infrared-based implementation with command line interface is moreover provided, as \RNARedPrint~2.x, at \url{https://gitlab.inria.fr/amibio/RNARedPrint}.

 \subsubsection{Excursion 2: a deeper dive into Infrared's sampling engine}

-While suggested by the constraint modeling paradigm in \Infrared, it is not at all a priori obvious how this targeted sampling can be achieved. It is as well worthwhile to take a dive into workings of the machinery in order to understand the computational possibilities and limitations of the \Infrared framework.
+Setting several target structures and targeting specific energies for them seems natural due to  \Infrared's modeling syntax, nevertheless it is not at all obvious \emph{a priori} how the system effectively generates solutions that satisfy the constraints and target specific properties.

-show: constraint/function network, tree decomposition, explain why complexity depends on tree width, hint at generality of this method,\dots
+\paragraph{Generation of samples from a multi-dimensional Boltzmann distribution.}
+For generating samples, \Infrared implements a general solving strategy based on tree decompositions and cluster tree elimination (CTE)~\cite{Decther..}. Such techniques have been well known in the (larger) context of constraint processing~\cite{Dechter..}; more recently, we described this approach specialized to multi-target RNA design for our approach \RNARedPrint.
+The cluster tree elimination scheme yields a fixed-parameter tractable algorithm to compute (partial) partition functions, which let's us generate samples from a multi-dimensional Boltzmann distribution.
+In our example, this means that we can efficiently generate samples with probabilities proportional to
+\begin{equation*}
+  \exp(w_{\text{gc}}\cdot\operatorname{GC}(s) + \sum_{k=0}^2 w_{\text{energy$k$}}\cdot\operatorname{energy\emph{k}}(s)).
+\end{equation*}
+Due to the CTE scheme, the computation is based on a tree decomposition of the network of the dependencies induced by the constraints and functions in the model (aka \emph{dependency graph}); see Figure~\ref{fig:dependency-graph}. The concept of tree decomposition allows us to recursively compute partial partition functions for the subtrees of the tree decomposition by processing the variables in its bags in bottom-up order. Moreover, this computation can be performed efficiently due to dynamic programming (which tabulates partial result that would otherwise be re-computed redundantly).
+
+After all partition functions are computed, each sample is generated in a backtrace running from the root the leaves. In this way, whenever a new variable is introduced in the depth-first/top-down traversal of the tree decomposition, its value can be chosen with the correct probability to generate the desired multi-dimensional Boltzmann distribution.
+
+This solving strategy explains why \Infrared can sample from one network faster than from the other. Tree-like networks of dependencies are processed quickly, while complex cyclic dependencies require tree decompositions with more variables per bag, since valid tree decompositions must satisfy certain conditions w.r.t. the dependencies in the network (which in turn guarantee the correctness of the dynamic programming evaluation).
+
+Finally, since the computation requires to enumerate all possible sub-assign\-ments in each bag, the computation time is exponential in the maximum number of variables per bag---this complexity is commonly described in terms of tree width, which is defined as this number minus 1.

 \begin{figure}
    \centering
    \includegraphicscenter[width=0.7\textwidth]{Figs/example-dependency_graph}\hfill%
-    \includegraphicscenter[width=0.27\textwidth]{Figs/example-treedecomp}
-    \caption{\textbf{(Left)} Dependency graph of the multi-target design model showing the dependencies between the variables $X0,\dots,X{34}$ of this model \textbf{(Right)} A tree decomposition of this graph that puts the variables into bags (only variable indices shown). The directed edges are labeled with the indices of variables introduced by the child bag.}
+    \includegraphicscenter[width=0.27\textwidth]{Figs/example-treedecomp} 
+    \caption{\textbf{(Left)} Dependency graph of the multi-target design model showing the dependencies between the variables $X0,\dots,X{34}$ of this model \textbf{(Right)} A tree decomposition of this graph that puts the variables into bags (only variable indices shown). The directed edges are labeled with the indices of variables introduced by the child bag. This decomposition has a tree width of two, since it's largest bags contain three (tree width plus one) variables.}
    \label{fig:dependency-graph}
 \end{figure}

+\paragraph{Targeting specific properties.} For targeting very specific properties like certain \GC content and energies of the target structures, \Infrared utilizes the just described sampling engine to iteratively sample from a multi-dimensional Boltzmann distribution, evaluate the generated distribution w.r.t.{} the target properties of the single targeted features and update their weights. By suitable updpates of the weights, it is possible to shift the distribution towards the targeted feature values and increase the probability to satisfy these targets (within the given tolerance). During this entire learning procedure, \Infrared returns samples inside of the tolerance range and rejects all others. In this way, \Infrared implements a variant of multi-dimensional Boltzmann sampling~\cite{Bodini2010}, which let's it solve the kind of complex constraints set are set by targeting certain tolerance ranges for features (which are composed from 'local' functions).
+
+As a consequence of this entire mechanism, the sampling efficiency in \Infrared is a result of the complexity of the constraint network as well as the (in)dependence of the targeted features and the demanded tolerances. For practical applications of \Infrared it is thus generally advantageous to be aware of the properties of the solving strategy and the resulting dependencies between these factors. This is especially important, since the framework easily allows modeling extremely hard problems, while (as we demonstrate) it is useful for a wide range of applications in practice. Its specific properties make the system attractive for a variety of complex design applications but as well intrinsically influence its applicability. 
+
+
 \subsection{Negative design by direct sampling}

 Good RNA designs typically must satisfy certain constraints and show high