Work on negative design; add todos

a336e730 · Sebastian Will · ea0882f4 · a336e730
Commit a336e730 authored 3 years ago by Sebastian Will
--- a/infrared-bookchapter.tex
+++ b/infrared-bookchapter.tex
@@ -134,6 +134,9 @@ fontsize=\footnotesize
 %

 \maketitle
+
+\TODO{Introduction still immature and incomplete. Check use of Vienna RNA package. Excursion 2 still needs to be written. Write last methods subsection about application-specific extensions (e.g. further constraints) and modifications(e.g. modified objective); example!?.}
+
 \begin{abstract}
    Applications in biotechnology and bio-medical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational design but at the same time put high demands on the flexibility of computational tools and their expressivity to model the applications-specific requirements. To address such demands, we present the computational framework \Infrared. It supports to develop advanced customized design tools, which generate designs with specific properties, often in a few lines of Python code.
    This text guides the reader in tutorial-format through the development of a complex multi-target RNA design application. Thanks to \Infrared, it can be described as a step-by-step extension of an elementary design task: generating sequences that are compatible with a single RNA structure.
@@ -462,13 +465,11 @@ Due to the compositionality of constraints (and functions) in \Infrared, definin

 \begin{Pythoncode}
 model = Model(n,4)
-
 for k, target in enumerate(targets):
    bps = parse(target)
    model.add_constraints(BPComp(i,j) for (i,j) in bps)
    model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
                         for (i,j) in bps], f'energy{k}')
-
 model.add_functions([GCCont(i) for i in range(n)], 'gc')
 \end{Pythoncode}

@@ -480,9 +481,7 @@ for k, target in enumerate(targets):
    model.add_feature(f'Energy{k}', f'energy{k}',
        lambda sample, target=target:
            RNA.energy_of_struct(ass_to_seq(sample), target))
-
 sampler = Sampler(model)
-
 sampler.set_target(0.75*n, 0.01*n, 'gc')
 sampler.set_target( -15, 1, 'Energy0')
 sampler.set_target( -20, 1, 'Energy1')
@@ -555,9 +554,9 @@ for i in range(100):
 Note that in both cases, we control the \GC content of our designs. This is just one example how the core idea of finding designs by sampling can be extended by further targets or constraints. (Adding prior knowledge as IUPAC string, would be another.) Generally, this approach will be most successful for problems with a medium-sized solution space. This would also apply to many design problems, where only parts of the RNA should be redesigned, while others are kept constant. 

 \subsection{Larger single-target designs by constraint-generation}
-For larger and harder design instances, we can apply a constraint-generation strategy following \RNAPOND~\cite{Yao2001}.
-\DOTS
+For larger and harder single-target design instances, we suggested a constraint-generation strategy in \RNAPOND~\cite{Yao2001}. Here, we demonstrate a slightly stripped down version of this appraoch, which requires only a few lines of code using \Infrared.

+The method starts with the attempt to solve negative design by sampling (=positive design) as in the subsection before. When we fold the suggested designs, it can be observed that some base pairs that don't occur in the target, occur more frequently than others. This motivates us to identify these most frequent disruptive base pairs and forbid them in a next round of sampling. This strategy is iterated until a design (according to the mfe criterion) is discovered. In the code presented here, we decide to terminate the design attempt, when the problem becomes inconsistent or a certain complexity of the constraint model is exceeded.

 \begin{Pythoncode}
 def cg_design_iteration(dbps):
@@ -566,7 +565,7 @@ def cg_design_iteration(dbps):
    sampler = Sampler(model, lazy=True)
    sampler.set_target(0.7 * n, 0.1 * n, 'gc' )
    if sampler.treewidth() > 10 or not sampler.is_consistent():
-        return dbps[:-1], None
+        return dbps, "Not found"
    ctr = Counter()
    for i in range(steps):
        seq = ass_to_seq(sampler.targeted_sample())
@@ -581,46 +580,71 @@ dbps, seq = [], None
 while seq is None: dbps, seq = cg_design_iteration(dbps)
 print(seq)
 \end{Pythoncode}
+\TODO{Some more discussion of the appraoch and results. Include figure (show frequencies of disruptive base pairs in dot plot, shift of distribution \dots.}

 \subsection{Negative design by stochastic optimization search with partial resampling}

-Based on \Infrared we can moreover easily implement stochastic optimization strategies to optimize negative objectives for single or multi-target design. We show how \Infrared supports such strategies based on resampling of connected components of the model's network.
+\Infrared supports stochastic optimization strategies to optimize negative objectives for single or multi-target design. We are going to present a Metropolis-Hastings optimization strategy for optimizing the solutions from a multi-design models. It starts with a sample from the model and evaluates it by the objective function. Then, it iteratively picks a connected component of the dependency graph; constructs a model for resampling of the variables in this component; and generates a sample. Based on its evaluation by the objective function, the sample is either accepted or rejected based on the Metropolis-Hastings criterion.

-For this purpose, we define a method \texttt{multi\_design\_model} that returns a multi-target design network model for resampling. This model looks essentially like the one presented before, but fixes 
-all nucleotides outside of a specified \texttt{subset} to a provided \texttt{solution} (later, the last solution in stochastic search). This is accomplished by the code snippet
+For our example, we choose too minimize the multi-defect~\cite{Hammer2017}, which is a weighted sum of the distance of the target energies to the ensemble energy and the energy distance between the targets. Minimizing this function thus means to increase the probability of the targets and balance the target energies as much as possible (see function \texttt{multi\_defect} in Notes).

+Next, we define a function \texttt{multi\_design\_model} (see Notes) that returns a multi-target design network model (closely resembling the one presented before). In addition, crucially for the use in stochastic optimization, this model supports partial resampling. Given a 
+subset of the model variables and a complete solution, it fixes all variables outside of the subset to the solution. This is accomplished by the code snippet
 \begin{Pythoncode}
    for i in range(n):
        if i not in subset:
            value = solution.values()[i]
            model.restrict_domains(i,(value,value))
 \end{Pythoncode}
-In the method (full code given in Notes), we moreover avoid to introduce constraints on or between variables outside of the subset.
-
-We can now implement a Metropolis-Hastings Monte Carlo optimization algorithm as follows
+In this function (see Notes), we moreover avoid to introduce constraints on or between variables outside of the subset.

+For the stochastic optimization, we require additional imports
 \begin{Pythoncode}
-from random import random
+from random import random, choices
 from math import exp
-def optimize(create_model, objective, steps, temp):
-    model = create_model(None, None)
-    cur = Sampler(model).sample()
-    curval = objective(ass_to_seq(cur))
-    ccs = model.connected_components()
-    weights = [1/len(cc) for cc in ccs]
-    bestval = math.inf
+\end{Pythoncode}
+The function \texttt{multi\_design\_optimize} returns the best multi-target design and its multi-defect after a number of iterations (\texttt{steps}); a further parameter \texttt{temp} (temperature) controls the acceptance probability.
+\begin{Pythoncode}
+def multi_design_optimize(steps, temp):
+    cc, cur, curval, bestval = None, None, math.inf, math.inf
    for i in range(steps):
-        cc = random.choices(ccs,weights)[0]
-        new = Sampler(create_model(cc, cur)).sample()
-        newval = objective(ass_to_seq(new))
+        model = multi_design_model(targets, cur, cc)
+        new = Sampler(model).sample()
+        newval = multi_defect(ass_to_seq(new),targets,1)
        if (newval <= curval
-            or random() <= exp(-(newval - curval ) / temp)):
+            or random() <= exp(-(newval-curval)/temp)):
            cur, curval = new, newval
            if curval < bestval:
-                best,bestval = cur, curval
+                best, bestval = cur, curval
+        if i==0:
+            ccs = model.connected_components()
+            weights = [1/len(cc) for cc in ccs]
+        cc = choices(ccs,weights)[0]
    return (ass_to_seq(best), bestval)
 \end{Pythoncode}

+Finally, we run the multi-defect optimization on our example target structures by
+\begin{Pythoncode}
+multi_design_optimize(1000, 0.015)
+\end{Pythoncode}
+Note that here the number of 1000 iterations and the temperature 0.015 were chosen after some experimentation. In practice, we will often restart such procedures to obtain better solutions and/or a diverse set of good solutions. In our tests, one such optimization run took moderate run-time below 10s on current notebook hardware, while it can find very good designs for the targets. 
+The sequence \texttt{CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG} is a particularly good solution found by this strategy. For this design all three target structures have minimum free energy of -15.40 kcal/mol (at a multi-defect of 1.21).
+Even the close suboptimals contain only structures similar to the targets as can be seen from the output of \texttt{RNAsubopt} of the Vienna RNA package, which enumerates all structures within 1 kcal/mol of the minimum free energy.
+\begin{bashcode}
+$ RNAsubopt -s <<<CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG
+CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG -15.40   1.00
+((((((((((...))))((((....)))))))))) -15.40
+((((((.((((((((....))))..)))))))))) -15.40
+.((((((...)))))).(((((((....))))))) -15.40
+((((((.(((((((......)))..)))))))))) -15.10
+((((((.((((.(((....)))...)))))))))) -14.80
+((((((.((((.(((.....)))..)))))))))) -14.80
+((((((.((((..(((....)))..)))))))))) -14.80
+((((((.((((.((((...))))..)))))))))) -14.60
+(((((((((.....)))((((....)))))))))) -14.50
+((((((.((((.((......))...)))))))))) -14.50
+.((((((...))))))(((.((((....))))))) -14.50
+\end{bashcode}

 \subsection{Further application-specific design goals}

@@ -666,37 +690,29 @@ def multi_defect(sequence, targets, xi=1):
    return diff_ee + xi * diff_targets
 \end{Pythoncode}

-\subsection{Multi-design model with support of partial resampling}
+\subsection{Multi-design model with support for partial resampling}

 \begin{Pythoncode}
-def multi_design_model(targets,subset=None,solution=None):
+def multi_design_model(targets, subset=None, solution=None):
    n = len(targets[0])
    model = Model(n, 4)
    if subset is None: subset = set(range(n))
-
-    # fix the entries outside of subset from solution
-    for i in range(n):
-        if i not in subset:
-            value = solution.values()[i]
-            model.restrict_domains(i,(value,value))
-
-    model.add_functions([GCCont(i) for i in range(n) if i in subset], 'gc')
-
+    for i in set(range(n))-subset:
+        value = solution.values()[i]
+        model.restrict_domains(i,(value,value))
+    model.add_functions([GCCont(i) for i in subset], 'gc')
    for target in targets:
-        bps = parse(target)
-        bps = [(i,j) for (i,j) in bps if i in subset or j in subset]
-        model.add_constraints(BPComp(i, j) for (i, j) in bps)
-        model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
-            for (i,j) in bps], 'energy')
-
+        s = parse(target)
+        ss = [(i,j) for (i,j) in s if i in subset or j in subset]
+        model.add_constraints(BPComp(i, j) for (i, j) in ss)
+        model.add_functions([BPEnergy(i, j, (i-1, j+1) not in s)
+            for (i,j) in ss], 'energy')
    model.set_feature_weight(-1, 'energy')
    model.set_feature_weight(-0.3, 'gc')
-
    return model
 \end{Pythoncode}


-
 \bibliographystyle{plain}
 \bibliography{biblio}