-
Sebastian Will authoredSebastian Will authored
infrared-bookchapter.tex 48.38 KiB
%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
%
% This is the LaTeX source for the instructions to authors using
% the LaTeX document class 'llncs.cls' for contributions to
% the Lecture Notes in Computer Sciences series.
% http://www.springer.com/lncs Springer Heidelberg 2006/05/04
%
% It may be used as a template for your own input - copy it
% to a new file with a new name and use it as the basis
% for your article.
%
% NB: the document class 'llncs' has its own and detailed documentation, see
% ftp://ftp.springer.de/data/pubftp/pub/tex/latex/llncs/latex2e/llncsdoc.pdf
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[runningheads,a4paper]{llncs}
\usepackage{amssymb}
\setcounter{tocdepth}{3}
\usepackage{graphicx}
\usepackage[utf8x]{inputenc}
\usepackage{amsmath}
\usepackage{xspace}
\usepackage{todonotes}
\newcommand{\TODO}[1]{\todo[inline,backgroundcolor=blue!20,linecolor=blue!20!black]{\sffamily Todo: #1}}
\usepackage{url}
\newcommand{\keywords}[1]{\par\addvspace\baselineskip
\noindent\keywordname\enspace\ignorespaces#1}
\usepackage{fancyvrb}
\definecolor{shadecolor}{rgb}{.95, .95, .99}
\newenvironment{shellout}%
{\snugshade\scriptsize\verbatim}%
{\endverbatim\endsnugshade}
\usepackage{xcolor}
\usepackage{minted}
\newminted{Python}{%
autogobble,
frame=lines,
framesep=2mm,
baselinestretch=1.2,
bgcolor=shadecolor,
%fontsize=\footnotesize
}
\newminted{bash}{%
autogobble,
breaklines,
breakbefore=-/,
frame=lines,
framesep=2mm,
baselinestretch=1.2,
bgcolor=shadecolor,
fontsize=\footnotesize
}
\newcommand{\DOTS}{\textcolor{red}{\textbf{[\dots]}}\xspace}
\usepackage{align-objs}
\begin{document}
\mainmatter % start of an individual contribution
% first the title is needed
\title{Developing complex RNA design applications in the \Infrared framework}
% a short form should be given in case it is too long for the running head
\titlerunning{The \Infrared framework for RNA design}
% the name(s) of the author(s) follow(s) next
%
% NB: Chinese authors should write their first names(s) in front of
% their surnames. This ensures that the names appear correctly in
% the running heads and the author index.
%
%\newcommand{\etal}{ \emph{et al}~}
%\newcommand{\aka}{{a.k.a.}\xspace}
%\newcommand{\ie}{{i.e.}\xspace}
%\newcommand{\cf}{{c.f.}\xspace}
%\newcommand{\eg}{{e.g.}\xspace}
%\newcommand{\wrt}{w.r.t.\xspace}
\newcommand{\Base}[1]{\text{{\sffamily#1}}}
\newcommand{\Ab}{\Base{A}\xspace}
\newcommand{\Cb}{\Base{C}\xspace}
\newcommand{\Gb}{\Base{G}\xspace}
\newcommand{\Ub}{\Base{U}\xspace}
\newcommand{\GC}{\Base{GC}\xspace}
\newcommand{\ProblemIO}[2]{\vspace{.5em}\newline\vspace{.5em}{\bfseries Input:} #1\\[.3em]{\bfseries Output:} #2}
\newcommand{\software}[1]{{\sffamily#1}\xspace}
\newcommand{\Infrared}{\software{Infrared}}
\newcommand{\IncARNation}{\software{IncARNation}}
\newcommand{\RNARedPrint}{\software{RNARedPrint}}
\newcommand{\RNAPOND}{\software{RNAPOND}}
\newcommand{\RNABluePrint}{\software{RNABluePrint}}
\newcommand{\RNAinverse}{\software{RNAinverse}}
\newcommand{\RefLIX}{1}
\newcommand{\RefMcGill}{2}
\author{Hua-Ting Yao$^\text{\RefLIX,\RefMcGill}$ \and Yann Ponty$^\text{\RefLIX}$ \and Sebastian Will$^\text{\RefLIX}$}
%
\authorrunning{Yao, Ponty and Will}
% (feature abused for this document to repeat the title also on left hand pages)
% the affiliations are given next; don't give your e-mail address
% unless you accept that it will be published
\institute{LIX, CNRS UMR 7161, Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
\and School of Computer Science, McGill University, Montreal, Canada}
%
% NB: a more complex sample for affiliations and the mapping to the
% corresponding authors can be found in the file "llncs.dem"
% (search for the string "\mainmatter" where a contribution starts).
% "llncs.dem" accompanies the document class "llncs.cls".
%
\maketitle
\TODO{Introduction still immature and incomplete. Check use of Vienna RNA package---ideally, most of the tutorial would not require the package (this certainly, does not work for all the negative design---so, we have to be careful how to communicate this and not to lock out all the Windows users). Write last methods subsection about application-specific extensions (e.g. further constraints) and modifications(e.g. modified objective); example!?.}
\begin{abstract}
Applications in biotechnology and bio-medical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational design but at the same time put high demands on the flexibility of computational tools and their expressivity to model the applications-specific requirements. To address such demands, we present the computational framework \Infrared. It supports to develop advanced customized design tools, which generate designs with specific properties, often in a few lines of Python code.
This text guides the reader in tutorial-format through the development of a complex multi-target RNA design application. Thanks to \Infrared, it can be described as a step-by-step extension of an elementary design task: generating sequences that are compatible with a single RNA structure.
Finally, we point to further possibilities beyond the presented examples and discuss properties.
\keywords{RNA design; Kinetic landscapes; Riboswitches}
\end{abstract}
\section{Introduction}
\newcommand{\Def}[1]{{\bfseries #1}}
Designing molecules with novel functionality or very specific desirable properties for applications in biological fundamental research, biotechnology, and medicine, is a highly complex task that typically requires interdisciplinary efforts, combining biochemical experimentation and computational design. In several ways, designing RNAs can be even more attractive than designing proteins. On the one hand, functional RNA molecules save the detour of translation into proteins, and can therefore act more efficiently, e.g. as fast on/off-switches of gene activity. On the other hand, the design process itself can build on the well-understood combinatorics of RNA secondary structure and available computational models and algorithms.
Still, the supporting RNA design computationally is highly demanding. First of all, RNA design is an optimization problem with often complex objectives \DOTS with respect to multiple (secondary) structures \DOTS
Moreover, RNA design is computationally complex even in simple problem variants. For example, one cannot efficiently design an RNA that preferentially folds into a single given target structure in the nearest-neighbor energy model, since this problem is NP-hard.
Here we present the framework \Infrared, which addresses the multiple demands of computational RNA design in several ways:
\begin{itemize}
\item To address the issues of computational complexity, it follows the classical decomposition of RNA design into two related sub-problems, often called \textbf{positive and negative design}. Positive RNA design aims at very specific properties (e.g. specific energy) for certain `target' RNA structures, while negative design additionally aims to avoid similar properties for all (exponentially many) other, `non-target' structures. While efficient algorithms for the latter problem can not exist,
the system automatically derives fixed-parameter tractable algorithms to solve the (sub-problem) positive design efficiently. In many cases, this can already solve negative design by searching through relatively few samples of good positive designs. To address harder negative design tasks, we demonstrate constraint generation as well as classic stochastic optimization techniques as supported by the system.
\item Real-world RNA design applications typically demand for targeting thermodynamic criteria referring to multiple target structures (e.g. on and off-states of riboswitches, binding pockets of aptamers and competing structures in specific energetic relations), potentially under side constraints like moderate \GC-content or avoidance of specific sequence motifs. Thus, we design Infrared as a library that supports programmers to develop complex and potentially novel design strategies based on a declarative constraint framework. This allows application programmers to harness the power of fixed-parameter tractable sampling of designs in an easy to use system.
In this chapter, we guide the reader through the use of this library and show by examples how to develop and run design programs from IPython notebooks. Further examples, covering as well non-design applications, are provided as part of the documentation of \Infrared.
In addition it should be mentioned that we published complete design tools based on the library for the specific applications of the generation of multi-target designs (\RNARedPrint v2, as reimplementation of \RNARedPrint~\cite{hammer2019fixed}) and negative RNA design (\RNAPOND~\cite{Yao2021}), which we hope are both directly useful to some readers. These tools can as well serve as further examples for the use of \Infrared in command line tools.
\end{itemize}
\begin{figure}
\centering
\includegraphicscenter[scale=0.6]{Figs/example-target1}\qquad
\includegraphicscenter[scale=0.6]{Figs/example-target2}\qquad
\includegraphicscenter[scale=0.6]{Figs/example-target3}
\bigskip
\centering
\begin{minipage}{0.4\textwidth}
\begin{verbatim}
((((((((((...))))((((....))))))))))
((((((.((((((((....))))..))))))))))
.((((((...)))))).(((((((....)))))))
\end{verbatim}
\end{minipage}
\caption{Three RNA target structures for RNA design, which will serve us as running examples.
They are shown in 2D representation (by {\tt VARNA}~\cite{Darty2009}) and as dot-bracket strings. The latter
represent base pairs by balanced parentheses.}
\label{fig:target-structures}
\end{figure}
Let us illustrate some fundamental ideas of the \Infrared framework with a first simple example. After importing \Infrared
\begin{Pythoncode*}{linenos=true}
from infrared import *
from infrared.rna import *
\end{Pythoncode*}
a few lines of code let us generate 10 random (uniformly sampled) designs that can fold into the first structure of Figure~\ref{fig:target-structures}.
\begin{Pythoncode*}{linenos=true,firstnumber=last}
target = "((((((((((...))))((((....))))))))))"
model = Model(len(target), 4)
model.add_constraints(BPComp(i,j) for (i,j) in parse(target))
sampler = Sampler(model)
samples = [sampler.sample() for _ in range(10)]
\end{Pythoncode*}
While this code doesn't yet come close to exploiting the power of the system, it demonstrates the main
structure of using \Infrared's FPT sampling engine. After a short prelude and defining the input, the users set's up a constraint model, here defining that we want to sample one of 4 nucleotides for each sequence position. Moreover by the constraints \texttt{BPComp}, we require
the nucleotides of base pairs to be complementary.
Finally, we generate a sampler for this constraint model, which specifies the entire problem, and generate 10 samples.
A slight extension of this example provides a first glimpse at the possibilities in \Infrared. Generating uniformly sampled sequences for multi-target design, analogously to \RNABluePrint~\cite{Hammer2017}, requires only a slight extension of the above model.
To target the three structures of Figure~\ref{fig:target-structures}, we define
\begin{Pythoncode}
targets = ["((((((((((...))))((((....))))))))))",
"((((((.((((((((....))))..))))))))))",
".((((((...)))))).(((((((....)))))))"]
\end{Pythoncode}
Then, we simply add constraints for all these structures to our model. For this purpose, we rewrite line 6 of the previous example to loop over the targets
\begin{Pythoncode}
for target in targets:
model.add_constraints(BPComp(i,j) for (i,j) in parse(target))
\end{Pythoncode}
and generate samples from this model in exactly the same way as before. The involved computation and algorithmic structure required by the multi-target case is handled transparently "under the hood". Internally, the problem is solved by an efficient algorithm that automatically adapts in complexity to the dependency network due to the multiple target structures.
\paragraph{Chapter overview.} We continue this chapter by describing the installation of the library and recommended additional software. In Section Methods, we systematically develop a complex multi-target design application using the Infrared framework. We will start by targeting positive design objectives and then discuss the integtration of negative objectives. Finally, we provide an out-look at more complex design constraints.
\section{Material: Installing \Infrared }
\label{sec:installation}
We recommend installing \Infrared using the package manager
\software{Conda} as described below. Alternatively but less conveniently, the software can be compiled and installed from its source found on Gitlab \url{https://gitlab.inria.fr/amibio/Infrared}. In this chapter, we describe the release 1.0 of the software.
\subsubsection{Package manager \software{Conda} installation.}
Unless \software{Conda} is already installed on you system, we recommend to install it in the form of \software{Miniconda} from
\url{https://conda.io/en/latest/miniconda.html}. The page contains installation instructions for Windows, MacOs and Linux.
\subsubsection{\Infrared installation.}
To use Conda, it typically has to be activated by running the shell command
\begin{bashcode}
conda activate
\end{bashcode}
in a terminal.
All required and recommended software can be installed from the command line by
\begin{bashcode}
conda install -c conda-forge -c bioconda infrared viennarna jupyter
\end{bashcode}
The command installs the packages \texttt{infrared}, \texttt{viennarna}, and \texttt{jupyter}; the flags \texttt{-c conda-forge -c bioconda} specify the required channels. For the largest part of the tutorial, only the package \texttt{\bf infrared} is required. As of yet, the Vienna RNA package can not be installed under Windows via Conda; Windows users must therefore remove \texttt{\bf viennarna} from the installation command. Note that there is as well no other convenient way to install the \emph{Python interface} to the Vienna RNA package on Windows. Nevertheless, most of the the examples can be run without Python bindings to the Vienna RNA package. Package \texttt{\bf jupyter} will allow running the code of this tutorial as IPython notebook by
\begin{bashcode}
jupyter-notebook bookchapter-tutorial.ipynb
\end{bashcode}
This notebook and other tutorials and example notebooks are provided at...
\DOTS
\section{Methods}
Note that all example code can be run from a provided IPython notebook, available for download from
Infrared's Gitlab repository (\url{https://gitlab.inria.fr/amibio/Infrared/-/tree/v1.0/Doc/bookchapter.ipynb}).
\subsection{Elementary use of \Infrared---A simple design model}
Recall the first example from the introduction. We want to generate
sequences compatible with a single target structure. In Infrared this idea
can be expressed in the form of a constraint/function network model.
For this purpose, given a target length $n$
\begin{Pythoncode}
n = 35
\end{Pythoncode}
we set up a new model with \texttt{n} variables $X_0,\dots,X_{n-1}$ that each can take one of four different values (representing the nucleotides A,C,G,U by integers 0,1,2,3).
\begin{Pythoncode}
model = Model(n, 4)
\end{Pythoncode}
This is simply the \Infrared-way of modeling the solutions of the design task, namely nucleotide sequences of length $n$.
%
Next, we restrict the solution space by enforcing compatibility to a
target structure (again from the running example). This is accomplished by adding a complementarity
constraint for each base pair
\begin{Pythoncode}
target = "((((((((((...))))((((....))))))))))"
model.add_constraints(BPComp(i,j) for (i,j) in parse(target))
\end{Pythoncode}
The constraint \texttt{BPComp(i,j)} is \emph{valid} (or
\emph{satisfied}) if the variables $X_i$ and $X_j$ represent complementary
nucleotides (ie. one pair of A-U, C-G, G-C, G-U, U-A, U-G).
%
Technically, solutions are
assignments that assign to each of the variables $X_0,\dots,X_{n-1}$ a
value of its domain $\{0,1,2,3\}$; moreover they must be \emph{valid}
assignments that satisfy all the constraints.
%
Once we have a model, which defines the solution space, we can instantiate a sampler
\begin{Pythoncode}
sampler = Sampler(model)
\end{Pythoncode}
which allows us to draw sample solutions by calls to \texttt{sample()}, e.g. we obtain 10 sampled designs by
\begin{Pythoncode}
samples = [sampler.sample() for _ in range(10)]
\end{Pythoncode}
Recall that nucleotides are represented by numbers, such that we typically
want to convert sampled assignments to sequences (over A, C, G, U); for this purpose,
one typically applies\Infrared's function \texttt{ass\_to\_seq()}, like
\begin{Pythoncode}
sequences = [ass_to_seq(sample) for sample in samples]
\end{Pythoncode}
Without further specifications,
such solutions will be sampled from the uniform distribution.
\subsection{Imposing additional constraints---Sequence constraints in IUPAC code}
As one of the main of the constraint paradigm, more complex tasks can be specified compositionally. This allows us
to tailor the solution space very naturally by imposing additional constraints.
In design problems it is often useful to express prior knowledge on the sequence design space in IUPAC nucleotide encoding. To provide just one (arbitrary) example, we could specify the loop regions as \text{RYY} and \text{GNRA}, as well as the left and right most base as strong (\texttt{S}) by the constraint string
\begin{Pythoncode}
iupac_sequence = "SNNNNNNNNNRYYNNNNNNNNGNRANNNNNNNNNS"
\end{Pythoncode}
% "((((((((((...))))((((....))))))))))"
Corresponding constraints can then be added to our model by
\begin{Pythoncode}
for i, x in enumerate(iupac_sequence):
model.add_constraints(ValueIn(i, iupacvalues(x)))
\end{Pythoncode}
Here, we make use of \Infrared's function \texttt{iupacvalues()} which returns the possible nucelotides values allowed by a IUPAC symbol and the
constraint \texttt{ValueIn} which constrains the allowed values of a variable.
\subsection{Functions and features---Control of \GC content}
\Infrared offers many ways to build on this initial design model and tailor
it towards more sophisticated design and very specific applications.
For a start, we want to generate sequences with control over their \GC content.
In Infrared, we can define the \GC content as a feature of solutions, which itself is specified by a group of functions.
For this purpose, we extend the previous model by
\begin{Pythoncode}
model.add_functions([GCCont(i) for i in range(n)], 'gc')
\end{Pythoncode}
This defines one function \texttt{GCCont} for each variable (or nucleotide)
and add them to the model (all in the same function group named \texttt{gc}).
Each single function has value 1, if the respective variable $X_i$
represents \Cb or \Gb and value 0, otherwise. \Infrared automatically
derives a feature with name \texttt{gc} that sums over all these function values.
Thus, it counts the \Gb and \Cb nucleotides (reflecting the \GC content as the \emph{feature value}).
\subsubsection{Shifting the \GC content distribution.} Infrared offers two major ways to make use of this feature.
In the simpler case, one can control the sample distribution by setting the
weight of the feature before constructing the sampler and drawing samples, e.g.
\begin{Pythoncode}
model.set_feature_weight(1, 'gc')
sampler = Sampler(model)
samples = [sampler.sample() for _ in range(1000)]
\end{Pythoncode}
Plotting sequence logos and histograms of the \GC contents of these samples shows that it is no longer uniformly distributed, but the sample distribution is shifted towards higher \GC content. Negative weights shift to lower \GC content, while 0 keeps the distribution uniform (Fig.~\ref{fig:gc-content}).
Technically, the samples $s$ are generated from a Boltzmann distribution, depending on the weight $w_{\text{gc}}$ of feature \texttt{gc},
i.e. their probabilities are proportional to the Boltzmann weights based on their \GC contents $\operatorname{GC}(s)$:
\begin{equation*}
exp(w_{\text{gc}} \cdot \operatorname{GC}(s)).
\end{equation*}
\begin{figure}
\hspace{-0.3cm}\includegraphics[height=2.3cm]{Figs/gc_content_minus-logo}\hspace{-0.5cm}\includegraphics[height=2.3cm]{Figs/gc_content_minus-hist}
\hspace{-0.3cm}\includegraphics[height=2.3cm]{Figs/gc_content_zero-logo}\hspace{-0.5cm}\includegraphics[height=2.3cm]{Figs/gc_content_zero-hist}
\hspace{-0.3cm}\includegraphics[height=2.3cm]{Figs/gc_content_plus-logo}\hspace{-0.5cm}\includegraphics[height=2.3cm]{Figs/gc_content_plus-hist}
\caption{Sequence logos and GC-content histograms for each time 1000 samples with respective weights -1, 0, and 1 (top, middle, bottom) of the feature \texttt{gc}. The sequences are compatible to the first example target structure and the IUPAC string \texttt{SNNNNNNNNNRYYNNNNNNNNGNRANNNNNNNNNS}.}
\label{fig:gc-content}
\end{figure}
\subsubsection{Excursion 1: a first glimpse at the \Infrared sampling engine.}
Having seen a few sampling examples, we are ready to talk about the computations behind the scenes. First observe that variables not involved in base pairs can apparently be sampled independently (and in a rather straightforward manner: either uniformly or with probability proportional to their Boltzmann factor).
It's more interesting how to sample variables $X_i$ and $X_j$ connected by a base pair $(i,j)$. Since determining the variables simultaneously does not generalize well, we handle one after the other. Choosing the value of the variable $X_i$ before $X_j$ requires looking ahead to the possible solutions in combination with $X_j$. In uniform sampling, we can choose based on the number of possible solutions for each value in $X_i$'s domain. For example, choosing $\Gb$ leaves two values for $X_j$ to satisfy the complementarity constraint, whereas $\Cb$ leaves exactly one; consequently, $\Gb$ must be selected two times as often as $\Cb$ if we want to sample uniformly.
Remarkably, sampling from a Boltzmann distribution works in the same way but 'counting' solutions weighted by their Boltzmann factors. Conversely, uniform sampling turns out to be as a special case of Boltzmann sampling with weight zero.
To inform the choice of $X_i$ (before choosing $X_j$), we marginalize the sums of Boltzmann weights $\exp(w_{\text{gc}} \cdot \operatorname{GC}(s))$ over the possible values of $X_j$. In other words, we utilize partial partition functions.
\begin{center}
\begin{tabular}{c@{\quad}|@{\quad}c@{\quad}|@{\quad}c}
Value of $X_i$ & Possible values of $X_j$ & Partition function (\GC content)\\
\hline
A & U & $\exp(w_\text{gc}\cdot 0)$\\
C & G & $\exp(w_\text{gc}\cdot 2)$\\
G & C, U & $\exp(w_\text{gc}\cdot 2) + \exp(w_\text{gc}\cdot 1)$\\
U & A, G & $\exp(w_\text{gc}\cdot 0) + \exp(w_\text{gc}\cdot 1)$\\
\end{tabular}
\end{center}
For sampling, the value of $X_i$ is selected with probability proportional
to its corresponding partition function from the table. After $X_i$ is
determined, the choice of a value for $X_j$ becomes comparably simple.
For the previous examples, \Infrared's solving mechanism indeed boils down to fixing an arbitrary order of the variables and precomputing partition functions as described for the variables in base pairs. In the presence of more complex dependencies between variables, \Infrared still chooses values for the variables one-by-one in a predetermined optimized variable order. Given this order, it precomputes partial partition functions to derive the probabilities to choose values in order to sample from the desired Boltzmann distribution.
The key to efficient uniform sampling in \Infrared is thus to precompute such partition functions as efficiently as possible. After the precomputation the choice for each variable is performed in constant time, resulting in linear time sampling. We are going to discuss further details of the \Infrared engine, when we progress to network models with more complex dependencies as they result from simultaneously targeting several RNA structures.
\subsubsection{Targeting specific \GC content.}
Let us return to the control of the \GC content. Instead of direct tweaking of the weight, we can even ask Infrared to target a specific feature value. For example, we generate targeted samples with a \GC content of $75\% \pm 1\%$ from our model by
\begin{Pythoncode}
sampler = Sampler(model)
sampler.set_target(0.75*n, 0.01*n, 'gc')
samples = [sampler.targeted_sample() for _ in range(1000)]
\end{Pythoncode}
Note the use of \texttt{Sampler}'s method \texttt{targeted\_sample()} in place of \texttt{sample()}. This method provides access to an automatic mechanism that
returns only samples within the tolerance from the target. To make such a rejection strategy effective, we iteratively sample, estimate the current mean of the feature value and then update the feature weight. Concretely, \Infrared implements a form of multi-dimensional Boltzmann sampling~\cite{Bodini2010} as applied in \RNARedPrint~\cite{hammer2019fixed}.
\subsection{Controlling energy---Multiple features}
While, for instructional purposes, we first presented how to target \GC content, an even more obvious target of RNA design is the energy of the target structure---or in other words, the affinity of the designs to the target structure.
Similar to the \GC content, RNA energy can be modeled as a sum of function values. This holds even for the detailed nearest-neighbor energy model of RNAs, where energy is composed of empirically determined or trained loop energies~\cite{Turner2010,Andronescu2007}. Here, we focus on the much simpler base pair energy model, which has been demonstrated to be an effective proxy for the Turner (nearest neighbor) model in design applications~\cite{hammer2019fixed}.
In this simple model, every type of base pair (A-U, C-G or G-U) receives a different energy. To define the feature \texttt{energy}, we impose a function for each base pair \texttt{(i,j)} in the target structure and moreover distinguish terminal and non-terminal base pairs, simply by \texttt{(i-1, j+1) not in bps}. \Infrared provides a default parameterization, which has been originally trained for use with \RNARedPrint~\cite{hammer2019fixed}.
\begin{Pythoncode}
bps = parse(target)
model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
for (i,j) in bps], 'energy')
\end{Pythoncode}
In the same way as for the feature \texttt{gc}, we can set the weight of the new feature \texttt{energy} and in this way shift the distribution of base pair energies. In \Infrared, can even simultaneously control the energy and the \GC content, since it samples (depending on the weights $w_{\text{gc}}$ and $w_{\text{energy}}$) from the two-dimensional Boltzmann distribution, where probabilities are proportional to
\begin{equation*}
\exp(w_{\text{gc}}\cdot\operatorname{GC}(s) + w_{\text{energy}}\cdot\operatorname{BPEnergy}(s)).
\end{equation*}
We observe that, for a fixed target structure, the energies of sequences in the base pair energy model are strongly correlated to their energies in the Turner model. Thus shifting the distribution of base pair energies indirectly shifts the distribution of nearest neighbor energies.
For example, we generate sequences with high affinity to the target structure and specific \GC content, straightforwardly extending previous code and ideas:
\begin{Pythoncode}
model.set_feature_weight(-2 'energy')
sampler = Sampler(model)
sampler.set_target(0.75*n, 0.01*n, 'gc')
samples = [sampler.targeted_sample() for _ in range(10)]
\end{Pythoncode}
We arrived at the functionality of \IncARNation~\cite{Reinharz-Incarnation} as application of the \Infrared library, requiring only a few lines of Python code. We remark that \IncARNation implements the stacking energy model, which is slightly more complex than the base pair model. This model is as well available in \Infrared, but was not shown to have clear advantages for sampling. To use it, we would simply replace adding the \texttt{BPEnergy} functions
by
\begin{Pythoncode}
model.add_functions([StackEnergy(i, j)
for (i,j) in bps if (i+1,j-1) in bps], 'energy')
\end{Pythoncode}
\subsection{Targeting Turner energies---Customized features}
This correlation between the base pair and Turner energy models can be moreover exploited in \Infrared to directly target Turner energies. This only requires us to add a custom feature for Turner energy that controls the group of functions \texttt{energy}:
\begin{Pythoncode}
model.add_feature('Energy', 'energy',
lambda sample, target=target:
RNA.energy_of_struct(ass_to_seq(sample), target))
\end{Pythoncode}
Consequently, we can target specific (realistic) Turner energy of the target structure (and simultaneously target specific \GC content)
\begin{Pythoncode}
sampler = Sampler(model)
sampler.set_target(0.75*n, 0.01*n, 'gc')
sampler.set_target(-10, 0.5, 'Energy')
samples = [sampler.targeted_sample() for _ in range(10)]
\end{Pythoncode}
This example provides a first demonstration of the targeting flexibility due to \Infrared multi-dimensional Boltzmann sampling engine, which can simultaneously target several features. We already see how this generalizes over advanced design tools like \IncARNation.
\subsection{Multiple-target structures---Complex dependencies}
Due to the compositionality of constraints (and functions) in \Infrared, defining multiple targets does not look any different than defining a single target structure. Thus, let us right away define model to target the structures of Fig.~\ref{fig:target-structures}, which were previously defined in the list \texttt{targets} with lenght \texttt{n}.
\begin{Pythoncode}
model = Model(n,4)
for k, target in enumerate(targets):
bps = parse(target)
model.add_constraints(BPComp(i,j) for (i,j) in bps)
model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
for (i,j) in bps], f'energy{k}')
model.add_functions([GCCont(i) for i in range(n)], 'gc')
\end{Pythoncode}
Note how we just add constraints and functions for each target structure, but define different names for their functions groups (\texttt{energy0}, \texttt{energy1}, \texttt{energy2}), such that we could control them separately. As well, note that we are going to control \GC content. Here, we could add further constraints like the ones for a specific IUPAC sequence.
By now, it will appear natural to the attentive reader that we can go on by defining Turner energy features and specific targets.
\begin{Pythoncode}
for k, target in enumerate(targets):
model.add_feature(f'Energy{k}', f'energy{k}',
lambda sample, target=target:
RNA.energy_of_struct(ass_to_seq(sample), target))
sampler = Sampler(model)
sampler.set_target(0.75*n, 0.01*n, 'gc')
sampler.set_target( -15, 1, 'Energy0')
sampler.set_target( -20, 1, 'Energy1')
sampler.set_target( -20, 1, 'Energy2')
\end{Pythoncode}
Finally, as expected, \Infrared will indeed generate sequences that are compatible to all structures and hit the prescribed target energies and \GC content.
\begin{Pythoncode}
samples = [sampler.targeted_sample() for _ in range(10)]
\end{Pythoncode}
At this point, we arrived at reimplementing the essential functionality of \RNARedPrint~\cite{hammer2019fixed} in the \Infrared framework. A full-fledged \Infrared-based implementation with command line interface is moreover provided, as \RNARedPrint~2.x, at \url{https://gitlab.inria.fr/amibio/RNARedPrint}.
\subsubsection{Excursion 2: a deeper dive into Infrared's sampling engine}
Setting several target structures and targeting specific energies for them seems natural due to \Infrared's modeling syntax, nevertheless it is not at all obvious \emph{a priori} how the system effectively generates solutions that satisfy the constraints and target specific properties.
\paragraph{Generation of samples from a multi-dimensional Boltzmann distribution.}
For generating samples, \Infrared implements a general solving strategy based on tree decompositions and cluster tree elimination (CTE)~\cite{Decther..}. Such techniques have been well known in the (larger) context of constraint processing~\cite{Dechter..}; more recently, we described this approach specialized to multi-target RNA design for our approach \RNARedPrint.
The cluster tree elimination scheme yields a fixed-parameter tractable algorithm to compute (partial) partition functions, which let's us generate samples from a multi-dimensional Boltzmann distribution.
In our example, this means that we can efficiently generate samples with probabilities proportional to
\begin{equation*}
\exp(w_{\text{gc}}\cdot\operatorname{GC}(s) + \sum_{k=0}^2 w_{\text{energy$k$}}\cdot\operatorname{energy\emph{k}}(s)).
\end{equation*}
Due to the CTE scheme, the computation is based on a tree decomposition of the network of the dependencies induced by the constraints and functions in the model (aka \emph{dependency graph}); see Figure~\ref{fig:dependency-graph}. The concept of tree decomposition allows us to recursively compute partial partition functions for the subtrees of the tree decomposition by processing the variables in its bags in bottom-up order. Moreover, this computation can be performed efficiently due to dynamic programming (which tabulates partial result that would otherwise be re-computed redundantly).
After all partition functions are computed, each sample is generated in a backtrace running from the root the leaves. In this way, whenever a new variable is introduced in the depth-first/top-down traversal of the tree decomposition, its value can be chosen with the correct probability to generate the desired multi-dimensional Boltzmann distribution.
This solving strategy explains why \Infrared can sample from one network faster than from the other. Tree-like networks of dependencies are processed quickly, while complex cyclic dependencies require tree decompositions with more variables per bag, since valid tree decompositions must satisfy certain conditions w.r.t. the dependencies in the network (which in turn guarantee the correctness of the dynamic programming evaluation).
Finally, since the computation requires to enumerate all possible sub-assign\-ments in each bag, the computation time is exponential in the maximum number of variables per bag---this complexity is commonly described in terms of tree width, which is defined as this number minus 1.
\begin{figure}
\centering
\includegraphicscenter[width=0.7\textwidth]{Figs/example-dependency_graph}\hfill%
\includegraphicscenter[width=0.27\textwidth]{Figs/example-treedecomp}
\caption{\textbf{(Left)} Dependency graph of the multi-target design model showing the dependencies between the variables $X0,\dots,X{34}$ of this model \textbf{(Right)} A tree decomposition of this graph that puts the variables into bags (only variable indices shown). The directed edges are labeled with the indices of variables introduced by the child bag. This decomposition has a tree width of two, since it's largest bags contain three (tree width plus one) variables.}
\label{fig:dependency-graph}
\end{figure}
\paragraph{Targeting specific properties.} For targeting very specific properties like certain \GC content and energies of the target structures, \Infrared utilizes the just described sampling engine to iteratively sample from a multi-dimensional Boltzmann distribution, evaluate the generated distribution w.r.t.{} the target properties of the single targeted features and update their weights. By suitable updpates of the weights, it is possible to shift the distribution towards the targeted feature values and increase the probability to satisfy these targets (within the given tolerance). During this entire learning procedure, \Infrared returns samples inside of the tolerance range and rejects all others. In this way, \Infrared implements a variant of multi-dimensional Boltzmann sampling~\cite{Bodini2010}, which let's it solve the kind of complex constraints set are set by targeting certain tolerance ranges for features (which are composed from 'local' functions).
As a consequence of this entire mechanism, the sampling efficiency in \Infrared is a result of the complexity of the constraint network as well as the (in)dependence of the targeted features and the demanded tolerances. For practical applications of \Infrared it is thus generally advantageous to be aware of the properties of the solving strategy and the resulting dependencies between these factors. This is especially important, since the framework easily allows modeling extremely hard problems, while (as we demonstrate) it is useful for a wide range of applications in practice. Its specific properties make the system attractive for a variety of complex design applications but as well intrinsically influence its applicability.
\subsection{Negative design by direct sampling}
Good RNA designs typically must satisfy certain constraints and show high
(or specific) affinity to the target structures, but as well must avoid
high affinity to all non-target structures. Objectives of the latter type are called
negative design criteria (whereas the former are called positive design
criteria). Intuitively, design towards negative criteria seems harder since
it requires to avoid affinity to exponentially many structures. (Indeed, as
we discussed before, negative design was shown to be NP hard in relevant
settings.)
A classic (single-target) negative design objective, requires the target structure to have
minimum free energy among all structures of the designed RNA; it can be tested using the Vienna RNA package:
\begin{Pythoncode}
def is_mfe_design(sequence, target):
fc = RNA.fold_compound(sequence)
return fc.eval_structure(target) == fc.mfe()[1]
\end{Pythoncode}
In many cases, sampling (targeting only positive design criteria) can be sufficient to satisfy negative design criteria. For example, we can easily find designs for any of our three target structures by straightforward ``generate-and-test'':
\begin{Pythoncode}
sampler = Sampler(single_target_design_model(target))
sampler.set_target(0.7 * n, 0.1 * n, 'gc')
for i in range(50):
seq = ass_to_seq(sampler.sample())
if is_mfe_design(seq,target):
print(f"{i} {seq}")
\end{Pythoncode}
where \texttt{design\_model(target)} returns a model for the given target structure with 'gc' feature and a bias to good base pair energy as we have developed it before (see Notes).
Based on this code, we easily find 10 to 20 solutions from generating 50 (biased) samples for each of our example target structures (in less than 0.1 seconds on a current notebook).
A similar approach can still be sufficient to optimize other negative criteria for single-target design like the frequency of the target structure (see Notes for function \texttt{target\_frequency}).
The following code regularly finds designs with >80\% target frequency for our example targets in comparable run-time (roughly, twice as long).
\begin{Pythoncode}
sampler = Sampler(single_target_design_model(target))
sampler.set_target(0.7 * n, 0.1 * n, 'gc')
best = 0
for i in range(100):
seq = ass_to_seq(sampler.sample())
freq = target_frequency(seq, target)
if freq > best:
best = freq
print(f"{i} {seq} {freq:.6f}")
\end{Pythoncode}
Note that in both cases, we control the \GC content of our designs. This is just one example how the core idea of finding designs by sampling can be extended by further targets or constraints. (Adding prior knowledge as IUPAC string, would be another.) Generally, this approach will be most successful for problems with a medium-sized solution space. This would also apply to many design problems, where only parts of the RNA should be redesigned, while others are kept constant.
\subsection{Larger single-target designs by constraint-generation}
For larger and harder single-target design instances, we suggested a constraint-generation strategy in \RNAPOND~\cite{Yao2001}. Here, we demonstrate a slightly stripped down version of this appraoch, which requires only a few lines of code using \Infrared.
The method starts with the attempt to solve negative design by sampling (=positive design) as in the subsection before. When we fold the suggested designs, it can be observed that some base pairs that don't occur in the target, occur more frequently than others. This motivates us to identify these most frequent disruptive base pairs and forbid them in a next round of sampling. This strategy is iterated until a design (according to the mfe criterion) is discovered. In the code presented here, we decide to terminate the design attempt, when the problem becomes inconsistent or a certain complexity of the constraint model is exceeded.
\begin{Pythoncode}
def cg_design_iteration(dbps):
model = single_target_design_model(target)
model.add_constraints(NotBPComp(i, j) for (i, j) in dbps)
sampler = Sampler(model, lazy=True)
sampler.set_target(0.7 * n, 0.1 * n, 'gc' )
if sampler.treewidth() > 10 or not sampler.is_consistent():
return dbps, "Not found"
ctr = Counter()
for i in range(steps):
seq = ass_to_seq(sampler.targeted_sample())
fc = RNA.fold_compound(seq)
mfe, mfe_e = fc.mfe()
if fc.eval_structure(target) == mfe_e:
return dbps, seq
ctr.update(parse(mfe))
ndbps = [x[0] for x in ctr.most_common() if x[0] not in bps]
return dbps + ndbps[:3], None
dbps, seq = [], None
while seq is None: dbps, seq = cg_design_iteration(dbps)
print(seq)
\end{Pythoncode}
\TODO{Some more discussion of the appraoch and results. Include figure (show frequencies of disruptive base pairs in dot plot, shift of distribution \dots.}
\subsection{Negative design by stochastic optimization search with partial resampling}
\Infrared supports stochastic optimization strategies to optimize negative objectives for single or multi-target design. We are going to present a Metropolis-Hastings optimization strategy for optimizing the solutions from a multi-design models. It starts with a sample from the model and evaluates it by the objective function. Then, it iteratively picks a connected component of the dependency graph; constructs a model for resampling of the variables in this component; and generates a sample. Based on its evaluation by the objective function, the sample is either accepted or rejected based on the Metropolis-Hastings criterion.
For our example, we choose too minimize the multi-defect~\cite{Hammer2017}, which is a weighted sum of the distance of the target energies to the ensemble energy and the energy distance between the targets. Minimizing this function thus means to increase the probability of the targets and balance the target energies as much as possible (see function \texttt{multi\_defect} in Notes).
Next, we define a function \texttt{multi\_design\_model} (see Notes) that returns a multi-target design network model (closely resembling the one presented before). In addition, crucially for the use in stochastic optimization, this model supports partial resampling. Given a
subset of the model variables and a complete solution, it fixes all variables outside of the subset to the solution. This is accomplished by the code snippet
\begin{Pythoncode}
for i in range(n):
if i not in subset:
value = solution.values()[i]
model.restrict_domains(i,(value,value))
\end{Pythoncode}
In this function (see Notes), we moreover avoid to introduce constraints on or between variables outside of the subset.
For the stochastic optimization, we require additional imports
\begin{Pythoncode}
from random import random, choices
from math import exp
\end{Pythoncode}
The function \texttt{multi\_design\_optimize} returns the best multi-target design and its multi-defect after a number of iterations (\texttt{steps}); a further parameter \texttt{temp} (temperature) controls the acceptance probability.
\begin{Pythoncode}
def multi_design_optimize(steps, temp):
cc, cur, curval, bestval = None, None, math.inf, math.inf
for i in range(steps):
model = multi_design_model(targets, cur, cc)
new = Sampler(model).sample()
newval = multi_defect(ass_to_seq(new),targets,1)
if (newval <= curval
or random() <= exp(-(newval-curval)/temp)):
cur, curval = new, newval
if curval < bestval:
best, bestval = cur, curval
if i==0:
ccs = model.connected_components()
weights = [1/len(cc) for cc in ccs]
cc = choices(ccs,weights)[0]
return (ass_to_seq(best), bestval)
\end{Pythoncode}
Finally, we run the multi-defect optimization on our example target structures by
\begin{Pythoncode}
multi_design_optimize(1000, 0.015)
\end{Pythoncode}
Note that here the number of 1000 iterations and the temperature 0.015 were chosen after some experimentation. In practice, we will often restart such procedures to obtain better solutions and/or a diverse set of good solutions. In our tests, one such optimization run took moderate run-time below 10s on current notebook hardware, while it can find very good designs for the targets.
The sequence \texttt{CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG} is a particularly good solution found by this strategy. For this design all three target structures have minimum free energy of -15.40 kcal/mol (at a multi-defect of 1.21).
Even the close suboptimals contain only structures similar to the targets as can be seen from the output of \texttt{RNAsubopt} of the Vienna RNA package, which enumerates all structures within 1 kcal/mol of the minimum free energy.
\begin{bashcode}
$ RNAsubopt -s <<<CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG
CCCCUUGCCUCAAGGGCCCUCUUCAGAGGAAGGGG -15.40 1.00
((((((((((...))))((((....)))))))))) -15.40
((((((.((((((((....))))..)))))))))) -15.40
.((((((...)))))).(((((((....))))))) -15.40
((((((.(((((((......)))..)))))))))) -15.10
((((((.((((.(((....)))...)))))))))) -14.80
((((((.((((.(((.....)))..)))))))))) -14.80
((((((.((((..(((....)))..)))))))))) -14.80
((((((.((((.((((...))))..)))))))))) -14.60
(((((((((.....)))((((....)))))))))) -14.50
((((((.((((.((......))...)))))))))) -14.50
.((((((...))))))(((.((((....))))))) -14.50
\end{bashcode}
\subsection{Further application-specific design goals}
discuss examples of further design goals, e.g. forbidding, enforcing motifs. Show how this can be added naively
specific negative design objectives, e.g. energy differences between target structures. Example of riboswitch design (artificial, Moerl-like)
\section{Notes}
\subsection{A model for single target design}
\begin{Pythoncode}
def single_target_design_model(target):
n, bps = len(target), parse(target)
model = Model(n, 4)
model.add_constraints(BPComp(i, j) for (i, j) in bps)
model.add_functions([GCCont(i) for i in range(n)], 'gc')
model.add_functions([BPEnergy(i, j, (i-1, j+1) not in bps)
for (i,j) in bps], 'energy')
model.set_feature_weight(-1.5, 'energy')
return model
\end{Pythoncode}
\subsection{Target frequency in ensemble}
The frequency of the target structure in the ensemble of the designed sequence is a good example of a negative design criterion.
\begin{Pythoncode}
def target_frequency(sequence, target):
fc = RNA.fold_compound(sequence)
fc.pf()
return fc.pr_structure(target)
\end{Pythoncode}
\subsection{Multi-Defect}
The multi-defect can be computed with the help of the Vienna RNA library.
\begin{Pythoncode}
def multi_defect(sequence, targets, xi=1):
k = len(targets)
fc = RNA.fold_compound(sequence)
ee = fc.pf()[1]
eos = [fc.eval_structure(target) for target in targets]
diff_ee = sum(1/k * eos[i] - ee for i in range(k))
diff_targets = sum(2/(k*(k-1)) * abs(eos[i]-eos[j])
for i in range(k) for j in range(k) if i<j)
return diff_ee + xi * diff_targets
\end{Pythoncode}
\subsection{Multi-design model with support for partial resampling}
\begin{Pythoncode}
def multi_design_model(targets, subset=None, solution=None):
n = len(targets[0])
model = Model(n, 4)
if subset is None: subset = set(range(n))
for i in set(range(n))-subset:
value = solution.values()[i]
model.restrict_domains(i,(value,value))
model.add_functions([GCCont(i) for i in subset], 'gc')
for target in targets:
s = parse(target)
ss = [(i,j) for (i,j) in s if i in subset or j in subset]
model.add_constraints(BPComp(i, j) for (i, j) in ss)
model.add_functions([BPEnergy(i, j, (i-1, j+1) not in s)
for (i,j) in ss], 'energy')
model.set_feature_weight(-1, 'energy')
model.set_feature_weight(-0.3, 'gc')
return model
\end{Pythoncode}
\bibliographystyle{plain}
\bibliography{biblio}
\end{document}