StreAM-Tg: algorithms for analyzing coarse grained RNA dynamics based on Markov models of connectivity-graphs

Background: In this work, we present a new coarse grained representation of RNA dynamics. It is based on adjacency matrices and their interactions patterns obtained from molecular dynamics simulations. RNA molecules are well-suited for this representation due to their composition which is mainly modular and assessable by the secondary structure alone. These interactions can be represented as adjacency matrices of k nucleotides. Based on those, we define transitions between states as changes in the adjacency matrices which form Markovian dynamics. The intense computational demand for deriving the transition probability matrices prompted us to develop StreAM-Tg, a streambased algorithm for generating such Markov models of k-vertex adjacency matrices representing the RNA. Results: We benchmark StreAM-Tg (a) for random and RNA unit sphere dynamic graphs (b) for the robustness of our method against different parameters. Moreover, we address a riboswitch design problem by applying StreAM-Tg on six long term molecular dynamics simulation of a synthetic tetracycline dependent riboswitch (500 ns) in combination with five different antibiotics. Conclusions: The proposed algorithm performs well on large simulated as well as real world dynamic graphs. Additionally, StreAM-Tg provides insights into nucleotide based RNA dynamics in comparison to conventional metrics like the root-mean square fluctuation. In the light of experimental data our results show important design opportunities for the riboswitch.


Background
The computational design of switchable and catalytic ribonucleic acids (RNA) becomes a major challenge for synthetic biology [1]. So far, available models and simulation tools to design and analyze functionally complex RNA based devices are very limited [2]. Although several tools are available to assess secondary as well as tertiary RNA structure [3], current capabilities to simulate dynamics are still underdeveloped [4] and rely heavily on atomistic molecular dynamics (MD) techniques [5]. RNA structure is largely modular and composed of repetitive motifs [4] that form structural elements such as hairpins and stems based on hydrogen-bonding patterns [6]. Such structural modules play an important role for nano design [1,7].
In order to understand RNA dynamics [8,14] we develop a new method to quantify all possible structural transitions, based on a coarse grained, transferable representation of different module sizes. The computation of Markov State Models (MSM) have recently become practical to reproduce long-time conformational dynamics of biomolecules using data from MD simulations [15].
To this end, we convert MD trajectories into dynamic graphs and derive the Markovian dynamics in the space of adjacency matrices. Aggregated matrices for each nucleotide represent RNA coarse grained dynamics. However, a full investigation of all transitions is computationally expensive.
To address this challenge we extend StreaM-a streambased algorithm for counting 4-vertex motifs in dynamic graphs with an outstanding performance for analyzing (bio)molecular trajectories [16]. The extension StreAM computes one transition matrix for a single set of vertices or a full set for combinatorial many matrices. To gain insight into global folding and stability of an RNA molecule, we propose StreAM-T g : It combines all adjacency-based Markov models for a nucleotide into one global weighted stochastic transition matrix T g (a). However, deriving Markovian dynamics from MD simulations of RNA is an emerging method to describe folding pathways [13] or to elucidate the kinetics of stacking interactions [11]. Especially MSM of atomistic aptamer simulations like the theophylline [12] and thrombin aptamer could help to understand structure-function relationships as well as the folding process [18]. Nonetheless, all the methods mentioned above rely on Root Mean Square Deviation (RMSD) computations in combination with clustering in order to identify relevant transition states. For StreAM-T g , the transition states are given by small adjacency matrices representing structural motifs.
The remainder of this paper is structured as follows: In "Our approach for coarse grained analysis", we introduce the concept of StreAM-T g as well as our biological test setup. We describe details of the algorithm in "Algorithm". We present runtime evaluations as well as application scenario of our algorithm in "Evaluation" for a synthetic tetracycline (TC) dependent riboswitch (TC-Aptamer). Furthermore, we investigate the influence upon ligand binding of four different TC derivates and compare them with a conventional method. Finally, we summarize our work in "Summary, conclusion, and future work".

Structural representation of RNA
Predicting the function of complex RNA molecules depends critically on understanding both, their structure as well as their conformational dynamics [17,19]. To achieve the latter we propose a new coarse grained RNA representation. For our approach, we start with an MD simulation to obtain a trajectory of the RNA. We reduce these simulated trajectories to nucleotides represented by their (C3 ′ ) atoms. From there, we represent RNA structure as an undirected graph [20] using each C3 ′ as a vertex and distance dependent interactions as edges [3]. It is well known that nucleotide-based molecular interactions take place between more than one partner [21]. For this reason interactions exist for several edges observable in the adjacency matrix (obtained via a Euclidean distance cutoff ) of C3 ′ coordinates at a given time-step. The resulting edges represent, e.g., strong local interactions such as Watson-Crick pairing, Hoogsteen, or π −π-stacking.
Our algorithm estimates adjacency matrix transition rates of a given set of vertices (nucleotides) and builds a Markov model. Moreover, by deriving all Markov models of all possible combinations of vertices, we can reduce them afterwards into a global weighted transition matrix for each vertex representing the ensemble that the nucleotide modeled as a vertex is immersed in.

Dynamic graphs, their analysis, and Markovian dynamics
. v |V | } and edges E. We refer to a single vertex of V as a. Here, we only consider undirected graphs without self-loops, i.e., E ⊆ {{v, w} : v, w ∈ V , v � = w}. We define a self-loop as an edge that connects a vertex to itself. For a subset V ′ of the vertex set V, we refer to We refer to the powerset of V as P(V ). The adjacency matrix A(G) = A i,j (Eq. 1) of a graph G is a |V | × |V | matrix, defined as follows: Here, the symbol ♦ denotes for an undefined matrix entry. We denote the set of all adjacency matrices of size k as A k , with |A k | = 2 k·(k−1) 2 . In our current implementation k can takes values in {2, 3, 4, 5, 6, 7, 8, 9, 10} . With concat(A), we denote the row-by-row concatenation of all defined values of an adjacency matrix A. We define the adjacency id of a matrix A as the numerical value of the binary interpretation of its concatenation, i.e., id(A) = concat(A) 2 ∈ N. We refer to id(V ′ ) := id(A(G[V ′ ])) as the adjacency id of the V ′ -induced subgraph of G. For example, the concatenation of the adjacency matrix of graph G 1 [V ′ ] (shown in Fig. 1) is concat(A(G 1 [V ′ ])) = 011011 and its adjacency id is id(V ′ ) = 011011 2 = 27 10 .
As a dynamic graph G t = (V , E t ), we consider a graph whose edge set changes over time. For each point in time t ∈ [1, τ ], we consider G t as the snapshot or state of the dynamic graph at that time. The transition of a dynamic graph G t−1 to the next state G t is described by a pair of edge sets which contain the edges added to and removed from G t−1 , i.e., (E + t , E − t ). We refer to these changes as a batch, defined as follows: The batch size is referred as δ t = |E + t | + |E − t | and the average batch size is refered as δ avg and is defined as t δ t τ . The analysis of dynamic graphs is commonly performed using stream-or batch-based algorithms. Both output the desired result for each snapshot G t . Streambased algorithms take a single update to the graph as input, i.e., the addition or removal of an edge e. Batchbased algorithms take a pair (E + t+1 , E − t+1 ) as input. They can always be implemented by executing a streambased algorithm for each edge addition e ∈ E + t+1 and removal e ∈ E − t+1 . We refer to id t (V ′ ) as the adjacency id of the V ′ -induced subgraph of each snapshot of G t . The result of analyzing the adjacency id of V ′ for a dynamic graph G t is a list (id t (V ′ ) : t ∈ [1, τ ]). We consider each pair (id t (V ′ ), id t+1 (V ′ )) as an adjacency transition of V ′ and denote the set of all transitions as T (V ′ ). Then, we define the local transition matrix T (V ′ ) of V ′ as a |A k | × |A k | matrix, which contains the number of transitions between any two adjacency ids over time, i.e., T i,j (V ′ ) := |(i + 1, j + 1) ∈ T (V ′ )| for an adjacency size k. From T (V ′ ), we can derive a Markov model to describe these transitions. By : |V ′ | = k and a ∈ V ′ , we derive a transition tensor C a (V ). Thus C a (V ) has the dimensions of We define the weighting matrix W (V ′ ) with the dimen- weighting for every subset V ′ ∈ C a (V ). It is defined as Here, S(V ′ ) is a matrix containing the sum of every transition between adjacency id(V ′ ) and every other id(V ′ ) of the same matrix T (V ′ ) is considered as the local distribution weighted by its global distribution of transitions matrices of V ′ . Finaly, we define a global transition matrix, a vertex a is immeresd in, as For a local or global transition matrix the respective dominant eigenvector 1 is called π and represents the stationary distribution attained for infinite (or very long) times. The corresponding conformational entropy of the ensemble of motifs is H := − i π i · log π i . The change in conformational entropy upon, e.g., binding a ligand is then given as H = H wt − H complex .

MD simulation setup
We use a structure of a synthetic tetracycline binding riboswitch (PDB: 3EGZ, chain B, resolution: 2.2 Å, Fig. 2) [23] and perform six simulations: the TC-Aptamer with five different tetracycline types in complex and one without tetracycline. As tetracycline binding alters the structural entropy of the molecule [24] our proposed method should be able to detect changes in (local) dynamics due the presence of tetracycline. All simulations were performed using the GROMACS software package (version 2016). For water molecules, we used the The first row shows the dynamic graph G t and the second the induced subgraph V ′ with its respective adjacency matrix. At the bottom is a short example of how to compute the adjacency id for the displayed subgraphs TIP3P model, the RNA interact through the CHARMM force field, while the tetracycline analogs interact through a modified CHARMM force field from Aleksandrov and Simonson [25,26]. The systems were first energy minimized and equilibrated for 1 ns in the NVT-ensemble at a temperature of 300 K and for 5 ns in the NpT-ensemble at a temperature of 300 K and a pressure of 1 bar. During the equilibration, temperature was controlled using the velocity-rescale thermostat [27] (τ T = 0.1 ps) and pressure was controlled using the Berendsen barostat [28] (τ P = 0.5 ps). Isothermal compressibility was set to 4.5 × 10 −5 bar −1 , which is the corresponding value for water. Production runs were performed for 500 ns. The temperature was controlled using the Nosé-Hoover thermostat [29,30] (τ T = 1 ps) and pressure was controlled using the Parrinello-Rahman barostat [31] (τ P = 1 ps ) during the production runs. Bond lengths were constrained using the LINCS [32] algorithm. The Lennard-Jones nonbonded interactions were evaluated using a cutoff distance of 1.2 nm. The electrostatic interactions were evaluated using the particle mesh Ewald method with a real space cutoff 1.2 nm and a grid-spacing 0.12 nm. Long-range corrections to energy and pressure due to the truncation of Lennard-Jones potential were accounted for. The equations of motion were integrated using a 2 fs time step.

Tetracycline derivates
For the comparison of TC derivates we use tetracycline (tc), doxycycline (dc), anhydrotetracycline (atc) and 6-deoxy-6-demythyltetracycline (ddtc) in our MD simulation. These four analogs share the characteristic 4-ring-structure and functional groups of all tetracyclines. Still, the possibility and the mode of interaction with the RNA is an open question. The first ring of tetracycline carries a dimethylamino group, while the third ring carries a hydroxy and a methyl group facing towards the same direction away from the 4-ring-system. The detailed chemical structures are shown in Fig. 3. In comparison to these two rings the fourth, aromatic ring has an especially small steric volume on this side of the molecule. From tc over dc and atc to ddtc this steric volume is further reduced by shifting the aforementioned hydroxy and methyl group away from the fourth ring or eliminating some of them entirely. Note, that our graphbased approach is capable to easily distinguish between different modes of interaction upon changes in the, e.g., the side-chains of the rings. The molecular data of tc, dc, atc and ddtc was created using the Avogadro software [33]. Structures were manually constructed and moved into the extended conformation described to be 3 kcal/mol more stable than its twisted alternative by Alexandrov et al. [24]. The molecules were then fitted to the position of 7-chlorotetracycline (7-cl-tc) bound in the TC-Aptamer structure used for simulation. Note, that the geometry of 7-cl-tc was already present in the crystal structure of the TC-Aptamer. All considered antibiotics show different properties upon ligand binding. They range from high activity (tc, 7-cl-tc) to weak activity (dc, ddtc, atc) based on in vivo experiments [34].

RNA trajectory and contact probability
An RNA trajectory X is represented as a list of T frames X = (� x t 0 , � x t 1 , . . .). Each frame � x t ∈ R 3n contains the three-dimensional coordinates of the simulated system of the n atoms at the respective point in time t. We define a binary contact matrix B(t) with dimensions |V | × |V |. Its entries scan range between {0, 1}. A single contact B i,j (t) between one pair of atom coordinates r i (t) and r j (t) is generated if their Euclidean distance [L2-norm, L2(. . .)] is shorter than d. Thus B(t) entries are defined as follows: The contact probability of one pair of atom coordinates r i and r j is defined as:

Graph transformation
All considered MD simulations have a total length of 500 ns using an integration stepsize of 2 fs. We created snapshots every 250 ps resulting in 100,000 frames. We generated dynamic graphs G t = (V , E t ) containing |V | = 65 vertices (Table 1), each modelling a nucleic 3C ′ (Fig. 2). This resolution is sufficient to represent both small secondary structure elements as well as large quaternary RNA complexes [35,36]. We create undirected edges between two vertices in case their Euclidean cut-off (d) is shorter than {d ∈ N |10 ≤ d ≤ 15} Å (cmp. Table 1).

Markov state models (MSM) of local adjacency and global transition matrix
StreAM counts adjacency transitions (e.g. as a set T (V ′ ) ) of an induced subgraph for a given adjacency size. Now the transition matrix T (V ′ ) can be derived from T (V ′ ) but not all possible states are necessarily visited in a given, finite simulation, although a "missing state" potentially might occur in longer simulations. In order to allow for this, we introduce a minimal pseudo-count [37] of P k = 1 |A k | . All models that fullfill {V ′ ∈ P(V ) : |V ′ | = k, a ∈ V ′ } have the same matrix dimension and thus can be envisioned to be combined in a tensor C a (V ). Now, C a i,j,l (V ) is one entry of the tensor of transitions between adjacency id i and j in the l th Thus C a (V ) contains all T (V ′ ) a specific vertex is immersed in and due to this it contains all possible information of local markovian dynamics. To derive T g (a) every entry C a i,j,l (V ) is normalized by the count of all transitions of i in all matrices S(V ) j,l = i C a i,j,l (V ). For a given set of l transition matrices T (V ′ ) we can combine them into a global model with respect to their probability:

Stationary distribution and entropy
As T g (a) (Eq. 4) is a row stochastic matrix we can compute its dominant eigenvector from a spectral decomposition. It represents a basic quantity of interest: the stationary probability � π := (π 1 , . . . , π i , . . .) of microstates i [37]. To this end we used the markovchain library in R [38,39]. For measuring the changes in conformational entropy H := − |A k | i=1 π i · log π i upon binding a ligand, we define H = H wt − H complex , form a stationary distribution.

Conventional analysis: root mean square fluctuation (RMSF)
The flexibility of an atom can be quantitatively assessed by its Root-mean-square fluctuation (RMSF). This measure is the time average L2-norm L2(. . .) of one particular atom's position r i (t) to its time-averaged position � i r. The RMSF of an nucleotide i (represented by its respective C3 ′ atom) is defined as:

Overview
In this section, we introduce the required algorithms to compute T g (a). First, we describe StreAM, a stream-based algorithm for computing the adjacency id(V ′ ) for a given V ′ . Afterwards we describe, the batch-based computation using StreAM B to derive id t (V ′ ). By computing the adjacency id of a dynamic graph G t [V ′ ] we derive a list represents an adjacency transition. The respective transitions are than stored in T (V ′ ). Now, a single T (V ′ ) can be derived by counting the transitions in T (V ′ ). At last we introduce StreAM-T g , an algorithm for the computation of a global transition matrix T g (a) for a given vertex a from a dynamic graph G t [V ]. To this end, StreAM-T g computes the tensor C a (V ) which includes every single matrix T (V ′ ) where V ′ ∈ P(V ) and |V ′ | = k with vertex a ∈ V ′ . Finally, StreAM-T g computes T g (a) from C a (V ).

StreAM and StreAM B .
We compute the adjacency id id(V ′ ) for vertices V ′ ⊆ V in the dynamic graph G t using the streambased algorithm StreAM, as described in Algorithm 1. In the bit representation of its adjacency id id(V ′ ), this edge is represented by the bit (i − 1) · k + j − i · (i + 1)/2. When interpreting this bit representation as a number, an addition or removal of the respective edge corresponds to the addition or subtraction of 2 k·(k−1)/2−((i−1)·k+j−i·(i+1)/2) . This operation is performed to update id(V ′ ) for each edge removal or addition. In the following, we refer to this position as e(a, b, V ′ ) := |V ′ |·(|V ′ |−1) Furthermore, in Algorithm 2 we show StreAM B for the batch-based computation of the adjacency id for vertices V ′

StreAM-T g
For the design or redesign of aptamers it is crucial to provide experimental researchers informations about e.g. dynamics at the nulceotide level. To this end, StreAM-T g combines every adajcency-based transition matrix, one nucleotide participates in, into a global model T g (a). This model can be derived for every nucleotide of the regarded RNA structure and contains all the structural transition of a nuclotide between the complete ensemble of remaining nucleotides. In order to do this, we present StreAM-T g , an algorithm for the computation of global transition matrices, one particular vertex is participating in, given in Algorithm 3. A full computation with StreAM-T g can be divided into the following steps. The first step is the computation of all possible Markov models that fulfill V ′ ∈ P(V ) : |V ′ | = k with StreAM for a given k with k ∈ [2,10]. This results in |V | k · k! = |V |! (|V |−k)! combinations. Afterwards, StreAM-T g sorts the matrices by vertex id into different sets, each with the size of |V | k − 1 · (k − 1)!. For each vertex a, StreAM-T g combines the obtained T (V ′ ) that fulfill a ∈ V ′ in a transition tensor C a (V ), which is normalized by W (V ′ ) the global distribution of transition states a vertex is immersing in, taking the whole ensemble into account. W (V ′ ) can be directly computed from C a (V ) (e.g. "Dynamic graphs, their analysis, and Markovian dynamics")

StreAM-T g optimization using precomputed contact probability
The large computational demands for a full computation of the |V | k · k! = |V |! (|V |−k)! transition matrices to derive a set of T g (a), motivated us to implement an optimization: The number of Markov models can be reduced by considering only adjacencies including possible contacts between at least two vertices of G t = (V , E t ). This can be precomputed before the full computation by considering the contact probability P(X, r i , r j ) between vertices. To this end we only compute transition matrices forming a contact within the dynamic graph with P(X, r i , r j ) > 0.

Objectives
As StreAM-T g is intended to analyze large MD trajectories we first measure the speed of StreAM for computing a single T (V ′ ) to estimate overall computational resources. With this in mind, we benchmark different G t with increasing adjacency size k (Table 1). Furthermore, we need to quantify the dependence of computational speed with respect to δ t . Note, δ t represents changes in conformations within G t . For the full computation of T g (a), we want to measure computing time in order to benchmark StreAM-T g by increasing network size |V| and k for a given system due to exponentially increasing matrix dimensions |A k | = 2 k·(k−1) 2 (k = 3 8, k = 4 64, k = 5 1,024, k = 6 32,768, k = 7 2,097,152 size of matrix dimensions). We expect due to combinatorial complexity of matrix computation a linear relation between |V| and speed and an exponential relation between increasing k and speed. To access robustness of influence of d robustness regarding the computation of T g (a) stationary distribution π. We expect a strong linear correlation between derived stationary distributions. Details are shown in "Robustness against threshold". We compare Markovian dynamics between the native TC-Aptamer and the structure in complex with 7-cl-tc with experimental data. We discuss the details in "Workflow" and "Application to molecular synthetic biology". Furthermore, we want to illustrate the biological relevance by applying it to a riboswitch design problem; this is shown in detail in "Application to molecular synthetic biology". For the last part, we investigate the ligand binding of four different TC derivates using StreAM-T g and compare them with a classical metric (e.g. RMSF) in "Comparison of tetracycline derivates".

Evaluation setup
All benchmarks were performed on a machine with four Intel(R) Xeon(R) CPU E5-2687W v2 processors with 3.4GHz running a Debian operating system. We implemented StreAM in Java; all sources are available in a GitHub repository. 2 The final implementation StreAM-T g is integrated in a Julia repository. 3 We created plots using the AssayToolbox library for R [39,40]. We generate all random graphs using a generator for dynamic graphs 4 derived for vertex combination.

Runtime dependencies of StreAM on adjacency size
For every dynamic graph G t (V , E t ), we selected a total number of 100,000 snapshots to measure StreAM runtime performance. In order to perform benchmarks with increasing k, we chose randomly nodes k ∈ [3,10] and repeated this 500 times for different numbers of snapshots (every 10,000 steps). We determined the slope (speed frames ms ) of compute time vs. k for random and MD graphs with different parameters (Table 1).

Runtime dependence of StreAM on batch size
We measured runtime performance of StreAM for the computation of a set of all transitions T (V ′ ) with different adjacency sizes k as well as dynamic networks with increasing batch sizes. To test StreAM batch size dependencies, 35 random graphs were drawn with increasing batch size and constant numbers of vertex and edges. All graphs contained 100,000 snapshots and k is calculated from 500 random combinations of vertices.

Runtime dependencies of StreAM-T g on network size
We benchmarked the full computation of T g (a) with different k ∈ [3,5] for increasing network sizes |V|. Therefore we performed a full computation with StreAM. StreAM-T g sorts the obtained transition list, converts them into transition matrices and combines them into a global Markov model for each vertex. Figure 4b shows computational speeds for each dynamic graph. Speed decreases linearly with a small slope (Fig. 4a). While this is encouraging the computation of transition matrices for k > 5 is still prohibitively expensive due to the exponential increase of the matrix dimensions with 2 k·(k−1) 2 . For G t obtained from MD simulations, we observe fast speeds due to small batch sizes (Table 1). Figure 4b reveals that T cpu increases linearly with increasing |V| and with k exponentially. We restrict the T g (a) full computation to k < 5. In Fig. 4c, speed decreases linearly with δ t . As δ t represents the changes between snapshots our observation has implications for 2 https://github.com/BenjaminSchiller/Stream.

Performance enhancing by precomputed contact probability
The exponential increase of transition matrix dimensions with 2 is an obvious disadvantage of the proposed method. However, there exist several T (V ′ ) where every vertex is never in contact with another vertex from the set. These adjacencies remain only in one state during the whole simulation. To avoid the computation of the respective Markov models we precomputed P(X, r i , r j ) of all vertices. Thus only combinations are considered with P(X, r i , r j ) > 0. This procedure leads to a large reduction of T cpu due to fewer number of matrices to be computed to derive T g (a). To illustrate this reduction, we compute the number of adjacencies left after a precomputation of P(X, r i , r j ) as a function of d for the TC-Aptamer simulation without TC. The remaining number of transition matrices for adjacency sizes k = 3, 4, 5 are shown in Fig. 5b. For further illustration we show the graph of the RNA molecule obtained for a cut-off of d = 15 Å in Fig. 5a.
We can observe that using a precomputation of P(X, r i , r j ) to a full computation of T g (a) hardly depends on the Euclidean cut-off (d) for all considered adjacencies. The reduced computational costs in case of a full computation can be expressed by a significant smaller number of transition matrices left to compute for all considered adjacency sizes k = 3, 4, 5. For example if we use k = 4 and d = 13 Å we have to compute 16,248,960 transition matrices, if we use a precomputation of P(X, r i , r j ) we can reduce this value to 2,063,100, this roughly eightfold. Furthermore, in case of new contact formation due to an increased d the number of transition matrices can increase.

Robustness against threshold
Here, we investigate the influence of threshold d for the full computation of T g (a). To this end, we created dynamic graphs with different d ∈ [11,15] Å of the TC-Aptamer simulation without TC. Here, we focus on a simple model with an adjacency size of k = 3, thus with eight states. In particular, we focus on the local adjacency matrix of combination 52, 54 and 51 because these nucleotides are important for TC binding and stabilization of intermediates.
To access the overall robustness of a full computation of T g (a) we compute the stationary distribution for every T g (a) and afterwards we compare them with each other. For the comparison we use the Pearson product moment correlation (Pearson's r). Figure 6 illustrates the comparison of stationary distributions obtained from 65 T g (a) for unit sphere dynamic graphs with different d.
The obtained Pearson correlations r are also shown in Fig. 6 (a, upper triangle). We observed a high robustness expressed by an overall high correlation (r = 0.938 to r = 0.98) of the dynamic graphs created with different d. However transient states disappear with increasing threshold d (Fig. 6b). This observation stems from the fact that the obtained graph becomes more and more densely connected. One consequence of a high threshold d is that the adjacency remain in the same state.

Accuracy of StreAM
In this section we discuss the accuracy of StreAM for the computation of a set of all transitions T (V ′ ) on finite data samples. Our approach estimates the transition probabilities from a trajectory as frequencies of occurrences. It could be shown that uncertainties derived from a transition matrix (e.g derived from a molecular dynamics simulation) decreases with increasing simulation time [22]. Thus the error and bias in our estimator are driven by the available data set size to derive T (V ′ ) . Additionally, there is an implicit influence of k on the accuracy since the number of k determines the transition matrix dimensions. Consequently, the available trajectory (system) data must be at least larger than the number of entries in the transition matrix to be estimated in order to use StreAM.

Application to molecular synthetic biology
This section is devoted to investigate possible changes in Markovian dynamics of the TC-Aptamer upon binding of 7-cl-tc. This particular antibiotic is part of the crystal structure of the TC-Aptamer thus structure of 7-cl-tc has the correct geometry and orientation of functional groups.
For both simulations of "Workflow", we computed 16,248,960 transition matrices and combined them into 65 global models (one for each vertex of the riboswitch). To account for both the pair-interactions and potential stacking effects we focus on k = 4-vertex adjacencies and use dynamic RNA graphs with d = 13 Å. One global transition matrix contains all the transitions a single nucleotide participates in. The stationary distribution and the implied entropy (changes) help to understand the effects of ligand binding and potential improvements on this (the design problem at hand). The H obtained are shown in Fig. 7.
A positive value of H in Fig. 7 indicates a loss of conformational entropy upon ligand binding. Interestingly, the binding loop as well as complexing nucleotides gain entropy. This is due to the fact of rearrangements between the nucleotides in spatial proximity to the ligand because 70% of the accessible surface area of TC is buried within the binding pocket L3 [23]. Experiments confirmed that local rearrangement of the binding pocket are necessary to prevent a possible release of the ligand [41]. Furthermore crystallographic studies have revealed that the largest changes occur in L3 upon TC binding [23]. Furthermore, we observe the highest entropy difference for nucleotide G51. Experimental data reveals that G51 crosslinks to tetracycline when the complex is subjected to UV irradiation [42]. These findings suggest a strong interaction with TC and thus a dramatic, positive change in H. Nucleotides A52 and U54 show a positive entropy difference inside L3. Interestingly, molecular probing experiments show that G51, A52, and U54 of L3 are-in the absence of the antibiotic-the most modified nucleotides [23,34]. Clearly, they change their conformational flexibility upon ligand binding due they direct interaction with the solvent. U54 further interacts with A51,A52,A53 and A55 building the core of the riboswitch [23]. Taken together, these observations reveal that U54 is necessary for the stabilization of L3. A more flexible dynamics ( H ) will change the configuration of the binding pocket and promotes TC release.

Comparison of tetracycline derivates
In this section, we want to investigate possible changes in configuration entropy by binding of different TC derivates. Moreover, we want to contrast StreAM-T g to conventional metrics like RMSF (Eq. 5) using the entropy of the stationary distributions obtained from T g (a). Therefore, we simulated a set consisting of four different antibiotics (atc, dc, ddtc, tc) in complex with the riboswitch of "Workflow". The structures of all derivates, each with different functional groups and different chemical properties, are shown in Fig. 3. For this approach we use a precomputation of P(X, r i , r j ) to reduce the number of transition matrices for a full computation of T g (a). Hence for all four simulations of TC derivates, we computed 1,763,208 (for tc), 1,534,488 (for atc), 2,685,816 (for dc) and 2,699,280 (for ddtc) transition matrices and combined them into 65 global models T g (a) each. Similar to "Application to molecular synthetic biology", we compute H = H wt − H complex from the stationary distribution as well as RMSF = RMSF wt − RMSF complex from individual RMSF computations. The results are shown in Fig. 8.
The RMSF in Fig. 8b and in H Fig. 8a shows a similar picture in terms of nucleotide dynamics. If we focus on atc we can observe a loss of conformational entropy upon ligand binding for almost every nucleotide. Considering this example the RMSF only detects a significant loss of nucleotide-based dynamics ranging from nucleotide 37-46. However, for dc, we observe the same effects like for dc. Contrary to this observation we detect, for ddtc, an increase in dynamic upon ligand binding as well as negative RMSF values. For tc, we observe a similar picture as for 7-cl-tc ("Comparison of tetracycline derivates"). In a next step, we want to compare the obtained differences in stationary distribution with experimental values. To this end,we use an experimental metric: xfold values. A xfold value describes the efficiency of regulation in vivo and is given as the ratio of fluorescence without and with antibiotic in the experimental setup [43]. Unfortunately, atc reveals no experimental dynamics due to growth inhibition caused by the toxicity of the respective tc derivative [43]. In contrast to atc, dc and ddtc show only a weak performance (xfold = 1.1) in comparison to tc (xfold = 5.8) and 7-cl-tc (xfold = 3.8) [43]. On the one hand, atc and dc appear overall too rigid and on the other hand ddtc too flexible to obtain a stable bound structure, implying insufficient riboswitch performance. For our design criterion of high xfold, we conclude that only certain nucleotides are allowed to be affected upon ligand binding. In particular, we need flexible nucleotides for the process of induced ligand binding (like nucleotide G51 Fig. 7) and stabilization of the complex intermediates ("Application to molecular synthetic biology"). Additionally, the switch needs rigidity for nucleotides building the stem region of the TC-Aptamer upon ligand binding (like nucleotides A51, A52 and A53 Fig. 7).

Summary, conclusion, and future work
Simulation tools to design and analyze functionally RNA based devices are nowadays very limited. In this study, we developed a new method StreAM-T g to analyze structural transitions, based on a coarse grained representation of RNA MD simulations, in order to gain insights into RNA dynamics. We demonstrate that StreAM-T g fulfills our demands for a method to extract the coarse-grained Markovian dynamics of motifs of a complex RNA molecule. Moreover StreAM-T g provides valuable insights into nucleotide based RNA dynamics in comparison to conventional metrics like the RMSF.
The effects observed in a designable riboswitch can be related to known experimental facts, such as conformational altering caused by ligand binding. Hence StreAM-T g derived Markov models in an abstract space of motif creation and destruction. This allows for the efficient analysis of large MD trajectories. Thus we hope to elucidate molecular relaxation timescales, spectral analysis in relation to single-molecule studies, as well as transition path theory in the future. At present, we use it for the design of switchable synthetic RNA based circuits in living cells [2,44].
To broaden the application areas of StreAM-T g we will extend it to proteins as well as evolutionary graphs mimicking the dynamics of molecular evolution in sequence space [45].