Next:Second AlgorithmUp:RNA secondary structure analysisPrevious:Introduction

Subsections

Example: sequence C C G G A

Explanation:

Complexity

Time:
Space:

Parallel version

Limitations

First algorithm

Specification

The first algorithm uses a very simple model: given a RNA-sequence, it returnsone of the secondary structures of the sequence having the maximum number of links (A-U and C-G). This can be expressed by the following difference equation:

$\begin{displaymath}E(S_{n,i})=min\left\{ \begin{array}{c}E(S_{n-2,i+1})+\alpha......n}\left( E(S_{k,i})+E(S_{n-k,i+k})\right)\end{array}\right.\end{displaymath}$

(1)

where r_i is the i-th nucleotide, S_n_,i the sequence of length n starting at nucleotide i, and E the energy of the considered sequence. $\alpha (r_{i},r_{j})$ = -1 if $\{r_{i},r_{j}\}$ ={A,U} or {C,G}, 0 in the other cases.

The algorithm operates as follows: it constructs two matrices row by row (where the row index is n and where the column index is i). The rows are assigned using (1). The first matrix (called energy) contains the energy of the considered sequence, and the second one (called previous) contains an integer pointing to one of the best solutions in (1); for instance, let us consider previous[n, i]: if E(S_n_,i)=E(S_k_0,i)+E(S_n_-k0,i+k0)then previous[n, i] = k₀, and if $E(S_{n,i})=E(S_{n-2,i+1})+\alpha (r_{i},r_{i+n-1})$ , then previous[n, i] = n.

The matrix previous is used at the end of the computation to retrace all the matching pairs of nucleotides.

Nucleus of the algorithm

Generating the matrices:

char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
for i = 1 ton-length+1
compute_energy(length, i, seq)
end

where compute_energy computes the energy of the part of the sequence of length length starting at position i, and places the result in energy[length, i], and the step at which it reached it (the above k₀) in previous[length, i].

Retracing the solution:

procedure find_couple(length, i, seq) begin
    if $length\geq 2$ then
    begin
        k = previous[length, i]
        if k = 0 then
        begin
            create_couple(i, i+length-1)
            find_couple(length-2, I+1, seq)
        end
        else
        begin
            find_couple(k, i, seq)
            find_couple(length-k, i+k, seq)
        end
    end
end

find_couple is a recursive function: it is initially called with the values n, 1 and seq to find all the matching pairs of the solution. These pairs constitute the secondary structure of a given RNA string. Note that this algorithm produces only one of the optimal solutions; indeed there may be several solutions that yield the same minimal energy level.

Example: sequence C C G G A

The two matrices below are generated by the program when given the sequence C C G G A : Energy :

	C	C	G	G	A
1	0	0	0	0	0
2	0	-1	0	0
3	-1	-1	0
4	-2	-1
5	-2

Previous :

	C	C	G	G	A
1	-	-	-	-	-
2	-	-	-	-
3	3	3	3
4	4	2
5	4

Explanation:

The notation $S\rightarrow U$ and V (where S, U and V are parts of the initial molecule) indicates that the string S has two components U and V, and that computing the matching pairs of S is reduced to computing those of U and V.

C-C-G-G-A $\rightarrow$ C-C-G-G and A according to Previous[5, 1] = 4.
C-C-G-G $\rightarrow$ C and C-G and G according to Previous[4, 1]=4, $\alpha (C,G)<0$ so C and G are matched.
C-G $\rightarrow$ C and G : for the same reason, C and G are matched.

Finally, two pairs are generated: {1, 4} and {2, 3}.

The following DAG summarizes how the original molecule is separated into its components.

$\resizebox*{0.3\textwidth}{!}{\includegraphics{figures/graph1.ps}}$

All the graphs of this type have been generated using Graphviz (see [11]).

Complexity

Time:

There are two loops in the program; therefore the complexity is O(n²), but the function compute_energy is n-linear. So the overall complexity of this algorithm is O(n³); the other function find_couple is recursive and calls itself twice: therefore its complexity is O(n²)and the highest complexity O(n³) prevails.

Space:

The two generated matrices have O(n²) space complexity. The recursive function creates a stack of size $O(\ln (n))$ . Therefore, the space complexity is O(n²).

Parallel version

Modification on the algorithm:

In this version, one of the loops is done in parallel:

char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
    in parallel do
        for i = 1 to n-length+1
        compute_energy(length, i, seq)
    od
end

Results:

Let us now consider the time complexity of the parallel algorithm, and its speedup with the number of processors. The following results were obtained using a randomly generated sequence of 1000 nucleotides.

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/speedup.eps}}$

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/evo_time.eps}}$

Note that :

the maximum speedup is reached when 10 processors are utilized, its value being approximately 7.
the speedup is linear up to 6 processors

Limitations

The energy considered here is K*[number of matching pairs] (where K < 0). However, this model is not very realistic; free nucleotides have different destabilizing energies according to their position in the structure; moreover, embedded matching pairs (e.g., A A A ...U U U) produce helices which have greater stabilizing properties than those same pairs considered independently.

The algorithm presented here gives one particular solution (always the same); note however that there may be several solutions of minimum energy; in addition, the user of this program may already know some particular aspects of the secondary structure he is trying to obtain. That is why the algorithm should be able to keep track of all minimum energy solutions. Those features are incorporated to the second algorithm.

First algorithm listing

Next:Second AlgorithmUp:RNA secondary structure analysisPrevious:Introduction

Nicolas DeRobert

2000-06-29