Next:Conclusion and suggestions forUp:RNA secondary structure analysisPrevious:First algorithm

Subsections

Complexity

Time:
Space:

Representing the solutions

Example:

Parallel analysis

Modification of the algorithm:

Second Algorithm

Specification

In this second algorithm we use a slightly more complicated model: here, we introduce L_n_,i in addition to S_n_,i. The latter is as before, and the former represents the same string, but with the additional information that the two extreme nucleotides of the sequence are matched (that means L_n_,iis not defined for every n and i).

$\begin{displaymath}E(S_{n,i})=min\left\{ \begin{array}{c}E(S_{n-1,i})\\E(S_{......k,i})+E(S_{n-k,i+k})\right) \\E(L_{n,i})\end{array}\right.\end{displaymath}$

(2)

For E(L_n_,i), we have the different values:

$\alpha (r_{i},r_{i+n-1})+\xi (n-2),$ if L_n_,i is a hairpin loop.
$\alpha (r_{i},r_{i+n-1})+\eta +E(L_{n-2,i+1})$ if L_n_,i is a helical region (or double helix).
$min_{1\leq p<n-3}\left( \alpha (r_{i},r_{i+n-1})+\beta (p)+E(L_{n-p-2,i+1})\right)$ if L_n_,i is a bulge on i.
$min_{1\leq q<n-3}\left( \alpha (r_{i},r_{i+n-1})+\beta (q)+E(L_{n-q-2,i+q+1})\right)$ if L_n_,i is a bulge on i+n-1.
$min_{p,q}\left( \alpha (r_{i},r_{i+n-1})+\gamma (p+q)+E(L_{n-p-q-2,i+q+1})\right)$ if L_n_,i is an interior loop.
$min_{1<k<n}\left( E(S_{k,i})+E(S_{n-k,i+k})\right)$ if L_n_,iis a multi-loop.

The variable configurations (hairpin, helix, bulge, interior loop, multiloop) are depicted below:

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/rna_struct_annot.ps}}$

The values considered in computing the energy are:

$\alpha (C-G)=-10$
$\alpha (A-U)=-7$
$\alpha (G-U)=-1$ : this pair has a higher energy than the two others; this explains why it does not appear very often in secondary structures.
$\eta =-1$
$\xi (k)=\beta (k)=\gamma (k)=k$

These values have been inferred from the information given on M. Zuker's page (see [6]); however note that they may vary substantially according to the environment, the temperature, etc.

$\xi ,\beta$ and $\gamma$ are destabilizing energies (>0) whereas $\alpha$ and $\eta$ are stabilizing (<0). The second algorithm operates in a similar manner to the first, except that instead of having only one matrix Energy and one matrix Previous, we have two pairs (one corresponding to S, and one corresponding to L): Energy, Loop, Prev_energy and Prev_loop. Moreover, Prev_energy and Prev_loop are now 3-dimensional arrays; that means that, for given values of n and i, Prev_energy[n, i] and Prev_loop[n, i] are two arrays of pointers: this is to handle the fact that there may be several solutions. Note also that the pointers in Prev_loop are now represented by a couple of integers: this is due to the large number of possible solutions at each step. For given values of n and i, the program first computes E(L_n_,i) and then E(S_n_,i) (the latter requiring known values of the former). These prev-matrices are a compact way of storing all the solutions of minimal energy. Actually they form a DAG (directed acyclic graph). Note that although this DAG contains all the solutions, listing them could very well be exponential.

Nucleus of this algorithm

Generating the matrices:

char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
    for i = 1 ton-length+1
    begin
        compute_loop(length, i, seq)
        compute_energy(length, i, seq)
    end
end

The above program uses about the same approach as in the previous algorithm except that there is an additional function: compute_loop that takes into account all the additional elements mentionned in section 2.1.

Retracing the solutions:

The boolean variable transition determines the energy to be considered: 0 for prev_energy and 1 for prev_loop.

If transition = 0, a random element of the array previous_energy[length, i] is considered; according to its value, 1 or 2 may occur: 1 indicates that the procedure undergoes a loop mode, 2 means the considered part of the sequence is split in two parts.

If transition = 1,then, 3, 4 or 5 occurs: 3 corresponds to a hairpin loop, 4 to a multiloop, and 5 to a bulge or an interior loop.

procedure find_couple(length, i , seq, transition)
    begin
    if $length\geq 2$ then
        begin
            if transition = 0 then
            begin
                k = previous_energy[length, i, random]
                if k = 0 then find_couple(length, I, seq, 1)                             (1)
            else
            begin
                find_couple(k, i, seq, 0)
                find_couple(length-k, i+k, seq, 0)                                            (2)
            end
        end
        else
        begin
            create_couple(i,i+length-1)
            (p,q) = previous_loop[length, i, random]
            if p=-1 then
            begin
                if q=-1 then end                                                                          (3)
                else
                begin
                    find_couple(q, i+1, seq, 0)
                    find_couple(n-q-2, i+q+1, seq, 0)                                        (4)
                end
            end
            else find_couple(length-p-q, i+p, 1)                                        (5)
        end
    end
end

In this procedure, only one random solution is traced. Indeed, all the solutions are contained in the two matrices prev_energy and prev_loop, but it would take an exponential time to list them all. This solution is obtained by using the random numbers that are generated by the function random.

Example:

The following figure describes the operational behavior of the function find_couples:

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/grapht.ps}}$

This graph corresponds to the following secondary structure:

Complexity

Time:

Once again, there are two loops in the program; therefore the complexity is O(n²); however, now the function compute_loop is O(n²), instead of O(n): indeed, there is minimum on p and q to compute (for the interior loop). So the complexity of this algorithm is O(n⁴); however there is an artifice to reduce the complexity of the interior loop by creating a matrix called diagonal:

diagonal[length, i, l] = $min\left\{ \begin{array}{c}loop[length-l-2,i+l]\\diagonal[length-1,i,l]\end{array}\right.$ with l corresponding to p+q.

This matrix is computed on the fly in the main loop (on length and i): l may vary from 1 to n, so the complexity is O(n). Once this matrix is known, the formula for the interior loop becomes: $min_{2\leq l\leq n-4}\left( \alpha (r_{i},r_{i+n-1})+\gamma (l)+diagonal[n,i,l]\right)$ . The complexity time becomes O(n). So, instead of having a n²-complexity, we have a single loop and a n-complex function within two embedded loops; this reduces the complexity from O(n⁴) to O(n³).

Actually, the length-index of diagonal is not even needed since the recursion depth is 1: thus we may reduce the storage to diagonal1[i,l] and diagonal2[i,l]. Of course, a prev_diagonal matrix is parallel to diagonal, so that p and q may be recovered, and solutions traced.

The other function find_couple has the same complexity as in the previous algorithm: O(n²). So, the highest complexity O(n³) prevails.

Space:

The two matrices energy and loop have O(n²) space complexity. The prev-matrices have O(n³) space complexity. The recursive function creates a stack of size $O(\ln (n))$ . The addition of diagonal and prev_diagonal does not change the space complexity, since they are O(n²) and O(n³). So the space complexity is O(n³).

Representing the solutions

The solutions are represented using the ESSA package (see [3] and [7]). The following figure provides an example of secondary structure of an RNA sequence. The ESSA program has as input the RNA structure and the set of indices representing two nucleotides that are linked.

Sometimes, a user wishes to determine the best secondary structure according to certain requirements. For instance he may want to have a helix between nucleotides p and q. To make this task simpler, a linear notation for displaying the structure is introduced:

( : first element of a U-A pair.
) : second element of a U-A pair.
[ : first element of a G-C pair.
] : second element of a G-C pair.
{ : first element of a G-U pair.
} : second element of a G-U pair.
. : nucleotide not linked.

For instance:

( ( ( ( ( ) ) ) ) ) is a helix made of 5 A-U pairs.
( . [ ( ( [ { [ [ [ . ( . [ . [ ( [ . . [ ( ( ( . . ) [ [ [ [ ( . [ [ [ [ ( [ . ] ) . [ . ] ] ] . ( [ ( ( . ) ) ] ) ] ] ) [ [ . . ] ] ] ] ] ] ) [ ( [ . . ] ) ] ) ] ] ) ] ] [ . [ . ] ] ) ] ( { [ [ [ ( . ) . ] ] . ] } ) ] ] } ] ) ) ] ) corresponds to the above two-dimensinal structure.

The user may require the presence of a particular part of the structure by using the following additional annotations:

< : first element of any couple.
> : second element of any couple.
* : any element (linked or not).

Example:

Let us consider the following sequence: ACCCGGGUUUAGUUUGCCCCC.

This sequence has 11 different solutions :

1.: . [ [ . ] ] [ . ( . ) [ . . . [ . ] ] ] .
2.: . . [ . ] [ [ ( . . ) [ . . . [ . ] ] ] ]
3.: . . [ . ] [ [ . ( . ) [ . . . [ . ] ] ] ]
4.: . [ . . ] [ [ ( . . ) [ . . . [ . ] ] ] ]
5.: . [ [ . ] ] [ . ( . ) [ . . . [ . ] ] . ]
6.: . [ [ . ] ] [ ( . . ) [ . . . [ . ] ] ] .
7.: . [ . . ] [ [ . ( . ) [ . . . [ . ] ] ] ]
8.: . [ [ . ] ] [ . ( . ) [ . . . [ . . ] ] ]
9.: . [ [ . ] ] [ . ( . ) [ . . . [ . ] . ] ]
10.: . [ [ . ] ] [ ( . . ) [ . . . [ . ] . ] ]
11.: . [ [ . ] ] [ ( . . ) [ . . . [ . ] ] . ]

Assume that the user knows for some reason that the second and fifth nucleotides of the sequence are linked: in this case, he will enter the following string: * < * * >; < and > standing for second and fifth. In this case, the program outputs one of the two following sequences :

. [ . . ] [ [ ( . . ) [ . . . [ . ] ] ] ]
. [ . . ] [ [ . ( . ) [ . . . [ . ] ] ] ]

Note that this information is not taken into account in the generation of the DAG. The present approach consists of generating random optimal solutions and checking if one of them satisfies the user's requirements. A better version is proposed in the conclusion.

Parallel analysis

The parallelization described below pertains to the DAG generation part of the algorithm. The parallelization is performed in the second embedded loop.

Modification of the algorithm:

char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
    in parallel do
        for i = 1 ton-length+1
        begin
            compute_loop(length, i, seq)
            compute_energy(length, i, seq)
        end
    od
end

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/speeds2.eps}}$

The following table presents the actual times in seconds for the SGI-Onyx machine

200 400 600 800 1000

1 3.3 24.9 86.7 216.4 461.1

2 3.3 17.9 55.1 130.6 244.1

5 3.3 13.4 35.1 74.4 141.7

8 3.3 12.9 31.3 63.2 111.1

The algorithm has cubic complexity, so the execution time may be approximated by a polynomial: $P_{1}(n)=0.7\left( \frac{n}{100}\right) ^{3}-3.5\left( \frac{n}{100}\right) ^{2}+10.8\frac{n}{100}-8$ .

When dealing with n_proc parallel processors, the execution time T_nprocmay be approximated by: $P_{n_{proc}}(n)=\frac{0.7\left( \frac{n}{100}\right) ^{3}-3.5\left( \frac{n}{100}\right) ^{2}}{n_{proc}}+10.8\frac{n}{100}-8$ . Indeed, the parallel instructions are executed within the second loop of the algorithm, this means that they have an effect on the quadratic and cubic coefficients of the polynomial.

The following curves represents: $\frac{T_{n_{proc}}}{P_{n_{proc}}}(n)$ :

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/rapport_parallel.eps}}$

The values obtained for $1000<n\leq 2000$ have been estimated by polynomial interpolation. The above figure clearly shows that for larger n and n_proc, the results approximate the value 1.0 which is the expected asymptotic value of $\frac{T_{n_{proc}}}{P_{n_{proc}}}(n)$ .

Therefore, the model is satisfactory, but not accurate enough to estimate the speed for larger numbers of processors. Let us make a more accurate analysis of the algorithm: inside the two outermost loops, we have the call of a function and a loop. The execution time is: C*length+D.

This component is executed (n-length) times; therefore the execution time becomes: (n-length)(C*length+D)+E, the multiplicative constant being included in C and D. In addition, one should take into account an embedded n-loop. So, the final execution time is:
$\sum ^{n}_{length=1}\left[ (n-length)(C*length+D)+E\right] +F$ .

Again, the multiplicative constant is included in C, D and E. The parallel part operates on the second loop; the approximate parallel execution time becomes:

$\begin{displaymath}Q_{n_{proc}}(n)=\sum ^{n}_{length=1}\left[ \frac{(n-length)}{n_{proc}}(C*length+D)+E\right] +F\end{displaymath}$

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/ratio_parallel2.eps}}$

These results are much more accurate for the larger lengths (above 800), the fact that the estimates have an odd behaviour for lengths under 800 is due to the fact that the above formula remains an approximation; the errors are reduced as n increases. We may now make estimates for the algorithm's efficiency as the number of processors is increased:

$\resizebox*{0.7\textwidth}{!}{\includegraphics{figures/speedup_estim.eps}}$

These curves show that for a string of length n=1000, the number of processors needed to obtain significant speed-up is no larger than 10. Nevertheless, when n=4000, gains are possible beyond 50 processors.

Second algorithm listing

Example

Next:Conclusion and suggestions forUp:RNA secondary structure analysisPrevious:First algorithm

Nicolas DeRobert

2000-06-29

	200	400	600	800	1000
1	3.3	24.9	86.7	216.4	461.1
2	3.3	17.9	55.1	130.6	244.1
5	3.3	13.4	35.1	74.4	141.7
8	3.3	12.9	31.3	63.2	111.1