Next:Second
AlgorithmUp:RNA
secondary structure analysisPrevious:Introduction
Subsections
First algorithm
Specification
The first algorithm uses a very simple model: given a RNA-sequence, it
returnsone of the secondary structures of the sequence having the maximum
number of links (A-U and C-G). This can be expressed by the following difference
equation:
|
(1) |
where ri is the i-th nucleotide, Sn,i
the sequence of length n starting at nucleotide i, and E
the energy of the considered sequence.
= -1 if ={A,U}
or {C,G}, 0 in the other cases.
The algorithm operates as follows: it constructs two matrices row by
row (where the row index is n and where the column index is i).
The rows are assigned using (1). The first
matrix (called energy) contains the energy of the considered sequence,
and the second one (called previous) contains an integer pointing
to one of the best solutions in (1); for
instance, let us consider previous[n, i]: if E(Sn,i)=E(Sk0,i)+E(Sn-k0,i+k0)then
previous[n, i] = k0, and if ,
then previous[n, i] = n.
The matrix previous is used at the end of the computation to
retrace all the matching pairs of nucleotides.
Nucleus of the algorithm
Generating the matrices:
char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
for i = 1 ton-length+1
compute_energy(length, i, seq)
end
where compute_energy computes the energy of the part of the sequence
of length length starting at position i, and places the result
in energy[length, i], and the step at which it reached it
(the above k0) in previous[length, i].
Retracing the solution:
procedure find_couple(length, i, seq) begin
ifthen
begin
k = previous[length,
i]
if k = 0 then
begin
create_couple(i, i+length-1)
find_couple(length-2, I+1, seq)
end
else
begin
find_couple(k, i, seq)
find_couple(length-k, i+k, seq)
end
end
end
find_couple is a recursive function: it is initially called with
the values n, 1 and seq to find all the matching pairs of
the solution. These pairs constitute the secondary structure of a given
RNA string. Note that this algorithm produces only one of the optimal solutions;
indeed there may be several solutions that yield the same minimal energy
level.
Example: sequence C C G G A
The two matrices below are generated by the program when given the sequence
C C G G A :
Energy :
|
C |
C |
G |
G |
A |
1 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
-1 |
0 |
0 |
|
3 |
-1 |
-1 |
0 |
|
|
4 |
-2 |
-1 |
|
|
|
5 |
-2 |
|
|
|
|
Previous :
|
C |
C |
G |
G |
A |
1 |
- |
- |
- |
- |
- |
2 |
- |
- |
- |
- |
|
3 |
3 |
3 |
3 |
|
|
4 |
4 |
2 |
|
|
|
5 |
4 |
|
|
|
|
Explanation:
The notation and
V (where S, U and V are parts of the initial molecule) indicates
that the string S has two components U and V, and
that computing the matching pairs of S is reduced to computing those
of U and V.
-
C-C-G-G-A
C-C-G-G and A according to Previous[5, 1] = 4.
-
C-C-G-G
C and C-G and G according to Previous[4, 1]=4,
so C and G are matched.
-
C-G
C and G : for the same reason, C and G are matched.
Finally, two pairs are generated: {1, 4} and {2, 3}.
The following DAG summarizes how the original molecule is separated
into its components.
All the graphs of this type have been generated using Graphviz (see
[11]).
Complexity
Time:
There are two loops in the program; therefore the complexity is O(n2),
but the function compute_energy is n-linear. So the overall
complexity of this algorithm is O(n3); the other
function find_couple is recursive and calls itself twice: therefore
its complexity is O(n2)and the highest complexity
O(n3) prevails.
Space:
The two generated matrices have O(n2) space complexity.
The recursive function creates a stack of size .
Therefore, the space complexity is O(n2).
Parallel version
Modification on the algorithm:
In this version, one of the loops is done in parallel:
char* seq (sequence)
int n (length of the sequence)
for length = 3 to n
begin
in parallel do
for i = 1 to n-length+1
compute_energy(length,
i, seq)
od
end
Results:
Let us now consider the time complexity of the parallel algorithm, and
its speedup with the number of processors. The following results were obtained
using a randomly generated sequence of 1000 nucleotides.
Note that :
-
the maximum speedup is reached when 10 processors are utilized, its value
being approximately 7.
-
the speedup is linear up to 6 processors
Limitations
The energy considered here is K*[number of matching pairs] (where
K < 0). However, this model is not very realistic; free nucleotides
have different destabilizing energies according to their position in the
structure; moreover, embedded matching pairs (e.g., A A A ...U U U)
produce helices which have greater stabilizing properties than those same
pairs considered independently.
The algorithm presented here gives one particular solution (always the
same); note however that there may be several solutions of minimum energy;
in addition, the user of this program may already know some particular
aspects of the secondary structure he is trying to obtain. That is why
the algorithm should be able to keep track of all minimum energy solutions.
Those features are incorporated to the second algorithm.
Next:Second
AlgorithmUp:RNA
secondary structure analysisPrevious:Introduction
Nicolas DeRobert
2000-06-29