Olivier Baby's PhD Dissertation

Non-Deterministic, Constraint-Based Parsing of Human Genes

A novel approach is proposed for predicting coding and non-coding
regions (i.e., exons and introns) in human genes. Unlike other
approaches, the one advocated in this thesis does not use learning
methods, sequence comparison, or statistical information. Instead, it
relies solely on the identification of the patterns occurring at the
boundaries between exons and introns, and on a set of simple
constraints. These are formulated in two different ways: one in which
directed acyclic graphs are generated using non deterministic
finite-state automata, and the other in which more compact, directed
acyclic graphs are constructed directly.

The first algorithm consists of defining one or more non-deterministic
finite-state automata (NDFSA) that recognize all gene sequences
satisfying the various patterns and constraints. Given the NDFSA and a
sequence to be analyzed, it is possible to generate a directed acyclic
graph (DAG) whose paths specify all possible parses, i.e.,
alternations of exons and introns, in that sequence. The generation
of this DAG is performed in linear time even though the number of
paths may be exponential. The partitioning of the transitions of the
NDFSA in two types, coding and non-coding, allows one to determine
gene structure by considering the subsequences of the input sequence
that are predicted as non-coding in all parses.

The purpose of the second algorithm is to reduce the memory space used
by the first algorithm. This is achieved by a direct construction of a
more compact DAG whose paths also describe all possible parses of the
sequence being analyzed. As in the first algorithm, gene structure
prediction is performed in linear time by considering the subsequences
that are predicted as non-coding in all parses.

The approach has higher accuracy than that achieved by the methods
combining different statistical measures, and it reveals the reason
why a predicted alternation of exons and introns satisfies the specified
patterns and constraints. The two proposed algorithms represent two
possible directions within a spectrum of possible means to solve the
problem using NDFSA and DAGs.