Jacques Cohen - Bio-Informatics
 

Over the past few years, I have directed my research and teaching efforts in the new area called Bio-Informatics (a term preferred by biologists) or Computational Biology (as computer scientists usually call it). I was initially attracted to this new subject by the work of David Searls [Sea 92], wherein he outlined the relationship between formal languages and biology. I became fascinated by the topic of parsing DNA and amino acids sequences for the purpose of determining genes and protein shape.

Most of the work done in parsing and compiling assumes deterministic grammars; this is largely insufficient to describe biological models. Grammars proposed for determining genes and the secondary structure of proteins are highly ambiguous. In addition, one has to handle very large strings and due to variations in the strings being analyzed a probabilistic approach is in order.

One of my doctoral students, Olivier Baby, wrote his dissertation on non-deterministic FSGs that define all possible alternations of exons and introns [Bab 98]. That is an important problem in gene finding. It turns out that ambiguous FSGs are very useful in defining those alternations. Unfortunately, the existing work in formal languages neglects this important area. The signifcant work in FSGs establishes that non-deterministic FSGs -- and therefore ambiguous grammars -- have equivalent deterministic counterparts. In the case of ambiguous grammars that counterpart is obtained by a transformation that annihilates all vestiges of multiple parsers.

The important consideration in this case is that: given an ambiguous FSA and a string it accepts, it is always possible to generate (in linear time) a DAG whose paths represent all possible parses of that string. The main problem then becomes finding of the most likely (optimal) path in the DAG. That is the essence of Hidden Markov Models (HMMs) that have been widely used in speech recognition and now in bio-informatics. It requires the use of probabilistic and dynamic programming methods.

It turns out that a similar kind of technique is useful in parsing ambiguous CFGs. For example, RNA shape can be defined by highly ambiguous probabilistic CFG grammar and the goal of parsing is the transformation of those grammars into DAGs that can later be analyzed for optimal paths using dynamic programming.

In [Coh 99c], I summarize my views about the usefulness of logic programming (LP) and constraint LP (CLP) in solving some computational biology problems. I have also started to investigate the usage of machine learning (ML), in particular the methods used in Inductive LP (ILP). What is needed is to introduce probabilistic approaches in ML to deal adequately with the interesting problems in bio-informatics.

Presently, there are two major problems that interest me:

  • The inverse protein-folding problem: given a sequence of amino acids representing a known three-dimensional protein structure, determine sequences that are likely to have the same structure. The so-called threading approach belongs to this category of algorithms. A recent paper by Kleinberg [Kle 99] explores this topic from a theoretical operations research perspective.
  • The reverse problem of cell regulation, also called modeling: given the micro-array data describing curves specifying how protein production or gene-expression varies with time, determine the regulation graph that governs cell behavior. Ideally, one would want to determine metabolic-pathways, which are essentially generalizations of the regulation graphs that take into account detailed interactions between the environment of a cell and its inner workings.

My recent paper [Coh 01b] explores how constraints may be used to generate regulation graphs from micro-array data.

In the spring of 2002 I was invited to co-teach a joint course in Bio-Informatics offered to Brandeis and Wellesley College students. That course enabled me to establish closer ties with two biologists: Dagmar Ringe from Brandeis and Andrew Webb from Wellesley. I am a believer that work in bio-informatics has to combine researchers in both biology and in computer science. Interacting with biologists is among my highest priorities.

References

[Sea 92] D. Searls, The Linguistics of DNA, American Scientist, 80, 579-591, 1992.

[Kle 99] J. Kleinberg, Efficient Algorithms for Protein Sequence Design and the Analysis of Certain Evolutionary Fitness, Proceedings of the 3rd ACM RECOMB International Conference on Computational Molecular Biology, 1999.


Home - Research Topics - Publications - Novel Contributions - Biographical Notes - Editorial Boards - Educational Activities - Ph. D. Dissertations and Senior Theses - Visiting Post-Docs - Interns - Teaching - Hobby

Last modified July 10th, 2002
jc@cs.brandeis.edu
Valid HTML 4.01!