Simple Lattice Model of Protein Folding and the Inverse Folding Problem

The purpose of this work is to provide the description and the results of a simple lattice model for protein folding. We also describe a recent polynomial approach to the inverse protein folding problem as proposed by J. Kleinberg. The lattice model is a simplified two-dimensional version of the 3-D model suggested by researchers at the NEC Center in Princeton. An interesting application described herein is to apply Kleinberg's inverse protein folding approach to the main configurations of proteins in the 2-D lattice model. This web page contains:

The Lattice Model

Given a 3x3x3 cube with 27 internal mini-cubes, consider a Self-Avoiding Path (SAP) made-up of 27 edges of the mini-cubes(see Figure 1). A node in the SAP is one of the vertices of the mini-cubes and an edge corresponds to an edge of a mini-cube. A SAP does not have intersecting edges (i.e., edges that are shared or touch another edge in the path). In other words, no two nodes in a SAP have the same coordinates (i , j, k). It is assumed that an SAP starts at the point (0, 0, 0).

It can be shown that there are about 50,000 such different SAP's that are not obtained from another SAP by symmetry considerations. Let us define as Untouching Neighboring Nodes (UNN) any two nodes in an SAP that are not in the same edge of the path but are neighboring in the sense that they belong to an edge in a mini-cube.

Fig. 1 The Lattice Model

The figure shows a 3x3x3 lattice and a Self-Avoiding Path (SAP).

Now consider all binary strings of length 27. There are 2²⁷ such strings. Notice that this number of strings is much larger than the number of SAPs in the original cube.

Each of the ones and zeros in a binary string S will correspond to an aminoacid . Aminoacids are roughly classified into two types: Hydrophopic (H) or Polar (P). Hydrophobicity is a property of an aminoacid that specifies a preference to be placed in the "inside" of the protein. Alternatively, a polar aminoacid preferentially occupies a position in the surface of the protein.

Intuitively, the approximate protein folding problem consists of determining, for each binary string S (also called an HP string), the "optimal" SAP, that is, the one that places "most" of its H elements in the interior part of the cube, and "most" of its P elements in the surface of the original cube. The words "optimal" and "most" are quantitatively defined in the sequel.

A pair of UNN nodes in the cube can be of three types: HH, PP, and HP, since HP and PH represent the same pair. We now define an energy function E(P,S) where P is a SAP and S is a 27 digit binary string representing a protein. Given a string S and a SAP path P, E is computed as follows:

Let i and j be the indices of elements in a string S such that they correspond to a pair of UNN nodes in the cube. Then:

$\begin{displaymath}E(P,S)=\sum _{i,j}P(i,j)\end{displaymath}$

where P(i,j) = 3 if i and j are both hydrophobic; P(i,j) = 1 if i and j are one hydrophobic and one polar; P(i,j) = 0 for all other pairs.

In the lattice model one considers a given binary string S and exhaustively determines, from all SAPs, the one that evaluates to a minimal energy E. A question arises: What is the distribution of the S's within all SAP's? The researchers of NEC at Princeton [1] considered that problem and solved it by exhaustive search for each of the 27 digit binary string.

An important remark is in order: if a given string optimally matches two more SAPs then it is reasonable to disregard it, since, in nature, a sequence of aminoacids folds predominantely to a unique configuration.

There are a large number of strings S that match a given optimal SAP. This means that many sequences of aminoacids acquire the same shape when folded. This fact is validated in laboratory experiments. In humans, there may be 100,000 valid sequences of aminoacids corresponding to genes, and it is estimated that there may be only a few thousand shapes that those proteins fold into (see Figure 2). Each of those 100,000 sequences contains hundreds of aminoacids. In nature, each of the 20 existing aminoacids has a different degree of polarity and hydrophobicity specified by a real number between 0 and 1.

Fig. 2 HP Sequence Mapping to Shapes
Mapping

Configurations for the Simplified 2-D Model

Instead of a 3x3x3 lattice, we used a 4x4 2-D lattice model. Figure 3 shows the SAPs for such a model after symmetry elimination. There are totally 65,536 HP sequences that can be mapped onto the 4x4 lattice. We mapped those sequences to the 38 SAPs using a NEC-like approach. Figure 4 shows the distribution of the mapping. Note that the SAP numbered 22 is the one that corresponds to the most frequent optimal shape of all the 65,536 HP strings considered in the 2-D model.

Fig. 3 The 38 SAPs After the Symmetry Elimination


structure 1:

*--*  *--*  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *--*  *  

-----------------

structure 2:

*--*  *--*  
|  |  |  |  
*  *  *  *  
|  |  |     
*  *  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 3:

*--*  *--*  
|  |     |  
*  *  *--*  
|  |  |     
*  *  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 4:

*--*  *--*  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *  *  *  
|  |     |  
*  *--*--*  

-----------------

structure 5:

*--*  *--*  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *--*  *  
|        |  
*  *--*--*  

-----------------

structure 6:

*--*  *--*  
|  |     |  
*  *--*  *  
|     |  |  
*  *--*  *  
|  |     |  
*  *--*--*  

-----------------

structure 7:

*--*  *--*  
|  |  |  |  
*  *--*  *  
|        |  
*  *--*--*  
|  |        
*  *--*--*  

-----------------

structure 8:

*--*  *--*  
|  |  |  |  
*  *--*  *  
|        |  
*  *--*  *  
|  |     |  
*  *--*--*  

-----------------

structure 9:

*--*  *--*  
|  |  |  |  
*  *--*  *  
|        |  
*  *--*  *  
|  |  |  |  
*  *  *--*  

-----------------

structure 10:

*--*--*--*  
|        |  
*  *--*--*  
|  |        
*  *  *--*  
|  |  |  |  
*  *--*  *  

-----------------

structure 11:

*--*--*--*  
|        |  
*  *--*--*  
|  |        
*  *  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 12:

*--*--*--*  
|        |  
*  *--*--*  
|  |        
*  *--*--*  
|        |  
*  *--*--*  

-----------------

structure 13:

*--*--*--*  
|        |  
*  *  *--*  
|  |  |     
*  *  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 14:

*--*--*--*  
|        |  
*  *--*  *  
|  |  |  |  
*  *  *--*  
|  |        
*  *--*--*  

-----------------

structure 15:

*--*--*--*  
|        |  
*  *--*  *  
|  |  |  |  
*  *  *  *  
|  |     |  
*  *--*--*  

-----------------

structure 16:

*--*--*--*  
|        |  
*  *--*  *  
|     |  |  
*  *--*  *  
|  |     |  
*  *--*--*  

-----------------

structure 17:

*--*--*--*  
|        |  
*  *--*  *  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *  *--*  

-----------------

structure 18:

*--*--*--*  
|        |  
*--*--*  *  
      |     
*--*  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 19:

*--*--*--*  
|        |  
*--*  *--*  
      |     
*--*  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 20:

*  *--*--*  
|  |     |  
*--*  *--*  
      |     
*--*  *--*  
|  |     |  
*  *--*--*  

-----------------

structure 21:

*--*--*--*  
|        |  
*--*--*  *  
      |  |  
*--*  *  *  
|  |     |  
*  *--*--*  

-----------------

structure 22:

*--*--*--*  
|        |  
*--*  *--*  
   |  |     
*--*  *--*  
|        |  
*  *--*--*  

-----------------

structure 23:

*--*--*--*  
|        |  
*--*--*  *  
      |  |  
*--*--*  *  
|        |  
*  *--*--*  

-----------------

structure 24:

*--*--*--*  
|        |  
*--*  *--*  
   |  |     
*  *  *  *  
|  |  |  |  
*--*  *--*  

-----------------

structure 25:

*--*--*--*  
|        |  
*--*  *--*  
   |  |     
*  *  *--*  
|  |     |  
*--*  *--*  

-----------------

structure 26:

*--*--*--*  
|        |  
*--*  *  *  
   |  |  |  
*  *  *  *  
|  |  |  |  
*--*  *--*  

-----------------

structure 27:

*--*--*--*  
|        |  
*  *--*  *  
   |  |  |  
*  *  *  *  
|  |  |  |  
*--*  *--*  

-----------------

structure 28:

*--*--*--*  
|        |  
*--*--*  *  
         |  
*  *--*  *  
|  |  |  |  
*--*  *--*  

-----------------

structure 29:

*--*  *--*  
|  |  |  |  
*  *--*  *  
         |  
*  *--*  *  
|  |  |  |  
*--*  *--*  

-----------------

structure 30:

*--*--*--*  
|        |  
*--*--*  *  
      |  |  
*  *--*  *  
|  |     |  
*--*  *--*  

-----------------

structure 31:

*--*--*--*  
|        |  
*--*  *--*  
   |        
*  *--*--*  
|        |  
*--*--*--*  

-----------------

structure 32:

*--*--*--*  
|        |  
*  *--*--*  
   |        
*  *--*--*  
|        |  
*--*--*--*  

-----------------

structure 33:

*--*--*--*  
|        |  
*--*  *--*  
   |  |     
*  *  *--*  
|        |  
*--*--*--*  

-----------------

structure 34:

*--*--*--*  
|        |  
*--*  *  *  
   |  |  |  
*  *--*  *  
|        |  
*--*--*--*  

-----------------

structure 35:

*--*--*--*  
|        |  
*--*--*  *  
      |  |  
*  *--*  *  
|        |  
*--*--*--*  

-----------------

structure 36:

*--*--*--*  
|        |  
*--*  *--*  
      |     
*--*  *--*  
|        |  
*--*--*--*  

-----------------

structure 37:

*--*--*--*  
|        |  
*--*--*  *  
      |  |  
*--*  *  *  
|        |  
*--*--*--*  

-----------------

structure 38:

*--*  *--*  
|  |  |  |  
*  *  *  *  
|  |  |  |  
*  *  *  *  
|        |  
*--*--*--*  

-----------------

Fig. 4 Number of HP Strings Mapped to SAPs

Mapping Result

Kleinberg's Approach

We now define the inverse protein folding problem. Given a SAP or a structure $\sigma$ in space, determine all the strings S that have $\sigma$ as the structure minimizing the energy E.

Kleinberg in [2] proposes a polynomial approximate algorithm: given a structure $\sigma$ specified by a sequence of straight line segments in space , determine the set of HP strings that are potential candidates for having $\sigma$ as the folded structure. Therefore, he provides an approximate solution to the inverse protein folding problem. Kleinberg uses a Fitness Function $\Phi$ instead of the previously defined Energy function.

$\Phi$ is defined as:

$\begin{displaymath}\Phi (S)=\alpha \sum _{i,j\in S_{H},i<j-2}g(d_{ij})+\beta \sum _{i\in S_{H}}s_{i}\end{displaymath}$

$\alpha$ and $\beta$ are constants; the sum of d(i,j) correspond to the distances between space coordinates of the nodes, and s_i corresponds to the nodes that are in the surface of the given $\sigma$ . S_H denotes the set of numbers i such that the i^th position of the sequence S is H (Hydrophobic).

In what follows we summarize how one would make practical usage of the inverse protein folding approach. Assume that given a shape $\sigma$ one could determine all the HP strings corresponding to that $\sigma$ . Kleinberg shows how one can determine one such HP string and he then discusses how one could approach the problem of finding the set of strings that would also fold into $\sigma$ .

He considers the set of HP strings obtainable from the optimal string by the change of one amino acid type, i.e., an H being changed into a P, or a P being changed into an H. Kleinberg showed that such an ensemble of strings may not have a continuous space representation. This means that if a string S_i is transformed -- by a change of one aminoacid type into a string S_j -- and afterwards to a string S_k also by a single change, then it may occur that S_i and S_k are optimally foldable into $\sigma$ whereas S_j is not. In other words the space of solutions of strings matching $\sigma$ is not convex or continuous. Kleinberg calls this set of all the S_i having $\sigma$ as optimal shape, the landscape L of $\sigma$

It turns out that determining L is itself an exponential problem since it may contain a very large number of strings. It would be therefore desirable to construct a recognizer R (say, an automaton), capable of determining if a given string S is recognized by R. This is an open problem that could be studied by machine learning approaches (or grammar inference techniques) using the model described in this work.

Given the pairs $(\sigma _{i},R_{i})$ for a large number of different shapes $\sigma _{i}$ , known to exist in nature, then the search for a shape $\sigma _{k}$ corresponding to a given string S would proceed as follows. Attempt to recognize S by each of the recognizers R_i. Assume it is accepted by the k-th recognizer R_k. If that is the case, $\sigma _{k}$ is the most likely structure for S.

Let us now return to the problem of finding one HP string that optimizes a given structure $\sigma$ . Kleinberg shows that the problem can be reduced to maximal flow-minimum cut of a graph directly obtainable from $\sigma$ . In Figure 5, we show the graph obtained from the structure numbered 22. The vertex set V of the graph G consists of s, t, a vertex v_i for each of the residue position i = 1,2,...,16 in the target shape, and a vertex u_ij for each pair of residue positions i,j for each i<j-2 and g(d_ij)>0. The edge set E of G consists of an edge (s,u_ij) for each vertex u_ij, and edge (v_i,t) for each vertex v_i which has a non-zero solvent-exposed contact surface area s_i and edges (u_ij,v_i) and (u_ij,v_j) for each vertex u_ij. We assign a capacity of alphag to the edge (s,u_ij), a capacity of betas to the edge (v_i,t), and a capacity of B+1, where B is sumalpha , to all edges of the form (u_ij, v_i) and (u_ij,v_j).

Figure 5 The Graph for Kleinberg's Algorithm for the Structure Numbered 22

The nodes 3,4,11 and 12 are internal nodes and therefore there are no edges linking them to the sink node t.

Results Obtained by Applying Kleinberg's Method to the Configuratons of the 2-D Model

Suggestions for further work using the 4 x 4 - 2D model

Bibliography

URL Links

Appendix

1. Computation of Self-Avoiding Paths (SAP)

We determined the SAPs(Self-Avoiding Path) of a $4\times 4$ 2-D lattice using a Prolog program. ``sap(X,Y)'' determines the SAP and the pairs of contacting residues in the SAP. ``sappath(X)'' is the clause that finds the SAP. It first makes a list of residues representing the path. Then it sets the constraints for the path. There are two constraints for a self-avoiding path. One is set by ``rightsucc(X)'' which ensures that residue i and residue i+1 are neighbors. This is satisfied by applying the constraint that only one of the $\left\vert x_{i}-x_{j}\right\vert$ or $\left\vert y_{i}-y_{j}\right\vert$ is equal to 1. The second constraint for the SAP is self-avoidance, which means there is no crossing of a path with itself. This is set by ``selfavoid(X)'', which ensures that no two residues can occupy the same location. The ``allcontact(X,Y)'' actually does a linear search for each residue to find its neighbors and place the result in ``Y''.

2. Symmetry Detection

Among all the SAPs, some are symmetrical to the others. For example, the points P(X,Y) and P(X,-Y) are symmetric. There are 8 types of symmetries. If P₁(X₁,Y₁), P₂(X₂,Y₂) are two positions in the two symmetrical structures, each of the following clauses describes a condition for one type of symmetry.

3. Mapping Sequences to Structures Using NEC Energy Formula

Given a sequence, the energy is computed for each shape. The energy is defined as follows: For each pair of contacted residues, if the two residues are both hydrophobic they contribute -3 to the energy. If one of the residue is hydrophobic, they contribute -1 to the energy. If none of the residues is hydrophobic, they contribute 0 to the energy.

For each sequence, the energy is computed over all the structures. If there is only one structure that gives minimal NEC energy, this sequence is mapped to the structure. After doing this mapping for all sequences, for each structure, we have a set of sequences that map to the structure. The number of mapped sequences over structure is plotted in Figure 4.

4. Find Sequence for Structures with Kleinberg Fitness Value

g(d_ij) = 1 if both residue i and j are hydrophobic and are neighbors; 0 otherwise.

S_i = 0 if the residue is not on the surface;1 if the residue is on the surface but not at the corner; 2 if the residue is at the corner.

We use both a brute force and a network mini_cut algorithm to determine the sequences that give the minimal Kleinberg fitness value to a given structure. The brute force approach simply considers all the sequences and computes the Kleinberg fitness value for each structure. Then it finds the sequence that gives the minimal fitness value to each structure. Kleinberg used a polynomial algorithm to find the sequence yielding the minimal fitness value. Basically, a weighted directed graph is built according to the given structure. Then the mini_cut of the graph partitions the graph into two parts. The nodes in one part represent hydrophobic residues while the nodes in the other represent polar residues. The graph network mini_cut algorithm is described in Kleinberg's paper.

In the program, we first build the graph according to Kleinberg's method in function``MakeGraph''. Then we determine the max_flow of the graph using the algorithm library LEDA. This also give us the flow for each edge. In function ``MarkGraph'' we find the mini_cut of the graph by a marking method. The method can be described as follows:

The residues whose representing nodes are in the marked set are hydrophobic and the others are polar.

Structure	Number of Mapped Sequences by NEC Method	Sequence Found by Kleinberg Method	NEC and Kleinberg Method Map the Same Sequence
1	1358	HHHPHHHHHHHHPHHH	No
2	232	HPHHHHPHHHHHPHHH	No
3	276	HPHHHHPHHHHHPHHH	No
4	564	HHHPHHPHHHHHPHHH	No
5	560	HHPHHPHHHHHHPHHH	Yes
6	851	HPHHPHHHHHHHPHHH	Yes
7	1116	HHHHHHHPHHHHPHHH	No
8	449	HHHHPHHPHHHHPHHH	No
9	667	HHHHPHHPHHHHPHHH	No
10	1361	HHPHHHHHHPHHPHHH	No
11	490	HHPHHHHHHPHHPHHH	No
12	549	HHPHHHHHHPHHPHHH	No
13	791	HHHHPHHHHPHHPHHH	Yes
14	1025	HHHHHHHHHPHHPHHH	No
15	23	HHHHHHPHHPHHPHHH	No
16	98	HHHHHHPHHPHHPHHH	No
17	617	HHHHHHPHHPHHPHHH	No
18	976	HPHHPHHHHHPHHPHH	No
19	489	HHPHHPHHHHPHHPHH	No
20	1520	HHPHHPHHHHPHHPHH	No
21	846	HHHHPHHPHHPHHPHH	No
22	2214	HHPHHHHPHHPHHHHH	Yes
23	1650	HHPHHPHHPHHHHHHH	Yes
24	253	HPHHHHPHHPHHHHPH	No
25	103	HPHHHHPHHPHHHHPH	No
26	203	HHHPHHPHHPHHHHPH	No
27	300	HPHHPHHPHHHHHHPH	No
28	151	HHHPHHPHHPHHHHPH	No
29	137	HPHHHHPHHPHHHHPH	No
30	394	HPHHPHHPHHHHHHPH	No
31	639	HHPHHPHHHHHPHHPH	Yes
32	1053	HPHHPHHHHHHPHHPH	Yes
33	1040	HHHPHHPHHHHPHHPH	Yes
34	230	HHHHHPHHPHHPHHPH	No
35	325	HHHHHPHHPHHPHHPH	No
36	1519	HHPHHPHHHHPHHPHH	Yes
37	1760	HHHHPHHPHHPHHPHH	Yes
38	1819	HHHPHHPHHPHHPHHH	Yes