ISMB-99, 95-105 Using Sequence Motifs for Enhanced Neural Network

Prediction of Protein Distance Constraints ¡ Jan Gorodkin1 2, Ole Lund1, Claus A. Andersen1, and Søren Brunak1 1Center for Biological Sequence Analysis, Department of Biotechnology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark [email protected], [email protected], [email protected], [email protected] Phone: +45 45 25 24 77, Fax: +45 45 93 15 85 2Department of Genetics and Ecology, The Institute of Biological Sciences University of Aarhus, Building 540, Ny Munkegade, DK-8000 Aarhus C, Denmark

Abstract Fariselli & Casadio 1999). The ability to adopt structure from sequences depends on constructing an appropriate cost Correlations between sequence separation (in residues) and function for the native structure. In search of such a function distance (in Angstrom) of any pair of amino acids in polypep- tide chains are investigated. For each sequence separation we we here concentrate on finding a method to predict distance define a distance threshold. For pairs of amino acids where constraints that correlate well with the observed distances the distance between Cα atoms is smaller than the threshold, in proteins. As the neural network approach is the only ap- a characteristic sequence (logo) motif, is found. The mo- proach so far which includes sequence context for the con- tifs change as the sequence separation increases: for small sidered pair of amino acids, these are expected not only to separations they consist of one peak located in between the perform better, but also to capture more features relating dis- two residues, then additional peaks at these residues appear, tance constraints and sequence composition. and finally the center peak smears out for very large separa- The analysis include investigation of the distances be- tions. We also find correlations between the residues in the tween as well as sequence motifs and corre- center of the motif. This and other statistical analyses are lations for separated residues. We construct a prediction used to design neural networks with enhanced performance compared to earlier work. Importantly, the statistical anal- scheme which significantly improve on an earlier approach ysis explains why neural networks perform better than sim- (Lund et al. 1997). ple statistical data-driven approaches such as pair probability For each particular sequence separation, the correspond- density functions. The statistical results also explain charac- ing distance threshold is computed as the average of all teristics of the network performance for increasing sequence physical distances in a large data set between any two amino separation. The improvement of the new network design is acids separated by that amount of residues (Lund et al. significant in the sequence separation range 10–30 residues. 1997). Here, we include an analysis of the distance dis- Finally, we find that the performance curve for increasing se- tributions relative to these thresholds and use them to ex- quence separation is directly correlated to the corresponding plain qualitative behavior of the neural network prediction content. A WWW server, distanceP, is available scheme, thus extending earlier studies (Reese et al. 1996). at http://www.cbs.dtu.dk/services/distanceP/. For the prediction scheme used here it is essential to relate Keywords: Distance prediction; sequence motifs; distance the distributions to their means. Analysis of the network constraints; neural network; protein structure. weight composition reveal intriguing properties of the dis- tance constraints: the sequence motifs can be decomposed Introduction into sub-motifs associated with each of the hidden units in the neural network. Much work have over the years been put into approaches Further, as the sequence separation increases there is a which either analyze or predict features of the three- clear correspondence in the change of the mean value, dis- dimensional structure using distributions of distances, cor- tance distributions, and the sequence motifs describing the related mutations, and more lately neural networks, or com- distance constraints of the separated amino acids, respec- binations of these e.g. (Tanaka & Scheraga 1976; Miyazawa tively. The predicted distance constraints may be used as & Jernigan 1985; Bohr et al. 1990; Sippl 1990; Maiorov inputs to threading, ab initio, or loop modeling algorithms. & Crippen 1992; Göbel et al. 1994; Mirny & Shaknovich 1996; Thomas, Casari, & Sander 1996; Lund et al. 1997; Materials and Method Olmea & Valencia 1997; Skolnick, Kolinski, & Ortiz 1997;

¢ Data extraction Present address: Structural Advanced Tech- nologies A/S, Agern Alle 3, DK-2970 Hørsholm, Denmark The data set was extracted from the Brookhaven Protein Copyright £ c 1999, American Association for Artificial Intelli- Data Bank (Bernstein et al. 1977), release 82 containing gence (www.aaai.org). All rights reserved. 5762 proteins. In brief entries were excluded if: (1) the sec- ondary structure of the proteins could not be assigned by the where k refers to the symbols of the alphabet considered

program¤ DSSP (Kabsch & Sander 1983), as the DSSP as- (here amino acids). The observed fraction of symbol k at signment is used to quantify the secondary structure identity position i is qik, and pk is the background probability of find- in the pairwise alignments, (2) the proteins had any physi- ing symbol k by chance in the sequences. pk will sometimes cal chain breaks (defined as neighboring amino acids in the be replaced by a position-dependent background probabil- α sequence having C -distances exceeding 4 ¥ 0Å) or entries ity, that is the probability of observing letter k at some posi- where the DSSP program detected chain breaks or incom- tion in the alignment in another data set one wishes to com- plete backbones, (3) they had a resolution value greater than pare to. Symbols in logos turned 180 degrees indicate that

q p . 2 ¥ 5Å, since proteins with worse resolution, are less reliable ik k as templates for homology modeling of the Cα trace (unpub- lished results). Neural networks Individual chains of entries were discarded if (1) they had As in the previous work, we apply two-layer feed-forward a length of less than 30 amino acids, (2) they had less than neural networks, trained by standard back-propagation, see 50% secondary structure assignment as defined by the pro- e.g. (Brunak, Engelbrecht, & Knudsen 1991; Bishop 1996; gram DSSP, (3) they had more than 85% non-amino acids Baldi & Brunak 1998), to predict whether two residues are (nucleotides) in the sequence, and (4) they had more than below or above a given distance threshold in space. The 10% of non-standard amino acids (B, X, Z) in the sequence. sequence input is sparsely encoded. In (Lund et al. 1997) A representative set with low pairwise sequence similar- the inputs were processed as two windows centered around ity was selected by running algorithm #1 of Hobohm et al. each of the separated amino acids. However, here we extend (1992) implemented in the program RedHom (Lund et al. that scheme by allowing the windows to grow towards each 1997). other, and even merge to a single large window covering the In brief the sequences were sorted according to resolution complete sequence between the separated amino acids. Even (all NMR structures were assigned resolution 100). The se- though such a scheme increases the computational require- quences with the same resolution were sorted so that higher ments, it allow us to search for optimal covering between the priority was given to longer proteins. separated amino acids. The sequences were aligned utilizing the local alignment As there can be a large difference in the number of posi- program, ssearch (Myers & Miller 1988; Pearson 1990) us- tive (contact) and negative (no contact) sequence windows,

ing the pam120 amino acid substitution matrix (Dayhoff & for a given separation, we apply the balanced learning ap- ¦

Orcutt 1978), with gap penalties ¦ 12, 4. As a cutoff for proach (Rost & Sander 1993). Training is done by a 10 set ¨ © sequence similarity we applied the threshold T § 290 L, cross-validation approach (Bishop 1996), and the result is T is the percentage of identity in the alignment and L the reported as the average performance over the partitions. The length of the alignment. Starting from the top of the list performance on each partition is evaluated by the Mathews each sequence was used as a probe to exclude all sequence correlation coefficient (Mathews 1975) similar proteins further down the list.

PtNt ¦ Pf Nf

    

By visual inspection seven proteins were removed from C § (2)

       the list, since their structure is either ’sustained’ by DNA or Nt  Nf Nt Pf Pt Nf Pt Pf predominantly buried in the membrane. The resulting 744 where Pt is the number of true positives (contact, predicted protein chains are composed of the residues where the Cα- contact), Nt the number of true negatives (no contact, no atom position is specified in the PDB entry. contact predicted), Pf the number of false positives (no con- Ten cross-validation sets were selected such that they all tact, contact predicted), and Nf is the number of false nega- contain approximately the same number of residues, and all tives (contact, no contact predicted). have the same length distribution of the chains. All the data The analysis of the patterns stored in the weights of the are made publicly available through the world wide web networks is done through the saliency of the weights, that page http://www.cbs.dtu.dk/services/distanceP/. is the cost of removing a single weight while keeping the remaining ones. Due to the sparse encoding each weight Information Content / Relative entropy measure connected to a hidden unit corresponds exactly to a particu- Here we use the relative entropy to measure the informa- lar amino acid at a given position in the sequence windows tion content (Kullback & Leibler 1951) of aligned regions used as inputs. We can then obtain a ranking of symbols on between sequence separated residues. The information con- each position in the input field. To compute the saliencies tent is obtained by summing for the respective position in we use the approximation for two-layer one-output networks L the alignment, I § ∑ I , where I is the information con- i 1 i i (Gorodkin et al. 1997), who showed that the saliencies for tent of position i in the alignment. The information content the weights between input and hidden layer can be written at each position will sometimes be displayed as a sequence as logo (Schneider & Stephens 1990). The position-dependent

k 2 2 §

information content is given by s ji § s ji w jiWj K (3)

qik where w ji is the weight between input i and hidden unit j,

I § ∑q log (1) i ik 2 and W the weight between hidden unit j and the output. K k pk j is a constant. The kth symbol is implicitly given due to the Sequence separation 2 Sequence separation 11 Sequence separation 16 2000 250 120 200 1500 100 150 80

1000 60 100 Number Number Number 40 500 50 20

0 0 0 -6 -4 -2 0 2 -20 -15 -10 -5 0 5 10 15 20 25 -20 -10 0 10 20 30 40 Physical distance (Angstrom) Physical distance (Angstrom) Physical distance (Angstrom)

Sequence separation 3 Sequence separation 12 Sequence separation 20 1000 90 200 80 800 70 150 60 600 50 100 40

Number 400 Number Number 30 200 50 20 10 0 0 0 -6 -4 -2 0 2 4 -20 -10 0 10 20 -20 -10 0 10 20 30 40 50 Physical distance (Angstrom) Physical distance (Angstrom) Physical distance (Angstrom)

Sequence separation 4 Sequence separation 13 Sequence separation 60 800 70 200 700 60 600 50 150 500 40 400 100 30 Number 300 Number Number 20 200 50 100 10 0 0 0 -8 -6 -4 -2 0 2 4 6 -20 -10 0 10 20 30 -40 -20 0 20 40 60 80 Physical distance (Angstrom) Physical distance (Angstrom) Physical distance (Angstrom)

Figure 1: Length distributions of residue segments for corresponding sequence separations, relative the respective mean values. Sequence separations 2, 3, 4, 11, 12, 13, 16, 20, and 60 are shown. These show the representative shapes of the distance distributions. The vertical line through zero indicates the displacement with respect to the mean distance.

sparse encoding. If the signs of w ji and Wj are opposite, analyze which pairs are above and below the threshold, it we display the corresponding symbol upside down in the is relevant to compare: (1) the distribution of distances be- weight-saliency logo. tween amino acid pairs below and above the threshold, and (2) the sequence composition of segments where the pairs of Results amino acids are below and above the threshold. We conduct statistical analysis of the data and the distance First we investigate the length distribution of the distances constraints between amino acids, and subsequent use the re- as function of the sequence separation. A complete inves- sults to design and explain the behavior of a neural network tigation of physical distances for increasing sequence sep- prediction scheme with enhanced performance. aration is given by (Reese et al. 1996). In particular it was found that the α-helices caused a distinct peak up to Statistical analysis sequence separation 20, whereas β-strands are seen up to separations 5 only. However, when we perform a simple For each sequence separation (in residues), we derive the α translation of these distributions relative to their respective mean of all the distances between pairs of C atoms. We use means, the same plots provide essentially new qualitative in- these means as distance constraint thresholds, and in a pre- formation, which is anticipated to be strongly correlated to diction scheme we wish to predict whether the distance be- the performance of a predictor (above/below the threshold). tween any pair of amino acids is above or below the thresh- In particular we focus on the distributions shown in Figure 1, old corresponding to a particular separation in residues. To but we also use the remaining distributions to make some As the sequence separation increases the distance distri-

important observations. bution approaches a universal shape, presumably indepen- When inspecting the distance distributions relative to their dent of structural elements, which are governed by more lo- mean, two main observations are made. First, the distance cal distance constraints. The traits from the local elements distribution for sequence separation 3, is the one distance do as mentioned by (Reese et al. 1996) (who merge large distribution where the data is most bimodal. Thus sequence separations) vanish for separations 20–25. (Here we con- separation 3 provides the most distinct partition of the data sidered separations up to 100 residues.) In Figure 1 we see points. Hence, in agreement with the results in (Lund et that the transition from bimodal distributions to unimodal al. 1997) we anticipate that the best prediction of distance distributions centered around their means, indicates that pre- constraints can be obtained for sequence separation 3. Fur- diction of distance constraints must become harder with in- thermore, we observe that the α-helix peak shifts relative to creasing sequence separation. In particular when the univer- the mean when going from sequence separation 11 to 13. sal shape has been reached (sequence separation larger than The length of the helices becomes longer than the mean dis- 30), we should not expect a change in the prediction ability

tances. This shift interestingly involve the peak to be placed as the sequence separation increases. The universal distri- ¥ at the mean value itself for sequence separation 12. Due to bution has its mode value approximately at ¦ 3 5 Angstrom, this phenomenon, we anticipate, that, for an optimized pre- indicating that the most probable physical distance for large dictor, it can be slightly harder to predict distance constraints sequence separations corresponds to the distance between for separation 12 than for separations 11 and 13. two Cα atoms. The peak present at the mean value for sequence separa- Notice that the universality only appears when the distri- tion 12 does indeed reflect the length of helices as demon- bution is displaced by its mean distance. This is interesting strated clearly in Figure 2. Rather than using the simple rule since the mean of the physical distances grows as the se- that each residue in a helix increases the physical distance quence separation increases. From a predictability perspec- with 1.5 Angstrom (Branden & Tooze 1999), we computed tive, it therefore makes good sense to use the mean value as the actual physical lengths for each size helix to obtain a a threshold to decide the distance constraint for an arbitrary more accurate picture. The physical length of the α-helices sequence separation. was calculated by finding the helical axis and measuring the A useful prediction scheme must rely on the information translation per Cα-atom along this axis. The helical axis available in the sequences. To investigate if there exists a was determined by the mass center of four consecutive Cα- detectable signal between the sequence separated residues, atoms. Helices of length four are therefore not included, for each sequence separation, we constructed sequence lo- since only one center of mass was present. We see that he- gos (Schneider & Stephens 1990) as follows: The sequence lices at 12 residues coincide with sequence separation 12. segments for which the physical distance between the sepa- Again we use this as an indication that at sequence sepa- rated amino acids was above the threshold were used to gen- ration 12 it may be slightly harder to predict distance con- erate a position-dependent background distribution of amino straints than at separation 13. acids. The segments with corresponding physical distance below the threshold were all aligned and displayed in se- quence logos using the computed background distribution Separation and helix length of amino acids. The pure information content curves are 35 shown for increasing sequence in Figure 3. The correspond-  Average sequence separation ing sequence logos are displayed in Figure 4. We used a 30  Average helix length “margin” of 4 residues to the left and right of the logos, e.g., for sequence separation 2, the physical distance is measured 25 between position 5 and 7 in the logo. The change in the sequence patterns is consistent with the 20 change in the distribution of physical distances. Up to sepa- rations 6–7, the distribution of distances (Figure 1) contains two peaks, with the β-strand peak vanishing completely for 15 separation 7–8 residues. For the same separations, the se- quence motif changes from containing one to three peaks. 10 For larger sequence separations, the motif consists of three characteristic peaks, the two located at positions exactly cor-

Physical distance (Angstrom) 5 responding to the separated amino acids. The third peak ap- pear in the center. This peak tells us that for physical dis-

0 tances within the threshold, it indeed matters whether the  0 5 10 15 20 25 30 residues in the center can bend or turn the chain. We see Separation/Length (Residues) that exactly such amino acids are located in the center peak: Figure 2: Mean distances for increasing sequence separation and glycine, apartic acid, glutamic acid, and lysine, all these are computed average physical helix lengths for increasing number of medium size residues (except glycine), being hydrophilic, residues. thereby having affinity for the solvent, but not being on the outer most surface of the protein (see e.g., (Creighton 1993) " Separation:! 2 Separation: 13 0.12 0.06 0.10 0.08 0.04

0.06 Bits Bits 0.04 0.02

0.02

       

0.00 

    

 

0.00 

2 4 6 8 10 12 14 16 18 20 22 1 2 3 4 5 6 7 8 9 10 11

Position

Position " Separation:! 7 Separation: 20

0.06 0.03

0.04 0.02 Bits Bits

0.02 0.01

             

   

0.00  0.00

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Position Position " Separation:! 9 Separation: 30

0.06 0.02

0.04

Bits Bits 0.01

0.02

     

    

0.00  0.00

2 4 6 8 10 12 14 16 18 5 10 15 20 25 30 35

Position Position " Separation:" 12 Separation: 60

0.06 0.02

0.04

Bits Bits 0.01

0.02

     

       

0.00  0.00

2 4 6 8 10 12 14 16 18 20 5 10 15 20 25 30 35 40 45 50 55 60 65 Position Position Figure 3: Sequence information profiles for increasing sequence separation. for amino acid properties). As the sequence separation in- degree the composition of the center for motifs having three creases from about 20 to 30 the center peak smears out, in peaks. However, the slightly increased amount of leucine agreement with the transition of the distance distribution that moves from the center of the one peak motifs to the sep- shifts to the universal distribution in this range of sequence arated amino acid positions in the three peak motifs. The separation. The likewise becomes “univer- reverse happens for glycine. Interestingly, as the sequence sal”. Only the peaks located at the separated residues are separation increases valine (and isoleucine) become present left. at the outermost peaks. In agreement with the helix length The composition of single peak motifs resemble to a large shift from below to above the threshold at separations 12 Separation: 2 Separation: 13

0.12 0.06

A

D A

A GG©

G L

C L Ÿ

0.1 L ¨

p A

B

° <

E L ž

0.08 A 0.04 DD§ G L

A G

o ¯

K A  L ¦ SA

— E L V

7 G G

@

A D Vn

L ¥ A

S EK

L; ®

E D G

T Ä

0.06 œ ¤

D £ V

: T bits

KS­

bits

D Im E

– E S

8 S 9 T D

A

G

K T

6 ?

5 T

S R KG Ã

• I

K

E S L L V

V T

I

A h E

›

G L Â

M

G S

” V N

H A N

N L D E

g S

A V

V A Á Ê T ¬

T L

0.04 V

4 G

L R K K“

K D

K A 0.02 V K

E T À A V

V

G

G

G’ R

E K G

D L R

D D

I š

L

R

I E V ¢ V

N T Œ

N D A P

L V

A T I P E

K I

V

F K

G A E

. P V I S

L E R

I

G ±

S É

Q LU S I S

‹ R V

I K S

T V A

E T

R D‘

3

¿ Y

E P

A P D L T N

G

I N

Ð

I Q N

L G S D

E LS

T l

V S

L R

K f A

-

2 N

D D I

N F T

F S 

L G Š

K R E G

v A

D

R S N K P

G A T

F

N G

P

A T È

1 T

E,

0.02 S G R

I

S V

$ Q

L Y I F

D N K P

O F A T V

K E

Y

P T

S µ

QR R P

F V

Y >

u

Q G P

V P

¾ K

P V

' I D M I

J L

F Q

c

E T

N Q

F E

S V +

N ½

R F

C

T T R

H T

P

Q S

= Y

F N V

Ï N I

I H

D

Q Y

F H Y I Y N

E A

K ¼ I P

W

I D

R N

M R N G

F D

Î

P ¡

& * Ç Õ

Y ‰ K

Q A Q L

N

D I R W

F

H R

R H N

Q

C I F

E L T

P

H Q 

0 MMY

E

H

FMM Q I D

P

C M

S C W

F C ˆ

F Q F S P

W

Y

E

W K

Y C Y Y P / H H

C

Q C

C Q ) ( P

# W E

Q H K

MC % « H C V

MWW H M R Y W W R

M Y K

W G H P

T K

| | Æ

Y ~ F R S

V A S F

D

N I

»

G QT

D

} N

Ô

LN

A H I S

V

N R

ª M „ Y

M T Q E

R L H Ó

F ™

P k ƒ ‡

Y G P Q A

Q K

\

V H

L H | M

GF P

P D T

K

A t { Í L

Y Y º

G I FS M P F H M

E Ž

S N

j Ò

K N b

Q D

T R

[

N

F H

V E M V

M Q

e ´ D

Q Q Y H

R K S

HQ

RE Å

P

Y H Y P T

Y F M

M

W

I W ³ F

I M W

H

E R ˜ Ì M T A

z Y

N W K

C CC S

P  Ý

s C W

F H R C L

H 

W

Q E K

N S

C D

Q

T N N

F

C V

W

K C Z † ¸ M ÙÜ Ú Û

H P

Q

H P

Y Q C

N

F C R

a r ² · ¶ Ë × P Q

Q YG

I Q

M M C M M H

M H W

FRCW W W RC FMCW X Y ^ ]_ ` d i q wxy  € ¹ Ñ Ö Ø |HMYW HMCWCWWWCWFHMCYHCY CPYI WWWCWMCWYI | 0.0

1 2 3 4 5 6 7 8 9 0.0 1 2 3 4 5 6 7 8 9 10 11 10 11 12 13 14 15 16 17 18 19 20 21 22

Separation: 7 Separation: 20 

0.06 AA "

L 0.04 

G GA

L

G!



-

AEE L,

G 0

L G

LD D AL

0.04

ES A



K K A

G

A G & /

K +

T V L

E T LV

 L

A S S 5

L

bits µ *

D L G bits G D V K A

¦ V

§

E V A

‰

L E KA G

G K E

G V

S

V V A

T

÷ T V

V A

 K A

A N ) E 0.02 G

¥ S I L G

D D

S G G

% L

G

ö S .

T

( V L

Lÿ I T

E¤ A L ´ T

S S

N A

ET I

S ƒ

D

þ V I G

S

D I I

T ³ õ K E S L

 KD

V 

‚ A R R

T V D

D RR

ˆ 4 V

E

0.02 E I I D G

T D

D D °

L

T

ý N V L

I ”

3

A I N A ‡

N

A

I E K G

I A K G S S D

ð © G

ô $ D N S V

N L E G

I R 7 L L A

V R P

P R I D

L

T 



R V

ó T

QP E Q  E K

ï E

N S

K

V P V

K £ †

S

D L

RN S

Œ

F M F

E Q N

ü

L “

G S G

2 V P T R A

P

D D R Q V ·

: R

N R K G A

R

F Y D L

S F

A ‹ I ’

T A é € F

V G

A N K E

N T

# E Y E

Y F S

G

S G

Q A è ¯

P ¨ L

9

L F R L N

P t

L T

û Q I T

1 G

Y

K P T

F G

S

P 8

î F S

L w S V

I L K

Aâ F KT

ç A

Y S ¢ S

V R K

V V G

P

G

P

á D

L Q E 6

æ Y

E H I D

F Y A

F N Q D

N S 

V 

K T V N

N T

H

Y E

V T I N

R D

P T N 

ú S D

P c FX

E A

E

Q I I H

T I

T Q V

‘ E

J K

I

S

S L

~ „ T P

S ®

ò K K

Z K

P

YH D

å Q N

I H M Q V P I

' P I

H L

í Q I

K

D P D

T M D

T F H

K Y M

à b A

H F Y R H G E C A

N M

I ñ C F E

R N

ù ¡ R œ ì T

R A R A

Y  MW

Q M R

P M Y G D

C G

H

H I L

N Q S

K

C

M GL G D L

R G

M C M

P H ë

W G C

W I L

K S Y L N

C

D Ã H

R A

W Y

W s

ê M

M ¨ A W

W M

ä ¡

F H E AE V A

C K ø }

ß C

D •

F Y

C E Q

ã N ¹

N

Y C

Q W Q Þ E

V

Q CM I T A L Q

C P R L

r C V

E L

HMCWHMCW

W WC G

W V K G

W S

W W T 

W N I

› D

a S

I V V LÂ

R

P P D R Q

| | Q

V Q G

A R

S I | F N K D

K V

U q F D

F

R v

V D

B P V

D Q

R ` KP T G ­ N

Y D Y P

P E

D

D

R V F F

> D S

E

T N S

S V E S § S

Q Á K T K

E š T

E

F

Y S Y Æ = G

P E

E T

l P

T Y

A 

L

K

I N T N

p T T

K

E

P ¦ À I e

E ™

K A N

S F

Y I S R

K

E ¸

R

K

V {

F T T Y

R o I

I

P Q

K

Ž H

R ¬

Q

P K F H Y

H I

H P

F Q P Y N

H K

¥ N

Q

H K k M Q N

I Y D H I Q

N F F

R ¿

I z

K

H ˜

H I P

N F

R Y I

F j N

S R Y F N

Ÿ M

M M R

Q R

P N E

R

H

i ¾ R Q M R P

P N

R d N N

P Y

< _ y ž M Y

P P

M P

M Y Y Q

P F F H F

H H

F F

Y F

F F M

H h I

Y

« M

T Y F

P H ²

Q M H

R

M Y ^ g n ½ W

F W

C W

H Q C F

M W

0.0 Q

W

Q Š ¤ ª

Q M M

Q

W

Q Y

W Q G x  Q

P C M Q M C

Y W

Q C C H

Y H

M C Y

F ] u — £ ¶

T Y M

H W

M H M C M

Y

C MY Q C CY

H

W W

C Y W

W W W M H W W

C MCWCCW CC CCWC ?@ F O \ [ f m –  ¢ © ± º »¼ ÄÅ |H; HCHCWWWMCWCW NHC C CWCWWW HMHMHM|

1 2 3 4 5 6 7 8 9 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Separation: 9 Separation: 30

0.06 Añ

GAL 0.04

GL L

Eí A

VG

0.04 DE E ù A A L

GKD

Þ G S

L T A

ì

A SK © G

bits

L ø K bits V

G L A T

æ V T

G L ¨

E ÷ A V E A¦ I

L E V

ý S

A D KLL§ G KA A D E S

G G V A

D 

E G K G ¥

V ã D L

L T L

D V

è N V

E A

S I G

V

S ð

S S I

T L V

ç T â N K

I V V

E E E G

R D L T

0.02 D K D I  N

Û D

¤ E

I ï G T P K 

A S A T

R RT

Ú K £

G D K

×

P

L D V

T K

S 

ö S

á E

Ù S N R I R V

S

N I

K I V

T

I 

I

V K å Ø

N R S D R

N Ö N T

I T

E I F A T

R

P

F T

D S

RP L

3 D

ë I I

S

Q N I F L S N N

P

Q P

¢

ä G

 D

R N

E

N Q A

P R 

F T

Y R Ý

R P

K F

AÕ Y

P G

F I E

G

F V

P P

Ô A

K K F

L R õ Y

ô ü 

Q ¡ EN

QQ P L

F F Q V

 D H R N

A

F K S F G Q Y

ê P

V Y

½

û

D Q

Ó L

R

E F T

Y

à S

T Q

G

Q P I

Y

S P

H F T

 R H H F H

Y Y 

ó

Q N

M D

Y D

A H

K M

Ò ¶ Y A G

I H Í H

N

H V ;

L ÿ I

Q C R

Y H LN

Y S

M P

Q C Y

H 

N E

D

Ñ G

V H G

R H

M P

E V

B

M : C

C

P ß HQ

A D L

M C L ò

A

T M

Y W

Ì & C

C

A

W ¼

M M -

ú þ

H é

A

M P

S G



î L T K

W M G W M A

W

C M F A W

W G

F M A

M C W C

G L  G

L

K Y C W

Ð C C Y A G

N QP C Ë x

I Y

C W

A

HMCWCW L

ÈÉÊ ÎÏ Ü 

FRQÇ HMW W HMCQ L G

W ! T

W W R

W S

L L

V

A

V V

r

| | m E

I

G G A

µ

V V

Å

D

V L L

S I

S V ²

E A

N l

Š V,

D K

T VD

S G

9 ´

D V% V

T S L k

? ‰ K

D D G

S S

P D » A A

A Ä 2 F

A

T A D

D L

® N

F T G Gc DES GG L

A

R

E

L L

b £ ­ Q 8 E

E E G

A

T I S R

+ T L

T D

E

Q V

¢ Ã

I G € E

E

V S

T N

T E V

1 K F

E I E V Y G º

L A A L

E

L

A

K L

I w D

D

K

P * ’ D E

Y D S

K

I GGG A j  ¬

I E G

K G

A V

L V P

K S V

L L I T

$ T A I

A G S

V

K

> v I ‘ — œ ¹ D

E

R ~ «

Kz S

F I T ) R a

D K VL

P D q E

S

S

# ˆ D Â

N K

I ª

S ¡ T

Q

N D ` i

7 S

N L

N

R S SK K

‡

N

N

A E

Y R T P R

R 

D_ V

T R V T

R VV V y

6 R T G

P N I

E K

} ³ K p N E

P

P E H

R N

Z P  ± Á 0.0 0 V

N P

E E

P

P E D F

P h u

E^ N D D D TT I F I I P |

H E D

I o – K

R N

K

P

Q

GS K ] g Ž À

E S E D

© Q È

K F A

I F T

F

A Y

F I V M t ¸

K

R

F P

Y † Ÿ R

/ L

N

H G N F

L MY E

f  ¨

Y M NSS SS I

A T S ( 5 P K K

K K

Q L

Y

R

N I

@ I • ™ ›

N

Y

F F P

R R F F R

F G Y § °

M RC L K TDTTT

' . 4 Y

Q T

Q N

Q

A C

V

Q Q I Q

Q

I

= e R Y

Y

Q

K P

P C P

F F

Y

Q

P P

Y A

K E

R R K R

P R R

P P

P

Y P R

F Y

F F F

R X \ ƒ„  Œ ” ˜ š ¥ ¦ ¯ · ¾¿ Ç Q

H Q D E K Q Y N

F F Y Y

Y

Y F

Y Y F Y

M H H M I W K

Q Q Q P E K

W Q

P

N

R

Y R

M

F Q Q

E D Q

Q F H N F Y G F Y

H H M

Q Q Q M M

H

H

M H H

M W P W W

W W A A M

C R K

P H S

< GH K JL NO P Q W [ d n s { ‚ “ ž ¤ Æ

" I

N R F

F F Y Q E VD Q Q M H Y M F

M

P P

Y H M H M M H M M H M M H W

N Q Q

P

Y Y

Y Q

I N

F

H M W W W

W W

S T V ‹ M H

P

R

Y Q

M M M

M

M

W W W W W

W W W

P

R F F Y Q M H M W W

W TT

CCC C H

W CW W

CCW R R U HCHCHCWWHMCWHCWHCWHCYWDRHQSTI DHSVTI HSVTI LKGSVTI LEKDNSGVI LKHGSVEDSTNQCCCCCCCCNCNCNCI NCI NCI NCNCYFMCYWFHMCYMCWWMCHMCHMCWNHMCTWI

1 2 3 4 5 6 7 8 9 | | 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Separation: 12 Separation: 60 0.06 G

0.04 § AL

L G¡

0.04 ¦

DL

$

A A G L

G ¥

Và EAL A

DK V# G

T

ß T

¤ I ÿ

E D bits

bits S

E

S E I T

S

G E

T

þ KV

L £ L  D K

ø K 0.02

AT

Þ G N

V ©

Ü

A ý L S L ÷

S K

G A V "

0.02 V D L AG

D

R A

E G D +

E P V Û

I ü

ö N K

T E W

Ú I L

V A

N E

õ ¢

G V

*

S S G

S A I N R

P A V

T

K

T R L )

Ý ¨

ô I

D K

F I P G

D N L

V û S

L

L G ê ó !

 . P

L R I A V V K

G F

S T

× G ) E

A Aé R E

D

 K U F

R

N I T

 V

K (

I

T

L T

A D

E A Q

ò V

V è

 T

NG G N V I

E P E

T S D L F R

K

R A

E Y

A I

Q F Y

S

N G ç T S

Q -

  D

L P P

S D

D Q R A

Ö E V ñ

V

Ô æ K S I E Y

L Q T T

YP

D D D

G

Ù Y

Ó

A V L

F I

N L S 

H K I

P

DK S

K 0 -

F R

' E

I I N

N

V F E A

E F

Q K

T G T

H

T R

ð S

G G ì ú

S Q H

S

Ò M  N

P Q

Y

T N

K

Õ I N

å ë

D A L R K

Ñ R Y F

Y H M P

M V K

I Q

P RL V

ä ù &

M M F R

Y D N

P K H

R G Ð /

C &

I  R H

F N PV Q S T

H

P Y

R F M  D

D C H Q

E N

N ,

M Q

Ï ã P D

P ,

R H F Y R J F K

P N Y

T V  F

Q

F W T H

Q M

S W Y A

L M W H A

M A

C H M A G

W 8

Í F

G C S

9

Î Y Y

R

C LQ G

F W

C M

%

Qï A N

E D

S P

Q H G N D T E Y A S T K Ë   367 5

É Y C H W C

N K

F C

A C I H

W

P P 1 Q F

M H Q

W H W S

L I

CW C CW Ì Ø áâ î   % 2 4 RHMQYVÊ I HMCQWMCW MCWHMCYHMC CRQYW CWMCWFRMYVI L W G

W W L P

E f W G

| | L L E I

A I A

G? V G

K

K

G D A



L L $ D

> V E

E

+ F V

V VP T L N

G N

P

YZ e i # D

S S SA K



F S

D G

D V A

A  = D D R

T D

T S h 8 G

V d

S R



T TL  V " 5 ` A

L Q L

V



TD G L

< E R

D E E

E G

S E Y

I I I SD 4

I I  T N

V G V E 0.0 Q

7 K E

K P P E

K N P K

T D

T K T

O

I _  !

E S V I L ¨

L

Y F S

; F A

D 3 I l

S A K N

N A s  N

L K K

C L

K A A

E

L

T G G S 2 I

r

R

N

R §  P

E R

R T

R G

K

V

X N

B N G



V

N V

E R

q  N I

P Q H T

K P G Q P

P L VP

P

E F

û K ^ V

Y I P

A A Y

¦

M D I

1 2 3 4 5 6 7 8 9 R D

L

H Q

K G

¥ (

k D G T

L

D L

è S Y

R R

R S D

R

N M ] 0

G A E

V L j p ¤ V 1 ;

N NSS L A I NFS ç

L L

M P T G F L P

c P g  ä

F E

E

A Y A R

F A

L L H o Ï ì  

F F A G A I A FH

ô £ '

F H TT G G G G í G

Q Q

L

Q

V

A

Q A

Q E

T K

E

K E

A T

V

T

: @ L a E  6 I

A A

L

A

Q V V

V

K

V

K

V

K

K

P

K E

I

Q

W Q

A

G

L

A

L

G

L

E E

E E L

K

E A A I V

I A L A L L

V

L

L A \ b n v —  © ¿ Ê Û á ã ó ù ÷ ú ü þ ýÿ ©  * / 9:

F

Q

W V

Q

Y

T

R A

R I T K E T

F L A W D

Q

Q

A P P

P

P

P

P A L A

P

A P

L L P E

L

L A

Y V H

T E

R

I K E

T R D

E

K E A K R R R R K T I I

R T E I

D

R

F F K E

I N R P P F

I E R R P

I P R I

R P R

F

F

Y YY Q

M

Y G

P L

Y P L

[ m t u z | „ ƒ “”– ™šœ ¡ ¢ §¨ ® ­ ± ² ³ º ¹ ¼½ Á ÄÅ Æ È ËÌÍÎ Ð Õ Ô Ö×ØÙÚ à Ý Þß â æ êë îïðñ ò ö ø ¢  L A

L

L L A A

A L

D K I

K E L

V

L

I K I E K

I K I I

A L I T I E I

P R

Y I

F

F I

Q F P Q R

F P Q Q Q E

GQ

P E K I I

V E R

E I K I E

D

R G D E

I E

R Y

G N

R

F F

A L

F D V

F F D

A

M L T T Y F

Q

V

Q

F

A Q T

Q Q V Q Q

Q Q M

Q

Q

V

L

D

V V D

P P

P

P P P I K

T

P

R P

I E

R R E I K

D R R

R

W W R W W W

W W F

Y Y

y x { ‚  ‡ ˆ † Š Ž • ‘ › ˜ ž Ÿ £ ¥ ¦ ª« ¬ ¯° · ¾ Ã Ç Ñ Ò Ó Ü å é õ ¡

F R

M

M M M M M M M H

H Q Q

M M M H H H Q

L Q

M

M

W

L A

L A

L

I T

T T

I E D

D V E

P

W W W

E I K I E

R R W W W W W P

P E

W

P P P

R R M R R R M M M R

M M M M M M

M F Y M I

M

M F

F F

Y

Q Q

F Y

Q

Q H H Q Q Q Q

Q H

H M M

M M H H M M H M

W

T

H L A

H H H H H

Q R H

Q R R

Q R

D

V D E E V

P P R

F P P P R P F P P

P

F

F F F F F

F

H

H H

H E

Q E R Y Y R Y Y Y Q Y

R T T V T

Q T

YKY Y K

T H T T T P P P T C C C F F C CC CCC } ~  € ‰ Œ   ’ ¤ ´µ¶ ¸» À É

C Qw Q Q HY Y

HH HWFF FHHW C HWHDS C C W C

F P R

Q Q C C CC C C CC CCC

M

Q

T Q M M C M M M W M

W M W W

CCCTTCWT CTCWC CWCWCTM CCCW C CWC CWCWCC Y

W TI T F

H HH H H F F R FFFF FFFFP FF Q

HCMWCC CCWCHCWHQWHQYMQRNMSGANHDSQYTGSKNGSYKNRSNKGSI NSGNKSGAI KNGSNYNHSYVTWKHSYTLNKGSNKYVKDGSVDKNGSVNGI PNKHDSVQWNKGMI DNHMSQVDNMHSVNHYI PNVPHMSYTWKDHMSYQTNWPHYNYVDKNSYVKNWKNGSYEKNSGYI NDSGDNSYNYWNYWNYWNWWNNYWNYW N I HC HMC CWHCMWHCMW . | HMWHMWCWCHMWFHYHMWFMYRNYNNDSYT KGANAA A ‹ KGSKGNGVVSGN GI DSVGI GI DSVTDGSDGVGVDSYVDNSVTDGSGDSVDVDVVGDSTDSEDSEKDGSYNYKDEGSYTDEKSTEDSENSYI EKDSYDNKSNI NSYDNSYTI HMYHCMWCWCMCWFHMY NY| 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

Figure 4: Sequence logos for increasing sequence separation. A margin of 4 residues on both ends of the logo is used. Sym- bols displayed upside down indicate that their amount is lower than the background amount. This figure is available in full color at http://www.cbs.dtu.dk/services/distanceP/. and 13, the center peak shifts from slight over to slight un- dent of the sequence separation. We even note the drop in derrepresentation< of alanine: helices are no longer dominant performance for sequence separation 12, as predicted from for distances below the threshold. the statistical analysis above. It is also characteristic that, The smearing out of the center peak for large sequence as the distribution of physical distances approaches the uni- separations does not indicate lack of bending/turn proper- versal shape, and as the center peak of the sequence logos ties, but reflects that such amino acids can be placed in a vanishes, the performance drops and becomes constant with large region in between the two separated amino acids. What approximately 0.15 in correlation coefficient. The change we see for small separations, is that the signal in addition in prediction performance takes place at sequence separa- must be located in a very specific position relative to the tions for which changes in the distribution of physical dis- separated amino acids. tance and sequence motifs change. Due to the not vanish- ing signal in the logos for large separations, it is possible Neural networks: predictions and behavior to predict distance constraints significantly better than ran- As found above, optimized prediction of distance constraints dom, and better than with a no-context approach. The con- for varying sequence separation, should be performed by a clusion is not surprising: the best performing network is that non-linear predictor. Here we use neural networks. Clearly, which uses as much context as possible. However, for large far most of the information is found in the sequence logos, sequence separations more than 30–35 residues, the same so we expect only a few hidden neurons (units) to be nec- performance is obtained by just using a small amount of essary in the design of the neural network architecture. As sequence context around the separated amino acids, as in earlier, we only use two-layer networks with one output unit (Lund et al. 1997). We found that the performance between (above/below threshold). We fix the number of hidden units the worst and best cross-validation set is about 0.1 for sepa- to 5, but the size of the input field may vary. rations up to 12 residues, then about 0.07 for separations up We start out by quantitatively investigating the rela- to 30 residues, and then 0.1–0.2 for separations larger than tion between the sequence motifs in the logos above, and 30. Hence, at sequence separation 31 the performance start the amount of sequence context needed in the prediction to fluctuate indicating that the sequence motif becomes less scheme. First, we choose the amount of sequence context, defined throughout the sequences in the respective cross- by starting with local windows around the separated amino validation sets. Hence we can use the networks as an in- acids and the extending the sequence region, r, to be in- dicator for when a sequence motif is well defined. cluded. We only include additional residues in the direction As an independent test, we predicted on nine CASP3 towards the center. The number of residues used in the direc- (http://predictioncenter.llnl.gov/casp3/) targets (and submit-

tion away from the center is fixed to 4, except in the cases of ted the predictions). None of these targets (T0053 T0067 = r = 0 and r 2, where that number is zero and two respec- T0071 T0077 T0081 T0082 T0083 T0084 T0085) have se- tively. At some point the two sequence regions may reach quence similarity to any sequence with known structure. The

each other. Whether they overlap, or just connect does not previous method (Lund et al. 1997) had on the average

> > > > > > > affect the performance. For each r = 0 2 4 6 8 10 12 14, 64.5% correct predictions and an average correlation coef- we trained networks for sequence separations 2 to 99, result- ficient of 0.224. The method presented here has on average ing in the training of 8000 networks, since we used 10 cross- 70.3% correct predictions and an average correlation coeffi- validation sets. The performances in “fraction correct” and cient of 0.249. The performance of the original method cor- “correlation coefficient” are shown in Figure 5. responds to the performance previously published (Lund et Figure 5(b) shows the performance in terms of correla- al. 1997). However, the average measure is a less convenient tion coefficient. We see that each time the sequence sepa- way to report the performance, as the good performance on ration becomes so large that the two context regions do not small separations are biased due to the many large separation connect the performance drops when compared to contexts networks that have essentially the same performance. There- that merge into a single window. This behavior is generic, it fore we have also constructed performance curves similar to happens consistently for increasing size of sequence context the one in Figure 5 (not shown). The main features of the region, though the phenomenon is asymptotically decreas- prediction curve was present. In Figure 8 we show a pre- ing. An effect can be found up to a region size of about 30, diction example of the distance constraints for T0067. The see Figure 6. Several features appear on the performance server (http://www.cbs.dtu.dk/services/distanceP/) provides curve, and they can all be explained by the statistical obser- predictions up a sequence separation of 100. Notice that

vations made earlier. First, the curve with no context (r = 0) predictions up to a sequence separation of 30 clearly capture corresponds to the curve one obtains using probability pair the main part of the distance constraints. density functions (Lund et al. 1997). The bad performance We also investigated the qualitative relation between the of such a predictor is to be expected due to the lack of motif network performance and information content in the se-

that was not included in the model. The green curve (r = 4) quence logos. Interestingly, and in agreement with the ob- is the curve that corresponds to the neural network predic- servations made earlier, we found that the two curves have tor in (Lund et al. 1997). As observed therein a significant the same qualitative behavior as the sequence separation in- improvement is obtained when including just a little bit of crease (Figure 7). Both curves peak at separation 3, both sequence context. As the size of the context region is in- curves drop at separation 12, and both curves reaches the creased, so is the performance up to about sequence separa- plateau for sequence separation 30. We note that the relative tion 30. Then the performance remains the same indepen- entropy curve increases slightly as the separation increases. (a) Mean performances 0.75 r=0 r=2 0.70 r=4 r=6 r=8 0.65 r=10 r=12 r=14

A 0.60

Fraction correct 0.55

0.50 ? 0.45 ? 0 20 40 60 80 100

Sequence Separation@ (residues)

(b) Mean performances 0.50 r=0 r=2 r=4 0.40 r=6 r=8 r=10 r=12 0.30 r=14

0.20 Correlation coefficient

0.10

B B B B B 0.00 B 0 20 40 60 80 100

Sequence separationC (residues)

Figure 5: The performance as function of increasing sequence context when predicting distance constraints. The sequence context is the

E E E E E E E number of additional residues r D 0 2 4 6 8 10 12 14 from the amino acid and toward the other amino acid. The performance is showed in percentage (a), and in correlation coefficient (b). This figure is available in full color at http://www.cbs.dtu.dk/services/distanceP/. This is due to a well known artifact for decreasing sampling CASP3 target T0067 size, andF entropy measures. The correlation between the curves also indicates that the most information used by the 40 60 80 20 predictor stem from the motifs in the sequence logos. 100 120 140 160 180 180

Single windowI curves 0.50 160

Plain 140 0.40 Profile 120

0.30 100

0.20 80

60 Correlation coefficient 0.10 40

0.00 20 G 0G 20 40 60 80 100

Sequence SeparationH (residues) Figure 6: The performance curve using a single windows only. We also show the performance when including homologue in the ] - ; 0.2] ]0.4 ; 0.5] ]0.7 ; 0.8] respective tests. This complete a profile. ]0.2 ; 0.3] ]0.5 ; 0.6] ]0.8 ; 0.9] ]0.3 ; 0.4] ]0.6 ; 0.7] ]0.9 ; 1.0] Relative entropy and NN performance Figure 8: Prediction of distance constraints for the CASP3 tar-

get T0067. The upper triangle is the distance constraints of pub- L NN-performance lished coordinates, where dark points indicate distances above Relative entropy the thresholds (mean values) and light points distances below the threshold. The lower triangle shows the actual neural network predictions. The scale indicates the predicted probability for se- quence separations having distances below the thresholds. Thus light points represent prediction for distances below the thresh- olds and dark points prediction for distances above the thresh- old. The numbers along the edges of the plot show the posi- tion in the sequence. This figure is available in full color at Arbitrary scale http://www.cbs.dtu.dk/services/distanceP/. input field, that is a proline in between the separated amino acids. This makes sense as proline is not located within sec- ondary structure elements, and the helix lengths for small

separations are smaller than the mean distances. (The net- J 0J 20 40 60 80 100

Sequence separationK (residues) work has learnt that proline is not present in helices.) In contrast, the presence of aspartic acid can have a positive or Figure 7: Information content and network performance for in- negative effect depending on the position in the input field. creasing sequence separation. For larger sequence separations, some of the neurons dis- play a three peak motif resembling what was found in the se- Finally, we conclude by an analysis of the neural network quence logos. The center of such logos can be very glycine weight composition with the aim of revealing significant rich, but also show a strong punishment for valine or trypto- motifs used in the learning process. In general, we found the phan. Sometimes two neurons can share the motif of what is same motifs appearing for the each of the 10 cross-validation present for one neuron in another cross-validation set. Other sets for a given sequence separation. As described in the details can be found as well. Here we just outlined the main Methods section, we compute saliency logos of the network features: the networks not only look for favorable amino parameters, that is we display the amino acids for which acids, they also detects those that are not favorable at all. In the corresponding weights, if removed, cause the largest in- Figure 9 we show an example of the motifs for 2 of the 5 crease in network error. We made saliency logos for each of hidden neurons for a network with sequence separation 13. the 5 hidden neurons. For the short sequence separations the network strongly punish having a proline in the center of the

P predicted separation 3 as having optimal performance. We

P found that the information content of the logos has the same P

P

P

2 P qualitative behavior as the network performance for increas- P

P

G ing sequence separation. Finally, the weight composition of

G

G the network was analyzed and we found that the sequence

G motif appearing was in agreement with the sequence logos.

C T

C k C G

P Q

W

H

1 H P _ T

} D T The perspectives are many: the network performance may

e I

Saliency K

H A

C [ d C

Q At

j S

H Q

c

X €

P A L R s

C L

m L x

D S

b Q C be improved further by using three windows for sequence

L ~

M S

C

X P K r I

S R S Q

i T

w C ƒ

R AL D I

N H

G

C

Q N P | G

K E

v

T M A A

a R M

P

D

L D

 S

T

H X {

X q

V A L K I ^

N

D H

G

W S

h ‚ P

D

A N

L ‡ separations between 20 and 30 residues. The performance of

EM L

Z X I H N D

A Y

P A I M g

M K Y N

T

S G

V H u D

D N

W QF Q E

P

z

] E M G C

V X

C EM M K S S

K

C K W

O

F T

L C



MQ D

p Q X G RE

G f Y C E

W N I M

o U I

N

FN KQ F

N

T E\ G W L L G

Y D I

T N A N

I Y R L R AV L N A

A E H I

W

N K

R H

W X

K Q

G Q

V KA

P N

K V

G

D

M F l

S

D V

R` W

Q X

N Y

WV R C K S E Y

F

S

P E

R

Y

M I Y K S Y

H

V

H y

Y

I

D

E N

X C

N

X T W T G

T W

VKH

Y

V S D SA X XX XX GD

R n „ †

RQ Q

F F F Q Y F H

W V W

R H

V E

V H W

Y VT F

Q R

T M W C

R M HT T L F FT |FESYVAI EWFSI LEASYXDMTFMXI Y I XI FRWYRQYWEVFKMVLERKMV| 0 the network can be used as inputs to a new collection of net- 1 2 3 4 5 6 7 8 9 works which can clean up certain types of false predictions.

10 11 12 13 14 15 16 17 18 Combining the method with a secondary structure prediction

should give a significant performance improvement on the

I

1 L short sequence separations. The relation between informa-

G G

L tion content and network performance might be explained

P

I

F

G D quantitatively through extensive algebraic considerations.

N W Á

F N

V

G W

C º

K

K R

˜

×

W Acknowledgments P X

V I

P

G G

P

I V Ü

D N K N

Y

F

D

W

Ö N

¹

R

N L

P

F

V

“ K I D

Y This work was supported by The Danish National Research G

§ W Saliency

N Ñ — P

X

V X

L P L

D P

F

N V I

Y

G

G

¸ Ç E

V Û

N M

S I

S I Foundation. OL was supported by The Danish 1991 Phar-

À D

E C G Õ

F

ΠРG

N K G G

R è

C F F ç P

D C

F

H

T Ð N

D

R L ¬

N

² M ‹ L

M Æ G R

D K

Ú Q

E W ·

¢ P P F

œ Y

G E P ¦

L

A

E V

±

š

S

¿ N F Ô SW

C E

E

E N T Iæ macy Foundation.

Þ

T X G X N S C C

C M

™

I

« R

L

VQ L W L

W D

K H

V W

â

’ S

¡ A D Y

H

A S W Ï

Y C R

Ù

N T I

R P D C Q

‘ X

å X I

I M V

Å

I K F P G A

N W ä

Y Ì H Š P

L R E

µ

W •

R

H

WQ

W ¾ M T Q Ó

E

K E

 ¥ D Ä

› È

‰ L

Rª E

W F

W H C

T F ES L

K S

C K G C

VK Y

F X

½ N

W ¼ ˆ A C F

R

H W C

Î Q

R V Q G C

Y K T

V H T

E

C á

Q

© D T

E M

F Ÿ M Q

D E Y

Y S

C

M

Y K à

L Q A A

Í ã

G Ã I D M

F I E M

H

F

´ Ë A

I 

M ž C K Q V H X

R N N C

Ž A

H

T

V

Q Q Q

G V R

I

¯ X ß X V

S Q Y

X

ME A

X

¤ T H D X

E T H D

X

KLV Q T

I Â

MV I A I N

L

H

¨ Ò

P A R Y

Q

R

Y Q A C Y W

S

 ” £ ³

Q

S P

H R R

A

W

A

D

T K T

K K

T

K

T P A D Q

T K ® ° » ÉÊ Ø Ý ­ G Y S

P

S Q S S R T

T S

C M Y M

S

M

W M E

F D

R Y

P P V Y R E PFT Y |QSHI LMAXM NXDSAXAXX YHAXHEHALXHKMLVA | 0 1 2 3 4 5 6 7 8 9 References 10 11 12 13 14 15 16 17 18 19 20 21 22 Baldi, P., and Brunak, S. 1998. Bioinformatics — The machine learning approach. Cambridge Mass.: MIT Press. 4 Bernstein, F. C.; Koetzle, T. G.; Williams, G. J. B.; Meyer, E. F.; Brice, M. D.; Rogers, J. R.; Kennard,

3 O.; Shimanouchi, T.; and Tasumi, M. 1977. The

I

G ) I protein data bank: A computer based archival file for

I

GY W /

GI " macromolecular structures. J. Mol. Biol. 122:535–542.

2 P G

V I

V 3

(

D Y

£

Saliency W (http://www.pdb.bnl.gov/). L

GM

N 2 V

G 

C V W

L

D

V !

I G N I I

þ

L

Y F . N

M

D N F Bishop, C. M. 1996. Neural Networks for Pattern Recog- 1 L G V G X M

 F

M -

'

D

V I V V

 N

X

X D

M F

V I

S S 7 W

,

X P P L

P P M L

ø D E

¡ M

ý ò

F

C L SS F C

Q nition. Oxford: Oxford University Press. P

D K

C M D K C 

P L

Y L

P X W

G

I P S Q

D  D Q V

V I

L Y

I G

E 

W C G

G C &

VE L G SP QD

K

í I K W

E

Q G I  W S

ü R

S K V

YE K A

F FQ

ñ Q A C

M G

I K F D

E C C

6

X U

1 G D

N S I

Q N P

¨ 

G %

E T

D V

û D

F P W

N R

Y K +

K § H H Y ÷

C A V N Q

K C W

ð A 0 N T

M N I S N W *

Y

Q P R

F O T

EQ J

M S V E P W; WM FWM Q

W T E

 S

F I W

¦

I E

Y R

ö D

Q E

Q E

G F

C I

P S

K E

W K TA Y F $ L

F R D

E

K

K

R

F

Q

K

K M T Q C  B R

V

S N

N C D

R

V

K

ô   N

T A

K F YA

E Q

A

X

P F

LA A ó > H N

R

S A A

L

L

G Q

S P

Q

Y

X

P A

Y

N

T T ê ù ©  = H D A

T TK T

T

X

T X N W

X

Y

X K

X

I R

R X F R

P

F S

S T C

T K H

F

R N

Y

H

R T

G T H R

GVC

Q K V

H HSM X SM

X

D XX HXE E

X H R P

X S X

L M

A H YE Y L

A

L H

Y

N H P

D A LQ CT K

C A RHPHWAM NCA PNAT SRD YTTLMWLMLHYVHA ëì îï õ ú ÿ ¥ ¤    # 45 9 8: @

10 11 12 13 14 15 16 17 18 19 20 21 22 holm, H.; Lautrup, B.; and Petersen, S. B. 1990. A novel Figure 9: Three saliency-weight logos from the trained neural net- approach to prediction of the 3-dimensional structures of work at sequence separations 9 and 13. For each position of the protein backbones by neural networks. FEBS Lett. 261:43– input field the saliency of the respective amino acids is displayed. 46. Letters turned upside down contribute negatively to the weighted Branden, C., and Tooze, J. 1999. Introduction to Protein sum of the network input. The top logo (sequence separation 9) il- Structure. New York: Garland Publishing Inc., 2nd edition. lustrates strong punishment for proline and glycine upstreams. On the middle logo (sequence separation 13), N is turned upside down Brunak, S.; Engelbrecht, J.; and Knudsen, S. 1991. Pre- on the peaks near the edges. On the bottom logo the I’s close to diction of human mRNA donor and acceptor sites from the the edges contribute positively, and the N in the center peak also DNA sequence. J. Mol. Biol. 220:49–65. contribute positively. The I’s in the center peak contributes nega- Creighton, T. E. 1993. Proteins. New York: W. H. Freeman tively (they are upside down). This figure is available in full color and Company, 2nd edition. at http://www.cbs.dtu.dk/services/distanceP/. Dayhoff, M. O. ans Schwartz, R. M., and Orcutt, B. C. 1978. A model of evolutionary change in proteins. Atlas of Final remarks Protein Sequence and Structure 5, Suppl. 3:345–352. We studied prediction of distance constraints in proteins. We Fariselli, P., and Casadio, R. 1999. A neural network based found that sequence motifs were related to the distribution of predictor of residue contacts in proteins. Prot. Eng. 12:15– distances. Using the motifs we showed how to construct an 21. optimal neural network predictor which improve an earlier Göbel, U.; Sander, C.; Schneider, R.; and Valencia, A. approach. The behavior of the predictor could be completely 1994. Correlated mutations and residue contacts in pro- explained from a statistical analysis of the data, in particular teins. Proteins 18:309–317. a drop in performance for sequence separation 12 residues Gorodkin, J.; Hansen, L. K.; Lautrup, B.; and Solla, S. A. was predicted from the statistical analysis. We also correctly 1997. Universal distribution of saliencies for pruning in layered neural networks. Int. J. Neural Systems 8:489–498. Tanaka, S., and Scheraga, H. A. 1976. Medium- and long

(http://wwwW .wspc.com.sg/journals/ijns/85_6/gorod.pdf). range interaction parameters between amino acids for pre- Hobohm, U.; Scharf, M.; Schneider, R.; and Sander, C. dicting three-dimensional structures of proteins. Macro- 1992. Selection of a representative set of structures from molecules 9:945–950. the Brookhaven Protein Data Bank. Prot. Sci. 1:409–417. Thomas, D. J.; Casari, C.; and Sander, C. 1996. The predic- Kabsch, W., and Sander, C. 1983. Dictionary of protein tion of protein contacts from multiple sequence alignments. secondary structure: Pattern recognition and hydrogen- Prot. Eng. 9:941–948. bonded and geometrical features. Biopolymers 22:2577– 2637. Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Stat. 22:79–86. Lund, O.; Frimand, K.; Gorodkin, J.; Bohr, H.; Bohr, J.; Hansen, J.; and Brunak, S. 1997. Protein dis- tance constraints predicted by neural networks and prob- ability density functions. Prot. Eng. 10:1241–1248. (http://www.cbs.dtu.dk/services/CPHmodels). Maiorov, V. N., and Crippen, G. M. 1992. Contact potential that recognizes the correct folding of globular proteins. J. Mol. Biol. 227:876–888. Mathews, B. W. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta 405:442–451. Mirny, L. A., and Shaknovich, E. I. 1996. How to de- rive a protein folding potential? A new approach to an old problem. J. Mol. Biol. 264:1164–1179. Miyazawa, S., and Jernigan, R. L. 1985. Estimation of effective interresidue contact energies from protein crys- tal structures: cuasi-chemical approximation. Macro- molecules 18:534–552. Myers, E. W., and Miller, W. 1988. Optimal alignments in linear space. CABIOS 4:11–7. Olmea, O., and Valencia, A. 1997. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Folding & Design 2:S25–S32. Pearson, W. R. 1990. Rapid and sensitive sequence com- parison with FASTP and FASTA. Meth. Enzymol. 183:63– 98. Reese, M. G.; Lund, O.; Bohr, J.; Bohr, H.; Hansen, J. E.; and Brunak, S. 1996. Distance distributions in proteins: A six parameter representation. Prot. Eng. 9:733–740. Rost, B., and Sander, C. 1993. Prediction of protein sec- ondary structure at better than 70% accuracy. J. Mol. Biol. 232:584–599. Schneider, T. D., and Stephens, R. M. 1990. Sequence logos: a new way to display concensus sequences. Nucl. Acids Res. 18:6097–6100. Sippl, M. J. 1990. Calculation of conformational ensem- bles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213:859–83. Skolnick, J.; Kolinski, A.; and Ortiz, A. R. 1997. MON- SSTER: A method for folding globular proteins with a small number-comment of distance restraints. J. Mol. Biol. 265:217–241.