STACK: a toolkit for analysing β-

Master of Science Thesis (20 points) Salvatore Cappadona, Lars Diestelhorst Abstract

β-helix proteins contain a solenoid fold consisting of repeated coils forming parallel β-sheets. Our goal is to formalise the intuitive notion of a β-helix in an objective algorithm. Our approach is based on first identifying residues stacks — linear spatial arrangements of residues with similar conformations — and then combining these elementary patterns to form β-coils and β-helices. Our algorithm has been implemented within STACK, a toolkit for analyzing β-helix proteins. STACK distinguishes aromatic, aliphatic and amidic stacks such as the asparagine ladder. Geometrical features are computed and stored in a relational database. These features include the axis of the β-helix, the stacks, the cross-sectional shape, the area of the coils and related packing information. An interface between STACK and a molecular visualisation program enables structural features to be highlighted automatically.

i Contents

1 Introduction 1

2 Biological Background 2 2.1 Basic Concepts of Structure ...... 2 2.2 Secondary Structure ...... 2 2.3 The β-Helix Fold ...... 3

3 Parallel β-Helices 6 3.1 Introduction ...... 6 3.2 Nomenclature ...... 6 3.2.1 Parallel β-Helix and its β-Sheets ...... 6 3.2.2 Stacks ...... 8 3.2.3 Coils ...... 8 3.2.4 The Core Region ...... 8 3.3 Description of Known Structures ...... 8 3.3.1 Helix Handedness ...... 8 3.3.2 Right-Handed Parallel β-Helices ...... 13 3.3.3 Left-Handed Parallel β-Helices ...... 19 3.4 Amyloidosis ...... 20

4 The STACK Toolkit 24 4.1 Identification of Structural Elements ...... 24 4.1.1 Stacks ...... 24 4.1.2 β-coils and β-Helices ...... 26 4.1.3 The Core Residues of β-Helices ...... 28 4.2 Geometrical Analysis of Structural Elements ...... 28 4.2.1 Axis of β-Helices ...... 28 4.2.2 Shape of β-Coils ...... 29 4.2.3 The Pitch and twist of β-coils ...... 32 4.2.4 Area of β-Coils ...... 33 4.2.5 Orientation of Side Chains ...... 34 4.2.6 Packing of β-Coils ...... 35

5 Results 39

6 Conclusions 41

i A User and Installation Manual 42 A.1 Readership ...... 42 A.2 Installation ...... 42 A.2.1 Requirements ...... 42 A.2.2 Installation Steps ...... 43 A.3 Configuration ...... 43 A.3.1 GENERAL Properties ...... 43 A.3.2 STORE Properties ...... 44 A.3.3 OPERATION Properties ...... 45 A.4 Starting and Using STACK ...... 47 A.4.1 The activate-Command ...... 47 A.4.2 The calculate-Command ...... 47 A.4.3 The color-Command ...... 48 A.4.4 The deactivate-Command ...... 48 A.4.5 The exit-Command ...... 48 A.4.6 The help-Command ...... 48 A.4.7 The identify-Command ...... 48 A.4.8 The import-Command ...... 48 A.4.9 The list-Command ...... 48 A.4.10 The visualize-Command ...... 49

B Maintenance Manual 50 B.1 Readership ...... 50 B.2 Development Environment ...... 50 B.2.1 Tools and Software Components ...... 50 B.3 STACK Architecture ...... 50 B.3.1 Files and Directory Structure ...... 51 B.3.2 Package Overview ...... 51 B.3.3 The geom and util Package ...... 51 B.3.4 The store Package ...... 53 B.3.5 The structure Package ...... 53 B.3.6 The operations Package ...... 56

ii List of Figures

2.1 Primary, secondary, tertiary and quaternary structure of proteins ...... 3 2.2 Secondary structure ...... 4 2.3 Two types of β-sheet structures ...... 4 2.4 Cartoon representation of a protein chain with a right-handed β-helix fold . . 5

3.1 Nomenclature of β-sheets in right-handed β-helical proteins ...... 7 3.2 Packing of successive coils in PelC ...... 9 3.3 Core of a β-helical protein ...... 9 3.4 The right-hand rule ...... 11 3.5 Left-handed and right-handed beta-helices have a different cross-section . . . 12 3.6 Chiral packing of isoleucines at the centre of LpxA ...... 13 3.7 Schematic of a typical coil of the family and of a resulting helix 14 3.8 N-terminal end and T3 loop of BsPel ...... 15 3.9 Ribbon illustration of TmAFP ...... 18 3.10 Lattice matching/occupation model for TmAFP binding to ice ...... 18 3.11 Typical left-handed parallel β-helix ...... 19 3.12 The putative ice-binding site of SbwAFP ...... 21 3.13 β-helical models of ...... 22 3.14 The α-helical cap at the N-terminal of a right-handed parallel β-helix . . . . 23

4.1 Stacking of residues ...... 25 4.2 An aromatic stack of pectate lyase from Bacillus subtilis ...... 26 4.3 β-coils, repetition of stacks in the residue sequence ...... 27 4.4 Highlighted β-coils of pectate lyase from Bacillus sp...... 27 4.5 “Gap-filling” to extend the core ...... 28 4.6 The core of Rgase A from Aspergillus aculeatus ...... 29 4.7 Dependencies between the algorithms implemented in STACK ...... 30 4.8 Axis of PelC from Erwinia chrysanthemi ...... 31 4.9 Cα-trace of a β-coil as the basis for shape approximation and the parameterized vector notation of the ...... 32 4.10 Shape of a β-coil ...... 33 4.11 Pitch and twist of β-coils ...... 34 4.12 Area of a β-coil ...... 35 4.13 Orientation of side chains based on intersection count ...... 36 4.14 Orientation of side chains based on scalar product ...... 37 4.15 The packing of β-coils ...... 38

iii B.1 STACK package diagram ...... 52 B.2 UML class diagram of the store ...... 54 B.3 Query related class diagram ...... 54 B.4 Composition of query objects ...... 55 B.5 Class hierarchy modeling β-helical proteins ...... 55 B.6 Using a factory-method to create a operation ...... 56

iv >2GJLK:_ ACKNOWLEDGEMENT ...... SYNYIHGVKKVGLDGSSSSDTGRNITYH HNYYNDVNARLPLQRGGLVHAYNNLYTNITGSGLNVR QNGQALITHENWFEKAINPVTSRYDGKNFGTWVLKGN NIKDFSTYTWTADTKPYVNADSWTSTGTFPTVAYNYS KLPYAGVGGPVSAQCVK THANK IYOU IGRAHAM .

v Chapter 1

Introduction

β-helix proteins contain a solenoid domain of parallel β-strands folded into a large prism. The repeated unit of the solenoid is called a β-coil and consists of a succession of a few (usually three) β-strands. β-strands from adjacent coils are stacking to form parallel β-sheets that make up the faces of the prism. These faces are linked by loop regions that protrude from the helix, and in many cases, form the binding site of the helix. The cross section of this prism is typically L-shaped in right-handed parallel β-helices and triangular in left-handed parallel β-helices. The stability of the domain is mainly obtained by the stacking of similar residues at equiv- alent positions of the coils, both inside and outside the helix. The inward side chains are mainly hydrophobic and, when not, maximal hydrogen bonding or electrostatic interactions neutralise their polar or charged groups. Our goal is to formalise the intuitive notion of a β-helix in a set of objective algorithms that define few basic features of these proteins. Consistent with the literature, we identified stacks, β-coils, core and β-helix as essential attributes to be determined (the core is defined as the helical domain of the protein, as distinguished from the protruding loop regions). Our algorithms have been implemented within STACK, a toolkit for analysing β-helix proteins. STACK first identifies aromatic, aliphatic and amidic stacks and then combines these elementary patterns to form β-coils, core, and β-helices. Once defined, geometrical features are computed and stored in a relational database. These features include the axis of the β-helix, the cross-sectional shape, the area of the coils and related packing information. STACK is implemented in Java and runs on Unix and Windows. An interface between STACK and RasMol enables structural features to be highlighted automatically. After a small introduction on protein structures given in Chapter 2, in Chapter 3 we discuss in detail all the geometrical features of β-helices and we present a model, based on the parallel β-helix, which is the base of recent speculation on the molecular make-up of amyloids. The algorithms devised and implemented in our work are described in Chapter 4. Some results are then given in Chapter 5 and conclusions are drawn in Chapter 6. The User and Installation Manual and the Maintenance Manual are in Appendices A and B.

1 Chapter 2

Biological Background

2.1 Basic Concepts of Protein Structure

The variety of functions performed by proteins arises from the huge number of different three-dimensional shapes that they can adopt: we can say that function follows structure. In considering the structure of a protein, it is helpful to distinguish the levels of organisation [BT98] shown in Figure 2.1.

1. All protein molecules are polymers built up from 20 different amino acids linked end-to- end by peptide bonds. The sequence is known as the primary sequence of the protein and it is directly determined by the nucleotide succession in the structural gene. The repeating sequence of atoms along the chain is referred to as the polypeptide backbone; attached to this repetitive chain are the different amino acid side-chains that can be non polar, acidic, basic, or uncharged polar.

2. The way the protein folds into higher-level structures is a direct consequence of the amino acid sequence. The secondary structure of the polypeptide chain can take the form either of α-helices or β-sheets, formed through regular hydrogen-bonding interactions between the N-H and C=O groups in neighbouring regions of the polypeptide backbone. Sometimes, the secondary structures can also link together to form motifs (also called super-secondary structures) by packing side-chains from adjacent α-helices or β-sheets close to each other in space. Several motifs usually combine to form compact globular structures, that are called domains (usually encoded by an exon and constitute units of function), like the β-helix.

3. The three-dimensional conformation formed by one or more domains is known as tertiary structure and is, usually, energetically favourable and stable.

4. If a particular protein is formed as a complex of more than one polypeptide chain, then the whole complex is designated as its quaternary structure.

2.2 Secondary Structure

Two regular folding patterns, α-helix and β-sheet, are often found in proteins.

2 CHAPTER 2. BIOLOGICAL BACKGROUND 3

Figure 2.1: Primary, secondary, tertiary and quaternary structure of proteins (Adapted from http://www.cs.msstate.edu/~graphics/molvis.html)

Both of these are very common because they result from hydrogen-bonding between the N- H and C=O groups in the polypeptide backbone, without involving the side chains of amino acids. This means that they can be formed by many amino acid sequences. The general bonding pattern is shown in Figure 2.2. The α-helix is generated by a single polypeptide chain turning around itself to make a rigid cylinder. A is made between every fourth peptide bond, linking the C=O of one peptide bond to the N-H of another. This gives rise to a regular helix with a complete every 3.6 amino acids. There are two types of β-sheets, both producing a very rigid structure held together by hydrogen bonds that connect the peptide bonds in neighbouring part of the chain. If the peptide chain folds back and forth upon itself, with each section of the chain running in the direction opposite to that of its immediate neighbour, the structure is known as antiparallel β-sheet; if instead the two neighbouring sections run in the same direction, we talk about parallel β-sheet. Both types are displayed in Figure 2.3.

2.3 The β-Helix Fold

The first example of this fold was discovered in 1993 at the University of California, Riverside, by the group of Frances Jurnak. It is a completely novel fold in which the polypeptide chain is folded into a wide helix with three or four β-strands for each turn. These β-strands align to form three or four β-sheets linked by loop regions and with a core between the sheets completely filled with side chains (Figure 2.4). An exhaustive description of this domain will be given in Chapter 3. CHAPTER 2. BIOLOGICAL BACKGROUND 4

Figure 2.2: Secondary structure (Adapted from http://fig.cox.miami.edu/~cmallery/150/protein/proteinsb.htm)

Figure 2.3: Two types of β-sheet structures On the left we can see an antiparallel β-sheet and on the right a parallel β-sheet. (Adapted from http://broccoli.mfn.ki.se/pps_course_96/ss_960723_4.html) CHAPTER 2. BIOLOGICAL BACKGROUND 5

Figure 2.4: Cartoon representation of a protein chain with a right-handed β-helix fold As shown by the red segments, the loops connecting corresponding strands in consecutive β-coils of the β-helix can differ in shape and length. Chapter 3

Parallel β-Helices

3.1 Introduction

The parallel β-helix fold is extremely simple: we can imagine that a gene that codes for a β- helical protein could be created by copying the DNA of a relatively short peptide many times. This is almost certainly the origin of the penta- and hexapeptide repeat families of proteins, but the same process can also account for the evolution of all the repetitive folds from a common ancestor. β-rolls, leucine-rich repeat proteins and folds, for instance, all share significant structure features with the parallel β-helix family, such as relatively untwisted β- sheets, stacking of similar residues and turns between β-strands often stabilised by hydrogen bonding along the direction of their axis. In this chapter we will describe the main features of parallel β-helices, focusing on their architecture rather than their function. We will also include a brief description of structures, to point out their similarities to parallel β-helices.

3.2 Nomenclature

3.2.1 Parallel β-Helix and its β-Sheets In 1993 Jurnak et al. [YNJ93] observed the structure of the first β-helix in pectate lyase C (PelC). The publication of the first left-handed structure of UDP-N-acetylglucosamine acyl- transferase by Raetz and Roderick (1995) [RR95] required an extension of the name to include the specification:

• left-handed or

• right-handed.

The nomenclature typically used to describe their basic architecture is due to Jurnak et al. [YLJ93], who introduced the names PB1, PB2 and PB3 for the three parallel β-sheets and T1, T2, T3 for the three regions following the β-sheets. This nomenclature has later been revised to include PB1a and T1a, when the structure of Rgase A clearly showed a fourth β-sheet. This convention is shown in Figure 3.1

6 CHAPTER 3. PARALLEL β-HELICES 7

Figure 3.1: Nomenclature of β-sheets in right-handed β-helical proteins (A)Pertactin has the longest known β-helix. PB1, PB2 and PB3 are shown in yellow, green and red. (B) PehA, showing the fourth β-sheet PB1a coloured in blue. (Adapted from [JP01]). CHAPTER 3. PARALLEL β-HELICES 8

3.2.2 Stacks Jurnak et al. [YLJ93] first described stacking as an alignment of similar residues at the equiv- alent positions in neighbouring coils. It is one of the most basic features of the architecture of this motif, so that no parallel β-helix has been observed without stacking. Three different types of stacks found in the parallel β-helix proteins have been reported in the literature:

• polar stacks of asparagine or serine ;

• aliphatic stacks of alanine, isoleucine, leucine and valine;

• aromatic stacks of phenylalanine or tyrosine.

The extensive stacking suggests that coils can be added or removed by duplication or deletion of the DNA corresponding to one or more coils and clarifies how homologous proteins can have different numbers of coils. We will describe our algorithm for finding stacks in section 4.1.1.

3.2.3 Coils The term “coil” has generally been used in literature to describe a single turn of the solenoid structure of β-helices; in this thesis we will often use also the term “β-coil”. The identification of the coils has never been a straightforward problem: according to different definitions the number of coils can vary very much and it is not clear weather to count the coils using topological arguments or to accept only coils with the regular three- stranded architecture. We will discuss our method for identifying coils in section 4.1.2. Figure 3.2 illustrates what is meant by coil and by packing of successive coils.

3.2.4 The Core Region The β-sheets of the parallel β-helices are always connected by loop regions, which protrude from the core of the protein (Figure 3.3). These loops may vary in size and conformation; consequently no specific amino acid sequence pattern has been detected among the coils. In Section 4.1.3 we will present our approach to identify the core region of the helix and to distinguish this from the protruding loops.

3.3 Description of Known Structures

In this section we summarise some of the salient structural features of parallel β-helices. These are described in greater detail in [JP01].

3.3.1 Helix Handedness All known β-helices reported in Table 3.1 can be easily classified as left-hand or right-handed. These helical protein can be distinguished by using the “right-hand rule” shown in Figure 3.4. CHAPTER 3. PARALLEL β-HELICES 9

Figure 3.2: Packing of successive coils in PelC (A) This image of PelC shows a schematic of coils 5 and 7 coloured green and red. (B) Space-filled representation of coil 5 (green) and 7 (red). This view suggests the strong shape complementarity that results in good steric packing between adjacent β-coils.

Figure 3.3: Core of a β-helical protein Loops are shown in yellow to be distinguished from the core, which is shown in blue. CHAPTER 3. PARALLEL β-HELICES 10

RIGHT-HANDED PARALLEL BETA-HELICES Short name PDB code E.C. CATH Pectate lyase PelC from Erwinia chrysanthemi PelC 1AIR 4.2.2.2 2.160.20.10 1PLU 4.2.2.2 2.160.20.10 2PEC 4.2.2.2 2.160.20.10 * Pectate lyase from Bacillus subtilis Bspel 1BN8 4.2.2.2 2.160.20.10 2BSP 4.2.2.2 2.160.20.10 Pectate lyase PelE from Erwinia chrysanthemi PelE 1PCL 4.2.2.2 Pectin lyase PnlA from Aspergillus niger PnlA 1IDJ 4.2.2.10 2.160.20.10 1IDK 4.2.2.10 2.160.20.10 Pectin lyase PnlB from Aspergillus niger PnlB 1QCX 4.2.2.10 2.160.20.10 Rhamnogalacturonase A from Aspergillus aculeatus RGase A 1RMG 3.2.1.- 2.160.20.10 PehA from Erwinia carotivora PehA 1BHE 3.2.1.15 2.160.20.10 Polygalacturonase II from Aspergillus niger PG II 1CZF 3.2.1.15 2.160.20.10 Polygalacturonase from Aspergillus aculeatus 1IA5 3.2.1.15 2.160.20.10 * 1IB4 3.2.1.15 2.160.20.10 * Endopolygalacturonase I from Stereum purpureum 1K5C 3.2.1.15 * Endopolygalacturonase I from Stereum purpureum complexed with a galacturonate 1KCC 3.2.1.15 * Endopolygalacturonase I from Stereum purpureum complexed with 2 galacturonate 1KCD 3.2.1.15 * Pectin methylesterase from Erwinia chrysanthemi PemA 1QJV 3.1.1.11 2.160.20.40 Pectate lyase Pel-15 from Bacillus sp.strain KSM-P15 Pel-15 1EE6 4.2.2.2 6.1.178.10 Crystal Structure Of Pectate Lyase A (C2 Form) from Erwinia chrysanthemi 1JRG 4.2.2.2 * 1JTA 4.2.2.2 * Salmonella P22 phage tailspike endorhamnosidase TSP 1TSP 2.160.20.20 1TYU 2.160.20.20 1TYV 2.160.20.20 1TYW 2.160.20.20 1CLW 2.160.20.20 1QA1 2.160.20.20 1QA2 2.160.20.20 1QQ1 2.160.20.20 1QRB 2.160.20.20 1QRC 2.160.20.20 Chrondroitinase B from Flavbacterium hepinarum 1DBG 2.160.20.30 1DBO 2.160.20.30 P69 pertactin from Bordetella pertussis Pertactin 1DAB 6.1.18.10 Glutamate synthase from Azospirillum brasiliense 1EA0 from Tenebrio molitor TmAFP 1EZG 2.160.20.50

LEFT-HANDED PARALLEL BETA-HELICES Short name PDB code E.C. CATH UDP- N -acetylglucosamine acetyltransferase from Escherichia coli LpxA 1LXA 2.3.1.129 2.160.10.10 Carbonic anhydrase from Methanosarcina thermophila Cam 1THJ 4.2.1.1 2.160.10.10 Tetrahydrodipicolinate N- succinyltransferase DapD 1TDT 2.3.1.117 2.160.10.10 2TDT 2.3.1.117 2.160.10.10 3TDT 2.3.1.117 2.160.10.10 1KGQ 2.3.1.117 * 1KGT 2.3.1.117 * Xenobiotic acetyltransferase from Pseudomonas aeruginosa PaXAT 1XAT 2XAT * N -acetylglucosamine 1-phosphate urydiltransferase from E.coli 1HV9 2.7.7.23 N -acetylglucosamine 1-phosphate urydiltransferase from Streptococcus pneumoniae Glmu 1G95 2.7.7.23 1G97 2.7.7.23 1HM0 2.7.7.23 1HM8 2.7.7.23 1HM9 2.7.7.23 Antifreeze protein from spruce budworm SbwAFP 1EWW 2.160.10.20

Table 3.1: Right-handed and left-handed β-helical proteins This table is adapted from [JP01] and asterisks in the last column show new protein struc- tures that were not included in that review. Each protein is described by its full name and origin, short name, PDB code, E.C. number and CATH classification (Version 2.4: released January 2002). The Brookhaven Protein Data Bank (PDB) is the single worldwide archive of structural data of biological macromolecules [BWF+00]. The Classification (E.C.) database contains all the known enzyme structures that have been deposited in the PDB [NI92]. CATH is a hierarchical classification of structures, which clusters proteins at four major levels, Class C), Architecture A), Topology (T) and Homologous su- perfamily (H) [PLB+00]. CHAPTER 3. PARALLEL β-HELICES 11

Figure 3.4: The right-hand rule In a right-handed helix, if one extends his or her right hand and traces with fingers along the backbone of the helix (from the N-terminal to the C-terminal), the hand and thumb move upwards. In a left-handed helix, in order to have your hand move upwards with your thumb pointing up, you would need to use your left hand. CHAPTER 3. PARALLEL β-HELICES 12

Figure 3.5: Left-handed and right-handed beta-helices have a different cross-section (A) Triangular cross-section of LpxA (B) L-shaped cross-section of PG II.

Very often, they can even be discriminated just by visual inspection of their shape. As shown in Figure 3.5, a typical left-handed β-helix has a triangular cross-section, while a typical right-handed β-helix is rather L-shaped. The question of why a chain folds into a left-handed rather than a right-handed helix, or vice versa, is still almost unanswered. Richardson [Ric76] proposed that the inherent right-handed twist of extended polypeptides naturally folds a protein into a right-handed coil when its ends are brought together. In the left-handed beta-helix, though, this may not be important because the sheets are unusually flat and the connections between adjacent β-strands are long. Kisker et al. [KSA+96] suggested the α-L turns as the origin of the left-handed β-helix chirality, but their idea was immediately refuted by the prevalence of such turns in the right- handed folds. Bateman et al. [BMT98] suggested that the most likely origin of the chirality is the side chains interaction. If, for instance, three isoleucines come together with their side chains forming a chiral packing at the centre of a coil, as it happens in LpxA (Figure 3.6), the result is an extraordinary example of symmetric close packing. This reasoning cannot be employed easily to state the importance of the chiral isoleucine side chains in defining the optimal helix chirality, because it is difficult to demonstrate that there is not a similar packing with the hand of the helix reversed. However, it can be used to explain the triangular cross-section of left-handed parallel β-helices as compared to the rather L-shaped cross-section of the right- handed family. This L-shaped section arises from interactions involving side chains from only two β-sheets, which are closer to a sandwich than to an equilateral triangle. It is interesting that all the right-handed parallel β-helices, except pertactin, have an α-helix at their N-terminal, while the left-handed β-helix proteins have an α-helix at their C-terminal. However, Brown et al. [BPD+99] showed that these helices are not essential for establishing the overall chirality. Stacking (as defined in Section 3.2.2) can be mentioned as a last potential origin of chirality. CHAPTER 3. PARALLEL β-HELICES 13

Aromatic stacks, for instance, are very rare in the core of left-handed parallel β-helices, while they are quite common in the right-handed family. The reason might be the greater twist of the sheets in the right handed family, as the ring must stack with sufficient offset, so that the electron-rich centres do not repel each other too strongly. Furthermore, the presence of external aromatic stacks in the left-handed parallel β-helices suggests that either space limitations are critical or aromatics residues do not favour the left-handed fold.

Figure 3.6: Chiral packing of isoleucines at the centre of LpxA The illustrated side chain packing can explain the triangular shape of left-handed β-helices.

3.3.2 Right-Handed Parallel β-Helices 3.3.2.1 Pectinases The Extra Cellular Pectate Lyase Family This family contains most known pectate lyases, including the archetype PelC, and all known pectin lyases (the first group of proteins in Table 3.1). The overall fold of these is very well conserved in the recurrence of the right-handed β-helical core, the N-terminal α-helix and the N- and C-terminal extensions. The structural unit of the helix is a coil of three β-strands, each of which forms a parallel β-sheet with parallel β-strands in neighbouring coils. The resulting structure is a prism made up of seven or eight complete coils, with a cross-section that is often described as L-shaped because of the arrangement of the parallel β-sheets. As we can see in Figure 3.7 all of the β-strands are unusually short, ranging from three to five residues in length. PB1 and PB2 are roughly antiparallel and interact with one another CHAPTER 3. PARALLEL β-HELICES 14

Figure 3.7: Schematic of a typical coil of the pectate lyase family and of a resulting helix (Adapted from http://www.cs.columbia.edu/~abk2001/betawrap.htm) in a β-sandwich interface, while PB3 lies approximately 120◦ relative to PB2. Three regions called T1, T2 and T3 connect the parallel β-sheets. Among them, T2 is the only turn with a fixed length (two residues), while T1 and T3 are composed of loops that vary in size and conformation and protrude from the helix to form, probably, the active site of the enzyme. The loops in T3 (Figure 3.8) are sometimes long enough to bury some hydrophobic residues as well as forming hydrogen bonds. Even if the whole region can appear as a separate domain, there is no information to suggest that this structure can fold autonomously of the parallel β-helix. Because two of the loop regions vary in size, the type and number of amino acids in each coil is variable, with a minimum of 22. This variability makes alignment without knowledge of the structure difficult and generates a low level of sequence identity, which is just enough to assert the homology between these enzymes. All the proteins of this family share some interesting features concerning the side chains that are oriented toward the centre of the helix. First, all of these side chains are stacked with side chains from adjacent coils, leaving no room for a channel. Second, not all of them are hydrophobic, since there are also polar and charged groups that are neutralised by maximal hydrogen bonding or electrostatic interactions. All these interactions, as well as the stacking of some of the side chains that are oriented outward the helix, may explain the observed stability of pectate lyases in solution. The last remarkable structural feature of the pectate lyase family is the presence of an α-helix cap covering the N-terminal end of the β-helix domain, as shown in Figure 3.8. The function of this cap is not clear. It may serve just to prevent solvent entering the semi- hydrophobic core, however a more fascinating suggestion for its role will be given in Section 3.4. CHAPTER 3. PARALLEL β-HELICES 15

Figure 3.8: N-terminal end and T3 loop of BsPel This figure of BsPel shows both the α-helical cap at the N-terminal end of the helix (coloured green, towards the top of the picture) and the T3 loop region (coloured blue). CHAPTER 3. PARALLEL β-HELICES 16

Polygalacturonases and Rhamnogalacturonase A The second group of proteins in Table 3.1 has been identified as family 28 of the glycosyl hydrolyases. Although these enzymes are clearly homologous, they have diverged beyond the level of sequence identity that can allow for the construction of correct models. Compared to the pectate lyase family, family 28 is distinguished by:

• a much more regular pattern of hydrogen bonds; • a longer helix, made of ten complete coils; • the presence in each coil of a fourth strand, called PB1a; • the occurrence of many aliphatic stacks that dominate the interior of the parallel β- helices, leaving little space for aromatic and polar stacks.

Apart from these differences, the proteins of these two families share a very similar structure, including the presence of the N-terminal α-helix.

Pectin Methylesterase The number and shape of the coils and the presence of the N- terminal α-helix cap make this enzyme very similar to those of the pectate lyase family. The only significant difference can be found in the C-terminal extension, which is much longer and interacts with PB2 rather than PB3.

Pectate Lyase Pel-15 This is another right-handed parallel β-helix with the same shape and number of coils as in the extra-cellular lyases. However, in this case the sequence is significantly shorter than that of the other pectinases and both the N-terminal helix and the C-terminal extension are missing.

3.3.2.2 The P22 Phage Tailspike The Salmonella typhimurium phage P22 tailspike protein (TSP) has also been reported to have a tri β-strand parallel domain. It has been for many years one of the principal model systems used to study and misfolding both in vivo and in vitro, mostly for its stability to high temperature and in presence of denaturants like SDS. TSP is a homotrimer each subunit of which includes 13 coils of three β-strands and numerous examples of the two-residues 120◦ turn. The nomenclature describing the structure of TSP was developed independently from that of pectinases, so the literature regarding this protein often describes the three β-strands PB1, PB2 and PB3 respectively as C, A and B. The overall shape of the monomer resembles a fish were the helix represent the body and the C-terminal domain the “caudal fin”. The long loops corresponding to T3 and T1 are named the “dorsal and ventral fins”. Distinctive traits of this right handed parallel β-helix are:

• the hydrophobicity of the interior residues; • the presence of the N-terminal α-helix cap; • The lack of extended stacking in the core of the helix: no asparagines ladders can be found, but there are several aliphatic stacks and an aromatic stack on PB1. CHAPTER 3. PARALLEL β-HELICES 17

3.3.2.3 Chondroitinase B With its 13 coils, chondroitinase B is the longest of the parallel β-helix enzymes. Its overall fold is similar to that of the other right-handed parallel β-helices, both in the general L-shape of the coils and in several specific features like the N-terminal α-helix and a long C-terminal extension. Nevertheless, there is no obvious relationship between the sequence of this enzyme and any other protein whose structure is known. Regarding stacking, the interior of the parallel β-helix contains a classical asparagine ladder as well as aliphatic and aromatic stacks. Aromatic stacks are also found outside the helix.

3.3.2.4 P69 Pertactin Pertactin is probably the most regular of the parallel β-helical proteins, with incredible inter- nal aliphatic stacks on rather flat β-sheets. With its 16 coils, it is also the longest β-helical protein observed so far. Whilst its N-terminal region does not start with an α-helix, it is otherwise rather similar to the β-helix enzymes.

3.3.2.5 Glutamate Synthase This is a very large enzyme of 1472 amino acids with a four domain architecture. The C- terminal domain, from residue 1203 to residue 1472, has a 7 coils right-handed β-helix, which is very regular, but does not resemble any other in the PDB and seems to have a hydrophobic interior.

3.3.2.6 The Antifreeze Protein from Tenebrio molitor The Tenebrio molitor antifreeze protein (TmAFP) has the highest resolution parallel β-helix structure so far solved (1.4A).˚ Its helix is the smallest known, with only 12 residues per coil, and is not evidently related to any other structure. The overall fold is quite different from the other right-handed parallel β-helices. The cross-section of the helix, for instance, is not L-shaped, but almost rectangular. All the coils are nearly identical in the backbone and also the conserved side chains are positioned in essentially the same orientations. The interior of the helix has no space for large hydrophobic residues, and stability is ensured by a regular pattern of disulphide bridges, rather than by the usual stacks. This structure seems designed to present a flat and rigid β-sheet along one side of the molecule: this is supposed to be the ice-binding site. Each turn of the helix contributes a short β-strand to the sheet, typically with the sequence TCT, with the two threonine residues projecting outward in a two dimensional array (Figure 3.9). This spatial arrangement of thrionine side chains makes an extraordinarily good match to the repeated spacing between oxygen atoms in the ice lattice on the primary prism plane (7.35A˚ and 4.52A),˚ and a reasonable match to the basal plane (7.83A˚ and 4.52A).˚ This lattice matching is highly suggestive of the ice-binding function of the protein (Figure 3.10), which is obtained by inhibition of further growth of the bound ice crystals. CHAPTER 3. PARALLEL β-HELICES 18

Figure 3.9: Ribbon illustration of TmAFP (a) The flat (unbent and untwisted) β-sheet. (b) The stacking of threonine residues pointing outward and the regular pattern of disulphide bridges. (Adapted from [LTDJ00])

Figure 3.10: Lattice matching/occupation model for TmAFP binding to ice (Taken from [LTDJ00]) CHAPTER 3. PARALLEL β-HELICES 19

Figure 3.11: Typical left-handed parallel β-helix This image of DapD gives an idea of the general features of left-handed parallel β-helices: trimeric form, alignment of the axes of the monomers and triangular cross section of the helices.

3.3.3 Left-Handed Parallel β-Helices 3.3.3.1 Structures Containing Hexapeptide Repeats All enzymes of this family resemble a triangular prism with the flat but pleated beta-sheet forming the faces. They share the same hexapeptide repeat motif [LIV]-[GAED]-X2-[STAV]-X. Their active form is always a trimer, as shown in Figure 3.11, and the trimerization method is very well conserved. The axis of the trimer is usually aligned with those of the three helices, within a few degrees; however, PaXAT has an angle of 21◦ between the two axes, probably due to the small dimension of the monomers’ helices. Stacking is even more frequent in the left-handed family than in the right-handed because of the repetitive sequence, but is mostly restricted to aliphatic residues. No asparagine ladders can be found and both polar and aromatic residues are very rare. All these common features, as well as the similar position of the active sites, are persuasive evidence of homology. CHAPTER 3. PARALLEL β-HELICES 20

3.3.3.2 The Antifreeze Protein from Spruce Budworm The spruce budworm antifreeze protein, SbwAFP, is the first parallel β-helical protein struc- ture determined by NMR. Its triangular shape and the spacing between the β-sheets are very similar to those seen in the other left-handed parallel β-helices. Yet, it has a smaller coil than the hexapeptide repeat proteins and can be considered as a protein. SbwAFP is one of the least regular parallel β-helices ever observed and few stacks can be found inside its core. It is also one of the least repetitive, with only four β-coils followed by some anti-parallel β-strands. As with other AFPs, SbwAFP probably functions by binding ice nuclei to prevent their growth. As proposed by [GKG+00], the ice-binding site contains nine of the surface-accessible threonines structured in a regular array of TXT motifs that match both the basal and the prism plane of the ice lattice (as shown in Figure 3.12).

3.4 Amyloidosis

Parallel β-helices form a bridge between globular and fibrous protein. The rather flat parallel β-sheets and the stacking of similar residues are typical features shared by these two classes of proteins. This had led to analogies being made between parallel β-helices and some models of amyloids. Amyloidosis (i.e. the formation of polymeric fibrillar structures from normal cellular pro- teins or peptides) is often involved in the development of neuro-degenerative diseases (e.g. Alzheimer’s disease and prion induced dementia). It is thus interesting to study what struc- tural features confer the ability to form such polymers and what the overall 3D structure of the filament is. High-resolution methods cannot be used to investigate the structure of the amyloids fibrils (e.g. the insolubility of prion proteins does not allow structural studies by x-ray crystallogra- phy or NMR spectroscopy). Therefore, a reasonable way to approach structural questions is to couple low-resolution data with the construction of feasible models that can be tested by further experiments. Several recent papers support models based on the β-helix folding motif:

• The late Max Perutz et al. [PFBL02] describe the polyglutamyne aggregates important in Huntington’s disease as β-helices.

• Prusiner et al. [WMG+02] describe the parallel β-helix as the only known fold that satisfy the constraints about the structure of the scrapie prion given by electron mi- croscopy data. Their paper also includes models for the hexagonal units of the crystal, based on structures of known parallel β-helices (Figure 3.13). CHAPTER 3. PARALLEL β-HELICES 21

Figure 3.12: The putative ice-binding site of SbwAFP Schematic representation of SbwAFP showing the arrangement of the threonine residues in a grid-like ordering spaced 7.4A˚ and 4.5A˚ apart. CHAPTER 3. PARALLEL β-HELICES 22

Figure 3.13: β-helical models of amyloids Models that satisfy the constraints about the structure of the scrapie prion given by electron microscopy data can be constructed both using left-handed parallel β-helices (A,B,C,D) and right-handed parallel β-helices (E,F,G,H). (Adapted from [WMG+02]) CHAPTER 3. PARALLEL β-HELICES 23

Figure 3.14: The α-helical cap at the N-terminal of a right-handed parallel β-helix The α-helix, shown in red toward the top of the figure, might be important to prevent further extension of the helix. Chapter 4

The STACK Toolkit

STACK is a toolkit for analysing the structures of β-helix proteins. The general architecture of these proteins and their biological relevance have been described in Chapter 2. In this chapter we focus on the algorithms that identify the different super-secondary structural elements or are used for the calculation of geometrical features. The algorithms that we describe here are those that we found to be the most successful in empirical tests. During the development phase we considered alternative approaches, and we describe the problems and limitations that we found when experimenting with these. In 1983 Kabsch and Sander [KS83] defined protein secondary structure by using objective algorithms, that were implemented in DSSP. Following the spirit of that work, we imple- mented objective algorithms to identify structural patterns in the β-helix fold. Therefore, the algorithms implemented in STACK are our definition of structural elements in β-helix proteins and their geometrical features.

4.1 Identification of Structural Elements

4.1.1 Stacks 4.1.1.1 Algorithm We use stacking of residues as the basic building blocks for the identification of more complex structural elements. Stacking residues can be identified intuitively in a protein viewer. They are characterised by their similar side chain and backbone conformation and their parallel and close spatial arrangement. Kabsch and Sander [KS83] use backbone hydrogen-bonding patterns to define residue bridges. This bridge concept corresponds directly with the stacking of residues in STACK. For that reason a possible approach to identify pairs of stacking residues is to use bridges that are defined by DSSP. However, visual inspection shows that an intuitive residue-stacking assignment does not completely agree with such a definition made by DSSP. Therefore we base the identification of stacking patterns on conformational and spatial similarity of residues. Linear spatial arrangements of stacking residues are stacks and can be found in the β-helix motif (see Section 3.2.2). To identify those the algorithm first identifies all pairs of stacking residues. In the second step, using the transitivity property of the binary stacking-relation, these are merged to stacks. Formally we define that two residues rs, rt ∈ R = {r1, r2, . . . , rn} are stacking if and only if the following is true:

24 CHAPTER 4. THE STACK TOOLKIT 25

Figure 4.1: Stacking of residues

1. The distance of the residues within the protein sequence must be greater than a thresh- old d1. This constraint is needed to avoid stacking of residues within α-helices and therefore d1 is set to 5. |s − t| > d1

2. The distance in three dimensional space of the Cα of the residues is less than a threshold d2. This assures the spatial relateness of the residues; from empirical testing, we found that a value of d2 = 6A˚ works well. ° ° ° α α ° Cs Ct < d2

α α 3. The angle of the vector from Cs to Ct and from Cs to Ct is less than a threshold α1. Together with the next constraint this requirement describes the conformational similarity of the residues. For the thresholds α1 and α2 the values 0.5 radians and 1.5 radians seem to be appropriate. ¡ ¢ 6 α α Cs Cs , Ct Ct < α1

α β α β 4. The angle of the vector from Ct to Ct and from Cs to Cs is less than a threshold α2. ³ ´ 6 α β α β Cs Cs , Ct Ct < α2

Figure 4.1 shows the different vectors that are used for comparing two residues. In the second step the algorithm enumerates over the found pairs (x, y) with x, y ∈ R of stacking residues and distinguishes three cases:

1. If there are no stacks that contain x or y create a new stack with x and y as initial residues.

2. If there is a stack that contains one element of the pair, but no stack that contains the other element, extend the stack with the missing residue.

3. If there is a stack that contains x and a stack that contains y union both stacks. CHAPTER 4. THE STACK TOOLKIT 26

The described algorithm identifies stacks that can consist of all types of residues. In order to identify stacks that contain only residues of certain types the stacking constraint is refined to allow only certain types of residues. This means that to identify aromatic stacks only aromatic residues are considered in the algorithm. The different kinds of stacks are described in Section 3.2.2 on page 8. Figure 4.2 shows the aromatic stack of pectate lyase from Bacillus subtilis.

Figure 4.2: An aromatic stack of pectate lyase from Bacillus subtilis

4.1.1.2 Discussion The part of the algorithm that finds all pairs of stacking residues calculates the Cartesian cross product of all residues and considers each element as a possible pair of stacking residues. As only close residues with a Cα distance less than 6A˚ are eventually stacking, a geometrical indexing of the Cα improves the algorithm. Dividing the geometrical space into cubes with a side length of 6A˚ we know that the Cα atoms of two stacking residues are either in the same or in neighbouring cubes. Therefore, we can find the partner of a stacking residue searching only the residue’s cube and all neighbouring cubes. This can be done easily if the residues of each cube are stored in a list. If a β-bulge is part of a stack, then the algorithm considers the by the β-bulge separated parts as individual stacks. The reason is that the bulge prevents the algorithm to identify the pair of stacking residues that connects both parts. The same affect can be observed if an inner pair of a stacking residues has not been identified.

4.1.2 β-coils and β-Helices 4.1.2.1 Algorithm β-coils and β-helices, described in Section 3.2.3, are identified by the same algorithm which is based on the previously identified stacks. The algorithm followes the residue sequence and identifies one β-coil after each other. The first residue in the sequence that is part of a stack defines the start position of the first β-coil. This coil is extended as long no new stack occurs in the sequence and therefore the first repetition of a stack marks the beginning of the next coil. This coil is extended analogous to the first one, such that no stack has two residues in this β-coil. This procedure is repeated until the end of the chain. The algorithm detects whether an identified β-coil is the first coil of a new β-helix by comparing the stacks of the coil with those of the previous coil. If the β-coils do not have any common stacks, then they belong to different β-helices. Figure 4.3 shows the principle idea behind this algorithm by wrapping the residue sequence into lines. The squares represent residues and their colouring CHAPTER 4. THE STACK TOOLKIT 27 defines stacks. The bar under the squares illustrates the β-coil assignment, and a different colour is used for each assignment. The algorithm was used to identify the β-coils of pectate lyase from Bacillus Sp. (PDB: 1EE6) and Figure 4.4 shows the result.

Figure 4.3: β-coils, repetition of stacks in the residue sequence

Figure 4.4: Highlighted β-coils of pectate lyase from Bacillus sp.

4.1.2.2 Discussion The results of the algorithm shows overlapping, meaning that the identified β-coils are too long. The reason is that the stacks do not cover the whole β-helix. A solution for this problem could be a backtracking approach. If the toolkit finishes a β-coil it steps upstream the residue sequence as long the geometrical distance to the start residue decreases. Another problem of the implemented approach is that it does not recognise the handedness of the β-helix. But adjusting the area calculation algorithm will solve this problem (see Section 4.2.4.2). Our approach has the inherent assumption that β-helices never change their handedness, this is true except for SbwAFP, that is described in Section 3.3.3.2. However, it is not clear whether a single coil that has a different handedness from the others should be considered being part of the helix. Therefore, we did not try to adapt our algorithm. All of the examined proteins, CHAPTER 4. THE STACK TOOLKIT 28 listed in Table 3.1 contain only one β-helix per chain. Assuming that this assumption holds for all β-helical proteins, we can advance the implementation by disregarding the possibility of several β-helices per chain.

4.1.3 The Core Residues of β-Helices 4.1.3.1 Algorithm The core residues of a β-helix are the residues that are strongly conserved in the β-coils according to their geometrical position. Due to the geometrical definition of stacks (see Section 4.1.1), the residues of the stacks are assigned to be core residues. However, visual inspection suggests that more residues should be part of the core. Therefore, the set of core residues is extended by “gap-filling”. This means that those fragments of the sequence with a length less than a threshold and positioned between stack residues are also considered as core residues. Figure 4.5 shows a fraction of a residue sequence. The residues with dark background are stacking residues and therefore automatically part of the core. Gaps between those residues with a length less than three are also considered as core residues. This algorithm was used to identify the core of Rgase A from Aspergillus aculeatus (PDB: 1RMG) and Figure 4.6 shows the result.

Figure 4.5: “Gap-filling” to extend the core The figure shows a fragment of a sequence and the residues are numbered from 30 to 49. The dark filled squares represent residues that are part of a stack. The bar under the sequence illustrates the core assignment. The described algorithm assigns the residues 33 and 34 to be core residues.

4.2 Geometrical Analysis of Structural Elements

4.2.1 Axis of β-Helices 4.2.1.1 Algorithm Finding the β-helix axis is a prerequisite for identifying most of the other structural features described later in this chapter. The dependencies between the algorithms are illustrated in Figure 4.7. Therefore, it is important that the axis can be found reliably. An axis is described mathematically as a line specified by a point on the line and a vector giving the line’s direction. The algorithm starts by finding the axis for each stack. This is done by fitting a line through the Cα atoms of the stack’s residues and minimising the root mean square distance. Then the point of application for the axis is calculated by projecting the centre of gravity of the Cα atoms onto the axis. In principle the direction of the β-helix axis is defined as the average of the directions of stacks’ axes. To find the point of application of the β-helix axis, the calculated points of application of the stacks’ axes are projected onto a plane that is perpendicular to the direction of the β-helix axis. We use the centre of gravity of those points as the point of application of CHAPTER 4. THE STACK TOOLKIT 29

Figure 4.6: The core of Rgase A from Aspergillus aculeatus the β-helix axis. Finally, the calculated axis is rewritten in order to ease the visualisation in RasMol. Figure 4.8 shows the axis of PelC from Erwinia chrysanthemi (PDB: 1AIR).

4.2.1.2 Discussion Variations of this algorithm might weight the stacks according their length, when calculating the β-helix axis’ direction. Weighting might also be used in the calculation of the axis’ point of application. All alternative approaches that were considered in the development of the algorithm were not successful, simply because they contained the implicit assumption of equally distributed stacks, which does not hold.

4.2.2 Shape of β-Coils 4.2.2.1 Algorithm We define the shape of a β-coil as a projection of the backbone atoms on a plane perpendicular to the axis of the β-helix. The toolkit provides a method to represent an approximation of the shape with two dimensional harmonics. As the aim of the algorithm is to compute an approximation of the shape we do not use the projection of the backbone, but the projection of the Cα-trace. Connecting the first and the last Cα atoms of the trace results in a closed curve (see the left box of figure 4.9). The first mathematical idea1 behind our shape approximation is that a curve x can be

1The mathematical formulas in this section are based on [AO97] and [AO94] CHAPTER 4. THE STACK TOOLKIT 30

Identification of structural elements Stacks

-coil/ -helix

Core

Identification/ calculation of features Axis

Pitch Twist Area

Packing Shape

Figure 4.7: Dependencies between the algorithms implemented in STACK written in vector notation, parameterized with a scalar t: Ã ! u(t) x(t) = with t ∈ [a, . . . , b] v(t) The coordinates of the start point are given with x(a) and the end point with x(b) and x(a) = x(b) holds. It is also possible to choose an arbitrary interval [˜a, . . . , ˜b] by substituting t with t˜. (t − (a − a˜)) (˜b − a˜) t˜= (b − a) We choose the interval [0,..., 2π] for t, so that we can rewrite both component functions u and v as an infinite Fourier series: a(u) X∞ h i u(t) = F (t) = 0 + a(u) cos(k t) + b(u) sin(k t) u 2 k k k=1 a(v) X∞ h i v(t) = F (t) = 0 + a(v) cos(k t) + b(v) sin(k t) v 2 k k k=1 Using only the terms with k ≤ n gives us a shape approximating function xn  h i  a(u) P 0 + n a(u) cos(k t) + b(u) sin(k t)  2 k=1 k k  xn(t) =   (v) P h i a0 n (v) (v) 2 + k=1 ak cos(k t) + bk sin(k t) CHAPTER 4. THE STACK TOOLKIT 31

Figure 4.8: Axis of PelC from Erwinia chrysanthemi

This method shows remarkably good shape approximations for the β-coils with n = 2 (see Figure 4.10). We use the discrete fast Fourier transforms to calculate the Fourier coefficients. More details about the algorithm can be found in Appendix B.3.3.

4.2.2.2 Discussion (u) (v) Translating the shape in 2D only affects the a0 and a0 coefficients without changing the parameterization.

Translation vector: Ã ! su s = sv Adding s to the function x(t) leads to:

x˜n(t) = s + xn(t)  h i  Ã ! a(u) P (u) (u) 0 + n a cos(k t) + b sin(k t) su  2 k=1 k k  = +   (v) P h i sv a0 (n) (v) (v) 2 + k=1 ak cos(k t) + bk sin(k t)  h i  a(u)+s P 0 u + n a(u) cos(k t) + b(u) sin(k t)  2 k=1 k k  =   (v) P h i a0 +sv n (v) (v) 2 + k=1 ak cos(k t) + bk sin(k t) CHAPTER 4. THE STACK TOOLKIT 32

Figure 4.9: Cα-trace of a β-coil as the basis for shape approximation and the parameterized vector notation of the curve

Comparing the coefficients we can derive:

(u) (u) a˜0 = a0 + su (v) (v) a˜0 = a0 + sv Also the affect of rotating the shape can be calculated easily. Rotation matrix: Ã ! cos φ − sin φ R = sin φ cos φ

Applying the R on xn gives: Ã !Ã ! cos φ − sin φ un(t) x˜n(t) = sin φ cos φ vn(t) Ã ! un(t) cos φ − vn(t) sin φ = un(t) cos φ + vn(t) sin φ

Then the coefficients ofx ˜n(t) can be calculated with: (u) (u) (v) ˜(u) (u) (v) a˜k = ak cos φ − ak sin φ bk = bk cos φ − bk sin φ (v) (u) (v) ˜(v) (u) (v) a˜k = ak cos φ + ak sin φ bk = bk cos φ + bk sin φ

4.2.3 The Pitch and twist of β-coils 4.2.3.1 Algorithm To calculate the pitch between consecutive β-coils the toolkit first finds all stacking residues in the considered coils. Then for both β-coils the centre of gravity of the identified residues’ Cα atoms is projected onto the axis. The distance between the projections is considered as CHAPTER 4. THE STACK TOOLKIT 33

Figure 4.10: Shape of a β-coil the pitch between both coils. We define the twist between consecutive β-coils as the average of the twist between stacking residues. The twist between those is calculated as follows: first the Cα atoms are projected onto the axis, and then the angle between the vectors from the projections to their origin is calculated. Figure 4.11 shows the pitch and the twist considering only one pair of stacking residues.

4.2.4 Area of β-Coils 4.2.4.1 Algorithm

The area of a β-coil is determined by the projection of the core residues’ Cα atoms projected on a plane perpendicular to the β-helix axis. Figure 4.12 shows the projection of a β-coil that contains only core residues and the resulting polygon divided into triangles. The total area can be calculated by summing the areas of the triangles. Let c(ti) be the vector to the ith Cα atom and m the total number of considered Cα atoms. The β-coil area is than calculated: ° ° °m−1 ° 1 °X ° A = ° [c(t ) × c(t )] + c(t ) × c(t )° 2 ° i i+1 m 1 ° i=1

4.2.4.2 Discussion

The triangles drawn in Figure 4.12 cross the boundary of the Cα polygon, but the given equation for the total area is robust against this because the partial areas are calculated with the vector product. That means changing the mathematical orientation gives a result vector in the opposite direction. This property of the vector product can be used to distinguish left- CHAPTER 4. THE STACK TOOLKIT 34

Figure 4.11: Pitch and twist of β-coils and right-handed β-helices by comparing the direction of the area-vector with the direction of the vector between the Cα atoms of stacking residues.

4.2.5 Orientation of Side Chains 4.2.5.1 Algorithm The toolkit distinguishes side chains of the core residues pointing inwards the β-helix from those pointing outwards. As glycine residues only have a hydrogen atom as their side chain, they are not taken into account. The problem to detect the side chain orientation of β-coil residue is equivalent to identify whether the Cβ atom of the residue is inside or outside the coil. We can determine whether a point lies inside a polygon by considering a line from the point to infinity and counting how often the line crosses the boundary of the polygon. If the number of crossing points is odd, then the point must lie inside the polygon.Figure 4.13 shows the projection of a Cα trace onto a plane perpendicular the β-helix axis. Three points (A, B and C) and a line parallel to the y-axis are additionally drawn. The line intersects with the polygon in the line segments (a,b,c and d). As our implementation uses always a line through the Cβ parallel to the y-axis we can identify easily the intersected line segments by comparing the x-coordinate of the Cβ atom with those of the line segment’s Cα atoms. Using the x-coordinate of the Cβ atom as the argument in the segments line equation and comparing the result with the y-coordinate of the Cβ we see whether the Cβ atom is “below” or “above” the line segment. We then count the number of intersected line segments “above” the Cβ (greater y-value) and test the parity. CHAPTER 4. THE STACK TOOLKIT 35

Figure 4.12: Area of a β-coil

4.2.5.2 Discussion An alternative approach is to calculate the orientation of the side chains by comparing the directions of the vector from the axis to the Cα atom of the residue and the vector from the Cα to the Cβ atom of the residue. If the scalar product between these vectors is positive we consider the side chain as pointing outwards the β-helix. An attraction of this approach is that it requires less calculation, however, Figure 4.14 shows that this approach detects the orientation incorrectly. The corresponding residues in Figure 4.14 are coloured red. The residues with correctly detected side chain orientation are coloured dark and light grey. This error is caused by the concavity of the β-coil area.

4.2.6 Packing of β-Coils 4.2.6.1 Description of the Algorithm

We define the packing index of a β-coil is the quotient of the inner side chain volume Vs and the total inner β-coil volume Vt. The Vs value is calculated by summing the side chain volume of all inwards pointing core residues of that β-coil. The specific side chain volumes are calculated using the residue volumes given in [Zam72] and subtracting the value of Glycine. The calculated values can be found in Table 4.1. The volume Vt is the product of the β-coils area and the average pitch. CHAPTER 4. THE STACK TOOLKIT 36

Figure 4.13: Orientation of side chains based on intersection count

Name Residue Volume Side Chain Volume Alanine 88.6 28.5 Arginine 173.4 113.3 Aspartic Acid 111.1 51 Asparagine 114.1 54 Cysteine 108.5 48.4 Glutamic Acid 138.4 78.3 Glutamine 143.8 83.7 Glycine 60.1 0 Histidine 153.2 93.1 Isoleucine 166.7 106.6 Leucine 166.7 106.6 Lysine 168.6 108.5 Methionine 162.9 102.8 Phenylalanine 189.9 129.8 Proline 112.7 52.6 Serine 89.0 28.9 Threonine 116.1 76.0 Tryptophan 227.8 167.7 Tyrosine 193.6 153.5 Valine 140.0 99.9

Table 4.1: Residue and residue side chain volume CHAPTER 4. THE STACK TOOLKIT 37

Figure 4.14: Orientation of side chains based on scalar product This figure shows the projection of the backbone’s C, Cα and N atoms and the side chains’ Cβ atoms on a plane perpendicular to the β-helix axis. Additionally the projection of the axis is shown. The residues with inward pointing side chain are coloured light grey and those outward pointing are shown in dark grey. The residues with incorrectly identified side chain orientation are coloured red. CHAPTER 4. THE STACK TOOLKIT 38

Figure 4.15: The packing of β-coils This Figure shows the same β-coil as in Figure 4.14 but the inward pointing side chains are in ball and stick representation. Additionally a sketch of the inner β-coil volume Vt is shown. Chapter 5

Results

We used the STACK toolkit to analyse the number of coils, the pitch and the twist of all proteins in our database. The results are shown in Table 5.1. Some of these results are consistent with observations reported in the literature.

• Right-handed β-helices have a higher twist than left-handed ones, which are, in fact, supposed to have flatter β-sheets [JP01]. The only exception to this statement is the ice-bonding protein TmAFP (1EZG). We should point out, though, that this structure is curved and, since we used the axis (a straight line) for calculating the twist, the values might be inconsistent.

• The pitch of PelC is very close to the ideal pitch of the pectate lyase family, which should be 4.8A[˚ JP01].

• The pitches of both ice-binding proteins, especially that of TmAFP, are very close to the value of the distance between ice oxygens atoms on the prism plane of ice, as shown in Figure 3.10 [LTDJ00].

The calculated results also suggest a couple of relationships between twist and handedness of a helix, at least for the proteins that we have studied. First, the handedness of a β-helix can usually be discriminated by the absolute value of the twist per β- coil. Secondly, the sign of the twist can also be used for the same purpose, as all the left-handed proteins seem to twist clockwise.

39 CHAPTER 5. RESULTS 40

PDB CODE COILS SUM TWIST [RAD] SUM TWIST [°] AVG TWIST [°] ABS AVG TWIST [°] SUM PITCH [ ] AVG PITCH [ ] 1G95 11 0,0509 2,9167 0,2917 0,2917 56,7130 5,6713 1LXA 8 0,1456 8,3401 1,1914 1,1914 35,5042 5,0720 1THJ.A 6 0,1204 6,8971 1,3794 1,3794 25,7442 5,1488 1THJ.C 6 0,1215 6,9619 1,3924 1,3924 25,7446 5,1489 1THJ.B 6 0,1268 7,2631 1,4526 1,4526 26,0139 5,2028 1TDT.B 5 0,1365 7,8194 1,9548 1,9548 20,4011 5,1003 1TDT.C 5 0,1422 8,1493 2,0373 2,0373 20,4169 5,1042 1TDT.A 5 0,1454 8,3308 2,0827 2,0827 20,4025 5,1006 1XAT 4 0,1361 7,7969 2,5990 2,5990 16,0826 5,3609 1EZG.B 7 0,2801 16,0457 2,6743 2,6743 27,9871 4,6645 1EWW 3 0,1001 5,7343 2,8671 2,8671 9,8012 4,9006 1EZG.A 7 0,3721 21,3199 3,5533 3,5533 27,7345 4,6224 1K5C 10 -0,5743 -32,9053 -3,6561 3,6561 44,1404 4,9045 1RMG 10 -0,6219 -35,6329 -3,9592 3,9592 45,2181 5,0242 1CZF.A 10 -0,6611 -37,8778 -4,2086 4,2086 47,6236 5,2915 1BHE 10 -0,6773 -38,8066 -4,3118 4,3118 47,2812 5,2535 1CZF.B 10 -0,7190 -41,1943 -4,5771 4,5771 47,3017 5,2557 1TSP 12 -0,8944 -51,2437 -4,6585 4,6585 58,2284 5,2935 1EE6 7 -0,5143 -29,4670 -4,9112 4,9112 29,5909 4,9318 1DAB 16 -1,3095 -75,0296 -5,0020 5,0020 75,3541 5,0236 1IA5 10 -0,8052 -46,1354 -5,1262 5,1262 46,7142 5,1905 1BN8 8 -0,6350 -36,3805 -5,1972 5,1972 34,0957 4,8708 1JTA 8 -0,6474 -37,0936 -5,2991 5,2991 33,6682 4,8097 1QCX 8 -0,6830 -39,1325 -5,5904 5,5904 35,4064 5,0581 1AIR 9 -0,7817 -44,7878 -5,5985 5,5985 38,1093 4,7637 1DBG 11 -1,0337 -59,2238 -5,9224 5,9224 50,0680 5,0068 1IDK 8 -0,7483 -42,8762 -6,1252 6,1252 34,8890 4,9841

Table 5.1: Results The table shows the results of querying the database for number of coils, pitch and twist of the proteins that we have studied. The data are sorted by the absolute values of their twist per β-coil and the thick horizontal line divides the entries with positive twist from those with negative twist. Right-handed helices are coloured grey to be distinguished from left-handed ones. Chapter 6

Conclusions

In our project we developed STACK, a toolkit for the structural analysis of β-helix proteins. The toolkit identifies characteristic structural elements of the β-helix fold, such as different kinds of stacks, the β-coils, the core and the β-helix. Furthermore, it calculates a number of geometrical features including the axis of the β-helix, the cross-sectional shape and area of a β-coil, the packing index of a β-coil and the β-coils’ pitch and twist. The implemented algorithms are objective and, therefore, can be seen as a definition of the described structural elements and features. Developing the algorithms has not been a straightforward process and, learning by our mistakes, we have finally achieved a deep insight into the structure of β-helices. The choice to store all the computed features in a relational database rather than focusing on the bare algorithms has led us to a final version of the toolkit that, we think, can be easily used by anyone who has just a basic knowledge of SQL. Indeed, there was not much time left to analyse the data, but the good results that we achieved by querying the database, makes us content of our choice. The toolkit provides a solid platform for further investigation of the amazing structure of β-helix proteins.

41 Appendix A

User and Installation Manual

A.1 Readership

In this manual we describe the installation procedure and the commands of STACK: a toolkit for analyzing β-helical proteins. The reader of this manual should be capable to install soft- ware on the target platform (e.g. Windows, Sun Solaris). This includes editing configuration files in a text editor of choice (e.g. Emacs, vi, notepad) and basic task like extracting files and performing basic commands on the file system.

A.2 Installation

A.2.1 Requirements Platform The toolkit runs on platforms that support the Java programming language version 2. It has been successfully tested on Windows 2000, Windows XP and Sun Solaris.

Java Runtime Environment or Java Development Kit The toolkit is completely written in the Java programming language. To execute the program a Java runtime environment (version 1.4.x) must be installed1. To compile the toolkit a Java development kit (version 1.4.x) is needed which also includes the runtime environment. In both cases we suggest the latest version (1.4.1). The Java executable must be in the path and the Java classpath enviroment variable must be set2. The compilation process is described in Section B.3.1 of the “maintenance manual” in Appendix B.

Relational Database The data is store in a relational database. So far we have only used the SapDB3. We assume that the database is installed properly and a user with RESOURCE privileges has been created. The according database instance must run.

1The jre and the jdk are available at http://www.javasoft.com 2For further information see refer to the Java documentation 3http://www.sapdb.org

42 APPENDIX A. USER AND INSTALLATION MANUAL 43

RasMol RasMol, a program for visualising proteins, must be installed4. We used mainly RasMol 2.7.1.1 but also any other version of RasMol should work.

Files and Libraries To install the toolkit you need either the binary distribution or source code distribution file5. Additionally two libraries are needed: sapdbc.jar6, a JDBC driver to connect to the SapDB and jama-1.0.1.jar7, a numerical library.

A.2.2 Installation Steps 1. Make sure that the requirements given in Section A.2.1 are fulfilled.

2. Unpack the toolkit distribution to a directory of your choice. In the rest of the document we assume C:\Program Files\STACK.

3. Customise the configuration file C:\Program Files\STACK\conf\main.properties (See the next section).

A.3 Configuration

The toolkit is configured in Java properties file, that contains a set of key, value pairs and additional documentation. The key and the corresponding value are separated by a ‘=’.

A.3.1 GENERAL Properties A.3.1.1 Sets the Analyser Root Directory Key: analyserDB.general.root Value: a Java-string, the root directory Example: x:\\

A.3.1.2 Sets the Analyser Import Directory Key: analyserDB.general.import Value: a Java-string, the import directory Example: x:\\data\\

A.3.1.3 Sets the Analyser Temporary Directory Key: analyserDB.general.temp Value: a Java-string, the temporary directory Example: c:\\temp\\

4http://www.openRasMol.org 5http://www.mdstud.chalmers.se/~md1lars 6http://www.sapdb.org/sap_db_jdbc.htm 7http://math.nist.gov/javanumerics/jama/index.html APPENDIX A. USER AND INSTALLATION MANUAL 44

A.3.1.4 Sets the Protein Viewer So far only RasMol on Windows is tested. Key: analyserDB.RasMol Value: (windows | unix) Example: windows Key: analyserDB.controller.rasmol.filename Value: a Java-string, filename of the temporary script Example: c:\\temp\\RasMolScript Key: analyserDB.controller.rasmol.command Value: a Java-string, command to start RasMol Example: x:\\bin\\raswin.exe

A.3.2 STORE Properties A.3.2.1 Selects the Store Key: analyserDB.store Value: (rambased | sapDB) rambased use the RamBasedEntityStore (is NOT recommend for large datasets) sapDB use the SapDBEntityStore (a SapDB must be available and be setup properly) Example: sapDB

A.3.2.2 Configures the Connection Properties The properties must be chosen according to analyserDB.store Key: analyserDB.store.sapdb.driver Value: a Java-string, the JDBC implementing driver class Example: com.sap.dbtech.jdbc.DriverSapDB Key: analyserDB.store.sapdb.login Value: a Java-string, the database login Example: jdbc Key: analyserDB.store.sapdb.passwd Value: a Java-string, the user password Example: geheim Key: analyserDB.store.sapdb.host Value: a Java-string, the database host Example: localhost Key: analyserDB.store.sapdb.instance Value: a Java-string, the instance name (schema name) Example: proDB APPENDIX A. USER AND INSTALLATION MANUAL 45

A.3.2.3 Configurates the startup behavior Deletes at startup all existing PersistentObjectContainer, has no effect on non persistent PersistentObjectContainers as ’rambased’. Key: analyserDB.store.startup.clean Value: (true | false) Example: false

A.3.2.4 Setting the Caching Strategy Selects the caching strategy for each PersistentObjectContainer Key: analyserDB.store.caching.’ContainerName’ Value: (LRU | WEAK) WEAK: use the WeakHashMap LRU: use the LRUCacheMap (LRU = Last Recently Used) If the LRU caching strategy is chosen, also the number of max. cached entries has to be set. Example: LRU

Key: analyserDB.store.caching.’ContainerName’.LRU.size Value: integer, maximum size of cache Example: 100

A.3.3 OPERATION Properties A.3.3.1 Selects the Algorithm to Identify Stacks Key: analyserDB.operations.betaHelixAxisFinder.Algorithm Value: (simpleAngle) simple: the stacks are identified based on a residue-residue stacking, described in Section 4.1.1. Example: simpleAngle

A.3.3.2 Selects the Algorithm to Identify β-Coils and β-Helices Key: analyserDB.operations.betaHelixFinder.Algorithm Value: (simple | simpleBackMD) simpleBackMD: (simpleBackMeasurmentDistance) The first β-coil starts with the same stack as in the ’simple’ algorithm and ends also with a residue from that stack. The following β-coils start directly after the residue that completes the previous coil. To prevent overlap- ping the algorithm tracks back the sequence, as long as in each step the geometrical distance to the start residue is decreased. simple: identifies β-coils based on stack repetition in the sequence (see Section 4.1.2). Example: simple APPENDIX A. USER AND INSTALLATION MANUAL 46

A.3.3.3 Selects the Algorithm to Identify the Core Residues of a β-Helix Key: analyserDB.operations.betaHelixCoreFinder.Algorithm Value: (simple | gapFilling) gapFilling: the core consists of all helix residues that are part of a stack and those residues that have a stacking residue in their neighbourhood. The max. size of a gap is 2. simple: the core consists of all helix residues that are part of a stack. Example: gapFilling

A.3.3.4 Selects the Algorithm to Calculate the Axis of a β-Helix Key: analyserDB.operations.betaHelixAxisFinder.Algorithm Value: (stackAxis | simple | stacksCABestFit | completeCoil)

stacksCABestFit: the axis is the line that fits best to the Cα atoms of the residues that are in the ’General’ stacks of the β-helix. simple: the axis is the line that fits best the centres of gravity of the β-coils core residues. stackAxis: the direction of the β-helix axis is determined by the average of the stack axises. The position of the β-helix axis is calculated by projecting the stacks’ axis position on a plane that is per- pendicular to the direction of the β-helix’s axis direction. The position of the β-helix’s axis is the centre of gravity of those points. completeCoil: the same algorithm as in the ’simple’ algorithm, but it uses all residues of a β-coil for the centre of gravity calculation. Example: stackAxis

A.3.3.5 Selects The Algorithm for Calculating the Pitch Key: analyserDB.operations.betaCoilPitch.Algorithm Value: (simple) simple: the calculation is based on the coil’s centres of gravity and the pitch is defined as the distance between the projections of two adjacent coils’ centres of gravity on the β-helix axis. Example: simple

A.3.3.6 Selects the Algorithm for Calculating the Twist Key: analyserDB.operations.betaCoilTwist.Algorithm Value: (simple) simple: average of the stack based twist angles. Example: simple APPENDIX A. USER AND INSTALLATION MANUAL 47

A.3.3.7 Selects the Algorithm for Calculating the Cross Sectional Area Key: analyserDB.operations.betaCoilArea.Algorithm Value: (simple)

simple: the area calculation is based on a projection of the core residues’ Cα atoms on a plane perpendicular to the axis. A polygon through the Cα atoms defines the boundary of the area. Example: simple

A.3.3.8 Selects the Algorithm for Estimating the Shape of a β-coil Key: analyserDB.operations.coilShape.Algorithm Value: (FFT ) FFT: uses a fast Fourier transformation to calculate the Fourier coefficients. Example: FFT

Key: analyserDB.operations.coilShape.FFT.R Value: integer, number of sampling points n = 2R Example: 10

Key: analyserDB.operations.coilShape.FFT.store Value: integer, the database stores coefficients with indices ≤’store’ Example: 15

A.4 Starting and Using STACK

The toolkits provides a very simple command line tool to perform basic tasks. The toolkit is started depending on your OS from the shell with analyser or analyser.bat. The following subsections describe the different possible task:

A.4.1 The activate-Command activate proteinId Loads a protein from the database and activates it. This means that the following commands refer to this protein. The proteinId is not the PDB code. Use the list-command to see all possible protein identifiers.

A.4.2 The calculate-Command calculate axis Calculates the axis of the protein’s stacks and β-helices. calculate area Calculates the area of the β-coils. calculate pitch Calculates the pitch of the β-coils. calculate twist Calculates the twist of the β-coils. calculate shape Calculates the shape coefficients of the β-coils. calculate packing Calculates the packing index of the β-coils. APPENDIX A. USER AND INSTALLATION MANUAL 48

A.4.3 The color-Command color stacks Colours the general stacks of a protein different and defines the stacks as sets in RasMol. To see an effect in the viewer either the visualize- command must be used or the generated script must be run from RasMol with the source-command. color stacks type color Colours the stacks of the given type of a protein in the given colour and defines the stacks as sets in RasMol. All colours in the RasMol colour format are allowed. Possible stack types are aromatic, aliphatic and amidic. To see an effect either the visualize-command must be used or the generated script must be run from RasMol with the source-command. color betaCoils Colours the β-coils of the protein different and defines the β-coils as sets in RasMol. To see an effect either the visualize-command must be used or the generated script must be run from RasMol with the source-command. color core color Colours the core residues of the protein in the given colour. All colours in the RasMol colour format are allowed. To see an effect either the visualize-command must be used or the generated script must be run from RasMol with the source-command.

A.4.4 The deactivate-Command deactivate Deactivates an activated protein. A protein must be deactivated before a new protein can be activated.

A.4.5 The exit-Command exit Closes the STACK command line client.

A.4.6 The help-Command help Shows the help text with all commands.

A.4.7 The identify-Command identify stacks Identifies all stacks in the activated protein. identify betaElements Identifies β-helices and β-coils in the activated protein. identify core Identifies the core residues of the β-helix.

A.4.8 The import-Command import Imports all pdb-files that are listed in the import file of the configuration directory. The import process does not overwrite any data.

A.4.9 The list-Command list Lists all proteins in the database. APPENDIX A. USER AND INSTALLATION MANUAL 49 list betaCoils Lists all β-coils of the activated protein. list betaCoils property Lists the given property of the protein’s β-coils. Possible properties are: residues, area, pitch, twist, packing and shape. list stacks Lists all stacks of the activated protein. list stacks property Lists the given property of the protein’s stacks. Possible properties are: type and axis.

A.4.10 The visualize-Command visualize Starts or updates the RasMol viewer. Appendix B

Maintenance Manual

B.1 Readership

This manual is addressed to computer scientists and bioinformaticians who want to change or extend the existing code. The aim of the manual is to give a brief overview of the impor- tant concepts and tools used. We recommend reading the “User and Installation Manual” (Appendix A) before reading the “Maintenance Manual”.

B.2 Development Environment

B.2.1 Tools and Software Components The used tools in the development process were: ant 1.5.1 is a build tool developed by the Apache Jakarta Project. jdk 1.4.1 is the latest1 version of the Java development kit devolped by Sun2.

SapDB 7.3 is a RDBMS developed by SAP 3. It is formally known as Adabas D and has transaction support. We used JDBC4 (sapdbc.jar) to connect to the RDB.

Perforce 2002.1 is the version control system that we used. It is available for different operating systems including many UNIX derivatives and the Windows platform.

RasMol is a protein viewer that is available on UNIX and on Windows. We used RasMol to visualise our results.

Jama is a Java library for numerical matrix operations.

B.3 STACK Architecture

In this chapter we describe the architecture of STACK. First a general overview of the packages and their dependencies are given, then the most important design concepts are described.

1December 2002 2http://java.sun.com 3http://www.sapdb.org 4Java Database Connectivity

50 APPENDIX B. MAINTENANCE MANUAL 51

B.3.1 Files and Directory Structure Several files and directories are under the root directory of the STACK toolkit(see Section A.3.1). The files are start-scripts for Windows and Unix and the ant build file (build.xml). In the src directory you can find the Java source files mapped according their package structure to the directory structure. The compilation process, that is started by invoking ant from the toolkit’s root directory with ant compile, compiles the class files to the build directory of the installation. The Java API documentation is generated into the doc directory. The generation process is started with typing ant doc at the command line. With ant dist the class files are compressed into a jar file (Java library). All the generated files are deleted using ant clean. The configuration files (main.properties, logging.properties and import) are in the conf and the log-files are in the log directory. We recommend to create a temporary directory for the RasMol command script.

B.3.2 Package Overview According to the Java packaging schema all packages of STACK are subpackages of the se.chalmers.cs.bioinf package5In order to shorten the notation unqualified names for subpack- ages are used in this paragraph (e.g. util instead of se.chalmers.cs.bioinf.util). Figure B.1 shows the UML6 package diagram of the STACK toolkit. A brief introduction to the UML gives [FS00]. The util package contains several useful helper-classes but not the geometrical routines, which are in the the geom package. The geometrical code is based on a C version pro- vided by the department. Both packages are independent from all other packages. Therefore, it is possible to use them as standalone packages in future applications. The store package and its subpackages provide persistence for first class objects. The store depends on some of the classes of the util package but may be used without the geom package. The STACK application logic is embedded in the structure and the operations package. The structure pack- age contains the data model and the operations package the methods, that are defined on them. Import and export methods are in the imex package. The class files directly under the se.chalmers.cs.bioinf package are main programs that invoke different operations.

B.3.3 The geom and util Package Both packages contain helper-classes that can be used independently from the other packages of the toolkit. The geom packages contains classes for geometrical operations and linear transformations (e.g. Point.java, Vector.java, Line.java, Plane.java). They contain only those geometrical operations that were needed in other parts of the toolkit. Therefore, the package cannot be considered as a complete geometrical library. The package depends on the Jama library. The util package provides classes for wrapping existing data collections with an Iterator inter- face (e.g. IteratorIterator.java, EmptyIterator.java, WrapperIterator.java, OneIterator.java) . The Iterator is a design pattern [GHJV94] that provides a simplified access to collections. Other classes in the package implement a very simple complex arithmetic (Complex.java) which is needed for the FFT algorithm (FFT.java, Function.java). The implementation is based on the FFT-algorithm by Cooley and Tukey taken from [AO97].

5The package names in Java are derived from the developers domain name. This source code convention was made to ensure worldwide unique namespaces. 6Unified Modeling Language APPENDIX B. MAINTENANCE MANUAL 52

se.chalmers.cs.bioinf

util imex geom All other packages depenf on the util package.

operations Depends also store on JAMA

interfaces structure

rdb interfaces sapDB rambased

Figure B.1: STACK package diagram APPENDIX B. MAINTENANCE MANUAL 53

B.3.4 The store Package Because, we use a relational database to store geometrical and analytical data, we need meth- ods that store and retrieve the data from the RDB. We separated this functionality, including the possibility to write queries, from classes that represent the application logic of STACK. Thus, from classes that are in the structure and operations package. The software architecture of this package is complex and involves several design patterns and abstraction layers. A detailed explanation of this would be far too long for this document, but we try to give a good starting point for further investigations of the source code. A UML class diagram of the store is displayed in figure B.2. In the upper left corner the Service class, which is the linchpin of the toolkit, can be seen. This Singleton [GHJV94] encapsulates all basic services in the toolkit. Therefore, the instantiation of the Service class triggers the instantiation of the store and of those classes that represent collections of β-helix structural elements. For clarity only the Stacks class is drawn, but it can be replaced with other containers from the structure package (e.g. Atoms, Residues, ShapeCoefficients, BetaCoils). Stacks is a subclass of the PersistentObjectContainer class and extends the superclass with methods to retrieve a collection of Stack objects that fulfill certain conditions (e.g. getStacksOfProtein(String id) returns an Iterator over all stacks that are part of the protein with the given id). In principle the implementation of this methods passes a constructed Query object to the corresponding EntityContainer of this PersistentObjectContainer. Figure B.3 shoes a UML class diagram of the different query classes to build complex queries. An example of such a query is pre- sented in Figure B.4. Depending on the implementation of the EntityContainer the passed query is transformed to a SQL statement and send to the RDB or performed internally in a RambasedEntityContainer. The same level of abstractions can be found in the corresponding elements of the different collections. In our example these are the classes Stack, PersistentO- bject and Entity. The ”Abstract. . . ” classes in the diagram contain general methods that are used in all of the more specific implementations.

B.3.5 The structure Package The structure package models the application specific data. The presented classes were iden- tified by literature review [JP01][HMS+98][JMP98][CBM+02] and by own inspection of the β-helical proteins. Our data model keeps the native hierarchy of different levels of protein structure. The smallest element of a protein is a single atom. Together with other atoms they build residues. These are connected with each other and form long chains. Finally the chains compose the protein. The described entities are directly mapped to classes. By inspection of the β-helical structure we identified sets of spatial linear arranged residues. We name such a set stack. We distinguish between different kind of stacks based on residue type restriction (e.g. one kind of stacks allows all types of residues and on other only aromatic residues). Therefore, several stacks can be assigned to one residue. As our store is based on a relational model we therefore need a association class representing a stack assignment to a residue. In β-helical proteins chains can fold into a helix, consisting of several coils. To distinguish them from other terms in the literature we used β-helix and β-coil. To store the Fourier coefficients estimated in the shape approximation we added an additional class. The classes described in this paragraph are visualised in Figure B.5. APPENDIX B. MAINTENANCE MANUAL 54

Service AbstractEntityStore

EntityStore

PersistentObjectContainer *

1

Stacks RDBEntityStore RDBEntityStore ...

PersistentObject AbstractEntityContainer

EntityContainer {derived} * * * * * RDBEntityContainer RDBEntityContainer ... Stack

AbstractEntity

Entity

1 1 *** EntityDefinition RDBEntity RDBEntity ...

Figure B.2: UML class diagram of the store

Query 2 2 + visit(qv: QueryVisitor) 1

BinaryQuery QueryAnd

QueryOr QueryEquals QueryNot QueryLess

QueryLessEquals

2 ... QueryElement

attributeType: int value: Object conainer: EntityContainer attribute: String alias: String

Figure B.3: Query related class diagram APPENDIX B. MAINTENANCE MANUAL 55

a QueryAnd

a QueryAnd a QueryAnd

a QueryEquals a QueryEquals a QueryAnd a QueryLessEquals

a QueryElement a QueryElement a QueryGreaterEquals a QueryLessEquals a QueryElement

conainer: sa conainer: sa conainer: bc attribute: 'Residue' attribute: 'Stack' attribute: 'ID'

a QueryElement a QueryElement

a QueryElement a QueryElement conainer: bc conainer: bc a QueryElement attribute: 'EndPosition' attribute: 'StartPosition' conainer: r value: stackId value: betaCoilId attribute: 'ID'

a QueryElement a QueryElement

conainer: r conainer: r attribute: 'Position' attribute: 'Position'

Figure B.4: Composition of query objects

Protein

1..m m Chain BetaHelix

1 + startPosition:int + endPosition:int

1..m

1..m 1 1..m Residue BetaCoil + position:int + startPosition:int m + endPosition:int StackAssignment

m 1

1 1..m 0..m

Stack Atom ShapeCoefficient

Figure B.5: Class hierarchy modeling β-helical proteins APPENDIX B. MAINTENANCE MANUAL 56

BetaHelixAxisFinder

static getInstance(): BetaHelixAxisFinder calculateAxis(String betaHelixId): void

BetaHelixAxisFinderStacksCABestFit public static BetaHelixAxisFinder getInstance(){ Properties p = Service.getProperties(); String name = p.getProperty("AxisFinder", "simple"); if (name.equals("simple")){ return new BetaHelixAxisFinderSimple(); BetaHelixAxisFinderCompleteCoil } else if (name.equals("StackAxis")) { return ... } else if ( ...) return ... BetaHelixAxisFinderStackAxis } }

BetaHelixAxisFinderSimple

Figure B.6: Using a factory-method to create a operation

B.3.6 The operations Package We put all the classes that contain either algorithms to identify structural elements of β- helical proteins or algorithms to calculate features In the operations package. Each approach to identify a structural element or to calculate a geometrical feature is placed in separate class. To make the toolkit easily expendable and configurable we decided to use a Factory-Method combined with a Singleton [GHJV94] to instantiate the different versions of the algorithms. The combination of both patterns and some part of the source-code is illustrated in figure B.6. APPENDIX B. MAINTENANCE MANUAL 57

>1AIR:_ PECTATE LYASE C ...... SYNYIHGVKKVGLDGSSSSDTGRNITYH HNYYNDVNARLPLQRGGLVHAYNNLYTNITGSGLNVR QNGQALIENNWFEKAINPVTSRYDGKNFGTWVLKGNN ITKPADFSTYSITWTADTKPYVNADSWTSTGTFPTVA YNYSPVSAQCVKDKLPGYAGVGKNLATLT STACK Bibliography

[AO94] R. Ansorge and H. J. Oberle. Mathematik f¨ur Ingenieure, volume 2 of Differential- und Integralrechnung mehrerer Variabler, Gew¨ohnlicheDifferential- gleichungen, Partielle Differentialgleichungen, Integraltransformationen, Funk- tionen einer komplexen Variablen. Akademie Verlag, first edition, 1994.

[AO97] R. Ansorge and H. J. Oberle. Mathematik f¨urIngenieure, volume 1 of Lin- eare Algebra und analytische Geometrie, Differential- und Integralrechnung einer Variablen. Akademie Verlag, second edition, 1997.

[BMT98] A. Bateman, A.G. Murzin, and S.A. Teichmann. Structure and distribution of pentapeptide repeats in bacteria. Protein Sci., 7:1477–1480, 1998.

[BPD+99] K. Brown, F. Pompeo, S. Dixon, D. Mengin-Lecreuix, C. Cambillau, and Y. Bourne. Crystal structure of the bifunctional N-acetylglucosamine 1- phosphate uridyltransferase from Escherichia coli: a paradigm for the related pyrophosphorylase superfamily. EMBO J., 18:4096–4107, 1999.

[BT98] C. Branden and J. Tooze. Introduction to Protein Structure. Garland Publishing, 19 Union Square West, New York, second edition, 1998.

[BWF+00] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig andI.N. Shindyalov, and P.E. Bourne. The Protein Data Bank. Nucleic Acids Research, 28:235–242, 2000.

[CBM+02] L. Cowen, P. Bradley, M. Menke, J. King, and B. Berger. Predicting the beta-helix fold from protein sequence data. Journal of Computational Biology, 9(2):261–276, 2002.

[FS00] M. Fowler and K. Scott. UML Distilled: a brief guide to the standard object modelling language. Addison-Wesley, second edition, 2000.

[GHJV94] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1994.

[GKG+00] S.P. Graether, M.J. Kuiper, S.M. Gagne, V.K. Walker, Z. Jia, B.D. Sykes, and P.L. Davies. Beta-helix structure and ice-binding properties of a hyperactive antifreeze protein from an insect. Nature, 406:325–328, July 2000.

[HMS+98] S. Heffron, G.R. Moe, V. Sieber, J. Mengaud, P. Cossart, J. Vitali, and F. Jurnak. Sequence Profile of the Parallel in the Pectate Lyase Superfamily. Journal of Structural Biology, 122(1-2):223–235, 1998.

58 BIBLIOGRAPHY 59

[JMP98] J. Jenkins, O. Mayans, and R. Pickersgill. Structure and evolution of parallel beta-helix proteins. Journal of Structural Biology, 122(1-2):236–246, 1998.

[JP01] J. Jenkins and R. Pickersgill. The architecture of parallel beta-helices and related folds. Progress in Biophysics and Molecular Biology, 77(2):11–175, October 2001.

[KS83] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pat- tern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577–2637, December 1983.

[KSA+96] C. Kisker, H. Schindelin, B.E. Alber, J.G. Ferry, and D.C. Rees. A left-handed beta-helix revealed by the crystal structure of a carbonic anhydrase from the archaeon Methanosarcina thermophila. EMBO J., 15:2323–2330, 1996.

[LTDJ00] Y.-C. Liou, A. Tocilj, P.L. Davies, and Z. Jia. Mimicry of ice structure by surface hydroxyls and water of a β-helix antifreeze protein. Nature, 406:322–324, 2000.

[NI92] NC-IUBMB. Recommendations of the Nomenclature Committee of the Interna- tional Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzyme-Catalysed Reactions. In Enzyme Nomenclature. Aca- demic Press, 1992.

[PFBL02] M. F. Perutz, J.T. Finch, J. Berriman, and A. Lesk. Amyloid fibers are water- filled nanotubes. Proc. Natl. Acad. Sci. USA, 99(8):5591–5595, April 2002.

[PLB+00] F.M.G Pearl, D. Lee, J.E Bray, I. Sillitoe, A.E. Todd, A.P. Harrison, J.M. Thorn- ton, and C.A. Orengo. Assigning genomic sequences to CATH . Nucleic Acids Research, 28:277–282, 2000.

[Ric76] J.S. Richardson. Handedness of crossover connections in beta-sheets. Proc. Natl. Acad. Sci. SA, 73:2619–2623, 1976.

[RR95] C.R. Raetz and S.L. Roderick. A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase. Science, 270:997–1000, 1995.

[WMG+02] H. Wille, M.D. Michelitsch, V. Guenebaut, S. Supattapone, A. Serban, F.E. Cohen, D.A. Agard, and S.B. Prusiner. Structural studies of the scrapie prion protein by electron crystallography. Proc. Natl. Acad. Sci. USA, 99(6):3563–3568, 2002.

[YLJ93] M.D. Yoder, S.E. Lietzke, and F. Jurnak. Unusual structural features in the parallel beta-helix in pectate lyases. Structure, 1(4):241–251, December 1993.

[YNJ93] M.D. Yoder, Keen N.T., and F. Jurnak. New domain motif: the structure of pectate lyase C, a secreted plant virulence factor. Science, 260:1503–1507, June 1993.

[Zam72] A. A. Zamayatnin. Protein volume in solution. Progress in Biophysics and Molecular Biology, 24:107–123, 1972.