Variability in type I helical pitch reflected in D

periodic fibrillar structure.

G. J. Cameron1, D. E. Cairns1 & T. J. Wess2

1 Department of Computing Science and Mathematics, University of Stirling, Stirling, FK9 4LA. 2 Structural Biophysics Group, School of Optometry and Vision Science, Cardiff University, Redwood Building, King Edward VII Avenue, Cardiff, CF11 2NB.

Short Title : Variability in type I collagen helical pitch.

Corresponding Author : D. E. Cairns, e-mail : [email protected], Fax : (+44) 1786 464551

1 Summary

The variability in axial rise per residue of the collagen was investigated as a possible factor missing in many structural models of fibrillar collagen to date. The feasibility of this approach is supported by previous evidence from axial structures determined by electron microscopy and X-ray diffraction as well as studies of the local sequence dependent conformation of the collagen helix. Here the sequence dependent variation of the axial rise per residue was used to improve the fit between model structures of the axially projected microfibrillar structure with the observed X-ray diffraction pattern from hydrated rat tail tendon. Models were adjusted using a Genetic Algorithm approach that allowed a wide range of structures to be tested efficiently. The results show that variation of the axial rise per residue could reduce the difference measure between model and data by up to 50%, indicating that such a variable is a necessary part of fibril structure. The variation in amino acid translation was also found to be influenced by the number of and residues in the structure.

Key Words : Type I Collagen, Genetic algorithm, fibril, structure, X-ray.

2 Introduction

The triple helical polyproline (PP) II helix is regarded as a rigid feature in protein secondary structure where the axial rise per amino acid residue is highly restricted.

The classical description of the collagen helix is as a PP-II helix where the amino acid sequence Gly-Pro-Hyp predominates. Collagen gene products are however more diverse in their sequence and Gly-Pro-Hyp only accounts for approximately 10% of collagen sequences. The sequence variability points to the local requirements of the collagen molecule that will lead to local deviations of collagen helical structure that may be required in satisfying the topological, crosslinking and dynamic properties that feature in supramolecular collagen structures.

The majority of collagen in animals is in a fibrillar form where the collagen triple helices interact with neighbouring molecules to form a structure that links the molecular strength of the triple helix with the mesoscopic properties of a fibril and thence the tissue. The axis of each collagen fibril exhibits long-range order, with each of the approximately 300nm long molecules being staggered by 67nm (D) with respect to its neighbour. This corresponds to approximately 234.2 +/- 0.5 amino acids per D period for type I collagen[1]. Due to the non integer relationship between molecular length and the D period, each D periodic projection of the axial structure repeat consists of a gap region and an overlap region which occur in the ratio of

0.54:0.46 D (for a type I collagen fibril)[2]. The ends of the triple helical region are defined by the telopeptide which do not conform to the repeating Gly – Xaa – Yaa structure of the helical region and are responsible for the formation of inter molecular cross-links[3].

Determination of fibrillar collagen structure has to date relied mostly on X-ray diffraction and electron microscopic evidence. In both cases the broad structural

3 features can be accounted for using models where the amino acids sequence is translated into electron density by placing the residues at positions dictated by the formal collagen helical co-ordinates of either a 310 or 72 helix, both of which have a very similar axial translation per residue. Studies that seek to elucidate structures at higher resolution fail to maintain the correlation between the predicted position of amino acids and the electron density profile, this indicates regions of possible rarefaction and compaction of electron density in the fibrillar structure, rather than the expected constant 0.286nm per residue. The electron density profile and molecular packing interpretation that was produced by Orgel et al [4], confirmed the general approach of model building, however the study only obtained an optimal fit of the electron density by allowing a fold in the telopeptide region and utilising a non- uniform distribution of amino acid scattering objects, a feature which was also noted by the electron microscopy studies of Chapman et al. [5].

In this study, we have further investigated the modelling of non-uniform axial translation of amino-acids in the collagen helix through the utilisation of a Genetic

Algorithm (GA) approach [6]. The aim of this study was to evolve a sequence based structure that obtains the best fit to the observed meridional X-ray diffraction data

(figure 1). Three structural elements of the model were investigated – the telopeptide inter-residue spacing, folding of the telopeptide and the helical inter-residue spacing.

4 Results

The initial model structure under investigation fixed the distances between amino acids within the helical region of the molecule at 0.286nm, however the telopeptide regions at either end of the helix were allowed to evolve. In particular, the distances between residues were free to change and random folding positions within the telopeptides were also encouraged as a previous study by Orgel et al. [7] indicated the possibility of a fold in the C-terminal telopeptide. The resultant set of model intensities scored over 30 orders, gave an R-factor value of 0.541 when compared to the observed intensities and did not show a significant improvement upon previous studies.

It is recognised that the distances between amino acids within the helical part of the collagen molecule may not be uniform[5]. This can be quantified by measuring the breadth of the meridional helical diffraction peaks corresponding to the axial rise per residue in the X-ray diffraction images, the breadth of this peak defines the limits feasible for the distribution of axial rises per residue that could be used within the

Genetic Algorithm. For the models under investigation, the measured minimum and maximum spacing were 0.270nm and 0.304nm. A further criteria for the helical region of the collagen molecule was that the average inter residue spacing should be equal to 0.286nm and that a distribution plot of inter residue distances should closely follow a trace of the helical diffraction peak that led to the defined limits.

The evolution of the structure of the extrahelical telopeptides was less restricted than the helical region of the molecule due to the relative lack of information about these regions of the collagen molecule. It was decided to allow any inter residue spacing for the amino acids of the telopeptide up to a distance of 0.38nm, the maximum possible size of a peptide bond[8], however it is believed that these

5 regions are compressed with an average inter residue spacing of approximately

0.2nm[9].

Table 1 indicates the recorded values of the first 30 orders of observed intensity as well as the values for the telopeptide model tested previously. The remainder of the table records the intensity values generated for the best fit models scored over 30 orders of intensity using two separate methods to evolve the inter- residue spacing – these have been designated as the A3 and A6 methods. The A3 method incorporated the effects of 3 amino acids in a triplet to determine spacing. The

A6 method used 6 amino acids (3 from the triplet before the inter-residue step and 3 after it).

The A6 method gives the best fit to the observed data when scored over the first 30 orders of intensity with an R-factor value of 0.173. The A3 method of evolving the inter-residue spaces produced models with R-factor values for 30 orders of 0.247. Although this fit is weaker than for the A6 approach, the A3 model was deemed to be more biologically feasible as it was able to reach a better match for the structural criteria that had been stipulated within the software. Figures 2 and 3 show the fit of the A3 and A6 models scored over 30 orders of meridional intensity.

The biological structures of these models must be examined against the previously defined criteria in order to determine whether they are feasible. The length of each molecule was of interest as collagen molecules are known to be approximately 300 nm in length[10]. Table 1 contains the length of each model that contributed to the best fit of the data for each method of optimising the inter-residue distance. The length of the modelled collagen molecule is dependant upon the distances between each of the triplets in the helical region and between amino acids in the telopeptides with the model with the lowest R factor (A6 over 30 orders of

6 intensity) producing a length of 302.914nm. All of the models with low R factors recorded lengths which could are within the tolerances one would expect from a collagen molecule.

Figure 4 demonstrates a simulation of the electron density profile of the model which gives the best fit to the observed data (A6-30). The profile was obtained by calculating a scattering factor for each amino acid utilising the method reported by

Hulmes et al. [11] and using amino acid side chain volumes calculated by Tsai et al.

[12]. The scattering factors for each amino acid were assigned a fractional co-ordinate depending upon their location within the repeating unit and these factors were summed in order to achieve a profile for the complete repeating unit.

The simulated electron density profile shows the gap and overlap regions which can be identified by the relative heights of the peaks, the overlap region has higher peaks and is found at the left hand end of the graph. The changeover to the gap region occurs at 0.46 as stipulated in the structural parameters of the modelling process.

Figure 5 entails a representation of how the inter-residue spacing changes along the length of the helical region within a model D-period. This was constructed by taking the model molecule and dividing it into four sections of 67nm and the rest of the molecule. The average distance for each location through the model was calculated and is shown as a function of location along the D-periodic structure in figure 5. It can be seen that there are regions of compression, indicated by low average spacings, and rarefaction, indicated by high average spacings. This figure also implies that the regions of compression and rarefaction in each of the five sections of the molecule are aligned when the model is built using the Hodge-Petruska model as a template [13].

7 In order to use a GA to evolve potential solutions that best fit observed data, each of the 20 amino acids that appear in the sequence of type I collagen were assigned a factor that would influence the inter-residue spacing. Each amino acid factor in a triplet of amino acids then contributes to the inter-residue spacing, which in makes up the total length of the molecule. In order for the model to be structurally feasible these distances must fall within the limits defined by the width of the 0.286nm reflection of the X-ray diffraction pattern of rat tail tendon. Figure 6 shows the distribution of distances between triplets in the helical region for the best fit models generated by the A3 and A6 methods over 30 orders. It can be seen from the figure that these limits are not strictly adhered to in the models with low R factors.

This can be confirmed by further examination of figure 5. The limits of spacings between amino acids have been calculated as 0.270nm and 0.304nm, however it can be seen that, especially in the A6 model, a number of points fall outside of these limits. It is our interpretation that this occurs due to the representation of the collagen molecule as a one dimensional ‘flattened’ set of aligned triplets rather than a true three dimensional model. Modelling the collagen molecule in this form will have reduced the structural information available to the Genetic Algorithm when determining a proper fit. Table 2 provides a list of the factors which resulted from the final models of both the A3 and the A6 methods.

In the A3 model, 56.9% of all inter-residue distances fall between the set limits with 16.5% being smaller and 26.6% being larger. The value within the limits increases to 93.2% if the maximum and minimum limits are relaxed by +/- 5%, indicating that many of these spacings are only marginally out with the defined limits.

The results for the A6 average models are less promising with 42.8% within the limits when the model is scored over 30 orders. Again this value improves if the limits are

8 relaxed, increasing to 46.4%. These results suggest that although the fit between the intensities of the models and the recorded values are closer when using the A6 method, the A3 method is better suited to obtaining a model which is adheres to the structural parameters defined by the behaviour of collagen in vivo.

The average inter-residue distances throughout the helical region of the model molecule are 0.288 for the A3 method scored over 30 orders. For the A6 method scored over 30 orders, the value is 0.289, indicating larger inter-residue spacing than in the A3 method.

By taking all of these factors into consideration, it can be stated that the A3 model when scored over 30 orders of intensity, is the model that is most representative of the structure of collagen while still achieving a low R factor of

0.247. This is an improvement over previous results where models of this type have been constructed [11].

9 Discussion

The models presented here give for the first time a quantitative evaluation of the deviation of the axial rise per residue of the collagen helix in situ within a fibril.

The broadening of the 0.286nm meridional diffraction peak may in part be due to the variation of the length of helical rise per residue and evidence from electron microscopy [REFS] and studies of collagen elasticity and stability [REFS] have eluded to regions of the collagen helix that my have a different structural state where the polyproline II helical structure is an average consensus structure.

The telopeptide regions were of interest since previous studies have found their conformation to be essential when fitting the model to the data [7,9]. The algorithms used to manipulate the telopeptides in this study were different from those used in the helical part of the molecule since the amino acids in the telopeptide region are not constrained by the helical structure and are therefore able to vary independently of the neighbouring chains. Each telopeptide of each alpha chain was treated separately with the distances between them only being limited by the constraints of the peptide bond and folding of the telopeptide chains was also permitted.

The telopeptide structures which are present in the best fit models (shown in figures 2 and 3) have similar characteristics which include a fold in the C-telopeptide of at least one of the α1(I) chains and an average spacing which tends to be less than the 0.286 nm found in the helical region of the molecule, as also predicted by Hulmes et al. [9]. The model which gave the best fit to the observed intensities, the A6 method scored over 30 orders, has a telopeptide structure which involves the C-terminal of one α1(I) chain folding at proline residue number 1039 (residue number 12 within the telopeptide). This is similar to the telopeptide structure reported by Orgel et al. [7].

10 However, the model that gives the best fit to the data also includes a fold in the N- terminal of the other α1(I) chain at the point where the telopeptide meets the helical region. This is probably not structurally feasible due to steric hindrances caused by an unfeasible packing density with many chains trying to fit between molecules within the microfibrillar structure. However such increased packing densities have been proposed by Orgel et al. [4], for type I collagen and Ortolani et al. [14], for type II collagen.

Further models were generated where the folding of the N-terminal telopeptide was restricted, the C-terminal telopeptide was permitted to evolve into any conformation. This model, using the A6 method of evolution over 30 orders of intensity, gave an R-factor value of 0.2932. However, the inter-residue spacings within the helical region of the molecule were either very large (in the region of 0.36 nm) or very small (0.15 nm) leading to the view that this particular model is not compatible with biochemical studies and the constraints of peptide bond length.

Consideration must be given to how the amino acid sequence accounts for the changes in distance between the residues. Figure 5 shows a graph which shows the average distance between amino acids along the length of a simulated D-periodic structure. It should be noted that the scale used in figure 7 is the same as that in figure

5, and the line depicting the average spacing between amino acids is also the same as that in figure 5.

In order to examine how the amino acid sequence of the collagen molecule affects the A3 model a number of groupings of amino acids were derived. The groupings, charged residues (this group was further split into three separate groups consisting of positively charged residues, negatively charged residues and a sum of these two groups), hydrophobic residues and a grouping of proline and

11 hydroxyproline residues, were each placed alongside the graph of average spacings and a correlation function between them was calculated.

The groups of amino acids were calculated by summing how many of each type of residue appeared in the triplets throughout the D-periodic repeat. These numbers were then averaged so that the final value represents the number of the residues per amino acid triplet. These values were then used to examine the correlation with the average distance between amino acid triplets along the length of the D-period using equation 1.

Cov(X ,Y ) " x, y = (Eq 1) ! x! y

Where ρx,y is a value between -1 and 1 where 1 is a positive correlation and -1 is a negative correlation and 0 is no correlation. This equation calculates the correlation function by dividing the covariance of x and y by the square root of the variance of x multiplied by the square root of the variance of y. The correlation values are recorded in table 3.

These values show that there is some correlation between the locations of the proline and hydroxyproline residues and the distance between amino acids triplets from the A3 model. There is also a negative correlation between the positions of the positively charged amino acids and the distances between the triplets along the length of the D-periodic repeating structure. Figure 7 shows the distance data alongside the hydroxyproline / proline data as described above. These weak correlations lead us to speculate that local variations may be induced in the process of fibrillogenesis, where in order to optimise interactions between residues within a fibril there are local adjustments in the axial rise per residue.

As part of this study, it was important to examine the size of the solution space in which the model structures exist. If the space was sufficiently large, any number of

12 models could exist that fit the data yet have no biological merit and could be inconsistent with each other. The most straightforward model that could have been developed would specify a different spacing for each triplet pattern in the model.

There are 310 triplet patterns in the sequence, which therefore provide 310 dimensions for a model of this nature. In this study we constrain variation to the 20 amino acids from which the triplets are composed and therefore reducing the available space of models from 310 to 20 dimensions. In the case of the A6 model, there are

774 distinct triplet to triplet combinations which therefore provides an even greater potential dimensionality but again, because of the limit set by only modifying the 20 amino acids, the model dimensionality remains set at 20. Given the potentially large model dimensionality, the approach taken in this study produces a model specificity which is very low and therefore a structure which is highly constrained.

In order to improve the overall fit of these models it may be prudent to make a fundamental change in the approach and treat each of the three alpha chains within the molecules on their own instead of using a triplet composed of one amino acid from each of the chains. This change would need further restrictions to ensure the three dimensional structure of the triple helix is maintained.

13 Conclusions

The results reported here show that utilising a Genetic Algorithm to evolve a variable inter-residue spacing and associated collagen triple helix structure improves significantly on previous results. The inclusion of a variable distance between helical amino acids is significant as this parameter produces a model with a better fit to the data than a structure with a constant inter-residue spacing whilst maintaining biological feasibility.

Investigation of variations on the type of model employed has shown that models have to be closely scrutinised, since it is possible to generate some models which give a good fit to the diffraction data but do not fit other biological criteria. In order to extend this work to incorporate further meridional reflections, we intend to develop a more accurate three dimensional collagen structure which could be used to investigate how different types of collagen molecule interact at the level of amino acids within the helix (i.e. type I collagen and type III collagen as seen in heterotypic structures such as skin).

14 Materials and Methods

Sample Preparation and Environment: Tails were removed from rats immediately after death and were frozen at -40°C. The tendons, once removed from the tails, were kept hydrated by immersion in a phosphate buffered saline (PBS; pH

7.4) solution immediately prior to use. X-ray diffraction data were collected from samples maintained in a water tight sample cell which contained a small reservoir of

PBS solution that was used to maintain adequate hydration.

Data Collection and Analysis: Low angle diffraction patterns were required to obtain the first 30 reflections sampling the meridian (figure 1). These low angle images were obtained at Beamline ID02, ESRF, France with a sample to detector distance of 3000 mm whilst high angle images were obtained at Station 14.1, SRS,

Daresbury, UK, with a sample to detector distance of 100mm. Data were collected electronically on X-ray sensitive detectors and a data file was automatically generated with each of the detector systems. Diffraction data was recorded as a two dimensional image and intensity was represented visually by colour depth. Spatial detector shape corrections were carried out automatically by the station software as a part of the process which generated the diffraction image file.

In order to prevent large scale corrections to the data obtained by the diffraction techniques only samples which were tilted in the plane of the X-ray beam by less than two degrees were considered. The angle of sample tilt was measured using the XFIX software developed by the CCP13 group.

The FIT2D software suite, developed by A. Hammersley, ESRF [15] was used in conjunction with the XFIT program (CCP13), to obtain values for the integration of the intensity in each of the meridional reflections. This recorded intensity value is

15 then corrected to take into account the lack of intersection with the Ewald Sphere.

The integral of the peak is directly related to the intensity of the Bragg reflection on the meridian, this relationship allows a direct comparison between the recorded and simulated data. In addition to these data, the validity and limits of a model can be tested (in part) by the constraints imposed by determining the breadth of the collagen helical reflection that corresponds to the axial rise per residue with an average spacing of 0.286nm (a 100mm camera length was used to observe this reflection at station

14.1, SRS, Daresbury). A one dimensional trace of the appropriate reflection was then obtained in the manner previously described, and the limits were derived by calculating the third standard deviation of the resultant .

Genetic algorithms: A Genetic Algorithm (GA) is a stochastic method used to solve complex optimisation problems based on the principal of survival of the fittest, combined with a replication process inspired by genetics. It is an iterative procedure which maintains a population of structures that are candidate solutions to specific domain challenges. An initial population of candidate solutions to a problem are randomly generated with the population scattered widely about the solution space.

Each individual in the population is composed of a set of parameter values where each parameter can be thought of as a ‘gene’. During each generation the individual solutions in the current population are rated for their effectiveness by using the parameter set (genetic pattern) to generate a model and then comparing the performance of the model against a target.

The algorithm then proceeds through a number of breeding generations where solutions with a better fit to the target outcome are more likely to be picked for replication. The end goal being to evolve an individual with a set of parameter values

16 that produce a close fit to the desired target. On the basis of the evaluations, a new population of candidate structures is produced using the ‘genetic operators’ of crossover and mutation. Cross-over involves copying a segment of genetic material (a set of parameter values) from one of the original parents to a new off-spring. This segment can vary in length from one parameter value to the complete set of parameter values.

Mutation of genes is produced through one of four effects. The simplest case is to randomly assign a new value for a gene, ignoring either of the parent values for the given parameter associated with the gene. The remaining three effects produce a new parameter value based on one of the two original parent values but with a small deviation. The deviation is either a small random increment or decrement from one of the parent values, or a new value that lies at a random point between the parameter values of the first and second parent for the given gene.

The above GA process can be represented using the following pseudo code:

1. Create an initial randomised population of potential solutions P where

each Pi consists of a set of 19 parameters specifying amino acid

spacing values between 0.270 and 0.304nm.

2. Generate a score Si for each individual solution Pi in the population P

as follows:

a. Generate a collagen model Mi from the parameter values of Pi

(in this case, the amino acid inter-residue spacing factors).

b. Perform a Fourier transform on the model Mi to generate a set

of axial intensities AMi associated with the model Mi.

c. Measure the intensity difference ADi for model i between the

modelled axial AMi and the recorded axial intensities AR by

17 summing the squared differences:

N 2 # n n & ADi = % A " A ( )$ R Di' (Eq 2) n=1

where N=30 is the number of observed axial intensities. !

d. Calculate the weighted sum squared difference BDi between the

biological components of model i's structure and reported

values:

X 2 # x x & BDi = % B " B ( W )$ Mi' (Eq 3) x=1

where W=5 is a weighting used to increase the influence of the ! components with respect to the axial intensity scores and X=4

is the number of component used (B1 is reported molecule

length where B1=300nm; B2 and B3 are the minimum and

maximum inter-residue distance respectively with B2=0.270,

and B3=0.304nm; B4 is the average helical inter-residue

distance where B4=0.286nm).

e. Add ADi and BDi together and take the square root of this value

to get a combined score Si for model Mi:

S (Eq 4) i= ADi +BDi

This generates a value equivalent to the weighted N ! dimensional Euclidean difference between the combined axial

intensities and biological components.

3. Calculate a normalised score NSi for each model in the current

population such that the best solution has a score of 0.0 and the worst

solution has a score of 1.0:

18 Si " Smin NSi = (Eq 5) Smax " Smin

! where Smin and Smax are the smallest and largest scores in the current

population respectively. Note that Smin defines the current distance

from the ideal solution since this problem is a minimisation task.

4. Pick two solutions Px and Py from the population to act as parents for

the creation of new solutions, where each parent is picked as follows:

a. Generate a random number r between 0 and 1 that will be used

as the minimum score required for a solution to be picked as a

parent.

b. Generate a random number z between 1 and P (the population

size).

c. If r >= NSz, choose individual Pz as a parent.

d. If r < NSz, let z = z + 1 mod P and repeat (c)

5. Pick a solution to delete from the population Pd so that it can be

replaced by the ‘offspring’ of the above parent solutions by repeating

step 4 except reversing the logic of (c) and (d). Step 4 will guarantee

that parents are picked with a likelihood of being chosen that is

proportional to their relative fitness in the current population.

Reversing step 4 will invert this process such that the weaker a solution

is, relative to the rest of the population, the more likely it is to be

picked for removal. The principal behind this approach is to generally

pick fitter solutions and remove weaker solutions but to periodically

include parameter values from weaker solutions that could contribute

to an overall increase in performance. If the top scoring solutions were

picked at every iteration then a process similar to ‘hill climbing’ would

19 be produced and the algorithm would be more likely to get stuck in a

local minima.

6. Generate a new solution to replace Pd by going through each parameter

value f specified in Pd and replacing it as follows:

#0 <= r<0.4,Pxf % 0.4<= r<0.8,Pyf % 2 3 (Eq 6) %0.8 <= r<0.875,L(Pxf +R(0,0.03),B ,B ) 2 3 %0.875 <= r<0.95,L(Pxf "R(0,0.03),B ,B ) Pdf =$ 2 3 %0.95 <= r<0.96,L(Pyf +R(0,0.03),B ,B ) 2 3 %0.96 <= r<0.97,L(Pyf "R(0,0.03),B ,B ) % 2 3 0.97<= r<0.98,Pxf +L((Pxf "Pyf ).R(0,0.03),B ,B ) % 2 3 &0.98 <= r<=1.0,R(0,0.3),B ,B )

where R(x,y) generates a random number between x and y; L(v,x,y) ! constrains the value v to lie within the lower and upper bounds x and y.

7. Calculate the score Sd for individual Pd using the process outlined in

step 2. This score will reflect the new parameter values derived in 6.

8. If Smin, the minimum population score, is above the stopping threshold

T then repeat the above process from step 3. If Smin

then a solution has been obtained.

Generation of Model: In this study, the parameter set comprised of individual inter-residue spacing factors that were assigned to each amino acid type and used to generate inter-residue spacing in the triple-helical and telopeptide regions together with fold points in the telopeptide region. Since the parameter affected using this methodology is the inter-residue spacing, the most obvious approach would be to assign individual values for each amino acid pair in the α1(I) and α2(I) chains.

However, this would produce a model with a high degree of freedom (3029) and possible inconsistencies. Using this particular approach, it would be possible for two pairs of identical longitudinally adjacent amino acids to have a different inter-residue spacing at different places on the same chain.

20 The approach chosen in this study was to assign a spacing factor to each amino acid type present in the collagen sequence and then use this factor to determine the relative size of the distance for a given amino acid pairing. The task of the GA was to modify these factors in such a way that a suitable fit to the observed intensity data was achieved. These factors provided significant constraints upon the degrees of freedom in the model since any adaptation to the factor of one amino acid would result in multiple changes throughout the model structure. Given that there are only 20 different amino acid types present in the α1(I) and α2(I) chains, this effectively produces a model with only 19 degrees of freedom. In the case of type I collagen there are 20 different amino acids as post translational modification of proline and lysine residues creates hydroxyproline and hydroxylysine. However there are no instances of either cysteine or tryptophan in the amino acid sequence of type I collagen.

In order to constrain the model further, and ensure that the amino-acids in each of the chains remained in step as one progressed down the collagen molecule, it was decided that the amino acids in the three chains should be grouped into ‘lateral’ triplets with each of the three alpha chains contributing one amino acid (see figure 8).

The GA was then used to calculate the distance between these triplets using a number of different approaches. The most straight forward approach is to take the average of the factors for the three amino acids which constitute a triplet.

A + A + A S = n,1 n,2 n,3 An,An+1 3 (Eq 7)

An,x is the factor assigned to amino acid A which is at index n in the amino acid sequence and x identifies the chain to which it belongs. This method of calculating the inter-residue spacing is referred to in this study as the A3 method. An

21 alternative variation, based on this approach, is to state that the inter-residue spacing

S, should be determined by the largest factor present in the triplet of index n. This approach is based on the hypothesis that a large amino acid could prevent the other amino acids in the triplet from coming too close together due to steric hindrance. This approach is referred to as the A3 Maximum method, however this method produced larger than desired inter-residue spacings and led to models with little or no structural merit.

An evolution of this approach was to model the inter-residue distance between triplets which lie axially adjacent to each other in the amino acid sequence. The simplest solution in this case is to take the average of the interactions between each of the amino acid pairs.

& A + A A + A A + A # S $ n,1 n+1,1 n,2 n+1,2 n,3 n+1,3 ! / 3 An,An+1 = $ + + ! (Eq 8) % 2 2 2 "

On testing this approach it was difficult to achieve an adequate level of fit with the observed data, an alternative method was therefore derived as shown in equation

9.

& A .A A .A A .A # S n,1 n+1,1 n,2 n+1,2 n,3 n+1,3 / 3 An,An+1 = $ + + ! (Eq 9) % 0.286 0.286 0.286 "

In this approach the multiplied factors are scaled using 0.286 in order to enforce a normalisation upon them such that they remained within specific constraints. This is referred to as the A6 method. An A6 Maximum method was also derived, however this again produced models that had little merit due to the large sizes of inter-residue spaces.

22 Validation of the structure: The figure of merit of the model is an essential component of the GA approach as such a scoring method is used to drive the process of selecting parents for the replication process. In this study, five measured objectives were considered when calculating the score. The score, S, is derived by taking the sum of the squares of the weighted differences between the model and the observed data for each of the factors. For the purposes of the operation of the GA, it is the relative value of this score for a given solution compared with a competing solution that is important. A further validation of the model under scrutiny is to calculate the

R-factor (Equation 10).

R factor || F | | F ||2 / | F |2 (Eq 10) " = [!( obs " calc ) !( obs )]

In this equation Fobs is the observed intensity and Fcalc is the model intensity, the R-factor is a value which indicates how the model set of intensities differs from the observed data. The simulated series of reflections is achieved by the Fourier inversion of the one-dimensional electron density profile of the simulated axial projection of a microfibril over a given number of orders.

In the case of each proposed structure, the resultant molecular model was tested for structural feasibility against the criteria discussed below. In order to generate a constrained set of models, the GA penalised model structures that did not satisfy the relevant structural criteria. The criteria used were features which are readily recognised as being inherent to collagen molecules, with the most heavily penalised aspect being the length of the molecule. This is approximately 300 nm [10].

The α1(I) polypeptide chain (the longest of those present in type I collagen), consists of 1052 amino acids. If the average inter residue distance is 0.286 nm [8] the total length of the molecule would be 300.872 nm. If compression and folding of the telopeptides are taken into account this value could be as low as 295 nm, therefore a

23 penalty was introduced for any model that deviated from the optimal length of 300nm.

This penalty was calculated as an extra term in equation 4 with a large weight value W used in order to create an ‘evolutionary pressure’ to keep the length value close to

300nm.

The ratio of the gap section to the overlap section in the collagen molecule is also of fundamental importance in achieving a set of diffraction intensities which match those of the observed data, since the lower orders of the meridional set are dependent on the size of the observed low resolution step function. Literature points to the overlap region accounting for 0.46 of the D period [2] and it was important to ensure the model structures contained this element. Any model structure that did not contain this gap to overlap ratio was also penalised by the Genetic Algorithm using a similar mechanism to the penalty applied for length measurement.

24

Acknowledgements

The authors acknowledge the staff of the ESRF and SRS for their help in obtaining the diffraction data, in particular Mike MacDonald (beamline 14.1) and

Manfred Roessle (previously of ID02). This work was funded by BBSRC research grant (98/B13991).

25 References

Meek, K. M., Chapman, J. A., Hardcastle, R. A., 1979. The staining pattern of collagen fibrils. I. J. Biol. Chem. 254, 10710-10714.

Brodsky, B., Eikenberry, E. F., 1982. Characterisation of fibrous forms of collagen. Meth. Enzymology. 82, 127-174.

Miller, A., 1982. Molecular packing in collagen fibrils. TIBS. 7, 13-18.

Orgel, J. P. R., Miller, A., Irving, T. C., Fischetti, R. F., Hammersley, A. P., Wess, T. J., 2001. The in situ supermolecular structure of type I collagen. Structure. 9, 1061- 1069.

Chapman, J. A., Holmes, D. F., Meek, K. M., Rattew, C. J., 1981. In: Structural Aspects of Recognition and Assembly in Biological Macromolecules. 1 pp387

Goldberg, D. Genetic Algorithms in Search, Optimization & Machine Learning Addison Wesley, 1989.

Orgel, J. P. R., Wess, T. J., Miller, A., 2000. The in situ conformation and axial location of the intermolecular cross-linking non-helical telopeptides of type I collagen. Structure. 8, 137-142.

Creighton, T. E., 1993. Proteins : Structures and molecular properties. 2nd Edition. W. H. Freeman and Company, New York.

Hulmes, D. J. S., Miller, A., White, S. W., Timmins, P. A., Berthet-Colominas, C., 1980. Interpretation of the low-angle meridional neutron diffraction patterns from collagen fibres in terms of the amino acid sequence. Int. J. Biol. Macromols. 2, 338- 345.

Schmitt, F. O., Gross, J., Highberger, J. H., 1955. Tropocollagen and the properties of fibrous collagen. Experimental Cell Research, Supplement 3 326-334

Hulmes, D. J. S., Miller, A., White, S. W., Doyle, B. B., 1977. Interpretation of the meridional x-ray diffraction pattern from collagen fibers in terms of the known amino acid sequence. J. Mol. Biol. 110, 643-666.

Tsai, J., Taylor, R., Chothia, C., Gerstein, M., 1999. The packing density in proteins : Standard radii and volumes. J. Mol. Biol. 290, 253-266.

Hodge, A. J., Petruska, J. A., 1962. In: Breese, S. S., (Ed.) Electron Microscopy, Paper QQ-1. Academic Press, New York, Vol. 1.

Ortolani, F., Giordano, M., Marchini, M., 2000. A model for type II collagen fibrils: distinctive D-band patterns in native and reconstituted fibrils compared with sequence data for helix and telopeptide domains. Biopolymers. 54, 448-463.

26 Wess, T. J., Hammersley, A., Wess, L., Miller, A., 1995. Type I collagen packing, conformation of the triclinic unit cell. J. Mol. Biol. 248, 487-493.

27 Figure Legends

Figure 1 : X-ray diffraction image of the meridional region from fibrillar type I collagen. The meridional Bragg peaks are a formed as a result of the interactions of the X-rays with the regular repeating structure of the axial packing scheme of the collagen molecules. This figure shows the resultant Bragg peaks of diffracted rat tail tendon with sample to detector distances of 2000mm (left hand side; ID02, ESRF).

For guidance, a number of the more intense peaks have been labelled.

Figure 2 : A3 model scored over 30 orders of intensity. In this figure the broken line represents the intensity of each reflection as recorded from the X-ray diffraction patterns. The solid line represents the best fit model obtained using the A3 method of altering the distance between amino acids. The values along the x-axis represent the order of the meridional reflection, in order to avoid a large value the first order has been left off of the chart. The first reflection represented at the left hand end of the figure is the 2nd order, labelled 2. The R Factor for this comparison of model and data is 0.247495.

Figure 3 : A6 model scored over 30 orders of intensity. As in figure 3, the broken line represents the intensity values recorded from the X-ray data. However, in this case the solid line represents the best fit model obtained using the A6 method of altering the model. As with figure 3, the values along the x-axis represent the meridional order of the reflection with the first order being left out so that it does not dominate the y-axis scale. In this comparison the R-Factor is recorded as 0.172834.

28 Figure 4 : Simulated Electron Density of A6 model scored over 30 orders of intensity. The figure indicates the simulated electron density profile of one repeating

D periodic unit of the model which gives the best fit to the data (A6 model scored over 30 orders of intensity).

Figure 5 : Average inter-residue spacings along the length of a simulated D- period. The values shown here were obtained by averaging the inter-residue spacings at each point along the length of the D-periodic repeat.

Figure 6 : Inter residue spacings of best fit models. Inter-residue distances of the

A6 and the A3 models over 30 orders. All of the 1010 inter-residue spaces in the helical region of each model are accounted for here, with any spacing which falls within the defined limits being in the column marked 0.270-0.303, i.e. this includes a count of one for each spacing between 0.270 and 0.304.

Figure 7 : Inter-residue spacing along the length of the D-period against proline and hydroxyproline residues. The average inter-residue spacing along the length of the simulated D-period is shown against the average number of proline and hydroxyproline residues in a triplet at the same point. The correlation factor between these two sets of values is shown in table 3.

Figure 8 : Schematic representation of the three alpha chains of the triple helix.

This figure shows an unspecific region of the collagen triple helix with nine amino acids labels a-i. These amino acids are arranged into three triplets which are denoted by the dotted lines which group the amino acids together into three groups of three

29 amino acids. The differences between the two methods of calculating inter residue distance can be explained using this figure. The inter-residue distance marked ‘1’ if calculated by the A3 method is only affected by the amino acids in the group ‘adg’.

The inter-residue distance marked ‘2’ if calculated by the A6 method is affected by the two groups marked ‘beh’ and ‘cfi’. It should also be noted that the staggered effect of the collagen molecule has been taken into account, for example if amino acid

‘a’ is then amino acids ‘e’ and ‘i’ are also glycine residues.

Table 1 : Meridional intensities of the best fit models using the two methods of determining the inter residue distance. This tables shows the recorded and model intensities generated by the different modelling methods described in the text. The data at the foot of each column describes the total length of each model molecule as well as the R-factor generated in fitting it to the recorded diffraction data.

Table 2 : Distances which each amino acid contributes to the calculation. This table is generated by calculating the contribution of each amino acid to the overall length of the model molecule for each of the two modelling methods.

Table 3 : Correlation Values. These values represent the correlation between the location of certain residues and the distances between amino acids at those locations in the A3 model.

30 TABLES

Telo Only Order Observed A3 A6 A6 (Figure 3) (Figure 4) 2 386.998 1039.7 627.551 634.688 3 2265.9 2265.9 2265.9 2265.9 4 250.246 1117.22 903.741 254.841 5 1216.35 1322.06 1260.81 1253.39 6 573.205 995.471 683.49 598.468 7 448.95 1074.51 632.641 672.516 8 348.124 1055.82 614.581 523 9 2426.47 2215.19 2091.25 2072.13 10 453.868 727.248 410.221 471.693 11 435.348 1054.1 539.824 446.338 12 932.487 829.822 825.182 1044.63 13 87.046 529.577 164.856 122.412 14 293.137 988.951 515.934 419.211 15 152.231 362.503 232.882 283.448 16 208.242 365.296 461.703 259.528 17 231.587 415.664 268.492 229.222 18 107.456 593.198 162.926 148.047 19 100.742 437.967 231.236 289.68 20 1085.99 1239.63 883.889 917.26 21 915.536 1160.55 746.524 781.352 22 365.536 333.541 383.779 408.208 23 30.3433 332.492 213.042 191.58 24 50.0509 384.565 147.24 141.451 25 394.851 732.525 455.603 220.396 26 172.042 608.691 270.865 320.337 27 296.283 687.12 450.122 404.6 28 12.9726 379.004 113.568 118.813 29 141.279 348.855 268.42 136.707 30 394.334 921.329 469.597 310.988

Length (nm) 297.31 303.673 302.914 R-factor 0.540559 0.247495 0.172834

Table 1

31 Amino Acid A3 Method A6 Method

Ala 0.27114 0.266768 Arg 0.244211 0.280193 Asn 0.330677 0.366966 Asp 0.3488 0.38 Gln 0.310293 0.264342 Glu 0.301428 0.328892 Gly 0.291405 0.29301 His 0.290728 0.304412 Hyl 0.234685 0.300695 Hyp 0.318422 0.294785 Ile 0.310935 0.225106 Leu 0.263374 0.372712 Lys 0.224928 0.212691 Met 0.313316 0.111038 Phe 0.238186 0.365361 Pro 0.312398 0.264977 Ser 0.303794 0.237173 Thr 0.34492 0.365527 Tyr 0.10053 0.108694 Val 0.246449 0.289977

Table 2

Residue Group Correlation Value

Proline & Hydroxyproline 0.211316 All Charged Residue -0.23096 Positively Charged Residues -0.27596 Negatively Charged Residues -0.06191 Hydrophobic Residues -0.17417

Table 3

32

FIGURE 1

33

FIGURE 2

34

FIGURE 3

35

FIGURE 4

36

FIGURE 5

37

FIGURE 6

38

FIGURE 7.

39

FIGURE 8

40