Biocomputational Studies on Protein Structures

FromtheDepartmentofMedicalBiochemistryandBiophysics, KarolinskaInstitutet,Stockholm,Sweden Biocomputationalstudiesonproteinstructures by ErikNordling

Stockholm2002

©ErikNordling ISBN91-7349-295-7

TillminälskadeMirna

Abstract Biologyinthepost-genomiceraproduceslargeamountofdata.This,incombinationwiththe needforefficientalgorithmstofindgenesinthegenomicmaterial,hasbroughtarenaissance intothefieldofcomputationalbiology.Methodsnowrangefromleaddiscoveryinthedrug discoveryprocess,byvirtualligandscreening,throughsequencecomparisonsandhomology searches, to micro array data analysis and visualisation. This thesis primarily deals with sequence analysis, different aspects of protein structure prediction and enzyme–ligand complexcharacterisation,Ihaveappliedbioinformatictechniquesontheenzymefamiliesof medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR)andbiologicallyactivepeptides Sequence analysis of the MDR superfamily extends the evolutionary context of this superfamily, as MDR enzymes were collected from completely sequenced genomes. The analysis reveals the presence of eight families whereof several were previously uncharacterised. Three families are formed by dimeric alcohol dehydrogenases (ADH), cinnamyl alcohol dehydrogenases (CAD) and tetrameric alcohol dehydrogenases (YADH). Three further families are centred on forms initially detected as mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families, with polyol dehydrogenases (PDH) and quinone reductases (QOR), are also distinct but with variable sequences.Theanalysisalsosuggeststhatnewfunctionshaveevolvedinthissuperfamilyin higherorganisms. FactorsthatgovernthesubstratespecificityofγγADHwereinvestigatedwithdocking calculations and can be traced to active site characteristics, most notably the Ser48/Thr48 replacement between γγADH and ββADH, which allow the oxidation of 3β-hydroxy bile acids,suchasisoUDCA,in ¡ ADH,whilebothenzymesareinactiveversus3α-hydroxybile acids. Ahomologymodeloftype1017β-hydroxysteroiddehydrogenasewasconstructedfrom 7α-hydroxysteroiddehydrogenase.Thevalidityofthemodelwasinvestigatedbyitsabilityto distinguishbetweenactiveandinactiveusingdockingcalculations.Substratestestedranged fromsteroidsandbileacidstoL-andD-hydroxyacylCoA.Ligandswith17βor3αhydroxy groups and L-hydroxyacyl CoA could achieve interactions favourable for catalysis at the activesite.Acrystallographicallydeterminedstructurepublishedafterthesubmissionofour paperverifiedlargeportionsofourmodel. The role of a conserved asparagine in the short-chain dehydrogenase/reductase (SDR) fold is investigated through structural comparisons of 21 members with experimentally verifiedstructures.Anextensivehydrogen-bondingnetworkincludingpartsoftheactivesite isrevealedin16outof21SDRforms. Molecular dynamics simulations were employed to study the effect of deleterious mutantsandreplacements,knowntodiminishfibrillation,tothestabilityofthehelicalregion oftheamyloidogenicpeptidesamyloidβ-peptide(Aβ)andsurfactantprotein-C(SP-C).The effectsinSP-Carequantitativelydistinguishable,whilethenoiseratiointheAβsimulations makesvalidpredictionssomewhatdifficult. SequencecomparisonsoftheC-peptideofproinsulindisplaysasequencevariabilitythat is1to2ordersofmagnitudegreaterthanthatofinsulin,butinthesameorderofmagnitude as the well established peptide hormones relaxin and parathormone, which in conjunction withfunctionalreportsmayindicateahormonalfunctionfortheC-peptide. ErikNordling:Biocomputationalstudiesonproteinstructures;Stockholm2002;ISBN91-7349-295-7

Biocomputationalstudiesonproteinstructures

Contents

Abbreviations ...... 8 Listoforiginalarticles ...... 10 Introduction ...... 12 Background ...... 13 Medium-chainDehydrogenases/Reductases(papersI-III) ...... 13 Short-chainDehydrogenases/Reductases(papersIV-V) ...... 15 Biologicallyactivepeptides(papersVI-VII) ...... 16 Techniques ...... 17 Sequencecomparisons ...... 17 Pairwisealignments...... 17 Multiplealignments...... 18 Scoringmatrices...... 18 Databases...... 20 Databasesearching...... 20 FASTA ...... 20 BLAST ...... 21 Phylogeneticanalysis ...... 21 Maximumparsimony ...... 21 Neighbourjoining ...... 22 Bootstrapanalysis ...... 22 Secondarystructureprediction...... 22 Molecularmodelling ...... 23 Forcefieldmethods...... 24 Localminimisationtechniques...... 25 Steepestdescent...... 25 Conjugategradient ...... 25 Globaloptimisationtechniques...... 25 MonteCarlo(MC)...... 26 Moleculardynamics(MD)...... 27 Simulatedannealing ...... 27 Dockingcalculations...... 27 Rigidbodydockings...... 28 Flexibleligand–rigidreceptor ...... 28 Flexibleligand–flexiblereceptor ...... 29 Commentsondockingmethods ...... 30 Homologymodelling...... 30 Aimofthestudy...... 32 Projectoverview...... 33 GenomecomparisonandmodellingofactivesitesintheMDRfamily(paperI) ...... 33 DifferentialmultiplicityofMDRalcoholdehydrogenases(paperII)...... 36 MolecularmodellinganddockingofbileacidstoγγandββADHclassI(paperIII) ...... 38 Molecularmodellingandsubstratedockingofhumantype1017β-hydroxysteroid dehydrogenase(paperIV) ...... 40 StructuralRoleofConservedAsn179intheShort-ChainDehydrogenase/Reductase Scaffold(paperV)...... 42 Moleculardynamicstudiesofsequenceinfluenceuponfibrillation(paperVI)...... 43 AstructuralbasisforproinsulinC-peptideactivity(paperVII) ...... 45

6 Contents

Conclusions ...... 46 Perspectives...... 47 Acknowledgements ...... 49 References ...... 51

7 Biocomputationalstudiesonproteinstructures

Abbreviations

Oneandthreelettercodesforthe20geneticallyencodedaminoacids. Alanine Ala A Arginine Arg R Asparagine Asn N Asparticacid Asp D Cysteine Cys C Glutamicacid Glu E Glutamine Gln Q Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V

8 Abbreviations

17β-HSD-10 type1017β-hydroxysteroiddehydrogenase Aβ Amyloidβ-peptide ACR Acyl-CoAreductase AD Alzheimer’sdisease ADH Alcoholdehydrogenase BPMC BiasedprobabilityMonteCarlo CAD Cinnamylalcoholdehydrogenase FAS Fattyacylsynthase HSD Hydroxysteroiddehydrogenase

LTD LeukotrieneB4dehydrogenase MC MonteCarlo MD Moleculardynamics MDR Medium-chaindehydrogenases/reductases MRF Mitochondrialrespiratoryfunctionprotein NAD+/NADH Nicotinamideadeninedinucleotide(oxidised/reduced) NADP+/NADPH Nicotinamideadeninedinucleotidephosphate(oxidised/reduced) PAP Pulmonaryalveolarproteinosis PDB Proteindatabank(www.rcsb.org) PDH Polyoldehydrogenase QOR Quinoneoxidoreductase RMSD Rootmeansquaredeviation SDR Short-chaindehydrogenases/reductases SNP Singlenucleotidepolymorphism SP-C Surfactantprotein-C UDCA Ursodeoxycholicacid YADH Yeastalcoholdehydrogenase Å Ångström(10-10m)

9 Biocomputationalstudiesonproteinstructures

Listoforiginalarticles Thisthesisisbasedonthefollowingarticles,whichwillbereferredtobytheirRoman numerals. I. Nordling,E.,Jörnvall,H.andPersson,B.Medium-chain dehydrogenases/reductases(MDR):Familycharacterizationsincludinggenome comparisonsandactivesitemodelling.Eur.J.Biochem.2002269,inpress II. Nordling,E.,Persson,B.andJörnvall,H.DifferentialmultiplicityofMDR alcoholdehydrogenases:enzymegenesinthehumangenomeversusthosein organismsinitiallystudiedCell.Mol.LifeSci.2002,59:1070-1075 III. Marschall,H.U.,Oppermann,U.C.,Svensson,S.,Nordling,E.,Persson,B.,

Höög,J-.O.andJörnvall,H.HumanliverclassIalcoholdehydrogenase ¡ isozyme:thesolecytosolic3β-hydroxysteroiddehydrogenaseofisobileacids. Hepatology2000,31:990-996 IV. Nordling,E.,Oppermann,U.C.,Jörnvall,H.andPersson,B.Humantype1017 β-hydroxysteroiddehydrogenase:molecularmodellingandsubstratedocking.J. Mol.Graph.Model.2001,19:514-520,591-593 V. Filling,C.,Nordling,E.,Benach,J.,Berndt,K.D.,Ladenstein,R.,Jörnvall,H. andOppermann,U.StructuralroleofconservedAsn179intheshort-chain dehydrogenase/reductasescaffold.Biochem.Biophys.Res.Commun.2001, 289:712-717 VI. Nordling,E.,Kallberg,Y.,Johansson,J.andPersson,B.Moleculardynamics studiesofsequenceinfluenceuponfibrillation.Manuscript VII. Jörnvall,H.,Nordling,E.,Persson,B.,Shafqat,J.Ekberg,K,Wahren,Jand Johansson,J.AstructuralbasisforproinsulinC-peptideactivity.Manuscript Thepublishedarticlesarereprintedwithpermissionfromthecopyrightholders.

10 Listoforiginalarticles

Relatedpapersnotincludedinthethesis • Dreij,K.,Sundberg,K.,Johansson,A.S.,Nordling,E.,Seidel,A.,Persson,B., Mannervik,B.andJernström,B.Catalyticactivitiesofhumanalphaclassglutathione transferasestowardcarcinogenicdibenzo[a,l]pyrenediolepoxides.Chem.Res. Toxicol.2002,15:825-831 • Filling,C.,Berndt,K.D.,Benach,J.,Knapp,S.,Prozorovski,T.,Nordling,E., Ladenstein,R.,Jörnvall,H.andOppermann,U.Criticalresiduesforstructureand catalysisinshort-chaindehydrogenases/reductases.J.Biol.Chem.2002,277:25677- 25684 • Strömberg,P.,Svensson,S.,Hedberg,J.J.,Nordling,E.andHöögJ.-O.Identification andcharacterisationoftwoallelicformsofhumanalcoholdehydrogenase2.Cell. Mol.Life.Sci.2002,59:552-559 • Jonsson,A.P.,Aissouni,Y.,Palmberg,C.,Percipalle,P.,Nordling,E.,Daneholt,B., Jörnvall,H.andBergman,T.Recoveryofgel-separatedproteinsforin-solution digestionandmassspectrometry.Anal.Chem.2001,73:5370-5377

11 Biocomputationalstudiesonproteinstructures

Introduction

The structure of a protein determines its function. It is essential for substrate specificity, stabilityandinteractionswithotherproteins.Thequestforproteinstructureandinteractions is recognised as one of the big challenges after the completion of the human genome, and severallarge-scalestructuralgenomicsprojectarelaunchedaroundtheglobe(Burley2000). The realisation that all information required for protein folding is inherited in the protein sequence(Epsteinetal.1963)gaverisetothegreatestchallengeforcomputationalbiology, thefoldingproblem.Ithasnotbeenpossibletofindauniversalsolutionaswiththegenetic code.Onecomplicationisthatmanyproteinsrequireassistanceofchaperoneproteinstofold into their native structure (Ellis et al. 1989). Nevertheless, several methods exist, which exhibitmoderatepredictiveaccuracy(Bonneuetal.2001,Skolnicketal.2001,Hardinetal. 2002). The most accurate structure prediction methods to date are homology modelling, or comparative modelling methods that use the information from homologous structures to predict the tertiary structure. The quality of the model is dependent upon the number of residueidentitiesincommonbetweenthetwoproteins,butisoftenbelow1ÅinRMSDifthe sequenceidentityisrelativelyhigh. Theactivityofanenzymeisdeterminedbytheshapeandaminoacidresiduedistributionof the active site. This makes it possible to search for potential substrates by computational methodstermeddockingcalculations,byfindingtheoptimalplacementofthesubstratewith respecttoelectrostatic/hydrophobicinteractionsandshapecomplementarities. Thecompletionofthefirstgenome,thatoftheSV40virus,withatotalof5,224basepairs (Fiersetal.1978),markedthearrivalofanewperspectiveinbiology.Forthefirsttime,all the genetic information of an organism were then known. Now with the completion of the humangenome(Venteretal.2001,Landeretal.2001),theneedsofanalysisareenormous. Oneofthemoreappealingpossibilitiesofthegenomicsequencesisthepossibilitytocompare speciesandtotrytoexplaintheirdifferencesintermsoftheirgenes(Bansal1999).Whole genomecomparisonshavealsobeenusedforphylogeneticanalysisoforganisms(Kooninet al. 1997, Fitz-Gibbon and House 1999, Tekaia et al. 1999), which largely confirms the previouspictureofthedivisionsoflife.

12 Introduction

Thelargeamountsofdatageneratedthroughtheadvancesinbiologyandbiochemistryhave givenrisetothedisciplineofbioinformatics,whichdealswiththebiologicalinformation,and tries to use it as a base to predict different biological properties, such as sub-cellular localisationorsecondarystructureofaprotein. In my thesis, I have applied bioinformatic techniques on the enzyme families of medium- chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and biologicallyactivepeptides

Background

Medium-chainDehydrogenases/Reductases(papersI-III) Medium-chain dehydrogenases/reductases (MDRs) constitute a large enzyme superfamily with (including species variants) close to 1000 members, which have been compared at separatestagesofestablishment(Perssonetal.1994,Jörnvalletal.1999).MDRproteinsare either dimeric or tetrameric proteins with about 350 residues. The tertiary structure reveals thattheyconsistofonecatalyticandonecoenzyme-bindingdomain(Fig.1).

Fig.1ModelofdimericγγADH(paperIII)inribbonrepresentation,NAD+isshownasballandstickmodels, zincionsaredisplayedinspace-fillingrepresentation.isoUDCAisdockedintotheactivesiteanddisplayedina ballandsticksmodelwithsolventaccessiblesurfacecolouredbytheelectrostaticpotential.Picturecreatedin WeblabViewerLite(AccelrysInc.)

13 Biocomputationalstudiesonproteinstructures

Several, but not all of the members have one zinc ion bound with catalytic function at the activesite.Some,inparticularclassical,dimericalcoholdehydrogenases(ADHs),alsohavea secondzincionatastructuralsite(Brändénetal.1975).Thecoenzymebindingdomainhasa Rossmann-foldthatbindsNAD(P)H(Rossmannetal.1975).Severalgeneduplicationshave ledtothepresentstateofdifferentfamilies,summarisedinFig.2. The MDR enzymes represent many different enzyme activities of which zinc-dependent ADHs are the ones early investigated most thoroughly (Bonnichsen and Wassen 1948, Brändén et al. 1975). They participate in the LTD oxidation of alcohols, detoxification of QOR ACR aldehydes/alcoholsandthemetabolismofbileacids (Jörnvall et al. 2000, Marschall et al. 2000). There MRF YADH are also tetrameric alcohol dehydrogenases, originally detected in yeast, as yeast alcohol PDH CAD

0.1 ADH dehydrogenases (YADH) (Negelein and Wulff

Fig.2.Evolutionarytreeshowingthe 1937), but present also in other species (paper II). eightMDRfamilies.ADH–alcohol The polyol dehydrogenases (PDHs) have activities dehydrogenases,CAD–cinnamyl alcoholdehydrogenases,YADH–yeast originally detected for sorbitol dehydrogenase alcoholdehydrogenases,ACR–acyl CoAreductases,LTD–leukotrieneB4 (Jörnvall et al. 1981). All their substrates are dehydrogenases,QOR–quinone oxidoreductases,MRF–mitochondrial widespread in nature because of their derivation respiratoryfunctionproteins,PDH– polyoldehydrogenases.Treecreated from glucose, fructose, and general metabolism. withClustalW(Thompsonetal.1994) Cinnamyl alcohol dehydrogenases, CAD, are andvisualisedinTreeview(Page1996) predominantly found in plants. This enzyme type catalyses the last step in the biosynthesis of the monomeric precursors of lignin, the main constituentofplantcellwalls(Boudetetal.1995).Ithasbeenextensivelycharacterisedfrom severalplantsources(Sarnietal.1984,Halpinetal.1994,Gallianoetal.1993),becauseof itsimportanceforthepulpindustry(cf.Perssonetal.1993).Downregulationorinhibitionof CAD will reduce wood lignin content and yield a pulp of higher quality (Campbell and Sederoff1995).Thefirstcharacterisedmemberofthequinoneoxidoreductase(QOR)family wasalensprotein(ζ-crystallin)(Gonzalezetal.1995),inwhichamutationallossmayresult in cataract formation already at birth. This suggests that ζ-crystallin has a role in the protection of the lens against oxidative damage (Rao et al. 1997). The mitochondrial respiratory function proteins (MRF) are necessary for respiratory function, although the mechanismisnotclear(Yamazoeetal.1994).Theacyl-CoAreductases(ACR)constituteone

14 Background domainoffattyacylsynthase(FAS)(Amyetal.1989)anderythronolidesynthase(Donadio etal.1991)multienzymesinanimals,whichareresponsibleforthesynthesisoflong-chain fattyacids.Thisisanexampleoffusionofgenesencodingforamultisubunitcomplex,since separategenesprovidethesamefunctioninmostbacteria(Magnusonetal.1993)andplants

(OhlroggeandBrowse1995).TheleukotrieneB4dehydrogenases(LTDs)areinvolvedinthe metabolismofimmunoreactivemolecules(Yokomizuetal.1996). Incommon,therefore,asdemonstratedbytheexamplesabove,allMDRfamiliesappearto havesomememberswithprotectivefunctionsindifferentorganismaldefences(Jörnvalletal. 1999),similartotheP450family.However,recentfindingssuggestmorespecificfunctions forthefamilymembersinhigherorganisms(paperII).

Short-chainDehydrogenases/Reductases(papersIV-V) The first reported enzyme of the short-chain dehydrogenases/reductase (SDR) family was DrosophilaADH(SoferandUrsprung1968,SchwartzandJörnvall1976),sharingfunction with horse liver alcohol dehydrogenase, but not evolutionary origin. As additional proteins werefoundtobesimilarinsequencetheSDRfamilywasestablished(Jörnvalletal.1981, Janyetal.1984,Jörnvalletal.1984,Perssonetal.1991).TheSDRproteinsareone-domain NAD(P)(H)-dependentenzymesoftypically250aminoacidresidues(Jörnvalletal.1995), oftenmultimeric(Fig.3).

Fig.3.Tetramericstructureof20β-hydroxysterioddehydrogenase(pdbcode2HSD;Ghoshetal.1994).Each subunit is shaded differently, NAD+ is displayed in space-filling representation. Figure created with Weblab ViewerLite(Accelrys,Inc.)

15 Biocomputationalstudiesonproteinstructures

Theydisplayawidesubstratespectrum,rangingfromsteroids,alcohols,sugars,andaromatic compoundstoxenobiotics.SDRsaresubdividedintoclassicalandextendedSDRs(Perssonet al. 1995), differing in lengths and cofactor-binding motifs. SDR constitutes a large family (Jörnvalletal.1999) withabout3000formsknown,includingspeciesvariantsinallliving kingdoms (Kallberg et al. 2002). The family is highly divergent, with typically 15–30% residue identity in pairwise comparisons. For 21 members, the three-dimensional structures havebeendetermined.Inspite ofthelowresidueidentitiesbetweenthedifferentmembers, thefoldingpatternisconservedwithsuperimposable peptidebackbones(Krooketal.1993; Ghoshetal.2001).ThecriterionforSDRmembershipistheoccurrenceoftypicalsequence motifs,arrangedinaspecificmanner.Thesemotifscontaincrucialresiduesinthenucleotide binding and for the active site including its highly conserved triad of Ser, Tyr, and Lys residues(Jörnvalletal.1995,Perssonetal.1995,Oppermannetal.1997).

Biologicallyactivepeptides(papersVI-VII) Peptides with a biological activity are widespread in nature. Their functions range from metabolicregulationofhormoneslikeglucagon(Thomsenetal.1972),innatehost-defenseof anti-bacterialpeptidessuchasLL-37(Agerberthetal.1995)andNK-lysin(Anderssonetal. 1995)throughneurologicalfunctionssuchasforneuropeptideY(Minthetal.1986)andits likes,tothemoredeleteriouseffectsofthedisintegrinsofsnaketoxins(Huangetal.1986). Many of the peptides are synthesised as larger inactive precursors, which are cleaved to producetheactivecomponents.Theshortpeptidesmayalsorequireastructuralassistanceto achievethefoldedconformation(BeersandLomax1995).Oftentheprecursorscontainmore thanoneactivepeptide,whichthenarereleasedtogetherinequimolaramounts. The peptides included in this thesis are the C-peptide, surfactant protein C (SP-C) and the amyloid β-peptide (Aβ). The C-peptide is produced by the posttranslational cleavage of prosinsulintoformmatureinsulinandC-peptide(Steineretal.1967).Theunderstandingof itsfunctionhasincreasedfromoriginallyconsideredonlyastructuralassistantforthecorrect foldingofinsulin,tonowknowntohavehormonalactivitiesofitsown.(Wahrenetal.2000). IndiabetestypeIpatientsithasbeneficialeffectsonglycaemiccontrolandprotectsagainst diabeticcomplications,whilesyntheticinsulin,whichisidenticaltoendogenousinsulinbut lacks C-peptide, does not suffice (Wahren and Johansson 1998). The surfactant protein C (Johanssonetal.1988)isanaturalpartofthesurfactantmixtureinthelungsandlowersthe surface tension in the alveolus (Gustafsson and Johansson1998).Itisproducedasalarger

16 Background precursortofolditintoanα-helixstructure(BeersandLomax1995).Theamyloidβpeptide isaproductfromtheAPPprotein(Kangetal.1987),whichnowhasrecentlybeensuggested to function as a kinesin 1 receptor, mediating the axonal transport of β-secretase and presenelin-1(Kamaletal.2001)

Techniques

In this thesis, I have used several bioinformatic techniques. They are described below in furtherdetailthantheformatoftheoriginalarticlesallowed.

Sequencecomparisons

The procedure of comparing two or more sequences to bring as many identical or similar residuesaspossibleintoverticalregisterproducesanalignmentofthetwosequences.Two main types of sequence alignment are recognised, global and local. The global alignment optimisesthealignmentoverthefull-lengthofthesequences,whileinthelocalalignment, stretchesofsequencewiththehighestdensityofmatchesaregiventhehighestpriority.

Pairwisealignments

Adirectmannerofcomparingsequencesisthedotplotanalysis(GibbsandMcIntyre1970). The two sequences to be compared form the rows and columns of a matrix, and in the simplest case a dot is plotted in a graphical representation of the matrix wherever the sequences are identical. In practice, the comparison is most often done in a window of specifiedlength,andadotisputinthegraphwheneverthescorewithinthiswindowisabove acertainspecifiedthreshold.Thescoremightbecalculatedasthenumberofidentitiesorasa similaritybasedonascoringmatrix.Sequencesimilaritieswillbeobviousasdiagonalsinthe plot. The method does not align sequences but it constitutes a quick manner of spotting similarities. It has the additional advantage over other alignment methods thatitcandetect repetitionsinthesequencesasparalleldiagonalsintheplot. Globalalignmentsaimtooptimallyaligntwoormoresequences.Themethodsmostusedfor globalalignmentsarebasedonalgorithmsoriginallydevelopedbyNeedlemanandWunsch (1970) and later modified by Sellers (1974). This procedure (the NWS method) is using a dynamicprogrammingalgorithmthatsimplifiestheenormoustaskofcalculatingascorefor all possible alignments of two sequences with gaps of any lengths. The sequences to be alignedarearrangedasrowsandcolumnsofarectangularmatrix.Ascoreiscalculatedfor

17 Biocomputationalstudiesonproteinstructures eachpositionofthematrixaccordingtothreepossibleevents:replacement(orconservation) ofaresidue,insertioninsequenceAorinsertioninsequenceB.Thealignmentistracedback fromthehighestscoringmatrixelementtothebeginningofthesequence. Thedynamicprogrammingalgorithmcanalsobeusedforfindinglocalsequencesimilarities. TheSmithandWaterman(1981)algorithmisverysimilartotheNWSmethodexceptthata calculated negative number for a matrix position is replaced by zero, indicating that no sequencesimilarityhasbeendetecteduptothatpoint.Whenallmatrixelementshavebeen calculated,themaximumnumberinthematrixislocated,andthealignmentistracedback fromthispointuntilthefirstpositivenumber.

Multiplealignments

Amultiplesequencealignmentofaproteinfamilygivesevolutionaryinformationaboutthe family,sincethealignmentpicksuptheevolutionarypressureoneachaminoacid,and,thus displayswhichpositionsthatarecrucialformaintainingthefoldandfunctionoftheproteins oftheinvestigatedfamily. Thesimplestmethodtoobtainamultiplesequencealignmentistouseonesequenceasthe base for the alignment and simply align all other sequences pairwise to this sequence (Altschul et al. 1997). The most common procedure for multiple sequence alignment uses hierarchical methods. In these methods, alignments of all pairs of sequences are made first usingthedynamicprogrammingalgorithm.Thesequencesarethengroupedaccordingtotheir similarities into a tree (hierarchical cluster analysis). Finally, starting with the most similar pairs, all the sequences are aligned stepwise to each other usingthedynamicprogramming method.Thisistheprocedureusedinthemostpopularprogram,CLUSTALW(Thompsonet al.1994),usedinallpapersexceptinpapersIIIandVI.

Scoringmatrices

Allalignmentmethodsneedsomesortofscoringformatchesandmismatches.Foracertain alignment,anumberisassignedtoeachpositioninthesequencedependingonthematchat thatposition.Thescoresforallpositionsinthealignmentarethenaddedtocalculateatotal score, which is used to select the optimal alignment among alternative alignments. The simplestwayofscoringistoassignascoreof1foramatchand0foramismatch.Sucha matrixisoftenreferredtoasaunitarymatrix(Fengetal.1984).Thecommonlyusedmatrices

18 Sequencecomparisons inproteinsequenceanalysisaregenerallypresentedaslog-oddsmatrices.Eachscoreinthe matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of timesresidue"A"isobservedtoreplaceresidue"B"dividedbythenumberoftimesresidue "A"wouldbeexpectedtoreplaceresidue"B"ifthereplacementoccurredatrandom. The PAM series are based on estimated mutation rates (Point Accepted Mutations) from closely related proteins (Dayhoff et al. 1978). PAM 1 stands for 1 % accepted mutations. PAMmatricesforlesssimilarsequencesareobtainedbyextrapolations.ThePAM100matrix corresponds to 100 accepted mutations per 100 residues, but since the same residue might change more than once, two sequences with this level of mutations will have about 50 % identity.Amajordrawbackisthattheserieswasdevelopeduponalimitedsetofsequences. TheGonnetmatrixisbaseduponexhaustivematchingoftheentireproteinsequencedatabase (Gonnetetal.1992),analmostidenticalapproachtotheoriginalPAMmatricesbutwitha moredivergentsetofsequences.TheBLOSUMseriesiscalculatedfromblocksofaligned sequences from homologous proteins with a certain level of identity. In this manner, the pattern of changes observed at a certain level of identity is used, instead of extrapolations from patterns for closely related proteins (Henikoff and Henikoff 1992). The first PAM matrix is almost as efficient as the newer BLOSUM and GONNET matrices in exhaustive comparisons (Henikoff and Henikoff 1993, Vogt et al. 1995, Abagyan and Batalov 1997), whichisaconsiderableachievementconsideringthelimitedsetofproteinsavailableatthe timeofitscreation. The scores from all pairs of aligned residues are combined with suitable penalties for introducinggapstocalculateatotalscore,whichisusedtoselecttheoptimalalignment.The gappenaltiesarenormallydeterminedbytwoparameters,oneforopeningagapandonefor elongatingit,proportionaltothelengthofthegap.Mostprogramsallowtheusertochoose theseparameters,whichmighthavedifferentoptimafordifferentproteintypes.Thechoiceof gapopeningandextensionpenaltieswilldependmuchonthepurposeofthealignment.If,for instancethealignmentistobeusedasinputfortheconstructionofahomologymodel,itmay be beneficial with a high gap opening penalty and a low gap extension penalty, since the qualityofthemodelissensitivetonumerousinsertions/deletionevents(AbagyanandBatalov 1997)

19 Biocomputationalstudiesonproteinstructures

Databases

Thehugeamountofbiologicalinformationthatiscollectedthroughthescientificcommunity worldwideisstoredinlargedatabases.Theinformationinthedatabasesmaybeasdiverseas nucleotide sequences (Stoesser et al. 2001, Benson et al. 2002), metabolic pathways (Kanehisa 1997) or expression profiles from micro array experiments (DeRisi et al. 1997). The databases most relevant for the work in this thesis contain protein sequences and structures. Swissprot (Bairoch and Apweiler 2000) and PIR (Wuet al. 2002) are manually curated databases that contain annotated protein sequences of high quality. Since there are several protein databases, the need to combine them for a complete view of the protein universe has arisen. In that process, a high degree of redundancy is generated, since the databasescontainoverlappinginformation.Thesolutiontothatproblemistoremoveidentical sequencesofthesamelengthorshorterversionsofotherentriesinthedatabase(Kallbergand Persson1999).ThemajorproteinstructuredatabaseistheProteindatabank(PDB)(Berman et al. 2000), where experimentally determined structures and theoretical models are stored. Also worth mentioning are the CATH (Orengo et al. 1997) and the SCOP (Murzin et al. 1995)databases,whichcontainstructuralclassificationofproteins.

Databasesearching

The question posed when performing a database search is whether there are homologous sequencestothequerysequenceinthedatabase.Theproblemisresolvedinasimilarfashion asinthesequencealignmentcase.Asdynamicprogrammingcanbequitetimeconsuming, severalheuristicalmethodshavebeendeveloped.Thesemethodsarenotguaranteedtofind thebestpossiblesolution.Howevertheyaresignificantlyfasterandcanthereforebeusedto searchlargedatabases.

FASTA

OneofthemorepopularsearchmethodsisFASTA(PearsonandLipman1988).TheFASTA algorithm is a fast heuristic approximation to the Smith-Waterman algorithm. The FASTA algorithm divides the query sequence into overlapping words, usually of length two for proteinsorsixfornucleicacids.Thenaseachsequenceisreadfromthedatabaseitisalso dividedintoitsoverlappingwords.Thesetwolistsofwordsarecomparedtofindtheidentical words in both sequences. High scoring sequences are subjected to a Smith-Waterman alignment within the region of the dot-plot defined by the concentrated identities and the windowsize.

20 Databases

BLAST

The BLAST algorithm (Altschulet al. 1990) uses a word-based heuristic similartothatof FASTA to approximate a simplification of the Smith-Waterman algorithm known as the maximalsegmentpairsalgorithm.Amajordrawbackofthemethodisthatitdoesnotallow gaps.Thelistofwordsisexpandedwiththeadditionofsimilarwordsinordertorecoverthe sensitivity lost by matching only identical words. BLAST then examines the database sequencesforwordsthatexactlymatchanyofthewordsontheexpandedlist. AnewgappedBLAST(Altschuletal.1997)getsaroundtheproblemofnotallowinggaps andstillavoidsthehighcomputationalcostofafullSmith-Watermanalignmentonthepairof sequences,bybuildingthealignmentoutfromacentralhighscoringpairofalignedamino acids.AnotherimprovementoftheBLASTalgorithmisPSI-BLAST(Altschuletal.1997), whichstartswithagappedBLASTsearchandcreatesaprofileoutofthefoundsequences. Thisprofileisusedasthebasisforanewsearch,andtheprocedureisiterateduntilthesearch hasconvergedandnonewproteinsarefound,orthenumberofiterationsreachesanumber specifiedbytheuser.Thismethodhasbeenfoundtobeverysensitiveinpickingoutremote homologies(JonesandSwindell2002).

Phylogeneticanalysis

The objective of phylogenetic analysis is to trace evolution. This can be done for morphological features of organisms, or as favoured in later years, by tracing mutational eventsinthegeneticmaterialoforganisms.Theworkincludedinthisthesisisbasedsolelyon proteinsequencesandreflectsprimarilytheevolutionofnewfunctionsinproteinfamiliesand notthebirthofnewspecies,althoughspeciationeventsalsoareincludedinthetrees.There are a few different methods available to construct a phylogenetic tree. The two most used methodsareoutlinedbelow,bothstartwiththeconstructionofamultiplealignmentofthe sequencesinquestionandthenfollowtheproceduresdescribedbelowineachsection.

Maximumparsimony

The phylogenetic trees are constructed so that the sequences are grouped together by the criterion of minimum replacements from the last common ancestor (Fitch 1971, Hartigan 1973).Themethodyieldsanumberoflowscoringtrees.Thesearchoftreescansampleall possible trees in search of the correct one, but it is not feasible for more than about 20 sequences/taxa(Swofford1998).Insteadusuallyaheuristicapproachisusedwherethetrees

21 Biocomputationalstudiesonproteinstructures arebuiltbystepwiseadditionofsequencesinthegrowingtree.Thismethodhasproblemsto

accommodatethefactthatmultiplesubstitutionsmayoccuratthesamesite(e.g.Ala Gly

followed by Gly Ala at the same position later in time) (Nei 1996). Popular programs includePAUP(Swofford1998)andPHYLIP(Felsenstein1993).Themethodisusedinpaper ItoconstructthetreeoftheMDRsuperfamily.

Neighbourjoining

Thetreesareconstructedusingpairwisedistancesofallsequencesintheset.Thefinaltreeis put together so that the distance between two sequences in the tree (branch lengths) correspondstothesimilaritybetweenthetwosequences.Adrawbackofthemethodisthat the same distance may be obtained by comparing two different sequences (Saitou and Nei 1987).ClustalW(Thompsonetal.1994)andthealreadymentionedPAUPandPHYLIPare someofthemostusedprogramsbaseduponthismethodology.ThemethodisusedinpaperI andIIforthetreeconstruction.

Bootstrapanalysis

Phylogenetictreesareusuallytestedforthereliabilitybyabootstraptest(Felsenstein1985). The test is made by constructing a tree based on a subset of the columns in the multiple alignment that the tree is based on. Thisprocedureisrepeatedmultipletimesandgivesan estimateonthereliabilitybygivingthenumberoftimestheanalysisresultsinthesametree. Abootstrapvalueover95%forabranchpointisusuallyconsideredtobehighlyconfident (Efronetal.1996).

Secondarystructureprediction

Predictingthesecondarystructureofaproteinisanimportantsteptowardstheelucidationof itsthree-dimensionalstructure,aswellasitsfunction,incaseswherethethree-dimensional structureisnotavailable,asinthesituationformostmembraneproteinsandforverylarge proteins. The secondary structure prediction methods commonly try to predict if a residue belongs to one of three states, α-helix, β-strand or coil (irregular parts and non-ordered structural segments). Some methods specialised to predict just one type of structure have recentlybeendeveloped,andmostnotablyaβ-turnpredictorwhichhasthehighestaccuracy yetforasecondarystructurepredictionmethod(Shepherdetal.1999).Thefirstgenerationof secondarystructurepredictionmethodswasbaseduponthestatisticallyinferredsingleamino acid propensity for a certain secondary structure element. The methods were not very

22 Phylogeneticanalysis successful since the secondary structure is dependent upon both local and long-range interactions (cf. Rost and Sander 2000). The second generation of prediction methods was able to incorporate the local environment by calculating the propensities for segments of variablesizes(ChouandFasman1974).Theaccuracyofthemethodsachievedaround60% but could not reach higher. The breakthrough for accuracy came with the use of multiple alignmentsastheinputtothepredictionmethods(Zvelebiletal.1987),whichledtothethird generationofpredictionmethods.Themultiplealignmentincludesevolutionaryinformation aboutresiduereplacements.Thus,yieldingindirectinformationaboutlong-rangeinteractions and the local environment. The top performing methods today are based upon machine learningtechniques.Neuralnetworksareusedinthetwomostprominentmethods.Thefirst methodthatbrokethe70%barrierwasPHD(RostandSander1993),whichusedamultiple alignment as an input to a neural network that is trained upon known sequences from the PDB. One of the largest advances to secondary structure prediction was PSIPRED (Jones 1999), which used position specific profiles from PSI-BLAST searches as the input to a simplefeedforwardneuralnetwork.Thisledtoanincreaseinaccuracytoroughly75%and thatisstillthefrontierinsecondarystructureprediction.Toincreasethisfurther,alsolong- rangeinteractionsneedtobeconsidered.Afirststepinthisdirectioncanbeseeninastudyby Baldietal.(1999),whereaneuralnetworkbasedmethodisdevelopedthatcanaccommodate long-rangeinteractionsandscoresaswellasorbetterthantheexistingmethods.

Molecularmodelling

The structure of a protein determines its physical and chemical properties. Methods that utilize physical or chemical information to predict structural, and therefore implicitly functional, properties of a biological molecule are collectively termed molecular modelling methods(cf.Forster2002).Abroaddivisioncanbedrawnbetweenmethodsthatpredictthe bindingofaligandtoamacromolecularreceptor–dockingmethods,andmethodsthataimat predicting the structure of protein. A further division can be made among the structure prediction methods, between fast threading methods that predict the fold of the protein without generating a three-dimensional model of the protein. The speed of these methods makesthemsuitableforwhole-genomepredictionoffoldsoridentifyingpossibletemplates for a more thorough investigation of the structure by constructing a complete three- dimensionalmodeloftheproteinbyabinitiopredictionorhomologymodelling.

23 Biocomputationalstudiesonproteinstructures

The physical properties of a molecule can be described by Schrödinger’s equation. Unfortunately it cannot be solved for systems with more than one electron (H, He+, Li2+). Howevertheconceptcanbeextendedbyapproximationstoincludelargersystemsandhas developed to a discipline of its own, quantum chemistry (cf. Oldfield 2002). The approximationmethodscanbecategorizedaseitherabinitioorsemiempirical.Semiempirical methods use parameters that compensate for neglecting some of the time consuming mathematical terms in Schrödinger's equation and can be used on systems up to a few hundredatoms(Stewart1990),whereasabinitiomethodsincludeallsuchtermsandcanonly handle a few dozen atoms (Schmidt et al. 1993). The parameters used by semiempirical methods can be derived from experimental measurements or by performing ab initio calculationsonmodelsystems.

Forcefieldmethods

TheSchrödingerequationcanbebypassedbywritingtheenergyasaparametricfunctionof the nuclear coordinates. This allows the dynamics of the atoms to be treated by classical mechanics.Fortimeindependentphenomena,theproblemreducestothecalculationofthe energy at a given geometry (Jensen 1999). The foundation of this methodology is the observation that molecules tend to be composed of units, which are structurally similar in different molecules. The O–H bond in methanol and in ethanol is of the same length and energycontent.Forcefieldsareageneralisationofthepictureofmoleculesbeingcomposed ofstructuralunits,“functionalgroups”,whichbehavesimilarlyindifferentmolecules.Force fields take care of the different properties of atoms by assigning them into different atom typesaccordingtotheatomandthetypeofchemicalbondingitisinvolvedin.TheMMFF force field (Halgren 1995a) used in among other programs ICM, extensively used in my thesis,has40typesofcarbonand20typesofoxygendependingonhybridisationstate,bond number,bond-partnersandcharge.Theparametersassignedtoeachatom-typearebasedupon empirical measurements or higher-order calculations (Chandrasekhar and van Gunsteren 2002) Theenergyiswrittenasasumofterms,eachafunctiondescribingtheenergyassociatedwith aparticularaspectofthemolecule,includingtheenergyofbond-stretching,rotationarounda bond,bendingofbond-anglesandnon-bondedatom-atominteractions,suchasvanderWaals orelectrostaticinteractions.Thisapproachhastheaddedadvantagethatitisquitesimpleto changethenatureofthefunctionforeachterm,andequallyeasytoaddorremovetermsfrom

24 Molecularmodelling thetotalenergyfunction(Abagyan1993).Suchanenergyfunctionofthenuclearcoordinates makesitpossibletocalculatethegeometryofthepossibleconformationsofamoleculeand their relative energies. Stable conformations correspond to minima on the potential energy surfaceandcanbefoundbyminimizingtheenergyasafunctionofthenuclearcoordinates.

Localminimisationtechniques

Local minimisation techniques are employed to find the nearest minimum, which may, or, moreprobably,maynotcorrespondtotheglobalminimum.

Steepestdescent

Thegradientvectorpointsinthedirectionwheretheenergyfunctionincreasesmost,i.e.the function value can always be lowered by stepping in the negative direction. Functional evaluations are made in the negative gradient direction until the energy increases and a temporary minimum along the negative gradient is located. At this point a new gradient is calculatedandusedforthenextsearch(MorseandFeschbach1953).Thismethodwillalways lower the function value and is guaranteed to approach a minimum. However, two main problemsareassociatedwiththistechnique.Twosubsequentsearchesaremadeperpendicular to each otherandeachroundwillthereforepartiallyundothepreviousminimisationstep’s decreaseofthefunctionvalue.Theotherproblemisthatastheminimumisapproached,the rate of convergenceisdecreasedandthemethodwillactuallyneverreachtheminimum.It will crawl towards it at an ever-decreasing speed. Even though its obvious drawbacks, the methodissimpleandrobustandguaranteedtolowerthefunctionvalue,whichmayexplain itswidespreaduse.

Conjugategradient

Themethodisverysimilartothepreviousmethod,buttriestoavoidthepartialundoingof thepreviousstep,bydoingeachstepexceptthefirst,inadirectionwhichisamixtureofthe current negative gradient and the previous search direction. The next step is taken in a direction,whichisconjugateofthepreviousdirection,henceitsname(Polak1971).

Globaloptimisationtechniques

Thepreviousmethodscanonlylocatetheminimumnearesttothestartingconformation,as theymimicamarblespathdownafunnelandfindthelowestspotintheenergylandscape.If thedesiredminimumistheglobalminimum,thedeepestvalleyintheenergylandscape,other

25 Biocomputationalstudiesonproteinstructures methods have to be employed. The optimal way to find the global minimum would be to sampleallpossibleconformationsandcalculatetheenergy.Thissystematicapproachisnot practically feasible with more than 10–12 free variables, since the number of possible conformations increases exponentiallywiththenumberoffreevariables,andlikewisedoes the computational time of the analysis. Large biomolecular systems, such as proteins, lipid membranesornucleicacids,cannotbestudiedbyasystematicapproach.Inordertobeable to investigate them, methods have been developed that sample a large number of possible conformationsinordertolocateminima.Althoughtheyarenotguaranteedtofindtheglobal minimum,theyoftenfindalocalminimum,whichisclosetotheglobalminimumintermsof energy.

MonteCarlo(MC)

The search starts from a given conformation. In each MC run the position of one or more atomsisrandomlychanged.Thenewconformationisacceptedasastartingpointforthenext perturbationiftheenergyislowerthantheenergyofthepreviousconformation.Stepsthat generateconformationswithhigherenergythanthepreviousoneareevaluatedbycalculation oftheBoltzmannfactor(Eq.1,TistemperatureinKelvin,kBisBoltzmann’sconstantand∆E isthechangeinenergy)andcomparingittoarandomnumberbetween0and1. -∆E/k T Boltzmannfactor=e B Eq.1 If the Boltzmann factor is less than the random number the conformation is used as the starting conformation for the next step in the Monte Carlo procedure. This allows the calculation to step out of local minima and continue the search for the global minimum (Hammersly1960).Afurtherenhancementofthemethodologyistheinclusionofinformation of preferred side-chain angles for amino acids in proteins. Each amino acid has a set of discrete side-chain angles, rotamers, that are energetically more favourable (Brändén and Tooze 1999). By including them into the search algorithm one gets the biased probability Monte Carlo (BPMC) search algorithm, that samples the conformational space more efficiently(AbagyanandTotrov1994).

26 Molecularmodelling

Moleculardynamics(MD)

MDmethodssolveNewton’sequationofmotion(Eq.2),wheretheforceFisdeterminedby themass,m,andtheacceleration,a,foratomsonanenergysurface(AlderandWainwright 1957).

F=m*a Eq.2 Theavailableenergyisdistributedbetweenpotentialandkineticenergy,andmoleculesare thus able to overcome barriers separating minima if the barrier height is less than the total energyminusthepotentialenergy.Newton’sequationrequiressmalltimesteps(femtosecond level) to be integrated, which makes the simulation time short, nowadays usually in the nanoseconds range, even though the microsecond barrier has beenreachedfora36-residue peptidewithaCrayT3Esupercomputer(DuanandKollman1998).This,inparallelwiththe useofphysiologicallyrelevanttemperatures,meansthatinpractiseonlythelocalareaaround the starting point is sampled and that relatively small barriers can be overcome. Molecular dynamics simulations have evolved from the first simulation of a protein in vacuum (McCammonetal.1977)toincluderealisticdescriptionofsolventeffects(Berendsenetal. 1981). This allows MD simulations to give insights into the natural dynamics on different timescalesofbiomoleculesinsolution(Hanssonetal.2002).Popularimplementationsarethe CHARMM, AMBER and GROMACS (Lindahl et al. 2001) packages. The GROMACS packagewasusedinpaperVI.

Simulatedannealing

Insimulatedannealingthestartingtemperatureischosentobehigh,2000–3000K,andthen graduallydecreasedduringanMCorMDrun.Thisallowsaninitiallyextensivesamplingof theconformationalspaceofthemolecule,andasthetemperaturedecreasesthemoleculeis trappedinaminimum.Ifthedecreaseisinfinitelyslowthemoleculewouldbetrappedinthe globalminimum,butthatwouldrequireinfinitecomputationaltimeaswell(Kirkpatricketal. 1983).

Dockingcalculations

Thesolutiontothedockingproblemincomputationalbiologyistheboundconformationofa substrate to a macromolecular receptor with a correct estimation of the binding energy. Reliable docking depends on both an adequate scoring function, whose global minimum

27 Biocomputationalstudiesonproteinstructures corresponds to the biologically relevant complex, and an optimisation algorithm that consistentlyfindsthatglobalminimum.

Rigidbodydockings

Theearliestdockingmethodsusedrigidbodydockingtechniques(Kuntzetal.1982),where boththereceptorandtheligandarerepresentedasrigidbodies.Inthisapproachthedocking problem is simplified to a search of surface complementarities. With the rigidity of the molecules the search space is reduced to three variables that determine the rotational and translational movements of the ligand. Even this three-dimensional search space is hard to completely cover with a brute force approach and often stochastic search schemes such as MonteCarloorsimulatedannealingtechniquesareused. Geometric surface models and data structures are used in order to find reasonable binding modes,andheuristiccostfunctions–mostlyratingsomenotionofgeometricfitness–torank candidatecomplexes. ThefirstautomateddockingprogramwastheDOCKprogram(Kuntzetal.1982),whichuses a graph theory derived search algorithm in its later apparitions (Ewing et al. 1997) that superimposes ligand atoms onto predefined site-points that map the negative image of the binding site. The hits are scored for the inter-molecular interactions, based on the inter- moleculartermsofamolecularmechanicsforcefield. Thegainofusingthismethodistheswiftnessoftheprocess,whichallowslargedatabasesof compoundstobescreenedversusareceptor(SchneiderandBöhm2002).Adrawbackofthe methodisthattheligandandreceptorhastobeco-crystallizedtoachievehighefficiencyof themethod(LengauerandRarey1996).

Flexibleligand–rigidreceptor

As more powerful computers were developed, the difficulty of the docking problem was increased,byallowinginternalrearrangementsoftheligand,byintroducingrotationaround bonds. These techniques involve a flexible ligand and rigid receptor. This allows for the discovery of novel substrates for a receptor at the expense of more computationally demanding calculations (Lengauer and Rarey 1996). The free variables added to the calculations are the rotations around bonds in the ligand. In the rigid body approach, three

28 Dockingcalculations variableswereneededtorotateandpositionthesubstrate,eachrotatablebondincreasesthe dimensionoftheproblembyone.Thus,foraligandwithtworotatablebonds,therearefive variablestooptimise.Astheligandsoftenaresmallwithrelativelyfewvariables,thespeedof the dockings is high and a large number of possible ligands can be tested. The docked conformations are scored by semi-empirically derived force field based energy functions (Sippl1995,Halgren1995b).Inordertospeedupthescoringofthedockedconformations, themacromolecularreceptorcanberepresentedbygridsthatcontaintheaffinitiesforeach atomtypeandthepossibilitiesforelectrostaticinteractions(Morrisetal.1996,Totrovand Abagyan1996).Thegainisthatthegridsallowfastcalculationofthescoreofthedocked conformation. Usually Monte Carlo algorithms (Morris et al. 1996) or genetic algorithms (Jones et al. 1997, Morris et al. 1998) are used to efficiently sample the possible conformations. Some popular programs of this type are AutoDock (Goodsell and Olsen 1990),FLEXX(Rareyetal.1996)andDOCKversion4(Ewingetal.2001).

Flexibleligand–flexiblereceptor

Anaturalextensionoftheabovetechniqueistointroduceflexibilityinthereceptoraswell.A firststepistoallowrearrangementoftheprotein’sside-chains.Thisallowedthesuccessful docking of new classes of substrates to receptors (Totrov and Abagyan 1994, Totrov and Abagyan 1997). The extra free variables introduced to the system makes the approach not feasibleforscanningoflargecompounddatabases.Todiminishthecomputationalneed,side- chainslocatedatadistancefromtheactivesiteareoftenfixed(paperIII+IV). All these methods have trouble dealing with induced fit mechanisms, common in some enzyme families (Najmanovich et al. 2000). A recent paper estimates that 50% of the imaginableprotein-ligandcomplexesarenotreachablebytoday’stechniquesduetoinduced fitmechanisms(Österbergetal.2002).Itcanbesomewhatcircumventedbyusingreceptor structures already in the active conformation, which are co-crystallized with a substrate. Computationaltechniquesthataddresstheproblemhavebeendeveloped.Oneisbasedupon selectinghingeregionsintheproteinsthatcan,asthenamesuggests,functionashingesinthe receptorandthusallowopeningandclosingmovementsofthereceptor(Sandaketal.1998). Another technique, which is a development of the popular Autodock software, create grids based upon both the bound complex and the free receptor and thus accounts for the conformationalchangeuponbindingofsubstrate(Österbergetal.2002).

29 Biocomputationalstudiesonproteinstructures

Commentsondockingmethods

Thetwomostattractivetechniquesarethegriddockingmethods,fortheirswiftness,andthe flexible ligand – flexible receptor methods, for their thoroughness. The success of the grid docking requires an active conformation of the active site, as it cannot accommodate large rearrangements.Smallerrearrangementscanbeincorporatedbyallowingacertainoverlapof ligandandreceptor,whichlatercanberelievedbyanall-atomrefinementofthetop-scoring complexes. The need for larger or multiple side-chain rearrangements can only be accommodated by the flexible ligand – flexible receptor technique. Thus, this technique is often suited for larger substrates, since they quite naturally often require more side-chain movementstofitintheactivesite.Thesameistruefordockingsofsubstratesintomodels, sincetheside-chainsmostlikelyarenotinanoptimalplacementforligandbinding.Thiswas demonstrated in the recent docking of carcinogenic dibenzopyrene diol epoxides to glutathionetransfereases(Dreijetal.2002).

Homologymodelling

Homology modelling or comparative modelling methods, first reported by Browne et al. (1969), make it possible to predict the three-dimensional structure of a protein by using informationderivedfromahomologousproteinofknownstructure(SaliandBlundell1993, SanchezandSali1997,Cardozoetal.1995)withoutattemptingthelaborioustaskofabinitio structureprediction. The basis for this approach is that two proteins that have evolved from the same ancestral protein retain the overall fold. The fold is more conserved through evolution than the sequence of a protein. It acts as a scaffold upon which new functions can be attached by changing the amino acid sequence. This has led to the evolution of superfamilies such as MDR (Perssonet al. 1994) and SDR (Janyet al.1984,Perssonetal.1991,Jörnvalletal. 1995),studiedinthisthesis,withaconservedoverallfoldandmanydivergentfunctions. In order to construct a homology model, the query sequence has to be aligned with one or morehomologousproteinstobeusedastemplatesinthemodellingprocess.Ifthesequence identity between query and template falls below 30%, the quality of the alignment will deteriorate (Venclovas et al. 1999), yielding more unreliable models. There are two main methods for the transfer of three-dimensional coordinates from the template to the query. Fragment-basedhomologymodellingmethodsidentifystructurallyconservedregionsthrough

30 Dockingcalculations thealignment,theseconservedfragmentsusuallycorrespondtosecondarystructureelements andthecoordinatesofthetemplatesarecopiedtothem.Thevariablepartscorrespondtoa large extent to loop structures, whose conformations are investigated through searches in a structuraldatabaseofloopstructures(Blundelletal.1987,Levitt1992).Therestraint-based homologymodellingmethodsusesthealignmenttoinfergeometricalrestraints,suchaslimits ofdistancesofCα-atoms,rangesofbackboneandside-chainangles.Theserestraintsarethen combined with an energy function to obtain ascoringfunctionthatisusedtocalculatethe resultingstructureofboththestructurallyconservedregionsandtheloopregions(Saliand Blundell1993,Cardozoetal.1995).Thishastheadvantagethatthestructuregenerationis notdividedintotwoseparatestagesandthepredictionmethodisthusmorerobust(Forster 2002). The hardest problem is the correct prediction of the loop regions, which has all the problemsassociatedwithabinitiostructureprediction,butlimitedinsizeandwithdefined endpoints. The accuracy of loop-predictions has been shown to increase by taking a deformationzoneintoaccount,byextendingtheloopsafewresiduesineachdirectionfrom the gap in the alignment. The procedure can accommodate the change in the local environmentcausedbythedeviationfromthefoldofthetemplatestructure(Abagyanetal. 1997). If the sequences are distantly related, additional information to assist in the alignment procedure is of great value. A multiple alignment of related proteins, even if they lack experimentallyverifiedstructures,addsinformationofconservedsequencestretchesthatmay correspondtosecondarystructureelementsandregionsthatallowinsertionsanddeletionthat mayhintatloopregions.Similarlysecondarystructurepredictionsmayassisttheassignment of equivalent secondary structure elements between the query and the template, in addition informationabouttheplacementofgapscanbeinferredfromthepredictions.Thisstrategy hasbeenusedinpaperIV.

31 Biocomputationalstudiesonproteinstructures

Aimofthestudy

Theaimofthestudywastoapplydifferentbioinformaticandbiocomputationalmethodsto medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and biologically active peptides, with anemphasisonmethodsthatutilisestructural information, in order to study and characterise proteins within these families. More specificallytheworkhasbeendirectedtoward: • TocharacterisetheMDRsuperfamilyincompletegenomes(papersIandII) • ToexplainwhyisoUDCAisasubstrateforγγADHbutnotforββADH(paperIII) • Toobtainamodelofhumantype1017β-hydroxysteroiddehydrogenaseanduseitas abasetoexplainthebindingpropertiesofsteroidsubstratestotheprotein(paperIV) • TocharacterisetheroleofaconservedresidueinSDRfold(paperV) • To analyse the stability of secondary structure elements in amyloidogenic peptides (paperVI) • To compare the sequence variation between species of peptide hormones, including theC-peptideofproinsulin(paperVII)

32 Projectoverview

Projectoverview

GenomecomparisonandmodellingofactivesitesintheMDR family(paperI)

Theavailabilityofcompletedgenomesprovidesagreatopportunitytoanalyseallmembersof enzyme families. We have therefore now studied the medium-chain dehydrogenases/ reductases (MDR) enzymes in all eukaryotic genomes available and, for comparison, the Escherichiacoligenome. The genomeswere searched for MDR members using FASTA (PearsonandLipman1988) withknownMDRproteinsasquerysequences.Multiplesequencealignmentswerecalculated usingClustalW(Thompsonetal.1994).Theevolutionarytree,constructedfromthealigned MDR sequences from the six genomes, allows distinction of several subgroups (schematic representation in Fig. 2). Three families are formed by dimeric alcohol dehydrogenases (ADH; originally detected in animals/plants), cinnamyl alcohol dehydrogenases (CAD; originally detected in plants) and tetrameric alcohol dehydrogenases (YADH; originally detected in yeast). Although similar in name, those three families are clearly different and separated structurally. Three further families are centred on forms initially detected as mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families, with polyol dehydrogenases (PDH; originally detected as sorbitol dehydrogenase) and quinone reductases (QOR; originally detected as ζ-crystallin), are also distinct but with variablesequences. ThemostcommonfamilyinthehumangenomeisADHwithninemembers(Fig.4)(further investigated in paper II), followed by QOR with seven members. A. thaliana has a similar pattern,butalsohasalargenumberofCADandLTDmembers.Intheunicellularorganisms, PDHisthemostcommonfamily,pointingatdifferentlivingconditionsbetweenunicellular andmulticellularorganisms.

33 Biocomputationalstudiesonproteinstructures

H.sapiens A.thaliana D.melanogaster

Others Others ACR Others ADH ADH LTD LTD MRF ADH

ACR PDH

MRF CAD QOR PDH QOR PDH MRF QOR Sum:23 Sum:38 Sum:10

C.elegans S.cerevisiae E.coli

Others ADH ADH LTD ADH Others ACR MRF CAD CAD LTD QOR LTD QOR YADH

YADH

MRF YADH PDH QOR PDH PDH Sum:13 Sum:15 Sum:17 Fig.4.DistributionofMDRmembersamongthedifferentorganismsinvestigated,thetotalnumberofmembers ineachgenomeisalsogiven. ThequinonereductasefamilymembersarehomologoustothesynapticvesicleproteinVAT- 1, indicating apossibleneuronalfunctionforthewholefamily.Thepolyoldehydrogenases arenumerousintheE.coligenome,contributingtohalfoftheMDRforms,emphasisingtheir importance in carbohydrate metabolism. Molecular modelling of the active sites was performedformembersofthedifferentsubgroupsinordertogiveestimatesonthegeometry, hydrophobicityandvolumeofthesubstrate-bindingpockets.Theresultsaresummarisedin Table1. Table1.Activesitecharacteristicsfortheinvestigatedfamiliesandcommonfunctionsofthemembers.Volume inparenthesisisforallbutonemember. Family Activesitecharacteristics Function Volume(Å3) Hydrophobicityindex

PDH 77–257 –1.28 +0.77 Polyoldehydrogenases (144–257) Threonine3-dehydrogenase

CAD 160–248 –0.69 +0.49 Mannitoldehydrogenase Cinnamyl-alcoholdehydrogenase

QOR 124–289 –0.56 +1.47 Quinoneoxidoreductase SynapticvesiclemembraneproteinVAT-1homologue

MRF 168–243 –1.48 –0.18 Mitochondrialrespiratoryfunctionprotein

LTD 161–291 –1.65 –0.02 NADP-dependentleukotrienedehydrogenase

ACR 140–189 +0.64 +1.08 Fattyacidsynthase

34 Projectoverview

The volumes of the active sites are more or less in the same range, with a broad span of volumesforeachfamily,exceptforthenarrowrangeoftheACRfamily,explainedbyfew memberswithahighdegreeofsequencesimilarity.MRFandLTDclearlyprefershydrophilic substrateswhileACRprefershydrophobicsubstrates,quitenaturalconsideringitsroleinfatty acidsynthesis.ThevariationofthehydrophobicityinPDH,CADandQORindicatesawide varietyofsubstratesfortheproteinsinthesegroups. The sequence comparisons and subdivisions make it possible to define sequence patterns useful for characterisation of MDR members. For QOR, a Prosite pattern (Hofmann et al. 1999) already exists (PS01162). However, this pattern only finds 5 of our presently recognised18QORmembers.Baseduponthesequencesnowavailable,weproposeanew patternthatbetterwilldetecttheQORmembers(Table2).Itfinds15oftheQORmembers, whichisthree-foldmorethantheexistingPrositepattern,anditmissesonly3QORforms, whiledetectingnofalsepositives.ThepatternsforthePDHandtheCADfamiliesarebased onresiduesthatbindthecatalyticzincandthesubstrate.ThePDHpattern(Table2)isabit complexbutcapturesallthedifferentsubstratespecificitiesofthisgroupandonlyonefalse positive when the pattern is screened versus Swissprot. The MRF, LTD and ACR patterns (Table 2) are highly specific. They find no false positive matches and miss not one of our members. Thepatternsareusefulforproperrecognitionofnewgenomicsequences.Theyallowrapid annotation into the different families of the MDR superfamily from the huge amounts of sequences generated by ongoinggenomeprojects.Theyarealsoidealforfindingparticular enzymesintheever-increasingsequencedatabases. Table2.Sequencepatternsandscreeningresults.ThecolumnhitsgivesnumberofMDRformsdetectedinthe six genomes investigated, fp (false positives) gives number of non-members detected in Swissprot, fn (false negatives)givesthenumberofproteinsclassifiedasamemberbutnotfoundbythepattern. Family Pattern hits fp fn QOR [GAS]-x-N-x(2)-[DEN]-x(5)-G-x(6,19)-[PS]-x(3)-[GA]-x-[ED]-x(2)- 15 0 3 G-x-[VIL]-x(3)-G MRF L-x(6)-[VL]-T-Y-G-G-M-[SA]-[KR] 6 0 0 PDH [GA]-[VIL]-[CS]-[GN]-[STA]-D-[VILMS]-[HKP]-x(14,27)-G-H- 22 1 0 [ED]-x(2)-G-x-[VI]-x(10,12)-G-[DEQ]-x-[IV] CAD C-G-x-C-x(2)-D-x(17)-G-H-E 12 0 0 LTD D-x-[YF]-x-[DE]-N-V-G-[GS]-x(3)-[DEN] 16 0 0 ACR W-x(5)-W-x(8)-P-x(2)-Y-x(3)-Y-Y 5 0 0

35 Biocomputationalstudiesonproteinstructures

DifferentialmultiplicityofMDRalcoholdehydrogenases(paperII)

TheoriginallypurifiedADHenzymetypes(yeast,Drosophila,liverandplant)werescreened for in the genomes of human, Arabidopsis thaliana, Saccharomyces cerevisiae and Drosophila melanogaster, and were compared, where relevant, to corresponding forms in mouse, pea, Escherichia coli and Caenorhabditis elegans. The screenings revealed the presenceoftenhuman,tenArabidopsis,fiveyeast,andoneDrosophilaMDR-ADHgenes.Of thehumangenes,threehadpropertieslikepseudogenes(recognizedbyunusualexon/intron patterns,combinedwithalackofupstreampromoterelementssuchasTATA,CAATorGC boxes, and with deviating chromosome localizations) and the remaining seven genes correspondtotheenzymesofclassesI(3genes)andII–V(onegeneeach)(JörnvallandHöög 1995,Duesteretal.1999).Thesesevengenesareallonchromosome4(Fig.5),compatible withtheearlyassignments(Duesteretal.1985).

Fig.5.GeneorganisationofADHonchromosome4 The screenings also show that ethanol-active MDR-ADH, when present, appears to be of multiple occurrences. The multiplicity (Jörnvall et al. 2000) and the general occurrence of ethanol-activity(Fernándezetal.1995)havebeennoticedbefore,butcannowbecorrelated with enzyme type and with eukaryotic type of organism. These and other functional conclusions are best noticed when the MDR-ADH enzymes are divided into the two constituent families of dimeric and tetrameric ADHs, initially represented by the liver and yeastADHs,respectively. Thedimericethanol-activeADHsdivergenceintoatleastthreegroups,correspondingtothe formaldehyde-active class III enzyme from all species, the ethanol-active non-class-III enzymesfromanimals,andtheethanol-activenon-classIIIenzymesfromplants(Fig.6).

36 Projectoverview

Fig.6.EvolutionarytreeofdimericADH.TreecreatedinClustalWandviewedinTreview. Several functional conclusions regarding these enzymes can be drawn by the evolutionary tree: • Onlyplantsandvertebratesappeartohavetheethanol-activedimericMDR-ADHs. • Extensivemultiplicityoccursbothinvertebrateandplantgroupsoftheseenzymes. • The patterns in the non-class-III groups from vertebrates and plants show some similarities,withbothearlyandlatebranchingfrequentlyatsimilarlevels. AcomparisonofanimalandplantlinesofMDR-ADH,withspeciesvariantsincluded,shows thatthetimeestimates,calculatedintwodifferentmannersforthespeciesseparation,inthe classIIIlinearetoodistant,falselysuggestingtheclassIIIspeciesseparationstobefurther away than those for classes I andPandalsomoredistantthanacceptedradiationnodesof higher vertebrates. Hence, the constant speed expected for the class III evolution may be slightlyhigherthancalculatedbefore(Cañestroetal.2002),orclassIIImayevolvefasterin highereukaryotes,perhapsgivingnovelenzymogenesisandadditionalfunctionsforclassIII

37 Biocomputationalstudiesonproteinstructures in higher eukaryotes, like already known for class I with its pattern of recent isozyme emergence(JörnvallandHöög1995,Duesteretal.1999). The present pattern, with the distinction towards absence of dimeric ADH between higher/lowereukaryotes,andwithpartlyparallelpatternsinplantsandanimals,maysuggest thataspecificfunctionisassociatedwithADHinhighereukaryotes.Ifso,dimericADHmay havekeyfunctionsinhighereukaryotesratherthanjustgeneraldetoxications(Jörnvalletal. 2000). A similar visualization of the tetrameric family shows considerably less multiplicity and places yeast ADH with ADH of C. elegans, and with some E. coli and A. thaliana forms. Thus,C.elegansandA.thalianahavebothdimericandtetramericADHenzymes.

MolecularmodellinganddockingofbileacidstoγγandββADH classI(paperIII)

A three-dimensional model of the class I alcohol dehydrogenase (ADH) γ subunit was obtained by adopting its amino acid sequence into the known fold of horse liver alcohol dehydrogenase(PDBcode3bto)usingtheprogramICM(MolsoftLLC,Abagyanetal.1994). FordockingstudiesusingthehumanclassIADHββdimer,thestructurewastakenfromthe PDBfile1deh.Dockingofbileacidstothedimericenzyme(ββ-structureandγγ-model)was performedusinganon-rigidprocedure,allowingfreemovementofthesubstrate,therotatable bondsofthesubstrate,theχanglesoftheresidueswithin7Åfromtheactivesiteandwith additionaldistancerestraintsof2.0–2.4ÅbetweentheO-atomatC-3ofthebileacidsandthe catalyticzincandtheH-atomatC-3ofbileacidsandtheC-4ofthenicotinamideofNAD+. Molecular modelling of the human class I γ ADH subunit and docking calculations demonstrated that iso-ursodeoxycholic acid (isoUDCA) fits into the active site tightly surrounded by mainly hydrophobic residues (Fig. 7), and that it interacts with the catalytic residues in a preferable way (Table 3). For comparison, similar docking calculations were performed with UDCA. This substrate could not form the interactions crucial for catalysis (Fig.7,Table3).Ifthe3α-hydroxylgroupwasforcedtofitattheactivesite,therestofthe steroidwastiltedincomparisontothe3β-hydroxylepimerandpositionedawayfromthe

38 Projectoverview

Fig.7.isoUDCA(darkgray)andUDCA(lightgray)dockedintotheactivesiteof ¡ ADH.Residueswithin5Å of the bile acids are displayed in stick models and with their surfaces transparently displayed. Residues F93, T94,Y110,N114andL116wereomittedforincreasedvisibility.Arrowsshowcatalyticinteractionsbetweenthe ligand,O ofS48andC4ofNAD+.ThepicturewascreatedinICMversion2.8(MolsoftLLC,SanDiego,CA). active-sitechannel,compatiblewiththefactthatthisconformationisnotsuitableforefficient catalysis. ThesamecalculationswerealsoperformedforthehumanclassIββADHwherethepresence ofThr48(insteadofSer48asintheγγisozyme)causedinefficientbindinginaccordancewith kinetic data. Distances crucial for catalysis are collected in Table 3. Distances required for catalysis,intherangeof2–2.4Å,areobtainedonlybyisoUDCAin ¡ ADH. Table3.DistancesinÅngströmobtainedfromthedockingcalculationsforisoUDCAandUDCA Isozyme Substrate Oγ(Ser48)-H(C3-OH) Zn-O(C3-OH) C4(NAD+)–H(C3) ADHββ UDCA 2.31 2.51 2.94 ADHββ isoUDCA 3.35 2.24 1.94 ADHγγ UDCA 3.20 2.88 3.53 ADHγγ isoUDCA 2.22 2.30 2.19

39 Biocomputationalstudiesonproteinstructures

Molecularmodellingandsubstratedockingofhumantype1017β- hydroxysteroiddehydrogenase(paperIV)

The human type 10 17β-hydroxysteroid dehydrogenase (17β-HSD-10) is a multifunctional mitochondrial enzyme. It catalyses the oxidative inactivation at C17 of androgens and estrogens.However,italsomediatesoxidationof3α-hydroxygroupsofandrogens,thereby reactivating androgen metabolites. Finally, it is involved in β-oxidation of fatty acids by catalysingtheL-hydroxyacylCoAdehydrogenasereactionoftheβ-oxidationcycle.Sinceno three-dimensionalstructureof17β-HSD-10wasavailable,homologymodellingwascarried out to understand the molecular basis of these substrate specificities. The template in the homologymodellingwas7α-HSD(1fmc;Tanakaetal.1996),whichhasasequenceidentity of27%tothetargetprotein.Thelowsequenceidentityrequiredspecialcareinthemodelling process. Themolecularmodellingindicatedthatthereactionresultingin3α-OH/3-oxoconversioncan be predicted to occur with 5α-reduced steroids such as 3α-ol,5α-androstane,17-one, with optimaldistancesforthe3α-OHhydrogenbondingtothetyroxylOη168of1.89Å,toC-4of + NAD of2.23Å,andadistancebetween3α-OHoxygentoSer155side-chainHγof1.96Å. Oxidationofthe3β-hydroxylasin3β-ol,5α-androstane,17-oneor3β-ol,5α-androstane,17β- olisnotpossible,duetoapredicteddistanceofover4Åbetweenthe3β-OHoxygenandthe Ser155 side chain. This distance will not allow the hydrogen bond necessary for catalysis. Docking simulations with 5β reduced steroids, such as bile acids (isoUDCA and UDCA) revealagainunfavourabledistancestoSer155,thuspreventingcatalysistotakeplace.The17 β-OHdehydrogenasereactionispossiblewithsteroidsoftheandrogenandestrogenclasses, yieldingfavourablegeometryanddistancesfor3β,17β,5α-androstane-diol,testosterone,5α- dihydrotestosteroneandestradiol.Thesepredictedpropertiescorrelatewithexperimentaldata. Aftersubmissionofthismanuscriptanothergroupsolvedthestructureoftherathomologue bycrystallography(Powelletal.2000).Thisstructureconfirmslargeportionsofourmodel including some of the enzyme–substrate complexes that wepredicted.InFig.8,ourmodel andthecrystalstructurebackbonesaresuperimposed.Thecoresofthestructuressuperimpose well,includingthemajorityoftheactivesiteresidues.Partsofthepredictedsubstrate-binding loop are missing in the crystal structure, indicating an unordered structure of that part.

40 Projectoverview

Nevertheless, we have correctly predicted the start and end of the loop. The dark-coloured regions corresponding to mispredicted areas are mainly centred in loops, which are notoriouslyhardtopredict,andsecondarystructureelementsattheedgesofthestructure.The entireC-terminalpartfromAsn243toPro261ismispredictedduetoanalignmenterrortothe template,andthewholeregionisshiftedoneresidue.TheplacementoftheNAD+ishighly accurateinthevicinityoftheactivesiteandallowustocorrectlypredictwhichsubstratesthat maybecatalysedbytheenzyme. The overall quality of our predictions makes us feel that it can be justifiable to create homology models based on templates with quite low sequence identity in homology modelling,inourcaseasuccessfulmodellingwascreatedatasequenceidentitylevelbelow 30%iftheoverallfoldofthefamilyisconserved .

Fig. 8. Model of ERAB and crystal structure of the rat homologue superimposed, white areas were correctly predicted,whiledarkareasweremispredicted.NAD+andsubstrateofthemodelarecolouredlightgrey;whilein thecrystalstructuretheyaredarker.FigurecreatedinICMversion2.8(MolsoftLLC,SanDiego,CA)

41 Biocomputationalstudiesonproteinstructures

StructuralRoleofConservedAsn179intheShort-Chain Dehydrogenase/ReductaseScaffold(paperV)

Short-chaindehydrogenases/reductases(SDR)constitutealargefamilyofenzymesfoundin all forms of life. Despite a low level of sequence identity, the three-dimensional structures determined display a nearly superimposable α/β folding pattern. We identified a conserved asparagine residue located within strand βF and analysed its role in the short-chain dehydrogenase/reductasearchitecture.MutageneticreplacementofAsn179toAlainbacterial 3β/17β-hydroxysteroid dehydrogenase yields a folded but enzymatically inactive enzyme. Significantly altered unfolding characteristics indicate an apparent increase in stability. Crystallographic analysis of the wild-type enzyme at 1.2 Å resolution reveals a hydrogen bondingnetworkincludingaburiedandwell-orderedwatermolecule,connectingstrandsβE toβF(Fig.9),acommonfeaturefoundin16of21knownthree-dimensionalstructuresofthe family.

Fig.9.StrandsβEandβFandinteractionsofAsn179,withdistancesgiveninÅngström.Figurecreatedwith ICM,version2.8(MolsoftLLC,SanDiego,CA) This segment interacts with the active site and the substrate-binding region, which could explainthelossoffunctionintheN179Amutant.Themaininteractionsofthisconservedsite aremadethroughside-chaintomain-chainatoms,whichmakesitimpossibletotracethem through sequence comparisons, since noevolutionarypressureisappliedontheresiduesin ordertomaintaintheinteractions.Basedontheseresultswepredictthatinmammalian11β-

42 Projectoverview hydroxysteroiddehydrogenasetheessentialAsn-linkedglycosylationsite,whichcorresponds to the conserved segment, displays the same structural features and has a central role to maintaintheSDRscaffold,ratherthanbeingacarbohydrateattachmentpoint.

Moleculardynamicstudiesofsequenceinfluenceuponfibrillation (paperVI)

Diseases such as the prion diseases and Alzheimer’s disease (AD), caused by fibrillating proteinsaregainingincreasedattentionduetotheirimportanceindementiaamonganaging population. Presently, over 20 diseases are known to be linked to fibrillating proteins in various tissues. We have studied the stability of the helical peptides involved in AD and pulmonaryalveolarproteinosis(PAP)using4nanosecondsmoleculardynamicssimulations. Theeffectofstabilisingreplacementsandtheeffectofthemutationlinkedtothehereditary cerebralamyloidosisoftheFlemishtypearealsoinvestigated. Thestabilityofthepeptidesisclearlyrelatedtotheiraminoacidsequence.Theamyloidβ- peptide(Aβ)haslostmostofitshelicalstructuremidwaythroughthesimulationandassumes amoreorlessunorderedstructurewithsomehelicaltendenciestoit(Fig.10A),muchlikethe variant involved in hereditary cerebral amyloidosis of the Flemish type (Aβ Flemish type), whichlosesallofitshelicaltendenciesandevenshowβ-strandtendencies(Fig.10C).The experimentally verified stable analogue Aβ (K16A+L17A+F20A) retains a helical structure duringthewhole4nssimulation(Fig.10B).Surfactantprotein-C(SP-C)ismorestablethan Aβbutthehelixcollapsesintoshortsegmentswithhelicalproperties(Fig.10D).Thestable analogueSP-C(Leu)maintainsarigidhelixduringthewholesimulation(Fig.10E)whilethe tri-substitutedSP-C(V18L+V20L+V24L)showslesshelicalcharacterthanSP-C(Leu),but morethanwild-typeSP-C(Fig.10F). Theresultscanbeusedtodesignmorestableanaloguesbasedontheconceptsderivedfrom this study. The deleterious effect of the mutations in hereditary cerebral amyloidosisofthe Flemishtypecanbeexplainedbydecreasedα-helicalstabilityofthepeptide.Wehavealso investigatedthesecondarystructurepropensitiesandacorrelationbetweenβ-propensitiesand fibrillationwasfound,compatiblewiththemoleculardynamicsresults.

43 Biocomputationalstudiesonproteinstructures

A B C D E F 0ps 2000ps 4000ps Fig. 10. Snapshots of the development of the structure during the MD simulations. A: Aβ B: Aβ (K16A+L17A+F20A)C:AβFlemishtype(A21G)D:SP-CE:SP-C(Leu)F:SP-C(V18L+V20L+V24L)

44 Projectoverview

AstructuralbasisforproinsulinC-peptideactivity(paperVII)

TheinsulinandC-peptidepartsofproinsulindiffermarkedlyininterspeciesvariability.With 37formsknowninprimarystructure,C-peptideisbetweenoneortwoordersofmagnitude more variable than insulin, is about as variable as relaxin in the same protein family, and showsatripartitevariabilitypattern.ConsideringthemammalianC-peptides,9positionsof the N- and C-terminal are conserved. In contrast, mid-positions of C-peptide are highly variableandsusceptibletomutationalremoval.Thesestructuralfeaturesareconsistentwith bindinginteractionsrecentlydefinedexperimentally(Ohtomoetal.1998,Rigleretal.1999, Wahren et al. 2000, Pramanik et al. 2001), highlighting an importance of the C-terminal pentapeptide. The main argument against a hormonal function for the C-peptide is the large species variability, with no strictly conserved residue. To investigate this apparent problem, 15 peptide hormones were investigated by constructing multiple alignments. The results are showninTable4whereC-peptideisgroupedtogetherwiththreeotherhighlyvariable,but well-establishedpeptidehormones.ThisputsthevariabilityofC-peptideinanewperspective andindicatesthepossibilityofhormonalactivitiesofC-peptide. Table4.Sequencevariabilityamongpeptidehormones,differencesaregivenin%identity andinPAMunits(acceptedpointmutations) Hormone n Human–Rat Human–Chicken Human–Cod Human–Hagfish % PAM % PAM % PAM % PAM Somatostatin 14 0* 0 0 0 - - 0 0 Glucagon 29 0 0 0 0 - - - - VIP 28 0 0 14 16 21 25 - - SubstanceP 11 0 0 9 10 27 33 - - GIP 42 5 5 - - - - - - Insulin 51 8 8 14 16 - - 35 47 IGF-1 70 4 4 11 12 - - - - IGF-2 67 6 6 15 17 - - - - CCK 33 9 10 39 55 - - - - Secretin 27 11 12 48 75 - - - - Pancreatic 36 22 26 56 97 - - - - polypeptide Gastrin 17 18 21 - - - - - - GRP 27 26 32 30 38 - - - - Parathormone 84 27 33 62 119 - - - - C-Peptide 31 32 42 68 148 - - 93 988 Relaxin 54 48 75 - - - - - - *Comparisonversusthemousehomologueinsteadofrat

45 Biocomputationalstudiesonproteinstructures

Conclusions

• Sequence comparisons are important and fruitful to examine evolutionary and functionalrelationshipsinfamiliesofbothlargeandsmallproteins/peptides. • Docking calculations is a powerful tool to elucidate substrate specificities and to investigatethemolecularmechanismsofsubstratebinding. • Homologymodellingisfeasiblebelow30%sequenceidentityifthefoldofthefamily iswellconserved.Loopsandouterpartsarelikelytobeslightlymispredicted,while theaccuracyofthecoreandtheactivesitecanbeexpectedtobehigh. • A multiple structural alignment is a powerful complement to traditional multiple sequencealignments,asfactorsnotobviousfromthesequenceconservationmaybe detected. The importance of structural comparisons is expected to increase as more structuresbecomeavailable. • MDsimulationsareidealinstudiesofthestabilityofsmallpeptides,asthetimescales practicallyreachablearebiologicallyrelevant. • The whole genome comparisons of the MDR superfamily has given further insights into the evolution of this superfamily and several new families have been characterised. Different multiplicity patterns of the superfamily are also detected in higherandlowereukaryotes,indicatingdifferentfunctions. • The substrate specificity of γγADH and ββADH is primarily determined by the Ser48/Thr48 replacement, where the Ser in γγADH allows the oxidation of bulkier secondaryalcohols. • Themodeloftype1017β-HSDiscompatiblewithknownenzymaticdataandcanbe usedtoexplainenzyme-substrateinteractionsatamolecularlevel.

46 Conclusions

• ThestructuralanalysisoftheSDRsuperfamilyrevealedaconservedhydrogen-bonded networkcenteredaroundaconservedasparagine,commonfor16ofthe21proteinsof theSDRwithexperimentallyverifiedstructure. • MDsimulationsofamyloidogenichelicalpeptidesandnonfibrillatingvariantspointat the importance of the stability of the helical portion for their amyloidogenic tendencies. • The helical peptides involved in pulmonary alveolar proteinosis and Alzheimer’s readilylosetheirhelicalconformationwithouttheinfluenceofexternalfactors. • The sequence variability of the C-peptide does not exclude it from being a peptide hormone, as other well-characterised peptide hormones exhibit similar variability patterns

Perspectives

Oneofthedrivingforcesincomputationalbiologyistheparalleldevelopmentoffasterand morepowerfulcomputers,whichallowsaddressingmorecomplexsystemsbycomputational methods.TheBlueGeneprojectbyIBM,wheretheyconstructasupercomputertotrytofold a protein by MD simulations, is one of the projects that most obviously benefits of new technology.Therecentadvancesinthefieldofcomputertechnologyandtheconstructionof moreefficientalgorithmsinMDsimulationsoftwareindicatesthatthecurrentlimitingfactor isthecrudenessoftheforcefieldsusedinthesimulations. The development in biology is of more importance than technological breakthroughs. The completion of the human genome last year gave a renaissance for the field of sequence comparisons,andespeciallygenepredictionalgorithmsareandwillbeofutterimportancein thenearfuture.Increaseofthebiologicalknowledgeaboutthemechanismsthatgoverneach process, will provide the basis for making better prediction algorithms for transmembrane proteins,subcellularlocalisation,glycosylationandsecondarystructure.

47 Biocomputationalstudiesonproteinstructures

Theprojectthatwillhavegreatestimpactonhomologymodellingisthestructuralgenomic initiative, where proteins without homology with experimentally determined structures, are selectedforcrystallisation.Theaimistocoverallexistingfoldsofsolubleproteins,which would make it possible to predict the structure of all known proteins. The initiative has identified 5285 proteins that will cover the three-dimensional space of protein folds with a reasonablesmallgapstofillupbyhomologymodelling,ofthem459hasbeencrystallizedto dateand15havetheirstructuredepositedinPDB(www.jcsg.org). Thedatabasesofsinglenucleotidepolymorphisms(SNP)cangiveenormousamountsofboth disease related and structurally relevant information, especially in relation to known or modelledthree-dimensionalfolds. Theenormousamountofbiologicalinformationgeneratedworldwidethreatenstodrownthe relevant information for each researcher. Therefore are initiatives like GPCRDB (ref) valuable,wereaconsortiumofresearchersgatherallavailabledataaboutaproteinfamilyand makes it clearly and easily available to the rest of the world, in this case it is G-protein coupledreceptors,buttheconceptcouldbetransferredtoanyproteinfamily. As more resources are directed into the field of bioinformatics the harder problems are no longershunned.ThereforeIhopetoseeanalgorithmwithin5–10yearsthatproducesfairly accurate(RMSDbelow2Å)estimatesofthethree-dimensionalstructureofproteinsjustbased onthesequenceoftheprotein.

48 References

Acknowledgements

ThisthesiswascarriedoutatChemistryI,DepartmentofMedicalBiochemistryand Biophysics,KarolinskaInstitutet,Stockholm,Sweden.Iamverygratefultoallofyouthat havehelpedmeinthecompletionofmythesis.InparticularIwouldliketoexpressmy gratitudetothefollowingpersons. BengtPersson,mysupervisor,forrescuingmefromthelab-benchandtoxicfumesinthefar northandallowingmetoworkinafieldwhichIbarelycoulddreamofduringthecoldarctic nightsinSundsvall. HansJörnvall,theheadofChemistryI,foryourenthusiasmandforfreelysharingyour knowledgeandforprovidingexcellentworkingfacilities. ThepeoplethatluredmeintogoingforaPh.Daftermyexjobb,MagnusGustafsson,Malin Hult,TimProzorovski,YuqinWang,YvonneKallbergandValentinaBonetto,Ihada reallygoodtime. MyroommateYvonneKallbergforlotsofdiscussionaboutscience,Sundsvallandunrelated stuffinourdarkcave,forniceconferenceandpartycompany.Lately,oursanctuaryhasbeen invadedbyAnnikaNorinwhotriedtoletsomesunshineintoourlives,withvaryingdegrees ofsuccess. TheboysofChemistryIwithunhealthyinterestsinnon-scienceprojects,AndreasP. Jonsson,JesperJ.Hedberg,MagnusGustafsson,MikaelHenriksson,PatrikStrömberg andStefanSvensson,you’renotsafeyet. MypresentandpaststudentcolleaguesinChemistryI: AndreasAlmlén,JuanAstorga-Wells,Ann-CharlotteBergman,PeterBergman,Åsa Brunnström,EloEriste,DanielHirschberg,CharlottaFilling,WaltteriHosia,Johan Lengquist,JingLi,IngemarLindh,CharlotteLindhé,SuyaLiu,ErmiasMelles,Al Necal,JohanNilsson,ÅkeNorberg,MadalinaOppermann,AnnaPäivio,EssamRefai, NaeemShafqat,MargaretaStark,MariaTollin,XiaoqiuWu,EliZahouandShah Zaltash,forsharingthedailygrindofaPh.D.

49 Biocomputationalstudiesonproteinstructures

Iwouldtothankmymaincollaboratorsduringtheseyears,UdoOppermann,Jan JohanssonandHansJörnvall,Ilearnedalotandhopefullysciencetookanant-stepinthe rightdirectionfrommycontributions. TimetothanktheHEJ-labcore: Ann-MargrethJörnvall,IngegerdNylander,IreneByman,EllaCederlund,Carina Palmberg,EvaMårtensson,UlrikaWaldenström,MonicaLindh,MarieStahlberg, GunvorAlvelius,JanSjövall,Jan-OlofHöög,MatsAndersson,BirgittaAgerberth, MustafaEl-Ahmad,TomasBergman,WilliamGriffiths,LarsHjelmqvist,Hanns-Ulrich Marschall,RannarSillardandJawedShafqat,therewouldbenoHEJ-labwithoutyou guys. IwouldalsoliketograsptheopportunityandsaycheerstothefriendsIfoundinallover MBB,ifyou’reforgottenblameme,it’slateandeverythingelseisfinished,Jenny,Stina, Eva,Helena(StructuralBiology),Andis,Hans,Simone,Stefan(Organicchemistry), Micke,Mattias,Pontus,(KemiII),Aristi(Biochemistry). Iwouldliketoexpressmydeepestgratitudeandlovetomyparentswhoalwaysstoodbyand supportedme. Ifitweren’tformysisterKarinstravelling,perhapsIwouldn’thavefoundthecourageto leavethesafehavenofSundsvall,thanksabillion:-) Mirna,finallyI’mfinished,Ioweitalltoyou. Ok,tosumitup,it’sbeengood;Iwillmissyouall.

50 References

References Abagyan,R.andTotrov,M.BiasedprobabilityMonteCarloconformationalsearchesand electrostaticcalculationsforpeptidesandProteins.J.Mol.Biol.1994,235:983-1002 Abagyan,R.,Batalov,S.,Cardozo,T.,Totrov,M.,Webber,J.andZhou,Y.Homology modelingwithinternalcoordinatemechanics:deformationzonemappingandimprovements ofmodelsviaconformationalsearch.Proteins1997,Suppl1:29-37 Abagyan,R.A.andBatalov,S.Doalignedsequencessharethesamefold?J.Mol.Biol.1997, 273:355-68 Abagyan,R.A.Towardsproteinfoldingbyglobalenergyoptimization.FEBSLett.1993, 325:17-22 Agerberth,B.,Gunne,H.,Odeberg,J.,Kogner,P.,Boman,H.G.andGudmundsson,G.H., FALL-39,aputativehumanpeptideantibiotic,iscysteine-freeandexpressedinbonemarrow andtestis.Proc.Natl.Acad.Sci.USA1995,92:195-199 Alder,B.J.andWainwright,T.E.Phasetransitionforahard-spheresystem.J.Chem.Phys. 1957,27:1208 Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.andLipman,D.J.Basiclocalalignment searchtool.J.Mol.Biol.1990,215:403-410 Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.andLipman, D.J.GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms. NucleicAcidsRes.1997,25:3389-3402 Amy,C.M.,Witkowski,A.,Naggert,J.,Williams,B.,Randhawa,Z.andSmith,S.Molecular cloningandsequencingofcDNAsencodingtheentireratfattyacidsynthase.Proc.Natl. Acad.Sci.USA1989,86:3114-3118 Andersson,M.,Gunne,H.,Agerberth,B.,Boman,A.,Bergman,T.,Sillard,R.,Joernvall,H., Mutt,V.,Olsson,B.,Wigzell,H.,Dagerlind,A.,Boman,H.G.andGudmundsson,G.H.,NK- lysin,anoveleffectorpeptideofcytotoxicTandNKcells.StructureandcDNAcloningof theporcineform,inductionbyinterleukin2,antibacterialandantitumouractivity.EMBOJ. 1995,14:1615-1625 Bairoch,A.andApweiler,R.TheSWISS-PROTproteinsequencedatabaseandits supplementTrEMBLin2000.NucleicAcidsRes.2000,28:45-48 Baldi,P.,Brunak,S.,Frasconi,P.,Soda,G.andPollastri,G.Exploitingthepastandthe futureinproteinsecondarystructureprediction.Bioinformatics1999,15:937-946 Bansal,A.K.Anautomatedcomparativeanalysisof17completemicrobialgenomes. Bioinformatics1999,15:900-908 Beers,M.F.andLomax,C.SynthesisandprocessingofhydrophobicsurfactantproteinCby isolatedrattypeIIcells.Am.J.Physiol.1995,269:744-753

51 Biocomputationalstudiesonproteinstructures

Benson,D.A.,Karsch-Mizrachi,I.,Lipman,D.J.,Ostell,J.,Rapp,B.A.andWheeler,D.L. GenBank.NucleicAcidsRes.2002,30:17-20 Berendsen,H.J.C.,Postma,J.P.M.,vanGunsteren,W.F.andHermans,J.Interactionmodels forwaterinrelationtoproteinhydration.In:IntermolecularForces,(ed.Pullman,B.)Reidel, Dordrecht,1981,331-342 Berman,H.M.,Westbrook,J.,Feng,Z.,Gilliland,G.,Bhat,T.N.,Weissig,H.,Shindyalov, I.N.andBourneP.E.TheProteinDataBank.NucleicAcidsRes.,2000,28:235-242 Blundell,T.L.,Sibanda,B.L.,Sternberg,M.J.andThornton,J.M.Knowledge-based predictionofproteinstructuresandthedesignofnovelmolecules.Nature1987,326:347-352 Bonneau,R.,Tsai,J.,Ruczinski,I.,Chivian,D.,Rohl,C.,Strauss,C.E.M.andBaker,D. RosettainCASP4:progressinabinitioproteinstructureprediction.Proteins2001,Suppl 5:119-126 Bonnichsen,R.K.andWassén,A.M.Crystallinealcoholdehydrogenasefromhorseliver. Arch.Biochem.Biophys.1948,18:361-363 Boudet,A.M.,Lapierre,C.andGrima-Pettenati,J.Biochemistryandmolecularbiologyof lignification.NewPhytol.1995,129:203-236 Browne,W.J.,North,A.C.,Phillips,D.C.,Brew,K.,Vanaman,T.C.,Hill,R.L.Apossible three-dimensionalstructureofbovinealpha-lactalbuminbasedonthatofhen'segg-white lysozyme.J.Mol.Biol.1969,42:65-86 Brändén,C.I.,Jörnvall,H.,Eklund,H.andFurugren,B.Alcoholdehydrogenases.In:The enzymes,thirdedition(ed.Boyer,P.D.),chapter3,AcademicPress,1975 Brändén,C.andTooze,J.,Introductiontoproteinstructure,secondedition.Garland PublishingInc.,NY,1999 Burley,S.K.Anoverviewofstructuralgenomics.Nat.Struct.Biol.2000Suppl:932-934 Campbell,M.M.andSederoff,R.R.Variationinlignincontentandcomposition.Mechanisms ofcontrolandimplicationsforthegeneticimprovementofplants.PlantPhysiol.1995,110:3- 13 Cañestro,C.,Albalat,R.,Hjelmqvist,L.,Godoy,L.,Jörnvall,H.andGonzàlez-Duarte,R. AscidianandamphioxusAdhgenescorrelatefunctionalandmolecularfeaturesoftheADH familyexpansionduringvertebrateevolution.J.Mol.Evol.2002,54:81-89 CardozoT,TotrovM,andAbagyanR.HomologymodelingbytheICMmethod.Proteins 1995,23:403-414 Chandrasekhar,I.andvanGunsteren,W.F.Acomparisonofthepotentialenergyparameters ofaliphaticalkanes:moleculardynamicssimulationsoftriacylglycerolsinthealphaphase. Eur.Biophys.J.2002,31:89-101

52 References

Chou,P.Y.andFasman,G.D.Predictionofproteinconformation.Biochemistry1974,13:222- 245 Dayhoff,M.O.Schwartz,R.M.andOrcutt,B.C.AmodelofevolutionarychangeinProteins. MatricesfordetectingdistantrelationshipsIn:AtlasofProteinSequenceandStructure, NationalBiomedicalResearchFoundation,WashingtonDC,1978 DeRisi,J.L.,Iyer,V.R.andBrownP.O.Exploringthemetabolicandgeneticcontrolofgene expressiononagenomicscale.Science1997278:680-686 Donadio,S.,Staver,M.J.,McAlpine,J.B.,Swanson,S.J.andKatz,L.Modularorganization ofgenesrequiredforcomplexpolyketidebiosynthesis.Science1991,252:675-679 Dreij,K.,Sundberg,K.,Johansson,A.S.,Nordling,E.,Seidel,A.,Persson,B.,Mannervik,B. andJernström,B.Catalyticactivitiesofhumanalphaclassglutathionetransferasestoward carcinogenicdibenzo[a,l]pyrenediolepoxides.Chem.Res.Toxicol.2002,15:825-831 Duan,Y.andKollman,P.A.Pathwaystoaproteinfoldingintermediateobservedina1- microsecondsimulationinaqueoussolution.Science1998,282:740-744 Duester,G.,Farrés,J.,Felder,V.C.,Höög,J.-O.,Parés,X.,Plapp,B.,Yin,S.-y.andJörnvall, H.Alcoholdehydrogenasenomenclature.Biochem.Pharmacol.1999,58:389-395 Duester,G.,Hatfield,G.W.andSmith,M.Moleculargeneticanalysisofhumanalcohol dehydrogenase.Alcohol1985,2:53-56 Efron,B.,Halloran,E.andHolmes,S.Bootstrapconfidencelevelsforphylogenetictrees. Proc.Natl.Acad.Sci.USA1996,93:13429-13434 Ellis,R.J.,vanderVies,S.M.andHemmingsen,S.M.Themolecularchaperoneconcept. Biochem.Soc.Symp.1989,55:145-153 Epstein,C.J.,Goldberger,R.F.andAnfinsen,C.B.Thegeneticcontroloftertiaryprotein structure.Modelsystems.ColdSpringHarborSymp.Quant.Biol.1963,28:439-449 Ewing,T.J.,Makino,S.,Skillman,A.G.andKuntz,I.D.DOCK4.0:searchstrategiesfor automatedmoleculardockingofflexiblemoleculedatabases.J.Comput.Aided.Mol.Des. 2001,15:411-428 Ewing,T.J.A.andKuntz,I.D.CriticalEvaluationofSearchAlgorithmsforAutomated MolecularDockingandDatabaseScreening.J.Comp.Chem.1997,9:1175-1189 Felsenstein,J.PHYLIP(PhylogenyInferencePackage)version3.5c.Distributedbythe author.DepartmentofGenetics,UniversityofWashington,Seattle.1993 Felsenstein,J.Confidence-limitsonphylogenies-anapproachusingthebootstrap.Evolution 1985,39:783-791 Feng,D.F.,Johnson,M.S.andDoolittle,R.F.Aligningaminoacidsequences:comparisonof commonlyusedmethods.J.Mol.Evol.1984-85;21:112-125

53 Biocomputationalstudiesonproteinstructures

Fernández,M.R.,Biosca,J.A.,Norin,A.,Jörnvall,H.andParés,X.ClassIIIalcohol dehydrogenasefromSaccharomycescerevisiae:Structuralandenzymaticfeaturesdiffer towardthehuman/mammalianformsinamannerconsistentwithfunctionalneedsin formaldehydedetoxication.FEBSLett.1995,370:23-26 Fiers,W.,Contreras,R.,Haegemann,G.,Rogiers,R.,vandeVoorde,A.,vanHeuverswyn, H.,vanHerreweghe,J.,Volckaert,G.andYsebaert,M.Completenucleotidesequenceof SV40DNA.Nature1978,273:113-120 Fitch,W.M.Towarddefiningthecourseofevolution:minimumchangeforaspecifictree topology.Sys.Zool.1971,20:406-416 Fitz-Gibbon,S.T.,andHouseC.H.Wholegenome-basedphylogeneticanalysisoffree-living microorganisms.NucleicAcidsRes.1999,27:4218-4222 Forster,M.J.Molecularmodellinginstructuralbiology.Micron2002,33:365-384 Galliano,H.,Cabane,M.,Eckerskorn,C.,Lottspeich,F.,Sandermann,H.JrandErnstD. Molecularcloning,sequenceanalysisandelicitor-/ozone-inducedaccumulationofcinnamyl alcoholdehydrogenasefromNorwayspruce(PiceaabiesL.).PlantMol.Biol.1993,23:145- 156 Ghosh,D.,Sawicki,M.,Pletnev,V.,Erman,M.,Ohno,S.,Nakajin,S.,andDuax,W.L. Porcinecarbonylreductase:Structuralbasisforafunctionalmonomerinshortchain dehydrogenases/reductases.J.Biol.Chem.2001,276:18457–18463 Ghosh,D.,Wawrzak,Z.,Weeks,C.M.,Duax,W.L.andErman,M.Therefinedthree- dimensionalstructureof3α,20β-hydroxysteroiddehydrogenaseandpossiblerolesofthe residuesconservedinshort-chaindehydrogenases.Structure1994,2:629-640 Gibbs,A.J.andMcIntyre,G.A.Thediagram,amethodforcomparingsequences.Itsusewith aminoacidandnucleotidesequences.Eur.J.Biochem.1970,16:1-11 Gonnet,G.H.,Cohen,M.A.andBenner,S.A.Exhaustivematchingoftheentireprotein sequencedatabase.Science1992,256:1443-1445 Gonzalez,P.,Rao,P.V.,Nunez,S.B.andZigler,J.S.Jr.Evidenceforindependentrecruitment

of -crystallin/quinonereductase(CRYZ)asacrystallinincamelidsandhystricomorph rodents.Mol.Biol.Evol.1995,12:773-781 Goodsell,D.S.andOlson,A.J.AutomatedDockingofSubstratestoProteinsbySimulated Annealing.Proteins:Str.Func.andGenet.1990,8:195-202 Gustafsson,M.andJohansson,J.Structuralandfunctionalpropertiesofpulmonarysurfactant proteinCandrelatedanalogs.J.ProteinChem.1998,17:540-542 Halgren,T.A.MerckMolecularForceField.I.-V.J.Comp.Chem.1995a,17:490-641 Halgren,T.A.Potentialenergyfunctions.Curr.Opin.Struct.Biol.1995b,5:205-210

54 References

Halpin,C.,Knight,M.E.,Foxon,G.A.,Campbell,M.,Boudet,A.M.,Boon,J.J.,Chabbert,B., Tollier,M.T.andSchuch,W.Manipulationofligninqualitybydown-regulationofcinnamyl alcoholdehydrogenase.PlantJournal1994,6:339-350 Hammersley,J.M.MonteCarloMethodsforSolvingMultivariableProblems.Ann.NewYork Acad.Sci.1960,86:844-874 Hansson,T.,Oostenbrink,C.andvanGunsteren,W.Moleculardynamicssimulations.Curr. Opin.Struct.Biol.2002,12:190-196 Hardin,C.,Eastwood,M.P.,Prentiss,M.,Luthey-Schulten,Z.andWolynes,P.G.Folding funnels:thekeytorobustproteinstructureprediction.J.Comput.Chem.2002,23:138-146 Hartigan,J.A.Minimumevolutionfitstoagiventree.Biometrics1973,29:53-65 Henikoff,S.andHenikoff,J.G.Aminoacidsubstitutionmatricesfromproteinblocks.Proc. Natl.Acad.Sci.USA.1992,89:10915-10919 Henikoff,S.andHenikoff,J.G.Performanceevaluationofaminoacidsubstitutionmatrices. Proteins1993,17:49-61 Hofmann,K.,Bucher,P.,Falquet,L.andBairochA.ThePROSITEdatabase,itsstatusin 1999.NucleicAcidsRes.1999,27:215-219 Huang,T.F.,Holt,J.C.,Lukasiewicz,H.andNiewiarowski,S.Trigramin.Alowmolecular weightpeptideinhibitingfibrinogeninteractionwithplateletreceptorsexpressedon glycoproteinIIb-IIIacomplex.J.Biol.Chem.1987,262:16157-16163 Jany,K.D.,Ulmer,W.,Froschle,M.andPfleiderer,G.Completeaminoacidsequenceof glucosedehydrogenasefromBacillusmegaterium.FEBSLett.1984,165:6-10 Jensen,F.IntroductiontoComputationalChemistryWileyandSons,Chichester,1999 Johansson,J.,Curstedt,T.,Robertson,B.andJörnvall,H.Sizeandstructureofthe hydrophobiclowmolecularweightsurfactant-associatedpolypeptide.Biochemistry1988, 27:3544-3547 Jones,D.T.Proteinsecondarystructurepredictionbasedonposition-specificscoring matrices.J.Mol.Biol.1999,292,195-202 Jones,D.T.andSwindells,M.B.GettingthemostfromPSI-BLAST.TrendsBiochem.Sci. 2002,27:161-164 Jones,G.,Willett,P.,Glen,R.C.,Leach,A.R.andTaylor,R.Developmentandvalidationofa geneticalgorithmforflexibledocking.J.Mol.Biol.1997,267:727-748 Jörnvall,H.andHöög,J.-O.Nomenclatureofalcoholdehydrogenases.Alcoholand Alcoholism1995,30:153-161

55 Biocomputationalstudiesonproteinstructures

Jörnvall,H.,Höög,J.-O.andPersson,B.SDRandMDR:completedgenomesequencesshow theseproteinfamiliestobelarge,ofoldorigin,andofcomplexnature.FEBSLett.,1999 445:261-264 Jörnvall,H.,Höög,J.-O.,Persson,B.andParés,X.Pharmacogeneticsofthealcohol dehydrogenasesystem.Am.J.Pharmacol.2000,61:184-191 Jörnvall,H.,Persson,B.,Krook,M.,Atrian,S.,Gonzalez-Duarte,R.,Jeffery,J.,andGhosh, D.Short-chaindehydrogenases/reductases(SDR).Biochemistry1995,34:6003–6013 Jörnvall,H.,Persson,M.andJeffery,J.Alcoholandpolyoldehydrogenasesarebothdivided intotwoproteintypes,andstructuralpropertiescross-relatethedifferentenzymeactivities withineachtype.Proc.Natl.Acad.Sci.USA1981,78:4226-4230 Jörnvall,H.,vonBahr-Lindström,H.,Jany,K.D.,Ulmer,W.andFroschle,M.Extended superfamilyofshortalcohol-polyol-sugardehydrogenases:structuralsimilaritiesbetween glucoseandribitoldehydrogenases.FEBSLett.1984,165:190-196 Kallberg,Y.,Oppermann,U.,Jörnvall,H.andPersson,B.Short-chain dehydrogenase/reductase(SDR)relationships:Alargefamilywitheightclusterscommonto human,animal,andplantgenomes.ProteinSci.2002,11:636-641 Kallberg,Y.andPersson,B.KIND-anon-redundantproteindatabase.Bioinformatics1999, 15:260-261 Kamal,A.,Almenar-Queralt,A.,LeBlanc,J.F.,Roberts,E.A.andGoldstein,L.S.Kinesin- mediatedaxonaltransportofamembranecompartmentcontainingbeta-secretaseand presenilin-1requiresAPP.Nature2001,414:643-648 Kanehisa,M.Adatabaseforpost-genomeanalysis.TrendsGenet.1997,13:375-376 Kang,J.,Lemaire,H.-G.,Unterbeck,A.,Salbaum,J.M.,Masters,C.L.,Grzeschik,K.-H., Multhaup,G.,Beyreuther,K.andMueller-Hill,B.,TheprecursorofAlzheimer'sdisease amyloidA4proteinresemblesacell-surfacereceptor.Nature1987,325:733-736 Kirkpatrick,S.,Gelatt,C.D.,andVecchi,M.P.OptimizationbySimulatedAnnealing.Science 1983,220:671-680 Koonin,E.V.,Mushegian,A.R.Galperin,M.Y.andWalkerD.R.Comparisonofarchaealand bacterialgenomes:computeranalysisofproteinsequencespredictsnovelfunctionsand suggestsachimericoriginforthearchaea.Mol.Microbiol.1997,25:619-637 Krook,M.,Ghosh,D.,Strömberg,R.,Carlquist,M.,andJörnvall,H.Carboxyethyllysineina protein:Nativecarbonylreductase/NADP+-dependentprostaglandindehydrogenase.Proc. Natl.Acad.Sci.USA1993,90:502–506 Kuntz,I.D.,Blaney,J.M.,Oatley,S.J.,Langridge,R.,andFerrin,T.E."Ageometricapproach tomacromolecule-ligandinteractions",J.Mol.Biol.1982,161,269-288 Lander,E.S.,Linton,L.M.,Birren,B.,Nusbaum,C.,Zody,M.C.,etal.andMorgan,M.J. Initialsequencingandanalysisofthehumangenome.Nature2001,409:860-921

56 References

Lengauer,T.andRarey,M.Computationalmethodsforbiomoleculardocking.Curr.Opin. Struct.Biol.1996,6:402-406 Levitt,M.Accuratemodelingofproteinconformationbyautomaticsegmentmatching.J. Mol.Biol.1992,226:507-533 Lindahl,E.,Hess,B.andvanderSpoel,D.GROMACS3.0:apackageformolecular simulationandtrajectoryanalysis.J.Mol.Model.2001,7:306–317 Magnuson,K.,Jackowski,S.,Rock,C.O.andCronan,J.E.Jr.Regulationoffattyacid biosynthesisinEscherichiacoli.Microbiol.Rev.1993,57:522-542 Marschall,H.-U.,Oppermann,U.C.,Svensson,S.,Nordling,E.,Persson,B.,Höög,J.-O.and Jörnvall,H.HumanliverclassIalcoholdehydrogenase ¡ isozyme:thesolecytosolic3β- hydroxysteroiddehydrogenaseofisobileacids.Hepatology2000,31:990-996 McCammon,J.A.,Gelin,B.R.andKarplus,M.Dynamicsoffoldedproteins.Nature1977, 267:585-590 Minth,C.D.,Andrews,P.C.andDixon,J.E.,Characterization,sequence,andexpressionof theclonedhumanneuropeptideYgene.J.Biol.Chem.1986,261:11974-11979 Morris,G.M.,Goodsell,D.S.,Halliday,R.S.,Huey,R.,Hart,W.E.,Belew,R.K.andOlson, A.J.AutomatedDockingUsingaLamarckianGeneticAlgorithmandandEmpiricalBinding FreeEnergyFunction.J.ComputationalChemistry1998,19:1639-1662 Morris,G.M.,Goodsell,D.S.,Huey,R.andOlson,A.J.Distributedautomateddockingof flexibleligandstoproteins:ParallelapplicationsofAutoDock2.4.J.Computer-Aided MolecularDesign1996,10:293-304 Morse,P.M.andFeshbach,H.AsymptoticSeries;MethodofSteepestDescent.In:Methods ofTheoreticalPhysics,PartI.McGraw-Hill,NewYork,1953,434-443 Murzin,A.G.,Brenner,S.E.,Hubbard,T.andChothia,C.SCOP:astructuralclassificationof proteinsdatabasefortheinvestigationofsequencesandstructures.J.Mol.Biol.1995, 247:536-540 Najmanovich,R.,Kuttner,J.,Sobolev,V.andEdelman,M.Side-chainflexibilityinproteins uponligandbinding.Proteins2000,39:261-268 Needleman,S.B.andWunsch,C.D.Ageneralmethodapplicabletothesearchforsimilarities intheaminoacidsequenceoftwoProteins.J.Mol.Biol.1970,48:443-453 Negelein,E.andWulff,H.-J.KristallisationdesProteinsderAcet-aldehyd-reduktase. Biochem.Z.1937,289:436-437 Nei,M.Phylogeneticanalysisinmolecularevolutionarygenetics.Annu.Rev.Genet.1996, 30:371-403 Ohlrogge,J.andBrowse,J.Lipidbiosynthesis.PlantCell1995,7:957-970

57 Biocomputationalstudiesonproteinstructures

Ohtomo,Y.,Bergman,T.,Johansson,B.-L.,Jörnvall,H.andWahren,J.Differentialeffects ofproinsulinC-peptidefragmentsonNa+,K+-ATPaseactivityofrenaltubulesegments. Diabetologia1998,41:287-291 Oldfield,E.Chemicalshiftsinaminoacids,peptides,andproteins:fromquantumchemistry todrugdesign.Annu.Rev.Phys.Chem.2002,53:349-378 Oppermann,U.C.,Persson,B.,Filling,C.,andJörnvall,H..Structure–functionrelationships ofSDRhydroxysteroiddehydrogenases.Adv.Exp.Med.Biol.1997,414:403–415 Orengo,C.A.,Michie,A.D.,Jones,S.,Jones,D.T.,Swindells,M.B.,andThornton,J.M. CATH-AHierarchicClassificationofProteinDomainStructures.Structure1997,5:1093- 1108 Page,R.D.TreeView:anapplicationtodisplayphylogenetictreesonpersonalcomputers. Comput.Appl.Biosci.1996,12:357-358 Pearson,W.R.andLipman,D.J.Improvedtoolsforbiologicalsequencecomparison.Proc. Natl.Acad.Sci.USA1988,85:2444-2448 Persson,B.,Hallborn,J.,Walfridsson,M.,Hahn-Hägerdal,B.,Keränen,S.,Penttilä,M.and Jörnvall,H.Dualrelationshipsofxylitolandalcoholdehydrogenasesinfamiliesoftwo proteintypes.FEBSLett.1993,324:9-14 Persson,B.,Krook,M.andJörnvall,H.Characteristicsofshort-chainalcoholdehydrogenases andrelatedenzymes.Eur.J.Biochem.1991,200:537-543 Persson,B.,Krook,M.,andJörnvall,H.Short-chaindehydrogenases/reductases.Adv.Exp. Med.Biol.1995,372:383–395 Persson,B.,Zigler,J.S.JrandJörnvall,H.Asuper-familyofmedium-chain

dehydrogenases/reductases(MDR).Sub-linesincluding -crystallin,alcoholandpolyol dehydrogenases,quinoneoxidoreductase,enoylreductases,VAT-1andotherproteins.Eur.J. Biochem.1994,226:15-22 Polak,E.Chapter2.3In:ComputationalMethodsinOptimization.NewYork:Academic Press,1971 Powell,A.J.,Read,J.A.,Banfield,M.J.,Gunn-Moore,F.,Yan,S.D.,Lustbader,J.,Stern, A.R.,Stern,D.M.andBrady,R.L.RecognitionofStructurallyDiverseSubstratesbyTypeII 3-Hydroxyacyl-CoaDehydrogenase(HadhII)Amyloid-BetaBindingAlcohol Dehydrogenase(Abad)J.Mol.Biol.2000,303:311-327 Pramanik,A.,Ekberg,K.,Zhong,Z.,Shafqat,J.,Henriksson,M.,Jansson,O.,Tibell,A., Tally,M.,Wahren,J.,Jörnvall,H.,Rigler,R.andJohansson,J.C-peptidebindingtohuman cellmembranes:importanceofGlu27.Biochem.Biophys.Res.Commun.2001,284:94-98 Rao,P.V.,Gonzalez,P.,Persson,B.,Jörnvall,H.,Garland,D.andZiglerJ.S.JrGuineapig

andbovine -crystallinshavedistinctfunctionalcharacteristicshighlightingreplacementsin otherwisesimilarstructures.Biochemistry1997,36,5353-5362

58 References

Rarey,M.,Kramer,B.,Lengauer,T.andKlebe,G.Afastflexibledockingmethodusingan incrementalconstructionalgorithm.J.Mol.Biol.1996,261:470-489 Rigler,R.,Pramanik,A.,Jonasson,P.,Kratz,G.,Jansson,O.T.,Nygren,J.-Å.,Ståhl,S., Ekberg,K.,Johansson,B.-L.,Uhlén,S.,Uhlén,M.,Jörnvall,H.andWahren,J.Specific bindingofproinsulinC-peptidetohumancellmembranes.Proc.Natl.Acad.Sci.USA1999, 96:13318-13323 Rossmann,M.G.,Moras,D.andOlsen,K.W.Chemicalandbiologicalevolutionof nucleotide-bindingprotein.Nature1974,250:194-199 Rost,B.andSander,C.Predictionofproteinsecondarystructureatbetterthan70%accuracy, J.Mol.Biol.1993,232:584-599 Rost,B.andSander,C.Thirdgenerationpredictionofsecondarystructure.In:Protein StructurePrediction:MethodsandProtocols(ed.Webster,D.),HumanaPress,Clifton,NJ, 2000,71-95 Saitou,N.andNei,M.Theneighbor-joiningmethod:anewmethodforreconstructing phylogenetictrees.Mol.Biol.Evol.1987,4:406-425 Sali,A.andBlundell,T.L.Comparativeproteinmodellingbysatisfactionofspatialrestraints. J.Mol.Biol.1993,234:779-815 Sanchez,R.andSali,A.Advancesincomparativeprotein-structuremodelling.Curr.Opin. Struct.Biol.1997,7:206-214 Sandak,B.,Wolfson,H.J.andNussinov,R.Flexibledockingallowinginducedfitinproteins: insightsfromanopentoclosedconformationalisomers.Proteins1998,32:159-174 Sarni,F.,Grand,C.andBoudet,A.M.Purificationandpropertiesofcinnamoyl-CoA reductaseandcinnamylalcoholdehydrogenasefrompoplarstems(PopulusXeuramericana). Eur.J.Biochem.1984,139:259-265 Schmidt,M.W.,Baldridge,K.K.,Boatz,J.A.,Elbert,S.T.,Gordon,M.S.,Jensen,J.H., Koseki,S.,Matsunaga,N.,Nguyen,K.A.;etal.Generalatomicandmolecularelectronic structuresystem.J.Comput.Chem.1993,14:1347-1363 Schneider,G.andBohm,H.J.Virtualscreeningandfastautomateddockingmethods.Drug Discov.Today2002,7:64-70 Schwartz,M.F.andJörnvall,H.Structuralanalysesofmutantandwild-typealcohol dehydrogenasesfromDrosophilamelanogaster.Eur.J.Biochem.1976,68:159-168 Sellers,P.H.Onthetheoryandthecomputationofevolutionarydistances.SIAMJ.Appl. Math.1974,26:787-793 Shepherd,A.J.,Gorse,D.andThornton,J.M.Predictionofthelocationandtypeofbeta-turns inproteinsusingneuralnetworks.ProteinSci.1999,8:1045-55

59 Biocomputationalstudiesonproteinstructures

Sippl,M.J.Knowledge-basedpotentialsforProteins.Curr.Opin.Struct.Biol.1995,5:229- 235 Skolnick,J.,Kolinski,A.,Kihara,D.,Betancourt,M.,Rotkiewicz,P.andBoniecki,M.Ab initioproteinstructurepredictionviaacombinationofthreading,latticefolding,clustering, andstructurerefinement.Proteins2001,Suppl5:149-156 Smith,T.F.andWaterman,M.S.Identificationofcommonmolecularsubsequences.J.Mol. Biol.1981,147:195-197 Sofer,W.andUrsprung,H.Drosophilaalcoholdehydrogenase.J.Biol.Chem.1968,243: 3110-3115 Steiner,D.,Cunningham,D.,Spigelman,L.andAten,B.Insulinbiosynthesis:evidencefora precursor.Science1967,157:697-700 Stewart,J.J.MOPAC:asemiempiricalmolecularorbitalprogram.J.Comput.Aided.Mol. Des.1990,4:1-105 Stoesser,G.,Baker,W.,vandenBroek,A.,Camon,E.,Garcia-Pastor,M.,Kanz,C., Kulikova,T.,Lombard,V.,Lopez,R.,Parkinson,H.,Redaschi,N.,Sterk,P.,Stoehr,P.and Tuli,M.A.TheEMBLNucleotideSequenceDatabase.NucleicAcidsRes.2001,29:17-21 Swofford,D.L.PAUP*.PhylogeneticAnalysisUsingParsimony(*andOtherMethods). Version4.SinauerAssociates,Sunderland,Massachusetts,1998 Tanaka,N.,Nonaka,T.,Tanabe,T.,Yoshimoto,T.,Tsuru,D.andMitsui,Y.Crystal structuresofthebinaryandternarycomplexesof7α-hydroxysteroiddehydrogenasefrom Escherichiacoli.Biochemistry1996,35:7715-7730 Tekaia,F.,Lazcano,A.andB.Dujon..Thegenomictreeasrevealedfromwholeproteome comparisons.GenomeRes.1999,9:550-557 Thompson,J.D.,Higgins,D.G.andGibson,T.J.CLUSTALW:improvingthesensitivityof progressivemultiplesequencealignmentthroughsequenceweighting,position-specificgap penaltiesandweightmatrixchoice.NucleicAcidsRes.1994,22:4673-4680 Thomsen,J.,Kristiansen,K.,Brunfeldt,K.andSundby,F.,Theaminoacidsequenceof humanglucagon.FEBSLett.1972,21:315-319 Totrov,M.andAbagyan,R.Detailedabinitiopredictionoflysozyme-antibodycomplexwith 1.6Aaccuracy.Nat.Struct.Biol.1994,1:259-263 Totrov,M.andAbagyan,R.Flexibleprotein-liganddockingbyglobalenergyoptimizationin internalcoordinates.Proteins1997,Suppl1:215-220 Wahren,J.andJohansson,B.NewaspectsofC-peptidephysiology.Horm.Metab.Res.1998, 30:A2-A5

60 References

Wahren,J.,Ekberg,K.,Johansson,J.,Henriksson,M.,Pramanik,A.,Johansson,B.L.,Rigler, R.andJörnvall,H.RoleofC-peptideinhumanphysiology.Am.J.Physiol.Endocrinol. Metab.2000,278:E759-E768 Venclovas,C.,Zemla,A.,Fidelis,K.andMoult,J.Somemeasuresofcomparative performanceinthethreeCASPs.Proteins1999,Suppl3:231-237 Venter,J.C.,Adams,M.D.,Myers,E.W.,Li,P.W.,Mural,R.J.,etal.andZhuX.The sequenceofthehumangenome.Science2001,291:1304-1351 Vogt,G.,Etzold,T.andArgos,P.Anassessmentofaminoacidexchangematricesinaligning proteinsequences:thetwilightzonerevisited.J.Mol.Biol.1995,249:816-831 Wu,C.H.,Huang,H.,Arminski,L.,Castro-Alvear,J.,Chen,Y.,Hu,Z.Z.,Ledley,R.S., Lewis,K.C.,Mewes,H.W.,Orcutt,B.C.,Suzek,B.E.,Tsugita,A.,Vinayaka,C.R.,Yeh,L.S., Zhang,J.andBarker,W.C.TheProteinInformationResource:anintegratedpublicresource offunctionalannotationofProteins.NucleicAcidsRes.2002,30:35-37 Yamazoe,M.,Shirahige,K.,Rashid,M.B.,Kaneko,Y.,Nakayama,T.,Ogasawara,N.and Yoshikawa,H.Aproteinwhichbindspreferentiallytosingle-strandedcoresequenceof autonomouslyreplicatingsequenceisessentialforrespiratoryfunctioninmitochondrialof Saccharomycescerevisiae.J.Biol.Chem.1994,269:15244-15252 Yokomizo,T.,Ogawa,Y.,Uozumi,N.,Kume,K.,Izumi,T.andShimizu,T.cDNAcloning, expression,andmutagenesisstudyofleukotrieneB412-hydroxydehydrogenase.J.Biol. Chem.1996,271:2844-2850 Zvelebil,M.J.,Barton,G.J.,Taylor,W.R.and,Sternberg,M.J.E.Predictionofprotein secondarystructureandactivesitesusingalignmentofhomologoussequences,J.Mol.Biol. 1987,195,957-961 Österberg,F.,Morris,G.M.,Sanner,M.F.,Olson,A.J.andGoodsell,D.S.Automateddocking tomultipletargetstructures:incorporationofproteinmobilityandstructuralwater heterogeneityinAutoDock.Proteins2002,46:34-40