Structural Approaches to Protein Sequence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Structural approaches to protein sequence analysis by David Tudor Jones Department of Biochemistry and Molecular Biology University College Gower Street London WCIE 6BT This thesis is submitted in fulfilment of the requirements for the degree of Doctor of Philosophy from the University of London. April 17, 1993 ProQuest Number: 10045889 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest. ProQuest 10045889 Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 Abstract Various protein sequence analysis techniques are described, aimed at improving the prediction of protein structure by means of pattern matching. To investigate the possibility that improvements in amino acid comparison matrices could result in improvements in the sensitivity and accuracy of protein sequence alignments, a method for rapidly calculating amino acid mutation data matrices from large sequence data sets is presented. The method is then applied to the membrane-spanning segments of integral membrane proteins in order to investigate the nature of amino acid mutability in a lipid environment. Whilst purely sequence analytic techniques work well for cases where some residual sequence similarity remains between a newly characterized protein and a protein of known 3-D structure, in the harder cases, there is little or no sequence similarity with which to recognize proteins with similar folding patterns. In the light of these limitations, a new approach to protein fold recognition is described, which uses a statistically derived pairwise potential to evaluate the compatibility between a test sequence and a library of structural templates, derived from solved crystal structures. The method, which is called optimal sequence threading, proves to be highly successful, and is able to detect the common TIM barrel fold between a number of enzyme sequences, which has not been achieved by any previous sequence analysis technique. Finally, a new method for the prediction of the secondary structure and topology of membrane proteins is described. The method employs a set of statistical tables compiled from well-characterized membrane protein data, and a novel dynamic programming algorithm to recognize membrane topology models by expectation maximization. The statistical tables show definite biases towards certain amino acid species on the inside, middle and outside of a cellular membrane. Acknowledgements In many ways this section was the hardest of all to write. It seems that 1 have received help and encouragement in one way or another from just about everyone 1 have met during my PhD course. Unfortunately, due to the limits of both space and memory, 1 can only hope to mention a fraction of those who have played an important part in the completion of this project. My sincerest apologies to anyone 1 have omitted. It goes without saying that 1 wouldn’t be writing this were it not for the help, encourage ment, trust, friendship and of course education 1 have received from my two supervisors. Professor Janet Thornton and Dr Willie Taylor. In many senses, it is hard for a PhD student to actually choose a supervisor (let alone TWO supervisors) - they are pretty much chosen for you by the constraints of geography, project availability, timing, luck and of course whether or not the supervisor in question actually wants you to work on their project. All 1 can say is that if 1 could have chosen from every supervisor in the field, 1 could not have made a better choice. 1 greatly look forward to our continued friendship and collaboration in the future. Thanks to the following: Dr Christine Orengo for all the general help, encouragement, technical discussions on algorithms over the years. Above all, thanks for never physically ejecting me from your office when I’ve been looking for someone to talk to and you’ve had a desk-full of work to do! Dr Tom Flores for all the help and advice, and the many discussions we’ve had on topics too numerous to mention - most of which left me out of my depth after about 2 milliseconds. Acknowledgements Dr Mark Swindells for the coffee, ice-cream, The Simpsons™, extremely stimulating technical arguments, and humour beyond the call of duty. Don’t stay in Japan for ever! Dr Simon Hubbard for even more coffee, ice-cream, discussions, the brilliant wit, and keeping a grip on reality even when the rest of the world appeared to be auditioning for a part in a sequel to One Flew Over the Cuckoo's Nest. Dr Nigel Brown for teaching me about Un*x, and the world of real computers. Dr Richard Mott for much helpful advice on statistics and many useful suggestions. Dr Tom Kirkwood for making me welcome in the Laboratory of Mathematical Biology at the National Institute for Medical Research (NIMR), where some of this work was carried out. And the following people for help, discussion, computer code, jokes but above all just for making my workplace fun to be in: Dr Andras Azodi, Steve Gardner, Dr Michael Green, Dr Gail Hutchinson, Dr Axel Kowald, Dr Roman Laskowski, Malcolm Mac Arthur, Dr Laurence Pearl, Frances Richardson, Dr Jus Singh, Helen Stirk, Dr Dek Woolfson, and Dr Marketa Zvelebil. This work was supported by a SERC CASE studentship with the MRC National Institute for Medical Research, Mill Hill, London. To my parents - Rose and Tudor Jones. Contents Abstract........................................................................................................................ 2 Acknowledgements..................................................................................................... 3 Contents ..................................................................................................................... 6 List of Figures.............................................................................................................. 10 List of Tables................................................................................................................ 13 Commonly used abbreviations .................................................................................. 15 Standard amino acid codes......................................................................................... 15 Chapter 1 - Pattern Matching Methods in ProteinStructure Prediction 16 1.1 Protein structure prediction from first principles ............................................ 17 1.2 Pattern matching methods in the prediction of protein structure and fu n c tio n......................................................................................................... 18 1.3 Consensus methods .......................................................................................... 23 1.4 Regular ex pressions .......................................................................................... 31 1.5 Specific structural patterns............................................................................... 32 1.6 Pattern assessment............................................................................................ 34 1.7 Prediction of progress at la s t .......................................................................... 36 Chapter 2 - The Calculation of New Amino Acid Comparison M atrices 40 2.1 Introduction......................................................................................................... 41 2.2 Types of scoring matrix .................................................................................. 41 6 Contents 2.3 Construction of the raw PAM matrix ........................................................... 44 2.4 Calculation of relative mutabilities............................................................... 45 2.5 Calculation of the mutation probability m atrix ........................................... 46 2.6 Calculating the log-odds m atrix .................................................................... 48 2.7 Automating the procedure ............................................................................... 48 2.8 Program implementation .................................................................................. 54 2.9 Results ................................................................................................................. 54 2.10 Summary............................................................................................................ 61 2.11 Alignment parameter optimization .............................................................. 64 2.12 A mutation data matrix for transmembrane proteins ............................... 70 2.13 Discussion......................................................................................................... 77 Chapter 3 - Protein Tertiary Structure Prediction by Fold Recognition .......... 84 3.1 Introduction......................................................................................................... 85 3.2 A limited number of folds ................................................................................ 88 3.3 The fold library ................................................................................................