Download/Tandemms.Doc)
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL METHODS FOR IMPROVED PEPTIDE AND PROTEIN IDENTIFICATION IN PROTEOMICS by KAREN MEYER-ARENDT B.A., Reed College, 1981 M.S., Oregon State University, 1987 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of Chemistry and Biochemistry 2011 This thesis entitled: Computational Methods for Improved Peptide and Protein Identification in Proteomics written by Karen Meyer-Arendt has been approved for the Department of Chemistry and Biochemistry ________________________________ Natalie G. Ahn ________________________________ William M. Old Date____________ The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. ABSTRACT Meyer-Arendt, Karen (Ph.D., Biochemistry) Computational Methods for Improved Peptide and Protein Identification in Proteomics Thesis directed by Professor Natalie G. Ahn Shotgun proteomics is an analytical method used to identify proteins from complex mixtures such as a whole-cell lysates. This method utilizes high-resolution mass spectrometers, proteolysis, and fractionation techniques in order to maximize the number and quantities of proteins being detected. Knowing the identity and abundance of proteins in a cell provides insights into cell functioning, and how cells respond to external stimuli. Our lab uses proteomics to further understanding of signaling networks and how these are dysregulated during melanoma progression. Computer algorithms are an essential aspect of shotgun proteomics in order to match hundreds of thousands of spectra to the peptide sequences from which they came. The most productive peptide identification methods search databases of protein sequences, looking for the best peptide spectrum matches, but these methods can be plagued by false positives and false negatives. I designed and implemented MSPlus, software which increases sensitivity and specificity in peptide identification by using physicochemical filters and consensus scoring between multiple database search programs, approaches which are now commonly in use. After peptides are identified, they must be mapped back to the proteins from which they derive, a non-trivial task in the human proteome with its extensive alternative splicing, gene iii duplication, and post-translational processing. I designed and implemented IsoformResolver, software which accurately and efficiently infers proteins from peptides using a pre-calculated peptide-centric reformatted protein database. Proteins are reported in the context of protein groups, a concise representation which allows experimentalists to see the most likely proteins in the context of all possible proteins for which there is mass spectrometry evidence. This novel representation minimizes the protein volatility inherent to the more common protein-centric output. Finally I examine the capabilities and limits of shotgun proteomics. I introduce a tier- based representation of protein abundances, and investigate how the abundances vary by protein class and at different sampling depths. I compare proteomics and transcriptomics results, and investigate to what extent proteomics can be used to identify members which distinguish cell states. iv ACKNOWLEDGMENTS Thanks go, above all, to Natalie Ahn and Katheryn Resing for hiring me years ago, encouraging me to go to graduate school, and teaching me so much about mass spectrometry, scientific rigor, and thinking big. Thank you also to my committee members Rob Knight, Will Old, Xuedong Liu, and Robin Dowell, all of whom who gave me valuable feedback on my research over the years. I am thankful to my co-members on the ABRF iPRG Study Group for many valuable discussions. Thanks also go out to my past and present colleagues at the Mass Spectrometry Lab, especially Chia-Yu Yen, Stephane Houel, Alex Mendoza, Lauren Aveline- Wolf, and Karen Jonscher, who were always helpful and fun to work with. I also want to mention how pleased I was with the CU Biochemistry Department, and especially the Molecular Biophysics Training Program, for teaching me more about Biochemistry than I could have imagined. There are many other people that I am indebted to, so I would also like to thank my parents, who set the example of finding a profession they loved, even if it took a circuitous route to get there; my friends, for encouraging me and always helping me out when needed; and my kids, Ada and Thomas, for their understanding when I was traveling or busy. Most of all, I thank my husband, George, who was immensely supportive of my studies and research these past years. v CONTENTS CHAPTER I. INTRODUCTION Overview of Proteomics .................................................................................1 Peptide Identification ......................................................................................4 Protein Inference .............................................................................................9 II. IMPROVING REPRODUCIBILITY AND SENSITIVITY IN IDENTIFYING HUMAN PROTEINS BY SHOTGUN PROTEOMICS Introduction ...................................................................................................15 Methods.........................................................................................................17 Results and Discussion .................................................................................24 Conclusion ....................................................................................................48 III. ISOFORM RESOLVER: A PEPTIDE-CENTRIC ALGORITHM FOR PROTEIN INFERENCE Introduction ...................................................................................................51 Methods.........................................................................................................56 Results ...........................................................................................................64 Conclusions .................................................................................................105 IV. CONCLUSIONS AND FUTURE DIRECTIONS Discussion ...................................................................................................108 REFERENCES .....................................................................................................................130 APPENDIX A. MANUAL ANALYSIS OF EXAMPLES IN TABLE 2.2 ILLUSTRATING “DISTRACTION” ...............................................................................................135 B. DISTRIBUTION OF MS/MS SPECTRAL ION INTENSITIES IN THE DISTRACTED GROUP VS. THE MSPLUS-VALIDATED GROUP ...............144 vi TABLES TABLE 2.1 Samples of soluble protein extracts from human K562 erythroleukemia cells .....18 2.2 Examples illustrating "distraction" by Sequest and Mascot, in which correct sequence assignments are replaced by incorrect assignments as the database size increases .........................................................................................................37 2.3 Isoform Resolver specifies protein variants from observed peptide sequences ....45 3.1 Terminology ...........................................................................................................53 3.2 Datasets used in these studies ................................................................................57 3.3 Protein inference reduces protein repeatability between replicates .......................78 3.4 Protein inference methods underestimate overlap between proteins obtained from datasets with lower vs. higher depth of sampling .........................................83 3.5 Comparison of protein inference programs ...........................................................98 3.6 Examples of ISD groups found in common by six protein inference programs ..102 4.1 Protein locations vary by tier between low- and high-depth sampling ................114 4.2 Predicted numbers of proteins using capture-recapture calculations ...................115 vii FIGURES FIGURE 1.1 Diagram of a mass spectrometer .................................................................................2 1.2 Peptide fragmentation .................................................................................................3 1.3 Shotgun proteomics protocol ......................................................................................4 1.4 Distribution of peptide identification scores ...............................................................6 1.5 MSPlus filters remove false positive assignments ......................................................8 1.6 Distinct vs. shared peptides .......................................................................................10 1.7 Protein-centric vs. peptide-centric strategy ...............................................................12 1.8 Protein volatility........................................................................................................13 2.1 Randomly determined peptide scores .......................................................................25 2.2 Combining Sequest and Mascot results to validate more DTA files ........................30 2.3 Using SCX chromatography as an MSPlus filtering criterion ..................................33 2.4 Distraction increases with database size ...................................................................41 3.1 IsoformResolver