Bioinformatics Methods for Protein Identification

BIOINFORMATICS METHODS FOR PROTEIN IDENTIFICATION USING PEPTIDE MASS FINGERPRINTING DATA _______________________________________ A Dissertation presented to the Faculty of the Graduate School at the University of Missouri-Columbia _______________________________________________________ In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy _____________________________________________________ by ZHAO SONG Dr. Dong Xu, Dissertation Supervisor MAY 2009 The undersigned, appointed by the dean of the Graduate School, have examined the [thesis or dissertation] entitled BINFORMATICS METHODS FOR PROTEIN IDENTIFICATION USING PEPTIDE MASS FINGERPRINTING presented by Zhao Song, a candidate for the degree of [doctor of philosophy of Computer Science], and hereby certify that, in their opinion, it is worthy of acceptance. Professor Dong Xu Professor Ye Duan Professor Chi-Ren Shyu Professor Dmitry Korkin Professor Sounak Chakraborty ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Dong Xu who is an experienced researcher and has always been supportive through the years of my study. With his guidance, I have finished the project successfully and learned how to do research and write papers. Dr. Dong Xu is a great mentor in life as well. It has been great experience working in his lab. I would like to thank Dr. Chi-Ren Shyu, Dr. Duan Ye, Dr. Sounak Chakraborty and Dr. Dmitry Korkin who are willing to serve on my committee. I acknowledge Dr. Chi- Ren Shyu for his valuable advices on database search issue. I acknowledge Dr. Sounak Chakraborty for his suggestions on the statistic model design. I acknowledge Dr. Duan Ye and Dr. Dmitry Korkin for the discussions on many detailed problems. I would like to thank Dr. Luonan Chen, who has proposed plenty of valuable ideas in the theory framework construction. I also wish to thank the Proteomics Center at University of Missouri for providing us mass spectrometry services. We would like to thank Beverly DaGue, Nengbing Tao, David Emerich, Gary Stacey, and Laurent Brechenmacher for helpful discussions and assistance. I would like to thank the members in Digital Biology Laboratory, University of Missouri: Chao Zhang, Nick Lin and Jianjiong Gao for their effective work and support in the collaboration. I also appreciate the great friendship we have developed through these years. I would like to acknowledge the financial support for my dissertation work from MU-Monsanto Program and a National Science Foundation grant NSF/ITRIIS-0407204. Part of my support is also from the Shumaker Fellowship. The last but not the least, this dissertation is dedicated to my wife, Yang Guo, who is not only my soul mate in life, but also a great help for my doctoral studies. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ................................................................................................ ii TABLE OF CONTENTS ................................................................................................... iii LIST OF TABLES ............................................................................................................ vii LIST OF FIGURES ......................................................................................................... viii ABSTRACT ....................................................................................................................... ix Chapter 1. INTRODUCTION ....................................................................................................1 1.1 High-throughput Data in Proteomics .............................................................2 1.2 Protein Identification Pipeline ........................................................................4 1.2.1 Protein Extraction and Digestion ..........................................................6 1.2.2 Protein Separation .................................................................................7 1.2.3 Mass Spectrometry Analysis .................................................................8 1.2.4 Protein Identification .............................................................................9 1.3 Mass Spectrometry Technology ...................................................................12 1.3.1 Peptide Mass Fingerprinting (PMF) ....................................................13 1.3.2 Tandem Mass (MS/MS) ......................................................................13 1.4 Materials and Database ................................................................................14 1.4.1 Materials ..............................................................................................14 1.4.2 Database ..............................................................................................15 1.5 Existing Computational Methods .................................................................16 1.5.1 MOWSE ..............................................................................................17 1.5.2 Profound ..............................................................................................18 1.5.3 Protein Prospector ...............................................................................20 iii 1.5.4 Normal Distribution Scoring Function ................................................21 1.5.5 Protein Identification Methods for Tandem Mass ...............................22 1.6 Dissertation Structure ..................................................................................23 2. PROTEIN IDENTIFICATION SCORING FUNCTIONS .....................................24 2.1 Introduction ..................................................................................................24 2.2 Data Sources .................................................................................................25 2.3 Scoring Function ..........................................................................................28 2.3.1 Scoring Function Review ....................................................................28 2.3.1.1 Mowse ....................................................................................28 2.3.1.2 Profound .................................................................................29 2.3.1.3 Normal Distribution Scoring Function ...................................30 2.3.2 Probability Based Scoring Function ....................................................31 2.3.2.1 Framework ..............................................................................31 2.3.2.2 Dependency of Peptides and Protein ......................................35 2.3.2.3 Peak Selection and Normalization .........................................39 2.3.2.3.1 Peak Selection .........................................................39 2.3.2.3.2 Peak Normalization .................................................41 2.3.2.4 Modified PBSF .......................................................................45 2.4 Results ..........................................................................................................45 2.4.1 Score Schema Comparison ..................................................................45 2.4.2 Comparison with Mascot and Protein Prospector ...............................49 2.5 Discussion ....................................................................................................50 3. CONFIDENCE ASSESSMENT .............................................................................53 3.1 Introduction ..................................................................................................53 3.2 Theory Fundamental ....................................................................................54 iv 3.2.1 Binomial Distribution ..........................................................................54 3.2.2 Central Limit Theory ...........................................................................56 3.2.3 Normal Approximation to Binomial Distribution ...............................56 3.3 Confidence Assessment Approaches ...........................................................58 3.3.1 Central Limit Theory Approach ..........................................................58 3.3.2 Gram-Charlier Expansion Approach ...................................................61 3.4 Results ..........................................................................................................63 3.4.1 Study on Individual Protein Kinase 2 .................................................64 3.4.2 Bench Mark of Entire Data Set ...........................................................65 3.4.3 Bootstrap for Confidence Interval .......................................................71 3.4.4 Confidence Interpretation ....................................................................74 3.5 Discussion ....................................................................................................75 4. SOFTWARE ...........................................................................................................76 4.1 SpotLink .......................................................................................................76 4.1.1 Clickable 2D gel ..................................................................................76 4.1.2 Web Pages of Protein Expression Profile ..........................................79 4.2 ProteinDecision ............................................................................................80 4.2.1 Functionality ........................................................................................81

Bioinformatics Methods for Protein Identification

MALDI-TOF/TOF MS Protocols Objectives

Optimized Search Strategy to Maximize PTM Characterization and Protein Coverage in Proteome Discoverer Software

Protein Identification by Mass Spectrometry

Large-Scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book

Ptmap—A Sequence Alignment Software for Unrestricted, Accurate, and Full-Spectrum Identification of Post-Translational Modification Sites

The Power of Three in Cannabis Shotgun Proteomics: Proteases, Databases and Search Engines

Using Mascot to Characterise Protein Modifications

John Cottrell, David Creasy & Darryl Pappin

Discovering the N-Terminal Methylome by Repurposing of Proteomic

Modifications

Evaluating Peptide Mass Fingerprinting-Based Protein Identification

Peptide Mass Fingerprinting