Computational Proteomics for Genome Annotation

A thesis submitted to The University of Manchester for the Degree of PhD in the Faculty of Life Sciences Computational Proteomics for Genome Annotation Paul David Blakeley January 2013 1 List of Contents List of Figures ....................................................................................................................... 5 List of Tables ........................................................................................................................ 6 Abstract ................................................................................................................................. 7 Declaration ............................................................................................................................ 8 Copyright .............................................................................................................................. 9 Acknowledgements ............................................................................................................. 10 The Author .......................................................................................................................... 11 Rationale for an Alternative Format Thesis .................................................................... 12 Abbreviations ..................................................................................................................... 13 Chapter 1: Introduction .................................................................................................... 14 1.1. Genome annotation .............................................................................................................. 15 1.1.1. Accurate genome annotation requires experimental evidence ........................................ 16 1.1.2. Genome annotation pipelines map experimental data to the genome ............................. 19 1.1.3. The need for protein-level data ....................................................................................... 21 1.2. Mapping the proteome ........................................................................................................ 23 1.2.1. High-throughput proteomics ........................................................................................... 24 1.2.2. Targeted Proteomics ....................................................................................................... 26 1.2.3. Peptide Identification via database searching ................................................................. 28 1.3. Proteogenomics .................................................................................................................... 30 1.3.1. Introducing proteogenomic approaches to genome annotation....................................... 31 1.4. Overview ............................................................................................................................... 36 1.5. References ............................................................................................................................. 38 Chapter 2: Investigating protein isoforms via proteomics: a feasibility study. ............ 49 Abstract ........................................................................................................................................ 50 2.1. Introduction .......................................................................................................................... 51 2.2. Materials and Methods ........................................................................................................ 53 2.2.1. Genome and proteome sequences ................................................................................... 53 2.2.2. Peptide identifications ..................................................................................................... 54 2.2.3. Chicken samples and data analysis ................................................................................. 55 2.2.4. Peptide mapping .............................................................................................................. 55 2.2.5. Database comparison with Swiss-Prot HPI..................................................................... 56 2.2.6. QconCAT design............................................................................................................. 57 2.3. Results and discussion ......................................................................................................... 58 2.3.1. Proteomic data and alternate splicing ............................................................................. 58 2.3.2. Examples of alternate splicing with supporting proteomic evidence .............................. 67 2.3.3. Protein Isoforms in UniProtKB/Swiss-Prot .................................................................... 69 2.3.4. Implications for design.................................................................................................... 71 2.3.5. Isoform detection and quantification using QconCATs and SRM ................................. 74 1 WORD COUNT: 39,225 2 2.4. Conclusions ........................................................................................................................... 77 2.5. References ............................................................................................................................. 77 Chapter 3: Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies ....................................................................................... 85 Abstract ........................................................................................................................................ 86 3.1. Introduction .......................................................................................................................... 87 3.2. Materials and Methods ........................................................................................................ 90 3.2.1. EST dataset ..................................................................................................................... 90 3.2.2. Preparation of chicken samples and mass spectrometry ................................................. 90 3.2.3. Databases ........................................................................................................................ 91 3.2.4. EORF description ............................................................................................................ 91 3.2.5. Mass spectrometry database searching ........................................................................... 94 3.2.6. Statistical methods for validating PSMs ......................................................................... 95 3.2.7. Estimating the proportion of correct PSMs ..................................................................... 96 3.3. Results and Discussion ......................................................................................................... 96 3.3.1. Searching against Six-frame or redundant databases affects sensitivity. ........................ 96 3.3.2. The target-decoy approach over-estimates the q-value/PEP for the six-frame database search. ..................................................................................................................................... 104 3.3.3. Six-frame databases confound the target-decoy assumption ........................................ 106 3.3.4. Equalising the target and decoy databases improves the sensitivity for the six-frame searches. .................................................................................................................................. 114 3.4. Conclusion .......................................................................................................................... 119 3.5. References ........................................................................................................................... 122 Chapter 4: Improving Genome Annotation using Shotgun Proteomics and De Novo- Assembled Transcripts .................................................................................................... 129 Abstract ...................................................................................................................................... 130 4.1. Introduction ........................................................................................................................ 131 4.2. Materials and Methods ...................................................................................................... 134 4.2.1. EST dataset ................................................................................................................... 134 4.2.2. Conceptual translations ................................................................................................. 135 4.2.3. Preparation of chicken samples and mass spectrometry ............................................... 135 4.2.4. Databases ...................................................................................................................... 136 4.2.5. Mass spectrometry database searching ......................................................................... 136 4.2.6. Coverage of proteins predicted from ESTs ................................................................... 137 4.2.7. Identification of mismatches in peptide sequences ....................................................... 137 4.2.8. Validation of

Computational Proteomics for Genome Annotation

Applied Category Theory for Genomics – an Initiative

Distinctive Regulatory Architectures of Germline-Active and Somatic Genes in C

Personal and Population Genomics of Human Regulatory Variation

Technology Development for Single-Molecule Protein Sequencing

Computational Proteomics: High-Throughput Analysis for Systems Biology

Genomic Approaches for Understanding the Genetics of Complex Disease

Molecular Biologist's Guide to Proteomics

Clinical Proteomics

Proteomics & Bioinformatics Part II

The Economic Impact and Functional Applications of Human Genetics and Genomics

S41598-018-25035-1.Pdf

PROTEOMICS the Human Proteome Takes the Spotlight