The University of Chicago the Metaproteomic Analysis Of
Total Page:16
File Type:pdf, Size:1020Kb
THE UNIVERSITY OF CHICAGO THE METAPROTEOMIC ANALYSIS OF ARCTIC SOILS WITH NOVEL BIOINFORMATIC METHODS A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF THE GEOPHYSICAL SCIENCES BY SAMUEL MILLER CHICAGO, ILLINOIS DECEMBER 2018 COPYRIGHT © 2018 Samuel Edward Miller All Rights Reserved TABLE OF CONTENTS LIST OF FIGURES ................................................................................................................. vii LIST OF TABLES .....................................................................................................................ix ACKNOWLEDGMENTS ........................................................................................................... x ABSTRACT ..............................................................................................................................xi I. INTRODUCTION ................................................................................................................... 1 I.A. IMPETUS FOR RESEARCH ...........................................................................................1 I.B. PROTEOMICS BACKGROUND ....................................................................................2 I.B.1. TANDEM MASS SPECTROMETRY .......................................................................3 I.B.2. METHODS OF AUTOMATIC SEQUENCE ASSIGNMENT ...................................9 I.B.2.i. DE NOVO SEQUENCING BY PEPNOVO+ .................................................... 11 I.B.2.ii. DE NOVO SEQUENCING BY NOVOR.......................................................... 12 I.C. REFERENCES ............................................................................................................... 15 II. CHAPTER 1. POSTNOVO: POST-PROCESSING ENABLES ACCURATE AND FDR- CONTROLLED DE NOVO SEQUENCING ............................................................................ 19 ABSTRACT.......................................................................................................................... 19 II.A. INTRODUCTION ........................................................................................................ 20 II.B. METHODS ................................................................................................................... 23 II.B.1. PROTEOMIC DATASETS .................................................................................... 23 II.B.2. ALGORITHM DESCRIPTION .............................................................................. 25 II.B.3. ALGORITHM EVALUATION .............................................................................. 28 II.C. RESULTS AND DISCUSSION .................................................................................... 30 iii II.C.1. POSTNOVO PERFORMANCE COMPARED TO INDIVIDUAL DE NOVO SEQUENCING TOOLS .................................................................................................... 30 II.C.2. CONTRIBUTION OF NOVEL FEATURES TO THE POSTNOVO CLASSIFICATION MODEL ............................................................................................ 33 II.C.2.i. CONSENSUS SEQUENCES AND MODEL RESULTS .................................. 34 II.C.2.ii. MASS TOLERANCE AGREEMENT ............................................................. 37 II.C.2.iii. PRECURSOR CLUSTERING........................................................................ 38 II.C.2.iv. POTENTIAL SEQUENCE ERRORS ............................................................. 38 II.D. FDR CONTROL FROM POSTNOVO SCORING .................................................... 39 II.E. ACCURATE SEQUENCES NOT FOUND BY DATABASE SEARCH ................... 40 II.F. CONCLUSIONS ........................................................................................................... 44 II.G. SUPPORTING INFORMATION .................................................................................. 46 II.G.1. CONSENSUS SEQUENCE IDENTIFICATION ................................................... 46 II.G.2. CLUSTERING SPECTRA FROM THE SAME MOLECULAR SPECIES ............ 49 II.G.3. POTENTIAL SEQUENCE ERRORS..................................................................... 50 II.G.4. OTHER FEATURES OF POSTNOVO MODEL ................................................... 51 II.G.5. LENGTH-ACCURACY TRADEOFF OF PARTIAL-LENGTH SEQUENCES ..... 51 II.G.6. SCORE MODELS ................................................................................................. 52 II.G.7. SUPPORTING FIGURES ...................................................................................... 54 II.G.8. SUPPORTING TABLES ....................................................................................... 64 II.H. REFERENCES ............................................................................................................. 84 iv III. CHAPTER 2. CONSIDERATIONS IN THE ANALYSIS OF DE NOVO PEPTIDE SEQUENCES ........................................................................................................................... 89 III.A. INTRODUCTION ....................................................................................................... 89 III.B. HOMOLOGOUS SEQUENCE IDENTIFICATION .................................................... 90 III.C. TAXONOMIC AND FUNCTIONAL SCREENING AND ANNOTATION ................ 97 III.D. DISCUSSION ............................................................................................................. 98 III.E. REFERENCES .......................................................................................................... 101 IV. CHAPTER 3. THE METAPROTEOMIC ANALYSIS OF ARCTIC SOILS ..................... 103 IV.A. INTRODUCTION ..................................................................................................... 103 IV.B. METHODS................................................................................................................ 107 IV.B.1. SAMPLES .......................................................................................................... 107 IV.B.2. PROTEIN EXTRACTION .................................................................................. 109 IV.B.3. DATA ANALYSIS ............................................................................................. 111 IV.B.3.i. NUCLEOTIDE DATA.................................................................................. 111 IV.B.3.ii. PEPTIDE DATA ......................................................................................... 115 IV.C. RESULTS ................................................................................................................. 121 IV.C.1. COMPARISON OF ENVIRONMENTS USING PROTEIN EXPRESSION PROFILES ...................................................................................................................... 121 IV.C.1.i. COMPARISON OF OVERALL PROTEIN EXPRESSION LEVELS ........... 121 IV.C.1.ii. MULTIVARIATE ANALYSIS OF PROTEIN EXPRESSION BY TAXA IN DIFFERENT ENVIRONMENTS ................................................................................ 129 IV.C.2. COMPARISON OF THE FUNCTIONAL PROFILES OF TAXA ...................... 135 v IV.C.2.i. OVERALL PATTERNS AND CELLULAR ACTIVITY .............................. 135 IV.C.2.ii. CARBON METABOLISM AND ENERGY CONSERVATION ................. 142 IV.C.2.iii. NUTRIENTS AND TRACE ELEMENTS .................................................. 151 IV.C.2.iv. CELL ENVELOPE AND MOVEMENT ..................................................... 158 IV.D. DISCUSSION AND CONCLUSION ........................................................................ 163 IV.E. REFERENCES .......................................................................................................... 171 V. CONCLUSION .................................................................................................................. 181 REFERENCE ...................................................................................................................... 182 VI. APPENDIX ...................................................................................................................... 183 vi LIST OF FIGURES Figure I.1. Bottom-up proteomics workflow .................................................................................. 4 Figure I.2. Fragmentation sites on the peptide backbone ............................................................... 7 Figure I.3. Fragmentation spectrum interpretation ......................................................................... 8 Figure I.4. Part of the Novor decision tree .................................................................................... 13 Figure II.1. Postnovo workflow .................................................................................................... 23 Figure II.2. Comparison of Postnovo to individual tools (datasets 1-4) ........................................ 32 Figure II.3. Contributions to Postnovo model ............................................................................... 35 Figure II.4. Relation of Postnovo score to sequence precision ...................................................... 41 Figure II.5. Postnovo consensus sequence procedure ................................................................... 54 Figure II.6. Pooled de novo sequencing results from six low-resolution test datasets .................. 55 Figure II.7. Comparison of Postnovo to individual tools (datasets