Proteogenomics for Personalised Molecular Profiling

Christoph Norbert Schlaffner

Department of Chemistry University of Cambridge Clare Hall

This dissertation is submitted for the degree of Doctor of Philosophy

June 2017

Proteogenomics for Personalised Molecular Profiling

Christoph Norbert Schlaffner

Technological advancements in allowing quantification of almost complete make a key platform for generating unique functional molecular data. Furthermore, the integrative analysis of genomic and proteomic data, termed proteogenomics, has emerged as a new field revealing insights into expression regulation, cell signalling, and disease processes. However, the lack of software tools for high-throughput integration and unbiased modification and variant detection hinder efforts for large-scale proteogenomics studies. The main objectives of this work are to address these issues by developing and applying new software tools and data analysis methods.

Firstly, I address mapping of peptide sequences to reference genomes. I introduce a novel tool for high-throughput mapping and highlight its unique features facilitating quantitative and post-translational modification mapping alongside accounting for amino acid substitutions. The performance is benchmarked. Furthermore, I offer an additional tool that permits generation of web accessible hubs of genome wide mappings.

To enable unbiased identification of post-translational modifications and amino acid substitutions for high resolution mass spectrometry data, I present algorithmic updates the mass tolerant blind spectrum comparison tool ’MS SMiV’. I demonstrate the applicability of the changes by benchmarking against a published mass tolerant database search of a high resolution dataset.

I then present the application of ‘MS SMiV’ on a panel of 50 colorectal cancer cell lines. I show that the adaption of ‘MS SMiV’ outperforms traditional sequence database based identification of single amino acid variants. Furthermore, I highlight the utility of mass tolerant spectrum matching in combination with isobaric labelled quantitative proteomics in distinguishing between post-translational modifications and amino acid variants of similar mass.

In the last part of this work I integrate both tools with a high-throughput proteoge- nomic identification pipeline and apply it to a pilot study of chondrocytes derived from 12 osteoarthritic individuals. I show the value of this approach in identifying variation between individuals and molecular levels and highlight them with individual examples. I show that multi-plexed proteogenomics can be used to infer genotypes of individuals.

I would like to dedicate this thesis to my loving family.

DECLARATION

I hereby declare that this thesis describes work carried out between September 2014 and June 2017 under supervision of Dr Jyoti Choudhary and Dr Andreas Bender at the Wellcome Trust Sanger Institute and the Department of Chemistry, University of Cambridge, while member of Clare Hall, University of Cambridge. This thesis is the result of my own work except where specific reference is made to the work of others. The contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation contains fewer than 60,000 words including summary, tables, and footnotes according to the specifications defined by the Board of Graduate Studies and the Chemistry Degree Committee.

Christoph Norbert Schlaffner June 2017

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my supervisors Jyoti Choudhary and Andreas Bender. I would like to thank Jyoti for giving me the opportunity to carry out this project, her valuable advice, support, and encouragement. Thank you for keeping me on track and motivated throughout the PhD. I am especially grateful for the freedom of driving my research and the opportunities she pushed me to pursue. I would also like to thank Andreas for taking me on as one of his PhD students, for his support and fruitful discussions.

Many thanks to James Wright and Hendrik Weisser, who supported me through many discussions about data analysis and best practices. I would also like to thank them for developing the proteogenomic pipeline on which my comprehensive pipeline is built and for generating the custom sequence databases. Similarly, I would like to thank Theodoros Roumeliotis, who collected the proteomics data for the colorectal cancer and osteoarthritis panels. I would also like to thank him for providing me with the canonical and variant peptide identifications of the colorectal cancer data set. I would like to extend my gratitude to Lu Yu and Mercedes Pardo for the endless hours they spent helping me to understand the wet-lab work on which my project is built. Furthermore, a special thanks to Thomas Bernwinkler for the parallel high performance version of MS SMiV he set out to implement under my supervision. A special thanks also to Georg Pirklbauer for his work on PoGo enabling the mapping of peptides with amino acid substitutions and his work on peptide-peptide correlation heatmaps, both of which he carried out under my supervision. I also would like to thank all current members and alumni of the Proteomics Mass Spectrometry team at the Wellcome Trust Sanger Institute for many tea and cake breaks and group outings and for making it a great place to work - especially Daniel Bode, Petra Gutenbrunner, Aida Barreiro Alonso, and Charlie Hillier.

I also would like to thank all current and previous members of the Bender group for interesting discussions, pub visits after work, and their willingness to include me in many social events even though I made myself scarce in the office in Chemsitry. A special thanks goes out to Krishna Bulusu, Ben Alexander-Dann, Fatima Baldo, Stephanie Ashenden, Kathryn Giblin, Erin Oerton, Fynn Krause and Nitin Sharma for their support and encouragement, especially during the end of my PhD. Without their persistence in pushing me to extend my comfort zone, I would not have had such a great PhD experience and I am very grateful for their friendship. x

I extend my gratitude to Juan Antonio Vizcaino and Yasset Perez Riverol for their valuable input for proteogenomic mapping and their determination to integrate PoGo into the PRIDE archive. I am also grateful to the HAVANA team, especially Jonathan Mudge, Adam Frankish, and Jennifer Harrow, who started to investigate my personalised proteogenomics data and track hubs. Furthermore, I would like to thank Sergio Santos and Alvis Brazma for investigating select examples of my proteogenomics data with regards to tissue specific expression of transcript isoforms in RNA-seq experiments. Similarly, I thank Elefteria Zeggini, Graham Ritchie, Julia Steinberg, Christine Le Maitre, and J. Mark Wilkinson for collecting the osteoarthritic patient samples, and providing RNA-sequencing and genotype data, without which my pilot study on personal molecular profiling would not have been possible. Additionally, I would like to extend my gratitude to Jyoti Choudhary and Theodoros Roumeliotis, who kindly read the draft of this thesis and provided me with valuable feedback. Thank you.

A very special thanks goes out to Tomislav Ilicic. Over the last eight years his friendship, objective criticism, and exemplary can-do outlook on life has helped me grow professionally and personally. Without his reaching for the stars approach to life and subsequent success, I would not have dreamed about pursuing a PhD at Cambridge. Similarly, I would like to thank Carina Üblackner for her friendship and encouragement over the last eight years. I am thankful for their places in my life.

On a personal note, I would like to thank my parents Gabriele and Norbert Schlaffner for supporting me morally and financially in my endeavour. I would also like to thank my sisters, Sandra and Verena Schlaffner, for putting up with my ridiculous working hours even during holidays when I visited them. I also thank my grandparents, aunts, uncles, and cousins for their support.

Lastly, I would like to extend my gratitude to the GENCODE project for providing me with the funding to pursue a PhD. I am also grateful to Andreas Bender and Clare Hall for stepping in and providing financial support when my funding unexpectedly ended due to the closure of the Proteomics Mass Spectrometry facility at the Wellcome Trust Sanger Institute. SUMMARY

Technological advancements in mass spectrometry allowing quantification of almost complete proteomes make proteomics a key platform for generating unique functional molecular data. Furthermore, the integrative analysis of genomic and proteomic data, termed proteogenomics, has emerged as a new field revealing insights into gene expression regulation, cell signalling, and disease processes. However, the lack of software tools for high-throughput integration and unbiased modification and variant detection hinder efforts for large-scale proteogenomics studies. The main objectives of this work are to address these issues by developing and applying new software tools and data analysis methods.

Firstly, I address mapping of peptide sequences to reference genomes. I introduce a novel tool for high-throughput mapping and highlight its unique features facilitating quantitative and post-translational modification mapping alongside accounting for amino acid substitutions. The performance is benchmarked. Furthermore, I offer an additional tool that permits generation of web accessible hubs of genome wide mappings.

To enable unbiased identification of post-translational modifications and amino acid substitutions for high resolution mass spectrometry data, I present algorithmic updates the mass tolerant blind spectrum comparison tool ’MS SMiV’. I demonstrate the applicability of the changes by benchmarking against a published mass tolerant database search of a high resolution tandem mass spectrometry dataset.

I then present the application of ‘MS SMiV’ on a panel of 50 colorectal cancer cell lines. I show that the adaption of ‘MS SMiV’ outperforms traditional sequence database based identification of single amino acid variants. Furthermore, I highlight the utility of mass tolerant spectrum matching in combination with isobaric labelled quantitative proteomics in distinguishing between post-translational modifications and amino acid variants of similar mass.

In the last part of this work I integrate both tools with a high-throughput proteoge- nomic identification pipeline and apply it to a pilot study of chondrocytes derived from 12 osteoarthritic individuals. I show the value of this approach in identifying variation between individuals and molecular levels and highlight them with individual examples. I show that multi-plexed proteogenomics can be used to infer genotypes of individuals.

TABLEOFCONTENTS

List of figures xvii

List of tables xxi

Nomenclature xxiii

1 Introduction1 1.1 From to proteins ...... 2 1.2 Gene expression and RNA-sequencing ...... 3 1.3 Limitations of RNA-sequencing ...... 5 1.4 Protein expression and shotgun proteomics ...... 7 1.4.1 Spectral identification and scoring ...... 9 1.4.2 Statistical significance measures and quality control . . 11 1.4.3 Peptide and protein quantification ...... 13 1.4.3.1 Overview over quantitation methods ...... 14 1.4.3.2 Quantitation using isobaric labelling ...... 16 1.5 Limitations of shotgun proteomics ...... 18 1.6 Proteogenomics as solution to overcome limitations ...... 20 1.7 Biological and technical biases affecting integrative analysis . . . . . 26 1.8 The orphan of multi-: post-translational modifications . . . . . 27 1.9 Novel challenges for high-throughput multi-omics ...... 29 1.10 Objectives and Outline ...... 30

2 Mapping peptides to genomic loci 33 2.1 Introduction ...... 34 2.2 Algorithm development ...... 37 2.2.1 Input formats ...... 37 2.2.2 PoGo algorithm ...... 38 xiv Table of contents

2.2.2.1 Connecting protein sequences with genomic coordi- nates ...... 38 2.2.2.2 Identifying protein of origin for input peptides . . . 41 2.2.2.3 Retrieving genomic coordinated for peptides . . . . 43 2.2.3 Mapping post-translational modifications ...... 43 2.2.4 Adding quantitative information for multiple samples . . . . . 44 2.2.5 Generating different output formats ...... 44 2.2.5.1 BED format ...... 45 2.2.5.2 GFT format ...... 46 2.2.5.3 GCT format ...... 47 2.2.6 Benchmarking and application ...... 47 2.2.6.1 Datasets ...... 47 2.2.6.2 Protein sequences, gene annotation and PoGo settings 48 2.2.6.3 Comparison of algorithms for performance evaluation 49 2.2.7 Generation of track hubs ...... 49 2.3 Results and discussion ...... 50 2.3.1 Speed and quality of mapping ...... 50 2.3.2 Track hubs for tissue data ...... 52 2.3.3 Accounting for SNPs ...... 52 2.3.4 PTM mapping ...... 56 2.3.5 Integrating peptide quantitation ...... 56 2.4 Conclusions ...... 56 2.5 Publication Note and Contributions ...... 59

3 Unbiased Detection of Modifications and Variants Using MS SMiV 61 3.1 Introduction ...... 62 3.2 Algorithm adaptation, optimization and testing ...... 66 3.2.1 Adaptations to the algorithm ...... 66 3.2.1.1 Spectrum processing ...... 66 3.2.1.2 Peak matching of differently charged spectra . . . . 68 Table of contents xv

3.2.2 Optimization ...... 70 3.2.3 Benchmarking ...... 71 3.3 Results and Discussion ...... 72 3.3.1 Assessment and optimization ...... 72 3.3.2 Runtime analysis ...... 79 3.3.3 Comparison with open modification search ...... 80 3.3.4 Increased spectra identification rate ...... 87 3.4 Conclusions ...... 88 3.5 Publication Note and Contribution ...... 90

4 Variant Identification in Colorectal Cancer 91 4.1 Introduction ...... 92 4.2 Materials and Methods ...... 94 4.2.1 Data acquisition, identification, and quantification . . . . 94 4.2.2 Unbiased mass tolerant spectrum pairing ...... 97 4.2.3 Gaussian fit analysis ...... 100 4.2.4 Identification of PTMs and SNPs using peptide quantitation . 101 4.2.5 Comparison of unbiased SNP detection with manually curated identifications ...... 102 4.3 Results and Discussion ...... 104 4.3.1 Peptide identification through database searching . . . . 104 4.3.2 Spectrum pairing for false discovery rate estimation ...... 107 4.3.3 Unbiased spectrum pairing for modification masses . . . . . 109 4.3.4 Peptide quantitation distinguishing between post- translational modifications and amino acid variants . . 111 4.3.5 Benchmarking MS SMiV against database searching for vari- ant identification ...... 125 4.3.6 De novo identification of amino acid variants . . . . . 127 4.4 Conclusions ...... 131 4.5 Publication Note and Contributions ...... 132

5 Quantitative Proteogenomics for Personalised Molecular Profiling 135 xvi Table of contents

5.1 Introduction ...... 136 5.2 Materials and Methods ...... 137 5.2.1 Data acquisition ...... 137 5.2.2 RNA-sequencing analysis ...... 139 5.2.3 Comprehensive proteogenomic analysis pipeline ...... 140 5.2.3.1 OpenMS identification and quantification . . . 140 5.2.3.2 Proteogenomic mapping ...... 144 5.2.3.3 Sequence variance detection ...... 146 5.2.3.4 MS SMiV application for de novo SNP identification149 5.3 Results and Discussion ...... 149 5.3.1 Identification results ...... 149 5.3.2 Proteogenomic resolution of splicing with missing RNA-sequencing support ...... 151 5.3.3 Splice level variation ...... 154 5.3.4 Alternative inclusion of 5’ exons ...... 156 5.3.5 De novo SNP detection ...... 158 5.3.5.1 Protein abundance variation in individuals . . . . . 164 5.3.6 Differential regulation of protein and RNA ...... 166 5.3.6.1 Isoform switching ...... 169 5.4 Conclusions ...... 170 5.5 Contributions and Publication Note ...... 174

6 Concluding Remarks 175 Bibliography ...... 183 LIST OF FIGURES

1.1 Complexity of Information Flow from Gene To Protein ...... 2 1.2 Gene Expression Analysis Technologies ...... 4 1.3 Mass Spectrometry Workflow ...... 8 1.4 Spectra Identification Algorithms ...... 10 1.5 Significance Measures ...... 12 1.6 Mass Spectrometry Quantification Methods ...... 15 1.7 Structure of Isobaric Labels ...... 16 1.8 Isotope distribution in TMT 10-plex reagents ...... 17 1.9 GENCODE Protein-coding Gene Annotation Over Time ...... 21 1.10 Custom Databases in Proteogenomics ...... 22

2.1 Aspects of Mapping Peptides to Reference Genomes ...... 35 2.2 Example of The Input Format of PoGo ...... 38 2.3 Schematic of Data Formats and Processing ...... 39 2.4 Schematic Representation of Mapping of Proteins onto Genomic Co- ordinates ...... 40 2.5 Schematic Representation of Peptide Mapping with Amino Acid Sub- stitutions ...... 42 2.6 Schematic Representation of Retrieval of Genomic Coordinates for Peptides ...... 43 2.7 Schematic Representation of PTM Visualization ...... 46 2.8 Comparison between PoGo and PGx Results ...... 51 2.9 PoGo and PGx Mapping to High Complexity Region of SPRR3 .... 51 2.10 CASS4 Example of Track Hub Visualisation ...... 53 2.11 RBP3 Example of Track Hub Visualisation ...... 54 2.12 Comparison of Runtime, Memory Requirement and Spread of Peptide Mappings for Allowing Msimatches ...... 55 2.13 Example of Peptide Mappings Allowing Mismatches ...... 55 2.14 Example of Phosphorylation Site and Quantitative Mapping . . . . . 57 2.15 IGV view of Quantitative Mapping of Phosphoproteome ...... 58 xviii List of figures

3.1 MS SMiV Utilities ...... 65 3.2 Visual Representation of Two-step Spectrum Preprocessing ...... 67 3.3 Workflow of File Preparation for Benchmarking ...... 73 3.4 FDR Comparison for Peak Picking Adaptation ...... 74 3.5 Visual Representation Data Fragmentation and MS SMiV Application 81 3.6 ∆Mass Distributions ...... 82 3.7 Overlap of Mass Bins Identified by Mass Tolerant Database Searching and MS SMiV ...... 83 3.8 Cumulative Ratio Distributions for MS SMiV Mass Tolerant Search . 86 3.9 FDR Comparison using Different Sets of Identifications ...... 87

4.1 Experimental Workflow for Colorectal Cancer Tribrid MS . . 96 4.2 Identification Workflow and Manual Variant Curation . . . . 98 4.3 Workflow for MS SMiV Preparation and Application . . . . 99 4.4 Schematic Representation of Spectrum-Spectrum Correlation Heatmap Generation ...... 103 4.5 Venn Diagram of Identified Canonical and Variant Peptides . . 105 4.6 Examples of Inclusion and Exclusion Criteria for Manual Curation . . 106 4.7 Amino Acid Substitution with Overlapping PTM Masses in Unimod . 106 4.8 FDR Comparison across Experimental Batches ...... 108 4.9 ∆mass Distribution of Spectrum Pairs of MS SMiV Using PTM Masses110 4.10 Overlapp of Bins with Similar ∆Mass ...... 111 4.11 Peptide-peptide Correlation of PURB ...... 114 4.12 Normalized Peptide Abundance and Differences for PURB ...... 115 4.13 Peptide-peptide Correlation of VAPA ...... 118 4.14 Normalized Peptide Abundance and Differences for VAPA ...... 119 4.15 Peptide-peptide Correlation of SON ...... 121 4.16 Extension Peptide-peptide Correlation of SON ...... 122 4.17 Extension to Peptide-peptide Correlation of SON ...... 123 4.18 Normalized Peptide Abundance and Differences for SON ...... 124 List of figures xix

4.19 Overlap of Variant Peptide Identifications ...... 126 4.20 Peptide-peptide Correlation ...... 128 4.21 Normalized Peptide Abundance and Differences for SLU7 ...... 129 4.22 Variant Peptide Annotated in Protein SLU7 ...... 130

5.1 Sample Processing for Osteoarthritis Pilot Study ...... 138 5.2 Protoegenomic Analysis Pipeline ...... 141 5.3 Normalization of Reporter Intensities for Sample Equal Loading . . . 145 5.4 Example View in IGV ...... 147 5.5 Example of Binary Reporter Intensity Pattern Indicating Sequence Variant ...... 148 5.6 IGV View of ERO1A Gene with Short 12bp Exon ...... 153 5.7 IGV View of 3’ Alternative Exons of Gene IKBIP ...... 157 5.8 Relative Reporter Intensities for Alternative 3’ Exons in Gene IKBIP . 158 5.9 IGV View of gene PLEC Showing Alternative Splicing ...... 159 5.10 IGV of 5’ Alternative Exons in Gene PLEC ...... 159 5.11 Filtered MS SMiV Identification of ∆Masses in Relation to Scores . . 161 5.12 Differential Regulation of Protein PLEC ...... 162 5.13 Reporter Intensities for Gene PLEC across Individuals in Experiment 5163 5.14 De Novo Variant Identification in Gene PLEC ...... 163 5.15 Reporter Intensities for Peptides of Gene CFHR5 ...... 165 5.16 Reporter Intensities for Peptides of NOS2 ...... 166 5.17 IGV View of NOS2 Peptide and RNA-seq Genomic Alignment . . . . 167 5.18 IGV View of NAGNAG Splicing in Gene HAPLN1 ...... 168 5.19 Differential Regulation of Gene HAPLN1 ...... 169 5.20 Isoform Regulation of Gene ACTN1 ...... 171 5.21 IGV View of Alternative Splicing in Gene ACTN1 ...... 171

LIST OF TABLES

2.1 Colour Code for Level of Uniqueness of Mapping ...... 45 2.2 Colur Code for Mapping of Post-translational Modifications . . . . . 48

3.1 MS SMiV Parameter Settings for Intensity Cluster Optimization . . . 70 3.2 Statistical Metrics for Intensity Clusters ...... 76 3.3 Statistical Metrics for Charge Classes ...... 78 3.4 Summary of 20 Most Frequently Identified Mass Shifts ...... 85 3.5 Summary of Assignment Rates ...... 88

4.1 Summary of Batches of 10-plex TMT for 50 Colorectal Cancer Cell Lines ...... 95 4.2 PSMs and Unique Peptides in Colorectal Cancer ...... 104 4.3 Table of MS2 Spectra and MS SMiV pairs per Experimental Batch . . 107 4.4 Summary of Spectral Identification Using a Canonical Sequence Database and Mass Tolerant MS SMiV ...... 109 4.5 Unimod Entries with Masses of 28 Da, 1 Da and 111 Da ...... 112 4.6 Unimod Entries with Masses of 14 Da and 30 Da ...... 116 4.7 MS SMiV Associates Mass Shifts to SON Peptides ...... 120 4.8 Proteogenomic Mapping Identifying Pairs of Variant and Reference Sequences ...... 125 4.9 Unimod Entries with Masses of 18 Da ...... 127

5.1 TMT Experimental Setup ...... 139 5.2 Composition of Customised Sequence Database ...... 143 5.3 Identification Filtering for Personal Proteogenomics ...... 144 5.4 Gene Identification at Confidence Levels ...... 150 5.5 List of Transcripts Identified With Multiple Isoforms ...... 156 5.6 Unimod Entries with Masses of 28Da ...... 164 5.7 Summary of Genes Exhibiting Fold Changes between Individuals . . 166

NOMENCLATURE

Acronyms / Abbreviations cDNA Complement Deoxyribonucleic Acid CDS Coding DNA Sequence DNA Deoxyribonucleic Acid FDR False Discovery Rate FPR False Positive Rate IGV Integrative Viewer lncRNA long non-coding Ribonucleic Acid MS Mass Spectrometry MS/MS Tandem Mass Spectrometry nsSNP non-synonymous Single Nucleotide Polymorphism PCR Polymerase Chain Reaction PEP Posterior Error Probability PSM Peptide Spectrum Match PTM Post-translational Modification RNA Ribonucleic Acid SAAV Single Amino Acid Variant SNP Single Nucleotide Polymorphism SNV Single Nucleotide Variant SO Sequence Ontology TMT Tandem Mass Tag

HAPTER C 1

INTRODUCTION 2 Introduction

1.1 From genes to proteins

A large variety of molecules are required for the smooth flow of biological processes within and between cells. For a long time, proteins were thought to be the molecules driving biology and responsible for passing specific traits onto offspring (Mendel, 1866). However, Avery et al.(1944) identified deoxyribonucleic acid as the molecule containing hereditary information through transfection experiments in Streptococcus pneumoniae. It took another decade until Crick(1958) described the idea of the ‘central dogma’. Therein he proposed the one-way flow of information from deoxyribonucleic acid (DNA) through ribonucleic acid (RNA) to proteins as illustrated in Figure 1.1.

Fig. 1.1 Graphical representation of flow of information from genomic sequence to functional protein. The information flow described in the ‘central dogma’ of biology follows the processes of transcription of DNA into RNA followed by translation into protein. Different types of regulation during transcription and translation as well as post translation lead to highly complex modulation of gene and protein expression. Additionally, changes in these regulatory processes introduce increasing dynamic ranges for each molecule class within cells. Translation of DNA and RNA consist- ing of 4 nucleotides into protein sequences comprising 20 amino acids followed by post-translational modification increases sequence complexity and requires different analysis methods.

The advent of amplification of DNA through polymerase chair reaction (PCR) (Mullis and Faloona, 1987) and RNA by reverse transcription PCR (reviewed in Freeman et al., 1999) enabled quantification of expression of lowly abundant messenger 1.2 Gene expression and RNA-sequencing 3

RNA (mRNA). While nucleotide based molecules can be amplified and quantitatively assessed, no amplification method for proteins has been found to date. This quickly lead to PCR based quantification overtaking protein expression assays. In turn, gene expression was used to estimate protein abundance in a sample and drove research in molecular biology and biomedicine over the last decades.

1.2 Gene expression and RNA-sequencing

Gene expression, i.e. the transcription of different genes into RNA molecules, has been central to studying biological processes. Various technologies to catalogue RNA molecules, measure their abundance, and link their expression to biological processes have been developed (reviewed in Hoheisel, 2006; Ozsolak and Milos, 2011; Wang et al., 2009). To assess the quantity of RNA molecules genome wide and in parallel, hybridisation approaches, known as microarrays, were amongst the first methods developed (Taub et al., 1983). For a pre-defined set of genes unique a priori known complementary DNA (cDNA) sequences representing each gene in the set are spotted onto the microarray. In the sample of interest RNA molecules are reverse transcribed and labelled with fluorescent dye before hybridisation to the cDNA fragments on the array. The strength of measured fluorescent signals is a proxy for the amount of captured RNA molecules and thus indirectly represents the quantity of RNA for the given set of genes in the sample (Figure 1.2 A, reviewed in Trevino et al., 2007).

While still in use today, microarrays are commonly focused on an a priori defined set of genes and thus prove to be impractical to assess expression of all transcripts present in a sample (termed transcriptome). New technologies and approaches using overlapping stretches of DNA without a priori knowledge of the presence or struc- ture of a gene to probe the transcriptome including non-coding sequences. These approaches with varying lengths of overlapping DNA probes are termed tiling ar- rays (Buckley et al., 2002; Kapranov et al., 2002; Rinn et al., 2003; Snijders et al., 2001). However, while tiling arrays are cheap options to assess the expression of transcripts the technological drawbacks of using fluorescent dyes limit the capture of the dynamic range and restrict the number of samples to be assessed at the same time (Agarwal et al., 2010). With the advent of next generation sequencing high throughput RNA-sequencing for the entire transcriptome of multiple samples became feasible (reviewed in Ozsolak and Milos, 2011; Wang et al., 2009). While purified sample RNA can be amplified or directly reverse transcribed into single stranded cDNA similar to microarray protocols, ‘library preparation’ for RNA-sequencing 4 Introduction

Fig. 1.2 Comparison of technologies enabling measurement of gene expression. The initial sample preparation steps of mRNA extraction and reverse transcription into cDNA are similar for microarray and RNA-sequencing technologies. A) For the microarray protocol resulting cDNA is labelled with fluorescent dye before cDNA is turned into single strands (denaturation). Labelled single strand cDNA molecules are poured over the microarray plate to allow hybridization with single stranded complement DNA fragments spotted onto the plate. After washing intensity of light from fluorescent dye is measured indicating abundance of captured cDNA molecules (example shows 2-channel array). B) For RNA-sequencing double stranded cDNA molecules are fragmented resulting in sequences of lengths between 50-1000 base pairs. Sequencing adapters containing identifiers (barcodes) are ligated to cDNA fragments before sequencing commences. Resulting sequence reads are aligned to a reference genome and the number of reads mapping to the genomic region of a gene reflects its expression. 1.3 Limitations of RNA-sequencing 5 incorporates additional steps. Using additional primers the second DNA strand is synthesized by DNA polymerase. Furthermore, double-stranded cDNA is fragmented into shorter pieces before sequencing adaptors enabling DNA-Sequencing (Applied Biosystems SOLiD, Illumina, Roche 454 Life Science) are ligated to each fragment (Figure 1.2 B). Adaptors may contain barcodes that uniquely identify them and allow mixing of multiple samples with different adaptors within the same sequencing lane, termed multiplexing. Single cDNA fragments can be sequenced either from one side (single-end) or both directions (paired-end), whereby the latter approach facilitates detection of genomic rearrangements and repetitive sequence elements.

Commonly, sets of short sequences, named reads, spanning 50 to 1000 base pairs (bp) are the output generated by next generation sequencing. All reads are provided with quality information for each base in a read and are stored in the FASTQ format (Cock et al., 2010). To identify expressed genes in the sample, reads are aligned to a reference genome sequence. Different methods and software tools such as TopHat2 (Kim et al., 2013) and STAR (Dobin et al., 2013) have been developed taking no, known or novel splice junctions into account. Once aligned to the genome, the number of reads mapping to each gene are used to quantify gene expression. Tools such as edgeR (Robinson et al., 2010), cufflinks (Trapnell et al., 2012) or HTSeq (Anders et al., 2015) have been developed and provide additional statistical functionalities to assess differential gene expression between samples. However, all methods follow the assumption that the number of sequencing reads is proportional to the initial number of RNA molecules in the sample.

1.3 Limitations of RNA-sequencing

One of the major limitations of RNA-sequencing arises from multi-mappings of reads in the alignment. Although multi-mappings can be removed from the output, they make up a significant fraction due to repetitive DNA sequences and shared domains between paralogous genes. In cases where the references to which reads are mapped are represented through transcript sequences multi-mappings are even more likely due to shared exons and splice junctions between isoforms of a gene. Either way, identification and quantification of transcripts is essential to determine alternatively expressed genes between samples.

Identifying transcripts from reads of lengths ranging from 50 to 1000 bp that rarely span across several splice junctions is challenging. The low complexity of 6 Introduction short stretches within reads originating, for example, from small exons leads to potential alignment to multiple genomic loci introducing false positives. Furthermore, variants and low quality sequenced bases in short RNA-seq reads increase the space of potential matches. In addition, inaccurate determination of transcription start and end sites together with sequence variation and sequencing and alignment errors make it difficult to correctly identify transcript boundaries (Steijger et al., 2013; Weirick et al., 2016). Increased coverage of exons and splice junctions unique to single annotated transcripts help to resolve false-positive identifications and strengthen confidence in the identification of single transcripts.

The identification of genes and transcripts, as challenging as it is, facilitates their quantification within the sample. With simple read counts building thebase for all approaches, multi-mappings affect the quantitation of different genes and more significantly transcripts. In many cases quantitation of single transcripts using reads mapping to a unique exon is insufficient. Therefore, transcript quantification is estimated using probabilistic models to allocate multi-mapping reads among transcripts as implemented in tools like Sailfish (Patro et al., 2014) and kallisto (Bray et al., 2016).

A way to mitigate these limitations of short RNA-seq reads would be full length transcript reads. However, sequencing errors become more frequent with increasing length of reads. This could be counteracted by sequencing to higher depth, which in turn increases the cost significantly (Hou et al., 2015). For differential gene expression analysis transcripts are required to be sequenced with a depth of 10-25 million reads (Liu et al., 2014). For comprehensive high quality data allowing confident identification of single nucleotide variants, for example in cancer, or alternative splicing a read depth of 50-100 million reads is required (Liu et al., 2013). This in turn demands additional amplification of cDNA fragments masking the true gene expression in thesample.

Genomics approaches have become central in biomedical research since the pro- posal of the ‘central dogma’ but have dominated since the publication of the complete human genome sequence in 2004 (Collins et al., 2003; International Human Genome Sequencing Consortium, 2004). Despite differences in morphology of various cell types containing the same genome sequence, gene expression was used as approxima- tion of protein levels (reviewed in Greenbaum et al., 2003; Guo et al., 2008; Maier et al., 2009). However, recent multi-omics studies integrating protein abundance with gene expression have found low to moderate correlation (Pearson’s r between 0.4 and 0.6) (reviewed in de Sousa Abreu et al., 2009). While gene expression technologies capture transcriptional regulation, they are oblivious to translational and post-translational regulation as well as transport of the gene product to different compartments of a cell 1.4 Protein expression and shotgun proteomics 7

(see Figure 1.1). This negligence of post-translational regulation leading to functional proteins is reflected in the modest correlations reported between gene expression and protein abundance. This makes gene expression analysis a poor proxy for protein expression and direct methods to assess protein abundances are required.

1.4 Protein expression and shotgun proteomics

Proteomics provides the tools to directly assess protein expression in samples and avoid the use of gene expression as a proxy. Similarly to microarrays for gene expression, protein expression can be indirectly measured through staining. In contrast to the cDNA fragments unique to the a priori selected set of genes to enable hybridisation, antibodies for proteins of interest are used. While sample proteins are captured by the immobilized antibodies, additional antibodies linked with specific enzymes are added prior to the substrate. The second antibody attaches to the proteins of interest and the conversion of substrate through the linked enzymes can be measured. The intensity correlates with the abundance of captured protein in the sample (Van Weemen and Schuurs, 1971). 8 Introduction

Fig. 1.3 Schematic depiction of a standard mass spectrometry workflow. Proteins, extracted from the sample, are digested into shorter amino acid sequences (peptides). Liquid enables a first stage of separation based on physicochemical properties of peptides before these are positively charged to enable mass analysis (MS). The behaviour of charged peptides in the electromagnetic field of the mass analyser allows selection of ions at a given mass to charge ratio (m/z). These are collected and subjected to fragmentation. Mass analysis of fragment ions is performed resulting in tandem MS spectra (MS/MS).

Mass spectrometry describes a method with higher throughput as well as higher accuracy for protein identification and quantification. It has become more popular in protein research with the development of the ionization techniques electrospray ionization (ESI) (Fenn et al., 1989; Yamashita and Fenn, 1984) and matrix-assisted laser desorption ionization (MALDI) (Hillenkamp et al., 1991) for large molecules. Commonly, data are generated utilizing the bottom-up method, also termed shotgun approach, which is a combination of liquid chromatography (LC) and tandem mass spectrometry (MS/MS) with a typical workflow shown in Figure 1.3. Proteins are commonly digested into short peptides using trypsin giving rise to the name ’shotgun approach’. In contrast to genomics, where DNA and RNA fragments are actually sequences, shotgun proteomics results in tandem mass spectra constituting collections of mass to charge ratios (m/z) of fragment ions and associated measured intensities. 1.4 Protein expression and shotgun proteomics 9

1.4.1 Spectral identification and scoring

To determine the underlying peptide sequence of experimentally acquired MS/MS spectra multiple different approaches are available. Commonly, they are matched against theoretical spectra derived from a reference protein sequence database through database search algorithms that model the experimental procedures (see Figure 1.4 A). Protein sequences are digested in silico and within a specified error tolerance candidate peptides with similar mass to the experimental precursor ion are selected. For each candidate sequence all potential fragments are generated and their masses calculated. The theoretical spectrum then is compared to the experimental tandem MS spectrum and scored. Various different scoring algorithms have been introduced over the last two decades with different software tools such as Mascot (Perkins et al., 1999), MS-GF+ (Kim and Pevzner, 2014) and SEQUEST (Eng et al., 1994). Identification, however, is restricted to sequences present in the database. It is therefore necessary to provide near complete protein sequence databases to enable the identification of the full proteome in the sample. This can be extended through inclusion of post-translational modification in the calculation of theoretical precursor and fragment masses.

Even more restricted is the comparison to previously identified spectra in so called spectrum libraries. This method of identification is commonly used in mass spec- trometry analysis of small molecules (reviewed in Scheubert et al., 2013). While the selection process of candidate spectra from the library is similar to the aforementioned sequence database search, the comparison between spectra and scoring of matches is significantly different. Under the assumption of reproducibility of a spectrum forthe same sequence intensities from library and query spectrum can be taken into account for scoring (see Figure 1.4 B). This is commonly done through calculation of the dot product assuming both spectra as vectors of equal length. Spectra are therefore divided into predefined bins along the mass range and fragment peak intensities accu- mulated within each bin. Variations thereof are implemented in the National Institute of Standards and Technology (NIST) MS Search (Stein and Scott, 1994), X!Hunter (Craig et al., 2006), Bibliospec (Frewen and MacCoss, 2007), and SpectraST (Lam et al., 2008). Although this approach is more sensitive than database searching, the latter is commonly employed as it can include variations in form of post-translational modifications (PTM) not previously seen in experimental spectra.

Direct peptide sequence identification tools exist in form of de novo identification, whereby single amino acids are fitted between m/z values until a full sequence canbe assigned to the spectrum (see Figure 1.4 C). This method, however, is not commonly 10 Introduction

Fig. 1.4 Schematic representation of three main classes of spectrum identification algorithms employed in proteomics. A) Algorithms for protein sequence database searching emulate the experimental process of mass spectrometry. Protein sequences are digested in silico to provide a comprehensive set of peptides. Candidate sequences are selected based on agreement with the experimentally determined peptide mass (precursor mass). For each candidate all masses of potential fragments are calculated to generate a theoretical spectrum. Theoretical and experimental spectra are then compared/correlated and scored. The highest scoring theoretical spectrum for each experimental spectrum is assumed as identification, termed peptide-spectrum match (PSM). B) Spectrum library search algorithms employ curated collections of high quality, identified tandem mass spectra to identify experimental spectra. Similar to sequence database searching, candidate library spectra are selected based on precur- sor mass. However, selection criteria are extended to include charge state as well. Candidate spectra then are correlated (dot-product) with experimental spectra. C) Identification algorithms for de novo identification are available. Amino acids and their modified counterparts are fitted by mass between fragment peaks in the experimental spectrum. Overlapping and incomplete ion ladders from C- and N-termini often result in only short sequence tags to be identified. 1.4 Protein expression and shotgun proteomics 11 used due to short and ambiguous sequence results (reviewed in McHugh and Arthur, 2008). Adaptations of algorithms have been implemented to increase the identification rate and mitigate sensitivity and specificity shortcomings. Open mass tolerance search- ing allowing a wider selection of candidate peptides from sequence databases has been used to identify post-translational modifications without their explicate specification (Chick et al., 2015). Error tolerant searching or two pass searching, on the other hand, performs a first search like regular database search engines allowing a limited setof PTMs. In the second step the search database then is restricted to sequences identified in the first search while all possible PTMs are allowed (Creasy and Cottrell, 2002). Adapted library searching tools, e.g. pMatch (Ye et al., 2010) and Modificomb (Savit- ski et al., 2006), use the open mass tolerance approach to allow additional candidate library spectra with precursor mass differences ∆m. The scoring is preceded by a peak matching step allowing fragments to match between spectra if their mass difference is within a predefined tolerance or similar to the precursor mass difference ∆m.

1.4.2 Statistical significance measures and quality control

Each spectrum identification algorithm provides an in-built scoring system indicating the quality of the sequence assignment to the experimental tandem mass spectrum. However, interpretation and thresholding for overall ‘good’ identifications requires expertise and manual assessment and differs from laboratory to laboratory. As the scores are commonly not associated with meaningful and comparable statistical mea- sures post-processing tools such as Percolator (The et al., 2016) have been developed to translate the scores into an interpretable format. Distinct statistical measures such as p-values, the false discovery rate (FDR), q-values, and posterior error probabilities (PEP) are commonly employed (reviewed in Nesvizhskii, 2010).

Spectrum library search tools widely rely on the calculation of p-values to statisti- cally validate their identifications. The p-value thereby is defined as the probability of observing an incorrect match with a given score or higher. A small probability of observing an incorrect match between a query and library spectrum thus is indicated by a low p-value. While p-values are based on the false positive rate (FPR), calcu- lated as proportion of incorrect matches above a given threshold over all incorrect matches (see Figure 1.5), small p-values are expected to be observed by chance in large datasets. Correction for multiple testing such as the conservative Bonferroni correction (Bonferroni, 1935; Shaffer, 1995) improve the by chance observation of 12 Introduction low p-values. However, accounting for both factors leads to highly stringent score thresholds.

Fig. 1.5 Visual representation of statistic measurements for spectrum identification. Score distributions of spectrum/spectrum or peptide/spectrum matches can be esti- mated from matches to target and decoy entries. The posterior error probability (PEP) is based on the proportion of false positive (FP) matches amongst all, i.e. true positive (TP) and FP, at a given score threshold. At the same time the false discovery rate (FDR) takes all matches above the score threshold into account. Thus the FDR is calculated as the ratio of all FP (ΣFP) over all TP and FP (ΣTP + ΣFP) above the score threshold. The metric p-value is based on the false positive rate (FPR), i.e. the ratio of all FP (ΣFP) above all true negative matches (ΣTN).

In contrast to spectrum library searching, post-processing software for database search algorithms employ FDR, q-values and PEP. The FPR on which p-values are based requires knowledge of all incorrect identifications, whereas the FDR is defined as an expected proportion of incorrect peptide spectrum matches (PSM) amongst a selected set of PSMs, i.e. above a given score threshold as shown in Figure 1.5. While multiple scores can be associated with a given FDR, the metric of q-values, which indicates the minimal FDR threshold at which a PSM is accepted, was introduced (Storey and Tibshirani, 2003). The FDR and q-value measures are associated with individual scores due to their calculation involving all PSMs in a dataset. However, when single high quality identifications are required, e.g. for identification ofnovel proteins, splice junctions and variant peptides as well as biomarker discovery, spectrum level significance measures are more appropriate.

The PEP represents the probability of a single PSM at a given score to be in- correct. For calculation of PEP values the underlying score distributions for correct and incorrect identifications is required. Therefore the target/decoy search strategy was introduced (Moore et al., 2002). The protein sequence database (target) is con- 1.4 Protein expression and shotgun proteomics 13 catenated with a reversed (Moore et al., 2002), randomized (Colinge et al., 2003) or shuffled (Klammer and MacCoss, 2006) database introducing known false peptide spectrum matches (decoy). While these decoy generation strategies work best with small databases resulting in close to no overlap between target and decoy sequences, the increasing size of target databases for current proteomics studies have seen se- quence overlaps by chance. Therefore more advanced methods of generating decoy databases have been developed (Wright and Choudhary, 2016). Utilizing the PSM identifications matching target and decoy database portions the distributions for correct and incorrect matches can be determined enabling not only the calculation of posterior error probabilities but also FDR, p- and q-values (see Figure 1.5).

The above described statistical measurements, however, are not independent from datasets, databases and parameter settings. Therefore they should be calculated for each change. In light of the theme of this thesis specific emphasis should be given tothe size of the protein sequence database in relation to the quality and quantity of peptide identifications. While a small database of a few hundred sequences does not provide enough search space leaving many high quality spectra unidentified, large databases of 100,000 sequences and more introduce higher possibilities of false assignments with high scores. In terms of the distribution of false identifications (see Figure 1.5) true negative as well as false positive PSMs are more likely to result in high scores therefore skewing the distribution to the right. Even at the same fixed false discovery rate the score threshold is increased reducing the number of total PSMs and therefore reducing sensitivity of the algorithm (Jeong et al., 2012). For larger databases containing two distinct portions such as canonical and isoform sequences should be treated with care. Search results for each portion should be treated and filtered separately to reduce the impact of the database size on sensitivity of identifications (Li et al., 2016). Lastly, databases should be chosen with care. They should be as complete as possible while keeping the overall database size comparatively small such as the human canonical protein database (Nesvizhskii, 2014).

1.4.3 Peptide and protein quantification

Besides recent advances in mass spectrometry technology allowing close to complete identification of proteomes, MS is the most common method employed forprotein quantitation. However, proteomic mass spectrometry is inherently not quantitative due to great variability of physiochemical properties, such as charge, hydrophobicity and molecular mass, of the analysed proteolytic peptides. This results in variability in the 14 Introduction

MS response between multiple runs (Aebersold and Mann, 2003). Additionally, only a small percentage of total peptides in a sample are analysed by MS misrepresenting the total amount of peptide in the sample (Bantscheff et al., 2007). Multiple approaches using MS measurements as semi-quantitative and relative quantitation have been developed (reviewed in Bantscheff et al., 2012). Most methods will only be mentioned in brief in the following, while quantitation using isobaric labelling will be discussed in more detail as this method is used for peptide quantitation in the following chapters.

1.4.3.1 Overview over quantitation methods

The initial method of counting spectra matching to an identified peptide is similar to the read count approach in RNA-sequencing experiments. However, several MS- based techniques and workflows have been developed to quantitatively capture peptide and protein expression (see Figure 1.6, reviewed in DeSouza and Siu, 2013; Zhang et al., 2010). In addition to spectrum counting, label-free quantification employs intensity based integration over retention time to more accurately estimate abundance of peptides in the sample. However, this allows comparison only between samples during data analysis. While these approaches can be applied to MS-experiments with single samples at a time, multiplexing approaches through labelling of proteins or peptides allow direct comparison of multiple samples (currently 2-10) within a single experiment. Hereby, proteins and peptides can be labelled through growth in media using light and heavy isotopes (metabolic labelling, e.g. SILAC), through incorporation of heavy isotopes (e.g. 18O) during digestion (chemical labelling) or through labelling with isobaric tags (e.g. TMT and iTRAQ). While all approaches applied in shotgun proteomics quantify the expression of peptides, protein abundance can only be inferred after identification of peptides related to the proteins.

Most of the methods described above are used for relative quantitation. However, when a predetermined concentration of a synthetic peptide is spiked in during sample preparation absolute quantities can be inferred. This method has gained ground in recent years with data independent acquisition (DIA) through monitoring selected peptide masses and fragment masses (SRM-based methods). These targeted methods of using standards for absolute quantification on a small set of proteins of interest are specifically of valuable in a clinical setting for identification and quantification of biomarkers (reviewed in Bantscheff et al., 2012). 1.4 Protein expression and shotgun proteomics 15

Fig. 1.6 Schematic representation of quantitative mass spectrometry techniques. Ap- proaches for peptide quantitation can be divided into label free and labelled methods. A) In a label-free approach samples are measured separately. Peptides are then quanti- fied by integration of peak intensities over mass (m/z) and elution time (RT) inmultiple MS spectra. Differential expression can be assessed by comparison between quantified peptides. B) In the simplest label-free approach samples are analyses separately. After identification of tandem mass spectra (MS/MS) the spectra matching to a given peptide are counted as a proxy of abundance. C) Metabolic labelling (e.g. SILAC) grows sample in medium of amino acids with heavy and light isotopes. Samples then can be pooled and processed together. Quantification is achieved by intensity or integration of intensities on MS level. Differential expression can be directly assessed by comparison between two samples within the same mass spectrometry run. D) Isobaric labelling quantification requires addition of an isobaric labels. After labelling samples canbe pooled and processed together. During fragmentation, balance and reporter groups fall off and reporter groups are measured resulting in individual peaks in the low mass region of MS/MS spectra. Differential expression of peptides can be assessed by comparison of reporter intensities. 16 Introduction

iTRAQ reagent TMT reagent O O O O O N N N N N O N O O H O Reporter Peptide Reporter Peptide group Balance reactive group Balance group reactive group group group Fig. 1.7 Structure of the isobaric labels iTRAQ (isobaric Tags for Relative and Absolute Quantitation, 4-plex) and TMT (tandem mass tags).

1.4.3.2 Quantitation using isobaric labelling

Isobaric labelling for relative quantitation has gained ground in recent years due to its capability of multiplexing up to 10 samples within a single MS experiment. The method relies on labelling peptides after proteolytic digestion with tags of equal mass (isobar) but differences is isotopic composition. Two different types of tags are available, namely isobaric tags for relative and absolute quantification (iTRAQ, Ross et al., 2004) and tandem mass tags (TMT, Thompson et al., 2003). The tags follow the same basic set-up in their respective chemical structure combining three distinct groups (see Figure 1.7). The peptide reactive group enables the attachment of the tag onto peptides while the reporter group results in a singly charged fragment during fragmentation. Introducing heavy isotopes (13C/15N/18O) and using them as different tags allows for the discrimination based on the fragment mass (see Figure 1.8). The balance group equalizes the mass of the whole tag by substituting heavy isotopes with light isotopes countering the inclusion of heavy isotopes in the reporter group.

In a quantitative mass spectrometry workflow with isobaric tagging samples are prepared in a similar manner as in a non-quantitative workflow (reviewed in Bantscheff et al., 2007, 2012). However, after proteolytic digestion each sample is tagged with a different isobaric tag. The tagged samples then are pooled. Due to the isobaric nature of the tags the sample peptide sequences from different samples maintain the same physical and chemical properties during chromatography and MS scans. During fragmentation reporter and balance groups break off leaving the peptide with the amino reactive group attached isobaric. This enables standard MS/MS scanning to determine the identity of the peptide. While the balance group remains uncharged and therefore can not be measured as a fragment ion the reporter group forms a singly 1.4 Protein expression and shotgun proteomics 17

126 (126.1277 Da) O O O

N N * *N * * * O H O

127N (127.1246 Da) 127C (127.1309 Da) O O O O O O *N N N N ** N * * O * * *N* * O H O H O

128N (128.1281 Da) 128C (128.1341 Da) O * O O O O O *N N N N * N * * * O * *N * O O O H * H 129N (129.1317 Da) 129C (129.1376 Da) * O * O O O O O *N N N N ** N O * * *N O O O * H * H 130N (130.1348 Da) 130C (130.1409 Da) * O * O O O * O O *N N N N * * N O * *N O O O * H * H 131 (131.1381 Da) * O * O O *N N * N O O * H Fig. 1.8 Distribution of carbon and nitrogen isotopes (13C/15N marked as red asterisks) in TMT 10-plex reagents. Vertical dashed lines indicate reporter group, balance group, peptide reactive group from left to right as described in Figure 1.7. Accumulation of heavy isotopes in the reporter group results in differing fragment masses while the balance group compensates keeping the overall reagent mass the same (isobar). 18 Introduction charged ion that will be measured in the MS/MS scan. The differing incorporation of heavy isotopes then results in peaks at the low mass end of the spectrum (114-131 Da). Relative quantitation is achieved by comparing the reporter ion intensities between the multiplexed samples (Bantscheff et al., 2007, 2012).

Isobaric labelling enables the most accurate quantitation as samples can be com- pared within a single mass spectrometric experiment. Other methods, specifically label free quantitaiton, require multiple technical replicates, i.e. multiple MS runs of the same sample, to enable confident quantitation of peptides. Ratio compression due to co-selection and fragmentation of peptides can introduce errors in accuracy (Bantscheff et al., 2012). However, this issue has been addressed through introducing an additional selection and measurement step (MS3) before quantitation (Paulo et al., 2016; Ting et al., 2011).

1.5 Limitations of shotgun proteomics

Shotgun mass spectrometry has firmly held its place in proteomic research overthe last two decades. While technology and algorithms steadily improve, there are still limitations to identification and quantification of proteins. One major limitation of shotgun proteomics originates from the large proportion, on average 75%, of analysed spectra that remain unidentified (Griss et al., 2016). This is partly due to search engines using large user defined protein sequence databases and a small set of PTMs. Thus the three most likely causes for missing coverage are (i) low signal- to-noise ratio of fragment peaks, (ii) the underlying peptide is not present in the sequence database, or (iii) unanticipated PTMs change the fragmentation pattern or shift the mass of fragments. While each of the reasons can be addressed separately through spectrum library searching, de novo identification, and error tolerant database searching, respectively, each approach has its own limitation. Spectrum libraries, for example, suffer from the very confined search space of previously identified spectra (Ji et al., 2013) and de novo algorithms are low throughput and introduce many false positives (Frank et al., 2007). Open mass tolerance and error tolerant searching approaches still suffer from missing identifications due to sequences absent in the database. However, unanticipated modifications can be identified. Furthermore, adapted spectrum library searches miss control of false matching and thus introduce false positives. 1.5 Limitations of shotgun proteomics 19

Another significant limitation of proteomics, specifically in comparison toRNA- sequencing, is the amount of sample required to successfully identify peptides and proteins. While RNA molecules from very few sample cells can be amplified through the polymerase chain reactions (PCR), amplification for proteins is missing. This often leads to low abundance of proteins and their peptide ions and results in lower intensity measurements compared to background noise (signal to noise ratio). This in turn then leads to missing identifications when they are compared against theoretical spectra derived from sequence databases.

Missing identifications due to low abundance, unanticipated post-translational modifications and missing sequences in the database in addition to the stochasticity of peptide ion selection for MS/MS measurement can significantly affect the identification of the same peptides across multiple experiments. While this, in itself, does not significantly affect the identification of protein groups it can, however, hinder the identification of different protein isoforms through numerous shared but few unique peptides (Li and Radivojac, 2012). Furthermore, this inference challenge impacts quantification even more so. Shared peptides, even when highly accurate peptide quantities have been obtained, are commonly omitted from quantification due to unknown contributions of different isoforms thus significantly reducing the number of confidently quantified proteins (He et al., 2016; Nikolov et al., 2012; Ning et al., 2016). A way to overcome this issue is the use of western blotting fro whole proteins. While this method is restricted to proteins for which antibodies can be generated it is commonly used to validate mass spectrometry quantification due to the low cost and work required (Aebersold et al., 2013). Protein microarrays can also be used in the same manner. Though not as sensitive as mass spectrometry and potentially hiding the identity of the sequence functional assays for identification of protein-protein, protein-small molecule, and protein-nucleotide interactions as well as kinase activity can be more easily achieved through protein microarrays than mass spectrometry (Sutandy et al., 2013). These advantages of microarrays are utilized by the Human Protein Atlas project enabling stratification of tissue specificity, localization within a cell (Thul et al., 2017; Uhlen et al., 2015, 2017).

Even though quantification methods in shotgun proteomics are advanced, pro- tein and peptide abundance can vary due to uncontrolled variation of experimental conditions. This specifically affects multiplexing techniques. Generally, the sooner differently labelled samples are combined in the workflow, the lesser the impact of variation in the upstream sample preparation on the relative quantification (see Figure 1.6). Particularly to mention is here that metabolic labelling, during the first step 20 Introduction of sample preparation, restricts multiplexing to only a few samples while isobaric labelling of peptides, after digestion just before LC-MS/MS application, allows the pooling of up to ten samples in a single experiment.

1.6 Proteogenomics as solution to overcome limitations

Both methods for identification and quantification of biological molecules, namely RNA-sequencing and shotgun proteomics, suffer from the limitations of providing insights only to parts of the complex regulatory mechanisms of transcription and translation. Gene expression used as proxy for protein abundance lacks to capture regulation at translational and post-translational level. Additionally, RNA-seq data provide limited indication of functions of gene products. On the other hand, shotgun proteomics only captures protein expression post translation, which implicitly is affected by transcriptional regulation prior to translation of mRNA. Furthermore, variation of the underlying genome sequence of samples, in form of single nucleotide polymorphisms, splice variants and other genome aberrations, are not taken into account in standard proteomics experiments. A solution to mitigate these limitations is the integration of genomics and transcriptomics data with proteomics.

Peptide identification in proteomics through database searching is based onthe assumption that all protein sequences are known and contained in the search database. However, manual genome annotation efforts have highlighted the high structural complexity of mammalian genomes (Harrow et al., 2006, 2012). Continuous updates to the number of protein-coding gene and transcript annotations with each release, as shown for GENCODE annotation of human and mouse genomes in Figure 1.9, illustrate that not all protein sequences have been found yet. Using translated sequences of predicted genes and transcripts for database searching allows identification of novel peptides as well as splice junctions and provides protein-level evidence for existing gene models. This approach utilizing proteomic data to support genome annotation was first described in literature in 2004 and termed ‘proteogenomics’ (Jaffe et al., 2004). 1.6 Proteogenomics as solution to overcome limitations 21

Fig. 1.9 Comparison of protein-coding annotation in GENCODE over time. Reference releases are indicated by red asterisks and version number. While the number of annotations of protein-coding genes has decreased in human, the number of protein- coding transcripts has increased. However, this number seems to have stabilized since GENCODE v21. Annotation for mouse shows a mitigated drop in protein-coding genes between the first and latest release. The number of transcripts, however, follows a slight upwards trend.

Today proteogenomics comprises functional annotation (Pinto et al., 2014), identi- fication of novel single amino acid variants (SAAV) (Krug et al., 2014), splice isoforms (Gawron et al., 2014), and detection of unannotated protein coding genomic regions (Manda et al., 2014). The focus each type of study is identified by the underlying protein sequence database (see Figure 1.10). The identification of non-synonymous single nucleotide polymorphisms (nsSNP) using mass spectrometry can be achieved by incorporating known SNPs from specialized databases such as HapMap (Interna- tional HapMap Consortium, 2003) and dbSNP (Sherry et al., 1999) into the protein sequences. Additional databases such as the Human Cancer Proteome Variation Database (CanProVar) (Li et al., 2010), COSMIC (Forbes et al., 2017), the Online Mendelian Inheritance in Man database (OMIM) (Hamosh et al., 2000) and the Protein Mutant Database (Kawabata et al., 1999) provide disease associated SNP collections, specifically of interest when comparing samples between healthy and diseased states. As database search engines only can identify sequences present in the database, it is imperative to include all combinations of non-synonymous SNP events within a protein. This exponential increase in search space of customized databases poses addi- tional challenges to sensitivity and false discoveries (Nesvizhskii, 2014). To reduce database complexity approaches such as limiting permutations of variants surrounding an initial variant site (Li et al., 2011) or pre-digested sequence databases providing only unique tryptic peptides (Bunger et al., 2007) have been introduced. Novel protein 22 Introduction

Fig. 1.10 Protein sequences as gene models utilized in custom proteogenomic databases with their source and effects as results. Annotated protein-coding genes and transcripts (light blue) result in identification of reference sequences, while novel sequences can be introduced into the database by extension of coding annotation and annotated non- coding transcripts (grey). Furthermore, gene predictions (green), de novo transcript assembly from RNA-sequencing (dark blue), as well as alternative splicing support identification of novel peptides from the customised database. Lastly, incorporation of non-synonymous single nucleotide variants into annotated protein-coding transcripts enables identification of variant peptide sequences (pink and yellow). coding genes can be identified through using translations of ab initio predicted genes. Tools such as Augustus (Keller et al., 2011; Stanke and Waack, 2003) and GeneID (Alioto et al., 2013; Guigo et al., 1992) use genomic features of annotated protein coding genes indicating coding potential to predict novel genes. With alternative use of exons for predicted as well as known genes, exon splice graphs and translation of novel exon combinations also facilitate the identification of new and alternative transcripts (Brosch et al., 2011; Castellana et al., 2008, 2014; Tanner et al., 2007; Wright et al., 2016).

In addition to identification of peptides derived from novel variants, splice junctions and genes relating these sequences back to their genomic region is essential in pro- teogenomics. The tool tblastn matches protein queries against a translated nucleotide database using BLAST (NCBI) (Camacho et al., 2009) and successfully provides this functionality for longer amino acid sequences, while failing to confidently identify genomic coordinates for short peptides due to low complexity of the amino acid composition. Numerous approaches and tools have been published to accommodate mapping of peptides to a genome. Some approaches integrate the genomic location directly into the protein sequence database (Risk et al., 2013), others reverse translate 1.6 Proteogenomics as solution to overcome limitations 23 the peptide sequences and approach the mapping like RNA-seq read mapping against a reference genome (Sanders et al., 2011). Utilizing the combination of genomic coordinates provided by annotation and prediction tools and their translation also provides indirect mapping through the protein sequences (Askenazi et al., 2016; Ghali et al., 2014; Kuhring and Renard, 2012).

Early proteogenomics applications, e.g. for annotation of the mouse and human genomes by GENCODE and ENCODE (Brosch et al., 2011; Khatun et al., 2013) with small datasets and lower mass resolution, had limited exposure within the wider scientific community. Technological advances in recent years allowed proteogenomics to gain ground. Furthermore, it was put into the spotlight recently by two large scale LC-MS/MS studies claiming to provide proteomic evidence for 90% of protein coding genes as well as identifying novel genes (Kim et al., 2014; Wilhelm et al., 2014). A review by Nesvizhskii(2014) highlights pitfalls and challenges for variant and novel identifications and proposes strategies such as separate FDR estimations to dealwith those issues in proteogenomic analysis. Furthermore, a reanalysis of the high impact proteogenomics papers using stringent filtering as well as manual annotation promotes a ‘quality-over-quantity’ approach as far fewer novel identifications in comparison to the original publications were made (Wright et al., 2016).

The focus of proteogenomics was initially described to aid genome annotation. However, the definition has been expanded to include sample specific protein sequence databases derived from RNA-sequencing experiments. Diverse approaches to utilize RNA-sequencing data have been described in literature ranging from incorporating sample specific SNPs in protein sequences, constraining the search database topro- teins with RNA-seq support or adding from raw reads de novo assembled transcripts (reviewed in Wang et al., 2014a). The use of sample specific protein databases also enables direct comparison between the measured proteome and transcriptome. Few studies have successfully applied quantitative integration between transcriptome and proteome. This approach, with integration of additional molecular levels, has been more generally termed multi-omics due to the use of various –omics technologies that enable systematic interpretation of biological processes in cells (Ritchie et al., 2015). Early multi-omics studies have shown only modest correlations between gene and protein expression (Pearson’s r between 0.4 and 0.6, reviewed in de Sousa Abreu et al., 2009) and hinted at different types of post-transcriptional regulation (reviewed in Vogel and Marcotte, 2012). Later studies found that the translational control is time dependent showing higher correlation when comparing gene expression changes with differential protein abundance over time courses, specifically in yeast (Fournier et al., 24 Introduction

2010; Lee et al., 2011). Furthermore, multi-omics analysis thus far provided insights into the regulatory network during early proliferation of hematopoietic stem cells (Cabezas-Wallscheid et al., 2014) and has started elucidating the complex molecular machinery leading to diseases such as cancer (reviewed in Boja and Rodriguez, 2014).

In recent years, multi-omics analyses in human have largely focused on cancer studies. This can primarily be attributed to the extensive characterization and se- quencing efforts undertaken by large-scale collaborative initiatives such as The Cancer Genome Atlas (TCGA) (reviewed in Tomczak et al., 2015) and the International Can- cer Genome Consortium (ICGC) (International Cancer Genome Consortium et al., 2010). These coordinated efforts generating large genomic and transcriptomic datasets provide a highly valuable source for cancer specific protein sequence databases con- taining e.g. , frame shifts, alternative splicing, stop codon read through and fusion gene sequences that are necessary for proteogenomic identifications (Alfaro et al., 2014). While translational and post-translational regulation make DNA- and RNA-level measurements unreliable predictors for protein abundance, the combination of multiple omics datasets has led to the identification of novel potential biomarkers in colon and rectal cancer (CRC) (Zhang et al., 2014). Additionally, proteomics data have captured features not detectable in transcriptome profiles leading to new subtypes of CRC. Along the same lines, a study on human high-grade serous ovarian cancer (HGSC) was able to identify differentially regulated pathways and functional modules correlating significantly with survival rates (Zhang et al., 2016). Furthermore, these functional modules revealed new potential drivers of HGSC as well as more robust signatures allowing patient stratification for informed therapeutic management. While subtyping for therapy is essential in cancer treatment, identification of novel drug targets is a main goal. A third study, on breast cancer, identified a new target showing strong correlation of DNA amplification, RNA, protein, and phosphoprotein besides the known tyrosine kinase-type cell surface receptor HER2 (Mertins et al., 2016).

The integration of novel proteomics datasets with previously published genomic and transcriptomic data enables identification of pathway and functional module regulation between health and disease states and between cancer subtypes. While the proteomic variation between disease states has widely been studied, the effect of genetic variation between populations on protein abundance was largely unknown. The integration of protein levels with genotype information for 95 individuals from the HapMap project captured molecular phenotypes exhibiting variation between individuals, populations and sexes (Wu et al., 2013). Furthermore, the study found that covarying had corresponding covarying proteins while RNA-protein correlation 1.6 Proteogenomics as solution to overcome limitations 25 indicates partly independent regulation of protein levels. Wu et al.(2013) also highlight that calculated gene expression quantitative trait loci (eQTL) do not perfectly overlap with protein QTLs (pQTL) demonstrating distinct genetic control mechanisms for gene and protein expression. The addition of ribosome profiling as well as further RNA-sequencing replicates shows that translation efficiency is a key mechanism for understanding differences between gene expression and protein abundance (Cenik et al., 2015). The study further highlights the importance of considering individual specific variation potentially dependent on cell type. Translating these findings into a model organism that enables researchers to alter the genetic structure, the genetic effects on protein expression can be studied more closely. Recently, genetic variants in diversity outbred mice were correlated with gene expression and protein abundance (Chick et al., 2016). The study identified local and distant QTLs mediated by translation and post-translationally and further facilitated the generation of models that successfully predict protein expression in the founding population of the diversity outbred mice.

Besides comparing disease states of different populations and variation between sexes and ethnicity, the multi-omics approach also facilitates molecular profiling of single individuals (reviewed in Chen and Snyder, 2012; Mias and Snyder, 2013). A longitudinal approach based on tracking molecular changes in an individual as performed in the pilot study of integrative Personal Omics Profile (iPOP) can help with the diagnosis and tracking of disease progression as well as recovery (Chen et al., 2012). The identification of deleterious single nucleotide variants associated with type-2 diabetes and aplastic anemia through whole genome sequencing in the iPOP individual prompted the authors to monitor medical phenotypes including proteome, transcriptome, metabolome, autoantibodies, and as well as associated environmental factors. Alternative splicing events confirmed during viral infection indicate mediating effects on immune response and longitudinal comparison enabled diagnosis of early onset of type-2 diabetes. While the authors highlight the disease risk estimation, longitudinal monitoring and preventive treatment, they also emphasize that the data only gives an individual centric view that prohibits extrapolation of novel associations to a broader level (reviewed in Chen and Snyder, 2013; Li-Pook-Than and Snyder, 2013). A similar comprehensive longitudinal molecular monitoring of multiple individuals would increase costs of health services and would also require dedicated infrastructure to ensure secure data storage. 26 Introduction

1.7 Biological and technical biases affecting integra- tive analysis

Technical biases effecting integrative multi-omics analysis arise from the aforemen- tioned limitations of the used technologies. While each step of the RNA-sequencing library preparation protocol is not completely efficient, technical noise is introduced in addition to the biological noise of the samples. This noise mainly stems from RNA degradation during the library preparation and cDNA amplification biases of genomic regions due to primer binding (reviewed in Ozsolak and Milos, 2011). Additionally, RNA-sequencing is strongly affected by sequencing depth. Specifically, identification of SNPs and splice junctions requires higher read depths as commonly used (Liu et al., 2013, 2014). The amplification bias as well as lower transcript coverage through low read depths affects confident identification of genes and isoforms (Engstrom et al., 2013; Steijger et al., 2013; reviewed in Garber et al., 2011).

Technical biases through sample preparation also influence proteomic mass spec- trometry data generation. Compared to RNA-sequencing larger amounts of sample is required to attain similar coverage on proteome level or targeted approaches can be employed (Sun et al. 2014; reviewed in Maier et al. 2009). Additionally, the shotgun proteomics approach introduces technical bias through tryptic digestion of proteins prior to measurement. Not all resulting peptides exhibit properties that allow charging and impair the ion mobility in the mass spectrometer (reviewed in Karpievitch et al., 2010). The stochastic isolation of precursor ions as well as low abundance proteins requires additional sample fractionation and longer gradients to provide more complete proteome coverage. These additional sample handling steps introduce further technical noise that interferes with biological variation (Scheerlinck et al., 2015). Furthermore, the stochasticity results in inter-run variability on peptide level while protein level variability based on inference commonly is not affected (Tabb et al., 2010).

The identification rate of tandem mass spectra through database searching isoften- times below 30% due to conservative filtering and sensitivity issues when comparing theoretical with experimental spectra (Griss et al., 2016). Furthermore, the sensitivity of database searching algorithms is further affected by the size of the sequence search space, i.e. the database. The generation of sample specific databases derived from RNA-sequencing data introduces biases due to unique translated sequences. While this affects comparison between different proteomics datasets only at a minor level, a comprehensive custom database comprising all RNA-sequencing derived proteins can 1.8 The orphan of multi-omics: post-translational modifications 27 provide the same search space across different experiments. However, the increase of search space that negatively influences sensitivity has to be taken into account through orthogonal identification algorithms.

Lastly, protein identification using RNA-sequencing experiments not only intro- duces biases when compared against other samples. Even within a single sample the different omics technologies give a partial insight into the biological processes (reviewed in Ritchie et al., 2015). This includes identification of sets of genes and proteins which do not completely overlap. This sampling bias compromising complete comparison between gene and protein expression due to the limitations of the underly- ing technologies is technical. However, time dependent transcription and translation of proteins can produce mismatching sets of genes and proteins measured at the same time (Fournier et al., 2010; Lee et al., 2011).

1.8 The orphan of multi-omics: post-translational mod- ifications

The central dogma with information flow from genes through RNA to proteins as defined by Crick(1958) was extended to comprise post-translational modifications (PTM) of proteins. In 1981 the definition comprised modifications affecting peptide bonds, carboxyl- (C-) and amino- (N-) terminals as well as individual amino acid side chains (Wold, 1981). More generally speaking post-translational modifications are defined as covalent differences between the linear polypeptide sequence as aresultof information flow of the central dogma and the functional protein(Han and Martinage, 1992). Mass spectrometry has proven successful in identifying numerous amino acid side chain modifications that are collected in the Unimod database (Creasy and Cottrell, 2004). Using these annotations, post-translational modifications have successfully been identified, confidently localized within the protein sequence (Ahrne et al., 2010; Beausoleil et al., 2006), and associated with biological processes and diseases (Greaves and Chamberlain, 2014; Huang et al., 2014; Wang et al., 2014b).

The identification of modifications is accompanied by difficulties such asthe similar mass shifts introduced by different modifications, the abundance of the modified peptide in the sample, and the liability of the PTM during MS/MS analysis (Parker et al., 2010). Additional challenges arise during the identification of modified peptides from the acquired MS/MS-spectra. While variable modifications can be included in 28 Introduction database search algorithms their combinatorial inclusion into the in silico generated theoretical spectra increase the search space significantly and thus rise the probability of false-positive identifications. However, this can be take into account with a accurate target/decoy model when monitoring the FDR. More challenging still is the localization of modifications within the peptide sequence when multiple candidate sites exist(Witze et al., 2007). Several approaches for site localization of phosphorylation (Albuquerque et al., 2008; Beausoleil et al., 2006; Fermin et al., 2013; Olsen et al., 2006; Phanstiel et al., 2011; Taus et al., 2011) and more broadly post-translational modifications (Bailey et al., 2009; Fermin et al., 2015) have been introduced over the last decade. While these tools significantly improve the site localization of post-translational modifications, they still suffer from incorrect assignments and are dependent on the identification results of database searching algorithms (Gutenbrunner, 2016) leading to site ambiguities that can to date not be resolved.

Proteomics is commonly used for integration of whole proteome with purification approaches for targeted modifications (Lawrence et al., 2016; Mertins et al., 2013), while only few multi-omics studies incorporate PTMs in the complex integration analysis. The combinatorial explosion of search space in combination with large customized databases for proteogenomic lowers the number of highly confident iden- tifications of novel or variant peptides and alternative splicing events significantly. Novel approaches combining RNA-sequencing derived protein databases with search algorithms restricting possible PTMs to only previously identified sites have been introduced (Cesnik et al., 2016). These allow better control of the high false discovery rate and have led to confident identification of variant and PTM peptides within the same search.

Integrating post-translational modifications with genomic information can also be driven by identifying the effects of single nucleotide variants (SNV) on the presence of post-translational modifications. Utilizing numerous publicly available datasets to identify post-translational modifications subsequently inferred the underlying triplet code of the modified amino acid (Keegan et al., 2016). This in turn provides additional biochemical insights on the consequence of SNVs and supplements tools assessing impact on conserved amino acids or the three dimensional structure of a protein.

While these two approaches are focused on the global PTM landscape, they are qualitative integration methods similar to the initial proteogenomics definition. Quan- titative integration of phosphoproteomics with proteomics and transcriptomics data in ovarian cancer resulted in the identification of a subset of patients with short over- all survival (Zhang et al., 2016). This was driven by multiple pathways exhibiting 1.9 Novel challenges for high-throughput multi-omics 29 significantly increased phosphorylation in patients with poor clinical outcome. A breast cancer study highlights the identification of complementary subclasses through employing genomics, proteomics and phosphoproteomics analysis (Mertins et al., 2016). Additionally, specific mutant cell lines exhibited distinct phosphoproteome sig- natures strongly correlating mutations with phosphorylation. These analyses show that integration of post-translational modifications can elucidate involvement in biological processes and aid patient stratification and the understanding of pathways and disease progression.

1.9 Novel challenges for high-throughput multi-omics

Although nowadays RNA-sequencing and mass spectrometry experiments can be conducted in a high-throughput manner specialist software for each is required. Addi- tionally, manual steps and parameters in genomics software uncommon in proteomics constitute a handicap for mass spectrometry groups to fully embark on a variety of integrative studies and vice versa. A crucial step in enabling cross community under- standing of proteogenomics and multi-omics data is the combination of data in a single coordinate space. Many different visualization platforms and tools for genomics have been implemented over the years and are continuously extended (Down et al., 2011; Kent et al., 2002; Thorvaldsdottir et al., 2013; Yates et al., 2016). Additionally, a wide variety of proteogenomic mapping tools have been published such as proBamSuite (Wang et al., 2016), ProteoAnnotator (Ghali et al., 2014) and PGx (Askenazi et al., 2016) to name a few. However, high throughput transcriptomics and proteomics data for large scale multi-omics studies require scalable and faster tools. The identification of amino acid variant peptides adds the challenge of mapping those to a reference genome.

Quantitative representation of proteomics data in a genome browser through map- ping of peptide spectrum matches (PSM) has been introduced (Wang et al., 2016). This is similar to visualizing gene expression by the number of RNA-sequencing reads mapping to a specific genomic locus. Mass spectrometry, however, offers more advanced quantification methods on peptide level. These need to be integrated on genome wide level to enable identification of protein degradation and expression ef- fects in relation to other genomic features such as alternative splice sites or nucleotide variation. Furthermore, the integration of post-translational modifications has proven its use in recent publications (Mertins et al., 2016; Zhang et al., 2016). Thus the direct 30 Introduction sample specific association of PTMs with genomic features presents an additional challenge for proteogenomic mapping.

Integrating proteomics studies with genomics can be a daunting task and oftentimes deters researchers from utilizing published proteomics results in their multi-omics approach. Thereby the need to either start from raw mass spectrometry data or integrate data only by matching between protein and gene identifiers is the main challenge. By utilizing the frameworks for genome wide visualisation and integrating proteogenomic mappings with the large genomic and transcriptomic repositories, proteomic studies can be more easily accessed and integrated with other omics datasets.

Over the last decades proteomics has proven that the analysis of post-translational modifications through mass spectrometry is essential to increase the understanding of signalling and other biochemical mechanisms in cells. However, full integration - including post-translational modifications - of proteomics approaches with other omics technologies is in its infancy. This is in part due to high error rates when including PTMs into the already large search space of proteogenomic sequence databases. With almost isobaric masses of amino acid substitutions and modifications unbiased identification approaches are required to keep the sequence search space from further expanding.

Lastly, efforts assessing variation between individuals on genome and transcrip- tome level have hinted that variation in gene expression might be higher between different tissues than between individuals (Mele et al., 2015; The GTEx Consortium, 2015). In light of personalized medicine, disease treatment, and the findings of multi- omics studies in cancer cell lines and mouse models showing that genomic variation can have significant impact on protein expression this highlights the need for individu- alised multi-omics profiling. Although analysis of variation on the proteome level has previously been done (Ryu et al., 2014; reviewed in Nedelkov 2008; Nedelkov et al. 2006), multi-omics studies across multiple individuals are still missing.

1.10 Objectives and Outline

The objectives of the work presented in this thesis are the improvement of software tools to enable high throughput proteogenomics studies and unbiased identification of single amino acid variants and post-translational modifications leading to personalised multi-omics analysis. 1.10 Objectives and Outline 31

In the first part I address the issues of proteogenomic mapping and compare available software tools based on their features supporting online and offline genome browser support and mapping reference amongst others. Furthermore, I present the software tool ‘PoGo’ and compare it to two other available tools before I highlight unique features facilitating quantitative and post-translational modification mapping alongside accounting for amino acid substitutions. Moreover, I offer an additional software tool which permits the generation of web accessible hubs of large genome wide mappings.

I then present an update to the software tool ‘MS SMiV’ for unbiased identification of post-translational modification and amino acid variants through adapted spectrum clustering. The update comprises charge dependent peak picking, fragment intensity adjustment and parallelization. I show that this approach outperforms open mass tolerance database searching by benchmarking against results obtained by the search engine ‘SEQUEST’ with a large high resolution tandem mass spectrometry dataset.

The application of ‘MS SMiV’ on a recent multiplexed panel of 50 colorectal cancer cell lines then highlights in the next part the utility of the approach. Combined with quantitative information near isobaric PTMs and amino acid substitutions can be discriminated. Furthermore, comparison of spectrum pairs identified through extreme peptide-peptide fold-changes and MS SMiV highlights de novo identification of amino acid substations not present in the search database and cell line specific database.

Lastly, I describe the development and application of a comprehensive and quan- titative proteogenomic pipeline on a pilot phase study of chondrocytes derived from osteoarthritic individuals. The integration of the tools presented in this thesis with the pipeline then leads to identification of variation between disease states, molecular levels and individuals. Moreover, I highlight each type of variation through examples.

HAPTER C 2

MAPPINGPEPTIDESTOGENOMIC LOCI 34 Mapping peptides to genomic loci

2.1 Introduction

Substantial advances in mass spectrometry technologies enable more complete identi- fication and quantification of proteomes. This makes proteomic data more comparable to transcriptomics and has led to the characterization of the consequences of natural genetic diversity on the proteome (Chick et al., 2016). Proteomic mass spectra are commonly identified through correlation with in silico generated spectra derived from protein sequence databases. These database search algorithms also account for user specified post-translational modifications for comparing theoretical spectra against the experimentally acquired ones. Many different proprietary data formats to store peptide to spectrum matches (PSM), protein identifications and quantitative information have been introduced over the last decades through numerous different database search tools. The Proteomic Standards Initiative (PSI) has embarked on unifying those into single open source formats and to enhance data exchange (Deutsch et al., 2015).

Common proteogenomic approaches integrate proteomic with transcriptomic re- sults through gene name or common identifiers and focus on complete gene and transcript abundance. This type of integration, however, does not utilize the full poten- tial of proteomics disregarding variation effects of parts of proteins, e.g. alternative splicing and allele expression. Having data from different molecular levels in a single coordinate system overcomes these restrictions. Mapping peptides onto genomic loci, however, is a non-trivial task. Triplet codons for single amino acids, alternative splicing that can occur at any position within a sequence even within single triplet codons, frame shifts and orientations of transcript and protein sequences given their notation introduce high levels of complexity that need to be taken into account when mapping peptides to genomic coordinates.

To facilitate full integration of transcriptomics and proteomics data numerous different tools have been introduced to map peptides identified through proteomic mass spectrometry onto genomes. Figure 2.1 shows the main differences of various tools in two principal groups of aspects. The first group covers aspects crucial for fast and high throughput peptide mapping and visualization. A small and comprehensive reference against which peptides are mapped is essential to reduce runtime while also minimizing false mappings. Additionally, standalone software, which can easily be integrated with various proteomic pipelines is preferred over framework bound mapping tools. Lastly, supporting broad access to visualisation of mapped peptides and providing web access for larger scale studies enables data exchange similar to genomics efforts. The second group considers features of peptide to genome mapping 2.1 Introduction 35 necessary for modern proteogenomics studies going beyond sequence identity. With sample specific databases in use for proteogenomic peptide identification, variants differing from a reference genome and proteome are likely and need to be taken into account to facilitate correct mappings. Additionally, visualisation of quantitative information in a genomics context to enable identification of quantitative trait loci on exon level is required in multi-omics studies. Finally, post-translational modifications are becoming a key facet of multi-omics studies for understanding post-translational regulation. Therefore, putting post-translational modifications into genomic context is key to enabling these types of studies.

The available approaches address these aspects differently. For example, they can be categorised based on their mapping reference. Proteogenomic Mapping Tool (Sanders et al., 2011) aligns a reverse translated peptide sequence onto the genome while PGNexus (Pang et al., 2014) utilizes the position within the protein annotated by the search engine to reconstruct the nucleotide sequence of the peptide with help of the provided gene annotation. PGMiner (Has et al., 2016) and ACTG (Choi et al., 2016) in contrast first translate the genomic database before mapping peptides tothe translated protein sequences. The translation can either be in 3 or 6 reading frames or guided by the splice annotation. ProtreoAnnotator (Ghali et al., 2014), ProBam suite (Wang et al., 2016), iPiG (Kuhring and Renard, 2012), and PGx (Askenazi et al., 2016) on the other hand skip the translation from nucleotide to amino acid sequences and map peptides directly to a provided protein database.

Fig. 2.1 Visual comparison of publicly available peptide to genome mapping tools with respect to their mapping reference type, integration into frameworks or availability as standalone software, and support of their output formats through online and offline genome browsers (blue). Furthermore, additional features supporting mapping of vari- ant peptides, inclusion of quantitative information and accounting for post-translational modifications show the superior performance of PoGo over other tools (orange). 36 Mapping peptides to genomic loci

Tools described here that utilize a priori annotation of genomic loci as provided for RNA-sequencing alignments, such as PGx (Askenazi et al., 2016), commonly take the annotated start and end coordinates as correct coordinates without additional checks. While this holds true for straight forward translation of these annotated transcripts into protein sequences, issues arise when using 3-frame translations and standardized annotation from public resources such as Ensembl (Yates et al., 2016) and GENCODE (Wright et al., 2016). This requires reformatting of the readily available annotation to fit the software dependent formats and requirements. Reformatting a single transcript is a trivial task and can be done manually. However, with 199,325 annotated transcripts in GENCODE v26 this trivial task becomes challenging and can only be achieved computationally. The reformatting is further complicated by partly incomplete annotations and frame shifts that need to be taken into account. All these intricacies make the task of reformatting standardised genome annotation to fit software proprietory specifications error prone and daunting for non-computational researchers.

The tools can also be classified based on their output formats and their support through online or offline genome browsers. Prominent online genome browsers from Ensembl (Yates et al., 2016), UCSC (Kent et al., 2002) and BioDalliance (Down et al., 2011), are accessible through the internet and allow easy sharing of data mapped to genomic coordinates. However, due to the availability to many researchers and access through the internet many restrict the upload of custom data larger than 20MB. Offline genome browsers such as the Integrative Genomics Viewer (IGV) (Thorvaldsdottir et al., 2013), on the other hand, require computational resources and installation procedure on the user’s side but are capable of dealing with large input files. To counter the file size restrictions in online genome browsers and further open access across the scientific communities track hubs – collections of binary mapping filesin predefined folder hierarchies – were introduced by UCSC(Raney et al., 2014). To date this mode of sharing genomic mapping of proteomic data is not supported by any of the above described public tools. Additionally, some tools can be used as standalone tools while others are firmly integrated into software workflows. Most of the described tools do not take full advantage of all aspects in group one and none address all aspects in group two. Therefore, to improve speed and quality of mapping peptides directly onto reference annotations, without the need to reformat these, I developed PoGo utilizing the annotated protein coding sequences (CDS) together with a reference protein sequence database (protein-DB). I extended PoGo functionality to account for non-synonymous single nucleotide polymorphisms (nsSNPs), any type of quantitative 2.2 Algorithm development 37 information provided for multiple samples and post-translational modifications in its mapping.

2.2 Algorithm development

2.2.1 Input formats

PoGo was developed to map peptides against reference proteins and through transcript annotations onto the respective genomic loci. To allow this mapping the reference proteins have to be provided in FASTA format. The header lines for each sequence have to contain a sequence identifier which relates to the underlying transcripts. The transcript annotation must be provided in the general transfer format (GTF). This format is hierarchically structured and each layer is identified through a type. Genes may have multiple transcripts and a transcript may contain multiple exons. Exons, on the other hand, may contain coding sequence annotation (CDS). Corresponding entries in the file are linked through unique gene and transcript identifiers. PoGo focuseson the genes, transcripts and CDS to infer the genomic coordinates for a reference protein. The translated coding sequences and associated annotations are commonly available in the specified file formats from annotation resources such as Ensembl (Yates et al., 2016) or GENCODE (Wright et al., 2016).

To enhance usability of the mapping tool PoGo the input format was designed to contain only minimal information about each peptide. The input format thus is a tab delimited file with four columns (see Figure 2.2). The columns contain (i) a sample identifier,ii ( ) the peptide with post-translational modifications annotated in the sequence in the format specified by the Proteomics Standards Initiative (PSI),iii ( ) the number of PSMs, and (iv) a value associated with the abundance of the peptide sequence in the specified sample. Each peptide sequence may occur multiple times in the input file though only in association with different sample identifiers. Tomake PoGo more accessible to the wider proteomics community two PSI file formats, namely mzIdentML and mzTab, were selected as additional input formats and converters to the minimalist four columns input file were implemented in java utilizing the PRIDE API ms-data-core-api for parsers of mzIdentML and mzTab files (Perez-Riverol et al., 2015). Furthermore, an easy to use graphical user interface was implemented in java. Figure 2.3 indicates the input formats and their processing before the PoGo mapping algorithm. 38 Mapping peptides to genomic loci

Fig. 2.2 Example of the input format of PoGo. Four columns specify a sample identifier, the peptide sequence with post-translational modifications annotated using the PSI name, a semi quantitative measure of peptide to spectrum matches per peptide and sample, and user defined quantitative information. This can be, for example, derived from label free approaches or isobaric labelling.

2.2.2 PoGo algorithm

2.2.2.1 Connecting protein sequences with genomic coordinates

For each protein entry captured from the FASTA input PoGo extracts corresponding gene, transcript, and CDS lines from the GTF. In the GTF file the CDSs of transcripts are ordered in the direction of the resulting mRNA thus reflecting the reading direction during translation. This leads to the reverse order of genomic coordinate of CDSs for transcripts annotated on the reverse strand. Through this, the directionalities of the annotated CDSs and corresponding protein sequences align. The exonic structure of the coding sequence is mapped onto the protein sequence through construction of protein exons. For a transcript T defined as a set of coding exons t1,t2, ...tn where n is the number of exons each exon t contains the chromosome identifier, start and end positions within the chromosome, St and Et respectively, and the strand on which the transcript is annotated. The corresponding protein P then is defined as a set of protein exons p1, p2, ... pn. Each protein exon p contains the start and end positions within the amino acid sequence, sp and ep respectively. The protein is mapped onto its transcript T so that f : P → T, pi → ti. This map is constructed for all protein and transcript pairs from the FASTA and GTF input (see Figure 2.4). 2.2 Algorithm development 39

Fig. 2.3 Schema of input and output formats and their processing before PoGo map- ping. Reference input is provided through protein sequences in FASTA format and transcript annotations in GTF format. Standardised proteomics identification formats are converted into a proprietary tab separated format holding only the minimum infor- mation required for PoGo mappings. PoGo generates four different output formats all containing genomic location annotation for peptides and are supplemented with additional information like uniqueness, post-translational modification location and quantitative information. 40 Mapping peptides to genomic loci

Fig. 2.4 Protein sequences (blue) and transcript annotation (green) linked through unique transcript and gene identifiers. The genomic coding sequence annotations of transcripts (green blocks) are mapped onto respective protein sequences through intermediate coordinates (turquoise) representing the exonic structure of the transcript within the protein.

To account for frame shifts between genomic exons ti and ti+1 each protein exon p also contains information about the number of base pairs (bp) contributing to the codon of the first (N-term) and last (C-term) amino acid of P(p). This information is encoded as offset O = {1,2,3} where the value 3 indicates a complete codon. In general, the

N-term offset at the beginning of a protein is defined as O(p1(N −term)) = 3 resulting in O(pn(C −term)) = 3 for complete annotations of coding transcripts. This may vary for incomplete annotations where start and end codons are missing. The N-terminal offset is determined by the annotated frame where annotated frames F = {0,1,2,3} where F = 0 and F = 3 representing complete codons with O(pi(N − term)) = 3 and F = 1 and F = 2 indicating frameshifts resulting in O(pi(N − term) = 1 and O(pi(N −term)) = 2, respectively. For missing frame annotations the initial protein offset is assumed to be complete, O(p1(N −term)) = 3, while offsets of following protein exons are calculated so that O(pi(C − term)) + O(pi+1(N − term))mod3 = 0. Each C-term offset O(pi(C − term)) for protein exon p is determined by the length of the genomic exon L(ti) and the N-terminal offset O(pi(N −term)) so that O(pi(C −term)) = Xmod3 where X = (L(ti)mod3) − O(pi(N −term)) + 3 with the exception of Xmod3 = 0 where O(pi(C −term)) = 3. This alignment of the reference protein sequences to their corresponding coding sequence annotations builds the basis for retrieving genomic coordinates of peptides originating from these protein sequences. 2.2 Algorithm development 41

2.2.2.2 Identifying protein of origin for input peptides

Through the minimal information per peptide sequence required in the input file of PoGo, it is necessary to find the parent proteins for each peptide. PoGo utilizes this to take amino acid substitutions in the peptide sequence into account to identify reference proteins of variant peptides. To allow fast lookup of proteins and peptide positions within each protein PoGo indexes the FASTA input. Each protein sequence P with length L is split into L − (k − 1) words of length k (k-mers) that overlap by k − 1 amino acids. These k-mers are stored as keys in a dictionary and associated with gene and transcript identifiers as well as start positions within the corresponding protein sequence (Figure 2.5 A). The dictionary is designed to consider leucine (L) and isoleucine (I) as equal through substitution with character J as they are indistinguishable in MS experiments.

Peptides identified through MS and provided through the input file are searched against the dictionary to retrieve identifiers and start positions in parent proteins. Hereby, PoGo takes up to two amino acid substitutions (mismatches m) into account to identify reference parent proteins for variant peptides caused by underlying non- synonymous single nucleotide variants. To enhance the speed PoGo follows different lookup strategies depending on the word size k, the number of allowed mismatches m and the peptide length l. Figure 2.5 B and C illustrate the different procedures. For peptides shorter than (m + 1) × k residues the first word of length k is obtained from the sequence. All possible combinations with m amino acid substitutions within k  m the word are generated. In total m × ∥AA∥ new words are generated where ∥AA∥ is the size of the amino acid alphabet AA. Each new word then is looked up in the dictionary to retrieve gene and transcript identifiers and start positions of the k-mer. Peptides longer than (m + 1) × k residues are split into consecutive k-mers. Allowing m amino acid substitutions at most m of the consecutive k-mers can contain a mismatch leaving at least one k-mer without any substitutions. Consequently, only the first m+1 consecutive k-mers are looked up in the dictionary without introducing amino acid substitutions. The presence of the peptide in each found protein then is verified through extension from the k-mer start position in the protein to the full length of the peptide. Thereby remaining mismatches are taken into account so that at most m substitutions can occur over the length of the whole peptide. The gene and transcript identifiers and the respective start positions within the proteins are retained to retrieve the genomic coordinates for each peptide. 42 Mapping peptides to genomic loci

Fig. 2.5 Graphical representation of the construction of the k-mer dictionary from protein sequences and the retrieval of matching proteins for given peptides. A) Protein sequences (blue) are split into words of length k overlapping by k − 1 amino acids, termed k-mer (red). The k-mers are stored in a dictionary with associated transcript and gene identifiers as well as start position ofthe k-mer in the protein sequence. B) Lookup procedure for input peptides (yellow) shorter than length k times the number of allowed mismatches +1. For the first word of length k of the peptide sequence words with all combinations of mismatches are generated and protein candidates are looked up (red). C) For peptide sequences (yellow) longer than k times the number of allowed mismatches +1 the first non-overlapping words of length k are used for lookup (red). 2.2 Algorithm development 43

Fig. 2.6 Graphical representation of genomic coordinate retrieval from protein-genomic coordinate linkage for peptides. Peptides (yellow) are mapped against their originating protein sequence (blue). Based on their coordinates within the protein overlapping protein-exons (turquoise) are used to directly calculate the genomic coordinates.

2.2.2.3 Retrieving genomic coordinated for peptides

Peptide sequences with associated gene and transcript identifiers as well as their start position within the protein sequence are used to calculate the genomic coordinates (see Figure 2.6). The end position eA of peptide A is calculated based on the length of the peptide L(A) and the start position sA within protein P. The start position is relative to the protein start SP that is a zero based index so that eA = sA + L(A) − 1. To calculate the exact genomic coordinates for the peptide the set of overlapping protein exons x is obtained first so that P(A) = {x ∈ P|sx ≤ sA ≤ ex ∨ Sx ≤ eA ≤ ex} is satisfied. Through the mapping of protein exons to their respective genomic coding exons PoGo can now determine the genomic coordinates for the peptide sequence and build a model from the transcript P(A) → T(A). The genomic coordinates for the peptide sequence are calculated as follows. Start and end coordinates are relative to the genomic start SE of exon E. The distances are defined as dSA = (sA − sP − 1) × 3 + O(P(N −term)) and dEA = (eA − sP) × 3 + O(P(N −term)) − 1. The genomic start and end coordinates of the peptide are then calculated as SA = SE ± dSA and EA = SE ± dEA, respectively, where addition of the distances is applied for genes annotated on the forward strand while distances are subtracted from the exon start SE for genes on the reverse strand. For peptides a modified version of the map of protein exons to genomic exonsis constructed and stored whereby the protein is substituted by the peptide sequence. Additionally, all C-terminal and N-terminal offsets are assumed to be complete, i.e. O(peptide(N −term)) = 3 and O(peptide(C −term)) = 3.

2.2.3 Mapping post-translational modifications

Along with mapping peptides to their respective genomic coordinates PoGo is capable of mapping post-translational modifications (PTMs) onto the genome. Commonly, 44 Mapping peptides to genomic loci

PTMs are identified through database search algorithms which annotate these inthe peptide sequence using round brackets containing the standardised name (PSI) of the modification following the modified residue. With the position of a modification marked in the peptide sequence and the previous genomic mapping of the peptide PoGo can easily retrieve the genomic location of the PTM. The equations for retrieving genomic coordinates for peptides are adjusted to calculate the start SPTM and end EPTM coordinates as follows for any given PTM or set of PTMs of the same type in a peptide A: dSPTM = (sPTM − sA − 1) × 3 + O(A(N −term)) and dEPTM = (ePTM − sA) × 3 + O(A(N − term)) − 1 where sA is the start of the peptide exon in which the PTM is annotated and sPTM and ePTM are the start and end coordinates of a modification. For single modifications sPTM = ePTM while for multiple modifications of the same type sPTM denotes the first occurrence of this modification type and ePTM the last in the peptide sequence. The dSPTM and dEPTM are relative to the genomic start SA of the peptide exon. PoGo stores the genomic coordinates for PTMs with the associated annotated peptide sequences and groups them by modification type. This enables PoGo to introduce colour coding of different PTMs types in the output.

2.2.4 Adding quantitative information for multiple samples

PoGo is also able to store additional information linked to the peptide in the input file. For each mapped peptide related sample identifiers as well as the number of peptide to spectrum matches (PSMs) and other quantitative data is stored. While these data points are stored as triplets for each mapping, a separate list of occurring sample identifiers is compiled. This then enables PoGo to generate comparative output of all mapped peptides across different samples.

2.2.5 Generating different output formats

PoGo generates output in three formats commonly used in genomics. Specifications of these formats were adapted to include peptide centric information. This allows PoGo to visually represent the different aspects and functionalities described in the algorithm. 2.2 Algorithm development 45

Colour Uniqueness of Mapping Peptide mapping uniquely to single gene and single transcript therein Peptide mapping uniquely to single gene but to multiple transcripts therein Peptide mapping to multiple genes and multiple transcripts therein Table 2.1 Colour code of uniqueness of peptide mapping. The uniqueness it determined through number of genes and transcripts to which a peptide is associated.

2.2.5.1 BED format

The first and central output is generated in BED format. In this format eachmapped peptides is represented by a single line of twelve tab delimited columns. Besides the chromosome coordinates and the peptide sequence start the start positions and lengths of peptide blocks mapping to genomic exons are included. Furthermore, BED files support individual colouring of each feature through a RGB (red, green, blue) colour code per line. PoGo utilizes this in two different forms. In the general BED output that is centred on the peptide mapping PoGo indicates the uniqueness of a mapping through a three tier colour code (Table 2.1). Peptides mapping uniquely to a single gene and a single transcript therein are marked in red while mappings to single genes but shared between multiple transcripts are coloured in black. Peptides not unique to single transcripts and genes but with multiple different loci are indicated through their grey colour.

PoGo generates separate output in BED format for PTM mappings. In this instance the colouring indicates the type of modification. The BED feature to indicate the coding sequence of a transcript through a thick block was re-purposed to indicate the position of PTMs within the peptide. For occurrences of multiple PTMs of the same type the thick block indicates the range between the first and last instance of this type in the sequence as indicated in Figure 2.7. This type of visualisation is determined by the format of genomic features that are redefined in this instance. The block-type occurrence of PTMs does reflect the presence of multiple instances of the PTM within the peptide sequence and should not be interpreted as ambiguity of the PTM site localisation described in Chapter1. 46 Mapping peptides to genomic loci

Fig. 2.7 Peptide sequences with annotated post-translational modifications (PTM) are visualized through continuous sequence lines (black) and the thick block feature (indicated in grey) of the BED file format. Single PTMs are shown as a thick block covering the modified residue (middle row). Multiple PTMs of the same typeare indicated through blocks spanning from the first to the last occurrence (bottom row). Thick blocks may also span across splice junctions if multiple PTMs of the same type occur on either side of the splice junction (right column). The block visualisation is not to be confused with PTM site ambiguity (see Chapter1.

2.2.5.2 GFT format

For a more comprehensive view of the mappings PoGo additionally generates output in the general transfer format (GTF). Each line in this format represents a feature with associated genomic coordinates and additional information. The format is structured hierarchically so that gene lines are followed by lines of type transcript that belong to the gene. Additionally, each transcript line is followed by lines containing exons that in combination represent the whole transcript. The format is specified through nine tab separated columns containing the genomic coordinates with strand and frame in columns 1, 4, 5, 7, and 8. Column 3 specifies a feature as controlled vocabulary, e.g. gene, transcript, exon. PoGo redefines feature types to allow visualisation of mapped peptides. While lines of the original GTF input are copied for genes with mapped peptides, the transcript type is redefined to represent a mapped peptide. The exon feature then refers to parts of the peptide mapping to the coding exons of the original transcript. Column 9 contains attributes uniquely specifying each line and additional information about each feature. Attributes are separated by semi colon and a single space. Mandatory attributes for each line are ‘gene_id’ and ‘transcript_id’ and provide linkage between different lines. While ‘gene_id’ must be a unique identifier for the genomic source of a transcript, i.e. the gene, ‘transcript_id’ must be globally unique for the transcript. PoGo uses the gene identifiers provided in the input annotation file associated with each mapped peptide as they are globally unique. To provide association of the transcript identifier and the peptide sequence PoGo concatenates those with a ‘.’ to build the ‘transcript_id’. In cases where the 2.2 Algorithm development 47 same peptide sequence occurs multiple times in a transcript, e.g. in repeat regions, the identifier is further extended with a number keeping ‘transcript_id’ globally unique. Through additional tags at the end of the attributes PoGo adds the sample, PSM, and quantitative information.

2.2.5.3 GCT format

While prominent offline and online genome browsers are able to parse BED andGTF files focusing on the splicing structure of the mapped entity the GCT format targets the quantitative comparison across multiple samples. The format is designed as a tab separated matrix of samples as columns and mapped peptides as rows. Values in the cells represent the quantitative information of a peptide in a sample as specified by the user input. The first two columns are mandatory and contain a unique identifier while the second column contains a description and the genomic coordinates separated by vertical bars. This structure and the therein contained information allows the Integrative Genomics Viewer (IGV) (Thorvaldsdottir et al., 2013) to visualise the aligned peptides with their associated quantitation in different samples and enables comparative visualisation.

2.2.6 Benchmarking and application

2.2.6.1 Datasets

To allow benchmarking against additional tools high-resolution MS data from 59 foetal and adult human tissues were used. The raw data for these draft human pro- teome maps were generated by the Pandey lab (Kim et al., 2014), the Kuster lab (Wilhelm et al., 2014), and Cutler lab (Desiere et al., 2006). All three datasets were reanalysed and combined by Wright et al.(2016) and deposited in the PRIDE archive (PXD002967). The search results in mzid format were downloaded and filtered to the highest stringency level described in Wright et al.(2016) for identification of novel coding regions (q-value ≤ 0.01, PEP ≤ 0.01, peptide length between 7 and 29 residues, full tryptic peptides, maximum of two missed cleavages). A single file of four columns as specified for the PoGo input was generated.

To demonstrate PoGo’s unique features the isobaric labelled phosphoproteome data from an ovarian cancer tumour study comprising 69 samples were downloaded as a tab 48 Mapping peptides to genomic loci

Colour Post-translational modification (PSI name) phosphorylation (phospho) acetylation (acetyl) amidation (amidated) oxidation (oxidation) (methyl) ubiquitinylation (glygly; gg) sulfation (sulfo) palmitoylation (palmitoyl) formylation (formyl) deamidation (deamidated) Any other post-translational modification Table 2.2 For peptide mappings with annotated positions of modifications the PTM type is indicated through the colour in the PTM BED file output. PoGo supports 10 biologically relevant post-translational modifications by default. Their colours, modification name and PSI modification short name for annotation within peptide sequences are indicated in this table. separated phosphopeptides report file from the data portal of the Clinical Proteomic Tumor Analysis Consortium (CPTAC) (Zhang et al., 2016). The data reformatted to match the PoGo input format and lower case characters indicating phosphorylation (s, t and y) were substituted by their upper case counterparts followed by the PSI annotation for phosphorylation in round brackets.

2.2.6.2 Protein sequences, gene annotation and PoGo settings

Wright et al.(2016) used GENCODE v20 as their main source of annotated pro- tein sequences. To ensure consistency with these identifications the annotation of human genes for GENCODE v20 in the GTF format and the corresponding pro- tein coding sequence translations in FASTA format were downloaded from http: //www.gencodegenes.org. Gene and transcript identifiers follow the Ensembl unique identifiers of “ENSG” and “ENST” for genes and transcripts, respectively, followed by 11 digits. The word length to build the k-mer dictionary was set to 5 amino acids and 10 biologically relevant post-translational modifications selected and included in PoGo’s source code for easy discriminability of the colour code (Table 2.2). 2.2 Algorithm development 49

2.2.6.3 Comparison of algorithms for performance evaluation

Benchmarking of PoGo was performed against PGx (Askenazi et al., 2016) and iPiG (Kuhring and Renard, 2012), respectively downloaded from https://github.com/ FenyoLab/PGx and https://sourceforge.net/projects/ipig/, using default parameters. Each dataset was formatted using R and perl scripts to fit the required input format for each tool. To compare mappings and assess the overlap between the tools mapped peptides were marked as equal if chromosome name, start and end coordinates, start coordinates and lengths of exons, and the peptide sequences were the same. Peptide matches not marked as equal then were assessed by shifting start and end coordinates by up to two base pairs to identify frame shifts. These frame shifts then were compared to the annotation of the originating transcripts to identify the origin of the shifts. Furthermore, the frame shifts and remaining unmarked mappings were compared to the 6 frame translation provided in the IGV genome browser to identify any false instances.

2.2.7 Generation of track hubs

Prominent online genome browsers have limits for the size of files directly uploaded to them. Large studies, such as the draft human proteome maps and their reanalysis result in files with sizes over the limit for direct upload. For this reason UCSC introduced track-hubs, web-accessible directories of binary genomic data. To allow web-access to large mapping files through prominent online genome browsers a Track- Hub Generator application was implemented in perl. The tool automatically generates the folder structure and required files for track hubs and converts BED files provided by the user into binary BigBed files. For testing and annotation purposes track hubs were generated to visualize different aspects of the human proteome maps. The data was filtered as described before to high stringency and additionally filtered to proteomics standard significance (q-value ≤ 0.01 (1% FDR), PEP ≤ 0.05 and minimum peptide length of 7 residues). Each significance set was split into subsets for individual tissues resulting in 120 individual files (60 per filtering level). PoGo was run with default parameters using comma separated lists of input files to be mapped separately against GENCODE v20. The script ‘fetchChromSizes.sh’ and tool ‘bedToBigBed’ from UCSC that are used in the Track-Hub Generator application were downloaded from http://hgdownload.cse.ucsc.edu. The Track-Hub Generator application then was run using the resulting BED files to generate two track hubs; one for each significance level filter comprising all 60 tissues. The generated track hub folders weremoved 50 Mapping peptides to genomic loci to a web server and are accessible through http://www.sager.ac.uk/science/projects/ proteogenomicshubs.

2.3 Results and discussion

2.3.1 Speed and quality of mapping

For benchmarking PoGo, PGx, and iPiG were run on a set of over 3 million peptides across 59 adult and foetal tissues (233,055 unique peptides) against GENCODE v20. PoGo (94 seconds) was 6.9 and 96.4 times faster than PGx (651 seconds) and iPiG (memory error after 9,064 seconds), respectively. Furthermore, PoGo required 20% less memory compared to PGx (9.7 GB and 11.9 GB respectively). These data show a major improvement of speed and memory usage of PoGo over similar tools. Additionally, it also demonstrates ease of application of PoGo with readily available reference annotation.

In total 250,488 mappings (89.3%) were common between PoGo and PGx while 10.5% (29,507) and 0.2% (422) were uniquely reported by PGx and PoGo, respectively (see Figure 2.8 A). By shifting start and end positions of the mapped peptides unique to PGx by up to two nucleotides all 29,507 mappings unique to PGx were identified as false assignment of frame caused through the incomplete annotation, e.g. missing start codon, of transcripts. 98.8% of mappings unique to PGx match after frame shifting to correctly mapped peptides shared by PoGo and PGx. The remaining 1.2% after correction of their frame matched to 333 of PoGo’s unique mappings (78.9%) that were correct due to PoGo’s incorporation of the annotated reading frame (see Figure 2.8 B). Additional 72 of PoGo’s unique results were identified as correct mappings for incompletely annotated transcripts. Furthermore, 17 unique mappings correspond to alternative splicing, immunoglobulin genes and multiple overlapping mappings in a repeat region. PGx found mappings to consecutive loci in the SPRR3 gene for the peptide sequence ‘VPEPGCTKVPEPGCTK’, which is a combination of two repeats of 8 amino acids caused through a missed cleavage. However, PoGo mapped the sequence four times with repeats overlapping each other (Figure 2.9). 2.3 Results and discussion 51

Fig. 2.8 Comparison of mapping results between PoGo and PGx. A) Between both tools a total of 280,417 peptide to genome mappings are retrieved. While 89.3% of mappings are shared between tools, PoGo reveals 422 (0.2%) unique mappings and PGx 10.5%. B) Adjusting for annotated incomplete transcripts resulting in frame shifts of PGx mappings, a total of 250,910 mappings remain. PoGo and PGx share 99.96% and PoGo reveals 89 unique mappings for incompletely annotated transcripts, high complexity regions and mappings to immunoglobuline genes.

Fig. 2.9 IGV view of peptide mappings with genomic coordinates shown at the top as x-axis. The peptide ‘VPEPGCTKVPEPGCTK’ with missed cleavage between two repeats of 8 amino acids within the gene SPRR3 (GENCODE (v20) annotation shown in blue) is mapped to four overlapping loci (black) while PGx only maps it to two consecutive loci (green). Furthermore, PoGo maps each peptide once to a single locus. PGx, however, maps all occurrences in the input set to respective genomic positions. 52 Mapping peptides to genomic loci

2.3.2 Track hubs for tissue data

The large number of mapped peptides for the draft human proteome maps resulted in a PoGo output BED file with size 19.4 MB. While the UCSC genome browser was able to load and parse the file, it was over the 19 MB size limit for the Ensembl browser. PoGo was able to map all 60 tissue input files per significance filtered set individually into BED files. The application of Track-Hub Generator resulted inthe desired folder and file structure as specified by UCSC. After moving the track hubstoa web accessible server the ‘hubCheck’ tool provided by UCSC to assess the correctness of a track hub set up terminated for both hubs without errors. Manual inspection of the high significance track hub loaded in the UCSC genome browser resulted inthe identification of protein identifications unique to single tissues. The scaffolding protein CASS4, for example, was only found in platelets as shown in Figure 2.10. Figure 2.11 shows the genomic region of RBP3, a protein only identified in retina. The identified peptides, however, show support for all annotated splice junctions.

2.3.3 Accounting for SNPs

The large number of single nucleotide variants in individuals can affect the protein sequences and hinder identification of peptides through database searching against a reference genome. Commonly, customised sequence databases are generated in- tegrating nucleotide variants to identify variant peptides in mass spectrometry data (Nesvizhskii, 2014). Uniquely compared to other tools, PoGo is able to account for up to 2 non-synonymous nucleotide substitutions and map variant peptides against a reference genome. Application with the draft human proteome maps allowing 1 and 2 variants resulted in a 1.5- and 60.8-fold increase in runtime (see Figure 2.12 A). Unique mappings to single transcripts were reduced by 5.1% and 15.9% while the number of peptides mapped to multiple genes increased exponentially by 220.9% and 3,175.2% for 1 and 2 allowed mismatches, respectively (see Figure 2.12 B).

The mapping accounting for multiple mismatches revealed highly similar genomic repeats of the peptide sequence ‘VPEPGCTK’ within the SPRR3 protein. These additional repeats are shown in Figure 2.13 and marked through red boxes. Peptides of the exact sequence, including the amino acid substitutions, were previously identified through database searching in the same sample and validate the mappings of PoGo taking mismatches into account. 2.3 Results and discussion 53

Fig. 2.10 View of the genomic region of CASS4 of the high significance filtered draft human proteome maps track-hub. Genomic coordinates are shown as x-axis while tissues are represented as rows. GENCODE v20 annotation of two transcripts is shown in black at the top and peptides identified across the whole dataset are shown underneath in red, showing unique mapping to one transcript while black indicates peptides mapping to both transcripts. Peptides within single tissues are shown below in each tissue line. All peptides of CASS4 were only found in platelets (red and black bars in the lower third). 54 Mapping peptides to genomic loci

Fig. 2.11 View of the genomic region of RBP3 of the high significance filtered draft human proteome maps track-hub. Genomic coordinates are shown as x-axis while tissues are represented as rows. GENCODE v20 annotation of the protein coding transcript is sown in blue at the top and peptides identified across the whole dataset are shown underneath in red, showing unique mapping to the single transcript. Peptides within single tissues are shown below in each tissue line. All peptides of RBP3 were only found in retina (red bars in the lower third) spanning all three splice junctions showing proteomic support for the annotated gene structure. 2.3 Results and discussion 55

A B

Fig. 2.12 Comparison of runtime, memory usage and distribution of uniqueness of mappings for PoGo applied to high stringency filtered draft human proteome map with increasing number of mismatches allowed. A) From the initial 250,000 peptide mappings with no mismatches allowed the increase of mismatches introduces an exponential increase of runtime (left side y-axis) and random access memory require for mapping (right side y-axis). B) The number of mappings (x-axis) increases exponentially driven by non-unique mappings (dark grey) for increasing number of allowed mismatches. Mappings of peptides to single transcripts or multiple transcripts within a single gene, however, are not as significantly affected and are reduced by only ∼8,000 mappings between 0 and 2 mismatches resulting in a smaller set of high confident mappings.

Fig. 2.13 Example of a single peptide with missed cleavage including two repeats of the sequence ‘VPEPGCTK’ and its mappings across multiple repeats within the annotated gene of SPRR3. The exact repeats are shown with zero mismatches (0 Mismatches). Application of PoGo with up to two mismatches shows mappings of the peptide sequence to additional repeats (1 and 2 Mismatches, red boxes). Additional mappings of the peptide are validated through peptides of exact sequence with amino acid substitutions identified through protein sequence database searching in thesame sample (Validation). Leucine (L) and isoleucine (I) are shown indicated by substitute letter J due to their isobaric nature. 56 Mapping peptides to genomic loci

2.3.4 PTM mapping

PoGo demonstrated its additional functionality of mapping post-translational modi- fications through application to the phosphoproteome of high-grade serous ovarian cancer with isobaric labelling of 96 tumour samples. The set comprises 13,646 unique peptides with 19,156 annotated phosphorylation sites. PoGo was able to map 13,617 peptides (99.8%) to 15,944 genomic loci in 66.9 seconds. PGx and iPiG on the other hand failed to interpret the annotated phosphorylation sites and resulted in no map- pings. The small fraction of 0.2% of peptides not mapped by PoGo could be attributed to differences between the original search database, RefSeq, and GENCODE v20 used for mapping. Furthermore, PoGo colour coded phosphorylated peptides as expected in red as shown for an example peptide within MAPK3 in Figure 2.14 A. The protein has previously been identified with differentially phosphorylated sites.

2.3.5 Integrating peptide quantitation

Utilising the provided log2-fold changes of phosphopeptides between the 69 samples in the isobaric labelled ovarian cancer dataset PoGo demonstrated its utility to incorporate any given quantitative information into the mapping. The output in GTF format then enables comparative visualisation in the Integrative Genomics Viewer (IGV) and downstream quantitative analysis. Across the 13,617 mapped peptides PoGo was able to map the associated 325,681 log2-fold changes across the 69 samples in the GCT file (Figure 2.15). For example, MAPK3 was identified in the ovarian cancer study with multiple phosphorylated sites and differential expression between samples as shown in Figure 2.14B.

2.4 Conclusions

In this chapter I described the development and implementation of PoGo, a new software tool to map peptides with linked quantitative information across multiple samples onto a reference genome. PoGo has advantages compared with other peptide to genome mapping tools. The data show that PoGo represents a major advance in speed and memory usage. Although the results in this chapter focus on human tissue and cancer cell lines, PoGo can be applied to any species for which annotation 2.4 Conclusions 57

A

B

Fig. 2.14 Mapping of phosphorylation sites and quantitative information for example peptide with multiple phosphorylation events across 69 ovarian cancer samples within the MAPK3 gene. GENCODE v20 annotation of MAPK3 transcripts is shown in blue in the upper half of each screenshot. A) Genomic loci of post-translational modifications within the peptide annotated through round brackets in the sequence are indicated as thick blocks as expected. The colour code of post-translational modifications results in the red colour of the mappings. B) Quantitative information expressed through log2-fold changes for the example peptide is depicted across 69 samples (y-axis). High values are shown in red while blue indicated low log2-ratios. 58 Mapping peptides to genomic loci

Fig. 2.15 Mapping of identified phosphopeptides in 69 ovarian cancer samples (y-axis) across the whole human genome reference GRCh38 (x-axis). The histogram indicated the number of phosphopeptide mappings per genomic locus. The heat map underneath indicated log2-fold changes of peptide expression over all samples compared to a pooled reference sample. The red colour shows high log2-fold changes indicating up-regulation, while blue represents down-regulation (low log-2 fold changes). 2.5 Publication Note and Contributions 59 of coding sequences and their translation are available in GTF and FASTA formats, respectively.

PoGo demonstrated additional strengths through its unique features such as al- lowing up to two non-synonymous single nucleotide substitutions, mapping of post- translational modifications, and integration of quantitation. This sets PoGo apart from other available mapping tools. The adoption of semi-standardized file formats com- monly used in genomics and proteomics for input as well as output and the scalability for large datasets make PoGo an indispensable component of small and large-scale multi-omics studies.

Its fast and diverse mapping capabilities prompted the integration of the algorithm into the PRIDE tool suite. This and the track-hub generator application further promote open access proteogenomics for both the wider proteomics as well as genomics community. PoGo has been developed to cope with the rapid increase in quantitative high-resolution datasets capturing proteomes and global modifications to support studies focusing on integration of gene, protein and post-translational modification expression (Alvarez et al., 2016). The integration with orthogonal genomics platforms with these datasets through PoGo will be valuable for large-scale analysis in personal variation and precision medicine studies.

2.5 Publication Note and Contributions

Most of the work described in this chapter has been accepted for publication in Cell Systems. The peptide to protein mapping algorithm was implemented by Georg Pirklbauer under my supervision. With exception of this and unless explicitly stated otherwise, the algorithm design, implementation and analysis described herein is the work I performed myself, under supervision of Andreas Bender and Jyoti Choudhary.

C. N. Schlaffner, G. J. Pirklbauer, A. Bender, and J. S. Choudhary. Fast, Quan- titative and Variant Enabled Mapping of Peptides to Genomes. Cell Systems, 5(2): 152–156 e4, 2017. doi: 10.1016/j.cels.2017.07.007

HAPTER C 3

UNBIASED DETECTIONOF MODIFICATIONS AND VARIANTS USING MSSMIV 62 Unbiased Detection of Modifications and Variants Using MS SMiV

3.1 Introduction

Mapping peptide sequences to their respective genomic locations is a cornerstone to combine proteomic mass spectrometry with next-generation sequencing technologies and enable an integrative view of molecular interactions in cells. Advancements in mass spectrometry technology have made confident identification and quantification of close to complete proteomes comparable in coverage to RNA-sequencing experiments. However, only up to 30% of acquired tandem mass spectra are on average identified in standard proteomics experiments (Griss et al., 2016). While of high quality, a large fraction of the spectra remain unidentified due to sequence variation and post- translational modifications. Additionally, low abundance peptides often result in spectra that are filtered from confident identification sets based on their low scores due to higher signal to noise ratios.

To address the identification of spectra stemming from low abundant peptide species spectrum library searching can be applied. This method of comparing sample spectra against a library of previously identified spectra has first been applied insmall molecule mass spectrometry (reviewed in Stein, 2012). For proteomics numerous tools such as X!Hunter (Craig et al., 2006), BiblioSpec (Frewen and MacCoss, 2007), and SpectraST (Lam et al., 2007) have been introduced comparing experimental tandem mass spectra to curated libraries. While these tools exhibit high sensitivity specifically for identification of spectra resulting from low abundant peptides, these are limitedto matching exact sequences.

In proteogenomics sequence variants as a result of genomic alterations are in part identified through customized sequence databases against which tandem mass spectra are searched. Numerous algorithms to incorporate variants into protein sequence databases have been proposed using known variants from databases such as dbSNP (Sherry et al., 1999) or COSMIC (Forbes et al., 2017). Furthermore, variant calls from RNA-sequencing experiments provide sample specific sequence alterations (Bunger et al., 2007; Li et al., 2011). The incorporation of additional sequences in the database increases the size of the search space significantly and leads to challenges regarding false discovery estimation and decreased sensitivity (Nesvizhskii, 2014). It is estimated that a human genome contains 10,000-11,000 non-synonymous variant sites (The 1000 Genomes Project Consortium et al., 2010). However, the identification of 42,549 missense mutations across 44 Caucasian subjects (Shen et al., 2013) highlights that only a small fraction of unidentified spectra may be recovered from variant sequence identification. 3.1 Introduction 63

The largest fraction of unidentified spectra arises from post-translationally modified peptides. Over 200 types of post-translational modifications (PTM) are known (Jensen, 2006). These bind covalently and reversibly to different amino acids and potential occurrence and co-occurrence on proteins and peptides leads to exponential increase in sequence complexity. Database search algorithms enable identification of peptides with PTMs by including their masses and potential sites in the generation of in silico spectra. The combinatorial increase of search space, however, severely impacts runtime and false discovery rate estimates (Shortreed et al., 2015). Many database search tools therefore introduced caps on numbers and combinations of dynamic PTMs that are taken into account.

In the last decade a two pass search search strategy termed error-tolerant database searching was introduced to enable comprehensive identification of PTMs in a sample. In the first pass only two variable modifications are allowed in the search, whilethe whole sequence database is used to identify spectra. The second pass then reduces the database to sequences identified in the first pass while all PTMs are taken into account (Creasy and Cottrell, 2002; Na et al., 2008a). Recently, the application of database searching using a wide peptide mass tolerance of 500 Da proved successful in recovering and additional 9% of spectra due to PTMs (Chick et al., 2015). However, the open mass tolerance resulted in a reduced sensitivity compared to database searches with dynamic modifications. A novel approach to open database searching was also recently introduced by the tool MSFragger leading to an average increase of 300% in identifications with modifications and also highlighted the requirement foropen searching to accurately estimate false discovery rates (Kong et al., 2017).

In recent years sensitive spectrum library algorithms have been adapted to accom- modate mass shifts introduced by sequence variants and post-translational modifica- tions. Some are built on the premise that spectra of modified and unmodified versions of a peptide follow the same fragmentation pattern. Griss et al.(2016) implemented a clustering algorithm allowing for mass differences between precursor masses. By contrast, other tools try to identify the precursor mass difference between two spectra in the fragment peaks as well (Savitski et al., 2006; Ye et al., 2010). These tools, however, rely on a previously identified library of tandem mass spectra and donot fully take advantage of the recent advancements in mass spectrometry technology.

I previously developed a software tool to fully utilize high resolution mass spec- trometry for variant and modification detection (Schlaffner, 2014). The algorithm, MS SMiV, matches fragment peaks between tandem mass spectra taking the mass difference of precursors into account (see Figure 3.1). In contrast to the aforemen- 64 Unbiased Detection of Modifications and Variants Using MS SMiV tioned adapted library search tools it solely relies on tandem mass spectra without prior knowledge of the underlying peptide sequences. Therefore it can provide similarity scores for any given pair of spectra in a dataset. Additionally, MS SMiV accepts a separate library file in Mascot Generic Format (MGF) allowing spectrum matching between different datasets.

Tandem mass spectra describe a high degree of complexity through large numbers of peaks. However, only peaks with high intensities commonly represent the underlying peptide fragments. MS SMiV pre-processes the spectra by dividing them in equidistant bins along m/z values and retaining only the highest intensity peaks within each bin. This complexity reduction is followed by selection of spectrum pair candidates by checking their precursor mass difference against a user defined white list of mass shifts. For candidate pairs fragment peaks are matched based on m/z values allowing a user defined error tolerance in addition to the mass difference between precursors. Under the assumption that relative fragment intensities are reproducible, most likely peak matches are ensured through intensity descending ordering of peak lists. Based on the matching of peaks between spectra the score of similarity is calculated using an adopted dot-product score. This is extended with an intensity sensitive penalty for unmatched peaks. The similarity score is defined as

k n−k m−k I ∗ I I2 ∗ I2 ∑ a,matchedi b,matchedi ∑ a,unmatchedi ∑ b,unmatchedi MS SMiV score = i=1 − i=1 i=1 . r n r m n m (3.1) 2 2 I2 ∗ I2 ∑ Ia i ∗ ∑ Ib i ∑ a i ∑ b i i=1 i=1 i=1 i=1

Intensities are represented through Ia and Ib, while n and m indicate the number of peaks in spectrum a and b, respectively. Furthermore, k refers to the number of matched peaks between the two spectra.

The implementation of MS SMiV presented in Schlaffner(2014) restricts spectrum pair matching to spectra of charge state 2+. To utilize all tandem mass spectra in a dataset updates to the algorithm are necessary. This chapter describes major extensions of MS SMiV. Adaptations include the spectrum pre-processing and peak matching methods to enable similarity scoring between differently charged spectra. Furthermore, an intensity transformation as part of the spectrum pre-processing is introduced to take differences in intensity distributions for spectra of different charges and experimental datasets into account. The large number of spectrum pairs that need to be scored due to increasing numbers of spectra as result of faster acquisition require parallelized MS SMiV scoring versions. These support execution on single computers and large 3.1 Introduction 65

Fig. 3.1 Graphical representation of MS SMiV utilities. Protein samples are digested into peptides and subsequently subjected to LC-MS/MS analysis. Resulting tandem mass spectra are used as input for MS SMiV. After complexity reduction and peak intensity adjustment, the algorithm pairs spectra based on precursor masses and a set of allowed mass shifts before determining similarity using a dot-product based score. Taking precursor mass shifts into account while matching fragment peaks (color coded in spectra on right hand side) allows MS SMiV to cluster spectra of same sequence (top) and, furthermore, identify spectra of modified (middle) and variant (bottom) peptide sequences. 66 Unbiased Detection of Modifications and Variants Using MS SMiV computational clusters. The enhanced features are benchmarked against an open mass tolerant database search (Chick et al., 2015).

3.2 Algorithm adaptation, optimization and testing

3.2.1 Adaptations to the algorithm

3.2.1.1 Spectrum processing

The initial processing in MS SMiV reduces spectrum complexity by dividing each spectrum into equidistant mass windows of a size defined by user input. From each of the mass windows in a spectrum a specified fixed number of peaks is retained starting from highest intensity. The window width of 100 m/z and the top 6 peaks are set as default parameters. These steps are only applied to spectra of charge 2+ in the initial implementation. To account for charge states higher than 2+ a two-step processing was introduced (see Figure 3.2). In the first step the number of retained peaks is changed depending on spectrum charge. The second step adjusts intensities of remaining peaks to decrease the variance between spectra of different charge and datasets.

Charge dependent peak picking. The charge of a peptide ion can have significant impact on the distribution of fragment ion masses and intensities. This is in part due to charge losses during fragmentation resulting in fragment ions with significantly different charge than the parent ion. The same fragment sequence of a peptide can occur as multiple peaks due to different charges. Furthermore, higher charge states result in lower m/z ratios and reduced distance between peaks leading to accumulation at the lower end of the m/z range of a spectrum. Selection of a fixed number of peaks in equidistant mass windows therefore would result in an incomplete representation of the spectrum after complexity reduction.

To address the shortcoming of the initial pre-processing method for highly charged spectra a charge dependent peak picking was developed. While the width of mass windows remains unchanged and equidistant across the spectrum, the number of peaks retained from each window is adjusted. Windows at lower masses, likely to comprise highly charged peaks, will retain more peaks than windows at the higher mass end 3.2 Algorithm adaptation, optimization and testing 67

Fig. 3.2 Visual representation of two-step spectrum preprocessing. In the first step the spectrum is divided into bins of predefined length along the mass dimension (m/z). Within each bin the N highest intensity peaks are retained. In the second step peak intensities are adjusted to reduce the variance between spectra of different charge states and experiments. Intensity harmonization divides intensities into a predefined number of percentiles, called intensity clusters. Within each cluster intensities are adjusted to the highest intensity therein. of the spectrum. This is achieved by use of a cosine curve between its minimum and maximum. The function is defined as

  i + 1    ppwi = (z − 2) ∗ cos (π + 1) × − 1 + 0.5 + ppw (3.2) n − 1 where ppw denotes the user defined number of peaks per window, z represents the charge state of the precursor ion, i the index of the current mass window and n the number of mass windows in the spectrum. The index i is zero based and refers to an empty window at index 0. Furthermore, the last mass window at index i = n remains a partial window usually smaller and with fewer peaks than the windows before. Using the cosine to base the function on results in a maximum and a minimum in windows i = 1 and i = m − 1, respectively. The user defined ppw therefore is increased for windows at low mass (small index) and decreased for higher indices.

Intensity harmonization. Different charge states, analysis on different mass spec- trometers as well as different conditions during measurement can affect the distribution of fragment ion intensities in tandem mass spectra. While peak picking addresses 68 Unbiased Detection of Modifications and Variants Using MS SMiV the reduction of complexity of a spectrum, intensities are not affected. Therefore an adjustment of intensities was introduced to enable comparison of spectra with differing intensity distributions due to charge or other factors. The fragment ion intensities of a tandem mass spectrum are divided into percentiles, hereafter called intensity cluster. This is achieved by ordering peaks according to decreasing intensity. Within each intensity cluster all intensities are adjusted so that

∀c ∈ C.∀i ∈ [0,nc] : Ici =⇒ max(Ic) (3.3) where c is a single cluster in the set of intensity clusters C, i the index of a fragment peak within cluster c and I the intensity of the peak. Following intensity adjustment the order of peaks, however, remains based on the original intensities (see spectra on right hand side in Figure 3.2).

3.2.1.2 Peak matching of differently charged spectra

The original algorithm of MS SMiV ensures the unique matching of a peak in the query spectrum to the most likely respective modified or unmodified peak in the library spectrum through the descending order of peaks based on intensity. The new described pre-processing step of intensity harmonization to balance distributions between different experiments and charge states retains the ordering of intensity adjusted peaks based on their original intensity. The peak matching algorithm therefore can largely remain as described in Schlaffner(2014). However, to enable matching of peaks as a result of different fragment charges, the algorithm was extended to match peaks of different charges between spectra. Each peak in the query spectrum therefore is first assumed as singly charged and compared to peaks in the library spectrum. Uniformly, each of the library spectrum peaks is assumed as singly charged in the beginning; if a peak is not matched, the charge state is increased up to the precursor charge. Equally, query peaks, which are not matched to a peak in the library spectrum, iterate charges from single to precursor charge. At each stage the fragment mass is calculated from the peak m/z and the assumed charge. Peak pairs with mass difference within the user defined error tolerance or the error tolerance around the precursor mass shift are matched. Pseudo-code for comparison highlights the difference between the original (see Algorithm1) and the charge adjusted algorithm (see Algorithm2). 3.2 Algorithm adaptation, optimization and testing 69 pre- Where PL then ▷ the error m f ) f T f T T and , | ≤ PL M PQ z ∆ , m , − PQ ) 1 j j z , + , z i L do , I j m PL , ) ≥ L PL m i − z i z , i I do , , | mz i ) < , Q ( PQ I j Q PQ m , m z , mz mz precursor charges, and L ( < | ∨ | do , j i , PL L Q L = ( ( z m ∈ L do , calcualteM − 1 and i Q PL , + i ← peakchargesz Q m I ∈ j PQ calculateM , m z match peak break peaks j | − L ≥ ATCHING i m if end if ← I peakchargesz | PQ i ) , for all end for m I peaksi VM Pseudo code for the charge adjusted peak matching Q MS SMiV , I for all m end for ← mz for all end for M for all ∆ end for = ( MSSM Q cursor masses, tolerance for matching fragment peaks end 7: 6: 8: 3: 4: 5: 9: 2: 1: 10: 11: 12: 13: 14: 15: 16: 17: algorithm of Algorithm 2 ▷ and PQ then m f , T 1 + i ) I f | ≤ T ≥ M , i ∆ I | PL ) − I m j , , , L m PQ m m = ( − , the error tolerance for match- i , L L f Q , , T 1 m Q + ( i do I L | ∨ | do j ≥ , ∈ i L Q I PL | m ) m ∈ I , − ATCHING − i , m peaks j Q PQ m match peak break VM m = ( peaksi | Pseudo code for the original peak matching algo- I end if if Q ← end for for all precursor masses, and MS SMiV M end for ∆ for all PL MSSM end m ing fragment peaks Where 5: 6: 7: 8: 9: 2: 4: 3: 1: 10: 11: 12: 13: 14: 15: 16: 17: rithm of Algorithm 1 70 Unbiased Detection of Modifications and Variants Using MS SMiV

3.2.2 Optimization

Number of intensity clusters. To assess the impact of the pre-processing step of intensity harmonization and to adjust the number of clusters to be used by default, MS SMiV was applied to a published dataset of the cell line HEK293 (Chick et al., 2015). The raw mass spectrometry data as well as the identification results were downloaded from PRIDE Archive (PXD001468). The peak lists in Mascot Generic Format (MGF) were extracted from raw files using Proteome Discoverer (Thermo Scientific) with default settings. The identification results for the standard search using a narrow mass tolerance were assumed to be correct and MGF files were split into separate files for identified and unidentified spectra. All identified spectra were combined intoa single MGF file. MS SMiV was run for identified spectra against themselves with the parameters shown in Table 3.1. The number of intensity clusters, however, was changed for each run so that a single to ten intensity clusters were tested. The results of MS SMiV then were annotated with the identified peptide sequence for each spectrum and the false discovery rate for each run with altered number of intensity clusters was calculated. MS SMiV score thresholds were determined for 1%, 5%, and 10% FDR and accuracy, sensitivity and specificity were calculated for each number of intensity clusters.

Parameter Name Value -MSMSTolerance 0.02 -PTMTolerance 5 -mods zero.txt [file containing only 0.0 as mass shift] -nic number of intensity clusters from range [0;10] Table 3.1 Parameter settings for MS SMiV. Narrow mass tolerances of 5 ppm and 0.02 Da are used to assess the impact of intensity harmonization. The number of intensity blusters (nic) is changed for each run ranging from 0 to 10 clusters. Parameter setting with 0 intensity clusters indicates use of unchanged fragment ion intensities.

High performance implementation. The high complexity of comparing numerous spectra against library candidates increases the larger the allowed range or set of precursor mass tolerances becomes. To allow the whole range of UniMod annotated post-translational modifications to be taken into account the C++ implementation of MS SMiV was adapted for high performance computing. Multi-threading and multi core processing through the message passing interface (MPI) were selected and implemented. Multi-core support was implemented so that one core is intended as the master node reading in the parameters and spectra, distributing query and library 3.2 Algorithm adaptation, optimization and testing 71 spectra, collecting results and printing these to the output file. All other cores specified by the user to partake in the execution of MS SMiV execute the algorithm on a subset of query spectra. While requiring more computational resources the execution time can be significantly reduced.

The parallelization of MS SMiV was tested on the aforementioned dataset. The set of identified spectra were chosen as query and library files. A mass shiftof0 Da provided by a tab separated file was allowed with an error tolerance of upto5 ppm. Fragment tolerance was set to 0.02 Da and 7 intensity clusters used for intensity adjustment. Serial and parallel execution were tested on the compute farm at the Wellcome Trust Sanger Institute with CPU 2x 2.1 GHz 16 core AMD 6378 running Linux (64bit). Only a single core using 16 GB random access memory (RAM) was used for serial processing while multi-threading and MPI execution was performed using 7 cores and a total of 16 GB RAM on a single machine.

3.2.3 Benchmarking

For benchmarking the aforementioned dataset of the HEK293 cell line was chosen with the associated standard search parameters (closed search) and a mass tolerant SEQUEST search (open search) (Chick et al., 2015). The benchmarking workflow is shown in Figure 3.3 and indicates the pre-processing of raw spectra and identification results for closed and open database searches. Firstly, raw files were converted into peak lists in the Mascot Generic Format (MGF) using Proteome Discoverer (Thermo Scientific) with default parameters. This was followed by splitting the resulting MGF files into separate files for identified (CI) and unidentified (CU) spectra usingthe closed search results. Furthermore, unidentified spectra were divided into identified (CUOI) and unidentified (CUOU) sets according to the open search results. TheCUOI set of spectra was compared as query to the set of CI spectra as library using MS SMiV. The list of 523 delta mass bins identified by Chick et al.(2015) was used as white list of mass shifts for spectrum pairs using the parameter –mods. Precursor mass tolerance was set as specified in Chick et al.(2015) for the closed database search to 5 ppm while the fragment mass tolerance was set to 0.02 Da. For spectrum pre-processing the window size was allowed to be 100 Da, the seed number of top peaks retained was selected as 6, and peak intensities were adjusted in 7 clusters. The filter option was applied to retain only spectrum pairs scoring higher than the threshold established for 1% FDR in the optimization described above. 72 Unbiased Detection of Modifications and Variants Using MS SMiV

MS SMiV results were annotated with identified peptide sequences from the closed search for library spectra. Additionally, results were annotated with identified peptide sequences from the mass tolerant search for query spectra. Results were filtered to only contain entries where query and library spectra were both identified. Remaining entries were grouped by equal (Hamming distance ≤1) and unequal (Hamming distance >1) sequences between query and library spectra and compared to the estimated FDR.

Due to the nature of SMiV scoring using absolute mass shifts between precursor masses the deltamass bins listed in Chick et al.(2015) of negative mean masses were transformed into positive bins. Additionally, bins were then merged if their mean deltamass values were less than 0.02 Da apart. New mean values per bin were calculated as weighted means using the absolute frequency of peptides as weights. Absolute frequencies of delta masses were calculated for each bin using mass windows of 1 Da cantered around the specified mean value per newly merged bin.

3.3 Results and Discussion

3.3.1 Assessment and optimization

Before benchmarking MS SMiV against the open mass tolerant database search (pre- cursor tolerance ±500 Da) on a high resolution mass spectrometry dataset of the human HEK293 cell lines (Chick et al., 2015), the impact of algorithmic adaptations on the quality of MS SMiV scoring were assessed. Therefore, only identified spec- tra from the closed mass tolerant search (precursor tolerance ±5 ppm) were used to enable comparison of peptide sequences for calculation of statistical metrics. This was undertaken under the assumption that identifications from the sequence database search followed by filtering to an FDR level of 1% are indeed correct peptide-spectrum matches. Of the initial 1.2 million spectra 35.5% (398,558) remained as identified through a database search with standard narrow parameters. These are hereafter re- ferred to as closed-identified (CI) spectra, while the remaining unidentified spectra are termed closed-unidentified (CU). Application of MS SMiV comparing pre-processing with and without peak picking depending on spectrum charge state resulted in 11.3 million spectrum pairs. Comparison of false discovery rates across scores between the initial and the charge dependent versions of the peak picking step revealed a slight but significant increase (mean difference ∆FDR=0.00287, Student’s t-test p≪0.05) of the false discovery rate when peak selection includes spectrum charge states (see Figure 3.3 Results and Discussion 73

Fig. 3.3 Workflow for benchmarking of MS SMiV. A) Raw spectrum files are searched against a human protein database using a narrow (±5 ppm) and a wide (±500 Da) peptide mass tolerance. The raw files are then converted into Mascot Generic Format (MGF) before extraction of the subset of spectra identified (CI) during the closed SEQUEST search. Remaining spectra (CU) are further divided into identified (CUOI) and unidentified (CUOU) sets using identification results from the open SEQUEST search. B) FDR is estimated using the CI subset as input for MS SMiV. Pairing of spectra with same mass and subsequent use of peptide identifications enables determination of true and false matches. This enables the estimation of a FDR. C) Spectra identified uniquely in the open search (CUOI) are paired with spectra identified in the closed search (CI) allowing a total of 532 mass shifts. After filtering to 1% FDR level the distribution of ∆m is compared to the distribution of ∆m of the open database search to establish the overlap between approaches. D) Previously unidentified spectra (CUOU) are paired with spectra identified by the closed database search (CI)to determine the recovery of spectrum matches unique to MS SMiV. 74 Unbiased Detection of Modifications and Variants Using MS SMiV

Fig. 3.4 Performance comparison of the spectrum pre-processing step of peak picking. While initial (blue) and charge dependent (red) pick picking perform similarly with regards to FDR the latter results in slightly higher FDR values at any given score.

3.4). Na et al.(2008b) highlights that in spectra of charge 3+ a higher fraction of total intensity is attributed to fragment ions with charge 2+ as compared to spectra of charge 2+, where fragment peaks of z=1+ dominate. This supports the general idea of charge dependent peak-picking for CID spectra. Shao et al.(2014) assessed the proportion of fragment ions depending on charge state within 10 bins along the relative mass range of CID and HCD spectra. Their findings, however, highlight that multiply charged ions appear more frequently with increasing relative mass, both in CID and HCD spectra. While charge dependent peak picking increases the weight of low mass fragments in the MS SMiV scoring, the results by Shao et al.(2014) show that it overestimates the impact of higher charge states on the relative mass distribution. However, the impact of charge dependent peak-picking on the false discovery rate (mean ∆FDR=0.049%) for MS SMiV spectrum pairs with scores above the 1% FDR threshold is low. This indicates that the effect of charge dependent peak picking is negligible for spectrum pairs of interest.

More significant is the impact of intensity adjustment on the performance ofMS SMiV. The application with the parameters outlined in Table 3.1, including ramping through the range from one to ten intensity clusters in addition to original fragment intensities (zero clusters), resulted in the overall high accuracies (mean=0.944, me- dian=0.956, σ2=0.000875) and specificities (mean=0.974, median=0.977, σ2=0.000370) at score thresholds relating to approximate FDR levels of 1%, 5%, and 10% (Table 3.2). The overall high accuracy and specificity is consisted with regular spectrum library search algorithms (Deutsch, 2011) showing little effects of intensity harmonization on the performance of MS SMiV. On the other hand, sensitivity shows a higher variability

(mean=0.879, median=0.930, σ2=0.014507) highlighting that, as expected, the inten- 3.3 Results and Discussion 75 sity based MS SMiV score is affected by changes to intensities in the spectra. Accuracy and sensitivity anti-correlate with the MS SMiV score (mean Pearson’s r=-0.92 and r=-0.91 respectively), which is affected by the number of intensity clusters.

No intensity clusters (zero) indicating the use of original fragment intensities for scoring shows the lowest overall sensitivity and accuracy, while resulting in the highest SMiV score thresholds across all three approximated FDR values. Accordingly, lowest SMiV scores are associated with highest accuracy and sensitivity. While the overall accuracies range between 84.9% and 97.9% indicating only minor improvement, spectrum pairs filtered at 1% FDR level demonstrate low mean sensitivity of0.735 compared to 0.933 and 0.969 at 5% and 10% FDR levels, respectively. However, the sensitivity of MS SMiV results filtered at 1% FDR exhibits the most significant increase when intensities are adjusted within clusters. In detail, the MS SMiV run using raw fragment intensities identifies only 50% of spectrum pairs as resulting from the same peptide sequence, which can be attributed to the high variability of intensities between spectra of the same peptide sequence. Decreasing number of intensity clusters, however, increases the recall to almost 90%. This indicates that the variability of intensities between differently charged spectra of the same peptide sequence is by far greater than anticipated and that only the reduction of the variance can assist successful matching of these pairs. 76 Unbiased Detection of Modifications and Variants Using MS SMiV pre-processing. Each of the metricsindicates is the divided change into within approximated each FDR metric thresholds for of changing 1%, numbers 5%, of and intensity 10%. clusters Colour resulting gradient in within best each (blue) column to worst (red) performance. Table 3.2 Statistical metrics of MS SMiV results following intensity adjustment within different numbers of clusters during spectrum 3.3 Results and Discussion 77

Therefore, the impact of intensity harmonization was further investigated. To assess the contribution of charge states of spectrum pairs in relation to the drastic change of sensitivity at 1% FDR. MS SMiV results for each run with differing number of clusters were filtered using the score thresholds for 1% FDR indicated inTable 3.2. Results were classed by charge states of both spectra of a pair resulting in groups of equally charged (EC) and differently charged (DC) spectra. For 59.7% of the total 11.3 million pairs spectra share the same precursor charge. While between 27.6% and 35.3% of EC pairs are recovered at <1% FDR level for raw intensities to a single cluster, respectively, only 2.0% of DC spectrum pairs are amongst the identified set for raw intensities. Using intensity harmonization, higher rates of DC spectra can be successfully matched. As an example, the MS SMiV application using a single intensity cluster recovered 19.3% of DC spectrum pairs.

Observed anti-correlation between accuracy and specificity for decreasing numbers of intensity clusters are similar between the different charge states (see Table 3.3). Although this is consistent with the overall observed anti-correlation for the whole set filtered at <1% FDR, DC pairs exhibit an exceptionally low FDR foruseof7 intensity clusters. Furthermore, DC pairs using original peak intensities recall only 7% of correct sequences. A steep increase from 22% to 75% using 10 to 1 intensity clusters, respectively, highlights the positive effect of this pre-processing step on the overall outcome of MS SMiV scoring. With a consistently high specificity (>0.99) and overall high accuracy (>85%), selection of an optimal number of intensity clusters relies on false discovery rate and sensitivity.

To reduce the number of false spectrum pairs with different charges the FDR is the main criterion for driving the choice of a default number for the intensity clusters. Due to the extremely low FDR of 0.0008 for DC pairs and an overall low FDR of 0.0096, 7 intensity clusters appear as an appropriate default value for future MS SMiV applications. Other library search tools utilizing a dot-product based score transform intensities by using their square root or rank within the spectrum (reviewed in Griss, 2016). While these adjustment methods reduce variability specifically for high intensity peaks, results show that using few intensity clusters and adjusting each peak intensity to the maximum within each cluster is a feasible way of ensuring that high intensity peaks continue to represent the spectrum to a higher degree than low intensity fragments. 78 Unbiased Detection of Modifications and Variants Using MS SMiV pre-processing. Each of the metricsstates is (EC), divided and by pairs groups with ofchanging different spectrum charge numbers pairs states of indicating (DC). intensity all Colour clusters spectrum gradient resulting pairs, within in spectrum each best pairs column (blue) with indicates to equal the worst charge change (red) within performance. each metric for Table 3.3 Statistical metrics of MS SMiV results following intensity adjustment within different numbers of clusters during spectrum 3.3 Results and Discussion 79

3.3.2 Runtime analysis

The high performance adaptation of the SMiV scoring algorithm through multi- threading and message passing interface (MPI) resulted in a comprehensive implemen- tation of MS SMiV for application on different computing systems. Thus MS SMiV application on single machines providing multiple computational cores (CPUs) and even computer clusters was enabled through a single additional parameter ‘par’.

In total 398,558 tandem mass spectra were compared against each other resulting in 11.3 million scored spectrum pairs with precursor mass difference of 0 Da ±5ppm. Serial execution required a maximum of 7.0 GB RAM and completed after 9h and 9min. While this execution time is in an acceptable range for spectrum identification, runtime will increase drastically when allowing additional mass shifts. The application of MS SMiV on 7 cores using the multi-threading approach resulted in a 38.8% increase of required memory while reducing runtime by 83.3% to 1h and 32min. This demonstrates the efficient implementation of the scoring in multi-threading manner.

The execution of MS SMiV using MPI across 7 cores, however, was terminated through the scheduling framework of the computer cluster at the Wellcome Trust Sanger Institute after exceeding the runtime limit of 168h. Reduction to two executing cores then resulted in a maximum usage of memory of 2.3 GB and successful termina- tion after 4h and 16min. Closer investigation to elucidate the reason for the drastic runtime increase of MPI execution revealed combined execution times for spectral reading, pre-processing and scoring of 4h 13 min, 1h 30min, and 2h 6min for 2 and 7 cores on a single machine and 8 cores across 4 different machines, respectively. Serial and multi-threading executed the same operations in 9h 5min and 1h 30min, respectively. This demonstrates that the MPI implementation of MS SMiV is success- ful in parallelizing the scoring section of the algorithm and results in a comparable runtime improvement to multi-threading. However, the consolidation of scoring results through the message passing interface introduces a bottleneck leading to the increased runtime. Alternative implementations of the result consolidation process including transfer through intermediate CPUs to the master CPU and direct transfer to the master CPU did not change the overall outcome.

Implementation for computational clusters spanning multiple computers using message passing requires copying of results to consolidate in a single master core, which solely provides functionality for reading input and writing output files. However, the large number of spectrum comparisons in MS SMiV results that are consolidated 80 Unbiased Detection of Modifications and Variants Using MS SMiV comprise a major fraction of spectra with low similarity. After a <1% FDR filtering level only 23% of pairs remain, discarding overall 8.7 million spectrum comparisons. These have been copied between computational cores only for the purpose to estimate the FDR. This shows that for MS SMiV executions on computational clusters with large numbers a different way of estimating false discovery rates is required.

3.3.3 Comparison with open modification search

For benchmarking MS SMiV was compared to the orthogonal method of open mass tolerant database searching. This approach, using the sequence database search tool SEQUEST, provides the same level of customisation with regards to fragment mass tolerances. Additionally, it utilises a wide precursor mass tolerance to enable identification of post-translational modifications. In the recent publication searching of 1.2 million high resolution tandem mass spectra with an open mass tolerance (±500 Da) in SEQUEST provides the suitable data quality for benchmarking (Chick et al., 2015). While the 35.5% of CI spectra, extracted as described previously, remained unchanged, CU spectra were further divided based on the identification results of the open mass tolerant SEQUEST search (see Figure 3.5). In total 16.8% (188,706) of spectra were recovered as identified with mass shifts up to 500 Da and are referred to as closed-unidentified-open-identified (CUOI) set.

MS SMiV comparison of the CUOI set as queries against the CI spectra allowing mass shifts of all 532 bins identified in Chick et al.(2015) using the previously estab- lished score cut-off 0.45 representing a 1% FDR level resulted in 774,049 spectrum pairs with a mean score of 0.62 (see Figure 3.5). The distribution of mass shifts as depicted in Figure 3.6 A shows distinct ∆Masses found dis-proportionally often in spectrum pairs indicating enrichment of post-translational modifications with these masses in the dataset. After removal of identifications with no mass shift from the published open identification results ∆Masses exhibit a similar distribution indicating that MS SMiV confidently identifies the same mass shifts as the open SEQUEST search (see Figure 3.6 B). However, the mass shifts identified through the open search appear at a frequency of one order of magnitude less than in the SMiV results. The comparison of spectra between query and library sets as performed by MS SMiV can result in association of a single query spectrum with multiple library spectra and vice versa. Thus the investigation of the result composition with regards to query and library spectra revealed a total of 75,815 query spectra associated with 142,730 library spectra. While each query spectrum is paired on average with 10.2 library spectra, 3.3 Results and Discussion 81

Fig. 3.5 Visual representation of fragmentation of tandem mass spectra. A) Of the initial 1.2 million spectra 35.5% are identified by standard database searching (subset CI). Additional 16.8% are identified using a mass tolerant database search approach (subset CUOI). The remaining 47.7% of tandem mass spectra remain unidentified by either database search method (subset CUOU). B) Application of MS SMiV pairing CI spectra with CI spectra with same mass results in 11.3 million spectrum pairs enabling FDR estimation. C) Spectrum pairing between CUOI and CI spectra allowing 532 different mass shifts results in 774,049 spectrum pairs after filtering. D) MS SMiV application pairing CUOU spectra with CI spectra allowing 523 different mass shifts results in 2.8 spectrum pairs after filtering leading to new identifications uniquely provided by MS SMiV. 82 Unbiased Detection of Modifications and Variants Using MS SMiV

Fig. 3.6 Distribution of ∆Mass for MS SMiV and open mass database. A) MS SMiV identified ∼1.1 million spectrum pairs filtered at 1% FDR. The use of 532 mass bins ranging from 0.02 to 500 Da prohibits identification of spectra from the same peptide sequence, thus showing no peak at ∆Mass 0 Da. B) The 532 original bins with masses ranging from -500 Da to 500 Da are converted into 454 bins of absolute ∆Mass where bins with mean values more than 0.02 Da apart are merged. A total of 339,578 identifications with no mass shifts∆ ( Mass=0) are removed from the distribution. library spectra are paired with only 5.4 query spectra. This can explain the overall frequency disparity between sequence database search results and MS SMiV ∆Mass identification. Furthermore, the difference in ratios between query and library pairs indicates that for each modified spectrum 10 spectra of the unmodified sequence have previously been identified by database searching. However, not all peptides identified in a standard database search are found with modifications in the dataset.

To compare MS SMiV and the open database search in more detail the initial 532 distinct mass bins ranging from -500 to 500 Da were converted into bins of absolute mass shifts. Bins with mean mass differences <0.02 Da were merged resulting in 454 ∆Mass bins. Following the analysis described by Chick et al.(2015) resulting in the ∆Mass bins from the open search Gaussian fit analysis of the MS SMiV output determined 345 distinct mass bins with at least 200 spectrum pairs in each bin. In total 3.3 Results and Discussion 83

Fig. 3.7 Comparison of mass bins identified by open mass tolerant database searching and MS SMiV. Both approaches share the identification of 321 (70.7%) of mass shifts. Mass tolerant database searching identifies additional 109 mass bins, while MS SMiV identifies 24 unique bins.

98% of spectrum pairs are assigned to bins consistent with 98% of peptides assigned to the open database search bins. Shared between the mass tolerant database search and MS SMiV are 321 bins with mass difference between bin means of up to 0.02 Da leaving 109 and 24 bins unique to the database search and SMiV, respectively (see Figure 3.7). In total, 52% of the 20 most frequently identified mass shifts by the mass tolerant database search and MS SMiV can be found amongst the 20 most frequent mass shifts of the other method (see Table 3.4). However, the remaining 48% show a high variability in frequency between open mass searching and MS SMiV. This demonstrates that mass tolerant database searching and blind spectrum matching utilise different aspects of spectral features leading to the differing frequency of identified modification masses. However, it highlights that both methods provide similar results and can be used confirmatory.

Bin mass Rank Unimod entires Open Search MS SMiV Open Search MS SMiV unknown 0.000244 0.002474 1 6 Glu↔Lys, Asn↔Xle 0.95676 0.954675 3 7 Dehydation, Lys oxida- 1.026362 1.024491 4 8 tion Asp↔Xle 1.927599 1.927035 18 24 15N(2), Cys↔Thr, 1.975496 1.978258 6 14 Thr↔Val, Met↔Glu unknown 2.053404 2.053139 11 26 unknown 2.900815 2.901037 20 44 15N(3), 13C3 SILAC la- 3.021622 3.021331 19 29 bel 84 Unbiased Detection of Modifications and Variants Using MS SMiV

Oxidation, deoxidation, 15.99435 15.99431 2 1 thiocarboxylic acid, Cys↔Ser, Ser↔Ala, Tyr↔Phe, Met↔Asp Asn↔Pro, Met↔Asn 16.99607 16.99741 14 18 Ammonium, ammonia 17.02547 17.02512 7 5 loss, deuterated methyl ester, Gln→pyroGlu Fluoro, dehydrated 18.00613 18.0121 24 11 (phosphorylation), Cys→Oxoalanine, Glu→pyro-Glu, Phe↔Glu Cation:Mg[II], 21.97849 21.97977 35 19 cation:Na, Met→Hpg Formylation, Pro 27.99462 27.99523 10 4 oxidation, Asp↔Ser, Glu↔Thr, Arg↔Lys Sulfide, dioxidation, 31.98991 31.98978 12 13 Met↔Val, Cys↔Ala, Glu↔Pro, Tyr↔Met Carbamyl, Asn↔Ala, 43.00598 43.00675 5 2 Arg↔Xle Cation:Fe[II] 53.91898 53.91717 9 12 Carbamidemethyl, Gly, 57.02306 57.02271 22 16 Asn↔Gly, Gln↔Ala, Arg↔Val Phosphorylation, 79.96655 79.96584 16 36 sulfonylation Met loss + acetylation, 89.03053 89.03079 15 186 Trp↔Pro Iodination 125.8997 125.8985 41 20 Lys, Lys loss, 4- 128.0958 128.0949 17 15 trimethylammonium- butyryl 3.3 Results and Discussion 85

Aminoethylbenzene- 183.0367 183.0352 13 9 sulfonylation unknown 249.0537 249.9837 168 10 unknown 301.9864 301.9875 8 3 unknown 302.9771 302.9954 77 17 Table 3.4 Summary table of the 20 most frequently identified mass shifts for mass tolerant database searching and MS SMiV. Both methods identify the same mass shifts amongst the 20 most frequent ones in 52% of cases.

Annotation of library and query spectra with peptide sequences identified from the closed and open searches, respectively, revealed sequence concordance of 541,083 spectrum pairs (69.9%), while 30.1% exhibited different peptide sequences. This indicates that identification results from database searches are not suitable for FDRes- timation, specifically for spectrum pairing with large mass differences. The cumulative distributions of spectrum pairs with same and different sequences indicate that high SMiV scores are not necessarily more sensitive than lower scores (see Figure 3.8 A). The slope of the cumulative distribution of the same sequence pairs is curved similarly to the distribution of different sequence pairs. Using only identification results from the open search to annotate sequences onto query and library spectra reveals a slight improvement (see Figure 3.8 B). The cumulative ratio of pairs with different sequences annotated exhibits a steeper increase at a smaller MS SMiV score and indicates that the estimation of false discoveries is dependent on the provided sequence annotation.

To assess the impact of different database searches on the estimation of false pairings, the FDR was calculated using (i) open and closed search results to annotate query and library spectra, respectively, and (ii) solely the open search results to annotate query and library spectra of MS SMiV pairs. The FDR distribution using open and closed search results reveals an increase in false positive matches for high similarity scores and highlights a lowest FDR of 5.8% at a score of 0.89 (see Figure 3.9 B). However, the MS SMiV similarity score incorporates peak intensities into the scoring and unmatched high intensity peaks contribute to the penalty to a higher degree than low intensity peaks. Given this nature of the dot-product based scoring the substantial ratio increase for highly scored spectrum pairs of supposedly dissimilar nature indicates errors with the underlying sequence identifications from narrow and open mass tolerant database searches. Use of peptide identifications solely from the open mass tolerant search revealed 93,654 (12.1%) pairs with no identification in the open search for library spectra while the FDR for high scores dropped significantly as 86 Unbiased Detection of Modifications and Variants Using MS SMiV

Fig. 3.8 Cumulative ratio of spectrum pairs with same and different sequences anno- tated. A) Cumulative ratios of MS SMiV query and library spectra annotated with sequences from open and closed database searches, respectively. Cumulative ratio of spectum pairs with same (blue) and different (red) peptide sequences annotated exhibit steep increases at lower MS SMiV scores. Steep increase of cumulative ratio for same sequence pairs under-performs as steepest increas is expected at high MS SMiV scores. B) Cumulative ratios of MS AMiV query and library spectra annotates with sequences from the open database search. Spectrum pairs annotated with same (blue) or different (red) peptide sequences follow similar curve. However, steepest increase for pairs with different sequences is shown at lower Ms SMiV score. Cumulative ratio curve of same sequence pairs appears shallower. 3.3 Results and Discussion 87

Fig. 3.9 FDR in relation to MS SMiV scores. FDR estimates based on use of sets of spectrum identifications from open and closed database searches for query and library spectra. A) FDR calculation based on comparison of spectra identified in the closed search against themselves allowing no mass shifts. B) MS SMiV spectrum comparison for CUOI set against CI using all 532 mass bins. Identifications from open and closed searches used for query and library spectra, respectively. FDR peaks at high scores which represent high similarity between spectra. C) MS SMiV results of CUOI against CI sets. For FDR calculation only open identification is used. In total 93,654 spectrum pairs are removed due to missing identification. FDR shows fast decline at high scores. shown in Figure 3.9 C. This highlights, that sequence based false discovery estimation is biased by varying search results. These are in turn dependent on the available search space derived from sequence databases. MS SMiV, however, provides the same pairing as it is blind to peptide sequence identifications and solely relies on pairing spectra within single experiments.

3.3.4 Increased spectra identification rate

Benchmarking MS SMiV against an open mass database search revealed consistent identification of mass shifts between both methods. However, a total of47.7% spectra remain unidentified after open mass tolerant searching (see Figure 3.5). To assess whether MS SMiV can identify additional tandem mass spectra, the unidentified spectra were matched against spectra identified in the standard (closed) search. Using the MS SMiV score threshold established earlier, MS SMiV returned 2,781,621 spec- trum pairs. Combining the results of closed and mass tolerant database searching with these novel MS SMiV assignments reveals a total assignment rate of 68.7% leaving ∼30% of tandem mass spectra unassigned. This sequential joining of identification results highlights that ∼33% of all tandem mass spectra remain unidentified due to low abundance, sequence variants, and post-translational modifications. 88 Unbiased Detection of Modifications and Variants Using MS SMiV

Identification Number of Assignment Cumulative As- Method Identified Rate signment Rate Spectra closed search 398,558 35.5% 35.5% open search 118,706 16.8% 52.3% MS SMiV 183,517 16.4% 68.7% Table 3.5 Summary of spectral assignments using different methods. Sequential used of standard database searching with closed mass tolerance, open mass database searching, and MS SMiV reveals an overall assignment rate of 68.7%.

3.4 Conclusions

In this chapter I described updates to the blind spectrum matching tool MS SMiV. These include charge dependent peak picking, intensity adjustment and peak matching and enable comparison of spectra of different charge states. Results show that selection of top intensity peaks within bins of 100 Da outperformed the new charge dependent approach, which favours selection of top intensity peaks bins with lower mass. This supports the findings of Shao et al.(2014) that masses of higher charge ions are not overrepresented in the low mass region.

While the charge dependent peak picking did not lead to an improvement of quality of MS SMiV spectrum pairing, intensity harmonization showed a significant improvement. This led to an increase in sensitivity from 53% to 89% for pairs of spectra with differing charge states at an FDR level of less than 1%. Overall increased performance with regards to accuracy, FDR and sensitivity indicate that intensity adjustment ensures the correct matching of pairs and leads to higher MS SMiV scores. Furthermore, specificity of more than 99% for all tested ranges of intensity clusters showed that MS SMiV results at low FDR levels miss few correct spectrum pairs. The use of few intensity clusters resulted in increased recall of correct spectrum pairs indicating the high variability of fragment peak intensities between different charge states.

Besides the algorithmic changes to MS SMiV support for multi-threading and computational clusters was added. The use of a message passing interface (MPI) to allow execution on large clusters revealed a bottleneck of the algorithm. While pre-processing of spectra and scoring of pairs in the MPI implementation showed significant speed-up, the large number of resulting spectrum pairs and their collection in one master node rendered the achieved speed-up void. Different approaches to 3.4 Conclusions 89 reduce the runtime of the result gathering process remain unsuccessful. However, only about a quarter of spectrum pairs pass the threshold of <1% FDR. This highlights that the majority of spectrum pairs are solely scored to enable estimation of a false discovery rate. Control of false discovery rates prior to the gathering of results can significantly increase runtime on computational clusters and enable fast comparison of millions of spectra. As a first starting point, I therefore propose to filter resultson a query spectrum basis. Pairs of library spectra with a given query spectrum should be discarded if their score is within three times the standard deviation of the mean score. Additionally, only positive scores should be retained. While this filter appears to be stringent, it will allow for fast assessment of mass shifts present in a dataset. Runtime results also show a significant speedup from the serial to the multi-threading implementation of MS SMiV. Overall runtime was reduced by 83% using the multi- threading for MS SMiV application. This allows for fast spectrum comparison and identification of overrepresented mass shifts in a given mass spectrometry datasets.

Benchmarking of MS SMiV against an mass tolerant database search demonstrated that the blind spectrum pairing approach can successfully identify mass shifts due to post-translational modifications present in a dataset. However, results also showed that a target-decoy based approach for FDR estimation using spectrum identifications derived from database searches is not suitable for mass tolerant spectrum matching. Therefore, novel ways to determine the rate of false discoveries is required. As described above I propose filtering based on score distributions of library spectra associated with a given query spectrum.

Reproducibility of mass bins identified by the open mass database search and blind spectrum matching highlights that MS SMiV can be used to validate results of mass tolerant searching. Differences in the frequencies with which the mass tolerant database search and MS SMiV identify modification masses indicates that MS SMiV can be used as an orthogonal tool to prioritise further assessment of the most frequently present modifications in a dataset. The identification of additional 16% of spectrathat remain unidentified even by the mass tolerant database search demonstrates thatMS SMiV outperforms mass tolerant database searching with regards to sensitivity. As highlighted by Chick et al.(2015), however, database searches with targeted selection of dynamic modifications are more sensitive than the mass tolerant searches. The selection of a set of protein modifications to be used in targeted database searches requires specialist knowledge of the data acquisition. The 15 most abundantly detected modifications in the open search emphasize a fraction of these is unanticipated and would not have been taken into account in standard database searches. This stresses 90 Unbiased Detection of Modifications and Variants Using MS SMiV that data driven identification of modifications present in a dataset is essential for enhancing identification rates of database searches. Therefore, I propose thatMS SMiV is used with a wide mass tolerance to identify the most abundant mass shifts in the dataset and inform the selection of modifications for targeted database searching. Annotating spectrum pairs with the identified peptide sequences will further increase the overall assignment rate.

3.5 Publication Note and Contribution

A manuscript for most of the work described in this chapter is in preparation. The parallelization for high performance execution of MS SMiV was implemented by Thomas Bernwinkler under my supervision. With exception of this and unless stated otherwise, the design, implementation and analysis described herein is the work I performed myself, under supervision of Andreas Bender and Jyoti Choudhary. HAPTER C 4

VARIANT IDENTIFICATION IN COLORECTAL CANCER PROTEOME 92 Variant Identification in Colorectal Cancer Proteome

4.1 Introduction

Coordinated expression of genes involved in the same biological pathways and protein complexes is key to cellular functions. Co-expression of genes from high throughput mRNA profiles have been analysed using gene clustering, network and gene set enrich- ment analysis to infer co-functionality of their gene-products. However, non-specific cis-regulation and transcriptional leakage can result in similar mRNA expression even though underlying genes are not necessarily functionally linked. The RNA-sequencing data used for co-expression analysis of genes prominently captures changes driven by transcriptional regulation but misses post-transcriptional processes. This makes RNA-sequencing a poor proxy to infer co-functionality of gene-products. Recent advances in mass spectrometry technologies have enabled near complete identification and quantification of cellular proteomes. In line with the notion that proteins gen- erally execute the functions encoded in the genome, recent studies have shown that proteomics outperforms transcriptomics in predicting co-functionality (Roumeliotis et al., 2016; Wang et al., 2016). Multi-omics approaches utilizing the interdependence of proteins with additional molecular information has led to improved understanding of the function of mitochondrial proteins (Stefely et al., 2016). While concordance of protein and mRNA expression has only been suggested in few publications, many studies reported significant discrepancies (reviewed in Haider and Pal, 2013; Maier et al., 2009). The contribution of technological issues such as protein and transcript quantification and underlying biological effects to these discrepancies is not entirely clear. A likely contributing factor is post-translational modification affecting peptide and therefore protein quantitation. Unanticipated post-translational modifications and sequence variants commonly result in missing identifications and thus violate the assumption of proportionality of abundance within a peptide mixture of a protein.

Modern mass spectrometry based proteomics has been used to assess relation- ships between genomic features, gene expression patterns, and phenotypic traits in different experiments using large cohorts of well-studied cancer tissues (Mertins et al., 2016; Zhang et al., 2014, 2016). Identification of sequence variant peptides dueto genomic aberrations and post-translational modifications can lead to missing values and variation of quantitation of peptides of the same protein. Such cases are therefore commonly excluded from protein quantification. Colorectal cancer cell lines with well-studied genomes and transcriptomes are widely used as model that approximates cancer behaviour, however, the impact of non-synonymous nucleotide variants and post-translational modifications on protein abundance remain largely unexplored. 4.1 Introduction 93

Proteogenomics allows the identification of genomic variants through searching mass spectrometry data against customized databases. These custom databases are generated by incorporating genomic alterations, such as non-synonymous SNPs, in- sertions/deletions of single or multiple nucleotides and copy number variation into canonical protein sequences (Krug et al., 2014). However, the combinatorial inclu- sion of sequence variants and the increasing sample size of proteogenomics studies introduce large numbers of customized sequences increasing the search space for spectral identification. Numerous tools have been developed to incorporate sequence variation from databases such as dbSNP (Sherry et al., 1999) or COSMIC (Forbes et al., 2017) and RNA-sequencing data into the search database while trying to reduce its complexity in various ways (Bunger et al., 2007; Li et al., 2011; Park et al., 2014; Sheynkman et al., 2014). The still increased search space, however, makes stricter filtering or separate treatment of canonical and variant peptide spectrum matchesa requirement to reduce the number of false identifications, which in turn results in lower sensitivity (Nesvizhskii, 2014). Additionally, searching customized databases follows the assumption that variant databases and variant identifications from RNA-sequencing data provide a comprehensive set of SNPs present in the samples. Consequently, non-synonymous mutations previously not reported or novel somatic variants are persistently overlooked.

In this chapter, MS SMiV leverages the diversity of genomic aberrations of 50 colorectal cancer cell lines analysed with isobaric labelling and tribrid mass spec- trometry to identify post-translational modifications and amino acid variants. This is achieved by allowing mass shifts during matching of unidentified spectra against PSMs, which have been identified in a search against a canonical protein sequences database. Peptide quantitation through TMT reporter intensities is used to identify cor- relation between peptides and associated MS SMiV spectra within single proteins and discriminate mass shifts caused by single amino acid variants from post-translational modifications. The approach is compared to a database search using a concatenated database of canonical proteins and sequences with incorporated single amino acid variants and a manually curated set of identified variant peptides. Lastly, it is shown that MS SMiV in combination with peptide quantitation enables de novo identification of variants. 94 Variant Identification in Colorectal Cancer Proteome

4.2 Materials and Methods

4.2.1 Data acquisition, identification, and quantification

Quantitative mass spectrometry analysis was performed on a panel of 50 colorectal cancer cell lines. Data acquisition, identification and quantification are described in detail by Roumeliotis et al.(2016). Briefly, proteins were extracted from cell pellets containing 3×106 cells through pulsed probe sonication and boiling. Cellular debris was removed by centrifugation. Aliquots of 100 µg of total protein were digested overnight using trypsin. TMT 10-plex reagent was added to each sample tube and samples were combined into batches as shown in Figure 4.1 A and Table 4.1. Pooled samples then were subjected to high-pH Reverse Phase (RP) fractionation. Fractions were collected in a time dependent manner every 30 sec.

LS-MS analysis was performed on the Dionex Ultimate 3000 UHPLC system coupled with the Orbitrap Fusion Tribrid Mass Spectrometer (Thermo Scienfitic). Precursors were selected with mass resolution of 120k in the top speed mode within 3 sec before isolation for CID fragmentation. Furthermore, MS3 quantification spectra were acquired with HCD fragmentation within 120-140 m/z with 60k resolution of the top 10 most abundant CID fragments isolated with Synchronous Precursor Selection (SPS) as indicated in Figure 4.1B.

Identification of acquired mass spectra was performed in Proteome Discoverer 1.4 (Thermo Scientific) using the SequestHT search algorithm. Spectra were searched against a UniProt fasta file containing 20,165 reviewed human entries using a precursor mass tolerance of 20 ppm and fragment ion mass tolerance was set at 0.5 Da. Only fully tryptic peptides of minimum length of 7 residues were taken into account allowing up to 2 missed cleavages. TMT6plex at N-terminus, K and Carbomidomethyl at C were set as static modifications. Oxidation of M and Deamidation of N,Q were allowed as variable modifications with a maximum of two different dynamic andtwo repetitions of modifications. Peptide confidence was estimated by Percolator usinga decoy database search. Results were filtered at <1% peptide level FDR (see Figure 4.2 A).

Identification of single amino acid variant peptides was facilitated by constructing a protein fasta file that incorporates 77k missense SNPs (Iorio et al., 2016) by replacing canonical amino acids with respective variant ones. Variant sequence identifiers 4.2 Materials and Methods 95

aa a Batch aa TMT aa Exp. 1 Exp. 2 Exp. 3 aa Reagent aa TMT-126 SW48 SW48 TMT-127N SW48 HT-115 NCI-H716 TMT-127C HCC2998 SNU-407 RKO TMT-128N SNU-C1 SNU-C2B SNU-C5 TMT-128C HT55 HCC-56 LS-123 TMT-129N CL-40 SW1463 HCT-116 TMT-129C LS-531 HT-29 RCM-1 TMT-130N COLO-320-HER LS-180 SW620 TMT-130C MDST8 CaR-1 C2BBe1 TMT-131 MDST8 DIFI

aa a Batch aa TMT aa Exp. 4 Exp. 5 Exp. 6 aa Reagent aa TMT-126 SW48 SW48 SW48 TMT-127N COLO-205 LS-1034 SNU-175 TMT-127C COLO-741 SNU-81 CL-11 TMT-128N LS-411N NCI-H630 SW948 TMT-128C NCI-H508 SK-CO-1 KM12 TMT-129N SW1116 HCT-15 SNU-1040 TMT-129C SW837 T84 GEO TMT-130N COLO-678 LoVo LIM1215 TMT-130C SNU-61 NCI-H747 CW-2 TMT-131 CCK-81 SW48 GP5d Table 4.1 Summary of 50 colorectal cancer cell lines. Isobaric labelled mass spectrom- etry was performed on TMT-labelled multi-plexed batches of 8 to 10 samples. Names of cell lines combined in batches are given. 96 Variant Identification in Colorectal Cancer Proteome

Fig. 4.1 Graphical display of experimental MS workflow. A) Samples of cancer cell lines are labelled with TMT 10-plex reagents. Labelled samples are pooled and subjected to high pH reverse phase fractionation. Single fractions are then analysed using tribrid LS-MS. B) Graphical representation of tribrid mass spectrometry using synchronous precursor selection (SPS). Peptide precursor are selected during MS1 and subjected to CID fragmentation and mass analysis of fragments resulting in MS2 spectra. The ten most abundant fragment precursors are then selected using SPS and fragmented using HCD to retrieve reporter fragments. Reporter fragments are then mass analysed resulting in MS3 spectra. 4.2 Materials and Methods 97 were generated through combination of ensemble transcript IDs, cell line identifier in which the mutation occurs, and the substitution. Furthermore, the canonical protein sequences were included. A proteogenomic search using SequestHT in Proteome Discoverer 2.1 (Thermo Scientific) was performed using parameter settings described above. Identification results were filtered at <1% Percolator FDR (see Figure 4.2 A). Distinctly, quantification of TMT-10plex reporter ions (S/N of reporter peaks) atMS3 level was included.

Manual curation of variant identification was achieved by filtering identification results from Proteome Discoverer 2.1 (Thermo Scientific) based on TMT reporter S/N. Search results provide for each PSM the peptide sequence as well as the associated protein identifiers. The protein identifiers for variant proteins contain Ensembl tran- script identifiers along with information about the incorporated mutations andcell line in which these mutations occur. Therefore, variant peptides are associated in the search results with the cell line in which the substitution is caused by a nsSNP. Additionally, variant peptides exhibit a mass shift leading to different MS2 spectra of variant peptides compared to canonical sequences. This leads also to differences in the MS3 quantification. High reporter ion intensity in the cell line exhibiting the variant peptide while low or missing intensities for the remaining samples is indicative of non- synonymous SNPs. Therefore, it is expected that variant peptides associated with a cell line through the protein identifier demonstrates high intensity of the TMT-reporter in this cell line (see Figure 4.2 B). To confidently identify a variant peptide, maximum reporter intensity per MS3 spectrum was required in the cell line where the variant sequence was expected to occur.

4.2.2 Unbiased mass tolerant spectrum pairing

Raw mass spectrometry data were converted into MGF format in Proteome Discoverer version 1.4 (Thermo Scientific) using default parameters. Furthermore, each resulting MGF file was split into two files containing identified and unidentified spectra, respec- tively, based on identification results from the canonical database search (see Figure 4.3). Within each batch (see Table 4.1) all files containing identified spectra were combined into a single MGF file. The same was done to files containing unidentified spectra. Identified spectra were searched against themselves within each batch using MS SMiV (described in detail in Chapter3) allowing only same mass pairs (see Figure 4.3 A). PTM tolerance was set to 5 ppm and 0.02 Da were given as MS/MS tolerance. Default values were used for the remaining parameters. 98 Variant Identification in Colorectal Cancer Proteome

Fig. 4.2 Graphical representation of the identification workflow. A) For identification of canonical peptides MS2 spectra are searched against a database of canonical protein sequences (left). MS3 spectra are added to provide quantitation. Spectral identifica- tions are filtered at <1% FDR. Variant peptides are identified by searching against a customised protein sequence database (right). This database contains canonical proteins and sequences with imputed non-synonymous SNPs. MS2 and MS3 spectra are used for identification and quantification, respectively. An overall <1%FDR filter is applied to all identifications from the custom database search before variant peptides are extracted, which are solely derived from the nsSNP sequence database. B) Example of reporter S/N for 3 theoretical variant peptides predicted from nsSNPs in cell lines 2, 3 and 9, respectively. Highest S/N in the top spectrum is consistent with predicted cell lines validating the variant identification. The middle spectrum shows highest S/N in a cell line different from the genome predicted cell line indicating a novel amino acid variant. The bottom spectrum exhibits a range of S/N ration across cell lines indicating false identification of PTM as variant. 4.2 Materials and Methods 99

Fig. 4.3 Graphical representation of workflow for data preparation and application of MS SMiV. A) Raw mass spectrometry files are converted into Mascot Generic Format (MGF) before splitting extracting only spectra that were identified in the canonical database search. Identified spectra then are paired with themselves using MS SMiV allowing no mass shifts. MS SMiV results are annotated with peptide sequences to enable FDR estimation. B) Raw mass spectra are converted into MGF format before being split into identified (ID) and unidentified (noID) based on results of canonical database search. Unidentified spectra as query are then paired with identified spectra as library using MS SMiV allowing mass shifts annotated in Unimod. Results are annotated with search results from canonical and custom database searches for library and query spectra, respectively. 100 Variant Identification in Colorectal Cancer Proteome

MS SMiV results were annotated with the respective peptide sequence for each spectrum using canonical database search results for library spectra while restuls from the variant database search were used for query spectra. Sequences of each spectrum pair were compared. The score distributions for spectrum pairs with equal and different peptide sequences were used to calculate the false discovery rate (FDR). Score thresholds for each experiment were established at 1% FDR level.

For identification of post-translational modifications unidentified spectra (query) were compared to identified spectra (library) using MS SMiV parameters as described above with the difference of allowed mass shifts (see Figure 4.3 B). All masses annotated as post-translational modifications in unimod including neutral-loss masses were saved in a tab separated text file and provided as allowed mass shifts between spectrum pairs. Additionally, the output filter option to only return spectrum pairs with scores higher than the user defined threshold was used with the above established score cut-offs for 1% FDR level.

4.2.3 Gaussian fit analysis

To assess in detail which post-translational modification masses were identified the spectrum pairs were divided into bins. 1 Da bins were created ranging from 0 Da to 500 Da with 0.5 Da at each side. Furthermore, Gaussian mixture models were fit for each bin using the ‘mclust’ library in R as follows. In a first round the optimal number of components n was estimated by the Bayesian information criterion (BIC). This was followed by fitting n Gaussian distributions through mixture modelling. For bins with more than n=1 components any mixture components with a variance greater than the variance of all data points in the bin were removed. Additionally, Gaussian components with mean difference smaller than 0.01 Da were combined resulting in overall n-x Gaussian distributions. In a second round BIC was calculated for only up to n-x components and the optimal number n’ was used for fitting the final Gaussian mixture models. Each resulting component then was annotated with known mass shifts from unimod allowing for a 0.01 Da tolerance between Gaussian component mean values and monoisotopic masses for known modifications. 4.2 Materials and Methods 101

4.2.4 Identification of PTMs and SNPs using peptide quantitation

Detailed investigation of mass shift bins with multiple distinct Gaussian distributions was performed using TMT reporter information by comparison of peptide to peptide, peptide to spectrum and spectrum to spectrum correlations within each sample. There- fore, spectra from the MS SMiV results were annotated with quantitative information. Reporter intensities for quantification collected at MS2 level allow easy integration, for example, with identification results through spectrum title and scan numbers. However, the use of quantification on MS3 level in the current dataset in addition to already jointly processed fractions required novel mapping of MS3 reporter ion information onto MS2 spectra.

Therefore, sequence annotated MS SMiV results were first collapsed so that each MS2 spectrum may occur only once in the output. To maintain the mass tolerant pairing information, each query spectrum retains links to the associated library spectra with annotated peptide sequences through unique library spectrum identifiers. Quan- titative information for all MS3 spectra was extracted from Proteome Discovere 2.1 (Thermo Scientific) for each of the 6 batches. Reporter signal-to-noise (S/N) ratios then were mapped onto MS2 spectra using corresponding precursor m/z values. To ensure accurate mapping between MS2 and MS3 spectra, the pairing with smallest difference in retention time was selected and only MS2 and MS3 spectra from the same originating raw file could be paired. Furthermore, MS2-MS3 pairings with retention time differences above the 95% quartile within each batch were removed.

Spectra without any quantitative information were removed before further analysis. Reporter signal-to-noise ratios were converted into relative quantitative values to the maximum S/N value per spectrum. For each protein in the dataset relative quantitation was extracted for identified library and associated query spectra. Library spectra were collapsed by canonical peptide sequence, reporter S/N ratios were summed up within each channel and the resulting unique peptides provided with novel identifiers per protein using the prefixes ‘P’ followed by a number (e.g. P1). Query spectra were annotated with the identifier of the associated canonical library peptide and themass shift identified by MS SMiV (e.g. P1±43.005). Spectrum-spectrum correlation matri- ces representing log-2 fold changes between spectra were generated per cell line (see Figure 4.4). Thereby, missing or low quantitation of a spectrum in a cell line results in extreme log2-fold changes with other spectra of the protein. Consistent quantification of the remaining spectra leads to log2-fold changes close to zero. Heatmaps for single cell lines were generated for all identified and MS SMiV paired spectra of a protein. 102 Variant Identification in Colorectal Cancer Proteome

Pairing between identified canonical peptides and MS SMiV associated query spectra is highlighted by the naming convention where query spectrum P1 ± 43.005 is asso- ciated with peptide P1 though exhibiting a mass shift of ±43.005Da. Resulting heat maps were manually assessed to confirm identification of PTM and variant spectra (see Figure 4.4).

4.2.5 Comparison of unbiased SNP detection with manually cu- rated identifications

Peptide sequences identified as variant sequences from the customized database search were mapped onto the human reference genome GRCh38 through GENCODE v25 using PoGo with no allowed mismatches (see Chapter2). The mapping was repeated allowing one mismatch in the peptide sequences. Peptides, which mapped to the human genome in the BED output without mismatches were removed from the second mapping to ensure only sequences with a single amino acid substitution remained. Furthermore, multi-mappings of remaining sequences were discarded. Identifications from the standard database search were also mapped onto the human reference genome using PoGo with no mismatches. Resulting mappings of reference peptide sequences were matched to peptides with one amino acid variant based on their genomic loci providing a pair of reference and alternative allele peptides identified in the mass spectrometry data.

Spectra that remained unidentified in the standard database were paired with identified spectra using MS SMiV. Parameters for MS/MS tolerance and PTM tolerance were set to 0.02 Da and 5ppm, respectively. Additionally, all masses representing amino acid substitutions in the Unimod database were allowed as mass shifts between spectrum pairs. Spectrum pairs with scores higher than the previously established thresholds for <1% FDR were retained. Remaining parameters left with default settings. Peptide sequence results from the standard database search were mapped onto library spectra and sequences identified as variant were mapped onto query spectra in the SM SMiV output to establish overlap with the reference and alternative allele peptide set. Furthermore, the results were compared to the manually curated set of variant peptides. 4.2 Materials and Methods 103

A

B cell line 1 cell line 2 P3 ± 15.994 P3 ± 15.994 P2 ± 17.974 P2 ± 17.974 P1 ± 43.005 P1 ± 43.005 P1 ± 0.984 P1 ± 0.984 P3 P3 P2 P2 P1 P1

1 2 3 1 2 3

P P P P P P

0.984 0.984

± 43.005 17.974 15.994 ± 43.005 17.974 15.994

± ± ± ± ± ± 1 1

P 1 2 3 P 1 2 3

P P P P P P

Fig. 4.4 A) Schematic representation of the process leading to spectrum-spectrum correlation heatmaps. Reporter ion intensities are generally similar across peptides of the same protein. Therefore, log2-fold changes between them are centred around zero. This is demonstrated by annotated ratios between quantitative values A and C in cell line 2. Missing or low quantification, as highlighted for peptide P2 in cell line 2 (annotation B), leads to log2-fold changes at both extremes of the range (log2- correlation matrix cells in red and blue). Blue cells indicate low log2-fold changes, while high log2-fold changes are shown in read. B) Heatmap of theoretical peptides (prefix ’P’) and associated MS SMiV spectra (association shown by mass shiftto the peptide sequence with prefix ’P’). Comparison of heatmaps for individual cell lines allows identification of alternative outliers. Here spectrum P2 ± 17.974 exhibits extreme log2-fold change in the left heatmap dark blue and dark red indicating highly negative and highly positive changes, respectively. The heatmap on the right shows no extreme fold change for spectrum P2 ± 17.974, however, for peptide P2. This comparison of extreme fold changes across different cell lines allows identification of pairs of reference sequence (P2) and variant spectrum (P2 ± 17.974). 104 Variant Identification in Colorectal Cancer Proteome

Database Peptide-Spectrum Matches (<1% FDR) Unique Peptides Standard 1,349,141 162,895 Custom+Standard 1,624,329 170,292 Custom only 13,103 2,404 Table 4.2 Summary of peptide spectrum matches and unique peptides identified through standard and custom database searching. Over 163k and 170k peptides were identified through searching against a standard database and a standard database with added variant sequences, respectively. A total of 2,404 peptides were solely identified from the variant sequences.

4.3 Results and Discussion

4.3.1 Peptide identification through database searching

To assess the extent of variation on peptide and protein sequences in 50 colorectal cancer cell lines, mass tolerant spectrum pairing was applied in combination with isobaric peptide labelling (TMT-10plex) and MS3 quantification. Over 5.4 million tandem mass spectra in 6 batches were generated and analysed. In total 1,349,141 peptide-spectrum-matches (PSM) representing 162,895 non-redundant peptides at FDR <1% were obtained from searching against a canonical protein sequence database (see Table 4.2). The overall spectrum assignment rate of 24.8% is consistent with the previously described fraction of spectra remaining unidentified (Griss, 2016). Searching the acquired tandem mass spectra against a customized database containing 77k variants (Iorio et al., 2016) resulted in 1,624,329 PSMs of 170,292 unique pep- tide sequences at <1% FDR. In total, 157,178 (89.3%) peptides are shared between canonical and custom database search. This highlights that the customized database significantly affects the identification rate due to the increased search space. Amongst the 13,114 (7.5%) peptides uniquely identified by the custom database search 2,404 peptide sequences resulting from variants were found (see Figure 4.5).

Manual curation of variant peptides using highest reporter intensities in expected cell lines resulted in high confidence identification of 769 amino acid substitution peptides. For the remaining 68%, the expected cell line for the variant did not exhibit the highest reporter intensity. Figure 4.6 depicts examples of reporter S/N of three peptide sequences originating from the customized sequences reflecting amino acid substitutions in cell lines HCT-15 and NCI-H508. While the top bar plot exhibits highest reporter S/N in cell line HCT-15 consistent with the amino acid substitution 4.3 Results and Discussion 105

Fig. 4.5 Venn diagram of identified canonical and variant peptides in database searches against a canonical sequence database and a custom sequence database. Overall 176,009 peptides were identified between both searches. Peptides derived from variant sequences only in the custom database were further filtered based on manual curation.

’DMELP[T310I]EK’ in cell line HCT-15, the middle bar plot displays the highest S/N of reporter intensities in cell line LS-1034. This is not consistent with the substitution Q285P predicted for cell line HCT-15. The identified peptide, in fact, is further downstream starting at position 419 in the protein and is a reference peptide without any variant associated in the database. Therefore, the peptide does not fit the criteria for a curated variant peptide, however, it highlights that novel variants might be present in unanticipated cell lines. Lastly, the third bar plot in Figure 4.6 depicts varying S/N ratios across cell lines. The highest S/N is present in cell line NCI-H747 while the amino acid substitution D3061G for peptide ’ASLGSLEGEAEAEASSPK’ is predicted for a cell line in a different batch, namely NCI-H508. Furthermore, the identified peptide is downstream of the predicted variant site starting at position 5,766. This highlights that database search algorithms supplied with large custom databases and user defined accessions leads to false association of peptides with variant accessions. Additionally, almost isobaric amino acid substitutions and post- translational modifications found in the Unimod database (see Figure 4.7) indicate potential false identifications. The cell lines used in this study are cancer lines, prone to novel somatic mutations (Forbes et al., 2015, 2017; Kennedy et al., 2012; Luzzatto, 2011). Therefore, identification of variant sequences in unanticipated cell lines might indicate unmapped mutations, which would require repetition of the genotyping (see Figure 4.6). 106 Variant Identification in Colorectal Cancer Proteome Expected cell line of variant

Fig. 4.6 Examples of inclusion and exclusion criteria for variant peptide sequences using reporter S/N ratios. Sole and highest reporter S/N value in cell line HCT-15 consistent with variant cell line HCT-15 in custom protein identifier of peptide (top). This type of variant peptide identification is kept in the manually curated set.The middle bar plot highlights reporter S/N ratios for a peptide wrongly identified as variant downstream of a predicted substitution Q285P in cell line HCT-15. Low abundance in cell line HCT-15 while cell line LS-1034 exhibits highest S/N ratio indicates novel mutation in cell line LS-1034. Cell lines of predicted mutation and solely quantified peptide are not consistent and the peptide is therefore rejected from the manual curation set. Highest expression for peptide ’ASLGSLEGEAEAEASSPK’ is present in cell line NCI-H747 while overall expression varies between cell lines. The cell line giving rise to the predicted peptide sequence, however, is not present in the batch. Similar identifications are also rejected from the manual curation set.

Fig. 4.7 Representation of Frequency of post-translational modifications in the Unimod database with monoisotopic mass ±0.02 Da around mass shifts caused by amino acid substitutions. In total 29 substitutions exhibit similar mass to single or multiple post-translational modifications. 4.3 Results and Discussion 107

Batch Spectra Identified MS SMiV Pairs with Spectra Pairs (Same Same Peptide (Standard Precursor Sequence Search) Mass) Annotated Exp. 1 655,867 143,023 3,815,838 822,635 (21.8%) (21.6%) Exp. 2 829,517 235,581 9,942,479 1,574,999 (28.4%) (15.8%) Exp. 3 1,184,094 290,897 17,899,443 4,395,035 (24.6%) (24.6%) Exp. 4 903,376 211,815 8,075,985 1,051,711 (23.4%) (13.0%) Exp. 5 980,249 250,985 11,125,870 1,600,957 (25.6%) (14.4%) Exp. 6 876,287 216,840 8,625,656 1,358,938 (24.7%) (15.8%) Table 4.3 The colorectal cancer dataset comprises a total of 5.4 million spectra. Exper- imental batches 1-6 contribute in varying degrees to the total amount. Additionally, the number of spectra identified during standard database searching as input forMS SMiV comparison is given. MS SMiV spectrum pairs amount to 59.5 million over all batches with varying degrees of true positive identifications based on annotation of same peptide sequences.

4.3.2 Spectrum pairing for false discovery rate estimation

To establish thresholds for MS SMiV scores that result in few false mappings, the FDR was estimated using spectrum pairings identified with the same correct sequence and differing sequences. This was achieved by comparing identified spectra against themselves allowing only spectral pairs of same precursor mass. An average of 224,856 spectra were identified in the standard database search for each batch.MS SMiV comparison of each of these spectra with the remaining identified spectra in a batch allowing only spectrum pairs of same precursor mass resulted in a total of 59,485,259 scored spectrum pairs. Across batches, between 13.0% and 24.6% of spectrum pairs were annotated with the same peptide sequences. The exact set-up per experimental batch is shown in Table 4.3.

Using the score distributions of spectrum pairs with same and different sequences as true and false positive matches, respectively, enabled the calculation of a false discovery rate and determination of respective score thresholds for each experimental batch of samples (see Figure 4.8). Overall, false discovery rates remain low for 108 Variant Identification in Colorectal Cancer Proteome

Fig. 4.8 Comparison of FDR in relation to MS SMiV score for all experimental batches. A) Overall FDR remains lowest for the largest experimental batch (batch 3). B) At a maximum FDR of 10%, batch 3 shows the slowest increase of FDR with decreasing degree of spectrum pair similarity based on MS SMiV score. C) At a more stringent FDR level of <5%, small changes can be observed between different experimental batches. D) At highest stringency level with FDR<1%, batch 3 reveals the steepest increase of false matches at high MS SMiV scores compared to the other batches. Score thresholds exhibit higher variance. high scores, while a significant change was found for MS SMiV similarity scores smaller than 0.5 (Figure 4.8 A-C). Furthermore, comparison of FDR curves between batches reveals that the rate of false spectrum pairings is unique for each batch at any given MS SMiV score. This implies that MS SMiV scoring is strictly dependent on the supplied spectra. Therefore, the specification of MS SMiV score thresholds representing a given FDR is dataset dependent and requires FDR estimation for every dataset independently. However, FDR estimation based on matched sequence spectral comparisons not allowing mass shifts, can be used to establish confident MS SMiV score thresholds for mass tolerant searches as shown in Chapter3. Overall FDR is the lowest in the largest batch, i.e. see batch 3 in Figure 4.8 A. However, at higher scores for FDR values <1% this trend is reversed resulting in higher rates of false spectrum matches (see Figure 4.8 D). Overall, score thresholds at a <1% FDR level of 0.545, 0.595, 0.615, 0.585, 0.575, and 0.685 for experimental batches 1 to 6, respectively, were established. 4.3 Results and Discussion 109

Batch Spectra Identified Un- MS MS Total Spectra identified SMiV SMiV Assign- (Stan- Spectra Pairs Assign- ments dard (Uni- ments Search) mod) Exp. 1 655,867 143,023 512,844 213,484 45,760 188,783 (21.8%) (7.0%) (28.8%) Exp. 2 829,517 235,581 593,936 268,980 47,715 283,296 (28.4%) (5.8%) (34.2%) Exp. 3 1,184,094 290,897 893,197 359,818 65,006 355,903 (24.6%) (5.5%) (30.1%) Exp. 4 903,376 211,815 691,561 219,188 41,148 252,063 (23.4%) (4.6%) (28.0%) Exp. 5 980,249 250,985 729,264 331,977 50,513 301,498 (25.6%) (5.2%) (30.8%) Exp. 6 876,287 216,840 659,447 63,619 19,459 236,299 (24.7%) (2.2%) (27.0%) Table 4.4 Summary of identification of spectral identification across 6 batches. Search- ing a total of 5.4 million spectra against a canonical sequence database results in assignment of a peptide sequence to 24.8% of spectra. Mass tolerant spectrum pairing of 4.1 million unidentified with identified spectra results in identification of additional 5.0% of spectra.

4.3.3 Unbiased spectrum pairing for modification masses

To test whether variant peptide identifications through custom database searching reflect the genotype of respective cell lines, MS SMiV was applied to identify spectra in the dataset which relate to identified PSMs by post-translational modification masses. This resulted in 1,457,066 spectrum pairs at <1% FDR (Table 4.4). These were obtained by comparison of 680,042 spectra that on average remained unidentified in the standard searches with 224,856 identified spectra averaging across all 6 batches allowing mass shifts of all post-translational modifications consolidated in the Unimod database (Creasy and Cottrell, 2004). Gaussian fit analysis across the mass range of 500 Da revealed a total of 801 distinct mass bins across all 6 batches with an average of 228 bins per batch. The distribution of spectrum pairs across ∆masses (see Figure 4.9) is similar to the the mass shifts identified by open mass tolerant SEQUEST search of another human cell line as described in Chapter3. This highlights the confidence of MS SMiV spectrum pairing for unbiased mass shift identification. 110 Variant Identification in Colorectal Cancer Proteome

Fig. 4.9 Distributions of ∆mass for spectrum pairs identified by MS SMiV allowing PTM masses at <1% FDR. Distributions are given per experimental batch and show consistent representation of ∆masses. In single experiments individual ∆masses are overrepresented compared to other experiments representing potential amino acid substitutions. 4.3 Results and Discussion 111

 ∆mass=15.00721 9 Lys↔Xle ∆ mass=14.97048 Serine to lactic acid Gln↔Xle ISD (z+2)-series Asn↔Val Tyrosine oxidation to @ 2-aminotyrosine @ @ Carboxylic acid to @ @@R hydroxamic acid Deamidation+ Methylation

Fig. 4.10 Density of ∆masses identified by MS SMiV between 14.5 Da and 15.5Da in 3rd batch. Gaussian fit analysis identified distinct mass bins with one displaying mean ∆mass similar to two amino acid substitutions. The mass bin identified at higher frequency features ∆mass similar to post-translational modifications, artefacts and one substitution.

Interestingly, several ∆mass bins exhibited similar mean masses. Annotation of known modification and amino acid substitution masses compiled in Unimod shows that in a total of 13 such cases one Gaussian component relates to post-translational modification masses while the another component exhibits a mean mass indicating amino acid substitutions (see Figure 4.10).This demonstrates that MS SMiV spec- trum pairing is capable of distinguishing post-translational modifications from almost isobaric amino acid substitutions.

4.3.4 Peptide quantitation distinguishing between post- translational modifications and amino acid variants

Post-translational modifications and amino acid substitutions with highly similar masses hinder the identification of either in the mass spectra. Database searches often result in false identifications choosing amino acid substitutions provided in acustom database over post-translational modifications. To assess how different approaches can discriminate post-translational modifications from amino acid variants during spectrum matching, peptide quantitation was utilized as an additional criterion. A total of 269,683 query and 245,217 library spectra in the MS SMiV results were mapped to quantitative values and were used for downstream analysis. 112 Variant Identification in Colorectal Cancer Proteome

Unimod Description Monoisotopic Mass Val→ Ala substitution -28.0313 Met→ Cys substitution -28.0313 Ala→ Val substitution 28.0313 Cys→ Met substitution 28.0313 di-methylation 28.0313 Acetaldehyde +28 28.0313 Ethylation 28.0313 Lysine oxidation to aminoadipic semialdehyde -1.031634 Half of a disulfide bridge -1.007825 15N(1) 0.997035 DMPO spin-trap nitrone adduct 111.068414 Table 4.5 Entries in Unimod with monoisotopic mass within ±0.02 Da of mass shifts ∆mass=28.029 Da, ∆mass=1.018, ∆mass=111.063 identified by MS SMiV. In total 2 amino acid substitutions and 3 post-translational modifications are suggested by Unimod as explanation.

Reporter S/N ratios were used to calculate log2-fold changes between peptides for each protein within each cell line. A total of 227,530 heatmaps were generated incorporating spectra identified through canonical database searching as well asMS SMiV associated spectra. To facilitate the visual inspection of the heatmaps only proteins with at least one identified peptide were used. Furthermore, only heatmaps with up to 50 identified peptides and MS SMiV paired spectra were plotted. Empty channels and negatively correlated extreme log2-fold changes between peptides and MS SMiV related spectra highlight the presence of variant peptides and their reference counterpart. As an example, the protein PURB, a protein binding to single stranded DNA sequences in the origin of replication and gene flanking regions, was identified in batch 6 with 17 PSMs corresponding to 7 unique peptide sequences with a protein sequence coverage of 48.4%. MS SMiV identified 3 spectra at <1% FDR representing 3 distinct peptides of PURB with mass shifts of 28.029 Da, 111.063 Da, and 1.018 Da with MS SMiV scores of 0.72, 0.72, and 0.73, respectively. Unimod annotation of these mass shifts revealed a single and three post-translational modifications for 111.063 Da and 1.018 Da, respectively (see Table 4.5). Additionally, the mass shift of 28.029 Da resulted in 2 potential amino acid substitutions and 3 potential post- translational modifications (see Table 4.5). However, the identified peptide sequence of ’GGGGFGAGPGPGGJQSGQTJAJPAQGJJEFR’ only would allow di-methylation of proline and supports the substitution from alanine to valine. 4.3 Results and Discussion 113

Heatmaps of log2-fold changes between the different spectra highlight one peptide (P1) as SNP candidate in cell line SNU-1040 due to the extreme log2-fold changes, while the related MS SMiV spectrum (P1 ± 28.029) with mass shift 28.029 Da is an exception in the remaining cell lines with extreme log2-fold changes (see Figure 4.11). The peptide ’GGGGFGAGPGPGGJQSGQTJAJPAQGJJEFR’ (P1) appears to be down-regulated in cell line SNU-1040 compared to the other peptides identified in the standard database search (peptides with prefix ’P’) and other cell lines. The related spectrum identified by MS SMiVP ( 1 ± 28.029) with a mass shift of ∼28 Da follows the opposite pattern. While down regulated (blue) or absent (grey) in all other cell lines, it exhibits log2-fold changes close to 0 with sequence database identified spectra P2 and P3 in sample SNU-1040. This mirror-inverted abundance change of peptide P1 and spectrum P1 ± 28.029 indicates conclusively a binary presence of either of the peptides which can only occur by presence of an amino acid substitution.

The additional graphical representation of changes in reporter intensities between peptides within cell lines (line plots) confirm both peptide and spectrum as SNP candidates (see Figure 4.12). Overall, reporter intensities for all peptides are centred around the median within each cell line (line). The exception is P1 in cell line SNU- 1040 (red in Figure 4.12 top) representing an extreme in comparison with other cell lines. Equally, normalized abundance of spectrum P1 ± 28.029 of ∼1 in sample SNU- 1040 indicates correlation with the other canonical peptides, while remaining cell lines exhibit normalized intensities close to 0 indicating the absence of this peptide sequence from these cell lines. Differences between reporter intensities in one cell line and the mean across all other cell lines highlight the MS SMiV identified spectrum as a clear SNP candidate (see Figure 4.12 bottom). However, the low abundance of peptide P1 in cell line SNU-1040 is not represented as an exception.

The mutation identified by MS SMiV and the peptide-peptide quantitative com- parison is confirmed by the manually curated set of variant peptides identified inthe custom database search. There the spectrum P1 ± 28.029 is identified with sequence ’GGGGFGAGPGPGGJQSGQTJVJP[A↔V]QGJJEFR’ with a substitution of alanine to valine at position 21 in the peptide sequence or position 173 in the protein se- quence. The cell line project (Forbes et al., 2017) identified a substitution of cytosine to thymine at position 518 within the gene (chr7:44884831) leading to the substitution of alanine to valine at protein position 173. This shows that MS SMiV in combination with peptide quantitation can discern amino acid substitutions from post-translational modifications. 114 Variant Identification in Colorectal Cancer Proteome

GEO SNU−1040

P3 ± 1.018 P3 ± 1.018

P2 ± 111.063 P2 ± 111.063

xP1 ± 28.029 xP1 ± 28.029

P3 P3

P2 P2

xP1 xP1

1 2 3 1 2 3

P P P P P P x x

1.018 1.018

28.029 ± 28.029 ± 111.063 111.063 ± ± 3 3 ± ± 1 P 1 P 2 2 P P x x P P SW48 SNU−175 CL−11

P3 ± 1.018 P3 ± 1.018 P3 ± 1.018

P2 ± 111.063 P2 ± 111.063 P2 ± 111.063

xP1 ± 28.029 xP1 ± 28.029 xP1 ± 28.029

P3 P3 P3

P2 P2 P2

xP1 xP1 xP1

1 2 3 1 2 3 1 2 3

P P P P P P P P P x x x

1.018 1.018 1.018

28.029 ± 28.029 ± 28.029 ± 111.063 111.063 111.063 ± ± ± 3 3 3 ± ± ± 1 P 1 P 1 P 2 2 2 P P P x x x P P P SW948 KM12 LIM1215

P3 ± 1.018 P3 ± 1.018 P3 ± 1.018

P2 ± 111.063 P2 ± 111.063 P2 ± 111.063

xP1 ± 28.029 xP1 ± 28.029 xP1 ± 28.029

P3 P3 P3

P2 P2 P2

xP1 xP1 xP1

1 2 3 1 2 3 1 2 3

P P P P P P P P P x x x

1.018 1.018 1.018

28.029 ± 28.029 ± 28.029 ± 111.063 111.063 111.063 ± ± ± 3 3 3 ± ± ± 1 P 1 P 1 P 2 2 2 P P P x x x P P P CW−2 GP5d

P3 ± 1.018 P3 ± 1.018 log2(fold change) P2 ± 111.063 P2 ± 111.063 < −3 0 > 3 xP1 ± 28.029 xP1 ± 28.029

P3 P3

P2 P2

xP1 xP1

1 2 3 1 2 3

P P P P P P x x

1.018 1.018

28.029 ± 28.029 ± 111.063 111.063 ± ± 3 3 ± ± 1 P 1 P 2 2 P P x x P P

Fig. 4.11 Example of peptide-peptide correlation heatmaps for 10 cell lines in experi- mental batch 6 of protein PURB (UniProt Q96QR8). High positive log2-fold changes are indicated by red colour, while blue depicts highly negative fold changes. The log2-fold changes of peptides identified by database searching centre around 0 (white) with exception of peptide P1 in cell line SNU-1040. Here the peptide displays highly negative log2-fold changes. In contrast, MS SMiV identified spectrum P1 ± 28.029 with a mirror-inverted correlation between cell lines as a variant to peptide P1. In the custom database search a variant peptide sequence is identified for spectrum P1 ± 28.029. Asterisks indicate the canonical and variant sequence pair. 4.3 Results and Discussion 115

Cell lines ● SW48 ● SNU−175 ● CL−11 ● SW948 ● KM12 ● SNU−1040 ● GEO ● LIM1215 ● CW−2 ● GP5d

3 ●

● 2 ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ●

● ● ●

Normalized Abundances 0

1 2 3

P P P x

1.018

28.029 ± 111.063 ± 3 ± 1 P 2 P x P

1.0 ●

0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● Differences ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● −0.5 ●

1 2 3

P P P x

1.018

28.029 ± 111.063 ± 3 ± 1 P 2 P x P

Fig. 4.12 Graphical representation of peptide quantification per cell line for PURB. Reporter intensities normalized by the median within each sample highlight P1 as a lowly abundant peptide in the SNU-1040 cell line (red) while P1 ± 28.029 has similar abundance to the median (top). Additionally, all other cell lines show spectrum P1 ± 28.029 as outlier with low or missing relative abundance. Relative differences of reporter intensity of one cell line compared to the mean of all other cell lines highlights the MS SMiV identified spectrum P1 ± 28.029 as outlier, while all other spectra exhibit high variability of differences. 116 Variant Identification in Colorectal Cancer Proteome

Unimod Description Monoisotopic Mass Ala→Gly substitution -14.01565 Glu→Asp substitution -14.01565 Leu/Ile→Val substitution -14.01565 Thr→Ser substitution -14.01565 Gln→Asn substitution -14.01565 Asp→Glu substitution 14.01565 Gly→Ala substitution 14.01565 Ser→Thr substitution 14.01565 Val→Leu/Ile substitution 14.01565 Asn→Gln substitution 14.01565 Methylation 14.01565 Proline oxidation to pyrrolidinone -30.010565 Ser→Gly substitution -30.010565 Thr→Ala substitution -30.010565 Homoserine -29.992806 Met→Thr substitution -29.992806 Thr→Met substitution 29.992806 Ala→Thr substitution 30.010565 Gly→Ser substitution 30.010565 Hydroxymethyl 30.010565 Table 4.6 Entries in Unimod with monoisotopic mass within ±0.02 Da of mass shift ∆mass=14.011 Da, ∆mass=14.014 Da, and ∆mass=30.012 Da identified by MS SMiV. In total 8 amino acid substitutions and 4 post-translational modifications are suggested by Unimod as explanation.

Another example of successful de novo SNP detection using MS SMiV and peptide quantitation was found in the VAPA protein. In total 40 PSMs for 9 unique peptide were identified covering 50.6% of the protein sequence. MS SMiV spectrum pairing returned 5 previously unidentified spectra associated with 6 PSMs of 3 distinct canonical peptide sequences. The peptide ’VAHSDKPGSTSTASFR’ (P3) revealed overall low abundance in two cell lines, namely GEO and GP5d, indicating a sequence variant or post-translational modification (see Figure 4.13). The results of MS SMiV provided 3 spectra (P3 ± 14.011,P3 ± 14.014,P3 ± 30.012) associated with these PSMs with mass differences of 14.011, 14.014, and 30.012, respectively. Unimod provides 5 substitutions and a signle PTM as explanation for a mass of 14 Da, while 3 PTMs and 3 variants would explain a mass difference of 30 Da (see Table 4.6).

Intriguingly, two of these MS SMiV spectra (P3±14.011 and P3±14.014) exhibit slightly increased log2-fold changes when compared to the other identified peptides in cell line GEO, while low expression or missing quantification was observed in the 4.3 Results and Discussion 117 remaining samples. This opposing direction of expression with the identified peptides P1 and P2 indicates that P3 ± 14.011 and P3 ± 14.014 are indeed spectra resulting from a variant sequence of ’VAHSDKPGSTSTASFR’ (P3) in cell line GEO. However, the third MS SMiV spectrum P3 ± 30.012 relating to peptide P3 is lowly abundant in the sample indicating that it is not a variant sequence in this cell line. A closer look at cell line GP5d then reveals that spectrum P3 ± 30.012 shows an opposing peptide abundance to the identified spectra P1 and P2. Similarly to the other two MS SMiV spectra in cell line GEO, this indicates a variant sequence in GP5d. This variant, however, is different from the variant in GEO as log2-fold changes between both are extreme or missing values.

Normalized peptide abundances of the peptide P3 with sequence ’VAHSDKPGST- STASFR’ in cell lines GEO (blue) and GP5d (orange) are highlighted as SNP candi- dates in the Figure 4.14 (top) due to their relative abundance close to 0, while SMiV identified spectra P3 ± 14.011 and P3 ± 14.014 exhibit high normalized abundance in the GEO sample, and P3 ± 30.012 shows median abundance in cell line GP5d. Furthermore, differences between peptide quantitation in a single sample and the mean of remaining cell lines in the set highlights both suspected variants in the MS SMiV results (see Figure 4.14 bottom). In general, these line plots indicate variant spectra well when the peptide is not present in the remaining samples.

The set of manually curated variant sequences confirms the presence of two variants with substitution of threonine to serine at position 262 and alanine to threonine at position 263 in the protein for cell lines GEO and GP5d, respectively. This highlights that MS SMiV successfully identified the mass shifts of 14 Da and 30 Da. The use of isobaric labelling quantification enables the discrimination between potential post-translational modifications and near isobaric amino acid variants to confidently identify the two variant sequences.

The heatmaps of log2-fold changes between peptides within a given sample also highlight global differences between cell lines. The gene SON was identified with 42 unique peptides (117 PSMs) covering 29.3% of the primary isoform sequence. MS SMiV output filtered at <1% FDR, retains 11 spectra (9 peptides) matching tothe standard database and 32 associated unidentified spectra, of which 27 are provided with quantitative values, with mass differences ranging between 0.958 Da and 453.224 Da (see Table 4.7). For the sequence ’EVPPPPK’ (P1) a variant was identified in the custom database search for a cell line in a different experimental batch. The heatmaps, however, highlight that peptide does not exhibit extreme fold changes in any of the cell lines in batch 3 (see Figure 4.15). In several cell lines spectra identified by MS SMiV 118 Variant Identification in Colorectal Cancer Proteome

SNU−175 GEO

xP3 ± 30.012 xP3 ± 30.012

xP3 ± 14.014 xP3 ± 14.014

x xP3 ± 14.011 P3 ± 14.011

P2 ± 113.081 P2 ± 113.081

P1 ± 14.015 P1 ± 14.015

x xP3 P3

P2 P2

P1 P1

1 2 3 1 2 3 P P P P P P x x

14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 113.081 113.081 ± ± ± ± ± ± ± ± ± ± 1 1 3 3 3 3 3 3 2 2 P P P P P P P P x x x x x x P P GP5d

xP3 ± 30.012 log2(fold change)

x P3 ± 14.014 < −6 0 > 6

xP3 ± 14.011

P2 ± 113.081 SW48 xP3 ± 30.012 P1 ± 14.015 xP3 ± 14.014 xP3 ± 14.011 x P3 P2 ± 113.081 P1 ± 14.015 P2 xP3 P2 P1 P1

1 2 3 1 2 3

P P P P P P x x

14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 113.081 113.081 ± ± ± ± ± ± ± ± ± ± 1 3 3 3 1 3 3 3 2 2 P P P P P P P P x x x x x x P P CL−11 SW948 KM12 xP3 ± 30.012 xP3 ± 30.012 xP3 ± 30.012 xP3 ± 14.014 xP3 ± 14.014 xP3 ± 14.014 xP3 ± 14.011 xP3 ± 14.011 xP3 ± 14.011 P2 ± 113.081 P2 ± 113.081 P2 ± 113.081 P1 ± 14.015 P1 ± 14.015 P1 ± 14.015 xP3 xP3 xP3 P2 P2 P2 P1 P1 P1

1 2 3 1 2 3 1 2 3

P P P P P P P P P x x x

14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 113.081 113.081 113.081 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 3 3 3 1 3 3 3 1 3 3 3 2 2 2 P P P P P P P P P P P P x x x x x x x x x P P P SNU−1040 LIM1215 CW−2 xP3 ± 30.012 xP3 ± 30.012 xP3 ± 30.012 xP3 ± 14.014 xP3 ± 14.014 xP3 ± 14.014 xP3 ± 14.011 xP3 ± 14.011 xP3 ± 14.011 P2 ± 113.081 P2 ± 113.081 P2 ± 113.081 P1 ± 14.015 P1 ± 14.015 P1 ± 14.015 xP3 xP3 xP3 P2 P2 P2 P1 P1 P1

1 2 3 1 2 3 1 2 3

P P P P P P P P P x x x

14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 14.015 14.011 14.014 30.012 113.081 113.081 113.081 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 3 3 3 1 3 3 3 1 3 3 3 2 2 2 P P P P P P P P P P P P x x x x x x x x x P P P

Fig. 4.13 Example of peptide-peptide correlation heatmaps for 10 cell lines in experi- mental batch 6 for protein VAPA (UniProt Q9P0L0). High positive log2-fold changes are indicated by red colour, while blue depicts highly negative fold changes. The reference peptide sequence ’VAHSDKPGSTSTASFR’ (P3) is identified by standard database searching and less abundant in cell lines GEO and GP5d compared to re- maining identified sequences (prefix ’P’). Spectra identified to be related to identified sequences are highlighted in the same sequence identifier and the associated mass shift. MS SMiV spectra P3 ± 14.011 and P3 ± 14.014 exhibit opposing quantitation to P3 in GEO, while P3 ± 30.012 shows opposing quantitation in GP5d. 4.3 Results and Discussion 119

Cell lines ● SW48 ● SNU−175 ● CL−11 ● SW948 ● KM12 ● SNU−1040 ● GEO ● LIM1215 ● CW−2 ● GP5d

● ●

● ●

2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ●

● ● ● ● ●

Normalized Abundances 0

1 2 3

P P P x

14.015 14.011 14.014 30.012 113.081 ± ± ± ± ± 1 3 3 3 2 P P P P x x x P

1.0 ● ● ●

0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ●

Differences ● ● ● ● ● ● ● ● ● ●

−0.5 ●

1 2 3

P P P x

14.015 14.011 14.014 30.012 113.081 ± ± ± ± ± 1 3 3 3 2 P P P P x x x P

Fig. 4.14 Graphical representation of peptide quantification per cell line for VAPA. Reporter intensities normalized by the median within each sample highlight peptide P3 as lowly abundant in the GEO and GP5d cell lines, while spectrum P3 ± 30.012 has similar abundance to the median in GP5d (top). Additionally, P3 ± 14.011 and P3 ± 14.014 exhibit extreme abundance in the GEO sample. In the remaining cell lines P3 ± 14.011, P3 ± 14.014, and P3 ± 30.012 reveal no abundance at all. Relative differences of reporter intensity of one cell line compared to the mean of all other cell lines highlights the MS SMiV identified spectra P3 ± 14.011, P3 ± 14.014, and P3 ± 30.012 as outliers, while all other spectra exhibit high variability of differences between cell lines. 120 Variant Identification in Colorectal Cancer Proteome

Identified Identified ∆Mass ∆Mass Peptide Peptide P1 33.969 P3 42.051 P1 43.004 P3 42.052 P1 43.005 P3 42.053 P1 43.006 P3 103.969 P1 43.007 P4 98.002 P2 0.957 P5 1.02 P2 1.048 P5 14.015 P2 70.041 P5 17.973 P2 71.07 P5 103.968 P2 118.054 P5 103.969 P2 159.059 P7 153.992 P2 159.06 P7 229.024 P2 159.061 P7 287.006 P9 287.061 Table 4.7 Peptides identified by standard database searching (prefix ’P’) matching to the SON protein and associated mass shifts identified MS SMiV spectral pairing. compared to peptides identified by database search (prefix ’P’) show log2-fold changes that indicate differential regulation or false spectrum pairing (highlighted by black boxes). However, in other cell lines MS SMiV spectra indicate log2-fold changes of 0 (white) indicating correlation of unidentified peptides with identified species ofthe protein.

Peptide abundances normalized per cell line (see Figure 4.18 top) and differences between reporter intensities in one cell line to the mean of intensities across remaining cell lines (see Figure 4.18 bottom) do not reveal outlier spectra as described above for variants. This is consistent with the lack of identification of variant peptides in the heatmaps. While normalized peptide abundance and differences between intensities are highly variable, the variability of log2-fold changes for MS SMiV spectra in the heatmaps revealing consistent quantification for some cell lines and differential regulation for others indicates the identification of post-translational modifications. The heatmaps further highlight that composition of log2-fold changes between peptides within each cell line appears to be unique for each sample. This suggests that peptide- peptide fold changes per protein might be used as a mass spectrometric fingerprint for individuals and samples. However, further analysis on replicates is required to asses the global applicability of this proof-of-concept workflow. 4.3 Results and Discussion 121

SW48 P9 ± 287.061 P7 ± 287.066 P7 ± 229.024 P7 ± 153.992 P5 ± 103.969 P5 ± 103.968 P5 ± 17.973 P5 ± 14.015 P5 ± 1.02 P4 ± 98.002 log2(fold change) P3 ± 103.969 > 2 P3 ± 42.053 P3 ± 42.052 P3 ± 42.051 P2 ± 159.061 P2 ± 159.06 P2 ± 159.059 P2 ± 118.054 P2 ± 71.07 P2 ± 70.041 ± P2 1.048 0 P2 ± 0.957 P1 ± 43.007 P1 ± 43.006 P1 ± 43.005 P1 ± 43.004 P1 ± 33.969 P9 P8 P7 P6 P5 < −2 P4 P3 P2 xP1

1 2 3 4 5 6 7 8 9

P P P P P P P P P x 1.02

0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007 ± ± 70.041 ± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 ± ± ± ± ± ± 118.054 159.059 ± 159.061 ± ± ± 103.969 ± ± ± 103.968 103.969 153.992 229.024 287.066 287.061 2 2 2 P ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P NCI−H716 RKO P9 ± 287.061 P9 ± 287.061 P7 ± 287.066 P7 ± 287.066 P7 ± 229.024 P7 ± 229.024 P7 ± 153.992 P7 ± 153.992 P5 ± 103.969 P5 ± 103.969 P5 ± 103.968 P5 ± 103.968 P5 ± 17.973 P5 ± 17.973 P5 ± 14.015 P5 ± 14.015 P5 ± 1.02 P5 ± 1.02 P4 ± 98.002 P4 ± 98.002 P3 ± 103.969 P3 ± 103.969 P3 ± 42.053 P3 ± 42.053 P3 ± 42.052 P3 ± 42.052 P3 ± 42.051 P3 ± 42.051 P2 ± 159.061 P2 ± 159.061 P2 ± 159.06 P2 ± 159.06 P2 ± 159.059 P2 ± 159.059 P2 ± 118.054 P2 ± 118.054 P2 ± 71.07 P2 ± 71.07 P2 ± 70.041 P2 ± 70.041 P2 ± 1.048 P2 ± 1.048 P2 ± 0.957 P2 ± 0.957 P1 ± 43.007 P1 ± 43.007 P1 ± 43.006 P1 ± 43.006 P1 ± 43.005 P1 ± 43.005 P1 ± 43.004 P1 ± 43.004 P1 ± 33.969 P1 ± 33.969 P9 P9 P8 P8 P7 P7 P6 P6 P5 P5 P4 P4 P3 P3 P2 P2 xP1 xP1

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

P P P P P P P P P P P P P P P P P P x x 1.02 1.02

0.957 1.048 71.07 ± 0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 5 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 2 2 2 P 2 2 2 P ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9 P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P P P P P P P P P P P

Fig. 4.15 Example of peptide-peptide correlation heatmaps for 3 cell lines in exper- imental batch 3 of protein SON (UniProt P18583). Positive log2-fold changes are indicated by red colour, while blue depicts negative fold changes. Spectra identified in the standard database search (prefix ’P’) exhibit no fold changes (white) between each other while in some cell lines spectra identified with a mass shift relating to identified sequences (unidentified spectra are associated with identified sequences through their identifier with prefix ’P’ and ∆mass) exhibit differential log2-fold changes (marked by black boxes). 122 Variant Identification in Colorectal Cancer Proteome

SNU−C5 LS−123 P9 ± 287.061 P9 ± 287.061 P7 ± 287.066 P7 ± 287.066 P7 ± 229.024 P7 ± 229.024 P7 ± 153.992 P7 ± 153.992 P5 ± 103.969 P5 ± 103.969 P5 ± 103.968 P5 ± 103.968 P5 ± 17.973 P5 ± 17.973 P5 ± 14.015 P5 ± 14.015 P5 ± 1.02 P5 ± 1.02 P4 ± 98.002 P4 ± 98.002 P3 ± 103.969 P3 ± 103.969 P3 ± 42.053 P3 ± 42.053 P3 ± 42.052 P3 ± 42.052 P3 ± 42.051 P3 ± 42.051 P2 ± 159.061 P2 ± 159.061 P2 ± 159.06 P2 ± 159.06 P2 ± 159.059 P2 ± 159.059 P2 ± 118.054 P2 ± 118.054 P2 ± 71.07 P2 ± 71.07 P2 ± 70.041 P2 ± 70.041 P2 ± 1.048 P2 ± 1.048 P2 ± 0.957 P2 ± 0.957 P1 ± 43.007 P1 ± 43.007 P1 ± 43.006 P1 ± 43.006 P1 ± 43.005 P1 ± 43.005 P1 ± 43.004 P1 ± 43.004 P1 ± 33.969 P1 ± 33.969 P9 P9 P8 P8 P7 P7 P6 P6 P5 P5 P4 P4 P3 P3 P2 P2 xP1 xP1

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

P P P P P P P P P P P P P P P P P P x x 1.02 1.02

0.957 1.048 71.07 ± 0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 5 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 2 2 2 P 2 2 2 P ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9 P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P P P P P P P P P P P HCT−116 RCM−1 P9 ± 287.061 P9 ± 287.061 P7 ± 287.066 P7 ± 287.066 P7 ± 229.024 P7 ± 229.024 P7 ± 153.992 P7 ± 153.992 P5 ± 103.969 P5 ± 103.969 P5 ± 103.968 P5 ± 103.968 P5 ± 17.973 P5 ± 17.973 P5 ± 14.015 P5 ± 14.015 P5 ± 1.02 P5 ± 1.02 P4 ± 98.002 P4 ± 98.002 P3 ± 103.969 P3 ± 103.969 P3 ± 42.053 P3 ± 42.053 P3 ± 42.052 P3 ± 42.052 P3 ± 42.051 P3 ± 42.051 P2 ± 159.061 P2 ± 159.061 P2 ± 159.06 P2 ± 159.06 P2 ± 159.059 P2 ± 159.059 P2 ± 118.054 P2 ± 118.054 P2 ± 71.07 P2 ± 71.07 P2 ± 70.041 P2 ± 70.041 P2 ± 1.048 P2 ± 1.048 P2 ± 0.957 P2 ± 0.957 P1 ± 43.007 P1 ± 43.007 P1 ± 43.006 P1 ± 43.006 P1 ± 43.005 P1 ± 43.005 P1 ± 43.004 P1 ± 43.004 P1 ± 33.969 P1 ± 33.969 P9 P9 P8 P8 P7 P7 P6 P6 P5 P5 P4 P4 P3 P3 P2 P2 xP1 xP1

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

P P P P P P P P P P P P P P P P P P x x 1.02 1.02

0.957 1.048 71.07 ± 0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 5 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 2 2 2 P 2 2 2 P ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9 P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P P P P P P P P P P P

Fig. 4.16 Example extension of Figure 4.15 of peptide-peptide correlation heatmaps for 4 cell lines in experimental batch 3 of protein SON (UniProt P18583). Positive log2-fold changes are indicated by red colour, while blue depicts negative fold changes. Spectra identified in the standard database search (prefix ’P’) exhibit no fold changes (white) between each other while in some cell lines spectra identified with a mass shift relating to identified sequences (unidentified spectra are associated with identified sequences through their identifier with prefix ’P’ and ∆mass) exhibit differential log2-fold changes (marked by black boxes). 4.3 Results and Discussion 123

SW620 C2BBe1 P9 ± 287.061 P9 ± 287.061 P7 ± 287.066 P7 ± 287.066 P7 ± 229.024 P7 ± 229.024 P7 ± 153.992 P7 ± 153.992 P5 ± 103.969 P5 ± 103.969 P5 ± 103.968 P5 ± 103.968 P5 ± 17.973 P5 ± 17.973 P5 ± 14.015 P5 ± 14.015 P5 ± 1.02 P5 ± 1.02 P4 ± 98.002 P4 ± 98.002 P3 ± 103.969 P3 ± 103.969 P3 ± 42.053 P3 ± 42.053 P3 ± 42.052 P3 ± 42.052 P3 ± 42.051 P3 ± 42.051 P2 ± 159.061 P2 ± 159.061 P2 ± 159.06 P2 ± 159.06 P2 ± 159.059 P2 ± 159.059 P2 ± 118.054 P2 ± 118.054 P2 ± 71.07 P2 ± 71.07 P2 ± 70.041 P2 ± 70.041 P2 ± 1.048 P2 ± 1.048 P2 ± 0.957 P2 ± 0.957 P1 ± 43.007 P1 ± 43.007 P1 ± 43.006 P1 ± 43.006 P1 ± 43.005 P1 ± 43.005 P1 ± 43.004 P1 ± 43.004 P1 ± 33.969 P1 ± 33.969 P9 P9 P8 P8 P7 P7 P6 P6 P5 P5 P4 P4 P3 P3 P2 P2 xP1 xP1

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

P P P P P P P P P P P P P P P P P P x x 1.02 1.02

0.957 1.048 71.07 ± 0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 5 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 2 2 2 P 2 2 2 P ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9 P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P P P P P P P P P P P DIFI P9 ± 287.061 P7 ± 287.066 P7 ± 229.024 P7 ± 153.992 P5 ± 103.969 P5 ± 103.968 P5 ± 17.973 P5 ± 14.015 P5 ± 1.02 P4 ± 98.002 P3 ± 103.969 P3 ± 42.053 P3 ± 42.052 P3 ± 42.051 P2 ± 159.061 P2 ± 159.06 P2 ± 159.059 P2 ± 118.054 P2 ± 71.07 P2 ± 70.041 P2 ± 1.048 P2 ± 0.957 P1 ± 43.007 P1 ± 43.006 P1 ± 43.005 P1 ± 43.004 P1 ± 33.969 P9 P8 P7 P6 P5 P4 P3 P2 xP1

1 2 3 4 5 6 7 8 9

P P P P P P P P P x 1.02

0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007± ± 70.041± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 ± ± ± ± ± ± 118.054159.059± 159.061± ± ± 103.969± ± ± 103.968103.969153.992229.024287.066287.061 2 2 2 P ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P

Fig. 4.17 Exampleextension of Figure 4.15 of peptide-peptide correlation heatmaps for 3 cell lines in experimental batch 3 of protein SON (UniProt P18583). Positive log2-fold changes are indicated by red colour, while blue depicts negative fold changes. Spectra identified in the standard database search (prefix ’P’) exhibit no fold changes (white) between each other while in some cell lines spectra identified with a mass shift relating to identified sequences (unidentified spectra are associated with identified sequences through their identifier with prefix ’P’ and ∆mass) exhibit differential log2-fold changes (marked by black boxes). 124 Variant Identification in Colorectal Cancer Proteome

Cell lines ● SW48 ● NCI−H716 ● RKO ● SNU−C5 ● LS−123 ● HCT−116 ● RCM−1 ● SW620 ● C2BBe1 ● DIFI

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Normalized Abundances

1 2 3 4 5 6 7 8 9

P P P P P P P P P x 1.02

0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007 ± ± 70.041 ± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 ± ± ± ± ± ± 118.054 159.059 ± 159.061 ± ± ± 103.969 ± ± ± 103.968 103.969 153.992 229.024 287.066 287.061 2 2 2 P ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Differences ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ●

1 2 3 4 5 6 7 8 9

P P P P P P P P P x 1.02

0.957 1.048 71.07 ±

33.969 43.004 43.005 43.006 43.007 ± ± 70.041 ± 159.06 42.051 42.052 42.053 98.002 14.015 17.973 5 ± ± ± ± ± ± 118.054 159.059 ± 159.061 ± ± ± 103.969 ± ± ± 103.968 103.969 153.992 229.024 287.066 287.061 2 2 2 P ± ± ± ± ± ± ± ± ± ± 1 1 1 1 1 P P 2 P 2 3 3 3 4 5 5

P P P P P P 2 2 P 2 P P P 3 P P P 5 5 7 7 7 9

P P P P P P P P P P

Fig. 4.18 Graphical representation of peptide quantification per cell line for SON. Reporter intensities normalized by the median within each sample highlight low variability of quantitation for identified spectra (prefix ’P’). Spectra identified by MS SMiV with mass shifts are associated to their peptide sequences by the peptide identifier and the mass difference ∆m and show high variability in both normalized abundance (top) and reporter intensity differences between a single cell line and the mean of remaining samples (bottom). 4.3 Results and Discussion 125

Number Allowed Number of ID Database of Unique Mis- Mapped Peptides matches Loci REF0 Canonical 162,589 0 176,696 VAR0 Custom 168,474 0 192,530 VAR≤1 Custom 170,288 ≤1 567,078 VAR1 Custom 1,814 1 3,498 1 & single VAR Custom 1,519 1,519 1, filter mapping REF ∩ 1,038 REF 0 1,055 VAR1, filter 1,055 VAR Table 4.8 Proteogenomic mapping of peptides to identify pairs of variant and reference sequences. Sets of variant peptides identified by custom database searching and reference sequences identified through standard database searching highlighted by VAR and REF, respectively. Unique peptides identified and associated genomic loci are given for mappings derived by PoGo using different settings for allowed mismatched in the alignment.

4.3.5 Benchmarking MS SMiV against database searching for vari- ant identification

To ensure that variant peptide identifications through a customized database search are genuine variant sequences, peptides from the standard and custom searches were mapped against the human reference genome. This was intended to identify pairs of reference and variant sequences that are expected due to the multiplexed experimental set-up. The mapping was performed using PoGo (see Chapter2) using GENCODE v25 annotation for the reference genome GRCh38. Proteogenomic mapping of the 162,895 peptides identified in the custom database search resulted in 192,530 and 567,078 unique loci with no and up to one allowed mismatch, respectively (see Table 4.8). Combined across all cell lines 1,519 loci identified with exactly one amino acid substitution. Peptides resulting in these mappings did not map anywhere else in the genome without allowing for a mismatch. In total these identified 1,814 unique variant peptide sequences. Similarly, peptides identified through standard database searching resulted in 176,696 unique mappings. Overlap between unique variant mappings and the reference peptides revealed a set of 1,519 mappings with sequences for variant and reference alleles. Overall, 99.4% appear as pairs of single reference and alternative sequences, while 9 loci display 2 variant sequences. 126 Variant Identification in Colorectal Cancer Proteome

Fig. 4.19 Overlap between 769 manually curated variant sequences (blue) with 1,055 pairs of variant and reference peptides identified through custom database and standard database searches (yellow, REF0 ∩VAR1, filter in Table 4.8). Additionally, 3,222 MS SMiV spectrum pairs (red) share 9.0% of all variant identifications.

MS SMiV application allowing mass shifts of amino acid substitutions with score thresholds reflecting <1% FDR resulted in 182,173 spectrum pairs across all 6exper- iments. These represent spectrum pairs of reference and variant peptides and were therefore compared to the set of paired sequences described above. Between MS SMiV and the combined canonical and custom database search 1,055 pairs of reference and variant peptides were identified (see Figure 4.19). While 21.7% are identified by both methods resulting in high confidence of variant detection, 8.3% and 67.0% are unique to database searches and MS SMiV, respectively. Utilizing reporter intensities for manual curation of variant peptides contributes an additional 273 variant peptides resulting in an overall overlap of 9.0% between all three methods. These 341 instances can be confidently called variants as reporter intensities are highest in expected cell lines, reference and variant sequences are identified through database searching and MS SMiV pairs spectra with a high similarity allowing for the variant mass shift. Interestingly, there are two instances where a variant sequence was identified with highest reporter intensities in the expected cell lines, however, no reference peptide was found even though MS SMiV identified suitable spectra. This highlights the sensitivity with which MS SMiV pairs spectra compared to database searches. Overall MS SMiV identified 2,412 spectrum pairs with mass shifts of amino acid substitutions indicating de novo variant identification from mass spectrometry data. 4.3 Results and Discussion 127

Unimod Description Monoisotopic Mass Asp→Pro substitution -17.974179 Met→Leu/Ile substitution -17.956421 Leu/Ile→Met substitution 17.956421 Pro→Asp substitution 17.974179 Table 4.9 Entries in Unimod with monoisotopic mass within ±0.02 Da of mass shift ∆mass=17.962 Da identified by MS SMiV. In total 2 amino acid substitutions are suggested by Unimod to explain the mass difference.

4.3.6 De novo identification of amino acid variants

Spectrum pairs of MS SMiV with mass shifts of amino acid substitutions not found by database search indicate de novo SNP detection. Utilizing quantitation and manual assessment of peptide correlation heatmaps revealed the protein SLU7, which has a distinct down regulation of a particular peptide in only one cell line. At the same time a related spectrum identified by MS SMiV exhibits down regulation or missing quantitation in all remaining cell lines (see Figure 4.20). The protein is identified in the standard database search by 16 unique peptides (34 PSMs) covering 37.9% of its sequence. MS SMiV output at <1% FDR retains 4 peptides and identifies 19 previously unidentified spectra. The peptide P3 with peptide sequence ’GACENCGAMTHK’ reveals differential expression (extreme log2-fold changes) in cell line NCI-H716 while for the remaining samples fold changes remain around 0 indicating a variant in NCI-H716. MS SMiV identified a single spectrum (P3 ± 17.962) associated to P3 with a mass shift of 17.962 Da. Unimod identifies 2 amino acid substitutions overlapping with the mass shift between P3 and P3 ± 17.962 with an error tolerance of 0.02 Da (see Table 4.9).

Similarly, line plots depicting peptide abundances per sample and differences between intensities of peptides in single cell lines to the mean intensities of remaining samples as shown in Figures 4.21 top and bottom, respectively, highlight the peptide P3 and spectrum P3 ± 17.962 as outliers. While P3 shows the extreme to the lower end of the spectrum in NCI-H716, P3 ± 17.962 reveals abundance slightly above the median of the protein and an absolute maximum difference to the reporter intensities of other cell lines. These results, in addition to the matching substitutions from Unimod provide strong evidence for the presence of an amino acid variant.

Looking at the identified peptide sequence ’GACENCGAMTHK’ only one amino acid substitution is possible (highlighted in Table 4.9). to leucine or 128 Variant Identification in Colorectal Cancer Proteome

SW48 NCI−H716 P4 ± 159.053 P4 ± 159.053 P4 ± 128.082 P4 ± 128.082 P3 ± 17.962 P3 ± 17.962 P2 ± 266.195 P2 ± 266.195 P2 ± 266.194 P2 ± 266.194 P2 ± 264.197 P2 ± 264.197 P2 ± 131.059 P2 ± 131.059 P2 ± 101.084 P2 ± 101.084 P2 ± 33.949 P2 ± 33.949 P2 ± 29.992 P2 ± 29.992 P2 ± 29.027 P2 ± 29.027 P2 ± 29.026 P2 ± 29.026 P2 ± 29.025 P2 ± 29.025 P2 ± 15.011 P2 ± 15.011 P2 ± 15.009 P2 ± 15.009 P2 ± 14.015 P2 ± 14.015 P2 ± 0.994 P2 ± 0.994 P1 ± 156.097 P1 ± 156.097 P1 ± 15.95 P1 ± 15.95 P4 P4 P3 P3 P2 P2 P1 P1

1 2 3 4 1 2 3 4

P P P P P P P P

15.95 0.994 15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084 131.059 264.197 266.194 266.195 ± 128.082 159.053 156.097 ± ± ± ± ± ± ± ± 101.084 131.059 264.197 266.194 266.195 ± 128.082 159.053 1 2 1 2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P P P P P P P P P RKO SNU−C5 LS−123 P4 ± 159.053 P4 ± 159.053 P4 ± 159.053 P4 ± 128.082 P4 ± 128.082 P4 ± 128.082 P3 ± 17.962 P3 ± 17.962 P3 ± 17.962 P2 ± 266.195 P2 ± 266.195 P2 ± 266.195 P2 ± 266.194 P2 ± 266.194 P2 ± 266.194 P2 ± 264.197 P2 ± 264.197 P2 ± 264.197 P2 ± 131.059 P2 ± 131.059 P2 ± 131.059 P2 ± 101.084 P2 ± 101.084 P2 ± 101.084 P2 ± 33.949 P2 ± 33.949 P2 ± 33.949 P2 ± 29.992 P2 ± 29.992 P2 ± 29.992 P2 ± 29.027 P2 ± 29.027 P2 ± 29.027 P2 ± 29.026 P2 ± 29.026 P2 ± 29.026 P2 ± 29.025 P2 ± 29.025 P2 ± 29.025 P2 ± 15.011 P2 ± 15.011 P2 ± 15.011 P2 ± 15.009 P2 ± 15.009 P2 ± 15.009 P2 ± 14.015 P2 ± 14.015 P2 ± 14.015 P2 ± 0.994 P2 ± 0.994 P2 ± 0.994 P1 ± 156.097 P1 ± 156.097 P1 ± 156.097 P1 ± 15.95 P1 ± 15.95 P1 ± 15.95 P4 P4 P4 P3 P3 P3 P2 P2 P2 P1 P1 P1

1 2 3 4 1 2 3 4 1 2 3 4

P P P P P P P P P P P P

15.95 0.994 15.95 0.994 15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 1 2 1 2 1 2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P P P P P P P P P P P P P P P P P HCT−116 RCM−1 SW620 P4 ± 159.053 P4 ± 159.053 P4 ± 159.053 P4 ± 128.082 P4 ± 128.082 P4 ± 128.082 P3 ± 17.962 P3 ± 17.962 P3 ± 17.962 P2 ± 266.195 P2 ± 266.195 P2 ± 266.195 P2 ± 266.194 P2 ± 266.194 P2 ± 266.194 P2 ± 264.197 P2 ± 264.197 P2 ± 264.197 P2 ± 131.059 P2 ± 131.059 P2 ± 131.059 P2 ± 101.084 P2 ± 101.084 P2 ± 101.084 P2 ± 33.949 P2 ± 33.949 P2 ± 33.949 P2 ± 29.992 P2 ± 29.992 P2 ± 29.992 P2 ± 29.027 P2 ± 29.027 P2 ± 29.027 P2 ± 29.026 P2 ± 29.026 P2 ± 29.026 P2 ± 29.025 P2 ± 29.025 P2 ± 29.025 P2 ± 15.011 P2 ± 15.011 P2 ± 15.011 P2 ± 15.009 P2 ± 15.009 P2 ± 15.009 P2 ± 14.015 P2 ± 14.015 P2 ± 14.015 P2 ± 0.994 P2 ± 0.994 P2 ± 0.994 P1 ± 156.097 P1 ± 156.097 P1 ± 156.097 P1 ± 15.95 P1 ± 15.95 P1 ± 15.95 P4 P4 P4 P3 P3 P3 P2 P2 P2 P1 P1 P1

1 2 3 4 1 2 3 4 1 2 3 4

P P P P P P P P P P P P

15.95 0.994 15.95 0.994 15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 1 2 1 2 1 2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P P P P P P P P P P P P P P P P P C2BBe1 DIFI P4 ± 159.053 P4 ± 159.053 P4 ± 128.082 P4 ± 128.082 P3 ± 17.962 P3 ± 17.962 P2 ± 266.195 P2 ± 266.195 P2 ± 266.194 P2 ± 266.194 log2(fold change) P2 ± 264.197 P2 ± 264.197 P2 ± 131.059 P2 ± 131.059 P2 ± 101.084 P2 ± 101.084 < −6 0 > 6 P2 ± 33.949 P2 ± 33.949 P2 ± 29.992 P2 ± 29.992 P2 ± 29.027 P2 ± 29.027 P2 ± 29.026 P2 ± 29.026 P2 ± 29.025 P2 ± 29.025 P2 ± 15.011 P2 ± 15.011 P2 ± 15.009 P2 ± 15.009 P2 ± 14.015 P2 ± 14.015 P2 ± 0.994 P2 ± 0.994 P1 ± 156.097 P1 ± 156.097 P1 ± 15.95 P1 ± 15.95 P4 P4 P3 P3 P2 P2 P1 P1

1 2 3 4 1 2 3 4

P P P P P P P P

15.95 0.994 15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962 ± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 156.097 ± ± ± ± ± ± ± ± 101.084131.059264.197266.194266.195± 128.082159.053 1 2 1 2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3 P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4 1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P P P P P P P P P

Fig. 4.20 Example of peptide-peptide correlation heatmaps for 10 cell lines in experi- mental batch 3 of protein SLU7 (UniProt O95391). High positive log2-fold changes are indicated by red colour, while blue depicts highly negative fold changes. Identified peptide P3 exhibits differential expression of the sequence ’GACENCGAMTHK’ solely in cell line NCI-H716. Associated unidentified spectrum P3 ± 17.962 identified by MS SMiV with a mass shift of 17.962 Da shows low or missing expression in all cell lines except NCI-H716 indicating an amino acid variant. 4.3 Results and Discussion 129

Cell lines ● SW48 ● NCI−H716 ● RKO ● SNU−C5 ● LS−123 ● HCT−116 ● RCM−1 ● SW620 ● C2BBe1 ● DIFI

2.0 ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ●

● ● ● Normalized Abundances 0.0

1 2 3 4

P P P P

15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084 131.059 264.197 266.194 266.195 ± 128.082 159.053 1 2 ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P

1.0 ●

● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Differences ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ●

1 2 3 4

P P P P

15.95 0.994

± ± 14.015 15.009 15.011 29.025 29.026 29.027 29.992 33.949 17.962

156.097 ± ± ± ± ± ± ± ± 101.084 131.059 264.197 266.194 266.195 ± 128.082 159.053 1 2 ± ± ± ± ± ± ± ± P P 2 2 2 2 2 2 2 2 3

1 P P P P P P P P 2 2 2 2 2 P 4 4

P P P P P P P P

Fig. 4.21 Graphical representation of peptide quantification per cell line for SLU7. Reporter intensities normalized by the median within each sample highlight P3 as a lowly abundant peptide in the NCI-H716 cell line while related MS SMiV spectrum P3±17.962 has similar abundance to the median (top). Additionally, all other cell lines show spectrum P3±17.962 as outlier. Relative differences of reporter intensity of one cell line compared to the mean of all other cell lines highlights the MS SMiV identified spectrum P3 ± 17.962 and sequence database identified peptide P3 as outliers, while all other spectra exhibit high variability of differences. 130 Variant Identification in Colorectal Cancer Proteome

Fig. 4.22 Protein sequence of SLU7 with peptide ’GACENCGAMTHK’ highlighted at position 121 to 132 (green). The potential variant site of methionine is located at position 126. isoleucine at position 9 in the peptide. This sequence relates to the position 126 within the protein sequence (see Figure 4.22). No variant was identified in the custom database search for residue 126 of SLU7 nor for the NCI-H716 cell line at any position within SLU7. Furthermore, manual assessment of variants for the NCI-H716 cell line in the COSMIC database (Forbes et al., 2017) revealed no matching single nucleotide variant. Further investigation in other variant databases for SLU7 revealed a known non- synonymous single nucleotide variant in the dbSNP database (Sherry et al., 1999). The variant rs41275313 introduces a nucleotide change from T to A at position 159840985 on chromosome 5. This variant leads to the substitution of methionine at position 126 in the SLU7 protein to leucine. Even though the variant is known with evidence from the 1000 Genomes Project (The 1000 Genomes Project Consortium et al., 2010) it is not annotated for the NCI-H716 cell line. This highlights that through quantitative proteomics and mass tolerant spectrum pairing novel and unanticipated variants can be detected. Furthermore, it shows that genome and transcriptome sequencing may miss variants, which affect the protein. This can be resolved through proteomic analysis and prompt confirmatory targeted sequencing of the ambiguous genes. 4.4 Conclusions 131

4.4 Conclusions

In the previous chapter (Chapter3) I have shown the application and benchmarking of MS SMiV for identification of post-translational modifications. I have drawn the conclusion that MS SMiV application to identify modification mass shifts to inform database searching using dynamic modifications gives the best results. In this chapter, however, I described the application of the blind spectrum matching tool MS SMiV on a isobaric labelled dataset of 50 colorectal cancer cell lines. While the results for post-translational modifications are equivalent in informing modification selection for database searching, MS SMiV proved to be a valuable orthogonal approach to verify amino acid variants from proteogenomic database searching and to identify novel nsSNPs.

Identification of amino acid variants in proteogenomics requires prior knowledge of genomic alterations that can be incorporated into a customized search database. There- fore, samples need to be genotyped or sequenced prior to proteogenomic analysis. The resulting larger databases reduce sensitivity and are likely to identify post-translational modifications as variants with similar mass. Application of MS SMiV ontopof standard identification processing through canonical database searching enables the identification of variant and PTM spectra that relate to identified peptides. Even though the ratio of false associations can be estimated, the complexity introduced by open mass difference spectral matching is confounded by false associations between peptides and related spectra due to propagated false peptide spectrum matches and the presence of mixed spectra. Results need to be evaluated carefully through appropriate orthogonal methods.

The use of TMT reporter intensities in combination with identified and MS SMiV associated spectra enables the discrimination of spectra with mass shifts due to post- translational modifications or amino acid substitutions. Results showed that post- translational modifications lead to moderate log2-fold changes between peptides ofa protein and high variability between samples. Heatmaps of peptide-peptide correlation therefore appear to be sample specific indicating unique PTM compositions and extent of modification for each sample. However, larger scale multiplexed proteomic experi- ments will need to be carried out to validate this theory and estimate the possibility of false positive spectral pairings.

Overall log2-fold changes close to 0 between peptides of a given protein indicate that no variant or post-translational modification is present. However, extreme log2- 132 Variant Identification in Colorectal Cancer Proteome fold changes for peptides that deviate from the consistent pattern across cell lines indicate the presence of an amino acid substitution. The complete absence of the peptide compared to other peptides of the same protein can reveal a non-synonymous SNP with homozygous alternative alleles. While missing quantification can also be attributed to technical factors, identification of the corresponding variant sequence is required to confirm the presence of a SNP. Spectra matched by MS SMiV withthe inverse pattern of log2-fold changes, therefore, point to the actual variant peptide. To date it is unclear how heterozygous events affect protein expression. MS SMiV spectral pairing in combination with peptide quantitation will allow the identification of both alleles. Similar to the opposing extreme log2-fold changes of homozygous SNPs lower expression of both alleles of heterozygous variants will lead to opposing log2-fold changes. These will, however, not be extreme. This approach will not only enable the identification of heterozygous events, but also can facilitate their quantitation and ultimately allow to determine the functional difference between alleles.

The results of MS SMiV application and the use of TMT-reporter intensities also highlight that this approach is capable of identifying novel nsSNPs. These are commonly not identified in a standard proteogenomics workflow because they are missed during genotyping or sequencing or newly arise in the samples after genotyping and sequencing has been performed. This highlights that sample specific databases do not contain all mutations present in a sample. Therefore, MS SMiV in combination with multiplexed quantitation outperforms database search strategies, which employ custom variant databases, by enabling unbiased de novo identification of SNPs. While genomics and proteomics studies commonly try to identify QTLs, i.e. loci affecting a phenotype, missing variants undermine the confidence in these types of studies. Proteomic coverage and comprehensive identification of variants and post- translational modifications are necessary to enhance the identification of functional QTLs. Therefore, the combination of MS SMiV with multi-plexed quantitaiton can inform selection of protein modifications for database searching and substitute custom databases, while confidently identifying variant peptides and verify post-translational modifications.

4.5 Publication Note and Contributions

A manuscript for most of the work described in this chapter is currently in preparation. The data acquisition, identification and quantification as well as manual curation of variant peptides were performed by Theodoros Roumeliotis. Mapping of MS3 4.5 Publication Note and Contributions 133 quantification onto MS2 spectra and calculation of peptide-peptide correlations and plot generation were conducted by Georg Pirklbauer under my supervision. With exception of this and unless stated otherwise, the analysis described herein is the work I performed myself, under supervision of Andreas Bender and Jyoti Choudhary.

HAPTER C 5

QUANTITATIVE PROTEOGENOMICS FOR PERSONALISED MOLECULAR PROFILING 136 Quantitative Proteogenomics for Personalised Molecular Profiling

5.1 Introduction

The central dogma starting with the transcription of a gene into messenger RNA and its subsequent translation into a protein is the foundation on which all aspects of molecular biology are based (Crick, 1958). Variation in nucleotide sequences therefore can have a significant impact on the structure and downstream functions of genes and proteins in a cell. Recent large scale genomic and transcriptomics efforts such as the 1,000 Genomes Project (The 1000 Genomes Project Consortium et al., 2010) and the Genotype-Tissue Expression consortium (GTEx) (The GTEx Consortium, 2015) have assessed single nucleotide variants, alternative splicing events and other aberrations in the human genome. Furthermore, they associate these genomic variants with changes in gene expression and the phenotype trying to identify quantitative trait loci (QTL) responsible for health, disease and disease susceptibility.

However, over the last decades various mechanisms for transcriptional and trans- lational regulation ranging from splicing, transcription factors and modulation of ribosome occupancy have been implicated. Together these lead to increased com- plexity from static genes to the highly modulated functional proteins (Sonenberg and Hinnebusch, 2009; Tamarkin-Ben-Harush et al., 2014; Wang et al., 2015). Correlation between mRNA and protein of 0.5 suggests the involvement of transcriptional and translational regulation (reviewed in Haider and Pal, 2013; Maier et al., 2009). Addi- tional levels of regulation, however, occur post translation and are not represented in the transcription.

Recent advances in instrumentation have allowed the successful integration of proteomics with previously published transcriptomics and genomics data focusing on transcriptional and translational regulation. Results include the identification of unique proteomics subtypes of colorectal cancer compared to those identified by transcriptomics data (Zhang et al., 2014). Variation of protein expression has further assessed the impact of somatic mutations on protein complexes and led to predictive models of drug response outperforming models based on gene expression (Roumeliotis et al., 2016). High sensitivity mass spectrometry in combination with multiplexed analysis and advances in preparation methods for genomics, transcriptomics and proteomics analysis from the same sample enable full integrative analysis of different molecular levels across multiple individuals.

This chapter describes a pilot study to establish a pipeline for global and personal proteogenomics. Here the proteogenomic analysis of an integrated molecular phe- 5.2 Materials and Methods 137 notyping study of twelve osteoarthritic individuals is presented. With over 40% of individuals over the age of 70 affected by osteoarthritis (Vos et al., 2012) it is the leading cause of pain and loss of physical function (Dieppe and Lohmander, 2005). With both heritable and environmental factors contributing to susceptibility (Valdes and Spector, 2011) cartilage degeneration, thought to be induced by an imbalance of complex protein networks including proteases and cytokines (reviewed in Lee et al., 2013; Xia et al., 2014; Yuan et al., 2014), is a key feature in osteoarthritis. Due to the lack of curative therapy symptom control is used for disease management and joint replacement surgery in severe cases. While disease tissues are inaccessible for many common complex diseases, cartilage is readily accessible at joint replacement surgery and in recent years, studies examining individual –omics levels have expanded our un- derstanding of osteoarthritis pathogenesis (reviewed in Ramos and Meulenbelt, 2017; Reynard and Loughlin, 2013; Ruiz-Romero et al., 2015; Steinberg and Zeggini, 2016). Increasing prevalence of the disease (Lawrence et al., 1998; Vos et al., 2012) combined with multi-omics approaches to characterize the underlying molecular processes of disease progression might enable the detection of distinct variation between molecular levels and between individuals and guide personalised treatment.

In this study genotyping, RNA-sequencing and isobaric labelling-based mass spectrometry data derived from knee joint tissue from 12 osteoarthritis patients are subjected to integrative analysis to assess variation on protein, splice isoform, and on an individual basis. Matched pairs of cartilage tissue from patients undergoing joint replacement surgery were used in this study, with one sample demonstrating little or no evidence of cartilage degeneration and the other sample demonstrating advanced degenerative change. This experimental set-up allows comparison between individuals and disease states. Despite the limited number of individuals assessed in this pilot study, the analysed panel of samples was sufficient to highlight the capabilities of the pipeline and can be used to establish targeted analysis methods for individual events.

5.2 Materials and Methods

5.2.1 Data acquisition

Detailed information and protocols for patient consent, study approval, and sample processing are available in Steinberg and Zeggini(2016). In brief, samples were collected from patients undergoing total knee replacement for primary osteoarthritis 138 Quantitative Proteogenomics for Personalised Molecular Profiling

Fig. 5.1 Sample extraction and assays. Cartilage was extracted during total knee re- placement surgery. Samples then were divided and graded using OARSI scoring system representing low (control/C) and high (disease/D) grade damage. Subsequently DNA, mRNA and proteins were extracted and subjected to genotyping, RNA-sequencing and shotgun proteomics, respectively. at the University of Sheffield. All subjects provided written, informed consent prior to participation in the study. Patients with diagnosis other than osteoarthritis were excluded. Half of each osteochondral sample was used for chondrocyte extraction and the remaining tissue was used for histological and immunohistochemical analysis. Cartilage tissue was graded using the OARSI scoring system at Sheffield Hallam University to determine overall grade of cartilage as healthy/low-grade (control/C) or high-grade (disease/D) degeneration (Mankin et al., 1971; Pearson et al., 2011). Furthermore, DNA, RNA and protein were extracted from chondrocytes at Sheffield Hallam University and samples were frozen at -80°C prior to assays (see Figure 5.1). Genotyping was performed at the Wellcome Trust Sanger Institute.

Sequencing of RNA was performed through Sample Management and Illumina Bespoke teams at the Wellcome Trust Sanger Institute. Poly-A tailed RNA (mRNA) was purified from total RNA-samples, subjected to cDNA library creation, and ligated to Illumina Paired-end Sequencing adapters. PCR amplification for 10 cycles was performed on libraries and samples were pooled. The multiplexed library then was sequenced on the Illumina HiSeq 2000, 75bp paired-end read length followed by quality control and data was provided as individual indexed library BAM files. Raw reads were extracted from BAM files and provided in FASTQ format.

Proteomic data were acquired by the Proteomics Mass Spectrometry group at the Wellcome Trust Sanger Institute. In short, the protein content of each sample was digested after precipitation using trypsin overnight. Samples were then labelled with TMT six-plex reagents and pooled into experiments as shown in Table 5.1 resulting in healthy/low-grade (control/C) and high-grade (disease/D) degenerated samples of single individuals with neighbouring TMT labels. Pooled samples were 5.2 Materials and Methods 139

TMT- TMT- TMT- TMT- TMT- TMT- 126 127 128 129 130 131 49C 49D 51C 51D 56C† 56D† Set 1 (1) (3) (0/1) (3) (0/1) (3/4) 60C 60D 61C 61D 66C† 66D† Set 2 (0/1) (3/4) (0) (3) (0/1) (3/4) 59C 59D 64C 64D 65C 65D Set 3 (0/1) (3/4) (0) (3) (0/1) (3/4) 52C‡ 52D‡ 54C 54D 55C 55D Set 4 (-) (-) (0/1) (3/4) (0) (3/4) †removed due to low protein/RNA sample quality ‡no RNA-sequencing data available Table 5.1 Setup of 24 six-plex TMT labeled samples of 12 osteoarthritic individuals across 4 different experiments. Both samples per individual are selected for neigh- bouring TMT-reporter channels and annotated as low-grade damaged/control (C) or high-grade damaged/disease (D) replicate using the OARSI scores (in brackets). subjected to offline fractionation prior to LS-MS analysis on the Dionex Ultimate 3000 UHPLC system coupled with the high-resolution LTQ Orbitrap Velos mass spectrometer (Thermo Scientific). The ten most abundant multiply charged precursors within 380-1500 m/z were selected and isolated for HCD fragmentation and tandem mass spectra were acquired. Targeted precursors were dynamically excluded from further isolation and activation for 40 seconds with 10 ppm mass tolerance.

5.2.2 RNA-sequencing analysis

Preliminary analysis of the RNA-sequencing data by the Analytical Genomics of Complex Traits group at the Wellcome Trust Sanger Institute revealed low quality of RNA-sequencing results for 2 individuals leading to their subsequent removal from further analyses.

The RNA-sequencing reads, which passed QC, were realigned to the human genome assembly GRCh38 using bowtie version 2.2.3, a splice-aware alignment software tool (Langmead et al., 2009), using the –library-type option fr-firststrand. Alignment was guided by a reference transcriptome downloaded from Ensembl release 76 (Flicek et al., 2014) and only uniquely mapping reads were used for further analysis. Quantitation for each sample was achieved by read counting using htseq-count from the HTSeq tool suite (Anders et al., 2015). Quantitation of single transcripts from RNA-sequencing data still remains challenging (Conesa et al., 2016), thus, only read 140 Quantitative Proteogenomics for Personalised Molecular Profiling counts for unique exons and supported splice junctions were used to verify findings from quantitative proteomics and proteogenomics analyses.

Results from RNA-sequencing read alginment and quantitation were used to validate proteogenomic findings. Therefore, no additional processing was performed as in other proteogenomic studies (reviewed in Sheynkman et al., 2016).

5.2.3 Comprehensive proteogenomic analysis pipeline

Proteogenomic analysis of the acquired tandem mass spectrometry data was conducted using multiple database search engines in OpenMS as previously described by Wright et al.(2016) and was extended with quantification as shown in Figure 5.2. Identification and quantification results were complemented by proteogenomic mapping through PoGo (see Chapter2) and spectrum pairing for de novo variant identification using MS SMiV (see Chapter3). Besides de novo identification of sequence variants the pipeline provides additional novel features. These include sequence based mapping of peptides to a reference genome and proteogenomic mapping of post-translational modifications to enable visualisation of PTM based quantitative trait loci. Furthermore, this output is provided in a web accessible format facilitating sharing between the genomics and proteogemics communities through track hubs. Additionally, proteogenomic mapping is provided in combination with peptide quantitiation allowing comparative analysis across multiple samples and experiments in relation to genomic features. This enables novel use of quantitative data in proteogenomics based annotation allowing identification of primariy and alternative isoforms. In the pilot study presented herea total of 81 raw files comprising 899,165 MS/MS spectra were converted into mzML files and Mascot Generic Format (MGF) using OpenMS (Rost et al., 2016) prior to application of the workflow.

5.2.3.1 OpenMS identification and quantification

Peptide identification was performed for spectra in the mzML format usingtwo database search engines, Mascot (Perkins et al., 1999) and MS-GF+ (Kim et al., 2014). Spectra were searched against a customised target protein sequence database previously used by Wright et al.(2016). The database comprises the human reference proteome, translated sequences of annotated transcript, 3-frame translated sequences of RNA-sequencing data as well as gene predictions in combination with contaminants 5.2 Materials and Methods 141

Fig. 5.2 Comprehensive proteogenomic workflow incorporating peptide level quantita- tion, variant identification, and genome mapping for visualization and comparative analysis. The central identification workflow using consensus of two database search engines, namely Mascot and MSGF+, followed by rescoring of peptide spectrum matches is supplemented by intensity based reporting for labelled and label free quan- titation. Furthermore, tandem MS spectra are subjected to MS SMiV for variant identification. Results filtered to high stringency are combined with peptide quantita- tion and mapped onto a reference genome before visualisation in genome browsers. Boxes with continuous borders indicate processing steps in the OpenMS framework while dashed borders highlight processes performed outside the OpenMS framework. 142 Quantitative Proteogenomics for Personalised Molecular Profiling resulting in 4,206,472 protein sequences. The full composition of the database with number of protein sequences contributed by single databases is given in Table 5.2. The database was concatenated with an equal number (4,206,472) of decoy sequences generated by DecoyPyrat (Wright and Choudhary, 2016). This resulting database of 8,412,944 target and decoy sequences was used to estimate the false discovery rate (FDR) of spectrum identifications.

Commonly in proteogenomics the sample specific protein sequence databases derived from RNA-sequencing experiments and genotypes are incorporated into cus- tomised databases. However, here the RNA-sequencing and genotyping data were deliberately excluded to enable orthogonal validation of sequence based variation after identification by searching against the concatenated database containing sequences derived from other RNA-sequencing experiments.

Searches were performed allowing no missed cleavages and precursor mass toler- ance of 30 ppm and a fragment ion tolerance of 0.02 Da for Mascot and 30 ppm mass tolerance with high-resolution instrument setting for MS-GF+. Static modifications were set as six-plex TMT tags on lysine residues (K) and N-termini as well as carbo- midomethyl on cysteine (C). Oxidation of methionine (M), deamidation of asparagine (N) and glutamine (Q), and phosphorylation of serine (S), threonine (T) and tyrosine (Y) were allowed as variable modifications. Identification results of each search engine were percolated using Mascot Percolator (Wright et al., 2012) and MS-GF+ Percolator (Granholm et al., 2014). Consensus of the results was built retaining identification with support from both search engines. The consensus score per identification was determined through the worst peptide level FDR (q-value) between search engines. Results were filtered at 1% peptide level FDR. Peptide results with associated protein identifiers as well as precursor m/z value, retention time, intensity, and charge were exported in CSV format. Furthermore, the result identification files were filtered to discard peptides shorter than 7 or longer than 30 residues (see Table 5.3). While these filtering steps compromise the overall coverage and sensitivity of protein and peptide identifications, they provide highly robust and high quality identifications. Personal proteogenomics focuses on variation between individuals and therefore requires high quality identifications as compared to global proteogenomics analysis, which aimsto increase coverage and novel identifications at the expense of robustness (see Kim et al. 2014; Wilhelm et al. 2014 and Wright et al. 2016). 5.2 Materials and Methods 143 30) ≤ 6,590 72,609 99,295 75,233 335,159 405,442 195,201 128,193 Length 1,653,152 1,639,110 1,095,311 4,017,182 ≤ Number of Full (7 Tryptic Peptides 31.13 28.52 29.28 24.69 28.50 29.12 395.97 373.50 147.40 185.87 518.74 199.64 Average Protein Length 247 8,084 88,708 93,246 10,280 73,432 11,335 206,225 704,224 255,141 126,160 2,629,390 Number of Proteins HAL CON Database MIT bodymap Yale Ensembl bodymap UniProt (April 2014) Augustus (2 Versions) GENCODE (v20) CDS GENCODE (v20) UTR GENCODE (v20) ncRNAs GENCOE (v20) Pseudogenes Caltech Cufflinks (ENCODE) RNA-seq Reference Predictions Source Type Contaminants Non-coding Annotation protein-coding genes as reference sequencesnon-coding while RNA also (ncRNA) including and additional 5’ translations UTRexperiments of sequences. and non-protein gene Furthermore, coding predictions. the genes database With suchproteins. exception is as of extended pseudogenes, contaminants with on translation sequences average the of largest RNA-sequencing protein sequences are contained in the reference Table 5.2 Composition of customised protein sequence database. The database comprises known proteins and translations of known 144 Quantitative Proteogenomics for Personalised Molecular Profiling

Number of Filter Identification Set Number of PSMs Unique Peptides q-value≤1% Mascot 248,378 60,378 q-values≤1% MS-FG+ 279,215 71,265 Consensus Mascot∩MS-GF+ 288,185 72,539 q-value≤1% Consensus 206,358 53,336 7≤length≤30 Consensus 203,226 52,064 Table 5.3 Filtering steps applied to spectrum identifications from two search engines, Mascot and MS-GF+. Results from each search engine are individually filtered at a peptide level FDR (q-value). Filtered results are combined to form a consensus list with worst q-value as consensus score. Consensus identifications are filtered again at ≤1% peptide level FDR to ensure high quality identifications. Lastly, a length filter, discarding peptide sequences unlikely to be measures during tandem MS analysis.

Quantitation using TMT six-plex tags of all 899,165 MS/MS spectra was obtained independently from identification results in OpenMS using the IsobaricAnalyzer node. Isotope impurity correction was allowed and reporter ion intensities, precursor retention time, m/z value, precursor intensity and charge for all MS/MS spectra were exported to CSV-files. Reporter ion intensities of two individuals, i.e. four samples marked in Table 5.1, were removed due to low quality of proteome and RNA-sequencing results (Ritchie et al., 2015). The remaining reporter intensities were normalised within each set for equal sample loading. Reporter intensities within a set were weighted so that median intensities for samples were equalised (see Figure 5.3).

Identification results and TMT reporter ion quantitation were merged using pre- cursor information (m/z value, retention time, intensity and charge) available in both output files as distinguishable features for MS/MS spectra. Combined results were filtered to remove peptides, which were only present in samples previously discarded due to low quality, by discarding all entries with no remaining quantitative information. Additionally, identifications relating to decoy and contaminant database entries were removed. Peptide quantitation was achieved through summation over reporter inten- sities per TMT channel of all peptide spectrum matches (PSMs) whereby modified peptides were regarded as distinct sequences.

5.2.3.2 Proteogenomic mapping

Peptides mapping uniquely to protein sequences of the non-reference portion of the search database, i.e. , lncRNA, 5’ UTR, gene prediction and RNA- sequencing translations, were extracted as novel identifications and subjected to manual 5.2 Materials and Methods 145

Set 1 Set 2

Set 3 Set 4

Fig. 5.3 Graphical representation of reporter intensity distribution in log2 in quantiles of spectra per experimental sets (raw). Normalization based on the median of the distribution to ensure equal intensity distribution between samples of each set reflecting equal amount of sample used for tandem MS analysis (normalized). 146 Quantitative Proteogenomics for Personalised Molecular Profiling analysis including sequence alignment to the genome and assessment of related identi- fications of the same protein sequence. The remaining reference identifications, i.e. UniProt and GENCODE CDS sequences, were mapped onto their respective genes and transcripts by linkage of UniProt accessions with Ensembl gene and transcript identifiers using Ensembl BioMart release 76 for human (August 2014). Peptides shared between multiple genes were removed as their reporter ion intensities repre- sent the expression of more than a single gene and thus are deemed unreliable for further proteogenomic and personal proteomics analysis. All identifications with associated reporter intensities of the four different experiments were merged into a single file linked by peptide sequence and genes identified through single peptides were removed. Additionally, identifications were divided into classes depending on associated transcripts identified through peptides.

To enable direct comparison with RNA-sequencing results used for validation of proteogenomic results, all identified peptides were mapped onto the human reference genome (GRCh38 primary assembly) through the peptide to genome mapping tool PoGo (see Chapter2) with default parameters and GENCODE v20 transcript anno- tations and protein coding transcript translations. Comparisons for selected findings were made manually by loading the mapped peptides and RNA-sequencing reads in the Integrative Genomics Viewer (IGV, Thorvaldsdottir et al., 2013). An example view of annotated transcripts, mapped peptides and RNA-sequencing reads is shown in Figure 5.4. Transcripts annotated in GENCODE v20 are highlighted at the top in blue, while mapped peptides are shown underneath in red, black and grey depending on the uniqueness of their genomic mapping. Details about the colour coding of peptides mapped by PoGo is given in Chapter2. Peptide quantitation is provided as heatmap whereby individual samples are indicated by rows. The bottom for the IGV view highlights the coverage of single nucleotides by RNA-sequencing reads in form of histograms.

5.2.3.3 Sequence variance detection

Genotype data, which previously had been mapped to GRCh37 were transferred to GRCh38 using the ’liftOver’ tool from USCS (Hinrichs et al., 2006) to enable verification of de novo SNP identifications from the proteogenomics analysis pipeline. Additionally, the effects of genotyped variants were predicted using the Variant Event Predictor (VEP) (McLaren et al., 2016) and filtered to only retain missense mutations. Genomic coordinates of peptides were intersected with missense variant coordinates 5.2 Materials and Methods 147

Fig. 5.4 Example view in IGV. Genomic coordinates are highlighted in the top panel while annotated transcripts are shown underneath (blue). Exons are indicated by thick blocks connected by lines indicating introns. Genomic alignment of peptides indicate partial peptide seuqences mapping to exons through thick blocks. Peptides spanning splice junctions highlight these by connecting lines spanning introns. Peptide quantitation in relation to genomic loci is indicated by a heatmap with high quantitative values shown in red and low abundance depicted as blue. Coverage of nucleotides in the genome by RNA-sequencing reads is highlighted through gray histograms in the bottom panel. 148 Quantitative Proteogenomics for Personalised Molecular Profiling A) Schematic Representation of Intensity B) Example of Intensity Pattern Pattern for SNP Detection

Fig. 5.5 A) Schematic representation of missing values and patterns of binary reporter ion intensity pattern indicating presence of SNP. B) Example of intensity pattern for individuals in set 5 (right). High abundance in both samples of individual 64 is regarded as ’on’ state, while complete absence of intensities in individual 59 indicate ’off’ state. The binary pattern in replicate samples of individuals indicates a variant in the sequence affecting co-selection and co-fragmentation for tandem MS analysis.

to identify variant peptides. This was further extended to assess behavior of reporter ion intensities in relation to amino acid substitutions.

Non-synonymous sequence variants result in shifts in peptide mass and these commonly cause missing identifications or missing quantification in a multiplexed analysis setting such as TMT-labelled quantification (see Figure 5.5 A). Through pooling samples due to TMT labelling these variant peptides are commonly not co- selected for fragmentation with their reference counterparts and thus cause distinct ‘on/off’ patterns of reporter ion intensities. Following the assumption that amino acid variants result in missing reporter intensities for both replicate samples of an individual all binary permutations for reporter ion intensities within each experiment were generated. Patterns expected to be the result of amino acid variation were annotated accordingly and all patterns were compared to the reporter ion intensities per peptide using cosine similarity. Cosine similarity scores were appended to the output only if the highest scoring binary pattern was one of the expected ’on/off’ substitution patterns (see Figure 5.5 B). However, mass shifts introducing missing quantitation can also be caused by other sequence variants and post-translational modifications. Therefore, concordance between identified patterns and occurrence of missense mutations in the genotype for each individual was used to verify quantitation based SNP identifications. 5.3 Results and Discussion 149

5.2.3.4 MS SMiV application for de novo SNP identification

Identification of variant peptides can commonly be achieved through incorporation of variant protein sequences in the search database. However, this affects the search space and can lead to reduced sensitivity. Due to the large size of the utilized search database, variant sequences could not be identified using common database search strategies. The combination of blind mass tolerant spectrum pairing through MS SMiV and multi-plexed TMT-based quantification facilitated the identification of variant peptides and the associated reference sequence (see Chapter4). Therefore, MS SMiV was applied to find spectrum pairs of identified reference sequences and variant peptides.

All spectra from peptide spectrum matches that were annotated with missense variants from the genotyping data were extracted and used as query input for MS SMiV. Previously unidentified spectra were used as library input and all mass differences for amino acid substitutions were allowed as mass shifts and provided in tab separated format. The tolerance for precursor masses was set to 30 ppm and 0.02 Da were allowed as fragment tolerance. The remaining parameters were used in their default setting. For each query spectrum only matched library spectra were retained when their score had more than 3 standard deviations distance from the mean. These candidates were plotted using score and delta masses to manually determine amino acid substitutions with high scores for each query spectrum.

TMT-reporter channels were extracted for each pair from the quantitation output. Reporter intensities were adjusted by division of maximum intensity to reflect relative abundance and enable comparability between the spectra in each pair. Relative reporter intensities then were manually compared to individual genotypes.

5.3 Results and Discussion

5.3.1 Identification results

Database searching of 899,165 tandem mass spectra using consensus of two search engines resulted in 203,226 peptide spectrum matches (PSMs) after conservative sig- nificance filtering for 1% peptide level FDR, peptide length between 7 and 30residues and mapping to single genes. These represent 52,064 unique peptide sequences and 3,707 modified versions thereof. Of the 52,064 peptides 160 (0.3%) peptides were 150 Quantitative Proteogenomics for Personalised Molecular Profiling

Minimum Number Minimum Number Number of Genes of Peptides of Individuals 5,731 (29%) 1 1 4,814 (24%) 2 1 2,853 (14%) 2 10 Table 5.4 Summary of gene identification at differing confidence levels. Single peptide identifications result in the highest coverage of protein-coding genes in thehuman genome. However, these identifications have a higher error rate and therefore only genes with ≥2 peptides are regarded as identified. To increase confidence in gene identifications and enable comparison across samples and experimental batches the requirement of identification in all samples that passes RNA-sequencing quality control can be added resulting in a loss of 10% of identifications of protein-coding genes. uniquely identified from the novel sequence fraction of the search database compris- ing translated sequences from RNA-sequencing experiments, annotated non-coding transcripts and gene predictions. Additionally, more stringent filtering (0.1% FDR) and the requirement of at least two unique peptides identifying novel translation prod- ucts as used by Wright et al.(2016) resulted in no identification of novel translated sequences. The remaining 51,904 peptides derived from the reference portion of the search database were mapped to 29% (5,731) of the annotated protein coding genes. These include 10,244 (19.7%) peptides spanning across exon-exon splice junctions. The application of a filter to remove single peptide identifications, commonly usedin proteomics, resulted in a higher confidence set of 4,814 genes (24%). Furthermore, in total 14% (2,853) of these high confidence genes were identified in all individuals allowing comparison across all individuals (see Table 5.4). Commonly in proteomics studies a requirement of identification in at least 50% of samples is introduced to reduce missing values and enable quantitative and statistical analysis. This require- ment is well founded and has proven sufficient in many studies (Mertins et al., 2016; Zhang et al., 2014). However, when assessing variation between individuals and more specifically on splice peptide and non-synonymous nucleotide polymorphism level missing reporter intensities do bear valuable information. This type of information has previously been discarded through requirement of identification and quantification in ≥50%. In this study the simple requirement of at least 2 peptides to identify a gene was used and missing values utilized for variant detection.

Comparison of the peptides identified in the chondrocyte samples with tissue samples from the draft human proteome maps (Kim et al., 2014; Wilhelm et al., 2014) reanalyzed by Wright et al.(2016) allowed distinction of peptides unique in the cartilage samples. In total 1,380 peptides (2.9%) were identified solely in the 5.3 Results and Discussion 151 chondrocyte samples and were absent in the identification set (<1% peptide level FDR) of 61 human tissues described in the reanalysis of the draft human proteome maps suggesting cell type specificity. Overall, these map to 818 genes in the dataset. However, each of the genes has previously been identified by different peptides in multiple of the 61 human tissues. The lack of tissue specific genes could be attributed to the low proteome coverage in this pilot study. A gene based identification of tissue specificity using peptide sequences does not take different isoforms into account and additionally dismisses expression of proteins. Alternative isoforms and regulation as well as co-expression of several proteins, however, might well be specific for single tissues and provide biomarkers as previously highlighted for tissue specificity of combinations of isoforms of the gene Plec in mouse (Fuchs et al., 1999). Large-scale quantitative studies of deep proteomes are therefore required in future.

5.3.2 Proteogenomic resolution of splicing with missing RNA-sequencing support

To assess whether the proteomics identification of proteins is consistent with tran- script identification from RNA-sequencing reads, support of exons through mapped peptides and RNA-seq reads was investigated. Amongst the genes with a single transcript uniquely identified in the peptide data, ERO1A was found with a sequence coverage of 39.7%. This is amongst the 25% of proteins with highest expression. ERO1A is annotated with a splice variant whereby two transcripts miss an exon of length 12bp highlighted in Figure 5.6. The web-based tool Annotating principal splice isoforms (APPRIS, Rodriguez et al., 2013) predicts the transcript containing the exon as principal isoform while transcripts missing these 12bp are annotated as nonsense_mediated_decay. PoGo mapping of the peptide sequence ‘RPLNPLAS- GQGTSEENTFYSWLEGLCVEK’ with a length of 29 residues uniquely identifying ERO1A resulted in a peptide mapping across four exons as shown by the red pep- tide mapping in Figure 5.6. This includes the small second exon of length 12bp. Comparison with transcriptomics results shows that even though ERO1A is supported through RNA-sequencing (median maximum read coverage per locus of 256.5 reads) the small exon is merely supported by one to nine reads in seven of the samples. The remaining eleven samples provided no evidence in form of mapped reads for this exon. Missing coverage of the 12bp exon in the transcriptomic data might be attributed to the complexity of read mapping and its low sensitivity to short sequences (Finotello et al., 2014). This in turn does not affect the identification and quantification of whole genes but can affect the identification of single transcripts. The identification, quantification 152 Quantitative Proteogenomics for Personalised Molecular Profiling and mapping of a reference peptide accounting for small exons (Figure 5.6) thus provides evidence for transcript translation in cases where RNA-sequencing coverage is too low to identify transcript isoforms. 5.3 Results and Discussion 153 (B) IGV view zoomed to 12bp exon AU A AU A showing of transcript annotations (blue), aligned peptide (red) and RNA-seq coverage (grey) for a ERO1A spanning up to 4 exons (black arrows). Transcript annotation reveals alternative splicing of a short exon of length 12 (A) IGV view across 4 exons ¡ ¡ ¡ ¡ ERO1A Fig. 5.6 IGV view ofsubsection gene of bp. While two isoforms (top and bottom blue lines) includemap the to short the exon, 12 two bp other sequence. isoforms (middle two blue lines) exclude this (marked with red box). The peptide sequence aligned to the annotation highlights the inclusion of the short exon while RNA-sequencing read fail to 154 Quantitative Proteogenomics for Personalised Molecular Profiling

5.3.3 Splice level variation

Following the ongoing discussion in literature on primary transcript isoforms and translation of multiple of these, the data was used to test whether protoegenomics can be used to identify proteomic evidence of multiple isoform translation. The proteoge- nomic mapping of the filtered peptides to Ensembl gene and transcript identifiers and grouping of genes based on unique identification of transcripts revealed three groups as follows. The first group comprises peptide identifications of 64.6% (3,110) of identified genes that support multiple annotated transcripts. The second group contains 35% (1,684) peptides uniquely identifying single transcripts. Lastly, the third group shows support for multiple transcripts of 20 genes (0.4%).

In 14 cases one transcript is identified with at least twice as many unique peptides as the other transcripts (see Table 5.5). Furthermore, in 5 cases both transcripts are identified only by a single unique peptide each. However, the inhibitor ofnuclear factor kappa-B kinase interacting protein (IKBIP), a proapoptotic gene and target of p53, is a single outlier as 13 and 11 unique peptides identify two of its isoforms (see Figure 5.7). To establish whether transcripts with more unique peptides are dominant isoforms predictions an online resource for prediction of primary isoforms based on conservation (APPRIS, Rodriguez et al., 2013) was employed. The identified isoforms of IKBIP were annotated as principal alternative isoforms as shown in Table 5.5. Furthermore, transcripts which are not annotated as protein-coding are highlighted in the table based on their biotype, i.e. a category that reflects the known or predicted biological significance of a transcript. Here we see mostly ‘nonsense_mediated_decay’. The predictions of principle transcript isoforms are overall in concordance with the number of unique peptide identifications of the dominating transcript. However, IKBIP occurs as an outlier with 13 and 11 unique peptides identifying the principal and an alternative isoform, respectively, indicating similar importance of both isoforms (see Figure 5.8). Despite the low coverage of this pilot phase study, 20 genes with multiple protein isoforms were identified. This shows that proteogenomics provides evidence for expression of multiple gene isoforms at protein level. In future, this will allow to identify functional differences between isoforms.

Number Number of Unique APPRIS Prediction / of Unique Gene Isoform Peptides Transcript Biotype Peptides per Tran- Per Gene script 5.3 Results and Discussion 155

PLEC-001 PRINCIPAL 5 PLEC-004 2 PLEC-005 1 PLEC 228 PLEC-008 1 PLEC-015 1 PLEC-003 1 NARS-001 PRINCIPAL 8 NARS 15 NARS-009 nonsense_mediated_decay 1 ABCE1-001 PRINCIPAL 4 ABCE1 21 ABCE1-005 nonsense_mediated_decay 1 PDIA3-001 PRINCIPAL 31 PDIA3 38 PDIA3-006 nonsense_mediated_decay 2 TMPO-001 PRINCIPAL 7 TMPO 20 TMPO-002 ALTERNATIVE 1 ESYT1-001 PRINCIPAL 1 ESYT1 28 ESYT11-004 ALTERNATIVE 1 HNRNPU- PRINCIPAL 1 HNRNPU 001 23 HNRNPU- ALTERNATIVE 1 002 IKBIP-002 PRINCIPAL 13 IKBIP 28 IKBIP-003 ALTERNATIVE 11 PLCD1-001 PRINCIPAL 1 PLCD1 28 PLCD1-006 ALTERNATIVE 1 CARKD- PRINCIPAL 4 CARKD 001 13 CARKD- ALTERNATIVE 2 201 PAFAH1B1- PRINCIPAL 6 PAFAH1B1 001 16 PAFAH1B1- 1 004 IDI1-004 PRINCIPAL 5 IDI1 6 IDI1-002 1 SDK2-002 PRINCIPAL 5 SDK2 21 SDK2-004 1 156 Quantitative Proteogenomics for Personalised Molecular Profiling

SDHB-001 PRINCIPAL 1 SDHB 10 SDHB-006 1 ACTN4-001 PRINCIPAL 4 ACTN4 46 ACTN4-004 1 COL6A2- PRINCIPAL 9 COL6A2 001 62 COL6A2- 2 002 HAPLN1- PRINCIPAL 14 HAPLN1 001 42 HAPLN1- 2 007 MMP3-001 PRINCIPAL 10 MMP3 16 MMP3-002 1 SUGT1-003 PRINCIPAL 1 SUGT1 16 SUGT1-002 1 BGN-001 PRINCIPAL 18 BGN 30 BGN-002 3 Table 5.5 List of genes with multiple transcripts identified through unique peptides. Genes and the number of unique peptides for the overall gene are shown besides isoform identifiers and number of peptides uniquely identifying these. Transcripts are annotated with APPRIS prediction for principal and alternative isoforms. Additionally, transcript biotypes are indicated where isoforms are not protein coding. Transcripts predicted as principal isoforms consistently are identified through higher numbers of peptides.

5.3.4 Alternative inclusion of 5’ exons

To determine the confidence of multiple isoform identifications through proteomics data, peptide mappings to the genome uniquely identifying transcripts were assessed with regards to the structure of the annotated transcripts. The gene plectin (PLEC), involved in orchestrating dynamic changes in cytoarchitecture and cell shape, was identified with a total of 228 peptides covering 55.3% of its sequence. Asmallnumber of peptides (9) uniquely identify 5 alternative 5’ exons. Furthermore, one peptide matches uniquely to the transcript isoform PLEC-003 due to alternative splice donor site within the same exon as highlighted in Figure 5.9. 5.3 Results and Discussion 157

Fig. 5.7 IGV view of gene IKBIP highlighting proteogenomic identification of two alternative 3’ exons. Peptide quantitation across the remaining 10 osteoarthritic individuals after RNA-sequencing quality control filtering (heatmap) ranging from low (blue) to high absolute reporter intensities (red) indicate confident identification. 158 Quantitative Proteogenomics for Personalised Molecular Profiling

Fig. 5.8 Graphical representation of relative reporter intensities for 28 peptides identi- fying gene IKBIP. A total of 13 peptides uniquely identify the principal (blue) isoform while 11 peptides identify the alternative (red) isoforms. Relative intensities of pep- tides shared between transcripts are highlighted in black. Overall, peptides uniquely identifying both isoforms and shared peptides show similar expression across all peptides.

Manual assessment of the peptide spectrum matches (PSMs) confirmed almost complete fragment ion ladders indicating high quality of the identifications. While the identification of multiple isoforms expressed at the same time is consistent with tissue specificity previously found in mice (Fuchs et al., 1999), the exonic structure of the alternatively spliced transcripts cannot be confidently resolved through the unique peptides mapping to 5’ alternative exons (Figure 5.10). Furthermore, RNA- sequencing results lacking reads uniquely identifying splice junctions between the second and alternative first exons as well as manual checks by the HAVANNA teamat the Wellcome Trust Sanger Institute using additional data sources used for annotation by GENCODE could not confirm the confident identification as distinct isoforms rather translation of single exons.

5.3.5 De novo SNP detection

As shown in Chapter4, proteogenomics analysis incorporating peptide level isobaric labelling quantitation can be used to identify non-synonymous single nucleotide variants. Using the genotypes of individuals in this pilot study, it was assessed whether quantitative proteogenomics can further reflect individual genotypes, homozygous 5.3 Results and Discussion 159

Fig. 5.9 IGV view of transcripts with alternative splice donor site and associated splice peptides of the PLEC gene on reverse strand (right to left). Directionality of peptide mapping is highlighted from N-term to C-term (right to left). Peptide sequences are shown in reading direction N-term to C-term (left to right). Genomic location is shown in the top section of the screenshot while peptides share the same C-term inducing the same splice acceptor site but differing N-termini. One peptide (black) exhibits mapping to a common splice acceptor site as shown in the annotation of transcripts (blue). The second peptide (red) highlights alternative mapping to an earlier splice donor site (compare transcript ENST00000527069.2).

Fig. 5.10 IGV view of 5’ splicing of gene PLEC on reverse strand incorporating different exons. Genomic location is shown at the top while transcript annotation is shown underneath (blue). Peptide mappings uniquely identifying single exons are highlighted at the bottom (red). While only two peptides show mapping across two exons (red lines on the right hand side marked by red box) identifying two different isoforms, remaining peptides solely identify single 5’ exons. Missing splice junction peptides prevent confident identification of additional splice isoforms. 160 Quantitative Proteogenomics for Personalised Molecular Profiling and heterozygous. All individuals, with the exception of individual number 52, were genotyped for a total of 542,585 known single nucleotide variants (SNV). After remapping those to the new genome assembly GRCh38 541,923 remained. Of these, a total of 273,164 (50.4%) sites were removed as the genotypes did not vary across all individuals or were unidentified. The remaining 268,759 (49.6%) SNVs were subjected to the online variant effect predictor VEP (McLaren et al., 2016) to determine the impact on protein sequences. For 4% (10,976) variant effects were successfully predicted ranging from synonymous substitutions to splice region variants and changes to stop and start codons. In total 34,541 consequences were predicted with 69.5% (24,007) describing missense mutations leading to amino acid substitutions. Single SNPs affecting a gene can result in multiple consequences due to the differing effects on transcripts resulting from the gene. This leads to an average of 3 consequences for the variants with successful effect prediction.

Binary drop-out patterns of reporter ion intensities were used to determine the absence of peptides from both samples of an individual. In total, 14.8% (7520) of modified and unmodified peptides exhibited high cosine similarity (mean=0.952, standard deviation=0.024) with one of the expected binary patterns. These were compared with genotype data mapped onto the identified peptides. In total 148 peptides overlapped with 169 genotyped variants. These represent 150 variant effects of which 74.7% (112) are annotated with the sequence ontology (SO) term ‘missense_variant’. Additionally, three predicted variant effects indicate the gain of a stop codon while 35 are predicted as synonymous substitutions. Furthermore, 148 unique peptides show cosine similarity (mean=0.956, standard deviation=0.025) with one of the expected on/off patterns. The SO terms associated with the mapped SNPs for the 148 unique peptides then revealed 110 ‘missense_variants’ as expected but also 32 peptides for which only ‘synonymous_variant’ annotations were found.

The application of MS SMiV (see Chapters3 and4) utilizes a different angle by finding previously unidentified spectra with high similarity but with massshifts of amino acid substitutions. In total 2,163 tandem mass spectra associated with a single nucleotide variant were searched against 919,244 unidentified spectra. This resulted in over 8.8 million scored spectrum pairs. After filtering scores per query spectrum that are greater than three times the standard deviation from the absolute median and scores above zero 0.08% (7,413) remained (see Figure 5.11). In total these remaining spectrum pairs map to only 29 unique peptide sequences that are associated with missense mutations. While ∆mass values of the remaining spectrum pairs range between 0 and 120 Da, distinct mass shifts appear more frequently and additionally 5.3 Results and Discussion 161

Set 1 Set 2

Set 3 Set 4

Fig. 5.11 MS SMiV results filtered to contain only spectrum pairs ≥3 σ 2 per query spectrum. Each point represents a pair of query and library spectra with related difference of precursor masses and MS SMiV similarity score. Results are enriched for distinct mass shifts highlighting masses of amino acid substitutions and post- translational modifications. with similar high scores. These are for example mass shifts around 2 Da, 15 Da, and 18 Da. Manual inspection with Unimod revealed more amino acid substitutions than post-translational modifications accounting for these mass differences. To further investigate the related spectra identified by MS SMiV reporter intensities of spectrum pairs grouped by protein were manually assessed.

The gene PLEC was identified at the protein level by 184 peptides in total. Peptide sequences mapping with up to two nsSNPs to other proteins in the genome were re- moved resulting in 119 peptides uniquely mapping to PLEC. These represent sequence coverage of 31% across all individuals in the dataset. Gene and protein expression of unique peptides show no differential regulation between control and disease samples 162 Quantitative Proteogenomics for Personalised Molecular Profiling

Fig. 5.12 Fold changes of gene and protein expression between disease and control samples. Expression of mRNA shows no differential expression between disease states (mean fold change 0.83). Slight but not significant elevation of PLEC can be found in disease samples (mean fold change 1.76).

(see Figure 5.12). Overall, individual number 59 exhibits higher reporter ion intensities compared to individuals examined in the same experiment as shown in Figure 5.13. While 117 peptides exhibit no significant inter-individual variation, two peptides with sequences ‘AGVAAPATQVAQVTJQSVQR’ and ‘QEQAJJEEJER’ appear without reporter intensities for both samples of individual number 59. Additionally, these spectra are associated with cosine similarity of 0.98 and 0.99 with distinct binary drop-out patterns described above, respectively.

All spectrum pairs with one of the two case peptides as query spectrum were extracted and after filtering revealed no matching library spectrum for the peptide sequence ‘QEQAJJEEJER’. In contrast, to the spectrum with the sequence ‘AGVAA- PATQVAQVTJQSVQR’ a single library spectrum matched with a score of 0.70 and a mass shift of 27.99 Da. In total 4 amino acid substitutions and 4 post-translational modifications from Unimod can explain the mass difference between the query andthe library spectrum with a mass error of 0.05 Da (see Table 5.6). The complete absence of reporter intensities in both samples of individual number 59, however, indicates a systematic change to the peptide sequence as opposed to the presence of a post- translational modification. Furthermore, the complementing reporter intensities ofthe unidentified library spectrum support the explanation through amino acid substitution (see Figure 5.14). In support of this finding is an identified non-synonymous SNP from individuals’ genotyping data within the genomic region of the mapped peptide. This SNP is predicted to introduce an amino acid substitution from alanine to valine with a mass difference of 28.03 Da. Genotypes of individuals for this SNP correspond with 5.3 Results and Discussion 163

Fig. 5.13 Reporter ion intensities of 119 peptides uniquely identifying the gene PLEC within a single mass spectrometry experiment. Median values per sample highlight that highest expression amongst the 3 individuals is present in the disease sample of individual 59.

Fig. 5.14 Reporter intensities for two spectra identified with high similarity score and mass shift of 27.99 Da through MS SMiV. While the spectrum (reference) identified through database searching shows missing quantification for both samples of individual 59, the SMiV identified spectrum (alternative) exhibits a complement pattern of reporter intensities. Genotype data (top) correlate with the intensity patterns resulting in absence of reference peptide quantification due presence of homozygous alternative alleles (1/1). This is replicated vice versa with homozygous reference alleles (0/0) resulting in absence of the alternative peptide. The heterozygous individual (0/1) expresses both variants of the peptide. 164 Quantitative Proteogenomics for Personalised Molecular Profiling

Unimod Description Monoisotopic Mass Val→ Ala substitution -28.0313 Met→ Cys substitution -28.0313 Arg→ Lys substitution -28.006148 Pyrrolidone from proline -27.994915 Asp→ Ser substitution -27.994915 Glu→ Thr substitution -27.994915 Ser→ Asp substitution 27.994915 Thr→ Glu substitution 27.994915 Formylation 27.994915 Lys→ Arg substitution 28.006148 Ala→ Val substitution 28.0313 Cys→ Met substitution 28.0313 di-Methylation 28.0313 Acetaldehyde +28 28.0313 Ethylation 28.0313 Table 5.6 Entries in Unimod with monoisotopic mass within ±0.05 Da of mass shift ∆mass=27.99 Da identified by MS SMiV. In total 4 amino acid substitutions and4 post-translational modifications are suggested by Unimod as explanation. the reporter intensities of both spectra highlighting expression of both variant forms to a lower extent in the heterozygous individual 65 as shown in Figure 5.14. Once more, this shows that quantitative proteogenomcis can confidently identify nsSNPs. Furthermore, it shows that the application of MS SMiV can identify spectra of the related variant sequence. The correlation of the de novo variant identification with individual genotypes highlights the value of this method for future studies on allelic expression on protein level.

5.3.5.1 Protein abundance variation in individuals

In standard proteomics and proteogenomics studies only frequently identified genes or proteins with presence in at least 50% of the samples are used for robust statistical testing and to reduce missing values. However, less stringent filtering can hold valuable information with regards to inter individual variation. To assess the extent of differing levels of protein expression, the requirement for protein identification in more than one individual was abolished. Furthermore, the level of inter personal variation was evaluated using inter individual fold changes. In the data presented here the gene CFHR5, involved in complement regulation, was found solely expressed in a single individual, patient 52, as shown in Figure 5.15. While RNA-sequencing 5.3 Results and Discussion 165

Fig. 5.15 Reporter ion intensities of unique peptides identifying the gene CFHR5 across samples for ten osteoarthritic individuals. While no peptides for CFHR5 are identified for individuals 49, 51, 60, and 61 only a single peptide is present insample of individuals 59, 64, and 65. All seven peptides are present in identified in the last experimental batch containing individuals 52, 54, and 55. Intensity discrepancy between individual 52 and the others indicate personal expression solely in individual 52. data support the absence of protein expression in other individuals no transcriptomics data is available for patient 52. Although this could be an interesting observation, the consistent absence in other individuals in the proteomics and transcriptomics data as well as the identification of the protein in synovial fluid in the ProteomicsDB resource (Wilhelm et al., 2014) indicates a possible contamination of the chondrocyte sample with synovial fluid.

The complete absence of CFHR5 in all but one patient is an extreme case of inter individual variation of protein expression. Overall, 1,550 (32%) genes are found showing >2 fold change between individuals. Furthermore, 202 (4%) are exhibiting >5 fold individual specific protein expression (see Table 5.7). For example, NOS2, which is involved in many immune system processes, is identified with protein sequence coverage of 24.5% (see Figure 5.17). The protein shows a >2 fold higher expression in three individuals while two others exhibit >5 fold higher expression. Figure 5.16 shows the extreme case of >5 fold higher expression in individual 64 over patients 59 and 65 in experiment 3. Overall, gene expression from RNA-sequencing confirms the difference between individuals (Figure 5.17). No evidence for genetic variation was found in the transcriptomic and genotyping data. However, the up-regulation of 166 Quantitative Proteogenomics for Personalised Molecular Profiling

Fold Change Between Individuals Number of Genes ≤2 3,264 (68%) >2 1,550 (32%) >5 202 (4%) Table 5.7 Summary table showing fractions of genes exhibiting inter-individual fold changes >2 and >5.

Fig. 5.16 Reporter ion intensities relative to the maximum in each spectrum across samples in an experimental batch. All 22 peptides (20 unique sequences) show highest relative expression in both samples of individual 64 resulting in an average fold change of 5 between samples of individual 59 and 64, and 65 and 64, respectively.

NOS2 has previously been associated with osteoarthritis in mouse models (van den Berg et al., 1999). The inhibition of matrix synthesis of up-regulation of NOS2 and the individual specific expression indicates a subtype of osteoarthritis. Thus, setsof individually regulated proteins could lead to identification of subtypes and potentially lead to identification of novel drug targets in osteoarthritis. The results highlight that one third of all identified proteins exhibit inter individual regulation; in few cases (∼4%) the variability is even higher. This can have a significant effect on statistical analysis for comparison between disease and healthy samples. Furthermore, the results highlight the interesting cases of inter individual variation are commonly filtered, although they may be used for quality control of samples.

5.3.6 Differential regulation of protein and RNA

Integrative multi-omics studies have found only modest correlation between gene expression and protein abundance. A contributing factor to this low concordance is translational and post-translational regulation. With quantitative proteogenomics 5.3 Results and Discussion 167

Fig. 5.17 Screenshot of RNA-seq coverage, gene annotation, and aligned peptide sequences for NOS2. While the genomic location is show at the top, RNA-seq coverage for samples of 3 individuals is shown in grey (middle section). Coverage through RNA-sequencing reads is highest in both samples of individual 64. Annotation of transcripts of NOS2 is visualised in blue and peptides covering 24.5% of the protein sequence are shown at the bottom in red and black. capable of identifying multiple isoforms of genes the question arises to what extent different isoforms are regulated on RNA and protein level. To test this, proteogenomic and RNA-sequencing data were combined. Differential expression of RNA and protein was assessed for unique alternative splicing events. This allowed the identification of two unique peptides characterising different isoforms of HAPLN1. The mass difference between both peptides was found to be 71 Da and was identified as an additional alanine in the sequence. Proteogenomic mapping to the reference genome using PoGo revealed the same start and end coordinates for both peptides spanning a single splice junction. The sole difference of mappings lies in a 3 bp extension of the splice accepting exon for the peptide sequence with additional alanine. A closer look at the splice sites in the Ensembl genome browser (Yates et al., 2016) exposed the genomic sequence surrounding the acceptor sites. The sequence of the 3’ splice site follows the NAGNAG motive, which is also known as tandem acceptor site previously shown to contribute to proteome plasticity and tissue specificity (Busch and Hertel, 2012; Hiller et al., 2004; Hinzpeter et al., 2010), and introduces an additional codon in one of the transcripts accounting for the additional alanine residue highlighted in Figure 5.18.

The comparison of protein abundance between disease state samples revealed down- regulation of HAPLN1 in high-grade degenerated chondrocytes. On the contrary, transcriptomics data showed no regulation of mRNA between disease states (see Figure 5.19 A). This discordance between gene and protein expression change can be attributed to metalloproteases and post-translational regulation. HAPLN1 is a known target of metalloproteases (Lambert et al., 2014; Nguyen et al., 1993), which 168 Quantitative Proteogenomics for Personalised Molecular Profiling

Fig. 5.18 IGV view of alternative splicing in gene HAPLN1 due to NAGNAG motif. Genomic location is shown at the top with nucleotide sequence (coloured) and transla- tion sequence (grey) of reverse strand indicated underneath. Annotation of transcripts of HAPLN1 (blue) show alternative splice acceptor sites for splice site after NAG (black boxes and arrows) and NAGNAG (red boxes and arrows). Splice donor site (blue boxes and arrows) indicate splice site within a codon. Alternative combination of splice acceptor sites identified through peptide sequences (bottom) resulting in theloss of codon for alanine (red) compared to annotation preferred peptide isoform (black). 5.3 Results and Discussion 169

Fig. 5.19 Fold change comparison between control and disease samples for 10 individ- uals. A) Comparison of fold-changes on gene expression and protein abundance level. Gene expression fold-changes show no regulation between control and disease samples of individuals, while protein is down regulated in disease samples. B) Fold changes between isoforms (APPRIS principle over minor isoform). RNA-sequencing reads across splice junctions exhibit an average fold-change of 17 while peptide spanning exons result in mean fold change 4. previously have been shown to be active in Osteoarthritis (Meszaros and Malemud, 2012; Woessner and Gunja-Smith, 1991). However, fold changes between counts of splice site spanning RNA-sequencing reads and reporter ion intensities to compare isoforms on transcript and protein level, respectively, reported a 20-fold dominance of the principle isoform on RNA-sequencing read level. On the other hand, peptide level comparison on average resulted in a 2-fold dominance of the APPRIS predicted principal isoform (Figure 5.19 B). With the exception of samples from individuals number 51 and number 54, the fold-changes on peptide level are not affected by disease state ruling out the explanation of higher protein stability of the non-principal isoform. Another more likely explanation is a regulatory mechanism leading to differential translation rates.

5.3.6.1 Isoform switching

Perturbations in the complex machinery of transcription and translation can lead to disease. These perturbations are commonly identified by differential expression analysis on gene and protein level. These perturbations can also affect isoforms. To test whether different isoforms are differently regulated in health and disease, quantitation 170 Quantitative Proteogenomics for Personalised Molecular Profiling of peptides and RNA-sequencing reads across splice junctions uniquely identifying different isoforms was used. The gene actinin alpha 1 (ACTN1), encoding for an actin-binding protein, was identified with 41 peptides (28 unique sequences) covering 43.7% of the longest isoform sequence. Median fold changes of 1 and 1.1 in RNA and protein level, respectively, show no significant change from low- to high-grade degraded samples (see Figure 5.20 A). However, the gene ACTN1 is described with multiple roles in different cell types prompting a closer look at the proteogenomic mappings. Actinin alpha 1 is found in the dataset with a unique identification of a single isoform consistent with the APPRIS predicted principle isoform. The uniquely identifying peptide sequence spans across a splice junction as shown as red peptide mapping (isoform 1) in Figure 5.21. Comparison with other peptide mappings to the gene exposed a peptide sequence with the same splice acceptor and differing splice donor sites. However, the peptide maps to exons that are part of multiple transcripts and thus the mapping is colored in black (isoform 2) in Figure 5.21. This peptide is the product of a transcript predicted as alternative isoform by APPRIS. While both peptides are quantified in all low- and high-grade degraded chondrocyte samples, RNA-sequencing results solely support one of the splice junctions mapping uniquely to isoform 1.

Fold changes calculated on peptide and RNA splice read level for isoform 1 are consistent with the overall gene fold changes indicating a statistically not significant slight upregulation in high-grade damaged samples (see Figure 5.20 B). However, isoform 2 exhibits a non-significant down regulation on peptide level (median fold change 0.7). Although the fold changes are not statistically significant, the shift from one isoform to another between control and disease samples highlights the use of quantitative proteogenomics to capture such events. While RNA-sequencing reads support only a single isoform, quantitative proteomics hints at an isoform switching event of ACTN1 in osteoarthritic chondrocytes.

5.4 Conclusions

In this chapter, I described the application of a high stringency proteogenomics work- flow with isobaric quantitation on a pilot phase study of a small cohort of osteoarthritic individuals. This stringent workflow provides high quality peptide-to-spectrum assign- ments enabling confident identification and quantification of variation. The results show that strict filtering provides proteomic coverage of 29% of all known protein coding genes. Although the data were searched against a customised database to facili- 5.4 Conclusions 171

Fig. 5.20 Fold changes of ACTN1 between disease and control samples for 10 indi- viduals. A) Fold-changes on whole gene and protein level. Mean fold changes of 1.04 and 1.11 show no regulation between disease and control on mRNA and protein level, respectively. B) Breakdown of fold-changes between disease and control for ACTN1 to isoforms. RNA-sequencing reads support only isoform Iso1 and average fold change of 0.96 indicates no regulation between disease states. Peptide support of two isoforms shows median fold changes of 1.18 and 0.78 for isoforms Iso1 and Iso2, respectively. Isoform Iso1 is more dominant in disease samples, while isoform Iso2 is prevalent in control samples.

Fig. 5.21 IGV view of genomic alignment of peptides identifying alternative isoforms in ACTN1. Genomic location is indicated at the top, while annotations of transcripts of ACTN1 are shown underneath (blue). Peptides identified and mapped against the reference genome are shown at the bottom. A peptide (red) uniquely identifies APPRIS principle isoform skipping an exon, while another peptide (black) maps to multiple other isoforms. 172 Quantitative Proteogenomics for Personalised Molecular Profiling tate identification of novel coding sequences, only previously known protein coding genes could confidently be identified. Furthermore, comparison of identified peptides with sequences found in three large scale human proteome datasets spanning 59 adult and foetal tissues highlights previously not encountered peptides. However, none of the associated genes can be classified as specific to chondrocytes. The coverage of only 29% of the human proteome indicates that high proteome coverage per tissue would be required to identify tissue specific proteins. This, however, is based on the assumption that single proteins or protein groups are solely expressed in single tissues. Alternative isoforms and varying abundance of proteins could be a key to tissue specific biomarkers. Confident calling of different isoforms from proteomics data and protein quantification therefore could aid the identification of tissue specificity on a larger scale.

The integration of proteomics with matched RNA-sequencing data derived from the same cells, as demonstrated here, can confidently identify and quantify different types of variation between individuals, disease states and molecular levels. The results show, that the multiplexed quantitative approach enables the identification of proteins solely expressed in single individuals. However, the example of CFHR5 also highlights the necessity for data acquisition to be performed on the same set of samples. Missing RNA-seq data for single individuals hinders the understanding of the underlying cause for this type of variation. More consistently, the results also show higher expression of proteins and mRNA for specific individuals. Here the example of NOS2, which has previously been associated with osteoarthritis in mouse models, highlights that smaller cohorts including replicates and multi-omics coverage might be sufficient to identify subclasses of the disease. The data also show that alternative splicing events can be detected through integration of quantitative proteogenomics and transcriptomics. Identification of unexpected translation of transcripts such as nonsense mediated decay RNA can aid annotation efforts to refine gene and transcript models. Furthermore, results of identification and proteogenomic mapping highlight that proteomics can identify splicing events where alignment algorithms for RNA-sequencing reads fail to map short sequences.

The pilot study presented here further shows that by integrating different omics data, insight into regulatory processes between RNA and protein can be gained. The example of cartilage-link protein highlights differences changes on differential expression between isoforms at RNA and protein level indicating increased protein stability of one isoform over the other or contrasting translation efficiency. Furthermore, the proteomics data unveiled a second distinct isoform of actinin alpha 1 missed by 5.4 Conclusions 173

RNA-sequencing data. Quantitative analysis then even uncovered a switching of the dominant isoform between control and disease samples. This stresses the importance of integration of quantitative proteogenomics and other omics data.

Utilizing the nature of multiplexed proteomics data and mass tolerant spectrum pairing through MS SMiV enables the identification of variant peptides in combina- tion with their reference sequences. The blind spectrum matching prior to database searching enables the use of a single database without integration of SNPs and thus in- creasing sensitivity. Furthermore, missing values in quantitation, if consistent between replicates of the same individuals, can improve confidence in the identification. This allows SNP detection in a de novo manner from proteomics data. The quantitative information can in future be used for analysis of allele specific expression enabling true personal profiling and advance personalized medicine and targeted treatment.

Proteome coverage is critical for the success of proteogenomic analysis to give insights on novel protein-coding genomic regions aiding annotation efforts and iden- tify regulation both pre and post translation. The identification of disease specific changes affecting protein sequence and regulation will help as biomarkers for patient stratification and targeted treatment. Furthermore, protoegenomics will help tonarrow global analysis for QTL identification and enable targeted approaches to associate genetic and non-genetic variants with phenotypic variation.

This pilot study, despite the limited coverage, highlights the potential of protoege- nomics to identify inter-individual variation and provide insight into translational and post-translational regulation. Sequence based variation can be identified using proteogenomic mapping of PoGo and integration with quantitation enables detection of novel genetic missense variants. Furthermore, integration with RNA-sequencing can identify variation at splice level between disease states elucidating disease mechanisms. However, the results also show that personalized studies require a thoroughly planned experimental setup with reference samples and sufficient numbers of samples and replicates to statistically assess the results. I believe that the stringent identification workflow extended with quantification, proteogenomic mapping and blind PTMand variant detection presented in this chapter will prove valuable in personal variation and precision medicine studies. 174 Quantitative Proteogenomics for Personalised Molecular Profiling

5.5 Contributions and Publication Note

Parts of the work described in this chapter have previously been presented at the 64th Conference on Mass Spectrometry and Allied Topics (oral; San Antonio) as well as Genome Informatics 2015 (poster; Cold Spring Harbour) and 2016 (poster; Hinxton). Eleftheria Zeggini at the Wellcome Trust Sanger Institute initiated the study of Osteoarthritis, J. Mark Wilkinson at the University of Sheffield acquired consent of all patients and extracted the samples,Christine Le Maitre at Sheffield Hallam University extracted DNA/RNA/protein from chondrocyte samples, Graham Ritchie and Julia Steinberg at the Wellcome Trust Sanger Institute provided RNA-sequencing and genotyping data. At the Proteomics Mass Spectrometry group at the Wellcome Trust Sanger Institute Theodoros Roumeliotis acquired isobaric labelled proteomic mass spectrometry data, Hendrik Weisser and James Wright created the consensus OpenMS identification workflow. Jonathan Mudge and Jennifer Harrow from the HAVANA team at the Wellcome Trust Sanger Institute manually assessed identified splice peptides in relation to GENCODE transcript annotation. Sergio Santos and Alvis Brazma at the European Institute compared predominant isoforms from RNA sequencing results to additional tissues. All other analysis described herein is the work I performed myself, under supervision of Andreas Bender and Jyoti Choudhary.

C. N. Schlaffner, G. R. Ritchie, R. I. Roumeliotis, J. Steinberg, C. Le Maitre, M. Wilkinson, E. Zeggini, A. Bender, and J. S. Choudhary. Quantitative proteoge- nomics for personalised molecular profiling, 2015. Poster presented at Genome Informatics 2015, Cold Spring Harbour, NY, October 28-31

C. N. Schlaffner, R. I. Roumeliotis, W. H., J. C. Wright, J. Mudge, S. Santos, G. R. Ritchie, J. Steinberg, A. Bender, A. Brazma, J. Harrow, C. Le Maitre, M. Wilkinson, E. Zeggini, and J. S. Choudhary. Quantitative proteogenomics for personalised molec- ular profiling, 2016a. 282157, Wednesday, Proceedings of the 64th ASMS Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, June 5-6

C. N. Schlaffner, R. I. Roumeliotis, W. H., J. C. Wright, J. Mudge, S. Santos, G. R. Ritchie, J. Steinberg, A. Bender, A. Brazma, J. Harrow, C. Le Maitre, M. Wilkin- son, E. Zeggini, and J. S. Choudhary. Quantitative proteogenomics for personalised molecular profiling, 2016b. Genome Informatics 2016, Hinxton, UK, September 19-22 HAPTER C 6

CONCLUDING REMARKS 176 Concluding Remarks

The ability to amplify abundance of nucleotide sequences has led to the use of gene expression as proxy for protein abundance in samples. Recently, experimental methods have greatly advanced for the large scale collection of genomics and proteomics data enabling accurate identification and quantification of transcripts and proteins. However, efforts aimed at understanding the complex molecular mechanisms in cells and their perturbations leading to disease remain challenging.

Proteogenomics, and more generally multi-omics integration, is the method of choice to combine genomics and proteomics data to validate translation, refine existing gene annotation, identify novel protein coding regions in genomes, and elucidate the interplay between molecular levels. With increasing efforts of sequencing large numbers of individuals, high throughput multi-omics integration is required to address questions of personalised molecular profiling. However, the use of proteomics data for high-throughput multi-omics integration is hindered by the lack of suitable tools and methods to achieve accurate genomic visualisation of peptides and unbiased identification of sequence variants and post-translational modifications. Thework presented in this thesis attempts to address some of these issues.

Multi-omics studies are dependent on the integration of proteomics and genomics data into a single coordinate system. While various tools for peptide to genomic mapping exist, they are not built for high throughput data analysis and lack aspects crucial to proteogenomics and personal proteomics such as quantitative visualisation, mapping of post-translational modifications, and allowing for non-synonymous single nucleotide variants. Therefore, I developed a new software tool PoGo (Schlaffner et al., 2017) to overcome the limitations of mapping basic sequence identity onto reference genomes and described its implementation and benchmarking in Chapter2. I have shown that PoGo outperforms other mapping tools with regards to speed by a factor of 10 while also requiring 20% less random access memory. This is of significance for mapping of large scale datasets such as the recently published draft maps of the human proteome (Kim et al., 2014; Wilhelm et al., 2014). I have also highlighted PoGo’s unique features of mapping peptides identified with up to two amino acid substitutions to a reference genome and quantitative mapping for multiple samples enabling comparative visualisation. Furthermore, I provide an additional software tool to allow easy sharing and visualisation of large scale mapped proteomics dataset in online genome browsers.

The unique feature of PoGo enabling mapping of peptides to highly similar ge- nomic loci introduces a novel application in identifying new protein coding genomic loci. In many cases novel coding loci present with high sequence similarity to known 177 genes. Stratifying peptides purely mapping to the novel loci and removing peptides that falsely identify the novel locus is therefore crucial. Variant enables mapping introduces mappings of peptides to highly similar genomic loci allowing targeted geno- typing to assess the origin for such peptides before claiming a novel protein coding loci as highlighted in the reanalysis of the draft human proteome maps (Wright et al., 2016). The visually encoded uniqueness additionally highlights confident (unique to single transcripts or genes) and less confident mappings (to multiple loci) assisting in the assessment. Upstream work to enable PoGo to map peptides to previously unannotated genomic loci is required. Protein sequence databases with corresponding coding sequence annotation (CDS) can be generated during the translation of predicted or RNA-sequencing based transcripts. While the sequence database can be used in the protoegenomic database search the combination of the database and the GTF file then allows PoGo to map peptides identified as originating from the novel sequences onto their respective genomic loci. An example of this approach is given by the identification of novel protein coding loci in human testis samples and their subsequent mapping onto genomic coordinates using manually generated annotation not present in reference annotation at the time (Weisser et al., 2016).

The use of PoGo and the selection of its options is based on on the need for mapping for peptides and variant peptides onto genomic coordinates. In addition, PoGo provides further information on the mappings such as uniqueness to enable filtering for peptides uniquely identifying single transcript or genes. It may seem that the variant enabled mapping of peptides through PoGo introduces large numbers of false mappings, however, it is an unbiased way of mapping peptides with potential amino acid variants. PoGo could be extended in the future to provide informed variant mapping of peptides using genotype information provided in VCF format. This will restrict mappings to loci where a single nucleotide variant in fact causes an amino acid substitution. However, genotyping data is not available in every proteogenomics study. Variant peptides identified from custom sequence databases constructed from incorporating variants from databases such as dbSNP (Sherry et al., 1999) are commonly incorporated due to lack of sample specific genotype information. Therefore, these variant peptides can only be mapped onto genomic coordinates by exhaustively mapping all possible loci. With the future increase of proteogenomics studies with mass spectrometry and RNA-sequencing data a more sample focused approach as described with integration of VCF genotype information into the mapping will lead to fewer false mappings while introducing a sample specific mapping of variant sequences. 178 Concluding Remarks

PoGo’s ability for rapid and memory efficient mapping of peptides and post- translational modifications to genomic coordinates in a format ready for track-hub generation has already led to the integration of the tool into the proteomics reposi- tory PRIDE (Vizcaino et al., 2016). This will enable the use of proteomics data by the genomic research community in a common and transferable format. To further enhance the wide application of PoGo full integration into other tool suites such as OpenMS (Rost et al., 2016) or MaxQuant (Cox and Mann, 2008) will be beneficial. Additionally, extending PoGo in a manner that allows mapping peptides to genomes of even more species will further boost its usability. This will require additional work to establish chromosome and plasmid orderings as well as standardised assessment of gene, transcript and exon identifiers consistent between protein sequence databases and genome annotation files. Those extensions of PoGo will promote more multi-omics studies in across conditions and species.

One of the major aspects missing from current multi-omics integration studies is the identification of post-translational modifications. The large fraction of uniden- tified tandem mass spectra due to low abundance of peptides, PTMs, and sequence variants leads to an underestimation of regulation post translation. In Chapter3,I described adaptations to the blind spectrum matching tool MS SMiV. These adapta- tions comprise charge dependent peak picking, intensity adjustment within clusters as well as implementation of parallelization allowing execution on computational clusters. The parallel execution of MS SMiV highlighted a significant reduction in runtime by ∼85%. Furthermore, in this chapter I showed the application for unbiased discovery of post-translational modifications. However, benchmarking against a mass tolerant database search (Chick et al., 2015) highlighted limitations to the estimation of false discoveries for this approach. Results indicate that the use of peptide-spectrum matches from database searches to estimate the rate of false spectrum pairs is sufficient for pairs of spectra with equal precursor mass. However, this approach is insufficient for false discovery rate estimates of spectral pairs with mass shift. As a solution I proposed the calculation of score thresholds at 1% FDR for MS SMiV spectrum pairs of same precursor masses. These score cut-offs then should be used for mass tolerant MS SMiV application in the same dataset. Nevertheless, MS SMiV showed high sensitivity and was able to reproduce the identification of anticipated and unanticipated mass shifts of the open mass tolerant database search.

In future work the FDR estimation based on database search results will require substitution with more suitable methods for assessing true and false matches. Methods to generate decoy spectra from high quality spectral libraries based on initial inversion 179 or randomization of the peptide sequence have been proposed in the literature (Lam et al., 2010; Zhang et al., 2018). However, decoy spectrum generation without a priori knowledge of the peptide sequence still needs to be developed to enable FDR estimation for blind spectrum-pairing tools such as MS SMiV. Other ways to assess the probability of false matching could be the exhaustive comparison of all spectra in a dataset against all others or subsets thereof to adequately establish a background score distribution to which spectrum pairs with a given mass shift of interest can be compared to. In the future this approach can be used to estimate the probability of post- translational modification enrichment in a mass spectrometry dataset via application of MS SMiV. Confidently identified mass shifts then can be used to inform the selection of dynamic modifications to enhance identification rates in database searching without biasing the search by researcher expectations.

Also addressed in future work should be MS SMiV’s ability to identify mixed spectra from random background. Mixed spectra occur when peptide ions have similar m/z ratios and are co-selected and subsequently co-fragmented resulting in overlapping spectra of two distinct peptide sequences. A way to address this is by simulating mixed spectra through combining spectra of distinct sequences and allow MS SMiV to score them against their constituent spectra. In addition, intensities of the component spectra could be adapted to simulate differing abundances and therefore assess the impact of abundance in mixed spectra on MS SMiV’s ability to pair them with the correct parent spectra.

Current proteogenomic methods to identify amino acid variants use customised sequence databases by incorporating all combinations of non-synonymous single nucleotide variants into protein sequences. These methods suffer from loss of sensi- tivity and high error rates. In the light of large scale multi-omics studies on cancer cell lines with numerous somatic mutations and advancing efforts towards person- alised multi-omics studies, it was desirable to identify amino acid substitutions from proteomics data without any prior knowledge. This was achieved by utilizing MS SMiV in combination with spectrum level quantitation on a panel of 50 colorectal cancer cell lines (Roumeliotis et al., 2016), as discussed in Chapter4. MS SMiV provided suitable candidate spectra of variant peptides to sequences identified in a standard search against canonical proteins. Spectra of reference and variant sequences exhibited alternative extreme log2-fold changes when correlating peptides of single proteins within samples against each other. This approach allowed the successful and previously infeasible discrimination of mass shifts caused by post-translational modifi- cations and amino acid variants. De novo identification of somatic mutations using 180 Concluding Remarks quantitative proteomics with MS SMiV demonstrated the value of quantitation and unbiased mass tolerant spectrum pairing without relying on sample specific sequence databases that commonly result in reduced sensitivity of database search algorithms in proteogenomics.

To be able to truly assess the accuracy and recall of MS SMiV with respect to single amino acid variants different methods incorporating known variant sequences could be employed. Using synthetic peptides with amino acid variants as spike in would allow for a good benchmarking opportunity to assess the recall of MS SMiV and its specificity in combination with isobaric labelling quantitation. Using differing concentrations of synthetic variant peptides would further enable future work to evaluate the sensitivity of this approach. Furthermore, this can be extended to spike-in synthetic peptides with and without post-translational modifications such as phosphorylation, ubiquitinylation and others to determine sensitivity and recall for PTMs.

The ability to identify post-translational modifications and amino acid variants by mass shifts alone also drives the need for enhanced site localization algorithms and methods. While several tools for phosphorylation and PTMs in general have been developed, they suffer from the difficulty of estimating a false localization rate (FLR) and low recall at a low given FLR (Fermin et al., 2013; Gutenbrunner, 2016). With the anticipated increase in novel identifications of modifications robust and accurate methods to localize the mass shifts within peptide sequences is a cornerstone for downstream analysis and future association with genomic features and phenotypes.

Spectral pairing results from MS SMiV can also be used for network based repre- sentation of mass spectrometry data. Spectra, as nodes, are linked by their similarity score, as edges. The network then can be traversed to identify variant sequences and modified versions in addition to multiply modified peptides. This holds true aslong as intermediate stages of modifications between unmodified and multiply modified spectra exist in the dataset. Adding database search based identification on top of the network may identify clusters of spectra that remain unidentified while being highly similar. More targeted approaches to identify the underlying sequences then may include de novo algorithms or protoegenomic search strategies containing potential novel protein coding genomic loci.

In Chapter5, I bring together the components in a proteogenomic pipeline for high throughput analysis that includes peptide quantitation, de novo variant detection using MS SMiV, and proteogenomic mapping to a reference genome using PoGo. This comprehensive proteogenomic pipeline was applied in a pilot study of 12 osteoarthritic 181 individuals (10 individuals after removal of samples with low RNA-seq quality) to assess variation on protein, splice isoform, and on an individual basis. I showed where peptides resulted in the identification of transcript isoforms, while support of RNA-sequencing reads was missing. I also proposed a set of genes with protein evidence of alternative splicing. Moreover, I demonstrated the use of MS SMiV combined with quantitative information to identify amino acid variants de novo and validate the findings with genotype data of individuals. I also highlighted forthe first time the value of isobaric quantification in proteogenomics for the identification of alternative regulation of splice isoforms between transcriptome and proteome. I demonstrated the use of the pipeline to identify the switching of isoforms between healthy and osteoarthritic cartilage samples. Overall, this study demonstrates the value of utilising proteomics data for personal molecular profiling enabling the identification of inter-individual variation and variation due to translational and post-translational regulation.

Scaling up this pilot study to improve coverage on transcriptome and proteome level and increase the number of individuals will allow statistical assessment and verification of the examples of variation described in Chaper5. Despite limited proteome coverage and limited number of individuals this pilot study highlights the potential of personal proteogenomics. Technological advancements in RNA- sequencing and mass spectrometry have led to significant improvements with regards to coverage and depth of analysis while reducing costs. This trend is likely to continue making multi-omics integration available in clinical settings and enable identification of variation between individuals and across different tissues in large cohorts, similar to the efforts by the GTEx consortium on the transcriptomics level, aiding precision medicine.

Bibliography 183

Bibliography

R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422(6928): 198–207, 2003. doi: 10.1038/nature01511.

R. Aebersold, A. L. Burlingame, and R. A. Bradshaw. Western blots versus selected reaction monitoring assays: time to turn the tables? Mol Cell Proteomics, 12(9): 2381–2, 2013. doi: 10.1074/mcp.E113.031658.

A. Agarwal, D. Koppstein, J. Rozowsky, A. Sboner, L. Habegger, L. W. Hillier, R. Sasidharan, V. Reinke, R. H. Waterston, and M. Gerstein. Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics, 11:383, 2010. doi: 10.1186/1471-2164-11-383.

E. Ahrne, M. Muller, and F. Lisacek. Unrestricted identification of modified proteins using MS/MS. Proteomics, 10(4):671–86, 2010. doi: 10.1002/pmic.200900502.

C. P. Albuquerque, M. B. Smolka, S. H. Payne, V. Bafna, J. Eng, and H. Zhou. A mul- tidimensional chromatography technology for in-depth phosphoproteome analysis. Mol Cell Proteomics, 7(7):1389–96, 2008. doi: 10.1074/mcp.M700468-MCP200.

J. A. Alfaro, A. Sinha, T. Kislinger, and P. C. Boutros. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat Methods, 11(11):1107–13, 2014. doi: 10.1038/nmeth.3138.

T. Alioto, E. Picardi, R. Guigo, and G. Pesole. ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection. Biomed Res Int, 2013:502827, 2013. doi: 10.1155/2013/502827.

M. J. Alvarez, Y. Shen, F. M. Giorgi, A. Lachmann, B. B. Ding, B. H. Ye, and A. Califano. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat Genet, 48(8):838–47, 2016. doi: 10.1038/ng.3593.

S. Anders, P. T. Pyl, and W. Huber. HTSeq–a Python framework to work with high- throughput sequencing data. Bioinformatics, 31(2):166–9, 2015. doi: 10.1093/ bioinformatics/btu638.

M. Askenazi, K. V. Ruggles, and D. Fenyo. PGx: Putting Peptides to BED. J Proteome Res, 15(3):795–9, 2016. doi: 10.1021/acs.jproteome.5b00870. 184 Bibliography

O. T. Avery, C. M. Macleod, and M. McCarty. Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med, 79(2):137–58, 1944.

C. M. Bailey, S. M. Sweet, D. L. Cunningham, M. Zeller, J. K. Heath, and H. J. Cooper. SLoMo: automated site localization of modifications from ETD/ECD mass spectra. J Proteome Res, 8(4):1965–71, 2009. doi: 10.1021/pr800917p.

M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster. Quantitative mass spectrometry in proteomics: a critical review. Anal Bioanal Chem, 389(4):1017–31, 2007. doi: 10.1007/s00216-007-1486-6.

M. Bantscheff, S. Lemeer, M. M. Savitski, and B. Kuster. Quantitative mass spectrom- etry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem, 404(4):939–65, 2012. doi: 10.1007/s00216-012-6203-4.

S. A. Beausoleil, J. Villen, S. A. Gerber, J. Rush, and S. P. Gygi. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol, 24(10):1285–92, 2006. doi: 10.1038/nbt1240.

E. S. Boja and H. Rodriguez. Proteogenomic convergence for understanding can- cer pathways and networks. Clin Proteomics, 11(1):22, 2014. doi: 10.1186/ 1559-0275-11-22.

C. E. Bonferroni. Il calcolo delle assicurazioni su gruppi di teste. Studi in onore del professore salvatore ortu carboni, pages 13–60, 1935.

N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter. Near-optimal probabilistic RNA- seq quantification. Nat Biotechnol, 34(5):525–7, 2016. doi: 10.1038/nbt.3519.

M. Brosch, G. I. Saunders, A. Frankish, M. O. Collins, L. Yu, J. Wright, R. Verstraten, D. J. Adams, J. Harrow, J. S. Choudhary, and T. Hubbard. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome Res, 21(5):756–67, 2011. doi: 10. 1101/gr.114272.110.

P. G. Buckley, K. K. Mantripragada, M. Benetkiewicz, I. Tapia-Paez, T. Diaz De Stahl, M. Rosenquist, H. Ali, C. Jarbo, C. De Bustos, C. Hirvela, B. Sinder Wilen, I. Frans- son, C. Thyr, B. I. Johnsson, C. E. Bruder, U. Menzel, M. Hergersberg, N. Mandahl, E. Blennow, A. Wedell, D. M. Beare, J. E. Collins, I. Dunham, D. Albertson, D. Pinkel, B. C. Bastian, A. F. Faruqi, R. S. Lasken, K. Ichimura, V. P. Collins, and Bibliography 185

J. P. Dumanski. A full-coverage, high-resolution human chromosome 22 genomic microarray for clinical and research applications. Hum Mol Genet, 11(25):3221–9, 2002.

M. K. Bunger, B. J. Cargile, J. R. Sevinsky, E. Deyanova, N. A. Yates, R. C. Hendrick- son, and J. Stephenson, J. L. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data. J Proteome Res, 6(6): 2331–40, 2007. doi: 10.1021/pr0700908.

A. Busch and K. J. Hertel. Extensive regulation of NAGNAG alternative splicing: new tricks for the spliceosome? Genome Biol, 13(2):143, 2012. doi: 10.1186/ gb-2012-13-2-143.

N. Cabezas-Wallscheid, D. Klimmeck, J. Hansson, D. B. Lipka, A. Reyes, Q. Wang, D. Weichenhan, A. Lier, L. von Paleske, S. Renders, P. Wunsche, P. Zeisberger, D. Brocks, L. Gu, C. Herrmann, S. Haas, M. A. Essers, B. Brors, R. Eils, W. Huber, M. D. Milsom, C. Plass, J. Krijgsveld, and A. Trumpp. Identification of regulatory networks in HSCs and their immediate progeny via integrated proteome, transcrip- tome, and DNA methylome analysis. Cell Stem Cell, 15(4):507–22, 2014. doi: 10.1016/j.stem.2014.07.005.

C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. BLAST+: architecture and applications. BMC Bioinformatics, 10:421, 2009. doi: 10.1186/1471-2105-10-421.

N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A, 105(52):21034–8, 2008. doi: 10.1073/pnas.0811066106.

N. E. Castellana, Z. Shen, Y. He, J. W. Walley, C. J. Cassidy, S. P. Briggs, and V. Bafna. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol Cell Proteomics, 13(1):157–67, 2014. doi: 10.1074/mcp.M113. 031260.

C. Cenik, E. S. Cenik, G. W. Byeon, F. Grubert, S. I. Candille, D. Spacek, B. Alsallakh, H. Tilgner, C. L. Araya, H. Tang, E. Ricci, and M. P. Snyder. Integrative analysis of RNA, translation, and protein levels reveals distinct regulatory variation across humans. Genome Res, 25(11):1610–21, 2015. doi: 10.1101/gr.193342.115.

A. J. Cesnik, M. R. Shortreed, G. M. Sheynkman, B. L. Frey, and L. M. Smith. Human Proteomic Variation Revealed by Combining RNA-Seq Proteogenomics and Global 186 Bibliography

Post-Translational Modification (G-PTM) Search Strategy. J Proteome Res, 15(3): 800–8, 2016. doi: 10.1021/acs.jproteome.5b00817.

R. Chen and M. Snyder. : personalized medicine for the future? Curr Opin Pharmacol, 12(5):623–8, 2012. doi: 10.1016/j.coph.2012.07.011.

R. Chen and M. Snyder. Promise of personalized omics to precision medicine. Wiley Interdiscip Rev Syst Biol Med, 5(1):73–82, 2013. doi: 10.1002/wsbm.1198.

R. Chen, G. I. Mias, J. Li-Pook-Than, L. Jiang, H. Y. Lam, R. Chen, E. Miriami, K. J. Karczewski, M. Hariharan, F. E. Dewey, Y. Cheng, M. J. Clark, H. Im, L. Habegger, S. Balasubramanian, M. O’Huallachain, J. T. Dudley, S. Hillenmeyer, R. Haraksingh, D. Sharon, G. Euskirchen, P. Lacroute, K. Bettinger, A. P. Boyle, M. Kasowski, F. Grubert, S. Seki, M. Garcia, M. Whirl-Carrillo, M. Gallardo, M. A. Blasco, P. L. Greenberg, P. Snyder, T. E. Klein, R. B. Altman, A. J. Butte, E. A. Ashley, M. Gerstein, K. C. Nadeau, H. Tang, and M. Snyder. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell, 148(6):1293–307, 2012. doi: 10.1016/j.cell.2012.02.009.

J. M. Chick, D. Kolippakkam, D. P. Nusinow, B. Zhai, R. Rad, E. L. Huttlin, and S. P. Gygi. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotechnol, 33(7):743–9, 2015. doi: 10.1038/nbt.3267.

J. M. Chick, S. C. Munger, P. Simecek, E. L. Huttlin, K. Choi, D. M. Gatti, N. Raghu- pathy, K. L. Svenson, G. A. Churchill, and S. P. Gygi. Defining the consequences of genetic variation on a proteome-wide scale. Nature, 534(7608):500–5, 2016. doi: 10.1038/nature18270.

S. Choi, H. Kim, and E. Paek. ACTG: novel peptide mapping onto gene models. Bioinformatics, 2016. doi: 10.1093/bioinformatics/btw787.

P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res, 38(6):1767–71, 2010. doi: 10.1093/nar/gkp1137.

J. Colinge, A. Masselot, M. Giron, T. Dessingy, and J. Magnin. OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics, 3(8): 1454–63, 2003. doi: 10.1002/pmic.200300485.

F. S. Collins, E. D. Green, A. E. Guttmacher, M. S. Guyer, and U. S. N. H. G. R. Institute. A vision for the future of genomics research. Nature, 422(6934):835–47, 2003. doi: 10.1038/nature01626. Bibliography 187

A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szczesniak, D. J. Gaffney, L. L. Elo, X. Zhang, and A. Mortazavi. A survey of best practices for RNA-seq data analysis. Genome Biol, 17:13, 2016. doi: 10.1186/s13059-016-0881-8.

J. Cox and M. Mann. MaxQuant enables high peptide identification rates, individual- ized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol, 26(12):1367–72, 2008. doi: 10.1038/nbt.1511.

R. Craig, J. C. Cortens, D. Fenyo, and R. C. Beavis. Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res, 5(8):1843–9, 2006. doi: 10.1021/pr0602085.

D. M. Creasy and J. S. Cottrell. Error tolerant searching of uninterpreted tan- dem mass spectrometry data. Proteomics, 2(10):1426–34, 2002. doi: 10.1002/ 1615-9861(200210)2:10<1426::AID-PROT1426>3.0.CO;2-5.

D. M. Creasy and J. S. Cottrell. Unimod: Protein modifications for mass spectrometry. Proteomics, 4(6):1534–6, 2004. doi: 10.1002/pmic.200300744.

F. H. Crick. On protein synthesis. In Symp Soc Exp Biol, volume 12, page 8, 1958.

R. de Sousa Abreu, L. O. Penalva, E. M. Marcotte, and C. Vogel. Global signatures of protein and mRNA expression levels. Mol Biosyst, 5(12):1512–26, 2009. doi: 10.1039/b908315d.

F. Desiere, E. W. Deutsch, N. L. King, A. I. Nesvizhskii, P. Mallick, J. Eng, S. Chen, J. Eddes, S. N. Loevenich, and R. Aebersold. The PeptideAtlas project. Nucleic Acids Res, 34(Database issue):D655–8, 2006. doi: 10.1093/nar/gkj040.

L. V. DeSouza and K. W. Siu. Mass spectrometry-based quantification. Clin Biochem, 46(6):421–31, 2013. doi: 10.1016/j.clinbiochem.2012.10.025.

E. W. Deutsch. Tandem mass spectrometry spectral libraries and library searching. Methods Mol Biol, 696:225–32, 2011. doi: 10.1007/978-1-60761-987-1_13.

E. W. Deutsch, J. P. Albar, P. A. Binz, M. Eisenacher, A. R. Jones, G. Mayer, G. S. Omenn, S. Orchard, J. A. Vizcaino, and H. Hermjakob. Development of data representation standards by the human proteome organization proteomics standards initiative. J Am Med Inform Assoc, 22(3):495–506, 2015. doi: 10.1093/jamia/ ocv001. 188 Bibliography

P. A. Dieppe and L. S. Lohmander. Pathogenesis and management of pain in osteoarthri- tis. Lancet, 365(9463):965–73, 2005. doi: 10.1016/S0140-6736(05)71086-2.

A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. doi: 10.1093/bioinformatics/bts635.

T. A. Down, M. Piipari, and T. J. Hubbard. Dalliance: interactive genome viewing on the web. Bioinformatics, 27(6):889–90, 2011. doi: 10.1093/bioinformatics/btr020.

J. K. Eng, A. L. McCormack, and J. R. Yates. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom, 5(11):976–89, 1994. doi: 10.1016/1044-0305(94)80016-2.

P. G. Engstrom, T. Steijger, B. Sipos, G. R. Grant, A. Kahles, G. Ratsch, N. Goldman, T. J. Hubbard, J. Harrow, R. Guigo, P. Bertone, and R. Consortium. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods, 10(12): 1185–91, 2013. doi: 10.1038/nmeth.2722.

J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrospray ionization for mass spectrometry of large biomolecules. Science, 246(4926):64–71, 1989.

D. Fermin, S. J. Walmsley, A. C. Gingras, H. Choi, and A. I. Nesvizhskii. LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach. Mol Cell Proteomics, 12(11):3409–19, 2013. doi: 10.1074/mcp.M113.028928.

D. Fermin, D. Avtonomov, H. Choi, and A. I. Nesvizhskii. LuciPHOr2: site localization of generic post-translational modifications from tandem mass spectrometry data. Bioinformatics, 31(7):1141–3, 2015. doi: 10.1093/bioinformatics/btu788.

F. Finotello, E. Lavezzo, L. Bianco, L. Barzon, P. Mazzon, P. Fontana, S. Toppo, and B. Di Camillo. Reducing bias in RNA sequencing data: a novel approach to compute counts. BMC Bioinformatics, 15 Suppl 1:S7, 2014. doi: 10.1186/ 1471-2105-15-S1-S7.

P. Flicek, M. R. Amode, D. Barrell, K. Beal, K. Billis, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fitzgerald, L. Gil, C. G. Giron, L. Gordon, T. Hourlier, S. Hunt, N. Johnson, T. Juettemann, A. K. Kahari, S. Keenan, E. Kulesha, F. J. Martin, T. Maurel, W. M. McLaren, D. N. Murphy, R. Nag, B. Overduin, M. Pig- natelli, B. Pritchard, E. Pritchard, H. S. Riat, M. Ruffier, D. Sheppard, K. Taylor, Bibliography 189

A. Thormann, S. J. Trevanion, A. Vullo, S. P. Wilder, M. Wilson, A. Zadissa, B. L. Aken, E. Birney, F. Cunningham, J. Harrow, J. Herrero, T. J. Hubbard, R. Kin- sella, M. Muffato, A. Parker, G. Spudich, A. Yates, D. R. Zerbino, and S. M. Searle. Ensembl 2014. Nucleic Acids Res, 42(Database issue):D749–55, 2014. doi: 10.1093/nar/gkt1196.

S. A. Forbes, D. Beare, P. Gunasekaran, K. Leung, N. Bindal, H. Boutselakis, M. Ding, S. Bamford, C. Cole, S. Ward, C. Y. Kok, M. Jia, T. De, J. W. Teague, M. R. Stratton, U. McDermott, and P. J. Campbell. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res, 43(Database issue): D805–11, 2015. doi: 10.1093/nar/gku1075.

S. A. Forbes, D. Beare, H. Boutselakis, S. Bamford, N. Bindal, J. Tate, C. G. Cole, S. Ward, E. Dawson, L. Ponting, R. Stefancsik, B. Harsha, C. Y. Kok, M. Jia, H. Jubb, Z. Sondka, S. Thompson, T. De, and P. J. Campbell. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res, 45(D1):D777–D783, 2017. doi: 10.1093/nar/gkw1121.

M. L. Fournier, A. Paulson, N. Pavelka, A. L. Mosley, K. Gaudenz, W. D. Bradford, E. Glynn, H. Li, M. E. Sardiu, B. Fleharty, C. Seidel, L. Florens, and M. P. Washburn. Delayed correlation of mRNA and protein expression in rapamycin-treated cells and a role for Ggc1 in cellular sensitivity to rapamycin. Mol Cell Proteomics, 9(2): 271–84, 2010. doi: 10.1074/mcp.M900415-MCP200.

A. M. Frank, M. M. Savitski, M. L. Nielsen, R. A. Zubarev, and P. A. Pevzner. De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res, 6(1):114–23, 2007. doi: 10.1021/pr060271u.

W. M. Freeman, S. J. Walker, and K. E. Vrana. Quantitative RT-PCR: pitfalls and potential. Biotechniques, 26(1):112–22, 124–5, 1999.

B. Frewen and M. J. MacCoss. Using BiblioSpec for creating and searching tandem MS peptide libraries. Curr Protoc Bioinformatics, Chapter 13:Unit 13 7, 2007. doi: 10.1002/0471250953.bi1307s20.

P. Fuchs, M. Zorer, G. A. Rezniczek, D. Spazierer, S. Oehler, M. J. Castanon, R. Haupt- mann, and G. Wiche. Unusual 5’ transcript complexity of plectin isoforms: novel tissue-specific exons modulate actin binding activity. Hum Mol Genet, 8(13):2461– 72, 1999. 190 Bibliography

M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods, 8(6): 469–77, 2011. doi: 10.1038/nmeth.1613.

D. Gawron, K. Gevaert, and P. Van Damme. The proteome under translational control. Proteomics, 14(23-24):2647–62, 2014. doi: 10.1002/pmic.201400165.

F. Ghali, R. Krishna, S. Perkins, A. Collins, D. Xia, J. Wastling, and A. R. Jones. ProteoAnnotator–open source proteogenomics annotation software supporting PSI standards. Proteomics, 14(23-24):2731–41, 2014. doi: 10.1002/pmic.201400265.

V. Granholm, S. Kim, J. C. Navarro, E. Sjolund, R. D. Smith, and L. Kall. Fast and accurate database searches with MS-GF+Percolator. J Proteome Res, 13(2):890–7, 2014. doi: 10.1021/pr400937n.

J. Greaves and L. H. Chamberlain. New links between S-acylation and cancer. J Pathol, 233(1):4–6, 2014. doi: 10.1002/path.4339.

D. Greenbaum, C. Colangelo, K. Williams, and M. Gerstein. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol, 4(9): 117, 2003. doi: 10.1186/gb-2003-4-9-117.

J. Griss. Spectral library searching in proteomics. Proteomics, 16(5):729–40, 2016. doi: 10.1002/pmic.201500296.

J. Griss, Y. Perez-Riverol, S. Lewis, D. L. Tabb, J. A. Dianes, N. Del-Toro, M. Rurik, M. W. Walzer, O. Kohlbacher, H. Hermjakob, R. Wang, and J. A. Vizcaino. Rec- ognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Methods, 13(8):651–656, 2016. doi: 10.1038/nmeth.3902.

R. Guigo, S. Knudsen, N. Drake, and T. Smith. Prediction of gene structure. J Mol Biol, 226(1):141–57, 1992.

Y. Guo, P. Xiao, S. Lei, F. Deng, G. G. Xiao, Y. Liu, X. Chen, L. Li, S. Wu, Y. Chen, H. Jiang, L. Tan, J. Xie, X. Zhu, S. Liang, and H. Deng. How is mRNA expres- sion predictive for protein expression? A correlation study on human circulating monocytes. Acta Biochim Biophys Sin (Shanghai), 40(5):426–36, 2008.

P. Gutenbrunner. A Computational Proteomics Pipeline for Phosphorylation Site Localisation. Hagenberg, 2016.

S. Haider and R. Pal. Integrated analysis of transcriptomic and proteomic data. Curr Genomics, 14(2):91–110, 2013. doi: 10.2174/1389202911314020003. Bibliography 191

A. Hamosh, A. F. Scott, J. Amberger, D. Valle, and V. A. McKusick. Online Mendelian Inheritance in Man (OMIM). Hum Mutat, 15(1):57–61, 2000. doi: 10.1002/(SICI) 1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G.

K. K. Han and A. Martinage. Post-translational chemical modification(s) of proteins. Int J Biochem, 24(1):19–28, 1992.

J. Harrow, F. Denoeud, A. Frankish, A. Reymond, C. K. Chen, J. Chrast, J. Lagarde, J. G. Gilbert, R. Storey, D. Swarbreck, C. Rossier, C. Ucla, T. Hubbard, S. E. Antonarakis, and R. Guigo. GENCODE: producing a reference annotation for ENCODE. Genome Biol, 7 Suppl 1:S4 1–9, 2006. doi: 10.1186/gb-2006-7-s1-s4.

J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan, G. Despacio-Reyes, G. Saunders, C. Stew- ard, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Derrien, J. Chrast, N. Walters, S. Balasubramanian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren, M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigo, and T. J. Hubbard. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 22(9):1760–74, 2012. doi: 10.1101/gr.135350.111.

C. Has, S. A. Lashin, A. V. Kochetov, and J. Allmer. PGMiner reloaded, fully automated proteogenomic annotation tool linking genomes to proteomes. J Integr Bioinform, 13(4):293, 2016. doi: 10.2390/biecoll-jib-2016-293.

Z. He, T. Huang, X. Liu, P. Zhu, B. Teng, and S. Deng. Protein inference: A protein quantification perspective. Comput Biol Chem, 63:21–9, 2016. doi: 10.1016/j. compbiolchem.2016.02.006.

F. Hillenkamp, M. Karas, R. C. Beavis, and B. T. Chait. Matrix-assisted laser desorp- tion/ionization mass spectrometry of biopolymers. Analytical chemistry, 63(24): 1193A–1203A, 1991.

M. Hiller, K. Huse, K. Szafranski, N. Jahn, J. Hampe, S. Schreiber, R. Backofen, and M. Platzer. Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity. Nat Genet, 36(12):1255–7, 2004. doi: 10.1038/ ng1469.

A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, J. Hillman-Jackson, R. M. Kuhn, J. S. Pedersen, A. Pohl, B. J. Raney, K. R. Rosenbloom, A. Siepel, K. E. Smith, C. W. 192 Bibliography

Sugnet, A. Sultan-Qurraie, D. J. Thomas, H. Trumbower, R. J. Weber, M. Weirauch, A. S. Zweig, D. Haussler, and W. J. Kent. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res, 34(Database issue):D590–8, 2006. doi: 10.1093/ nar/gkj144.

A. Hinzpeter, A. Aissat, E. Sondo, C. Costa, N. Arous, C. Gameiro, N. Martin, A. Tarze, L. Weiss, A. de Becdelievre, B. Costes, M. Goossens, L. J. Galietta, E. Girodon, and P. Fanen. Alternative splicing at a NAGNAG acceptor site as a novel phenotype modifier. PLoS Genet, 6(10), 2010. doi: 10.1371/journal.pgen.1001153.

J. D. Hoheisel. Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet, 7(3):200–10, 2006. doi: 10.1038/nrg1809.

Z. Hou, P. Jiang, S. A. Swanson, A. L. Elwell, B. K. Nguyen, J. M. Bolin, R. Stewart, and J. A. Thomson. A cost-effective RNA sequencing protocol for large-scale gene expression studies. Sci Rep, 5:9570, 2015. doi: 10.1038/srep09570.

Q. Huang, J. Chang, M. K. Cheung, W. Nong, L. Li, M. T. Lee, and H. S. Kwan. Human proteins with target sites of multiple post-translational modification types are more prone to be involved in disease. J Proteome Res, 13(6):2735–48, 2014. doi: 10.1021/pr401019d.

International Cancer Genome Consortium, T. J. Hudson, W. Anderson, A. Artez, A. D. Barker, C. Bell, R. R. Bernabe, M. K. Bhan, F. Calvo, I. Eerola, D. S. Gerhard, A. Guttmacher, M. Guyer, F. M. Hemsley, J. L. Jennings, D. Kerr, P. Klatt, P. Kolar, J. Kusada, D. P. Lane, F. Laplace, L. Youyong, G. Nettekoven, B. Ozenberger, J. Pe- terson, T. S. Rao, J. Remacle, A. J. Schafer, T. Shibata, M. R. Stratton, J. G. Vockley, K. Watanabe, H. Yang, M. M. Yuen, B. M. Knoppers, M. Bobrow, A. Cambon- Thomsen, L. G. Dressler, S. O. Dyke, Y. Joly, K. Kato, K. L. Kennedy, P. Nicolas, M. J. Parker, E. Rial-Sebbag, C. M. Romeo-Casabona, K. M. Shaw, S. Wallace, G. L. Wiesner, N. Zeps, P. Lichter, A. V. Biankin, C. Chabannon, L. Chin, B. Clement, E. de Alava, F. Degos, M. L. Ferguson, P. Geary, D. N. Hayes, T. J. Hudson, A. L. Johns, A. Kasprzyk, H. Nakagawa, R. Penny, M. A. Piris, R. Sarin, A. Scarpa, T. Shibata, M. van de Vijver, P. A. Futreal, H. Aburatani, M. Bayes, D. D. Botwell, P. J. Campbell, X. Estivill, D. S. Gerhard, S. M. Grimmond, I. Gut, M. Hirst, C. Lopez-Otin, P. Majumder, M. Marra, J. D. McPherson, H. Nakagawa, Z. Ning, X. S. Puente, Y. Ruan, T. Shibata, M. R. Stratton, H. G. Stunnenberg, H. Swerdlow, V. E. Velculescu, R. K. Wilson, H. H. Xue, L. Yang, P. T. Spellman, G. D. Bader, P. C. Boutros, P. J. Campbell, et al. International network of cancer genome projects. Nature, 464(7291):993–8, 2010. doi: 10.1038/nature08987. Bibliography 193

International HapMap Consortium. The International HapMap Project. Nature, 426 (6968):789–96, 2003. doi: 10.1038/nature02168.

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–45, 2004. doi: 10.1038/ nature03001.

F. Iorio, T. A. Knijnenburg, D. J. Vis, G. R. Bignell, M. P. Menden, M. Schubert, N. Aben, E. Goncalves, S. Barthorpe, H. Lightfoot, T. Cokelaer, P. Greninger, E. van Dyk, H. Chang, H. de Silva, H. Heyn, X. Deng, R. K. Egan, Q. Liu, T. Miro- nenko, X. Mitropoulos, L. Richardson, J. Wang, T. Zhang, S. Moran, S. Sayols, M. Soleimani, D. Tamborero, N. Lopez-Bigas, P. Ross-Macdonald, M. Esteller, N. S. Gray, D. A. Haber, M. R. Stratton, C. H. Benes, L. F. Wessels, J. Saez-Rodriguez, U. McDermott, and M. J. Garnett. A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 166(3):740–54, 2016. doi: 10.1016/j.cell.2016.06.017.

J. D. Jaffe, H. C. Berg, and G. M. Church. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics, 4(1):59–77, 2004. doi: 10. 1002/pmic.200300511.

O. N. Jensen. Interpreting the protein language using proteomics. Nat Rev Mol Cell Biol, 7(6):391–403, 2006. doi: 10.1038/nrm1939.

K. Jeong, S. Kim, and N. Bandeira. False discovery rates in spectral identification. BMC Bioinformatics, 13 Suppl 16:S2, 2012. doi: 10.1186/1471-2105-13-S16-S2.

C. Ji, R. J. Arnold, K. J. Sokoloski, R. W. Hardy, H. Tang, and P. Radivojac. Extending the coverage of spectral libraries: a neighbor-based approach to predicting intensities of peptide fragmentation spectra. Proteomics, 13(5):756–65, 2013. doi: 10.1002/ pmic.201100670.

P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. Fodor, and T. R. Gingeras. Large-scale transcriptional activity in chromosomes 21 and 22. Science, 296(5569):916–9, 2002. doi: 10.1126/science.1068597.

Y. V. Karpievitch, A. D. Polpitiya, G. A. Anderson, R. D. Smith, and A. R. Dabney. Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects. Ann Appl Stat, 4(4):1797–1823, 2010. doi: 10.1214/ 10-AOAS341.

T. Kawabata, M. Ota, and K. Nishikawa. The Protein Mutant Database. Nucleic Acids Res, 27(1):355–7, 1999. 194 Bibliography

S. Keegan, J. P. Cortens, R. C. Beavis, and D. Fenyo. g2pDB: A Database Mapping Protein Post-Translational Modifications to Genomic Coordinates. J Proteome Res, 15(3):983–90, 2016. doi: 10.1021/acs.jproteome.5b01018.

O. Keller, M. Kollmar, M. Stanke, and S. Waack. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics, 27(6): 757–63, 2011. doi: 10.1093/bioinformatics/btr010.

S. R. Kennedy, L. A. Loeb, and A. J. Herr. Somatic mutations in aging, cancer and neurodegeneration. Mech Ageing Dev, 133(4):118–26, 2012. doi: 10.1016/j.mad. 2011.10.009.

W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Res, 12(6):996–1006, 2002. doi: 10.1101/gr.229102.ArticlepublishedonlinebeforeprintinMay2002.

J. Khatun, Y. Yu, J. A. Wrobel, B. A. Risk, H. P. Gunawardena, A. Secrest, W. J. Spitzer, L. Xie, L. Wang, X. Chen, and M. C. Giddings. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics, 14:141, 2013. doi: 10.1186/1471-2164-14-141.

D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol, 14(4):R36, 2013. doi: 10.1186/gb-2013-14-4-r36.

M. S. Kim, S. M. Pinto, D. Getnet, R. S. Nirujogi, S. S. Manda, R. Chaerkady, A. K. Madugundu, D. S. Kelkar, R. Isserlin, S. Jain, J. K. Thomas, B. Muthusamy, P. Leal- Rojas, P. Kumar, N. A. Sahasrabuddhe, L. Balakrishnan, J. Advani, B. George, S. Renuse, L. D. Selvan, A. H. Patil, V. Nanjappa, A. Radhakrishnan, S. Prasad, T. Subbannayya, R. Raju, M. Kumar, S. K. Sreenivasamurthy, A. Marimuthu, G. J. Sathe, S. Chavan, K. K. Datta, Y. Subbannayya, A. Sahu, S. D. Yelamanchi, S. Jayaram, P. Rajagopalan, J. Sharma, K. R. Murthy, N. Syed, R. Goel, A. A. Khan, S. Ahmad, G. Dey, K. Mudgal, A. Chatterjee, T. C. Huang, J. Zhong, X. Wu, P. G. Shaw, D. Freed, M. S. Zahari, K. K. Mukherjee, S. Shankar, A. Mahadevan, H. Lam, C. J. Mitchell, S. K. Shankar, P. Satishchandra, J. T. Schroeder, R. Sirdeshmukh, A. Maitra, S. D. Leach, C. G. Drake, M. K. Halushka, T. S. Prasad, R. H. Hruban, C. L. Kerr, G. D. Bader, C. A. Iacobuzio-Donahue, H. Gowda, and A. Pandey. A draft map of the human proteome. Nature, 509(7502):575–81, 2014. doi: 10.1038/nature13302.

S. Kim and P. A. Pevzner. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun, 5:5277, 2014. doi: 10.1038/ncomms6277. Bibliography 195

A. A. Klammer and M. J. MacCoss. Effects of modified digestion schemes on the identification of proteins from complex mixtures. J Proteome Res, 5(3):695–700, 2006. doi: 10.1021/pr050315j.

A. T. Kong, F. V. Leprevost, D. M. Avtonomov, D. Mellacheruvu, and A. I. Nesvizhskii. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry- based proteomics. Nat Methods, 14(5):513–520, 2017. doi: 10.1038/nmeth.4256.

K. Krug, S. Popic, A. Carpy, C. Taumer, and B. Macek. Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants. Proteomics, 14(23-24):2699–708, 2014. doi: 10.1002/ pmic.201400219.

M. Kuhring and B. Y. Renard. iPiG: integrating peptide spectrum matches into genome browser visualizations. PLoS One, 7(12):e50246, 2012. doi: 10.1371/journal.pone. 0050246.

H. Lam, E. W. Deutsch, J. S. Eddes, J. K. Eng, N. King, S. E. Stein, and R. Aebersold. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics, 7(5):655–67, 2007. doi: 10.1002/pmic. 200600625.

H. Lam, E. W. Deutsch, J. S. Eddes, J. K. Eng, S. E. Stein, and R. Aebersold. Building consensus spectral libraries for peptide identification in proteomics. Nat Methods, 5 (10):873–5, 2008. doi: 10.1038/nmeth.1254.

H. Lam, E. W. Deutsch, and R. Aebersold. Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. J Proteome Res, 9(1):605–10, 2010. doi: 10.1021/pr900947u.

C. Lambert, J. E. Dubuc, E. Montell, J. Verges, C. Munaut, A. Noel, and Y. Henrotin. Gene expression pattern of cells from inflamed and normal areas of osteoarthritis synovial membrane. Arthritis Rheumatol, 66(4):960–8, 2014. doi: 10.1002/art. 38315.

B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10(3):R25, 2009. doi: 10.1186/gb-2009-10-3-r25.

R. C. Lawrence, C. G. Helmick, F. C. Arnett, R. A. Deyo, D. T. Felson, E. H. Giannini, S. P. Heyse, R. Hirsch, M. C. Hochberg, G. G. Hunder, M. H. Liang, S. R. Pillemer, V. D. Steen, and F. Wolfe. Estimates of the prevalence of arthritis and selected 196 Bibliography

musculoskeletal disorders in the United States. Arthritis Rheum, 41(5):778–99, 1998. doi: 10.1002/1529-0131(199805)41:5<778::AID-ART4>3.0.CO;2-V.

R. T. Lawrence, B. C. Searle, A. Llovet, and J. Villen. Plug-and-play analysis of the human phosphoproteome by targeted high-resolution mass spectrometry. Nat Methods, 13(5):431–4, 2016. doi: 10.1038/nmeth.3811.

A. S. Lee, M. B. Ellman, D. Yan, J. S. Kroin, B. J. Cole, A. J. van Wijnen, and H. J. Im. A current review of molecular mechanisms regarding osteoarthritis and pain. Gene, 527(2):440–7, 2013. doi: 10.1016/j.gene.2013.05.069.

M. V. Lee, S. E. Topper, S. L. Hubler, J. Hose, C. D. Wenger, J. J. Coon, and A. P. Gasch. A dynamic model of proteome changes reveals new roles for transcript alteration in yeast. Mol Syst Biol, 7:514, 2011. doi: 10.1038/msb.2011.48.

H. Li, Y. S. Joh, H. Kim, E. Paek, S. W. Lee, and K. B. Hwang. Evaluating the effect of database inflation in proteogenomic search on sensitive and reli- able peptide identification. BMC Genomics, 17(Suppl 13):1031, 2016. doi: 10.1186/s12864-016-3327-5.

J. Li, D. T. Duncan, and B. Zhang. CanProVar: a human cancer proteome variation database. Hum Mutat, 31(3):219–28, 2010. doi: 10.1002/humu.21176.

J. Li, Z. Su, Z. Q. Ma, R. J. Slebos, P. Halvey, D. L. Tabb, D. C. Liebler, W. Pao, and B. Zhang. A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics, 10(5):M110 006536, 2011. doi: 10.1074/mcp. M110.006536.

Y. F. Li and P. Radivojac. Computational approaches to protein inference in shot- gun proteomics. BMC Bioinformatics, 13 Suppl 16:S4, 2012. doi: 10.1186/ 1471-2105-13-S16-S4.

J. Li-Pook-Than and M. Snyder. iPOP goes the world: integrated personalized Omics profiling and the road toward improved health care. Chem Biol, 20(5):660–6, 2013. doi: 10.1016/j.chembiol.2013.05.001.

Y. Liu, J. F. Ferguson, C. Xue, I. M. Silverman, B. Gregory, M. P. Reilly, and M. Li. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS One, 8(6):e66883, 2013. doi: 10.1371/journal.pone.0066883.

Y. Liu, J. Zhou, and K. P. White. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics, 30(3):301–4, 2014. doi: 10.1093/ bioinformatics/btt688. Bibliography 197

L. Luzzatto. Somatic mutations in cancer development. Environ Health, 10 Suppl 1: S12, 2011. doi: 10.1186/1476-069X-10-S1-S12.

T. Maier, M. Guell, and L. Serrano. Correlation of mRNA and protein in complex biological samples. FEBS Lett, 583(24):3966–73, 2009. doi: 10.1016/j.febslet.2009. 10.036.

S. S. Manda, R. S. Nirujogi, S. M. Pinto, M. S. Kim, K. K. Datta, R. Sirdeshmukh, T. S. Prasad, V. Thongboonkerd, A. Pandey, and H. Gowda. Identification and characterization of proteins encoded by chromosome 12 as part of chromosome- centric . J Proteome Res, 13(7):3166–77, 2014. doi: 10.1021/pr401123v.

H. J. Mankin, H. Dorfman, L. Lippiello, and A. Zarins. Biochemical and metabolic abnormalities in articular cartilage from osteo-arthritic human hips. II. Correlation of morphology with biochemical and metabolic data. J Bone Joint Surg Am, 53(3): 523–37, 1971.

L. McHugh and J. W. Arthur. Computational methods for protein identification from mass spectrometry data. PLoS Comput Biol, 4(2):e12, 2008. doi: 10.1371/journal. pcbi.0040012.

W. McLaren, L. Gil, S. E. Hunt, H. S. Riat, G. R. Ritchie, A. Thormann, P. Flicek, and F. Cunningham. The Ensembl Variant Effect Predictor. Genome Biol, 17(1):122, 2016. doi: 10.1186/s13059-016-0974-4.

M. Mele, P. G. Ferreira, F. Reverter, D. S. DeLuca, J. Monlong, M. Sammeth, T. R. Young, J. M. Goldmann, D. D. Pervouchine, T. J. Sullivan, R. Johnson, A. V. Segre, S. Djebali, A. Niarchou, G. T. Consortium, F. A. Wright, T. Lappalainen, M. Calvo, G. Getz, E. T. Dermitzakis, K. G. Ardlie, and R. Guigo. Human genomics. The human transcriptome across tissues and individuals. Science, 348(6235):660–5, 2015. doi: 10.1126/science.aaa0355.

G. Mendel. Versuche über Pflanzen-Hybriden. Im Verlage des Vereines, Brünn :, 1866.

P. Mertins, J. W. Qiao, J. Patel, N. D. Udeshi, K. R. Clauser, D. R. Mani, M. W. Burgess, M. A. Gillette, J. D. Jaffe, and S. A. Carr. Integrated proteomic analysis of post-translational modifications by serial enrichment. Nat Methods, 10(7):634–7, 2013. doi: 10.1038/nmeth.2518. 198 Bibliography

P. Mertins, D. R. Mani, K. V. Ruggles, M. A. Gillette, K. R. Clauser, P. Wang, X. Wang, J. W. Qiao, S. Cao, F. Petralia, E. Kawaler, F. Mundt, K. Krug, Z. Tu, J. T. Lei, M. L. Gatza, M. Wilkerson, C. M. Perou, V. Yellapantula, K. L. Huang, C. Lin, M. D. McLellan, P. Yan, S. R. Davies, R. R. Townsend, S. J. Skates, J. Wang, B. Zhang, C. R. Kinsinger, M. Mesri, H. Rodriguez, L. Ding, A. G. Paulovich, D. Fenyo, M. J. Ellis, S. A. Carr, and C. Nci. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature, 534(7605):55–62, 2016. doi: 10.1038/nature18003.

E. Meszaros and C. J. Malemud. Prospects for treating osteoarthritis: enzyme-protein interactions regulating matrix metalloproteinase activity. Ther Adv Chronic Dis, 3 (5):219–29, 2012. doi: 10.1177/2040622312454157.

G. I. Mias and M. Snyder. Personal genomes, quantitative dynamic omics and person- alized medicine. Quant Biol, 1(1):71–90, 2013. doi: 10.1007/s40484-013-0005-3.

R. E. Moore, M. K. Young, and T. D. Lee. Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom, 13(4):378–86, 2002. doi: 10.1016/S1044-0305(02)00352-5.

K. B. Mullis and F. A. Faloona. Specific synthesis of DNA in vitro via a polymerase- catalyzed chain reaction. Methods Enzymol, 155:335–50, 1987.

S. Na, J. Jeong, H. Park, K. J. Lee, and E. Paek. Unrestrictive identification of multiple post-translational modifications from tandem mass spectrometry using an error-tolerant algorithm based on an extended sequence tag approach. Mol Cell Proteomics, 7(12):2452–63, 2008a. doi: 10.1074/mcp.M800101-MCP200.

S. Na, E. Paek, and C. Lee. CIFTER: automated charge-state determination for peptide tandem mass spectra. Anal Chem, 80(5):1520–8, 2008b. doi: 10.1021/ac702038q.

D. Nedelkov. Population proteomics: investigation of protein diversity in human populations. Proteomics, 8(4):779–86, 2008. doi: 10.1002/pmic.200700501.

D. Nedelkov, U. A. Kiernan, E. E. Niederkofler, K. A. Tubbs, and R. W. Nelson. Popula- tion proteomics: the concept, attributes, and potential for cancer biomarker research. Mol Cell Proteomics, 5(10):1811–8, 2006. doi: 10.1074/mcp.R600006-MCP200.

A. I. Nesvizhskii. A survey of computational methods and error rate estimation proce- dures for peptide and protein identification in shotgun proteomics. J Proteomics, 73 (11):2092–123, 2010. doi: 10.1016/j.jprot.2010.08.009. Bibliography 199

A. I. Nesvizhskii. Proteogenomics: concepts, applications and computational strategies. Nat Methods, 11(11):1114–25, 2014. doi: 10.1038/nmeth.3144.

Q. Nguyen, G. Murphy, C. E. Hughes, J. S. Mort, and P. J. Roughley. Matrix metallo- proteinases cleave at two distinct sites on human cartilage link protein. Biochem J, 295 ( Pt 2):595–8, 1993.

M. Nikolov, C. Schmidt, and H. Urlaub. Quantitative mass spectrometry-based proteomics: an overview. Methods Mol Biol, 893:85–100, 2012. doi: 10.1007/ 978-1-61779-885-6_7.

Z. Ning, X. Zhang, J. Mayne, and D. Figeys. Peptide-Centric Approaches Provide an Alternative Perspective To Re-Examine Quantitative Proteomic Data. Anal Chem, 88(4):1973–8, 2016. doi: 10.1021/acs.analchem.5b04148.

J. V. Olsen, B. Blagoev, F. Gnad, B. Macek, C. Kumar, P. Mortensen, and M. Mann. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell, 127(3):635–48, 2006. doi: 10.1016/j.cell.2006.09.026.

F. Ozsolak and P. M. Milos. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet, 12(2):87–98, 2011. doi: 10.1038/nrg2934.

C. N. Pang, A. P. Tay, C. Aya, N. A. Twine, L. Harkness, G. Hart-Smith, S. Z. Chia, Z. Chen, N. P. Deshpande, N. O. Kaakoush, H. M. Mitchell, M. Kassem, and M. R. Wilkins. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res, 13(1):84–98, 2014. doi: 10.1021/pr400820p.

H. Park, J. Bae, H. Kim, S. Kim, H. Kim, D. G. Mun, Y. Joh, W. Lee, S. Chae, S. Lee, H. K. Kim, D. Hwang, S. W. Lee, and E. Paek. Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses. Proteomics, 14(23-24):2742–9, 2014. doi: 10.1002/pmic. 201400225.

C. E. Parker, V. Mocanu, M. Mocanu, N. Dicheva, and M. R. Warren. Mass Spectrom- etry for Post-Translational Modifications. Frontiers in Neuroscience. Boca Raton (FL), 2010.

R. Patro, S. M. Mount, and C. Kingsford. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol, 32(5):462–4, 2014. doi: 10.1038/nbt.2862. 200 Bibliography

J. A. Paulo, J. D. O’Connell, and S. P. Gygi. A Triple Knockout (TKO) Proteomics Standard for Diagnosing Ion Interference in Isobaric Labeling Experiments. J Am Soc Mass Spectrom, 27(10):1620–5, 2016. doi: 10.1007/s13361-016-1434-9.

R. G. Pearson, T. Kurien, K. S. Shu, and B. E. Scammell. Histopathology grading systems for characterisation of human knee osteoarthritis–reproducibility, variability, reliability, correlation, and validity. Osteoarthritis Cartilage, 19(3):324–31, 2011. doi: 10.1016/j.joca.2010.12.005.

Y. Perez-Riverol, J. Uszkoreit, A. Sanchez, T. Ternent, N. Del Toro, H. Hermjakob, J. A. Vizcaino, and R. Wang. ms-data-core-api: an open-source, metadata-oriented library for computational proteomics. Bioinformatics, 31(17):2903–5, 2015. doi: 10.1093/bioinformatics/btv250.

D. N. Perkins, D. J. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18):3551–67, 1999. doi: 10.1002/(SICI)1522-2683(19991201) 20:18<3551::AID-ELPS3551>3.0.CO;2-2.

D. H. Phanstiel, J. Brumbaugh, C. D. Wenger, S. Tian, M. D. Probasco, D. J. Bailey, D. L. Swaney, M. A. Tervo, J. M. Bolin, V. Ruotti, R. Stewart, J. A. Thomson, and J. J. Coon. Proteomic and phosphoproteomic comparison of human ES and iPS cells. Nat Methods, 8(10):821–7, 2011. doi: 10.1038/nmeth.1699.

S. M. Pinto, S. S. Manda, M. S. Kim, K. Taylor, L. D. Selvan, L. Balakrishnan, T. Subbannayya, F. Yan, T. S. Prasad, H. Gowda, C. Lee, W. S. Hancock, and A. Pandey. Functional annotation of proteome encoded by human chromosome 22. J Proteome Res, 13(6):2749–60, 2014. doi: 10.1021/pr401169d.

Y. F. Ramos and I. Meulenbelt. The role of epigenetics in osteoarthritis: current perspective. Curr Opin Rheumatol, 29(1):119–129, 2017. doi: 10.1097/BOR. 0000000000000355.

B. J. Raney, T. R. Dreszer, G. P. Barber, H. Clawson, P. A. Fujita, T. Wang, N. Nguyen, B. Paten, A. S. Zweig, D. Karolchik, and W. J. Kent. Track data hubs enable visual- ization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics, 30(7):1003–5, 2014. doi: 10.1093/bioinformatics/btt637.

L. N. Reynard and J. Loughlin. The genetics and functional analysis of primary osteoarthritis susceptibility. Expert Rev Mol Med, 15:e2, 2013. doi: 10.1017/erm. 2013.4. Bibliography 201

J. L. Rinn, G. Euskirchen, P. Bertone, R. Martone, N. M. Luscombe, S. Hartman, P. M. Harrison, F. K. Nelson, P. Miller, M. Gerstein, S. Weissman, and M. Snyder. The transcriptional activity of human Chromosome 22. Genes Dev, 17(4):529–40, 2003. doi: 10.1101/gad.1055203.

B. A. Risk, W. J. Spitzer, and M. C. Giddings. Peppy: proteogenomic search software. J Proteome Res, 12(6):3019–25, 2013. doi: 10.1021/pr400208w.

M. D. Ritchie, E. R. Holzinger, R. Li, S. A. Pendergrass, and D. Kim. Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet, 16(2): 85–97, 2015. doi: 10.1038/nrg3868.

M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–40, 2010. doi: 10.1093/bioinformatics/btp616.

J. M. Rodriguez, P. Maietta, I. Ezkurdia, A. Pietrelli, J. J. Wesselink, G. Lopez, A. Valencia, and M. L. Tress. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res, 41(Database issue):D110–7, 2013. doi: 10.1093/nar/ gks1058.

P. L. Ross, Y. N. Huang, J. N. Marchese, B. Williamson, K. Parker, S. Hattan, N. Khain- ovski, S. Pillai, S. Dey, S. Daniels, S. Purkayastha, P. Juhasz, S. Martin, M. Bartlet- Jones, F. He, A. Jacobson, and D. J. Pappin. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics, 3(12):1154–69, 2004. doi: 10.1074/mcp.M400129-MCP200.

H. L. Rost, T. Sachsenberg, S. Aiche, C. Bielow, H. Weisser, F. Aicheler, S. Andreotti, H. C. Ehrlich, P. Gutenbrunner, E. Kenar, X. Liang, S. Nahnsen, L. Nilse, J. Pfeuf- fer, G. Rosenberger, M. Rurik, U. Schmitt, J. Veit, M. Walzer, D. Wojnar, W. E. Wolski, O. Schilling, J. S. Choudhary, L. Malmstrom, R. Aebersold, K. Reinert, and O. Kohlbacher. OpenMS: a flexible open-source software platform for mass spec- trometry data analysis. Nat Methods, 13(9):741–8, 2016. doi: 10.1038/nmeth.3959.

T. I. Roumeliotis, S. P. Williams, E. Goncalves, F. Zamanzad Ghavidel, N. Aben, M. Michaut, M. Schubert, J. C. Wright, M. Yang, C. Alsinet, R. Dienstmann, J. Guinney, P. Beltrao, A. Brazma, O. Stegle, D. J. Adams, L. Wessels, J. Saez- Rodriguez, U. McDermott, and J. S. Choudhary. Genomic determinants of protein abundance variation in colorectal cancer cells. bioRxiv, 2016. 202 Bibliography

C. Ruiz-Romero, P. Fernandez-Puente, V. Calamia, and F. J. Blanco. Lessons from the proteomic study of osteoarthritis. Expert Rev Proteomics, 12(4):433–43, 2015. doi: 10.1586/14789450.2015.1065182.

S. Y. Ryu, W. J. Qian, D. G. Camp, R. D. Smith, R. G. Tompkins, R. W. Davis, and W. Xiao. Detecting differential protein expression in large-scale population proteomics. Bioinformatics, 30(19):2741–6, 2014. doi: 10.1093/bioinformatics/ btu341.

W. S. Sanders, N. Wang, S. M. Bridges, B. M. Malone, Y. S. Dandass, F. M. McCarthy, B. Nanduri, M. L. Lawrence, and S. C. Burgess. The proteogenomic mapping tool. BMC Bioinformatics, 12:115, 2011. doi: 10.1186/1471-2105-12-115.

M. M. Savitski, M. L. Nielsen, and R. A. Zubarev. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol Cell Proteomics, 5(5):935–48, 2006. doi: 10.1074/mcp.T500034-MCP200.

E. Scheerlinck, M. Dhaenens, A. Van Soom, L. Peelman, P. De Sutter, K. Van Steen- dam, and D. Deforce. Minimizing technical variation during sample preparation prior to label-free quantitative mass spectrometry. Anal Biochem, 490:14–9, 2015. doi: 10.1016/j.ab.2015.08.018.

K. Scheubert, F. Hufsky, and S. Bocker. Computational mass spectrometry for small molecules. J Cheminform, 5(1):12, 2013. doi: 10.1186/1758-2946-5-12.

C. N. Schlaffner. MS SMiV: Mass Spectrometry Similarity Scoring for Modification and Variation Detection. Hagenberg, 2014.

C. N. Schlaffner, G. R. Ritchie, R. I. Roumeliotis, J. Steinberg, C. Le Maitre, M. Wilkin- son, E. Zeggini, A. Bender, and J. S. Choudhary. Quantitative proteogenomics for personalised molecular profiling, 2015. Poster presented at Genome Informatics 2015, Cold Spring Harbour, NY, October 28-31.

C. N. Schlaffner, R. I. Roumeliotis, W. H., J. C. Wright, J. Mudge, S. Santos, G. R. Ritchie, J. Steinberg, A. Bender, A. Brazma, J. Harrow, C. Le Maitre, M. Wilkinson, E. Zeggini, and J. S. Choudhary. Quantitative proteogenomics for personalised molecular profiling, 2016a. 282157, Wednesday, Proceedings of the 64th ASMS Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, June 5-6.

C. N. Schlaffner, R. I. Roumeliotis, W. H., J. C. Wright, J. Mudge, S. Santos, G. R. Ritchie, J. Steinberg, A. Bender, A. Brazma, J. Harrow, C. Le Maitre, M. Wilkinson, Bibliography 203

E. Zeggini, and J. S. Choudhary. Quantitative proteogenomics for personalised molecular profiling, 2016b. Genome Informatics 2016, Hinxton, UK, September 19-22.

C. N. Schlaffner, G. J. Pirklbauer, A. Bender, and J. S. Choudhary. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes. Cell Systems, 5(2):152–156 e4, 2017. doi: 10.1016/j.cels.2017.07.007.

J. P. Shaffer. Multiple hypothesis testing. Annual review of psychology, 46(1):561–584, 1995.

C. Shao, Y. Zhang, and W. Sun. Statistical characterization of HCD fragmentation pat- terns of tryptic peptides on an LTQ Orbitrap Velos mass spectrometer. J Proteomics, 109:26–37, 2014. doi: 10.1016/j.jprot.2014.06.012.

H. Shen, J. Li, J. Zhang, C. Xu, Y. Jiang, Z. Wu, F. Zhao, L. Liao, J. Chen, Y. Lin, Q. Tian, C. J. Papasian, and H. W. Deng. Comprehensive characterization of human genome variation by high coverage whole-genome sequencing of forty four Caucasians. PLoS One, 8(4):e59494, 2013. doi: 10.1371/journal.pone.0059494.

S. T. Sherry, M. Ward, and K. Sirotkin. dbSNP-database for single nucleotide poly- morphisms and other classes of minor genetic variation. Genome Res, 9(8):677–9, 1999.

G. M. Sheynkman, J. E. Johnson, P. D. Jagtap, M. R. Shortreed, G. Onsongo, B. L. Frey, T. J. Griffin, and L. M. Smith. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics, 15:703, 2014. doi: 10.1186/1471-2164-15-703.

G. M. Sheynkman, M. R. Shortreed, A. J. Cesnik, and L. M. Smith. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. Annu Rev Anal Chem (Palo Alto Calif), 9(1):521–45, 2016. doi: 10.1146/annurev-anchem-071015-041722.

M. R. Shortreed, C. D. Wenger, B. L. Frey, G. M. Sheynkman, M. Scalf, M. P. Keller, A. D. Attie, and L. M. Smith. Global Identification of Protein Post-translational Modifications in a Single-Pass Database Search. J Proteome Res, 14(11):4714–20, 2015. doi: 10.1021/acs.jproteome.5b00599.

A. M. Snijders, N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, G. Hamil- ton, A. K. Hindle, B. Huey, K. Kimura, S. Law, K. Myambo, J. Palmer, B. Ylstra, J. P. Yue, J. W. Gray, A. N. Jain, D. Pinkel, and D. G. Albertson. Assembly of 204 Bibliography

microarrays for genome-wide measurement of DNA copy number. Nat Genet, 29 (3):263–4, 2001. doi: 10.1038/ng754.

N. Sonenberg and A. G. Hinnebusch. Regulation of translation initiation in eukaryotes: mechanisms and biological targets. Cell, 136(4):731–45, 2009. doi: 10.1016/j.cell. 2009.01.042.

M. Stanke and S. Waack. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19 Suppl 2:ii215–25, 2003.

J. A. Stefely, N. W. Kwiecien, E. C. Freiberger, A. L. Richards, A. Jochem, M. J. Rush, A. Ulbrich, K. P. Robinson, P. D. Hutchins, M. T. Veling, X. Guo, Z. A. Kemmerer, K. J. Connors, E. A. Trujillo, J. Sokol, H. Marx, M. S. Westphall, A. S. Hebert, D. J. Pagliarini, and J. J. Coon. Mitochondrial protein functions elucidated by multi-omic mass spectrometry profiling. Nat Biotechnol, 34(11):1191–1197, 2016. doi: 10.1038/nbt.3683.

T. Steijger, J. F. Abril, P. G. Engstrom, F. Kokocinski, T. J. Hubbard, R. Guigo, J. Harrow, P. Bertone, and R. Consortium. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods, 10(12):1177–84, 2013. doi: 10.1038/nmeth. 2714.

S. Stein. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal Chem, 84(17):7274–82, 2012. doi: 10.1021/ac301205z.

S. E. Stein and D. R. Scott. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom, 5(9):859–66, 1994. doi: 10.1016/1044-0305(94)87009-8.

J. Steinberg and E. Zeggini. in osteoarthritis: Past, present, and future. J Orthop Res, 34(7):1105–10, 2016. doi: 10.1002/jor.23296.

J. D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 100(16):9440–5, 2003. doi: 10.1073/pnas.1530509100.

J. Sun, G. L. Zhang, S. Li, A. R. Ivanov, D. Fenyo, F. Lisacek, S. K. Murthy, B. L. Karger, and V. Brusic. Pathway analysis and transcriptomics improve protein identification by shotgun proteomics from samples comprising small number of cells–a benchmarking study. BMC Genomics, 15 Suppl 9:S1, 2014. doi: 10.1186/ 1471-2164-15-S9-S1. Bibliography 205

F. X. Sutandy, J. Qian, C. S. Chen, and H. Zhu. Overview of protein microarrays. Curr Protoc Protein Sci, Chapter 27:Unit 27 1, 2013. doi: 10.1002/0471140864. ps2701s72.

D. L. Tabb, L. Vega-Montoto, P. A. Rudnick, A. M. Variyath, A. J. Ham, D. M. Bunk, L. E. Kilpatrick, D. D. Billheimer, R. K. Blackman, H. L. Cardasis, S. A. Carr, K. R. Clauser, J. D. Jaffe, K. A. Kowalski, T. A. Neubert, F. E. Regnier, B. Schilling, T. J. Tegeler, M. Wang, P. Wang, J. R. Whiteaker, L. J. Zimmerman, S. J. Fisher, B. W. Gibson, C. R. Kinsinger, M. Mesri, H. Rodriguez, S. E. Stein, P. Tempst, A. G. Paulovich, D. C. Liebler, and C. Spiegelman. Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res, 9(2):761–76, 2010. doi: 10.1021/pr9006365.

A. Tamarkin-Ben-Harush, E. Schechtman, and R. Dikstein. Co-occurrence of tran- scription and translation gene regulatory features underlies coordinated mRNA and protein synthesis. BMC Genomics, 15:688, 2014. doi: 10.1186/1471-2164-15-688.

S. Tanner, Z. Shen, J. Ng, L. Florea, R. Guigo, S. P. Briggs, and V. Bafna. Improving gene annotation using peptide mass spectrometry. Genome Res, 17(2):231–9, 2007. doi: 10.1101/gr.5646507.

F. E. Taub, J. M. DeLeo, and E. B. Thompson. Sequential comparative hybridizations analyzed by computerized image processing can identify and quantitate regulated RNAs. DNA, 2(4):309–27, 1983. doi: 10.1089/dna.1983.2.309.

T. Taus, T. Kocher, P. Pichler, C. Paschke, A. Schmidt, C. Henrich, and K. Mechtler. Universal and confident phosphorylation site localization using phosphoRS. J Proteome Res, 10(12):5354–62, 2011. doi: 10.1021/pr200611n.

M. The, M. J. MacCoss, W. S. Noble, and L. Kall. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J Am Soc Mass Spectrom, 27(11):1719–1727, 2016. doi: 10.1007/s13361-016-1460-7.

The 1000 Genomes Project Consortium, G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M. Durbin, R. A. Gibbs, M. E. Hurles, and G. A. McVean. A map of human genome variation from population-scale sequencing. Nature, 467(7319): 1061–73, 2010. doi: 10.1038/nature09534.

The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science, 348(6235):648–60, 2015. doi: 10.1126/science.1262110. 206 Bibliography

A. Thompson, J. Schafer, K. Kuhn, S. Kienle, J. Schwarz, G. Schmidt, T. Neumann, R. Johnstone, A. K. Mohammed, and C. Hamon. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal Chem, 75(8):1895–904, 2003.

H. Thorvaldsdottir, J. T. Robinson, and J. P. Mesirov. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioin- form, 14(2):178–92, 2013. doi: 10.1093/bib/bbs017.

P. J. Thul, L. Akesson, M. Wiking, D. Mahdessian, A. Geladaki, H. Ait Blal, T. Alm, A. Asplund, L. Bjork, L. M. Breckels, A. Backstrom, F. Danielsson, L. Fagerberg, J. Fall, L. Gatto, C. Gnann, S. Hober, M. Hjelmare, F. Johansson, S. Lee, C. Lind- skog, J. Mulder, C. M. Mulvey, P. Nilsson, P. Oksvold, J. Rockberg, R. Schutten, J. M. Schwenk, A. Sivertsson, E. Sjostedt, M. Skogs, C. Stadler, D. P. Sullivan, H. Tegel, C. Winsnes, C. Zhang, M. Zwahlen, A. Mardinoglu, F. Ponten, K. von Feilitzen, K. S. Lilley, M. Uhlen, and E. Lundberg. A subcellular map of the human proteome. Science, 356(6340), 2017. doi: 10.1126/science.aal3321.

L. Ting, R. Rad, S. P. Gygi, and W. Haas. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat Methods, 8(11):937–40, 2011. doi: 10.1038/nmeth.1714.

K. Tomczak, P. Czerwinska, and M. Wiznerowicz. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn), 19(1A):A68–77, 2015. doi: 10.5114/wo.2014.47136.

C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, and L. Pachter. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 7(3): 562–78, 2012. doi: 10.1038/nprot.2012.016.

V. Trevino, F. Falciani, and H. A. Barrera-Saldana. DNA microarrays: a powerful genomic tool for biomedical and clinical research. Mol Med, 13(9-10):527–41, 2007. doi: 10.2119/2006-00107.Trevino.

M. Uhlen, L. Fagerberg, B. M. Hallstrom, C. Lindskog, P. Oksvold, A. Mardinoglu, A. Sivertsson, C. Kampf, E. Sjostedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg, S. Navani, C. A. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen, S. Hober, T. Alm, P. H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg, P. Nilsson, J. M. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson, F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, and F. Ponten. Proteomics. Tissue-based Bibliography 207

map of the human proteome. Science, 347(6220):1260419, 2015. doi: 10.1126/ science.1260419.

M. Uhlen, C. Zhang, S. Lee, E. Sjostedt, L. Fagerberg, G. Bidkhori, R. Benfeitas, M. Arif, Z. Liu, F. Edfors, K. Sanli, K. von Feilitzen, P. Oksvold, E. Lundberg, S. Hober, P. Nilsson, J. Mattsson, J. M. Schwenk, H. Brunnstrom, B. Glimelius, T. Sjoblom, P. H. Edqvist, D. Djureinovic, P. Micke, C. Lindskog, A. Mardinoglu, and F. Ponten. A pathology atlas of the human cancer transcriptome. Science, 357 (6352), 2017. doi: 10.1126/science.aan2507.

A. M. Valdes and T. D. Spector. Genetic epidemiology of hip and knee osteoarthritis. Nat Rev Rheumatol, 7(1):23–32, 2011. doi: 10.1038/nrrheum.2010.191.

W. B. van den Berg, F. van de Loo, L. A. Joosten, and O. J. Arntz. Animal models of arthritis in NOS2-deficient mice. Osteoarthritis Cartilage, 7(4):413–5, 1999. doi: 10.1053/joca.1999.0228.

B. K. Van Weemen and A. H. Schuurs. Immunoassay using antigen-enzyme conjugates. FEBS Lett, 15(3):232–236, 1971.

J. A. Vizcaino, A. Csordas, N. del Toro, J. A. Dianes, J. Griss, I. Lavidas, G. Mayer, Y. Perez-Riverol, F. Reisinger, T. Ternent, Q. W. Xu, R. Wang, and H. Hermjakob. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res, 44(D1): D447–56, 2016. doi: 10.1093/nar/gkv1145.

C. Vogel and E. M. Marcotte. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet, 13(4):227–32, 2012. doi: 10.1038/nrg3185.

T. Vos, A. D. Flaxman, M. Naghavi, R. Lozano, C. Michaud, M. Ezzati, K. Shibuya, J. A. Salomon, S. Abdalla, V. Aboyans, J. Abraham, I. Ackerman, R. Aggarwal, S. Y. Ahn, M. K. Ali, M. Alvarado, H. R. Anderson, L. M. Anderson, K. G. An- drews, C. Atkinson, L. M. Baddour, A. N. Bahalim, S. Barker-Collo, L. H. Barrero, D. H. Bartels, M. G. Basanez, A. Baxter, M. L. Bell, E. J. Benjamin, D. Bennett, E. Bernabe, K. Bhalla, B. Bhandari, B. Bikbov, A. Bin Abdulhak, G. Birbeck, J. A. Black, H. Blencowe, J. D. Blore, F. Blyth, I. Bolliger, A. Bonaventure, S. Boufous, R. Bourne, M. Boussinesq, T. Braithwaite, C. Brayne, L. Bridgett, S. Brooker, P. Brooks, T. S. Brugha, C. Bryan-Hancock, C. Bucello, R. Buchbinder, G. Buckle, C. M. Budke, M. Burch, P. Burney, R. Burstein, B. Calabria, B. Campbell, C. E. Canter, H. Carabin, J. Carapetis, L. Carmona, C. Cella, F. Charlson, H. Chen, A. T. Cheng, D. Chou, S. S. Chugh, L. E. Coffeng, S. D. Colan, S. Colquhoun, 208 Bibliography

K. E. Colson, J. Condon, M. D. Connor, L. T. Cooper, M. Corriere, M. Cortino- vis, K. C. de Vaccaro, W. Couser, B. C. Cowie, M. H. Criqui, M. Cross, K. C. Dabhadkar, M. Dahiya, N. Dahodwala, J. Damsere-Derry, G. Danaei, A. Davis, D. De Leo, L. Degenhardt, R. Dellavalle, A. Delossantos, J. Denenberg, S. Derrett, D. C. Des Jarlais, S. D. Dharmaratne, M. Dherani, et al. Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990-2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet, 380(9859):2163–96, 2012. doi: 10.1016/S0140-6736(12)61729-2.

X. Wang, Q. Liu, and B. Zhang. Leveraging the complementary nature of RNA- Seq and shotgun proteomics data. Proteomics, 14(23-24):2676–87, 2014a. doi: 10.1002/pmic.201400184.

X. Wang, R. J. Slebos, M. C. Chambers, D. L. Tabb, D. C. Liebler, and B. Zhang. proBAMsuite, a Bioinformatics Framework for Genome-Based Representation and Analysis of Proteomics Data. Mol Cell Proteomics, 15(3):1164–75, 2016. doi: 10.1074/mcp.M115.052860.

Y. C. Wang, S. E. Peterson, and J. F. Loring. Protein post-translational modifications and regulation of pluripotency in human stem cells. Cell Res, 24(2):143–60, 2014b. doi: 10.1038/cr.2013.151.

Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcrip- tomics. Nat Rev Genet, 10(1):57–63, 2009. doi: 10.1038/nrg2484.

Z. Wang, X. Sun, Y. Zhao, X. Guo, H. Jiang, H. Li, and Z. Gu. Evolution of gene regulation during transcription and translation. Genome Biol Evol, 7(4):1155–67, 2015. doi: 10.1093/gbe/evv059.

T. Weirick, G. Militello, R. Muller, D. John, S. Dimmeler, and S. Uchida. The identification and characterization of novel transcripts from RNA-seq data. Brief Bioinform, 17(4):678–85, 2016. doi: 10.1093/bib/bbv067.

H. Weisser, J. C. Wright, J. M. Mudge, P. Gutenbrunner, and J. S. Choudhary. Flexible data analysis pipeline for high-confidence proteogenomics. J Proteome Res, 15(12): 4686–4695, 2016. doi: 10.1021/acs.jproteome.6b00765.

M. Wilhelm, J. Schlegl, H. Hahne, A. M. Gholami, M. Lieberenz, M. M. Savit- ski, E. Ziegler, L. Butzmann, S. Gessulat, H. Marx, T. Mathieson, S. Lemeer, K. Schnatbaum, U. Reimer, H. Wenschuh, M. Mollenhauer, J. Slotta-Huspenina, Bibliography 209

J. H. Boese, M. Bantscheff, A. Gerstmair, F. Faerber, and B. Kuster. Mass- spectrometry-based draft of the human proteome. Nature, 509(7502):582–7, 2014. doi: 10.1038/nature13319.

E. S. Witze, W. M. Old, K. A. Resing, and N. G. Ahn. Mapping protein post- translational modifications with mass spectrometry. Nat Methods, 4(10):798–806, 2007. doi: 10.1038/nmeth1100.

J. Woessner, J. F. and Z. Gunja-Smith. Role of metalloproteinases in human osteoarthri- tis. J Rheumatol Suppl, 27:99–101, 1991.

F. Wold. In vivo chemical modification of proteins (post-translational modification). Annu Rev Biochem, 50:783–814, 1981. doi: 10.1146/annurev.bi.50.070181.004031.

J. C. Wright and J. S. Choudhary. DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics. J Proteomics Bioinform, 9(6): 176–180, 2016. doi: 10.4172/jpb.1000404.

J. C. Wright, M. O. Collins, L. Yu, L. Kall, M. Brosch, and J. S. Choudhary. Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator. Mol Cell Proteomics, 11(8):478–91, 2012. doi: 10.1074/mcp.O111. 014522.

J. C. Wright, J. Mudge, H. Weisser, M. P. Barzine, J. M. Gonzalez, A. Brazma, J. S. Choudhary, and J. Harrow. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun, 7:11778, 2016a. doi: 10.1038/ncomms11778.

J. C. Wright, J. Mudge, H. Weisser, M. P. Barzine, J. M. Gonzalez, A. Brazma, J. S. Choudhary, and J. Harrow. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun, 7:11778, 2016b. doi: 10.1038/ncomms11778.

L. Wu, S. I. Candille, Y. Choi, D. Xie, L. Jiang, J. Li-Pook-Than, H. Tang, and M. Snyder. Variation and genetic control of protein abundance in humans. Nature, 499(7456):79–82, 2013. doi: 10.1038/nature12223.

B. Xia, C. Di, J. Zhang, S. Hu, H. Jin, and P. Tong. Osteoarthritis pathogenesis: a review of molecular mechanisms. Calcif Tissue Int, 95(6):495–505, 2014. doi: 10.1007/s00223-014-9917-9.

M. Yamashita and J. B. Fenn. Electrospray ion source. Another variation on the free-jet theme. The Journal of Physical Chemistry, 88(20):4451–4459, 1984. 210 Bibliography

A. Yates, W. Akanni, M. R. Amode, D. Barrell, K. Billis, D. Carvalho-Silva, C. Cum- mins, P. Clapham, S. Fitzgerald, L. Gil, C. G. Giron, L. Gordon, T. Hourlier, S. E. Hunt, S. H. Janacek, N. Johnson, T. Juettemann, S. Keenan, I. Lavidas, F. J. Martin, T. Maurel, W. McLaren, D. N. Murphy, R. Nag, M. Nuhn, A. Parker, M. Patri- cio, M. Pignatelli, M. Rahtz, H. S. Riat, D. Sheppard, K. Taylor, A. Thormann, A. Vullo, S. P. Wilder, A. Zadissa, E. Birney, J. Harrow, M. Muffato, E. Perry, M. Ruffier, G. Spudich, S. J. Trevanion, F. Cunningham, B. L. Aken, D. R.Zerbino, and P. Flicek. Ensembl 2016. Nucleic Acids Res, 44(D1):D710–6, 2016. doi: 10.1093/nar/gkv1157.

D. Ye, Y. Fu, R. X. Sun, H. P. Wang, Z. F. Yuan, H. Chi, and S. M. He. Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics, 26(12):i399–406, 2010. doi: 10.1093/bioinformatics/btq185.

X. L. Yuan, H. Y. Meng, Y. C. Wang, J. Peng, Q. Y. Guo, A. Y. Wang, and S. B. Lu. Bone-cartilage interface crosstalk in osteoarthritis: potential pathways and future therapeutic strategies. Osteoarthritis Cartilage, 22(8):1077–89, 2014. doi: 10.1016/j.joca.2014.05.023.

B. Zhang, J. Wang, X. Wang, J. Zhu, Q. Liu, Z. Shi, M. C. Chambers, L. J. Zimmerman, K. F. Shaddox, S. Kim, S. R. Davies, S. Wang, P. Wang, C. R. Kinsinger, R. C. Rivers, H. Rodriguez, R. R. Townsend, M. J. Ellis, S. A. Carr, D. L. Tabb, R. J. Coffey, R. J. Slebos, D. C. Liebler, and C. Nci. Proteogenomic characterization of human colon and rectal cancer. Nature, 513(7518):382–7, 2014. doi: 10.1038/nature13438.

G. Zhang, B. M. Ueberheide, S. Waldemarson, S. Myung, K. Molloy, J. Eriksson, B. T. Chait, T. A. Neubert, and D. Fenyo. Protein quantitation using mass spectrometry. Methods Mol Biol, 673:211–22, 2010. doi: 10.1007/978-1-60761-842-3_13.

H. Zhang, T. Liu, Z. Zhang, S. H. Payne, B. Zhang, J. E. McDermott, J. Y. Zhou, V. A. Petyuk, L. Chen, D. Ray, S. Sun, F. Yang, L. Chen, J. Wang, P. Shah, S. W. Cha, P. Aiyetan, S. Woo, Y. Tian, M. A. Gritsenko, T. R. Clauss, C. Choi, M. E. Monroe, S. Thomas, S. Nie, C. Wu, R. J. Moore, K. H. Yu, D. L. Tabb, D. Fenyo, V. Bafna, Y. Wang, H. Rodriguez, E. S. Boja, T. Hiltke, R. C. Rivers, L. Sokoll, H. Zhu, M. Shih Ie, L. Cope, A. Pandey, B. Zhang, M. P. Snyder, D. A. Levine, R. D. Smith, D. W. Chan, K. D. Rodland, and C. Investigators. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell, 166(3): 755–65, 2016. doi: 10.1016/j.cell.2016.05.069. Bibliography 211

Z. Zhang, M. Burke, Y. A. Mirokhin, D. V. Tchekhovskoi, S. P. Markey, W. Yu, R. Chaerkady, S. Hess, and S. E. Stein. Reverse and Random Decoy Methods for False Discovery Rate Estimation in High Mass Accuracy Peptide Spectral Library Searches. J Proteome Res, 2018. doi: 10.1021/acs.jproteome.7b00614.