Statistical and machine learning methods to analyze large-scale data

MATTHEW THE

Doctoral Thesis Stockholm, Sweden 2018 School of Engineering Sciences in Chemistry, Biotechnology and Health TRITA-CBH-FOU-2018:45 SE-171 21 Solna ISBN 978-91-7729-967-7 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Teknologie doktorsexamen i bioteknologi onsdagen den 24 oktober 2018 klockan 13.00 i Atrium, Entréplan i Wargentinhuset, Karolinska Institutet, Solna.

© Matthew The, October 2018

Tryck: Universitetsservice US AB iii

Abstract

Modern biology is faced with vast amounts of data that contain valuable in- formation yet to be extracted. , the study of proteins, has repos- itories with thousands of mass spectrometry experiments. These data gold mines could further our knowledge of proteins as the main actors in cell pro- cesses and signaling. Here, we explore methods to extract more information from this data using statistical and machine learning methods. First, we present advances for studies that aggregate hundreds of runs. We introduce MaRaCluster, which clusters mass spectra for large-scale datasets using statistical methods to assess similarity of spectra. It identified up to 40% more peptides than the state-of-the-art method, MS-Cluster. Further, we accommodated large-scale data analysis in Percolator, a popular post- processing tool for mass spectrometry data. This reduced the runtime for a draft human proteome study from a full day to 10 minutes. Second, we clarify and promote the contentious topic of protein false discovery rates (FDRs). Often, studies report lists of proteins but fail to report protein FDRs. We provide a framework to systematically discuss protein FDRs and take away hesitance. We also added protein FDRs to Percolator, opting for the best-peptide approach which proved superior in a benchmark of scalable protein inference methods. Third, we tackle the low sensitivity of protein quantification methods. Current methods lack proper control of error sources and propagation. To remedy this, we developed Triqler, which controls the protein quantification FDR through a Bayesian framework. We also introduce MaRaQuant, which proposes a quantification-first approach that applies clustering prior to identification. This reduced the number of spectra to be searched and allowed us to spot unidentified analytes of interest. Combining these tools outperformed the state-of-the-art method, MaxQuant/Perseus, and found enriched functional terms for datasets that had none before. iv

Sammanfattning

Modern molekylärbiologi genererar enorma mängder data, som kräver sofisti- kerade algoritmer för att tolkas. Inom området proteomik, det vill säga det storskaliga studier av proteiner, har det under årens lopp byggt upp stora, lättåtkomliga arkiv med data från masspektrometri-experiment. Vidare, finns det fortfarande många outforskade aspekter av proteiner, som är extra in- tressanta eftersom proteiner utgör de viktigaste aktörerna i cellprocesser och cellsignalering. I den här avhandlingen undersöker vi metoder för att utvinna mer information från tillgängliga data med hjälp av statistik och maskinin- lärningsmetoder. För det första, presenterar vi framsteg för att kunna hantera analyser av stu- dier av sammanslagna mängder med hundratals aggregerade körningar. Vi beskriver MaRaCluster, en ny metod för klustring av masspektra som använ- der statistiska metoder för att bedöma likheten mellan masspektra. Metoden identifierar upp till 40% fler peptider än den etablerade metoden, MS-Cluster. Dessutom har vi lagt till funktioner till Percolator, den populära mjukvaran för efterbehandling av masspektrometridata. Detta minskade exekveringsti- den för en storskalig studie av det mänskliga proteomet från ett dygn ner till 10 minuter på en standarddator. För det andra, förtydligar vi och förespråkar användandet av ett mått på osäkerhet i identifiering av protein från masspektrometridata, det så kallade False Discovery Rate (FDR) på proteinnivå. Många studier rapporterar inte FDR på proteinnivå, trots att dessa väljer att redovisa sina resultat i form av listor av proteiner. Här presenterar vi ett statistiskt ramverk för att kunna systematiskt diskutera FDR på proteinnivå och för att undanröja tveksam- heter. Vi jämför skalbara inferensmetoder för proteiner och inkluderade den bästa-peptidinferensmetoden i Percolator. För det tredje, tar vi upp bristen på känslighet för proteinkvantifieringsme- toder. Nuvarande metoder saknar ordentlig kontroll av felkällor och felfort- plantning. För att åtgärda detta utvecklade vi ett nytt verktyg, Triqler, som tillhandahåller adekvat falsk upptäcktshastighetsstyrning för proteinkvanti- fiering genom ett Bayesiska ramverk. Vi beskriver också MaRaQuant, där vi introducera kvantifiering-först-metoden som tillämpar klustring före iden- tifiering. Detta minskade antalet spektra som behövde matchas, men tillät oss också att upptäcka oidentifierade intressanta analyter. Kombinationen av dessa två verktyg överträffade den etablerade proteinkvantifieringsmetoden, MaxQuant/Perseus, och hittade anrikning av funktionella anteckningstermer för dataset där inget hittades tidigare. Contents

Contents v

Acknowledgments 1

List of publications 3

1 Introduction 5 1.1 Thesis overview ...... 5

2 Biological background 7 2.1 From DNA to proteins ...... 7 2.2 Proteins ...... 9 2.3 Current state of research ...... 10

3 Shotgun proteomics 13 3.1 Protein digestion ...... 13 3.2 Liquid chromatography ...... 14 3.3 Tandem mass spectrometry ...... 14

4 Data analysis pipeline 21 4.1 Feature detection ...... 21 4.2 Peptide identification ...... 22 4.3 Protein identification ...... 24 4.4 Protein quantification ...... 25

5 Statistical evaluation 27 5.1 Terminology ...... 27 5.2 False discovery rates ...... 28 5.3 The Percolator software package ...... 33

6 Mass spectrum clustering 35 6.1 Distance measures ...... 36 6.2 Clustering algorithms ...... 38

v vi CONTENTS

6.3 Evaluating clustering performance ...... 40

7 Label-free protein quantification 41 7.1 Feature detection ...... 41 7.2 Peptide quantification ...... 43 7.3 Matches-between-runs ...... 43 7.4 Missing value imputation ...... 44 7.5 Peptide to protein summarization ...... 44 7.6 Differential expression analysis ...... 44

8 Present investigations 47 8.1 Paper I: MaRaCluster ...... 47 8.2 Paper II: Protein-level false discovery rates ...... 48 8.3 Paper III: Percolator 3.0 ...... 48 8.4 Paper IV: Triqler ...... 49 8.5 Paper V: MaRaQuant ...... 49

9 Future perspectives 51

Bibliography 53 Acknowledgements

I owe many thanks to all the people who have made all the work in my Ph.D. and this thesis possible. First of all, none of this would have been possible without my marvelous supervisor, Lukas. Your support, optimism and never-ending stream of suggestions made this journey an immensely positive experience. I would also like to thank my collaborators, William Stafford Noble, Fredrik Edfors, Johannes Griss, Yasset Perez-Riverol and Juan Antonio Vizcaíno, as well as the OpenMS crew for their help and comradery, Timo Sachsenberg, Julianus Pfeuffer, Leon Bichmann and Oliver Alka. A special thanks to Oliver Serang, for all your fascinating stories and planting the Bayesian seed, and Wout Bittremieux, for your companionship all around the globe. All the people who have passed through the Käll lab over the years. Viktor and Luminita, for providing me with a warm welcome to the group, no matter how short it was. David, Heydar, Xuanbin, Ayesha, Sara, Daniel, Yunyi, Alicia, Revant, Aslıand Yang for all the good times in- and outside the lab. Jorrit, for injecting some “Hollandse nuchterheid” every so often. Yrin, for all the amusing discussions and board game evenings, go finish those degrees! Vital, for making the office an enjoyable place to come to. Richard, for sharing all your kooky ideas and stories. Gustavo, for your contagious good mood, stimulating conversations and for always being up for hot pot and a beer (or two). Patrick, for your infectious high-energy can-do attitude, go apply that to the mess I left you! To everyone at Gamma 6 for all the enjoyable lunches, fikas and after work activi- ties, Mattias, Linus, Pekka, Ikram, Owais, Mehmood, Auwn, Hashim, Mohammad, Hazal, Seong, Jovana, Andrei, Parul, Johannes, Ilaria, Daniel, Yue and Mandi. Also to all of you from SciLifeLab, you are simply too many to name, but in par- ticular thanks to Johannes, for being my TA buddy and all-round good company, and to Wenjing, Marcel and Fran, for providing me with a second group when our numbers were particularly low and all the times after that as well. To my friends that I was fortunate enough of meeting here in Stockholm, Steffen, Lisa, Lukas, Thomas, Nicole and Dmitry, and also my Berlin Master/Ph.D. buddies, Luis, Carlos and Xiao, for all the comradery and good times.

1 2 Acknowledgments

A big thanks as well to my family and friends in the Netherlands and extended family in Denmark. In particular, my parents, for all your continuous support and for always showing great interest in my work. And last but definitely not least, Ditte, for all your encouragements, patience and affection, I would not have come so far today if it wasn’t for you. List of publications

I have included the following articles in the thesis.

PAPER I: MaRaCluster: A fragment rarity metric for clustering frag- ment spectra in shotgun proteomics. Matthew The & Lukas Käll Journal of Proteome Research, 15(3), 713-720 (2016). DOI: 10.1021/acs.jproteome.5b00749 PAPER II: How to talk about protein-level false discovery rates in shot- gun proteomics. Matthew The, Ayesha Tasnim & Lukas Käll Proteomics, 16(18), 2461-2469 (2016). DOI: 10.1002/pmic.201500431 PAPER III: Fast and accurate protein false discovery rates on large- scale proteomics data sets with Percolator 3.0. Matthew The, Michael J. MacCoss, William S. Noble & Lukas Käll Journal of the American Society for Mass Spectrometry, 27(11), 1719-1727 (2016). DOI: 10.1007/s13361-016-1460-7 PAPER IV: Integrated identification and quantification error probabil- ities for shotgun proteomics. Matthew The & Lukas Käll Manuscript in preparation PAPER V: Distillation of label-free quantification data by clustering and Bayesian modeling. Matthew The & Lukas Käll Manuscript in preparation

The articles are used here with permission from the respective publishers.

3 4 LIST OF PUBLICATIONS

Other publications that I have not included in the thesis.

PAPER VI: Response to “Comparison and Evaluation of Clustering Al- gorithms for Tandem Mass Spectra”. Johannes Griss, Yasset Perez-Riverol, Matthew The, Lukas Käll & Juan Antonio Vizcaíno Journal of proteome research, 17(5), 1993-1996 (2018). DOI: 10.1021/acs.jproteome.7b00824 PAPER VII: Uncertainty estimation of predictions of peptides’ chromato- graphic retention times in shotgun proteomics. Heydar Maboudi Afkham, Xuanbin Qiu, Matthew The & Lukas Käll Bioinformatics, 33(4), 508–513 (2017). DOI: 10.1093/bioinformatics/btw619 PAPER VIII: A protein standard that emulates homology for the charac- terization of protein inference algorithms. Matthew The, Fredrik Edfors, Yasset Perez-Riverol, Samuel H Payne, Michael R Hoopmann, Magnus Palmblad, Björn Forsström & Lukas Käll Journal of proteome research, 17(5), 1879-1886 (2018). DOI: 10.1021/acs.jproteome.7b00899 PAPER IX: ABRF Proteome Informatics Research Group (iPRG) 2016 Study: Inferring Proteoforms from Bottom-up Proteomics Data. Joon-Yong Lee, Hyungwon Choi, Christopher Colangelo, Darryl Davis, Michael R. Hoopmann, Lukas Käll, Henry Lam, Samuel H. Payne, Yasset Perez-Riverol, Matthew The, Ryan Wilson, Susan T. Wein- traub & Magnus Palmblad Journal of Biomolecular Techniques, 29(2), 39-45 (2018). DOI: 10.7171/jbt.18-2902-003 Chapter 1

Introduction

We live in an era where data is generated in gigabytes, or sometimes even terabytes per hour. The field of biology forms no exception to this, where we use all this data to capture the complexity of life, most prominently in the forms of DNA, RNA and proteins. The field that deals with the analysis of this biological data is called bioinformatics. Of the 3 aforementioned primary building blocks of life, proteins are arguably the most important ones when it comes to biological processes. Ideally, one would want to know the expression levels of all proteins in a cell or tissue, and with advances in technology that allow deep sampling of, for example, the human proteome, we are well on our way. In the field of proteomics, the study of proteins, a major part of the data genera- tion comes from so-called shotgun proteomics experiments. In these experiments, we start with a mixture of unknown proteins. We then use several experimental procedures to break the proteins up into smaller, more easily identifiable fragments and subsequently detect and register these fragments. The challenge is then to put these shards of evidence together again and derive what the original proteins were that were present in the mixture. Several gigabytes of data from these shotgun proteomics experiments are deposited daily to openly accessible repositories, such as the PRIDE archive [77] (http:// www.ebi.ac.uk/pride/archive/) and PeptideAtlas [22] (http://www.peptideatlas. org/). In this thesis, we will look at novel methods we have developed which at- tempt to explore and summarize these gold mines of information, using techniques from machine learning and statistics.

1.1 Thesis overview

In order to make this thesis as accessible as possible to readers unfamiliar with the field of shotgun proteomics. Chapters 2 and 3 are devoted to the biological

5 6 CHAPTER 1. INTRODUCTION and experimental background respectively. Chapter 4 contains an overview of the analysis pipeline that is commonly used in shotgun proteomics and has also been employed at several instances in the included papers. Chapter 5 focuses on the statistical methods that are used to provide appropriate levels of confidence in the obtained results and are the subject of papers II and III. Chapter 6 examines several approaches to summarize the data of large-scale studies by clustering similar observations, providing an introduction to paper I. Chapter 7 gives an introduction into protein quantification, which is the subject of papers IV and V. Chapter 8 summarizes the main points of the included papers and explains where they fit into the pipeline. Finally, Chapter 9 provides some perspectives into the future of the field of shotgun proteomics and the subjects of the included papers in particular. Chapter 2

Biological background

Over the last century, molecular biology has fundamentally changed our under- standing of life and the way we think about it. Among other things, it has given us the tools to reason about evolution, heredity and disease. We take it for granted that we can check for risk factors for certain diseases through analysis of our genome and that we can test blood samples for protein disease markers. Here, a short back- ground in the field of molecular biology will be provided, with a focus on proteins, together with the relevant terminology used later in the thesis.

2.1 From DNA to proteins

The relationship between the 3 primary building blocks of life can be formulated in what is known as the central dogma of molecular biology. It can loosely be formulated as ”DNA is transcribed to RNA and RNA is translated into proteins“. DNA stands for deoxyribonucleic acid and is the famous double-stranded helix that stores our genetic information. Each strand is built up of a chain of nucleotides, of which there are 4 types: cytosine (C), guanine (G), adenine (A) and thymine (T). The two strands are complementary, in the sense that a C on one of the strands has to be coupled to a G at the same location on the other strand and vice versa, while the same relation holds for A and T. Each such pair of C-G or A-T is called a base pair and DNA molecules are in the order of millions of base pairs long. RNA stands for ribonucleic acid and is very similar to DNA, except that it only has a single strand and is typically only a few thousand base pairs long. Instead of thymine, RNA uses the nucleotide type uracil (U). RNA is copied from specific regions of the single DNA strands, called genes, through a process called transcrip- tion. In most eukaryotic genes, certain parts, called introns, of the RNA chains are subsequently removed in a process called RNA splicing. The remaining RNA chains, called exons, can be combined in different ways, giving rise to splice variants of a gene.

7 8 CHAPTER 2. BIOLOGICAL BACKGROUND

Residue Formula Mass (Da) Alanine Ala A C3H5NO 71.04 Arginine Arg R C6H12N4O 156.10 Asparagine Asn N C4H6N2O2 114.04 Aspartic acid Asp D C4H5NO3 115.03 Cysteine Cys C C3H5NOS 103.01 Glutamic acid Glu E C5H7NO3 129.04 Glutamine Gln Q C5H8N2O2 128.06 Glycine Gly G C2H3NO 57.02 Histidine His H C6H7N3O 137.06 Isoleucine Ile I C6H11NO 113.08 Leucine Leu L C6H11NO 113.08 Lysine Lys K C6H12N2O 128.09 Methionine Met M C5H9NOS 131.04 Phenylalanine Phe F C9H9NO 147.07 Proline Pro P C5H7NO 97.05 Serine Ser S C3H5NO2 87.03 Threonine Thr T C4H7NO2 101.05 Tryptophan Trp W C11H10N2O 186.08 Tyrosine Tyr Y C9H9NO2 163.06 Valine Val V C5H9NO 99.07

Table 2.1: Table of the 20 canonical amino acids. The second and third column are, respectively, the 3-letter code and 1-letter code commonly used as a shorter representation of the amino acid when given in a protein sequence. The given mass is the monoisotopic mass given in unified atomic mass units, also known as Daltons (Da).

Proteins consist of a chain of typically hundreds of amino acids, which are connected by amide bonds. The amino acids come in 20 types that all have the same basic structure of an amine (NH2) and carboxylic acid (COOH) group, but differ in their side chains (see Table 2.1). They are translated from RNA strands, where a triplet of nucleotides codes for a certain type of amino acid. As one can see from the math, there are 43 = 64 triplets and only 20 amino acid types, and consequently certain triplets code for the same amino acid. Also, there is one triplet, the start codon, that signals the start of the protein as well as three triplets, the stop codons, that mark the end of the protein. A protein is commonly represented as a string of amino acids, starting with the amino acid at the amino end, also called the N-terminus, and ending with the amino acid at the carboxyl end, also called the C-terminus. This is also the direction in which the protein is translated from RNA, i.e. new amino acids are added to the carboxyl end during translation. 2.2. PROTEINS 9

From the brief description above, one might get the impression that it would be enough to know the DNA sequences to know what proteins would be produced. However, there are several complicating factors such as splice variants, gene regu- lation, as well as RNA and protein degradation that prevent this conclusion from being correct. Knowing the DNA sequence does not tell us how much RNA will be produced from a specific gene and the correlation of protein abundance and RNA levels is generally low, although this is currently a controversial point of de- bate [47, 44, 75, 115, 25].

2.2 Proteins

In contrast to DNA strands, proteins can have many different structures, owing to the diversity of the amino acid types’ side chains. The side chains differ in, among other things, ion charge, size and hydrophilicity, i.e., to which degree it is attracted to water. The proteins typically folds into a conformation that minimizes its energy (Figure 2.1), which depends on the above-mentioned characteristics, as well as other chemical properties of the amino acid side chains. Proteins can also bind to one or more other proteins, forming so-called protein complexes.

Figure 2.1: The 3D structure of the protein myoglobin (blue). The amino acid chain folds into specific structures that give the protein its functionality. Myo- globin is involved in the transportation of oxygen to cells. Here, a heme group (gray) with a bound oxygen molecule (red) is bound to the myoglobin molecule. [7] 10 CHAPTER 2. BIOLOGICAL BACKGROUND

Exactly this diversity in structure allows proteins to act as the major provider of functionality on cell level, for example, catalyzing chemical reactions, transporting molecules, or receiving and reacting to signals in the cell’s environment. An im- portant mechanism that helps in this last task is post-translational modifications (PTMs), which are reversible chemical alterations to amino acids that can, for ex- ample, activate or deactivate the protein. In short, knowing the proteins that are present in a cell, ideally with their modifications and a functional annotation, can tell us a lot about what is happening in a cell. PTMs and splice variants result in variation in the population of proteins stemming from the same gene, known as the different proteoforms of a gene [103]. Humans have approximately 20 000 different protein-coding genes [36]. Taking splice vari- ants into account this results into around 70 000 protein species [116]. Accounting for PTMs as well as single nucleotide polymorphisms (SNPs), the most common form of genetic variation between people, could send the total number of distinct proteoforms into the millions [1]. Proteoforms from a single gene are all slightly different and may even have different functionality. However, as they often have significant parts of their amino acid chains in common, it can be quite challenging to distinguish them from one another. Therefore, it is common to report proteins on gene level, either with or without taking splice variants into account.

2.3 Current state of research

Nowadays, we can determine entire DNA sequences of humans and other organisms, their genomes, in a matter of hours. We can also get relatively complete images of the gene expression, the transcriptome, using RNA sequencing techniques. How- ever, getting a complete image of the protein expression, the proteome, is not yet feasible. Note that gene and protein expression are very much cell type dependent and therefore the transcriptome or proteome of an organism should be a collection of transcriptomes or proteomes from different tissues and cell types. Some notable attempts have been made at constructing a human proteome [115, 60], but both projects took a substantial amount of time and effort and are far from complete, each claiming a coverage of 70 − 80% of the known human genes. One of the main challenges in proteomics is the large dynamic range of protein copy numbers in a cell. Protein copy numbers span 7 orders of magnitude in a typical cell, from just a single copy to tens of millions [122]. In stark contrast, typical untargeted proteomics experiments can cover about 4 orders of magnitude and are unable to detect proteins with fewer than 100 copies per cell [122]. Another important cornerstone of molecular biological research is the functional annotation of genes and proteins. As biological processes are often highly complex and interdependent, it is often difficult to isolate the specific functionality of a single gene or protein. While there are definitely cases where we have a clear image of the function of a gene or protein, we more often have to satisfy ourselves to only 2.3. CURRENT STATE OF RESEARCH 11 knowing the pathway it is active in, or use the functionality of a protein that has a (partly) similar sequence or structure as a proxy. Databases that have up-to-date information on gene or protein functionality include Gene Ontology [6] (http:// geneontology.org/), Pfam [9] (http://pfam.xfam.org/) and Uniprot [4] (http: //www.uniprot.org/).

Chapter 3

Shotgun proteomics

In this chapter, we will take a look at one of the most common techniques to identify and quantify large numbers of protein species in a sample, shotgun proteomics. Direct measurement of proteins has proven to be difficult [59, 13]. Therefore, in shotgun proteomics, several steps are undertaken to reduce the complexity of the analyte by breaking up the proteins into progressively smaller pieces, hence the adjective shotgun. The following sections give a more detailed explanation of each of these steps. Note that, we will focus an untargeted proteomics techniques, where we attempt to identify and quantify thousands of proteins at a time. We will not delve into targeted proteomics techniques, such as selected reaction monitoring (SRM) [91] and multiple reaction monitoring (MRM) [62], as only tens of proteins can typically be detected in such experiments. Also, fractionation techniques, such as gel electrophoresis and strong cation exchange [82], will not be explored here, as this generally reduces the reproducibility between runs.

3.1 Protein digestion

Proteins typically have lengths of a few hundreds of amino acids, which make them too complex to detect directly. Therefore, the proteins are first digested, similar to the way that food is digested in our digestive track. Proteases cut the long protein amino acid chains into smaller pieces, ideally to a length of about 6 to 50 amino acids. These short amino acid chains are called peptides and are basically shorter forms of proteins. In the following, when we refer to a peptide, we actually mean the set of all peptides with the same amino acid chain, sometimes also referred to as a peptide species, rather than a single molecule. Common proteases include trypsin, Lys-C and chymotrypsin, and are often isolated from digestive systems of vertebrates, such as porcine and bovine trypsin, which are isolated from the pancreas. Proteases cleave the protein at specific places in

13 14 CHAPTER 3. SHOTGUN PROTEOMICS the amino acid chain, e.g. trypsin cleaves the amide bonds after arginine and lysine unless they are followed by proline. As an example, the hypothetical protein MKPATRDISILKILGAG would be split up into the peptides MKPATR, DISILK and ILGAG using trypsin as the protease. The predictability of this cleaving process makes it easier to identify the peptides in later stages. However, the cleaving process is not perfect and frequently some cleavage sites are missed, known as miscleavages. The opposite can also happen,proteins are split at other amide bonds than the cleavage site, known as partial digestion.

3.2 Liquid chromatography

The goal is now to identify the peptides that are present after the digestion step. This task will be easier if we can analyze only a single peptide at a time. Liquid chromatography is used as a first step to accomplish this separation of the peptides. As mentioned before, amino acid side chains have different chemical properties, for example, with respect to hydrophobicity and charge. Liquid chromatography (LC) uses this fact by letting the peptides, dissolved in a liquid solvent, pass through a column containing an adsorbent. The composition of the liquid solvent is gradually changed to include more organic solvent. The duration and strategy to perform this change is called the gradient. Through interactions between the adsorbent, the liquid solvent and the amino acid side chains, peptides are slowed down at different rates. This causes different peptides to elude at different times from the LC column, whereas peptides from the same species elude at approximately the same time. The time it takes for a particular peptide to pass through the LC column is known as the peptide’s retention time. The longer the gradient, the more peptides and proteins can be identified. As a rule of thumb, the number of identified proteins is 1000 + 1000 × h, with h the number of hours of the gradient [122]. Typical gradient times are around 2 hours, but short gradients of just 20 minutes [38] and long gradients of more than 40 hours [54] have been used.

3.3 Tandem mass spectrometry

While liquid chromatography creates a first level of separation, there are usually still many peptide species that elude at approximately the same time. As our goal is to analyze a single peptide species at a time, we need to apply a second level of separation. This is done with a mass spectrometer, an instrument that ionizes molecules and then separates molecules on the basis of their mass-to-charge (m/z) ratio1 (Figure 3.1). In shotgun proteomics, we often use a technique which consists of two steps of mass spectrometry, tandem mass spectrometry.

1Molecular masses are typically given in dalton (Da) units, whereas mass-to-charge ratios are Da given in thomson (Th) units, where 1 Th = 1 e , with e the elementary charge unit. 3.3. TANDEM MASS SPECTROMETRY 15

The instrument makes use of the fact that charged molecules will change direc- tion when passing through an electric or magnetic field, by the Lorentz force: F = q(E + v × B). The degree of deflection depends on the mass of the molecule by Newton’s second law of motion: F = m · a. Modern mass spectrometers work on the same principles but use different techniques to separate molecules than the one schematically depicted in Figure 3.1, for instance, time of flight (TOF), quadrupoles or ion traps. Some of these techniques can detect the mass-to-charge ratio of an ion without it needing to hit a detector, facilitating further analysis of this molecule.

Detection Faraday collectors {m/q} = 46 am {m/q} = 45 be {m/q} = 44 n io e current iv t i s o p

magnet amplifiers

ratio Ion source output beam focussing ion acelerator electron trap ion repeller legend: gas inflow (from behind) m ... ion mass ionizing filament q ... ion charge

Figure 3.1: A mass spectrometer ionizes the molecules of interest and then separates the charged molecules on the basis of their mass-to-charge ratio. [35]

The LC column is often directly connected to the mass spectrometer and peptides enter the mass spectrometer according to their retention time. Peptides are first ionized into precursor ions before the deflection in an electric or magnetic field. The precursor mass-to-charge ratios, i.e., the mass-to-charge ratios of the precursor ions, are recorded in a so-called MS1 spectrum. Several MS1 spectra are recorded, each at a different retention time. The MS1 spectrum consists of a list of mass-to-charge ratios with their correspond- ing signal intensity, which itself is a function of the precursor ions’ abundance. We often plot the signal intensity as a function of the mass-to-charge ratio to visualize 16 CHAPTER 3. SHOTGUN PROTEOMICS a spectrum, where we can identify the precursor peaks corresponding to certain pre- cursor ions as mass-to-charge ratios with high signal intensity (Figure 3.2, bottom). The MS1 spectrum furthermore records the retention time at which the spectrum was registered.

To visualize the precursor ions entering the mass spectrometer we can create a mass chromatogram, which plots the ion intensity against the retention time. For example, a plot can be made of the ion intensity summed over all precursor mass- to-charge ratios (Figure 3.2, top), known as a total ion chromatogram (TIC). Such a plot can be seen as a summary of all the MS1 spectra, which themselves can be thought of as slices along the retention time axis, by disregarding the recorded individual precursor ion mass-to-charge ratios. This type of plot mostly serves as a diagnostic plot to check how many analytes elude at which retention time. Chro- matograms of specific precursor mass-to-charge ratio, extracted-ion chromatograms (XICs), can be used to quantify peptides, by using the peak height or the area under the curve at a certain retention time interval as a proxy for the peptides’ abundance.

Total intensity (1e9) Retention time

Ion intensity (1e6) Precursor m/z

Figure 3.2: Examples of a TIC plot (top) and an MS1 spectrum (bottom). Top: in a TIC mass chromatogram we plot the retention time in minutes on the x axis and the total precursor ion intensity on the y axis. Bottom: the MS1 spectrum taken at a retention time of 85.2 minutes. It displays the precursor ion mass-to- charge ratio on the x axis and the precursor ion intensity on the y axis. 3.3. TANDEM MASS SPECTROMETRY 17

After obtaining the MS1 spectrum, two diverging approaches are available. Either we try to analyze single peptide species at a time by selecting a very narrow precur- sor mass-to-charge ratio window, known as data-dependent acquisition, or we take large windows of mass-to-charge ratios and capture multiple peptide species at a time, known as data-independent acquisition.

Data-dependent acquisition (DDA) For data-dependent acquisition (DDA), a narrow window called the isolation win- dow, of typically around 1 Th around the target precursor mass-to-charge ratio is selected based on high-intensity precursor peaks in the MS1 spectrum. This ide- ally leads to the selection of a single peptide species. The precursor ions within this range are fragmented into smaller ions, using techniques such as collision- induced dissociation (CID) [53], Higher-energy collisional dissociation (HCD) [88] and electron-transfer dissociation (ETD) [105, 121]. Most often, the peptide breaks at one of its amide bonds, thereby producing two fragment ions that are themselves short amino acid chains, or just a single amino acid. These fragment ions are then recorded in a so-called MS2 spectrum or fragment spectrum.

Several mass-to-charge ranges of the precursor ions are selected to produce numer- ous MS2 spectra. Usually, a list of recently selected mass-to-charge ranges is kept, so that one does not sample the same highly abundant precursor ions over and over again. This technique is known as dynamic exclusion and includes several parame- ters that govern how often one samples a particular range before excluding it and for how long it should be excluded.

An MS2 spectrum, like an MS1 spectrum, consists of a list of mass-to-charge ratios with their corresponding signal intensity. In the plots of the signal intensity as a function of the mass-to-charge ratio, we can now identify the fragment peaks as corresponding to certain fragment ions (Figure 3.3). The MS2 spectrum also retains information about the mass-to-charge range that was selected for the precursor ion, as well as the retention time.

The advantage of data-dependent acquisition is that the MS2 spectra usually only feature a single (dominant) peptide species, thereby preventing the more difficult case of having the fragment peaks of two or more peptide species in a single MS2 spectrum. In this case, we both need to figure out which peaks belonged to which peptide species, as well as identify which peptide species they came from. However, it is not at all uncommon for two peptide species to be fragmented together anyway, resulting in so-called chimeric spectra, with recent studies showing over 50% of the MS2 spectra featuring this behavior [72, 79].

The major issue with data-dependent acquisition is its poor reproducibility. Each time the experiment is run, a different set of precursor ions could be selected. This mainly is due to small variations in the MS1 spectrum that influence the precursor 18 CHAPTER 3. SHOTGUN PROTEOMICS

103111-Yeast-2hr-03.ms2:28023 (peptide = K.NFTPEQISSMVLGK.M, q-val = 0.0) 160000 K.NFTPEQISSMVLGK.M -K -GK NF- -LGKNFT- 140000

120000

100000

80000 intensity 60000

40000

20000

0 0 200 400 600 800 1000 1200 1400 m/z

Figure 3.3: Example of an MS2 spectrum. In an MS2 spectrum, we plot the observed fragment ion intensity (black bars) as a function of the fragment ion m/z. Overlayed, we plotted the predicted fragment ion peaks of its peptide identification (red lines, colored amino acid oligomer annotations), which frequently match up with the observed fragments. The precursor ion peak at a mass-to-charge ratio of 517.60 Th can be seen in Figure 3.2.

mass-to-charge ratio selection and thereby also the outcome of the dynamic exclu- sion. This would not be a problem if we were able to sample all or most of the peptide species at least once, but in practice, we are often considerably undersam- pling the pool of present peptide species due to limits of the instrumentation [79].

Data-independent acquisition (DIA) A recently popularized alternative to DDA is data-independent acquisition (DIA), also known as SWATH [110, 39]. Here, broad precursor mass-to-charge ratio win- dows are systematically selected for the fragmentation round, with often tens of peptide species in it. At the cost of more complex MS2 spectra, we now have a much deeper sampling of the pool of present peptide species. If the windows are taken broad enough, one could even sample all peptide species that elude from the LC column, which could greatly increase the reproducibility of shotgun proteomics 3.3. TANDEM MASS SPECTROMETRY 19 experiments. One of the main use cases of DIA is in protein quantification, where we can now use chromatograms of specific fragment ions in the MS2 spectra for quantification, instead of using the chromatograms from the precursor ions in the MS1 spectra.

Chapter 4

Data analysis pipeline

Bioinformaticians receive the MS1 and MS2 spectra that were produced in the shotgun proteomics experiment and are now faced with the task of putting these pieces of the puzzle back together. The end product is commonly a list of proteins that one believes to have been present in the original mixture, possibly with a quantification of the protein’s concentration. This chapter focuses on peptide and protein identification rather than quantification, which is the subject of Chapter 7. For a more elaborate review of the data analysis pipeline, see [86].

4.1 Feature detection

One of the most important pieces of information for matching an MS2 spectrum to a peptide sequence is the precursor mass of the peptide. By assigning an accurate precursor mass to a spectrum, the search space can be reduced to only the peptide sequences with the correct mass. Each MS2 spectrum does come with informa- tion regarding its precursor mass-to-charge ratio, but in fact, this only reflects the isolation window rather than the actual mass-to-charge ratio of the precursor ion of interest. To resolve this discrepancy, we can recalibrate the precursor mass- to-charge ratios of the precursor ion and make a prediction of its charge state by examination of the MS1 spectra.

Software packages that were developed for this purpose include Hardklör [50] and Bullseye [52]. Hardklör scouts the MS1 spectra for potential precursor ions, known as MS1 features, and Bullseye subsequently matches these to the corresponding MS2 spectrum, while propagating information extracted from the peaks in the MS1 spectra, such as precursor mass-to-charge ratio, charge, retention time and ion current. Note that multiple MS1 features can be assigned to a single MS2 spectrum and it is not uncommon to have multiple precursor ions within the isolation window, resulting in chimeric spectra. In the context of peptide and protein identification, we normally do not use the MS1 spectra from this point on and focus solely on the

21 22 CHAPTER 4. DATA ANALYSIS PIPELINE

MS2 spectra for the identification of peptides. For the role of feature detection in protein quantification, see Chapter 7.

4.2 Peptide identification

Several approaches are available for identifying the peptide from an MS2 spectrum. They can roughly be classified into three categories: database search engines, spec- tral library searching and de novo sequencing. Database search engines are by far the most commonly used and usually are capable of identifying the highest number of peptides, but the other two categories are gaining in popularity as the amount of data, the accuracy of the instruments and available computing power are increasing.

Database search engines As the number of possible peptides grows exponentially with their length n as 20n, identifying the amino acid sequence directly from the MS2 spectrum is a very difficult task. Instead, we usually work with databases containing sequences of known proteins for the organism of interest. We then do an in silico digestion of these proteins, i.e. we cleave the proteins based on the cleavage rules of the protease that was used. At this stage, we can also account for miscleavages, partial digestion and post-translational modifications. Having obtained a list of potentially present peptides, we can construct a theoreti- cal spectrum by considering the mass-to-charge ratios of the fragments that result from breaking the peptide at the amide bonds. Subsequently, a scoring function is applied to a pair of experimental and theoretical spectra, which assigns a score to the pair. Such a pair is called a peptide-spectrum match (PSM). We only compare experimental and theoretical spectra for which the precursor masses are sufficiently close, which is set depending on the accuracy of the instrumentation. On modern instruments, this mass search window can be as narrow as 20 parts per million (ppm) around the experimentally obtained precursor mass-to-charge ratio. Popular database search engines include SEQUEST [27], MASCOT [16], X!Tandem [19] and MS-GF+ [61].

Spectral library searching One of the problems with database search engines is that the theoretical spectrum is not necessarily a good template for the experimental spectrum of a peptide due to the complex processes governing the fragmentation. Therefore, in spectral library searching we instead choose to use previously identified spectra as a template for a peptide. This first of all requires a good database of template spectra, as well as an appropriate scoring function that compares a template spectrum to the experimental spectrum. 4.2. PEPTIDE IDENTIFICATION 23

Databases of template spectra are commonly constructed using the reliable identifi- cations from database search engines. Notable examples include, the NIST (http: //peptide.nist.gov), PeptideAtlas (http://www.peptideatlas.org/speclib/) and PRIDE (https://www.ebi.ac.uk/pride/cluster/) spectral libraries. As dif- ferent ionization and fragmentation techniques produce different spectra for the same peptide species, it is often beneficial to select a library created with the same or similar instrumentation. Spectral libraries are especially important in the context of DIA, as traditional search engines typically do not have sufficient sensitivity to identify more than 2 peptide species in a spectrum. Using the peak intensity information from previ- ous experimental spectra and reducing the search space of candidate peptides can provide the extra sensitivity needed [39]. Packages implementing a scoring function to identify peptides from spectral li- braries include SpectraST [67], X!Hunter [20] and BiblioSpec [34]. For a compre- hensive overview of spectral library searching, see [66].

De novo sequencing The most challenging class of algorithms for peptide identification is de novo se- quencing. Here, one attempts to identify the peptide sequence directly from the distances between fragment peaks, without making use of a database. De novo identification is most commonly used for experiments in which no good reference database is available, for example for organisms without a reference genome [24]. One of the challenges for de novo identification is that, frequently, not all amide bond breaks are recorded as fragment ions, resulting in a gap of two or more amino acids. By using combinations of amino acids one can bridge such gaps, but often several combinations of amino acids add up to an approximately equal mass that could explain such a gap. Additionally, we do not know the order of the amino acids in the gap. Therefore, de novo peptide identifications often contain ambiguous regions, i.e., multiple short amino acid chains are just as likely to have produced a part of the spectrum. Popular de novo sequencing packages include PEAKS [74], PepNovo [32], NovoHMM [29], pNovo [14] and Novor [73].

Post-translational modification searches The ability to identify post-translation modifications (PTMs) represents one of the main benefits of proteomics over transcriptomics. However, it comes with the challenge that the search space increases exponentially as a function of the number of modifications that are to be searched for. An increased search space not only increases the time for peptide identification but also reduces the sensitivity of the search as the probability of obtaining a false positive increases as well. 24 CHAPTER 4. DATA ANALYSIS PIPELINE

Typically, only a handful of common modifications are searched for, such as oxi- dation of methionine, N-term acetylation and deamidation of asparagine and glu- tamine. Unfortunately, these modifications typically do not have much biological relevance, but failure to account for them will result in many missed peptides. More interesting are studies of, for example, phosphorylation of protein kinases, that play an important role in cell signaling [28, 76]. However, these studies typically rely on the enrichment of these post-translationally modified peptides and proteins, as their expression levels are typically too low to be detected in normal experiments. All major database search engines allow users to specify the modifications they want to search for, but the choice of which modifications to search for is not a straight- forward task. Alternatively, so-called open modification searches do not require a predetermined set of modifications, but choose a modification based on unexplained gaps between fragment peaks [83] or on the precursor mass difference [64]. Popular open modification search engines are MODa [83] and MSFragger [64].

Retention time prediction The supplied precursor mass narrows down the number of candidate peptides for an experimental spectrum from the order of millions to several hundreds. We could narrow down the search space even further if we predict the retention time of a theoretical peptide and compare this to the retention time of the experimental spectrum. Unfortunately, the processes inside the LC column are quite complex and no simple expression can be found to predict a peptide’s retention time and are furthermore dependent on the type of LC column. A popular approach has been to calculate retention coefficients to the amino acids, reflecting their individual contribution to the retention time as used in, for exam- ple, SSRCalc [65]. Also, attempts have been made to predict retention times by modeling the physical interactions of the peptide with the LC column such as in BioLCCC [40]. Alternatively, machine learning algorithms, such as Elude [81] and RTPredict [90], can be applied that learn parameters from previously seen data and use these to make predictions of peptide retention times. For a review of the sub- ject, see [80]. Retention time prediction is the subject of Paper VII (not included in this thesis).

4.3 Protein identification

For certain studies, it is enough to have the peptide identifications, but often we are more interested in the actual proteins that those peptides came from. This task is made difficult by the fact that many peptides could stem from more than one protein, known as shared peptides, either due to multiple splice variants from the same gene or due to homologous genes, that is, genes with common ancestry. Therefore, we cannot simply combine evidence from all peptides that could have 4.4. PROTEIN QUANTIFICATION 25 come from a protein, as some of those peptides could have come from other proteins. This issue has been designated as the protein inference problem [84] and several protein inference methods have been developed to tackle it. There is currently a wide availability of schemes to infer proteins, based on different principles such as IDPicker [117], which uses the parsimony principle, and Protein- Prophet [85], MSBayes [69] and Fido [99], which all use probabilistic models. For a review on the subject see [100].

4.4 Protein quantification

Another entity of interest can be the concentration of each protein that was iden- tified and a wide variety of methods are employed for this. While most of these methods are quantifying peptides rather than proteins, the idea is that we can trans- fer these peptide quantifications to protein-level. Here, we again face the protein inference problem and care has to be taken in doing this correctly. By comparing the quantitative levels of peptides or proteins between different samples, for ex- ample, treated and untreated, we can get a better insight into the inner workings of the biological processes. For a review on the subject of protein quantification, see [8]. Some quantification methods rely on modification of the sample by introducing stable isotopes, for example, stable isotope labeling by amino acids (SILAC) [89] and isobaric labeling techniques, such as tandem mass tags (TMT) [108] and isobaric tags for relative and absolute quantitation (iTRAQ) [114]. Others do not require modification of the sample and are therefore cheaper, simpler and allow retrospective analysis of the data, often at the cost of higher noise levels and more missing values. These so-called label-free quantification strategies include spectral counting [70] or using the total ion current (TIC) of precursor ions [12]. Chapter 7 will provide more detailed information with regards to label-free quan- tification. Alternatively, DIA methods allow protein quantification by fragment ions and achieve high coverage and reproducibility [95].

Chapter 5

Statistical evaluation

Having obtained a list of peptides, proteins or even protein quantities from the data analysis pipeline, we are faced with the question: “How confident are we that these are correct?”. The field of statistics is essential in answering this question. Failure to apply these techniques or to apply them correctly could cost an experimenter valuable time and money. We start by providing some statistical terminology, fol- lowed by the most common approach to statistical evaluation in shotgun proteomics and finish with a section about a software package developed by our lab, Percola- tor, that performs statistical evaluation of shotgun proteomics data. This chapter serves as a detailed introduction to papers II and III.

5.1 Terminology1

In statistics, a null hypothesis usually represents the presupposition that the obser- vations are a result of random chance alone. For example, a null hypothesis for a PSM produced by a database search engine could be, “The peptide was spuriously matched to the spectrum”. We will call a PSM, peptide or protein for which we have found evidence through the data analysis pipeline an inference, as they have respectively been inferred from mass spectra, PSMs and peptides. These inferences can be correct or incorrect, which actually depends on the null hypothesis we are testing, as will be illustrated later in this chapter. Ideally, the list of inferences is sorted by some score, for instance, from the scor- ing function of a database search engine in the case of PSMs. Since we are only interested in the inferences we are very confident about, we typically only keep

1In some cases, other terms will be more common in the field of statistics than the ones used here. However, terms like “identification” or “feature” have confounding meanings due to its concurrent usage in the fields of proteomics and machine learning and were therefore avoided on purpose

27 28 CHAPTER 5. STATISTICAL EVALUATION the inferences scoring better than some kind of threshold. These inferences scoring above the threshold will be referred to here as discoveries 2. Discoveries, like in- ferences, can be correct or incorrect; correct discoveries are called true discoveries and incorrect discoveries are called false discoveries.

5.2 False discovery rates

Introduced by Benjamini and Hochberg [11], the false discovery rate (FDR) is a concept from multiple hypothesis testing aimed at controlling the number of false discoveries. When multiple hypotheses are tested at the same time, we cannot sim- ply use a significance level on p values, as we would when testing a single hypothesis. To see this, say that we use a significance level of 0.05 and test this hypothesis a 1000 times and let the null hypothesis be true all times. Then, by random chance alone, we would already expect 50 discoveries, without there being a true discovery in the whole set. The FDR is defined as the expected fraction of false discoveries in the group of discoveries:  false positives  FDR = E false pos. + true pos. The FDR allows for a few false discoveries, provided that their number is small compared to the number of true discoveries. A commonly employed threshold of 1% FDR would, for example, allow 10 false discoveries, provided that there are 990 true discoveries. This might not be acceptable for certain critical applications, but in shotgun proteomics, we are often dealing with exploratory data analysis and thus are more interested in many discoveries as candidates for further examination and less concerned about false discoveries. As the FDR is not a monotonic function of the score, we often prefer to use the q value of an inference (Figure 5.1), which is the multiple testing corrected analog of the widely used p value. The q value is defined as the minimum of the FDRs of all inferences with the same or a worse score than this inference. The q value is a monotonic function of the score and therefore it makes sense to set a threshold on it and call all inferences below a certain q value discoveries.

Decoy models For each inference, we test the selected null hypothesis by considering the distribu- tion under the null hypothesis, i.e., we look at the score distribution of inferences for which the null hypothesis is true and compare the score of an inference to this null distribution. However, if we simply search our spectra against a protein database,

2In statistical literature, discoveries are also frequently referred to as positives or significant features 5.2. FALSE DISCOVERY RATES 29 we obtain a mixture of correct and incorrect inferences and, therefore, do not know the null distribution. To obtain this null distribution, we artificially create a set of inferences for which we know the null hypothesis is true, known as the decoy model [56]. This is usually done by reversing or shuffling the original protein database, resulting in a decoy database containing decoy proteins, and subsequently, search the spectra against this database. As none of the decoy peptides, i.e., peptides derived from the decoy proteins, should be present in the sample, any match to such a decoy peptide, called a decoy PSM, is incorrect. The score distribution for the matches to decoy peptides is, therefore, a good estimation of the null distribution for the PSMs. The original protein database containing the proteins of interest is called the target database and correspondingly we have target PSMs, target peptides and target proteins. Although any database of non-present proteins would produce a score distribution for which the null hypothesis is true, using a reversed or appropriately shuffled ver- sion of the protein sequences of interest ensures that several peptide characteristics, like length and amino acid frequency, are preserved. This, in turn, ensures that the probability of obtaining a hit to the decoy database is approximately equal to the probability of obtaining an incorrect hit in the target database.

PSM-level false discovery rates To estimate the PSM-level FDR at a certain threshold t for a list of target PSMs, we can count the number of these decoy PSMs scoring better than t and divide that by the number of target PSMs scoring better than t (Figure 5.1). An extra correction should be applied to account for the fact that the number of incorrect target PSMs is lower than the number of decoy PSMs [104]. An alternative method to calculate the PSM-level FDR is to search the spectra against a concatenated database, i.e., a protein database consisting of the target and decoy proteins combined, also known as target-decoy competition [26] (TDC). In this case, we expect to get the same number of decoy PSMs as incorrect target PSMs at a threshold t and no correction factor is needed. This approach is especially beneficial if there is a bias in the scoring function, which causes some spectra to systematically score better than others [58]. If the scores are not biased the two methods give approximately the same performance, though TDC is to be preferred as some issues still remain with the first approach [57].

Peptide-level false discovery rates Frequently, many PSMs map to the same peptide and correctly inferred peptides usually have more PSMs mapped to them than incorrectly inferred peptides. There- fore, simply using the PSM-level FDR as the peptide-level PSM will lead to over- confident FDR rates. In contrast, inferring the peptide by taking the score of the best scoring PSM and subsequently recalculating the FDR based on this truncated 30 CHAPTER 5. STATISTICAL EVALUATION list will give an accurate peptide-level FDR [41]. Note that multiple PSMs with the same peptide sequence are not independent pieces of evidence, as multiple spectra of the same analyte could receive the same incorrect identification. Therefore, these PSMs cannot simply be combined by, for example, Fisher’s method for combining p values [30].

Protein-level false discovery rates While the FDR itself is well defined and PSM- and peptide-level are well accepted in the field, protein FDRs have caused some controversy due to its ambiguity in its definition and its difficulty to be estimated. We will first focus on the issue of ambiguity in its definition, which can be traced back to the different definitions of the (often implicit) null hypothesis different authors use. Two camps can be distinguished right off the bat. Loosely, their two differing null hypotheses can be formulated as “the evidence for the protein came from incorrectly interpreted mass spectra” [85, 93, 97] and “the protein is absent from the sample” [99, 117, 69]. Interestingly, while some consider the difference between the two views obvious, many actually have trouble distinguishing the two views. The different viewpoints can be illustrated using the following scenario (see also Figure 5.2):

Protein A is present in the sample and has 3 peptides a1, a2 and a3 that can be detected. Due to undersampling of the pool of present peptides in the mass spectrometry experiment, peptides a2 and a3 do not produce a mass spectrum. Furthermore, a1 produces a very low- quality mass spectrum that cannot be interpreted confidently. Protein B is also present in the sample and has a peptide b4 that has a pep- tide sequence similar to a2. The mass spectrum belonging to b4 gets erroneously inferred as originating with high confidence from peptide a2. Based on this high confident, albeit faulty, evidence, protein A gets called significant. The first camp bases their classifications on the observed mass spectra. They would classify this protein as a false discovery because the mass spectrum was erroneously interpreted. The second camp bases their classification on the actual presence of the protein in the sample, and would, therefore, classify this protein as a true discovery. Formulating a definition is also made harder by the shared peptide problem. A handful of protein inference methods choose to eliminate all shared peptides, thereby simplifying the problem considerably at the cost of reduced sensitivity. However, the majority of the available methods chooses not to throw out this preciously ob- tained data. This often leads to a solution called protein grouping, where proteins with the same peptide evidence are inferred together as a protein group. However, this solution comes with the immediate question: “What do we mean when we say we inferred a protein group?”. Several answers are possible: “one 5.2. FALSE DISCOVERY RATES 31

Figure 5.1: Estimation of false discovery rates (FDRs) and q values using a decoy model. Column 1 and 2 represent the idealized situation in which we know which inferences are correct and incorrect; columns 3-5 represent the situation in practice, where these labels are unknown and a decoy model is used to approximate the number of incorrect inferences. Ideally, column 2 and 4 should correspond.

>proteinA Peptide a1 a2 a3 FIKELRVIESGPHCANTEIIVKLSDGRELCLDPKENWVQRVVEKFLKRAENS

Missing Missing MS/MS MS/MS

No identification

>proteinB b1 b2 b3 b4 NVWGKVEADIPGHGQEVLIRLFKFDKFKHLKSEDEMKASEDLKKHLEDCLPKGA

Missing Missing MS/MS MS/MS

Correct id: Incorrect id: VEADIPGHGQEVLIR ELCLDPK → proteinB identified → proteinA identified

Figure 5.2: Should the identification of protein A be regarded as a false discovery or not? The figure depicts the dilemma presented in the text regarding the different null hypotheses that are used concurrently in the field. 32 CHAPTER 5. STATISTICAL EVALUATION protein in the group is present”, “all proteins in the group are present” or “at least one of the proteins in the group is present”. The first answer leaves the question which of the proteins to choose, the second is generally too liberal and the third is too unspecific. This might not be very problematic if the proteins in the group have the same function, but due to sequence similarity between different genes, this is certainly not always the case. Provided with a well-defined null hypothesis, there is still the difficulty of estimat- ing the corresponding protein-level FDR. These difficulties mostly seem to stem from the shared peptide problem. Correctly inferred shared peptides break the assumptions of the standard decoy model since decoy proteins cannot mimic the case of a correct peptide which is mapped to an absent protein. On the other hand, using only peptides unique to a protein does not break the decoy model, but does force us to throw out a large portion of the inferred peptides. Sometimes the difficulty in estimating the protein-level FDR stems from faulty assumptions rather than a fundamental problem. This was the case in one of the first draft human proteomes [115, 98], which was later resolved [97].

FDR Calibration Having obtained FDR estimates, one would like to verify that they provide good estimates of the true number of false discoveries. This can be done through calibra- tion using engineered datasets where we know the ground truth of each inference before doing the experiment. The most popular method to achieve this is to analyze mixtures of known proteins, such as the ISB 18 [63] or the Sigma UPS1/UPS2 protein mixtures [101]. Here, the entire shotgun proteomics pipeline is executed and we try to find the peptides and proteins that are known to be present against a database of peptides and proteins that are definitely not present. We can check the FDR estimates by comparing them to the fraction of non-present peptides or proteins reported by the method. These standards contain relatively few proteins, often in the order of 50, and often do not contain many shared peptides. Therefore, they are not very representative of the highly complex samples we typically analyze in shotgun proteomics [43]. Furthermore, an intrinsic problem is that we cannot verify the null hypothesis for a peptide or protein inference to have come from incorrect PSMs, as we are not in control over this part of the process. Alternatively, one can use simulations of shotgun proteomics experiments. Here, correct and incorrect inferences are drawn from their respective score distributions. For example, to test peptide and protein inference methods we can simulate correct and incorrect PSMs, or in the case of protein inference methods, we could even simulate correct and incorrect peptides. Simulations come with a certain set of assumptions about the shotgun proteomics pipeline which might be violated in practice. For example, in some simulations, 5.3. THE PERCOLATOR SOFTWARE PACKAGE 33 present peptides are randomly drawn, whereas in practice the sampling of present peptides is a function of their abundance. Because of assumptions like these, sim- ulations could be considered as giving less conclusive evidence than mixtures of known proteins. However, they can simulate results from samples with high com- plexity and can also verify the null hypothesis of peptide and protein inferences coming from incorrect PSMs.

5.3 The Percolator software package

A common issue with database search engines is that their scoring functions often contain biases, for example, scores are systematically higher for longer peptides or highly charged precursor ions. Initially, this was solved by, for instance, applying different score thresholds for different precursor charge states or having peptide length dependent thresholds. Käll et al. [55] recognized that a more optimal approach to achieve this was to use a support vector machine (SVM), which was implemented in the widely used software package, Percolator. Percolator is still being developed further in our group and we regularly release new features and bug fixes. An SVM is a classification algorithm, i.e., given a set of features for an inference it decides to assign it a certain class, or label, typically one out of two. In the context of shotgun proteomics, these labels are typically “correct” and “incorrect”. A feature can be any numeric value, for example, scoring function score, peptide length or precursor mass. SVMs belong to the class of supervised learning algorithms, which means that they are trained by feeding it inferences for which the labels are known. Based on this training set it will produce a high-dimensional decision boundary in the feature space which separates the inferences from the two labels in an optimal way (Figure 5.3). The problem in the analysis of shotgun proteomics context is that we do not know to which of the two classes an inference belongs. We can create a decoy model which will produce exclusively incorrect inferences, but, initially, we do not have a set of correct inferences. To overcome this, Percolator uses a semi-supervised learning approach, where the correct set is inferred during the training. In practice, this means that we start by assigning the label “correct” to a small set of highly confident PSMs, typically those with a very good database search engine score. Subsequently, we train an SVM using this small set as the correct inferences and the decoy inferences as incorrect inferences and based on the new decision boundary we select a new set of highly confident PSMs. This process can be repeated a fixed number of times or until convergence, i.e., the set of inferences labeled as correct does not change significantly anymore. To prevent overfitting to the data, Percolator uses a cross-validation scheme, that is, the PSMs are divided into 3 34 CHAPTER 5. STATISTICAL EVALUATION

H H H X2 1 2 3

X1 Figure 5.3: A support vector machine calculates the decision boundary H3 which has the largest margin between inferences of the two classes. Here, one can think of the black circles representing the incorrect inferences, while the white circles represent the correct ones. The X1 axis represents one feature, e.g., a PSM’s score, while the X2 axis represents a second feature, e.g., a PSM’s peptide length. The decision boundary H1 does not provide a separation between the two classes, H2 and H3 both separate the two classes, but H3 does this with the largest possible margin to the closest inference. [112]

validation bins and each bin is evaluated by the SVM trained on the other two bins [42]. Chapter 6

Mass spectrum clustering

In the field of shotgun proteomics, it is considered good practice to make data sets of tandem mass spectrometry experiments publicly available in repositories such as the PRIDE archive and PeptideAtlas. According to the latest statistics, the PRIDE archive contains 690 million mass spectra from 813 data sets [111], and the human part of the PeptideAtlas contains 470 million mass spectra from over a 1000 samples [23]. While the data sets are interesting by themselves for researchers interested in the particular topic of the original study, much more interesting out- comes can be expected by considering the entire database as a whole. In particular, we would like to investigate commonalities between studies and to analyze the huge amount of unidentified spectra that could, for example, unveil novel PTMs [51] or unexpected fragmentation patterns. This chapter serves as a detailed introduction to paper I. In a typical experiment, more than half of the MS2 spectra remain unidentified [15]. A part of these can be attributed to low-quality spectra or spurious fragmenta- tion events. However, also among high-quality spectra, a large proportion remains unidentified, even after accounting for common modifications and semi-tryptic or non-tryptic peptides [87]. It is quite a daunting task to consider hundreds of millions of MS2 spectra and find spectra of interest. A useful technique from the field of machine learning is to cluster the spectra, i.e., similar spectra are grouped together, ideally stemming from the same peptide species. By doing this, we can identify groups of interest, rather than single spectra of interest, which typically is a reduction in the number of spectra to be searched of at least one order of magnitude. Furthermore, large groups of unidentified spectra are natural candidates for investigation and one could construct spectral libraries for frequently recurring spectra [33]. By using open modification searches and de novo identification tools, we can shed more light on to what Skinner et al. [102] aptly called the “dark matter” of proteomics. The two key challenges in the clustering of mass spectra are how to measure the

35 36 CHAPTER 6. MASS SPECTRUM CLUSTERING similarity between two spectra and how to decide if the similarity is high enough to warrant with a high probability that two or more spectra came from the same peptide species. The first challenge has been met with several distance measures that take into account different characteristics of mass spectra, while the second part requires a combination of statistics and clustering algorithms from the field of machine learning. Several mass spectrum clustering packages have been developed, such as Pep- Miner [10], MS2Grouper [106], NorBel2 [31], MS-Cluster [33], PRIDE-Cluster [45, 46], CAMS-RS [96] and Liberator [51]. Some of these have specifically targeted large-scale datasets [33, 45, 96, 51]. So far, only PRIDE-Cluster has managed to cluster all spectra, including the unidentified spectra, of an entire repository. In the following sections, we will start by looking at the different aspects of mass spectrum clustering algorithms. In particular, we will look at approaches to measure similarities between mass spectra using a distance measure, how to subsequently group similar spectra together using clustering algorithms and how to evaluate the results of clustering.

6.1 Distance measures

Before we start, it is helpful to clarify the small difference between two frequently used terms. Similarity measures assign high scores to pairs of similar spectra, whereas distance measures assign low scores. This corresponds to our intuition that highly similar pairs receive high similarity scores but are close to each other when talking about distance. We will mostly use the term distance measure as it corresponds better to a visualization of clusters, where similar spectra are close to each other and dissimilar spectra are far away from each other. Spectrum distance measures are not only used for clustering but are also an integral part of spectral library search engines. Some measures were specifically designed for a certain type of experimental setting and are therefore not guaranteed to work when these conditions change. Here, we focus on the most generally applicable distance measures, that should perform reasonably well independent of the experi- mental settings. To determine how similar two spectra are, it is beneficial to take into account the variation in the peak patterns of spectra from the same peptide species. Typi- cally, the absolute peak intensities can vary over several orders of magnitude and therefore some type of normalization is often a key step when using peak inten- sity information. However, even after normalization, the relative peak intensity of a specific fragment peak will usually still show significant variance. Furthermore, electronic and chemical noise, as well as poorly calibrated instruments, can cause systematic and random noise, which needs to be dealt with appropriately. One of the most straightforward similarity measure is by first binning the peaks, for example, in 1 Th bins, and then applying the cosine similarity, of which adapted 6.1. DISTANCE MEASURES 37 versions have been employed by [10, 33, 51, 46]:

Pm IQ · IT C(IQ,IT ) = i=0 i i (6.1) r 2 q Pm  Q Pm T 2 i=0 Ii · i=0 Ii

Where IQ and IT are vectors of the summed or maximum peak intensity per bin. This normalized dot product takes values in the interval [0, 1], where pairs of very similar spectra have a score close to 1. Its main advantage is that it is fast to calculate, but it has the potential problem that a few high-intensity peaks will dominate the scores. In the most extreme case, two spectra with just a single peak in the same bin, for instance, at the precursor ion mass-to-charge ratio, could get a score of 1 even though we do not have much reason to believe they came from the same peptide species. Quite similar to the cosine similarity is the Euclidean distance, as used in [92]: v u m Q T uX Q T 2 E(I ,I ) = t (Ii − Ii ) (6.2) i=0

This distance measure puts less emphasis on the high-intensity peaks when they are in common, while increasing its emphasis when it is absent in one of the spectra. On the other hand, it is very sensitive to the natural variance of fragment peak intensity and it is not straightforward how to normalize the peak intensities, nor the scores themselves. To prevent a few peaks from overpowering all lower intensity peaks, we could instead ignore all peak intensities and use a 1 if the peak intensity is above a certain threshold and a 0 if it is below this threshold:

m Q T X Q T D(B ,B ) = Bi · Bi (6.3) i=0

Where BQ and BT are boolean vectors with 1s and 0s as specified above; a nor- malization factor can be added similarly to the cosine distance above. While this approach solves the problem of overpowering by a few peaks it introduces some new problems. First, an appropriate threshold needs to be chosen to set the 1s and 0s, or alternatively a fixed number k of 1s can be set corresponding to the most intense bins. Second, peaks in common as a result of low-intensity peaks from background noise cannot be distinguished from high-intensity peaks in common from actual fragments. The previous distance metrics do not specifically take advantage of the special characteristics of mass spectra, i.e., that they come from sequences of amino acids. This property can, for example, be exploited by looking at the mass-to-charge ratio 38 CHAPTER 6. MASS SPECTRUM CLUSTERING differences of consecutive peaks [96]. However, this relies on the ability to filter out noise peaks between actual fragments peaks, which is not a simple task. Some other distance measures that have been proposed are the Hertz similarity index [49], the accurate mass spectrum distance [48], using mass differences on aligned peak lists [2] and a sigmoid-based fragment scoring function [31].

6.2 Clustering algorithms

After calculating the pairwise distances, we are left with a large affinity matrix. This matrix is normally quite sparse, as we typically only consider pairs of spectra with similar precursor mass or mass-to-charge ratio. One of the issues which makes the mass spectrum clustering different from many other clustering problems in machine learning is that instead of a few clusters with many observations per cluster, we expect many clusters with relatively few spectra in each cluster. Furthermore, unless we analyze the data with a search engine first, we have no idea of how many clusters we expect to get. This makes the application of popular machine learning clustering algorithms such as k-means or Gaussian mixture models rather difficult. The class of clustering algorithm that does come quite natural to our problem are the hierarchical clustering algorithms. Initially, each observation forms its own cluster, a singleton cluster. We start by linking the most similar two clusters and calculate a new distance between this newly created cluster and all other clusters. This continues until only one cluster is remaining. In this way these algorithms construct a hierarchical tree of the observations (Figure 6.1), linking the most similar observations in the bottom and less similar observations higher up in the tree. Constructing the clusters then simply comes down to setting a threshold on the tree and ignoring all links above this threshold. There are three main ways to construct the linkages, which differ in the calculation of the distance between a newly formed cluster and all other clusters (Figure 6.2). In single-linkage clustering the new distance is the minimum of the distances of pairs with one observation in the newly formed cluster and one in the cluster we want to calculate the new distance to. This has, for example, been used by [31]. In average-linkage clustering the new distance is the average distance of the pairs, as described above, employed by [92]. In complete-linkage clustering the new distance is the maximum distance between the pairs as described above. While single-linkage clustering is the easiest version to implement, its greedy nature tends to produce impure clusters. Average- and complete-linkage clustering improve this by requiring more consensus, but are harder to implement for large matrices, as it requires the storage of all distances between clusters. For average-linkage a memory constrained algorithm is available that resolves this issue [71] and for complete-linkage, a similar scheme can be constructed. In the spirit of average-linkage clustering, one could recalculate the distances by first creating a consensus spectrum of the newly formed cluster and then apply the 6.2. CLUSTERING ALGORITHMS 39

0.87 1.4e-2 4 clusters 4.4e-5 6.9e-7 5 clusters

9.2e-14 7 clusters 3.2e-15 1.7e-18

S5 S8 S7 S1 S4 S6 S2 S3

Figure 6.1: Hierarchical clustering creates a tree of the observations; clus- ters can be constructed by choosing a cutoff threshold. Here, we have 8 spectra, marked s1, . . . , s8, which all start off as clusters of size 1. The distance at which two clusters were linked are given in red. Clusters with a small distance are linked at the bottom, e.g., s1 and s4, whereas clusters with a large distance are linked at the top, e.g., s7 and s8. Setting the cutoff threshold at a large distance creates few clusters (top line, 4 clusters), which might result in impure clusters. Setting the threshold at a small distance results in many clusters (bottom line, 7 clusters), which reduces the summarizing power. Setting the threshold at an inter- mediate distance (middle line, 5 clusters) trades off these two issues and might give the optimal result.

1 P P DU,V = min D(Ui,Vj ) DU,V = D(Ui,Vj ) DU,V = max D(Ui,Vj ) i,j |U||V | i,j i j (A) (B) (C) Figure 6.2: Calculation of the distance between two clusters, U and V , for single-linkage (A), average-linkage (B) and complete-linkage (C) clus- tering. 40 CHAPTER 6. MASS SPECTRUM CLUSTERING distance measure between this new consensus spectra and the consensus spectra of each of the other clusters [33, 46]. Instead of hierarchical clustering methods, one could employ graph-based algo- rithms. After first eliminating all links over a certain distance threshold, we could look for so-called cliques, i.e., groups of spectra for which each pair has a link in the graph [106].

6.3 Evaluating clustering performance

The goal of mass spectrum clustering is to cluster spectra belonging to the same peptide species together while preventing the formation of clusters with more than one peptide species. To evaluate the performance of a mass spectrum clustering algorithm we can use the peptide inferences as the true labels. With these labels, we can apply different cluster evaluation metrics. We shall call the list of clusters resulting from a mass spectrum clustering algorithm a clustering. However, the peptide inferences of the spectra themselves contain a certain level of uncertainty due to chimeric spectra and incorrect identifications. To account for incorrect identifications, one could cluster only the spectra with a PSM inference with a q value below 1% or even 0.1%. Still, this could give an inaccurate image of the actual performance of the mass spectrum clustering algorithm on the entire set, as the confident PSMs are typically from high-quality spectra which are more easily distinguished from each other than low-quality spectra. Alternatively, we could first cluster all spectra and then eliminate all spectra with only PSMs above a certain q value. In this case, the clusters are no longer accurately represented by their remaining spectra and care has to be taken to correctly interpret any statistics we calculate from them. Several general-purpose clustering metrics have been proposed such as the rand index, the adjusted rand index, the B-Cubed algorithm, the purity and normalized mutual information, see [3] for a review. Most of these metrics give us a score, typically between 0 and 1, and allow us to rank different clusterings by this score. Unfortunately, the value of these scores are hard to interpret, and more simplistic metrics might give more information. A commonly used example is the cluster purity, which is the average fraction of spectra belonging to the most common peptide species in its cluster. Alternatively, we can use the number of peptide discoveries from searching the consensus spectra with a database search engine. It is not hard to imagine that if the clusters are pure that the consensus spectra should be a good model for spectra from that peptide species. Unfortunately, this is certainly not always the case, as a cluster with one high-quality spectrum together with many lower-quality spectra might result in a consensus spectrum falling below the discovery threshold. Nevertheless, the number of peptide discoveries for a clustering can be considered as an informative metric regarding its purity. Chapter 7

Label-free protein quantification

Though it might seem trivial to add quantitative values to proteins once they have been identified, this is by no means the case. There are many possible sources of errors along the way from the raw data of the mass spectrometer to a list of quantified proteins. Firstly, it is not always evident which quantification signal belongs to a certain peptide nor if the quantification signal is accurate, as peptides of similar mass and retention time might act as confounding factors. Secondly, as was the case in protein identification, we are still faced with the problem that we are detecting peptides rather than the proteins themselves. This means that we will somehow have to summarize the peptide evidence and deal with possibly conflicting evidence. In this chapter, we will investigate the data analysis pipeline of label-free shotgun proteomics and the types of error sources present at each step. This chapter serves as a detailed introduction to papers IV and V. We will focus on quantification methods that use the precursor intensity signals from the MS1 spectra in favor of alternative techniques such as spectral counting and using fragment intensity signals from the MS2 spectra. Furthermore, we will mostly consider relative quantification methods, rather than absolute quantification methods.

7.1 Feature detection

As mentioned before, we can use the precursor ion intensity of the MS1 spectra as a proxy for a peptides’ abundance, relative to other runs. For a given peptide, the precursor ions will come in different isotopes due to the different isotopes of the constituting atoms, resulting in an isotopic envelope (see Figure 7.1). For a charge 1 precursor ion, this isotopic envelope starts with a peak at the monoisotopic mass, the sum of the atomic masses using the mass of the most abundant isotope, followed by peaks at 1 Da intervals. For a higher charged precursor ion of charge c, these masses will be divided by the charge and the shifts of the isotopes will, therefore, 1 be c Th.

41 42 CHAPTER 7. LABEL-FREE PROTEIN QUANTIFICATION

659 658 657 1 or 2 features? 656 Charge +2 feature 655 654 653 652 651 Charge +1 feature Precursor m/z (Th) 650

4510 4520 4530 4540 4550 Retention time (s)

Figure 7.1: Examples of isotopic envelopes. We plotted the retention time on the x-axis, the precursor m/z on the y-axis and the precursor ion intensity on the z axis, where purple represents the highest intensity, followed by increasingly lighter shades of red. We can see two clearly distinguishable MS1 features in green and blue, one of charge +1 and one of charge +2. We can also see a more ambiguous feature in red, which could also be two separate features. There are also several less intense features in the plot, which we have not highlighted but would typically be called as well.

Due to the differing atomic compositions between peptides, the distribution among the isotopes depends on the peptide. It is important to deisotope the MS1 spectra, such that the most relevant isotopes are taken into account, resulting in an MS1 feature that can be used as the quantitative representation of an analyte [107]. Notable MS1 feature detection software packages are Hardklör [50], MaxQuant [18] and Dinosaur [107].

Errors can be made during the deisotoping, due to isotopes being attributed to the wrong analyte. Also, false positives will arise during the calling of the MS1 features, resulting from random noise, as well as false negatives, truly present analytes that do not get an MS1 feature assigned. Furthermore, in some cases, MS1 features are erroneously split up into multiple smaller MS1 features. 7.2. PEPTIDE QUANTIFICATION 43

7.2 Peptide quantification

After feature detection, we can link the MS1 features to the MS2 spectra by consid- ering the precursor isolation window, the charge state and the retention time [52]. Some MS2 spectra can be linked to multiple MS1 features, for example in the case of chimeric spectra. We can now do peptide identification in the same manner as in the identification pipeline. We can then assign the peptide from the MS2 spectrum to one of its MS1 features. If multiple MS1 features are present with the charge state of the peptide identification, we have to select one, for example, based on predicted precursor mass and retention time. In this way, we obtain a quantity for a peptide with a certain charge state across several runs. We typically filter out peptides that have not been quantified in more than a certain number of runs. This will typically filter out false positives, as well as reduce problems with missing value imputation. It has been observed that the intensity is influenced differently at different retention times, therefore a retention time-dependent normalization factor helps to reduce this variance [118]. Errors are introduced during this stage due to incorrect PSMs and the resulting incorrect peptide and protein identifications, as well as matching the wrong MS1 feature to the peptide.

7.3 Matches-between-runs

Following the approach above, we are left with many missing values due to the random nature of MS2 acquisition of data-dependent acquisition. To alleviate this problem, we can search for MS1 features with similar retention time and precursor mass for identified peptides that did not have a (reliable) MS2 spectrum in that particular run. This strategy is known as matches-between-runs (MBR) [17, 118, 5]. In a particular case, the number of missing values could be reduced from 40% down to just a few percents using this technique [118]. To apply MBR effectively, we need a mapping between retention times of different runs. While linear mappings are sufficient if experimental conditions are kept very steady, one often has to resort to smooth approximations with splines or other interpolation methods. We can either select one of the runs as a reference run [113] or use pairwise alignments [94]. The same type of errors can be made in MBR as during feature detection, that is deisotoping errors, false negatives due to insufficient or incoherent signal, and false positives due to random noise. Furthermore, we are dependent on the retention time alignment and the mass accuracy of the precursors to select the right MS1 feature. To control this error rate, one can use a decoy model where we search for features in a shifted region of precursor m/z and/or retention time [118]. Depending on the implementation, a mapping has to be made between intensities obtained 44 CHAPTER 7. LABEL-FREE PROTEIN QUANTIFICATION by MBR and intensities obtained be regular feature detection, which introduces another source of uncertainty.

7.4 Missing value imputation

Even after MBR, many missing values remain and imputation strategies are com- monly applied. Many different strategies have been employed, each with their own sources of errors. A popular imputation method is to impute a very low value [17], under the assumption that the feature is missing due to low signal. Unfortunately, this assumption turns out to be frequently violated [68]. Other imputation meth- ods try to use quantification values from peptides with a similar expression pattern for the present values or use the very conservative approach of imputing the mean value of all present values for the peptide [109]. Each of these methods come with severe implications for the differential expression analysis. Aside from potentially producing false positives if the incorrect value was imputed, these imputation strategies also affect the variance that is used as input to typical differential expression tests. For example, imputation by some low value can reduce the sensitivity of the test by increasing the variance, whereas imputation by the mean value can artificially lower the variance and produce false positives in this way. Missing value imputation can also cause the p value distribution of the non-differentially expressed proteins to become non-uniform, which could produce an increased number of false positive and also affects the multiple testing correction.

7.5 Peptide to protein summarization

Having obtained multiple peptide quantifications of the same protein, one now has to combine these quantifications into a quantification of the protein. While it is possible to use shared peptides for protein quantification [37], it is often easier to discard these values. Different strategies can be employed for summarizing the peptide quantifications, often using the intuition that peptides with high ion intensity are the most reliable. For example, the top-3 approach takes the three peptides with the highest ion intensity and simply takes the average of these three. One problem with these approaches is that a single outlier can invalidate the entire protein quantification. To prevent this, one can use the coherence between the peptide quantifications and remove the peptides that show aberrant behavior [120, 119].

7.6 Differential expression analysis

Finally, one is interested in the proteins that show differential expression between two conditions. By far the most common way is to use a t-test or its variant 7.6. DIFFERENTIAL EXPRESSION ANALYSIS 45

Welch’s t-test. One problem with these tests is that they do not take the effect size into account, that is, proteins that have a small but significant difference between conditions are commonly not of interest. To remedy this, one could simply filter away these proteins, but this will invalidate the uniformity assumption of p-values, making multiple testing correction an issue [78]. Alternatively, one can make adap- tations to existing tests that penalize proteins with low effect sizes [78, 21]. Compared to RNA sequencing results, differential expression analysis on protein level often suffers from low sensitivity due to its lower coverage and higher noise level. It is not uncommon to have only a handful or even no significantly differen- tially expressed proteins. There are multiple causes for this, though much of it can likely be attributed to erroneous quantification outliers that increase the variance, which lowers the sensitivity of the statistical test.

Chapter 8

Present investigations

The papers included in this thesis center around the peptide and protein inference pipeline. In particular, the papers try to address different issues arising from the increase in available data. Paper I introduces a new mass spectrum clustering al- gorithm which uses a probabilistic framework to calculate a distance measure and applies this to a dataset containing millions of spectra. Paper II focuses on the discussion of protein-level false discovery rates, which has been criticized for not behaving properly on large datasets. The paper proposes a framework in which they can be discussed in a meaningful way to reopen the discussion on protein-level false discovery rates on large datasets. Paper III showcases two new features of the Percolator package which makes it ready for application on aggregated datasets comprising multiple runs such as the recent draft human proteome studies. Pa- per IV introduces a new framework for protein quantification based on Bayesian statistics that provides a rigorous way of treating error sources and uncertainties rather than hiding them as typically happens. Paper V proposes a paradigm shift for protein quantification in which the interpretation of the data with the search engine is postponed until the end, such that we can pinpoint unidentified analytes of interest.

8.1 Paper I: MaRaCluster: A fragment rarity metric for clustering fragment spectra in shotgun proteomics

This study proposes a novel mass spectrum distance measure, the fragment rarity metric, which uses the frequency of the observed peaks in the MS2 spectra to cal- culate a p value for a pair of spectra. It makes use of the intuition that fragment peaks shared by many MS2 spectra give less information about two spectra stem- ming from the same peptide species than peaks shared by few MS2 spectra. In contrast to existing distance measures, the score of the fragment rarity metric can be interpreted by itself and is robust to both systematic and random noise. This distance metric is then used as input to a complete-linkage clustering, from which

47 48 CHAPTER 8. PRESENT INVESTIGATIONS clusters are then derived. The complete-linkage clustering mitigates the problem of chimeric spectra connecting clusters of different peptide species. Using a dataset containing 7.8 million spectra, we demonstrate that our algorithm, MaRaCluster, outperforms alternatives that use a cosine distance or single-linkage clustering, as well as the state-of-the-art method MS-Cluster.

8.2 Paper II: How to talk about protein-level false discovery rates in shotgun proteomics

In this work, we discuss the controversy regarding protein-level false discovery rates that has resurfaced in the wake of the draft human proteome studies. In partic- ular, we propose a framework to have meaningful discussions about them, which mainly consists of separating its definition from its application. The definition is determined by the choice of null hypothesis, of which two significantly different versions are currently employed by protein inference methods. We show that inter- pretation of the results using the wrong null hypothesis can lead to a considerable overestimation of the confidence in the results. Furthermore, we recommend the employment of simulations of the digestion and matching step to calibrate and verify the protein-level FDR estimates reported by protein inference methods. We demonstrate this approach by verifying the FDR estimates of a very simple protein inference method, using decoy models for both of the null hypotheses.

8.3 Paper III: Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0

This paper introduces two new features of the Percolator package that are aimed towards large-scale studies. The first feature accelerates the training algorithm by training the SVMs only on a small random subset of the available PSMs. Even going down to just 0.14% of the original 73 million PSMs does not significantly reduce the number of discovered PSMs and peptides at 1% FDR. For the second feature, we compared several scalable protein inference methods based on accuracy and number of discoveries. The protein inference methods we benchmarked were the best scoring peptide approach, the two peptide rule, multiplication of peptide- level posterior error probabilities and Fisher’s method for combining p values. Using a small yeast dataset with an entrapment database we show that it is absolutely necessary to remove shared peptides to have working decoy models. Furthermore, we demonstrate that the best scoring peptide approach gives the most significant proteins at 1% FDR on the data of a draft of the human proteome. This approach was subsequently implemented in the Percolator package, which together with the random subset training, allowed the analysis from 73 million PSMs up to the protein inferences in just 10 minutes. 8.4. PAPER IV: TRIQLER 49

8.4 Paper IV: Integrated identification and quantification error probabilities for shotgun proteomics.

In protein quantification, many error sources are present that could result in false positive findings. Almost all protein quantification methods use intermediate filters at several parts of the pipeline in an attempt to limit the number of false positives. However, they subsequently typically fail to propagate error probability informa- tion to subsequent steps and instead silently assume that the filtered list does not contain any false positives. Instead, we designed a probabilistic graphical model (PGM) that propagates the error probabilities through each of the steps in the protein quantification pipeline. Specifically, instead of imputing missing values, we integrate over a probability distribution for the missing value and propagate uncer- tainty all the way from the MS1 features, through peptide and protein expression levels unto a posterior distribution for the fold change difference between treatment groups. We implemented this PGM in a Python package called Triqler. Not only does Triqler achieve proper control of the false discovery rate, but it also increases the sensitivity on both engineered and clinical datasets. Specifically, Triqler man- ages to identify 35 differentially expressed proteins in a clinical dataset at a 5% FDR where previously none were discovered and also outperformed the state-of- the-art method, MaxQuant. Furthermore, the proteins discovered by Triqler show enrichment for several functional annotation terms, suggesting that these are in- deed relevant findings, whereas the proteins discovered by MaxQuant did not show any enrichment.

8.5 Paper V: Distillation of label-free quantification data by clustering and Bayesian modeling.

In conventional protein quantification pipelines, we start our quantification from the peptide identifications coming from a search engine. In this paper, we propose to reverse this process by first quantifying all possible analyte signals without in- terpretations and only in the very end use a search engine to assign identities to these signals. This can simply be formulated as the quantification-first approach, opposed to the identification-first philosophy. In this way, we do not lose any po- tentially interesting signals just because they could not initially be identified. By using spectrum clustering with MaRaCluster, we can further reduce the time spent on the identification process, by creating spectrum clusters and consensus spectra that can directly be associated with the discovered features. Additionally, we note that the successful technique of matches-between-runs (MBR) is an error source that is typically not accounted for in terms of error propagation. We adapted the MBR approach to work in the setting of the quantification first approach and re- tained the error probabilities associated with this process as an input to Triqler. We created a combined software package, MaRaQuant, that effectively recovers all signals of interest and provides the consensus spectra that belong to these signals, 50 CHAPTER 8. PRESENT INVESTIGATIONS which can then be searched by a search engine. We demonstrated that we could indeed recover new features of interest that could be identified by open modifica- tion searching and de novo identification. Moreover, we showed that by combining MaRaQuant with Triqler we could obtain a much higher sensitivity than the state- of-the-art method, MaxQuant, on one engineered as well as two clinical datasets. In both clinical datasets, we could recover significant functional annotation terms, whereas significant proteins from MaxQuant and from the original studies were unable to produce any such significant terms. Chapter 9

Future perspectives

The studies presented in this thesis were focused on summarizing data and isolat- ing objects of interest from large-scale proteomics studies. As shotgun proteomics studies become increasingly easy and cheap to perform, computational parts of the analysis pipeline will reach their limits and will require new solutions. Mass spectrum clustering has already started to get traction in the field of pro- teomics and I believe that we can expect several new applications for this type of analysis in the near future. Especially the increasing popularity of spectral li- braries, both in the DDA and DIA fields, appears to be a perfect match for spectrum clustering. However, more attention is needed in the statistical analysis of spec- tral library searching, specifically with regards to decoy spectra and chimericity of spectra. Furthermore, I expect that more effort will be directed towards identify- ing the unidentified spectra in for example spectral archives or retrieved through our “quantification first” approach. This can be achieved through advancements in instrumentation, improved algorithms for open modification searches and de novo identification, as well as large-scale analysis across datasets. In the same way that PSM and peptide-level false discoveries have cemented the reputation of shotgun proteomics experiments as a reliable tool, I think that report- ing identification and quantification protein-level false discovery rates will increase the applicability and popularity of shotgun proteomics even further. I also think that Bayesian statistics will become an integral part of protein quantification as it can deal with the high complexity of the experiments in a relatively straightforward and understandable way. Finally, there have already been some efforts in using deep learning in proteomics and I expect that these efforts will intensify over the next few years. In particular, fragment ion intensity, retention time and ionization efficiency prediction seem to be good fits for the deep learning framework. Considering the vast amounts of data the field has collected in the repositories, it should not take long before we are able to tackle these prediction tasks with high accuracy.

51

Bibliography

[1] , Jeffrey N Agar, I Jonathan Amster, Mark S Baker, Car- olyn R Bertozzi, Emily S Boja, Catherine E Costello, Benjamin F Cravatt, Catherine Fenselau, Benjamin A Garcia, et al. How many human proteoforms are there? Nature Chemical Biology, 14(3):206, 2018.

[2] Rikard Alm, Peter Johansson, Karin Hjernø, Cecilia Emanuelsson, Markus Ringner, and Jari Häkkinen. Detection and identification of protein isoforms using cluster analysis of MALDI-MS mass spectra. Journal of Proteome Re- search, 5(4):785–792, 2006.

[3] Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A com- parison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461–486, 2009.

[4] Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, et al. UniProt: The universal protein knowledge- base. Nucleic Acids Research, 32(suppl 1):D115–D119, 2004.

[5] Andrea Argentini, Ludger JE Goeminne, Kenneth Verheggen, Niels Hulstaert, An Staes, Lieven Clement, and Lennart Martens. moFF: A robust and auto- mated approach to extract peptide ion intensities. Nature Methods, 13(12): 964–966, 2016.

[6] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene Ontology: Tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000.

[7] AzaToth. Myoglobin 3D structure. https://upload.wikimedia.org/ wikipedia/commons/6/60/Myoglobin.png, 2012. Accessed: 2016-03-29.

[8] Marcus Bantscheff, Markus Schirle, Gavain Sweetman, Jens Rick, and Bern- hard Kuster. Quantitative mass spectrometry in proteomics: A critical re- view. Analytical and BioAnalytical Chemistry, 389(4):1017–1031, 2007.

53 54 BIBLIOGRAPHY

[9] Alex Bateman, Lachlan Coin, Richard Durbin, Robert D Finn, Volker Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik LL Sonnhammer, et al. The Pfam protein families database. Nucleic Acids Research, 32(suppl 1):D138–D141, 2004.

[10] Ilan Beer, Eilon Barnea, Tamar Ziv, and Arie Admon. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics, 4(4):950– 960, 2004.

[11] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.

[12] Pavel V Bondarenko, Dirk Chelius, and Thomas A Shaler. Identification and relative quantitation of protein mixtures by enzymatic digestion followed by capillary reversed-phase liquid chromatography-tandem mass spectrometry. Analytical Chemistry, 74(18):4741–4749, 2002.

[13] Adam D Catherman, Owen S Skinner, and Neil L Kelleher. Top down pro- teomics: Facts and perspectives. Biochemical and Biophysical Research Com- munications, 445(4):683–693, 2014.

[14] Hao Chi, Rui-Xiang Sun, Bing Yang, Chun-Qing Song, Le-Heng Wang, Chao Liu, Yan Fu, Zuo-Fei Yuan, Hai-Peng Wang, Si-Min He, et al. pNovo: De novo peptide sequencing and identification using hcd spectra. Journal of Proteome Research, 9(5):2713–2724, 2010.

[15] Joel M Chick, Deepak Kolippakkam, David P Nusinow, Bo Zhai, Ramin Rad, Edward L Huttlin, and Steven P Gygi. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nature Biotechnology, 33(7):743, 2015.

[16] John S Cottrell and U London. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18):3551–3567, 1999.

[17] Jürgen Cox, Marco Y Hein, Christian A Luber, Igor Paron, Nagarjuna Na- garaj, and . Accurate proteome-wide label-free quantifica- tion by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Molecular & Cellular Proteomics, 13(9):2513–2526, 2014.

[18] Jürgen Cox and Matthias Mann. MaxQuant enables high peptide identi- fication rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology, 26(12):1367, 2008.

[19] Robertson Craig and Ronald C Beavis. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics, 20(9):1466–1467, 2004. BIBLIOGRAPHY 55

[20] Robertson Craig, JC Cortens, David Fenyo, and Ronald C Beavis. Using annotated peptide mass spectrum libraries for protein identification. Journal of Proteome Research, 5(8):1843–1849, 2006. [21] Xiangqin Cui and Gary A Churchill. Statistical tests for differential expression in cDNA microarray experiments. Genome Biology, 4(4):210, 2003. [22] Frank Desiere, Eric W Deutsch, Nichole L King, Alexey I Nesvizhskii, Parag Mallick, Jimmy Eng, Sharon Chen, James Eddes, Sandra N Loevenich, and Ruedi Aebersold. The PeptideAtlas project. Nucleic Acids Research, 34(suppl 1):D655–D658, 2006. [23] Eric W Deutsch, Zhi Sun, David Campbell, Ulrike Kusebauch, Caroline S Chu, Luis Mendoza, David Shteynberg, Gilbert S Omenn, and Robert L Moritz. State of the human proteome in 2014/2015 as viewed through Pep- tideAtlas: Enhancing accuracy and coverage through the AtlasProphet. Jour- nal of Proteome Research, 14(9):3461–3473, 2015. [24] Arun Devabhaktuni and Joshua E Elias. Application of de novo sequencing to large-scale complex proteomics data sets. Journal of Proteome Research, 15(3):732–742, 2016. [25] Fredrik Edfors, Frida Danielsson, Björn M Hallström, Lukas Käll, Emma Lundberg, Fredrik Pontén, Björn Forsström, and Mathias Uhlén. Gene- specific correlation of RNA and protein levels in human cells and tissues. Molecular , 12(10):883, 2016. [26] Joshua E Elias and Steven P Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, 4(3):207–214, 2007. [27] Jimmy K Eng, Ashley L McCormack, and John R Yates. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11):976–989, 1994. [28] Scott B Ficarro, Mark L McCleland, P Todd Stukenberg, Daniel J Burke, Mark M Ross, Jeffrey Shabanowitz, Donald F Hunt, and Forest M White. Phosphoproteome analysis by mass spectrometry and its application to sac- charomyces cerevisiae. Nature Biotechnology, 20(3):301, 2002. [29] Bernd Fischer, Volker Roth, Franz Roos, Jonas Grossmann, Sacha Baginsky, Peter Widmayer, Wilhelm Gruissem, and Joachim M Buhmann. NovoHMM: A hidden markov model for de novo peptide sequencing. Analytical Chemistry, 77(22):7265–7273, 2005. [30] Ronald Aylmer Fisher. Statistical methods for research workers. Genesis Publishing Pvt Ltd, 1925. 56 BIBLIOGRAPHY

[31] Kristian Flikka, Jeroen Meukens, Kenny Helsens, Joël Vandekerckhove, In- gvar Eidhammer, Kris Gevaert, and Lennart Martens. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics, 7(18):3245–3258, 2007. [32] Ari Frank and Pavel Pevzner. PepNovo: De novo peptide sequencing via probabilistic network modeling. Analytical Chemistry, 77(4):964–973, 2005. [33] Ari M Frank, Matthew E Monroe, Anuj R Shah, Jeremy J Carver, Nuno Ban- deira, Ronald J Moore, Gordon A Anderson, Richard D Smith, and Pavel A Pevzner. Spectral archives: Extending spectral libraries to analyze both iden- tified and unidentified spectra. Nature Methods, 8(7):587–591, 2011.

[34] Barbara E Frewen, Gennifer E Merrihew, Christine C Wu, William Stafford Noble, and Michael J MacCoss. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Analytical Chem- istry, 78(16):5678–5684, 2006.

[35] Devon Fyson. Schematic of a typical mass spectrometer. https://commons. wikimedia.org/wiki/File:Mass_Spectrometer_Schematic.svg, 2008. Ac- cessed: 2016-03-08. [36] Pascale Gaudet, Pierre-André Michel, Monique Zahn-Zabal, Aurore Britan, Isabelle Cusin, Marcin Domagalski, Paula D Duek, Alain Gateau, Anne Gleizes, Valérie Hinard, et al. The nextprot knowledgebase on human pro- teins: 2017 update. Nucleic Acids Research, 45(D1):D177–D182, 2016. [37] Sarah Gerster, Taejoon Kwon, Christina Ludwig, Mariette Matondo, Chris- tine Vogel, Edward M Marcotte, Ruedi Aebersold, and Peter Bühlmann. Sta- tistical approach to protein quantification. Molecular & Cellular Proteomics, 13(2):666–677, 2014.

[38] Philipp E Geyer, Nils A Kulak, Garwin Pichler, Lesca M Holdt, Daniel Teupser, and Matthias Mann. Plasma proteome profiling to assess human health and disease. Cell Systems, 2(3):185–195, 2016. [39] Ludovic C Gillet, Pedro Navarro, Stephen Tate, Hannes Röst, Nathalie Se- levsek, Lukas Reiter, Ron Bonner, and Ruedi Aebersold. Targeted data ex- traction of the MS/MS spectra generated by data-independent acquisition: A new concept for consistent and accurate proteome analysis. Molecular & Cellular Proteomics, 11(6):O111–016717, 2012. [40] Alexander V Gorshkov, Irina A Tarasova, Victor V Evreinov, Mikhail M Savitski, Michael L Nielsen, Roman A Zubarev, and Mikhail V Gorshkov. Liquid chromatography at critical conditions: Comprehensive approach to sequence-dependent retention time prediction. Analytical Chemistry, 78(22): 7770–7777, 2006. BIBLIOGRAPHY 57

[41] Viktor Granholm and Lukas Käll. Quality assessments of peptide–spectrum matches in shotgun proteomics. Proteomics, 11(6):1086–1093, 2011.

[42] Viktor Granholm, William S Noble, and Lukas Käll. A cross-validation scheme for machine learning algorithms in shotgun proteomics. BMC Bioin- formatics, 13(Suppl 16):S3, 2012.

[43] Viktor Granholm, William Stafford Noble, and Lukas Käll. On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. Journal of Proteome Research, 10(5):2671–2678, 2011.

[44] Dov Greenbaum, Christopher Colangelo, Kenneth Williams, and Mark Ger- stein. Comparing protein abundance and mRNA expression levels on a ge- nomic scale. Genome Biology, 4(9):117, 2003.

[45] Johannes Griss, Joseph M Foster, Henning Hermjakob, and Juan Antonio Vizcaíno. PRIDE Cluster: Building a consensus of proteomics data. Nature Methods, 10(2):95–96, 2013.

[46] Johannes Griss, Yasset Perez-Riverol, Steve Lewis, David L Tabb, José A Dianes, Noemi del Toro, Marc Rurik, Mathias Walzer, Oliver Kohlbacher, Henning Hermjakob, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nature Methods, 13 (8):651, 2016.

[47] Steven P Gygi, Yvan Rochon, B Robert Franza, and Ruedi Aebersold. Cor- relation between protein and mRNA abundance in yeast. Molecular and Cellular Biology, 19(3):1720–1730, 1999.

[48] Michael Edberg Hansen and Jørn Smedsgaard. A new matching algorithm for high resolution mass spectra. Journal of the American Society for Mass Spectrometry, 15(8):1173–1180, 2004.

[49] Harry S Hertz, Ronald A Hites, and . Identification of mass spectra by computer-searching a file of known spectra. Analytical Chemistry, 43(6):681–691, 1971.

[50] Michael R Hoopmann, Gregory L Finney, and Michael J MacCoss. High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. An- alytical Chemistry, 79(15):5620–5632, 2007.

[51] Oliver Horlacher, Frédérique Lisacek, and Markus Müller. Mining large scale MS/MS data for protein modifications using spectral libraries. Journal of Proteome Research, 2015. 58 BIBLIOGRAPHY

[52] Edward J Hsieh, Michael R Hoopmann, Brendan MacLean, and Michael J MacCoss. Comparison of database search strategies for high precursor mass accuracy MS/MS data. Journal of Proteome Research, 9(2):1138–1143, 2009.

[53] Donald F Hunt, John R Yates, Jeffrey Shabanowitz, Scott Winston, and Charles R Hauer. Protein sequencing by tandem mass spectrometry. Pro- ceedings of the National Academy of Sciences, 83(17):6233–6237, 1986.

[54] Mio Iwasaki, Shohei Miwa, Tohru Ikegami, Masaru Tomita, Nobuo Tanaka, and Yasushi Ishihama. One-dimensional capillary liquid chromatographic separation coupled with tandem mass spectrometry unveils the escherichia coli proteome on a microarray scale. Analytical Chemistry, 82(7):2616–2620, 2010.

[55] Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael J MacCoss. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods, 4(11):923–925, 2007.

[56] Lukas Käll, John D Storey, Michael J MacCoss, and William Stafford Noble. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of Proteome Research, 7(01):29–34, 2007.

[57] Uri Keich, Attila Kertesz-Farkas, and William Stafford Noble. Improved false discovery rate estimation procedure for shotgun proteomics. Journal of Proteome Research, 14(8):3148–3161, 2015.

[58] Uri Keich and William Stafford Noble. On the importance of well-calibrated scores for identifying shotgun proteomics spectra. Journal of Proteome Re- search, 14(2):1147–1160, 2014.

[59] Neil L Kelleher. Peer reviewed: Top-down proteomics. Analytical Chemistry, 76(11):196–A, 2004.

[60] Min-Sik Kim, Sneha M Pinto, Derese Getnet, Raja Sekhar Niru- jogi, Srikanth S Manda, Raghothama Chaerkady, Anil K Madugundu, Dhanashree S Kelkar, Ruth Isserlin, Shobhit Jain, et al. A draft map of the human proteome. Nature, 509(7502):575–581, 2014.

[61] Sangtae Kim, Nitin Gupta, and Pavel A Pevzner. Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research, 7(8):3354–3363, 2008.

[62] Donald S Kirkpatrick, Scott A Gerber, and Steven P Gygi. The absolute quantification strategy: A general procedure for the quantification of proteins and post-translational modifications. Methods, 35(3):265–273, 2005. BIBLIOGRAPHY 59

[63] John Klimek, James S Eddes, Laura Hohmann, Jennifer Jackson, Amelia Peterson, Simon Letarte, Philip R Gafken, Jonathan E Katz, Parag Mallick, Hookeun Lee, et al. The standard protein mix database: A diverse data set to assist in the production of improved peptide and protein identification software tools. Journal of Proteome Research, 7(01):96–103, 2007.

[64] Andy T Kong, Felipe V Leprevost, Dmitry M Avtonomov, Dattatreya Mel- lacheruvu, and Alexey I Nesvizhskii. MSFragger: Ultrafast and comprehen- sive peptide identification in mass spectrometry–based proteomics. Nature Methods, 14(5):513, 2017. [65] Oleg V Krokhin, R Craig, V Spicer, W Ens, KG Standing, RC Beavis, and JA Wilkins. An improved model for prediction of retention times of tryptic peptides in ion pair reversed-phase HPLC its application to protein peptide mapping by off-line HPLC-MALDI MS. Molecular & Cellular Proteomics, 3 (9):908–919, 2004. [66] Henry Lam and Ruedi Aebersold. Spectral library searching for peptide identi- fication via tandem MS. In Proteome Bioinformatics, pages 95–103. Springer, 2010. [67] Henry Lam, Eric W Deutsch, James S Eddes, Jimmy K Eng, Stephen E Stein, and Ruedi Aebersold. Building consensus spectral libraries for peptide identification in proteomics. Nature Methods, 5(10):873–875, 2008.

[68] Cosmin Lazar, Laurent Gatto, Myriam Ferro, Christophe Bruley, and Thomas Burger. Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. Journal of Proteome Research, 15(4):1116–1125, 2016. [69] Yong Fuga Li, Randy J Arnold, Yixue Li, Predrag Radivojac, Quanhu Sheng, and Haixu Tang. A bayesian approach to protein inference problem in shotgun proteomics. Journal of Computational Biology, 16(8):1183–1193, 2009. [70] Hongbin Liu, Rovshan G Sadygov, and John R Yates. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry, 76(14):4193–4201, 2004.

[71] Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and Michal Linial. Efficient algorithms for accurate hierarchical clustering of huge datasets: Tackling the entire protein space. Bioinformatics, 24(13):i41–i49, 2008. [72] Roland Luethy, Darren E Kessner, Jonathan E Katz, Brendan MacLean, Robert Grothe, Kian Kani, Vitor Faca, Sharon Pitteri, Samir Hanash, David B Agus, and Parag Mallick. Precursor-ion mass re-estimation improves peptide identification on hybrid instruments. Journal of Proteome Research, 7(9):4031–4039, 2008. 60 BIBLIOGRAPHY

[73] Bin Ma. Novor: Real-time peptide de novo sequencing software. Journal of the American Society for Mass Spectrometry, 26(11):1885–1894, 2015.

[74] Bin Ma, Kaizhong Zhang, Christopher Hendrie, Chengzhi Liang, Ming Li, Amanda Doherty-Kirby, and Gilles Lajoie. PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Communi- cations in Mass Spectrometry, 17(20):2337–2342, 2003.

[75] Tobias Maier, Marc Güell, and Luis Serrano. Correlation of mRNA and protein in complex biological samples. FEBS Letters, 583(24):3966–3973, 2009.

[76] Matthias Mann, Shao-En Ong, Mads Grønborg, Hanno Steen, Ole N Jensen, and Akhilesh Pandey. Analysis of protein phosphorylation using mass spec- trometry: Deciphering the phosphoproteome. Trends in biotechnology, 20(6): 261–268, 2002.

[77] Lennart Martens, Henning Hermjakob, Philip Jones, Marcin Adamski, Chris Taylor, David States, Kris Gevaert, Joël Vandekerckhove, and Rolf Apweiler. PRIDE: The proteomics identifications database. Proteomics, 5(13):3537– 3545, 2005.

[78] Davis J McCarthy and Gordon K Smyth. Testing significance relative to a fold-change threshold is a treat. Bioinformatics, 25(6):765–771, 2009.

[79] Annette Michalski, Juergen Cox, and Matthias Mann. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS. Journal of Proteome Research, 10(4):1785–1793, 2011.

[80] Luminita Moruz and Lukas Käll. Peptide retention time prediction. Mass Spectrometry Reviews, 2016.

[81] Luminita Moruz, Daniela Tomazela, and Lukas Käll. Training, selection, and robust calibration of retention time models for targeted proteomics. Journal of Proteome Research, 9(10):5209–5216, 2010.

[82] Akira Motoyama and John R Yates III. Multidimensional LC separations in shotgun proteomics, 2008.

[83] Seungjin Na, Nuno Bandeira, and Eunok Paek. Fast multi-blind modification search through tandem mass spectrometry. Molecular & Cellular Proteomics, pages mcp–M111, 2011.

[84] Alexey I Nesvizhskii and Ruedi Aebersold. Interpretation of shotgun pro- teomic data the protein inference problem. Molecular & Cellular Proteomics, 4(10):1419–1440, 2005. BIBLIOGRAPHY 61

[85] Alexey I Nesvizhskii, Andrew Keller, Eugene Kolker, and Ruedi Aebersold. A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry, 75(17):4646–4658, 2003.

[86] Alexey I Nesvizhskii, Olga Vitek, and Ruedi Aebersold. Analysis and vali- dation of proteomic data generated by tandem mass spectrometry. Nature Methods, 4(10):787–797, 2007.

[87] Kang Ning, Damian Fermin, and Alexey I Nesvizhskii. Computational anal- ysis of unassigned high-quality MS/MS spectra in proteomic data sets. Pro- teomics, 10(14):2712–2718, 2010.

[88] Jesper V Olsen, Boris Macek, Oliver Lange, Alexander Makarov, Stevan Horning, and Matthias Mann. Higher-energy c-trap dissociation for peptide modification analysis. Nature Methods, 4(9):709, 2007.

[89] Shao-En Ong, Blagoy Blagoev, Irina Kratchmarova, Dan Bach Kristensen, Hanno Steen, Akhilesh Pandey, and Matthias Mann. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & Cellular Proteomics, 1(5):376–386, 2002.

[90] Nico Pfeifer, Andreas Leinenbach, Christian G Huber, and Oliver Kohlbacher. Statistical learning of peptide retention behavior in chromatographic separa- tions: A new kernel-based approach for computational proteomics. BMC Bioinformatics, 8(1):468, 2007.

[91] Paola Picotti and Ruedi Aebersold. Selected reaction monitoring–based pro- teomics: Workflows, potential, pitfalls and future directions. Nature Methods, 9(6):555, 2012.

[92] David W Powell, Connie M Weaver, Jennifer L Jennings, K Jill McAfee, Yue He, P Anthony Weil, and Andrew J Link. Cluster analysis of mass spec- trometry data reveals a novel component of SAGA. Molecular and Cellular Biology, 24(16):7249–7259, 2004.

[93] Lukas Reiter, Manfred Claassen, Sabine P Schrimpf, Marko Jovanovic, Alexander Schmidt, Joachim M Buhmann, Michael O Hengartner, and Ruedi Aebersold. Protein identification false discovery rates for very large pro- teomics data sets generated by tandem mass spectrometry. Molecular & Cel- lular Proteomics, 8(11):2405–2417, 2009.

[94] Hannes L Röst, Yansheng Liu, Giuseppe D’Agostino, Matteo Zanella, Pedro Navarro, George Rosenberger, Ben C Collins, Ludovic Gillet, Giuseppe Testa, Lars Malmström, et al. TRIC: An automated alignment strategy for repro- ducible protein quantification in targeted proteomics. Nature Methods, 13(9): 777, 2016. 62 BIBLIOGRAPHY

[95] Hannes L Röst, George Rosenberger, Pedro Navarro, Ludovic Gillet, Saša M Miladinović, Olga T Schubert, Witold Wolski, Ben C Collins, Johan Malm- ström, Lars Malmström, et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nature Biotechnology, 32 (3):219, 2014.

[96] Fahad Saeed, Jason D Hoffert, and Mark A Knepper. CAMS-RS: Cluster- ing algorithm for large-scale mass spectrometry data using restricted search space and intelligent random sampling. IEEE/ACM Transactions on Com- putational Biology and Bioinformatics (TCBB), 11(1):128–141, 2014.

[97] Mikhail M Savitski, Mathias Wilhelm, Hannes Hahne, Bernhard Kuster, and Marcus Bantscheff. A scalable approach for protein false discovery rate esti- mation in large proteomic data sets. Molecular & Cellular Proteomics, pages mcp–M114, 2015.

[98] Oliver Serang and Lukas Käll. Solution to statistical challenges in proteomics is more statistics, not less. Journal of Proteome Research, 2015.

[99] Oliver Serang, Michael J MacCoss, and William Stafford Noble. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. Journal of Proteome Research, 9(10):5346–5357, 2010.

[100] Oliver Serang and William Noble. A review of statistical methods for protein identification using tandem mass spectrometry. Statistics and its Interface, 5 (1):3, 2012.

[101] Sigma-Aldrich. UPS1 & UPS2 proteomic standards. http://www. sigmaaldrich.com/life-science/proteomics/mass-spectrometry/ ups1-and-ups2-proteomic.html. Accessed: 2016-03-15.

[102] Owen S Skinner and Neil L Kelleher. Illuminating the dark matter of shotgun proteomics. Nature Biotechnology, 33(7):717, 2015.

[103] Lloyd M Smith, Neil L Kelleher, et al. Proteoform: A single term describing protein complexity. Nature Methods, 10(3):186–187, 2013.

[104] John D Storey and Robert Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003.

[105] John EP Syka, Joshua J Coon, Melanie J Schroeder, Jeffrey Shabanowitz, and Donald F Hunt. Peptide and protein sequence analysis by electron trans- fer dissociation mass spectrometry. Proceedings of the National Academy of Sciences, 101(26):9528–9533, 2004. BIBLIOGRAPHY 63

[106] David L Tabb, Melissa R Thompson, Gurusahai Khalsa-Moyers, Nathan C VerBerkmoes, and W Hayes McDonald. MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. Journal of The American Society for Mass Spectrometry, 16(8):1250–1261, 2005. [107] Johan Teleman, Aakash Chawade, Marianne Sandin, Fredrik Levander, and Johan Malmström. Dinosaur: A refined open-source peptide MS feature de- tector. Journal of Proteome Research, 15(7):2143–2151, 2016. ISSN 15353907. [108] Andrew Thompson, Jürgen Schäfer, Karsten Kuhn, Stefan Kienle, Josef Schwarz, Günter Schmidt, Thomas Neumann, and Christian Hamon. Tandem mass tags: A novel quantification strategy for comparative analysis of com- plex protein mixtures by MS/MS. Analytical Chemistry, 75(8):1895–1904, 2003. [109] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520– 525, 2001. [110] John D Venable, Meng-Qiu Dong, James Wohlschlegel, Andrew Dillin, and John R Yates. Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra. Nature Methods, 1(1):39–45, 2004. [111] Juan Antonio Vizcaíno, Attila Csordas, Noemi del Toro, José A Dianes, Jo- hannes Griss, Ilias Lavidas, Gerhard Mayer, Yasset Perez-Riverol, Florian Reisinger, Tobias Ternent, et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Research, 44(D1):D447–D456, 2016. [112] Zack Weinberg. SVM separating hyperplanes. https://commons. wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg, 2012. Accessed: 2016-03-16. [113] Hendrik Weisser, Sven Nahnsen, Jonas Grossmann, Lars Nilse, Andreas Quandt, Hendrik Brauer, Marc Sturm, Erhan Kenar, Oliver Kohlbacher, Ruedi Aebersold, et al. An automated pipeline for high-throughput label- free quantitative proteomics. Journal of Proteome Research, 12(4):1628–1644, 2013. [114] Sebastian Wiese, Kai A Reidegeld, Helmut E Meyer, and Bettina Warscheid. Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics, 7(3):340–350, 2007. [115] Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami, Marcus Lieberenz, Mikhail M Savitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-based draft of the human proteome. Nature, 509(7502):582–587, 2014. 64 BIBLIOGRAPHY

[116] Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Carlos García Girón, et al. Ensembl 2018. Nucleic Acids Research, 46 (D1):D754–D761, 2017. [117] Bing Zhang, Matthew C Chambers, and David L Tabb. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of Proteome Research, 6(9):3549–3557, 2007. [118] Bo Zhang, Lukas Kall, and Roman A Zubarev. DeMix-Q: Quantification- centered data processing workflow. Molecular & Cellular Proteomics, pages mcp–O115, 2016. [119] Bo Zhang, Mohammad Pirmoradian, Roman Zubarev, and Lukas Käll. Co- variation of peptide abundances accurately reflects protein concentration dif- ferences. Molecular & Cellular Proteomics, 16(5):936–948, 2017. [120] Yafeng Zhu, Lina Hultin-Rosenberg, Jenny Forshed, Rui MM Branca, Lukas M Orre, and Janne Lehtiö. SpliceVista, a tool for splice variant identi- fication and visualization in shotgun proteomics data. Molecular & Cellular Proteomics, 13(6):1552–1562, 2014. [121] Roman A Zubarev. Electron-capture dissociation tandem mass spectrometry. Current Opinion in Biotechnology, 15(1):12–16, 2004. [122] Roman A Zubarev. The challenge of the proteome dynamic range and its implications for in-depth proteomics. Proteomics, 13(5):723–726, 2013.