Statistical and machine learning methods to analyze large-scale mass spectrometry data

MATTHEW THE

Licentiate Stockholm, Sweden 2016 TRITA BIO Report 2016:3 KTH School of Biotechnology ISSN 1654-2312 SE-171 21 Solna ISBN 978-91-7595-933-7 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Teknologie licentiat i bioteknologi tisda- gen den 5 maj 2016 klockan 13.00 i Pascal, våning 6 i Gamma-huset, for Life Laboratory, Solna.

© Matthew The, May 2016

Tryck: Universitetsservice US AB iii

Abstract

As in many other fields, biology is faced with enormous amounts of data that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily acces- sible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling. In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In par- ticular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess sim- ilarity between mass spectra, followed by the conservative complete-linkage clustering algorithm. The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method. Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs. iv

Sammanfattning

Modern molekylärbiologi genererar enorma mängder data, som kräver sofistik- erade algoritmer för att tolkas. Inom området proteomik, det vill säga det storskaliga studier av proteiner, har det under årens lopp byggt upp stora, lättåtkomliga arkiv med data från masspektrometri-experiment. Vidare, finns det fortfarande många outforskade aspekter av proteiner, extra intressanta eftersom proteiner utgör de viktigaste aktörerna i cellprocesser och cellsigna- lering. I den här avhandlingen undersöker vi metoder för att utvinna mer information från tillgängliga data med hjälp av statistik och maskininlärningsmetoder. För det första, beskriver vi MaRaCluster, en ny metod för klustring av masspek- tra från storskaliga dataset. Metoden använder statistiska metoder för att bedöma likheten mellan masspektra, följt av hirarkisk klustring. Metoden identifierar upp till 40% fler peptider än tidigare etablerade metoder. För det andra, förtydligar vi och förespråkar användadet av ett mått på osäk- erhet i identifiering av protein från masspektrometridata, det så kallade False Discovery Rate (FDR) på proteinnivå. Många studier rapporterar inte FDR på proteinnivå, trots att dessa väljer att redovisa sina resultat i form av listor av proteiner. Här presenterar vi ett statistiskt ramverk för att kunna system- atiskt diskutera FDRs på proteinnivå och för att undanröja eventuell tvek- samhet. Vi jämför några inferensmetoder för proteiner. Vidare inkluderade vi den bästa inferensmetoden i den populära mjukvaran för efterbehandling av masspektrometridata, Percolator. Dessutom har vi lagt till funktioner till Per- colator för att kunna hantera analyser av studier av sammanslagna mängder med ett flertal aggregerade körningar. Detta minskade exekveringstiden för en storskalig studie av det mänskliga proteomet från nästan en hel dag ner till bara 10 minuter på en standarddator. Contents

Contents v

List of publications 1

1 Introduction 3 1.1 Thesis overview ...... 3

2 Biological background 5 2.1 From DNA to proteins ...... 5 2.2 Proteins ...... 7 2.3 Current state of research ...... 8

3 Shotgun proteomics 9 3.1 Protein digestion ...... 9 3.2 Liquid chromatography ...... 10 3.3 Tandem mass spectrometry ...... 10 3.4 Data analysis pipeline ...... 14

4 Statistical evaluation 19 4.1 Terminology ...... 19 4.2 False discovery rates ...... 20 4.3 The Percolator software package ...... 24

5 Mass spectrum clustering 27 5.1 Distance measures ...... 28 5.2 Clustering algorithms ...... 29 5.3 Evaluating clustering performance ...... 31

6 Present investigations 33 6.1 Paper I: MaRaCluster ...... 33 6.2 Paper II: Protein-level false discovery rates ...... 34 6.3 Paper III: Percolator 3.0 ...... 34

v vi CONTENTS

7 Future perspectives 35

Bibliography 37 List of publications

I have included the following articles in the thesis.

PAPER I: MaRaCluster: A fragment rarity metric for clustering frag- ment spectra in shotgun proteomics. Matthew The & Lukas Käll Journal of Proteome Research, 15(3), 713-720 (2016). DOI: 10.1021/acs.jproteome.5b00749 PAPER II: How to talk about protein-level false discovery rates in shot- gun proteomics. Matthew The, Ayesha Tasnim & Lukas Käll Manuscript in preparation PAPER III: Fast and accurate protein false discovery rates on large- scale proteomics data sets with Percolator 3.0. Matthew The, Michael J. MacCoss, William S. Noble & Lukas Käll Manuscript in preparation

1

Chapter 1

Introduction

We live in an era where data is generated in gigabytes, or sometimes even terabytes per hour. The field of biology forms no exception to this rule, where we use all this data to capture the complexity of life, for example in the form of DNA, RNA and proteins. The field that deals with the analysis of this biological data is called . Of the 3 aforementioned primary building blocks of life, proteins are arguably the most important ones when it comes to biological processes. Ideally, one would want to know the expression levels of all proteins in a cell or tissue, and with advances in technology that allow deep sampling of, e.g., the human proteome, we are well on our way in this direction. In the field of proteomics, the study of proteins, a major part of the data generation comes from so-called shotgun proteomics experiments. In these experiments we start with a mixture of unknown proteins. We then use several methods to break the proteins up into smaller, more easily identifiable fragments and subsequently detect and register these fragments. The challenge is then to put these shards of evidence together again and derive what the original proteins were that were present in the mixture. Several gigabytes of data from these shotgun proteomics experiments are deposited regularly to openly accessible repositories, such as the PRIDE archive [57] (http:// www.ebi.ac.uk/pride/archive/) and PeptideAtlas [16] (http://www.peptideatlas. org/). In this thesis, we will look at novel methods that we developed that attempt to explore and summarize these gold mines of information, using techniques from machine learning and statistics.

1.1 Thesis overview

In order to make this thesis as accessible as possible to readers unfamiliar with the field of shotgun proteomics, Chapters 2 and 3 are devoted to the biological and

3 4 CHAPTER 1. INTRODUCTION experimental background respectively. Chapter 3 also contains an overview of the analysis pipeline that is commonly used in shotgun proteomics and has also been employed at several instances in the included papers. Chapter 4 focuses on the statistical methods that are used to provide appropriate levels of confidence in the obtained results. Chapter 5 examines several approaches to summarize the data of large-scale studies by clustering similar observations. Chapter 6 presents the main points of the included papers and explains where they fit into the pipeline. Finally, Chapter 7 provides some perspectives into the future of the field of shotgun proteomics and the subjects of the included papers in particular. Chapter 2

Biological background

Over the last century, molecular biology has fundamentally changed our under- standing of life and the way we think about it. Among other things, it has given us the tools to reason about evolution, heredity and disease. We take it for granted that we can use DNA material as evidence in the courtroom and that we can check for risk factors for certain diseases through analysis of our genome. Still, most mere mortals only have a fleeting idea of what molecular biology actually is. Here, I will provide some background in the field of molecular biology, with a focus on proteins, and present the relevant terminology.

2.1 From DNA to proteins

Frequently mentioned in theses regarding this topic, and therefore unthinkable to leave out here as well, is the central dogma of molecular biology. It can loosely be formulated as ”DNA is transcribed to RNA and RNA is translated into proteins“. So what are DNA, RNA, proteins, transcription and translation exactly? DNA stands for deoxyribonucleic acid and is the famous double stranded helix that stores our genetic information. Each strand is built up of a long chain of nucleotides, of which there are 4 types: cytosine (C), guanine (G), adenine (A) and thymine (T). The two strands are complementary, in the sense that a C on one of the strands has to be coupled to a G at the same location on the other strand and vice versa, while the same relation holds for A and T. RNA stands for ribonucleic acid and is very similar to DNA, except that it only has a single strand and uses the nucleotide type uracil (U) instead of thymine. It is copied from specific regions of the single DNA strands, called genes, through a process called transcription. In most eukaryotic genes, certain parts, called introns, of the RNA chains are subsequently removed in a process called RNA splicing. The remaining RNA chains, called exons, can be combined in different ways, giving rise to splice variants of a gene.

5 6 CHAPTER 2. BIOLOGICAL BACKGROUND

Residue Formula Mass (Da) Alanine Ala A C3H5NO 71.04 Arginine Arg R C6H12N4O 156.10 Asparagine Asn N C4H6N2O2 114.04 Aspartic acid Asp D C4H5NO3 115.03 Cysteine Cys C C3H5NOS 103.01 Glutamic acid Glu E C5H7NO3 129.04 Glutamine Gln Q C5H8N2O2 128.06 Glycine Gly G C2H3NO 57.02 Histidine His H C6H7N3O 137.06 Isoleucine Ile I C6H11NO 113.08 Leucine Leu L C6H11NO 113.08 Lysine Lys K C6H12N2O 128.09 Methionine Met M C5H9NOS 131.04 Phenylalanine Phe F C9H9NO 147.07 Proline Pro P C5H7NO 97.05 Serine Ser S C3H5NO2 87.03 Threonine Thr T C4H7NO2 101.05 Tryptophan Trp W C11H10N2O 186.08 Tyrosine Tyr Y C9H9NO2 163.06 Valine Val V C5H9NO 99.07

Table 2.1: Table of the 20 canonical amino acids. The second and third column are, respectively, the 3-letter code and 1-letter code commonly used as a shorter representation of the amino acid when giving a protein sequence. The given mass is the monoisotopic mass given in unified atomic mass units, also known as Daltons (Da).

Proteins consist of chain of amino acids connected by amide bonds. The amino acids come in 20 types that all have the same basic structure of an amine (NH2) and carboxylic acid (COOH) group, but each have a different side chain (see Table 2.1). They are translated from RNA strands, where a triplet of nucleotides codes for a certain type of amino acid. As one can see from the math, there are 43 = 64 triplets and only 20 amino acid types, and consequently certain triplets code for the same amino acid.

A protein is commonly represented as the string of amino acid types in its amino acid chain, starting with the amino acid at the amino end, also called the N- terminus, and ending with the amino acid at the carboxyl end, also called the C-terminus. This is also the direction in which the protein is translated from RNA, i.e. new amino acids are added to the carboxyl end during translation.

From the brief description above, one might get the impression that it would be 2.2. PROTEINS 7 enough to know the DNA sequences to know what proteins would be produced. However, there are several complicating factors such as splice variants, gene regu- lation, as well as RNA and protein degradation that prevent this conclusion from being correct. Knowing the DNA sequence does not tell us how much RNA will be produced from a specific gene and the prediction of protein abundance from RNA levels has been a controversial point of debate [34, 32, 56, 82].

2.2 Proteins

Different from DNA strands, proteins all have different structures, owing to the diversity of the amino acid types’ side chains. Certain side chains are hydrophylic, i.e., attracted to water, while others are hydrophobic, i.e., repelled by water. Some are positively charged ions, while others are negatively charged or do not carry any charge at all. Some are large, some are small. The proteins typically folds into a conformation that minimizes its energy (Figure 2.1), which depends on the above mentioned characteristics, as well as other chemical properties of the amino acids. Proteins can also consist of multiple amino acid chains, forming so-called protein complexes.

Figure 2.1: The 3D structure of the protein myoglobin (blue). The amino acid chain folds into specific structures that give the protein its functionality. Myo- globin is involved in the transportation of oxygen to cells. Here, a heme group (gray) with a bound oxygen molecule (red) is bound to the myoglobin molecule. [5] 8 CHAPTER 2. BIOLOGICAL BACKGROUND

Exactly this diversity in structure makes proteins the major provider of function- ality on cell level, e.g., catalyzing chemical reactions, transporting molecules, or receiving and reacting to signals in the cell’s environment. An important mecha- nism that helps in this last task are post translational modifications (PTMs), which are reversible chemical alterations to amino acids that can, for example, activate or deactivate the protein. In short, knowing the proteins that are present in a cell, ideally with their modifications and a functional annotation, can tell us a lot about what is happening in a cell. PTMs as well as splice variants result in variation in the population of proteins stemming from the same gene, known as the different proteoforms of a gene [74]. These proteoforms are all slightly different and may have different functionality. However, as they often have significant parts of their amino acid chains in common, it can be quite challenging to distinguish them from one another.

2.3 Current state of research

Nowadays, we can determine entire DNA sequences of humans and other organisms, their genomes, in a matter of hours. We can also get relatively complete images of the gene expression, the transcriptome, using RNA sequencing techniques. How- ever, getting a complete image of the protein expression, the proteome, is not yet feasible. Note that gene and protein expression are very much cell type dependent and therefore the transcriptome or proteome of an organism should be a collection of transcriptomes or proteomes from different tissues and cell types. Some notable attempts have been made at constructing a human proteome [82, 45], but both projects took a substantial amount of time and effort and are far from complete, each claiming a coverage of 70 − 80% of the known human genes. Another important corner stone of molecular biological research is the functional annotation of genes and proteins. As biological processes are often highly complex and interdependent, it is often difficult to isolate the specific functionality of a single gene or protein. While there are definitely exceptional cases where we have a clear image of the function of a gene or protein, we more often have to satisfy ourselves to only knowing the pathway it is active in, or use the functionality of a protein that has a (partly) similar sequence or structure as a proxy. Databases that have up-to-date information on gene or protein functionality include Gene Ontology [4] (http://geneontology.org/), Pfam [7] (http://pfam.xfam.org/) and Uniprot [3] (http://www.uniprot.org/). Chapter 3

Shotgun proteomics

In this chapter we will take a look at one of the most common techniques to identify proteins in a sample, shotgun proteomics. Direct measurement of proteins has proven to be difficult [44, 11], therefore, in shotgun proteomics, several steps are undertaken to reduce the complexity of the analyte by breaking up the proteins into progressively smaller pieces, hence the adjective shotgun. During and after this disintegration of the proteins, several measurements are taken which together form a complex puzzle that needs to be solved by computational means. The following sections give a more detailed explanation of the above steps and introduces the relevant terminology that will be used later in the thesis. We start from a sample containing a complex mixture of proteins, go through the steps of breaking up the proteins into smaller pieces and the measurements on them, and finish with the data analysis on these measurements. The end product is commonly a list of proteins that we believe to have been present in the original mixture, possibly with a quantification of the protein’s concentration.

3.1 Protein digestion

Proteins typically have lengths of a few hundreds of amino acids, which make them too complex to detect directly. Therefore, the proteins are first digested, similar to the way that food is digested in our digestive track. Proteases cut the long protein amino acid chains into smaller pieces, ideally to a length of about 6 to 50 amino acids. These short amino acid chains are called peptides and are basically the same as proteins, only shorter. In the following, when we refer to a peptide, we actually mean the set of all peptides with the same amino acid chain, sometimes also referred to as a peptide species, rather than a single molecule. Common proteases include trypsin, Lys-C and chymotrypsin, and are often iso- lated from digestive systems of vertebrates, e.g. porcine or bovine trypsin, which are isolated from the pancreas. Proteases cleave the protein at specific places in the amino acid chain, e.g. trypsin cleaves the amide bonds after arginine and ly-

9 10 CHAPTER 3. SHOTGUN PROTEOMICS sine, unless they are followed by proline. For example, the hypothetical protein MKPATRDISILKILGAG would be split up into the peptides MKPATR, DISILK and ILGAG if trypsin were to be used as protease. The predictability of this cleav- ing process makes it easier to identify the peptides in later stages. However, the cleaving process is not perfect and frequently some cleavage sites are missed, known as miscleavages. The opposite can also happen, when proteins are split at other amide bonds than the cleavage site, known as partial digestion.

3.2 Liquid chromatography

The goal is now to identify the peptides that are present after the digestion step. This task will arguably be easier if we try to analyze only a single peptide at a time. Liquid chromatography is used as a first step to accomplish a separation of the peptides.

As mentioned before, amino acid side chains have different chemical properties, e.g., with respect to hydrophobicity and charge. Liquid chromatography (LC) uses this fact by letting the peptides, dissolved in a liquid solvent, pass through a column containing an adsorbent. Through interactions between the adsorbent and the side chains, peptides are slowed down at different rates. This causes different peptides to elude at different times from the LC column, while identical peptides should elude at approximately the same time. The time it takes for a particular peptide to pass through the LC column is known as the peptide’s retention time.

3.3 Tandem mass spectrometry

While liquid chromatography creates a first level of separation, there are usually still many distinct peptides that elude at approximately the same time. As our goal is to analyze a single peptide species at a time, we need to apply a second level of separation. This is done with a mass spectrometer, an instrument that can separate molecules on the basis of their mass-to-charge (m/z) ratio (Figure 3.1). In shotgun proteomics we often use a technique which consists of two steps of mass spectrometry, tandem mass spectrometry.

The instrument makes use of the fact that charged molecules will change direc- tion when passing through an electric or magnetic field, i.e., the Lorentz force: F = q(E + v × B), and that the degree of deflection depends on the mass of the molecule, i.e., Newton’s second law of motion: F = m · a. Modern mass spectrom- eters work on the same principles but often use different techniques to separate molecules than the one schematically depicted in Figure 3.1, e.g., time of flight (TOF), quadrupoles or ion traps. Some of these techniques can detect the mass-to- charge ratio of an ion without it needing to hit a detector, allowing further analysis of this molecule. 3.3. TANDEM MASS SPECTROMETRY 11

Detection Faraday collectors {m/q} = 46 am {m/q} = 45 be {m/q} = 44 n io e current iv t i s o p

magnet amplifiers

ratio Ion source output beam focussing ion acelerator electron trap ion repeller legend: gas inflow (from behind) m ... ion mass ionizing filament q ... ion charge

Figure 3.1: A mass spectrometer can separate charged molecules on the basis of their mass-to-charge ratio. [26]

The LC column is often directly connected to the mass spectrometer and peptides enter according to their retention time. Peptides are first ionized into precursor ions and then exposed to an electric or magnetic field. The precursor mass-to- charge ratios, i.e., the mass-to-charge ratios of the precursor ions, are recorded in a so-called MS1 spectrum. Several MS1 spectra are recorded, each at a different retention time. The MS1 spectrum consists of a list of mass-to-charge ratios with their correspond- ing signal intensity, which itself is a function of the precursor ions’ abundance. We often plot the signal intensity as a function of the mass-to-charge ratio to visualize a spectrum, where we can identify the precursor peaks corresponding to certain pre- cursor ions as mass-to-charge ratios with high signal intensity (Figure 3.2, bottom). The MS1 spectrum furthermore records the retention time at which the spectrum was registered. To visualize the precursor ions entering the mass spectrometer we can create a mass chromatogram, which plots the precursor ion intensity summed over all precursor mass-to-charge ratios against the retention time (Figure 3.2, top). Basically, such a plot can be seen as a summary of all the MS1 spectra, which themselves can be thought of as slices of the x axis, by disregarding the recorded precursor ion mass-to-charge ratios. 12 CHAPTER 3. SHOTGUN PROTEOMICS

Figure 3.2: Top: An example of a mass chromatogram with the retention time in minutes on the x axis and the total precursor ion intensity on the y axis. Bottom: An example of an MS1 spectrum taken at a retention time of 85.2 minutes. It features the precursor ion mass-to-charge ratio on the x axis and the precursor ion intensity on the y axis.

After obtaining the MS1 spectrum, two diverging approaches are available. Either we try to analyze single peptide species at a time by selecting a very narrow pre- cursor mass-to-charge ratio window, known as data dependent acquisition, or we take large windows of mass-to-charge ratios and capture multiple peptide species at a time, known as data independent acquisition.

Data dependent acquisition (DDA) For data dependent acquisition (DDA), a narrow window of the precursor mass-to- charge ratio is selected, based on the precursor peaks in the MS1 spectrum. Ideally, we only select a single peptide species at this point. The ions within this range are fragmented into smaller ions, using techniques such as collision-induced dissociation (CID) and electron-transfer dissociation (ETD). Most often, the peptide breaks at one of its amide bonds, thereby producing two fragment ions that are themselves short amino acid chains, or just a single amino acid. These fragment ions are then recorded in a so-called MS2 spectrum or fragment spectrum. Several mass-to-charge ranges of the precursor ions are selected to produce numer- ous MS2 spectra. Usually, a list of recently selected mass-to-charge ranges is kept, so that we do not sample the same precursor ions over and over again. This tech- nique is known as dynamic exclusion and includes several parameters that govern how often we can sample a particular range before excluding it and for how long it should be excluded. 3.3. TANDEM MASS SPECTROMETRY 13

An MS2 spectrum, like an MS1 spectrum, consists of a list of mass-to-charge ratios with their corresponding signal intensity. In the plots of the signal intensity as a function of the mass-to-charge ratio, we can now identify the fragment peaks corresponding to certain fragment ions (Figure 3.3). The MS2 spectrum also retains information about the mass-to-charge range that was selected for the precursor ion, as well as the retention time.

103111-Yeast-2hr-03.ms2:28023 (peptide = K.NFTPEQISSMVLGK.M, q-val = 0.0) 160000 K.NFTPEQISSMVLGK.M

140000

120000

100000

80000 intensity 60000

40000

20000

0 0 200 400 600 800 1000 1200 1400 m/z

Figure 3.3: An example of an MS2 spectrum (black bars), together with the predicted fragment ion peaks of its peptide identification (red lines). The precursor ion peak at a mass-to-charge ratio of 517.60 Th can be seen in Figure 3.2.

The advantage of data dependent acquisition is that the MS2 spectra usually only feature a single peptide species, thereby preventing the more difficult case of having the fragment peaks of two or more peptide species in a single MS2 spectrum. In this case, we both need to figure out which peaks belonged to which peptide species, as well as identify which peptide species they came from. However, it is not at all uncommon for two peptide species to be fragmented together anyway, resulting in so-called chimeric spectra, with recent studies showing over 50% of the MS2 spectra featuring this behavior [54, 58].

The major issue with data dependent acquisition is its poor reproducibility. Each time the experiment is run, a different set of precursor ions could be selected. This mainly is due to small variations in the MS1 spectrum that influence the precursor mass-to-charge ratio selection and thereby also the outcome of the dynamic exclu- sion. This would not be a problem if we were able to sample all or most of the peptide species at least once, but in practice we are often considerably undersam- pling the pool of present peptide species due to limits of the instrumentation [58]. 14 CHAPTER 3. SHOTGUN PROTEOMICS

Data independent acquisition (DIA) A recently popularized alternative to DDA is data independent acquisition (DIA), also known as SWATH [78, 27]. Here, a broad precursor mass-to-charge ratio window is selected for the fragmentation round, with often many peptide species in it. At the cost of more complex MS2 spectra, we now have a much deeper sampling of the pool of present peptide species. If the windows are taken broad enough, one could theoretically even sample all peptide species that elude from the LC column, which could greatly increase the reproducibility of shotgun proteomics experiments.

3.4 Data analysis pipeline

Receiving the MS1 and MS2 spectra that were produced in the shotgun proteomics experiment, bioinformaticians are now faced with the task of putting these pieces of the puzzle back together. For a more elaborate review of the data analysis pipeline, see [63].

First, as an optional but advisable step for DDA data, we can recalibrate the precursor mass-to-charge ratios of the precursor ion, and make a prediction of its charge state. This is necessary, because the precursor mass-to-charge ratio provided with the MS2 scan is often reflecting the selection range rather than the mass-to-charge ratio of the precursor ion of interest. Software packages that were developed for this include Hardklör [37] and Bullseye [39]. Hardklör scouts the MS1 spectra for potential precursor ions and Bullseye subsequently matches them to the corresponding MS2 spectrum, while propagating information extracted from the peaks in the MS1 spectra, such as mass-to-charge ratio, charge and retention time.

From this point on, we normally do not use the MS1 spectra any longer and focus solely on the MS2 spectra for the identification of peptides. However, there do exist algorithms that try to propagate identifications from the MS2 spectra back to precursor peaks in the MS1 spectra [13, 84]. This mitigates the effects of under- sampling, i.e., precursor ions that were not selected for fragmentation can still be identified.

Peptide identification Several approaches are available for identifying the peptide from an MS2 spectrum. They can roughly be classified in three categories: database search engines, spectral library searching and de novo sequencing. Database search engines are by far the most commonly used and usually identify the most peptides, but the other two categories are gaining in popularity as the amount of data and available computing power are growing. 3.4. DATA ANALYSIS PIPELINE 15

Database search engines As the number of possible peptides grows exponentially with their length (20n), identifying the amino acid sequence directly from the MS2 spectrum is a very difficult task. Instead, we usually work with databases containing sequences of known proteins for the organism of interest. We then do an in silico digestion of these proteins, i.e. we cleave the proteins based on the cleavage rules of the protease that was used. At this stage, we can also account for miscleavages, partial digestion and post translational modifications. Having obtained a list of potentially present peptides, we can construct a theoreti- cal spectrum by considering the mass-to-charge ratios of the fragments that result from breaking the peptide at the amide bonds. Subsequently, a scoring function is applied to a pair of experimental and theoretical spectra, which assigns a a score to the pair. Such a pair is called a peptide spectrum match (PSM). We only compare experimental and theoretical spectra for which the precursor masses are sufficiently close, which is set depending on the accuracy of the instrumentation. Well known database search engines include SEQUEST [19], MASCOT [12], X!Tandem [14] and MS-GF+ [46].

Spectral library searching One of the problems with database search engines is that the theoretical spectrum is not necessarily a good template for the experimental spectrum of a peptide due to the complex processes governing the fragmentation. Therefore, in spectral library searching we instead choose to use previously identified spectra as a template for a peptide. This first of all requires a good database of template spectra, as well as a scoring function that compares a template spectrum to the experimental spectrum. Reliable databases of template spectra are commonly constructed using the reliable identifications from database search engines. Notable examples include, the NIST (http://peptide.nist.gov) and PeptideAtlas (http://www.peptideatlas.org/ speclib/) spectral libraries. As different ionization and fragmentation techniques produce different spectra for the same peptide species, it is often beneficial to select a library created with the same or similar instrumentation. Packages implementing a scoring function to identify peptides from spectral li- braries include SpectraST [50], X!Hunter [15] and BiblioSpec [25]. For a compre- hensive overview of spectral library searching, see [49].

De novo sequencing There are also algorithms that do de novo sequencing, peptide identification without making use of a database. They typically make use of the mass-to-charge ratio differences between peaks to identify single or short chains of amino acids. One of the problems is, that fragment peaks are often missing from MS2 spectra and 16 CHAPTER 3. SHOTGUN PROTEOMICS many combinations of amino acids add up to an equal mass that could explain such a gap. Therefore, de novo peptide identifications often contain ambiguous regions, i.e., two or more amino acid chains are just as likely to have produced a part of the spectrum. De novo sequencing packages include PEAKS [55], PepNovo [23] and NovoHMM [20].

Retention time prediction The supplied precursor mass narrows down the number of candidate peptides for a experimental spectrum from the order of millions to several hundreds. We could narrow down the search space even further if we could predict the retention time of a theoretical peptide and compare this to the retention time of the experimental spectrum. Unfortunately, the processes inside the LC column are quite complex and no simple expression can be found to predict a peptide’s retention time and are furthermore dependent on the type of LC column. A popular approach has been to calculate retention coefficients to the amino acids, reflecting their individual contribution to the retention time as used in, e.g., SSR- Calc [48]. Also, attempts have been made to predict retention times by modeling the physical interactions of the peptide with the LC column such as in BioLCCC [28]. Alternatively, machine learning algorithms, such as Elude [60] and RTPredict [65], can be applied that learn parameters from previously seen data and use these to make predictions of peptide retention times. For a review on the subject, see [59].

Protein identification For certain studies it is enough to have the peptide identifications, but often we are more interested in the actual proteins that those peptides came from. This task is made difficult by the fact that many peptides could stem from more than one protein, known as shared peptides, either due to multiple splice variants from the same gene or due to homologies between genes. Therefore, we cannot simply combine evidence from all peptides that could have come from a protein, as some of those peptides could have come from other proteins. This issue has been designated as the protein inference problem [61] and several protein inference methods have been developed to tackle it. There is currently a wide availability of schemes to infer proteins, based on different principles such as IDPicker [83], which uses the parsimony principle, and Protein- Prophet [62], MSBayes [51] and Fido [71], which all use probability theory. For a review on the subject see [72].

Peptide and protein quantification Another entity of interest can be the concentration of each protein that was iden- tified and a wide variety of methods are employed for this. While most of these 3.4. DATA ANALYSIS PIPELINE 17 methods are quantifying peptides rather than proteins, the idea is that we can trans- fer these peptide quantifications to protein-level. Here, we again face the protein inference problem and care has to be taken in doing this correctly. By compar- ing the quantitative levels of peptides or proteins between different samples, e.g., treated and untreated, we can get a better insight in the inner workings of the biological processes. Some of the quantification methods rely on modification of the sample by intro- ducing stable isotopes e.g., stable isotope labeling by amino acids (SILAC) [64] and isobaric labeling techniques such as tandem mass tags (TMT) [77] and isobaric tags for relative and absolute quantitation (iTRAQ) [81]. Others do not require modification of the sample and are therefore cheaper, simpler and allow retrospective analysis of the data. These so-called label-free quantification strategies include spectral counting [52] or using the total ion current (TIC) of precursor ions [10]. For a review on the subject, see [6].

Chapter 4

Statistical evaluation

Having obtained a list of peptides, proteins or even protein quantifications from the data analysis pipeline, we are faced with the question: “How much confidence do we have that they are correct?”. The field of statistics is essential in answering this question. Failure to apply these techniques or to apply them correctly could cost an experimenter valuable time and money. We start by providing some statistical terminology, followed by the most common approach to statistical evaluation in shotgun proteomics and finish with a section about a software package developed by our lab that does statistical evaluation of shotgun proteomics data.

4.1 Terminology1

In statistics, a null hypothesis usually represents the presupposition that the obser- vations are a result of random chance alone. For example, a null hypothesis for a PSM produced by a database search engine could be, “The peptide was spuriously matched to the spectrum”. We will call a PSM, peptide or protein for which we have found evidence through the data analysis pipeline an inference, as they have respectively been inferred from mass spectra, PSMs and peptides. These inferences can be correct or incorrect, which actually depends on the null hypothesis we are testing, as will be illustrated later in this chapter. Ideally, the list of inferences is sorted by some score, e.g., from the scoring function of a database search engine. Since we are only interested in the inferences we are very confident about, we only keep the inferences scoring better than some kind of threshold. These inferences scoring above the threshold will be referred

1In some cases, other terms will be more common in the field of statistics than the ones used here. However, terms like “identification” or “feature” have confounding meanings due to its concurrent usage in the fields of proteomics and machine learning and were therefore avoided on purpose

19 20 CHAPTER 4. STATISTICAL EVALUATION to here as discoveries 2. Discoveries, like inferences, can be correct or incorrect; correct discoveries are called true discoveries and incorrect discoveries are called false discoveries.

4.2 False discovery rates

Introduced by Benjamini and Hochberg [9], the false discovery rate (FDR) is a concept from multiple hypothesis testing aimed at controlling the number of false discoveries. In multiple hypothesis testing, we cannot simply use a significance level, as we would when testing a single hypothesis. To see this, say that we use a significance level of 0.05 and test this hypothesis a 1000 times and let the null hypothesis be true all times. Then, by random chance alone, we would already expect 50 significant results, without there being a truly significant result in the whole set. The FDR is defined as the expected fraction of false discoveries in the group of dis- coveries. The FDR allows for a few false discoveries, provided that their number is small compared to the number of true discoveries. A commonly employed threshold of 1% FDR would, for example, allow 10 false discoveries, provided that there are 990 true discoveries. This might not be acceptable for certain critical applications, but in shotgun proteomics we are often dealing with exploratory data analysis and thus are more interested in many discoveries as candidates to examine further and less concerned about false discoveries. As the FDR is not a monotonic function of the score, we often prefer to talk about the q value of an inference (Figure 4.1), in analogy to the widely used p value. The q value is defined as the minimum of the FDRs of all inferences with the same or a lower score than this inference. The q value is a monotonic function of the score and therefore it makes sense to set a threshold on it and call all inferences below a certain q value discoveries.

Decoy models For each inference, we test the selected null hypothesis by considering the distribu- tion under the null hypothesis, i.e., we look at the score distribution of inferences for which the null hypothesis is true and compare the score of an inference to this null distribution. In practice, we have a mixture of correct and incorrect inferences and this score distribution is therefore not suitable to be used. Instead, we artificially create a set of inferences for which we know the null hy- pothesis is true, known as the decoy model [41]. This is usually done by searching the spectra against a reversed or shuffled version of the original protein database, creating a decoy database containing decoy proteins. As none of the decoy peptides,

2In statistical literature, discoveries are also frequently referred to as positives or significant features 4.2. FALSE DISCOVERY RATES 21 i.e., peptides derived from the decoy proteins, should be present in the sample, any match to such a decoy peptide, called a decoy PSM, is incorrect. The score distri- bution for the matches to decoy peptides is therefore a good estimation of the null distribution for the PSMs. The original protein database containing the proteins of interest is called the target database and correspondingly we have target PSMs, target peptides and target proteins. Although any database of non-present proteins would produce a score distribution for which the null hypothesis is true, using a reversed or appropriately shuffled ver- sion of the protein sequences of interest ensures that several peptide characteristics, like length and amino acid frequency, are preserved.

PSM-level false discovery rates To estimate the PSM-level FDR at a certain threshold t, we can count the number of these decoy PSMs scoring better than t and divide that by the number of target PSMs scoring better than t (Figure 4.1). An extra correction should be applied to account for the fact that the number of incorrect target PSMs is lower than the number of decoy PSMs [75]. An alternative method to calculate the PSM-level FDR is to search the spectra against a concatenated database, i.e., a protein database consisting of the target and decoy proteins combined, also known as target-decoy competition [18] (TDC). In this case, we expect to get the same number of decoy PSMs as incorrect target PSMs at a threshold t and no correction factor is needed. This approach is especially beneficial if there is a bias in the scoring function, which causes some spectra to systematically score better than others [43]. If the scores are not biased the two methods give approximately the same performance, though TDC is to be preferred as some issues still remain with the first approach [42].

Peptide-level false discovery rates Frequently, many PSMs map to the same peptide and correctly inferred peptides usually have more PSMs mapped to them than incorrectly inferred peptides. There- fore, simply using the PSM-level FDR as the peptide-level PSM will lead to over- confident FDR rates. However, inferring the peptide by taking the score of the best scoring PSM and subsequently recalculating the FDR based on this truncated list will give an accurate peptide-level FDR [29]. Note that PSMs mapping to the same peptide are not independent pieces of evidence and cannot be combined by, e.g., Fisher’s method for combining p values [21].

Protein-level false discovery rates While the FDR itself is well defined and PSM- and peptide-level are well accepted in the field, protein FDRs have caused some controversy due to its ambiguity in 22 CHAPTER 4. STATISTICAL EVALUATION

Figure 4.1: Calculation of false discovery rates (FDRs) and q values to ap- proximate the true fraction of incorrect inferences using a decoy model. Here, an inference can, for example, be a PSM, peptide or protein. The first two columns represent the idealized situation in which we know which of the inferences are correct and incorrect; the last three columns represent the true situation, where we do not know these labels and need to use a decoy model to approximate the number of incorrect inferences. As the FDRs are not a monotonic function of the score, we prefer to use the q value, which takes the minimum of the FDRs over all inferences scoring worse than the current one.

its definition and its difficulty to be estimated. We will first focus on the issue of ambiguity in definition, which can be traced back to the different definitions of the (often implicit) null hypothesis different authors use. Two camps can be distinguished right off the bat. Loosely, their two differing null hypotheses can be formulated as “the evidence for the protein came from incorrectly interpreted mass spectra” [62, 67, 69] and “the protein is absent from the sample” [71, 83, 51]. Interestingly, while some consider the difference between the two views obvious, many actually have trouble distinguishing the two views. The different viewpoints can be illustrated using the following scenario:

Protein A is present in the sample and has 3 peptides a1, a2 and a3 that can be detected. Due to undersampling of the pool of present peptides in the mass spectrometry experiment, peptides a2 and a3 do not produce a mass spectrum. Furthermore, a1 produces a very low 4.2. FALSE DISCOVERY RATES 23

quality mass spectrum that cannot be interpreted confidently. Protein B is also present in the sample and has a peptide b5 that is very similar to a2. The mass spectrum belonging to b5 gets erroneously inferred as originating with high confidence from peptide a2. Based on this high confident, albeit faulty, evidence, protein A gets called significant.

The first camp bases their classifications on the observed mass spectra. They would classify this protein as a false discovery, because the mass spectrum was erroneously interpreted. The second camp bases their classification on the actual presence of the protein in the sample, and would therefore classify this protein as a true discovery. The definition is also made harder by the shared peptide problem. A handful of pro- tein inference methods choose to eliminate all shared peptides, thereby simplifying the problem considerably at the cost of reduced sensitivity. However, the majority of the available methods chooses not to throw out this preciously obtained data. This often leads to a solution called protein grouping, where proteins with the same peptide evidence are inferred together as a protein group. However, this solution comes with the immediate question: “What do we mean when we say we inferred a protein group?”. Several answers are possible: “one protein in the group is present”, “all proteins in the group are present” or “at least one of the proteins in the group is present”. The first answer leaves the question which of the proteins to choose, the second is generally too liberal and the third is too unspecific. This might not be very problematic if the proteins have the same function, but due to homology between different genes this is certainly not always the case. One of the major difficulties in estimating the protein-level FDR provided a well- defined null hypothesis mostly seems to stem from the shared peptide problem. Correctly inferred shared peptides break the assumptions of the decoy model by potentially inferring proteins that do not fulfill the null hypothesis, i.e., there will not be a decoy protein inferred at a similar score that could represent the incorrect protein in the FDR calculation. Using only peptides that are unique to a protein does not break the decoy model, but does force us to throw out a large portion of the inferred peptides. Sometimes the difficulty in estimating the protein-level FDR stems from faulty assumptions rather than a fundamental problem. This was the case in one of the first draft human proteomes [82, 70], which was later resolved [69].

Calibration We would like to verify that the FDR estimates reported by an inference method provide good estimates of the real number of false discoveries. This can be done through calibration using datasets where we know for each inferences if it is correct or not before doing the experiment. 24 CHAPTER 4. STATISTICAL EVALUATION

The most popular method to achieve this goal is to analyze mixtures of known proteins, such as the ISB 18 [47] or the Sigma UPS1/UPS2 protein mixtures [73]. The entire shotgun proteomics pipeline is executed here, and we typically try to find the peptides and proteins that are known to be present against a background of peptides and proteins that are definitely not present. We can check the FDR estimations by using the non present peptides or proteins as the false discoveries. These standards contain relatively few proteins, often in the order of 50, and often do not contain many shared peptides. Therefore, they are not very representative for the highly complex samples we typically analyze in shotgun proteomics [31]. Furthermore, an intrinsic problem is that we cannot verify the null hypothesis for a peptide or protein inference to have come from incorrect PSMs, as we are not in control over this part of the process. Alternatively, one can use simulations of shotgun proteomics experiments. Here, correct and incorrect inferences are typically drawn from their respective (score) distributions. To test PSM inference methods we could simulate mass spectra, though this is generally tricky to get right. Simpler is to test peptide and protein inference methods, where we can simulate correct and incorrect PSMs or in the case of protein inference methods also correct and incorrect peptides. Simulations come with a certain set of assumptions about the shotgun proteomics pipeline which might be violated in practice. For example, in some simulations present peptides are randomly drawn, whereas in practice the sampling of present peptides is a function of their abundance. Because of assumptions like these, sim- ulations could be considered as giving less conclusive evidence than mixtures of known proteins. However, they can simulate results from samples with high com- plexity and can also verify the null hypothesis of peptide and protein inferences coming from incorrect PSMs.

4.3 The Percolator software package

A common issue with database search engines is that their scoring functions often contain biases, e.g., scores are systematically higher for longer peptides or highly charged precursor ions. Initially, this was solved by, e.g., applying different score thresholds for different precursor charge states or having peptide length dependent thresholds. Käll et al. [40] recognized that a more optimal approach to achieve this was by employing a support vector machine (SVM), which was implemented in their widely used software package called Percolator. Percolator is still being developed further in our group and we regularly bring out bug fixes and new features. An SVM is a classification algorithm, i.e., given a set of features for an inference it decides to assign it a certain class, or label, typically one out of two. In this case the labels are “correct” and “incorrect”. A feature can be any numeric value, for a PSM, it can, for example, be its score, peptide length or precursor mass. 4.3. THE PERCOLATOR SOFTWARE PACKAGE 25

SVMs belong to the class of supervised learning algorithms, which means that they are trained by feeding it inferences, together with their features, for which the labels are known. Based on this training set it will produce a high-dimensional decision boundary in the feature space which separates the inferences from the two labels in an optimal way (Figure 4.2).

H H H X2 1 2 3

X1 Figure 4.2: A support vector machine calculates the decision boundary H3 which has the largest margin between inferences of the two classes. Here, one can think of the black circles representing the incorrect inferences, while the white circles represent the correct ones. The X1 axis represents one feature, e.g., a PSM’s score, while the X2 axis represents a second feature, e.g., a PSM’s peptide length. The decision boundary H1 does not provide a separation between the two classes, H2 and H3 both separate the two classes, but H3 does this with the largest possible margin to the closest inference. [80]

The problem in the analysis of shotgun proteomics data is that we normally do not know to which of the two classes an inference belongs. We can create a decoy model which will produce just incorrect inferences, but from the start we do not have a set of correct inferences. To overcome this, Percolator uses a semi-supervised learning approach, where the correct set is inferred during the training. In practice, this means that we start by assigning the label “correct” to a small set of very confident PSMs, typically those with a very good database search engine score. Subsequently, we train an SVM using this small set as the correct inferences and the decoy in- ferences as incorrect inferences and based on the new decision boundary we select a new set of very confident PSMs. This process can be repeated a fixed number of times or until convergence, i.e., the set of inferences labeled as correct does not 26 CHAPTER 4. STATISTICAL EVALUATION change significantly anymore. To prevent overfitting to the data, Percolator uses a cross validation scheme [30]. Chapter 5

Mass spectrum clustering

In the field of shotgun proteomics it is considered good practice to make data sets of tandem mass spectrometry experiments publicly available in repositories such as the PRIDE archive and PeptideAtlas. According to the latest statistics, the PRIDE archive contains 690 million mass spectra from 813 data sets [79], and the human part of the PeptideAtlas contains 470 million mass spectra from over a 1000 samples [17]. While the data sets are interesting by themselves for researchers inter- ested in the particular topic of the original study, much more interesting outcomes can be expected by considering the entire database as a whole. In particular we would like to investigate communalities between studies and to analyze the huge amount of unidentified spectra that could, for example, unveil novel PTMs [38] or unexpected fragmentation patterns. It is quite a daunting task to consider hundreds of millions of mass spectra and find the spectra of interest. A useful technique from the field of machine learning is to cluster the spectra, i.e., similar spectra are grouped together, ideally stemming from the same peptide species. By doing this, we can identify groups of interest, rather than spectra of interest, which usually is a reduction of at least one order of magnitude. Furthermore, large groups of unidentified spectra are natural can- didates for investigation and one could construct spectral libraries for frequently recurring spectra. The two key challenges in the clustering of mass spectra are how to measure the similarity between two spectra and how to decide if the similarity is high enough to warrant that two or more spectra came from the same peptide species. The first challenge has been met with several distance measures that take into account dif- ferent characteristics of mass spectra, while the second part requires a combination of clustering algorithms, often from the field of machine learning, and statistics. Several mass spectrum clustering packages have been developed, such as Pep- Miner [8], MS2Grouper [76], NorBel2 [22], MS-Cluster [24], PRIDE-Cluster [33], CAMS-RS [68] and Liberator [38]. Some of these have specifically targeted large-

27 28 CHAPTER 5. MASS SPECTRUM CLUSTERING scale datasets [24, 33, 68, 38], though currently none of them have attempted to cluster all spectra, including the unidentified spectra, of an entire repository. In the following sections, we will start by looking at the different aspects of mass spectrum clustering algorithms. In particular, we will look at approaches to measure similarities between mass spectra using a distance measure, how to subsequently group similar spectra together using clustering algorithms and how to evaluate the results of clustering.

5.1 Distance measures

Before we start, it is helpful to clarify the small difference between two frequently used terms. Similarity measures assign high scores to pairs of similar spectra, whereas distance measures assign low scores. This corresponds to our intuition that highly similar pairs receive high similarity scores but are close to each other when talking about distance. We will mostly use the term distance measure as it corresponds better to a visualization of clusters, where similar spectra are close to each other and dissimilar spectra are far away from each other. Distance measures are not only used for clustering, but are also an integral part of spectral library search engines. Some measures were specifically designed for a certain type of experimental setting and are therefore not guaranteed to work when these preconditions change. Here, we focus on the most generally applicable distance measures, that should perform reasonably well independent of the experi- mental settings. To determine how similar two spectra are, it is beneficial to take into account the variation in the peak patterns of spectra from the same peptide species. Typi- cally, the absolute peak intensities can vary over several orders of magnitude and therefore some type of normalization is often a key step when using peak inten- sity information. However, even after normalization, the relative peak intensity of a specific fragment peak will usually still show significant variance. Furthermore, electronic and chemical noise as well as poorly calibrated instruments can cause systematic and random noise, which needs to be dealt with appropriately. One of the most straightforward similarity measure is by first binning the peaks, e.g., in 1 Th bins, and then applying the cosine similarity, of which adapted versions have been employed by [8, 24, 38]:

Pm IQ · IT C(IQ,IT ) = i=0 i i (5.1) r 2 q Pm  Q Pm T 2 i=0 Ii · i=0 Ii

Where IQ and IT are vectors of the summed or maximum peak intensity per bin. This normalized dot product takes values in the interval [0, 1], where pairs of very similar spectra have a score close to 1. Its main advantage is that it is fast to 5.2. CLUSTERING ALGORITHMS 29 calculate, but it has the potential problem that a few high intensity peaks will dominate the scores. In the most extreme case, two spectra with just a single peak in the same bin, e.g. at the precursor ion mass-to-charge ratio, could get a score of 1 even though we do not have much reason to believe they came from the same peptide species. To prevent a few peaks from overpowering all lower intensity peaks, we could instead ignore all peak intensities and use a 1 if the peak intensity is above a certain threshold and a 0 if it is below this threshold: m Q T X Q T D(B ,B ) = Bi · Bi (5.2) i=0

Where BQ and BT are boolean vectors with 1s and 0s as specified above; a nor- malization factor can be added similarly to the cosine distance above. While this approach solves the problem of overpowering by a few peaks it introduces some new problems. First, an appropriate threshold needs to be chosen to set the 1s and 0s, or alternatively a fixed number k of 1s can be set corresponding to the most intense bins. Second, peaks in common as a result of low intensity peaks from background noise cannot be distinguished from high intensity peaks in common from actual fragments. Quite similar to the cosine similarity is the Euclidean distance, as used in [66]: v u m Q T uX Q T 2 E(I ,I ) = t (Ii − Ii ) (5.3) i=0 This distance measure puts less emphasis on the high intensity peaks when they are in common, while increasing its emphasis when it is absent in one of the spectra. On the other hand, it is very sensitive to the natural variance of fragment peak intensity and it is not straightforward how to normalize the peak intensities, nor the scores themselves. The previous distance metrics do not specifically take advantage of the special characteristics of mass spectra, i.e., that they come from sequences of amino acids. This property can for example be exploited by looking at the mass-to-charge ratio differences of consecutive peaks [68]. However, this relies on the ability to filter out noise peaks between actual fragments peaks, which is not a simple task. Some other distance measures that have been proposed are the Hertz similarity index [36], the accurate mass spectrum distance [35], using mass differences on aligned peak lists [1] and a sigmoid-based fragment scoring function [22].

5.2 Clustering algorithms

After calculating the pairwise distances, we are left with a large affinity matrix. This matrix is normally quite sparse, as we typically only calculate the distance 30 CHAPTER 5. MASS SPECTRUM CLUSTERING between spectra with similar precursor mass or mass-to-charge ratio. One of the issues which makes the mass spectrum clustering different from many other clus- tering problems in machine learning is that instead of a few clusters with many observations per cluster, we expect many clusters with relatively few spectra in each cluster. Furthermore, unless we analyze the data with a search engine first, we do not have any idea how many clusters we expect to get. This makes the application of popular machine learning clustering algorithms such as k-means or Gaussian mixture models rather difficult.

The class of clustering algorithm that does come quite natural to our problem are the hierarchical clustering algorithms. Initially, each observation forms its own cluster, a singleton cluster. We start by linking the most similar two clusters and calculate a new distance between this newly created cluster and all other clusters. This continues until only one cluster is remaining. In this way these algorithms construct a hierarchical tree of the observations (Figure 5.1), linking the most similar observations in the bottom and less similar observations higher up in the tree. Constructing the clusters then simply comes down to setting a threshold on the tree and ignoring all links above this threshold.

0.87 1.4e-2 4 clusters 4.4e-5 6.9e-7 5 clusters

9.2e-14 7 clusters 3.2e-15 1.7e-18

S5 S8 S7 S1 S4 S6 S2 S3

Figure 5.1: Hierarchical clustering creates a tree of the observations; clus- ters can be constructed by choosing a cutoff threshold. Here, we have 8 spectra, marked s1, . . . , s8, which all start off as clusters of size 1. The distance at which two clusters were linked are given in red. Clusters with a small distance are linked at the bottom, e.g., s1 and s4, whereas clusters with a large distance are linked at the top, e.g., s7 and s8. Setting the cutoff threshold at a large distance creates few clusters (top line, 4 clusters), which might result in impure clusters. Setting the threshold at a small distance results in many clusters (bottom line, 7 clusters), which reduces the summarizing power. Setting the threshold at an inter- mediate distance (middle line, 5 clusters) trades off these two issues and might give the optimal result. 5.3. EVALUATING CLUSTERING PERFORMANCE 31

There are three main ways to construct the linkages, which differ in the calculation of the distance between a newly formed cluster and all other clusters (Figure 5.2). In single-linkage clustering the new distance is the minimum of the distances of pairs with one observation in the newly formed cluster and one in the cluster we want to calculate the new distance to. This has, for example been used by [22]. In average-linkage clustering the new distance is the average distance of the pairs, as described above, employed by [66]. In complete-linkage clustering the new distance is the maximum distance between the pairs as described above.

1 P P DU,V = min D(Ui,Vj ) DU,V = D(Ui,Vj ) DU,V = max D(Ui,Vj ) i,j |U||V | i,j i j (A) (B) (C) Figure 5.2: Calculation of the distancee between two clusters, U and V , for single-linkage (A), average-linkage (B) and complete-linkage (C) cluster- ing.

While single-linkage clustering is the easiest version to implement, its greedy nature tends to produce impure clusters. Average- and complete-linkage clustering improve this by requiring more consensus, but are harder to implement for large matrices, as it requires the storage of all distances between clusters. For average-linkage a memory constrained algorithm is available that resolves this issue [53] and for complete-linkage a similar scheme can be constructed. In the spirit of average-linkage clustering, one could recalculate the distances by first creating a consensus spectrum of the newly formed cluster and then apply the distance measure between this new consensus spectra and the consensus spectra of each of the other clusters [24]. Instead of hierarchical clustering methods, one could employ graph based algo- rithms. After first eliminating all links over a certain distance threshold, we could look for so-called cliques, i.e., groups of spectra for which each pair has a link in the graph [76].

5.3 Evaluating clustering performance

The goal of mass spectrum clustering is to cluster spectra belonging to the same peptide species together, while preventing the formation of clusters with more than 32 CHAPTER 5. MASS SPECTRUM CLUSTERING one peptide species. To evaluate the performance of a mass spectrum clustering algorithm we therefore need the peptide inferences as the true labels. With these labels, we can apply different cluster evaluation metrics. We shall call the list of clusters resulting from a mass spectrum clustering algorithm a clustering. However, the peptide inferences of the spectra themselves contain a certain level of uncertainty. One solution could be to only cluster the spectra with a PSM inference with a q value below 1% or even 0.1%. This could, however, give an inaccurate image of the actual performance of the mass spectrum clustering algorithm on the entire set, as the confident PSMs are typically from high-quality spectra which might be more easily distinguished from each other than lower-quality spectra. Al- ternatively, we could first cluster all spectra and then eliminate all spectra with only PSMs above a certain q value. In this case, the clusters are no longer accu- rately represented by their remaining spectra and care has to be taken to correctly interpret any statistics we calculate from them. Several clustering metrics have been proposed such as the rand index, the adjusted rand index, the B-Cubed algorithm, the purity and normalized mutual information. Most of these metrics give us a score, typically between 0 and 1, and allow us to rank different clusterings by this score. According to [2], the B-Cubed algorithm produces a ranking of clusterings that most closely follows our intuition. On the other hand, the value of the score is hard to interpret, as is the case for most of the other metrics, and more simplistic metrics might give more information, such as the fraction of spectra belonging to the most common peptide species in its cluster. Alternatively, we can use the number of peptide discoveries from searching the consensus spectra with a database search engine. It is not hard to imagine that if the clusters are pure that the consensus spectra should be a reasonably good model for the peptide species. Unfortunately, this is certainly not always the case, as a cluster with one high-quality spectrum, that would normally be confidently inferred, together with many lower-quality spectra might result in an consensus spectrum falling below the discovery threshold. Nevertheless, the number of peptide discoveries for a clustering can be considered as an informative metric regarding its purity. Chapter 6

Present investigations

The papers included in this thesis center around the peptide and protein inference pipeline. In particular, all papers try to address different issues arising from the increase in available data. Paper I introduces a new mass spectrum clustering algorithm which uses a probabilistic framework to calculate a distance measure and applies this to a data set containing millions of spectra. Paper II focuses on the discussion of protein-level false discovery rates, which has been criticized for not behaving properly on large datasets. The paper proposes a framework in which they can be discussed in a meaningful way to reopen the discussion on protein-level false discovery rates on large datasets. Paper III showcases two new features of the Percolator package which makes it ready for application on aggregated datasets comprising multiple runs such as the recent draft human proteome studies.

6.1 Paper I: MaRaCluster: A fragment rarity metric for clustering fragment spectra in shotgun proteomics

This study proposes a novel mass spectrum distance measure, the fragment rarity metric, which uses the frequency of the observed peaks in the MS2 spectra to calculate a p value for a pair of spectra. It makes use of the intuition that fragment peaks shared by many MS2 spectra give less information about the two spectra stemming from the same peptide species than peaks shared by few MS2 spectra. In contrast to existing distance measures, the score of the fragment rarity metric can be interpreted by itself and is robust to both systematic and random noise. This distance metric is then used as input to a complete-linkage clustering, from which clusters are then derived. The complete-linkage clustering mitigates the problem of chimeric spectra connecting clusters of different peptide species. Using a dataset containing 7.8 million spectra, we demonstrate that our algorithm, MaRaCluster, outperforms alternatives that use a cosine distance or single-linkage clustering, as well as the state-of-the art method MS-Cluster.

33 34 CHAPTER 6. PRESENT INVESTIGATIONS

6.2 Paper II: How to talk about protein-level false discovery rates in shotgun proteomics

In this work, we discuss the controversy regarding protein-level false discovery rates that has resurfaced in the wake of the draft human proteome studies. In partic- ular, we propose a framework to have meaningful discussions about them, which mainly consists of separating its definition from its application. The definition is determined by the choice of null hypothesis, of which two significantly different versions are currently employed by protein inference methods. We show that inter- pretation of the results using the wrong null hypothesis can lead to a considerable overestimation of the confidence in the results. Furthermore, we recommend the employment of simulations of the digestion and matching step to calibrate and verify the protein-level FDR estimates reported by protein inference methods. We demonstrate this approach by verifying the FDR estimates of a very simple protein inference method, using decoy models for both of the null hypotheses.

6.3 Paper III: Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0

This paper introduces two new features of the Percolator package that are aimed towards large-scale studies. The first feature accelerates the training algorithm by training the SVMs only on a small random subset of the available PSMs. Even going down to just 0.14% of the original 73 million PSMs does not significantly reduce the number of discovered PSMs and peptides at 1% FDR. For the second feature we compared several scalable protein inference methods based on accuracy and number of discoveries. The protein inference methods we benchmarked were the best scoring peptide approach, the two peptide rule, multiplication of peptide- level PEPs and Fisher’s method for combining p values. Using a small yeast data set with an entrapment database we show that it is absolutely necessary to remove shared peptides to have working decoy models. Furthermore we demonstrate that the best scoring peptide approach gives the most significant proteins at 1% FDR on the data of one of the draft human proteomes. This approach was subsequently implemented in the Percolator package, which together with the random subset training, allowed the analysis from 73 million PSMs up to the protein inferences in just 10 minutes. Chapter 7

Future perspectives

The studies in this paper were focused on summarizing data and isolating objects of interest from large-scale studies. As shotgun proteomics studies become increas- ingly easy and cheap to perform, different parts of the analysis pipeline will reach their limits and will require new solutions. I believe that the mass spectrum clustering approach is only just starting to get traction and that we can expect many new applications for this type of analysis, both in proteomics as well as in related fields such as metabolomics. One of the key challenges for MaRaCluster is to get it ready for repository sized datasets, which still requires some adaptations and optimizations. We could for example try out several initial filtering steps, e.g., based on retention time or high intensity peaks, to reduce the number of pair comparisons that need to be performed. Other challenges are the construction of good consensus spectra and the develop- ment of new spectral library searching algorithms. One of the main expectations we initially had, was that we would get more peptide discoveries by clustering first and then analyze the consensus spectra. Using conventional database searching techniques, we have not achieved this, but there is a lot of information available in the clusters that we have not even begun to exploit. I think that reporting protein-level false discovery rates will become more common in future studies, and the addition of this feature to Percolator will help in this quest. The employment of simulations of digestion and matching experiments that more closely reflect the experimental results will allow us to evaluate existing protein inference methods and point to possible weak points that need to be fixed. The results from the protein inference benchmark were still rather unsatisfying, as the best performing method was the one that threw out the most evidence. Now that we have a better understanding of why current attempts have failed to improve on this, I think it is likely that we can develop new protein inference methods that take into account more evidence than just the best peptide or PSM for a protein.

35

Bibliography

[1] Rikard Alm, Peter Johansson, Karin Hjernø, Cecilia Emanuelsson, Markus Ringner, and Jari Häkkinen. Detection and identification of protein isoforms using cluster analysis of MALDI-MS mass spectra. Journal of Proteome Re- search, 5(4):785–792, 2006.

[2] Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A com- parison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461–486, 2009.

[3] , , Cathy H Wu, Winona C Barker, Brigitte Boeck- mann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, et al. UniProt: The universal protein knowledgebase. Nu- cleic Acids Research, 32(suppl 1):D115–D119, 2004.

[4] Michael Ashburner, Catherine A Ball, Judith A Blake, , Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene Ontology: Tool for the unification of biology. Nature , 25(1):25–29, 2000.

[5] AzaToth. Myoglobin 3D structure. https://upload.wikimedia.org/ wikipedia/commons/6/60/Myoglobin.png, 2012. Accessed: 2016-03-29.

[6] Marcus Bantscheff, Markus Schirle, Gavain Sweetman, Jens Rick, and Bern- hard Kuster. Quantitative mass spectrometry in proteomics: A critical review. Analytical and Bioanalytical Chemistry, 389(4):1017–1031, 2007.

[7] , Lachlan Coin, Richard Durbin, Robert D Finn, Volker Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik LL Sonnhammer, et al. The Pfam protein families database. Nucleic Acids Re- search, 32(suppl 1):D138–D141, 2004.

[8] Ilan Beer, Eilon Barnea, Tamar Ziv, and Arie Admon. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics, 4(4):950–960, 2004.

37 38 BIBLIOGRAPHY

[9] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995. [10] Pavel V Bondarenko, Dirk Chelius, and Thomas A Shaler. Identification and relative quantitation of protein mixtures by enzymatic digestion followed by capillary reversed-phase liquid chromatography-tandem mass spectrometry. Analytical Chemistry, 74(18):4741–4749, 2002. [11] Adam D Catherman, Owen S Skinner, and Neil L Kelleher. Top down pro- teomics: Facts and perspectives. Biochemical and Biophysical Research Com- munications, 445(4):683–693, 2014. [12] John S Cottrell and U London. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18):3551–3567, 1999. [13] Jürgen Cox, Marco Y Hein, Christian A Luber, Igor Paron, Nagarjuna Nagaraj, and Matthias Mann. Accurate proteome-wide label-free quantification by de- layed normalization and maximal peptide ratio extraction, termed MaxLFQ. Molecular & Cellular Proteomics, 13(9):2513–2526, 2014. [14] Robertson Craig and Ronald C Beavis. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics, 20(9):1466–1467, 2004. [15] Robertson Craig, JC Cortens, David Fenyo, and Ronald C Beavis. Using annotated peptide mass spectrum libraries for protein identification. Journal of Proteome Research, 5(8):1843–1849, 2006. [16] Frank Desiere, Eric W Deutsch, Nichole L King, Alexey I Nesvizhskii, Parag Mallick, Jimmy Eng, Sharon Chen, James Eddes, Sandra N Loevenich, and Ruedi Aebersold. The PeptideAtlas project. Nucleic Acids Research, 34(suppl 1):D655–D658, 2006. [17] Eric W Deutsch, Zhi Sun, David Campbell, Ulrike Kusebauch, Caroline S Chu, Luis Mendoza, David Shteynberg, Gilbert S Omenn, and Robert L Moritz. State of the human proteome in 2014/2015 as viewed through PeptideAtlas: Enhancing accuracy and coverage through the AtlasProphet. Journal of Pro- teome Research, 14(9):3461–3473, 2015. [18] Joshua E Elias and Steven P Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, 4(3):207–214, 2007. [19] Jimmy K Eng, Ashley L McCormack, and John R Yates. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of The American Society for Mass Spectrometry, 5(11):976–989, 1994. BIBLIOGRAPHY 39

[20] Bernd Fischer, Volker Roth, Franz Roos, Jonas Grossmann, Sacha Baginsky, Peter Widmayer, Wilhelm Gruissem, and Joachim M Buhmann. NovoHMM: A hidden markov model for de novo peptide sequencing. Analytical Chemistry, 77(22):7265–7273, 2005. [21] Ronald Aylmer Fisher. Statistical methods for research workers. Genesis Pub- lishing Pvt Ltd, 1925. [22] Kristian Flikka, Jeroen Meukens, Kenny Helsens, Joël Vandekerckhove, In- gvar Eidhammer, Kris Gevaert, and Lennart Martens. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics, 7(18):3245–3258, 2007. [23] Ari Frank and Pavel Pevzner. PepNovo: De novo peptide sequencing via probabilistic network modeling. Analytical Chemistry, 77(4):964–973, 2005. [24] Ari M Frank, Matthew E Monroe, Anuj R Shah, Jeremy J Carver, Nuno Ban- deira, Ronald J Moore, Gordon A Anderson, Richard D Smith, and Pavel A Pevzner. Spectral archives: Extending spectral libraries to analyze both iden- tified and unidentified spectra. Nature Methods, 8(7):587–591, 2011. [25] Barbara E Frewen, Gennifer E Merrihew, Christine C Wu, William Stafford Noble, and Michael J MacCoss. Analysis of peptide MS/MS spectra from large- scale proteomics experiments using spectrum libraries. Analytical Chemistry, 78(16):5678–5684, 2006. [26] Devon Fyson. Schematic of a typical mass spectrometer. https://commons. wikimedia.org/wiki/File:Mass_Spectrometer_Schematic.svg, 2008. Ac- cessed: 2016-03-08. [27] Ludovic C Gillet, Pedro Navarro, Stephen Tate, Hannes Röst, Nathalie Se- levsek, Lukas Reiter, Ron Bonner, and Ruedi Aebersold. Targeted data ex- traction of the MS/MS spectra generated by data-independent acquisition: A new concept for consistent and accurate proteome analysis. Molecular & Cellular Proteomics, 11(6):O111–016717, 2012. [28] Alexander V Gorshkov, Irina A Tarasova, Victor V Evreinov, Mikhail M Sav- itski, Michael L Nielsen, Roman A Zubarev, and Mikhail V Gorshkov. Liquid chromatography at critical conditions: comprehensive approach to sequence- dependent retention time prediction. Analytical Chemistry, 78(22):7770–7777, 2006. [29] Viktor Granholm and Lukas Käll. Quality assessments of peptide–spectrum matches in shotgun proteomics. Proteomics, 11(6):1086–1093, 2011. [30] Viktor Granholm, William S Noble, and Lukas Käll. A cross-validation scheme for machine learning algorithms in shotgun proteomics. BMC Bioinformatics, 13(Suppl 16):S3, 2012. 40 BIBLIOGRAPHY

[31] Viktor Granholm, William Stafford Noble, and Lukas Käll. On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. Journal of Proteome Research, 10(5):2671–2678, 2011.

[32] Dov Greenbaum, Christopher Colangelo, Kenneth Williams, and Mark Ger- stein. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biology, 4(9):117, 2003.

[33] Johannes Griss, Joseph M Foster, Henning Hermjakob, and Juan Antonio Vizcaíno. PRIDE Cluster: Building a consensus of proteomics data. Nature Methods, 10(2):95–96, 2013.

[34] Steven P Gygi, Yvan Rochon, B Robert Franza, and Ruedi Aebersold. Corre- lation between protein and mRNA abundance in yeast. Molecular and Cellular Biology, 19(3):1720–1730, 1999.

[35] Michael Edberg Hansen and Jørn Smedsgaard. A new matching algorithm for high resolution mass spectra. Journal of the American Society for Mass Spectrometry, 15(8):1173–1180, 2004.

[36] Harry S Hertz, Ronald A Hites, and Klaus Biemann. Identification of mass spectra by computer-searching a file of known spectra. Analytical Chemistry, 43(6):681–691, 1971.

[37] Michael R Hoopmann, Gregory L Finney, and Michael J MacCoss. High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Ana- lytical Chemistry, 79(15):5620–5632, 2007.

[38] Oliver Horlacher, Frédérique Lisacek, and Markus Müller. Mining large scale MS/MS data for protein modifications using spectral libraries. Journal of Proteome Research, 2015.

[39] Edward J Hsieh, Michael R Hoopmann, Brendan MacLean, and Michael J MacCoss. Comparison of database search strategies for high precursor mass accuracy MS/MS data. Journal of Proteome Research, 9(2):1138–1143, 2009.

[40] Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael J MacCoss. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods, 4(11):923–925, 2007.

[41] Lukas Käll, John D Storey, Michael J MacCoss, and William Stafford Noble. Assigning significance to peptides identified by tandem mass spectrometry us- ing decoy databases. Journal of Proteome Research, 7(01):29–34, 2007. BIBLIOGRAPHY 41

[42] Uri Keich, Attila Kertesz-Farkas, and William Stafford Noble. Improved false discovery rate estimation procedure for shotgun proteomics. Journal of Pro- teome Research, 14(8):3148–3161, 2015.

[43] Uri Keich and William Stafford Noble. On the importance of well-calibrated scores for identifying shotgun proteomics spectra. Journal of Proteome Re- search, 14(2):1147–1160, 2014.

[44] Neil L Kelleher. Peer reviewed: Top-down proteomics. Analytical Chemistry, 76(11):196–A, 2004.

[45] Min-Sik Kim, Sneha M Pinto, Derese Getnet, Raja Sekhar Nirujogi, Srikanth S Manda, Raghothama Chaerkady, Anil K Madugundu, Dhanashree S Kelkar, Ruth Isserlin, Shobhit Jain, et al. A draft map of the human proteome. Nature, 509(7502):575–581, 2014.

[46] Sangtae Kim, Nitin Gupta, and Pavel A Pevzner. Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research, 7(8):3354–3363, 2008.

[47] John Klimek, James S Eddes, Laura Hohmann, Jennifer Jackson, Amelia Pe- terson, Simon Letarte, Philip R Gafken, Jonathan E Katz, Parag Mallick, Hookeun Lee, et al. The standard protein mix database: A diverse data set to assist in the production of improved peptide and protein identification software tools. Journal of Proteome Research, 7(01):96–103, 2007.

[48] Oleg V Krokhin, R Craig, V Spicer, W Ens, KG Standing, RC Beavis, and JA Wilkins. An improved model for prediction of retention times of tryptic peptides in ion pair reversed-phase HPLC its application to protein peptide mapping by off-line HPLC-MALDI MS. Molecular & Cellular Proteomics, 3 (9):908–919, 2004.

[49] Henry Lam and Ruedi Aebersold. Spectral library searching for peptide identi- fication via tandem MS. In Proteome Bioinformatics, pages 95–103. Springer, 2010.

[50] Henry Lam, Eric W Deutsch, James S Eddes, Jimmy K Eng, Stephen E Stein, and Ruedi Aebersold. Building consensus spectral libraries for peptide identi- fication in proteomics. Nature Methods, 5(10):873–875, 2008.

[51] Yong Fuga Li, Randy J Arnold, Yixue Li, Predrag Radivojac, Quanhu Sheng, and Haixu Tang. A bayesian approach to protein inference problem in shotgun proteomics. Journal of Computational Biology, 16(8):1183–1193, 2009.

[52] Hongbin Liu, Rovshan G Sadygov, and John R Yates. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry, 76(14):4193–4201, 2004. 42 BIBLIOGRAPHY

[53] Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and . Ef- ficient algorithms for accurate hierarchical clustering of huge datasets: Tackling the entire protein space. Bioinformatics, 24(13):i41–i49, 2008.

[54] Roland Luethy, Darren E Kessner, Jonathan E Katz, Brendan MacLean, Robert Grothe, Kian Kani, Vitor Faca, Sharon Pitteri, Samir Hanash, David B Agus, and Parag Mallick. Precursor-ion mass re-estimation improves peptide identification on hybrid instruments. Journal of Proteome Research, 7(9):4031– 4039, 2008.

[55] Bin Ma, Kaizhong Zhang, Christopher Hendrie, Chengzhi Liang, Ming Li, Amanda Doherty-Kirby, and Gilles Lajoie. PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Communi- cations in Mass Spectrometry, 17(20):2337–2342, 2003.

[56] Tobias Maier, Marc Güell, and Luis Serrano. Correlation of mRNA and protein in complex biological samples. FEBS Letters, 583(24):3966–3973, 2009.

[57] Lennart Martens, Henning Hermjakob, Philip Jones, Marcin Adamski, Chris Taylor, David States, Kris Gevaert, Joël Vandekerckhove, and Rolf Apweiler. PRIDE: The proteomics identifications database. Proteomics, 5(13):3537– 3545, 2005.

[58] Annette Michalski, Juergen Cox, and Matthias Mann. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS. Journal of Proteome Research, 10(4):1785–1793, 2011.

[59] Luminita Moruz and Lukas Käll. Peptide retention time prediction. Mass Spectrometry Reviews, 2016.

[60] Luminita Moruz, Daniela Tomazela, and Lukas Käll. Training, selection, and robust calibration of retention time models for targeted proteomics. Journal of Proteome Research, 9(10):5209–5216, 2010.

[61] Alexey I Nesvizhskii and Ruedi Aebersold. Interpretation of shotgun proteomic data the protein inference problem. Molecular & Cellular Proteomics, 4(10): 1419–1440, 2005.

[62] Alexey I Nesvizhskii, Andrew Keller, Eugene Kolker, and Ruedi Aebersold. A statistical model for identifying proteins by tandem mass spectrometry. Ana- lytical Chemistry, 75(17):4646–4658, 2003.

[63] Alexey I Nesvizhskii, Olga Vitek, and Ruedi Aebersold. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nature Methods, 4(10):787–797, 2007. BIBLIOGRAPHY 43

[64] Shao-En Ong, Blagoy Blagoev, Irina Kratchmarova, Dan Bach Kristensen, Hanno Steen, Akhilesh Pandey, and Matthias Mann. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & Cellular Proteomics, 1(5):376–386, 2002. [65] Nico Pfeifer, Andreas Leinenbach, Christian G Huber, and Oliver Kohlbacher. Statistical learning of peptide retention behavior in chromatographic sepa- rations: A new kernel-based approach for computational proteomics. BMC Bioinformatics, 8(1):468, 2007. [66] David W Powell, Connie M Weaver, Jennifer L Jennings, K Jill McAfee, Yue He, P Anthony Weil, and Andrew J Link. Cluster analysis of mass spectrome- try data reveals a novel component of SAGA. Molecular and Cellular Biology, 24(16):7249–7259, 2004. [67] Lukas Reiter, Manfred Claassen, Sabine P Schrimpf, Marko Jovanovic, Alexan- der Schmidt, Joachim M Buhmann, Michael O Hengartner, and Ruedi Ae- bersold. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Molecular & Cellular Pro- teomics, 8(11):2405–2417, 2009. [68] Fahad Saeed, Jason D Hoffert, and Mark A Knepper. CAMS-RS: Clustering algorithm for large-scale mass spectrometry data using restricted search space and intelligent random sampling. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(1):128–141, 2014. [69] Mikhail M Savitski, Mathias Wilhelm, Hannes Hahne, Bernhard Kuster, and Marcus Bantscheff. A scalable approach for protein false discovery rate esti- mation in large proteomic data sets. Molecular & Cellular Proteomics, pages mcp–M114, 2015. [70] Oliver Serang and Lukas Käll. Solution to statistical challenges in proteomics is more statistics, not less. Journal of Proteome Research, 2015. [71] Oliver Serang, Michael J MacCoss, and William Stafford Noble. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. Journal of Proteome Research, 9(10):5346–5357, 2010. [72] Oliver Serang and William Noble. A review of statistical methods for protein identification using tandem mass spectrometry. Statistics and its Interface, 5 (1):3, 2012. [73] Sigma-Aldrich. UPS1 & UPS2 proteomic standards. http://www. sigmaaldrich.com/life-science/proteomics/mass-spectrometry/ ups1-and-ups2-proteomic.html. Accessed: 2016-03-15. [74] Lloyd M Smith, Neil L Kelleher, et al. Proteoform: A single term describing protein complexity. Nature Methods, 10(3):186–187, 2013. 44 BIBLIOGRAPHY

[75] John D Storey and Robert Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003. [76] David L Tabb, Melissa R Thompson, Gurusahai Khalsa-Moyers, Nathan C VerBerkmoes, and W Hayes McDonald. MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. Journal of The American Society for Mass Spectrometry, 16(8):1250–1261, 2005. [77] Andrew Thompson, Jürgen Schäfer, Karsten Kuhn, Stefan Kienle, Josef Schwarz, Günter Schmidt, Thomas Neumann, and Christian Hamon. Tan- dem mass tags: A novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical Chemistry, 75(8):1895–1904, 2003. [78] John D Venable, Meng-Qiu Dong, James Wohlschlegel, Andrew Dillin, and John R Yates. Automated approach for quantitative analysis of complex pep- tide mixtures from tandem mass spectra. Nature Methods, 1(1):39–45, 2004. [79] Juan Antonio Vizcaíno, Attila Csordas, Noemi del Toro, José A Dianes, Jo- hannes Griss, Ilias Lavidas, Gerhard Mayer, Yasset Perez-Riverol, Florian Reisinger, Tobias Ternent, et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Research, 44(D1):D447–D456, 2016.

[80] Zack Weinberg. SVM separating hyperplanes. https://commons.wikimedia. org/wiki/File:Svm_separating_hyperplanes_(SVG).svg, 2012. Accessed: 2016-03-16. [81] Sebastian Wiese, Kai A Reidegeld, Helmut E Meyer, and Bettina Warscheid. Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics, 7(3):340–350, 2007. [82] Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gho- lami, Marcus Lieberenz, Mikhail M Savitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-based draft of the human proteome. Nature, 509(7502):582–587, 2014. [83] Bing Zhang, Matthew C Chambers, and David L Tabb. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of Proteome Research, 6(9):3549–3557, 2007. [84] Bo Zhang, Lukas Kall, and Roman A Zubarev. DeMix-Q: Quantification- centered data processing workflow. Molecular & Cellular Proteomics, pages mcp–O115, 2016.