Statistical Methods for Differential Proteomics at Peptide and Protein Level

STATISTICAL METHODS FOR DIFFERENTIAL PROTEOMICS AT PEPTIDE AND PROTEIN LEVEL Ir. Ludger Goeminne Student number: 00802186 Supervisors: Prof. Dr. Ir. Lieven Clement, Prof. Dr. Kris Gevaert A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Doctor of Statistical Data Analysis Academic year: 2018 - 2019 SUMMARY Proteins are very diverse biomolecules that facilitate nearly all cellular processes of life. They interact with each other in complex networks in which disruption of a single protein can severely impact an organism. Therefore, quantitative information of a proteome (i.e. the entire set of proteins present in an organism) is extremely important to gain insights in the functioning of an organism in both healthy and diseased states. Mass spectrometry (MS)-based proteomics is the method of choice for the high-throughput identification and quantification of thousands of proteins in a single analysis. When deep proteome coverage on many samples is needed, the analysis is often performed without any stable isotope labels, label-free. Here, proteins are extracted and digested into peptides that are subsequently loaded onto a reverse-phase high-performance liquid chromatography column (HPLC) coupled to a mass spectrometer, by which they are separated, ionized and have their MS spectra recorded. The intensity peaks in these MS spectra are proxies for peptide abundance. Subsequently, (some) peptides are targeted for fragmentation and the resulting MS² spectra enable their identification. As a result, label-free proteomics data are hierarchical: the data are at the peptide ion level, while inference typically happens at the protein level. Important to note is that signal intensities are strongly peptide-dependent as some peptides ionize more efficiently than others. Furthermore, missing values are very common and a large fraction of this missingness is not at random. Indeed, intensities of low- abundant and poorly ionizing peptides are more likely to go missing and competition for ionization makes missingness also context-dependent. Many ad hoc data analysis workflows for differential protein quantification do not handle label- free proteomics data in a statistically rigorous way, which leads to suboptimal ranking of differentially abundant proteins. Consequently, many biologically relevant proteins remain unnoticed and valuable resources are wasted by needlessly trying to validate false positive hits. In chapter 8, we demonstrate the necessity of properly taking peptide-specific effects into account in differential protein quantification analyses. Peptide-based models, which naturally account for these effects, perform better than methods that summarize peptide intensities to the protein level prior to the statistical analysis. We further illustrate that controlling the false discovery rate becomes problematic when highly-abundant proteins are differentially abundant due to suppression of the intensity of the background proteome. Finally, we show that missing values should be handled with care as imputing these under wrong assumptions leads to worse results compared to not imputing missing values at all. Most peptide-based models suffer from overfitting, unstable estimations of residual variances and a disproportionate impact of outlying peptide intensities. To address these issues, I developed the versatile R package MSqRob, which adds three modular improvements to existing peptide-based models: ridge regression stabilizes fold change estimates, empirical Bayes variance estimations stabilize the variances of the test statistics and M-estimation with Huber weights reduces the impact of outliers. MSqRob’s algorithm has been described in detail in section 9.1 and it not only improves the fold change estimates in terms of precision and accuracy, but also the protein ranking, leading to a better discrimination between true and false positives. MSqRob is freely available on GitHub (https://github.com/statOmics/MSqRob) and has a user-friendly graphical interface that is made in "Shiny", an R package developed by RStudio that allows smooth integration of the R programming language with an HTML interface. xiii In section 9.2, I pinpoint important aspects of both experimental design and data analysis. Furthermore, I provide a step-by-step guide on how to use the MSqRob graphical user interface for both simple as well as more complex experimental designs. I also provide well- documented scripts to run analyses in bash mode, enabling the integration of MSqRob in automated pipelines on cluster environments. In my latest, unpublished work (chapter 10), I focus on the missing value problem. Indeed, missingness in label-free proteomics is a mix of missingness completely at random and missingness not at random. However, the exact contributions of both mechanisms are unknown and dataset-specific, and imputing under the wrong assumptions is detrimental for the downstream protein quantifications. Therefore, I developed a hurdle model that combines the power of MSqRob with the complementary information that is available in peptide counts without having to rely on undeterminable assumptions. This enables MSqRob to quantify proteins that are completely missing in one condition in a statistically rigorous manner. Moreover, it opens new possibilities to detect the sudden appearance of post-translationally modified peptides in addition to traditional protein fold change estimation. With the development of MSqRob, I have made an important contribution to enabling experimenters to get the most out of their proteomics data. And, even though MSqRob is already one of the most versatile differential proteomics quantification tools, there are ample opportunities to broaden MSqRob’s scope, both towards new types of (prote)omics data and towards more complicated experimental designs. xiv ABBREVIATIONS AUC area under the curve CID collision-induced dissociation CPTAC Clinical Proteomic Technology Assessment for Cancer Network DA differential abundance DE differential expression DDA data-dependent acquisition DIA data-independent acquisition ESI electrospray ionization ETD electron-transfer dissociation FC fold change FDR false discovery rate FN false negatives FP false positives GFP green fluorescent protein HCD higher-energy collisional dissociation HILIC hydrophilic interaction liquid chromatography HPLC high-performance liquid chromatography IMAC immobilized metal affinity chromatography IQR interquartile range iTRAQ isobaric tag for relative and absolute quantitation LC liquid chromatography LFQ label-free quantification kNN k-nearest neighbors KO knock-out MALDI matrix-assisted laser desorption MCAR missingness completely at random MCMC Markov Chain Monte Carlo MDS multidimensional scaling MNAR missingness not at random MS mass spectrometry xvii MOAC metal-oxide affinity chromatography NETD negative electron-transfer dissociation OR odds ratio pAUC partial area under the curve PPV positive predictive values PSM peptide-to-spectrum match QRILC quantile regression imputation of left censored data ROC receiver operating curve rpAUC relative partial area under the curve RP-HPLC reverse-phase high-performance liquid chromatography RR robust ridge SAX strong anion exchange SCX strong cation exchange SILAC stable isotope labeling of amino acids in cell culture SWATH-MS sequential windowed acquisition of all theoretical fragment ion mass spectra TMT tandem mass tags TN true negatives TOF time-of-flight TP true positives UPS Universal Proteomics Standard UPS1 Universal Proteomics Standard 1 WT wild type xviii SHORT TABLE OF CONTENTS Foreword – woord vooraf ................................................................................................ viii Summary ........................................................................................................................... xiii Samenvatting ..................................................................................................................... xv Abbreviations .................................................................................................................. xvii Short table of contents .................................................................................................... xix Long table of contents ...................................................................................................... xx PART I: INTRODUCTION 1. Biological context ........................................................................................................... 5 2. Technical context .......................................................................................................... 23 3. From spectra to data ..................................................................................................... 37 4. Differential protein abundance analysis ...................................................................... 49 5. Research hypothesis .................................................................................................... 83 6. Outline ............................................................................................................................ 87 7. References part I ........................................................................................................... 89 PART II: RESEARCH PAPERS 8. Summarization vs Peptide-Based Models in Label-Free Quantitative Proteomics: Performance, Pitfalls, and Data Analysis Guidelines ..................................................

Statistical Methods for Differential Proteomics at Peptide and Protein Level

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support