Introduction to Msstats
Total Page:16
File Type:pdf, Size:1020Kb
INTRODUCTION TO MSSTATS Meena Choi, Olga Vitek College of Computer and Information Science WHY STATISTICS? • Variation and uncertainty are unavoidable • Technical variation: sampling handling, storage, processing • Instrumental variation: matrix effects, ion suppression • Signal processing: peak boundaries, identity, intensity • Biological variation: variation in protein abundance • Overall goal: effective, reproducible research • Experimental design: unbiased and efficient experiments • Data analysis: objective conclusions in presence of uncertainty • Statistical software: re-analysis, peer review 2 OUTLINE • Motivating example • ABRF iPRG study • MSstats • Statistical relative quantification of proteins and peptides • Methods evaluation • Extensions to MSstats • Assay characterization • System suitability and quality control 3 ABRF IPRG STUDY 2015 Detection of differentially abundant proteins in controlled mixture Samples Name Origin Molecular Weight 1 2 3 4 A Ovalbumin Chicken Egg White 45KD 65 55 15 2 B Myoglobin Equine Heart 17KD 55 15 2 65 C Phosphorylase b Rabbit Muscle 97KD 15 2 65 55 D Beta-Galactosidase Escherichia Coli 116KD 2 65 55 15 E Bovine Serum Albumin Bovine Serum 66KD 11 0.6 10 500 F Carbonic Anhydrase Bovine Erythrocytes 29KD 10 500 11 0.6 Spiked into a constant background: tryptic digests of S. cerevisiae ◆ Three technical replicates per sample ◆ Thermo nLC 1000 system ◆ 110-min linear gradient ◆ DDA profile mode in Orbitrap ◆ Data processing with Skyline Choi et al., Journal of Proteome Research, 2017. 1 Figure 2 DIVERSESUBMISSIONS AND CHOICE OF QUANTIFICATION OF CHOICE AND A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 NUMBER, PROTEIN INPUT, Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 Figurer 2 ted p A r Spectral 2000 Intensity based counting Hybrid NA Input data 6000 Repo Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check 0 Raw 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 4000 13 48 NA oteins Provided # of proteins with peak intensities = 3766Study ID, ordered by number of proteins r B Spectral ted p r Intensity based counting Hybrid NA 2000 447 514 734 2002 330 False 300 Repo positive 200 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins 100 B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 0 True positive Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 200 Study ID, ordered by number of false positives C Spectral Intensity based counting Hybrid NA 12 100 4 2 6 24 68 317 273 46 2094 10 9 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 6 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 3 Spectral Intensity based counting Hybrid NA in background proteins 12 4 2 6 24 68 317 273 46 2094 10 Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 9 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives 6 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives Figure 2 A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 r ACCURACY OF DETECTING DIFFERENTIAL DIFFERENTIAL DETECTING OF ACCURACY DIVERSESUBMISSIONS Figure ted p 2 r 2000 A Spectral Repo Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs 0 Raw+check 3 1 8 4 2 7 9 6 5 Raw 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 4000 ABUNDANCE NA oteins Study ID, ordered by number of proteins Provided # of proteins with peak intensities = 3766 r B Spectral ted p r Intensity based counting Hybrid NA 2000 447 514 734 2002 330 False 300 Repo positive 0 3 1 8 4 2 7 9 200 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins Spectral B 100 Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 positive 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 200 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 100 Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 9 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 6 Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 3 in background proteins 9 Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 6 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives Figure 2 A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 r ted p r 2000 Repo 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 positive 200 ACCURACY OF ESTIMATING FOLD CHANGE FOLD ESTIMATING OF ACCURACY Figure 2 100 DIVERSESUBMISSIONS A Spectral Intensity based counting Hybrid NA Input data 6000 0 True Differentially abundant proteins Peaks Provided # of proteins with spectral counts = 5766 positive 36 Peptide IDs ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 Raw+check 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Raw Study ID, ordered by number of false positives 4000 NA oteins Provided # of proteins with peak intensities = 3766 r C Spectral Intensity based counting Hybrid NA ted p r 12 2000 4 2 6 24 68 317 273 46 2094 10 Repo 9 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins 6 B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 3 positive in background proteins 200Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 100 Study ID, ordered by number of false positives 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 9 6 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives SUMMARY OF Lab ID Submission ID Input IdentificationQuantificationSummarizationStatisticalInference softwareFDR PPV True positiveFalse positive SUBMISSIONS 3 11 0.857 36 6 3 42 0.972 35 1 20 0.821 32 7 48 0.744 32 11 2 25 1.000 30 0 USER EXPERTISE IS KEY 13 1.000 30 0 Input 12 1.000 27 0 Peaks 1 44 1.000 26 0 Peptide ids 2 49 0.963 26 1 Raw+check 36 0.897 26 3 Raw 26 1.000 24 0 Identification PPV > 0.7 2 17 0.920 23 2 Skyline 4 4 0.821 23 5 MaxQuant 40 0.767 23 7 Progenesis 1 6 0.955 21 1 Others 5 3 1.000 19 0 Quantification 45 0.708 17 7 Feature intensity 6 46 0.867 13 2 Spectral counting 37 1.000 10 0 Hybrid Summarization 8 8 0.559 33 26 Protein summarization / Protein−level inference 1 0.580 29 21 Peptide summarization / Protein−level inference 5 14 0.359 28 50 Peptide summarization / Peptide−level inference 7 28 0.295 28 67 21 0.294 25 60 Statistical software Persus 43 0.294 25 60 Progenesis QI 7 19 0.511 24 23 < Others 7 31 0.575 23 17 R, Excel, MatLab, Python 0.2 PPV < 0.7 7 23 0.500 23 23 In−house