Introduction to Msstats

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Msstats INTRODUCTION TO MSSTATS Meena Choi, Olga Vitek College of Computer and Information Science WHY STATISTICS? • Variation and uncertainty are unavoidable • Technical variation: sampling handling, storage, processing • Instrumental variation: matrix effects, ion suppression • Signal processing: peak boundaries, identity, intensity • Biological variation: variation in protein abundance • Overall goal: effective, reproducible research • Experimental design: unbiased and efficient experiments • Data analysis: objective conclusions in presence of uncertainty • Statistical software: re-analysis, peer review 2 OUTLINE • Motivating example • ABRF iPRG study • MSstats • Statistical relative quantification of proteins and peptides • Methods evaluation • Extensions to MSstats • Assay characterization • System suitability and quality control 3 ABRF IPRG STUDY 2015 Detection of differentially abundant proteins in controlled mixture Samples Name Origin Molecular Weight 1 2 3 4 A Ovalbumin Chicken Egg White 45KD 65 55 15 2 B Myoglobin Equine Heart 17KD 55 15 2 65 C Phosphorylase b Rabbit Muscle 97KD 15 2 65 55 D Beta-Galactosidase Escherichia Coli 116KD 2 65 55 15 E Bovine Serum Albumin Bovine Serum 66KD 11 0.6 10 500 F Carbonic Anhydrase Bovine Erythrocytes 29KD 10 500 11 0.6 Spiked into a constant background: tryptic digests of S. cerevisiae ◆ Three technical replicates per sample ◆ Thermo nLC 1000 system ◆ 110-min linear gradient ◆ DDA profile mode in Orbitrap ◆ Data processing with Skyline Choi et al., Journal of Proteome Research, 2017. 1 Figure 2 DIVERSESUBMISSIONS AND CHOICE OF QUANTIFICATION OF CHOICE AND A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 NUMBER, PROTEIN INPUT, Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 Figurer 2 ted p A r Spectral 2000 Intensity based counting Hybrid NA Input data 6000 Repo Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check 0 Raw 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 4000 13 48 NA oteins Provided # of proteins with peak intensities = 3766Study ID, ordered by number of proteins r B Spectral ted p r Intensity based counting Hybrid NA 2000 447 514 734 2002 330 False 300 Repo positive 200 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins 100 B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 0 True positive Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 200 Study ID, ordered by number of false positives C Spectral Intensity based counting Hybrid NA 12 100 4 2 6 24 68 317 273 46 2094 10 9 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 6 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 3 Spectral Intensity based counting Hybrid NA in background proteins 12 4 2 6 24 68 317 273 46 2094 10 Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 9 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives 6 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives Figure 2 A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 r ACCURACY OF DETECTING DIFFERENTIAL DIFFERENTIAL DETECTING OF ACCURACY DIVERSESUBMISSIONS Figure ted p 2 r 2000 A Spectral Repo Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs 0 Raw+check 3 1 8 4 2 7 9 6 5 Raw 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 4000 ABUNDANCE NA oteins Study ID, ordered by number of proteins Provided # of proteins with peak intensities = 3766 r B Spectral ted p r Intensity based counting Hybrid NA 2000 447 514 734 2002 330 False 300 Repo positive 0 3 1 8 4 2 7 9 200 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins Spectral B 100 Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 positive 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 200 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 100 Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 9 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C 6 Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 3 in background proteins 9 Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 6 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives Figure 2 A Spectral Intensity based counting Hybrid NA Input data 6000 Peaks Provided # of proteins with spectral counts = 5766 Peptide IDs Raw+check Raw 4000 NA oteins Provided # of proteins with peak intensities = 3766 r ted p r 2000 Repo 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 positive 200 ACCURACY OF ESTIMATING FOLD CHANGE FOLD ESTIMATING OF ACCURACY Figure 2 100 DIVERSESUBMISSIONS A Spectral Intensity based counting Hybrid NA Input data 6000 0 True Differentially abundant proteins Peaks Provided # of proteins with spectral counts = 5766 positive 36 Peptide IDs ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 Raw+check 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Raw Study ID, ordered by number of false positives 4000 NA oteins Provided # of proteins with peak intensities = 3766 r C Spectral Intensity based counting Hybrid NA ted p r 12 2000 4 2 6 24 68 317 273 46 2094 10 Repo 9 0 3 1 8 4 2 7 9 6 5 26 27 14 25 15 37 36 20 38 29 21 43 30 22 10 34 35 24 16 33 17 23 19 31 28 41 49 39 11 42 40 46 12 44 47 51 45 18 32 50 13 48 Study ID, ordered by number of proteins 6 B Spectral Intensity based counting Hybrid NA 447 514 734 2002 330 False 300 3 positive in background proteins 200Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 100 Study ID, ordered by number of false positives 0 True Differentially abundant proteins 36 positive ‘False positives’ and ‘True 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives C Spectral Intensity based counting Hybrid NA 12 4 2 6 24 68 317 273 46 2094 10 9 6 3 in background proteins Absolute value of log2 fold-change 0 4 1 7 8 2 5 9 3 6 11 20 48 31 23 19 14 15 21 43 28 35 34 30 38 16 22 27 24 39 33 29 41 10 12 44 46 40 47 45 51 50 32 18 25 37 26 13 42 49 17 36 Study ID, ordered by number of false positives SUMMARY OF Lab ID Submission ID Input IdentificationQuantificationSummarizationStatisticalInference softwareFDR PPV True positiveFalse positive SUBMISSIONS 3 11 0.857 36 6 3 42 0.972 35 1 20 0.821 32 7 48 0.744 32 11 2 25 1.000 30 0 USER EXPERTISE IS KEY 13 1.000 30 0 Input 12 1.000 27 0 Peaks 1 44 1.000 26 0 Peptide ids 2 49 0.963 26 1 Raw+check 36 0.897 26 3 Raw 26 1.000 24 0 Identification PPV > 0.7 2 17 0.920 23 2 Skyline 4 4 0.821 23 5 MaxQuant 40 0.767 23 7 Progenesis 1 6 0.955 21 1 Others 5 3 1.000 19 0 Quantification 45 0.708 17 7 Feature intensity 6 46 0.867 13 2 Spectral counting 37 1.000 10 0 Hybrid Summarization 8 8 0.559 33 26 Protein summarization / Protein−level inference 1 0.580 29 21 Peptide summarization / Protein−level inference 5 14 0.359 28 50 Peptide summarization / Peptide−level inference 7 28 0.295 28 67 21 0.294 25 60 Statistical software Persus 43 0.294 25 60 Progenesis QI 7 19 0.511 24 23 < Others 7 31 0.575 23 17 R, Excel, MatLab, Python 0.2 PPV < 0.7 7 23 0.500 23 23 In−house
Recommended publications
  • Birth Cohort Effects Among US-Born Adults Born in the 1980S: Foreshadowing Future Trends in US Obesity Prevalence
    International Journal of Obesity (2013) 37, 448–454 & 2013 Macmillan Publishers Limited All rights reserved 0307-0565/13 www.nature.com/ijo ORIGINAL ARTICLE Birth cohort effects among US-born adults born in the 1980s: foreshadowing future trends in US obesity prevalence WR Robinson1,2, KM Keyes3, RL Utz4, CL Martin1 and Y Yang2,5 BACKGROUND: Obesity prevalence stabilized in the US in the first decade of the 2000s. However, obesity prevalence may resume increasing if younger generations are more sensitive to the obesogenic environment than older generations. METHODS: We estimated cohort effects for obesity prevalence among young adults born in the 1980s. Using data collected from the National Health and Nutrition Examination Survey between 1971 and 2008, we calculated obesity for respondents aged between 2 and 74 years. We used the median polish approach to estimate smoothed age and period trends; residual non-linear deviations from age and period trends were regressed on cohort indicator variables to estimate birth cohort effects. RESULTS: After taking into account age effects and ubiquitous secular changes, cohorts born in the 1980s had increased propensity to obesity versus those born in the late 1960s. The cohort effects were 1.18 (95% CI: 1.01, 1.07) and 1.21 (95% CI: 1.02, 1.09) for the 1979–1983 and 1984–1988 birth cohorts, respectively. The effects were especially pronounced in Black males and females but appeared absent in White males. CONCLUSIONS: Our results indicate a generational divergence of obesity prevalence. Even if age-specific obesity prevalence stabilizes in those born before the 1980s, age-specific prevalence may continue to rise in the 1980s cohorts, culminating in record-high obesity prevalence as this generation enters its ages of peak obesity prevalence.
    [Show full text]
  • Introduction to Label-Free Quantification
    SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer, Alexander Fillbrunn OpenMS • OpenMS – an open-source C++ framework for computational mass spectrometry • Jointly developed at ETH Zürich, FU Berlin, University of Tübingen • Open source: BSD 3-clause license • Portable: available on Windows, OSX, Linux • Vendor-independent: supports all standard formats and vendor-formats through proteowizard • OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools – Building blocks: One application for each analysis step – All applications share identical user interfaces – Uses PSI standard formats • Can be integrated in various workflow systems – Galaxy – WS-PGRADE/gUSE – KNIME Kohlbacher et al., Bioinformatics (2007), 23:e191 OpenMS Tools in KNIME • Wrapping of OpenMS tools in KNIME via GenericKNIMENodes (GKN) • Every tool writes its CommonToolDescription (CTD) via its command line parser • GKN generates Java source code for nodes to show up in KNIME • Wraps C++ executables and provides file handling nodes Installation of the OpenMS plugin • Community-contributions update site (stable & trunk) – Bioinformatics & NGS • provides > 180 OpenMS TOPP tools as Community nodes – SILAC, iTRAQ, TMT, label-free, SWATH, SIP, … – Search engines: OMSSA, MASCOT, X!TANDEM, MSGFplus, … – Protein inference: FIDO Data Flow in Shotgun Proteomics Sample HPLC/MS Raw Data 100 GB Sig. Proc. Peak 50 MB Maps Data Reduction 1
    [Show full text]
  • Openms – a Framework for Computational Mass Spectrometry
    OpenMS { A framework for computational mass spectrometry Dissertation der Fakult¨atf¨urInformations- und Kognitionswissenschaften der Eberhard-Karls-Universit¨atT¨ubingen zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Dipl.-Inform. Marc Sturm aus Saarbr¨ucken T¨ubingen 2010 Tag der m¨undlichen Qualifikation: 07.07.2010 Dekan: Prof. Dr. Oliver Kohlbacher 1. Berichterstatter: Prof. Dr. Oliver Kohlbacher 2. Berichterstatter: Prof. Dr. Knut Reinert Acknowledgments I am tremendously thankful to Oliver Kohlbacher, who aroused my interest in the field of computational proteomics and gave me the opportunity to write this thesis. Our discussions where always fruitful|no matter if scientific or technical. Furthermore, he provided an enjoyable working environment for me and all the other staff of the working group. OpenMS would not have been possible without the joint effort of many people. My thanks go to all core developers and students who contributed to OpenMS and suffered from the pedantic testing rules. I especially thank Eva Lange, Andreas Bertsch, Chris Bielow and Clemens Gr¨oplfor the tight cooperation and nice evenings together. Of course, I'm especially grateful to my parents and family for their support through- out my whole life. Finally, I thank Bettina for her patience and understanding while I wrote this thesis. iii Abstract Mass spectrometry coupled to liquid chromatography (LC-MS) is an analytical technique becoming increasingly popular in biomedical research. Especially in high-throughput proteomics and metabolomics mass spectrometry is widely used because it provides both qualitative and quantitative information about analytes. The standard protocol is that complex analyte mixtures are first separated in liquid chromatography and then analyzed using mass spectrometry.
    [Show full text]
  • Median Polish with Covariate on Before and After Data
    IOSR Journal of Mathematics (IOSR-JM) e-ISSN: 2278-5728, p-ISSN: 2319-765X. Volume 12, Issue 3 Ver. IV (May. - Jun. 2016), PP 64-73 www.iosrjournals.org Median Polish with Covariate on Before and After Data Ajoge I., M.B Adam, Anwar F., and A.Y., Sadiq Abstract: The method of median polish with covariate is use for verifying the relationship between before and after treatment data. The relationship is base on the yield of grain crops for both before and after data in a classification of contingency table. The main effects such as grand, column and row effects were estimated using median polish algorithm. Logarithm transformation of data before applying median polish is done to obtain reasonable results as well as finding the relationship between before and after treatment data. The data to be transformed must be positive. The results of median polish with covariate were evaluated based on the computation and findings showed that the median polish with covariate indicated a relationship between before and after treatment data. Keywords: Before and after data, Robustness, Exploratory data analysis, Median polish with covariate, Logarithmic transformation I. Introduction This paper introduce the background of the problem by using median polish model for verifying the relationship between before and after treatment data in classification of contingency table. We first developed the algorithm of median polish, followed by sweeping of columns and rows medians. This computational procedure, as noted by Tukey (1977) operates iteratively by subtracting the median of each column from each observation in that column level, and then subtracting the median from each row from this updated table.
    [Show full text]
  • BIOINFORMATICS ORIGINAL PAPER Doi:10.1093/Bioinformatics/Btm145
    Vol. 23 no. 13 2007, pages 1648–1657 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm145 Data and text mining An efficient method for the detection and elimination of systematic error in high-throughput screening Vladimir Makarenkov1,*, Pablo Zentilli1, Dmytro Kevorkov1, Andrei Gagarin1, Nathalie Malo2,3 and Robert Nadon2,4 1Department d’informatique, Universite´ du Que´ bec a` Montreal, C.P.8888, s. Centre Ville, Montreal, QC, Canada, H3C 3P8, 2McGill University and Genome Quebec Innovation Centre, 740 Dr. Penfield Ave., Montreal, QC, Canada, H3A 1A4, 3Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, 1020 Pine Av. West, Montreal, QC, Canada, H3A 1A4 and 4Department of Human Genetics, McGill University, 1205 Dr. Penfield Ave., N5/13, Montreal, QC, Canada, H3A 1B1 Received on December 7, 2006; revised on February 22, 2007; accepted on April 10, 2007 Advance Access publication April 26, 2007 Associate Editor: Jonathan Wren ABSTRACT experimental artefacts that might confound with important Motivation: High-throughput screening (HTS) is an early-stage biological or chemical effects. The description of several process in drug discovery which allows thousands of chemical methods for quality control and correction of HTS data can compounds to be tested in a single study. We report a method for be found in Brideau et al. (2003), Gagarin et al. (2006a), Gunter correcting HTS data prior to the hit selection process (i.e. selection et al. (2003), Heuer et al. (2003), Heyse (2002), Kevorkov and of active compounds). The proposed correction minimizes the Makarenkov (2005), Makarenkov et al. (2006), Malo et al. impact of systematic errors which may affect the hit selection in (2006) and Zhang et al.
    [Show full text]
  • June 5, 2017 Bioinformatics MS Interest Group Your Hosts
    Open Source Software Packages: Using and Making your conbributions June 5, 2017 Bioinformatics MS Interest Group Your hosts Meena Choi Samuel Payne Post doc. Scientist Northeastern University Pacific Northwest National Lab Statistical methods for Integrative Omics quantitative proteomics Outline • General Intro – Meena Choi • mzRefinery/proteowizard - Sam Payne • openMS - Oliver Kohlbacher • Skyline - Brendan MacLean • General discussion on open source • Ask questions for the General Discussion http://bit.ly/2qNZVBU • Shout-out for Open Source tool http://bit.ly/2qVHVo7 Oliver Kohlbacher • The chair of Applied Bioinformatics at University of Tübingen & fellow at the Max Plank Institute • OpenMS ( openms.de ) Brendan MacLean • Principal developer for Skyline ( skyline.ms ) • University of Washington Ask questions or comments : http://bit.ly/2qNZVBU • Why have open source? • What are the advantages and disadvantages between open source and private closed-source software? • How should a developer consider the question of making a project open source or not? • What is appropriate level of guide/documentation to help new developers? • How to incentivize people to contribute to open source software? Bioconductor.org biocViews search R package development • Provide the framework for developing package : basic structure, requirements… • Requirements : 1. pass check or BiocCheck on all supported platforms (their own checking system) 2. Documents • DESCRIPTION, NAMESPACE, vignette, help file, NEWS 3. Review process (2-5 weeks) • submit a GitHub repository • a reviewer will be assigned and a detailed package review is returned. • the process is repeated until the package is accepted to Bioconductor. • Maintaining the packages across release cycles (twice a year) + deprecate packages • Import or depend on other packages in Bioconductor or CRAN R package as software • Easy to make open source software for new method development.
    [Show full text]
  • Exploratory Data Analysis
    Copyright © 2011 SAGE Publications. Not for sale, reproduction, or distribution. 530 Data Analysis, Exploratory The ultimate quantitative extreme in textual data Klingemann, H.-D., Volkens, A., Bara, J., Budge, I., & analysis uses scaling procedures borrowed from McDonald, M. (2006). Mapping policy preferences II: item response theory methods developed originally Estimates for parties, electors, and governments in in psychometrics. Both Jon Slapin and Sven-Oliver Eastern Europe, European Union and OECD Proksch’s Poisson scaling model and Burt Monroe 1990–2003. Oxford, UK: Oxford University Press. and Ko Maeda’s similar scaling method assume Laver, M., Benoit, K., & Garry, J. (2003). Extracting that word frequencies are generated by a probabi- policy positions from political texts using words as listic function driven by the author’s position on data. American Political Science Review, 97(2), some latent scale of interest and can be used to 311–331. estimate those latent positions relative to the posi- Leites, N., Bernaut, E., & Garthoff, R. L. (1951). Politburo images of Stalin. World Politics, 3, 317–339. tions of other texts. Such methods may be applied Monroe, B., & Maeda, K. (2004). Talk’s cheap: Text- to word frequency matrixes constructed from texts based estimation of rhetorical ideal-points (Working with no human decision making of any kind. The Paper). Lansing: Michigan State University. disadvantage is that while the scaled estimates Slapin, J. B., & Proksch, S.-O. (2008). A scaling model resulting from the procedure represent relative dif- for estimating time-series party positions from texts. ferences between texts, they must be interpreted if a American Journal of Political Science, 52(3), 705–722.
    [Show full text]
  • Expression Summarization Interrogate for Each Gene, Called a Probe Set
    Expression Quantification: Affy Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene. · PM is the exact complementary sequence of the target genetic sequence, composed of 25 base pairs · MM probe, which has the same sequence with exception that the middle base (13th) position has been reversed · There are roughly 11­20 PM/MM probe pairs that Expression summarization interrogate for each gene, called a probe set Mikhail Dozmorov Fall 2016 2/36 Affymetrix Expression Array Preprocessing Affymetrix Expression Array Preprocessing Background adjustment Normalization Remove intensity contributions from optical noise and cross­ Remove array effect, make array comparable hybridization 1. Constant or linear (MAS) · so the true measurements aren't affected by neighboring measurements 2. Rank invariant (dChip) 3. Quantile (RMA) 1. PM­MM 2. PM only 3. RMA 4. GC­RMA 3/36 4/36 Affymetrix Expression Array Preprocessing Expression Index estimates Summarization Summarization Combine probe intensities into one measure per gene · Reduce the 11­20 probe intensities on each array to a single number for gene expression. 1. MAS 4.0, MAS 5.0 · The goal is to produce a measure that will serve as an 2. Li­Wong (dChip) indicator of the level of expression of a transcript using 3. RMA the PM (and possibly MM values). 5/36 6/36 Expression Index estimates Expression Index estimates Single Chip Multiple Chip · MAS 4.0 (avgDiff): no longer recommended for use due · MBEI (Li­Wong): a multiplicative model (Model based to many flaws.
    [Show full text]
  • Median Polish Algorithm
    Introduction to Computational Chemical Biology Dr. Raimo Franke [email protected] Lecture Series Chemical Biology http://www.raimofranke.de/ Leibniz-Universität Hannover Where can you find me? • Dr. Raimo Franke Department of Chemical Biology (C-Building) Helmholtz Centre for Infection Research Inhoffenstr. 7, 38124 Braunschweig Tel.: 0531-6181-3415 Email: [email protected] Download of lecture slides: www.raimofranke.de • Research Topics - Metabolomics - Biostatistical Data Analysis of omics experiments (NGS (genome, RNAseq), Arrays, Metabolomics, Profiling) - Phenotypic Profiling of bioactive compounds with impedance measurements (xCelligence) (Peptide Synthesis, ABPP) I am looking for BSc, MSc and PhD students, feel free to contact me! Slide 2 | My journey… Slide 3 | Outline Primer on Statistical Methods Multi-parameter phenotypic profiling: using cellular effects to characterize bioactive compounds Network Pharmacology, Modeling of Signal Transduction Networks Paper Presentations: Proc Natl Acad Sci U S A. 2013 Feb 5;110(6):2336-41. doi: 10.1073/pnas.1218524110. Epub 2013 Jan 22. Antimicrobial drug resistance affects broad changes in metabolomic phenotype in addition to secondary metabolism. Derewacz DK, Goodwin CR, McNees CR, McLean JA, Bachmann BO. Slide 4 | Chemical Biology Arsenal of Methods In silico Target interactions Correlation signals Chemical Genetics & predictions Chemoproteomics Pharmacophore Biochemical assays Expression profiling Protein microarrays –based target SPR, ITC Proteomics Yeast-3-Hybrid prediction X-Ray, NMR Metabonomics Phage display Image analysis ABPP Impedance Affinity pulldown NCI60 panel Chemical probes Computational Methods in Chemical Biology • In silico –based target prediction: use molecular descriptors for in silico screening, molecular docking etc. (typical applications of Cheminformatics) • Biochemical Assays: SPR: curve fitting to determine kass und kdiss, Xray: model fitting, NMR: chemical shift prediction etc.
    [Show full text]
  • Computational Proteomics and Metabolomics
    COMPUTATIONAL PROTEOMICS AND METABOLOMICS Oliver Kohlbacher, Sven Nahnsen, Knut Reinert 0. Introducon and Overview This work is licensed under a Creative Commons Attribution 4.0 International License. LU 0B – OPENMS AND KNIME • Workflows - defini/on • Conceptual ideas behind OpenMS and TOPP • Installaon of KNIME and OpenMS extensions • Overview of KNIME • Simple workflows in KNIME • LoadinG tabular data, manipulanG rows, columns • Visualizaon of data • PreparinG simple reports • EmbeddinG R scripts • Simple OpenMS ID workflow: findinG all proteins in a sample This work is licensed under a Creative Commons Attribution 4.0 International License. High-Throughput Proteomics • AnalyzinG one sample is usually not a biG deal • AnalyzinG 20 can be resome • AnalyzinG 100 is a really biG deal • High-throughput experiments require high- throughput analysis • Compute power scales much beer than manpower Pipelines and Workflows pipeline |ˈpīpˌlīn| noun 1. a lonG pipe, typically underGround, for conveyinG oil, Gas, etc., over lonG distances. […] 2. Compu,ng a linear sequence of specialized modules used for pipelining. 3. (in surfing) the hollow formed by the breakinG of a larGe wave. workflow |ˈwərkˌflō| noun • the sequence of industrial, administrave, or other processes throuGh which a piece of work passes from ini/aon to comple/on. http://oxforddictionaries.com/definition/american_english/pipeline http:// oxforddictionaries.com/definition/american_english/workflow Bioinformacs – The Holy Grail KNIME and OpenMS • Construc/nG workflows requires • Tools – makinG
    [Show full text]
  • Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms
    bioRxiv preprint doi: https://doi.org/10.1101/044552; this version posted March 19, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms Hannes L. Röst1,2, Ruedi Aebersold1,3 and Olga T. Schubert1,4 1 Institute of Molecular Systems Biology, ETH Zurich, CH-8093 Zurich, Switzerland 2 Department of Genetics, Stanford University, Stanford, CA 94305, USA 3 Faculty of Science, University of Zurich, CH-8057 Zurich, Switzerland 4 Department of Human Genetics, University of California Los Angeles, Los Angeles, CA 90095, USA Corresponding authors: HR ([email protected]), RA ([email protected]) and OS ([email protected]) Running head: SWATH Data Analysis 1 bioRxiv preprint doi: https://doi.org/10.1101/044552; this version posted March 19, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Summary Targeted mass spectrometry comprises a set of methods able to quantify protein analytes in complex mixtures with high accuracy and sensitivity. These methods, e.g., Selected Reaction Monitoring (SRM) and SWATH MS, use specific mass spectrometric coordinates (assays) for reproducible detection and quantification of proteins. In this protocol, we describe how to analyze in a targeted manner data from a SWATH MS experiment aimed at monitoring thousands of proteins reproducibly over many samples.
    [Show full text]
  • Introductory Methods for Area Data Analysis
    Analysis Methods for Area Data Introductory Methods for Values are associated with a fixed set of areal units covering the Area Data study region We assume a value has been observed for all areas Bailey and Gatrell Chapter 7 The areal units may take the form of a regular lattice or irregular units Lecture 17 November 11, 2003 Analysis Methods for Area Data Analysis Methods for Area Data Objectives For continuous data models we were more concerned with explanation of patterns in terms of locations § Not prediction - there are typically no unobserved values, the attribute is exhaustively measured. In this case explanation is in terms of covariates measured over the same units as well as in terms of the spatial § Model spatial patterns in the values associated with fixed arrangement of the areal units areas and determine possible explanations for such patterns Examples Relationship of disease rates and socio-economic variables 1 Analysis Methods for Area Data Analysis Methods for Area Data {Y (s), s Î R} Random variable Y indexed by locations Explore these attribute values in the context of global trend or first order variation and second order variation – the Random variable Y indexed by a fixed spatial arrangement of the set of areas {Y ( Ai ), ?i Î R} set of areal units § First order variation as variation in the of mean, mi of Yi A1 ÈL È An = R The set of areal units cover the study region R § Second order variation as variation in the COV(Yi , Yj ) Conceive of this sample as a sample from a super population – all realizations of the process over these areas that might ever occur.
    [Show full text]