Lecture 3 Differential Expression

Arthur Moseley [email protected] Genome Academy April, 2013 Quantitative of Peptides and

– Quantitative MS is easy to try, hard to do right

– Sets of “Light and Heavy” reagents can be used for relative quantitation

– Quantitative MS often relies on use of isotopically labeled authentic standards

– Spiking authentic stable-labeled molecules (peptides, drugs, pesticides, etc.) into samples provides for molar quantitation • THE Gold standard approach for quantitative mass spectrometry

– Label-free quantitation is often very useful • Used for relative quantitation and “Top-3” Mole Quantitation • Ultimate flexibility in experimental design “Old-School” Differential Expression Proteomics The First Mass Spec Based Differential Expression Proteomics

(ICAT) developed by Reudi Aebersold (Nature Biotechnology, 17, 994, 1999) ICAT Reagent and Strategy Stable Isotope Labeling for : - Lots of Options

Goshe and Smith, Curr Op in Biotech (2003) 14:101 Analytical Challenges Associated with Performing Quantitative Proteomics Using Chemical Isotopic Labeling

• Bypassing gels avoids problems with membrane proteins, other special cases

• Sample loading issues contributing to poor dynamic range are reduced

• Not all proteins contain targeted amino acid (tag dependent consideration)

• Post-translational modifications can be missed (tag dependent)

• Quantitation from LC/MS: relative intensities of isotope clusters

• Qualitative Identification from LC/MS/MS: peptide sequencing (MS/MS)

• Analytical challenge - very complex mixtures (30,000+ peptides/sample) are made more complex by isotope labeling (doubles number of analytes) – pre-fractionate samples – Multidimensional analytical HPLC (capillary LC/LC/MS/MS) Applied Biosystems iTRAQ reagents use isobaric tags

Multiple tags present with the same nominal mass in survey spectra

Quantitation is done during the MS/MS step, simultaneously with peptide identification

Only quantify peptides sequenced by MS/MS - A subset of all peptides present

Label-free methods quantitate all species regardless of identification

http://docs.appliedbiosystems.com/pebiodocs/00113379.pdf Metabolic Stable Isotope Coding

Goshe and Smith, Curr Op in Biotech (2003) 14:101

SILAC generates a lot of data regarding 2 samples - Be aware of statistical limitations Even when quantitative methods are used, most of the time, the focus is on function. There is little attention to the details of quantitation.

Such an approach is fundamentally flawed.

Forget not the basic principals of quantitative analyses. – Replication; QCs; Validation Rigorously use Quantitatively Reproducible Analytical Methods Forget not the basics of • Highly reproducible chromatography is required

• A high sampling rate across the chromatographic peak is required for accurate quantitation •Ideally want 15-20 sampling points across chromatographic profile •Highly reproducible chromatography is required for sample-to-sample comparisons

• High resolution, accurate mass (precursor & products) technology needed

• For quantitative selectivity (near isobaric cross-talk)

• For accurate qualitative identifications 1% FPR at peptide level (Decoy DB; Peptide Prophet)

• No QCs = No Quantifiably Reliable Data

• No Replication = No Quantifiably Reliable Data

• No Common Standard = No Meaningful Comparison across Projects

Overview of Label Free Quantitation

Acquisition of LC Acquisition of Selected MS/MS Peptide Separation MS Data Data Via Identification Targeted (Database Search Analysis Engine)

Import Raw Data Data Alignment Statistical Import Raw Annotation & & Feature Analysis of MS/MS Data Peptide/ Extraction Differences Analysis

(courtesy Rosetta Biosoftware) Gel-Free Label Free Proteomics High Resolution, Accurate Mass 3D Peptide Mass Map X and Y coordinates identify the peptide Y coordinate (mass-to charge ratio) is fixed to <5 ppm error X coordinate (LC Retention Time) has more variability (typically < 6 seconds)

charge (m/z)charge ratio - to - mass An isotope group of a peptide

LC retention time •Intensity (AUC) of SIC of peptide is the quantitative measure

•Must be accurately measured across statistically significant sample cohort

Results of Data Alignment based on Accurate Mass and Retention Time Raw Aligned Data Data

Aligned Data 111,015 Features Aligned across 16 Combined by LC/MS Analyses Biological Condition of Cell Lines

How to QC this vast Amount of Data? QC of Individual Isotope Groups pairwise t-tests of significance of peak area measurement Rigorously use Quantitatively Reproducible Analytical Methods Daily QC Checks of Data Acquisition Precision and Reproducibility

Instrument Performance Checks Day 1(+) QCs Column Conditioning Preliminary database searches

Column QC1 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10 QC 2 Sample 11 Sample 12 Sample 13 Condition

Day 2: Data Collection Day 3: Data Collection

Sample Sample Sample Sample Sample Sample QC X-1 QC X ……… X-5 X-4 X-3 X-2 X-1 X

Day X: Data Collection • Want to maximize biological powering - analyzing as many samples as possible

• Must use robust LC-MS platform and singlicate analysis of each sample

• Data QC is performed by daily injections of a “standard” of the same biological sample (pool) • Aliquots of same pool used in all projects – QC tracking across projects Quantitatively Reproducible Analytical Methods Forget not the basics of analytical chemistry Assessing Quantitative Reproducibility with Daily QCs QC Metric #1 = %CV (Anal. + Biol. Variability) - %CV (Anal. Variability)

• Analytical Variability ~ 35,000 • Analytical + Biological Variability peptides • Patient Samples • Daily QC Sample (pool of QC plasma sample) ~ 40% peptides ~ 2 % peptides CV < 10% CV < 25% ~ 70% peptides CV < 20%

~ 90% peptides CV < 25%

Note X- Axis Scale Differences QC Samples 0 to 170% CV Biological Samples 0 to 500% CV 125% CV Plasma Peptides 25% CV Plasma Peptides

- Alternating cycles (1 sec. each) of precursor / product scans provides high reproducibility via a high sampling rate across chromatographic peak - Major attribute of MSE Rigorously use Quantitatively Reproducible Analytical Methods Assessing Quantitative Reproducibility at the Peptide Level with QCs

Reproducibility of Internal Standard Spiked into Each Sample ADH1_YEAST (50fmol/ug) Peptide Abundance across 60 patient clinical cohort

DDA Data Qual only

VVGLSTLPEYIEK, 12.8% CV across all samples Label Free Intensity Plots differential expression visualization Cluster Analysis of Label Free Quantitation Datasets

Proteins

• Cluster Analyses – Examine large data sets and determine if items behave similarly – Data belonging to the same cluster are similar

at some level – Data sets in different clusters are less similar at some level – Make a preliminary assessment of possible relationships between clusters and identify

Treatment Groups Treatment data sets for further investigation

Differential Protein Expression

• Differential protein expression studies are key for – Identifying biomarkers of disease and treatment response – Elucidating biological pathways – Identifying and validating protein drug targets

• Essentially all differential proteomics studies have studied relative protein expression – Isotope labeling methods – Label free methods

• Differential proteomic expression studies based on “absolute” quantitation have yet to be fully exploited

Relative Protein Expression

• Provides data on protein expression changes between two or more samples within the same experiment • Requires direct comparison of proteolytic peptides or marker ions from proteolytic peptides – Provides relative abundance ratios of the same protein between different samples – Data does not easily extrapolate beyond the experiment • Experiments are isolated “islands of information”

One Exemplar Biomarker Discovery & Verification Project

Biomarkers to Predict Outcomes of Hepatitis C Patient Treatment in Serum of Treatment Naive Patients Jeanette McCarthy, Keyur Patel, Joe Lucas and John McHutchison

Spontaneous clearance (~25%)

Chronic infection

Eligible for Treatment

Responders Non-responders (>50%)

Hepatic Fibrosis Steatosis Insulin resistance Dyslipidemia

20% cirrhosis Increased risk Unknown 3-5% cancer of diabetes consequences Cohort Selection and Placement in the Pipeline (Guided by an “Unmet Clinical Need”, US HUPO 2009)

Number of 10,000s Analytes 100-1,000 10s

1,000s 100 -1,000 Number of Samples 10s Biomarker Biomarker Biomarker Discovery Verification Validation First: Discover in Matched Cohorts to Second: Verify in All-Comers Trials Focus on the Clinical Variable of Interest to Test Robustness

Biomarker Discovery Paradigm Challenge Hepatitis C Cohorts – all by UPLC/Q-Tof

Open Platform LC/MS LC/MS/MS (MRM) LC/MS/MS (MRM)

Duke Hepatology Biorepository - 3,169 patients

Discovery Cohort - small discovery experiment - well matched cohort from Biorepository - n = 55 patients - ‘omic LC/MS/MS

Verification Cohort 1 - well matched cohort from Biorepository - n = 41 patients

Verification Cohort 2 - pediatric patients - “all-comers” trial - N = 50 patients

Verification / Validation Cohort 3 - “all-comers” trial (Australia) - N = 243 patients Insure Professional Use of Statistical Tools Suitable for High Dimensional Data Analyses

Sparse Latent Factor Regression - Bayesian Factor Regression Modeling 35,000 Isotope Groups Predictive Factor Factor Score “Metaproteins” “Expression Value”

• Regression - Leads directly to prediction • Sparsity – Most peptides are irrelevant for prediction • Latent Factors – let data determine important relationships • Resulting model for prediction:

• Initial Metaprotein Model - 650 Isotope Groups Statistical Analysis: Joe Lucas, PhD,

Pastor Thomas Bayes Duke IGSP Key Features of Metaprotein Expression Modeling Bayesian Factor Regression Model

• Allows correction of large scale correlational structure between proteins arising from technical rather than biological variability

• Casts a “wide net” initially for predictive peptides

• Models both identified and unidentified peptides

• Utilizes identifications while allowing for incorrect identifications

• Recognizes that some peptides from a protein may be post-translationally modified and the expression of these peptides may not be representative of the protein as a whole

• Can be used in the creation of predictive models based on multiple proteins, capturing “pathway” expression

Joe Lucas et. al. “Metaprotein Expression Modeling for Label-Free Quantitative Proteomics”, J. Proteome Res, in review. Oral presentation, 2011 RECOMB Satellite Conference on Computational Proteomics. Remember, A Metaprotein Model… • A Metaprotein is a group of peptides which exhibit a similar expression pattern across the cohort(s)

• A Metaprotein may contain: – All peptides from one protein – A subset of peptides from a protein – A collection of peptides from multiple proteins

• Model constructed with intensity measurements aggregated at the isotope group level – Identified or unidentified peptides Metaprotein expression modeling for label-free quantitative proteomics. Lucas JE, Thompson JW, Dubois LG, McCarthy J, Tillmann H, Thompson A, Shire N, Hendrickson R, Dieguez F, Goldman P, Schwarz K, Patel K, McHutchison J, Moseley MA. BMC Bioinformatics. 2012 May 4;13(1):74 Discovery and Initial Verification of SVR- Prediction Using Unbiased Data

Use Model to Predict SVR (Blinded) Discovery Data, Build Model

Patel, et al. Hepatology 2011 Jun;53(6):1809-1818. Reproducibility of Metaprotein Biosignatures

• Build predictive model with first three cohorts • Predict NR / SVR in “Big Pharma” measured data – different LC/MS/MS (LTQ-Orbi) system in different lab – Metaprotein model maintained consistent results

Verification Cohort Verification Cohort Discovery Cohort 1 2 Discovery Cohort N = 55 N = 41 N = 50 Measured by Matched Matched All-Comers Big Pharma Lab Relative Protein Expression

• Provides data on protein expression changes between two or more samples within the same experiment • Requires direct comparison of proteolytic peptides or marker ions from proteolytic peptides – Provides relative abundance ratios of the same protein between different samples – Data does not extrapolate beyond the experiment • Experiments are isolated “islands of information”

Absolute Protein Expression – ‘omic scale • Calculation of the absolute amount of the proteins present (ng or fm) in a sample – Permits determination of stoichiometry of proteins in macromolecular complexes – Permits extrapolation of results to different experiments in different labs

• These workers made the notable and unexpected observation: – “the average MS signal response for the three most abundant peptides per mole of protein is constant within a coefficient of variation of less than 10%” – “Given an internal standard, this relationship is used to calculate a universal response factor (counts/mole)”

NOTE – “absolute” is a controversial description Absolute Quantification of Proteins by LCMSE A Virtue of Parrallel MS Acquistion Jeffrey C. Silva, Marc V. Gorenstein, Guo-Zhong Li, Johannes P.C. Vissers, Scott Geromanos Molecular & Cellular Proteomics, 5:144-156, 2006.

Intensity Distribution of Peptides from One Protein Response Per fmol for a Six Protein Mixture

Biological “Validation” by Determining Stoichometric Ratios Absolute Quantitation at the Protein Level - E. coli lysate spiked with 4 exogenous proteins Absolute Quantitation for Measurement of Fold-Changes E. coli spiking Experiment Quantitative proteomics reveals metabolic and pathogenic properties of Chlamydia trachomatis developmental forms

Hector A. Saka,1 J. Will Thompson,2 Yi-Shan Chen,1 Yadunanda Kumar,1,3 Laura G. Dubois,2 M. Arthur Moseley,2 andRaphael H. Valdivia1

Mol Microbiol. 2011 December; 82(5): 1185–1203. PMCID: PMC3225693

Quantitative proteomics reveals metabolic and pathogenic properties of Chlamydia trachomatis developmental forms Hector A. Saka1, J. Will Thompson2, Yi-Shan Chen1, Yadunanda Kumar1,3, Laura G. Dubois2, M. Arthur Moseley2, and Raphael H. Valdivia1

C. trachomatis is the most common bacterial STD, and exhibits a biphasic development cycle – EB infectious, RB non-infectious Protein Identification Metrics (Swissprot Human, NCBI C. trachomatis, 1% FDR)

EB RB (775) (1120)

355 420 700

HUMAN CHLAMYDIA (990, ~5%) (485, 55%) EB RB EB RB (349) (851) (426) (269)

139 210 641 216 210 59

>54% of C. trachomatis proteome *Mass spectrometry allows us to distinctly isolate the signal from Chlamydia versus Human Global Strategy for Using Mass Spectrometry to Deal with Mixed Proteomes using UPLC/UPLC/MS/MS

Search Against Both Human and Chlamydia DB

EB: 754 Human, 3916 Chlamydia Peptides RB: 4025 Human, 1274 Chlamydia Peptides

Remove Peptide Matches Shared Between Species

EB: 14 Homologous Peptides RB: 8 Homologous Peptides

Calculate fmol of Each Protein Using Method of Silva and Geromanos

EB: 339 Chlamydia Proteins RB: 181 Chlamydia Proteins CT842, 200 ± 25 fmol CT842, 47 ± 9 fmol

Calculate sum ng of Chlamydia proteins, use this to normalize fmol values

EB: CT842, 80 ± 3 fmol/ug RB: CT842, 84 ± 19 fmol/ug Validation of Species-Specific Quantitation using a Model System

Sample 1 Sample 2 Spiked E. Coli Lysate 0.25 ug 0.5 ug Mouse Brain Lysate 0.25 ug 0 ug Total Column Load 0.5 ug 0.5 ug

Quantitation (fmol/ug) Ratios Sample 1 Sample 2 Sample 1 Sample 2 Measured Ratio Measured Ratio Protein Name (uncorrected) (uncorrected) (corrected) (corrected) (uncorrected) (corrected) Theoretical Ratio ALBU_BOVIN 217±4 118±35 561±9 132±39 1.8 4.3 4.0 ADH1_YEAST 234±6 260±7 604±7 291±7 0.90 2.1 2.0 ENO1_YEAST 50±3 112±7 129±5 125±7 0.50 1.0 2.0 PYGM_RABIT 27.7±0.5 105±5 69±3 117±5 0.26 0.60 0.50 E. Coli Proteins (average) 109±6 263±17 281±12 294±19 0.41 1.0 1.0 Reproducibility of Protein Quantitation

Protein CV Distribution, EB Protein CV Distribution, RB Species-Specific Correction Applied to Chlamydia Protein Quantitation

EB vs RB Quantitation (with Species-Specific Scaling) Select Proteins with Verification Protein Classes and Relative Abundance in the Developmental Forms

- EBs are enriched in T3S-effectors and chaperones, as well as in enzymes involved in glucose catabolism.

- RBs are enriched for protein synthesis and assembly components , ATP generation and transport, and nutrient import. Molecular Evidence of the Different Metabolic Properties of the two Developmental Stages

Proteomic results show the EB and RB proteomes are streamlined for their function - maximum infectivity for EB, replicative capacity for RBs

A Simple Explanation of Selected Reaction Monitoring for Quantitative Analysis

Mayya and Han, Expert Rev Proteomics 3(6), 597-610 (2006)

Figure 1.Fundamentals of isotope-dilution mass spectrometry for quantification. (A) Amount of the native or endogenous peptide in the sample is quantified using the ratio of the mass spectrometric response to the endogenous peptide and the SIS peptide and the initial amount of the SIS peptide spiked into the sample. (B) In SRM, only specific product ions from collision-induced dissociation events are recorded. The top panel illustrates the operations of an ion-trap mass spectrometer, whereas the bottom panel illustrates the operations of a triple quadrupole mass spectrometer for SRM. Note that operations in an ion-trap are 'sequential in time' for a given population of injected ions, whereas in a triple quadrupole, each quadrupole specializes in carrying the three operations simultaneously on the ions that are continuously conveyed. Parent ion m/z, product ion m/z, and elution-time criteria from SRM enable selectivity and sensitivity for the detection of specific peptides in complex mixtures from biological sources. Recording multiple product ion trasitions, as in multiple reaction monitoring, can further increase the selectivity. SIS: Stable isotope-labeled standard; SRM: Selected reaction monitoring. (A) In the regular MRM mode of acquisition, the mass spectrometer records product ion transitions intended from the SIS peptide and the endogenous peptide in alternate scans. (B) The mass spectrometer can be instructed to record multiple product ion transitions from multiple SIS and endogenous peptide pairs and continue to do so in each acquisition cycle for the entire duration of chromatography. This allows multiplexed quantification. However, the reduced sampling frequency can compromise sensitivity, reproducibility and accuracy of quantification. (C) The chromatographic duration can be subdivided into time-segments or slices wherein different endogenous peptides are quantified using corresponding acquisition cycles. However, the method is limited by the peak capacity of the online chromatographic method and requires highly reproducible elution times. (D) It is practically difficult to achieve consistent elution times of peptides in complex mixtures on a routine basis. A hybrid 'staggered multiplexing’ is an optimum strategy as it attempts to maximize the elution time-window for the peptides and also to minimize the number of MRMs in each acquisition cycle. The peptide-pairs in each acquisition cycle are indicated for illustrating the 'staggered’ nature of acquisition. Skyline Open Source Software for Targeted Method Development and Data Analysis

Anal Biochem. 2007 Mar 1;362(1):44-54. Epub 2006 Dec 20. Antibody-based enrichment of peptides on magnetic beads for mass-spectrometry-based quantification of serum biomarkers. Whiteaker JR, Zhao L, Zhang HY, Feng LC, Piening BD, Anderson L, Paulovich AG. Source Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N., PO Box 19024, Seattle, WA 98109-1024, USA.

A major bottleneck for validation of new clinical diagnostics is the development of highly sensitive and specific assays for quantifying proteins. We previously described a method, stable isotope standards with capture by antipeptide antibodies, wherein a specific tryptic peptide is selected as a stoichiometric representative of the protein from which it is cleaved, is enriched from biological samples using immobilized antibodies, and is quantitated using mass spectrometry against a spiked internal standard to yield a measure of protein concentration. In this study, we optimized a magnetic-bead-based platform amenable to high-throughput peptide capture and demonstrated that antibody capture followed by mass spectrometry can achieve ion signal enhancements on the order of 10(3), with precision (CVs <10%) and accuracy (relative error approximately 20%) sufficient for quantifying biomarkers in the physiologically relevant ng/mL range. These methods are generally applicable to any protein or biological fluid of interest and hold great potential for providing a desperately needed bridging technology between biomarker discovery and clinical application.

Fig. 2. Calibration curves for quantifying heavy-labeled pure AAC and TNFα peptides. The ion signals for different amounts of pure synthetic heavy peptides were measured using LC–MS and used to determine the linear range of quantification on the linear ion trap instrument. Duplicate analyses were performed for each amount of peptide injected. Error bars show the range for each measurement.

Acknowledgments Duke University Proteomics Core Facility http://www.genome.duke.edu/cores/proteomics/

Biostatistics Dr. Joseph Lucas

Funding NIH S10 grant Duke School of Medicine CTSA grant UL1RR024128