1

Development of Sensitive High Performance Analytical Methods for the

Comprehensive Characterization of and Glycoproteins from Samples of

Clinical and Biopharmaceutical Importance

A dissertation presented

by

Dipak A. Thakur

to The department of Chemistry and Chemical Biology

In partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the field of Chemistry

Northeastern University Boston, Massachusetts June 2011 2

Development of Sensitive High Performance Analytical Methods for the

Comprehensive Characterization of Proteins and Glycoproteins from Samples of

Clinical and Biopharmaceutical Importance

by

Dipak A. Thakur

ABSTRACT OF DISSERTATION

Submitted in partial fulfillment of the requirements for the degree

of Doctor of Philosophy in Chemistry in the Graduate School of

Arts and Sciences of Northeastern University, June 2011 3

ABSTRACT

This thesis focuses on the development of ultra sensitive high resolution analytical methods for the characterization of proteins and glycoproteins from samples of clinical and biopharmaceutical origin. In the first instance the combination of laser capture micro dissection (LCM) for the selective enrichment of homogenous but low number cell populations in combination with down-stream porous layer open tubular column (PLOT) liquid chromatography- (LC-MS) using both one- and two-dimensional separations is described. The second portion of the thesis describes the ultra high performance analysis of intact recombinant a-human chorionic gonadotrophin glycoforms using capillary electrophoresis with accurate mass high resolution Fourier transform ion cyclotron resonance mass spectrometry (CE-FTMS).

In Chapter 1 an overview of current analytical methods and technologies applied in the field of is discussed. A critique of these technologies is also performed laying down the foundations for the developments and improvements in current state-of- the-art as presented in the subsequent Chapters.

In Chapter 2 the development of a micro-proteomic workflow for the comprehensive analysis of just 10,000 cells, collected by LCM, from invasive and metastatic epithelial cell types from a breast cancer patient is described. To minimize sample loss the development of an efficient sampling handling approach was necessary.

To achieve this level separation and subsequent enzymatic of the cell lysate was performed using short distance SDS-PAGE separation on tricine-PAGE gels.

By combining this sample clean-up and fractionation approach with ultrasensitive 1D

PLOT LC-MS in excess of 1,000 proteins were identified following injection of just 4

1/10th of the digested lysate or approximately 1,000 cells. The micro-proteomic workflow is highly suited for the comparative analysis of such small but highly informative LCM collected cell populations, more than 100 proteins were found to be differentially expressed thereby facilitating a deeper understanding of the associated biological changes associated with the invasive to metastatic transition.

In Chapter 3 the application of an online 2D-RP/SCX/SPE/PLOT LC-FT-MS micro- proteomics platform is presented for the comparative proteomic analysis of LCM collected normal and triple negative breast cancer cell population. Using the effective sample handling approach described in Chapter 2 followed by fractionation and ultra sensitive analysis of the lysate, the tryptic digest corresponding to 4,000 cells using the

2D-RP/SCX/SPE PLOT LC-FT-MS platform in excess of 15,000 unique peptides corresponding to 4,259 proteins were identified. This deep proteome coverage further emphasizes the utility of the developed micro-proteomic platform for the analysis of trace quantities of proteins generated from small but highly biologically important LCM enriched cell populations.

In chapter 4 the development and application of a high resolution CE-FTMS method for intact glycoform profiling of recombinant α-human chorionic gonadotrophin is described.

The CE separation parameters used allowed for the rapid analysis, <20 minutes, and high resolution of >60 different glycoforms bearing up to nine sialic acids in addition to other glycoforms differing by the number and extent of uncharged monosaccharides. A low volume pressurized liquid junction, which preserves the high resolution of the CE separation, was used to interface the CE system with high resolution FTMS thereby allowing accurate determination of charge state and accurate mass of each intact 5 glycoform following deconvolution. In addition to the intact glycoform, profiling analysis of glycopeptides and glycans was also performed to determine and assign the population of oligosaccharides present at each individual glycosite, thereby facilitating complete and comprehensive characterization of r-ahCG. The methodology developed in Chapter 4 was further applied to the analysis of r-αhCG from different expression systems, CHO and murine cell based. The CE-FTMS method is readily applicable for characterization of drug substance/product as well as in process monitoring of these complex glycoforms.

6

ACKNOWLEDGEMENT

I want to express my sincere and heartfelt gratitude to many people, teachers, colleagues and friends, who have helped me in reaching this milestone.

First, I would like to acknowledge my thesis advisor, Professor Barry L. Karger, for accepting me as his student and giving me an opportunity to work in his research group.

His guidance was constructive and aimed at bringing best out of me as a scientist and a person. Importantly, I was inspired and motivated by his wisdom, enthusiasm and commitment to highest standards.

I would like to thank Dr. Tomas Rejtar for devoting his time and energy while guiding me on various projects. I would like to appreciate Dr. Marina Hincapie, Dr. Andras

Guttman, Dr. Billy Wu, Dr. Shujia Dai, Dr. Sanwon Cha and Dr. Jonathan Bones for sharing their knowledge and expertise.

I would like to thank my dissertation committee members, Prof. Paul Vouros, Prof.

Graham Jones and Prof. Roger Giese for their time, suggestions and guidance.

Many thanks to Dr. Buffie Clodfelder-Miller (Cellular and Molecular Neuropathology

Core, University of Alabama), Elizabeth Richardson, Shemeica Binns, Sonika Dahiya and Dennis Sgroi (Massachusetts General Hospital) for providing precious LCM samples. I would like to thank our collaborators N.Washburn, C.J. Bosques, N.S.Gunay,

Z.Shriver, and G.Venkataraman (Momenta Pharmaceuticals) for supporting glycoform profiling project and for their full contribution towards the glycan analysis.

I would like to acknowledge the support and friendship of current and former researchers of Barnett Institute, Dr. E.Moskovets, Dr. Vickor Andreev, Dr. Quanzhou Luo, Dr. 7

Guihua Yue, Mr. Laxmi Manohar Akella, Dr. Claudia Donnet, Dr. Enrique Avarelo, Dr.

Zoltan Sabo, Dr. Jim Glick, Somak Ray; previous and current graduate students lingyun

Li, Ye Gu, Dongdong Wang, Majlinda Kulloli, Agnes Rafalko, Jonna Linholm-Ventola,

Jack Liu, Chen Li, Peter Li, Chris Morgan, Vaneet Sharma, Rose Gathungu, Joshua

Klaene and Fateme Tousi.

I would like to express my gratitude to Jeffrey Kesilman, Felicia Hopkins, Richard

Pumphrey, Andrew Bean, Jana Volf and Bill O,Neil for their support.

I would like to acknowledge my wife, Vaishali, daughter Radhika, and son Hrishikesh for their love, support, sacrifice and compromise during 5 long years. Many many thanks to my parents, Sudha and Arjun Thakur, for their support, encouragement and care. I would like to thank my brother, Ganesh and his family, for supporting, guiding and encouraging me during my graduate studies. I would like to express my gratitude to my sister Jyoti and her family for their support and encouragement. 8

TABLE OF CONTENTS

ABSTRACT………………………………………………………………………. 3 ACKNOWLEDGEMENT………………………………………………………… 6 TABLE OF CONTENTS………………………………………………………….. 8 LIST OF FIGURES.…………………………………………………………….…..14 LIST OF TABLES……………………………………………………………..……16 LIST OF ABBREVIATIONS AND CONVENTIONS…….……………………….16

Chapter 1: Overview of Technologies and Methodologies for Proteomics Analysis…………………………………………………………………………..…19 1.1 Introduction………………………………………….…….…………………….20 1.1.1 Proteomics: An Overview………………………………………………….….20 1.2 Shotgun Proteomics Methodologies…………………………………………..…23 1.2.1 Samples………………………………………………………………………...25 1.2.1.1 In Vitro Sample Source: Cell lines…………………………………….….....25 1.2.1.2 In Vivo Sample Sources…………………………………….……………..…26 1.2.2 Tissue Microdissection………………………………….……………………...28 1.2.2.1 Laser Capture Microdissection………………………….……………….…...30 1.2.2.2 Laser Microbeam Microdissection (LMM) …...... 32 1.2.2.3 Comparison of LCM and LMM……………………………..……………….33 1.2.3 Sample Preparation……………………………………………..………..….…34 1.2.3.1 SDS-Polyacrylamide Gel Electrophoresis (SDS-PAGE) ……….……...…..36 1.2.4 Separation Techniques…………………………………………………….…...38 1.2.4.1 High Pressure Liquid Chromatography…………………………………..... 38 1.2.5 Mass Spectrometry…………………………………………………..………..40 1.2.5.1 Ionization Methods……………………………………………………….. 40 1.2.5.2 Mass Analyzers………………………………………………………….. 42 1.2.5.3 Database Searching Tools for Proteomics……………………………….. 47 1.3 Microproteomics………………………………………………………….. 54 1.3.1 Alternative strategies for protein digestion………………………………….. 56 1.3.1.1 Solvents based approach………………………………………………….. 56 1.3.1.2 Cleavable ……………………………………………………….. 57 9

1.3.1.3 Filter-Aided Sample Preparation (FASP) ……………………………….. 59 1.3.2 High Performance Liquid Chromatography for Microproteomics………….. 61 1.3.2.1 Peak Capacity………………………………………………………….. 61 1.3.2.2 Narrow-bore column and ESI-MS……………………………………….. 64 1.3.2.3 Porous Layer Open Tubular (PLOT) Columns…………………………….. 66 1.4 Protein Glycosylation Analysis……………………………………………….. 71 1.4.1 Intact Glycoprotein Analysis……………………………………………….. 73 1.4.1.2 Capillary Electrophoresis………………………………………………….. 73 1.4.1.3 Capillary Electrophoresis Coupled to Mass Spectrometry……………….. 77 1.4.1.4 Application of CE-MS for Analysis of Intact Glycoforms……………….. 80 1.4.2 Glycan analysis………………………………………………………….. 81 1.4.2.1 Glycan release methods………………………………………………….. 82 1.4.2.2 Enzymatic Sequencing of Oligosaccharides…………………………….. 82 1.4.2.3 HPLC analysis of glycans……………………………………………….. 85 1.5 References……………………………………………………………….. 89

Chapter 2: Proteomic Analysis of 10,000 Laser Captured Microdissected Breast Tumor Cells Using Short Migration on SDS-PAGE and Porous Layer Open Tubular (PLOT) LC-MS…...... ………………………………………….. 101 ABSTRACT……………..………………………………………………….. 102 2.1 Introduction……….…………………………………………………….. 104 2.2 Experimental Section………………………………………………………….. 106 2.2.1 Chemicals………………….………………………………………….. 106 2.2.2 Clinical Specimens………………………………………………………….. 106 2.2.3 Laser Capture Microdissection…………………………………………….. 107 2.2.4 Cell Lysis, SDS-PAGE and In-Gel Digestion……………………………….. 107 2.2.5 Nano LC-ESI-MS with 10 µm i.d. PLOT Column………………………….. 108 2.2.6 Protein Identification……………………………………………………….. 109 2.2.7 Identification of Differentially Abundant Proteins by Spectral Counts...…….110 2.2.8 Reproducibility of Replicate Analyses of Metastatic and Invasive Breast Cancer Samples. ………………………………………………………………….. 111 2.2.9 Gene Ontology Annotation with DAVID (Database for Annotation, Visualization and Integrated Discovery)………………………………………….. 111 2.3 Results and discussion……………………………………………………….. 112 10

2.3.1 Overview of Proteomic Workflow………………………………………….. 112 2.3.2 Cell Lysis and Protein Extraction from the LCM Cap…………………….. 113 2.3.3 Short SDS-PAGE Run for In-Gel Digestion……………………………….. 114 2.3.4 Online PLOT/LC-ESI-MS……………………………………………….. 114 2.3.5 Proteomic Analysis of Three Replicates of 10,000 Breast Cancer Cells…….. 118 2.3.6 Identification of Differentially Expressed Proteins………………………….. 119 2.3.7 Gene Ontology Analysis………………………………………………….. 121 2.4 Conclusions………………..…………………………………………….. 125

Addendum to Chapter 2………………………………………………………….. 127 Evaluation of Short SDS-PAGE Separation Distance for Sample Preparation of Small Protein Amounts Prior to LC/MS Proteomic Analysis…….. 127 2.1A Methods and Materials……………………………………………………….. 127 2.1.1 Chemicals…………….……………………………………………….. 127 2.1.2 SDS-PAGE Separation and In-Gel Digestion……………………………….. 127 2.1.3 LC-MS/MS Analysis……………………………………………………….. 130 2.1.4 Protein Identification……………………………………………………….. 130 2.2A Results…………………………..…………………………………….. 131 2.3 Reference……………………………………………………………….. 132

Chapter 3: Comparative Proteomic Analysis of 10,000 Triple Negative Breast Cancer and Normal Mammary Epithelial Laser Microdissected Cells Using On-line 2D RP-SCX/Porous Layer Open Tubular Column (PLOT) LC-MS…………………………………………………………….. 134 Abstract………………………….………………………………………….. 135 Introduction…………….…………………………………………………….. 136 2. Materials and Methods………………………………………………………….. 140 2.1. Chemicals and Materials……………………..……………………………….. 140 2.2. Laser Capture Microdissection……………….……………………………….. 140 2.3. Protein Extraction and Digestion…………………………………………….. 141 2.4. Column Preparation and Two-Dimensional Separation………………………. 142 2.5. MS Analysis and Data Analysis…………………………………………….. 145 2.6. Spectral Index (SpI) for Identification of Differentially Abundant Proteins….. 146 11

2.7. Gene Ontology by DAVID (Database for Annotation, Visualization and Integrated Discovery) a Functional Annotation Clustering Tool……….. 147 2.8 Gene Set Enrichment Analyses (GSEA) for Functional Significance of Differentially Abundant Proteins………………………………………….. 147 3. Results and Discussion………………………………………………………….. 148 3.1 Experimental and Bioinformatics Workflow for Proteomic Analysis of 10,000 LCM Collected Normal and Cancer Breast Epithelial Cells. ……….. 148 3.2. Peptide and Proteins Identification…………………………………………... 150 3.3. Spectral Index Analysis for Determination of Differentially Abundant Proteins. ……………………………………..…………………………………….. 152 3.4 DAVID Functional Annotation Analysis of Differentially Abundant Proteins…154 3.5 Gene Set Enrichment Analyses (GSEA) for Canonical Pathway Analysis….. 156 Conclusions………….……………………………………………………….. 160 References…………….…………………………………………………….. 162

Chapter 4: Characterization of the Intact α- Subunit of Recombinant Human Chorionic Gonadotropin Glycoforms by High Resolution CE-FT-MS*…….. 165 Abstract………………….………………………………………………….. 166 4.1 Introduction……………………….…………………………………….. 167 4.2 Experimental…………………………….……………………………….. 171 4.2.1 Recombinant r-αhCG ……………………………………………………….. 171 4.2.2 Chemicals………………………………….………………………….. 171 4.2.3 CE-MS System………………………………………………………….. 172 4.2.4 Deglycosylation and Analysis of Released Glycans……………………….. 176 4.2.5 Trypsin Digestion of r-αhCG Expressed in a Murine Cell Line…………….. 177 4.2.6 LC-MS Analysis of r-αhCG Tryptic Digest……………………………….. 177 4.2.7 Data Analysis………………………………………………………….. 178 4.3 Results and Discussion……………………………………………………….. 180 4.3.1 Intact Protein Analysis……………………………………………………….. 180 4.3.2 Repeatability of the Intact Protein Separation……………………………….. 185 4.3.3 Analysis of the Released Glycans………………………………………….. 188 4.3.4 Glycopeptide Analysis……………………………………………………….. 199 4.3.5 Analysis of Combined Data………………………………………………….. 202 4.3.6 Analysis of r r-αhCG Expressed in CHO Cell Culture…………………….. 214 12

4.4 Conclusions…….……………………………………………………….. 217 4.5 References ………………………………………………………………….…...219

Chapter 5: Summary and Future Directions…………………………………. 221 13

LIST OF FIGURES

Chapter 1 Figure 1.1 Conceptual organization of proteomic experiments………………... 22 Figure 1.2 Human islet protein reference map……………………………………... 23 Figure 1.3.The principles of laser capture microdissection (LCM) …………….... 31 Figure 1.4 Common matrices used in MALDI mass spectrometry…………….... 41 Figure 1.5 Operational principle of the FTICR…………………………………... 45 Figure 1.6 Cutaway view of the Orbitrap mass analyzer……………………………47 Figure 1.7 Low energy collision induced dissociation of peptide………………... 48 Figure 1.8 Mobile Proton Theory………………………………………………... 49 Figure 1.9. Illustration of effect of concentration of analytes and flow rate on ESI processes………...... …... 63 Figure 1.10 Comparison of normal flow rate electrospray vs. a lower flow rate electrospray. ……………………………………………………………... 65 Figure 1.11 Schematic diagram of the low dead volume connections used to design 1D and 2D SPE-PLOT system……………………………………... 67 Figure 1.12 Diagram of the advanced on-line 2-D SCX/PLOT/MS system using a 3.2 m* 10 µm i.d. PLOT column and an online triphasic trapping column…….. 68 Figure 1.13 Chemical diversity of glycans………………………………………... 72 Figure 1.14 Electric double layer at the capillary wall and creation of EOF...... 75 Figure 1.15 Different types of CE/MS interfaces…………………………………. 78 Figure 1.16 CZE-ESI-MS analysis of a recombinant human EPO. …..………….. 81 Figure 1.17 Exoglycosidases commonly used to determine the structure of the N-glycans……………………………………………………………………. 84

14

Chapter 2 Figure 1. Shotgun proteomic workflow for the analysis of 10,000 LCM collected breast cancer cells collected from breast tumor and lymph node tumor…………...113 Figure 2. Optimization of LC-MS parameters……………………………………….115 Figure 3. Assessment of the variability in proteomic profiles associated with three replicate runs each of invasive and metastatic breast cancer samples (three samples of 10,000 cells each)…………………..……………….. 120 Figure S1. Selection of gel type and SDS-PAGE separation distance for proteomic analysis of small sample amounts……………………………………. 129

Chapter 3 Figure 1. Shotgun proteomics workflow to analyze breast epithelial cells collected from normal and triple negative breast tumor epithelium……….... 148 Figure 2. Peptide and protein identifications from 6 salt steps……………... 150 Figure 3. Peptide and protein identifications in the six samples. …………………..151 Figure 4. Participants of cell cycle (G1-S Phases) were significantly enriched in triple negative breast cancer (TNBC) cells……………………... 157 Figure 5. Structural molecular organization was significantly deficient in triple negative breast cancer (TNBE)…………………………………….. 159

Chapter 4 Figure 1A Diagram of CE-MS system for analysis of intact glycoproteins………. 172 Figure 1B. Photograph of CE system coupled to LTQ-FTMS for analysis of intact glycoproteins…….………………………………………. 175 Figure 2 Illustration of the separation resolution of CE-MS analysis of intact α-hCG derived from a murine cell line………..……………………. 181 Figure 3A. CE-MS separation of r-αhCG produced in a murine cell line……….. 182 Figure 3B CE-MS separation of r-αhCG produced in a murine cell line……….. 183 Figure 4: Chromatograms and fragmentation spectra of glycan analysis……….. 189 Figure 5: LC/MS/MS analysis of sulfated and α-galactose containing N-glycans.. 190 Figure 6: Exoglycosidase characterization of galactose-α-galactose-containing species……………………………………….... 191 Figure 7. CE-MS separation of r-αhCG produced in a CHO cell line….…... 214 15

LIST OF TABLES

Chapter 2 Table 1. Number of proteins identified per gel section per sample from three technical replicates of 10,000 mouse liver cells……………………… 117 Table 2. Number of proteins identified per gel section per sample from three replicates of 10,000 invasive breast cancer cells……...…………..……119 Table 3. Enriched Gene-Ontology (GO) terms for with FDR less than 5% and P value less than 0.05 are shown in bold………………………….. 123 Table S1. Peptides and proteins identified using three SDS-PAGE separation conditions……………………………………………………….. 131

Chapter 3 Table 1. Details about normal breast specimens and triple negative breast cancer specimens……………………………………………………………………141 Table 2. List of differentially abundant proteins between TNBE and BNE……….. 153 Table 3. Representative enriched, functional clusters with corresponding GO terms for differentially expressed proteins identified by DAVID……………. 155 Table 4. List of the canonical pathways found to be overrepresented in TNBE samples. ..…………………………………………………………………….156 Table 5. List of the canonical pathways found to be overrepresented in NBE samples………………………………………………………………………158

Chapter 4 Table 1. Repeatability of peak area measurements for 20 glycoforms on r-αhCG….186 Table 2. Summary table N-linked glycans in r-αhCG……………………………….194 Table 3. Abundance of individual glycopeptides……………………………………200 Table 4. List of theoretical and observed glycoforms ………………………………204 Table 5. Abundance of r- hCG glycoforms produced in CHO cells ………………216

16

LIST OF ABBREVIATIONS AND CONVENTIONS

2D GE Two-dimensional gel electrophoresis

2-AB 2-amino benzamide

CE Capillary electrophoresis

CID Collision Induced Dissociation

CPAS Computational proteomics analysis system

CTC Circulating tumor cells

CZE Capillary zone electrophoresis

DAVID Database for annotation, visualization and integrated discovery

DTA Sequest data files

DTT dithiothreitol

EIE Extracted ion electropherograms

EOF Electroosmotic flow

ESI Electrospray ionization

FASP Filter-aided sample preparation

FDR False discovery rate

FFPE Formalin-fixed paraffin-embedded

FTICR Fourier Transform Ion Cyclotron Resonance

GO Gene ontology

GSEA Gene set enrichment analyses

HILIC Hydrophilic interaction liquid chomatography 17

IAA Iodoacetamide

ICAT Isotope-Coded Affinity Tag

INV Invasive

IPG Immobilized pH gradient

IPI International Protein Index

IR Infra red

IT Ion Trap iTRAQ Isobaric tags for relative and absolute quantitation

LCM Laser capture microdissection

LMM Laser Microbeam microdissection

LTQ Linear Ion Trap

MALDI Matrix-assisted laser desorption/ionization

MBE Invasive malignant breast epithelial

MCM Minichromosomal maintenance

MET Metastatic

MS Mass spectrometry

NBE Normal breast epithelial

NBE Non-cancerous breast epithelial

NCBI National Center for Biotechnology Information

PALM Pressure assisted Laser microdissection

PGC Porous graphitic carbon

PLOT Porous-layer open-tabular 18

ppb Parts per billion ppm Parts per million

PRLC Reverse-phase liquid chromatography

PS-DVB Poly Styrene- Divinyl benzene r-αhCG Recombinant human chorionic gonadotrophin

SCX Strong Cation Exchange

SDS-PAGE Sodium dodecyl sulfate polyacrylamide gel electrophoresis

SILAC Stable isotope labelling by amino acids in cell culture

SPE Solid phase extraction

SpI Spectral index

TNBC Triple negative breast cancer

TNBE Triple negative malignant breast epithelial

TOF Time-of-flight

UV Ultraviolet

Xcorr Cross-correlation score

19

Chapter 1: Overview of Technologies and Methodologies for Proteomics Analysis

20

1.1 Introduction

1.1.1 Proteomics: An Overview

Proteomics[1] offers a complementary approach to genomic technologies by investigating biological phenomena on the global protein level. The emergence of mass spectrometric-based proteomic technologies has advanced our understanding of the complexity and dynamic nature of proteomes, at the same time revealing that no

„one-size-fits-all‟ proteomic strategy can be used to solve all biological problems. Two technologies have been responsible for the recent, rapid advance of proteomics : first, the development of new strategies for peptide sequencing using mass spectrometry, including soft ionization techniques, such as electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI); and second, the miniaturization and automation of liquid chromatography. However, the high expectations on the potential of proteomics have been slowed with the discovery of huge molecular complexity and dynamic nature of the proteome, introducing difficulties greater than those encountered for either genome or transcriptome studies. In particular, complexities related to splice variants, post-translational modifications (PTM) , dynamic ranges covering ten orders of magnitude or more of protein abundance in plasma, protein stability and dependence on cell type or physiological state have challenged our ability to characterize proteomes comprehensively in a reasonable time

[2,3,4].

Despite the above challenges, proteomic technologies have already significantly contributed to the life sciences and are today an integral part of biological 21 research efforts. Currently, the field of proteomics covers diverse research topics such as, protein expression profiling, analysis of signaling pathways, and protein biomarker discovery, among others [4]. It is important to be aware that within each area, unique proteomic approaches need to be applied; these approaches differ widely in their requirement of skills, difficulty and expense. Based on the objectives, the proteomic experiments are categorized into either discovery or assay. Proteomic assay experiments investigate a quantitative change in a small, predefined set of proteins or peptides, whereas discovery experiments focus on the analysis of large, unbiased sets of proteins. The measurement of cardiac troponins in human plasma samples is one such example of an assay experiment [3,4]. An example of the discovery proteomic experiment is the Human Proteome Organization Plasma Project, which aims to catalog all proteins and peptides in the human plasma.

The discovery proteomics experiments are divided into comprehensive, broad scale or focused approaches because these distinctions determine how a biological question is approached technically. The comprehensive approaches aim at enumerating as many components of a biological system as possible [5]. Next, broad-scale experiments target a selected fraction of the expressed proteome, for example, the phosphoproteome, glycoproteome, etc. The comprehensive and broad-scale experiments are used to profile qualitative and quantitative changes in the system taking place as a result of perturbation to a biological system or differences in genetic background [6,7]. Whereas focused approaches, such as identification of components of a protein complex, involve co-purification of relatively few interacting proteins and their analysis, here, the aim is to identify the components of multiprotein complexes 22 and their interaction mechanisms in order to understand physiological and pathogenic processes. Once components of multiprotein complexes are determined, they are further monitored using the assay methods to develop therapies [8].

Characterization of a single protein that is isolated from natural or recombinant sources involves determination of its mass, identity, post-translational modifications and purity. The comprehensive characterization task draws on decades of experience in protein chemistry [4]. Figure 1.1 presents a diagram of the various components of proteomics discovery and assay.

Figure 1.1 Conceptual organization of proteomic experiments. Reprinted from reference [4].

23

1.2 Shotgun Proteomics Methodologies

Figure 1.2 Human islet protein reference map. The proteins were loaded onto an IPG strip (pH 3-10) and subsequently separated by mass on a gradient (8-12%) SDS-PAGE gel. Reprinted from reference [9].

The combination of two dimensional gel electrophoresis and mass spectrometry (2DE-MS) has traditionally been used to determine changes in protein identity and protein abundance in a complex protein mixture [10]. Using this combination, a protein mixture is first separated based on isoelectric point and then by molecular weight to almost single protein spots, therefore this strategy is sometimes called the “single protein” method [11]. To identify individual proteins separated by

2DE, the excised gel pieces are subjected to in-gel digestion and subsequent analysis using tandem mass spectrometry. As this method provides very high resolution, the visible image of a stained 2D gel is used to observe changes in protein abundance, 24 protein isoform and protein modification [9]. Figure 1.2 shows an example of a complex 2D gel pattern from a proteome. While powerful, the method is difficult to automate, is slow to operate and does not work well with highly hydrophobic proteins

[12].

In the past few years shotgun proteomics, introduced by Yates et al. (10) has replaced conventional 2DE-MS (2-dimensional gel electrophoresis- mass spectrometry) due to its inherent high throughput capability and its ability to detect and quantitate more proteins than 2D gel electrophoresis. Shotgun proteomics is a method, in which the total proteome is digested to peptides, and the resulting highly complex peptide mixture is separated by one-dimensional or 2- dimensional liquid chromatography coupled to mass spectrometry (MS). The method consists of four steps: sample preparation, liquid chromatography, MS and data processing. The results are interpreted using bioinformatics tools that are rapidly developing [13]. The sample preparation for proteomic analysis involves multiple steps such as protein extraction, enrichment, digestion and peptide clean-up. The sample preparation step extracts the proteins from the biological specimen such as blood, cell lines or tissues. The extracted protein mixture may be further fractionated to reduce the protein complexity using chromatographic, electrophoretic or affinity purification procedures. To facilitate their identification, the proteins are digested with highly specific proteolytic enzymes, such as trypsin, to generate fragments of suitable mass for MS detection. The digested peptides are subsequently separated using high performance liquid chromatography coupled to ESI or MALDI mass spectrometry. Both precursor mass and MS/MS fragmentation spectra can be used to determine and quantitate the peptides. Generally, 25

the tandem mass spectra, which provide peptide sequence data based on MS/MS

fragmentation patterns, are searched against a specific protein database (e.g. NCBI and

Swiss-Prot[14]) using various algorithms (e.g., Mascot[15] or SEQUEST [16]) to

determine protein identity. The advantage of shotgun proteomics over the 2DE

approach is that the former can analyze hydrophobic membrane proteins as well as

proteins with a broad range of pI or size. In addition, the protein dynamic range which

shotgun method covers can be higher than that covered by the 2DE method [17].

1.2.1 Samples

Cancer is one of the leading causes of death worldwide. In order to develop

treatment for cancer, protein biomarkers, which can be an (1) indicator of presence of

disease, (2) disease reduction or progression, and (3) response to the treatment, are

highly desired. During biomarker discovery proteomic experiments, a variety of

sample sources can be used, such as cell lines, tissues and body fluids.

1.2.1.1 In Vitro Sample Source: Cell lines

Cell lines are routinely used in proteomic studies as they may be easily

manipulated with different chemical additives or physical conditions. Because the

population of cells can be large (as many as 100,000,000 cells), there are no

limitations with respect to the amount of sample available. Cancer cell lines are

extensively studied using quantitative proteomics for:

1) identification of differentially abundant proteins between diseased and normal cells of

the same type, 26

2) identification of pathways associated with specific phenotype. e.g., cancer progression,

3) drug resistance studies, and proteins secreted by cancer cell lines for potential

biomarker discovery [18].

One must, however, always keep in mind that a cell line is a model system that may

or may not represent the in vivo condition [19].

1.2.1.2 In Vivo Sample Sources

Biofluids

In contrast to cell lines, body fluids such as serum[20], plasma[21], saliva[22],

urine, nipple aspirate, cervical –vaginal fluid[23] and exhaled breath condensate[24]

closely represent the in-vivo biological events. Compared to biopsied samples, the

biofluids are easy to collect at low cost using less invasive methods [25,26]. Among

the body fluids, blood, the most common human sample used in diagnosis, is often the

focus for the discovery of protein biomarkers for disease [26,27]. However, the

challenges with analysis of serum or plasma are high complexity of proteome with a

wide dynamic range (at least 10 orders of magnitude[28]) and anticipated low relative

abundance of many disease-specific biomarkers.

Compared to blood, proximal fluids, a body fluid which is close to or in direct

contact with the site of disease, can be an attractive alternative sample type for

biomarker discovery. The proteins or peptides secreted, shed or leaked from diseased

tissue, are likely to be enriched in proximal fluids with respect to both blood and

disease-free control fluid of the same type[29]. The examples of proximal fluids are

urine for bladder and kidney disease, nipple aspirate or ductal lavage for breast cancer, 27 and cerebrospinal fluid for intracranial processes[30]. Evidence of marker enrichment in proximal fluids was demonstrated with a study of ovarian cancer, where both ovarian cyst fluid and ascites fluid constituted proximal fluid[31].

Tissue Samples

Compared to blood and proximal fluids, analysis of tissue offers several important advantages. 1) During the biomarker discovery on tissue samples, the proteins are studied in their surroundings. 2) The possibility of identifying potential biomarkers is highest in damaged/diseased tissues as they are likely to be concentrated in those tissues. Therefore, it makes sense to look for markers in tissue samples due to their higher concentration and relatively narrower dynamic range of proteins. To perform the discovery studies, tissue samples can be used either from animal models or from human biopsied samples. Mouse [32-34] and rat [35,36] are two of the most widely used animal models for proteomic research, though human biopsies are the most appropriate samples to study human diseases. However, human biopsied samples are not as easily available as tissue samples from animal models, and controlled experiments are clearly much easier to perform on animal models. The biopsied samples require extra care during their processing and storage. That is, the tissue specimens are frozen immediately after their excision and stored at -80ºC.

Conventionally, in order to preserve all the biopsied samples and to maintain their morphology, the samples are fixed in formalin and embedded in paraffin[37]. The formalin fixation causes cross-linking of the proteins, and the paraffin limits water contact. 28

Huge collections of formalin fixed and paraffin embedded biopsied samples, are preserved and last many years [38]. Such samples have a well documented clinical history of individual patients and are available for prospective analysis. To perform proteomic analysis of FFPE samples, decross-linking and efficient extraction of proteins are necessary. To remove paraffin from FFPE tissue blocks, the bocks are treated with xylene. Further, formalin fixed tissue blocks are boiled in a solution containing metal ions. This procedure is termed heat induced antigen retrieval [39,40].

High temperature, above 90ºC, is found to be essential to decross-link methylene bridges between the proteins. The studies performed on FFPE samples, in order to obtain the comprehensive proteome, use two different approaches. The first is extraction of intact proteins with SDS and high temperature [41-43]. The commercialized product, called Qproteome FFPE tissue kit (QIAGEN, Germantown,

MD), uses proprietary chemistry to extract full length proteins for subsequent analysis.

The second approach, a novel approach, is to perform in-solution enzymatic digestion on FFPE samples, after heat induced decross-linking, to directly obtain peptides for shotgun proteomic analysis[25]. The commercialized product, based on the later extraction principle, is called Liquid Tissue-MS protein prep kit (Expression

Pathology, Inc. Rockville, MD).

1.2.2 Tissue Microdissection

The microenvironment of a tumor tissue sample is highly heterogeneous[44].

The pathologist identifies malignant cells based on their differential staining and morphology. The malignant cells are surrounded by normal-related and other types of 29 cells in the tissue matrix. In order to perform a detailed study of biopsied samples to gain information on proteomic changes between malignant and normal cells, the malignant cells need to be separated into a homogeneous population. Tissue microdissection is an indispensible tool to enrich distinct cell types from heterogeneous tissue matrix in an efficient and accurate manner. Before the advent of microdissection, fluorescence-activated cell sorting (flow cytometry)[45] and magnetic-bead based cell sorting[46] were the methods of choice for cell separation.

However, these methods employed enzymes for breakage of tissue structure, which may alter or modify the cellular constituents in a number of ways. Microdissection techniques have the advantage over cleavage that they allow selection of individual cells under the microscopic inspection of the intact tissue.

Microdissection techniques can be classified into two major classes, manual microdissection and laser assisted microdissection. Early efforts to dissect specific cell types from tissue sections used sharp tools such as scalpel blades and needles [47].

The other manual dissection technique called “negative ablation”, as the name suggests, destroys the unwanted cells surrounding cells of interest and collects the non ablated cells using the needle [48].

Though, manual microdissection techniques were useful in obtaining a homogeneous cell population, these methods were slow, tedious and required considerable expertise to perform. In addition, the manual microdissection techniques suffered due to issues such as sample handling and contamination. To address these issues and to perform fast, clean and accurate microdissection, laser based- 30 microdissection technology which includes laser capture microdissection and laser microbeam microdissection, was developed [49]. Over the years, this technique has proved to be effective, as more than one thousand research articles have been presented on the samples procured using this technique [50] .

1.2.2.1 Laser Capture Microdissection

Laser Capture Microdissection (LCM) is a laser based cell procurement method that was developed in mid 1990s by Emmert-Buck, Liotta and colleagues at the National Institute of Health (NIH) and designed to perform fast and accurate microdissection of tissue samples [49,51]. The earliest design of the LCM system was commercialized by Arcturus Biosciences Inc. (now part of Applied Biosciences), and later on Leica and PALM introduced a non-contact based LCM system based on a technique called laser microbeam microdissection. 31

Figure 1.3. The principles of laser capture microdissection (LCM). (a) The scheme of LCM. (b) Comparison of properly melted polymer spots and poor spots. Only cell lying within the dark ring of melted polymer will be targeted for LCM. (c) Physical forces involved in LCM. (d) A single cell bound to the thermolabile polymer. Reprinted from reference [52].

The principles of contact based LCM technology are shown in Figure 1.3. In brief, the tissue specimens are first processed by sectioning and staining, and then examined to identify cells of interest based on their staining and morphology. To selectively capture cells of interest from the tissue sections, an LCM cap with a thermolabile polymer membrane is placed on the tissue section. An infrared (IR) laser is focused through the transparent cap material, heating and melting the membrane, and thus causing the targeted cells to adhere to the membrane. The cells of interest are then dissected by lifting the LCM cap away from tissue section. The thickness of the 32 tissue sections (5-15 µm) used for microdissection is critical from an operational point of view. The tissue thickness <5 µm is less than the size of a single cell, whereas the thickness greater than 15 µm is too thick for specific cell collection. Normally, the diameter of the IR laser used for LCM is 7.5, 15.0 and 30.0 µm. The minimum IR laser diameter of 7.5 µm, which is close to the size of the single human epithelial cell

(7.0 µm) [52], is capable of accurately capturing even a single cell (see Figure 1.3).

Importantly, the laser heating of the membrane and histological staining of the tissue for cell laser capture microdissection is a contact based technique which can potentially result in changes in prtoeins and DNA. On the other hand, laser microbeam microdissection collection, a non-contact method, has been reported to cause a negligible effect on the quality of proteins and nucleic acids extracted from

LCM cells [53,54].

1.2.2.2 Laser Microbeam Microdissection (LMM)

Laser microbeam microdissection employs a high intensity UV laser microbeam which is navigated around the edges of the targeted cells or around a chosen area of the tissue, in order to dissect the target cells/chosen area from the rest of the tissue by destroying the bonds between the cells and their surrounding tissue

[55,56]. Subsequently, another laser pulse is shot under the selected cells/tissues to catapult the desired cells into the collection tube [57]. PALM (Pressure-Assisted Laser

Microdissection) is a commercial instrument which uses a laser pulse to catapult the isolated cells/tissues to the collection tube, whereas Leica AS LMD (Laser

Microdissection), another commercially available LMM system simply allows the cells of interest to fall in the collection tube by gravity. Recently, Expression 33

Pathology introduced “DIRECTOR” Microdissection slides, which are based on Laser

Induced Forward Transfer (LIFT) Technology utilizing a thin layer energy transfer coating. Laser energy is transferred to the coating and thus results in evaporation of the coating. The evaporation of the transfer coating causes the selected feature of the tissue section to fall into collection tube.

1.2.2.3 Comparison of LCM and LMM

Using the older version of Arcturus LCM instrument, any material adhering to the LCM cap was collected. This type of nonspecific collection of loose material from tissue specimen is a potential source of contamination. To overcome this issue,

Arcturus introduced a newer design of LCM caps in which the cap remains slightly away from the tissue specimen, allowing collection of only cells which are in contact with the melted thermolabile membrane. In addition, to avoid contamination, such as keratin and loose tissue material, sticky “prep strips” can be used [12]. In contrast to

LCM, the primary source of contamination in LMM is fine tissue material resulting from laser ablation of the edges of targeted cells/tissue areas [58].

In case of LCM, the tissue preparation procedure which includes slide selection, tissue staining and dehydration, and microscopic evaluation of tissue specimen has to be strictly followed in order to obtain effective microdissection.

Whereas, in case of LMM, the tissue preparation is less complicated than LCM, and parameter such as tissue thickness is more flexible.

The LCM, contact-based microdissection technique, is advantageous compared to LMM, since the cells collected on the thermolabile membrane can be easily viewed 34 under the microscope for their homogeneity. In LMM, a thin polyethylene naphthalate

(PEN) membrane is required between the glass slide and the tissue section; otherwise, the catapulted cells might pulverize to debris in the collection tube. Thus the collected cells remain relatively intact and can be visualized.

One example of an application of LCM is to investigate the molecular basis of breast tumor formation. This disease is not clearly understood due to difficulties encountered while studying the early stages of disease progression. The breast cancer progression is a multistep process, involving the premalignant stage of atypical ductal hyperplasia

(ADH), the preinvasive stage of ductal carcinoma in situ (DCIS), and the potentially lethal stage of invasive ductal carcinoma (IDC)[59]. The obstacles in studying breast cancer disease lesions are complexity and heterogeneity of tissue and microscopic size

(<500 µm) and the limited extent of ADH and DCIS (< 1% of the total cellular population) in clinical specimens. LCM is effective in studying these sample types as it provides a homogeneous cell population compared to usual sample extraction methods which involve homogenization of bulk tissue samples[60]. Chapters 2 and 3 of this thesis describe the advanced proteomic analysis of only 10,000 LCM collected breast cancer cells.

1.2.3 Sample Preparation

In proteomics, two fundamental strategies, top-down and bottom-up, are primarily used for identification and characterization of proteins in complex mixtures.

In the top-down approach, the intact protein molecules are introduced into the mass analyzer and subjected to gas phase fragmentation for their identification. For sample with only moderate complexity, i.e. glycoforms of biopharmaceuticals and variants of 35 a single protein, this approach is suitable as it requires minimal sample preparation.

Chapter 4 of this thesis describes glycoform profiling of recombinant α-human chorionic gonadotropin (α-hCG) using high resolution capillary electrophoresis coupled with high mass resolution FT-MS [61].

In the bottom-up approach, there are two ways to convert proteins extracted from biological specimens to peptides which are suitable for mass-spectrometry (MS) based proteome analysis. The first solubilizes the proteins with detergents and separates the proteins by sodium dodecyl sulfate (SDS) polyacrylamide gel electrophoresis. The proteins trapped by the gel are subjected to enzymatic digestion, i.e., “in-gel” digestion. The second sample preparation method is detergent-free, as it uses strong chaotropic reagents such urea and thiourea for cell lysis, protein extraction and solubilization. The enzymatic digestion of the proteins in the presence of denaturing reagents is termed “in-solution” digestion.

The in-gel digestion method is advantageous over in-solution digestion due to the absence of most impurities which could interfere with digestion; however, the gel may limit peptide recovery. On the other hand, in-solution digestion can be more readily automatable and can minimize losses associated with sample handling.

However, the use of chaotropes may result in incomplete solubilization of the proteome, and digestion may be impeded by interfering substances.

36

1.2.3.1 SDS-Polyacrylamide Gel Electrophoresis (SDS-PAGE)

The mobility of proteins during gel-electrophoresis depends upon the following factors:

1) electric field strength,

2) total charge on the molecule,

3) size and shape of the molecule and

4) ionic strength of the buffer and properties of the gel matrix through which the molecules are migrating.

The polyacrylamide gel matrix is in extensive use for protein prefractionation

[62]. Gel matrices act like a molecular sieve, and their sieving function depends on the mesh size of the gel. The polyacrylamide gels are synthesized by the polymerization of acrylamide monomers into long chains and the reaction of these chains with bifunctional compounds such as N, N-methylene-bisacrylamide (bis) to form a sieve like structure. The mesh size of the gel is determined by the concentration of acrylamide and bisacrylamide (%T and %C).

%T=concentration of total monomer

%C=concentration of cross linker (as a percentage of the total monomer)

The higher the concentration of monomer (%T), the smaller the mesh size of the gel

[63].

Gel electrophoresis is performed under either continuous or discontinuous buffer conditions. The running buffer and gel buffer are same in the continuous buffer system; whereas the discontinuous buffer system has different gel and running buffers.

The gel system contains two gel layers, the stacking and separating layer. 37

Electrophoresis with a discontinuous buffer system provides sample concentration and higher resolution. SDS-PAGE is performed under denaturing conditions, where the detergent denatures and opens the protein by wrapping around the peptide backbone of the protein. SDS binds to the protein approximately at a ratio of 1:1.4. The highly negative SDS-protein complexes are separated on the gel based on their molecular weight rather than their charge, as protein acquires net negative charge which is proportional to the length of the protein. The electrophoretic mobility of the proteins through the gel is inversely proportional to the logarithm of the protein molecular weight[64].

Prefractionation of samples is required in proteomics, and gel electrophoresis is a versatile and reliable method to achieve such prefractionation. The discontinuous buffer system is frequently used as it provides higher protein resolution compared to continuous buffer system. The discontinuous buffer system offers the ability to manipulate buffer systems to achieve “steady-state-stacking” or “isotacophoresis” which is responsible for focusing of the proteins before their separation by PAGE.

Though the separation of proteins in SDS-PAGE is primary based on the molecular weight, the molecular weight range that can be preferentially resolved depends upon the gel composition, buffer system used and the pH of the buffer system. The presence of post-translational modification, such as glycosylation on the protein, results in anomalous migration of the glycoprotein on SDS-PAGE. This anomalous electrophoretic migration of glycoproteins, resulting in inaccurate molecular weight determination, is due to little or no SDS binding of the sugar moieties. 38

1.2.4 Separation Techniques

Peptide mass spectrometry (shotgun proteomics) identifies proteins by measuring mass-to-charge ratios of peptides and their fragments in the MS spectra. In order to perform unambiguous identification of proteins and to achieve deep proteome coverage, mass-spectrometry is highly dependent on separation to reduce the very complex samples prior to their analysis. This facilitates the identification of low- abundant species that would otherwise be overshadowed by the high abundant species, i.e., increase the dynamic range.

1.2.4.1 High Pressure Liquid Chromatography

High-pressure liquid chromatography (HPLC) is often directly coupled to mass spectrometric instruments with electrospray ionization (ESI) source. The continuous separation of analytes using HPLC is physically compatible with an electrospray ionization source. Due to efficient coupling of HPLC and ESI source, the combination has become a standard sample introduction setup for peptide analysis. The most commonly used chromatographic materials for separation of analytes are: ion exchange (IEX), reverse phase, hydrophilic interaction chromatography (HILIC), affinity, and hybrid materials.

Reverse phase liquid chromatography (RPLC or RP) separates analytes based on their hydrophobicity, and a significant advantage of RPLC, when coupled with mass spectrometer, is that the buffers used are generally compatible with ESI. The use of acidic pH and organic solvents (acetonitrile and methanol) are conducive for analysis of peptides by ESI-MS. Due to its high resolution, efficiency, reproducibility, 39 and mobile phase compatibility with ESI-MS, RPLC has emerged as a preferred separation phase for the analysis of proteins and peptides. Over the years, significant efforts have been made to increase peak capacity, sensitivity, reproducibility, and analysis speed of reverse phase chromatography. It has been observed that packing long, narrow capillary RP columns results into significant improvement in loading capacity, sensitivity, and dynamic range of the RPLC. Shen et al. have reported use of

50 µm i.d. 40-200 cm long, small-particle-size (1.4 μm) RPLC columns with high peak capacity (1000-1500, compared with an average of 400) operated in an ultrahigh pressure regime (20 kpsi) for proteomic and metabolomics analysis[65]. The use of a small diameter particle stationary phase (1.7 μm diameter) contributes significantly towards the efficiency of the separation. The efficiency is inversely proportional to the size of the particles used for packing the column. However, columns packed with small diameter particles exhibit high back pressure, high pressure pumps (up to 15,000 psi) are required for their operation [66].

Multidimensional separation is a common way to increase the peak capacity of chromatographic analysis. This approach combines several separation techniques, such as ion exchange, high pH reverse phase separation, low pH reverse phase separation and so forth, to improve the resolving power. For effective performance of multidimensional separation, the individual separation methods should be as orthogonal as possible to other methods in which each dimension utilizes different molecular properties as a basis of separation. One of the first and most practiced two dimensional setups is combination of strong cation exchange (SCX) chromatography with reverse phase chromatography known as multidimensional protein identification 40 technology (MudPIT [67]). In this multidimensional separation, a highly complex peptide mixture is loaded onto an SCX column and eluted in a series of steps with increasing salt concentration. Each fraction is transferred onto an RP column either off-line or directly, and peptides are further separated and eluted into the MS.

1.2.5 Mass Spectrometry

Mass spectrometry usually involves three parts: ion source and optics, mass analyzer and data processing software.

1.2.5.1 Ionization Methods

A rapid growth in mass spectrometry based proteomic analysis can be attributed to major contributions of experimental methods, instrumentation and data analysis. Among the most important developments in mass spectrometry related instrumentation is the invention of soft ionization methods i.e. matrix assisted laser desorption ionization (MALDI) and electrospray ionization (ESI), allowing peptides and proteins to be directly analyzed by MS.

MALDI

MALDI functions just as its name suggests: the matrix assists in desorption and ionization of ions. In this type of ionization technique, the incident laser energy is absorbed by the matrix and transferred to the acidified analyte. The rapid laser heating results in desorption of matrix and positively charged analyte into the gas phase.

Singly charged ions are predominantly generated by MALDI, which makes it applicable for top-down analysis of high-molecular weight proteins [68]. However, 41 low shot-to shot reproducibility and strong dependence on sample preparation are the drawbacks of this technique. MALDI-TOFMS is suitable for high throughput analysis.

However, the high ionization energy can be detrimental in the analysis of compounds with labile modifications [69].

Figure 1.4 Common matrices used in MALDI mass spectrometry. Reprinted from reference [69].

ESI

ESI, unlike MALDI , generates ions from solution. Electrospray ionization is created by application of high voltage between the emitter end of the separation column and the inlet of the mass spectrometer [68]. Physicochemical processes of ESI involve formation of a Taylor cone, i.e. an electrically charged spray of liquid eluting from the separation column, followed by generation and desolvation of eluent droplets.

The unique feature of ESI compared to other ionization methods is its ability to produce multiply charged ions from high molecular weight biological molecules like 42 proteins, which enables the analysis of these molecules with instrument having a small mass to charge range (400-2000 m/z). A most important development in ESI technique, which led to the sensitive proteomic analysis, is known as nano-ESI. In

Chapters 2 and 3 of this dissertation, nano-ESI, operated at 20 nL/min, is a primary technique used for the analysis of 10,000 laser captured microdissected breast cancer cells. The diagram of the ESI process is discussed in the PLOT related section.

1.2.5.2 Mass Analyzers

Ion Trap

As the name suggests an ion-trap mass spectrometer works by trapping the ions in a vacuum. The ion trap functions by repeating the steps of ion collection, ion storage and ejection of ions from the ion trap as flow from the LC column occurs. The unique feature of ion-trap lies in its ability to isolate and fragment peptide ions from complex mixtures, this operation is called tandem MS. Due to their fast scan rates,

MSn scans, high sensitivity, high-duty cycle, high ion storage capacity (compared to

2D and 3D traps), reasonable resolution and mass accuracy, linear ion traps (e.g. LTQ,

Thermo Fisher) are considered as the high-throughput workhorses in proteomic research. Therefore, for our initial development work, as mentioned in Chapters 2 and

3, we employed LTQ-MS for bottom-up 10 µm i.d. Porous Layer Open Tubular

(PLOT) LC-MS analysis of 10,000 LCM cells. Furthermore, the LTQ is coupled with

Orbitrap and FTICR as the front end of hybrid MS instruments to perform ion trapping, ion selection and high resolution ion analysis.

Mass spectrometry has been extensively used for determination of molecular masses of the intact proteins. Among the mass spectrometric techniques, the ESI- high mass 43 accuracy MS is preferred as ESI generated multiply charged ions fall in the m/z range of most mass spectrometers. A variety of mass spectrometers can be used for this purpose; including ion trap (IT), orthogonal time-of-flight, time-of-flight and Fourier transform ion cyclotron (FTICR) and Orbitrap instruments. However, mass spectrometers such as ion traps are not suitable for this purpose due to their low resolving power at full scan speed. However, the mass spectrometers such as TOF,

FTICR and Orbitrap, due to their high mass resolution and high mass accuracy, have become the preferred instruments for accurate mass determination of intact proteins.

Quadrupole -Time of Flight Mass Spectrometer Time-of-flight mass spectrometry (TOFMS) determines the mass-to-charge ratio of the ions using a time measurement. Ions are accelerated in the flight tube by an electric field. This acceleration provides the same kinetic energy to all the ions bearing the same charge. The velocity gained by the ion due to acceleration depends on the mass-to-charge ratio. Then, the time that an ion takes to travel to the detector is measured. The heavier ions will take longer time to reach the detector compared to lighter ones. Based on the flight time of the ion and the known experimental parameters, the mass-to-charge ratio of the ion can be determined.

Fourier Transform Ion Cyclotron Resonance (FTICR) FTICR mass analyzer determines the mass to charge ratio of the ions based on their cyclotron frequency under the influence of constant magnetic field. In the ICR mass analyzer, the ions are stored in a Penning trap under the influence of constant magnetic and electric fields. The ions are excited to a larger cyclotron radius by an 44 oscillating electric field perpendicular to the magnetic field. The energy applied to the ions in ICR cell can be tuned to excite, dissociate and eject ions. The detector plates on the opposite sides of the trap measures the cyclotron frequency of all the ions simultaneously and with the help of Fourier transform convert these frequencies into m/z values (Figure 1.5). FTICR is a very high mass resolution technique contributing to accurate mass measurement[70]. The high mass resolution and high mass accuracy of the FTICR is due to following reasons. 1) The mass of the ion is calculated from the measurement of cyclotron frequency, a parameter that is more precisely measurable than any other parameter. 2) The ion cyclotron frequency is defined by the magnetic field. The better the time stability of the magnetic field (1 ppb/hour) compared to time stability of rf voltage (100 ppb/hour) results in a superior mass precision. 3) In the spatially uniform magnetic field, the cyclotron frequency of an ion is independent of the ion speed. 4) In order to attain high mass precision, ICR, unlike ion-beam-based mass measurement, does not require the use of narrow slits [71]. 45

Figure1.5 Operational principle of the FTICR. Reprinted from reference [69]. Among the many applications of the FTICR, the high resolving power of

FTICR is useful for the study of large macromolecules such as proteins with several multiple charges generated by electrospray ionization. The FTICR instrument provides mass resolution in the range of 50,000-750,000 and mass accuracy of less than 2 ppm.

However, FTICR suffers due to relatively slow acquisition speed and low sensitivity of analysis. In order to obtain high sensitivity and improved acquisition time, we acquired MS scans over a limited mass window, corresponding to m/z values of the 9+ charge state of intact alpha-human chorionic gonadotropin (Chapter 4).

46

Orbitrap

In 1999, Markov invented a new type of mass analyzer called the Orbitrap [72] which was applied for proteomic research in 2005[73]. Among the high mass resolution FTMS instruments, the Orbitrap superceded the FT-ICR due to low cost of operation, while providing equivalent high mass accuracy. The Orbitrap consist of two electrodes, an outer barrel-like electrode and a coaxial inner spindle-like electrode with an electrostatic field formed between them (Fig.1.6). The ions are tangentially injected in the gap between the two electrodes and made to rotate around the inner electrode due to the electrostatic attraction by the inner electrode and the balancing centrifugal forces. While cycling around the central axis, the ions move back and forth along the central axis. The frequency of these harmonic oscillations is Fourier transformed to determine the mass-to charge ratio of the ions. The Orbitrap offers a high resolving power of roughly 50,000 and a mass accuracy of less than 2 ppm, with proper standards. With an average acquisition speed of at least 6 MS/MS spectra per second in parallel with a single high-resolution spectrum (60,000 resolution) significantly improved protein coverage can be achieved. 47

Figure1.6 Cutaway view of the Orbitrap mass analyzer. Ions are injected into the Orbitrap at a point (arrow) offset from its equator and perpendicular to the z-axis, where they begin coherent axial oscillations without the need for any further excitation. Reprinted from reference [69].

1.2.5.3 Database Searching Tools for Proteomics

Database searching plays an important role in large-scale proteomics. Database searching tools enable the use of mass spectrometric data of peptides to identify proteins in sequence databases. Two mass spectrometric- based database search principles are mainly used for identification of proteins. The first method uses the molecular weight fingerprint of the protein digest (peptides) obtained by a site-specific [74,75], and the second method uses the tandem mass spectra obtained on the individual peptides of a digested protein[16,76]. Since each tandem mass spectrum stands as a unique and verifiable piece of data, the second method has the ability to identify a wide range of proteins and thus provide a comprehensive approach to handle complex protein mixtures[77].

Tandem Mass-Spectrometry and Data Processing 48

Figure 1.7 Low energy collision induced dissociation of peptide. Reprinted from reference [78]

In tandem mass-spectrometry (MS/MS), the gas phase peptide ions undergo fragmentation due to process such as collision-induced dissociation (CID. The gas phase CID is the most widely used technique in tandem mass-spectrometry. The dissociation pathways are exclusively dependent on the collision energy. The low energy collisions (<100 eV) mainly produce fragmentation along the peptide backbone[79], whereas for high energy collisions the fragmentation of amino acid side chains are observed[80]; most typically, low energy fragmentation is used. The chemical and physical properties of the amino acids and sequences of the peptide have strong influence on the MS/MS fragmentation pattern [81,82]. Figure 1.7 describes the fragmentation performed under the low energy conditions, as those typically encountered in triple Quadrupole, Quadrupole time-of-flight and ion trap[83]. B-ions, y-ions and ions with neutral losses of water and ammonia are preferentially produced. 49

Figure 1.8 Mobile Proton Theory. Reprinted from reference [84]. To explain the intensity patterns observed in the tandem mass spectra, a mobile proton model has been proposed[83]. The mobile proton model states that to initiate backbone cleavages for production of b and y ions, the protons are transferred intramolecularly from basic side-chains to the heteroatoms along the backbone. Figure 1.8A shows that the proton exists in equilibrium between all possible basic sites. The energy required to mobilize the proton from a basic side-chain or from the amino terminus to the peptide backbone depends on the amino acid composition of the peptide. Therefore, the dissociation or the fragmentation energy for the peptides containing amino acids having greater gas-phase basicity is higher compared to the peptides with amino acids 50 having lower gas-phase basicity. An example of a lysine- terminated peptide is shown

Figure 1.8B.

SEQUEST- Database Search Algorithm Given the mass of the precursor ion (m/z of the peptide ion) and its fragment ions, the goal of the database search algorithm is to determine peptide sequence and protein identity. SEQUEST [16] is a database search program which uses a descriptive model for peptide fragmentation and correlative matching to a tandem mass spectrum[16]. To access the quality of the match between the experimental spectrum and amino acid sequence from the database, SEQUEST applies a two-tiered scoring scheme. It first calculates the empirically derived preliminary score (Sp) that restricts the number of sequences to be analyzed in the correlation analysis. Sp is calculated by summing the peak intensities of fragment ions as well as accounting for continuity of the fragment ion series and the length of the amino acid sequence. The second and decisive score is a cross-correlation score, referred to as XCorr, which correlates the experimental and theoretical spectra. The theoretical spectrum is generated from the predicted fragmentations, i.e. b- and y-ions for each of the sequence in the database.

The similarity between the theoretical and experimental spectra is evaluated based on the cross-correlation of the two spectra. Apart from preliminary and cross-correlation scores, SEQUEST calculates another important difference, ΔCn, the normalized difference of XCorr values between the best matched sequence and each of the other sequences. ΔCn is a useful indicator of the uniqueness of the match. If the value of

ΔCn is greater than 0.1, then the match is considered as reasonably unique to a sequence. XCorr, which is not dependent of the database size, suggests the quality of 51 the match between the spectrum and sequence, whereas ΔCn, which is dependent on the size of the database, indicates the quality of the match relative to near misses.

Label Free Quantitative Microproteomics

Currently, a number of stable isotope labeling approaches are in use for

„shotgun” quantitative proteomic analysis. The stable isotope labeling approaches include Isotope-Coded Affinity Tag (ICAT), Stable Isotope Labeling by Amino Acids in cell culture (SILAC), 15N/14N metabolic labeling, 18O/16O enzymatic labeling,

Isotope Coded Protein Labeling (ICPL), Tandem Mass Tags (TMT), Isobaric Tags for

Relative and Absolute Quantification (iTRAQ) and other chemical labeling[85,86].

These stable isotope labeling methods have offered valuable flexibility while using quantitative proteomic methods to study protein abundance changes in complex samples. However, most labeling based quantification methods are limited in their application due to increased time and complexity of sample preparation, the requirement of higher sample concentration, high cost of the reagents and incomplete labeling. Therefore, for relative quantitation of small sample amounts, there is increased interest in label-free approaches in order to achieve more sensitive and simpler quantification results.

Label-free protein quantitation is generally based on two approaches. The first involves the measurement of ion intensity changes such as peptide peak areas or peak heights in chromatography (i.e. total or single ion analysis). The second approach is based on spectral counting of the identified peptides after MS/MS analysis. Peptide peak intensity and spectral counting are measured for individual LC-MS/MS runs, and 52 changes in protein abundance are determined by direct comparison between different analyses.

Relative Quantitation by Peak Intensity In this approach, relative quantitation of the peptides was achieved by direct comparison of peak area of each peptide ion in multiple LC-MS datasets. However, application of this method for determination of protein abundance changes in complex biological samples had some practical limitations. The differences in the sample preparation and sample injection, in addition to experimental changes in retention time and m/z value, significantly influence the direct and accurate comparison of multiple

LC-MS datasets. Therefore, highly reproducible LC-MS performance and careful chromatographic peak alignment are critical for the quantitation approach[87].

Relative Quantitation by Spectral Count

In the spectral counting approach, comparison of the number of identified

MS/MS spectra from the same proteins (spectral count) are compared between multiple LC-MS/MS datasets. The increase in protein sequence coverage, the number of identified unique peptides and the number of identified total MS/MS spectra

(spectral count) correspond with the increase in protein abundance. However from these three factors of identification, only spectral count showed strong linear correlation with relative protein abundance with a dynamic range over 2 orders of magnitude. Therefore spectral counting is considered as a simple and reliable index for relative protein quantification[88]. In comparison to peak intensity, which uses computer algorithms for automatic LC-MS peak selection, alignment and comparison, the spectral counting approach is much easier to implement. 53

However, for accurate and reliable detection of protein changes in complex mixtures, normalization and statistical analysis of spectral counting databases is necessary. One of the simple normalization methods, which accounts for the run to run variability, uses total spectral counts[89]. Another approach to normalization involving calculation of a normalized spectral abundance factor (NPAF) was suggested to account for the effect of protein length on spectral count[90].

Zhang et al. compared five different statistical tests on spectral count data collected by analysis of yeast digests to evaluate the significance of comparative quantification by spectral counts[91]. These statistical tests were 1) Fisher‟s exact test,

2) goodness-of-fit test (G-test) 3) AC test, 4) Student‟s t-test and 5) Local-Pooled-error

(LPE) test. For datasets with three or more replicates, the Student‟s t-test was found to be the best, whereas, in case of datasets with one or two replicates, the Fisher‟s exact test, G-test and AC test can be used.

Relative quantitation by spectral count has been successfully applied for different clinical applications[92], including analysis of normal and acute inflammation, biomarker discovery in human saliva proteome in type-2 diabetes[93], comparison of protein expression in mammalian and yeast cells under different culture conditions, distinguishing normal and diseased lung cancer samples[94,95], discovery of phosphotyrosine-binding proteins in mammalian cells and identification of differential plasma membrane proteins in terminally differentiated mouse cell lines[95].

Another label -free method, the spectral index, is used to analyze relative protein abundances in large-scale data sets obtained from biological samples by shotgun 54 proteomics is called spectral index. The spectral index method is made up of two biochemically plausible features i.e. 1) Spectral counts (indicative of relative protein abundance and 2) the number of samples within a group with detectable peptides [96].

We used this method to assess differentially abundant proteins between 9 non- cancerous, normal breast epithelial (NBE) samples and 9 estrogen receptor (ER)- positive (luminal subtype), invasive malignant breast epithelial (MBE) samples [97].

However, for a low number of replicates of breast cancer samples (n=3), we used spectral counting (PatternLab software [98]) for determination of differentially abundant proteins between invasive breast cancer cells and metastatic breast cancer cells (Chapter 2).

1.3 Microproteomics

Mass spectrometry-based proteomic methods are extensively used to study global changes in protein expression caused due to pathological stimuli in an organism. Current methods use sample total protein amounts in the range of micrograms or milligrams [99] and extensive protein/peptide level separations in order to achieve comprehensive proteomic analysis. However, in many cases, obtaining these sample amounts can be practically impossible or challenging. There are several reasons for low availability of sample amounts e.g. rarity of the sample itself, collection of many thousand cells takes several hours or days using a technique such as laser capture microdissection, multiple experiments on a homogeneous sample and so forth. 55

One of the examples of such a rare/limited sample type is brain tissue specimen related to neurodegenerative diseases such as Parkinson, Alzheimer, and Huntington disease. These neurodegenerative diseases are characterized by selective degeneration of particular types of neurons; while the tissue of rest of the brain is under normal pathological state[100]. Researchers, trying to understand the causes behind these diseases, are using laser capture microdissection (LCM) to selectively collect degenerative neurons. However, obtaining even 10,000 to 50,000 neurons is impractical because the degenerative neurons are limited in numbers[101]. Another similar example of limited sample amount is malignant cells collected from a solid tumor. Solid tumors are heterogeneous in composition, i.e. they are made up of a subpopulation of cancer cells, along with stromal elements that collectively form a microenvironment[41]. The subtypes of malignant cells differ among themselves in many properties, such as production and expression of cell surface markers, sensitivity to therapeutics, growth rate, etc. The studies aimed at determining the proteomic changes in these individual cell types are limited due to the time and cost required to collect large cell numbers using the LCM procedure. The proteomic analysis of circulating tumor cells (CTCs), which can be an indicator of potential metastasis, is thought to provide a noninvasive way of determining tumor metastasis or the impact of treatment on the number of CTCs[102]. As the number of CTCs circulating in the blood is very low, advances in proteomics are required to analyze them.

To accomplish microproteomics of clinically relevant and limited amounts of sample, one must use a minimum number of steps in the proteomic platform, and each of these steps must limit sample losses[99]. Considerable sample losses during sample 56 preparation and limited dynamic range of LC-MS/MS system are two main obstacles in analyzing small protein amounts. In order to improve sample preparation, low protein binding tubes and the use of MS–friendly acid labile detergents are suggested[99]. The use of MS-friendly detergents results in shorter extraction and digestion procedures.

One of the recent examples of low sample proteomics was the analysis of 500-

5,000 CTCs, generating proteomic profiles of ~150-650 proteins[103]. The cells were lysed using NP-40 detergent, and the detergent was separated by precipitating the proteins from cell lysate. The in-solution digest of these samples were subjected to nanoflow LC/Q-TOF analysis. In an another approach to a small sample amount, quantitative comparison of a proteome of LCM collected single pancreatic islets, containing 2,000-4,000 cells, treated with high and low levels of glucose, was carried out. The cells were lysed with acid labile detergent followed by in-solution digestion.

Sensitive LC-MS/MS analysis was performed using a low column flow rate and long chromatographic separation time. In Chapter 2 and 3, we have presented a short run on

SDS-PAGE based sample handling step, followed by sensitive LC-MS analysis using

PLOT column.

1.3.1 Alternative strategies for protein digestion

1.3.1.1 Solvents based approach

In 2007, Veenstra et al. introduced a membrane protein digestion method with

60% methanol in place of chaotropes, as a membrane protein solubilizing solvent during trypsin digestion[44]. In this approach, the plasma membrane protein 57 population was isolated from the human epidermis and dispersed in 50 mM ammonium bicarbonate, pH 7.9. The proteins were reduced and alkylated using TCEP and iodoacetamide (IAA), respectively. The membrane proteins were separated using ultracentrifugation at 100,000 g. The protein pellet was further solubilized in 60% v/v methanol in 50 mM ammonium bicarbonate. The proteins were digested by trypsin

(trypsin/protein ratio: 1/20) at 37˚C for 5 hours in the same solubilizing buffer. The acidified digest was analyzed using two dimensional (SCX/RP) LC-MS.

This strategy was found to have advantages compared to detergent- and chaotrope-based solubilization as 1) the same methanol based buffer conditions were used for solubilization, denaturation and proteolysis, 2) sample dilution and dialysis steps were completely eliminated, and these steps typically decrease solubilizing capacity and subsequent proteolytic efficiency, 3) methanol and ammonium bicarbonate , volatile water soluble compounds, are removable by lyophilization after digestion, making the methanol-based buffer approach MS-friendly. Other solvents such as acetonitrile and trifluroethanol are also used for solubilization and digestion of membrane proteins.

1.3.1.2 Cleavable surfactant

The are stable and strong solubilizing agents. The environmental concerns such as a low biodegradability rate of the surfactant, has become one of the main driving forces for the development of cleavable surfactants. Although cleavable surfactants were first synthesized many years ago, Norris et.al applied nonacid cleavable detergents for MALDI mass spectrometry profiling of whole cells [104]. 58

They showed that cleavable surfactant results in an increase in the number of proteins analyzed by increasing protein solubility. Cleavable surfactants are detergents that have a deliberately introduced weak bond in the surfactant structure that can be cleared by changing reaction conditions such pH or temperature. This is similar to the use of organic solvents which are removed by evaporation. However, the benefit with cleavable surfactants for proteomic studies compared to organic solvents is that they have superior solubilizing capability. The cleavable surfactants are not only tolerated during the enzymatic digestion of proteins but were found to accelerate enzymatic digestion and improve the digestion yield. The cleavable surfactant disrupts cell membranes, extracts and solubilizes proteins, and then breaks down into non- surfactant hydrolytic cleavage products.

Waters introduced Rapigest® SF, an anionic mild surfactant, which solubilizes and denatures the substrate proteins and therefore making them amenable to proteolysis without denaturing or inhibiting proteolytic enzymes such as trypsin, Asp-

N, Glu-C and Lys-C. In addition to MS compatibility, increase in peptide yield and speed of proteolytic digestion are two key benefits of Rapigest® SF and other cleavable surfactants against the solvents and chaotropes.

In 2003, Protein Discovery introduced cleavable zwitterionic PPS Silent® surfactant : 3-[3-(1, 1-bisalkyloxyethyl) pyridin-1-yl] propane-1-sulfonate, for proteomic studies. 59

ProteaseMax is another commercial cleavable detergent commercialized by

Promega. This cleavable detergent decomposes during the enzymatic digestion and generates a lipophilic byproduct which needs a clean-up step before LC-MS analysis.

Yates et al. in 2007 performed a study[107] to compare trypsin digestion strategies for peptide/protein identification by shotgun proteomics with or without cleavable detergents in mixed organic-aqueous and aqueous systems [67]. It was shown that cleavable detergent based digestion protocol increased peptide/protein identifications. The peptides generated by the different cleavable detergents showed a significant difference in hydrophobicity. The study concluded that cleavable detergents produce increased protein identification and improve the chances of identifying low-abundant proteins.

In the next section, filter-aided sample preparation protocol, using a filter as a reactor surface, is described for the proteomic sample preparation.

1.3.1.3 Filter-Aided Sample Preparation (FASP)

Mann et al. developed a universal sample preparation method, filter-aided sample preparation (FASP), for MS-based proteomic analysis [108]. The method was described as “universal” because it can use aspects of both proteolytic digestion approaches, namely in-gel digestion and in-solution digestion. The in-gel digestion approach solubilizes the proteins using SDS, a reagent of choice for total cell lysis and protein solubilization. In-solution digestion approaches include minimum sample handling steps and can be automated for high throughput analysis. 60

The FASP method uses a common ultrafiltration device in combination with 8

M urea for detergent removal. The filter-aided sample preparation method facilitates solubilization of sample in 4% SDS, followed by retention and concentration of sample in microliter volumes. The filter unit then functions as a “proteomic reactor” for removal of detergent, exchange of buffer, chemical modification and enzymatic digestion of proteins. Importantly, after digestion, the ultrafiltration device separates the peptides from high molecular weight substances, avoiding the interference of high molecular weight compounds during peptide separation.

The efficiency and range of applicability of the method was performed on various amounts of bovine serum albumin and HeLa cell lysates. Analysis of HeLa cell lysates, equivalent to different numbers of HeLa cells, down to a few thousand cells, showed no significant change in the number of peptides and proteins identified or quantitated. Almost 80% of the tandem mass spectra (MS/MS events) resulted in the identification of the peptides, which was claimed to be due to high purity of FASP generated peptides. Gene Ontology studies revealed more that 40% protein falling in the membrane category. This suggest the usefulness of FASP protocol for analysis and identification of cytosolic and membrane proteins.

Liebler et al. compared the performance of the FASP method with short run on

SDS-PAGE (in-gel digestion approach) and TFE based method (in-solution digestion approach) using two protein amount levels (50 µg and 150 ng of total protein loads from colon carcinoma cells)[109]. At high protein loads, the FASP method yielded 8% more protein identifications compared to the other two methods; however, with low 61 protein load (150 ng of protein), the short run on the SDS-page based method was the best among three methods. It was concluded that at low protein load, the FASP method suffers with substantial losses, probably due to binding of the proteins/peptides to the filter unit. In the next section, the principles of SDS-PAGE for proteins pre-fractionation and subsequent sample preparation are discussed.

1.3.2 High Performance Liquid Chromatography for Microproteomics

The diverse physico-chemical properties of peptides (isoelectric point, charge, size, hydrophobicity) make their separation possible by nearly every liquid –based separation method. It was realized in the initial years of HPLC that the non- polar/reverse stationary phase holds great potential for the separation of peptides [110-

113]. The powerful separation capability of reverse-phase HPLC (RPLC) led to the use of this method for peptide mapping [114]. With the introduction of soft ionization techniques, such as electrospray, on-line hyphenation of RPLC with mass spectrometry evolved today into the principle analytical technique in the field of proteomics [115-118].

1.3.2.1 Peak Capacity

The challenges with proteomic analysis are complexity of proteome and the 10-

12 orders of magnitude of dynamic range. Further, in shotgun proteomics, due to tryptic digestion of proteins, the resultant mixture of the peptides is even more complex. Based on 52,816 protein entries in Human IPI database (version 2.23),

Abersold et al. calculated in silico triptic peptides to be 892,584 [119]. These numbers suggest that approximately 17 fold increase in the complexity results from the protein 62 digestion. It is generally accepted that no single chromatographic separation can resolve all components of a global proteolytic digest of a complex proteome. In order to access the resolving power of HPLC separation under gradient conditions, peak capacity (Pc) is a most common metric[120]. The peak capacity is defined as the number of peaks that can be resolved within a chromatographic retention window at unit resolution. In LC, the peak capacity can be improved by increasing the column length (L) and by reducing the plate height (H) [121-123]. Another important parameter, gradient slope, affects the peak capacity. In general, increase in gradient times produces higher peak capacities (and longer separation times) although peak capacity attains a maximum value at longer gradient times. In order to achieve efficient peptide separations, a number of strategies such as long columns, long gradients times were implemented. However, while analyzing limited amounts of samples, a balance between the peak capacity and sensitivity has to be found since peaks broaden as they elute at longer times.

63

Relationship between Flow rate and MS Response

Figure 1.9. Illustration of effect of concentration of analytes and flow rate on ESI processes. Reprinted from reference [124]. In 2001, Smith et al. demonstrated that long capillary columns packed with porous particles can handle large sample loading amounts and with a 10 fold increase in sample amount (5-50 µg), a nearly seven fold increase in peptide identification was observed[124]. Further, the same group showed that decreasing the inner diameter of the nanoscale column operated at 20 nL/min (from 75 µm to 15 µm) had an effect equivalent to increasing sample loading amount, which is important for proteomics of limited sample amount[65]. This was a direct result of the lower flow rates that are associated with narrow-bore columns (15 µm i.d.). Figures 1.9 and 1.10 explain the influence of flow rate on the sensitivity of ESI-MS detection. In ESI, the 64 initial droplet size determines the average number of possible fission events. The ion formation is more rapid and efficient from smaller droplets with less need of heating.

The smaller droplets move easily toward the periphery of the electrospray plume due to their low inertia and higher charge density. In the case of conventional capillary flow rates (about 1 µL per minute), a large fraction of the analyte remains unutilized in the form of either charged clusters or residue particle. The second factor in case of conventional flow ESI-MS is that about 1 cm distance is required between the electrospray emitter and the MS inlet for solvent evaporation thus causing expansion of the electrospray plus limiting the inlet efficiency. Therefore, sensitivity of conventional ESI is limited due to low ionization and inlet sampling efficiencies.

1.3.2.2 Narrow-bore column and ESI-MS

Electrospray ionization-mass spectrometry is routinely used in proteomics, mainly due to its sensitive detection, broad dynamic range and online coupling with capillary HPLC [67,125,126]. In order to achieve comprehensive analysis of complex mixtures, high-resolution separation prior to MS analysis is necessary in order that a wide dynamic concentration range and low detection level can be attained. The currently used 75 µm i.d. capillary columns provide advantages such as high resolving power, high mass sensitivity and low mobile phase consumption and thus are in wide use. However, to analyze microdissected cells or microbiopsies, improvements in detection performance of the chromatographic column are desired. 65

Figure 1.10 Comparison of normal flow rate electrospray (top) vs. a lower flow rate electrospray (bottom). The lower flow rate produces smaller droplets and thus offers efficient ion introduction to MS inlet. Reprinted from reference [124]. The use of a narrow-bore column reduces the chromatographic band dilution which results in analytes eluting in smaller volume and at higher concentration.

Further, narrow-bore columns are operated at lower flow rates, which have a positive influence on the ESI-MS sensitivity as ESI with low flow rates provide very small electrospray droplets facilitating efficient ionization of the analytes [127,128].

At conventional flow rates (1 µL per minute), ESI-MS response is mass sensitive. However, at low nanoliter per minute flow rate due to the high ionization efficiency it exhibits concentration sensitive behavior[129]. 66

In order to achieve low detection limits and to perform sensitive proteomic analyses, efforts were made to pack narrow-bore columns (<20 µm i.d.) with conventional microparticles. The difficulties with operation of these columns are decreased ratio of column i.d. to particle size and low permeability of the columns.

Monolith capillary columns have successfully emerged as an alternative to microparticle packed narrow-bore columns due to their moderate back pressure and high resolution[130]. This laboratory demonstrated the low attomole detection capability of a 20 µm i.d. PS-DVB monolith column when operated at a flow rate of

20 nL/min[128]. Similarly, Smith et al. demonstrated the use of 20 and 10 µm i.d. silica-based monolithic columns with a flow rate of 10 nL/min for sensitive and quantitative proteomic analyses [131,132].

1.3.2.3 Porous Layer Open Tubular (PLOT) Columns

Over many years, researchers have tried to implement open tubular capillary liquid chromatographic columns in order to achieve high separation efficiency, as obtained in case of open tubular capillary GC columns. Tsuda et al., in 1978, for the first time investigated the open tubular 60 µm i.d. capillary columns for LC[133]. In

1983, Jorgenson and Guthrie reported use of 15 µm i.d. open tubular capillary columns for efficient separation[134], but these columns suffered due to low retention and low sample loading capability. Therefore, efforts were focused on development of porous layer open tubular columns to increase the sample loading capacity. Further, the issues encountered in successful application of PLOT LC columns were 1) lack of a sensitive, universal and small dead volume detector, 2) difficulties with generation of 67 gradient elution at low nanoliter flow rates and 3) difficulties with construction of

PLOT columns with a relatively uniform porous layer.

Figure 1.11 Schematic diagram of the low dead volume connections used to design 1D and 2D SPE-PLOT system. Printed from Ref. [135]

68

Figure 1.12 Diagram of the advanced on-line 2-D SCX/PLOT/MS system using a 3.2 m* 10 µm i.d. PLOT column and an online triphasic trapping column. Reprinted from reference [135]. In 2007, this laboratory demonstrated the use of PS-DVB polymer based 10

µm i.d. PLOT columns at 20 nL/min flow rate for high resolution and sensitive proteomic analysis[135]. The columns were prepared by in- situ polymerization of 69 styrene and divinylbenzene. The column preparation method combined the polymerization and coating procedures in a single step. A peak capacity of close to

400 was obtained on these columns (column length 4.2 meters and separation time of

260 minutes), suggesting the high resolving performance of these columns with good column to column reproducibility. As initial characterization of the column was carried out by loading the sample by split flow injection (direct injection), in order to handle the microliter volume of sample solution, 50 µm i.d. micro SPE (PS-DVB monolith column) and PLOT assembly was designed. The dead volume between the

SPE and PLOT column was minimized by the use of a Picoclear tee (Figure 1.11), and the separation performance of SPE-PLOT assembly (Figure 1.12) was found to be similar to that obtained by direct injection on the PLOT column alone.

To further improve the resolving power of the micro SPE/PLOT set up (Figure

1.11), a 2D chromatography platform was built by combining high resolving PS-DVB based reverse phase PLOT column with strong cation-exchange chromatography

(SCX)[136]. To make the triphasic column, as shown in the inset of Figure 1.12, first a sol-gel frit was made in 75 µm i.d. fused silica tubing. Then, particles of SCX resins

(Polysulfoethyl A) were packed in the silica tubing forming a bed of 2 cm length.

Another bed of 2 cm length of C18 RP particles was packed. To assemble the triphasic

RP/SCX/RP trapping column, the biphasic (SCX/RP) column was connected to 4 cm by 50 µm i.d. micro SPE (PS-DVB monolith column). As described in Figure 1.12, to minimize the dead volume between the triphasic trapping column and the PLOT column and to perform fast sample loading, a PicoClear tee was used. Microtee „a‟ was employed to split the mobile phase during SCX fractionation, whereas microtee 70

„b‟ was used to split the mobile phase during analytical separation on the PLOT column.

Chapter 2 details the analysis of 10,000 LCM collected breast cancer cells performed using 1D PLOT LC-MS platform. A short (2.5 cm) SDS PAGE run was performed for protein fractionation and digestion. Chapter 3 presents the application of

2D PLOT LC-MS for analysis of 10,000 LCM collected triple negative breast cancer cells. Here, sample preparation was performed using the procedure described in

Chapter 2; however the peptides were fractionated instead of proteins before LC-MS analysis.

71

1.4 Protein Glycosylation Analysis

It is estimated that more than 50% of the all proteins are glycosylated[137].

There has been a growing interest in glycans (-N and –O) due to their roles in many specific biological recognition events, which includes cell growth and development[138], tumor growth and metastasis[139], anticoagulation[140], immune recognition/response[141], cell-cell interaction and communication[142], and microbial pathogenesis[143]. Interestingly, most of the biopharmaceutical molecules are glycoproteins in which the attached glycan moiety (oligosaccharide chain) can significantly affect the antigenicity, solubility, bioactivity, stability and pharmacokinetics, and protease resistance of the protein therapeutics.

Glycoproteins can have a variety of different N-glycans on a particular Asn-X-

Ser/Thr (X is any amino acid residue except proline) motif which confers heterogeneity to the glycoproteins. All N-glycans share common core structures and are classified into three categories: high mannose, complex and hybrid (Figure 1.13).

On the other hand, there is no consensus motif for O-linked glycosylation. Two monosaccharide units or a monosaccharide and amino acid residue can be joined together by a glycosidic bond. The glycosidic bond is formed between the anomeric carbon of one monosaccharide and a hydroxy group of another monosaccharide which exist in two stereoisomeric forms: α and β. 72

Figure 1.13 Chemical diversity of glycans. Reprinted from reference [144] A recent example analysis of highly diverse glycan structures is the identification of 58 different complex N-glycan structures on one N-glycan site in mouse zona pellucida glycoprotein[145]. Furthermore, with the presence of additional

Asn-X-Ser/Thr motifs per molecule, different protein molecules in the pool of glycoproteins may have different subsets of N-glycans on different motifs. This is described as site-specific heterogeneity or micro-heterogeneity. From this pool of glycoprotein molecules, an individual/homogeneous component is called as a glycoform. In order to investigate and characterize various glycoforms present in the glycoprotein, glycosylation analysis is performed on three levels: released glycans, glycopeptides and intact glycoproteins.

73

1.4.1 Intact Glycoprotein Analysis

Due to the very high complexity of heterogeneous mixture of glycoforms, high resolving separation methods are typically coupled to high resolution mass spectrometers for accurate mass measurement of glycoforms. One of the high resolving separation techniques, capillary electrophoresis, is described in the following section.

1.4.1.2 Capillary Electrophoresis

Basics of CE

Electrophoresis is a technique used to separate charged species under the influence of an applied electric field. The separation is due to the difference in mobility of analytes, which is a net effect of charge, size, and shape of the analyte and properties of the electrolyte solution used.

Capillary electrophoresis (CE) uses an electric field to separate charged analytes. When the electric field is applied through the separation capillary, the charged analytes migrate towards the electrode of opposite polarity. The migration velocity (up) of the analyte depends upon the electrophoretic mobility (µp) of the analyte and the applied electric field (E), assuming no electroosmotic flow [146].

up= µp*E Equation 1

As the applied electric filed is held constant in CE, the analytes must possess different electrophoretic mobility in order to be separated. The electrophoretic mobility of an analyte at a given pH is dependent on η (viscosity of the electrophoresis 74 buffer), z (charge on the analyte) and r (Stokes radius of the analyte) and is given by the equation 2.

µp= z/6Πηr Equation 2

However, in practice, the electrophoretic mobility is experimentally determined from migration time of the analyte (tr), the applied voltage (E) and the length of the capillary (L) used, as shown in Equation 3.

µp= (l/tr) (L/E) Equation 3

In the above equation, l is a migration distance of an analyte from injection to the detection point[146].

Electroosmotic Flow (EOF)

EOF can be described as the movement of the electrolyte solution inside a capillary column due to applied electric field and caused by fixed charged groups on the capillary surface. The silica-based capillaries used for CE experiments contain silanol (Si-OH) groups on its inner surface. Under pH conditions greater than 3, the silanol groups begin to dissociate to form negatively charged silanoate (Si-O-) groups.

These negatively charged groups attract positively charged counter ions from the electrolysis buffer solution to form an electric double layer or diffuse double layer

[147]. The first layer of the cations is strongly attracted towards the capillary surface and remains static, whereas the second layer of the cations is further away from the capillary surface. This outer layer moves towards the cathode under the influence of the applied electric field causing bulk migration of electrolyte in the separation capillary. The extent of EOF is controlled by the applied electric field and the charge 75 density on the capillary surface, which is dependent on the pH of the buffer solution.

Generally, in a typical CE system, the EOF is directed toward the cathode. The positively charged analytes migrate with the EOF, whereas the negatively charged analytes migrate in the direction opposite to the EOF, which results in longer retention of negatively charged analytes. However, in order to avoid interaction of peptides/proteins with charged surfaces of the silica capillary, either neutral coated capillaries or capillaries with charge opposite to that of peptide/proteins are preferred.

Figure 1.14 Electric double layer at the capillary wall and creation of EOF (N = neutral solute). Reprinted from reference [148]

Efficiency and Resolution in CE

The number of theoretical plates (N), used to evaluate the separation performance of

CE, is given by the Equation 4.

N= (µp*E)/2D Equation 4

In this equation, D represents diffusion coefficient of the analyte. 76

The resolution is expressed as

R= ¼ √N [Δµ/µavg] Equation 5

where Δµ is the difference in analytes mobilities, and µavg the average of their mobilities.

The number of theoretical plates (N) and resolution, two main measures of the performance of CE, are influenced by the spread of the analyte zone in the separation capillary. In an ideal case, the longitudinal diffusion may be considered as the only source of zone broadening; however, in reality, broadening can occur due to, for example, electrostatic interaction between the analyte and the capillary wall, injection volume and Joule heating.

Capillary Zone Electrophoresis (CZE)

Among the various CE modes, capillary zone electrophoresis is the simplest and most frequently used technique. CZE takes place in an electrolyte buffer in open capillaries. The separation depends on the size to charge ratio of the analyte, i.e. different electrophoretic mobilities in applied electric field. The basic amino acid residues such as lysine and arginine in peptides and proteins are responsible for interaction of proteins/peptides with negatively charged capillary wall. Due to the large size, this effect is significant for proteins. It was shown that the interaction can be reduced by modifying the pH[149] or ionic strength[150] of the electrolyte.

However, in order to reduce capillary-wall interactions, various capillary coating methods are widely used [151-153]. Two major types of capillary wall modifications are covalent and dynamic. In the covalent modification, the negatively charged silanol groups are covalently linked to coating material. The static coatings include 77 compounds such as polyacrylamide, polyethylene glycol, (3-aminopropyl)- trimethoxysilane and poly (vinyl alcohol). Proprietary compounds such as EOTrolTM and UltraTrolTM have been in use as dynamic coatings.[154]

1.4.1.3 Capillary Electrophoresis Coupled to Mass Spectrometry

The initial efforts to couple CE to mass spectrometry (online CE-ESI-MS) were made by Smith et al. in late 1980's[155,156]. Since then, many groups contributed to development of various interfaces for CE-ESI-MS, and these interfaces can be broadly classified into sheath flow, sheathless-flow and liquid junction [157].

Liquid junction is sometimes considered as a type of sheath flow interface[158]. In the case of CE-ESI-MS interface, the electric contact with electrolyte and sheath liquid serves two purposes: (1) establishment of potential difference between the ESI spray tip and the mass spectrometer inlet and (2) to complete the CE electric circuit.

Sheath Flow Interface

The sheath flow interfaces are designed in such a way that the separation capillary is concentrically positioned inside a spray needle and the sheath liquid mixes with the separated analytes followed by electrospray ionization. The electric contact, for ESI spray and CE circuitry, is achieved through the sheath liquid by applying voltage on a metal spray needle[156] or by placing a metal wire between a nonconducting needle and the separation capillary[159]. Due to the dilution effect of the sheath liquid, the sensitivity of the sheath flow interface is lower than that of the sheathless interface. In order to obtain a stable ESI spray, the composition of electrolytes or sheath liquid should be volatile and acidic for positive ESI mode. If the 78 separation buffer system lacks these requirements, they can be attained by tuning the sheath liquid. Therefore, the sheath flow interface allows flexibility of having a combination of suitable composition of separation buffer system, for a good separation of a complex analyte mixture, and an ESI-mass spectrometer compatible sheath liquid.

The ESI spray obtained using the sheath flow interface is generally more stable compared to the sheathless interface [160,161].

Figure1.15 Different types of CE/MS interfaces: (a) sheath flow; (b) sheathless flow; and (c) liquid junction. Reprinted from reference [148] 79

Sheathless Interface

This type of interface depends on the electroosmotic flow of CE separation, which falls in the range of few nanoliters per minute. Therefore, if the EOF is not fast enough, the ESI spray can be unstable and discontinuous. However, the flow rate of the sheath liquid can be increased by adjusting the separation buffer composition and electrophoresis conditions. In order to reduce the analyte and capillary surface interactions, coated capillaries are employed in the CE separation. Coated capillaries reduce the EOF, and therefore, they cannot be used with the sheathless interfaces. A significant advantage of using the sheathless interface is increased sensitivity due to use of smaller spray needle orifices producing smaller droplets, allowing efficient solvent evaporation and thus efficient analyte ionization[118,162].

Liquid Junction Interface

As the name suggests, this type of interface has the separation capillary and

ESI emitter separated by a gap or junction, which allows the sheath liquid to mix with the CE-separated analytes and introduce them into mass spectrometer. Similar to the sheath-flow interface, this interface offers the flexibility of using MS friendly solvent composition with influencing the CE separation. The advantages with the use of liquid junctions are the flexibility of spraying solution and stable spraying conditions for

ESI-MS detection. Also, the dilution of analyte due to the sheath liquid is low in the case of a liquid junction compared to a sheath flow interface operating at a microliter per minute flow [163]. In Chapter 4, we present a home-made design CE-MS for short

CE separation, incorporating a liquid junction interface to facilitate stable ESI spray. 80

1.4.1.4 Application of CE-MS for Analysis of Intact Glycoforms

Capillary electrophoresis provides a very high resolving ability for intact glycoform analysis. Many groups have used CE as a tool for partial or complete separation of complex mixture of glycoforms, for example, ribonuclease B, ovalbumin, α1- acid glycoprotein, horseradish peroxidase and recombinant human erythropoietin [164-167]. Analysis of the intact glycoforms helps in providing information about total and cumulative numbers of monosaccharide units and other post-translational modifications, such as acetylation, sulfation, phosphorylation, γ- carboxylation and deamidation, present in the glycoform. This information can be used to monitor consistency of production processes and stabilities of therapeutic glycoprotein formulations. An example for separation of glycoforms of recombinant human erythropoietin by Balaguer E. et al. is presented in Fig 1.16, in which glycan and intact protein level information was used to characterize individual isoforms.

Figure 1.16 A describes the separation of sialoforms (isoforms having different sialic acids) of the proteins, whereas in Figure 1.16 B, the extracted ion electropherograms demonstrates the impressive separation power of CE in resolving different isoforms having the same number of sialic acids. The adjacent isoforms differ from each other by the Hex+HexNAc unit. 81

Figure 1.16 CZE-ESI-MS analysis of a recombinant human EPO showing (A) different sialic acid (SiA) isoforms, (B) different HexHexNac content of one specific SiA isoform, and (C) the original mass spectrum and the deconvoluted spectrum of one specific isoform, revealing acetylated (Ac) and oxidized (Ox) isoforms. Reprinted from reference [168]

1.4.2 Glycan analysis

Most of the protein-based biopharmaceuticals are glycosylated proteins. The presence of the glycan moiety directly affects the efficacy and safety of glycosylated biopharmaceutical protein. Therefore, glycan analysis of biopharmaceuticals is essential due to pharmaceutical and regulatory reasons. The number and similarity of possible structures of these glycans make their analysis technically challenging. 82

1.4.2.1 Glycan release methods

The N-linked glycans can be released from the glycopeptides or glycoprotein by either chemical or enzymatic methods. The methods used to release glycans from glycoprotein need to comply with the following criteria:

The most commonly used chemical method, hydrazinolysis, for release of N-glycans was developed mainly by the group of Kobata[169] and has been widely used for the analysis of many glycoproteins by many groups. The enzymatic release of N-glycan moiety can be performed by several commercially available endoglycosidases. The precise specificities of the endoglycosidases allow us to determine the type of glycan attached to the asparagines (i.e. complex, high-mannose or hybrid). The most widely used endoglycosidase, peptide N-glycosidase F (PNGase F), cleaves the glycosylamino linkage between the asparagines and the glycan moiety (N-linked glycans), and thus releases the glycan and converts the asparagine to aspartic acid. Use of this strategy allows the investigation of glycosylation site in addition to release of glycans.

1.4.2.2 Enzymatic Sequencing of Oligosaccharides

The enzymatic analysis of oligosaccharides with the help of highly specific exoglycosidases is an important tool for determining the glycan sequence, the anomeric linkages and the linkage position of the respective monosaccharide[170]. To assess the susceptibility of glycan to cleavage by exoglycosidases, the glycan digestion can be performed sequentially with a single enzyme or with enzyme arrays. The advantage of the reagent-array analysis method is that aliquots of oligosaccharides are digested simultaneously with the defined mixtures of exoglycosidase enzymes which 83 are followed by a single analysis of combined products[171,172]. However, the problems such as absence of specific enzymes for all types of glycosidic linkages, steric hindrances to release of a terminal monosaccharide and the requirement of high purity exoglycosidases limits this method from comprehensive and exhaustive characterization of oligosaccharides.

84

Fucose , N-acetylglucosamine, Mannose, Galactose,

Sialic acid.

Figure 1.19. Some of the exoglycosidases commonly used to determine the structure of the N-glycans. Reprinted from www.sigmaaldrich.com/img/assets/15880/glycan_analysis.pdf . 85

1.4.2.3 HPLC analysis of glycans

The separation method used to analyze glycans depends upon the characteristics of the glycans. Three types of chromatography, i.e. normal phase, weak anion exchange and reverse phase HPLC, have been used to separate fluorescent- labeled sialylated and neutral N- and O-linked glycans [146]. To analyze native (label free approach) glycans, high pH anion exchange chromatography with pulsed amperometric detector can be used. The use of high pH in this method is a major drawback of this method, as sodium hydroxide has to be removed to recover the glycans for further structural analysis.

HILIC

This mode of chromatography consists of a stationary phase with amino, cyano, alpha-cyclodextrin, polyhydroxyethyl and aspartamide functional groups. The samples are introduced to the column in a high organic solvent containing solution.

The glycans are eluted from the column with increasing concentration of aqueous buffer solution. This technique utilizes the differences in hydrophilicity of the glycans and their interaction with stationary phase to achieve high resolution and reproducible separation. Due to the ability of this technique to analyze sialylated and neutral glycans in a single chromatographic run, it a widely used for comparative profiling of glycan pools. Partially hydrolyzed dextran, as a source of glucose oligomers, is used as an external standard. The elution time of each peak is designated as glucose units

(GU), i.e as the chain length of dextran oligomer having the same retention time. The individual eluting glycan in the chromatogram for an unknown sample is assigned an overall GU value based on their elution time compared to the standard dextran ladder. 86

These values are generally used to predict possible composition of the unknown glycans [173].

Weak Anion Exchange Chromatography

Weak anion exchange chromatography separates glycans based on the relative binding affinities of the glycans to the weak anion exchange groups in the column

[170]. This type of chromatography separates glycans based on the number of charge groups. The elution time of the glycan is affected by the number of sialic acids, sulphate, phosphate or uronic acid functional groups present on the glycan [170]. In addition to the charge on the glycan moiety, the size of the glycan has an influence on the elution profile of the glycans.

Porous Graphitic Carbon (PGC)

In this type of chromatography, the stationary phase solely consists of graphite type carbon. It is very useful for glycan analysis, as it offers remarkable selectivity for isomeric glycans and increased retention particularly for charged glycans. Compared to conventional reversed-phase stationary phases, PGC adsorbs the polar analytes strongly and thus is very useful for the analysis of native and reduced glycans. We have used PGC SPE cartridge for removal of salts and other reagents from a pool of released glycans as mentioned under Chapter 4.

87

1.4.3 Glycopeptide Analysis

Glycopeptide analysis is particularly useful for identification of site-specific glycosylation information. Endoproteinases, such as trypsin, chymotrypsin, Asp-N,

Glu-C and Lys-C are commonly used to isolate glycopeptides carrying individual glycosylation sites. The identification of glycopeptides using mass spectrometry, within a complex protein digest, is still a challenging task due to multiple reasons. The first is glycopeptides often represent a minor portion of the total peptide digest. The second reason is relatively lower signal intensities of glycopeptides, compared to non- glycosylated peptides, due to the glycan microheterogeneity and lower ionization efficiency. The third reason is glycopeptide signal suppression in the presence of other peptides[174], particularly in the case of glycans with a negatively charged sialic acid moiety. Therefore, several glycopeptide enrichment strategies such as lectin affinity technology[175,176], size exclusion[177], hydrophilic interaction chromatography[178], boronic acid beads[179] and hydrazine resin[180] based capture techniques have been developed.

Often, glycopeptides mixtures are separated on the reverse phase HPLC and detected by subsequent ESI-MS. Carr et al. [181]have observed that glycopeptides can be specifically identified by the presence of diagnostic oxonium ions such as m/z 204

(HexNAc+), 366 (HexHexNAc+), 163(Hex+) and 292 (Neu5Ac+). Generation of these small molecular weight marker ions from glycopeptides is observed in the source region of all types of mass spectrometer [182,183]. Therefore, it is to be noted that the combination of on-line liquid chromatography separation with mass spectrometry detection is a powerful tool for the localization and identification of glycopeptides 88 with high sensitivity. Subsequent to identification of the glycopeptide, the mass of the glycopeptide and the oligosaccharide moiety can be determined if the protein is known. As the glycosidic linkages are far more labile than the peptide bonds, the collision induced fragmentation of the glycopeptides predominantly provides information on the composition and the sequence of the oligosaccharide attached to peptide moiety[184]. The fragment ions from peptide backbone are observed at relatively low signal intensity and can be further increased by fine-tuning collision energy for the fragmentation. However, other tandem mass spectrometry techniques, such as electron capture dissociation[185] and electron transfer dissociation[186], have been shown to preferentially cleave the peptide backbone and thus leaving glycan structures intact. This is utilized for unambiguous assignment of the glycosylation site in glycopeptides.

89

1.5 References

[1] V.C. Wasinger, S.J. Cordwell, A. Cerpapoljak, J.X. Yan, A.A. Gooley, M.R. Wilkins, M.W. Duncan, R. Harris, K.L. Williams, I. Humpherysmith, Electrophoresis 16 (1995) 1090.

[2] R. Aebersold, Nature 422 (2003) 115.

[3] H. Keshishian, T. Addona, M. Burgess, D.R. Mani, X. Shi, E. Kuhn, M.S. Sabatine, R.E. Gerszten, S.A. Carr, Molecular & Cellular Proteomics 8 (2009) 2339.

[4] P. Mallick, B. Kuster, Nature Biotechnology 28 (2010) 695.

[5] G.S. Omenn, D.J. States, M. Adamski, T.W. Blackwell, R. Menon, H. Hermjakob, R. Apweiler, B.B. Haab, R.J. Simpson, J.S. Eddes, E.A. Kapp, R.L. Moritz, D.W. Chan, A.J. Rai, A. Admon, R. Aebersold, J. Eng, W.S. Hancock, S.A. Hefta, H. Meyer, Y.K. Paik, J.S. Yoo, P.P. Ping, J. Pounds, J. Adkins, X.H. Qian, R. Wang, V. Wasinger, C.Y. Wu, X.H. Zhao, R. Zeng, A. Archakov, A. Tsugita, I. Beer, A. Pandey, M. Pisano, P. Andrews, H. Tammen, D.W. Speicher, S.M. Hanash, Proteomics 5 (2005) 3226.

[6] L.M.F. de Godoy, J.V. Olsen, J. Cox, M.L. Nielsen, N.C. Hubner, F. Frohlich, T.C. Walther, M. Mann, Nature 455 (2008) 1251.

[7] J. Rush, A. Moritz, K.A. Lee, A. Guo, V.L. Goss, E.J. Spek, H. Zhang, X.M. Zha, R.D. Polakiewicz, M.J. Comb, Nature Biotechnology 23 (2005) 94.

[8] R.N. Wine, J.M. Dial, K.B. Tomer, C.H. Borchers, Analytical Chemistry 74 (2002) 1939.

[9] M. Ahmed, J. Forsberg, P. Bergsten, Journal of Proteome Research 4 (2005) 931.

[10] S.M. Hanash, J.R. Strahler, J.V. Neel, N. Hailat, R. Melhem, D. Keim, X.X. Zhu, D. Wagner, D.A. Gage, J.T. Watson, Proceedings of the National Academy of Sciences of the United States of America 88 (1991) 5709.

[11] W.H. McDonald, J.R. Yates, Disease Markers 18 (2002) 99.

[12] A.J. Fischer, K.L. Goss, T.E. Scheetz, C.L. Wohlford-Lenane, J.M. Snyder, P.B. McCray, American Journal of Respiratory Cell and Molecular Biology 40 (2009) 189.

[13] L. Beretta, Nature Methods 4 (2007) 785. 90

[14] A. Bairoch, Bioinformatics 16 (2000) 48.

[15] D.N. Perkins, D.J.C. Pappin, D.M. Creasy, J.S. Cottrell, Electrophoresis 20 (1999) 3551.

[16] J.K. Eng, A.L. McCormack, J.R. Yates, Journal of the American Society for Mass Spectrometry 5 (1994) 976.

[17] D. Claire, Y.J.R. III, in ENCYCLOPEDIA OF LIFE SCIENCES, John Wiley & Sons, 2006.

[18] J.E.V. Eyk, Clinical proteomics: From Diagnosis to therapy Wiley-VCH, 2008.

[19] P.A. Everley, J. Krijgsveld, B.R. Zetter, S.P. Gygi, Molecular & Cellular Proteomics 3 (2004) 729.

[20] K.P. Rosenblatt, P. Bryant-Greenwood, J.K. Killian, A. Mehta, D. Geho, V. Espina, E.E. Petricoin, L.A. Liotta, Annual Review of Medicine 55 (2004) 97.

[21] S.M. Carlson, A. Najmi, H.J. Cohen, Proteomics 7 (2007) 1037.

[22] E. Cuadrado, A. Rosell, J. Alvarez-Sabin, J. Montaner, Revista De Neurologia 44 (2007) 551.

[23] S. Dasari, L. Pereira, A.P. Reddy, J.E.A. Michaels, X.F. Lu, T. Jacob, A. Thomas, M. Rodland, C.T. Roberts, M.G. Gravett, S.R. Nagalla, Journal of Proteome Research 6 (2007) 1258.

[24] A.S. Jackson, A. Sandrini, C. Campbell, S. Chow, P.S. Thomas, D.H. Yates, American Journal of Respiratory and Critical Care Medicine 175 (2007) 222.

[25] B.L. Hood, M.M. Darfler, T.G. Guiel, B. Furusato, D.A. Lucas, B.R. Ringeisen, I.A. Sesterhenn, T.P. Conrads, T.D. Veenstra, D.B. Krizman, Molecular & Cellular Proteomics 4 (2005) 1741.

[26] N.L. Anderson, N.G. Anderson, Molecular & Cellular Proteomics 1 (2002) 845.

[27] L.A. Liotta, M. Ferrari, E. Petricoin, Nature 425 (2003) 905.

[28] L. Anderson, Journal of Physiology-London 563 (2005) 23.

[29] C. Rosty, L. Christa, S. Kuzdzal, W.M. Baldwin, M.L. Zahurak, F. Carnot, D.W. Chan, M. Canto, K.D. Lillemoe, J.L. Cameron, C.J. Yeo, R.H. Hruban, M. Goggins, Cancer Research 62 (2002) 1868. 91

[30] N. Rifai, M.A. Gillette, S.A. Carr, Nature Biotechnology 24 (2006) 971.

[31] P. Sedlaczek, I. Frydecka, M. Gabrys, A. van Dalen, R. Einarsson, A. Harlozinska, Cancer 95 (2002) 1886.

[32] A. De Iuliis, J. Grigoletto, A. Recchia, P. Giusti, P. Arslan, Clinica Chimica Acta 357 (2005) 202.

[33] D. Sizova, E. Charbaut, F. Delalande, F. Poirier, A.A. High, F. Parker, A. Van Dorsselaer, M. Duchesne, A. Diu-Hercend, Neurobiology of Aging 28 (2007) 357.

[34] J.V. Olsen, P.A. Nielsen, J.R. Andersen, M. Mann, J.R. Wisniewski, Brain Research 1134 (2007) 95.

[35] F. Vivanco, Methods in Molecular Biology (2007).

[36] Y.M. Koen, N.V. Gogichaeva, M.A. Alterman, R.P. Hanzlik, Chemical Research in Toxicology 20 (2007) 511.

[37] H. Fraenkel-Conrat, journal of Biological Chemistry 174 (1948) 827.

[38] S.R. Setlur, K.D. Mertz, Y. Hoshida, F. Demichelis, M. Lupien, S. Perner, A. Sboner, Y. Pawitan, O. Andren, L.A. Johnson, J. Tang, H.O. Adami, S. Calza, A.M. Chinnaiyan, D. Rhodes, S. Tomlins, K. Fall, L.A. Mucci, P.W. Kantoff, M.J. Stampfer, S.O. Andersson, E. Varenhorst, J.E. Johansson, M. Brown, T.R. Golub, M.A. Rubin, Journal of the National Cancer Institute 100 (2008) 815.

[39] S.R. Shi, M.E. Key, K.L. Kalra, Journal of Histochemistry & Cytochemistry 39 (1991) 741.

[40] N.J. Nirmalan, P. Harnden, P.J. Selby, R.E. Banks, Molecular Biosystems 4 (2008) 712.

[41] H. Sugimoto, T.M. Mundel, M.W. Kieran, R. Kalluri, Cancer Biology & Therapy 5 (2006) 1640.

[42] S.R. Shi, C. Liu, B.M. Balgley, C. Lee, C.R. Taylor, Journal of Histochemistry & Cytochemistry 54 (2006) 739.

[43] K.F. Becker, C. Schott, S. Hipp, V. Metzger, P. Porschewski, R. Beck, J. Nahrig, I. Becker, H. Hofler, Journal of Pathology 211 (2007) 370.

[44] J. Blonder, K.C. Chan, H.J. Issaq, T.D. Veenstra, Nature Protocols 1 (2006) 2784. 92

[45] W.A. Bonner, R.G. Sweet, H.R. Hulett, Herzenbe.La, Review of Scientific Instruments 43 (1972) 404.

[46] G.H. Luers, R. Hartig, H. Mohr, M. Hausmann, H.D. Fahimi, C. Cremer, A. Volkl, Electrophoresis 19 (1998) 1205.

[47] L. O.H., J.V. Passonneau, Acedemic press, 1972.

[48] D. Shibata, D. Hawes, Z.H. Li, A.M. Hernandez, C.H. Spruck, P.W. Nichols, American Journal of Pathology 141 (1992) 539.

[49] M.R. EmmertBuck, R.F. Bonner, P.D. Smith, R.F. Chuaqui, Z.P. Zhuang, S.R. Goldstein, R.A. Weiss, L.A. Liotta, Science 274 (1996) 998.

[50] G.I. Murray, Acta Histochemica 109 (2007) 171.

[51] N.L. Simone, R.F. Bonner, J.W. Gillespie, M.R. Emmert-Buck, L.A. Liotta, Trends in Genetics 14 (1998) 272.

[52] V. Espina, J.D. Wulfkuhle, V.S. Calvert, A. VanMeter, W.D. Zhou, G. Coukos, D.H. Geho, E.F. Petricoin, L.A. Liotta, Nature Protocols 1 (2006) 586.

[53] A.F. Okuducu, J.C. Hahne, A. Von Deimling, N. Wernert, International Journal of Molecular Medicine 15 (2005) 763.

[54] S. Curran, 2005.

[55] K. Schutze, I. Becker, K.F. Becker, S. Thalhammer, R.W. Stark, W.M. Heckl, M. Bohm, H. Posl, Genetic Analysis-Biomolecular Engineering 14 (1997) 1.

[56] K. Schutze, H. Posl, G. Lahr, Cellular and Molecular Biology 44 (1998) 735.

[57] L. Schermelleh, S. Thalhammer, W. Heckl, H. Posl, T. Cremer, K. Schutze, M. Cremer, Biotechniques 27 (1999) 362.

[58] A. Cornea, A. Mungenast, Laser Capture Microscopy and Microdissection 356 (2002) 3.

[59] D.C. Allred, S.K. Mohsin, S.A.W. Fuqua, Endocrine-Related Cancer 8 (2001) 47.

[60] A.P. Fuller, D. Palmer-Toy, M.G. Erlander, D.C. Sgroi, Journal of Mammary Gland Biology and Neoplasia 8 (2003) 335.

[61] D. Thakur, T. Rejtar, B.L. Karger, N.J. Washburn, C.J. Bosques, N.S. Gunay, Z. Shriver, G. Venkataraman, Analytical Chemistry 81 (2009) 8900. 93

[62] G. Zhang, D. Fenyo, T.A. Neubert, Journal of Proteome Research 7 (2008) 678.

[63] L.E. Bennett, W.N. Charman, D.B. Williams, S.A. Charman, Journal of Pharmaceutical and Biomedical Analysis 12 (1994) 1103.

[64] R.H. Garrett, , 2010.

[65] Y.F. Shen, R. Zhao, S.J. Berger, G.A. Anderson, N. Rodriguez, R.D. Smith, Analytical Chemistry 74 (2002) 4235.

[66] L. Novakova, L. Matysova, P. Solich, Talanta 68 (2006) 908.

[67] M.P. Washburn, D. Wolters, J.R. Yates, Nature Biotechnology 19 (2001) 242.

[68] X.M. Han, A. Aslanian, J.R. Yates, Current Opinion in Chemical Biology 12 (2008) 483.

[69] L.F. Marvin, M.A. Roberts, L.B. Fay, Clinica Chimica Acta 337 (2003) 11.

[70] A.G. Marshall, C.L. Hendrickson, G.S. Jackson, Mass Spectrometry Reviews 17 (1998) 1.

[71] S.D.H. Shi, J.J. Drader, M.A. Freitas, C.L. Hendrickson, A.G. Marshall, International Journal of Mass Spectrometry 195 (2000) 591.

[72] A. Makarov, Analytical Chemistry 72 (2000) 1156.

[73] Q.Z. Hu, R.J. Noll, H.Y. Li, A. Makarov, M. Hardman, R.G. Cooks, Journal of Mass Spectrometry 40 (2005) 430.

[74] W.J. Henzel, T.M. Billeci, J.T. Stults, S.C. Wong, C. Grimley, C. Watanabe, Proceedings of the National Academy of Sciences of the United States of America 90 (1993) 5011.

[75] P. James, M. Quadroni, E. Carafoli, G. Gonnet, Biochemical and Biophysical Research Communications 195 (1993) 58.

[76] M. Mann, M. Wilm, Analytical Chemistry 66 (1994) 4390.

[77] A.L. McCormack, D.M. Schieltz, B. Goode, S. Yang, G. Barnes, D. Drubin, J.R. Yates, Analytical Chemistry 69 (1997) 767.

[78] R.G. Sadygov, D. Cociorva, J.R. Yates, Nature Methods 1 (2004) 195.

[79] I.A. Papayannopoulos, Mass Spectrometry Reviews 14 (1995) 49. 94

[80] J.T. Stults, J.T. Watson, Biomedical and Environmental Mass Spectrometry 14 (1987) 583.

[81] D.L. Tabb, L.L. Smith, L.A. Breci, V.H. Wysocki, D. Lin, J.R. Yates, Analytical Chemistry 75 (2003) 1155.

[82] F. Schutz, E.A. Kapp, R.J. Simpson, T.P. Speed, Biochemical Society Transactions 31 (2003) 1479.

[83] V.H. Wysocki, G. Tsaprailis, L.L. Smith, L.A. Breci, Journal of Mass Spectrometry 35 (2000) 1399.

[84] A.R. Dongre, J.L. Jones, A. Somogyi, V.H. Wysocki, Journal of the American Chemical Society 118 (1996) 8365.

[85] E.I. Chen, J.R. Yates, Molecular Oncology 1 (2007) 144.

[86] T.D. Veenstra, Journal of Chromatography B-Analytical Technologies in the Biomedical and Life Sciences 847 (2007) 3.

[87] W.H. Zhu, J.W. Smith, C.M. Huang, Journal of Biomedicine and Biotechnology (2010).

[88] H.B. Liu, R.G. Sadygov, J.R. Yates, Analytical Chemistry 76 (2004) 4193.

[89] M.Q. Dong, J.D. Venable, N. Au, T. Xu, S.K. Park, D. Cociorva, J.R. Johnson, A. Dillin, J.R. Yates, Science 317 (2007) 660.

[90] B. Zybailov, A.L. Mosley, M.E. Sardiu, M.K. Coleman, L. Florens, M.P. Washburn, Journal of Proteome Research 5 (2006) 2339.

[91] B. Zhang, N.C. VerBerkmoes, M.A. Langston, E. Uberbacher, R.L. Hettich, N.F. Samatova, Journal of Proteome Research 5 (2006) 2909.

[92] J.X. Pang, N. Ginanni, A.R. Dongre, S.A. Hefta, G.J. Opiteck, Journal of Proteome Research 1 (2002) 161.

[93] P.V. Rao, A.P. Reddy, X. Lu, S. Dasari, A. Krishnaprasad, E. Biggs, C.T. Roberts, S.R. Nagalla, Journal of Proteome Research 8 (2009) 239.

[94] J. Pan, H.Q. Chen, Y.H. Sun, J.H. Zhang, X.Y. Luo, Lung 186 (2008) 255.

[95] N.T. Seyfried, L.C. Huysentruy, J.A. Atwood, Q.W. Xia, T.N. Seyfried, R. Orlando, Cancer Letters 263 (2008) 243. 95

[96] X. Fu, S.A. Gharib, P.S. Green, M.L. Aitken, D.A. Frazer, D.R. Park, T. Vaisar, J.W. Heinecke, Journal of Proteome Research 7 (2008) 845.

[97] K. Lenaerts, F.G. Bouwman, W.H. Lamers, J. Renes, E.C. Mariman, Bmc Genomics 8 (2007).

[98] P.C. Carvalho, J.R. Yates Iii, V.C. Barbosa, Curr Protoc Bioinformatics Chapter 13 (2010) Unit 13.13.1.

[99] L.F. Waanders, K. Chwalek, M. Monetti, C. Kumar, E. Lammert, M. Mann, Proceedings of the National Academy of Sciences of the United States of America 106 (2009) 18902.

[100] D.G. Standaert, Archives of Neurology 62 (2005) 203.

[101] L. Mouledous, S. Hunt, R. Harcourt, J.L. Harry, K.L. Williams, H.B. Gutstein, Electrophoresis 24 (2003) 296.

[102] C.L. Sawyers, Nature 452 (2008) 548.

[103] N. Wang, M.G. Xu, P. Wang, L. Li, Analytical Chemistry 82 (2010) 2262.

[104] J.L. Norris, N.A. Porter, R.M. Caprioli, Analytical Chemistry 75 (2003) 6642.

[105] Y.Q. Yu, M. Gilar, P.J. Lee, E.S.P. Bouvier, J.C. Gebler, Analytical Chemistry 75 (2003) 6023.

[106] P. Discovery, in, http://www.proteindiscovery.com.

[107] E.I. Chen, D. Cociorva, J.L. Norris, J.R. Yates, Journal of Proteome Research 6 (2007) 2529.

[108] J.R. Wisniewski, A. Zougman, N. Nagaraj, M. Mann, Nature Methods 6 (2009) 359.

[109] D.C. Liebler, A.J.L. Ham, Nature Methods 6 (2009) 785.

[110] K.A. Gruber, S. Stein, L. Brink, A. Radhakrishnan, S. Udenfriend, Proceedings of the National Academy of Sciences of the United States of America 73 (1976) 1314.

[111] R.W. Frei, L. Michel, W. Santi, Journal of Chromatography 126 (1976) 665.

[112] C. Horvath, W. Melander, I. Molnar, Journal of Chromatography 125 (1976) 129.

[113] I. Molnar, C. Horvath, Journal of Chromatography 142 (1977) 623. 96

[114] M.I. Aguilar, M.T.W. Hearn, High Resolution Separation and Analysis of Biological Macromolecules, Pt A 270 (1996) 3.

[115] M. Karas, F. Hillenkamp, Analytical Chemistry 60 (1988) 2299.

[116] F. Hillenkamp, M. Karas, R.C. Beavis, B.T. Chait, Analytical Chemistry 63 (1991) A1193.

[117] J.B. Fenn, M. Mann, C.K. Meng, S.F. Wong, C.M. Whitehouse, Science 246 (1989) 64.

[118] M. Wilm, M. Mann, Analytical Chemistry 68 (1996) 1.

[119] H. Zhang, W. Yan, R. Aebersold, Current Opinion in Chemical Biology 8 (2004) 66.

[120] U.D. Neue, Journal of Chromatography A 1079 (2005) 153.

[121] M. Gilar, A.E. Daly, M. Kele, U.D. Neue, J.C. Gebler, Journal of Chromatography A 1061 (2004) 183.

[122] J.R. Mazzeo, U.D. Neue, M. Kele, R.S. Plumb, Analytical Chemistry 77 (2005) 460A.

[123] X.L. Wang, D.R. Stoll, P.W. Carr, P.J. Schoenmakers, Journal of Chromatography A 1125 (2006) 177.

[124] Y.F. Shen, R. Zhao, M.E. Belov, T.P. Conrads, G.A. Anderson, K.Q. Tang, L. Pasa-Tolic, T.D. Veenstra, M.S. Lipton, H.R. Udseth, R.D. Smith, Analytical Chemistry 73 (2001) 1766.

[125] R. Aebersold, D.R. Goodlett, Chemical Reviews 101 (2001) 269.

[126] R.D. Smith, G.A. Anderson, M.S. Lipton, L. Pasa-Tolic, Y.F. Shen, T.P. Conrads, T.D. Veenstra, H.R. Udseth, Proteomics 2 (2002) 513.

[127] R.D. Smith, Y.F. Shen, K.Q. Tang, Accounts of Chemical Research 37 (2004) 269.

[128] A.R. Ivanov, L. Zang, B.L. Karger, Analytical Chemistry 75 (2003) 5306.

[129] R.D. Smith, J.H. Wahl, D.R. Goodlett, S.A. Hofstadler, Analytical Chemistry 65 (1993) A574.

[130] F. Svec, J.M.J. Frechet, Analytical Chemistry 64 (1992) 820. 97

[131] Q.Z. Luo, Y.F. Shen, K.K. Hixson, R. Zhao, F. Yang, R.J. Moore, H.M. Mottaz, R.D. Smith, Analytical Chemistry 77 (2005) 5028.

[132] Q.Z. Luo, K.Q. Tang, F. Yang, A. Elias, Y.F. Shen, R.J. Moore, R. Zhao, K.K. Hixson, S.S. Rossie, R.D. Smith, Journal of Proteome Research 5 (2006) 1091.

[133] T. T., N. G., Journal of chromatography 268 (1983) 369.

[134] J.W. Jorgenson, E.J. Guthrie, Jouranl of chromatography A 255 (1983) 335.

[135] G.H. Yue, Q.Z. Luo, J. Zhang, S.L. Wu, B.L. Karger, Analytical Chemistry 79 (2007) 938.

[136] Q. Luo, G. Yue, G.A. Valaskovic, Y. Gu, S.L. Wu, B.L. Karger, Analytical Chemistry 79 (2007) 6174.

[137] D. Vasiliu, N. Razi, Y.N. Zhang, N. Jacobsen, K. Allin, X.F. Liu, J. Hoffmann, O. Bohorov, O. Blixt, Carbohydrate Research 341 (2006) 1447.

[138] J.B. Lowe, J.D. Marth, Annual Review of Biochemistry 72 (2003) 643.

[139] R. Sasisekharan, Z. Shriver, G. Venkataraman, U. Narayanasami, Nature Reviews Cancer 2 (2002) 521.

[140] B. Casu, M. Guerrini, G. Torri, Current Pharmaceutical Design 10 (2004) 939.

[141] Y. Kinjo, D. Wu, G.S. Kim, G.W. Xing, M.A. Poles, D.D. Ho, M. Tsuji, K. Kawahara, C.H. Wong, M. Kronenberg, Nature 434 (2005) 520.

[142] B.E. Collins, J.C. Paulson, Current Opinion in Chemical Biology 8 (2004) 617.

[143] E.E. Fry, S.M. Lea, T. Jackson, J.W.I. Newman, F.M. Ellard, W.E. Blakemore, R. Abu-Ghazaleh, A. Samuel, A.M.Q. King, D.I. Stuart, Embo Journal 18 (1999) 543.

[144] R. Raman, S. Raguram, G. Venkataraman, J.C. Paulson, R. Sasisekharan, Nature Methods 2 (2005) 817.

[145] A. Varki, Glycobiology 3 (1993) 97.

[146] D.A. Skoog, Principles of Instrumental Analysis, Thomson Brooks/Cole Publishing: Belmont, CA, 2007.

[147] K.R. Mitchelson, Capillary electrophoresis of nucleic acids, Humana press, 2001.

[148] A. Weston, P.R. Brown, „Capillary Electrophoresis‟, HPLC and CE: Principles and Practice, Academic press, 1997. 98

[149] R.G. Nielsen, G.S. Sittampalam, E.C. Rickard, Analytical Biochemistry 177 (1989) 20.

[150] M.M. Bushey, J.W. Jorgenson, Journal of Chromatography 480 (1989) 301.

[151] S. Hu, N.J. Dovichi, Analytical Chemistry 74 (2002) 2833.

[152] W.W.C. Quigley, N.J. Dovichi, Analytical Chemistry 76 (2004) 4645.

[153] J. Hernandez-Borges, C. Neususs, A. Cifuentes, M. Pelzing, Electrophoresis 25 (2004) 2257.

[154] W.W.P. Chang, C. Hobson, D.C. Bomberger, L.V. Schneider, Electrophoresis 26 (2005) 2179.

[155] O. J.A., Analytical Chemistry 59 (1987) 1230.

[156] S. R.D., Analytical chemistry 60 (1988) 1948.

[157] E. Gelpi, Journal of Mass Spectrometry 37 (2002) 241.

[158] M. Moini, Analytical and Bioanalytical Chemistry 373 (2002) 466.

[159] F. Hsieh, E. Baronas, C. Muir, S.A. Martin, Rapid Communications in Mass Spectrometry 13 (1999) 67.

[160] C. Neususs, M. Pelzing, M. Macht, Electrophoresis 23 (2002) 3149.

[161] C.C. Liu, R. Jong, T. Covey, Journal of Chromatography A 1013 (2003) 9.

[162] M.R. Emmett, R.M. Caprioli, Journal of the American Society for Mass Spectrometry 5 (1994) 605.

[163] R. Wojcik, O.O. Dada, M. Sadilek, N.J. Dovichi, Rapid Communications in Mass Spectrometry 24 (2010) 2554.

[164] M.X. Huang, J. Plocek, M.V. Novotny, Electrophoresis 16 (1995) 396.

[165] V. Pacakova, S. Hubena, M. Ticha, M. Madera, K. Stulik, Electrophoresis 22 (2001) 459.

[166] M. Kinoshita, E. Murakami, Y. Oda, T. Funakubo, D. Kawakami, K. Kakehi, N. Kawasaki, K. Morimoto, T. Hayakawa, Journal of Chromatography A 866 (2000) 261. 99

[167] K. Kakehi, M. Kinoshita, D. Kawakami, J. Tanaka, K. Sei, K. Endo, Y. Oda, M. Iwaki, T. Masuko, Analytical Chemistry 73 (2001) 2640.

[168] E. Balaguer, C. Neususs, Analytical Chemistry 78 (2006) 5384.

[169] S. Takasaki, T. Mizuochi, A. Kobata, Methods in Enzymology 83 (1982) 263.

[170] H. Geyer, R. Geyer, Biochimica Et Biophysica Acta-Proteins and Proteomics 1764 (2006) 1853.

[171] G.R. Guile, P.M. Rudd, D.R. Wing, S.B. Prime, R.A. Dwek, Analytical Biochemistry 240 (1996) 210.

[172] P.M. Rudd, G.R. Guile, B. Kuster, D.J. Harvey, G. Opdenakker, R.A. Dwek, Nature 388 (1997) 205.

[173] S.A. Brooks, in G. Walsh (Editor), Post-translational Modification of Protein Biopharmaceuticals, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim,, (2009).

[174] T.M. Annesley, Clinical Chemistry 49 (2003) 1041.

[175] U.M. Demelbauer, M. Zehl, A. Plematl, G. Allmaier, A. Rizzi, Rapid Communications in Mass Spectrometry 18 (2004) 1575.

[176] H. Kaji, H. Saito, Y. Yamauchi, T. Shinkawa, M. Taoka, J. Hirabayashi, K. Kasai, N. Takahashi, T. Isobe, Nature Biotechnology 21 (2003) 667.

[177] G. Alvarez-Manilla, J. Atwood, Y. Guo, N.L. Warren, R. Orlando, M. Pierce, Journal of Proteome Research 5 (2006) 701.

[178] M. Tajiri, S. Yoshida, Y. Wada, Glycobiology 15 (2005) 1332.

[179] K. Sparbier, S. Koch, I. Kessler, T. Wenzel, M. Kostrzewa, J Biomol Tech 16 (2005) 407.

[180] H. Zhang, X.J. Li, D.B. Martin, R. Aebersold, Nature Biotechnology 21 (2003) 660.

[181] M.J. Huddleston, M.F. Bean, S.A. Carr, Analytical Chemistry 65 (1993) 877.

[182] J. Jebanathirajah, H. Steen, P. Roepstorff, Journal of the American Society for Mass Spectrometry 14 (2003) 777.

[183] H. Jiang, H. Desaire, V.Y. Butnev, G.R. Bousfield, Journal of the American Society for Mass Spectrometry 15 (2004) 750. 100

[184] M. Wuhrer, C.A.M. Koeleman, C.H. Hokke, A.M. Deelder, Analytical Chemistry 77 (2005) 886.

[185] M. Mormann, H. Paulsen, J. Peter-Katalinic, European Journal of Mass Spectrometry 11 (2005) 497.

[186] J.M. Hogan, S.J. Pitteri, P.A. Chrisman, S.A. McLuckey, Journal of Proteome Research 4 (2005) 628.

101

Chapter 2: Proteomic Analysis of 10,000 Laser Captured Microdissected Breast

Tumor Cells Using Short Migration on SDS-PAGE and Porous Layer Open

Tubular (PLOT) LC-MS

102

ABSTRACT

Reproducible proteomic profiling of small levels of tissue samples is a challenging task. The amount of tissue sample can be limited either due to the extent of biopsy and/or to the use of laser capture microdissection (LCM) to extract a homogeneous population of cells. Application of traditional proteomic sample processing and analysis techniques are often insufficient to comprehensively examine such small numbers of cells, due to adsorptive losses during the sample preparation steps and the limit of sensitivity of the analytical LC/MS system. In this Chapter, we present a workflow to facilitate effective and reproducible proteomic profiling of LCM of only 10,000 cells from tumor tissue samples. The workflow integrates selective sample procurement by LCM and sample processing by SDS-PAGE using a highly crossslinked gel and a short separation distance. This is followed by in-gel digestion and ultrasensitive analysis with a 10 µm i.d. porous layer open tubular (PLOT) LC column coupled to MS. LCM of 10,000 hepatocyte cells from mouse liver tissue was initially utilized to develop and validate the performance of the proteomic workflow.

Importantly, only 10-20 percent of the in-gel digest, equivalent to 1000-2000 cells, was required for injection in a single LC/MS run. The workflow was then applied to the proteomic profiling of 10,000 LCM collected cells each obtained from invasive and metastatic breast cancer tissue. Each run identified more than 1100 proteins, and from three technical replicates, 1700 proteins (1103 with 2 peptides) were identified and quantitated by spectral counting, allowing a comparison between invasive and metastatic tumor cells. The results demonstrate the benefits from the integration of 103 sensitive microscale separation with effective sample preparation for the comprehensive analysis of small amounts of tissue samples.

104

2.1 Introduction

Laser capture microdissection (LCM) has emerged as an indispensable tool to isolate a homogeneous population of cells from tissue, especially tumor tissue [1-4].

However, collection of a large number of cells using LCM can be a time and labor consuming process, and the number of cells of a specific type will likely be limited in any case. Therefore, there is a need of a specific workflow to allow comprehensive proteomics from a small number of cells.

Sample preparation is a critical component in the analysis of a limited number of cells. One approach is to perform a protein separation prior to proteolytic digestion, followed by LC/MS. With a limited number of cells, the challenge is to achieve the protein separation with minimal protein loss. A typical protein separation/clean up step is SDS PAGE. The addition of SDS to cells and tissues aids in the lysis and total solubilization of proteins [5], with the resultant PAGE separation based on size. One way to minimize protein losses is to limit the separation distance on the gel [6]. In this way, the concentration of the protein gel plug for in-gel digestion will be higher than if a longer SDS PAGE separation distance were employed. Moreover, the short distances for separation should reduce potential accumulation of keratin impurities.

Following sample preparation, the chromatographic separation, coupled on-line to the MS, needs to be optimized. Clearly, high sensitivity is necessary, along with high resolving power, to separate the low amount of highly complex peptide mixtures.

Currently, 75 µm i.d. capillary columns, packed with C18 stationary phases, are widely used for analysis of in-gel digested proteins with mobile phase flow rates of

100 nL/min or higher. It is known that electrospray ionization efficiency increases as 105 the mobile phase flow rate decreases [7-11]. In fact, operation at 20 nL/min or lower can greatly enhance electrospray signal [7,12]. Furthermore, due to the formation of very small droplets released from the electrospray tip at such low flow rates, the number of analytes per droplet decreases to a vanishing small value, resulting in little or no ion suppression.

In order to operate efficiently at such low flow rates while maintaining high separation performance, significantly narrower LC column IDs are required. While packed beds have been used [13], we have selected to use porous layer open tubular

(PLOT) columns of 10 µm ID. As we have previously shown [12], such columns, due to their open tubular structure allow column lengths of 3 meters or longer, while using standard LC pumps (6000 psi). With such columns, we have demonstrated high sensitivity (attomole) and high resolving power (peak capacities of 400 or more) [12].

In this Chapter, we describe a platform with short run SDS PAGE and PLOT-

LC/MS for effective and reproducible proteomic analysis of 10,000 LCM malignant breast tumor cells from primary and metastatic sites (only a few hundred ng of total protein in each sample). Up to 1700 proteins have been identified per cell type (more than 1100 with 2 or more unique peptides), and many proteins have been quantitated with sample injection amounts equivalent to only 1000 - 2000 cells. Biologically relevant differences were observed between the two sample types. The results provide a basis for the proteomic analysis of LCM cells from disease tissue for biomarker discovery at low cell number.

106

2.2 Experimental Section

2.2.1 Chemicals

Styrene, divinylbenzene, ethanol, formic acid (HPLC grade),3-(trimethoxysilyl)propyl methacrylate, 2,2-diphenyl-1-picrylhydrazyl (DPPH), DMF anhydrous, THF, azobisoisobutyronitrile (AIBN),ammonium bicarbonate, dithiothreitol (DTT), and iodoacetamide (IAA) were obtained from Sigma–Aldrich (St. Louis, MO). Fused- silica capillary tubing was purchased from Polymicro Technologies (Phoenix, AZ).

The Pico Clear tee and union were from New Objective (Woburn, MA). Sequencing grade trypsin was obtained from Promega (Madison, WI). Reducing agent (10X),

Tricine gel (16%T), Tricine SDS sample buffer (2X), and Tricine SDS running buffer

(10X) were from Invitrogen (Carlsbad, CA). HPLC grade water, formic acid and acetonitrile were obtained from Thermo Fisher Scientific (Fairlawn, NJ).

2.2.2 Clinical Specimens

Tumor tissue samples from breast and axillary lymph node were obtained by surgical excision from a patient diagnosed with metastatic breast cancer. The samples were frozen fresh within 20 mins of surgery and stored at -80ºC. The LCM tissue samples were obtained from Massachusetts General Hospital. This study was approved by the Massachusetts General Hospital human research committee in accordance with

National Institute of Health human research study guidelines. For method development and optimization purposes, 10,000 lymphocyte cells extracted by LCM from mouse liver tissues were used. The samples were provided by the laser microdissection facility at University of Alabama. 107

2.2.3 Laser Capture Microdissection

Both tissue sample types (human breast cancer and mouse liver tissue) were prepared, as previously described [14], with the exception that the eosin tissue staining step was excluded from the tissue staining protocol. Enrichment of malignant epithelial cells from breast (invasive) and lymph node tissue (metastatic) was performed using an IR (λ = 810 nm) laser capture method (PixCell IIe LCM

Molecular Devices, Mountain View, CA), as described previously [15].

Approximately 10,000 cells were collected per tissue sample onto a single LCM cap.

Three individual samples of 10,000 cells were collected for both sample types and subjected to proteomic analysis. For method development and optimization, approximately 10,000 hepatocyte cells from mouse liver tissue sections were collected using the Arcturus Veritas LCM system (Molecular Devices).

2.2.4 Cell Lysis, SDS-PAGE and In-Gel Digestion

An ExtracSure device (MDS Analytical Technologies, Sunnyvale, CA) was placed on the LCM cap containing 10,000 cells, and cell lysis was performed within 3 minutes using 10 µL of lysis buffer (Novex Tricine SDS sample buffer (2X) :

NuPAGE reducing agent : water, 5:1:4). The cell lysate was transferred to an

Eppendorf tube. The LCM cap surface was rinsed with two aliquots of lysis buffer (7.5

µL of each), and the combined cell lysate was incubated at 85˚C for 2 minutes. A similar procedure was performed on three invasive and three metastatic samples to lyse the cells. The cell lysates from the six samples were separately loaded onto a 16%

Tricine gel, and electrophoresis was performed using 125 volts for a period of 20 minutes, leading to approximately 2.5 cm of the electrophoresis running buffer front. 108

After staining the gel with Coomassie blue (SimplyBlue Safestain, Invitrogen), each gel lane was cut into three equal sections, as shown in Figure S1 (Addendum). Each section was in-gel digested by the procedure described previously [16]. Briefly, the gel section was cut into cubes of approximate 1 mm3, and reduction was performed using dithiothreitol, followed by alkylation with iodoacetamide in the dark. The proteins were subjected to in-gel digestion using 6 ng/µL of a trypsin solution (in 50 mM ammonium bicarbonate solution) at 37˚C for 12 hours. The supernatant solution was removed and preserved, and the digested peptides were extracted by hydration of the gel using 5% formic acid, followed by dehydration of the gel by acetonitrile. The extracts were combined and further reduced to dryness under vacuum using a centrifugal evaporator.

2.2.5 Nano LC-ESI-MS with 10 µm i.d. PLOT Column

The chromatographic system ( i.e., on-line micro-SPE-PLOT) was identical to that published previously[17]. Briefly, the column preparation consisted of the following: A degassed solution containing 5 mg of AIBN, 200 µL of styrene, 200 µL of divinylbenzene, and 600 µL ethanol was filled in a 10 µm i.d. capillary, which was pretreated with 3-(trimethoxysilyl) propyl methacrylate. Both ends of the capillary were closed with septa, and polymerization was carried out for 16 hours at 74˚C. The column was washed overnight with acetonitrile and stored with water.

The in-gel digest of each of the SDS-PAGE gel sections was dissolved in 15

µL of mobile phase A (0.1% formic acid in water). 1.5 µL of in-gel digest (10% of the total volume) was loaded on the micro-SPE column (50 µm i.d. PS-DVB monolithic micro-SPE columns) at a flow rate of approximately 1 µL/min and desalted using 109

100% mobile phase A. A nanoLC (Ultimate 3000, Dionex, Sunnyvale, CA) was programmed as follows: mobile phase B (0.1% formic acid in acetonitrile) was linearly increased from 2% to 30% over 180 minutes, followed by a linear increase to

60% over 30 minutes, and finally linearly increased to 80% over 10 minutes, followed by an isocratic hold for 10 minutes. The LTQ XL mass spectrometer (Thermo Fisher

Scientific, San Jose, CA) was operated in the data-dependent mode using one full MS scan, followed by MS2 acquisition for the 8 most intense ions with normalized collision energy of 35%. Dynamic exclusion was employed with no repeat count for selected precursor ions and exclusion duration of 60 s. The LCM collected 10,000 hepatocyte cells (mouse liver cells) were processed as above; the in-gel digest was used for optimization of sample loading amount and LC-gradient time.

2.2.6 Protein Identification

The acquired MS/MS scans were converted into DTA files by Extract-MSn

(version 4.0, Thermo Fisher Scientific). The DTA files were searched against the human SwissProt database (release 2010_06 downloaded in July 2010) combined with a database containing reversed sequences by using the Sequest algorithm (Ver. 27, rev.12, Thermo Fisher Scientific), and the results were stored in the CPAS system

(ver. 9.10 LabKey, Seattle, WA)[18]. The peptide mass search tolerance was set to 1.4

Da, whereas the fragment ion mass tolerance was set to 1.0 Da. Full tryptic enzyme specificity was selected with up to 2 missed cleavage sites. Cysteine carbamidomethylation was considered as a fixed modification. The search results

(identified peptides) were filtered by Xcorr ≥ 1.9 for charge state +1, ≥ 2.2 for charge state +2 and ≥ 3.8 for charge state +3 and by PeptideProphet (Institute for System 110

Biology, Seattle, WA) using peptide probability ≥ 0.95. Since some peptides were assigned to multiple proteins, ProteinProphet (Institute for System Biology, Seattle,

WA) was used to assign peptides to protein groups with acceptance criteria of

ProteinProphet probability ≥ 0.9, resulting in the false discovery rate (FDR) <2% at the protein level[19-21].The procedure to identify proteins from hepatocyte cells

(mouse liver cells) was the same as described above except that the mouse SwissProt database (release 2010_06 downloaded on July 2010) was used in place of the human

SwissProt database.

2.2.7 Identification of Differentially Abundant Proteins by Spectral Counts

A comparison of protein abundances between invasive and metastatic breast cancer cells was obtained based on spectral count (the number of significant MS2 spectra assigned to a protein) of at least 2 peptides per protein. The Sequest search results (three files corresponding to three gel sections) were processed by DTASelect to obtain a single result file per sample. The total six files of invasive breast and metastatic breast cancer cells were compared using the TFold analysis tool of

PatternLab software [22]. The differentially abundant proteins were determined by the t-test with a statistical significance threshold p< 0.05 and a Benjamini-Hochberg false discovery rate (BH-FDR) [22] of less than 10%.

111

2.2.8 Reproducibility of Replicate Analyses of Metastatic and Invasive Breast

Cancer Samples.

Analyses of three replicates of metastatic and invasive breast cancer samples were performed, and data was processed to identify proteins. For each identified protein, average spectral counts and relative standard deviations were calculated from the three runs. From the list of total proteins, proteins identified by an average spectral count ≥ 2 were selected for reproducibility analysis. The selected proteins were divided into three bins, such that each bin contained equal numbers of proteins. These bins were assigned as “low”, “middle” and “high” based on the average spectral count values. The protein groups obtained from replicates of metastatic and invasive breast cancer samples were compared. Statistical calculations to compare variability associated with replicates of invasive and metastatic breast cancer were performed using the R language (www.r-project.org).

2.2.9 Gene Ontology Annotation with DAVID (Database for Annotation,

Visualization and Integrated Discovery)

Functional annotation tools in DAVID [23,24] were used to extract gene ontology terms for identified proteins and to provide functional annotations to differentially abundant proteins. The lists of up and down regulated proteins in metastatic samples were separately processed against a background list of all the proteins that were submitted to PatternLab software. The annotation clusters with an enrichment score greater than 1.3 [24] and FDR less than 20% were selected to explore the biological differences between the two sample types. 112

2.3 Results and discussion

2.3.1 Overview of Proteomic Workflow

The proteomic workflow is shown in Fig.1. Given that just 10,000 LCM collected cells were targeted for proteomic analysis, the workflow must be optimized to minimize sample losses. In this study, 4% SDS buffer was employed for rapid and complete lysis of the cells from the LCM cap. Further, a 2.5 cm SDS-PAGE run on a highly crosslinked gel (16% Tricine) was utilized for protein pre-fractionation and

SDS clean-up. The LC-ESI-MS analysis of a small amount of the complex peptide mixture using on-line PLOT LC-MS provided not only ultrasensitive detection but also the potential for multiple analyses on the same sample. The optimized workflow was employed to determine proteomic profiles of three replicates of 10,000 LCM collected invasive and metastatic breast cancer cells. Due to limited sample amounts, label-free quantitation based on spectral counts was utilized.

113

10,000 Invasive and Metastatic Breast Cancer Cells Collected By LCM

On-the -Cap Cell Lysis and Protein Extraction

2.5 cm SDS-PAGE on 16% Tricine Gel and In-Gel Digestion

Online-1D-PLOT-LC-ESI-MS

Protein Identification and Quantitative Comparison Figure 1. Shotgun proteomic workflow for the analysis of 10,000 LCM collected breast cancer cells collected from breast tumor and lymph node tumor. Shown here are the important steps that facilitate the proteomic profiling and comparison of 10,000 invasive vs. metastatic breast cancer cells.

2.3.2 Cell Lysis and Protein Extraction from the LCM Cap

To collect a homogeneous population of cells from heterogeneous tissue, we employed an infra red (IR) laser-based LCM approach. The malignant cancerous cells were isolated with the help of a thermoplastic film attached to a cap. The film, when heated by the IR-laser, was melted and contacted the cells of interest [2]. The strong adhesion of the cells to the thermoplastic film resulted in selective extraction of a homogeneous population of the desired cells. To reduce the number of sample handling steps, the cells were lysed directly on the cap using 4% SDS buffer. This procedure facilitated cell lysis with only 5-10 µL lysis buffer. The lysis procedure was 114 assumed to be complete, based on no remaining cells on the film examined with a microscope.

2.3.3 Short SDS-PAGE Run for In-Gel Digestion

Prior to enzymatic digestion of proteins, SDS-PAGE is an effective tool for sample clean-up, protein separation, and SDS removal. However, application of this method to small protein loads can be challenging. Recently, a short SDS-PAGE run was shown to be an improved approach for sample clean-up, due to the concentrated protein which led to lower losses than for a longer gel run [6]. In the Addendum

Section, preliminary experiments are described demonstrating that a highly crosslinked gel (16% Tricine) and a gel migration distance of only 2.5 cm results in a significant increase in peptides and thus proteins identified, in comparison to 2.5 cm and 10 cm gel migration distances on a gradient gel. In this work, we not only used the short SDS-PAGE run with a highly crosslinked gel for sample clean-up and removal of SDS, but also for protein separation, as each 2.5 cm gel lane was divided into three sections for LC/MS analysis. As a result, a 2-Dimensional separation of each proteomic sample was achieved.

2.3.4 Online PLOT/LC-ESI-MS

Currently, for LC-ESI-MS proteomic analysis of a complex peptide mixture, capillary columns with an inner diameter greater than 75 µm and mobile phase flow rates of 100 nL/min or higher are widely used. However, these columns consume microgram amounts of total protein per run. For limited amounts of sample where lower levels of injection are required, more sensitive approaches must be utilized. It has been demonstrated that narrower capillary columns with 20 nL/min or lower flow 115 rates can provide increased sensitivity of ESI-MS with a reduction in ion suppression

[10,25]. Our laboratory has developed 10 µm i.d. porous layer open tubular (PLOT) columns of several meters in length and operating at flow rates of 20 nL/min or lower

[12]. These columns result in ultrasensitive detection in the attomole range along with high resolving power (peak capacity of 400 or greater). The columns have been shown to be reproducible (~3% RSD in retention time from column to column) [12,26].

Figure 2. Optimization of LC-MS parameters. A. Optimization of sample loading amount. Shown is the plot of number of proteins identified vs. percent of in-gel digest (peptides) injected into PLOT LC-MS. The in- gel digest was prepared on 10,000 LCM collected mouse liver cells using the workflow shown in Figure 1. B. Optimization of chromatographic gradient time. Plot of number of proteins identified verses LC gradient time. The increase in LC gradient time from 1 to 3 hours provides a balance between analysis time and number of proteins identified.

Prior to the study of the LCM breast cancer samples, we initially evaluated the complete workflow using a model system - 10,000 LCM collected hepatocyte cells of mouse liver. The cells were lysed as described above and then separated by SDS-

PAGE on the16% Tricine gel with a migration distance of 2.5 cm. The gel lane was 116 divided into 3 equal size sections. The middle section was removed and in-gel digested. The digest was then suspended in aqueous 0.1% formic acid, and aliquots equivalent to 3%, 10% and 20% of the total sample (assuming no protein losses) were analyzed by PLOT LC/MS (LTQXL linear ion trap and a gradient time of 3 hours) in order to determine optimum sample load on the LC column. To facilitate desalting of the digested sample, a PS-DVB monolith SPE precolumn was employed prior to injection, as previously [17]. Data processing, as described in the Experimental

Section, provided the number of proteins identified for different injection amounts, see

Figure 2A. The results show that the number of proteins leveled off between 10 % and

20 % of the total sample (equivalent to 1000 to 2000 cells injected, assuming no sample loss), and this level was selected for the breast cancer sample proteomic analysis. Importantly, this allowed multiple injections of a given sample, despite the limited amount of protein.

Next, the LC gradient time was optimized, again using the in-gel digest of the mouse liver sample. We evaluated 1, 3, 5 and 7 hour gradient times with the above injected sample amount, and the results are shown in Figure 2B. A significant increase in protein identification was observed in going from 1 hour to 3 hour gradient times.

However, gradient times of 5 and 7 hours leveled off to 812 and 814 proteins. A gradient time of 3 hours was selected to balance the number of proteins identified and the total analysis time of a given run.

117

Table 1. Number of proteins identified per gel section per sample from three technical replicates of 10,000 mouse liver cells. Numbers in brackets are the proteins identified with 2 or more unique peptides.

Number of Proteins Identified Per Run (Three replicate LC-MS runs performed on 10,000 mouse liver cells)

All Three Gel Section Run 1 Run 2 Run 3 Runs

1 536 522 509

2 528 721 694

3 442 392 345

Total Unique 1035 1187 1138 1490 proteins (641) (739) (692) (1103)

Using the optimized workflow, the in-gel digest of all three sections of the short SDS-PAGE run for the 10,000 mouse liver LCM collected cells were analyzed, by PLOT LC/MS, each section with three technical replicates. The number of proteins identified per gel section per sample is presented in Table 1 where it can be seen that more than 1000 proteins were consistently identified from each LC/MS replicate of the three SDS-PAGE sections. Approximately 700 proteins per replicate were identified by two or more unique peptides. The total number of proteins identified from the three replicate runs was close to 1500 with 1100 identified with two or more peptides. It is important to note that these results are obtained with an ion trap mass spectrometer, an instrument with low mass resolution. It was also found that the protein overlap of the three gel sections was on average less than 10%, suggesting that sufficient protein 118 separation can be performed even using a 2.5 cm short run on the 16% Tricine gel.

The number of proteins identified demonstrates that this workflow can be effective for proteomic profiling of a limited number of LCM collected cells.

2.3.5 Proteomic Analysis of Three Replicates of 10,000 Breast Cancer Cells

To test the applicability of the proteomic workflow for clinical samples, 10,000 invasive and metastatic breast cancer cells collected by LCM were analyzed. Triplicate samples were independently analyzed and subjected to quantitative comparison based on spectral counts. Each replicate of invasive breast cancer sample identified close to

1100 proteins (Table 2), and, taken together, the set of 3 samples of the invasive and metastatic breast cancer samples identified more than 1700 unique proteins. Among the 1700, more than 1100 were identified with two or more unique peptides. A similar number of proteins were identified for the metastatic LCM collected cells. The workflow was thus demonstrated to lead to meaningful number of proteins for proteomic analysis.

119

Table 2. Number of proteins identified per gel section per sample from three replicates of 10,000 invasive breast cancer cells. Numbers in brackets are the proteins identified with 2 or more unique peptides.

Number of Proteins Identified Per Invasive Breast Cancer Sample

All Three Gel-section Sample 1 Sample 2 Sample 3 Samples

1 305 189 233

2 550 563 596

3 479 464 435

Total 1132 1050 1098 1708 Unique (558) (537) (523) (1123) proteins

2.3.6 Identification of Differentially Expressed Proteins

Spectral counting was used to assess the quantitative differences between the invasive and metastatic breast cancer samples. Since only 10,000 LCM collected cells were processed in this study and only 10% of the sample amount was consumed in each run by PLOT LC-MS, the technical variation attributed to sample preparation and analysis was compared between the sample types. To assess this variation, a simple statistical comparison based on the relative standard deviation (RSD) of spectral counts obtained from three replicates of the invasive and metastatic samples was made. In order to compare the data in detail, the proteins with an average spectral 120 count ≥2 were selected and divided into three equal sized groups (see Experimental

Section 2.7).

Figure 3. Assessment of the variability in proteomic profiles associated with three replicate runs each of invasive and metastatic breast cancer samples (three samples of 10,000 cells each). Box plots of relative standard deviation are shown for invasive (INV) and metastatic (MET) samples. The values of the relative standard deviation calculated on peptide counts obtained from triplicate runs are shown on the y-axis. The upper and lower side of the box represents the 25th and 75th percentile values, respectively. The hroizontal line inside the box represents the median value. The lines extending from the box indicate the spread (10-90th percentile) of the data.

As shown in Fig. 3, the median values of relative standard deviations for protein groups (low, medium and high) were 0.27, 0.20 and 0.21 respectively, for the invasive samples, whereas the median values for the metastatic samples were 0.31,

0.22 and 0.20 respectively. In addition, the interquartile distances for protein groups between two samples types reveal similar trends in the spectral count variability. 121

Importantly, this similarity was obtained over a period of 9 days demonstrating stable performance of the PLOT LC-ESI-MS platform.

Next, to determine differentially abundant proteins between invasive and metastatic samples, the TFold module of PatternLab software was employed. The differentially abundant proteins were determined based on a t-test with a statistical significance threshold p< 0.05 and a Benjamini-Hochberg false discovery rate (BH-

FDR) of less than 10% [22]. A total of 109 proteins were found to be differentially abundant between the two sample types. Eighty five proteins were found to be up- regulated in the metastatic samples, whereas 24 proteins were found to be down- regulated. In the next section, we examine the annotation of the differentially abundant proteins.

2.3.7 Gene Ontology Analysis

The 109 differentially abundant proteins in the breast cancer metastatic samples, relative to the invasive sample, were submitted to DAVID for gene ontology

(GO) analysis. GO terms with enrichment scores >1.3 and an FDR less than 20% were selected as overrepresented functional categories [24]. Eighty five proteins, up- regulated in metastatic breast cancer cells, were linked to 47 overrepresented GO terms and clustered into 4 functional categories (Table 3). A number of proteins up- regulated in the metastatic samples, such as coatomer subunits (gamma-2, epsilon, alpha and beta), transmembrane emp24 domain-containing protein 10 and Golgi- specific brefeldin A-resistance guanine nucleotide exchange factor 1, are associated with vesicles targeting/ transport/localization function. The basic communication mechanism, vesicle transport, between different membrane compartments within a cell 122 and also with the extracellular environment consists of endocytosis and exocytosis, the two membrane-trafficking networks. It is known that cancer cells implement various exocytic routes to communicate important information for growth, migration and matrix degradation [27]. Proteins belonging to functional category cytoplasmic vesicle and signal peptide include endoplasmic reticulum protein ERp29, endoplasmin, Ras- related protein Rab-14, cathepsin D, clusterin, heat shock cognate 71 kDa protein, dihydrolipoyl dehydrogenase, mitochondrial and fibronectin. Proteins belonging to the cytoplasmic vesicle category are responsible for mediating vesicular transport between organelles of secretory and endocytic system. Importantly, the down regulation of extracellular matrix proteins in invasive breast cancer tumor has been observed previously[28]. In this study, the up-regulation of proteins such as three chains of collagen VI (alpha-1, alpha-2 and alpha-3) and tenascin suggests remodeling of extracellular matrix in metastatic breast cancer cells compared to invasive breast cancer cells. Glycolysis, an important biological function that has recently been linked to cancer metastasis [29], was found to be down- regulated in the metastatic samples.

The proteins associated with this function are triosephosphate isomerase, phosphoglycerate kinase 1 and alpha-enolase. These findings are suggestive of changes in the extracellular matrix, vesicle pathways and cellular metabolism of metastatic cells in comparison to invasive breast cancer cells.

In summary, the gene ontology term enrichment analysis revealed important biological changes between the two disease states, which can be attributed to specificity of LCM procedure, short-SDS-PAGE sample preparation and ultrasensitive detection capability of PLOT LC-MS. The results are in line with what is expected and 123 demonstrate that the technology platform can be successfully used for proteomic analysis of a limited number of cells.

Table 3. Enriched Gene-Ontology (GO) terms with FDR less than 5% and P value less than 0.05 are shown in bold. The differentially abundant proteins associated with each GO term are presented with their Swiss-Prot accession number.

Annotation Accession

Cluster Number GO Terms/ proteins belonging to each cluster P value % FDR

Cytoplasmic vesicle (Enrichment Score:

1 4.45) 1.24E-05 0.02

(P30040) Endoplasmic reticulum protein ERp29

(P61106) Ras-related protein Rab-14

(P07339) Cathepsin D

(P10909) Clusterin

(P53621) Coatomer subunit alpha

(O14579) Coatomer subunit epsilon

(Q9Y678) Coatomer subunit gamma

(P11142) Heat shock cognate 71 kDa protein

(P08133) Annexin A6

(Q8NEW0) Zinc transporter 7

Transmembrane emp24 domain-containing

(P49755) protein 10

(P09622) Dihydrolipoyl dehydrogenase, mitochondrial

(Q9NZM1) Myoferlin

(P53618) Coatomer subunit beta

(Q15084) Protein disulfide-isomerase A6

(O95716) Ras-related protein Rab-3D

(P02751) Fibronectin

Vesicle targeting, to, from or within Golgi,

Golgi vesicle transport, vesicle localization 0.03, 0.15,

2 (Enrichment Score: 2.19) 2.25E-04 0.24

(P55735) Protein SEC13 homolog

124

Golgi-specific brefeldin A-resistance guanine

(Q92538) nucleotide exchange factor 1

(P05783) Keratin, type I cytoskeletal 18

(P54920) Alpha-soluble NSF attachment protein

3 Signal peptide ( Enrichment Score: 1.84) 2.52E -03 3.53

Transmembrane emp24 domain-containing

(Q9Y3A6) protein 5

Annotation Accession

Cluster Number GO Terms/ proteins belonging to each cluster P value % FDR

Transmembrane emp24 domain-containing

(Q9Y3B3) protein 7

(P12111) Collagen alpha-3(VI) chain

(P12110) Collagen alpha-2(VI) chain

(Q9BRX8) Uncharacterized protein C10orf58

(Q15392) 24-dehydrocholesterol reductase

(P12109) Collagen alpha-1(VI) chain

(Q8TD06) Anterior gradient protein 3 homolog

(Q12907) Vesicular integral-membrane protein VIP36

(P24821) Tenascin

ECM -receptor interaction, Extracellular

4 matrix (Enrichment Score: 1.59) 3.78E-03 1.69, 4.73

(P12111) Collagen alpha-3(VI) chain

(P12110) Collagen alpha-2(VI) chain

(P12109) Collagen alpha-1(VI) chain

(P24821) Tenascin

5 Gly colysis Pathway (Enrichment Score: 1.40) 6.20E -03 3.34

(P60174) Triosephosphate isomerase

(P00558) Phosphoglycerate kinase 1

(P06733) Alpha-enolase

125

2.4 Conclusions

A comparative proteomic analysis was performed on 10,000 LCM collected metastatic and invasive breast cancer cells using PLOT column-based LC/MS shotgun proteomics. The cells extracted onto the thermal polymer attached to the cap were directly lysed with an SDS solution. The lysed solution was cleaned up and the proteins separated by molecular weight by SDS-PAGE The use of a short SDS-PAGE run (2.5 cm) on a 16% Tricine gel as a sample preparation step resulted in improved in-gel digestion efficiency, relative to longer gel migration (10 cm). Another key factor in the analysis of the limited number of LCM generated cells was utilization of

10 µm i.d. PLOT column for LC-ESI-MS analysis. The optimization of sample loading amount on the PLOT column was essential as an equivalent to 1000 - 2000 cells injected identified more than 1000 proteins, thus enabling the multiple LC-MS analysis on the same limited sample. The optimized workflow showed good performance in the full scale proteomic experiment resulting in identification of more than 2,000 proteins. Importantly, changes in protein abundances related to biologically relevant categories (comparing metastatic to invasive breast cancer type) such as extracellular matrix and glycolysis were detected. These observations demonstrate that biological information can indeed be obtained on limited sample amounts. This study confirms the feasibility of the workflow in a highly relevant biological matrix for determination of protein abundance changes where the advantage of multiple LC-MS analysis on the same sample can be utilized. In our efforts to further improve the proteome coverage from 10,000 LCM cells, we are currently investigating a 126 combination of short-run SDS-PAGE for sample preparation with 2D SCX/RP PLOT

LC-MS for peptide separation.

127

Addendum to Chapter 2

Evaluation of Short SDS-PAGE Separation Distance for Sample Preparation of

Small Protein Amounts Prior to LC/MS Proteomic Analysis

To evaluate performance of short SDS-PAGE separation run for digestion of 1 microgram of protein amount, two gel compositions (4-12% Bis-Tris and 16 % Tricine) and separation distances (2.5 and 10.0 cm) were tested.

2.1A Methods and Materials

2.1.1 Chemicals

Bis-Tris gel (4-12%T), LDS sample buffer (4X) and MES SDS running buffer (20X) were from Invitrogen (Carlsbad, CA). A431 cell lysate was obtained from Santa Cruz

Biotechnology (Santa Cruz, CA).

2.1.2 SDS-PAGE Separation and In-Gel Digestion

1) Long SDS-PAGE Method with Gradient Gel

A 2.5 µL aliquot of A431 cell lysate (0.4 µg/µL) was mixed with 5 µL of

NuPAGE lithium dodecyl sulfate buffer, 2 µL of NuPAGE reducing agent and 10.5 µL of water. The mixture was incubated at 70°C for 10 minutes, followed by cooling and then loading onto a 4-12 % Bis-Tris gel. Gel electrophoresis was performed using MES- SDS running buffer over a distance of ~ 10 cm (Figure S1A). After staining the gel with

Coomassie blue (SimplyBlue Safestain, Invitrogen), the gel lane was cut into three equal 128 sections. Each gel section was further sliced into 1mm squares and in-gel digested. The gel pieces were washed in 0.1 M ammonium bicarbonate solution at 37 ˚C for 15 minutes. Acetonitrile was added to destain and dehydrate the gel pieces. Next, the gel pieces were suspended in 10 mM dithiothreitol and incubated for 30 minutes at 56 ˚C.

The solution was cooled to room temperature, the supernatant removed, and the gel pieces incubated in the dark in 55 mM iodoacetamide for 60 minutes. The excess of iodoacetamide solution was removed, and the gel pieces were washed with 0.1 M ammonium bicarbonate. Acetonitrile was added to dehydrate the gel pieces; the solution was removed, and the gel pieces were desiccated using a centrifugal evaporator. The gel pieces were further rehydrated with trypsin solution (6 ng/µL), and the trypsin digestion was performed at 37 ˚C overnight. After digestion, the peptides from the gel pieces were extracted with 5% formic acid in acetonitrile. Aliquots of in-gel digest obtained from three gel sections were combined and evaporated to dryness using a centrifugal evaporator (Labconco, Kansas city, MO). The peptides were resuspended in 0.1% formic and subjected to LC-MS analysis.

2) Short SDS-PAGE Method with Gradient Gel

For an identical sample amount, as used in condition 1, gel electrophoresis was performed using MES SDS over a distance of ~ 2.5 cm. The in-gel digestion was performed as mentioned above.

3) Short SDS-PAGE Method with 16% Constant Gel

A 2.5 µL aliquot of A431 cell lysate (0.4 µg/µL) was mixed with 5 µL of Novex Tricine

SDS sample buffer, 1 µL of NuPAGE reducing agent and 4 µL of water. The mixture was incubated at 85°C for 2 minutes, allowed to cool and then loaded onto a 16 % 129

Tricine gel. Gel electrophoresis was performed using tricine running buffer over a distance of ~ 2.5 cm (Figure S1B). The in-gel digestion was performed as above for the long SDS-PAGE method.

Figure S1. Selection of gel type and SDS-PAGE separation distance for proteomic analysis of small sample amounts. A) NuPAGE Bis-Tris gel (4-12%T) was used to separate 1µg and 5 µg of A431 cell lysate over a length of 10 cm. Lane 1 in Coomassie blue stained gel image corresponds to 1µg of protein. For improved visualization, the second lane corresponding to 5 µg of protein was presented for demonstration. The third lane corresponds to the molecular weight markers. B) NuPAGE Tricine gel (16%T) was used to separate 1µg (lane 1) and 5 µg (lane 2) of A431 cell lysate over a gel length of 2.5 cm. Note that molecular weight markers (third lane) are separated with 2.5 cm separation distance. 130

2.1.3 LC-MS/MS Analysis

The in-gel digest was loaded on a 75 µm i.d. x 10 cm capillary column packed with Magic C18 material (3 µm, 200 Å pore size) (Michrom Bioresources, Auburn, CA).

The sample loading was followed by sample desalting for 15 mins using 2% of mobile phase B (0.1% formic acid in acetonitrile) with a flow rate of 200 nL/min. A 50 min linear gradient from 2% mobile phase B to 30% mobile phase B with a nanoLC instrument (Ultimate 3000, Dionex) was used at a flow rate of 200 nL/min. This step was followed by a steeper gradient to 80% B at 60 min and further maintained at 80% B for

20 mins. The peptides were identified by tandem mass spectrometry using a Thermo

Fisher (San Jose, CA) LTQ ion trap mass spectrometer equipped with a PicoView ESI source (New Objective, Woburn, MA). MS/MS spectra of the peptides were acquired in the data-dependent mode, in which one full MS scan was followed by 7 MS2 acquisitions, with normalized collision energy of 35%. The dynamic exclusion was used with 2 repeat counts, repeat duration 30 s and exclusion duration of 30 s.

2.1.4 Protein Identification

See experimental section in Chapter 2.

131

2.2A Results

In order to select appropriate gel compositions, experiments were conducted using both a 4-12 % Bis-Tris gradient gel and a 16% Tricine gel. In the case of the 4-12% Bis-

Tris gradient gel, the SDS-PAGE separation was performed for distances of 2.5 and 10.0 cm, while only a 2.5 cm separation distance was used with the 16% Tricine gel. The number of peptides and proteins identified with the two gel types and separation distances are shown in Table S1.

Table S1. Peptides and proteins identified using three SDS-PAGE separation conditions.

Gradient gel Gradient gel Constant gel Gel type 4-12% Bis-Tris 4-12% Bis-Tris 16% Tricine

Separation length 10 cm 2.5 cm 2.5 cm

Unique peptides 146 109 789

Total peptides 219 172 1315

Total proteins 48 41 181

The identification of proteins was based on one or more unique peptides. It can be seen that the number of proteins/peptides identified using the short run on the 16%

Tricine gel was almost four times higher than that obtained on either the short or long run

4-12% gradient gel. It can be concluded from these results, in agreement with others [6], that concentrating proteins in a small gel volume provides improved proteomic coverage when sample amount is limited.

132

2.3 Reference

[1] R.F. Bonner, M. EmmertBuck, K. Cole, T. Pohida, R. Chuaqui, S. Goldstein, L.A. Liotta, Science 278 (1997) 1481-&.

[2] M.R. EmmertBuck, R.F. Bonner, P.D. Smith, R.F. Chuaqui, Z.P. Zhuang, S.R. Goldstein, R.A. Weiss, L.A. Liotta, Science 274 (1996) 998-1001.

[3] F. Fend, M. Raffeld, J Clin Pathol 53 (2000) 666-672.

[4] D.J. Johann, J. Rodriguez-Canales, S. Mukherjee, D.A. Prieto, J.C. Hanson, M. Emmert-Buck, J. Blonder, J. Proteome Res. 8 (2009) 2310-2318.

[5] N. Nagaraj, A.P. Lu, M. Mann, J.R. Wisniewski, J. Proteome Res. 7 (2008) 5028- 5032. [6] D.C. Liebler, A.J.L. Ham, Nature Methods 6 (2009) 785-785.

[7] A.R. Ivanov, L. Zang, B.L. Karger, Anal. Chem. 75 (2003) 5306-5316.

[8] Q.Z. Luo, K.Q. Tang, F. Yang, A. Elias, Y.F. Shen, R.J. Moore, R. Zhao, K.K. Hixson, S.S. Rossie, R.D. Smith, J. Proteome Res. 5 (2006) 1091-1097.

[9] Y.F. Shen, R.J. Moore, R. Zhao, J. Blonder, D.L. Auberry, C. Masselon, L. Pasa- Tolic, K.K. Hixson, K.J. Auberry, R.D. Smith, Anal. Chem. 75 (2003) 3596-3605.

[10] R.D. Smith, Y.F. Shen, K.Q. Tang, Acc. Chem. Res. 37 (2004) 269-278.

[11] L. Tang, P. Kebarle, Anal. Chem. 65 (1993) 3654-3668.

[12] G.H. Yue, Q.Z. Luo, J. Zhang, S.L. Wu, B.L. Karger, Analytical Chemistry 79 (2007) 938-946.

[13] K.D. Patel, A.D. Jerkovich, J.C. Link, J.W. Jorgenson, Anal. Chem. 76 (2004) 5777- 5786.

[14] D.C. Sgroi, S. Teng, G. Robinson, R. LeVangie, J.R. Hudson, A.G. Elkahloun, Cancer Res. 59 (1999) 5656-5661.

[15] X.J. Ma, R. Salunga, J.T. Tuggle, J. Gaudet, E. Enright, P. McQuary, T. Payette, M. Pistone, K. Stecker, B.M. Zhang, Y.X. Zhou, H. Varnholt, B. Smith, M. Gadd, E. Chatfield, J. Kessler, T.M. Baer, M.G. Erlander, D.C. Sgroi, Proc Natl Acad Sci U S A 100 (2003) 5974- 5979.

133

[16] Y. Gu, S.L. Wu, J.L. Meyer, W.S. Hancock, L.J. Burg, J. Linder, D.W. Hanlon, B.L. Karger, J. Proteome Res. 6 (2007) 4256-4268.

[17] Q. Luo, G. Yue, G.A. Valaskovic, Y. Gu, S.L. Wu, B.L. Karger, Anal. Chem. 79 (2007) 6174-6181.

[18] A. Rauch, M. Bellew, J. Eng, M. Fitzgibbon, T. Holzman, P. Hussey, M. Igra, B. Maclean, C.W. Lin, A. Detter, R.H. Fang, V. Faca, P. Gafken, H.D. Zhang, J. Whitaker, D. States, S. Hanash, A. Paulovich, M.W. McIntosh, J. Proteome Res. 5 (2006) 112-121.

[19] J.E. Elias, W. Haas, B.K. Faherty, S.P. Gygi, Nature Methods 2 (2005) 667-675.

[20] J.E. Elias, S.P. Gygi, Nature Methods 4 (2007) 207-214.

[21] R. Higdon, J.M. Hogan, G. Van Belle, E. Kolker, Omics-a Journal of Integrative Biology 9 (2005) 364-379.

[22] P.C. Carvalho, J.S. Fischer, E.I. Chen, J.R. Yates, V.C. Barbosa, BMC Bioinformatics 9 (2008).

[23] G. Dennis, B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki, Genome Biology 4 (2003).

[24] D.W. Huang, B.T. Sherman, R.A. Lempicki, Nature Protocols 4 (2009) 44-57.

[25] S.E. Martin, J. Shabanowitz, D.F. Hunt, J.A. Marto, Anal. Chem. 72 (2000) 4266- 4274.

[26] M. Rogeberg, S.R. Wilson, T. Greibrokk, E. Lundanes, J Chromatogr A 1217 (2010) 2782-2786.

[27] A. Hendrix, W. Westbroek, M. Bracke, O. De Wever, Cancer Res. 70 (2010) 9533- 9537.

[28] L.A. Emery, A. Tripathi, C. King, M. Kavanah, J. Mendez, M.D. Stone, A. de las Morenas, P. Sebastiani, C.L. Rosenberg, Am J Pathol 175 (2009) 1292-1302.

[29] E.I. Chen, J. Hewel, J.S. Krueger, C. Tiraby, M.R. Weber, A. Kralli, K. Becker, J.R. Yates, B. Felding-Habermann, Cancer Res. 67 (2007) 1472-1486.

134

Chapter 3: Comparative Proteomic Analysis of 10,000 Triple Negative Breast

Cancer and Normal Mammary Epithelial Laser Microdissected Cells Using On-line

2D RP-SCX/Porous Layer Open Tubular Column (PLOT) LC-MS

135

Abstract

Knowledge regarding the pathophysiology of triple negative breast cancer

(TNBC) can be obtained by identifying protein expression patterns specific to malignant epithelial cells. Laser capture microdissection can be used to generate a homogeneous population of cells, facilitating subsequent proteomic studies, that allows a clearer understanding of molecular signatures associated with these epithelial cells. Here, protein expression profiles of 3 non-cancerous breast epithelial (NBE) and 3 triple negative malignant breast epithelial (TNBE) samples were investigated following LCM collection of 10,000 epithelial cells and ultra-sensitive shotgun proteomics. We employed 2- dimensional peptide fractionation using an online triphasic (RP-SCX- SPE (PS-DVB)-

PLOT LC platform coupled to high mass resolution LTQ FTMS. A total of 15,406 unique peptides and 4,259 proteins were identified from 6 samples using a volume of protein digest equivalent to only 4000 cells per sample. Comparative proteomic analysis between TNBE and NBE, using spectral index (SpI) as a measure of protein abundance, found 114 differentially abundant proteins at the 95% confidence level. Gene ontology analysis using the DAVID web resource identified enrichment of key functional categories such as DNA replication initiation and extracellular matrix. Complementary to these finding, the GSEA analysis revealed processes such as cell cycle, M-G1 transition,

DNA replication pre-initiation, focal adhesion, ECM receptor interaction and regulation of the actin cytoskeleton. This proteomic platform has the potential to perform similar extensive proteomic profiling of other cancer types from very small samples amounts. 136

Introduction

Breast cancer is the most common type of cancer in women. Over 1 million cases are diagnosed each year making breast cancer the leading cause of death among women worldwide [1]. There are 17 breast cancer subtypes, classified based on tumor size, histo- morphological information (histological subtypes and grading), lymph node and distant metastasis information, and transcriptomic similarity. Due to the histological, molecular and clinically heterogeneous nature of breast cancer, different treatment approaches are necessary for each disease subtype [2]. Three different breast tissue cancer receptors: estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (Her2/neu) are targeted for therapeutic treatment of breast cancer. Breast cancer tissues which do not express these three receptors (negative expression) are known as triple-negative breast cancer (TNBC). TNBC constitutes approximately 15% of all the different breast cancer subtypes [3,4]. TNBC is an aggressive subtype, associated with early relapse and a poor survival rate [5]. At present, chemotherapy is the only systemic treatment for this disease subtype, other effective target therapies have yet to be developed [2]. Understanding the biology of these specific tumor cells could have a significant impact in the identification of new and more efficacious therapeutic targets.

The application of comparative proteomics to study the breast cancer proteome is one approach that can potentially reveal differential protein signatures that can provide an insight into the cellular and molecular changes of the disease. However, to generate biologically relevant proteomic data, the tissue samples under investigation should be homogeneous and devoid of unwanted cells of other types. 137

As discussed in previous chapters, one approach for the selective enrichment of cellular features of interest from tissue sections is laser capture microdissection (LCM)

[6]. Using LCM, it is possible to obtain homogenous populations of cells, which allows for more relevant proteomic comparison of diseased and normal tissue samples.

However, due to lack of a protein amplification technique, comprehensive proteomic analysis of LCM collected specimens is hampered due to sample amounts ranging from

103 to 104 cells, corresponding to 0.1-1 µg of total protein [7]. A typical tissue proteomic workflow is broadly divided into two steps: sample preparation, i.e. protein extraction, followed by enzymatic digestion and downstream analysis of the digest by liquid chromatography-mass spectrometry (LC-MS). To extensively analyze a small number of

LCM collected cells (103 to 104 cells or 0.1-1 µg protein), effective sample preparation methods with minimum sample handling steps i.e. single pot cell lysis and digestion

[8,9], use of low protein binding tubes, MS friendly detergents [10] and use of strong detergent with short SDS-PAGE run [11] are suggested. For LC-MS analysis of small sample amounts, current technology utilizes 50-75 µm i.d. packed columns for nano LC

ESI-MS operating at 100-200 nL/min. Such column dimensions require complete injection of these small samples in a single run, resulting in low information content, i.e. the identification of only a few hundred proteins. Therefore, an analytical platform capable of high resolution and high ESI-MS sensitivity that can perform extensive peptide fractionation with minimum sample loss is a necessity for the comparative global proteomic analysis of such limited and precious clinical specimens.

As described in Chapters 1 and 2, the utility of a 3-4 meter long, 10 µm i.d. porous layer open tubular column (PLOT) chromatography column operating at 20 138 nl/min flow rate and capable of detecting attomole amounts of analyte when coupled with

LTQ-MS has been demonstrated [12]. The sensitive performance of the PLOT based separation platform (PLOT-SPE LC-MS ) is attributed to the high ESI-MS sensitivity at the associated low nanoliter operational flow rate, as at such a low flow rate ( 20 nL/min) the response of ESI-MS changes from concentration sensitive to mass sensitive. The additional feature of SPE-PLOT LC-MS is the preconcentration and clean up of the sample via online desalting using a PS-DVB bulk monolith SPE and a loading flow rate of 1 µL/min. The next significant development was introduction of an online multidimensional SCX-RP-PLOT integrated platform for extensive peptide fractionation and analysis [13,14]. The improvement in peak capacity attributed to the multidimensional liquid chromatographic separation yielded enhanced peptide and protein identifications due to the increased dynamic range of the mass spectrometer and decreased ion suppression. Therefore, this multidimensional platform can be a valuable tool for sensitive and comprehensive analysis of limited sample amounts.

In this study, we present a proteomic workflow, comprised of sample preparation using a short SDS-PAGE run and an online multidimensional SCX-RP-PLOT integrated platform coupled to an LTQ-FTMS, for analysis of 10,000 LCM collected breast tumor or normal cells. The workflow enabled the proteomic analysis of 10,000 LCM collected normal breast epithelium cells (NBE) from different non-cancerous human mammoplasty reduction specimens from three individuals against non-patient matched 10,000 LCM collected triple negative malignant breast epithelium cells (TNBC) from 3 human breast cancer patient specimens. Although 10,000 cells were dissected from each tissue specimen and processed, the in-gel digest equivalent to only 4,000 cells was sufficient to 139 identify 5,000 unique peptides and close to 2,000 proteins using six multidimensional

LC-MS analyses from a single sample. These results facilitated a second LC/MS injection of the same sample to further increase comprehensiveness and confidence of identified proteome. When taken together, more than 4000 proteins were identified from the six specimens. In addition to thousands of high confidence protein identifications, we demonstrate the utility of this workflow for quantitative proteomic comparison of TNBC versus normal breast tissue samples.

Quantitation of protein expression changes on small sample amounts is a challenging task due to the multiple steps involved in the labeling procedure when isobaric reporter reagents such as iCAT or iTRAQ are employed, leading to potential sample losses and variability in quantitation. The label free method based on spectral counts, representing the abundance of the proteins, is a proven method for relative quantitative comparative studies[15]. The 114 differentially abundant proteins were determined using calculation of SpI and statistical analysis [16], and annotated using

DAVID [17]. Proteins involved in DNA replication initiation were found to be up regulated in TNBC and extracellular matrix related proteins were down regulated in

TNBC.

140

2. Materials and Methods

2.1. Chemicals and Materials

Ammonium acetate, acetonitrile (HPLC grade), and water (HPLC grade) were obtained from Thermo Fisher Scientific (Pittsburg, PA). Trypsin (sequencing grade) was purchased from Promega (Madison, WI). Styrene, divinylbenzene, ethanol, formic acid

(HPLC grade), 3-(trimethoxysilyl)propyl methacrylate, 2,2-diphenyl-1-picrylhydrazyl

(DPPH), DMF anhydrous, tetrahydrofuran (THF), azobisoisobutyronitrile (AIBN), ammonium bicarbonate, dithiothreitol (DTT), and iodoacetamide (IAA) were obtained from Sigma–Aldrich (St. Louis, MO). Fused-silica capillary tubing was purchased from

Polymicro Technologies (Phoenix, AZ). The PicoClear tee and union were from New

Objective (Woburn, MA). Reducing agent (10X), Tricine gel (16%T), Tricine SDS sample buffer (2X) and Tricine SDS running buffer (10X) were purchased from

Invitrogen (Carlsbad, CA).

2.2. Laser Capture Microdissection

The breast cancer specimens used in this study were fresh-frozen biopsies obtained from the Massachusetts General Hospital. The diagnostic criteria and grading of the tumor samples were as described previously [18]. Highly enriched populations of non patient-matched normal epithelial cells and malignant epithelial breast cancer cells were procured by LCM using a PixCell IIe system (Molecular Devices, Mountain View, CA,

USA) as previously described [19]. The enriched cells of interest were verified by microscopic examination of the LCM cap after microdissection. Information regarding patient and tumor characteristics is presented in Table 1. This study was approved by the 141

Massachusetts General Hospital human research committee in accordance with National

Institutes of Health human research study guidelines.

Table 1. Details about normal breast specimens and triple negative breast cancer specimens.

Normal Breast Specimen Triple Negative Breast Cancer Specimen

Sample Case ID Number of Number of Sample ID Case ID Number of Number of ID LCM Caps Laser LCM Laser Pulses Used Pulses Caps Used N1 29-2T 1 3585 TNBC1 297-1-M1/2 2 6107

N2 6-1T 1 3500 TNBC2 1195-2M 1 6108

N3 55-2T 1 3530 TNBC3 1265-1-M1/2 2 6000

Number of Laser Pulses: The number of laser pulses used to collect approximately 10,000 cells, Number of LCM

Caps Used: The number of LCM caps used to collect 10,000 cells, this is important as cell lysis was directly

performed on the LCM cap.

2.3. Protein Extraction and Digestion

Protein extraction from normal epithelial cells (number of individual samples = 3) and breast carcinoma epithelial cells (number of individual samples= 3) was performed directly on the LCM cap without peeling off the LCM cap membrane. Cells were lysed on each cap using 10 µL of lysis buffer (Novex Tricine SDS sample buffer (2X):

NuPAGE reducing agent: water, 5:1:4). After cell lysis, the lysate was transferred to an

Eppendorf tube. The LCM cap was rinsed with a fresh aliquot of 10 µL of lysis buffer and combined with previous cell lysate in the Eppendorf tube. In the case of tissue samples collected on two LCM caps, cells on each cap were lysed using 10 µL of lysis buffer, and both cell lysates were combined. 142

The cell lysate from each sample was separated on a 16% Tricine gel for a distance of 2.5 cm with application of 125V. Gel staining, destaining and further processing of the gel cubes were performed as described in Chapter 2. Samples were reduced with 10 mM DTT for 30 min at 56 °C and then alkylated for 60 min at room temperature in the dark with 55 mM iodoacetamide. Proteins were digested at 37 °C with sequencing grade trypsin for 12 hours. Peptides were extracted from gel pieces in two stages; first, the gel pieces were dehydrated with acetonitrile (acetonitrile: digestion buffer 2:1), and the solution was collected in another Eppendorf tube. Next the dried gel pieces were hydrated with 5% v/v formic acid for 5 minutes and then dehydrated with acetonitrile. Both extracts were combined, dried using vacuum centrifugation, and stored at -80 °C until further analysis.

2.4. Column Preparation and Two-Dimensional Separation

A PS-DVB PLOT column (10 μm i.d. /360 μm o.d., 4.0 m long) was prepared following the previously described protocol [12]. Briefly, 10 μm i.d. fused silica capillary

(5 meter long) was treated overnight with 3-(trimethoxysilyl) propyl methacrylate. The treated capillary was filled with a degassed solution containing 5 mg of AIBN, 200 µL styrene, 200 µL divinylbenzene, and 600 µL ethanol yielding a PS-DVB porous layer on inner surface of the capillary. Polymerization was performed by sealing the capillary with septa at both ends and heating the capillary at 74˚C for 16 h in a water bath. The capillary was washed with acetonitrile for 2 days before use. The preparation of PS-DVB monolithic micro-SPE column (50 µm id) was carried out with a published procedure

[14]. The PS-DVB monolithic column was prepared using a polymerization solution containing 5 mg of AIBN, 200 µL styrene, 200 µL divinylbenzene, 40 µL THF, and 550 143

µL of decanol. The use of a flow rate of 1 µL/min on the 4 cm monolith column resulted in a backpressure of approximately 2900 psi, thus facilitating rapid sample loading. The diagram and operation of the on-line 2-D RP (reverse phase-C18 /SCX (Strong Cation

Exchange) /PLOT/MS system is described in Chapter 1. The triphasic trapping column

(RP/SCX/micro-SPE) allowed fast sample loading as well as on-line desalting and washing prior to the 2-D separation of a complex sample. To prepare the biphasic 2D

RP/SCX column, a frit was first constructed within the silica capillary (75 µm i.d./360

µm o.d., 15 cm long) by dipping the capillary into a 15 % v/v solution of formamide in potassium silicate (Kasil#1, PQ, Valley Forge, PA) and placing the capillary into an oven at 100˚C for 3 min [14]. The end of the capillary was cut to reduce the length of frit to 0.5 mm, and then the frit was rinsed with acetonitrile. First, 2 cm of 5 µm, 300 Å

Polysulfoethyl A SCX resin (Nest Group, Southboro, MA) was slurry-packed, followed by 2 cm of 5 µm Magic C18 packing material (200 Å pore size, Michrom BioResources,

Auburn, CA). A triphasic RP/SCX/micro-SPE trapping column was assembled by butt- to-butt connecting the 4 cm x 50 µm i.d. PS-DVB monolithic pre-column to the biphasic

RP/SCX column with a PicoClear union (New Objective). A PicoClear tee (New

Objective) was utilized to connect the triphasic trapping column and the PLOT column.

PEEK microtees were employed to split the mobile phase to obtain a flow rate of 1

µL/min during sample loading and 20 nL/min during the separation mode. The downstream end of the PLOT column was butt-to-butt connected to a metalized ESI spray tip (360 µm o.d., 20 µm i.d. fused silica with a 5 µm i.d. tip, 2–3 cm length, New

Objective) to which the electrospray voltage (~1.0 kV) was applied. RP gradient elution was performed using an Ultimate 3000 LC System (Dionex, Sunnyvale, CA) with mobile 144 phase A as 0.1% v/v formic acid in water and mobile phase B as 0.1% v/v formic acid in acetonitrile. Buffers for the salt gradient steps (2.5–50 mM) were prepared from a 1 M ammonium acetate stock solution and diluted in mobile phase A containing 2% ACN.

The following are the three steps involved in the two-dimensional separation and MS- analysis of peptides.

Step 1 involved loading the peptide mixture on the C18 column, desalting and transfer of peptides from C18 stationary phase to SCX stationary phase. A sharp reverse phase (RP) gradient, increase in mobile phase B from 0 to 45% in 20 min at a flow rate of

1 µL/min, was used to elute peptides from C18 column.

Step 2 involved sequential elution of the peptides from the SCX column to the

PS-DVB micro-SPE column. 2 µL of salt solution in six salt steps, i.e. 2.5, 5.0, 7.5, 10.0,

15.0, and 50.0 mM ammonium acetate was injected to elute peptides onto the PS-DVB monolithic micro-SPE. Subsequently, the SPE column was desalted with mobile phase A for 15 minutes. During steps 1 and 2, the 1 µL/min flow of the mobile phase was directed to waste (sample loading mode).

Step 3 involved elution and analysis of a subset of peptides from PS-DVB monolithic micro-SPE. The mobile phase was directed through the PLOT column

(analysis mode) for the analysis of eluted peptides after each salt plug at a flow rate of 20 nL/min. A longer RP gradient where mobile phase B increased from 0 to 30% in 180 min was used for the PLOT LC/ LTQ-FTMS analysis.

145

2.5. MS Analysis and Data Analysis

Nano-ESI-MS was performed on a linear ion trap-FT mass spectrometer (LTQ-FT

(7 Tesla), Thermo Fisher Scientific, San Jose, CA). Each MS spectral scan was collected in the data dependent mode over an m/z range of 400–2000 with a mass resolution of

100,000 at m/z 400. The number of ions transferred to the FT-MS was set to 1x106.

Following each high mass accuracy FT-MS scan, ten MS/MS scans were performed in the linear ion trap. Dynamic exclusion was set to 60 s, and the normalized collision energy for collision induced dissociation was selected as 35%. The acquired MS/MS scans were converted to DTA files by Extract-MSn (Version 4.0, Thermo Fisher

Scientific), and these DTA files were searched against the human SwissProt annotated database (release 2010_06 downloaded in July 2010 entries including common contaminants) with combined normal and reversed sequences to facilitate the estimation of the peptide false-positive rate using the Sequest algorithm (Ver. 27, rev.12, Thermo

Fisher Scientific) within the CPAS system (ver. 9.10 LabKey, Seattle, WA) [20].

Carbamidomethylation of cysteines was selected as a fixed modification. Mass tolerances were set at 2.5 Da (precursor ions) and 1.0 Da (fragment ions). Full tryptic enzyme specificity was specified with maximum 2 missed cleavage sites. To filter identified peptides, Xcorr values, (a measure of closeness between experimental and theoretical

MS/MS) were ≥ 1.9 for 1+ ions, ≥ 2.2 for 2+ ions, and ≥ 3.8 for 3+ ions and a

PeptideProphet probability value ≥ 0.9 were used (Institute for Systems Biology, Seattle,

WA). 146

2.6. Spectral Index (SpI) for Identification of Differentially Abundant Proteins

Spectral counts i.e. the total count of MS/MS spectra responsible for identification of a specific protein or a group of proteins in an individual sample, were first normalized by the total identified spectral counts in that sample. To determine the differentially abundant proteins between TNBC and normal samples, the spectral index (SpI) was calculated using the following formula:

D T D T SpI= ((spcT/( spcT + spcN)*NT / NT ) - ( spcN /( spcT + spcN)* NN / NN ))

spcT= Mean spectral count for a particular protein in TNBC patients. spcN= Mean spectral count for a particular protein in normal individuals. D D NT and NN = Numbers of TNBC and normal individuals in which the protein of interest was observed. T T NT and NN = Total number of TNBC and normal individuals.

To determine SpI cutoff value, which was indicative of statistically significant differences in abundance, a random permutation analysis was performed. The proteomic profiles of TNBC and normal samples were randomized 2000 times, and for each permutation of a protein, a SpI value was calculated. The frequency distribution of these

SpI values obtained by randomization was used to determine null distribution of SpI values and to calculate SpI cutoff values corresponding to a statistical significance of

95%.

147

2.7. Gene Ontology by DAVID (Database for Annotation, Visualization and

Integrated Discovery) a Functional Annotation Clustering Tool

Proteins with statistically significant abundance changes, as determined by the

SpI, were compared against the complete list of identified proteins using DAVID to determine significantly overrepresented Gene Ontology annotation terms. The UniProt protein accession numbers were used for DAVID analysis. To explore biological differences between two sample types, the annotation clusters having an enrichment score greater than 1.0 [21] and FDR less than 25% were considered.

2.8 Gene Set Enrichment Analyses (GSEA) for Functional Significance of

Differentially Abundant Proteins

Enrichment of gene sets contained in molecular signatures database (MSigDB v

3.0, http://www.broadinstitute.org/gsea/msigdb/), were computed in our identified (1398 proteins) proteomic dataset. Proteins in our dataset were placed in order, based on their

SpI values, and submitted to GSEA ranked with their corresponding gene names. To estimate statistical significance of enrichment score of Geneset, 1000 Geneset permutations were performed. The Genesets with q-value <0.25 (FDR<25%) were considered as significant. Each of the 880 annotated proteins sets in MSigDb (C2, canonical pathways) was matched against the 1398 proteins identified in our studied, yielding 83 genes sets with at least 20 matches in our dataset.

148

3. Results and Discussion

3.1 Experimental and Bioinformatics Workflow for Proteomic Analysis of 10,000

LCM Collected Normal and Cancer Breast Epithelial Cells.

Figure 1. Shotgun proteomics workflow to analyze breast epithelial cells collected from normal and triple negative breast tumor epithelium. A) Sample preparation step includes laser capture microdissection of 10,000 normal and malignant cells, on-the-cap cell lysis using 4% SDS buffer, short (2.5 cm) SDS-PAGE (16% Tricine gel) separation run and in-gel digestion. B) 2-dimensional separation of peptides on online triphasic (RP/strong cation exchange (SCX)/micro-SPE)/PLOT set-up using a step gradient of six increasing concentration of ammonium acetate salt solutions and identified using LTQ FTMS.

The proteomic workflow used for the identification and comparison of protein signatures from normal and breast cancer epithelial cells is summarized in Figure 1. To obtain a highly pure population of epithelial cells form tissue specimens, infra red LCM was employed. Approximately 10,000 cells per biological replicate (3 patients/per tissue 149 type) were collected using 1-2 LCM caps. The cells were lysed using SDS containing buffer, and short SDS-PAGE run (2.5 cm) on 16% Tricine gel was used to remove SDS and to perform in-gel digestion. The online 2D SCX-RP-SPE-PLOT LC-MS was used to fractionate and analyze the peptide mixture by employing 6 salt steps elution. It is important to note that more than 5,000 unique peptides (close to 2,000 proteins) were identified from single two-dimensional analysis.

As only 10,000 cells were processed, label-free quantitation based on spectral counting was performed to maintain a minimum number of sample handling steps.

Differentially abundant proteins were determined based on SpI, which was calculated from spectral counts and statistical analysis. The gene ontology term allocation and enrichment analysis on the differentially abundant proteins was performed using DAVID.

The GSEA analysis on the subset of proteins, identified with a minimum of two spectral counts, was performed to elucidate the important canonical pathways represented by these proteins.

150

3.2. Peptide and Proteins Identification.

From 3 pairs of samples (normal and TNBC), a total of 15,406 unique peptides and 4,259 proteins were identified (2902 protein were found in either two or more samples). The proteins and peptides identified by each salt step in the 2-dimentional analysis of 6 samples are presented in Figure 2.

646 673 50 PROTEINS 50 1038 1073 PROTEINS 346 PEPTIDES 509 15 424 15 657 PEPTIDES 29T 290 370 1195M 10 392 10 524 827 771 7.5 1369 7.5 1236

5 815 1038 1599 5 2190

2.5 936 832 1569 2.5 1455

0 500 1000 1500 2000 0 500 1000 1500 2000 2500 648 PROTEINS 814 50 988 50 1243 PROTEINS PEPTIDES 446 560 PEPTIDES 15 539 15 673 55T 10 454 350 1265M 633 10 451 7.5 660 722 1104 7.5 1140 5 1141 1233 2187 5 2327

Salt Steps Salt(mM) 2.5 1017 919 1919 2.5 1575

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 626 PROTEINS 50 960 672 50 1008 PROTEINS PEPTIDES 15 865 547 PEPTIDES 1389 15 658 10 358 553 530 6T 10 796 279M

607 7.5 666 7.5 1009 1097 850 5 1070 5 1550 2080 1022 923 2.5 1889 2.5 1674 0 500 1000 1500 2000 2500 0 500 1000 1500 2000

Figure 2. Peptide and protein identifications from 6 salt steps. More than 1000 peptides were consistently identified from the first three salt steps for all 6 samples (29T, 55T, 6T (TNBC epithelium samples), 1195M, 1265M and 279M (Normal breast epithelium samples).

151

Extensive peptide fractionation yielded identification of more than 5,000 unique peptides and more than 2,000 proteins (>1 peptide hit) from the protein digest, equivalent to 4000 cells of each sample. Of the total 4,259 identified proteins, 2397 proteins were found in both normal and TNBC samples, while 1015 and 846 proteins were detected exclusively in normal and TNBC samples respectively. The detailed information about the number of unique and total peptides, and proteins identified from each sample is presented in Figure 3.

1908 6T 4989 8872 2219 55T 5904 9405 2009 29T 5518 7991 Proteins 2267 Unique Peptides 279M 5816 9937 Total peptides 2514 6031 1265M 9152 2153 6045 1195M 9320

0 2000 4000 6000 8000 10000

Figure 3. Peptide and protein identifications in the six samples. The number of total peptide and unique peptide identifications per sample represent the complexity of proteome and sensitive proteomic analysis.

152

3.3. Spectral Index Analysis for Determination of Differentially Abundant Proteins.

To calculate SpI values, we selected 1398 proteins identified with two or more unique peptides. To determine statistically significant SpI values, permutation analysis was performed [22]. Based on the SpI values of 2,000 random permuted sample phenotype classes, the range covering 95% of calculated SpI values for 1398 proteins was computed to be |SpI| < 0.667. Based on this SpI cutoff value, 114 proteins were found to be differentially abundant at the 95% confidence level (Table 2). From this list of 114 proteins, 40 were found to be enriched in TNBE samples and 74 enriched in NBE samples.

153

Table 2. List of differentially abundant proteins between TNBE and BNE. Positive SpI values indicate the over expression of protein in TNBE samples, whereas negative values indicate over expression of protein in NBE.

SpI Protein List SpI values Protein List SpI values Protein List values COPE_HUMAN 1.000 PABP1_HUMAN 0.683 TAGL_HUMAN -0.939 4F2_HUMAN 1.000 IPO5_HUMAN 0.682 ACTH_HUMAN -0.948 MCM3_HUMAN 1.000 RL4_HUMAN 0.680 PEPL_HUMAN -1.000 DNJA1_HUMAN 1.000 MK01_HUMAN 0.669 UFL1_HUMAN -1.000 MCM4_HUMAN 1.000 PYGB_HUMAN -0.674 PIGR_HUMAN -1.000 MCM7_HUMAN 1.000 K1C18_HUMAN -0.680 HEMO_HUMAN -1.000 MCM2_HUMAN 1.000 RAB1A_HUMAN -0.694 FRIH_HUMAN -1.000 TPIS_HUMAN 1.000 RO60_HUMAN -0.695 PGS2_HUMAN -1.000 GFPT1_HUMAN 1.000 K2C7_HUMAN -0.706 ANXA3_HUMAN -1.000 MFGM_HUMAN 1.000 QCR1_HUMAN -0.720 KCRU_HUMAN -1.000 BZW1_HUMAN 1.000 CADH1_HUMAN -0.726 CD59_HUMAN -1.000 PUF60_HUMAN 1.000 TPP1_HUMAN -0.743 MIME_HUMAN -1.000 NASP_HUMAN 0.880 EPIPL_HUMAN -0.743 RFA1_HUMAN -1.000 STAT1_HUMAN 0.876 CD166_HUMAN -0.750 APEX1_HUMAN -1.000 FILA2_HUMAN 0.867 PURA_HUMAN -0.763 NDUS1_HUMAN -1.000 FIBG_HUMAN 0.865 PRP8_HUMAN -0.764 GSTT1_HUMAN -1.000 PDIA6_HUMAN 0.850 RS8_HUMAN -0.771 MYH11_HUMAN -1.000 PGK1_HUMAN 0.848 SYMC_HUMAN -0.777 ACTY_HUMAN -1.000 TBB2A_HUMAN 0.832 MATR3_HUMAN -0.783 RAB5C_HUMAN -1.000 PSMD6_HUMAN 0.825 IGKC_HUMAN -0.783 SYRC_HUMAN -1.000 RS2_HUMAN 0.823 C1TC_HUMAN -0.784 COEA1_HUMAN -1.000 RBBP7_HUMAN 0.819 ECHB_HUMAN -0.784 IQGA2_HUMAN -1.000 IF2A_HUMAN 0.808 ITB4_HUMAN -0.789 CUL1_HUMAN -1.000 ADT3_HUMAN 0.800 CO4B_HUMAN -0.796 LAMB3_HUMAN -1.000 GTF2I_HUMAN 0.798 ARC1A_HUMAN -0.797 LAMC2_HUMAN -1.000 FINC_HUMAN 0.792 GSTM3_HUMAN -0.802 LAMA3_HUMAN -1.000 SYAC_HUMAN 0.778 CO6A2_HUMAN -0.808 NDUA9_HUMAN -1.000 P5CS_HUMAN 0.772 KTN1_HUMAN -0.822 PDCD4_HUMAN -1.000 DNM1L_HUMAN 0.763 QOR_HUMAN -0.824 SKT_HUMAN -1.000 TBA1C_HUMAN 0.760 NDRG2_HUMAN -0.829 SCMC1_HUMAN -1.000 HCFC1_HUMAN 0.728 THIM_HUMAN -0.830 ACOT1_HUMAN -1.000 CSN2_HUMAN 0.727 POSTN_HUMAN -0.833 ESYT1_HUMAN -1.000 SERPH_HUMAN 0.719 RS5_HUMAN -0.839 TINAL_HUMAN -1.000 RS7_HUMAN 0.703 SYNE2_HUMAN -0.847 MYOF_HUMAN -1.000 TOIP1_HUMAN 0.703 ISOC2_HUMAN -0.863 EHD2_HUMAN -1.000 GLYM_HUMAN 0.688 STA5B_HUMAN -0.871 HACD3_HUMAN -1.000 PABP1_HUMAN 0.683 ODO1_HUMAN -0.875 SYLC_HUMAN -1.000 IPO5_HUMAN 0.682 CO6A3_HUMAN -0.923 PADI2_HUMAN -1.000 RL4_HUMAN 0.680 IGHA2_HUMAN -0.926 PKP3_HUMAN -1.000 154

3.4 DAVID Functional Annotation Analysis of Differentially Abundant Proteins.

DAVID analysis, one of the most widely used functional Gene Ontology (GO) term annotation analysis tools, was employed on the differentially abundant proteins. DAVID requires a list of the proteins or genes which are up- or down regulated between two sample types i.e. normal and diseased. Using DAVID, the differentially abundant proteins at the 95% confidence level were assigned annotation terms, and these GO terms were compared against terms for the 1398 proteins. A total of 19 overrepresented GO terms (Table 3) associated with 114 differentially expressed proteins were clustered under 5 annotation categories (Table 4).

The results of DAVID annotation analysis indicate that the majority of the proteins were related to the extracellular region. GO terms such as cell adhesion and extracellular matrix were down regulated in the TNBE specimens. The several other terms related to the extracellular region e.g. extracellular matrix organization, extracellular structure organization, glycoprotein, ECM-receptor interaction and focal adhesion, were found to be down regulated in the TNBE samples.

Proteins related with the DNA replication category were observed to be up regulated in

TNBE. The important biological features/GO terms related to this category were DNA replication initiation, nucleic acid-binding, DNA-dependent ATPase MCM, MCM

(minichromosomal maintenance) and RNA biosynthetic process. The other significant

GO terms are listed in Table 3.

155

Table 3. Representative enriched, functional clusters with corresponding GO terms for differentially expressed proteins identified by DAVID.

Annotation Cluster 1 Enrichment Score: 1.83 Fold Category Term Count P Value Enrichment FDR GOTERM_CC_FAT Proteinaceous extracellular matrix 10 0.01 2.57 13.5 GOTERM_CC_FAT Extracellular matrix 10 0.01 2.52 15.4

Annotation Cluster 2 Enrichment Score: 1.71 Fold Category Term Count p Value Enrichment FDR GOTERM_BP_FAT Extracellular matrix organization 6 0.01 4.01 17.8 GOTERM_BP_FAT Extracellular structure organization 7 0.02 3.12 26.2

Annotation Cluster 3 Enrichment Score: 1.47 Fold Category Term Count P Value Enrichment FDR GOTERM_CC_FAT Proteinaceous extracellular matrix 10 0.01 2.57 13.5 GOTERM_CC_FAT Extracellular matrix 10 0.01 2.52 15.4 GOTERM_MF_FAT Polysaccharide binding 5 0.02 4.33 25.5 GOTERM_MF_FAT Pattern binding 5 0.02 4.33 25.5

Annotation Cluster 4 Enrichment Score: 1.15 Fold Category Term Count p Value Enrichment FDR GOTERM_BP_FAT DNA replication initiation 5 0.00 8.59 1.9 GOTERM_BP_FAT DNA-dependent DNA replication 6 0.00 5.16 5.7 GOTERM_BP_FAT DNA replication 9 0.00 3.18 7.2 GOTERM_BP_FAT Transcription 15 0.01 2.10 11.8 GOTERM_BP_FAT DNA unwinding during replication 4 0.01 8.02 13.3 GOTERM_BP_FAT DNA duplex unwinding 4 0.01 6.87 21.0 GOTERM_BP_FAT DNA geometric change 4 0.01 6.87 21.0

Annotation Cluster 5 Enrichment Score: 1.05 Fold Category Term Count p Value Enrichment FDR GOTERM_BP_FAT Transcription, DNA-dependent 6 0.01 4.25 14.0 GOTERM_BP_FAT RNA biosynthetic process 6 0.01 4.25 14.0 Transcription from RNA polymerase II GOTERM_BP_FAT promoter 5 0.02 4.63 24.1

156

3.5 Gene Set Enrichment Analyses (GSEA) for Canonical Pathway Analysis

Here, GSEA was used to study the canonical pathways associated with the global proteomic changes found in the comparison of NBE and TNBE specimens. Unlike

DAVID, GSEA utilizes the complete list of 1398 proteins with their SpI values to study the biological features associated with the proteomic changes in the dataset. GSEA determines the importance of the proteins and their corresponding pathways on the strength of SpI values and the up/down regulation of the proteins while comparing these proteins with those of weaker SpI values and discordant expression changes[23].

Table 4. List of the canonical pathways found to be overrepresented in TNBE samples. The proteins that are members of cell cycle, M-G1 transition and DNA replication pre- initiation were highly enriched at the positive end of the SpI distribution.

NAME FDR q-val CELL_CYCLE 0.000 ORC1_REMOVAL_FROM_CHROMATIN 0.002 SYNTHESIS_OF_DNA 0.002 M_G1_TRANSITION 0.002 MITOTIC_M_M_G1_PHASES 0.001 DNA_REPLICATION_PRE_INITIATION 0.002 S_PHASE 0.002 CELL_CYCLE_CHECKPOINTS 0.002 G1_S_TRANSITION 0.002 CELL_CYCLE_MITOTIC 0.014 PLATELET_DEGRANULATION 0.165

The interesting observation was that the majority of the proteins belonging to cell cycle, reactome G1-S transition and DNA replication were minichromosome maintenance

(MCM) proteins. In eukaryotes, the members of MCM family (MCM2-7) are known to function by forming a complex, and they are essential as replication initiation factors. It is important to note that in Figure 4 the significant enrichment of the MCM family proteins 157 is observed in TNBE samples. This is also supported by the similar observation obtained by studies at the transcriptomic level [24]. The high level of MCM expression is expected to contribute to the cancer cell growth as these proteins facilitate the overall replication and increase the transcription rate.

FDR = 0% FDR = 0.6% FDR = 0.1%

Minichromosome maintenance (MCM) Proteins

Figure 4. Participants of cell cycle (G1-S Phases) were significantly enriched in triple negative breast cancer (TNBC) cells.

As observed in previous studies [25-27], the breast carcinoma samples showed down regulation of proteins related to the extracellular region. The canonical pathways which were found to be down regulated in TNBE are listed in Table 5. The down regulation of terms such as focal adhesion, ECM receptor interaction, and regulation of actin cytoskeleton are consistent with processes of de-differentiation and loss of cell adhesion that occur in cancer cells.

158

Table 5. List of the canonical pathways found to be overrepresented in NBE samples.

NAME FDR q-val TIGHT_JUNCTION 0.045 FOCAL_ADHESION 0.089 AXON_GUIDANCE 0.065 ALZHEIMERS_DISEASE 0.080 ECM_RECEPTOR_INTERACTION 0.073 VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION 0.119 REGULATION_OF_ACTIN_CYTOSKELETON 0.174

The proteins belonging to focal adhesion and reactome cell junction organization were significantly down regulated in TNBE. DAVID analysis suggested GO terms such as cell adhesion and extracellular matrix associated with the proteins down regulated in TNBE, and these proteins were involved in the enriched terms such as focal adhesion and reactome cell junction organization (Fig.5).

159

FDR = 0.1% FDR = 15%

Figure 5. Structural molecular organization was significantly deficient in triple negative breast cancer (TNBE) 160

Conclusions

Application of 2-dimensional PLOT LC-MS based shotgun proteomics for comparative analysis of 10,000 LCM collected NBE and TNBE cells yielded high proteome coverage.

The important features of this study included: 1) sample processing of individual samples of only 10,000 cells, 2) the possibility of second MS measurement on the same samples as each analysis consumed in-gel digest equivalent to only 4000 cells (not performed in this study), 3) peptide level fractionation (6 salt steps) of 6 samples with further PLOT

LCMS analysis was completed in reasonable time (12days), and 4) high proteome coverage (4,259 proteins).

DAVID gene ontology analysis on 114 differentially abundant proteins revealed the important functional annotations such as cell adhesion, extracellular matrix, extracellular matrix organization and ECM-receptor interaction for proteins enriched in NBE and

MCM, DNA replication, DNA replication initiation for proteins enriched in TNBE. The complementary information about the canonical pathways covered by the proteins enriched in NBE and TNBE was obtained by GSEA analysis. The terms cell cycle, reactome G1-S transition and DNA replication were significantly enriched in TNBE, whereas in NBE focal adhesion and ECM receptor interaction terms were significantly enriched. Overall the comparative proteomic analysis using only 10,000 cells revealed substantial information about proteomic changes associated with important biological processes.

Further studies may include analysis of more pairs of NBE and TNBE samples, which should help in strengthening the confidence in the current findings. In addition, studies 161 using replicate MS measurement would investigate the reproducibility of identified proteome and/or increase in proteome coverage. 162

References

[1] A. Jemal, R. Siegel, E. Ward, Y.P. Hao, J.Q. Xu, M.J. Thun, Ca-a Cancer Journal for

Clinicians 59 (2009) 225-249.

[2] M. Lu, S.A. Whelan, J. He, R.E. Saxton, K.F. Faull, J.P. Whitelegge, H.R. Chang, Clin

Proteomics 6 (2010) 93-103.

[3] S. Cleator, W. Heller, R.C. Coombes, Lancet Oncology 8 (2007) 235-244.

[4] S.P. Kang, M. Martel, L.N. Harris, Curr Opin Obstet Gynecol 20 (2008) 40-46.

[5] S.J. Dawson, E. Provenzano, C. Caldas, Eur J Cancer 45 Suppl 1 (2009) 27-40.

[6] B. Domazet, G.T. Maclennan, A. Lopez-Beltran, R. Montironi, L. Cheng, Int J Clin

Exp Pathol 1 (2008) 475-488.

[7] R.F. Bonner, M. EmmertBuck, K. Cole, T. Pohida, R. Chuaqui, S. Goldstein, L.A.

Liotta, Science 278 (1997) 1481-&.

[8] J.J. Hill, T.L. Tremblay, A. Pen, J. Li, A.C. Robotham, A.E. Lenferink, E. Wang, M.

O'Connor-McCourt, J.F. Kelly, J Proteome Res 10 (2011) 2479-2493.

[9] L.F. Waanders, K. Chwalek, M. Monetti, C. Kumar, E. Lammert, M. Mann, Proc

Natl Acad Sci U S A 106 (2009) 18902-18907.

[10] Y.Q. Yu, M. Gilar, J. Kaska, J.C. Gebler, Rapid Commun Mass Spectrom 19 (2005)

2331-2336.

[11] D.C. Liebler, A.J.L. Ham, Nature Methods 6 (2009) 785-785.

[12] G.H. Yue, Q.Z. Luo, J. Zhang, S.L. Wu, B.L. Karger, Anal. Chem. 79 (2007) 938-946.

[13] Q. Luo, G. Yue, G.A. Valaskovic, Y. Gu, S.L. Wu, B.L. Karger, Anal. Chem. 79 (2007)

6174-6181. 163

[14] Q.Z. Luo, Y. Gu, S.L. Wu, T. Rejtar, B.L. Karger, Electrophoresis 29 (2008) 1604-

1611.

[15] W. Zhu, J.W. Smith, C.M. Huang, J Biomed Biotechnol 2010 (2010) 840518.

[16] P.C. Carvalho, J.S. Fischer, E.I. Chen, J.R. Yates, V.C. Barbosa, BMC Bioinformatics

9 (2008).

[17] G. Dennis, B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki,

Genome Biology 4 (2003).

[18] X.J. Ma, R. Salunga, J.T. Tuggle, J. Gaudet, E. Enright, P. McQuary, T. Payette, M.

Pistone, K. Stecker, B.M. Zhang, Y.X. Zhou, H. Varnholt, B. Smith, M. Gadd, E. Chatfield, J.

Kessler, T.M. Baer, M.G. Erlander, D.C. Sgroi, Proc Natl Acad Sci U S A 100 (2003) 5974-

5979.

[19] D.C. Sgroi, J. Gaudet, H. Varnholt, X. Ma, R. Salunga, M. Erlander, Clin Cancer Res

7 (2001) 3687s-3687s.

[20] A. Rauch, M. Bellew, J. Eng, M. Fitzgibbon, T. Holzman, P. Hussey, M. Igra, B.

Maclean, C.W. Lin, A. Detter, R.H. Fang, V. Faca, P. Gafken, H.D. Zhang, J. Whitaker, D.

States, S. Hanash, A. Paulovich, M.W. McIntosh, J. Proteome Res. 5 (2006) 112-121.

[21] D.W. Huang, B.T. Sherman, R.A. Lempicki, Nature Protocols 4 (2009) 44-57.

[22] X. Fu, S.A. Gharib, P.S. Green, M.L. Aitken, D.A. Frazer, D.R. Park, T. Vaisar, J.W.

Heinecke, J Proteome Res 7 (2008) 845-854.

[23] S. Cha, M.B. Imielinski, T. Rejtar, E.A. Richardson, D. Thakur, D.C. Sgroi, B.L.

Karger, Molecular & Cellular Proteomics 9 (2010) 2529-2544. 164

[24] R.M. Neve, K. Chin, J. Fridlyand, J. Yeh, F.L. Baehner, T. Fevr, L. Clark, N. Bayani,

J.P. Coppe, F. Tong, T. Speed, P.T. Spellman, S. DeVries, A. Lapuk, N.J. Wang, W.L. Kuo,

J.L. Stilwell, D. Pinkel, D.G. Albertson, F.M. Waldman, F. McCormick, R.B. Dickson, M.D.

Johnson, M. Lippman, S. Ethier, A. Gazdar, J.W. Gray, Cancer Cell 10 (2006) 515-527.

[25] L.M. Bergstraesser, G. Srinivasan, J.C.R. Jones, S. Stahl, S.A. Weitzman, Am J

Pathol 147 (1995) 1823-1839.

[26] A.J. D'Ardenne, P.I. Richman, M.A. Horton, A.E. McAulay, S. Jordan, J Pathol 165

(1991) 213-220.

[27] P.G. Natali, M.R. Nicotra, C. Botti, M. Mottolese, A. Bigotti, O. Segatto, Br J

Cancer 66 (1992) 318-322.

165

Chapter 4: Characterization of the Intact α- Subunit of Recombinant Human

Chorionic Gonadotropin Glycoforms by High Resolution CE-FT-MS*

*Reproduced with permission from Analytical Chemistry, 2009, 81 (21), 8900-8907.

Copyright 2009 American Chemical Society.

166

Abstract

With the rapid growth of complex heterogeneous biological molecules entering the therapeutic market, effective techniques that are capable of comprehensively, rapidly and efficiently characterizing biologics are essential to ensure the desired product characteristics. To address this need, we have developed a rapid method for analysis of intact glycoproteins based on high resolution capillary electrophoretic separation coupled to a high mass resolution FT mass spectrometer. We evaluated the performance of this method on the alpha subunit of mouse cell line-derived recombinant human chorionic gonadotrophin (r- hCG), a doubly glycosylated protein which is part of the clinically-relevant gonadotrophin family. Analysis of r-αHCG using capillary electrophoresis (CE) with a separation time under 20 minutes enables the identification of over 60 different forms with up to nine sialic acids. This high resolution allowed separation and analysis of not only intact glycoforms with different numbers of sialic acids but also glycoforms differing by the number and extent of neutral monosaccharides. The high mass resolution of the FT-MS enables targeting a reduced mass range for analysis of the protein isoforms, simplifying analysis without sacrificing accuracy. In addition, this analysis strategy results in an accelerated scan speed which positively impacts the reproducibility of isoform relative quantification.

Furthermore, the intact glycoprotein analysis is complemented with the analysis of glycopeptides and glycans to enable the assignment of glycans to individual glycosylation sites and to achieve a comprehensive characterization of r-αhCG.

Samples of r- hCG obtained from different cell lines were also analyzed and compared. Taken together, the results presented here suggest that the CE-FT-MS 167 method described could be useful for final product characterization as well as for in- process monitoring and thus could be of value in ensuring controlled manufacturing of clinically-relevant therapeutics.

4.1 Introduction

Biotherapeutics, a class of pharmaceuticals which is primarily comprised of glycoproteins, are used in a wide variety of clinical indications, including inflammatory and immunomodulatory disorders, such as rheumatoid arthritis, multiple sclerosis, and cancer. Because of the ability for this class of therapeutics to affect targets that are not easily modulated by traditional small molecule approaches, many biotherapeutics are currently in various stages of development, including for hard-to- treat diseases, such as Alzheimer‟s disease. Because of their clinical success, the absolute number of biotherapeutics available as medicinal products and their relative importance in treating many diseases is expected to increase dramatically in the coming years.

Unlike typical small molecules, the structural elucidation of biotherapeutics, especially glycoproteins, and the manufacture and control of them is a difficult task.

First, glycoprotein pharmaceutical agents are significantly more structurally complex than small molecules. While a typical small molecule therapeutics is comprised of

≤ 50 atoms, a glycoprotein therapeutic is comprised of thousands of atoms. Second, and more importantly, while the active pharmaceutical ingredient in a typical small molecule therapeutics is one or, at most, several forms, a glycoprotein therapeutic can be a mixture of tens, hundreds or even thousands of individual isoforms. This diversity 168 is mostly introduced by various post-translational modifications, such as phosphorylation and glycosylation, as well as process modifications to the peptide backbone such as methionine oxidation and deamidation.

The complexity inherent in biotherapeutics has generated increasing demand for the development of sophisticated analytical technologies that allow for the detailed characterization of the various components within the mixture. Development of such analytical tools is expected to impact a variety of areas in the manufacture and control of biotherapeutics. First, such analytical tools can provide the scientific foundation to develop quality control tests for batch-to-batch release. Second, such tools can prove valuable both in process development as well as in the comparison of a product produced after a manufacturing change to a reference product. Finally, high resolution analysis can lead to in-process controls which can be used to assess process integrity prior to end product characterization.

Among the different types of possible modifications, glycosylation introduces the highest diversity to a protein and contributes to the three dimensional structure, activity, biodistribution, and side effect profile of a biologic therapeutic. As an example, recent evidence points to the fact that IgE-mediated reactions to the glycoprotein therapeutic certuximab (Erbitux®) are due to a particular glycoform or set of glycoforms on the protein which contain the galactose- -1,3-galactose epitope[1]. Humans are known to contain circulating antibodies against galactose- -

1,3- galactose epitopes which are activated in the presence of the complement cascade, and therefore, immune reactions to this epitope have been well-documented in patients. Thus, controlling and monitoring glycosylation, and particular glycoforms 169

(such as galactose- -1,3-galactose), have generated a clear need for robust methods that enable the rapid monitoring of the distribution of glycoprotein glycoforms[2,3].

Analysis of intact glycoproteins provides a survey of the individual isoforms present in the sample. The analysis of intact glycoproteins is typically performed using isoelectric focusing[4]. Since the introduction, in the late 1980s, of capillary electrophoresis coupled to mass spectrometry (CE-MS), advances in electrophoretic separations and the introduction of accurate high mass resolution mass spectrometers has greatly improved the characterization of complex biological molecules including intact glycoproteins[5]. In the case of glycoform analysis, the abundance of individual glycoforms or combinations of glycosylation with other modifications can be assessed.

Though it is typically possible to determine glycan mass composition of individual glycoforms, analysis of the intact glycoprotein does not allow the determination of structures of individual glycans. Given the nature of the information arising from glycoform analysis, high resolution separations coupled to high resolution MS has the potential to characterize individual intact isoforms of glycoproteins. In addition, CE-

MS allows the rapid characterization of glycoprotein properties in an efficient manner without the need for laborious sample preparation [6-9]. For example, Neusuβ and coworkers recently characterized different isoforms of erythropoietin including oxidated and acetylated variants of the glycoforms[10] using a qTOF-CE-MS.

To move beyond compositions and to determine glycan structures, analysis of released glycans is preferred. Various techniques have been applied for the analysis of labeled glycans including normal phase HPLC[11] or capillary electrophoresis[12].

These separation techniques are suitable for quantitative analysis and can be 170 performed in combination with exoglycosidase digestion leading to determination of exact linkages[13]. In addition, mass spectrometry can be applied directly on a mixture of glycans[14] or in conjunction with separation[15]. In the case of a glycoprotein with multiple glycosylation sites, the assignment of glycans to individual glycosylation sites is typically performed by analysis of glycopeptides after proteolytic digestion of the glycoprotein using for example tryptic digestion followed by LC-MS[16].

We sought to build upon these studies to (1) develop a robust and reproducible

CE-FT MS methodology that is amenable to higher mass resolution than the qTOF MS while maintaining high throughout analysis; (2) combine the accuracy and resolution of this technology with bottom-up approaches, including glycan and glycopeptides analysis, and (3) assess the sensitivity of the technology to determine differences between materials derived from different sources. For our studies, we used the alpha subunit of recombinant human chorionic gonadotrophin (r- hCG) obtained from different cell lines as a representative example of a challenging glycoprotein therapeutic. The gonadotrophins are a family of four structurally-related hormones composed of follicle stimulating hormone, chorionic gonadotrophin, luteinizing hormone, and thyroid-stimulating hormone all of which are therapeutically important.

This family of hormones is characterized by their non-covalent heteromeric dimeric structures composed of alpha and beta subunits. r- hCG forms a core component of the gonadotrophin family, and thus, techniques to characterize r- hCG are of pharmaceutical interest. In addition, r- hCG has two glycosylation sites with a heterogeneous glycan population, thereby providing an analytical challenge that is both pertinent and representative of the issues facing biotherapeutics as a whole. 171

4.2 Experimental

4.2.1 Recombinant hCG

Recombinant hCG, expressed in a mouse cell line, was obtained from Sigma-

Aldrich (St. Louis, MO). The protein was dissolved in 50 mM ammonium bicarbonate

(final protein concentration 4 g/ L), aliquoted and stored at -80 °C until needed for analysis. Recombinant -hCG, expressed in a CHO cell line, was obtained from

Feldan Bio (Hamilton, NJ). Due to the large amount of sucrose and phosphoric acid in the latter sample, the protein required purification. A 50 μg sample of r- hCG from

CHO cell line was dissolved in 200 μL of water and desalted using a Microcon

Ultracel YM-3 centrifugal filter device (Millipore, Billerica, MA). The filter was washed twice with water, followed by 50 mM ammonium bicarbonate solution. The solution was concentrated to a final volume of 20 μL, leading to an estimated concentration of 2.5 g/ L and stored at -80 °C until needed for analysis.

4.2.2 Chemicals

Glacial acetic acid was obtained from Acros Organics (Morris Plains, NJ).

HPLC grade acetonitrile and water were purchased from Thermo Fisher (Fairlawn,

NJ). Trypsin (sequencing grade, modified) was from Promega (Madison, WI), and urea, ammonium bicarbonate, ammonium acetate, dithiotreitol and iodoacetamide were from Sigma-Aldrich. Peptide N-glycosidase F (PNGase F) was obtained from

New England Biolabs. β(1-3,4,6)-Galactosidase, Sialidase A, β(1-2, 3, 4,6)-N- 172

Acetylhexosaminidase and α(1-3,4,6)-Galactosidase were obtained from Prozyme (San

Leandro, CA).

4.2.3 CE-MS System

N flow e 2 50 M

ESI +2 kV

+15 kV c d

LTQ-FT 300 nL/min b a 0V

f Figure 1A Diagram of CE-MS system for analysis of intact glycoproteins. a) BGE reservoir; b) separation capillary (15 cm, 50 m i.d. PVA coated fused silica); c) liquid junction; d) ESI interface with metal tip, 50 m i.d.; e) pressure valve and manometer; f) syringe for replacement of ESI solution.

The CE-MS system, schematically shown in Figure 1A, consisted of a 20 cm long, 50

m i.d., 360 m o.d. PVA coated separation capillary (Agilent, Santa Clara, CA) attached to a pressurized liquid junction interface . The reservoir, containing background electrolyte at the injection end of the capillary, was fitted with an HPLC

PEEK union (Upchurch, Oak Harbor, WA) to allow sample injection while providing airtight connection during separation. The second reservoir was connected to one arm of a liquid junction cross made from a polypropylene block. The additional three arms 173

(Figure 1A), were connected to (i) the separation capillary, (ii) a syringe to allow rapid flushing of the intercross volume and (iii) a 3.5 cm long, 50 m i.d., 280 m o.d. stainless steel ESI needle (New Objective, Woburn, MA). All machined plastic parts of the CE system that came in contact with liquids were made of highly chemical resistant polypropylene to prevent leaching. Reservoirs, made airtight using silicon O- rings, were connected to a chamber pressurized with nitrogen from a gas cylinder. A needle valve was used to control the pressure, which was monitored using a digital manometer (model DM8215, MSC Direct, Melville, NY). In a typical experiment, the pressure was maintained at 10 cm of H2O (~1 kPa), resulting in a flow rate in the ESI tip of roughly 200 nL/min. Since both reservoirs were maintained at the same pressure, there was no hydrodynamic flow in the separation capillary. The background electrolyte reservoir was equipped with a platinum electrode while the ESI voltage was applied directly to the ESI metal tip. To provide independent control of separation and ESI voltage, two high voltage power supplies (CZE1000A, Spellman,

Hauppauge, NY) were employed. A 50 MOhm resistor was added to the ESI electrical circuit (Figure 1A) to provide sufficient current drain to maintain the ESI voltage constant.

The CE-MS system was coupled to an LTQ-FT MS (Thermo Fisher, San Jose,

CA) using a PicoView interface (New Objective) that allowed precise positioning of the ESI tip in front of the MS orifice (Figure 1B). The LTQ-FT MS was operated in the MS only mode using a combination of a low resolution LTQ MS scan, followed by a high resolution FT scan. The tuning of the FT MS was performed using the 8+ charge state ion (m/z=1788) of an acidified lysozyme solution infused under static 174 nanospray conditions. The target number of ions in the FT cell was set at 106, and the mass resolution of the FT MS was estimated to be roughly 55,000 at m/z = 1800.

Intact r- hCG was separated using 2% acetic acid (pH = 2.5) as the background electrolyte using 8 kV as separation voltage. Samples were injected hydrodynamically with a height difference of 10 cm for 30 seconds. The injected amount was estimated to be roughly 100 ng, i.e. 5 pmol of the intact protein (4 g/ L).

The ESI solution consisted of 2% (v/v) acetic acid in 20 % (v/v) aqueous solution of acetonitrile.

175

Figure 1B. Photograph of CE system coupled to LTQ-FTMS for analysis of intact glycoproteins.

176

4.2.4 Deglycosylation and Analysis of Released Glycans

For glycan structural analysis, r-αhCG was deglycosylated using PNGase F.

The released N-glycans were purified from the digestion buffer and protein using a porous graphitized carbon (PCG) solid phase extraction cartridge (Thermo Fisher).

The glycan pool was then labeled with 2-aminobenzamide (2-AB). The fluorescently labeled glycans were purified from the reaction mixture using a GlycoClean G cartridge (Prozyme) by loading and washing the glycans with 96% acetonitrile. The glycans were eluted from the cartridge using LC-MS grade water prior to analysis by

LC-MS/MS on an LTQ ion trap mass spectrometer (Thermo Fisher) using normal phase LC (Amide 80 column, 2 x 250mm, Tosoh Biosciences). The elution profile was monitored using fluorescence detection with excitation and emission wavelengths of

330nm, and 420nm. Glycans were fragmented by low energy CID using a normalized collision energy of 35. The LTQ mass spectrometer was operated in “triple play” mode with one full MS scan followed by 5 ultra zoom scans and 5 MS/MS scans.

The glycan composition was determined based on the intact mass from the full

MS scan. Glycan structures were then further interrogated using low energy collision induced dissociation (CID). Fragments were assigned using the nomenclature proposed by Domon and Costello[17]. Glycan fragments were identified through

GlycoWorkBench[18]. The proposed structures were further confirmed using an exoglycosidase enzyme array designed to confirm the nature of the non-reducing end structures. In short, the glycan pool was split into two aliquots and incubated overnight with an array of exoglycosidase enzymes; both aliquots were treated with β (1-3, 4, 6)-

Galactosidase (Prozyme), Sialidase A (Prozyme) and β (1-2, 3, 4, 6)-N- 177

Acetylhexosaminidase (Prozyme). To determine the nature of the non-reducing end structures, one aliquot was incubated with α (1-3, 4, 6)-Galactosidase (Prozyme) which is specific for terminal α-galactose capping structures. The utility of exoglycosidase enzyme array for N-glycan structural elucidation is well documented[13].

4.2.5 Trypsin Digestion of r- hCG Expressed in a Murine Cell Line

r- hCG obtained from Sigma-Aldrich (20 μg aliquot) was denatured in 100 μL of 8 M urea and 50 mM ammonium bicarbonate for 1 h at 37°C. The protein was reduced by 20 mM dithiothreitol for 3 h at 37°C. Subsequently, the protein was alkylated by 40 mM iodoacetamide for 90 minutes at room temperature. The solution was next transferred to a Microcon Ultracel YM-3 centrifugal filter device (Millipore) to remove urea. Trypsin (Promega, Madison, WI) was added to the protein solution

(protein to enzyme ratio 20:1), and digestion was allowed to proceed for 12 hrs at

37°C before stopping the reaction by addition of 0.1% formic acid.

4.2.6 LC-MS Analysis of hCG Tryptic Digest

Tryptic digest r- hCG was separated using a 75 μm i.d. 15 cm long column packed with Magic C18AQ, 3μm obtained from Michrom Bioresources (Auburn, CA) with mobile phase A as 0.1% (v/v) formic acid in water and mobile phase B as 0.1%

(v/v) formic acid, 90% (v/v) acetonitrile in water. A linear gradient of mobile phase B from 5% to 80% in 90 min was used to elute the peptides. Both MS and MS/MS data 178 were acquired on an LTQ MS (Thermo Fisher) using data dependent acquisition with every MS scan followed by MS/MS spectra for up to 8 precursors with dynamic exclusion set to 30 sec. Database searching was performed using Sequest within the

Bioworks Browser ver. 3.3.1 (Thermo Fisher) with the Swiss Prot mouse protein database (ver. 55.1) appended with the sequence of r- hCG. Tryptic digests of both native and deglycosylated r- hCG were analyzed to facilitate identification of the chromatographic peaks of glycopeptides. MS/MS spectra of glycopeptides were interpreted manually.

4.2.7 Data Analysis

Accurate masses of intact glycoforms from CE-MS analysis were calculated as follows: First, the r- hCG sequence was converted to its elemental composition and corrected for the presence of 5 disulfide bridges, i.e. subtraction of 10 hydrogens.

Then, masses of glycoproteins were generated by adding the elemental composition of individual glycans, followed by calculation of isotopic distribution using Protein

Prospector [19]. The mass of the most intense isotope in the isotopic cluster was used to confirm the correct assignment of a particular structure to the experimental mass with a mass tolerance of ±50 ppm. Of note, average rather than monoisotopic masses are reported throughout in the manuscript. A set of in-house developed perl scripts was used to assign glycan compositions, match glycans to intact protein glycoforms and calculate theoretical combinations of glycans corresponding to a particular intact protein glycoform. Since glycan structures were not determined in this work, glycans are referred to as: HexNAc, Hexose, Sialic acid (NeuAc) and sulfated glycan (+SO4). 179

Abundances of individual intact protein glycoforms were estimated from the data acquired using the linear ion trap with Qualbrowser (Thermo Fisher) based on peak areas derived from extracted ion electropherograms on the 9+ charge states of calculated m/z of the average mass with the mass tolerance of ±0.5 Da. Similarly, abundances of glycopeptides were determined using the same method but with average masses of the 3+ and 4+ charge states.

180

4.3 Results and Discussion

4.3.1 Intact Protein Analysis

A CE-MS system was constructed, in which the CE system was coupled to high mass resolution LTQ-FT MS using a pressurized liquid junction interface, Figure

1. Such an interface provided rapid analysis, preserved the high resolution of the CE and allowed independent tuning of separation and ESI conditions. The high mass resolution of the FT MS allowed direct determination of the ion charge states and accurate mass measurement of the intact protein glycoforms. To provide high resolution and reproducible separation, interaction of analytes with the capillary wall was prevented by selecting a capillary coating with polyvinylalcohol (PVA) as this permanent neutral coating eliminated the electroosmotic flow and also unlike dynamic coating did not require regeneration.

In preliminary experiments, using an LTQ MS in an m/z range of 1000-3000 with ESI spraying solution 2% acetic acid in 20% ACN, it was found that most r-

hCG protein isoforms were observed in 8+, 9+ and 10+ charge states, with 9+ charge state displaying the highest intensity. Based on the results from these initial experiments, the m/z range was restricted to 1400-2000 upon transfer of the method to a LTQ-FT MS. Because of the high resolving power of the FT MS, there was no loss of information since accurate intact protein mass could be determined without the requirement to observe multiple charge states. Additionally, restricting the m/z window had a number of benefits including increasing acquisition speed as well as increasing the sensitivity of analysis.

181

Figure 2 Illustration of the separation resolution of CE-MS analysis of intact -hCG derived from a murine cell line. A) Total ion current, (B- D) selected extracted ion electropherograms, see text for more details.

Figure 2 shows the total ion electropherogram for r- hCG derived from the murine cell line as well as examples of extracted ion electropherograms (EIE) for three different glycoforms. A background electrolyte consisting of 2% acetic acid was found to lead to high resolution by CE, allowing a broad separation window with an analysis time of under 20 minutes. The electrophoretic peak widths for EIE at the half-height were only 12 sec, further demonstrating the resolving power of CE (approximately

100,000 theoretical plates). 182

Figure 3A. CE-MS separation of r- hCG produced in a murine cell line. A) Annotation of mass differences between high intensity glycoforms is indicated. High intensity glycoforms are annotated with letters (Table 1) and average molecular weight.

Figure 3A presents an overall separation pattern of CE-MS for r- hCG presented in a heat map demonstrating the high number of glycoforms that can be observed. The twenty most intense glycoforms are labeled with their corresponding masses and the differences in glycan compositions are highlighted using color-coded arrows. Of note is the fact that due to the broad dynamic range of abundances for individual glycoforms, only the most abundant forms can be clearly observed. 183

Figure 3B CE-MS separation of r- hCG produced in a murine cell line. Log-intensity plot of the same separation shown in Figure 3A. Numbers indicate the number of sialic acids. The dotted lines show approximate boundaries for the 8+, 9+ and 10+ charge states.

Figure 3B shows the same separation plotted on a log intensity scale, which further reveals the complexity of r- hCG glycoforms. The individual bands corresponding to the glycoforms with the same number of sialic acids are connected with a dashed line.

The figure also shows the distribution of glycoforms in different charge states.

Mass measurement by FT MS allowed direct determination of the exact molecular mass of observed glycoforms. Table 1 summarizes the masses of the 20 most abundant peaks, labeled in Figure 3A, which correspond to the major glycoforms. Table 1 also provides a summary of the glycan compositions for each identified glycoform. To accurately assign glycan compositions to individual 184 glycoforms, PNGase F deglycosylated r- hCG was analyzed by CE-MS and it was found that the experimental mass of the deglycosylated protein was similar to the theoretical value (MW=10196) within expected mass error. Also of note is that these compositions correspond to the sum of glycan compositions on the two glycosylation sites.

Upon comparison of the calculated glycan compositions with the separation in

Figure 3, it can be seen that there is a clear separation pattern reflecting the number of sialic acids per isoform. As expected, the additional negative charges provided by sialic acids leads to a significant change in mobility of individual glycoforms. For example, peaks labeled g and h in Fig. 3A, differing by an addition of a sialic acid and subtraction of a Hexose, i.e. M=129 Da, were baseline separated by CE, see Figures

2B and D. Moreover, forms with the same number of sialic acids, such as peaks g and l in Figure 3A with M =526 Da, corresponding to the addition of two Hexoses and one HexNAc could be resolved as well, see Fig 2B and C.

185

4.3.2 Repeatability of the Intact Protein Separation

Consistency of analysis is an essential requirement for potential applications of this technique. We focused on evaluation of reproducibility of the relative abundance of individual isoforms since individual forms could be easily matched based on their masses. We evaluated the consistency of isoform abundances by comparing peak areas for selected glycoforms, determined from the extracted ion electropherograms run in duplicate. It should be noted that, due to potential differences in ionization efficiencies of individual glycoforms, the relative abundances do not generally correspond to in- solution abundances of glycoforms. In addition, the distribution of the electrospray charge states for different glycoforms may change with mass and/or the number of sialic acids further complicating quantitative analysis. Finally, only repeatability rather than reproducibility, i.e. a replicate analysis of the same sample, by the same operator, on the same instrument and the same day, was performed.

The separation of the mouse derived r- hCG obtained from Sigma was repeated 3 times, and peak areas of 20 most intense peaks acquired using the linear ion trap were measured and compared, see Experimental Section for details. Since the samples were manually introduced using hydrodynamic injection, only a relative comparison of run-to-run reproducibility was performed and the results are summarized in Table 1.

186

Table 1. Repeatability of peak area measurements for 20 glycoforms on r- hCG (n=3).

Symbola Average Composition Rel. migb Rel. area 1 Rel. area 2 Rel. area 3 CV

Mass (HexNAc,

Hex, NeuAc)

A 14877 9,14,2 1.00 54.1% 54.2% 61.5% 7.4%

Bc 15006 9,13,3 1.09 100.0% 100.0% 100.0% --

C 15135 9,12,4 1.19 59.8% 56.6% 57.8% 2.8%

D 15265 9,11,5 1.31 17.4% 17.2% 17.8% 1.6%

E 15405 10,16,2 1.02 12.8% 12.8% 13.0% 1.0%

F 15534 10,15,3 1.12 56.1% 57.8% 52.6% 4.8%

G 15663 10,14,4 1.22 93.8% 86.9% 86.9% 4.5%

H 15792 10,13,5 1.34 56.3% 57.8% 58.2% 1.7%

I 15899 11,16,3 1.14 12.5% 11.6% 13.0% 5.5%

J 15921 10,12,6 1.47 21.5% 18.7% 21.0% 7.4%

K 16028 11,15,4 1.25 17.3% 16.4% 15.7% 4.7%

L 16190 11,16,4 1.27 29.2% 26.8% 22.5% 11.0

%

M 16319 11,15,5 1.39 46.1% 44.9% 42.4% 4.3%

N 16449 11,14,6 1.53 30.3% 30.9% 30.5% 1.0%

O 16578 11,13,7 1.70 11.4% 12.5% 11.6% 4.8

P 16427 12,18,3 1.16 8.3% 7.0% 8.1% 8.7

Q 16556 12,17,4 1.27 13.6% 14.0% 13.6% 1.7%

R 16685 12,16,5 1.40 16.8% 17.0% 15.9% 3.5%

S 16847 12,17,5 1.41 12.4% 11.8% 11.4% 4.0%

T 16976 12,16,6 1.55 15.6% 15.5% 14.3% 4.7%

187

The table shows glycoform mass, glycan composition, relative migration and peak areas normalized to that of the most intense glycoform. It can be seen that the average relative coefficient of variance of the peak areas is less than 10% for all but one measurement. It is expected that improvements in sample injection and automation of operation would increase the precision of analysis. In addition, inclusion of internal standards would further improve quantitation.

It was estimated that the 20 glycoforms used for the evaluation of reproducibility accounts for over 90% of the total amount of the glycoprotein. The abundance of the additional roughly 40 different glycoforms, representing the remaining approximately 10% of the total glycoprotein amount, were not determined.

However, if the abundances of these isoforms are distributed evenly, the average abundance of each these forms should be less than 1% of the mass of the total glycoprotein. To examine in greater detail the distribution of the less abundant forms, we completed analysis of the glycopeptides, upon protease treatment of r-αhCG as well as glycan structures after enzymatic release.

188

4.3.3 Analysis of the Released Glycans

Previous analysis of recombinant hCG expressed in CHO cells have revealed a fairly simple N-glycosylation profile[20]. In contrast, normal phase separation and analysis of the 2-AB labeled N-glycan pool revealed the glycosylation complexity of r-αhCG expressed in a murine cell line (Figure 4A).

The compositions of many species were tentatively assigned as those commonly found in complex-type glycans, of the form HexNAcNHexN+1 with variable numbers of NeuAc. For example, a glycan with an observed m/z of 1026.9 [M+2H]2+ and a neutral mass of 2051.8 suggested a composition of HexNAc4Hex5NeuAc1. Low energy CID fragmentation of this species confirmed the structure of this glycan as a typical biantennary structure (Figure 4B). Many of the species required detailed structural studies by MS/MS as well as exoglycosidase digestion analysis. The most abundant of these glycans had an observed m/z of 1108.2 [M+2H]2+ and composition of HexNAc4Hex6NeuAc1. 189

Figure 4: Chromatograms and fragmentation spectra of glycan analysis. (A) Normal phase separation with fluorescence detection for 2-AB labeled N-glycans derived from α subunit of rhCG. Over 20 chromatographic peaks were resolved during the 110 minute run. N-glycan compositions were assigned based on the neutral mass of each species; the structure of the N-glycans was refined based on MS/MS fragmentation. (B) MS/MS fragmentation of a common N-glycan species as a representative example. N-glycan fragmentation is characterized by cleavage between the antennary N- acetylglucosamine and the trimannosyl chitobiose core. Complementary B/Y and B‟/Y‟ ions (Y4β/B3β and Y4α /B3α at m/z 366.13/1687.47 and 657.19/1396.41) are the most prominent ions formed under these conditions and allow for structural characterization of this species. Of note is the fact that additional assignments exist for many of the fragments observed; fragments were assigned the most likely structure based on the minimum number of bond cleavages.

190

Low energy CID of this species suggested the presence of a Gal-α-Gal disaccharide as a terminal non-reducing capping structure (Figure 5b).

Figure 5: LC/MS/MS analysis of sulfated and α-galactose containing N-glycans. (a) Fragmentation analysis of N-glycan species containing a sulfate with neutral mass of 2423.4 Da. Abundant B–ions seen which help to confirm the antennary structure at m/z 657.27, corresponding to the sialylated lactosamine antenna, and at 737.38, corresponding to the sulfated sialylated lactosamine antenna. (b) Fragmentation analysis of an α-galactose containing N-glycan. Abundant B–ions are observed which help to confirm the antennary structure at m/z 657.27 corresponding to the sialylated lactosamine antenna and, at 528.18, corresponding to the α-galactose capped lactosamine antenna. The presence of significant levels of N-glycans containing Gal-α-Gal was confirmed through the use of a targeted exoglycosidase enzyme array. Enzymatic digestion with

α-galactosidase, in conjunction with sialidase A, β-galactosidase, and β-N- acetylhexosaminodase resulted in all species collapsing down to the trimannosyl- chitobiose core (Figure 6a); whereas in the absence of the α-galactosidase enzyme, 191 treatment with these same three enzymes resulted in resistant species, which contain non-reducing end Gal-α-Gal (Figure 6b).

mV 38.3 16- 34.6 A 30.0

20.0

10.0

-0.5 min 21.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.1

26.9 mV 14- 34.6 B 20.0 18- 57.3 15.0 24- 75.7 10.0 16- 45.4 25- 83.3 5.0 19- 58.8 - 22- 69.623- 74.2

-2.8 min 19.5 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.1 Time (min) Figure 6: Exoglycosidase characterization of galactose-α-galactose-containing species. (A) Normal phase HPLC analysis of N-glycan pool treated with sialidase A, β-galactosidase, α-galactosidase and β-N-acetylhexosaminodase. (B) Glycan pool treated with sialidase A, β-galactosidase and β-N-acetylhexosaminodase.

Susceptibility to green coffee bean α-galactosidase has been extensively used to confirm the presence of α-galactose moieties at the non-reducing end of N- glycans[21]. Because α(1-3,4,6)-galactosidase can cleave terminal galactose residues linked in an 1-3, 1-4 or 1-6 linkage, the exact connectivity within the Gal-α-Gal structure cannot be unequivocally assigned. However, previous studies have identified a galα1-3gal epitopes in mouse cell surface glycans[22,23]. The enzymatic activity 192 necessary to form these structures has also been well documented in many murine cell types[24]. The combined enzymatic and mass spectrometric approach strongly suggests the presence of significant levels of glycans containing Gal-α-Gal in r-αhCG.

Combining the results of HPLC and MS analysis, we find that up to 60% of the glycans in r-αhCG contained this immunogenic epitope.

The mass of other species suggested the presence of low levels of sulfated N- glycans such as the species with m/z 1212.7 [M+2H]2+ (a neutral mass of 2423.4) which corresponds to a composition of HexNAc4Hex5NeuAc2+sulfate. The presence of sulfated structures was confirmed by fragmentation analysis (Figure 5a). In addition, these species are resistant to digestion with β-galactosidase on the sulfated antennae further confirming their presence. Interestingly, four species had masses which suggested compositions that contain more than four non-reducing end capping structures. These species contain from two to five N-acetylneuraminic acid moieties and from zero to three α-galactose capping structures. One of these was a species with a weak m/z of 1559.6 [M+3H]3+ and a neutral mass of 4676.5 potentially corresponding to the presence of a pentasialylated N-glycan with composition of

HexNAc8Hex9NeuAc5 (Table 2). Another example of a structure illustrating this unusual glycosylation was determined to have a composition of

3+ HexNAc8Hex10NeuAc4 based on an observed mass of 1517.1 [M+3H] and a neutral mass of 4547.6. Despite the low intensity of the parent ions, the fragment spectra for these two species suggest the presence of a disialylated antenna (Figure 5a). MS/MS fragmentation was able to identify the proposed structure as a likely candidate but was not sufficient to confirm the exact structure of the disialylated antenna. This type of 193 glycosylation feature has been observed previously in murine transferrin[25] and in other murineurinary proteins. Pentasialylated glycans[26] have also been previously documented from the CE-MS analysis of EPO derived from CHO cells[27]. It is also possible that these structures could contain a disialic acid as has been observed in recombinant murine ICAM-1 expressed in CHO cells [28] and in murine NCAM [29].

Finally, MS/MS fragmentation for multiple species also revealed the presence of N- acetyl lactosamine extended antennae as determined by the characteristic B-ions

(Figure S3).

194

Table 2. Summary table N-linked glycans in r-αhCG. Relative abundance, composition and most probable structure are given for each N-glycan.

Approximate Mass Observed / Composition with most

# Abundance (%) Neutral probable structure

1 18 1172.46/ 2342.85 HexNAc4Hex5NeuAc2

2 16 1108.19/ 2213.81 HexNAc4Hex6NeuAc1

3 9 1133.26/ 3397.22 HexNAc6Hex9NeuAc2

4 7 914.26/ 2740.99 HexNAc5Hex8NeuAc1

5 6 1000.19/ 2999.08 HexNAc5Hex6NeuAc3

6 6 1027.10/ 2051.76 HexNAc4Hex5NeuAc1

7 5 1176.75/ 3526.26 HexNAc6Hex8NeuAc3 195

8 5 958.00/ 2870.04 HexNAc5Hex7NeuAc2

9 4 1225.82/ 2449.90 HexNAc5Hex8

10 3 962.34/ 1922.71 HexNAc4Hex6

11 3 1219.76/ 3655.78 HexNAc6Hex7NeuAc4

12 2 904.00/ 2707.98 HexNAc5Hex6NeuAc2

13 2 1079.26/ 3235.17 HexNAc6Hex8NeuAc2

14 1 1122.67/ 3364.21 HexNAc6Hex7NeuAc3

15 1 1036.78/ 3106.13 HexNAc6Hex9NeuAc1

16 1 1290.19/ 2578.94

196

HexNAc5Hex7NeuAc1

17 1 1255.48/ 3762.35 HexNAc7Hex10NeuAc2

18 1 1090.67/ 3268.18 HexNAc6Hex10NeuAc1

19 <1 1067.28/ 2131.71 HexNAc4Hex5NeuAc1+SO4

20 <1 1517.13/ 4547.62 HexNAc8Hex10NeuAc4

21 <1 1559.1/ 4676.14 HexNAc8Hex9NeuAc5

22 <1 1027.26/ 3079.06 HexNAc5Hex6NeuAc3+SO4

23 <1 881.40/ 1760.66 HexNAc4Hex5

24 <1 1341.21/ 4020.44

197

HexNAc7Hex8NeuAc4

25 <1 1298.16/ 3891.40 HexNAc7Hex9NeuAc3

26 <1 1212.81/ 2422.56 HexNAc4Hex5NeuAc2+SO4

27 <1 1160.78/ 3477.75 HexNAc6Hex9NeuAc2+SO4

28 <1 1043.66/ 2084.77 HexNAc4Hex7

29 <1 1246.25/ 3735.18 HexNAc6Hex7NeuAc3+SO4

30 <1 1377.14/ 4128.42 HexNAc8Hex11NeuAc2

31 <1 1430.65/ 4289.54 HexNAc8Hex12NeuAc2 198

32 <1 1473.67/ 4418.19 HexNAc8Hex11NeuAc3

The data from the LC-MS/MS analysis of glycans released from a murine cell line-derived r- hCG was used to construct a list of glycan structures and their relative abundance in r-αhCG (Table 2). In Table 2, compositions are presented along with the most probable structure from MS/MS and exoglycosidase digestion. Taken together, this analysis highlights the fact that the N-glycosylation of murine-derived r-αhCG is complex, resulting from the variable degree of sialylation and wide range of antennary. Overall, biantennary species constitute approximately 46% of the total glycan pool, triantennary species ~25 % of the pool and tetraantennary glycans ~29%.

The tetraantennary species contain a small subset of glycans which contain more than

4 non-reducing end capping structures as discussed above. These glycans constitute less than 2% of the total N-glycan pool. The pool contains primarily monosialylated, disialylated and trisialylated glycans which constitute approximately 33%, 38% and

14% of the of the total glycan pool respectively. Neutral glycans represent approximately 8% of the total pool. Sulfated, tetrasialylated and pentasialylated glycans constitute 3%, 4% and less than 1% of the glycan pool respectively. Core fucosylation was not observed in any of the released N-glycans.

199

4.3.4 Glycopeptide Analysis

Recombinant hCG is a glycoprotein with two glycosylation sites, and thus the analysis of intact glycoprotein determines the overall glycan composition, but not glycan structures attached to individual sites. Analysis at the glycopeptide level is necessary to determine association of a particular glycan with a specific glycosylation site. A theoretical tryptic digest of r- hCG showed that the glycosylation sites were found on two distinct tryptic peptides, allowing straightforward association by peptide analysis of the glycans associated with a particular site by peptide analysis.

A tryptic digest of r- hCG was analyzed by nano LC-MS/MS in the data dependent mode. Elution times of glycopeptides were determined by analysis of the tryptic digest of the deglycosylated form of the protein as the deglycosylated peptides generally have similar elution times as the glycosylated forms[16]. Compositions of glycans associated with a specific glycosylation site were determined based on glycopeptide molecular weight and also by comparison with the list of identified glycans (Table 3). In addition, MS/MS spectra of glycopeptides were analyzed to confirm the assignment of a given glycan structure; for example the ion at m/z=657, corresponding to Sial-HexNAc-Hex fragment, should be present only for glycopeptides which contain sialic acid.

200

Table 3. Abundance of individual glycopeptides. Normalized to 100% at each glycosylation site.

Glycana Glycan Glycan Composition NVTSESTCCVAK VENHTACHCSTCYYHK

abundance Abundance % Abundance %

1 18% HexNAc4Hex5NeuAc2 7.4% 40.7%

2 16% HexNAc4Hex6NeuAc1 4.5% 22.0%

3 9% HexNAc6Hex9NeuAc2 9.8% 1.1%

4 7% HexNAc5Hex8NeuAc1 30.0% 1.7%

5 6% HexNAc5Hex6NeuAc3 3.7% 11.7%

6 6% HexNAc4Hex5NeuAc1 1.0% 0.2%

7 5% HexNAc5Hex7NeuAc2 17.4% 7.9%

8 5% HexNAc6Hex8NeuAc3 4.4% 1.7%

9 4% HexNAc5Hex8NeuAc0 0.4% 0.1%

10 3% HexNAc4Hex6NeuAc0 ND 1.6%

11 3% HexNAc6Hex7NeuAc4 1.0% 1.9%

12 2% HexNAc6Hex8NeuAc2 3.7% 1.1%

13 2% HexNAc5Hex6NeuAc2 2.1% 3.7%

14 1% HexNAc5Hex7NeuAc1 5.6% 1.4%

15 1% HexNAc6Hex9NeuAc1 4.1% 0.6%

16 1% HexNAc6Hex7NeuAc3 1.0% 2.1%

17 1% HexNAc7Hex10NeuAc2 0.8% 0.4%

18 1% HexNAc6Hex10NeuAc1 3.2% 0.1%

Peak areas in Table 3 calculated from extracted ion chromatograms for glycopeptide masses corresponding to all glycans with >1% abundance (i.e. 18 glycans,) associated with either of the two glycosylation sites. The peak areas were normalized to represent the percent abundance of a specific glycopeptide. It should be 201 noted again that peak areas represent an approximation of real abundances of glycopeptides in solution since ionization efficiencies and distribution of charge states will be likely dependent on the specific glycopeptide. By examining the data in Table

3, it can be seen that the majority of glycans can be found on both glycosylation sites, though the differences in abundance can be substantial. For example, the glycan with composition HexNAc5Hex8NeuAc1 represents roughly 30% of all forms at glycopeptide NVTSESTCCVAK (site N76) but only roughly 2% at

VENHTACHCSTCYYHK (site N102). Interestingly, glycopeptides with sulfated glycoforms were not observed.

202

4.3.5 Analysis of Combined Data

Data from all three levels of analysis, i.e. intact protein, glycopeptide and released glycan analyses were combined to provide a more detailed characterization of r- hCG and to verify data consistency. It was assumed that glycans identified in the analysis of released glycans were the only structures that contributed to glycoprotein isoforms. Thus, using compositions of all observed glycans, a theoretical list of all potential protein glycoforms was generated. There are 32 glycans listed in Table 2.

Assuming that both glycosylation sites are occupied, there is theoretically 528 combinations of these 32 glycans and all of these potential glycoforms were generated by a perl script. In addition, when taking into account glycoforms with only one site occupied, there is a total 560 (528+32) potential protein glycoforms. However, only

286 of these glycoforms have a unique glycan composition, which means that roughly

50% of the potential glycoforms are isomeric, and thus isobaric. See Supplemental

Table 4 for a list of all theoretical glycoprotein compositions.

Next, the calculated compositions were compared with the glycan compositions of intact protein glycoforms derived from the CE-MS analysis. First, glycan compositions of the 60 most abundant intact protein glycoforms were determined, see Supplemental Table 4. Then, the calculated compositions were matched to the list of theoretical compositions derived from observed glycans (286 glycoforms), see Table 4. It was found that the majority of high intensity protein glycoforms could be matched to theoretical of glycan compositions. For example, the intact protein glycoform labeled b in Figure 3A with a molecular weight 15,007 could be matched to two different combinations of glycans. On the other hand, the intact 203 protein glycoform with mass 16,556 Da, peak q in Figure 3A, could be matched to 7 different combinations of observed glycans; again, all of these combinations have the same total glycan composition HexNAc12Hex17NeuAc4. As can be seen in Table 4, with the increasing mass and decreasing abundance, see Figure 3B, the number of glycoforms that could be matched to theoretical combinations of observed glycans decreased.

Next, we combined data obtained from CE-MS analysis of the intact r- hCG with site specific glycan information derived from the glycopeptide analysis. For protein glycoforms that can be assigned to a unique combination of two glycans, see

Table 4, the glycopeptide site specific information provide the information about the site to which a particular glycan is attached. For example, r- hCG glycoform with glycan mass composition HexNAc9Hex14NeuAc2 (Figure 3A, band a, mass 14877) can be matched to a unique combination of two glycans HexNAc4Hex6NeuAc and

HexNAc5Hex8NeuAc. Since this glycoform (mass 14877) is one of the most abundant, see Table 1, the glycopeptides associated with corresponding glycans should be also highly abundant. From Table 3 it can be seen that glycan HexNAc4Hex6NeuAc is likely present at site N76 because there 22% abundance compared to 4.5% on the other site. In addition, HexNAc5Hex8NeuAc is likely present at site N102 because its abundance is 30% compared to only 1.7% abundance on site N76. However, determination of the site specific glycosylation could be performed only for limited number of glycoforms because most of the remaining high abundant glycoforms are associated with more than one combination of glycans, thus leading to a high redundancy in the site assignment. 204

Table 4. List of theoretical and observed glycoforms.

Mass of Total glycan Total glycan Theoretical Site 1 Site 2 mass Mass Site 1 Site 2 mass Glycoforms composition composition

13443 4,5,0,0 4,5,0,0 8,10,0,0 14553 5,7,2,0 4,5,0,0 9,12,2,0 13605 4,6,0,0 4,5,0,0 8,11,0,0 14553 4,6,0,0 5,6,2,0 9,12,2,0 13734 4,5,1,0 4,5,0,0 8,10,1,0 14559 4,6,1,0 4,5,2,1 8,11,3,1 13767 4,6,0,0 4,6,0,0 8,12,0,0 14586 4,6,1,0 5,8,0,0 9,14,1,0 13767 4,5,0,0 4,7,0,0 8,12,0,0 14586 5,8,1,0 4,6,0,0 9,14,1,0 13814 4,5,1,1 4,5,0,0 8,10,1,1 14586 5,7,1,0 4,7,0,0 9,14,1,0 13896 4,6,1,0 4,5,0,0 8,11,1,0 14608 4,5,2,0 4,5,2,0 8,10,4,0 13896 4,5,1,0 4,6,0,0 8,11,1,0 14633 5,7,1,0 4,5,1,1 9,12,2,1 13929 4,6,0,0 4,7,0,0 8,13,0,0 14682 5,6,3,0 4,5,0,0 9,11,3,0 13976 4,6,0,0 4,5,1,1 8,11,1,1 14682 4,5,1,0 5,6,2,0 9,11,3,0 14025 4,5,2,0 4,5,0,0 8,10,2,0 14688 4,5,2,0 4,5,2,1 8,10,4,1 14025 4,5,1,0 4,5,1,0 8,10,2,0 14715 4,5,2,0 5,8,0,0 9,13,2,0 14058 4,6,1,0 4,6,0,0 8,12,1,0 14715 4,6,1,0 5,7,1,0 9,13,2,0 14058 4,5,1,0 4,7,0,0 8,12,1,0 14715 5,8,1,0 4,5,1,0 9,13,2,0 14091 4,7,0,0 4,7,0,0 8,14,0,0 14715 5,7,2,0 4,6,0,0 9,13,2,0 14105 4,5,1,0 4,5,1,1 8,10,2,1 14715 5,6,2,0 4,7,0,0 9,13,2,0 14105 4,5,0,0 4,5,2,1 8,10,2,1 14748 5,8,1,0 4,7,0,0 9,15,1,0 14133 5,8,0,0 4,5,0,0 9,13,0,0 14762 5,6,2,0 4,5,1,1 9,11,3,1 14138 4,5,1,1 4,7,0,0 8,12,1,1 14762 5,6,3,1 4,5,0,0 9,11,3,1 14185 4,5,1,1 4,5,1,1 8,10,2,2 14768 4,5,2,1 4,5,2,1 8,10,4,2 14188 4,5,2,0 4,6,0,0 8,11,2,0 14789 6,9,1,0 4,5,0,0 10,14,1,0 14188 4,6,1,0 4,5,1,0 8,11,2,0 14795 5,8,1,0 4,5,1,1 9,13,2,1 14221 4,6,1,0 4,7,0,0 8,13,1,0 14795 5,8,0,0 4,5,2,1 9,13,2,1 14262 5,7,1,0 4,5,0,0 9,12,1,0 14822 5,8,0,0 5,8,0,0 10,16,0,0 14268 4,6,1,0 4,5,1,1 8,11,2,1 14844 4,5,2,0 5,7,1,0 9,12,3,0 14268 4,6,0,0 4,5,2,1 8,11,2,1 14844 4,6,1,0 5,6,2,0 9,12,3,0 14295 5,8,0,0 4,6,0,0 9,14,0,0 14844 5,6,3,0 4,6,0,0 9,12,3,0 14317 4,5,2,0 4,5,1,0 8,10,3,0 14844 4,5,1,0 5,7,2,0 9,12,3,0 14350 4,5,2,0 4,7,0,0 8,12,2,0 14877 4,6,1,0 5,8,1,0 9,14,2,0 14350 4,6,1,0 4,6,1,0 8,12,2,0 14877 5,7,2,0 4,7,0,0 9,14,2,0 14391 5,6,2,0 4,5,0,0 9,11,2,0 14918 6,8,2,0 4,5,0,0 10,13,2,0 14397 4,5,2,0 4,5,1,1 8,10,3,1 14924 5,7,2,0 4,5,1,1 9,12,3,1 14397 4,5,1,0 4,5,2,1 8,10,3,1 14924 4,6,0,0 5,6,3,1 9,12,3,1 14424 5,8,1,0 4,5,0,0 9,13,1,0 14924 5,7,1,0 4,5,2,1 9,12,3,1 205

14424 4,5,1,0 5,8,0,0 9,13,1,0 14951 5,8,0,0 5,7,1,0 10,15,1,0 14424 4,6,0,0 5,7,1,0 9,13,1,0 14951 4,6,0,0 6,9,1,0 10,15,1,0 14430 4,5,2,1 4,7,0,0 8,12,2,1 14951 6,10,1,0 4,5,0,0 10,15,1,0 14457 5,8,0,0 4,7,0,0 9,15,0,0 14973 4,5,2,0 5,6,2,0 9,11,4,0 14477 4,5,1,1 4,5,2,1 8,10,3,2 14973 5,6,3,0 4,5,1,0 9,11,4,0 14479 4,5,2,0 4,6,1,0 8,11,3,0 15006 4,5,2,0 5,8,1,0 9,13,3,0 14504 5,8,0,0 4,5,1,1 9,13,1,1 15006 4,6,1,0 5,7,2,0 9,13,3,0 14553 4,5,1,0 5,7,1,0 9,12,2,0 15006 5,6,3,0 4,7,0,0 9,13,3,0 15047 6,7,3,0 4,5,0,0 10,12,3,0 15345 4,5,2,0 5,6,3,1 9,11,5,1 15053 5,6,3,0 4,5,1,1 9,11,4,1 15345 5,6,3,0 4,5,2,1 9,11,5,1 15053 4,5,1,0 5,6,3,1 9,11,4,1 15372 4,5,2,0 6,9,1,0 10,14,3,0 15053 5,6,2,0 4,5,2,1 9,11,4,1 15372 4,6,1,0 6,8,2,0 10,14,3,0 15080 6,9,2,0 4,5,0,0 10,14,2,0 15372 6,9,2,0 4,5,1,0 10,14,3,0 15080 4,5,1,0 6,9,1,0 10,14,2,0 15372 5,8,1,0 5,6,2,0 10,14,3,0 15080 5,8,0,0 5,6,2,0 10,14,2,0 15372 5,6,3,0 5,8,0,0 10,14,3,0 15080 4,6,0,0 6,8,2,0 10,14,2,0 15372 6,8,3,0 4,6,0,0 10,14,3,0 15080 5,7,1,0 5,7,1,0 10,14,2,0 15372 5,7,2,0 5,7,1,0 10,14,3,0 15086 5,8,1,0 4,5,2,1 9,13,3,1 15372 6,7,3,0 4,7,0,0 10,14,3,0 15086 5,6,3,1 4,7,0,0 9,13,3,1 15405 4,6,1,0 6,10,1,0 10,16,2,0 15113 5,8,1,0 5,8,0,0 10,16,1,0 15405 6,9,2,0 4,7,0,0 10,16,2,0 15113 4,6,0,0 6,10,1,0 10,16,1,0 15405 5,8,1,0 5,8,1,0 10,16,2,0 15113 6,9,1,0 4,7,0,0 10,16,1,0 15419 4,5,1,0 6,7,3,1 10,12,4,1 15127 4,5,0,0 6,7,3,1 10,12,3,1 15419 6,7,3,0 4,5,1,1 10,12,4,1 15133 4,5,1,1 5,6,3,1 9,11,4,2 15425 5,6,3,1 4,5,2,1 9,11,5,2 15135 4,5,2,0 5,7,2,0 9,12,4,0 15446 7,10,2,0 4,5,0,0 11,15,2,0 15135 4,6,1,0 5,6,3,0 9,12,4,0 15452 6,9,2,0 4,5,1,1 10,14,3,1 15160 6,9,1,0 4,5,1,1 10,14,2,1 15452 4,5,1,0 6,9,2,1 10,14,3,1 15160 4,5,0,0 6,9,2,1 10,14,2,1 15452 5,8,0,0 5,6,3,1 10,14,3,1 15210 4,5,1,0 6,8,2,0 10,13,3,0 15452 6,9,1,0 4,5,2,1 10,14,3,1 15210 6,8,3,0 4,5,0,0 10,13,3,0 15452 4,7,0,0 6,7,3,1 10,14,3,1 15210 4,6,0,0 6,7,3,0 10,13,3,0 15479 5,8,0,0 6,9,1,0 11,17,1,0 15210 5,6,2,0 5,7,1,0 10,13,3,0 15485 6,9,2,1 4,7,0,0 10,16,2,1 15215 4,6,1,0 5,6,3,1 9,12,4,1 15499 4,5,1,1 6,7,3,1 10,12,4,2 15215 5,7,2,0 4,5,2,1 9,12,4,1 15501 4,5,2,0 6,8,2,0 10,13,4,0 15243 4,6,1,0 6,9,1,0 10,15,2,0 15501 4,6,1,0 6,7,3,0 10,13,4,0 15243 6,9,2,0 4,6,0,0 10,15,2,0 15501 5,6,3,0 5,7,1,0 10,13,4,0 15243 5,8,1,0 5,7,1,0 10,15,2,0 15501 4,5,1,0 6,8,3,0 10,13,4,0 15243 4,5,1,0 6,10,1,0 10,15,2,0 15501 5,7,2,0 5,6,2,0 10,13,4,0 15243 5,7,2,0 5,8,0,0 10,15,2,0 15501 4,6,0,0 6,7,4,0 10,13,4,0 15243 6,8,2,0 4,7,0,0 10,15,2,0 15532 4,5,1,1 6,9,2,1 10,14,3,2 15265 4,5,2,0 5,6,3,0 9,11,5,0 15534 4,5,2,0 6,10,1,0 10,15,3,0 206

15276 6,10,1,0 4,7,0,0 10,17,1,0 15534 4,6,1,0 6,9,2,0 10,15,3,0 15290 4,6,0,0 6,7,3,1 10,13,3,1 15534 5,8,1,0 5,7,2,0 10,15,3,0 15290 6,8,2,0 4,5,1,1 10,13,3,1 15534 6,8,3,0 4,7,0,0 10,15,3,0 15323 4,6,0,0 6,9,2,1 10,15,2,1 15575 4,5,0,0 7,9,3,0 11,14,3,0 15323 6,10,1,0 4,5,1,1 10,15,2,1 15581 4,6,1,0 6,7,3,1 10,13,4,1 15339 4,5,1,0 6,7,3,0 10,12,4,0 15581 6,8,3,0 4,5,1,1 10,13,4,1 15339 6,7,4,0 4,5,0,0 10,12,4,0 15581 6,8,2,0 4,5,2,1 10,13,4,1 15339 5,6,2,0 5,6,2,0 10,12,4,0 15581 5,7,1,0 5,6,3,1 10,13,4,1 15608 5,8,0,0 6,8,2,0 11,16,2,0 15866 5,6,2,0 6,8,2,0 11,14,4,0 15608 4,6,0,0 7,10,2,0 11,16,2,0 15866 6,7,3,0 5,7,1,0 11,14,4,0 15608 6,9,1,0 5,7,1,0 11,16,2,0 15872 6,8,3,0 4,5,2,1 10,13,5,1 15614 4,6,1,0 6,9,2,1 10,15,3,1 15872 5,7,2,0 5,6,3,1 10,13,5,1 15614 6,10,1,0 4,5,2,1 10,15,3,1 15899 4,6,1,0 7,10,2,0 11,16,3,0 15630 4,5,2,0 6,7,3,0 10,12,5,0 15899 6,9,2,0 5,7,1,0 11,16,3,0 15630 5,6,3,0 5,6,2,0 10,12,5,0 15899 5,8,1,0 6,8,2,0 11,16,3,0 15630 4,5,1,0 6,7,4,0 10,12,5,0 15899 6,8,3,0 5,8,0,0 11,16,3,0 15641 5,8,0,0 6,10,1,0 11,18,1,0 15899 5,7,2,0 6,9,1,0 11,16,3,0 15663 4,5,2,0 6,9,2,0 10,14,4,0 15899 5,6,2,0 6,10,1,0 11,16,3,0 15663 4,6,1,0 6,8,3,0 10,14,4,0 15899 7,9,3,0 4,7,0,0 11,16,3,0 15663 5,8,1,0 5,6,3,0 10,14,4,0 15921 4,5,2,0 6,7,4,0 10,12,6,0 15663 5,7,2,0 5,7,2,0 10,14,4,0 15921 5,6,3,0 5,6,3,0 10,12,6,0 15663 6,7,4,0 4,7,0,0 10,14,4,0 15932 5,8,1,0 6,10,1,0 11,18,2,0 15704 4,5,0,0 7,8,4,0 11,13,4,0 15946 5,7,1,0 6,7,3,1 11,14,4,1 15710 4,5,2,0 6,7,3,1 10,12,5,1 15946 4,5,1,1 7,9,3,0 11,14,4,1 15710 6,7,4,0 4,5,1,1 10,12,5,1 15973 4,6,0,0 8,11,2,0 12,17,2,0 15710 5,6,2,0 5,6,3,1 10,12,5,1 15973 4,5,0,0 8,12,2,0 12,17,2,0 15710 6,7,3,0 4,5,2,1 10,12,5,1 15979 5,7,1,0 6,9,2,1 11,16,3,1 15737 4,5,1,0 7,10,2,0 11,15,3,0 15995 4,5,1,0 7,8,4,0 11,13,5,0 15737 5,8,0,0 6,7,3,0 11,15,3,0 15995 5,6,2,0 6,7,3,0 11,13,5,0 15737 4,6,0,0 7,9,3,0 11,15,3,0 16001 5,6,3,0 5,6,3,1 10,12,6,1 15737 5,6,2,0 6,9,1,0 11,15,3,0 16001 6,7,4,0 4,5,2,1 10,12,6,1 15737 6,8,2,0 5,7,1,0 11,15,3,0 16028 4,5,2,0 7,10,2,0 11,15,4,0 15743 4,5,2,0 6,9,2,1 10,14,4,1 16028 4,6,1,0 7,9,3,0 11,15,4,0 15743 6,9,2,0 4,5,2,1 10,14,4,1 16028 6,9,2,0 5,6,2,0 11,15,4,0 15743 5,8,1,0 5,6,3,1 10,14,4,1 16028 5,8,1,0 6,7,3,0 11,15,4,0 15770 6,9,2,0 5,8,0,0 11,17,2,0 16028 5,6,3,0 6,9,1,0 11,15,4,0 15770 5,8,1,0 6,9,1,0 11,17,2,0 16028 6,8,3,0 5,7,1,0 11,15,4,0 15770 5,7,1,0 6,10,1,0 11,17,2,0 16028 5,7,2,0 6,8,2,0 11,15,4,0 15770 7,10,2,0 4,7,0,0 11,17,2,0 16028 5,8,0,0 6,7,4,0 11,15,4,0 15790 4,5,2,1 6,7,3,1 10,12,5,2 16028 7,8,4,0 4,7,0,0 11,15,4,0 15792 4,5,2,0 6,8,3,0 10,13,5,0 16061 6,9,2,0 5,8,1,0 11,17,3,0 207

15792 4,6,1,0 6,7,4,0 10,13,5,0 16061 5,7,2,0 6,10,1,0 11,17,3,0 15792 5,6,3,0 5,7,2,0 10,13,5,0 16075 5,6,2,0 6,7,3,1 11,13,5,1 15811 4,5,0,0 8,11,2,0 12,16,2,0 16075 4,5,1,1 7,8,4,0 11,13,5,1 15817 5,8,0,0 6,7,3,1 11,15,3,1 16081 5,6,3,1 5,6,3,1 10,12,6,2 15817 7,10,2,0 4,5,1,1 11,15,3,1 16102 4,5,1,0 8,11,2,0 12,16,3,0 15823 4,5,2,1 6,9,2,1 10,14,4,2 16102 4,5,0,0 8,11,3,0 12,16,3,0 15850 5,8,0,0 6,9,2,1 11,17,2,1 16108 5,8,1,0 6,7,3,1 11,15,4,1 15866 4,5,1,0 7,9,3,0 11,14,4,0 16108 5,6,2,0 6,9,2,1 11,15,4,1 15866 4,6,0,0 7,8,4,0 11,14,4,0 16108 6,9,1,0 5,6,3,1 11,15,4,1 16108 7,10,2,0 4,5,2,1 11,15,4,1 16394 4,5,1,0 8,11,3,0 12,16,4,0 16135 5,8,0,0 7,10,2,0 12,18,2,0 16394 5,8,0,0 7,8,4,0 12,16,4,0 16135 4,6,0,0 8,12,2,0 12,18,2,0 16394 4,6,0,0 8,10,4,0 12,16,4,0 16135 6,9,1,0 6,9,1,0 12,18,2,0 16394 5,6,2,0 7,10,2,0 12,16,4,0 16135 4,7,0,0 8,11,2,0 12,18,2,0 16394 6,8,2,0 6,8,2,0 12,16,4,0 16141 5,8,1,0 6,9,2,1 11,17,3,1 16394 6,7,3,0 6,9,1,0 12,16,4,0 16157 4,5,2,0 7,9,3,0 11,14,5,0 16394 5,7,1,0 7,9,3,0 12,16,4,0 16157 4,6,1,0 7,8,4,0 11,14,5,0 16399 6,9,2,0 5,6,3,1 11,15,5,1 16157 5,6,3,0 6,8,2,0 11,14,5,0 16399 5,6,3,0 6,9,2,1 11,15,5,1 16157 6,8,3,0 5,6,2,0 11,14,5,0 16427 4,6,1,0 8,12,2,0 12,18,3,0 16157 5,7,2,0 6,7,3,0 11,14,5,0 16427 6,9,2,0 6,9,1,0 12,18,3,0 16157 6,7,4,0 5,7,1,0 11,14,5,0 16427 5,8,1,0 7,10,2,0 12,18,3,0 16182 4,5,1,1 8,11,2,0 12,16,3,1 16427 6,8,2,0 6,10,1,0 12,18,3,0 16190 6,9,2,0 5,7,2,0 11,16,4,0 16427 4,7,0,0 8,11,3,0 12,18,3,0 16190 5,8,1,0 6,8,3,0 11,16,4,0 16446 5,6,3,1 6,7,3,1 11,13,6,2 16190 5,6,3,0 6,10,1,0 11,16,4,0 16449 5,6,3,0 6,8,3,0 11,14,6,0 16231 8,10,4,0 4,5,0,0 12,15,4,0 16449 5,7,2,0 6,7,4,0 11,14,6,0 16237 5,7,2,0 6,7,3,1 11,14,5,1 16460 6,10,1,0 6,10,1,0 12,20,2,0 16237 6,8,2,0 5,6,3,1 11,14,5,1 16474 6,9,1,0 6,7,3,1 12,16,4,1 16237 7,9,3,0 4,5,2,1 11,14,5,1 16474 4,5,1,1 8,11,3,0 12,16,4,1 16264 4,6,1,0 8,11,2,0 12,17,3,0 16474 4,5,2,1 8,11,2,0 12,16,4,1 16264 4,5,1,0 8,12,2,0 12,17,3,0 16479 5,6,3,1 6,9,2,1 11,15,5,2 16264 5,8,0,0 7,9,3,0 12,17,3,0 16501 5,8,0,0 8,11,2,0 13,19,2,0 16264 4,6,0,0 8,11,3,0 12,17,3,0 16507 6,9,1,0 6,9,2,1 12,18,3,1 16264 6,8,2,0 6,9,1,0 12,17,3,0 16523 4,5,1,0 8,10,4,0 12,15,5,0 16264 5,7,1,0 7,10,2,0 12,17,3,0 16523 4,6,0,0 8,9,5,0 12,15,5,0 16270 5,7,2,0 6,9,2,1 11,16,4,1 16523 5,6,2,0 7,9,3,0 12,15,5,0 16270 6,10,1,0 5,6,3,1 11,16,4,1 16523 6,8,2,0 6,7,3,0 12,15,5,0 16286 4,5,2,0 7,8,4,0 11,13,6,0 16523 5,7,1,0 7,8,4,0 12,15,5,0 16286 5,6,3,0 6,7,3,0 11,13,6,0 16529 6,8,3,0 5,6,3,1 11,14,6,1 16286 6,7,4,0 5,6,2,0 11,13,6,0 16556 4,5,2,0 8,12,2,0 12,17,4,0 16297 6,9,1,0 6,10,1,0 12,19,2,0 16556 4,6,1,0 8,11,3,0 12,17,4,0 208

16297 4,7,0,0 8,12,2,0 12,19,2,0 16556 6,9,2,0 6,8,2,0 12,17,4,0 16319 6,9,2,0 5,6,3,0 11,15,5,0 16556 5,8,1,0 7,9,3,0 12,17,4,0 16319 5,8,1,0 6,7,4,0 11,15,5,0 16556 6,8,3,0 6,9,1,0 12,17,4,0 16319 6,8,3,0 5,7,2,0 11,15,5,0 16556 5,7,2,0 7,10,2,0 12,17,4,0 16344 4,5,1,1 8,12,2,0 12,17,3,1 16556 6,7,3,0 6,10,1,0 12,17,4,0 16361 8,9,5,0 4,5,0,0 12,14,5,0 16556 8,10,4,0 4,7,0,0 12,17,4,0 16366 5,6,3,0 6,7,3,1 11,13,6,1 16578 5,6,3,0 6,7,4,0 11,13,7,0 16366 6,7,3,0 5,6,3,1 11,13,6,1 16589 6,9,2,0 6,10,1,0 12,19,3,0 16366 7,8,4,0 4,5,2,1 11,13,6,1 16603 6,8,2,0 6,7,3,1 12,15,5,1 16394 4,5,2,0 8,11,2,0 12,16,4,0 16603 4,5,1,1 8,10,4,0 12,15,5,1 16630 5,7,1,0 8,11,2,0 13,18,3,0 16847 6,7,4,0 6,10,1,0 12,17,5,0 16636 6,8,2,0 6,9,2,1 12,17,4,1 16878 6,9,2,1 6,9,2,1 12,18,4,2 16636 6,10,1,0 6,7,3,1 12,17,4,1 16894 6,8,3,0 6,7,3,1 12,15,6,1 16636 4,5,2,1 8,12,2,0 12,17,4,1 16894 8,10,4,0 4,5,2,1 12,15,6,1 16652 4,5,1,0 8,9,5,0 12,14,6,0 16894 5,6,3,1 7,9,3,0 12,15,6,1 16652 5,6,2,0 7,8,4,0 12,14,6,0 16921 5,7,2,0 8,11,2,0 13,18,4,0 16652 6,7,3,0 6,7,3,0 12,14,6,0 16921 5,8,0,0 8,10,4,0 13,18,4,0 16658 6,7,4,0 5,6,3,1 11,13,7,1 16921 5,6,2,0 8,12,2,0 13,18,4,0 16663 5,8,0,0 8,12,2,0 13,20,2,0 16921 6,8,2,0 7,10,2,0 13,18,4,0 16669 6,10,1,0 6,9,2,1 12,19,3,1 16921 6,9,1,0 7,9,3,0 13,18,4,0 16685 4,5,2,0 8,11,3,0 12,16,5,0 16921 5,7,1,0 8,11,3,0 13,18,4,0 16685 4,6,1,0 8,10,4,0 12,16,5,0 16927 6,8,3,0 6,9,2,1 12,17,5,1 16685 6,9,2,0 6,7,3,0 12,16,5,0 16943 4,5,2,0 8,9,5,0 12,14,7,0 16685 5,8,1,0 7,8,4,0 12,16,5,0 16943 5,6,3,0 7,8,4,0 12,14,7,0 16685 5,6,3,0 7,10,2,0 12,16,5,0 16943 6,7,4,0 6,7,3,0 12,14,7,0 16685 6,8,3,0 6,8,2,0 12,16,5,0 16954 5,8,1,0 8,12,2,0 13,20,3,0 16685 5,7,2,0 7,9,3,0 12,16,5,0 16954 7,10,2,0 6,10,1,0 13,20,3,0 16685 6,7,4,0 6,9,1,0 12,16,5,0 16976 6,9,2,0 6,7,4,0 12,16,6,0 16685 8,9,5,0 4,7,0,0 12,16,5,0 16976 6,8,3,0 6,8,3,0 12,16,6,0 16718 6,9,2,0 6,9,2,0 12,18,4,0 17023 6,7,4,0 6,7,3,1 12,14,7,1 16718 6,8,3,0 6,10,1,0 12,18,4,0 17023 8,9,5,0 4,5,2,1 12,14,7,1 16732 6,7,3,0 6,7,3,1 12,14,6,1 17023 5,6,3,1 7,8,4,0 12,14,7,1 16732 4,5,1,1 8,9,5,0 12,14,6,1 17050 5,6,3,0 8,11,2,0 13,17,5,0 16759 5,6,2,0 8,11,2,0 13,17,4,0 17050 5,8,0,0 8,9,5,0 13,17,5,0 16765 6,9,2,0 6,7,3,1 12,16,5,1 17050 5,6,2,0 8,11,3,0 13,17,5,0 16765 6,7,3,0 6,9,2,1 12,16,5,1 17050 6,8,2,0 7,9,3,0 13,17,5,0 16765 7,10,2,0 5,6,3,1 12,16,5,1 17050 6,7,3,0 7,10,2,0 13,17,5,0 16765 4,5,2,1 8,11,3,0 12,16,5,1 17050 6,9,1,0 7,8,4,0 13,17,5,0 16792 5,8,1,0 8,11,2,0 13,19,3,0 17050 5,7,1,0 8,10,4,0 13,17,5,0 16792 5,8,0,0 8,11,3,0 13,19,3,0 17056 6,7,4,0 6,9,2,1 12,16,6,1 16792 6,9,1,0 7,10,2,0 13,19,3,0 17083 6,9,2,0 7,10,2,0 13,19,4,0 209

16792 5,7,1,0 8,12,2,0 13,19,3,0 17083 5,8,1,0 8,11,3,0 13,19,4,0 16798 6,9,2,0 6,9,2,1 12,18,4,1 17083 5,7,2,0 8,12,2,0 13,19,4,0 16812 6,7,3,1 6,7,3,1 12,14,6,2 17083 6,10,1,0 7,9,3,0 13,19,4,0 16814 4,5,2,0 8,10,4,0 12,15,6,0 17105 6,8,3,0 6,7,4,0 12,15,7,0 16814 4,6,1,0 8,9,5,0 12,15,6,0 17130 7,10,2,0 6,7,3,1 13,17,5,1 16814 5,6,3,0 7,9,3,0 12,15,6,0 17130 5,6,3,1 8,11,2,0 13,17,5,1 16814 6,8,3,0 6,7,3,0 12,15,6,0 17157 6,9,1,0 8,11,2,0 14,20,3,0 16814 5,7,2,0 7,8,4,0 12,15,6,0 17163 7,10,2,0 6,9,2,1 13,19,4,1 16814 6,7,4,0 6,8,2,0 12,15,6,0 17179 5,6,2,0 8,10,4,0 13,16,6,0 16845 6,9,2,1 6,7,3,1 12,16,5,2 17179 6,8,2,0 7,8,4,0 13,16,6,0 16847 6,9,2,0 6,8,3,0 12,17,5,0 17179 6,7,3,0 7,9,3,0 13,16,6,0 17179 5,7,1,0 8,9,5,0 13,16,6,0 17578 7,10,2,0 7,9,3,0 14,19,5,0 17212 6,9,2,0 7,9,3,0 13,18,5,0 17600 5,6,3,0 8,9,5,0 13,15,8,0 17212 5,8,1,0 8,10,4,0 13,18,5,0 17600 6,7,4,0 7,8,4,0 13,15,8,0 17212 5,6,3,0 8,12,2,0 13,18,5,0 17611 6,9,2,0 8,12,2,0 14,21,4,0 17212 6,8,3,0 7,10,2,0 13,18,5,0 17611 6,10,1,0 8,11,3,0 14,21,4,0 17212 5,7,2,0 8,11,3,0 13,18,5,0 17658 6,7,3,1 8,12,2,0 14,19,5,1 17212 6,10,1,0 7,8,4,0 13,18,5,0 17680 8,9,5,0 5,6,3,1 13,15,8,1 17234 6,7,4,0 6,7,4,0 12,14,8,0 17691 6,9,2,1 8,12,2,0 14,21,4,1 17259 7,9,3,0 6,7,3,1 13,16,6,1 17707 6,7,4,0 8,11,2,0 14,18,6,0 17286 6,8,2,0 8,11,2,0 14,19,4,0 17707 6,8,2,0 8,10,4,0 14,18,6,0 17292 5,6,3,1 8,12,2,0 13,18,5,1 17707 6,7,3,0 8,11,3,0 14,18,6,0 17292 7,9,3,0 6,9,2,1 13,18,5,1 17707 6,9,1,0 8,9,5,0 14,18,6,0 17308 5,6,2,0 8,9,5,0 13,15,7,0 17707 7,10,2,0 7,8,4,0 14,18,6,0 17308 6,7,3,0 7,8,4,0 13,15,7,0 17707 7,9,3,0 7,9,3,0 14,18,6,0 17319 6,9,1,0 8,12,2,0 14,21,3,0 17740 6,9,2,0 8,11,3,0 14,20,5,0 17319 6,10,1,0 8,11,2,0 14,21,3,0 17740 6,8,3,0 8,12,2,0 14,20,5,0 17341 6,9,2,0 7,8,4,0 13,17,6,0 17740 6,10,1,0 8,10,4,0 14,20,5,0 17341 5,8,1,0 8,9,5,0 13,17,6,0 17787 6,7,3,1 8,11,3,0 14,18,6,1 17341 5,6,3,0 8,11,3,0 13,17,6,0 17814 7,10,2,0 8,11,2,0 15,21,4,0 17341 6,8,3,0 7,9,3,0 13,17,6,0 17820 6,9,2,1 8,11,3,0 14,20,5,1 17341 5,7,2,0 8,10,4,0 13,17,6,0 17836 6,8,2,0 8,9,5,0 14,17,7,0 17341 6,7,4,0 7,10,2,0 13,17,6,0 17836 6,7,3,0 8,10,4,0 14,17,7,0 17388 7,8,4,0 6,7,3,1 13,15,7,1 17836 7,8,4,0 7,9,3,0 14,17,7,0 17415 6,7,3,0 8,11,2,0 14,18,5,0 17869 6,9,2,0 8,10,4,0 14,19,6,0 17421 5,6,3,1 8,11,3,0 13,17,6,1 17869 6,8,3,0 8,11,3,0 14,19,6,0 17421 7,8,4,0 6,9,2,1 13,17,6,1 17869 6,7,4,0 8,12,2,0 14,19,6,0 17449 6,9,2,0 8,11,2,0 14,20,4,0 17869 6,10,1,0 8,9,5,0 14,19,6,0 17449 6,8,2,0 8,12,2,0 14,20,4,0 17916 8,10,4,0 6,7,3,1 14,17,7,1 17449 6,9,1,0 8,11,3,0 14,20,4,0 17943 7,9,3,0 8,11,2,0 15,20,5,0 17449 7,10,2,0 7,10,2,0 14,20,4,0 17949 8,10,4,0 6,9,2,1 14,19,6,1 210

17471 5,6,3,0 8,10,4,0 13,16,7,0 17965 6,7,3,0 8,9,5,0 14,16,8,0 17471 6,8,3,0 7,8,4,0 13,16,7,0 17965 7,8,4,0 7,8,4,0 14,16,8,0 17471 5,7,2,0 8,9,5,0 13,16,7,0 17976 7,10,2,0 8,12,2,0 15,22,4,0 17471 6,7,4,0 7,9,3,0 13,16,7,0 17998 6,9,2,0 8,9,5,0 14,18,7,0 17482 6,10,1,0 8,12,2,0 14,22,3,0 17998 6,8,3,0 8,10,4,0 14,18,7,0 17495 6,7,3,1 8,11,2,0 14,18,5,1 17998 6,7,4,0 8,11,3,0 14,18,7,0 17529 6,9,2,1 8,11,2,0 14,20,4,1 18045 8,9,5,0 6,7,3,1 14,16,8,1 17551 8,10,4,0 5,6,3,1 13,16,7,1 18072 7,8,4,0 8,11,2,0 15,19,6,0 17578 6,8,3,0 8,11,2,0 14,19,5,0 18078 8,9,5,0 6,9,2,1 14,18,7,1 17578 6,8,2,0 8,11,3,0 14,19,5,0 18105 7,10,2,0 8,11,3,0 15,21,5,0 17578 6,7,3,0 8,12,2,0 14,19,5,0 18105 7,9,3,0 8,12,2,0 15,21,5,0 17578 6,9,1,0 8,10,4,0 14,19,5,0 18127 6,8,3,0 8,9,5,0 14,17,8,0 18127 6,7,4,0 8,10,4,0 14,17,8,0

18179 8,11,2,0 8,11,2,0 16,22,4,0

18234 7,10,2,0 8,10,4,0 15,20,6,0

18234 7,8,4,0 8,12,2,0 15,20,6,0

18234 7,9,3,0 8,11,3,0 15,20,6,0

18256 6,7,4,0 8,9,5,0 14,16,9,0

18341 8,11,2,0 8,12,2,0 16,23,4,0

18363 7,10,2,0 8,9,5,0 15,19,7,0

18363 8,10,4,0 7,9,3,0 15,19,7,0

18363 7,8,4,0 8,11,3,0 15,19,7,0

18470 8,11,2,0 8,11,3,0 16,22,5,0

18492 8,10,4,0 7,8,4,0 15,18,8,0

18492 8,9,5,0 7,9,3,0 15,18,8,0

18503 8,12,2,0 8,12,2,0 16,24,4,0

18600 8,10,4,0 8,11,2,0 16,21,6,0

18622 8,9,5,0 7,8,4,0 15,17,9,0

18633 8,12,2,0 8,11,3,0 16,23,5,0

18729 8,9,5,0 8,11,2,0 16,20,7,0

18762 8,10,4,0 8,12,2,0 16,22,6,0

18762 8,11,3,0 8,11,3,0 16,22,6,0

18891 8,10,4,0 8,11,3,0 16,21,7,0

18891 8,9,5,0 8,12,2,0 16,21,7,0

19020 8,10,4,0 8,10,4,0 16,20,8,0

19020 8,9,5,0 8,11,3,0 16,20,8,0

19149 8,10,4,0 8,9,5,0 16,19,9,0

19278 8,9,5,0 8,9,5,0 16,18,10,0

211

Observed Glycoforms Mass HexNAc Hex NeuAc Sulf No. of theor comp. Symbol in Table 1 14350 8 12 2 0 3

14479 8 11 3 0 2

14715 9 13 2 0 5

14877 9 14 2 0 2 a 15006 9 13 3 0 3 b 15135 9 12 4 0 2 c 15210 10 13 3 0 4

15265 9 11 5 0 1 d 15372 10 14 3 0 8

15405 10 16 2 0 3 e 15534 10 15 3 0 4 f 15663 10 14 4 0 5 g 15792 10 13 5 0 3 h 15866 11 14 4 0 4

15899 11 16 3 0 7 i 15921 10 12 6 0 2 j 16028 11 15 4 0 9 k 16061 11 17 3 0 2

16083 10 13 6 0 0

16190 11 16 4 0 3 l 16286 11 13 6 0 3

16319 11 15 5 0 3 m 16353 11 17 4 0 0

16427 12 18 3 0 5 p 16449 11 14 6 0 2 n 16556 12 17 4 0 8 q 16578 11 13 7 0 1 o 16611 11 15 6 0 0

16652 12 14 6 0 3

16685 12 16 5 0 9 r 16740 11 14 7 0 0

16814 12 15 6 0 6

16847 12 17 5 0 2 s 16869 11 13 8 0 0

16943 12 14 7 0 3

16976 12 16 6 0 2 t 17072 12 13 8 0 0

17105 12 15 7 0 1

17234 12 14 8 0 1

212

17267 12 16 7 0 0

17396 12 15 8 0 0

17471 13 16 7 0 4

17526 12 14 9 0 0

17600 13 15 8 0 2

17633 13 17 7 0 0

17762 13 16 8 0 0

17795 13 18 7 0 0

17891 13 15 9 0 0

17924 13 17 8 0 0

17998 14 18 7 0 3

18053 13 16 9 0 0

18127 14 17 8 0 2

18160 14 19 7 0 0

18256 14 16 9 0 1

18289 14 18 8 0 0

18363 15 19 7 0 3

18418 14 17 9 0 0

18492 15 18 8 0 2

18525 15 20 7 0 0

18688 15 21 7 0 0

18891 16 21 7 0 2

Mass - Average mass of the glycoform Number of theor compositions - the number of different combinations of observed glycans that can form a particular glycoform. "0" indicates that given composition cannot be matched to any combination of observed glycans Symbol in Table 1 - associates a particular glycoform with peak in Table 1 and Figure 3A 213

Glycan compositions of low abundant intact protein glycoforms migrating at later times suggested the presence of 9 sialic acids in the intact glycoproteins, Figure

3B and Table 4. Since there are only two glycosylation sites in r- hCG, one of these sites must, presumably, contain at least 5 sialic acids. Importantly, as can be seen in

Table 2, several structures with 5 sialic acids were observed in the analysis of released glycans. In addition, it should be mentioned that some low intensity intact protein masses could not be matched to any glycan compositions suggesting additional modifications of the protein backbone such as truncation of N-terminus, acetylation etc.

214

4.3.6 Analysis of r- hCG Expressed in CHO Cell Culture

Having defined the glycosylation of r-αhCG from a murine cell line, we applied the developed CE-MS method to characterize intact glycoforms of the same protein, r- hCG, expressed in CHO cells. It is expected that different cell lines and cell cultures conditions could impact the glycosylation of r-αhCG. Recombinant hCG obtained from CHO cells was supplied in a formulated form containing high concentrations of phosphoric acid and sucrose, which was found to be incompatible with CE analysis. To circumvent this, the protein was purified from other excipients using molecular weight cutoff filters (see Experimental Section).

100 A 8.9 80 TIC MW = 15266 60 HexNAc Hex NeuAc 40 11 9 5 20 0 100 7.3 80 B MW =13737 60 HexNAc10Hex8NeuAc 40 20 0 100 7.8 80 C 60 MW =14028 40 HexNAc10Hex8NeuAc2 20 0 100 8.9

RelativeAbundance D 80 MW =14321 60 HexNAc10Hex8NeuAc3 40 20 0 100 E MW =14611 10.3 80 HexNAc10Hex8NeuAc4 60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Time (min)

Figure 7. CE-MS separation of r- hCG produced in CHO cells. A) Total ion electropherogram. (B)- (E) extracted ion electropherograms for individual high glycoforms. Other conditions as in Figure 2. 215

The desalted CHO sample was separated using CE-MS system and analyzed (Figure

5). We found that the glycosylation of r- hCG in a CHO cell line produced material differed significantly from the same protein expressed in mouse cells. Using the high mass accuracy of the FT MS, compositions of all major intact protein glycoforms were determined, see Table 5.

The glycosylation was found to be limited to several forms primarily differing by the number of sialic acids. Based on the separation pattern and high mass accuracy, glycoforms with up to 5 sialic acids were observed in the CHO-derived product, in contrast to glycoforms observed from the mouse cell line-derived product containing 9 sialic acids.

Comparison of the CE-MS results for CHO cell derived r-αhCG (Figure 5A) with those obtained for murine derived r-αhCG (Figure 2A); demonstrate that there are dramatic differences between the two. Interestingly, based on calculated compositions, one of the high abundant glycoforms (MW=12,404Da) could be matched to the glycan composition HexNAc4Hex5NeuAc2 suggesting glycosylation of only one of the two available sites. Importantly, this observation is consistent with the migration position of this particular glycoform with respect to other intact protein glycoforms. As expected, this glycoform (MW=12,404 and composition

HexNAc4Hex5NeuAc2) migrated between the form with two glycans but only one sialic acid (13,373 Da, HexNAc10Hex8NeuAc) and glycoform with two glycans with two sialic acids (14,028 Da, HexNAc10Hex8NeuAc2).

216

Table 5. Abundance of r- hCG glycoforms produced in CHO cells.

Glycoprotein

MW Glycan mass HexNAc Hex NeuAc Area

12404 2208 4 5 2 10.2%

13446 3250 8 10 0 2.5%

13737 3541 8 10 1 11.5%

14028 3832 8 10 2 25.1%

14320 4124 8 10 3 32.1%

14611 4415 8 10 4 10.1%

14685 4489 9 11 3 2.6%

14976 4780 9 11 4 3.8%

15266 5070 9 11 5 2.1%

217

4.4 Conclusions

We report here the use of high resolution CE FT MS for the analysis of r-

hCG. The studies demonstrate that the method can be rapid, accurate, and information rich. The studies presented here also suggest how CE with or without MS can be used to develop rapid tests to assess product quality either for release or for in- process control. In the context of future studies, larger glycoproteins than r- hCG, such as monoclonal antibodies, can be assessed in a similar manner using newer MS instruments such as an Orbitrap. Nevertheless, the fundamental usefulness of a CE-MS technique, and its role in the larger context of glycoprotein characterization, has been demonstrated here.

Finally, techniques of the kind described here not only can play a role in assessing variability of a glycoprotein therapeutic within a process but also between processes.

Defining methodologies to assess and handle the complexity of these types of therapeutics is a source of discussion around follow-on biologics[30]. In fact, one of the major points in the debate is whether the technology exists to characterize these complex molecules in a comprehensive manner. Through the last two decades, many novel developments and improvements on bioanalytical technologies have been introduced. Specifically, numerous publications have been reported about technologies for the characterization of glycoproteins at different levels, using either a “top-down” or “bottom-up” approaches. Although numerous reports on the analysis of glycans and peptides have been published, few studies have been published on detailed structural analysis of intact glycoproteins. Even fewer reports discuss the characterization of the different components of a glycoprotein mixture and the 218 correlation of this information between different sets of analyses. We believe that studies of the type presented here, which assess the application of a technology towards a pertinent problem and assessing how the technology complements and extends existing analyzes, can help define the scientific issues that are involved with this important debate.

219

4.5 References

[1] C.H. Chung, B. Mirakhur, E. Chan, Q.T. Le, J. Berlin, M. Morse, B.A. Murphy, S.M. Satinover, J. Hosen, D. Mauro, R.J. Slebos, Q. Zhou, D. Gold, T. Hatley, D.J. Hicklin, T.A. Platts-Mills, N Engl J Med 358 (2008) 1109.

[2] U. Galili, E.A. Rachmilewitz, A. Peleg, I. Flechner, J Exp Med 160 (1984) 1519.

[3] A.H. Good, D.K. Cooper, A.J. Malcolm, R.M. Ippolito, E. Koren, F.A. Neethling, Y. Ye, N. Zuhdi, L.R. Lamontagne, Transplant Proc 24 (1992) 559.

[4] I.S. Krull, S. Kazmi, H. Zhong, L.C. Santora, Methods Mol Biol 213 (2003) 197.

[5] R.D. Smith, H. Udseth, Nature 331 (1988) 639.

[6] J.F. Kelly, S.J. Locke, L. Ramaley, P. Thibault, J Chromatogr A 720 (1996) 409.

[7] B. Yeung, T.J. Porter, J.E. Vath, Anal Chem 69 (1997) 2510.

[8] U.M. Demelbauer, A. Plematl, L. Kremser, G. Allmaier, D. Josic, A. Rizzi, Electrophoresis 25 (2004) 2026.

[9] S. Amon, A. Zamfir, A. Rizzi, Electrophoresis 12 (2008) 2485.

[10] C. Neususs, U. Demelbauer, M. Pelzing, Electrophoresis 26 (2005) 1442.

[11] P.J. Domann, A.C. Pardos-Pardos, D.L. Fernandes, D.I. Spencer, C.M. Radcliffe, L. Royle, R.A. Dwek, P.M. Rudd, Proteomics 7 Suppl 1 (2007) 70.

[12] A. Guttman, Nature 380 (1996) 461.

[13] C.J. Edge, T.W. Rademacher, M.R. Wormald, R.B. Parekh, T.D. Butters, D.R. Wing, R.A. Dwek, Proc Natl Acad Sci USA 89 (1992) 6338.

[14] D.J. Harvey, J Mass Spectrom 40 (2005) 642.

[15] M. Ninonuevo, H. An, H. Yin, K. Killeen, R. Grimm, R. Ward, B. German, C. Lebrilla, Electrophoresis 26 (2005) 3641.

[16] M. Wuhrer, M.I. Catalina, A.M. Deelder, C.H. Hokke, J Chromatogr B Analyt Technol Biomed Life Sci 849 (2007) 115.

[17] B. Domon, C. Costello, Biochemistry 27 (1988) 1534.

220

[18] A. Ceroni, A. Dell, S. Haslam, Source Code Biol Med (2007) 2:3.

[19] E. Van Rossen, S. Vander Borght, L. van Grunsven, H. Reynaert, V. Bruggeman, R. Blomhoff, T. Roskams, A. Geerts, Histochem Cell Biol 131 (2009) 313.

[20] A. Gervais, Y.A. Hammel, S. Pelloux, P. Lepage, G. Baer, N. Carte, O. Sorokine, J.M. Strub, R. Koerner, E. Leize, A. Van Dorselaer, Glycobiology 13 (2003) 179.

[21] Y.-G. Kim, S.-Y. kim, Y.-M. Hur, H.-S. Joo, J. Chung, D.-S. Lee, L. Royle, P.M. Rudd, R.A. Dwek, D.J. Harvey, B.-G. Kim, Proteomics 6 (2006) 1133.

[22] T.W. Chung, K.S. Kim, S.K. Moon, J.W. Lee, E.Y. Song, T.H. Chung, Y.I. Yeom, C.H. Kim, Moll. Cells 16 (2003) 343.

[23] R.L. Easton, M.S. Patankar, F.A.L. Lattanzio, T.H, H.R. Morris, G.F. Clark, A. Dell, J.Biol.Chem 275 (2000) 7731.

[24] D.H. Joziasse, N.L. Shaper, D. Kim, D.H. van den Eijnden, J.H. Shaper, J.Biol.Chem 267 (1992) 5534.

[25] B. Coddeville, E. Regoeczi, G. Strecker, Y. Plancke, G. Spik, Biochimica et Biophysica Acta 1475 (2000) 321.

[26] Y. Mechref, P. Chen, M. Novotony, Glycobiology 3 (1999) 227.

[27] E. Balaguer, U. Demelbauer, M. Pelzing, V. Sanz-Nebot, J. Barbosa, C. Neusub, Electrophoresis 27 (2006) 2638.

[28] V.I. Otto, E. Damoc, L.N. Cueni, T. Schurpf, R. Frei, S. Ali, N. Callewaert, A. Moise, J.A. Leary, G. Folkers, M. Przybylski, Glycobiology 16 (2006) 1033.

[29] N. Kojima, Y. Tachida, Y. Yoshida, S. Tsuji, J.Biol.Chem 271 (1996) 19457.

[30] A. Dove, Nature Biotechnology 19 (2001) 117.

221

CHAPTER 5 SUMMARY AND FUTURE DIRECTIONS 222

PART I

A consequence of the advent of more non invasive biopsy techniques, small organ size, rare disease type and specific sample collection tools such as laser capture microdissection (LCM) is that the sample amount available for proteomics analysis is very small. To comprehensively profile the proteome from these limited sample amounts, such as 10,000 LCM cells, and to obtain meaningful biological information, minitutization of LC-MS analysis platform was desired. The development of porous layer open tubular (PLOT) chromatography columns represents a major advance in the field of micro-proteomics due to the low sample consumption and the high sensitivity provided due to the operation of the PLOT column at 20 nL/min. In chapter 1, a sample handling technique was developed that reduces the protein losses while effectively performing protein digestion. Consistent identification of more than 1000 proteins using sample loading amount equivalent to 1000 cells, ~1 μg protein, demonstrated sensitive and reproducible performance of the overall proteomic platform. The comparison of 10,000 LCM primary and metastatic breast cancer cells

(three replicates of each type) in this study detected the changes related to important functional categories such as glycolysis and the extracellular matrix.

The developed platform was further modified to yield an online 2D-

RP/SCX/SPE/PLOT LC-FT-MS micro-proteomics platform for the comparative proteomic analysis of LCM collected normal and triple negative breast cancer cell

(TNBC) populations. The extensive peptide level fractionation identified more than

2000 proteins while consuming sample equivalent to just 4000 cells or ~1 μg protein.

Gene set enrichment analysis highlighted up and down regulation of proteins 223 belonging to cell cycle and extracellular membrane pathways respectively in TNBC samples. The detection of all the members of minichromosome maintenance (MCM) proteins MCM family (MCM2-7), which are known to function as replication initiation factors, provides convincing evidence that comprehensive proteomic characterization of small samples using 2D PLOT proteomic platform is a possibility.

To consolidate the current information and to mine further into the biological information, more biological replicates with be performed. The analysis of biological specimens from other breast cancer subtypes is under consideration as this may provide important information about their biological differences. Further modification of the PLOT platform, in particular, the combination with microfluidic based online sample preparation is also an active research area that will further build on the data presented in this thesis.

224

PART II

Another complex analytical problem is the characterization of protein biotherapeutics due to the macro- and microheterogeneity arising from the glycosylation present on the protein and the variety of glycans present at each site. The development and application of a high resolution CE-FTMS method for intact glycoform profiling of recombinant α-human chorionic gonadotrophin allows rapid analysis of sample for in- process and quality control checks is presented in Chapter 4. In addition to the top down approach, the bottom–up approach i.e. profiling analysis of glycopeptides and glycans determined and assigned the population of oligosaccharides present at each individual glycosite, thereby facilitated complete and comprehensive characterization of r-αhCG. Glycoproteins larger than r-αhCG, such as monoclonal antibodies, can be assessed in a similar manner using newer MS instruments such as an Orbitrap which can maintain high resolution at larger m/z values than the FT-MS instrument. To assess more complex mixtures or even to investigate isobaric glycoforms, the fundamental usefulness of CE-MS coupled with gas phase separation of coeluting isobaric glycoforms using fast speed ion mobility separations has a role in the comprehensive glycoprotein characterization.