ADVANCES IN ANALYSIS OF NOVEL THERAPEUTICS AND CLINICAL

SAMPLES WITH LC-MS/MS METHODS

A Dissertation Presented

by

Fangfei Yan

to

The Department of Chemistry and Chemical Biology

In partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the field of

Chemistry

Northeastern University

Boston, Massachusetts

April 9, 2013

ADVANCES IN ANALYSIS OF NOVEL PROTEIN THERAPEUTICS AND CLINICAL

SAMPLES WITH LC-MS/MS METHODS

by

Fangfei Yan

ABSTRACT OF DISSERTATION

Submitted in partial fulfillment of the requirements

for the Degree of Doctor of Philosophy in Chemistry

in the College of Science of

Northeastern University

April 9, 2013

2

Abstract

Analytical characterization is essential for drug efficacy, stability and safety. Significant concern in protein stability and potential immunogenicity requires detailed characterization of protein therapeutics. Therefore, the ability to detect product variants from non-clonal cell lines is very important. To fulfill this job, LC-MS/MS becomes the perfect choice. This Ph.D. research discusses advances in analysis of novel protein therapeutics and clinical samples with LC-

MS/MS methods. Recombinant protein therapeutics is a major focus of my research work. In

Chapter 1, an overview of production of recombinant protein therapeutics and the potential heterogeneity, in particular post-translational modifications, is reviewed. Common techniques for protein characterization are discussed, including electrophoresis, chromatography, and mass spectrometry. Because the major focus of my research is about mass spectrometry, further details about ionization method, mass analyzer, hybrid mass spectrometer, fragmentation method and

LC-MS/MS analysis are reviewed. In addition, common strategies for protein characterization are discussed. Furthermore, biology part for proteomic analysis is reviewed. ERBB2, an important oncogene, is of particular interest and its association to cancer disease is included.

Chapter 2, 3 and 4 detail my research projects. Chapter 2 is comprehensive mass spectrometric characterization of recombinant phenylalanine ammonia lyase (rAvPAL). Full characterization with the identification of the active site for this protein is discussed. Active sites of RtPAL and

GFP with similar structure were analyzed. The characterization process and strategies developed here are applied in the work of Chapter 3. Chapter 3 discusses the characterization of disulfide bond linkage and improper disulfide bridging of a variant of acidic lysosomal glucosidase (GAA) by LC-MS/MS. Here, details about the disulfide bond linkage analysis with the technique that

3

ETD (electron transfer dissociation) is in combination with CID (collision induced dissociation) for fragmentation are included. The work flow including gel-based protein separation, enzymatic digestion and LC-MS/MS analysis were then applied in proteomic studies in Chapter 4. In

Chapter 4, proteomic and genomic analysis of gastric cancer patient tissues is the topic. This includes proteomic studies with the help of LC-MS/MS. Also, the study of gastric cancer is part of the C-HPP Initiative, the exploration of proteomics and transcriptomics has provided in-depth analysis in oncology.

4

ACKNOWLEDGEMENTS

First of all, I want to thank my advisor Professor William S. Hancock for giving me the great opportunity to be in his research group and for bringing me into the exciting research world.

Thank you so much for your great instructions and kind help to me all these years. Your insight into research advancement is great value in my life. Thank you so much for all your support, patience and encouragement to me. I feel so lucky to have you as my advisor. Also, I want to thank Professor Shiaw-lin (Billy) Wu for his great instructions on my projects collaborated with industry company. I have learned a lot from your valuable experience in mass spectrometry.

Thank you so much for your patient guidance and support. Lastly, I want to thank Professor

Barry Karger for providing a great environment for research at Barnett Institute.

In addition, many thanks to members in my Ph.D. committee: Professor Paul Vouros, Professor

Zhaohui (Sunny) Zhou, Professor Michael Pollastri and Professor Erno Pungor. Professor

Vouros has given me great instructions regarding mass spectrometry and related techniques. He has also given me very helpful advice in my Ph.D studies. Professor Zhou has given me a lot of advice on Ph.D. studies, seminar talk and thesis completion. Professor Pollastri also provided helpful suggestions and great support in my thesis defense. Last but not the least, Professor

Pungor provided a lot of support towards my Ph.D. research projects and he has given me many solid instructions on detailed experiments. Thank you all for being in my committee and thank you for your great advice and instructions on my thesis.

Additionally, grateful thanks are given to members, both former and present, in Professor

Hancock’s group: Professor Marina Hincapie, Dr. Xiaoyang Zheng, Dr. Haitao Jiang, Dr.

Qiaozhen Lu, Dr. Yi Wang, Dr. Zhi Zeng, Dr. Agnes Ralfako, Dr. Majlinda Kullolli, Xiaomei

5

He, Suli Liu, Fan Zhang, Yue Zhang, Francisca Sekyere, and KyOnese Taylor. Thank you very much for your kind help and support to me in my Ph.D. research. Also, grateful thanks are given to members from Professor Karger’s group: Professor Shujia Dai, Dr. Sangwon Cha, Dr.

Quanzhou Luo, Dr. Dongdong Wang, Chen Li, Wenqin Ni, Zhenke Liu and Siyang Li. Thank you very much for your help and friendship.

Besides, I want to give my sincere thanks to Dr. Melinda Hull, Dr. Khrystal DeHate, and Ms.

Sheila Magee Beare from College of Science, Jean Harris and Andrew Bean at Department of

Chemistry and Chemical Biology, Nancy Carbone, Jeffrey Kasilman, Jana Volf, and Felicia

Hopkins at Barnett Institute. Thank you for giving me your sincere help during my study and research at Northeastern University.

Last but not least, I want to thank to my family members, my mom, my dad, and my grandparents. Thank you so much for your encouragement and support to me all these years.

Although we are in different countries, your love gives me continuous power at all times.

Finally, special thank is given to my husband Xiaowei Wang. I am very lucky and happy to be with you. Thank you so much for your love and support. Being with you, I am able to confront any barrier.

6

Table of Contents

ABSTRACT………………………………………………………………………………………2

ACKNOWLEDGEMENT……………………………………………………………………...... 5

TABLE OF CONTENTS………………………………………………………………………....7

LIST OF FIGURES……………………………………………………………………………...13

LIST OF TABLES……………………………………………………………………………….17

LIST OF ABBREVIATIONS…………………………………………………………………...18

Chapter 1

Overview of Industrial Protein Therapeutics and Proteomic Studies ….……………24

1.1 Introduction…………………………………………………………………………….….....25

1.2 Overview of Industrial Protein Therapeutics ………………….………………………….....26

1.2.1 Production of recombinant …………………….……………………………..27

1.2.2 Heterogeneity associated with recombinant protein therapeutics ….………………….28

1.2.3 PTMs of protein therapeutics…………………...... 32

1.2.3.1 Deamidation…………………...... 33

1.2.3.2 Oxidation…………………...... 35

1.2.3.3 Glycosylation…………………...... 37

1.2.3.4 Other common types of PTMs ………………...... 39

1.3 Techniques for Protein Characterization …………………………………...... 40

1.3.1 Electrophoresis ………………………………………………………………………..40

7

1.3.2 Chromatography……….………………………………………………………………42

1.3.3 Mass spectrometry...... ………………………44

1.3.3.1 Ionization methods...... …..45

1.3.3.1.1 Chemical ionization...... 45

1.3.3.1.2 Fast atom bombardment ionization...... 45

1.3.3.1.3 Field ionization/field desorption...... 46

1.3.3.1.4 Matrix-assisted laser desorption ionization...... 46

1.3.3.1.5 Thermospray ionization...... 47

1.3.3.1.6 Electrospray ionization...... 47

1.3.3.2 Mass analyzer...... 48

1.3.3.2.1 Magnetic sector...... 49

1.3.3.2.2 Time of flight...... 49

1.3.3.2.3 Quadrupole...... 50

1.3.3.2.4 Ion trap...... 51

1.3.3.2.5 FT-ICR...... 51

1.3.3.2.6 Orbitrap...... 52

1.3.3.3 Tandem mass spectrometry...... 52

1.3.3.3.1 Fragmentation method...... 53

1.3.3.3.1.1 Collision-induced dissociation...... 53

1.3.3.3.1.2 Electron-transfer dissociation...... 54

1.3.3.3.1.3 Electron-capture dissociation...... 54

1.3.3.3.2 Type of Tandem mass spectrometry...... 55

8

1.4 Characterization Strategies for Proteins by LC-MS/MS…………………………………… 57

1.4.1 LC-MS/MS analysis...... 60

1.4.2 Top-down analysis of proteins...... 63

1.4.3 Bottom-up analysis of proteins...... 64

1.4.4 Multi-sequential enzymatic digestion strategy...... 65

1.4.5 Labeling quantitation in LC-MS/MS analysis...... 66

1.4.6 Spectral counts in quantitative proteome...... 68

1.5 ERBB2-driven Studies...... 68

1.5.1 Cancers associated with ERBB2...... 68

1.5.2 SNPs and ASVs of ERBB2...... 69

1.5.3 ERBB2 amplicon...... 72

1.5.4 Proteomic analysis in combination of transcriptomics...... 77

1.6 Conclusion ...... 77

1.7 References ...... 78

Chapter 2

Comprehensive Mass Spectrometric Characterization of Recombinant

Phenylalanine Ammonia Lyase...... 86

2.1 Abstract ...... 87

2.2 Introduction ...... 87

2.3 Experimental Section ...... 91

2.3.1 Reagents and materials ...... 91

9

2.3.2 In-solution enzymatic digestion...... 91

2.3.3 SDS-PAGE and in-gel digestion...... 92

2.3.4 LC-MS/MS analysis...... 93

2.3.5 Peptide assignment...... 93

2.4 Results and Discussion ...... 94

2.4.1 SDS-PAGE separation of rAvPAL...... 94

2.4.2 Intact analysis of rAvPAL...... 95

2.4.3 LC-MS/MS analysis of digested peptides of rAvPAL...... 97

2.4.4 Analysis of RtPAL and GFP with similar ring structure...... 102

2.4.5 More findings about free radical associated degradation...... 106

2.4.6 Analysis of rAvPAL with addition of reducing reagent...... 107

2.5 Conclusion...... 108

2.6 References...... 110

2.7 Supplementary Figures and Tables...... 112

Chapter 3

Disulfide Bond Linkage and Improper Disulfide Bridging of a Variant of Lysosomal

Acidic Glucosidase by LC-MS/MS...... 121

3.1 Abstract ...... 122

3.2 Introduction ...... 123

3.3 Experimental Section ...... 125

10

3.3.1 Reagents and Materials ...... 125

3.3.2 In-solution native digestion...... 125

3.3.3 SDS-PAGE and in-gel digestion...... 126

3.3.4 LC-MS/MS analysis...... 127

3.4 Results and Discussion ...... 128

3.4.1 SDS-PAGE Separation of GAA...... 128

3.4.2 Combination of ETD with CID in LC-MS/MS analysis...... 129

3.4.3 Disulfide bond analysis of GILT tag...... 129

3.4.4 Characterization of other disulfide linkages...... 139

3.5 Conclusion...... 145

3.6 References...... 147

Chapter 4

Proteomic and Genomic Analysis of Gastric Cancer Patient Tissues...... 149

4.1 Abstract ...... 150

4.2 Introduction ...... 150

4.3 Experimental Section ...... 152

4.3.1 Reagents and Materials ...... 152

4.3.2 FISH and antiERBB2 reactivity...... 152

4.3.3 Protein extraction, separation, fractionation, and in-gel digestion...... 152

4.3.4 LC-MS analysis by LTQ-Orbitrap...... 153

11

4.3.5 Peptide identification...... 154

4.4 Results and Discussion ...... 155

4.4.1 Anti-ERBB2 reactivity for sample selection...... 155

4.4.2 Proteomic analysis of gastric cancer patient samples...... 156

4.4.3 ERBB2 characterization...... 156

4.4.4 Strategy for ERBB2-driven gastric cancer analysis...... 157

4.4.5 Combination of proteomic analysis with transcriptomics...... 158

4.4.6 Tumor vs. control samples comparison...... 160

4.4.7 ERBB2 amplicon...... 159

4.4.8 Pathway analysis based on oncogene interactions...... 165

4.4.7 Sub-pathway analysis of ERBB2+ vs. ERBB2- sample sets...... 168

4.5 Conclusion...... 171

4.6 References...... 173

4.6 Supplementary Figures and Tables…………………………………………..……………. 176

12

LIST OF FIGURES

Figure 1-1. Deamiation of Asn to Asp and iso-Asp. ………...... 34

Figure 1-2. Structures of amino acids prone to be oxidized...... 36

Figure 1-3. Illustration of quadrupole mass analyzer with ion trajectory...... 51

Figure 1-4. General strategy for analysis of protein samples by LC-MS...... 59

Figure 1-5. Bottom-up vs. top-down analysis...... 60

Figure 1-6. ERBB2 sequence comparison between ENSP00000269571 and the other three ASVs from Ensembl...... 70

Figure 1-7. PTMs and SNPs of ERBB2 sequence (1255 aa, ENSP00000269571)……...... 71

Figure 1-8. Genome vs.transcriptome vs. proteome...... 76

Figure 2-1. Full sequence of rAvPAL with highlights of the tryptic peptide containing ASG sequence...... 88

Figure 2-2. MIO group formation through double dehydration at -Ala-Ser-Gly- ...... 90

Figure 2-3. SDS-PAGE separation of rAvPAL...... 94

Figure 2-4. Deconvoluted mass analysis of the peaks from top-down analysis of intact rAvPAL molecule...... ,...... 95

Figure 2-5. XIC of trpysin and trypsin plus Glu-C digested peptide fragment...... 98

Figure 2-6. XIC and MS2 of fragmented peptide sequence from Ser 158 to Lys 189 in 45

KDa.…...... 99

Figure 2-7. Proposed degradation process of MIO group…………………...…………...... 102

13

Figure 2-8. XIC extraction of the tryptic peptide containing SYG in GFP……………...... 105

Figure 2-9. MS2 extraction of the tryptic peptide containing SYG in GFP...... 106

Figure 2-10. SDS-PAGE of rAvPAL with and without ascorbic acid...... 107

Figure 2-11. Extracted precursor m/z and targeted MS2 for confirmation of the ASG containing peptide after reduction (-H2O+2H)...... 108

Figure 2-S1. SDS-PAGE separation of RtPAL with different loading amount……………….. 112

Figure 2-S2. Full seuqence of RtPAL...... 113

Figure 2-S3. XIC extraction of one peptide fragment (EALTNFLNHGITPIVPLRGTISA) of

RtPAL...... 114

Figure 2-S4a. XIC extraction of the C-term part of the peptide fragment

(SGDLSPLSYIAAAISGHPDSK) of RtPAL………………………………………………..... 115

Figure 2-S4b. MS2 spectra of C-term fragmented tryptic peptide from 45 K band containing the

ASG in RtPAL...... 116

Figure 2-S5a. TIC spectra of intact protein analysis of RtPAL...... 107

Figure 2-S5b. Mass deconvolution results of intact analysis of RtPAL...... 107

Figure 2-S6. Sequence of Green Fluorescence Protein (GFP)...... 108

Figure 2-S7. Formation of HBDI group in GFP from SYG...... 109

Figure 2-S8. SDS-PAGE of green fluorescent protein...... 120

Figure 3-1. Full sequence of a variant of lysosomal acidic glucosidase...... 124

Figure 3-2. SDS-PAGE separation of B89 and BIP under non-reduced and reduced condition………………………………………………………………………………………. 128

14

Figure 3-3. Correct disulfide linkage of the three tryptic peptides in GILT tag (a), unexpected disulfide linkage has P2 with P3 (b), P1 has another disulfide bond (c)...... 130

Figure 3-4. LC-MS/MS analysis of GILT tag disulfide bond linkage [P1+P2+P3] of FIN sample.

(a) exact precursor m/z matches exactly , (b) CID-MS2, (c) ETD-MS2, (d) CID-MS3, (e-g) targeting CID-MS3...... 134

Figure 3-5. LC-MS/MS analysis of GILT tag disulfide bond linkage [P2+P3] of FIN sample. (a) exact precursor m/z matches exactly , (b) CID-MS2, (c) ETD-MS2, (d) CID-MS3...... 136

Figure 3-6. (a) precursor m/z from native digestion after deglycosylation; (b) precursor m/z from r reduced digestion after deglycosylation; (c) CID-MS2 spectra for confirmation from reduced digestion...... 142

Figure 3-7. (a) Accurate mass measurement from MS scan; (b) CID-MS2 with backbone b and y ions...... 145

Figure 4-1. Strategy for analysis of ERBB2-driven gastric cancer...... 158

Figure 4-2. Plot of number of proteins identified in proteomic study for illustration of activity in gastric cancer...... 160

Figure 4-3. Diagram for comparison of protein IDs from different sample set...... 162

Figure 4-4. Illustration of 17q12 region on , a region with oncogene ERBB2..164

Figure 4-5. ERBB2 interacted oncogene expression and transcriptomic proteomic expression in

ERBB2+ vs. ERBB2- gastric cancer………………………………………………………….. 167

Figure 4-6a. ERBB2 interacted oncogene expression and transcriptomic proteomic expression in

ERBB2+ vs. ERBB2- gastric cancer...... 169

15

Figure 4-6b. Modified development ERBB signaling pathway (ERBB2+ vs. ERBB2– patient samples) from KEGG...... 170

Figure 4-7. Regulation of pathway from KEGG for ERBB2- set...... 171

Figure 4-S1. SDS‐PAGE separation of extracted proteins from gastric cancer patient tissues.. 176

Figure 4-S2. A peptide identified in ERBB2 (a) and EGFR (b)...... 180

Figure 4-S3. CNV detection of Nimblegen CGH 385K for gastric cancer patient samples...... 181

Figure 4-S4. Protein identification in gastric cancer around ERBB2 on chromosome 17...... 182

Figure 4-S5. Tumor unique proteins in ERBB2 positive gastric cancer which are at bands q12, q21.1, q21.2 on chromosome 17 with illustration of spectral counts……………………...…...183

Figure 4-S6. identified around EGFR on chromosome 17………………..…………….184

Figure 4-S7. Illustration of ERBB family signaling and EGFR development pathway fit with the proteomics and transcriptomic data: a. ERBB2 signaling; b. development EGFR signaling via small GTPases; c. development EGFR signaling; d. cytoskeleton remodeling integrin outside-in signaling...... 185

16

LIST OF TABLES

Table 1-1. Description and cancer association of genes in PPP1R1B-ERBB2-GRB7 region. ....74

Table 2-1. Fragments observed after digested by enzymes in 15, 45 and 60 KDa gel bands.... 101

Table 3-1. Disulfide connection at N-term region, comparison of retention time, intensity and mass (a, B87; b, B89, c, FIN)...... 140

Table 3-2. (a) list of Cys containing tryptic peptides; (b) Confirmation of disulfide linkage in

GAA protein...... 144

Table 4-1. Top 20 Proteins in the 196 tumor unique proteins from ERBB2+ set...... 164

Table 4-2. Major driver oncogenes in gastric cancer sets as well as RNA-Seq data from cell lines...... 168

Table 4-S1. Clinical parameters of two study groups which were heterogeneous except of the expression of the oncogene...... 189

Table 4-S2. Catalog of ‘quality’ observed peptides for ERBB2_HUMAN...... 190

Table 4-S3. Top 20 most abundant proteins in ERBB2+ gastric cancer sample set...... 191

Table 4-S4a. Tumor unique proteins in ERBB2+ sample set...... 193

Table 4-S4b. Tumor unique proteins in ERBB2- sample set...... 200

Table 4-S5. Tumor unique proteins that have interactions with the 21 oncogenes...... 205

17

LIST OF ABBREVIATIONS

2-D DIGE two-dimensional difference gel electrophoresis

AA anthranilic acid

AC alternating current

AEX anion exchange chromatography

AGP α1-acid glycoprotein

Ala alanine

APCI atmospheric pressure chemical ionization

Asn asparagine

Asp aspartic acid

ASVs alternative splice variants

BRCA1 breast cancer 1

CD circular dichroism

CHCA α-cyano-4-hydroxycinnamic acid

CHO Chinese hamster ovary

CHPP Chromosome-Centric Human Proteome Project

CI chemical ionization

CID collision induced dissociation

CT cholera toxin

CTA cholera toxin A subunit

CTB pentamer cholera toxin B homopentamer

Cys cysteine

18

DC direct constant

DNA deoxyribonucleic acid

DTNB 5,5'-dithiobis(2-nitrobenzoic acid)

DTT dithiothreitol

ECD electron capture dissociation

E.coli Escherichia coli

EGF epidermal growth factor

EGFR epidermal growth factor receptor

EndoH Endo-β-N-acetylglucosaminidase H

EPO erythropoietin

ER endoplasmic reticulum

ERBB2 v-erb-b2 erythroblastic leukemia viral oncogene homolog 2

ESI electrospray ionization

ETD electron transfer dissociation

FAB fast atom bombardment ionization

FDA Food and Drug Administration

FD field desorption

FDR False discovery rate

FFT fast Fourier transformation

FI filed ionization

FoxO3 forkhead box O3

FT-ICR Fourier transform ion cyclotron resonance

19

FTIR Fourier transform infrared spectroscopy

FSH follicle stimulating hormone

Gal galactose

Gla γ-glutamate

GlcNAc N-acetylglucosamine

Gln glutamine

Glu glutamic acid

GLP good laboratory practice

GMP good manufacturing practice

GRB7 growth factor receptor-bound protein 7

HCCA α-cyano-3-hydroxycinnamic acid

HIC hydrophobic interaction chromatography

HILIC hydrophilic interaction chromatography

HPAEC-PAD high performance anion exchange chromatography with pulsed amperometric detection

HPLC high performance liquid chromatography

Hya β-hydroxyaspartate

IBC inflammatory breast cancer

ICAT isotope-coded affinity tag

ICH International Conference on Harmonization

IEF isoelectric focusing

IgG immunoglobulin G

20

IKZF3 IKAROS family zinc finger 3

IsoAsp isoaspartic acid

ITRAQ isobaric tag for relative and absolute quantitation

LC liquid chromatography

LC-MS liquid chromatography mass spectrometry

LC-MS/MS liquid chromatography with tandem mass spectrometry

LIT linear Ion Trap m/z mass to charge ratio mAb monoclonal antibody

MALDI matrix assisted laser desorption ionization

Met methionine

MIEN migration and invasion enhancer 1

MRM multiple reaction monitoring

MS mass spectrometry

MS/MS tandem mass spectrometry

NMR nuclear magnetic resonance

NOE nuclear overhauser effect

NSAF normalized spectral abundance factor

PAGE polyacrylamide gel electrophoresis

PEG polyethylene glycol

PEI polyethylenimine

PGAP3 post-GPI attachment to proteins 3

21

PIMT protein isoaspartate methyltransferase

PNGaseF Peptide-N4-(acetyl-β-glucosaminyl)

PNMT phenylethanolamine N-methyltransferase

PPP1R1B protein phosphatase 1, regulatory (inhibitor) subunit 1B

PTM posttranslational modification

QQQ triple quadrupole

QTOF quadrupole coupled with time-of-flight

RF radio frequency

RIA radioimmunoassay

RNA-Seq transcriptomic sequence

RP reversed phase

RP-HPLC reversed phase high performance liquid chromatography

SEC size exclusion chromatography

Ser serine

SDS sodium dodecyl sulfate

SDS-PAGE sodium dodecyl sulfate polyacrylamide gel electrophoresis

SILAC stable isotopic labeling of amino acids in cell culture

SNPs single-nucleotide polymorphisms

SPR surface plasmon resonance

STARD3 StAR-related lipid transfer (START) domain containing 3

TBHP tert-butyl hydroperoxide

TCAP -cap (telethonin)

22

Thr threonine

TIAF1 TGFB1-induced anti-apoptotic factor 1

TOF time-of-flight

TOP2A topoisomerase (DNA) II alpha 170kDa

TP53 tumor protein p53

Trp tryptophan

UDP-GNAc UDP-N-acetylhexosamine

UV-vis Ultraviolet–visible

Val valine

23

Chapter 1

Overview of Industrial Protein Therapeutics, Protein

PTMs, and Various Techniques for Protein

Characterization

24

1.1 Introduction

Nowadays, protein therapeutics have taken a large part in the biotechnology industry. With development in recombinant DNA technologies, recombinant protein therapeutics are becoming the major focus for biotech companies. Complex mixtures generated are heterogeneous. Various aspects can affect the proteins produced with this technique. Briefly, a wide variety of post- translational modifications (PTMs), from physiological or chemical modification, variations in sequence due to proteolysis, and degradation products during each stage of purification or processing, are of major concern. Therefore, comprehensive characterization is essential to reproducibility, drug efficacy, and most important, to drug safety.

Mass spectrometry (MS) is widely applied in protein characterization for the biotech industry. It is powerful because of its accuracy, sensitivity, and high resolution. With on-line coupling to liquid chromatography (LC), proteins or peptides are able to be separated before going to the MS system. Mass spectrometry details the characterization work to the level of amino acid. This chapter will discuss protein therapeutics, variations in PTMs, and the importance of protein characterization. Furthermore, techniques for protein characterization with a detailed inclusion of mass spectrometry are reviewed, as well as major strategies (top-down vs. bottom-up as well as multi-sequential enzymatic digestion) and common methods for quantitation.

Lastly, this chapter will discuss ERBB2-associated cancer studies. ERBB2-driven cancer diseases are introduced. Introduction of sequence variation and single-nucleotide polymorphism of ERBB2 are included. The amplification mechanism typically results in the over-expression of a group of genes adjacent to ERBB2 (known as the ERBB2 amplicon) is discussed as well. As part of the CHPP Initiative, proteomics integrated to transcriptomic analysis is introduced.

25

1.2 Overview of Industrial Protein Therapeutics

Small molecules, acting as classic drugs, make a large part of the pharmaceutical market.

Their molecular weight can be as small as 50 Da, or it can be as large as 800 Da. Traditionally, they are synthesized by chemical reactions. Therefore, small molecule drugs are purified step by step to be homogeneous, which can be characterized by various analytical techniques, such as

FTIR, and NMR. In contrast to small molecular drugs, there are biopharmaceuticals for therapeutic use. Proteins, antibodies and nucleic acid are included. They are large molecules, e.g. monoclonal antibodies can be as large as 150 kDa. The US FDA (Food and Drug Administration) has approved four groups of protein therapeutics according to their functions: “Group I, protein therapeutics with enzymatic and regulatory activity; Group II, protein therapeutics with special targeting activity; Group III, protein vaccines; Group IV, protein diagnostics”1. The first protein approved for human use was insulin. This was the recombinant DNA insulin, produced in 1982.

It was a breakthrough in biotechnology history.2 Today, with advancements in biotechnology, particularly the development in recombinant DNA technology, recombinant proteins are becoming more and more popular to treat a wide variety of disease, such as cancer and diabetes.

Also, recombinant proteins can be used as enzyme replacement therapy for people with an insufficient amount of enzymes for supporting normal functions. In the 2011 La Merie report, the top eight class in recombinant protein therapeutics was the report erythropoietin, “insulin and insulin analogs, coagulation factors, interferon alpha, interferon beta, granulocyte colony- stimulating factor, human growth hormone, enzyme replacement therapy and follicle stimulating hormone (FSH)”3.

26

1.2.1 Production of recombinant proteins

A typical process of industrial production involves vector construction, cloning, transfection/transformation, adaptation, bioreactor production, product concentration and further analysis. Briefly, a suitable vector and adaptor promoter are chosen, and a gene of interest is cloned into the vector. Then, the host cells are transformed, and proteins are produced under specific conditions through biosynthesis process.4

Expression system selection is very important to produce desired protein products because the target gene should be able to be utilized by host cells. Otherwise, if negative results occur in the interaction between vector and host, further production will be impossible. Problems such as gene replication, transcription, protease activity, and post-translational modifications need to be considered. Among the various expression systems, E.coli (Escherichia coli) and mammalian cell lines are widely used. When choosing which system should be used, the characteristics of the recombinant proteins need to be examined, including the size of the protein, its stability, desired post-translational modifications (PTMs), production costs, etc. Mammalian cells are more frequently selected for production of recombinant proteins because of PTMs (e.g. glycosylation) that will be included onto the proteins. CHO (Chinese hamster ovary) cell lines are commonly used as mammalian cell lines, especially for production of monoclonal antibodies.

In addition, protein engineering is essential to help optimize natural proteins with specified functions. Naturally occurring protein molecules or wild-type proteins can be mutated at the level of amino acid sequence for pursuing better performance. For example, mutation from Met to Leu or Ile provides protection for the wild-type staphylococcal nuclease from oxidative damage5. This helps to stabilize the protein structure.

27

1.2.2 Heterogeneity associated with recombinant protein therapeutics

Recombinant proteins have a large degree of heterogeneity in the whole production process due to the fact that they are produced from other living species. Therefore, major concerns include drug safety, efficacy, and reproducibility. Immunogenicity is also a critical focus. With variations present, macromolecules, especially those with complex structures, will result in a population of structures that are in close relationship, but are not identical to each other. This is caused by several factors, namely variants in genetics, variations in cell proteolytic activities and in translation process. Moreover, variations could be caused during production and manufacturing, both product linked and process linked.6 Differential glycosylation, phosphorylation, deamidation, oxidation, as well as proteolytic cleavage would create a substantial number of differences within a specific protein population. This increases the complexities in the protein world. Considering the intrinsic complexity, safety, potency, consistency and efficacy are of great importance for protein therapeutics, it is vital to have specific test and assessment to characterize protein therapeutics.

Generally, there are two major types of heterogeneity, macro-heterogeneity and micro- heterogeneity. In the case for glycosylation, macro-heterogeneity is used to describe multiple sites of glycosylation, occupancy on different sites differs for different molecules; while micro- heterogeneity refers to structure of glycan occupying the same site varies among different molecules.7 In the following discussion, macro- and micro-heterogeneity will not be differentiated specifically.

Protein heterogeneity can arise from various aspects, briefly three major sources are included here: (a) distinctly different protein contaminants, (b) derivatives produced by the preparative

28 procedure, or (c) variations between protein molecules due to PTMs. Contaminants could count for part of protein heterogeneity. Studies have shown that fractionated albumin can be heterogeneous. For example, albumin fractioned by cold-methanol could contain transferring and

IgG at measurable level.8 Another major source of heterogeneity is from derivatives generated through the preparative procedures. It has been reported that heterogeneity of recombinant human arginse resulted from pegylated derivatives. PEG (polyethylene glycol) was used for protein modification.9 The pegylation reactions generated various PEG numbers for protein molecules (e.g. one, two, etc.), and thus generated heterogeneous products. Aside from the previously mentioned contaminants and derivatives as sources of heterogeneity, protein heterogeneity could be generated from variations of protein molecules resulting from post- translational modifications (PTMs).10 Common PTMs are discussed in 1.2.3. High degree of protein intrinsic heterogeneity could be present due to PTMs. For example, there is a substantial amount of heterogeneity in recombinant erythropoietin (EPO) protein molecules, particularly in its high content of glycan.11 This makes characterization very difficult. Specifically for glycosylation, there are various reasons for glycoform heterogeneity. One of the aspects is UDP- monosaccharide metabolism and transport into ER and Golgi. Yang et al. reported that there is coinncidence between glycoform heterogeneity of EPO and the UDP-GNAc (UDP-N- acetylhexosamine).11 In addition, downstream cis-, medial-, and trans-Golgi location of glycosyltransferases, at glycosyltransferase expression level (promoters regulated by transcription factor binding elements), and recombinant protein expression level could also have influence on final products.12 More complexity is generated than that provided by genetic code, creating heterogeneity within a specific protein population.7 This has been well explained by the

29 presence of numerous sub-fractions of histones due to acetylation, methylation, and phosphorylation of various residues, although histones are classified as a limited number of types of fractions, each having a fundamentally distinct amino acid composition and sequence.13 From genome to proteome, complexity of proteins increases. Heterogeneity associated to PTMs has been reported in large amounts. For example, N-linked glycans of prosaposin show high degrees of heterogeneity.14 For human plasminogen, sialic acid and N-acetyllactosamine are included in

N-glycans, while O-linked glycan is consisted of Gal-GalNAc disaccharides.15 Variations in the carbohydrate units are commonly present for the explanation of differences in the structures of glycoproteins. Also, two-dimensional immunoblots detected that there are heterogeneous forms for heavy polypeptide, varying in the degree of phosphorylation.16 It has also been reported that oxidized serum albumin shows much heterogeneity17; and the presence of heterogeneity of fatty acid acylated lipoproteins18.

In addition, peptide structures, host-cell phenotypes, expression systems and culture environments can affect final products and generate heterogeneous protein forms even during a single batch of fermentation. Heterogeneity could be generated during product manufacturing and processing as well. Subtle chemical modifications found in recombinant proteins can be caused by infidelity of biological reactions, in process degradation, and during product/sample storage. Thus, protein heterogeneity is of great concern in the pharmaceutical industry.

Under most circumstances, when manufacturing and developing protein therapeutics, safety and efficacy are sensitively influenced by detailed structures and composition. With the thriving discovery and development of protein therapeutics, new challenges are faced by both the regula- tory authorities and the analytical community. Standard synthesis methods are usually used for

30 the manufacturing of small molecules. However, such synthetic reactions are not compatible with productions of protein therapeutics. For example, growth hormone is a protein drug with relatively small molecular weight. Bacterial cell lines could be a candidate for the production.

The condition is similar to that of insulin, with a molecular weight of around 12 kDa.

Mammalian cell lines are mostly used for more complex molecules. For instance, the production of monoclonal antibodies is closely involved with mammalian cell line expression. This is also critical for protein PTMs (post-translational modifications), which is in close connection to the activities of protein therapeutics. Therefore, instead of producing only one single protein, heterogeneous products are produced by cell lines. This is a critical problem for safety and efficacy issues of bio pharmaceutics. This is why FDA and other regulatory agencies are giving much concern about protein heterogeneity. It could have potential influence on potency, efficacy, half-life, immunogenicity and other important properties of pharmaceutics. Also, biological and clinical research revealed that such modifications have potential influence on potency, half-life, immunogenicity and other important properties.19 A wide variety of methods have been applied in order to facilitate the process of protein therapeutical characterization. Also, quality analysis and quality control (QA and QC) are very essential in the improvement and maintenance of batch to batch consistency for protein pharmaceuticals, and thus to further improve drug discovery and development process. For example, a series of purification schemes and manipulation of expression systems was performed. Rapid and sensitive analytical methods such as immunoaffinity to monitor changes are needed, e.g. in the course of cell culture. Also, formulation issues are of importance for protein intrinsic instability. A promising formulation can improve properties of drugs such as bioavailability, homogeneity, and decrease chemical

31 degradation. Besides, many protein products have structural heterogeneity due to biosynthetic processes. Therefore, a mixture of modified forms will be present. One method to overcome this problem is site-directed mutagenesis. For example, Asn is a potential site to get through deamidation. Therefore, it is applicable that Asn replaced by other residues such as Ser or His to avoid deamidation.20 Another method is that proteins can be treated with chemical modification, such as coupling with polyethylene glycol (PEG).21

1.2.3 PTMs of protein therapeutics

Protein drugs can be initially produced from bacteria, yeast or mammalian cell, such as CHO cells. They are difficult for characterization due to the presence of a wide variety of variants.

Even having been originated from the same processing steps, protein products can have various molecular weight due to possible modifications. Protein can function as signal transduction, recognition and so on. Therefore, due to the processing step, one gene does not get one protein.

The two categories of post-translational modification are covalent modification and cleavage of protein backbone. Phosphorylation, acetylation, alkylation, glycosylation and oxidation are five major covalent modifications. Besides, disulfide bond formation, hydroxylation, sulfation, carboxylation, and protein modification by bacteria toxin are included as well. For example, post-translational processing was investigated for cholera toxin (CT). CT has a toxic-active A subunit (CTA), and its function is for toxin binding to cells. CTA goes through post-translational modification by V. cholera protease. Initially, CTA is a single peptide chain. After post- translational processing, it generates into two fragment chains, CTA1 and CTA2. Disulfide bond links CTA1 and CTA2. Each fragment chain has its own function. CTA1 has the ADP

32 ribosylation activity. CTA2 has the insertion function, as CTA is inserted in circular B homopentamer (CTB pentamer). CT has the CTA and CTB pentamer subunit.22 Three major

PTMs, deamidation, oxidation and glycosylation, are discussed here.

1.2.3.1 Deamidation

Deamidation is instability in peptides and proteins. Among the twenty amino acids, Asn and

Gln are unstable under physiological conditions, which account for about 8% of protein building blocks.23

Deamidation is a common posttranslational modification. Degradation pathway of Asn and

Asp goes through intramolecular rearrangement when pH is higher than 5. The active peptide- bond nitrogen can be positioned at the carboxyl carbon of Asn or Asp side chains by nucleophilic attack. Figure 1-1 illustrates the demidation of Asn to Asp and iso-Asp. This reaction gives more favor to Asn or Asp than to Gln or Glu by forming a stable five-member ring structure. This is referred to as a succinimide intermediate or cyclic imide, compared to the six-member ring formed by Gln or Glu. Besides, Gly followed by Ser before Asn can give acceleration to this reaction resulting from its small steric hindrance. The intermediate imide can be hydrolyzed at two positions: α-carbon (L-cyclic imide) and β-carbon (D-cyclic imide) to form L-iso-Asp and

Asp respectively in an approximate ratio of 3:1. Another pathway of forming Asp is that at low pH, the side chain of Asn can be directly hydrolyzed via tetrahedral intermediate. Asp is then generated from this intermediate.

33

Figure 1-1. Deamidation of Asn to Asp and iso-Asp.

Deamidation has a biological function. Non-enzymatic deamidation of Asn and Gln can occur during physiological conditions. This will determine rates of protein turnover in vivo, serving as the molecular clock.24 Also, deamidation has been shown to regulate time-dependent biological processes such as development and aging. Deamidation is related to the mechanism of cell aging.

There are two aspects associated to such change. One is the inactivation of proteins, which is the conformational change of Asn or Gln rendering functional change, and finally activities are influenced. Some functions of enzymes can be lost due to the conformational change. The other aspect is cell apoptosis.25 The apoptosis is the programmed cell death. Deamidation invokes such mechanism, leading cells to be killed. Consequently, deamidation is associated with accumulation of inactivated molecules as well as programmed cell death.

34

Sequence and structure of peptide or protein have influences on reaction rate. Therefore, the rate for deamidation can vary in a wide range. Besides, rate varies to changes in protein structures, buffer type and solvent conditions, such as pH, ionic strength, or oxygen-rich environment, in vivo. Therefore, extent and consequences of deamidation vary from protein to protein. These lead to structure and functional changes. Protein degradation could be caused by deamidation of Asn to Asp and iso-Asp residue, in particular for long- term storage.26 This will be also potentially immunogenic due to the formation of iso-Asp as it is not natural in amino acid.

Biological activities of protein therapeutics such as IgG1 and other recombinant antibodies can be affected adversely by deamidation.27 Surface Asn residues will be more susceptible due to the alkaline pH environment where non-enzymatic deamidation is accelerated. Besides Asn residues that are susceptible to deamidation, Gln residues are also able to be deamidated. However, its reaction is much slower than that of Asn due to structural characteristics, and it is barely detected in recombinant protein therapeutics.

1.2.3.2 Oxidation

Among the twenty standard amino acids found in proteins, some can be oxidized under specific conditions. The structures of some amino acids which are prone to oxidation are listed in

Figure 1-2 for clarity.28 Reactive oxygen species is a major source of oxidation. They contribute to oxidative damage to proteins and they are highly reactive.29 They can be generated both endogenously and exogenously. Several conditions can have effect on the oxidized reactions.

Oxidation can be affected by pH, temperature and activity of solvents. Other reagents in the atmosphere can also accelerate or hinder oxidation. Moreover, oxidation in proteins can be

35 influenced by the relative position of specific sensitive groups.30 If the groups are more approachable in proteins by reagents, oxidation will be much easier. If the groups form a cluster with others and generally if they are buried inside the proteins where reagents cannot conveniently reach, the residues will be intact and oxidation will probably be circumvented under such conditions.

Figure 1-2. Structures of amino acids prone to be oxidized.

Oxidation reactions have much influence on proteins. They can cause hydrophobic groups in proteins to be exposed to environment. They can also cause changes in disulphide groups and secondary structures.31 Therefore, oxidation reactions have an influence on charge as well as conformation of proteins and their three-dimensional structures, increasing hydrophobicity,

36 potential to form negative aggregates, and non-specific interactions.32 Oxidation leading to the formation of carbonyls can cause changes in tertiary structures and aggregation. Moreover, they can lead to deactivation of proteins as well as inhibition of enzyme activity and loss of function.33 For example, some research work suggests that Peroxiredoxin I (PrxI) activity might be an indicator of oxidation.34 It can also cause the backbone cleavage of peptides, decrease in cellular uptake, and even compromise cell viability and lead to cell apoptosis. In clinical trials, an increase of oxidation can be observed in plasma of people who have diabetes.35 Subsequently, it is exceptionally vital and essential to investigate oxidation in order to lead therapeutic protein drugs to reach optimal function and efficacy. Investigation of how to detect and identify oxidation in proteins is still going on because of its importance and necessity. This has great applications in pharmaceutical industries when researching and manufacturing protein as therapeutic agents. However, post-translational oxidation modification can bring some other chemical reactivity, which can be applied to be used for derivatization in detection.36 This can also modulate the structures and functions of some proteins.

1.2.3.3 Glycosylation

Glycosylation is critical to activities and functions of protein therapeuticals. It is one of the five major protein PTMs (post-translational modifications). The covalent glycosylation, mediated by specific enzymes, is quite common in eukaryotes, rather than prokaryotes. N-linked glycosylation is much more common and complex than O-linked. Asn is N-linked. Glycosylation begins in cytoplasm, carries on in ER, and then Golgi for modification of sugar. Glycosylation serves as signal function, adhesion, recognition and so on. It is usually on the surface of cells, and sticks outside the membrane for recognition and immune function. Secreted proteins can be 37 glycosylated as well as circulating proteins, such as antibody and albumin. This allows for recognition, binding and other functions. It is resistant from degradation and is able to have a longer circulating life. Still for CT (cholera toxin), it can attack the intestinal epithelial cells through its B subunit. Outside the luminal surface of intestinal epithelial cells, there is glycosylation, whose pattern is GM1 (monosialoganglioside). It is the B subunit that binds to this glycosylation pattern of cells. Such GM1 glycosylation pattern serves as the receptor for CT binding.22 Initially, there are two subunits, light and heavy subunit. L is responsible for the binding function, and H is for activity. CT binds to GM1 by CTB (cholera toxin B subnit) significantly. When GM1 is not present, there would be CTB-CTB interaction, the interaction between pentamer. GM1 has high affinity to CTB. In experiments, CT could bind to other various cell types who also have the GM1 glycosylation pattern, such as lymphocytes. However, such binding and affinity do not occur all the time as the expression of GM1 is cell-cycle dependent. It is more likely to happen in the phase G0/G1. The activity of CT begins from the binding process to GM1. The detailed pattern of GM1 is composed of Gal, NeuAc, Glc; exact pattern is Gal(β1–3)GalNac(β1–4)(NeuAcα2–3)Gal(β1–4)Glc-ceramide. 37 CTB is a pentamer, and each unit has a binding site for GM1. Also, as a whole, CTB binds more tightly than each B monomer. There are three important residues for binding: Trp88, Gly33, Tyr12.

GM1 is glycolipid. CTB binds to it before endocytosis of CT. The cells are with caveolae as well as without caveolae.37 Also, the distribution and localization of GM1 might be affected by the binding to CT. Each B subunit binds to five GM1.38 Such binding as a whole is critical for CT endocytosis and its activities.

38

Glycosylation generally represents the carbohydrates linking to Asn, Ser or Thr residues. It is a site-specific process. There is N-linked as well as O-linked glycosylation. N-linked glycosylation is more common, where the sugar group linked to Asn residue. Usually it is “Asn-X-Ser/Thr”. X here represents any amino acid residue except Pro. There are three types of N-glycan, complex type, hybrid type and high-mannose type.39 Compared to N-linked glycan, O-linked glycan has the linkage of hydroxyl group to the Ser or Thr. With addition to N-acetylgalatosamine, the N- acetylgalatosaminyltransferase was part of this process.

1.2.3.4 Other common types of PTMs

In addition to deamidation, oxidation and glycosylation, there are other common types of

PTMs, such as phosphorylation, methylation and acetylation. Phosphorylation is associated with biological power and enzyme activity. It occurs on Ser and Thr residues. Phosphate groups from

ATP can be transferred to protein by kinase. Once phosphorylated, the protein is activated. On the other hand, phosphatase can remove the phosphate group and thus deactivate the protein.

There are many biological areas that are involved with phosphorylation, such as , cell differentiation and motility, metabolism, and signal transduction between various cells.

Methylation, as another type of post-translational modification, can happen at Lys and Arg residues. S-adenosylmethionine can provide the methyl group by forming esters. Methylation is also closely related to cellular activities. For example, methylation on protein Arg and Lys residues is associated with signal events of regulation of interactions between cellular proteins as well as sub-celluar localization.40 Besides Lys and Arg residues, methylation can happen at His and carboxyl groups of Asp and Glu. Activities of proteins such as cytochrome c, ,

39 histones, are related to methylation. For example, methylation of FoxO3 is essential to its cellular activities and stability, and thus it makes great influence on its tumor suppression activity.41

Acetylation is another type of PTMs for proteins. Most of it happens at N-terminus and Lys side chain. N-terminal Met residue is prone to be acetylated. If N-terminal Met got removed, the following residues Ala, Ser, Thr, Val and Cys could be acetylated.42 Acetylation could be co- translational and post-translational. An extensive amount of eukaryotic proteins, such as human proteins are acetylated. Lys side chain acetylation can happen to histones and non-histone proteins, such as some regulators in cytoplasma and nuclear proteins. This modification is very important since it can interact with other types of post-translational modification, such as methylation, phosphorylation and ubiquitination.43 And this plays an essential role in signal transduction between cells under various environments.

1.3 Techniques for Protein Characterization

Multiple analytical techniques have been applied and developed for quality control and assessment, for detection and relative quantitation. Electrophoresis, liquid chromatography, mass spectrometry and other biophysical techniques are introduced. For mass spectrometry with particular focus, various ionization methods, mass analyzers as well as combinational mass spectrometric instruments are included.

40

1.3.1 Electrophoresis

Electrophoresis is commonly used for protein separation and characterization. Protein migration in electrophoresis is based on charge differences. Gel is frequently used in electrophoresis for support. For example, polyacrylamide is used for SDS-PAGE (short form for sodium dodecyl sulfate-polyacrylamide gel electrophoresis). Protein analysis by gel electrophoresis is based on charge as well as size. SDS-PAGE is the most popular technique in protein electrophoresis. SDS serves as detergent to denature proteins. Its negative charge can thus cover the intrinsic charge of proteins. This allows the separation of proteins based on size.

IEF (iso-eletric focusing) is another type of PAGE. Analysis by IEF utilizes pH gradient together with electric field. Therefore, proteins migrate in the pH gradient and their charges change during migration. When their pI (iso-electric point) is equal to the pH, migration stops due to the zero net charge of the proteins. Variation of α1-acid glycoprotein (AGP) can be detected from

IEF (iso-eletric focusing).44 In addition, capillary electrophoresis is helpful in protein analysis with high resolution. Charge variation of protein samples could be determined from cIEF

(capillary iso-electric focusing). The separation is quantitative. cIEF coupled to MS is informative for proteomic studies, allowing for quick and ease analysis and automation. Besides,

Berkowitz has developed CZE with high resolution, which is used to quantify various glycoforms and charges of glycoprotein from complex eletropherogram upon releasing sugar by different enzymes.45

Electrophoresis can be one- or two-dimensional. Two-dimensional SDS-PAGE can separate differentially post-translationally modified forms of the same protein. However, it is of limited accuracy, reproducibility and quantitation. A double-isotope procedure has been reported to

41 examine sulfation and glycosylation.46 After separation by 2-D gel electrophoresis, isotope incorporated ratios of proteins were determined with individual gel spots. Another technique is fluorescence two-dimensional difference gel electrophoresis (2-D DIGE). This is advantageous because of capability of relative quantitation. Fluorescent stains such as PorQ Emerald and ProQ

Diamond are powerful for analyzing glycosylation and phosphorylation after 2D gels (SDS-

PAGE after IEF).47 For example, ProQ Diamond stain as molecular probe can bind specifically to phosphate groups on residues. Its fluorescent nature gives high sensitivity. The stain intensity is related to the number of phosphate groups, which is the basis for quantitation. Dyes bind to phosphate groups non-covalently, which is compatible to MS. For deamidation, introduction of methylation, hydrazinolysis and derivatized by dansylation, UV-vis or fluorescence detection coupled with HPLC or 2D-gel allows absolute detection.48

1.3.2 Chromatography

Chromatography includes analytical chromatography and preparative chromatography. Liquid chromatography (LC) is a type of analytical chromatography. Briefly, protein molecules in solution are separated in LC system by interaction between mobile phase (liquid) and stationary phase (liquid or solid). HPLC is used in a wide variety for separation, purification, identification and even quantification of protein samples. It has been reported that RP-HPLC to investigate

HMG-14 and HMG -17 proteins, 0.1% trifluoroacetic acid was used as weak ion-pairing agent.49

This can separate multiple species according to order of hydrophobicity. RP stands for reverse- phase. Also, HPLC is very helpful in peptide mapping for identification and characterization of protein PTMs. For example, RP (reversed phase)-HPLC tryptic digestion of interferon after

42 treatment with TBHP (tert-butyl hydroperoxide) and hydrogen peroxide gives information about amount of oxidation.50 Derivatization is a common method in the application of HPLC in protein analysis. To quantitate thiols, DTNB, also called Ellman’s reagent (5,5'-dithiobis(2-nitrobenzoic acid)) can conjugate to thiols as derivatization before HPLC separation and quantitation through the adduct formation.51 Moreover, fluorescence tag can also be used for quantitation with the help of HPLC. Labeling by anthranilic acid (AA) is a sensitive and robust oligosaccharide analysis, and normal phase HPLC gives high resolution and detail profile of fluorescent-labeled

N-glycans.52 Glycoproteins are enzymatically or chemically deglycosylated. For example, neuraminidase can release sialic acid. EndoH and PNGase F release N-linked oligosaccharides.

Sometimes, derivatization is necessary. For example, the attachment of 2-amino pyridine allows relative molar quantitation of individual glycan.53 Use of immobilized binding agent in HPLC helps detect specific glycoprotein; while another binding agent identifies peptide or protein. For example, affinity column is powerful to detect and characterize heterogeneity of amyloid-β- peptides.54 Specific binding agents are immobilized on column to capture various purpose proteins. Also, lectin is widely used for glycosylation analysis for specific binding. Antibodies are well-known for specificity and affinity. They are widely used in biological assays, being as receptor to epitope. For example, phosphoprotein-specific antibodies directed against distinct phosphorylation site were investigated.55 Another example is radioimmunoassay (RIA). RIA is powerful for quantitative analysis of amidation. RIA applies competition between sample and radiolabeled peptide for recognition. Standard curve is plotted with such correlation. Also, sulfation can be analyzed by combination of HPLC and specific assays. And this can be quantitative as well.

43

Also, the protein isoaspartate methyltransferase (PIMT) can be applied in relative quantification of amount of iso-Asp when there is increase in protein methylation.56 But it depends on the availability of the enzyme and optimization of the procedure as results can be different under different conditions. Labeled glycans can be separated by hydrophilic interaction chromatography (HILIC). For example, this can reach further separation after isolation of recombinant interferon- γ.57 After separation, peaks need to be identified by MS. Therefore, we can determine elution order of each glycopeptide. Relative quantitation is achieved from percentage of each glycan based upon peak area. Moreover, hydrophobic interaction chromatography (HIC) is good at analyzing surface variants with “salting in” mechanism. Salt, usually ammonium sulfate, is used. And anion exchange chromatography (AEX) with polyethylenimine (PEI) column can also be used to analyze charged glycosylation and quantify extent of phosphate on glycoproteins with better selectivity. For determining the content of γ- glutamate (Gla) and β-hydroxyaspartate (Hya), ion exchange HPLC can quantitate amount in hydrolysates by comparing the ratio to standards.58 For example, ratios of Gla/Glu and Gla/Asp are used to obtain number of Gla. Standard curve is for quantitation. Besides, HPAEC-PAD, which is high performance anion exchange chromatography with pulsed amperometric detection, is used for analyzing carbohydrate part.59 In addition, SEC (size exclusion chromatography) can be used to serve in simplification of the molecular compositions that are entering the mass spectrometry part.60

44

1.3.3 Mass spectrometry

Mass spectrometry is becoming more and more popular as analytical technique for both qualitative and quantitative analysis. Generally, molecules are converted to positive or negative ions through ionization source. Then these ions will go through mass analyzer and the separation of ions is based on m/z (mass to charge) ratio. Ions with different m/z will reach the detector with variation. Relative signal abundance from the detector will be recorded and generated for analysis. Soft ionization is provided by electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) for proteins analysis. Usually, tandem MS with various fragmentation methods is powerful for analysis and identification. Ionization types with mass analyzer as well as combinational mass spectrometry are introduced as following.

1.3.3.1 Ionization methods

Ionization methods include a wide variety of types: chemical ionization (CI), fast atom bombardment ionization (FAB), field ionization (FI)/field desorption (FD), matrix-assisted laser desorption ionization (MALDI), thermal spray ionization, and electrospray ionization (ESI).

1.3.3.1.1 Chemical ionization

Chemical ionization is the ion molecule collision to neutral molecule. Also, charge exchange could happen after collision between ionized gas (e.g. He) and the molecule. Formation of

[M+H]+ is a feature of CI. There are two types of chemical ionization (CI), positive CI and negative CI. Atmospheric pressure chemical ionization (APCI) is a form of CI. 1 atm pressure is maintained for APCI. High vacuum is maintained with analyzer pumped into the chamber.

45

Charge exchange can happen at low ionization potential. APCI is very robust and reproducible, and it can be used in two ion modes, positive mode and negative mode. APCI is typically used for analysis of small molecules. It is a less soft ionization compared to ESI and MALDI, which are discussed later. The ionization process of APCI generally happens in gas phase, while that of

ESI is in liquid phase.

1.3.3.1.2 Fast atom bombardment ionization

High energy is applied in FAB, and desorption with ionization happens from high energy bombardment. Therefore, matrix and analytes are both present in the spectra as the mixture enters the mass analyzer directly after leaving the surface. Molecules for analysis are in the solution before absorbing the energy from the ion beam. Proton transfer happens after the energy is absorbed by the analytes and therefore ions are formed. This ionization method is especially popular for peptides with low volatility. The molecules for analysis need to be dissolved in glycerol and other viscous solution, which can be a limitation for this ionization method.

1.3.3.1.3 Field ionization/field desorption

Unlike CI, filed ionization (FI)/field desorption (FD) applies external electric field. This can reduce the energy barrier for ionization, known as quantum mechanical tunneling. Thermal decomposition could happen in vaporization process of the molecules. For FI, the molecules need to be vaporized to the electrode, while for FD, the electrode is the emitter wire, which is the coating of the analyte. FD can bypass the vaporization process and therefore the thermal decomposition due to the coating of analyte with the electrode.

46

1.3.3.1.4 Matrix-assisted laser desorption ionization

MALDI, matrix-assisted laser desorption ionization, is another major ionization method. Matrix is used as a very important role. It can absorb the energy, and it can transfer the energy to the analyte embedded in itself, providing protection of the molecules from decomposition due to excessive energy. Similarly to ESI, it also generates ionized species without fragmenting them into pieces. However, it is processed as off-line, which is disadvantageous compared to ESI’s direct interface with liquid chromatography for separation before mass analysis. For phophorylation detection, dephosphorylation is needed by alkaline phosphatase allowing quantitation of various phosphorylation by MALDI-TOF (time-of-flight).55, 61 Besides, MALDI is also informative in disulfide bond analysis. Disulfide linked peptides could be identified by observance of specific product plus matrix ions after the disulfide bond was reduced. For this purpose, CHCA (α-cyano-4-hydroxycinnamic acid) and HCCA (α-cyano-3-hydroxycinnamic acid) are popular to be used as matrix in MALDI analysis.62 For example, for CHCA, each reduced Cys residue (sulfhydryl group) will be added 189 Da, the mass of CHCA. Therefore, from the mass shift from MALDI spectra, the number of disulfide linkage will be able to be detected. This is advantageous in high-throughput analysis for proteins or peptides that contain disulfide bond. In addition, MALDI generates mostly 1+ ions while ESI can generate multiple charged ions, 2+, 3+, 4+ or even higher. MALDI is good in MW measurement of peptides or proteins, while ESI is more advantageous in analysis of intact proteins with large size and is critical for fragmentation further for PTM detection.

47

1.3.3.1.5 Thermospray ionization (TI)

Ion or atom beam is not needed for TI, which is different from previously discussed FAB.

Solvent vaporization happens with the help of heated capillary where they pass through to form aerosol. Because of the present of statistical distribution, part of the droplets will not be neutral.

By carrying a charge from a photon transfer, the ions can be analyzed with the help of deflection while the neutral part is disposed from mass spectrometry. Flow rate used in TI is usually at the level of mL/min.

1.3.3.1.6 Electrospray ionization

ESI, eletro-spray ionization, converts ions into gaseous phase with the help of strong electrical field. Unlike TI, flow rate of ESI is usually at the level of µL/min or even lower. In solution, ions can be emitted and analyzed directly through solvent evaporation by ESI, while for neutral, molecules, protonation is needed to generate ionic species before the analysis. Therefore, vaporization and ionization is critical for ESI source in mass spectrometry. For being a soft ionization, ESI can generate high charges for large proteins or peptides, thus making the mass detection of large-sized molecules maintain in a relatively small range. For example, characterization of glycosylation profiles of human Fc fragments is completed by LC/ESI-MS.63

Fc fragments represent glycosylation of intact antibody including combinatorial information, which is lost by simple reduction of intact antibodies. This regards to other posttranslational modifications. Fc fragments are analyzed for complex structures. The difference in molecular weight between antibody fragments (50 kDa) and intact antibodies (150 kDa) allows accurate determination. For ESI, there are three parts associated with the process for the transfer of ions

48 from solution into gaseous phase. They are sprays of droplets with charges that are well dispersed, ability of evaporation, and ejection of the ionic species from a tube with high voltage.64

1.3.3.2 Mass analyzer

After the ionization process, with the presence of electric field, ions will be pushed into mass analyzer with acceleration. Magnetic sector, TOF (time-of-flight), quadrupole, ion trap and FT-

ICR (Fourier transform ion cyclotron resonance) are five basic types of mass analyzer.

1.3.3.2.1 Magnetic sector

Magnetic sector is the oldest mass analyzer. The magnetic field is used for ion separation. Ion trajectory, its radius, is determined by the ion mass to charge ratio, the magnetic field, and the ion velocity. With scanning the magnetic field, various ions with different trajectory can be detected as mass spectra. It has limited resolution and the scan speed is low, typically at the level of a second.65

1.3.3.2.2 Time of flight

In theory, all ions that initially present the same kinetic energy will have various velocities when reaching the detector due to mass difference. Mass spectrometer with TOF as mass analyzer applies the relationship between m/z (mass to charge ratio) and the flight time of the ions. In this relationship, potential for acceleration in the chamber, flight tube’s length, mass and charge of the ions are all included to have influence on the determination of the ion velocity and

49 finally the flight time. However, this is disadvantageous in differentiation of different ions with the same mass to charge ratio under the same TOF conditions. Modern technology has reflectron

TOF to overcome the problem of separation of ions that have the same mass to charge ratio but with different kinetic energy. Generally, TOF is used with MALDI, but MALDI can generate a wide start time for the same ions due to its characteristics in laser pulse. Delayed extraction can help with this problem as it presents ions with delayed time before entering the tube with voltage for acceleration. TOF is also disadvantageous because its dynamic range is limited.66

1.3.3.2.3 Quadrupole

Quadrupole is a most popular mass analyzer for mass spectrometry. For gas chromatography coupled mass spectrometry, quadrupole is the standard. It is also widely used for coupling with liquid chromatography. Quadrupole consists of four circular rods with rectangular array, and each pair is applied with constant positive DC (direct constant) and negative DC. Figure 1-3 shows the quadrupole rods with possible ion trajectory.66 Besides constant DC, superimposed

AC potential is applied. The electric field of the quadrupole is a combined selection of DC and

AC. For scanning the mass spectra, the ratio of DC vs. AC is varied in linearity. Therefore, different from magnetic sector, quadrupole generates linear mass rang, and its scan rate is fast.

Serving as mass filter, quadrupole lets ions within the specific range pass and reach the detector, while the other ions whose m/z is outside of the range, either above or below, will be ejected.

Compared to the high vacuum requirement for TOF, quadrupole’s scan rate is fast with high efficiency.

50

ions

Figure 1-3. Illustration of quadrupole mass analyzer with ion trajectory (adapted from reference 66).

1.3.3.2.4 Ion trap

Ion trap is composed of one electrode with ring shape and two electrodes that are cap-shaped.

This quadrupole ion trap is three-dimensional. A combination of DC and AC voltage are set in the ring electrode, while ground potential is held on the two caps. Selected voltage is applied to the electrodes to form a well and to trap and store the ions. Ion trap can create instability of ions and it ejects ions with selective m/z by scanning applied AC radio frequency (RF). Followed by ejection, ions will be reaching the detector. This mass analysis process of operation mode is called mass-selective instability scan.67 The other mode is called resonance ejection. Under this mode, there is a supplementary voltage added at RF. With this addition, ions will be able to exit the trap with enough energy. Commonly, RF frequency is set as constant, while the voltage is added to change the amplitude of RF. Resonance ejection is advantageous compared to mass- selective instability scan as it has much broader mass range, higher sensitivity and resolution.68

The other ion trap is linear ion trap, which is two-dimensional. It is composed of front, center and back section. The front section is responsible for sending ions to have reaction in the center section. The back section is applied with selective potentials to trap the ions in the center section. 51

The linear ion trap can be used as mass filter and selected ions are ejected axially. Fringe field is avoided in this linear arrangement. Mass spectra are acquired similarly as quadrupole, creating ion instability and ejection.69

1.3.3.2.5 FT-ICR

FT-ICR (Fourier transform ion cyclotron resonance) is famous for its extreme and superior mass resolution and accuracy. FTICR has a magnetic field with super conductivity. This helps to force the ions enter the cyclotron mode. Performance of FTICR is better with higher magnetic field as the resolution is affected directly by the strength of the conductivity. As FT-ICR is very sensitive, ultra-high vacuum needs to be maintained, otherwise, ion motion and its detection will be disturbed and high resolution will be tortured. With FT-ICR, m/z ratio of ions can be recorded and measured from the cyclotron frequency of the ions. Image current is measured for all ions and the conversion of the signal to cyclotron frequency is through Fourier transform. The analyzer cell is composed of a pair of trapping plates, excitation plates, and detection plates, respectively.70 Ions can enter from the trapping plates into the cell, and the analyzer cell can increase the detection time, and thus, the resolution and sensitivity can be possibly enhanced as well.

1.3.3.2.6 Orbitrap

Recently, there is a novel and very popular type of mass analyzer referred to as Orbitrap. It is widely used in as it has similar resolving abilities with superior mass accuracy and sensitivity compared to FT-ICR. It is advantageous compared to FT-ICR as it does not need magnetic field

52 with super-conductivity. It is also similar to quadrupole ion trap as it also needs storage for trapping ions. Ions are squeezed in the center with introduction of new ions in the trap. However, static electric field is used in Orbitrap compared to the dynamic electric field in quadrupole ion trap, compared to the RF in the quadrupole.71 Energy independence is a major characteristic for

Orbitrap as the axial frequency is applied. Energy and ion spreading are independent from axial frequency, and this allows the fulfillment of high mass accuracy and sensitivity. Mass spectra are obtained from FFT, which is fast Fourier transformation. This analysis is obtained from the ion oscillated trajectories around the center electrode with spindle shape.72

1.3.3.3 Tandem mass spectrometry

Tandem mass spectrometry is known as “MS/MS or MSn”. It includes more than one mass analyzer to provide structural information. Typically, there is a collision cell in between, where fragmentation happens. In tandem MS, the precursor ions are first isolated and separated, and then the mass of product ions (fragments) is determined. In tandem mass spectrometry, m/z of the precursor ions and product ions are determined independently.

1.3.3.3.1 Fragmentation method

1.3.3.3.1.1 Collision-induced dissociation

CID (collision-induced dissociation) helps active selected ions with the help of collision with neutral gas. Both low energy collision and high energy collision can be applied. CID is the most commonly used fragmentation method in the analytical field. Precursor ions were collided with molecules of inert gas (e.g. N2, He) in the cell. Repetitive collision happens between the

53 precursor ion and the inert gas. Potential energy is built during collision and when reaching threshold, fragmentation happens and product ions are generated. For low energy CID, energy ranges from 1 to 100 eV, while for high energy CID, it is at the level of keV. Briefly, instruments that apply low energy CID are FTICR, QIT (quadrupole ion trap), etc. In magnetic sector and

TOF instruments, high energy CID is preferred.73 Energy should be optimized since too much energy could produce uncontrolled fragmentation. Low energy CID is the fragmentation type applied in the work in this thesis to generate mostly b ions (starting from N-terminus) and y ions

(starting from C-terminus) for peptide sequencing. Sequential peptide backbone fragments are very helpful in getting amino acid sequence for determination of primary structure. Presumably, ions could have neutral loss at low energy, such as loss of water or ammonium.

1.3.3.3.1.2 Electron-transfer dissociation

Electro-transfer dissociation, ETD, is advantageous in analysis of PTMs, such as glycosylation, phosphorylation, and analysis of secondary structure disulfide linkage. Electron is transferred to peptide from radical anion and this lead to Cα-N amine bond cleavage from peptide backbone.

Instead of b and y ions generated from CID, ETD creates corresponding c and z ions. Commonly, fragmentation efficiency is less compared to CID. For example, for disulfide bond analysis, mass spectra from ETD fragmentation generally contain mostly charge reduced species and peptide ions without fragmentation. This is beneficial in the determination of precursor mass and the charge state.74 Complementary use of CID and ETD is described in detail in Chapter 4 for analysis of disulfide linkage in a variant of lysosomal acidic glucosidase.

54

1.3.3.3.1.3 Electron-capture dissociation

ECD (electron-capture dissociation) is another one for fragmentation similar to ETD. ETD is primarily applied in linear ion trap, while ECD is mostly used for FT-ICR. Both ECD and ETD are helpful in analysis of PTMs, such as phosphorylation and glycosylation, when peptide backbone was cleaved to generate c and z ions, leaving the side chain with modification intact. 75

For example, the diagnostic ions generated are very informative in analysis regarding preserved site-specific electrostatic interaction in formation of salt-bridge associated non-covalent peptide complexes, such as phosphorylated or sulfated peptides.76 ECD (electron capture dissociation) is associated with electron capture. Electrons can be reactive with peptides and proteins that have multiple charges. The electron is closely related to peptide backbone cleavage and corresponding c and z fragment ions will be generated from N-terminus and C-terminus, respectively. Therefore, post-translational modification can be preserved to the fragmented ions. Besides the peptide backbone cleavage, ECD could generate cleavage at side chains of amino acid residues, such as

Asp. With this characteristic, ECD is helpful in detecting the relative abundance of the isoAsp from deamidation. ECD can generate a specific ion zn-57 for diagnostic purpose. Researchers have found that this diagnostic ion has linearity to the abundance of isoAsp residue in synthetic peptides and could be used to relatively quantitate the amount of isoAsp in deamidated proteins.77 Also, this type of linearity is observable in various sequences.

1.3.3.3.2 Type of Tandem mass spectrometry

There are three major types of tandem MS, triple quadrupole tandem MS, ion trap tandem MS and time-of-flight tandem MS. Commonly, tandem mass spectrometry applies soft ionization

55 such as ESI and MALDI. Hybrid mass spectrometer combines various types of mass analyzers in one mass spectrometric instrument. This helps combine the characteristics of various analyzers for having more advanced performance in mass determination. Brief introduction of QQQ, Q-

TOF, QQQ and LTQ-Orbitrap as tandem MS system is included here.

QQQ (triple quadrupole) is a type of tandem MS with three quadrupole linear ion trap. It is advantageous in scan of precursor ions and neutral loss with better sensitivity and faster scan speed compared to Q-TOF.78 The second Q (q2) is a collision cell. In this collision cell, the precursor ions from the first Q (q1) enter and get fragmented. CID is applied in this q2. Then the third Q (q3) traps and isolates the fragment ions for excitation for further analysis in linear ion trap mode.79 Therefore, M1 stage selects precursor ions, and M2 stage selects fragment ions. The process for precursor ion scan is continuous. There are three major scan modes in QQQ,

“precursor or parent ion scan, product ion scan, and constant neutral loss scan”.80 MRM

(multiple reaction monitoring) using QQQ mass spectrometer is a targeting analysis. It can be applied in experiments when the precursor ion and fragment ion mass to charge ratio is known, serving as MRM transition. Dependent acquisition of MS/MS can be triggered by MRM.81

Tandem MS in time-of-flight mass spectrometry applies metastable decomposition for fragmentation.82 Q-TOF hybrids quadrupole with time-of-flight allows sensitive and fast acquisition. Q-TOF is able to record mass spectra of precursor ions as well as selected product ions from the quadrupole mass filter.78

In ion trap tandem MS system, precursor ion isolation is not continuous as in QQQ, instead, this step is pulsed. In a specific period of time, precursor ions are accumulated with selection.

Precursor ions as well as fragment ions need to be kept within the trap. MSn is obtained by the

56 process of trapping and excitation for several stages. This is helpful in detecting ion structures.

On the other hand, some problems such as ion reaction with molecule, rearrangement of ions or even artifact productions would disrupt detection and interpretation of the real structure.83 This could be generated from multi-stage for reaction and for ion trapping. LTQ-Orbitrap is also a hybrid mass spectrometer. It typically functions with LTQ as mass filter before it. Briefly, there is a linear ion trap for collecting ions initially, then ions are passed and ejected axially to the middle part C-trap, and finally will enter the Orbitrap for mass analysis.

1.4 Characterization Strategies for Protein by LC-MS/MS

General strategy for protein characterization is illustrated in Figure 1-4. Basically, complex protein samples are separated on SDS-PAGE. Reduced protein samples by DTT (heat at 70 ºC for 10 min in general or other conditions depending on protein properties) or non-reduced samples (no DTT added for reduction) are loaded onto the gels (e.g. 4-12% Bis-Tris or Tris-

Glycine). After separation, protein bands for analysis are cut off into gel pieces, de-stained, reduced and alkylated by DTT and IAA (iodoacetamide), respectively (for native digestion, no

DTT is added, but IAA is added for alkylation). Gel pieces are washed by ammonium bicarbonate (if digestion will be conducted under alkaline condition) or hydrochloric acid buffer

(if acidic condition will be applied for incubation). After that, shrinking gel pieces are rehydrated in enzymatic digestion buffer at 4 ºC for 30 min to absorb enzyme before being incubated in buffer (no enzyme) for overnight (14-16h). The whole process from protein separation to enzyme digestion could vary depending on the specific characteristics of protein samples. For example, if the protein has strict structure and is hard to denature, reduced, and further digested, harsher

57 conditions for heating (e.g. 90 ºC for 30 min) could be used for getting better enzyme contacts for digestion. Also, the selection of enzymes for digestion can be various as well. Usually trypsin is mostly used as a general enzyme, other enzymes are selected depending on the number of Lys and Arg residues in the protein (trypsin cleaves after Lys and Arg). If tryptic peptides are too short to be captured by the LC-MS system, or too hydrophilic for C18 column in reversed phase liquid chromatography separation, other enzymes such as Lys-C could be applied instead to generate longer peptides (Lys-C cleaves after Lys residues only).On the other hand, if tryptic peptides are too hydrophobic to be eluted out within a desired time range, Glu-C (cleaves after

Glu mostly and Asp) or other enzymes like pepsin (cleaves after carboxylic groups with aromatic rings, such as Phe, Trp, Tyr) might work since this could generate shorter/smaller peptides. Moreover, multi-enzyme strategy is applied with advantages in LC-MS analysis and this is discussed in 1.4.4. Once proteins are digested into peptides, peptide samples are analyzed by LC-MS/MS system. If the digestion was conducted under alkaline (e.g. pH 8.0 for general trypsin digestion) condition, 5% formic acid will be added to stop the digestion as the sample solution needs to be acidified for LC-MS injection. Peptides analyzed by LC-MS system are enzyme specific, and this is very helpful in protein identification/peptide mapping. However, non-specific cleavage happens frequently in enzyme digestion process and this brings protein identification complexities and difficulties.

58

Figure 1-4. General strategy for analysis of protein samples by LC-MS.

There are two fundamental methods for protein identification and characterization by LC-

MS/MS, top-down analysis and bottom-up analysis. They are widely applied in proteomic studies. Digested peptides obtained after proteolytic cleavage are analyzed by LC-MS/MS in bottom-up analysis, while in top-down analysis, intact protein molecules are fragmented and sent for MS analysis. Figure 1-5 illustrates a general picture about the two strategies. Here LC-

MS/MS analysis is introduced briefly followed by reviews of top-down and bottom-up analysis.

Then the multi-sequential enzymatic digestion strategy, which is applied extensively in the thesis, is discussed. Besides, methods for quantitation used in LC-MS/MS are introduced.

59

Bottom-up analysis

Pepeptide from PTM proteolysis PTM

PTM PTM PTM N-term. C-term.

PTM

fragments from PTM dissociation in PTM gas phase PTM

PTM PTM Top-down analysis

Figure 1-5. Bottom-up vs. top-down analysis (adapted from reference 92).

1.4.1 LC-MS/MS analysis

LC-MS/MS analysis refers to the liquid chromatography for separation coupled to tandem mass spectrometry for mass detection, providing information regarding molecular weight, structure elucidation as well as quantitation. This has been growing tremendously over the years, in particular for life science studies. There are several interfaces used for ionization of molecules separated from LC system to enter mass spectrometry. Particle beam interface is a type of method using conventional ionization such as EI. Momentum separator and mechanical pumps are used for pumping away vapors of solvent after the conversion of solute into particle beam and the formation of ion aerosol in a high temperature and high vacuum de-solvation chamber. It is quite helpful in analysis of smaller and lighter molecules but it is disadvantageous for detecting large non-volatile molecules due to the difficulty in formation of aerosol for further mass spectrometric analysis. Introduction of liquid directly by ESI or APCI (atmospheric pressure ionization) method, in particular, ESI, is a popular technique, and is widely used in 60 analytical field. With the introduction of nano-electrospray ionization method with lower flow rate compared to the conventional ESI with normal flow rate, droplets produced in the system are smaller and this allows introduction of ions into the MS inlet more efficiently.93 The generation rate of droplets is much faster and the diameter of droplets is much smaller. Besides, with the use of nano-ESI coupled to mass spectrometry, the spray needle is closer to the inlet, and this helps get more ions to enter the MS system. This greatly increases efficiency and sensitivity in analysis.

For detailed measurement, LC-MS/MS becomes the perfect choice. It has superior sensitivity and selection in detection. Instrumental development in mass spectrometry, with the presence of

FTMS and Orbitrap, characterization work has been strengthened in a much further way. Briefly, protein PTMs characterization, such as glycosylation and phosphorylation, is well deciphered by nano-HPLC coupled with LTQ-FT or Orbitrap with CID (collision-induced dissociation) with the combination with ETD (electron-transfer dissociation) fragmentation. Furthermore, degradation products, typically oxidation, deamidation and clipping products were also able to be detected.

In drug development process, full sequence coverage is very important for protein therapeutics, providing reproducible products in the whole process of manufacturing. Therefore, sample preparation is very important. In order to detect and monitor product variants, and its possible

PTMs, separation may be needed from the major product. Our strategy includes SDS-PAGE gel separation, and nano reversed phase HPLC separation. Especially, SDS-PAGE helps the separation of major products from the variants, such as glycosylated form, degradation products, in a direct and quick way. This is generally based on MW (molecular weight) difference. It also helps remove detergents and salts for better compatibility with mass spectrometry. For top-down

61 analysis, intact protein is analyzed by mass spectrometer and its mass can be determined after deconvolution. This allows determination of the molecular weight of a protein in a direct way.

Clipped protein products could be separated by LC system and then each mass could be determined. However, this cannot provide cleavage site specificity. Therefore, bottom-up analysis of protein digests is required, from which the peptide analysis can provide much details.

After the first step of separation, enzymes are used to digest proteins into peptides for further detailed analysis through LC-MS/MS. Multi-enzymatic strategy is advantageous by using combination of various enzymes for full sequence coverage. Trypsin enzyme is commonly used as it can generate peptides suitable for mass spectrometric detection with electrospray ionization

(ESI). Lys-C is similar to trypsin, while this enzyme generates longer peptides because it only cleaves after Lys residue (trypsin cleaves after both Lys and Arg residue). Besides, Glu-C, Asp-

C, pepsin, etc. are used for generating various lengths of peptides for full sequence coverage and for detecting some specific peptides. For example, we hope to generate one glycan moiety and less various PTMs in one peptide. Therefore, multi-enzymatic strategy provides a general process for proteins to be analyzed in a deeper way. The digested peptides from in-gel digestion with various enzymes are analyzed by LC-MS/MS (nano HPLC with reversed-phase column online coupled to iontrap mass spectrometer with multiple fragmentation). This helps to identify the structure of protein therapeutics as well as to characterize PTMs. Besides, targeting analysis applied as SRM (selected reaction monitoring) helps detect some specific peptide with significant fragmentation details. Site location of a single modification could be detected by this method. Sometimes, combination of various fragmentation types helps characterize protein secondary structure, such as the disulfide linkage. Electron transfer dissociation (ETD) can help

62 identify the linked peptides, with the combination of collision induced dissociation (CID). ETD preferably dissociates disulfide bond compared to peptide backbone due to the properties of sulfur. This will generate dissociated peptides and charge-reduced species, providing information of peptide linkage. Further fragmentation with CID can generate b and y ions from peptide backbone for confirmation of the linkage site.

1.4.2 Top-down analysis of proteins

Top-down analysis is to introduce ionized protein molecules by ESI into mass analyzer. Top- down analysis allows characterization of protein molecules by tandem mass spectrometry. Intact protein molecules go through fragmentation in gas phase.94 However, charge state variation is disadvantageous in interpretation of MS/MS spectra. Purification of charge state could be obtained by manipulation of stripping the precursor ion prior to dissociation. This is associated with ion trap where there is ion-ion interaction. Besides, high mass measurement accuracy is helpful in top-down analysis. With high mass accuracy provided by instruments such as FT-ICR and Orbitrap, isotope spacing and thus charge state of the fragment ion is able to be determined.

95 With capture of electrons at low energy by protein ions, fragmentation of precursor ions with cleavage of Cα-N bond can happen rapidly in ECD. From the c and z ions (cleavage of N-Cα) generated from ECD, more sequence information through the process of fragmentation. Also, information about intact PTMs could be obtained site specifically. For example, TOF MS2 coupled to LC system with monolithic column allows separation, fractionation and profiling of protein samples in plasma.96 Both precursor ions and fragment ions from ETD were able to be determined fast with high accuracy. MALDI-ISD (in-source decay) is also helpful in top-down

63 analysis where protein fragments are generated from being excited and proton-donated with proper MALDI matrix.97 This is advantageous in getting N- and C-terminus information from

ISD spectra. In addition, top-down analysis is also helpful in isoform mapping for intact proteins.

1.4.3 Bottom-up analysis of proteins

Bottom-up approach is associated with extensive proteolysis digestion of protein molecules by enzymes with following LC-MS/MS analysis. This is critical to sequence analysis and to evaluate impurities as well as PTMs in protein samples. Digested peptides are separated and then analyzed to obtain precursor m/z and fragment ions from MS/MS spectra and compared with expected peptide sequence from theoretical digestion. Sequence coverage can be known from the protein sequence covered from the identified peptide with bottom-up analysis. Bottom-up proteomic studies for understanding proteins from biological samples are closely related to the identified peptides. Briefly, proteins are extracted from cell lines or tissue samples, and are fractionated if needed for complexity issue. Then the extracted proteins are denatured, reduced, alkylated and digested with specific enzymes (e.g. trypsin). Digests are then analyzed by LC-

MS/MS system. Data obtained are analyzed by software such as Proteome Discoverer, Bioworks, etc. Or manual extraction of the MS and MS2 spectra of peptides is necessary to confirm their presence. Bottom-up approach is also widely used in analysis of PTMs in proteins. Presence of the modification can be verified from precursor m/z of the specific peptide, and the site location could be determined by b ions and y ions generated in the MS2 spectra.

64

1.4.4 Multi-sequential enzymatic digestion strategy

When using enzymatic digestion for protein sequencing, single enzyme is not enough for full sequence coverage. Trypsin is used commonly to digest proteins into peptides. It cleaves after

Lys and Arg residues. However, as the length of tryptic peptide depends on the number of Lys and Arg residues, using trypsin only can generate very long peptides which are too hydrophobic for reverse-phase HPLC analysis. Conventional single enzyme digestion in one step produces insufficient peptides that are detectable for the purpose of getting sequence coverage of the entire protein. This will bring difficulty in identify proteins and in determination of sequence variant as well as PTMs. At this time, other enzymes such as Glu-C, Asp-D could be used with tryptic digestion for getting further cleavage at peptide backbone, providing peptides that are more suitable for analysis. For example, multi enzymes and sequential digestion were used for characterization of BoNT sequence with 98.4% coverage and strong evidence in serotype determination.98 Chapter 2 and 4 describes the use of multi-sequential enzymatic digestion strategy in great detail.

Besides for peptide sequence coverage, commonly, multi-enzymatic method is used to remove glycosylation linkage. PNGase F, an amidase, specifically releases N-linked oligosaccharide from peptides. With combination of PNGaseF with enzymes that cleave at peptide backbone, glycopeptides can be identified as Asn at the glycosylation site will be converted to Asp by

PNGaseF. Further enzymes could be applied for analysis of structure of the release glycan.

Endo-beta-N-acetylglucosidases H (endo H) is another enzyme to cleave N-linked oligosaccharide. This enzyme cleaves between the two GlcNAc residues. Combined use of Endo

H and other enzymes that cleave the glycan structure is useful for the study of high mannose and

65 hybrid type oligosaccharides. For example, application of Endo H and PNGaseF enzyme with tryptic digestion helps elucidate the N-glycosylation sites of paucimannose and oligomannose structure.99

1.4.5 Labeling quantitation in LC-MS/MS analysis

Stable isotope is commonly used as chemical labeling. This is specifically for LC-MS/MS analysis with relative quantification. ICAT (isotope-coded affinity tag) is a method for quantitative proteomics in stable-isotope labeling. Briefly, intact proteins are labeled with heavy and light tags and the split samples are combined for proteolysis. After digestion, peptides labeled with the tag are purified from affinity chromatography. Protein expression could be measured relatively from the intensities of labeled peptides (heavy isotope vs. light isotope).

New cleavable ICAT reagent that employs 13C and acid-cleavable biotin group is better to solve complex problems for modification. For example, ICAT based LC-MS/MS analysis is used for the identification and quantitation of Cys thiols that are sensitive to oxidant. Free Cys could be measured due to the labeling of free thiols with ICAT based on iodoacetamide.100 Sensitivity to oxidation reagent of the thiols, such as hydrogen peroxide, could be determined by measurement of free Cys left in the sample complex.

Compared to ICAT, where the labeling is for intact protein molecules, there are quantitation strategies for labeling proteolytic peptides. 18O labeling could be incorporated into enzymatic digestion process. For example, labeled and unlabeled proteolytic peptides of adenovirus type 5 can be separated with HPLC. Peak intensity of the isotope pairs could be measured by spectra from MALDI-TOF profiling.101 In addition, iTRAQ is a newly developed method. iTRAQ

66 utilizes the isobaric tag for relative and absolute quantitation. Upon fragmentation in tandam MS, m/z from 114 to 117 are used for signature ions. Peak area integration of these signature ions is used for quantitation.102 Besides, there is a peptide reactive group and the MS/MS fragmentation takes place between this reactive group and the peptide. While ions for reporter group are present at low mass range in MS spectra, backbone fragmentation of the peptide is for determination of amino acid sequence. MS/MS for iTRAQ labeled samples will be able to quantitate the variations. Because there are up to four deuterium atoms present in iTRAQ, four samples at most can be analyzed and compared from a single run. This is helpful in determination of protein expression variations.

1.4.6 Spectral counts in quantitative proteome

Spectral counting is widely used in studies of protein expression in different cell types and tissue samples without the limitation of labeling. It is very helpful in proteomic analysis of complex protein samples. Spectral counts refer to the total spectra number of peptides identified for a protein from LC-MS/MS. It was reported that the spectral counts is correlated well to the protein abundance over a dynamic and linear range.103 Identification, profiling and quantitation of large-scale proteomes with complexity widely use spectral counts for measurement and analysis. Normalized spectral abundance factor, known as NSAF, is frequently used to determine differences in proteins and their expression variations without using labeling method. Although relative-quantitative proteome could be studied from measuring the peak intensity (peptides) or spectral counts (proteins), this has limitations in application. For example, sample preparation process and injection into the LC-MS/MS system could generate differences even for the same

67 sample. Use of normalization factor could help to account for such variation. NSAF is more than a single normalization factor. It includes the consideration of protein size in addition to spectral counts. It is advantageous in accounting for variations of protein length since larger proteins will tend to have more spectral counts detected and identified compared to smaller proteins.104

1.5 ERBB2-driven Studies

1.5.1 Cancers associated with ERBB2

ERBB2, which is erythroblastic leukemia viral oncogene homolog 2 (from ) is an important oncogene in the development of cancer, which can form a heterodimer with other epidermal growth factor (EGF) receptor family members and activate kinase-mediated downstream signaling pathways.105 ERBB2 (Her-2/neu) is a member in EGFR (epidermal growth factor receptor) family. This gene is located in q12 region on chromosome 17. It encodes a trans-membrane glycoprotein at around 185 kDa. With tyrosine kinase activity, dimerization of

ERBB2 with all ERBB receptors results in signal amplification. Heterodimer formation of

EGFR-ERBB2 leads to increased activities of tyrosine kinase and following signal activation and cascading.106 ERBB2 is closely related to cancers. Studies have shown that over-expression of

ERBB2 is associated with about 30% breast cancers as well as other cancers, such as stomach and lung carcinomas.107 Significant over-expression of ERBB2 has been observed in inflammatory breast cancer (IBC) patients. Treatment of ERBB2 over-expressing IBC and non-

IBC tumor models with trastuzumab, an ERBB2 monoclonal antibody, has shown decreased cell viability.108 Besides, it has been also been proposed that colorectal cancer development in

68 is associated with ERBB2 over-expression and together with EGFR, this is also emerging as an essential biomarker and target for colorectal cancer treatment.109

1.5.2 SNPs and ASVs of ERBB2

ASVs stands for alternative splice variants. From Ensembl110 database, there are fourteen

ASVs for ERBB2 with the following length (amino acid) and the number of such ASV in brackets: 1255 (1), 1240 (1), 1225 (3), 1055 (1), 979 (1), 603 (2), and less than 252 (5). In CCDS

(consensus CDS) database, four of the fourteen are listed, one of 1255, three of 1225.111 In both

Genecards105a and GPMDB112 databases, there are six ASVs, one of 1255, one of 1240, two of

1225, and one of 979. Due to the presence of protein variants, PTMs are also different. For example, among the four ASVs of ERBB2, three of them have phosphotyrosine in neXtProt. But it must be noticed that the number of PTMs are probably less for ASVs with low abundance since the evidence from experiments might be less. Figure 1-6 illustrates the four ASVs by showing the amino acid sequence of the ASV with 1255 aa, and the other three are shown as alternative splice 1, 2 and 3 from Ensembl database. Sequence variants are shown with square brackets and variations in PTMs are shown as well.

69

3 1 MPRGSWKPQV 2

Observed in full seuqnece ENSP00000269571 1255 aa

Not observed in alternative splice 1 Not observed in alternative splice 2 & 3

Figure 1-6. ERBB2 sequence comparison between ENSP00000269571 and the other three ASVs from Ensembl.110 (amino acids highlighted in blue: observed PTMs in ENSP00000269571, blue bracket: alternative splice 1, red bracket: alternative splice 2, green bracket: alternative splice 3, red circle: PTMs not observed in alternative splice 1, cross: PTMs not observed in alternative splice 2 and 3) (adapted from Ensembl website)

SNPs are single-nucleotide polymorphisms. Understanding of SNPs of ERBB2 is helpful in studies about the relationship between functions and changes of amino acids. For example, it has been reported that an investigation was carried out about testing of various polymorphisms, such as I655V, located at trans-membrane position, to discover the association between ERBB2

70 polymorphisms and cancer deposition.113 Figure 1-7 illustrates information about SNPs. In this figure SNPs of 2, 3, and 4 are listed in red, blue and green with underline. PTMs of amino acids in this sequence listed in GPMDB and neXtPot are given in boxes separately for both phosphorylation (yellow and purple boxes) and glycosylation (blue boxes). SNPs discussed here is non-synonymous or coding SNPs (or called cSNPs).

1 MELAALCRWG LLLALLPPGA ASTQVCTGTD MKLRLPASPE THLDMLRHLY 50

51 QGCQVVQGNL ELTYLPTNAS LSFLQDIQEV QGYVLIAHNQ VRQVPLQRLR 100

101 IVRGTQLFED NYALAVLDNG DPLNNTTPVT GASPGGLREL QLRSLTEILK 150

151 GGVLIQRNPQ LCYQDTILWK DIFHKNNQLA LTLIDTNRSR ACHPCSPMCK 200

201 GSRCWGESSE DCQSLTRTVC AGGCARCKGP LPTDCCHEQC AAGCTGPKHS 250

251 DCLACLHFNH SGICELHCPA LVTYNTDTFE SMPNPEGRYT FGASCVTACP 300

301 YNYLSTDVGS CTLVCPLHNQ EVTAEDGTQR CEKCSKPCAR VCYGLGMEHL 350

351 REVRAVTSAN IQEFAGCKKI FGSLAFLPES FDGDPASNTA PLQPEQLQVF 400

401 ETLEEITGYL YISAWPDSLP DLSVFQNLQV IRGRILHNGA YSLTLQGLGI 450

451 SWLGLRSLRE LGSGLALIHH NTHLCFVHTV PWDQLFRNPH QALLHTANRP 500

501 EDECVGEGLA CHQLCARGHC WGPGPTQCVN CSQFLRGQEC VEECRVLQGL 550

551 PREYVNARHC LPCHPECQPQ NGSVTCFGPE ADQCVACAHY KDPPFCVARC 600

601 PSGVKPDLSY MPIWKFPDEE GACQPCPINC THSCVDLDDK GCPAEQRASP 650

651 LTSIISAVVG ILLVVVLGVV FGILIKRRQQ KIRKYTMRRL LQETELVEPL 700

701 TPSGAMPNQA QMRILKETEL RKVKVLGSGA FGTVYKGIWI PDGENVKIPV 750

751 AIKVLRENTS PKANKEILDE AYVMAGVGSP YVSRLLGICL TSTVQLVTQL 800 Phosphorylation

801 MPYGCLLDHV RENRGRLGSQ DLLNWCMQIA KGMSYLEDVR LVHRDLAARN 850 from GPM

851 VLVKSPNHVK ITDFGLARLL DIDETEYHAD GGKVPIKWMA LESILRRRFT 900 Phosphorylation from GPM & 901 HQSDVWSYGV TVWELMTFGA KPYDGIPARE IPDLLEKGER LPQPPICTID 950 Nextprot 951 VYMIMVKCWM IDSECRPRFR ELVSEFSRMA RDPQRFVVIQ NEDLGPASPL 1000 Glycosylatrion 1001 DSTFYRSLLE DDDMGDLVDA EEYLVPQQGF FCPDPAPGAG GMVHHRHRSS 1050 from Nextprot 1051 STRSGGGDLT LGLEPSEEEA PRSPLAPSEG AGSDVFDGDL GMGAAKGLQS 1100

1101 LPTHDPSPLQ RYSEDPTVPL PSETDGYVAP LTCSPQPEYV NQPDVRPQPP 1150 Underlined SNPs (2) 1151 SPREGPLPAA RPAGATLERP KTLSPGKNGV VKDVFAFGGA VENPEYLTPQ 1200 Underlined SNPs (3) 1201 GGAAPQPHPP PAFSPAFDNL YYWDQDPPER GAPPSTFKGT PTAENPEYLG 1250 Underlined SNPs (4)

1251 LDVPV 12

Figure 1-7. PTMs and SNPs of ERBB2 sequence (1255 aa, ENSP00000269571). (residues highlighted in yellow: phosphorylation from GPMDB, residues highlighted in purple: phosphorylation from GPMDB and neXtProt, residues highlighted in blue: glycosylation from neXtProt, underlined residues in red: 2 SNPs, underlined residues in blue: 3 SNPs, underlined residues in green: 4 SNPs.) (adapted from Ensembl website)

In neXtProt the data is curated for solid experimental evidence such as well-annotated

MS/MS spectra. In the case of protein variants resulting from alternative splicing and SNPs there is significant additional heterogeneity, such as with ERBB2 with 4 and 88 variants

71 respectively. One could also expect that PTMs could be altered in a protein variant and again using ERBB2. At this stage of characterization of the proteogenome there is little experimental evidence for PTMs in variants produced by cSNPs and thus it will not be discussed here although one could expect additional variation in these variants. Other examples of significant variants include IKZF3 with 15 and 7 variants, as well as for TP53, 9 and 1394 variants, and for BRCA1, 6 and 320 variants, respectively.

1.5.3 ERBB2 amplicon

Amplicon refers to amplification of genes, in particular of oncogenes, and of genes that are adjacent in solid tumor development. Oncogene amplification has been reported in extensive span of region on . ERBB2 amplicon has been observed in breast cancer and other cancers at region q12-q21 on chromosome 17. Co-amplification of genes of ERBB2 amplicon is related to the characteristics of ERBB2 over-expressed tumors and is closely associated with oncology progression. It is obvious that there are other factors that influence ERBB2 driven cancers, and the co-amplified genes around ERBB2 are most promising candidates. In breast cancer, q12-q21 region on chromosome 17 is critical as its amplification is in close relationship to activation of ERBB2 and disease pathogenesis.114 In both gastric and breast cancer, PPP1R1B-

ERBB2-GRB7 region is significantly amplified.115 This region is PPP1R1B-STARD3-TCAP-

PNMT-PGAP3-ERBB2-MIEN1-GRB7. Table 1-1 illustrates the gene information (band location, gene start and end) as well as description of these genes extracted from Genecards database, together with cancer association obtained from iHOP116, Uniprot105b and neXtProt117 database. In addition to this region, genes over a larger span but still positioned in 1 Mb base

72 from each side of ERBB2, such as PSMB3, RPL19, and NR1D1 can also possibly be co- amplified. Besides, related function of genes is associated with co-expression with ERBB2 and this is also helpful in studies of disease etiology.118 For example, both GRB7 and TOP2A encode proteins related to ERBB2 nuclear transport as well as synthesis of DNA, and they have elevated expression with co-expressed ERBB2. This might be correlated to cancer progression and development. In studies of chromosome 17 parts list, the ERBB2 amplicon region has been spanned from TIAF1 to TOP2A.

73

Table 1-1 Description and cancer association of genes in PPP1R1B-ERBB2-GRB7 region.

Gene Band Gene Gene end Description Cancer name location start (Genecards) association PPP1R 17q12 37783179 37792879 protein It is overexpressed in breast 1B phosphatase 1, cancer, colon, gastric cancers. regulatory inhibitor, subunit 1B STAR 17q12 37793318 37820454 StAR-related It is elated to steroid hormone D3 lipid transfer increased production in (START) cancer, overexpressed in domain breast cancer and other containing 3 tumors that are responsive to hormone. PNMT 17q12 37824234 37826728 phenylethanola It is related to adrenaline mine N- throught conversion of methyltransfera noradrenaline. Elevated se expression in breast cancer. PGAP3 17q12 37827375 37844310 post-GPI It is associated with GPI attachment to muturation and remodeling of proteins 3 lipid, critical in association with ERBB2 lipid raft. ERBB2 17q12 37844393 37884915 v-erb-b2 Strongly associated with a erythroblastic wide variety of cancers, leukemia viral significantly over-expressed oncogene in tumors. homolog 2 MIEN1 17q12 37885409 37887040 migration and It is behaving as enhancer for invasion carcinomas, and regulator of enhancer 1 apoptosis. It is over-expressed (C17orf37) in various breat cancers, and it's highly significant in prostate cancer. GRB7 17q12 37894187 37903545 growth factor It is associated with ERBB2 receptor-bound phosphorylation and protein 7 promotion of HRAS activities. It is related to invasive carcinomas.

74

1.5.4 Proteomic analysis in combination of transcriptomics

In the studies of chromosome 17, we combined proteome and genome knowledge derived from neXtProt, GPMDB as well as antibodypedia. Figure 1-8 summarizes knowledge at genomic, transcriptomic and proteomic level. This is the current proteogenome of chromosome 17. The designation of protein coding genes is based on the Ensembl genome browser and shows approximately 1169 such genes with 861 having protein, 269 with transcript evidence, and 40 with preliminary evidence, respectively. The number of high probability mass spectrometric identifications in GPMDB is 824 (70%), the number of identifications with high grade in neXtProt is 601 (51%), and the number in PeptideAtlas with 1% FDR is 745 (64%); while an individual proteomics study would be expected to generate 200 to 400 identifications (not shown in this figure) of proteins that are coded by chromosome 17 which would be based on the number of replicates and size of the sample set. The data set of GPM is based on an aggregation of curated protein identifications by mass spectrometry that has been deposited in the public domain while Nextprot is based on the curation of the public literature. Thus the two databases represent two different snapshots of the information flow into proteomics. For example, this figure includes that the number of genes that have protein evidence from neXtProt database is

861, while the number for this with high GPMDB value is 824.

75

1200

1000

800

600

400

200

0

Preliminary Transcript Protein level

Figure 1-8. Genome vs.transcriptome vs. proteome.

While one may expect that the depth of proteomic coverage would be similar across all chromosomes it is possible that subtle differences may be observed. For example it has been noted that this is a reasonable correlation between the gene density of a given chromosomal region and the number of protein observations, and thus the number of protein observations could be expected to be higher for the high density chromosomes than for chromosome of low gene density. A major factor influencing protein expression is mRNA expression. Analysis of transcriptomic sequence (RNA-Seq) could illustrate a relatively comprehensive profile for transcriptome. Combination of proteomic studies based on mass spectrometry with transcriptomic analysis gives more complete and in-depth understanding of clinical samples. 76

This comprehensive analysis is from both proteomic and genomic levels. Chromosome-Centric

Human Proteome Project (CHPP) initiates the integrated analysis of multi-stage platforms of

“omics” studies. Combined use of RNA-Seq data with proteomics results establishes in-depth analysis of chromosome parts list and oncology investigation.119

1.6 Conclusion

Industrial protein therapies have been advanced by biotechnological development. It is critical to understand the potential problems associated with heterogeneity of protein therapies.

Therefore, applications of various techniques to characterize proteins comprehensively are essential. This includes but is not limited to peptide mapping for sequence determination, PTMs analysis, disulfide bond mapping, and glycosylation analysis. A wide variety of techniques and strategies for LC-MS/MS analysis have been applied and developed for this purpose. In addition,

ERBB2 as a major oncogene has been studied to be closely associated with cancer. With co- amplification, expression of genes around ERBB2 on bands q12-q21 on chromosome 17 could be elevated and this could be critical in cancer diagnostics and treatment. Applied by CHPP

Initiative, integration of proteomics and transcriptomic analysis allows for in-depth understanding of cancers.

77

1.7 References:

1. Leader, B.; Baca, Q. J.; Golan, D. E., Protein therapeutics: A summary and pharmacological classification. Nat Rev Drug Discov 2008, 7 (1), 21-39. 2. Pickup, J., Human Insulin. Brit Med J 1986, 292 (6514), 155-157. 3. TOP 30 Biologics 2011. http://www.pipelinereview.com/index.php/2011030447751/FREE-Reports/-TOP-30-Biologics- 2011.html 2012. 4. Laura A. Palomares, F. K.-B., Octavio T. Ramirez, Industrial Recombinant Protein Production. Biotechnology V. 5. Kim, Y. H.; Berry, A. H.; Spencer, D. S.; Stites, W. E., Comparing the effect on protein stability of methionine oxidation versus mutagenesis: steps toward engineering oxidative resistance in proteins. Protein Eng 2001, 14 (5), 343-347. 6. FDA, Biotechnology Inspection Guide In BIOTECHNOLOGY INSPECTION GUIDE REFERENCE MATERIALS AND TRAINING AIDS 1991. 7. Wormald, M. R.; Rudd, P. M.; Harvey, D. J.; Chang, S. C.; Scragg, I. G.; Dwek, R. A., Variations in oligosaccharide-protein interactions in immunoglobulin G determine the site- specific glycosylation profiles and modulate the dynamic motion of the Fc oligosaccharides. Biochemistry-Us 1997, 36 (6), 1370-1380. 8. Frank J. Mannuzza, J. G. M., Is Bovine Albumin Too Complex to Be Just a Commodity? BioProcess International 2010, 8, 40-43. 9. Tsui, S. M.; Lam, W. M.; Lam, T. L.; Chong, H. C.; So, P. K.; Kwok, S. Y.; Arnold, S.; Cheng, P. N. M.; Wheatley, D. N.; Lo, W. H.; Leung, Y. C., Pegylated derivatives of recombinant human arginase (rhArg1) for sustained in vivo activity in cancer therapy: preparation, characterization and analysis of their pharmacodynamics in vivo and in vitro and action upon hepatocellular carcinoma cell (HCC). Cancer Cell Int 2009, 9. 10. Strupat, K., Molecular weight determination of peptides and proteins by ESI and MALDI. Method Enzymol 2005, 405, 1-36. 11. Yang, M.; Butler, M., Effects of ammonia and glucosamine on the heterogeneity of erythropoietin glycoforms. Biotechnol Progr 2002, 18 (1), 129-138. 12. Wildt, S.; Gerngross, T. U., The humanization of N-glycosylation pathways in yeast. Nat Rev Microbiol 2005, 3 (2), 119-128. 13. Lewis, S. E.; Nixon, R. A., Multiple Phosphorylated Variants of the High Molecular Mass Subunit of in Axons of Retinal Cell Neurons - Characterization and Evidence for Their Differential Association with Stationary and Moving Neurofilaments. J Cell Biol 1988, 107 (6), 2689-2701. 14. Leonova, T.; Qi, X. Y.; Bencosme, A.; Ponce, E.; Sun, Y.; Grabowski, G. A., Proteolytic processing patterns of prosaposin in insect and mammalian cells. J Biol Chem 1996, 271 (29), 17312-17320. 15. Rosenfeld, R.; Bangio, H.; Gerwig, G. J.; Rosenberg, R.; Aloni, R.; Cohen, Y.; Amor, Y.; Plaschkes, I.; Kamerling, J. P.; Maya, R. B. Y., A lectin array-based methodology for the analysis of protein glycosylation. J Biochem Bioph Meth 2007, 70 (3), 415-426.

78

16. Goldstein, M. E.; Sternberger, L. A.; Sternberger, N. H., Varying Degrees of Phosphorylation Determine Microheterogeneity of the Heavy Neurofilament Polypeptide (Nf-H). J Neuroimmunol 1987, 14 (2), 135-148. 17. Apweiler, R.; Hermjakob, H.; Sharon, N., On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Bba-Gen Subjects 1999, 1473 (1), 4-8. 18. Weinmann, W.; Maier, C.; Przybylski, M., Characterization of Primary Structure and Microheterogeneity of Fatty-Acid Acylated Lipoproteins by 252cf-Plasma Desorption and Electrospray Mass-Spectrometry. Fresen J Anal Chem 1992, 343 (1), 63-64. 19. Rao, S.; Bach, N., Monolithic columns for biomolecule analysis. Genet Eng Biotechn 2007, 27 (4), 20-20. 20. Volkin, D. B.; Verticelli, A. M.; Bruner, M. W.; Marfia, K. E.; Tsai, P. K.; Sardana, M. K.; Middaugh, C. R., Deamidation of Polyanion-Stabilized Acidic Fibroblast Growth-Factor. J Pharm Sci 1995, 84 (1), 7-11. 21. Cohen, O.; Kronman, C.; Chitlaru, T.; Ordentlich, A.; Velan, B.; Shafferman, A., Effect of chemical modification of recombinant human acetylcholinesterase by polyethylene glycol on its circulatory longevity. Biochem J 2001, 357, 795-802. 22. Sanchez, J.; Holmgren, J., Cholera toxin structure, gene regulation and pathophysiological and immunological aspects. Cell Mol Life Sci 2008, 65 (9), 1347-1360. 23. Wright, H. T., Nonenzymatic Deamidation of Asparaginyl and Glutaminyl Residues in Proteins. Crit Rev Biochem Mol 1991, 26 (1), 1-52. 24. Robinson, N. E.; Zabrouskov, V.; Zhang, J.; Lampi, K. J.; Robinson, A. B., Measurement of deamidation of intact proteins by isotopic envelope and mass defect with ion cyclotron resonance Fourier transform mass spectrometry. Rapid Commun Mass Sp 2006, 20 (23), 3535- 3541. 25. Watanabe, A.; Takio, K.; Ihara, Y., Deamidation and isoaspartate formation in smeared tan in paired helical filaments - Unusual properties of the -binding domain of tau. J Biol Chem 1999, 274 (11), 7368-7378. 26. Chelius, D.; Rehder, D. S.; Bondarenko, P. V., Identification and characterization of deamidation sites in the conserved regions of human Immunoglobulin Gamma antibodies. Anal Chem 2005, 77 (18), 6004-6011. 27. Harris, R. J.; Kabakoff, B.; Macchi, F. D.; Shen, F. J.; Kwong, M.; Andya, J. D.; Shire, S. J.; Bjork, N.; Totpal, K.; Chen, A. B., Identification of multiple sources of charge heterogeneity in a recombinant antibody. J Chromatogr B 2001, 752 (2), 233-245. 28. Fennema's Food Chemistry, Fourth Edition (Food Science and Technology) CRC Press, Boca Raton, FL: 2008. 29. Stadtman, E. R.; Berlett, B. S., Reactive oxygen-mediated protein oxidation in aging and disease. Drug Metab Rev 1998, 30 (2), 225-243. 30. Korbashi, P.; Kohen, R.; Katzhendler, J.; Chevion, M., Iron Mediates Paraquat Toxicity in Escherichia-Coli. J Biol Chem 1986, 261 (27), 2472-2476. 31. Jones, J., Amino acid and peptide synthesis. Oxford Univ. Press: 2002. 32. Mirzaei, H.; Regnier, F., Protein : protein aggregation induced by protein oxidation. Journal of Chromatography B-Analytical Technologies in the Biomedical and Life Sciences 2008, 873 (1), 8-14.

79

33. Fucci, L.; Oliver, C. N.; Coon, M. J.; Stadtman, E. R., Inactivation of Key Metabolic Enzymes by Mixed-Function Oxidation Reactions - Possible Implication in Protein-Turnover and Aging. P Natl Acad Sci-Biol 1983, 80 (6), 1521-1525. 34. Singh, I.; Contreras, M. A.; Baatz, J. E., PPAR-ALPHA agonist induced multifunctional enzyme-2 posttranslational modification: Implications for peroxisomal proliferation. J Neurochem 2008, 104, 39-40. 35. Cakatay, U., Protein oxidation parameters in type 2 diabetic patients with good and poor glycaemic control. Diabetes Metab 2005, 31 (6), 551-557. 36. Poole, L. B.; Nelson, K. J., Discovering mechanisms of signaling-mediated cysteine oxidation. Curr Opin Chem Biol 2008, 12 (1), 18-24. 37. Sandvig, K.; van Deurs, B., Transport of protein toxins into cells: pathways used by ricin, cholera toxin and Shiga toxin. FEBS Letters 2002, 529 (1), 49-53. 38. Fujinaga, Y., Transport of bacterial toxins into target cells: Pathways followed by cholera toxin and botulinum progenitor toxin. J Biochem 2006, 140 (2), 155-160. 39. Balzarini, J., Targeting the glycans of glycoproteins: a novel paradigm for antiviral therapy. Nat Rev Microbiol 2007, 5 (8), 583-597. 40. Huq, M. D. M.; Ha, S. G.; Barcelona, H.; Wei, L. N., Lysine Methylation of Nuclear Co- Repressor Receptor Interacting Protein 140. J Proteome Res 2009, 8 (3), 1156-1167. 41. Calnan, D. R.; Webb, A. E.; White, J. L.; Stowe, T. R.; Goswami, T.; Shi, X. B.; Espejo, A.; Bedford, M. T.; Gozani, O.; Gygi, S. P.; Brunet, A., Methylation by Set9 modulates FoxO3 stability and transcriptional activity. Aging-Us 2012, 4 (7), 462-479. 42. Hwang, C. S.; Shemorry, A.; Varshavsky, A., N-Terminal Acetylation of Cellular Proteins Creates Specific Degradation Signals. Science 2010, 327 (5968), 973-977. 43. Yang, X. J.; Seto, E., Lysine acetylation: Codified crosstalk with other posttranslational modifications. Mol Cell 2008, 31 (4), 449-461. 44. Clack, J. W.; Juhl, M.; Rice, C. A.; Li, J. Y.; Witzmann, F. A., Proteomic analysis of transducin beta-subunit structural heterogeneity. Electrophoresis 2003, 24 (19-20), 3493-3499. 45. Berkowitz, S. A.; Zhong, H. J.; Berardino, M.; Sosic, Z.; Siemiatkoski, J.; Krull, I. S.; Mhatre, R., Rapid quantitative capillary zone electrophoresis method for monitoring the micro- heterogeneity of an intact recombinant glycoprotein. J Chromatogr A 2005, 1079 (1-2), 254-265. 46. Hammerschlag, R.; Stone, G. C.; Bolen, F. A., A Double-Isotope Procedure for Examining Protein Microheterogeneity - Multiple Forms of Fast-Transported Glycoproteins and Sulfoproteins Possess a Common Polypeptide-Chain. J Neurochem 1986, 46 (2), 569-573. 47. Ge, Y.; Rajkumar, L.; Guzman, R. C.; Nandi, S.; Patton, W. F.; Agnew, B. J., Multiplexed fluorescence detection of phosphorylation, glycosylation, and total protein in the proteomic analysis of breast cancer refractoriness. Proteomics 2004, 4 (11), 3464-3467. 48. Alfaro, J. F.; Gillies, L. A.; Sun, H. G.; Dai, S. J.; Zang, T. Z.; Klaene, J. J.; Kim, B. J.; Lowenson, J. D.; Clarke, S. G.; Karger, B. L.; Zhou, Z. S., Chemo-enzymatic detection of protein isoaspartate using protein Isoaspartate methyltransferase and hydrazine trapping. Anal Chem 2008, 80 (10), 3882-3889. 49. Elton, T. S.; Reeves, R., Microheterogeneity of the Mammalian High-Mobility Group Proteins 14 and 17 Investigated by Reverse-Phase High-Performance Liquid-Chromatography. Anal Biochem 1985, 146 (2), 448-460.

80

50. Liu, H.; Pan, H. C.; Peng, L.; Cai, S. X., RP-HPLC determination of recombinant human interferon omega in the Pichia pastoris fermentation broth. J Pharmaceut Biomed 2005, 38 (4), 734-737. 51. Chen, W.; Zhao, Y.; Seefeldt, T.; Guan, X. M., Determination of thiols and disulfides via HPLC quantification of 5-thio-2-nitrobenzoic acid. J Pharmaceut Biomed 2008, 48 (5), 1375- 1380. 52. Guile, G. R.; Rudd, P. M.; Wing, D. R.; Prime, S. B.; Dwek, R. A., A rapid high- resolution high-performance liquid chromatographic method for separating glycan mixtures and analyzing oligosaccharide profiles. Anal Biochem 1996, 240 (2), 210-226. 53. Parekh, R. B.; Patel, T. P., Comparing the Glycosylation Patterns of Recombinant Glycoproteins. Trends Biotechnol 1992, 10 (8), 276-280. 54. Clarke, N. J.; Crow, F. W.; Younkin, S.; Naylor, S., Analysis of in vivo-derived amyloid- beta polypeptides by on-line two-dimensional chromatography-mass spectrometry. Anal Biochem 2001, 298 (1), 32-39. 55. Domon, B.; Aebersold, R., Review - Mass spectrometry and protein analysis. Science 2006, 312 (5771), 212-217. 56. Liu, D. T. Y., Deamidation - a Source of Microheterogeneity in Pharmaceutical Proteins. Trends Biotechnol 1992, 10 (10), 364-369. 57. Zhang, J. F.; Wang, D. I. C., Quantitative analysis and process monitoring of site-specific glycosylation microheterogeneity in recombinant human interferon-gamma from Chinese hamster ovary cell culture by hydrophilic interaction chromatography. J Chromatogr B 1998, 712 (1-2), 73-82. 58. Davies, M. J., Hounsell, E. F., HPLC and HPAEC of oligosaccharides and glycopeptides. Humana Press: 1998. 59. Townsend, R., Carbohydrage analysis: high performance liquid chromatography and capillary eletrophoresis. Elsevier, New York: 1995. 60. Zaia, J.; Costello, C. E., Compositional analysis of glycosaminoglycans by electrospray mass spectrometry. Anal Chem 2001, 73 (2), 233-239. 61. Bonenfant, D.; Schmelzle, T.; Jacinto, E.; Crespo, J. L.; Mini, T.; Hall, M. N.; Jenoe, P., Quantitation of changes in protein phosphorylation: A simple method based on stable isotope labeling and mass spectrometry. P Natl Acad Sci USA 2003, 100 (3), 880-885. 62. Yang, H. M.; Liu, N.; Qiu, X. Y.; Liu, S. Y., A New Method for Analysis of Disulfide- Containing Proteins by Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry. J Am Soc Mass Spectr 2009, 20 (12), 2284-2293. 63. Gorman, J. J.; Wallis, T. P.; Pitt, J. J., Protein disulfide bond determination by mass spectrometry. Mass Spectrom Rev 2002, 21 (3), 183-216. 64. CS Ho, C. L., * MHM Chan, RCK Cheung, LK Law, LCW Lit, KF Ng, MWM Suen, and HL Tai, Electrospray Ionisation Mass Spectrometry: Principles and Clinical Applications. Clin Biochem Rev. 2003, 24 (1), 3-12. 65. McLuckey, S. A.; Wells, J. M., Mass analysis at the advent of the 21st century. Chem Rev 2001, 101 (2), 571-606. 66. Dietrich A. Volmer, L. S., Tutorial — Mass Analyzers: An Overview of Several Designs and Their Applications, Part I. Spectroscopy 2005.

81

67. Jonscher, K. R.; Yates, J. R., The quadrupole ion trap mass spectrometer - A small solution to a big challenge. Anal Biochem 1997, 244 (1), 1-15. 68. Burlingam, A. L., Methods in enzymology: Biological mass spectrometry. Gulf Professional Publishing: 2005. 69. Douglas, D. J.; Frank, A. J.; Mao, D. M., Linear ion traps in mass spectrometry. Mass Spectrom Rev 2005, 24 (1), 1-29. 70. Dietmar G. Schmid, P. G., Holger Bandel, Gu¨ nther Jung, FTICR-Mass Spectrometry for High-Resolution Analysis in Combinatorial Chemistry. BIOTECHNOLOGY AND BIOENGINEERING (COMBINATORIAL CHEMISTRY) 2000, 71, 149-161. 71. Hu, Q. Z.; Noll, R. J.; Li, H. Y.; Makarov, A.; Hardman, M.; Cooks, R. G., The Orbitrap: a new mass spectrometer. J Mass Spectrom 2005, 40 (4), 430-443. 72. LTQ Orbitrap Hybrid FT mass spectrometer. In LTQ ORBITRAP: A NOVEL HYBRID MASS SPECTROMETER WITH SUPERIOR MASS ACCURACY, MASS RESOLUTION, DYNAMIC RANGE AND DETECTION POWER, Thermo electron corporation: 2006. 73. Sleno, L.; Volmer, D. A., Ion activation methods for tandem mass spectrometry. J Mass Spectrom 2004, 39 (10), 1091-1112. 74. Sobott, F.; Watt, S. J.; Smith, J.; Edelmann, M. J.; Kramer, H. B.; Kessler, B. M., Comparison of CID Versus ETD Based MS/MS Fragmentation for the Analysis of Protein Ubiquitination. J Am Soc Mass Spectr 2009, 20 (9), 1652-1659. 75. Coon, J. J.; Syka, J. E. P.; Shabanowitz, J.; Hunt, D. F., Tandem mass spectrometry for peptide and protein sequence analysis. Biotechniques 2005, 38 (4), 519-+. 76. Jackson, S. N.; Dutta, S.; Woods, A. S., The Use of ECD/ETD to Identify the Site of Electrostatic Interaction in Noncovalent Complexes. J Am Soc Mass Spectr 2009, 20 (2), 176- 179. 77. Cournoyer, J. J.; Lin, C.; Bowman, M. J.; O'Connor, P. B., Quantitating the relative abundance of isoaspartyl residues in deamidated proteins by electron capture dissociation. J Am Soc Mass Spectr 2007, 18 (1), 48-56. 78. Bateman, R. H.; Carruthers, R.; Hoyes, J. B.; Jones, C.; Langridge, J. I.; Millar, A.; Vissers, J. P. C., A novel precursor ion discovery method on a hybrid quadrupole orthogonal acceleration time-of-flight (Q-TOF) mass spectrometer for studying protein phosphorylation. J Am Soc Mass Spectr 2002, 13 (7), 792-803. 79. Zhang, M. Y.; Pace, N.; Kerns, E. H.; Kleintop, T.; Kagan, N.; Sakuma, T., Hybrid triple quadrupole-linear ion trap mass spectrometry in fragmentation mechanism studies: application to structure elucidation of buspirone and one of its metabolites. J Mass Spectrom 2005, 40 (8), 1017-1029. 80. Shuguang Ma, S. K. C., Chapter 11. Application of Liquid Chromatography/Mass Spectrometry for Metabolite Identification. In Drug Metabolism in Drug Design and Development: Basic Concepts and Practice, D. Zhang, M. Z. a. W. G. H., Ed. John Wiley & Sons, Inc.: Hoboken, NJ, 2007. 81. Gao, H. Y.; Materne, O. L.; Howe, D. L.; Brummel, C. L., Method for rapid metabolite profiling of drug candidates in fresh hepatocytes using liquid chromatography coupled with a hybrid quadrupole linear ion trap. Rapid Commun Mass Sp 2007, 21 (22), 3683-3693.

82

82. Fabris, D.; Vestling, M. M.; Cordero, M. M.; Doroshenko, V. M.; Cotter, R. J.; Fenselau, C., Sequencing Electroblotted Proteins by Tandem Mass-Spectrometry. Rapid Commun Mass Sp 1995, 9 (11), 1051-1055. 83. . Mass spectrometry- essays and tutorials Tandem mass spectrometry (MS/MS) [Online], 2006. 84. Lundblad, R. L., Approaches to the Conformational Analysis of Biopharmaceuticals. Chapman and Hall/CRC: 2009. 85. Feng, C. J.; Li, H. J.; Li, J. N.; Lu, Y. J.; Lia, G. Q., Expression of Mcm7 and Cdc6 in Oral Squamous Cell Carcinoma and Precancerous Lesions. Anticancer Res 2008, 28 (6A), 3763- 3769. 86. Wang, N.; Ye, L.; Yan, F. F.; Xu, R., Spectroscopic studies on the interaction of azelnidipine with bovine serum albumin. Int J Pharm 2008, 351 (1-2), 55-60. 87. Group, T. A. B., Protein chemical shifts, dipolar couplings, and molecular fragment replacement (MFR): An overview. 2012. 88. Roland Riek, S. H., Gerhard Wider, Rudi Glockshuber, Kurt Wuthrich, NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231). FEBS Letters 1997, 413, 282-288. 89. Rush, R. S.; Derby, P. L.; Smith, D. M.; Merry, C.; Rogers, G.; Rohde, M. F.; Katta, V., Microheterogeneity of Erythropoietin Carbohydrate Structure. Anal Chem 1995, 67 (8), 1442- 1452. 90. ICH Topic Q 6 B. In Specifications: Test Procedures and Acceptance Criteria for. Biotechnological/Biological Products. , European Medicines Agency: 1999. 91. Cohen, S. L.; Padovan, J. C.; Chait, B. T., Mass spectrometric analysis of mercury incorporation into proteins for X-ray diffraction phase determination. Anal Chem 2000, 72 (3), 574-579. 92. Kelleher, N. L., Top-down proteomics. Anal Chem 2004, 76 (11), 196a-203a. 93. Pennington, C. L., Bioanalytical Applications of Electrospray Ionization Mass Spectrometry for Proteome and Biomarker Analysis. ProQuest: 2006. 94. Reid, G. E.; McLuckey, S. A., 'Top down' protein characterization via tandem mass spectrometry. J Mass Spectrom 2002, 37 (7), 663-675. 95. Wehr, T., Top-down versus bottom-up approaches in proteomics. Lc Gc N Am 2006, 24 (9), 1004-+. 96. M. Lubeck, C. S., and R. Hartmer, Fast and Highly Accurate QTOF ETD MS/MS for Top-Down Sequencing. J Biomol Tech. 2010, 21 (21(3 Suppl)), S50. 97. P. Liu, C. B., and W. Sandoval; Gan, Y., Optimizing In-source Decay as a High- Throughput Alternative to Edman Degradation for the Determination of Protein Termini. J Biomol Tech. 2010, 21(3 Suppl) (S49). 98. Wang, D. X.; Baudys, J.; Rees, J.; Marshall, K. M.; Kalb, S. R.; Parks, B. A.; Nowaczyk, L.; Pirkle, J. L.; Barr, J. R., Subtyping Botulinum Neurotoxins by Sequential Multiple Endoproteases In-Gel Digestion Coupled with Mass Spectrometry. Anal Chem 2012, 84 (11), 4652-4658. 99. Mehlert, A.; Wormald, M. R.; Ferguson, M. A. J., Modeling of the N-Glycosylated Transferrin Receptor Suggests How Transferrin Binding Can Occur within the Surface Coat of Trypanosoma brucei. Plos Pathog 2012, 8 (4).

83

100. Sethuraman, M.; McComb, M. E.; Huang, H.; Huang, S. Q.; Heibeck, T.; Costello, C. E.; Cohen, R. A., Isotope-coded affinity tag (ICAT) approach to redox proteomics: Identification and quantitation of oxidant-sensitive cysteine thiols in complex protein mixtures. J Proteome Res 2004, 3 (6), 1228-1233. 101. Fenselau, C.; Laine, O.; Swatkoski, S., Microwave assisted acid cleavage for denaturation and proteolysis of intact human adenovirus. Int J Mass Spectrom 2011, 301 (1-3), 7-11. 102. Wu, W. W.; Wang, G. H.; Baek, S. J.; Shen, R. F., Comparative study of three proteomic quantitative methods, DIGE, cICAT, and iTRAQ, using 2D gel- or LC-MALDI TOF/TOF. J Proteome Res 2006, 5 (3), 651-658. 103. Zhou, J. Y.; Schepmoes, A. A.; Zhang, X.; Moore, R. J.; Monroe, M. E.; Lee, J. H.; Camp, D. C.; Smith, R. D.; Qian, W. J., Improved LC-MS/MS Spectral Counting Statistics by Recovering Low-Scoring Spectra Matched to Confidently Identified Peptide Sequences. J Proteome Res 2010, 9 (11), 5698-5704. 104. Zhu, W. H.; Smith, J. W.; Huang, C. M., Mass Spectrometry-Based Label-Free Quantitative Proteomics. J Biomed Biotechnol 2010. 105. (a) http://genecards.org/. Database, T. G. H. G., Ed. Weizmann Institute of Science; (b) http://www.uniprot.org/. UniProt Consortium. 106. Rubin, I.; Yarden, Y., The basic biology of HER2. Ann Oncol 2001, 12, 3-8. 107. Tan, M.; Yu, D., Molecular mechanisms of ErbB2-Mediated breast cancer chemoresistance. Adv Exp Med Biol 2007, 608, 119-129. 108. Aird, K. M.; Ding, X. Y.; Baras, A.; Wei, J. P.; Morse, M. A.; Clay, T.; Lyerly, H. K.; Devi, G. R., Trastuzumab signaling in ErbB2-overexpressing inflammatory breast cancer correlates with X-linked inhibitor of apoptosis protein expression. Mol Cancer Ther 2008, 7 (1), 38-47. 109. Giannopoulou, E.; Antonacopoulou, A.; Floratou, K.; Papavassiliou, A. G.; Kalofonos, H. P., Dual targeting of EGFR and HER-2 in colon cancer cell lines. Cancer Chemoth Pharm 2009, 63 (6), 973-981. 110. http://useast.ensembl.org/index.html. 111. http://www.hsls.pitt.edu/obrc/index.php?page=URL1151339289. 112. http://gpmdb.thegpm.org/. Beavis Informatics Ltd. 113. Benusiglio, P. R.; Lesueur, F.; Luccarini, C.; Conroy, D. M.; Shah, M.; Easton, D. F.; Day, N. E.; Dunning, A. M.; Pharoah, P. D.; Ponder, B. A. J., Common ERBB2 polymorphisms and risk of breast cancer in a white British population: a case-control study. Breast Cancer Res 2005, 7 (2), R204-R209. 114. Kauraniemi, P.; Kallioniemi, A., Activation of multiple cancer-associated genes at the ERBB2 amplicon in breast cancer. Endocr-Relat Cancer 2006, 13 (1), 39-49. 115. Katoh, M.; Katoh, M., Evolutionary recombination hotspot around GSDML-GSDM is closely linked to the oncogenomic recombination hotspot around the PPP1R1B-ERBB2-GRB7 amplicon. Int J Oncol 2004, 24 (4), 757-763. 116. Hoffmann, R., Valencia, A, A gene network for navigating the literature. In Nature Genetics, 2004; Vol. 36. 117. http://www.nextprot.org/. 118. Bai, T.; Luoh, S. W., GRB-7 facilitates HER-2/Neu-mediated signal transduction and tumor formation. Carcinogenesis 2008, 29 (3), 473-479.

84

119. Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F. F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko- Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30 (3), 221-223.

85

Chapter 2

Comprehensive Mass Spectroscopic Characterization of

Recombinant Phenylalanine Ammonia Lyase

Publication in preparation:

Yan, F., Wendt, D., Alferness, P., Hancock, W.S., Wu, S.L., Pungor, E., Comprehensive Characterization of Recombinant Phenylalanine Ammonia Lyase by LC-MS/MS (manuscript in preparation).

86

2.1 Abstract

Phenylketonuria (PKU) is an autosomal recessive disorder characterized by deficiency in the phenylalanine hydroxylase, an enzyme that converts phenylalanine to tyrosine.1 In PKU, phenylalanine can accumulate in the blood and body tissues causing neurological damage. A recombinant form of the phenylalanine ammonia lyase from Anabaena variabilis (rAvPAL), converting phenylalanine to trans-cinnamate is now in clinical development to treat PKU. The active site of rAvPAL contains an ASG sequence that is spontaneously converted into electrophilic 4-methylidene-imidazole-5-one (MIO) ring through a double dehydration.2 The

MIO in the active site can decompose under certain conditions ultimately leading to a cleavage in the peptide backbone of the 60 kDa intact protein, resulting in a 15 kDa and a 45 kDa protein fragment. As the MIO is not stable under denaturing conditions and /or acidic environment, traditional peptide mapping approaches could not confirm its presence on rAvPAL. Our studies using multiple and sequential proteolytic digestions coupled high resolution (FT-ICR) mass spectroscopy and with LC-MS/MS analysis identified the exact break-down sites and revealed that the decomposition of the MIO and the chain cleavage are results of formation of free radical peptides. Eliminating the double bond on the methylidene group by Michael-type additions stopped the decomposition of the MIO site and allowed for direct LC-MS/MS identification of the Michael addition products. These results were then confirmed with direct mass spectroscopic analysis of the whole protein.

2.2 Introduction

Phenylketonuria (PKU) is a genetic disorder resulted by inability to utilize the essential amino acid, phenylalanine, which then accumulates in the blood and body tissues to damage brain.1, 3

87

In PKU, the enzyme that breaks down phenylalanine, phenylalanine hydroxylase, is deficient.4

Lately, recombinant phenylalanine ammonia lyase (PAL) from A. Variabilis (rAvPAL), which can convert phenylalanine to trans-cinnamate and ammonia, is a promising therapy for treating

PKU.2a Recombinant protein drugs are often complicated in nature due to the chemical and manufacturing complexity.5 Although the comprehensive characterization is a great challenge considering the complexity, determination of structural features are essential for drug safety and efficacy.6 Therefore, to produce consistent therapeutics, there is great need to apply powerful analytical techniques to characterize protein structures comprehensively. Correct assembly of protein structures is very essential to assure the drug quality. Although PAL protein sequence is well defined, the structure at the active site of ASG region is still difficult to characterize. The native PAL could form a cyclic intermediate 4-methylidene-imidazole-5-one group (MIO), which is unstable and could break down to degradation products prior to analysis. The full sequence is shown in Figure 2-1. The tryptic peptide sequence of rAvPAL at the active site is highlighted.

147 167 169 189 MEIFLNAGVTPYVYEFGSIGASGDLVPLS YITGSLIGLDPSFK

Figure 2-1. Full sequence of rAvPAL with highlights of the tryptic peptide containing ASG sequence.

88

To characterize at a molecular level, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) online was applied in comprehensive characterization. It is the major analytical tool nowadays. Identification of active site associated peptides was done using LC-

MS/MS with collision induced dissociation. This provides extensive information regarding identification and characterization with high sensitivity and confidence.

In this work, we have performed detailed characterization of this recombinant protein drug. We used multiple and sequential enzymatic digestions with sensitive online LC-MS/MS analysis to characterize intact rAvPAL, and its break-down parts. Accurate peptide mass was obtained by using FT-ICR MS, and the sequence assignment was finished with the help of tandem MS measurement. Such strategy was designed for this complex protein, producing fragment sizes that are suitable for further analysis. Direct LC-MS/MS helps to identify the Michael addition products of the MIO group without formation of decomposition.

rAvPAL is the recombinant form of PAL protein from A. variabilis. It has the same amino acid sequence as PAL in nature except for the two substitutions of protein engineering at Cys503 to

Ser, and at Cys565 to Ser.7 The active site region is still based on homology.2b Currently, there are no direct results present for the detailed information of the structure. The combination of the information with the exact break-down sites and modifications should suggest the mechanism to form and degrade the intermediate in this active ASG region. The ability to identify the location of degradation could help expose the activity and stability of rAvPAL. Figure2-2 illustrates the formation of the active site from Ala, Ser, Gly residues with double dehydration. The MIO group, 4-methylidene-imidozole-5-one group, is a ring structure formed by the active site after a loss of two water molecules.

89

O CH3 O O H N O O N N N Gly Gly H H H OH CH3 H C H+ O CH2OH O Ser 3 N N-H OH N N Ser H OH N -Ala-Ser-Gly- H Ala Ala O O 2H2O Gly N CH2 N Ser* H3C AlaNH MIO

Figure 2-2. MIO group formation through double dehydration at -Ala-Ser-Gly-.

In addition to the analysis of rAvPAL, we also analyzed PAL enzyme from Rhodosporidium toruloides (RtPAL) and green fluorescence protein (GFP) from jelly fish using the same analytical method and process. RtPAL has the same active site generated from post-translational modification of ASG into MIO group. It is a homolog to PAL from A. variabilis.2b This could serve as a comparison to rAvPAL as they have the same double dehydrated ring structure as the active site and similar sequences. GFP serves as another comparison because its chromophore, dehydrated dehydrotyrosine, is composed of SYG, not ASG as the active site.8 HBDI group serves as the chromophore of GFP.9 It is formed from autocatalytic cyclization with one water molecule loss. It also has a double bond conjugated to the acrylamide ring structure with a loss of two hydrogens (oxidation).10 The structure of this group has a similar dehydrated ring structure to that of PAL enzymes, but different composition. Comparison of the three proteins, especially

90 the comparison of the active site region is meaningful and informative in analysis and characterization of the target enzyme.

2.3 Experimental Section

2.3.1 Reagent and materials

rAvPAL, RtPAL, and green fluorescent protein were obtained from Biomarin, Inc. (Novarto,

CA). Trypsin (sequencing grade) and Glu-C (sequencing grade) were bought from Promega

(Madison, WI). Ammonium bicarbonate, formic acid, dithiolthreitol (DTT) and iodoacetamide

(IAA) were purchased from Sigma-Aldrich (St. Louis, MO). Water at LC-MS grade was bought from J.T. Baker (Philipsburg, NJ). HPLC grade acetonitrile which was from Thermo Fisher

Scientific (Fairlawn, NJ). NuPAGE® 4-12% Bis-Tris gels, Novex® sharp prestained protein standard, SimplyBlue™ safestain were bought from Invitrogen (Carlsbad, CA).

2.3.2 In-solution enzymatic digestion

The concentration of rAvPAL solution was 2.02 mg/mL. Protein solution was denatured by 6M guanidine hydrochloride with 100 mM ammonium bicarbonate. The sample was reduced by 5 mM dithiothreitol (DTT) at 37 °C. Alkylation was done by adding 20 mM fresh iodoacetamide

(IAA) with incubation for 1 h, covered in aluminum. The protein solution was buffer exchanged by ammonium bicarbonate (100 mM) at pH 8.0 with a spin column of 10 K molecular weight cutoff at 11,000 rpm for centrifugation. For trypsin digestion, 1:50 (w/w) trypsin was added and the solution was incubated for 14 h at 37 °C. After incubation, the digestion was stopped by adding 5% formic acid.

91

2.3.3 SDS-PAGE and in-gel digestion

rAvPAL products were separated by SDS-PAGE first. 2 µL DTT and 2 µL loading buffer were added to 40 µg protein solution. Before loading the solution to the gel, the solution was heated at

70 °C for 10 min. 4-12% Bis-Tris gel was used for separation, and Coomassie blue stain was applied. Different loading amount (20, 30, and 40 µg) of the protein sample was loaded onto the gel. In addition, RtPAL with a concentration 17.2 mg/mL and GFP with a concentration of 1 mg/mL were also analyzed by SDS-PAGE for further in-gel digestion analysis.

The gel bands of each protein were cut out individually and then digested by trypsin or trypsin plus Glu-C for the subsequent LC-MS analysis. Briefly, upon removal of Commassie stain, the gel bands were sliced and reduced by 10 mM DTT in ammonium bicarbonate (0.1 M) with 30 min incubation at 56°C, followed by 1 h of incubation with 55 mM IAA at room temperature in the dark for alkylation. For digestion by trypsin, digestion buffer was prepared as 12.5 ng/µL trypsin in 50 mM ammonium bicarbonate. Gel slices were incubated in the digestion buffer containing trypsin at 4 °C for 35 min. Then the digestion buffer was replaced by the buffer without trypsin, followed by incubation at 37 °C for 14 h before extraction.

For the trypsing with additional Glu-C digestion, after the gel slices was incubated in digestion buffer for 14h, the solution was changed to Glu-C digestion buffer. Another 35 min incubation was needed at 4 °C. Then the digestion buffer with Glu-C was replaced by the buffer without

Glu-C, followed by 5 h incubation at 37 °C before extraction.

The peptides after digestion were extracted from the gel slices by 25 mM ammonium bicarbonate and acetonitrile. Further extraction was done with 5% formic acid (37 °C, 5 min vortex). Supernatants were collected together and concentrated for further LC-MS analysis.

92

2.3.4 LC-MS/MS analysis

The analysis was conducted by Ultimate 3000 nano-LC (Dionex, Mountain View, CA) on-line coupled with an LTQ-FT mass spectrometer (Thermo Fisher Scientific, San Jose, CA). A self- packed C18 column (Magic, 200 Å pore, 5 µm particle size, 18 cm length, 75 µm i.d.) was used.

A nano-spray ion source (New Objective, Woburn, MA) was used to couple the column online to the mass spectrometer. For the gradient, mobile phase A was 0.1% formic acid in water, and mobile phase B was 0.1% formic acid in acetonitrile. The gradient was from 2% mobile phase B for sample loading, a linear gradient from 2% to 65% B for 60 min. Then in the next 10 min, mobile phase B was changed from 65% to 80%, followed by an isocratic gradient at 80% B for

10 min. The flow rate was maintained at 200 nL/min.

FTICR cell was responsible for obtaining survey full-scan MS spectra from 400 to 2000 (m/z).

The mass resolution was 100,000 at m/z 400. Acquisition time was 90 min in total. There was one segment for MS analysis, and there were 11 events in this segment. In scan event 1, the data type was profile. FTMS was the analyzer. Polarity was positive, and normalized collision energy was set as 35.0. Then the next ten sequential MS2 scans were acquired by LTQ as dependent scan. Each precursor ion was isolated under data-dependent mode. MSn setting was set as 2, indicating as MS2. Acquisition type was CID (collision induced dissociation). The data type for them was centroid, and the analyzer was ion trap.

2.3.5 Peptide assignment

Results from LTQ-FT MS were first analyzed and filtered by Bioworks software (3.3.1, Thermo

Fisher Scientific). It helped to assign most probable sequence for peptides according to the fragmentation spectra, using the Sequest algorithm. The spectra generated from CID-MS2 were

93 searched using the following parameters against the theoretical fragmentations of the proteins under studies. Xcorr (precursor ion 1+, > 1.5; 2+, >2.0, 3+, >2.5, 4+ and above, > 3.0) was set for initial filter of enzymatic peptides. Theoretical b ions and y ions were used for searching against the spectra generated from CID-MS2. The peptide fragments with corresponding to the digested enzyme specificity or non-specific cleavages were all considered for assignment using the PAL database for match. Manual inspection was further applied for peptides identified in the active site region. The ions of interest were extracted for quantitative comparison between the intact and degraded products. Experiments were carried out in replicates for confirmation of peptide identification and characterization.

2.4 Results and Discussion

2.4.1 SDS-PAGE separation of rAvPAL

The intact rAvPAL at 60 kDa region was observed. Besides this single band, two truncated bands, one at 15 kDa and the other at 45 kDa, were on the gel image. SDS-PAGE separation of the enzyme was shown in Figure 2-3.

Figure 2-3. SDS-PAGE separation of rAvPAL.

94

2.4.2 Intact analysis of rAvPAL

Protein was analyzed directly without further reduction and enzymatic digestion. Intact analysis of 1 µg intact protein molecule was performed with Q-TOF instrument. The result is corresponding to the SDS-PAGE separation of the protein. Major mass observed after deconvolution is around 15 kDa, 45 kDa and 60 kDa observed, with minor peaks observed as well. The highest mass detected was associated with the intact protein molecule with loss of two water molecules. Minor mass was detected as well, which corresponded to the loss of a short peptide from N-terminal. Figure 2-4A, 2-4B and 2-4C shows the deconvoluted mass of the peak separated from the system. Intact protein analysis gave an idea of the presence of the intact protein. This could not identify the specific peptide present with the MIO group. Therefore, further bottom-up analysis with the help of different enzymes for digestion is needed for the identification of the active site in this protein.

A

Deconvolution of peak A

Theoretical average mass (Bioconfirm) 15 KDa: M= 17834.4233 Da

MEIFLNAGVTPYVYEFGSIGA

lose N-term M(15K)=16946.3630, lose 888.0603 Da

95

B

Deconvolution of peak B

Theoretical average mass (Bioconfirm) 60 KDa: M=61709.6768 Da 45 KDa: M’=43894.2749 Da

C

Deconvolution of peak C

Theoretical average mass (Bioconfirm) 45 KDa: M’=43894.2749 Da 60 KDa: M=61709.6768 Da

Figure 2-4. Deconvoluted mass analysis of the peaks from top-down analysis of intact rAvPAL molecule (peak A-C).

96

2.4.3 LC-MS/MS analysis of digested peptides of rAvPAL

In the 15 KDa enzymatic digested sample, the peptide sequence could be identified from the

N-terminal to Ala 157. For the tryptic digestion, peptide sequence from Met 147 to Lys 189, including the ASG active site would be observed if it is an intact peptide. However, only the peptide fragment from Met 147 to Ala 157 (MEIFLNAGVTPYVYEFGSIGA) was observed.

Also, there is 1 Da difference between observed mass and theoretical mass. MS2 information could confirm the observed peptide sequence. MS2 information could confirm that the observed peptide was the sequence of the clipped fragment from Met 147 to Ala 157. When using Glu-C after trypsin for further digestion, the corresponding peptide was cleaved after Glu. If the sequence including the Ala in the ASG region was removed, the intensity of the peptide from Ile

147 to Glu 150 (IFLNAGVTPYVYE) was much higher. There is no mass shift observed for the further digested peptide. Precursor mass extraction for both trypsin and trypsin pluc Glu-C digestion is illustrated in Figure 2-5. In this figure, A is the XIC extraction of tryptic digested peptide fragment sequence from Met 147 to Ala 157(MEIFLNAGVTPYVYEFGSIGA); B is the

XIC of trypsin plus Glu-C digested peptide from Ile 147 to Glu 150 (IFLNAGVTPYVYE).

97

A

B

Figure 2-5. XIC of trpysin and trypsin plus Glu-C digested peptide fragment.

In the 45 kDa digested sample, the peptide sequence could be identified back from C-terminus to Ser 158. Peptide from Met 147 to Lys 189, including the ASG active site would be observed in theory for the protein. However, the extracted peptide was from Ser 158 to Lys189

(SGDLVPLSYITGSLIGLDPSFK), which was the fragment of the intact tryptic peptide. In addition, this peptide fragment with Ser 158 at the ASG region did not have the exact theoretical mass for precursor ion. It had mass shift such as M+1-17, M-17, or M+2-17, M+2. XIC extraction of the peaks for this peptide fragment in 45 KDa and their corresponding MS2 information is shown in Figure 2-6. MS2 spectra with the corresponding b and y ions could confirm that the observed peptide sequence was from Ser 158 to Lys 189. The mass shift could also be observed from the b ions if Ser 158 was included. No mass shift for y ions was observed.

If using trypsin plus Glu-C for digestion for the 45 kDa sample, the corresponding peptide

98 fragments were cleaved further away from the ASG region. Once the peptide fragments were cleaved further away from the ASG, the observed precursor mass for the peptide from Lys 171 to

Lys 189 (LVPLSYITGSLIGLDPSFK) matched to the theoretical m/z for this peptide.

46.98 100 b 80 45.81 60 a 40 20 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 1132.61 Time (min)

a1132.12 1133.11

1133.62 1141.13 1140.63 1141.63 1134.12 1142.13

1133.10

1132.61 1133.60 b 1141.62 1134.10 1141.12 1142.12 1142.62 1134.61 1143.13

593.35 100 478.26 80 1687.44 60 763.39

40 607.29 1517.39 777.38 20 876.47 1109.63 1386.20 1801.76 460.16 706.34 1234.56 1574.31 494.42 1908.41 0 1671.22 100

80 RelativeAbundance 60 593.40

40 478.31 763.33 1501.41 20 904.81 1101.72 1370.38 1559.34 1786.39 460.39 575.36 675.52 805.21 1030.29 1215.64 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 2-6. XIC and MS2 of fragmented peptide sequence from Ser 158 to Lys 189 in 45 KDa.

After the analysis of the 15 and 45 kDa gel bands, we analyzed further for the single 60 kDa gel band. 60 kDa gel band appeared at the theoretical molecular weight for the protein. Therefore, if digested by trypsin, the tryptic peptide including the active site of ASG would be observed as a

99 double dehydrated form. However, we did not observe this dehydrated form for the tryptic peptide at the active site, nor the intact tryptic peptide. Hence, there is some possibility that the tryptic peptide fragmented similarly to that of 15 and 45 kDa we have observed. Considering this, we extracted the peptide using the theoretical precursor mass for the clipped fragments.

Similarly to those fragments in 45 kDa, we extracted the latter part as a fragment of the intact tryptic peptide starting from Ser 158. But we did not observe the first part as the fragment starting from Met 147. If the sample was further digested by Glu-C, the same fragment observing in 15 kDa from Met 147 to Glu 150 appeared in 60 kDa as well, indicating that the first part fragment did present in 60 kDa. Consequently, in 60 kDa sample, we could obtain similar results to that of 15 and 45 kDa. Table 2-1 illustrates all the fragments generated from the tryptic peptide with the ASG active site as well as the fragments generated from the trypsin plus Glu-C digestion.

100

Table 2-1 Fragments observed after digested by enzymes in 15, 45 and 60 kDa gel bands.

Peptide (amino acid sequence) Start End MW Observed Intensity identified Monoisotopic (XIC) in the gel m/z (2+) bands MEIFLNAGVTPYVYEFGSIGA1 147 157 15 kDa 1139.0809 (M-1) 2.17E5

IFLNAGVTPYVYE2 149 150 15 kDa, 60 743.3889 (M) 9.78E6 (15 kDa kDa) 9.53E6 (60 kDa) SGDLVPLSYITGSLIGLDPSFK 158 189 45 kDa 1140.6392 1.69E6 1 (M+1) (M+1) 1132.1205 7.26E6 (M+1-17) (M+1-17) 1141.1150 3.90E6 (M+2) (M+2) 1132.5994 7.60E6 (M+2-17) (M+2-17) SGDLVPLSYITGSLIGLDPSFK 158 189 60 kDa 1140.6315 6.45E5 1 (M+1) (M+1) 1132.1184 9.91E5 (M+1-17) (M+1-17) 1140.1344 (M) 1.30E6 (M) 1131.6202 (M- 2.15E6 (M- 17) 17) GDLVPLSYITGSLIGLDPSFK1 159 189 45 kDa, 60 1096.6260 (M) 4.50E6 (45 kDa kDa) 1.91E6 (60 kDa) LVPLSYITGSLIGLDPSFK2 171 189 45 kDa, 60 1010.5817 (M) 1.21E5 (45 kDa kDa) 5.76E6 (60 kDa)

From all the previous experimental results, we did not find the ASG sequence or the ring structure formed by ASG sequence. Therefore, based on the observation of the fragments in the protein tryptic digestion peptide, there might be a degradation process derived from the MIO ring

101 structure. Figure 2-7 illustrates the possible degradation process. The double dehydrated ring structure could be rehydrated back, and this process and further peptide chain cleavage could be related to the formation of free radical.

Gly -Ala-Ser*-Gly- O O 1H O+1NH 2 3 CH3 O N H N NH Ser* N N H H H3C rehydration O CH2OH O NH Ala NH2 rehydrated back H• or free radical formation & to NH2 (not MIO peptide chain cleavage OH) at S168 O

NH2 H N NH N NH H O O HO O HN HN N O O H

O HN HN N loss of NH3 (-17 Da) N H O O

Figure 2-7. Proposed degradation process of MIO group.

2.4.4 Analysis of RtPAL and GFP with similar ring structure

In addition to PAL from A. variabilis, we analyzed PAL from yeast (RtPAL) as well as green fluorescence protein (GFP) from jelly fish using the same method and process for characterization. SDS-PAGE was performed for the separation of RtPAL. The SDS-PAGE analysis generated similar patterns of the truncated fragment gel bands to that of rAvPAL

(Supplementary Figure 2-S1). The intact RtPAL has a molecular weight around 76 kDa. The sequence of RtPAL is shown in Supplementary Figure 2-S2. It has been reported that its sequence also has –Ala-Ser-Gly- formed as MIO group with double dehydration as active site.11

102

Besides the 76 kDa band, clipped fragments one at 20 kDa, and the other one at 56 kDa were also observed from SDS-PAGE. The clipped fragment gel bands as well as the single protein band were in-gel digested and further analyzed by LC-MS/MS.

In the 20 kDa, only the peptide sequences from N-terminal to the Ala in the ASG active site was identified. The clipped fragment ALTNFLNHGITPIVPLRGTISA was extracted by Lys-C digestion of the 20 kDa, as shown in Supplementary Figure 2-S3. The other part of the fragment

SGDLSPLSYIAAAISGHPDSK was found in 56 kDa by trypsin digestion. This is illustrated in

Supplementary Figure 2-S4 (precursor m/z is listed in Figure 2-S4a, the extracted MS2 spectra is shown in Figure 2-S4b. In the 56 kDa, only the peptide sequences from C-terminal back to the

Ser in the ASG active site was found. For the digested peptides in 76 kDa, both of the two truncated fragment peptides were extracted. Intact tryptic peptide including the ASG sequence

GTISASGDLSPLSYIAAAISGHPDSK was observed with exact precursor m/z and correct MS2 spectra. However, no double dehydration with water loss was observed. In addition, intact mass analysis of RtPAL was performed by Q-TOF instrument. Supplementary Figure 2-S5a and 2-S5b shows the TIC of intact protein analysis together with the result of mass deconvolution. From the mass doconvolution result, major peak observed as full protein with probable N-term loss of two amino acids MA. Mass loss of 16 and 17 Da compared to the theoretical mass of the full sequence was observed. However this cannot give guarantee that there was dehydration at ASG sequence since this mass loss could have been generated from various conditions. The major focus in this work is to use PAL protein from another species to characterize if there were fragments observed from the ASG sequence. Therefore, similar fragmentation pattern to rAvPAL was observed in RtPAL.

103

Finally, GFP, green fluorescent protein was analyzed as well. The sequence of GFP is shown in

Supplementary Figure 2-S6. The chromophore is called the HBDI group formed from –Ser-Tyr-

Gly- with a loss of one water molecule. There is also a loss of two hydrogens for formation of a double bond resulting from oxidation.10 This structure is called 4'-hydroxybenzylidene-2,3- dimethylimidazolinone. Formation of this HBDI group as the chromophore is illustrated in

Supplementary Figure 2-S7. This ring structure of the HBDI group in GFP is similar to that of the MIO group in rAvPAL or RtPAL. However, the active site (chromophore) in this protein was composed by SYG, not ASG as in rAvPAL or RtPAL. SDS-PAGE analysis of GFP was shown in

Supplementary Figure 2-S8. There was only one single band on the gel image. This is completely different from that of rAvPAL or RtPAL, who had the truncated fragment gel bands besides the single protein band.

The protein band was digested by trypsin and analyzed by LC-MS/MS for the digested peptides. The target peptide with the SYG site was able to be extracted for XIC, as illustrated in

Figure 2-8. The extracted mass was the exact mass of the peptide with the theoretical dehydrated dehydrotyrosin. The target peptide had the corresponding MS2 as shown in Figure 2-9. This is different from that of rAvPAL or RtPAL, where the target peptide with the intact active site was not able to be extracted.

104

Figure 2-8. XIC extraction of the tryptic peptide containing SYG in GFP.

105

Figure 2-9. MS2 extraction of tryptic peptide containing SYG in GFP.

2.4.5 More findings about free radical associated degradation

If free radical was involved in the reaction, scavengers such as ascorbic acid would be able to reduce or inhibit the degradation process.12 The rAvPAL enzyme with ascorbic acid and without ascorbic acid was analyzed by SDS-PAGE. From the experimental result illustrated in Figure 2-

10, it was observed that much less degradation product was generated into 15 or 45 kDa truncated fragments for this protein. Therefore, with addition of ascorbic acid, and the radical capture reagent, less cleavage products were present. This also provided evidence of the involvement of free radical in the reaction.

106

Figure 2-10. SDS-PAGE of rAvPAL with and without ascorbic acid.

2.4.6 Analysis of rAvPAL with addition of reducing reagent

Borohydroxide was added to the protein solution for reducing the methylidene group followed by immediate enzymatic in-solution digestion with trypsin and Glu-C. We proposed that this should be able to stop the decomposition and observance of the MIO group containing peptide would be possible. Figure 2-11 shows the experimental result of the extracted precursor m/z with the exact match, with 0.86 ppm difference to the theoretical mass of this ASG containing peptide with loss of one H2O and an addition of 2 Da due to the reduction reaction of the active site.

Targeting MS2 helps to confirm of this peptide with correct corresponding b and y ions. Also, the precursor m/z of this ASG containing peptide with loss of 2 H2O and addition of 4 Da due to reduction of the two double bonds in this MIO group could also be observed. Elimination of the double bond on the methylidene group by Michael-type additions helped stop the decomposition

107 of this active site and allowed for direct identification of the of the addition products by LC-

MS/MS.

Target FT-MS2 1397.7450 FGSIGASGDLVPLSYITGSLIGLDPSFK 1398.2451 0.86 ppm 1397.2432 1398.7459 763.3985 z=1 100 593.2928 1399.2471 z=1 95 1399.7478 478.2659 90 z=1

85 1157.6119 80 z=2

75

70 876.4828 1100.0985 65 z=1 z=2

60

55

50 1015.0459 z=2 45

40 706.3771 z=1 Relative AbundanceRelative 35

30

25 1234.6692 z=1 20

15 1323.6915 10 z=2 377.2294 5 z=8 1557.8171 1943.1119 z=1 z=? 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 2-11. Extracted precursor m/z and targeted MS2 for confirmation of the ASG containing peptide after reduction (-H2O+2H).

2.5 Conclusion

This work comprehensively characterizes the recombinant form of the phenylalanine ammonia lyase (rAvPAL). The protein’s function is to convert phenylalanine to trans-cinnamate for further clinical use in PKU disease. Instability of the active site ASG formed MIO group brings the cleavage of this intact protein into two parts. Our method of sequential enzymatic digestion helps to identify the exact site of cleavage of this protein. The decomposition was caused by free radical in our proposed degradation mechanism. Reduction reagent for elimination of the double

108 bond in the active site of the protein allowed the direct observation of the addition products with

LC-MS/MS method.

In this method, the use of SDS-PAGE allows the direct separation of cleaved forms in the protein solution. Use of multiple and sequential proteolytic digestion greatly helps to shorten the peptide for analysis and gives more confidence in identifying the cleavage site of the protein backbone chain. The high resolution mass spectrometry coupled to the online LC system detected the exact mass of the active site containing peptide. The exact precursor m/z and the corresponding MS2 by CID fragmentation confirm the characterization of the peptide from digestion. This further agrees with the proposed degradation mechanism. This analysis system, multiple and sequential enzymatic digestions with high resolution mass spectroscopy and with

LC-MS/MS is very helpful to characterize industrial protein pharmaceuticals and their associated degradation products. The process for recombinant protein therapy developed in this Chapter 2 is applied in the following Chapter 3 for the characterization work of recombinant acidic glucosidase.

109

2.6 References

1. Milward, K., Phenylketonuria - a brief outline. In Genefaith, Palmer, A. J., Ed. genefaith.org, 2001. 2. (a) Moffitt, M. C.; Louie, G. V.; Bowman, M. E.; Pence, J.; Noel, J. P.; Moore, B. S., Discovery of two cyanobacterial phenylalanine ammonia lyases: Kinetic and structural characterization. Biochemistry-Us 2007, 46 (4), 1004-1012; (b) OKHAMAFE, A., O., BELL, Sean, M., ZECHERLE, G., Nick, ANTONSEN, Kris, ZHANG, Yanhong, LY, Kieu, Y, FITZPATRICK, Paul, A., KAKKIS, Emil, D., VELLARD, Michel, Claude, WENDT, Daniel, J., MUTHALIF, Mubarack COMPOSITIONS OF PROKARYOTIC PHENYLALANINE AMMONIA-LYASE VARIANTS AND METHODS OF USING COMPOSITIONS THEREOF. 2011; (c) Ritter, H.; Schulz, G. E., Structural basis for the entrance into the phenylpropanoid metabolism catalyzed by phenylalanine ammonia-lyase. Plant Cell 2004, 16 (12), 3426-3436. 3. (a) Harding, C. O., Progress toward cell-directed therapy for phenylketonuria. Clin Genet 2008, 74 (2), 97-104; (b) Huether, G.; Kaus, R.; Neuhoff, V., Amino-Acid Depletion in the Blood and Brain-Tissue of Hyperphenylalaninemic Rats Is Abolished by the Administration of Additional Lysine - a Contribution to the Understanding of the Metabolic Defects in Phenylketonuria. Biochem Med Metab B 1985, 33 (3), 334-341. 4. (a) Black, H., Newborn screening report sparks debate in USA. Lancet 2005, 365 (9469), 1453-1454; (b) Langenbeck, U.; Baum, F.; Behbehani, A. W., Metabolic and Neurophysiologic Studies in Adult Patients with Phenylketonuria. Biol Chem H-S 1985, 366 (2), 139-139. 5. Barnes, C. A. S.; Lim, A., Applications of mass spectrometry for the structural characterization of recombinant protein pharmaceuticals. Mass Spectrom Rev 2007, 26 (3), 370- 388. 6. (a) Frokjaer, S.; Otzen, D. E., Protein drug stability: A formulation challenge. Nat Rev Drug Discov 2005, 4 (4), 298-306; (b) Apfel, S. C.; Schwartz, S.; Adornato, B. T.; Freeman, R.; Biton, V.; Rendell, M.; Vinik, A.; Giuliani, M.; Stevens, J. C.; Barbano, R.; Dyck, P. J.; Grp, r. C. I., Efficacy and safety of recombinant human nerve growth factor in patients with diabetic polyneuropathy - A randomized controled trial. Jama-J Am Med Assoc 2000, 284 (17), 2215- 2221. 7. Wang, L.; Gamez, A.; Archer, H.; Abola, E. E.; Sarkissian, C. N.; Fitzpatrick, P.; Wendt, D.; Zhang, Y. H.; Vellard, M.; Bliesath, J.; Bell, S. M.; Lemontt, J. F.; Scriver, C. R.; Stevens, R. C., Structural and biochemical characterization of the therapeutic Anabaena variabilis phenylalanine ammonia lyase. J Mol Biol 2008, 380 (4), 623-635. 8. Reid, B. G.; Flynn, G. C., Chromophore formation in green fluorescent protein. Biochemistry-Us 1997, 36 (22), 6786-6791. 9. Konstantin Chingin, R. M. B., Vladimir Frankevich, Konstantin Barylyuk, Robert Nieckarz, Pavel Sagulenko, Renato Zenobi, Absorption of the green fluorescent protein chromophore anion in the gas phase studied by a combination of FTICR mass spectrometry with laser-induced photodissociation spectroscopy. International Journal of Mass Spectrometry 2001, 306, 241-245. 10. Lemay, N. P.; Zimmer, M., Chromophore formation in GFP: Computational modeling of the immature form of wild-type GFP. Proc Spie 2007, 6449. 11. http://www.rcsb.org/pdb/results/results.do?qrid=EAC116C8&tabtoshow=Current. 12. (a) Niki, E., Action of Ascorbic-Acid as a Scavenger of Active and Stable Oxygen Radicals. Am J Clin Nutr 1991, 54 (6), S1119-S1124; (b) Wagner, A. E.; Huebbe, P.; Konishi, T.;

110

Rahman, M. M.; Nakahara, M.; Matsugo, S.; Rimbach, G., Free Radical Scavenging and Antioxidant Activity of Ascorbigen Versus Ascorbic Acid: Studies in Vitro and in Cultured Human Keratinocytes. J Agr Food Chem 2008, 56 (24), 11694-11699. 13. http://www.uniprot.org/. UniProt Consortium.

111

2.7 Supplementary Figures and Tables

MW Std (KDa) 1 7.2 µg 34.4 µg 51.6 µg 260 160 110 80 60 50 40

30

20 15 with DTT 10

3.5

Supplementary Figure 2-S1. SDS-PAGE separation of RtPAL with different loading amount.

112

MAPSLDSISHSFANGVASAKQAVNGASTNLAVAGSHLPTTQVTQVDIVEKMLAAPT DSTLELDGYSLNLGDVVSAARKGRPVRVKDSDEIRSKIDKSVEFLRSQLSMSVYGVTT GFGGSADTRTEDAISLQKALLEHQLCGVLPSSFDSFRLGRGLENSLPLEVVRGAMTIR VNSLTRGHSAVRLVVLEALTNFLNHGITPIVPLRGTISASGDLSPLSYIAAAISGHPDS KVHVVHEGKEKILYAREAMALFNLEPVVLGPKEGLGLVNGTAVSASMATLALHDAH MLSLLSQSLTAMTVEAMVGHAGSFHPFLHDVTRPHPTQIEVAGNIRKLLEGSRFAV HHEEEVKVKDDEGILRQDRYPLRTSPQWLGPLVSDLIHAHAVLTIEAGQSTTDNPLID VENKTSHHGGNFQAAAVANTMEKTRLGLAQIGKLNFTQLTEMLNAGMNRGLPSC LAAEDPSLSYHCKGLDIAAAAYTSELGHLANPVTTHVQPAEMANQAVNSLALISARR TTESNDVLSLLLATHLYCVLQAIDLRAIEFEFKKQFGPAIVSLIDQHFGSAMTGSNLRD ELVEKVNKTLAKRLEQTNSYDLVPRWHDAFSFAAGTVVEVLSSTSLSLAAVNAWKV AAAESAISLTRQVRETFWSAASTSSPALSYLSPRTQILYAFVREELGVKARRGDVFLGK QEVTIGSN VSKIYEAIKSGRINNVLLKMLA

Supplementary Figure 2-S2. Full seuqence of RtPAL.13

113

Supplementary Figure 2-S3. XIC extraction of one peptide fragment (ALTNFLNHGITPIVPLRGTISA) of RtPAL.

114

a 30.15 100 NL: 2.94E7 29.20 60 K 80

60

40

20

0 30.20 100 NL: 2.61E7

1036.0195 1035.5196 45 K 80 1035.5182 M-16 M-17 1035.0192 60 1036.5209 1036.0210

1036.5208 40 1037.0223

30.99

20 Relative AbundanceRelative 0 100 15 K no 80

60

40

20

0 0 10 20 30 40 50 60 70 80 90 Time (min)

Figure 2-S4a. XIC extraction of the C-term part of the peptide fragment (SGDLSPLSYIAAAISGHPDSK) of RtPAL.

115

727.13 100 y7+ 95 y16 2+

90 813.90 y11+

85 1053.17

80

75

70 y18 2+ 65 913.91 y16+ 60 y10+ 1625.22 b19+ -17 55 982.26 1836.36 50 45 y4+ y12+ 446.14

40 1166.12 Relative AbundanceRelative 35 y6+ 1229.98 30 640.19 y14+ 25 1416.32 20

15 540.82 b20+ -17 10 b4+426.08 -17 y17+ 1597.27 5 357.00 1923.65 1714.69 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 2-S4b. MS2 spectra of C-term fragmented tryptic peptide from 45 kDa band containing the ASG in RtPAL.

116

8 x 10 1 9.5 9 8.5 17.2 µg/µL RtPAL intact protein 8 dilute to 20 times, inject 1 µL ~ 0.86 µg 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Counts vs. Acquisition Time (min)

Figure 2-S5a. TIC spectra of intact protein analysis of RtPAL.

Also lose N-term MA, and -16 & -17 Da

Lose N-term two amino acid MA

MAMAPSLDSISHSFANGVASAKQAVNGASTNLAVAGSHLPTTQVTQVDIVEKMLAAPTDSTLELDGYSLNLGDVVSAARKGRPVRVKDSDEIRSKIDK SVEFLRSQLSMSVYGVTTGFGGSADTRTEDAISLQKALLEHQLCGVLPSSFDSFRLGRGLENSLPLEVVRGAMTIRVNSLTRGHSAVRLVVLEALTNFLNH GITPIVPLRGTISA

Figure 2-S5b. Mass deconvolution result of intact analysis of RtPAL.

117

MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTL KFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSA MPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFK EDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDG SVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDH MVLLEFVTAAGITHGMDELYK

Supplementary Figure 2-S6. Sequence of Green Fluorescence Protein (GFP).13

118

HO

H [M+H]+ 2455.2479, [M+2H]2+ O N 1228.1276, [M+3H]3+ 819.0875 N N H H Gly HO O O O HN Tyr

N NH H

Ser O -Ser-Tyr-Gly- H O Form a ring → HBDI group 2

O O (4'-hydroxybenzylidene-2,3-dimethylimidazolinone) O

Gly N Gly O

N Tyr -2H HO HN Tyr HO Ser NH NH N H Ser O [M+H]+ 2435.2373, [M+2H]2+ 1218.1186, [M+3H]3+ 812.4124 [M+H]+ 2437.2373, [M+2H]2+ 1219.1223, [M+3H]3+ 813.0840

Supplementary Figure 2-S7. Formation of HBDI group in GFP from SYG.10

119

Supplementary Figure 2-S8. SDS-PAGE of green fluorescent protein.

120

Chapter 3

Characterization of disulfide bond linkage and improper disulfide bridging of a variant of lysosomal acidic glucosidase by LC-MS/MS.

121

3.1 Abstract

In this work, we present full characterization of the disulfide bonds of a variant of the lysosomal acidic glucosidase, known as GAA, with LC-MS/MS with combination of collision induced dissociation (CID) with electron transfer dissociation (ETD) fragmentation. The lysosomal targeting of this GAA protein as protein pharmaceutical can help restore normal lysosomal function for treating Pompe disease. Pompe disease is a disorder with accumulation of glycogen in human bodies, resulting from GAA gene mutation and therefore deficiency of the enzyme for glycogen degradation.1 This specific variant contains an N-terminal fusion that has the majority of the human IGFII sequence , used as GILT tag for glycosylation-independent lysosomal targeting.2 The disulfide arrangement on the IGFII segment is not homogenous and unexpected disulfide linkage with the scrambling associated with the 6 Cys residues was deciphered from MS/MS sequencing pattern. There were 3 peptides generated from trypsin digestion linked by 2 disulfides. There was also a swap between Cys40 and Cys41, detected with the help of targeting ETD-MS2 analysis. Besides, the unexpected disulfide linkage for this tag, with 2 disulfide linkages for two tryptic peptides, one disulfide linkage present in the third peptide, was also observed. In addition to the mapping of the N-terminal GILT tag, the other disulfide linkages containing the rest of the 13 Cys residues in this protein were also identified, with one single Cys present. In ETD-MS2, we were able to observe the ions with high abundance for the linked peptides.3 This is very helpful for getting the information about disulfide linkage and scrambling. The methodology used here will be useful for the determination of scrambling patterns and even for analysis of other unknown linkages.

122

3.2 Introduction

Enzyme acid α-glucosidase (lysosomal acidic glucosidase, known as GAA) is responsible for glycogen break-down in human bodies.4 A disorder in metabolism would happen, known as

Pompe disease, due to lack of this glycogen dissociation enzyme.5 Without sufficient acidic lysosomal glucosidase, known as GAA, a substantial amount of glycogen would be built up in lysosomes, leading to accumulation of waste inside cells.6 Chronic muscle disorders are associated with Pompe disease in adults. During infantile period, the onset of this rare disease could progress further from muscle weakness to the enlargement of organs such as the heart and liver, and can even lead to cardiac failure, which could be life threatening.7

A variant of the recombinant lysosomal acidic glucosidase (GAA) is a recombinant protein still under development for clinical use. This variant contains an N-terminal fusion that has the majority of the human IGFII sequence. This fused N-term IGFII is used as glycosylation- independent lysosomal targeting (GILT) tag with the idea that this molecule will be internalized upon intravenous administration through the IGFII clearance receptor, known as cation- independent-mannose-6-phosphate receptor (CI-M6PR).8 CI-M6PR has two affinity domains, one for IGFII and one for mannose-6-phosphate structures. The receptor delivers the recognized structures with the rest of the protein to the lysosomes.9 This pathway is used for lysosomal targeting of this GAA protein to restore normal lysosomal function in patients.10 The full sequence of this recombinant protein is shown in Figure 3-1. The N-terminal signal peptide was removed during the secretion from the production cell. The sequence of this protein for analysis has the sequence region of the IGFII domain, followed by the linker GAP, and the human GAA sequence.

123

N-term signal peptide was removed N-term fusion GILT tag MGIPMGKSMLVLLTFLAFASCCIAALCGGELVDTLQFVCGDRGFYFSRPASRVSRRSRGI VEECCFRSCDLALLETYCATPAKSEGAPAHPGRPRAVPTQCDVPPNSRFDCAPDKAITQ EQCEARGCCYIPAKQGLQGAQMGQPWCFFPPSYPSYKLENLSSSEMGYTATLTRTTPTF FPKDILTLRLDVMMETENRLHFTIKDPANRRYEVPLETPRVHSRAPSPLYSVEFSEEPFGVI VHRQLDGRVLLNTTVAPLFFADQFLQLSTSLPSQYITGLAEHLSPLMLSTSWTRITLWNR DLAPTPGANLYGSHPFYLALEDGGSAHGVFLLNSNAMDVVLQPSPALSWRSTGGILDV YIFLGPEPKSVVQQYLDVVGYPFMPPYWGLGFHLCRWGYSSTAITRQVVENMTRAHFP LDVQWNDLDYMDSRRDFTFNKDGFRDFPAMVQELHQGGRRYMMIVDPAISSSGPAG SYRPYDEGLRRGVFITNETGQPLIGKVWPGSTAFPDFTNPTALAWWEDMVAEFHDQV PFDGMWIDMNEPSNFIRGSEDGCPNNELENPPYVPGVVGGTLQAATICASSHQFLST HYNLHNLYGLTEAIASHRALVKARGTRPFVISRSTFAGHGRYAGHWTGDVWSSWEQLA SSVPEILQFNLLGVPLVGADVCGFLGNTSEELCVRWTQLGAFYPFMRNHNSLLSLPQEP YSFSEPAQQAMRKALTLRYALLPHLYTLFHQAHVAGETVARPLFLEFPKDSSTWTVDHQ LLWGEALLITPVLQAGKAEVTGYFPLGTWYDLQTVPIEALGSLPPPPAAPREPAIHSEGQ WVTLPAPLDTINVHLRAGYIIPLQGPGLTTTESRQQPMALAVALTKGGEARGELFWDDG ESLEVLERGAYTQVIFLARNNTIVNELVRVTSEGAGLQLQKVTVLGVATAPQQVLSNGVP VSNFTYSPDTKVLDICVSLLMGEQFLVSWC

Figure 3-1. Full sequence of a variant of lysosomal acidic glucosidase.

The disulfide arrangement on the IGFII segment is not homogenous and the protein can be clipped by furin in the IGFII segment.8a The purpose of this research is to characterize the disulfide linkage of this protein and identify the improper and unexpected disulfide bridging. For this GILT tag, there are 3 tryptic peptides connected by the 6 Cys forming 3 disulfide bonds.

Mathematically these three peptides can be connected to each other with 8 permutations, including a swap between C40 and C41. Disulfide linkages were able to be pinpointed from the

MS/MS sequencing pattern with the help of the combination of different fragmentation methods, collision induced dissociation (CID) with electron transfer dissociation (ETD). This detailed characterization method could identify the site of disulfide scrambling. Besides the analysis of the disulfide linkage at the N-terminal GILT tag, all other Cys in this protein with their formation of each disulfide linkage, together with one single Cys also present, were also able to be fully characterized.

124

3.3 Experimental Section

3.3.1 Materials and reagents

The recombinant protein therapeutic lysosomal acidic glucosidase with the N-term fusion of

IGFII sequence as GILT tag together with GAP as linker was provided in liquid form by

Biomarin, Inc. (Novato, CA). Four samples under different processing stages were analyzed, named as 0964 HICQ (B87), 0989 HIC (B89), 1-FIN-069 (FIN), and BIP9003DS (BIP). Trypsin at sequencing grade was bought from Promega (Madison, WI). Ammonium bicarbonate, dithiothreitol (DTT), and iodoacetamide (IAA) were bought from Sigma-Aldrich (St. Louis,

MO). Acetonitrile at HPLC grade and water at LC-MS grade were bought from J.T. Baker

(Phillipsburg, NJ). NuPAGE® gels (4−12% Bis-Tris) for SDS-PAGE (sodium dodecyl sulfate polyacrylamide gel electrophoresis) and native PAGE were purchased from Invitrogen

(Carlsbad, CA). LDS sample buffer (4X), MES SDS running buffer (20X), SimplyBlue

SafeStain® compatible to the gels used were also bought from Invitrogen (Carlsbad, CA).

3.3.2 In-solution native digestion

The three samples B87, B89, FIN, in liquid form, with concentration of 0.8 mg/mL, 1.5 mg/mL, and 5.07 mg/mL, respectively, were analyzed under non-reduced condition. 20 µg or each protein sample was used for direct trypsin digestion. The protein solution was buffer exchanged by ammonium bicarbonate (100 mM, pH 8.0) with a spin column of 10 K molecular weight cutoff at 11,000 rpm for centrifugation. 100 mM ammonium bicarbonate at pH8 solution was added to the protein solution to provide a basic environment for trypsin digestion. 1:20 (w/w) trypsin was added and the protein solution was incubated at 37 °C for 14 h. After incubation, the digestion was stopped by adding 5% formic acid. Besides, same in-solution tryptic digestion

125 was carried out with 100 mM ammonium acetate at pH 6.8 for providing a less basic environment for native tryptic digestion.

3.3.3 SDS-PAGE and in-gel digestion

GAA proteins samples were separated by SDS-PAGE before in-gel digestion. 20 µg protein amounts were loaded onto gel for each sample. Under native digestion condition, sample was loaded directly onto gel together with LDS sample buffer (4X). Under reduced condition, the protein solution with LDS sample buffer was heated at 70 °C for 10 min before being loaded onto gel (4-12% Bis-Tris gels).

The major gel bands at 110 kDa of each protein were cut off for in-gel tryptin digestion. For native digestion, gel bands were sliced, destained and alkylated with 55 mM IAA in the dark at room temperature for 1 hour without any reduction process. Then the digestion was carried on by adding 12.5 ng/µL trypsin in 50 mM ammonium bicarbonate, allowing the gels to absorb trypsin digestion buffer at 4 °C for 35 min. Gel bands were incubated at 37 °C for 14 h with digestion buffer without the enzyme. For reduced trypsin digestion, gel bands at 110 kDa were sliced and reduced at 56°C for 30 min by 10 mM DTT in 0.1 M ammonium bicarbonate. Then alkylation was done with 1 hour incubation with 55 mM IAA. For digestion by trypsin, the process was the same as native digestion. Still 12.5 ng/µL trypsin in 50 mM ammonium bicarbonate was used, followed by incubation at 37 °C for 14 h.

After digestion, the peptides in the gel pieces were extracted by ammonium bicarbonate (25 mM) and acetonitrile, shaking at 37 °C for 15 min respectively. Further extraction was done with

5% formic acid (37 °C, 10 min vortex) and acetonitrile. Supernatants were collected and concentrated to final concentration of 1µg/µL for LC-MS/MS analysis.

126

3.3.4 LC-MS/MS analysis

Ultimate 3000 nano-LC (Dionex, Mountain View, CA) on-line coupled with an LTQ-Orbitrap mass spectrometer (Thermo Fisher Scientific, San Jose, CA) was used for analysis. The column for separation was self-packed C18 column (Magic, 200 Å pore, 5 µm particle size, 18 cm length, 75 µm i.d.) for reversed-phase HPLC separation. For the gradient, mobile phase A was used as 0.1% formic acid in water, and mobile phase B was 0.1% formic acid in acetonitrile. The gradient was from 2% mobile phase B for sample loading, a linear gradient from 2% to 65% B for 60 min. Then in the next 10 min, mobile phase B was changed from 65% to 80%, followed by an isocratic gradient at 80% B for 10 min. The flow rate was maintained at 200 nL/min during gradient, and was at 300 nL/min during sample loading process. Before sample loading, LC-MS system was balanced at 90% B for 5 min and 2% B for 30 min.

LTQ-Orbitrap was used for getting full-scan MS spectra from m/z 400 to 2000. The mass resolution was 60,000. Total acquisition time was 90 min. Acquisition type was CID (collision induced dissociation) combined with ETD (electron transfer dissociation). There was one segment for MS analysis, and there were 11 events in this segment. In scan event 1, the data type was profile. FTMS was the analyzer for getting exact precursor ions. Normalized collision energy was set at 35. Each precursor ion was isolated under data-dependent mode. The second event as MS2 was acquired by LTQ as dependent scan, from the highest ion obtained from scan event 1. Fragmentation type was set as CID. So this was as CID-MS2. The third event was also acquired by LTQ, with ETD as the fragmentation type. This was also the dependent scan, same as scan event 2. So this was as ETD-MS2. Sequentially, the fourth event was CID-MS2 for the ion with the second highest abundance. The fifth event was ETD-MS2 for the same ion as the fourth event. In total, there were 9 scan events. The data type for them was centroid, and analyzer

127 was ion trap. Sample preparation and LC-MS analysis were done in replicates to confirm the consistency of the experimental observation.

3.4 Results and Discussion

3.4.1 SDS-PAGE Separation of GAA

B89 and BIP were separated by SDS-PAGE under reduced and non-reduced condition. SDS-

PAGE separation of the enzyme was shown in Figure 3-2. The major bands at around 110 kDa, representing the whole protein GAA, were cut off for further analysis. Each protein was analyzed by SDS-PAGE under non-reduced and reduced condition. The major bands at around 110 kDa, representing the whole protein GAA, were cut off for further analysis.

B89 non- B89 BIP 93000DS BIP 93000DS MW Std (KDa) reduced reduced non-reduced reduced

260

160 110 80 60

50 40 30

20

15

10

Figure 3-2. SDS-PAGE separation of B89 and BIP under non-reduced and reduced condition.

128

3.4.2. Combination of CID with ETD in LC-MS/MS analysis

CID can help with the peptide backbone dissociation. When fragmentation type is CID, the disulfide bond will be intact. Therefore, the peptides still linked together by disulfide bond.

Application of ETD fragmentation type in LC-MS/MS analysis can help to get information of disulfide linked peptide since ETD prefers breaking disulfide bond. This will generate MS2 information for each peptide and charged-reduced species. Further CID-MS3 based on the charge-reduced species from ETD-MS2, in targeting mode can help get further backbone details.

If necessary, CID-MS4 can be applied for getting backbone details from CID-MS3. Previous publications by Wu et.al, describe the detailed information about the application of ETD fragmentation.11

3.4.3 Disulfide bond analysis of GILT tag

Figure 3-3 shows the disulfide linkages of the N-term GILT tag in this protein. Theoretical disulfide linkage of the tryptic peptides in the N-term GILT tag is shown in Figure 3-3a. Figure

3-3b and 3-3c are the unexpected structures of these three peptides. This structure is the desired one. The three peptides containing 2 Cys each should be connected to each other through disulfide bonds. Briefly, three tryptic peptides, P1 (1-18), P2 (35-43), P3 (44-59), are linked by three disulfide bonds, at C3 to C41, C15 to C54, C41 to C45. The purpose is to characterize if this correct disulfide bond linkage is present and if there is unexpected disulfide formation in the protein samples. For the desired structure, the exact disulfide linkage position was identified.

129

(a) ALC3GGELVDTLQFVC15GDR (P1: 1-18)

GIVEEC40C41FR (P2: 35-43)

SC45DLALLETYC54ATPAK (P3: 44-59)

(b) GIVEEC40C41FR (P2: 35-43)

(c) ALCGGELVDTLQFVCGDR (P1: 1-18) SC45DLALLETYC54ATPAK (P3: 44-59)

Figure 3-3. Correct disulfide linkage of the three tryptic peptides in GILT tag (a), unexpected disulfide linkage has P2 with P3 (b), P1 has another disulfide bond (c).

First of all, we characterized the disulfide linkage of B87, B89 and FIN samples under reduced condition. Under this native digestion, the disulfide linkage remained intact. Take an example of identification of the FIN N-terminal disulfide linkage between the three tryptic peptides. Figure

4-4 clarifies the elucidation process. Figure 3-4a shows the exact precursor m/z of calculated from theoretical disulfide linkage, the mass was the sum of the mass of the three peptides with deduction of 6 Da due to the presence of three disulfide bonds. Under CID-MS2 (Figure 3-4b), the three disulfide bonds were still linked, with the corresponding b and y ions having the addition of the three peptides. While, for ETD-MS2 (as shown in Figure 3-4c), P1, P2, P3 could be dissociated to [P1+P2], [P2+P3], [P1+P3]. The high intensity obtained for these dissociated peptides was likely because that for electron-capture process, the free electrons were captured more easily by sulfurs, compared to the peptide backbones. Besides, charged reduced species and corresponding c and z ions were observed in ETD-MS2. The broken linkages with dissociated peptides could also be observed in CID-MS3, illustrated in Figure 3-4d. However, with data- dependent CID-MS3, the exact disulfide position could not be identified. Therefore, CID-MS3 under targeting mode was applied. For example, in Figure 3-4e, [P1+P3] was targeted for getting

130

CID-MS3, the detailed linkage position in P1 and P3 was able to be identified. We could conclude C15 was linked to C54. Same method was used for targeting analysis of [P1+P2]

(Figure 3-4f) and [P2+P3] (Figure 3-4g). From the targeting analysis, which gave us more detailed b and y ions.From this we could pinpoint that C40 was linked to C45, and C41 was linked to C3. Without this technique of analysis, it would be much more difficult, even impossible, to get the exact disulfide linkage position. Here the detailed mass spectra were shown for FIN sample. B87 and B89 were analyzed using the same analysis method. The comparison for the disulfide linkages will be shown later in the chapter.

(a) ALC3GGELVDTLQFVC15GDR (1-18)

GIVEEC40C41FR (35-43)

SC45DLALLETYC54ATPAK (44-59)

Theoretical monoisotopic mass = 922.2278 (5+) 929.8318 100 929.6323 95 90 NL: 7.81E7 85 930.0313 80 75 929.4325 70 65 60 55 930.2313 50 45 40 observed 930.4316 35 929.2326 Relative AbundanceRelative 30 25 20 930.6319 15 10 930.8325 5 931.0328 0 928.5 929.0 929.5 930.0 930.5 931.0 931.5 932.0 m/z The monoisotopic mass precisely matched the expected peptide mass with three disulfides (a loss of 6 H from the three backbone sequences)

131

(b) CID-MS2 (m/z 929.93, 5+) y7 4+ (P2) 1118.65 100 [y16-G] y3 95 b15 4+ (P3) ALC3GGELVDTLQFVC15GDR (P1) 90 y7 4+ (P2) 85 b13 4+ (P3) 80 4+ (P1) GIVEEC40C41FR (P2) 1083.28 [y16-G] 75 y3~y5 70 SC45DLALLETYC54ATPAK (P3) 65 y16 4+ (P1) 60 b15 4+ (P1) b11,b13,b15 55 50 45

Relative Relative Abundance 40 904.06 35 [b11-AL-GDR-GIVEE-FR-S] 3+ (P1) 30 25 b11 3+ (P3) 20 881.49 1+ (P3) 15 y3 1+ (P3) y4 839.07 1+ (P3) 1199.67 10 315.30 y5 1444.61 416.36 1255.12 [b11-FR] 2+ (P3) 5 800.04 695.80 1673.08 1917.64 0 1+ (P1) 400 600 1+800 (P3) 1000 1200 1400 1600 1800 2000 y3 [y5-H2O] m/z

2 (c) ETD-MS (m/z 775.07, 6+) P2 1+ 1053.27 100 z1~z2 95 [M+5H]3+•• ALC3GGELVDTLQFVC15GDR (P1) 1548.83 90 c2 z1 85

80 GIVEEC40C41FR (P2)

75 c4 z4~z5

70 SC45DLALLETYC54ATPAK (P3) [M+5H]4+• [P1+P3] 2+ 65 1162.23 1797.32 60

55 50 z1 1+ (P1/P2) 45 159.07

Relative AbundanceRelative 40 3+ 35 [P1+P3] c4 1+ (P2) 30 1198.21 1496.36 1+ 1+ (P3) 2+ 1+ P1 25 z4 [P1+P2] P3 5+ 1895.60 1+ (P1) 527.37 [M+5H] 1698.50 20 z2 416.31 21458.71+ 948.36 [P2+P3] 273.94 1416.49 15 849.11 1+ (P1) 1375.98 c2 1+ (P3) 775.64 10 z5 541.27 1281.37 5 716.15 0 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z

CID-MS3 (d) (m/z 1549.22, 3+) [P1+P3] 2+ 1795.74 100 ALC3GGELVDTLQFVC15GDR (P1) 95 90 GIVEEC40C41FR (P2) 85 b5 y5 80 SC45DLALLETYC54ATPAK (P3) 75 P2 1+ 1053.35 70 [P1+P2] 2+ 65 2+ 60 [P2+P3] 1551.14 1375.95 55

50 1+ 45 1150.99 P3

Relative AbundanceRelative 40 1697.27 P1 1+ 35 1267.34 1892.66

30

25

20 1958.51 1+ (P2) 15 b5 1579.71 1+ (P3) 804.30 937.32 10 y5 697.41 530.99 5 841.03 489.86 617.74 0 600 800 1000 1200 1400 1600 1800 2000 m/z

132

Targeting CID-MS3 (e) ([P1+P3] 2+ m/z 1198.20, 2+) x5 x2 947.59 1+ (P3) 100 b9 y10~y12 95 y16 y4 2+ 90 P3 ALC3GGELVDTLQFVC15GDR (P1) 848.75 y8,y12 85 b7~b9 1+ (P1) y4 80 SC45DLALLETYC54ATPAK (P3)

75 b9) 70

65

60 1+ 55 P3 1698.45 50 2+ (P1) 45 y16

Relative AbundanceRelative 40 1183.57 2+ (P1) 35 y10 1425.84 30 y11 2+ (P1) 25 y4 1+ (P3) 696.37 b8 1+ (P1) y4 2+ (P1) 1475.30 20 1+ (P1) 2+ (P1) 1+ b7 1074.30 y12 1724.41 416.15 743.33 2+ (P3) 1639.36 P1 15 y8 1894.37 644.43 1374.20 10 963.84 1231.68 597.15 5 568.17 1775.04 1979.11 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z y12 2+ (P3)

Targeting CID-MS3 (f) ([P2+P3] 2+ m/z 1376.00, 2+) 1053.40 1+ GIVEEC40C41FR (P2) 100 P2 b6 y6,y8,y9 95 SC45DLALLETYC54ATPAK (P3)

90 b3~b5 b9~b10 85 1+ (P3) 80 b3

75 1357.67

70

65 b10 2+ (P3) 60 2+ 1083.83 [P2+P3] 55 1377.78 50

45

Relative AbundanceRelative 2+ (P3) 2+ (P2) 40 b4-CFR b6 P3 1+ 35 523.26 1165.44 1699.35 30 947.06 y9 1+ (P3) 25 y8 1+ (P3) 1183.92 P1 1+ 20 1+ (P3) y6 853.39 1+ (P3) 1894.59 b5-CFR 1590.49 15 2+ (P3)590.36 769.31 b3-CFR-2H2O 1205.49 10

5 450.15

0 400 600 800 1000 1200 1400 1600 1800 2000 b9-CFR-2H2O 2+ (P3) m/z

133

(g) Targeting CID-MS3 [P1+P2] 2+ y15~y16 y8~y9 1477.07 100 2+ ALC3GGELVDTLQFVC15GDR (P1) ([P1+P2] m/z 1475.10, 2+) y3~y5,y8 b9 95 GIVEEC40C41FR (P2) 90 85

80

75 70

65

60 1+ 55 P1 1895.59 50

45 1+ Relative Relative Abundance 40 P2

35 2+ (P2) 1055.30 1187.26 y5-H2O 1023.75 30 2+ (P2) 2+ (P1) y3 y16 1+ (P1) 25 y15 1+ (P1)1159.29 y9 1425.18 1606.40 20 1266.34 y8 2+ (P2) 2+ (P2) 1631.11 15 947.83 y4 2+ (P1) 1+ (P1) 2+ (P2) b9 10 y8 y5 1829.48 1926.22 5 937.21 0 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 3-4. LC-MS/MS analysis of GILT tag disulfide bond linkage [P1+P2+P3] of FIN sample. (a) exact precursor m/z matches exactly , (b) CID-MS2, (c) ETD-MS2, (d) CID- MS3, (e-g) targeting CID-MS3.

In addition to the characterization of the correct disulfide linkage, we also identified the unexpected linkage, where the P2 and P3 were linked together with two disulfide bonds, without

P1. For the identification of this unexpected disulfide linkage, the identification of the detailed linking position, C40 and C41, was not necessary. Therefore, for analysis of disulfide for P2 and

P3, identification of the presence of two disulfide bond, and the P2 and P3 were linked together, is enough for studying this problem. For identification, CID-MS2, ETD-MS2 and CID-MS3 were used. Targeting mode was not applied. Figure 3-5a to 3-5d show the exact precursor m/z extraction, CID-MS2, ETD-MS2 and CID-MS3 for identification. From exact precursor m/z

(Figure 3-5a), the extracted mass matches to the theoretical mass of P2 and P3 with two disulfides, with a loss of 4 Da due to the disulfide bond. CID-MS2 (Figure 3-5b) generated corresponding b and y ions where P2 and P3 were linked together. ETD-MS2 (Figure 3-5c) helps to get charge reduced species in high intensity due to electron capture by disulfide. Dissociated

134

P2 and P3 peptide were also observed. From CID-MS3 (Figure 3-5d), detail information about b and y ions for the disulfide linked [P2+P3].

a Theoretical monoisotopicmass = 917.0909 (3+) 917.4185 100 95 90 917.7520 85 80 observed 75 917.0843 70 NL: 2.28 E6 65 60 918.0856 55 50 45 40 Relative AbundanceRelative 35 918.4193 30 25 20 918.7528 15 10 919.0873919.4210 5 0 917.0 917.5 918.0 918.5 919.0 919.5 920.0 920.5 921.0 m/z

b CID-MS2 y7 3+ (P2)

860.5733 2+ (P3) 100 (m/z 917.29, 3+) b13

95 1218.1597 y4~y7 90 2+ (P2) 3+ y7 85 [M’-H2O+3H] GIVEEC40C41FR (P2) 80 1290.1566 y3~y4 75 905.6806

70 SC45DLALLETYC54ATPAK (P3) 65 b13 60

55

50

45 [y5-H2OI] 2+ (P2)

Relative AbundanceRelative 40 2+ (P2) 35 1167.8392 y5 30 y4 2+ (P2) 25 1+ (P3) 20 y3 1111.7771 y6 3+ (P2) 15 y4 1+ (P3) 315.1645 801.9722 10

5 1818.6034 583.4216 1416.5945 1672.3973 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

ETD-MS2 c (m/z 917.10, 3+) [M’+3H]2+• 1375.8483 100

95

90

85

80 GIVEEC40C41FR (P2) 75 3+ z4 70 [M’+3H] 917.3035 65 SC45DLALLETYC54ATPAK (P3) 60

55

50

45

RelativeAbundance 40

35

30

25

20

15 1+ 10 1+ (P3) P2 1+ P3 z4 1310.9459 1697.5222 5 1053.3544 1483.3217 1829.9070 285.0432 399.8052 613.2090 1176.9926 0 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z

135

CID-MS3 d (m/z 1376.82, 2+) x10 x2 x5 1053.33 100 P2 1+ 95 GIVEEC40C41FR (P2) 40 41 90 GIVEEC C FR (P2) b5 b5 85 y12 80 SC45DLALLETYC54ATPAK (P3) SC45DLALLETYC54ATPAK (P3) 75 b12-CFR b5 [y15-G] 70 b3,b4,b6

65 1378.22 1+ 60 P3 1697.38 55 50

45

Relative Abundance 40

35 2+ (P3) 1354.29 [b5-H2O] 2+ (P3) 30 639.96 y12 2+ (P3) 1+ (P3) 25 582.99 [b6-H2O] b3 472.35 651.63 863.37 1+ (P3) 1833.70 20 837.41 967.24[b3-CFR] 1+ (P2) 2+ (P3) [y15-G] 1979.38 15 1+ (P2) b4 1274.02 b5 819.86 1142.89 1863.14 1799.76 1+ (P3) 10 [b12-CFR] 1487.16 5 0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 3-5. LC-MS/MS analysis of GILT tag disulfide bond linkage [P2+P3] of FIN sample. (a) exact precursor m/z matches exactly , (b) CID-MS2, (c) ETD-MS2, (d) CID-MS3.

For the three samples from different processing stages, disulfide bond linkage for B87, B89 and FIN were characterized. All samples were digested without reduction and alkylation for getting information about disulfide linkage. Comparison of peak retention time, intensity and observed mass of disulfide linked N-term GILT tag peptides was made for these three samples under different digestion pH conditions, at pH 6.8 and pH8, in Table 3-1a (B87), 3-1b (B89), 3-

1c (FIN).

136

Table 3-1. Disulfide connection at N-term region, comparison of retention time, intensity and mass (a, B87; b, B89, c, FIN).

a

I.D. Retention time Disulfides Sequences (method) (intensity)

ALC3GGELVDTLQFVC15GDR (1-18)

C3-C41 45.25 min 45.78 min C15-C54 GIVEEC40C41FR (35-43) CID/ETD/CIDMS3 3.07 E7 2.61 E6 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 40.13 min 41.31 min C41-C54 CID/ETD/CIDMS3 SC45DLALLETYC54ATPAK (44-59) 1.33 E8 4.29 E6

I.D. (pH 6.8) Disulfides Sequences (method) Theoretical Observed mass mass

ALC3GGELVDTLQFVC15GDR (1-18)

C3-C41 40 41 C15-C54 GIVEEC C FR (35-43) CID/ETD/CIDMS3 4641.1716 4641.1620 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 C -C 41 54 SC45DLALLETYC54ATPAK (44-59) CID/ETD/CIDMS3 2748.2726 2748.2632

137 b LTQ-Orbi I.D. Retention time Disulfides Sequences (method) (intensity) pH 6.8 pH 8 ALC3GGELVDTLQFVC15GDR (1-18) C -C 3 41 45.98 min 44.65 min C15-C54 GIVEEC40C41FR (35-43) CID/ETD/CIDMS3 2.63 E6 8.04 E6 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 39.34 min 41.32 min C41-C54 CID/ETD/CIDMS3 SC45DLALLETYC54ATPAK (44-59) 8.22 E6 7.05 E6

I.D. (pH 6.8) Disulfides Sequences (method) Theoretical Observed mass mass

ALC3GGELVDTLQFVC15GDR (1-18)

C3-C41 40 41 C15-C54 GIVEEC C FR (35-43) CID/ETD/CIDMS3 4641.1716 4641.1470 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 C -C 41 54 SC45DLALLETYC54ATPAK (44-59) CID/ETD/CIDMS3 2748.2726 2748.2624

138

c LTQ-Orbi I.D. Retention time Disulfides Sequences (method) (intensity) pH 6.8 pH 8 ALC3GGELVDTLQFVC15GDR (1-18)

C3-C41 44.21 min 44.88 min C15-C54 GIVEEC40C41FR (35-43) CID/ETD/CIDMS3 2.16 E7 8.60 E6 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 39.26 min 40.05 min C41-C54 CID/ETD/CIDMS3 SC45DLALLETYC54ATPAK (44-59) 2.36 E6 3.79 E6

I.D. (pH 6.8) Disulfides Sequences (method) Theoretical Observed mass mass

ALC3GGELVDTLQFVC15GDR (1-18)

C3-C41 40 41 C15-C54 GIVEEC C FR (35-43) CID/ETD/CIDMS3 4641.1716 4641.1665 C40-C45 SC45DLALLETYC54ATPAK (44-59)

GIVEEC40C41FR (35-43) C40-C45 C -C 41 54 SC45DLALLETYC54ATPAK (44-59) CID/ETD/CIDMS3 2748.2726 2748.2529

3.4.4 Characterization of other disulfide linkages

There are 19 Cys totally for the whole GAA protein. Table 3-2 part (a) lists the Cys location in tryptic peptides and their theoretical monoisotopic mass. We also characterized the disulfide linkage at the N-term GILT tag, which contains 6 Cys, and there were 13 more Cys left for identification about their linkage. Cys369 at tryptic peptide A344-370 was identified as a single

139

Cys without disulfide linkage for the major gel bands we analyzed at 110 kDa. This was done by native in-gel tryptic digestion, where the Cys was able to be observed with alkylation by IAA, adding 57.02 Da. Native digestion was required for this because under reduced condition, even if this Cys was linked to other Cys by disulfide bond, it would already be dissociated after reduction by DTT.

Confirmation of other disulfide linkages for the GAA protein has the similar analytical process.

In Table 3-2 part (b), the confirmed disulfide linkage with their associated peptides and the theoretical monoisotopic mass are listed. CID-MS2, ETD-MS2 were applied in combination as fragmentation type in LC-MS/MS analysis for getting information about backbone cleavages when disulfide remain intact and about the dissociated peptides.

140

Table 3-2. (a) list of Cys containing tryptic peptides; (b) Confirmation of disulfide linkage in GAA protein. (a intramolecular disulfide; b Asn647Asp647 conversion after PNGase-F treatment)

Theoretical A Sequence Feature Monoisotopic Location Mass A(1-18) Cys3; Cys15 2008.9347 A(35-43) Cys40; Cys41 1168.5005 A(44-59) Cys45; Cys54 1811.8434 A(72-84) Cys77 1439.6827 A(85-91) Cys87 851.3484 A(92-101) Cys98 1204.5506 A(102-109) Cys103; Cys104 967.4256 A(110-132) Cys122 2673.2145 A(344-370) Cys369 3227.5725 A(523-580) Cys528;Cys553 6262.9502 A(604-655) Cys642; Cys653; a Asn647Asp647 5763.7654 A(929-947) Cys933; Cys947 2268.0993

Theoretical B Confirmed Disulfide-Linked Peptides Confirmed Disulfide Monoisotopic (Sequence Location) Linkage Mass (Da)

C3-C41, C40-C45, A(1-18) + A(35-43) + A(44-59) 4641.1028 C15-C54 A(85-91) + A(72-84) + A(102-109) C87-C103, C104-C77 3026.3395 A(92-101) + A(110-132) C98-C122 3761.7065 a, b A(604-655) C642-C653 b5647.7068 a A(523-580) C528-C553 6146.8916 a A(929-947) C933-C947 2268.0993

Among these disulfide bond linkages shown in table 3-2B, there are two peptides that were treated differently from others. One is peptide A604-655, the other one is peptide A929-947. For peptide A604-655, there was one N-glycosylation site at Asn647. If only trypsin was used,

A604-655 would not be able to be identified. Therefore, PNGase-F was used for cleaning the glycan from the peptide. Asn647 was converted to Asp647 after using PNGase-F, with mass addition of 1 Da. This was more complicated, since besides the mass addition after applying

141

PNGase-F, the loss of 2 Da was also present from native digestion. Figure 3-6a shows the extracted precursor m/z mass measurement for this peptide after using trypsin together with

PNGas-F under native digestion. For reduced digestion, with PNGase-F cleavage of the glycan,

Asn still converted to Asp, leading to 1 Da mass addition. Differently from native digestion, the two Cys residues were alkylated with IAA, having mass addition of 57.02 Da, respectively. The accurate mass measurement is shown in Figure 3-6b. Besides the accurate precursor m/z for this peptide, b and y ions generated from peptide backbone are very critical for confirmation about the peptide. Figure 3-6c shows the CID-MS2 from reduced digestion. Under native digestion condition, the little information was obtained for b and y ions from backbone due to the presence of the disulfide link; while from reduced digestion, more details of b and y ions from peptide backbones were observed for confirmation of this peptide. Therefore, with accurate mass measurement under both native and reduced digestion, together with CID-MS2 cleavage of peptide backbone, the deglycosylated peptide A604-655 was identified with one disulfide bond linked C642 to C653.

a Theoretical monoisotopic mass = 1412.9419 (4+) 1413.6895 100 NL: 3.29 E5 95 90 1413.9402 85 1413.4392 80 75 70 1413.1887 1414.1909 65 60 55 50 45

40 1414.4397 Relative AbundanceRelative 35 30 observed 25 1412.9437 1414.6941 20 15 1414.9502 10 5 0 1411 1412 1413 1414 1415 1416 1417 1418 m/z

142

b Theoretical monoisotopicmass = 1441.9486 (4+)

1442.7047 100 NL: 1.16 E6

95 1442.4544 1442.9547 90 85 80 75 70 1443.2049 65 60 1442.2045 55 50 45 1443.4554

40 Relative AbundanceRelative 35 30 observed 1443.7064 25 1441.9536 20 15 1443.9561 10 1444.2026 5 0 1440 1441 1442 1443 1444 1445 1446 1447 m/z

y32 y23~y24 y21 y3~y10 c YAGHWTGDVWSSWEQLASSVPEILQFNLLGVPLVGADVC642GFLGNTSEELC653VR b13 b25 b29 b31 b34 b18~b20 1736.4012 b51 100 CID-MS2 b31 2+ b49 95 (m/z 1442.44, 4+) 90 y21 2+ 85 1147.5203 80

75

70

65

60

55

50

45

Relative AbundanceRelative 40 2+ 2+ 35 y23 y32 2+ 2+ 30 2+ b20 b29 3+ b18 b49 2+ 25 1225.4791 1658.8939 b34 2+ 1891.2791 1+ b25 20 y8 1+ 3+ 1+ y10 1+ b51 15 1+ 1+ y7 2+ 2+ b13 y3 y5 1+ b19 1+ y24 1533.2551 1+ 993.4084 10 y4 y6 y9 1415.1918 1282.6362 5 434.1985 676.2616 892.3674

0 400 600 800 1000 1200 1400 1600 1800 2000 m/z

Figure 3-6. (a) precursor m/z from native digestion after deglycosylation; (b) precursor m/z from r reduced digestion after deglycosylation; (c) CID-MS2 spectra for confirmation from reduced digestion.

For analysis of peptide A929-947, multi-enzyme for sequential digestion was used. This tryptic peptide was not able to be observed, probably due to its high hydrophobicity. Therefore, trypsin plus Glu-C enzyme was used for both reduced and native digestion. With addition of Glu-C enzyme, peptide A929-947 became A932-939, and A940-947, with cleavage after Asp (D) and

143

Glu (E). For these two peptides, I was able to identify each peptide separately from reduced digestion. Under native digestion condition, accurate mass measurement showed that there were

2 Da loss, shown in Figure 3-7a. This gives partial evidence that there might be disulfide linkage between the two peptides. From CID-MS2, shown in Figure 3-7b, the corresponding b and y ions generated from backbones with two peptides linked to each other were able to be observed in detail. Therefore, disulfide linkage between C933 and C947 was identified.

144

Theoretical monoisotopic mass = 979.4730 (2+) a

observed 979.9707 100 979.4703 95 90 ICVSLLMGE (P1) 85 80 75 QFLVSWC (P2) 70 65 NL: 2.96 E7 980.4717 60 55 50 45

40 Relative AbundanceRelative 35 30 980.9731 25 20 15 981.4741 10 5 981.9748 0 978 979 980 981 982 983 984 985 m/z

y8 y8 2+ (P1) 2 ICVSLLMGE b CID-MS 913.9153 y4 y2 b7~b8 100 (m/z, 979.97, 2+)

95 QFLVSWC

90 b3~b5 85 b7 2+ (P1) 80

75 878.4598 70 1+ (P2) 1+ (P2) y4 65 y2 1569.3525 1383.3239 60

55

50 2+ (P2) 45 y4

40 1+ (P2) 2+ (P1) RelativeAbundance b5 b8 785.4203 35 (M-H2O) 2+ 30 575.1435

25 961.5676 1270.3499 20 b3 1+ (P2) 15 688.2326

10 388.9696 1067.8801 1626.4139 5 1454.4292 1757.6989 0 400 600 800 1000 1200 1400 1600 1800 m/z

Figure 3-7. (a) Accurate mass measurement from MS scan; (b) CID-MS2 with backbone b and y ions.

3.5 Conclusion

The method successfully characterized the disulfide linkage at the N-term GILT tag as well as for the whole protein. The correct disulfide form for the N-term tag was observed together with the unexpected scrambling form. Under pH 6.8 condition, disulfide scrambling is less likely to happen compared to more basic digestion condition, pH 8.0. Also, samples from various stages

145 were compared for the intensity of the correct disulfide and the scrambling form, providing important information regarding drug development.

Here we used the combination of CID and ETD fragmentation for disulfide bond analysis. ETD provided disulfide linked peptides with high intensity and the charge-reduced species with the linkage. CID generated corresponding b and y ion from peptide backbone, for confirmation of the peptide. Disulfide bond, as a secondary structure, is difficult to characterize since the peptides were linked together, and this usually generated complicated ions. Our analysis methodology here is very effective in solving higher structural characterization problems of protein samples, which is very difficult through conventional techniques. Moreover, our analysis process can be widely applied in characterizing disulfide linkages for other protein-related samples, such as comparisons between biosimilar therapeutics and innovator drugs. The process of protein analysis associated with the use of gel separation and mass spectrometric analysis coupled to liquid chromatography is applied to the analysis of clinical samples in Chapter 4.

146

3.6 References

1. (a) Bali, D.; Goldstein, J.; Banugaria, S.; Dai, J.; Mackey, J.; Rehder, C.; Kishnani, P., Predicting Cross Reactive Immunological Material (CRIM) Status in Pompe Disease Using GAA Mutations: Lessons Learned from 10 Years of Clinical Laboratory Testing Experience. Molecular Genetics and Metabolism 2012, 105 (2), S20-S20; (b) Fukuda, T.; Roberts, A.; Plotz, P. H.; Raben, N., Acid alpha-glucosidase deficiency (Pompe disease). Current Neurology and Neuroscience Reports 2007, 7 (1), 71-77. 2. Maga JA, Z. J., Kambampati R, Peng S, Wang X, Bohnsack RN, Thomm A, Golata S, Tom P, Dahms NM, Byrne BJ, Lebowitz JH., Glycosylation-independent lysosomal targeting of acid α-glucosidase enhances muscle glycogen clearance in Pompe mice. J Biol Chem. 2012. 3. Wang, Y.; Lu, Q. Z.; Wu, S. L.; Karger, B. L.; Hancock, W. S., Characterization and Comparison of Disulfide Linkages and Scrambling Patterns in Therapeutic Monoclonal Antibodies: Using LC-MS with Electron Transfer Dissociation. Analytical Chemistry 2011, 83 (8), 3133-3140. 4. Vanderploeg, A. T.; Kroos, M.; Vandongen, J. M.; Visser, W. J.; Bolhuis, P. A.; Loonen, M. C. B.; Reuser, A. J. J., Breakdown of Lysosomal Glycogen in Cultured Fibroblasts from Glycogenosis Type-Ii Patients after Uptake of Acid Alpha-Glucosidase. Journal of the Neurological Sciences 1987, 79 (3), 327-336. 5. VanHove, J. L. K.; Yang, H. W.; Wu, J. Y.; Brady, R. O.; Chen, Y. T., High-level production of recombinant human lysosomal acid alpha-glucosidase in Chinese hamster ovary cells which targets to heart muscle and corrects glycogen accumulation in fibroblasts from patients with Pompe disease. Proceedings of the National Academy of Sciences of the United States of America 1996, 93 (1), 65-70. 6. Raben, N.; Danon, M. J.; Takikita, S.; Ralston, E.; Plotz, P. H., The role of autophagy in pathogenesis of Pompe disease. Annals of Neurology 2007, 62, S64-S64. 7. (a) Chien, Y. H.; Lee, N. C.; Thurberg, B. L.; Chiang, S. C.; Zhang, X. K.; Keutzer, J.; Huang, A. C.; Wu, M. H.; Huang, P. H.; Tsai, F. J.; Chen, Y. T.; Hwu, W. L., Pompe Disease in Infants: Improving the Prognosis by Newborn Screening and Early Treatment. Pediatrics 2009, 124 (6), E1116-E1125; (b) Kishnani, P. S.; Howell, R. R., Pompe disease in infants and children. Journal of Pediatrics 2004, 144 (5), S35-S43; (c) Kishnani PS, S. R., Bali D, Berger K, Byrne BJ, Case LE, Crowley JF, Downs S, Howell RR, Kravitz RM, Mackey J, Marsden D, Martins AM, Millington DS, Nicolino M, O'Grady G, Patterson MC, Rapoport DM, Slonim A, Spencer CT, Tifft CJ, Watson MS, Pompe disease diagnosis and management guideline. Genetics in Medicine 2006, 8 (5), 267-88. 8. (a) Jonathan LeBowitz, J. M. Methods for treating pompe disease; (b) Mori, J.; Ishihara, Y.; Matsuo, K.; Nakajima, H.; Terada, N.; Kosaka, K.; Kizaki, Z.; Sugimoto, T., Hematopoietic Contribution to Skeletal Muscle Regeneration in Acid alpha-Glucosidase Knockout Mice. Journal of Histochemistry & Cytochemistry 2008, 56 (9), 811-817; (c) Koeberl, D. D.; Luo, X. Y.; Sun, B. D.; McVie-Wylie, A.; Dai, J.; Li, S. T.; Banugaria, S. G.; Chen, Y. T.; Bali, D. S., Enhanced efficacy of enzyme replacement therapy in Pompe disease through mannose-6- phosphate receptor expression in skeletal muscle. Molecular Genetics and Metabolism 2011, 103 (2), 107-112. 9. Castonguay, A. C.; Lasanajak, Y.; Song, X. Z.; Olson, L. J.; Cummings, R. D.; Smith, D. F.; Dahms, N. M., The glycan-binding properties of the cation-independent mannose 6-

147 phosphate receptor are evolutionary conserved in vertebrates. Glycobiology 2012, 22 (7), 983- 996. 10. Chen, Y. T.; Amalfitano, A., Towards a molecular therapy for glycogen storage disease type II (Pompe disease). Molecular Medicine Today 2000, 6 (6), 245-251. 11. Wu, S. L.; Jiang, H.; Hancock, W. S.; Karger, B. L., Identification of the unpaired cysteine status and complete mapping of the 17 disulfides of recombinant tissue plasminogen activator using LC-MS with electron transfer dissociation/collision induced dissociation. Anal Chem 2010, 82 (12), 5296-303.

148

Chapter 4

Proteomic and Genomic Analysis of Gastric Cancer

Patient Tissues

Published papers and contribution:

1. Paik, Y.K., Jeong, S.K., Omenn, G.S., Uhlen, M., Hanash, S., Cho, S.Y., Lee, H.J., Na, K., Choi, E.Y., Yan, F., Zhang, F., Zhang, Y., Snyder, M., Cheng, Y., Chen, R., Marko-Varga, G., Deutsch, E.W., Kim, H., Kwon, J.Y., Aebersold, R., Bairoch, A., Taylor, A.D., Kim, K.Y., Lee, E.Y., Hochstrasser, D., Legrain, P., Hancock, W.S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30 (3), 221-3. (Contribute to the world-wide CHPP project with work on chromosome 13 and 17, PTMs, and web of bioinformatics.)

2. Liu, S., Im, H., Bairoch, A., Cristofanilli, M., Chen, R., Deutsch, E., Dalton, S., Fenyo, D., Fanayan, S., Gates, C., Gaudet, P., Hincapie, M., Hanash, S., Kim, H., Jeong, S.K., Lundberg, E., Mias, G., Menon, R., Mu, Z., Nice, E., Paik, Y.K., Uhlen, M., Wells, L., Wu, S.L., Yan, F., Zhang, F., Zhang, Y., Snyder, M., Omenn, G.S., Beavis, R.C., Hancock, W.S., A Chromosome- centric Human Proteome Project (C-HPP) to Characterize the Sets of Proteins Encoded in Chromosome 17. J Proteome Res 2013, 12 (1), 45-57. (Contribute to chromosome 17 parts list, studies of PTMs, ERBB2 amplicon and draft arrangement.)

3. Integrated proteomic and transcriptomic analysis of the human colon carcinoma cell lines LIM1215, LIM1899 and LIM2405. J Proteome Res Just Accepted Manuscript DOI: 10.1021/pr3010869 Publication Date (Web): March 4, 2013. (Contribute to pathway analysis, application of normalization factor and identification of colorectal cancer-associated proteins.)

4. Yan. F., Wu, S.L., Kim, H., Im, H., Paik, Y.K., Hancock, W.S., Proteomic and genomic analysis of gastric cancer patient tissues. (Manuscript in preparation, contribute to all proteomic measurement with protein identification/spectral counts, oncogene interaction analysis, combined analysis with transcriptomics and pathway/sub-pathway analysis.)

149

4.1 Abstract

Erythroblastic leukemia viral oncogene homolog 2, known as ERBB2, is an important oncogene in the development of cancer.1 It can form a heterodimer with other epidermal growth factor (EGF) receptor family members and activate kinase mediated downstream signaling pathways.2 The ERBB2 gene is located on chromosome 17 and is over-expressed in a subset of cancers, such as breast, gastric and colon cancer.3 Of particular interest to the Chromosome-

Centric Human Proteome Project (CHPP) initiative is the amplification mechanism which typically results in the over-expression of a set of genes adjacent to ERBB2, including GRB7 and provides evidence for a linkage between gene location and expression.4 In this lecture, data will be presented on a combined study of ERBB2 positive gastric cancer cell lines, KATOIII and

SNU16 as well as gastric cancer tumor samples and adjacent non-tumor tissue as a control. In addition, a set of non ERBB2 expressing patient samples was studied as a control. As part of the

CHPP initiative we have explored approaches for the effective integration of transcriptomics data such as RNA-Seq with proteomic data sets. We detected 196 tumor unique proteins present in the ERBB2 patient samples that had minimal overlap (29 proteins) with the non-ERBB2 tumor samples. We then constructed a list of 21 oncogenes from the RNA-Seq and proteomics data sets and selected ERBB2, EGFR, MYC and GRB2 as genes of high relevance. Interaction and pathway analysis then determined that the extracellular signal regulated kinase (ERK) cascade was of primary importance in the gastric cancer samples.

4.2 Introduction

Gastric cancers can be divided into intestinal type, diffuse type, and mixed type cancers according to the Lauren classification.5 Carcinogenesis mechanisms as well as the stromal

150 compositions are significantly different for the two types of gastric cancer.5-6 In this study all the cases are intestinal type cancer.

ERBB2 is a member of epidermal growth factor receptor (EGFR) super-family with tyrosine kinase receptors.7 Although Her-2 positive is only present among 5% of gastric cancer patients, the EGF family is one of the well-studied growth factor receptor systems that are usually over expressed in human tumors.8

In this study, we describe the proteomic analysis of two gastric cancer sample sets, ERBB2 positive set and the non-ERBB2 expressing set. Tissue samples were selected by FISH

(fluorescence in situ hybridization) with analysis of antiERBB2 reactivity.9 Therefore, we have two levels of comparison. One is the comparison of tumor and its control, the other is the comparison of ERBB2 positive and non-ERBB2 expressing samples. Besides, we also examined transcriptomic data from the two gastric cancer cell lines, SNU16 and KATOIII, which are

ERBB2 expressing.10 ERBB2 has been popularly studied in invasive breast cancer, predominantly used as biomarkers for diagnosis and investigation.11 Also, ERBB2 amplicon is significant to cancer studies, in particular breast cancer.12 Gastric cancer has been associated with a variety of oncogenes, which is not limited to ERBB2 or EGFR.13 Researchers have observed and announced that this genomic alteration would be responsible for gastric cancer and be used as possible and potential treatment.10b Here we expanded the interested driver oncogenes, not just ERBB2, for the purpose of having an intensive study of gastric cancer.

151

4.3 Experimental Section

4.3.1 Reagents and materials

Trypsin (sequencing grade) was bought from Promega (Madison, WI). Guanidine hydrochloride, formic acid (FA), ammonium bicarbonate, dithiothreitol (DTT), iodoacetamide

(IAA), tris, urea, thiourea, CHAPS were purchased from Sigma-Aldrich (St. Louis, MO).

Protease inhibitor was obtained from cOmplete, EDTA-free protease inhibitor cocktail Tablets from Roche (Indianapolis, IN). LC-MS grade water was bought from J.T. Baker (Philipsburg,

NJ). Acetonitrile with HPLC grade was purchased from Thermo Fisher Scientific (Fairlawn, NJ).

Novex® sharp unstained protein standard, and NuPAGE® 4-12% Bis-Tris gels were from

Invitrogen (Carlsbad, CA).

4.3.2 FISH and antiERBB2 reactivity

Anti-ERRB2 reactivity analyzed by FISH (fluorescence in situ hybridization) was applied for selection of ERRB2 positive as well as the ERBB2 negative (non-ERBB2) expressing gastric cancer samples. A total of 150 samples were examined and 8 samples were selected as ERBB2 positive sample set, and 9 were selected for non-ERBB2 expressing sample set. Upon selection,

ERBB2 positively expressed tumor tissues and the control tissues will be used for further analysis (control tissues are healthy tissues surrounding cancer tissues).

4.3.3 Protein extraction, separation, fractionation, and in-gel digestion

Proteins were extracted from each tumor and control tissue samples by the following steps.

First, roughly every 50 mg frozen tissue was sonicated in 250 μL lysis buffer (30mM tris, 7M urea, 2M thiourea, 65mM DTT, 4% CHAPS, pH 8.8) containing protease inhibitor. Seven to ten

152 times of sonication was performed. For some solid part of the tissue sample, which were not able to be lyzed, was discarded during sonication. Then each sample was centrifuged at 14,000 rpm for 30 min at 4℃. Supernatants were recovered carefully.

The extracted proteins were separated by SDS-PAGE for proteomic analysis. Each gel lane for the control or the tumor sample was loaded around 40 μg proteins. The separated proteins on each gel lane were cut off into five equal portions. After that, a series of sample preparation steps were carried out prior to LC-MS/MS analysis. Briefly, upon removal of Commassie stain, the gel bands were sliced and reduced by 10 mM DTT in 0.1 M ammonium bicarbonate with 30 min incubation at 56°C, followed by 1 h of incubation with 55 mM IAA in the dark at room temperature for alkylation. For digestion by trypsin, digestion buffer was prepared as 12.5 ng/µL trypsin in 50 mM ammonium bicarbonate. Gel slices were incubated in the digestion buffer containing trypsin at 4 °C for 35 min. Then the digestion buffer was replaced by the buffer without trypsin, followed by incubation at 37 °C for 14 h before extraction. The peptides after digestion were extracted from the gel slices by 25 mM ammonium bicarbonate and acetonitrile.

Further extraction was done with 5% formic acid (37 °C, 5 min vortex). Supernatants were collected together. The digested peptide solution was concentrated into pellets for storage. Then the pellets were reconstituted into 10 µL with 0.1% formic acid in water. Therefore, each control and each disease sample has five portions of peptide solution for further LC-MS/MS analysis.

4.3.4 LC-MS analysis by LTQ-Orbitrap

The analysis was conducted by Ultimate 3000 nano-LC (Dionex, Mountain View, CA) on-line coupled with an LTQ-Orbitrap XL mass spectrometer (Thermo Fisher Scientific, San Jose, CA).

A self-packed C18 column (Magic, 200 Å pore, 5 µm particle size, 18 cm length, 75 µm i.d.) was

153 used. A nano-spray ion source (New Objective, Woburn, MA) was used to couple the column online to the mass spectrometer. The reconstituted 10 µL peptide solution was used for LC-

MS/MS analysis. 5 out of 10 µL were injected for each single run. Consequently, for each control or each disease sample, we have 5 runs as we divided each one into five portions for in-gel tryptic digestion. For the gradient, mobile phase A was 0.1% formic acid in water, and mobile phase B was 0.1% formic acid in acetonitrile. The gradient was from 2% mobile phase B for sample loading. At 2 min, mobile phase B increases to 5%. Then a linear gradient was applied from 5% to 50% B from 5 to 85 min. Then in the next 3 min, mobile phase B was changed from

50% to 90%, followed by an isocratic gradient at 90% B for 5 min. Then the organic phase

(mobile phase B) decreased back to 2% in 1 min and was maintained for another one minute.

The flow rate was maintained at 200 nL/min.

LTQ-Orbitrap was used for LC-MS/MS analysis. It is responsible for obtaining survey full- scan MS spectra from m/z 400 to 2000. Total acquisition time was 90 min. The method setup for mass spectrometric analysis was the same used for Chapter 2. Briefly, there were 9 MS2 scan events in a single segment. FTMS was the analyzer for MS scan in event 1, and resolution was set as 30,000. After the scan event 1 for precursor ion scan, the eight sequential MS2 dependent scans were acquired by LTQ. CID (collision induced dissociation) was used for fragmentation type and mass analyzer was linear ion trap (LTQ).

4.3.5 Protein identification

Data from LTQ-Orbitrap mass spectrometer was first analyzed and filtered by Proteome

Discoverer software (Thermo Fisher Scientific). The five portions from each gel lane were searched as a workflow combination. Therefore, the ten control and ten disease samples will

154 each have one workflow searching result. The searching process helps to assign the most probable sequences for peptides according to the fragmentation spectra, using the Sequest algorithm. The spectra generated from CID-MS2 were searched using the following parameters against the theoretical fragmentations of the proteins under studies. Precursor mass tolerance was set as 20 ppm and fragment mass tolerance was set as 0.8 Da as initial filter of enzymatic peptides. Strict FDR as 0.01 was applied for more confident peptide matching than the relaxed target FDR. Theoretical b ions and y ions were used for searching against the spectra generated from CID-MS2. The peptide fragments corresponding to the digested enzyme specificity were considered for assignment using the SP. Human_56.5 database for matching. Maximum missed tryptic cleavage sites were set as two. Also, carbamidomethyl with mass addition of 57.021 Da was applied as static modification. RNAseq data were contributed by collaborators in triplicates.

Experiments for proteomics were done once in order to limit access to patient tissues.

4.4 Results and Discussion

4.4.1 Anti-ERBB2 reactivity for sample selection

In this study we are examining the proteomics of a subset of gastric cancer patients which express significant levels of ERBB2. As stated in the introduction, Her-2 positive is only present among 5% of GI cancer patients. Individual patients var in the cancer percentage, although they are all above 50 % (range 50% and 99 %). This variation may account for lack of detection of

Her-2 in proteomics in Her-2 positive cases although their immuno-affinity and FISH assays were positive. Also there is tissue area variation because the position of the extracted tissues was from different areas along the whole GI. Namely, a total of 150 samples were examined by

155

FISH for antiERBB2 reactivity and 8 samples were selected as the study set, together with adjacent non tutor tissue as a control.

Additionally, a second set of gastric cancer samples and controls were collected in the same manner but were selected on the basis of absence of ERBB2 expression. In Supplementary Table

4-S1, some of the clinical parameters of the two study groups are listed as heterogeneous except of the expression of the oncogene.

4.4.2 Proteomics analysis of gastric cancer patient samples

For proteomic analysis, 8 gastric cancer samples, together with its control, were selected because the over-expression of ERBB2. Similarly, 9 samples with control were selected as non-

ERBB2 expressing (ERBB2 negative). Basically, proteins were extracted from the selected tissue samples. Then the extracted proteins were separated by SDS-PAGE. Supplementary

Figure 4-S1 is the illustration of SDS-PAGE separation of extracted proteins from one sample of gastric cancer tumor and its control.

After gel analysis, each gel lane was cut off into five fractions of proteins and was digested by trypsin, and then analyzed by LC-MS/MS. Here, we used high protein confidence and high peptide rank and with a FDR of less that 1% for protein identification from replicate analysis

4.4.3 ERBB2 Characterization

In our study, it was of interest to observe ERB2 and EGFR expression only in the tumor samples of ERBB2 positive gastric cancer set (not in the control set) as well as in two gastric cancer cell lines, SNU16 and KATOIII. As shown in Supplementary Table 4-S2 ERBB2 was identified with 7 peptides for ERBB2 positive gastric cancer tumor samples. In contrast, the

156 number of peptides identified in the control set was zero. Therefore, ERBB2 is unique to tumor samples. Assessment of the quality of these peptides observed for this protein was obtained from the comparison of the ranking of the peptides observed in our study to the ranking of the corresponding peptides in GPMDB.14 In Supplementary Figure 4-S2, MS/MS spectra for a diagnostic peptide for ERBB2 and EGFR are illustrated.

4.4.4 Strategy for ERBB2-driven gastric cancer analysis

The strategy we were using for analysis of gastric cancer is shown in Figure 4-1. In-depth analysis of gastric cancer patient tissue samples provides around 3000 protein IDs. This gives the idea of protein identification from general proteomics analysis, including protein extraction from selected tissues samples, SDS-PAGE for protein separation and protein identification by LC-

MS/MS. Among these, 196 and 159 are tumor unique proteins in ERBB2 positive (ERBB2+) and ERBB2 negative (ERBB2-) set, respectively. 48 tumor unique proteins have interactions with the selected 21 driver oncogenes. Altogether, transcriptomic data was applied in combination to proteomics data. Further pathway analysis can give details about difference between ERBB2+ and ERBB2– sample sets.

157

Gastric cancer

Tumor Proteomics (mass- Proteomics Oncogene RNAseq data specific spectrometry data: 1D- data set interacting SDS-PAGE, HPLC, LTQ- (transcriptome) proteins ~3000 ID 159 48 FTMS, Orbi-trap ) SNU16, KATOIII ERBB2+, 8 cases, 8 controls (adj. non tumor tissue) ERBB2-, 9 cases, 9 controls

ERBB2, EGFR, GRB2, MYC 21 Proteins which exist oncogenes GeneGo specifically in GI cancer KEGG, expressing ERBB2 EGFR Cytoscape erbb2ERBB2

• Development EGFR signaling pathway 196 MYC GRB2 • Development ERBB signaling • Development EGFR signaling via small GTPases • Cytoskeleton remodeling integrin outside-in signaling

Identification of interacting partner 48 protein using public database (i.e. STRING or I2D, Cytoscape)

Figure 4-1. Strategy for analysis of ERBB2-driven gastric cancer. (red star: ERBB2 expressing sets)

4.4.5 Combination of proteomic analysis with transcriptomics

To support the proteomic study we measured the transcriptome of two gastric cell lines by

RNA-Seq as described by Snyder et al.15 The analysis corresponded to deep reads with approximately 11,000 transcripts measured each analysis. While the proteomic study was of lesser depth (approximately 3,000 proteins identified) we and others have shown that proteomics can aid the identification of significant pathways and expression events.

In Figure 4-2 we plot the number of proteins identified in the ERBB2 positive proteomic study with the percentage of the number of proteins to the number of protein coding genes on each chromosome. We hypothesize that the higher percentage in some chromosomes can result as a

158 combination of factors such as the constitutive expression of housekeeping proteins as well as gene amplification related to oncogene expression. The former is illustrated in Supplementary

Table 4-S3, for example, the expression of genes GAPDH, TUB1A,B, C and KRT1 on and the latter by copy number variation (CNV) of a gastric tumor sample expressing a high level of ERBB2 on chromosome 17 (see Supplementary Figure 4-S3). Protein name was from Proteome Discoverer for protein identification from raw data analysis. The genomic information of each gene listed in supplemental Table 4-S3 was obtained from the tool

Gene a la Cart gene by Genecards.2a The information included gene symbol, genomic locations

(chromosome number, base number of gene start, end, and gene size), and Ensemble cytoband.16

In addition, the spectral counts for identification as well as the RPKM value of RNA-Seq for two cell lines were also included. It has been noted that cancer as an evolutionary process exploits regions of high transcription activity.17

159

Figure 4-2. Plot of number of proteins identified in proteomic study for illustration of chromosome activity in gastric cancer. (vertical bars: number of proteins on each chromosome; lines with squares: ratio of proteins identified on each chromosome to total number of proteins for each chromosome)

4.4.6 Tumor vs. control samples comparison

In the proteomic analysis, around 3000 proteins were identified from the gastric cancer tissue samples in total. For ERBB2 + sample set, there are 196 tumor unique proteins, compared to the adjacent normal tissues as control. These proteins with their chromosome location, base start and end, band location, spectral counts and RNA-Seq value in the two gastric cancer cell lines are shown in Supplementary Table S4a. Among these, the top 20 (spectral counts) of these 196 tumor unique proteins are listed in Table 1. Genes encoded gastric cancer associated, cancer associated proteins as well as structural proteins are indication by boxes with different color in

160 the table. From this table, we could see that most of them are tumor related. Some of them are gastric cancer associated. ERBB2 was identified with high ranking.

Table 4-1. Top 20 Proteins in the 196 tumor unique proteins from ERBB2+ set. (green box: gastric cancer associated proteins, yellow box: cancer associated proteins, red box: structural proteins, +: RPKM value of gastric cancer cell lines higher than 1, ++: RPKM value higher than 10)

Also, the same comparison was done for the ERBB2- sample set. The 159 tumor unique proteins are listed in Supplementary Table 4-S4b. We concentrated on proteins that were only expressed in the tumor sample relative to the control. To reduce individual variability caused by factors such as tumor grade, tumor location, as well as other patient variables, we pooled

161 observations for the ERBB2 positive and negative groups separately as well as separate groups for the control samples. The number of proteins unique to ERBB2 positive and negative tumors was 196 and 159 respectively, while only 29 were common to these two data sets. The corresponding Venn diagram is shown in Figure 4-3.

gastric cancer control tumor ERBB2 + gastric tumor samples and control 2282 89 196 (adjacent normal tissue).

gastric cancer ERBB2 - gastric tumor and control control tumor 1229 124 159

gastric cancer gastric cancer ERBB2 pos. ERBB2 neg. 29 Gastric cancer unique proteins (tumor only) (tumor only) ERBB2 + vs. ERBB2 - 130 167

Figure 4-3. Diagram for comparison of protein IDs in different sample sets.

4.4.7 ERBB2 amplicon

This study has illustrated one important part of our CHPP strategy recently outlined which is the goal of effectively integrating transcriptomic and proteomic data.4 Another aspect of this informatics approach is that the proteomic researcher will understand better the genomic context of their observations. Thus in addition to the study of specific ERBB2 pathways as potential

162 cancer markers we also investigated the ERBB2 amplicon which had been previously identified by genomic tools to occur in breast cancer.12, 18 In Supplementary Figure 4-S4 we plotted the protein identification in gastric cancer for each of the bands on chromosome 17 which is the focus of a US lead team.4 The ERBB2 amplicon can span 3 bands q12, q21.2 and q21.3. The numbers of interval protein-coding genes as well as base numbers in between are also shown.

This figure clearly shows that 3 successive genes, C18orf37, ERBB2 and GRB7, were only identified in the gastric cancer tumor samples.19 In Supplementary Figure 4-S5a, we show the heat map of the tumor unique proteins that were identified in this ERBB2 amplicon region (three bands on chromosome 17), with the gene name, and the spectral counts illustration in the 8 tumor samples (T1 to T8).

In Figure 4-4 we have plotted the proteomic values for the 17q12 region on chromosome 17, a region with ERBB2 in terms of a heat map for ERBB2 and non-ERBB2 gastric tumor samples. We have also captured the corresponding transcriptomic values as measured in the

ERBB2 expressing cell lines KATOIII and SNU16. The comparison between ERBB2+ and

ERBB2- sample sets clarifies the significant differences in protein expression of the ERBB2 amplicon. The plot demonstrates that in gastric cancer there is a group of genes which are co- amplified with the oncogene ERBB2 by a mechanism such as chromosomal breakage and resynthesis.20 Similar amplicon have been reported for EGFR and Myc (c-Myc) and represent a mixture of passenger genes as well as genes which are functionally linked to the oncogenic process.21 In the case of ERBB2 the adjacent genes GRB7 (nuclear transport of ERBB2)22,

PGAP3 (synthesis of the lipid raft involved in the hetero-dimerization of ERBB2 and EGFR)23 and TOP2A (DNA replication and synthesis)24 have been functionally linked. This observation is consistent with the suggestion that in certain cases genes with relayed functions may be located

163 in the same gene and exhibit coordinated expression in certain situations.25 Other genes such as

PPR1B and GSDMA may be passenger genes but may have some diagnostic potential. In conclusion, the combination of genomic/transcriptomic data can illuminate proteomic results.

Figure 4-4. Illustration of 17q12 region on chromosome 17, a region with oncogene ERBB2.

164

4.4.8 Pathway analysis based on oncogene interactions

Our study was focused on the role of driver oncogenes in the gastric tumor samples and for this purpose we combined the proteomics data from the two tumor sample sets together with

RNA-Seq data for the two cell lines SNU16 and KATOIII to construct Table 4-2. ERBB2 and

EGFR were observed with significant transcript values in both cell lines (RPKM>4) which establishes relevance of these two cell lines for this study. For SNU16, MYC, ERBB2, EGFR,

GRB7, TP53, TOP2A, and GRB2 have RPKM value higher than 10. For KATOIII cell line,

MYC, TOP2A, PPP1R1B have RPKM value higher than 10.

Table 4-2. Proteomic observance of selected major oncogenes in gastric cancer sets as well as RPKM value of RNA-Seq from gastric cancer cell lines. Gene Spectral counts in Spectral counts in non- RPKM value of two symbol ERBB2 expressing set ERBB2 expressing set gastric cancer cell lines tumor control tumor control SNU16 KATOIII ERBB3 0 0 0 0 17.03 3.56 EGFR 6 0 0 0 11.34 4.04 GRB7 16 1 0 0 2.99 1.84 GRB2 11 0 1 5 10.25 8.57 ERBB2IP 0 0 0 0 3.83 ND KRAS 1 0 4 1 0.82 2.1 PIK3R1 0 0 0 0 1.2 2.8 MYC 0 0 0 0 141.3 11.62 PIK3R2 0 0 0 0 8.35 0.97 JAK1 0 0 0 0 7.97 5.08 PIK3R3 0 0 0 0 1.71 ND SRC 10 1 3 1 7.12 4.75 STAT1 33 7 19 0 4.02 6.6 CRKL 0 1 0 0 7.95 4.08 TP53 0 0 0 0 21.06 ND TOP2A 11 0 0 0 17.02 15.98 PPP1R1B 9 0 0 0 0.55 21.14 ERBB2 38 0 0 0 12.63 5.85 BRCA1 0 0 0 0 4.15 3.49 ACOT8 0 4 0 0 3.08 2.69 RB1 0 0 0 0 1.37 3.80

165

In this table we listed the major oncogenes that were observed in the gastric cancer samples or cell lines by either proteomics or transcriptomic measurements. The value of proteomic study to identify major phenotypes is shown by the high levels of ERBB2 in the proteomic measurement

(spectral count 38 vs 0 in tumor vs. control and 0 for non-ERBB2 tumor samples). Other significant proteomic values with ERBB2 tumor specificity were GRB7 and TOP2A which will be discussed later as part of the ERBB2 amplicon. Also EGFR, the heterodimeric partner of

ERBB2 in the ERBB family, signaling is observed in the patient samples and cell lines with specificity for the ERBB2 tumor samples. Supplementary Figure 4-S6 shows the genes identified around ERBB2 on chromosome 17. The value of the transcriptomic information is shown with the observation of high levels of MYC (RPKM values of 141 and 12) in the two cell lines. The absence of MYC in proteomic data could be either technology or sample related but as will be discussed later it was helpful to include MYC in subsequent interaction studies. In addition,

MYC is a well documented partner for ERBB2 and EGFR signaling pathways.26

Our hypothesis in this study is that ERBB2 driven cancers have a common set of pathways and a more detailed knowledge of individual pathways would aid in the management of this disease. Since cancer is the result of complex perturbations between different oncogene-driven pathways we examined the interaction profile of the tumor specific proteins in the ERBB2 expressing cancers. The 51 out of 199 tumor unique proteins that have interactions with oncogenes listed in Table 4-2 was listed in Supplementary Table 4-S5. Information regarding to interactions were obtained from String and I2D from Genecards.2a, 27

In Figure 4-5 we have mapped the ERBB2 interacted oncogene expression and transcriptomic proteomic expression in gastric cancer. Comparison was made for the number of proteins in

ERBB2+ with ERBB2- tumor unique proteins that have interactions with each oncogene. We

166 decided to focus on three interaction sets, namely ERBB2 (8 interactors), EGFR (12 interactors), which are the focus of this study, and MYC (37 interactors) which had by far the largest set.

PIK3R3 JAK1 (7 vs.2) (3 vs.0)

ERBB3 MYC (37 vs.25) (8 vs.0) EGFR (12 vs.9) PIK3R1 (9 vs.6) KRAS (0 ERBB2 vs.1) (8 vs. 2) PIK3R2 GRB2 (7 vs.1) (16 vs.28) ERBB2IP GRB7 SRC (1 vs.1) (17 vs.12) (6 vs.0)

STAT1 (6 vs.4)

Figure 4-5. ERBB2 interacted oncogene expression and transcriptomic proteomic expression in ERBB2+ vs. ERBB2- gastric cancer. (line length: interaction score; circle size: spectral counts, larger size, higher counts; black circle: RPKM value of RNA-Seq higher than 10; number in brackets: number of tumor unique proteins in ERBB2+ vs. ERBB2- gastric cancer data sets that have interaction with this oncogene)

Based on the oncogene list information, for ERBB2+ gastric cancer set, four pathways: ERBB family signaling, development EGFR signaling via small GTPases, development EGFR signaling, and cytoskeleton remodeling integrin outside-in signaling pathway from GeneGo

167 database 28were examined for fitting with the proteomics and transcriptomic data, illustrated in

Supplementary Figure 4-S7 (a-d). In addition to proteomic and transcriptomic data, oncogene

(ERBB2, EGFR, MYC (GRB2 and GRB7 if present in pathway)) interaction information was also included in the figure. Encoded proteins that were up-regulated in disease set were indicated with a red star. In the selected pathways a high percentage of genes were observed with the exception NEUR which are expressed in specialized tissues unrelated to gastric cancer.

Proteomic analysis of the non ERBB2 sample set gave only measurable levels of STAT1,

KRAS, SRC with tumor specificity. In contrast GRB2 which was present only in the ERBB2+ tumor samples (19 in tumor vs. 0 in control) was predominantly in the control samples for non-

ERBB2 tumor sample set (1 in tumor vs. 5 in control).

4.4.9 Sub-pathway analysis of ERBB2+ vs. ERBB2- sample sets

After investigation of pathways for the two different sets, we analyzed in-depth about sub- pathway for each of them. Figure 4-6a shows development ERBB signaling pathway, from

KEGG29, with the transcriptomic and proteomic expression of gastric cancer sample sets. This pathway is over-expressed in ERBB2+ set, compared to ERBB2-, especially for the chromosome

17 region with ERBB2 expression. Figure 4-6b shows modified and trimmed ERBB signaling pathway from KEGG with illustration of the differences between ERBB2+ and ERBB2- expression. MAPK signaling pathway is over-expressed in ERBB2+ set, indicating this would be a sub-pathway for ERBB2+ samples.

168

Figure 4-6a. Development ERBB signaling pathway from KEGG. (red circle: RNA-Seq detected, : no RNA-Seq detected or smaller than 0.3, blue star: proteomics observed, yellow box: tumor unique)

169

Figure 4-6b. Modified development ERBB signaling pathway (ERBB2+ vs. ERBB2– patient samples) from KEGG. (all genes illustrated have RPKM value of RNA-Seq equal to or higher than 0.3, green star: up-regulated in ERBB2+ set)

For comparison, ERBB2- highly expressed pathway, regulation of actin cytoskeleton from

KEGG, is shown in Figure 4-7. Over-expressed genes were given a red star, which are up- regulated in ERBB2- patient set. This indicates a possible sub-pathway, which is actin polymerization and actinmyosin assembly contraction.

170

Up-regulated in ERBB2- Source: KEGG

Figure 4-7. Regulation of actin cytoskeleton pathway from KEGG for ERBB2- set. (red star: up-regulated in ERBB2- set).

4.5 Conclusion

In this work, two sets of samples, ERBB2 positive and non-ERBB2 expressing (ERBB2-) sets were analyzed. FISH experiments for anti-ERBB2 reactivity were applied for selection of

ERBB2 expressing and non-ERBB2 expressing patient tissue samples. Gel loading of extracted protein samples, protein digestion and LC-MS/MS analysis by LTQ-Orbitrap provided in-depth analysis of patient samples. Also, the tumor and the control (adjacent vs. tumor tissue samples were compared. Two cell lines SNU16 and KATOIII were used providing RPKM value of RNA-

Seq data for integration with proteomics data for in-depth analysis of clinical samples. As part of

171

CHPP Initiative, established from proteomics as well as transcriptomics data, oncogene interaction and pathway analysis then made determinations about pathways and associated sub- pathways with primary importance in the gastric cancer samples, ERBB2 positive as well as

ERBB2 negative, respectively.

172

4.6 References 1. Yu, D. H.; Hung, M. C., Overexpression of ErbB2 in cancer and ErbB2-targeting strategies. Oncogene 2000, 19 (53), 6115-6121. 2. (a) http://genecards.org/. Database, T. G. H. G., Ed. Weizmann Institute of Science; (b) http://www.uniprot.org/. UniProt Consortium. 3. (a) Kancha, R. K.; von Bubnoff, N.; Bartosch, N.; Peschel, C.; Engh, R. A.; Duyster, J., Differential Sensitivity of ERBB2 Kinase Domain Mutations towards Lapatinib. Plos One 2011, 6 (10); (b) Wu, W. K. K.; Tse, T. T. M.; Sung, J. J. Y.; Li, Z. J.; Yu, L.; Ch, C. H., Expression of ErbB Receptors and their Cognate Ligands in Gastric and Colon Cancer Cell Lines. Anticancer Res 2009, 29 (1), 229-234. 4. Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F. F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko- Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30 (3), 221-223. 5. Dicken, B. J.; Bigam, D. L.; Cass, C.; Mackey, J. R.; Joy, A. A.; Hamilton, S. M., Gastric adenocarcinoma - Review and considerations for future directions. Ann Surg 2005, 241 (1), 27- 39. 6. Textbook of Gastroenterology. Yamada, T., Ed. John Wiley & Sons: 2011. 7. Sharma, S. V.; Bell, D. W.; Settleman, J.; Haber, D. A., Epidermal growth factor receptor mutations in lung cancer. Nat Rev Cancer 2007, 7 (3), 169-181. 8. (a) Cathy B. Moelans, P. J. v. D., Anya N. A. Milne, and G. Johan A. Offerhaus, HER- 2/neu Testing and Therapy in Gastroesophageal Adenocarcinoma. Pathology Research International 2011; (b) Bizari, L.; Borim, A. A.; Leite, K. R. M.; Goncalves, F. D.; Cury, P. M.; Tajara, E. H.; Silva, A. E., Alterations of the CCND1 and HER-2/neu (ERBB2) proteins in esophageal and gastric cancers. Cancer Genet Cytogen 2006, 165 (1), 41-50; (c) Kawano, S.; Ikeda, W.; Kishimoto, M.; Ogita, H.; Takai, Y., Silencing of ErbB3/ErbB2 Signaling by Immunoglobulin-like Necl-2. J Biol Chem 2009, 284 (35), 23793-23805; (d) Celina Ang, Y. Y. J., Ali Shamseddine, Ayman Tawil,Maeve A. Lowery, Andrew Intlekofer,Walid Faraj,Ashwaq Al-Olayan, Laura Tang,Eileen M. O'Reilly, Fady Geara, Aghiad Al-Kutoubi, David P. Kelsen, and Ghassan K. Abou-Alfa, A Case of Advanced Gastric Cancer. Gastrointest Cancer Res. 2012, 5 (2), 59-63. 9. Ginestier, C.; Charafe-Jauffret, E.; Penault-Llorca, F.; Geneix, J. N.; Adelaide, J.; Chaffanet, M.; Mozziconacci, M. J.; Hassoun, J.; Viens, P.; Birnbaum, D.; Jacquemier, J., Comparative multi-methodological measurement of ERBB2 status in breast cancer. J Pathol 2004, 202 (3), 286-298. 10. (a) Dahlberg, P. S.; Jacobson, B. A.; Dahal, G.; Fink, J. M.; Kratzke, R. A.; Maddaus, M. A.; Ferrin, L. J., ERBB2 amplifications in esophageal adenocarcinoma. Ann Thorac Surg 2004, 78 (5), 1790-1800; (b) Deng, N. T.; Goh, L. K.; Wang, H. N.; Das, K.; Tao, J.; Tan, I. B.; Zhang, S. L.; Lee, M. H.; Wu, J. N.; Lim, K. H.; Lei, Z. D.; Goh, G.; Lim, Q. Y.; Tan, A. L. K.; Poh, D. Y. S.; Riahi, S.; Bell, S.; Shi, M. M.; Linnartz, R.; Zhu, F.; Yeoh, K. G.; Toh, H. C.; Yong, W. P.; Cheong, H. C.; Rha, S. Y.; Boussioutas, A.; Grabsch, H.; Rozen, S.; Tan, P., A comprehensive survey of genomic alterations in gastric cancer reveals systematic patterns of molecular exclusivity and co-occurrence among distinct therapeutic targets. Gut 2012, 61 (5), 673-684.

173

11. Brooks, J. D., Translational genomics: The challenge of developing cancer biomarkers. Genome Res 2012, 22 (2), 183-187. 12. Kauraniemi, P.; Barlund, M.; Monni, O.; Kallioniemi, A., New amplified and highly expressed genes discovered in the ERBB2 amplicon in breast cancer by cDNA microarrays. Cancer Res 2001, 61 (22), 8235-8240. 13. Wang, Y. K.; Gao, C. F.; Yun, T.; Chen, Z.; Zhang, X. W.; Lv, X. X.; Meng, N. L.; Zhao, W. Z., Assessment of ERBB2 and EGFR gene amplification and protein expression in gastric carcinoma by immunohistochemistry and fluorescence in situ hybridization. Mol Cytogenet 2011, 4. 14. Craig, R.; Cortens, J. C.; Fenyo, D.; Beavis, R. C., Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 2006, 5 (8), 1843-1849. 15. Wang, Z.; Gerstein, M.; Snyder, M., RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10 (1), 57-63. 16. http://useast.ensembl.org/index.html. 17. (a) Loop, T.; Leemans, R.; Stiefel, U.; Hermida, L.; Egger, B.; Xie, F. K.; Primig, M.; Certa, U.; Fischbach, K. F.; Reichert, H.; Hirth, F., Transcriptional signature of an adult brain tumor in Drosophila. Bmc Genomics 2004, 5; (b) Doloff, J. C.; Waxman, D. J.; Jounaidi, Y., Human Telomerase Reverse Transcriptase Promoter-Driven Oncolytic Adenovirus with E1B-19 kDa and E1B-55 kDa Gene Deletions. Hum Gene Ther 2008, 19 (12), 1383-1399. 18. Sircoulomb, F.; Bekhouche, I.; Finetti, P.; Adelaide, J.; Ben Hamida, A.; Bonansea, J.; Raynaud, S.; Innocenti, C.; Charafe-Jauffret, E.; Tarpin, C.; Ben Ayed, F.; Viens, P.; Jacquemier, J.; Bertucci, F.; Birnbaum, D.; Chaffanet, M., Genome profiling of ERBB2-amplified breast cancers. Bmc Cancer 2010, 10. 19. Bai, T.; Luoh, S. W., GRB-7 facilitates HER-2/Neu-mediated signal transduction and tumor formation. Carcinogenesis 2008, 29 (3), 473-479. 20. Dowhan, D. D. M. a. D., CHAPTER 2 Preparation and Analysis of DNA. In Current Protocols in Molecular Biology John Wiley & Sons, Inc.: 2002. 21. Kim, H. K.; Choi, I. J.; Kim, C. G.; Kim, H. S.; Oshima, A.; Yamada, Y.; Arao, T.; Nishio, K.; Michalowski, A.; Green, J. E., Three-gene predictor of clinical outcome for gastric cancer patients treated with chemotherapy. Pharmacogenomics J 2012, 12 (2), 119-127. 22. http://genecards.org/. Weizmann Institute of Science. 23. http://refgene.com/browse/id/9606/human?p=228. 24. Srikantan, S.; Abdelmohsen, K.; Lee, E. K.; Tominaga, K.; Subaran, S. S.; Kuwano, Y.; Kulshrestha, R.; Panchakshari, R.; Kim, H. H.; Yang, X. L.; Martindale, J. L.; Marasa, B. S.; Kim, M. M.; Wersto, R. P.; Indig, F. E.; Chowdhury, D.; Gorospe, M., Translational Control of TOP2A Influences Doxorubicin Efficacy. Mol Cell Biol 2011, 31 (18), 3790-3801. 25. (a) Claverie, J. M., Computational methods for the identification of differential and coordinated . Hum Mol Genet 1999, 8 (10), 1821-1832; (b) van Driel, R.; Fransz, P. F.; Verschure, P. J., The eukaryotic genome: a system regulated at different hierarchical levels. J Cell Sci 2003, 116 (20), 4067-4075. 26. (a) Hynes, N. E.; Lane, H. A., Myc and mammary cancer: Myc is a downstream effector of the ErbB2 receptor tyrosine kinase. J Mammary Gland Biol 2001, 6 (1), 141-150; (b) Chou, Y. T.; Lin, H. H.; Lien, Y. C.; Wang, Y. H.; Hong, C. F.; Kao, Y. R.; Lin, S. C.; Chang, Y. C.; Lin, S. Y.; Chen, S. J.; Chen, H. C.; Yeh, S. D.; Wu, C. W., EGFR Promotes Lung

174

Tumorigenesis by Activating miR-7 through a Ras/ERK/Myc Pathway That Targets the Ets2 Transcriptional Repressor ERF. Cancer Res 2010, 70 (21), 8822-8831. 27. (a) Weinmann, W.; Maier, C.; Przybylski, M., Characterization of Primary Structure and Microheterogeneity of Fatty-Acid Acylated Lipoproteins by 252cf-Plasma Desorption and Electrospray Mass-Spectrometry. Fresen J Anal Chem 1992, 343 (1), 63-64; (b) Rosenfeld, R.; Bangio, H.; Gerwig, G. J.; Rosenberg, R.; Aloni, R.; Cohen, Y.; Amor, Y.; Plaschkes, I.; Kamerling, J. P.; Maya, R. B. Y., A lectin array-based methodology for the analysis of protein glycosylation. J Biochem Bioph Meth 2007, 70 (3), 415-426. 28. http://www.genego.com/. 29. http://www.genome.jp/kegg/.

175

4.7 Supplementary Figures and Tables

Supplementary Figure 4-S1. SDS‐PAGE separation of extracted proteins from gastric cancer patient tissues.

176

838.3945 838.8953 L-L-D-I-D-E-T-E-Y-H-A-D-G-G-K RT 26.29 min 839.3973

838.8988

a. A peptide from ERBB2 was identified. 1018.0494 1017.5485

E-L-V-E-P-L-T-P-S-G-E-A-P-N-Q-A-L-L-R 1018.5506 RT 46.02 min

1109.0526

b. A peptide from EGFR was identified.

Supplementary Figure 4-S2. A peptide identified in ERBB2 (a) and EGFR (b).

177

1x window (raw probe signal)

Supplementary Figure 4-S3. CNV detection of Nimblegen human CGH 385K for gastric cancer patient samples.

178

Supplementary Figure 4-S4. Protein identification in gastric cancer around ERBB2 on chromosome 17.

179

Gene T1 T2 T3 T4 T5 T6 T7 T8 ACACA 12 MRPL45 1 2 PPP1R1B ERBB2 GRB7 7 1 1 TOP2A 4 3 2 1 KRTAP2-1 KRT32 15 FKBP10 50

Supplementary Figure 4-S5. Tumor unique proteins in ERBB2 positive gastric cancer with illustration of spectral counts. (bands: q12, q21.1, q21.2 on chromosome 17, red: spectral counts equal or more than 5, green: spectral counts less than 5, black: none)

180

10,11 14,14 32 21 Adjacent AP2B1 ACACA RPL23 LASP1 1388.5 Kb 1237.2 Kb 16 Kb 101,63; 12,0; 41,31; 15,11; 24, 9 12,5 124,127 15,12 1,1 4 2,2 4,4 278.5 Kb Adjacent 4 5 c17orf37 ERBB2 PPP1R1B RPL19 0.5 Kb 51.5 Kb 422.2 Kb 38,0; 9,0; 4,8; Adjacent 1,0; 13,6 0.5,21 332,135 7,7 7.1 Kb 0,1 1,1 2 2 Adjacent GRB7 GSDMB GSDMA PSMD3 157.3 Kb 41.9 Kb 3.0 Kb 16,0; 30,7; 2,0; 34,29; 3,2 2,1 ND,ND 1,1 29,16 9,10 6,7 390.6 Kb Adjacent Adjacent 53 KRT38 KRT37 KRT34 TOP2A 11.5 Kb 38.2 Kb 959.5 Kb 11,0; 11,3;ND,ND 9,1;ND,ND 77,15;ND,ND 17,16

Supplementary Figure 4-S6. Genes identified around EGFR on chromosome 17.

(green boxes: gene ID in gastric cancer control set; red boxes: gene ID in gastric cancer disease set; left number in the circle: number of missing interval genes; right number in the circle: number of genes that have a RPKM value greater to equal to 0.3; square with yellow border: spectral counts of disease, control as well as RPKM value of SNU16 and KATOIII)

181 a

182 b

183 c

184

d

Supplementary Figure 4-S7. Illustration of ERBB family signaling and EGFR development pathway fit with the proteomics and transcriptomic data: a. ERBB2 signaling; b. development EGFR signaling via small GTPases; c. development EGFR signaling; d. cytoskeleton remodeling integrin outside-in signaling. (from GeneGo database, blue star: oncogenes, red star: up-regulated in disease/tumor set in ERBB2+, highlighted in yellow: tumor unique proteins, blue brackets: co-localization, red brackets: co-expression)

185

Supplementary Table 4-S1. Clinical parameters of ERBB2 positively expressed patient tissues samples which were heterogeneous except of the expression of the oncogene.

case HER2 FISH Age Sex Invasion depth N stage 1 Positive 60 F Subserosa (T2b) N2 2 Positive 67 M Subserosa (T2b) N2 3 Positive 78 F Seresa exposure N2 (T3) 4 Positive 69 M Subserosa (T2b) N2 5 Positive 64 M Seresa exposure N1 (T3) 6 Positive 69 M Seresa exposure N1 (T3) 7 Positive 59 M Seresa exposure N1 (T3) 8 Positive 75 M Seresa exposure N1 (T3)

186

Supplementary Table 4-S2. Catalog of ‘quality’ observed peptides for ERBB2_HUMAN.

Peptide sequence Charge Rank Numbe Rank in Charge Observed in state in GI r of sets GPMD in GPMDB observe cancer observe B GPMD z=1 z=2 z= d in GI sample d in GI B 3 cancer s cancer sample (spectra s l counts) (z=2) FVVIQNEDLGPASPLDST 2 1 6 (10) 3 2,3 - 124 51 FYR LLDIDETEYHADGGK 2 2 5 (12) 9 2,3 - 54 27 AVTSANIQEFAGCK 2 2 5 (7) 10 2 - 53 -

NPQLCYQDTILWK 2 4 3 (4) 6 2 - 79 -

VLGSGAFGTVYK 2 5 2 (2) 1 1,2 129 1419 -

LGSQDLLNWCMQIAK 2 5 2 (2) 20 2 - 18 -

VCYGLGMEHLR 2 7 1 (1) 25 2,3 - 11 42

187

Supplementary Table 4-S3. Top 20 most abundant proteins in ERBB2+ gastric cancer sample set.

Protein Gene Ch Start End Size Band Ranking (Spectral counts) for name symbol r cancer vs. control and RPKM value of two gastric cancer cell lines Gastric control SNU KAT cancer 16 OIII ENOA ENO1 1 8,921,0 8,939,3 18,24 1p36.2 12 23 343.3 137.6 61 08 7 3 (1011) (561) A26C POTEE 2 131,97 132,022 46,49 2q21.1 6 20 ND ND A 5,924 ,416 3 (1431) (574) DESM DES 2 220,28 220,291 8,363 2q35 20 62 ND ND 3,099 ,461 (730) (307) CO6A COL6A3 2 238,23 238,323 90,37 2q37.3 3 3 ND ND 3 2,646 ,018 3 (1864) (1793) EF1A1 EEF1A1 6 74,225, 74,233, 8,048 6q13 20 38 0.91 ND 473 520 (736) (471) ACTB ACTB 7 5,566,7 5,603,4 36,63 7p22.1 2 2 423.2 359.5 79 15 6 (2744) (2044) VIME VIM 10 17,270, 17,279, 9,335 10p13 19 35 ND ND 258 592 (748) (489) ACTA ACTA2 10 90,694, 90,751, 56,31 10q23. 5 22 ND ND 831 147 7 31 (1532) (571) G3P GAPDH 12 6,643,0 6,647,5 4444 12p13. 15 18 329.5 332.6 93 37 31 (888) (588) TBA1 TUBA1 12 49,521, 49,525, 3739 12q13. 7 10 177.7 102.1 B B 565 304 12 (1270) (751) TBA1 TUBA1 12 49,578, 49,583, 4,529 12q13. 16 26 ND ND A A 579 107 12 (887) (533) TBA1 TUBA1 12 49,582, 49,667, 84,59 12q13. 9 13 70.3 36.77 C C 519 116 8 12 (1196) (691) K2C1 KRT1 12 53,068, 53,074, 5,672 12q13. 10 4 ND ND 520 191 13 (1107) (1198) ACTC ACTC1 15 35,080, 35,087, 7,632 15q14 4 5 ND ND 296 927 (1718) (1006) MYH1 MYH11 16 15,796, 15,950, 153,8 16p13. 17 16 ND ND 1 992 890 99 11 (880) (634) HBA HBA1/ 16 226,67 227,521 843/8 16p13. 13 6 ND ND HBA2 9/ / 64 3 (969) (1002) 222,84 223,709 6 K1C10 KRT10 17 38,974, 38,978, 4,495 17q21. 18 8 (814) 0.66 0.58 369 863 2 (852)

188

TBA8 TUBA8 22 18,593, 18,614, 21,40 22q11. 14 58 ND ND 097 498 2 21 (967) (327) MYH9 MYH9 22 36,677, 36,784, 106,7 22q12. 8 14 41 29.99 323 063 40 3 (1210) (684) FLNA FLNA X 153,57 153,603 26,11 Xq28 11 21 26.25 41.88 6,892 ,006 4 (1079) (574)

189

Supplementary Table 4-S4a. Tumor unique proteins in ERBB2+ sample set.

Protein Gene Ch Start End Size Band Sp RPKM value name symbol r ect of two gastric ral cancer cell co lines unt SNU KAT s 16 OIII MXRA MXRA X 3226606 3264684 38078 Xp22.33 3 ND ND 5 5 DDX3 DDX3 Y 15016019 15032390 16371 Yq11.21 11 ND ND Y Y TIMP1 TIMP1 X 47441690 47446190 4500 Xp11.23 4 1.51 5.13 MAGA MAGE X 151080981 151093642 12661 Xq28 10 ND ND 4 A4 MAGA MAGE X 151301782 151307033 5251 Xq28 6 ND ND A A10 TKTL1 TKTL1 X 153524024 153558713 34689 Xq28 31 ND ND AGRIN AGRN 1 955503 991496 35993 1p36.33 3 9.16 13.22 PLOD1 PLOD1 1 11994262 12035599 41337 1p36.22 5 10.24 6.63 NECP2 NECA 1 16767167 16786585 19418 1p36.13 4 5.43 2.15 P2 MFAP2 MFAP2 1 17300997 17308081 7084 1p36.13 7 0.55 ND ARGA ARHG 1 17866330 18024370 15804 1p36.13 3 14.77 0.70 L EF10L 0 KHDR KHDR 1 32479430 32526451 47021 1p35.1 9 17.29 6.16 1 BS1 FHL3 FHL3 1 38462442 38471278 8836 1p34.3 4 0.41 ND PABP4 PABPC 1 40026485 40042521 16036 1p34.3 7 11.39 3.22 4 PYRG1 CTPS 1 41445007 41478235 33228 1p34.2 14 4.226 1.64 P3H1 LEPRE 1 43212006 43232755 20749 1p34.2 6 1.88 1.09 1 FUBP1 FUBP1 1 78409740 78444794 35054 1p31.1 5 9.80 4.28 GBP5 GBP5 1 89724633 89738544 13911 1p22.2 3 ND ND LASS2 CERS2 1 150933059 150947479 14420 1q21.3 5 ND ND HDGF HDGF 1 156711899 156736717 24818 1q23.1 3 46.46 21.8 FHR1 CFHR1 1 196788861 196801319 12458 1q31.3 3 ND ND KCC1G CAMK 1 209757045 209787284 30239 1q32.2 3 ND ND 1G P5CR2 PYCR2 1 226107578 226111978 4400 1q42.12 4 13.15 4.04 GREM GREM 1 240652873 240775462 12258 1q43 3 ND ND 2 2 9 CEBPZ CEBPZ 2 37428755 37458856 30101 2p22.2 3 7.39 3.84 MSH2 MSH2 2 47630108 47789450 15934 2p21 6 7.45 4.80

190

2 MSH6 MSH6 2 48010221 48034092 23871 2p16.3 3 6.41 4.27 FBLN3 EFEMP 2 56093102 56151274 58172 2p16.1 25 ND ND 1 SNUT2 USP39 2 85829979 85876406 46427 2p11.2 4 9.74 3.61 PTCD3 PTCD3 2 86333305 86369280 35975 2p11.2 4 7.42 2.54 RPIA RPIA 2 88991162 89050452 59290 2p11.2 8 10.78 2.86 KV312 IGKV3 2 89442057 89442643 586 2p11.2 20 ND ND -20 FHL2 FHL2 2 105974169 106055230 81061 2q12.2 6 20.12 5.86 RBP2 RANB 2 109335937 109402267 66330 2q12.3 4 8.08 1.97 P2 DDX18 DDX18 2 118572226 118589955 17729 2q14.1 5 11.32 3.66 MCM6 MCM6 2 136597196 136634011 36815 2q21.3 3 9.58 2.65 HAT1 HAT1 2 172778935 172848600 69665 2q31.1 3 5.86 4.03 ITAV ITGAV 2 187454790 187545628 90838 2q32.1 5 2.66 1.90 TOP2B TOP2B 3 25639475 25706398 66923 3p24.2 8 7.25 2.39 CO7A1 COL7A 3 48601506 48632700 31194 3p21.31 10 0.73 ND 1 RL29 RPL29 3 52027644 52029958 2314 3p21.2 4 165 61.25 UBA3 UBA3 3 69103881 69129559 25678 3p14.1 3 3.08 2.45 DTX3L DTX3L 3 122283085 122294050 10965 3q21.1 3 2.63 2.98 PROF2 PFN2 3 149682691 149768575 85884 3q25.1 5 10.05 ND DHX36 DHX36 3 153990335 154042286 51951 3q25.2 3 2.68 2.66 LXN LXN 3 158363611 158390482 26871 3q25.32 6 ND ND FND3B FNDC3 3 171757418 172119455 36203 3q26.31 4 2.66 3.92 B 7 GBB4 GNB4 3 179113876 179169378 55502 3q26.33 11 0.35 0.41 LPP LPP 3 187871072 188608460 73738 3q27.3 6 3.8 3.13 8 PALLD PALLD 4 169418217 169849608 43139 4q32.3 10 1.24 3.24 1 WWC2 WWC2 4 184020446 184241930 22148 4q35.1 3 ND 0.66 4 HMCS HMGC 5 43289493 43313614 24121 5p12 3 25.42 11.28 1 S1 SK2L2 SKIV2 5 54603576 54721409 11783 5q11.2 4 6.31 2.61 L2 3 ARSB ARSB 5 78073032 78282357 20932 5q14.1 6 0.51 0.33 5 TSP4 THBS4 5 79287134 79379110 91976 5q14.1 6 ND ND CSPG2 VCAN 5 82767284 82878122 11083 5q14.2 26 ND ND 8 NUDC NUDC 5 162880494 162887146 6652 5q34 3 4.86 2.37 2 D2

191

HXK3 HK3 5 176307870 176326333 18463 5q35.2 5 ND ND DREB DBN1 5 176883609 176901402 17793 5q35.3 4 0.40 2.33 PDLI7 PDLIM 5 176910395 176924607 14212 5q35.3 10 0.75 1.56 7 1A68 ENSG0 6 n/a n/a 4640 9 ND ND 000023 HSCHR6_ 5657 MHC_DB B 1A23 ENSG0 6 n/a n/a 4641 HSCHR6_ 6 ND ND 000022 MHC_SS 3980 TO TBB2B TUBB2 6 3224495 3231964 7469 6p25.2 71 ND ND B H12 HIST1 6 26055915 26056699 784 6p22.2 5 191.8 237 H1C H2B1J HIST1 6 27094058 27100575 6517 6p22.1 12 349.2 281.8 H2BJ H2B1K HIST1 6 27106072 27114637 8565 6p22.1 15 366.1 157.6 H2BK 1C01 HLA-C 6 31236526 31239907 3381 6p21.33 7 13.15 25.54 1B51 HLA-B 6 31321649 31324989 3340 6p21.33 14 6.45 29.37 1B59 HLA-B 6 31321649 31324989 3340 6p21.33 11 6.45 29.37 1B07 HLA-B 6 31321649 31324989 3340 6p21.33 9 6.45 29.37 1B41 HLA-B 6 31321649 31324989 3340 6p21.33 7 6.45 29.37 HA24 HLA- 6 32,595,956 32,614,839 18,884 6q21.32 5 ND ND DQA1 MEP1A MEP1 6 46761094 46807519 46425 6p12.3 16 6.49 ND A MCM3 MCM3 6 52128807 52149582 20775 6p12.2 6 32.64 12.04 RAB23 RAB23 6 57053581 57087078 33497 6p11.2 3 0.88 0.57 HDAC HDAC 6 114254192 114332472 78280 6q21 9 3.46 2.02 2 2 E41L2 EPB41 6 131160487 131384462 22397 6q23.2 7 9.90 3.41 L2 5 TSP2 THBS2 6 169615875 169654139 38264 6q27 3 ND 2.66 AGR3 AGR3 7 16899029 16921613 22584 7p21.1 21 14.05 18.54 IF2B3 IGF2B 7 23349828 23510086 16025 7p15.3 9 7.95 0.83 P3 8 CPVL CPVL 7 29034847 29235067 20022 7p14.3 3 ND 2.71 0 AEBP1 AEBP1 7 44143960 44154161 10201 7p13 7 ND ND ABCA ABCA 7 48211055 48687092 47603 7p12.3 3 ND ND D 13 7 EGFR EGFR 7 55086714 55324313 23759 7p11.2 6 11.34 4.04 9

192

LANC2 LANC 7 55433141 55501435 68294 7p11.2 8 3.18 1.02 L2 CLD3 CLDN3 7 73183327 73184600 1273 7q11.23 6 18.71 1.723 PON2 PON2 7 95034174 95064510 30336 7q21.3 3 5.03 0.95 POP7 POP7 7 100303676 100305123 1447 7q22.1 3 13.67 5.64 MPPB PMPC 7 102937869 102969958 32089 7q22.1 5 6.91 2.83 B MEST MEST 7 130126046 130146133 20087 7q32.2 4 23.79 8.45 COPG2 COPG2 7 130146079 130353598 20751 7q32.2 10 6.51 2.10 9 ABP1 ABP1 7 150521715 150558592 36877 7q36.1 7 2.85 0.34 DEF4 DEFA4 8 6793344 6795860 2516 8p23.1 28 ND ND GGH GGH 8 63927638 63951730 24092 8q12.3 8 3.13 0.67 HNF4G HNF4G 8 76320149 76479078 15892 8q21.11 4 2.82 1.25 9 PUF60 PUF60 8 144898514 144912029 13515 8q24.3 5 19.14 11.39 BOP1 BOP1 8 145486055 145515120 29065 8q24.3 4 11.37 6.12 RRAG RRAG 9 19049372 19051023 1651 9p22.1 3 2.34 3.26 A A CD2A1 CDKN 9 21967751 21995300 27549 9p21.3 4 ND ND 2A TF3C4 GTF3C 9 135545422 135570342 24920 9q34.13 7 4.16 3.02 4 EF1A3 EEF1A 9 135894816 135896562 1746 9q34.13 14 ND ND 1P5 7 VAV2 VAV2 9 136627016 136857726 23071 9q34.2 5 1.77 3.57 1 TBB8 TUBB8 10 92828 96053 3225 10p15.3 22 ND ND CDC2 CDK1 10 62538089 62554610 16521 10q21.2 7 7.28 4.87 P4HA1 P4HA1 10 74766975 74856732 89757 10q22.1 5 5.797 4.29 BTAF1 BTAF1 10 93683736 93790082 10634 10q23.32 4 4.65 2.26 6 ERLN1 ERLIN 10 101909847 101948091 38244 10q24.31 6 5.64 2.25 1 NPM3 NPM3 10 103541082 103543170 2088 10q24.32 3 26.36 6.35 RRP5 PDCD1 10 105156405 105206049 49644 10q24.33 3 9.00 2.35 1 ADDG ADD3 10 111756126 111895323 13919 10q25.1 4 6.45 7.41 7 BCCIP BCCIP 10 127512104 127542264 30160 10q26.2 4 7.612 3.52 GLRX3 GLRX3 10 131934663 131982785 48122 10q26.3 5 8.25 2.64 RIC8A RIC8A 11 207511 215113 7602 11p15.5 3 5.50 2.08 PKP3 PKP3 11 392614 404908 12294 11p15.5 9 27.55 6.72 GHC1 SLC25 11 790475 798316 7841 11p15.5 4 3.40 0.81 A22

193

NAT10 NAT10 11 34127111 34169217 42106 11p13 11 10.34 1.63 DDB1 DDB1 11 61066919 61110068 43149 11q12.2 6 32.55 14.96 NC2A DRAP1 11 65686728 65689048 2320 11q13.1 3 4.76 4.50 RBM4 RBM4 11 66406088 66434153 28065 11q13.2 3 12.67 6.38 HEM3 HMBS 11 118955576 118964259 8683 11q23.3 4 5.10 4.35 NOL1 NOP2 12 6666029 6677857 11828 12p13.31 4 5.87 2.82 DDX47 DDX47 12 12966250 12982915 16665 12p13.1 7 4.53 5.37 RT35 MRPS3 12 27863706 27909237 45531 12p11.22 3 31.17 9.78 5 K6PF PFKM 12 48498922 48540187 41265 12q13.11 8 13.38 3.06 ARF3 ARF3 12 49297286 49351334 54048 12q13.12 12 11.58 7.1 6 KRT81 KRT81 12 52679697 52685318 5621 12q13.13 16 ND ND 2 K2C79 KRT79 12 53215194 53228079 12885 12q13.13 10 ND ND ROA1 HNRN 12 54673977 54680872 6895 12q13.13 37 0.89 1.18 PA1 SMRC2 SMAR 12 56555636 56583351 27715 12q13.2 4 5.38 1.96 CC2 IKIP IKBIP 12 99007182 99038732 31550 12q23.1 4 ND 0.40 RAB35 RAB35 12 120532899 120555306 22407 12q24.23 19 3.67 1.99 TRUA PUS1 12 132413745 132428406 14661 12q24.33 4 11.05 1.85 HS105 HSPH1 13 31710762 31736525 25763 13q12.3 17 8.12 6.49 IMA3 KPNA3 13 50273443 50367057 93614 13q14.2 3 5.18 2.16 RRP44 DIS3 13 73329540 73356344 26804 13q22.1 3 2.75 1.68 DHRS2 DHRS2 14 24099378 24114848 15470 14q11.2 13 0.89 1.18 4 HSP72 HSPA2 14 65002623 65009955 7332 14q23.3 46 4.50 ND FBLN5 FBLN5 14 92335755 92414046 78291 14q32.12 4 ND ND TRM61 TRMT 14 103995509 104003410 7901 14q32.32 3 7.42 2.21 61A TSP1 THBS1 15 39873127 39891119 17992 15q14 19 ND ND GCNT3 GCNT3 15 59903982 59913023 9041 15q22.2 5 18.01 15.94 TLN2 TLN2 15 62939477 63136830 19735 15q22.2 6 10.79 4.84 3 RB11A RAB11 15 66161796 66184329 22533 15q22.31 6 11.11 8.44 A MP2K1 MAP2 15 66679155 66783882 10472 15q22.31 3 5.99 3.78 K1 7 PML PML 15 74287014 74340160 53146 15q24.1 4 2.366 1.02 CSK CSK 15 75074425 75095539 21114 15q24.1 5 14.68 4.72 IMP3 IMP3 15 75931426 75941047 9621 15q24.2 4 22.82 8.82 DMN SYNM 15 99645286 99675800 30514 15q26.3 36 0.31 ND TRYD TPSD1 16 1306060 1308494 2434 16p13.3 3 ND ND ELOB TCEB2 16 2821415 2827297 5882 16p13.3 3 34.3 22.34

194

RL1D1 RSL1D 16 11928053 11945442 17389 16p13.13 9 11.38 4.11 1 THUM THUM 16 20744986 20753406 8420 16p12.3 4 3.28 2.65 1 PD1 XTP3A DCTPP 16 30435019 30441396 6377 16p11.2 3 8.00 6.80 1 MMP2 MMP2 16 55512883 55540603 27720 16q12.2 4 ND ND EST1 CES1 16 55836763 55867075 30312 16q12.2 27 ND ND GNAO GNAO 16 56225251 56391356 16610 16q12.2 3 ND ND 1 5 ULA1 NAE1 16 66836778 66864900 28122 16q22.1 5 16.88 5.75 TPPP3 TPPP3 16 67423712 67427438 3726 16q22.1 3 0.64 2.05 HPTR HPR 16 72088508 72111145 22637 16q22.2 3 ND ND DPEP1 DPEP1 16 89679716 89704864 25148 16q24.3 20 ND ND NXN NXN 17 702581 883010 18042 17p13.3 9 2.28 5.49 9 GLYC SHMT 17 18231187 18266856 35669 17p11.2 3 4.04 3.88 1 RAB34 RAB34 17 27041299 27045447 4148 17q11.2 5 2.72 ND BLMH BLMH 17 28575223 28619074 43851 17q11.2 3 10.58 ND ACAC ACAC 17 35441923 35766902 32497 17q12 12 12.4 5.47 A A 9 RM45 MRPL4 17 36452990 36479101 26111 17q12 4 7.80 3.33 5 PPR1B PPP1R 17 37783179 37792879 9700 17q12 9 0.55 21.14 1B ERBB2 ERBB2 17 37844393 37884915 40522 17q12 38 12.63 5.85 GRB7 GRB7 17 37894187 37903545 9359 17q12 16 2.99 1.84 TOP2A TOP2A 17 38544768 38574408 29640 17q21.2 11 17.02 15.98 KRA21 KRTA 17 39202793 39203568 775 17q21.2 5 ND ND P2-1 K1H2 KRT32 17 39615765 39623681 7916 17q21.2 15 ND ND FKB10 FKBP1 17 39968962 39979469 10507 17q21.2 50 ND 4.49 0 U5S1 EFTUD 17 42927655 42976993 49338 17q21.31 16 19.23 2.18 2 NMT1 NMT1 17 43138680 43186384 47704 17q21.31 5 6.93 5.52 CBX1 CBX1 17 46147414 46178883 31469 17q21.32 4 6.50 3.32 MRC2 MRC2 17 60704762 60770956 66194 17q23.2 3 ND 0.72 DDX42 DDX42 17 61851567 61896677 45110 17q23.3 3 8.19 3.73 IMA2 KPNA2 17 66031848 66042970 11122 17q24.2 10 22.01 17.17 GALK1 GALK 17 73754018 73761307 7289 17q25.1 3 5.32 2.82 1 RAB31 RAB31 18 9708228 9862553 15432 18p11.22 8 2.92 ND 5

195

SMCA SMAR 19 11071598 11172958 10136 19p13.2 4 21.11 3.26 4 CA4 0 LEG7 LGALS 19 39261608 39264157 2549 19q13.2 6 ND ND 7 CEAM CEAC 19 42212504 42234437 21933 19q13.2 23 ND ND 5 AM5 CEAM CEAC 19 42259329 42276113 16784 19q13.2 10 24.53 113.8 6 AM6 CEAM CEAC 19 43011458 43032661 21203 19q13.2 3 1.07 2.82 1 AM1 AP2A1 AP2A1 19 50270180 50310369 40189 19q13.33 14 9.20 4.04 TNNI3 TNNI3 19 55663135 55669100 5965 19q13.42 3 0.36 ND RT26 MRPS2 20 3026591 3028900 2309 20p13 5 14.58 5.31 6 OFUT1 POFUT 20 30795683 30826470 30787 20q11.21 5 6.36 3.23 1 CHM4 CHMP 20 32399110 32442173 43063 20q11.22 4 8.70 5.51 B 4B RALY RALY 20 32581452 32696114 11466 20q11.22 12 23.09 17.08 2 MYH7 MYH7 20 33543704 33590240 46536 20q11.22 5 ND ND B B SRC SRC 20 35973088 36034453 61366 20q11.23 10 7.12 4.75 ACOT8 ACOT8 20 44470360 44486045 15685 20q13.12 4 3.08 2.69 MMP9 MMP9 20 44637547 44645200 7653 20q13.12 7 ND ND ATP9A ATP9A 20 50213053 50385173 17212 20q13.2 3 7.46 3.74 0 PFD4 PFDN4 20 52824386 52844591 20205 20q13.2 5 2.13 1.57 U2AF1 U2AF1 21 44513066 44527697 14631 21q22.3 3 9.23 2.93

196

Supplementary Table 4-S4b. Tumor unique proteins in ERBB2- sample set.

Gene Chr Start End Size Band Gastric symbol cancer tumor only spectral counts TPM1 15 63334831 63364114 29283 15q22.2 82 COL12A1 6 75794042 75915767 121725 6q14.1 58 (isoform 4) TNC 9 117782805 117880536 97731 9q33.1 56 (isoform 4 ) SPTAN1 9 131314866 131395941 81075 9q34.11 56 TUBB3 16 89989687 90002505 12818 16q24.3 49 ACTA1 1 229566992 229569845 2853 1q42.13 38 HSPA2 14 65002623 65009955 7332 14q23.3 36 PKM2 15 72491370 72524163 32793 15q23 35 TRIM28 19 59055836 59062082 6246 19q13.43 30 TNC 9 117782805 117880536 97731 9q33.1 28 (isoform 5) OLFM4 13 53602830 53626196 23366 13q14.3 25 TFRC 3 195754054 195809060 55006 3q29 21 SSB 2 170648443 170668575 20132 2q31.1 21 PGM5 9 70971815 71145977 174162 9q21.11 21 CEACAM 19 42212504 42234437 21933 19q13.2 20 5 STAT1 2 191829084 191885686 56602 2q32.2 19 PALLD 4 169418217 169849608 431391 4q32.3 19 SFPQ 1 35641979 35658749 16770 1p34.3 18 SLMAP 3 57741177 57914895 173718 3p14.3 18 COL12A1 6 75794042 75915767 121725 6q14.1 18 NES 1 156638555 156647189 8634 1q23.1 17 PDLIM3 4 186421814 186456766 34952 4q35.1 17 ASS1 9 133320094 133376661 56567 9q34.11 16 PSMC2 7 102984701 103009842 25141 7q22.1 16 HLA-A 6 29909037 29913661 4624 6p22.1 15 (A-33 alpha chain) MYLK 3 123328896 123603178 274282 3q21.1 15

197

CTSB 8 11700033 11726957 26924 8p23.1 14 RDX 11 110045605 110167447 121842 11q22.3 14 MUC5B 11 1244295 1284402 40107 11p15.5 14 HLA-A 6 4640 HSCHR6_ 13 (A-69 MHC_DB alpha B chain) TPM2 9 35681989 35691017 9028 9p13.3 12 HLA-A 6 4640 HSCHR6_ 12 (A-11 MHC_DB alpha B chain) STMN1 1 26210672 26233482 22810 1p36.11 12 HSPH1 13 31710762 31736525 25763 13q12.3 12 TNC 9 117782805 117880536 97731 9q33.1 11 EIF2S3 X 24072833 24096927 24094 Xp22.11 11 RBBP7 X 16857406 16888537 31131 Xp22.2 11 PDLIM5 4 95373037 95589377 216340 4q22.3 11 HNRNPA 12 54673977 54680872 6895 12q13.13 10 1 DDX6 11 118618472 118661972 43500 11q23.3 10 RPL13 16 89627056 89630950 3894 16q24.3 10 SRSF9 12 120899471 120907596 8125 12q24.31 10 RCC2 1 17733251 17766220 32969 1p36.13 10 HNRNPC 1 12907261 12908578 1317 1p36.21 9 L1 C4B 6 31982539 32003195 20656 6p21.33 9 MCM4 8 48872745 48890720 17975 8q11.21 9 MCM7 7 99690351 99699563 9212 7q22.1 9 TMPO 12 98909290 98944157 34867 12q23.1 9 STAT1 2 191829084 191885686 56602 2q32.2 9 ACTN3 11 66313866 66330800 16934 11q13.2 9 HNRNPD 4 83273651 83295656 22005 4q21.22 9 PRAF2 X 48928818 48931733 2915 Xp11.23 8 MMP9 20 44637547 44645200 7653 20q13.12 8 DPEP1 16 89679716 89704864 25148 16q24.3 8 HLA-B 6 31321649 31324989 3340 6p21.33 8 KPNA2 17 66031848 66042970 11122 17q24.2 8 AK2 1 33473585 33546597 73012 1p35.1 8 OXCT1 5 41730167 41870791 140624 5p13.1 8 CBX1 17 46147414 46178883 31469 17q21.32 8 EFEMP1 2 56093102 56151274 58172 2p16.1 8 PMVK 1 154897208 154909484 12276 1q21.3 8 CSRP2 12 77252495 77272840 20345 12q21.2 8 TBCB 19 36605888 36616849 10961 19q13.12 8

198

PDLIM7 5 176910395 176924607 14212 5q35.3 8 GLS 2 191745547 191830278 84731 2q32.2 7 DMD X 31132808 33357726 2224918 Xp21.1 7 HLA-A 6 29909037 29913661 4624 6p22.1 7 STOM 9 124101353 124132545 31192 9q33.2 7 MAP4 3 47892180 48130769 238589 3p21.31 7 MAP1B 5 71403061 71505397 102336 5q13.2 7 ALDH7A1 5 125877533 125931110 53577 5q23.2 7 HIST1H2 6 26158349 26171577 13228 6p22.2 7 BD PPP1CB 2 28974506 29025806 51300 2p23.2 7 PLOD1 1 11994262 12035599 41337 1p36.22 7 RBBP4 1 33116743 33151812 35069 1p35.1 7 DDX39B 6 31497996 31510225 12229 6p21.33 7 SAFB2 19 5587010 5622938 35928 19p13.3 7 CNN3 1 95362507 95392834 30327 1p21.3 7 TOMM34 20 43570771 43589127 18356 20q13.12 7 KRT79 12 53215194 53228079 12885 12q13.13 7 ABHD11 7 73150424 73153197 2773 7q11.23 7 SEPT8 5 132086509 132142933 56424 5q31.1 7 COL12A1 6 75794042 75915767 121725 6q14.1 7 PUF60 8 144898514 144912029 13515 8q24.3 7 HLA-A 6 29909037 29913661 4624 6p22.1 6 (A-28 alpha chain) CES1 16 55836763 55867075 30312 16q12.2 6 LMOD1 1 201865580 201915716 50136 1q32.1 6 RPL3 22 39708887 39716394 7507 22q13.1 6 RPL6 12 112842994 112856642 13648 12q24.13 6 GBE1 3 81538850 81811312 272462 3p12.2 6 PAK2 3 196466728 196559518 92790 3q29 6 RBM39 20 34291531 34330234 38703 20q11.22 6 AARS2 6 44266463 44281063 14600 6p21.1 6 HP1BP3 1 21069154 21113816 44662 1p36.12 6 C20orf3 20 24943561 24973615 30054 20p11.21 6 RAB23 6 57053581 57087078 33497 6p11.2 6 GGCT 7 30536237 30544460 8223 7p14.3 5 PMPCB 7 102937869 102969958 32089 7q22.1 5 LGALS1 22 38071613 38075813 4200 22q13.1 5 RPL14 3 40498783 40506549 7766 3p22.1 5 RAD23B 9 110045418 110094475 49057 9q31.2 5 H3F3C 12 31944119 31945175 1056 12p11.21 5 GLT25D1 19 17666460 17693965 27505 19p13.11 5 RCN3 19 50030875 50046890 16015 19q13.33 5

199

FUBP3 9 133454352 133513739 59387 9q34.11 5 LXN 3 158363611 158390482 26871 3q25.32 5 IGF2BP1 17 47074774 47133507 58733 17q21.32 5 UBA2 19 34919264 34960798 41534 19q13.11 5 SUN2 22 39130730 39190148 59418 22q13.1 5 SUGT1 13 53226831 53262433 35602 13q14.3 5 THY1 11 119288090 119295695 7605 11q23.3 4 RRAS 19 50138552 50143400 4848 19q13.33 4 CEACAM 19 43011458 43032661 21203 19q13.2 4 1 AK4 1 65613232 65697828 84596 1p31.3 4 AQP1 7 30951415 30965131 13716 7p14.3 4 DUT 15 48623621 48635570 11949 15q21.1 4 PPP5C 19 46850251 46896238 45987 19q13.32 4 RAB11A 15 66161796 66184329 22533 15q22.31 4 FBLN2 3 13573824 13679922 106098 3p25.1 4 FMOD 1 203309752 203320617 10865 1q32.1 4 MCM6 2 136597196 136634011 36815 2q21.3 4 PDK3 X 24483338 24568583 85245 Xp22.11 4 TSN 2 122494679 122525429 30750 2q14.3 4 IGFBP7 4 57896939 57976551 79612 4q12 4 PRPF8 17 1553923 1588176 34253 17p13.3 4 GOLM1 9 88641056 88715116 74060 9q21.33 4 PGM2 4 37828255 37864559 36304 4p14 4 PGAM5 12 133287393 133299323 11930 12q24.33 4 CNPY3 6 42896860 42907025 10165 6p21.1 4 NHP2 5 177576461 177580968 4507 5q35.3 4 LUC7L2 7 139025105 139108200 83095 7q34 4 CLIC4 1 25071760 25170815 99055 1p36.11 4 BAG2 6 57037104 57050013 12909 6p11.2 3 C1QC 1 22970118 22974603 4485 1p36.12 3 S100A9 1 153330330 153333503 3173 1q21.3 3 SERPINF 17 1646130 1658560 12430 17p13.3 3 2 SRM 1 11114641 11120091 5450 1p36.22 3 TUBG1 17 40761358 40767256 5898 17q21.2 3 CEACAM 19 42259329 42276113 16784 19q13.2 3 6 BCAM 19 45312338 45324678 12340 19q13.32 3 ADAR 1 154554533 154600475 45942 1q21.3 3 ABCE1 4 146019084 146050676 31592 4q31.21 3 SKP1 5 133492082 133561762 69680 5q31.1 3 LTBP2 14 74964886 75079034 114148 14q24.3 3 PSMD6 3 63996225 64009658 13433 3p14.1 3 BROX 1 222885895 222908538 22643 1q41 3

200

DARS2 1 173793641 173827684 34043 1q25.1 3 MTDH 8 98656407 98742488 86081 8q22.1 3 LEMD2 6 33738979 33756913 17934 6p21.31 3 NPLOC4 17 79523913 79616376 92463 17q25.3 3 KRT12 17 39017195 39023462 6267 17q21.2 3 DNAJC7 17 40128439 40169715 41276 17q21.2 3 HNRNPU 19 41768391 41813811 45420 19q13.2 3 L1 MXRA5 X 3226606 3264684 38078 Xp22.33 3 BCLAF1 6 136578001 136610989 32988 6q23.3 3 SEPT10 2 110300376 110371783 71407 2q13 3 PFDN2 1 161070346 161087901 17555 1q23.3 3 SNX6 14 35030300 35099389 69089 14q13.1 3 DYNC1LI 3 32567463 32612366 44903 3p22.3 3 1

201

Supplementary Table 4-S5. Tumor unique proteins that have interactions with the 21 oncogenes. (genes with * present in selected pathways for gastric cancer in Supplementary Figure S6 and S7)

Protein Gene Ch Start End Size Band spe RPKM value of name symbol r ctra two gastric l cancer cell lines cou nts SNU1 KATOI 6 II

KHDR KHDR 1 32479430 32526451 47021 1p35.1 9 17.29 6.16 1 BS1

P3H1 LEPR 1 43212006 43232755 20749 1p34.2 6 1.88 1.09 E1

CEBPZ CEBP 2 37428755 37458856 30101 2p22.2 3 7.39 3.84 Z

MSH2 MSH2 2 47630108 47789450 159342 2p21 6 7.45 4.8

MSH6 MSH6 2 48010221 48034092 23871 2p16.3 3 6.41 4.27

SNUT2 USP39 2 85829979 85876406 46427 2p11.2 4 9.74 3.61

FHL2 FHL2 2 10597416 106055230 81061 2q12.2 6 20.12 5.86 9

RBP2 RANB 2 10933593 109402267 66330 2q12.3 4 8.08 1.97 P2 7

ITAV ITGA 2 18745479 187545628 90838 2q32.1 5 2.66 1.9

202

V 0

TOP2B TOP2 3 25639475 25706398 66923 3p24.2 8 7.25 2.39 B

GBB4 GNB4 3 17911387 179169378 55502 3q26.33 11 0.35 0.41 6

ARSB ARSB 5 78073032 78282357 209325 5q14.1 6 0.51 0.33

CSPG2 VCAN 5 82767284 82878122 110838 5q14.2 26 ND ND

PDLI7 PDLI 5 17691039 176924607 14212 5q35.3 10 0.75 1.56 M7 5

MCM3 MCM3 6 52128807 52149582 20775 6p12.2 6 32.64 12.04

HDAC HDAC 6 11425419 114332472 78280 6q21 9 3.46 2.02 2 2 2

IF2B3 IGF2B 7 23349828 23510086 160258 7p15.3 9 7.95 0.83 P3

EGFR EGFR 7 55086714 55324313 237599 7p11.2 6 11.34 4.04

CLD3 CLDN 7 73183327 73184600 1273 7q11.23 6 18.71 1.723 3

COPG2 COPG 7 13014607 130353598 207519 7q32.2 10 6.51 2.1 2 9

CD2A1 CDKN 9 21967751 21995300 27549 9p21.3 4 ND ND 2A

TF3C4 GTF3 9 13554542 135570342 24920 9q34.13 7 4.16 3.02 C4 2

203

VAV2* VAV2 9 13662701 136857726 230711 9q34.2 5 1.77 3.57 * 6

P4HA1 P4HA 10 74766975 74856732 89757 10q22.1 5 5.797 4.29 1

BTAF1 BTAF 10 93683736 93790082 106346 4 4.65 2.26 1 10q23.3 2

RRP5 PDCD 10 10515640 105206049 49644 3 9 2.35 11 5 10q24.3 3

DDB1 DDB1 11 61066919 61110068 43149 11q12.2 6 32.55 14.96

DDX47 DDX4 12 12966250 12982915 16665 12p13.1 7 4.53 5.37 7

HS105 HSPH 13 31710762 31736525 25763 13q12.3 17 8.12 6.49 1

RRP44 DIS3 13 73329540 73356344 26804 13q22.1 3 2.75 1.68

MP2K1 MAP2 15 66679155 66783882 104727 3 5.99 3.78 * K1* 15q22.3 1

PML PML 15 74287014 74340160 53146 15q24.1 4 2.366 1.02

CSK CSK 15 75074425 75095539 21114 15q24.1 5 14.68 4.72

MMP2 MMP2 16 55512883 55540603 27720 16q12.2 4 ND ND * *

204

GNAO GNAO 16 56225251 56391356 166105 16q12.2 3 ND ND 1

ACAC ACAC 17 35441923 35766902 324979 17q12 12 12.4 5.47 A A

PPR1B PPP1R 17 37783179 37792879 9700 17q12 9 0.55 21.14 1B

ERBB2 ERBB 17 37844393 37884915 40522 17q12 38 12.63 5.85 * 2*

GRB7* GRB7 17 37894187 37903545 9359 17q12 16 2.99 1.84 *

TOP2A TOP2 17 38544768 38574408 29640 17q21.2 11 17.02 15.98 A

U5S1 EFTU 17 42927655 42976993 49338 17q21.3 16 19.23 2.18 D2 1

NMT1 NMT1 17 43138680 43186384 47704 17q21.3 5 6.93 5.52 1

IMA2 KPNA 17 66031848 66042970 11122 17q24.2 10 22.01 17.17 2

SMCA SMAR 19 11071598 11172958 101360 19p13.2 4 21.11 3.26 4 CA4

CEAM CEAC 19 43011458 43032661 21203 19q13.2 3 1.07 2.82 1 AM1

AP2A1 AP2A 19 50270180 50310369 40189 19q13.3 14 9.2 4.04 1 3

205

SRC* SRC* 20 35973088 36034453 61366 20q11.2 10 7.12 4.75 3

ACOT ACOT 20 44470360 44486045 15685 20q13.1 4 3.08 2.69 8 8 2

MMP9 MMP9 20 44637547 44645200 7653 20q13.1 7 ND ND * * 2

206