<<

LESSONS FROM THE :

CHARACTERIZATION OF CHINESE HAMSTER OVARY (CHO) CELL LINES AND GRISEUS TISSUES VIA PROTEOMICS AND GLYCOPROTEOMICS

by Kelley Marie Heffner

A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy

Baltimore, Maryland

August, 2017

© 2017 Kelley Marie Heffner All Rights Reserved

Abstract

Chinese hamster ovary (CHO) cells were isolated in the late 1950’s and have

been the workhorse of biotherapeutics production for decades. While previous

efforts compared CHO cell lines by proteomics, research into the original Chinese

hamster (Cricetulus griseus) host has not been conducted. Thus, we sought to

understand proteomic differences across CHO-S and CHO DG44 cell lines in relation

to brain, heart, kidney, liver, lung, ovary, and spleen tissues. As glycosylation is

critical for recombinant protein quality, we additionally performed a

glycoproteomics and sialoproteomics analysis of wild-type and mutant CHO cell

lines that differ in glycosylation capacity.

First, wild-type CHO was compared with tunicamycin-treated CHO and

Lec9.4a cells, a mutant CHO cell line which shows 50% of wild-type glycosylation

levels. A total of 381 glycoproteins were identified, including heavily-glycosylated

membrane proteins and transporters. Proteins related to glycosylation

downregulated in Lec9.4a include alpha-(1,3)-fucosyltransferase and dolichyl- diphosphooligosaccharide-protein glycosyltransferase subunit 1. Next, wild-type

Pro-5 CHO was compared with Lec2 cells, which have a mutation in CMP-sialic acid transporter that reduces sialylation. A total of 272 sialylated proteins were identified. Downregulated sialoproteins, including dolichyl- diphosphooligosaccharide-protein glycosyltransferase subunit STT3A and beta-1,4- galactosyltransferase 3, detect glycosylation defects.

ii Next, a label-free quantitative proteomics analysis of CHO-S and CHO DG44

cell lines and liver and ovary tissue identified 11801 proteins, including 9359

proteins specifically in the cell lines, representing a 56% increase over previous

work. Additionally, 6663 proteins were identified across liver and ovary tissues

providing the first Chinese hamster tissue proteome. Overall, both gene ontology

and KEGG pathway analysis revealed enrichment of cell cycle activity in cells. In

contrast, upregulated molecular functions in tissue include glycosylation and lipid

transport.

Finally, we used labeled proteomics to compare CHO-S and CHO DG44 cell

lines with brain, heart, kidney, liver, lung, ovary, and spleen tissues to identify 8464

proteins. After protein expression and functional analyses, we combined the

proteomics with transcriptomic data obtained by RNAseq in order to correlate

mRNA and protein expression. Over 65% of genes show agreement between

transcriptome and proteome. The remaining genes were categorized as stable or

unstable and related to metabolic pathways. In conclusion, this large-scale

proteomics analysis delineates specific changes across cell lines and tissues, which

can help explain tissue function and the adaptations cells incur as biotherapeutics

production hosts.

Advisor: Dr. Michael J. Betenbaugh

Readers: Dr. Marc Donohue, Dr. Sharon Krag

iii Preface

Proteomics serves an important role in the elucidation of cell metabolism, with various applications for the biotechnology industry. Identification and quantification of thousands of cellular proteins, as well as their functional pathway interactions, can aid efforts to optimize genetic engineering and bioprocess development. In this work, we conducted label-free, labeled, and glyco- proteomics analyses of Chinese hamster ovary (CHO) cell lines and Cricetulus griseus tissues to expand knowledge of biotech’s top recombinant protein production host.

Chapter 1 provides an introduction to the field of proteomics and its increasing importance in the biotechnology industry. Applications of proteomics, such as media formulation, bioprocess development, and cell line engineering, are highlighted. A summary of developments for analysis, such as mass spectrometry and database development are also discussed.

Chapter 2 describes a glycoproteomics effort to identify the glycosylated proteins in wild-type CHO and mutant cell lines that lack the capability to glycosylate or sialylate proteins. An enrichment method coupled to identification by mass spectrometry was developed to identify proteins known to be glycosylated and sialylated. We used bioinformatics to relate changes in glycoprotein abundance to cell pathways in order to characterize metabolism of the cell lines.

In Chapter 3, label-free proteomics was used to characterize CHO cell lines at both exponential and stationary phases and Chinese hamster tissues. This represents the first liver and ovary proteomes developed for Cricetulus griseus. After

iv comparing the overall differences in protein abundance, both gene ontology and

KEGG pathway analyses are presented to show functional significance.

Chapter 4 describes a larger effort to map the Chinese hamster tissue

proteome. Along with CHO cell lines, we analyzed brain, heart, kidney, liver, lung,

ovary, and spleen tissue from the hamster using a labeled proteomics approach.

Overall protein intensity differences were compared between each permutation of

samples and we developed a high-abundance analysis for each tissue type. Next, the

outliers for each comparison were functionally analyzed using gene ontology and

ingenuity pathway analysis to identify up- and down-regulated pathways. Finally, a comparison between mRNA and protein expression was used from transcriptomics and proteomics combination.

Finally, Chapter 5 concludes the dissertation by drawing conclusions from the work and suggesting applications and next steps of research. We specifically summarize the proteins identified using label-free and labeled proteomics, as well as sialoproteomics and glycoproteomics. The sum total of Chinese hamster and CHO cell lines proteins is the largest to date and will be publicly available to aid further research by the CHO community.

v Acknowledgements

The completion of a doctoral program requires the support of many people

over the course of many years. Primarily, I wish to thank my advisor Dr. Michael J.

Betenbaugh for his guidance. At the start of the program, we had vague and general

ideas about how to improve our understanding of CHO cell metabolism. Over the

course of my work, we became interested in how to apply multiple ‘omics strategies

to identify novel insights about CHO. Dr. Betenbaugh has helped me to expand the

way I approach research by not limiting the possibilities. It is advantageous to try

addressing as many problems as possible and then the combination of work can

string together a story. I was also fortunate to use Dr. Betenbaugh’s extensive

relationship with the biotechnology industry to attend conferences and participate

in off-site industrial internships. Through the internships, I gained a real-world understanding of how research can be applied to improve bioprocess development.

Next, I am grateful for the opportunity to learn from a former Betenbaugh

Lab member who now heads research and development of biologics in Turkey, Dr.

Deniz Baycin Hizal. Deniz established the original CHO proteome during her time at

Johns Hopkins University and this laid the groundwork for my proteomics and

glycoproteomics analyses of CHO cell lines and hamster tissues. Her attention to

detail has instilled in me a desire to ask more questions and think analytically about

how to design and perform bioinformatics analyses.

In addition to Deniz, I also thank Dr. George Yerganian, Dr. Raghothama

Chaerkady, Dr. Robert O’Meally, Dr. Robert Cole, Dr. Amit Kumar, and Dr. Joseph

vi Priola for guidance related to ‘omics. Specifically, Dr. Yerganian shared invaluable

insight into the development of CHO cell lines from hamster tissue and provided the

female for tissue harvest. I thank Nadine Forbes-McBean and Dr. Cory

Brayton for guidance in tissue preparation. Raghothoma Chaerkady and Robert

O’Meally ran our mass spectrometry at the core facility of Dr. Cole.

Numerous people at Johns Hopkins University have also aided in my

progression as a graduate student. Members of the Betenbaugh Lab, such as Dr.

Andrew Chung, Dr. Yue Zhang, Dr. Bojiao Yin, Jimmy Kirsch, Chien-Ting Li, and

Qiong Wang, were helpful in sharing their experiences in the lab and providing

feedback about my work. I also had the pleasure of working with visiting, Master’s,

and undergraduate students.

Finally, my family has been an unwavering source of support along the

graduate school process. I look up to my parents Cheryl and Michael with the

greatest respect. They offer me guidance and set an example of what it means to

show unconditional love. I also am grateful for the support of my younger sister

Erica and have been thrilled to spend time close to her over the course of my

research. She is a strong-willed and independent woman that I hope to stay best

friends with for many years and she finally knows the difference between a

Bachelors, Masters, and Doctoral degree. In addition to my family, I value the

support of my friends as I have grown immensely during graduate school.

vii Table of Contents

Title Page i

Abstract ii

Preface iv

Acknowledgements vi

List of Tables ix

List of Figures ix

Chapter 1: Introduction to Proteomics 1

Chapter 2: Glycoproteomics Analysis of Chinese Hamster Ovary Cell Lines 30

Chapter 3: Proteomics Analysis of Chinese Hamster Cell Lines and Tissues 66

Chapter 4: Comparative Proteomics of Chinese Hamster Cell Lines and Tissues 106

Chapter 5: Conclusions 145

References 149

Curriculum Vitae 163

viii List of Tables

Table 1.1: Top Selling Biologics of 2012 24 Table 1.2: Summary of CHO Proteomics for Bioprocess Development 25

Table 2.1: Experimental Conditions 60 Table 3.1: Number of Peptides and Proteins 98

Table 4.1: TMT Doublet Information 137 Table 4.2: Top 10 Most Enriched Biological Processes for CHO-S and Tissues 138

Table 4.3: Top 10 Most Enriched Biological Processes for CHO DG44 and Tissues 139

List of Figures

Figure 1.1: Overview of Proteomics Experiment 28

Figure 1.2: Overview of Comparative Proteomics 29 Figure 2.1: Overview of Experimental Design 61

Figure 2.2: Distribution of Protein Intensity 62 Figure 2.3: Protein Intensity Heat Maps 63

Figure 2.4: Lec2 KEGG Pathway N-glycan Biosynthesis 64 Figure 2.5: Pathway Analysis Maps 65

Figure 3.1: Experimental Workflow 99

Figure 3.2: Proteomic Comparison of Samples 100 Figure 3.3: Protein Intensity Comparison of Samples 101

Figure 3.4: The Top 10 Enriched Biological Processes 102 Figure 3.5: K-Means Clustering of Molecular Functions 103

Figure 3.6: Cellular Components Enriched in Tissues 104 Figure 3.7: Pathway Map of Cell Cycle 105

ix Figure 4.1: Overview of Experimental Design and Workflow 140

Figure 4.2: Protein Intensity Analysis 141 Figure 4.3: Overall Distribution of Protein Intensity in Hamster Tissues 142

Figure 4.4: Highly Expressed Proteins 143 Figure 4.5: Comparison of Proteome and Transcriptome 144

x Chapter 1: Introduction to Proteomics

Abbreviations

CHO – Chinese hamster ovary, mAb – monoclonal antibody, MS – mass spectrometry,

DTT – dithiothreitol, CHAPS – 3-[(3-Cholamidopropyl)dimethylammonio]-1- propanesulfonate, SDS – sodium dodecyl sulfate, FASP – filter aided sample preparation, HCP – host cell protein, SILAC – stable isotope labeling with amino acids in culture, iTRAQ – isobaric tags for relative and absolute quantification, TMT

– tandem mass tags, MALDI TOF – matrix-assisted laser desorption/ionization time of flight, AEC – adenylate energy charge, VCP – valosin-containing protein

1.1 Summary

The genetic sequencing of Chinese hamster ovary (CHO) cells initiated the

systems biology era for biotechnology applications. In addition to genomics, now the

critical ‘omics data sets also include proteomics, transcriptomics, and metabolomics.

In recent years, the use of proteomics in cell lines for recombinant protein

production has increased significantly because proteomics is the study of all

proteins in an organism and their changes over time, and thus it is important for

applications in cell culture such as bioprocess development. Specifically,

identification of proteins that affect cell culture processes can aid efforts in cell line

engineering to improve growth or productivity, delay onset of apoptosis, or utilize

nutrients efficiently. Optimizations of sample preparations and database

development have improved the quantity and accuracy of identified CHO proteins.

1 The applications are widespread and suggest further applications of proteomics or

combined ‘omics experiments in the future.

1.2 CHO Hosts for Biotherapeutics Production

CHO cells are the desired recombinant protein production hosts. Their

growth rate in cell culture is scalable for high-density production of biotherapeutics.

As mammalian cells, CHO cells provide similarity in glycosylation pathways that allow for human biotherapeutic compatibility. In 2012, five out of the top five best- selling biologics were produced in CHO cells as shown in table 1.2 [1]. The top biotherapeutic is Humira, produced using the CHO expression system for the treatment of rheumatoid arthritis [1]. Characteristics such as human-compatibility

and manufacturing scalability help explain CHO cells’ widespread dominating use in

biotechnology applications.

Improved understanding of CHO cell physiology resulted from the recently

completed genome sequence [2] [3]. From this information, a variety of methods

have been used to quantify the genome, transcriptome, proteome, and metabolome.

These data sets offer new insights into cell physiology. Baycin-Hizal et al. completed

the proteome of the CHO cell line and included information on intracellular, secreted,

and glyco- proteins [4]. This study complemented the results from the CHO genome

[3] using a codon frequency analysis; the differences between CHO and human cells

may aid development of biocompatible therapeutics [4]. Additionally, this study

combined proteomic and transcriptomic data to analyze pathway changes; for

example, pathways such protein processing and apoptosis are enriched at the

protein level whereas steroid hormone and glycosphingolipid metabolism are

2 depleted at the protein level [4]. The complete CHO proteome offers new insights into bioprocess development by increasing knowledge of the most-widely used production host.

Proteins have diverse functions in the cell and are involved in growth, signaling, regulation, and metabolism. Their rapidly changing levels provide important information about subtle changes in the cell that may not be detected by differences in the transcriptome or genome. A variety of methods are used to generate large, complex data sets, which require processing and analysis to reveal useful information about cellular phenotype. Quantification of protein levels provides clearer understanding of cell physiology and can lead to development of improved cell culture for biotechnology applications.

Production of monoclonal antibodies (mAbs) for therapeutic use requires development of a scalable and consistent bioprocess, ensuring high yield and purity of the biotherapeutic. Proteomics can also be used during development in order to identify proteins that affect cell culture growth, apoptosis, recombinant protein productivity, and product quality. This information can suggest methods to improve the bioprocess through cell line engineering and rational media formulation, for example. Recently, combined ‘omics approaches were utilized. These approaches provide useful insight because direct 1:1 correlation between approaches rarely exists. The combination improves the reliability and accuracy of the results from either approach alone.

This chapter highlights how proteomics can be used for cell culture applications. Proteomics can provide identification and quantification of thousands

3 of cellular proteins and can be used to increase understanding of production hosts

such as CHO cells. In recent years, the number of published proteomic data sets has

expanded due to the availability of the CHO genome and new techniques have been

introduced to increase protein identification and accuracy.

The use of proteomics in cell culture applications is widespread. Increasing

application of proteomics for cell culture requires advanced methods, such as the

optimization of sample preparation, digestion, labeling and mass spectrometry (MS).

Proteomics is increasingly applied to cell culture in order to understand cell lines

and aid in cell line engineering or process development efforts to increase cell

growth, increase recombinant protein productivity, or maintain product quality, for

example. In the following sections, proteomics will be examined in great detail with

respect to the biotechnology industry.

1.3 Proteomics Method Optimization

Following the initial proteomics experiments in CHO cells, there have been widespread refinements in methods in order to improve the recovery of cell proteins and aid in their identification. The ability to identify increasing numbers of proteins with high accuracy is dependent on optimized sample preparation methods.

Proteomics methods include extraction, reduction, alkylation, digestion, and peptide

fractionation prior to quantification with MS. After cell culture extracts are obtained, the cells are lysed are extracted. Following the reduction, alkylation, digestion, and fractionation, the peptides can be individually identified. This requires use of software and databases to match the mass spectra to specific sequences and thus proteins. An overview of the proteomic workflow is shown in figure 1.1. In recent

4 years, there have been numerous developments to improve proteomic methods

including the optimization of sample preparation, digestion, optional labeling, and

MS.

1.3.1 Sample Preparation

Prior to advancements in data analysis, it was necessary to improve the

preparation of high-quality peptide samples by sample preparation using different

extraction and digestion techniques.

In order to improve protein recovery from CHO cells for two-dimensional

(2D) gel electrophoresis, the concentration of solubilizers such as urea, dithiothreitol (DTT), 3-((3-cholamidopropyl) dimethylammonio)-1- propanesulfonate (CHAPS), and sodium dodecyl sulfate (SDS) was optimized [5].

Maintaining solubility enables recovery of proteins with diverse physical and chemical properties. The optimum solubilizing factors for CHO cells were studied using a design of experiments approach; DTT, urea-DTT cross-interaction, and urea-

CHAPS cross-interaction were positive factors that improved protein recovery [5]. A final solution composition of 8M urea, at least 32.5mM DTT, and at least 2% CHAPS was selected for CHO cell lysates [5].

Besides the extraction efficiency, digestion efficiency is another criterion for increasing correct protein identifications. Two methods of in-gel and in-solution digestion methods can have separate advantages. In-gel digestion involves

solubilizing proteins with detergent and separating proteins by gel electrophoresis.

Separated proteins are then digested from the gel and quantified by MS. In-solution

digestion involves extracting proteins with strong reagents and digesting in the

5 solution. There are advantages and disadvantages to both methods. In-gel digestion

can protect against impurities but the protein recovery is poor in comparison to in-

solution digestion. On the other hand, in-solution digestion is efficient to implement but there is greater risk of impurities or incomplete solubilization. SDS is one of the

major detergents, which can be used for full extraction of the cell lysates including

insoluble membrane proteins; however, SDS has to be removed prior to MS analysis.

Filter-aided sample preparation (FASP) was developed to remove SDS prior to

digestion and MS [6]. This method improvement allows for more complete coverage of the proteome. The recent development of FASP was used for the CHO proteome prior to trypsin digestion in order to maximize the protein recovery [4].

Both in-gel and in-solution digestions continue to be used for biotechnology applications and there are various examples. In-gel digestion was used to identify important proteins for cell line engineering efforts [7] [8] [9]. Another application of in-gel digestion aided in quantifying differences between cell lines [10] [11]. In-gel digestion was used to identify phosphorylated proteins from CHO-K1 cell culture

[12] and was also used to elucidate the proteome of the CHO DG44 cell line [13].

In-solution digestion overcomes some of the limitations of in-gel digestion such as difficulties in separating proteins with low molecular weight, high molecular weight, or hydrophobic properties. Meleady used in-solution digestion to profile protein levels between cell lines with or without miR-7 overexpression [14]. In- solution digestion was also used to identify secreted proteins from the CHO-S and

CHO DG44 cell lines [15].

6 In other experiments, a combination of both in-gel and in-solution digestions was used [4]. Both in-gel and in-solution digestions were also used recently to analyze the CHO proteome [14]. Digestion methods vary but are important for preparing peptide samples for MS. An optional step that is now widely used for comparative proteomics is labeling of digested peptides.

Following digestion, fractionation can be used to increase protein identification. In one example, basic reversed phase liquid chromatography was used to separate samples into 96 fractions that were combined into 48 fractions for

MS analysis [4]. Such method refinements have increased the quantity of identified proteins.

1.3.2 Targeted Proteomics

In some applications, it is useful to target the identification of proteins related to a specific organelle or compare intracellular and extracellular compartments. Two recent examples include identification of the secretome and mitotic spindle, respectively.

The secretome consists of host cell proteins (HCPs) that must be removed prior to formulation of the final drug product. Identification of secreted proteins is limited by their low abundance. Design of experiments was used to optimize sample preparation methods to increase protein recovery for HCP identification [16].

Precipitation parameters such as the precipitant chemical, precipitant concentration, and incubation length, were evaluated for both gel-based and shotgun proteomics

[16]. The results were used to optimize a method for identification of HCPs, which differ in physical and chemical properties, as well as their physiological function;

7 this method used methanol precipitation for one hour to identify 178 HCPs, such as

clusterin, beta-actin, glyceraldehyde-3-phosphate dehydrogenase, and

immunoglobulin superfamily member 8 [16]. Optimization of sample preparation is critical for the secretome to maximize the recovery of low abundance proteins.

The mitotic spindle proteins were identified in CHO cells [17]. Cell division is

an important event to study, as it relates to increasing the growth of cell lines

expressing recombinant protein. Isolation of the spindle aids in identification of

factors affecting division and thus growth. After synchronizing all cells in metaphase,

over 1100 proteins were identified, of which 239 proteins were cell division factors

and 841 proteins were involved in early stages of division [17]. Of the proteome, 11

percent of proteins localized to the membrane, 7% were associated with

microtubules, and 3% were associated with actin [17]. Identification of cell division

factors can aid bioprocess development due to their importance in cell growth.

1.3.3 Protein Labeling

Labeling strategies are used in proteomics approaches to provide the relative

quantification in addition to the identification of proteins. Recent methods for

labeling proteins include stable isotope labeling of amino acids in culture (SILAC),

isobaric tags for relative and absolute quantification (iTRAQ), and tandem mass tags

(TMT). Various labeling methods as well as label-free methods are used for relative

quantification.

In SILAC, proteins are labeled when amino acids from the culture medium

are incorporated into the cell. Incorporation of amino acid labeling into proteins can

be detected by MS. SILAC experiments involve adapting cells to labeled media, cell

8 growth, and protein identification by MS and data analysis [18]. Both in-gel and in-

solution digestion can be used to extract proteins prior to MS. SILAC labeling

experiments provide information on protein dynamics because they quantitate the

incorporation of labeled amino acids. This method introduces experimental errors,

such as incomplete amino acid isotope incorporation and sample mixing errors [19].

To reduce error, label-swap replication experiments were used to average ratios of

individual replicates and validated with a triplet experiment [19]. The results

corrected for incomplete labeling and arginine to proline conversion, thus obtaining

consistent experimental ratios [19].

An alternative labeling method is iTRAQ, which is used to label proteins in a cell extract before proteins are run in the MS. In this method, Tag-specific reporter ions represent relative protein levels. The benefit of this method is that cells do not require adaptation to labeled medium because the labeling is performed on protein extracts. It is important to equalize masses of iTRAQ reagent added across samples because the quantification is relative.

A comparison of iTRAQ versus nonisobaric labeling (mTRAQ) revealed that iTRAQ labeling identifies over double the total number of proteins and approximately triple the phosphopeptides as compared to mTRAQ [20].

Additionally, kinase identification was significantly increased by using iTRAQ, thus aiding knowledge of protein-protein interactions involved in recombinant protein production [20]. All of these proteomics experiments can elucidate new information about CHO cells that leads to advancements in their use as biopharmaceutical production hosts.

9 1.3.4 MS

High-quality digested peptides are injected for MS. Components of the MS

include the source, which produces gas phase ions from the sample; mass analyzer,

which resolves the ions based on mass to charge ratio; and detector, which detects

ions that have been resolved by the analyzer. Both tandem MS and matrix-assisted

laser desorption/ionization time of flight (MALDI TOF MS) have been used to

identify and quantify proteins in cell lines with different growth or productivity

characteristics [7-9, 11, 12, 21].

Following MS, peaks are identified and attributed to specific proteins. Proper identification requires the use of search engines and databases. Examples of search engines include TagRecon and MyriMatch [4]. Examples of databases include NCBI

[7, 8, 12, 13], Swiss-Prot [9, 10, 22], OWL [9], and PANTHER [23]. These databases

increase confidence of successful protein identifications.

1.4 Proteomics for Bioprocess Development

Bioprocess development involves the steps necessary to produce high

quantity and quality of recombinant proteins such as mAbs. A major bottleneck in

the manufacture of biotherapeutics is efficient production. There is widespread

need for cell lines stably expressing recombinant protein at high yield and quality.

Recently, proteomics has been used to identify proteins that play an important role

in recombinant protein production for bioprocess optimization. Comparison of high

and low expressers can provide significant insights and understanding to the

recombinant protein production process, which aids developments for increasing

cell growth and optimizing the medium formulation. A labeling coupled comparative

10 proteomics strategy, which can be applied for comparing high and low expressers is

shown in figure 1.2.

Recently, comparative proteomics has been used in many studies to gain

insights for increasing the productivity of mAbs. Table 1.2 summarizes the recent

publications in this area.

1.4.1 Increase Cell Growth Rate and Viable Cell Density

A critical bioprocess development goal is to maximize growth rate in order to increase the produced drug quantity and decrease the cost of biotherapeutics. A combined approach using transcriptomics and proteomics was enacted to identify candidates for a high growth phenotype [24]. The proteomics results showed that

285 proteins were differentially expressed between cell lines with fast and slow growth rates, highlighting a correlation between protein synthesis and cell growth

[24]. Benefits of combining the ‘omics approaches include accounting for low abundance protein expression and identifying mRNA post-translational processing, thus reducing error [24]. Analysis showed that the most significantly enriched gene ontology terms were translational elongation, translation, generation of precursor metabolites and energy, oxidation-reduction, and aerobic respiration [24].

Combination of ‘omics strategies improves the confidence of either data set alone and results can help to generate cell lines or culture conditions that improve growth.

Comparative proteomics relates protein levels between cell lines. Commonly, cell lines with different growth and productivity characteristics are compared to identify differences in protein levels that correspond to these differences. A combined approach using transcriptomics and proteomics was used to compare cell

11 lines with fast and slow growth [21]. Valosin-containing protein (VCP) was shown to have a significant effect on cell growth and viability. This was confirmed by silencing

VCP expression, which resulted in decreased viable cell density and viability [21].

Proteomics was beneficial in the analysis of protein levels to reveal differences in

VCP between cell lines.

Proteomics was used to compare CHO cell lines with and without the cMyc gene, which had been previously shown to improve cell growth rate and concentration [10]. Over 100 proteins were differentially expressed between cultures. Culture performance improved as measured by the increase in nucleolin

(important for growth, preventing apoptosis, protein productivity, and energy

utilization) and decrease in regulation of adhesion proteins between cells and cell-

matrix [10]. Specifically, components of ATP synthetase were up-regulated which

may indicate a change in energy utilization to release more ATP in cMyc culture [10].

From the quantification of protein levels, it was possible to identify critical proteins

that may be further investigated through hypothesis-driven research.

One way to understand cell death is to elucidate what happens at the

transition of cell culture from exponential growth to stationary phase. Fast cell

growth initially occurs during the exponential phase; then cells transition into a

period of high protein production during the stationary phase. Comparative

proteomics, using iTRAQ labeling, identified 59 proteins with significantly different

protein expression levels between the exponential and stationary phases [25]. Some

of these proteins included binding immunoglobulin protein, protein disulfide

isomerase, DNA replication licensing factors MCM2 and MCM5, transglutaminase-2,

12 and clusterin [25]. These time points were compared in order to identify proteins associated with cell growth and apoptosis. After identifying and classifying differentially expressed proteins, it was found that growth and apoptotic proteins

are expressed highly during the stationary phase [25]. Results from this study aid in

understanding dynamic cell culture changes and can be used to identify protein

markers for prolonged exponential phase and improved process development.

In a time-course experiment, protein samples collected throughout cell culture were analyzed by proteomics [26]. This experiment identified 40 differentially expressed proteins between exponential and stationary phases [26].

These results show that over time, apoptosis occurs due to the initiation of the unfolded protein response [26]. Thus, delaying apoptosis can increase protein production. This was validated in another study, where coexpression of the anti- apoptotis gene Bcl-xL with membrane proteins increased membrane protein expression in CHO cells [27].

1.4.2 Increase Recombinant Protein Productivity

An important goal in bioprocess development is to maximize the

recombinant protein productivity. In one experiment, proteomics was used to

identify proteins related to the improved protein productivity following

supplementation of butyrate and zinc sulfate to culture medium [9]. Increased

expression was observed for metabolic and chaperone proteins including GRP75,

enolase, and thioredoxin [9]. These proteins were identified as possible cell

engineering targets for improving recombinant protein productivity.

13 In another experiment, cell line comparison between high and low producing

cell culture was conducted. Overexpression of Bcl-xl reduced apoptosis and

increased growth. Proteomics analysis showed that over 30 proteins were

differentially expressed between high and low producing cultures [28]. In the high

producing cell line, eukaryotic translation initiation factor 3 and ribosome 40S were upregulated, whereas vimentin, annexin, and histone H1.2/H2A were downregulated [28]. Additionally, chaperone BiP was upregulated in the high producing cell line, from which it was concluded that the unfolded protein response

occurs as a consequence of endoplasmic reticulum stress [28].

Combined proteomics and transcriptomics were used to determine the effect

of low temperature and sodium butyrate on recombinant protein productivity [29].

This approach relied on the transcriptomic information to identify hundreds of

differentially expressed genes between treatments. From this data, proteomics was

used to further identify different protein levels. Butyrate treatment and low

temperature were shown to improve recombinant protein productivity by

improving cell secretory capacity [29]. Enriched pathways included Golgi processing, cytoskeleton binding, and GTPase mediated signal transduction [29].

In more recent years, the development of the CHO genome database has led to improvements in protein identifications. Different cell culture conditions were used for proteomics experiments in order to compare high and low producing cell lines [30]. From 180 differentially expressed proteins, there were 12 proteins that showed different expression levels over culture duration in bioreactors, including

ADP-ribosylation factor protein, V-type proton ATPase, colony stimulating factor 1,

14 and angiopoietin 4 [30]. The important proteins had functions related to growth, metabolism, organization, and protein synthesis [30], which can be used for process optimization by genetical or media supplement technologies.

Meleady also investigated differences in protein levels between high and low producing cultures [22]. There were 89 differentially expressed proteins between the high and low producing cultures [22]. Proteins were identified that contributed to the high producing phenotype [22]. This suggests that future applications overexpressing these proteins could improve cell line productivity. There were 12 proteins identified that differed in expression level in the same direction, including aldose reductase-related protein 2, annexin, eukaryotic translation initiation factor, glucose-6-phosphate 1-dehydrogenase, endoplasmin, and nuclear migration protein

[22]. Examples of proteins that were expressed at higher levels in the high- producing cell line include proteins involved in translation and folding [22] which are associated with recombinant protein production.

Overexpression of micro-RNAs may improve process development by regulating protein productivity. Proteomic comparison of a microRNA overexpressing cell line with a control cell line revealed differences in protein expression contributing to higher productivity in the miR-7 cell line [14].

Overexpression of miR-7 resulted in 93 downregulated proteins and 74 upregulated proteins [14]. Proteins with decreased levels were related to protein translation and

DNA/RNA processing and with increased levels were related to protein folding and secretion [14]. Upregulated proteins involved in protein folding and secretion may serve as cell line engineering targets for increasing recombinant protein

15 productivity [14]. The results show how miR-7 overexpression changes protein levels, which can aid in future cell line engineering to improve process performance.

In addition to the increased protein productivity, overexpression of microRNAs can affect other processes such as cell growth and apoptosis.

The genetic factors attributed to high productivity of a cell line were studied by maintaining the same process conditions [31]. Results from proteomics indicated positive correlation between productivity and the expression of dihydrofolate reductase, adaptor protein complex subunits, DNA repair proteins, and the ER translocation complex components [31]. Both transcriptomics and proteomics data was combined in order to improve understanding of the high production phenotype

[31]. Protein expression differences suggest important roles for these proteins in productivity. Such results can be used to generate high producing clones through cell line engineering and improved clone selection using biomarkers.

1.4.3 Optimize Media Formulation

The growth and productivity of a cell line is highly dependent on the media formulation. Media formulation is a key component of process development and methods to reduce development time are highly sought. The formulation of cell culture medium can have a significant effect on cell growth, viability, and protein productivity. Proteomics provides detailed information about cell protein levels that can aid in medium formulation. This drives hypothesis-based research to further optimize process development.

Recently, proteomics has been used to profile differences between medium formulations in order to correlate improved cell growth with protein levels.

16 Proteomics helps identify proteins for adaptation from serum-bearing to serum-free

media [7]. Results indicated that two molecular chaperones and four de novo

nucleotide synthesis related proteins were significantly increased in the serum-free

cell culture [7]. Subsequently, two of the chaperones identified (HSP60 and HSC70)

were overexpressed and this increased the cell growth rate up to 15% and

decreased adaptation time up to 33% [7]. Thus, quantification of protein levels aids

in cell line engineering strategies to improve cell growth and adaptation. It is

possible to identify candidate proteins through proteomics that play essential roles

in CHO cell culture.

The secretome consists of extracellular proteins processed through the

secretory pathway. These extracellular molecules are in low abundance, but

involved in diverse biological processes. In one approach, secreted proteins from

conditioned media samples were identified in order to develop a serum-free media

formulation [32]. Supplementation of identified growth factors, such as fibroblast

growth factor 8, growth regulated alpha protein, hepatocyte growth factor, and

macrophage colony stimulating factor 1, to the serum-free cloning media

formulation led to increased cell growth [32].

Analysis of the secretome can aid in the identification of proteins that have

important roles in cell growth and protein productivity. These proteins could be effective medium supplements for an optimized process. N-azido-galactosamine labeling was used to tag the mucin-type O-linked glycans of secreted proteins in order to enable their identification in cell-conditioned media [15]. This method helps to identify secreted proteins in low abundance and minimize background

17 proteins [15]. The secretome of CHO-S and CHO DG44 cell lines were compared and

it was found that 171 proteins were identified in both cell lines [15]. There were

also 96 proteins unique to CHO DG44 and 85 proteins unique to CHO-S [15].

Differences in protein levels were related to adhesion, cell growth, and proteases

[15]. Approximately 70 percent of the proteins identified were the same between

the CHO-S and CHO DG44 cell lines [15]. Important secreted proteins may be

investigated as medium supplements. Identification of important secreted proteins

aids in the development of media formulations to improve growth; it is also critical

that secreted proteins are identified and removed from the final biotherapeutic drug

product.

In another experiment, label-free comparative proteomics was used to identify differences between cells cultivated in serum-free medium formulations with or without hydrolysates [33]. The changes in protein expression upon addition of hydrolysates, containing blends of peptides, free amino acids, vitamins, and trace elements, helped to explain the increased recombinant protein productivity [33].

Proliferative proteins were upregulated whereas pro-apoptotic proteins were

downregulated in the culture containing hydrolysates [33].

Proteomics can thus be used to identify differences in protein expression

between media formulation and help to optimize formulations for high growth and

productivity. These studies highlight the applications of proteomics to understand

the roles of media supplements on cell growth and recombinant protein production

as well as to identify novel media supplements for the optimization of media

formulation.

18 1.4.4 Systems Biology Full characterization of the proteome significantly aids efforts to understand

cell physiology by analysis of metabolic pathways. The complete proteome was

recently elucidated for the CHO-K1 cell line [4]. Analysis of genomics, transcriptomics, and proteomics data at the systems biology level using Kyoto

Encyclopedia of Genes and Genomes (KEGG) pathway analysis revealed that pathways for protein processing and apoptosis were enriched in CHO-K1, whereas

pathways for steroid hormone and glycosphingolipid metabolism were depleted [4].

The glycoproteomics study aided the identification of the major cell adhesion and

membrane proteins, which are difficult to identify with intracellular proteomics

approaches. Importantly, Baycin-Hizal et al. used both genomics and proteomics

information to generate the codon usage table for CHO for the first time. Codon

optimization is the main preferred method for the increased expression of drugs for

that reason CHO specific codon usage table will help biopharmaceutical companies

to increase the recombinant protein expression [4]. Proteomics was thus able to

elucidate new information about CHO protein levels that was not detectable by

genomics or transcriptomics.

In another experiment, various databases were used to improve the

confidence of identified CHO cell proteins [34]. By using multiple databases, it was possible to increase the number of identified proteins by over 40% [34]. Different

methods were used across locations and the results were compared to improve

confidence. Thus, proteomic data is reproducible and is improved by comparing

results to various CHO cell specific databases.

19 1.5 Database Development

In recent years, CHO-specific databases were developed to improve sequence

information for protein identification. Prior to this, CHO-specific sequences were not

publicly available requiring researchers to rely on cross species information.

Following the completed genome of the CHOK1 cell line, the draft Chinese hamster

genome was published [2, 35]. The published data sets from these genomes provide

unifying information about the genes and variations between CHO cell lines. Such

genomic data sets are now available online and have been compiled into databases

such as www.chogenome.org. Similar efforts have been made to establish proteomic

databases for CHO cells, including www.gofant.com/CHOproteomemainimage.html.

An evaluation of CHO databases increased protein identification by 282 proteins, a

40-50% increase in the total number of proteins identified [36]. CHO sequence

information was combined from the SwissProt database, CHOK1 draft genome [35],

and the Bielefeld-BOKU-CHO database [36]. Because more peptides matched, there is increased confidence in the results. The increased proteins identified were related to protein translation and energy metabolism, both important for bioprocess development [36]. The compilation of data sets from proteomics experiments around the world allows for data sharing and database generation. Multiple groups can use data to improve proteomics results.

1.6 Combined ‘Omics

The CHO genome was recently elucidated, providing high-level information

about cell function. Moving from the genome to transcriptome provides information

about the mRNA transcribed from DNA; subsequently the proteome provides

20 information about the proteins translated. The metabolome is the final step toward

understanding cell function. Small changes in metabolite levels can provide

information that would not have been previously possible with only a genomic,

transcriptomic, or proteomic approach.

Combined proteomics and transcriptomics was used to relate the CHO

proteome, secretome, and glycoproteome to mRNA data sets in order to provide

new information about amino acid utilization pathways, highly expressed proteins,

and depleted or enriched pathways [37]. The combined approach allowed for

relationship between mRNA and protein levels. Stable proteins were identified as

those with high protein intensity and low mRNA intensity, such as tubulin, myosin,

and binding proteins [4]. A combination of proteomic and transcriptomic data with

metabolomics could further increase understanding of CHO cell pathways relevant

for cell growth, recombinant protein production, and glycosylation allowing for new

engineering strategies.

A combination of proteomics and metabolomics was used to increase

understanding of CHO cells during prolonged cultivation [11]. The observation that

prolonged culture results in increased cell growth was related to an increase in

adenylate energy charge (AEC) and differential protein expression [11]. The increased AEC relates to a high energetic state of the cell. Additionally, 43 differentially expressed proteins were identified [11]. Some of the proteins, including phosphoglycerate mutase, phosphoglycerate kinase and pyruvate kinase isozymes, were associated with the improved growth rate [11]. Other differentially expressed endoplasmic reticulum stress proteins were related to cell robustness to

21 changing environmental conditions [11]. The combined approach was useful for

determining correlations between the different protein and metabolite levels and

the observed cell phenotype.

Recent approaches have used metabolomics in combination with other – omics approaches. Transcriptomics, metabolomics, and fluxomics were used to identify differences between a recombinant protein producing cell culture and the parental Hek-293 cell line [38]. This combined approach elucidated new information about cell line differences and indicated possible targets for metabolic engineering. The producer culture consumed less glucose and produced less lactate than the parental cell line. Nucleotides and nucleotide sugars were less abundant in the producer culture, as well as most glycolytic and TCA cycle intermediates [38].

With a combined ‘omics approach, Dietmair showed that the lower metabolite levels corresponded to downregulation of transcription and translation in the producer culture. The producer culture was shown to have an increased uptake of amino acids and increased expression of transporters needed for amino acid transport to growing proteins

1.7 Summary of Proteomics

Current developments in biotechnology rely on the wealth of information provided by the genome, transcriptome, proteome, and metabolome. Protein levels provide clear information about cell physiology, thus, proteomic methods are important for hypothesis-driven research and development. A variety of different digestion techniques and proteomics techniques have been developed for CHO proteomics. In addition to this, a variety of labeling techniques such as SILAC, iTRAQ

22 and TMT have been used to study differences between cell lines and analyze important pathways such as apoptosis, growth, and protein production. Not only does proteomics aid the identification of novel biomarkers for targeted mAb development, but it also aids in identifying protein-protein interactions during disease progression. In bioprocess development, applications of proteomics include identification of accumulating or depleting protein levels important in recombinant protein production. From this information, cell line engineering approaches or medium formulation developments help in the prevention of apoptosis, improvement of productivity, or utilization of nutrients. Thus, ‘omics approaches enable research that delves deeper into the workings of the cell.

In summary, proteomics is increasingly used in bioproduction to improve development of current biotherapeutics by increasing cell growth, minimizing cell death, and increasing recombinant protein production. This helps in the manufacture of biotherapeutics at large scale and with high product quality, reducing the development time. In recent years, sample preparation methods for proteomics have been optimized in order to obtain the maximum number of correct protein identifications. This has enabled more developed databases and identification of proteins localized to cell membranes, the cytoplasm, or other organelles. Future refinements in the preparation and analysis of proteomics samples will allow for in-depth information about critical proteins in biotechnology research and development.

23 Tables

Table 1.1: Top Selling Biologics of 2012

24 Table 1.2: Summary of CHO Proteomics for Bioprocess Development

Reference Purpose Method Conclusion

Identified increased levels of GRP78 To determine differences in Used in-gel digestion for and peroxiredoxin following proteins between CHO cell proteins. Used MALDI TOF treatment with sodium butyrate. Baik 2008 cultures with or without MS and MS/MS to identify Phosphopyruvate hydratase levels sodium butyrate proteins decreased with sodium butyrate supplementation treatment

Identified increased levels of HSP60 To determine differences in Used in-gel digestion for and HSC70 in serum-free media. protein levels during proteins. Used MALDI TOF Subsequent overexpression Baik 2011 adaptation to serum-free MS and MS/MS to identify resulted in improved cell medium proteins concentration during serum-free adaptation

Used SPEG to generate glycoprotein fractions and To identify the proteome, Identified important proteins in in-gel and in-solution Baycin-Hizal secretome, and CHOK1 cell line and combined digestion for proteins. 2012 glycoproteome of CHOK1 proteomic and transcriptomic data Used MS/MS to measure cell line for improved analysis proteome, secretome, and glycoproteome

Found differentially expressed proteins between To compare the proteomes cultures.Eukaryotic translation Proteins were digested Carlage 2009 of high and low producing initiation factor 3 and ribosome 40S and identified by LC-MS cell cultures over time were upregulated and vimentin, annexin, and histones were downregulated in the high producer

Identified proteins with changing To determine the changes levels over time from exponential to in the proteome over time Used iTRAQ labeling and Carlage 2012 stationary transition and related the for a CHO cell culture LC-MS to identify proteins differences to cell growth and overexpressing Bcl-Xl apoptosis

Compared miRNA, mRNA, and To determine important protein expression levels and proteins that elucidate how Used LC-MS to identify Clarke 2012 identified processes regulating cell miRNAs affect CHO cell proteins growth, such as ribosome synthesis, growth translation, and mRNA processing

To combine Identified valosin containing protein transcriptomics and Used in-gel digestion for as important regulator of cell Doolan 2010 proteomics to identify proteins. Used MALDI TOF growth. Overexpression resulted in important proteins that MS to identify proteins improved growth with no decrease relate to high growth rate in viability

Collected spent media Comparison of high and low samples for host cell expressing clones revealed 180 To identify proteins that proteins. Used in-gel differentially expressed proteins. Dorai 2013 affect high productivity in digestion for proteins. Identified proteins related to bioreactor cell cultures Used LC-MS to identify cytoskeletal organization, protein proteins synthesis, metabolism, and growth

25 Identified proteins with positive Used LC-MS/MS shotgun correlation to productivity, To identify differences proteomics to identify including DHFR, adaptor protein Kang 2013 between cell lines that protein expression complex subunits AP3D1 and affect antibody production differences between cell AP2B2, DNA repair protein DDB1, lines and ER translocation subunit SRPR

Found significant changes in protein To determine the effect of Used 2D gel expression in serum-free medium hydosylate electrophoresis combined supplemented with hydrosylates, Kim 2011 supplementation on with nano LC-ESI-QTOF such as upregulation of metabolic, antibody productivity in MS/MS to identify cytoskeletal organization, and serum-free cell cultures proteins growth regulated proteins

Found increase in nucleolin and decreased in regulation of proteins Used in-gel digestion for Kuystermans To determine the effect of related to matrix and cell adhesion. proteins. Used MS/MS to 2010 cMyc on proteome Also found increased ATP identify proteins synthetase and mitrochondrial protein levels

Improved protein identification by Used in-gel digestion for enrichment of medium and low To identify the proteome of proteins. Used MALDI TOF Lee 2010 abundance proteins. Most CHO DG44 cell line MS and MS/MS to identify identified proteins function in proteins energy metabolism

To identify growth factors Identified 290 secreted proteins that can be used as Used MS shotgun from CHO cell culture including 8 Lim 2013 supplements in serum-free proteomics to identify novel growth factors. Used growth cell cultures to improve proteins in spent medium factors as medium supplements to antibody production increase cell growth rate

To compare the proteomes Used in-gel digestion for Idetified 89 proteins with different Meleady of high and low producing proteins. Used LC-MS to expression between high and low 2011 cell cultures over time identify proteins producer cell lines

Identified 93 decreased proteins and 74 increased proteins resulting Used in-solution digestion from miR-7 overexpression. To determine the effect of Meleady for proteins. Used label- Decreased proteins related to miR-7 overexpression on 2012 free LC-MS to identify protein translation and DNA/RNA proteome proteins processing. Increased proteins related to protein folding and secretion

Used in-gel and in- solution digestion for To improve identification Improved protein identification by Meleady proteins. Used MALDI TOF of proteome by using 40-50% through mulptiple CHO- 2012 MS and electrospray ion multiple databases specific databases trap MS to identify proteins

Cell culture medium Identified differences between To develop a method for supplemented with CHO-S and CHO DG44 secreted Slade 2012 identifying secreted GalNAz and enriched by proteins. Found 70% similarity proteins copper catalyzed click between cell lines chemistry. Used iTRAQ labeling and MS to

26 identify proteins

To identify important Used in-gel digestion for Identified increased expression of Van Dyk proteins that affect protein proteins. Proteins GRP75, enolase, and thioredoxin in 2003 productivity in butyrate identified with MALDI-TOF response to media supplements and zinc treated culture MS

After prolonged cultivation, Used in-gel digestion for To identify the proteome of identified 40 proteins with different proteins. Electrospray Wei 2011 CHO cells during prolonged expression levels related to ionization tandem MS cultivation cytoskeletal proteins, chaperones, used to identify proteins and metabolic enzymes

27 Figures

Figure 1.1: Overview of Proteomics Experiment. After proteins are extracted from cell culture, sample preparation involves reduction, alkylation, filter aided sample preparation, and fractionation. Peptides are injected into the mass spectrometer and the resultant peaks are analyzed. Proteins are identified and quantified using CHO-specific databases and statistical pathway analysis.

28 Figure 1.2: Overview of Comparative Proteomics. Comparative proteomics can be used to identify differences in protein expression levels between culture conditions. Following sample preparation, peptides are labeled separately with unique tags and then mixed in equal amounts prior to mass spectrometer injection. Peptides are identified and quantified using CHO-specific databases.

29 Chapter 2: Glycoproteomics Analysis of Chinese Hamster Ovary

Cell Lines

Abbreviations

CHO – Chinese hamster ovary, ER – endoplasmic reticulum, CDG – congenital disorders of glycosylation, MS – mass spectrometry, iTRAQ – isobaric tags for relative and absolute quantification, SPE – solid phase extraction, FBS – fetal bovine serum, DMEM – Dulbecco’s modified essential medium, TCEP – tris (2-carboxyethyl) phosphine, SDS – sodium dodecyl sulfate, PAGE – polyacrylamide gel electrophoresis, SCX – strong cation exchange, LC – liquid chromatography, GO – gene ontology, KEGG – Kyoto encyclopedia of genes and genomes, IPA – Ingenuity pathway analysis, PCA – principal component analysis, ANOVA – analysis of variance,

MGAT2 – mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N- acetylglucosaminyltransferase, B4GALT3 – beta-1,4-galactosyltransferase 3,

HSP90B1 – heat shock protein 90 beta family member 1, ADAM10 – a disintegrin and metalloproteinase domain 10, CTSF – cathepsin F, ICAM1 – intracellular adhesion molecule 1, MAN2B1 – mannosidase alpha class 2B member 1, BMP1 – bone morphogenetic protein 1, HEXA – hexosaminidase subunit alpha, ST3GAL1 –

ST3 beta-galactoside alpha-2,3-sialyltransferase 1, LGMN – legumain, 1ADGRG1 – adhesion G protein-coupled receptor G, HSPA5 – 78 kDa glucose-regulated protein,

HYOU1 – hypoxia up-regulated 1, GAA – glucosidase alpha, HSPA8 – heat shock cognate 71 kDa protein, GBA – glucosylceramidase, CSPG4 –chondroitin sulfate

30 proteoglycan 4, RPN1 – ribophorin 1, OST –oligylsaccharyltransferase, SILAC –

stable isotope labeling with amino acids in cell culture, TMT – tandem mass tags

2.1 Introduction

Biotherapeutics expressed in Chinese hamster ovary (CHO) cell lines

generate billions of dollars in revenue due to the similarities in glycosylation

patterns to humans and their applicability for therapeutic drug synthesis to treat

human diseases and conditions. An important characteristic that CHO cells share in

common with humans is that they are able to modify their glycoproteins by adding

sialic acid sugar groups. CHO contains different sialylation patterns, however, than

those in humans due to alpha-2,3-sialyltransferase activity in CHO cells which

differs from the alpha-2,6 sialic acid linkages found in humans [39]. Variations in

glycosylation patterns affect proteins’ biological recognition, including the ability to

remain in the human body for circulation and the stability of the protein’s structure.

These characteristics are crucial to the actions of protein drugs in terms of their

function, half-life, and immune response in the body. Therefore, investigating the

effects of different glycosylation patterns and mutations in CHO cells are essential to

their proteins’ potential therapeutic applicability.

N-Glycosylation is a process that begins in the endoplasmic reticulum (ER)

when the core unit of a sugar is transferred to a newly synthesized protein from a

lipid dolichol carrier. The dolichol molecule serves as a membrane anchor where the

sugar moiety (Glc3Man9GlcNAc2) is formed and then transferred to an asparagine

residue marked by the recognition sequences N-X-S or N-X-T in the nascent protein.

Dolichol is a product of the HmG-CoA, or mevalonate, pathway. The mevalonate

31 pathway produces dehydrodolichyl diphosphate, which loses both of its phosphates

to become polyprenol through a currently unknown enzyme. Polyprenol is alpha-

isoprenoid-saturated by polyprenol reductase and becomes dolichol. Finally,

dolichol kinase adds a phosphate to dolichol to make it into dolichyl phosphate, the

actual sugar carrier [40]. After the initial transfer of the oligosaccharide to the

protein, the sugar group undergoes a series of modifications in the Golgi before it

reaches its maturity. These modifications include the removal and addition of sugars

to the budding oligosaccharide by glycosyltransferases. Additionally, nucleotide

sugar transporters bring activated sugar substrates, like sialic acid, galactose,

GlcNAc, and fucose, to be added to the oligosaccharide.

Many CHO cell lines that have mutations pertaining to glycosylation have

been characterized [41]. Initially, the cells were used by the glycobiology

community to understand how loss-of-function mutations contribute to congenital disorders of glycosylation (CDG) in humans [41]. The high spontaneous rate of mutations in CHO was advantageous in order to study diverse effects of altered glycosylation [41]. Using molecular biology approaches, Stanley studied the specific genes responsible for glycosylation defects [41]. Next, mutagenesis was used to select for rare glycosylation mutants using plant lectins, which are specific to different sugar configurations [41]. Stanley subsequently yielded cell lines such as

Lec32, Lec2, Lec8, Lec13, and Lec9.4a. The creation of diverse cell lines aids both research of CDG and development of glycoengineered recombinant therapeutics.

CDG refers to the various glycosylation disorders in humans as a result of defects at different stages of N- and O-glycosylation pathways. For example,

32 SRD5A3-CDG (CDG-Iq) refers to the polyprenol reductase defect. Patients exhibit

symptoms such as ocular coloboma, optic nerve hypoplasia, visual loss, nystagmus,

hypotonia, developmental delay, and cerebellar ataxia [42] [43]. This CDG can be

studied using the Lec9.4a mutant. In contrast, SLC35A1-CDG (CDG-IIf) refers to a

defect in sialic acid transporter, which mimics the phenotype of the Lec2 mutant. A patient with this disorder experienced macrothrombocytopenia, neutropenia, and immunodeficiency prior to death as a toddler [44]. CDGs are the result of diverse

mutations across the glycosylation pathway.

One of the mutants, Lec9.4a, is known for having a mutation in the

polyprenol reductase enzyme that allows the conversion of polyprenol to dolichol.

This mutation causes a decrease in glycosylation levels, as it is dependent on the

dolichol carrier in order to carry and deliver the sugar moiety. However, the cells

retain the capacity to glycosylate at about a 50% rate because the cell uses

precursor polyprenol molecules for saccharide unit formation to compensate for the

lack of dolichol [45]. Thus, the Lec9.4a cell line utilizes polyprenol as opposed to

dolichol, which affects the first step of N-glycosylation that assembles sugars onto a dolichol-phosphate carrier. In the mutant cells, the concentration of polyisoprenyl lipid phosphate is decreased which increases the amounts of monoglycosylated and phosphorylated lipids [45]. Overall, the ratio of active prenols is higher in wild-type cells than mutants [45]. Glycosylation is altered in Lec9.4a as a result of the inefficieny in which cells use polyprenol as opposed to dolichol. This inefficiency may result from the preference of flippases to use dolichol when translocating the growing polypeptide to the ER lumen [45]. In addition, the differential usage of

33 polyprenyl phosphate affects oligosaccharide lipid levels and thus decreases the

glycosylation levels [45]. Overall, the 50% reduction in Lec9.4a refers to a decreased glycosylation rate and lower total levels of glycosylated protein. A similar effect in blocking the transfer of the core sugars from dolichol can be achieved by treating

CHO-K1 cells, the original wild-type CHO cell line, with tunicamycin (TM) [46]. TM is an antibiotic that inhibits GlcNAc phosphotransferase, which catalyzes the addition of N-acetylglucosamine-1-phosphate to dolichol [47]. This inhibits N- glycosylation in the ER and provides an interesting basis of comparison with the effects of the Lec9 mutation. In addition, the comparison of wild-type and mutant cells yields insights in glycolsylation differences between dolichol and polyprenol as the lipid carrier.

The Lec2 mutant cell line possesses a deletion in the gene responsible for expressing the CMP sialic acid transporter, thus expressing a non-functional transporter that renders the cell unable to transport sialic acid across Golgi vesicles and limiting localization of sialic acid to the target proteins. This mutation severely decreases the cell’s ability to sialylate and to add N-linked glycosylation to its

proteins, yielding a 70-90% reduction in sialylation level [48] [49]. Comparison of

Lec2 with wild-type cells reveals differences between siaylated and asiaylated

proteins. In functional analyses, proteins in the Lec2 cell line exhibited altered

reactivity in enzymatic assays suggesting the protein structure is altered as a result

of the different glycoform [50]. Interestingly, the proteins did not have significantly

reduced in vivo half-life [50]. In contrast to Lec9.4a, the Lec2 cell line has a mutation

at the last step of glycosylation and is still expected to glycosylate proteins to the

34 same extent as wild-type cells. Pro-5 CHO is a parent cell line of the Lec2 line [51]. It is also a mutant of wild-type CHO cells that is deficient in proline synthesis.

Therefore, these cells cannot grow in the absence of proline in the culture medium.

They contain only 21 chromosomes, and a gene on the missing chromosome is thought to be responsible for the cell line’s inability to convert glutamic acid to glutamic- -semialdehyde, which blocks the synthesis of proline [52].

Acrossγ the cell lines, research efforts have focused on identifying the genes that cause the mutated phenotypes. One application of this led to CDG classification.

Similarly, glycomics approaches have been used to identify the glycan structures produced in different mutants. Specifically, Lec9.4a mutant cells show increased branching of beta-1,6 carbohydrates and overall reduced glycosylation of proteins

[53]. North used mass spectrometry (MS) to identify the N- and O-glycans of nine different Lec-based CHO mutants [54]. Surprisingly, there was higher glycan complexity than expected for the mutants, so North suggested that further analysis is required for cell line applications [54]. In other applications, the glycosylation mutants suggest targets for genetic engineering. As a result of identifying the point mutation for sialic acid transporter in Lec2 cells, Wong overexpressed the functional gene in CHO cells to increase recombinant protein sialylation [55]. This strategy achieved up to 20-fold increased gene expression and a 4-16% increase in sialylation levels [55].

Glycomics is not the only ‘omics method that can yield insights into cellular physiology. Proteomics aims to identify and quantify proteins within biological samples. In order to study the structure and glycosylation patterns of the proteins

35 produced by these cells lines, certain proteomic analytical techniques, such as isobaric tags for relative and absolute quantification (iTRAQ) and solid phase protein extraction (SPE), are necessary. iTRAQ is used to simultaneously analyze a variety of peptide samples and quantify proteins from different sources into a single experiment. Isotope coded covalent tags, 4-plex and 8-plex, label the peptides at their N-terminus and at lysine amino acid side chains [56]. The protein samples are extracted and digested with one or multiple enzymes to reduce them into their component peptides, which are then labeled with amine specific markers.

Afterward, the samples are pooled and fractionated for analysis in a mass spectrometer [57]. Through the use of tandem mass spectrometry (MS/MS) and these covalent tags, signature ions with a mass to charge ratio between 114 and 117 can be separated, observed, and quantified with greater sensitivity than other techniques, including Isotope Coded Affinity Tags and 2D gel electrophoresis [58]

[56]. The sensitivity of iTRAQ can be further increased through sub cellular fractionation [59]. Even with its many advantages, it is more susceptible to errors in precursor ion isolation.

SPE, a valuable tool for glycoprotein enrichment, can capture proteins with high fidelity when the protein is present in small amounts, which is very difficult with other methods. Desired proteins are dissolved in a fluid substance and then passed through a solid layer with high affinity for the solute. These are known as the liquid and solid phases, respectively. SPE quickly and efficiently separates the protein and undesired components, allowing for further analysis such as by MS [60].

Historically used in analyzing pesticides in soil samples [61], SPE has been used for

36 capturing glycoproteins on both a clinical [62] and laboratory setting [63]. In particular, laboratory applications of SPE can isolate N-linked glycosylated proteins using hydrazide beads. Because hydrazine forms hydrazone bonds with sugar- linked peptides and proteins, glycoproteins can be captured with high efficiency and then freed from the hydrazine with glycosidases for the specific proteins desired

[64]. Further digestion separates the sugar markers from proteins so that conventional MS can be used to derive further data.

A comparison between CHO and the glycosylation mutant cell lines, Lec9.4a and Lec2, can provide further insight into the initiation of glycosylation and to how the lack of a large part of the cell’s glycosylation processes impacts its ability to function. It can also shed light onto the importance of glycosylation to protein folding and availability within the cell. Currently, no clear picture exists into the specific differences between these cell lines at the proteomic level. The following study aims to establish the glycoproteome and sialylated proteome of CHO-K1 and

Pro-5 CHO as negative controls and compare the parental cell lines with Lec9.4a and

Lec2 cell lines, respectively. Specifically, glycosylated proteins will be compared

between CHO-K1 and Lec9.4a, whereas sialylated proteins will be compared between Pro-5 CHO and Lec2. These differences will shed light on differences

between the dolichol and polypropenol pathway, as well as increase our

understanding of pathway changes in CDG. Through this proteomic analysis, we

identify and quantify the differences that glycosylation and sialylation deficiencies

cause in global proteomic expression.

2.2 Materials and Methods

37 Briefly, the experiment section is divided into three parts: sample

preparation, MS peptide identification, and bioinformatics analysis.

2.2.1 Materials

The wild-type CHO-K1 (CCL-61) and wild-type Pro-5 CHO (CRL-1781) cell

lines were obtained from ATCC (Manassas, VA) and cultured by Deniz Baycin as

described in a previous experiment [4]. Two glycosylation mutant cell lines (Lec2

and Lec9.4a) were graciously provided by the lab of Dr. Pamela Stanley. Cell culture

supplies, including fetal bovine serum (FBS), L-Glutamine, non-essential amino acids,

DPBS, F-12K and Dulbecco’s modified essential medium (DMEM) media were obtained from Gibco (Grand Island, NY). Sequencing grade trypsin was purchased from Promega (Madison, WI). The BCA protein assay kit and Tris (2-carboxyethyl) phosphine (TCEP) were obtained from Thermo Scientific Pierce (Rockford, IL). The

sodium dodecyl sulfate – polyacrylamide gel electrophoresis (SDS-PAGE) gels4−12% were purchased from Invitrogen (Grand Island, NY). Affi-prep hydrazide resin beads and sodium periodate were purchased from Bio-Rad (Hercules, CA). PNGaseF was purchased from New England Biolabs (Ipswich, MA). The 1 mL C18 Sep-Pak cartridges were obtained from Millipore (Billerica, MA). All other chemicals used in this study were purchased from Sigma-Aldrich (St. Louis, MO).

2.2.2 Cell Culture and Preparation of Protein Lysate

Cell lines were cultured in the media specified by the manufacturer or cell line supplier. Wild-type CHO-K1 and the mutant Lec9.4a cell lines were cultured in

F-12K media supplemented with 10% FBS, 1% nonessential amino acids, and 2mM

L-glutamine. Wild-type Pro-5 CHO and mutant Lec2 cell lines were cultured using

38 DMEM, supplemented with 10% FBS, 1% non-essential amino acids, and 2mM L-

% confluence, they were lysed for

glycoproteomicglutamine. When analysis. cells reached 70−80

2.2.3 In-solution Digestion

The sample preparation prior to MS was performed by Baycin as decribed

previously [4]. Cells were lysed in 9M urea with protease inhibitor, pH 8.0, and

sonicated on ice three times for 20s using a probe sonicator. This step breaks open

the cells to expose the proteins. The protein concentration of the lysate was estimated by BCA assay. An equivalent amount of lysate from each cell line was first reduced with TCEP and subsequently alkylated with iodoacetamide before trypsin digestion. TCEP is used to reduce disulfide bonds. The free sulfhedryl groups on cysteine residues are then alkylated to prevent reformation of disulfide bonds.

2.2.4 Glycopeptide Capture with SPEG

Following tryptic digestion, which breaks the peptide bonds, the samples were desalted using SepPak C18 columns and oxidized with 10mM sodium periodate in the dark at room temperature for 1hr. Next, samples were mixed with

temperature50μL of hydrazide for the resins coupling containing reaction 50% for slurrythe glycoproteom and incubatede analysis. overnight For at the room sialylated proteome, the peptides were treated with sodium periodate on ice before incubating with hydrazide resins. The nonglycosylated peptides were subsequently

hloride andremoved water. by The washing N-glycosylated the immobilized peptides resins were with released 800μL from of 1.5M the immobilized sodium c

support by suspending the resins in 50μL of G7 reaction buffer, containing 50mM

39 platform.sodium phosphate Next, glycopeptides at pH 7.5, with released 2μL PNGaseFfrom the overnightsupport were at 37°C transferred on a shaking into a

solutionsglass vial andfrom the each resins sample were were washed added twice to C18 with SepPak 100μL columns. of water. Eluted The combined samples

were dried by Speedvac and labeled prior to MS.

2.2.5 iTRAQ Labeling

Each sample was labeled using iTRAQ reagents. A 4-plex was used to compare wild-type CHO, TM-treated CHO, and Lec9.4a CHO cell lines. Two 4-plex experiments were used to compare wild-type Pro-5 CHO and Lec2 CHO cell lines.

Each iTRAQ set was fractionated by strong cation exchange (SCX) into 8 fractions and each fraction was analyzed by liquid chromatography (LC) MS. Samples were submitted to the proteomics core at Johns Hopkins Medical School− run by Dr. Robert

Cole. LC-MS/MS analysis and data analysis were done as explained previously in the label-free CHO paper [4]. Briefly, the iTRAQ method labels the N-terminus and

lysine side chains. Each reagent has a reporter-, reactive-, and balance- group to

maintain the same mass. When the peptide is fragmented, the reporter breaks off

and produces a unique ion. The relative intensity of the reporter is proportional to

the abundance of each peptide in the sample.

2.2.6 Bioinformatics Analysis

Presumed protein identifications were made using Proteome Discoverer 1.4

software

(https://www.thermofisher.com/order/catalog/product/IQLAAEGABSFAKJMAUH).

The protein intensities were determined by fold change over the respective wild-

40 type CHO cell line (CHO-K1 for iTRAQ1 and Pro-5 CHO for iTRAQ2 and iTRAQ3).

Fold changes less than 0.7 and greater than 1.5 were used to compare differential

protein expression. For the functional analysis, protein accession numbers were

mapped to gene symbols using the biological database network (http://biodbnet-

abcc.ncicrf.gov) for gene ontology (GO) and Kyoto Encyclopedia of Genes and

Genomes (KEGG). For GO annotation, gene symbols were mapped to molecular

function, biological process, and cellular component using the GOCHO platform

(http://ebdrup.biosustain.dtu.dk/gocho). For KEGG annotation, database pathways

for Cricetulus grisueus were downloaded on December 31, 2015

(http://www.genome.jp/kegg). All programming for the hypergeometric test were

calculated in MATLAB version 2015vB

(http://www.mathworks.com/products/matlab) and RStudio

(https://www.rstudio.com). Enrichment and depletion p-values were calculated using the hygecdf and hygepdf functions in MATLAB. These values were used to generate heat maps and k-means clustering in Genesis software version 1.7.6 [65].

KEGG pathways of interest were plotted using the search and color feature on the

KEGG website (http://www.genome.jp/kegg/tool/map_pathway2.html). Finally,

Ingenuity Pathway Analysis (IPA) software

(http://www.ingenuity.com/products/ipa) was used to visualize upregulated and downregulated genes and pathways of interest.

2.3 Results and Discussion

In this study, wild-type and glycosylation-deficient CHO cell lines were compared by comparative glycoproteomics for the first time. Shown in Figure 2.1A

41 are the glycosylation steps affected for the mutant and treated cell lines. The overall workflow of sample preparation, MS, and bioinformatics is shown in Figure 2.1B.

Briefly, lysed cells were alkylated and digested followed by oxidation for glycopeptide enrichment. Following hydrazide bead coupling, the coupled glycopeptides were digested by PNGaseF and labeled using iTRAQ reagents. Three separate iTRAQ kits were used; the first set was used to compare Lec9.4a with wild- type and TM-treated CHO, the second set was used to compare Lec2 with wild-type

Pro-5 CHO, and the third set was used to compare Lec2 with wild-type Pro-5 CHO.

Each iTRAQ was fractionated and run on MS/MS, and the resulting spectra were

matched to the CHO database using Proteome Discoverer software. Bioinformatics

analysis was used to determine functional differences between cell lines.

2.3.1 Lec9.4a Glycoproteome Distribution A summary of glycopeptide and glycoprotein identifications is shown in

Table 2.1. First, wild-type CHO was compared with TM-treated CHO and Lec9.4a cells. A total of 973 unique peptides representing 381 different proteins were identified in this iTRAQ analysis and a stringent false discovery rate of 1%. Of these,

187 glycoproteins were identified containing at least 2 unique peptides. Shown in

Figure 2.2 is a histogram distribution of protein intensities for each sample. This highlights the spread of glycoprotein intensity across different samples. The bin

sizes represent the fold change of the sample over the wild-type CHO cell line and the frequency shows the number of identified glycoproteins with that specific fold change. The average fold change is lowest for the comparison between wild-type and TM-treated CHO cells (Figure 2.2B, purple), indicating that the majority of

42 glycoproteins identified in the TM-treated cells are detected at lower levels compared to the wild-type CHO cell line. This is in contrast to the histogram of

Lec9.4a, which shows the greatest spread (Figure 2.2A, green), with many

glycoproteins detected at both higher and lower levels relative to wild-type CHO.

The difference between Lec9.4a and CHO TM may be due to the transient treatment

of wild-type CHO with TM as opposed to cell line adaptation to dysfunctional

glycosylation in the Lec9.4a cells.

Principal component analysis (PCA), which is commonly used for various

proteomics applications, is used to compare cell lines in each iTRAQ experiment further [66]. Each axis represents a component, and the components are ordered in terms of decreasing variability within the data set [67]. In Figure 2.2C, all cell lines

are clustered independently and show variability in glycoprotein detection.

2.3.2 Lec2 Sialoproteome Distribution

Next, wild-type Pro-5 CHO was compared with Lec2 cells, which cannot fully

sialylate glycoproteins. More than 2100 peptides were elucidated representing at

least 500 unique proteins. Of these, a total of 272 unique sialylated proteins were

identified in the comparison (combination of proteins identified in both iTRAQ 2

and iTRAQ 3) containing at least 2 unique peptides and a stringent false discovery

rate of 1%. The histogram comparison in Figure 2.2 shows the average fold change

distribution. Lec2 shows a wide distribution in overall sialylated protein intensity

with many sialylated proteins identified at both lower and higher levels than Pro-5

CHO (Figure 2.2D, pink). The sharpest peak is at a fold change of 1 compared to Pro-

43 5 CHO, which indicates many proteins show similar expression between the cell lines. Identification of the up- and down-regulated glycoproteins is detailed further by functional analysis.

The PCA plot in Figure 2.2E shows wild-type Pro-5 CHO cluster separately from Lec2, a sialylation-mutant CHO cell line. The independent clustering show differences in the expression of sialylated proteins. Following the PCA, protein intensity differences were characterized functionally using KEGG and IPA.

2.3.3 Lec9.4a K-means Clustering One method for grouping proteins by expression levels between samples is

K-means clustering through Genesis software

(http://genome.tugraz.at/genesisclient/genesisclient_description.shtml). Similar to the PCA, this is an unsupervised method that groups proteins by levels rather than supervised tests such as analysis of variance [68] or t-tests [67]. Clustering is useful because data is organized into multiple groups with comparable expression patterns. As shown in Figure 3, heat maps were generated as a result of K-means clustering into groups of proteins with low levels in CHO but high levels in Lec9.4a and TM-treated CHO (Figure 2.3A) and high levels in CHO but low in Lec9.4a and

TM-treated CHO (Figure 2.3B). Clusters provide clues of proteins appropriate for further analysis.

Some of the glycoproteins at higher levels in Lec9.4a and TM-treated CHO shown in Figure 2.3A as compared to wild-type CHO include CD63 antigen, glyceraldehyde-3-phosphate dehydrogenase, vimentin, and ATP synthase. ATP-

44 synthase is critical for energy generation and when the a2 subunit is mutated, there

is a corresponding abnormal glycosylation of serum proteins that affects Golgi

trafficking [69]. On the other hand, proteins related to glycosylation are generally

downregulated in both Lec9.4a and TM-treated CHO as compared to wild-type CHO, shown in Figure 2.5B. For example, alpha-(1,3)-fucosyltransferase, dolichyl- diphosphooligosaccharide-protein glycosyltransferase subunit 1, uncharacterized glycosyltransferase AER61-like, and uncharacterized glycosyltransferase AGO61- like are all expressed very highly in wild-type CHO but not in the glycosylation- deficient cell lines. Interestingly, the majority of these proteins show more significant downregulation in TM-treated CHO as compared to Lec9.4a. The glycosylation-deficient Lec9.4a cell line has a mutation in polyprenol reductase the affects N-glycosylation, resulting in approximately 50% of expected glycosylation.

TM blocks glycosylation and also arrests the cell cycle at G1 [70]. Therefore, both of these modifications served to either downregulate or limit glycosylation on key glycosylating proteins. The correlation in expression profile indicates that the impact of TM treatment which inhibits glycosylation chemically can affect some of the same genes as the Lec9.4a mutation that minimizes glycosylation. The impact of both are clearly observed on a number of enzymes involved in the N-glycosylation pathway.

2.3.4 Lec2 K-means Clustering Figure 2.3 shows heat maps generated as a result of K-means clustering for the sialylated iTRAQs into groups of proteins with low expression in Pro-5 CHO but

45 high expression in Lec2 and high expression in Pro-5 CHO but low expression in

Lec2 (Figure 2.3D).

The Lec2 mutant cell line, with a deletion in the gene for sialic acid

transporter, can be used to study the effect of sialylation on glycoproteins. Sample

sialylated proteins that are higher in Lec2 cell line include membrane proteins and

transporters, such as cadherin-17-like, CD63 antigen-like, neural cell adhesion

molecule 1, CD97 antigen, and voltage-dependent calcium channel subunit alpha-

2/delta-1. In contrast, the sialylated proteins identified at high levels in Pro-5 CHO but low levels in Lec2 (Figure 3.3D) include legumain, clusterin-like,

transmembrane protein 2, dolichyl-diphosphooligosaccharide-protein

glycosyltransferase subunit STT3A, and beta-1,4-galactosyltransferase 3. This

highlights the sialylation enrichment method can identify sialylated proteins in

wild-type cells that are less sialylated in cells with defects in the sialylation pathway.

In the next section, bioinformatics methods such as GO, KEGG, and IPA were used to

identify the interaction between protein intensity levels and the relevance to

specific biological pathways.

2.3.5 Lec 9.4a KEGG analysis

Proteins from each iTRAQ were mapped to gene symbols for functional

analysis. The GO-CHO (http://ebdrup.biosustain.dtu.dk/gocho) and bioDBnet

(http://biodbnet-abcc.ncicrf.gov) databases were used to maximize the percentage

of Chinese hamster proteins annotated. The dual use of databases ensured that

increased annotations were made using those available for the updated 2014

46 Chinese hamster genome [71]. Each gene symbol was mapped to KEGG orthologs to

determine up- and down-regulated KEGG pathways. The hypergeometric

distribution is used to determine what terms are over- or under-represented in a

local sample list compared to the global population [72]. For iTRAQ 1, the global

population was chosen to be wild-type CHO. Therefore, enriched in Lec9.4a from

iTRAQ 1 means overrepresented in this sample compared to the wild-type CHO

control. The goal of this functional analysis was to determine how the glycosylation

mutant cell lines alter particular cellular processing pathways as a result of

glycosylation deficiencies.

In the comparison between Lec9.4a and CHO, 73 glycoproteins are detected

at significantly lower levels in Lec9.4a cells and 114 show higher levels in Lec9.4a

cells. The subsequent KEGG analysis revealed significant enrichment (p < 0.05) for

three pathways in Lec9.4a cells and significant depletion (p < 0.05) for six pathways

in Lec9.4a cells. This is an important reason why functional analyses are prevalent

in proteomics. Although the overall protein levels in Lec9.4a includes both up- and

down- regulated proteins (Figure 2.2A), the functional analysis that maps proteins

and genes within pathways shows more depletion than enrichment of pathways in

Lec9.4 cells.

2.3.6 Lec2 KEGG analysis

For the KEGG analysis in iTRAQ 2/3, the global population was represented

by the wild-type Pro-5 CHO. Thus, enriched in Lec2 from iTRAQ 2/3 refers to proteins that are overrepresented in the sample compared to the wild-type Pro-5

47 CHO control. In the comparison between Lec2 and Pro-5 CHO, 128 sialoproteins are significantly lower in Lec2 cells and 144 show higher levels in Lec2 cells. The KEGG analysis revealed significant enrichment (p < 0.05) for five pathways in Lec2 cells and significant depletion (p < 0.05) for seven pathways in Lec2 cells. Thus, similar to the comparison for Lec9.4a and CHO, even with overall protein levels upregulated,

the functional analysis shows that biological pathways are depleted in glycosylation-

and sialylation- deficient cell lines.

The top three pathways depleted in Lec2 cells as compared to wild-type Pro-

5 CHO cells are N-glycan biosynthesis (p = 0.00749), glycosaminoglycan biosynthesis (p = 0.00195), and glycosphingolipid biosynthesis (p = 0.00101). For the N-glycan biosynthesis pathways, genes that are significantly enriched in the wild-type Pro-5 CHO cells and concomitantly depleted in the Lec2 sialoproteome include MGAT2, which codes for mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N- acetylglucosaminyltransferase, and B4GALT3, which codes for beta-1,4- galactosyltransferase 3. The depletion of these genes in the Lec2 sialoproteome significantly affects glycosylation as shown in Figure 2.4. In addition, the amino sugar and nucleotide sugar metabolism pathway is identified in the CHO sialoproteome but not identified in Lec2 sialoproteome. This confirms the Lec2 phenotype affects sialylation and prior glycosylation activities. Other cellular pathways, including mucin type O-glycan biosynthesis, RNA transport, and other types of O-glycan biosynthesis, are not significantly enriched or depleted in Lec2 as compared to wild-type Pro-5 CHO cells.

48 2.3.7 Lec9.4a IPA Pathway Analysis

Following KEGG, IPA software was used to further understand up- and down-

regulated cellular functions and pathways of interest. The results of this experiment

can suggest the effect of glycosylation deficiencies on cell metabolism and provide

novel targets for cell line engineering efforts to control the glycosylation of

recombinant protein therapeutics.

Figure 2.5A shows the pathway map for Lec9.4a, where green represents

genes mapped from proteins were downregulated from wild-type CHO and red

represents genes upregulated from wild-type CHO. The upregulated functions for

Lec9.4a include metabolism of protein, catabolism of protein, ER stress response, and glycosylation of protein. This is related to proteins such as heat shock protein

90 beta family member 1 (HSP90B1), a disintegrin and metalloproteinase domain

10 (ADAM10), and cathepsin F (CTSF). In contrast, the functions for protein and carbohydrate metabolism show downregulation. Example proteins include CD44 molecule, intercellular adhesion molecule 1 (ICAM1), and mannosidase alpha class

2B member 1 (MAN2B1). The Lec9.4a mutant is still capable of glycosylation at approximately 50% of wild-type levels. Interestingly, protein glycosylation glycoproteins are upregulated in Lec9.4a as a response to the dysfunctional pathway.

2.3.8 Lec2 IPA Pathway Analysis

Figure 2.5B shows the pathway map for Lec2, where green represents genes

mapped from proteins downregulated from wild-type Pro-5 CHO and red represents genes upregulated from wild-type Pro-5 CHO. Many genes related to protein

49 glycosylation are upregulated in Lec2, corresponding to proteins such as cathepsin

(CTSB, CTSD, CTSF), HSP90B1, SEL1L ERAD E3 ligase adaptor subunit (SEL1L), and bone morphogenetic protein 1 (BMP1). Other genes corresponding to functions in synthesis of carbohydrate, metabolism of carbohydrate, and glycosylation of protein were downregulated in Lec2 compared to Pro-5 CHO. Examples include ICAM1,

CD44, MAN2B1, hexosaminidase subunit alpha (HEXA), ST3GAL1, and STT3A. Lec2 cells show downregulation of other glycosylation-related enzymes, including ST3 beta-galactoside alpha-2,3-sialyltransferase 1 (ST3GAL1), and dolichyl-

diphosphooligosaccharide protein glycotransferase oligosaccharyl transferase

subunit STT3A (STT3A). The genes associated with low protein levels in Lec2 and

concomitantly high levels in Pro-5 CHO cells represent sialylated proteins.

Specifically, sialoproteins identified at high levels in Pro-5 CHO but low levels in Lec

2, M include legumain (LGMN), and adhesion G protein-coupled receptor G

(1ADGRG1).

There is close agreement between the KEGG and IPA analyses. Specifically, N- glyan biosynthesis was identified as depleted in Lec2 cells compared to wild-type

Pro-5 CHO (Figure 2.4). The IPA map for Lec2 also shows altered glycosylation functions, including lower levels of MGAT2. Whereas KEGG is useful to identify specific pathway enrichment and depletion using the hypergeometric distribution,

IPA is useful for comparing interrelated pathways through fold change analysis.

Overall, the glycosylation-deficient cell lines show corresponding changes in pathways especially related to glycosylation, protein synthesis and processing, and metabolism.

50 2.4 Conclusions

Glycosylation and sialylation can play crucial roles in the activity and

circulatory lifetime of numerous biological glycoproteins. As the major producer of

biologics with numerous mutant cell lines available, CHO cells represent an

excellent model mammalian cell line in which to evaluate the impact of changes in

the glycosylation processing capabilities [45, 51]. Efforts to improve the

identification and understanding of glycoproteins and sialoproteins in cell culture

and its impact on protein expression and processing can help us to decipher which

biological processes are particularly sensitive to changes in these glycosylation

events. In this study, we analyzed glycoproteomic data to compare wild-type CHO

cell lines with Lec9.4a and Lec2 glycosylation mutant cell lines.

SPE enrichment can be used to identify various glycosylated proteins,

including membrane proteins, transporters, and secretory pathway enzymes. 381

different proteins were identified in the glycoproteome of iTRAQ1 comparing wild-

type CHO, TM-treated CHO, and Lec9.4a cells. Of the 187 glycoproteins, there were

36 glycoproteins not previously identified in the CHO-K1 proteome [4]. The

combination of ITRAQ2 and ITRAQ3 comparing wild-type Pro-5 CHO with Lec2 cells resulted in 272 sialylated proteins. When the data was compared with the previously published CHO-K1 proteome, there were 60 sialylated proteins not identified earlier [4]. The data sets were analyzed for overall differences in fold change and the gene symbol lists were used for functional analysis. Overall, there was no decrease in protein expression for either Lec9.4a or Lec2 as compared to the

51 wild-type CHO cells. This likely indicates the cellular adaptation and genetic drift of

the mutant cell lines in order to compensate for dysfunctional glycosylation.

Interestingly, the Lec9.4a cells show greater spread from the histogram than Lec2

(Figure 2.2) which may indicate that the mutation in dolichol transport at the very

beginning of glycosylation has a different effect than sialic acid transporter. Both cell

lines have at least some ability to glycosylate and sialylate proteins, respectively.

While both Lec9.4a and TM-treated CHO cells lack the ability to properly glycosylate, there are differences at the glycoproteomic level. Figure 2.2 shows the

average fold change of Lec9.4a is 2.0 (Figure 2.2A) whereas the average fold change

of TM-treated CHO is 0.7 (Figure 2.2B). This suggests that the changes manifested in

the Lec9.4a cells likely extend beyond an inhibition of glycosylation alone. TM

works through chemical inhibition of the transferase that synthesizes N-

acetylglucosaminyl(pyro)phosphoryldolichol from UDP-N-acetylglucosamine and

dolichol phosphate [73]. In contrast, although Lec9.4a cells lack the ability to fully glycosylate, there is not a similar decrease in overall protein intensity of the glycoproteins that are identified as with TM treatment. Thus, the mutant cell line has likely adapted to compensate for glycosylation deficiencies by upregulating and downregulating expression of glycosylation-related genes. In the sialoproteome comparison, the average fold change of Lec2 is 1.1 (Figure 2.2D) versus wild-type

Pro-5 CHO. Similar to Lec9.4a, Lec2 also shows upregulation, in this case of some sialoproteins. One important difference is between the protein expression level and the degree of glycosylation or sialylation, which suggests that transcriptomic data would improve the differentiation between transcription and post-transcriptional

52 modifications. In addition to histograms, Figure 2.2 shows the PCA analysis of

iTRAQ1(Figure 2.2C) and the combination of iTRAQ2/3 (Figure 2.2E). This clustering shows no overlap of samples and a distinct cluster for each sample. This indicates variance in samples, which show up- and down- regulation.

The combination of k-means clustering, KEGG, and IPA elucidate specific differences in glycoprotein and sialoprotein expression as well as affected cellular functions and pathways. In Lec9.4a, cell pathways such as protein metabolism, protein catabolism, carbohydrate metabolism, protein glycosylation, and ER stress response, were mapped with up- and down- regulated gene interactions (Figure

2.5A). One example of a gene upregulated and glycosylated in Lec9.4a is HSP90B1, a heat shock protein chaperone that assists in processing, folding, and secreting proteins. A reduction in glycosylation present in Lec9.4a is likely to challenge the ER processing capabilities of CHO cells including folding. Indeed, previous reports suggested that inhibition of glycosylation through TM treatment can trigger the ER stress and unfolded protein response that activates a number of chaperones and heat shock proteins [74]. Previous proteomics efforts have revealed that partial glycosylation of HSP90B1 occurs during TM treatment and the increased concentration of improperly glycosylated HSP90B1 serves to increase cell survival

[75]. Other ER stress response genes upregulated in the Lec9.4a cells include CTSD,

78 kDa glucose-regulated protein (HSPA5), MGAT2, hypoxia up-regulated 1

(HYOU1), and glucosidase alpha, acid (GAA). Similarly, GAA is highly glycosylated, with seven potential N-glycosylation sites [76]. When the majority of glycosylation sites were eliminated by site-directed mutagenesis, Hermans observed no effect on

53 structure or function [76]. However, Herman found one glycosylation site shows a

significant effect leading to synthesis of a mutant GAA precursor that accumulates in

the ER [76]. Indeed, GAA deficiency results in glycogen storage disease type II, also

known as Pompe disease [77]. Thus, the Lec9.4a cells appear to have incorporated a

number of adaptations that enable survival in a low glycosylation environment.

Interesting, the upregulation of ER stress in both Lec9.4a and TM-treated CHO

shows that TM treatment shows similar pathway effects and is a good model for

Lec9.4a.

In addition to the known changes to the dolichol lipid in Lec9.4a, pathway

analysis (Figure 2.5A) revealed altered protein metabolism, carbohydrate

metabolism, and protein catabolism. Specifically, protein catabolism showed

upregulated of genes including ADAM10, heat shock cognate 71 kDa protein

(HSPA8), BMP1, and glucosylceramidase beta (GBA). An example of protein

degradation enzyme is ADAM10, a transmembrane metalloprotease with four N-

glycosylation sites [78]. Escrevente studied glycosylation site mutants and showed that the non-processed precursor to ADAM10 accumulates in the ER as a result of the improper glycosylation [78]. Interestingly, defects in ADAM10 play a role in

Alzheimer’s disease pathogenesis in humans [79]. Upregulation of ADAM10, which

we observed in Lec9.4a cells, may be reveal an accumulation of improperly

glycosylated, less active, glycoprotein. Other upregulated genes in Lec9.4a cells

require activation and targeting for proper function. CTSF is synthesized inactively

and requires processing in order for the enzyme to function [80] [81]. Reduced

54 glycosylation will lower targeting and thus upregulation may be essential to

facilitate transport to the lysosome for proper function.

Many glycosylation-related genes are downregulated in Lec9.4a cells relative

to wild-type CHO including CD44, a highly glycosylated protein involved in cell to cell interactions and migration. CD44 glycosylation variants are related to tumorigenesis [82, 83]. In other research, treatment of human cell lines with TM to inhibit glycosylation resulted in decreased ability of CD44 to bind hyaluronan [84].

Bartolazzi showed that all five potential N-glycosylation sites are critical for CD44

binding [84]. Generation of CD44 glycosylation mutants revealed diverse effects

with respect to hyaluronan binding; terminal sialic acid inhibited binding, and the

first N-linked GlcNAc residue increased binding, Similarly, ICAM1 is a highly-

glycosylated protein and N-linked glycosylation is critical for the conformation and

immune function of the protein [85]. ICAM1 is observed lower in Lec9.4a cells, also

deficient in complete glycosylation and suggests the importance of glycosylation to

glycoprotein activity. The plasma membrane glycoprotein plays a role in human

glycosylation disorders, and ICAM1 was identified in congenital disorders of

glycosylation patient fibroblasts by He [86]. Thus, He suggested that ICAM1 can

represent a hypoglycosylation biomarker [86], and appears to be playing such a role

in our research as well.

In Lec2 cells, the pathway analysis showed changes in carbohydrate

synthesis, carbohydrate metabolism, protein glycosylation, and protein catabolism

(Figure 2.5B). The greatest upregulation in Lec2 is protein catabolism. BMP1 is one

55 gene upregulated in both Lec2 and Lec9.4a cells. In order for this extracellular

matrix protein to be activated, it must cleave the prodomain in the trans Golgi prior

to secretion [87]. Structural analyses have shown that BMP1 has six N-glycosylation sites, including one on the prodomain, which contain complex oligosaccharides with sialic acid residues [88]. Glycosylation and sialylation are essential modifications to

BMP1 because only properly-glycosylated BMP1 is secreted. Structural analyses have identified that five of six N-glycosylation sites contain sialic acid, including the prodomain [88]. Recombinant BMP1 lacking glycosylation sites were not secreted and trafficked to the proteasome for degradation [88]. The upregulation of protein catabolism may be important in recycling proteins without proper sialylation.

Another upregulated gene, SEL1L, is essential for degrading misfolded proteins [89].

SEL1L has been implicated in some Alzheimer patients as a polymorphism leading

to this phenotype [90]. The chondroitin sulfate proteoglycan 4 (CSPG4) also plays a

role in protein catabolism. During brain injury, CSPGs, including CSPG4 observed

here, are upregulated and glycosylation patterns were compared between the

recovering and recovered environment [91]. Kilcoyne showed that alpha-2,6-

sialylation activities were increased and inhibitory in the injured state [91].

Some genes related to glycosylation are upregulated in Lec2 cells, as well as

Lec9.4 cells, possibly suggesting cellular adaptation to sialic acid transporter

deletion in this phenotype and glycosylation defects in Lec9.4a. However, most

glycosylation genes in Lec2 are downregulated, including B3GNT2, MGAT2,

ST3GAL1, STT3A, and ribophorin 1 (RPN1). This differentiates the Lec2 and Lec9.4a

phenotype, which shows overall upregulation of glycosylation. STT3A is reduced in

56 Lec2; reduced expression may be one approach the cells can use to adapt to a

reduced sialylation potential. In essence, dispensing with glycosylation completely

is one approach for addressing sialylation deficiencies in Lec2. Also, ST3GAL1 is

lower in Lec2 as compared to wild-type Pro-5 CHO. ST3GAL1 most commonly adds sialic acid to O-glycans [92]. Specifically, ST3GAL1 has four N-glycan attachment sites with potential sialylation [93]. Reducing its activity would have been a reasonable adaptation since there was limited sialylation present due to deficiencies in the transporter. STT3A is involved in the oligylsaccharyltransferase (OST) transfer of oligosaccharides to asparagine residues with high specificity [94].

Hypoglycosylation of OST (with STT3A as a subunit) results in CDG type I, leading to mental retardation [94]. Mutations in STT3A cause a distinct CDG due to changes in glycosylation of the subunit [95]. Sialyltransferases thus are frequently used as targets for cell line engineering efforts in order to increase the sialic acid content of recombinant proteins. The downregulation of glycosylation also revealed depletion of MGAT2 in Lec2 cells. Studies of congenital disorders of glycosylation type IIa exhibit deficiencies in MGAT2, leading to deficiencies in N-glycan biosynthesis [96]

[97]. In mice with MGAT2 deletion, the N-glycans produced showed a hybrid bisected N-glycan structure with the bisecting GlcNAc residue exchanged for beta-

1,4-linked galactose or Lewisx [96]. Thus, glycosylation-deficient cells can produce

N-glycans, notably with changed structure and altered function. In general, there is a strong relationship between glycosylation and sialylation activities in cells. Lec2 cells may exhibit adaptations in their glycosylation machinery to adapt to under sialylation due to the transporter defect.

57 New advances in glycoproteomics will improve glycoprotein identification

and enable novel investigations into cell culture. In addition to iTRAQ used in this

experiment, other labeling methods such as stable isotope labeling (SILAC) and

tandem mass tags (TMT) can be used to compare glycoproteins between cell lines

and healthy or diseased tissue samples [98] [99]. Additionally, methods such as SPE

can be used to enrich for glycoproteins and identify more low-abundance proteins.

Recently, Yang et al. evaluated how glycoproteomic profiles change between

dyssynchronous heart failure and cardiac resynchronization therapy [100]. Relative

changes in the level of glycoproteins were determined by SPE enrichment, followed

by iTRAQ labeling [100]. Similarly, Tian used SPE to enrich glycoproteins from both

ovarian tumors and adjacent normal ovarian tissue [101]. Various tumor types were

labeled with iTRAQ reagents and relative protein abundance measured by MS [101].

Up- and down-regulated glycoproteins were identified and further verified by

Western blot; specifically, CEA5 and CEA6 showed significant upregulation in

ovarian mucinous carcinoma compared to healthy ovarian tissue [101]. These

examples highlight how glycoproteomics can be used to compare samples of clinical

relevance in addition to biotherapeutic development applications.

Our enrichment of glycoproteins allowed for the comparison between

Lec9.4a and CHO cells, and the enrichment of sialylated proteins enabled

comparison between Lec2 cells and Pro-5 CHO. The enrichment of glycoproteins in the wild-type CHO but not in the glycosylation-deficient Lec9.4a cells indicates

success of the glycoprotein enrichment protocol, whereas enrichment of

sialoproteins in wild-type Pro-5 CHO but not in Lec2 suggests successful

58 sialoprotein enrichment methods. In the iTRAQ1 comparison, we showed up- and down- regulation of glycoproteins involved in protein metabolism, protein catabolism, carbohydrate metabolism, protein glycosylation, and ER stress response.

In the iTRAQ 2/3 comparison, we identified differences in carbohydrate synthesis, carbohydrate metabolism, protein glycosylation, and protein catabolism. Higher expression of glycoproteins and sialoproteins suggests differences in genetic expression and post-translational modifications. For example, proteins with insufficient glycosylation or sialylation may accumulate intracellularly and require increased transcription to yield functional activity. In contrast, low expression of glycoproteins and sialoproteins in the deficient cell lines confirms that the mutation directly affects cell metabolism. Overall, results from the analysis of glycosylation mutant cell lines in this experiment validate the need for advanced glycoproteomic and sialoproteomic methods to yield new insight into cellular metabolism for CHO cell lines, as well as disease models. This serves to increase knowledge for physiological changes in response to defects in the cellular processing pathways.

59

Tables

Table 2.1: Experimental Conditions

Peptide Unique Unique iTRAQ Name Spectrum Peptides (#) Proteins (#) Matches (#)

1 CHO, CHO TM, Lec9.4a iTRAQ 973 381 14106

2 Pro-5, Lec2 iTRAQ 2188 500 22221

3 Pro-5, Lec2 iTRAQ 2723 613 18538

60 Figures

Figure 2.1: Overview of Experimental Design. A: Affected glycosylation steps for mutant and treated cell lines. Tunicamycin-treated and Lec9.4a mutant cells show deficient glycosylation. Lec2 mutant cells show deficient sialylation. B: Flow diagram of glycoproteomics sample preparation and bioinformatics analysis. Solid-phase extraction method was used which involves oxidation, hydrazide bead coupling to enrich specifically for glycopeptides, and PNGaseF digestion to cleave off N-glycans. Following labeling, samples are fractionated and subjected to mass spectrometry. Spectra are matched to CHO databases and bioinformatics is used to determine functional differences between samples. Abbreviations: TM – tunicamycin.

61 Figure 2.2: Distribution of Protein Intensity. A-C: Histograms of Lec9.4a compared to wild-type CHO and TM-treated CHO. Bin sizes are 0.2 and range from 0-6 fold change. Frequency indicates the number of proteins within each bin size. D: Principal component analysis of CHO, Lec9.4a, and TM-treated CHO showing the first two principal components on the axis. E-H: Histograms of Lec2 compared to wild-type Pro-5 CHO. I: Principal component analysis of Pro-5 CHO and Lec2, showing the first two principal components on the axis. Abbreviations: TM – tunicamycin, PC – principal component.

62 Figure 2.3: Protein Intensity Heat Maps. K means clustering was calculated for the cell line comparisons. A-D: glycoprotein groups with low levels in wild-type CHO but high in Lec9.4a and TM-treated CHO, glycoprotein groups with high levels in wild-type CHO but low in Lec9.4a and TM-treated CHO, glycoprotein groups with low levels in wild-type Pro-5 CHO but high in Lec2, and glycoprotein groups with high levels in wild-type Pro-5 CHO but low in Lec2. Protein intensity heat maps show distribution from low [43] to high (red). Abbreviations: TM – tunicamycin.

63 Figure 2.4: Lec2 KEGG Pathway N-glycan Biosynthesis. Pathway is significantly depleted in Lec2 cell line as compared to wild-type Pro-5 CHO. Specifically, MGAT2, MGAT4, and B4GALT3 are genes that are involved in pathway depletion (shown in pink).

64 Figure 2.5: Pathway Analysis Maps. A: Lec2 pathway map. IPA software (www.ingenuity.com/products/ipa) was used to evaluate biological functions that are up- and down- regulated in Lec2 as compared to wild-type Pro-5 CHO cells. Green represents predicted downregulated genes in Lec2 and red represents predicted upregulated genes in Lec2. Identified genes play a role in proliferation of cells, hydrolysis of protein fragment, metabolism of oligosaccharide, metabolism of carbohydrate, glycosylation of protein, biosynthesis of poly-N-acetyllactosamine, and synthesis of polysaccharide. B: Lec9.4a pathway map. IPA software (www.ingenuity.com/products/ipa) was used to evaluate biological functions that are up- and down- regulated in Lec9.4a as compared to wild-type CHO cells. Green represents predicted downregulated genes in Lec9.4a and red represents predicted upregulated genes in Lec9.4a. Identified genes play a role in apoptosis, binding of carbohydrate, metabolism of carbohydrate, glycosylation of protein, metabolism of protein, and endoplasmic reticulum stress response of cells.

65 Chapter 3: Proteomics Analysis of Chinese Hamster Cell Lines

and Tissues

Abbreviations

CHO – Chinese hamster ovary, mAb – monoclonal antibody, HEK – human

embryonic kidney, DHFR – dihydrofolate reductase, HCP – host cell protein, MS –

mass spectrometry, SDS – sodium dodecyl sulfate, EDTA –

ethylenediaminetetraacetic acid, PMSF – phenylmethylsulfonyl fluoride, BCA – bicinchoninic acid, TCEP – tris(2-carboxyethyl)phosphine, NSAF – normalized spectral abundance factor, GO – gene ontology, KEGG – Kyoto encyclopedia of genes and genomes, FDR – false discovery rate, iTRAQ – isobaric tags for relative and absolute quantification, PCA – principal component analysis, MCM –

minichromosome maintenance complex, BiP – binding immunoglobulin protein, PDi

– protein disulfide isomerase (PDi), GOLPH3 – golgi phosphoprotein 3, HPLC – high

performance liquid chromatography, IGF1 – insulin-like growth factor 1, ZP – zona

pellucida, ORC – origin recognition complex, Cdc25A – cell division cycle 25

homolog A, E2F – E2 transcription factor, Abl – Abelson murine leukemia viral

oncogene

3.1 Introduction

In 1957, Theodore Puck and colleagues established the Chinese hamster

ovary (CHO) cell line that is used today across the biotechnology industry to

generate therapeutic monoclonal antibodies (mAbs) and recombinant proteins that

generate billions of dollars in sales [102]. Biologics increasingly constitute a large

66 percentage of top drugs and play important roles in research and development

pipelines. Over the past three years there have been 11, 12, and 13 biological license

application approvals in 2014, 2015, and 2016 respectively [103-105]. While there

are many possible production hosts including human embryonic kidney [106],

mouse myeloma NS0, Saccharomyces cerevisiae, and Escherichia coli, CHO represents the most widely used biotechnology host for commercial proteins due to its wide regulatory acceptance, scalability and adaptability to manufacturing, and glycosylation capabilities among other characteristics [107]. CHO cell lines exhibit relatively high growth rates, are easy to genetically engineer, and exhibit post- translational modifications that are compatible with humans. Increasing our knowledge about CHO cellular capabilities as well as the original Chinese hamster can provide insights that biotechnologists can use to improve production of therapeutic proteins through cell line engineering and bioprocessing modifications.

Originally, ovarian tissues were isolated and used to establish a fibroblast- like cell type with nearly diploid chromosomes that became spontaneously immortal after months in culture [108]. Although commonly considered ovary cells,

CHO actually grew from the surrounding ovarian connective tissue [109]. Over time,

CHO use for industrial applications grew, as the cells were adapted to low-serum and eventually serum-free suspension culture. The first product made by CHO cells used the CHO-DXB11 cell line established at Columbia University to study dihydrofolate reductase (DHFR) activity [110]. Chasin and Urlaub established the

CHO DG44 cell line following deletion of both loci for DHFR on chromosome 2 [111].

The origin of CHO-S from the Puck CHO ancestor is less clear and may have resulted

67 from adaptations to grow the cells in suspension simultaneously in academia and

industry [2, 109]. As a result, CHO cell lines represent a ‘quasi-species’ due to the

high variability across parental and clonal cell lines [109]. Today, derivatives of the

original CHO cells exist across the biopharmaceutical industry and academia, having

undergone numerous genetic modifications and adaptations, yet they serve as the

dominant production platform cell line for biotherapeutics.

Given the wide application and variability of the CHO cells in biotechnology

as well as its origin from Chinese hamsters, burgeoning efforts are under way to

begin to characterize the production host from different sources and in comparison,

with the Chinese hamster host organism. Recently, a range of ‘omics efforts have

been initiated to characterize and differentiate CHO hosts from each other and the

Chinese hamster [37]. A draft genome of CHO-K1 was completed through a

collaborative effort in 2011 [3] followed by the Cricetulus griseus genome in 2013

[2] and several other cell lines. More recently, the CHO-DXB11 cell line was

sequenced in order to aid efforts at understanding the DHFR negative phenotype

and drift from other CHO cell lines [112]. In addition, genomic BAC libraries were constructed for CHO-K1 and CHO-DG44 cells to study the distributions and rearrangements of hamster chromosomes [113]. These initial efforts have paved the way for characterizing the diversity of the Chinese hamster against CHO parental and clonal cell lines at least at the genome level.

Equally important will be efforts to decipher the characteristics of these cell lines and the host organism at the RNA, protein, and metabolite level based on other

‘omics tools including transcriptomics, lipidomics, and proteomics among others.

68 Recently, diverse transcriptomics data sets for CHO cell lines under various culture

conditions were compiled into an online transcript database [71]. The approach

enabled the assembly of over 65000 CHO cell line transcripts [71]. In another study,

transcriptomics was used to compare CHO cell lines and hamster tissue by RNA-seq

[114]. Variability in gene expression was attributed to differences in glycolysis and

glycosylation thus suggesting the existence of metabolic engineering targets that

could be exploited [114]. Recently, a comparison of cell lines by lipidomics revealed

that phosphatidylethanolamine and phosphatidylcholine represent the major lipid

classes in CHO, SP2/0, and HEK cells [115]. HEK cells show increased lipid content,

suggesting that differences in cell lipids can contribute to cell secretory capacity

[115]. These previous studies highlight how increased ‘omics data sets for CHO can

be used for improved understanding of the production host.

Proteomics can now identify and quantify thousands of cellular proteins for

understanding cell physiology. These efforts, like those with transcriptomics and

lipidomics, can help assist biotechnologists in their efforts to characterize distinct

cell lines in developing stable host cell lines with desirable phenotypes. An initial

proteomic analysis of the CHO-K1 cell line by our group identified CHO host cell proteins and pathways that are enriched and depleted compared to the CHO genome and transcriptome [4]. This large-scale profiling identified approximately

6000 cellular proteins and, together with functional analysis, revealed enrichment

in genes related to protein processing and apoptosis at the proteomic level [4]. In

contrast, steroid hormone and glycosphingolipid metabolism were depleted [4].

Additionally, bioinformatics tools were developed to enable the identification of

69 CHO secreted proteins by detecting the presence of signal peptides, transmembrane

domains, and cellular localization features [116]. Over 1000 proteins were

determined to represent the CHO supernatant or superome, which enables further

research into high abundance host cell proteins (HCPs) for bioprocess development

applications [116]. These HCPs can accumulate in the production matrix and

represent a challenge to bioprocessing development, especially for biotherapeutic

purification steps. In another study, proteomics was used to identify host cell

proteins (HCPs) that accumulate over cell culture duration and can negatively affect

the process by interacting with recombinant protein [117]. Proteomics was also

used to evaluate changes in HCPs over 500 days in culture; results showed 92 HCPs

with significant changes in expression with numerous HCPs that are known as

difficult to purify [118]. While many cell lines and different bioprocess conditions

have been studied via transcriptomics and proteomics, comparative studies are still

in their infancy particularly in relation to the original hamster host.

As the relationship between derived cell lines and tissues of the original

Cricetulus griseus is not well-documented, our study was undertaken to characterize

the proteome of multiple CHO cell lines in the context of the Chinese hamster

organism from which these lines were derived. Liver and ovarian tissues were

chosen to represent biologically relevant tissues for this comparison. CHO cell lines

have been created from isolated ovarian connective tissue, and the liver has been

used for genome sequencing [2, 109]. Furthermore, one of the principal functions of the liver is to detoxify and metabolize chemicals; thus, the organ is highly

metabolically active. Specifically, the liver plays a role in bile production, cholesterol

70 production, protein synthesis, glucose storage and release, and substance clearance.

In contrast, the ovary serves the female reproductive system by producing eggs and reproductive hormones. The organ contains extensive loose and dense connective tissue that acts as a covering and bed for vasculature, respectively. As CHO was established, it was sequentially adapted from an adherent to suspension culture. It remains unclear what other physiological changes occurred during this process.

This research aims to begin to elucidate the different cellular characteristics at the proteomic level for representative CHO cell lines and hamster tissues. Consequently, two cell lines, CHO-S and CHO DG44, during both the exponential and stationary phases, were subjected to mass spectrometry proteomics analysis along with liver and ovary tissues. Peptide identification, protein quantification, and functional analysis were performed to characterize similarities and differences at the protein level between cell lines and tissues in order to explain physiological changes from tissue to cell line. While similar protein expression patterns were observed between

CHO-S and CHO DG44 in both growth phases, these expression characteristics were distinct from those observed in liver and ovary tissues. These differences likely reflect the diverse physiological adaptations that these specific cell lines have undergone as well as the complex cellular makeup of tissues or organs. In particular, differences were detected between cell lines during processes including the cell cycle, DNA replication, and membrane transport. Between exponential and stationary phases, both CHO-S and CHO DG44 also shift physiological activities from replication activities to metabolic processes. In addition, liver and ovary exhibit enhanced metabolism compared to cell lines as determined by an examination of

71 enriched pathways, including secretion, and in the case of ovary, extracellular

matrix and adhesion. Overall, the results indicate distinct variability across

physiological pathways of interest that are either upregulated or downregulated in

tissues as compared to cells. These differences highlight unique aspects of cellular

performance in organisms as compared to immortal cell lines in a tissue culture environment. Our findings will help us to better understand the changes that are manifested in adapting cells from tissue for different requirements, such as growth in suspension cultures and production of biotherapeutics for pharmaceutical applications.

3.2 Materials and Methods

Briefly, the experiment section is divided into three parts: sample

preparation, MS peptide identification, and bioinformatics analysis.

3.2.1 Tissue Isolation

Chinese hamsters were generously provided by the lab of Dr. George

Yerganian (Cytogen Research, Roxbury, MA). Euthanization was performed by CO2

and verified by puncture by veterinarians at Johns Hopkins Medical School.

Harvested tissues, including liver and ovary, were immediately frozen on dry ice

and stored at -80oC until analysis.

3.2.2 Cell Culture

CHO-S and CHO DG44 were the two suspension CHO cell lines we grew in

shake-flask batch culture to collect samples at both exponential and stationary

phases. CHO-S cells were cultured in CD-CHO medium supplemented with 8mM

glutamine (ThermoFisher Scientific, Waltham, MA). After achieving the growth

72 curves (data not shown), samples were collected on day 2 for exponential phase and

day 4 or 5 for stationary phase. CHO DG44 cells were culture in DG44 medium

supplemented with 2mM glutamine (ThermoFisher Scientific, Waltham, MA). Cells

were maintained at 37oC, 8% CO2, and 120RPM; viable cell density was determined

by hemocytometer and trypan blue. For sample collection, approximately 3 million

cells were spun down, washed with PBS on ice, frozen rapidly on dry ice, and stored

at -80oC until analysis.

3.2.3 Sample Preparation

Samples for proteomics were thawed on ice and suspended in 2% sodium dodecyl sulfate (SDS) supplemented with 0.1mM phenylmethane sulfonyl fluoride

(PMSF) and 1mM ethylenediaminetetraacetic acid (EDTA), pH 7-8. Samples were lysed by sonicating thrice for 60 seconds at 20% amplitude followed by 90 seconds pause in order to expose proteins. Protein concentration was measured by bicinchoninic acid (BCA) protein assay after briefly spinning to remove cell debris.

Three hundred micrograms of each sample was reduced in 10mM tris(2- carboxyethyl) phosphine (TCEP), pH 7-8, at 60oC for 1hr on a shaking platform.

After bringing the sample to room temperature, approximately 17mM iodacetamide

was added to alkylate the sample to final concentration for 30 minutes, thus

preventing the reformation of disulfide bonds. Next, samples were cleaned using

10kDa filters to reduce the SDS concentration as suggested by the filter aided

sample preparation (FASP) protocol [6]. The samples were digested by

trypsin/LysC (Promega V507A), overnight at 37oC on a shaking platform.

3.2.4 2D LC-MS

73 The MS was completed by the lab of Dr. Robert Cole at the Johns Hopkins

Medical School. Digested peptides were fractionated on basic reversed phase column (XBridge C18 Guard Column, 5µm, 2.1 x 10mm XBridge C18 Column, 5µm,

2.1 x 100mm). Fractions were concatenated into 48 samples prior to second dimension LC and mass spectrometry analysis. The use of fractionation with equal peptides in each was designed to mimic biological replicates for each sample.

Tandem mass spectrometry analysis of the peptides was carried out on the LTQ

Orbitrap Velos (www.thermoscientific.com) MS interfaced to Eksigent

(www.eksigent.com) nanoflow liquid chromatography system with the Agilent 1100 auto sampler. Peptides were enriched on a 2cm trap column (YMC gel ODS-A S-

10µm), fractionated on Magic C18 AQ, 5µm, 100Å (Michrom Bioresources), 75µm x

15cm column and electrosprayed through a 15µm emitter (PF3360-75-15-N-5, New

Objective). Reversed phase solvent gradient consisted of solvent A (0.1% formic acid) with increasing levels of solvent B (0.1% formic acid, 90% acetonitrile) over a period of 90 minutes. LTQ Orbitrap Velos parameters included 2.0kV spray voltage, full MS survey scan range of 350-1800m/z, data dependent HCD MS/MS analysis of top 10 precursors with minimum signal of 2000, isolation width of 1.9, 30s dynamic exclusion limit and normalized collision energy of 35. Precursor and fragment ions were analyzed at 60000 and 7500 resolutions, respectively. In summary, the peptides were ionized and dissociated to obtain fragment masses that are matched to the predicted masses for peptide sequences.

3.2.5 Bioinformatics

74 Peptide sequences were identified from isotopically resolved masses in MS

and MS/MS spectra extracted with and without deconvolution using Thermo

Scientific MS2 processor and Xtract software nodes. Data was searched against all

entries in Cricetulus griseus database for CHO cell lines. Oxidation on methionine, deamidation NQ and phosphoSTY were set as variable modifications, and carbamidomethyl on cysteine was set as fixed modifications in Mascot search node used in Proteome Discoverer (Version 1.4, http://portal.thermo-brims.com) workflow. Mass tolerances on precursor and fragment masses were 15 ppm and

0.03 Da, respectively.

Protein identifications were made using Proteome Discoverer software

(Thermo) with stringent high confidence cutoff (<1%FDR). The protein intensities were determined using NSAF, a method to generate reproducible counts and high linearity across biological and technical replicates [119]. It measures the number of spectra matched to a specific protein divided by its length over the sum of all proteins in the sample [119]. NSAF values were plotted for each sample comparison, and fold change values were used to identify significant differences in protein intensity. Protein accession numbers were mapped to gene symbols using the biological database network (http://biodbnet-abcc.ncicrf.gov) for functional analysis by GO and KEGG. For GO annotation, gene symbols were mapped to molecular function, biological process, and cellular component using the GOCHO platform (http://ebdrup.biosustain.dtu.dk/gocho). For KEGG annotation, latest

(Dec.2015) database pathways for Cricetulus grisueus were downloaded.

(http://www.genome.jp/kegg). All programming for the hypergeometric test were

75 calculated in MATLAB version 2015vB

(http://www.mathworks.com/products/matlab) and RStudio

(https://www.rstudio.com). Enrichment and depletion p-values were calculated using the hygecdf and hygepdf functions in MATLAB. These values were used to generate heat maps and k-means clustering in Genesis software version 1.7.6 [65]

KEGG pathways of interest were plotted using the search and color feature on the

KEGG website (http://www.genome.jp/kegg/tool/map_pathway2.html).

3.3 Results and Discussion

Our study was designed to identify differences in expression of key proteins between relevant tissues from the Chinese hamster and CHO cell lines for potential cell engineering applications. A proteomic evaluation and comparison of CHO cell lines and hamster tissues using proteomics will provide valuable insights into the similarities and differences between production cell lines and the host organisms.

Such a comparison is particularly relevant given the importance of CHO cell lines for production of numerous proteins of biotechnology research and clinical therapeutic interest. An overview of the workflow of the sample preparation and analysis is shown in Figure 3.1. Initially, cells and tissues were subjected to sonication, reduction, alkylation, and digestion, and then the peptides were fractionated in order to increase the number of identifications followed by LC/MS/MS profiling.

Previous studies by Baycin have demonstrated that increased fractionation improves number of protein identifications since more peptides can be identified for each protein [4].

76 In a recent comparison by Le, the 2012- and 2014- annotated genomes were used for ‘omics applications and it was observed that the newest annotation was preferred because it contained higher mapped rates for intra-genic regions [120].

Therefore, the recently annotated Chinese hamster genome [2] by Lewis was used in

this study to provide for the most accurate method for protein identification. The

protein accession numbers were converted to gene symbols for functional analysis.

For missing gene symbols, accession numbers were searched against mouse and

human databases using the biological database network bioDBnet [121]. The

website includes db2db which allows users to input protein accession numbers to

inter-convert between various databases and increase the percentage of protein

accession numbers and gene symbols matching.

Both GO and KEGG tools were used for functional analysis. In GO, gene

symbols are mapped to molecular functions, cellular components, and biological

processes in which the gene plays a role [122]. A related approach is pathway

analysis through a KEGG tool. In this method, gene symbols are mapped to all

involved KEGG pathways in order to identify significant pathway differences and the

presence or absence specific genes [123]. Both functional analyses use the

hypergeometric distribution in order to identify enriched and depleted p-values. A level of p < 0.05 was set for evaluating significance.

3.3.1 Protein Identifications

The numbers of spectra, peptides, and proteins from the different sources are shown in Table 3.1. With a FDR of 1% and at least 2 unique peptides per protein, at least 4400 proteins were identified in tissues, and over 5500 proteins were

77 identified in cell lines. Combining the unique proteins from each sample yielded a

proteome of 11801 proteins. As shown in Figure 3.2A, there were 2283 proteins

common to all samples while 5138, 1058, and 894 proteins were unique to cell lines,

ovary tissue, and liver tissue only, respectively. In addition, 801 proteins were

common between cell lines and liver tissue, whereas 1137 proteins were common

between cell lines and ovary tissue. Regarding CHO-S and CHO DG44 (Figure 3.2B),

5307 proteins were identified in both as well as 2176 proteins unique to CHO-S and

1876 proteins unique to CHO DG44. For the CHO cell lines alone, this represents a

huge improvement in protein identifications compared to the 2012 CHO-K1 proteome [4]. Between CHO-S and CHO DG44, there were over 9300 unique proteins identified compared to approximately 6000 CHO-K1 proteins, representing

a 56% increase in proteins elucidated in our previous study [4]. Also of note, this

experiment set the FDR at 1% whereas the CHO-K1 proteome used a slightly less

stringent 2% [4]. Therefore, the total number of proteins identified in this

experiment surpasses previous CHO proteomes and offers the first hamster tissue

proteomes for comparison. A comparison of 8-plex iTRAQ and label-free proteomics

revealed that label-free proteomics yields higher sequence coverage and is able to elucidate proteins that are differentially expressed [124]. Similarly, a comparison between iTRAQ and label-free proteomics suggests that label-free approaches can provide more values closer to the actual differentially expressed values as shown by

Western blotting [125]. Therefore, for this study, a label-free approach was used to maximize the proteome coverage and provide a thorough comparative analysis of specific proteins in different cells and tissues.

78 Proteins found only in cells often relate to processes, such as DNA replication

and transcription, which are most active in the growing cells. Likewise, some

proteins are tissue specific such as ovary-specific proteins including ovarian cancer

G-protein coupled receptor, oviduct-specific glycoprotein, polycystin-1, estradiol

17- -dehydrogenase 1, and zona pellucida sperm-binding protein. Many proteins

identifiedβ only in the ovary suggest roles in extracellular matrix maintenance and

reproductive development. In contrast, liver-specific proteins identified include

liver carboxylesterase 1, apolipoprotein M, bile salt sulfotransferase-like protein,

and glycolipid transfer protein domain-containing protein 2. These relate to liver-

specific functions, such as detoxification and secretion. Finally, cell-specific

identified proteins are related predominately to the cell cycle. Examples include cell

cycle checkpoint protein RAD17, DNA topoisomerase II-binding protein 1,

DNA/RNA-binding protein KIN17, and mitotic-spindle organizing protein 1. These

differences highlight the difference between actively-growing cells in culture and

differentiated tissues.

For each sample, the protein intensity was determined by dividing the

number of peptide spectrum matches by the number of amino acids to yield the

spectral abundance factor (SAF) [126]. The SAF was normalized for each sample to

yield the normalized SAF (NSAF) value. These values were used to generate a

principal component analysis (PCA) for shared proteins in the tissues as well as the

two cell lines in distinct growth phases, as shown in Figure 3.2C. As expected, the

different cell lines cluster together and are distinct from liver and ovary tissues.

Specifically, CHO-S exponential and CHO-S stationary are grouped together and CHO

79 DG44 exponential and CHO DG44 stationary are grouped together. Also, the cell lines cluster more closely to ovary than to liver. While PCA groupings can elucidate general similarities and differences, GO and KEGG can be used to identify functional differences as described later.

Next, a histogram of NSAF values compiled for all the proteins for each cell line and the two tissues is shown in Figure 3.2D. The histogram is used to represent the distribution of proteins from low to high abundance. In general, the total number of proteins identified is greatest in the cell lines, but the distribution of protein intensity is wider for the organs, which would be expected for complex tissues. Specifically, the most abundant proteins are observed in liver and ovary tissue, as shown by the increased levels at higher proteins intensities in Figure 3.2D.

Example proteins include actin, myosin, heat shock protein, fatty acid binding protein, and glyceraldehyde-3-phosphate dehydrogenase. This highlights that many metabolic activities are upregulated in tissues relative to cells. This finding is in general agreement with a proteomic comparison done between liver tissue and primary liver hepatocytes that showed overall downregulation of proteins in the cell line [127]. Furthermore, the peak for number of proteins shown in Figure 3.2D is larger in both cell lines during the exponential phase than the stationary phase. This may correspond to the high DNA replication activity during the growth phase and a requirement for higher numbers of cellular proteins during exponential growth.

3.3.2 Protein Intensity Comparison

The NSAF values were plotted for each sample combination including between exponential and stationary phases of CHO cell lines and between cell lines

80 and tissues. Linearity was measured using R2 to determine how close the points align to a regression fit. The R2 value ranges were determined to give a linearity estimate and are plotted from highest at 0.89 (Figure 3.3A) to lowest at 0.48 (Figure

3.3O). The highest degree of linearity is observed between CHO-S exponential and

CHO-S stationary (Figure 3.3A) and CHO DG44 exponential versus CHO DG44 stationary (Figure 3.3B). Interestingly, more linearity is observed within a cell line instead of between different cell lines at the exponential or stationary phase. This correlation suggests significant diversity has evolved between these cell lines that may affect protein expression and cellular characteristics. Indeed, the genomes of

CHO-K1, CHO-S, and CHO DG44 have diverged from Cricetulus griseus as indicated by the three million single nucleotide polymorphisms across various CHO cell lines

[2]. Likewise, the values between CHO cell lines and tissues vary from a high of 0.66

(between ovary and CHO-S stationary, Figure 3.3G) and a low of 0.48 (between liver and CHO-S stationary, Figure 3.3O).

CHO cells, which were derived from fibroblast connective tissue surrounding the ovary [109], show levels of protein expression that are more similar to the ovary than the liver proteome. For example, the R2 ranged from 0.61-0.66 for ovary versus cells (Figure 3.3G-J) versus R2 of 0.48-0.54 for the liver versus cells (Figure 3.3L-O), respectively. For the ovary comparison, there was a slightly higher linearity with the stationary phase data to suggest greater similarity of the CHO stationary phase to tissue characteristics (Figure 3.3G-H). The tissue to cell line comparison with the highest R2 value is CHO-S stationary versus ovary (Figure 3.3G). Proteins are colored in Figure 3.3 to show differential expression between samples, as measured

81 by a fold change of less than 0.5 or greater than 2 for each comparison. Example

proteins that are significantly higher in ovary include transthyretin, biglycan, fatty

acid binding protein, and actin. These proteins correlate well with results from the

human ovary-specific proteome, which identified 145 proteins with elevated

expression in ovary compared to other tissues [128]. The ovary mainly functions to

develop oocytes and produce reproductive hormones; it contains an extracellular

matrix of granulosa cells surrounding the follicle [128]. Many of the ovarian

upregulated proteins are related to extracellular matrix, which is not observed in

suspension CHO cells. This likely also highlights the loss of adhesion properties for

cells, especially those adapted to suspension growth, compared to tissues.

Liver versus both CHO lines in exponential and stationary phases (Figure

3.3L-O) show the greatest difference for the tissue to cell line comparisons at R2 from 0.48 to 0.54. Between the two cell lines, CHO DG44 has higher linearity than

CHO-S. Examples of proteins that are significantly higher in liver include sulfotransferase 1A1, liver fatty acid binding protein, cytochrome P450 2E1, and alcohol dehydrogenase, highlighting the tissue specificity of these highly expressed

proteins. Indeed, cytochrome P450 represents a family of enzymes with the highest

levels of expression in the human liver [128]. Many of the high-expressing liver

proteins correspond to its function as the largest gland in the human body; the liver

manages hormone and bile synthesis, glycogen storage, detoxification, and plasma

protein synthesis [128].

In general, the proteins expressed at significantly higher levels in cells over

tissues are related to DNA replication and delay of apoptosis as may be expected to

82 accommodate the cells’ rapid growth [129]. Furthermore, proteins involved in

apoptosis that play a role in extending growth are often upregulated in cells as

compared to tissues. One example is bcl-2, an anti-apoptotic gene that has been

commonly overexpressed in CHO cells to delay apoptosis and extend culture

duration [130-132]. In the current study, bcl-2-associated transcription factor 1

isoform X1 was identified in all cell samples, but not in either tissue. Other examples

of growth-related proteins that were identified in cells only include DNA helicase B,

DNA ligase 4, and DNA replication licensing factor MCM4. In the following sections,

we explore the functional significance of the differential protein expression between

cell lines and between cells and tissues in greater depth.

3.3.3 GO Functional Analysis

For functional analyses, each protein accession number was converted to a gene symbol using online tools. The GO-CHO

(http://ebdrup.biosustain.dtu.dk/gocho) and bioDBnet (http://biodbnet- abcc.ncicrf.gov) databases were used to maximize the percent annotated by mapping to the Chinese hamster genome and related species. This was required to account for the increased annotations made available for the 2014 Chinese hamster genome [71]. For GO, these gene symbols were matched to, biological processes, molecular function, and cellular component. Each GO category was analyzed to determine enrichment and depletion p-values. The top 10 most enriched terms

(lowest p-values) for biological processes are shown for all samples in Figure 3.4.

The slice size represents the numbers of genes identified within the sample that are

associated with the GO term. For all samples, the most enriched process is gene

83 expression. Also, the slices are ordered clockwise in decreasing order of p-value so that the 10th most enriched term for the different samples are mitotic cell cycle

(Figure 3.4A-D), translation elongation (Figure 3.4E), or mRNA splicing (Figure

3.4F).

In GO, biological processes represent a biological function involving the gene or gene product [122]. Interestingly, small molecule metabolic processes represent the second most enriched biological process for both cell lines during the stationary

phase (Figure 3.4B,D) and both tissues (Figure 3.4E-F), but are surprisingly absent

from the top 10 for both cell lines during the exponential phase (Figure 3.4A,C). This

process is enriched for the exponential growth samples, although not among the top

10 most enriched processes. Small molecule metabolic process encompasses

thousands of genes and gene products of low molecular weight [122]. This can

include activities such as fatty acid synthesis, amino acid metabolism, protein synthesis, and ATP conversion. Increased small molecule metabolic processes correlate with the increased metabolic activity of tissues such as the detoxification and metabolite production in the liver. The ovary also functions in production and secretion of hormones and thus shows higher metabolic activity than cell lines. For

CHO-S and CHO DG44 stationary, the small molecule metabolic processes also

include metabolite synthesis, conversion, and degradation. Some examples include

UDP-N-acetylglucosamine-dolichyl-phosphate N-

acetylglucosaminephosphotransferase, galactosylgalactosylxylosylprotein 3-beta-

glucoronosyltransferase 3, CTP synthase, and dipeptidase. Thus, glycosylation

enzymatic activities are upregulated in both cell lines at the stationary phase. This is

84 an important consideration for CHO engineering efforts to control glycosylation of

recombinant proteins, an important strategy in bioproduction [133, 134].

The second most enriched biological processes for CHO-S exponential phase

(Figure 3.4A) and CHO DG44 exponential phase (Figure 3.4C) are mRNA processing and mRNA splicing, respectively. This difference may represent increased requirements for mRNA synthesis and processing during growth. Interestingly, the dynamics of mRNA levels change during the cell cycle; peaks of mRNA degradation subsequently follow peaks of mRNA synthesis [135]. In CHO, a positive correlation exists among growth rate, specific productivity, and heavy chain mRNA [136]. As the cells transition to stationary phase, there is a concomitant shift in protein levels as cells shift to other metabolic activities. Previously, a comparative proteomics analysis was applied to understand what changes take place in protein expression between exponential and stationary phases of CHO cells [137]. This studied identified differential expression of proteins such as binding immunoglobulin protein (BiP), protein disulfide isomerase (PDi), and DNA replication licensing factors MCM2 and MCM5 during the exponential phase [137]. Similarly, these proteins were also upregulated in both cell lines in this study during exponential phase. In contrast, growth-regulating proteins such as transglutaminase-2 and clusterin, were upregulated during stationary phase concomitant with apoptosis induction [137]. Clusterin is a glycoprotein that plays a role in apoptosis through lipoprotein transport, complement-mediated cell lysis, and cell adhesion [138]. It is expressed at higher levels in both cell lines at stationary phase and in both tissues in the current study. Furthermore, previous proteomic comparisons between

85 stationary and exponential phases indicated the greatest differential expression was

related to growth regulation and apoptosis, suggesting that control of growth and

prevention of apoptosis are important priorities during stationary phase [137]

[139].

Often the differences in enriched proteins between cells and tissues relate to growth versus senescence. While the mitotic cell cycle term is present in cells

(Figure 3.4A-D), the tenth most enriched term is instead replaced by translational elongation in liver (Figure 3.4E) and mRNA splicing in ovary (Figure 3.4F). The mitotic process is active in rapidly growing cells and includes genome replication and separation of chromosomes into daughter cells. Thus, cells have increased replication activity, whereas liver tissues have increased activity related to addition of amino acid residues to growing polypeptide chains during protein synthesis. This increased processing activity is reasonable when one considers that one of the primary roles of the liver is protein synthesis. Outside of these differences, the majority of GO terms for biological processes are the same for all samples; these include gene expression, mRNA metabolic processing, RNA metabolic process, protein transport, translation, and RNA splicing (Figure 3.4). These processes encompass general processes in cells with thousands of genes and gene products that are essential for cellular activities [122]. In order to further characterize the

specific proteins that play a role in key molecular functions, further analysis was

completed with k-means clustering.

GO terms were clustered into categories using k-means in order to identify

the terms enriched in cells only or enriched in tissues only (p < 0.05). The resulting

86 heat maps are shown for different molecular functions in Figure 3.5. Among the

molecular functions enriched in cells (Figure 3.5A) are DNA replication origin

binding, DNA helicase activity, and RNA polymerase activity. MCM5 is one gene involved in the molecular functions for DNA helicase activity (Figure 3.5A) and DNA replication initiation that was identified in this study as well as in a proteomics comparison of CHO cells undergoing treatment with sodium butyrate [140].

Previous research found MCM5 was the most strongly downregulated protein as a result of sodium butyrate treatment, which results in a large reduction in growth rate accounts and significant increases in recombinant protein productivity [140].

MCM5 was also identified as a growth marker with significant differences in protein expression associated with the transition from exponential to stationary phase in

CHO cells [25]. Finally, recent efforts to differentiate high-producing CHO cell lines revealed significant downregulation of DNA replication initiation (MCM2, MCM5,

MCM6) in the high-producers corresponding to decreased cell growth but increased production [141]. Thus, efforts to knockdown genes involved in DNA replication may enable growth rate reductions that shift cell metabolism to an emphasis on protein production.

Among the molecular functions enriched in tissues (Figure 3.5B) are AMP binding, glucose-6-phosphate dehydrogenase activity, glycoprotein binding, lipid transporter activity, and mannose binding. These activities are expected to be important for protein processing and secretion. AMP binding, for example, has been studied in agricultural biotechnology applications for its relevance in synthesis and modification of lipids [142], and lipids comprise essential organelle membranes

87 involved in the secretory pathway. In addition, lipid transporter activity was

enriched only in tissues; this GO term groups genes responsible for transporting

lipids within and between cells. Example genes include glycolipid transfer protein

and apolipoprotein. In rat liver, glycolipid transfer protein was found to control

translocation of phospholipids across intracellular membranes [143]. In addition,

apolipoproteins are known to be synthesized in the liver, including apoB, apoA-1,

and apoE [144]. Their regulation is dictated post-translationally depending on the

lipid status with the cells. Overall changes in metabolism occur as immortal cell lines

are derived from tissue; similarly, we observe differences between CHO cell lines

and hamster tissues.

Adhesion is an important characteristic of tissues and generally comprises

various cell adhesion molecules and receptors, extracellular matrix proteins, and

peripheral membrane proteins. Figure 3.5B highlights tissue-enriched functions

such as cell adhesion molecule binding, glycoprotein binding, integrin binding,

laminin receptor activity, polysaccharide binding, and receptor binding that are not

enriched in cells. This emphasizes the adaption of cell lines to suspension culture

occurred with a loss of adhesion activity. This has been observed in a previous study

that adapted adherent Madin Darby canine kidney cells to suspension culture [145].

Using proteomics, the study found that many changes in protein expression related

to cytoskeletal architecture; specifically, myosin proteins were differentially

expressed and played a role in cellular detachment [145]. In our results, myosin

family proteins were expressed only in tissues, verifying the inhibition of myosin occurs as cells lose adhesion properties in suspension culture. Laminin is another

88 extracellular matrix component that is highly expressed in the outer layer of ovarian

tissue and plays a role in follicle development [146]. Overall, different laminin

family members were upregulated in ovary tissue in our results. Finally, integrins

play a role in cell attachment and show enriched activity in tissues compared to cells.

To illustrate our findings, Figure 3.6 highlights the cellular components

enriched in tissues, including secretory machinery, such as Golgi cisterna and trans-

Golgi network transport vesicles, and extracellular matrix proteins. For example,

Golgi phosphoprotein 3 (GOLPH3) plays an important role in cancer metabolism, as

its depletion results in increased apoptosis following DNA damage [147].

Subsequent overexpression of GOLPH3 can enhance cell survival as part of the DNA

damage response [147]. GOLPH3 also functions to control the localization of core 2

N-acetylglucosamine- -2,6-sialyltransferase 1 to transport vesicles [148]. This cantransferase affect the 1utilization and α of these enzymes for glycosylation activities. In another experiment, GOLPH3 knockdown suppressed cell migration and decreased N-glycan sialylation as measured by HPLC and LC/MS [149] -

2,6-sialyltransferase 1 was overexpressed in the knockdown cell line, cell migration. When α

and signaling were restored [149]. GOLPH3 isoform X2 was identified in both liver

and ovary tissue, but not in cells. Another gene enriched in tissues is insulin-like

growth factor 1 (IGF1), which is important for promotion of growth and prevention of apoptosis. In our results, IGF1 isoform X2 was identified in liver, which is the organ primarily responsible for IGF1 production. While IGF1 expression was confirmed in CHO-K1 cells through kinetic and growth curves [150], it also serves as

an important component of serum for dependent cell lines and can be important for

89 media development in serum-free conditions [151]. Finally, extracellular matrix proteins were enriched specifically in ovary tissue. This highlights that a difference between the structure of liver and ovary contributes to different organ functions.

The extensive ovary matrix helps to stabilize tissue structure and cell to cell communication. Collagen and laminin comprise the basal lamina, whereas integrins span the membrane to transmit signals within the cell. Collagens are highly abundant in ovary and their rearrangement is a major event in ovulation and follicle rupture [152]. In addition, zona pellucida proteins, such as ZP1 and ZP2 identified in our results, are unique to ovary tissue. The zona pellucida is a layer of glycoproteins that encases oocytes. Similar to the human zona pellucida, mass spectrometry analysis of the demonstrates the expression of all four of the ZP glycoproteins (1-4) in a targeted glycoproteomics study [153], The ZP

proteins were identified in ovary tissue but not cells. Overall, these clustering

results demonstrate significant differences in the proteome between cell lines and

tissues, especially involving proteins that are components of the extracellular matrix

and secretory apparatus.

3.3.4 KEGG Functional Analysis

The GO functional analysis identifies molecular functions, biological processes, and cellular components that are enriched in samples at the proteomics level. Another functional analysis is KEGG, which also uses the hypergeometric distribution to identify enrichment and depletion values. In this case, the values represent pathways that are significantly upregulated or downregulated at the proteome level (p < 0.05). Analyses revealed between 117 and 125 enriched

90 pathways in CHO cells and about 140 enriched pathways in tissues. Depletion

analyses revealed between 32 and 46 depleted pathways in CHO cells as compared

to 30 depleted pathways in tissues to confirm more pathways are depleted in cells

as compared to tissues. These finding suggest that tissues are more metabolically

active as measured by an increase in the number of enriched pathways and a

decrease in the number of depleted pathways.

While tissues show more enriched pathways than cells, cells show slightly

more enrichment during the stationary phases versus exponential phases

Consequently, there are more pathways depleted during exponential phases than

stationary phases. One example is glycosphingolipid metabolism; this pathway is

depleted in CHO-S cells during the exponential phase (p = 0.038) but not at

stationary phase (p > 0.05). However, it was not found as a depleted pathway in

CHO DG44 cells. As another example, the MAPK signaling pathway is not enriched in

CHO DG44 cells at exponential phase (p > 0.05) but is enriched at stationary phase

(p = 0.045). While there are more enriched pathways in tissues than cells, some of

the most enriched pathways are similar between samples, including carbon metabolism, biosynthesis of amino acids, protein processing in the endoplasmic reticulum, and aminoacyl-tRNA biosynthesis. This suggests common metabolic

activities that play a significant role in cells and tissues.

Shown in Figure 3.7 is the KEGG cell cycle pathway with colors representing

genes that were identified at the proteome level. The cell cycle pathway is enriched

for CHO-S and CHO DG44 at both exponential and stationary phases (p < 0.05) but it

is not an enriched pathway for both liver and ovary tissues. Many cell cycle genes

91 are observed in all samples; however, some are unique to cells or tissues only. In

agreement with the GO analysis, many cell cycle functions are increased in CHO cell

lines but not in tissues (genes colored magenta in Figure 3.7). In our results, CDK4,6

is highlighted in blue (Figure 3.7) which means it was identified in both CHO cell

lines and tissues. One effort to shift cell lines from growth to production targeted

the cell cycle G1-checkpoint through inhibition of cyclin-dependent kinase 4/6

(CDK4,6) [154]. The inhibition precisely controlled cell growth while increasing

productivity to 110 pg/cell/day and decreasing the percentage of high-mannose

glycoforms [154]. Another class of DNA replication molecules included the MCM

complex, most of which were observed in both cells and tissues. The complex unwinds double stranded DNA, recruits DNA polymerase, and initiates DNA synthesis, thus important to the survival of both cells and tissues. Both cells and tissues show MCM2, MCM3, MCM5, MCM6, and MCM7 at the proteome level.

Interesting, MCM4 was identified in all the cells but not in either tissue.

Identification may be masked by the high abundance of other metabolic proteins in tissue that decrease the significance of DNA replication.

Another important class of cell cycle proteins includes cyclins, such as CycA,

CycB, and CycD, which were identified in the cell proteomes but not in liver or ovary

tissue. Specifically, CycA concentration peaks during the G2 phase, CycB

concentration peaks during the initiation of mitosis, and CycD maintains a high

concentration throughout the cell cycle. CycA is involved in both the S phase and the

G2 to M transition and may play a role in cancer progression [155]. This is in

agreement with the immortalized CHO cell lines. A study of DNA synthesis inhibition

92 showed that CHO cells could accumulate protein and continue to grow as CycB

levels accumulate [156]. This aberrant growth and protein accumulation are

determinants of cytotoxicity and cell death [156]. In addition to cyclins, there are

many proteins essential for regulating the cell cycle that are upregulated in cells

rather than tissues. These highlight key differences between immortalized CHO cell

lines and hamster tissues.

The origin recognition complex [157] represents another class of proteins identified in cell but not tissue proteomes. The first step in DNA replication initiation is the assembly of the ORC; the ORC subunits then undergo changes in affinity for chromatin as the cell cycle progresses. Orc1 and Orc2 concentrations are constant throughout, but Orc1 can selectively detach from chromatin during M and

S phases [158]. This could explain the identification of detached Orc1 in mitotic CHO cells. Other detached ORC subunits exist during M phase and the complete ORC loses its function. Unlike in yeast, the Orc1 and Orc2 subunits are not tightly bound to the

ORC in hamster [158]. In another study, Orc1, Orc2, and Orc4 were identified as metabolically stable proteins in CHO cells [159]. Orc6 also plays a role in cancer and its downregulation can lead to cell cycle arrest at G1 [160]. Thus, the ORC is important in actively growing cells. Interestingly, when comparing primary and transformed cell lines, it was observed that protein levels of several ORC subunits are upregulated in transformed cell lines [161]. This is in agreement with our results that show enrichment of cell cycle proteins in CHO cell lines but not in liver or ovary tissue.

93 Another cell cycle protein, p107, was identified in CHO cells only. p107

regulates the cell cycle through its phosphorylation at the S to M transition. This is

in agreement with a proteome comparison between primary liver cells and

immortalized tumor cells [127]. Growth-promoting activities were upregulated in the Hepa1-6 tumor cells, including increased expression of p107 [127]. Related to

p107, E2F transcription factor (E2F)4/5 was identified at the proteome level in cells

[162]only; it. Anti interacts-apoptotic with propertiesp107 to transduce of E2F4/5 TGFβ and receptor other E2F signals homologs upstream have ofresulted CDK in

in their application as a cell line engineering target for prolonging culture [163, 164].

This highlight how an ‘omics approach can identify worthwhile candidates for

controlling the cell cycle and cell death in culture.

There are few cell cycle genes identified at the proteome level in liver and

ovary tissues but not in cell lines. One important example is Cdc25A, which

regulates genomic stability by controlling cell cycle progression. Normal expression

controls the G1 to S transition; overexpression leads to tumor growth [165] [166].

Interestingly, Cdc25A has been targeted in CHO cell line engineering efforts. Lee

evaluated wild-type and overexpressing Cdc25A cell lines by transgene copy

number, recombinant protein productivity, and cell cycle progression, and found

that overexpression of Cdc25A increased the specific productivity up to 2.9-fold and

increased the G2 to M phase transition by 1.5-fold [167]. Our inability to identify

Cdc25A in any of the cell samples suggests that mRNA is not efficiently translated.

Another example of a protein identified in tissues is Abelson murine

leukemia viral oncogene [168], which has various functions related to cell growth,

94 differentiation, and migration. Interestingly, the attachment of fibroblasts to

extracellular matrix directs the nuclear export of Abl, which may explain why it is

identified only in the tissue proteomes [169]. Abl kinases also play a role in cancer

development, as increased expression is known to activate solid tumors, alter cell

polarity, and induce invasion [170]. Finally, stratifin (14-3- in the tissue proteomes. The protein is a downstream target3 ofσ) p53 was that identified enables only tumor cell proliferation. Its methylation status and expression can be modified in cancers of the liver, breast, stomach, and colon, for example [171]. In the ovary, the methylation of the stratifin gene reduces mRNA and protein expression. In turn, reduction or elimination of the stratifin methylation leads to changes in the expression levels in ovarian cancer tissue [172]. Thus, these genes can be intimately related with tissue cell cycle and cancer metabolism.

3.4 Conclusions

This study considerably expands on our current proteomics catalog for CHO cells by identifying proteins in two cell lines (CHO-S and CHO DG44) as well as hamster liver and ovary tissues. The total number of unique proteins with at least two unique peptides for CHO-S and CHO DG44 is 9359, representing a 56% improvement over previous work [4]. Additionally, we established the first tissue proteome with 6663 unique proteins. Combining the results yielded a CHO cell line and tissue proteome of 11801 proteins at a 1% false discovery rate (FDR) with at least two unique peptides per protein. Protein intensities were compared between cell lines in exponential and stationary phase and tissues using NSAF, and functional analysis was performed using GO and KEGG hypergeometric distributions.

95 Similarities and differences were observed at protein intensity level and in the analysis of genes and pathways.

Interestingly, for both CHO-S and CHO DG44 cell lines, greater linearity in protein expression was observed within cell lines even in the two different growth stages than across cell lines, suggesting large genetic differences across cell lines.

Also, the cell lines exhibited greater similarities in expression to the ovary tissues than to the liver.

The GO and KEGG analyses also demonstrated significant differences in cell function and pathways between cell lines and tissues. There is a shift in activities from replication and transcription events, such as RNA splicing, to small molecule metabolic processes when comparing cells in the exponential phase to cells in the stationary phases, and to tissues. Furthermore, cellular components such as Golgi cisternae and the trans Golgi network were enriched in both liver and ovary tissues, but not in cells, which may suggest an increased secretory capacity in tissues that could be harnessed for CHO cell line engineering approaches. Additionally, cellular components such as extracellular matrix and endomembrane system are enriched specifically in the ovary. For example, zona pellucida proteins 1 and 2 are membranous proteins upregulated specifically in the ovary. KEGG analyses indicated upregulation of the cell cycle pathway proteins in cells as compared to tissues. Thus, the functional analysis complements protein expression analyses to identify metabolic changes across samples.

In summary, our results elucidate differences between cell lines and the

Cricetulus griseus host from which these cell lines were originally derived. Overall,

96 this proteomics analysis provides new insights on a greater number of protein and the overall protein intensity differences, as well as functional physiological differences related to genes and pathways, between CHO cells at different growth stages and the tissues of the Chinese hamster host. This information will enable us both to better understand this important production host at the level of proteins and pathways and to harness this knowledge in order to make more effective bio- production platforms in the future.

97 Tables

Table 3.1: Number of Peptides and Proteins

Peptide Unique Unique Samples Type Spectrum Peptides (#) Proteins (#) Matches (#)

CHO DG44 Cell line day 2 58128 5950 461989 exponential

CHO DG44 stationary Cell line day 4 50194 5593 371379

CHO-S exponential Cell line day 2 53958 6089 501439

CHO-S stationary Cell line day 5 48205 5538 439336

Liver Flash frozen tissue 33902 4468 461007

Ovary Flash frozen tissue 32954 4968 342141

98 Figures

Figure 3.1: Experimental Workflow. The proteomics workflow is divided into three main phases: sample preparation, MS peptide identification, and bioinformatics analysis. During sample preparation, proteins were extracted from cells or tissues and subjected to reduction, alkylation, and filter-aided sample preparation (FASP). Next, proteins were digested into peptides that are fractionated prior to LC/MS/MS analysis. Data was searched using Proteome discoverer software configured with Mascot search engine to identify proteins and MATLAB, R, gene ontology, and KEGG pathway analysis tools were used to identify functional pathways involved in cell line engineering and bioprocessing.

99 Figure 3.2: Proteomic Comparison of Samples. A. Number of proteins identified in all or some of samples. ‘Cells’ represent a combination of all cell culture samples (CHO-S exponential, CHO-S stationary, DG44 exponential, and DG44 stationary). B. Number of proteins identified in each or both cell lines. ‘CHO-S’ represents the total number of unique proteins in CHO-S exponential and stationary samples. ‘CHO DG44’ represents the total number of unique proteins in CHO DG44 exponential and stationary samples. C. Principal component analysis of samples plotted on the first and second principal components. D. Histogram of protein intensities using normalized spectral abundance factor as bin values.

100 Figure 3.3: Protein Intensity Comparison of Samples. Protein intensity comparison of samples. Plots show NSAF values for all proteins identified in each comparison. Proteins highly expressed in the sample on vertical axis are colored yellow. Proteins highly expressed in the sample on horizontal axis are colored pink. A-O: CHO-S exponential vs. CHO-S stationary, CHO DG44 exponential vs. CHO DG44 stationary, CHO DG44 exponential vs. CHO-S exponential, CHO DG44 exponential vs. CHO-S stationary, CHO DG44 stationary vs. CHO-S exponential, CHO DG44 stationary vs. CHO-S stationary, CHO-S stationary vs. ovary, CHO DG44 stationary vs. ovary, CHO DG44 exponential vs. ovary, CHO-S exponential vs. ovary, liver vs. ovary, CHO DG44 stationary vs. liver, CHO DG44 exponential vs. liver, CHO-S exponential vs. liver, CHO-S stationary vs. liver.

101 Figure 3.4: The Top 10 Enriched Biological Processes. Gene ontology analysis was used to identify the biological processes with most significant enrichment (p- value < 0.05). The size of the slice represents the number of genes annotated to that process. The order starts from gene expression (most enriched) and rotates clockwise to the 10th most enriched process. A: CHO-S exponential, B: CHO-S stationary, C: DG44 exponential, D: DG44 stationary, E: liver, F: ovary.

102 Figure 3.5: K-Means Clustering of Molecular Functions. Following gene ontology analysis, the resulting p-values were used to determine molecular functions that were similar or different in CHO cells vs. tissues. The color scale ranges from low p- value in green (significant enrichment) to high p-value in red (not enriched). A: Selected molecular functions enriched in cells (p < 0.05) but not tissues. B: Selected molecular functions enriched in tissues (p < 0.05) but not cells.

103 Figure 3.6: Cellular Components Enriched in Tissues. The final result of gene ontology analysis was to identify cellular compartments enriched in tissues vs. CHO cells. The components are labeled and corresponding gene symbols are shown that correlated to the cellular component. Green: enrichment in both liver and ovary tissues, but not in CHO-S or CHO DG44 cells. Blue: enrichment in ovary tissue, but not in CHO-S or CHO DG44 cells or liver tissue.

104 Figure 3.7: Pathway Map of Cell Cycle. The Kyoto Encyclopedia of Genes and Genomes (KEGG) search and color tool was used to highlight genes that were identified at the proteome level for CHO cells and tissues. Blue represents genes identified at the proteome level in all samples, magenta represents genes identified at the proteome level in cells only (present in CHO-S exponential and CHO-S stationary and CHO DG44 exponential and CHO DG44 stationary), and orange represents genes identified at the proteome level in tissues only (present in liver and ovary but not in any CHO cells).

105 Chapter 4: Comparative Proteomics Analysis of Chinese

Hamster Cell Lines and Tissues

Abbreviations

CHO – Chinese hamster ovary, DHFR – dihydrofolate reductase, VCP – valosin

containing protein, MS – mass spectrometry, SDS – sodium dodecyl sulfate, EDTA – ethylenediaminetetraacetic acid, PMSF – phenylmethylsulfonyl fluoride, BCA – bicinchoninic acid, TCEP – tris(2-carboxyethyl)phosphine, BCA – bicinchoninic acid,

TCEP – tris (2-carboxyethyl) phosphine, TMT – tandem mass tags, LC – liquid chromatography, FPKM – fragments per kilobase of transcript per million mapped reads, FDR – false discovery rate, NSAF – normalized spectral abundance factor, GO

– gene ontology, KEGG – Kyoto encyclopedia of genes and genomes, IPA – Ingenuity pathway analysis, SILAC – stable isotope labeling with amino acids in cell culture, iTRAQ – isobaric tags for relative and absolute quantification, PCA – principal component analysis, BiP – binding immunoglobulin protein, PDi – protein disulfide isomerase, MCM – minichromosome maintenance, ROCK2 – Rho associated coiled- coil containing protein kinase 2, ARFGEF2 – ADP ribosylation factor guanine nucleotide exchange factor 2, ZEB1 – zinc finger E-box binding homeobox 1,

ANKFY1 – ankryin repeat and FYVE domain containing 1, AP2A2 – adaptor related protein complex 2 alpha 2 subunit, ROS – reactive oxygen species (ROS), SPARC – secreted protein acidic and rich in cysteine, Igbp1 – immunoglobulin CD79A binding protein 1, VASP – vasodilator-stimulated phosphoprotein, Rapgef1 – Rap guanine nucleotide exchange factor 1, Itgb1 – integrin beta-1, SIRT1 – sirtuin 1, CIRP – cold-

106 inducible RNA-binding protein, Lrp1 – low density lipoprotein receptor-related

protein, Thbs1 – thrombospondin, Mapk14 – mitogen-activated protein kinase 14,

Stim1 – stromal interaction molecule 1, Clcn3 – chloride voltage-gated channel 3,

ERAP1 – endoplasmic reticulum aminopeptidase 1

4.1 Introduction

The history of Cricetulus griseus dates back to the 1950s when Theodore T.

Puck established the CHO-K1 cell line [102]. Today, Chinese hamster ovary (CHO) cell lines generated from the Chinese hamster dominate the biopharmaceutical industry and their products generate billions of dollars in revenue annually [109].

The success of CHO cell lines for recombinant protein production can be attributed to their high growth rate, ease of genetic modification, and ability to post- translationally modify proteins via glycosylation and sialylation.

CHO is actually a misnomer as the cell line established by Puck grew out of the connective tissue surrounding the ovary. The cells became immortal spontaneously in culture and grew with nearly diploid chromosomes and fibroblast character [108]. Prior to their use in the biotechnology industry, the cells were frequently used in studies for induced and spontaneous mutation research because the chromosomes were easily observed [109]. After over ten months in culture, the cells morphed from a fibroblast type to epitheliod marking the origin of the CHO culture used today [109]. As early as the 1960s, CHO was adapted to serum free media and it was discovered the cells required proline supplementation in the culture medium [109]. Subsequently, CHO use became more widespread in industry.

107 The original CHO cell line was adapted and modified by various researchers

to create cell lines, such as CHO-DXB11, CHO DG44, and CHO-S, for example. The first CHO-generated product was made in the CHO-DXB11 cell line established at

Columbia University to study dihydrofolate reductase (DHFR) activity [173]. A related cell line is CHO DG44, which was generated in the same lab by fully deleting both loci for DHFR on chromosome 2 [111]. Finally, CHO-S is widely used but has a more complicated origin because it was simultaneously grown in both academia and industry as a suspension-based CHO cell line [109]. There are thus significant differences across CHO parental cell lines, as well as clonal- and process- dependent variations. This yields transcriptomes and proteomes with significant differences. It is widely recognized that variations in the CHO cell lines exhibit significant differences, so a single CHO-ome is not relevant for most applications. In addition to cell line differences, variations in the bioprocess including media formulation and bioreactor operation can alter the transcriptome and proteome.

Initial efforts to understand CHO include the sequencing of both the CHO and

Cricetulus griseus genomes. The draft CHO-K1 genome was established in 2011 [3] and the Chinese hamster genome followed two years later [2]. In addition, the CHO-

DXB11 genome was sequenced to understand the DHFR negative phenotype and the cell line drift [112]. Other efforts aimed to create bacterial artificial chromosome libraries for CHO-K1 and CHO-DG44 cell lines to visualize hamster chromosome rearrangements [113].

108 Finally, a combined approach using genomics and epigenetics was recently

used to study genetic responses as a result of cell line differences and bioprocess

perturbations, such as process conditions, media formulation, and cell specific

productivity [174]. Results indicated that genome sequence variation between the cell lines was high, resulting from single nucleotide polymorphisms, insertion/deletions, and structural variants [174]. Combined, genomic efforts play a

significant role in comparing CHO parental and clonal cell lines.

Similar efforts sought to understand mRNA and protein expression in CHO

using transcriptomics and proteomics, respectively. Through advances in sample

preparation and mass spectrometry (MS) technology, it is possible to identify and quantify thousands of cellular proteins. In addition, library preparation and annotation tools make RNAseq a sophisticated method for transcriptomic analysis.

Both methods aid biotherapeutics research and development because they can be used for target identification, monoclonal antibody discovery, and bioprocess optimization. To highlight one application, a proteomic comparison between pulmonary adenocarcinoma and surrounding tissue identified over 30 differentially expressed proteins, suggesting that cancer causes an upregulation of PKM2 and cofilin-1 [175]. Following the identification, a targeted RNA interference knockdown of PKM2 resulted in decreased cell growth and induction of apoptosis in a cancer cell line [175]. In other applications, proteomics is used to identify proteins and pathways that are significantly changed as a result of bioprocess differences. These studies seek to develop stable cell lines and bioprocess parameters that increase

recombinant protein yields and maintain consistent product quality. The initial

109 proteomic analysis was performed for the CHO-K1 cell line to identify pathways that

are enriched and depleted compared to the CHO genome and transcriptome [4].

Functional analysis revealed enrichment in protein processing and apoptosis

pathways at the proteomic level concomitant with depletion in steroid hormone and

glycosphingolipid metabolic pathways [4].

Finally, the combination of transcriptomic and proteomic data serves to

validate significantly changing pathways. To date, many cell lines and different

bioprocess conditions have been studied via transcriptomics and proteomics to

yield insights into protein production, cell growth, reduced apoptosis, favorable

glycosylation, and optimized medium formulations [176]. In a combined approach,

Clarke showed that 285 proteins were differentially expressed between cell lines

with fast and slow growth rates, and functional analysis identified translational

elongation, translation, generation of precursor metabolites and energy, oxidation-

reduction, and aerobic respiration as most significantly enriched gene ontology

terms [24]. Combination of proteomic and transcriptomic data enabled the

validation of low abundance protein information and identification of mRNA post-

translational processing [24]. Thus, results indicated a correlation between protein

synthesis and cell growth. In another approach, the combination of transcriptomics

and proteomics was used to compare cell lines with fast and slow growth rates [21].

Researchers identified valosin-containing protein (VCP) as a significant contributor

to the variation in cell growth and viability [21]. Following the ‘omics approach, VCP expression was knocked out which resulted in decreased viable cell density and

110 viability [21]. Thus, proteomics is important for identifying differential expression that can be confirmed or negated by mRNA expression using transcriptomics.

In this thesis, we combine both transcriptomics and proteomics to generate comparisons between CHO cell lines and the original Chinese hamster host.

Comparative proteomics data was established using TMT labeling and comparative transcriptomics data was established by RNAseq. Briefly, proteome samples were sonicated and subjected to reduction, alkylation, digestion, labeling, fractionation, and MS identification. After RNA isolation and library preparation, transcriptome sequencing data was generated. Different organs, including the brain, heart, kidney, liver, lung, ovary, and spleen, were used to generate the tissue proteome and identify characteristics of each that are highly abundant. In addition, the CHO-S and

CHO DG44 cell lines were compared to the tissues to suggest functions and pathways with significant differential expression. Finally, we combine the proteome data with the transcriptome to identify functions that confirm or negate each other.

Overall, tissues show upregulation in metabolic pathways such as protein processing, secretion, and amino acid metabolism. The results highlight tissue characteristics and genes of interest that may be used for future genetic engineering approaches of CHO host cell lines. In addition, depleted functions identified in CHO highlight areas of improvement that could increase cell capacity for biotherapeutics production using cell line engineering or optimization of media formulation, transfection method, bioreactor operation, and process scale-up.

4.2 Materials and Methods

111 For both proteomics and transcriptomics, the experiment involves sample

preparation, data collection, and bioinformatics analysis.

4.2.1 Tissue Isolation

Tissues from various organs, including brain, heart, kidney, liver, lung, ovary, and spleen, were harvested from female Chinese hamsters generously provided by the lab of Dr. George Yerganian (Cytogen Research, Roxbury, MA). Euthanization was performed by CO2 and verified by abdominal puncture. Upon harvest, each organ was divided into small pieces, rapidly frozen on dry ice, and subsequently stored at -80oC until analysis.

4.2.2 Batch Cell Culture

Two commercial suspension CHO cell lines, CHO-S and CHO DG44, were

grown in batch culture to collect samples during the mid-exponential phase. CHO-S

cells were cultured in CD-CHO medium supplemented with 8mM glutamine (both

from ThermoFisher Scientific, Waltham, MA). CHO DG44 cells were culture in DG44

medium supplemented with 2mM glutamine (both from ThermoFisher Scientific,

Waltham, MA). After preparing growth curves (data not shown), samples from both

cell lines were collected on day 2 which corresponded to mid-exponential phase.

Cells were incubated at 37oC with 8% CO2 shaking at 120RPM; viable cell density

was determined by hemocytometer and trypan blue. For sample collection,

approximately 3 million cells were spun down, washed with PBS on ice, frozen

rapidly on dry ice, and stored at -80oC until analysis.

112 4.2.3 Proteomics Sample Preparation

Samples for proteomics were thawed on ice and lysed in 2% sodium dodecyl sulfate (SDS) supplemented with 0.1mM phenylmethylsulfonyl fluoride (PMSF) and

1mM ethylenediaminetetraacetic acid (EDTA), pH 7-8. Lysates were sonicated thrice for 60 seconds at 20% amplitude followed by 90 seconds pause in order to expose proteins. Protein concentration was measured by bicinchoninic acid (BCA) protein assay. Three hundred micrograms of each sample were reduced in 10mM tris (2- carboxyethyl) phosphine (TCEP), pH 7-8, at 60oC for 1hr on a shaking platform.

After bringing the sample to room temperature, approximately 17mM iodacetamide was added to alkylate the sample to final concentration for 30 minutes in order to

prevent reformation of disulfide bonds. Next, samples were cleaned using 10kDa

filters to reduce the SDS concentration [6]. The samples were finally digested using trypsin/LysC enzyme mix (Promega V507A), overnight at 37oC on a shaking

platform.

4.2.4 Proteomics Labeling

In order to compare protein expression, peptide samples were labeled in

duplicate using two tandem mass tag (TMT)-10plex (ThermoFisher Scientific,

Waltham, MA). Each of the 10 reagents has the same nominal mass and chemical

structure so that for each sample, a unique reporter mass is used to relate protein

expression level. Triplicates were used for ovary tissue and CHO-S. In addition, CHO-

S included biological and technical replicates in order to compare samples across

each TMT. Following the digestion, TMT reagents were thawed and acetonitrile was

113 used to dissolve the reagents. One reagent tube was added to each sample and then

incubated at room temperature for 1hr. Hydroxylamine was subsequently added to

quench the reaction before the tubes were combined. One tube for each TMT was

submitted for MS analysis.

4.2.5 2D Liquid Chromatography (LC)-MS

Samples were submitted to the proteomics core run by Dr. Robery Cole.

Briefly, digested peptides were fractionated on basic reversed phase column

(XBridge C18 Guard Column, 5µm, 2.1 x 10mm XBridge C18 Column, 5µm, 2.1 x

100mm). Fractions were concatenated into final 24 prior to second dimension LC

and MS analysis. MS/MS analysis of the peptides were carried out on the LTQ-

Orbitrap Velos (www.thermoscientific.com) MS interfaced to Eksigent

(www.eksigent.com) nanoflow LC system with the Agilent 1100 auto sampler.

Peptides were enriched on a 2cm trap column (YMC gel ODS-A S-10µm), fractionated on Magic C18 AQ, 5µm, 100Å (Michrom Bioresources), 75µm x 15cm column and electrosprayed through a 15µm emitter (PF3360-75-15-N-5, New

Objective). Reversed-phase solvent gradient consisted of solvent A (0.1% formic acid) with increasing levels of solvent B (0.1% formic acid, 90% acetonitrile) over a period of 90 minutes. LTQ Orbitrap Velos parameters included 2.0kV spray voltage, full MS survey scan range of 350-1800 m/z, data dependent MS/MS analysis of top

10 precursors with minimum signal of 2000, isolation width of 1.9, 30s dynamic exclusion limit and normalized collision energy of 35. Precursor and the fragment ions were analyzed at 60000 and 7500 resolutions, respectively.

114 4.2.6 Transcriptomics

Tissues and cells were also analyzed by Clarke using RNAseq to prepare

transcriptomics data, similar to previously published analyses [177]. Transcriptome

data, provided as fragments per kilobase of transcript per million mapped reads

(FPKM) values was provided to compare RNA expression across replicated samples.

In this method, RNA is isolated from the cells and tissues using commercial kits

(Qiagen, Germantown, MD). Isolation by 3’ polyadenylated tails yields only coding

mRNA, which is reversed transcribed into cDNA. Fragmentation and size selection

are used to select for high-quality sequences. Challenges with transcriptome

analysis include the assembly of raw sequence reads into the complete

transcriptome. The de novo assembly by Dr. Colin Clarke was provided for our study.

4.2.7 Bioinformatics

Peptide sequences were identified from isotopically resolved masses in MS

and MS/MS spectra extracted with and without deconvolution using Thermo

Scientific MS2 processor and Xtract software nodes. Data was searched against all

entries in Cricetulus griseus database for CHO cell lines. Oxidation on methionine, deamidation NQ and phosphoSTY were set as variable modifications, and carbamidomethyl on cysteine was set as fixed modifications in Mascot search node used in Proteome Discoverer (Version 1.4, http://portal.thermo-brims.com) workflow. Mass tolerances on precursor and fragment masses were 15 ppm and

0.03Da, respectively.

115 Protein identifications were made using Proteome Discoverer software

(Thermo) with stringent high confidence cutoff (<1% false discovery rate, FDR) and

provided as an excel file by the lab of Dr. Robert Cole. The protein intensities were

determined using normalized spectral abundance factor (NSAF), a method to

generate reproducible counts and high linearity across biological and technical

replicates [119]. It measures the number of spectra matched to a specific protein

divided by its length over the sum of all proteins in the sample [119]. NSAF values

were plotted for each sample comparison, and fold change values were used to

identify significant differences in protein intensity. Protein accession numbers were

mapped to gene symbols using the biological database network (http://biodbnet-

abcc.ncicrf.gov) for functional analysis by gene ontology (GO) and Kyoto

Encyclopedia of Genes and Genomes (KEGG). For GO annotation, gene symbols were

mapped to molecular function, biological process, and cellular component using the

GOCHO platform (http://ebdrup.biosustain.dtu.dk/gocho). For KEGG annotation,

latest (Dec.2015) database pathways for Cricetulus grisueus were downloaded.

(http://www.genome.jp/kegg). All programming for the hypergeometric test were

calculated in MATLAB version 2015vB

(http://www.mathworks.com/products/matlab) and RStudio

(https://www.rstudio.com). Enrichment and depletion p-values were calculated using the hygecdf and hygepdf functions in MATLAB. These values were used to generate heat maps and k-means clustering in Genesis software version 1.7.6 [65].

Pathways of interest were plotted using Ingenuity pathway analysis (IPA) software

(http://www.ingenuity.com/).

116 4.3 Results and Discussion

Our work compares protein and mRNA expression of various CHO cell lines and hamster tissues, creating the first global map of the Cricetulus griseus proteome and transcriptome. The ‘omics analysis aims to improve our understanding of CHO as the dominant biopharmaceutical production host and identify significant differences between cells and tissues. As proteomics is the main focus of this paper, an overview of the workflow is shown in Figure 4.1. Following sonication, reduction, alkylation, and digestion, peptides were labeled by TMT and combined into two

TMT-10plex to determine relative abundance. Each TMT was fractionated prior to

LC/MS/MS in order to increase the number of identifications.

Following MS identification, the protein accession numbers were determined using the 2014-annotated CHO genome. In a recent comparison between the 2012- and 2014-annotated CHO genomes, it was determined that the 2014 version showed higher mapped rates for intra-genic regions, making it the preferred reference for future applications [120]. Protein accession numbers were converted to gene symbols for functional analysis. For missing gene symbols, the accession numbers were searched against mouse and human databases using the online database, bioDBnet [121]. In this experiment, GO and IPA were used for functional analysis of the differentially expressed proteins. The purpose of GO is to convert gene symbols to molecular function, cellular component, and biological process in order to evaluate the relationship between the molecular activities of gene products, location of activity, and pathways comprising the activity of multiple gene products,

117 respectively [122]. In another functional analysis, each gene symbol is mapped to

the IPA pathways it is involved in, thus suggesting enrichment and depletion

between different comparisons [123]. In both GO and IPA, the enrichment and

depletion are quantified by p-values that are calculated via the hypergeometric distribution. In this case, p < 0.05 was set for evaluating significance.

4.3.1 Protein Identification and Total Protein Intensity Differences

In order to reduce costs and multiplex samples, we chose a labeled approach for comparative proteomics. TMT was not the only option, as isobaric tags for relative and absolute quantification (iTRAQ) and stable isotope labeling with amino acids in cell culture (SILAC) are alternative labeling methods [178] [179]. The main difference between iTRAQ and TMT is the number of samples that are multiplexed.

With TMT, it is possible to analyze 2, 6, or 10 samples in a set thus providing greater flexibility for sample design [180]. For this experiment, TMT was selected in order to ensure replicates for each of the conditions. Labeled approaches show some disadvantages as compared to label-free, including lower sequence coverage and fewer differentially expressed proteins [124]. In addition, comparison between iTRAQ and label-free proteomics suggests that label-free approaches are more accurate [125]. In conclusion, labeled proteomics has both advantages and disadvantages necessitating thorough analysis in order to confidently identify points of interest.

For each of the TMT samples, the number of unique proteins, unique peptides, and spectra is shown in Table 4.1. Over 8000 proteins were identified in

118 each TMT with at least 2 unique peptides per protein; this corresponded to a total of

8464 unique proteins with a false discovery rate (FDR) of 1% for protein

identification. Protein intensity fold change ratios were initially evaluated through

the principal component analysis (PCA) as shown in Figure 4.2 [66]. The cell

samples cluster together and are distinct from all tissues as shown by PCA. For the

tissues, spleen, liver, and heart cluster together while lung and kidney cluster together. Ovary and brain are clustered separately from the other organs involved in the circulatory, digestive, and respiratory systems, which is consistent with the plots and outliers identified. Next, replicates for each tissue were plotted. Fitting a curve to the scatter plot and calculating the R2 value assessed the linearity of the

replicates. Shown in Figure 2B to 2E are the plots for the tissues with the highest R2

value for each cluster, specifically brain, ovary, lung from the lung/kidney cluster,

and heart from the heart/kidney/spleen cluster.

To compare between samples, protein fold change ratios were averaged for

biological replicates and plotted in Figure 4.3. Blue and yellow represent points of

differential expression (outliers) between the samples compared. More consistency

is observed between CHO-S and CHO DG44 rather than between cells and tissues.

The highest degree of consistency (as measured by the fewest number of outlier

points) is observed between CHO-S and CHO DG44 (Figure 4.3A). There are 247

proteins with significantly higher expression in CHO-S and 236 proteins with significantly higher expression in CHO DG44. For the cell to tissue comparison, the highest consistency is for CHO-S to ovary (Figure 4.3E) or CHO DG44 to ovary

(Figure 4.3I). This agrees with the fact that CHO cells were derived from ovary tissue,

119 even though cells grew out of a mixture of ovary and surrounding connective tissue.

Between CHO-S and ovary, there are 1587 proteins with significantly higher

expression in CHO-S and 832 proteins with significantly higher expression in ovary.

Similarly, between CHO DG44 and ovary, there are 2216 proteins with significantly

higher expression in CHO DG44 and 962 proteins with significantly higher

expression in ovary. After ovary, CHO cells show the greatest consistency to heart

(Figure 4.3C and 4.3G). Between CHO-S and heart, there are 603 proteins with significantly higher expression in CHO-S and 1073 proteins with significantly higher expression in heart. Similarly, between CHO DG44 and heart, there are 864 proteins with significantly higher expression in CHO DG44 and 1557 proteins with significantly higher expression in heart. Unlike the other organs, heart is predominately comprised of muscle cells so this metabolic function may account for significant differences in protein intensity [181]. Over 50% of the cells in this organ are cardiac fibroblasts. In addition, the heart has endothelial, smooth muscle, and pacemaker cells. This high degree of specialization is likely important for differences between CHO and heart. Next, both CHO-S and CHO DG44 have low consistency for the comparison against lung tissue (Figure 4.3D and Figure 4.3H, respectively), containing a total of 2130 and 2536 proteins with differential expression, respectively. Finally, the total number of outliers, as measured by the number of blue and yellow points in Figure 4.3, is highest between cells and brain (Figure 4.3B

and Figure 4.3F). There are 2373 and 2618 differentially expressed proteins in CHO-

S and CHO DG44, respectively, as compared to brain. Brain and ovary show distinct

separation in terms of embryonic development from the other organs. Brain

120 originates from the ectoderm whereas ovary develops from the mesoderm [182]. In

terms of tissue to tissue comparisons, brain shows the most outliers compared to

other tissues; it is the only organ derived from the ectoderm and thus is highly

specialized for its function [182]. Ovary is also highly specialized for its role in

reproduction, containing extensive extracellular matrix and epithelioid cells [183].

The remaining organs derive from mesoderm and endoderm, including kidney, liver,

heart, and lung.

Similar to other efforts at whole organism proteomes, we evaluated some of

the most highly expressed proteins in each tissue [128, 184, 185]. Hierarchical

clustering of protein expression is shown in Figure 4.4. Pink represents areas of high

expression, for which we observed clusters in each sample. These groups represent

highly expressed proteins with tissue specificity. For each sample, the top 200

proteins, corresponding to approximately 3% of the total proteins, were identified

in order to highlight tissue specificity. Example proteins significantly high in ovary

include disintegrin and metalloproteinase domain-containing protein,

methyltransferase, serine/threonine-protein phosphatase, and sodium/glucose costransporter 2. This agrees with results from the human ovary-specific proteome,

which identified 145 genes with elevated expression in ovary compared to other

tissues [128]. Interestingly, the metalloproteinases were observed in the mouse

proteome, but only within lung and placental organs [185]. The ovary controls

oocyte development and reproductive hormone production; it contains extracellular

matrix of granulosa cells surrounding the follicle [128]. Similarly, the placenta is

part of the reproductive system and develops in order to provide oxygen and

121 nutrients to a fetus growing in the uterus. CHO cells, which were established from

fibroblast connective tissue surrounding the ovary [109], show protein expression levels that are more similar to the ovary than other tissues.

Next, we identified highly expressed proteins in lung tissue (Figure 4.4). The human proteome identified 193 genes elevated in the lung and saw the greatest similarity to protein expression in the testis [128]. The lung is responsible for breathing, controlling gas exchange. Some proteins highly expressed in the hamster lung include branched chain amino acid aminotransferase, glycine N-

methyltransferase, integrin alpha-6, and vesicle-associated membrane protein.

These proteins likely account for the increased metabolic and secretory functions of

the lung. The heart is a strong muscle that must contract nonstop and is

predominately comprised of cardiomyocytes. 201 genes were shown to have

elevated expression in the human proteome [128], including retinal dehydrogenase

which was also significantly upregulated in our hamster heart proteome. In addition,

we observed high expression for proteins including actin-related protein,

glutathione S-transferase, protein O-glycosyltransferase, and Ras GTPase- activvating protein. Finally, brain tissue clustered separately in the PCA analysis and shows distinct embryonic development from other tissues. Highly expressed brain proteins include amyloid beta A4 protein, calcium and integrin binding protein, neuron navigator 1, and serine/threonine protein kinase. As expected, many of the highly expressed proteins in brain relate to signaling and transport. In general, the proteins expressed at significantly higher levels in cells over tissues include proteins related to DNA replication and transcription, as expected to the cells’ rapid growth.

122 Among the most highly expressed proteins in either CHO-S or CHO DG44 are 60S ribosomal protein, apoptosis inhibitor 5, DNA-directed RNA polymerase II, and eukaryotic translation initiation factor. Thus, analysis of the most abundant proteins

for each tissue or cell line highlights each sample’s function.

4.3.2 IPA Pathway Analysis

Next, the proteins were annotated with gene symbols in order to perform

pathway analysis. Fold change values of less than 0.5 or greater than 2.0 were used

for each comparison to determine downregulation and upregulation of pathways in

IPA software. As shown in Figure 4.4, protein intensity hierarchical clustering can be

used to identify proteins with significantly higher expression in a specific tissue.

Protein functions identified in IPA that are enriched in brain include

internalization of protein, organization of Golgi apparatus, and translation of

proteins. The Golgi likely plays a role in cholesterol metabolism; almost 25% of the human body’s unesterified cholesterol is present in brain [186]. The input of cholesterol into the central nervous system comes almost entirely from in situ synthesis, involving the endomembrane system [186]. Additionally, glycosphingolipids are abundant in the nervous system, and are synthesized in the endoplasmic reticulum and completed in the Golgi apparatus [187]. Protein functions enriched in lung include metabolism of amino acids and trafficking of vesicles. Lung shows high secretory capacity and thus trafficking of the secretory apparatus will be important. In recent research, the secretome of lung cancer cell lines was profiled to identify differential expression of secreted proteins [188].

123 Protein functions enriched in heart include metabolism of alpha-amino acid, sorting of protein, and translation. Amino acid metabolism was studied in rat heart; as amino acid levels rose up to 5x plasma levels, there was a 40% increase in whole heart protein and myosin synthesis [189]. A prior analysis of human heart mitochondria similarly showed upregulation of metabolic proteins, including those involved in signaling, protein synthesis, ion transport, and lipid metabolism [190].

Protein functions enriched in ovary include development of extracellular matrix, docking of vesicles, and exocytosis. The matrix is important for all metabolic activities, including growth, migration, and differentiation [183]. When the matrix is remodeled, it can lead to ovarian cancer [191]. The extensive extracellular matrix is in agreement with previous research in our lab quantifying the hamster ovary tissue proteome.

4.3.3 GO Functional Analysis

For GO functional analysis, gene symbols were annotated to molecular function, biological process, and cellular component. Each GO category was analyzed to determine enrichment and depletion p-values for the cell-to-tissue or tissue-to- tissue comparisons. Tables 4.2 and 4.3 list the top 10 most enriched biological processes for CHO-S (Table 4.2) and CHO DG44 (Table 4.3) to tissue comparisons.

Enrichment is determined by hypergeometric distribution, with a p-value of < 0.5 used for significance.

Biological processes represent a biological function involving the gene or gene product. This is used to complement the pathway analysis in IPA. A process is

124 accomplished using at least one molecular function assembly [122]. In both CHO-S

and CHO DG44, the most common processes enriched involve DNA, mRNA, and RNA.

Some of the most common enriched processes in CHO-S include RNA metabolic process, mRNA processing, and transcription from RNA polymerase II promoter.

Overall, differences between cells and tissues relate to growth and senescence.

Growth activities are high in cells, whereas signaling and transport processes are significantly enriched in tissues. A previous proteomics comparison of CHO cells undergoing treatment with sodium butyrate identified minichromosome maintenance deficient 5 (MCM5) as a differentially expressed gene involved in DNA replication [140]. Of the MCM family members we identified between CHO-S and

CHO DG44, MCM4, MCM6, and MCM3 were all identified at nearly equal levels (fold change near 1.0). In addition, the levels in both cell lines were higher than in tissues.

In comparison, signaling and transport are significantly enriched across different hamster organs. For example, enriched brain biological processes include axon guidance, ion transport, metabolic process, and synaptic transmission. Eight of the ten top processes are equivalent in the comparison against CHO-S or CHO DG44.

The brain comprises the central nervous system that regulates signaling throughout the organism. The GO term for axon guidance describes the chemotaxis process that directs axon migration to a specific target in response to attractive and repulsive cues [122]. It involves genes such as homeobox protein, neural cell adhesion molecule, plexin, laminin, and ephrin receptor.

125 Next, the identification of enriched biological processes in heart shows

functions related to circulation and metabolism. Complement activation and protein heterotrimerization are significantly enriched in heart tissue compared to either

CHO-S or CHO DG44. Other processes enriched in heart include regulation of blood coagulation, positive regulation of vasoconstriction, phospholipid efflux, and sodium-independent organic ion transport. For example, complement activation involves mannose-binding lectin and complement components, C3, C4, CD46, and

CD59. Additionally, the activities for vasoconstriction, phospholipid efflux, and sodium-independent organic ion transport are enriched in heart. These results are indicative of heart specialization, such as the relationship between complement activity and heart failure [192].

Similar to the pancreas, thymus, and spleen, the lung is a highly secretory organ. It performs metabolic functions similar to other tissues, but has selectivity that differentiates the lung from liver [193]. Biological processes we observed that are enriched in lung as compared to either CHO cell line include G-protein coupled receptor signaling pathway, vesicle-mediated transport, and intracellular signal transduction. Indeed, vesicle transport is an important component of secretory pathway machinery. Genes related to this GO term are transferrin receptor protein,

Golgi SNAP receptor, growth arrest-specific protein, and vesicle-fusing ATPase.

Additional lung-enriched processes include transport and signaling. Unraveling the complexities of lung secretions may yield new insight into the control of CHO secretory pathway engineering.

126 Finally, enriched biological processes in ovary highlight differences between

the cells and tissue from which CHO cells were derived. Overall, there is high

agreement between CHO-S and CHO DG44 related to ovary. Ovary biological

processes enriched as compared to both cell lines include protein transport,

transport, transmembrane transport, and vesicle-mediated transport.

Transmembrane transport is associated with genes such as sodium/hydrogen

exchanger, transferrin receptor protein, peroxisomal targeting signal 2 receptor,

and growth arrest-specific protein. Interestingly, the ovary tissue has high transport

capacity for proteins and other transmembrane materials that may highlight

reasons for the dominance of CHO cells in recombinant protein production.

Specifically, vesicle production rate was quantified in the ovary and the turnover

indicated high vesicle recycling across the endomembrane system [194]. However,

the same processes are not enriched in the cell line, highlighting possibilities for

metabolic improvements.

4.3.4 Transcriptome Analysis

Transcriptomics was used to complement the proteomics study and compare mRNA expression across CHO cell and tissue samples. RNAseq fold change ratios were used in combination with fold change ratios from TMT labeling in order to compare and contrast the data sets. The results of the transcriptome to proteome comparison are shown in Figure 4.5A for cell lines to tissue comparisons. For each

of the cell line to tissue comparisons studied, over 65% of gene symbols are in

agreement, indicating that mRNA and protein expression are either both low or both

127 high. The minority of gene symbols in red represent situations where mRNA and

protein expression differ, with one of the fold changes greater than 1 and the other

fold change less than 1. These situations represent stable (high protein expression

and low mRNA expression) and unstable (high mRNA expression and low protein

expression) mRNA.

For the conditions tested, the greatest agreement is for the cell line to cell

line comparison (Figure 4.5A). A total of 1427 gene symbols express the same fold change whereas 349 proteins have conflicting fold changes between mRNA and

protein, representing over 80% correlation. This shows that mRNA and protein

expression correlate much more for cells than complex tissues. Across the tissues,

the greatest agreement is between brain mRNA and protein versus CHO-S.

Specifically, in brain there are 1741 gene symbols with fold change in agreement and 539 gene symbols that differ in fold change between mRNA and protein. Lung and ovary both show approximately 65-70% correlation between the mRNA and protein fold changes in relation to CHO-S. The greatest disagreement is between heart mRNA and protein versus CHO-S. For the heart, there are 1399 gene symbols in agreement and 637 in disagreement between mRNA and protein.

4.4 Conclusions

The results from the Chinese hamster proteome provide new insight into the global protein expression across tissues. Comparative proteomics by TMT labeling was used to identify differential expression between CHO cell lines and hamster tissues. A total of 8464 unique proteins with a false discovery rate of 1% and

128 containing at least 2 unique peptides per protein were identified by combining the

TMT datasets. This represents the most extensive Chinese hamster proteome to-

date. Initially, we compared protein expression by comparing the fold change for

each sample on the basis of the CHO-S technical replicates (Figure 4.3). Overall,

there is high similarity between CHO-S and CHO DG44. When incorporating tissue

comparisons, there is greater consistency between CHO-S or CHO DG44 cell lines and ovary tissue and lower consistency between CHO-S or CHO DG44 cell lines and other tissues. Next, the highly expressed proteins in each tissue were further characterized to identify tissue-specific functions.

The top 200 proteins were identified in each tissue cluster. Due to the high reproducibility of the replicates, we specifically chose to analyze brain, ovary, heart, and lung in greatest detail. Interestingly, the metalloproteinases (ADAM7, 10, 15) were highly enriched in ovary (Figure 4.4). ADAMs were observed in the mouse proteome, but only within placental organs and lung [185]. The ovary controls oocyte development and reproductive hormone production; it contains extracellular matrix of granulosa cells surrounding the follicle [128]. Similarly, the placenta is part of the reproductive system and develops in order to provide oxygen and nutrients to a fetus growing in the uterus. CHO cells, which were established from fibroblast connective tissue surrounding the ovary [109], show protein expression levels that are more similar to the ovary than other tissues. Next, the lung highly- expressed proteins may participate in the critical metabolic and secretory functions.

The human proteome identified 183 genes highly expressed in the lung including similar membrane and secretory proteins [128]. Specifically, branched chain

129 aminotransferase shows high expression in our results and in the human lung as

compared to the tissues we evaluated. In the heart, 201 genes were shown to have

elevated expression in the human proteome [128], including retinol dehydrogenase

(RDH1) which was also significantly upregulated in our hamster heart proteome.

The knock-down of RDH1 led to abnormal neural crest cell migration and an

abnormal heart loop in mutant embryos, showing that it plays an important role in

heart tissue [195]. Finally, many of the highly expressed proteins in brain relate to signaling and transport similar to the 1437 genes that were identified at high levels in the human brain proteome [128], such as amyloid beta A4 protein and neuron

navigator 1 (NAV1). NAV1 is specialized specifically to the nervous system [196].

Following the protein intensity analysis, we further investigated protein

functions by IPA and GO. As shown in Figure 4.4, tissues show increased metabolic

and secretory activity as compared to cells and there are tissue-specific functions

identified as clusters in each grouping. Some of the differences between CHO cells

and ovary tissue may be attributed partly to the mixture of ovary cell types. In an in-

depth proteome analysis of ovarian cell lines and tissues, it was possible to

differentiate the proteomes between epithelial, clear cell, and mesenchymal tissues

[68]. We identified various highly active pathways in ovary such as exocytosis and

extracellular matrix development. There are similarities of protein functions across

tissues with related functions, notably the spleen, liver, and heart which cluster

together in the PCA from Figure 4.2. The liver and spleen share many proteins with

near equivalent protein expression levels, including vacuolar protein sorting-

associated protein 4B, proteasome maturation protein, and alpha-mannosidase

130 (fold change is equal to 1). Between the spleen and heart, proteins with near

equivalent expression include spartin, adenylate kinase isoenzyme 4, and receptor- type tyrosine-protein phosphatase alpha (fold change is equal to 1). Finally, between the liver and heart, proteins with a fold change equal to 1 include apolipoprotein A-1 binding protein, plexin, and transforming growth factor beta receptor type 3. Overall, these organs are critically important and share an extensive blood supply. Heart, liver, and spleen were used to compare oxidative stress in rats following DNA damage due to their systemic relationship [197].

Next, the GO analysis is used to highlight specific differences in functions that dominate cell and tissue metabolism (Table 4.2 and Table 4.3). Cells show enrichment of transcription machinery whereas tissues show increased transport and signaling as measured by significant p-values. Specifically, there are 82 and 109 enriched biological processes in CHO-S and CHO DG44, respectively, and 194-229 enriched biological processes in brain in the comparison with the two CHO cell lines.

In the cell to lung comparison, there are 126-137 enriched biological processes in the cell lines and 240-253 enriched biological processes in lung. Finally, In the cell to ovary comparison, there are 89-113 enriched biological processes in CHO-S and

CHO-DG44 and 110-160 enriched biological processes in ovary. Overall, there are more biological processes that are enriched in tissues as compared to either cell line with more enriched processes in the non-ovary tissues. The next section further investigates the role of genes, mRNA, aby comparing the protein expression fold changes to mRNA expression fold changes between CHO cells and hamster tissues.

131 For the transcriptome analysis, we identified stable and unstable functions

for each of the samples as indicated in Figure 4.5B. In tissues, stable and unstable

functions are often tissue- and organ-specific, relative to those in other tissues. For

example, in brain stable functions include protein synthesis, vesicle formation,

neuron branching, and neuron development whereas unstable functions include death, interphase, cell attachment, and neuron differentiation. Identification of brain-specific functions such as neuron branching and development need high protein levels. Interestingly, neuronal differentiation requires proteins with relatively rapid turnovers [198, 199]. Another interesting stable function was

vesicle-related. In the 2D mouse brain proteome, approximately 25% of proteins

were membrane-associated and involved transport, a function associated with

vesicle transport [200]. For vesicle formation, we identified genes including Rho

associated coiled-coil containing protein kinase 2 (ROCK2), ADP ribosylation factor

guanine nucleotide exchange factor 2 (ARFGEF2), zinc finger E-box binding

homeobox 1 (ZEB1), SRC proto-oncogene, clathrin heavy chain, ankyrin repeat and

FYVE domain containing 1 (ANKFY1), and adaptor related protein complex 2 alpha

2 subunit (AP2A2). Many of these genes, including AP2A2, were also identified in a

model of Huntington’s disease in the human brain [201]. ARFGEF2 is critical for the

brain; mutations causes movement disorder and neuronal migration because vesicle

trafficking, growth, and neurotransmitter receptor function are affected [202].

Additionally, ZEB1 is a biomarker for glioblastoma that plays a role in invasion,

chemoresistance, and tumorigenesis; it shows high expression in the brain [168].

132 The unstable brain functions represent transient activities, where high mRNA levels

are important such as interphase and neuronal differentiation.

For heart tissue, the stable functions include protein hydrolysis, connective

tissue shape change, cell homeostasis, and vesicle formation whereas unstable

functions include RNA transcription, production of reactive oxygen species (ROS),

and cell spreading. Genes associated with connective tissue morphology

upregulated in heart are secreted protein acidic and rich in cysteine (SPARC),

Immunoglobulin (CD79A) Binding Protein 1 (Igbp1), vasodilator-stimulated

phosphoprotein (Vasp), Rap Guanine Nucleotide Exchange Factor 1 (Rapgef1),

catenin CTNND1, and Integrin beta-1 (Itgb1). SPARC is an extracellular matrix component with high expression in the heart of mice [203]. In SPARC-null mice, the response to myocardial infarction involves increased cardiac rupture, dysfunction, and death as a result of collagen disorganization [204]. Catenins also play an important role in the heart, as the cadherin-catenin interaction compartmentalize cardiac cells during early vertebrate development [205]. Finally, Vasp shows localization to heart tissue. VASP and mammalian enabled (Mena) are associated with blood vessels and the intercalated discs of cardiac myocytes, and they co- localize with connexin-43 [206]. Overall, important activities for the heart show agreement between transcriptomic and proteomic data. Vesicle formation shows high protein expression in both brain and heart tissue with low expression in cells indicating that tissues likely have increased capacity for protein secretion. Unstable heart functions include RNA transcription and reactive oxygen species (ROS) production. Generation of ROS correlates with cardiac injury [207]. These oxidative

133 stress markers are precursors to heart failure [157]. RNA transcription results in

the generation of mRNA thus increasing the pool of mRNA relative to protein. These

activities are expected to show high mRNA expression.

In ovary tissue, stable functions include protein hydrolysis, protein

metabolism, microtubule dynamics, and cell viability whereas unstable functions

include nucleotide metabolism, ATP synthesis, cell homeostasis, and glycolysis.

Protein metabolism is upregulated in ovary and includes genes such as CAPN2,

POLDIP3, Snx1, ITCH, sirtuin 1 (SIRT1), CASP1, CRBN, Cold-inducible RNA-binding protein (CIRP), and Naglu. SIRT1 is overexpressed in ovarian cancer and inactivates p53 [208]. As an NAD+-dependent histone deacetylase, SIRT1 can affect glucose homeostasis and insulin sensitivity. The protein was observed by immunohistochemistry in human ovarian tissues and human luteinized granulosa cells [209]. Expression changes as a result of calorie restriction [209]. CIRP, identified at high levels in the ovary, has been a candidate for study in CHO cells.

CIRP was downregulated in order to evaluate the effect on cell growth and erythropoietin [155] production in CHO cells at low culture temperature [210].

Results showed that reduced CIRP expression did not recover the growth arrest

[210]. However, in another experiment, CIRP was overexpressed at standard culture temperature and improved titer by 25-40% in adherent and suspension cell lines

[211]. In this case, identification of proteins high in the ovary correlated with genetic engineering targets. Overall, the upregulation of protein metabolism in ovary could be important for recombinant protein production, one of the chief applications in CHO. The unstable functions represent activities that are

134 continuously active and highly metabolic such as glycolysis and ATP synthesis.

Because of the rapid turnover, it is expected that mRNA expression should be high.

Finally, the stable functions in lung tissue include death, shape change, and cytoskeletal rearrangement whereas unstable functions include cell membrane formation, endocytosis, molecular transport, and RNA expression. In differentiated lung tissue, the shape and cytoskeleton are maintained by structural proteins such as actin, dynactin, and alpha-actinin. Actin reorganization was correlated with cytoskeleton organization during surfactant secretion [212]. Intersectin (ITSN)-1 and ITSN-2 also show high protein levels in lung. In models of lung cancer progression, ITSN-1s contributes to proliferation, migration, and metastasis [213].

In contrast, endocytosis and molecular transport show high mRNA levels but low protein levels. Example genes include Vav2, low density lipoprotein receptor- related protein (LRP1), thrombospondin (THBS1), Cltc, mitogen-activated protein kinase 14 (Mapk14), stromal interaction molecule 1 (Stim1), KRAS, chloride voltage- gated channel 3 (Clcn3), and ERAP1. High expression of LRP1 is associated with high secretory capacity. When LRP1 expression is low, there is reduced migration, survival, proliferation, and differentiation [214]. Thbs1 is an extracellular matrix glycoprotein that shows inverse correlation with metastatic potential. At high expression levels, there is less metastatic potential [215]. Mapk14 is a kinase that is upregulated in lung cancer and affects inflammation. Signaling pathways triggered by cytokines upregulate Mapk14 expression and trigger the inflammatory response

[216]. Mutations in chloride voltage-gated channels, including Clcn3, affect secretion and cell volume; in the lung specifically, mutation leads to cystic fibrosis [217].

135 Another channel (Clc2) is expressed on the epithelium of respiratory cells and regulates fluid production and lung morphogenesis [218]. Finally, KRAS is a well- studied oncogene in lung cancer [219]. The lung in this experiment represents an organ with high secretory capacity that may help to elucidate secretion inefficiencies in CHO cells. The stable functions represent activities that are established within tissues. Molecular transport and RNA expression are continuously-occurring activities with high mRNA expression detected.

Our experiment generates the most-extensive tissue map of the Chinese hamster proteome, including over 8000 total proteins. Because of the hamster’s relevance to the biotechnology industry, we further compare the tissue proteome to

CHO cell lines in order to identify functional areas of differential expression.

Specifically, we observed enrichment of many metabolic pathways in all tissues that were not enriched in cells, such as protein processing, vesicle transport, and carbohydrate metabolism. Correlation between transcriptomic and proteomic data was in agreement for the most part, with at least 65% agreement for all samples.

Differences between mRNA and protein abundance reveal stable and unstable expression levels that further increase our knowledge about CHO gene and protein expression. In conclusion, this experiment expands on CHO knowledge by establishing a catalogue to differentiate between cells and tissues, thus creating the most extensive Chinese hamster proteome to-date.

136 Tables

Table 4.1: TMT Doublet Information

Peptide Unique Unique Experiment Spectrum Proteins (#) Peptides (#) Matches (#)

TMT 1 9,152 74,334 665,867

TMT 2 8,507 73,313 685,479

11,454 total

proteins

137 Table 4.2: Top 10 Most Enriched Biological Processes for CHO-S and Tissues

138 Table 4.3: Top 10 Most Enriched Biological Processes for CHO DG44 and Tissues

139 Figures

Figure 4.1: Overview of Experimental Design and Workflow. A: Proteome experiment compares hamster tissues (including brain, heart, lung, kidney, spleen, liver, and ovary) and CHO cell lines (including CHO-S and CHO DG44). B: Each TMT 10-plex contains a mix of tissues and cell line samples. Sample preparation involves protein extraction from tissues and cells, followed by reduction, alkylation, FASP, and digestion. Next, peptides are labeled and combined into two TMT 10-plex experiments for fractionation and mass spectrometry identification. In the bioinformatics analysis, peptides are matched to proteins and gene symbols for functional analysis in order to understand cell and tissue metabolism. Abbreviations: TMT (tandem mass tags), FASP (filter aided sample preparation), RPLC (strong cation exchange).

140 Figure 4.2: Protein Intensity Analysis. A: principal component analysis of CHO cells and hamster tissues. Clustering of overall protein intensity fold changes is calculated using the first and third principal components to show variation. B-E: normalized protein intensity is plotted for replicates of each sample corresponding to a different cluster from principal component analysis. B: heart, C: brain, D: lung, E: ovary.

TMT1 TMT2

141 Figure 4.3: Overall Distribution of Protein Intensity in Hamster Tissues. Protein intensity plotted as normalized fold change ratio. Outliers represented in yellow for proteins highly expressed for y-axis. Outliers represented in blue for proteins highly expressed for x-axis.

142 Figure 4.4: Highly Expressed Proteins. Center: heat map of highly expressed proteins. The top 200 proteins are plotted for each sample. Coloring is shown from low [43] to high (pink) abundance. Distinct clusters are shown for each sample. Clockwise from top: clusters of protein functions specific to tissues. Proteins were mapped to gene symbols for functional analysis. The protein functions exhibiting high expression for each tissue are shown relative to other tissues for brain, lung, heart, and ovary.

143 Figure 4.5: Comparison of Proteome and Transcriptome. A. Fold change ratio comparison between transcriptome and proteome. Green represents fold change ratio is in agreement between transcriptome and proteome, whereas red represents fold change ratio is not in agreement between transcriptome and proteome. B. Evaluation of stable and unstable protein expression for brain, heart, ovary, and lung all relative to CHO-S (clockwise from left). Stable represents proteome fold change >1 and transcriptome fold change <1. Unstable represents proteome fold change < 1 and transcriptome fold change >1. Functions were identified using IPA.

144 Chapter 5: Conclusions

CHO cells represent the dominant host for therapeutic recombinant protein

production. However, few large-scale datasets have been generated to characterize this host organism and derived CHO cell lines at the proteomics level. In this thesis,

numerous proteomics experiments using both label-free and labeled approaches

were used to characterize the Chinese hamster and CHO cell lines. The results

represent the largest CHO proteomics datasets to date and expand our knowledge of

the capabilities of CHO as a production host.

In the glycoproteomics comparison of CHO cell lines and glycosylation

mutants, a labeled proteomics experiment was designed to identify glycosylated and

siaylated proteins. We combined glycoprotein and sialylated protein enrichment

with iTRAQ peptide labeling to compare samples from cell lines differing in

glycosylation capabilities. Specifically, wild-type CHO was compared with

tunicamycin-treated CHO and Lec9.4a cells, which show approximately 50% of wild-

type glycosylation levels. A total of 381 unique glycoproteins were identified in the

comparison, with 187 glycoproteins identified with at least 2 unique peptides (194

glycoproteins with 1 unique peptide) and a stringent false discovery rate of 1%.

Results from the glycoproteomic analysis identified heavily-glycosylated proteins, such as membrane proteins and transporters. Upregulated glycoproteins in Lec9.4a cells include CD63 antigen, glyceraldehyde-3-phosphate dephydrogenase, vimentin, and ATP synthase. Proteins related to glycosylation are downregulated in Lec9.4a, such as alpha-(1,3)-fucosyltransferase, dolichyl-diphosphooligosaccharide-protein

145 glycosyltransferase subunit 1, uncharacterized glycosyltransferase AER61-like, and

uncharacterized glycosyltransferase AGO61-like. Next, wild-type Pro-5 CHO was compared with Lec2 cells, which have a mutation in CMP-sialic acid transporter causing significant reduction in sialylation. A total of 272 unique sialylated proteins were identified across 2 experiments, with at least 2 unique peptides and a stringent false discovery rate of 1%. Upregulated sialylated proteins include cadherin-17-like,

CD63 antigen-like, neural cell adhesion molecule 1, and CD97 antigen. The

downregulated sialoproteins, such as legumain, clusterin-like, transmembrane

protein 2, dolichyl-diphosphooligosaccharide-protein glycosyltransferase subunit

STT3A, and beta-1,4-galactosyltransferase 3, detect defects in glycosylation. Thus,

results from the glycoproteomic and sialoproteomic analysis of glycosylation

mutant cell lines demonstrate differences in cell metabolism.

Next, an extensive label-free quantitative proteomics analysis of two cell lines (CHO-S and CHO DG44) and two Chinese hamster tissues (liver and ovary) was used to identify a total of 11801 unique proteins containing at least two unique peptides. 9359 unique proteins were identified specifically in the cell lines, representing a 56% increase over previous work. Additionally, 6663 unique proteins were identified across liver and ovary tissues providing the first Chinese hamster tissue proteome. Protein expression was more conserved within cell lines during both growth phases than across cell lines, suggesting large genetic differences across cell lines. Overall, both gene ontology and KEGG pathway analysis revealed enrichment of cell cycle activity in cells. In contrast, upregulated molecular functions in tissue include glycosylation and lipid transporter activity. Furthermore,

146 cellular components including Golgi apparatus are upregulated in both tissues. In

conclusion, this large-scale proteomics analysis enables us to delineate specific

changes between tissues and cells derived from these tissues, which can help

explain specific tissue function and the adaptations cells incur for applications in

biopharmaceutical productions.

Finally, TMT labeling was used to compare protein expression across CHO

cell lines and eight hamster tissues including the brain, heart, kidney, liver, lung,

ovary, and spleen organs. In this experiment, we sought to compare various tissues

from the Chinese hamster (Cricetulus griseus) from which CHO cells were derived in order to improve our knowledge of engineering the CHO host for biotherapeutics production. Over 8500 proteins were identified in each of two TMT experiments; this corresponded to a total of 11454 unique proteins with a false discovery rate of

1%. Protein intensity plots were generated between tissues and cell lines to identify proteins with significant differential expression. In addition, the top 200 (3%) of proteins highly expressed by each sample were determined to compare cell and tissue specificity. Finally, we combined the proteomics with transcriptomic data using RNAseq in order to correlate mRNA and protein expression for each of the samples. Over 65% of genes show agreement between transcriptome and proteome fold changes.

Detailed comparisons of hamster tissues and CHO cell lines further improve our knowledge of Chinese hamster physiology. One caveat of proteomics is that the protein expression and functional differences need future work in order to be validated and proven. We envision that our predictions will suggest future

147 directions of research in the CHO community. In this work, we have greatly

expanded knowledge of the CHO host by thorough label-free and labeled proteomics

approaches. Specifically, we identified both glycosylated and sialylated proteins in

CHO cell lines, created the largest CHO proteome to-date, and established the first

Chinese hamster tissue proteome. We hope this work can be used across the CHO community to design new experiments by harnessing our knowledge of the Chinese hamster.

148 References

1. Huggett, B., Public biotech 2012--the numbers. Nat Biotechnol, 2013. 31(8): p. 697-703. 2. Lewis, N.E., et al., Genomic landscapes of Chinese hamster ovary cell lines as revealed by the Cricetulus griseus draft genome. Nat Biotechnol, 2013. 31(8): p. 759-65. 3. Xu, X., et al., The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat Biotech, 2011. 29(8): p. 735-741. 4. Baycin-Hizal, D., et al., Proteomic analysis of Chinese hamster ovary cells. J Proteome Res, 2012. 11(11): p. 5265-76. 5. Valente, K.N., et al., Optimization of protein sample preparation for two- dimensional electrophoresis. Electrophoresis, 2012. 33(13): p. 1947-57. 6. Wisniewski, J.R., et al., Universal sample preparation method for proteome analysis. Nat Methods, 2009. 6(5): p. 359-62. 7. Baik, J.Y., et al., Proteomic understanding of intracellular responses of recombinant chinese hamster ovary cells adapted to grow in serum-free suspension culture. Biotechnology Progress, 2011. 27(6): p. 1680-1688. 8. Baik, J.Y., et al., Limitations to the comparative proteomic analysis of thrombopoietin producing Chinese hamster ovary cells treated with sodium butyrate. J Biotechnol, 2008. 133(4): p. 461-8. 9. Van Dyk, D.D., et al., Identification of cellular changes associated with increased production of human growth hormone in a recombinant Chinese hamster ovary cell line. Proteomics, 2003. 3(2): p. 147-56. 10. Kuystermans, D., M.J. Dunn, and M. Al-Rubeai, A proteomic study of cMyc improvement of CHO culture. BMC Biotechnology, 2010. 10(1): p. 25. 11. Beckmann, T.F., et al., Effects of high passage cultivation on CHO cells: a global analysis. Appl Microbiol Biotechnol, 2012. 94(3): p. 659-71. 12. Hayduk, E.J., L.H. Choe, and K.H. Lee, A two-dimensional electrophoresis map of Chinese hamster ovary cell proteins based on fluorescence staining. ELECTROPHORESIS, 2004. 25(15): p. 2545-2556. 13. Lee, J.S., et al., Protein reference mapping of dihydrofolate reductase-deficient CHO DG44 cell lines using 2-dimensional electrophoresis. Proteomics, 2010. 10(12): p. 2292-302. 14. Meleady, P., et al., Impact of miR-7 over-expression on the proteome of Chinese hamster ovary cells. J Biotechnol, 2012. 160(3-4): p. 251-62. 15. Slade, P.G., et al., Identifying the CHO Secretome using Mucin-type O-Linked Glycosylation and Click-chemistry. Journal of Proteome Research, 2012. 11(12): p. 6175-6186. 16. Valente, K.N., et al., Recovery of Chinese hamster ovary host cell proteins for proteomic analysis. Biotechnol J, 2014. 9(1): p. 87-99. 17. Bonner, M.K., et al., Mitotic Spindle Proteomics in Chinese Hamster Ovary Cells. PLoS ONE, 2011. 6(5): p. e20489.

149 18. Harsha, H.C., H. Molina, and A. Pandey, Quantitative proteomics using stable isotope labeling with amino acids in cell culture. Nat Protoc, 2008. 3(3): p. 505-16. 19. Park, S.S., et al., Effective correction of experimental errors in quantitative proteomics using stable isotope labeling by amino acids in cell culture (SILAC). J Proteomics, 2012. 75(12): p. 3720-32. 20. Mertins, P., et al., iTRAQ Labeling is Superior to mTRAQ for Quantitative Global Proteomics and Phosphoproteomics. Molecular & Cellular Proteomics : MCP, 2012. 11(6): p. M111.014423. 21. Doolan, P., et al., Microarray and proteomics expression profiling identifies several candidates, including the valosin-containing protein (VCP), involved in regulating high cellular growth rate in production CHO cell lines. Biotechnol Bioeng, 2010. 106(1): p. 42-56. 22. Meleady, P., et al., Sustained productivity in recombinant Chinese hamster ovary (CHO) cell lines: proteome analysis of the molecular basis for a process- related phenotype. BMC Biotechnol, 2011. 11: p. 78. 23. Mi, H., et al., The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Research, 2005. 33(suppl 1): p. D284-D288. 24. Clarke, C., et al., Integrated miRNA, mRNA and protein expression analysis reveals the role of post-transcriptional regulation in controlling CHO cell growth rate. BMC Genomics, 2012. 13(1): p. 656. 25. Carlage, T., et al., Analysis of dynamic changes in the proteome of a Bcl-XL overexpressing Chinese hamster ovary cell culture during exponential and stationary phases. Biotechnology Progress, 2012. 28(3): p. 814-823. 26. Wei, Y.Y.C., et al., Proteomics analysis of chinese hamster ovary cells undergoing apoptosis during prolonged cultivation. Cytotechnology, 2011. 63(6): p. 663-77. 27. Ohsfeldt, E., et al., Increased expression of the integral membrane proteins EGFR and FGFR3 in anti-apoptotic Chinese hamster ovary cell lines. Biotechnol Appl Biochem, 2012. 59(3): p. 155-62. 28. Carlage, T., et al., Proteomic profiling of a high-producing Chinese hamster ovary cell culture. Anal Chem, 2009. 81(17): p. 7357-62. 29. Kantardjieff, A., et al., Transcriptome and proteome analysis of Chinese hamster ovary cells under low temperature and butyrate treatment. J Biotechnol, 2010. 145(2): p. 143-59. 30. Dorai, H., et al., Proteomic analysis of bioreactor cultures of an antibody expressing CHOGS cell line that promotes high productivity. Journal of Proteomics & Bioinformatics, 2013. 2013. 31. Kang, S., et al., Cell line profiling to improve monoclonal antibody production. Biotechnol Bioeng, 2014. 111(4): p. 748-60. 32. Lim, U.M., et al., Identification of autocrine growth factors secreted by CHO cells for applications in single-cell cloning media. J Proteome Res, 2013. 12(7): p. 3496-510. 33. Kim, J.Y., et al., Proteomic understanding of intracellular responses of recombinant Chinese hamster ovary cells cultivated in serum-free medium

150 supplemented with hydrolysates. Appl Microbiol Biotechnol, 2011. 89(6): p. 1917-28. 34. Meleady, P., et al., Utilization and evaluation of CHO-specific sequence databases for mass spectrometry based proteomics. Biotechnol Bioeng, 2012. 109(6): p. 1386-94. 35. Xu, X., et al., The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat Biotechnol, 2011. 29. 36. Meleady, P., et al., Utilization and evaluation of CHO-specific sequence databases for mass spectrometry based proteomics. Biotechnol Bioeng, 2012. 109. 37. Kildegaard, H.F., et al., The emerging CHO systems biology era: harnessing the 'omics revolution for biotechnology. Curr Opin Biotechnol, 2013. 24(6): p. 1102-7. 38. Dietmair, S., et al., A Multi-Omics Analysis of Recombinant Protein Production in Hek293 Cells. PLoS One, 2012. 7(8). 39. Minch, S.L., P.T. Kallio, and J.E. Bailey, Tissue plasminogen activator coexpressed in Chinese hamster ovary cells with alpha(2,6)-sialyltransferase contains NeuAc alpha(2,6)Gal beta(1,4)Glc-N-AcR linkages. Biotechnol Prog, 1995. 11(3): p. 348-51. 40. Burda, P. and M. Aebi, The dolichol pathway of N-linked glycosylation. Biochim Biophys Acta, 1999. 1426(2): p. 239-57. 41. Patnaik, S.K. and P. Stanley, Lectin-resistant CHO glycosylation mutants. Methods Enzymol, 2006. 416: p. 159-82. 42. Cantagrel, V., et al., SRD5A3 is required for the conversion of polyprenol to dolichol, essential for N-linked protein glycosylation. Cell, 2010. 142(2): p. 203-217. 43. Morava, E., et al., ALG6-CDG: a recognizable phenotype with epilepsy, proximal muscle weakness, ataxia and behavioral and limb anomalies. J Inherit Metab Dis, 2016. 39(5): p. 713-23. 44. Foulquier, F., et al., Conserved oligomeric Golgi complex subunit 1 deficiency reveals a previously uncharacterized congenital disorder of glycosylation type II. Proceedings of the National Academy of Sciences of the United States of America, 2006. 103(10): p. 3764-3769. 45. Rosenwald, A.G. and S.S. Krag, Lec9 CHO glycosylation mutants are defective in the synthesis of dolichol. J Lipid Res, 1990. 31(3): p. 523-33. 46. Senkal, C.E., et al., Alteration of Ceramide Synthase 6/C16-Ceramide Induces Activating Transcription Factor 6-mediated Endoplasmic Reticulum (ER) Stress and Apoptosis via Perturbation of Cellular Ca2+ and ER/Golgi Membrane Network. Journal of Biological Chemistry, 2011. 286(49): p. 42446-42458. 47. Olden, K., R.M. Pratt, and K.M. Yamada, Selective cytotoxicity of tunicamycin for transformed cells. Int J Cancer, 1979. 24(1): p. 60-6. 48. Eckhardt, M., B. Gotza, and R. Gerardy-Schahn, Mutants of the CMP-sialic acid transporter causing the Lec2 phenotype. J Biol Chem, 1998. 273(32): p. 20189-95. 49. Lim, S.F., et al., The Golgi CMP-sialic acid transporter: A new CHO mutant provides functional insights. Glycobiology, 2008. 18(11): p. 851-60.

151 50. Wright, A. and S.L. Morrison, Effect of C2-Associated Carbohydrate Structure on Ig Effector Function: Studies with Chimeric Mouse-Human IgG1 Antibodies in Glycosylation Mutants of Chinese Hamster Ovary Cells. The Journal of Immunology, 1998. 160(7): p. 3393. 51. Stanley, P., Selection of specific wheat germ agglutinin-resistant (WgaR) phenotypes from Chinese hamster ovary cell populations containing numerous lecR genotypes. Mol Cell Biol, 1981. 1(8): p. 687-96. 52. Kao, F.T. and T.T. Puck, Genetics of Somatic Mammalian Cells. IV. Properties of Chinese Hamster Cell Mutants with Respect to the Requirement for Proline. Genetics, 1967. 55(3): p. 513-24. 53. Rosenwald, A.G., P. Stanley, and S.S. Krag, Control of carbohydrate processing: increased beta-1,6 branching in N-linked carbohydrates of Lec9 CHO mutants appears to arise from a defect in oligosaccharide-dolichol biosynthesis. Molecular and Cellular Biology, 1989. 9(3): p. 914-924. 54. North, S.J., et al., Glycomics profiling of Chinese hamster ovary cell glycosylation mutants reveals N-glycans of a novel size and complexity. J Biol Chem, 2010. 285(8): p. 5759-75. 55. Wong, N.S.C., M.G.S. Yap, and D.I.C. Wang, Enhancing recombinant glycoprotein sialylation through CMP-sialic acid transporter over expression in Chinese hamster ovary cells. Biotechnology and Bioengineering, 2006. 93(5): p. 1005-1016. 56. Zeng, G.Q., et al., Comparative proteome analysis of human lung squamous carcinoma using two different methods: two-dimensional gel electrophoresis and iTRAQ analysis. Technol Cancer Res Treat, 2012. 11(4): p. 395-408. 57. Pottiez, G., et al., Comparison of 4-plex to 8-plex iTRAQ quantitative measurements of proteins in human plasma samples. J Proteome Res, 2012. 11(7): p. 3774-81. 58. Wu, W.W., et al., Comparative study of three proteomic quantitative methods, DIGE, cICAT, and iTRAQ, using 2D gel- or LC-MALDI TOF/TOF. J Proteome Res, 2006. 5(3): p. 651-8. 59. Leong, S., et al., iTRAQ-based proteomic profiling of breast cancer cell response to doxorubicin and TRAIL. J Proteome Res, 2012. 11(7): p. 3561-72. 60. Poole, C.F., New trends in solid-phase extraction. TrAC Trends in Analytical Chemistry, 2003. 22(6): p. 362-373. 61. Solid-phase extraction clean-up of soil and sediment extracts for the determination of various types of pollutants in a single run. JournalDąbrowska, of Chromatography H., et al., A, 2003. 1003(1–2): p. 29-42. 62. Zhou, Y., R. Aebersold, and H. Zhang, Isolation of N-Linked Glycopeptides from Plasma. Analytical Chemistry, 2007. 79(15): p. 5826-5837. 63. Peterson, D.S., et al., Dual-Function Microanalytical Device by In Situ Photolithographic Grafting of Porous Polymer Monolith: Integrating Solid- Phase Extraction and Enzymatic Digestion for Peptide Mass Mapping. Analytical Chemistry, 2003. 75(20): p. 5328-5335. 64. Zhang, H., et al., Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat Biotechnol, 2003. 21(6): p. 660-6.

152 65. Sturn, A., J. Quackenbush, and Z. Trajanoski, Genesis: cluster analysis of microarray data. Bioinformatics, 2002. 18(1): p. 207-8. 66. Alonso-Gutierrez, J., et al., Principal component analysis of proteomics (PCAP) as a tool to direct metabolic engineering. Metabolic Engineering, 2015. 28: p. 123-133. 67. Gulmann, C., et al., Array-based proteomics: mapping of protein circuitries for diagnostics, prognostics, and therapy guidance in cancer. The Journal of Pathology, 2006. 208(5): p. 595-606. 68. Coscia, F., et al., Integrative proteomic profiling of ovarian cancer cell lines reveals precursor cell associated proteins and functional status. Nature Communications, 2016. 7: p. 12645. 69. Kornak, U., et al., Impaired glycosylation and cutis laxa caused by mutations in the vesicular H+-ATPase subunit ATP6V0A2. Nat Genet, 2008. 40(1): p. 32-34. 70. Esko, J.D. and C.R. Bertozzi, Chemical Tools for Inhibiting Glycosylation, in Essentials of Glycobiology, A. Varki, et al., Editors. 2009, The Consortium of Glycobiology Editors, La Jolla, California.: Cold Spring Harbor NY. 71. Rupp, O., et al., Construction of a Public CHO Cell Line Transcript Database Using Versatile Bioinformatics Analysis Pipelines. PLoS ONE, 2014. 9(1): p. e85568. 72. Bessarabova, M., et al., Knowledge-based analysis of proteomics data. BMC Bioinformatics, 2012. 13(Suppl 16): p. S13-S13. 73. Schneider, E.G., H.T. Nguyen, and W.J. Lennarz, The effect of tunicamycin, an inhibitor of protein glycosylation, on embryonic development in the sea urchin. Journal of Biological Chemistry, 1978. 253(7): p. 2348-55. 74. Nair, S., et al., Toxicogenomics of endoplasmic reticulum stress inducer tunicamycin in the small intestine and liver of Nrf2 knockout and C57BL/6J mice. Toxicology Letters, 2007. 168(1): p. 21-39. 75. Sun, S. and H. Zhang, Large-Scale Measurement of Absolute Protein Glycosylation Stoichiometry. Analytical Chemistry, 2015. 87(13): p. 6479- 6482. 76. Hermans, M.M.P., et al., Human lysosomal α-glucosidase: functional characterization of the glycosylation sites. Biochemical Journal, 1993. 289(3): p. 681-686. 77. Hers, H.G., α-Glucosidase deficiency in generalized glycogen-storage disease (Pompe's disease). Biochemical Journal, 1963. 86(1): p. 11-16. 78. Escrevente, C., et al., Functional role of N-glycosylation from ADAM10 in processing, localization and activity of the enzyme. Biochimica et Biophysica Acta (BBA) - General Subjects, 2008. 1780(6): p. 905-913. 79. Endres, K. and F. Fahrenholz, Regulation of alpha-secretase ADAM10 expression and activity. Experimental Brain Research, 2012. 217(3): p. 343- 352. 80. Sulpizio, S., et al., Cathepsins and pancreatic cancer: The 2012 update. Pancreatology, 2012. 12(5): p. 395-401. 81. Wang, B., et al., Human Cathepsin F: MOLECULAR CLONING, FUNCTIONAL EXPRESSION, TISSUE LOCALIZATION, AND ENZYMATIC CHARACTERIZATION. Journal of Biological Chemistry, 1998. 273(48): p. 32000-32008.

153 82. Okamoto, I., et al., Proteolytic Cleavage of the CD44 Adhesion Molecule in Multiple Human Tumors. The American Journal of Pathology, 2002. 160(2): p. 441-447. 83. Skelton, T.P., et al., Glycosylation Provides Both Stimulatory and Inhibitory Effects on Cell Surface and Soluble CD44 Binding to Hyaluronan. The Journal of Cell Biology, 1998. 140(2): p. 431-446. 84. Bartolazzi, A., et al., Glycosylation of CD44 is implicated in CD44-mediated cell adhesion to hyaluronan. The Journal of Cell Biology, 1996. 132(6): p. 1199- 1208. 85. Jiménez, D., et al., Contribution of N-Linked Glycans to the Conformation and Function of Intercellular Adhesion Molecules (ICAMs). Journal of Biological Chemistry, 2005. 280(7): p. 5854-5861. 86. He, P., et al., Identification of Intercellular Cell Adhesion Molecule 1 (ICAM-1) as a Hypoglycosylation Marker in Congenital Disorders of Glycosylation Cells. Journal of Biological Chemistry, 2012. 287(22): p. 18210-18217. 87. Leighton, M. and K.E. Kadler, Paired Basic/Furin-like Proprotein Convertase Cleavage of Pro-BMP-1 in the trans-Golgi Network. Journal of Biological Chemistry, 2003. 278(20): p. 18478-18484. 88. Garrigue-Antar, L., N. Hartigan, and K.E. Kadler, Post-translational Modification of Bone Morphogenetic Protein-1 Is Required for Secretion and Stability of the Protein. Journal of Biological Chemistry, 2002. 277(45): p. 43327-43334. 89. Mueller, B., et al., SEL1L nucleates a protein complex required for dislocation of misfolded glycoproteins. Proceedings of the National Academy of Sciences, 2008. 105(34): p. 12325-12330. 90. Saltini, G., et al., A novel polymorphism in SEL1L confers susceptibility to Alzheimer's disease. Neuroscience Letters, 2006. 398(1): p. 53-58. 91. Kilcoyne, M., et al., Neuronal glycosylation differentials in normal, injured and chondroitinase-treated environments. Biochemical and Biophysical Research Communications, 2012. 420(3): p. 616-622. 92. Angata, K. and M. Fukuda, ST3 Beta-Galactoside Alpha-2,3-Sialyltransferase 1 (ST3GAL1), in Handbook of Glycosyltransferases and Related Genes, N. Taniguchi, et al., Editors. 2014, Springer Japan: Tokyo. p. 637-644. 93. Jeanneau, C., et al., Structure-Function Analysis of the Human Sialyltransferase ST3Gal I: ROLE OF N-GLYCOSYLATION AND A NOVEL CONSERVED SIALYLMOTIF. Journal of Biological Chemistry, 2004. 279(14): p. 13461- 13468. 94. Mohorko, E., R. Glockshuber, and M. Aebi, Oligosaccharyltransferase: the central enzyme of N-linked protein glycosylation. Journal of Inherited Metabolic Disease, 2011. 34(4): p. 869-878. 95. Shrimal, S., et al., Mutations in STT3A and STT3B cause two congenital disorders of glycosylation. Human Molecular Genetics, 2013. 22(22): p. 4638- 4645. 96. Wang, Y., H. Schachter, and J.D. Marth, Mice with a homozygous deletion of the Mgat2 gene encoding UDP-N-acetylglucosamine:α-6-d-mannoside β1,2-N- acetylglucosaminyltransferase II: a model for congenital disorder of

154 glycosylation type IIa. Biochimica et Biophysica Acta (BBA) - General Subjects, 2002. 1573(3): p. 301-311. 97. Tan, J., et al., Mutations in the MGAT2 gene controlling complex N-glycan synthesis cause carbohydrate-deficient glycoprotein syndrome type II, an autosomal recessive disease with defective brain development. American Journal of Human Genetics, 1996. 59(4): p. 810-817. 98. Kashyap, M.K., et al., SILAC-based quantitative proteomic approach to identify potential biomarkers from the esophageal squamous cell carcinoma secretome. Cancer Biol Ther, 2010. 10(8): p. 796-810. 99. Raso, C., et al., Characterization of breast cancer interstitial fluids by TmT labeling, LTQ-Orbitrap Velos mass spectrometry, and pathway analysis. J Proteome Res, 2012. 11(6): p. 3199-210. 100. Yang, S., et al., Glycoproteins identified from heart failure and treatment models. Proteomics, 2015. 15(2-3): p. 567-79. 101. Tian, Y., et al., Identification of glycoproteins associated with different histological subtypes of ovarian tumors using quantitative glycoproteomics. Proteomics, 2011. 11(24): p. 4677-87. 102. Puck, T.T., P. Sanders, and D. Petersen, Life Cycle Analysis of Mammalian Cells: II. Cells from the Chinese Hamster Ovary Grown in Suspension Culture. Biophysical Journal, 1964. 4(6): p. 441-450. 103. 2014 Biological Approvals. 2014; Available from: http://www.fda.gov/BiologicsBloodVaccines/DevelopmentApprovalProcess /BiologicalApprovalsbyYear/ucm385842.htm. 104. 2015 Biological Approvals. 2015; Available from: http://www.fda.gov/BiologicsBloodVaccines/DevelopmentApprovalProcess /BiologicalApprovalsbyYear/ucm434959.htm. 105. 2016 Biological Approvals. 2016; Available from: http://www.fda.gov/BiologicsBloodVaccines/DevelopmentApprovalProcess /BiologicalApprovalsbyYear/ucm482392.htm. 106. Wang, J., et al., Targeting the NG2/CSPG4 Proteoglycan Retards Tumour Growth and Angiogenesis in Preclinical Models of GBM and Melanoma. PLOS ONE, 2011. 6(7): p. e23062. 107. Kildegaard, H.F., et al., The emerging CHO systems biology era: harnessing the ‘omics revolution for biotechnology. Current Opinion in Biotechnology, 2013. 24(6): p. 1102-1107. 108. Tjio, J.H. and T.T. Puck, GENETICS OF SOMATIC MAMMALIAN CELLS : II. CHROMOSOMAL CONSTITUTION OF CELLS IN TISSUE CULTURE. The Journal of Experimental Medicine, 1958. 108(2): p. 259-268. 109. Wurm, M.F., CHO Quasispecies—Implications for Manufacturing Processes. Processes, 2013. 1(3). 110. Kaufman, R.J., et al., Coamplification and co-expression of human tissue-type plasminogen activator and murine dihydrofolate reductase sequences in Chinese hamster ovary cells. Mol Cell Biol, 1985. 5: p. 1750-59. 111. Urlaub, G., et al., Deletion of the diploid dihydrofolate reductase locus from cultured mammalian cells. Cell, 1983. 33(2): p. 405-12.

155 112. Kaas, C.S., et al., Sequencing the CHO DXB11 genome reveals regional variations in genomic stability and haploidy. BMC Genomics, 2015. 16(1): p. 1- 9. 113. Cao, Y., et al., Construction of BAC-based physical map and analysis of chromosome rearrangement in Chinese hamster ovary cell lines. Biotechnol Bioeng, 2012. 109(6): p. 1357-67. 114. Vishwanathan, N., et al., Global insights into the Chinese hamster and CHO cell transcriptomes. Biotechnology and Bioengineering, 2015. 112(5): p. 965-976. 115. Zhang, Y., et al., High-Throughput Lipidomic and Transcriptomic Analysis To Compare SP2/0, CHO, and HEK-293 Mammalian Cell Lines. Analytical Chemistry, 2016. 116. Kumar, A., et al., Elucidation of the CHO Super-Ome (CHO-SO) by Proteoinformatics. Journal of proteome research, 2015. 14(11): p. 4687-4703. 117. Levy, N.E., et al., Identification and characterization of host cell protein product-associated impurities in monoclonal antibody bioprocessing. Biotechnology and Bioengineering, 2014. 111(5): p. 904-912. 118. Valente, K.N., A.M. Lenhoff, and K.H. Lee, Expression of difficult-to-remove host cell protein impurities during extended Chinese hamster ovary cell culture and their impact on continuous bioprocessing. Biotechnology and Bioengineering, 2015. 112(6): p. 1232-1242. 119. McIlwain, S., et al., Estimating relative abundances of proteins from shotgun proteomics data. BMC Bioinformatics, 2012. 13: p. 308. 120. Le, H., C. Chen, and C.T. Goudar, An evaluation of public genomic references for mapping RNA-Seq data from Chinese hamster ovary cells. Biotechnol Bioeng, 2015. 112(11): p. 2412-6. 121. Mudunuri, U., et al., bioDBnet: the biological database network. Bioinformatics, 2009. 25(4): p. 555-6. 122. Ashburner, M., et al., Gene Ontology: tool for the unification of biology. Nat Genet, 2000. 25(1): p. 25-9. 123. Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 1999. 27(1): p. 29-34. 124. Latosinska, A., et al., Comparative Analysis of Label-Free and 8-Plex iTRAQ Approach for Quantitative Tissue Proteomic Analysis. PLoS ONE, 2015. 10(9): p. e0137048. 125. Trinh, H.V., et al., iTRAQ-Based and Label-Free Proteomics Approaches for Studies of Human Adenovirus Infections. International Journal of Proteomics, 2013. 2013: p. 16. 126. Lauc, G., et al., Genomics Meets Glycomics—The First GWAS Study of Human N- Glycome Identifies HNF1α as a Master Regulator of Plasma Protein Fucosylation. PLoS Genetics, 2010. 6(12): p. e1001256. 127. Pan, C., et al., Comparative Proteomic Phenotyping of Cell Lines and Primary Cells to Assess Preservation of Cell Type-specific Functions. Molecular & Cellular Proteomics : MCP, 2009. 8(3): p. 443-450. 128. Uhlén, M., et al., Tissue-based map of the human proteome. Science, 2015. 347(6220).

156 129. Tuffy, K.M. and S.L. Planey, Cytoskeleton-Associated Protein 4: Functions Beyond the Endoplasmic Reticulum in Physiology and Disease. ISRN Cell Biology, 2012. 2012: p. 11. 130. Mastrangelo, A.J., et al., Part I. Bcl-2 and bcl-xL limit apoptosis upon infection with alphavirus vectors. Biotechnology and Bioengineering, 2000. 67(5): p. 544-554. 131. Mastrangelo, A.J., et al., Part II. Overexpression of bcl-2 family members enhances survival of mammalian cells in response to various culture insults. Biotechnology and Bioengineering, 2000. 67(5): p. 555-564. 132. Templeton, N., et al., The impact of anti-apoptotic gene Bcl-2 expression on CHO central metabolism. Metab Eng, 2014. 25: p. 92-102. 133. Hossler, P., S.F. Khattak, and Z.J. Li, Optimal and consistent protein glycosylation in mammalian cell culture. Glycobiology, 2009. 19(9): p. 936-49. 134. Yin, B., et al., Glycoengineering of Chinese hamster ovary cells for enhanced erythropoietin N-glycan branching and sialylation. Biotechnology and Bioengineering, 2015. 112(11): p. 2343-2351. 135. Eser, P., et al., Periodic mRNA synthesis and degradation co‐operate during cell cycle gene expression. Molecular Systems Biology, 2014. 10(1): p. 717. 136. Edros, R.Z., S. McDonnell, and M. Al-Rubeai, Using Molecular Markers to Characterize Productivity in Chinese Hamster Ovary Cell Lines. PLOS ONE, 2013. 8(10): p. e75935. 137. Carlage, T., et al., Analysis of dynamic changes in the proteome of a Bcl-XL overexpressing Chinese hamster ovary cell culture during exponential and stationary phases. Biotechnol Prog, 2012. 28(3): p. 814-23. 138. Koltai, T., Clusterin: a key player in cancer chemoresistance and its inhibition. Onco Targets Ther, 2014. 7: p. 447-56. 139. Gundemir, S., et al., Transglutaminase 2: A molecular Swiss army knife. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research, 2012. 1823(2): p. 406-419. 140. Kantardjieff, A., et al., Transcriptome and proteome analysis of Chinese hamster ovary cells under low temperature and butyrate treatment. Journal of Biotechnology, 2010. 145(2): p. 143-159. 141. Orellana, C.A., et al., High-Antibody-Producing Chinese Hamster Ovary Cells Up- Regulate Intracellular Protein Transport and Glutathione Synthesis. Journal of Proteome Research, 2015. 14(2): p. 609-618. 142. Shockey, J., J. Schnurr, and J. Browse, Characterization of the AMP-binding protein gene family in Arabidopsis thaliana: will the real acyl-CoA synthetases please stand up? Biochem Soc Trans, 2000. 28(6): p. 955-7. 143. Yamada, K. and T. Sasaki, Rat Liver Glycolipid Transfer Protein. A Protein Which Facilitates the Translocation of Mono- and Dihexosylcera-mides from Donor to Acceptor Liposomes. The Journal of Biochemistry, 1982. 92(2): p. 457-464. 144. Dixon, J.L. and H.N. Ginsberg, Hepatic synthesis of lipoproteins and apolipoproteins. Semin Liver Dis, 1992. 12(4): p. 364-72.

157 145. Chan, C.J., et al., Myosin II Activity Softens Cells in Suspension. Biophys J, 2015. 108(8): p. 1856-69. 146. Lee, V.H., J.H. Britt, and B.S. Dunbar, Localization of laminin proteins during early follicular development in pig and rabbit ovaries. J Reprod Fertil, 1996. 108(1): p. 115-22. 147. Buschman, M.D., J. Rahajeng, and S.J. Field, GOLPH3 Links the Golgi, DNA Damage, and Cancer. Cancer Research, 2015. 75(4): p. 624-627. 148. Eckert, E.S.P., et al., Golgi Phosphoprotein 3 Triggers Signal-mediated Incorporation of Glycosyltransferases into Coatomer-coated (COPI) Vesicles. Journal of Biological Chemistry, 2014. 289(45): p. 31319-31329. 149. Isaji, T., et al., An Oncogenic Protein Golgi Phosphoprotein 3 Up-regulates Cell Migration via Sialylation. Journal of Biological Chemistry, 2014. 289(30): p. 20694-20705. 150. Mohamed, V.P., et al., Chinese hamster ovary (CHO-K1) cells expressed native insulin-like growth factor-1 (IGF-1) gene towards efficient mammalian cell culture host system. African Journal of Biotechnology, 2011. 10(81): p. 18716- 18721. 151. Mohamed, V.P., et al., Effects of glutamine and serum on IGF-dependent CHO- K1 cell growth in T-flask. Advances in Environmental Biology, 2014: p. 831. 152. Lind, A.K., et al., Collagens in the human ovary and their changes in the perifollicular stroma during ovulation. Acta Obstet Gynecol Scand, 2006. 85(12): p. 1476-84. 153. Izquierdo-Rico, M.J., et al., Hamster zona pellucida is formed by four glycoproteins: ZP1, ZP2, ZP3, and ZP4. J Proteome Res, 2009. 8(2): p. 926-41. 154. Du, Z., et al., Use of a small molecule cell cycle inhibitor to control cell growth and improve specific productivity and product quality of recombinant proteins in CHO cell cultures. Biotechnology and Bioengineering, 2015. 112(1): p. 141- 155. 155. Desdouets, C., et al., Cyclin A: function and expression during cell proliferation. Prog Cell Cycle Res, 1995. 1: p. 115-23. 156. Kung, A.L., S.W. Sherwood, and R.T. Schimke, Differences in the regulation of protein synthesis, cyclin B accumulation, and cellular growth in response to the inhibition of DNA synthesis in Chinese hamster ovary and HeLa S3 cells. Journal of Biological Chemistry, 1993. 268(31): p. 23072-80. 157. Borchi, E., et al., Enhanced ROS production by NADPH oxidase is correlated to changes in antioxidant enzyme activity in human heart failure. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, 2010. 1802(3): p. 331- 338. 158. Li, C.J. and M.L. DePamphilis, Mammalian Orc1 protein is selectively released from chromatin and ubiquitinated during the S-to-M transition in the cell division cycle. Mol Cell Biol, 2002. 22(1): p. 105-16. 159. McNairn, A.J., et al., Chinese hamster ORC subunits dynamically associate with chromatin throughout the cell-cycle. Experimental cell research, 2005. 308(2): p. 345-356. 160. Gavin, E.J., et al., Reduction of Orc6 Expression Sensitizes Human Colon Cancer Cells to 5-Fluorouracil and Cisplatin. PLOS ONE, 2009. 3(12): p. e4054.

158 161. McNairn, A.J. and D.M. Gilbert, Overexpression of ORC subunits and increased ORC-chromatin association in transformed mammalian cells. J Cell Biochem, 2005. 96(5): p. 879-87. 162. Chen, C.-R., et al., E2F4/5 and p107 as Smad Cofactors Linking the TGFβ Receptor to c-myc Repression. Cell, 2002. 110(1): p. 19-32. 163. Krampe, B. and M. Al-Rubeai, Cell death in mammalian cell culture: molecular mechanisms and cell line engineering strategies. Cytotechnology, 2010. 62(3): p. 175-188. 164. Majors, B.S., et al., E2F-1 overexpression increases viable cell density in batch cultures of Chinese hamster ovary cells. J Biotechnol, 2008. 138(3-4): p. 103-6. 165. Blomberg, I. and I. Hoffmann, Ectopic expression of Cdc25A accelerates the G(1)/S transition and leads to premature activation of cyclin E- and cyclin A- dependent kinases. Mol Cell Biol, 1999. 19(9): p. 6183-94. 166. Ray, D. and H. Kiyokawa, CDC25A phosphatase: a rate-limiting oncogene that determines genomic stability. Cancer Res, 2008. 68(5): p. 1251-3. 167. Lee, K.H., et al., Generation of high-producing cell lines by overexpression of cell division cycle 25 homolog A in Chinese hamster ovary cells. J Biosci Bioeng, 2013. 116(6): p. 754-60. 168. Siebzehnrubl, F.A., et al., The ZEB1 pathway links glioblastoma initiation, invasion and chemoresistance. EMBO Molecular Medicine, 2013. 5(8): p. 1196. 169. Lewis, J.M., et al., Integrin regulation of c-Abl tyrosine kinase activity and cytoplasmic–nuclear transport. Proceedings of the National Academy of Sciences, 1996. 93(26): p. 15174-15179. 170. Greuber, E.K., et al., Role of ABL Family Kinases in Cancer: from Leukemia to Solid Tumors. Nature reviews. Cancer, 2013. 13(8): p. 559-571. 171. Shiba-Ishii, A. and M. Noguchi, Aberrant Stratifin Overexpression Is Regulated by Tumor-Associated CpG Demethylation in Lung Adenocarcinoma. The American Journal of Pathology, 2012. 180(4): p. 1653-1662. 172. Sarno, J., et al., Stratifin expression in epithelial ovarian cancer: the role of methylation in vitro and in vivo. Cancer Research, 2007. 67(9 Supplement): p. 1057-1057. 173. Urlaub, G. and L.A. Chasin, Isolation of Chinese hamster cell mutants deficient in dihydrofolate reductase activity. Proceedings of the National Academy of Sciences of the United States of America, 1980. 77(7): p. 4216-4220. 174. Feichtinger, J., et al., Comprehensive genome and epigenome characterization of CHO cells in response to evolutionary pressures and over time. Biotechnology and Bioengineering, 2016. 113(10): p. 2241-2253. 175. Peng, X.-c., et al., Comparative Proteomic Approach Identifies Pkm2 and Cofilin- 1 as Potential Diagnostic, Prognostic and Therapeutic Targets for Pulmonary Adenocarcinoma. PLoS ONE, 2011. 6(11): p. e27309. 176. Heffner, K., et al., Proteomics in Cell Culture: From Genomics to Combined ‘Omics for Cell Line Engineering and Bioprocess Development, in Cell Culture, M. Al-Rubeai, Editor. 2015, Springer International Publishing: Cham. p. 591-614. 177. Monger, C., et al., A Bioinformatics Pipeline for the Identification of CHO Cell Differential Gene Expression from RNA-Seq Data. (1940-6029 (Electronic)).

159 178. Ross, P.L., et al., Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics, 2004. 3(12): p. 1154-69. 179. Ong, S.E., et al., Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics, 2002. 1(5): p. 376-86. 180. Thompson, A., et al., Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS. Analytical Chemistry, 2003. 75(8): p. 1895-1904. 181. Xin, M., E.N. Olson, and R. Bassel-Duby, Mending broken hearts: cardiac development as a basis for adult heart regeneration and repair. Nat Rev Mol Cell Biol, 2013. 14(8): p. 529-541. 182. Itskovitz-Eldor, J., et al., Differentiation of human embryonic stem cells into embryoid bodies compromising the three embryonic germ layers. Molecular Medicine, 2000. 6(2): p. 88-95. 183. Rodgers, R.J., H.F. Irving-Rodgers, and D.L. Russell, Extracellular matrix of the developing ovarian follicle. Reproduction, 2003. 126(4): p. 415-424. 184. Lundby, A., et al., Proteomic Analysis of Lysine Acetylation Sites in Rat Tissues Reveals Organ Specificity and Subcellular Patterns. Cell Reports. 2(2): p. 419- 431. 185. Geiger, T., et al., Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse. Mol Cell Proteomics, 2013. 12(6): p. 1709-22. 186. Dietschy, J.M. and S.D. Turley, Cholesterol metabolism in the brain. Current Opinion in Lipidology, 2001. 12(2). 187. Yu, R.K., Y. Nakatani, and M. Yanagisawa, The role of glycosphingolipid metabolism in the developing brain. Journal of Lipid Research, 2009. 50(Supplement): p. S440-S445. 188. Bosse, K., et al., Mass spectrometry-based secretome analysis of non-small cell lung cancer cell lines. Proteomics, 2016. 16(21): p. 2801-2814. 189. Morgan, H.E., et al., Regulation of Protein Synthesis in Heart Muscle: I. EFFECT OF AMINO ACID LEVELS ON PROTEIN SYNTHESIS. Journal of Biological Chemistry, 1971. 246(7): p. 2152-2162. 190. Taylor, S.W., et al., Characterization of the human heart mitochondrial proteome. Nat Biotech, 2003. 21(3): p. 281-286. 191. Nadiarnykh, O., et al., Alterations of the extracellular matrix in ovarian cancer studied by Second Harmonic Generation imaging microscopy. BMC Cancer, 2010. 10(1): p. 94. 192. Chakraborti, T., et al., Complement activation in heart diseases. Role of oxidants. (0898-6568 (Print)). 193. Kahn, A., Jr., The lung--an endocrine organ. J Ark Med Soc, 1978. 74(11): p. 462-3. 194. Kristen, U. and J. Lockhausen, Estimation of Golgi membrane flow rates in ovary glands of aptenia cordifolia using cytochalasin B. European journal of cell biology, 1983. 29(2): p. 262-267. 195. Wang, W., et al., Retinol dehydrogenase, RDH1l, is essential for the heart development and cardiac performance in zebrafish. (0366-6999 (Print)).

160 196. Martinez-Lopez, M.J., et al., Mouse neuron navigator 1, a novel microtubule- associated protein involved in neuronal migration. Mol Cell Neurosci, 2005. 28(4): p. 599-612. 197. Baldissera, M.D., et al., Relationship between DNA damage in liver, heart, spleen and total blood cells and disease pathogenesis of infected rats by Trypanosoma evansi. (1090-2449 (Electronic)). 198. Frese, C.K., et al., Quantitative Map of Proteome Dynamics during Neuronal Differentiation. Cell Reports, 2017. 18(6): p. 1527-1542. 199. Sabri, M.I., A.H. Bone, and A.N. Davison, Turnover of myelin and other structural proteins in the developing rat brain. Biochemical Journal, 1974. 142(3): p. 499. 200. Wang, H., et al., Characterization of the Mouse Brain Proteome Using Global Proteomic Analysis Complemented with Cysteinyl-Peptide Enrichment. Journal of Proteome Research, 2006. 5(2): p. 361-369. 201. Shirasaki, Dyna I., et al., Network Organization of the Huntingtin Proteomic Interactome in Mammalian Brain. Neuron. 75(1): p. 41-57. 202. de Wit, M.C.Y., et al., Movement disorder and neuronal migration disorder due to ARFGEF2 mutation. neurogenetics, 2009. 10(4): p. 333-336. 203. Sage, H., et al., Distribution of the calcium-binding protein SPARC in tissues of embryonic and adult mice. Journal of Histochemistry & Cytochemistry, 1989. 37(6): p. 819-829. 204. McCurdy, S., et al., Cardiac extracellular matrix remodeling: Fibrillar collagens and Secreted Protein Acidic and Rich in Cysteine (SPARC). Journal of Molecular and Cellular Cardiology. 48(3): p. 544-549. 205. Linask, K.K., K.A. Knudsen, and Y.-H. Gui, N-Cadherin–Catenin Interaction: Necessary Component of Cardiac Cell Compartmentalization during Early Vertebrate Heart Development. Developmental Biology, 1997. 185(2): p. 148- 164. 206. Gambaryan, S., et al., Distribution, cellular localization, and postnatal development of VASP and Mena expression in mouse tissues. Histochemistry and Cell Biology, 2001. 116(6): p. 535-543. 207. Chen, Q., et al., Ischemic defects in the electron transport chain increase the production of reactive oxygen species from isolated rat heart mitochondria. American Journal of Physiology - Cell Physiology, 2008. 294(2): p. C460. 208. Expression and prognostic significance of SIRT1 in ovarian epithelial tumours. Pathology, 2009. 41(4): p. 366-371. 209. Morita, Y., et al., Resveratrol promotes expression of SIRT1 and StAR in rat ovarian granulosa cells: an implicative role of SIRT1 in the ovary. Reproductive Biology and Endocrinology, 2012. 10(1): p. 14. 210. Hong, J.K., et al., Down-regulation of cold-inducible RNA-binding protein does not improve hypothermic growth of Chinese hamster ovary cells producing erythropoietin. Metabolic Engineering, 2007. 9(2): p. 208-216. 211. Tan, H.K., et al., Overexpression of cold-inducible RNA-binding protein increases interferon-γ production in Chinese-hamster ovary cells. Biotechnology and Applied Biochemistry, 2008. 49(4): p. 247-257.

161 212. Singh, T.K., et al., Reorganization of cytoskeleton during surfactant secretion in lung type II cells: a role of annexin II. (0898-6568 (Print)). 213. Jeganathan, N., et al., Rac1-mediated cytoskeleton rearrangements induced by intersectin-1s deficiency promotes lung cancer cell proliferation, migration and metastasis. Molecular Cancer, 2016. 15(1): p. 59. 214. Wujak, L., P. Markart, and M. Wygrecka, The low density lipoprotein receptor- related protein (LRP) 1 and its function in lung diseases. (1699-5848 (Electronic)). 215. Zabrenetzky, V., et al., Expression of the extracellular matrix molecule thrombospondin inversely correlates with malignant progression in melanoma, lung and breast carcinoma cell lines. International Journal of Cancer, 1994. 59(2): p. 191-195. 216. Facchinetti, F., et al., CHF-6297, a Novel Selective p38α (MAPK14) Inhibitor Designed for Inhalation Delivery, Effectively Counteracts IL-1?-Evoked Acute Lung Inflammation, in C74. ADVANCES IN TRANSLATIONAL COPD. 2017, American Thoracic Society. p. A6330-A6330. 217. Lamb, F.S., et al., Ontogeny of CLCN3 chloride channel gene expression in human pulmonary epithelium. (1044-1549 (Print)). 218. Blaisdell, C.J., et al., Inhibition of CLC-2 chloride channel expression interrupts expansion of fetal lung cysts. American Journal of Physiology - Lung Cellular and Molecular Physiology, 2004. 286(2): p. L420. 219. Riely, G.J., J. Marks, and W. Pao, KRAS Mutations in Non–Small Cell Lung Cancer. Proceedings of the American Thoracic Society, 2009. 6(2): p. 201-205.

162 Curriculum Vitae Kelley Marie Heffner

23115 Robin Song Drive • Clarksburg, MD 20871 • 301-518-2042 • [email protected]

EDUCATION Johns Hopkins University, Baltimore, MD, September 2012–August 2017 PhD in Chemical and Biomolecular Engineering University of Maryland, College Park, MD, August 2008–May 2012 Bachelor of Science in Bioengineering Summa Cum Laude (4.0 GPA), Honors College Citation, Gemstone Research Program Citation EXPERIENCE Michael J. Betenbaugh Lab, Baltimore, MD, September 2012–June 2017 Graduate Researcher . Analyzed growth of Chinese hamster ovary (CHO) cell lines and harvested hamster organs. . Established CHO cell line and first hamster tissue proteomes with over 10,000 unique proteins. . Compared statistical differences at transcriptomic and proteomic scale for CHO cell lines and hamster tissues. . Applied bioinformatics to identify enriched and depleted pathways in glycosylation- deficient CHO cell lines.

MedImmune/AstraZeneca, Gaithersburg, MD, July 2009–September 2016 Antibody Discovery and Protein Engineering Intern (June 2016–September 2016) . Designed proteomics workflow to increase throughput and robustness using automated liquid handling steps. . Established proteomics plasma depletion method to remove high-abundance proteins and compare expression between healthy and disease samples for clinical biomarker identification. Antibody Discovery and Protein Engineering Intern (June 2013–September 2013) . Developed immunoprecipitation-comparative proteomics method to identify secretory pathway proteins. . Evaluated phosphoproteomics, glycoproteomics, membrane enrichment, and iTRAQ comparative proteomics to analyze cell and tissue protein expression. Process Cell Culture Co-op, (March 2011–August 2012) . Established alternating tangential flow perfusion process to improve protein productivity and quality. . Developed high-throughput phenotypic microarray to investigate effect of media components on cell growth. . Dipeptide supplementation was implemented during tech transfer to improve antibody productivity and reduce lot-to-lot variability at manufacturing scale. Process Cell Culture Intern (June 2010–August 2010) . Increased recombinant glycoprotein sialylation using DOE to alter process parameters and media formulation. Process Cell Culture Intern (July 2009–August 2009) . Optimized pilot facility Pod-filtration for bioreactor harvests by reducing rinse water and chase buffer volumes.

Adam Hsieh Lab, College Park, MD, December 2010–May 2012 Undergraduate Researcher

163 . Investigated the proliferation of nucleus pulposus cell populations on plastic and collagen thin films. . Researched the effect of fluid shear stress on intervertebral disc cells to understand cellular mechanobiology.

Gemstone Research Program, College Park, MD, August 2008–May 2012 Web Liaison, Team Member of ‘Ligament Elasticity post-Graft Surgery (LEGS)’ . Part of 14-member team that sectioned, stained, and imaged cadaveric ACL grafts to correlate collagen organization and tension loss in order to advocate for ACL graft repair material. . Managed the team website by designing and uploading relevant project information.

SKILLS

Mammalian cell culture, CHO suspension culture, adherent culture, media formulation, bioreactors, data analysis, Microsoft Office, MatLab, R, proteomics, IPA, JMP, HPLC, MS, lab automation, DOE, process development, tissue culture, perfusion culture, AssayMap robot, Pod filtration, technical writing, literature review, time management

LEADERSHIP

Fleet Feet Sports, Gaithersburg, MD, June 2016–present Marathon Run Pacing Instructor . Schedule track and long runs and coach runners of various paces to complete marathons.

Cellular and Molecular Biotechnology Course, Baltimore, MD, Fall 2013, Fall 2014 Teaching Assistant . Led grading of homework, projects, and exams. . Taught weekly 3-hour course and managed office hours for students.

Graduate Student Liaison Committee, Baltimore, MD, August 2012–present Student Volunteer . Establish link between graduate students and elementary schools through STEM activities. . Led organization of departmental Toys for Tots drive and collected holiday donations.

Women in Engineering Program, College Park, MD, August 2009–May 2011 Peer Mentor . Advised freshmen engineering females on academics, campus life, and career development.

Engineers Without Borders, College Park, MD, August 2008–May 2011 Executive Board Secretary, Ethiopia Team Member . Traveled to Ethiopia to develop sustainable engineering technology for youth center. . Elected to Executive Board and maintained written notes of meetings.

AWARDS

. Walter L. Robb, Whiting School of Engineering Graduate Fellowship, 2012–2017 . National Science Foundation Graduate Research Fellowship, 2012–2017 . Fischell Department of Bioengineering Outstanding Senior Award, 2012 . Howard Hughes Medical Institute Undergraduate Research Fellowship, 2011 . Howard Hughes Medical Institute Gemstone Research Fellowship, 2010, 2009 . Fischell Department of Bioengineering Outstanding Junior Award, 2011 . Society of Women Engineers Life Technologies Scholarship, 2011 . Armed Forces Communications and Electronics Association Scholarship, 2011, 2008 . Suez Energy Generation Scholarship, 2011, 2010 . Baltimore Post American Society of Military Engineers Scholarship, 2010 . A. James Clark School of Engineering Knust Memorial Scholarship, 2009

164 . Association for Women in Science Scholarship, 2009 . Armed Forces Communications and Electronics Association Scholarship, 2009, 2008 . Clarksburg Parent-Teacher-Student Association Scholarship, 2008 . Clarksburg Athletic Boosters Scholarship, 2008 . Maryland Distinguished Scholarship, 2008–2012 . University of Maryland Banneker Key Scholar, 2008–2012

PRESENTATIONS 1) K. Heffner, D. Baycin Hizal, G. Yerganian, M. Betenbaugh, “Lessons from the hamster: A combined ‘omics approach to analyze CHO cell lines and Cricetulus griseus tissues”, ACS National Meeting, San Francisco CA, oral presentation April 2017. 2) K. Heffner, K. Lekstrom, D. Baycin Hizal, R. Chaerkady, R. Woods, M. Bowen, “HiT ME UP: High throughput method evaluation using proteomics”, MedImmune Intern Scientific Poster Session, Gaithersburg MD, August 2016. 3) K. Heffner, D. Baycin Hizal, M. Betenbaugh, H. Zhang, S. Krag, “Applying iTRAQ proteomics and systems biology approach to understand the glycosylated and sialylated pathways of CHO mutants”, American Society for Mass Spectrometry, San Antonio TX, June 2016. 4) K. Heffner, D. Baycin Hizal, M. Bowen, G. Yerganian, M. Betenbaugh, “Lessons from the hamster: Cricetulus griseus tissue and CHO cell line proteomes inform bioprocessing”, Cell Culture Engineering, La Quinta CA, May 2016. 5) K. Heffner, J. Lee, R. Venkat, M. Betenbaugh, “High throughput cell culture media component screening”, Cell Culture Engineering, Quebec City Canada, May 2014. 6) K. Heffner, D. Baycin Hizal, M. Bowen, “You belong with me: Development of an immunoprecipitation coupled mass spectrometry-based proteomics approach to identify interacting proteins involved in monoclonal and bispecific antibody production”, MedImmune Intern Scientific Poster Session, Gaithersburg MD, August 2013. 2nd Place Award 7) K. Heffner, A. Hsieh, “Effect of fluid shear stress on nucleus pulposus cell mechanotransduction”, Bioengineering Departmental Honors Thesis Presentation, College Park MD, May 2012. 8) A. Chandramani, M. Costales, B. Garbus, J. Hartstein, K. Heffner, R. Jain, K. Klein, A. McDonnell, P. Patel, V. Stefanelli, J. Wang, J. Weinberg, T. Zhang, A. Hsieh, H. Kim, C. Bennett, “Effect of tensile strength loss on collagen organization in ACL grafts post-reconstruction surgery”, Biomedical Engineering Society Annual Meeting, Hartford CT, October 2011. 9) K. Heffner, J. Lee, “High-throughput cell culture media component screening”, MedImmune Intern Scientific Poster Session, Gaithersburg MD, August 2011. 1st Place Award 10) J. Lee, K. Heffner, C. Barton, B. Holman, “Controlling sialylation of recombinant glycoprotein in cell culture process”, European Society for Animal Cell Technology Meeting, Vienna Austria, May 2011. 11) K. Heffner, J. Lee, B. Holman, J. Reier, “High-throughput methods for cell culture media investigation”, University of Maryland Undergraduate Research Day, College Park MD, April 2011. 12) K. Heffner, J. Lee, B. Holman, R. Heckathorn, “Improving recombinant blood factor II process development and sialylation”, MedImmune Intern Presentation Day, Gaithersburg MD, August 2010. 13) K. Heffner, A. Groves, “Optimization of pod harvest”, MedImmune Intern Presentation Day, Gaithersburg MD, August 2009. 3rd Place Award

PUBLICATIONS 1) K. Heffner, D. Baycin Hizal, M. Bowen, M. Betenbaugh, “Lessons from the hamster: Cricetulus grisues tissue and CHO cell line proteomes inform biotherapeutics research and development”, submission in progress. 2) K. Heffner, D. Baycin Hizal, M. Betenbaugh, “GlycoCHO: characterization of Chinese hamster ovary (CHO) glycosylation-deficient cell lines using a glycoproteomics approach”, submission in progress. 3) K. Heffner, D. Baycin Hizal, Q. Zhang, M. Betenbaugh, “Glycoengineering of mammalian expression systems on a cellular level”. In: E. Rapp, U. Reichl, Advances in Biochemical Engineering, Springer, submitted.

165 4) J. Lee, J. Reier, K. Heffner, C. Barton, D. Spencer, A. Schmelzer, R. Venkat, “Production and characterization of active recombinant human factor II with consistent sialic acid”, 2017, Biotechnology and Bioengineering, online DOI: 10.1002/bit.26317. 5) Y. Zhang, D. Baycin Hizal, A. Kumar, J. Priola, M. Bahri, K. Heffner, M. Wang, X. Han, M. Bowen, M. Betenbaugh, “High-throughput lipidomic and transcriptomic analysis to compare SP2/0, CHO, and HEK-293 mammalian cell lines”, 2016, Analytical Chemistry, 89(3), p. 1477-1485. 6) S. Bennum, D. Baycin Hizal, K. Heffner, O. Can, H. Zhang, M. Betenbaugh, “Systems glycobiology: integrating glycogenomics, glycoproteomics, glycomics, and other ‘omics data sets to characterize cellular glycosylation processes”, 2016, Journal of Molecular Biology, 428(16), p. 3337-3352. 7) A. Kumar, D. Baycin Hizal, D. Wolozny, L. Pedersen, N. Lewis, K. Heffner, R. Chaerkady, R. Cole, J. Shiloach, H. Zhang, M. Bowen, M. Betenbaugh, “Elucidation of the CHO super-ome (CHO-SO) by proteoinformatics”, 2015, Journal of Proteome Research, 14(11), p. 4687-4703. 8) K. Heffner, C. Schroeder Kaas, A. Kumar, D. Baycin Hizal, M. Betenbaugh, “Proteomics in cell culture: from genomics to combined ‘omics for cell line engineering and bioprocess development”. In: M. Al-Rubeai, Animal Cell Culture, Dublin, Ireland, Springer, 2015, p. 591-614. 9) A. Kumar, K. Heffner, J. Shiloach, M. Betenbaugh, D. Baycin Hizal, “Harnessing Chinese hamster ovary cell proteomics for biopharmaceutical processing”, 2014, Pharmaceutical Bioprocessing, 2(5), p. 421-435. 10) S. Bennun, K. Heffner, E. Blake, A. Chung, M. Betenbaugh, “Control of biotherapeutics glycosylation”. In: H. Hauser, R. Wagner, Animal Cell Biotechnology, Berlin, Germany, De Gruyter, 2014, p. 247-279. 11) K. Heffner, D. Baycin Hizal, A. Kumar, J. Shiloach, J. Zhu, M. Bowen, M. Betenbaugh, “Exploiting the proteomics revolution in biotechnology: from disease and antibody targets to optimizing bioprocess development”, 2014, Current Opinion in Biotechnology, 30, p. 80-86. 12) V. Jadhav, M. Hackl, A. Druz, S. Shridhar, C. Chung, K. Heffner, D. Kreil, M. Betenbaugh, J. Shiloach, N. Barron, J. Grillari, N. Borth, “CHO microRNA engineering is growing up: Recent successes and future challenges”, 2013, Biotechnology Advances, 31(8), p. 1501-1513.

166