UNDERSTANDING PHYSIOLOGY OF DISEASES AND CELL LINES USING OMICS BASED APPROACHES

by Amit Kumar

A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy

Baltimore, Maryland May, 2015

© 2015 Amit Kumar All Rights Reserved

Abstract

This thesis focuses on understanding physiology of diseases and cell lines using OMICS based approaches such as microarrays based expression analysis and mass spectrometry based analysis. It includes extensive work on functionally characterizing mass spectrometry based data for identifying secreted proteins using bioinformatics tools. This dissertation also includes work on using omics based techniques coupled with bioinformatics tools to elucidate pathophysiology of diseases such as Type 2 Diabetes (T2D).

Although the well-known characteristic of T2D is hyperglycemia, there are multiple other metabolic abnormalities that occur in T2D, including insulin resistance and dyslipidemia.

In order to attain a greater understanding of the alterations in metabolic tissues associated with T2D, microarray analysis of in metabolic tissues from a mouse model of pre-diabetes and T2D to understand the metabolic abnormalities that may contribute to T2D was performed. This study also uncovered the novel and pathways regulated by the insulin sensitizing agent (CL-316,243) to identify key pathways and target genes in metabolic tissues that can reverse the diabetic phenotype.

Specifically, he found significant decreases in the expression of mitochondrial and peroxisomal oxidation genes in the skeletal muscle and adipose tissue of adult

MKR mice, and in the of pre-diabetic MKR mice, compared to healthy mice. In addition, this study also explained the lower free fatty acid levels in MKR mice after treatment with CL-316,243 and provided biomarker genes such as ACAA1 and

HSD17b4.

ii

Using results from T2D microarrays studies, a multi-tissue computational model was created using metabolic reconstructions for in silico simulation of T2D for a better understanding of the disease pathophysiology. A time-efficient algorithm for generating tissue-specific metabolic models was presented in this study. The flux balance analysis using the multi-tissue model showed that the degradation pathways of branched-chain and fatty acid oxidation were significantly downregulated in MKR mice versus healthy mice. The T2D multi-tissue model was able to explain the high level of branched-chain amino acids and free fatty acids in plasma of T2D subjects from a systems level metabolic fluxes perspective.

In addition to T2D studies, this dissertation also reports identification of the complete collection of proteins which make up the Chinese hamster ovary (CHO) cells proteome which has been an invaluable source of information for scientists, allowing them to engineer their cell lines to increase the efficiency of therapeutics production. Proteomics has been especially attractive for biotechnology applications since it can provide an understanding of disease states and aid drug discovery and development. Moreover, CHO cells are the preferred host cell line for manufacturing a variety of biologicals including monoclonal antibodies. A proteomics and bioinformatics analysis on the spent medium from CHO cells was performed. From the analysis of supernatant of post-centrifugation

CHO cells, identification of thousands of unique proteins that are potentially secreted from the CHO cells was done. In order to categorize these proteins functionally, multiple bioinformatics tools including SignalP, TargetP, SecretomeP, TMHMM, WoLF PSORT, and Phobius were implemented. This analysis provided information on the cellular localization of the proteins found in the supernatant, including the presence of iii

transmembrane domains and signal peptides. Proteins were shown to be localized to the secretory pathway, including ones playing role in cell growth, proliferation, and folding as well as those involved in degradation and removal of other proteins. As a part of this effort, a publically accessible web-based tool called GO-CHO

(http://ebdrup.biosustain.dtu.dk/gocho/) was created to functionally categorize the proteins. This work and database will enable the CHO community to rapidly identify high abundance host cell proteins in their cultures in order to facilitate processing and purification efforts in the future. Moreover, the compartmentalization strategies presented in this work will help the CHO community in understanding CHO secretory machinery.

Advisors: Michael J. Betenbaugh and Joseph Shiloach

iv

Preface

Emergence of OMICS technologies in recent years have allowed us relatively faster analysis of complex physiology of biological systems including diseases and cell lines displaying the main advantage of obtaining a big amount of information at a relatively low cost and effort and converting it to biologically meaningful results.

To analyze cells or tissues by an OMICS approach, various biochemical technologies are employed such as genomics, transcriptomics, proteomics, metabolomics, and so on.

Genomics uses technologies such as fluorescence in situ hybridization, comparative genome hybridization arrays, and single nucleotide polymorphism arrays to decipher physiology of biological systems. On the other hand, transcriptomics used mRNA microarrays and real time polymerase chain reaction (RT-PCR) to achieve similar goals.

Proteomics technologies include separation techniques such as one-dimensional sodium dodecyl-sulfate polyacrylamide gel (1D-SDS-PAGE), two dimensional (2D) PAGE, high pressure liquid chromatography (HPLC), and ultra-pressure liquid chromatography

(UPLC), reverse-phase liquid chromatography tandem mass spectrometry (RP-LC-

MS/MS), arrays, matrix-assisted laser desorption ionization time-of-flight mass spectrometry, and bioinformatics method to study biological systems. On the other hand, metabolomics employs techniques such as gas chromatography – mass spectrometry

(GC-MS), liquid chromatography – mass spectrometry (LC-MS), HPLC, and H nuclear magnetic resonance (H-NMR) for the purpose of studying the events and interactions of cellular structures and process from DNA and genes to metabolites in a complex and global way. Using OMICS platforms, all classes of biological compounds, epigenetic markers, genes, messenger ribonucleic acid (mRNA), proteins and metabolites can be v

analyzed. In other words, it can be said that genomics and transcriptomics methods enable assessment of genetic information, proteomics permits assessing actually translated proteins, and metabolomics displays the results after the above three plans are executed.

T2D is a complex disease with epidemic proportions and is a public health, economic, scientific issue, and ethical issue and requires proactive and preventive approaches to the individual and public health burden caused by diabetes and its co-morbidities. The complexity of the T2D phenotype has challenged the fragmented scientific approaches, typically focusing on either genetic, or environmental (diet, lifestyle), or socio-economic conditions in isolation rather than on multi-scale, longitudinal, systems-level studies.

The focus of this dissertation is to present an emerging novel strategy of utilization of computational methods to study pathophysiology of Type 2 Diabetes (T2D). In addition,

OMICS technologies were implemented in studying physiology of Chinese Hamster

Ovary (CHO) cells and E. coli.

This dissertation consists of six chapters and is mainly focused on implementing OMICS technologies such as transcriptomics and proteomics on improving the understanding of physiology of T2D and cell lines such as CHO cells and E. coli. The first chapter, which was published in PLoS One journal (PMID: 25029527), introduces a computational strategy based on metabolic reconstruction to study metabolic fluxes in T2D condition.

The second chapter, which was published in Nutrition and Metabolism journal (PMID:

25784953), discusses effect of T2D in terms of differences in genes expression using microarrays and discusses effect of a drug (CL 316,243) on T2D. In this process of

vi

studying the effect of T2D and the aforementioned drug, we have also characterized the metabolic characteristics of a T2D animal model – MKR mice. Third chapter, which was published in Proteomics Clinical Applications journal (Reuse license number:

3632651416772), discusses proteomics and its application in understanding physiology of cell lines and diseases. Chapter four discusses transcriptomics and proteomics application in deciphering differences in two strains of E. coli. Fifth chapter, pubished in

Pharmaceutical Bioprocessing, is dedicated to advances in proteomics technology specifically related to CHO cells. Chapter six discusses application of proteomics in identifying secreted proteins in CHO cells along with introducing novel bioinformatics strategies. Finally, chapter 7 concludes the dissertation and discusses the future work to extend the efforts presented in this study.

Acknowledgements

I believe that getting a PhD is a process of evolution, development, learning, and growth. I owe this progress to my family, especially my parents and my wife – Olivia

Franken, for their continuous support in my journey to pursue a Doctoral degree. I am deeply thankful to my parents for instilling the enthusiasm in me towards pursuing higher education. I would also like to express my sincere gratitude to my academic advisor Dr.

Michael J. Betenbaugh and NIH advisor Dr. Joseph Shiloach for their wisdom, mentorship, and invaluable guidance. It was an honor and privilege to work with them. I would also like to thank Dr. Deniz Baycin-Hizal for mentoring me throught the Ph.D. program and helping me learn nuances of proteomics techniques and analysis of the data.

vii

I would also like to thank all my colleagues both at Johns Hopkins University and

NIDDK Biotechnology Unit, especially Dr. Alex Druz, for their advice, guidance, and assistance which allowed me to learn and grow professionally and personally. I would like to thank our close collaborators at the Mt. Sinai School of Medicine, especially Dr.

Derek LeRoith and Dr. Emily Gallagher for their commitment and technical assistance. It has been truly a rewarding experience working with all of them.

Finally, I would like to acknowledge that all funding for the work presented here was provided by the Intramural Research program at the National Institute of Diabetes and Digestive Kidney Diseases at the National Institutes of Health and the Department of

Chemical and Biomolecular Engineering at Johns Hopkins University.

viii

Table of Contents

Title page i

Abstract ii

Preface v

Acknowledgements vii

List of Tables x

List of Figures xi

Chapter 1: Multi-tissue computational modeling analyzes 13 pathophysiology of Type 2 Diabetes in MKR mice

Chapter 2: The beta-3 adrenergic agonist (CL-316,243) restores the 55 expression of down-regulated fatty acid oxidation genes in Type 2 diabetic mice

Chapter 3: Coupling Enrichment and Proteomics Methods for 84 Understanding and Treating Disease

Chapter 4: Global gene transcription and translation in E. coli B 113 and K under minimal media conditions

Chapter 5: Harnessing Chinese hamster ovary (CHO) cell proteomics for 140 biopharmaceutical processing

Chapter 6: Elucidation of the CHO Super-Ome (CHO-SO) 168 by BioInfo-Proteomics

Chapter 7: Conclusions and future work 209

References 212

Curriculum Vitae 258

ix

List of Tables

Table 1. A comparison of different models 4 Table 2. Simplified depiction of the validation results 12 Table 3. Summarized results of validation based on T2DM Chapter 1 15 gene expression change Table 4. Microarray results for BCAA and FA degradation pathways 18 Table 1. Kegg pathways enrichment P values in the differentially expressed gene sets 53 Table 2. Gene expression data for Fatty Acid Oxidation 56 pathway genes in adipose tissue from MKR vs WT mice Table 3. Gene expression data for Fatty Acid oxidation pathway genes in skeletal muscle from MKR vs WT mice 59 Table 4. Gene expression data for Fatty Acid oxidation Chapter 2 61 pathway genes in liver from MKR vs WT mice Table 5. Gene expression data for Fatty Acid oxidation pathway genes in liver from pre-diabetic MKR vs age- 62 matched WT mice Table 6. Gene expression data for Fatty Acid oxidation pathway genes in adipose from CL-316,243 treated MKR vs 65 vehicle treated MKR mice Table 1. A summary of various methods used in membrane Chapter 3 proteome enrichment 79 Table 1. Concentrations of biomass, , and acetate and glucose uptake rate (qs), acetate formation rate (qA), and 112 Chapter 4 yield on glucose (Yx/s) Table 2. Results of hypergeometric test showing top 10 overrepresented pathways in each study type 117 Table 1. A summary of various studies using in-gel and in- solution digestions methods 134 Chapter 5 Table 2. Advantages and disadvantages of different methods used in mass spectrometry 144 Table 1. Results summary from proteins identified from the CHO supernatant 168 Chapter 6 Table 2. T – cell epitopes identified from high abundance novel CHO-SO proteins 171

x

List of Figures

Figure 1. Multi-tissue model building workflow 7 Figure 2. OMIM data used for model validation 9 Figure 3. ROC plot for validation using OMIM data 10 Figure 4. ROC plot for exchange flux changes using Zucker 14 fatty rat data Chapter 1 Figure 5. ROC plot for transport reaction flux changes using 16 Zucker fatty rat data Figure 6. Canonical biosynthesis pathway of fatty acids in 19 liver using microarray data 21 Figure 7. MCL multi-tissue model predictions on different pathways 52 Figure 1. Results from differential change analyses

Figure 2. Gene network analysis of the fatty oxidation 55 pathway in adipose tissue from MKR vs WT mice Figure 3. Gene network analysis of the fatty acid oxidation 58 Chapter 2 pathway in skeletal muscle from MKR vs WT mice Figure 4. Gene network analysis of the fatty acid oxidation 60 pathway in liver from MKR vs WT mice Figure 5. Gene network analysis of the fatty acid oxidation 64 pathway in adipose tissue from CL-316,243 MKR vs vehicle- treated MKR mice 74 Figure 1. Progession of OMICS applications 75 Chapter 3 Figure 2. Applications of mass spectrometry proteomics methodologies Figure 3. Different techniques used in proteome profiling 77 Figure 1. Graph showing flow rate, dO2 level, air 106 flow rate, and agitation rate during the chemostat cultures Figure 2. Venn diagrams showing comparison between 114 transcriptomics data and proteomics data Chapter 4 116 Figure 3. TCA cycle and associated pathways in JM109

Figure 4. Acetate formation biochemical pathway and 118 associated genes’ differential expression in JM109

xi

120 Figure 5. TCA cycle and associated pathways in BL21

Figure 6. Acetate formation biochemical pathway and 122 associated genes’ differential expression in BL21 Figure 7. enrichment analysis results for 123 biological processes category Figure 1. A general scheme showing application of 130 proteomics Chapter 5 Figure 2. Labeling methods 142 Figure 3. CHO proteomics database 154 Figure 1. Overview of the process of obtaining functionally 166 categorized CHO-SO 170 Figure 2. Analysis of all CHO-SO genes 176 Figure 3. Results from bioinformatics analyses 180 Chaoter 6 Figure 4. An example output from the GO CHO website

Figure 5. Results from Gene Ontology (GO) hypergeometric 184 distribution analysis 186 Figure 6. Functional network categories 190 Figure 7. KEGG’s focal adhesion pathway

xii

Chapter 1: Multi-tissue computational modeling analyzes pathophysiology of Type 2

Diabetes in MKR mice

1. Summary

Computational models using metabolic reconstructions for in silico simulation of metabolic disorders such as type 2 diabetes mellitus (T2DM) can provide a better understanding of disease pathophysiology and avoid high experimentation costs. There is a limited amount of computational work, using metabolic reconstructions, performed in this field for the better understanding of T2DM. In this study, a new algorithm for generating tissue-specific metabolic models is presented, along with the resulting multi- confidence level (MCL) multi-tissue model. The effect of T2DM on liver, muscle, and fat in MKR mice was first studied by microarray analysis and subsequently the changes in gene expression of frank T2DM MKR mice versus healthy mice were applied to the multi-tissue model to test the effect. Using the first multi-tissue genome-scale model of all metabolic pathways in T2DM, we found out that branched-chain amino acids’ degradation and fatty acids oxidation pathway is downregulated in T2DM MKR mice.

Microarray data showed low expression of genes in MKR mice versus healthy mice in the degradation of branched-chain amino acids and fatty-acid oxidation pathways. In addition, the flux balance analysis using the MCL multi-tissue model showed that the degradation pathways of branched-chain amino acid and fatty acid oxidation were significantly downregulated in MKR mice versus healthy mice.

Validation of the model was performed using data derived from the literature regarding

T2DM. Microarray data was used in conjunction with the model to predict fluxes of

1 various other metabolic pathways in the T2DM mouse model and alterations in a number of pathways were detected.

The Type 2 Diabetes MCL multi-tissue model may explain the high level of branched- chain amino acids and free fatty acids in plasma of Type 2 Diabetic subjects from a metabolic fluxes perspective.

2. Introduction

Type 2 Diabetes Mellitus (T2DM), the most common form of Diabetes in America, is becoming a global pandemic with the greatest increase in cases in many developing countries. The pathophysiology of T2DM primarily involves defects in three organ systems– liver, peripheral target tissues (skeletal muscle and fat), and pancreatic β-cells

[1]. Insulin resistance in the peripheral target tissues, primarily skeletal muscle, is considered the primary reason for insulin resistance in T2DM [2].

In the patients with T2DM, withdrawal of insulin treatment has been shown to be associated with increased levels of branched-chain amino acids (BCAAs) in the plasma

[3]. Moreover, metabolite profiling from the plasma of T2DM patients [4] revealed

BCAAs as the key-biomarkers during the progression of T2DM. It is shown that the concentrations of BCAAs in plasma, liver, and skeletal muscle are higher in T2DM conditions such as in the Zucker diabetic rat [5]. Another study performed on hyperglycemic/T2DM Finnish males revealed high plasma level of BCAAs [6] too.

Additionally, it has been shown that high levels of BCAAs in plasma of T2DM subjects are associated with conditions of insulin-resistance [7,8,9,10,11,12].

2 It is also known that elevated free fatty acid (FFA) levels in plasma is linked to T2DM in

patients [13]. One of the studies [14] on the effect of high plasma FFA levels, pointed out

the contribution of high FFA levels in plasma on the impaired insulin response of the

T2DM subjects.

In contrast to the wealth of knowledge available for the concentrations of circulating

BCAAs and FFAs in T2DM patients, the actual mechanisms leading to these changes at the metabolic and genetic levels are less understood. With the emergence of systems biology tools associated with high-throughput data, it is now feasible to create in silico genome scale metabolic reconstruction models to study the causes of various metabolic disorders [15]. After completion of a global metabolic network [16], Recon1, constraint-based modeling became feasible to study metabolic disorders in silico.

Accounting for more than 3000 human metabolic reactions, Recon1 provides a firm basis for studying human metabolism and metabolic disorders such as cancer, diabetes, obesity, and inherited gene and deficiencies [15].

Following Recon1 coming into existence, several tissue-specific models have also been generated [17,18,19,20,21, 22]. Recon1 in itself is not sufficient for modeling specific tissues as different tissues exhibit different physiological and hence metabolic behaviors.

Increased efforts are creating multi-tissue metabolic models to study the pathophysiology of human metabolic disorders [23].

Insulin resistance leading to Type 2 Diabetes Mellitus (T2DM) is regulated by more than one tissue system requiring analysis at multi-tissue level. Major tissues involved in T2DM are skeletal muscle, liver, adipose, pancreas, brain, and gastrointestinal tract [24]. As far

3 as the level of metabolites is concerned, skeletal muscle, liver, and adipose tissues are the major role players in secreting these metabolites into the blood by various biochemical pathways. Metabolite concentration levels (such as amino acids levels) from T2DM subjects’ blood is readily available in literature, allowing comparison of an in silico model to T2DM phenotype. The current study introduces a comprehensive multi-tissue-specific model to study interdependence of hepatocytes, mycoytes, and adipocytes in the T2DM condition. Table 1 shows a summary of some different models including the model for the current study (Kumar et al.) [17,18,20,25,26].

Table 1. A comparison of different models

Kumar Bordbar Jerby Gille Mardinoglu Recon1

Intracellular 2202 518 1056 1081 6160* 2180

Reaction

Genes 1496* 931 - - 1809 1496

Unique 610 413 729 777 2497 1509

Metabolites

Compartments 4 4 6 6 8 7

Each model provides specific advantages and limitation for specific applications. Models can vary in size and scope and also tissue distribution. For example, while some models are specific for certain tissues, the current Kumar model expands the scope to include three tissues. Moreover, microarray data used in our study contextualizes the reconstruction according to the three different tissues (liver, WAT, and skeletal muscle).

Furthermore, we used microarray data for three different tissues from the same animals.

4 While we recognize that the current approach incorporates this data from a different organism (mice) into a model based on human metabolism, the similarity in physiological responses across species makes such an approach reasonable in the absence of fully validated models and data sets for each species. Most of the studies involving T2DM and insulin resistance utilize data from insulin-resistant or diabetic animal models, such as

Zucker fatty (ZF) and Zucker diabetic fatty rats (ZDF)[27], high-fat-fed mice[28] , muscle

IGF-1 receptor–lysine–arginine (MKR) mice[29,30], and lep/lep mice[31]. A common feature of these animal models is that all models have manifested insulin resistance and often exhibit islet dysfunction as occurs in the early stages of type 2 diabetes in . It is, in our opinion, valid to use MKR mice and zucker diabetic fatty rat data for making predictions on humans’ T2DM phenotypes. To validate the model, its capability to predict expected phenotypes from known genotypes was tested. For this purpose, a publically available comprehensive database for genes and genetic phenotypes, the Online

Mendelian Inheritance in Man (OMIM) database was used. This was followed by a comparison between constraint-based simulation results and the levels of plasma amino acid in the Zucker diabetic rat after gene expression data for the three tissues from the diabetic MKR mouse compared to normal control mice [32].

3. Results

3.1 Multi-confidence level (MCL) multi-tissue model

Before generating the model, an algorithm was applied to generate a list of high

confidence, medium confidence, and low confidence reactions, based on the source of a

particular reaction. The high confidence list comprised reactions from published literature

and was not altered by the algorithm; the medium confidence list comprised of reactions

5 from online databases such as HPRD, etc.; and the low confidence list comprised of all remaining reactions that weren’t present in high and medium confidence reaction lists, as shown in Figure 1 (detailed in Materials and Methods). The final high confidence reaction list (Ch) consists of 1110 reactions and 1593 metabolites. The final medium confidence reaction list (Cm) consists of 1099 reactions and 1643 metabolites. The final low confidence reaction list (Cx) consists of 4433 reactions and 3679 metabolites. The algorithm required six iterations before completing to give multi-confidence level (MCL) multi-tissue model with 4704 reactions and 3131 metabolites. Completion is indicated by the fact that the objective function for the whole model is at a maximum for a particular set of flux distributions and no more reactions can be added in subsequent iterations

(detailed in Materials and Methods).

6 Figure 1. Multi-tissue model building workflow. Recon1, downloaded from BiGG database, is the basis of building this multi-tissue model for liver, skeletal muscle, and adipose tissues. From the Recon1 individual compartment models, the algorithm first loads a randomized flux distribution matrix representing randomization of linear programming (LP) problems. Then a Boolean vector describing activity (1 = active, 0 = inactive) of each reaction is created. Then scoring for each column of the flux distribution is done. If the objective score is greater than 0, corresponding reaction is added in the construction. This process is repeated until highest objective score is greater than 0, implying objective function is at its maximum. This generates a partial model. Then the whole process is repeated again based on the partial model, until there is no change in model size between one iteration to another iteration.

3.2 Validation using Online Mendelian Inheritance in Man (OMIM) Database

 From the OMIM database [33], 17 disorders were chosen and characterized with

regard to increases or decreases in metabolite level in the blood as shown in

Figure 2 [22]. Each of the disorders involves changes in the amino acid levels.

The phenotype of each of these disorders is outlined in Figure 2 above. Each

disorder has a set of associated genes, as listed in Figure 2, which can be used to

map to reactions in silico. S-Adenosylhomocysteine is associated with

AHCY gene deficiency [34]. Alkaptonuria is associated with HGD gene

deficiency [35]. Argininemia is associated with mutation in ARG1 gene [36].

Cystinuria is associated with mutations in SLC3A1 and SLC7A9 genes [37].

Lysinuric protein intolerance is associated with mutation in SLC7A7 gene [38].

7 Formiminotransferace deficiency is associated with mutation in FTCD gene [39].

Histidinemia is associated with mutation in HAL gene [40]. Homocystinuria is associated with mutation in CBS gene [41]. Hyperprolinemia is associated with mutation in PRODH gene [42]. Maple syrup urine disease is associated with mutation in DBT, BCKDHB, and BCKDHA genes [43]. Methionine adenosyltransferase deficiency is caused by mutation in MAT1A gene [44].

Methylmalonic aciduria is caused by mutation in MUT gene [45].

Phenylketonuria is caused by mutation in PAH gene [46]. Hyperphenylalaninemia is associated with mutation in QDPR [47]. Tyrosinemia, Type I is caused by mutation in FAH gene [48]. Tyrosinemia, type III is caused by mutation in HPD gene [49]. Glycine encephalopathy is associated with mutation in AMT, GLDC, and GCSH genes [50]. In this validation, the reactions associated with these genes were removed and simulation of the in silico model was run to predict the phenotype of the disease associated with removing that particular gene(s). The predicted phenotype was then compared to the actual phenotype from the OMIM database. Exchange reactions tell about the concentration and the exchange flux balances indicate the increase or decrease in concentration of a metabolite in the model in the blood/extracellular compartment. This analysis was performed on the MCL multi-tissue model, Recon1, and the multi-tissue version of Recon1

(modeling the three tissues: adipose, liver, and skeletal muscle). The results from simulations of Recon1 and the multi-tissue version of Recon1 served as benchmarks to compare to the simulations of MCL multi-tissue model, as shown in receiver-operator curves (ROC) in Figure 3. ROC curves portray the trade-off

8 between true positive rate (TPR) and false positive rate (FPR) predictor of a

model as the decision threshold of a parameter is varied.

Figure 2. OMIM data used for model validation. Blue and red squares depict an increase and decrease in concentration, respectively. White squares represent unchanged concentration levels.

Recon1 and the multi-tissue version of Recon1 have an identical curve which is understandable as a multi-tissue version of Recon1 is just a combination of Recon1 models of the three tissues and the external exchange compartment; however, it is necessary to test the differences between the two because the number of type III cycles can increase in the multi-tissue version of Recon1. A type III cycle is one of three types of extreme pathways that can exist in a reaction network (type I, type II, and type III) and could cause the behavior of the multi-tissue version to be different than the single tissue version of Recon1. Figure 3 illustrates that the MCL multi-tissue model reaches higher true positive rates as compared to Recon1.

9

Figure 3. ROC plot for validation using OMIM data. ROC plot comparing different models using gene-deletion study. This figure demonstrates the true positive and false positive rates at various threshold values.

Also, all data points other than the last one at (1,1) are clustered in an area with a very low false positive rate. This indicates that all models evaluated in this experiment performed as expected. The area under the curve (AUC) provides a quantitative measure of the performance of each model. The AUC of the MCL multi-tissue model is 0.7151, and the AUC of Multi – Recon1 and Recon1 is 0.6719, and AUC of a random selector

(not shown) is approximately 0.3866.

3.3 Validation using Type 2 Diabetes Gene Expression Change

Subsequent validation analysis involved the application of microarray data from MKR

T2DM mice to the aforementioned in silico models. The microarray data of the two sets

10 was compared and genes that had statistically significant (with p-value < 0.05) fold changes were tabulated. These differentially expressed genes were then mapped to

Recon1 and the MCL multi-tissue model to determine sets of upregulated and downregulated reactions. The bounds of these reactions were changed according to the procedure defined elsewhere (Materials and Methods). The resulting differences in the exchange reactions flux bounds in both models were compared with amino acid data for the Zucker diabetic fatty rat from the literature [5], as shown in Table 2. With respect to the reference model with no change; any positive value corresponds to upregulated and any negative value corresponds to downregulated.

Table 2. Simplified depiction of the validation results.

Zucker Ex-MCL Trans-MCL Ex- Trans- Amino acid diabetic fatty multi-tissue multi-tissue Recon1 Recon1 rat Model Model Arginine ↓ ↑ - - ↓ Leucine ↑ - - ↑ - Phenylalanine - ↑ - - - Cysteine - - - ↓ - Glutamine ↓ - ↑ ↑ - Serine ↓ ↑ - ↑ - Asparagine ↓ - - ↑ - Tryptophan ↓ - - - ↓ Proline - ↓ - - - Threonine ↓ - - ↑ - Aspartate - ↑ ↑ - - Glycine ↓ - - ↓ - Glutamate ↑ - - ↓ - Isoleucine ↑ - ↑ - ↑ Lysine ↓ - - - ↑ Valine ↑ - - ↑ ↑ Methionine ↓ ↑ ↑ - ↑ Tyrosine ↓ - ↓ ↓ ↓ Alanine - ↓ ↑ - -

11 ↓ ↑ ↓ ↓ ↑ The upward arrow (↑) depicts higher amino acid level in Zucker diabetic fatty rat versus healthy rat. The downward arrow (↓) depicts lower amino acid level in Zucker diabetic fatty rat versus healthy rat. (-) depicts no difference in the amino acid level between the two types of rats.

The difference between the two pairs (Recon 1 and MCL multi-tissue model) of columns in Table 2 represent two different analyses in which reactions were used in calculating increases and decreases in plasma concentration of the corresponding amino acids. The columns labeled “Ex-Recon1” and “Ex-MCL Multi-tissue Model” represent analyses in which the exchange reactions were used solely to determine the increases and decreases in concentration of amino acids in the blood/plasma. The columns labeled “Trans-

Recon1” and “Trans-MCL Multi-tissue Model” represent analyses in which transport reactions were used to determine the increases and decreases in different pathways’ fluxes. In this instance, a transport reaction is defined as a reaction that moves a particular amino acid from the cytosol to the extracellular/blood compartment. An exchange reaction is a special type of reaction that only exists in these types of computational models and represent flow of metabolites across a system boundary [51].

The first two rows in Table 3 represent the analyses using exchange reactions, and as expected the MCL multi-tissue model outperforms Recon1 in every category (a lower

FPR is better). The last two rows in Table 3 represent the analyses using the transport reactions. Table 3 shows that the MCL multi-tissue model significantly outperformed

Recon1 when using exchange reactions, but the results appear to be much closer when using transport reactions. The MCL multi-tissue model outperformed Recon1 in every category except recall (recall is same as true positive rate). ROC curves for the

12 differences in flux bounds of the exchange reactions and transport reactions are shown in

Figure 4 and Figure 5 respectively.

Figure 4. ROC plot for exchange flux changes using Zucker fatty rat data. ROC plot of the results generated by comparing the exchange flux changes determined by applying diabetic and wild-type gene expression data to the MCL multi-tissue model and Recon1 to available literature data on amino acid concentration changes in the blood/plasma.

These ROC curves clearly demonstrate the differences between Recon1 and MCL multi- tissue model by providing the clear differences in the area under the curve (AUC) between the two models. In Figure 4, the MCL multi-tissue model outperformed Recon1, which isn’t surprising given the results in Table 3. In the figures 4 and 5, there are more points in Recon1 because of the fact that there are greater number of reactions in Recon1

13 as compared to the multi-tissue model and therefore downregulated genes were mapped to more reactions.

Table 3. Summarized results of validation based on T2DM gene expression change.

Model TP FP TN FN Precision Recall TNR FPR Accuracy

Ex-Recon1 0 8 1 11 0.00 0.00 0.11 0.89 0.05

Ex-MCL multi-tissue

Model 3 4 3 10 0.43 0.23 0.43 0.57 0.30

Trans-Recon1 5 6 4 5 0.45 0.50 0.40 0.60 0.45

Trans-MCL multi-tissue

Model 5 3 5 7 0.63 0.42 0.63 0.38 0.50

TP stands for True Positive, FP stands for False Positive, TN stands for True Negative, FN stands for False Negative, TNR stands for True Negative Rate, and FPR stands for False Positive Rate. As seen from the Accuracy column, MCL multi-tissue model outperforms Recon1 is both exchange and transport reactions.

14

Figure 5. ROC plot for transport reaction flux changes using Zucker fatty rat data.

ROC plot of the results generated by the flux changes in transport reactions determined by applying T2DM and wild-type gene expression data to the MCL multi-tissue model and Recon1and then comparing the results to available literature data on amino acid concentration changes in the plasma of T2DM Zucker diabetic fatty rat.

The AUC of the fold change simulations of MCL multi-tissue model using Transport reactions in Figure 4 is 0.3100; the AUC of Recon1 in Figure 4 is 0.0412 and the AUC of random selector (not shown) is 0.3866. The AUC’s of the MCL multi-tissue model using

Exchange reactions and Recon1 in Figure 5 are 0.5048 and 0.3998, respectively, which means that both Recon1 and the MCL multi-tissue model are better than a random

15 selector for transport reactions. Thus, the MCL multi-tissue model is consistently above

Recon1 for both transport and exchange reactions and exceeds the random selector curve for transport reactions but not for exchange reactions. The random selector curve is based on a set of reactions generated by random permutation from the entire set of reactions.

Recon1 was highly inaccurate for transport reactions at each threshold, which demonstrates the difficulty with predicting increases or decreases of exchange from gene expression fold changes. The fact that the MCL multi-tissue model displays an AUC that is roughly 7.5 times greater than the AUC produced by Recon1 provides further validation for the MCL multi-tissue model in these analyses, at least compared to other systems available.

3.4 MCL multi-tissue model’s application on MKR mice microarray data

After the two extensive validations outlined in sections 2.1 and 2.2, the model was used to study physiological behaviors during T2DM condition – higher plasma concentration of branched-chain amino acids (BCAAs) and free fatty acids (FFAs) in T2DM subjects.

Branched-chain amino acid metabolism has been studied in T2DM condition previously

[10], [52], [53], but none of the groups have determined metabolic fluxes through relevant pathways, as described in this study using a robust computational model. Past studies [18], [17] have focused on a single tissue (such as liver) for investigating metabolic disorders. However, this doesn’t give a complete picture of the metabolic disorder. In another study [20] the investigators studied T2DM on fat, liver, and muscle tissues using an in silico model which was less comprehensive in terms of the number of

16 metabolic reactions used in the one presented in this study. In order to use the MCL multi-tissue model for predicting physiological changes in fully T2DM MKR mice as compared to healthy mice, gene expression data was first generated for liver, skeletal muscle, and fat tissues for healthy and diabetic mice. Statistical tests on gene expression data suggested that several genes were differentially regulated between MKR and healthy mice. The statistical test – Fisher’s exact test, providing significance of the association of the experimental gene expression values and the IPA canonical pathways, was done using the Ingenuity software (www.ingenuity.com). From gene expression fold change analysis in the three tissues, it was found that for BCAA degradation and FA oxidation pathways, some of the genes are upregulated and some are downregulated in liver, while the same were predominantly downregulated in the T2DM MKR mice’s fat and muscle tissues when compared to their healthy euglycemic littermates, as shown in Table 4.

Table 4. Microarray results for BCAA and FA degradation pathways.

Tissue Valine Isoleucine Leucine Tyrosine Phenylalanine Fatty-acid

(%) (%) (%) (%) (%) (%) Pathway

Fat 81↓/0↑ 79↓/0↑ 55↓/0↑ 30↓/30↑ 10↓/10↑ 50↓/10↑

Liver 15↓/0↑ 10↓/0↑ No change 0↓/20↑ 0↓/20↑ 5↓/5↑

Skeletal Muscle 35↓/0↑ 30↓/0↑ 0↓/15↑ 50↓/0↑ 50↓/20↑ 40↓/0↑

Percentage of genes downregulated (↓)/upregulated (↑) in BCAA degradation and FA oxidation pathways in the three tissues.

17 The liver tissues’ FA degradation pathway genes did not provide a clear explanation of high concentration of FFAs in the plasma of T2DM subjects. However, the biosynthesis pathway from IPA (Figure 6) for fatty acids in the liver tissues of MKR mice does provide an explanation of the differences in FFA levels in healthy and diseased subjects’ plasma. There is a clear increase in the gene expression patterns for biosynthesis of fatty acids in liver and a decrease in fatty acid oxidation in skeletal muscle and fat tissues, thereby accumulating more free fatty acids in the plasma of T2DM subjects.

Figure 6. Canonical biosynthesis pathway of fatty acids in liver using microarray data. Mapping of gene expression for liver tissues on biosynthesis of fatty acids, using

18 Ingenuity Pathway Analysis software. Red color shows upregulated gene expression in

MKR mice versus Healthy mice.

In order to perform systems level analysis of the gene expression data, to achieve a better understanding of BCAA and FA metabolism, we next used the gene expression data along with the MCL multi-tissue model to generate flux predictions. Gene expression data was used for generating a context-specific MCL multi-tissue network using an algorithm which uses gene expression data to remove the reactions associated with no gene expression, thereby creating a context-specific metabolic network for the in silico

T2DM condition simulation. Then, flux variability analysis is performed to find out the reactions without fluxes, which are then removed. The remaining reactions then represent an active context-specific metabolic network. This active model was used to find out the differences between biochemical reaction activity between T2DM and healthy states.

Figure 7 (A) – (F) represent the data corresponding to each subsystem in the three tissues compared to a hypothetical dataset represented by a random selector. T-scores were obtained for each subsystem to determine the statistical significance of the degree of differential expression. The purpose of organizing the data in this way is to identify particular subsystems affected more greatly by T2DM. This organization scheme helps filter out small subsystems that contain all downregulated reactions but have very few reactions, and therefore do not control the overall behavior of the model.

19

Figure 7. MCL multi-tissue model predictions on different pathways. (A), (B), and

(C) represent downregulated pathways (subsystem) for adipose, liver, and skeletal muscle tissues respectively. (D), (E), and (F) represent upregulated pathways for adipose, liver, and skeletal muscle tissues respectively. They are organized by t-score; the more negative the t-score, the more down-regulated the subsystem and the more positive the t-score, the more up-regulated the subsystem.

Cholesterol metabolism is upregulated in adipose tissue (Figure 7 – D) and downregulated in liver tissue (Figure 7 – B) and muscle tissue (Figure 7 – C). This can

20 lead to higher cholesterol level in plasma of T2DM subjects as also observed through actual measurements [54]. The carnitine shuttle is downregulated in adipose tissues

(Figure 7 – A) and muscle tissues (Figure 7 – C), and is upregulated in liver tissues

(Figure 7 – E). As reported in literature [14], a downregulated carnitine shuttle leads to insulin resistance by triglyceride accumulations in the cytosol of the cells by hampering beta-oxidation. Another interesting observation is the upregulated N-Glycan degradation in adipose tissues (Figure 7 – C) and downregulated N-Glycan biosynthesis in liver tissues (Figure 7 – B). The behavior of these pathways leads to altered N-Glycans structure in the plasma of T2DM subjects [55,56]. Sphingolipid metabolism is upregulated both in adipose tissues (Figure 7 – D) and muscle tissues (Figure 7 – F), which matches expectations, as ceramide levels in skeletal muscle of T2DM zucker fatty rat were found normal [57,58]. However, there is no known relation between T2DM and skeletal muscle ceramide level [59].

It is important to note that in most of the research work done on studying behavior of these subsystems deal with studying differences in gene expression in the two physiological conditions; however, in this study we present differences in fluxes in the biochemical reactions in the two physiological conditions (T2DM vs healthy). In the former method of study, if genes associated with few reactions in a subsystem are downregulated, then most of the subsystem will be downregulated because of the steady- state condition. However, in subsystems with a combination of upregulated and downregulated genes, the whole subsystem behaves unpredictably, so the flux results can help decipher such ambiguous situations clearly.

21 4. Discussion

In the present study, microarray technology was used to elucidate the relation of free fatty acid and branched-chain amino acids levels under T2DM and normal condition and the gene expression profiles in three different tissues – liver, muscle, and adipose of diabetic

MKR mice. The findings from microarray analysis show that MKR mice have a downregulated fatty acid oxidation profile in muscle and adipose tissues. In liver, the entire fatty acid oxidation profile is not downregulated. On the contrary, fatty acid biosynthesis in liver tissues is considerably upregulated as shown in figure 6. Therefore, a possible explanation for the higher circulating free fatty acid levels in the diabetic MKR mice is the upregulated biosynthesis of fatty acids in liver and considerable lower oxidation of the fatty acids in the muscle and adipose tissues of the animals. Gene expression for branched-chain amino acids metabolism is significantly downregulated too. However, mere downregulation of gene expression does not imply downregulation of the flux through these pathways. Moreover, using gene expression data alone doesn’t clearly predict how fast or slow a biochemical reaction proceeds. However, incorporating metabolic flux results, based on such computational models, greatly enhances the understanding of reaction fluxes.

Wang et al. [60] pointed out that comparisons using only enrichment statistics with gene expression data provides far fewer predictions as given by model based approaches.

There are a larger number of pathways that can be identified using the model based approach as compared to using gene expression data alone.

22 Subsequently, an algorithm for building a multi-tissue model is presented in the current paper. The algorithm presented in this paper uses as few linear programming solutions as possible to achieve an optimal solution, so that the solution can be found in a reasonable amount of time. The model provides users with the ability to predict changes in metabolite levels in the medium after deleting specific metabolic genes, as well as the ability to predict changes in metabolite concentrations in the plasma/medium after applying fold changes to specific genes. Other capabilities include using a quadratic programming solver to apply experimentally derived fluxes to specific reactions in the model and determining the remaining fluxes to examine the behavior of the overall model in given by the experiment [17].

The positive aspect of this approach is that metabolic fluxes can be predicted; this is important because metabolic fluxes are difficult to measure in mammalian tissues, and metabolic fluxes provide essential information used for characterizing phenotypes of cellular systems. Flux information can help predict metabolic biomarkers in blood/plasma

[22], and can potentially help predict causes for certain metabolic disorders. However, due to their qualitative nature, results from these steady state models should be thought of as supplements to experiments; they are best used to help provide potential targets for biological experiments.

As mentioned before, it is easier to find out approximately which subsystems are upregulated or downregulated by microarray analysis, but that does not fully explain the physical effects of the fold changes in these genes, in terms of changes in the metabolic fluxes. The results show a significant downregulation of branched-chain amino acid

23 metabolism and fatty acid oxidation in muscle and adipose tissues (there is some down regulation of branched-chain amino acid metabolism in the liver tissue, but it is less apparent than in the other two tissues). There are more downregulated reactions than upregulated reactions. Changes in gene expression can only be assumed to affect relative enzyme levels, which can only affect upper and lower flux bounds; therefore, we can assume that more downregulated reactions are affected than upregulated reactions because a decrease in the flux bounds represents more restriction than an increase in flux bounds. This is because reactions with increased flux bounds can still be forced to have lower fluxes by adjacent downregulated reactions, whereas reaction with decreased flux bounds cannot be forced to have higher fluxes by adjacent upregulated reactions.

Statistical significance of the differential regulation of the subsystems was calculated by using a two sample t-test (assuming the second sample is a randomly generated set of upregulated, downregulated, and unchanged reaction predictions and is of the same size as the subsystem) to determine the statistical significance of the subsystem’s deviation from random behavior. For example, if the subsystem is 100% downregulated with two reactions, then, intuitively, this is not very statistically relevant because a random selector can make this selection 1/9 of the time (assuming the random selector chooses downregulated, upregulated, and unchanged each 1/3 of the time). Thus, sorting by the t- score of each subsystem provides a better understanding of the extent of differential regulation of the subsystems. As expected branched chain amino acid metabolism and fatty acid oxidation have very high t-scores in muscle and adipose tissue. The results regarding the branched-chain amino acid metabolism is consistent with the results

24 available in literature [61]. The downregulation of BCAA metabolism in adipose tissue causes an increase in BCAAs in the circulation, which is shown by the amino acid transport fluxes. This increase in circulating BCAAs causes a decrease in BCAA metabolism in muscle tissue as well [53], which is consistent with results in this study.

Newgard et al. have reported increased concentration of BCAA in the plasma of T2DM subjects [10]. Although we didn’t have access to in vivo methods of metabolic flux measurement, we did find evidences supporting our results. For example, it is reported that gluconeogenesis flux in humans is decreased in Type 2 Diabetes [62], which is also demonstrated in our model. In skeletal muscle, free fatty oxidation flux is reduced implying abnormal mitochondrial function [63,64,65,66] which is also suggested by our results with reduced fatty acid oxidation in skeletal muscle as well as reduced carnitine shuttle which is responsible for fatty acid transport across mitochondria. In order to compare our findings and the above reported findings, we have reported predictions on the BCAA exchange metabolites. Apart from BCAA, other exchange fluxes predictions also resulted from the analysis, but those predictions’ in vivo validity with respect to

T2DM is not yet confirmed.

The most negative t-score in adipose and muscle tissues is for the carnitine shuttle subsystem. The ‘Nucleotides’ subsystem is also strongly downregulated in both muscle and adipose tissue. Interestingly, liver tissue displays more upregulation than the other tissues, with the carnitine shuttle subsystem showing upregulation. Thus, the carnitine shuttle subsystem appears to be the subsystem most affected in T2DM. The carnitine shuttle is responsible for transport of long chain fatty acids into mitochondria for

25 oxidation. So, a downregulated carnitine shuttle in adipose and muscle tissues implies lower metabolism of the FFAs in those two tissues. Even though the FFA metabolism is higher in liver because of upregulated carnitine shuttle, the biosynthesis pathway of fatty acids is also elevated (Figure 6). This explains the overall high FFA plasma level in

T2DM condition. Other subsystems that appear to be affected strongly include most metabolic subsystems of bile acid metabolism, cholesterol metabolism, and branched- chain amino acid metabolism. These subsystems should be studied in more depth in future studies.

In summary, the integration of microarray data and in silico predictions via constraint- based modeling has facilitated better understanding of the reason behind high BCAA and

FFA levels in plasma of T2DM subjects. Prior systems biology studies have shown the ability of constraint-based models to predict metabolic biomarkers [22]. Similarly, this model provides the complete set of biomarker predictions generated by the genetic fold change study. These metabolites represent potential biomarkers that can facilitate T2DM studies.

Type 2 diabetes is characterized by two major defects: beta-cell dysfunction and insulin resistance in peripheral tissues. The exact alterations in molecular pathways associated with beta-cell dysfunction in insulin-resistant and diabetic states are not clearly understood. Most of the studies involving T2DM and insulin resistance utilize data from insulin-resistant or diabetic animal models, such as Zucker fatty (ZF) and Zucker diabetic fatty rats (ZDF) [27], high-fat-fed mice [28], muscle IGF-1 receptor–lysine–arginine

(MKR) mice [29,30], and lep/lep mice [31]. A common feature of all these animal

26 models is that all models have manifested insulin resistance and often exhibit islet dysfunction as occurs in the early stages of type 2 diabetes in humans. Therefore, it is, in our opinion, valid to use MKR mice and zucker diabetic fatty rat data for making predictions on humans’ T2DM phenotypes.

In a mammalian system, the specific objective of a particular function of a cell, at a given time and under certain conditions, is difficult to identify, whereas in unicellular systems, the objective function can be assumed to be the maximization of cell growth. This is because mammalian systems are much more complex, especially when attempting to define models for different tissues, because each tissue has a different objective. One may consider that there is an over-arching objective function when considering a system that defines the overall human body as a linear combination of partial Recon1’s associated with each tissue, but this again is difficult to define, because the objective of maximizing biomass in unicellular organisms only applies during the exponential growth phase. Thus, maximizing biomass in mammalian systems may be applicable, but only during the growth phase of the mammal.

5. Materials and Methods

5.1 Animal Studies

All animal study protocols were approved by the Mount Sinai School of Medicine

Institutional Animal Care and Use Committee (IACUC). Mice were housed in The

27 Mount Sinai School of Medicine Center for Comparative Medicine and Surgery, an

Association for Assessment and Accreditation of Laboratory Animal Care (AAALAC) and Office of Laboratory Animal Welfare (OLAW) accredited facility, where animal care and maintenance were provided.

Male 10-weeks old FVB/N – MKR mice were used for the microarray studies.

Generation and characterization of MKR mice have been described elsewhere [32]. Mice were kept on a 12-h light/dark cycle, they were allowed free access to diet (Picolab rodent diet #5053) and fresh water. Death of the mice was caused by subjecting them to

CO2. Liver, Skeletal muscle, and fat tissues were flash frozen in liquid nitrogen, stored at

-80 oC, and shipped on dry ice to NIDDK/NIH facility for further processing.

5.2 RNA Sampling

Total RNA from the liver, fat, and skeletal muscle tissues was isolated for three biological replicates for both diseased and healthy animal subjects. Qiagen’s RNeasy®

(Qiagen GmbH, Germany) Microarray Tissue Mini Kit was used for RNA isolation purpose according to the manufacturer’s instructions. Purified total RNA was quantified using spectrophotometer, NanoDrop 2000 (Thermo Scientific Ltd). The absorbance values at 260 and 280 nm were used for assessing the quality of the sample. Only the samples with greater than 1.80 260/280 absorbance ratio were used for microarray analysis [67].

5.3 Microarrays procedure, Statistical analysis and Biological Inference

RNA quality was tested using bioanalyzer (RNA Nano assay in the Expert 2100 software,

Agilent Technologies, CA) and RIN (RNA Integrity Number) values were above 8.0 for

28 all the samples. 100 ng of RNA from each sample was amplified to generate cDNA using

NUGEN Applause WT – Amp ST system (NuGEN Technologies, CA), according to the manufacturer’s instructions. 2.5 µg of cDNA was fragmented and biotinylated using

Encore Biotin module (NUGEN Technologies, CA). Resultant sample mix with hybridization reagents (Affymetrix Inc. CA) and injected into affymetrix Mouse Gene 1.0

ST arrays and incubated for 18 + 2hours in hybridization oven rotating at 60 rpm at 45oC

(Affymetrix Inc, CA) . Arrays were processed using Affymetrix 450 Fluidic stations using wash and stain kit (Affymetrix Inc). Chips were scanned using Affymetrix

GeneChip scanner 3000 operated by Gene Chip Operating Software, version 1. 4 (GCOS

1.4) and generated .CEL, .CHP and RPT files. To access the efficiency of cDNA synthesis Poly A controls (dap, lys, phe, thr- Affymetrix Inc.) was spiked to the samples and hybridization controls (bioB, bioD, bioC and Cre, Affymetrix Inc.) were added to monitor labeling efficiency according to the manufacturer’s instructions.

The microarray raw data was analyzed using Partek software, version 6.3 Copyright 2008

Partek Inc., St. Louis, MO, USA. Raw data were subjected to Robust Multichip Average

(RMA) quantile normalization to remove biases introduced by technical and experimental effects. All expression data were log base 2 -transformed to get near normal distribution for accurate statistical inference. Quality control by visualizing the data using Principal

Component Analysis cluster plot ensured that no outliers were included for the analysis.

Next, two-way ANOVA analysis was performed to obtain a set of differentially expressed genes. A filter of P-value < 0.05 and Fold-change > 1.5 times was applied to get the significantly differentially expressed genes list. The results of the microarray

29 analysis have been deposited to National Center for Biotechnology Information (NCBI) repository and can be accessed with Gene Omnibus Expression accession: GSE51866.

The significantly differentially expressed genes list was exported to – Ingenuity Pathway

Analysis (IPA – Ingenuity Systems, www.ingenuity.com) for finding biological inference. IPA is online software used to study relationship between genes, proteins, and biological reactions. More technical details about the IPA capabilities can be accessed from the IPA website. Statistically significant genes from Partek analysis were overlaid on the IPA global molecular network, which is based on information from other databases such as Kegg, HumanCyc, etc. in the IPA knowledge base after applying a filter on species type (mouse) and tissue type (eg. adipose).

5.4 Tissue-specific Model Building Algorithm

Published, detailed reconstruction of human metabolism (Recon1) was downloaded from the BiGG database in SBML format [68,69]. The models were created, maintained, and altered using the COBRA toolbox in MATLAB version 2010a [70]. A three tissue version of Recon1 was generated by adding prefixes -‘A:’, ‘H:’, and ‘M:’ to each reaction name and suffixes - [Adp], [Hep], and [Msc] to each metabolite for adipocytes, hepatocytes, and skeletal muscle tissue respectively. The extracellular compartment is shared between all three tissues and only one set of reactions comprising of only extracellular metabolites was maintained in the MCL multi-tissue model. After removing all reactions associated with dead-end metabolites from the model, the size of the three tissue version of Recon1 becomes 6644 reactions with 4103 metabolites.

30 The algorithm for trimming down the general three tissue version of Recon1 is built by minimizing the number of linear programming problems needed to trim the reconstruction [17,22]. The algorithm contains three sets of reactions: a high-confidence set, a medium-confidence set, and a low-confidence set. The high confidence set of reactions was created based on literature results containing confirmed protein expression in the three tissues and were mapped to specific reactions in Recon1 for specific tissue types

[71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,9

9,100,101]. The medium confidence set of reactions was obtained from tissue-specific data from publically available databases HPRD [102,103,104], UniProt [105], and

Brenda [106] and their online databases with tissue-specific data [107,108,109]. The low confidence set of reactions was the list of remaining reactions that were not in either the high or medium confidence sets.

The overall goal of this algorithm is: (1) maintain all high confidence reactions and (2) maximize the number of medium confidence reactions minus 0.5 multiplied by the number of low confidence reactions in the final trimmed version of the reconstruction.

Statement (2) can be considered the objective function of the algorithm as shown below –

Maximize CRCR  0.5   MPXP

The value 0.5 reflects equal probability of obtaining the most parsimonious model as well as including a maximal number of moderate probability reactions in the partial model using above algorithm [17].

Rp is a partial subset of reactions from Recon1 that defines the solution space for the tissue-specific reconstruction, CM and CX are the medium and low confidence set of

31 reactions, respectively. The algorithm makes more efficient use of all of the information gained from a single linear programming (LP) solution. Every reconstruction can be broken down into elementary flux modes; these can be thought of as the simplest possible flux distributions. Therefore, every reconstruction is a linear combination of elementary flux modes, and thus, each linear programming solution is a smaller linear combination of elementary flux modes. Ideally, the algorithm would identify all elementary flux modes in a reconstruction or even LP solution and find the optimal combination from the elementary flux modes, but as of now, this is computationally infeasible due to combinatorial explosion [110]. Elementary flux modes have been determined for smaller networks, but cannot be found for more complex networks like Recon1. However, LP solutions represent feasible flux distributions that are typically much smaller than the size of the final networks. Final networks can also be thought of as linear combinations of all of these possible linear programming solutions. Therefore, the algorithm pre-loads a very large matrix with information on around 10,000 – 20,000 randomly generated flux distributions represented by linear programming solutions. The randomization of the solutions is the randomization of the objective functions of the associated linear programming problems. The algorithm randomly selects a few high confidence and medium confidence reactions to be maximized in the objective function; low confidence reactions will never be in the objective function. Then, a Boolean vector is created which describes the activity of each reaction in the network (1 for active and 0 for inactive).

This vector represents one flux distribution and also represents one column in the large matrix of random flux distributions (the large matrix is named fdMatrix). The rows of each Boolean column vector, and consequently the matrix, represent distinct reactions in

32 the general network. The creation of this matrix represents the second step of the algorithm.

The next step of the algorithm consists of providing scores for each flux distribution. The scores are calculated by subtracting half the number of active low confidence reactions from the number of active medium confidence reactions. If a score is positive, it means that adding the reactions associated with this flux distribution will add to the value of the objective function of the algorithm. Scores are calculated for each column in the matrix.

This marks the beginning of the third step of the algorithm. This step involves addition of flux distributions associated with the high confidence set of reactions. It starts by ordering the list of scores associated with the flux distributions from high to low values.

Then, the flux distribution with the highest score is checked for any active high confidence reactions; if there are no high confidence reactions, then the flux distribution with the second highest score is checked for high confidence reactions. This procedure continues until the highest scoring flux distribution containing at least one high confidence reaction is found. All of the active reactions within this flux distribution are added to a final list of reactions. Also, any high confidence reactions added to the final list of reactions are removed from the list of remaining high confidence reactions. Then, the rows of the fdMatrix which are associated with the set of active reactions that were recently added to the final list of reactions are set to rows filled with zeros. This is done because these reactions are now in the final list of reactions and therefore do not contribute to the objective score anymore. Changing the values in the rows changes the scores associated with each column/flux distribution in fdMatrix, thus the scores must be re-calculated to reflect the changes in the rows. These re-calculated scores are sorted

33 again to reveal the columns/flux distributions with the highest scores. Then, the highest scoring column/flux distribution with at least one reaction that is in the list of remaining high confidence reactions is selected, and the whole cycle repeats itself. This cycle is repeated until there are no reactions left in the list of remaining high confidence reactions, which means that all of the high confidence reactions were added to the list of final reactions, which is one of the requirements of the algorithm. Also, it’s important to note that the high confidence reactions were added in a way that maximizes the score/objective function of the algorithm.

The next step of the algorithm involves adding other flux distributions that increase the value of the objective function. After adding all of the high confidence reactions, the number of medium confidence reactions added to the reconstruction was optimized. The process for adding flux distributions that adds a positive value to the objective function is very similar to the process that added all of the high confidence reactions. Each of the remaining flux distributions is assigned an objective score based on the number of medium confidence reactions and low confidence reactions remaining within that distribution. These scores are sorted to find the distribution with the highest objective value. If the highest objective score is above zero, then the reactions of that distribution are added to the reconstruction. Then, the rows that correspond to the reactions that were just added are filled with zeros to prevent double counting toward the objective function.

This process is repeated as long as the highest objective score is greater than zero. This means that reactions are added only if the corresponding distributions increase the objective function, and reactions are no longer added when the objective function cannot

34 be increased by the randomly created distributions, meaning that the objective function is at a maximum for this particular set of flux distributions.

Then, this whole process is repeated again; a new set of random flux distributions is created based on the partial model this time (the distributions were created from the general model the first time), and reactions from the high confidence list are added along with their associated reactions determined by the flux distributions, and medium confidence reactions are optimized. As part of the model building algorithm during the reconstruction process, maximizing the flux to ascertain activity/inactivity of a reaction is needed. A random flux distribution is used only in the reconstruction process which is subjected to our algorithm to find out whether each reaction adds to biological meaning or not. This process repeats itself until the model doesn’t change in size from one iteration to the next. The time required to solve a problem associated with a three tissue system varies from 0.05-0.1 seconds. The number of reactions maximized in each iteration is variable; in the first iteration, all reactions are maximized. In the second iteration, all reactions that had zero flux in the solution to the first linear programming iteration are maximized, and so on. This continues until either all reactions are proven to carry a flux (have a non-zero solution to any linear programming problem), or some set of reactions are proven to be unable to carry flux (the set of reactions is maximized/minimized and no flux profiles result). The latter is explicitly proven by performing distinct linear programming problems and maximizing/minimizing each reaction in the set individually as explained in f tissue-specific model building algorithm.

5.5 Validation Procedure

35 Robustness analysis of the algorithm was performed by generating 5 replicate models with starting point as 5 different randomized stoichiometric matrices. All 5 of the replicate models were similar in size, thereby validating the robustness of the algorithm.

Further validations of the model are described in the following subsections.

5.5.1 OMIM gene deletion analysis

The first validation procedure uses similar approach as used in studies by Shlomi et al.

The validation tests the effect of deleting genes in silico versus experimental data. The experimental phenotypic effect of deleting genes is available in the public database –

OMIM [111]. The known biomarkers of the amino acid-associated disorders compiled above were manually extracted from the disease description field in the OMIM database.

This set of disorders was further filtered to include only the disorders that were reported to show a concentration change in at least one of the model's boundary metabolites. This resulted in a final set of 17 disorders that composed the validation set. In this validation, disorders that are known to increase or decrease amino acid levels in the blood were considered. This is because the calculated steady state models can only allow accumulation in the extracellular (blood) compartment.

After mapping OMIM disorder identifier to specific human genes, reactions associated with the affected genes were found. Then, those reactions were artificially “turned on” by equating the lower bounds of the associated reactions to 1. This means that a flux with zero value through the affected reactions is not permitted. The purpose of forcing the reactions to be active is to model a situation in which these reactions are used; this will provide a greater contrast for when the reactions are removed (if zero flux is still allowed, then the model may not use the affected reactions and therefore the comparison would be

36 between reactions that may not be used and reactions that are deactivated). After the lower bounds are changed, flux variability analysis is performed on the model [112] providing a minimum and maximum flux allowed by the solution space of the model.

Then, the two sets of flux bounds are compared to determine if the reactions are upregulated, downregulated or unchanged. The reactions are considered either upregulated or downregulated if the change in flux bounds is greater than a threshold, or less than the negative of that same threshold. The change between the sets of flux bounds is calculated by:

For reaction ‘i’:

Disease reaction flux < wild-type reaction flux if,

diseaseMin()() i M i  refMin () i AND  diseaseMax () i  refMax ()i  OR diseaseMin() i refMin () i AND diseaseMax ()() i  M i  refMax ()i    

Disease reaction flux > wild-type reaction flux if,

diseaseMin() i refMin ()() i  M i AND  diseaseMax () i  refMax ()i  OR diseaseMin() i refMin () i AND diseaseMax () i  refMax ()() i  M i    

Where:

M() i mean refMin (), i refMax (), i diseaseMin (), i diseaseMaxi ()

The set of reactions represented by the variable ‘i’ are the set of exchange reactions of metabolites that have been identified in the phenotype descriptions of the OMIM database. In this validation, they are amino acids. The variables diseaseMax and diseaseMin are the flux bounds of the amino acids in the disease version of the model with particular genes knocked out. The variables refMin and refMax are the flux bounds

37 of the amino acids in the healthy model that has no genes knocked out. The variable ε represents the threshold value used for the receiver-operator characteristic (ROC) curves.

The results of this validation are depicted via ROC curves in the Results section. In this simulation, the threshold value is varied to distinguish between very large changes in flux bounds and very small changes in flux bounds. Thus, for a given threshold value and disease, upregulated and downregulated predictions are made for each amino acid exchange reaction. These predictions are checked against the known phenotype of that specific disease. This generates a list of true positive, false positive, true negative, and false negative predictions for the specific disease at a particular threshold. This analysis is repeated using the same threshold value for each disease in the validation. All of the predictions from each disease are added together, to provide a number of true positives

(TP), false positives (FP), true negatives (TN), and false negatives (FN) for that particular threshold value. Then, the false positive rate (FPR) and the true positive rate (TPR) are calculated as below.

1 FPR  1

1 TPR  1 

1 Po Where,   ; Po is proportion of unchanged metabolites in the data;  is the 1 Po threshold, defined above

TPR = true positive rate; (number of true positives)/(number of true positives + number of false negatives)

FPR = false positive rate; (number of false positives)/(number of false positives + number of true negatives)

38 These values for FPR and TPR are calculated for a range of threshold values; typically from 0 to 1000, and plotted with FPR on the x-axis and TPR is on the y-axis. The area under the curve (AUC) of this plot is the indicator of the quality of the reconstruction network. An AUC of 0.5 represents a random classifier; any AUC over 0.5 represents a system that is better than a random guessing.

5.5.2 T2DM gene expression fold change analysis

The gene expression of adipose, liver, and muscle tissue for MKR mice model was examined. The MKR mouse model was developed by over-expressing the IGF-I receptor in skeletal muscle [113]. Statistical tests were used to identify genes that were significantly upregulated or downregulated (p-value < 0.05). These genes were mapped to Entrez gene IDs, and subsequently mapped to the steady-state model. This identified specific reactions that were either upregulated or downregulated due to differential gene transcription in type II diabetes. For reactions that had both upregulated and downregulated genes mapped to them, regulation status of the reaction was determined by summing the number of up and downregulated genes that map to that specific reaction in question and compare the results. This is only done when multiple gene IDs from the microarray data map to the same Entrez gene ID; this is not performed in the case where multiple distinct gene IDs map to the same reaction through gene-to-reaction maps. In the latter case, the gene-to-reaction logical mapping is used to determine the regulation status.

Transcription levels of genes do not directly correspond to reaction fluxes, but a few assumptions can be made to demonstrate the effect of the levels. The main assumption is that relative gene transcription levels correlate with relative protein concentrations. With

39 this assumption, the effect of increased or decreased protein concentration manifests itself in the upper and lower bounds of the reactions that are associated with the protein. The reason for this is because enzyme concentration only affects Vmax in equations. Vmax is equal to kcat multiplied by enzyme concentration; therefore, a lower enzyme concentration decreases the maximum reaction velocity (and minimum reaction velocity, if the reaction is reversible). Reactions that are determined to be affected by gene regulation have upper bounds initially set to 100 and lower bounds initially set to -

100 or 0 depending on reversibility. Unaffected reactions have normal reaction bounds

(upper bounds are 1000 and lower bounds are -1000 or 0). Initial FVA is then performed.

Then, upregulated reactions have bounds doubled downregulated reactions have bounds halved. Then, FVA is performed, and resulting bounds are compared to initial FVA results for comparison and analysis. The reasoning for initially changing affected reaction bounds from 1000 to 100 was to magnify the effect of upgregulating reactions. If reaction bounds of affected reactions are kept at normal 1000 values, then doubling the reaction bounds due to upgregulation may not display downstream upregulation or display the far- reaching effects of gene upregulation. The lower bound and upper bound for exchange reactions are 1 and 1000 respectively. The first step is to identify all upregulated and downregulated reactions, and change the associated bounds to one-tenth of the initial value. This is done so that the effect of increases in reaction bounds due to upregulation will have an effect in a steady-state model. Then, FVA is performed on the model to establish a control set of bounds. After that, the bounds of upregulated reactions are doubled and downregulated reactions are halved. Then, FVA is performed again to

40 generate a disease-state set of bounds. Then, the difference between the two sets of bounds is determined by the following:

For a given reaction ‘ i ’;

diseaseMin()()()() i refMin i  diseaseMax i  refMax i  change() i  Mi()

diseaseMin, diseaseMax  represent the lower and upper bounds of the disease set. refMin, refMax  represent the lower and upper bounds of the control set.

Mi() is the absolute value of the mean of all four bounds for reaction ‘i’.

Literature results were obtained for specific concentrations of amino acids in the plasma for T2DM rat versus healthy rat. The effect on each of the transport fluxes of each amino acid were considered to determine an increase or decrease in concentration in the plasma.

This was done instead of simply inspecting the exchange reactions for each amino acid because the full effect of the gene expression changes may not be observed on an exchange reaction if an unaffected pathway exists that involves the exchange reaction in question. Taking all transport reactions into account generates a more complete picture of the uptake or secretion of metabolites.

Chapter 2: The beta-3 adrenergic agonist (CL-316,243) restores the expression of down-regulated fatty acid oxidation genes in Type 2 diabetic mice

Summary

Background: The hallmark of Type 2 diabetes (T2D) is hyperglycemia, although there are multiple other metabolic abnormalities that occur with T2D, including insulin resistance and dyslipidemia. To advance T2D prevention and develop targeted therapies

41 for its treatment, a greater understanding of the alterations in metabolic tissues associated with T2D is necessary. The aim of this study was to use microarray analysis of gene expression in metabolic tissues from a mouse model of pre-diabetes and T2D to further understand the metabolic abnormalities that may contribute to T2D. We also aimed to uncover the novel genes and pathways regulated by the insulin sensitizing agent (CL-

316,243) to identify key pathways and target genes in metabolic tissues that can reverse the diabetic phenotype.

Methods: Male MKR mice on an FVB/n background and age matched wild-type (WT)

FVB/n mice were used in all experiments. Skeletal muscle, liver and fat were isolated from prediabetic (3 week old) and diabetic (8 week old) MKR mice. Male MKR mice were treated with CL-316,243. Skeletal muscle, liver and fat were isolated after the treatment period. RNA was isolated from the metabolic tissues and subjected to microarray and KEGG database analysis.

Results: Significant decreases in the expression of mitochondrial and peroxisomal fatty acid oxidation genes were found in the skeletal muscle and adipose tissue of adult MKR mice, and the liver of pre-diabetic MKR mice, compared to WT controls. After treatment with CL-316,243, the circulating glucose and insulin concentrations in the MKR mice improved, an increase in the expression of peroxisomal fatty acid oxidation genes was observed in addition to an increase in the expression of retinaldehyde .

These genes were not previously known to be regulated by CL-316,243 treatment.

Conclusions: This study uncovers novel genes that may contribute to pharmacological reversal of insulin resistance and T2D and may be targets for treatment. In addition, it

42 explains the lower free fatty acid levels in MKR mice after treatment with CL-316,243 and furthermore, it provides biomarker genes such as ACAA1 and HSD17b4 which could be further probed in a future study.

Introduction

The global prevalence of diabetes is rising (1). As a result of population ageing, increasing rates of obesity and a sedentary lifestyle, there is an increasing incidence of

Type 2 diabetes (T2D), which comprises 90% of diabetes cases worldwide (1, 2). There is an interplay between genetic susceptibility and environmental influences in this epidemic (3), and studies such as the Diabetes Prevention Program (DPP) have demonstrated that lifestyle intervention, particularly weight loss, leads to a significant reduction in the risk of diabetes (4, 5). It is generally believed that insulin resistance in metabolic tissues (liver, fat, and skeletal muscle) is a key factor in the development of

T2D (6).What causes the development of insulin resistance in metabolic tissues is unclear. However, reducing insulin resistance by using medications, such as thiazolidinediones and beta-3 adrenergic agonists, decreases the incidence of diabetes, and improves insulin resistance in animals, and short-term human studies, with evidence of browning of white adipose tissue (5, 7); but the mechanisms through which this occurs are incompletely understood.

Inter-tissue cross talk between skeletal muscle, adipose tissue, the liver, brain, and pancreatic beta cells may contribute to insulin resistance. Increasing circulating levels of free fatty acids (FFAs) and triglycerides (TG) from adipose tissue lipolysis are frequently observed in pre-diabetes and T2D and contribute to insulin resistance. Abnormalities in

43 fatty acid metabolism contribute to the accumulation of lipids in tissues such as skeletal muscle and liver and to the development or worsening of insulin resistance (8). Other molecules, such as myokines and adipokines may contribute to intertissue cross talk.

Skeletal muscle, for example releases a number of myokines that can act in an autocrine, paracrine, or endocrine manner to regulate metabolic processes (9, 10). Adipose tissue releases multiple adipokines, including leptin, adiponectin and resistin that act through their receptors on many other organs to regulate metabolism.

In this study we have used an animal model of insulin resistance and T2D to understand the gene expression changes that occur in pre-diabetes, T2D and pharmacological treatment of T2D. The LeRoith lab developed a transgenic mouse model of T2D that overexpresses a dominant-negative IGF-1R (KR-IGF-1R) specifically in the skeletal muscle under the muscle creatine (MCK) promoter (11). The male MKR mice exhibit hyperinsulinemia by 2-3 weeks of age, and hyperglycemia by 5-6 weeks of age with decreased whole body glucose uptake, failure to suppress hepatic gluconeogenesis in response to insulin and pancreatic beta cell dysfunction (11, 12). In addition, the male

MKR mice have elevated circulating free fatty acids, triglycerides, and hepatic and muscle triglyceride content, compared to wild-type mice (11, 13). Therefore, the male

MKR mouse is an excellent mouse model for studying the metabolic derangements associated with the pre-diabetic insulin resistant state and the diabetic hyperglycemic condition.

Previous gene expression studies on T2D have focused on single metabolic tissues (14-

20), have studied target genes or proteins (11, 21-25), or have been performed in vitro in cell culture studies (26-29). None of these studies have explored the global gene

44 expression changes in multiple tissues, or those caused by the administration of the insulin sensitizing beta-3 adrenergic agonist (CL–316, 243). In this study, we aimed to uncover novel changes in metabolic tissues (skeletal muscle, liver and adipose tissue) between pre-diabetic and diabetic MKR mice, using microarrays and the Kyoto

Encyclopedia of Genes and Genomes (KEGG) database analysis. The KEGG database is a comprehensive database constructed from well-known molecular interaction networks and is extensively used to study biological pathways (30). The enrichment of KEGG pathways were used to encode all significantly differentially expressed genes in this study. The study of extracted KEGG pathways related to T2D, indicate that they may help building effective computational tools in the study of T2D. In addition to examining differences in the metabolic tissues in the pre-diabetic and diabetic models, we also aimed to uncover novel genes and pathways that were altered by the pharmacological treatment with a CL-316,243 (24). Using the same methods, we uncovered novel genes and pathways that could be targets for future therapeutics for insulin resistance and T2D.

Methods

Animals

All animal studies were approved by the Mount Sinai School of Medicine Institutional

Animal Care and Use Committee. Mice were housed in The Mount Sinai School of

Medicine Center for Comparative Medicine and Surgery, Association for Assessment and

Accreditation of Laboratory Animal Care International (AALAC) and Office of Laboratory

Animal Welfare (OLAW) accredited facility, where animal care and maintenance were provided. Mice were kept on a 12 hour light/dark cycle, had free access to diet (Picolab

45 Rodent Diet 20, 5053) and fresh water. All MKR and WT mice used in these studies were male, on the FVB/N background and were 3-12 weeks of age. The generation and characterization of the MKR mice have been previously described (11). Mice were euthanized at the end of each experiment, liver, gonadal fat and skeletal muscle

(quadriceps) were collected and flash frozen in liquid nitrogen for subsequent RNA extraction.

Treatment with CL-316,243.

We have previously demonstrated that the beta-3 agonist CL-316,243 improves insulin sensitivity and lowers glucose levels in the MKR mice (24, 31). Nine to ten week old male WT and MKR mice were injected intraperitoneally with CL-316,243 (1 mg/kg

BW/day) or with an equivalent volume of vehicle (sterile water and phosphate buffered saline) for three weeks. Body weight was measured before treatment and twice weekly during treatment. Body composition analysis was performed using the EchoMRI 3-in-1

NMR system (Echo Medical Systems, Houston, TX, USA) before treatment, and 6 and

13 days after the start of treatment. Fed blood glucose measurements were performed on tail vein blood during tumor studies using a Bayer Contour Glucometer (Bayer

Healthcare, Mishawaka, IN, USA), prior to commencing treatment and weekly thereafter.

Plasma fed insulin levels were measured at the end of the studies using the Sensitive Rat

Insulin RIA kit (Millipore, St. Charles, MO, USA).

RNA Isolation and Microarray Analysis procedures

Total RNA was isolated from the liver, fat, and skeletal muscle tissues using the RNeasy

Microarray Tissue Mini kit (Qiagen, Valencia, CA, USA), according to the

46 manufacturer’s instructions. The concentration and quality of RNA was determined using the NanoDrop 2000 Spectrophotometer (Thermo Scientific, Wilmington, DE, USA), the

Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA), and RNA Integrity Number

(RIN, Agilent, Santa Clara, CA, USA). All samples used for reverse transcription and microarray analysis had 260/280 ratios greater than 1.8 on the Nanodrop

Spectrophotometer and RIN values greater than 8.0. 100ng of RNA from each sample was reverse transcribed to cDNA and amplified using NUGEN Applause WT–Amp ST system (NuGEN Technologies, San Carlos, CA, USA), according to the manufacturer’s protocol. 2.5 µg of cDNA was fragmented and 3’-biotinylated using Encore Biotin

Module (NuGEN Technologies, San Carlos, CA, USA). The resultant sample mix with hybridization reagents were loaded into the GeneChip Mouse Gene 1.0 ST array and incubated for 16-20 hours in hybridization oven rotating at 60 rpm at 45oC (Affymetrix,

Santa Clara, CA, USA). Arrays were processed using GeneChip Fluidics Station 450

(Affymetrix, Santa Clara, CA, USA). Chips were scanned using the GeneChip Scanner

3000 (Affymetrix, Santa Clara, CA, USA), operated by Gene Chip Operating Software, version 1. 4 and generated .CEL, .CHP and RPT files. Poly-A controls (dap, lys, phe, thr) and hybridization controls (bioB, bioC, bioD and cre) were used as spike controls for cDNA synthesis and hybridization, respectively, using methods described in the manufacturer’s instructions (Affymetrix, Santa Clara, CA, USA).

The microarray raw data was analyzed using Partek software, version 6.3 (Partek. St.

Louis, MO, USA). Raw data was subjected to Robust Multichip Average (RMA) quantile normalization to remove biases introduced by technical and experimental effects. All expression data were log base 2 -transformed to get near normal distribution for accurate

47 statistical inference. Quality control by visualizing the data using Principal Component

Analysis cluster plot ensured that no outliers were included for the analysis. Next, two- way ANOVA analysis was performed to obtain a set of differentially expressed genes. A filter of p value < 0.05 and fold-change > 1.5 times was applied to identify the significantly differentially expressed genes.

The significantly differentially expressed genes list was exported to Ingenuity Pathway

Analysis (Ingenuity Systems, Redwood City, CA) for finding biological inference.

Statistically significant genes from Partek analysis were overlaid on the Ingenuity

Pathway Analysis (IPA) global molecular network after applying a filter on species

(mouse) and tissue type.

KEGG pathways analysis

Kyoto Encyclopedia of Genes and Genomes (KEGG) database pathways and their corresponding genes were downloaded from KEGG website

(http://www.genome.jp/kegg/) on 11 June 2014 (32, 33). Differentially expressed genes for each study type, i.e. MKR vs WT (fat, skeletal muscle, and liver tissues) and CL

316,243 treated MKR vs vehicle treated MKR fat tissues and genes from entire mouse genome were mapped to each KEGG pathway. This provided a list of count of genes in each of the above four datasets and the entire mouse genome annotated to each KEGG pathway. Calculation of enrichment P values is based on the following mathematical approach: in a dataset of N genes from entire genome, where M genes are annotated to a particular KEGG pathway, and n genes are differentially expressed, then the probability

48 of having k genes to be differentially expressed and also included in the above KEGG pathway is given by a hypergeometric distribution as described by (34) –

MNM      k x n x P      x1 N  n

Based on this approach, calculations of enrichment P values and programming tasks were performed using MATLAB version 2010a [Natick, Massachusetts: The MathWorks Inc.,

2010.]. Enrichment P values outcome was calculated using MATLAB’s hygecdf and hygepdf functions. The enrichment P value cutoff for KEGG analysis was 0.0001 based on correction applied for testing of multiple pathways

Results

KEGG pathways analysis of the microarray results

The results of the microarray analysis are summarized in Venn diagrams shown in Figure

1a and 1b.Followed by KEGG pathways analysis was done to determine enriched pathways (containing high number of differentially expressed genes). The results of

KEGG pathway analysis of lipid metabolism is shown in Table 1. The following pathways were found to be significantly overrepresented in fat tissues of MKR vs WT mice: fatty acid degradation, glycerolipid metabolism, and glycerophospholipid metabolism. When fat tissues of CL-316,243 treated MKR mice were compared with non-treated MKR mice,the glycerolipid metabolism, glycerophospholipid metabolism, fat digestion and absorption, and lipid metabolism were found to be significantly

49 overrepresented. Ether lipid levels have previously found to be significantly higher in obese and diabetic subjects (35).As shown in Figure 1c when fat tissues of treated MKR were compared with non-treated MKR difference in dysregulation of ether lipid metabolism was observed. Most of the genes in fat tissues related to ether lipid metabolism were found to be down regulated in MKR compared with WT mice.

Following treatment this trend was reversed. In addition to the above, pathways related to carbohydrate metabolism such as TCA cycle, pentose phosphate pathway, and metabolism, pyruvate metabolism, and propanoate metabolism were found to be overrepresented only in fat tissues of MKR vs WT mice and not in the fat tissues of CL-

316,243 treated MKR compared with non-treated MKR. Moreover, carbohydrate metabolism pathway amino sugar and nucleotide sugar metabolism was found to be overrepresented only after treatment with CL-316,243 in fat tissues of MKR mice.

50

Figure 1. Results from differential change analyses. a) Venn diagram showing significant genes (<0.05 P value) with 1.5X differential fold change between different datasets. b) Venn diagram showing overlap of significant genes with more than 1.5X fold change difference in CL-316,243 treated (Treated) vs untreated MKR (MKR) fat and untreated MKR vs WT fat tissues datasets c) Ether lipid metabolism KEGG pathway – color scheme is corresponding to the color scheme in Venn diagram. Other colors correspond to genes found in mouse genome but not in either datasets.

Table 1. Kegg pathways enrichment P values in the differentially expressed gene sets.

P Value P Value P Value P Value P Value Pathway Name MKR vs MKR vs CL treated MKR vs Prediabetic WT Fat WT Liver MKR vs

51 WT Skeletal MKR vs Vehicle- Muscle WT Liver treated MKR Fat

Fatty acid 0.037 0.006 0.259 0.007 0.676 biosynthesis

Fatty acid 0.026 0.201 0.016 1.61E-05 0.731 elongation

Fatty acid 0.012 degradation 1.79E-08 0.319 1.17E-05 3.31E-14

Synthesis and degradation of 0.381 0.422 0.103 5.27E-05 0.183 bodies

Steroid 0.166 biosynthesis 0.060 0.081 0.002 9.23E-08

Primary bile acid 0.073 biosynthesis 0.081 0.036 0.027 0.001

Steroid hormone 0.114 biosynthesis 0.341 0.0008 0.097 0.001

Glycerolipid 8.45E-06 2.90E-05 1.98E-08 2.84E-08 7.49E-06 metabolism

Fat digestion and 0.181 0.244 5.40E-03 2.57E-05 6.86E-03 absorption

Glycerophospholi 3.27E-12 9.99E-06 5.75E-08 2.22E-08 7.57E-09 pid metabolism

Ether lipid 0.0004 0.002 0.002 0.0006 5.61E-07 metabolism

Sphingolipid 0.079 0.0003 0.007 0.002 0.0003 metabolism

52 Arachidonic acid 0.014 0.001 0.002 0.007 0.002 metabolism

Linoleic acid 0.054 0.248 0.025 0.065 0.010 metabolism

alpha-Linolenic 0.012 0.050 0.055 0.015 0.001 acid metabolism

Biosynthesis of unsaturated fatty 0.085 0.006 0.055 3.35E-06 0.153 acids

Fatty acid oxidation genes are dysregulated in the adipose tissues of the diabetic

MKR mice compared with WT mice

Although the diabetic MKR mice were generated by overexpression of a dominant- negative human IGF-1R in skeletal muscle, it has previously been found that these mice also develop insulin resistance in liver and adipose tissue as well as a decrease in adipose tissue mass. Pathway analysis of the gene expression in the adipose tissue of MKR and

WT mice by microarrays revealed dysregulated expression of the TCA cycle, branched chain amino acid (valine) degradation, tryptophan degradation, N-glycan biosynthesis, pyrimidine metabolism, ether lipid metabolism, and fatty acid oxidation. As the MKR mice have been previously found to have changes in fatty acid metabolism in the skeletal muscle, with increased adipose tissue and hepatic insulin resistance, we further investigated the specific alterations in the expression of genes in the fatty acid oxidation pathway in the adipose tissue of the MKR mice, compared to WT mice. We found down regulation of number of genes involved in fatty acid oxidation in adipose tissue (Figure 2,

Table 2), with a few notable exceptions, including the dehydrogenases

53 ALDH3A2 and ALDH7A1, the acyl CoA synthetases ACSL4 and ACSBG2 and the peroxisomal fatty acid oxidase ACOX3, that were upregulated in the adipose tissue of the

MKR mice.

Figure 2. Gene network analysis of the fatty oxidation pathway in adipose tissue from MKR vs WT mice. Gene network analysis of the fatty acid oxidation pathway in adipose tissue from MKR vs WT mice. Green represents down-regulation, red represents up-regulation, white symbols denote neighboring genes. The intensity of color represents the average of fold changes in the tissue from the MKR mice vs WT mice. The numbers below the symbols denote the fold change in gene expression of MKR vs WT mice.

Table 2. Gene expression data for Fatty Acid Oxidation pathway genes in adipose tissue from MKR vs WT mice.

54 Gene MKR vs WT Gene MKR vs WT

Name Name

CYP2E1 -2.7887 HADHA -1.9032

ACADVL -2.64 HADHB -1.8385

HSD17B8 -2.428 ECHS1 -1.8378

ADHFE1 -2.3643 SLC27A4 -1.7369

ACSBG1 -2.3277 ACSL6 -1.736

HSD17B4 -2.3023 ACADL -1.6304

ACAA1 -2.2625 ACSL5 -1.6065

ACOX1 -2.175 SLC27A1 -1.572

ACSL1 -2.174 ACOX3 1.54147

ACADM -2.1376 ACSBG2 1.60749

CPT2 -2.1299 ALDH7A1 1.78087

HSD17B1 -2.1022 ALDH3A2 1.8053 0

EHHADH -2.0911 ACSL4 3.1309

*NA corresponds to that gene expression data was not available in the cut-off dataset (Fold-change > 1.5X and p-value < 0.05).

Insulin resistance leads to downregulation of fatty acid oxidation genes in the skeletal muscle of MKR mice.

We next examined the expression of fatty acid oxidation genes in the skeletal muscle of the MKR and WT mice by microarray analysis. Significant differences were found in the fatty acid oxidation pathway between MKR and WT mice. The MKR mice had a

55 significant downregulation of many of the genes involved in fatty acid beta-oxidation

(Figure 3, Table 3). Notable exceptions to the general downregulation of fatty acid oxidation genes in the skeletal muscle of the MKR mice include carnitine palmitoyl 1a (CPT1A), the peroxisomal acyl CoA oxidase ACOX3, the ADH6A, dicarbonyl/L-xylulose reductase DCXR, and the aldehyde / retinaldehyde dehydrogenase ALDH1A2, which were upregulated approximately 2 fold

(Table 3). ACOX3 was upregulated in both the adipose tissue and the skeletal muscle of the MKR mice. It is one of the three acyl CoA oxidases that perform the first step of fatty acid oxidation in mouse peroxisomes, specifically the oxidation of branched-chain fatty acids (36-43). ACOX1, which was downregulated in the skeletal muscle and adipose tissue of MKR mice, metabolizes very long-chain fatty acids and long-chain dicarboxylic acids (DCAs). The primary genetic defect in the MKR mice is in the skeletal muscle, and the muscle of the MKR mice have been previously shown to have greater accumulation of fatty acid intermediates compared to WT mice. Therefore, the results of our microarray analysis on the skeletal muscle were consistent with our previously published data showing a decrease in fatty acid oxidation, although the microarray data demonstrated previously unidentified novel changes in gene expression in this pathway in the skeletal muscle of the MKR mice.

56

Figure 3. Gene network analysis of the fatty acid oxidation pathway in skeletal muscle from MKR vs WT mice. Gene network analysis of the fatty acid oxidation pathway in skeletal muscle from MKR vs WT mice. Green represents down-regulation, red represents up-regulation, white symbols denote neighboring genes. The intensity of color represents the average of fold changes in the tissue from the MKR mice vs WT mice. The numbers below the symbols denote the fold change in gene expression of

MKR vs WT mice.

57 Table 3. Gene expression data for Fatty Acid oxidation pathway genes in skeletal muscle from MKR vs WT mice.

Gene Name MKR vs WT Gene Name MKR vs

WT

ALDH1A1 -4.19187 ACADVL -1.7823

HSD17B4 -3.58868 CPT2 -1.73571

ACAA1 -3.11098 ALDH2 -1.73348

ACSL3 -2.40998 HSD17B10 -1.67733

IVD -2.25629 ACSBG2 -1.60441

ACOX1 -2.14926 CYP2E1 -1.56151

ACSL4 -2.12389 HADHA -1.54366

ACSL6 -2.10271 ADH5 -1.54339

ECHS1 -2.02957 ALDH1A2 1.72927

ALDH3A2 -1.97016 CPT1A 1.95779

SLC27A4 -1.95574 DCXR 2.01625

ACAA2 -1.82191 ACOX3 2.02906

ACAT1 -1.78422 ADH6A 2.19604

*NA corresponds to that gene expression data was not available in the cut-off dataset (Fold-change > 1.5X and p-value < 0.05).

No changes in hepatic fatty acid oxidation gene expression were found in the liver of adult diabetic MKR mice

The liver in the MKR mice is known to have increased triglyceride deposits compared to

WT mice (11), and demonstrate hepatic insulin resistance with failure to suppress gluconeogenesis (11). Therefore, we hypothesized that we would also find a downregulation of fatty acid oxidation genes in the liver of the MKR mice. However,

58 analysis of hepatic gene expression revealed no significant changes in fatty acid oxidation genes in the adult MKR mouse, compared to WT mice (Figure 4, Table 4). Decreased expression of only two genes in this pathway, the acyl CoA synthetase ACSBG1 and serine dehydratase (SDS) was found, while notably, an increase in acetyl CoA acyltransferase 1 (ACAA1) and the acyl CoA synthetase, ACSL6 was found (both of which had decreased expression in the skeletal muscle and adipose tissue of MKR mice).

Therefore, despite the insulin resistance in the liver of the MKR mice and known hepatic

TG accumulation, no significant decrease in the genes of hepatic FA oxidation were found.

Figure 4. Gene network analysis of the fatty acid oxidation pathway in liver from

MKR vs WT mice. Gene network analysis of the fatty acid oxidation pathway in liver from MKR vs WT mice. Green represents down-regulation, red represents up-regulation, white symbols denote neighboring genes. The intensity of color represents the average of

59 fold changes in the tissue from the MKR mice vs WT mice. The numbers below the symbols denote the fold change in gene expression of MKR vs WT mice.

Table 4. Gene expression data for Fatty Acid oxidation pathway genes in liver from MKR vs WT mice.

Gene Name MKR vs WT Gene Name MKR vs WT

SDS -3.28557 CPT1A NA

ACSBG1 -1.72659 CPT2 NA

ALDH3A2 1.55345 ACOX1 NA

ACSBG2 1.74991 ACOX3 NA

ALDH1A2 1.90081 ACADVL NA

ACSL6 2.4039 EHHADH NA

ACAA1 2.47745 ECHS1 NA

ACSL4 2.94002 HADHA NA

ACSL5 NA HADHB NA

SLC27A1 NA HSD17B10 NA

SLC27A4 NA HSD17B4 NA

ACSL1 NA ACAT1 NA

*NA corresponds to that gene expression data was not available in the cut-off dataset (Fold-change > 1.5X and p-value < 0.05).

Hepatic fatty acid oxidation genes were significantly downregulated in the pre- diabetic MKR mouse

60 We then examined the gene expression of the liver from 3-week-old pre-diabetic MKR mice compared to age-matched WT mice. The 3-week-old MKR mice have insulin resistance and hyperinsulinemia, but are not hyperglycemic, and are therefore a model of pre-diabetes. Our previous studies have demonstrated that the MKR mice at 3 weeks of age have significant increases in hepatic triglyceride content when compared to WT mice

(11). Using the same methods, our microarray analysis revealed that the FA oxidation pathway was the most significantly dysregulated pathway in the liver of the pre-diabetic

MKR mice with significant and substantial decreases in the expression of a large number of FA oxidation genes, compared to the age-matched control mice (Table 5).

Table 5. Gene expression data for Fatty Acid oxidation pathway genes in liver from pre- diabetic MKR vs age-matched WT mice.

Gene PD vs WT Gene PD vs WT

EHHADH -7.26916 ACOX3 -2.51966

ALDH2 -6.37787 HSD17B4 -2.50286

ALDH1A1 -5.94915 CPT2 -2.42769

SLC27A5 -5.0653 ACAA2 -2.32958

ALDH7A1 -4.70793 ACADL -2.2378

ACADVL -4.33359 ACSL4 -2.20483

ACOX1 -3.99613 ECHS1 -2.17225

SLC27A4 -3.15579 HSD17B10 -2.08251

ACSL3 -2.98288 ADH5 -2.02934

DCXR -2.67591 ACAT1 -2.6477

ADHFE1 -2.67386 IVD -2.5505

61 *NA corresponds to that gene expression data was not available in the cut-off dataset (Fold-change > 1.5X and p-value < 0.05). PD denotes prediabetic (3 week old MKR mice).

Treatment of adult MKR mice with the beta 3-adrenergic receptor agonist CL-

316,243 led to improvement in metabolic parameters and altered expression of fatty acid oxidation genes

WT and MKR mice were treated for two weeks with CL-316,243. Consistent with our previous studies, a decrease in random fed blood glucose, fed plasma insulin concentration and body fat was observed in the MKR mice after one and two weeks of treatment (Figure 5). Microarray analysis of adipose tissue from MKR mice treated with

CL-316,243 revealed a significant change in the expression of a number of the genes in the FA oxidation pathway that were differentially regulated in the MKR compared to WT mice (Figure 5, Table 6). Genes that were downregulated in the MKR adipose tissue, but then were upregulated after CL-316,243 treatment included, acyl CoA acyltransferase

(ACAA1), the acyl CoA synthetase ACSL6, the alcohol dehydrogenase ADHFE1, and the hydroxysteroid dehydrogenase HSD17B4. Genes that were upregulated in the adipose tissue of MKR mice compared to WT mice and were downregulated after CL-316,343 treatment, included the peroxisomal fatty acid oxidation enzyme ACOX3, and the aldehyde dehydrogenases ALDH3A2, ALDH7A1. Chronic CL-316,243 treatment led to a significant downregulation of retinaldehyde dehydrogenases, including Aldh1a1 in the adipose tissue of MKR mice, the deficiency of which has previously been associated with browning of adipose tissue, a phenomenon observed with CL-316,243 treatment. As shown in Table 6, a number of genes in the FA oxidation pathway were altered by CL-

316, 243 treatment but were not differentially regulated between the WT and MKR mice, and many genes were further downregulated by CL-316,243 treatment. These genes may

62 be downregulated after chronic administration of CL-316,243, but may be increased in the acute setting, or their altered expression may be related to the browning of white adipose tissue observed after CL-316,243 treatment.

Figure 5. Gene network analysis of the fatty acid oxidation pathway in adipose tissue from CL-316,243 MKR vs vehicle-treated MKR mice. Gene network analysis of the fatty acid oxidation pathway in adipose tissue from CL-316,243 MKR vs vehicle- treated MKR mice. Green represents down-regulation, red represents up-regulation, white symbols denote neighboring genes. The intensity of color represents the average of fold changes in the tissue from the MKR CL-316,243 treated MKR mice vs vehicle treated

MKR mice. The numbers below the symbols denote the fold change in gene expression of MKR CL-316,243 treated vs vehicle treated mice.

Table 6. Gene expression data for Fatty Acid oxidation pathway genes in adipose from CL-316,243 treated MKR vs vehicle treated MKR mice.

63 Gene Name TR vs MKR Gene Name TR vs MKR

ALDH1A2 -5.1818 DCXR -1.6946

ALDH1A1 -4.1157 ACSL1 -1.5251

SLC27A4 -3.8033 ADHFE1 1.52237

ALDH2 -3.7982 HSD17B4 1.54675

CPT1A -3.6059 SLC27A6 1.60681

ACSL5 -3.0073 ACAT1 1.89588

ACOX3 -2.7208 ACSBG2 1.89683

ALDH7A1 -2.5909 ADH6A 2.02657

ALDH3A2 -2.3444 ACAA1 2.02752

IVD -2.2947 ACSL6 2.10495

ACOX1 -2.0718 SLC27A5 2.51732

ADH5 -1.9237 SDS 2.65876

CYP2E1 -1.7096 ACSBG1 NA

*NA corresponds to that gene expression data was not available in the cut-off dataset (Fold-change > 1.5X and p-value < 0.05). TR denotes CL-316,243 treated MKR mice, MKR denotes vehicle treated MKR mice.

Discussion

The key finding in this study is the significant down-regulation of FA oxidation genes in the skeletal muscle and adipose tissues of MKR mouse model of Type 2 diabetes, and in the pre-diabetic insulin resistant liver of these mice. The finding shows dysregulation of

FA oxidation in the tested tissues of the insulin resistant, pre-diabetic and diabetic state

64 and altered expression in genes that regulate peroxisomal beta-oxidation of fatty acids.

Treatment of the mice with a beta-3 adrenergic agonist improved the diabetic state and led to the differential regulation of genes involved in peroxisomal fatty acid oxidation and genes that have been previously reported to regulate browning of white adipose tissue in genetic knockout mice, but have not previously shown to be regulated by a pharmacological agent such as the one used in this study. This robust study on changes in gene expressions after treatment with beta 3-adrenergic receptor agonist highlights novel findings for better understanding of pathophysiology of T2D in MKR mice, and identifies potential treatment targets.

Computational efforts to understand diseases, such as T2D, with enormous complexity affecting more than one organ, have accelerated the understanding of the perturbed metabolism in such pathophysiological conditions. To understand the mechanism of CL

316,243 compound on diabetic mice, a thorough computational analysis was performed using KEGG pathways analysis based on hypergeometric test and Ingenuity Pathway

Analysis (IPA – www.ingenuity.com) based on Fisher’s exact test. The results revealed that treatment with CL 316,243 led to dysregulation of ether lipid metabolism as well as pathways such as fat digestion and absorption, glycerolipid metabolism, and glycerophospholipid metabolism. Additionally, the TCA cycle that was highly overresresented in MKR fat tissues was back to the same level of the healthy animals following CL-316,243 treatment In short, the overrepresentation of TCA cycle during

T2D was found to be reversed after treatment with the drug.

Defects in fatty acid oxidation may cause the accrual of fatty acyl CoAs and diacylglycerol in the skeletal muscle of obese subjects, resulting in inhibition of insulin-

65 mediated glucose uptake and insulin resistance (44, 45). In the setting of hyperinsulinemia in insulin-treated rodents, hepatic expression of sterol regulatory binding protein 1c (SREBP-1c) increased, along with increased expression of acetyl CoA carboxylase (ACC) and other lipogenic , suggesting that insulin stimulates lipid synthesis in the liver (46). Individuals with T2D have increased adipose tissue lipolysis that leads to increased circulating free fatty acids (47).While initial reports stated that white adipose tissue is not an important tissue for fatty acid oxidation (48), later studies have shown that although the rate of fatty acid oxidation is low in white adipose tissue, normal fatty acid oxidation in adipocytes is important to regulate circulating free fatty acid concentrations and to prevent the development of fatty liver and beta cell dysfunction (49).

Previous genome wide association studies have identified subsets of gene candidates for

T2D (50). Of the genes identified in that study, some were involved in the metabolism of fatty acids and amino acids, and were also found in our present study to be differentially regulated in the skeletal muscle (ACAA2, ECHS1), and adipose tissue (ECHS1) of the adult MKR mice, and in the liver of pre-diabetic MKR mice. (ACAA2, ECHS1). ACAA2 encodes the protein acetyl CoA acyltransferase 2 that catalyzes the final step of mitochondrial fatty acid oxidation along with acetyl coA thiolase (ACAT1), which is also downregulated in the skeletal muscle of adult MKR mice, and in the liver of pre-diabetic mice. ECHS1 encodes enoyl CoA hydratase short chain 1 and catalyzes the second step of mitochondrial FA oxidation.

Beta oxidation of fatty acids occurs in the peroxisomes and the mitochondrial. The mitochondria oxidize the majority of long chain fatty acids (LCFAs) from diet and fat

66 stores, while the peroxisomes oxidize specific carboxylic acids, including very long-chain fatty acids (VLCFAs), branched-chain fatty acids (BCFAs), bile acids and fatty dicarboxylic acids. Mouse peroxisomes have three acyl CoA oxidases that perform the first step of beta-oxidation. ACOX1 the first step oxidizes the straight chain substrates such as very long-chain fatty acids and long-chain dicarboxylic acids (DCAs), ACOX2 is active with the bile acid intermediates and the branched-chain fatty acids and ACOX3 accepts the branched-chain fatty acids (36-43). The second and third steps of peroxisomal beta oxidation involve Hsd17b4 (also known as D-peroxisomal enzyme) and

EHHADH (also known as L-peroxisomal bifunctional enzyme). The peroxisomal beta- oxidation genes that are differentially regulated in the tissues of the MKR mice are the fatty acid transporter SLC27A1 (also known as FATP1 or ACSVL4) in adipose tissue, and SLC27A4 (FATP4, ACSVL5) in adipose tissue, skeletal muscle and pre-diabetic liver. In human studies, links between both SLC27A1 and SLC27A4, and obesity and

T2D have been proposed (51, 52). SLC27A1 overexpression in primary human myocytes and HEK293 cells has been reported to trigger increased incorporation of FAs into triacylglycerol, and an increase in 1,2-diacylglycerol acyltransferase activity (53, 54).

Reduced triacylglycerol deposition has been observed in 3T3-L1 adipocytes with shRNA- induced knockdown of SLC27A1 (55). Insulin has been shown to have varying effects on SLC27A1 expression in adipocytes, with an early repression, followed by a stimulation during differentiation and a subsequent repression in mature adipocytes (56).

Insulin may also stimulate the translocation of SLC27A1 to the cell surface in adipocytes, however SLC27A4 is not recruited to the plasma membrane by insulin (57, 58). Of all the

FA oxidation genes found to have lower expression in the pre-diabetic MKR mouse liver,

67 EHHADH expression decreased more than seven fold compared with the WT mice.

Adult MKR adipose tissue, skeletal muscle, and the liver of prediabetic MKR mice also showed significant decrease in the expression of HSD17b4 which appears to be able to handle all of the peroxisomal beta oxidation substrates (42). Recent studies have shown that L-PBE is protective against the over-accumulation of dicarboxylic fatty acids

(DCAs) in mice fed MCFAs in the form of coconut oil (59). L-PBE knockout mice fed with diet high in MCFAs developed DCA accumulation in the liver, inflammation, hepatic fibrosis and death (59). HSD17b4 deficient patients and Hsd17b4 knockout mice accumulate VLCFAs, BCFAs and bile acid intermediates (60, 61). Our microarray analysis thus demonstrates significant differences in peroxisomal FA oxidation genes between the MKR and WT mice in different tissues that may indicate that abnormalities in peroxisomal beta-oxidation, contribute to the accumulation of DCAs and play a role in the development of insulin resistance.

The relative lack of difference in fatty acid oxidation gene expression in the liver of adult

MKR compared with WT mice was an unexpected finding, as the MKR mice have increased hepatic TG content compared to WT mice (21). In addition, the FA oxidation pathway was the most significantly down-regulated pathway in the pre-diabetic MKR mice. Possible explanations for the normalization of FA oxidation in the adult MKR mouse are that the described abnormalities in peroxisomal FA oxidation may lead to accumulation of DCAs. DCAs induce PPARalpha FA oxidation genes (59), and therefore the accumulation of DCAs in the pre-diabetic liver which may lead to an up-regulation

(or normalization) of FA oxidation genes in the adult mice. Alternatively, if FA oxidation is impaired, then gluconeogenesis will not occur and fasting hypoglycemia will result.

68 Therefore, the lack of down-regulated fatty acid oxidation in the adult MKR mice may provide the fuel for gluconeogenesis and the maintenance of hyperglycemia (62). This hypothesis is supported by the up-regulation of fatty acid transporter genes ACSBG2,

ACSL4 as well as acetyl CoA acyltransferase 1 (ACAA1) and the retinaldehyde dehydrogenase 2 (ALDH1A2), which may be involved in hepatic gluconeogenesis (63).

Treatment with CL-316,243 showed enrichment of many biochemical pathways and an increase in the expression of the peroxisomal FA oxidation enzymes ACAA1 and

HSD17b4 in adipose tissue of MKR mice. It is interesting that, chronic CL-316,243 treatment led to a significant down-regulation of several retinaldehyde dehydrogenases, including Aldh1a1 in the white adipose tissue of MKR mice. Aldehyde dehydrogenases convert retinaldehyde to retinoic acid. Aldh1a1 deficiency induces a brown adipose tissue-like transcriptional program in white adipose tissue, with uncoupling of respiration and adaptive thermogenesis (63). In obese mice Aldh1a1 knockdown led to a reduction in weight gain and improvement in glucose homeostasis (63). No previous studies have demonstrated that beta 3 adrenergic receptor agonists decrease the expression of aldehyde dehydrogenases and the function of the Aldh family of enzymes is still being elucidated

(64, 65). These changes may contribute to the systemic metabolic improvements that occur with CL-316,243 treatment, and may be targets for future T2D treatments.

Conclusions

Overall, this study uses microarray and computational analysis to demonstrate that FA oxidation genes are significantly altered in the metabolic tissues of diabetic and pre- diabetic MKR mice. In addition, it uncovers biomarker genes such as ACAA1 and

69 HSD17b4 which could be further probed in future studies of pre-diabetes and T2D.

Furthermore, it identifies novel changes that occur upon treatment with CL-316,243, which not only explain the lower free fatty acid levels in MKR mice after treatment but also may provide targets for future therapies for T2D.

Chapter 3: Coupling Enrichment and Proteomics Methods for Understanding and Treating Disease

Summary

Owing to recent advances in proteomics analytical methods coupled with bioinformatics capabilities there is a growing trend towards using these capabilities for the development of drugs to treat human disease, including target and drug evaluation, understanding mechanisms of action, and biomarker discovery.

70 Currently the genetic sequences of many major organisms are available, which has helped greatly in characterizing proteomes in model animal systems and in humans.

Through proteomics, global profiles of different disease states can be characterized (e.g. changes in types and relative levels as well as post-translational modifications such as glycosylation or phosphorylation). Although intracellular proteomics can provide a broad overview of physiology of cells and tissues, it has been difficult to quantify the low abundance proteins which can be important for understanding the diseased states and treatment progression. For this reason, there is increasing interest in coupling comparative proteomics methods with subcellular fractionation and enrichment techniques for membranes, nucleus, phosphoproteome, glycoproteome as well as low abundance serum proteins. In this review, we will provide examples of where the utilization of different proteomics-coupled enrichment techniques has aided target and biomarker discovery, understanding the targeting mechanism, and mAb discovery. Taken together, these improvements will help to provide a better understanding of the pathophysiology of various diseases including cancer, autoimmunity, inflammation, cardiovascular, and neurological conditions, and in the design and development of better medicines for treating these afflictions.

Introduction

With enhancements in proteomics techniques, there has been a surge in world-wide efforts aimed at large-scale protein analysis of biological samples. A global human proteome project is now underway to facilitate understanding of the relationship between physiological changes in organisms and protein abundance, subcellular localization, and

71 functions in different tissues and environments [1]. Indeed, a large number of biomarkers for different diseases including cancer have been discovered in the past two decades using a mass spectrometry-based proteomics approach [2,3]. Proteomics has proven to be useful in cancer research because the complexity of tumorogenesis, cancer progression, tumor relapse, and metastasis may involve complex protein networks. The recent application of proteomics in cancer-related research is evident by the steadily increasing number of publications from 2000 to 2010 followed by sustained numbers through 2013 as shown in Figure 1 (http://www.ncbi.nlm.nih.gov/pubmed).

72

Figure 1. Progession of OMICS applications. Number of Pubmed publications per year from the year 2000 to 2013 containing keywords “All kind of Omics and Cancer”, “Genomics and Transcriptomics and Cancer”, “Proteomics” and “Cancer”, “Metabolomics and Cancer”, “Glycomics and Cancer”, and “Lipidomics and Cancer”.

Genomics and transcriptomics have seen acceleration in application to cancer research, owing to improvement in technology. As seen from Figure 1, Proteomics is still far above other types of omics technologies, such as metabolomics, lipidomics, and glycomics, but also not as mature as genomics and transcriptomics. The recent advancements in mass spectrometry technologies will hopefully provide acceleration in application of proteomics in the coming years.

73 The number of publications has increased over the past 10 years in part because of the many areas and opportunities available as illustrated in Figure 2.

Figure 2. Applications of mass spectrometry proteomics methodologies include understanding mechanisms of drug action, identifying novel biomarkers, elucidating drug targets, and monoclonal antibody discovery.

Mass-spectrometry based proteomics techniques involving different ionization methods such as EI (electron ionization), ESI (electrospray ionization), and MALDI (matrix- assisted laser desorption/ionization) with quantification power can now be used to understand mechanism of action of various drugs, discovering novel biomarkers, discovering targets for various drugs, and discovering monoclonal antibodies. Within proteomics research, there is a growing focus on specific categories of proteins in order to elucidate changes in the levels of proteins that may be at relatively low abundance

74 levels. For example, secreted proteins, also referred to as the secretome, can play important roles in cell-cell communication, growth, and, as they can reflect the various stages of pathophysiological conditions [4], represent a useful source of biomarkers and potential targets to treat disease. Various proteomics approaches have been used to discover biomarkers in blood/plasma, body fluids and tissues [5,6], exosomes [7], and conditioned media from cultured and primary cells in vitro [8].

In addition to the secretome, post-translational modifications including glycosylation

[9,10] and phosphorylation represent another subclass of targets since they are involved in regulating biological processes. Dysregulation of kinase signaling pathways is commonly associated with various cancers [11] and gastrointestinal stromal tumors [12].

Phosphoproteomic approaches can also assist in identifying appropriate therapeutic targets as well as elucidating drug mechanism on pathophysiological pathways. In addition to mass spectrometry [13], other high throughput proteomic technologies have been developed for studying phosphorylation including peptide arrays [14,15], reverse- phase protein microarrays [16], and antibody arrays [17].

Here we will explore the multiple identification and enrichment proteomics methods used for target discovery, biomarker discovery and understanding the mechanism of action of drugs, as shown in Figure 2.

75 1. Target Discovery

Comparative quantification of low abundance and complex proteins is becoming an attractive field since it can provide novel drug targets for therapeutic intervention and many new methods are being implemented in this field. As shown in Figure 3, the serum depletion method reduces the complexity of a sample by removing high abundance proteins, thereby enriching the low abundance proteins. Low abundance phosphopeptides can be enriched using titanium dioxide (TiO2) incubation owing to high affinity of TiO2 to the negatively charged phosphopeptides. Similarly, when hydrazide chemistry is implemented for capturing and enriching glycopeptides using cationic nanoparticles such as colloidal silica binding, it can highly enrich low abundant membrane proteins.

Figure 3. Different techniques used in proteome profiling: (a) serum depletion method, (b) titanium dioxide incubation for phosphopeptides isolation, (c) glycopeptides capturing, (d) colloidal silica coupling for membrane proteins, (e) exosome isolation, and (f)secretome isolation

76

In particular, development of an effective strategy for membrane protein analysis is highly relevant to diseases where cell surface receptor signaling plays a vital role. For example, cell surface proteins have attracted substantial interest from cancer researchers in areas ranging from molecular diagnosis to therapy.

However comparison of membrane proteins from normal and diseased tissues using standard extraction methods which also isolate whole cell components make this task difficult. For this reason, recent advances in coupling proteomics methods with membrane protein enrichment protocols [18,19,20] are especially useful for novel target discovery.

1.1. Membrane Proteomics

Membrane proteins play important roles in various diseases and can be potential drug targets. Various methods, as discussed below and in Table 1, have recently been developed which could potentially overcome challenges in studying membrane proteomics.

Table 1. A summary of various methods used in membrane proteome enrichment

77 Membrane Membrane Transmembrane protein Enrichment Principle proteins Ref. percentage based method percentage on GO annotation

Colloidal Silica Electrostatic 65% 44% (UniProt and [50]

coupling Human Protein

between Reference

plasma database)

membrane and

cationic silica

Biotinylation Affinity 51% NA [25]

between Biotin

and avidin

homologous

proteins

Ultracentrifugation Separation 70% NA [22]

based on

density

Glycoproteomics Chemical 41% 34% (KEGG [45]

coupling pathways

database)

1.1.1. Ultracentrifugation techniques for membrane proteome enrichment

78 Membrane proteins can be enriched by the removal of cytosolic proteins under high pH conditions by using ultracentrifugation. For example, the membrane proteome of post- mortem brain tissues from normal and Alzheimer’s disease cases were compared using two simultaneous ultracentrifugation procedures. The simultaneous centrifugation and treatment with high pH removed the loosely associated proteins to improve the membrane proteome quality [21].

In another application, proteome profiles of plasma membrane enriched microdomains

(MD) involved in cell signaling, transport, and proliferation from human renal cell carcinoma (RCC) and adjacent normal kidney (ANK) were compared using gradient ultracentrifugation [22]. Ninety-three proteins were identified in MD isolated from RCC and 98 proteins in MD isolated from ANK. analysis indicated differentially expressed proteins such as Thy1, VDAC, and DPEP, which may potentially represent biomarkers of cancer progression [22]. In another study, the apical plasma membrane proteome of a multinuclear synctiotrophoblast was enriched using ultracentrifugation and sucrose gradient centrifugation in conjunction with 1-D SDS-PAGE and ESI-MS [23].

Two hundred ninety-six non-redundant proteins are identified, of which about 60% were integral and peripheral membrane proteins [23].

1.1.2. Biotinylation coupled proteomics for cell surface proteome enrichment

Biotinylation of whole cells can be used to enrich low-abundant cell surface protein isolation by taking advantage of very high affinity of avidin, streptavidin, or neutravidin for biotin. This method works by conjugation of multiple biotin molecules to target proteins of interest to form a biotin complex. Surface biotinylation with reactive chemical derivatives of biotin followed by purification on avidin beads allows selective cell surface

79 proteome profiling. The method often involves resuspending a whole cell mixture in a suitable lysis buffer followed by a reaction between a solution of a biotin derivative such as sulfo-NHS-SS-biotin and the solubilized cell surface proteins. After the capture of the biotin complex on streptavidin or avidin based supports, the biotinylated proteins can be cleaved using reducing reagents [24].Combining cell-surface protein enrichment using biotinylation and isobaric tags for relative and absolute quantitation (iTRAQ) technology can be used to detect differentially expressed proteins in endometrial cancer and healthy cells [25]. Out of 272 overexpressed proteins, overexpression of bone marrow stromal antigen 2 (BST2) on cancer cells was investigated further. Administering a monoclonal antibody targeting BST2 (anti-BST2) resulted in growth reduction of BST2-positive endometrial cancer cells in SCID mice identifying BST2 as a possible therapeutic target

[25]. Biotinylation helps purification of low-abundance plasma membrane proteins using microsomal fractionation, chemical cross-linking, solubilization, and a single-step affinity purification using magnetic streptavidin beads followed by protein identification using mass spectrometry [26].

1.1.3. Glycoproteome enrichment

Post-translational modifications including glycosylation offer high diagnostic and therapeutic potential and thus glycoproteomics is another area for targeted analysis

[27,28,29,30,31,32,33,34,35,36,37]. However, there are many challenges to achieve efficient analysis of complex glycoproteome present with wide concentration ranges in complex biological medium due to the severe masking effects of highly abundant serological proteins and the alteration in the stoichiometric ratios of glycosylated forms of proteins. . To overcome these obstacles, various enrichment strategies for

80 glycoproteins are developed [38]. As glycoproteins are localized on the cell surface or are secreted, N-glycoproteome enrichment methods facilitate identification and expression comparisons of membrane proteins. There are multiple glycoproteome enrichment methodologies, some of which are:

1) Gravity-flow columns involve use of a glycoprotein enrichment resin (such as phenylboronic acid based resin containing a ligand, attached to agarose beads, which binds to sugar residues glycoprotein molecules [39]) to capture the glycoproteins.

2) Cell surface capture (CSC) involves treating glycoproteins of the plasma membrane by oxidizing cells with sodium-meta-periodate followed by covalent chemical labeling with biocytin hydrazide (BH), which is a biotin containing hydrazide. The BH-labeled glycopeptides can then be analyzed by LC-MS/MS for peptide identification [40].

3) For enrichment of N-linked glycopeptides, solid phase extraction of N-linked glycopeptides (SPEG) is highly preferred for both identification and quantification

[41].The SPEG method involves proteolysis of proteins into peptides followed by oxidation of glycosylated peptides and coupling to hydrazide beads together, which allows removal of non-glycosylated peptides. The amino-termini of glycopeptides can be labeled with succinic anhydride or other reagents followed by peptide-N-glycosidase F

(PNGase F) to release N-linked glycopeptides, which can be identified and quantified by mass spectrometry [42].

A highly selective glycopeptide enrichment method combining several of these glycoproteomics methods including cell surface capture and hydrazide derivatization has been applied to isolate cell surface proteins [43]. This method has several advantages

81 over other methods as it provides complete solubilization of membrane proteins, enriches glycopeptides instead of glycoproteins which eliminates potential steric hindrance during capture, ensures robustness and tolerance to stringent washes through hydrazide chemistry, and offers lower sample loss and shorter sample processing times [43].

Processing steps include cell lysis, microsomal fraction, denaturation and digestion, glycopeptide capture, LC-MS, and bioinformatics analysis. The success of this method, particularly suited to in vitro cell cultures, relies on complete digestion of the samples

[43].

In another study, a modified cell surface capture strategy was used by oxidizing the oligosaccharides on cell surface proteins followed by cell lysis and coupling oxidized proteins onto hydrazide resin followed by mass spectrometry analysis for quantification of glycoproteins and other cell surface proteins [44]. This strategy was tested on two biological replicates of Chang Liver (CL) and HepG2 human cell lines. Out of 341 identified glycoproteins 33 exhibited significant changes in expression between the two cell lines. Western blot was used to validate the higher expression of extracellular matrix metalloproteinase inducer (EMMPRIN) and basal cell adhesion molecule (BCAM) in

HepG2 cells, both of which are associated with the malignant potential of tumor and liver cancer [44].

Recently, a detailed large-scale LC-MS/MS data set on the membrane proteome and N- glycoproteome of the BV-2 microglia line was reported [45]. This study used multiple strategies employing crude membrane fractionation, filter-aided sample preparation

82 (FASP)-based differential sample preparation, and N-glyco-FASP-based glycopeptide enrichment. A total of 6928 unique protein groups and 1450 unique N-glycosites and 760 unique glycoproteins were identified [45]. Among the identified glycoproteins, it was found that receptors, transporters, peptidases, and ion binding proteins were enriched

[45].

Quantitative glycoproteomics based on hydrazide chemistry was used to study cell surface and serum proteins in human serum and prostate cancer epithelial cell line [46].

This method was found to be useful in reducing serum sample complexity with detection of 2.5 peptides per protein on an average. For prostate cancer epithelial cell line, this method could identify proteins such as HSPG2 and SSRA which weren’t previously identified by 2-DE gel method [46].

A targeted proteomics approach involving galactose oxidase and aniline-catalyzed ligation (GAL) and periodate oxidation and aniline-catalyzed oxime ligation (PAL) were used for chemical tagging for identification of glycosylation sites of proteins on cells with an altered sialylation status [47]. Immobilized streptavidin was used to pull down

PAL- and GAL- labeled and biotinylated glycoproteome and the glycopeptides were released with N-glycosidase treatment to be analyzed in LC/MS/MS [47]. One hundred and eight nonredundant glycoproteins were identified with both the methods combined

[47].

1.1.4 Enrichment of plasma membrane proteome with cationic nanoparticles

Isolation and study of plasma membrane proteome is especially challenging due to low abundance and hydrophobicity of the proteins. In addition, a variety of post-translational

83 and chemical modifications cause a dynamic population on the cell surface. Cationic nanoparticles synthesized from Fe3O4 (magnetite) coated with Al2O3 were used to enrich the low-abundant plasma membrane proteome of human multiple myeloma RPMI 8226 cells for analysis of integral proteins and proteins from both the inner and outer surfaces

[48]. Aluminosilicate particles of Al2O3 coated SiO2 pellicles were used for comparison purposes. Plasma membrane proteome enrichment provided by the Fe3O4/ Al2O3 pellicles was statistically significant and all three pellicles show much stronger enrichment compared to whole cell lysate of proteins as classified by UniProt annotation and the web-based TMHMM tool [48]. Plasma membrane proteins represented 27.6% of the sample detected by the Fe3O4/ Al2O3 pellicles method and 12% were transmembrane proteins, highest among the methods compared in this study in itself [48].

Colloidal silica coupling relies on interactions between the negatively charged membranes to cationic colloidal particles. The method involves washing the cells with ice-cold 2-(N-morpholino)ethanesulfonic acid (Mes)-buffered saline followed by adding an ice-cold silica bead solution and incubating (for adherent cells) or gentle rocking (for suspension cells) prior to cell lysis, density gradient centrifugation, and protein purification [49].

Cationic colloidal silica (CCS) particles based proteomics technique have been utilized for the isolation of plasma membranes obtained from intact syncytiotrophoblast (STB) of human placenta [50], resulting in detection of 340 non-redundant proteins in the apical plasma membrane fraction. Proteins not previously known to be in the plasma membrane

84 in the STB of human placenta include proteins (11, 14, Ib, Id, Ie), Nicalin,

Flotillin-1, and receptor expression enhancing protein 5 [50].

A cationic silica-polyanion (CSP) strategy was used to enrich plasma membrane from freshly isolated C57 mouse hepatocytes [51]. CSP was shown to provide a better yield compared with the biotin-avidin (BA) method. The CSP method yielded 185 non- redundant proteins while BA isolated 49 non-redundant proteins, with 45 proteins identified by both methods [51].

A density perturbation technique modified from the silica nanoparticles sub-fraction strategy for in vivo membranes in combination with mass spectrometry was used to identify the normal surface proteome of liver sinusoidal endothelial cells (LSEC) [52].

The proteins were separated by SDS-PAGE and identified by with LC-MS/MS (GeLC-

MS/MS). A total of 837 different proteins were found including 450 with membrane origin in addition to contaminants from mitochondria, ribosomes, and the nucleus [52].

Despite the tremendous progress and potential use of proteomics for target discovery, there are serious challenges in the biopharmaceutical area for novel drug development. A lack of coordinated efforts to follow up with the proteomic discovery of novel therapeutic targets largely accounts for the current gap between these emerging new targets from numerous of proteomics studies of cancer and drug development (design, synthesis, and optimization) phases. Most of the potential candidates identified in the proteomic analyses have not been validated for their specific roles in diseases and their potential use

85 in clinics. Functional studies involving in vitro manipulations of gene expression using specific pharmacological inhibitors, antisense RNA, RNA interference, or gene knockout experiments should be an integral part of a proteomic study for target discovery. Clinical validations of candidate biomarkers should also be included in retrospective and prospective studies. Future investigations should place more emphasis on functional studies and on the clinical validation of novel targets for their translation in clinical trials with great speed and higher success rates. Proteomics is also expected to play major role in the preclinical and clinical research of targeted and combinatorial therapies. The recent advancements in the targeted proteomics approaches such as PSAQ (Protein Standard

Absolute Quantification) and MRM (Multiple Reaction Monitoring) techniques will hopefully give acceleration in both validation and usage of proteomics targets and biomarkers in the drug discovery, development and clinical studies.

2. Biomarker Discovery

Identification of both cell surface and serum markers is important for diagnostic and prognostic purposes when treating a disease. With the development of improved mass spectrometric techniques [53,54,55,56,57,58], cell surface and secreted proteins are more accessible for characterization. Analysis of cell surface protein is often made difficult by the low abundance of these proteins; however the recent application of plasma membrane enrichment [59] and glycoproteome enrichment [42,60,61] techniques have been helpful for the identification of greater numbers of potential biomarkers.

Serum biomarker discovery can be enhanced by the removal of high abundance proteins, such as albumin and immunoglobulins, so that low abundance proteins can be detected.

86 One approach for serum depletion is to use commercially available kits such as Multiple

Affinity Removal System (MARS – Agilent, Santa Clara, CA) spin columns or the

Amersham (GE Healthcare, Piscataway, NJ) albumin and IgG removal kit [62].

Holewinski et al. tested a commercially available affinity capture reagent from Protea

Biosciences and compared its efficiency and reproducibility to four other commercially available albumin depletion methods [63]. Two methods showed an albumin depletion efficiency of more than 97% for both serum and cerebrospinal fluid (CSF). Using LC-

MS/MS analyses, they subsequently found 45 novel proteins in serum [63] and also showed that albumin isolation from serum can eliminate the need for fractionation which reduces the sample processing time [63].

Another study reported a method that coupled denaturing ultrafiltration with reverse phase fractionation and mass spectrometry for characterization of low-molecular-weight proteins and peptides in serum and plasma samples [64]. The enriched peptides were analyzed by reverse-phase liquid chromatography combined with MALDI-MS/MS characterization of the analytes, resulting in identification of 250 native peptides from 50 different proteins[64].

A study focused on exploring protein biomarkers in serum from rheumatoid arthritis

(RA) patients treated with infliximab, an anti-tumor necrosis factor monoclonal antibody, employed quantitative proteomics approach using 8-plex iTRAQ labeling [65]. A combination of depletion of the most abundant serum proteins, two-dimensional LC

87 fractionation, protein identification, and relative quantification with a hybrid Orbitrap mass spectrometer was used to identify 235 proteins with high confidence [65]. Fourteen proteins which were significantly abundant in non-responder patients as compared to the responder patients were identified as potential biomarkers. Some of the proteins showing significant changes and thus representing potential biomarkers were C4B-alpha chain, complement factor H-related protein 4, mannan-binding lectin, serine protease 2, and inter alpha trypsin inhibitor heavy chain H1 and H2 [65].

A plasma-proteomics based approach employing 8-plex iTRAQ technique was used to study the association between nutrients and proteins in the plasma of 500 subjects after immune-depletion of six high abundance proteins including albumin and transferrin using an affinity removal liquid chromatography column [66]. More than 4700 non-redundant proteins were identified including more than 450 proteins identified as extracellular, secreted, membrane, or lipoprotein-associated. A strong correlation between vitamin A and retinol binding protein 4 (RBP4) was observed. Overall, the method demonstrated an approach for elucidating and quantifying protein biomarkers related to the nutrition environment [66].

In another study, Miltenyi Biotech’s MACS LS column was used for proteomic analysis of human prostate cancer cell line DU145 in order to identify potential biomarkers [67].

The DU154 prostate cancer cell line was isolated into CD44+ or CD44- cells using

MACS and subsequent analysis of the differential expression of proteins between CD44+ and CD44- was carried out using two-dimensional gel electrophoresis and LC-MS/MS.

88 The proteins cofilin and annexin A5, associated with proliferation or metastasis in cancer, were found to be positively correlated with CD44 expression [67].

Development of glycoproteome enrichment techniques has been very lucrative for biomarker discovery through comparison of the biologically significant glycoproteins between the diseased and normal serum subjects; however, there is still much work to do.

Although traditional proteomics identification strategies can provide information on novel sites of glycosylation, software and bioinformatics methods still need to be developed to improve glycan composition and structural analysis. Third, in order to realize effective biomarker screening, it is required to accurately quantify the glycosylation site occupancy from both relative and absolute aspects, as well as to analyze the relationship between the changes in glycosylation site occupancy and protein expression. With the progress of glycoproteomics research, we believe that the growth in the development of methods for high-throughput glycoproteomics will shed new lights on the discovery of new biomarkers.

3. Understanding the mechanism of action of drugs and disease

A substantial number of biological drugs and monoclonal antibodies effect signaling and phosphorylation events where they engage cell surface receptors. For this reason, the identification and quantification of protein phosphorylation sites are important areas of study in treatment of cancer and other diseases, such as diabetes, with biologic drugs. For example, identifying phosphotyrosine site occupancy can help in the diagnosis of these diseases [68,69,70,71,72] and provide a better understanding of the behavior of target

89 cells upon treatment with drugs [73]. For this reason, a variety of phosphoproteomics techniques such as application of anti-phosphotyrosine affinity chromatography, immobilized metal affinity chromatography, and TiO2 enrichment columns [70,74] have been used to study the changes of phospho-peptide and phospho-protein states in cells.

Protein phosphorylation facilitates information transfer in cells and changes can contribute to the pathophysiology of diseases including cancer, inflammation, and metabolic disorders [11,75,76,77]. One study investigated phosphorylation of serine, threonine, tyrosine, and histidine residues by protein and reported a strategy using peptide arrays and motif-specific antibodies to identify and characterize sequences for protein kinase A (PKA) [78]. It was found that protein kinase D and microtubule-associated protein-regulating kinase 3 can both be regulated by PKA. They also showed that the adaptor protein RIL is a PKA substrate that is phosphorylated on serine, which was predicted to regulate cell growth [78].

In order to understand the relationship between pancreatic cancer and alterations in proteomics and signaling pathways, a SILAC-based quantitative proteomics approach was employed to compare cells from three sites of metastasis (liver, lung, and peritoneum) in order to study the proteome and tyrosine phosphoproteome of these tumors [79]. Approximately 42% of the proteome was found to be highly variable when compared across any two sites of metastasis in terms of receptor and activity, suggesting that regulation of these tumors was different depending upon anatomical location [79].

90 Using a computational approach, in silico motif sequences were generated and compared with an experimental database –the Human Protein Reference Database (HPRD). Two hundred seven phosphotyrosine and 960 phosphoserine/threonine motifs were determined by this method [80]. HPRD [81,82] along with Human proteinpedia [83,84] are centralized resources for phosphorylation site analyses and can be applied in a systems biology approach to determine the role of protein phosphorylation in protein function, cell signaling, biological processes, and their implications in the human diseases [85]. It contains experimental human phosphoproteome data from various methods including 32P- labeled ATP followed by SDS-PAGE or high performance liquid chromatography followed by Edman sequencing to determine the site of phosphorylation [86]; phosphoproteome enrichment and LC-MS [68,87,88,89,90,91,92]; high throughput analysis such as SILAC combined with titanium dioxide chromatography or immobilized metal ion affinity chromatography microspheres (IMAC)for phosphoproteome enrichment [93,94,95,96]; or using pTyr-100 and 4G10 phosphotyrosine antibodies to enrich the phosphotyrosine proteome [97]. More work requires to be done in the quantification of phosphoproteome, as currently the work is mostly focused merely on identifying the phophosites.

4. Exosomal proteome profiling

With the improvement in exosome purification strategies together with mass spectrometry based proteomics tools with higher sensitivity and mass accuracy, exosomal

91 profiling has proven to be a useful technique in studying disease states including cancer

[98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115], AIDS

[116,117], Leishmaniasis [118], age-related macular degeneration [119], and neurodegenerative diseases [120]. Interestingly, exosomal proteome comprises of proteins found both in membrane and cytosol as well proteins with a distinct function in different cells. Proteomics has been very useful in cataloging in exosomes (extracellular membrane vesicles) which are versatile mediators of intercellular communication, pathogenesis, and might be useful for drug and gene delivery to a variety of tissues including cerebrospinal fluid [121], hepatocytes [122], and glioblastomas [123].

Exosomes derived from several types of tumor cells such as breast adenocarcinoma

[124], colorectal cancer [125,126], mammary adenocarcinoma [127], melanoma

[128,129], mesothelioma [130], and brain tumor [99,131] have been characterized by isolation strategies including differential centrifugation, filtration, sucrose density gradient, and immunobeads, combined with mass spectrometry-based proteomic strategies such as one- and two- dimensional liquid chromatography-tandem mass spectrometry and MALDI-TOF/TOF mass spectrometry [132]. A comprehensive review of studies that have used proteomic methods to characterize exosomes derived from in vitro sources and biological fluids has been previously reported [133].

Various methods for exosomes’ isolation [134] include :

1) Ultracentrifugation: Requires centrifugation at 100,000 g followed by

resuspension in PBS and recentrifugation.

92 2) Size exclusion chromatography: Involves application of samples to a 2% agarose

based gel column and eluting isocratically with PBS.

3) Magnetic beads: Requires adsorption to anti-EpCAM antibodies coupled to

magnetic microbeads followed by magnetic separation which results in the

exosomes attached to the magnetic beads. The magnetic beads can be washed

with TBS and exosome can be extracted using a Trizol extraction protocol.

ExoQuick Precipitation is another widely used exosomes isolation method which involves addition of ExoQuick precipitation solution and centrifugation. When the supernatant is aspirated, the exosomal pellet can be extracted using Trizol extraction procedures. While the mechanism of action of exoquick is not disclosed by the manufacturers, it is commonly used by the researchers [135] and is known to provide higher yield and purity when compared to other methods [136].

In one study, ultracentrifugation and filtration techniques were used for isolating exosomes from human neuroblastoma cell lines for proteomic analysis [137]. A discrete set of molecules including tetraspanins, prominin-1 (CD133), and basigin (CD147) were found to be expressed in the exosomes which suggests the important role of exosomes in the modulation of the tumor microenvironment and indicates their potential utilization as a tumor biomarker [137].

Ultracentrifugation technique was used for exploring urinary exosome in Zucker diabetic fatty rats (ZDF) to study diabetic nephropathy and related renal disease [138]. Two- hundred eighty six identified proteins comprised mainly of membrane proteins which

93 majorly are associated with functions such as transport, cellular communication, and cellular adhesion. A renal protein, Xaa-pro dipeptidase, which is linked with collagen breakdown was found to be upregulated and major urinary protein 1 (MUP1) was downregulated. These differentially regulated proteins might be few biomarkers for understanding metabolic changes during progression of diabetes in obese diabetic mice models [138]. In another study, proteome of exosome shed by myeloid derived suppressor cells (MDSC) was explored using ultracentrifugation [139]. Under increased inflammatory response, proteins such as GTP and ATP binding proteins (ATP-citrate synthase, ADP-ribosylation factor 1) were found to be in high abundance by mass spectrometry analysis. Additionally, Pro-inflammatory proteins S100A8 and S100A9 were found to be abundant in MDSC-derived exosomes [139].

One of the challenges in isolating low-density exosomes by ultracentrifugation is variability between runs. Although relatively large starting volumes can be used [140], it would still pose challenges in actual clinical samples which are low in volume. The exosomes can also be trapped on the antibody-coated magnetic beads which have advantages such as immunoblotting, flow cytometry, and electron microscopy analysis of bead-exosome complexes, but this method is not preferred for large sample volumes

[141]. Size-exclusion chromatography, ultracentrifugation, and ExoQuick precipitation are not useful for isolating preferential exosomes selectively such as tumor derived exosomes which require immunoaffinity based approaches [136]. In coming years, an improvement in mass spectrometry-based proteomics and exosomal isolation and enrichment methods can shed more light on understanding fundamental signaling pathways involved in different types of diseases.

94

5. Secretome profiling

The secretome can be isolated from serum-starved cells or also SILAC labeling can be used to quantitatively compare the secreted proteins of different cells [142]. A study on secretome of differentiating primary adipocytes resulted in identification of 420 differentially expressed proteins including collagen triple helix repeat containing 1, cytokine receptor-like factor 1, glypican-1, hepatoma-derived growth factor, SPARC related modular calcium binding protein 1, SPOCK 1, and sushi repeat-containing protein

[143].

In another report, SILAC based quantitative proteomics approach was employed to elucidate the differences in the secretome of neoplastic and non-neoplastic gastric epithelial cells [142]. Out of 263 overexpressed proteins in cancer-driven cell lines, three novel proteins candidates – proprotein convertase subtilisin/kexin type 9 (PCSK9) with

13 fold overexpression, lectin mannose binding 2 (LMAN2) with 6-fold overexpression, and PDGFA associated protein 1 (PDAP1) with 5.2 fold overexpression, could represent possible biomarkers for the progression of gastric cancer [142].

6. Monoclonal Antibody (mAb) Discovery

Proteomics methods are now being extensively incorporated into the mAb discovery process as a supplement to traditional hybridoma and phage display technologies. Indeed,

95 a proteomics approach based on LC-MS/MS for identifying an antigen-specific antibody from sera and B-cell sources of immunized rabbits and mice was coupled with next- generation sequencing approaches [144]. This approach coupled proteomics with enrichment of polyclonal antibodies via affinity purification method. The enriched polyclonal antibodies were subsequently analyzed by LC-MS/MS. This combined approach provided identification of 4 rabbit antibodies and 1 mouse antibody [144]. Wine et al. also used antigen-affinity chromatography for the enrichment of antibodies followed by subjecting the antibodies to liquid chromatography-high resolution tandem mass spectrometry for discovering the antigen-specific antibodies composition of a polyclonal serum response after immunization [145].

7. Bioinformatics and Proteomics

With growing data volumes generated by proteomics studies, software demands and opportunities are increasing to ease various aspects of data analysis including: a) compatibility with MS technology to support the data format and fragmentation patterns b) identification power of peptides and post-translationally modified peptides c) de-novo sequencing d) use of experimental peptide libraries e) high speed and quality of results; f) cost effectiveness and user-friendliness. Some of the most commonly used search engines for peptide and protein identification for MS based data are Mascot [146], X!Tandem

[147], SEQUEST [148], MyriMatch [149], and TagRecon [150]. For instance, TagRecon provides the opportunity to identify proteins even with unexpected mutations and post- translational modifications which is commonly observed in disease states [151]. After data analysis, results visualization is an important step towards interpretation of the

96 results. Most large software packages offer a variety of visual ways of displaying the data input and output. For example, spectromania [152] allows 2D, stacked chromatographic display of input spectra. It also allows a set of spectra from different packages to be merged into new and averaged spectra (with less noise). Other such visualization tools are MSight [153] and MassView [154].

A variety of bioinformatics tools are also available to be used for biological interpretation of these high-throughput data. Tools such as KEGG (Kyoto Encyclopedia of Genes and

Genomes) [155,156], IPA (Ingenuity Pathway Analysis – www.ingenuity.com), GO

(Gene Ontology) [157], cytoscape [158] and DAVID (Database for Annotation,

Visualization and Integrated Discovery) [159,160] are commonly used for interpretation of quantitative proteomics data. In addition to these, various publically-available software are present to address the localization of the proteins in a cell. For instance, TMHMM

(http://www.cbs.dtu.dk/services/TMHMM-2.0/) developed by Center for Biological

Sequence Analysis in Denmark uses the hidden Markov model to predict transmembrane helices of the proteins to confirm the membrane proteins [161]. Phobius

(http://phobius.sbc.su.se/) is similar software which was developed at the Center for

Genomics and Bioinformatics in the Karolinska Institute in Stockholm, Sweden [162] and predicts signal peptides and regions of transmembrane protein sequence. In addition to Phobius, SignalP and TargetP are two other bioinformatics tools to predict the presence of signal peptides in a protein sequence. SignalP

(http://www.cbs.dtu.dk/services/SignalP/) was developed at the Center for Biological

Sequence Analysis in Denmark and is based on neural network algorithms. TargetP

(http://www.cbs.dtu.dk/services/TargetP/) predicts signal peptides in the protein

97 sequences as well as their subcellular location with a success rate of 90% [163]. For finding secreted proteins, SecretomeP 2.0 server

(http://www.cbs.dtu.dk/services/SecretomeP/) uses neural networks to provide identification of proteins secreted by non-classical pathway i.e. the proteins not containing an N-terminal signal peptide [164]. WoLF PSORT (http://wolfpsort.org/) provides protein subcellular localization using PSORT features for prediction [165,166].

Secreted Protein Database (SPD-http://spd.cbi.pku.edu.cn/) contains secreted proteins from human, mouse, and rat proteomes, including sequences from SwissProt, Trembl,

Ensembl, and RefSeq [167].

Despite the tremendous progress in software generation, bioinformatics for MS based proteomics still faces challenges such as the need for user-friendly databases and programs, reducing protein inference for quantification, and peptide level statistical analysis tools [168] (instead of using a third party package such as R(R Development

Core Team 2008) [169]. With development of cloud-based computational technology, more proteomic data is likely to be publically available, thereby giving open-source platforms a boost. In future, MS instruments are likely to become faster, robust and accurate, thereby requiring a more powerful bioinformatics needs.

Conclusions

The potential of proteomics is now being widely applied to decipher the pathophysiology of various diseases and help in the design of novel treatment strategies. Due to significant improvements in technology, powerful proteomics capabilities are emerging to provide opportunities for the discovery of new biomarkers and targets for drug discovery and development. One of the critical barriers to discovery is the challenge associated with

98 identifying biomarkers or targets that are present in low abundance. In this review, we describe a variety of the emerging technologies which can be applied to detect and characterize lower abundance proteins which may be important in biomedicine. Different analytical techniques such as ultracentrifugation, biotinylation, colloidal silica, size exclusion chromatography, ExoQuick precipitation, hydrazide chemistry and titanium dioxide incubation have been coupled with LC/MS/MS in order to achieve greater coverage of the subcellular proteome. These approaches often facilitate enrichment of the membrane proteome, glycoproteome, phosphoproteome, secretome, and exosome.

However, there is still much work needed in characterizing the enormous amounts of data generated from these improved techniques and in the development of easy to use software and robust databases. With the increasing interest and growth of new high- throughput technologies, tremendous opportunities exist to expand our capabilities for discovery of new and important biomarkers and targets, all with the ultimate goal of better understanding disease and in developing better medicines to treat patients.

Chapter 4: Global gene transcription and translation in E. coli B and K under minimal media conditions

1. Summary

99 The bacterial strains K, specifically JM109, and BL21 strains remarkably differ in acetate production levels from each other. Moreover, the differences are even more noticeable when casamino acids (CAA) are involved in the growth media.

The aim of this study was to understand the differences in acetate production pathways of

JM109 and BL21 in terms of transcriptomics and proteomics levels at steady state growth rate using chemostats. By implementing the microarrays and dimethyl-labeling based proteomics we found significantly high overexpression of acetyl-CoA synthetase, which activates acetate to acetyl-CoA (acs) in BL21 allowing the simultaneous consumption of glucose and acetate in this strain in the absence of casamino acids in the media.

Moreover, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene

Ontology (GO) enrichment analyses revealed enriched TCA cycle in BL21, leading to the conclusion that in the absence of casamino acids in the media, this particular strain increases TCA cycle activity. On the other hand, JM109 showed downregulated genes at both transcriptomics and proteomics levels in acetate producing pathways in absence of casamino acids in the medium. Moreover, KEGG pathways and GO enrichment analyses revealed enriched biosynthesis of amino acids pathways and related GO terms, leading to the conclusion that in the absence of casamino acids in the media, JM109 increases biosynthesis of amino acids. Overall, these differences in contribute to the lower acetate overflow and the improved ability of E. coli BL21 to consume this metabolite in the presence of glucose as compared to E. coli JM109.

2. Introduction

Escherichia coli is one of the most used for recombinant protein production in research and industrial processes owing to its ability to utilize a variety of carbohydrates

100 and organic acids as sole carbon and energy sources. In the last many years, E. coli has also gained significant importance for production of heterologous low molecular weight compounds [1,2,3] and metabolites such as amino acids [4], and also for the biotransformation processes as a whole cell biocatayst [5]. Therefore, in order to understand and improve the production processes and also to gain better understanding of the bacterial physiology, there is a greater need for a complete comprehension of the metabolic potential of E. coli.

Based on the research development of the last 30 years, it is now accepted that the efficiency of the growth and production process in E. coli depends mainly on the bacterial strain, on the composition of the media, and on the growth conditions.The selection of one or more variables such as the strain (E. coli K or E. coli B), media composition (minimal or complex) and concentration and type of the carbon source can significantly affect the growth and production process. It is known that the aerobic growth of E. coli on media consisting of high glucose and casamino acids is characterized by the formation and excretion of acetate [6,7,8,9,10,11,12] or acetate overflow. Acetate overflow in bacterial cell cultures is known to have effects such as reduced biomass formation along with inhibition of recombinant proteins and low molecular weight molecules production process [13,14,15].

Acetate overflow is generally believed to be caused by an imbalance between substrate uptake and metabolic throughput of downstream pathways, and numerous studies have attempted to uncover different factors affecting acetate overflow by evaluating effect of media compositions, strain modification, and growth strategies. Few such studies have hinted the role of limitation in tricarboxylic acid (TCA) cycle [16,17], respiratory chain

101 [18,19,20], glyoxylate shunt activity [21,22], and competition of membrane space [23] as factors in acetate overflow but so far none of the theories or process/genetic studies have been able to unequivocally explain the main reasons behind differences in acetate overflow in presence of casamino acids.

In recent years, several process level studies involving gene expression disruption studies to understand acetate overflow have been pubished [24]. In one such study, it was shown that disruption of the phosphotransacetylase-acetate kinase (PTA-ACKA) node, which is involved in acetate production pathway, although resulted in reduced acetate excretion but also lead to reduced specific growth rate and biomass yield [25,26,27,28,29].

Moreover, it was found that the deletion of the pyruvate oxidase (PoxB) gene, which is involved in acetate productin through a different route, caused damage to growth efficiency [30]. Therefore, simple deletion of genes involved in acetate overflow cannot simply result in acetate overflow reduction without having any negative side-effects.

Few other studies have attempted to address the issue of acetate overflow by analyzing the effect on downregulation of acetyl-CoA synthetase (ACS) gene, which is involved in acetate production [31]. It was also shown that acetate overflow in aerobic E. coli cultivations can be significantly decreased by synchronized activation of PTA-ACS node and TCA cycle [32].

In this study we attempt to understand the regulation of ACS and PTA-ACS node, which seems to have an important role in acetate overflow, and gain further understanding of the relevance of the role of ACS gene and TCA cycle activation in acetate overflow metabolism in bacterial cultures containing casamino acids. For this purpose, we studied

102 two strains of E. coli – JM109 and BL21 – in a chemostat under aerobic conditions. In this work, we show that although acetate overflow in aerobic E. coli B is dependent on

ACS gene and TCA cycle activation, but is not dependent on it in the E.coli K strain.

3. Materials and Methods

3.1 Bacterial strains, inoculum preparation, and culture media

Escherichia coli BL21(DE3) (F-, ompT, hsdSB (rB-, mB+), dcm, gal, (DE3), Cmr) and

JM109(DE3) (endA1, recA1, gyrA96, thi, hsdR17 (rk−,mk+), relA1, supE44, Δλ−, Δ(lac- proAB), [F', traD36, proAB, lacIqZΔM15], λDE3) were grown in continuous cultures in defined medium with the following composition: KH2PO4, 13 g/L; (NH4)2HPO4, 4 g/L; citric acid, 1.5 g/L. Where it is indicated 20 g/L of casamino acids was added. The pH of the medium was adjusted to 7.0 with 5M NaOH prior to sterilization, after sterilization the media was aseptically supplemented with 1 mL/L trace metal solution [3], 5 mM

MgSO4, 4.5 mg/L thiamine-HCl, 8.4 mg/L EDTA, and 40 g/L glucose. Frozen stocks were used to inoculate overnight precultures grown at 37 °C in 100 mL of defined medium with 5 g/L of glucose.

3.2 Bacterial growth

Bioreactor continuous cultures were carried out at a dilution rate of 0.35 h-1 in 1.5 L laboratory fermenters (New Brunswick) at stirrer speed of 650 r.p.m. Temperature was maintained at 37°C and the pH at 7 by the addition of 2M NaOH. The dissolved oxygen

(dO2) concentration was continuously measured with an oxygen electrode (Mettler

Toledo, Columbus, OH) and it remained above 20% of air saturation, as shown in Figure

1. The working volume of the cultures was kept 1.0 L by coupling two peristaltic pumps,

103 one for effluent and other for feeding. After five residence times, when steady state was confirmed, fermentation samples were taken for RNA isolation and determination of dry weight and metabolites. After withdrawal from the fermenter, samples for glucose and assays were immediately filtered (0.2 mm pore size filter, Millipore) and kept at -20°C until analysis. Fermentation samples for RNA analysis and protein were centrifuged at 20800 x g for 5 min at 4°C and the pellet was quickly frozen and stored at -

80°C.

Figure 1. Graph showing oxygen flow rate, dO2 level, air flow rate, and agitation rate during the chemostat cultures

3.3 Analytical methods

Cell growth was followed by measuring the OD at 600nm (Ultrospec 3000 UV/Visible spectrophotometer, Pharmacia Biotec); measurements were converted to dry cell weight

104 per liter using a calibration curve (1 optical density at 600 nm=0.347 and 0.409 gDCW/l for BL21 with and without casamino acids; 1 OD600 nm=0.373 and 0.403 gDCW/l for

JM109 with and without casamino acids). Glucose concentration was determined by YSI

2700 Biochemistry Analyzer (YSI Instruments, Yellow Springs, OH). Acetic acid concentration was determined using the R-Biopharm AG kit (Cat. No.10 148 261 035); the determination is based on the formation of NADH measured by the increase in light absorbance at 340 nm.

3.4 Total RNA preparation

Total RNA was isolated from the cell pellets in triplicates for each strain and each media condition using hot phenol method. The method requires cell pellets to be resuspended in

0.5% SDS, 20 mMNaAc, and 10 mM EDTA and extracted twice with hot acid phenol:chloroform mixture. This is followed by two extractions with phenol:chlorform isoamyl alcohol mixture. Subsequently, absolute ethanol was added to the extract and it was kept at -80°C for 15 minutes. After subjecting to centrifugation at 14,000 g for 15 min, the pellets were washed in 70% ethanol resulting in purified RNA pellets. RNA pellets were air dried and RNA was resuspended in ultra-pure water (KD medical USA).

The purified total RNA was then subjected to enrichment of mRNA using

MICROBExpress bacterial mRNA enrichment kit (Ambion), as per manufacturer’s instructions. The concentration and quality of purified RNA was then determined using the NanoDrop 2000 Spectrophotometer (Thermo Scientific, Wilmington, DE, USA), the

Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA), and RNA Integrity Number

(RIN, Agilent, Santa Clara, CA, USA). All samples used for reverse transcription and microarray analysis had 260/280 ratios greater than 1.8 on the Nanodrop

105 Spectrophotometer and RIN values greater than 8.0. 100 ng of RNA from each sample was reverse transcribed to cDNA and amplified using NUGEN Applause WT–Amp ST system (NuGEN Technologies, San Carlos, CA, USA), according to the manufacturer’s protocol. 2.5 μg of cDNA was fragmented and 3’-biotinylated using Encore Biotin

Module (NuGEN Technologies, San Carlos, CA, USA). The resultant sample mix with hybridization reagents were loaded into the GeneChip Mouse Gene 1.0 ST array and incubated for 16–20 hours in hybridization oven rotating at 60 rpm at 45°C (Affymetrix,

Santa Clara, CA, USA). Arrays were processed using GeneChip Fluidics Station 450

(Affymetrix, Santa Clara, CA, USA). Chips were scanned using the GeneChip Scanner

3000 (Affymetrix, Santa Clara, CA, USA), operated by Gene Chip Operating Software, version 1. 4 and generated .CEL, .CHP and RPT files. Poly-A controls (dap, lys, phe, thr) and hybridization controls (bioB, bioC, bioD and cre) were used as spike controls for cDNA synthesis and hybridization, respectively, using methods described in the manufacturer’s instructions (Affymetrix, Santa Clara, CA, USA).

3.5 Protein sample preparation

3.5.1 Digestion, reductive methylation and AIX fractionation

Three biological replicates of both the complex and defined media conditions were created for further analysis. Serial on-column reductive methylation was adapted from reference [33]. A defined pool and a complex pool sample were made by mixing equal parts of each of the samples (which had nearly the same overall cell density upon harvesting). These samples were processed using the FASP procedure [34] using the original Lys-C/Trypsin serial digestion. After harvesting peptides from the concentration

106 device each sample was placed on a C18 Stage tip with a large additional bed of POROS

R2 resin calculated to have a loading capacity 5 times greater than needed for the combined protein mass of the two samples. The sample to be labeled with normal methyl groups was applied and treated essentially as described in [33] but scaled down to use serial loading of C18 stage tips. Two serial reductive methylation treatments for each condition were used to insure useful levels of incorporation which was confirmed using internal controls. After both light and heavy reactions were carried out the column was washed extensively and eluted onto SCX-Stage tip which had an additional 30uL bed of

Source S resin (GE). Once loaded this was washed extensively and eluted first with

500mM ammonium acetate, 20% acetonitrile in water (4 column volumes) and next with

200mM ammonium acetate, 50% acetonitrile in water (4 column volumes). The sample was then nearly evaporated and then resuspended in 50% acetonitrile twice to remove residual ammonium acetate in preparation for AIX based fractionation. Three technical replicates for the dimethyl labeling were performed.

The AIX fractionation scheme described [35] was used except that larger bore

AIX Stage tips were fashioned from the bottom simple cone portion of 1ml sized Axygen

Maxymum recovery pipette tips and only 2 Empore AIX disks were placed in the bottom of this tip and Source Q (GE) was applied to make a 50uL bed volume. Capture C18 stage tips were also additionally loaded with 20uL POROS R2 each, and centrifugal forces were lowered to 1000 G with loading at 400G, additionally a fourth AIX separation was performed in which each of the AIX separations was performed with 20% acetonitrile added with pH eluted samples being nearly evaporated and resuspended in

1.6% formic acid prior to application to their respective capture tips. After capture the

107 reversed phase stage tips were washed extensively and eluted first with 100uL of 1.6% formic acid 40% acetonitrile and then 100uL 1.6% formic acid 80% acetonitrile. The eluent was then evaporated under nitrogen and resuspended in 60uL of 1.6% formic acid.

3.5.2 Mass Spec Data collection

Samples were next submitted to the queue of an LC/MS/MS system comprising a nanoscale HPLC (Waters nanoacquity in trapping configuration with 180micrometer x

20mm Symmetry C18 trap column and 75 micrometer x 250mm BEH130 C18 analytical column) interfaced to a hybrid mass spectrometer (Thermo Orbitrap Elite). Samples were applied in the trapping configuration at 10uL/minute for 10 minutes prior to the vent valve being closed and the analytical separation being performed at 0.15uL/minute starting at 1% acetonitrile with gradient inflections of 99% A at 5 minutes, 95@25,

65@160, 30@200, 40@205, 99 @210 with data collected for 240 minutes with solution

A being 0.1% formic acid in water and solution B being .1% formic acid in acetonitrile.

The mass spectrometer was configured for data dependent triggering with a parent scan with resolution setting 240K (1e6 target, 200ms max injection time) in the orbitrap analyzer followed by up to 20 parent ion fragmentations (2e4 target, 100ms max injection time) in the ion trap part of the instrument. The chromatography feature was enabled with phase detection set at 30%. Charge state dependence excluded unknown and single charge states from fragmentation. The preview feature was enabled to improve sampling efficiency.

3.6 Data analysis

108 The microarray raw data was analyzed using Partek software, version 6.3 (Partek. St.

Louis, MO, USA). Raw data was subjected to Robust Multichip Average (RMA) quantile normalization to remove biases introduced by technical and experimental effects. All expression data were log base 2 -transformed to get near normal distribution for accurate statistical inference. Quality control by visualizing the data using Principal Component

Analysis cluster plot ensured that no outliers were included for the analysis. Next, two- way ANOVA analysis was performed to obtain a set of differentially expressed genes. A filter of p value < 0.05 and fold-change > 1.5 times was applied to identify the significantly differentially expressed genes.

Protein Raw data was analyzed using MaxQuant version 1.3.0.5 using a binary setting and requiring either light or heavy di-methylation of N-termini and lysines. The data was searched using protein sequences from E. coli as well as known contaminants and a list of control proteins used to monitor reaction efficiency. Default values were used except that match between runs was enabled with a 0.5 minute allowable difference in elution time

(which was applied after spline fitting the elution patterns).

4. Results

To evaluate the effect of amino acids on the growth characteristics and acetate production of E. coli K and B grown at high glucose concentration in minimal media the bacteria were cultivated in chemostat and gene transcription and protein expression were analyzed.

4.1 Bacterial growth in defined media with and without CAA

109 The growth parameters of the two bacterial strains in chemostats supplied with minimal media containing 40 grams per liter glucose with and without casamino acids at a dilution rate of 0.35 h-1 are seen in Table 1. The major effects of the casamino acids were related to acetate production rate, acetate production yield, and acetate concentration in the growth media. In JM109, the presence of casamino acids increased the specific acetate production rate by 9.1-folds and the glucose-based biomass yield by 1.27-fold; at the same time it decreased the specific glucose consumption rate by 21%. Similar effects, although somewhat lower, were observed in E. coli BL21. The specific acetate production rate increased by 5.6 folds and biomass yield on glucose by 1.17-fold; while glucose uptake rate decreased by 15%. The results also demonstrated that compared with

JM109, E. coli B21 is a high performer, since biomass concentration observed in both media conditions as well as the glucose-based biomass yield with or without CAA were both high. Both strains showed higher specific acetate formation rate (qA) and higher yields on glucose in media containing CAA compared with media without CAA. By evaluating the two strains it is clear that compared with JM109, BL21 produces approximately half the amount of acetate in media without casamino acids, has lower specific acetate formation rate, and has lower acetate yield on glucose.

Table 1. Concentrations of biomass, glucose, and acetate and glucose uptake rate (qs), acetate formation rate (qA), and yield on glucose (Yx/s) during steady-state growth of E. coli JM109 and BL21 in continuous cultures performed with 40 g/L glucose feeding at a dilution rates of 0.35 h-1 with and without casamino acids. E. coli JM109 E. coli BL21

With CAA Without CAA With CAA Without CAA

Biomass conc. [g/L] 9.26 7.90 16.07 18.60

110 Glucose conc. [g/L] 19.84 19.23 8.07 0.14

Acetate conc. [g/L] 7.58 0.68 4.48 0.77

qs [g/g-h] 0.951 1.24 0.813 0.879

qA [g/g-h] 0.286 0.032 0.098 0.016

Yx/s [g/g] 0.368 0.282 0.431 0.398

YA/x [g/g] 0.82 0.085 0.279 0.042

4.2 Proteomics and Transcriptomics analysis of E. coli JM109 and BL21 growing with and without CAA in culture medium

To characterize the gene and protein expression patterns of E. coli JM109 and E. coli

BL21, microarray based gene expression analysis and dimethyl-labeling based proteomics analysis were implemented. The results of the gene and protein expression comparisons in bacteria grown without casamino acid to bacteria grown with casamino acids are summarized in the Venn diagram (Figure 2). The microarray analysis in E. coli

K (JM109) revealed statistically significant (p<0.05) differentially expressed – 330 upregulated genes (>1.5X fold change) and 529 down-regulated genes (<-1.5X fold change) with and without CAA in the culture medium. In E. coli B (Bl21) statistically significant 1121 (p<0.05) genes were found upregulated (>1.5 X fold change) and 1373 were found down-regulated (<-1.5X fold change), when the media did not contain CAA.

The proteomics analysis in E. coli K revealed 537 high abundance proteins with 176 upregulated proteins (>1.5X fold change) and 356 down-regulated proteins (<-1.5X fold change) in medium not containing CAA. In BL21, the proteomics analysis identified 380

111 high abundance proteins with 242 upregulated proteins (>1.5X fold change) and 138 down-regulated proteins (<-1.5X fold change) in medium not containing CAA.

The significant observation from this comparison are: 1) in both bacterial strains, growth without casamino acids yields higher number of down-regulated genes compared with the number of upregulated genes, i.e. 60% more genes were down-regulated as compared to upregulated genes in JM109 and 23% more genes were down-regulated as compared to upregulated genes in BL21, 2) The differences in JM109 between the two culture conditions are more pronounced when observed using proteomics as compared to gene expression, i.e. the difference between upregulated and down-regulated genes at microarray level is 1.6 folds whereas at protein level it is more than 2 folds in JM109, and 3) the differences in proteomics and microarrays are less pronounced in BL21 strain between the two culture conditions, i.e. the difference between upregulated and down- regulated genes were 1.2 folds and 1.7 folds at microarray and proteomics level respectively in BL21.

112

Figure 2. Venn diagrams showing comparison between transcriptomics data and proteomics data. (a) Comparison between upregulated genes at transcriptomics and proteomics level in JM109 chemostats without casamino acids compared to with casamino acids in the culture media, (b) Comparison between downregulated genes at transcriptomics and proteomics level in JM109 chemostats without casamino acids compared to with casamino acids in the culture media, (c) Comparison between upregulated genes at transcriptomics and proteomics level in BL21 chemostats without casamino acids compared to with casamino acids in the culture media, (d) Comparison between downregulated genes at transcriptomics and proteomics level in BL21 chemostats without casamino acids compared to with casamino acids in the culture media.

113 4.3. Enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways

To decipher the omics data obtained from the two bacterial strains at the two media composition, hypergeometric tests were carried out to identify overrepresented/enriched biochemical pathways by using the KEGG database. The omics data were segregated as per upregulated (>1.5X fold change) genes/proteins and as per down regulated (<-1.5X fold change) genes/proteins, comparing to culture not containing CAAs to culture containing CAAs.

4.3.1 E. coli JM109

Performing hypergeometric test to identify pathway enrichment in JM109 revealed differences of gene and protein expression in bacteria grown in media without CAA compared with bacteria grown in media with CAA (figure3). There was up regulation of genes and proteins involved in amino acids biosynthesis pathways and both up regulation and down regulation of genes and proteins related to acetate metabolism. Among the significantly enriched pathways there were upregulated genes related to fatty acid metabolism and purine and metabolism.

114

Figure 3. TCA cycle and associated pathways in JM109. Red color corresponds to genes with >1.5X fold change, green color corresponds to genes with <-1.5X fold change, and black color corresponds to no change in gene expression. All fold changes correspond to without cas-amino acids versus with cas-amino acids culture media conditions.

Concerning the amino acid biosynthesis the following pathways were upregulated: a) alanine, aspartate, and glutamate b) Glycine, threonine, and aspartate c) Cysteine and methionine d) Lysine e) Arginine and proline, and f) Tyrosine. P-values of the enriched pathways are as shown in Table 2.

115 Table 2. Results of hypergeometric test showing top 10 overrepresented pathways in each study type.

JM109 BL21

Pathway P-Value Pathway P-Value

Biosynthesis of amino acids 2.40E-18 Citrate cycle (TCA cycle) 6.25E-11 Alanine, aspartate and 9.78E-10 Flagellar assembly 4.28E-09 glutamate metabolism Glycine, serine and 1.14E-08 Carbon metabolism 4.34E-09 threonine metabolism Pyrimidine metabolism 1.53E-07 metabolism 7.14E-09 Carbon metabolism 2.73E-07 Starch and sucrose 6.28E-07 metabolism 2-Oxocarboxylic acid 3.70E-07 Pyruvate metabolism 2.97E-06 metabolism Arginine and proline 0.00015 Galactose metabolism 5.84E-05 metabolism Base excision repair 0.00016 Glyoxylate and 9.88E-05 dicarboxylate metabolism Methane metabolism 0.00019 Two-component system 0.00011 Nucleotide excision repair 0.00034 Tryptophan metabolism 0.00073

In regard to the acetate metabolism as seen in figure 4 the genes involved in acetate production, the Pta and Acka, are significantly downregulated in both transcriptomics and proteomics level with Pta showing a downregulation of -1.77 folds at transcriptomics level and -1.89 folds at proteomics level. Acka shows downregulation of -1.65 fold at transcriptomics level and -2.38 at proteomics level. Moreover, the genes responsible for conversion from pyruvate to acetyl-CoA – AceE and AceF – are both downregulated at protein level by -1.86 folds and -1.92 folds respectively, thereby decreasing the amount of acetyl-CoA available for the production of acetate. At the same time the Glta gene that convert Acetyl-CoA to citrate is upregulated by 2.73 folds at protein level, thereby further reducing the availability of acetyl-CoA to form acetate.

116

Figure 4. Acetate formation biochemical pathway and associated genes’ differential expression in JM109 (cas-amino acids devoid media versus cas-amino acids containing media). Microarray fold change result is shown in black, proteomics fold change result is shown in red, and all genes are highlighted with a green background.

Genes related to fatty acid metabolism such as fadJ, fadB, and fadI were found to upregulated by 1.61, 1.58, and 1.55 folds at transcriptomics level. Purine metabolism genes such as xdhA and guaD were found to be upregulated by 1.51 and 2.08 folds respectively at transcriptomics level. Pyrimidine metabolism genes such as rutA, rutB, and rutE were upregulated by 9.21, 11.89, and 15.45 folds at transcriptomics level respectively as well as 51.55, 103.72, and 126.01 folds at proteomics level.

4.3.2 E.coli BL21

117 Performing hypergeometric test to identify pathway enrichment in BL21 revealed differences of genes and proteins expression in bacteria grown in media without CAA compared with bacteria grown in media with CAA (Figure 5). As evident from Table 2, there was up regulation of genes and proteins involved in TCA cycle and both up regulation and down regulation of genes and proteins related to acetate metabolism. In addition among the significantly enriched pathways there were upregulated genes related to metabolism pathways of galactose, cysteine and methionine, valine, leucine and isoleucine degradation, and Lysine biosynthesis. Interestingly, the only significantly an enriched downregulated pathway was flagellar assembly.

118 Figure 5. TCA cycle and associated pathways in BL21. Red color corresponds to genes with >1.5X fold change, green color corresponds to genes with <-1.5X fold change, and black color corresponds to no change in gene expression. All fold changes correspond to without cas-amino acids versus with cas-amino acids culture media conditions.

As seen in Figure 5, the conversion of acetate into acetyl-CoA is significantly upregulated many folds in absence of CAA in the media, thereby reducing the quantity of acetate produced in the media devoid of CAA. Moreover, as previously mentioned, the

TCA cycle is significantly upregulated in BL21, thereby reducing any efflux of metabolite flow towards the formation of acetate.

When looked more in detail in terms of related biochemical pathways, we see that in

BL21 expression at protein level relates to lower acetate level, as shown in Figure 6.

BL21 cells growing without cas-amino acids in the culture media, contrary to JM109 cells, exhibit higher activity of enzymes involved in TCA cycle, as well as enzymes involved in conversion of Acetate into Acetyl-CoA. Specifically, conversion of pyruvate to acteyl-CoA is downregulated due to downregulation of both involved genes – AceE and AceF – by -1.53 and -1.67 folds respectively at protein level. Moreover, there is a significant upregulation of conversion of actetate to acetyl-CoA by 31.51 folds and 4.52 folds at transcriptomics and proteomics levels respectively, thereby further reducing the level of acetate. The availability of acetyl-CoA to convert into acetate is further diminished by increased expression of GltA at both transcriptomics (1.71 folds) and proteomics (2.99 folds) levels, thereby increasing the flux in the TCA cycle and reducing it in the acetate production.

119

Figure 6. Acetate formation biochemical pathway and associated genes’ differential expression in BL21 (cas-amino acids devoid media versus cas-amino acids containing media). Microarray fold change result is shown in black, proteomics fold change result is shown in red, and all genes are highlighted with a green background.

Moreover, fatty acid oxidation enzymes converting fatty acids into Acetyl-CoA also are at higher level, thereby increasing levels of Acetyl-CoA which is utilized in the TCA cycle. Specifically, genes such as fadA, fadB, fadD, and fadE were upregulated by 8.72,

18.75, 6.73, and 8.57 folds respectively at transcriptomics level as well as fadA and fadB were upregulated by 2.91 and 4.38 folds respectively at proteomics level in cultures devoid of CAAs.

4.4 Gene Ontology (GO) Enrichment Analysis

120 A subsequent GO enrichment analysis was carried out to find out enriched GO terms in biological processes category, as shown in Figure 7. The results are in line with those of

KEGG enrichment analysis – showing higher activity of genes associated with GO terms amino acid biosynthesis and transport as well as fatty acid biosynthesis in JM109 strain indicating loss of metabolites from TCA cycle, thereby producing less acetate in absence of cas-amino acids in the media. Interestingly, biological processes GO terms – “acetyl-

CoA biosynthetic process from acetate” has an enrichment p-value of 0.0082 and “acetate biosynthetic process” has an enrichment p-value of 0.032, showing from gene ontology study acetate production is less favored than its consumption to produce acetyl-CoA in

JM109 culture devoid of CAAs.

Figure 7. Gene ontology enrichment analysis results for biological processes category for JM109 (part a) and BL21 (part b) based on omics data. Genes related to the GO term GO:0003333 (amino acid transmembrane transport) such as lysP, gadC, and cycA are upregulated by 3.64, 2.95, and 9.44 folds respectively at

121 proteomics level showing that not only amino acid biosynthesis is higher in JM109 growing in culture devoid of CAA, but it also transports it across transmembranes at an increased rate.

Additionally, purine biosynthesis process was also among top enriched GO terms – containing 9% of genes of top enrichd biological processed, showing correlation with

KEGG analysis.

Whereas, in BL21 strain, genes associated with enriched “TCA cycle” biological process

GO term are active and as confirmed from previously discussed KEGG enrichment analysis, they are also upregulated. An active TCA cycle in BL21 and increased conversion of acetate to acetyl-CoA leads to very little amount of residual acetate in media devoid of CAAs. Similar to JM109, amino acid transport is enriched in BL21 too, comprising of 14% of genes belonging to top enriched GO terms.

5. Discussion

In this study, we employed steady state conditions to study the effect of casamino acids in the cell cultures of two different bacterial strains. One of the main objectives of this study was to shed light on the metabolic differences between E. coli K and B strains, especially focusing on the differences in acetate metabolism. During many years, numerous studies have tried to unravel differences in acetate metabolism in the E. coli K and B strains, which cannot be directly related to the observed genomic differences

[36,37,38,39,40,41,42,43,44]. In present study, we looked more in detail in terms of related biochemical pathways using chemostats which allow us to study the effect of casamino acids while keeping all other variables constant at steady state. Chemostats

122 provide the unique advantage of steady state and have been used in number of other studies involving E. coli for transcriptomics [17,46,47,48] and proteomics [49].

We investigated major metabolic pathways related to acetate overflow including the pathways of acetate-producing and -consuming routes such as the irreversible oxidative decarboxylation of pyruvate, catalyzed by pyruvate oxidase (PoxB), and the high- capacity reversible phosphotransacetylase-acetate kinase pathway (Pta- AckA) [45].

We observed that in JM109 protein expression relates to lower acetate level in absence of casamino acids, as shown in Figure 3. In the absence of casamino acids in the culture media, the observed low levels of enzymes involved in TCA cycle and acetyl-CoA to acetate conversion, along with increased enzyme activity in biosynthesis of amino acids paves the way for increased conversion of acetyl-CoA into oxaloacetate and subsequently into the amino acids. Moreover, the upregulation and high enrichment of biosynthesis of amino acids and alanine, aspartate, and glutamate metabolism pathways imply increased flux out of the TCA cycle and lesser flux of metabolites going into the acetate production.

As discussed above and also seen from Figures 4 and 6, acetate can be formed from pyruvate by two different routes. One of the routes includes generation of acetyl-CoA which then converts into acetyl-phosphate which in turn converts into acetate included phosphotransacetylase (Pta) and acetate kinase (AckA) along with Acetyl coenzyme A synthetase (Acs). In JM109, Pta and Acka are downregulated in the absence of casamino acids in the cell culture leading to reduction in acetate overflow.

However, in E. coli BL21 Acs was found to be highly upregulated leading to increase in acetate consumotion. Moreover, an enriched upregulated valine, leucine, and isoleucine

123 degradation pathway is observed for E.coli BL21, as seen from Figure 7, implying increased flux outside of enriched upregulated TCA cycle, thereby decrease in flux of metabolites for acetate production in BL21 in absence of casamino acids in the medium.

Another route is direct conversion of pyruvate into acetate involving enzyme PoxB which mainly dominates during stationary phase growth and under environmental stress conditions [32]. In our case of chemostat, the cells do not enter stationary phase and are under no induced environmental stress, thereby decreasing the possibility of increased acetate production from pyruvate.

The Venn diagrams in figure 2 show differences in the gene transcription and protein translation levels. Such differences can arise due to several biological parameters which influence the correlations between mRNA and protein levels such as small metabolites can change the RNA secondary structure [50], regulatory proteins and small RNAs that act as translation modulators [51], codon biases which resulting in variable correlation with highly expressed genes [52,53] and proteins [54,55, 60], ribosomal density and ribosomal occupancy, protein half-lives, and posttranslational processing [56].

6. Conclusions

In this study, we were able to demonstrate the differences between the E. coli K and B strains at transcriptomic and proteomic levels. In presence of casamino acids, lax of acetate metabolism could be the reason for lower acetate levels in

124 BL21 as compared to JM109. However, in absence of casamino acids, JM109 overexpresses pathways leading to increasing flux of metabolites out of acetate metabolism pathways resulting in significantly lower acetate levels. Altogether, this study has shown deeper insight into the differences in acetate metabolism in E. coli K and B strains, demonstrating the differential catabolite repression and amino acids biosynthesis.

Chapter 5: Harnessing Chinese hamster ovary (CHO) cell proteomics for biopharmaceutical processing

Summary

125 Chinese hamster ovary (CHO) cell lines are the preferred host for manufacturing therapeutic proteins. The use of proteomics for recombinant protein production in CHO cells has expanded recently because of the ease in identifying protein levels and its potential to elucidate targets for improved growth and productivity. Coupling advanced analytical methods, labeling strategies, and bioinformatics with mass spectrometry-based proteomics facilitates identification and quantification of large numbers of proteins. CHO proteomics studies have increased the knowledge of proteins that affect cell culture events, thereby facilitating media development and cell line engineering to increase growth and production, delay apoptosis, and utilize nutrients more effectively. The applications of proteomics offers numerous ways for creating superior hosts and improving biotherapeutics production in CHO cells.

Introduction

Chinese hamster ovary (CHO) cells are used worldwide for production of biotherapeutics. Despite the availability of many other mammalian cell lines, such as baby hamster kidney (BHK) and human embryonic kidney (HEK)-293, nearly 70% of all recombinant therapeutic proteins are produced by host CHO cells [1]. CHO cells offer numerous advantages, such as producing recombinant proteins with post-translational modifications compatible to humans, growing rapidly in suspension as compared to other mammalian cell lines, and scaling for well-controlled manufacturing processes [1-3].

Another advantage of using CHO cells is easy adaptation from serum-bearing to serum- free media (SFM), a desirable characteristic for large-scale culture in bioreactors.

126 DUXB11 and DG44 cells are the parental host cell lines derived from CHO-K1 cell line and they are widely used for biologics production [4, 5]. Other cell lines used for production of recombinant proteins and monoclonal antibodies (mAbs) in the pharmaceutical industry include CHO-K1 ATCC, CHO-K1 ECACC, CHO-S, and CHO-

K1/SF [5].

There is an increased demand for improving the robustness and productivity of CHO cells in order to offer affordable drugs for patients. One implemented method involves isolating highly productive host cells by single cell cloning of the parent host cells or through iterative bioreactor evolution [6-8]. Another recent approach which has been implemented extensively is genetic engineering of CHO cells. Over-expression or knock-out studies for genes involved in cellular metabolism, apoptosis, cell cycle, and protein secretion facilitate creation of desirable cell lines with high production or growth properties [9-12]. Despite these advances, more work is needed to improve stability and productivity of CHO cell lines. The demands of a highly competitive market require cells to maintain high productivity and grow in bioreactors at high densities under rigorous optimization schemes. Recently, proteomics has proven to be an efficient tool to understand the CHO cell physiology for the purpose of creating a superior production host.

127

Figure 1. A general scheme showing application of proteomics in media optimization, bioprocess optimization, and cell line engineering

As shown in Figure 1, in recent years, researchers have started using proteomics techniques in cell culture applications, owing to advances in methods such as digestion, labeling, mass spectrometry (MS), and bioinformatics. Advanced proteomics techniques can be used to quantitatively measure thousands of proteins changing between high and low producers or fast and slow growing cell lines to better understand the cell physiology. This can help identifying targets for genetic engineering of CHO cell lines for increased cell growth, increased recombinant protein productivity, and consistent high quality as discussed in the following sections.

1. Optimization of proteomics methods for CHO cells

In order to understand CHO cell physiology, the proteome of CHO cells must be examined in addition to its transcriptome and metabolome. Accurate proteome identification necessitates use of optimized sample preparation methods along with versatile bioinformatics tools and robust databases to match the mass spectra to specific

128 peptide sequences. In following subsections, several developments in proteomics methods such as optimization of sample preparation, enhancement of digestion and labeling, and improved MS analysis, are discussed in context of CHO cell lines.

1.1 Sample preparation methods and improvements

Sample preparation is a crucial step to ensure maximum protein recovery from complex protein mixtures; for consistent such results, optimization of sample preparation methods is essential.

For sample preparation, researchers mainly use either in-gel or in-solution digestion techniques. Although in-gel digestion reduces complexity of biological samples and removes MS-interfering impurities from the sample, it has its shortcomings. Apart from being a lengthy and laborious procedure, it also incurs extensive loss of peptides, which decreases protein coverage and prevents identification of low abundance proteins. In contrast, the automation of in-solution digestion reduces intensive labor, but there is a greater risk of incomplete solubilization and interfering impurities. Efforts are in progress to address this issue. For instance, in-solution digestion technique involves protein solubilization with detergents and separation of proteins by sodium dodecyl sulfate

(SDS). For efficient analysis, SDS needs to be completely removed prior to analysis in mass spectrometer. A recently developed filter-aided sample preparation (FASP) technique involving ultra-filtration methods to remove SDS prior to digestion and MS can increase the solubilization of proteins significantly [13]. In recent past, FASP technique has proved to maximize the proteome coverage of CHO-K1 cells by enabling the identification of 6163 proteins [14].

129

In another study concerning SDS solubilization, Filter-Aided N-Glycan Separation

(FANGS) was developed and used to study N-glycan profiles of wild-type and Cogl, a subunit of COG complex, deficient mutant CHO cells. It was found that mutant CHO cells do not produce biantennary glycans which are produced by the wild-type CHO cells

[15].

Cell surface biotinylation is another method used with solubilized proteins in CHO cells.

It was applied to study conserved cysteines in the human organic cation transporter 2

(hOCT2). More than 20-fold increased influx of tetraethylammonium (TEA) and 1- methyl-4-phenypyridinium (MPP) was observed in CHO cells expressing hOCT2 [16].

In another experiment, sample preparation techniques were optimized for in-gel digestion of CHO cell lysates using design of experiments (DOE) tools. Protein recovery from 2-D gels was maximized by trying different concentrations of solubility enhancing agents

(, DTT, CHAPS, and SDS) in the initial suspension solution for CHO cells [17]. This experiment showed that protein recovery from the CHO cell lysate can be maximized with 8M urea, 32.5 mM DTT, and 2% CHAPS [17].

In-gel digestion technique was also used for comparing proteomic profiles of CHO cells treated with different supplementations of sodium butyrate (NaBu) [18]. Increased levels of GRP78 and peroxiredoxin and decreased levels of phosphopyruvate hydratase levels were identified following treatment with NaBu [18].

130 In-solution digestion was used to identify and quantify CHO DG44 and CHO-S secreted proteins following GalNAz supplementation and enrichment by copper catalyzed click chemistry. iTRAQ labeling was used to quantify the protein amount. The differences between CHO-S and CHO DG44 secretome were identified and 70% similarity between these cell lines was found [19].

Additionally, both in-gel and in-solution digestion techniques have been coupled with fractionation techniques prior to LC/MS or direct LCMS injections [14, 20]. Baycin-

Hizal et al. identified 5694 proteins using in-solution digestion technique coupled with basic pH reversed-phase liquid chromatography (bRPLC) fractionation method and 5006 proteins using in-gel digestion technique. In-solution digestion technique coupled with bRPLC fractionation has found to be more efficient compared to in-gel digestion identification [14]. A whole KEGG pathway mapping of genome, transcriptome and proteome indicated enrichment of protein processing in endoplasmic reticulum and apoptosis pathways [14]. Table 1 provides a summary of the studies done using in-gel and in-solution digestion methods.

Table 1. A summary of various studies using in-gel and in-solution digestions methods

Aim of the study Analysis technique Conclusion Reference

131 To study effect of In-gel digestion; Uprgulated GRP78 [18]

NaBu MALDI TOF MS and peroxiredoxin, supplementation in and MS/MS and downreglated

CHO cell cultures phosphopyruvate

hydratase levels due

to NaBu

supplementation

To identify the SPEG for Identified more than [14] proteome, glycoproteins 6100 proteins in secretome, and fractions; In-gel CHO proteome glycoproteome of and In-solution;

CHOK1 cell line MS/MS

In-gel digestion; Valosin containing

MALDI TOF MS protein identified as To study growth key regulato of cell rate- related proteins growth and its using a combined [21] overexpression transcriptomics and showed increased proteomics approach growth without any

viability reduction

To find proteins In-gel digestion; Identified 180 affecting high LCMS differentially [22] productivity in expressed proteins

132 bioreactor cell which were mainly cultures associated with

cytoskeletal

organization, protein

synthesis,

metabolism, and

growth

To find out how In-gel digestion; Increase in [23] cMyc affects CHO MS/MS nucleolin, ATP proteome synthetase and

mitrochondrial

protein levels.

Decrease in matrix

and cell adhesion

related proteins

To identify CHO In-gel digestion; Identified 106 [24]

DG44 cell line MALDI TOF MS proteins which are proteome and MS/MS mainly related to

energy metabolism

To compare the In-gel digestion; Identified 89 [25] proteomes of high LCMS proteins with and low producing differential

expression between

133 cell cultures over high and low time producer cell lines

Identified 93

decreased proteins

and 74 increased

proteins resulting

from miR-7 Used in-solution To determine the overexpression. digestion for effect of miR-7 Decreased proteins proteins. Used [26] overexpression on related to protein label-free LCMS to proteome translation and identify proteins DNA/RNA

processing.

Increased proteins

related to protein

folding and secretion

To identify Identified increased Used in-gel important proteins expression of digestion for that affect protein GRP75, enolase, and proteins. Proteins [27] productivity in thioredoxin in identified with butyrate- and zinc- response to media MALDI-TOF MS treated culture supplements

134 40 differentially

regulated proteins

To identify the identified post- proteome of CHO prolonged In-gel digestion; cells during cultivation such as [28] ESI tandem MS prolonged cytoskeletal cultivation proteins,

chaperones, and

metabolic enzymes

Enhanced digestion of proteins can be achieved by using LysC/trypsin enzymes to eliminate missed cleavages. Endoproteinase Lys-C digestion along with trypsin digestion used for absolute protein quantification resulted in 76% peptides yield and subsequent identification of 19000 unique peptides per sample [29]. It was found that trypsin digestion was not effective for cleavage of 2.4% identified peptides whereas Lys-C digestion was not effective for cleavage of a mere 1.2% identified peptides. This multi- enzyme digestion approach could be useful in increasing digestion efficiency in quantitative proteome analysis [29].

Chymotrypsin, another protein digestion enzyme, is known to be structurally similar to trypsin, but recognizes different substrates. Trypsin acts on lysine and arginine residues, while chymotrypsin acts on large hydrophobic residues such as tryptophan, tyrosine and phenylalanine, with a high catalytic efficiency. A total of 154 and 31 N-glycosylated

135 proteins were identified when trypsin and chymotrypsin enzymes were used, respectively demonstrating the higher efficiency of tryptic cleavage [30].

Undesirable shuffling in proteolytic digestion during sample preparation can be avoided by using pepsin. Pepsin digestion was used to study protein disulfide bridges resulting in 31 unique Cys-Cys bonds identification out of 43 present disulfide-bonds.

Pepsin digestion operates in acidic pH range as opposed to the neutral pH range required by trypsin digestion. Therefore, pepsin digestion can be a great tool for low pH processes

[31].

Fractionation of enzymatically digested proteins or peptides using LC reduces complexity of protein samples. One of the recent applications of CHO proteomics is the identification of secreted proteins. Because of their low abundance in conditioned media, proteins require techniques such as multidimensional protein identification technology (MUDPIT) to combine the benefits of strong cation exchange chromatography and coupled LC- tandem MS [32]. This strategy was used to identify autocrine growth factors in conditioned media for chemically-defined media formulations [33] or elucidation of the

CHO secretome [19]. In this approach, low collision energy acquires precursor peptide data and high collision energy acquires fragment ion data [33]. There also has been a number of efforts to increase the protein identification by using different liquid chromatography based systems. For example, a concatenated low pH (pH 3) and high pH

(pH 10) reversed-phase liquid chromatography (RPLC) strategy was compared with the traditional strong cation exchange method. It was found that the use of concatenated high

136 pH RPLC as a first-dimension fractionation strategy resulted in 1.8- and 1.6-fold increases in the number of peptide and protein identifications (with two or more unique peptides), respectively. This method could be a lucrative alternative to strong cation exchange for 2-D shotgun proteomic analysis by offering improved protein sequence coverage, simplified sample processing, and reduced sample losses [34]. The aforementioned 2-D LC was used for obtaining high separation rates and identifying a larger number of proteins in CHO cell lines. An increased number of proteins was obtained after separating the digested peptides into 96 fractions bRPLC and combining into 48 prior to MS analyses [14]. Furthermore, N-glycopeptides were enriched using solid phase extraction of glycopeptides (SPEG) followed by in-gel and in-solution digestion for the glycoproteome of CHO-K1 cells [14].

One challenge in recombinant protein purification is the host cell protein (HCP) contamination of the supernatant from the CHO cells. HCP identification and removal in extracellular or secreted proteins can be difficult due to low abundances of the proteins and complex media compositions. Researchers studied the impact of different precipitation parameters on secreted CHO HCPs for both gel-based and shotgun proteomics and subsequently improved proteome identification for CHO HCPs [35] which has provided the identification of 178 unique extracellular HCPs [35].

Identification of mitotic spindle proteins can help in understanding cell division and subsequently bioprocess optimization. Mitotic spindle proteins were identified in CHO cells using MudPIT, tandem MS, and bioinformatics tools [36]. A total of 1155 mitotic spindle proteins were reported, of which 11% were membrane-localized, 7% were microtubule-localized, and 3% were associated with actin [36].

137

1.2. Protein labeling

Stable isotope labeling with amino acids in culture (SILAC), as shown in Figure 2a, isobaric tags for relative and absolute quantification (iTRAQ), isotope-coded affinity tags

(ICATs), and tandem mass tags (TMT), as shown in Figure 2b, have been used to label proteins and measure the quantitative changes between samples, also known as comparative proteomics. Various labeling methods as well as label-free methods are used for relative quantification.

The SILAC method comprises of adaptation of cells to labeled media and labeling of intracellular proteins with amino acids in the cell culture medium. The labeled intracellular proteins can be studied using in-gel digestion or in-solution digestion prior to detection by MS [37]. A limitation to this method is quantification errors by incomplete amino acid isotope incorporation and incomplete arginine to proline conversion [38]. To reduce quantification errors, a label-swap replication experiment was done by averaging ratios of individual replicates and validated with SILAC doublet and triplet experiments

[38]. It was found that errors induced by the incomplete arginine to proline conversion are larger than incomplete amino acid isotope incorporation. By averaging ratios of label- swap replicates, errors were reduced by 97% in a SILAC H/L doublet experiment and overall SILAC experimental errors were corrected [38].

A super-SILAC methodology was recently developed for a broad spectrum of applications [39]. Super-SILAC is a mixture of various cell lines differing in origin, stage, and subtype, and serves as a standard. Key advantages of super-SILAC to cell lines

138 include no limitation on number of samples to be analyzed and a common standard for all experimental samples that can be stored for years [39, 40]. Although super-SILAC has not been used for CHO cells yet, it could greatly accelerate advancement and accuracy of

CHO proteomics.

iTRAQ (isobaric tags for relative and absolute quantification) is another protein labeling method in which N-terminus and primary amino groups of digested peptides are covalently bonded, resulting in tandem mass spectra with tag-specific reporter ions [41].

Unlike SILAC, iTRAQ does not require adaptation to labeled media. Another advantage of the iTRAQ method is that it allows multiplexing of up to 8 samples. Apart from the iTRAQ method, TMT (tandem mass tag) is a separate isobaric labeling technique, as shown in Figure 2b. In comparison, the TMT provides intensity measurement at the peptide fragmentation level and can provide multiplexing up to 10 samples [42]. In general, isobaric labeling allows for relative quantitation of peptides based on reporter ions in the low m/z region of MS2 spectra. For getting precursors with different masses that can be identified in MS1 spectra, non-isobaric labeling such as mTRAQ is also useful [43]. However, a comparison between mTRAQ and iTRAQ shows superiority of iTRAQ over mTRAQ in identifying about double the number of proteins and triple the number of phosphopeptides [43]. All these labeling methods are applicable to specific applications and are useful in providing novel insights into CHO cell physiology.

b)

139 a)

Figure 2. Labeling methods: a) SILAC labeling method: Cell populations are labeled

either with medium containing heavy or light amino acids and mixed in a 1:1 ratio and

analyzed together in LC-MS/MS; b) TMT labeling method: Digested peptides are labeled

with TMT reagents and then mixed before sample fractionation MS analysis is

performed. Quantitation can be done by bioinformatics tools.

Comparative proteomics methods such as iTRAQ and SILAC enable identification and

relative quantification of numerous novel proteins. However, absolute quantitation of

desired proteins with high accuracy is challenging despite availability of these methods.

For such requirements, targeted proteomics methods such as multiple reaction monitoring

(MRM) [44, 45] and protein standard absolute quantification (PSAQ) [46] are useful. The

MRM method coupled with stable isotope dilution MS can provide accurate quantitative

measurements of target proteins. After the initial discovery stage, these methods can help

establishing high-throughput assays for assessing the changes in proteins of interest in

various cell lines.

140 A variety of different mass spectrometers can be used after preparation of the digested peptides. Peptide fragments are ionized using ESI or matrix-assisted laser desorption/ionization (MALDI). Subsequently, peptides are analyzed using MALDI or

ESI in combination with TOF or Q-TOF.

For many experiments, both MALDI-TOF and tandem MS are used to increase result confidence [14, 20, 47, 48]. Examples of MS systems used for CHO proteomics identification and quantification include Thermo Linear Ion Trap – Fourier Transform

Ion Cyclotron Resonance [48], Thermo nano-LC coupled to LTQ-FT [49], Micromass

Q-TOF [28, 50], Thermo LTQ Orbitrap XL [20, 25, 26, 51], Thermo Orbitrap Discovery

[19], Thermo LTQ Orbitrap Velos [14, 33, 48], Applied Biosystems QTRAP 2000 [28],

Bruker Daltonik ultrafleXtreme MALDI-TOF-MS/MS [20], ABScieX 4800 MALDI-

TOF/TOF [35], SYNAPT G2 HDMS [33], Waters QTOF Premier [19], and QSTAR-XL hybrid quadrupole-TOF [52, 53].

A number of different search engines have been used to map CHO mass spectra to the

CHO genome such as SEQUEST [20, 33, 54], Mascot [19, 20, 26, 28, 35, 47, 48, 50-52],

X! Tandem [19], TagRecon [14], and MyriMatch [14]. Baycin et al. improved protein identification through the use of TagRecon, a mutation-tolerant search engine, and the

MyriMatch search engine which can work [14] in conjunction with IDpicker software to find out false discovery rate during protein identification. Over 90 percent of proteins were identified by both methods [14], confirming the identification of these proteins.

Other examples of softwares used in CHO proteomics studies are Proteome Discoverer

[33, 54], ProteinLynxGlobalSERVER [33], TurboSEQUEST [20, 25], ProQuant [52],

141 ProGroup [52, 53], ProteinPilot [35, 53], Progenesis [26], PEAKS [28], ProteinScape

[20], GPS Explorer [35], Scaffold [19], IDPicker [14], and Mascot Distiller [19, 48].

These software tools are used for quantitative and qualitative analysis of proteomics data and are specific to the type of labeling and MS instrument used for the experiment.

Following protein identification, it is equally important to quantify protein expression levels. In label-free experiments, the quantification is made by relating the peptide peak intensity to total peptides in the sample[55]. Another method involves the peptide spectral count for determining the relative protein expression [56]. Differential labeling with SILAC, iTRAQ, TMT and ICAT between samples enables quantification of protein expression levels. Labeling overcomes quantification barriers inherent in MS such as differences in ionization and detection. Table 2 provides a comparison between labeling and ionization techniques commonly used in MS.

Table 2. Advantages and disadvantages of different methods used in mass spectrometry

Technique Pros Cons

SILAC  Delivers greater  Renders inapplicability in

proteome coverage tissue samples

 Provides less quantitation  Causes delays in cell

error as compared to adaptation to dialyzed

chemical labeling serum

 Offers increased  Costs more compared to

sensitivity than chemical chemical labeling

labeling methods [57] methods [57]

142 iTRAQ  Supports relative  Needs longer MS and

quantitation in any cell data exploration time

type, including tissues  Demands small margin of

 Offers wide-ranging data error in sample

by labeling all peptides preparation [59]

from tryptic proteolysis

[58]

2-DE  Offers concurrent  Exhausts more time

visualization of big compared to other

amount of proteins techniques and is

 Increases visualization occasionally semi-

through differential quantitative

display layout [60]  Produces difficulties for

detecting hydrophobic

proteins [60]

Matrix-assisted laser  Offers precise analysis of  Delivers low shot to shot desorption ionization high-molecular weight reproducibility

(MALDI) proteins and small  Relies strongly on sample

sample amounts preparation methods and

 Has no effect from salts has short sample life [61]

[61]

Electrospray ionization  Provides high accuracy  Presents relatively

(ESI) and a large mass range complex spectra

143  Is relatively fast  Is sensitive to salts [61]

 Allows coupling with

liquid chromatography

for enhanced separation

[61]

It is important to minimize false positive rates, and for that reason, a minimum of two distinct matching peptides and 95% confidence can be applied, which has proved to enable identification of over 300 differentially expressed proteins in past [52]. Similarly, parsimony filtering can be applied to count proteins with two distinct peptide sequences and at least six spectra [14]. Common minimum requirements are set in order to increase confidence levels of protein identifications and quantifications, as well as to minimize the false discovery rate.

2. Proteomics Applications in Biotechnology

2.1. Bioprocess development with Proteomics

In recent years, proteomics has been used extensively to gain insights in bioprocess development. Quantitative proteomics can be used to compare different CHO cell cultures and enable identification and quantification of differentially expressed proteins.

In order to identify proteins responsible for increased growth and understand the effect of transfection on the CHO cell proteome compared to non-transfected CHO cells, a

144 shotgun proteomics approach was used [62]. Between the high-producing CHO cell line, which was transfected with Bcl-xl gene, and the low-producing control CHO cell line, 32 differentially expressed proteins were identified, including those involved in protein metabolism, cytoskeletal structure, and cell cycle control [62]. Moreover, the molecular chaperone BiP, which is associated with protein folding in the endoplasmic reticulum

(ER), was found to be upregulated in high producing CHO cells which suggests increased unfolded protein response due to ER stress [62].

Recently CHO DG44 proteome was identified and a protein reference map was created for understanding its physiology clearly for improved therapeutic proteins production

[24]. Identified proteins were involved in energy metabolism pathways such as glycolysis, cell redox homeostasis, TCA cycle, and ATP synthesis [24].

To compare high and low producing cell cultures over time, Meleady et al. used LC/MS- based proteomics. They identified 89 proteins with different expression between high and low producing cell lines [25]. In particular, 12 proteins differed in expression levels in the same direction, including aldose reductase-related protein 2, annexin, eukaryotic translation initiation factor, glucose-6-phosphate 1-dehydrogenase, endoplasmin, and nuclear migration protein [25]. Proteins that were expressed at higher levels in the high producing cell line were involved in translation and folding [25].

2.2 Proteomics analysis to increase cell growth rate and viable cell density

145 Maximizing cell growth rate is one of the critical bioprocess development goals. Cell lines with different growth profiles have been investigated using proteomics to identify significantly changing proteins between slow and fast growing cultures.

An integrated transcriptomics, microRNAomics and proteomics method was used to study CHO cell clones with high and low growth rates to reduce false positive and negative identifications. The combined approach identified a correlation between cell growth rate and mRNA processing or protein synthesis. A total of 51 microRNAs were identified as associated with growth rate in this study, including downregulated miR-451, which is linked to growth inhibition and apoptosis induction, in high growth rate CHO cells [63]. This integrated omics approach was also used to compare four cell lines with fast and slow growth rates [21]. This study identified 118 gene transcripts and 58 proteins that were differentially expressed between the cell lines. Following overlap comparison of the datasets and functional analysis, valosin-containing protein (VCP) was determined to significantly affect cell growth and viability. This finding was confirmed by silencing

VCP expression, which resulted in decreased viable cell density and viability [21].

Comparative proteomics using iTRAQ labeling was used to understand dynamic changes in CHO cell cultures and to identify proteins associated with cell growth and apoptosis

[54]. Fifty-nine differentially expressed proteins related to cell metabolism (e.g. increasing mitochondrial ALOH2), protein folding (e.g. increasing heat shock proteins

HSPE1, HSPA5, and HSPA9), and nucleic acid binding proteins were identified [54].

Pathway analysis revealed 29 proteins with functional association related to

146 hematological function, development, hematopeosis, and apoptosis (e.g. Bcl-xl, an apoptosis inhibitor, which was expressed more in stationary phase as compared to the exponential phase) [54].

In another study, Meleady et al. used in-solution digestion for proteins and label-free

LCMS to identify proteins in order to determine the effect of miR-7 overexpression on the proteome. They identified 93 downregulated proteins and 74 upregulated proteins resulting from overexpression of the microRNA. Potential cell engineering targets were identified which were related to protein translation and DNA/RNA processing, protein folding and secretion [26].In addition to protein production, overexpression of microRNAs can affect processes such as cell growth and apoptosis [64].

Differential protein expression in CHO-K1 cells over-expressing c-Myc was compared to that of control CHO-K1 cells [23]. The proteomics experiments showed the increase in nucleolin (important for growth, preventing apoptosis, protein productivity, and energy utilization) and decrease in regulation of adhesion proteins further proving the reasons of improved culture performance. An increase in subunit alpha and beta chain of protein

ATP synthetase was found which may indicate a change in energy utilization by changes in the glycolytic pathway to release more ATP in c-Myc cultures [23].

Recently, proteomics was used to identify apoptosis-related proteins in anti-Rhesus D factor-producing CHO cells during non-induced apoptosis in prolonged cell culture [28].

Forty differentially expressed proteins, such as cytoskeletal proteins (e.g. downregulated

147 vimentin/α-), chaperone and folding proteins (e.g. upregulated calreticulin), and glycolytic and metabolic enzymes (e.g. upregulated phosphoglycerate mutase) were identified between exponential and stationary phases. Additionally, changes in energy metabolism with the advancement of apoptosis were detected [28].

2.3 Proteomics analysis to increase recombinant protein production

Proteomics has also been applied to identify proteins that increase recombinant protein production. In one such study, recombinant mAb production in high and low producing

CHO-GS cells was examined [49]. A comparison of the proteomes resulted in 180 differentially expressed proteins, related to cytoskeleton rearrangement, protein synthesis, cell metabolism, and cell proliferation, including proteins such as ADP-ribosylation factor protein, V-type proton ATPase, colony stimulating factor 1, angiopoietin 4, 60S ribosomal proteins, and NADH dehydrogenases [49].

The comparative proteome profiling of rCHO cells exposed to 0.5 mM butyrate and 80

µM zinc sulphate media [27], identified increased expression of metabolic and chaperone proteins including GRP75, enolase, and thioredoxin. These proteins were suggested as potential cell engineering targets for increasing recombinant protein production [27].

An integrated transcriptomics and proteomics method was used to identify differentially expressed proteins in mAb-producing CHO cell lines [51]. A positive correlation between productivity and DHFR, adaptor subunits, DNA repair protein DDB1, and ER translocation subunit SRPR, and a negative correlation between productivity and

148 subunits of molecular chaperone T-complex protein 1 and mitochondrial metabolism regulator MTHFD2 were discovered [51].

In another example of the integrated transcriptomics and proteomics approach, a research group studied the effect of low temperature and NaBu treatment on immunoglobulin G

(IgG) secretion in CHO cells [53]. It was observed that cells growing at low temperature

(33 oC) with NaBu treatment had improved secretory capacity and therefore improved

IgG secretion, which could have resulted from enriched elements of the secretory pathway such as the Golgi apparatus, cytoskeleton protein binding, and small GTPase- mediated signal transduction [53].

3. Proteomics to optimize media formulations

Proteomics serves as a useful method for optimizing media formulation by identifying differences in the proteomes of CHO cells growing in different media. Baik et al. tested proteomics as a screening tool for rCHO cells by identifying over-expressed protein candidates in serum-bearing media as compared to SFM [47]. They selected two molecular chaperones, HSP60 and HSC70, for overexpression and observed up to a 15% increase in cell concentration and up to a 33% decrease in adaptation time during serum- free adaptation [47]. A proteomics approach can help identify candidates for cell line engineering in different media conditions to increase cell density and decrease adaptation time.

149 A shotgun proteomics approach was used to develop serum and animal-component free single-cell cloning media [33]. Eight growth factors were identified as potential supplements, including leukemia inhibitory factor, vascular endothelial growth factor C, and fibroblast growth factor 8. Supplementation in SFM resulted in up to a 30% improvement in CHO cell cloning efficiency [33].

Hydrolysates were optimized in SFM to study their effect on the rCHO cell growth rate and mAb production [50]. Differentially expressed proteins in SFM were identified with and without hydrolysates such as upregulated metabolism proteins (e.g. PGK and ENO which are involved in glycolysis), upregulated proliferation proteins, upregulated cytoskeleton-associated proteins, downregulated anti-proliferative proteins (e.g. TCTP, associated with reduction in cell growth and alteration in cell morphology) and downregulated pro-apoptotic proteins (e.g. ASPP2, associated with induction of apoptosis in COS-1 cells) [50].

4. Systems biology and proteomics database development

Since the CHO-K1 genome was released [65] and sequence information from other cell lines was made available [66], systems biology approaches have been used in conjunction with the CHO proteome and other ‘omics data to better understand the CHO cell host.

With increasingly available ‘omics resources, including genomics, proteomics, glycomics, transcriptomics, and metabolomics, an integration of ‘omics data can increase understanding of CHO cell physiology at a systems biology level. For example, when this

150 integrated data was overlaid on metabolic pathways from Kyoto Encyclopedia of Genes

and Genomes (KEGG), it significantly increased the CHO knowledge base compared to

those belonging to a single ‘omics platform [22]. An improved CHO metabolic

reconstruction would result in high confidence in silico flux predictions.

Baycin-Hizal et al. used CHO genomic information along with the CHO-K1 proteome to

find out the codon frequency in CHO cells. In conjunction with transcriptomic data and

KEGG pathway analysis, they identified highly expressed as well as depleted pathways

in CHO-K1 cells [14]. CHO-K1 draft genome [65], together with Bielefeld-BOKU-CHO

[20], SwissProt, and NCBI databases were used to identify CHO proteins resulting in a

50% improvement in protein identification, with most related to protein translation and

energy metabolism, and improved statistical confidence [20].

For added specificity, a CHO-specific geneome database has been implemented based on increasing sequence information for protein identification. This database is based on the

CHO and Chinese hamster genome sequencing [5, 65]. The CHO-specific protein database builds on genomics information and can be referenced at www.chogenome.org

[66]. This database is being built as a one-stop solution for CHO related information and genome scale data from CHO cells, including microarray, transcriptomics, proteomics, microRNA, and metabolomics. A large-scale proteomic database for CHO cells can be accessed at www.chogenome.org/proteome.php, as shown in Figure 3. This database can be used to obtain information regarding 6163 detected proteins [14]. The detected proteins and their accession numbers are listed on the proteome search page of the website. The protein of interest can be searched either by protein name or accession number. After the proteins that meet the entered search criteria are listed, one can click on the accession

151 numbers to get more detailed information. Each protein lists general information such as its SwissProt annotation, gene ontology annotation, and KEGG annotation and more specific information, such as identified peptide sequences and false discovery rate.

Figure 3. CHO Proteome Databse. Screenshots showing navigation steps on CHO Proteome webpage. After clicking the proteome icon on www.chogenome.org webpage, users are redirected to the CHO proteome database webpage. For looking into shotgun proteomics results, users can click on the relevant icon. In the next page, users can either enter protein name or GI accession name to search for a protein. Upon clicking the hyperlinked accession number, users are redirected to a Proteome View page where information such as protein name, identified peptide sequence, number of identified spectra, etc are available for the particular protein. An example of a recently developed proteomics software tool is Diffprot, which uses

resampling statistical tests and local variance estimates to improve statistical analysis and

reduce false positives [48]. After importing data using various quantification methods,

the data is filtered based on tandem MS identification and quantification quality. The

values are processed by log transformation, normalization, impurity correction, peptide to

protein ratio, and statistics [48].

Conclusions

152 The use of proteomics as a tool for improving biotherapeutics production from mammalian, and especially CHO, cells was described in this review. Platforms used for increased protein coverage such as MUDPIT and 2D-LC, labeling techniques used for quantitative proteomics such as SILAC and iTRAQ are extensively discussed here. These methods are currently used to optimize the media formulations, increase cell growth, and increase recombinant protein production. Continued improvements in the generation of high-producing cell lines are expected to increase both the specific and volumetric productivities of biotherapeutics. To complement the data produced by proteomics and other ‘omics platforms, more bioinformatics tools and databases are being developed, such as Diffprot and the CHO proteome database. Such advancements will have an economic impact on manufacturing by discovering novel candidates for cell line engineering or novel media components for increased growth.

Chapter 6: Elucidation of the CHO Super-Ome (CHO-SO) by BioInfo-Proteomics

Summary

Chinese hamster ovary (CHO) cells are the preferred host cell line for manufacturing a variety of biologicals including monoclonal antibodies. We performed a proteomics and bioinformatics analysis on the spent medium from CHO cells. Supernatant from CHO-K1 culture was collected following centrifugation and subjected to an in-solution digestion followed by LC-LC/MS/MS analysis. From the analysis of supernatant of post- centrifugation CHO cells, we identified 3281 unique host cell proteins from the CHO

153 cells. These identified proteins include both secreted and intracellular proteins, which are likely released due to the high speed centrifugation. In order to categorize these proteins functionally, we applied multiple bioinformatics tools including SignalP, TargetP,

SecretomeP, TMHMM, WoLF PSORT, and Phobius. This analysis provided information on the cellular localization of the proteins found in the “superome”, including the presence of transmembrane domains and signal peptides. Proteins were shown to be localized to the secretory pathway, including ones playing roles in cell growth, proliferation, and folding as well as those involved in degradation and removal of other proteins. After combining predicted secreted proteins and the proteins predicted to have a signal peptide, we identified more than 1000 proteins, which was termed the CHO Super-

Ome (CHO-SO). As a part of this effort, we also created a publically accessible web- based tool called GO-CHO (http://ebdrup.biosustain.dtu.dk/gocho/) to functionally categorize the proteins found in CHO-SO and to find out enriched molecular functions, biological processes, and cellular components. Among enriched molecular functions were catalytic activity and structural constituents of cytoskeleton. Various transport related biological processes such as vesicle mediated transport were found to be highly enriched.

Extracellular space and vesicular exosome were found to be the most enriched cellular component. CHO-SO also included proteins secreted both from classical and non- classical secretory pathways. This work and database will enable the CHO community to rapidly identify high abundance HCPs in their cultures in order to facilitate processing and purification efforts in the future. This analysis will help the CHO community in understanding both host cell proteins released by CHO cells as well as the critical components of the secretory apparatus.

154 1. Introduction

Mammalian cell lines are the preferred hosts for the production of numerous recombinant proteins, especially those of biotherapeutic interest, due to their ability to secrete complex proteins with post-translational capabilities accepted by humans. Due to the adaptability of mammalian cells to protein synthesis and secretion, thirty-two biotherapeutics products from mammalian cells were approved by regulatory authorities between 2006 and 2010 [1] and current trends project that monoclonal antibodies production in mammalian-based system will double the 2010 value by 2016 [2]. Among mammalian cells, Chinese Hamster Ovary (CHO) cells have been the most widely used cell line due to their ease of cultivation in suspension culture, adaptability in different media compositions, and their capabilities to produce proteins with post-translational modifications compatible with humans. Proteomics can serve as a useful tool to identify and quantify secreted proteins and to provide general insights into cell physiology of proteins in cell lines. This information may lead to further improvement of the CHO cells’ production capabilities and assist in eliminating undesirable proteins during the purification process. CHO cell proteomics has previously been used to identify and quantify the proteins involved in growth, protein bioprocessing, metabolism, glycosylation, and apoptosis [3,4,5,6,7,8,9,10,11,12,13,14,15,16]. However, there has only been a limited analysis of the CHO proteins in the extracellular environment [17].

By using the mass spectrometry-based proteomics methods, we have characterized the supernatant of CHO cell culture post-centrifugation. The protein groups, identified in this supernatant have been described as CHO-SO, may influence cell growth, development, differentiation, and many other cell features. Computational analysis was used to

155 characterize the CHO-SO and to filter potentially secreted proteins using different bioinformatics tools including (1) SignalP (http://www.cbs.dtu.dk/services/SignalP/) which predicts proteins secreted by the classical pathway, (2) SecretomeP

(http://www.cbs.dtu.dk/services/SecretomeP/) which predicts proteins secreted by the non-classical pathways [18], (3) TMHMM (http://www.cbs.dtu.dk/services/TMHMM-

2.0/) which predicts transmembrane helices [19], (4) Phobius (http://phobius.sbc.su.se/) which predicts signal peptides and various regions of a transmembrane protein sequence

[20], (5) TargetP (http://www.cbs.dtu.dk/services/TargetP/) which predicts signal peptides in a protein sequences as well as their subcellular locations [21], (6) WoLF

PSORT (http://wolfpsort.org/) which predicts protein subcellular localization [22,23], (7)

Secreted Protein Database or SPD (http://spd.cbi.pku.edu.cn/) which contains information on secreted proteins from human, mouse, and rat proteomes, including sequences from

SwissProt, Trembl, Ensembl, and RefSeq [24], and (8) Signal Peptide Database

(http://www.signalpeptide.de/) which provides signal peptide sequences for mammals containing more than 2000 confirmed sequences. From these tools, we were able to identify proteins that that represent the most likely candidates for the secretome.In addition to the above current bioinformatics tools, we have implemented a publically- available gene ontology (GO) web-based tool (http://ebdrup.biosustain.dtu.dk/gocho) for annotating gene products from our dataset.

CHO cells are widely used for the production of monoclonal antibodies (mABs) and other heterologous proteins. While these mAbs can be purified using protein A and other methods, the purified proteins are often accompanied by a number of additional CHO host cell proteins (HCPs). These HCPs represent contaminants in the product mAb and

156 must be removed during one of the purification steps. However, not all CHO host cell proteins may be removed from the end product. Unfortunately, these HCP impurities[17,25] can have immunogenic effects [26] and also can affect product quality and stability. They can cause formation of undesired product variants due to enzymatic activities of some HCP species such as protease and disulfide reductase

[27,28,29,30,31,32]. For this reason, it is very important to characterize the HCPs and if possible develop methodologies for eliminating them from the product mix [17]. The targeted removal of HCPs from CHO cell cultures will require greater knowledge of the proteins’ identity and characteristics and also represents a regulatory requirement [33]

[34,35]. The current methods for quantitative estimation of HCPs lack detailed information regarding the properties or composition of the HCPs from CHO cells [36].

The bioinformatics strategies presented in this study in addition with the developed CHO gene ontology tool will help to better identify and characterize known and possible unknown CHO HCPs. In addition, this study will serve as basis for understanding the

CHO secretory machinery since it categorizes the proteins containing N-terminal signal peptides and transmembrane domains for compartmentalization as well as GO ontology information for translocation, protein folding, O-glycosylation and N-glycosylation in the endoplasmic reticulum. A more complete understanding of the CHO secretome will facilitate current bioprocessing methodologies and provide insights how to enhance secretory processes from CHO cells in the future.

2. Materials and Methods

2.1 CHO Cell Samples and Isolation Materials

157 The CHO-K1 (CCL-61) cell line was bought from ATCC (Manassas, VA). Media, F-12K was obtained from GIBCO (Grand Island, NY) along with Fetal bovine serum (FBS), L -

Glutamine, non-essential amino acids, and DPBS. Sequencing grade trypsin enzyme was purchased from Promega (Madison,WI). The BCA protein assay kit was bought from

Thermo Scientific Pierce (Rockford, IL). Other reagents used were Tris (2-carboxyethyl) phosphine (TCEP) (Pierce, Rockford, IL), trifluoroethanol (TFE) (Sigma-Aldrich,

Milwaukee, WI), and ultra filters (Waters, Milford, MA). All other chemicals used in this study were purchased from Sigma-Aldrich (St. Louis, MO).

2.2 Cell Culture and Protein Lysate Preparation

CHO-K1 was cultured in supplemented 10% FBS, 1% nonessential amino acids, 2 mM

L-glutamine and F-12K media. After reaching 80% confluency, the media was decanted and cells were washed six times with PBS. Subsequently, the cells were starved for 12 h with serum free media. The supernatant was collected after 12 hours and the proteins were concentrated by centrifugation with 3 kDa ultrafilters.

2.3 In-solution Digestion

The protein concentration was determined by using BCA assay. Filter aided sample preparation (FASP) method [37] was used prior to digesting the proteins with trypsin enzyme (1:50 ratio) at 37° C overnight. The digested samples were separated into 96

158 fractions with a bRPLC method adapted from Wang et al. [38]. The 96 fractions were collected and concatenated into 12 fractions by merging the samples [3]. The experiment was replicated with two CHO cell cultures.

2.4 LC − MS/MS Analysis

In order to analyze various fractions from CHO cell protein digests, twelve different LC

− MS/MS analyses were performed on an LTQ-Orbitrap Velos (Thermo Electron,

Bremen, Germany) mass spectrometer with an attachment of Eksigent 2D nanoflow LC system. The reverse phase-LC system used consisted of two parts: a peptide trap column

(75 μm x 2 cm) and an analytical column (75 μm × 10 cm) which were both packed with

Magic AQ C18 material (5 μm, 120 Å, www.michrom.com). After elution, the peptides were sprayed directly into an LTQ Orbitrap Velos at 2.0 kV, with a flow rate of 300 nL/min, using an electrospray (internal diameter: 8 μm) emitter tip (New Objective,

Woburn, MA), with a capillary temperature of 200 °C. Entire tandem MS analysis was carried out in Orbitrap instrument at 60000 and 7500 resolution (measured at m/z 400) for precursor and the fragment ions respectively. FTMS full MS and MSn AGC target were set to 1 million and 50000 ions, respectively. Additionally, survey scans were acquired from m/z ratio of 350 − 1800 with up to 15 peptide masses (precursor ions) individually isolated with a 1.9 Da window and fragmented (MS/MS) using a collision energy of 35% in a Higher Collision Dissociation (HCD) cell and 30 second dynamic exclusion. Minimum signal requirement for triggering an MS2 scan was set to 2000 and the first mass value was fixed at m/z ratio of 140. An ambient air lock mass was set at m/z ratio of 371.10123 for real time calibration [39]. Monoisotopic pre-cursor mass

159 selection and rejection of singly charged ion criteria were enabled for the MS/MS analysis.

2.5 Database Searching and MS/MS Data Analysis

In order to analyze the MS/MS data RefSeq annotation of CHO cells was used from the

CHO genomic sequence [40].Mascot search engine was used for the searches with two

False Discovery Rate (FDR) cut-offs – 1% and 5%.,. In the search engine, semitryptic enzyme specificity allowing maximum 2 missed cleavages, with precursor ions required to fall within 10 ppm of projected m/z values and the mass tolerance for fragments ions was 0.5 m/z was chosen. The variable modifications included both oxidation (M

+15.996) and pyroglutamine (N-terminal Q − 17.027). Moreover, a fixed modification of carbamido-methylation (C +57.021) was identified.

2.6 Gene Ontology (GO) Annotation

For finding GO annotation of the secreted proteins, GO Cross Homology was obtained using GO-CHO platform. GO-CHO is build using the python based Django web framework (https://www.djangoproject.com/) and utilizes a MySQL database to store datasets of GO terms related to gene products of interest. It uses up-to-date GO annotation from http://geneontology.org/ [41].

2.7 Subcellular Localization and Protein Sequence Analysis

160 For determining subcellular localization of the identified protein sequences, we implemented a coupled use of the amino-acid sequence-based predictors TargetP,

SignalP, SecretomeP, TMHMM, Phobius, and WoLF PSORT, to increase our confidence in classifying secreted proteins. [42]. Default D-cutoff values were chosen to optimize the performance of the search in SignalP. In order to increase specificity, default cutoff was used in TargetP. Normal prediction method was used in Phobius to predict subcellular localization of the proteins. Along with these predictors, an open access Secreted Proteins

Database [43] was also used to find out the secreted proteins from other eukaryotes.

Additionally, mammalian signal peptides were obtained from an online database – Signal peptide website (http://www.signalpeptide.com/).

2.8 GO and KEGG Enrichment analyses

GO terms and corresponding genes were found as described above. KEGG pathways and corresponding genes were downloaded from KEGG website

(http://www.genome.jp/kegg/). Programming tasks were performed using MATLAB version 2010a [Natick, Massachusetts: The MathWorks Inc., 2010.]. Enrichment P values outcome is essentially a hypergeometric distribution calculated using MATLAB’s hygecdf and hygepdf functions.

2.9 Immunogenicity Prediction

Publically available tool (http://tools.immuneepitope.org/immunogenicity/) was used for predicting immunogenicity of the proteins based on peptide sequences.

161 3. Results

In order to characterize the CHO supernatant proteome, proteins were isolated from the cell culture broth, analyzed by mass spectrometry and subjected to multiple bioinformatics analyses as outlined in Figure 1. During the bioinformatics analysis, a number of filters were implemented in including SignalP, TargetP, SecretomeP, WoLF

PSORT, TMHMM, and Phobius along with databases such as Secreted protein database

(SPD) and Signal peptide database in order to identify proteins that are present in the plasma membrane and extracellular environment of CHO cells. To further analyze these predicted secreted and transmembrane proteins, the positive results from above tools were combined in three subcategories – 1) proteins with signal peptides, 2) secreted proteins, and 3) proteins with transmembrane domain, as shown in Figure 1. Further, proteins from above subcategories were combined, resulting in 2660 candidates following removal of duplicate entries to obtain unique proteins which maybe secreted and/or contain signal peptide or transmembrane domain (Figure 1). The duplicate entries resulted from proteins positive prediction in more than one of the above subcategories e.g., proteins containing both signal peptide and predicted to be secreted. Next, the combined protein’s gene ontology (GO) annotation was found using our newly established GO-CHO website, described in detail below. GO annotation provided cellular localization for each protein which was used as a filter to remove proteins containing exclusively intracellular cellular components. After removing the intracellular proteins we obtained a protein dataset what we termed as CHO-SO (CHO

SupernatantOme or CHO SuperOme). These 1015 proteins were then subjected to GO enrichment analysis, KEGG pathway enrichment analysis, and Ingenuity pathway

162 analysis (IPA) to obtain enriched and depleted GO terms and KEGG pathways as well as

IPA knowledgebase functional groups which assisted us in understanding the biology and secretion mechanism of CHO cells (Figure 1).

These filtering steps will be described in detail in subsequent sections filtered the protein number to 2660 and further to 1015 proteins after analyzing for cellular compartmentalization. In addition to functionally categorizing CHO supernatant, relative quantification of the proteins based on normalized spectral abundance factor (NSAF) values was done to identify and characterize high abundance proteins which could be potential host cell proteins. In conjunction with literature search for previously reported

CHO host cell proteins and immunogenic proteins, novel proteins from this study were identified and their immunogenicity prediction was made to identify novel immunogenic

CHO proteins. Each of these steps and results of this analysis are described in greater detail in the following sections.

163

Figure 1. Overview of the process of obtaining functionally categorized CHO-SO through various filtering strategies and analysis techniques along with process of obtaining high abundance immunogenic CHO proteins.

3.1 CHO Superome (CHO-SO) protein extraction, mass spectrometry experiment, and data analysis

Common approaches for identifying secreted proteins involve proteomic analysis of conditioned culture medium from the cell type of interest. In one approach, cells are grown in serum bearing medium. However, this method usually necessitates extensive fractionation of proteins/peptides in order to detect low-abundance secreted proteins among thousands of high-abundance serum proteins [44,45]. An alternative to this approach, used in this study, was to deplete the serum after growing the cells thereby

164 reducing analytical interference significantly and also increasing the ability to detect relatively low-abundance secreted proteins [46,47].

To characterize the supernatant of CHO-K1 cell culture, the cells were grown in duplicates in serum bearing media. After 2 days of growth, the serum was depleted and the supernatant was collected 12 hours later. As shown in Figure 1, the collected CHO supernatant was concentrated by vacuum centrifuge and ultrafiltration prior to trypsin digestion. Two dimensional liquid chromatography was used to fractionate the proteins prior to LC/MS/MS. A total of 24 fractions from the two replicates of CHO supernatant were analyzed in the mass spectrometer and the data was analyzed using the Mascot search engine using the CHO genome for peptide and protein identification [40].

In order to ensure high quality of data, a variety of filtering strategies were applied – A) a stringent 1 % (False Discovery Rate) FDR cutoff with less than 2 peptides and less than 6 peptide spectrum matches (PSMs), B) Cutoff of 1% FDR regardless of the number of peptides and PSMs, C) Cutoff of 5% FDR with less than 2 peptides and less than 6 peptide spectrum matches (PSMs), and D) Cutoff of 5% FDR regardless of the number of peptides. These filters resulted in 3281 CHO proteins being identified from the superome based on all criteria combined [48]. A summary of the filtering results along with categorization based on number of peptides and peptide spectrum matches (PSMs) is tabulated in Table 1. Of the total 3281 grouped proteins identified, 2718 exhibited at or below a 1% FDR which is, to our knowledge, the highest number of proteins reported in the supernatant of CHO cells so far.

165 Table 1. Results summary from proteins identified from the CHO supernatant

Protein Identification Protein Identification Total

MS Results (Peptides >2 and PSMs (Peptides <2 and PSMs Proteins

>6) <6) Identified

1% FDR 1340 1378 2718 5% FDR 8 555 563 Total 1348 1933 3281

3.2 Relative Quantification of the proteins in the CHO supernatant

In order to elucidate the high abundance proteins in CHO supernatant and their properties, normalized spectral abundance factor (NSAF) values of each protein were calculated, using a method previously described [49]. A histogram of NSAF values of all the 3281 proteins reported in section 3.1 is shown in Figure 2a. The proteins in the supernatant show a wide range of expression (NSAF) values from -21.33 to -6.33.

Ninety-two proteins showed NSAF values higher than -8.94 were outside the two standard deviation range and thus considered to be high abundance proteins. These proteins were subjected to IPA (Ingenuity Pathway Analysis) software to evaluate the related functional networks-an example of one such network is shown in Figure 2b. From the IPA analysis, it was found that proteins such as SPARC (secreted protein acidic and rich in cysteine) and CLU (Clusterin) which are both related to folding of proteins, cell survival functions, binding, and cell growth are high in abundance. Extracellular matrix glycoprotein SPARC ,secreted by many other different cells\ such as osteoblasts, fibroblasts, endothelial cells, and platelets [50,51], ] is involved in a) disruption of cell

166 adhesion [52], b) changes in cell shape [53], c) inactivation of cellular responses to certain growth factors such as PDGF [54], and (d) extracellular matrix synthesis, developmental processes, angiogenesis, and binding to growth factors

Clusterin, a heavily glycosylated protein [55] ubiquitously present in many tissues, functions as an extracellular chaperone that prevents aggregation of nonnative proteins and maintains partially unfolded proteins in a state appropriate for subsequent refolding by other chaperones, such as HSPA8/HSC70.

Other high abundance proteins – PpiA (NSAF: 0.01619) and PpiB (NSAF: 0.00638) in

CHO-SO are considered to be involved in acceleration of the folding of the proteins critical to protein export and secretion. PpiA has an N-terminal uncleavable hydrophobic domain and is predicted to be an N-in C-out transmembrane domain protein [56].

167 Figure 2. Analysis of all CHO-SO genes. a) A histogram of number of proteins for each normalized spectral abundance factor (NSAF) bin with an overlaid normal distribution curve. Two standard deviation (2.4689) from the mean (-13.8317) correspond to right hand bound of 95% confidence interval with a Log2NSAF value of -8.8939. b) Ninety- two proteins with Log2NSAF values greater than -8.8939 were used to build the pathway network in IPA. The figure shows network of genes corresponding to functions – cell survival and folding of proteins.

Both SPARC and Clusterin have been identified as difficult to remove host cell impurities and are known to exhibit strong interactions with different monoclonal antibodies [25]. Some other proteins which were present in high abundance and identified as HCPs by Valente et al. [25] are Igfbp4 (Insulin-like growth factor-binding protein 4),

Vim (Vimentin), Enoa (Enolase 1), Tpm1 (tropomyosin 1), and Ldha (lactate dehydrogenase A). Overall, out of 92 high abundance proteins from our data set, 56 proteins are known to be host cell protein impurities [57].

One main concern with the presence of host cell proteins contamination is their potential immunogenicity. Upon comparing with previously published [58] and immunogenic and host cell proteins [25], we identified 12 high abundance proteins previously not reported as CHO HCPs or immunogenic CHO proteins. In order to identify the T-cell epitopes of these 12 proteins, we explored their immunogenicity using a publically available tool

(http://tools.immuneepitope.org/immunogenicity/) [59]. Out of these 12 proteins, 8 proteins were found to contain the top 20% of potentially most immunogenic peptides with predicted immunogenicity scores higher than 0.15. Table 2 provides all of the 10 immunogenic proteins results.

168 Table 2. T – cell epitopes identified from high abundance novel CHO-SO proteins. Gene Symbol Protein Name Peptide Score

Srsf1 serine/arginine-rich splicing factor 1 PFAFVEFED 0.42225

Prdx6 peroxiredoxin-6 RVVFIFDPD 0.38324

Rps21 40S ribosomal protein S21 ASNRIIGAK 0.35632

Hnrnpa1 heterogeneous nuclear ribonucleoprotein A1 KRGFAFVTF 0.33462

Hist1h3a histone H3.1 LARRIRGER 0.3343

Tpm1 tropomyosin alpha-1 chain KQVEEELTH 0.31922

Hist1h3c histone H3.2 QRLVREIAQ 0.31777

EIF5AL1 eukaryotic translation initiation factor 5A-1 YDCGEEILI 0.31419

Rps27a ubiquitin-40S ribosomal protein S27a PSDTIENVK 0.28877

Hspe1 Heat Shock 10kDa Protein 1 LPLFDRVLV 0.21288

Srsf1 contains 12 epitopes within only 198 total amino acids and belongs to a class of intrinsically disordered proteins [60] making it an ideal candidate to be screened by T- cells detecting aberrancies and generating an immunogenic response. Removal of these host cell proteins which generate immunogenic response would be desirable as part of the therapeutic proteins manufacturing process.

3.3 Addressing the subcellular localization of the proteins

In order to functionally categorize the 3281 proteins reported in section 3.1, we implemented a number of publically-available bioinformatics tools including SignalP,

TargetP, SecretomeP, WoLF PSORT, TMHMM, Phobius, and also searched for the proteins in Secreted Protein Database (SPD) as shown previously in Figure 1. Each of these tools focus on the identification of either a signal peptide in a given protein

169 sequence to predict whether a protein is secreted or if a protein has a transmembrane domain. This process allowed us categorize proteins into three subcategories: 1) proteins with signal peptides, 2) proteins with transmembrane domains, and c) proteins in the extracellular domain. This categorization provided us with a second filter for the data to segregate and retain for further analysis only those proteins known to reside in the above three categories.

A summary of the results from all of the above search engines is provided in Figure 3a.

Many proteins were detected in multiple categories including containing signal peptides, as secreted proteins, or containing a transmembrane domain, resulting in a large overlap of the resulting datasets. As a result, a number of the positive results identified by one bioinformatics tools were also be identified with another tool. A simple six-way Venn diagram is provided in Figure 3a showing results from the different bioinformatics tools including – 1) SignalP and Signal peptide database positive results combined providing proteins with signal peptides, 2) SecretomeP positive results of secreted proteins, 3)

Phobius and TMHMM positive results with proteins containing transmembrane domains,

4) TargetP positive results also showing proteins containing a signal peptide, 5) WoLF

PSORT positive results showing proteins in the extracellular space, and 6) Secreted protein database (SPD) positive results with secreted proteins. SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms based on a combination of several artificial neural networks [61].

Whereas, TargetP predicts the subcellular location of eukaryotic proteins based on the predicted presence of any of the N-terminal presequences: chloroplast transit peptide

(cTP), mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP)

170 [21].For the sake of simplicity, all these results are provided without the overlap details of positive results from the different tools. For example, a number of the 2398 proteins identified in SecretomeP were also identified by another search tool such as SignalP and

TargetP. Overall, 66 proteins were found by all analysis tools and 621 proteins were not identified by any of the aforementioned tools.

To identify proteins containing signal peptides, one of the approaches was to apply a signal peptide website from UniProt knowledgebase database

(http://www.signalpeptide.de/) – to perform a stand-alone BLAST (basic local alignment search tool) analysis [62]. This database, containing sequences of predicted signal peptides from various species, was aligned to the sequences of our identified 3281 proteins using BLAST tool to identify proteins containing signal peptides. The positive results from this analysis were then combined with positive results from SignalP and

TargetP analyses to further improve the prediction of signal peptides containing proteins.

Subsequently, the datasets from different searching engines were grouped into categories as shown in Figure 3b. Firstly, the positive hits from SecretomeP, SPD, and WoLF

PSORT tools were combined to create a secreted proteins dataset (yellow circle in Figure

3b). Secondly, the positive results from SignalP, TargetP, and Signal Peptide database were combined to create a dataset of proteins containing signal peptides (purple circle in

Figure 3b). Thirdly, the positive results from TMHMM and Phobius were combined to create a dataset of proteins containing transmembrane domains (green circle in Figure

3b). In this integrated Venn diagram, the overlaps of proteins in different categories were noted including 447 proteins that were classified in all the three categories. An additional

262 proteins were classified in at least two of the different categories.

171 Elimination of the 621 proteins identified by MS resulted in a list of 2660 potentially secreted proteins. Secretory proteins including signal peptides transferrin, lipoproteins, immunoglobulin domain proteins (e.g., SEMA3B, SEMA3C, and SEMA3E), collagens

(e.g., COL12A1, COL16A1, and COL9A2), fibronectins (FN1), and proteoglycans (e.g.,

HSPG2, LEPRE1, and VCAN). These proteins are sorted in the trans-Golgi network into transport vesicles that immediately move to and fuse with the plasma membrane, releasing their contents by exocytosis.

In addition, proteins such as LRP6 (low density lipoprotein receptor-related protein 6), APP (amyloid beta A4 precursor protein), and PECAM1 (platelet/endothelial cell adhesion molecule 1) associated with receptors, cell adhesion, and binding functions, respectively, were found to be in the membrane protein dataset obtained using the

TMHMM and Phobius TM tools. Alternatively, extracellular proteins such as TGFB1

(transforming growth factor, beta 1), SEPT8 (septin 8), and PDGFA (platelet-derived growth factor alpha polypeptide) which perform roles as growth factors, cytokinesis regulator, and cell migration, respectively, were categorized under the secreted protein dataset. Furthermore other proteins such as SRPR (signal recognition particle receptors), cytoskeleton remodeling and organization proteins (ENAH - enabled homolog), and proteins involved in cytokinesis (SEPT6 - septin 6) were grouped under signal peptide category.

Four hundred forty seven proteins found in all three categories by the search tools included ADAM17 (ADAM metallopeptidase domain 17), ALCAM (activated leukocyte cell adhesion molecule), and ERP29 (endoplasmic reticulum protein 29). Another protein in this category was SDC1 (Syndecan-1), which is a single-pass, integral membrane,

172 heparin-sulfate proteoglycan, known to contain a transmembrane domain shed under certain conditions [63,64].This group also included 66 experimentally identified proteins picked by all of the search engines including SPARC (Secreted Protein, Acidic and

Cysteine-Rich) and SERPINC1 (also known as antithrombin-III). The protein SERPINC1 contains 465 amino acids preceded by a signal peptide of 32 amino acids [65]. This class of multifunctional proteins is involved in processes such as protease inhibition, cell- matrix interaction regulation, and inhibiting cellular proliferation [66], possibly explaining the inclusion of a signal peptide possible and their location both on the transmembrane domain and in the extracellular space.

Six-hundred twenty one of 3281 proteins detected by LC-MS were not detected by any of the search engines and is shown in the left hand corner of Figure 3a and 3b. An examination of these proteins reveals the presence of nuclear and cytoplasmic proteins such as BOD1L1 (Biorientation of in Cell Division 1-Like 1) and CCAR1

(Cell Division Cycle and Apoptosis Regulator 1) which have role in cell division. The elimination of these candidates shows the capacity of established bioinformatics methods as a helpful tool to eliminate likely intracellular proteins for a secretome. Indeed, intracellular proteins are often detected in the supernatant from cell cultures as a result of cell bursting or lysis due to the nature of the culture process or subsequent processing steps including centrifugation [17].

173

Figure 3. Results from bioinformatics analyses – a) Shows number of positive hits as identified by different bioinformatics tools which may also be a positive hit in another tool. b) Shows proteins groups after combining positive results from SecretomeP, WoLF

PSORT, and Secreted Protein Database (SPD) as ‘Secreted Proteins’; Positive results from SignalP, TargetP, and BLAST analysis on signal peptides database were combined as ‘Proteins with Signal peptides’; and positive results from TMHMM and Phobius were combined as ‘Membrane Proteins’.

3.4 Gene Ontology (GO) Cross-species HOmology (GO-CHO) database

To further analyze these predicted secreted and transmembrane proteins, we combined the 2660 proteins in the aforementioned three subcategories to obtain a first-cut of the

CHO secretome (Figure 1). However, for an even stricter list of potential secretome proteins, a third filter was implemented in using gene ontology (GO) annotation and categorization of the 2660 candidates using our newly established GO-CHO website discussed below.

174 The genome of human, mouse, rat, fly, and a few other model organisms are very well annotated in the current literature. However, CHO genes lack such a thorough annotation which presents a large gap in understanding the functions and cellular localization of

CHO genes. However, the common ancestry between the above species and CHO cells means that a substantially large percentage of the gene products exist as homologs in the different species. This knowledge can be utilized for GO annotation of CHO genes, since a gene product may have similar or the same function and characteristics across related species. Based on this precept, the whole CHO genome has been functionally annotated and a web based interface to find CHO specific gene ontologies established. The GO-

CHO database can be reached at http://ebdrup.biosustain.dtu.dk/gocho/. In the current study, the proteins were searched against the GO-CHO database and all related terms from well annotated species were extracted [67].

An example output from the GO-CHO website is displayed in Figure 3. The homepage of the website consists of a link – “create a new dataset” – which can be used to create user datasets. Upon clicking “create a new dataset,” users are redirected to another webpage where users can use these simple steps to find GO annotation for their datasets:

1) Provide a name for the dataset, 2) Paste the list of full gene/protein names for which users require annotation, 3) Search for the species specific to the dataset and choose the species for annotation (eg. Cricetulus griseus for Chinese hamster), and 4) Submit the entries. Upon clicking submit users are provided with an output containing gene symbol, gene/protein name, molecular functions, cellular components, biological processes, and cellular components of the gene/protein list.

175 For the current study, the candidate 2660 proteins obtained following filtering for signal peptides, secreted proteins, and membrane proteins, were input into the GO-CHO website in order to obtain the gene ontology of the proteins. Cellular component GO terms which provide information about the cellular location, e.g. endoplasmic reticulum or Golgi apparatus, were used to filter out the proteins containing only intracellular GO terms such as nucleus and mitochondria.

More than 90% of CHO-SO proteins were contained in the GO terms related to extracellular space and plasma membrane such as Dkk2 (Dickkopf-related protein 2),

Klkb1 (plasma kallikrein), and Stk10 (serine/threonine-protein kinase 10). However, it is known that intracellular proteins lacking signal peptides can also accumulate in the extracellular space through cell lysis and unconventional secretion such as secreting through extracellular vesicular exosomes [68]. Overall 368 proteins from the final filtered dataset of 1015 proteins contained cytoplasm/cytosol GO terms, while 52 proteins were found to be contained in endoplasmic reticulum and 71 proteins were contained in the

Golgi apparatus according to GO terms.. Upon investigating the source from the previous bioinformatics filter, more than 98% of these cytosolic proteins were predicted to be secreted based on SecretomeP and only 15% were predicted as either secreted or containing transmembrane domain by any other tool, to indicate the prediction capability limitations of these tools. However, it is important to note that a number of these proteins such as ALAD (Delta-aminolevulinic acid dehydratase) contain “extracellular vesicular exosome” GO terms in addition to “cytosol” and “nucleus” cellular components which could explain an unconventional secreting route and there subsequent inclusion in the final this filtered list.

176

Figure 4. An example output from the GO CHO website used to find out gene ontology of Chinese hamster and Chinese hamster ovary cells (a) The homepage of the website is accessible using URL: http://ebdrup.biosustain.dtu.dk/gocho/. Clicking on

‘create a new dataset’ takes the user to a new page; (b) Users can input gene names (e.g. for which GO annotation identification is needed along with the species type on this page and submit the information; (c) The output shows gene symbol, gene name, GO description as well as accession numbers for molecular function, biological processes, and cellular component categories.

177

3.5 Gene Ontology (GO) Enrichment Analysis

In order to gain a better understanding of the biological roles of the 1015 functionally annotated proteins in CHO-SO, an enrichment/overrepresentation and depletion/underrepresentation analysis of GO terms was applied using a hypergeometric distribution test. For comparison with the GO-CHO annotation of the 1015 proteins, GO annotation was performed on both the whole CHO transcriptome and proteome using the aforementioned GO-CHO website followed by the hypergeometric test as a background control. CHO transcriptome data was obtained from Xu et al. study [40] and proteome data was obtained from Baycin-Hizal et al. study [48]. A total of 9429 integrated genes from transcriptomics and proteomics data were used as background for finding enriched

GO terms in 1015 proteins from CHO-SO data by performing the hypergeometric distribution test as previously described [69].

The results from hypergeometric distribution tests are summarized in Figure 5 in terms of the top 15 enriched GO terms in different categories for different CHO samples. Shown in Figures 5a, 5b, and 5c are the percentages of genes corresponding to top 15 enriched

GO terms in molecular function, biological process, and cellular component respectively, for the 1015 CHO-SO proteins.

Moreover, to compare CHO-SO cellular compartmentalization (Figure 6c) to the whole cell proteome, the hypergeometric distribution test [70] was performed on the previously published CHO-K1 intracellular proteome containing 4391 proteins with GO cellular component terms [48] with CHO genome with 13984 genes with GO cellular component

178 terms as background [40] for obtaining enriched cellular component GO terms [69].The result of this test is shown in Figure 5d and was compared with CHO-SO cellular component GO terms in Figure 5c, elucidating differences between superome and intracellular proteome. For example, most of the enriched cellular component GO terms in intracellular proteome involve cytoplasmic space related GO terms, whereas most of the enriched cellular component GO terms in CHO-SO are related to extracellular space and plasma membrane.

The process of GO enrichment on the filtered supernatant protein data helps to focus more directly on specific classes of enriched proteins. For example, the tubulin class proteins which are associated with the plasma membrane such as Tuba1b, Tubb4a, and

Tubg1 associated with enriched “structural constituent of cytoskeleton” molecular function (Figure 5a) are enriched in the CHO secretome [71]. Another one of the highly enriched molecular functions in CHO-SO is “calcium ion binding” which includes Tgfb1

(transforming growth factor, beta 1) protein. Tgfb1 protein together with SMAD signaling proteins is known to be involved in IgA (Immunoglobulin A) secretion [72].

The CHO intracellular proteome is known to be rich in Smad1, Smad2, Smad3, Smad4, and Smad5 proteins [3]. Another enriched molecular functions in CHO-SO, “catalytic activity” includes plasma membrane related proteins such as ILK (Integrin-Linked

Kinase) associated with cell junction signaling, cell adhesion, and integrin activation

[73,74,75] and caveolae formation [76]. Interestingly, actin filament binding was also in the top 15 enriched GO molecular functions, including proteins such as Myh9 associated with secretion [77].

179 Proteins such as Sec31a and Sec23b are associated with “protein transport” as well as “vesicle mediated transport” were among enriched biological processes in CHO-

SO (Figure 5b),. [78]. Within the enriched “intracellular protein transport” category for

CHO-SO were sorting nexin family proteins such as Snx1, Snx2, and Snx3, which regulate the cell surface trafficking of growth factor receptors as well as other cytoplasmic and membrane-based proteins. [79].

Among the enriched cellular components in CHO-SO (Figure 5c) was the “extracellular space,” which included heat shock proteins such as Hspd1, Hspe1, and Hspa13 involved in protein folding process as a part of the overall protein secretion process [80]. Proteins associated with the extracellular vesicular exosome include Tgfb1 (transforming growth factor, beta 1), which is secreted by CHO-K1 cells and difficult to eliminate during downstream purifications. Tgfb1 can play a key role in modulation of cellular growth, maturation and differentiation, extracellular matrix formation, homeostasis, apoptosis, and angiogenesis [81,82]. The contamination of clinical protein products with even minor amounts of Tgfb1 can lead to significant adverse effects [83]. For example, the presence of minute amounts of contaminating Tgfb1 can exert profound immunosuppressive effects in patients administered with human therapeutic blood products such as intravenous immunoglobulin [84,85].

180

Figure 5. Results from Gene Ontology (GO) hypergeometric distribution analysis.

The percentages shown are the percentage of gene in each GO function as compared to the total genes in the GO functions. a) Top 15 molecular functions enriched in the 1015 filtered proteins. b) Top 15 biological processes enriched in this same list. c) Top 15 cellular components enriched. d) Top 15 cellular components enriched in the CHO whole cell proteome [3]

In order to contrast the intracellular proteome with CHO-SO, we compared these two datasets and found that 369 proteins (approx. 36% of 1015 proteins) in CHO-SO were also found in CHO intracellular proteome. Upon closely comparing it was found that many of these 369 proteins contain cellular components such as “extracellular vesicular

181 exosome” and “plasma membrane”. Moreover, the enriched cellular component GO terms of intracellular proteomics dataset [3] (Figure 5d) differ significantly from that of the CHO-SO dataset (Figure 5c) – in that the intracellular proteome three most common cellular components included nucleus, cytoplasm, and mitochondria while CHO-SO included plasma membrane, extracellular space and vesicular exosomes ( Figure 5c).

MEA (Male Enhanced Antigen) is an example of the protein not identified with the intracellular proteome analysis but is instead associated with the CHO supernatant. This integral membrane protein is associated with cytoskeleton organization and includes the

N-terminus on the extracellular side and the C-terminus on the cytoplasmic side. Many of the secreted proteins identified only in the CHO-SO include those such as Plat

(0.0004/0.00003), Col5a2 (0.0005/0.00003), Tinagl1 (0.0008/0.00004), Csf1

(0.001/0.00001), Dag1 (0.001/0.00006), Clstn1 (0.001/0.00001), Mmp9 (0.002/0.00001), and C1ra (0.003/0.00001) [fractions in parentheses show Normalized Spectral

Abundance Factor (NSAF) values in CHO-SO, followed by the corresponding NSAF value in intracellular proteome [3]. Very low abundance values in the intracellular proteome means that many of these proteins do not accumulate significantly inside the cell, which makes it difficult to identify them with the whole cell proteomics methods.

3.6 Ingenuity Pathway Analysis (IPA)

182 In order to improve our understanding of the biological functions of the 1015 proteins in

CHO-SO, IPA software was used (www.ingenuity.com) to provide key functional networks. Chemotaxis of cells, exocytosis, and cell spreading, as shown in Figure 6 were some of the enriched network functions found by IPA analysis.

Figure 6. Functional network categories overrepresented by the extracellular space and extracellular vesicular exosome proteins. Proteins related to “cell spreading”,

“chemotaxis of cells”, and “exocytosis” functions are shown in the figure.

183 Importantly, many of the proteins shown in the above networks associated with

“exocytosis” are also associated with secretory and signaling pathways. For example,

Thioredoxin (TXN), which functions to catalyze disulfide bond formation, is widely distributed and actively secreted by a variety of tissues [86]. Although TXN is known to lack a signal peptide, it follows a leaderless secretory pathway, alternative to the classical

ER-Golgi secretion route and is hypothesized to translocate directly through the plasma membrane [86]. Another protein – N-Ethylmaleimide-Sensitive Factor Attachment

Protein, Alpha (NAPA) – is a member of the soluble NSF attachment proteins (SNAP or soluble NSF attachment protein) involved in diverse transport events in the secretory pathway. NAPA is functionally important for protein trafficking in the secretory pathway and may act as a SNARE (SNAP receptor) for vesicle-mediated transport events [87]. N- ethylmaleimide-sensitive factor (NSF) together with α-SNAP dissociates the SNARE complexes that promote association and fusion of cellular membranes [88]. Another example is Perforin (PFN) which is secreted to aid in the intracellular delivery of proteases for initiating apoptosis through invagination at the plasma membrane and by promoting endocytosis of vesicles to allow membrane bound molecules into the target cells [89].

ADP ribosylation factor 6 (ARF6) from the exocytosis network is believed to mediate cytoskeletal remodeling and vesicular trafficking along the secretory pathway at the plasma membrane [90] [91,92].In normal rat kidney (NRK) cells, endogenous and overexpressed ARF6 localizes to the plasma membrane and may play a role in remodeling the plasma membrane for facilitating protein secretory mechanisms [93].

Other classes of proteins from the exocytosis network shown in Figure 6, which are

184 associated with signal transduction and trafficking of vesicles are EXOC2, EXOC8, and

RHOA. The exocyst complex proteins – EXOC2 and EXOC8 (exocyst complex components) are involved in transport of the proteins from Golgi apparatus to plasma membrane via the docking of exocytic vesicles with fusion sites on the plasma membrane

[94]. Alternatively, RHOA (ras homolog family member A) regulates signal transduction pathways linking plasma membrane receptors to the assembly of focal adhesions and actin stress fibers required for the apical junction formation of keratinocyte cell-cell adhesion [95]. SNAP23, an essential component of the high affinity receptor for the general membrane fusion machinery and regulator of transport vesicle docking and fusion [96], is also known to be required for integrin signaling through focal adhesion turnover in CHO cells [97].

A protein related to chemotaxis of cells and cell spreading networks is RAC1 (ras related protein), which is a plasma membrane-associated small GTPase and binds to a variety of effector proteins to regulate cellular responses such as secretory processes, phagocytosis of apoptotic cells, epithelial cell polarization and growth-factor induced formation of membrane ruffles [98]. Among other exocytosis network signal transduction proteins serving essential cellular functionalities are AXL and B4GALT1 which are also associated with chemotaxis of cells and cell spreading networks.. The cell surface form of

B4GALT1 functions as a recognition molecule during a variety of cell to cell and cell to matrix interactions by binding to specific oligosaccharide ligands on opposing cells or in the extracellular matrix [99].

Several previously reported host cell protein impurities are associated with the above shown exocytosis and plasma membrane projections formation networks including

185 Galectin-3 (LGALS3), Vimentin (Vim), Annexin A1 (Anxa1), Peptidyl-prolyl cis-trans A (PPIA), Transforming growth factor beta-1 (Tgfb1), Laminin subunit beta-1

(Lamb1), and Lactadherin (Mfge8). LAMB1 is known to have increased expression with cell age and is considered to be one of the difficult-to-remove host cell impurities [57].

3.7 Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Enrichment

Analysis of CHO-SO

Translating knowledge of these proteins into overrepresented/enriched or underrepresented/depleted biochemical pathways can be particularly meaningful. For this purpose, the KEGG database [100,101], which arranges genes/proteins into specific pathways, was used to map proteins to corresponding pathways. In order to find enriched and depleted KEGG pathways in the filtered dataset, hypergeometric analysis was performed based on known transcriptomics and proteomic datasets [40]. The hypergeometric analysis revealed 111 pathways including focal adhesion, tight junction, and synaptic vesicle cycle to be significantly overrepresented (p-value<0.05) in CHO-SO.

Adherence of cells to the extracellular matrix through focal adhesion pathway is known to modulate many cellular functions such as cell survival, cell proliferation, and cell motility. Besides these functions, it is also known to take part in regulation of secretion of proteins by modulating cytoskeletal rearrangements and cell spreading during secretion

[102].

Shown in Figure 7 is the focal adhesion pathway with genes/proteins found in CHO-SO, and the transcriptome and proteome combined data highlighted in different colors. Genes such as GF (growth factor) and Grb2 (growth factor receptor-bound protein 2), which are

186 growth factor receptor bound proteins, were found in the CHO-SO. Moreover, Rac (ras- related C3 botulinum toxin substrate) and MEK1 (mitogen-activated protein kinase) which are connected by PAK (p21 protein-activated kinase) were also activated in the

CHO-SO dataset. MEK1 is known to be a regulator of focal adhesion dynamics and cell migration [103]. We hypothesize that activation of Rac is necessary for activation of

MEK1 and might regulate adhesion dependent cell growth [104]. Myosin light chain kinase (MLCK), known to be a main regulator of permeability in tight junction, was also found to be associated primarily with CHO-SO data [105].

Figure 7. KEGG’s focal adhesion pathway displaying proteins in CHO-SO in green color, proteins in CHO supernatant but not in CHO-SO in orange color, proteins in CHO transcriptome-proteome data (but not in either CHO supernatant proteins or CHO-SO) in magenta color, and genes not found in any of the above datasets in yellow color.

4. Discussion

187 The present study answers questions regarding various types of secreted proteins present in CHO supernatant including both high abundance and low abundance proteins. Because of the technical limitations which make the detection of very low abundance proteins within biological samples very difficult due to masking, for instance the situations in which non-secreted proteins are released from cells after lysis or death, the low abundance proteins can be difficult to discover and characterize. However, using the power of two dimensional liquid chromatography based mass spectrometry, we were able to identify the low abundance proteins in CHO supernatant along with high abundance proteins that can modify cell growth, including ligands, receptors, waste products, and endogenous growth factors, [106,107]. This qualitative and quantitative information regarding secreted proteins in CHO cell culture may aid bioprocess development of therapeutics since they can play role in aggregation, proteolysis, and so on. An analysis on relative abundance of proteins, as described in materials and methods section, reveals that high abundance supernatant proteins are involved in protein folding, binding, and cellular growth. Furthermore, secreted proteins – both high abundance and low abundance – which get carried as host cell proteins (HCPs) are one of the major classes of impurities that must be removed during downstream purification as they have the potential to cause antigenic effects in human patients [108]. For example, as summarized previously, heat shock proteins (HSPs), which act as chaperones for protein synthesis and degradation, are present in high abundance in the CHO supernatant. It is recognized that

HSPs are major immunogens in the innate response and exhibit a dual role in the initiation of both innate immunity and acquired immunity. Specifically, HSPs can

188 transport antigens into antigen presenting cells, thereby enabling the initiation of specific acquired immunity responses [109].

It is crucial to identify these immunogenic proteins that can potentially be in the supernatant and also to understand their source of presence. We believe that identification of whole CHO supernatant proteins by deep sequencing proteomics methods can help us to explore and possibly eliminate the potential HCPs.

Furthermore, we developed a computational strategy for identifying and characterizing the secreted proteins in supernatant of CHO cells which was termed as Supernatant-Ome or CHO-SO, using various bioinformatics tools. For the first time, two independent bioinformatics approaches were implemented on the whole supernatant CHO data containing 3281 proteins. In the first approach, prediction algorithms by various bioinformatics tools such as SignalP, TargetP, SecretomeP, Phobius, WoLF PSORT, and

TMHMM were implemented to filter the dataset to obtain predicted secreted proteins.

This filtered dataset was then subjected to the second approach based on newly established GO-CHO annotation tool to find out the cellular components of these proteins and eliminating exclusive intracellular proteins which are released due to cell burst or lysis.

These two approaches are useful in compartmentalization of supernatant proteins and have not been reported before for CHO cells. In particular, WoLF PSORT provides compartmentalization of proteins within the cell such as endoplasmic reticulum (ER) and

Golgi apparatus in addition to cytoplasm, plasma membrane, and extracellular domain.

ER and Golgi apparatus are involved in protein secretory machinery which carry out

189 transportation of many secretory and transmembrane proteins along with post translational modifications. Moreover, the gene ontology provides all the associated GO terms with a protein which help in evaluating pathway of protein secretion. This strategy of compartmentalizing proteins can form the basis of building secretory machinery model of CHO cells.

Prediction accuracies of the various tools utilized in this study have been evaluated before [42] and it was found that in terms of true positives, Phobius performs better than

WolfPsort which in turn performs better than SignalP and TargetP. SignalP and TargetP have the specificities of only 78.4% and 71.6% respectively. It is reported that combining

Phobius, WolfPsort, TMHMM, and TargetP results increases the specificity of predictions to 97.3% [42]. Specificity has been reported to be calculated as the percentage of true negatives in total number of true negatives and false positives. When multiple tools were used for identifying secreted proteins, only the entries predicted to be positives by all tools were taken as true positives as shown in Figure 3b.

The proteins in our dataset identified in the extracellular space but lacking a signal peptide probably are released into the supernatant through various mechanisms frequently referred to as “unconventional protein secretion” [110]. Examples of such proteins in CHO-SO include galectins [111], fibroblast growth factor receptor, and macrophage migration inhibitory factor (MIF). The group of unconventionally secreted proteins is previously shown to be externally influenced in most cases [111,112], for example, MIF secretion from monocytes is triggered by lipopolysaccharides

[113,114,115,116,117,118] which is probably in the CHO-SO due to the cell stress

190 Moreover, CHO-SO was found to be enriched in GO terms related to protein transport, exocytosis, phagocytosis, formation of plasma membrane projections, and cell spreading containing proteins related to cell surface trafficking and protein secretion. IPA software analysis revealed overrepresented functions related to chemotaxis of cells, cell spreading, and exocytosis. CHO-SO proteins related to these functions were found to be associated with diverse transport events in the secretory and signaling pathways along with proteins related to cytoskeletal remodeling and vesicular trafficking through exocytic vesicles which provides understanding of CHO secretory mechanism. It is known that exocytic membrane trafficking pathways are mediated by a series of vesicular intermediates.

Additionally, each step in vesicular transport is facilitated by specific events such as targeting, docking, and fusion which probably involve protein components localized to the transport vesicle. Some of these protein components define the specificity of a certain transport step, and other components are likely to be common to multiple transport steps leading to secretion of proteins [119]. Finally, using KEGG pathway analysis it was found that one of the overrepresented KEGG pathways corresponding to CHO-SO proteins was focal adhesion which plays role in directed cell migration, cell proliferation, cell adhesion, and cell survival.

In summary, the combination of quantitative LC- LC/MS/MS results with gene ontology and protein sequence analyses using aforementioned bioinformatics tools clearly emphasizes the high confidence of our dataset.

5. Conclusion

191 In this study, we have demonstrated an extensive strategy that can be used to interpret secretory proteins of CHO cells, which has not been fully understood up to now. Along with identification of immunogenic proteins and using various bioinformatics tools for subcellular localization of supernatant proteins, we have also established a publically available web-based tool, GO-CHO, for determining gene ontology of CHO cells. The

CHO-SO provide the most comprehensive information available to date of the assumed microenvironment and subcellular compartmentalization of the CHO cells. The antibody production nature due to high growth rate of the CHO cells is mirrored by the enrichment of catalytic activity GO term and proteins such as growth factors involved in the process.

Secreted proteins are involved in the key signaling pathways and CHO cells are able to have higher growth rates via the interaction of secreted proteins with the surrounding cells.

Studying these interactions may lead to a better understanding of the CHO cell physiology as well as understanding of the microenvironments of CHO cells which includes the host cell proteins (HCPs). The knowledge of repertoire of the proteins secreted by the CHO cells offers now an unprecedented opportunity to understand secretory mechanism of CHO cells and develop new strategies to remove the HCPs from the final biologics.

192

Chapter 7. Conclusions and Future Work

The preceding chapters in this dissertation described the utilization of transcriptomics and proteomics in conjunction with computational modeling and bioinformatics tools to decipher the underlying physiology of Type 2 Diabetes Mellitus, Chinese Hamster Ovary cells, and E. coli cells.

From the T2D computational modeling perspective, it may be worthwhile to extend the methods developed in this study to develop high confidence model by updating the gene-

193 protein-reaction mapping based on recently published RECON 2, which maximizes the mapping of every reaction to a gene and/or protein. This step would also minimize the number of dead end reactions, thereby improving the confidence of the model. Next step should be to implement input of drugs such as CL-316,243 in the computational model and predict the metabolic fluxes in different pathways. Ultimately, we would like to have signaling pathways, which are very crucial to T2D pathogenesis, to be included in the model.

As far as characterization of MKR mice’s metabolite profile is concerned, a metabolomics study outlining the differences in metabolites levels in MKR mice in healthy, T2D, and treated condition would be very effective in comparing the results of the computational model and the observed physiology. The metabolomics data can also be used to further improve the confidence of the model.

The E.coli study provided differences in metabolism of JM109 and BL21 and explained the differences in acetate levels in the two strains. The results from E. coli study can be extended to build robust computational models in a way similar to T2D study presented in this dissertation with an advantage of having proteomics data in conjunction to transcriptomics data. The resulting model could be used to study the effect of different substrates on the metabolic fluxes and biomass growth. There are several E. coli metabolic reconstructions available in literature but the compartmentalization of metabolites in these model has further scope of improvement, e.g. metabolites located in the lipid bilayers. Using proteomics data presented in this study along with protein subcellular localization tools such as TMHMM and WoLF PSORT, metabolic reactions

194 in different compartments can be ascertained and can be used to increase the rubustness of exisiting models.

This dissertation also presents emerging proteomics techniques and its usage in CHO cells along with a detailed analysis of supernatant of CHO cells to predict secreted proteins based on bioinformatics methods. The CHO community will benefit greatly from the expanding number of analytical tool and innovations available in the field of proteomics. New tools are being implemented that can enhance labeling of cellular proteins and improve sample preparation. These methodologies will help to reveal new approaches to increase cell growth and achieve higher yields of recombinant proteins.

While proteomics is paving the way for improving cell performance, refinement are needed in the proteomics tools in order to boost performance and reproducibility before they can be regularly used at large-scales in the pharmaceutical industry. Concerns regarding pre-diagnostic variables, analytical inconsistency, and culture-to-culture variability must be addressed. Many proteomics researchers around the world, including the CHO proteomics community, have initiated to work on these issues, predicating even more extensive use of proteomics in forthcoming years. Proteomics studies exhibiting the benefits of labeling technologies such as iTRAQ, TMT, and SILAC foreshadow even more outputs from these methods in coming decades. Developments in multiplexing technologies and mass spectrometry instruments will hopefully enable the capacity to combine and integrate a large number of samples simultaneously in a single run. These advances will in turn accelerate industrial-scale analysis to enable high-throughput scanning of large numbers of clones in order to specify proteomic-based differences from clone to clone. Linking advances in proteomics and subsequently secretome analysis will

195 provide insights that may overcome production bottlenecks in order to eliminate host cell protein impurities and create a superior CHO host for manufacturing of affordable biopharmaceuticals.

References

Chapter 1:

1. Saltiel AR (2000) Series introduction: the molecular and physiological basis of insulin resistance: emerging implications for metabolic and cardiovascular diseases. J Clin Invest 106: 163-164. 2. DeFronzo RA, Tripathy D (2009) Skeletal muscle insulin resistance is the primary defect in type 2 diabetes. Diabetes Care 32 Suppl 2: S157-163. 3. Halvatsiotis PG, Turk D, Alzaid A, Dinneen S, Rizza RA, et al. (2002) Insulin effect on leucine kinetics in type 2 diabetes mellitus. Diabetes Nutr Metab 15: 136-142.

196 4. Wang TJ, Larson MG, Vasan RS, Cheng S, Rhee EP, et al. Metabolite profiles and the risk of developing diabetes. Nat Med 17: 448-453. 5. Wijekoon EP, Skinner C, Brosnan ME, Brosnan JT (2004) Amino acid metabolism in the Zucker diabetic fatty rat: effects of insulin resistance and of type 2 diabetes. Can J Physiol Pharmacol 82: 506-514. 6. Stancakova A, Civelek M, Saleem NK, Soininen P, Kangas AJ, et al. Hyperglycemia and a common variant of GCKR are associated with the levels of eight amino acids in 9,369 Finnish men. Diabetes 61: 1895-1902. 7. Pisters PW, Restifo NP, Cersosimo E, Brennan MF (1991) The effects of euglycemic hyperinsulinemia and amino acid infusion on regional and whole body glucose disposal in man. Metabolism 40: 59-65. 8. Huffman KM, Shah SH, Stevens RD, Bain JR, Muehlbauer M, et al. (2009) Relationships between circulating metabolic intermediates and insulin action in overweight to obese, inactive men and women. Diabetes Care 32: 1678-1683. 9. Macotela Y, Emanuelli B, Bang AM, Espinoza DO, Boucher J, et al. Dietary leucine-- an environmental modifier of insulin resistance acting on multiple levels of metabolism. PLoS One 6: e21187. 10. Newgard CB, An J, Bain JR, Muehlbauer MJ, Stevens RD, et al. (2009) A branched- chain amino acid-related metabolic signature that differentiates obese and lean humans and contributes to insulin resistance. Cell Metab 9: 311-326. 11. Menge BA, Schrader H, Ritter PR, Ellrichmann M, Uhl W, et al. Selective amino acid deficiency in patients with impaired glucose tolerance and type 2 diabetes. Regul Pept 160: 75-80. 12. Fiehn O, Garvey WT, Newman JW, Lok KH, Hoppel CL, et al. Plasma metabolomic profiles reflective of glucose homeostasis in non-diabetic and type 2 diabetic obese African-American women. PLoS One 5: e15234. 13. Boden G (1999) Free fatty acids, insulin resistance, and type 2 diabetes mellitus. Proc Assoc Am Physicians 111: 241-248. 14. Mingrone G (2004) Carnitine in type 2 diabetes. Ann N Y Acad Sci 1033: 99-107. 15. Bordbar A, Palsson BO Using the reconstructed genome-scale human metabolic network to study physiology and pathology. J Intern Med 271: 131-141. 16. Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, et al. (2007) Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc Natl Acad Sci U S A 104: 1777-1782. 17. Jerby L, Shlomi T, Ruppin E Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism. Mol Syst Biol 6: 401.

197 18. Gille C, Bolling C, Hoppe A, Bulik S, Hoffmann S, et al. HepatoNet1: a comprehensive metabolic reconstruction of the human hepatocyte for the analysis of liver physiology. Mol Syst Biol 6: 411. 19. Lewis NE, Schramm G, Bordbar A, Schellenberger J, Andersen MP, et al. Large- scale in silico modeling of metabolic interactions between cell types in the human brain. Nat Biotechnol 28: 1279-1285. 20. Bordbar A, Feist AM, Usaite-Black R, Woodcock J, Palsson BO, et al. A multi-tissue type genome-scale metabolic network for analysis of whole-body systems physiology. BMC Syst Biol 5: 180. 21. Bordbar A, Lewis NE, Schellenberger J, Palsson BO, Jamshidi N Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions. Mol Syst Biol 6: 422. 22. Shlomi T, Cabili MN, Ruppin E (2009) Predicting metabolic biomarkers of human inborn errors of metabolism. Mol Syst Biol 5: 263. 23. Bordbar A, Jamshidi N, Palsson BO iAB-RBC-283: A proteomically derived knowledge-base of erythrocyte metabolism that can be used to simulate its physiological and patho-physiological states. BMC Syst Biol 5: 110. 24. O'Rahilly S (2009) Human genetics illuminates the paths to metabolic disease. Nature 462: 307-314. 25. Mardinoglu A, Agren R, Kampf C, Asplund A, Nookaew I, et al. Integration of clinical data with a genome-scale metabolic model of the human adipocyte. Mol Syst Biol 9: 649. 26. Schellenberger J, Park JO, Conrad TM, Palsson BO BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions. BMC Bioinformatics 11: 213. 27. Han D, Moon S, Kim H, Choi SE, Lee SJ, et al. Detection of differential proteomes associated with the development of type 2 diabetes in the Zucker rat model using the iTRAQ technique. J Proteome Res 10: 564-577. 28. Qiu L, List EO, Kopchick JJ (2005) Differentially expressed proteins in the pancreas of diet-induced diabetic mice. Mol Cell Proteomics 4: 1311-1318. 29. Lu H, Yang Y, Allister EM, Wijesekara N, Wheeler MB (2008) The identification of potential factors associated with the development of type 2 diabetes: a quantitative proteomics approach. Mol Cell Proteomics 7: 1434-1451. 30. Lu H, Koshkin V, Allister EM, Gyulkhandanyan AV, Wheeler MB Molecular and metabolic evidence for mitochondrial defects associated with beta-cell dysfunction in a mouse model of type 2 diabetes. Diabetes 59: 448-459.

198 31. Sanchez JC, Converset V, Nolan A, Schmid G, Wang S, et al. (2002) Effect of rosiglitazone on the differential expression of diabetes-associated proteins in pancreatic islets of C57Bl/6 lep/lep mice. Mol Cell Proteomics 1: 509-516. 32. Fernandez AM, Kim JK, Yakar S, Dupont J, Hernandez-Sanchez C, et al. (2001) Functional inactivation of the IGF-I and insulin receptors in skeletal muscle causes type 2 diabetes. Genes Dev 15: 1926-1934. 33. McKusick VA (2007) Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80: 588-604. 34. Baric I, Fumic K, Glenn B, Cuk M, Schulze A, et al. (2004) S-adenosylhomocysteine hydrolase deficiency in a human: a genetic disorder of methionine metabolism. Proc Natl Acad Sci U S A 101: 4234-4239. 35. Vilboux T, Kayser M, Introne W, Suwannarat P, Bernardini I, et al. (2009) Mutation spectrum of homogentisic acid oxidase (HGD) in alkaptonuria. Hum Mutat 30: 1611- 1619. 36. Iyer RK, Yoo PK, Kern RM, Rozengurt N, Tsoa R, et al. (2002) Mouse model for human arginase deficiency. Mol Cell Biol 22: 4491-4498. 37. Barbosa M, Lopes A, Mota C, Martins E, Oliveira J, et al. Clinical, biochemical and molecular characterization of cystinuria in a cohort of 12 patients. Clin Genet 81: 47-55. 38. Borsani G, Bassi MT, Sperandeo MP, De Grandi A, Buoninconti A, et al. (1999) SLC7A7, encoding a putative permease-related protein, is mutated in patients with lysinuric protein intolerance. Nat Genet 21: 297-301. 39. Hilton JF, Christensen KE, Watkins D, Raby BA, Renaud Y, et al. (2003) The molecular basis of glutamate formiminotransferase deficiency. Hum Mutat 22: 67-73. 40. Ishikawa M (1987) Developmental disorders in histidinemia--follow-up study of language development in histidinemia. Acta Paediatr Jpn 29: 224-228. 41. Reish O, Townsend D, Berry SA, Tsai MY, King RA (1995) Tyrosinase inhibition due to interaction of homocyst(e)ine with copper: the mechanism for reversible hypopigmentation in homocystinuria due to cystathionine beta-synthase deficiency. Am J Hum Genet 57: 127-132. 42. Phang JM, Chien-an, A. H., Valle, D. (2001) Disorders of proline and hydroxyproline metabolism.In: Scriver, C. R.; Beaudet, A. L.; Sly, W. S.; Valle, D. (eds.) : The Metabolic and Molecular Bases of Inherited Disease. 43. Chuang DT, Shih, V. E. (2001) Maple syrup urine disease (branched-chain ketoaciduria).In: Scriver, C. R.; Beaudet, A. L.; Sly, W. S.; Valle, D. (eds.) : The Metabolic and Molecular Bases of Inherited Disease.

199 44. Mudd SH, Tangerman A, Stabler SP, Allen RH, Wagner C, et al. (2003) Maternal methionine adenosyltransferase I/III deficiency: reproductive outcomes in a woman with four pregnancies. J Inherit Metab Dis 26: 443-458. 45. Ledley FD (1990) Perspectives on methylmalonic acidemia resulting from molecular cloning of methylmalonyl CoA mutase. Bioessays 12: 335-340. 46. Blau N, van Spronsen FJ, Levy HL Phenylketonuria. Lancet 376: 1417-1427. 47. Dianzani I, de Sanctis L, Smooker PM, Gough TJ, Alliaudi C, et al. (1998) Dihydropteridine reductase deficiency: physical structure of the QDPR gene, identification of two new mutations and genotype-phenotype correlations. Hum Mutat 12: 267-273. 48. Bliksrud YT, Brodtkorb E, Andresen PA, van den Berg IET, Kvittingen EA (2005) Tyrosinaemia type I - de novo mutation in liver tissue suppressing an inborn splicing defect. Journal of Molecular Medicine-Jmm 83: 406-410. 49. Tomoeda K, Awata H, Matsuura T, Matsuda I, Ploechl E, et al. (2000) Mutations in the 4-hydroxyphenylpyruvic acid dioxygenase gene are responsible for tyrosinemia type III and Hawkinsinuria. Molecular Genetics and Metabolism 71: 506-510. 50. Hamosh A, Johnston, M. V. (2001) Nonketotic hyperglycinemia.In: Scriver, C. R.; Beaudet, A. L.; Sly, W. S.; Valle, D. : The Metabolic and Molecular Bases of Inherited Disease: New York: McGraw-Hill. 51. Orth JD, Thiele I, Palsson BO What is flux balance analysis? Nat Biotechnol 28: 245- 248. 52. Connor SC, Hansen MK, Corner A, Smith RF, Ryan TE Integration of metabolomics and transcriptomics data to aid biomarker discovery in type 2 diabetes. Mol Biosyst 6: 909-921. 53. Newgard CB Interplay between lipids and branched-chain amino acids in development of insulin resistance. Cell Metab 15: 606-614. 54. Brunham LR, Kruit JK, Pape TD, Timmins JM, Reuwer AQ, et al. (2007) Beta-cell ABCA1 influences insulin secretion, glucose homeostasis and response to thiazolidinedione treatment. Nat Med 13: 340-347. 55. Higai K, Azuma Y, Aoki Y, Matsumoto K (2003) Altered glycosylation of alpha1- acid glycoprotein in patients with inflammation and diabetes mellitus. Clin Chim Acta 329: 117-125. 56. Itoh N, Sakaue S, Nakagawa H, Kurogochi M, Ohira H, et al. (2007) Analysis of N- glycan in serum glycoproteins from db/db mice and humans with type 2 diabetes. Am J Physiol Endocrinol Metab 293: E1069-1077.

200 57. Aerts JM, Ottenhoff R, Powlson AS, Grefhorst A, van Eijk M, et al. (2007) Pharmacological inhibition of glucosylceramide synthase enhances insulin sensitivity. Diabetes 56: 1341-1349. 58. Zhao H, Przybylska M, Wu IH, Zhang J, Siegel C, et al. (2007) Inhibiting glycosphingolipid synthesis improves glycemic control and insulin sensitivity in animal models of type 2 diabetes. Diabetes 56: 1210-1218. 59. Skovbro M, Baranowski M, Skov-Jensen C, Flint A, Dela F, et al. (2008) Human skeletal muscle ceramide content is not a major factor in muscle insulin sensitivity. Diabetologia 51: 1253-1260. 60. Wang Y, Eddy JA, Price ND Reconstruction of genome-scale metabolic models for 126 human tissues using mCADRE. BMC Syst Biol 6: 153. 61. Newgard CB (2012) Interplay between lipids and branched-chain amino acids in development of insulin resistance. Cell Metab 15: 606-614. 62. Jones JG, Solomon MA, Sherry AD, Jeffrey FM, Malloy CR (1998) 13C NMR measurements of human gluconeogenic fluxes after ingestion of [U-13C]propionate, phenylacetate, and acetaminophen. Am J Physiol 275: E843-852. 63. Kelley DE, Goodpaster B, Wing RR, Simoneau JA (1999) Skeletal muscle fatty acid metabolism in association with insulin resistance, obesity, and weight loss. Am J Physiol 277: E1130-1141. 64. Cha BS, Ciaraldi TP, Park KS, Carter L, Mudaliar SR, et al. (2005) Impaired fatty acid metabolism in type 2 diabetic skeletal muscle cells is reversed by PPARgamma agonists. Am J Physiol Endocrinol Metab 289: E151-159. 65. Thyfault JP, Kraus RM, Hickner RC, Howell AW, Wolfe RR, et al. (2004) Impaired plasma fatty acid oxidation in extremely obese women. Am J Physiol Endocrinol Metab 287: E1076-1081. 66. Kim JY, Hickner RC, Cortright RL, Dohm GL, Houmard JA (2000) Lipid oxidation is reduced in obese human skeletal muscle. Am J Physiol Endocrinol Metab 279: E1039- 1044. 67. Butte A (2002) The use and analysis of microarray data. Nat Rev Drug Discov 1: 951-960. 68. Schellenberger J, Park JO, Conrad TM, Palsson BO (2010) BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions. BMC Bioinformatics 11: 213. 69. Keating SM, Bornstein BJ, Finney A, Hucka M (2006) SBMLToolbox: an SBML toolbox for MATLAB users. Bioinformatics 22: 1275-1277.

201 70. Schellenberger J, Que R, Fleming RM, Thiele I, Orth JD, et al. (2011) Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0. Nat Protoc 6: 1290-1307. 71. Adachi J, Kumar C, Zhang Y, Mann M (2007) In-depth analysis of the adipocyte proteome by mass spectrometry and bioinformatics. Mol Cell Proteomics 6: 1257-1273. 72. Hojlund K, Yi Z, Hwang H, Bowen B, Lefort N, et al. (2008) Characterization of the human skeletal muscle proteome by one-dimensional gel electrophoresis and HPLC-ESI- MS/MS. Mol Cell Proteomics 7: 257-267. 73. Kislinger T, Cox B, Kannan A, Chung C, Hu P, et al. (2006) Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 125: 173-186. 74. Ohgami M, Takahashi N, Yamasaki M, Fukui T (2003) Expression of acetoacetyl- CoA synthetase, a novel cytosolic ketone body-utilizing enzyme, in human brain. Biochem Pharmacol 65: 989-994. 75. Luong A, Hannah VC, Brown MS, Goldstein JL (2000) Molecular characterization of human acetyl-CoA synthetase, an enzyme regulated by sterol regulatory element-binding proteins. J Biol Chem 275: 26458-26466. 76. Scholte HR, Wit-Peeters EM, Bakker JC (1971) The intracellular and intramitochondrial distribution of short-chain acyl-CoA synthetases in guinea-pig heart. Biochim Biophys Acta 231: 479-486. 77. Fujino T, Kondo J, Ishikawa M, Morikawa K, Yamamoto TT (2001) Acetyl-CoA synthetase 2, a mitochondrial matrix enzyme involved in the oxidation of acetate. J Biol Chem 276: 11420-11426. 78. Harvey JW, Pate MG, Kivipelto J, Asquith RL (2005) Clinical biochemistry of pregnant and nursing mares. Vet Clin Pathol 34: 248-254. 79. Ramchandani VA, Bosron WF, Li TK (2001) Research advances in ethanol metabolism. Pathol Biol (Paris) 49: 676-682. 80. Deng Y, Wang Z, Gu S, Ji C, Ying K, et al. (2002) Cloning and characterization of a novel human alcohol dehydrogenase gene (ADHFe1). DNA Seq 13: 301-306. 81. Sladek NE (2003) Human aldehyde dehydrogenases: potential pathological, pharmacological, and toxicological impact. J Biochem Mol Toxicol 17: 7-23. 82. Tillmann H, Eschrich K (1998) Isolation and characterization of an allelic cDNA for human muscle fructose-1,6-bisphosphatase. Gene 212: 295-304. 83. Ferrer J, Aoki M, Behn P, Nestorowicz A, Riggs A, et al. (1996) Mitochondrial glycerol-3-phosphate dehydrogenase. Cloning of an alternatively spliced human islet-cell cDNA, tissue distribution, physical mapping, and identification of a polymorphic genetic marker. Diabetes 45: 262-266.

202 84. Brisson D, Vohl MC, St-Pierre J, Hudson TJ, Gaudet D (2001) Glycerol: a neglected variable in metabolic processes? Bioessays 23: 534-542. 85. Arden SD, Zahn T, Steegers S, Webb S, Bergman B, et al. (1999) Molecular cloning of a pancreatic islet-specific glucose-6-phosphatase catalytic subunit-related protein. Diabetes 48: 531-542. 86. Pan CJ, Lei KJ, Annabi B, Hemrika W, Chou JY (1998) Transmembrane topology of glucose-6-phosphatase. J Biol Chem 273: 6144-6148. 87. Shieh JJ, Pan CJ, Mansfield BC, Chou JY (2003) A glucose-6-phosphate hydrolase, widely expressed outside the liver, can explain age-dependent resolution of hypoglycemia in glycogen storage disease type Ia. J Biol Chem 278: 47098-47103. 88. Martin CC, Bischof LJ, Bergman B, Hornbuckle LA, Hilliker C, et al. (2001) Cloning and characterization of the human and rat islet-specific glucose-6-phosphatase catalytic subunit-related protein (IGRP) genes. J Biol Chem 276: 25197-25207. 89. Martin CC, Oeser JK, Svitek CA, Hunter SI, Hutton JC, et al. (2002) Identification and characterization of a human cDNA and gene encoding a ubiquitously expressed glucose-6-phosphatase catalytic subunit-related protein. J Mol Endocrinol 29: 205-222. 90. Shelly LL, Lei KJ, Pan CJ, Sakata SF, Ruppert S, et al. (1993) Isolation of the gene for murine glucose-6-phosphatase, the enzyme deficient in glycogen storage disease type 1A. J Biol Chem 268: 21482-21485. 91. Millan JL, Driscoll CE, LeVan KM, Goldberg E (1987) Epitopes of human testis- specific lactate dehydrogenase deduced from a cDNA sequence. Proc Natl Acad Sci U S A 84: 5311-5315. 92. Yu Y, Deck JA, Hunsaker LA, Deck LM, Royer RE, et al. (2001) Selective inhibitors of human lactate dehydrogenases A4, B4, and C4. Biochem Pharmacol 62: 81- 89. 93. Modaressi S, Christ B, Bratke J, Zahn S, Heise T, et al. (1996) Molecular cloning, sequencing and expression of the cDNA of the mitochondrial form of phosphoenolpyruvate carboxykinase from human liver. Biochem J 315 ( Pt 3): 807-814. 94. Eto K, Sakura H, Yasuda K, Hayakawa T, Kawasaki E, et al. (1994) Cloning of a complete protein-coding sequence of human platelet-type phosphofructokinase isozyme from pancreatic islet. Biochem Biophys Res Commun 198: 990-998. 95. Marks DB, Marks AD, Smith CM (1996) Basic medical biochemistry : a clinical approach. Baltimore: Williams & Wilkins. xi, 806 p. p. 96. Li X, Qin C, Burghardt R, Safe S (2004) Hormonal regulation of lactate dehydrogenase-A through activation of protein kinase C pathways in MCF-7 breast cancer cells. Biochem Biophys Res Commun 320: 625-634. 97. Champe PC, Harvey, R.A., Ferrier D.R. (2005) Biochemistry.

203 98. Salway J (1999) Metabolism at a glance. 99. Nordlie RC, Sukalski, K.A. (1985) The enzymes of biological membranes. 100. Orten JM NO (1975) Human Biochemistry. 101. TM D (2001) Textbook of Biochemistry with Clinical Correlations. 102. Prasad TS, Kandasamy K, Pandey A (2009) Human Protein Reference Database and Human Proteinpedia as discovery tools for systems biology. Methods Mol Biol 577: 67- 79. 103. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, et al. (2006) Human protein reference database--2006 update. Nucleic Acids Res 34: D411-414. 104. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13: 2363-2371. 105. Apweiler R, Martin MJ, O'Donovan C, Magrane M, Alam-Faruque Y, et al. (2013) Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research 41: D43-D47. 106. Schomburg I, Chang A, Placzek S, Sohngen C, Rother M, et al. (2013) BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic Acids Research 41: D764- D772. 107. Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, et al. (2009) Human Protein Reference Database-2009 update. Nucleic Acids Research 37: D767-D772. 108. Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, et al. (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Research 32: D431-D433. 109. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Research 32: D115-D119. 110. Pfeiffer T, Sanchez-Valdenebro I, Nuno JC, Montero F, Schuster S (1999) METATOOL: for studying metabolic networks. Bioinformatics 15: 251-257. 111. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33: D514-517. 112. Mahadevan R, Schilling CH (2003) The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metab Eng 5: 264-276. 113. Kim H, Pennisi PA, Gavrilova O, Pack S, Jou W, et al. (2006) Effect of adipocyte beta3-adrenergic receptor activation on the type 2 diabetic MKR mice. Am J Physiol Endocrinol Metab 290: E1227-1236.

204

Chapter 2: 1. Whiting DR, Guariguata L, Weil C, Shaw J. IDF diabetes atlas: global estimates of the prevalence of diabetes for 2011 and 2030. Diabetes research and clinical practice. 2011 Dec;94(3):311-21. 2. Type 2 diabetes epidemic: a global education. Lancet. 2009 Nov 14;374(9702):1654. 3. Tong Y, Lin Y, Zhang Y, Yang J, Zhang Y, Liu H, et al. Association between TCF7L2 gene polymorphisms and susceptibility to type 2 diabetes mellitus: a large Epidemiology (HuGE) review and meta-analysis. BMC medical genetics. 2009;10:15. 4. Hamman RF, Wing RR, Edelstein SL, Lachin JM, Bray GA, Delahanty L, et al. Effect of weight loss with lifestyle intervention on risk of diabetes. Diabetes care. 2006 Sep;29(9):2102-7. 5. Diabetes Prevention Program Research G, Knowler WC, Fowler SE, Hamman RF, Christophi CA, Hoffman HJ, et al. 10-year follow-up of diabetes incidence and weight loss in the Diabetes Prevention Program Outcomes Study. Lancet. 2009 Nov 14;374(9702):1677-86. 6. LeRoith D. Beta-cell dysfunction and insulin resistance in type 2 diabetes: role of metabolic and genetic abnormalities. The American journal of medicine. 2002 Oct 28;113 Suppl 6A:3S-11S. 7. Investigators DT, Gerstein HC, Yusuf S, Bosch J, Pogue J, Sheridan P, et al. Effect of rosiglitazone on the frequency of diabetes in patients with impaired glucose tolerance or impaired fasting glucose: a randomised controlled trial. Lancet. 2006 Sep 23;368(9541):1096-105. 8. McGarry JD. Banting lecture 2001: dysregulation of fatty acid metabolism in the etiology of type 2 diabetes. Diabetes. 2002 Jan;51(1):7-18. 9. Norheim F, Raastad T, Thiede B, Rustan AC, Drevon CA, Haugen F. Proteomic identification of secreted proteins from human skeletal muscle cells and expression in response to strength training. American journal of physiology Endocrinology and metabolism. 2011 Nov;301(5):E1013-21. 10. Henningsen J, Rigbolt KT, Blagoev B, Pedersen BK, Kratchmarova I. Dynamics of the skeletal muscle secretome during myoblast differentiation. Molecular & cellular proteomics : MCP. 2010 Nov;9(11):2482-96. 11. Fernandez AM, Kim JK, Yakar S, Dupont J, Hernandez-Sanchez C, Castle AL, et al. Functional inactivation of the IGF-I and insulin receptors in skeletal muscle causes type 2 diabetes. Genes & development. 2001 Aug 1;15(15):1926-34.

205 12. Asghar Z, Yau D, Chan F, Leroith D, Chan CB, Wheeler MB. Insulin resistance causes increased beta-cell mass but defective glucose-stimulated insulin secretion in a murine model of type 2 diabetes. Diabetologia. 2006 Jan;49(1):90-9. 13. Vaitheesvaran B, LeRoith D, Kurland IJ. MKR mice have increased dynamic glucose disposal despite metabolic inflexibility, and hepatic and peripheral insulin insensitivity. Diabetologia. 2010 Oct;53(10):2224-32. 14. Coort SL, van Iersel MP, van Erk M, Kooistra T, Kleemann R, Evelo CT. Bioinformatics for the NuGO proof of principle study: analysis of gene expression in muscle of ApoE3*Leiden mice on a high-fat diet using PathVisio. Genes Nutr. 2008 Dec;3(3-4):185-91. 15. Seda O, Sedova L, Oliyarnyk O, Kazdova L, Krenova D, Corbeil G, et al. Pharmacogenomics of metabolic effects of rosiglitazone. Pharmacogenomics. 2008 Feb;9(2):141-55. 16. Skov V, Glintborg D, Knudsen S, Jensen T, Kruse TA, Tan Q, et al. Reduced expression of nuclear-encoded genes involved in mitochondrial oxidative metabolism in skeletal muscle of insulin-resistant women with polycystic ovary syndrome. Diabetes. 2007 Sep;56(9):2349-55. 17. Luo TH, Zhao Y, Li G, Zhang HL, Li WY, Ldu M. [Identification of genes that are differentially expressed in omental fat of normal weight subjects, obese subjects and obese diabetic patients]. Zhongguo Ying Yong Sheng Li Xue Za Zhi. 2007 May;23(2):229-34. 18. Hulver MW, Berggren JR, Carper MJ, Miyazaki M, Ntambi JM, Hoffman EP, et al. Elevated stearoyl-CoA desaturase-1 expression in skeletal muscle contributes to abnormal fatty acid partitioning in obese humans. Cell Metab. 2005 Oct;2(4):251-61. 19. Fu WJ, Haynes TE, Kohli R, Hu J, Shi W, Spencer TE, et al. Dietary L-arginine supplementation reduces fat mass in Zucker diabetic fatty rats. J Nutr. 2005 Apr;135(4):714-21. 20. Corominola H, Conner LJ, Beavers LS, Gadski RA, Johnson D, Caro JF, et al. Identification of novel genes differentially expressed in omental fat of obese subjects and obese type 2 diabetic patients. Diabetes. 2001 Dec;50(12):2822-30. 21. Kim H, Haluzik M, Asghar Z, Yau D, Joseph JW, Fernandez AM, et al. Peroxisome proliferator-activated receptor-alpha agonist treatment in a transgenic model of type 2 diabetes reverses the lipotoxic state and improves glucose homeostasis. Diabetes. 2003 Jul;52(7):1770-8. 22. Heron-Milhavet L, Haluzik M, Yakar S, Gavrilova O, Pack S, Jou WC, et al. Muscle- specific overexpression of CD36 reverses the insulin resistance and diabetes of MKR mice. Endocrinology. 2004 Oct;145(10):4667-76.

206 23. Kim H, Haluzik M, Gavrilova O, Yakar S, Portas J, Sun H, et al. Thiazolidinediones improve insulin sensitivity in adipose tissue and reduce the hyperlipidaemia without affecting the hyperglycaemia in a transgenic model of type 2 diabetes. Diabetologia. 2004 Dec;47(12):2215-25. 24. Kim H, Pennisi PA, Gavrilova O, Pack S, Jou W, Setser-Portas J, et al. Effect of adipocyte beta3-adrenergic receptor activation on the type 2 diabetic MKR mice. American journal of physiology Endocrinology and metabolism. 2006 Jun;290(6):E1227- 36. 25. Zhao H, Yakar S, Gavrilova O, Sun H, Zhang Y, Kim H, et al. Phloridzin improves hyperglycemia but not hepatic insulin resistance in a transgenic mouse model of type 2 diabetes. Diabetes. 2004 Nov;53(11):2901-9. 26. Malmgren S, Spegel P, Danielsson AP, Nagorny CL, Andersson LE, Nitert MD, et al. Coordinate changes in histone modifications, mRNA levels, and metabolite profiles in clonal INS-1 832/13 beta-cells accompany functional adaptations to lipotoxicity. J Biol Chem. Apr 26;288(17):11973-87. 27. Sparks LM, Moro C, Ukropcova B, Bajpeyi S, Civitarese AE, Hulver MW, et al. Remodeling lipid metabolism and improving insulin responsiveness in human primary myotubes. PLoS One.6(7):e21068. 28. Bikopoulos G, da Silva Pimenta A, Lee SC, Lakey JR, Der SD, Chan CB, et al. Ex vivo transcriptional profiling of human pancreatic islets following chronic exposure to monounsaturated fatty acids. J Endocrinol. 2008 Mar;196(3):455-64. 29. Swagell CD, Henly DC, Morris CP. Expression analysis of a human hepatic cell line in response to palmitate. Biochem Biophys Res Commun. 2005 Mar 11;328(2):432-41. 30. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. Jan;40(Database issue):D109-14. 31. Fierz Y, Novosyadlyy R, Vijayakumar A, Yakar S, LeRoith D. Insulin-sensitizing therapy attenuates type 2 diabetes-mediated mammary tumor progression. Diabetes. 2010 Mar;59(3):686-93. 32. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. Jan;42(Database issue):D199-205. 33. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000 Jan 1;28(1):27-30. 34. Wang K, Hu F, Xu K, Cheng H, Jiang M, Feng R, et al. CASCADE_SCAN: mining signal transduction network from high-throughput data based on steepest descent method. BMC Bioinformatics.12:164.

207 35. Donovan EL, Pettine SM, Hickey MS, Hamilton KL, Miller BF. Lipidomic analysis of human plasma reveals ether-linked lipids that are elevated in morbidly obese humans compared to lean. Diabetol Metab Syndr.5(1):24. 36. Poirier Y, Antonenkov VD, Glumoff T, Hiltunen JK. Peroxisomal beta-oxidation--a metabolic pathway with multiple functions. Biochimica et biophysica acta. 2006 Dec;1763(12):1413-26. 37. Van Veldhoven PP. Biochemistry and genetics of inherited disorders of peroxisomal fatty acid metabolism. Journal of lipid research. 2010 Oct;51(10):2863-95. 38. Wanders RJ, Waterham HR. Biochemistry of mammalian peroxisomes revisited. Annual review of biochemistry. 2006;75:295-332. 39. Van Veldhoven PP, Vanhove G, Assselberghs S, Eyssen HJ, Mannaerts GP. Substrate specificities of rat liver peroxisomal acyl-CoA oxidases: palmitoyl-CoA oxidase (inducible acyl-CoA oxidase), pristanoyl-CoA oxidase (non-inducible acyl-CoA oxidase), and trihydroxycoprostanoyl-CoA oxidase. The Journal of biological chemistry. 1992 Oct 5;267(28):20065-74. 40. Antonenkov VD, Van Veldhoven PP, Waelkens E, Mannaerts GP. Substrate specificities of 3-oxoacyl-CoA thiolase A and sterol carrier protein 2/3-oxoacyl-CoA thiolase purified from normal rat liver peroxisomes. Sterol carrier protein 2/3-oxoacyl- CoA thiolase is involved in the metabolism of 2-methyl-branched fatty acids and bile acid intermediates. The Journal of biological chemistry. 1997 Oct 10;272(41):26023-31. 41. Antonenkov VD, Van Veldhoven PP, Waelkens E, Mannaerts GP. Comparison of the stability and substrate specificity of purified peroxisomal 3-oxoacyl-CoA thiolases A and B from rat liver. Biochimica et biophysica acta. 1999 Feb 25;1437(2):136-41. 42. Qi C, Zhu Y, Pan J, Usuda N, Maeda N, Yeldandi AV, et al. Absence of spontaneous peroxisome proliferation in enoyl-CoA Hydratase/L-3-hydroxyacyl-CoA dehydrogenase- deficient mouse liver. Further support for the role of fatty acyl CoA oxidase in PPARalpha ligand metabolism. The Journal of biological chemistry. 1999 May 28;274(22):15775-80. 43. Dieuaide-Noubhani M, Novikov D, Baumgart E, Vanhooren JC, Fransen M, Goethals M, et al. Further characterization of the peroxisomal 3-hydroxyacyl-CoA dehydrogenases from rat liver. Relationship between the different dehydrogenases and evidence that fatty acids and the C27 bile acids di- and tri-hydroxycoprostanic acids are metabolized by separate multifunctional proteins. European journal of biochemistry / FEBS. 1996 Sep 15;240(3):660-6. 44. Lowell BB, Shulman GI. Mitochondrial dysfunction and type 2 diabetes. Science. 2005 Jan 21;307(5708):384-7. 45. Simoneau JA, Veerkamp JH, Turcotte LP, Kelley DE. Markers of capacity to utilize fatty acids in human skeletal muscle: relation to insulin resistance and obesity and effects

208 of weight loss. FASEB journal : official publication of the Federation of American Societies for Experimental Biology. 1999 Nov;13(14):2051-60. 46. Shimomura I, Bashmakov Y, Ikemoto S, Horton JD, Brown MS, Goldstein JL. Insulin selectively increases SREBP-1c mRNA in the of rats with streptozotocin- induced diabetes. Proceedings of the National Academy of Sciences of the United States of America. 1999 Nov 23;96(24):13656-61. 47. Coppack SW, Jensen MD, Miles JM. In vivo regulation of lipolysis in humans. Journal of lipid research. 1994 Feb;35(2):177-93. 48. Coppack SW, Fisher RM, Gibbons GF, Humphreys SM, McDonough MJ, Potts JL, et al. Postprandial substrate deposition in human forearm and adipose tissues in vivo. Clinical science. 1990 Oct;79(4):339-48. 49. Maassen JA, Romijn JA, Heine RJ. Fatty acid-induced mitochondrial uncoupling in adipocytes as a key protective factor against insulin resistance and beta cell dysfunction: a new concept in the pathogenesis of obesity-associated type 2 diabetes mellitus. Diabetologia. 2007 Oct;50(10):2036-41. 50. Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic acids research. 2006;34(10):3067-81. 51. Gertow K, Pietilainen KH, Yki-Jarvinen H, Kaprio J, Rissanen A, Eriksson P, et al. Expression of fatty-acid-handling proteins in human adipose tissue in relation to obesity and insulin resistance. Diabetologia. 2004 Jun;47(6):1118-25. 52. Meirhaeghe A, Martin G, Nemoto M, Deeb S, Cottel D, Auwerx J, et al. Intronic polymorphism in the fatty acid transport protein 1 gene is associated with increased plasma triglyceride levels in a French population. Arteriosclerosis, thrombosis, and vascular biology. 2000 May;20(5):1330-4. 53. Garcia-Martinez C, Marotta M, Moore-Carrasco R, Guitart M, Camps M, Busquets S, et al. Impact on fatty acid metabolism and differential localization of FATP1 and FAT/CD36 proteins delivered in cultured human muscle cells. American journal of physiology Cell physiology. 2005 Jun;288(6):C1264-72. 54. Hatch GM, Smith AJ, Xu FY, Hall AM, Bernlohr DA. FATP1 channels exogenous FA into 1,2,3-triacyl-sn-glycerol and down-regulates sphingomyelin and cholesterol metabolism in growing 293 cells. Journal of lipid research. 2002 Sep;43(9):1380-9. 55. Lobo S, Wiczer BM, Smith AJ, Hall AM, Bernlohr DA. Fatty acid metabolism in adipocytes: functional analysis of fatty acid transport proteins 1 and 4. Journal of lipid research. 2007 Mar;48(3):609-20. 56. Man MZ, Hui TY, Schaffer JE, Lodish HF, Bernlohr DA. Regulation of the murine adipocyte fatty acid transporter gene by insulin. Molecular endocrinology. 1996 Aug;10(8):1021-8.

209 57. Stahl A, Evans JG, Pattel S, Hirsch D, Lodish HF. Insulin causes fatty acid transport protein translocation and enhanced fatty acid uptake in adipocytes. Developmental cell. 2002 Apr;2(4):477-88. 58. Wu Q, Ortegon AM, Tsang B, Doege H, Feingold KR, Stahl A. FATP1 is an insulin- sensitive fatty acid transporter involved in diet-induced obesity. Molecular and cellular biology. 2006 May;26(9):3455-67. 59. Ding J, Loizides-Mangold U, Rando G, Zoete V, Michielin O, Reddy JK, et al. The Peroxisomal Enzyme L-PBE Is Required to Prevent the Dietary Toxicity of Medium- Chain Fatty Acids. Cell reports. 2013 Oct 17;5(1):248-58. 60. Ferdinandusse S, Denis S, Mooyer PA, Dekker C, Duran M, Soorani-Lunsing RJ, et al. Clinical and biochemical spectrum of D-bifunctional protein deficiency. Annals of neurology. 2006 Jan;59(1):92-104. 61. Baes M, Huyghe S, Carmeliet P, Declercq PE, Collen D, Mannaerts GP, et al. Inactivation of the peroxisomal multifunctional protein-2 in mice impedes the degradation of not only 2-methyl-branched fatty acids and bile acid intermediates but also of very long chain fatty acids. The Journal of biological chemistry. 2000 May 26;275(21):16329-36. 62. Sun Z, Lazar MA. Dissociating fatty liver and diabetes. Trends in endocrinology and metabolism: TEM. 2013 Jan;24(1):4-12. 63. Kiefer FW, Orasanu G, Nallamshetty S, Brown JD, Wang H, Luger P, et al. Retinaldehyde dehydrogenase 1 coordinates hepatic gluconeogenesis and lipid metabolism. Endocrinology. 2012 Jul;153(7):3089-99. 64. Reichert B, Yasmeen R, Jeyakumar SM, Yang F, Thomou T, Alder H, et al. Concerted action of aldehyde dehydrogenases influences depot-specific fat formation. Molecular endocrinology. 2011 May;25(5):799-809. 65. Demozay D, Rocchi S, Mas JC, Grillo S, Pirola L, Chavey C, et al. Fatty : potential role in oxidative stress protection and regulation of its gene expression by insulin. The Journal of biological chemistry. 2004 Feb 20;279(8):6261-70.

Chapter 3: 1. Legrain P, Aebersold R, Archakov A, Bairoch A, Bala K, et al. (2011) The human proteome project: current state and future direction. Mol Cell Proteomics 10: M111 009993. 2. Mermelekas G, Zoidakis J (2014) Mass spectrometry-based membrane proteomics in cancer biomarker discovery. Expert Rev Mol Diagn 14: 549-563.

210 3. Jimenez CR, Verheul HM (2014) Mass spectrometry-based proteomics: from cancer biology to protein biomarkers, drug targets, and clinical applications. Am Soc Clin Oncol Educ Book 34: e504-510. 4. Doroudgar S, Glembotski CC (2011) The cardiokine story unfolds: ischemic stress- induced protein secretion in the heart. Trends Mol Med 17: 207-214. 5. Celis JE, Gromov P, Cabezon T, Moreira JM, Ambartsumian N, et al. (2004) Proteomic characterization of the interstitial fluid perfusing the breast tumor microenvironment: a novel resource for biomarker and therapeutic target discovery. Mol Cell Proteomics 3: 327-344. 6. Gromov P, Gromova I, Bunkenborg J, Cabezon T, Moreira JM, et al. (2010) Up- regulated proteins in the fluid bathing the tumour cell microenvironment as potential serological markers for early detection of cancer of the breast. Mol Oncol 4: 65-89. 7. Raimondo F, Morosi L, Chinello C, Magni F, Pitto M (2011) Advances in membranous vesicle and exosome proteomics improving biological understanding and biomarker discovery. Proteomics 11: 709-720. 8. Dowling P, Clynes M (2011) Conditioned media from cell lines: a complementary model to clinical specimens for the discovery of disease-specific biomarkers. Proteomics 11: 794-804. 9. Kolarich D, Lepenies B, Seeberger PH (2012) Glycomics, glycoproteomics and the immune system. Curr Opin Chem Biol 16: 214-220. 10. Kim EH, Misek DE (2011) Glycoproteomics-based identification of cancer biomarkers. Int J Proteomics 2011: 601937. 11. Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100: 57-70. 12. Corless CL, Fletcher JA, Heinrich MC (2004) Biology of gastrointestinal stromal tumors. J Clin Oncol 22: 3813-3825. 13. Choudhary C, Olsen JV, Brandts C, Cox J, Reddy PN, et al. (2009) Mislocalized activation of oncogenic RTKs switches downstream signaling outcomes. Mol Cell 36: 326-339. 14. Amanchy R, Kalume DE, Iwahori A, Zhong J, Pandey A (2005) Phosphoproteome analysis of HeLa cells using stable isotope labeling with amino acids in cell culture (SILAC). J Proteome Res 4: 1661-1671. 15. Amanchy R, Zhong J, Molina H, Chaerkady R, Iwahori A, et al. (2008) Identification of c-Src tyrosine kinase substrates using mass spectrometry and peptide microarrays. J Proteome Res 7: 3900-3910. 16. Gulmann C, Sheehan KM, Conroy RM, Wulfkuhle JD, Espina V, et al. (2009) Quantitative cell signalling analysis reveals down-regulation of MAPK pathway activation in colorectal cancer. J Pathol 218: 514-519.

211 17. Zhong D, Liu X, Khuri FR, Sun SY, Vertino PM, et al. (2008) LKB1 is necessary for Akt-mediated phosphorylation of proapoptotic proteins. Cancer Res 68: 7270-7277. 18. Schindler J, Jung S, Niedner-Schatteburg G, Friauf E, Nothwang HG (2006) Enrichment of integral membrane proteins from small amounts of brain tissue. J Neural Transm 113: 995-1013. 19. Kohnke PL, Mulligan SP, Christopherson RI (2009) Membrane proteomics for leukemia classification and drug target identification. Curr Opin Mol Ther 11: 603-610. 20. Lehner I, Niehof M, Borlak J (2003) An optimized method for the isolation and identification of membrane proteins. Electrophoresis 24: 1795-1808. 21. Donovan LE, Higginbotham L, Dammer EB, Gearing M, Rees HD, et al. (2012) Analysis of a membrane-enriched proteome from postmortem human brain tissue in Alzheimer's disease. Proteomics Clin Appl 6: 201-211. 22. Raimondo F, Morosi L, Chinello C, Perego R, Bianchi C, et al. (2012) Protein profiling of microdomains purified from renal cell carcinoma and normal kidney tissue samples. Mol Biosyst 8: 1007-1016. 23. Zhang Q, Schulenborg T, Tan T, Lang B, Friauf E, et al. (2010) Proteome analysis of a plasma membrane-enriched fraction at the placental feto-maternal barrier. Proteomics Clin Appl 4: 538-549. 24. Elia G (2012) Cell surface protein biotinylation for SDS-PAGE analysis. Methods Mol Biol 869: 361-372. 25. Yokoyama T, Enomoto T, Serada S, Morimoto A, Matsuzaki S, et al. (2013) Plasma membrane proteomics identifies bone marrow stromal antigen 2 as a potential therapeutic target in endometrial cancer. Int J Cancer 132: 472-484. 26. Qi Y, Katagiri F (2009) Purification of low-abundance Arabidopsis plasma- membrane protein complexes and identification of candidate components. Plant J 57: 932-944. 27. Tian Y, Koganti T, Yao Z, Cannon P, Shah P, et al. (2014) Characterization of cardiac extracellular proteome and membrane topology using glycoproteomics. Proteomics Clin Appl. 28. Liu Y, Chen J, Sethi A, Li QK, Chen L, et al. (2014) Glycoproteomic analysis of prostate cancer tissues by SWATH mass spectrometry discovers N-acylethanolamine acid amidase and protein tyrosine kinase 7 as signatures for tumor aggressiveness. Mol Cell Proteomics. 29. Baycin Hizal D, Wolozny D, Colao J, Jacobson E, Tian Y, et al. (2014) Glycoproteomic and glycomic databases. Clin Proteomics 11: 15.

212 30. Yang W, Zhou JY, Chen L, Ao M, Sun S, et al. (2014) Glycoproteomic analysis identifies human glycoproteins secreted from HIV latently infected T cells and reveals their presence in HIV+ plasma. Clin Proteomics 11: 9. 31. Chen J, Xi J, Tian Y, Bova GS, Zhang H (2013) Identification, prioritization, and evaluation of glycoproteins for aggressive prostate cancer using quantitative glycoproteomics and antibody-based assays on tissue specimens. Proteomics 13: 2268- 2277. 32. Tian Y, Zhang H (2013) Characterization of disease-associated N-linked glycoproteins. Proteomics 13: 504-511. 33. Li QK, Gabrielson E, Zhang H (2012) Application of glycoproteomics for the discovery of biomarkers in lung cancer. Proteomics Clin Appl 6: 244-256. 34. Almaraz RT, Tian Y, Bhattarcharya R, Tan E, Chen SH, et al. (2012) Metabolic flux increases glycoprotein sialylation: implications for cell adhesion and cancer metastasis. Mol Cell Proteomics 11: M112 017558. 35. Liu Y, Huttenhain R, Surinova S, Gillet LC, Mouritsen J, et al. (2013) Quantitative measurements of N-linked glycoproteins in human plasma by SWATH-MS. Proteomics 13: 1247-1256. 36. Huttenhain R, Surinova S, Ossola R, Sun Z, Campbell D, et al. (2013) N-glycoprotein SRMAtlas: a resource of mass spectrometric assays for N-glycosites enabling consistent and multiplexed protein quantification for clinical applications. Mol Cell Proteomics 12: 1005-1016. 37. Cerciello F, Choi M, Nicastri A, Bausch-Fluck D, Ziegler A, et al. (2013) Identification of a seven glycopeptide signature for malignant pleural mesothelioma in human serum by selected reaction monitoring. Clin Proteomics 10: 16. 38. Ahn YH, Kim JY, Yoo JS Quantitative mass spectrometric analysis of glycoproteins combined with enrichment methods. Mass Spectrom Rev. 39. Xu Y, Bailey UM, Punyadeera C, Schulz BL (2014) Identification of salivary N- glycoproteins and measurement of glycosylation site occupancy by boronate glycoprotein enrichment and liquid chromatography/electrospray ionization tandem mass spectrometry. Rapid Commun Mass Spectrom 28: 471-482. 40. Wollscheid B, Bausch-Fluck D, Henderson C, O'Brien R, Bibel M, et al. (2009) Mass-spectrometric identification and relative quantification of N-linked cell surface glycoproteins. Nat Biotechnol 27: 378-386. 41. Baycin-Hizal D, Tian Y, Akan I, Jacobson E, Clark D, et al. (2011) GlycoFly: a database of Drosophila N-linked glycoproteins identified using SPEG--MS techniques. J Proteome Res 10: 2777-2784. 42. Tian Y, Zhou Y, Elliott S, Aebersold R, Zhang H (2007) Solid-phase extraction of N- linked glycopeptides. Nat Protoc 2: 334-339.

213 43. Lee MC, Sun B (2014) Glycopeptide capture for cell surface proteomics. J Vis Exp. 44. Sun Z, Chen R, Cheng K, Liu H, Qin H, et al. (2012) A new method for quantitative analysis of cell surface glycoproteome. Proteomics 12: 3328-3337. 45. Han D, Moon S, Kim Y, Min H, Kim Y (2014) Characterization of the membrane proteome and N-glycoproteome in BV-2 mouse microglia by liquid chromatography- tandem mass spectrometry. BMC Genomics 15: 95. 46. Zhang H, Li XJ, Martin DB, Aebersold R (2003) Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat Biotechnol 21: 660-666. 47. Ramya TN, Weerapana E, Cravatt BF, Paulson JC Glycoproteomics enabled by tagging sialic acid- or galactose-terminated glycans. Glycobiology 23: 211-221. 48. Choksawangkarn W, Kim SK, Cannon JR, Edwards NJ, Lee SB, et al. (2013) Enrichment of plasma membrane proteins using nanoparticle pellicles: comparison between silica and higher density nanoparticles. J Proteome Res 12: 1134-1141. 49. Kim Y, Elschenbroich S, Sharma P, Sepiashvili L, Gramolini AO, et al. (2011) Use of colloidal silica-beads for the isolation of cell-surface proteins for mass spectrometry- based proteomics. Methods Mol Biol 748: 227-241. 50. Vandre DD, Ackerman WEt, Tewari A, Kniss DA, Robinson JM (2012) A placental sub-proteome: the apical plasma membrane of the syncytiotrophoblast. Placenta 33: 207- 213. 51. Li X, Jin Q, Cao J, Xie C, Cao R, et al. (2009) Evaluation of two cell surface modification methods for proteomic analysis of plasma membrane from isolated mouse hepatocytes. Biochim Biophys Acta 1794: 32-41. 52. Li X, Xie C, Cao J, He Q, Cao R, et al. (2009) An in vivo membrane density perturbation strategy for identification of liver sinusoidal surface proteome accessible from the vasculature. J Proteome Res 8: 123-132. 53. Schiess R, Mueller LN, Schmidt A, Mueller M, Wollscheid B, et al. (2009) Analysis of cell surface proteome changes via label-free, quantitative mass spectrometry. Mol Cell Proteomics 8: 624-638. 54. Zhang H, Liu AY, Loriaux P, Wollscheid B, Zhou Y, et al. (2007) Mass spectrometric detection of tissue proteins in plasma. Mol Cell Proteomics 6: 64-71. 55. Gillet LC, Navarro P, Tate S, Rost H, Selevsek N, et al. (2012) Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics 11: O111 016717. 56. Liu Y, Huttenhain R, Collins B, Aebersold R (2013) Mass spectrometric protein maps for biomarker discovery and clinical research. Expert Rev Mol Diagn 13: 811-825.

214 57. Zhang H, Yi EC, Li XJ, Mallick P, Kelly-Spratt KS, et al. (2005) High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Mol Cell Proteomics 4: 144-155. 58. Herbrich SM, Cole RN, West KP, Jr., Schulze K, Yager JD, et al. (2013) Statistical inference from multiple iTRAQ experiments without using common reference standards. J Proteome Res 12: 594-604. 59. Yan M, Yang X, Wang L, Clark D, Zuo H, et al. (2013) Plasma membrane proteomics of tumor spheres identify CD166 as a novel marker for cancer stem-like cells in head and neck squamous cell carcinoma. Mol Cell Proteomics 12: 3271-3284. 60. Kaji H, Saito H, Yamauchi Y, Shinkawa T, Taoka M, et al. (2003) Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat Biotechnol 21: 667-672. 61. Zhang W, Zhou G, Zhao Y, White MA, Zhao Y (2003) Affinity enrichment of plasma membrane for proteomics analysis. Electrophoresis 24: 2855-2863. 62. Warder SE, Tucker LA, Strelitzer TJ, McKeegan EM, Meuth JL, et al. (2009) Reducing agent-mediated precipitation of high-abundance plasma proteins. Anal Biochem 387: 184-193. 63. Holewinski RJ, Jin ZC, Powell MJ, Maust MD, Van Eyk JE (2013) A fast and reproducible method for albumin isolation and depletion from serum and cerebrospinal fluid. Proteomics 13: 743-750. 64. An Y, Goldman R (2013) Analysis of peptides by denaturing ultrafiltration and LC- MALDI-TOF-MS. Methods Mol Biol 1023: 13-19. 65. Ortea I, Roschitzki B, Ovalles JG, Longo JL, de la Torre I, et al. (2012) Discovery of serum proteomic biomarkers for prediction of response to infliximab (a monoclonal anti- TNF antibody) treatment in rheumatoid arthritis: an exploratory analysis. J Proteomics 77: 372-382. 66. Cole RN, Ruczinski I, Schulze K, Christian P, Herbrich S, et al. (2013) The Plasma Proteome Identifies Expected and Novel Proteins Correlated with Micronutrient Status in Undernourished Nepalese Children. Journal of Nutrition 143: 1540-1548. 67. Lee EK, Cho H, Kim CW (2011) Proteomic analysis of cancer stem cells in human prostate cancer cells. Biochem Biophys Res Commun 412: 279-285. 68. Rush J, Moritz A, Lee KA, Guo A, Goss VL, et al. (2005) Immunoaffinity profiling of tyrosine phosphorylation in cancer cells. Nat Biotechnol 23: 94-101. 69. Luo W, Slebos RJ, Hill S, Li M, Brabek J, et al. (2008) Global impact of oncogenic Src on a phosphotyrosine proteome. J Proteome Res 7: 3447-3460.

215 70. Iwai LK, Benoist C, Mathis D, White FM (2010) Quantitative phosphoproteomic analysis of T cell receptor signaling in diabetes prone and resistant mice. J Proteome Res 9: 3135-3145. 71. Watkin EE, Arbez N, Waldron-Roby E, O'Meally R, Ratovitski T, et al. (2014) Phosphorylation of mutant huntingtin at serine 116 modulates neuronal toxicity. PLoS One 9: e88284. 72. Zhong J, Kim MS, Chaerkady R, Wu X, Huang TC, et al. (2012) TSLP signaling network revealed by SILAC-based phosphoproteomics. Mol Cell Proteomics 11: M112 017764. 73. Mitsos A, Melas IN, Siminelakis P, Chairakaki AD, Saez-Rodriguez J, et al. (2009) Identifying drug effects via pathway alterations using an integer linear programming optimization formulation on phosphoproteomic data. PLoS Comput Biol 5: e1000591. 74. Brill LM, Xiong W, Lee KB, Ficarro SB, Crain A, et al. (2009) Phosphoproteomic analysis of human embryonic stem cells. Cell Stem Cell 5: 204-213. 75. Kaminska B (2005) MAPK signalling pathways as molecular targets for anti- inflammatory therapy--from molecular mechanisms to therapeutic benefits. Biochim Biophys Acta 1754: 253-262. 76. Peifer C, Wagner G, Laufer S (2006) New approaches to the treatment of inflammatory disorders small molecule inhibitors of p38 MAP kinase. Curr Top Med Chem 6: 113-149. 77. White MF (2006) Regulating insulin signaling and beta-cell function through IRS proteins. Can J Physiol Pharmacol 84: 725-737. 78. Smith FD, Samelson BK, Scott JD (2011) Discovery of cellular substrates for protein kinase A using a peptide array screening protocol. Biochem J 438: 103-110. 79. Min-Sik Kim YZ, Shinchi Yachida, N.V. Rajeshkumar, Melissa L. Abel, Arivusudar Marimuthu, Keshav Mudgal, Ralph H. Hruban, Justin S. Poling, Jeffrey W. Tyner, Anirban Maitra, Christine A. Iacobuzio-Donahue, and Akhilesh Pandey (2014) Heterogeneity of pancreatic cancer metastases in a single patient revealed by quantitative proteomics. The American Society for Biochemistry and Molecular Biology. 80. Amanchy R, Kandasamy K, Mathivanan S, Periaswamy B, Reddy R, et al. (2011) Identification of Novel Phosphorylation Motifs Through an Integrative Computational and Experimental Analysis of the Human Phosphoproteome. J Proteomics Bioinform 4: 22-35. 81. Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, et al. (2004) Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 32: D497-501.

216 82. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13: 2363-2371. 83. Mathivanan S, Ahmed M, Ahn NG, Alexandre H, Amanchy R, et al. (2008) Human Proteinpedia enables sharing of human protein data. Nat Biotechnol 26: 164-167. 84. Kandasamy K, Keerthikumar S, Goel R, Mathivanan S, Patankar N, et al. (2009) Human Proteinpedia: a unified discovery resource for proteomics research. Nucleic Acids Res 37: D773-781. 85. Goel R, Harsha HC, Pandey A, Prasad TS (2012) Human Protein Reference Database and Human Proteinpedia as resources for phosphoproteome analysis. Mol Biosyst 8: 453- 463. 86. McLachlin DT, Chait BT (2001) Analysis of phosphorylated proteins and peptides by mass spectrometry. Curr Opin Chem Biol 5: 591-602. 87. Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422: 198- 207. 88. Maguire PB, Wynne KJ, Harney DF, O'Donoghue NM, Stephens G, et al. (2002) Identification of the phosphotyrosine proteome from thrombin activated platelets. Proteomics 2: 642-648. 89. Zheng H, Hu P, Quinn DF, Wang YK (2005) Phosphotyrosine proteomic study of interferon alpha signaling pathway using a combination of and immobilized metal affinity chromatography. Mol Cell Proteomics 4: 721-730. 90. Molina H, Horn DM, Tang N, Mathivanan S, Pandey A (2007) Global proteomic profiling of phosphopeptides using electron transfer dissociation tandem mass spectrometry. Proc Natl Acad Sci U S A 104: 2199-2204. 91. Bose R, Molina H, Patterson AS, Bitok JK, Periaswamy B, et al. (2006) Phosphoproteomic analysis of Her2/neu signaling and inhibition. Proc Natl Acad Sci U S A 103: 9773-9778. 92. Harsha HC, Molina H, Pandey A (2008) Quantitative proteomics using stable isotope labeling with amino acids in cell culture. Nat Protoc 3: 505-516. 93. Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, et al. (2006) Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 127: 635-648. 94. Nagaraj N, D'Souza RC, Cox J, Olsen JV, Mann M (2010) Feasibility of large-scale phosphoproteomics with higher energy collisional dissociation fragmentation. J Proteome Res 9: 6786-6794. 95. Han G, Ye M, Liu H, Song C, Sun D, et al. (2010) Phosphoproteome analysis of human liver tissue by long-gradient nanoflow LC coupled with multiple stage MS analysis. Electrophoresis 31: 1080-1089.

217 96. Oyama M, Kozuka-Hata H, Tasaki S, Semba K, Hattori S, et al. (2009) Temporal perturbation of tyrosine phosphoproteome dynamics reveals the system-wide regulatory networks. Mol Cell Proteomics 8: 226-231. 97. Li H, Ren Z, Kang X, Zhang L, Li X, et al. (2009) Identification of tyrosine- phosphorylated proteins associated with metastasis and functional analysis of FER in human hepatocellular carcinoma cells. BMC Cancer 9: 366. 98. Garnier D JN, Rak J. (2012) Extracellular vesicles as prospective carriers of oncogenic protein signatures in adult and paediatric brain tumors. Proteomics 13: 1595- 1607. 99. Kucharzewska P, Christianson HC, Welch JE, Svensson KJ, Fredlund E, et al. (2013) Exosomes reflect the hypoxic status of glioma cells and mediate hypoxia-dependent activation of vascular cells during tumor development. Proc Natl Acad Sci U S A 110: 7312-7317. 100. Yang M, Chen J, Su F, Yu B, Su F, et al. (2011) Microvesicles secreted by macrophages shuttle invasion-potentiating microRNAs into breast cancer cells. Mol Cancer 10: 117. 101. Thomas SN, Liao Z, Clark D, Chen Y, Samadani R, et al. (2013) Exosomal Proteome Profiling: A Potential Multi-Marker Cellular Phenotyping Tool to Characterize Hypoxia-Induced Radiation Resistance in Breast Cancer. Proteomes 1: 87-108. 102. Klinke DJ, 2nd, Kulkarni YM, Wu Y, Byrne-Hoffman C (2014) Inferring alterations in cell-to-cell communication in HER2+ breast cancer using secretome profiling of three cell models. Biotechnol Bioeng. 103. Amorim M, Fernandes G, Oliveira P, Martins-de-Souza D, Dias-Neto E, et al. (2014) The overexpression of a single oncogene (ERBB2/HER2) alters the proteomic landscape of extracellular vesicles. Proteomics 14: 1472-1479. 104. Yoon JH, Kim J, Kim KL, Kim DH, Jung SJ, et al. (2014) Proteomic analysis of hypoxia-induced U373MG glioma secretome reveals novel hypoxia-dependent migration factors. Proteomics 14: 1494-1502. 105. Drake RR, Kislinger T (2014) The proteomics of prostate cancer exosomes. Expert Rev Proteomics 11: 167-177. 106. Beckham CJ, Olsen J, Yin PN, Wu CH, Ting HJ, et al. (2014) Bladder Cancer Exosomes Contain EDIL-3/Del1 and Facilitate Cancer Progression. J Urol. 107. Webber J, Stone TC, Katilius E, Smith BC, Gordon B, et al. (2014) Proteomics analysis of cancer exosomes using a novel modified aptamer-based array (SOMAscan) platform. Mol Cell Proteomics 13: 1050-1064. 108. Sinha A, Ignatchenko V, Ignatchenko A, Mejia-Guerrero S, Kislinger T (2014) In- depth proteomic analyses of ovarian cancer cell line exosomes reveals differential

218 enrichment of functional categories compared to the NCI 60 proteome. Biochem Biophys Res Commun 445: 694-701. 109. Jeppesen DK, Nawrocki A, Jensen SG, Thorsen K, Whitehead B, et al. (2014) Quantitative proteomics of fractionated membrane and lumen exosome proteins from isogenic metastatic and nonmetastatic bladder cancer cells reveal differential expression of EMT factors. Proteomics 14: 699-712. 110. Bijnsdorp IV, Geldof AA, Lavaei M, Piersma SR, van Moorselaar RJ, et al. (2013) Exosomal ITGA3 interferes with non-cancerous prostate cell functions and is increased in urine exosomes of metastatic prostate cancer patients. J Extracell Vesicles 2. 111. Shin SJ, Smith JA, Rezniczek GA, Pan S, Chen R, et al. (2013) Unexpected gain of function for the scaffolding protein plectin due to mislocalization in pancreatic cancer. Proc Natl Acad Sci U S A 110: 19414-19419. 112. Camacho L, Guerrero P, Marchetti D (2013) MicroRNA and protein profiling of brain metastasis competent cell-derived exosomes. PLoS One 8: e73790. 113. Nyalwidhe JO, Betesh LR, Powers TW, Jones EE, White KY, et al. (2013) Increased bisecting N-acetylglucosamine and decreased branched chain glycans of N- linked glycoproteins in expressed prostatic secretions associated with prostate cancer progression. Proteomics Clin Appl. 114. Park JO, Choi DY, Choi DS, Kim HJ, Kang JW, et al. (2013) Identification and characterization of proteins isolated from microvesicles derived from human lung cancer pleural effusions. Proteomics 13: 2125-2134. 115. Raimondo F, Morosi L, Corbetta S, Chinello C, Brambilla P, et al. (2013) Differential protein profiling of renal cell carcinoma urinary exosomes. Mol Biosyst 9: 1220-1233. 116. Dominy SS, Brown JN, Ryder MI, Gritsenko M, Jacobs JM, et al. (2014) Proteomic Analysis of Saliva in HIV-Positive Heroin Addicts Reveals Proteins Correlated with Cognition. PLoS One 9: e89366. 117. Jaworski E, Saifuddin M, Sampey G, Shafagati N, Van Duyne R, et al. (2014) The Use of Nanotrap Particles Technology in Capturing HIV-1 Virions and Viral Proteins from Infected Cells. PLoS One 9: e96778. 118. Hassani K, Shio MT, Martel C, Faubert D, Olivier M (2014) Absence of Metalloprotease GP63 Alters the Protein Content of Leishmania Exosomes. PLoS One 9: e95007. 119. Kang GY, Bang JY, Choi AJ, Yoon J, Lee WC, et al. (2014) Exosomal proteins in the aqueous humor as novel biomarkers in patients with neovascular age-related macular degeneration. J Proteome Res 13: 581-595.

219 120. Bellingham SA, Guo BB, Coleman BM, Hill AF (2012) Exosomes: vehicles for the transfer of toxic proteins associated with neurodegenerative diseases? Front Physiol 3: 124. 121. Chiasserini D, van Weering JR, Piersma SR, Pham TV, Malekzadeh A, et al. (2014) Proteomic analysis of cerebrospinal fluid extracellular vesicles: A comprehensive dataset. J Proteomics 106C: 191-204. 122. Rodriguez-Suarez E, Gonzalez E, Hughes C, Conde-Vancells J, Rudella A, et al. (2014) Quantitative proteomic analysis of hepatocyte-secreted extracellular vesicles reveals candidate markers for liver toxicity. J Proteomics 103: 227-240. 123. Redzic JS, Ung TH, Graner MW (2014) Glioblastoma extracellular vesicles: reservoirs of potential biomarkers. Pharmgenomics Pers Med 7: 65-77. 124. Koga K, Matsumoto K, Akiyoshi T, Kubo M, Yamanaka N, et al. (2005) Purification, characterization and biological significance of tumor-derived exosomes. Anticancer Res 25: 3703-3707. 125. Choi DS, Lee JM, Park GW, Lim HW, Bang JY, et al. (2007) Proteomic analysis of microvesicles derived from human colorectal cancer cells. J Proteome Res 6: 4646-4655. 126. Ji H, Greening DW, Barnes TW, Lim JW, Tauro BJ, et al. (2013) Proteome profiling of exosomes derived from human primary and metastatic colorectal cancer cells reveal differential expression of key metastatic factors and signal transduction components. Proteomics 13: 1672-1686. 127. Wolfers J, Lozier A, Raposo G, Regnault A, Thery C, et al. (2001) Tumor-derived exosomes are a source of shared tumor rejection antigens for CTL cross-priming. Nat Med 7: 297-303. 128. Craven RA, Totty N, Harnden P, Selby PJ, Banks RE (2002) Laser capture microdissection and two-dimensional polyacrylamide gel electrophoresis: evaluation of tissue preparation and sample limitations. Am J Pathol 160: 815-822. 129. Marton A, Vizler C, Kusz E, Temesfoi V, Szathmary Z, et al. (2012) Melanoma cell-derived exosomes alter macrophage and dendritic cell functions in vitro. Immunol Lett 148: 34-38. 130. Hegmans JP, Bard MP, Hemmes A, Luider TM, Kleijmeer MJ, et al. (2004) Proteomic analysis of exosomes secreted by human mesothelioma cells. Am J Pathol 164: 1807-1815. 131. Graner MW, Alzate O, Dechkovskaia AM, Keene JD, Sampson JH, et al. (2009) Proteomic and immunologic analyses of brain tumor exosomes. FASEB J 23: 1541-1557. 132. Simpson RJ, Jensen SS, Lim JW (2008) Proteomic profiling of exosomes: current perspectives. Proteomics 8: 4083-4099.

220 133. Simpson RJ, Lim JW, Moritz RL, Mathivanan S (2009) Exosomes: proteomic insights and diagnostic potential. Expert Rev Proteomics 6: 267-283. 134. Taylor DD, Zacharias W, Gercel-Taylor C (2011) Exosome isolation for proteomic analyses and RNA profiling. Methods Mol Biol 728: 235-246. 135. Van Deun J, Mestdagh P, Sormunen R, Cocquyt V, Vermaelen K, et al. The impact of disparate isolation methods for extracellular vesicles on downstream RNA profiling. J Extracell Vesicles 3. 136. Taylor DD, Zacharias W, Gercel-Taylor C Exosome isolation for proteomic analyses and RNA profiling. Methods Mol Biol 728: 235-246. 137. Marimpietri D, Petretto A, Raffaghello L, Pezzolo A, Gagliani C, et al. (2013) Proteome profiling of neuroblastoma-derived exosomes reveal the expression of proteins potentially involved in tumor progression. PLoS One 8: e75054. 138. Raimondo F, Corbetta S, Morosi L, Chinello C, Gianazza E, et al. (2013) Urinary exosomes and diabetic nephropathy: a proteomic approach. Mol Biosyst 9: 1139-1146. 139. Burke M, Choksawangkarn W, Edwards N, Ostrand-Rosenberg S, Fenselau C Exosomes from myeloid-derived suppressor cells carry biologically active proteins. J Proteome Res 13: 836-843. 140. Street JM, Barran PE, Mackay CL, Weidt S, Balmforth C, et al. Identification and proteomic profiling of exosomes in human cerebrospinal fluid. J Transl Med 10: 5. 141. Thery C, Amigorena S, Raposo G, Clayton A (2006) Isolation and characterization of exosomes from cell culture supernatants and biological fluids. Curr Protoc Cell Biol Chapter 3: Unit 3 22. 142. Marimuthu A, Subbannayya Y, Sahasrabuddhe NA, Balakrishnan L, Syed N, et al. (2013) SILAC-based quantitative proteomic analysis of gastric cancer secretome. Proteomics Clin Appl 7: 355-366. 143. Zhong J, Krawczyk SA, Chaerkady R, Huang H, Goel R, et al. (2010) Temporal profiling of the secretome during adipogenesis in humans. J Proteome Res 9: 5228-5238. 144. Cheung WC, Beausoleil SA, Zhang X, Sato S, Schieferl SM, et al. (2012) A proteomics approach for the identification and cloning of monoclonal antibodies from serum. Nat Biotechnol 30: 447-452. 145. Wine Y, Boutz DR, Lavinder JJ, Miklos AE, Hughes RA, et al. (2013) Molecular deconvolution of the monoclonal antibodies that comprise the polyclonal serum response. Proc Natl Acad Sci U S A 110: 2993-2998. 146. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20: 3551-3567.

221 147. Albaum SP, Hahne H, Otto A, Haussmann U, Becher D, et al. A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome Sci 9: 30. 148. Eng JK, McCormack AL, Yates JR (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5: 976-989. 149. MyriMatch, http://forge.fenchurch.mc.vanderbilt.edu/scm/viewvc.php/*checkout*/trunk/doc/index.ht ml?root=myrimatch. 150. TagRecon, http://forge.fenchurch.mc.vanderbilt.edu/scm/viewvc.php/*checkout*/trunk/doc/index.ht ml?root=tagrecon. 151. Dasari S, Chambers MC, Slebos RJ, Zimmerman LJ, Ham AJ, et al. TagRecon: high-throughput mutation identification through sequence tagging. J Proteome Res 9: 1716-1726. 152. Zucht HD, Lamerz J, Khamenia V, Schiller C, Appel A, et al. (2005) Datamining methodology for LC-MALDI-MS based peptide profiling. Comb Chem High Throughput Screen 8: 717-723. 153. Palagi PM, Walther D, Quadroni M, Catherinet S, Burgess J, et al. (2005) MSight: an image analysis software for liquid chromatography-mass spectrometry. Proteomics 5: 2381-2384. 154. Wang W, Zhou H, Lin H, Roy S, Shaler TA, et al. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal Chem 75: 4818-4826. 155. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42: D199- 205. 156. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27-30. 157. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-29. 158. Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2: 2366-2382. 159. Huang da W, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4: 44-57.

222 160. Huang da W, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37: 1-13. 161. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567-580. 162. Kall L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338: 1027-1036. 163. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005-1016. 164. Bendtsen JD, Jensen LJ, Blom N, Von Heijne G, Brunak S (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17: 349- 356. 165. Nakai K, Kanehisa M (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897-911. 166. Horton P, Park KJ, Obayashi T, Fujita N, Harada H, et al. (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res 35: W585-587. 167. Chen Y, Zhang Y, Yin Y, Gao G, Li S, et al. (2005) SPD--a web-based secreted protein database. Nucleic Acids Res 33: D169-173. 168. Cappadona S, Baker PR, Cutillas PR, Heck AJ, van Breukelen B Current challenges in software solutions for mass spectrometry-based quantitative proteomics. Amino Acids 43: 1087-1108. 169. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Chapter 4: 1. Martin VJ, Pitera DJ, Withers ST, Newman JD, Keasling JD (2003) Engineering a mevalonate pathway in Escherichia coli for production of terpenoids. Nat Biotechnol 21: 796-802. 2. Tsuruta H, Paddon CJ, Eng D, Lenihan JR, Horning T, et al. (2009) High-level production of amorpha-4,11-diene, a precursor of the antimalarial agent artemisinin, in Escherichia coli. PLoS One 4: e4489. 3. Chavez-Bejar MI, Balderas-Hernandez VE, Gutierrez-Alejandre A, Martinez A, Bolivar F, et al. Metabolic engineering of Escherichia coli to optimize melanin synthesis from glucose. Microb Cell Fact 12: 108.

223 4. Park JH, Lee SY Fermentative production of branched chain amino acids: a focus on metabolic engineering. Appl Microbiol Biotechnol 85: 491-506. 5. Julsing MK, Kuhn D, Schmid A, Buhler B Resting cells of recombinant E. coli show high epoxidation yields on energy source and high sensitivity to product inhibition. Biotechnol Bioeng 109: 1109-1119. 6. el-Mansi EM, Holms WH (1989) Control of carbon flux to acetate excretion during growth of Escherichia coli in batch and continuous cultures. J Gen Microbiol 135: 2875- 2883. 7. Rinas U, Krackehelm HA, Schugerl K (1989) Glucose as a Substrate in Recombinant Strain Fermentation Technology - by-Product Formation, Degradation and Intracellular Accumulation of Recombinant Protein. Applied Microbiology and Biotechnology 31: 163-167. 8. Luli GW, Strohl WR (1990) Comparison of growth, acetate production, and acetate inhibition of Escherichia coli strains in batch and fed-batch fermentations. Appl Environ Microbiol 56: 1004-1011. 9. Holms H (1996) Flux analysis and control of the central metabolic pathways in Escherichia coli. FEMS Microbiol Rev 19: 85-116. 10. Xu B, Jahic M, Blomsten G, Enfors SO (1999) Glucose overflow metabolism and mixed-acid fermentation in aerobic large-scale fed-batch processes with Escherichia coli. Appl Microbiol Biotechnol 51: 564-571. 11. Xu B, Jahic M, Enfors SO (1999) Modeling of overflow metabolism in batch and fed- batch cultures of Escherichia coli. Biotechnol Prog 15: 81-90. 12. El-Mansi M (2004) Flux to acetate and lactate excretions in industrial fermentations: physiological and biochemical implications. J Ind Microbiol Biotechnol 31: 295-300. 13. Mazumdar S, Clomburg JM, Gonzalez R Escherichia coli strains engineered for homofermentative production of D-lactic acid from glycerol. Appl Environ Microbiol 76: 4327-4336. 14. Shiloach J, Reshamwala S, Noronha SB, Negrete A Analyzing metabolic variations in different bacterial strains, historical perspectives and current trends--example E. coli. Curr Opin Biotechnol 21: 21-26. 15. Wolfe AJ (2005) The acetate switch. Microbiol Mol Biol Rev 69: 12-50. 16. Majewski RA, Domach MM (1990) Simple constrained-optimization view of acetate overflow in E. coli. Biotechnol Bioeng 35: 732-738. 17. Veit A, Polen T, Wendisch VF (2007) Global gene expression analysis of glucose overflow metabolism in Escherichia coli and reduction of aerobic acetate formation. Appl Microbiol Biotechnol 74: 406-421.

224 18. Han K, Lim HC, Hong J (1992) Acetic acid formation in Escherichia coli fermentation. Biotechnol Bioeng 39: 663-671. 19. Paalme T, Elken R, Kahru A, Vanatalu K, Vilu R (1997) The growth rate control in Escherichia coli at near to maximum growth rates: the A-stat approach. Antonie Van Leeuwenhoek 71: 217-230. 20. Varma A, Palsson BO (1994) Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type Escherichia coli W3110. Appl Environ Microbiol 60: 3724-3731. 21. Shin S, Chang DE, Pan JG (2009) Acetate consumption activity directly determines the level of acetate accumulation during Escherichia coli W3110 growth. J Microbiol Biotechnol 19: 1127-1134. 22. Waegeman H, Beauprez J, Moens H, Maertens J, De Mey M, et al. Effect of iclR and arcA knockouts on biomass formation and metabolic fluxes in Escherichia coli K12 and its implications on understanding the metabolism of Escherichia coli BL21 (DE3). BMC Microbiol 11: 70. 23. Zhuang K, Vemuri GN, Mahadevan R Economics of membrane occupancy and respiro-fermentation. Mol Syst Biol 7: 500. 24. De Mey M, Lequeux GJ, Beauprez JJ, Maertens J, Van Horen E, et al. (2007) Comparison of different strategies to reduce acetate formation in Escherichia coli. Biotechnol Prog 23: 1053-1063. 25. Castano-Cerezo S, Pastor JM, Renilla S, Bernal V, Iborra JL, et al. (2009) An insight into the role of phosphotransacetylase (pta) and the acetate/acetyl-CoA node in Escherichia coli. Microb Cell Fact 8: 54. 26. Contiero J, Beatty C, Kumari S, DeSanti CL, Strohl WR, et al. (2000) Effects of mutations in acetate metabolism on high-cell-density growth of Escherichia coli. Journal of Industrial Microbiology & Biotechnology 24: 421-430. 27. Dittrich CR, Bennett GN, San KY (2005) Characterization of the acetate-producing pathways in Escherichia coli. Biotechnology Progress 21: 1062-1067. 28. Elmansi EMT, Holms WH (1989) Control of Carbon Flux to Acetate Excretion during Growth of Escherichia-Coli in Batch and Continuous Cultures. Journal of General Microbiology 135: 2875-2883. 29. Yang YT, Bennett GN, San KY (1999) Effect of inactivation of nuo and ackA-pta on redistribution of metabolic fluxes in Escherichia coli. Biotechnology and Bioengineering 65: 291-297. 30. Abdel-Hamid AM, Attwood MM, Guest JR (2001) Pyruvate oxidase contributes to the aerobic growth efficiency of Escherichia coli. Microbiology-Sgm 147: 1483-1498.

225 31. Valgepea K, Adamberg K, Nahku R, Lahtvee PJ, Arike L, et al. (2010) Systems biology approach reveals that overflow metabolism of acetate in Escherichia coli is triggered by carbon catabolite repression of acetyl-CoA synthetase. Bmc Systems Biology 4. 32. Li ZP, Nimtz M, Rinas U (2014) The metabolic potential of Escherichia coli BL21 in defined and rich medium. Microbial Cell Factories 13. 33. Boersema PJ, Raijmakers R, Lemeer S, Mohammed S, Heck AJ (2009) Multiplex peptide stable isotope dimethyl labeling for quantitative proteomics. Nat Protoc 4: 484- 494. 34. Wisniewski JR, Zougman A, Nagaraj N, Mann M (2009) Universal sample preparation method for proteome analysis. Nat Methods 6: 359-362. 35. Wisniewski JR, Zougman A, Mann M (2009) Combination of FASP and StageTip- based fractionation allows in-depth analysis of the hippocampal membrane proteome. J Proteome Res 8: 5674-5678. 36. El-Mansi M, Cozzone AJ, Shiloach J, Eikmanns BJ (2006) Control of carbon flux through enzymes of central and intermediary metabolism during growth of Escherichia coli on acetate. Current Opinion in Microbiology 9: 173-179. 37. Han MJ, Lee SY, Hong SH (2012) Comparative Analysis of Envelope Proteomes in Escherichia coli B and K-12 Strains. Journal of Microbiology and Biotechnology 22: 470-478. 38. Jeong H, Barbe V, Lee CH, Vallenet D, Yu DS, et al. (2009) Genome Sequences of Escherichia coli B strains REL606 and BL21(DE3). Journal of Molecular Biology 394: 644-652. 39. Lara AR, Caspeta L, Gosset G, Bolivar F, Ramirez OT (2008) Utility of an Escherichia coli strain engineered in the substrate uptake system for improved culture performance at high glucose and cell concentrations: An fed-batch cultures. Biotechnology and Bioengineering 99: 893-901. 40. Negrete A, Majdalani N, Phue JN, Shiloach J (2013) Reducing acetate excretion from E. coli K-12 by over-expressing the small RNA SgrS. New Biotechnology 30: 269-273. 41. Phue JN, Noronha SB, Bhattacharyya R, Wolfe AJ, Shiloach J (2005) Glucose metabolism at high density growth of E. coli B and E. coli K: Differences in metabolic pathways are responsible for efficient glucose utilization in E. coli B as determined by microarrays and northern blot analyses (vol 90, pg 805, 2005). Biotechnology and Bioengineering 91: 649-649. 42. Phue JN, Shiloach J (2004) Transcription levels of key metabolic genes are the cause for different glucose utilization pathways in E-coli B (BL21) and E-coli K (JM109). Journal of Biotechnology 109: 21-30.

226 43. Shiloach J, Kaufman J, Guillard AS, Fass R (1996) Effect of glucose supply strategy on acetate accumulation, growth, and recombinant protein production by Escherichia coli BL21 (lambda DE3) and Escherichia coli JM109. Biotechnology and Bioengineering 49: 421-428. 44. vandeWalle M, Shiloach J (1998) Proposed mechanism of acetate accumulation in two recombinant Escherichia coli strains during high density fermentation. Biotechnology and Bioengineering 57: 71-78. 45. Wolfe AJ (2005) The acetate switch. Microbiology and Molecular Biology Reviews 69: 12-50. 46. Nahku R, Valgepea K, Lahtvee PJ, Erm S, Abner K, et al. Specific growth rate dependent transcriptome profiling of Escherichia coli K12 MG1655 in accelerostat cultures. J Biotechnol 145: 60-65. 47. Renilla S, Bernal V, Fuhrer T, Castano-Cerezo S, Pastor JM, et al. Acetate scavenging activity in Escherichia coli: interplay of acetyl-CoA synthetase and the PEP- glyoxylate cycle in chemostat cultures. Appl Microbiol Biotechnol 93: 2109-2124. 48. Vemuri GN, Altman E, Sangurdekar DP, Khodursky AB, Eiteman MA (2006) Overflow metabolism in Escherichia coli during steady-state growth: transcriptional regulation and effect of the redox ratio. Appl Environ Microbiol 72: 3653-3661. 49. Folsom JP, Parker AE, Carlson RP Physiological and proteomic analysis of Escherichia coli iron-limited chemostat growth. J Bacteriol 196: 2748-2761. 50. Nahvi A, Barrick JE, Breaker RR (2004) Coenzyme B12 riboswitches are widespread genetic control elements in prokaryotes. Nucleic Acids Res 32: 143-150. 51. Golding I, Paulsson J, Zawilski SM, Cox EC (2005) Real-time kinetics of gene activity in individual bacteria. Cell 123: 1025-1036. 52. Jansen R, Bussemaker HJ, Gerstein M (2003) Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Research 31: 2242-2251. 53. Kliman RM, Irving N, Santiago M (2003) Selection conflicts, gene expression, and codon usage trends in yeast. Journal of Molecular Evolution 57: 98-109. 54. Friberg M, von Rohr P, Gonnet G (2004) Limitations of codon adaptation index and other coding DNA-based features for prediction of protein expression in Saccharomyces cerevisiae. Yeast 21: 1083-1093. 55. Wu G, Nie L, Freeland SJ (2007) The effects of differential gene expression on coding sequence features: Analysis by one-way ANOVA. Biochemical and Biophysical Research Communications 358: 1108-1113.

227 56. Wu G, Nie L, Zhang WW (2008) Integrative analyses of posttranscriptional regulation in the yeast Saccharomyces cerevisiae using transcriptomic and proteomic data. Current Microbiology 57: 18-22.

Chapter 5: 1. Jayapal Kr, Wlaschin Kf, Hu Ws, Yap Mgs. Recombinant protein therapeutics from CHO cells - 20 years and counting. Chem. Eng. Prog. 103(10), 40-47 (2007). 2. Kelley B. Industrialization of mAb production technology The bioprocessing industry at a crossroads. Mabs 1(5), 443-452 (2009). 3. Kim Jy, Kim Yg, Lee Gm. CHO cells in biotechnology for production of recombinant proteins: current state and further potential. Appl. Microbiol. Biotechnol. 93(3), 917-930 (2012). 4. Urlaub G, Chasin La. Isolation of Chinese hamster cell mutants deficient in dihydrofolate reductase activity. Proc Natl Acad Sci U S A 77(7), 4216-4220 (1980). 5. Lewis Ne, Liu X, Li Y et al. Genomic landscapes of Chinese hamster ovary cell lines as revealed by the Cricetulus griseus draft genome. Nat Biotechnol 31(8), 759-765 (2013) 6. Davies Sl, Lovelady Cs, Grainger Rk, Racher Aj, Young Rj, James Dc. Functional heterogeneity and heritability in CHO cell populations. Biotechnol Bioeng 110(1), 260- 274 (2013) 7. Prentice Hl, Ehrenfels Bn, Sisk Wp. Improving performance of mammalian cells in fed-batch processes through "bioreactor evolution". Biotechnol Prog 23(2), 458-464 (2007). 8. Pichler J, Galosy S, Mott J, Borth N. Selection of CHO host cell subclones with increased specific antibody production rates by repeated cycles of transient transfection and cell sorting. Biotechnol Bioeng 108(2), 386-394 (2011) 9. Zhou M, Crawford Y, Ng D et al. Decreasing lactate level and increasing antibody production in Chinese Hamster Ovary cells (CHO) by reducing the expression of lactate dehydrogenase and kinases. J Biotechnol 153(1-2), 27-34 (2011) 10. Tigges M, Fussenegger M. Xbp1-based engineering of secretory capacity enhances the productivity of Chinese hamster ovary cells. Metab Eng 8(3), 264-272 (2006). 11. Lee Sk, Lee Gm. Development of apoptosis-resistant dihydrofolate reductase- deficient Chinese hamster ovary cell line. Biotechnol Bioeng 82(7), 872-876 (2003). 12. Cost Gj, Freyvert Y, Vafiadis A et al.. BAK and BAX deletion using zinc-finger nucleases yields apoptosis-resistant CHO cells. Biotechnol Bioeng 105(2), 330-340 (2010)

228 13. Wisniewski Jr, Zougman A, Nagaraj N, Mann M. Universal sample preparation method for proteome analysis. Nat Methods 6(5), 359-362 (2009) 14. Baycin-Hizal D, Tabb Dl, Chaerkady R et al. Proteomic analysis of Chinese hamster ovary cells. J Proteome Res 11(11), 5265-5276 (2012) 15. Abdul Rahman S, Bergstrom E, Watson Cj et al. Filter-aided N-glycan separation (FANGS): a convenient sample preparation method for mass spectrometric N-glycan profiling. J Proteome Res 13(3), 1167-1176 (2014). 16. Pelis Rm, Dangprapai Y, Cheng Y, Zhang X, Terpstra J, Wright Sh. Functional significance of conserved cysteines in the human organic cation transporter 2. Am J Physiol Renal Physiol 303(2), F313-320 (2012). 17. Valente Kn, Choe Lh, Lenhoff Am, Lee Kh. Optimization of protein sample preparation for two-dimensional electrophoresis. Electrophoresis 33(13), 1947-1957 (2012) 18. Baik Jy, Joo Ej, Kim Yh, Lee Gm. Limitations to the comparative proteomic analysis of thrombopoietin producing Chinese hamster ovary cells treated with sodium butyrate. J biotechnol 133(4), 461-468 (2008). 19. Slade Pg, Hajivandi M, Bartel Cm, Gorfien Sf. Identifying the CHO Secretome using Mucin-type O-Linked Glycosylation and Click-chemistry. J. Proteome Res. 11(12), 6175-6186 (2012). 20. Meleady P, Hoffrogge R, Henry M et al. Utilization and evaluation of CHO-specific sequence databases for mass spectrometry based proteomics. Biotechnol Bioeng 109(6), 1386-1394 (2012) 21. Doolan P, Meleady P, Barron N et al. Microarray and proteomics expression profiling identifies several candidates, including the valosin-containing protein (VCP), involved in regulating high cellular growth rate in production CHO cell lines. Biotechnol Bioeng 106(1), 42-56 (2010) 22. Kildegaard Hf, Baycin-Hizal D, Lewis Ne, Betenbaugh Mj. The emerging CHO systems biology era: harnessing the 'omics revolution for biotechnology. Curr Opin Biotechnol 24(6), 1102-1107 (2013). 23. Kuystermans D, Dunn Mj, Al-Rubeai M. A proteomic study of cMyc improvement of CHO culture. Bmc Biotechnol 10, (2010). 24. Lee Js, Park Hj, Kim Yh, Lee Gm. Protein reference mapping of dihydrofolate reductase-deficient CHO DG44 cell lines using 2-dimensional electrophoresis. Proteomics 10(12), 2292-2302 (2010) 25. Meleady P, Doolan P, Henry M et al. Sustained productivity in recombinant Chinese hamster ovary (CHO) cell lines: proteome analysis of the molecular basis for a process- related phenotype. BMC Biotechnol 11, 78 (2011)

229 26. Meleady P, Gallagher M, Clarke C et al. Impact of miR-7 over-expression on the proteome of Chinese hamster ovary cells. J Biotechnol 160(3-4), 251-262 (2012) 27. Van Dyk Dd, Misztal Dr, Wilkins Mr et al. Identification of cellular changes associated with increased production of human growth hormone in a recombinant Chinese hamster ovary cell line. Proteomics 3(2), 147-156 (2003) 28. Wei Yy, Naderi S, Meshram M et al. Proteomics analysis of chinese hamster ovary cells undergoing apoptosis during prolonged cultivation. Cytotechnology 63(6), 663-677 (2011) 29. Wisniewski Jr, Rakus D. Multi-enzyme digestion FASP and the 'Total Protein Approach'-based absolute quantification of the Escherichia coli proteome. J Proteomics, (2014). 30. Baycin-Hizal D, Tian Y, Akan I et al. GlycoFish: a database of zebrafish N-linked glycoproteins identified using SPEG method coupled with LC/MS. Anal Chem 83(13), 5296-5303 (2011). 31. Liu F, Breukelen Bv, Heck Aj. Facilitating protein disulfide mapping by a combination of pepsin digestion, Electron Transfer higher energy Dissociation (EThcD) and a dedicated search algorithm SlinkS. Mol Cell Proteomics, (2014). 32. Kislinger T, Gramolini Ao, Maclennan Dh, Emili A. Multidimensional protein identification technology (MudPIT): technical overview of a profiling method optimized for the comprehensive proteomic investigation of normal and diseased heart tissue. J Am Soc Mass Spectrom 16(8), 1207-1220 (2005). 33. Lim Um, Yap Mg, Lim Yp, Goh Lt, Ng Sk. Identification of autocrine growth factors secreted by CHO cells for applications in single-cell cloning media. J Proteome Res 12(7), 3496-3510 (2013) 34. Wang Y, Yang F, Gritsenko Ma et al. Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 11(10), 2019-2026 (2011). 35. Valente Kn, Schaefer Ak, Kempton Hr, Lenhoff Am, Lee Kh. Recovery of Chinese hamster ovary host cell proteins for proteomic analysis. Biotechnol J 9(1), 87-99 (2013) 36. Bonner Mk, Poole Ds, Xu T, Sarkeshik A, Yates Jr, 3rd, Skop Ar. Mitotic spindle proteomics in Chinese hamster ovary cells. PLoS One 6(5), e20489 (2011) 37. Harsha Hc, Molina H, Pandey A. Quantitative proteomics using stable isotope labeling with amino acids in cell culture. Nat Protoc 3(3), 505-516 (2008). 38. Park Ss, Wu Ww, Zhou Y, Shen Rf, Martin B, Maudsley S. Effective correction of experimental errors in quantitative proteomics using stable isotope labeling by amino acids in cell culture (SILAC). J Proteomics 75(12), 3720-3732 (2012)

230 39. Geiger T, Wisniewski Jr, Cox J et al. Use of stable isotope labeling by amino acids in cell culture as a spike-in standard in quantitative proteomics. Nat Protoc 6(2), 147-157 (2011) 40. Pozniak Y, Geiger T. Design and Application of Super-SILAC for Proteome Quantification. Methods Mol Biol 1188, 281-291 (2014). 41. Wiese S, Reidegeld Ka, Meyer He, Warscheid B. Protein labeling by iTRAQ: a new tool for quantitative mass spectrometry in proteome research. Proteomics 7(3), 340-350 (2007). 42. Megger Da, Pott Ll, Ahrens M et al. Comparison of label-free and label-based strategies for proteome analysis of hepatoma cell lines. Biochim Biophys Acta 1844(5), 967-976 (2013) 43. Mertins P, Udeshi Nd, Clauser Kr et al. iTRAQ labeling is superior to mTRAQ for quantitative global proteomics and phosphoproteomics. Mol Cell Proteomics 11(6), M111 014423 (2012) 44. Keshishian H, Addona T, Burgess M, Kuhn E, Carr Sa. Quantitative, multiplexed assays for low abundance proteins in plasma by targeted mass spectrometry and stable isotope dilution. Mol Cell Proteomics 6(12), 2212-2229 (2007). 45. Addona Ta, Abbatiello Se, Schilling B et al. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring-based measurements of proteins in plasma. Nat Biotechnol 27(7), 633-641 (2009). 46. Lebert D, Dupuis A, Garin J, Bruley C, Brun V. Production and use of stable isotope- labeled proteins for absolute quantitative proteomics. Methods Mol Biol 753, 93-115 (2011). 47. Baik Jy, Ha Tk, Kim Yh, Lee Gm. Proteomic understanding of intracellular responses of recombinant chinese hamster ovary cells adapted to grow in serum-free suspension culture. Biotechnol. Prog. 27(6), 1680-1688 (2011). 48. Malinowska A, Kistowski M, Bakun M et al. Diffprot - software for non-parametric statistical analysis of differential proteomics data. J Proteomics 75(13), 4062-4073 (2012) 49. Dorai H Ls, Yao X, Wang Y, Tekindemir U, et al. Proteomic Analysis of Bioreactor Cultures of an Antibody Expressing CHO-GS Cell Line that Promotes High Productivity. J Proteomics Bioinform, 6: 099 – 108 (2013). 50. Kim Jy, Kim Yg, Han Yk, Choi Hs, Kim Yh, Lee Gm. Proteomic understanding of intracellular responses of recombinant Chinese hamster ovary cells cultivated in serum- free medium supplemented with hydrolysates. Appl Microbiol Biotechnol 89(6), 1917- 1928 (2011) 51. Kang S, Ren D, Xiao G et al. Cell line profiling to improve monoclonal antibody production. Biotechnol Bioeng 111(4), 748-760 (2013)

231 52. Nissom Pm, Sanny A, Kok Yj et al. Transcriptome and proteome profiling to understanding the biology of high productivity CHO cells. Mol Biotechnol 34(2), 125- 140 (2006). 53. Kantardjieff A, Jacob Nm, Yee Jc et al. Transcriptome and proteome analysis of Chinese hamster ovary cells under low temperature and butyrate treatment. J Biotechnol 145(2), 143-159 (2009) 54. Carlage T, Kshirsagar R, Zang L et al. Analysis of dynamic changes in the proteome of a Bcl-XL overexpressing Chinese hamster ovary cell culture during exponential and stationary phases. Biotechnol Prog 28(3), 814-823 (2012) 55. Titulaer Mk, De Costa D, Stingl C, Dekker Lj, Sillevis Smitt Pa, Luider Tm. Label- free peptide profiling of Orbitrap full mass spectra. BMC Res Notes 4, 21 (2011) 56. Vogel C, Marcotte Em. Label-free protein quantitation using weighted spectral counting. Methods Mol Biol 893, 321-341 (2012) 57. Mann M. Functional and quantitative proteomics using SILAC. Nat Rev Mol Cell Biol 7(12), 952-958 (2006). 58. Shen Y Sn, Nemunaitis Jj. Use of proteomics analysis for molecular precision approaches in cancer therapy. Drug Target Insights, 3:55-66. (2008). 59. Aggarwal K Cl, Lee Kh. Shotgun proteomics using the iTRAQ isobaric tags. Brief Funct Genomic Proteomic, 5:112-120. (2006). 60. Haynes Pa, Yates Jr III. Proteome profiling-pitfalls and progress. Yeast 17(2), 81-87 (2000). 61. Yates Jr, Ruse Ci, Nakorchevsky A. Proteomics by mass spectrometry: approaches, advances, and applications. Annu Rev Biomed Eng 11, 49-79 (2009). 62. Carlage T, Hincapie M, Zang L et al. Proteomic profiling of a high-producing Chinese hamster ovary cell culture. Anal Chem 81(17), 7357-7362 (2009). 63. Clarke C, Henry M, Doolan P et al. Integrated miRNA, mRNA and protein expression analysis reveals the role of post-transcriptional regulation in controlling CHO cell growth rate. BMC Genomics 13, 656 (2012) 64. Druz A, Betenbaugh M, Shiloach J. Glucose depletion activates mmu-miR-466h-5p expression through oxidative stress and inhibition of histone deacetylation. Nucleic Acids Res 40(15), 7291-7302 (2012) 65. Xu X, Nagarajan H, Lewis Ne et al. The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat Biotechnol 29(8), 735-741 (2011) 66. Hammond S, Kaplarevic M, Borth N, Betenbaugh Mj, Lee Kh. Chinese hamster genome database: an online resource for the CHO community at www.CHOgenome.org. Biotechnol Bioeng 109(6), 1353-1356 (2011)

232

Chapter 6: 1. Walsh G Biopharmaceutical benchmarks 2010. Nat Biotechnol 28: 917-924. 2. Walsh G Biopharmaceutical benchmarks 2014. Nat Biotechnol 32: 992-1000. 3. Baycin-Hizal D, Tabb DL, Chaerkady R, Chen L, Lewis NE, et al. (2012) Proteomic analysis of Chinese hamster ovary cells. J Proteome Res 11: 5265-5276. 4. Wong NS, Wati L, Nissom PM, Feng HT, Lee MM, et al. An investigation of intracellular glycosylation activities in CHO cells: effects of nucleotide sugar precursor feeding. Biotechnol Bioeng 107: 321-336. 5. Doolan P, Meleady P, Barron N, Henry M, Gallagher R, et al. Microarray and proteomics expression profiling identifies several candidates, including the valosin- containing protein (VCP), involved in regulating high cellular growth rate in production CHO cell lines. Biotechnol Bioeng 106: 42-56. 6. Kantardjieff A, Jacob NM, Yee JC, Epstein E, Kok YJ, et al. Transcriptome and proteome analysis of Chinese hamster ovary cells under low temperature and butyrate treatment. J Biotechnol 145: 143-159. 7. Meleady P, Gallagher M, Clarke C, Henry M, Sanchez N, et al. Impact of miR-7 over- expression on the proteome of Chinese hamster ovary cells. J Biotechnol 160: 251-262. 8. Bonner MK, Poole DS, Xu T, Sarkeshik A, Yates JR, 3rd, et al. Mitotic spindle proteomics in Chinese hamster ovary cells. PLoS One 6: e20489. 9. Beckmann TF, Kramer O, Klausing S, Heinrich C, Thute T, et al. Effects of high passage cultivation on CHO cells: a global analysis. Appl Microbiol Biotechnol 94: 659- 671. 10. Feng HT, Sim LC, Wan C, Wong NS, Yang Y Rapid characterization of protein productivity and production stability of CHO cells by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Rapid Commun Mass Spectrom 25: 1407-1412. 11. Feng HT, Wong NS, Sim LC, Wati L, Ho Y, et al. Rapid characterization of high/low producer CHO cells using matrix-assisted laser desorption/ionization time-of-flight. Rapid Commun Mass Spectrom 24: 1226-1230. 12. Hayduk EJ, Choe LH, Lee KH (2004) A two-dimensional electrophoresis map of Chinese hamster ovary cell proteins based on fluorescence staining. Electrophoresis 25: 2545-2556. 13. Meleady P, Doolan P, Henry M, Barron N, Keenan J, et al. Sustained productivity in recombinant Chinese hamster ovary (CHO) cell lines: proteome analysis of the molecular basis for a process-related phenotype. BMC Biotechnol 11: 78.

233 14. Carlage T, Kshirsagar R, Zang L, Janakiraman V, Hincapie M, et al. Analysis of dynamic changes in the proteome of a Bcl-XL overexpressing Chinese hamster ovary cell culture during exponential and stationary phases. Biotechnol Prog 28: 814-823. 15. Slade PG, Hajivandi M, Bartel CM, Gorfien SF Identifying the CHO secretome using mucin-type O-linked glycosylation and click-chemistry. J Proteome Res 11: 6175-6186. 16. Evans VC, Barker G, Heesom KJ, Fan J, Bessant C, et al. De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods 9: 1207-1211. 17. Valente KN, Schaefer AK, Kempton HR, Lenhoff AM, Lee KH (2014) Recovery of Chinese hamster ovary host cell proteins for proteomic analysis. Biotechnology Journal 9: 87-99. 18. Bendtsen JD, Jensen LJ, Blom N, von Heijne G, Brunak S (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Engineering Design & Selection 17: 349-356. 19. Krogh A, Larsson B, von Heijne G, Sonnhammer ELL (2001) Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology 305: 567-580. 20. Kall L, Krogh A, Sonnhammer ELL (2004) A combined transmembrane topology and signal peptide prediction method. Journal of Molecular Biology 338: 1027-1036. 21. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300: 1005-1016. 22. Nakai K, Kanehisa M (1992) A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells. Genomics 14: 897-911. 23. Horton P, Park KJ, Obayashi T, Fujita N, Harada H, et al. (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Research 35: W585-W587. 24. Chen YJ, Zhang Y, Yin YB, Gao G, Li SG, et al. (2005) SPD - a web-based secreted protein database. Nucleic Acids Research 33: D169-D173. 25. Valente KN, Lenhoff AM, Lee KH Expression of Difficult-to-Remove Host Cell Protein Impurities During Extended Chinese Hamster Ovary Cell Culture and Their Impact on Continuous Bioprocessing. Biotechnol Bioeng. 26. Singh SK (2011) Impact of Product-Related Factors on Immunogenicity of Biotherapeutics. Journal of Pharmaceutical Sciences 100: 354-387. 27. Trexler-Schmidt M, Sargis S, Chiu J, Sze-Khoo S, Mun M, et al. Identification and prevention of antibody disulfide bond reduction during cell culture manufacturing. Biotechnol Bioeng 106: 452-461.

234 28. Kao YH, Hewitt DP, Trexler-Schmidt M, Laird MW Mechanism of antibody reduction in cell culture production processes. Biotechnol Bioeng 107: 622-632. 29. Gao SX, Zhang Y, Stansberry-Perkins K, Buko A, Bai S, et al. Fragmentation of a highly purified monoclonal antibody attributed to residual CHO cell protease activity. Biotechnol Bioeng 108: 977-982. 30. Wang X, Hunter AK, Mozier NM (2009) Host cell proteins in biologics development: Identification, quantitation and risk assessment. Biotechnol Bioeng 103: 446-458. 31. Valente KN, Schaefer AK, Kempton HR, Lenhoff AM, Lee KH Recovery of Chinese hamster ovary host cell proteins for proteomic analysis. Biotechnol J 9: 87-99. 32. Tait AS, Hogwood CE, Smales CM, Bracewell DG Host cell protein dynamics in the supernatant of a mAb producing CHO cell line. Biotechnol Bioeng 109: 971-982. 33. (1997) Points to consider in the manufacture and testing of monoclonal antibody products for human use (1997). U.S. Food and Drug Administration Center for Biologics Evaluation and Research. J Immunother 20: 214-243. 34. Aboulaich N, Chung WK, Thompson JH, Larkin C, Robbins D, et al. A novel approach to monitor clearance of host cell proteins associated with monoclonal antibodies. Biotechnol Prog 30: 1114-1124. 35. Levy NE, Valente KN, Choe LH, Lee KH, Lenhoff AM Identification and characterization of host cell protein product-associated impurities in monoclonal antibody bioprocessing. Biotechnol Bioeng 111: 904-912. 36. Jin M, Szapiel N, Zhang J, Hickey J, Ghose S Profiling of host cell proteins by two- dimensional difference gel electrophoresis (2D-DIGE): Implications for downstream process development. Biotechnol Bioeng 105: 306-316. 37. Wisniewski JR, Zougman A, Nagaraj N, Mann M (2009) Universal sample preparation method for proteome analysis. Nat Methods 6: 359-362. 38. Wang Y, Yang F, Gritsenko MA, Clauss T, Liu T, et al. (2011) Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 11: 2019-2026. 39. Olsen JV, de Godoy LM, Li G, Macek B, Mortensen P, et al. (2005) Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol Cell Proteomics 4: 2010-2021. 40. Xu X, Nagarajan H, Lewis NE, Pan S, Cai Z, et al. (2011) The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat Biotechnol 29: 735-741. 41. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29.

235 42. Min XJ (2010) Evaluation of computational methods for secreted protein prediction in different eukaryotes. J Proteomics Bioinform 3: 143-147. 43. Chen Y, Zhang Y, Yin Y, Gao G, Li S, et al. (2005) SPD--a web-based secreted protein database. Nucleic Acids Res 33: D169-173. 44. Colzani M, Waridel P, Laurent J, Faes E, Ruegg C, et al. (2009) Metabolic labeling and protein linearization technology allow the study of proteins secreted by cultured cells in serum-containing media. J Proteome Res 8: 4779-4788. 45. Finoulst I, Vink P, Rovers E, Pieterse M, Pinkse M, et al. Identification of low abundant secreted proteins and peptides from primary culture supernatants of human T- cells. J Proteomics 75: 23-33. 46. Yarwood SJ, Woodgett JR (2001) Extracellular matrix composition determines the transcriptional response to epidermal growth factor receptor activation. Proc Natl Acad Sci U S A 98: 4472-4477. 47. Mbeunkui F, Fodstad O, Pannell LK (2006) Secretory protein enrichment and analysis: an optimized approach applied on cancer cell lines using 2D LC-MS/MS. J Proteome Res 5: 899-906. 48. Baycin-Hizal D, Tabb DL, Chaerkady R, Chen L, Lewis NE, et al. Proteomic analysis of Chinese hamster ovary cells. J Proteome Res 11: 5265-5276. 49. Paoletti AC, Parmely TJ, Tomomori-Sato C, Sato S, Zhu D, et al. (2006) Quantitative proteomic analysis of distinct mammalian Mediator complexes using normalized spectral abundance factors. Proc Natl Acad Sci U S A 103: 18928-18933. 50. Termine JD, Kleinman HK, Whitson SW, Conn KM, McGarvey ML, et al. (1981) Osteonectin, a bone-specific protein linking mineral to collagen. Cell 26: 99-105. 51. Brekken RA, Sage EH (2000) SPARC, a matricellular protein: at the crossroads of cell-matrix. Matrix Biol 19: 569-580. 52. Murphy-Ullrich JE, Lane TF, Pallero MA, Sage EH (1995) SPARC mediates focal adhesion disassembly in endothelial cells through a follistatin-like region and the Ca(2+)- binding EF-hand. J Cell Biochem 57: 341-350. 53. Sage H, Vernon RB, Funk SE, Everitt EA, Angello J (1989) SPARC, a secreted protein associated with cellular proliferation, inhibits cell spreading in vitro and exhibits Ca+2-dependent binding to the extracellular matrix. J Cell Biol 109: 341-356. 54. Raines EW, Lane TF, Iruela-Arispe ML, Ross R, Sage EH (1992) The extracellular glycoprotein SPARC interacts with platelet-derived growth factor (PDGF)-AB and -BB and inhibits the binding of PDGF to its receptors. Proc Natl Acad Sci U S A 89: 1281- 1285.

236 55. Aguilar-Mahecha A, Cantin C, O'Connor-McCourt M, Nantel A, Basik M (2009) Development of reverse phase protein microarrays for the validation of clusterin, a mid- abundant blood biomarker. Proteome Sci 7: 15. 56. Tremillon N, Morello E, Llull D, Mazmouz R, Gratadoux JJ, et al. PpiA, a surface PPIase of the cyclophilin family in Lactococcus lactis. PLoS One 7: e33516. 57. Valente KN, Lenhoff AM, Lee KH (2015) Expression of difficult-to-remove host cell protein impurities during extended Chinese hamster ovary cell culture and their impact on continuous bioprocessing. Biotechnol Bioeng. 58. Bailey-Kellogg C, Gutierrez AH, Moise L, Terry F, Martin WD, et al. CHOPPI: a web tool for the analysis of immunogenicity risk from host cell proteins in CHO-based protein production. Biotechnol Bioeng 111: 2170-2182. 59. Calis JJA, Maybeno M, Greenbaum JA, Weiskopf D, De Silva AD, et al. (2013) Properties of MHC Class I Presented Peptides That Enhance Immunogenicity. Plos Computational Biology 9. 60. Haynes C, Iakoucheva LM (2006) Serine/arginine-rich splicing factors belong to a class of intrinsically disordered proteins. Nucleic Acids Research 34: 305-312. 61. Nielsen H, Engelbrecht J, Brunak S, vonHeijne G (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10: 1-6. 62. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic Local Alignment Search Tool. Journal of Molecular Biology 215: 403-410. 63. Su G, Blaine SA, Qiao D, Friedl A (2008) Membrane type 1 matrix metalloproteinase-mediated stromal syndecan-1 shedding stimulates breast carcinoma cell proliferation. Cancer Res 68: 9558-9565. 64. Stepp MA, Liu Y, Pal-Ghosh S, Jurjus RA, Tadvalkar G, et al. (2007) Reduced migration, altered matrix and enhanced TGFbeta1 signaling are signatures of mouse keratinocytes lacking Sdc1. J Cell Sci 120: 2851-2863. 65. Zettlmeissl G, Conradt HS, Nimtz M, Karges HE (1989) Characterization of recombinant human antithrombin III synthesized in Chinese hamster ovary cells. J Biol Chem 264: 21153-21159. 66. Wollaston-Hayden EE, Harris RB, Liu B, Bridger R, Xu Y, et al. Global O-GlcNAc Levels Modulate Transcription of the Adipocyte Secretome during Chronic Insulin Resistance. Front Endocrinol (Lausanne) 5: 223. 67. Sanner MF (1999) Python: A programming language for software integration and development. Journal of Molecular Graphics & Modelling 17: 57-61.

237 68. Villarreal L, Mendez O, Salvans C, Gregori J, Baselga J, et al. (2013) Unconventional Secretion is a Major Contributor of Cancer Cell Line Secretomes. Molecular & Cellular Proteomics 12: 1046-1060. 69. Olivares-Hernandez R, Bordel S, Nielsen J (2011) Codon usage variability determines the correlation between proteome and transcriptome fold changes. Bmc Systems Biology 5. 70. Rivals I, Personnaz L, Taing L, Potier MC (2007) Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 23: 401-407. 71. Moritz M, Agard DA (2001) Gamma-tubulin complexes and microtubule nucleation. Curr Opin Struct Biol 11: 174-181. 72. Li XZ, Feng JT, Hu CP, Chen ZQ Does Arkadia contribute to TGF-beta1-induced IgA expression through up-regulation of Smad signaling in IgA nephropathy? Int Urol Nephrol 42: 719-722. 73. Honda S, Shirotani-Ikejima H, Tadokoro S, Tomiyama Y, Miyata T The integrin- linked kinase-PINCH-parvin complex supports integrin alphaIIbbeta3 activation. PLoS One 8: e85498. 74. Friedrich EB, Liu E, Sinha S, Cook S, Milstone DS, et al. (2004) Integrin-linked kinase regulates endothelial cell survival and vascular development. Mol Cell Biol 24: 8134-8144. 75. Tucker KL, Sage T, Stevens JM, Jordan PA, Jones S, et al. (2008) A dual role for integrin-linked kinase in platelets: regulating integrin function and alpha-granule secretion. Blood 112: 4523-4531. 76. Wickstrom SA, Lange A, Hess MW, Polleux J, Spatz JP, et al. Integrin-linked kinase controls microtubule dynamics required for plasma membrane targeting of caveolae. Dev Cell 19: 574-588. 77. Nurden AT, Nurden P (2011) Advances in our understanding of the molecular basis of disorders of platelet function. Journal of Thrombosis and Haemostasis 9: 253-253. 78. Renston RH, Jones AL, Christiansen WD, Hradek GT, Underdown BJ (1980) Evidence for a vesicular transport mechanism in hepatocytes for biliary secretion of immunoglobulin A. Science 208: 1276-1278. 79. Worby CA, Dixon JE (2002) Sorting out the cellular functions of sorting nexins. Nat Rev Mol Cell Biol 3: 919-931. 80. Bie AS, Palmfeldt J, Hansen J, Christensen R, Gregersen N, et al. A cell model to study different degrees of Hsp60 deficiency in HEK293 cells. Cell Stress Chaperones 16: 633-640.

238 81. Heldin CH, Landstrom M, Moustakas A (2009) Mechanism of TGF-beta signaling to growth arrest, apoptosis, and epithelial-mesenchymal transition. Curr Opin Cell Biol 21: 166-176. 82. Ikushima H, Miyazono K Biology of transforming growth factor-beta signaling. Curr Pharm Biotechnol 12: 2099-2107. 83. Kekow J, Reinhold D, Pap T, Ansorge S (1998) Intravenous immunoglobulins and transforming growth factor beta. Lancet 351: 184-185. 84. Kekow J, Reinhold D, Pap T, Ansorge S (1998) Intravenous immunoglobulins and transforming growth factor beta. Lancet 351: 184-185. 85. Reinhold D, Perlov E, Schrecke K, Kekow J, Brune T, et al. (2004) Increased blood plasma concentrations of TGF-beta isoforms after treatment with intravenous immunoglobulins (i.v.IG) in patients with multiple sclerosis. J Neuroimmunol 152: 191- 194. 86. Rubartelli A, Bajetto A, Allavena G, Wollman E, Sitia R (1992) Secretion of thioredoxin by normal and neoplastic cells through a leaderless secretory pathway. J Biol Chem 267: 24161-24164. 87. Xu Y, Wong SH, Tang BL, Subramaniam VN, Zhang T, et al. (1998) A 29-kilodalton Golgi soluble N-ethylmaleimide-sensitive factor attachment protein receptor (Vti1-rp2) implicated in protein trafficking in the secretory pathway. J Biol Chem 273: 21783- 21789. 88. Dalal S, Rosser MF, Cyr DM, Hanson PI (2004) Distinct roles for the AAA NSF and p97 in the secretory pathway. Mol Biol Cell 15: 637-648. 89. Praper T, Sonnen AF, Kladnik A, Andrighetti AO, Viero G, et al. Perforin activity at membranes leads to invaginations and vesicle formation. Proc Natl Acad Sci U S A 108: 21016-21021. 90. Yang CZ, Mueckler M (1999) ADP-ribosylation factor 6 (ARF6) defines two insulin- regulated secretory pathways in adipocytes. J Biol Chem 274: 25297-25300. 91. Caumont AS, Galas MC, Vitale N, Aunis D, Bader MF (1998) Regulated exocytosis in chromaffin cells. Translocation of ARF6 stimulates a plasma membrane-associated phospholipase D. J Biol Chem 273: 1373-1379. 92. Galas MC, Helms JB, Vitale N, Thierse D, Aunis D, et al. (1997) Regulated exocytosis in chromaffin cells. A potential role for a secretory granule-associated ARF6 protein. J Biol Chem 272: 2788-2793. 93. Song J, Khachikian Z, Radhakrishna H, Donaldson JG (1998) Localization of endogenous ARF6 to sites of cortical actin rearrangement and involvement of ARF6 in cell spreading. J Cell Sci 111 ( Pt 15): 2257-2267.

239 94. Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, et al. (2004) Complete sequencing and characterization of 21,243 full-length human cDNAs. Nature Genetics 36: 40-45. 95. Moscow JA, He R, Gudas JM, Cowan KH (1994) Utilization of Multiple Polyadenylation Signals in the Human Rhoa Protooncogene. Gene 144: 229-236. 96. Mollinedo F, Lazo PA (1997) Identification of two isoforms of the vesicle-membrane fusion protein SNAP-23 in human neutrophils and HL-60 cells. Biochemical and Biophysical Research Communications 231: 808-812. 97. Skalski M, Sharma N, Williams K, Kruspe A, Coppolino MG (2011) SNARE- mediated membrane traffic is required for focal adhesion kinase signaling and Src- regulated focal adhesion turnover. Biochimica Et Biophysica Acta-Molecular Cell Research 1813: 148-158. 98. Chaturvedi LS, Marsh HM, Shang X, Zheng Y, Basscon MD (2007) Repetitive deformation activates focal adhesion kinase and ERK mitogenic signals in human caco-2 intestinal epithelial cells through Src and Rac1. Journal of Biological Chemistry 282: 14- 28. 99. Amado M, Almeida R, Schwientek T, Clausen H (1999) Identification and characterization of large galactosyltransferase gene families: galactosyltransferases for all functions. Biochimica Et Biophysica Acta-General Subjects 1473: 35-53. 100. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42: D199- 205. 101. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27-30. 102. Kawasugi K, French PW, Penny R, Ludowyke RI (1995) Focal adhesion formation is associated with secretion of allergic mediators. Cell Motil Cytoskeleton 31: 215-224. 103. Slack-Davis JK, Eblen ST, Zecevic M, Boerner SA, Tarcsafalvi A, et al. (2003) PAK1 phosphorylation of MEK1 regulates fibronectin-stimulated MAPK activation. J Cell Biol 162: 281-291. 104. Assoian RK, Schwartz MA (2001) Coordinate signaling by integrins and receptor tyrosine kinases in the regulation of G1 phase cell-cycle progression. Curr Opin Genet Dev 11: 48-53. 105. Cunningham KE, Turner JR Myosin light chain kinase: pulling the strings of epithelial tight junction function. Ann N Y Acad Sci 1258: 34-42. 106. Ahram M, Strittmatter EF, Monroe ME, Adkins JN, Hunter JC, et al. (2005) Identification of shed proteins from Chinese hamster ovary cells: application of statistical confidence using human and mouse protein databases. Proteomics 5: 1815-1826.

240 107. Kumar N, Maurya P, Gammell P, Dowling P, Clynes M, et al. (2008) Proteomic profiling of secreted proteins from CHO cells using Surface-Enhanced Laser desorption ionization time-of-flight mass spectrometry. Biotechnol Prog 24: 273-278. 108. Singh SK Impact of product-related factors on immunogenicity of biotherapeutics. J Pharm Sci 100: 354-387. 109. Colaco CA, Bailey CR, Walker KB, Keeble J (2013) Heat Shock Proteins: Stimulators of Innate and Acquired Immunity. Biomed Research International. 110. Nickel W, Rabouille C (2009) Mechanisms of regulated unconventional protein secretion. Nat Rev Mol Cell Biol 10: 148-155. 111. Hughes RC (1999) Secretion of the galectin family of mammalian carbohydrate- binding proteins. Biochim Biophys Acta 1473: 172-185. 112. Nickel W (2003) The mystery of nonclassical protein secretion. A current view on cargo proteins and potential export routes. Eur J Biochem 270: 2109-2119. 113. Flieger O, Engling A, Bucala R, Lue H, Nickel W, et al. (2003) Regulated secretion of macrophage migration inhibitory factor is mediated by a non-classical pathway involving an ABC transporter. FEBS Lett 551: 78-86. 114. Calandra T, Bernhagen J, Mitchell RA, Bucala R (1994) The macrophage is an important and previously unrecognized source of macrophage migration inhibitory factor. J Exp Med 179: 1895-1902. 115. Cooper DNW (1997) Galectin-1: Secretion and modulation of cell interactions with laminin. Trends in Glycoscience and Glycotechnology 9: 57-67. 116. Cleves AE, Cooper DNW, Barondes SH, Kelly RB (1996) A new pathway for protein export in Saccharomyces cerevisiae. Journal of Cell Biology 133: 1017-1026. 117. Lutomski D, Fouillit M, Bourin P, Mellottee D, Denize N, et al. (1997) Externalization and binding of galectin-1 on cell surface of K562 cells upon erythroid differentiation. Glycobiology 7: 1193-1199. 118. Jackson A, Friedman S, Zhan X, Engleka KA, Forough R, et al. (1992) Heat-Shock Induces the Release of Fibroblast Growth Factor-I from Nih-3t3 Cells. Proceedings of the National Academy of Sciences of the United States of America 89: 10691-10695. 119. Bennett MK, Scheller RH (1993) The Molecular Machinery for Secretion Is Conserved from Yeast to Neurons. Proceedings of the National Academy of Sciences of the United States of America 90: 2559-2563.

241 AMIT KUMAR 3003 Van Ness ST NW, Apt W1023 Washington, DC 20008 [email protected]; (443) 850-6787

SUMMARY OF RESEARCH AND PROFESSIONAL EXPERIENCE

Bioinformatics and Systems Biology: Professional with background in statistics and various bioinformatics tools with experience in applications to microarrays, RNA-Seq, quantitative proteomics, and metabolomics data analysis for studying Type 2 Diabetes, Cancer, and physiology of Vero cells, immune cells, and Chinese Hamster Ovary (CHO) cells. Expert in performing principal component analysis, differential expression analysis, pathways enrichment analysis, gene ontology (GO) and gene set enrichment analysis (GSEA). Hold extensive exposure to Gaussian Graphical Modeling with applications to metabolomics datasets. Broad experience with pathways analysis and data visualization tools such as Ingenuity Pathway Analysis (IPA), and Cytoscape. Specialist in using statistical analysis tools such as Partek, Genesis, TIBCO Spotfire, R, and MATLAB. Proficient in building multicomponent in silico models to perform flux balance analysis for studying biological systems and diseases such as Type 2 Diabetes.

Proteomics and Quantitative Proteomics: Extensive experience in dimethyl labeling based proteomics for E. coli strains and expertise in shotgun proteomics data from CHO cells analysis using Myrimatch and Proteome Discoverer software. Experience in data analysis of label-free iTRAQ and labeled TMT proteomics data.

Database design and implementation: Broad experience in database design using SQL. Built proteome page of CHO genome website (http://chogenome.org/proteome_new.php).

Target and Biomarker Discovery: Extensive exposure in utilization of genomics, transcriptomics and proteomics data using pathway integration databases and systems biology approach for novel target discovery, biomarker discovery, target validation and understanding the mechanism of action of therapeutic drugs for cancer, Type 2 Diabetes, and immunological diseases.

Mammalian and bacterial cell culture: Adept in using aseptic techniques with extensive experience in working with mammalian and bacterial cell cultures (CHO-S, Vero, and E. coli cell lines) for growth studies and microarrays. Practiced in running and monitoring bacterial batch and chemostat cell cultures with experience in working with Cedex and YSI instruments.

RNA and protein extraction and purification: Skilled in RNA extraction and characterization at mammalian (mice tissues and Vero cells) and bacterial level (E. coli)

242 and subsequent microarray analysis. Exposed to protein extraction at bacterial level including digestion, reductive methylation, and AIX fractionation. Experienced in quality monitoring of purified RNA using gel electrophoresis, spectrophotometer (Nanodrop), and bioanalyzer (Agilent).

Lab instruments and techniques: Extensive exposure to using Biomolecular imager (ImageQuant LAS4000), LC-MS (Agilent), GC-MS (Agilent), TLC, HPLC, and UPLC, Western Blotting, Acetate assay kit with UV-visible spectrophotometer.

Secretome characterization and analysis: Expert in identifying secreted proteins and identifying subcellular protein localization for large proteome datasets using bioinformatics tools such as SignalP, SecretomeP, TMHMM, TargetP, WoLF PSORT, Phobius, and standalone-BLAST (NCBI).

Supervisory: Excellent interpersonal, problem solving, organizational, multi-tasking, time management skills, creative, goal-oriented, can work well independently and also as part of a team, have supervised multiple graduate and undergraduate students, exceptional troubleshooting skills, proven ability to implement new techniques and equipment

EDUCATION

2010-2015 (Expected) Ph.D., Chemical Engineering, Johns Hopkins University, Baltimore, MD, USA GPA: 3.7/4 PhD thesis title: “Understanding physiology of diseases and cell lines using OMICS based approaches” Advisor: Dr. Michael J. Betenbaugh, Professor, Department of Chemical and Biomolecular Engineering

2004-2008 Bachelor of Technology, Chemical Engineering, Indian Institutes of Technology Madras, India GPA: 7.83/10 Senior year project title: “Simulation and experimental studies of hydrodynamics and mass transfer in an unbaffled vessel” Advisor: Dr. A. Kannan, Professor, Department of Chemical Engineering

PROFESSIONAL EXPERIENCE

243 2014-2015: Intern – MedImmune LLC, Gaithersburg, MD Antibody Discovery & Protein Engineering, Cell Line Engineering and Proteomics Supervisor – Dr. Deniz Baycin-Hizal Proteomics  Contributed in Proteomics-bioinformatics efforts in ADPE, MedImmune  Participated in the proteomics forum together with the scientists from various other departments  Accomplished results for prediction of novel biomarkers and targets while working together with scientists from Translational Science, Cell line engineering, Oncology, and RIA departments along with biostatisticians on MedImmune’s various proteomics projects  Identified potential antigen targets for a BBB crossing antibody using statistical analysis and data interpretation techniques including ANOVA and Fisher’s exact test  Performed proteomics data analysis for investigating possible biomarkers for MedImmune DLL4 antibody for cancer  Implemented bioinformatics analysis on proteomics data of cancer tumor cells for novel target discovery  Utilized proteomics data for biomarkers discovery for systemic lupus erythematosus (SLE)  Investigated differentially expressed proteins based on labeled-TMT proteomics data for finding antigens for MedImmune’s proprietary antibodies  Carried out comparative transcriptomics analyses to compare high secretion cell lines with MedImmune’s proprietary CHO cell lines at microarray and RNA-Seq levels  Worked on understanding physiological pathways of MedImmune’s proprietary CHO cell lines using bioinformatics techniques  Worked together with biostatisticians to find out optimal normalization method for proteomics data using data comparison methods such as volcano plots, PCA plots, and ANOVA Web-based Bioinformatics Tool

 Developed a proteomics pipeline for finding secreted and membrane proteins for any cell line  Contributed in developing a bioinformatics tool for predicting signal peptides, transmembrane domain, and cellular location of a protein based on its sequence

244  Assisted in testing and streamlining the tool for predictions of overrepresented gene ontology annotations and biochemical pathways based on KEGG database 2011 – Present: Visiting Fellow – National Institutes of Health, Bethesda, MD Biotechnology Core Laboratory – National Institute of Diabetes and Digestive and Kidney Diseases Supervisor – Dr. Joseph Shiloach

 Created a MATLAB based computational model to simulate Type 2 Diabetes (T2D) in silico and found out fluxes in various biochemical pathways  Characterized MKR mice – a T2D animal model – in terms of gene expression between pre-diabetic, fully diabetic, and healthy conditions  Elucidated the effects of a beta-3 adrenergic agonist when administered in fully diabetic MKR mice using gene expression studies with microarrays  Implemented metabolomics studies to understand the differences in metabolic profiles of MKR mice, healthy mice, and beta-3 adrenergic agonist treated MKR mice  Shed light on the reasons behind differences in acetate production profiles in two E. coli strains growing in defined media with and without supplements at genetic and proteomic levels  Carried out microarray analysis to study differences in anchorage- dependent and suspension strains of Vero cells  Analyzed metabolic flux data from experiments involving C13 labeling to characterize metabolic pathways in mammalian cells using software such as METRAN (Antoniewicz Lab)  Maintained laboratory, calibrated and provided preventive maintenance of analytical instruments  Wrote and reviewed scientific documents (peer-reviewed papers, reviews, book chapters, posters, and presentation slides) 2010 – Present: Research Assistant – Johns Hopkins University, Baltimore, MD Department of Chemical and Biomolecular Engineering Supervisor – Dr. Michael J. Betenbaugh

Thesis Title: ‘UNDERSTANDING PHYSIOLOGY OF DISEASES AND CELL LINES USING OMICS BASED APPROACHES’

 Identified and characterized whole CHO proteome using bioinformatics software such as KEGG and GO.  Elucidated and quantified CHO secretome using Myrimatch and other software

245  Built a bioinformatics pipeline for analyzing membrane and secreted proteins of CHO secretome (this pipeline is applicable to all proteomics data)  Performed testing of now publically-available gene ontology annotation manager for CHO cells – GO CHO (http://ebdrup.biosustain.dtu.dk/gocho/)  Built publically-available proteomics platform for CHO genome website (http://chogenome.org/proteome_new.php)

2011-2012, Teaching Assistant, Johns Hopkins University Department of Chemical and Biomolecular Engineering

 Served as a teaching assistant of Modeling, Dynamics, and Control of Chemical and Biological Systems (Undergraduate level – Fall 2011) and Cellular and Molecular Biotechnology (Graduate Level – Fall 2012)  Supervised graduate and undergraduate students on process control laboratory experiments, cell engineering, and flux balance analysis using systems biology approach

Sep 2008-June 2010, Executive-Production, Biocon Limited, Bangalore, India

 Implemented technology transfer of various API products from R&D scale to Production scale  Incorporated scale-up software platform – DynoChem – for ease of modeling and scaling-up of various unit operations  Predicted root cause of significant low yields in manufacturing facility for a major product  Contributed in 50% yield improvement of a major product at manufacturing facility  Increased productivity by reducing cycle time of a major product by 75%  Successfully scaled up and validated two API manufacturing processes  Performed the role of a project manager for scaling up new products from R&D to production  Initiated a new team of process engineering for chemical synthesis block leading a team of 150 people  Mentored summer interns on various projects  Recruited and trained several junior executives in production department

246  Received a rating of 1, during appraisals, indicating exceptional performance in the financial year 2008-09 and 2009-10

Jul 2008-Sep 2008, Software Consultant, Exeter Group, Bangalore, India

 Developed software in accordance with the clients requirements  Underwent training in web development technologies and platforms like SQL, HTML, C#, .NET

May 2007-Jul 2007, Intern, Tata Chemicals Limited, Babrala, India

Supervisor: A.K. Gupta, Head of Department, Department

 Implemented Aspen Plus software based model for various heat exchangers  Analyzed the working efficiency of heat exchangers in terms of fouling factors  Suggested measures to reduce steam consumption in the Ammonia plant

May 2006-Jun 2006, Intern, National Chemical Laboratory, Pune, India

Supervisor: Dr. Imran Rahman, Scientist, Chemical Engineering and Process Development Division

 Developed and simulated a FORTRAN based microkinetic model of reaction of Nitric Oxide and Carbon Monoxide on a solid surface coated with Palladium  Verified that simulation parameters of the model match closely with the experimental data PUBLICATIONS

Kumar A, Shiloach J, Betenbaugh MJ, Gallagher EJ (2015). “The beta-3 adrenergic agonist (CL-316,243) restores the expression of down-regulated fatty acid oxidation genes in Type 2 diabetic mice”, Nutr Metab (Accepted)

Kumar A, Baycin-Hizal D, Shiloach J, Bowen M, Betenbaugh MJ (2014). “Coupling Enrichment and Proteomics Methods for Understanding and Treating Disease”, Proteomics Clin Appl., Dec 19, doi: 10.1002/prca.201400097

247 Kumar A, Heffner K, Shiloach J, Betenbaugh MJ, Baycin-Hizal D (2014). “Harnessing proteomics to characterize CHO cell physiology”, Pharmaceutical Bioprocessing, 2(5), 421–435, 10.4155/PBP.14.49

Kumar A, Harrelson T, Lewis NE, Gallagher EJ, LeRoith D, Shiloach J, Betenbaugh MJ (2014) “Multi-Tissue Computational Modeling Analyzes Pathophysiology of Type 2 Diabetes in MKR Mice.” PLoS ONE 9(7): e102319. doi: 10.1371/journal.pone.0102319

Heffner K, Kumar A, Kaas C, Baycin-Hizal D, Betenbaugh MJ (2014). “Proteomics in Cell Culture: From genomics to combined ‘omics for cell line engineering and bioprocess development”, Cell Engineering: Animal Cell Culture, Al Rubeai M. (editor), Springer, doi: 10.1007/978-3-319-10320-4

Heffner K, Baycin-Hizal D, Kumar A, Zhu J, Bowen M, Shiloach J, Betenbaugh MJ (2014). “Exploiting the proteomics revolution in biotechnology: from disease and antibody targets to optimizing bioprocess development”, Current Opinion in Biotechnology 30: Pharmaceutical biotechnology, Dec;30:80-6, doi: 10.1016/j.copbio.2014.06.006.

Baycin-Hizal D, Tabb DL, Chaerkady R, Chen L, Lewis N, Nagarajan H, Sarkaria V, Kumar A, Wolozny D, Colao J, Jacobson E, Tian Y, O’Malley RN, Krag S, Cole RN, Palsson BO, Zhang H, Betenbaugh MJ (2012), “Proteomic Analysis of Chinese Hamster Ovary (CHO) Cells”, Proteome Res., Nov 2;11(11):5265-76. doi: 10.1021/pr300476w.

In-Preparation

Kumar A, Baycin-Hizal D, Wolozny D, Pedersen LE, Lewis NE, Chaerkady R, Cole RN, Zhang H, Bowen M, Shiloach J, Betenbaugh MJ (2015), “Elucidation of the CHO Super-Ome (CHO-SO) by BioInfo-Proteomics” (In submission).

Baycin-Hizal D, Kumar A, Sebastian Y, Zhu W, Dong H, Zhuang L, Chaerkady R, Cole RN, Zhu J, Betenbaugh MJ, Bowen M (2015), “Comparative Systeomics to Elucidate the Physiology of CHO and SP2/0 Cell Lines” (In preparation).

Kumar A, Baez A, Betenbaugh MJ, Shiloach J (2015), “Global gene transcription and translation in E. coli B and K under minimal media conditions” (In preparation)

Kumar A, Gallagher EJ, LeRoith D, Betenbaugh MJ, Shiloach J (2015), “Metabolic profiling of muscle and liver tissues of MKR mice treated with β3-adrenergic agonist” (In preparation)

ORAL & POSTER PRESENTATIONS

Kumar A, Shiloach J, Betenbaugh MJ, Gallagher EJ (January, 2015) “The Beta-3 Adrenergic Agonist (CL-316,243) Restores the Expression of Down-Regulated Fatty

248 Acid Oxidation Genes in Type 2 Diabetic Mice”, NIH graduate student symposium, NIH, Bethesda, USA (Poster)

Kumar A, Zhuang L, Sebastian Y, Zhu W, Dong H, Xing C, Yamagata R, Yang H, Zhu J, Bowen M, Baycin-Hizal D (August, 2014), “Exploiting the omics revolution by using a comparative bioinformatics approach to study the physiology of protein production by CHO”, Summer Internship, MedImmune LLC, Gaithersburg, USA (Poster)

Kumar A, Harrelson T, Lewis NE, Gallagher EJ, LeRoith D, Shiloach J, Betenbaugh MJ (March, 2014) “Understanding pathophysiology of Type 2 Diabetes using multi-tissue computational modeling”, Graduate Student Seminar, NIH, Bethesda, USA (Oral)

Betenbaugh MJ, Kumar A (March, 2014) “Systems and synthetic biotechnology platform for characterizing and designing CHO cells”, CHOgenome workshop 2014, University of Natural Resources and Life Sciences, Vienna, Austria (Oral)

Kumar A, Harrelson T, Lewis NE, Gallagher EJ, LeRoith D, Shiloach J, Betenbaugh MJ (January, 2014) “Understanding pathophysiology of Type 2 Diabetes using iMB1496”, NIH graduate student symposium, NIH, Bethesda, USA (Poster)

Kumar A, Harrelson T, Lewis NE, Gallagher EJ, LeRoith D, Shiloach J, Betenbaugh MJ (November, 2013) “Multi-tissue modeling analyzes pathophysiology of Type 2 Diabetes in MKR mice”, NIH research festival, NIH, Bethesda, USA (Poster)

Kumar A, Harrelson T, Lewis NE, Gallagher EJ, LeRoith D, Shiloach J, Betenbaugh MJ (March, 2013) “Understanding pathophysiology of Type 2 Diabetes”, NIDDK Fellows retreat, NIH, Bethesda, USA (Poster)

Gallagher EJ, Kumar A, Betenbaugh MJ, Shiloach J, LeRoith D (June, 2012) “Altered fatty acid oxidation and branched-chain amino acid metabolism in pre-diabetic and diabetic mice”, American Diabetes Association meeting, Philadelphia, USA (Poster)

COMPUTER SKILLS

Software: Partek, JMP, Cytoscape, Myrimatch, Qlucore, Genesis, TIBCO Spotfire

Programming Language: C, Java, Javascript, FORTRAN, HTML, SQL

Computing Environments: MATLAB, R

Bioinformatics Tools: COBRA, SignalP, SecretomeP, TMHMM, TargetP, Phobius, WoLF PSORT, DAVID, IPA, NCBI BLAST, GO CHO, MaxQuant, GSEA

Databases: CHO genome database (www.chogenome.org), KEGG, BRENDA, UniProt, Flybase, HPRD, OMIM, SPD

249 Operating Systems: Windows, Linux, and Macintosh

EXTRA-CURRICUAR ACTIVITIES AND ACHIEVEMENTS

 Conducted open house while serving as Vice President – Public Relations (2013-14), NIH Evening Speakers Club, Toastmasters International, Bethesda  Volunteered at Manna Food center, Gaithersburg, MD  Participated actively in Rickey Meyers Service Day, Baltimore, MD  Represented Biocon Limited in Bangalore’s (India) biggest cycle marathon – CYCLOTHON (2009)  Underwent training and drills as cadet in National Cadet Corps (Air-Wing) (TN AIR SQN [4] NCC) (2005)  Performed distinctively in 42nd annual all India test organized by UNO (1999)  Led school house as house captain in senior year of High school

250