<<

QUANTITATIVE GLYCOMICS USING SIMULATION OPTIMIZATION

by

Jun Han

(Under the direction of John A. Miller)

Abstract

Simulation optimization is attracting increasing interest within the modeling and simulation research community. Although much research effort has focused on how to apply a vari- ety of simulation optimization techniques to solve diverse practical and research problems, researchers find that existing optimization routines are difficult to extend or integrate and often require one to develop their own optimization methods because the existing ones are problem-specific and not designed for reuse. A Semantically Enriched Environment for Sim- ulation Optimization (SEESO) is being developed to address these issues. By implementing generalized semantic descriptions of the optimization process, SEESO facilitates reuse of the available optimization routines and more effectively captures the essence of different sim- ulation optimization techniques. This enrichment is based on the existing Discrete-event Modeling Ontology (DeMO) and the emerging Simulation oPTimization (SoPT) ontologies. SoPT includes concepts from both conventional optimization/mathematical programming and simulation optimization. Represented in ontological form, optimization routines can also be transformed into actual executable application code (e.g., targeting JSIM or Scala- Tion). As illustrative examples, SEESO is being applied to several simulation optimization problems. Mass spectrometry (MS) has emerged as the preeminent tool for performing quantitative glycomics analysis. However, the accuracy of these analyses is often compromised by the instrumental artifacts, such as low signal to noise ratios and mass-dependent differential ion responses. Methods have been developed to address some of these issues by introducing stable isotopes to the glycans under study, but these methods require robust computational methods to determine the abundances of various isotopic forms derived from different exper- imental sources. An automated simulation framework for MS-based quantitative glycomics, GlycoQuant, is proposed and implemented to address these issues. Instead of manipulating the experimental data directly, GlycoQuant simulates the experimental data based on a gly- can’s theoretical isotopic distribution and takes various forms of error sources into considera- tion. It has been applied to analyze the MS raw data generated from IDAWGTMexperiments and obtained satisfactory results in the estimation of (1) the ratio of relative abundances of 15N-enriched and natural abundance glycans in a mixture and (2) the 50% degradation time of 15N-enriched glycan and its “remodeling coefficient” at this time point.

Index words: Quantitative Glycomics, Modeling & Simulation, Simulation Optimization, Mass Spectrometry, Ontology QUANTITATIVE GLYCOMICS USING SIMULATION OPTIMIZATION

by

Jun Han

B.E., Beihang University, China, 2002 M.E., Institute of Software, Chinese Academy of Sciences, China, 2007

A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree

Doctor of Philosophy

Athens, Georgia

2012 c 2012 Jun Han All Rights Reserved QUANTITATIVE GLYCOMICS USING SIMULATION OPTIMIZATION

by

Jun Han

Approved:

Major Professor: John A. Miller

Committee: William S. York Krys J. Kochut Maria Hybinette

Electronic Version Approved:

Maureen Grasso Dean of the Graduate School The University of Georgia August 2012 QUANTITATIVE GLYCOMICS USING SIMULATION OPTIMIZATION

Jun Han DEDICATION

To my beloved fianc´eeand parents for their endless love, support and encouragement.

iv ACKNOWLEDGEMENTS

First of all, I would like to express my sincere appreciation to my major professor, Dr. John A. Miller for his patient guidance, encouragement and advice that he has provided throughout my time as his student. I have learned numerous things from him, from his diligent, dedicated working attitude, the approaches of conducting researches, programming styles and habits to paper writing and presentation skills. I would like to thank him for making my graduation possible and enjoyable. I must express my gratitude to the members of my doctoral committee, Dr. William S. York, Dr. Krys J. Kochut and Dr. Maria Hybinette for their input, valuable discussions and accessibility. I would appreciate Dr. York a lot for his direction and help on the development of GlycoQuant. I would also like to thank Dr. Lance Wells in CCRC for his suggestions in discussions. I would also like to mention my friends and colleagues. Thank Meng Fang for providing the raw experimental data, which makes my work possible. And thank Gregory Silver and Michael Cotterell for their inspiration during discussions, and contribution to the papers. Last but not least, I would like to thank my fianc´ee,my parents and my relatives, who have been always believing in me and giving me constant love and support.

v TABLE OF CONTENTS

ACKNOWLEDGEMENTSv

LIST OF FIGURES xii

LIST OF TABLES xiii

CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW1 1.1 Systems Biology Overview...... 5 1.2 Glycomics...... 7 1.3 Mass Analysis...... 9 1.4 Metabolic Pathway...... 12 1.5 Modeling & Simulation...... 14 1.6 Simulation Optimization...... 15

2 GLYCOQUANT: AN AUTOMATED SIMULATION FRAMEWORK TARGETING ISOTOPIC LABELING STRATEGIES IN MS-BASED QUANTITATIVE GLYCOMICS 19 2.1 INTRODUCTION...... 21 2.2 METHODOLOGIES...... 27 2.3 GLYCOQUANT SOFTWARE PLATFORM...... 32

vi 2.4 EVALUATION...... 38 2.5 RELATED WORK...... 41 2.6 CONCLUSIONS...... 44

3 SEESO: A SEMANTICALLY ENRICHED ENVIRONMENT FOR SIM- ULATION OPTIMIZATION 45 3.1 Introduction...... 47 3.2 Simulation Optimization Overview...... 50 3.3 Modeling with DeMO, JSIM and ScalaTion...... 60 3.4 Simulation Optimization with ScalaTion, SoPT and Rules..... 61 3.5 SEESO: A Semantically Enriched Environment for Simulation Op- timization...... 77 3.6 Case Studies...... 80 3.7 Conclusions and Future Work...... 91

4 CONCLUSIONS 93

REFERENCES 116

APPENDICES 117 A GLYCOQUANT USER GUIDE...... 118 B GLYCOQUANT RESULTS...... 137 C SIMULATION OPTIMIZATION ONTOLOGY (SoPT)...... 152

vii LIST OF FIGURES

1.1 OMICS Overview [1]...... 2 1.2 Generation of Protein...... 3 1.3 Relationship Between Gene Regulatory Pathways and Metabolic Pathways..4 1.4 Overview of Molecule and Glycans...... 8 1.5 Overview of Mass Spectrometer...... 10 1.6 Metabolic History of glycans using IDAWGTM[39, 37]...... 13 1.7 Interaction between Simulation Model and Simulation Optimization..... 18

2.1 GlycoQuant Workflow...... 33 2.2 Comparison of experimental mass spectra and mass spectra simulated by Gly- coQuant. Spectra are drawn using the GlycoQuant user interface. (a) Exper- imental data with high S/N and little ion contamination. (b) Experimental spectrum with moderate noise and ion contamination. (c) Experimental spec- trum with low S/N. (d) Experimental spectrum with significant ion contami- nation...... 40

viii 2.3 Analysis of dynamic IDAWGTMexperiments. Isotopologue abundances corre-

sponding to a high remodeling coefficient (a) for (NeuAc)2(Hex)1(HexNAc)1

and a low remodeling coefficient (b) for (Hex)2(HexNAc)2. Number of [0] - [3] represents the number of nitrogen atoms in the glycan that are derived from the

heavy precursor pool. Fully labeled (heavy) glycans contain n nitrogen atoms de-

rived from the heavy precursor pool and correspond to the isotopologue distribution

labeled [3] in panel (a) and [2] in panel (b). Glycans undergoing active remodeling

contain at least 1 and less than n nitrogen atoms derived from the heavy precursor

pool...... 42

3.1 Loosely-coupled Software Architecture for Simulation Optimization...... 54 3.2 Schematic Diagram for an Urgent Care Facility (UCF) Model...... 60 3.3 General Workflow for Simulation Optimization...... 63 3.4 Top-level Abstract Classes for SoPT Ontology...... 69 3.5 Schematic Representation of Optimization Component in SoPT Ontology... 71 3.6 Schematic Representation of Optimization Problem in SoPT Ontology.... 72 3.7 Schematic Representation of Optimization Method in SoPT Ontology..... 74 3.8 Screenshot of UCF simulation in ScalaTion...... 81 3.9 Mass Spectrometry Model: elemental composition → isotopic distribution

→ simulated mass spectrum. Cartoon representation comes from CFG glycan

structure database ...... 85 3.10 A Sample O-Glycan Metabolic Pathway. Substrates are glycans shown in graphical representation and enzymes are put above the arrows. CMP-Neu5Ac acts as a sugar donor to add one sugar residue (Neu5Ac in this case) to the

glycan. Graphical representation follows specifications in [70]...... 88

4.1 GlycoQuant Architecture...... 118

ix 4.2 GlycoQuant Home Page...... 119 4.3 GlycoQuant Create a New User...... 120 4.4 GlycoQuant User Login...... 120 4.5 GlycoQuant Configuration Page (upper part)...... 121 4.6 GlycoQuant Configuration Page (bottom part)...... 121 4.7 GlycoQuant Upload CSV file...... 122 4.8 GlycoQuant Upload CSV file (successful)...... 122 4.9 GlycoQuant Upload CSV file to Server...... 123 4.10 GlycoQuant Configure Experiment Parameters (for O-Glycan)...... 123 4.11 GlycoQuant Upload mzXML files...... 124 4.12 GlycoQuant Set up parameters for mzXML files (static IDAWG)...... 125 4.13 GlycoQuant Set up parameters for mzXML files (dynamic IDAWG)...... 125 4.14 GlycoQuant Fetch Results...... 125 4.15 GlycoQuant Browse Results...... 126 4.16 GlycoQuant Browse Results...... 127 4.17 GlycoQuant Browse Results (Zoom In)...... 128 4.18 GlycoQuant Analyze Results (static IDAWGTM)...... 128 4.19 GlycoQuant Analyze Results (dynamic IDAWGTM)...... 129 4.20 GlycoQuant Analyze Results (dynamic IDAWGTMPartial data set)...... 130

4.21 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 heavy...... 137

4.22 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 mixture...... 138

4.23 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 heavy...... 138

4.24 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 mixture...... 139

4.25 GlycoQuant Results. (Hex)7(HexNAc)2 heavy...... 140

4.26 GlycoQuant Results. (Hex)7(HexNAc)2 mixture...... 140

x 4.27 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) heavy...... 141

4.28 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) mixture...... 141

4.29 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 0hr...... 142

4.30 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 6hr...... 142

4.31 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 12hr...... 143

4.32 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 24hr...... 143

4.33 GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 36hr...... 144

4.34 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 0hr...... 144

4.35 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 6hr...... 145

4.36 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 12hr...... 145

4.37 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 24hr...... 146

4.38 GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 36hr...... 146

4.39 GlycoQuant Results. (Hex)7(HexNAc)2 0hr...... 147

4.40 GlycoQuant Results. (Hex)7(HexNAc)2 6hr...... 147

4.41 GlycoQuant Results. (Hex)7(HexNAc)2 12hr...... 148

4.42 GlycoQuant Results. (Hex)7(HexNAc)2 24hr...... 148

4.43 GlycoQuant Results. (Hex)7(HexNAc)2 36hr...... 149

4.44 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 0hr...... 149

4.45 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 6hr...... 150

4.46 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 12hr...... 150

xi 4.47 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 24hr...... 151

4.48 GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 36hr...... 151

xii LIST OF TABLES

2.1 Glycomics Tools for Identification...... 22 2.2 Optimization Results for Static IDAWGTM ...... 39

3.1 Classification of conventional optimization problems...... 57 3.2 Examples of optimization components, where x, b and c are the vector of input/decision variables, constant vector, and cost coefficient vector, respec- tively, A and Q are the coefficient matrices for constraint and objective func- tions, respectively. If Q is zero, Equation (3.5) is reduced to Equation (3.3). 58 3.3 Scala Code for UCFModel...... 62 3.4 Performance Metric for Optimization Algorithm...... 68 3.5 Sample Rule using RIF-BLD Presentation Syntax...... 75 3.6 UCF Optimization Results...... 83

4.1 Parameter Configuration for O-Glycan and N-Glycan in IDAWGTMexperiments.124 4.2 Sub-projects in GlycoQuant...... 132

xiii CHAPTER 1

INTRODUCTION AND LITERATURE REVIEW

Since the beginning of 21st century, enormous progress has been achieved in the OMICS areas. Summarized by the journal of “OMICS: A Journal of Integrative Biology”1, the areas of OMICS research include Genomics, Transcriptomics, Proteomics, Glycomics and Metabolomics. A more detailed overview of OMICS research areas and their relationships is shown in Figure 1.1. The definition of gene has been changed with the advancement of the research in genet- ics, biochemistry and molecular biology [2,3]. The concept of gene started from an abstract unit of inheritance in genetics. As researchers from various disciplines achieve deeper un- derstanding, genes have become “real physical entities – sequences of DNA which when converted into strands of so-called messenger RNA could be used as the basis for building their associated protein piece by piece” [2]. Gene expression is about how genetic information

(genetic code) is interpreted to demonstrate its biological functions. The central dogma of

1http://www.liebertpub.com/OMI

1 Figure 1.1: OMICS Overview [1]

transcription molecular biology [4], DNA −−−−−−−→ RNA −−−−−−→translation P rotein, defines the process of genetic information flow. As shown in Figure 1.2, within a cell, DNA forms chromosomes and genetic information

encoded in DNA is transcribed from DNA to pre-mRNA. Pre-mRNA includes two types of segments, introns and exons. After the process of RNA splicing, pre-mRNA is completely processed. Introns within a gene are removed and exons are joined together to generate the final product mRNA. A sequence of mRNA is translated to Protein, which is initially a linear chain of amino acids. During the protein folding process, linear protein folds to a functional three-dimensional structure and carries out active cellular functions. These regulatory interactions are organized into gene regulatory pathways and construct whole gene regulatory networks.

2 DNA mRNA Protein Translation

Transcription Splicing Folding

Pre-mRNA Protein Intron and exon Native state

Figure 1.2: Generation of Protein

After the translation phase, a protein consists of a chain of amino acids defined by gene sequence and are of vital importance to almost every biological process within a cell. Post- translational modification (PTM) of a protein occurs after its translation. Glycosylation is one of the post-translational modification processes, and will add a glycan to the peptide sequence of a protein. Glycans are composed of monosaccharide residues through carbo- hydrate metabolism and can affect the function of proteins. To sum up, gene regulation, signaling, and metabolic reactions are among the most fundamental life processes. In order to understand the relationship between gene expression and metabolic path- ways and how gene expression regulates metabolic pathways, gene regulatory pathway and metabolic pathway are put side by side, as shown in Figure 1.3. The intersection part con- necting the two pathways is the Protein. On the one hand, it serves as the end product of gene regulatory pathway; on the other hand, it acts as enzyme and catalyzes metabolic reactions. Proteins often act as enzymes and thus, participate in both types of pathways. Therefore, regulation of gene expression levels of enzymes will affect the metabolic pathways. Some research has shown that regulation in the glycolysis pathway is accomplished by inhibiting

3 Biochemical Reactions

Reactant/ Enzyme Product

Glycan DNA RNA Protein Metabolic Pathways Gene Regulatory Pathways

Figure 1.3: Relationship Between Gene Regulatory Pathways and Metabolic Pathways.

or activating the involved enzymes. For example, the increased uptake and metabolism of glucose by tumour cells is considered mainly due to enhanced production of glycolytic enzymes (hexokinase) [5]. Comparative genomics was integrated with flux balance analysis (FBA), and the effect of deletion of genes to the reconstruction of the metabolic network (E. coli K-12) was examined [6]. They found that metabolic networks in bacteria evolve in response to changing environments, including enzyme kinetics and uptake of expression level of associated genes. Therefore, they concluded systems biology could not stop at the boundaries of the metabolic network. As biochemical pathway models become more and more complicated, researchers and scientists try to address the involved bio-molecule’s structure-function relationship from a perspective of systems biology rather than from an isolated view. The next section will give a brief overview of Systems Biology.

4 1.1 Systems Biology Overview

Research in Systems Biology started from 1968. However, due to the lack of relevant experi- mental data and the large demand for computational power, it did not gain too much notice. Since 2003, it has become an active research area with the fast evolution of experimental techniques and huge progress of experimental instruments and computing power. A history of Systems Biology is reviewed in [7]. The extensive foundation of Systems Biology was addressed and how the system approaches evolved over time was examined. In the paper the hierarchical structure of the systems was explained and metabolic pathways were briefly discussed. Three phases in research of System Biology are divided according to scope of the target system under study: (1) to begin with, the research aimed at the functions and behaviors of small and isolated biochemical entities, (2) researchers tried to model the pathways within a particular discipline (e.g., regulatory vs. metabolic) or network, and (3) the current inte- grative research considers various types of networks and investigates how the behaviors of the whole organism will be affected by the changes of single or multiple system variables.

1. (1968 – 1990) Elementary stages The general term of Systems theory can be traced back to [8,9], in which the general system theory was defined. The research of Systems Biology applied systems theory to the area of biology and was started by M. D. Mesarovi´cin 1968 in an international symposium “Systems Theory and Biology” [10]. Most of the research during this period was limited in scope to a single biochemical entity, because high quality experimental data were hard to obtain and the relationships among molecules and the external environment were unclear to the scientists.

2. (1990 – 2002) Expansion to small independent systems Starting from 1990’s, the availability of experimental data experienced rapid growth,

5 together with the great boost in the computational power, led to wider recognition of Systems Biology. Although the processes of metabolism, signaling networks and gene regulation are highly coupled in the current point of view, the research from 1990 to 2002 were more focused on the individual processes and concentrated on the mech- anisms that drive the biological processes. A review of the network-based pathway paradigm was given in [11]. Different aspects of biochemical pathway modeling and analysis consists of (but not limited to) network modeling, network structure, robust- ness, etc.

(a) Gene Regulatory Pathway Gene regulatory pathways/networks describe the regulatory interactions between transcription factors and their target genes, which has a four level structure (from top to bottom): transcriptional regulatory network, modules, motifs, and basic unit such as transcription factors, target genes and binding sites [12]. Much research has been done to understand the functions of genes through gene ex- pression analysis and gene regulatory pathways. The levels of gene expression can be measured from the abundance of messenger RNA (mRNA). A review on how to measure the level of gene expression using high-density DNA arrays is given in [13]. Other techniques have been proposed to monitor the abundance of messenger RNA as well, e.g., qRT-PCR [14] and differential display [15].

(b) Signaling Pathway Signaling pathways/networks are largely based on interactions in protein, imple- ment a variety of cellular functions (e.g., signal transduction, cellular rhythms and intercellular signaling) [16].

(c) Metabolic Pathway A metabolic pathway is defined as “a series of reactions where the product of one

6 reaction becomes the substrate for the next reaction” [17] or “series of consecutive enzymatic reactions that produce specific products” [18]. A detailed simulation study of metabolic networks and metabolic pathway analysis concentrates on the stoichiometry rather than kinetic properties of metabolic networks is discussed in [19].

3. (2003 – present) Integrated Systems Biology The main purpose of Systems Biology is to “examine the structure and dynamics of cellular and organismal function at the system level” [20]. Starting from 2003, more researchers tried to explain mechanisms of organism functions from a systems approach. Genomic and proteomic analysis are integrated in metabolic networks [21]. Based on the Flux Balance Analysis (FBA), integrated dynamic FBA (idFBA), an extension of FBA approach is proposed to integrate signaling, metabolic, and regulatory networks [22]. The current research on the integration of metabolic reactions and gene regulation is reviewed in [23].

1.2 Glycomics

In order to better understand glycans and glycomics, the first thing is to restrict the scope of the problem under investigation. As shown in Figure 1.4, a glycan is a oligosaccha- ride/polysaccharide, which consists of a few/many monosaccharide units linked and bound together. The research area of Glycomics is reviewed in [24, 25]. The title of [25], “Glycomics: an integrated systems approach to structure-function relationships of glycans”, suggests the key characteristics of glycomics. Post-translational modifications (PTM) regulating protein function has a more important role in cell phenotype and glycosylation is the most common form of protein PTM. In the process of glycosylation, glycans are produced by linking saccha-

7 Molecule Nucleic Amino Carbohyd Acid Acid rate

Amino Monosacc monomer Nucleotides Acid haride

Polynucle Oligosaccharide/ polymer Protein otides PolySaccharide

DNA RNA

Figure 1.4: Overview of Molecule and Glycans.

rides and then attached to proteins. Due to the extensive and complex form of glycosylation, the research on glycans needs to take an integrated systems approach and consider various roles of glycans that are attributed to the interactions with the attached proteins. The technologies enabling the systems approach include the data analysis of gene microarrays and mass spectrometry (MS). The relationship between glycomics and mass spectrometry is discussed in [26, 27, 28, 29]. In order to understand the relationship between the structure and function of glycans and proteins, several large-scale Glycomics research initiatives have been established:

• NIH Consortium for Functional Glycomics (CFG2) created the first version of the glycan structures database comprising structures, molecules and glycosyltransferases.

• The Glycomics research group in the Complex Carbohydrate Research Center (CCRC)

2http://www.functionalglycomics.org

8 at the University of Georgia3 is developing a database as well as computational tools to facilitate the acquisition, description, analysis and sharing of glycomics data. A set of workflow, visualization, ontology, semantic query and browser tools are being and have been developed.

• EuroCarbDB4, includes (1) databases for Nuclear Magnetic Resonance (NMR) and MS data, (2) applications for building structures, annotating MS data and identifying structures.

• Human Disease Glycomics/Proteome Initiative (HGPI5) aims at performing disease- related glycomics using two complementary approaches, (1) functional glycomics and (2) high-sensitivity and high-throughput mass spectrometry [30].

1.3 Mass Spectrum Analysis

Since the first studies of mass spectrometry starting in 1912, the principles and progress of mass spectrometry are summarized in [31]. One of the new instruments is the Orbitrap mass spectrometer [32]. An overview of mass spectrometry is given in [33], which also outlines the working process of mass spectrometers. As shown in Figure 1.5, the working process of mass spectrometers is summarized as follows: (1) a sample is prepared and fed into mass spectrometer, (2) ions are generated by ionizing the sample and accelerated to possess a given kinetic energy, (3) single or multiple mass analyzers are applied to separate and sort the ions according to their individual m/z values (mass to charge ratio), (4) an ion detector is used to count the number of ions detected, and (5) the processed signal data are output as mass spectra in various formats. The key processes involved in mass spectrometry are the following:

3http://glycomics.ccrc.uga.edu 4http://www.eurocarbdb.org 5http://www.hupo.org/research/hgpi/

9 Instrument Data Sample Data Mass Analyzer Process

Ionization

Ion Detection Intensity Source m/z Figure 1.5: Overview of Mass Spectrometer.

• Ionization Source In the process of ionization, the particles from the source sample interact with light electrons or other molecules to become charged to form positively or negatively charged ions (+1, +2 or higher) are formed, illustrated as M +e → M n+ +(n+1)e , where n is the charge state of the ion M. Two widely used techniques are ElectroSpray Ionization (ESI) and Matrix-Assisted Laser Desorption Ionization (MALDI).

• Ion Analysis The goal of ion analysis is to separate the various ions. When positively or negatively charged ions travel through the electron magnetic field with the same speed, they will be deflected by the external force of the magnetic or electric field. The degree of deflection is closely correlated to m/z value and follows Newton’s second law of motion, F = ma. If the force (F ) is constant, only a portion of ions will be detected. Therefore, the magnetic force can be controlled and adjusted to let more ions pass through. Ion analysis is performed by mass analyzer(s) and different techniques have been developed, e.g., quadrupole, ion trap, Time Of Flight (TOF) and Fourier Transform-Ion Cyclotron Resonance (FT-ICR).

• Ion Detection

10 When the positively charged ions eventually hit a metal surface, the positive charge will be neutralized by the electrons in the surface. The signals recorded by an ion detector are produced by either generating the secondary electrons, which will be amplified in the future or inducing currents caused by the flow of electrons. These signals will be processed by the instrument and computer, and the final output is called the mass spectra.

The issues encountered in the current quantitative analysis of glycans are summarized in [34, 35]. Two main types of quantitative methods for mass spectral analysis, namely, absolute quantitative glycomics and relative quantitative glycomics, are outlined:

• Absolute Quantification Absolute quantification determines the quantity of each individual glycan in a sample. It is difficult to achieve at the current time because determining the quantity of each individual glycan depend on numerous factors that will change in every sample and all the factors will differ between samples, such as ionization efficiency, sample preparation and matrix effect in the MS instrument.

• Relative Quantification The techniques of relative quantitative glycomics focuses on the determination of how levels of individual glycans change between samples. Strategies for relative quantifi- cation using MS analysis address the factors that hinder the absolute quantitative research and introduce less error between different experiments by putting and com- paring glycans of interest within the same spectrum.

11 1.4 Metabolic Pathway

Metabolism is defined as “the entire network of chemical reactions carried out by living cells” [17] and responsible for the changes in a cell and an organism. It is classified into catabolism and anabolism [36]. The former breaks down complex organic molecules to release energy while the latter captures energy to build up complex organic molecules from simple units. Glycans, as a type complex carbohydrate molecule, go through various biological pro- cesses of synthesis and breaking down during its whole metabolic history. The concepts related to different glycan modifications are listed below:

• De novo synthesis of a glycan is the biosynthetic process of constructing complex oligosaccharides/polysaccharides from simple sugars (monosaccharides).

• Degradation of a glycan, sitting on the opposite side of de novo synthesis, is the breaking down of the glycan into simple sugars (or monosaccharides).

• Recycling of a glycan is the process of reusing the individual monosaccharides that are produced by the breaking down of glycans.

• Remodeling of a glycan is the process of changing and replacing residues within the glycan with other residues.

• Turnover of a glycan is the balance between the synthesis and degradation pro- cesses. During the process of turn over, the glycans are degraded and re-synthesized at various rates.

Because many types of glycan modifications occur during the cultivation of cell samples, it is difficult to perform in vivo experiment to grow cells. One of the in vivo approaches, Isotopic Detection of Aminosugars With Glutamine (IDAWGTM)[37, 38], is used to illustrate the procedures of cell cultivation, isotopic labeling, synthesis and degradation of glycans over time, as shown in Figure 1.6.

12 Figure 1.6: Metabolic History of glycans using IDAWGTM[39, 37]

In the beginning, the cells are grown in the pure heavy medium, which contains the source enriched with heavy nitrogen (15N) (Amide-15N-Glutamine). Following the hex- osamine biosynthetic pathway, amino sugars containing nitrogen (e.g., GlcNAc, GalNAc, and NeuAc) are synthesized and labeled with heavy nitrogen so that the 15N-enriched pre- cursor pool is formed. Then the heavy nitrogen (15N) source is removed and the 15N-enriched cells continue to grow in the natural abundance media. When mixed with protein powder, the glycosylation process starts, where amino sugars are synthesized into glycans and then attaching them to the proteins. With the presence of both types of glycans labeled with 15N-

13 enriched and natural abundance media, complex glycan modifications occur within the cells: (1) glycans are generated using natural abundance precursor; (2) 15N-labeled glycans are degraded and residues containing 15N are released; (3) residues from degraded 15N-labeled glycans are recycled and combined with natural abundance residues to construct the new glycans; and (4) 15N-labeled glycans are remodeled with natural abundance residues.

1.5 Modeling & Simulation

When scientists try to gain a deeper understanding of how systems in real world function, modeling and simulation is always used. System models for the real world can be classified into physical and mathematical models. A physical model is used to build a real repre- sentation of the targeted system, such as creation of miniature to represent a big statue. Mathematical models use symbolic notations and mathematical equations. Simulation is a particular type of mathematical model. Based on the features and attributes, simulation can be further divided into: (1) deterministic or stochastic, (2) discrete or continuous, (3) static or dynamic [40]. The importance of carrying out modeling & simulation is evident, but it is difficult to transform a conceptual model specification to an executable simulation program due to the huge gap between them. A model specification is often expressed in natural language and/or mathematics while an executable simulation program is written in programming languages, either a Simulation Programming Language (SPL) or General-purpose Programming Lan- guages (GPL). Several attempts have been made to narrow the gap:

• Ontology Ontology is often used to describe the meaning of concepts and relationships between these concepts [41], and has been widely used to collect domain knowledge and facilitate the construction of domain models. The Discrete-event Modeling Ontology (DeMO)

14 [42, 43], Simulation oPTimization ontology (SoPT) [44], and other ontologies have been created to share model design and increase the collaboration among the Modeling and Simulation (M&S) community.

• Domain-specific Language (DSL) A domain-specific language can be used to write clear, concise and intuitive simulation programs for domain modelers. Utilizing the feature of Scala, a modern object-oriented functional programming language, ScalaTion [45] is developed to support several pop- ular modeling paradigms.

1.6 Simulation Optimization

The need of combining simulation and optimization is summarized in [46]. On the one hand, simulation is an approximation to the real world, and in most cases it is impossible to enu- merate all the scenarios to identify a good enough solution due to the complexity of the problem domain, therefore simulation needs optimization techniques to provide some guid- ance toward the good solution (usually a global ). On the other hand, without the help of simulation, many real world problems are too complicated to be modeled by explicit mathematical formulations so that the traditional optimization techniques (gradient-based approaches and random walk methods) can not work. This has caused a major dilemma for the researcher who want to approximate the real world as closely as possible and find a good enough solution at the same time. Therefore, simulation optimization addresses this problem by combining the two methods. The term simulation optimization (SO) is widespread in the simulation community and is defined as “optimization of performance measures based on outputs from stochastic (primary discrete-event) simulations” in [47]. Integrating optimiza- tion approaches into simulation tool kits is a necessity for discrete-event simulation software [48].

15 Surveys and reviews of Simulation Optimization from various aspects are given in [49, 50, 51, 47, 52, 53, 54, 55]. A review focusing on both gradient-based techniques (for continuous parameter estimation) and random search methods (for discrete parameter estimation) is presented in [49, 50]. A review of optimization techniques for discrete-event simulation is given in [51] covering both continuous and discrete input parameters. Discrete simulation optimization is reviewed in [52]. Techniques and applications of engineering optimization is addressed in [56] from a broader aspect. Convex optimization and numerical optimization techniques are discussed in [57, 58]. According to the nature of the target problem domains, optimization techniques can be classified into different groups, such as linear and non-linear optimization, local and global optimization, single objective and multi-objective optimization, unconstrained and constrained optimization, deterministic and stochastic optimization, continuous and discrete, etc. General classifications are listed as follows:

• Continuous parameter estimation Continuous parameter methods can in general be classified into gradient-based and non-gradient based.

– Gradient-based techniques and stochastic Approximation (SA) [59] are the two most used optimization algorithms in the problems of continuous parameter esti- mation. Many variants have been proposed on gradient-based approaches, which is reviewed in [60].

– Non gradient approaches include the Nelder-Mead (simplex) method [61], the Rosenbrock methods, and the Hooke and Jeeves method. These family of methods are gradient and can solve the target problems that are costly to compute the explicit derivatives but relatively efficient to calculate the fitness function itself.

16 • Discrete parameter estimation When it is possible to enumerate the possible combinations of the parameters, statis- tical selection methods, i.e., subset selection, indifference-zone ranking and selection (R&S), multiple comparison procedures (MCP), can be utilized to find a small feasible region (via subset selection) or the optimal solution itself. However, in most cases, the search space is so huge that the enumeration and evaluation of each candidate solution is impossible. Therefore some kind of random walk methods were used due to the combination explosion and the limitation of computer power, such as simulated annealing (SA) and Tabu search (TS).

• Stochastic optimization In stochastic optimization, response surface methodology (RSM) [62] has been applied to address the stochastic optimization problems. For example, RSM is used in simu- lation of hedging and trading strategies [63]. RSM seeks to identify the relationship between the response and the control factors. Response surface designs include (1) determining the number of control factor, (2) ensuring adequate coverage of the region of interest, and (3) conducting experiments and obtaining the accompanying results. The regression coefficients can be optimized and the relationship between the control factors and the response are established.

• Heuristic based optimization Heuristics, including genetic algorithm (GA) and particle swarm optimization (PSO), have been applied to continuous optimization problems. Ant colony optimization (ACO) is a popular heuristics strategies, and often applied to combinatorial optimiza- tion or discrete problems.

The approaches listed above in most cases will be combined with each other in order to achieve better performance. Research has been done to combine the robust heuristics

17 and better simulation optimization methods can be established. For example, GA combined with tabu search was applied in [64] to automatically dock peptides and proteins. A global guidance system, a selection-of-the-best procedure, and local improvement are combined together to find the global optimum of stochastic discrete-event simulation with decision variables subject to linear integer constraints [65]. The general interaction between a simulation model and simulation optimization engine is shown in Figure 1.7. It illustrates how optimization techniques can guide the simulation toward a good enough solution through an iterative process.

function evaluation Simulation Simulation Model Optimization next possible solution

Figure 1.7: Interaction between Simulation Model and Simulation Optimization

18 CHAPTER 2

GLYCOQUANT: AN AUTOMATED SIMULATION FRAMEWORK TARGETING ISOTOPIC LABELING STRATEGIES IN MS-BASED QUANTITATIVE GLYCOMICS

Jun Han1 John A. Miller1,2 Meng Fang3,4 Lance Wells2,3,4,5 Ren´eRanzinger3 William S. York1,2,3,5

0To be submitted to “Journal of Proteome Research” 1Department of Computer Science, University of Georgia, Athens, Georgia, USA 2Institute of Bioinformatics, University of Georgia, Athens, Georgia, USA 3Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia, USA 4Department of Chemistry, University of Georgia, Athens, Georgia, USA 5Department of Biochemistry & Molecular Biology, University of Georgia, Athens, Georgia, USA

19 Abstract

Mass spectrometry (MS) has emerged as the preeminent tool for performing quantitative glycomics analysis. However, the accuracy of these analyses is of- ten compromised by the instrumental artifacts, such as low signal to noise ratios and mass-dependent differential ion responses. Methods have been developed to address some of these issues by introducing stable isotopes to the glycans under study, but these methods require robust computational methods to determine the abundances of various isotopic forms derived from different experimental sources. An automated simulation framework for MS-based quantitative glycomics, Gly- coQuant, is proposed and implemented to address these issues. Instead of manip- ulating the experimental data directly, GlycoQuant simulates the experimental data based on a glycan’s theoretical isotopic distribution and takes various forms of error sources into consideration. It has been applied to analyze the MS raw data generated from IDAWGTMexperiments and obtained satisfactory results in the estimation of (1) the ratio of relative abundances of 15N-enriched and natural abundance glycans in a mixture and (2) the 50% degradation time of 15N-enriched glycan and its “remodeling coefficient” at this time point.

20 2.1 INTRODUCTION

2.1.1 Glycan and MS-based Glycomics

Complex carbohydrates, such as the O-linked and N-linked glycans of glycoproteins, have important biological functions, from regulating gene expression [66] to mediating cellular in- teractions [67]. Glycomics is an emerging discipline that focuses on the structures, biosynthe- sis and biological functions of glycans. Although many of the experimental and quantitative approaches that have been developed for proteomics can be applied to glycomics, the addi- tional challenges of glycomics require the development of new analytical and data processing methods. These challenges stem from three interrelated aspects of glycomics analysis: (1) diversity of the chemical structures of glycans, (2) complexity of glycan biosynthesis, and (3) complexity of the functional information encoded in glycan structures [25]. In order to cope with these complexities, integration of knowledge and information from various categories of biology (including genetics and proteomics) as well as other disciplines (Computer Science, Statistics, etc.) are necessary. This has led one group to define glycomics as “an integrated systems approach to structure-function relationships of glycans” [25]. Mass spectrometry (MS) is a powerful tool that is widely utilized in quantitative anal- ysis of chemical samples due to its capacity for high-throughput, high precision and high sensitivity. An MS experiment involves the ionization and gas-phase analysis of molecules to generate data in the form of a mass spectrum, which is generally represented as a graph of ion intensity versus mass-to-charge (m/z), providing information regarding the molecular mass of the molecules in the sample. MS is a well-established method for protein identi- fication and quantification in the field of proteomics [34] and is well suited to provide the same types of information in analytical glycomics. The science of glycomics has advanced in no small part due to the development of new mass spectrometry techniques specifically designed for glycan analysis [27, 28, 29, 38].

21 Table 2.1: Glycomics Tools for Identification

Tool Description GlycoMod [68] de novo calculation of glycan monosaccharide compositions from experimental masses and possible monosaccharide residues. Op- tional database search for structures with that composition in GlycoSuiteDB[69]. Cartoonist [70] Identification from archetype cartoons and automatic annotation of glycan structures. GlycoPepDB [71] Identification of glycan structures based on database search Glyco-Peakfinder [72] de novo analysis of glycan compositions. Optional database search for structures with that composition in www.glycosciences.de. GlycoWorkbench [73] Manual annotation of glycan structures. There is an option to run a search against several databases. SimGlycanTM[74] Glycan structure identification and annotation based on database search.

Although glycan identification and quantification are important issues in glycomics anal- ysis, most of the currently available software tools for MS-based glycomics focus on glycan identification and annotation. MS deconvolution tools identify signals comprising each iso- topic multiplet in the spectrum and process these data to establish the monoisotopic mass and ionic charge of the corresponding analyte ion. The resulting lists of mass/abundance pairs are frequently used as the input to software tools that implement either database searches or de novo structure matching methods to assign glycosyl compositions or, in some cases, chemical structures to the glycans being analyzed. The structures identified by these methods can be used to manually or automatically annotate the spectrum, thereby facilitat- ing its interpretation.

22 2.1.2 MS-based Quantitative Glycomics

Quantitative glycomics methods can provide estimates of the absolute or relative amount of each glycan in the sample. Although absolute quantification methods have been widely applied in proteomics [34, 75], such methods have not been used extensively in glycomics [35]. Methods for relative quantification often focus on the changes in the abundances of glycans that occur during development of disease or other biological processes. These meth- ods can use label-free approaches or isotopic labeling strategies. As label-free approaches are comparatively simple to devise and do not require selective preparation of isotopically enriched standards, they are widely used in proteomics [76] and glycomics [77, 78]. Label-free quantitation often serves as a screening process to select the glycans of interest for further profiling and quantitative analysis [35], while isotopic labeling strategies are often employed for more accurate glycan quantitation. Isotopic labeling strategies rely on the incorporation of stable isotopes (e.g., 2H, 13C, 15N and 18O) into glycans. Various strategies for labeling biopolymers with stable isotopes have been proposed, including both metabolic (in vivo) and in vitro labeling methods. In vivo methods, which involve growing cells or tissues in isotopically enriched media, include SILAC [79, 80] for proteomics and IDAWGTM[37, 38] for glycomics. In vitro labeling methods, which use isotopically enriched chemical reagents to generate derivative molecules, include iTRAQ [81] and QUIBL [82]. Many of these methods are implemented by mixing two samples that are differentially labeled with isotopic tags. For example, a “heavy” (e.g., 15N-enriched) sample may be mixed with a “light” (e.g., natural isotopic abundance) sample before mass spectral analysis. In order to interpret MS data obtained by analysis of mixtures, the isotopologues of each compound of interest must be considered. Isotopologues are molecules that differ only in their isotopic composition (http://goldbook.iupac.org/I03351.html). The stable iso- topes of hydrogen are 1H (99.985% natural abundance on earth) and 2H (0.015%) and the

23 stable isotopes of carbon are 12C (98.9%) and 13C (1.1%). Radioactive (unstable) isotopes of these elements have extremely low natural abundances and are usually neglected in quan-

12 1 12 1 2 titative glycomics analysis. The ten stable isotopologues of methane are C H4, C H3H,

12 1 2 12 1 2 12 2 13 1 13 1 2 13 1 2 13 1 2 13 2 C H2H2, C H H3, C H4, C H4, C H3H, C H2H2, C H H3, and C H4. The

chemical formula CH4 implies an isotopologue mixture whose composition depends on the natural abundance of each element. The most abundant isotopologue in natural abundance

12 1 13 1 methane is C H4 (98.84%), although other isotopologues (especially C H4 at 1.099%) are also present. Molecules that are enriched in one or more of these isotopologues (e.g., 12C-enriched methane or 13C-enriched methane) can be purchased or synthesized in the lab- oratory. The isotopologue composition of the material depends on its degree of enrichment.

13 13 1 12 1 For example, 95% C-enriched methane contains 94.94% C H4 and 4.997% C H4 along with other, less abundant isotopologues. Although several software tools are available for automated glycan structure predic- tion and annotation, very little software has been developed for quantitative glycomics or validation of these tools [35]. Most of the available MS-based quantitative software tools that have been developed are designed for proteomics [83], including MSQuant [84], which

supports the SILAC method, RelEx [85], the Mascot quantification module [86](http: //www.matrixscience.com), SEQUEST (http://fields.scripps.edu/sequest/index. html) and the quantification modules such as XPRESS [87], ASAPRatio [88] and Libra that are available as part of the Trans-Proteomic Pipeline (http://tools.proteomecenter. org/)[89]. Several major issues confound MS-based quantitative glycomics, including the require- ment to fit noisy or otherwise non-ideal experimental data to theoretical isotopic distribu- tions. It is often necessary to process very large volumes of such MS data for which major sources of non-ideality are unknown. Interpretation of these data (e.g., structural annota- tion of the major peaks) requires highly specialized expertise that is frequently limited to

24 scientists within the area of MS-based glycomics. Faced with high-volumes of raw MS data, even highly trained glycoanalysists have difficulty in manually annotating raw MS data and identifying non-idealities or their theoretical underpinnings. Properly defined models are required to calculate theoretical isotopic distributions that provide quantitatively accurate results. Simplified models that do not correspond well to the chemical system under study will yield inaccurate results in calculating mass and relative abundance. This inaccuracy may propagate to the later computation and lead to larger dis- crepancies in the final results. For example, theoretical isotopic distributions for proteomics MS data are often calculated using the so-called averagine model [90], which uses the average elemental composition of the 20 commonly found amino acids to calculate the theoretical isotope distribution pattern of a peptide ion at a given ion mass. The theoretical pattern is calculated by approximating the ion as a sum of several (identical) average amino acids. One source of error in averagine-based methods is the practice of rounding the numbers of hydrogen (H), carbon (C), nitrogen (N), (O) and sulfur (S) to the nearest integer and adjusting the number of H atoms to obtain an ion corresponding to an experimentally observed molecular mass. Simulations based on this approach can have significant deviations from the observed patterns, leading to fundamental errors such as incorrectly identifying the monoisotopic ion peak in a high resolution mass spectrum and misinterpretation of analytical and quantitative results. (The monoisotopic ion, which is composed exclusively of the most abundant isotope of each element in the structure, is the feature most often used to identify the specific structure giving rise to the observed isotopic pattern.) Such errors can result in incorrect assignment of ion structure and significant quantitation inaccuracies. Averagine- based approaches are clearly inappropriate for quantitative glycomics. At a minimum, it is necessary to modify the model to account for the average elemental composition of a monosaccharide residue rather than that of an amino acid residue. We have chosen to imple- ment an alternative quantitative glycomics approach that employs MIDA (Mass Isotopomer

25 Distribution Analysis [91]). This involves calculating the masses and relative abundances of the isotopologues of specific glycan compositions identified in the sample by other methods. The isotopic distribution simulations are thus based on well-defined elemental compositions, allowing exact masses to be calculated without resorting to arbitrary rounding operations. This provides, for example, more confident assignment of the monoisotopic peaks in a mass spectrum, because each experimentally observed isotopic pattern is evaluated by comparison to one or more ideal patterns based on explicit monosaccharide residue compositions rather than arbitrarily defined average compositions. Several sources of error that are introduced by the physical limitations of the experi- mental process and analytical instrumentation are often ignored in quantitative proteomics [85]. This is also the case for quantitative glycomics. Although significant improvements in many aspects of MS instrumentation (precision, sensitivity, etc.) are continually being achieved, several inherent sources of error continue to be highly problematic. These in- clude ion suppression effects, mass-dependent differential ion responses, performance incon- sistencies among different instruments, and process variation during sample preparation [35]. Many quantitative MS-based methods involve techniques designed to address these errors. For instance, simultaneous, comparative mass-analysis of the heavy and light components of mixtures can provide data with internal compensation for disparities in the performance of different instruments (e.g., SILAC and IDAWGTM). However, development of data models and processing methods that minimize such errors in quantitative MS analysis remains a major challenge for experimentalists and computer scientists. In order to address these issues, we have proposed and implemented an automated sim- ulation framework called GlycoQuant, aimed specifically at isotopic labeling strategies that we have developed for MS-based quantitative glycomics. GlycoQuant provides quantitative functions that implement the following steps: (1) generation of theoretical isotopic distribu- tions for previously identified glycan compositions, (2) processing of raw experimental MS

26 data to extract spectral regions of interest and identification of the monoisotopic peak, (3) simulation of mass spectral patterns for comparison to experimentally observed patterns, (4) systematic modification of the simulated patterns to incorporate the effects of known sources of error, (5) estimation of the relative abundance levels of isotopologues by optimizing the correspondence between simulated and experimentally observed spectral patterns, and (6) estimation of the degradation time and “remodeling coefficient” (defined in 2.6) of glycans using MS data obtained by sampling at various times after changing the isotopic composition of precursor molecules added to cell cultures producing the glycans.

2.2 METHODOLOGIES

2.2.1 Isotopic Labeling Strategy

In vivo (metabolic) isotopic labeling involves the growth of cells or tissues in the presence of isotopically enriched precursor molecules that are incorporated into the biopolymer of inter- est. For example, growth of animal cells in the presence of 15N-enriched glutamine results in the production of N- and O-glycans that contain 15N-enriched amino sugars [37]. Anal- ysis of the resulting isotopologue distributions provides information regarding the source (isotopically-enriched or natural-abundance precursors) from which each glycan was ulti- mately derived. The isotopologue distribution of a molecule or ion can be represented as a vector D = (d1, d2, . . . , dm) whose elements each correspond to the normalized abundance of a specific isotopologue and where m is the number of possible isotopologues for that molecule. It should be noted that it would be impractical to completely specify D, as m is very large for molecules having more than a few atoms and many of the isotopologues are present in van-

13 2 ishingly small amounts. For example, the isotopologue C H4 has a normalized abundance of only 5.6 × 10−18 in natural abundance methane. Each isotopologue is composed of multiple isotopomers, each having the same isotopic

27 composition but differing in the positions of the various isotopes within the structure (http: //goldbook.iupac.org/I03352.html). Each of the isotopomers comprising a (randomly labeled) isotopologue has the same probability Aisotopomer, which depends on its isotopic

Q nj composition and the abundance of each isotope. That is, Aisotopomer = j aj where nj is the number of atoms of isotope j in the isotopomer and aj is the abundance of isotope j as a fraction of all isotopes of the same element. The probability Aisotopologue of each isotopologue corresponds to the number of its constituent isotopomers times the probability of each isotopomer (computed above). Aisotopologue is thus a probability mass function of the multinomial distribution. Q (nk)! Y n A = k × a j (2.1) isotopologue Q (n )! j j j j P where nk = j∈isotopes(k) nj is the number of atoms of element k in the isotopologue.

12 1 2 2 For example, there are four possible isotopomers of C H3H, because the H isotope

12 1 2 can be present at any one of four different positions and the abundance of C H3H is

1! 4! 1 0 3 1 Aisotopologue = 1!0! 3!1! × a12a13a1a2 = 4 × 0.0001483 = 0.000593, where the number sub-

scripted to each abundance refers to the nominal mass of an isotope. The mass Misotopologue of each isotopologue is X Misotopologue = nj × mj (2.2) j 2.1 can be used to calculate isotopologue populations of an isotopically-enriched molecule if one specifies the isotopically enriched components of the molecule as pseudoelements. For ex- ample, a glutamine molecule contains two nitrogen atoms; in amide-15N-enriched glutamine, the isotopic composition of one of these nitrogen atoms is modified. The chemical formula of

15 glutamine is C5H10N2O3, but we define the pseudochemical formula of amide- N-enriched ˜ ˜ glutamine to be C5H10NO3N, where N is a pseudoelement comprised of the same isotopes (15N˜ and 14N)˜ as nitrogen, but with different abundances. Thus, 98% amide-15N-enriched glutamine has an amide containing the pseudoelement N,˜ in which 98% of the atoms are 15N˜

28 and 2% are 14N.˜ When 2.1 is applied, pseudoelements are treated the same way as natural abundance elements. Specification of pseudoelements allows isotopologue populations of molecules that are derived from a combination of natural abundance and isotopically enriched precursors to be fully defined. For example, certain culture conditions may generate glycan molecules having two amino sugars, one containing N (derived from natural abundance glutamine) and the other containing N˜ (derived from 15N-enriched glutamine). The pseudochemical formula of ˜ this glycan molecule is thus CxHyNOzN, where x, y and z are the numbers of C, H and O atoms, respectively. Quantitative analysis of the isotopic composition of such partially enriched glycans is critical in order to understand the data generated by in vivo labeling methods. Switching from natural-abundance media to isotopically enriched medium (containing precursors such as amide-15N-enriched glutamine) alters the isotopic composition of the gly- cans in the culture in a time-dependent manner. Typically, “heavy” glycans are prepared from cells that have been grown for several days in isotopically enriched medium. At that time, most of the glycan molecules are enriched to nearly the same degree as the enriched precursors that have been added. However, some of the glycan molecules are more incom- pletely labeled, containing atoms derived from the original natural abundance precursors that were present before the medium was changed. We model this situation by defining certain atoms from the isotopically enriched precur- sors as pseudoelements. Thus, for a glycan containing n nitrogen atoms that is produced by a cell grown in the presence of amide-15N-enriched glutamine, we define n + 1 different pseudochemical formulae, each corresponding to a different number of N˜ atoms. For exam- ˜ ˜ ˜ ple, four formulae (CxHyN3Oz,CxHyN2OzN, CxHyNOzN2, and CxHyOzN3) are defined for glycans containing three nitrogen atoms. Each of these formulae corresponds to a unique population of isotopologues, whose abundances can be predicted if the isotopic purity of

29 the enriched precursor is specified when defining N.˜ (See Figure S1 and Table S1 in the supplemental material.) The isotopologues of each glycan can thus be grouped into n + 1 sets, each corresponding to a different pseudochemical formula. The total abundance of molecules in each of these isotopologue sets can be described by an abundance distribution P vector T = (t0, t1, . . . , tn), f tf = 1, where t0 is the total abundance of molecules in the isotopologue set that contains zero N˜ atoms, etc. It is important to note that T describes the contribution of the heavy and light precursor pools to the glycan. That is, a molecule that is derived solely from the light (natural abundance) precursor pool has the formula CxHyNnOz; glycan populations that consist entirely of such molecules are described by t0 = 1.0, even though some of these molecules contain 15N atoms. Similarly, a molecule that is derived

15 ˜ solely from the heavy ( N-enriched) precursor pool has the formula CxHyOzNn; glycan pop- ulations that consist entirely of such molecules are described by tn = 1.0, even though some of these glycan molecules have fewer than n 15N atoms. For glycan populations that are derived from some combination of the light and heavy precursor pools, both t0 and tn are non-zero. Unless cells have been grown in the presence of isotopically enriched (heavy) precursor pools for an extremely long time (which is expensive and impractical), each glycan in the resulting isotopically-enriched sample will have an isotopologue population corresponding to the last scenario. Real mass spectral data include noise and signals from contaminating molecules. Gly- coQuant compares isotopic distribution patterns within a real mass spectrum to patterns simulated using a priori knowledge of the molecular compositions of molecules that are represented in the spectrum. For a glycan containing n nitrogen atoms, the isotopologue abundance pattern for each of the n + 1 pseudochemical formulae is calculated. (See Figure S1 in the supplemental material.) Then, GlycoQuant simulates complete spectral patterns corresponding to linear combinations (specified by the coefficients T ) of these isotopologue abundance patterns. The resulting spectral patterns are compared to real mass spectra and

30 the elements of T are optimized by maximizing the objective function, the coefficient of determination R2 (square of the Pearson correlation coefficient) between the two spectral patterns. “Heavy” glycans prepared from cells grown in the presence of isotopic label can be used as standards for quantitation of “light” glycans produced by cells grown in natural abundance medium (i.e., the analyte). In this case (which we call static IDAWGTM), the mass spectrum is recorded for a sample prepared by mixing the analyte and standard in known proportions. For each glycan in the sample, GlycoQuant calculates the ratio of ion signals derived from the analyte to ion signals derived from the standard. This provides an estimation of the molar ratio of the glycan in the two samples. When cells are grown in the absence of isotopically enriched precursors, the isotopologue set abundance distribution for each glycan is T = (1.0, 0.0,..., 0.0). As cells grow in the presence of isotopically enriched precursors,

t0 → 0 and tn → 1. Although the other elements of T (i.e., tf , where f 6= 0 and f 6= n) may

increase transiently, ultimately tf → 0 for all f 6= n. For static IDAWGTM, GlycoQuant independently optimizes the elements of two vectors,

TH and TM , which correspond to the isotopologue set abundance distributions for the glycan in the standard (heavy) sample and in the analyte/standard mixture, respectively. The abun-

dance distribution TL of the analyte (light) glycan is assumed to be TL = (1.0, 0.0,..., 0.0).

The abundance distribution TH of the standard glycan is determined by analysis of the (heavy) standard before mixing. The mixture is then analyzed to determine its abundance distribution TM . Ideally (in the absence of noise and contaminating ions), the observed dis- tribution TM = l × Tl + h × Th is a linear combination of Tl and Th where l + h = 1.0. The ratio of L to H corresponds to the abundance ratio for the glycan in the mixture. GlycoQuant also provides quantitative analysis of dynamic IDAWGTMdata. In this case, cells are grown for several days in the presence of the isotopically enriched precursor 15N- enriched glutamine. Then the 15N-enriched glutamine is replaced with natural abundance

31 glutamine. Glycan samples are prepared immediately before changing the precursor and at various times thereafter. GlycoQuant provides estimates of the populations of each of the n + 1 sets of isotopologues that contain from 0 to n nitrogen atoms derived from the isotopically enriched precursor. This corresponds to the elements of the vector T , described above. Thus, GlycoQuant provides information regarding the metabolic history of each glycan analyte. By comparing simulated and observed spectral features for each glycan, the GlycoQuant algorithm minimizes the effects of contaminating molecules and provides statistics that can be used to judge the quality of the quantification. Simply comparing the sum of all of the ion signals within a specified region of the standard spectra to the sum of all ion signals in the corresponding region of the analyte spectrum can lead to errors, as some of these signals may correspond to irrelevant “contaminating” molecules that are present in the standard or analyte. By fitting observed spectral peaks to predictable isotopologue distributions, the GlycoQuant algorithm provides quantification based solely on ions that correspond to iso- topologues of the glycan of interest. When the optimization procedure for a glycan produces low correlation, it typically indicates low signal to noise or the presence of contaminating molecules, which can prompt further evaluation of the results for that glycan.

2.3 GLYCOQUANT SOFTWARE PLATFORM

A general workflow illustrating the process of GlycoQuant is shown as in 2.1. The major modules in GlycoQuant are: (1) raw data processing, (2) isotopic distribution calculation, (3) mass spectrum simulation, (4) model fitting, and (5) quantification and visualization. Initially, the theoretical isotopic distribution patterns for the n + 1 formulae for each glycan are computed. These remain unchanged in different experiments if the isotopically enriched reagents used have the same abundance ratio for 15N˜ and 14N.˜ The simulation and

32 Isotopic Simulation of Quantification Glycan distribution Mass Model Fitting and Structures calculation Spectrum visualization

Raw data MS Raw data processing No Optimized?

Figure 2.1: GlycoQuant Workflow

model fitting modules are computationally intensive, involving numerical optimization (i.e., maximization of the coefficient of determination between the simulated and experimental spectral patterns) to calculate a vector T of abundances for the n + 1 isotopologue sets for each glycan.

2.3.1 MS Raw Data Processing

The initial steps for processing raw MS data involve extracting the relevant spectral seg- ment for each glycan ion and identifying the monoisotopic peak therein. Although other approaches for deconvoluting mass spectra often involve data transformation as the initial step, QlycoQuant maintains experimental data integrity and does not impose any additional operations when extracting raw spectral patterns. QlycoQuant data segmentation is based on the theoretical ion isotopologue distribution for each glycan in a predefined list to define

minMass−1 maxMass+1 an m/z range [ z , z ], where minMass and maxMass are the lower and upper bounds of exact mass values of significantly populated isotopologues of the quasi-

z+ molecular ion ([M+Naz] ) of the glycan and z is the ionic charge. Although a huge number of possible isotopologues exist for each glycan, the size of the actual isotopologue set under

33 consideration is limited by neglecting isotopologues with vanishing Aisotopologue. If the spec- tral segment for a particular glycan structure is represented in several different scans in the raw data file, the segment is extracted from the scan with highest ion abundance of ions corresponding to the glycan. Both manual and automated methods such as spectral deconvolution [92] have been used previously to identify monoisotopic peaks in mass spectra. However, GlycoQuant uses a priori knowledge of the molecular composition of each glycan to compute its exact monoiso- topic mass and thereby assign signals with appropriate m/z values as monoisotopic ions. When analyzing the spectrum of a mixed population of heavy (labeled) and light (natural abundance) glycans, two distinct monoisotopic peaks should be considered for each glycan. The first is the classically-defined monoisotopic peak corresponding to a glycan isotopologue

12 1 14 16 with the formula CxHy Nn Oz. The second is a “pseudomonoistopic” peak corresponding to 12 1 16 15 ˜ 15 ˜ a glycan isotopologue with the formula CxHy Oz Nn. (As per convention, N is the most abundant isotope of N.)˜ At least one of these peaks must be identified in order to calibrate the m/z axis of the spectrum and determine the peak width used for subsequent spectral simulation. As the classical monoisotopic peak in fact corresponds to a single isotopologue, it is always used if its abundance is above a preset threshold (e.g., 10% of the largest peak in the segment). Otherwise, the pseudomonoisotopic peak is identified and used. This occurs most frequently for fully labeled ions, such that the pseudomonoisotopic peak is dominated by a single isotopologue. No human intervention is required for monoisotopic peak picking.

2.3.2 Simulation of Mass Spectrum

For each glycan expected to be present in the sample, a “profile mode” spectral segment (ion current as a function of m/z) is calculated using the n + 1 discrete isotopologue abundance distributions [Mi,Ai] corresponding to the n + 1 formulae (See Figure S1 in supplementary material.) and the vector T (relative abundances corresponding to each formula). Here,

34 each isotopologue is specified by a subscript i. The simulated ion current corresponds to a sum of standard probability distributions (e.g., Gaussian, Lorentzian or a combination of the

two). Inputs to this calculation are (1) the Gaussian peak width pwG and Lorentzian peak

width pwL required to calculate the standard probability distributions, (2) the charge state z of the ion, (3) the experimental spectrum segment E spanning all significantly abundant ionized isotopologues of the glycan, specified as an array [x, y] where x and y are the m/z value and intensity value, respectively, (4) the global isotopologue abundance distribution D, calculated as a linear combination of n + 1 isotopologue abundance distributions, weighted by the elements of T and expressed as a global list of mass/abundance pairs [Mi,Ai], and (5) the m/z calibration offset δ. The mass of the ionization adduct (e.g., Na+) is added to each

Mi to obtain the mass Mi of each ionized isotopologue i. The theoretical mass-to-charge ratio Xi = Mi/z is then calculated for each ionized isotopologue. The calibration offset δ is added to each x value in the experimental spectrum segment and the theoretical ion current s is then simulated for each (corrected) x0 = x + δ value. The Gaussian probability distribution is widely used in MS-based deconvolution approaches in which the spectral intensity SG is simulated as a function of the mass to charge ratio x0.

0 2 A −(x −Xi) 0 X i 2 SG(x ) = √ × e 2σ (2.3) i∈D σ 2π

The Gaussian parameter σ is closely related to the peak width at half maximum pwG [93]

σ = 0.4247 × pwG

The Lorentzian probability distribution is often appropriate to model data obtained by Fast Fourier Transform (FFT) of periodic data, such as those recorded by ion cyclotron

35 mass spectrometers. In this case, the spectral intensity SL is simulated as

0 X Ai γ SL(x ) = × [ 0 2 2 ] (2.4) π (x − Xi) + γ i∈D

The Lorentzian shape parameter γ is closely related to pwL, the peak width at half maximum.

γ = 0.5 × pwL

Since a Gaussian or Lorentzian distribution alone may not accurately replicate the exper-

imental spectrum, a linear combination SGL of Gaussian and Lorentzian distributions is calculated.

0 0 0 SGL(x ) = p × SG(x ) + (1 − p) × SL(x ) (2.5) where p is the fraction contributed by the Gaussian component.

2.3.3 Spectrum Calibration and Noise Filtering

In order to compare the experimental and simulated spectra, it is first necessary to optimize parameters used to calibrate the m/z axis of the experimental spectrum and reproduce low-level noise filtering that are often applied when the mass spectrum is recorded. The optimization methods used by GlycoQuant require very accurate mass calibration to match the observed and theoretical masses of ions in the spectrum. Therefore, a calibration parameter δ is applied to the experimental spectrum such that x0 = x + δ. The values of δ and peak width parameters (pwG and/or pwL) are simultaneously optimized based on comparison of the monoisotopic (or pseudomonoisotopic) peaks in the simulated and experimental spectra and are used later for simulation of the complete isotopologue pattern. Low level noise filtering simulates basic filtering operations performed by many mass spectrometers. GlycoQuant reproduces this operation by zeroing the abundance of each

36 spectral point in the simulated spectrum whose simulated abundance is lower than a specified threshold (τ).

2.3.4 Model fitting

Two most widely used model fitting methods in MS-based quantification are (1) curve shape fitting and (2) area under the curve. Area under the curve methods are easily confounded by the presence of contaminating peaks. Therefore, GlycoQuant utilizes a curve shape fitting approach using coefficient of determination between the simulated (S) and experimental (E) spectra as the objective function. As multiple parameters are involved and each parameter has its own bounded constraints, a constrained numerical optimization routine, limited mem- ory Broyden-Fletcher-Goldfarb-Shanno or L-BFGS [94], is implemented. As a quasi-newton method, L-BFGS is very competitive in performance (e.g., time efficiency and memory usage) compared with other numerical optimization routines (e.g., steepest descent and conjugate gradient methods). The overall goal of modeling fitting is to minimize the difference between the simulated and experimental spectra by maximizing the coefficient of determination between the two, and this leads to a multi-dimensional optimization problem. Attempts have been made to include all the parameters in optimization; however, it leads to convergence to inferior so- lutions and long running times. In order to achieve better performance and more accurate results, the optimization process is executed in two phases and the aforementioned param- eters are divided into two subsets correspondingly. The parameters in phase 1 are related to the spectral pattern calculation and optimized using a small region of the whole spectral data (e.g., the monoisotopic peak) while those in phase 2 are related to isotopologue sets and estimated against the whole experimental spectrum. Phase 1 is to optimize the following parameters: the peak width of the Gaussian curve

(pwG) and/or Lorentzian curve (pwL), the percentage of Gaussian curve (p), the calibration

37 offset (δ), solely on the basis of the monoisotopic or pseudomonoisotopic peak. Because the number of data points included in the monoisotopic peak is limited, the phase 1 optimization process is fast and efficient. Phase 2 optimizes the following parameters: the noise filtering threshold (τ) and the n+1 dimensional abundance distribution vector (T ). As noise filtering only filters the abundance values less than the threshold and does not affect the abundance values higher than τ, it has no impact on the optimization in phase 1, which only considers data points with high intensities. The optimized parameter values from phase 1 are used to compute the theoretical isotopic distribution patterns for each of n + 1 pseudochemical formulae and the simulated spectrum is generated from the linear combination of isotopic distribution patterns based their abundance vector. R2 is calculated as the fitness value for the optimization routine.

2.4 EVALUATION

GlycoQuant is evaluated using the MS raw data generated from IDAWGTMexperiments, which provide quantitative information for relative abundance of heavy and light forms of each glycan (static IDAWGTM) or the time-dependent distributions of nitrogen atoms derived from the heavy and light precursor pools in each glycan (dynamic IDAWGTM). For static IDAWGTMexperiments, GlycoQuant calculates the heavy-to-light ratio. Four typical conditions are shown in 2.2, namely, (a) Spectral data with high S/N and little ion contamination, (b) Spectral data with noise and ion contamination, (c) Spectral data with low S/N, and (d) Spectral data with significant ion contamination. The figures show that the simulated spectrum can fit the experimental data very well for the glycans with high intensities. Even with significant ion contamination, as shown in 2.2(d), GlycoQuant is still able to identify the relevant peaks in noisy experimental data. The corresponding optimization statistics are given in 2.2, including two coefficients of determination R2: (1)

38 Table 2.2: Optimization Results for Static IDAWGTM

Structure Monoisotopic mass R2 Cleaned R2

(a) (NeuAc)2(Hex)1(HexNAc)1 1256.6364 0.99579 0.99662 (b) (Hex)2(Deoxyhexose)1(HexNAc)2 1157.6043 0.81725 0.97758 (c) (Hex)3(HexNAc)3 1432.7412 0.11214 0.75251 (d) (Hex)3(HexNAc)3 1432.7412 0.34185 0.46246

based on the original experimental data and (2) based on “cleaned” data (the experimental data are cleaned by filtering out signals in the real spectrum where the simulated value is zero). In dynamic IDAWGTMexperiments, MS raw data are collected at various time points during the period of 36 hours after changing from “heavy” precursors to “light” precursors, when the biological processes of degradation, remodeling and new synthesis occur simultane- ously in the cell culture. For each glycan, GlycoQuant calculates a theoretical “remodeling coefficient” that reveals metabolic processes that modify the glycan structure by removing and/or adding glycosyl residues to pre-existing glycan molecules and provides estimates of the scale and kinetics of these processes. Such information is very difficult to obtain by other available methods.

GlycoQuant calculates the relative abundance vector Ti(t) of the n + 1 isotopologue sets for each glycan i at various time points. The absolute abundance for each isotopologue set is then calculated by multiplying the optimized relative abundances by an exponential growth factor based on the cell doubling time, the period of time required for cells to double their number, estimated to be 30 hours for hESCs glycans [38]. Based on the changes of absolute abundances over time, two parameters are estimated: (1) 50% degradation time of the glycan and (2) remodeling coefficient at 50% degradation time. The 50% degradation time is the (estimated) time when the abundance of the 15N-labeled glycan decreases to half of its initial

39 (a) (b)

(c) (d)

Figure 2.2: Comparison of experimental mass spectra and mass spectra simulated by Glyco- Quant. Spectra are drawn using the GlycoQuant user interface. (a) Experimental data with high S/N and little ion contamination. (b) Experimental spectrum with moderate noise and ion contamination. (c) Experimental spectrum with low S/N. (d) Experimental spectrum with significant ion contamination.

40 value at time 0. The remodeling coefficient at 50% degradation time t∗ is calculated as the ratio of the combined abundances of partially enriched isotopologue sets (containing at least one and less than n nitrogen atoms from the isotopically enriched precursor) to the total isotopologue abundance for the glycan.

Pn−1 T (t∗) remodeling = i=1 i (2.6) Pn ∗ i=0 Ti(t )

Two cases of remodeling are shown in 2.3, one with high remodeling coefficient and the other with very low remodeling coefficient. The remodeling of sialyated glycan structures is revealed by dynamic IDAWGTMexperiments [95].

2.5 RELATED WORK

SysBioWare [92] is a software platform developed for high-throughput glycomics. It is based on the average elemental composition, uses Gaussian or Lorentzian shape in the wavelet analysis and applies Fast Fourier Transform (FFT) to detect the monoisotopic peak by fit- ting area under the curve. As discussed earlier, use of average elemental composition may introduce errors in calculation for high precision MS data and the accuracy of the area under the curve method can be greatly affected by noise in the real data, even after applying their de-noising operation. Our work instead is based on MIDA [91] to calculate the exact mass and fit the model using curve fitting using a Gaussian curve, Lorentzian curve or combi- nation of the two. Furthermore, SysBioWare focuses on mass spectrum deconvolution and monoisotopic peak assignment, and does not address the quantitative aspect of glycomics. MSQuant [96, 84] is widely used for quantitative proteomics and supports quantification using the SILAC strategy. It requires a peak list from a Mascot module as input and real experimental data. The ratios between the MS peaks of “heavy” and “light” forms of the

41 (a) (b)

Figure 2.3: Analysis of dynamic IDAWGTMexperiments. Isotopologue abundances corre- sponding to a high remodeling coefficient (a) for (NeuAc)2(Hex)1(HexNAc)1 and a low re- modeling coefficient (b) for (Hex)2(HexNAc)2. Number of [0] - [3] represents the number of nitrogen atoms in the glycan that are derived from the heavy precursor pool. Fully labeled (heavy) glycans contain n nitrogen atoms derived from the heavy precursor pool and correspond to the isotopologue distribution labeled [3] in panel (a) and [2] in panel (b). Glycans undergoing active remodeling contain at least 1 and less than n nitrogen atoms derived from the heavy precursor pool.

42 peptide are calculated by summing the ion intensities around the centroids of the abundant peaks. Instead of relying on the centroid data, GlycoQuant consider the profile data of mass spectra and more data points can be utilized to provide quantitative information. Summing ion intensities can produce inaccurate results and can be affected by contaminating peaks. GlycoQuant simulates only predictable istopologue distributions for comparison to real data. RelEx [85] is an automated quantitative proteomics software tool for determination of ratios of labeled (heavy) and unlabeled (light) samples in a mixture. RelEx addresses the systematic error in MS-derived quantitative proteomics data, which has often been ignored. It introduces the Pearson correlation coefficient as the objective function for fitting simulated data and adds a normalization step to compensate for the systematic error and shifting of the mass spectrum for better curve fitting. Our framework takes a similar approach and integrates steps of spectrum calibration and noise filtering. RelEx uses a seven-point Savitsky-Golay filter to smooth the experimental data and calculates linear least-squares correlation between the data points of unlabeled and labeled data. However, smoothing the experimental data may cause the loss of important information; therefore, our approach does not manipulate the experimental data directly and instead simulates the experimental data based on theoretical isotopic distributions. The quality of simulation is evaluated by calculating the coefficient of determination between the simulated and experimental spectral data in our work. XPRESS [87] picks peptide peaks from the heavy and light-labeled peptide profile, deter- mines the area of each peptide peak, and calculates the abundance ratio based on the areas. In addition to providing the quantification function, ASAPRatio [88] uses statistical tools for distinguishing protein abundance changes and calculates the peptide ratio considering all the peaks in the mass spectrum where signals of identified peptides carrying different charge state may be detected. Similar to XPRESS and ASAPRatio, the Mascot quantifica- tion module and Libra module in TPP are for quantification on MS/MS spectra. All four

43 tools are more focused on the identification of proteins based on MS/MS spectra, and do not provide quantify MS data. Therefore, they cannot be compared directly with our approach.

2.6 CONCLUSIONS

Rapid advancement of experimental techniques in isotopic labeling strategies requires the emergence of new quantitative and analytical methodologies. Therefore, it is urgent to de- velop quantitative software tools capable of processing and interpreting MS raw data gener- ated from these isotopic labeling strategies. GlycoQuant is the first quantitative software tool for isotopic labeling strategies in quantitative glycomics. The estimated changes in the distri- bution of isotopic label in glycans produced by cells grown in the presence of different isotopi- cally labeled precursors can provide key information regarding diverse biological processes such as glycan metabolism and cell differentiation. This information is not provided by other currently available methods. Evaluation of GlycoQuant for the processing of IDAWGTMdata has shown that it achieves high accuracy in detecting and quantifying mass spectral pat- terns even in cases where high levels of signal noise are present. Furthermore, data analysis using GlycoQuant is fully automated and requires little human intervention. Additional information about GlycoQuant is available at http://glycomics.ccrc.uga.edu/idawg/.

44 CHAPTER 3

SEESO: A SEMANTICALLY ENRICHED ENVIRONMENT FOR SIMULATION OPTIMIZATION

Jun Han1 John A. Miller1,2 Michael E. Cotterell1 Krys J. Kochut1,2 William S. York1,2,3,4

0To be submitted to “Simulation Modelling Practice and Theory” 1Department of Computer Science, University of Georgia, Athens, Georgia, USA 2Institute of Bioinformatics, University of Georgia, Athens, Georgia, USA 3Complex Carbohydrate Research Center, University of Georgia, Athens, GA, USA 4Department of Biochemistry & Molecular Biology, University of Georgia, Athens, Georgia, USA

45 Abstract

Simulation optimization is attracting increasing interest within the modeling and simulation research community. Although much research effort has focused on how to apply a variety of simulation optimization techniques to solve diverse practical and research problems, researchers find that existing optimization rou- tines are difficult to extend or integrate and often require one to develop their own optimization methods because the existing ones are problem-specific and not designed for reuse. A Semantically Enriched Environment for Simulation Opti- mization (SEESO) is being developed to address these issues. By implementing generalized semantic descriptions of the optimization process, SEESO facilitates reuse of the available optimization routines and more effectively captures the essence of different simulation optimization techniques. This enrichment is based on the existing Discrete-event Modeling Ontology (DeMO) and the emerging Sim- ulation oPTimization (SoPT) ontologies. SoPT includes concepts from both con- ventional optimization/mathematical programming and simulation optimization. Represented in ontological form, optimization routines can also be transformed into actual executable application code (e.g., targeting JSIM or ScalaTion). As illustrative examples, SEESO is being applied to several simulation optimization problems.

46 3.1 Introduction

Simulation optimization is a very resource-intensive undertaking. Complex models by them- selves can be resource-intensive. Given that, for many models, some input domains may be integer-valued, and output is likely to be both nonlinear and stochastic, the optimization problems can be quite challenging. In our project of Semantically Enriched Environment for Simulation Optimization (SEESO), we are addressing this complexity on three fronts: The first involves efficiency of the simulation in terms of both the simulation engine and replica- tion strategies adopted. The second relates to choosing appropriate optimization goals and selecting/customizing optimization algorithms based on these goals captured in ontological forms. The third is the use of domain-specific language (DSL) where appropriate for the creation of concise, expressive, configurable and easy-to-read source code for the executable simulation models based on the ontological models. This work is being bootstrapped by multiple prior and ongoing research and development projects.

• JSIM [97]. JSIM was one of the first Web-Based Simulation environments that allowed simulation models to be run on the Web as Java Applets. It was later extended to support the assembly of model elements as well as larger simulation components as Java beans. JSIM supports three simulation world-views (modeling paradigms): event-scheduling, activity-based and process-interaction.

• DeMO [98]. The Discrete-event Modeling Ontology (DeMO) specifies the structure of many popular simulation modeling paradigms including event, activity, process and state oriented models using the Web Ontology Language (OWL [41]). Individual sim- ulation models may be stored as instance data within the ontology/knowledge base.

• ScalaTion [45]. Scala is a relatively new object-oriented functional programming lan-

47 guage that runs on Java Virtual Machines (JVMs) [99]. As an advanced programming language, it supports the creation of embedded Domain Specific Languages (eDSLs), which greatly facilitates the coding of simulation models. ScalaTion supports many of the modeling paradigms represented in DeMO as well as several optimization al- gorithms. The supported modeling paradigms are event-scheduling, activity-scanning, process-interaction, event graphs, state-based and systems dynamics.

• DeMOforge [43]. The purpose of DeMOforge is to exploit the ever-growing amount of problem/domain knowledge that is being created as part of the Semantic Web. For example, ontologies on metabolic pathways are being created which can provide impor- tant information for the construction of models for systems biology. The interactive DeMOforge tool assists a model developer in producing such models from existing domain knowledge available in the Semantic Web.

As an analog to the DeMO project, the Simulation oPTimization (SoPT) ontology [44] project represents common optimization problems and algorithms using OWL. However, since the requisite knowledge in this area goes well beyond representation of ontological knowledge, as it is important to know when to apply and how to customize an optimization algorithm, much of the development involves the use of the Rule Interchange Framework (RIF [100]) to maintain a rule base. After SoPT is populated, with optimization problems reasoning is performed to determine which optimization algorithms are suitable for solving the problem using a rule-based inference engine (e.g., RIF4J [101]). Besides our efforts in developing the SoPT ontology and building a rule base, additional work has begun on completing the SoPT ontology and selecting suitable optimizers given an optimization problem. The goal of making SEESO functional is to integrate all of these projects into an efficient and easy-to-use simulation and optimization environment for domain modelers.

48 SEESO consists of multiple types of simulators (three from JSIM and six from ScalaTion). It also contains several optimizers (e.g., steepest descent, quasi-Newton, tabu search, genetic algorithms, etc.). SEESO is intended to be able to mix and match simulators with optimizers in a loosely-coupled fashion. The two must interact as the optimizer needs to use the simulator for function evaluations f(x), while the simulator needs input parameters from the optimizer to know what scenario to simulate. The optimizer computes f(x) by invoking fs(x, y) that then creates a simulation model passing in x along with any fixed parameters y to the model. Given a simulation model such as MyModel, in Scala, two functions may be defined to establish this linkage. def f_s(x: VectorD, y: VectorD): Double = (new MyModel (x, y)).simulate() def f(x: VectorD): Double = f_s(x, getFixedParameters())

To improve performance, several steps may be taken: (1) cache functional values so they are not re-evaluated/simulated, (2) use interpolation where appropriate, (3) use meta- modeling methods when applicable, (4) use reduced simulation precision requirements when far from optima, and (5) exploit parallel processing. For simplicity in this paper, we assume the inputs to a simulation model are deterministic and the outputs from a model are stochastic (although deterministic outputs will work as well). If the outputs are stochastic, it is usually necessary to replicate the runs (or have a long run length) of a simulation model to obtain statistical estimates (e.g., means, variances and confidence intervals). In this paper, we consider how optimization can be used in simulation, as well as, the types of optimization problems that are encountered. We illustrate the approach by con- sidering three simulation optimization problems: two from the bioinformatics domain (Mass Spectrometry and Metabolic Pathways) and the third being a more conventional discrete- event simulation optimization problem (optimizing staffing/inventory for an Urgent Care

49 Facility). Based on these case studies, we develop an ontology for simulation optimization (SoPT) to complement the DeMO ontology, which serves as a knowledge repository for simulation optimization. The rest of this paper is organized as follows. Section 3.2 provides an overview of sim- ulation optimization, classifies common optimization problems, and describes the general application of simulation optimization and how to couple a simulator with an optimizer. Section 3.3 describes how to model and construct an application using the DeMO ontology and the DSLs provided in ScalaTion. Section 3.4 first outlines the utility packages and classes provided in ScalaTion for optimization, summarizes the key aspects of the SoPT ontology, and then discusses how to leverage rule-based inference to facilitate effective algorithm selec- tion for later code generation and execution. Section 3.5 describes the semantically enriched computing environment that serves as a platform for dynamic allocation of resources and parallel processing capabilities for simulation optimization. Section 3.6 introduces three case studies, one for a conventional discrete-event simulation optimization problem and two in the bioinformatics domain, Mass Spectrometry and Metabolic Pathways. Section 3.7 concludes the paper and discusses possible future work. Notational conventions: (1) boldface letters for vectors, (2) capitalization for matrices and random variables.

3.2 Simulation Optimization Overview

In 1987, optimization for simulation was regarded as “an art, not a science” [102]. After ten years of exploration and inspiration from early practitioners integrating optimization into simulation, the research field of simulation optimization was systematically surveyed in [103, 49, 50]. Both random search methods for discrete decision variables and gradient-based optimization techniques for continuous decision variables were discussed. Although huge

50 progress has been achieved since then, simulation optimization remains a big challenge as it requires a deep understanding of two fields: modeling and optimization. To date, numerous efforts have focused on the optimization techniques and application of specific mathematical programs to practical simulation problems. Surveys and reviews of Simulation Optimization from various aspects are given in [51, 48, 47, 104, 105, 52, 53, 54, 55, 46, 106]. In general, various simulation optimization strategies can be grouped based on the type of decision parameters (e.g., continuous or discrete) used. General iterative search methods can utilize strategies of global or local scope and sample solution space or path. Furthermore, when it is not feasible to optimize a simulation model that is close enough to the real world problem due to money and time limitations, meta-modeling methods can be considered. For continuous decision parameters, gradient approaches (e.g., steepest descent and Stochastic Approximation (SA)) and gradient-free methods (e.g., the Nelder-Mead Simplex method) can be applied, while for discrete ones, ranking and selection (R&S), random search, tabu search, simulated annealing, and genetic algorithms may be considered. Gradient-based simulation optimization methods on continuous problems are reviewed in [60], and two gradi- ent based methods, SA and the sample average approximation (SAA) method are discussed. SA is similar to the steepest descent method in deterministic optimization and gradients are approximately estimated from measurements. SAA approximates the original simula- tion optimization problem with a deterministic optimization problem constructed from the sample points. Gradient based methods have a long history from both theoretical and prac- tical fields and are most thoroughly investigated. Techniques and applications of engineering optimization are addressed in [56] from a broad perspective. Convex optimization and nu- merical optimization techniques are discussed in [57, 58]. The gradient based methods work well when the input parameters and the objective function are well formulated and first (and in some cases second) derivatives can be directly computed or numerically estimated. However, in the area of Modeling and Simulation (M&S), a common situation faced by

51 simulation researchers is that the model of interest cannot be mathematically formulated easily and simulation is the only feasible approach to explore the problem space. Treating simulation as a black box, heuristic methods, inspired by natural social behaviors and evolu- tionary strategies, can often be applied as a global optimization approach. Scatter search and Genetic Algorithms (GAs) are widely adopted in commercial simulation software [53], while other heuristic methods, such as Particle Swarm Optimization (PSO) and Ant Colony Opti- mization (ACO), have been applied in M&S research and shown as efficient and robustness. In some heuristic methods, a large population of candidate solutions is maintained instead of keeping a single instance and sampling its neighbors. Therefore, larger search spaces with irregular response surfaces can be explored leading to improvements in robustness compared to traditional optimization techniques in higher dimension problems. For example, PSO is embedded within a Monte Carlo simulator to solve the Reliable Server Assignment (RSA) problem [107] and ACO has been used to solve the problems of Buffer capacity allocation [108] and cooperative transportation planning [109]. Instead of sampling the solution space, sample path optimization [110, 111, 112] (or the sample average approximation optimization method [113]) often uses Monte Carlo based simulation to generate a set of random samples and approximates the expected value function using a nonlinear deterministic function (e.g., a quadratic function) by sampling on the independently and identically distributed variables and taking the average. When designing a model or experiment with little knowledge about the model under analysis, meta-modeling can be applied to investigate the unknown problem space. Based on the scope of optimization, meta-modeling based methods for simulation optimization can utilize either a global strategy or a local strategy [106]. In our work, a local strategy for meta-modeling is implemented using quadratic fitting of the response surface in a region around a given center point. Although global meta-modeling is briefly discussed in this paper, we plan to implement and test such techniques as part of our future work.

52 As simulation optimization has been receiving research attention from the M&S commu- nity, “Optimization via Simulation” was added in the third edition of Simulation Modeling and Analysis [114] and the third edition of Discrete-event System Simulation [115] in 2000 as a totally new section (cf. section 12.6 and 12.4, respectively) and continues to get refined in the later editions [116, 40]. A full track dedicated to the frontier research and applications on simulation optimization was added to the 2011 Winter Simulation Conference. The integration of simulation and optimization has become ubiquitous when faced with complex system design and modeling [48]. On the one hand, simulation is an approxima- tion to the real world, and in most cases it is impossible to find a good enough solution by enumerating the possible scenarios in huge search spaces, therefore simulation needs opti- mization techniques to provide some guidance towards a global optimal solution. On the other hand, without the help of simulation, many real world problems are too complicated to be modeled by explicit mathematical formulations and traditional optimization techniques (e.g., gradient-based approaches) may not achieve satisfactory results. This has led to a ma- jor dilemma for the researchers who want to approximate the real world as closely as possible and find a good enough solution at the same time. In general, simulation optimization is classified into two classes, optimization for simulation and simulation for optimization. The former refers to adding optimization routines to stochastic discrete-event simulator, while the latter uses Monte Carlo simulation to generate candidates for a mathematical programming formulator [104].

3.2.1 Classification of Optimization Problems

To facilitate component reuse, we are primarily interested in systems where simulators and optimizers are loosely-coupled. In addition, there may be a third component, a cost analyzer. In general, the simulator will take a set of input/parameter vectors {x}, and produce a set of output/response vectors, {Y} = R({x}) where x is thought of as deterministic and Y is often

53 stochastic. The cost analyzer will take the response vectors Y along with a quality vector q, and provide a set of triples {(x, Y,Z)} using a cost function to compute Z = c(Y, q). These triples are then fed into the optimizer, which will compare these values with those it has stored from previous iterations. After applying a stopping rule, the optimizer will either report a solution or ask for more response data from the simulator, as shown in Figure 3.1. Denoting the composition of the cost and response functions as F = c ◦ R, optimizations may be formulated as in Equation (3.1).

{x}0 {(x, Y=R(x))}i Simulator

Cost Analyzer

Optimizer {x}i+1 {(x, Y=R(x), Z=c(Y, q))}i

(x*, Z*)

Figure 3.1: Loosely-coupled Software Architecture for Simulation Optimization.

min E[F (x)]

s.t. V [F (x)] ≤ t

g(x) ≥ 0

x ∈ D (3.1)

where F (x) is a stochastic objective function, g is a constraint function, t is the threshold for variance and the domain D (often a vector space) can be subsets of Rn, Zn or Bn,

54 for real, integer or binary vector spaces, respectively. Due to the stochastic nature of the objective function, one must focus on characteristics of F (x), such as moments or quartiles. Typically, the mean (E) of F (x) is used. Other characteristics can be used in constraints (e.g., limiting the variance (V ) to be within a threshold or specifying acceptable relative precision on confidence intervals). , if R is deterministic, the simulator may be thought of as a function evaluator (using either analytic or numerical evaluation). Then F becomes deterministic, so that E[F (x)] = F (x) and V [F (x)] = 0, leading to conventional optimization. For efficiency, the same evaluations can be produced by a meta-model, which typically provides more rapid evaluation and produces deterministic responses (i.e., {Y} is deterministic). Replacing F with f, this may be formulated as follows:

min f(x)

s.t. g(x) ≥ 0

x ∈ D (3.2)

In conventional optimization, the problem may be defined as finding the minimum value of an objective function f(x) over domain D subject to the constraint function g(x) being non-negative. We use minimization for simplicity (maximization could be used as well). Classification of the type of optimization problem is primarily determined by the character- istics of f, g and D. The objective function f can be linear, quadratic or non-linear. The constraint function may be non-existent (unconstrained), linear, quadratic or non-linear. Note, in this paper, non-linear more accurately means what is not accounted for by the

special cases of linear (and quadratic where applicable). D is typically a subset of Rn, Zn or Bn, but may also be mixed. For example, IntegerProg.scala supports Mixed Integer

55 Linear Programming (MILP) by considering a set of indexes in the input vector to exclude from the integer requirement. A secondary classification is based on solution quality, in terms of finding a global mini- mum versus a local minimum as well as whether an exact, approximate or heuristic solution is acceptable. Exact solution involves convergence on a solution, while approximate solu- tion typically guarantees a solution within a relative error bound and heuristics offer no guaranteed error bounds. A portion of our classification of optimization methods is listed in Table 3.1. To save space, the column of Constraint, including unconstrained, linear, quadratic and nonlinear constraints is omitted; the distinctions are saved for the discussion below. Typically general- ized solvers (e.g, for nonlinear programming) can be applied to solve the more specific form (e.g., linear programming), although less efficiently. People try to look for a one-size-fits-all solution for various practical optimization problems. However, as the generalized solvers (e.g., for nonlinear programming) may get trapped in local optima and take much longer time than some more specific solvers, there is a need to have several more specific solvers (e.g., the Simplex method for linear programming). Therefore, in order to determine the most suitable and efficient solvers, the first and crucial step in optimization is always to determine the classification of a particular optimization problem. Linear constrained, linear programming (Equation (3.3)), integer linear programming (Equation (3.4)) and linearly constrained, quadratic programming (Equation (3.5)) are three types of problems that are simple to express in canonical forms. Therefore, they are taken as examples to illustrate the basic components involved in optimization, as listed in Table 3.2.

3.2.2 Use of Ontology in Simulation Optimization

In the M&S community, ontologies have been proposed and constructed to facilitate the sharing of domain knowledge. DeMO [98, 43] represents the domain of discrete-event model-

56 Table 3.1: Classification of conventional optimization problems.

Objective Restriction Problem Methods Function Real Linear Programming Simplex Method, Interior Point Methods (Khachiyan Linear and Karmarkar) Integer Integer Linear Program- Branch and Bound ming Real/Integer Mixed Integer Program- Branch and Bound ming Binary Binary Integer Program- Balas Additive Algorithm ming Real Quadratic Programming Calculus and Quadratic Simplex method Quadratic Integer Quadratic Integer Pro- Special Branch and Bound gramming Real/Integer Mixed Integer Quadratic Generalized Benders De- Programming composition (GBD) Binary Binary Quadratic Program- Heuristic Methods ming Real Nonlinear Programming Gradient-based Methods Integer Nonlinear Integer Program- Branch and Bound, Outer- Nonlinear ming Approximation Real/Integer Mixed Integer Nonlinear Branch and Bound, Outer- Programming Approximation Binary Nonlinear Binary Integer Heuristic Methods Programming

57 Table 3.2: Examples of optimization components, where x, b and c are the vector of in- put/decision variables, constant vector, and cost coefficient vector, respectively, A and Q are the coefficient matrices for constraint and objective functions, respectively. If Q is zero, Equation (3.5) is reduced to Equation (3.3).

LP ILP QP

T T 1 T T min c x min c x min 2 x Qx + c x s.t. Ax ≥ b s.t. Ax ≥ b s.t. Ax ≥ b x ∈ R+n (3.3) x ∈ Z+n (3.4) x ∈ Rn (3.5)

ing and COmponent-oriented Simulation and Modeling Ontology (COSMO) [117] describes the simulation components and compositions from a component-oriented perspective. Pro- cess Interaction Modeling Ontology for Discrete-event Simulations (PIMODES) [118] facili- tates the exchange of model information between various simulation software packages. The Discrete-event Simulation Ontology (DeSO) [119] is a small prototype ontology for discrete- event simulation. Little work, however, has been done to gather domain knowledge on optimization tech- niques to construct an ontology that can be shared with others in the M&S community. ONTOP [120] is an ontology for engineering design optimization. It defines the class Opti- mization Model, which includes objective function, input and output variables, constraints, etc. For our purposes, the hierarchical structure of the class Optimization Model is not organized suitably, and optimization problems (e.g., linear programming) and optimization methods (e.g., Simplex method) are defined in the same place. This brings two disadvan- tages: (1) fails to address the condition where one method can be applied to solve multiple problems (e.g., Simplex and its revisions/extensions can be used to solve both Linear Pro- gramming and Quadratic programming problems); (2) adding new mathematical programs can be difficult. For example, quadratically constrained, quadratic programming (QCQP)

58 cannot find a suitable position because it can be put either under Nonlinearly Constrained or Quadratic Programming. In order to facilitate the reuse of optimization algorithms and sharing of knowledge, SoPT has been developed as a complementing ontology to DeMO and serves as a knowledge repository for simulation optimization. SoPT will be discussed in detail in Section 3.4.3.

3.2.3 Linking the Optimizer with the Simulator

We divide input parameters into three categories: regular input parameters, control param- eters and model parameters. Control parameters influence the behavior of a model and are in some sense controllable. An example for certain metabolic pathway simulations is con- centrations of enzymes that catalyze reactions in pathways. Experiments may be conducted where enzyme levels are up-regulated, down-regulated or even reduced to zero by knocking out the responsible gene. Metabolic pathway models can explore these possibilities. In such simulations, one may wish to adjust the enzyme concentrations to optimize the production of certain bio-molecules. Model parameters are intrinsically part of a model, but may be unknown and hence, need to be estimated from empirical data or calibrated by comparing output results to empirical data. Kinetic rate constants in metabolic pathways are good examples of model parameters. Many of these are not accurately known in the literature and are difficult to measure directly. They may, however, be estimated by performing an optimization that adjusts the reaction rate constants with the objective of minimizing the differences (e.g., least squares) between empirical pathway time series data and time series data generated by a metabolic pathway model. As discussed, model parameters may be optimized to improve the accuracy of a simulation model, while control parameters may be optimized to improve the outcome of the model (e.g., a more accurate yield of bio-molecules). Simulation optimization can be accomplished in a

59 couple of ways. One is to simply explore the input parameter space over a sufficiently detailed grid to produce a response surface. Then, for example, a polynomial surface can be fit to the sample points produced by the simulation. Using a response surface, a global optimization method can now be used to find a global optimum. This is the essence of meta-modeling local strategies and is considered in our work when appropriate. Our work focuses on simulation optimization techniques that loosely couple a simulator with an optimizer, and having them work together in an iterative fashion. The optimizer steers the exploration of the simulator’s input space.

3.3 Modeling with DeMO, JSIM and ScalaTion

DeMO and DeMOforge allow domain modelers to represent several types of simulation mod- els and connect high-level domain models represented in a specialized domain ontology to DeMO instances. For example, the schematic diagram for a simplified process-interaction model for an Urgent Care Facility (UCF) simulation (a more complex and realistic model is presented in Section 3.6.1) is shown in Figure 3.2.

entry: nurseQ: nurse: toNurseQ Source WaitQueue Resource

toDoctorQ

doctorQ: doctor: door: toDoor WaitQueue Resource Sink

Figure 3.2: Schematic Diagram for an Urgent Care Facility (UCF) Model.

Code generators may be utilized to translate the models represented in the DeMO ontol- ogy into executable simulations. Currently, generators exist for both JSIM and ScalaTion.

60 Coded in Java, JSIM was one of the first simulation environments supporting Web-based simulation. Utilizing Scala’s capabilities for creating embedded Domain Specific Languages, ScalaTion supports a more concise and domain customized language for specifying exe- cutable simulation models. For example, a simplified model for a UCF simulation is shown in Table 3.3. This makes coding from scratch as well as code generation from DeMO easier. Because of the consistent and uniform way that both JSIM and ScalaTion invoke models, they can be loosely coupled to an optimizer. Having the simulation represented at a high-level (a form of conceptual model) in the DeMO ontology also facilitates simulation optimization in the following ways: (1) the essence of a conceptual model can be captured in an ontology, (2) model reuse and composition can be implemented through ontological relationships, and (3) various domain specific models can be easily extended or developed via DSLs and executable model code can be obtained from code generators.

3.4 Simulation Optimization with ScalaTion, SoPT

and Rules

In order to make simulation optimization more intuitive for modelers, our approach lever- ages ontologies, rules, and domain-specific languages. The whole workflow is illustrated in Figure 3.3. To begin with, the optimization problem and optimization algorithm are repre- sented as top-level ontological concepts. Taxonomies are created for each as subclasses in the ontology. Concrete real world problems and optimization algorithms/implementations are treated as instances to one of the sub-concepts in the ontology. An optimization problem can be defined as an instance of an appropriate subclass of optimization problem. Upon cre- ation of this instance, a set of rules can be utilized to define which particular algorithms are suitable for this type of problem. In other words, when a new practical problem arises, run- ning a rule-based inference engine on both the existing rules and SoPT knowledge base can

61 Table 3.3: Scala Code for UCFModel. class UCFModel (name: String, nArrivals: Int, iArrivalRV: Variate, nurses: Int, doctors: Int, nurseRV: Variate, doctorRV: Variate, moveRV: Variate) extends Model (name) { val entry = new Source ("entry", this, Patient, nArrivals, iArrivalRV, (70., 185.)) val nurseQ = new WaitQueue ("nurseQ", (200., 190.)) val nurse = new Resource ("nurse", nurseQ, nurses, nurseRV, (270., 185.)) val doctorQ = new WaitQueue ("doctorQ", (410., 190.)) val doctor = new Resource ("doctor", doctorQ, doctors, doctorRV, (480., 185.)) val door = new Sink ("door", (620., 185.)) val toNurseQ = new Transport ("toNurseQ", moveRV, entry, nurseQ) val toDoctorQ = new Transport ("toDoctorQ", moveRV, nurse, doctorQ) val toDoor = new Transport ("toDoor", moveRV, doctor, door)

addComponents (List (entry, nurseQ, nurse, doctorQ, doctor, door, toNurseQ, toDoctorQ, toDoor))

case class Patient () extends SimActor ("p", this) { def act () { toNurseQ.move () if (nurse.busy) nurseQ.waitIn () nurse.utilize () nurse.release () toDoctorQ.move () if (doctor.busy) doctorQ.waitIn () doctor.utilize () doctor.release () toDoor.move () door.leave () } // act } // Patient } // UCFModel class

62 guide the selection of particular algorithms. After particular optimization algorithms have been determined, a DSL included in ScalaTion for simulation optimization, SimOptDSL, can be used to achieve: (1) adjust the parameters of the optimization algorithms to fit the needs of the practical problem (e.g., step size, population size, number of generations, etc.), and (2) execution of the simulation model and optimization routine to obtain the optimized parameters and collection of the statistics.

Rule Inference Algorithm Ontology Engine Configuration

Optimization Parameter Optimized Algorithm Optimization Simulation Rule Selection

Figure 3.3: General Workflow for Simulation Optimization.

3.4.1 Optimization Package in ScalaTion

SoPT supports both optimization and simulation optimization. From one point of view the two areas are similar, each of which optimizes a function subject to satisfying a set of con- straints on variables defined on various domains (e.g., reals, integers or binary). Simulation optimization is, however, more challenging for the following reasons: (1) a function evalua- tion involves the execution of one or more runs of a simulation model and (2) the results are stochastic. In other words, a function evaluation usually takes a long time and the results are uncertain. To make matters worse, the domains may be mixed with some being reals and others being integers. For instance, in the UCF problem, the numbers of rooms, nurses and doctors are integers, while various supplies such as the amount of saline solution in inventory can be real numbers. The response surface is likely to be non-linear and may or may not be

63 convex. Hence, many simulation optimization problems fall into the hard category of Mixed Integer NonLinear Programming (MINLP).

Currently, ScalaTion has two packages dedicated to maximization (scalation.maxima) and minimization (scalation.minima), respectively. Each package contains roughly equiv- alent classes and the scalation.minima package contains the following classes:

• Simplex2P.scala implements the two-phase simplex algorithm. This algorithm opti- mizes linear functions subject to a set of linear constraints, i.e., Linear Programming (LP) problems. Phase I is used to find a basic feasible solution, while phase II is used to find an optimal solution.

• IntegerLP.scala implements a recursive branch and bound algorithm to solve Inte- ger Linear Programming (ILP) problems. It solves the corresponding relaxed linear program and checks the solution vector x for non-integral values. Upon finding such

a value, it solves two sub-problems, one adding the constraint xi ≤ bvaluec and the

other adding xi ≥ dvaluee. The recursion continues until the solution consists of only integer values.

• QuadraticSimplex.scala solves Quadratic Programming (QP) problems where the objective function is quadratic and the constraints are linear.

• SteepestDescent.scala solves Non-Linear Programming (NLP) problems, either con- strained or unconstrained, by iteratively computing a gradient and moving in the op- posite direction (d = −∇f(x)) using a line search algorithm.

• ConjGradient.scala solves NLP problems by using conjugate gradient techniques that determine a search direction as a combination the steepest descent direction (op- posite the gradient) and the previous search direction.

64 • QuasiNewton.scala uses the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm to solve NLP problems. BFGS determines a search direction by deflecting the steepest descent direction vector (opposite the gradient) by multiplying it by a matrix that approximates the inverse Hessian.

• IntegerNLP.scala adds branch and bound capabilities to BFGS in order to solve Mixed Integer Non-Linear Programming (MINLP) problems.

• IntegerLocalSearch.scala moves from point to point in the search space by evalu- ating the function in the neighborhood of the current point and moving to the lowest point in the neighborhood. It is also used to solve MINLP problems.

• IntegerTabuSearch.scala enhances local search by keeping track of the points that have been visited before and not considering any such points that are less optimal than the current point.

• GeneticAlgorithm.scala incorporates tournament selection, one-point crossover, and mutation operators that mimic the process of natural evolution over a large population of individuals/chromosomes. The individuals with the better fitness values (low for minimization/high for maximization) will survive.

• GoldenSectionLS.scala searches for an optimal solution to a line search problem (λ∗ = argmin{f(x + λd) | λ ≥ 0}) by using the Golden Section line search algorithm.

• WolfeLS.scala searches for a point of sufficient decrease and reducing slope in the search direction d.

3.4.2 Optimization Algorithm Evaluation and Selection

Optimization problems and their corresponding optimization solvers are closely related with each other. While algorithm designers attempt to address a wide range of real world sce-

65 narios with a general purpose algorithm, the practical problems vary case by case. Hence, when choosing an effective optimization algorithm for a specific problem, a common issue is addressed by the No Free Lunch (NFL) theorem [121], “If algorithm A outperforms al- gorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A.” In order to select a better algorithm, the foremost step is to capture various aspects of an algorithm and measure its performance. The general-purpose Algorithm Selection Problem was theoretically studied by John R. Rice [122]. Based on features identified in the prob- lem space, algorithm space, performance measure space, general purpose algorithm selection can be determined by measuring a function for algorithm performance. Although according to Rice’s theorem, an automatic algorithm selection program based only on the descrip- tion of the input instance is non-existing, a machine learning-based inductive approach is proposed in [123], and used to select an algorithm for the sorting problem and a more com- plicated problem called the Most Probable Explanation Problem. Besides machine learning approaches, rules are often utilized in algorithm selection too. For example, learning algo- rithm selection for classification was studied in [124] and Rule selection on fuzzy rule-based systems is applied to diabetics treatment and fraud detection [125]. In many circumstances, selecting an appropriate optimization algorithm is very challeng- ing and more research effort is needed in algorithm selection. An approach based on modern machine learning techniques for automatically selecting an optimization algorithm to solve a given problem is proposed and various optimization algorithms are evaluated based on their characteristics and performance [126]. Regarding optimization algorithm selection, some general guidelines from empirical ex- perience have been provided on how to select a specific optimization algorithm for a family of mathematical programming problems.

• For LP problems, the Simplex algorithm is recommended for small-sized LP problems,

66 as it can reach an exact solution, although it may be less efficient than Interior point algorithms. Interior point algorithms are more suitable for LP problems in larger scale, as they can approach the optimal solution very quickly, although they may not reach it with complete accuracy.

• For real NLP problems, gradient based optimization methods are the natural choices if the problem surface is convex. The steepest descent method is easy to implement, but it may get trapped in local optima or converge very slowly when approaching the optimal solution. Conjugate gradient methods work best with problems having a nearly quadratic objective function. The Newton methods have been proved to be accurate, however, they have several disadvantages: (1) require more function evaluations, (2) are not applicable when the Hessian matrix is not available, and (3) may not find a solution when the starting point is far away from the optimal. Quasi-newton methods can overcome some of the disadvantages of Newton methods by generating a numerical approximation (B) of the Hessian matrix (H).

Despite the fact that the evaluations and comparisons of different optimization methods are available, it is still difficult to pick or decide an appropriate optimizer for a practical problem without large trials. In the M&S community, the optimization algorithms used to solve specific simulation optimization problems are usually chosen according to empirical experience and in an ad-hoc manner. Three major issues make matters worse: (1) domain modelers may not know how to formulate the exact optimization problem (e.g., input param- eters, constraints, objective function) and then optimize it; (2) researchers find that existing optimization routines are difficult to extend or integrate and often require one to develop their own optimization methods because the existing ones are problem-specific and not de- signed for reuse; and (3) from the perspective of domain modelers, they may not care about

67 the detailed information for how to select a specific optimization and just need to search for an optimizer that can work and disregard the quality of the final solution. Our long term goal for SEESO is to provide a computing environment where optimization results will be presented to the modeler after they specify the model design and (possible) characteristics of optimization problems of interest. All this convenience depends on the au- tomatic selection and configuration of an optimization algorithm, automatic code generation and automatic execution of the simulation model and optimization routines. A metric has been designed for selecting optimization algorithms based on the evaluation of optimization algorithms, as shown in Table 3.4. With the progress of our project development, these performance metrics will be incorporated in the SoPT ontology eventually and various opti- mization algorithms will be evaluated for performance metric data, which will be stored in SoPT as instances as well.

Table 3.4: Performance Metric for Optimization Algorithm.

Metric Description Accuracy indication of error bound: exact (none), approximate (formula) or heuristics (unknown) Robustness can the algorithm work generally, including special cases (e.g., stiff curve) Time Efficiency what is the average time of the algorithm Space Efficiency what is the average space of the algorithm Local/Global optima will the algorithm reach local or global optima Multi-objective can the algorithm handle multi-objective objective functions

3.4.3 Simulation Optimization (SoPT) Ontology

How to establish connections between numerous real world problems and various optimiza- tion algorithms remains a big challenge, and significant research effort has been put into this area. A repository (www.simopt.org) has been set up since 2006 and improved over time [127, 128]. Serving as a testbed for simulation optimization, simopt.org includes various

68 problem definitions, classifies them according to the parameter type and constraints, and provides optimization algorithms and results for some of them. However, the categoriza- tion is only based on the variable classes (e.g., continuous, integer, and categorical) and constraints (deterministic, unconstrained, and stochastic). In order to better describe both practical problems and optimization algorithms, some building block components and taxonomy need to be defined first and then used to charac- terize the particular optimization problems and optimization algorithms/methods. For this purpose, SoPT is designed to capture the essential knowledge in the domain of simulation optimization. Influenced by DeMO, the top-level concepts related to simulation optimization are defined as abstract classes and new classes are added as subclasses using common OWL constructs. The top-level abstract classes of SoPT are Optimization Component, Optimiza- tion Problem and Optimization Method. The relationships among them are represented by two object properties, has-component and can-solve, as shown in Figure 3.4.

Optimization Optimization Method Component

Optimization can-solve has-component Problem

Figure 3.4: Top-level Abstract Classes for SoPT Ontology.

Optimization Component

An optimization problem consists of several important parts, including the objective func- tion, set of constraints, domain over which optimization is to be performed, and the goal of the optimization (e.g., min f(x) or max E[F (x)], etc.). The top-level concept, Optimization Component, is designed to describe these parts.

69 The sub-concepts of Optimization Component, as shown in Figure 3.5, include data types and restrictions of input variables, objective function, constraints, solution, solution quality and optimization goal:

• Data Types: input variables may be represented in the form of a single element, vector or matrix, which can be used as the initial values and part of a Solution.

• Restriction: input variables may have restrictions on the types of numbers, e.g., real, integer, binary or mixed (real/integer/binary).

• Objective Function: it can be a single objective or multiple objective functions by

using a data property called has-MultiObjectives and needs to be represented as specifically as possible, so that more specific methods for this particular problem can be found.

• Constraints: they may have the forms of inequality or equality, and may be linear, quadratic or nonlinear.

• Solution: it contains the optimized parameters and value of the objective function.

• Solution Quality: it is used by both the Optimization Problem and Optimization Method concepts to specify which solution quality is desired for certain optimization problems or which solution quality can be achieved by certain optimization methods. In such a way, a connection between algorithm and problems can be established.

• Optimization Goal: it specifies the type of optimization desired on the objective func- tion, e.g, minimization or maximization.

Optimization Problem

The sub-concepts within the Optimization Problem taxonomy (as shown in Figure 3.6), Lin- ear Programming (LP), Quadratic Programming (QP) and Nonlinear Programming (NLP),

70 Optimization component

is-a is-a is-a is-a is-a is-a Optimization Constraint Restriction Goal is-a is-a is-a is-a Nonlinear Quadratic Binary Real Solution is-a Constraint Constraint Restriction Restriction is-a is-a Maximization Objective Linear Integer Mixed Function Constraint Restriction Restriction Minimization is-a is-a Global Nonlinear Quadratic Solution Optimal Objective Objective Quality Function Function is-a is-a Local is-a Optimal is-a Approxima Exact Linear te Solution Solution Objective Heuristic Function Solution

Figure 3.5: Schematic Representation of Optimization Component in SoPT Ontology.

have object properties that link to the sub-concepts of both Optimization Component and Optimization Method. Based on the available optimization components, the aforementioned optimization prob- lems (LP and QP, represented in Equation (3.3) and (3.5), respectively) can be treated as in- stances of the ontological class of Optimization Problem. Various optimization components, such as objective function, constraints, restrictions, optimization goal can be shared between both instances using object properties (e.g., has-ObjectiveFunction, has-Constraint, has-Restriction, has-OptimizationGoal, requires-SolutionQuality). Although LP and QP have different forms of objective functions, both can also share the same input vari-

71 able vector and cost coefficient vector as LP is a special case of QP. The difference is that LP has a zero-matrix for its quadratic coefficient matrix Q, while QP has a non-zero Q. Compared with deterministic mathematical programming, stochastic mathematical pro- gramming [129] involves uncertain parameters that are often described by probability distri- butions and thus, is another major category of optimization problems. Stochastic program- ming can be further classified into stochastic LP, QP and NLP.

Solution Objective Optimization Constraint Restriction Quality Function Goal

has-Constraint has-ObjectiveFunction has-Restriction requires-SolutionQuality Optimization has-OptimizationGoal Problem

is-a is-a Deterministic Stochastic Programming Programming is-a is-a is-a is-a Nonlinear Quadratic Stochastic Stochastic Programming Programming Nonlinear Quadratic Programming Programming is-a is-a Linear Stochastic Programming Linear Programming

Figure 3.6: Schematic Representation of Optimization Problem in SoPT Ontology.

Optimization Method

Our primary interest is focused on iterative interaction between the optimizer and simulator and to provide connections between optimization algorithms and real world problems. As shown in Figure 3.7, the top-level class in SoPT, Optimization Method, classifies popular algorithms into groups and serves as a knowledge base for further rule-based reasoning. Due to the wide variety of methods used for simulation optimization, it is difficult to

72 determine where to start and cover every aspect of the algorithms in detail. Because we are focusing on iterative interaction between the simulator and the optimizers, we concentrate on what the optimizer needs from the simulator, a set of response vectors estimating a small portion of the response surface, and how the optimizer explores the parameter space using various search techniques. Gradient based optimization methods (e.g., steepest descent or conjugate gradient) can be used to illustrate our approach. The key components in gradient based methods are gradient calculation (or numerical estimation if the gradient cannot be computed directly) and line search. Gradient calculation determines the search direction, while line search decides how far to move in that direction. Optimization methods based on meta-modeling can have strategies for either global or lo- cal optimization. Because global strategies are often integrated with other heuristic methods and thus are difficult to categorize, we currently only include local meta-modeling methods in the SoPT ontology. We plan to refine this in the future, after we establish some cross linkages between SoPT and DeMO (e.g., a meta-model could be viewed as an alternative path in Figure 3.1.)

3.4.4 Optimization Algorithm Selection Based on Ontological

Rules

Given the plethora of optimization algorithms implemented and rules available to indicate the choices of algorithms for well-investigated practical problems, specific algorithms can be selected for a new problem through rule inference. This task is executed by a rule-based reasoning engine, which extracts and fires rules from a rule repository. Common rule-based reasoners are implemented based on a Datalog rule engine [130] or the Rete algorithm [131]. A Datalog engine is included and extended in many rule engine implementations. IRIS is an open-source Datalog system written in Java. The Semantic Web

73 Optimization Method

Derivative Gradient Heuristic Meta-Modeling Random Search Free based Method Method Method

Simplex Interior Genetic Kriging RSM Local Search Method Point Algorithm Method Nelder Quasi Response Newton Ant Colony Mead Newton Surface Sample Path Method Optimization Method Method Methodology Optimization Steepest Simplex Bacterial Algorithm Descent Foraging Simulated Optimization Quadratic Fit Hooke Conjugate BFGS Annealing Algorithm and Gradient Method Jeeves Descent Particle Tabu Search Direct L-BFGS Swarm Search Polak Method Optimization Ribiere Conjugate DFP Gradient Formula

Figure 3.7: Schematic Representation of Optimization Method in SoPT Ontology.

toolkit, Jena [132], also includes a Datalog implementation for its general purpose rule-based inference engine. The Rete algorithm is widely adopted in rule-based inference engines of open-source and commercial software (e.g., CLIPS, Drools, JESS, Bossam, etc.). Considering the existence of different rule languages and systems, the Rule Interchange Format (RIF) is designed to facilitate the exchange of rule semantics among various rule systems. Three dialects are defined, namely, RIF Core Dialect (RIF-Core), Basic Logic Dialect (RIF-BLD) and Production Rule Dialect (RIF-PRD). RIF-BLD and RIF-PRD share a common subset, RIF-Core. RIF-BLD, which extends RIF-Core with features such as function symbols and logical entailment, covers many existing rule systems. RIF-PRD defines the rules-with-actions dialect and has a condition part as well as an action part in the production rule.

74 RIF became a W3C Recommendation on June 22, 2010 [133]. However, the publicly available inference engines (e.g., Jena and Drools) have not provided direct support for the RIF format yet. In our project, RIF4J [101] is chosen to perform the inference over a set of rules represented in RIF format, because it builds upon the IRIS library and enables semantic reasoning over RIF by providing both a RIF parser and a translation from RIF- BLD to equivalent Datalog programs. Although the RIF-BLD presentation syntax provides a more concise view following EBNF notation, it is not intended to be used in the exchange of rules. When representing rules in RIF-BLD in the rule repository, the primary normative syntax for RIF-BLD is concrete XML syntax. In the paper, for the purpose of conciseness and clarity, presentation syntax is used and a sample rule in the SEESO rule-base is listed in Table 3.5. Other rules are being created to handle other cases, for example, steepest descent can work for different types of constraints. In Table 3.5, the first part is the rule and the second part contains several facts.

Table 3.5: Sample Rule using RIF-BLD Presentation Syntax.

Document( Group( Forall ? optProb ( (? optProb ):− And( (?optProb ) (?optProb ) (?optProb ) (?optProb ) (?optProb ) ) ) ( ) ( ) ( ) ( ) ( ) ) )

Within the SEESO computing environment, ontological rules and ontologies (e.g., DeMO and SoPT) can collaborate seamlessly as both are built upon the same Semantic Web plat- form. Concepts, relationships and instances can be shared between RIF and ontology without

75 further translation or conversion. The rules are intended to apply to the general case, while specific facts about a specific optimization problem can be extracted from the SoPT ontology.

3.4.5 Surface Analysis and Meta-modeling

As a single data point cannot provide enough useful information regarding a huge search space, surface sampling is crucial to determining where to simulate, i.e., the input vector {x}i. Various approaches can be taken based on different conditions. If the cost for large scale simulation is affordable, one common solution is to use the sample path optimization or sample average approximation method. It generates a large number of random samples via Monte Carlo based simulation, constructs a deterministic function to approximate the expected value of the sample average, and the procedure of solving this optimization problem is repeated until convergence. If the objective function is smooth enough, gradient based Newton and quasi-Newton methods estimate first-order derivatives and/or second-order derivatives for d parameters, either directly or numerically. When the objective function becomes non-differentiable or noncontinuous, derivative free methods may be helpful. Typically, they utilize points within an n-dimensional problem space, e.g., the Nelder-Mead downhill simplex method maintains n + 1 points, pattern search maintains a set of points called a pattern, and random search samples from a hypersphere around the current point. Heuristic methods often maintain a population of candidates, as is the case for Genetic Algorithms and Particle Swarm Opti- mization. Although Simulated Annealing only keeps track of a single point, it chooses from a number of its neighbors. When objective function evaluations become expensive or impossible, meta-modeling can be used to approximate the actual problem surface. It includes various methods, such as classic response surface methodology (RSM) [105] and Kriging RSM [134, 135]. Classic RSM uses first-order and/or second-order polynomial models to represent the meta models

76 and important design parameters are identified via polynomial regression. Kriging RSM combines the polynomial model with a random field built from the sample input varianceσ ˆ2 and can provide better approximations to a large variety of functions when highly nonlinear behavior cannot be modeled by classic RSM. Presently, we only provide local meta-modeling using QuadraticFit.scala, which collects response values around a point/n-dimensional

2 2 vector to fit a quadratic function (e.g., b0 + b1x0 + b2x0 + b3x1 + b4x1x0 + b5x1 for n = 2) by using multiple regression.

3.4.6 Search Techniques

Many optimization algorithms work in the following fashion: Iteratively establish a search direction (e.g., gradient for Steepest Descent) and then move in this direction by an amount determined by a line search algorithm that either optimizes or provides sufficient decrease of the objective function. An alternative is to use trust regions instead of line search. Meta-modeling fits a surface to a portion of the domain D and finds the optimal solution,

1 T T e.g., using QP: 2 x Qx + c x,Ax ≥ b. It replaces a complex, stochastic simulation with a simpler, likely deterministic meta-model.

3.5 SEESO: A Semantically Enriched Environment for

Simulation Optimization

In developing a computing environment for simulation optimization, we are concerned with efficient allocation of resources as well as techniques for speeding up the process. As already mentioned, there are several ways to speed up simulation optimizations. The first is caching. Some algorithms may require a functional evaluation at point x more than once. Therefore, a hash table is maintained with the vector x serving as the key and f(x)

77 as the value for that key. If x is not found in the hash table, then unless another technique is used, f(x) will be estimated via simulation. The next technique for speed up is to find

two points in a δ-neighborhood of x, call them y and z, such for each i either yi ≤ xi ≤ zi

or zi ≤ xi ≤ yi. If points x and y can be found in the hash table, then the value of f(x) is estimated using linear interpolation as follows:

f(x) = (dyx · f(y) + dxz · f(z))/(dyx + dxz)

where dyx is the distance from y to x and dxz is the distance from x to z. To make finding points y and z efficient we plan use a proximity index. The next technique is to use a meta-model. For example, our UCF simulation can be approximated by a queuing network model, which can be solved more quickly. Finally, if simulation is required, fewer replications are taken when far from an optima. When it is difficult to know how far one is from an optima, we apply a couple of heuristics. First, at the start of optimization, approximations are utilized more so than later in the optimization, i.e., precision increases throughout the optimization. Second, measures of improvement of the objective function are maintained and as improvement begins to slow, precision is increased. In simulation optimization, there are many opportunities for speed-up via parallel exe- cution including parallel simulation, parallel replications and parallelization of optimization algorithms. Parallelization of simulation, running different parts of a simulation concurrently on different processing elements, is a well-studied topic. In the literature of simulation [136], parallel simulation is usually concerned with the execution of a parallelized simulation on a multi-processor system, while its execution on a multi-computer system (distributed nodes) is called Distributed simulation. Speedup and scalability of parallelized simulation depends on its serial fraction (according to Amdahl’s law) and the granularity of the parallel program, as well as, the runtime environment [137]. In general, the finer the granularity indicates higher potential for parallelism, and faster communication between processes leads to higher scala-

78 bility. Parallel replication, on the other hand, is running several replications of a sequential simulation on different processors [138]. When it is possible to use this approach for paral- lel execution, it can considerably accelerate the performance of the simulation part of the simulation-optimization process. Parallelization of an optimization algorithm depends on the selected algorithm; for example in [139] a parallel Genetic Algorithm is used. In general, several instances of an optimization algorithm can be executed in parallel to find the best result out of them. The evolutionary era of computing as a service on the Internet which has emerged as part of cloud computing in recent years provides new possibilities for parallel and distributed simulation optimization. As a result, this has created a new momentum among researchers to adopt old algorithms to the new environment, as reported in [140, 141]. Although the speed up techniques can be quite useful, it is important to select the best optimization algorithm for the job. For example, when the objective function and the constraints are linear, an LP algorithm such as the Simplex method or an Interior Point method can be used. With such constraints and a quadratic objective function, the Quadratic Simplex method may be used. These algorithms are typically very fast, but often will not suffice, unless the model or more likely its meta-model are sufficiently simple. In Section 3.6, we evaluate how useful these relatively fast algorithms are. More commonly, simulation optimization will require Non-Linear Programming algorithms. In the NLP domain, there are four basic approaches: Derivative-Free, Steepest-Descent, Conjugate-Gradient and Quasi- Newton methods. If it is possible to reliably estimate first-order derivatives, then Quasi- Newton methods, such as the BFGS method, tend to be robust. If the surface is nearly quadratic, then Conjugate-Gradient methods, such as the Polak-Ribiere (PR) algorithm, provide solutions similar to those obtained with BFGS, but do so more quickly. Finally, for simulation optimization, one must deal with cases where surfaces are not convex and not all domains are real-valued. There are two basic approaches to deal with such complexity. The first is to adapt and extend NLP algorithms with techniques such as

79 branch and bound, random restarts, etc. The other approach is to use heuristic methods, either those based on local search (e.g., local search, tabu search, and simulated annealing) or those inspired by nature (e.g., particle swarm optimization, ant colony optimization, genetic algorithms, and bacterial foraging optimization algorithms [142]). It is also possible to create hybrid heuristics, for example by combining genetic algorithms with local search [143]. In this paper, an Engineered Condition (EC) operator is introduced that allows some individuals in the GA population to be improved using limited local search, which produces better results than either technique by itself.

3.6 Case Studies

3.6.1 Scenario 1: Urgent Care Facility

The first case study models an Urgent Care Facility (UCF), roughly patterned on the emer- gency room model developed by OptTek Systems, Inc. and presented at the 2006 Winter Simulation Conference [144]. Our Process Interaction model for UCF makes use of two Sources for generating Poisson arrivals of patients and a Sink for patients leaving the UCF: (1) The Ambulance source generates ambulance patients at rate λA, (2) the Walk-in source generates walk-in patients at rate λW , and (3) the Exit-Door sink is used by patients to leave the UCF. Patients are divided into two different levels according to the severity of their condition. For the purposes of this simulation, patients are assigned a severity level of 0 (low) or 1 (high) based on a Bernoulli distribution where the probability of receiving a 1 is 25%. Five types of resources exist in the model, as indicated below:

• Triage Nurses (TN) determine the severity level of incoming walk-in patients.

80 • Registered Nurses (RN) see all patients before treatment. They also determine the severity level of incoming ambulance patients.

• Medical Doctors (MD) see only the patients with a high severity level.

• Nurse Practitioners (NP) see only the patients with a low severity level.

• Administrative Clerks (AC) handle patient billing and release.

Upon entering the UCF, a walk-in patient proceeds to the triage nurse, followed by a registered nurse, and then depending on the severity, either sees a nurse practitioner or a doctor. A patient arriving by ambulance is routed in a similar fashion, but starts immediately by visiting a registered nurse. All patients must see an administrative clerk before leaving. The simulation model is shown in Figure 3.8.

Figure 3.8: Screenshot of UCF simulation in ScalaTion.

The effective hourly profit for the UCF would correspond to the revenue received by treating both ambulance and walk-in patients minus the hourly rates of the various employ- ees, minus an opportunity cost proportional to the average time patients have to wait, and

81 minus fixed costs such as rent and utilities. Then the hourly revenue generated (p (x)) and hourly payroll cost (c (x)) are defined as follows:

r (x) = ($400 × N0 + $750 × N1) / h (3.6)

c (x) = (κ · x) + δ × (1 · wQ (x)) × θ (x) (3.7)

where $400 and $750 are the amounts charged to the N0 low severity and N1 high severity patients over the entire simulation run, respectively, h is the number of hours simulated, κ is the hourly pay rate for each type of employee, δ is a scalar penalty (in dollars per hour) for waiting, 1 is the one vector, wQ (x) is the vector of waiting times and θ (x) is the throughput in patients per hour. For simplicity, fixed costs are not considered because they only cause the results to differ by a constant. Also, δ has been given a default value of $150. The hourly profit (p (x)) earned from operating the UCF for h hours is calculated by subtracting the hourly payroll cost from the hourly revenue:

p (x) = r (x) − c (x) (3.8)

Given the simulation model of the UCF, the simulation optimization problem is to find the staffing level that yields an optimal profit (p (x)). It is an example of a 5-dimensional Integer Nonlinear Programming (INLP) problem. The optimization problem for the UCF is a variant of the general maximization formula that keeps the relative precision of each simulation run below a desired threshold, while maximizing the grand mean (f¯, the mean of the means of each batch). Letting f(x) estimate p(x), the following optimization formula is

82 defined:

max f¯(x) ¯ s.t. γ = IHW /f ≤ t

σˆ ¯ I = √f t HW n n−1,α/2 x ≥ 0

x ∈ Nn (3.9)

where γ is the relative precision, IHW is the 1 − α = 95% confidence interval half-width and t is the maximum allowed value for γ. For evaluation purposes, t was set to 0.1.

Table 3.6: UCF Optimization Results.

TN RN MD NP AC Profit 2 2 2 2 3 $2827.32

As the UCF is an INLP problem with an integer restriction, the IntegerLocalSearch and IntegerTabuSearch optimization algorithms were chosen. Optimizers for the Integer

domain Zn have been extensively researched in the past [145, 146, 147]. The UCF maxi- mization problem was solved by using these optimizers with the max method provided by the SimOptDSL environment in ScalaTion, as shown below.

val x0 = new VectorI(1,1,1,1,1) val result = max (f _ using IntegerLocalSearch)(x0, 0.1)

When evaluating the performance of these optimizers, it is important to note that caching the steady-state results for evaluations of an objective function is the default behavior in

SimOptDSL. In this experiment, caching was disabled for IntegerTabuSearch (in order to reduce the caching overhead) because it never visits the same vector input more than once.

83 In both cases, the optimizers produced the same result for the maximization problem as indicated in Table 3.6. The results of the optimization routines suggest that the expected maximum daily profit of running the UCF modeled in this paper is around $2827.32.

3.6.2 Scenario 2: Mass Spectrometry Modeling and Simulation

Mass spectrometers (MS) are powerful tools widely utilized in the identification and quanti- tative analysis of chemical samples due to their capacity for high-throughput, high precision and high sensitivity [29]. An MS experiment involves the ionization and gas-phase analysis of molecules to generate data in the form of a mass spectrum, which is generally represented as an array of ion intensities versus mass-to-charge ratios (m/z), providing information re- garding the molecular mass of the molecules in the sample.

Given the elemental composition of a molecule (e.g., CxHyOzNuNa), the mass spectrum for this molecule is simulated in two major steps: (1) calculate the isotopic distribution of each isotopic configuration for this molecule via Multinomial distributions and (2) simulate the intensity by summing Normal distributions of all possible isotopic configurations, as shown in Figure 3.9.

1 16 1 2 16 2 16 For example, light water ( H2 O), semi-heavy water ( H H O), heavy water ( H2 O)

1 18 and heavy-oxygen water ( H2 O) are an incomplete list of six possible stable isotopic config- urations for water whose elemental composition is H2O. Now, suppose there are K atoms of a given element in a molecule. Each such atom may be one of l different isotopes, and

therefore, the element has a vector k = (k1, . . . , kl) indicating the number of atoms for each

of the l isotopes where k1 + ··· + kl = K. Given a molecule with K atoms of a certain

element (e.g., oxygen (O)), the probability that the molecule has kj such atoms having iso- tope j (j = 1 . . . l), is determined by the probability mass function (pmf) of the Multinomial

84 0.6 Gal 0.5 GalNAc 0.4 Neu5Ac 0.3

α6 0.2 Probability

β3 0.1 α3

+ 0 C55H99N3O27Na 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 Molecular Mass

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0 NormalizedIntensity 1256 1256.5 1257 1257.5 1258 1258.5 1259 1259.5 1260 1260.5 1261 m/z

Figure 3.9: Mass Spectrometry Model: elemental composition → isotopic distribution → simulated mass spectrum. Cartoon representation comes from CFG glycan structure database

distribution [91].

l K! Y k p(k) = p(k , . . . , k ) = × p j (3.10) 1 l Ql j j=1 kj! j=1

where pj is the relative abundance of each of the element’s isotopes (e.g., for element Hy- drogen (H), relative abundances for 1H and 2H are 0.99985 and 0.00015, respectively). As a molecule is made up of E elements, to determine the probability for all of the elements in the molecule, the product of E such pmfs is required. The exact molecular mass for this

85 isotopic configuration (k1,..., kE) can also be calculated.

E Y p(k1,..., kE) = p(ki) (3.11) i=1 E X m(k1,..., kE) = ki × mi (3.12) i=1

th where i is the index of the i element and mi is the mass vector for the l isotopes of element i (e.g., for element Hydrogen (H), m = (1.007825, 2.014101), representing the isotopic masses for 1H and 2H, respectively). These probabilities and masses for various isotopic configura- tions correspond to intensity and m/z readings obtained from the mass spectrometer, where z is the charge of the ion. Due to the fact that masses are clustered and inherent properties of mass spectrometers, intensity peaks (see Figure 3.9) will typically have Gaussian shapes (Normal Distribution). Each isotopic configuration has its effect on the whole simulated mass spectrum based on Normal distributions; therefore, the actual intensity of each ζ = m/z in the mass spec- trum is the sum of the intensities of all the possible isotopic configurations at m/z. Given experimental data represented as an array of [m/z, intensity], a peak width (w) and charge state (z), the intensity value contributed by the jth isotopic configuration is simulated as a Normal distribution (for further details, see [148]).

2 −(ζ − ζj) 1 2 fj(ζ) = √ × e 2σ (3.13) σ 2π weighting this by the pmf from Equation (3.13), we obtain:

X fG(ζ) = fj(ζ)pj (3.14) j

86 where σ = 0.4247 × w [93]. In summary, the input parameters to a mass spectroscopy model are listed as follows:

• Input Parameters: (1) molecular composition, (2) experimental spectrum.

• Control Parameters: (1) spectrum calibration, (2) low level noise filtering. Both pa- rameters are controlled by the mass spectrometer, which changes the intensity response of experimental mass spectrum.

• Model Parameters: (1) peak width, (2) charge state, and (3) relative abundance vector for each structure.

All the inputs are real numbers. Constraints for the input parameters are in linear form with lower and upper bounds. The goal is to minimize the errors between the two and √ maximize the Pearson correlation coefficient R = R2, where the coefficient of determination R2 is computed by comparing the normalized intensity of the simulated spectrum with the experimental one. Multinomial distributions and Normal distributions are used in isotopic distribution calculation and mass spectrum simulation, respectively. Therefore, the overall objective function is in nonlinear form. To sum up, the problem of mass spectrometry model calibration belongs to linear constrained, nonlinear programming. Because it is a linear constrained NLP problem with real parameters, heuristic methods (e.g., Genetic Algorithm) and gradient based methods (e.g., steepest descent, conjugate gra- dient and quasi-Newton methods) are reasonable suggestions. Initially, a Genetic Algorithm was applied to this optimization problem. However, it failed to converge after running for a week due to an huge search space of real numbers. Then a steepest descent algorithm and conjugate gradient method were tested. They can reach an optimal solution in some cases, but when the experimental data contain many noisy data points, they fail to pro- duce acceptable results. Ultimately, the quasi-Newton method (L-BFGS [149]) is chosen

87 with carefully selected initial parameter values because it numerically estimates gradient and Hessian matrix and proves more robust when processing noisy data. Furthermore, as simulations and optimizations of different glycan structures do not inter- fere with each other, they are executed in parallel (16 Java threads) in a server equipped with an 8 hyper-threaded cores and 32GB of memory. The running time was reduced from several days to less than three hours for five datasets containing more than 200 glycan structures.

3.6.3 Scenario 3: Metabolic Pathway Simulation

A metabolic pathway consists of a series of chemical reactions, each of which modifies the structure of a biomolecule. Figure 3.10 represents a portion of an O-linked glycan pathway (http://www.genome.jp/kegg-bin/show_pathway?mcc00512+698004) consisting of three substrates (molecules involved in enzyme catalyzed reactions), three enzymes, and two metabolic reactions. An enzyme catalyzes a reaction, which transforms substrates into products. One of the products becomes a substrate for the next reaction.

Gal1GalNAc1 Neu5Ac1Gal1GalNAc1 Neu5Ac2Gal1GalNAc1 St6GalNAc2 St3Gal1 St6GalNAc1 α6 β3 β3 α3 β3 α3 CMP CMP CMP CMP

Figure 3.10: A Sample O-Glycan Metabolic Pathway. Substrates are glycans shown in graphical representation and enzymes are put above the arrows. CMP-Neu5Ac acts as a sugar donor to add one sugar residue (Neu5Ac in this case) to the glycan. Graphical representation follows specifications in [70].

Our objective is to build a simulation model, which represents both the qualitative (struc- tural/visual) and quantitative (mathematical) aspects of a metabolic pathway. Two models are supported in our work, namely, Petri Nets and systems of ordinary differential equations. Other simulation models have been proposed as well, such as, π-Calculus and Markov chain

88 models. π-Calculus is utilized to model Systems Biology from the perspective of discrete- event simulation in [150, 151]. An analytical approach via continuous-time Markov chains for signal transduction networks is described in [152]. A Petri Net model can be used to provide visual and quantitative models. Pioneering qualitative analysis of biochemical reaction systems was explored in [153, 154]. An exe- cutable Petri Net model is proposed in [155]. Hybrid Petri nets and glycomics ontologies are integrated to simulate biochemical pathways in [156]. In our previous work, Hybrid Functional Petri Nets (HFPNs) [157] have been chosen to model biochemical pathways [43]. Several of the model’s input parameters, such as initial concentrations of substrates and enzymes, for the pathway can be estimated from experimental data, but accurate reaction rate constants are more difficult to obtain. What makes matters worse is that the biological system is very complex, considering all of the possible aspects (e.g., gene expression level, enzyme concentration, compartmentalization, etc.) that affect the metabolic pathway reac- tions. These parameters are very difficult to measure in vivo, although methods to do so are being intensively investigated. A system of differential equations has been used to develop biochemical pathway models [158, 159]. However, much of the previous work focused on the theoretical simulation and results are sometimes not sufficiently verified with real-world biological experimental data. Enzymatic reactions can be divided into single-substrate and multiple-substrates reac- tions. Enzyme kinetics in single-substrate reactions is discussed in [156] for pathway simu- lation and Michaelis-Menten kinetics is discussed in detail, which makes use of the kinetics constants kcat and Km. Multiple-substrates reactions can proceed by different mechanisms (such as Random Bi Bi, Ordered Bi Bi, Ordered Ping Pong) [160]. Each of these mecha- nisms is described by a different set of rate equations, which are more complex than those in single-substrate reactions. The two reactions in Figure 3.10 are glycosyltransferase reactions

89 involving two substrates and can proceed by the Ordered Bi Bi mechanism, which is typical for many glycosyl transferases [161]. Some of the differential equations describing a pathway composed of reactions in Fig- ure 3.10 that use this mechanism are listed in Equation (3.15).

˙ [X1][X4] [X1] = −kcat1 [E1] KX1 KX4 + KX4 [X1] + [X1][X4] ˙ [X1][X4] [X2][X4] [X2] = kcat1 [E1] − kcat23 [E23] KX1 KX4 + KX4 [X1] + [X1][X4] KX2 KX4 + KX4 [X2] + [X2][X4] ˙ [X2][X4] [X3] = kcat23 [E23] (3.15) KX2 KX4 + KX4 [X2] + [X2][X4]

where X1 is Gal1GalNAc1, X2 is NeuAc1Gal1GalNAc1, X3 is NeuAc2Gal1GalNAc1, X4 is

CMP-Neu5Ac, X5 is CMP, E1 is St3Gal1, E2 is St6GalNAc2 and E3 is St6GalNAc1. Note,

kcat23 represents an effective rate constant based upon kcat2 and kcat3 , and [E23] represents

an effective aggregate enzyme concentration of [E2] and [E3]. Simulation of dynamic systems (metabolism) described by such rate equations will require estimations of the values of the kinetic constants, as precise measurement of all these rate constants are very difficult using current technologies. Simplification of the rate equations may be necessary, because it will be very difficult to reliably find or numerically estimate all of the relevant kinetic constants. In order to estimate reaction rate constants as accurately as possible, we will use simulation optimization to calibrate our model parameters and optimize these constants. The biomolecule and enzyme concentrations for the pathway can be estimated from experimental data, while the simulated results will be produced using systems of differential equations or an HFPN model. The output of the simulation will consist of the biomolecule concentrations produced by each of the model’s two reactions at various time points over a certain time period. Our optimization will use experimental results along with simulated

results in order to develop accurate estimates of the kcat and KX constants for the pathway’s

90 reaction rates. The experimental and simulated results will be used as input for a least squares model that will quantify the difference between the two. The output of the least squares model along with the experimental results will be used by an optimizer to adjust the kcat and KX constants, thus modifying the reaction rate constants, which will be used in the next set of simulation runs.

3.7 Conclusions and Future Work

SEESO is being developed in order to make it easier and more efficient to carry out simulation optimization. The focus at this point is on developing the software base, including JSIM, ScalaTion and DeMOforge and establishing the relationships between conceptual models, ontological models and executable codes. The use of this software is effectuated by the use of the DeMO and SoPT ontologies, with the former facilitating the creation of simu- lation models and the latter guiding users carrying out simulation optimization. From our preliminary evaluation based on the three case studies we can conclude:

• Quantitative glycomics needs simulation optimization.

• Integration of ontology and DSL can facilitate modeling, simulation and application of simulation optimization for domain modelers.

• Given a specific problem, different algorithms can be evaluated, facilitating the com- parison and searching among various algorithms.

• Rule-based inference system can improve the selection of optimization algorithm effec- tively.

In our future work, we plan to support global meta-modeling and Kriging methods, refine the SoPT rule knowledge repository, and quantitatively evaluate several available

91 modeling and optimization techniques to be used individually and in combination. We are also planning to investigate more fully Automatic Algorithm Configuration [162, 163], which is an active research area focusing on tuning the algorithm configuration automatically. In the ScalaTion project, some preliminary work has started by designing a small DSL to automatically explore the search space and determine the surface shape of the problem of interest, so that some parameters in the chosen optimization algorithm can be configured according to the estimated surface shape. Efficient allocation of distributed resources and exploitation of parallelism will be addressed in more detail as we scale up our distributed computational resources.

92 CHAPTER 4

CONCLUSIONS

Glycomics, as a fast-growing and interdisciplinary research area, is attracting more and more attention from Biology, Chemistry, Computer Science and Statistics. Due to its vast complex- ity, Modeling & Simulation approaches are often utilized to tackle the problems in Glycomics. Rapid advent of experimental methods (e.g., innovative isotopic labeling strategies) requires the emergence of new quantitative software and analytical methodologies. Therefore, it is urgent to develop quantitative software tools capable of processing and interpreting MS raw data generated from these high-throughput instruments. An automated simulation software, GlycoQuant, has been developed for quantitative glycomics. It is the first quantitative software tool for isotopic labeling strategies in quanti- tative glycomics. Evaluation of GlycoQuant for processing of IDAWGTMdata has shown that it achieves high accuracy in detecting and quantifying mass spectral patterns even in cases where high levels of signal noise are present. Furthermore, data analysis using GlycoQuant is fully automated and requires little human intervention. With extensive use of Web 2.0 technologies, GlycoQuant provides the scientists with the capability to compare the theoret- ical simulated and experimental spectra side by side in the browser. Additional information about GlycoQuant is available at http://glycomics.ccrc.uga.edu/idawg/.

93 In order to better promote simulation optimization in the area of Modeling & Simulation, ontology, rules and domain specific languages can be utilized to bridge the gap between numerous optimization algorithms and practical problems. For this purpose, SEESO is being developed to make it easier and more efficient to carry out simulation optimization for domain modelers. Based on existing projects (JSIM, DeMO, DeMOforge and ScalaTion) and ongoing development of SoPT ontology and small DSLs (e.g., SimOptDSL), SEESO aims at establishing the relationships among conceptual models, ontological models and executable codes. Evaluations on three case studies (Urgent Care Facility, Mass spectrometry and Metabolic Pathway) show that integration of ontology and DSL can facilitate modeling, simulation and application of simulation optimization for domain modelers. Especially in optimization algorithm selection, a family of optimization algorithms can be chosen for a specific simulation problem based on (1) inferencing on general ontological rules and (2) performance metrics for optimization algorithms applied to related problems.

94 REFERENCES

[1] R. Wu, X. Zhao, Z. Wang, M. Zhou, and Q. Chen, “Novel molecular events in oral carcinogenesis via integrative approaches,” Journal of Dental Research, vol. 90, no. 5, pp. 561–572, 2011.

[2] H. Pearson, “Genetics: What is a gene?” Nature, vol. 441, no. 7092, pp. 398–401, 2006.

[3] M. B. Gerstein, C. Bruce, J. S. Rozowsky, D. Zheng, J. Du, J. O. Korbel, O. Emanuels- son, Z. D. Zhang, S. Weissman, and M. Snyder, “What is a gene, post-ENCODE? History and updated definition.” Genome Research, vol. 17, no. 6, pp. 669–681, 2007.

[4] F. H. C. Crick, “Central Dogma of Molecular Biology.” Nature, vol. 227, no. 5258, pp. 561–563, 1970.

[5] E. Pauwels, E. Sturm, E. Bombardieri, F. Cleton, and M. Stokkel, “Positron-emission tomography with [18F]fluorodeoxyglucose. Part I. Biochemical uptake mechanism and its implication for clinical studies.” J Cancer Res Clin Oncol., vol. 126, no. 10, pp. 549–559, 2000.

[6] C. P´al,B. Papp, and M. J. Lercher, “Adaptive evolution of bacterial metabolic net- works by horizontal gene transfer.” Nature Genetics, vol. 37, no. 12, pp. 1372–1375, 2005.

95 [7] A. Trewavas, “A Brief History of Systems Biology,” The Plant Cell, vol. 18, pp. 2420– 2430, 2006.

[8] L. v. Bertalanffy, “An Outline of General System Theory.” The British Journal for the Philosophy of Science, vol. 1, no. 2, pp. 134–165, 1950.

[9] K. E. Boulding, “General Systems Theory-The Skeleton of Science.” Management Sci- ence, vol. 2, no. 3, pp. 197–208, 1956.

[10] M. D. Mesarovi´c,Ed., Systems Theory and Biology. Cleveland, Ohio: Springer Verlag, 1968.

[11] J. A. Papin, N. D. Price, S. J. Wiback, D. A. Fell, and B. O. Palsson, “Metabolic pathways in the post-genome era.” Trends Biochem Sci, vol. 28, no. 5, pp. 250–258, 2003.

[12] M. M. Babu, N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann, “Struc- ture and evolution of transcriptional regulatory networks.” Curr Opin Struct Biol, vol. 14, no. 3, pp. 283–291, 2004.

[13] D. J. Lockhart and E. A. Winzeler, “Genomics, gene expression and DNA arrays.” Nature, vol. 405, no. 6788, pp. 827–836, 2000.

[14] C. A. Heid, J. Stevens, K. J. Livak, and P. M. Williams, “Real Time Quantitative PCR,” Genome Research, vol. 6, no. 10, pp. 986–994, 1996.

[15] P. Liang and A. B. Pardee, “Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction.” Science, vol. 257, no. 5072, pp. 967–971, 1992.

[16] D. Gilbert, H. Fuss, X. Gu, R. Orton, S. Robinson, V. Vyshemirsky, M. J. Kurth, C. S. Downes, and W. Dubitzky, “Computational methodologies for modelling, analysis and simulation of signalling networks.” Brief Bioinform, vol. 7, no. 4, pp. 339–353, 2006.

96 [17] R. Horton, L. A. Moran, G. Scrimgeour, M. Perry, and D. Rawn, Principles of Bio- chemistry., 4th ed. Prentice Hall, 2005.

[18] J. G. Voet and D. Voet, Biochemistry., 2nd ed. John Wiley & Sons Inc, 1997.

[19] C. H. Schilling, S. Schuster, B. O. Palsson, and R. Heinrich, “Metabolic pathway anal- ysis: basic concepts and scientific applications in the post-genomic era.” Biotechnology progress, vol. 15, no. 3, pp. 296–303, 1999.

[20] H. Kitano, “Systems Biology: A Brief Overview.” Science (New York, N.Y.), vol. 295, no. 5560, pp. 1662–1664, 2002.

[21] T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J. Buhler, J. K. Eng, R. Bumgar- ner, D. R. Goodlett, R. Aebersold, and L. Hood, “Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network.” Science, vol. 292, no. 5518, pp. 929–934, 2001.

[22] J. M. Lee, E. P. Gianchandani, J. A. Eddy, and J. A. Papin, “Dynamic Analysis of Integrated Signaling, Metabolic, and Regulatory Networks.” PLoS Comput Biol, vol. 4, no. 5, p. e1000086, 2008.

[23] C.-H. Yeang, “Integration of Metabolic Reactions and Gene Regulation.” in Plant Systems Biology, ser. Methods in Molecular Biology. Humana Press, 2009, vol. 533, ch. 13, pp. 265–285.

[24] Z. Shriver, S. Raguram, and R. Sasisekharan, “Glycomics: a pathway to a class of new and improved therapeutics,” Nature Reviews Drug Discovery, vol. 3, no. 10, pp. 863–873, 2004.

97 [25] R. Raman, S. Raguram, G. Venkataraman, J. C. Paulson, and R. Sasisekharan, “Gly- comics: an integrated systems approach to structure-function relationships of glycans,” Nature Methods, vol. 2, no. 11, pp. 817–824, 2005.

[26] W. Morelle and J.-C. Michalski, “Glycomics and Mass Spectrometry,” Current Phar- maceutical Design, vol. 11, no. 20, pp. 2615–2645(31), 2005.

[27] J. Zaia, “Mass spectrometry of oligosaccharides,” Mass Spectrometry Reviews, vol. 23, no. 3, pp. 161–227, May/June 2004.

[28] ——, “Mass Spectrometry and the Emerging Field of Glycomics,” Chemistry & Biol- ogy, vol. 15, no. 9, pp. 881–892, 2008.

[29] ——, “Mass Spectrometry and Glycomics,” OMICS: A Journal of Integrative Biology, vol. 14, no. 4, pp. 401–418, 2010.

[30] N. Taniguchi, “Human Disease Glycomics/Proteome Initiative (HGPI).” Molecular & Cellular Proteomics, vol. 7, pp. 626–627, 2008.

[31] E. d. Hoffmann and V. Stroobant, Mass Spectrometry: Principles and Applications, 3rd ed. Wiley, 2007.

[32] R. H. Perry, R. G. Cooks, and R. J. Noll, “Orbitrap mass spectrometry: Instrumen- tation, ion motion and applications,” Mass Spectrometry Reviews, vol. 27, no. 6, pp. 661–699, 2008.

[33] F. Forner, L. J. Foster, and S. Toppo, “Mass Spectrometry Data Analysis in the Proteomics Era,” Current Bioinformatics, vol. 2, pp. 63–93(31), 2007.

[34] M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster, “Quantitative mass spectrometry in proteomics: a critical review.” Analytical and bioanalytical chemistry, vol. 389, no. 4, pp. 1017–1031, 2007.

98 [35] R. Orlando, “Quantitative Glycomics,” in Functional Glycomics, J. Li, Ed. Humana Press, 2010, pp. 31–49.

[36] P. H. Raven and G. B. Johnson, Biology., 6th ed. McGraw-Hill Science/Engineer- ing/Math, 2001.

[37] R. Orlando, J.-M. Lim, J. A. Atwood, P. M. Angel, M. Fang, K. Aoki, G. Alvarez- Manilla, K. W. Moremen, W. S. York, M. Tiemeyer, M. Pierce, S. Dalton, and L. Wells, “IDAWG: Metabolic incorporation of stable isotope labels for quantitative glycomics of cultured cells,” Journal of Proteome Research, vol. 8, no. 8, pp. 3816–3823, 2009.

[38] M. Fang, J.-M. Lim, and L. Wells, “Quantitative glycomics of cultured cells using isotopic detection of aminosugars with glutamine (IDAWG),” Current Protocols in Chemical Biology, vol. 2, pp. 55–69, 2010.

[39] M. Fang, “Applications of the IDAWG technique to quantitative glycomics of human embryonic stem cells,” Ph.D. dissertation, University of Georgia, Athens, GA, 2011.

[40] J. Banks, J. S. Carson, B. L. Nelson, and D. M. Nicol, Discrete-Event System Simula- tion, 5th ed. Prentice Hall, 2009.

[41] D. L. McGuinness and F. van Harmelen, “OWL Web Ontology Language Overview,” http://www.w3.org/TR/owl-features/ [online Feb 10 2004], 2004.

[42] G. A. Silver, K. R. Bellipady, J. A. Miller, W. S. York, and K. J. Kochut, “Supporting Interoperability Using the Discrete-event Modeling Ontology (DeMO),” in WSC ’09: Proceedings of the 41th conference on Winter simulation. IEEE, Piscataway, NJ, 2009, pp. 1399–1410.

[43] G. A. Silver, J. A. Miller, M. Hybinette, G. Baramidze, and W. S. York, “DeMO: An Ontology for Discrete-event Modeling and Simulation,” SIMULATION: Transactions

99 of The Society for Modeling and Simulation International, vol. 87, no. 9, pp. 747–773, 2011.

[44] J. Han, J. A. Miller, and G. A. Silver, “SoPT: Ontology for Simulation Optimization for Scientific Experiments,” in Proceedings of the 2011 Winter Simulation Conference, S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2011.

[45] J. A. Miller, J. Han, and M. Hybinette, “Using Domain Specific Languages for modeling and simulation: ScalaTion as a case study,” in Proceedings of the 2010 Winter Simula- tion Conference, B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Y¨ucesan, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2010, pp. 741–752.

[46] M. Better, F. Glover, G. Kochenberger, and H. Wang, “Simulation Optimization: Applications in Risk Management,” International Journal of Information Technology & Decision Making, vol. 7, no. 4, pp. 571–581, 2008.

[47] M. C. Fu, “Simulation Optimization,” in Proceedings of the 2001 Winter Simulation Conference, B. A. Peters, J. S. Smith, D. J. Medeiros, and M. W. Rohrer, Eds. Pis- cataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2001, pp. 53–61.

[48] M. C. Fu, S. Andrad´ottir, J. S. Carson, F. Glover, C. R. Harrell, Y.-C. Ho, J. P. Kelly, and S. M. Robinson, “Integrating Optimization and Simulation: Research and Practice,” in Proceedings of the 2000 Winter Simulation Conference, J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2000, pp. 610–616.

100 [49] S. Andrad´ottir,“A Review of Simulation Optimization Techniques,” in Proceedings of the 1998 Winter Simulation Conference, D. J. Medeiros, E. F. Watson, J. S. Carson, and M. S. Manivannan, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1998, pp. 151–158.

[50] ——, Simulation Optimization, ser. Handbook of Simulation: Principles, Methodology, Advances, Applications, and Practice. John Wiley & Sons, Inc., 1998, ch. 9, pp. 307– 333.

[51] J. R. Swisher, P. D. Hyden, S. H. Jacobson, and L. W. Schruben, “A Survey of Sim- ulation Optimization Techniques and Procedures,” in Proceedings of the 2000 Winter Simulation Conference, J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2000, pp. 119–128.

[52]S. Olafsson´ and J. Kim, “Simulation optimization,” in Proceedings of the 2002 Winter Simulation Conference, E. Y¨ucesan,C. H. Chen, J. L. Snowdon, and J. M. Charnes, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2002, pp. 79–84.

[53] J. April, F. Glover, J. P. Kelly, and M. Laguna, “Practical Introduction to Simulation Optimization,” in Proceedings of the 2003 Winter Simulation Conference, S. Chick, P. J. S´anchez, D. Ferrin, and D. J. Morrice, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2003, pp. 71–78.

[54] M. C. Fu, F. W. Glover, and J. April, “Simulation Optimization: a Review, New Devel- opments, and Applications,” in Proceedings of the 2005 Winter Simulation Conference, M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2005, pp. 83–95.

101 [55] M. C. Fu, C.-H. Chen, and L. Shi, “Some Topics for Simulation Optimization,” in Pro- ceedings of the 2008 Winter Simulation Conference, S. J. Mason, R. R. Hill, L. Moench, O. Rose, T. Jefferson, and J. W. Fowler, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2008, pp. 27–38.

[56] S. S. Rao, Engineering Optimization: Theory and Practice, 4th ed. New York: Wiley, 2009.

[57] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.

[58] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York, USA: Springer, 2006.

[59] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, 1951.

[60] S. Kim, “Gradient-based simulation optimization,” in Proceedings of the 2006 Winter Simulation Conference, L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2006, pp. 159–167.

[61] J. A. Nelder and R. Mead, “A Simplex Method for Function Minimization,” The Computer Journal, vol. 7, no. 4, pp. 308–313, 1965.

[62] R. H. Myers, D. C. Montgomery, and C. M. Anderson-Cook, Response Surface Method- ology: Process and Product Optimization Using Designed Experiments, 3rd ed. New York: Wiley, 2009.

102 [63] R. E. Baysal, B. L. Nelson, and J. Staum, “Response surface methodology for simulat- ing hedging and trading strategies,” in WSC ’08: Proceedings of the 40th Conference on Winter Simulation. Winter Simulation Conference, 2008, pp. 629–637.

[64] T. Hou, J. Wang, L. Chen, and X. Xu, “Automated docking of peptides and proteins by using a genetic algorithm combined with a tabu search,” Protein Engineering, vol. 12, no. 8, pp. 639–648, 1999.

[65] J. Pichitlamken and B. L. Nelson, “A combined procedure for optimization via simula- tion,” in WSC ’02: Proceedings of the 34th conference on Winter simulation. Winter Simulation Conference, 2002, pp. 292–300.

[66] S. Brimble, E. E. Wollaston-Hayden, C. F. Teo, A. C. Morris, and L. Wells, “The role of the o-glcnac modification in regulating eukaryotic gene expression,” Current Signal Transduction Therapy, vol. 5, pp. 12–24(13), 2010.

[67] R. Kleene and M. Schachner, “Glycans and neural cell interactions,” Nature reviews Neuroscience, vol. 5, no. 3, pp. 195–208, March 2004.

[68] C. A. Cooper, E. Gasteiger, and N. H. Packer, “GlycoMod–a software tool for deter- mining glycosylation compositions from mass spectrometric data.” Proteomics, vol. 1, no. 2, pp. 340–349, February 2001.

[69] C. A. Cooper, H. J. Joshi, M. J. Harrison, M. R. Wilkins, and N. H. Packer, “Gly- cosuitedb: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update,” Nucleic Acids Research, vol. 31, no. 1, pp. 511–513, 2003.

[70] D. Goldberg, M. Sutton-Smith, J. Paulson, and A. Dell, “Automatic annotation of matrix-assisted laser desorption/ionization n-glycan spectra,” PROTEOMICS, vol. 5, no. 4, pp. 865–875, 2005.

103 [71] E. P. Go, K. R. Rebecchi, D. S. Dalpathado, M. L. Bandu, Y. Zhang, and H. Desaire, “Glycopep db: A tool for glycopeptide analysis using a ”smart search”,” Analytical Chemistry, vol. 79, no. 4, pp. 1708–1713, 2007.

[72] K. Maass, R. Ranzinger, H. Geyer, C.-W. von der Lieth, and R. Geyer, “”Glyco- peakfinder” - de novo composition analysis of glycoconjugates,” Proteomics, vol. 7, no. 24, pp. 4435–4444, 2007.

[73] A. Ceroni, K. Maass, H. Geyer, R. Geyer, A. Dell, and S. M. Haslam, “GlycoWork- bench: A Tool for the Computer-Assisted Annotation of Mass Spectra of Glycans,” Journal of Proteome Research, vol. 7, no. 4, pp. 1650–1659, 2008.

[74] A. Apte and N. S. Meitei, “Bioinformatics in Glycomics: Glycan Characterization with Mass Spectrometric Data Using SimGlycanTM,” in Functional Glycomics, J. Li, Ed. Totowa, NJ: Humana Press, 2010, vol. 600, ch. 19, pp. 269–281.

[75] M. H. Elliott, D. S. Smith, C. E. Parker, and C. Borchers, “Current trends in quan- titative proteomics,” Journal of Mass Spectrometry, vol. 44, no. 12, pp. 1637–1660, 2009.

[76] W. M. Old, K. Meyer-Arendt, L. Aveline-Wolf, K. G. Pierce, A. Mendoza, J. R. Sevin- sky, K. A. Resing, and N. G. Ahn, “Comparison of Label-free Methods for Quantify- ing Human Proteins by Shotgun Proteomics,” Molecular Cellular Proteomics, vol. 4, no. 10, pp. 1487–1502, 2005.

[77] Y. Wada, P. Azadi, C. E. Costello, A. Dell, R. A. Dwek, H. Geyer, R. Geyer, K. Kakehi, N. G. Karlsson, K. Kato, N. Kawasaki, K.-H. Khoo, S. Kim, A. Kondo, E. Lattova, Y. Mechref, E. Miyoshi, K. Nakamura, H. Narimatsu, M. V. Novotny, N. H. Packer, H. Perreault, J. Peter-Katalini´c,G. Pohlentz, V. N. Reinhold, P. M. Rudd, A. Suzuki, and N. Taniguchi, “Comparison of the methods for profiling glycoprotein glycans -

104 HUPO Human Disease Glycomics/Proteome Initiative multi-institutional study,” Gly- cobiology, vol. 17, no. 4, pp. 411–422, 2007.

[78] K. R. Rebecchi, J. L. Wenke, E. P. Go, and H. Desaire, “Label-free quantitation: A new glycoproteomics approach,” Journal of the American Society for Mass Spectrometry, vol. 20, no. 6, pp. 1048 – 1059, 2009.

[79] S.-E. Ong, B. Blagoev, I. Kratchmarova, D. B. Kristensen, H. Steen, A. Pandey, and M. Mann, “Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics,” Molecular & Cellular Pro- teomics, vol. 1, no. 5, pp. 376–386, 2002.

[80] S.-E. Ong and M. Mann, “A practical recipe for stable isotope labeling by amino acids in cell culture (SILAC),” Nature Protocols, vol. 1, no. 6, pp. 2650–2660, January 2007.

[81] P. L. Ross, Y. N. Huang, J. N. Marchese, B. Williamson, K. Parker, S. Hattan, N. Khainovski, S. Pillai, S. Dey, S. Daniels, S. Purkayastha, P. Juhasz, S. Martin, M. Bartlet-Jones, F. He, A. Jacobson, and D. J. Pappin, “Multiplexed protein quan- titation in saccharomyces cerevisiae using amine-reactive isobaric tagging reagents,” Molecular & Cellular Proteomics, vol. 3, no. 12, pp. 1154–1169, 2004.

[82] J. A. Atwood, L. Cheng, G. Alvarez-Manilla, N. L. Warren, W. S. York, and R. Or- lando, “Quantitation by isobaric labeling: Applications to glycomics,” Journal of Pro- teome Research, vol. 7, no. 1, pp. 367–374, 2008.

[83] A. Panchaud, M. Affolter, P. Moreillon, and M. Kussmann, “Experimental and com- putational approaches to quantitative proteomics: Status quo and outlook,” Journal of Proteomics, vol. 71, no. 1, pp. 19–33, 2008.

[84] P. Mortensen, J. W. Gouw, J. V. Olsen, S.-E. Ong, K. T. G. Rigbolt, J. Bunken- borg, J. Cox, L. J. Foster, A. J. R. Heck, B. Blagoev, J. S. Andersen, and M. Mann,

105 “Msquant, an open source platform for mass spectrometry-based quantitative pro- teomics,” Journal of Proteome Research, vol. 9, no. 1, pp. 393–403, 2010.

[85] M. J. MacCoss, C. C. Wu, H. Liu, R. Sadygov, and J. R. Yates, “A correlation algo- rithm for the automated quantitative analysis of shotgun proteomics data,” Analytical Chemistry, vol. 75, no. 24, pp. 6912–6921, 2003.

[86] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell, “Probability-based protein identification by searching sequence databases using mass spectrometry data,” ELECTROPHORESIS, vol. 20, no. 18, pp. 3551–3567, 1999.

[87] D. K. Han, J. Eng, H. Zhou, and R. Aebersold, “Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry,” Nature Biotech, vol. 19, no. 10, pp. 946–951, 2001.

[88] X.-j. Li, H. Zhang, J. A. Ranish, and R. Aebersold, “Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry,” Analytical Chemistry, vol. 75, no. 23, pp. 6648–6657, 2003.

[89] P. G. A. Pedrioli, “Trans-proteomic pipeline: A pipeline for proteomic analysis,” in Proteome Bioinformatics, ser. Methods in Molecular BiologyTM, J. M. Walker, S. J. Hubbard, and A. R. Jones, Eds. Humana Press, 2010, vol. 604, pp. 213–238.

[90] M. W. Senko, S. C. Beu, and F. W. McLafferty, “Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions,” Journal of the American Society for Mass Spectrometry, vol. 6, no. 4, pp. 229–233, 1995.

[91] M. K. Hellerstein and R. A. Neese, “Mass isotopomer distribution analysis at eight years: theoretical, analytic, and experimental considerations,” American Journal of Physiology - Endocrinology And Metabolism, vol. 276, no. 6, pp. E1146–E1170, 1999.

106 [92] S. Y. Vakhrushev, D. Dadimov, and J. Peter-Katalini´c,“Software platform for high- throughput glycomics.” Analytical chemistry, vol. 81, no. 9, pp. 3252–3260, 2009.

[93] J. Incz´edy, T. Lengyel, and M. A. Ure, “Compendium of analytical nomencla- ture definitive rules 1997,” Available via http://old.iupac.org/publications/analytical compendium/TOC cha9.html [online 14 August 2002], 1998, section 9.2.3.3.

[94] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, September 1995.

[95] M. Fang, M. Kulik, J. Han, S. Brimble, J.-M. Lim, S. Dalton, J. A. Miller, W. S. York, and L. Wells, “Assessing the dynamics of individual glycans released from hu- man es cells using the IDAWGTMtechnique reveals the remodeling of sialyated glycan structures,” Journal of Proteome Research, 2012, to be submitted.

[96] W. X. Schulze and M. Mann, “A novel proteomic screen for peptide-protein interac- tions,” Journal of Biological Chemistry, vol. 279, no. 11, pp. 10 756–10 764, 2004.

[97] R. S. Nair, J. A. Miller, and Z. Zhang, “Java-based query driven simulation environ- ment,” in Proceedings of the 1996 Winter Simulation Conference, J. M. Charnes, D. J. Morrice, D. T. Brunner, and J. J. Swain, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1996, pp. 786–793.

[98] J. A. Miller, G. T. Baramidze, A. P. Sheth, and P. A. Fishwick, “Investigating On- tologies for Simulation Modeling,” in Proceedings of the 37th Annual Simulation Sym- posium, ser. ANSS ’04, H. Karatza, Ed. Washington, DC, USA: IEEE Computer Society, 2004, pp. 55–63.

[99] M. Odersky, P. Altherr, V. Cremet, I. D. G. Dubochet, B. Emir, S. McDirmid, S. Mich- eloud, N. Mihaylov, M. Schinz, E. Stenman, L. Spoon, and M. Zenge, “An Overview

107 of the Scala Programming Language,” Ecole´ Polytechnique F´ed´eralede Lausanne (EPFL), Lausanne, Switzerland, Tech. Rep. IC/2004/64, 2004.

[100] M. Kifer and H. Boley, “RIF Overview,” http://www.w3.org/TR/2010/ NOTE-rif-overview-20100622/ [online June 2010], 2010.

[101] A. Marte, “Rif4j - a reasoning engine for rif-bld,” Master’s thesis, Semantic Technolo- gies Institute (STI), Innsbruck, 2011.

[102] M. Meketon, “Optimization in Simulation: A Survey of Recent Results,” in Proceedings of the 1987 Winter Simulation Conference, H. Grant, W. D. Kelton, and A. Thesen, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1987, pp. 58–68.

[103] M. C. Fu, “Optimization via simulation: A review,” Annals of Operations Research, vol. 53, no. 1, pp. 199–247, 1994.

[104] ——, “Optimization for simulation: Theory vs. Practice,” INFORMS Journal on Com- puting, vol. 14, no. 3, pp. 192–215, 2002.

[105] E. Ang¨un,J. P. C. Kleijnen, D. D. Hertog, and G. G¨urkan, “Recent advances in simulation optimization: response surface methodology revisited,” in Proceedings of the 2002 Winter Simulation Conference, E. Y¨ucesan,C.-H. Chen, J. L. Snowdon, and J. M. Charnes, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2002, pp. 377–383.

[106] R. R. Barton, “Simulation Optimization Using Metamodels,” in Proceedings of the 2009 Winter Simulation Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2009, pp. 230–238.

108 [107] S. Kulturel-Konak and A. Konak, “Simulation optimization embedded particle swarm optimization for Reliable Server Assignment,” in Proceedings of the 2010 Winter Simulation Conference, B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Y¨ucesan,Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2010, pp. 2897–2906.

[108] I. Vitanov, V. Vitanov, and D. Harrison, “Buffer capacity allocation using ant colony optimisation algorithm,” in Proceedings of the 2009 Winter Simulation Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, Eds. Pis- cataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2009, pp. 3158–3168.

[109] R. Sprenger and L. Monch, “An Ant Colony optimization approach to solve coopera- tive transportation planning problems,” in Proceedings of the 2009 Winter Simulation Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2009, pp. 2488–2495.

[110] G. Gurkan, A. Ozge, and T. Robinson, “Sample-path Optimization in Simulation,” in Proceedings of the 1994 Winter Simulation Conference, D. A. Sadowski, A. F. Seila, J. D. Tew, and S. Manivannan, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1994, pp. 247–254.

[111] G. G¨urkan, A. Y. Ozge,¨ and S. M. Robinson, “Sample-path solution of stochastic variational inequalities, with applications to option pricing,” in Proceedings of the 1996 Winter Simulation Conference, J. M. Charnes, D. J. Morrice, D. T. Brunner, and J. J. Swain, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1996, pp. 337–344.

109 [112] M. C. Ferris, T. S. Munson, and K. Sinapiromsaran, “A practical approach to sample- path simulation optimization,” in Proceedings of the 2000 conference on Winter sim- ulation, J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2000, pp. 795–804.

[113] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello, “The Sample Average Approxi- mation Method for Stochastic Discrete Optimization,” SIAM Journal on Optimization, vol. 12, no. 2, pp. 479–502, February 2002.

[114] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis, 3rd ed. The McGraw-Hill Companies, 2000.

[115] J. Banks, D. M. Nicol, B. L. Nelson, J. S. I. Carson, and B. L. Nelson, Discrete-event system simulation, 3rd ed., ser. Prentice-Hall international series in industrial and systems engineering. Prentice Hall, 2000.

[116] A. M. Law, Simulation Modeling and Analysis, 4th ed. The McGraw-Hill Companies, 2007.

[117] Y. M. Teo and C. Szabo, “CODES: An Integrated Approach to Composable Modeling and Simulation,” in Proceedings of the 41st Annual Simulation Symposium, H. Karatza, Ed. Washington, DC, USA: IEEE Computer Society, 2008, pp. 103–110.

[118] L. W. Lacy, “Interchanging Discrete Event Simulation Process Interaction Models Using the Web Ontology Language — OWL,” Ph.D. dissertation, University of Central Florida, Orlando, FL, USA, 2006, aAI3242447.

[119] J. A. Miller, C. He, and J. L. Couto, Impact of the Semantic Web on Modeling and Simulation, ser. Handbook of Dynamic System Modeling. Chapman & Hall/CRC Press, 2007, pp. 3–1–3–22.

110 [120] P. Witherell, S. Krishnamurty, and I. R. Grosse, “Ontologies for Supporting Engineer- ing Design Optimization,” Journal of Computing and Information Science in Engi- neering, vol. 7, no. 2, pp. 141–150, 2007.

[121] D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, vol. 1, no. 1, pp. 67–82, 1997.

[122] J. R. Rice, “The algorithm selection problem,” ser. Advances in Computers, M. Rubi- noff and M. C. Yovits, Eds. Elsevier, 1976, vol. 15, pp. 65 – 118.

[123] H. Guo, “Algorithm selection for sorting and probabilistic inference: a ma- chine learning-based approach,” Ph.D. dissertation, Manhattan, KS, USA, 2003, aAI3100557.

[124] S. Ali and K. A. Smith, “On learning algorithm selection for classification,” Applied Soft Computing, vol. 6, no. 2, pp. 119–138, 2006.

[125] F. Benmakrouha, C. Hespel, and E. Monnier, “An algorithm for rule selection on fuzzy rule-based systems applied to the treatment of diabetics and detection of fraud in elec- tronic payment,” in 2010 IEEE International Conference on Fuzzy Systems (FUZZ), july 2010, pp. 1–5.

[126] P. J. W. Patricia D. Hough, “Modern Machine Learning for Automatic Optimization Algorithm Selection,” in Proceedings of the INFORMS Artificial Intelligence and Data Mining Workshop, 2006.

[127] R. Pasupathy and S. G. Henderson, “A Testbed of Simulation-Optimization Problems,” in Proceedings of the 2006 Winter Simulation Conference, L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2006, pp. 255–263.

111 [128] ——, “SIMOPT: A Library of Simulation Optimization Problems,” in Proceedings of the 2011 Winter Simulation Conference, S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2011, pp. 4080–4090.

[129] A. Shapiro, D. Dentcheva, and A. Ruszczy´nski, Lectures on Stochastic Programming: Modeling and Theory, ser. MOS-SIAM Series on Optimization. Philadephia, PA: Society for Industrial and Applied Mathematics, 2009.

[130] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases. Addison-Wesley, 1995.

[131] C. L. and Forgy, “Rete: A fast algorithm for the many pattern/many object pattern match problem,” Artificial Intelligence, vol. 19, no. 1, pp. 17–37, 1982.

[132] J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson, “Jena: implementing the semantic web recommendations,” in Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, ser. WWW Alt. ’04. New York, NY, USA: ACM, 2004, pp. 74–83.

[133] H. Boley, G. Hallmark, M. Kifer, A. Paschke, A. Polleres, and D. Reynolds, “RIF core dialect,” W3C, W3C Recommendation, June 2010, http://www.w3.org/TR/2010/ REC-rif-core-20100622/ [online June 2010].

[134] D. Huang, T. Allen, W. Notz, and N. Zeng, “Global Optimization of Stochastic Black- Box Systems via Sequential Kriging Meta-Models,” Journal of Global Optimization, vol. 34, no. 3, pp. 441–466, 2006.

[135] M. Zakerifar, W. Biles, and G. Evans, “Kriging metamodeling in multi-objective simu- lation optimization,” in Proceedings of the 2009 Winter Simulation Conference, M. D.

112 Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2009, pp. 2115–2122.

[136] R. M. Fujimoto, “Parallel and distributed simulation,” in Proceedings of the 1999 Winter Simulation Conference, P. A. Farrington, H. B. Nembhard, D. T. Sturrock, and G. W. Evans, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 1999, pp. 122–131.

[137] E. Mota, A. Wolisz, and K. Pawlikowski, “A perspective of batching methods in a simulation environment of multiple replications in parallel,” in Proceedings of the 2000 Winter Simulation Conference, J. A. Joines, R. R. Barton, K. Kang, , and P. A. Fish- wick, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2000, pp. 761–766.

[138] P. Heidelberger, “Discrete event simulations and parallel processing: statistical prop- erties,” SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 6, pp. 1114– 1132, November 1988.

[139] B. Gehlsen and B. Page, “A framework for distributed simulation optimization,” in Proceedings of the 2001 Winter Simulation Conference, B. A. Peters, J. S. Smith, D. J. Medeiros, and M. W. Rohrer, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2001, pp. 508–514.

[140] R. M. Fujimoto, A. W. Malik, and A. J. Park, “Parallel and Distributed Simulation in the Cloud,” SCS M&S Magazine, vol. 1, no. 3, pp. 1–10, 2010.

[141] G. D’Angelo, “Parallel and Distributed Simulation from Many Cores to the Public Cloud (Extended Version),” in Proceedings of the 2011 International Conference on High Performance Computing and Simulation (HPCS 2011). Institute of Electrical and Electronics Engineers, Inc., 2011, pp. 14–23.

113 [142] K. Passino, “Biomimicry of bacterial foraging for distributed optimization and con- trol,” Control Systems, IEEE, vol. 22, no. 3, pp. 52–67, 2002.

[143] J. A. Miller, W. D. Potter, R. V. Gandham, R. V. G, and C. N. Lapena, “An Evalua- tion of Local Improvement Operators for Genetic Algorithms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 23, pp. 1340–1351, 1993.

[144] J. April, M. Better, F. Glover, J. Kelly, and M. Laguna, “Enhancing Business Pro- cess Management with Simulation Optimization,” in Proceedings of the 2006 Winter Simulation Conference, L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2006, pp. 642–649.

[145] D. Bertsimas and R. Weismantel, Optimization over Integers. Dynamic Ideas, 2005, vol. 13.

[146] C. Floudas, Nonlinear and Mixed-Integer Optimization: Fundamentals and Applica- tions. Oxford University Press, USA, 1995.

[147] F. Glover and M. Laguna, Tabu Search. Kluwer Academic Pub., 1998, vol. 1.

[148] J. Han, J. A. Miller, M. Fang, L. Wells, K. J. Kochut, R. Ranzinger, and W. S. York, “GlycoQuant: An Automated Simulation Framework Targeting Isotopic Label- ing Strategies in MS-Based Quantitative Glycomics,” Journal of Proteome Research, 2012, to be submitted.

[149] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transactions on Mathematical Software, vol. 23, no. 4, pp. 550–560, December 1997.

114 [150] A. M. Uhrmacher and C. Priami, “Discrete event systems specification in systems biology - a discussion of stochastic π-calculus and DEVS,” in Proceedings of the 2005 Winter Simulation Conference, M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2005, pp. 317–326.

[151] O. Mazemondet, M. John, C. Maus, A. M. Uhrmacher, and A. Rolfs, “Integrating Diverse Reaction Types Into Stochastic Models - A Signaling Pathway Case Study in the Imperative Pi-Calculus,” in Proceedings of the 2009 Winter Simulation Confer- ence, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, Eds. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc., 2009, pp. 932–943.

[152] M. Calder, V. Vyshemirsky, D. Gilbert, and R. Orton, “Analysis of signalling pathways using continuous time markov chains,” in Transactions on Computational Systems Biology VI, ser. Lecture Notes in Computer Science, C. Priami and G. Plotkin, Eds. Springer Berlin / Heidelberg, 2006, vol. 4220, pp. 44–67.

[153] V. N. Reddy, M. L. Mavrovouniotis, and M. N. Liebman, “Petri Net Representations in Metabolic Pathways,” in Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1993, pp. 328–336.

[154] V. N. Reddy, M. N. Liebman, and M. L. Mavrovouniotis, “Qualitative analysis of biochemical reaction systems,” Computers in Biology and Medicine, vol. 26, no. 1, pp. 9–24, 1996.

[155] H. Genrich, R. K¨uffner,and K. Voss, “Executable Petri net models for the analysis of metabolic pathways,” International Journal on Software Tools for Technology Transfer (STTT), vol. 3, no. 4, pp. 394–404, 2001.

115 [156] K. Nimmagadda, “Ontology Driven Simulation of Biochemical Pathways Using Hybrid Petri Nets,” Master’s thesis, University of Georgia, Athens GA, USA, 2008.

[157] H. Matsuno, Y. Tanaka, H. Aoshima, A. Doi, M. Matsui, and S. Miyano, “Biopathways Representation and Simulation on Hybrid Functional Petri Net,” In Silico Biology, vol. 3, no. 3, pp. 389–404, 2003.

[158] R. Heinrich and S. Schuster, “The Modelling of Metabolic Systems. Structure, Control and Optimality,” Biosystems, vol. 47, no. 1-2, pp. 61–77, 1998.

[159] W. Wiechert, “Modeling and simulation: tools for metabolic engineering,” Journal of Biotechnology, vol. 94, no. 1, pp. 37–63, 2002, the Molecular Key for Biotechnology.

[160] I. Segel, Biochemical calculations: how to solve mathematical problems in general bio- chemistry. Wiley, 1976.

[161] B. Bendiak and H. Schachter, “Control of glycoprotein synthesis. kinetic mechanism, substrate specificity, and inhibition characteristics of udp-n-acetylglucosamine:alpha- d-mannoside beta 1-2 n-acetylglucosaminyltransferase ii from rat liver.” Journal of Biological Chemistry, vol. 262, no. 12, pp. 5784–5790, 1987.

[162] C. Ans´otegui,M. Sellmann, and K. Tierney, “A Gender-based Genetic Algorithm for the Automatic Configuration of Algorithms,” in Proceedings of the 15th international conference on Principles and practice of constraint programming, ser. CP’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 142–157.

[163] S. Kadioglu, Y. Malitsky, M. Sellmann, and K. Tierney, “ISAC –Instance-Specific Algorithm Configuration,” in Proceeding of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence. Amsterdam, The Netherlands, The Netherlands: IOS Press, 2010, pp. 751–756.

116 APPENDICES

• Part A is the user guide for GlycoQuant. It describes general architecture, how to build, compile and deploy the software. A step-by-step is also included.

• Part B contains optimization results for typical O-linked and N-linked glycans for both static IDAWGTMand dynamic IDAWGTM, demonstrates GlycoQuant’s capability of correctly performing the simulation and optimization against the experimental data.

• Part C contains the current version of SoPT ontology.

117 Appendix A

GLYCOQUANT USER GUIDE

Goal

The goal of GlycoQuant is to provide a simulation environment for quantitative Glycomics based on mass spectrometry.

GlycoQuant Architecture

The development of GlycoQuant follows Model-View-Controller (MVC) framework and uti- lizes Web 2.0 techniques, as shown in Figure 4.1.

idawg-client idawg-server HTTP request HTTP request Browser AJAX/ XML/JSON Javascript XML/JSON XML

Figure 4.1: GlycoQuant Architecture.

• Presentation layer: Web pages are constructed in Javascript, and AJAX calls are sent to and from between browser and RESTful Web services hosted in server-side.

• Service layer: RESTful web services are deployed in the server side to invoke com- putational tasks and return JSON data to presentation layer.

• Logic layer: Computation-extensive simulation and optimization are the core of the whole project. As they take a long time and consume all the CPU capacity by starting multiple threads to perform tasks in parallel, it is advised to deploy it in a dedicated

118 server (e.g., 4-core CPU + 8GB memory or (better) 8-core CPU + 16GB memory) or in a distributed computing environment.

• Persistent layer: XML is used as storage for optimization results. XML/JSON are constructed dynamically per user request and sent back to browser for visualization purpose.

Tutorial

URL to access GlycoQuant is http://localhost:8080/idawg-client/index.html. Re- place localhost with the host URI where idawg-client web application is deployed. Tested browsers include Google Chrome browser (v16.0) / Mozilla Firefox (v9.0.1) / Safari (v5.1.2) in Windows 7. A step-by-step tutorial is given in screenshots.

• Enter home page (Figure 4.2)

Figure 4.2: GlycoQuant Home Page.

• User registration (Figure 4.3). Click Create a New User, type in username and pass- word and some validation will be performed (e.g., length of username should be within 3 to 16).

119 Figure 4.3: GlycoQuant Create a New User.

• User login (Figure 4.4). Type in username and password, click Login and some vali- dation will be performed.

Figure 4.4: GlycoQuant User Login.

• Set up input parameters in the configuration page (Figure 4.5 and Figure 4.6). If user wants to upload experimental data and run his own optimization, he needs to take four steps. If he just needs to browse the existing result sets, he can click the link of

Go to result page directly.

1. Upload csv file containing all the structures to be examined (Figure 4.7, Figure 4.8

and Figure 4.9). Two structure files (hES_O-glycan_IDAWG and hES_N-glycan_

120 Figure 4.5: GlycoQuant Configuration Page (upper part).

Figure 4.6: GlycoQuant Configuration Page (bottom part).

121 IDAWG) have been provided as examples. Users can download one of these files first, make some changes according to their needs and then upload it again. If uploading is successful, the uploaded file will show up with a red cross button, which can be used to delete the already uploaded file by a single click. Clicking

the link of Upload CSV file will upload the file to the remote server.

Figure 4.7: GlycoQuant Upload CSV file.

Figure 4.8: GlycoQuant Upload CSV file (successful).

2. Set up parameters related to biological experiments, such as charge state, adduct, etc. (Figure 4.10). Specific configurations for O-Glycan and N-Glycan are shown

in Table 4.1. The checkbox for IDAWG data is checked by default. If the data are

122 Figure 4.9: GlycoQuant Upload CSV file to Server.

for static IDAWG experiment, do not check Time Dependent. If the data is dy- namic IDAWG experiment, make sure to check Time Dependent before uploading the mzXML files.

Figure 4.10: GlycoQuant Configure Experiment Parameters (for O-Glycan).

3. Upload mzXML files containing mass spectral data (Figure 4.11). File

name must conform with the naming convention experimentName_repNo_ sampleType.mzXML, which is separated by “ ” (e.g., O-linked_rep1_6hr.mzXML, O-linked-01052011_rep2_Mixture.mzXML). Clicking the link of Upload RAW or mzXML data will upload the files to the remote server. When uploading is finished, the files will show up in the right panel with some parameters that need to be set up.

123 Table 4.1: Parameter Configuration for O-Glycan and N-Glycan in IDAWGTMexperiments.

O-Glycan N-Glycan Derivative methyl methyl End structure alditol derivatized Adducts Na+ Na+ Precursor Enrichment 0.98 0.98 Charge State (min and max) 1 and 3 1 and 3

Figure 4.11: GlycoQuant Upload mzXML files.

4. Set up parameters for uploaded mzXML files (Figure 4.12 and Figure 4.13). Sam- ple types are based on the type of experiments: (1) for static IDAWG, heavy or mixture, (2) for dynamic IDAWG, different time points.

• Clicking GO button beside the label of Run Optimization will send HTTP requests to the server. The request contains all the input parameters. Computational tasks will be queued in the server side and do not require further human intervention.

• Tasks of optimization and simulation are performed in the server side automatically,

124 Figure 4.12: GlycoQuant Set up parameters for mzXML files (static IDAWG).

Figure 4.13: GlycoQuant Set up parameters for mzXML files (dynamic IDAWG).

and users may continue to browse the currently available results by clicking the link of

Go to result page.

• User selects the experiments that have been submitted before (Figure 4.14).

Figure 4.14: GlycoQuant Fetch Results.

125 • Browsing and filtering the results. Typing in the search box can filter results based on string match. Clicking the green “+” icon can expand the result set.

Figure 4.15: GlycoQuant Browse Results.

• View Results in static IDAWGTMand dynamic IDAWGTMexperiments

– Optimization Results Panel (Figure 4.16 and 4.17). This panel is the same for both experiments. It contains: (1) visualization of simulated and experimental spectrum, and (2) Optimized Parameters for Mass Spectrum. User can zoom in to see the detail of curve fitting by dragging mouse over the plot area.

– Analysis Panel. Clicking analyze button will generate the analysis result based on the type of experiments.

∗ static IDAWGTMexperiments (Figure 4.18).

∗ dynamic IDAWGTMexperiments (Figure 4.19 and 4.20). Clicking one of the time points (e.g., 6hr, 12hr, 24hr) will toggle the time points to be

126 Figure 4.16: GlycoQuant Browse Results.

added/removed from data sets under consideration, therefore, the calcula-

tion results of 50% degradation time and Proportion of remodeling at 50% degradation time (%) will change accordingly.

Source code and Requirement

Source code is located in http://ra.cs.uga.edu/svn/dcon/GlycoQuant-idawg/ and http://bird.cs.uga.edu/svn/junbackup/GlycoQuant-idawg/. Use any svn client (e.g., TortoiseSVN or svn command line) to download the whole folder. Requirements are listed below:

127 Figure 4.17: GlycoQuant Browse Results (Zoom In).

Figure 4.18: GlycoQuant Analyze Results (static IDAWGTM).

• Programming language: Java Development Kit 6.0 (Sun JDK java version 1.6.0 24 JavaTMSE Runtime Environment build 1.6.0 24-b07)1.

• Build tool: Apache Maven 3.0.32.

1http://www.oracle.com/technetwork/java/javase/downloads/index.html 2http://maven.apache.org/download.html

128 Figure 4.19: GlycoQuant Analyze Results (dynamic IDAWGTM).

• Web Server: Tomcat 6.0.353. Tomcate server need to be running before executing the build scripts.

• IDE (optional): Eclipse IDE for JavaEE Developers SR1 (v3.7 Indigo Service Release 1)4. For better maven integration with Eclipse, m2e (a maven plugin for eclipse)5 can be installed.

Build and Deployment

The whole GlycoQuant project can be built from command line via mvn or im- ported as a maven project into Eclipse IDE workspace. The README file, the

3http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.35/bin/apache-tomcat-6.0.35.zip 4http://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/indigosr1 5http://maven.apache.org/download.html

129 Figure 4.20: GlycoQuant Analyze Results (dynamic IDAWGTMPartial data set).

build scripts (build-client.sh/build-server.sh/build-javadoc.sh) and correspond- ing pom-client.xml, pom-server.xml and pom-javadoc.xml in the root folder, are self- explanatory with some knowledge of maven build tool. User is required to manually specify (1) the installation directory of tomcat server in both GlycoQuant-idawg/pom-client.xml and GlycoQuant-idawg/pom-server.xml,

/home/jhan/java/apache−tomcat −6.0.35

(2) remote server hostname and port number in GlycoQuant-idawg/ GlycoQuant-idawg-client/pom.xml and GlycoQuant-idawg/

130 GlycoQuant-idawg-server/pom.xml to indicate where the HTTP requests should be sent to.

In GlycoQuant-idawg/GlycoQuant-idawg-client/pom.xml:

wedman. cs .uga. edu:8080

In GlycoQuant-idawg/GlycoQuant-idawg-server/pom.xml:

wedman. cs .uga. edu:8080

The two build scripts (build-server.sh and build-client.sh) performs the following tasks automatically:

• pre-compile: install a jar file (jrap_StAX_v5.1.jar) to local maven repository for future use;

• compile: compile GlycoQuant-idawg-core, GlycoQuant-idawg-server, GlycoQuant- idawg-client, and GlycoQuant-idawg-test;

• install: two web applications (.war) are assembled and generated in the folders of GlycoQuant-idawg-server/target and GlycoQuant-idawg-client/target, respec- tively;

• deploy client web app in build-client.sh: deploy GlycoQuant-idawg-client web app to tomcat server via cargo:redeploy command

• deploy server web app in build-server.sh: deploy GlycoQuant-idawg-server web app to tomcat server via cargo:redeploy command

131 The build-javadoc.sh generates javadoc for all the modules and aggregate them in one site and all the apidocs can be found in the GlycoQuant-idawg/target/site/apidocs. Type following commands in the terminal to run the build script.

chmod +x ∗ . sh . / build −s e r v e r . sh . / build −c l i e n t . sh . / build −javadoc . sh

The whole GlycoQuant project is divided into four sub-projects. Each project and its description is listed in Table 4.2.

Table 4.2: Sub-projects in GlycoQuant

Name Description GlycoQuant-idawg-core contains the foundation classes for simulation and optimiza- tion GlycoQuant-idawg-server provide RESTful web services including file upload, opti- mization and mass spectrum visualization GlycoQuant-idawg-client provide web page interface to invoke web services in server GlycoQuant-idawg-test contains test cases for GlycoQuant-idawg-core and GlycoQuant-idawg-server

Third party software installation

This section can be safely skipped if GlycoQuant does not process raw MS data in .raw format. Easiest way is to run the conversion program in Windows operating system and upload the generated mzXMLs to GlycoQuant web application. In Windows

OS, google for raw mzxml converter and pick one software that works for you, e.g., ProteoWizard In order to enable GlycoQuant to process raw MS data in .raw format, some third

132 party software must be installed, which can be found in the folder of GlycoQuant/ thirdPartySoftware. In Linux, install wine first, which can run windows application in Linux platform. For example, in Fedora, use the command for installation yum install wine

Download winetricks in http://www.kegel.com/wine/winetricks and run following commands:

chmod +x winetricks ./winetricks winxp ./winetricks vcrun2008 ./winetricks dotnet20

Trouble shooting when wine install dotnet20, it may fail and complain “can not find cabextract”. After installing the package cabextract, wine program will work fine. yum install cabextract

Install msfilereader from Thermo company by installing the msi file in the zip file of msfilereader_ver-13_04-14-2009_Setup.zip. run following commands:

unzip msfilereader S e t u p . z i p cd msfilereader Setup/setup msi wine msiexec /q /i thermo\ install .msi

msconvert.exe is provided in pwiz-bin-windows-x86-vc90-release-2_0_1937.tar. bz2 from ProteoWizard, download the latest windows version of pwiz6. It can convert RAW format to mzXML format. After unzipping the file, we can use msconvert.exe to convert RAW file to mzXML in Windows. However, in Linux, use the command below instead.

6http://proteowizard.sourceforge.net/

133 wine msconvert.exe −−32 −−mzXML test .RAW

Unzip the zip file under the directory of $user.home/apps/pwiz/, the java program will try to find the msconvert.exe in the relative path to the $user.home directory. Post-installation A short shell script called msconvert is put in the src folder of edu.uga.cs.lsdis.glycomics.jmass.data.converter. By default, it will try look for the msconvert.exe in the directory of ~/apps/pwiz/msconvert.exe. Therefore, the Prote- oWizard program must be unzipped to the correct path.

#!/ bin / sh echo convert RAW file: $1 in directory: $2 wine ˜/apps/pwiz/msconvert.exe −−32 −−mzXML $1 −o $2 echo done

(Incomplete) Development History

A brief development history is listed as below:

• version 0.1 The source code was first developed by Dr. Will York, and it implemented:

1. isotope distribution

2. the simulation of mass spectrum via Gaussian line shape

3. find the residue composition from monoisotopic mass

4. find matching residues according to mass.

• version 0.2

– generating simulation of elemental composition spectrum

134 – use jquery.flot plugin to draw chart in the web page

– add support for residue composition

• version 0.3

– rewrite javascript code and modify web UI

– add IDAWG support and generate spectrum based on IDAWG configuration

– add function to generate centroid data as well as profile data

• version 0.4

– simulate the mixture for different combination of N15 and N14

– Add Lorentzian line shape and Multi-normial distribution

– add simulationEngine

– add zoom in/zoom out in the plot via flot jquery plugin

– use gradient descent algorithm to optimize the parameter for peak width, thresh- old and shifting calibration

• version 0.5

– spectrum optimization using hybrid Genetic algorithm

– spectrum optimization using SimulationEfficient algorithm

– integration with backend optimizatin code to the Web page

– spectrum simulation using Pseudo Nitrogen

• version 0.6

– add support of converting RAW to mzXML format.

– modify the web page UI for Thermo company demo

135 • version 0.7 (current)

– add user registration, user login, submit job and retrieve optimization results

– divide the whole project into several small maven projects and facilitate easy building and deployment

– separate the server-side code and client-side code so that the two web application can be deployed in different PCs

136 Appendix B

GLYCOQUANT RESULTS

O-Glycan in static IDAWGTM

Figure 4.21: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 heavy.

137 Figure 4.22: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 mixture.

Figure 4.23: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 heavy.

138 Figure 4.24: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 mixture.

139 N-Glycan in static IDAWGTM

Figure 4.25: GlycoQuant Results. (Hex)7(HexNAc)2 heavy.

Figure 4.26: GlycoQuant Results. (Hex)7(HexNAc)2 mixture.

O-Glycan in dynamic IDAWGTM

140 Figure 4.27: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) heavy.

Figure 4.28: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) mixture.

141 Figure 4.29: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 0hr.

Figure 4.30: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 6hr.

142 Figure 4.31: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 12hr.

Figure 4.32: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 24hr.

143 Figure 4.33: GlycoQuant Results. (NeuAc)1(Hex)1(HexNAc)1 36hr.

Figure 4.34: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 0hr.

144 Figure 4.35: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 6hr.

Figure 4.36: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 12hr.

145 Figure 4.37: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 24hr.

Figure 4.38: GlycoQuant Results. (NeuAc)2(Hex)1(HexNAc)1 36hr.

146 N-Glycan in dynamic IDAWGTM

Figure 4.39: GlycoQuant Results. (Hex)7(HexNAc)2 0hr.

Figure 4.40: GlycoQuant Results. (Hex)7(HexNAc)2 6hr.

147 Figure 4.41: GlycoQuant Results. (Hex)7(HexNAc)2 12hr.

Figure 4.42: GlycoQuant Results. (Hex)7(HexNAc)2 24hr.

148 Figure 4.43: GlycoQuant Results. (Hex)7(HexNAc)2 36hr.

Figure 4.44: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 0hr.

149 Figure 4.45: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 6hr.

Figure 4.46: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 12hr.

150 Figure 4.47: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 24hr.

Figure 4.48: GlycoQuant Results. (NeuAc)(Hex)2(HexNAc)2(Hex)3(HexNAc)2(DeoxyHexose) 36hr.

151 APPENDIX C

SIMULATION OPTIMIZATION ONTOLOGY (SoPT)

Prefix: xsd:

Prefix: swrlb:

Prefix: owl:

Prefix: protege:

Prefix: Nelder:

Prefix: :

Prefix: xsp:

Prefix: xml:

Prefix: rdf:

Prefix: swrl:

Prefix: rdfs:

Ontology:

Datatype: xsd:boolean

Datatype: xsd:string

Datatype: xsd:float

ObjectProperty: has-Vector

Domain:

Matrix

Range:

has-Vector some Vector

152 ObjectProperty: can_Achieve

Domain:

Optimization_Method

Range:

Solution_Quality

ObjectProperty: has-QuadraticCoefficientMatrix

Characteristics:

Functional

Domain:

Quadratic_Objective_Function

Range:

has-QuadraticCoefficientMatrix only Matrix

ObjectProperty: produces

Characteristics:

Functional

Domain:

Operator

Range:

produces some Data_Type

ObjectProperty: has-SolutionQuality

Domain:

Optimization_Method

or Solution

Range:

has-SolutionQuality some Solution_Quality

153 ObjectProperty: has-CoefficientMatrix

Domain:

Linear_Constraint

Range:

has-CoefficientMatrix only Matrix

ObjectProperty: has-Restriction

Domain:

Optimization_Problem

Range:

has-Restriction some Restriction

ObjectProperty: has-LinearCoefficientVector

Domain:

Quadratic_Objective_Function

Range:

has-LinearCoefficientVector only Constant_Vector

ObjectProperty: has-Element

Domain:

Vector

Range:

has-Element some Element

ObjectProperty: has-Component

Domain:

Optimization_Problem

154 Range:

has-Component some Optimization_Component

ObjectProperty: has-Constraint

Domain:

Optimization_Problem

Range:

has-Constraint some Constraint

ObjectProperty: can-Solve

Characteristics:

Functional

Domain:

Optimization_Method

Range:

can-Solve some Optimization_Problem

ObjectProperty: has-Goal

Domain:

Optimization_Problem

Range:

has-Goal some Solution_Quality

ObjectProperty: has-VariableVectorValue

Domain:

Solution

Range:

has-VariableVectorValue only Constant_Vector

155 ObjectProperty: has-VariableVector

Domain:

Objective_Function

Range:

has-VariableVector only Variable_Vector

ObjectProperty: has_OptimizationGoal

Domain:

Optimization_Problem

Range:

has_OptimizationGoal some Optimization_Goal

ObjectProperty: requires_SolutionQuality

Domain:

Optimization_Problem

Range:

requires_SolutionQuality some Solution_Quality

ObjectProperty: has-ConstantVector

Domain:

Linear_Constraint

Range:

has-ConstantVector only Constant_Vector

ObjectProperty: has-ObjectiveFunction

Domain:

Optimization_Problem

156 Range:

has-ObjectiveFunction some Objective_Function

DataProperty: has-MultiObjectives

Domain:

Objective_Function

Range:

xsd:boolean

DataProperty: is-Constant

Domain:

Element

Range:

xsd:boolean

DataProperty: is-Global

Domain:

Optimization_Method

Range:

xsd:boolean

DataProperty: has-ObjectiveFunctionValue

Domain:

Solution

Range:

xsd:float

DataProperty: has-Value

157 Domain:

Element

DataProperty: has-Type

Domain:

Element

Range:

{"binary"^^xsd:string , "boolean"^^xsd:string ,

"integer"^^xsd:string , "real"^^xsd:string}

Class: Derivative_Free

SubClassOf:

Optimization_Method

Class: Random_Search_Method

SubClassOf:

Optimization_Method

Class: Bacterial_Foraging_Optimization_Algorithm

SubClassOf:

Heuristic_Method

Class: Simplex_Algorithm

SubClassOf:

Simplex_Method

Class: Approximate_Solution

SubClassOf:

158 Solution_Quality

Class: Operator

SubClassOf:

SoPT

Class: Steepest_Descent

SubClassOf:

Gradient_Based

Class: Quasi_Newton_Method

SubClassOf:

Gradient_Based

Class: Stochastic_Linear_Programming

SubClassOf:

Stochasic_Quadratic_Programming

Class: Deterministic_Programming

SubClassOf:

Optimization_Problem

Class: L-BFGS_Method

SubClassOf:

Quasi_Newton_Method

Class: Constant_Vector

SubClassOf:

159 Vector

Class: Linear_Objective_Function

SubClassOf:

Quadratic_Objective_Function

Class: Nonlinear_Programming

SubClassOf:

Deterministic_Programming

Class: Linear_Constraint

SubClassOf:

Quadratic_Constraint

Class: Exact_Solution

SubClassOf:

Solution_Quality

Class: Optimization_Problem

SubClassOf:

SoPT

Class: Genetic_Algorithm

SubClassOf:

Heuristic_Method

Class: Matrix

SubClassOf:

160 Data_Type

Class: Constraint

SubClassOf:

Optimization_Component

Class: Gradient_Based

SubClassOf:

Optimization_Method

Class: Hessian

SubClassOf:

produces only Matrix,

Operator

Class: Pariticle_Swarm_Optimization

SubClassOf:

Heuristic_Method

Class: Linear_Programming

SubClassOf:

Quadratic_Programming,

has-ObjectiveFunction only Linear_Objective_Function

Class: Optimization_Component

SubClassOf:

SoPT

161 Class: Heuristic_Solution

SubClassOf:

Solution_Quality

Class: Restriction

SubClassOf:

Optimization_Component,

has-Restriction only

(Binary_Restriction

or Integer_Restriction

or Mixed_Restriction

or Real_Restriction)

Class: Solution

SubClassOf:

Optimization_Component

Class: Response_Surface_Methodology

SubClassOf:

Meta_Modeling_Method

Class: Interior_Point_Method

SubClassOf:

Gradient_Based

Class: Integer_Restriction

SubClassOf:

Restriction

162 Class: Variable_Vector

SubClassOf:

Vector

Class: Binary_Restriction

SubClassOf:

Restriction

Class: Simplex_Method

SubClassOf:

Derivative_Free

Class: Quadratic_Constraint

SubClassOf:

Constraint

Class: Quadratic_Fit

SubClassOf:

Response_Surface_Methodology

Class: SoPT

Class: Stochastic_Programming

SubClassOf:

Optimization_Problem

163 Class: Meta_Modeling_Method

SubClassOf:

Optimization_Method

Class: Local_Optimal

SubClassOf:

Optimization_Goal

Class: Polak-Ribiere_Conjugate_Gradient

SubClassOf:

Conjugate_Gradient_Descent

Class: Mixed_Restriction

SubClassOf:

Restriction

Class: Stochastic_Nonlinear_Programming

SubClassOf:

Stochastic_Programming

Class: Stochasic_Quadratic_Programming

SubClassOf:

Stochastic_Programming

Class: Global_Optimal

SubClassOf:

Optimization_Goal

164 Class: Nonlinear_Constraint

SubClassOf:

Constraint

Class: Conjugate_Gradient_Descent

SubClassOf:

Gradient_Based

Class: Minimization

SubClassOf:

Optimization_Goal

Class: Gradient

SubClassOf:

Operator,

produces only Vector

Class: Tabu_Search

SubClassOf:

Random_Search_Method

Class: Simulated_Annealing

SubClassOf:

Random_Search_Method

Class: Newton_Method

SubClassOf:

Gradient_Based

165 Class: Hooke_and_Jeeves_Direct_Search

SubClassOf:

Derivative_Free

Class: Heuristic_Method

SubClassOf:

Optimization_Method

Class: Quadratic_Objective_Function

SubClassOf:

Objective_Function

Class: Local_Search

SubClassOf:

Random_Search_Method

Class: Vector

SubClassOf:

Data_Type

Class: Sample_Path_Optimization

SubClassOf:

Random_Search_Method

Class: Maximization

SubClassOf:

Optimization_Goal

166 Class: Real_Restriction

SubClassOf:

Restriction

Class: Nelder:Mead_Method

SubClassOf:

Simplex_Method

Class: Optimization_Method

SubClassOf:

SoPT

Class: Optimization_Goal

SubClassOf:

Optimization_Component

Class: Nonlinear_Objective_Function

SubClassOf:

Objective_Function

Class: Data_Type

SubClassOf:

SoPT

Class: Quadratic_Programming

SubClassOf:

Deterministic_Programming,

167 has-ObjectiveFunction only Quadratic_Objective_Function

Class: Objective_Function

SubClassOf:

Optimization_Component

Class: BFGS_Method

SubClassOf:

Quasi_Newton_Method

Class: Kriging_RSM

SubClassOf:

Meta_Modeling_Method

Class: Element

SubClassOf:

Data_Type

Class: Solution_Quality

SubClassOf:

Optimization_Component,

has-SolutionQuality only

(Approximate_Solution

or Exact_Solution

or Heuristic_Solution)

Class: Ant_Colony_Optimization

SubClassOf:

168 Heuristic_Method

Individual: integer_1

Types:

Integer_Restriction

Individual: variable_vec1

Types:

Variable_Vector

Individual: linear_coefficient_vec1

Types:

Constant_Vector

Individual: quadratic_objective_function1

Types:

Quadratic_Objective_Function

Facts:

has-QuadraticCoefficientMatrix matrix1,

has-VariableVector variable_vec1,

has-LinearCoefficientVector linear_coefficient_vec1,

has-MultiObjectives false

Individual: UCF_Problem

Types:

Stochastic_Nonlinear_Programming

Facts:

has_OptimizationGoal maximization_1,

169 has-ObjectiveFunction ucf_objective_function,

has-Restriction integer_1,

requires_SolutionQuality heuristic_solution_1,

has-Constraint ucf_linear_constraint

Individual: matrix1

Types:

Matrix

Facts:

has-Vector row_vec4,

has-Vector row_vec5,

has-Vector row_vec3,

has-Vector row_vec2,

has-Vector row_vec1

Individual: real_1

Types:

Real_Restriction

Individual: zero_Matrix_1

Types:

Matrix

Individual: row_vec1

Types:

Constant_Vector

Individual: maximization_1

170 Types:

Maximization

Individual: exact_solution_1

Types:

Exact_Solution

Individual: LP_problem1

Types:

Linear_Programming

Facts:

has-Constraint linear_constraint1,

has-Goal exact_solution_1,

has-Restriction real_1,

has-ObjectiveFunction linear_objective_function1

Individual: constant_vec1

Types:

Constant_Vector

Individual: Simplex_Algorithm

Types:

Derivative_Free

Facts:

can-Solve LP_problem1

Individual: linear_objective_function1

Types:

171 Linear_Objective_Function

Facts:

has-VariableVector variable_vec1,

has-LinearCoefficientVector linear_coefficient_vec1,

has-QuadraticCoefficientMatrix zero_Matrix_1,

has-MultiObjectives false

Individual: heuristic_solution_1

Types:

Heuristic_Solution

Individual: linear_constraint1

Types:

Linear_Constraint

Facts:

has-CoefficientMatrix matrix1,

has-ConstantVector constant_vec1

Individual: ucf_linear_constraint

Types:

Linear_Constraint

Individual: row_vec4

Types:

Constant_Vector

Individual: row_vec5

Types:

172 Constant_Vector

Individual: ucf_objective_function

Types:

Nonlinear_Objective_Function

Individual: row_vec2

Types:

Constant_Vector

Individual: QP_problem1

Types:

Quadratic_Programming

Facts:

has-ObjectiveFunction quadratic_objective_function1,

has-Restriction real_1,

has-Constraint linear_constraint1,

has-Goal exact_solution_1

Individual: row_vec3

Types:

Constant_Vector

173