LOGOS EX MACHINA: A REASONED APPROACH TOWARD CANCER

by

Andrew Avila, B. S., M. S.

A Dissertation

In

Biological Sciences

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Approved

Lauren Gollahon Chairperson of the Committee

Rich Strauss

Sean Rice

Boyd Butler

Richard Watson

Peggy Gordon Miller Dean of the Graduate School

May, 2012 c 2012, Andrew Avila Texas Tech University, Andrew Avila, May 2012

ACKNOWLEDGEMENTS

I wish to acknowledge the incredible support given to me by my major adviser, Dr. Lauren Gollahon. Without your guidance surely I would not have made it as far as I have. Furthermore, the intellectual exchange I have shared with my advisory committee these long years have propelled me to new heights of inquiry I had not dreamed of even in the most lucid of my imaginings. That their continual intellectual challenges have provoked and evoked a subtle sense of natural wisdom is an ode to their efficacy in guiding the aspirant to the well of knowledge. For this initiation into the mysteries of nature I cannot thank my advisory committee enough. I also wish to thank the Vice President of Research for the fellowship which sustained the initial couple years of my residency at Texas Tech. Furthermore, my appreciation of the support provided to me by the Biology Department, financial and otherwise, cannot be understated. Finally, I also wish to acknowledge the individuals working at the High Performance Computing Center, without your tireless support in maintaining the cluster I would have not have completed the sheer amount of research that I have. To my parents there is nothing I can say that would enunciate my feelings of appreciation. Your ongoing love and support is remarkable if not extraordinary. Truly, I am thankful to have you.

ii Texas Tech University, Andrew Avila, May 2012

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... ii ABSTRACT ...... vi LIST OF TABLES ...... viii LIST OF FIGURES ...... x LIST OF ABBREVIATIONS ...... xiii I. INTRODUCTION ...... 1 Background and Significance ...... 1 Hypotheses ...... 6 Chapter Summaries ...... 7 Bibliography ...... 8 II. DEVELOPMENT OF AN ABSTRACT LOGICAL MODEL FOR EX- PRESSING GENETIC RELATIONSHIPS ...... 9 Introduction ...... 9 Boolean Networks ...... 10 Bayesian Networks ...... 11 Knowledge Based Approaches ...... 11 Answer Set Programming and Action Languages ...... 12 Summary ...... 15 Materials and Methods ...... 16 Results ...... 16 Discussion ...... 17 Bibliography ...... 22

iii Texas Tech University, Andrew Avila, May 2012

Appendix A ...... 26 Appendix B ...... 28 III. AUTOMATION OF LOGICAL MODEL GENERATION AND ANAL- YSIS FOR HOMO SAPIENS ...... 30 Introduction ...... 30 Public Genetic and Biochemical Databases ...... 30 Representation Formats ...... 34 Summary ...... 37 Materials and Methods ...... 37 Results ...... 40 Discussion ...... 42 Bibliography ...... 49 Appendix A ...... 55 IV. ANALYSIS OF DIFFERENTIALLY EXPRESSED IN HU- MAN CANCER CELL LINES ...... 99 Introduction ...... 99 Microarray Meta-Analyses of Cancer ...... 99 Summary ...... 102 Materials and Methods ...... 102 Results ...... 105 Discussion ...... 106 Bibliography ...... 112 Appendix A ...... 152

iv Texas Tech University, Andrew Avila, May 2012

V. ANALYSIS OF DIFFERENTIALLY EXPRESSED GENES IN TISSUE SAMPLES OF BREAST CANCER ...... 156 Introduction ...... 156 Breast Cancer Initiation per the SMT ...... 156 Oncogenesis Via Chromothripsis ...... 158 Tumor Virology ...... 159 The Tissue Organization Field Theory of Cancer ...... 160 Summary ...... 161 Materials and Methods ...... 162 Results ...... 164 Discussion ...... 166 Bibliography ...... 171 VI. CONCLUSIONS ...... 188

v Texas Tech University, Andrew Avila, May 2012

ABSTRACT

Limitations in our current ability to integrate a diverse spectrum of genetic information in an effort to elucidate the underlying causes of cancer has spawned the need for a novel cancer modeling approach. Public repositories of biological pathways and expression experiments were combined in order to provide a systems biology approach toward cancer. Furthermore, by unifying these sources of knowledge, the ability to predict expression levels of unmeasured genes was developed. This technique was then applied to a variety of cancer types in order to resolve commonalities between heretofore divergent (or disparate) cancers. The results generated in this manner revealed characteristics that challenge the current prevailing paradigm of cancer. Specifically, the predicted results, according to the Somatic Mutation Theory of Cancer, of a significant upregulation of oncogenes and a significant downregulation of tumor suppressor genes was not found. In contrast, it was found that oncogenes were significantly downregulated and tumor suppressor genes were upregulated among the cancers examined. Furthermore, the results demonstrate the differential expression, in cancer cells, of genes involved in the cellular differentiation and wound healing processes. These results were used as a springboard to develop a novel oncogenesis hypothesis, named Umbracesis. In short, the Umbracesis hypothesis proposes that disruption of the wound healing process via carcinogens, occurs in such a way as to prevent organismic homeostasis from being recovered or prevent full re-differentiation of dedifferentiated cells. The former concept is implicated in inflammatory cancers. Whereas the latter concept, is implicated in cancers that show characteristics associated with embryonic tissues. It

vi Texas Tech University, Andrew Avila, May 2012 was concluded, that the instrumental use of the modeling approach, developed within this study, has implications beyond cancer and may be of use within other areas of biomedical concern.

vii Texas Tech University, Andrew Avila, May 2012

LIST OF TABLES

2.1 The individual relationships of an instance of a model that was imple- mented in Appendix B...... 23 2.2 The application of a variety of initial conditions to the model listed in Appendix B and their solutions...... 23 4.1 Genes that are significantly underexpressed in cancer (cervical, breast, and mesothelioma); sorted from most significant to least significant based on distribution mean...... 117 4.2 Genes that are significantly overexpressed in cancer (cervical, breast, and mesothelioma); sorted from most significant to least significant based on distribution mean...... 117 4.3 Genes that are significantly underexpressed in cervical cancer (GEO Dataset GDS3233 ); sorted from most significant to least significant based on distribution mean...... 117 4.4 Genes that are significantly overexpressed in cervical cancer (GEO Dataset GDS3233 ); sorted from most significant to least significant based on distribution mean...... 124 4.5 Genes that are significantly underexpressed in breast cancer (GEO Dataset GDS820 ); sorted from most significant to least significant based on distribution mean...... 130 4.6 Genes that are significantly overexpressed in breast cancer (GEO Dataset GDS820 ); sorted from most significant to least significant based on dis- tribution mean...... 131

viii Texas Tech University, Andrew Avila, May 2012

4.7 Genes that are significantly underexpressed in mesothelioma (GEO Dataset GDS1220 ); sorted from most significant to least significant based on distribution mean...... 132 4.8 Genes that are significantly overexpressed in mesothelioma (GEO Dataset GDS1220 ); sorted from most significant to least significant based on distribution mean...... 138 5.1 Genes that are significantly underexpressed in breast cancer (GEO Dataset GDS3324 ); sorted from most significant to least significant based on distribution mean...... 175 5.2 Genes that are significantly overexpressed in breast cancer (GEO Dataset GDS3324 ); sorted from most significant to least significant based on distribution mean...... 179

ix Texas Tech University, Andrew Avila, May 2012

LIST OF FIGURES

2.1 The network representing the causal relationship between the variables (a through i). Different arrows are used for edges leading to e in order to signify that they uniquely imply e and are not co-dependent. The relationship between a and i is of a non-direct, inverse type...... 25 3.1 A gene cluster focusing on the gene IL1R1 and related elements. This is from the HeLa cervical cancer cell line (GEO dataset GDS3233 ). . 51 3.2 An expansion upon Figure 3.1 with a recursive depth of one. Any gene (or related element) that connects to a gene (or related element) in Figure 3.1 has been added to the figure...... 52 3.3 An expansion upon Figure 3.1 with a recursive depth of two. Any gene (or related element) that connects to a gene (or related element) in Figure 3.2 has been added to the figure...... 53 3.4 A network demonstrating the ability to deduce the expression level of complexes from the expression levels of genes. Specifically, this network demonstrates the binding of FASL to the FAS Receptor in order to produce the FASL FAS Receptor Monomer. The deduction of the expression level of the FASL FAS Receptor Trimer is deduced from the FASL FAS Receptor Monomer...... 54 4.1 A graph showing the differential expression of genes in cervical cancer (GEO Dataset GDS3233 ); specifically focusing on Semaphorin 5A and connected elements...... 146

x Texas Tech University, Andrew Avila, May 2012

4.2 A graph showing the differential expression of genes in cervical cancer (GEO Dataset GDS3233 ); specifically focusing on Semaphorin 6D and connected elements...... 147 4.3 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS820 ); specifically focusing on MED12 and con- nected elements...... 148 4.4 A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 4D and connected elements...... 149 4.5 A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 6A and connected elements...... 150 4.6 A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 7A and connected elements...... 151 5.1 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on IGF2BP3 and con- nected elements...... 183 5.2 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on Thrombopoietin Re- ceptor and connected elements...... 184 5.3 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on Metastin and con- nected elements...... 185

xi Texas Tech University, Andrew Avila, May 2012

5.4 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on NOD1 and con- nected elements...... 186 5.5 A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on UBF and connected elements...... 187

xii Texas Tech University, Andrew Avila, May 2012

LIST OF ABBREVIATIONS

2D Two Dimensional

ACC Adenoid Cystic Carcinoma

ATP Adenosine triphosphate

BID BH3 Interacting Domain Death Agonist

BioPAX Pathway exchange language for Biological pathway data

BRCA1 Breast Cancer 1, Early Onset

CCT4 T-complex Protein 1 Subunit Delta

CDKN3 Cyclin-Dependent Kinase Inhibitor 3

CKS2 Cyclin-Dependent Kinases Regulatory Subunit 2

CPU Central Processing Unit

DNA Deoxyribonucleic acid

E2F5 Transcription Factor E2F5

EGR1 Early Growth Response Protein 1

ERCC1 DNA Excision Repair Protein ERCC-1

FAS Tumor Necrosis Factor Receptor Superfamily, Member 6

FASL Tumor Necrosis Factor Receptor Superfamily, Member 6 Ligand

xiii Texas Tech University, Andrew Avila, May 2012

FBFR1 Fibroblast Growth Factor Receptor 1

FGFR3 Fibroblast Growth Factor Receptor 3

GEO Omnibus

HOXC6 Homeobox Protein Hox-C6

HTML HyperText Markup Language

IGF2BP3 Insulin-like Growth Factor II mRNA-binding Protein 3

IGFBP2 Insulin-like Growth Factor Binding Protein 2

IGFBP4 Insulin-like Growth Factor Binding Protein 4

IGFBP4 Insulin-like Growth Factor Binding Protein 5

IGFBP6 Insulin-like Growth Factor Binding Protein 6

IL1R1 Interleukin 1 Receptor Type 1

IL6 Interleukin-6

IPTR3 Inositol Triphosphate Receptor 3

MED12 Complex Subunit 12

NFKB Nuclear Factor Kappa-list-chain-enhancer of activated B Cells

NIH National Institutes of Health

NOD1 Nucleotide-binding Oligomerization Domain Containing 1

NR4A1 Nuclear Receptor Subfamily 4, Group A, Member 1

xiv Texas Tech University, Andrew Avila, May 2012

P53 Tumor Protein 53

PANTHER Protein ANalysis THrough Evolutionary Relationships

PDGFRβ Platelet-Derived Growth Factor Receptor Beta

PLK Serine/Threonine Kinase

POL-I Polymerase alpha 1

POLB DNA Polymerase β

PSA Prostate-Specific Antigen

PTMA Thymosin α

PUMA P53 Upregulated Modulator of Apoptosis

RAD51 Recombination Protein A

RAM Random Access Memory

RNA Ribonucleic acid

SBML Systems Biology Markup Language

SEMA4D Semaphorin-4D

SEMA5A Semaphorin-5A

SEMA6A Semaphorin-6A

SEMA6D Semaphorin-6D

SEMA7A Semaphorin-7A

xv Texas Tech University, Andrew Avila, May 2012

SMT Somatic Mutation Theory of Cancer

SOFT Simple Omnibus Format in Text

SOX4 Sry-related HMG Box 4

TNF Tumor Necrosis Factor

TNF-R1 Tumor Necrosis Factor Receptor 1

TOFT Tissue Organization Field Theory of Cancer

TPOR Thrombopoietin Receptor

TRAF2 TNF Receptor Associated Factor 2

UBF Upstream Binding Factor

VEGFR2 Vascular Endothelial Growth Factor 2

VEGFR2 Vascular Endothelial Growth Factor 3

XML Extensible Markup Language

XRCC4 X-ray Repair Cross-Complementing Protein 4

xvi Texas Tech University, Andrew Avila, May 2012

CHAPTER I INTRODUCTION

Background and Significance

Integration and elucidation of information generated from numerous independent studies is a common problem facing the modern researcher. This is further complicated by impedance factors originating from methodology and at the extreme, ideological differences between studies. Furthermore, the sheer volume of information available, approximately 50 million scholarly articles, requires a novel approach for the integration of knowledge at a fundamental level (Jinha, 2010). Clearly, tackling such a magnitude of information is outside the purview of this study. Therefore this study has been limited to the realm of cancer biology and specifically attempts to arrive at a fundamental mode of expressing genetic knowledge for purposes of modeling cancer through automated means. In general terms cancer is characterized by the uncontrolled growth of cells. The epiphenomena of uncontrolled cell growth are explainable by molecular mechanisms that widely vary depending on the type of cancer and the specific biochemistry of the individual afflicted. This variance in mechanism proves to be a challenge to current treatment techniques (e.g. chemotherapeutic approaches that target cells with an elevated mitotic rate) that operate by targeting common mechanisms (Deisboeck, 2009). Furthermore, current diagnostic methods (e.g. histological screening) invariably apply the law of large numbers to inductively reason what method of treatment should be used (Deisboeck, 2009). This approach is by necessity inefficient at treating individuals who have rare types of cancer,

1 Texas Tech University, Andrew Avila, May 2012 particularly those of a sporadic nature. Additionally, there are few analytical measures that will allow the clinician to predict the success of a specific treatment regime. Therefore, a prerequisite of the methodology developed within this study was a sufficient tolerance for instances of cancer that lie outside the mean or perhaps more simply stated as having the capability of applying this information toward personalized medicine. The need for personalized treatment has been reiterated in literature a number of times. Liakakos and Roukos (2008) reviewed treatments for gastric cancer. It is noted that despite improvements in treatment methods, the cure rates for patients in the later stages of gastric cancer remained poor. Specifically, a 20 percent recurrence rate occurred three years postoperatively. Approximately 35 percent of patients were cured with surgery only, improving to 45 percent with the addition of chemotherapy. These are claimed to currently be the best achievable rates. The authors then proceeded to describe the urgent need to develop new therapeutic strategies. This need has arisen due to the relatively poor success of single-gene-based traditional research, noting that: “No robust prognostic or predictive marker has been validated for clinical use despite multiple reports and hundreds of proposals” (Liakakos and Roukos, 2008). In contrast, the authors cite several reports that show success toward personalized medicine for various cancer types through the use of genetic markers identified by traditional research. Essentially, the author’s state that markers are useful on a individual basis but do not necessarily apply to the disease as a whole. The use of markers to personalize treatment falls apart when dealing with sporadic cancers, due to the multiple genes and oncogenic pathways

2 Texas Tech University, Andrew Avila, May 2012 that are involved in carcinogenesis (Liakakos and Roukos, 2008). A key point is made that supports the direction taken by this study, “Single-gene-based traditional research cannot reveal the global signaling pathways network... methodologies and research directions should be altered for a faster clinical success” (Liakakos and Roukos, 2008). In a review by Duffy and Crown (2008), the necessity for a more personalized method of treatment was emphasized. Specifically the authors focused on the use of biomarkers for determining prognosis, possible response to therapy, and pre-empting severe toxicity related to treatment of various cancers. It is noted that the current approach of systemic cancer therapy can be less effective due to the variations in a patient’s instance of the disease (Duffy and Crown, 2008). For example, patients with aggressive diseases are sometimes undertreated; while patients with indolent diseases can be over treated (Duffy and Crown, 2008). This issue is in large part due to current treatment strategies revolving around the “one size fits all” mentality. The authors support the advancement of personalized treatment by stating that it “has the potential to increase efficacy and decrease toxicity” (Duffy and Crown, 2008). There will be a wide variety of implications from the move to a more personalized treatment of cancer (Duffy and Crown, 2008). It is anticipated that the time to market for new drugs will decrease through the use of predictive markers, due to the enriched population for participation in clinical trials (Duffy and Crown, 2008). Methods of diagnosis will also have to change due to more personalized approaches. A move from serum-based screening and monitoring to an expanded test involving prognostic, predictive, and toxicity markers will be necessary (Duffy

3 Texas Tech University, Andrew Avila, May 2012 and Crown, 2008). The greatest implication for personalized medicine will be the patient. The probability of a positive response to personalized treatment is predicted to be better than current non-personalized methodologies (Duffy and Crown, 2008). Lastly, the authors claim that personalized medicine will result in an overall reduction in costs for health care providers. Although this is not to diminish the importance of the collected knowledge gained from the integration of many studies; these are certainly important for guiding the direction of further inquiry even if not necessarily applicable to a single instance of the disease. As to the mode of reasoning that has been used within this study, a novel approach was applied based on logic. Specifically, beginning from empirical observations and catering to a rational approach based on reason, instances of cancer were modeled. This is a considerable divergence from current approaches based on mathematical models (Sargent, 2001). The principal difference lies at the phenomenal level at which the modeling occurs. The models reviewed by Sargent (2001) largely attempt to predict medical outcomes from the epiphenomena of the disease. In cancer this is typically done through use of the Classification of Malignant Tumours (TNM) staging system (Sobin et al., 2009). Although predicting medical outcomes is not the goal of this study, in practice it could be used for such purposes through application of the Knudson hypothesis (Nordling, 1953). The rationale for this modeling method is based on the understanding that cancer is the result of the accumulation of mutations within a cell (Nordling, 1953). It was later elaborated upon that the development of cancer depended upon the activation of oncogenes and deactivation of tumor suppressor genes (Knudson, 1971). Therefore of principal interest are those genetic elements that are known to

4 Texas Tech University, Andrew Avila, May 2012 be differentially expressed relative to normal expression. It is reminded that this is a snapshot in time and based on tissue samples or cells in culture; the landscape of that snapshot may change dramatically over time potentially yielding a different dataset of information. The models generated in this study attempt to explain cancer at the molecular level by deduction of genetic expression levels. The method developed within this study can be summarized as follows; a public repository of genetic information provides the abstract genetic relationships of the interested species without reference to the expression levels of genes in an instance. This abstract information is then formatted into logical expressions that connect genetic elements together in a causal manner. This is the abstract logical model. From here two instances (normal, vis-`a-vis, cancer) describing the expression levels of as many genes as possible, discovered through empirical means, are compared and labeled into three categories: overexpressed, normal, and underexpressed. This information is then imported into the abstract logical model. Finally, a solution to the logical model deduces the status of any genetic elements that were not described empirically. Any genetic elements that cannot be deduced are labeled unknown. Armed with this method several significant possibilities open. Treatments may be developed for a particular instance of the disease, to be used in a regime of personalized medicine. Comparison of multiple studies could reveal common molecular mechanisms in a variety of cancer types. Optimization of existing treatment methods could be accomplished by studying particular types of the cancer. The methodology alone could be applied to other diseases that have a high incidence of variation, being co-opted as a general model of the disease. The

5 Texas Tech University, Andrew Avila, May 2012 methodology could be used as a diagnostic standard, reducing the risk for misdiagnosis. By expanding the grammar of the methodology to encompass non-genetic factors (lifestyle, age, race, sex, etc...), it may be possible to predict a patients risk of developing cancer or the success of a treatment. Furthermore, the methodology can be used to verify the current understanding we have of cancer initiation and progression. In this study, after having developed the method, two of such applications were examined. First a meta-analysis was performed across a series of different cancer types in order to reveal common functional mechanisms. Then a particular type of cancer (breast cancer) was examined in order to compare and contrast the results with our current knowledge of the disease.

Hypotheses

The hypotheses, per chapter, of this dissertation were as follows:

• Chapter II: A modeling approach can be defined that describes genetic relationships in a logic based manner and provides for the ability to deduce unobserved gene expression information from a priori observations.

• Chapter III: A program can be developed that combines sources of genetic knowledge and produces a model that conforms to the definition given in Chapter II.

• Chapter IV: Models, generated through application of the program developed in Chapter III, of established cancer cell lines will have a common set of significantly differentially expressed genes.

6 Texas Tech University, Andrew Avila, May 2012

• Chapter V: Models, generated through application of the program developed in Chapter III, of breast cancer from in-vivo sources will have a common set of significantly differentially expressed genes.

Chapter Summaries

Chapter II explores the theoretical conception of modeling genetic relationships in terms of logic. Chapter III then focuses upon building the technical bridge in order to allow the applied use of the theoretical methodology. Chapter IV attempts to derive common mechanisms to a variety of cancers through the applied use of the modeling methodology. Chapter V then narrows the scope of inquiry to a single type of cancer (breast cancer) and reconciles the results with our current knowledge of the progression of cancer. The results and conclusions of the previous chapters are then summarised in the final chapter.

7 Texas Tech University, Andrew Avila, May 2012

Bibliography

Deisboeck, T. S. 2009. Personalizing medicine: a systems biology perspective. Molecular Systems Biology, 5.

Duffy, M. J. and J. Crown. 2008. A personalized approach to cancer treatment: How biomarkers can help. Clinical Chemistry, 54(11):1770–1779.

Jinha, A. 2010. Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing, 23(3):258–263.

Knudson, A. 1971. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences, 68(4):820.

Liakakos, T. and D. H. Roukos. 2008. More controversy than ever – challenges and promises towards personalized treatment of gastric cancer. Annals of Surgical Oncology, 15(4):956–960.

Nordling, C. 1953. A new theory on the cancer-inducing mechanism. British journal of cancer, 7(1):68.

Sargent, D. 2001. Comparison of artificial neural networks with other statistical approaches. Cancer, 91(S8):1636–1642.

Sobin, L., M. Gospodarowicz., and E. Wittekind. 2009. TNM Classification of Malignant Tumors. Wiley-Blackwell, Oxford, 7 edition.

8 Texas Tech University, Andrew Avila, May 2012

CHAPTER II DEVELOPMENT OF AN ABSTRACT LOGICAL MODEL FOR EXPRESSING GENETIC RELATIONSHIPS

Introduction

It is generally understood that the function of a cell may be described by the continuous interactions of a variety of biochemical molecules. Furthermore, genetic elements serve the purpose of encoding and preserving the functional characteristics of these molecules. This ensures identical function, with respect to maintaining homeostasis, among subsequent generations of the same cell (sans mutations and extracellular modifiers). The purpose of the study detailed in this Chapter was to develop a logical method for the description of genetic relationships in such a way as to allow the deduction of unobserved biochemical activity from a few a priori observations. The real utility of this methodology was exploited in the studies described in Chapters IV and V. During this study, what was of principal concern were the theoretical considerations of the abstract method without relation to a particular physical instance; that is to say the pure rationale of the method. Before development of this study’s methodology a review of current methods had to be performed in order to justify the need for a novel direction. Although given the multitude of methods that exist only a few particularly relevant methods were reviewed. It should also be noted that the reviews performed were not exhaustive but focused principally on the general principles of the methodology and the potential applications thereof.

9 Texas Tech University, Andrew Avila, May 2012

Boolean Networks

Boolean networks as described by Kauffman (1969) was the first method to be reviewed. The author begins by drawing the analogy between an automaton and a cell. The operation of the former is described by the program encoded in the wiring of the discreet logic that composes it. Similarly the operation of a cell is described by the program encoded by the genome. The interaction of individual genetic elements (or more accurately their products after transcription and translation) is proposed by the author to operate in a similar fashion as discreet logic (e.g. AND, OR, NOR, etc...). The state of a gene approximated by a boolean value (on or off) is likewise related to the energized state of a wire connecting to a logic gate. These logic gates (interacting genes) can be further aggregated to form networks encoding complex function. The output of these networks can then be calculated via analysis by truth table as described in the article (Kauffman, 1969). In principle this allows all possible states of a cell to be enumerated. When reviewing this method several shortcomings were identified that made it unsuitable for use within this study. When referring to irregular function, it is insufficient to describe a gene as simply being expressed or not expressed. An ordinal qualification of the expression level is at least required because it delineates the differences between a normal cell and a abnormal cell (e.g. cancer cell). The differences between a normal cell and cancer cell may be as simple as the overexpression of an oncogene (Knudson, 1971). This alone was sufficient to disqualify this method from further investigation.

10 Texas Tech University, Andrew Avila, May 2012

Bayesian Networks

The next method reviewed was Bayesian networks as formalized by Friedman et al. (2000). This method, when used within the context of molecular biology, begins with the creation of a directed acyclic graph representing the interaction of genes. Specifically, the nodes of the graph represent genes and the edges represent the causal relationship between the genes. Genes that are not connected by an edge are said to be conditionally independent from each other. The expression level of each gene is determined by a probability function that is conditionally dependent on the parent nodes. This method has some particularly attractive features; first it allows one to overcome the stochastic nature of gene expression and the issue of measurement error (Friedman et al., 2000). Second, the learning aspect of Bayesian approach toward statistics allows for incomplete knowledge to be expressed and handled in a reasonable manner (Friedman et al., 2000). This implies that as additional information is acquired the veracity of the predicted outcomes would be expected to increase. Unfortunately this method has one defect that prevents it from being of use within the goals of this study; the requirement for a directed acyclic graph. It is found that cycles occur quite often within molecular biology, an example of which is the Krebs cycle.

Knowledge Based Approaches

Another method reviewed was a knowledge based approach as developed by Meyers and Friedland (1984). This approach depends on the creation of a database of knowledge containing facts and rules. A fact, within molecular biology, may be a statement regarding an object. An example would be the concept of a gene which is then defined as being composed of other facts such as expression level and sequence.

11 Texas Tech University, Andrew Avila, May 2012

These sub-objects can then be further decomposed into other objects through reductive reasoning. This naturally flows to the creation of a web of knowledge through the process of defining facts. Rules on the other hand consist of two statements, a condition statement and an action statement. The condition statement expresses a condition dependent upon the properties of objects (e.g. expression level of a gene being within a certain value). An action then determines the modification that occurs to a property of another object. A complete example of a rule; if gene X is at expression level A then gene Y is at expression level A. This approach has numerous advantages; by the inclusion of the fact time it is possible to use this approach to simulate changes that occur over time as demonstrated by Meyers and Friedland (1984) in their simulation of the growth control mechanisms in λ phage. Furthermore, the intuitive way in which knowledge is defined lends itself to a more natural way of dealing with biological information (Hofestadt, 1995). It is also noted that such knowledge-based approaches can be used to unify symbolic and quantitative information as demonstrated by the genetic grammar developed by Hofestadt (1995). The greatest challenge to this approach is the lack of formality in defining knowledge databases. Therefore there tends to be a lack of consistency among the different authors who have taken to this approach. This general modality of reasoning seemed particularly promising, especially when combined with a logical grammar to express knowledge.

Answer Set Programming and Action Languages

Of the logical grammars answer set programming was particularly attractive for its perceived ability to model uncertainty through the use of the soft logical ”NOT ”. The concept of modeling biological networks using answer set

12 Texas Tech University, Andrew Avila, May 2012 programming is discussed by Dworschak et al. (2008). Specifically, an intermediate language known as an action language was used, although action languages are ultimately translated to answer set programming. Briefly, answer set programming is a method of logical programming in which a problem is described in terms of Boolean relationships is described. An answer set solver then calculates solutions that satisfy the described problem (a problem is termed “satisfiable” if it has solutions). Action languages are designed to provide a declarative syntactical means for describing casual relationships and the transitions that occur in these relationships (Dworschak et al., 2008). In essence, action languages simplify answer set programming by providing syntax to describe a certain type of problem (namely, problems where transitions occur). Various kinds of reasoning can be implemented to plan and support experiments, thereby reducing the number of experiments that must be performed (Dworschak et al., 2008). It is also possible to design reasoning modes that allow for prediction of consequences and explanation of observations (Dworschak et al., 2008). Previously known information can be included in the model through the use of static causal laws (Dworschak et al., 2008). Lastly, the approach allows for easy expansion of a part of the model without requiring change to the rest of it (Dworschak et al., 2008).

Dworschak et al. (2008) developed an action language known as CT AID for modeling biological networks that also includes the formal specification of the developed language. This language was successfully used to model the sulfur starvation response-pathway of Arabidopsis thaliana (Dworschak et al., 2008). If the plant is unable to receive the required amount of sulfur, it will follow a complex

13 Texas Tech University, Andrew Avila, May 2012 strategy in order to normalize its sulfur levels (Dworschak et al., 2008). First, the plant will attempt to form lateral roots to access additional sources of sulfur. If unsuccessful, the plant will use all remaining resources to form seeds (Dworschak et al., 2008). This specific behavior is reflected in the author’s language in terms of the genes involved in the signal transduction pathway (Dworschak et al., 2008). Then by inputting a variety of environmental factors, potential responses from the modeled plant could be computed (Dworschak et al., 2008). These responses were verified in-vivo and closely resembled the responses evoked from the plant (Dworschak et al., 2008). This ability to model the “reasoning” used by a plant is unique to this method of modeling (Dworschak et al., 2008). Another system to represent pathway information in terms of answer set programming was explored by Baral et al. (2004). The goal of this system known as BioSigNet-RR, is to represent the qualitative information of biological networks and to emulate the reasoning done by biologists when analyzing such information (Baral et al., 2004). This approach is contrasted with more quantitative methods that rely on simulation and perturbation in order to explain phenomena. The foundation of the authors’ approach is in to treat the biological network as a knowledge base of which many different kinds of queries can be made (Baral et al., 2004). This approach allows the inclusion of mechanisms that can handle incomplete or partial information, as is often the case with signaling networks and pathways (Baral et al., 2004). Three principle modes of reasoning were developed with BioSigNet-RR (Baral et al., 2004). Prediction is accomplished by first describing the biological facts of the pathway (Baral et al., 2004). Then observations are made over time,

14 Texas Tech University, Andrew Avila, May 2012 such as a particular receptor being bound at the initial time point (Baral et al., 2004). From there it is possible to predict what will occur at a future time point by inferring future actions that will occur given the observations made (Baral et al., 2004). Explanation, the second reasoning mode, is accomplished by providing alternative facts that must be true in order to explain the observation. A hypothetical example is that of TRAF2 to the TNF-R1 signaling complex will result in NFKB activation (Baral et al., 2004). However, if no activation of NFKB is observed upon TRAF2-TNF-R1 binding, the system will explain this observation by stating that TRAF2 is mutated in such a way that it will not activate NFKB (Baral et al., 2004). Planning, the final mode of reasoning, is implemented by initially defining a goal. Then BioSigNet-RR will decide which actions to take in order to accomplish the goal (Baral et al., 2004).

Summary

Based on the successful use of answer set programming in the experiments performed by the previous authors, this was the logical grammar chosen for implementation within this study. However, the novel formalism created did not focus on the simulation of genetic relationships of a cell over time. The rationale for not using this approach lies in the observation that the choice rules of action languages can often lead to multiple answer sets, which are qualitatively identical though only differ in the ordering of actions. These different answer sets that are qualitatively identical tend toward obfuscation as the size of the answer sets increase. Instead, the focus of the formalism developed was upon the nominal characteristics of the genetic relationships of a cell. This was developed in such a way that only one answer set to any particular instance of a model existed.

15 Texas Tech University, Andrew Avila, May 2012

Materials and Methods

The focus was to develop a general method of modeling genetic relationships without reference to a specific pathway or organism. One may say that this method was developed a priori through pure abstract reasoning. The two specific programs that were used in the development of this method were:

1. gringo (Version 3.0.2)

2. clasp (Version 1.3.5)

Both of these programs are available from the Potassco sourceforge project (http://potassco.sourceforge.net/). Briefly, gringo is an grounder that computes an equivalent variable-free program from an answer set program with variables. This is required in order to generate the appropriate input for the second program, clasp, which then attempts to solve the answer set program.

Results

The abstract conception of the syntax of the model was the first result to be generated. This was done without a semantic context in order to avoid any bias toward a specific instance of the model. The precise definition that a model conforms to is detailed in Appendix A. This was presented in a formal language in order to facilitate its realization in a variety of languages, not simply answer set programming. Fortunately, the formalism lends itself for easy implementation in answer set programming. An instance of a model was then generated in order to explore some of the capabilities of this modeling method. The answer set program of this model is listed

16 Texas Tech University, Andrew Avila, May 2012 in Appendix B. The set of individual relationships that compose it are tabulated in Table 2.1. The network representing the combination of the individual relationships is shown in Figure 2.1. The results for a few notable cases, after running the program through the grounder and solver (gringo and clasp respectively), are shown in Table 2.2.

Discussion

The modeling methodology produced conforms to a series of definitions designed in such a way as to allow to rational deduction of activity in a cell that may not be directly measured (Appendix A). It is noted that the syntax used in the formalism (Appendix A) is not to be taken in the quantitative sense but in the nominal sense. The first definition sets the stage for the fundamental physical objects that are to be reasoned upon, in this case all the genes within a cell. This is limited to the current knowledge set of all genes within a cell. This is an advantage as the set which describes all genes within a cell can be freely modified over time maintaining the relevance of the definition as new knowledge is acquired. The following definition, 2, then deepens the physical object and attributes a mutable character, or attribute, to it; that being expression level. The set of expression levels is by definition bounded by what is possible in the set of all genes within a cell. In a global sense this implies a range of expression values (e.g. underexpressed, normal, and overexpressed) that individual genes can be attributed to. In a local sense an individual gene is bounded by those genes that directly causally related to it; in terms of a graph each node (i.e. gene) is bounded by the nodes connected to it. The tertiary definition serves the purpose of grounding the abstract gene to a particular instance by being bound to an expression level. These three definitions alone are

17 Texas Tech University, Andrew Avila, May 2012 sufficient to describe observed genes; although this does little more than formalize a set of empirical observations and does not generate new knowledge. In order to generate new knowledge at least one more definition is necessary, Definition 4 connects individual genes together in a causal manner by creating an assertion between two subsets of genes (i.e. cause and effect). If the cause subset of genes is known to be at a particular expression level, then the effect subset of genes can be deduced to be at an equivalent expression level. It is through this form of deduction that new knowledge can be generated. The equivalency between the subsets of genes is dependent on the characterization of the causal relationship. An example would be if gene X causes gene Y and one knows that gene X is at expression level normal then one can deduce that gene Y is also at the expression level of normal. Naturally this codifies a form of relativism in that the real value of what is normal is dependent upon which particular gene one is reasoning. This has the advantage of avoiding quantitative difficulties when the specific reaction mechanics between two subsets of genes are unknown. The fifth Definition is an extension of the previous definition by establishing the reverse assertion in a specific case. If the effect subset of genes is observed and there is only one known causal relationship that ends in that effect subset of genes, then the cause subset of genes can be deduced to be at an equivalent expression level. This definition avoids issues of disjunction that complicate and weaken the verifiability of what is deduced; in other words one is only interested in what can safely be asserted given our current knowledge. The sixth and seventh Definitions allows for non-direct relationships (e.g. repressors) to be defined. These two Definitions are similar in scope and limitation as the Definitions, 4 and 5, that define direct relationships. The eighth

18 Texas Tech University, Andrew Avila, May 2012

Definition is a constraint preventing a physical impossibility from occurring. For example, within a cell, it is impossible for a gene to be simultaneously expressed at normal and overexpressed expression levels. The final Definition, 9, simply states that if the expression level of a gene cannot be deduced then its expression level is unknown. Essentially codifying the deduction of not being able to deduce anything regarding the expression level of a gene. A model was generated from the formalism in such a way as to satisfy a few desired criteria. The model was to be as simple as possible yet demonstrate complexity that may be encountered on a larger scale. This was accomplished by integrating circular relationships as well as disjunctive relationships. These features are visually demonstrated (Figure 2.1) and encoded in the answer set program (Appendix B). The results of solving this model were surprising and a number of these results were tabulated (Table 2.2). When no initial conditions are given then nothing can be deduced regarding the expression levels of any of the genes (Table 2.2). This was to be expected and as such, was merely a confirmation of what one could logically deduce. Another case is found in the second and third trials (Table 2.2) specifically testing how tightly deductions can be made. For example if the network (Figure 2.1) was tightly coupled, then theoretically, the network could be predicted from as little as one initial condition, through forward and backward deduction (Definitions 4 and 5, Appendix A). This appears was validated/confirmed by the second trial in which one initial condition lead to the deduction of most of the network. However, by switching the one initial condition to a gene which lies outside of the circular subnetwork (bgfc), the solution becomes terminal (e.g. nothing additional was able to be deduced (Trial 3, Table 2.2)). This

19 Texas Tech University, Andrew Avila, May 2012 illustrates the well-known issue of impredicativity, recognized by logicians in the late 19th century (Frege, 1879). Modifications to Definition 8 were performed, by loosening the constraints upon the formalism, in an attempt to test for a solution (Appendix A). However, due to the increasing number of answer sets generated, these results were not considered practical for this application. Since this looser formalism was not adopted for the subsequent study, the results were not reported. Although the current stricter formalism contained issues of impredicativity, it still remained useful as demonstrated by the decoupling of the network in trial 4 (Table 2.2). These results suggested that cycles could serve to decouple segments of a network from each other, allowing a certain amount of autonomy between different segments. The fifth trial demonstrates that when the end effect of a disjunction is known, it is impossible to deduce the cause without further information. The sixth trial is similar to the third trial except that it demonstrates a non-direct, inverse, relationship. Specifically, the solution demonstrates that when a is overexpressed, i is underexpressed (Table 2.2). Finally, the last trial shows that setting contrary facts as initial conditions results in a model that is unsatisfiable (i.e. no answer set). This is a positive feature as it prevents incorrect deductions as well as reveals places where decoupling is necessary. In such situations it is reasonable to believe that an individual’s current knowledge of the genetic network is insufficient to verify the truth of the model. Therefore further empirical investigation is necessary to elaborate upon what is currently known. In summary a method of modeling the genetic relationships within a cell was developed and formalized. This development is fundamentally a knowledge based approach. However, it extends beyond similar current approaches by applying the

20 Texas Tech University, Andrew Avila, May 2012 rules of logic in order to reason new information from prior known information. A model based on the developed formalism was then tested and found capable of generating new knowledge, beginning with only a few initial assertions. However, the model was found to suffer from the classical logic issue of impredicativity in certain situations; that being of impredicativity. None the less, if these issues are found to exist in models generated for real phenomena, then it reminds us that our current knowledge is insufficient to resolve the inadequacy. In this regard, the formalism may also be used to guide future research efforts. Finally, there is a larger question that begs to be asked: Does life conform to the rules of logic? Although this study has been performed in the spirit of instrumentalism the question remains as to the applicability of this modeling method to realistically portraying biological systems.

21 Texas Tech University, Andrew Avila, May 2012

Bibliography

Baral, C., K. Chancellor., N. Tran., N. L. Tran., A. Joy., and M. Berens. 2004. A knowledge based approach for representing and reasoning about signaling networks. Bioinformatics, 20(Suppl 1):i15.

Dworschak, S., S. Grell., V. J. Nikiforova., T. Schaub., and J. Selbig. 2008. Modeling biological networks by action languages via answer set programming. Constraints, 13(1-2):21–65.

Frege, G. 1879. Begriffsschrift: eine der arithmetischen nachgebildete Formelsprache des reinen Denkens. Halle.

Friedman, N., M. Linial., I. Nachman., and D. Pe’er. 2000. Using bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601–620.

Hofestadt, R. 1995. Interactive modelling and simulation of biochemical networks. Computers in Biology and Medicine, 25:321–334.

Kauffman, S. 1969. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of theoretical biology, 22(3):437–467.

Knudson, A. 1971. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences, 68(4):820.

Meyers, S. and P. Friedland. 1984. Knowledge-based simulation of genetic regulation in bacteriophage lambda. Nucleic acids research, 12(1Part1):1.

22 Texas Tech University, Andrew Avila, May 2012

Table 2.1: The individual relationships of an instance of a model that was imple- mented in Appendix B.

Identifier Relationship 1 {h} → {e} 2 {d} → {e} 3 {c} → {d} 4 {f, g} → {c} 5 {g} → {f} 6 {c} → {b} 7 {b} → {g} 8 {b} → {a} 9 {a} ⇒ {i}

Table 2.2: The application of a variety of initial conditions to the model listed in Appendix B and their solutions.

Trial Initial Conditions Solution 1 None exists(i,unknown) exists(h,unknown) exists(g,unknown) exists(f,unknown) exists(e,unknown) exists(d,unknown) exists(c,unknown) exists(b,unknown) exists(a,unknown) 2 exists(g,normal) exists(g,normal) exists(i,normal) exists(h,unknown) exists(f,normal) exists(e,normal) exists(d,normal) exists(c,normal) exists(b,normal) exists(a,normal)

23 Texas Tech University, Andrew Avila, May 2012

Table 2.2: Continued Trial Initial Conditions Solution 3 exists(a,normal) exists(a,normal) exists(i,normal) exists(h,unknown) exists(g,unknown) exists(f,unknown) exists(e,unknown) exists(d,unknown) exists(c,unknown) exists(b,unknown) 4 exists(a,normal) exists(a,normal) exists(d,low) exists(d,low) exists(i,normal) exists(h,unknown) exists(g,unknown) exists(f,unknown) exists(e,unknown) exists(c,unknown) exists(b,unknown) 5 exists(e,normal) exists(e,normal) exists(i,unknown) exists(h,unknown) exists(g,unknown) exists(f,unknown) exists(a,unknown) exists(d,unknown) exists(c,unknown) exists(b,unknown) 6 exists(a,high) exists(a,high) exists(i,low) exists(h,unknown) exists(g,unknown) exists(f,unknown) exists(e,unknown) exists(d,unknown) exists(c,unknown) exists(b,unknown)

24 Texas Tech University, Andrew Avila, May 2012

Table 2.2: Continued Trial Initial Conditions Solution 7 exists(g,normal) UNSATISFIABLE exists(c,low)

Figure 2.1: The network representing the causal relationship between the variables (a through i). Different arrows are used for edges leading to e in order to signify that they uniquely imply e and are not co-dependent. The relationship between a and i is of a non-direct, inverse type.

f

c g

h d b

e a i

25 Texas Tech University, Andrew Avila, May 2012

Appendix A

A model conforms to the following definitions:

Definition 1. Let X be the set of all genes within a cell such that Xi is a specific gene and where i is the unique identifier of said gene.

Definition 2. Let Y be the set of all possible expression levels of X such that Ya is a specific expression level.

Definition 3. Let the product of Xi and Ya, represented as Zia, be the assertion

that the gene Xi exists at the expression level Ya.

Definition 4. Let a direct causal relationship be the orderly assertion between the

sets {Xi,...,Xj} → {Xk,...,Xl} such that {Zia,...,Zja} implies {Zka,...,Zla} when one has no reason to believe a is unknown.

Definition 5. Let the form {Xi,...,Xj} → {Xk,...,Xl} also assert that

{Zka,...,Zla} implies {Zia,...,Zja} if and only if {Xi,...,Xj} → {Xk,...,Xl} is the only direct causal relationship in the direction of {Xk,...,Xl} and one has no reason to believe a is unknown.

Definition 6. Let a non-direct causal relationship be the orderly assertion between the sets {Xi,...,Xj} ⇒ {Xk,...,Xl} such that {Zia,...,Zjb} implies

{Zkc,...,Zld} when one has no reason to believe a, b, c, or d is unknown.

Definition 7. Let the form {Xi,...,Xj} ⇒ {Xk,...,Xl} also assert that

{Zkc,...,Zld} implies {Zia,...,Zjb} if and only if {Xi,...,Xj} ⇒ {Xk,...,Xl} is the only non-direct causal relationship in the direction of {Xk,...,Xl} and one has no reason to believe a, b, c, or d is unknown.

26 Texas Tech University, Andrew Avila, May 2012

Definition 8. It is impossible to assert that {Zia,...,Zib} exists.

Definition 9. Let Z = Z ∪ {Zic}, where c is unknown, when one has no reason to believe Xi is at any Ya.

27 Texas Tech University, Andrew Avila, May 2012

Appendix B

A short but relatively complex answer set program representing the network in Figure 2.1:

%genes that can exist species(a). species(b). species(c). species(d). species(e). species(f). species(g). species(h). species(i). #domain species(X).

%possible expression levels state(normal). state(low). state(high). state(unknown) . #domain state(Y).

%unknown when not any other expression level exists(X, unknown) :− not exists(X, normal), not exists(X, low), not exists(X, high).

%constraints :− exists(X, normal), exists(X, low). :− exists(X, normal), exists(X, high). :− exists(X, low), exists(X, high).

%reverse direct rules to determine if a gene exists at an expression l e v e l exists(b, Y) :− exists(a, Y), exists(g, Y), not exists(a, unknown), not exists(g, unknown). exists(f, Y) :− exists(c, Y), not exists(c, unknown). exists(c, Y) :− exists(d, Y), exists(b, Y), not exists(d, unknown), not exists(b, unknown). exists(f, Y) :− exists(c, Y), not exists(c, unknown). exists(g, Y) :− exists(c, Y), exists(f, Y), not exists(c, unknown), not exists(f , unknown).

%direct rules to determine if a gene exists at an expression level exists(a, Y) :− exists(b, Y), not exists(b, unknown). exists(g, Y) :− exists(b, Y), not exists(b, unknown).

28 Texas Tech University, Andrew Avila, May 2012 exists(b, Y) :− exists(c, Y), not exists(c, unknown). exists(d, Y) :− exists(c, Y), not exists(c, unknown). exists(e, Y) :− exists(d, Y), not exists(d, unknown). exists(c, Y) :− exists(f, Y), exists(g, Y), not exists(f , unknown), not exists(g, unknown). exists(f, Y) :− exists(g, Y), not exists(g, unknown). exists(e, Y) :− exists(h, Y), not exists(h, unknown).

%reverse non−direct rules to determine if a gene exists at an expression l e v e l exists(a, normal) :− exists(i, normal), not exists(i , unknown). exists(a, high) :− exists(i, low), not exists(i , unknown). exists(a, low) :− exists(i, high), not exists(i , unknown).

%non−direct rules to determine if a gene exists at an expression level exists(i, normal) :− exists(a, normal), not exists(a, unknown). exists(i, high) :− exists(a, low), not exists(a, unknown). exists(i, low) :− exists(a, high), not exists(a, unknown).

%show only what exists after solving #hide . #show exists(X, Y).

29 Texas Tech University, Andrew Avila, May 2012

CHAPTER III AUTOMATION OF LOGICAL MODEL GENERATION AND ANALYSIS FOR HOMO SAPIENS

Introduction

Having previously established (Chapter II) the conceptual framework for the methodology of expressing genetic relationship information in a logical manner; this study focused on creating the technical bridge from the theoretical methodology to the applied use thereof. Briefly, in the previous study a modelling method was developed based on the treatment of genetic relationships in a cell as logical expressions. These expressions can then be operated upon in order to deduce expression levels for genes that are unobserved. It was then demonstrated that in order to deduce any new information, typically, at least one observed expression level must be asserted. In order for this modelling approach to be of instrumental use it must be able to express the accumulated current genetic knowledge that has been discovered. Furthermore, the results generated from the model, through analysis, must be intelligible. This task, by necessity, requires a technical approach in order to build a software solution that can import prior genetic knowledge, build a corresponding logical model, analyze the solution, and visualize the results.

Public Genetic and Biochemical Databases

In order to choose the appropriate technologies and modalities of their use, it was critical to review relevant literature concerning these technologies. High priority was given to identifying a publicly available repository of genetic knowledge that supported the exportation of a species (specifically Homo sapiens) genetic network

30 Texas Tech University, Andrew Avila, May 2012 in a standardized format while concurrently demonstrating the capability to support cross-referencing identifiers to integrate with other databases. Furthermore, the database would ideally be curated to ensure the accuracy of the genetic networks. Reactome (http://www.reactome.org) was found to contain all the desired features of a public genetic knowledge repository. Reactome is a publicly accessible database created by Joshi-Tope et al. (2005) in an effort to combine published, peer-reviewed information regarding biological processes into a consolidated form that could be readily accessible to researchers. Reactome provides a unified view of these (biological) processes by displaying the links between gene products (Joshi-Tope et al., 2005). The components of the biological networks in Reactome were selected from well established information regarding the processes while deferring contentious information for inclusion at a later date (Joshi-Tope et al., 2005). Archived information is largely from Homo sapiens, though other models are used in places where gaps in knowledge exist (Joshi-Tope et al., 2005). Reactome supports exportation of genetic networks in SBML and BioPAX formats. In addition Reactome provides raw MySQL database dumps which include cross-referencing identifiers with other genetic databases such as the US National Institutes of Health (NIH) GenBank database (http://www.ncbi.nlm.nih.gov/genbank/). PANTHER (http://www.pantherdb.org) and BioCyc (http://www.biocyc.org) were also identified as possible databases. However, each had weaknesses that precluded their use in this study. PANTHER is a browsable database of gene products in Homo sapiens, among others, created by Thomas (2003). Briefly, this database is organized by the functional characteristics of protein groups identified through one of two means. Either the functional

31 Texas Tech University, Andrew Avila, May 2012 characteristics of a protein group are asserted by the curators of the database or are automatically classified by statistical models (Thomas, 2003). Furthermore, pathways are contributed to communally, thus ensuring continued relevance of the information in the database (Thomas, 2003). However, no mention is made concerning of the accuracy of contributed information. PANTHER supports the exportation of pathway information in both SBML and BioPAX formats and may be integrated with other databases through the use of the GenBank identifier system. The principal reason why PANTHER was not adopted is that not all of its information is backed by peer-reviewed literature. BioCyc is another publicly accessible database containing genetic network information created by Karp et al. (2005). BioCyc attempts to be a comprehensive solution by containing genetic information for virtually every sequenced species (Karp et al., 2005). Furthermore, this database provides inferential features enabling the determination of metabolic networks from primary research compiled from other species (Karp et al., 2005). The information in the database is curated, although the authors mention a shortage in the number of working curators. As a result, not all the information in the database has been annotated. Various tools are provided to allow users to visualize genomic and proteomic information derived from the database. In addition, BioCyc can export its information in SBML and BioPAX formats. BioCyc also provides database dumps of its database through use of a dual licensing model. This database was not chosen due to the limitations of the academic license. The next requirement, a source for the initial conditions of a generated model (used in order to instantiate the genetic expression of a particular cell type) must be

32 Texas Tech University, Andrew Avila, May 2012 reviewed. Since Reactome is readily compatible with the NIH identifier system for genes, the NIH Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) was chosen as the source to fulfill this requirement. The Gene Expression Omnibus (GEO) was initially created by Edgar et al. (2002) for the purpose of establishing a public repository of high-throughput gene expression data. The principal concern of the authors, was to establish a system to catalog the minimum amount of information from gene expression experiments in a manner that would allow the results from heterogeneous experimental methods to be compared (Edgar et al., 2002). This was accomplished by segregating the information recorded by GEO into three distinct components stored in a relational database (Edgar et al., 2002). The first component, platform, defines the set of molecules that may be detected by the list probes used in the experiment (Edgar et al., 2002). The second component, sample, defines the molecular abundance detected by a probe; this may be recorded in relative or absolute terms depending on the experimental method employed (Edgar et al., 2002). Lastly, series, serves the purpose of binding sample sets that make up an experiment together in a meaningful manner (Edgar et al., 2002). It is also worth mentioning that the information stored by GEO may be exported in a number of formats including a computer readable format, SOFT file (tab delimited), for further processing. This feature coupled with the use of the NIH gene identifier system made it ideal for use in this study. Finally, PubChem (http://pubchem.ncbi.nlm.nih.gov), was belatedly reviewed. PubChem, first published by Wang et al. (2009), is hosted by the U.S. National Institutes of Health (NIH). The project began in 2004 as an effort to discover chemical probes that could modify the activity of gene products through

33 Texas Tech University, Andrew Avila, May 2012 high-throughput screening of small molecules (Wang et al., 2009). PubChem is subdivided into three related databases: Substance, Compound, and BioAssay (Wang et al., 2009). The Substance database contains descriptions of the samples collected (Wang et al., 2009). The Compound database is focused on the chemical structures derived from the samples collected (Wang et al., 2009). The tertiary database, BioAssay, contains the bioactivity screens of the chemical substances deposited in PubChem (Wang et al., 2009). These bioactivity screens are contributed by a variety of research organizations, private companies, and by the NIH Molecular Library Program (Wang et al., 2009). The specific information available from the BioAssay database includes: biological activity descriptions, test results (such as percent inhibition), and the assay protocol used (Wang et al., 2009). This database was chosen because it contains information on small molecules known to interact with gene products and can be integrated with Reactome through the use of cross-referencing identifiers. This information could be used to identify compounds that may translate to treatments for cancer in future studies. However, it is important to note that the interface between Reactome, the logical model, and PubChem was built during this study.

Representation Formats

Reactome, the genetic database that was implemented offers for two different exportation formats (SBML and BioPax). To determine which exportation format was more appropriate for use in this study, it was relevant to review each format. Hucka et al. (2003) described the Systems Biology Markup Language (SBML) as a “...free, open, XML-based format for representing biochemical reaction networks”. The authors state that this format was developed from the interdisciplinary needs of

34 Texas Tech University, Andrew Avila, May 2012

Systems Biologists, specifically by integrating theory, computational modeling, and experimentation. This language was also developed to alleviate concerns of incompatibilities between emerging computational tools within the biological sciences (Hucka et al., 2003). These concerns are typified by the need to work with multiple tools, unusable models from old systems, and proprietary model definitions (Hucka et al., 2003). The SBML format is subdivided into a number of elements: compartment, species, reaction, parameter, unit definition, and rules (Hucka et al., 2003). The compartment element is the prescribed location for substances and reactions. Species elements represent the chemical substances that are used in a reaction. Reaction elements are descriptions of chemical reactions that transform, transport, or bind species. These reaction elements also have rate laws describing the quantitative portion of the reaction. Parameter elements are modifiers that apply to either the whole model or to a single reaction. The unit definition element is used to express the quantities in a model. Lastly, the rule element is a mathematical equation used to establish quantitative relationships among other elements. SBML has had success in being accepted and used within the computational biology community. Several notable tools including Cellerator, DBsolve, E-CELL, Gepasi, Jarnac, NetBuilder, ProMoT/DIVA, StochSim, and Virtual Cell have reported using SBML (Hucka et al., 2003). Furthermore, given the community driven nature of the format, it is expected that systems using this format will remain applicable in the future. Therefore, in order to leverage modeling information already present within the community, SBML may be a useful format to support within the system proposed.

35 Texas Tech University, Andrew Avila, May 2012

The other Reactome supported format is the pathway exchange language for biological pathway data (BioPAX). This format is described by Luciano (2005) as another attempt to consolidate biochemical pathway information into a standardized format for regular use within the scientific community. A unique feature of BioPAX is that it contains support for previously existing pathway formats (Luciano, 2005). Specifically, the Proteomics Standards Initiative Molecular Interaction (PSI-MI) and SBML formats are supported by BioPAX (Luciano, 2005). BioPAX (much like SBML) is based on XML, which allows it to be readily rendered and supported by web-enabled technologies. A major difference between BioPAX and SBML is in the data organization. BioPAX is unique in that it is arranged into 4 levels, each representing a higher degree of molecular detail. These levels were developed by experts in the specific domain a particular level is designed to represent (Luciano, 2005). Level 1 provides an abstraction specifically tailored for representing metabolic pathways (Luciano, 2005). Level 2 is an additional level of abstraction that provides a framework to describe protein-protein, protein-DNA, and protein-RNA interactions (Luciano, 2005). The third level builds on level 2 by providing a method for genetic interactions, i.e., DNA-DNA or DNA-RNA (Luciano, 2005). The final level provides all the benefits of the previous levels as well as the incorporation of a generalized model for describing any type of molecular interaction (Luciano, 2005). Ultimately, SBML was implemented for use in this study due to its excellent community support and continued development.

36 Texas Tech University, Andrew Avila, May 2012

Summary

In summary, this study focused upon developing the technology required to enable the applied use of the modeling method described in Chapter II. In essence, there were a series of interrelated tasks that needed to be addressed in this study. First the importation of the abstract genetic network within a cell; the source of which was the information in Reactome. This information must then be parsed to generate a corresponding logical model. Having the abstract conception of the cell’s genetic network available, it must be instantiated or realized within a particular cell. The genetic expression information for a particular cell was imported into the logical model from a sample from the GEO database. This complete model could then be solved in order to reveal the expression levels of any unobserved genes. Finally, the solution to the model could be analyzed and visualized through tools that were developed within this study. Information revealed in this way can then be used for applied ends. However, specific applied uses were outside the scope of this study, which focused on the realization of the technological means.

Materials and Methods

In order to complete this study the following of materials were required. A server environment was required as a local repository of the information gathered and generated by the software developed to run on said server. In order to develop the software solution for this study, certain developmental software was also required. Lastly, the selected public databases were also used to develop the software solution. The software was developed and deployed on a custom built server with the following specifications:

37 Texas Tech University, Andrew Avila, May 2012

1. Motherboard: ASUS KFN32-D SLI/SAS Server Motherboard

2. CPUs: Dual AMD Opteron 8346 HE Quad Core Processors

3. RAM : 8GB DDR2 667 MHz

4. Hard Drives: 120GB Raid 0 Array (4 Drives), 400GB Raid 1 Array (2 Drives)

5. Video Card: nVidia Geforce 6800 Ultra

The developmental software used within this study is listed along with a short description thereof:

1. Ubuntu Linux (Version 11.04): A Unix-like operating system.

2. MySQL (Version 5.1.54): A relational database management system.

3. gringo (Version 3.0.2): An Answer Set Programming grounder.

4. clasp (Version 1.3.5): An Answer Set Programming solver.

5. Python (Version 2.7.1): A general purpose programming language.

6. lxml (Version 2.3.3): A Python library for processing XML and HTML.

7. MySQL-Python (Version 1.2.3): A Python library for interacting with a MySQL database.

8. NetworkX (Version 1.6): A Python library for creating and manipulating graphs.

9. matplotlib (1.1.0): A Python library for plotting 2D figures.

38 Texas Tech University, Andrew Avila, May 2012

Lastly the public databases, and a description thereof, used within this study are as follows:

1. Reactome (http://www.reactome.org): A database of genetic relationships.

2. Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo): A database of high-throughput gene expression information.

3. PubChem (http://pubchem.ncbi.nlm.nih.gov): A database of chemical molecules and their activity against biological targets.

In order to test and debug the software solution developed, a dataset from the NIH Gene Expression Omnibus was required. Please note, that a requirement for the dataset was that it contained gene expression information for both normal and cancer cell types. The dataset chosen for use within this study was a microarray analysis performed on a variety of cervical cancer cell lines by Scotto et al. (2008). This dataset was published on GEO and has the identifier of GDS3233. The full SOFT file, containing the gene expression information derived from the microarray analysis, of said dataset was used in this study. The methods employed within this study are relevant in reference to the programming paradigms used. The three paradigms used were imperative, functional, and logical programming. Briefly, imperative programming involves describing computation as a series of directives that change program state. Simply stated: The use of mutable variables that are operated upon by a program. This is contrasted with functional programming, in which computation is treated as the evaluation of mathematical functions without state or mutable variables. Lastly, logical programming can be summarized as treating computation as the search for

39 Texas Tech University, Andrew Avila, May 2012 solutions to a set of logical formulas.

Results

A variety of scripts were written that accomplish the goals of this study, and as such are the principal results that are reported. The first script, Appendix A: Listing 3.1, reads an SBML file from Reactome and generates a corresponding logical model. Please note, that the identifiers in the SBML file are unique to Reactome. The next major step involved generating the logical equivalent of the gene expression profile of the desired cell line. This was broken into a series of function libraries. Appendix A: Listing 3.2 is the library that processes a GEO full SOFT file to an array. The SOFT file from GEO does not have the Reactome identifiers, therefore a cross-reference library was written to map Reactome identifiers to GEO compatible identifiers (Appendix A: Listing 3.3). These two libraries were then used in a script, Appendix A: Listing 3.4, that actually produced the logical equivalent of the gene expression profile of the desired cell line. It was then necessary, for performance reasons, to build a local database containing minimal subsets of the NIH GenBank and PubChem databases. The structure of the local MySQL database is given in Appendix A: Listing 3.12. The web address for the required database file is listed in the header section of each script, respectively. A script was written to upload a subset of the NIH gene info database, Appendix A: Listing 3.5. In order to be able to map genes to gene products the gene2accession database from NIH GenBank was used. The script that uploads a subset of this database is given in Appendix A: Listing 3.6. In order to upload the required subset from PubChem two scripts were written, Appendix A: Listings 3.7 and 3.8. The first script uploads the target molecule information of an

40 Texas Tech University, Andrew Avila, May 2012 assay; the second script uploads the activity results of tested ligands against the target molecule. The library that interacts and locates treatments is given in Appendix A: Listing 3.9. One script was written to simplify the work flow and store the results in a local MySQL database. The structure of the database is given in Appendix A: Listing 3.12. The script that performs the analysis and uploads the results is given in Appendix A: Listing 3.10. Finally an interactive library was written to explore and visualize results gathered (Appendix A: Listing 3.11). Please note, Reactome also contains non-genetic elements (e.g. protein-protein complex), although during instantiation of the model only genetic expression levels can be asserted. The expression levels of non-genetic information are deductively found by the solving the logical model. An example of this is given in Figure 3.4, where the expression level of the FASL FAS Receptor Trimer is deductively found from the expression level of the FASL FAS Receptor Monomer, which is deductively found from the expression level of FAS Receptor and FASL genes. Graphical output from exploring a gene cluster (IL1R1 and related elements) from a cervical cancer dataset (GDS3233 ), using the HeLa cell line (GSM246123 ) as the experimental input and normal cervical cells (GSM246422 ) as the normal input at a significance factor of 0.10, is shown in Figure 3.1. The recursive depth function allows the user to see neighboring genes of a subset of the genetic expression network. The output from using the recursive depth functions at a level of one is shown in Figure 3.2. The output from using the recursive depth functions at a level of two is shown in Figure 3.3. The complete results, expression levels for 1,523 genes, are too lengthy for inclusion in this publication though can be regenerated with the scripts provided.

41 Texas Tech University, Andrew Avila, May 2012

Discussion

The purpose of this study was to create a technical implementation of the logical modeling formalism; specifically culminating in the ability to: Import prior genetic knowledge, build a corresponding logical model, analyze the solution, and visualize the results. This was accomplished by a series of scripts that were written during this study. It would be beneficial to discuss the modality in which each of these scripts accomplishes their individual tasks. Furthermore, any difficulties encountered and their respective workarounds should also be discussed. The first script written, Appendix A: Listing 3.1, performs the translation from the SBML format to a logical model. Due to the XML characteristic of the SBML file it is possible to use a standard XML parsing library to parse the SBML file. Once the SBML file has been parsed it is necessary to separate the resulting object into species and reactions (species as in a biomolecule and not in the organismic sense). The species object contains a list of all the reactive molecules within an organism. The most notable of which are gene products, although other small molecules may also be included in the list (e.g. ATP). Likewise, the reactions object contains a list of all the reactions that can occur among the species. The following step reads through the list of species and outputs a string that defines which species of the logical model. Similarly the list of reactions is parsed; thereafter the output of which is a string that defines the possible reactions the logical model has. Lastly, these strings are concatenated together (along with a short string containing directives for showing what exists upon solving in clasp) and output to a text file. In this manner the abstract logical model is produced. In order to instantiate the model, information from the Gene Expression

42 Texas Tech University, Andrew Avila, May 2012

Omnibus must be translated to a logical form. Appendix A: Listings 3.2, 3.3, and 3.4 are the libraries and scripts that perform this task. At this point it is necessary to define precisely how the quantitative information in a GEO SOFT file is converted to qualitative form (e.g. underexpressed, normal, overexpressed). The first terms that must be defined are the experimental cell line and normal cell line. The normal cell line is a set of quantitative gene expression information in the GEO SOFT file to which all other states are normalized against (i.e. considered baseline). The experimental cell line is the set of gene expression information that is being compared. A significance factor is a decimal between zero and one treated as a percent, defines the zone of normality. For example, a gene from normal has an expression level of 200 and is being compared at a significance factor of 0.10, therefore the zone of normality is from 180 through 220. If the same gene from experimental is at an expression level of 240 it is considered overexpressed. Likewise, if said gene was at an expression level of 179 it would be considered underexpressed. In this manner, information from GEO is converted to an ordinal qualitative form, usable by the logical grammar employed (i.e. Answer Set Programming). This is necessary due to the poor handling of continuous variables by the logical grammar. Armed with this information describing how quantitative variables are converted to an ordinal form, it is now possible to discuss the operation of the script that performs the conversion. The first step involves retrieving the list of species from the SBML file. The genetic expression information for all cell lines in the GEO SOFT file is parsed to an array. A hash table is then used to map column names (i.e. hash table keys) to a particular column number (i.e. hash table values). The user defines which gene expression information will be used by defining the

43 Texas Tech University, Andrew Avila, May 2012 experimental and normal by sample cell line identifiers (e.g. GSM246123 for HeLa cells). The appropriate columns are located, compared, and discretized by the process described above. The Reactome database is then queried for other identifiers (e.g. NIH GenBank identifiers) for each gene in the species list. A simple search is performed to match each Reactome identifier to each gene in the GenBank identifier column in the array read from the GEO SOFT file; this information is stored in a hash table. At this point a hash table has been generated that contains Reactome identifiers mapped to gene expression levels. Subsequently this information can then be written to a string in logical format. The string is then output to a file. This genetic expression information can be concatenated to the abstract logical model in order to instantiate a model. Ultimately, this complete logical model can be solved using gringo and clasp. At this point, an automated gene expression inference engine has been developed. The following step allows compounds that may be useful in treatment to be determined from the solution to the logical model. The first step was to create a local repository (Appendix A: Listing 3.12) to store a subset of the information from the NIH GenBank and PubChem databases. From the NIH GenBank database a list of known genes from Homo sapiens was needed. NIH provides a compressed tab delimited file (ftp://ftp.ncbi.nih.gov/gene/DATA/gene info.gz) containing the subset of information required: GenBank identifier, symbol, and a short description of the gene. This information is parsed by a script (Appendix A: Listing 3.5 and uploaded to a local MySQL database. NIH provides another tab delimited file (ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz) that maps genes to gene products. Specifically, a script (Appendix A: Listing 3.6) was written to pull the

44 Texas Tech University, Andrew Avila, May 2012 contents of the NIH provided file to a local MySQL database. The information available from PubChem required a slightly different approach than the GenBank database. PubChem does not provide a single tab delimited dump of their database. Instead, PubChem provides individual files in comma separated and XML formats (ftp://ftp.ncbi.nih.gov/pubchem/Bioassay/). Each individual file contains the information of a single Bioassay submission. This information is broken into two parts; one contains descriptive information regarding the target molecule of the Bioassay and the other contains the results of the Bioassay. The descriptive information is parsed by a script (Appendix A: Listing 3.8) in its XML format due to the irregular structure of each Bioassay. This is a result of varying experimental methods that can be used to determine gene expression information. The results of the assay are in a standard form (comma separated) and contain activity information of tested ligands against a target protein. A script was written to upload this information (Appendix A: Listing 3.7). An issue was identified at this point. The activity information recorded by PubChem did not include information regarding whether the ligands inhibited or activated the target protein. Furthermore, the scoring information (relative activity of the ligand against the target) is specific to each Bioassay and does not lend itself well for easy comparison across Bioassays. Nonetheless, the possibility of a compound has the potential for development as a treatment to a disease, makes the information in PubChem a valuable inclusion. With a local repository of information from GenBank and PubChem in place, a library for mining the local PubChem database for active compounds was produced (Appendix A: Listing 3.9). This library operates through two methods

45 Texas Tech University, Andrew Avila, May 2012 contingent upon the mapping of Reactome identifiers to NIH gene identifiers or the mapping of Reactome identifiers to NIH protein identifiers. This mapping process is mediated through automated MySQL query generation performed by the library. It is also worth noting that logical models make use of the Reactome identifier system. In the first method, a Reactome identifier is mapped to a NIH gene identifier directly. Since PubChem deals specifically with gene products as targets for Bioassays, the NIH gene identifier must be mapped to a NIH protein identifier. This is performed by cross referencing the gene2accession database. Otherwise, Reactome can also reference NIH protein identifiers directly. It was found in practice that using the first method was preferable as Reactome does not update its cross reference database in real time. However, both methods are available for a user to use. At this point a convenience script (Appendix A: Listing 3.10) was written to perform the complete work flow with the invocation of a single command line. This script operates by treating all the previous scripts as libraries and invoking their associated functions in order. Instead of producing output to a file, the output is instead redirected to a local MySQL database where it can be explored through the use of queries. Any issues of impredicativity in the generated model will prohibit the results from being uploaded. This helps insure that only logically sound models are used in future data analysis. Lastly, the topic of visualizing results must be discussed. An interactive library (Appendix A: Listing 3.11) was written to facilitate visualization of results. Currently, the library reads in the SBML file and solution to the logical model (i.e. the output from clasp). The genetic relationship information in the SBML is

46 Texas Tech University, Andrew Avila, May 2012 converted to a directional graph object. The nodes of the graph represent individual genes, whereas the edges represent directional relationships between the genes. Information regarding the expression level of each gene is attributed to each node respectively. The library has functions for finding clusters of genes that have a specified attribute. For example, the user can find the gene cluster with the most number of connected nodes that have the attribute of overexpression. This is useful for finding the most impacted pathway in a cell. Furthermore, this can be expanded through the use of a recursive search function. Therefore, allowing related genes to be progressively investigated. This recursive function takes a cluster of genes and finds nodes that are directly connected, but may not have the same attributes, as the cluster of genes. This can be done recursively, centering on a cluster of genes in order to expand the graph. This is demonstrated at a depth of one (Figure 3.2) and a depth of two (Figure 3.3). It is important to note that at the time of this study the graph library used, NetworkX, does not correctly render arrows showing directionality in a graph. This is a known problem acknowledged by the developers (https://networkx.lanl.gov/trac/ticket/423). The implications of these specific results will not be discussed here, as this study focuses upon the technical aspects of the software implementation. In summary, a software solution was developed that allows for: importation of prior genetic knowledge, building of a corresponding logical model, analysis of the solution, and visualization of the results. This development is unique as it is the first to provide a comprehensive solution for accomplishing the tasks described. With regards to analysis, PubChem may be mined for potential treatment compounds. This has an application in personalized medicine with the aim of

47 Texas Tech University, Andrew Avila, May 2012 improving patient response to treatment. Furthermore, the clustering algorithms developed allow for the visualization of genetic pathways due to disease. This empowers researchers to focus their inquiry into specific genes that may have large downstream effects. Finally, a future development for this software solution might be the integration of the NIH PubMed database for lexical analysis in order to automatically determine the function of gene clusters. This would allow researchers the flexibility of studying a gene based on its associated genes and/or alternatively, based on its known function.

48 Texas Tech University, Andrew Avila, May 2012

Bibliography

Edgar, R., M. Domrachev., and A. E. Lash. 2002. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30(1):207–210.

Hucka, M., A. Finney., H. M. Sauro., H. Bolouri., J. C. Doyle., H. Kitano., and others. 2003. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524.

Joshi-Tope, G., M. Gillespie., I. Vastrik., P. D’Eustachio., E. Schmidt., B. de Bono., B. Jassal., G. R. Gopinath., G. R. Wu., L. Matthews., and others. 2005. Reactome: a knowledgebase of biological pathways. Nucleic acids research, 33(Database Issue):D428.

Karp, P., C. Ouzounis., C. Moore-Kochlacs., L. Goldovsky., P. Kaipa., D. Ahren.,´ S. Tsoka., N. Darzentas., V. Kunin., and N. Lopez-Bigas.´ 2005. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic acids research, 33(19):6083–6089.

Luciano, J. S. 2005. PAX of mind for pathway researchers. Drug discovery today, 10(13):937–942.

Scotto, L., G. Narayan., S. V. Nandula., H. Arias-Pulido., S. Subramaniyam., A. Schneider., A. M. Kaufmann., J. D. Wright., B. Pothuri., M. Mansukhani., and V. V. Murty. 2008. Identification of copy number gain and overexpressed genes on

49 Texas Tech University, Andrew Avila, May 2012

arm 20q by an integrative genomic approach in cervical cancer: potential role in progression. Genes, & Cancer, 47(9):755–765. PMID: 18506748.

Thomas, P. D. 2003. PANTHER: a browsable database of gene products organized by biological function, using curated and subfamily classification. Nucleic Acids Research, 31(1):334–341.

Wang, Y., J. Xiao., T. O. Suzek., J. Zhang., J. Wang., and S. H. Bryant. 2009. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37(Web Server issue):W623.

50 Texas Tech University, Andrew Avila, May 2012

Figure 3.1: A gene cluster focusing on the gene IL1R1 and related elements. This is from the HeLa cervical cancer cell line (GEO dataset GDS3233 ).

Expression Legend Low High Unknown Normal

51 Texas Tech University, Andrew Avila, May 2012

Figure 3.2: An expansion upon Figure 3.1 with a recursive depth of one. Any gene (or related element) that connects to a gene (or related element) in Figure 3.1 has been added to the figure.

Expression Legend Low High Unknown Normal

52 Texas Tech University, Andrew Avila, May 2012

Figure 3.3: An expansion upon Figure 3.1 with a recursive depth of two. Any gene (or related element) that connects to a gene (or related element) in Figure 3.2 has been added to the figure.

Expression Legend Low High Unknown Normal

53 Texas Tech University, Andrew Avila, May 2012

Figure 3.4: A network demonstrating the ability to deduce the expression level of pro- tein complexes from the expression levels of genes. Specifically, this network demon- strates the binding of FASL to the FAS Receptor in order to produce the FASL FAS Receptor Monomer. The deduction of the expression level of the FASL FAS Receptor Trimer is deduced from the FASL FAS Receptor Monomer.

Expression Legend Low High Unknown Normal

54 Texas Tech University, Andrew Avila, May 2012

Appendix A

Listing 3.1: Logical Model Generation Script #!/usr/bin/env python #modelGen. py #Author: Andrew Avila #Description: This program will take a sbml model describing various biological reactions and transform it to an equivalent logical model . Initial conditions must be added manually but once added the solver can figure out where abnormalities exist within the model. Be sure to unzip the input file. #Input File: http://www.reactome.org/download/current/homo sapiens . 2 . sbml . gz #Usage: python modelGen.py −i s b m l f i l e −o a s p o u t p u t

#imported libraries import sys from lxml import e t r e e from i t e r t o o l s import ∗

#The header of the sbml tags tag header = ”{ http://www.sbml. org/sbml/level2 }”

#Function that removes header from a string def ch(dirtyStr): return dirtyStr.replace(tag header , ’ ’ )

#Function that takes a array and outputs a comma seperated string of the array elements def arStr ( ar ) : genStr = ”” for s in ar : genStr += s + ”,” return genStr.rstrip(’,’)

#Function that calculates the maximum number of products and reactants def getMaxs(tElem) : maxReac = 0 maxProd = 0 for subElem in tElem : for i in subElem : i f (i.tag == tag header + ”listOfReactants”): i f ( l e n ( i ) > maxReac ) : maxReac = len(i) i f (i.tag == tag header + ”listOfProducts”): i f ( l e n ( i ) > maxProd ) : maxProd = len(i)

55 Texas Tech University, Andrew Avila, May 2012

return [maxReac, maxProd]

#Generates the asp fluents def genFluents(tElem): genStr = ”” #Generate the species from the sbml reactants for subElem in tElem : genStr += ”species(” + subElem.attrib[’id’] + ”). \ n”

#Generates the species domain variable genStr += ”#domain species(X). \ n”

#Generate possible state conditions and rules for identifying unknowns genStr += ”state(normal). \ nstate(low). \ nstate(high). \ nstate(unknown) . \ n#domain state(C). \ n” genStr += ”exists(X, unknown) :− not exists(X, normal), not exists(X , low), not exists(X, high). \ n” genStr += ”:− exists(X, normal), exists(X, low). \ n” genStr += ”:− exists(X, normal), exists(X, high). \ n” genStr += ”:− exists(X, low), exists(X, high). \ n” return genStr

#Generates the asp possible reactions from the sbml reaction list def genReactions(tElem): genStr = ”” t L i s t = [ ] for subElem in tElem : reactList = [] prodList = [] for i in subElem : i f (i.tag == tag header + ”listOfReactants”): for j in i : reactList.append(j.attrib[ ’species ’]) i f (i.tag == tag header + ”listOfProducts”): for j in i : prodList.append(j.attrib[ ’species ’]) i f len(prodList) > 0 and len(reactList) > 0 : rS = reduce (lambda x,y: x + ”exists(” + y + ”, C), not exists(” + y + ”, unknown), ”,reactList , ””).rstrip(” ”) .rstrip(”,”) + ”. \ n” pS = map(lambda x: ”exists(” + x + ”, C) :− ” + rS, prodList ) tList.append([prodList , reactList]) for k in pS : genStr += k genStr += ”\n” + genIndReactions(tList) return genStr

56 Texas Tech University, Andrew Avila, May 2012

#Generate the indirect inference rules def genIndReactions(tList): genStr = ”” bigPList = [] for i in t L i s t : bigPList.append(i [0]) for i in t L i s t : reactList = i[0] prodList = i[1] for i in r e a c t L i s t : i f recurCount(bigPList , i) == 1: rS = reduce (lambda x,y: x + ”exists(” + y + ”, C), not exists(” + y + ”, unknown), ”,reactList , ””).rstrip( ” ”).rstrip(”,”) + ”. \ n” pS = map(lambda x: ”exists(” + x + ”, C) :− ” + rS , prodList ) for k in pS : genStr += k return genStr

#A recursive depth instance counter def recurCount(arr , item, num = 0): i f type ( arr ) . name == ” l i s t ” : for i in arr : num = recurCount(i , item, num) else : i f arr == item : return num + 1 return num

#Generates the species output def genOutSpecies(tElem) : genList = [ ]

#generate the species from the sbml reactants for subElem in tElem : genList += [subElem.attrib[ ’id’]]

return genList

#Main function that processes command line arguments and generates the asp f i l e def main ( ) :

#Default do not gen species file species = False t r e e = False

57 Texas Tech University, Andrew Avila, May 2012

out name = False

#Pull command line arguments and generate tree from sbml file and set output file name for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : tree = etree.parse(sys.argv[x+1]) i f (sys.argv[x] == ’−o ’ ) : out name = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : species = sys.argv[x+1]

i f t r e e : #Seperate tree into species and reactions and find max number of each speciesTree = tree.find(”.//” + tag header + ”listOfSpecies”) reactionTree = tree.find(”.//” + tag header + ”listOfReactions”)

i f t r e e and out name : #Produce the output string out gen = genFluents(speciesTree) out gen += genReactions(reactionTree) out gen += ”#hide. \ n#show exists(X, C).”

#Write the output string f = open ( out name , ”w” ) f.write(out gen ) f . c l o s e ( )

#Write species to file if arg passed i f s p e c i e s and t r e e : genStr = ”” spcList = genOutSpecies(speciesTree)

for i in s p c L i s t : genStr += i + ”\n” f = open(species , ”w”) f.write(genStr) f . c l o s e ( )

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

58 Texas Tech University, Andrew Avila, May 2012

Listing 3.2: GEO Processing Library #!/usr/bin/env python #geoReader.py #Author: Andrew Avila #Description: This is a library for reading SOFT Files from NCBI’s GEO Database and outputting a dictionary with GenBank ID’s mapped to qualitative genelevels based on the comparison of a control and experimental sample to a significance factor (default = 0.25). The significance factor is based on the percent difference between the control and experimental sample #Usage: geoReader.py −i g e o F i l e −o t a b d e l i m i t e d o u t p u t −c c o n t r o l −e experimental −s significance

#L i b r a r i e s import sys

#Extracts the relevant lines from the SOFT file def ParseSOFT(filePath , splitChar = ”\ t ” ) : return reduce (lambda x, y: x + [”start”] i f y == ” ! d a t a s e t t a b l e b e g i n \n” else \ (x + [y.split(splitChar)] i f l e n ( x ) > 0 and y != ” ! d a t a s e t t a b l e e n d \n” else x), open(filePath , ’r’). readlines(), []) [1:]

#Generates a hashtable mapping column names to specific positions from the array passed to the function def MapElements(geoArray) : mapDict = {} for i in range(len(geoArray[0])): i f ( ( i > 1 and not mapDict. has key( ’GenBank Accession ’)) or geoArray[0][ i ] == ”GenBank Accession”) \ and geoArray[0][ i] != ”Gene symbol” and geoArray[0][ i] != ”Gene title” and geoArray[0][ i] != ”Gene ID” and geoArray[0][ i] != ”Nucleotide Title” and geoArray [0][ i] != ”UniGene ID” and geoArray[0][ i] != ”UniGene symbol” and geoArray[0][ i] != ”UniGene title”: mapDict. update({ geoArray[0][i] : i }) return mapDict

#Creates a dictionary mapping genes to qualitative expression levels after checking if experimental and control values are sufficiently d i f f e r e n t def GeneLevels(geoArray, mapDict, control , experimental , sigLevel = 0 . 2 5 ) : geneDict = {} for i in geoArray [1:]: i f i[mapDict[control]] != ”null” and i [mapDict[experimental ]] != ” n u l l ” :

59 Texas Tech University, Andrew Avila, May 2012

i f float(i[mapDict[control ]]) >= float(i[mapDict[ experimental ]]) : i f 1 . 0 − (float(i[mapDict[experimental]]) / float(i[ mapDict[control ]]) ) > s i g L e v e l : geneDict.update({ i[mapDict[ ’GenBank Accession’]] : ” low” }) else : geneDict.update({ i[mapDict[ ’GenBank Accession’]] : ” normal” }) else : i f 1 . 0 − (float(i[mapDict[control]]) / float(i[mapDict[ experimental ]]) ) > s i g L e v e l : geneDict.update({ i[mapDict[ ’GenBank Accession’]] : ” high ” }) else : geneDict.update({ i[mapDict[ ’GenBank Accession’]] : ” normal” }) return geneDict

#Main function that processes command line arguments def main ( ) :

#Default do not gen species file geoFile = False control = False experimental = False sigLevel = False outFile = False outStr = ””

#Pull command line arguments for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : geoFile = sys.argv[x+1] i f (sys.argv[x] == ’−o ’ ) : outFile = sys.argv[x+1] i f (sys.argv[x] == ’−c ’ ) : control = sys.argv[x+1] i f (sys.argv[x] == ’−e ’ ) : experimental = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : sigLevel = float(sys.argv[x+1])

#If valid argument run the appropriate processing function i f g e o F i l e and c o n t r o l and experimental and o u t F i l e : geoArray = ParseSOFT(geoFile) mapDict = MapElements(geoArray) geneDict = GeneLevels(geoArray, mapDict, control , experimental ,

60 Texas Tech University, Andrew Avila, May 2012

s i g L e v e l ) i f s i g L e v e l else GeneLevels(geoArray , mapDict, control , experimental)

for i , j in geneDict.items(): i f i != ”” and j != ”” : outStr += i + ”\ t ” + j + ”\n”

#Write the output string f = open(outFile , ”w”) f.write(outStr) f . c l o s e ( )

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

61 Texas Tech University, Andrew Avila, May 2012

Listing 3.3: Gene Identifier Cross Reference Library #otherID.py #Author: Andrew Avila #Description: A library for retrieving the other identifiers that map to Reactome identifiers.

#L i b r a r i e s import MySQLdb

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’reactome ’ MYSQL PASSWD = ’reactome ’ MYSQL DB = ’reactome’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params)

62 Texas Tech University, Andrew Avila, May 2012

return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Takes an array with two columns and produces an equivalent hashtable def arrToDict(arr): newDict = {} for i in range(len(arr)): newDict.update({ arr[i][1] : arr[i][0] } ) return newDict

#Parses the SBML identifiers to retrieve the Reactome Stable Identifiers def parseSpecies(rawList , breakChar = ” ”, idColumn = 1): return reduce (lambda x, y: x + [[y.strip(”\n” ) , ”REACT ” + y . s p l i t ( breakChar)[idColumn]]] , rawList , [])

#Reads the species file def readRawList(speciesFile): return open(speciesFile , ’r’).readlines()

#Queries the Reactome database to retrieve the other identifiers for each gene in the sbml file def getOtherIds(parList): return reduce (lambda x, y: x + [[y[1] , DBQueryDataFunc(”SELECT ReferenceEntity 2 otherIdentifier . otherIdentifier FROM StableIdentifier LEFT JOIN DatabaseObject on DatabaseObject. stableIdentifier = StableIdentifier.DB ID LEFT JOIN EntityWithAccessionedSequence ON EntityWithAccessionedSequence. DB ID = DatabaseObject.DB ID LEFT JOIN ReferenceEntity 2 otherIdentifier ON ReferenceEntity 2 otherIdentifier .DB ID = EntityWithAccessionedSequence . referenceEntity WHERE StableIdentifier. identifier = ’%s’ AND DatabaseObject. c l a s s = ’

63 Texas Tech University, Andrew Avila, May 2012

EntityWithAccessionedSequence’” % y[1] , [] , lambda j , k : k ) ] ] , parList, [])

64 Texas Tech University, Andrew Avila, May 2012

Listing 3.4: ASP Condition Generator Script #!/usr/bin/env python #condGen. py #Author: Andrew Avila #Description: This is a program for generating the initial asp conditions from a geo soft file. This program operates by building a bridge between the reactome stable identifiers in the sbml file and the genbank ascession identifiers in the geo file through the reactome sql database #Usage: condGen.py −g g e o F i l e −i s b m l F i l e −o a s p c o n d i t i o n s o u t −c c o n t r o l −e experimental −s significance

#L i b r a r i e s from geoReader import ∗ from otherID import ∗ from modelGen import ∗ import sys

#Generates a hashtable mapping the sbml gene identifiers to the geo gene identifiers def GenConds(geneDict , otherList): condDict = {} for i in o t h e r L i s t : i f l e n ( i [ 1 ] ) : for j in i [ 1 ] : i f geneDict.has key ( j [ 0 ] ) : condDict.update({ i[0] : geneDict[j[0]] } ) return condDict

#Generates an array mapping each gene in the sbml file to an expression l e v e l def SbmlConds(condDict , sbmlDict): outArr = [ ] for k , v in sbmlDict.items() : i f condDict.has key ( k ) : outArr += [[v, condDict[k]]] return outArr

#Generates the asp string for each gene and its expression level def AspConds(condArr) : outStr = ”” for i in condArr : outStr += ”exists(” + i[0] + ”, ” + i[1] + ”). \ n” return outStr

#Main function that processes command line arguments def main ( ) :

65 Texas Tech University, Andrew Avila, May 2012

#default params geoFile = False sbmlFile = False outFile = False control = False experimental = False sigLevel = False

#Pull command line arguments for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−g ’ ) : geoFile = sys.argv[x+1] i f (sys.argv[x] == ’−i ’ ) : sbmlFile = etree.parse(sys.argv[x+1]) i f (sys.argv[x] == ’−o ’ ) : outFile = sys.argv[x+1] i f (sys.argv[x] == ’−c ’ ) : control = sys.argv[x+1] i f (sys.argv[x] == ’−e ’ ) : experimental = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : sigLevel = float(sys.argv[x+1])

#Checks if all parameters are specified , generates initial variables , parses the GEO file and outputs the asp equivalent i f g e o F i l e and sbmlFile and o u t F i l e and c o n t r o l and experimental: speciesTree = sbmlFile.find(”.//” + tag header + ”listOfSpecies” ) reactionTree = sbmlFile.find(”.//” + tag header + ” listOfReactions”) modelGen = genFluents(speciesTree) modelGen += genReactions(reactionTree) modelGen += ”#hide. \ n#show exists(X, C).”

spcList = genOutSpecies(sbmlFile.find(”.//” + tag header + ” listOfSpecies”)) geoArray = ParseSOFT(geoFile) mapDict = MapElements(geoArray) geneDict = GeneLevels(geoArray, mapDict, control , experimental , s i g L e v e l ) i f s i g L e v e l else GeneLevels(geoArray , mapDict, control , experimental) otherList = getOtherIds(parseSpecies(spcList)) condDict = GenConds(geneDict , otherList) aspStr = AspConds(SbmlConds(condDict , arrToDict(parseSpecies( spcList))))

f = open(outFile , ”w”) f.write(aspStr)

66 Texas Tech University, Andrew Avila, May 2012

f . c l o s e ( )

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

67 Texas Tech University, Andrew Avila, May 2012

Listing 3.5: Gene Information Upload Script #!/usr/bin/env python #g e n e i n f o . py #Author: Andrew Avila #Description: This program will take an NCBI gene info file from and upload a reduced version for use in logoexmachina. Be sure to unzip the file. #Input File: ftp://ftp.ncbi.nih.gov/gene/DATA/gene i n f o . gz #Usage: python gene2accesion.py −i g e n e i n f o −s t a x id [9606 for homo sapiens ]

#L i b r a r i e s import MySQLdb import sys

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

68 Texas Tech University, Andrew Avila, May 2012

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Reads the contents of a gene info file from NIH and constructs an array with only the gene information for a particular species def readToArray(filepath , splitchar = ’ \ t’, taxid = ’9606’): return reduce (lambda x, y: x + [y.split(splitchar)] i f y[0:len(taxid ) ] == taxid else x, open(filepath , ’r’), [])

#Reads the contents of a gene info file from NIH and uploads the information to a database for a particular species def readNUpload(filepath , splitchar = ’ \ t’, taxid = ’9606’): return reduce (lambda x, y: uploadInfo(y.split(splitchar)) i f y [ 0 : l e n (taxid)] == taxid else x, open(filepath , ’r’))

#The upload function that performs the task of uploading the information def uploadInfo(geneArr, valLength = 15): return DBInsertFunc( ’gene i n f o ’ , { ’ t a x id’: geneArr[0], ’GeneID’: geneArr[1], ’Symbol’: geneArr[2], ’Description’: geneArr[8]. upper ( ) }) \ i f len(geneArr) == valLength else 0

#Main function that processes command line arguments def main ( ) :

#By default no arguments

69 Texas Tech University, Andrew Avila, May 2012

iFile = False species = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : iFile = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : species = sys.argv[x+1]

#If valid argument run the appropriate processing function i f i F i l e : readNUpload(iFile , ’ \ t’, species) i f s p e c i e s else readNUpload( i F i l e )

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

70 Texas Tech University, Andrew Avila, May 2012

Listing 3.6: Gene Accession Information Upload Script #!/usr/bin/env python #gene2accesion.py #Author: Andrew Avila #Description: This program will take an NCBI gene2accesion from entrez and upload a reduced version for use in logoexmachina. Be sure to unzip the file. #Input File: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz #Usage: python gene2accesion.py −i gene2accesion f i l e −s t a x i d [9606 for homo sapiens]

#L i b r a r i e s import MySQLdb import sys

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

71 Texas Tech University, Andrew Avila, May 2012

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Reads the contents of a gene2acession file from NIH and constructs an array with only the gene information for a particular species def readToArray(filepath , splitchar = ’ \ t’, taxid = ’9606’): return reduce (lambda x, y: x + [y.split(splitchar)] i f y[0:len(taxid ) ] == taxid else x, open(filepath , ’r’), [])

#Reads the contents of a gene2acession file from NIH and uploads the information to a database for a particular species def readNUpload(filepath , splitchar = ’ \ t’, taxid = ’9606’): return reduce (lambda x, y: uploadInfo(y.split(splitchar)) i f y [ 0 : l e n (taxid)] == taxid else x, open(filepath , ’r’))

#The upload function that performs the task of uploading the information def uploadInfo(geneArr, valLength = 13): #and len(DBQueryDataFunc(”SELECT ID FROM gene2accession WHERE p r o t e i n gi = ’%s’” % geneArr[6], [], lambda x, y: y)) == 0 return DBInsertFunc( ’gene2accession ’ , { ’ t a x id’: geneArr[0], ’GeneID ’: geneArr[1], ’protein accession’: geneArr[5], ’protein g i ’ : geneArr[6], ’gene n u c l e o t i d e accession’: geneArr[7], ’ g e n e n u c l e o t i d e gi’: geneArr[8] } ) \ i f len(geneArr) == valLength and (geneArr[5] != ’− ’ or geneArr[6] != ’− ’ or geneArr[7] != ’− ’ or geneArr[8] != ’ − ’) else 0

72 Texas Tech University, Andrew Avila, May 2012

#Main function that processes command line arguments def main ( ) :

#By default no arguments iFile = False species = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : iFile = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : species = sys.argv[x+1]

#If valid argument run the appropriate processing function i f i F i l e : readNUpload(iFile , ’ \ t’, species) i f s p e c i e s else readNUpload( i F i l e )

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

73 Texas Tech University, Andrew Avila, May 2012

Listing 3.7: Assay Data Upload Script #!/usr/bin/env python #assayData.py #Author: Andrew Avila #Description: This program will take pubchem bioassay data files (csv format) and upload the compound information to the BioassayData sql table. Be sure to unzip the files. #Directory Input: ftp://ftp .ncbi.nih.gov/pubchem/Bioassay/CSV/Data/ #Usage: python assayData.py −i b i o a s s a y f i l e o r d i r

#L i b r a r i e s import MySQLdb import sys import os

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

74 Texas Tech University, Andrew Avila, May 2012

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#A function that recursively searches a directory for assay data to upload . def RecursiveAssayData(dir): map( ( lambda x : RecursiveAssayData(dir + ”/” + x) i f os.path.isdir( d i r + ”/” + x ) else assayData(dir + ”/” + x)), os.listdir(dir))

#A function that opens a file , checks if it is compatible, parses its contents, and uploads the parsed information to the database. def assayData(filePath): return reduce (lambda x, y: uploadAssayData(y.split(’,’), os.path. splitext(os.path.basename(filePath))[0]) i f y.split(’,’)[0] != ’ ”PUBCHEM SID” ’ else x, open(filePath), []) i f l e n ( DBQueryDataFunc(”SELECT ID FROM BioassayDesc WHERE aid = ’%s ’” % os.path.splitext(os.path.basename(filePath))[0] , [] , lambda x , y : y ) ) > 0 else 0

#A function that uploads assay information to the database. def uploadAssayData(rawArr, aidStr): return DBInsertFunc( ’BioassayData ’ , { ’aid’: aidStr, ’sid’: rawArr [0], ’cid’: rawArr[2], ’activity’: rawArr[3], ’score’: rawArr [ 4 ] } )

#Main function that processes command line arguments def main ( ) :

75 Texas Tech University, Andrew Avila, May 2012

#By default no arguments iFile = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : iFile = sys.argv[x+1]

#If valid argument run the appropriate processing function i f i F i l e : i f os.path.isdir(iFile) == True: RecursiveAssayData( iFile ) else : assayData(iFile)

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

76 Texas Tech University, Andrew Avila, May 2012

Listing 3.8: Assay Description Upload Script #!/usr/bin/env python #assayDescription.py #Author: Andrew Avila #Description: This program will take pubchem bioassay description files (xml format) and upload the associated with a given assay. Be sure to unzip the files. #Directory Input: ftp://ftp .ncbi.nih.gov/pubchem/Bioassay/XML/ Description/ #Usage: python assayDescription.py −i b i o a s s a y f i l e o r d i r

#L i b r a r i e s import MySQLdb import sys import os from lxml import e t r e e

#The expected header if the xml file is valid tag header = ”{ http://www. ncbi .nlm.nih.gov}”

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) :

77 Texas Tech University, Andrew Avila, May 2012

print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#A function that recursively searches a directory for assay description to upload def RecursiveTargetMol(dir): map( ( lambda x : RecursiveTargetMol(dir + ”/” + x) i f os.path.isdir( d i r + ”/” + x ) else getTargetMol(dir + ”/” + x)), os.listdir(dir ))

#Gets the target molecule of an assay uploads its identifier to the database def getTargetMol(filePath): return reduce (lambda x, y: DBInsertFunc(’BioassayDesc’, { ’ aid ’ : os . path.splitext(os.path.basename(filePath))[0] , ’protein g i ’ : y . t e x t }) i f y . tag == tag header + ”PC−AssayTargetInfo mol−id ” and len (DBQueryDataFunc(”SELECT ID FROM gene2accession WHERE p r o t e i n gi = ’%s’” % y.text, [], lambda x , y : y ) ) > 0 else x , etree.parse(filePath).iter(), [])

#Main function that processes command line arguments

78 Texas Tech University, Andrew Avila, May 2012 def main ( ) :

#By default no arguments iFile = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−i ’ ) : iFile = sys.argv[x+1]

#If valid argument run the appropriate processing function i f i F i l e : i f os.path.isdir(iFile) == True: RecursiveTargetMol(iFile) else : getTargetMol(iFile)

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

79 Texas Tech University, Andrew Avila, May 2012

Listing 3.9: Treatment Discovery Library #genTreatments.py #Author: Andrew Avila #Description: A library with functions to find treatments for over/under expressed genes

#L i b r a r i e s from inspectoris import ∗ import otherID as reactome import MySQLdb

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch

80 Texas Tech University, Andrew Avila, May 2012 def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Finds compounds in pubchem for each gene in the current graph. def genTreatmentsGene(curGraph, level = ”high”): return reduce (lambda x, y: x + [[y[0], getCompoundsGene( getEntrezGene(reactome.getOtherIds([y]) [0][1]) ) ]] i f l e n ( reactome.getOtherIds([y]) [0][1]) > 0 else x, reactome. parseSpecies(getNodeswithAttribute(curGraph, ’level ’, level)), [])

#Gets the NIH identifier in a list of other identifiers from Reactome def getEntrezGene(rawList): for i in rawList : i f type ( i ) . name == ” s t r ” and l e n ( i [ 0 ] ) > 11 and i [ 0 ] [ 0 : 1 1 ] == ”EntrezGene:”: return i [ 0 ] [ 1 1 : ]

#Gets the information for molecules that are active against a specific gene def getCompoundsGene(geneID) : return DBQueryDataFunc(”SELECT DISTINCT BioassayData . aid , BioassayData.sid , BioassayData.cid , BioassayData.score FROM gene2accession LEFT JOIN BioassayDesc ON gene2accession. p r o t e i n gi = BioassayDesc.protein gi LEFT JOIN BioassayData ON BioassayDesc.aid = BioassayData.aid WHERE gene2accession .GeneID = ’%s’ and BioassayData.activity = 2 and BioassayData.score > 0” % geneID, [], lambda j , k : k )

#Finds compounds in pubchem for the gene product in the current graph.

81 Texas Tech University, Andrew Avila, May 2012 def genTreatmentsProtein(curGraph, level = ”high”): return reduce (lambda x, y: x + [[y[0], getCompoundsProteinGI( getProteinGI(reactome.getOtherIds([y]) [0][1]) ) ]] i f len(reactome .getOtherIds([y]) [0][1]) > 0 else x, reactome.parseSpecies( getNodeswithAttribute(curGraph, ’level ’, level)), [])

#Gets the protein gi identifier from the protein accession identifier. def getProteinGI(rawList): for i in rawList : p r o t e i n gi = DBQueryDataFunc(”SELECT DISTINCT protein g i FROM gene2accession WHERE protein accession = ’%s’” % i[0], [], lambda j , k : k ) i f len(protein g i ) > 0 : return p r o t e i n g i [ 0 ]

#Gets the compounds that are active against a gene product. def getCompoundsProteinGI(proteinGI) : return DBQueryDataFunc(”SELECT DISTINCT BioassayData . aid , BioassayData.sid , BioassayData.cid , BioassayData.score FROM gene2accession LEFT JOIN BioassayDesc ON gene2accession. p r o t e i n gi = BioassayDesc.protein gi LEFT JOIN BioassayData ON BioassayDesc.aid = BioassayData.aid WHERE gene2accession. p r o t e i n gi = ’%s’ and BioassayData.activity = 2 and BioassayData . s c o r e > 0” % proteinGI, [], lambda j , k : k )

#Checks if the GEO dataset has been uploaded and if not uploads it returns the identifier for the dataset def insertGEODataset(gds, description): g d s i d = DBQueryDataFunc ( ”SELECT ID FROM GEODATASET WHERE GDS = ’%s ’ ” % gds , [ ] , lambda j , k : k ) return g d s i d [ 0 ] [ 0 ] i f l e n ( g d s i d ) > 0 else DBInsertFunc ( ’GEODATASET ’, { ’GDS’ : gds , ’DESCRIPTION ’ : d e s c r i p t i o n })

#Checks if the GEO sample has been uploaded and if not uploads it returns the identifier for the sample def insertGEOSample(geodataset id, gsm, cellline , description): gsm id = DBQueryDataFunc ( ”SELECT ID FROM GEOSAMPLE WHERE GSM = ’%s ’ ” % gsm , [ ] , lambda j , k : k ) return gsm id [ 0 ] [ 0 ] i f l e n ( gsm id ) > 0 else DBInsertFunc ( ’GEOSAMPLE’ , { ’GEODATASET ID’: geodataset i d , ’GSM’ : gsm , ’CELLLINE ’ : cellline , ’DESCRIPTION’: description })

#Checks if the analysis description has been uploaded and if not uploads it returns the identifier for the analysis def insertAnalysis(gds id , normal id , experimental id, filterfactor): a n a l y s i s i d = DBQueryDataFunc ( ”SELECT ID FROM ANALYSIS WHERE GEODATASET ID = ’%s ’ AND NORMAL ID = ’%s ’ AND EXPERIMENTAL ID = ’%s ’ AND FILTERFACTOR = ’%s ’ ” % ( gds id , normal id ,

82 Texas Tech University, Andrew Avila, May 2012

experimental id, filterfactor), [], lambda j , k : k ) return a n a l y s i s i d [ 0 ] [ 0 ] i f len(analysis i d ) > 0 else DBInsertFunc( ’ ANALYSIS ’ , { ’GEODATASET ID ’ : gds id , ’NORMAL ID ’ : normal id , ’ EXPERIMENTAL ID’: experimental id , ’FILTERFACTOR ’ : s t r ( filterfactor) })

#Checks if the analysis result has been uploaded and if not uploads it returns the identifier for the result def insertAnalysisResult(analysis id , stableidentifier id , aid, lowhigh, geneprotein): r e s u l t i d = DBQueryDataFunc ( ”SELECT ID FROM ANALYSISRESULT WHERE ANALYSIS ID = ’%s’ AND StableIdentifier ID = ’%s’ AND aid = ’%s’ AND LOWHIGH = ’%s ’ AND GENEPROTEIN = ’%s ’ ” % ( a n a l y s i s i d , stableidentifier id , aid, lowhigh, geneprotein), [], lambda j , k : k ) return r e s u l t i d [ 0 ] [ 0 ] i f l e n ( r e s u l t i d ) > 0 else DBInsertFunc( ’ ANALYSISRESULT ’ , { ’ANALYSIS ID’: analysis i d , ’ StableIdentifier id’: stableidentifier i d , ’ aid ’ : aid , ’LOWHIGH’ : lowhigh , ’GENEPROTEIN’: geneprotein })

#Checks if the Reactome identifier has been uploaded and if not uploads it returns the identifier for the reactome identifier def insertStableIdentifier(identifier , name): r e a c t id = DBQueryDataFunc(”SELECT ID FROM StableIdentifier WHERE identifier = ’%s’” % identifier , [], lambda j , k : k ) return r e a c t i d [ 0 ] [ 0 ] i f l e n ( r e a c t i d ) > 0 else DBInsertFunc( ’ StableIdentifier ’, { ’identifier’: identifier , ’name’: name})

#Uploads any compounds that have been determined to be active through a n a l y s i s def uploadTreatment(curGraph , analysis id , lowhigh, geneprotein): return map(lambda x : map(lambda p: insertAnalysisResult(analysis i d , insertStableIdentifier(”REACT ” + x[0].split(’ ’)[1], ’’.join(i + ’ ’ for i in x[0].split(’ ’)[3:]).rstrip(’ ’)), p, lowhigh, ” PROTEIN” ) , \ reduce (lambda y, z: y+ [z[0]] i f z [ 0 ] not in y else y, x[1], [])) \ i f l e n ( x [ 1 ] ) > 0 else insertAnalysisResult(analysis i d , insertStableIdentifier(”REACT ” + x[0].split(’ ’ ) [ 1 ] , ’’.join(i + ’ ’ for i in x[0].split(’ ’)[3:]).rstrip (’ ’)), ’’, lowhigh , ”PROTEIN”), genTreatmentsProtein (curGraph, lowhigh)) \ i f geneprotein == ”PROTEIN” else \ map(lambda x : map(lambda p: insertAnalysisResult( a n a l y s i s id , insertStableIdentifier(”REACT ” + x [ 0 ] . s p l i t ( ’ ’)[1], ’’.join(i + ’ ’ for i in x[0].split(’ ’)[3:]).rstrip(’ ’)), p, lowhigh, ”GENE”), \ reduce (lambda y, z: y+ [z[0]] i f z [ 0 ]

83 Texas Tech University, Andrew Avila, May 2012

not in y else y, x[1], [])) \ i f l e n ( x [ 1 ] ) > 0 else insertAnalysisResult( a n a l y s i s id , insertStableIdentifier(”REACT ” + x [0].split(’ ’)[1], ’’.join(i + ’ ’ for i in x [ 0 ] . s p l i t ( ’ ’)[3:]).rstrip(’ ’)), ’’, lowhigh, ”GENE” ) , genTreatmentsGene(curGraph, lowhigh))

#Generates and uploads the appropriate descriptive information , performs a treatment analysis and uploads any treatments found. def doFullAnalysis(curGraph, gds, gds description , normal gsm , experimental gsm , normal desc , experimental desc, normal c e l l l i n e , experimental cellline , filterfactor): g d s id = insertGEODataset(gds, gds description) normal id = insertGEOSample(gds id , normal gsm , n o r m a l c e l l l i n e , normal desc ) experimental id = insertGEOSample(gds id , experimental gsm , experimental cellline , experimental d e s c ) a n a l y s i s id = insertAnalysis(gds id , normal id , experimental id , filterfactor) uploadTreatment(curGraph , analysis i d , ”low” , ”GENE” ) uploadTreatment(curGraph , analysis id , ”high”, ”GENE”) uploadTreatment(curGraph , analysis id , ”normal”, ”GENE”) uploadTreatment(curGraph , analysis id , ”unknown”, ”GENE”) #uploadTreatment(curGraph, analysis i d , ” low ” , ”PROTEIN”) #uploadTreatment(curGraph, analysis i d , ” high ” , ”PROTEIN”) return a n a l y s i s i d

84 Texas Tech University, Andrew Avila, May 2012

Listing 3.10: Complete Analysis Program #!/usr/bin/env python #logosexmachina.py #Author: Andrew Avila #Description: The complete program Logos Ex Machina; takes in a series of files , builds a logical model, adds the initial conditions, solves it with gringo and clasp, and looks for possible treatments. Uploads results to the database. Look at usage line below for specific filees needed for input and format. #Usage: logosexmachina.py −s s b m l f i l e −g g e o f i l e −h g d s d e s c −a normal gsm −b n o r m a l c e l l l i n e −c normal desc −i experimental gsm −j experimental c e l l l i n e −k experimental d e s c −z f i l t e r f a c t o r

#L i b r a r i e s import genTreatments as treat import condGen as cond import modelGen as model import sys import subprocess import os

#Main function that processes command line arguments def main ( ) : #Default parameters geoFile = False sbmlFile = False gdsDesc = False nGsm = False nLine = False nDesc = False eGsm = False eLine = False eDesc = False filterFactor = False

#Pull command line arguments for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−g ’ ) : geoFile = sys.argv[x+1] i f (sys.argv[x] == ’−s ’ ) : sbmlFile = cond.etree.parse(sys.argv[x+1]) i f (sys.argv[x] == ’−h ’ ) : gdsDesc = sys.argv[x+1] i f (sys.argv[x] == ’−a ’ ) : nGsm = sys.argv[x+1] i f (sys.argv[x] == ’−b ’ ) : nLine = sys.argv[x+1] i f (sys.argv[x] == ’−c ’ ) :

85 Texas Tech University, Andrew Avila, May 2012

nDesc = sys.argv[x+1] i f (sys.argv[x] == ’−i ’ ) : eGsm = sys.argv[x+1] i f (sys.argv[x] == ’−j ’ ) : eLine = sys.argv[x+1] i f (sys.argv[x] == ’−k ’ ) : eDesc = sys.argv[x+1] i f (sys.argv[x] == ’−z ’ ) : filterFactor = float(sys.argv[x+1])

#If valid argument run the appropriate processing function i f g e o F i l e and sbmlFile and gdsDesc and nGsm and nLine and nDesc and eGsm and eLine and eDesc : speciesTree = sbmlFile.find(”.//” + model.tag header + ” listOfSpecies”) reactionTree = sbmlFile.find(”.//” + model.tag header + ” listOfReactions”) modelStr = model.genFluents(speciesTree) modelStr += model.genReactions(reactionTree) modelStr += ”#hide. \ n#show exists(X, C).”

spcList = cond.genOutSpecies(speciesTree) geoArray = cond.ParseSOFT(geoFile) mapDict = cond.MapElements(geoArray) geneDict = cond.GeneLevels(geoArray , mapDict, nGsm, eGsm, filterFactor) i f filterFactor else cond.GeneLevels(geoArray , mapDict , nGsm, eGsm) otherList = cond.getOtherIds(cond.parseSpecies(spcList)) condDict = cond.GenConds(geneDict , otherList) aspStr = cond.AspConds(cond.SbmlConds(condDict , cond.arrToDict( cond.parseSpecies(spcList))))

gringoOut = subprocess.Popen([”gringo”] , stdin=subprocess.PIPE, stdout=subprocess.PIPE).communicate(aspStr + ’ \n’ + modelStr ) [ 0 ] claspOut = subprocess.Popen([”clasp”] , stdin=subprocess.PIPE, stdout=subprocess .PIPE).communicate(gringoOut) [0]

curGraph = treat .genReacGraph(reactionTree) curGraph = treat .genColorGraph(claspOut , curGraph) treat.doFullAnalysis(curGraph, os.path. splitext(os.path.basename (geoFile))[0]. split(’ ’)[0][3:] , gdsDesc, nGsm, eGsm, nDesc, eDesc, nLine, eLine, filterFactor)

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

86 Texas Tech University, Andrew Avila, May 2012

Listing 3.11: Interactive Graph Analysis Library #!/usr/bin/env python #inspectoris.py #Author: Andrew Avila #Description: A dual function tool for analysis of results from Logos Ex Machina. Primary purpose is as an interactive library for working with results. Secondary purpose is for use as a script to facilitate of analysis information to a tab delimited format. To get analysis list to not pass any parameters to the script. #Usage: inspectoris.py −e analysisID −t geneProtein

#L i b r a r i e s import MySQLdb import sys import networkx as nx import matplotlib.pyplot as plt from lxml import e t r e e from i t e r t o o l s import ∗

#Expected header of sbml file tag header = ”{ http://www.sbml. org/sbml/level2 }”

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function.

87 Texas Tech University, Andrew Avila, May 2012

@DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : #print query data = datafunc(conn, DBExecuteQuery(conn, query , params)) conn.close() return data

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Generates a reaction graph from the sbml imported object def genReacGraph(tElem) : reactGraph = nx.DiGraph() for subElem in tElem : reactList = [] prodList = [] for i in subElem : i f (i.tag == tag header + ”listOfReactants”): for j in i : reactList.append(j.attrib[ ’species ’]) i f (i.tag == tag header + ”listOfProducts”): for j in i : prodList.append(j.attrib[ ’species ’]) reactGraph.add nodes from(reactList) reactGraph.add nodes from(prodList) for j in r e a c t L i s t :

88 Texas Tech University, Andrew Avila, May 2012

for k in prodList : reactGraph.add edge ( j , k )

return reactGraph

#Takes in the output of clasp and a reaction graph and adds gene expression information to the existing reaction graph def genColorGraph(filePath , reactGraph, splitChar = ” ”): i f filePath[0:5] == ”clasp”: rawList = reduce(lambda x, y: x + [”start”] i f y == ”Answer: 1” else \ (x + [y.split(splitChar)] i f l e n ( x ) > 0 and y [0:6] == ”exists” else x), filePath.split(’ \n’), [])[1:][0] else : rawList = reduce(lambda x, y: x + [”start”] i f y == ”Answer: 1\n ” else \ (x + [y.split(splitChar)] i f l e n ( x ) > 0 and y [0:6] == ”exists” else x), open(filePath , ’ r’).readlines(), []) [1:][0] ordList = map(lambda x: x.replace(”\n”, ””)[7:len(x) − 1]. split(”,”) , rawList[0:len(rawList) −1])

for i in o r d L i s t : i f i [ 0 ] in reactGraph: i f i[1] == ”normal”: reactGraph.node[i [0]][ ’level ’] = ’normal’ e l i f i[1] == ”high”: reactGraph.node[i [0]][ ’level ’] = ’high’ e l i f i[1] == ”low”: reactGraph.node[i [0]][ ’level ’] = ’low’ else : reactGraph.node[i [0]][ ’level ’] = ’unknown’ return reactGraph

#Takes in the output of stats.py and a reaction graph and adds gene expression information to the existing reaction graph def genColorGraphStats(filePath , reactGraph, splitChar = ” ”): rawList = reduce(lambda x, y: x + [y.replace(”\n”, ””).split(’ \ t ’ ) ] , open(filePath , ’r’), []) rawDict = {} for i in rawList : rawDict[i[0]] = i[1:4] for i in reactGraph.nodes() : i f ”REACT ” + i.split(’ ’ ) [ 1 ] in rawDict : i f rawDict [ ”REACT ” + i.split(’ ’)[1]][2] == ”Significant High” : reactGraph.node[i ][ ’level ’] = ’high’

89 Texas Tech University, Andrew Avila, May 2012

e l i f rawDict [ ”REACT ” + i.split(’ ’)[1]][2] == ”Significant Low” : reactGraph.node[i ][ ’level ’] = ’low’ e l i f rawDict [ ”REACT ” + i.split(’ ’)[1]][2] == ”Not Significant”: reactGraph.node[i ][ ’level ’] = ’normal’ else : reactGraph.node[i ][ ’level ’] = ’unknown’ return reactGraph

#Processes a graph to find nodes with a particular expression def getNodeswithAttribute(reactGraph, attribute = ’level ’, cond = ’high’ ): return reduce (lambda x , y : x + [ y ] i f reactGraph.node[y]. has key ( a t t r i b u t e ) and reactGraph.node[y][ attribute] == cond else x , reactGraph.nodes() , [])

#Generates an array of graphs centering on isolated clusters of genes containing a particular attribute def genAttributeTrees(reactGraph, attribute = ’level ’, cond = ’high’): attNodes = getNodeswithAttribute(reactGraph , attribute , cond)

a t t D i c t = {} for i in attNodes : attDict[i] = False

treeArr = [ ] for i in attNodes : i f attDict[i] == False: xGraph = nx.DiGraph() xGraph . add node ( i ) xGraph.node[i ][ ’level ’] = reactGraph.node[i ][ ’level ’] xGraph = accumTravel(xGraph, i , reactGraph, attribute , cond) treeArr += [xGraph] for j in xGraph.nodes() : i f attDict.has key ( j ) : attDict[j] = True

return treeArr

#Travels a continually searching for nodes with the same attribute as the parent node until no new nodes are found def accumTravel(curGraph, node, reactGraph, attribute = ’level ’, cond = ’ high ’ ) : newGraph = recurTravel(curGraph, node, reactGraph , attribute , cond) while True : curGraph = newGraph.copy() for i in newGraph.nodes() :

90 Texas Tech University, Andrew Avila, May 2012

newGraph = recurTravel(newGraph, i , reactGraph , attribute , cond ) i f len(curGraph.nodes()) == len(newGraph.nodes()): break return curGraph

#Note: initially written in a recursive manner though had issues with stack so rewritten with loops. Function that travels a graph beginning at a node searching for connected nodes with the same expression level as the parent node def recurTravel(curGraph, node, reactGraph, attribute = ’level ’, cond = ’ high ’ ) : for i in reactGraph. out edges(node): i f reactGraph.node[i [1]][ attribute] == cond: curGraph.add node ( i [ 1 ] ) curGraph.node[i [1]][ attribute] = reactGraph.node[i [1]][ a t t r i b u t e ] curGraph. add edge(node, i[1]) for i in reactGraph. in edges(node): i f reactGraph.node[i [1]][ attribute] == cond: curGraph.add node ( i [ 1 ] ) curGraph.node[i [1]][ attribute] = reactGraph.node[i [1]][ a t t r i b u t e ] curGraph. add edge(i[1], node) return curGraph

#Produces a graph with one gene, pass Reactome identifier def getGene(curGraph, reactID): xGraph = nx.DiGraph() for i in curGraph.nodes() : i f i [1:len(reactID)+1] == reactID: xGraph . add node ( i ) return xGraph

#Finds the graph with the most nodes among an array of graphs def getLongestTree(treeArr): return reduce (lambda x , y : y i f len(y.nodes()) > len(x.nodes()) else x , treeArr )

#Gets the nearest neighbors to all the nodes in a graph to a specified depth def getNearestNeighbors(reactGraph , treeGraph , recurDepth): newGraph = treeGraph.copy() for i in treeGraph.nodes() : for j in reactGraph. out edges ( i ) : newGraph. add node ( j [ 1 ] ) newGraph.node[j [1]][ ’level ’] = reactGraph.node[j [1]][ ’level ’ ]

91 Texas Tech University, Andrew Avila, May 2012

newGraph. add edge(i, j[1]) for j in reactGraph. in e d g e s ( i ) : newGraph. add node ( j [ 0 ] ) newGraph.node[j [0]][ ’level ’] = reactGraph.node[j [0]][ ’level ’ ] newGraph. add edge(j[0], i) i f recurDepth > 0 : newGraph = getNearestNeighbors(reactGraph , newGraph, recurDepth − 1) return newGraph

#Produces a graph plot in new window def drawGraph(curGraph, writeDot = False): pos = nx.graphviz layout(curGraph , prog=”neato”) nx . draw networkx nodes(curGraph, pos, nodelist = getNodeswithAttribute(curGraph, ’level ’, ’high’), node c o l o r = ’ r ’ ) nx . draw networkx nodes(curGraph, pos, nodelist = getNodeswithAttribute(curGraph, ’level ’, ’low’), node c o l o r = ’b ’) nx . draw networkx nodes(curGraph, pos, nodelist = getNodeswithAttribute(curGraph, ’level ’, ’normal’), node c o l o r = ’ g ’ ) nx . draw networkx nodes(curGraph, pos, nodelist = getNodeswithAttribute(curGraph, ’level ’, ’unknown’), node c o l o r = ’ y ’ ) nx . draw networkx edges(curGraph, pos, arrows = False)

l a b e l s = {} for i in curGraph.nodes() : labels[i] = ” ”.join(i.split(’ ’ ) [ 3 : ] ) nx . draw networkx labels(curGraph, pos, labels , font s i z e = 8)

plt.rcParams[ ’figure.facecolor ’] = ’w’ plt.axis(’off ’) p l t . show ( ) i f writeDot : nx . w r i t e dot(curGraph, writeDot)

#Produce a figure centering on a particular gene def drawGene(curGraph, reactID , depth): drawGraph(getNearestNeighbors(curGraph , getGene(curGraph , reactID) , depth ) )

#Performs the complete graph analysis for a specific expression level then draws a plot of the longest tree with its nearest neighbors up to a depth of 2 def metaGraph(sbmlFile , aspFile , level = ”high”):

92 Texas Tech University, Andrew Avila, May 2012

tree = etree.parse(sbmlFile) reactionTree = tree.find(”.//” + tag header + ”listOfReactions”) curGraph = genReacGraph(reactionTree) curGraph = genColorGraph(aspFile , curGraph) curTrees = genAttributeTrees(curGraph, ’level ’, level) print len(curTrees) print len(getNodeswithAttribute(curGraph, ’level ’, level)) longTree = getLongestTree(curTrees) neighborGraph = getNearestNeighbors(curGraph, longTree , 1) #nx.draw(longTree) #plt .show() drawGraph(neighborGraph) return curTrees

#Performs the complete analysis for a specific expression level though does not produce a plot def getLevelTrees(sbmlFile , aspFile , level = ”high”): tree = etree.parse(sbmlFile) reactionTree = tree.find(”.//” + tag header + ”listOfReactions”) curGraph = genReacGraph(reactionTree) curGraph = genColorGraph(aspFile , curGraph) curTrees = genAttributeTrees(curGraph, ’level ’, level) return [curGraph, curTrees]

#Performs the complete analysis for a specific expression level though does not produce a plot (with stats) def getLevelTreesStats(sbmlFile , statFile , level = ”high”): tree = etree.parse(sbmlFile) reactionTree = tree.find(”.//” + tag header + ”listOfReactions”) curGraph = genReacGraph(reactionTree) curGraph = genColorGraphStats(statFile , curGraph) curTrees = genAttributeTrees(curGraph, ’level ’, level) return [curGraph, curTrees]

#Gets a list of all the experiments in database def getExperiments() : return DBQueryDataFunc ( ”SELECT ANALYSIS. ID , FILTERFACTOR, GEO1.GSM, GEO2.GSM, GEO1. CELLLINE, GEO2. CELLLINE FROM ANALYSIS LEFT JOIN GEOSAMPLE GEO1 ON ANALYSIS.NORMAL ID = GEO1. ID LEFT JOIN GEOSAMPLE GEO2 ON ANALYSIS.EXPERIMENTAL ID = GEO2.ID”, [] , lambda j , k : k )

#Gets the results of a specific experiment def getTreatments(analysisID , geneProtein): return DBQueryDataFunc ( ”SELECT DISTINCT LOWHIGH, aid , i d e n t i f i e r , name FROM ANALYSISRESULT LEFT JOIN S t a b l e I d e n t i f i e r ON StableIdentifier .ID = ANALYSISRESULT. StableIdentifier I D WHERE ANALYSISRESULT. ANALYSIS ID = ’%s ’ AND GENEPROTEIN = ’%s ’ ORDER

93 Texas Tech University, Andrew Avila, May 2012

BY LOWHIGH” % (analysisID , geneProtein), [] , lambda j , k : k )

#Main function that processes command line arguments def main ( ) :

#By default no arguments analysisID = False geneProtein = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−e ’ ) : analysisID = sys.argv[x+1] i f (sys.argv[x] == ’−t ’ ) : geneProtein = sys.argv[x+1]

#If no argument output list of experiments i f not (analysisID and getTreatments): for i in getExperiments() : print ”%s \ t%s \ t%s \ t%s \ t%s \ t%s”% (i[0], i[1], i[2], i[4], i [ 3 ] , i [ 5 ] )

#Retrieve experiment specified i f a n a l y s i s I D and geneProtein: for i in (getTreatments(analysisID , ”GENE”) i f geneProtein[0] == ’ g ’ else getTreatments(analysisID , ”PROTEIN”)) : print ”%s \ t%s \ t%s \ t%s” % (i[0], i[1], i[2], i[3])

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

94 Texas Tech University, Andrew Avila, May 2012

Listing 3.12: MySQL Database Structure −− MySQL Structure for Logos Ex Machina

SET SQL MODE=”NO AUTO VALUE ON ZERO” ;

/∗ !40101 SET @OLD CHARACTER SET CLIENT=@@CHARACTER SET CLIENT ∗/ ; /∗ !40101 SET @OLD CHARACTER SET RESULTS=@@CHARACTER SET RESULTS ∗/ ; /∗ !40101 SET @OLD COLLATION CONNECTION=@@COLLATION CONNECTION ∗/ ; /∗ !40101 SET NAMES u t f 8 ∗/ ;

−− −− Database: ‘logosexmachina ‘ −−

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘ANALYSIS‘ −−

CREATE TABLE IF NOT EXISTS ‘ANALYSIS‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘GEODATASET ID‘ int (11) NOT NULL, ‘NORMAL ID‘ int (11) NOT NULL, ‘EXPERIMENTAL ID‘ int (11) NOT NULL, ‘FILTERFACTOR‘ decimal ( 6 , 4 ) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=22 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table s t r u c t u r e f o r t a b l e ‘ANALYSISRESULT‘ −−

CREATE TABLE IF NOT EXISTS ‘ANALYSISRESULT‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘ANALYSIS ID ‘ int (11) NOT NULL, ‘StableIdentifier ID‘ int (11) NOT NULL, ‘ aid ‘ int (11) NOT NULL, ‘LOWHIGH‘ varchar ( 8 ) NOT NULL, ‘GENEPROTEIN‘ varchar (16) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=42645 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

95 Texas Tech University, Andrew Avila, May 2012

−− −− Table structure for table ‘BioassayData‘ −−

CREATE TABLE IF NOT EXISTS ‘BioassayData ‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘ aid ‘ int (11) NOT NULL, ‘ sid ‘ int (11) DEFAULT NULL, ‘ cid ‘ int (11) DEFAULT NULL, ‘ a c t i v i t y ‘ int (11) DEFAULT NULL, ‘ score ‘ int (11) DEFAULT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=75178685 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘BioassayDesc‘ −−

CREATE TABLE IF NOT EXISTS ‘BioassayDesc ‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘ aid ‘ int (11) NOT NULL, ‘ p r o t e i n g i ‘ int (11) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=47063 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘gene2accession‘ −−

CREATE TABLE IF NOT EXISTS ‘gene2accession ‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘ tax id ‘ int (11) DEFAULT NULL, ‘ GeneID ‘ int (11) DEFAULT NULL, ‘ p r o t e i n a c c e s s i o n ‘ varchar (20) DEFAULT NULL, ‘ p r o t e i n g i ‘ int (11) DEFAULT NULL, ‘ g e n e n u c l e o t i d e a c c e s s i o n ‘ varchar (20) DEFAULT NULL, ‘ g e n e n u c l e o t i d e g i ‘ int (11) DEFAULT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=717152 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘gene info ‘

96 Texas Tech University, Andrew Avila, May 2012

−−

CREATE TABLE IF NOT EXISTS ‘ g e n e i n f o ‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘ tax id ‘ int (11) DEFAULT NULL, ‘ GeneID ‘ int (11) DEFAULT NULL, ‘ Symbol ‘ varchar (20) DEFAULT NULL, ‘Description ‘ text , PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=45448 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘GEODATASET‘ −−

CREATE TABLE IF NOT EXISTS ‘GEODATASET‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘GDS‘ varchar (16) NOT NULL, ‘DESCRIPTION‘ varchar (255) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=6 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘GEOSAMPLE‘ −−

CREATE TABLE IF NOT EXISTS ‘GEOSAMPLE‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘GEODATASET ID‘ int (11) NOT NULL, ‘GSM‘ varchar (16) NOT NULL, ‘CELLLINE‘ varchar (128) NOT NULL, ‘DESCRIPTION‘ varchar (255) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=28 ;

−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− −− Table structure for table ‘StableIdentifier ‘ −−

CREATE TABLE IF NOT EXISTS ‘StableIdentifier ‘ ( ‘ID‘ int (11) NOTNULL AUTO INCREMENT, ‘identifier ‘ varchar (20) NOT NULL,

97 Texas Tech University, Andrew Avila, May 2012

‘name ‘ varchar (255) NOT NULL, PRIMARYKEY (‘ID‘) ) ENGINE=MyISAM DEFAULT CHARSET=l a t i n 1 AUTO INCREMENT=1529 ;

98 Texas Tech University, Andrew Avila, May 2012

CHAPTER IV ANALYSIS OF DIFFERENTIALLY EXPRESSED GENES IN CANCER CELL LINES

Introduction

In order to most effectively direct cancer research efforts it is important to identify promising genetic targets for further investigation. This is complicated due to the often sporadic nature of cancer that leads to a potential variety of mutations. However, there is at least one defining characteristic of cancer, the gross proliferation of cells. It is reasonable, through reductive reasoning, to suspect that this characteristic of cancer may be caused by the deregulation of certain, to be determined, genes. Furthermore, there may be other commonly deregulated genes that contribute to the success of a cancer’s aggressive nature. Therefore, it was the goal of this study to determine genes that are significantly expressed in a non-normal fashion among sub-populations of cancer and then in cancer as a whole. A review of the literature has shown that a variety of similar analyses have been performed in the past. The current status of the field as it pertains to identifying potential genetic targets through analysis of microarray datasets is discussed in the following section.

Microarray Meta-Analyses of Cancer

The first study reviewed was by Singh et al. (2002). In this study, the gene expression pattern in prostate cancer was analyzed. Specifically 52 tumor and 50 normal prostate samples were studied using oligonucleotide microarrays containing probes for approximately 12,600 genes (Singh et al., 2002). In order to determine if

99 Texas Tech University, Andrew Avila, May 2012 the genes from tumor samples were significantly differentially expressed from normal tissue, a variation of a signal-to-noise metric was applied (Singh et al., 2002). Furthermore, a supervised machine learning algorithm was developed in order to allow unknown prostate samples to be identified as tumorigenic or not (Singh et al., 2002). Of the 317 genes that had higher expression in tumor samples, 5 genes (chromogranin A, PDGFRβ, HOXC6, IPTR3, and sialyltransferase-1) were determined to be useful predictors of outcome (Singh et al., 2002). It is important to note that the samples used by the authors were from primary tissue and not from established in-vitro cell models. Finally, it is worth noting that the authors found that traditional diagnostic methods such as PSA screening and tissue inversion were not correlated with a particular gene expression pattern (Singh et al., 2002). A study by Frierson Jr et al. (2002), investigated differentially expressed genes in salivary adenoid cystic carcinoma (ACC). The authors analyzed 15 samples from individuals with ACC (primary samples), one established ACC cell line, and five normal tissue samples (Frierson Jr et al., 2002). Samples were compared for divergent expression levels between groups using a combination of methods including: “the difference of hybridization intensities, the quotient of hybridization intensities, and an unpaired t-test between expression levels in tumor and normal tissue” (Frierson Jr et al., 2002). A series of 30 genes were identified as significantly overexpressed in both the primary samples and the ACC cell line (Frierson Jr et al., 2002). A different series of 30 genes was identified to be significantly underexpressed in the same sample sets (Frierson Jr et al., 2002). Of the significantly overexpressed genes, SOX4 showed the highest expression level (Frierson Jr et al., 2002). This gene is involved in a variety of pathways although the authors note that the

100 Texas Tech University, Andrew Avila, May 2012 connection to cancer is not clear (Frierson Jr et al., 2002). Finally, the authors note that the expression patterns differ between the primary samples and the ACC cell line (Frierson Jr et al., 2002), although they did share approximately 60 percent of the 100 genes overexpressed in primary samples (Frierson Jr et al., 2002). The next study reviewed was a meta-analysis performed by Rhodes et al. (2004) on a variety of different cancers. The authors collected and analyzed 40 published microarray datasets, composed of over 3,700 cancer samples (Rhodes et al., 2004). Although several analyses were performed by the authors, the most relevant was the comparison of cancer versus respective normal tissue. Of the 40 datasets, 36 contained information for both cancer and its respective normal tissue, these datasets were used in this specific analysis. These datasets span a variety of cancer types including: breast, prostate, colon, lung, liver, brain, ovary, pancreas, uterus, salivary gland, bladder, and B lymphocytes (Rhodes et al., 2004). Given the diverse sources of the original datasets, the authors developed a unique statistical method, comparative metaprofiling, in order to derive the set of genes that were significantly differentially expressed (Rhodes et al., 2004). The authors note that it is generally not acceptable to directly compare different microarray datasets due to differences in experimental platform, therefore a series of data transformations were required so that statistical parameters could be compared across studies (Rhodes et al., 2004). No universal differential gene expression pattern (i.e. cancer vs. normal) was found across all cancers analyzed (Rhodes et al., 2004). Although a subset of the cancers did contain a gene expression pattern that was found to be significant. Specifically, the genes (CDKN3, CKS2, E2F5, PTMA, PLK, and CCT4) were found to be involved in cell cycle regulation (Rhodes et al., 2004). The authors

101 Texas Tech University, Andrew Avila, May 2012 state that these genes may serve as valuable targets for future pharmaceutical research (Rhodes et al., 2004).

Summary

The previous studies have emphasized the need to find common gene expression patterns in cancer. Furthermore, these studies suggest that it is unlikely that one signature gene expression profile exists that is commonly found in cancer as a whole. Nonetheless, it has been shown that subpopulations of cancer do indeed have signature gene expression profiles respectively. This is important to note, as the direction of cancer research and pharmaceutical development should be guided by those patterns that are most common in cancer. This will hopefully lead to developments that will aid the most individuals who are afflicted by this disease. Therefore, through the application of previously developed analysis tools (Chapter III) and the use of an appropriate statistical method, an analysis scanning for common gene expression patterns in cancer was performed in this study.

Materials and Methods

The resources required for this study comprised two main categories. The first resource was publicly available microarray data that fit certain requirements. The first requirement to be satisfied was the inclusion of gene expression data for both cancer and normal cells. Secondly, the specific origin of the datasets needed to be from established in-vitro cell lines. Lastly, all of the datasets made use the same microarray platform. The datasets used were downloaded from the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo). The following are the specific datasets used and a short description thereof:

102 Texas Tech University, Andrew Avila, May 2012

1. GDS3233 (Cervical Cancer - Carcinoma)

• Platform: Affymetrix U133.

• Normal Cell Line: Ambion Normal Cervix

• Cancer Cell Lines: C4-I, CaSki, C-33A, HT-3, SiHa, SW756, MS751, ME-180, HeLa

2. GDS820 (Breast Cancer - Carcinoma)

• Platform: Affymetrix Human Genome U133.

• Normal Cell Line: Cambrex Bio Normal Epithelial

• Cancer Cell Lines: MDA-MB-436, HCC1954

3. GDS1220 (Mesothelioma - Sarcoma)

• Platform: Affymetrix Human Genome U133.

• Normal Cell Line: Met-5A

• Cancer Cell Lines: MSTO-211H, MS589, MS428, JMN1B

The second resource requirement involved the software used to examine the microarray datasets. The comparison between normal and cancer cells was facilitated by the previously developed software package (Chapter III). Briefly, this software package makes use of a logical model in order to deduce gene expression levels and attribute a qualitative measure (underexpressed, normal, overexpressed). Also, through deductive reasoning the software can make predictions of the gene expression levels of those genes that have not been observed empirically. Furthermore, the logical nature of the model prevents contradictory gene expression

103 Texas Tech University, Andrew Avila, May 2012 levels by checking if a model is solvable. Solutions to the model are then stored in a database for further processing, within the software package and without. It is noted that the cut-off value for demarcating the difference between underexpressed/normal and normal/overexpressed was set at 0.10, based on the work of Rhodes et al. (2004). Essentially, if the gene expression level in a cancer cell differs by plus or minus 10 percent compared to normal then the gene was labeled as overexpressed or underexpressed respectively. In order to discover gene expression patterns in cancer, the ordinal data generated by the previous software package had to be statistically analyzed through a novel method. This statistical method was of non-parametric design and made use of random resampling with replacement in order to estimate the distribution of the expression levels of a gene from a population of cancer cells. To perform the analysis, the discretized ordinal values must be given numerical values in logical order (e.g. underexpressed: 1, normal: 2, overexpressed: 3). Statistical significance was determined by whether the mean of the distribution (¯x) fell within a range of the extremities. This essentially presented as a two-tailed design although each tail has a specific meaning (i.e. significantly underexpressed vis-a-vis significantly overexpressed). The formula to calculate the lower significance boundary was determined to be Blow = Elow + R ∗ S (where E is the value of the extremity, R is the range, and S is the significance level). Similarly the upper significance boundary formula was derived as Bhigh = Ehigh − R ∗ S. This necessarily involves multiple hypothesis testing for each gene respectively. The general form of the hypotheses are as follows:

• H0: A gene is not significantly differentially expressed in a population of

104 Texas Tech University, Andrew Avila, May 2012

cancer cells (Blow < x¯ < Bhigh).

• H1: A gene is significantly underexpressed in a population of cancer cells

(¯x ≤ Blow).

• H2: A gene is significantly overexpressed in a population of cancer cells

(¯x ≥ Bhigh).

The implementation of this statistical method (Appendix A) was realized in the computer language of Python. For this study, a significance level of 0.10 was used with a resampling number of 100,000. The reasoning behind this choice of resampling number was to reduce the variance of the test statistic (¯x). The significance level of 0.10 was used in order to avoid bias that may have occurred if only the genes closer to the extremities of the sample space had been considered, and yet stringent enough to avoid superfluous inferences.

Results

The first result generated was the software implementation of the statistical method (Appendix A). This software was then used to statistically analyze the solutions to the logical models generated for the GEO datasets mentioned previously; cervical cancer (GDS3233 ), breast cancer (GDS820 ), and mesothelioma (GDS1220 ). All the datasets were analyzed at a significance level of 0.10, a range of 2, and a resampling number of 100,000. The genes that were found to be significantly underexpressed in cervical cancer are listed in Table 4.3. The genes found to be significantly overexpressed in cervical cancer are listed in Table 4.4. Genes found to be underexpressed and overexpressed in breast cancer are listed in Tables 4.5 and 4.6 respectively. Analysis results, significantly underexpressed and

105 Texas Tech University, Andrew Avila, May 2012 overexpressed, for the mesothelioma datasets are tabulated in Tables 4.7 and 4.8 respectively. The meta-analysis combining all the previously mentioned cancer datasets revealed a few genes that were significantly underexpressed (Table 4.1) and overexpressed (Table 4.2). It is noted that these tables also contain genes whose products may undergo posttranslational modification (e.g. phosphorylation), this is due to the structural organization of Reactome. There were also common functional patterns identified, a posteriori, by looking at the functions of the genes found to be significantly differentially expressed per cancer type (Tables 4.3 through 4.8). The functional patterns identified were those of cellular differentiation and pathways involved in healing. Figures 4.1 through 4.6 illustrate, cancer type specific, genetic networks centering on genes that are involved in cellular differentiation. For more information on any of the genes listed in the Tables or Figures, refer to: http://www.genecards.org

Discussion

The primary objective of this study was to determine genes that were significantly differentially expressed in cancer. This was accomplished by defining subpopulations of cancer based on the tissue from which the cancer was derived. The attempt to identify global differentially expressed genes, in the cancer types examined, was successful (Tables 4.1 and 4.2). From this meta-analysis one gene (IGF2BP3) stood out due to its highly researched connection to variety of cancers (Jeng et al., 2008; Ikenberg et al., 2010; Schaeffer et al., 2010). Furthermore, analyzing the resulting significant genes, from individual cancer types, revealed a heretofore unobserved functional pattern to cancer that challenges the current prevailing paradigm.

106 Texas Tech University, Andrew Avila, May 2012

The current cancer paradigm, The Somatic Mutation Theory of Cancer (SMT), first developed by Nordling (1953) (a.k.a. The Knudson Hypothesis) states that cancer is the result of mutations to a cells’ genetic material. In its simplest form, the development of cancer requires two offenses, one that immortalizes the cell and a second which triggers uncontrolled proliferation. This idea is well supported in cases of hereditary cancers such as retinoblastoma (Knudson, 1971), hereditary breast-ovarian cancer syndromes (Jolly et al., 1994), hereditary nonpolyposis colorectal cancer (Thompson et al., 2004), and Fanconi anemia (Alan and D’Andrea, 2010) among others. The evidence for this is the documented loss of one wildtype allele (generally a tumor suppressor gene) in the germline that increases the likelihood of cancer development should a second mutation occur in the remaining wildtype allele. Furthermore, the concept of upregulated oncogenes and downregulated tumor suppressor genes serving as drivers in cancer is inherent in modern studies of the SMT (Min et al., 2010). With regard to the results of this study, it would be expected (per the SMT) that oncogenes would be upregulated and tumor suppressor genes would be downregulated. Unfortunately, the results do not agree with what would be predicted. In cancers examined we find that growth factors (e.g. VEGFR2, VEGFR3, IGFBP4, IGFBP5, FGFR3, etc...) are significantly underexpressed. This challenges the idea that cancer cells produce and respond to their own growth factors (Sporn and Roberts, 1985). Furthermore, tumor suppressor genes were upregulated in cervical cancer and mesothelioma (e.g. POLB, XRCC4, ERCC1, BRCA1, P53, etc...). One possible explanation to this paradoxical situation lies in invoking a counter-intuitive use of the SMT. If we consider the reasonable assumption that the majority of the

107 Texas Tech University, Andrew Avila, May 2012 genes in a cancer cell are operating normally (i.e. not mutated), then the results of this study would seem to point at the idea that a cancer cell is actually trying to limit its proliferation and repair its mutations. The question then becomes why does a cancer cell fail to regulate itself? The naive explanation lies in invoking the SMT itself; the cell has mutated in such a way as to lose its ability for self-regulation. This is of course, a logical fallacy (circular reasoning) because the premise that lead to the question is that heritable gene mutation is a relatively rare event. Therefore, a new approach to cancer development is postulated and results from this study are used to support the position. Initially, only sporadic cancers (i.e. not hereditary and not caused by infections) will be considered, as the causal mechanism behind these cancers is the least understood. It is then assumed that mutation is a rare event, as such the genetic differences between normal tissue and cancer tissue is reasoned to be negligible. This is supported by the idea that if cancer were due to mutagenesis and individuals from a population all receive roughly the same amount of mutagenic exposure at any one instance (i.e. a uniform distribution of mutation rate over time) then it would be expected that cancer incidence would be roughly equivalent among all age groups. Although, older individuals may accumulate more mutations over time due to the decrease in DNA repair capacity associated with aging (Goukassian et al., 2000). Since this is an argument in terms of oncogenesis initiation, and not cancer progression (where mutation may have a role), the age-related decrease in DNA repair capacity is not a factor. It is evident that cancer incidence is not uniform across age groups, over 75 percent of cases occur in individuals over the age of 55 (Howlader et al., 2011). This is an argument against the SMT and in support

108 Texas Tech University, Andrew Avila, May 2012 of the idea that mutation does not play a significant role in sporadic cancer. However, age is correlated with an increase in injury rate (Aschkenasy and Rothenhaus, 2006). The process of wound healing involves dedifferentiation (Stappenbeck and Miyoshi, 2009; Ueda et al., 1995; Ertl and Frantz, 2005). Briefly, dedifferentiation is the process by which a terminally differentiated cell reverses to a pluripotent stage capable of “re-differentiating” into different cell types. The results of this study show that in all the cancer types analyzed, genes associated with cellular differentiation are significantly differentially expressed (SEMA5A, SEMA6D, SEMA4D, SEMA6A, SEMA7A, MED12, etc...). Semaphorin pathways in general are associated with a variety of developmental processes, including cellular differentiation (Tamagnone and Giordano, 2006). MED12 has a role in endoderm development (Shin et al., 2008). Furthermore, IG2BP3 (found to be significantly overexpressed in the meta-analysis) is normally found in embryonic tissues (Nielsen et al., 1999; Jeng et al., 2008). These results imply that the cancer cells are in a dedifferentiated state and are undergoing cellular differentiation, this is concordant with the literature concerning cancer stem cells (Jordan et al., 2006). It is understood, within the context of the SMT, that at the initiation of cancer, a mutagenic event will cause the birth of a cancer stem cell; these cancer stem cells give rise to the variety of cell types found in a tumor and are said to be tumorigenic (Jordan et al., 2006). However, as mutagenesis has been assumed to be negligible, an alternative mechanism of cancer stem cell initiation is needed. Therefore, I postulate that during the wound healing process, if the dedifferentiated cells become located in an inappropriate tissue this gives rise to a neoplasm. This is supported in the literature by models showing that inert objects (e.g. asbestos, polyurethanes,

109 Texas Tech University, Andrew Avila, May 2012 etc..) inserted into an in-vivo model will cause tumors at the location of the lesion (Bischoff and Bryson, 1964; Autian et al., 1975; Peto et al., 1999). This is also supported by evidence showing an increased risk of cancer after surgery (Caygill et al., 1987). In addition, the results of this study reveal several significantly overexpressed genes (Factor V, Factor VII, Factor IX, Factor XII, etc...) that are involved in the healing process. Furthermore, if idea that the dermis is the most likely location of a lesion (given its size and exposure) is provisionally accepted, then a clear correlation is drawn with the most common type of cancer, skin cancer (Howlader et al., 2011). Additionally, when a global view of cancers is considered, carcinomas (i.e. those cancers derived from epithelial cells) have the most frequent incidence (Howlader et al., 2011). These epithelial cells are known to: line the organs of the systems (i.e. digestive, respiratory, and urinary systems) which interact with the external environment, are regularly subjected to harsh conditions, and have an active stem cell population (Lommel, 2003; Slack, 2000). This serves to add support to the approach of cancer development that I have postulated. Lastly, if there is any validity to this proposed hypothesis of cancer, then it follows that treatments of cancer should focus upon forced, localized, cellular differentiation. The success of radiation therapy may be attributed to its ability to force differentiation (Liu et al., 1994). Naturally, this will require further research although there is evidence to support the involvement of cellular differentiation in the spontaneous regression of cancer (Haas et al., 1988; Elston, 2004). In summary, lists of genes that were found to be significantly differentially expressed in different types of cancer were compiled using the logical modeling method previously developed (Chapters II and III) and a novel statistical approach

110 Texas Tech University, Andrew Avila, May 2012 developed within this study. The expression levels of the oncogenes and tumor suppressor genes examined were not found to be in agreement with the expectations given be the current prevailing paradigm of cancer (SMT). Furthermore, the common functional characterization of genes involved in cellular differentiation and the healing process, in the cancer cell lines tested, lead to the development of a novel hypothesis regarding the origin of cancer. This hypothesis essentially proposes that cancer originates due to the wound healing process gone awry. Furthermore, it is proposed that treatments involving the enhancement of the cellular differentiation process could be used to treat cancer patients. However, further investigation is needed to add support to this oncogenesis hypothesis.

111 Texas Tech University, Andrew Avila, May 2012

Bibliography

Alan, D. and M. D’Andrea. 2010. The fanconi anemia and breast cancer susceptibility pathways. The New England journal of medicine, 362(20):1909.

Aschkenasy, M. and T. Rothenhaus. 2006. Trauma and falls in the elderly. Emergency medicine clinics of North America, 24(2):413–432.

Autian, J., A. Singh., J. Turner., G. Hung., L. Nunez., and W. Lawrence. 1975. Carcinogenesis from polyurethans. Cancer research, 35(6):1591.

Bischoff, F. and G. Bryson. 1964. Carcinogenesis through solid state surfaces. Progress in Experimental Tumor Research, 5:85–133. PMID: 14317768.

Caygill, C., M. Hill., C. Hall., J. Kirkham., and T. Northfield. 1987. Increased risk of cancer at multiple sites after gastric surgery for peptic ulcer. Gut, 28(8):924.

Elston, D. 2004. Mechanisms of regression. Clinical medicine & research, 2(2):85–88.

Ertl, G. and S. Frantz. 2005. Healing after myocardial infarction. Cardiovascular Research, 66(1):22–32.

Frierson Jr, H., A. El-Naggar., J. Welsh., L. Sapinoso., A. Su., J. Cheng., T. Saku., C. Moskaluk., and G. Hampton. 2002. Large

112 Texas Tech University, Andrew Avila, May 2012

scale molecular analysis identifies genes with altered expression in salivary adenoid cystic carcinoma. The American journal of pathology, 161(4):1315.

Goukassian, D., F. Gad., M. Yaar., M. Eller., U. Nehal., and B. Gilchrest. 2000. Mechanisms and implications of the age-associated decrease in DNA repair capacity. The FASEB journal, 14(10):1325–1334.

Haas, D., A. R. Ablin., C. Miller., S. Zoger., and K. K. Matthay. 1988. Complete pathologic maturation and regression of stage ivs neuroblastoma without treatment. Cancer, 62(4):818–825.

Howlader, N., A. Noone., M. Krapcho., R. Aminou., W. Waldron., S. Altekruse., C. Kosary., J. Ruhl., Z. Tatalovich., H. Cho., A. Mariotto., M. Eisner., D. Lewis., H. Chen., E. Feuer., K. Cronin., and B. Edwards. 2011. SEER cancer statistics review, 1975-2008.

Ikenberg, K., F. R. Fritzsche., U. Zuerrer-Haerdi., I. Hofmann., T. Hermanns., H. Seifert., M. Muntener., M. Provenzano., T. Sulser., S. Behnke., J. Gerhardt., A. Mortezavi., P. J. Wild., F. Hofstadter., M. Burger., H. Moch., and G. Kristiansen. 2010. Insulin-like growth factor II mRNA binding protein 3 (IMP3) is overexpressed in prostate cancer and correlates with higher gleason scores. BMC Cancer, 10(1):341.

Jeng, Y., C. Chang., F. Hu., H. E. Chou., H. Kao., T. Wang., and H. Hsu. 2008. RNA-binding protein insulin-like growth factor II mRNA-binding protein 3 expression promotes tumor invasion and predicts

113 Texas Tech University, Andrew Avila, May 2012

early recurrence and poor prognosis in hepatocellular carcinoma. Hepatology (Baltimore, Md.), 48(4):1118–1127. PMID: 18802962.

Jolly, K. W., D. Malkin., E. C. Douglass., T. F. Brown., A. E. Sinclair., and A. T. Look. 1994. Splice-site mutation of the p53 gene in a family with hereditary breast-ovarian cancer. Oncogene, 9(1):97–102. PMID: 8302608.

Jordan, C. T., M. L. Guzman., and M. Noble. 2006. Cancer stem cells. The New England Journal of Medicine, 355(12):1253–1261. PMID: 16990388.

Knudson, A. 1971. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences, 68(4):820.

Liu, S. Z., SuXu., Y. C. Zhang., and Y. Zhao. 1994. Signal transduction in lymphocytes after low dose radiation. Chinese Medical Journal, 107(6):431–436. PMID: 7956482.

Lommel, A. T. L. v. 2003. From cells to organs : a histology textbook and atlas. Kluwer Academic Pub., Boston.

Min, J., A. Zaslavsky., G. Fedele., S. McLaughlin., E. Reczek., T. De Raedt., I. Guney., D. Strochlic., L. MacConaill., R. Beroukhim., and others. 2010. An oncogene-tumor suppressor cascade drives metastatic prostate cancer by coordinately activating ras and nuclear factor-[kappa] b. Nature medicine, 16(3):286–294.

114 Texas Tech University, Andrew Avila, May 2012

Nielsen, J., J. Christiansen., J. Lykke-Andersen., A. Johnsen., U. Wewer., and F. Nielsen. 1999. A family of insulin-like growth factor II mRNA-binding proteins represses translation in late development. Molecular and cellular biology, 19(2):1262.

Nordling, C. 1953. A new theory on the cancer-inducing mechanism. British journal of cancer, 7(1):68.

Peto, J., A. Decarli., C. La Vecchia., F. Levi., and E. Negri. 1999. The european mesothelioma epidemic. British Journal of Cancer, 79(3/4):666.

Rhodes, D., J. Yu., K. Shanker., N. Deshpande., R. Varambally., D. Ghosh., T. Barrette., A. Pandey., and A. Chinnaiyan. 2004. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the national academy of sciences of the United States of America, 101(25):9309.

Schaeffer, D. F., D. R. Owen., H. J. Lim., A. K. Buczkowski., S. W. Chung., C. H. Scudamore., D. G. Huntsman., S. S. Ng., and D. A. Owen. 2010. Insulin-like growth factor 2 mRNA binding protein 3 (IGF2BP3) overexpression in pancreatic ductal adenocarcinoma correlates with poor survival. BMC Cancer, 10(1):59.

Shin, C., W. Chung., S. Hong., E. Ober., H. Verkade., H. Field., J. Huisken., and D. YR Stainier. 2008. Multiple roles for med12 in vertebrate endoderm development. Developmental biology, 317(2):467–479.

115 Texas Tech University, Andrew Avila, May 2012

Singh, D., P. Febbo., K. Ross., D. Jackson., J. Manola., C. Ladd., P. Tamayo., A. Renshaw., A. D’Amico., J. Richie., and others. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203–209.

Slack, J. M. 2000. Stem cells in epithelial tissues. Science, 287(5457):1431–1433.

Sporn, M. B. and A. B. Roberts. 1985. Autocrine growth factors and cancer. Nature, 313(6005):745–747.

Stappenbeck, T. S. and H. Miyoshi. 2009. The role of stromal stem cells in tissue regeneration and wound repair. Science, 324(5935):1666–1669.

Tamagnone, L. and S. Giordano. 2006. Semaphorin pathways orchestrate osteogenesis. Nature Cell Biology, 8(6):545–547.

Thompson, E., C. Meldrum., R. Crooks., M. McPhillips., L. Thomas., A. Spigelman., and R. Scott. 2004. Hereditary non-polyposis colorectal cancer and the role of hPMS2 and hEXO1 mutations. Clinical Genetics, 65(3):215–225.

Ueda, M., A. E. Becker., T. Naruko., and A. Kojima. 1995. Smooth muscle cell de-differentiation is a fundamental change preceding wound healing after percutaneous transluminal coronary angioplasty in . Coronary Artery Disease, 6(1):71–81. PMID: 7767506.

116 Texas Tech University, Andrew Avila, May 2012

Table 4.1: Genes that are significantly underexpressed in cancer (cervical, breast, and mesothelioma); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean C1Inh 1.0000 OX 2 membrane glycoprotein precursor 1.0502 ASC 1.0994 Phospho Forkhead box protein O1A T24 S256 S319 1.1980 IL1R1 1.1982 Desmoplakin 1.1989 proteolytically cleaved Desmoplakin 1.1992

Table 4.2: Genes that are significantly overexpressed in cancer (cervical, breast, and mesothelioma); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean Kinesin like protein KIFC1 2.9008 Insulin like growth factor 2 mRNA binding protein 3 2.8009

Table 4.3: Genes that are significantly underexpressed in cervical cancer (GEO Dataset GDS3233 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean Adducin alpha 1.0000 Adenosine A2a receptor 1.0000 Alpha adducin fragment 1 633 1.0000 Alpha adducin fragment 634 737 1.0000 Alpha Pix 1.0000 Angiopoietin 1 1.0000 Angiopoietin 1 receptor precursor 1.0000 Angiopoietin 2 1.0000

117 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean Angiotensinogen II 1.0000 Apelin receptor 1.0000 apoE 1.0000 APRIL with phosphothreonine 244 1.0000 B lymphocyte activation marker BLAST 1 precursor 1.0000 Bcl 2 protein 1.0000 Beta 2 microglobulin 1.0000 Beta TrCP 1.0000 Bone sialoprotein 2 1.0000 BoNT B cleaved VAMP2 fragment 1.0000 c FOS 1.0000 c FOS P 1.0000 c Jun 1.0000 c Jun P 1.0000 C1Inh 1.0000 C2a 1.0000 C2b 1.0000 C5a anaphylatoxin chemotactic receptor 1.0000 C7 1.0000 cAMP specific 3 5 cyclic phosphodiesterase 4B 1.0000 CCL5 1.0000 CCR6 1.0000 CCR7 1.0000 CD160 antigen precursor 1.0000 CD4 1.0000 Chemokine XC receptor 1 1.0000 CHL1 1.0000 Co SMAD 1.0000 Complement factor 2 1.0000 CREB 1.0000 Cryptochrome 2 1.0000 CX3CL1 1.0000 CXCL12 1 1.0000 CXCR2 1.0000 CXCR6 1.0000 Cytotoxic and regulatory T cell molecule 1.0000 DAP12 1.0000

118 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean Desmoglein 1 1.0000 Desmoglein 1 fragment 50 888 1.0000 Desmoglein 1 fragment 889 1049 1.0000 Desmoglein 3 1.0000 Desmoglein 3 fragment 50 781 1.0000 Desmoglein 3 fragment 782 999 1.0000 Desmoplakin 1.0000 Dicer 1.0000 DOCK2 1.0000 DP1 receptor 1.0000 Duffy antigen chemokine receptor 1.0000 ELMO1 1.0000 Endofin 1.0000 EP1 receptor 1.0000 FABP4 1.0000 factor X activation peptide 1.0000 factor X light chain propeptide 1.0000 factor XIII A chain activation peptide 1.0000 factor XIII B chain 1.0000 FES 1.0000 Fibrillin 1 1.0000 Forkhead box protein O1A 1.0000 FP receptor 1.0000 FPRL2 1.0000 FYN 1.0000 G protein coupled estrogen receptor 1 1.0000 G protein coupled receptor kinase 5 1.0000 GAB1 1.0000 GBP2 1.0000 Gelsolin 1.0000 Gelsolin 27 403 fragment 1.0000 Gelsolin 404 782 fragment 1.0000 GPI anchored form of CD14 1.0000 HCK 1.0000 Hemoglobin subunit gamma 1 1.0000 Hemoglobin subunit gamma 2 1.0000 HNF4G 1.0000

119 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean HREV3 1.0000 HSP27 1.0000 HVEM 1.0000 ICOS 1.0000 IGFBP 4 157 237 fragment 1.0000 IGFBP 4 22 156 fragment 1.0000 IGFBP 5 164 272 fragment 1.0000 IGFBP 5 21 163 fragment 1.0000 IL1R1 1.0000 IL1R2 1.0000 IL2RB 1.0000 IL2RG 1.0000 IL5RA 1.0000 ILK 1.0000 Insulin like Growth Factor Binding Protein 4 1.0000 Insulin like Growth Factor Binding Protein 5 1.0000 IP receptor 1.0000 IQGAP1 1.0000 ITK 1.0000 JAM2 1.0000 L selectin 1.0000 Lck 1.0000 Leukocyte common antigen precursor 1.0000 MADCAM1 1.0000 MED25 1.0000 MIG 2 1.0000 N arachidonyl glycine receptor 1.0000 Necl 1 1.0000 Necl 2 1.0000 NEUROD1 1.0000 NLRP1 1.0000 NOD2 1.0000 Osteopontin 1.0000 OX 2 membrane glycoprotein precursor 1.0000 Oxytocin 1.0000 P2RY14 1.0000 P2RY5 1.0000

120 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean PACS 1 1.0000 PARVA 1.0000 PECAM 1 1.0000 Period circadian protein homolog 2 1.0000 pFyn Y531 1.0000 phospho CREB 1.0000 Phospho Forkhead box protein O1A T24 S256 S319 1.0000 Phosphorylated cAMP specific 3 5 cyclic 1.0000 phosphodiesterase 4B Phosphorylated ICOS 1.0000 phosphorylated PECAM 1 1.0000 phosphorylated SLP 76 1.0000 PI3 kinase p110 subunit alpha 1.0000 Platelet glycoprotein IV 1.0000 Plexin C1 1.0000 Plexin D1 precursor 1.0000 prekallikrein 1.0000 pro protein S 1.0000 pro protein S uncarboxylated 1.0000 Proactivator polypeptide Contains Saposin A 1.0000 protein S 1.0000 protein S propeptide 1.0000 Protein transport protein Sec31A 1.0000 Protein tyrosine kinase 2 beta PYK2 1.0000 proteolytically cleaved Desmoplakin 1.0000 Relaxin 3 receptor 1 1.0000 RNF125 1.0000 Satb1 1.0000 Satb1 fragment 1 254 1.0000 Satb1 fragment 255 763 1.0000 Secreted form of CD14 1.0000 SEMA5A 1.0000 Semaphorin 6D 1.0000 Serum albumin 1.0000 SIAH2 1.0000 SLP 76 1.0000 SUMO conjugating UBC9 1.0000

121 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean SUN domain containing protein 1 1.0000 Syntenin 1 1.0000 T Cell Receptor zeta chain 1.0000 T cell surface antigen CD2 precursor 1.0000 Talin 1 1.0000 Tenascin 1.0000 Thrombopoietin receptor 1.0000 TLR5 1.0000 Triggering receptor expressed on myeloid cells 1 1.0000 precursor Tripeptidyl peptidase 1 1.0000 Tristetraproline 1.0000 Tumor necrosis factor ligand superfamily member 5 1.0000 TXNIP 1.0000 VAMP2 1.0000 VAMP2 Synaptobrevin2 1.0000 VCAM1 1.0000 VEGFR2 1.0000 VEGFR3 1.0000 Von Willebrand factor precursor 1.0000 WASP 1.0000 XAF1 1.0000 Caspase 10 precursor 1.1088 Neurokinin A peptide 1.1089 Insulin like Growth Factor Binding Protein 3 1.1091 CCBP2 1.1096 Transmembrane remnant 3 1.1096 Plexin B3 1.1096 SIRP alpha 1.1097 Notch 3 receptor precursor 1.1099 CD86 monomer 1.1099 Casp 10 p10 1.1100 Filamin A 1.1102 IGFBP 3 28 124 fragment 1.1102 Death inducer obliterator 1 1.1102 Graf 2 Arhgap10 1.1103 IGFBP 3 227 291 fragment 1.1103

122 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean RBM 5 1.1103 MC2R 1.1103 Proto oncogene tyrosine protein kinase MER precursor 1.1103 P2X7 1.1103 IGFBP 3 125 233 fragment 1.1104 Y55 phospho Sprouty 2 1.1105 Notch 3 receptor precursor 1.1106 IGFBP 3 28 168 fragment 1.1107 CD97 antigen 1.1107 Killer cell lectin like receptor subfamily G member 1 1.1108 Synaptonemal complex protein 1 1.1108 Casp 10 p23 p17 1.1109 TRIF 1.1109 Lutropin subunit beta 1.1110 Killer cell immunoglobulin like receptor 3DL2 precursor 1.1110 Formyl peptide receptor 1.1111 Notch 3 NEXT fragment 1.1111 IGFBP 3 187 291 fragment 1.1113 Interferon induced transmembrane protein 1 1.1114 NICD 3 fragment 1.1114 STAT2 1.1115 Leucine zipper protein FKSG13 1.1115 Kinesin associated protein 3 1.1115 beta catenin 1 1 115 1.1115 Neurocan 1.1115 Rrn3 1.1116 Vasopressin receptor type 2 1.1116 Corticotropin releasing hormone 1.1117 NICD 3 fragment 1.1118 IGFBP 3 234 291 fragment 1.1118 Lamin A 1.1119 apoC III 1.1119 Lamin A fragment 231 664 1.1120 Semaphorin 6A 1.1120 IGFBP 3 188 291 fragment 1.1120 Y55 phospho Sprouty 2 1.1121 IGFBP 3 169 226 fragment 1.1121

123 Texas Tech University, Andrew Avila, May 2012

Table 4.3: Continued Gene Name Distribution Mean GAS2 1.1121 Lamin A fragment 1 230 1.1121 IGFBP 3 28 186 fragment 1.1122 beta catenin 1 116 376 1.1125 IGFBP 3 125 187 fragment 1.1127 IGFBP 3 127 291 fragment 1.1127 Substance P peptide 1.1128 Lamin C 1.1129 GAS2 280 313 1.1137 IGFBP 3 28 126 fragment 1.1139 GAS2 1 279 1.1142

Table 4.4: Genes that are significantly overexpressed in cervical cancer (GEO Dataset GDS3233 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean 4E BP 3.0000 4E BP1 P 3.0000 Activated BAK protein 3.0000 Ankyrin G 3.0000 BAF protein 3.0000 BIM 3.0000 BLM 3.0000 BRCA1 3.0000 CBL 3.0000 CBP80 3.0000 Cdc2 3.0000 Cdc20 3.0000 Cdc25A 3.0000 Cdc25B 3.0000 Cdc25C 3.0000 Cdc45 3.0000 CDC6 3.0000 Cdk4 3.0000

124 Texas Tech University, Andrew Avila, May 2012

Table 4.4: Continued Gene Name Distribution Mean Cdt1 3.0000 CENP E 3.0000 Chk1 3.0000 citrate lyase monomer 3.0000 Cks1 3.0000 CPSF 3.0000 Crm1 3.0000 DDB1 DNA damage binding protein 1 3.0000 Dihydroxyacetone kinase DAK 3.0000 dimethylated Sm protein D3 3.0000 DNA dependent protein kinase catalytic subunit 3.0000 DNA directed RNA polymerase I 40 kDa polypeptide 3.0000 DNA directed RNA polymerases I II and III 17 1 kDa 3.0000 polypeptide DNA ligase I 3.0000 DNA polymerase beta EC 2 7 7 7 3.0000 DNA repair protein XRCC4 3.0000 dna2 endonuclease 3.0000 Drp1 3.0000 eIF2 alpha 3.0000 eIF2 alpha phosphorylated at Ser52 3.0000 Elongin A1 protein 3.0000 Emi1 3.0000 ERCC1 DNA excision repair protein 3.0000 FACT 140 kDa subunit 3.0000 FACT 80 kDa subunit 3.0000 FCP1P protein 3.0000 Flap endonuclease 1 3.0000 gamma H2AX 3.0000 geminin 3.0000 GRB2 3.0000 hBUBR1 3.0000 histidine rich glycoprotein 3.0000 Histone deacetylase 3 3.0000 Histone H2A x 3.0000 Histone RNA hairpin binding protein 3.0000 hnRNP I 3.0000

125 Texas Tech University, Andrew Avila, May 2012

Table 4.4: Continued Gene Name Distribution Mean hnRNP K 3.0000 hnRNP L 3.0000 HsMAD2 3.0000 hSMUG1 glycosylase 3.0000 hUpf3B 3.0000 Importin alpha 3.0000 Importin beta 1 3.0000 Importin beta 2 3.0000 Inactive BAK protein 3.0000 Insulin like growth factor 2 mRNA binding protein 3 3.0000 IRAK1 3.0000 IRE1 3.0000 ITCH 3.0000 Karyopherin alpha 3.0000 KIF18A 3.0000 Kinesin 5 3.0000 Kinesin like protein KIF15 3.0000 Kinesin like protein KIFC1 3.0000 Lamin B1 3.0000 Lamin B1 fragment 1 231 3.0000 Lamin B1 fragment 232 586 3.0000 MAD2 3.0000 Mcm10 3.0000 Mcm2 3.0000 Mcm3 3.0000 Mcm5 3.0000 MED14 3.0000 MED20 3.0000 Metastin 3.0000 Mitotic checkpoint protein BUB3 3.0000 MKLP1 3.0000 MKLP2 3.0000 mLst8 3.0000 mTERF 3.0000 Myt1 kinase 3.0000 Nek2A 3.0000 NELF B protein 3.0000

126 Texas Tech University, Andrew Avila, May 2012

Table 4.4: Continued Gene Name Distribution Mean NOD1 3.0000 NOSIP 3.0000 Nuclear pore complex protein Nup214 3.0000 Orc1 3.0000 Orc3 3.0000 Orc6 3.0000 p VAV2 Y172 3.0000 p Y700 731 774 CBL 3.0000 p53 protein 3.0000 p53 ser 15 phosphorylated 3.0000 PALB2 3.0000 PALS1 MPP5 3.0000 PCNA 3.0000 phosph Cdc25C Ser 216 3.0000 phospho BRCA1 3.0000 phospho Cdc2 Thr 161 3.0000 phospho Cdc25A 3.0000 phospho Cdc25C 3.0000 phospho Chk1 3.0000 phospho DNA PKcs 3.0000 Phospho Emi1 3.0000 phospho MKLP 1 3.0000 phospho MKLP 2 3.0000 phospho Myt1 3.0000 phosphorylated Cdc6 3.0000 phosphorylated Orc1 3.0000 PIKE L 3.0000 Plexin B1 3.0000 PLK1 3.0000 Poly A specific ribonuclease PARN 3.0000 pro prothrombin factor II 3.0000 pro prothrombin factor II uncarboxylated 3.0000 Protein kinase C iota type 3.0000 Protein kinase C zeta type 3.0000 Protein SET 3.0000 prothrombin factor II 3.0000 prothrombin factor II propeptide 3.0000

127 Texas Tech University, Andrew Avila, May 2012

Table 4.4: Continued Gene Name Distribution Mean PSF1p 3.0000 PSF2p 3.0000 RAD50 3.0000 RAD51 3.0000 RAD51C 3.0000 RanBP1 3.0000 RNASEL 3.0000 RNPS1 3.0000 Secretin receptor 3.0000 Securin 3.0000 Serine threonine protein phosphatase PP1 alpha 1 3.0000 catalytic subunit SLD5P 3.0000 Sm Protein B 3.0000 Sm protein D1 3.0000 Sm protein D2 3.0000 Sm protein E 3.0000 Sm protein F 3.0000 Sm protein G 3.0000 SMAC 3.0000 Smac protein mitochondrial precursor 3.0000 SR2 SC35 3.0000 SR9 SRp30 3.0000 thrombin activation peptide 3.0000 Thrombopoietin 3.0000 TRAF2 3.0000 TRAIL receptor 2 3.0000 Transcription factor NF E2 45 kDa subunit 3.0000 Tribbles homolog 3 3.0000 Tubulin folding cofactor E 3.0000 U2AF 35 kDa subunit 3.0000 UNG2 3.0000 VAV2 3.0000 Vesicular inhibitory amino acid transporter 3.0000 VGLUT1 3.0000 XPF protein 3.0000 hnRNP A0 2.8904

128 Texas Tech University, Andrew Avila, May 2012

Table 4.4: Continued Gene Name Distribution Mean CGI 58 2.8902 MASK fragment 1 305 2.8901 tBID p15 2.8901 MASK 2.8900 tBID 2.8898 BID 2.8897 hnRNP R 2.8896 cleaved TFAM 2.8895 Phospho IKK1 T23 2.8895 SRp55 2.8894 tBID 2.8893 DDB2 DNA damage binding protein 2 2.8892 CDK5 2.8892 ERK2 2.8892 Serine threonine protein kinase ATR 2.8892 Activated ERK2 2.8891 DNA mismatch repair protein Mlh1 2.8890 HSP90 2.8890 Caspase 9 2.8890 MBD4 glycosylase 2.8889 TRBP 2.8886 factor XII 2.8885 CycC 2.8885 Phospho Caspase 9 2.8885 IKKA 2.8884 U2AF 65 kDa subunit 2.8884 YB 1 2.8883 ATP dependent DNA helicase II 70 kDa subunit 2.8878 Orc5 2.8876 Ezrin 2.8876 MASK fragment 306 416 2.8876 Caspase 9 precursor 2.8871

129 Texas Tech University, Andrew Avila, May 2012

Table 4.5: Genes that are significantly underexpressed in breast cancer (GEO Dataset GDS820 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean Activated PAR1 1.0000 ADAR2 protein 1.0000 ASC 1.0000 B cell linker protein 1.0000 BP180 1.0000 c FOS 1.0000 c FOS P 1.0000 C1Inh 1.0000 CASP1 1.0000 CASP1 internal fragment 1.0000 CASP1 p10 1.0000 CASP1 p20 1.0000 caveolin 1 1.0000 Connexin 43 1.0000 DDB2 DNA damage binding protein 2 1.0000 Desmoglein 3 1.0000 Desmoglein 3 fragment 50 781 1.0000 Desmoglein 3 fragment 782 999 1.0000 DNA ligase IV 1.0000 ERCC1 DNA excision repair protein 1.0000 FAS Receptor 1.0000 Fibroblast growth factor receptor 3b 1.0000 Fibroblast growth factor receptor 3c 1.0000 Fibronectin 1.0000 FYN 1.0000 GAS6 1.0000 GAS6 propeptide 1.0000 Gelsolin 1.0000 Gelsolin 27 403 fragment 1.0000 Gelsolin 404 782 fragment 1.0000 Granulocyte macrophage colony stimulating factor 1.0000 Grb14 1.0000 IL1R2 1.0000 IL1RN 1.0000 NrCAM 1.0000 OX 2 membrane glycoprotein precursor 1.0000

130 Texas Tech University, Andrew Avila, May 2012

Table 4.5: Continued Gene Name Distribution Mean p120 catenin 1.0000 p21 1.0000 P2RY5 1.0000 PEX3 1.0000 pFyn Y531 1.0000 Phosphorylated platelet SEC1 protein 1.0000 plasminogen activator inhibitor 2 1.0000 Platelet SEC1 protein 1.0000 pro GAS6 1.0000 pro GAS6 uncarboxylated 1.0000 Procaspase 1 1.0000 Ras GTPase activating protein 1 1.0000 Thrombospondin 1 1.0000 tissue plasminogen activator one chain 1.0000

Table 4.6: Genes that are significantly overexpressed in breast cancer (GEO Dataset GDS820 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean ADAR1 protein 3.0000 ApoER2 3.0000 beta catenin 1 1 115 3.0000 beta catenin 1 116 376 3.0000 cAMP specific 3 5 cyclic phosphodiesterase 4B 3.0000 CCL5 3.0000 Cdc20 3.0000 Cdc25B 3.0000 Cdt1 3.0000 Complement Factor B 3.0000 Complement Factor Ba 3.0000 Cyclic AMP dependent transcription factor ATF 3 3.0000 Glutaredoxin 3.0000 glutaredoxin thioltransferase oxidized 3.0000 GRB10 3.0000 HsMad1 3.0000

131 Texas Tech University, Andrew Avila, May 2012

Table 4.6: Continued Gene Name Distribution Mean HSP70 3.0000 IFIT1 3.0000 IFIT3 3.0000 Interferon induced 35 kDa protein 3.0000 IRF9 3.0000 ISG15 3.0000 ISG20 3.0000 MED12 3.0000 NQO1 3.0000 Orc1 3.0000 Phosphorylated cAMP specific 3 5 cyclic 3.0000 phosphodiesterase 4B phosphorylated Orc1 3.0000 PLK1 3.0000 Proteasome subunit beta type 8 3.0000 RIG I 3.0000 UBP43 3.0000

Table 4.7: Genes that are significantly underexpressed in mesothelioma (GEO Dataset GDS1220 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean 53BP1 1.0000 Acinus 1.0000 Acinus fragment 1 1093 1.0000 Acinus fragment 1094 1341 1.0000 Activated BAK protein 1.0000 ADAR1 protein 1.0000 ADAR2 protein 1.0000 AGPAT1 1.0000 ASC 1.0000 BAD protein 1.0000 BAF protein 1.0000 BoNT B cleaved VAMP2 fragment 1.0000

132 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean C1Inh 1.0000 Cadherin 1 1.0000 Calmodulin 1.0000 cAMP dependent protein kinase alpha catalytic subunit 1.0000 cAMP specific 3 5 cyclic phosphodiesterase 4B 1.0000 Caspase 7 precursor 1.0000 CCAAT enhancer binding protein delta 1.0000 CCK 1.0000 CD86 monomer 1.0000 Cdc20 1.0000 Cdc25B 1.0000 Cdc45 1.0000 CDK5 1.0000 Co SMAD 1.0000 Cofilin 1 1.0000 Complement Factor B 1.0000 Complement Factor Ba 1.0000 Connexin 32 1.0000 CPSF 1.0000 CREB 1.0000 Cryptochrome 1 1.0000 DCC interacting protein 13 alpha 1.0000 Death inducer obliterator 1 1.0000 dimethylated Sm protein D3 1.0000 DNA ligase I 1.0000 DNA topoisomerase 3 alpha 1.0000 dna2 endonuclease 1.0000 Double strand break repair protein MRE11A 1.0000 Dynamin 2 1 1.0000 E1B AP5 nucleocytoplasmic transport protein 1.0000 eIF4E 1.0000 Elongin A1 protein 1.0000 Epithelial cadherin precursor fragment 155 750 1.0000 Epithelial cadherin precursor fragment 751 882 1.0000 ERCC1 DNA excision repair protein 1.0000 ERK5 1.0000 Fibroblast growth factor receptor 3b 1.0000

133 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean Fibroblast growth factor receptor 3c 1.0000 Flap endonuclease 1 1.0000 Forkhead box protein O1A 1.0000 G protein beta1 GBB1 subunit 1.0000 gamma H2AX 1.0000 GAP 1.0000 geminin 1.0000 GPI anchored form of CD14 1.0000 hBUBR1 1.0000 Heat shock related 70 kDa protein 2 1.0000 Heterochromatin protein 1 homolog alpha 1.0000 Histone H2A x 1.0000 Histone RNA hairpin binding protein 1.0000 hnRNP H 1.0000 hnRNP I 1.0000 hnRNP L 1.0000 hnRNP M 1.0000 hNTH1 1.0000 hPrp16 1.0000 HRH1 1.0000 HsMAD2 1.0000 HSP70 1.0000 IFIT1 1.0000 IFIT3 1.0000 IGFBP 5 164 272 fragment 1.0000 IGFBP 5 21 163 fragment 1.0000 IKKA 1.0000 IL1R1 1.0000 Inactive BAK protein 1.0000 Insulin like Growth Factor Binding Protein 5 1.0000 Interferon alpha inducible protein 27 mitochondrial 1.0000 Interferon alpha inducible protein 6 1.0000 Interferon induced transmembrane protein 1 1.0000 Interferon induced transmembrane protein 2 1.0000 Interferon regulatory factor 2 1.0000 IRF9 1.0000 ISG15 1.0000

134 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean ISG20 1.0000 KIF18A 1.0000 Kinesin 5 1.0000 Kinesin like protein KIF15 1.0000 Lamin B1 1.0000 Lamin B1 fragment 1 231 1.0000 Lamin B1 fragment 232 586 1.0000 LAT 1.0000 LAT with phospho TBSMs 1.0000 LDL receptor 1.0000 MAD2 1.0000 Mcm10 1.0000 Mcm2 1.0000 Mcm3 1.0000 Mcm5 1.0000 MED4 1.0000 MED6 1.0000 MGMT 1.0000 MGMT containing S methyl L cysteine 1.0000 MKLP1 1.0000 MKLP2 1.0000 mRNA decapping enzyme 1A 1.0000 MyD88 1.0000 Nek2A 1.0000 Nesprin 2 1.0000 Neuromedin K receptor 1.0000 NuMA 1.0000 Orc6 1.0000 OX 2 membrane glycoprotein precursor 1.0000 p LAT Y200 Y220 1.0000 p Raf1 S259 S621 1.0000 p Ser 93 1130 Thr 1462 TSC2 1 1.0000 p120 catenin 1.0000 p27 1.0000 p27Kip1 1.0000 p53 protein 1.0000 p53 ser 15 phosphorylated 1.0000

135 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean PCNA 1.0000 Period circadian protein homolog 1 1.0000 PERK 1.0000 perlecan 1.0000 PEX3 1.0000 Phospho 2 sites RAF1 1.0000 Phospho AKT1 T308 S473 1.0000 Phospho BAD 1.0000 Phospho BAD protein 1.0000 Phospho BAD protein at S136 1.0000 phospho CREB 1.0000 Phospho ERK5 1.0000 Phospho Forkhead box protein O1A T24 S256 S319 1.0000 Phospho IKK1 T23 1.0000 phospho MEK2 1.0000 phospho MKLP 1 1.0000 phospho MKLP 2 1.0000 phospho Nlp 1.0000 Phospho Phospholipase C gamma 1 1.0000 phospho Retinoblastoma protein 1.0000 Phospho TSC2 1 1.0000 Phospholipase C gamma 1 1.0000 Phosphorylated cAMP specific 3 5 cyclic 1.0000 phosphodiesterase 4B phosphorylated NuMA 1.0000 PIN1 1.0000 PKC delta 1.0000 PKC delta fragment 1 329 1.0000 PKC delta fragment 330 676 1.0000 PLK1 1.0000 Proteasome subunit beta type 8 1.0000 Protein kinase C zeta type 1.0000 Protein SMG5 1.0000 Protein tyrosine kinase 2 beta PYK2 1.0000 proteolytically cleaved ZO 2 1.0000 PSF1p 1.0000 Pyruvate kinase L isozyme 1.0000

136 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean Pyruvate kinase R isozyme 1.0000 RAD51 1.0000 RAD51C 1.0000 RAD52 1.0000 Raf 1 1.0000 RalGDS 1.0000 RBM 5 1.0000 Receptor type tyrosine protein phosphatase alpha 1.0000 Retinoblastoma Protein 1 1.0000 RIG I 1.0000 ROBO3A 1 1.0000 SAR1B GTP binding protein 1.0000 Secreted form of CD14 1.0000 Semaphorin 4D 1.0000 SIAH2 1.0000 SIRP beta 1.0000 Sm Protein B 1.0000 SNAP 25 variant1 active 1.0000 SP100 1.0000 spectrin alpha chain alpha II Fodrin fragment 1 1185 1.0000 spectrin alpha chain alpha II Fodrin fragment 1186 2472 1.0000 spectrin alpha chain brain alpha II fodrin 1.0000 SPF31 Dna J 1.0000 SPT4H1 protein 1.0000 SR4 SRp75 1.0000 SRm160 1.0000 STAT1 1.0000 STAT1 alpha 1.0000 SURP 2 G patch protein 1.0000 Synaptonemal complex protein 2 1.0000 TAF 1 110 1.0000 Talin 1 1.0000 TAP 1.0000 TESK1 1.0000 thrombomodulin 1.0000 TLR5 1.0000 TPL2 1.0000

137 Texas Tech University, Andrew Avila, May 2012

Table 4.7: Continued Gene Name Distribution Mean TRAF2 1.0000 TSC2 1 1.0000 Tubulin folding cofactor B 1.0000 Tubulin folding cofactor C 1.0000 Tubulin folding cofactor D 1.0000 UBP43 1.0000 VAMP2 1.0000 VAMP2 Synaptobrevin2 1.0000 VEGFR3 1.0000 Vitamin D binding protein DBP 1.0000 XPA binding protein 2 1.0000 Y55 phospho Sprouty 2 1.0000 ZO 2 1.0000

Table 4.8: Genes that are significantly overexpressed in mesothelioma (GEO Dataset GDS1220 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean 14 3 3 zeta 3.0000 40S small ribosomal protein 6 3.0000 60S ribosomal protein L13a 3.0000 78 kDa glucose regulated protein 3.0000 ABIN2 3.0000 Activated BAX 3.0000 Activated IRAK4 3.0000 Acyl NAT2 intermediate 3.0000 Adducin alpha 3.0000 Adiponectin 3.0000 AGER 3.0000 ALOX5 3.0000 Alpha 2 antiplasmin 3.0000 Alpha adducin fragment 1 633 3.0000 Alpha adducin fragment 634 737 3.0000 Angiopoietin 1 3.0000

138 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean Angiopoietin 1 receptor precursor 3.0000 antithrombin III 3.0000 Apelin receptor 3.0000 APOBEC 1 Apolipoprotein B mRNA editing enzyme 3.0000 Asparagine synthetase glutamine hydrolyzing 3.0000 ATF 2 3.0000 ATF 2 P 3.0000 ATF4 3.0000 ATF6 alpha 3.0000 ATF6 alpha C terminal cleavage product of S1P 3.0000 ATF6 alpha C terminal cleavage product of S2P 3.0000 ATF6 alpha N terminal cleavage product of S1P 3.0000 ATF6 alpha N terminal cleavage product of S2P 3.0000 Bcl 2 protein 3.0000 Bcl10 3.0000 beta actin 3.0000 Beta arrestin 1 3.0000 Bone sialoprotein 2 3.0000 BP180 3.0000 Bradykinin 3.0000 c FOS 3.0000 c FOS P 3.0000 C3G isoform 1 3.0000 C6 3.0000 C7 3.0000 C9 3.0000 caveolin 1 3.0000 CBP80 3.0000 CCL5 3.0000 CCR11 3.0000 CCR6 3.0000 CDO 3.0000 Cep250 C Nap1 3.0000 CERT 3.0000 CHE1 3.0000 CHL1 3.0000 CK1alpha 3.0000

139 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean Corticotropin 3.0000 Corticotropin lipotropin precursor 3.0000 CSA protein 3.0000 CSB protein 3.0000 CXCL13 3.0000 CXCR5 3.0000 CycC 3.0000 Cyclic AMP dependent transcription factor ATF 3 3.0000 Cyclin T2 3.0000 CypA protein 3.0000 Cytotoxic and regulatory T cell molecule 3.0000 D site binding protein 3.0000 DAP12 3.0000 Disks large homolog 1 3.0000 DNA polymerase beta EC 2 7 7 7 3.0000 Doublecortin 3.0000 DP1 receptor 3.0000 Drp1 3.0000 eEF1A 3.0000 eEF2 3.0000 EF 1 beta 3.0000 EF 1 gamma 3.0000 eIF4B 3.0000 eIF4B P 3.0000 Endophilin 3.0000 EP3 receptor 3.0000 ERp57 3.0000 FABP4 3.0000 factor IX 3.0000 factor IX activation peptide 3.0000 factor IX propeptide 3.0000 factor V 3.0000 factor V activation peptide 3.0000 factor VII 3.0000 factor VII propeptide 3.0000 FAS Receptor 3.0000 FASL 3.0000

140 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean fibrinopeptide A 3.0000 Fibroblast growth factor receptor 1c 3.0000 FPRL1 3.0000 FPRL2 3.0000 Frs2 3.0000 FSH receptor 3.0000 GLP2 3.0000 Glucagon 3.0000 Glucagon like peptide 2 receptor 3.0000 GLUT2 3.0000 Glutaredoxin 3.0000 glutaredoxin thioltransferase oxidized 3.0000 GM CSF receptor alpha subunit 3.0000 GRB2 related adapter protein 2 3.0000 Grb7 3.0000 hnRNP E2 3.0000 ICOSL 3.0000 IFNB1 3.0000 IGFBP 3 125 187 fragment 3.0000 IGFBP 3 125 233 fragment 3.0000 IGFBP 3 127 291 fragment 3.0000 IGFBP 3 169 226 fragment 3.0000 IGFBP 3 187 291 fragment 3.0000 IGFBP 3 188 291 fragment 3.0000 IGFBP 3 227 291 fragment 3.0000 IGFBP 3 234 291 fragment 3.0000 IGFBP 3 28 124 fragment 3.0000 IGFBP 3 28 126 fragment 3.0000 IGFBP 3 28 168 fragment 3.0000 IGFBP 3 28 186 fragment 3.0000 IL1R2 3.0000 IL2RG 3.0000 IL5 3.0000 IL5RA 3.0000 Importin beta 1 3.0000 Importin beta 2 3.0000 Inactive Bax alpha protein 3.0000

141 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean INSM1 3.0000 Insulin like Growth Factor Binding Protein 3 3.0000 IPS 1 3.0000 IRAK4 3.0000 JAK3 3.0000 JNK1 3.0000 Killer cell immunoglobulin like receptor 2DL4 precursor 3.0000 Killer cell immunoglobulin like receptor 3DL1 precursor 3.0000 Killer cell immunoglobulin like receptor 3DL2 precursor 3.0000 Kinesin associated protein 3 3.0000 Kinesin like protein KIFC1 3.0000 kininogen 3.0000 Krueppel like factor 5 3.0000 Leptin 3.0000 Leucine rich repeat containing protein 16A 3.0000 Leucine zipper protein FKSG13 3.0000 Leukocyte common antigen precursor 3.0000 Leukosialin precursor 3.0000 lipoprotein lipase monomer 3.0000 Lutropin subunit beta 3.0000 MADCAM1 3.0000 MASK 3.0000 MASK fragment 1 305 3.0000 MASK fragment 306 416 3.0000 Matrix metalloproteinase 1 3.0000 MDM2 3.0000 MED1 3.0000 MED14 3.0000 Metastin 3.0000 monophospho CERT 3.0000 multiphospho CERT 3.0000 NAT2 3.0000 Necl 1 3.0000 Necl 2 3.0000 NELF A protein 3.0000 NEMO 3.0000 Nephrin 3.0000

142 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean Nesprin 1 3.0000 Neuropilin 1 3.0000 Neuropilin 2 NP2 3.0000 NLRP1 3.0000 Nociceptin 3.0000 NR1D1 HUMAN 3.0000 ODC 3.0000 Orc5 3.0000 Orexin receptor 1 3.0000 Orexin receptor 2 3.0000 Osteopontin 3.0000 p VAV2 Y172 3.0000 P2RY1 3.0000 P2RY2 3.0000 P2RY6 3.0000 PABPN1 3.0000 Pannexin 1 3.0000 PCBP2 3.0000 PD 1 3.0000 PDX1 3.0000 perilipin 3.0000 phospho ALOX5 3.0000 Phospho JNK1 3.0000 phospho L13a 60S ribosomal protein 3.0000 phospho MDM2 3.0000 Phospholipase C gamma 2 3.0000 Phospholipase C gamma 2 phosphorylated 3.0000 Phosphorylated 40S small ribosomal protein 6 3.0000 phosphorylated Bcl10 3.0000 phosphorylated perilipin 3.0000 phosphorylated SLP 76 3.0000 Pituitary adenylate cyclase activating polypeptide 3.0000 precursor Pituitary adenylate cyclase activating polypeptide type 3.0000 I receptor Platelet glycoprotein IV 3.0000 Plexin D1 precursor 3.0000

143 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean Poly A specific ribonuclease PARN 3.0000 Preproglucagon 3.0000 pro factor IX 3.0000 pro factor IX uncarboxylated 3.0000 pro factor VII 3.0000 pro factor VII uncarboxylated 3.0000 Proglucagon 3.0000 protein C heavy chain activation peptide 3.0000 protein C light chain propeptide 3.0000 Protein SMG7 3.0000 Proto oncogene tyrosine protein kinase MER precursor 3.0000 RAC1 3.0000 RAD50 3.0000 RAPGEF1 3.0000 Ras GTPase activating protein 1 3.0000 RIP2 3.0000 RNASEL 3.0000 SAP 97 3.0000 Secretin 3.0000 SEMA7A 3.0000 Semaphorin 6A 3.0000 SHP2 3.0000 SLP 76 3.0000 Smoothened homolog 3.0000 SOCS3 3.0000 Src kinase associated phosphoprotein 2 SCAP2 3.0000 SRp55 3.0000 Staf SPH binding factor 3.0000 Substance K receptor 3.0000 SUMO conjugating enzyme UBC9 3.0000 Synaptonemal complex protein 1 3.0000 Syntenin 1 3.0000 T cell surface protein tactile precursor 3.0000 Tapasin 3.0000 TBL1XR1 3.0000 TEC 3.0000 Tenascin 3.0000

144 Texas Tech University, Andrew Avila, May 2012

Table 4.8: Continued Gene Name Distribution Mean TERT 3.0000 Testis expressed sequence 12 protein 3.0000 TFIIB 3.0000 TFIIS protein 3.0000 TGF beta 1 3.0000 Thrombopoietin receptor 3.0000 TNF alpha 3.0000 TRAF6 3.0000 TRAIL 3.0000 Transcription factor NF E2 45 kDa subunit 3.0000 Transforming growth factor beta 1 precursor 3.0000 TRHR 3.0000 TRIF 3.0000 Triggering receptor expressed on myeloid cells 1 3.0000 precursor Tumor necrosis factor ligand superfamily member 5 3.0000 Tumor necrosis factor receptor type 1 associated 3.0000 DEATH domain protein VAV2 3.0000 VEGF A variant4 VEGF165 3.0000 Vesicular acetylcholine transporter 3.0000 VPS24 3.0000 XPC protein 3.0000

145 Texas Tech University, Andrew Avila, May 2012

Figure 4.1: A graph showing the differential expression of genes in cervical cancer (GEO Dataset GDS3233 ); specifically focusing on Semaphorin 5A and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

146 Texas Tech University, Andrew Avila, May 2012

Figure 4.2: A graph showing the differential expression of genes in cervical cancer (GEO Dataset GDS3233 ); specifically focusing on Semaphorin 6D and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

147 Texas Tech University, Andrew Avila, May 2012

Figure 4.3: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS820 ); specifically focusing on MED12 and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

148 Texas Tech University, Andrew Avila, May 2012

Figure 4.4: A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 4D and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

149 Texas Tech University, Andrew Avila, May 2012

Figure 4.5: A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 6A and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

150 Texas Tech University, Andrew Avila, May 2012

Figure 4.6: A graph showing the differential expression of genes in mesothelioma (GEO Dataset GDS1220 ); specifically focusing on Semaphorin 7A and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

151 Texas Tech University, Andrew Avila, May 2012

Appendix A

The implementation of the statistical analysis used in this study.

#!/usr/bin/env python #s t a t s . py #Author: Andrew Avila #Description: An implementation of a non−parametric statistical method for Logos Ex Machina. #Usage: python stats.py −e ”experiment l i s t ” −r num resamples −s s i g L e v e l −a A l l r e s u l t s

#L i b r a r i e s import MySQLdb import i t e r t o o l s import random import sys

#MySQL Database Parameters MYSQL HOST = ’127.0.0.1 ’ MYSQL USER = ’logosexmachina ’ MYSQL PASSWD = ’logosexmachina ’ MYSQL DB = ’logosexmachina’

#Generates an anonymous function that takes in three variables to be applied to the four variable function ”f” first variable of ”f” will be a dynamically managed mysql connection. def DBManageConn( f ) : return lambda x , y , z : f (MySQLdb. connect (MYSQL HOST, MYSQL USER, MYSQL PASSWD, MYSQL DB) , x , y , z )

#Generates an anonymous function that takes in three variables to be to be applied to the three variable function ”f” first variable to the anonymous function will be the mysql cursor, after evaluation will run the fetchall() method of the returned object of the anonymous function generated. def DBFetch(f) : return lambda x, y, z : f(x.cursor(), y, z).fetchall()

#A decorated function that takes in four variables; a mysql connection, a query to be executed, the parameters of the query, and a function to be applied to the results of the executed query (the results of which are returned). It is noted that the connection is dynamically managed by the decorator function. @DBManageConn def DBQueryDataFunc(conn, query , params, datafunc = lambda x , y : y ) : # print query data = datafunc(conn, DBExecuteQuery(conn, query , params))

152 Texas Tech University, Andrew Avila, May 2012

conn.close() return data

#A decorated function that takes in three variables; a mysql cursor, a query to be executed, and the parameters to the query. Returns the cursor after query execution. @DBFetch def DBExecuteQuery(cursor , query , params): cursor.execute(query , params) return c u r s o r

#A function that inserts into a database table a series of values stored in a hashtable and applies a function to the result of the insertion. The hashtable has the format of the key being the column name and the value being the value to be inserted. By default the function applied returns value of the primary key of the last row i n s e r t e d . def DBInsertFunc(table , nvpairs , datafunc = lambda x, y: x.insert i d ( ) ) : return DBQueryDataFunc(”INSERT INTO %s (%s) VALUES (%s)” % (table , reduce (lambda x, y : x + ’, ’ + y, nvpairs.iterkeys()), MultiplyStr(”%s”, len(map(lambda x: x, nvpairs.itervalues())))), map(lambda x: x, nvpairs.itervalues()), datafunc)

#A function that multiplies a string a number of times and inserts commas in between. def MultiplyStr(string , number): return (((string + ’,’) ∗number).strip(’,’)) #.rsplit(’,’)

#Get the common list of genes that have expression information among a list of experiments def GetGeneXSection(expArr) : return DBQueryDataFunc(”SELECT name FROM (SELECT StableIdentifier ID AS S1 , COUNT( ∗ ) AS NUMPRESENT FROM ANALYSISRESULT WHERE GENEPROTEIN = ’GENE’ AND (%s ) AND LOWHIGH != ’ unknown ’ GROUP BY StableIdentifier ID) AS T1 LEFT JOIN StableIdentifier ON StableIdentifier .ID = S1 WHERE NUMPRESENT = ’%s ’” % ((” OR ”. join ([”ANALYSIS ID = ’%s’” % str(i) for i in expArr])), len( expArr)), [], lambda j , k : k )

#Get the common list of genes and count each expresssion level def GetGeneXSectionLevels(expArr) : return DBQueryDataFunc(”SELECT StableIdentifier . identifier , name, low, normal, high FROM (SELECT StableIdentifier I D AS S1 , COUNT ( ∗ ) AS NUMPRESENT, COUNT( IF (LOWHIGH = ’ low ’ , 1 , NULL) ) AS low , COUNT( IF (LOWHIGH = ’ normal ’ , 1 , NULL) ) AS normal , COUNT( IF ( LOWHIGH = ’ high ’ , 1 , NULL) ) AS high FROM ANALYSISRESULT WHERE GENEPROTEIN = ’GENE’ AND (%s ) AND LOWHIGH != ’ unknown ’ GROUP BY StableIdentifier ID) AS T1 LEFT JOIN StableIdentifier ON

153 Texas Tech University, Andrew Avila, May 2012

StableIdentifier .ID = S1 WHERE NUMPRESENT = ’%s ’” % ((” OR ”. join ([”ANALYSIS ID = ’%s’” % str(i) for i in expArr])), len( expArr)), [], lambda j , k : k )

#Generates a list the same length of samples, each position represents a samples expression level def GenSampleList(levelArr): return [ ] + [ 1 ] ∗ levelArr[0] + [2] ∗ levelArr[1] + [3] ∗ levelArr [2]

#Generates an array of pseudo−random resamples pulled uniformly from a list of samples with replacement def BootstrapSamples(samList , resampNum) : return map(lambda x: random.choice(samList) , range(resampNum))

#Gets the mean for an array of samples def GetDistMean(samples) : return sum(samples, 0.0) / len(samples)

#Performs bootstrap proceduce followed by the mean calculation for a list of genes def MeanStrapGenes(expArr , resampNum) : return map(lambda x: list(x[0:2]) + [GetDistMean(BootstrapSamples( GenSampleList(x[2:5]) , resampNum))] , GetGeneXSectionLevels( expArr ) )

#Assign significance to array of genes and sort low to high def AssignSig(geneArr, sigLevel = 0.10): return sorted(map(lambda x: x + [”Significant Low”] i f x [ 2 ] <= ( 1 . 0 + ( 2 . 0 ∗ s i g L e v e l ) ) else \ (x + [”Significant High”] i f x [ 2 ] >= ( 3 . 0 − ( 2 . 0 ∗ s i g L e v e l ) ) else x + [”Not Significant”]), geneArr), key = lambda x : x [ 2 ] )

#Main function that processes command line arguments def main ( ) :

#By default no arguments iExpArr = False iSams = False i A l l = False i S i g = False

#Retrieves command line arguments and parses them for x in range(0, len(sys.argv)): i f (sys.argv[x] == ’−e ’ ) : iExpArr = sys.argv[x+1].split(’ ’) i f (sys.argv[x] == ’−r ’ ) : iSams = int(sys.argv[x+1])

154 Texas Tech University, Andrew Avila, May 2012

i f (sys.argv[x] == ’−a ’ ) : i A l l = True i f (sys.argv[x] == ’−s ’ ) : iSig = float(sys.argv[x+1])

#If valid argument run the appropriate processing function i f iExpArr and iSams : results = AssignSig(MeanStrapGenes(iExpArr, iSams), iSig) i f i S i g else AssignSig(MeanStrapGenes(iExpArr , iSams)) for i in r e s u l t s : i f i A l l : print ”%s \ t%s \ t%s \ t%s” % (i[0], i[1], i[2], i[3]) else : i f i[3] != ”Not Significant”: print ”%s \ t%s \ t%s \ t%s” % (i[0], i[1], i[2], i[3])

#Wrapper for when script is run stand−alone i f name == ’ m a i n ’: main ( )

155 Texas Tech University, Andrew Avila, May 2012

CHAPTER V ANALYSIS OF DIFFERENTIALLY EXPRESSED GENES IN TISSUE SAMPLES OF BREAST CANCER

Introduction

In the previous chapter (IV) it was demonstrated that a variety of different cancer types, in-vitro, have significantly differentially expressed genes compared to their normal phenotype. Furthermore, numerous genes that were found to be differentially expressed did not conform to the expectations of the current prevailing cancer paradigm, The Somatic Mutation Theory of Cancer (SMT). Briefly, the SMT posits that cancer originates from the mutagenesis of normal cells leading to the upregulation of oncogenes and the downregulation of tumor suppressor genes. Therefore, a novel hypothesis of the origin of cancer was posited based on the idea that cancer originates from normal cells that have undergone the dedifferentiation process due to tissue injury and disregulation of the healing/repair process. The aim of this study was to determine if genes involved in the cellular differentiation process are significantly differentially expressed in breast cancer tissue samples.

Breast Cancer Initiation per the SMT

For the purposes of this study, breast cancer was used to demonstrate the power of this model as well as reduce the number of inter-tissue differences. The trend that was observed in the previous chapter for tissue healing and repair serves as a basis to search for similar correlations on a more focused scale. The molecular basis of cancer as understood by the SMT is as follows: Changes at the genetic level, specifically the loss of heterozygosity (LOH) and the amplification of DNA, are

156 Texas Tech University, Andrew Avila, May 2012 crucial to the initiation and progression of breast cancer (Winchester, 2006). There are a variety of factors leading to the malignant, uncontrolled growth of breast tissue. The hormonal influences include the effects of estrogen and progesterone (Winchester, 2006). Growth factors (EGF, TGF-β, and IGF) also play an important role in the progression of breast cancer (Winchester, 2006). Two genes important in familial hereditary cases ( e.g. BRCA1 and BRCA2) have been shown to play a role in the emergence of sporadic breast cancers as well (Winchester, 2006). Estrogen (E2) is a mammary epithelial cell carcinogen; the effects of which vary depending on the exposure, age, and lifetime dose of E2 (Winchester, 2006). Estrogen is known to control several G1 cell-cycle regulators that promote cellular proliferation (Winchester, 2006). This proliferative effect is believed to place ductal cells at risk for carcinogenesis by allowing the accumulation of genetic changes (Winchester, 2006). Progesterone along with its receptors is similar in structure to estrogen (Winchester, 2006). The role of progesterone (PR) is believed to enable cells to reach the G1 cell cycle point while simultaneously preventing senescence (Winchester, 2006). PR positive cells therefore can further proliferate by facilitating the cell’s response to other growth factors such as TGF-β, maintaining the cell cycle in G1 (Winchester, 2006). Various growth factors have also been shown to play a role in the progression of breast cancer. EGF is a paracrine mediator of E2 and exerts its action on ductal epithelium via mammary stromal regulatory elements (Winchester, 2006). TGF-β has been shown to be important in maintaining the spiral patterning in mammary gland development (Winchester, 2006). In tumorigenesis, this has a primary role in the post-lactational mammary gland involution (Winchester, 2006). Lastly, IGF

157 Texas Tech University, Andrew Avila, May 2012 acts upon the mammary stroma in the presence of growth hormone to facilitate mammary gland involution (Winchester, 2006). Two genetic elements, BRCA1 and BRCA2, are well known to have a role in breast cancer. BRCA1 is involved in DNA repair mechanisms used in homologous recombination and transcriptional regulation (Winchester, 2006). BRCA1 is important in facilitating the repair of DNA double strand breaks caused by ionizing radiation, enabling the dephosphorylation of the retinoblastoma protein, E2F binding, and possible cdk2 repression (Winchester, 2006). BRCA1 mutations may inhibit p53 transcription, allowing for the persistence of cells with DNA mutations and inhibition of damaged cells from undergoing apoptosis (Winchester, 2006). BRCA2 may also inhibit p53 when overexpressed (Winchester, 2006). BRCA2 is also associated with DNA repair mechanisms, specifically those related to meiotic and mitotic recombination as well as double stranded break repair (Winchester, 2006). BRCA2 may also be associated with proliferating nuclear cell antigen (PCNA), which is important to DNA repair and replication (Winchester, 2006).

Oncogenesis Via Chromothripsis

A few alternative theories concerning the origin of cancer were then reviewed for comparison with the hypothesis previously proposed. The first alternative theory to be examined is that of chromothripsis, the idea that cancer occurs during a single catastrophic cellular event. This theory of cancer was originally published by Stephens et al. (2011) in response to results gathered using “next-generation sequencing.” The authors begin by restating the SMT and the age-related nature of cancer development in that mutations are accumulated over many years (Stephens et al., 2011). Furthermore, the authors emphasize that this perspective of cancer

158 Texas Tech University, Andrew Avila, May 2012 shares the classical view of evolution that states that the phenotypic changes occur in a gradual manner over a long period of time (Stephens et al., 2011). The authors then that a model of punctuated equilibrium may be more appropriate based on the genome-wide telomere attrition in somatic cells potentially initiated during mitosis (Stephens et al., 2011). The authors also posit that a pulse of a mutagenic source, such as radiation, could potentially also cause such a wide-spread genomic change (Stephens et al., 2011). This mechanism is argued to allow a “burst” of mutation during a relatively short period of time (Stephens et al., 2011). In order to add support to their idea the authors performed paired-end sequencing, a technique that allows detection of genomic rearrangements, on a set of Lymphocytic Leukemia patients and a diverse spread of different cancer cell lines (Stephens et al., 2011). Wide-spread genome rearrangements were found among 2 to 3 percent of all cancers and in approximately 25 percent of bone cancers (Stephens et al., 2011). Given the statistical improbability of encountering such a wide-spread disruption of the genome the authors argue that this hypothesis on the origination of cancer may provide a greater degree of accuracy in explaining of some genomic features of cancer (Stephens et al., 2011). However, the authors offer no suggestion as to how this theory of cancer may be used to optimize the treatment of cancer.

Tumor Virology

Another intriguing theory of the origin of cancer is that of tumor virology. The essential idea is that cancer is caused by an infectious agent. This idea was first proposed by Peyton Rous in 1911, through work on an avian virus that induced tumors in chickens (Javier and Butel, 2008). Later it was discovered that Epstein-Barr Virus, Hepatitis B, and most recently HPV, could induce tumors

159 Texas Tech University, Andrew Avila, May 2012

(Javier and Butel, 2008). The earliest studies showed that these viruses induced tumorigenesis by expression of viral oncogenes (Javier and Butel, 2008). For viruses lacking viral oncogenes, the mechanism by which these viruses could potentially cause a tumor is known as insertional mutagenesis, whereby a virus could integrate near a oncogene and modify its expression (Javier and Butel, 2008). Finally it was found that HPV could bind and inactivate tumor suppressor genes in order to initiate tumorigenesis (Javier and Butel, 2008). It is notable that this paradigm of oncogenesis integrates concepts from the SMT (e.g. oncogene and tumor suppressor genes) although the initiating factor is not driven through a random accumulation of genomic, changes but by an active agent. Furthermore, this paradigm is attractive for explaining certain cancers as it has been implicated in over 90 percent of cervical cancer cases (Walboomers et al., 1999).

The Tissue Organization Field Theory of Cancer

The final alternative theory of cancer reviewed is to as The Tissue Organization Field Theory of Cancer (TOFT). This theory posits that cancer is derived from a disorganization of the tissues. The idea was first proposed by Soto and Sonnenschein (2005) as a response to experimental results that indicated that the default state of cells in metazoa was one of proliferation and not quiescence as had been previously assumed. Furthermore, cancer is treated as a disease that originates at the tissue level (Soto and Sonnenschein, 2005). Before continuing, it should be noted that the authors do not approach the subject from the typical reductive materialism view of molecular biology but from a position of materialistic holism also known as organicism. Essentially the principal difference, for the purposes of this review, is one of primary causation where the reductive approach

160 Texas Tech University, Andrew Avila, May 2012 reduces all higher level entities (tissues, organs, etc...) to the action of genes and their associated products. On the other hand, the holistic view posits that causality may begin not only from the interaction of genes but also from the interaction of higher level entities. An example that embraces the holistic view is one of polyphenism, where a single genotype gives rise to multiple phenotypes (Soto and Sonnenschein, 2005). The authors give several examples where the SMT has failed to explain phenomena of carcinogenesis; the most provocative example is the transformation of chicken cells by the Rous-sarcoma virus (Soto and Sonnenschein, 2005). In this example the virus transforms all the cells of a chicken by integration of the src oncogene into the a cell’s genome, although neoplasms only arise at locations where wounds are inflicted (Soto and Sonnenschein, 2005). This is interesting in light of the discussion from Chapter IV in which I suggest that the dedifferentiation process during wound healing serves as the initiator of oncogenesis. The authors argue that carcinogens disrupt the organization of tissues (Soto and Sonnenschein, 2005). This tissue-tissue interaction is important in preventing neoplasms from forming (Soto and Sonnenschein, 2005). As proliferation is default in this view, this disruption allows cells to proliferate without the suppressive/regulatory effects of other tissue (Soto and Sonnenschein, 2005). The authors’ views present an interesting perspective, not only on the issue of carcinogenesis, but also on the ontology of biology.

Summary

The purpose of the study of this chapter was to determine if genes are significantly differentially expressed in tissue samples of breast cancer. Several theories of carcinogenesis were reviewed in order to be better informed of the

161 Texas Tech University, Andrew Avila, May 2012 meaning of the specific genes found to be differentially expressed. Although most theories of cancer are based upon a variation of the SMT, one was found to provide a distinct view of cancer as a disease of tissue organization. Furthermore, the merit of the hypothesis concerning cancer development proposed in Chapter IV was evaluated based on the results gathered in this study.

Materials and Methods

The resources required for this study comprised two main categories. The first resource was publicly available microarray data that fit certain requirements. The first requirement to be satisfied was the inclusion of gene expression data for both cancer and normal cells. Secondly, the specific origin of the cells were to be from patients with breast cancer. The datasets used were downloaded from the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo). The following is the specific dataset used and a short description thereof:

1. GDS3324 (Breast Cancer - Carcinoma)

• Platform: Affymetrix Human Genome U133.

• Cell Lines: Normal epithelial tissue from patients who underwent reduction mammoplasty.

• Cancer Cell Lines: Invasive breast cancer epithelial tissues from patients.

• Note: Further patient information regarding: age, tumor size, node status, tumor grade, tumor type, estrogen receptor status, progesterone receptor status, lymphovascular invasion, and stage of the cancer may be found in Casey et al. (2008).

162 Texas Tech University, Andrew Avila, May 2012

The second resource requirement involved the software used to examine the microarray datasets. The comparison between normal and cancer cells was facilitated by the previously developed software package (Chapter III). Briefly, this software package makes use of a logical model in order to deduce gene expression levels and attribute a qualitative measure (underexpressed, normal, overexpressed). Also, through deductive reasoning the software can make predictions of the gene expression levels of those genes that have not been observed empirically. Furthermore, the logical nature of the model prevents contradictory gene expression levels by checking if a model is solvable. Solutions to the model are then stored in a database for further processing, within the software package and without. It is noted that the cut-off value for demarcating the difference between underexpressed/normal and normal/overexpressed was set at 0.10, based on the work of Rhodes et al. (2004). Essentially, if the gene expression level in a cancer cell differs by plus or minus 10 percent compared to normal then the gene was labeled as overexpressed or underexpressed respectively. In order to discover gene expression patterns in cancer, the ordinal data generated by the previous software package had to be statistically analyzed through a novel method previously created (Chapter IV). This statistical method was of non-parametric design and made use of random resampling with replacement in order to estimate the distribution of the expression levels of a gene from a population of cancer cells. To perform the analysis, the discretized ordinal values must be given numerical values in logical order (e.g. underexpressed: 1, normal: 2, overexpressed: 3). Statistical significance was determined by whether the mean of the distribution (¯x) fell within a range of the extremities. This essentially presented

163 Texas Tech University, Andrew Avila, May 2012 as a two-tailed design although each tail has a specific meaning (i.e. significantly underexpressed vis-a-vis significantly overexpressed). The formula to calculate the lower significance boundary was determined to be Blow = Elow + R ∗ S (where E is the value of the extremity, R is the range, and S is the significance level). Similarly

the upper significance boundary formula was derived as Bhigh = Ehigh − R ∗ S. This necessarily involves multiple hypothesis testing for each gene respectively. The general form of the hypotheses are as follows:

• H0: A gene is not significantly differentially expressed in a population of

cancer cells (Blow < x¯ < Bhigh).

• H1: A gene is significantly underexpressed in a population of cancer cells

(¯x ≤ Blow).

• H2: A gene is significantly overexpressed in a population of cancer cells

(¯x ≥ Bhigh).

The implementation of this statistical method (Chapter IV: Appendix A) was realized in the computer language of Python. For this study, a significance level of 0.10 was used with a resampling number of 100,000. The reasoning behind this choice of resampling number was to reduce the variance of the test statistic (¯x). The significance level of 0.10 was used in order to avoid bias that may have occurred if only the genes closer to the extremities of the sample space had been considered, and yet stringent enough to avoid superfluous inferences.

Results

The analysis was carried out upon the GEO dataset (GDS3324) with a significance level of 0.10 and a resampling number of 100,000. These are the same

164 Texas Tech University, Andrew Avila, May 2012 parameters as used in the previous study (Chapter IV). The genes that were found to be significantly underexpressed are listed in Table 5.1. Genes that were found to be significantly overexpressed are listed in Table 5.2. It is noted that these tables also contain genes whose products may undergo posttranslational modification (e.g. phosphorylation), this is due to the structural organization of Reactome. The first specific result of note, regards a gene (IGF2BP3) found to be overexpressed in the meta-analysis from the previous study (Chapter IV). This gene is notable because it is normally expressed in embryonic tissues. A similar result was found in this study, that of significant overexpression, and a graph displaying the gene’s network is shown in Figure 5.1. The table of underexpressed genes (Table 5.1 reveal a similar pattern as to was observed in the previous study (Chapter IV). It is observed that a variety of growth factors (Angiopoeitin, IGFBP6, IGFBP2, TPOR, etc...) are being underexpressed; although the specific growth factors are not identical to the ones observed in the previous study. A graph of TPOR and connected elements in shown in Figure 5.2. It was also found that a variety of tumor suppressor genes (Metastin, POL-I, PUMA, RAD51, BID, etc...) were being overexpressed. A graph of Metastin and connected elements in shown in Figure 5.3. Furthermore, genes associated with the wound healing process (e.g. Fibronectin, Factor XII, and Factor VII) are being overexpressed. Lastly, genes associated with the cellular differentiation process (Karyopherin-α, Importin-α, p38-α, NR4A1, NOD1, UBF, and EGR1) were differentially expressed. Illustrative graphs of a couple of these genes, NOD1 and UBF, are shown in Figures 5.4 and 5.5, respectively. For more information on any of the genes listed in the Tables or Figures, refer to: http://www.genecards.org

165 Texas Tech University, Andrew Avila, May 2012

Discussion

The purpose of this study was to determine genes that are differentially expressed in breast cancer within an in-vivo system. This study was designed as a follow up to the study performed previously on in-vitro cell lines (Chapter IV). As the results from the previous study did not conform to the expectations given by the SMT, it would be interesting to see if results from an in-vivo system would conform to the expectations given by the SMT. Furthermore, if indeed the results of this study do not conform to the SMT, what do the results of this study tell us in terms of the carcinogenesis process? When comparing the variations in underexpressed growth factors between Chapters IV and V, the disparity between the listed genes challenges the paradigm that cancer cells produce and respond to their own growth factors (Sporn and Roberts, 1985; Turner and Grose, 2010). It should be noted that one growth factor receptor, FGFR1, is being overexpressed. In general, the results of this chapter are concordant with the results found in the previous chapter (i.e. tumor suppressor genes were overexpressed and oncogenes were underexpressed). According to the SMT, tumor suppressor genes are expected to be underexpressed and oncogenes are expected to be overexpressed. When considering the expected behavior of gene expression in cancer development with regards to the precepts of the SMT, the results of this study do not reflect the prediction of an upregulation of oncogenes and a downregulation of tumor suppressor genes. Furthermore, the presence of genes associated with the wound healing process (e.g. Fibronectin, Factor XII, and Factor VII) being overexpressed adds weight to the oncogenesis hypothesis proposed in the previous

166 Texas Tech University, Andrew Avila, May 2012 study (Chapter IV). Additionally genes associated with the cellular differentiation process (Karyopherin-α, Importin-α, p38-α, NR4A1, NOD1, UBF, and EGR1) were differentially expressed. This is important in implying a pluripotent state of the cancer cells which may have originated during the wound healing process (Stappenbeck and Miyoshi, 2009; Ueda et al., 1995; Ertl and Frantz, 2005). These results suggest a strong correlation between carcinogenesis and the wound healing process. Briefly restated, the hypothesis of the origin of cancer within the previous study, henceforth named Umbracesis, introduces the idea that the wound healing process involves the dedifferentiation of cells to a stem cell state. It is argued that these stem cells can give rise to neoplasms through inappropriate differentiation during the wound healing process. These stem cells are what recent research have termed cancer stem cells (Jordan et al., 2006). It is noted that Umbracesis is distinct from a similar hypothesis stating that tumors are “wounds that do not heal” (Dvorak, 1986). That hypothesis relies on the observation of chronic inflammation, an observation not consistent with the results of this study (e.g. IL1R1, IL6, etc... are underexpressed). The difficulty with Umbracesis arises in attempting to explain what causes a stem cell to undergo an inappropriate differentiation process. In order to address this issue, the origin of cancerous lesions should be discussed. Briefly, a lesion is simply an abnormality in a tissue typically caused by disease or trauma. A precancerous lesion is an abnormality in tissue prior to the formation of cancer. Furthermore, a cancerous lesion is an abnormality in a tissue caused by cancer. Cancerous lesions are a well documented histological characteristic of cancer (Thomas, 2011). The Umbracesis hypothesis lies at the interface between a

167 Texas Tech University, Andrew Avila, May 2012 precancerous lesion and a cancerous lesion. It is noted that this hypothesis makes the reasonable assumption that all cancerous lesions derive from precancerous lesions. Therefore, the cause of the precancerous lesion is of not much concern and may be caused by different mechanisms (e.g. trauma, viral infection, etc...). The relevant question is: Why do some precancerous lesions resolve themselves to become cancerous lesions? Proponents of the SMT may argue natural selection for cells with mutations that have the most aggressive characteristics. This position would imply that all precancerous lesions, given enough time to accumulate mutations, will resolve themselves to become cancerous lesions. This position is countered by the observation that some precancerous lesions will spontaneously regress (Zahl et al., 2008). To answer this question, an organism will be considered as a monadic unit. This is to imply that each functional aspect of the monad is limited in its degrees of freedom (i.e. not autonomous). This can be mediated by a number of mechanisms (e.g. growth factors, epigenetic signaling, etc...). However, the important idea is the maintenance of organismic homeostasis (i.e. unity of the monad). A wound represents an infraction against organismic homeostasis, such that it must be resolved in order to recover homeostasis. A precancerous lesion, therefore, is a wound that has not resolved itself to either full regression or to becoming cancerous. Note, the time-frame of a precancerous lesion can be sufficiently lengthy, chronologically speaking, such that it never resolves itself within the life-time of the individual. The dedifferentiation process during wound healing represents an increase in the degrees of freedom given to any one function of the monad. Such that a carcinogen (i.e. any agent that causes cancer including, but not restricted to,

168 Texas Tech University, Andrew Avila, May 2012 chemical agents) may either block organismic homeostasis from being reached or prevent the re-establishment of the limited autonomy given to any one functional aspect of the monad. An example of the former are inflammatory cancers (Coussens and Werb, 2002). An example of the latter is a carcinogen that acts as a one-way valve allowing dedifferentiation but not full re-differentiation (Blagosklonny, 2005). The results of this study imply the latter mechanism, for the tissue samples used in this study, due to the overexpression of a gene normally associated with embryonic tissues (IGF2BP3) and the underexpression of inflammatory genes. Furthermore, if the degrees of freedom are sufficiently raised, a predicted outcome would be the “birth” of a monad (e.g. cloning of an organism or speciation). An example of this phenomena is the complex structure a teratoma might have (Sergi et al., 1999). Teratomas are also closely related to the phenomena of a parasitic twin (Hoeffel et al., 2000). Oncogenesis as a form of speciation has been previously suggested by other authors, although the authors use the SMT paradigm of oncogenesis (Duesberg et al., 2011). Lastly, It is noted that a carcinogen within the Umbracesis hypothesis need not be a mutagenic substance but anything that disrupts organismic homeostasis (i.e. that which disrupts the unity of the monad). This allows cancers caused by chemically inert objects (i.e. cases unaccounted for by the SMT) to be explained within the context of Umbracesis. A monad does not necessarily have to be multi-cellular, cancer can be also be conceived of as occurring in unicellular organisms, per Umbracesis. Prion plaques in yeast can be viewed as examples of molecular tumors (King and Diaz-Avalos, 2004). It is proposed that there exists mechanisms for “healing” protein lesions; perhaps protein refolding mechanisms (“dedifferentiation” then “redifferentiation” of a

169 Texas Tech University, Andrew Avila, May 2012 protein) serve this purpose (Wickner, 1999). An interruption to this process, via “carcinogens,” may result in oncogenesis of the molecular tumor. Identification of these “carcinogens” is a topic for a future research endeavor. Via Umbracesis, a proposed solution to treat molecular tumors lies in the forceful “differentiation” of the protein; a process likely mediated via protein folding. This concept may have applications in treating Alzheimer’s disease, where amyloid plaques are composed of prion aggregates (Eikelenboom et al., 2002). In summary a series of differentially expressed genes were found in breast cancer derived from tissue samples of patients. Furthermore, the results of this study were not in agreement with the SMT per the expectation of the upregulation of oncogenes and the downregulation of tumor suppressor genes. The Umbracesis hypothesis was further refined within this study in order to provide a mechanism by which oncogenesis may occur. The mechanism proposed is disruption of organismic homeostasis via carcinogens during the wound healing process, in such a way as to prevent organismic homeostasis from being recovered or by preventing full re-differentiation of dedifferentiated cells. The synthetic idea of an organism as a monadic unit was useful in elaborating this mechanism. It is anticipated that future research can elaborate this idea further into other areas of biomedical concern besides cancer. For example, a future research endeavor may explore transplant rejection as a failure of assimilation into the monad. Conversely, a genetic chimera may be viewed as a successful monadic amalgamation.

170 Texas Tech University, Andrew Avila, May 2012

Bibliography

Blagosklonny, M. V. 2005. Carcinogenesis, cancer therapy and chemoprevention. Cell Death and Differentiation, 12(6):592–602.

Casey, T., J. Bond., S. Tighe., T. Hunter., L. Lintault., O. Patel., J. Eneman., A. Crocker., J. White., J. Tessitore., M. Stanley., S. Harlow., D. Weaver., H. Muss., and K. Plaut. 2008. Molecular signatures suggest a major role for stromal cells in development of invasive breast cancer. Breast Cancer Research and Treatment, 114(1):47–62.

Coussens, L. M. and Z. Werb. 2002. Inflammation and cancer. Nature, 420(6917):860–867. PMID: 12490959.

Duesberg, P., D. Mandrioli., A. McCormack., and J. M. Nicholson. 2011. Is carcinogenesis a form of speciation? Cell Cycle, 10(13):2100–2114.

Dvorak, H. F. 1986. Tumors: wounds that do not heal. similarities between tumor stroma generation and wound healing. The New England Journal of Medicine, 315(26):1650–1659. PMID: 3537791.

Eikelenboom, P., C. Bate., W. Van Gool., J. Hoozemans., J. Rozemuller., R. Veerhuis., and A. Williams. 2002. Neuroinflammation in alzheimer’s disease and prion disease. Glia, 40(2):232–239.

Ertl, G. and S. Frantz. 2005. Healing after myocardial infarction. Cardiovascular Research, 66(1):22–32.

171 Texas Tech University, Andrew Avila, May 2012

Hoeffel, C. C., K. Q. Nguyen., H. T. Phan., N. H. Truong., T. S. Nguyen., T. T. Tran., and P. Fornes. 2000. Fetus in fetu: A case report and literature review. PEDIATRICS, 105(6):1335–1344.

Javier, R. and J. Butel. 2008. The history of tumor virology. Cancer research, 68(19):7693.

Jordan, C. T., M. L. Guzman., and M. Noble. 2006. Cancer stem cells. The New England Journal of Medicine, 355(12):1253–1261. PMID: 16990388.

King, C. and R. Diaz-Avalos. 2004. Protein-only transmission of three yeast prion strains. Nature, 428(6980):319–323.

Rhodes, D., J. Yu., K. Shanker., N. Deshpande., R. Varambally., D. Ghosh., T. Barrette., A. Pandey., and A. Chinnaiyan. 2004. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the national academy of sciences of the United States of America, 101(25):9309.

Sergi, C., V. Ehemann., B. Beedgen., O. Linderkamp., and H. F. Otto. 1999. Huge fetal sacrococcygeal teratoma with a completely formed eye and intratumoral DNA ploidy heterogeneity. Pediatric and Developmental Pathology: The Official Journal of the Society for Pediatric Pathology and the Paediatric Pathology Society, 2(1):50–57. PMID: 9841706.

172 Texas Tech University, Andrew Avila, May 2012

Soto, A. and C. Sonnenschein. 2005. Emergentism as a default: cancer as a problem of tissue organization. Journal of biosciences, 30(1):103–118.

Sporn, M. B. and A. B. Roberts. 1985. Autocrine growth factors and cancer. Nature, 313(6005):745–747.

Stappenbeck, T. S. and H. Miyoshi. 2009. The role of stromal stem cells in tissue regeneration and wound repair. Science, 324(5935):1666–1669.

Stephens, P., C. Greenman., B. Fu., F. Yang., G. Bignell., L. Mudie., E. Pleasance., K. Lau., D. Beare., L. Stebbings., and others. 2011. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1):27–40.

Thomas, P. A. 2011. Breast cancer and its precursor lesions making sense and making it early. Springer, New York.

Turner, N. and R. Grose. 2010. Fibroblast growth factor signalling: from development to cancer. Nature Reviews Cancer, 10(2):116–129.

Ueda, M., A. E. Becker., T. Naruko., and A. Kojima. 1995. Smooth muscle cell de-differentiation is a fundamental change preceding wound healing after percutaneous transluminal coronary angioplasty in humans. Coronary Artery Disease, 6(1):71–81. PMID: 7767506.

Walboomers, J. M. M., M. V. Jacobs., M. M. Manos., F. X. Bosch., J. A. Kummer., K. V. Shah., P. J. F. Snijders., J. Peto., C. J. L. M. Meijer., and N. Mu?oz. 1999. Human papillomavirus is a

173 Texas Tech University, Andrew Avila, May 2012

necessary cause of invasive cervical cancer worldwide. The Journal of Pathology, 189(1):12–19.

Wickner, S. 1999. Posttranslational quality control: Folding, refolding, and degrading proteins. Science, 286(5446):1888–1893.

Winchester, D. J. 2006. Breast cancer. B.C. Decker, Hamilton.

Zahl, P., J. Maehlen., and H. Welch. 2008. The natural history of invasive breast cancers detected by screening mammography. Archives of internal medicine, 168(21):2311.

174 Texas Tech University, Andrew Avila, May 2012

Table 5.1: Genes that are significantly underexpressed in breast cancer (GEO Dataset GDS3324 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean BoNT B cleaved VAMP2 fragment 1.0000 c FOS 1.0000 c FOS P 1.0000 c Jun 1.0000 c Jun P 1.0000 CX3CL1 1.0000 Cyclic AMP dependent transcription factor ATF 3 1.0000 Gelsolin 1.0000 Gelsolin 27 403 fragment 1.0000 Gelsolin 404 782 fragment 1.0000 IL1R1 1.0000 Insulin like Growth Factor Binding Protein 6 1.0000 Kinesin like protein KIF3A 1.0000 Leucine zipper protein FKSG13 1.0000 NR1D1 HUMAN 1.0000 OX 2 membrane glycoprotein precursor 1.0000 P2RY5 1.0000 Period circadian protein homolog 1 1.0000 Proto oncogene tyrosine protein kinase MER precursor 1.0000 Semaphorin 3E precursor 1.0000 thrombomodulin 1.0000 Thrombopoietin receptor 1.0000 Thrombospondin 1 1.0000 TPL2 1.0000 Tristetraproline 1.0000 VAMP2 1.0000 VAMP2 Synaptobrevin2 1.0000 Y55 phospho Sprouty 2 1.0000 phosphorylated perilipin 1.0335 Period circadian protein homolog 2 1.0336 NR4A1 Human 1.0337 SIRP alpha 1.0338 Adiponectin 1.0340 Phosphorylated cAMP specific 3 5 cyclic 1.0341 phosphodiesterase 4B

175 Texas Tech University, Andrew Avila, May 2012

Table 5.1: Continued Gene Name Distribution Mean APRIL with phosphothreonine 244 1.0341 cAMP specific 3 5 cyclic phosphodiesterase 4B 1.0341 perilipin 1.0343 p Y813 JAK2 1.0343 Insulin like Growth Factor Binding Protein 2 1.0343 FGR 1.0344 Oxidized low density lipoprotein receptor 1 1.0345 Desmoglein 2 70 Kd fragment 1.0346 Filamin A 1.0346 Secreted form of CD14 1.0348 GPI anchored form of CD14 1.0349 APRIL with phosphothreonine 244 1.0350 JAK2 1.0350 pL1 Y1176 1.0350 L1 1.0350 Retinoblastoma Protein 1 1.0351 Phospho Orphan nuclear receptor NR4A1 1.0352 Desmoglein 2 1.0356 NOD1 1.0356 phospho Retinoblastoma protein 1.0358 Inactive BAK protein 1.0671 pro protein S 1.0673 CD55 1.0679 PAIP1 1.0679 Phospho IkBA 1.0681 IKKA 1.0682 TANK 1.0682 protein S 1.0682 Activated IRAK4 1.0683 hnRNP E2 1.0683 TRAIL 1.0683 Caspase 7 precursor 1.0686 UBF 1.0686 Activated BAK protein 1.0687 Platelet SEC1 protein 1.0687 Platelet glycoprotein IV 1.0687 Interferon regulatory factor 2 1.0688

176 Texas Tech University, Andrew Avila, May 2012

Table 5.1: Continued Gene Name Distribution Mean pro protein S uncarboxylated 1.0690 BNIP2 1.0690 PERK 1.0690 Alpha actinin 1 1.0690 Major prion protein 1.0691 Phospho Forkhead box protein O1A T24 S256 S319 1.0692 Larg variant1 1.0692 CD160 antigen precursor 1.0695 DNA mismatch repair protein Mlh3 1.0695 PCBP2 1.0695 Phosphorylated platelet SEC1 protein 1.0697 EGR1 1.0698 Phospho IkBA 1.0699 IL6 1.0699 Phospho IKK1 T23 1.0699 mIL1RAP 1.0700 IRAK4 1.0701 Oxytocin receptor 1.0702 pro protein S 1.0702 protein S propeptide 1.0702 Forkhead box protein O1A 1.0703 Leukocyte common antigen precursor 1.0706 Phospho Forkhead box protein O1A T24 S256 S319 1.0709 GAS2 280 313 1.1016 calreticulin 1.1024 Notch 3 receptor precursor 1.1026 Krueppel like factor 5 1.1026 CASP1 1.1029 DNA ligase I 1.1029 NICD 3 fragment 1.1029 phosphorylated Bcl10 1.1030 Apaf 1 1.1030 MED4 1.1031 ABL1 1.1032 Procaspase 1 1.1032 GBP2 1.1033 CCAAT enhancer binding protein delta 1.1033

177 Texas Tech University, Andrew Avila, May 2012

Table 5.1: Continued Gene Name Distribution Mean Nesprin 2 1.1033 CASP1 internal fragment 1.1034 CASP1 p10 1.1034 Transmembrane remnant 3 1.1034 Notch 3 NEXT fragment 1.1035 PABPN1 1.1035 CASP1 p20 1.1035 Notch 3 receptor precursor 1.1036 NELF A protein 1.1037 NICD 3 fragment 1.1037 IFNAR2c 1.1039 Bcl10 1.1041 G alpha olf 1.1043 GAS2 1.1045 ROBO3A 1 1.1047 Cryptochrome 2 1.1047 CHL1 1.1052 Semaphorin 4D 1.1053 GAS2 1 279 1.1060 DDB2 DNA damage binding protein 2 1.1357 MIG 2 1.1365 hnRNP L 1.1375 Class E basic helix loop helix protein 40 1.1376 Sphingosine kinase 1 1.1377 Sm Protein B 1.1377 Phospho MSK1 1.1378 Dynamin 2 1 1.1378 lipoprotein lipase monomer 1.1381 MSK1 1.1384 Co SMAD 1.1388 SUN domain containing protein 2 1.1392 DFF40 DFFB CAD 1.1696 Class E basic helix loop helix protein 41 1.1701 TTF 1 1.1706 ODC 1.1712 Lymphocyte function associated antigen 3 precursor 1.1713 Phospho BAD 1.1717

178 Texas Tech University, Andrew Avila, May 2012

Table 5.1: Continued Gene Name Distribution Mean DAP12 1.1717 phosphorylated NuMA 1.1721 CCL5 1.1723 Biogenesis of lysosome related organelles complex 1 1.1724 subunit 1 Snurportin 1.1727 EGFR 1.1727 DFF40 DFFB CAD 1.1728 Histone RNA hairpin binding protein 1.1728 Phospho BAD protein 1.1731 NuMA 1.1733 BAD protein 1.1735 BAD protein 1.1739 hnRNP H 1.1742 Phospho BAD protein at S136 1.1748

Table 5.2: Genes that are significantly overexpressed in breast cancer (GEO Dataset GDS3324 ); sorted from most significant to least significant based on distribution mean.

Gene Name Distribution Mean CCR11 3.0000 Fibronectin 3.0000 Insulin like growth factor 2 mRNA binding protein 3 3.0000 Nek2A 3.0000 P2RY9 3.0000 Plexin C1 3.0000 Securin 3.0000 Synaptonemal complex protein 1 3.0000 Synaptonemal complex protein 2 3.0000 CDC20 2.9667 Leukosialin precursor 2.9665 Endophilin 2.9664 phospho Cdc2 Thr 161 2.9664 PML 2.9664

179 Texas Tech University, Andrew Avila, May 2012

Table 5.2: Continued Gene Name Distribution Mean Cdc2 2.9663 C9 2.9656 AIM2 2.9655 RAD51B 2.9654 Endophilin 2.9653 CENP E 2.9651 Cdc20 2.9645 TP receptor 2.9320 Natural killer cells antigen CD94 1 2.9320 Neuropilin 2 NP2 2.9317 Neurokinin B peptide 2.9316 RNASEL 2.9315 Protein SMG7 2.9314 factor XII 2.9313 Karyopherin alpha 2.9313 Importin alpha 2.9306 DNA directed RNA polymerase I 135 kDa polypeptide 2.9301 RIM Rab3A interacting 2.9297 factor VII propeptide 2.8986 pro factor VII uncarboxylated 2.8981 SIKE1 2.8976 Grb14 2.8974 MAP kinase p38 alpha 2.8974 Tau fragment 422 758 2.8973 phospho GRK2 2.8973 Rrn3 2.8973 factor VII 2.8972 Proglucagon 2.8968 Proglucagon 2.8968 Tau fragment 2 421 2.8967 GLP2 2.8967 pro factor VII 2.8966 Tau 2.8964 Preproglucagon 2.8961 hPrp17 2.8960 Lutropin subunit beta 2.8960 hnRNP D0 2.8959

180 Texas Tech University, Andrew Avila, May 2012

Table 5.2: Continued Gene Name Distribution Mean Glucagon 2.8958 factor VII 2.8958 Phospho MAP kinase p38 alpha 2.8956 Fibroblast growth factor receptor 1c 2.8952 CD28 2.8950 Contactin 6 2.8950 CCL16 2.8950 pro factor VII 2.8948 GRK2 2.8947 Protein transport protein Sec23A 2.8645 Son of sevenless protein homolog 1 2.8641 ProGIP 2.8633 Glucose dependent Insulinotropic Polypeptide 2.8631 Glucose dependent Insulinotropic Polypeptide 2.8630 Protein transport protein Sec23A 2.8627 HNF6 2.8626 Glycoprotein hormones alpha chain 2.8619 TRAF2 2.8618 phospho SOS 2.8616 Metastin 2.8614 G protein gamma 1 GBG1 subunit 2.8613 Glucose dependent Insulinotropic Polypeptide Cleaved 2.8613 at N terminus PreproGIP 2.8610 Thyrotropin subunit beta 2.8610 ProGIP 2.8607 MADCAM1 2.8606 Vacuolar protein sorting associated protein 45 2.8600 SHP2 2.8301 MutS protein homolog 4 2.8300 RNF125 2.8298 PUMA protein 2.8296 INSM1 2.8295 FABP6 2.8293 tBID 2.8291 VIP 2.8290 NAT1 2.8290

181 Texas Tech University, Andrew Avila, May 2012

Table 5.2: Continued Gene Name Distribution Mean tBID 2.8288 RAD51 2.8284 ATM serine protein kinase 2.8283 PUMA protein 2.8280 tBID p15 2.8279 FASL 2.8275 CSB protein 2.8273 NOD2 2.8270 Orc6 2.8269 Vasopressin receptor type 2 2.8268 BID 2.8266 Acyl NAT1 intermediate 2.8266 phospho ATM Ser 1981 2.8255

182 Texas Tech University, Andrew Avila, May 2012

Figure 5.1: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on IGF2BP3 and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

183 Texas Tech University, Andrew Avila, May 2012

Figure 5.2: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on Thrombopoietin Receptor and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

184 Texas Tech University, Andrew Avila, May 2012

Figure 5.3: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on Metastin and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

185 Texas Tech University, Andrew Avila, May 2012

Figure 5.4: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on NOD1 and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

186 Texas Tech University, Andrew Avila, May 2012

Figure 5.5: A graph showing the differential expression of genes in breast cancer (GEO Dataset GDS3324 ); specifically focusing on UBF and connected elements.

Expression Legend Underexpressed Overexpressed Not Differentially Expressed Unknown or Not Gene

187 Texas Tech University, Andrew Avila, May 2012

CHAPTER VI CONCLUSIONS

The principal goal of this dissertation was to develop an epistemological approach toward modeling cancer through automated means. Each chapter served a constructive purpose in developing and testing the methodology created. Furthermore the later chapters served the purpose of applying the modeling method toward a series of real-world datasets with the goal of uncovering the underlying genetic mechanisms behind cancer. The results generated by the methodology on the datasets demonstrated the robustness of the synthesized models by allowing its application to a variety of theoretical models of cancer. Chapter II focused on the theoretical conception of modeling genetic relationships as logical formulas. Through the application of an epistemological logic, answer set programming, a formalism was created that served the purpose of describing the influence of genetic expression levels between genes in a qualitative manner. This development naturally applied the view of instrumentalism, in order to deduce predictions of the expression levels of genes that were unobserved empirically. It is by reason that such deductions can be made. Given the theoretical nature of this chapter the most prominent result was the precise definition of the formalism given. Furthermore, an example implementation of a small but complex network in answer set programming elucidated some of the properties that a logical approach toward genetic relationship modeling can produce. Chapter III described an applied study that created a software solution to allow the automated generation of logical models. This was accomplished through

188 Texas Tech University, Andrew Avila, May 2012 the integration of a variety of public databases including: Reactome, Gene Expression Omnibus, and PubChem. From Reactome the abstract genetic network could be translated from SBML to the logical formalism of Chapter II. The genetic expression information of the Gene Expression Omnibus could then be used to ground the abstract genetic network into an actual instance. The results of solving the complete logical model were then stored in a relational database that has the ability to provide access to the results of the model to a variety of individuals. The importance of data-sharing cannot be underestimated as it serves to allow other researchers the ability to corroborate the results of this study and allows the derivation of further experiments. Furthermore, the integration of PubChem allows a researcher to determine which biologically active molecules are known to interact with gene products. However, this capability was not exploited in this dissertation due to the limited information PubChem provides in terms of the type of interaction the molecules have with the gene products. Lastly, a tool for interactively analyzing and visualizing the results of a model solution was also developed. Chapter IV served to demonstrate the application of the logical modeling to a meta-analysis of a variety of in-vitro cancers. During the course of the study a novel statistical method was developed in order to address the question of whether a gene is significantly differentially expressed. The genes that were found to be differentially expressed did not conform to the expectations given by the prevailing oncogenesis paradigm (The Somatic Mutation Theory of Cancer). Thus, the results were interpreted through a novel perspective based on the idea that cancer originates during the wound healing process. During wound healing, the process of dedifferentiation gives rise to a cancer stem cell population. Evidence is cited from

189 Texas Tech University, Andrew Avila, May 2012 the literature and the results of the study to support the position posited. Chapter V described the next logical application of the model - testing for differentially expressed genes in an in-vivo system. In order to increase the complexity of the results, more closely simulating in-situ conditions, while maintaining manageable output, the in-vivo system tested was constrained to breast cancer tissue samples. The results again supported the idea that the current prevailing paradigm is insufficient to encompass the initiation of oncogenesis. The hypothesis proposed in chapter IV, named Umbracesis, was further refined by its placement between the transition of a precancerous lesion to a cancerous lesion. The refined mechanism which the Umbracesis hypothesis proposes is disruption of the wound healing process via carcinogens, in such a way, as to prevent organismic homeostasis from being recovered or prevent full re-differentiation of dedifferentiated cells. Looking to the future of research in cancer, the direction is clear based on the results of this dissertation. Future treatments should focus upon localized forced differentiation of cancer cells in order to re-establish their limited functional autonomy and/or removal of the carcinogen. This may be enacted through a variety of mechanisms including but not limited to: chemical therapies and mechanical approaches. Furthermore, in terms of the modeling method that was created in this dissertation, future work should focus upon integration of knowledge at multiple levels of being (..., genes, proteins, cells, tissues, organs, etc...) in order to better elucidate upon the causal mechanisms that lead to cancer. In essence, attempting to recapitulate the complex interactions occurring in-situ, in order to generate a multi-dimensional understanding of the relational importance between gene

190 Texas Tech University, Andrew Avila, May 2012 expression and its ultimate physiological manifestation(s). This will naturally require a rich grammar, although the demonstrated logical approach certainly seems promising. One final remark, a greater depth of inquiry is required in order to fully understand cancer, the currently prevailing paradigm is simply not sufficient; by necessity this will require a multidisciplinary approach in order to prevent premature conclusions. This will require a unification of biology, of which the reductive epistemological approach may be insufficient due to the multiple realizable patterns in which biological phenomena may occur.

191