Copyright by John Oates Woods III 2013 The Dissertation Committee for John Oates Woods III certifies that this is the approved version of the following dissertation:

Predicting gene–phenotype associations in humans and other species from orthologous and paralogous phenotypes

Committee:

Edward M Marcotte, Supervisor

Andrew D Ellington

Thomas Juenger

Steven A Vokes

Claus O Wilke Predicting gene–phenotype associations in humans and other species from orthologous and paralogous phenotypes

by

John Oates Woods III, B.S.

DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN December 2013 Dedicated to Maxine Shelly Turner, who should have beat me to this.

Tiger got to hunt, bird got to fly; Man got to sit and wonder ‘why, why, why?’ Tiger got to sleep, bird got to land; Man got to tell himself he understand.

— Kurt Vonnegut Jr., Cat’s Cradle Acknowledgements

First and foremost, I wish to thank Edward. In any other lab, the too- frequent breaks I took from research to deal with personal/political struggles would likely not have been tolerated — and allowed me to acquire a much broader skill set than would have been possible in any other lab. With great humility and honesty, Edward set an example of what a scientist should look like. Additionally, my fellow lab members, current and past, have offered both guidance and friendship. It’s hard to imagine my life without them.

Secondly, I wish to thank my phenolog co-authors: Tae Joo Park, Hye Ji Cha, John Wallingford, and of course Kris McGary, who developed the phe- nolog method; and Martin Singh-Blom, without whom there would be no Chapter 3. Special thanks also go to Matthew Tien, who demonstrated that the method in Chapter 4 would work.

The Ruby Science Foundation and SciRuby were built upon the hard work of too many people to name. However, thanks are owed to John Prince, who sold me on Ruby, and provided a great deal of vision and encouragement. Claudio Bustos provided a number of key libraries. For NMatrix, which also had a large number of contributors, gratitude is owed to Chris Wailes — who made some early design decisions and translated the code base to C++ from C — and Colin Fuller, who helped get our first beta out the door. Adam

v Hughes offered math advice in exchange for lunch (for which he never even asked). Gratitude is also owed the Google Open Source Programs Office, our four Summer of Code students, and our many mentors, as well as Pjotr Prins for selflessly volunteering his experience, wisdom, and leadership.

For providing me with my initial experiences and training in , I wish to thank Claus Wilke, Sara Sawyer, and Andy Ellington — the first two for teaching me about and the third for recruiting me to UT and inspiring me. Additional thanks go to Claus, Andy, and both Steve Vokes and Tom Juenger for providing their expertise on developmental biology and biology, respectively.

I’ve been blessed with a number of good roommates, most of whom were great for “idea basketball.” In particular, Liz Baykina taught me things about ears that helped with phenologs, though she claims now that she did not. Lisa Piccirillo and Ben Scholl offered helpful discussions about both mathematics and neuroscience; though Ben expressed apprehension that me thanking him in my dissertation would come to reflect badly upon him in his own academic career. (Ben, this is not a political manifesto. You are likely safe for now.)

I want to thank Susan, Paul, and Anthony Turner for occupying the role of second family, and Susan especially for being like a fifth parent (in a good way). Maxine was here for me perhaps more than anyone else in my second senior year of college, and no small amount of credit for my success today is owed to her.

Lastly, I want to thank my first four parents — John the Elder, Patti,

vi Marsha, and Donna — and my sister, Mollie. Dad has been my role model and my hero, and Mom (Patti) is simultaneously one of the smartest and most humble people I know (with wonderful science and career advice to boot). I may well have the best family on the planet.

vii Predicting gene–phenotype associations in humans and other species from orthologous and paralogous phenotypes

Publication No.

John Oates Woods III, Ph.D. The University of Texas at Austin, 2013

Supervisor: Edward M Marcotte

Phenotypes and diseases may be related to seemingly dissimilar pheno- types in other species by means of the orthology of underlying genes. Such “or- thologous phenotypes,” or “phenologs,” are examples of deep homology, and one member of the orthology relationship may be used to predict candidate genes for its counterpart. (There exists evidence of “paralogous phenotypes” as well, but validation is non-trivial.)

In Chapter 2, I demonstrate the utility of including plant phenotypes in our database, and provide as an example the prediction of mammalian neural crest defects from an Arabidopsis phenotype, negative gravitropism defective.

In the third chapter, I describe the incorporation of additional pheno- types into our database (including chicken, zebrafish, E. coli, and new C. el- egans datasets). I present a method, developed in coordination with Mar- tin Singh-Blom, for ranking predicted candidate genes by way of a k nearest

viii neighbors naïve Bayes classifier drawing phenolog information from a variety of species.

The fourth chapter relates to a computational method and application for identifying shared and overlapping pathways which contribute to pheno- types. I describe a method for rapidly querying a database of phenotype–gene associations for Boolean combinations of phenotypes which yields improved predictions. This method offers insight into the divergence of orthologous pathways in evolution. I demonstrate connections between breast cancer and zebrafish methylmercury response (through oxidative stress and apoptosis); human myopathy and plant red light response genes, minus those involved in water deprivation response (via autophagy); and holoprosencephaly and an array of zebrafish phenotypes.

In the first appendix, I present the SciRuby Project, which I co-founded in order to bring scientific libraries to the Ruby programming language. I describe the motivation behind SciRuby and my role in its creation. Finally in Appendix B, I discuss the first beta release of NMatrix, a dense and sparse matrix library for the Ruby language, which I developed in part to facilitate and validate rapid phenolog searches.

In this work, I describe the concept of phenologs as well as the develop- ment of the necessary computational tools for discovering phenotype orthology relationships, for predicting associated genes, and for statistically validating the discovered relationships and predicted associations.

ix Table of Contents

Acknowledgements v

Abstract viii

List of Tables xiv

List of Figures xv

Chapter 1. Introduction 1 1.1 Phenotype and genotype ...... 1 1.2 Some molecular biology ...... 4 1.3 Mutant phenotypes ...... 5 1.4 Homology, paralogy, and orthology ...... 7 1.5 Deep homology ...... 7 1.6 Phenologs ...... 8 1.7 Hypotheses ...... 11

Chapter 2. Predicting human disease genes using plant gene– phenotype associations 12 2.1 Introduction ...... 12 2.1.1 Contributions and modifications ...... 12 2.1.1.1 Reciprocal best hit strategy ...... 13 2.1.1.2 Phenologs reveal deeply homologous modular sub- networks ...... 13 2.1.1.3 Other modifications and abridgments ...... 13 2.2 Background ...... 14 2.3 Results and discussion ...... 17 2.3.1 Systematic discovery of phenologs ...... 20 2.3.2 Yeast phenologs predict unique angiogenesis genes . . . 26

x 2.3.3 Human/Arabidopsis phenologs predict vertebrate regula- tors of craniofacial development ...... 28 2.3.4 Phenologs reveal deeply homologous modular subnetworks 31 2.4 Conclusions ...... 33 2.5 Materials and methods ...... 34 2.5.1 Collection of phenotypes ...... 34 2.5.2 Orthologs ...... 37 2.5.2.1 Proteomes ...... 37 2.5.2.2 INPARANOID calculation ...... 38 2.5.2.3 Calculation of phenologs ...... 38 2.5.2.4 Cross-validated prediction of disease genes . . . 40 2.5.2.5 Tests of deep paralogy ...... 43

Chapter 3. Ranking of predictions using multiple phenologs from a variety of species; and the addition of zebrafish, chicken, and new worm gene–phenotype association data 45 3.1 Introduction ...... 45 3.2 Background ...... 46 3.3 Results and discussion ...... 51 3.3.1 A matrix-based formalism for comparing gene–phenotype associations between species ...... 51 3.3.2 Integrating information from multiple phenologs . . . . 56 3.3.3 Epilepsy ...... 65 3.3.4 Predicting from E. coli — pharmacologically-induced seiz- ures ...... 73 3.3.5 Atrial fibrillation ...... 76 3.3.6 Plant phenotypes — response to vernalization ...... 79 3.3.7 Datasets ...... 84 3.4 Conclusions ...... 88 3.5 Methods ...... 89 3.5.1 Cross-validation ...... 89 3.5.2 Additional phenotype data ...... 90 3.6 Acknowledgements ...... 92

xi Chapter 4. Improved prediction of phenologs using Boolean com- binations of phenotypes 93 4.1 Introduction ...... 93 4.2 Boolean phenologs ...... 95 4.2.1 Correcting for multiple hypothesis testing ...... 98 4.3 Results ...... 98 4.3.1 Increased brown adipose tissue ...... 98 4.3.1.1 Predictions from Arabidopsis ...... 98 4.3.2 Set difference: methylmercury and breast cancer . . . . 102 4.3.3 Myopathy from Arabidopsis ...... 106 4.3.4 Holoprosencephaly ...... 108 4.4 Discussion ...... 109 4.5 Conclusion ...... 115 4.6 Availability of source code and data ...... 116

Chapter 5. Some concluding remarks 117 5.1 Phenologs ...... 117 5.1.1 Paralogous phenotypes ...... 119 5.1.2 The importance of basic research ...... 120 5.1.3 Future directions ...... 121

Appendices 122

Appendix A. SciRuby: Tools for scientific computing in the Ruby language 123 A.1 Foreword ...... 123 A.2 Introduction ...... 125 A.2.1 Why Ruby? ...... 125 A.2.2 Open source, open data ...... 126 A.3 SciRuby ...... 127 A.3.1 A note about Ruby implementations ...... 127 A.3.2 Statistics ...... 128 A.3.3 Data visualization: Rubyvis and Plotrb ...... 129 A.3.4 Semantic web: PubliSci ...... 130

xii A.3.5 Data mining: Ruby Band ...... 130 A.3.6 Matrices and linear algebra: NMatrix ...... 130 A.3.7 Multi-dimensional arrays ...... 131 A.4 Other Ruby science tools ...... 131 A.5 Conclusion ...... 132 A.5.1 Acknowledgements ...... 133 A.5.2 Funding sources ...... 133

Appendix B. NMatrix: A fast matrix library for the Ruby lan- guage, with dense and sparse data structures 135 B.1 Introduction ...... 135 B.2 Features ...... 136 B.2.1 Storage ...... 136 B.2.1.1 Dense ...... 138 B.2.1.2 Compressed row (Yale) ...... 138 B.2.1.3 Nested linked list ...... 140 B.2.2 Data Types ...... 141 B.2.2.1 Initialization values ...... 142 B.2.3 Input and output ...... 144 B.2.4 ATLAS and LAPACK ...... 144 B.2.5 Experimental features ...... 146 B.2.5.1 Set operations ...... 146 B.2.5.2 Data structure ...... 148 B.3 Conclusion ...... 148 B.4 Acknowledgements ...... 149

Index 156

Bibliography 167

Vita 201

xiii List of Tables

2.1 Examples of significant phenologs ...... 24 2.2 Proteome FASTA sequence download locations and dates. . . 37

4.1 Holoprosencephaly genes are predicted by many intersection phenologs at the same p value...... 110 4.2 Holoprosencephaly candidate genes may be ranked by evidence. 112

B.1 ATLAS and LAPACK functions available for dense matrices in NMatrix...... 139

xiv List of Figures

1.1 The number of gene–phenotype associations has increased sub- stantially in model organisms as compared to humans. . . . . 2

2.1 Identification of phenologs...... 16 2.2 An example of a phenolog mapping high incidence of male C. el- egans progeny to human breast/ovarian cancers...... 19 2.3 Systematic identification of phenologs...... 22 2.4 Ten-fold cross-validation of a preliminary k nearest neighbors classifier...... 25 2.5 Phenologs reveal plant models of human disease, including a model of Waardenburg syndrome neural crest defects...... 29

3.1 Prediction of disease genes from orthologous phenotypes. . . . 48 3.2 The matrix formalism for calculating phenolog overlaps . . . . 54 3.3 Effect of distance measure choice for ordering and weighting phenotypes...... 60 3.4 Predictive performance of the orthogroup-based matrix approach. 62 3.5 Effect of k on predictiveness...... 63 3.6 Contributions by individual species to the prioritization of can- didate genes...... 66 3.7 Phenologs predict candidate genes substantially better than random...... 68 3.8 A Venn diagram showing predictions for epilepsy based on the 40 most genetically similar phenotypes...... 70 3.9 Top candidate genes predicted for epilepsy...... 72 3.10 Predicting mouse seizure genes from E. coli phenotypes. . . . 75 3.11 The top candidate genes predicted for atrial fibrillation. . . . . 78 3.12 Predicting performance of phenologs for plant phenotypes. . . 82 3.13 The top candidate genes predicted for Arabidopsis response to vernalization...... 83

xv 3.14 Measuring the effect of additional datasets on predictive per- formance...... 86

4.1 Process for the identification of Boolean phenologs...... 96 4.2 Real and null distributions based on a permutation test of Boolean phenologs...... 99 4.3 Mouse increased brown adipose tissue genes may be predicted by Arabidopsis Boolean phenotype response to salt stress ∩ re- sponse to sucrose stimulus...... 100 4.4 Breast cancer is predicted from genes involved in zebrafish methyl- mercury response but not response to metal ion...... 105 4.5 Myopathy is predicted from plant genes involved in red light response but not response to water deprivation...... 108

xvi List of Code Listings

B.1 Different ways to create an NMatrix...... 150 B.2 Syntax for modifying or getting individual entries of a matrix or vector...... 151 B.3 Syntax for modifying or getting slices of a matrix...... 152

B.4 General strategy for creating a Yale sparse matrix (via :list). 153 B.5 Contiguous block and repeating pattern strategies for creating a Yale sparse matrix...... 154 B.6 Experimental strategy for optimal Yale construction...... 155

xvii Chapter 1

Introduction

There has been a diagnostic disconnect between symptoms and underly- ing biology through much of the history of Western medicine. Since the advent of modern genetics, the gap between phenotype and genotype has dwindled; but it continues to profoundly influence our view of human disease. The effect lingers especially in psychiatry [1], where diagnostic criteria are based in many cases on behavior rather than than biology or etiology.

With the advent of sequencing technology, the quantity of direct map- pings between genotype and phenotype has increased substantially for model organisms relative to humans (Figure 1.1) [2]. The study of model organisms is important in no small part for its ability to elucidate aspects of human biol- ogy that are, for ethical and practical reasons, inaccessible to study. Methods which map conserved features from model organisms to Homo sapiens are useful, if nothing else for their impact on human health.

1.1 Phenotype and genotype

The terms genotype and phenotype were coined by Wilhelm Johanssen, who was the first to draw a distinction between heritable Mendelian elements

1 160,000 mouse 140,000

120,000

100,000

80,000 yeast

60,000 worm

40,000

20,000 human 0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year

Figure 1.1: The number of gene–phenotype associations has increased substantially in model organisms as compared to humans. This figure, generously provided by Greg Weiss (originally in [2]), illustrates the substan- tial increase in gene–phenotype associations since the turn of the century. Mappings between genes and diseases in humans have also increased, but not at anywhere near the pace.

2 — so-called genes, collectively referred to as genotype — and the description of morphology, the appearance of which is referred to as phenotype.1 He wrote:

All “types” of organisms, distinguishable by direct inspection or only by finer methods of measuring or description, may be char- acterized as “phenotypes.” Certainly phenotypes are real things; the appearing (not only apparent) “types” or “sorts” of organisms are again and again the objects for scientific research. All typi- cal phenomena in the organic world are co ipso phenotypical, and the description of the myriads of phenotypes as to forms, struc- tures, sizes, colors and other characters of the living organisms has been the chief aim of natural history, which was ever a science of essentially morphological-descriptive character. [3]

I use the term “phenotype” to describe not only the collection of mor- phological elements, but the elements themselves, insofar as those elements can be distinguished from one another. In this work, trait is used more-or- less interchangeably with phenotype, and disease refers to the combination in humans of some number of distinguishable, largely detrimental phenotypes.

While some biologists study those contributions to phenotype which are quantifiable, such as quantitative trait loci, our lab focuses on the largely qualitative aspects. As an example, rather than asking how tall an individual is in inches or centimeters — where many genes contribute subtly and syn-

1Phenotype comes from the Greek word for to show, phene.

3 thetically — we would prefer to look at such relevant phenotypes as abnormal height, proportionate dwarfism, disproportionate dwarfism, and gigantism.

1.2 Some molecular biology

A central principle of the field is that proteins — composed of amino acids — are translated from triplets of RNA bases, and that RNA molecules are transcribed from DNA. DNA bases, which are adenine (A), thymine (T), cytosine (C), and guanine (G), form base-pairs A–T and C–G. RNA bases differ only in that thymine is replaced with uracil (U), and in that the ribose sugar has lost an –OH group.

The three-base triplet sequences of nucleic acids which make up DNA and RNA are known as codons. There are three stop codons (UAA, UAG, and

UGA). Codon sequences are degenerate; only tryptophan and methionine have a single possible triplet sequence each (UGG and AUG, respectively), and most others have two or four.

The codon sequences are arranged in such a way that a change in the third base is likely to produce the same amino acid (e.g., GGG to GGA still produces glycine, which is four-fold degenerate), and a change in any one base is likely to result in an amino acid which is chemically similar to the one that should have been. In this way, and in others beyond the scope of this work, the codons are adapted such that the most common defects exhibit relatively little impact upon an organism’s fitness or phenotype.

4 If bases are deleted, inserted, or substituted, mistranslation may occur

— or no translation at all, if a stop codon is the result — and the cellular machinery constructed from that protein may function less efficiently (hypo- morphs), more efficiently (hypermorphs), or not at all (amorphs, null muta- tions, or knock-outs). Proteins may also gain new function (neomorphs), or act in opposition to normal gene activity (antimorphs).

The change need not be to the protein-coding portion of the gene se- quence. In some cases, non-coding sequence modifications may reduce or in- crease transcription, or result in the incorporation of pieces of the gene that aren’t normally included — or conversely, the non-incorporation of pieces which are usually present.

The different versions of genes resulting from changes in nucleotide sequence are known as alleles.

1.3 Mutant phenotypes

Mutant phenotypes are most frequently observed in a wild-type back- ground. Today, in the major model organisms, wild-type refers not to the set of alleles found in a wild animal plucked out of the forest at random, but to a well-characterized set of alleles found in a lineage that has been kept for study for a number of decades.

Most phenotypes discussed in this work arise from perturbation of a single gene in a wild-type background — that is, a new allele might be pro-

5 duced in a gamete either via some genetic technology like gene trapping; or by mutagen, chemical or radioactive; or by inhibition of the message through

RNA interference.2 While mutant phenotypes may arise from synthetic inter- actions of two or more alleles, the phenotypes used as evidence for this work are primarily produced by single-gene perturbations.

Muller’s discovery of X-ray mutagenesis at the University of Texas in

1926 produced the first controlled studies of mutant phenotypes [4], but until sequencing became available, mapping phenotypes to specific genetic loci was a complicated process.

It is worth noting that multiple genes may, when mutated, give rise to the same or similar phenotypes. Indeed, this work looks only at those phenotypes with multiple associated genes.

It is also true that some genes give rise to multiple phenotypes. Some- times different mutations to the same gene give rise to distinct mutant char- acters, and this can be used to determine the functions of the various parts of a protein or protein complex. It may also happen that a single mutation gives rise to multiple seemingly distinct mutant phenotypes. In either case, genes are said to be pleiotropic; that is, one gene affects more than one trait.

2RNA interference technology, or RNAi, is also called post-transcriptional gene silencing. The messenger RNA transcript may be targeted for knock-down using a complementary RNA molecule, which causes that message to be mostly silenced, resulting in what amounts to an incomplete knock-out of that gene. Targeted RNAi knock-down technology is available only in a few model organisms, and may involve off-target effects (genes which are knocked down accidentally in the process of knocking down a desired target).

6 1.4 Homology, paralogy, and orthology

As species diverge from a common ancestor [5], genes too may diverge. Genes which perform similar functions or with similar sequences are generally referred to as homologs (and typically descend from common ancestor genes), just as structures like arms, wings, and legs may be homologous.

In 1970, Walter Fitch coined the terms “orthologs” and “paralogs,” which distinguish between two types of homologs. Orthologs are those homol- ogous genes in a pair of species which diverged from a single gene in the most recent common ancestor; and paralogs are homologous genes within a species which were created through a duplication event from a single gene [6].

Remm et al. further categorized paralogs into in-paralogs and out- paralogs; which refer, with respect to some divergent species, to paralogs cre- ated since the speciation event and prior to it. In addition, Remm et al. distinguished between out-paralogs in the divergent species — that is, genes which were duplicated and then diverged in the common ancestor — and or- thologs, which were a single gene in the common ancestor. [7]

1.5 Deep homology

Shubin et al. suggest that cases of structural homology (e.g., wings and arms) arise from the “sharing of the genetic regulatory apparatus that is used to build morphologically and phylogenetically disparate animal features” [8], which they call deep homology [9]. Crudely, if one can identify those genes

7 which cause people to develop arms, the orthologs (and out-paralogs) of those genes in, say, a chicken should largely be involved in the development of wings.

It turns out that deep homology may also occur in structures which did not directly evolve from a common ancestor structure. Eyes, for example, appear to have evolved in a number of different species [10], and may have wildly different structures. Yet Shubin et al. point out that the genes employed in eye development are remarkably similar across these species:

Taken together, the available data suggest that the eyes of jellyfish, squid, arthropods, and vertebrates are not the product of rampant convergent evolution, as once thought, but of parallel evolution that is based on a shared history of generative mechanisms and cell types — deep homologies — established early in animal evolution. [8]

1.6 Phenologs

The observations of Shubin et al. show that even in the absence of any obvious common structures, underlying genetic programs may be conserved.

Phenologs (homologous phenotypes) are a direct extension of deep ho- mology. As defined by McGary et al., phenologs are pairs of phenotypes in separate organisms which derive from a common set of underlying genes. Thus eyes in separate organisms — or eye development — could be considered phe- nologs, and so might limb development. Yet the meaningful innovation with phenologs was the possibility to identify non-obvious examples. [2]

8 McGary et al. showed that phenologs could be used to predict genes involved in human diseases based on their involvement in model organism phenotypes. I identified the phenotype negative gravitropism in (growth of structures in opposition to gravity) as a phenolog of human Waardenburg syndrome, which is a neural crest disorder, causing pigmentation defects and hearing impairment; and on that basis, we identified a new neural crest development gene [2].

Phenologs resemble pleiotropy. In pleiotropy, one gene affects multiple traits. In phenologs, a gene and its orthologs affect multiple traits, typically in multiple organisms.

While “phenolog” was originally a portmanteau of “phenotype” and “ortholog” (with ortholog being used here to represent both true orthologs and out-paralogs) it is worth noting that defects in underlying genetic programs contribute to in-paralogous phenotypes as well. On the structural phenolog level, development of hands and feet (for example) involve related genetic programs.

The distinction between orthologous and out-paralogous phenotypes relates to the relation of the genetic programs in the most recent ancestor. If the phenotypes diverged from a single phenotype, they are orthologs; but if they diverged from more than one phenotype (with each phenotype arising from the same set of pleiotropic genes), the phenotypes should be thought of as out-paralogous; and a mixture is also possible. The distinction is rather a moot point, however, since without access to the genome of the common

9 ancestor it is impossible to distinguish between the two cases.

In-paralogous phenotypes need not be structural. As an example, mul- tiple sclerosis (MS) and type 1 diabetes (T1D) are phenotypically dissimi- lar, but the observation that they shared similar genes [11] led the Interna- tional Multiple Sclerosis Genetics Consortium to examine known T1D single- nucleotide polymorphisms for association with MS; they identified CLEC16A and CD226 as MS-susceptibility genes [12]. Similar associations have been made between celiac disease and T1D [13], as well as rheumatoid arthritis, ulcerative colitis, Crohn’s disease, and others [11, 14–17].

In-paralogous phenotypes are also related to pleiotropy. A distinction should be made, for example, between the pleiotropic case where numerous phenotypes in an organism have exactly or almost exactly the same set of genes involved (as in autoimmunity), and the case where the genes which give rise to one phenotype are in-paralogous to the genes which affect a second phenotype (which is discussed in the fourth chapter). Indeed, it is likely that most real examples do not fall cleanly into one category or the other, and the latter case may reflect ancestral pleiotropy (i.e., prior to the duplication and divergence of the ancestor genes into paralogs).

For the purposes of this work, in-paralogous phenotypes are primarily a curiosity. However, they do demonstrate conceptually the symmetry of the extension from deep homology to phenologs.

10 1.7 Hypotheses

First, this dissertation seeks to examine a hypothesis that gene–pheno- type associations in vertebrates may be predicted by phenotype orthology between H. sapiens diseases and A. thaliana phenotypes. It builds on existing work by Kriston McGary and Edward Marcotte.

A second key hypothesis is that candidate gene prediction may be im- proved, through ranking, by use of a classifier incorporating gene–phenotype associations from a variety of species simultaneously. I consider the impact on predictions of different types of phenotype data, including Gene Ontol- ogy (GO) biological processes for Danio rerio (zebrafish) and A. thaliana and in situ hybridization annotations for Gallus gallus (chicken).

Finally, I demonstrate the use of new scientific computational tools, specifically SciRuby and NMatrix (described in great detail in the appendices), to test a hypothesis that insights may be gained into deeply conserved cellular and organismal processes by predicting disease genes using Boolean combina- tions of phenotypes. I demonstrate synthetic interactions between processes using set union operations; and rescues of one phenotype by another using set difference.

11 Chapter 2

Predicting human disease genes using plant gene–phenotype associations

2.1 Introduction

This chapter, the first major part of my work, introduces the concept of phenologs. It is adapted from the paper “Systematic discovery of non-obvious human disease models through orthologous phenotypes.” Kriston McGary and

Tae Joo Park wrote the paper together, with contributions in the introduction and methods from me. Kris provided the computational methods, as well as the gene–phenotype associations for human, mouse, yeast, and worm. I obtained and processed the Arabidopsis gene–phenotype associations. Tae Joo

Park and Hye Ji Cha performed the wet-lab experiments. Edward Marcotte and John Wallingford supervised and advised on all elements. 2.1.1 Contributions and modifications

I include some sections relevant to my hypotheses in which I played little or no role, but abridge some others which are only tangentially related.

12 2.1.1.1 Reciprocal best hit strategy

The last paragraph of Section 2.3.1 describes an alternate method by which orthologous phenotypes may be identified (the reciprocal best hit, or

RBH, strategy), and elucidates the extent of the analogy between homologous genes and homologous phenotypes.1

2.1.1.2 Phenologs reveal deeply homologous modular subnetworks

I include Section 2.3.4, which is not my work, because it involves an exploration of phenologs in the context of functional networks and sets the stage for the rest of this work. Additionally, this section explores the proba- bility that a given prediction is correct, which is important for Chapter 3. I exclude the methods relating to the tests of subnetwork modularity.

2.1.1.3 Other modifications and abridgments

I exclude methods sections relating primarily to experimental meth- ods in Xenopus. Additionally, I abridge or summarize some portions of 2.3.2, which discusses a yeast model for angiogenesis. Those parts which are re- tained illustrate the distinction between the concept of deep homology and the framework of phenologs. That is, phenologs are examples of deep homology, but the phenolog framework is a useful innovation for its ability to identify

“non-obvious” disease models.

1Specifically, phenotypes may be either orthologous or paralogous, just as genes may be orthologous or paralogous.

13 2.2 Background

Biochemical and molecular functions of a given protein are generally conserved between organisms; this observation is fundamental to biological research. For example, in X-ray crystallography, in order to facilitate the study of a protein’s biochemical function, one may select, from an array of organismal orthologs, the version of that protein most easily crystallized. On the other hand, even with a conserved gene, disruption of a function may give rise to radically different phenotypic outcomes in different species.

For example, mutating the human RB1 gene leads to retinoblastoma, a cancer of the retina, yet disrupting the nematode ortholog contributes to ectopic vulvae [18, 19]. Thus, although a gene’s “molecular” functions are conserved, the “organism-level” functions need not be. When a conserved gene is mutated, the resulting organism-level phenotype is an emergent property of the system. This bedrock principle underlying the use of model organisms not only allows us to study important aspects of human biology using mice or frogs, but also permits exploration of inherently multicellular processes, such as cancer, using unicellular organisms like yeast.

Within this paradigm, once a molecular function has been discovered in one organism, it should be predictable in other organisms: GSK3 homologs in yeast are kinases, and as such GSK3 homologs in every other organism will generally be kinases. In contrast, the emergent organism-level phenotypes are far less predictable between organisms, in part because relationships between genes and phenotypes are many-to-many. Manipulation of GSK3 perturbs

14 nutrient and stress signaling in yeast, anteroposterior patterning and segmen- tation in insects, dorsoventral patterning in frogs, and craniofacial morpho- genesis in mice [20–22]. Recognizing functionally equivalent organism-level phenotypes between model organisms can therefore be non-obvious, especially across large evolutionary distances.

However, the ability to recognize equivalent phenotypes between dif- ferent model organisms is important for the study of human diseases. Given the success of studies in model systems (genes and phenotypes have been as- sociated in model organisms at a far higher rate than for humans; see Figure 1.1), it seems likely that useful and tractable models for human disease await discovery, currently hidden by differences in the emergent appearance of phe- notype in diverse model organisms. Although a framework exists for discussing complex gene–phenotype relationships across evolution, we lack simple — and importantly, quantifiable — methods for discovering new gene–phenotype re- lationships from existing data.

As a foundation for a quantifiable approach to identifying equivalent phenotypes, we introduce the notion of orthologous phenotypes (phenologs), defined as phenotypes related by the orthology of the associated genes in two organisms. Phenologs are the phenotype-level equivalent of gene orthologs.2 Two phenotypes are thus said to be orthologous if they share a significantly larger set of common orthologous genes than would be expected at random

2In fact, it is more accurate to say that phenologs are the phenotype-level equivalent of gene homologs.

15 Orthologous genes

Orthologous phenotypes (phenologs)

Figure 2.1: Identification of phenologs. Phenologs can be identified based on significantly overlapping sets of orthologous genes (gene A is orthologous to A’, B to B’, etc.), such that each gene in a given set (green box or cyan box) gives rise to the same phenotype in that organism. The phenotypes may differ in appearance between organisms because of differing organismal contexts. As gene–phenotype associations are often incompletely mapped, genes currently linked to only one of the orthologous phenotypes become candidate genes for the other phenotype; that is, the gene A’ is a new candidate for phenotype 2.

(i.e., are enriched for the same orthologous genes) (Figure 2.1), regardless of how dissimilar the phenotypes appear.

Phenologs, therefore, are evolutionarily conserved outputs arising from disruption of any of a set of conserved genes (Figure 2.1, green and blue). These outputs manifest as different traits or defects in different organisms because of the organism-specific roles played by each set of genes. One example, noted above, is the human retinoblastoma eye cancer and the Caenorhabditis elegans ectopic vulvae. These phenotypes are orthologous, as failure of equivalent

16 genes (the Rb pathway) — undertaking conserved molecular functions but in different contexts — leads to different phenotypes in the different organisms [18, 19].

2.3 Results and discussion

Phenologs are identified by assembling known gene–phenotype associa- tions for two organisms — considering only genes that are orthologous between the two organisms — and searching for inter-organism phenotype pairs with significantly overlapping sets of genes. Significance is derived from three ob- servations:

1. the number of orthologs in organism 1 that give rise to phenotype 1;

2. the number of orthologs in organism 2 that give rise to phenotype 2; and

3. the number of orthologs shared between these two sets.3

Formally, significance of a phenolog is calculated from the hypergeometric probability of observing at least that many shared orthologs by chance (see Methods).

3This definition is incomplete; see Materials and Methods. In short, “orthologs” must be read here as unique; if organisms 1 and 2 have each experienced a gene expansion post- divergence, only the orthologs at the time of divergence should be counted. Since that cannot be known without a time machine, I apply a term used by Remm et al., “orthology groups” or simply “orthogroups,” which describes the one-to-one correspondence between a set of in-paralogs in one species and their orthologs in another (and vice versa). Thus, an orthogroup is the set of in-paralogs which are mutually orthologous between two species [7].

17 Figure 2.2 shows an example of these observations: the set of human– worm genes associated with X-linked breast/ovarian cancer in human signifi- cantly overlaps the set of genes whose mutations lead to a high frequency of male progeny in C. elegans. Male C. elegans are determined by a single X chro- mosome, hermaphrodites by two copies; thus, X chromosome nondisjunction leads to higher frequencies of males [23]. Human breast/ovarian cancers may derive from X chromosome abnormalities [24], supporting the notion that this phenolog is identifying a useful disease model. Human orthologs of the thirteen additional genes associated with this worm trait are thus reasonable candidate genes for involvement in breast/ovarian cancers. Nine of these genes are not yet linked to breast cancer in the databases we employed, but could be con-

firmed as such in the primary literature (Table S1 of [2]). The remaining four genes (GCC2, PIGA, WDHD1, SEH1L) are thus implicated as breast cancer candidate genes. This rate of literature confirmation is 268-fold higher than that expected based on the current annotation rate in the Online Mendelian

Inheritance in Man database [25], providing significant support for the utility of the worm phenotype to predict and suggest additional genes relevant to human breast cancer.4

4The estimated fold-improvements presented here depend upon consistent literature cu- ration and, thus, should be interpreted cautiously.

18 Human/Worm Ortholog Linked to breastLinked to more cancer in humans males in worms ATM/atm-1 Y BRIP1/dog-1 Y KRAS/let-60 Y PHB/phb-1 Y 4,649 orthologs total PIK3CA/age-1 Y Human Worm RAD51/rad-51 Y breast/ovarian high incidence RAD54L/rad-54 Y cancer male progeny SLC22A18/C53B4.3 Y TSG101/tsg-1010 Y BARD1/brd-1 Y Y BRCA1/brc-1 Y Y CHEK2/chk-2 Y Y 9313 FAM82B/F33H2.6 Y GCC2/hcp-1,hcp-2 Y HMG20A,B/W02D9.3 Y HORMAD2,1/him-3,htp-1,2 Y KIF15/klp-10,18 Y MRE11A/mre-11 Y -6 PIGA/D2085.6 Y p ≤ 7.2x10 RAD1/mrt-2 Y RAD21/coh-1 Y SEH1L/npp-18 Y SVIL/viln-1 Y TSPO,BZRPL1/C41G7.3 Y WDHD1/F17C11.10 Y

Figure 2.2: An example of a phenolog mapping high incidence of male C. elegans progeny to human breast/ovarian cancers. The left-hand table shows the human–worm orthologs that participate in the two phenotypes, as well as whether they have been associated with the respective breast cancer and increased male progeny phenotypes in humans and worms. Three genes are associated with both (out of twelve for breast cancer and sixteen for increased male progeny). An intersection of that size should happen by chance at a frequency of 7.2 × 10−6 given 4,649 genes shared between the species.

19 2.3.1 Systematic discovery of phenologs

To systematically discover phenolog relationships, Kriston McGary col- lected from the literature a set of 1,923 human disease–gene associations [25], 74,250 mouse gene–phenotype associations [26], 27,065 C. elegans gene– phenotype associations [27], and 86,383 yeast gene–phenotype associations [28–31]. The dataset spans approximately three hundred human diseases and over six thousand model organism phenotypes. With these data and the sets of orthologous gene relationships between each pair of organisms [7], we quantita- tively examined the overlap of each interorganism phenotype pair, measuring their significance (Figure 2.3A).

To correct for testing multiple hypotheses, all analyses were repeated 1,000 times with randomly permuted gene–phenotype associations, then calcu- lated a false-discovery rate (FDR) based upon the observed null distribution of scores (Figure 2.3B; see also Figure S1 from [2]). We observed 154 sig- nificant phenologs (5% FDR) between human diseases and yeast mutational phenotypes; 3,755 between human and mouse; 147 between mouse and worm; 119 between mouse and yeast; 206 between yeast and worm; and nine between human and worm (the low number stems from limited mutational data in both species) (Figure 2.3C).

Many specific, intuitively obvious phenologs were revealed by this anal- ysis, especially for the comparison of mouse and human phenotypes. Our anal- ysis recapitulates many known mouse models of disease, providing an impor- tant positive control for our approach; Table 2.1 lists other specific examples

20 of both known and previously undescribed equivalences.

For example, one of the most significant phenologs identified between human disease and mouse mutational phenotypes is that linking Bardet–Biedl syndrome with four mouse traits, each of which relates to the disruption of ciliary function (abnormal brain ventricle, choroid plexus morphology, small hippocampus, enlarged third ventricle, absent sperm flagella; all p ≤ 10−11, consistent with the apparent molecular defects in Bardet–Biedl syndrome [32]. These phenologs thus suggest mouse ciliary defects provide a powerful model for studying human Bardet–Biedl syndrome, consistent with its recognized utility in this regard.

Similarly, human cataracts are observed to be phenologous to mouse cataracts (p ≤ 10−24), human deafness to mouse deafness (p ≤ 10−29), hu- man retinitis to mouse retinal degeneration (p ≤ 10−26), and human goiter to mouse enlarged thyroid glands (p ≤ 10−8). Thus, the calculation of phenologs correctly identifies many known mouse models of human diseases and there- fore has the potential to identify new models. More generally, cross-validated tests of phenologs confirm their strong predictability for genes associated with a substantial portion of human diseases (Figure 2.4).

Much of the powerful conceptual framework established for determin- ing homology and orthology for gene sequences may also be applicable to phenologs. For example, many of the algorithmic approaches used to identify orthologous genes might also be applied to the identification of phenologs. We explored this notion for one effective and easily automated approach to iden-

21 Figure 2.3: Systematic identification of phenologs. (A) For a pair of organisms, sets of genes known to be associated with mutational phenotypes are assembled, considering only orthologous genes between the two organisms. Pairs of mutational phenotypes — one phenotype from each organism, each associated with a set of genes — are then compared to determine the extent of the overlap of the associated gene sets, calculating the significance of overlap by the hypergeometric probability. Comparison of the distribution of observed probabilities with those derived from the same analysis following permutation of gene–phenotype associations reveals that many more orthologous pheno- types are observed than expected by chance, as shown in B for the case of the human–yeast comparison (see also Figure S1 from [2]), and summarized for each organism pair in C.

22 1 A Human Yeast B disease phenotype 10-1

-2 N 10 Real orthologs k 10-3

n n -4 1 2 10 p k nnN 1 2 Permuted 10-5 Human-Yeast 10-6 -2 -3 -4 -5 -6 -7 -8 -9 10 10 10

1.00 C Human- Yeast Mouse- Worm 0.99

0.98

0.97

0.96 Human- Mouse- Worm- Human- Worm Yeast Yeast Mouse

0.95 1 10 100 1000

23 Phenotype 1 Phenotype 2 n1 n2 k p-value PPV Hs Cataracts Mm Cataracts 19 47 11 6 × 10−24 1.00 Hs X-linked conductive Mm Circling 47 50 12 2 × 10−20 1.00 deafness Hs Bardet–Biedl syn- Mm Absent sperm flag- 11 5 4 8 × 10−13 1.00 drome ella Mm Lymphoma Sc CANR mutator 14 11 6 1 × 10−11 1.00 high Hs Zellweger syndrome Sc Reduced number of 8 6 4 1 × 10−9 1.00 peroxisomes Hs Susceptible to Mm Abnormal social in- 5 16 3 1 × 10−8 1.00 autism vestigation Mm Abnormal heart de- At Defective response 25 9 4 3 × 10−7 1.00 velopment to red light Hs Refsum disease At Defective protein im- 4 5 2 1 × 10−5 1.00 port into peroxisomal matrix Mm Absent posterior At Shade avoidance de- 2 4 2 1 × 10−6 0.99 semicircular canal fect Mm Spleen hypoplasia Sc Uge (enlarged cells) 5 16 3 3 × 10−6 0.99 Mm Gastrointestinal hem- Ce Abnormal body wall 6 3 2 4 × 10−6 0.98 orrhage muscle cell polarization Hs Mental retardation At Cotyledon develop- 13 5 2 1 × 10−4 0.98 ment defects Hs Congenital disorder Sc CID 6045865 sensi- 10 25 3 2 × 10−4 0.98 of glycosylation tive Hs Hemolytic anemia Sc Hydroxyurea sensitive 11 23 3 2 × 10−4 0.98 Hs Amyotrophic lateral Sc Increased resistance 2 34 2 2 × 10−4 0.97 sclerosis to wortmannin

Table 2.1: Examples of significant phenologs. n1 indicates the number of orthologs in organism 1 with phenotype 1, n2 the number in organism 2 with phenotype 2, and k the number in both sets. The significance of each phenolog is assessed by the hypergeometric probability (p-value), the positive predictive value (PPV) when considering multiple testing (1 − FDR), and the reciprocal best-hit criterion (bold text). At, Arabidopsis thaliana; Ce, worm; Hs, human; Mm, mouse; Sc, yeast.

24 1 k=40 0.9 k=1 0.8 Results of shuffling 0.7 disease gene sets 0.6 Random expectation 0.5 0.4 0.3 0.2 Area under ROC curve (AUC) 0.1 0 0 50 100 150 200 250 Human disease (from OMIM, rank-ordered by AUC)

Figure 2.4: Ten-fold cross-validation of a preliminary k nearest neigh- bors classifier. Ten-fold cross-validated tests show strong disease gene pre- diction by single phenologs for approximately one-sixth to one-fifth of tested diseases; simple weighted combinations of phenologs (e.g., evaluating the k = 40 best phenologs) provide strong predictability for approximately one- third to one-half of the tested diseases. Predictability is measured as the area under a receiver–operator characteristic (ROC) curve as described in Mate- rials and Methods and evaluated separately for each human genetic disease with at least two associated genes. An area under the ROC curve (AUC) of 1 indicates perfect prediction of known disease genes in a cross-validated test; an AUC of 0.5 indicates performance no better than chance. Error bars indicate first quartile, median, and third quartile of predictions of shuffled disease gene sets from the k = 1 test; score distributions from shuffling tests are similar for both k = 1 and k = 40 and center around AUC = 0.5, as expected by chance. OMIM, Online Mendelian Inheritance in Man.

25 tify orthologous sequences, the reciprocal best hit (RBH) strategy. The RBH criterion holds that genes X and Y are orthologs if genes X and Y are the most similar to each other (reflexively) when searched genome-wide. We adapted the RBH criterion to the identification of phenologs to identify the most equiv- alent phenotypes between two organisms from among those assayed, by asking if the phenotypes have the most significant (lowest p value) gene overlaps with each other when searched against all phenotypes in their respective organisms. Such analysis gives a second criterion for identifying phenologs, useful for le- gitimate phenologs with poor p values because of limited phenotypic data sets. Examples of such RBH phenologs are indicated in Table 2.1.

2.3.2 Yeast phenologs predict unique angiogenesis genes

The power of the phenolog framework lies in discovery of non-obvious disease models. We identified just such a non-obvious model (p ≤ 10−6) in the form of yeast lovastatin sensitivity and the mouse phenotype angiogen- esis defective. Lovastatin is a hypercholesterolemia drug. We wanted to know whether vertebrate vasculature formation, a clearly multicellular pro- cess, might be modeled in budding yeast, a single-cell organism; and asked whether the model could predict additional angiogenesis genes.

Phenologs are most important for their ability to map gene–phenotype associations from one organism to another. In other words, phenologs suggest unobserved associations in one organism from observed associations in another (Figure 2.1, gene E and gene A’). Thus, examination of phenologs should

26 substantially improve the rate of discovery of new genes over the rate expected by chance (here about one in 267, as estimated from the frequency of known angiogenesis genes). We might a priori expect a discovery rate comparable to the fraction of genes already verified by the phenolog; for the yeast angiogenesis model, a likely discovery rate of about one in thirteen, for an around 21-fold higher discovery rate than random expectation.

Our model therefore asserts that some of the sixty-two additional genes associated with lovastatin sensitivity in yeast would be predicted to be in- volved in angiogenesis. In fact, literature confirms three of the corresponding mouse genes to function in angiogenesis: the known target of lovastatin, HMG– CoA reductase, whose role in angiogenesis has been previously observed [33], the sirtuin SIRT1, whose disruption in zebrafish and mice caused defective blood vessel formation and blunted ischemia-induced neovascularization [34], and the casein kinase CSNK2A1, inhibitors of which block mouse retinal neo- vascularization [35].

For phenologs to be useful, they must be able to predict entirely new gene–phenotype associations, which we attempt to demonstrate first by in situ hybridization of the predicted genes in Xenopus laevis. Five out of fifty- nine were expressed in vasculature. Next, we verified the role of one of these genes, sox13, by morpholino knock-down.6 Finally, we confirmed SOX13’s role in human angiogenesis using siRNA-induced knockdown. Thus, SOX13 is a

6Morpholino knock-down is similar to RNAi-based knock-down in practice, though the principles differ. To date, RNAi technology does not exist in Xenopus.

27 unique regulator of angiogenesis, discovered in the absence of any previous functional data linking it to angiogenesis, on the basis of orthology between mouse angiogenesis defects and yeast lovastatin sensitivity.

2.3.3 Human/Arabidopsis phenologs predict vertebrate regulators of craniofacial development

Phenologs provide a quantitative framework for identifying cases of extremely distance homology (deep homology [9]) of functionally coherent gene systems. This creates an opportunity to use very distantly related species as human disease models. I tested this approach by systematically searching for plant models of human disease.

I collected 22,921 gene–phenotype associations — spanning 1,711 unique phenotypes — for the mustard and cabbage relative Arabidopsis thaliana and analyzed these for phenologs with yeast and animal phenotypes. Hundreds of orthologous phenotypes were evident (Figure 2.5A; see also Figure S6 from [2]), including 897, 733, 172, and 48 between Arabidopsis and yeast, mice, worms, and humans, respectively (5% FDR). The human–plant phenologs suggest mappings between specific plant mutational phenotypes and diverse human diseases, including cancers, peroxisomal disorders such as Refsum disease and

Zellweger syndrome, and a variety of birth defects (Table 2.1).

I observed a striking plant–human phenolog relating negative gravit- ropism to Waardenburg syndrome (Figure 2.5B). This congenital syndrome stems from defects in the developing neural crest and is characterized by cran-

28 Waardenburg Gravitropism 1.00 B A syndrome defects in humans 1 2 3 -6 0.99 p Yeast- < 10 Arabidopsis

0.98 Human- Arabidopsis C 0.97 Mouse- Arab.

0.96 Worm- Arabidopsis Migrating 0.95 1 10 100 1000 SEC23IP D E E’ Ctl Exp

Migrating

Slug

Ctl Exp development E’’

Slug Slug

Figure 2.5: Phenologs reveal plant models of human disease, including a model of Waardenburg syndrome (WS) neural crest defects. (A) Many orthologous phenotypes are observed between Arabidopsis and each of worms, yeast, mouse, and humans, with hundreds more than expected by chance. Many mam- malian/plant phenologs relate to vertebrate developmental defects, including models for WS and other birth defects. (B) Considering only human/Arabidopsis orthologs, the three known WS genes significantly overlap the five genes associated with neg- ative gravitropism defects in Arabidopsis. The plant gene set suggests unique can- didate WS genes. (C) In situ hybridization versus candidate sec23ip induces (E) defects in neural crest cell migration on the side with the knock-down (E”) but not the control side (E’), measured using in situ hybridization versus two independent markers of neural crest cells, snai2-a (defects observed in 23 of 35 animals tested) and twist (eight of fourteen animals tested) (Figure S7 of [2], not included). Such defects are rare in untreated control animals and off-target morpholino (OM) knock-downs (0 of 21 control animals tested with snai2-a; one of fourteen OM animals tested with snai2-a; zero of fourteen OM animals tested with twist).

29 iofacial dysmorphology, abnormal pigmentation, and hearing loss.7 In partic- ular, this phenolog suggested that a set of three vesicle trafficking genes — involved in directing plant growth in response to gravitational cues — might also serve to direct neural crest cell migration and differentiation in developing animal embryos.

Encouragingly, one of the identified proteins (STX12) is known to in- teract with the protein encoded by the pallid gene in mice [37], whose mu- tational phenotypes, including pigmentation and ear defects, are consistent with Waardenburg syndrome [38]. The remaining two proteins had no sup- port in the literature, and we selected their three mammalian orthologs for further testing. Repeating the procedure used to examine the angiogenesis de- fect genes, Tae Joo first attempted to determine whether the candidates were expressed in Xenopus neural crest, using in situ hybridization; sec23ip was positive. Next, he used morpholinos to knock-down the gene unilaterally in the neural crest, and observed defects in neural crest cell migration specifically on the injected side (Figure 2.5E; see also Figure S7 from [2]), confirming its role in neural crest development.

Notably, SEC23IP physically associates with SEC23, a component of the COPII complex that controls ER-to-Golgi trafficking, and mutations in SEC23 underlie cranio-lenticular-sutural dysplasia, another congenital human disease related to neural crest development [39]. Thus, SEC23IP, identified

7In fact, it accounts for between two and five percent of cases of human deafness. [36]

30 here on the basis of orthology to plant negative gravitropism, is both a promis- ing candidate gene for Waardenburg syndrome and also provides insights into the emerging link between COPII function and craniofacial development in vertebrates [40]. Our success rate of one in two for finding Waardenburg- relevant genes represents a 550-fold improvement over the current annotation rate of around one in 1,100 genes. Notably, in spite of the extremely dissimilar associated phenotypes, phenologs can identify functionally coherent gene sets that predate the divergence of and animals.

2.3.4 Phenologs reveal deeply homologous modular subnetworks

Phenologs imply that although phenotypes diverge, the orthology of the underlying gene networks — and the networks’ immediate functional output — is conserved. We might therefore expect genes involved in a given phenolog to represent a coherent biological module, and thus be highly interconnected in gene networks. Moreover, we might expect that the genes already confirmed to show the signature phenotypes in both organisms (e.g., the intersection labeled by k in Figure 2.3A would be even more highly interconnected than the genes associated with the signature phenotype in only one organism; these latter genes might or might not belong to this subnetwork, as multiple mechanisms might give rise to the phenotype. Evidence in current gene networks of more linkages among the genes in each intersection would support this notion of phenologs recapitulating modular subnetworks.

We therefore systematically tested all significant phenologs involving

31 yeast and worm genes for the genes’ connectivity in available functional net- works [41]. We find the network connectivity of genes in phenolog intersections to be significantly higher (p < 0.00001; Wilcoxon signed-rank) than the phe- nolog genes outside of the intersections, which nonetheless show significantly higher network connectivity than random size-matched gene sets (p < 0.0001; see Figure S8 of [2]).

Additional tests confirm that genes in phenolog intersections are no more enriched for homologous genes than are phenolog genes outside of the intersections (see Figure S9 in [2]), ruling out trivial discovery of “deep par- alogs.”8 These observations indicate that phenologs identify evolutionarily con- served subnetworks of genes relevant to particular phenotypes or diseases, yet still predicting as yet undiscovered candidate genes significantly better than random expectation. Phenologs may inherently identify systems such as those found by aligning protein interaction networks across species [42]. Indeed, direct searches for evolutionarily conserved subnetworks composed largely of genes with similar phenotypes might provide an alternate strategy for phenolog discovery.

8The key word here is “trivial.” Part of the problem lies in the definition of “pheno- type.” Referring back to the example in the first chapter of this dissertation, it may be that rheumatoid arthritis and celiac disease are distinct diseases — or they may simply be variations of a single phenotype. In-paralogous phenotypes must exist, but where the line is drawn between paralogy and identity is subjective.

32 2.4 Conclusions

Phenologs reflect the innate modularity of gene systems and identify adaptive reuse of those systems, creating a rich framework for comparing mu- tational phenotypes with potential for finding non-obvious models of human disease. Cross-validated tests indicate phenologs show utility for roughly one- third to one-half of tested human genetic diseases (Figure 2.4). Given a phe- nolog for a human disease, any approach for associating more genes with the model organism trait (e.g., a genetic screen) will suggest additional new human disease gene candidates. In addition to associating unique genes with mod- eled diseases, such models can provide mechanistic understanding in simplified model organisms for understanding aspects of more complex human diseases.

Phenologs thus bridge the molecular definition of homologous and or- thologous genes [6] with classic definitions of homologous structures from Owen [43] and Darwin [5], deriving from considerations both of gene heredity and of the traits/structures affected by perturbing the genes, concepts falling within the general field of evolutionary developmental biology (evo–devo) [44]. The conserved gene systems revealed by the plant–vertebrate phenologs illus- trate a more ancient homology than the “deep homology” of metazoans that is currently a focus of evolutionary developmental biology [8]. These phenologs should bring attention to the potentially extensive molecular toolkit within the last common eukaryotic ancestor, which facilitated the parallel evolution of complex multicellular organisms. This comparative approach provides a simultaneously deeper and wider view of the evolution of life and points the

33 way to a greater synthesis of evolutionary developmental biology and modern medicine.

2.5 Materials and methods 2.5.1 Collection of phenotypes

We collected gene–phenotype associations from the literature for five species (worm, yeast, mouse, human, and Arabidopsis).

For human phenotypes, we used human diseases from the Online Mende- lian Inheritance in Man (OMIM) database [25], using the compressed OMIM disease categories previously described in McGary et al. [30], such that multi- ple variants of a disease were grouped together.9

Mouse gene–phenotype associations were downloaded from Mouse Gen- ome Informatics (MGI) [26].10 Gene–phenotype associations involving more than one locus or that could not be linked to an Entrez gene were removed.

MGI identifiers were converted to Entrez gene IDs using MGI_Coordinate.rpt (downloaded 25 April 2008). MGI mouse phenotype descriptions were from

VOC_MammalianPhenotype.rpt, downloaded 7 May 2008.11 MGI associations were supplemented with a small number of broadly defined mouse pheno- types,12 but which are ultimately derived from MGI data.

9For example, “Corneal dystrophy, hereditary polymorphous posterior” and “Corneal dystrophy, lattice type I,” reduce to a single category of “corneal dystrophies.” 10MGI_PhenoGenoMP.rpt, downloaded 21 April 2008. 11All MGI data were downloaded from ftp://ftp.informatics.jax.org/pub/reports/ index.html. 12http://hugheslab.met.utoronto.ca/supplementary-data/mouseFunc_I/MGI_

34 Worm gene–phenotype associations were assembled from the literature- reported RNAi studies assembled in Lee et al. [45] supplemented by additional phenotype data13 from WormBase 188 [27]. Worm gene–phenotype association data come from phenotype_association.Ws188.obo, phenotype descriptions from phenotype_ontology.WS188.obo, and gene information from geneIDs.

WS188.gz, accessed 26 March 2008. WormBase phenotypes were filtered for positive associations only. All allelic variants and RNAi data were reduced to gene–phenotype pairs. Gene IDs (e.g., WBGene00044645) were translated to sequence names (e.g., Y51H7BR.8) using geneIDs.WS188.gz. Of around 22,000 gene–phenotype pairs, 384 could not be linked to a sequence name. These derived primarily from uncloned genes and were omitted from further analysis.

Yeast gene–phenotype associations were obtained from McGary et al. [30],14 supplemented with associations from a recent set of genome-wide screens of drug sensitivity [31].15 All gene–phenotype associations from the drug screens were filtered using the authors’ recommended cutoff of p < 1 × 10−5.

I gathered Arabidopsis gene–phenotype associations from The Arabidop- sis Information Resource (TAIR)16 on 9 December 2008 [46]. Most gene on- phenotype.txt 13ftp://ftp.wormbase.org/pub/wormbase/acedb/WS188/ 14essentially a literature compilation plus Saccharomyces Genome Database (SGD) [28] 15homozygous and heterozygous screens, het.z_tdist_pval_nm.goodbatch.pub, hom.z_ tdist_pval_nm.pub, downloaded from http://chemogenomics.stanford.edu/supplements/ global/download/data/) 16ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/ATH_GO_GOSLIM. txt

35 tology (GO) terms are not phenotypes, so I retained only GO biological pro- cesses. I obtained a mapping between symbol and locus from TAIR17 on the same date. I extracted other symbol–locus mappings and a set of descriptions for the genes from the proteome file (see below). This mapping was used to convert from symbols to loci in the gene–phenotype association list. The gene– description mapping was enhanced by inclusion of alternate gene symbols and names.

To minimize the number of redundant comparisons performed, I com- bined those Arabidopsis phenotype pairs whose sets of associated genes over- lapped by greater than ninety percent, provided that the phenotypes each had more than one associated gene.

Kris McGary did the same for the other four organisms, but set his threshold differently. Phenotype-associated gene sets within a given organism were tested for significant overlap and non-redundant sets were selected for subsequent analyses. Within each organism, phenotypes were identified that reciprocally covered at least eighty percent of each other’s genes; for each such pair of phenotypes, only the phenotype with the greater number of genes was retained. (For example, in mouse, genes associated with defects in the small petrosal ganglion and small nodose ganglion overlap considerably. The former has nine associated genes, of which a subset of eight is also associated with the latter phenotype; only the former was retained.)18

17ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20080716 18Note that the standard collapsing procedure employed by McGary et al. presents an

36 Species URL Download Date Human ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz 7 Feb. 2008 Mouse ftp.ncbi.nih.gov/genomes/M_musculus/protein/protein.fa.gz 13 Oct. 2007 Worm ftp.wormbase.org/pub/wormbase/data_freezes/WS170/sequences/ 19 Feb. 2007 wormpep170.tar.gz Yeast genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequence/orf_ 19 Feb. 2007 protein/orf_trans.fasta.gz Arabidopsis ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR8_ 10 Dec. 2008 blastsets/TAIR8_pep_20080412 Table 2.2: Proteome FASTA sequence download locations and dates.

Additionally, for the purposes of calculating phenologs from mouse, worm, yeast, and humans, we considered only a subset of the gene–phenotype associations plotted in Figure 1.1, analyzing only those implicating single genes

(i.e., not genetic interactions or traits requiring simultaneous mutation of mul- tiple loci), and only those phenotypes in which a defect was observed (i.e., omitting genes associated with the phenotypes “normal,” “wild-type,”, “no effect,” or other such cases). All gene–phenotype sets are available from the supporting website, phenologs.org.

2.5.2 Orthologs 2.5.2.1 Proteomes

Orthologs between species were calculated using translated genomes in

Table 2.2.

For human and mouse proteomes, we analyzed only sequences with protein RefSeq identifiers (NP_ only). For humans, forty-three genes with- out gene IDs were removed (mostly hypothetical proteins). For mouse, three obstacle for the identification of in-paralogous phenotypes from our datasets.

37 proteins without current records were removed.

2.5.2.2 INPARANOID calculation

To identify orthologous genes in different species, orthologs were cal- culated using INPARANOID v. 1.35 [7] and BLASTALL v. 2.2.15, both with default parameters. All genes assigned as orthologs (strictly speaking, ortholog groups or orthogroups because of inclusion of in-paralogs) by INPARANOID were kept regardless of their INPARANOID score. Using orthogroups, rather than bidirectional best hits, captures the many-to-many relationships that ex- ist for gene duplicates found in more than one copy in one or both species.

To prevent isoform variations from resulting in skewed BLAST results, mouse and human sequences with the same Entrez gene ID but separate Ref- Seq IDs were treated separately in INPARANOID. Following INPARANOID analysis, orthologs sharing gene IDs were combined so that gene variants would be considered together in subsequent analyses.19

2.5.2.3 Calculation of phenologs

For each pair of species, we first converted gene–phenotype associa- tions to ortholog–phenotype associations using the orthologs calculated by INPARANOID. In cases where paralogous genes within an organism result in the same phenotype, multiple gene–phenotype associations thus collapse

19For example, if gene 1001 had three isoforms, they would be included as separate genes 1001-1, 1001-2, and 1001-3. Orthogroups containing these isoforms would be combined if necessary.

38 to a single ortholog–phenotype association, which eliminates artificial infla- tion of the significance of ortholog overlap. Second, we compared the set of orthologs associated with a given phenotype in one species (species 1) to the set of orthologs associated with a given phenotype in the second species (species 2), repeating this process for all pair-wise comparisons of phenotypes from species 1 and species 2. For each pair of phenotypes in which the ortholog sets overlapped (shared members), we calculated the probability of the overlap occurring by chance using the cumulative hypergeometric distribution, where N is the total number of orthologs shared between the two species; n and m are the number of orthologs linked to the species 1 and species 2 phenotypes, respectively; and c is the number of common orthologs (that is, those linked to both phenotypes). The probability is given by:

( )( ) min∑(m,n) m N−m k ( n)−k N (2.1) k=c n

The hypergeometric probability does not correct for multiple compar- isons, so we estimated the false-discovery rate (FDR) with an empirical per- mutation test [47]. We performed 1,000 random permutations of the ortholog– phenotype associations, for each permutation repeating the all-versus-all phe- notype comparison using ortholog set sizes identical to those associated with the actual phenotypes (i.e., shuffling ortholog identities on a per phenotype basis, thus maintaining the phenotype set size distribution). Significant phe- nologs were identified at a false discovery rate of 0.05 by ranking real and

39 permuted phenologs on the basis of the associated hypergeometric probabili- ties and selecting a threshold of probability where the proportion of permuted phenologs above the cutoff accounted for five percent of the phenologs.

2.5.2.4 Cross-validated prediction of disease genes20

For the set of human genetic diseases, I predicted specific genes as- sociated with each disease using ten-fold cross-validation, evaluating perfor- mance by standard receiver–operator characteristic (ROC) analysis (Figure 2.4). These tests employed an alternate formalism from that described above to discover significant phenologs, and I performed them as follows.

I generated a binary gene–disease association matrix for each species, where the columns represent phenotypes. The rows in the human (or pre- diction) matrix each represent a single human gene; a true value in cell (i, j) indicates an association has been observed between gene i and phenotype or disease j. Genes that have no identifiable orthologs in any species are excluded. False values in cells indicate that no association has been observed.

The rows in other species’ matrices (the source matrices) are also de- scribed in terms of human genes: if the human gene has no ortholog in that species, the row is absent; but if the human gene has one or more orthologs in that species, a single row represents the whole set of orthologs. The presence of a true value in cell (i, j) indicates that a species-specific ortholog of human

20This subsection describes an early (flawed) prototype of the formalism which is the focus of the next chapter.

40 gene i is observed as associated with species-specific phenotype j. False values indicate no observed association.

Phenologs correspond to mappings between a prediction matrix col- umn and the most similar source matrix columns. To compute inter-column distances, a submatrix of the prediction matrix is generated, its rows limited to those shared by the source matrix. Treating each phenotype or disease as a column vector, a distance is computed between each of the phenotypes in the source matrix and each of the diseases in the prediction matrix.

As for the calculation of phenologs described above, I defined our dis- tance function as the hypergeometric probability of observing c or more com- mon genes between source phenotype u and prediction disease v, with n total observations in one and m total observations in the other. The cardinality of the vectors u and v is n, the total number of human genes with orthologs in the source species. Thus, the probability is calculated as in the equation given above.

For each prediction disease v, we selected the source phenotype with the smallest distance as the top hit (best performing phenolog), then predicted genes’ associations with the human disease according to their associations (true or false) with the source phenotype.

I evaluated predictive accuracy by ten-fold cross-validation, omitting ten percent of the prediction matrix rows for each of ten successive tests, and only evaluating predictions on the withheld ten-percent test set of genes,

41 repeating for ten unique test sets, and measuring true and false-positive pre- diction rates using ROC analysis.

I observed that those phenologs ranked just below the best (smallest distance) hit often provided additional valuable information about a disease. One simple method for integrating predictions across phenologs is to combine information from the k nearest neighbors (the top hit would be k = 1). In some cases, distance to the kth neighbor is equal to that of additional neighbors, representing a tie; in which case we include all neighbors tied with item k.

I used a simple weighting scheme to integrate evidence from the k (and tied with kth) nearest neighbors, calculating a score for each human gene (row) as:

p(gene ∈ disease|k disease phenologs) = ∏k 1 − (1 − p(gene ∈ disease|phenolog i is correct) i=1 × p(phenolog i is correct) (2.2)

I define the probability that the phenolog is correct (the final term) as one minus the hypergeometric probability given previously. For the probability of the gene being associated with the disease given that the phenolog i is correct, we use the following empirical score: for a true source observation, as the ration of the phenolog intersection (the size of set u ∩ v, defined above) to the size of set u; for a false source observation, as zero. Thus, although observations are binary (true or false), predictions are represented by scores

42 (between 0 and 1) which are essentially weighted averages of the predictions of the k nearest orthologous phenotypes.

I calculated null distributions by repeating the cross-validated analysis with ten randomizations of the prediction matrix. Randomization was accom- plished by shuffling the true values in each prediction matrix column, to ensure that the phenotype gene set size distribution was maintained.

2.5.2.5 Tests of deep paralogy21

In principle, deep paralogs or gene families could be responsible for significant phenologs, rather than modules of non-sequence related genes. We reasoned that the deep paralog hypothesis predicts that the overlapping in- tersection of orthogroups involved in a phenolog should have more significant pair-wise BLAST E-values than the non-intersecting genes that are involved in the same phenotype.

To test this hypothesis, we compared genes in each phenolog set — either intersection (I) or unique (D1or2) (Figure S9 of [2], not included) in pair-wise fashion using default BLASTP settings to all other genes in the set. The most significant BLAST E-values were collected for each orthogroup pair in each set.22

21This usage of the term “deep paralogy” does not relate to in-paralogous phenotypes, which are a type of deep in-paralogy. This section deals with out-paralogy. 22BLAST E-values between genes within the same orthogroup were omitted, as genes from the same orthogroup do not contribute separately to calculating the phenolog. When multiple genes from the same orthogroup belong to a set, only the single most significant BLAST E-value to a gene outside the orthogroup was included for that orthogroup.

43 For the significant (5% FDR) phenologs of each species pair, we sepa- rately collected the BLAST values for the orthogroups in either the intersecting sets or the unique sets, comparing the E-value distributions on a species-by- species basis (Figure S9, McGary et al.). BLAST E-values less significant than 10−3 were truncated at 10−3; BLAST E-values more significant than 5×10−313 were truncated at 5 × 10−313.

44 Chapter 3

Ranking of predictions using multiple phenologs from a variety of species; and the addition of zebrafish, chicken, and new worm gene–phenotype association data

3.1 Introduction

In this chapter, I demonstrate the ranking of candidate genes for human diseases on the basis of phenologs drawn from multiple species. It is adapted from the paper “Prediction of gene–phenotype associations in humans, mice, and plants using phenologs,” an older draft of which appears in Martin Singh- Blom’s dissertation (Martin and I are joint first authors).

For this work, I created the naïve Bayes classifier, while Martin demon- strated the additive model. Each of us implemented the distance and weight- ing functions separately for our respective classifiers and compared results. We each worked on separate cross-validation schemes (ten-fold for me and three- fold for Martin). It eventually became clear that the gene–phenotype matrix framework had difficulties with predicting plant phenotypes and those from some other species as well; so I designed the orthogroup–phenotype frame- work and re-implemented both classifiers, all distance functions, and Martin’s cross-validation scheme. Using this new implementation, I ran the analyses

45 discussed in this chapter, and produced the original versions of the figures, which Martin edited.

It was Martin who first noticed the connection to expert recommenda- tion systems. E. coli gene–phenotype data were assembled by Jon Laurent; and except for Arabidopsis, which I gathered, data from the previous chapter were assembled by Kriston McGary (as discussed previously). I obtained and processed the gene–phenotype data for chicken and zebrafish, as well as the new nematode datasets.

I researched the gene–phenotype predictions for the selected examples.

I provided the initial draft of the paper, and Edward and Martin and I worked together to produce the final version.

3.2 Background

Computational prediction of complex phenotypes from underlying genes has largely involved increasingly complex in silico simulations of cells and cel- lular processes. Last year, for example, Karr et al. published a whole-cell computational model made up of twenty-eight sub-models, each a simulation of a specific cellular process [48]. Most methods are variations on flux–balance analysis for predicting metabolic phenotypes [49], in most cases including tran- scriptional regulatory information [50–53], and yield primarily quantitative data.

In contrast, a number of qualitative methods make use of guilt-by-

46 association in functional networks to predict gene–phenotype associations (re- viewed in [54, 55]). Like the quantitative methods, these network-based tech- niques are species-specific, though they may incorporate data from additional species. While quantitative methods are limited to unicellular organisms, or at least to unicellular phenotypes of multicellular organisms, the qualitative methods can provide insight into whole-organism traits.

In 2010, McGary et al. described a separate qualitative method which relies on orthology rather than gene networks. Specifically, human traits, dis- eases, and phenotypes may have orthologous properties in other organisms, and such properties — typically phenotypes — are identifiable based on or- thology of the underlying genes. Such orthologous phenotypes, or phenologs, can be used to predict novel disease-causing genes as in the manner sum- marized in Figure 3.1. For example, McGary et al. identified SEC23IP as a neural crest effector, potentially involved in Waardenburg syndrome, based on its association with negative gravitropism defects in Arabidopsis [2].

Phenologs are a natural extension of the concept of deep homology: as a bird’s wing and a human hand arose from a common ancestor structure with a common complement of genes and a similar developmental program [9], so also might less obviously related phenotypes derive from a common ancestor phenotype affiliated with an underlying conserved gene module. To take the above example of Waardenburg syndrome, certain mammalian neural crest defects and plant gravitropism defects share and partly arise from an ancient, highly conserved vesicle trafficking system.

47 Figure 3.1: Prediction of disease genes from orthologous phenotypes. (a) Two phenotypes are said to be orthologous (“phenologs”) if the sets of underlying genes for those phenotypes have a statistically significant intersec- tion, as determined using gene orthology. Statistical significance is calculated as the probability of seeing an intersection of v or greater given m genes with phenotype A and n with phenotype B, out of N total genes with orthologs in both species. Genes associated with A but not B are said to be predicted to be involved with B, and vice versa. McGary et al. observed that approximately v/m of the predictions tended to be true positives for B, and v/n to be true positives for A.(b) illustrates a validated example from McGary et al. pre- dicting genes involved in a human neural crest defect, Waardenburg syndrome, using the Arabidopsis negative gravitropism defect phenotype. In this example, the overlap between gene sets affiliated with Waardenburg and gravitropism is highly statistically significant (p ≤ 10−6). In the right-hand circle and intersec- tion, the human orthologs of the gravitropism genes are shown, for simplicity (VAM3 corresponding to STX7, STX12; SGR2 to DDHD2, SEC23IP; and GRV2 to DNAJC13). (c) In this paper, we extend the phenolog formalism to consider additional gene–phenotype associations from multiple model organ- isms to develop a quantitative ranking scheme for phenolog-based predictions. Those genes predicted by a single phenolog, as in (a), are weakly predicted for A; whereas those predicted by two phenologs are strongly predicted for A. In general, the addition of a third phenolog contributing to a predicted asso- ciation will cause that gene to be ranked higher than if only two phenologs predict it. However, not all phenologs are equal; phenologs derived from less similar gene sets exert less influence over predictions than phenotypes with highly overlapping sets of affiliated genes.

48 (a) (b)

human mouse human Arabidopsis negative trait A trait B Waardenburg gravitropism syndrome SNAI2 defective MITF predicted for PAX3 DDHD2/SEC23IP human trait A STX7/STX12 DNAJC13

(c) weakly predicted human for A trait A Arabidopsis trait B

strongly predicted for A

weakly predicted for A mouse trait C

49 We set out to improve upon the original phenolog algorithm, which relies on identifying pairs of matching phenotypes across species, with a goal of ranking candidate genes relevant to specific traits and diseases by way of an unsupervised search for similar phenotypes (Figure 3.1). We reasoned that gene–phenotype association predictions coming from multiple “nearby” (or high similarity) phenologs, preferably across multiple species, should provide more predictive power than those from single phenologs. Our method ranks candidate genes based on both the number and similarity of cognate pheno- types which involve those genes, which might be used as a prioritization for wet lab experiments (Figure 3.1C).

Additionally, we expanded upon the original phenolog study — which included gene–phenotype data from human, mouse, worm (C. elegans), baker’s yeast, and Arabidopsis thaliana — by adding data from chicken, zebrafish, and even E. coli, as well as additional human and worm datasets. We show that phenotype data may come from a variety of sources, including GO biological processes and gene tissue expression annotations, and that the integration of signal from multiple phenologs markedly improves the predictive power of the method.

A key advantage to a neighborhood-based approach for predicting gene– phenotype associations is the ease with which non-obvious — and thus inter- esting — biological stories may be teased out. We demonstrate the process with epilepsy, a human syndrome; mouse susceptibility to pharmacologically- induced seizures, a related phenotype, using only E. coli data; and atrial

50 fibrillation, the leading cause of arrhythmia in humans.

In addition to offering concrete predictions, we compared two classi- fiers for integrating phenologs (additive and naïve Bayes), across a variety of similarity or distance functions, and with different numbers of neighboring phenotypes (k). We also experimented with changing the weighting function used to assign prediction scores, and we tested two frameworks for translating gene–phenotype associations between species, evaluating all of these methods within a consistent cross-validation scheme.

3.3 Results and discussion 3.3.1 A matrix-based formalism for comparing gene–phenotype as- sociations between species

The phenolog approach, developed by McGary et al. in [2], identifies pairs of homologous phenotypes in different organisms by counting the overlap between the sets of genes associated with them. McGary et al. hypothesize that pairs of phenotypes with a greater than expected number of shared genes derive from a shared evolutionary past, and further hypothesize that genes as- sociated with one might therefore be good candidates for the other. To extend this conceptual framework to make predictions based on multiple phenotypes from multiple species, we developed a matrix-based formalism for integrating phenotypic information.

For a given species, the set of all gene–phenotype associations can be thought of as a matrix where rows correspond to genes and columns to pheno-

51 types, and the matrix has a 1 in position i, j if the ith gene has been observed to be associated with the jth phenotype. This formalism works well when all the genes and phenotypes studied are from the same organism, but leads to some complications when extended to pairs or groups of species.

In particular, since generating these gene–phenotype matrices involves translation via gene orthology, we investigated whether expansions and con- tractions of gene families (e.g., in Arabidopsis, which frequently has large par- alogous gene expansions relative to other eukaryotes) might produce enough noise to obscure signal from other gene–phenotype associations of interest.

In order to address this question, we developed two different frameworks within our matrix formalism for translating gene–phenotype associations be- tween species (Figure 3.2). In the first, the “gene-based” approach, we let the rows correspond to genes in the species that we wish to make predictions for, and translated the gene–phenotype interactions from a number of species by orthology. This gave us a number of species-specific gene–phenotype associa- tion matrices ΦS, for S ∈ {human, mouse, yeast, nematode, plant, zebrafish,

fly, chicken}, where each ΦS is defined by   1 if any ortholog in S of gene i (Φ ) = is associated with phenotype j, S ij  0 otherwise.

We used the INPARANOID algorithm to determine which genes in different organisms are orthologs of each other. The INPARANOID algorithm discovers orthology relationships in the form of orthogroups (Figure 3.2A). [7]

52 For the method described above, we simply translated other species’ gene–phenotype associations into the target (e.g., into human genes when predicting human gene–disease associations) gene–phenotype matrix by or- thogroup, and compared the phenotype columns in terms of human genes (as in Figure 3.2B–C).

This gene-based approach works very well for closely-related species, where genes often have one-to-one equivalents between species. However, when large orthogroups are involved, the predictive performance of this approach deteriorates.

To mitigate the decrease in performance caused by paralogous gene expansions we devised an “orthogroup-based” matrix framework, in which rows corresponded to INPARANOID orthogroups (Figure 3.2A and D) rather than actual genes (Figure 3.2B–C),   1 if any gene in the orthogroup i (Φ ) = is associated with phenotype j, S ij  0 otherwise. Notably, the use of orthogroups can dramatically simplify the relationships.

Consider orthogroup OA (Figure 3.2A): one gene from each species is involved in the phenolog; whereas in Figure 3.2B, a matrix with each row representing a human gene, a single mouse gene–phenotype association translates into three human gene associations because of the paralogous expansion of this gene fam- ily in humans. Similarly, in Figure 3.2C, in which each row is a mouse gene, a single human gene–phenotype association translates into two mouse gene associations, due to a mouse-specific paralogous expansion. In contrast, the

53 Figure 3.2: The matrix formalism for calculating phenolog overlaps is especially important when predicting between species where large gene fam- ily expansions have occurred since species divergence, such as between Ara- bidopsis and humans. The example uses human and mouse to illustrate the orthogroup-based matrix formalism. (a) Phenotype associations (colors) are plotted as graphs for genes from human (left nodes, sub-scripted h) and mouse (right nodes, sub-scripted m), showing genes’ orthology relationships (edges radiating from orthogroups — middle nodes, labeled O). The orthologies (from INPARANOID), are used to “translate” phenotype associations between species (in the case of the gene-based matrix framework in panels (b, c)) or into an intermediate collection of orthogroup–phenotype associations (for the orthogroup-based matrix framework in (d)). Orthogroup vertices (e.g., OA) ′ ′′ connect human and mouse orthologs (such as Ah, Ah, and Ah, which are par- alogs of one another relative to the human–mouse divergence, with Am and ′ Am. Red vertices within a species are genes associated with the phenotype of interest (ϕh for human and ϕm for the mouse phenotype); orthogroup colors reflect the species data. These associations can alternately be captured by rep- resenting the graphs as matrices (b–d), with bullets indicating an association between a given genetic element and a phenotype. Specifically, (b) and (c) represent the gene-based formalism, and (d) illustrates the orthogroup-based formalism. Human and mouse phenotype columns are indicated by ϕh and ϕm, respectively.

54 A h φm φh O A A m A m • A’h A’m A’m • • A’’h B m • • φh φm C m • O B A • • C’ B h m h B m A’h • D m • • O C C m A’’h • E m C h B h • • E’ C’m m C h • F m D h O D D h • • G m

D’ D D’ • • . h m h . D’’ • D’’ h h (c) By mouse gene E h

E h O E E m E’h F h φh φm E’h E’m G h A • • O F G’h B • • F F h m G’’h C • G’’’ D • • G h h . E G’ O . h G F G m (b) By human gene G’’h G . G’’’h .

(a) Orthogroups (d) By orthogroup

55 orthogroup-based matrix in Figure 3.2D permits a symmetric comparison of the phenotypes, reducing paralogs in each species to a single orthogroup. Fur- thermore, a hypergeometric CDF test of the intersection between phenotypes

ϕh and ϕm will produce different values for the matrices described in Figures

3.2B–C. The consequence of asymmetric distances is that ϕh may have ϕm as its closest neighbor when the search is performed in one direction, but ϕm may not have ϕh as its closest neighbor in the reverse search.

The orthogroup-based matrix (Figure 3.2D) has the advantage of pro- ducing consistent, symmetric similarity scores regardless of the direction of prediction; furthermore, these scores are not inflated by the co-occurrence of multiple phenotype observations in a single orthogroup. Unless otherwise noted, we use this framework for the analyses that follow.

3.3.2 Integrating information from multiple phenologs

Given this basic formalism — a matrix of gene–disease associations incorporating phenotypic data from multiple species — we next evaluated methods for ranking genes on the basis of their tendency to be involved in a phenotype of interest. In other words, we wanted to construct a set of predictions X for gene–phenotype associations such that Xij is higher for pairs where the gene i is actually associated with the phenotype j.

One way to incorporate information from multiple phenotypes is by measuring the similarity — in terms of associated genes — between pairs of phenotypes, and integrating the information from different phenotypes in such

56 a way such that more similar phenotypes get more weight than less similar phenotypes. We tested two different ways of integrating this information — one multiplicative naïve Bayes scheme, and one additive method.

The naïve Bayes scheme we used was first described in the original phenolog paper [2], and can be written as follows:

∏k Xij = P (gene i ∈ disease j|k phenologs) = 1 − (1 − fijlwjl) (3.1) l=1 where

fijl = P (gene i ∈ disease j|phenotypes j and l are phenologs) (3.2)

wjl = P (phenotypes j and l are phenologs) (3.3)

We tested a wide variety of measures for the weighting function wjl that cal- culates a similarity or distance between two sets. Pearson sample correlation is a particularly popular option for expert recommendation systems, such as those used in online retail for generating recommendations from past pur- chase history. McGary et al. used the hypergeometric CDF, which gives the probability of seeing an intersection of size v or greater between phenotypes containing m and n genetic elements, with N total elements in the species pair (Figure 3.1A–B).

For fijl we used v/n, the fraction of the number of genes common to both phenotypes j and l over the number of genes known to be involved in phenotype j, which empirically appears to be a good approximation of the

57 probability that a candidate gene from a single phenolog will turn out to be a true positive [2].

While the naïve Bayes method multiplies distances or similarities as if they were probabilities, for the additive method, Xij is calculated for each gene–phenotype pair (i, j) by taking the sum over all nearest neighbor pheno- types k, weighted by the similarity between phenotypes j and k, so ∑ T Xij = wjkΦik = (Φw )ij, (3.4) k where Φ is a phenotype matrix and w is a weight matrix of phenotype– phenotype similarity scores.

In addition to the hypergeometric CDF and Pearson sample correlation, we tested Euclidean distance, taxicab (Manhattan) distance, cosine distance and Tanimoto coefficient as measures of phenotypic similarity, both for find- ing the k nearest neighbors and as weighting functions. We expected that orthologous phenotypes from closely related species might show more similar gene sets than those from more distantly related species. In turn, various distance functions might account for this bias to a greater or lesser extent; we thus compared different distance functions using cross-validation, as noted later. Euclidean and Manhattan distance performed extremely poorly in the gene framework, using five-fold cross-validation, so we excluded them from analyses in the orthogroup framework. Overall, the Pearson coefficient and hypergeometric test appear to have the most power for identifying nearby predictive phenologs (Figure 3.3A).

58 We also repeated the analysis while varying the distance function (used for searching) and holding the recommendation function (w) the same, and vice versa (Figure 3.3B). Pearson sample correlation showed the best performance among the distance functions; however, we found that the hypergeometric CDF was the best weighting function for assigning prediction scores to genes.

We compared the naïve Bayes and additive classifiers, with the results shown in Figure 3.4. The performance in cross-validation is quite similar be- tween the two classifiers, with the best version of the naïve Bayes classifier (using Pearson sample correlation for distance and hypergeometric CDF for weighting) performing slightly better than the best additive one (using Pear- son sample correlation and cosine similarity). However, the additive classifier allows us to visualize and deconstruct the predictions into component pheno- types. We therefore chose to use the additive classifier for most predictions.

Varying the maximum number of neighbors (k) tends to affect lower- ordered predictions (e.g., the thousandth gene predicted for a disease) to a larger extent than top predictions. Figure 3.5 shows that even including the k = 5 nearest neighbors improves the results modestly — raising the number of diseases for which the withheld genes can be predicted at a top-100 median rank from around 50 to 80. Searching for the k = 40 nearest neighbors seems to offer no meaningful improvement over k = 10 at relevant ranks. Thus, while a higher value of k may not always provide the best predictor, it is more likely, on average, to be useful than only considering the single closest phenotype.

Some phenotypes were intrinsically unpredictable; notably, several of

59 Figure 3.3: Effect of distance measure choice for ordering and weight- ing phenotypes. Here we plot for how many diseases the median rank of the gene withheld during leave-one-out cross-validation stays at a certain level, using all available species, and integrating the results using the naïve Bayes scheme. In (a), we vary the distance and weighting function (using the same measure for both). In (b), we show the effect of varying the distance function independently from the weighting function. Here the first function in the leg- end is the distance function used for computing the k nearest neighbors, and the second is the weighting function wij from Equations 3.1 and 3.4. As can be seen from the figure, a good distance function has more effect on perfor- mance than a good weighting function, but that the results can be improved slightly by using a combination: hypergeometric for distance, and Pearson for integration.

60 0

50

100

Median rank 150

200 Tanimoto Cosine Pearson I Pearson II Hypergeometric 250 20 40 60 80 100 120 (a) Phenotypes (ordered by best median rank)

0

50

100

Median rank 150

200 Pearson – Hypergeometric Hypergeometric – Pearson Pearson – Pearson Hypergeometric – Hypergeometric

250 20 40 60 80 100 120 (b) Phenotypes (ordered by best median rank)

61 0

50

100

Median rank 150

200 Additive Pearson cos Naïve Bayes cos cos Naïve Bayes Pearson hypg Naïve Bayes hypg hypg Additive cos cos 250 20 40 60 80 100 120 Phenotypes (ordered by best median rank)

Figure 3.4: Predictive performance of the orthogroup-based matrix approach. Here we show a comparison of naïve Bayes and additive classifier predictions, which seem to have similar performance, using leave-one-out cross- validation. As in Figure 3.3B, the first function in the legend is the distance function used for computing the k nearest neighbors, and the second is the weighting function wij from Equations 3.1 and 3.4.

62 1 k=1 k=5 k=10 k=30 k=40 k=60 10 Median rank 100

1000 50 100 150 200 Phenotypes (ordered by best median rank)

Figure 3.5: Effect of k on predictiveness. Using the same cross-validation setup as in Figure 3.4, we compare different k-values in the neighborhood search for phenologs. Any k greater than 1 gives a great improvement in the high-precision regime. However, as k increases further, the improvements in the recovery affect successively less important ranks, with diminishing returns as k approaches 30.

63 these were revealed to be combinations of unrelated diseases that were over- collapsed into the same entity in the initial version of our OMIM database (two such examples were achromatopsia with achondroplasia, and the combination of all blood type genes), thus serving inadvertently as negative controls.

The best similarity functions produced highly correlated predictions. Further, the best predictions of the worst classifiers were highly correlated with the best predictions of the top-performing classifiers. We thus concluded that the potential benefits of a fusion or blending classifier, a model that draws the best characteristics from simple classifiers via optimization, would be modest at best. At worst, any improvement would be difficult to mea- sure; optimization of such a blending model would require an additional layer of cross-validation, and many phenotypes would need to be dropped due to relatively small associated gene sets.

While similarity functions produced remarkably similar results, predic- tions coming from different species were much less strongly correlated (Fig- ure 3.6), suggesting that weighting phenotypes by species in some manner may offer additional improvement.1 While each species provides uncorrelated prediction information, the human disease predictions are nearly always dom- inated by mouse whenever that species is included — likely because of the

1Indeed, this lack of correlation seems to hold for every pair of species; we looked at prediction of both human and Arabidopsis features from other species, and in each there is little, if any, correlation. The likely reason for this lack of correlation is that predictions are made on the basis of shared genes, and the shared genes tend to vary widely between pairs of species. In other words, genes shared in human–mouse tend not to be the same genes shared between human and plants.

64 highly correlated nature of the exploration of gene–phenotype associations in mouse and human.

Finally, we measured the extent to which our predictive performance was improved compared to random trials. To measure this enrichment, we gen- erated a series of random gene-based matrices. For each phenotype-column of cardinality p, we marked p randomly-drawn genes as observed. We attempted to predict phenotypes-of-interest from these randomized matrices using our regular classifiers (Figure 3.7). (Note that we did not repeat the randomized matrix control for the orthogroup-based matrices, primarily because random- ization of gene–phenotype associations eliminates the type of structure which made orthogroup-based matrices necessary.)

Importantly, we see a strong improvement in predictive performance on actual gene–phenotype associations as compared to randomized data. The method is able to recover all known genes for several real diseases — but is unable to recover withheld genes for any of the randomized diseases.

3.3.3 Epilepsy

In addition to evaluating our method’s overall performance, we wished to take a closer look at its prediction of individual diseases in our database. We chose epilepsy because, despite offering ostensibly correct predictions, it actually scores somewhat poorly in cross-validation. In our initial three-fold leave-one-out test, only one of the three separately withheld genes was recov- ered at a reasonably testable rank (twelve, in this case).

65 Figure 3.6: Contributions by individual species to the prioritization of candidate genes. (a) Each phenotype offers some sort of information for prediction of human disease genes. Mouse data seem to offer the most informa- tion about human diseases, as one would expect from the quality of the data and the proximity of the species in the phylogenetic tree. Arabidopsis, which is the furthest species from human in our database, unexpectedly provides as much information as mouse on top predictions, and is second at higher ranks. (b) This scatter plot, examining predictions of human gene–phenotype associ- ations, demonstrates that the information offered by each species (in this case mouse and Arabidopsis) is highly independent, and suggests that integrating data from multiple species may be useful.

66 (b) (a)

Median rank, k=40, only Arabidopsis features Median rank

1 10 100 1000 10000 250 200 150 100 50 0 1 20 Median rank, 10 Phenotypes (orderedbymedianrank) 40 k=40, onlymousefeatures 100 67 60 1000 80 G. gallus E. coli C. elegans A. thaliana D. rerio S. cerevisiae M. musculus 10000 100 1.0 1.0

0.8 0.1

0.6

0.01

0.4 Area under ROC

0.001 0.2 Real data Randomized data Area under Precision-Recall Curve (PRC)

0.0 0.0001 0 45 90 135 180 0 45 90 135 180 Phenotypes (sorted by area under ROC) Phenotypes (sorted by area under PRC) (a) (b)

Figure 3.7: Phenologs predict candidate genes substantially better than random. Shown are (a) ROC and (b) precision–recall plots for k = 100 naïve Bayes using the hypergeometric weighting function, predicting human (OMIM) gene–disease associations from human, mouse, worm, fruit fly, yeast, and plant gene–phenotype association data. We restrict the evaluation to only those phenotypes with four or more known genes. The solid line shows the actual data, and the dashed line shows the result on similarly sized random gene sets. Thus, integrating phenologs across multiple species successfully prioritizes candidate genes to an extent far greater than random chance.

68 Our method successfully identifies the genes GABBR1, GABBR2 [56], and KCNA1 [57], which were absent from our database but known to be as- sociated with the disorder. These were predicted primarily due to mouse phenotypes that resemble epilepsy (clonic seizures and abnormal brain wave pattern; Figures 3.8 and 3.9).

Top epilepsy predictions include PAX6, PRRX1, and RAX2 (of which

PAX6 has been associated with seizures); and PAX3, PAX7, HESX1, and

NKX2-1, NKX2-4, NKX2-6, and NKX2-8 (Figure 3.9). Notably, NKX2-1 is involved in mouse epilepsy [58], and PAX3 appears in a region linked to the human version of the disease [59]; neither of these genes were in our database.

Interestingly, these predictions come from the Arabidopsis phenotypes regulation of gene expression by genetic imprinting, cotyledon development, epidermal cell differentiation, and gene silencing by RNA, as well as the yeast phenotype annotation for sensitivity to trichlormethine (nitrogen mustard, or tris(2-chloroethyl)amine).

To learn more about the general predictability of the epilepsy pheno- type, we ran an expanded cross-validation, withholding each of the full set of

51 epilepsy genes in our database, and found that six genes could be predicted back — all within the top 120 ranks. We note that even when a phenotype performs poorly in cross-validation, it seems that our method still provides useful predictions.

We wanted to know the extent to which predictions could be attributed

69 Figure 3.8: A Venn diagram showing predictions for epilepsy based on the 40 most genetically similar phenotypes. The analysis is primar- ily derived from Arabidopsis, yeast, worm, and mouse, based on the Pearson sample correlation, and using cosine similarity as the weighting function. The twenty closest phenotypes are each displayed separately, and the remaining twenty are aggregated into the category “below top-20 phenotypes.” Paralogs are grouped together when they coincide at a prediction score. Genes in bold represent the orthogroups used in the search — that is, those groups of orthol- ogous genes where one or more paralog was already associated with epilepsy in our database. Colors correspond to those in Figure 3.9.

70                                                  

  

                                   

71 ARX

PAX6 · PRRX1 · RAX2 Gene silencing by RNA (At) Malate metabolic process (At) PAX3 · PAX7 · NKX2-1 · HESX1 Papuamide (Sc) NKX2-8 · NKX2-6 · NKX2-4 Pentose-phosphate shunt, oxidative branch (At) CHRNA4 Anthocyanin accum. in tissues in response to UV light (At) Epidermal cell differentiation (At) ME1 · ME3 · ME2 Fatty acid biosynthetic process (At) Trichlormethine (Sc) KCNMA1 Acetylcholinesterase inhibitor resistant (Ce) Binned Rank Body levamisole resistant (Ce) SLC25A12 · SLC25A13 · SLC25A18 · SLC25A22 Hypoosmotic shock hypersensitive (Ce) Aldicarb hypersensitive (Ce) CHRNA3 · CHRNA6 · CHRNA2 Cotyledon development (At) Abnormal brain wave pattern (Mm) SCN1A Regulation of gene expression by genetic imprinting (At) Clonic seizures (Mm) GABBR2 Borrelidin (Sc) below top-20 phenotypes 0.0 1.0 2.0 3.0 Prediction Score

Figure 3.9: Top candidate genes predicted for epilepsy. Each row of this chart represents a set of genes predicted with the same score. If a gene symbol is printed in bold, it or a member of its orthogroup is already known to be involved. Rows with plain-text labels are novel predictions. The depicted search makes predictions based on the k = 40 nearest neighbor phenotypes (from human, mouse, chicken, zebrafish, worm, yeast, and plant), and color codes the twenty nearest neighbor phenotypes’ contributions to each prediction (the remaining twenty-one are grouped in blue, as “below top-20 phenotypes”). The top scoring gene, ARX, is predicted primarily by Proud syndrome, hy- dranencephaly, and Partington’s syndrome, all of which are human diseases characterized partially by seizures; but information is also drawn from a vari- ety of plant phenotypes. These predictions were generated using an additive classifier for ease of visualization. The distance function is Pearson sample correlation, using cosine similarity as the weighting function w.

72 to paralogy (shared orthogroup membership) with genes already associated in our database with epilepsy. GABBR1 and GABBR2 are each singleton ortho- group members, and are thus independently predicted. KCNA1 and KCNA2 emerged as paralogs following the human–worm divergence, but are predicted from non-worm phenotypes — and are therefore also independent predictions.

PAX6’s plant–human paralogs make up the top three rank bins in Figure 3.9. We suggest that even non-independent predictions are of use, provided they are accompanied by independent predictions — since, as men- tioned, PAX3, NKX2-1, and PAX6 are all associated to some degree with seizures and/orthologous epilepsy. Indeed, the inclusion of species in which these genes are not paralogs offers additional resolution on predictions and demonstrates the utility of our method.

3.3.4 Predicting from E. coli — pharmacologically-induced seizures

We then turned to a similar mouse phenotype, pharmacologically in- duced seizures, to determine whether E. coli gene–phenotype associations could be used to make predictions about mammalian associations without ad- ditional information. We found that mouse genes linked to pharmacologically- induced seizures could be predicted extraordinarily well from E. coli alone in cross-validated tests: eight of the forty-eight genes associated with this mouse phenotype could be predicted back when withheld. These results are particu- larly impressive because they represent all six of the mouse–E. coli orthogroups associated with this seizure phenotype. Two of the orthogroups (Grik2/Grik5

73 and Slc1a2/Slc1a3) are in the top prediction ranking bin; additionally, Faim2 is in the top hundred ranks (Figure 3.10).

Next, we examined the predictions for promising new candidate genes.

One of the most intriguing candidates was α-adducin, which is known to be reduced in the brains of rats experiencing kainate-induced seizures [60].

Another interesting prediction is Sv2a (synaptic vesicle glycoprotein). It was recently reported that a mutation in chicken SV2A leads to photosensitive reflex epilepsy [61]. Mouse Sv2a is a known binding site for levetiracetam, an anti-epileptic drug [62], and Sv2a−/− mice experience seizures and die within three weeks of birth [63, 64].

We also examined the compounds associated with the source E. coli phenotypes to see if these were associated with seizures. The compounds in- cluded ethanol — alcohol poisoning and alcohol withdrawal symptoms include seizures — as well as paraquat, which causes seizures and brain damage in rats [65]; and aztreonam, which is a convulsant [66]. While many compounds might cause seizures if given in sufficient amounts, a control PubMed search for ten randomly chosen compounds associated with E. coli phenotypes in our database failed to turn up such clear associations.

Thus, both at the level of predicting candidate genes and affiliated compounds, the E. coli phenotypes appear to be relevant. Both of the genes discussed above may indeed represent reasonable new candidates for affect- ing pharmacologically-induced seizures and might warrant follow-up experi-

74 Gria1 · Gria2 · Gria4 · Grid1 · Grid2 · Grik1 · Grik3 · Grin1 · Grin2a · Grin2b · Grin2c · Grin2d Slc1a1 · Slc1a6 · Slc1a5 · Gria3 · Slc1a4 · Grik4 · Slc1a7 · Grik2 · Grik5 · Slc1a2 · Slc1a3

Grina · Tmbim4 · Tmbim1 · 4930511M11Rik · Faim2

Mfsd2a · Mfsd2b

Kl · Klb · Lct · Lctl

Slc15a1 · Slc15a2

Dmpk · Rock1 · Rock2 · Cdc42bpb · Cdc42bpa · Cit Binned Rank Upp1 · Upp2 Tobramycin (0.05 µg/ml) Ethanol (6.0%) Ercc2 · Brip1 · Rtel1 · Ddx11 Succinate (0.30%) Methotrexate (1.25 µg/ml) Add1 · Add2 · Add3 SDS (1.0%) + EDTA (0.5 mM) Spiramycin (20 µg/ml) Paraquat dichloride (10.0 µM) Adhfe1 Aztreonam (0.02 µg/ml) Phenazine methosulfate (0.05 mM) 0.0 0.5 1.0 1.5 2.0 Cholate (0.1%) Prediction Score

Figure 3.10: Predicting mouse seizure genes from E. coli phenotypes. These mouse phenotype predictions are constructed from the k = 10 near- est neighbor E. coli phenotypes, using no other species. Predicting eukary- otic phenotype-linked genes from a prokaryote is necessarily coarse-grained, due firstly to evolutionary expansions of ancestral orthologs into larger or- thogroups, and secondly to the tendency for some orthologs to vanish from certain species or become unrecognizable. Nevertheless, the probability of see- ing an intersection of six or more orthogroups by chance, such as that between sensitivity to tobramycin at 0.05 µg/ml and the seizure phenotype, is 1.7×10−4 (without correction for multiple testing).

75 ments. Finally, it is particularly striking that mammalian seizures — which are distinctly neurological phenomena — could be derived from processes so fundamental as to exist even in bacteria.

3.3.5 Atrial fibrillation

We looked next at the human heart phenotype atrial fibrillation (AF), expecting to find that AF, like pharmacologically-induced seizures, was rooted in highly-conserved signaling defects. Instead, we found that the most predic- tive phenotypes were from mouse and chicken — quite unlike epilepsy, for which plants, worms, and yeast offered a great deal more information than mouse or chicken.

The AF phenotype performed well cross-validation in the gene-based configuration method. However, in the orthogroup-based cross-validation, only three of the eight genes associated could be predicted after being withheld. The removed genes were predicted at ranks 3–4, 15–16, and 81–94. Nonetheless, the novel predictions for this phenotype are worth noting.

The top-ranked new prediction for atrial fibrillation (AF) is the his- tamine receptor H2 (HRH2), largely contributed by gastrointestinal phenologs in mouse (Figure 3.11). Histamine has been known to act on heart cadence for over a hundred years [67]. However, an empirical link between heart and gastrointestinal function was established by the recent observation that his- tamine increases the heart rate in pythons during digestion [68] — regulation which both occurs via the H2 receptor [69–71] and is apparently ubiquitous in

76 vertebrates.

Similarly predicted are ATP4A and, further down the list, ATP4B, which are the α and β subunits of the H+/K+ ATPase. This proton pump is responsible for gastric acid secretion during digestion.

A somewhat speculative connection is offered by recent work, which showed cigarette smoke extracts cause an increase in the amount of H+/K+

ATPase in the stomach [72]. It is unclear — and worth testing — whether

ATP4A and ATP4B are expressed in the heart. These genes could offer an additional route by which smoking contributes to heart problems.

Following HRH2 and ATP4A is HOPX, or homeodomain only protein x, which is down-regulated during heart failure in humans [73]. It is not clear that HOPX is involved in AF per se, but again worth exploring in future experiments, as is the gene ranked next, KCNE1, based on orthologous phe- notype prolonged QT interval — and seemingly also a factor in rare cases of atrial fibrillation [74–76].

GJA1 (gap junction protein, α1, also known as connexin 43 or Cx43) is one of the two most abundantly expressed connexins in the heart [77–79].

The other is GJA5 (connexin 40), already associated in our database with

AF. Cx40 and Cx43 seem to form heteromeric channels with different prop- erties from homomeric channels [80]. Cx43, unlike Cx40, is essential for heart development and cardiac impulse conductance in mice [81]. Tuomi et al. ob- served that a dominant negative Cx43 mutant causes severe AF [82]. Finally,

77 KCNQ1 NPPA

KCNE2 bifid atrial appendage (Mm) abnormal vascular resistance (Mm) GJA5 syndromic hearing impairment (Mm) HRH2 hypochlorhydria (Mm) increased circulating gastrin level (Mm) KCNE1 Torsades de pointes, drug-associated (3) (Hs) ATP4A abnormal canal morphology (Mm) Atria at stage 28 (Gg) POU4F3 abnormal frontal plane axis (Mm) pulmonary hypertension (Mm) SCN5A mitral valve stenosis (Mm)

Binned Rank ATP4B absent vestibular hair cell stereocilia (Mm) detached otolithic membrane (Mm) SLC9A4 abnormal renin activity (Mm) SST abnormal gastric gland (Mm) prolonged QT interval (Mm) HOPX enlarged stomach (Mm) increased circulating adrenaline level (Mm) GJA1 achlorhydria (Mm) SLC9A2 small scala media (Mm) below top-20 phenotypes MYH7B

0.0 2.0 4.0 6.0 8.0 Prediction Score

Figure 3.11: The top candidate genes predicted for atrial fibrillation. These predictions are constructed in the same manner as those in Figure 3.9. Limiting the search to k = 40 neighbors in this case means that all predictive phenotypes come from mouse and chicken, though other species were included in the analysis. Interestingly, few of the informative mouse and chicken phe- notypes are related to the heart in any obvious manner.

78 atrial fibrillation was observed in a somatic mutation in human GJA1 [83]. Notably, another channel similarly implicated was SCN5A (human cardiac sodium channel, voltage-gated, type V, α subunit). This sodium channel com- ponent has been associated with atrial fibrillation [84–86] but was missing from our database.

In terms of the predictor phenotypes themselves, the top AF phenologs can be grouped into three basic categories: cardiac, gastric, and auditory. We have explored the first two categories, but have not considered genes from the third. We note that while Jervell and Lange-Nielsen syndrome (i.e., long QT syndrome) has been associated with deafness for half a century [87–90] via alleles of KCNQ1 [91] and KCNE1 [92], other genes may yet be involved [93]. Further, Belmont et al. write of “a growing appreciation for conditions that affect hearing and which are accompanied by significant cardiovascular disorders” [94].

Given the success with which our method was able to predict AF genes — and with which it was able to identify potentially related disorders — exploration of additional candidates (e.g., ATP4A/B, POU4F3, and S1PR2) from Figure 3.11 may be warranted.

3.3.6 Plant phenotypes — response to vernalization

Finally, having focused primarily on predicting mammalian phenotype- and disease-genes, we asked whether plant gene–phenotype associations could be predicted from the other species in our database.

79 Plants represented a particular challenge, since a number of factors reduce the specificity of predictions for plant phenotypes. Firstly, while human phenotypes are predicted at least in part from other mammals and even other vertebrates — which are phylogenetically similar — there are no close neighbor species to Arabidopsis in our database.

Secondly, while 19,439 of the 28,002 human genes in our database have orthologs in other species, the ratio is less promising for A. thaliana phenolog predictions: 12,668 of 27,325 have orthologs. The cause is likely again the lack of other plants in our database, compared to the several vertebrates from which to draw information for H. sapiens.

Third and finally, the Arabidopsis genome contains a great deal of re- dundancy, as observed in [95]: 37.4% of proteins belong to families of more than five members, compared to 12.1% in fruit fly and 24.0% in worm. In pre- dictions that rely on gene orthology, as with phenologs, there is often no way to distinguish which of the plant paralogs is most relevant — except perhaps by relying on paralogous phenotypes.

Our orthogroup-based matrix formalism (Figure 3.2D) was thus crit- ical for addressing the extensive divergence of gene families between distant species. In particular and as previously described, when attempting to predict Arabidopsis phenotypes, we noticed that the gene-based formalism resulted in asymmetric scores and unwarranted improvements in rank of certain predic- tions, particularly those where large orthogroups were involved (Figure 3.2A– C). The genes-as-rows configuration also inflated performance, as measured

80 by ROC plots, during cross-validation — primarily due to the high frequency with which plant gene expansions co-participate in a biological process.

Given this formalism, we then determined phenotypes which may be predictable by cross-validating candidate genes produced from all non-plant species in the database (Figure 3.12), and found a large number (greater than

50) of the plant phenotypes to be reasonably well-predicted based on non-plant phenotypes. We describe final predictions for the response to vernalization phenotype (Figure 3.13).

We selected this phenotype because it scores better than most other plant phenotypes in cross-validation; seven of the fifteen genes in this plant phenotype can be predicted back at low rank when withheld, representing two or three orthogroups (about half of the total number of orthogroups) depending upon the source species considered.

Although we cannot easily cross-validate predictions from paralogous phenotypes, since they are not sufficiently independent, we speculate that the inclusion of paralogous phenotype data may help to improve the specificity of the predictions at no perceivable cost.

Among the new candidate genes implicated, two are particularly no- table. One of these, EMF2 — which appears to be associated with vernaliza- tion-mediated flowering by its interaction with CLF [96] — is predicted based on seemingly unrelated orthologous mouse and human phenotypes (abnormal chorion morphology and endometrial cancer, respectively). EMF2 is paral-

81 otueu niiulseisfrpeitn ln phenotypes. plant predicting for performance species except combined individual the species useful shows most all line using solid predictions red of The species). individual from Arabidopsis types. 3.12: Figure

Median rank 100 80 60 40 20 0 hsfiuemrosFgr .A u eosrtstepeito of prediction the demonstrates but 3.6A, Figure mirrors figure This hntpsfo niiulseis(ahrta ua diseases human than (rather species individual from phenotypes rdcigpromneo hnlg o ln pheno- plant for phenologs of performance Predicting 20 Phenotypes (orderedbymedianrank) 40 82 Arabidopsis 60 es per ob the be to appears Yeast . S. cerevisiae H. sapiens C. elegans G. gallus D. rerio E. coli M. musculus 80 AGL31

SVP · MAF4 · MAF1 · FLC · MAF3 · MAF5 SEP4 · AGL6 · SEP2 · AGL79 · AGL13 · AGL14 · AGL21 · AGL15 SEP1 · TT16 · AT5G51860 · AGL8 · AGL42 · SEP3 · CAL · AP1 AP3 · PI · GOA · AGL12

AGL44 · AGL17 · AGL16 · AGL18 · AGL24 · AGL71

SHP2 · AGL20 · SHP1 · STK · AG · AGL19

AGL30 · AGL65 · AGL94 GO:0009910: negative regulation of flower development (At) GO:0006349: regulation of gene expression by genetic imprinting (At) FWA abnormal sinus venosus (Mm) dilated dorsal aorta (Mm) HDG7 · HDG11 GO:0010628: positive regulation of gene expression (Dr) GO:0021782: glial cell development (Dr) Binned Rank TFL2 GO:0030241: skeletal muscle myosin thick filament assembly (Dr) VIP4 GO:0050936: xanthophore differentiation (Dr) GO:0010221: negative regulation of vernalization response (At) HDG3 · HDG1 · PDF2 · ANL2 · ATML1 · HDG12 · HDG2 abnormal branchial arch artery morphology (Mm) Ectoderm at stage 4 (Gg) HDG8 · HDG9 · HDG10 Kohler's Sickle (Gg) Epiblast at stage 2 (Gg) HDG4 · AT5G07260 · HB-7 abnormal endocardium morphology (Mm) Neural Plate/Tube at stage 44 (Gg) GO:0048701: embryonic cranial skeleton morphogenesis (Dr) abnormal sinoatrial node morphology (Mm) 0.0 2.0 4.0 6.0 below top-20 phenotypes Prediction Score

Figure 3.13: The top candidate genes predicted for Arabidopsis re- sponse to vernalization. Here, we demonstrate predictions for a plant phenotype, response to vernalization, while also demonstrating how includ- ing paralogous phenotypes may slightly enhance resolution. These predictions are drawn from phenotype data from each species in the database, with a neighborhood cutoff of k = 40. Due to the large gene expansions in plants, as well as the relatively large distance of Arabidopsis from other species in our database, paralogs are often ranked together. In the first two bins, a large gene expansion is split into separate ranks by information from an Arabidopsis phenotype (which is paralogous rather than orthologous). Those ranks labeled with green text include at least one previously known vernalization response gene (that is, a gene that was already linked with vernalization response in our database).

83 ogous with known vernalization gene VRN2; however, it is ranked ahead of VRN2 by its association with the related plant phenotype negative regulation of flower development. That EMF2 was boosted by a potential paralogous phenotype supports the hypothesis that paralogous phenotypes are similarly useful to orthologous phenotypes in predicting gene function.

The second interesting prediction is FWA and its several paralogs (HB- 7, HDG1–4, HDG7–12, PDF2, ANL2, ATML1, and AT5G07260). Certain FWA mutants produce a vernalization-insensitivity phenotype [97,98]. Candi- dates ANL2 and PDF2 both have late flowering phenotypes [99,100] markedly similar to that of FWA [101]. That discovery lends additional support for par- alogous phenotypes, as neither FWA nor PDF2 were associated in our database with regulation of flower development — but our method successfully identified negative regulation of flower development as a potential phenolog.

Empirically, we believe that the strength of our predictions for plant phenotypes are limited primarily by the quantity of gene–phenotype infor- mation available for plants. Notably, the addition of associations from other plant species would prove exceptionally useful for predicting not only Ara- bidopsis phenotypes, but crop species as well, and — as demonstrated earlier — even animal phenotypes.

3.3.7 Datasets

We sought to determine whether the improvement in our method over the original Phenologs method could be attributed in part to the additional

84 species information or arose exclusively from incorporating phenotypes beyond the nearest neighbor (as demonstrated in Figure 3.5).

We began by plotting the performance of those predictions drawn from the species used by McGary et al. (mouse, nematode, yeast, and plant, short- hand mcgary+green, notably including the additional phenotypes from Green et al.). As expected, we found that increasing k from 0 to 40 markedly im- proved the results (Figure 3.14).

Next, we tried adding the new species (chicken, E. coli, and zebrafish) individually, and found improvements at relevant ranks for E. coli and D. rerio.

Surprisingly, inclusion of chicken data hurt the predictive performance. We also tried adding E. coli and D. rerio at the same time, and saw an additional increase in performance.

Given the decrease in performance resulting from the inclusion of chicken phenotypes, we sought to determine whether the additional C. elegans datasets were negatively affecting performance. In Figure 3.14B, we tried leaving out the broad and specific components of the green dataset. We found that either component alone performed worse than the combination. We also tried leaving out the green datasets altogether, and found that their inclusion moderately improved performance at relevant ranks, but decreased performance beyond around rank 45.

These mixed results with the datasets are somewhat surprising. We expected that the in situ hybridization expression annotations from GEISHA

85 Figure 3.14: Measuring the effect of additional datasets on predictive performance. Here, we used our best classifier (naïve Bayes with Pearson sample correlation for a distance function, weighted by hypergeometric CDF), and subtract out datasets in order to determine their relative contributions. (a) demonstrates that for the original species used by McGary et al. (also in- cluding the new phenotypes from Green et al.), the k nearest neighbors method performs substantially better than the original Phenologs method (approxi- mated by k = 1, which includes not only the nearest phenotype but also any with the same distance). The datasets are labeled mcgary (mouse, worm, ne- matode, yeast, and plant), green (nematode), Dr for zebrafish, Ec for E. coli, and Gg for chicken. The best performing analysis was repeated (labeled “(1)” and “(2),” with different random test genes withheld) to demonstrate that performance is robust under cross-validation. (b) presents a test of whether specific phenotypes are more useful than broad phenotypes, by breaking down the green dataset into its components, green–specific and green–broad. We found that including both green datasets yielded the best results at relevant ranks, but that they both hurt results at less relevant ranks (beyond 45). Also shown is a comparison between the original datasets (mcgary alone) and the best-performing collection from (a), with all datasets except chicken (repre- sented by the solid cyan line).

86 (a) (b 0

50 ank ank Median r Median r 100 mcgary + green + Dr + Ec (1) mcgary + green + Dr + Ec (2) mcgary + green + Ec mcgary + green + Dr mcgary + green + Gg mcgary + green mcgary + green (k=1) 150 20 40 60 80 100 Phenotype (ordered by median rank) Phenotype (ordered by median rank)

(b) 0

50 ank Median r 100

mcgary + Dr + Ec mcgary + green + Dr + Ec (1) mcgary + green-specific + Dr + Ec mcgary + green-broad + Dr + Ec mcgary mcgary + green (k=1) 150 20 40 60 80 100 Phenotype (ordered by median rank) Phenotype (ordered by median rank)

87 would be useful for human predictions not only because gene expression stage and location should correlate highly with phenotype, but also because chicken — like mouse — is more closely related to human compared to other species in our database.

The distributions of genes per phenotype for human, mouse, Arabidop- sis, and chicken were similar (not pictured). Only E. coli differed substantially, with most phenotypes involving between 800 and 1,000 genes. However, in gen- eral, the counts of E. coli–human orthologs involved in bacterial phenotypes are much smaller due to the relatively small fraction of genes with orthologs between the two species.

3.4 Conclusions

In summary, we set out to improve upon the results of the original phenolog project by unifying information from a “neighborhood” of phenotypes surrounding the phenotype or disease of interest. Our method produces ranked predictions for a large percentage of human diseases in OMIM, as well as for plant biological process-based phenotypes.

Notably, we were able to demonstrate the correct prediction of at least one gene associated with the mouse phenotype pharmacologically-induced seizures using only phenologs from E. coli. While McGary et al. demonstrated the existence of deep homology between mice and single-celled eukaryotes, our work suggests that examples of deep homology exist — and may even offer useful predictions — between prokaryotes and eukaryotes.

88 We also demonstrate that the term “phenotype” may be interpreted broadly when incorporating gene-association data for phenolog-based predic- tions. Gene Ontology biological processes are one potential source. Another potential source is annotations for in situ hybridization experiments, such as GEISHA, but it may be necessary to refine such a phenotype database by hand.

Finally, we give a number of concrete gene predictions for the human diseases atrial fibrillation and epilepsy, and show how phenologs may be used to generate hypotheses and a biological context that correctly connect categories of diseases, such as disorders of the heart, stomach, and sensorineural system.

3.5 Methods 3.5.1 Cross-validation

For the gene-based matrix, we compared classifiers and metrics using n-fold cross-validation, and calculated receiver operating characteristic (ROC) and precision–recall curves for each disease or phenotype to be predicted. Clas- sifiers could be represented by arrays of area-under-the-curve measurements.

With the orthogroup-based matrix we chose a simpler and faster “leave- one-out” cross-validation scheme, where one observed gene association was hidden for each disease. Noting that some orthogroups have multiple genes associated with the same phenotype, we also hid any orthogroups associated with hidden genes. Since a gene may be part of one orthogroup for each species included in the search, we measured the rank of predicted genes rather than

89 predicted orthogroups. When multiple genes were predicted with the same score, the mean rank was used.

The leave-one-out procedure was repeated three times for each pheno- type, taking the median hidden gene rank to be representative of the classifier– phenotype performance.

3.5.2 Additional phenotype data

In addition to those databases described in [2], we incorporated or- thology and gene–phenotype data from a variety of additional species. We excluded any phenotypes with fewer than three associated genes.

Our choice of species was driven primarily by availability of data in a useful format, namely that phenotype annotations could be expressed quali- tatively, and that we we could link those annotations to a protein sequence; for example, we wished to incorporate phenotype data from the agricultural plant database Gramene, but most or all phenotype-associated genes lacked sequences.

Human diseases came from OMIM as for [2], but updated on August 17, 2011. Additional C. elegans phenotypes came from Green et al. [102] and were broken down into two datasets, green–broad and green–specific, according to the classifications given by the authors in the second supplemental table, “Broad Phenotypic Category” and the more specific subcategories into which they were divided. We chose to maintain the dual categorization primarily because a number of the more specific phenotypic classes were monogenic,

90 and hence would have been ignored altogether in our analysis.

E. coli phenotypes were taken on May 20, 2011, from the file coli_

FinalData2.txt [103]. Each gene’s phenomic profile was sorted by score, as- signing both the top and bottom forty conditions to the gene. Thus, each condition was considered to be a phenotype, and the genes associated with that phenotype were those genes whose growth was most affected — either positively or negatively — in the corresponding condition.

We considered fruit fly phenotypes from FlyBase [104]. Unfortunately, FlyBase’s dataset — while extensive — makes use of a controlled vocabulary optimized for manual searching rather than high-throughput analysis. The only way to connect a phenotypic class annotation to an anatomical location or developmental stage is by allele and literature reference — if these are given at all. We attempted to match anatomical annotations for mutant phenotypes to annotations from the phenotypic class ontology, joining on allele and pub- lication. While it was possible to predict some human diseases based on fruit fly phenotypes from FlyBase, the results were noisy and difficult to interpret, and we ultimately chose to exclude fruit fly results.

Zebrafish phenotypes consisted of gene ontology (GO) biological pro- cesses from ZFIN [105], keeping only those annotations with evidence types of IMP, IDA, IPI, IGI, TAS, NAS, IC, and IEP — the same procedure used for Arabidopsis phenotypes, obtained from TAIR [46]. These evidence types were selected so as to avoid the inclusion of annotations that originated directly from knowledge of other model organisms.

91 For chicken (Gallus gallus) phenotypes, we utilized in situ hybridization annotations from GEISHA [106], kindly provided in XML format on June 24, 2011. If there were more than fifty genes associated with a specific location and more than three at a specific stage at that location, a new phenotype was created (“anatomical location at stage x”); and regardless, each location became an independent phenotype. We defined phenotypes as gene–expression associations in specific anatomical locations. For those locations with more than fifty genes annotated, we created additional phenotypes for each stage with greater than three associated genes.

3.6 Acknowledgements

This work was supported by grants from the Texas Advanced Research Program, the National Science Foundation, the National Institutes of Health, U.S. Army Research (58343–MA), and the Welch Foundation (F–1515); a Packard Fellowship (to E.M.M.); and a National Science Foundation Graduate Research Fellowship (to J.O.W.).

92 Chapter 4

Improved prediction of phenologs using Boolean combinations of phenotypes

4.1 Introduction

The phenolog and deep homology hypotheses are based on the postulate that some or all aspects of the protein machinery of the cell may be concep- tualized as consisting of functional modules — parts in a complex machine which interact to make that machine work.

The modules are not atomic; they may be subdivided into additional modules down to the protein level.1 Thus modules may share components — whether those components are submodules consisting of multiple or single proteins.

Modularity need not be determined by physical interactions between proteins. For example, the host of proteins responsible for releasing dopamine from a neurotransmitter may never interact directly with receptors for that neurotransmitter on a separate cell; yet receptor and exocytosis docking appa- ratus, which are independent submodules, may be thought of as participating

1And perhaps beyond, in cases where a gene has multiple protein isoforms. For simplicity, however, we consider proteins to be atomic.

93 in the same module.

I hoped, when writing the k nearest neighbors search, that the process of incorporating data from multiple phenologs might reveal readily identifiable modules at the intersection of phenotypes drawn from multiple organisms. In this chapter, I demonstrate a much cleaner approach.

The method described in this chapter was originally conceived by Ed- ward Marcotte. I mentored Matthew Tien, an undergraduate researcher, who cleverly demonstrated a proof of concept in Python and Perl, using Kriston

McGary’s original phenolog software. However, Matthew graduated without any in-depth study of the results, and Kris’ software — written in Perl — was for some species pairs too slow for the requisite computations on the larger set of phenotypes produced by this method.

In the following pages, I describe the Boolean phenologs concept and its re-implementation.2 I use it to predict genes associated with oxidative stress-related apoptosis in breast cancer from zebrafish, and increased brown adipose tissue from plants. I demonstrate its utility in identifying a functional module relating to carcinogenesis — particularly recurrence of cancer — using breast cancer predictions from zebrafish.

2Some details of the implementation, in Ruby, are provided in Appendix B.

94 4.2 Boolean phenologs

Diseases may consist of multiple phenotypes; such diseases are called syndromes. I wondered if it might be helpful to divide phenotypes into what might be called subphenotypes. The original definition of “phenotype” requires that real phenotypes be observable, and in practice — except for diseases — phenotypes are atomic. Phenotypes describe one altered aspect of an organism.

So a subphenotype might be thought of as a functional module or submodule.

I hypothesized that if phenotypes reflect functional modules then it might be possible to identify such submodules by performing various math- ematical set operations (and, or, xor, not) on phenotypes whose associated gene sets overlap. Reasoning that or (union) and xor would produce super- modules rather than submodules, I elected to consider and (intersection) and not (subtraction) most carefully (Figure 4.1).

It is also possible to view intersection phenotypes as the automatically generated non-human equivalent of syndromes. If a syndrome consists of mul- tiple human phenotypes which tend to co-occur (since co-occurrence likely increases the odds of medical intervention or diagnosis), then an intersection phenotype, properly constructed, should consist of those genes with pleiotropic roles in the same pair of phenotypes. Understanding the intersection between

Boolean phenotypes which are phenologous with a human syndrome could fa- cilitate a clearer understanding of the component processes of that syndrome.

95 Figure 4.1: Process for the identification of Boolean phenologs. Prior to identifying Boolean phenologs, gene–phenotype associations are put into a matrix which represents phenotypes as rows and genes as columns (1 rep- resents an association, and 0 means none has been observed). The matrix is then translated based on the target species into an orthogroup–phenotype ma- trix, where the columns represent orthogroups between the target and source species. Phenotypes with exactly the same sets of orthogroups are merged so that every row is unique.

Within the resulting matrix, an all-versus-all search is performed, identifying all pairs of phenotypes which overlap. The selected operation (and or not) is carried out upon each overlapping pair. For the latter operation, it is carried out twice — e.g., A − B and B − A. If a resulting composite phenotype is the same as either of the inputs, or consists of fewer than three orthogroups, it is discarded. Phenotypes with the exact same orthogroup sets are merged, as before.

Lastly, phenologs are identified between the target species and the computed matrix. A filter is imposed, throwing away all phenologs with an intersection size of less than two. In addition, phenologs which predict no new genes (the intersection consists of the entire Boolean phenotype) are discarded. The or- thogroups within the Boolean phenotype but outside the overlap are translated into target-species gene identifiers, and are considered to be candidates.

96 Make boolean phenotypes by looking for overlapping phenotypes within a species

NOT AND

Find phenologs between new phenotypes and species-of-interest phenotypes

97 4.2.1 Correcting for multiple hypothesis testing

To address the problem of multiple hypothesis testing, I perform an empirical permutation test similar to that used in [2]. I experimented with two different schemes and found that they produced basically identical results.

In both schemes, the Boolean phenotype –orthogroup matrices from

Figure 4.1 are used. In one, a single permutation is generated for an entire matrix and each row (phenotype) has its contents rearranged according to that same permutation. In the other scheme, each row is permuted separately.

(Since the results were similar, I elected to use the former strategy, as it preserves the joint distribution of genes and phenotypes.)

In each scheme, phenologs are calculated for the matrix permutation; and the process is repeated 1,000 times. The resulting distributions are plotted as in Figure 4.2, and I calculate a positive predictive value [47] for every phenolog found as in McGary et al. (2010).

4.3 Results 4.3.1 Increased brown adipose tissue 4.3.1.1 Predictions from Arabidopsis

To demonstrate the power of the method, I picked the very first pheno- type prediction listed in the output of the very first test run (predicting mouse from Arabidopsis, and operation). This was the mouse phenotype increased brown adipose tissue amount, and its single nearest Boolean plant phenotype

98 1/10

1/100

1/1000

real 1/104

1/105

1/106 Fraction of phenologs (log scale) Fraction 1/107 random

1/108

1/109 ) ) ) ) ) ) 6 11 16 21 26 , 1/10 1, 1/10 [ 5 , 1/10 , 1/10 , 1/10 , 1/10 10 15 20 25 1/10 [ 1/10 1/10 1/10 1/10 [ [ [ [ Hypergeometric frequency of observing overlap by chance

Figure 4.2: Real and null distributions based on a permutation test of Boolean phenologs. The solid blue line represents the real distribution of p values between phenotypes in a target species and Boolean phenotypes in a source species. The null distribution is based on 1,000 independent runs, permuting each matrix as a unit rather than each row independently. The distributions shown are for predicting human from zebrafish using the not operation.

99 N=3,383 PPV=0.980

0.0176 response to salt stress increased brown adipose & 2.66E-5 tissue (BAT) response to sucrose 6 known stimulation 2.06E-4 3 predicted overlap 2

Figure 4.3: Mouse increased brown adipose tissue genes may be pre- dicted by Arabidopsis Boolean phenotype response to salt stress ∩ response to sucrose stimulus. This figure represents the first mouse result outputted by the search of Arabidopsis phenotype intersections. The given in- tersection is the single nearest neighbor. The probability of seeing such an overlap (or larger) by chance is p = 2.66 × 10−5 (FDR = 0.980), smaller by several orders of magnitude than the probability of seeing either of the individ- ual phenologs separately (p = 0.0176 for salt stress alone and 2.06 × 10−4 for sucrose stimulus, neither of which meets our significance threshold of 0.0001). The intersection size is two, with set sizes of six and five; thus, three or- thogroups are predicted for the increased brown adipose tissue phenotype. neighbor was the intersection of the plant GO biological processes response to salt stress and response to sucrose stimulus (Figure 4.3).

The most consequential difference between brown and white adipose tissue (BAT and WAT) is that the former dissipates energy as heat while the latter stores it. The uncoupling of mitochondrial oxidative phosphorylation in BAT is accomplished by thermogenesisUCP1), the molecular site of non- shivering thermogenesis [107]. Drug-induced uncoupling has been pursued as a treatment for obesity, sometimes with lethal consequences [108,109]. A better

100 understanding of BAT versus WAT physiology could be useful for addressing the obesity epidemic.

The three genes predicted for increased BAT were Pgd (6-phospho- gluconate dehydrogenase), Psmd4 (part of the 26S proteasome), and a large orthogroup of homeobox proteins (including Pitx1–3, Isx, Pax2–8, Rax, Alx3, Esx1, Crx, Otx1/2, Phox2a/b, and Sebox).

Phosphogluconate dehydrogenase is a lipogenic enzyme in the pentose phosphate pathway, and seems to be expressed in WAT, BAT, and liver, with activity varying according to sex and tissue [110]. Pankiewicz et al. also found that oestradiol regulates liver Pgd expression; whereas Puerta et al. found that oestradiol decreased BAT thermogenesis only in cold-acclimated rats [111]. While the full story would require a full literature review to elucidate, Pgd seems to be an interesting candidate.

The case for Psmd4 is only slightly clearer. UCP1 ubiquitinylation is associated with BAT in cold-acclimated animals, and ubiquitinylation seems to control the rate of UCP1 turnover by the proteasome [112]. It is therefore possible that Psmd4 plays a direct role in BAT thermogenesis.

Homeobox genes are involved in adipogenesis, but it’s unclear whether any of the candidates play a role; these two may be worthy of additional exploration.

In short, the candidate genes predicted from plant may be worthy of additional exploration.

101 4.3.2 Set difference: methylmercury and breast cancer

Next, I looked at Boolean combinations consisting of set differences, reasoning that these would identify submodules rather than the modules in which those submodules play a role.

Perhaps the most striking example of the applicability of this technique, and of Boolean phenologs in general, is the case of human breast cancer being predicted from response to methylmercury less response to metal ion (Fig- ure 4.4). I expected that the genes predicted would by DNA repair-related, as with many cancer phenologs.3 Instead, these genes highlight two other pathways by which organisms (or individual cells) suppress cancer growth: apoptosis and oxidative stress response.

Methylmercury (MeHg+ or just MeHg) is an organometallic cation, so this appears at first glance to be a strange combination. However, the response to methylmercury GO annotation is a child node of response to organic substance in the tree, not response to metal ion. A just-published article by McElwee et al. examines the effects of organic (MeHgCl) and inorganic mercury (HgCl2) on C. elegans, finding eighteen genes which were important to mercurial exposure response — and only two which responded to both types of mercury [113]. The mechanisms for mercury toxicity are incompletely understood, but it seems clear that even if the two mechanisms are the same

3For example, breast cancer has a phenolog in plants with the intersection of DNA repair and response to gamma radiation, p = 4.70 × 10−5, with two predicted orthogroups: ATR and ERCC6, which both appear to be DNA repair genes.

102 or similar (as argued by Clarkson et al. [114]), the organismal and cellular responses differ. Thus, this phenolog is in agreement with the findings of McElwee et al. [113].

Five genes are predicted: MT-COI, JUN (c-Jun), SOD2, GADD45B, and BAX. All of these genes — MT-COI especially — appear to be involved in the apoptotic response to reactive oxygen species (ROS). The mechanisms by which methylmercury generates ROS are not entirely clear, as previously mentioned, but are reviewed by Farina et al. [115]. Generally, MeHg+ forms a complex with lower-weight thiol and selenol groups, especially glutathione (re- viewed separately in [116]). When such groups occur in mitochondrial creatine kinase or respiratory chain proteins, the compound can inhibit mitochondrial function, leading to depolarization of that organelle’s membrane and overpro- duction of ROS [115].

Without any one of the predicted genes, the effects of ROS are mag- nified. Mitochondria rely on SOD2 for conversion of extremely toxic superox- ide into hydrogen peroxide (which can be further processed by catalase) and molecular oxygen. Several of the genes (JUN, GADD45B, and BAX) play well-characterized roles in stress-induced apoptosis, triggered in the event the cell is overwhelmed by free radicals (or other agents which cause damage).

MT-COI, part of complex IV of the oxidative phosphorylation pathway, is the site of an extremely common germ line mutation in cancer patients, which seems to predispose those individuals toward developing cancer [117]. This may indicate a role for this cytochrome C oxidase component in the

103 pro-apoptotic pathway.

Induction of apoptosis is a key route by which chemotherapy targets cancers — and by which cancers circumvent chemotherapy. The roles of Bcl-2 and Bax are reviewed in [118]. Notably, Bcl-2 binds Bax (the protein products of BCL-2 and our prediction BAX, respectively), and BCL-2 over-expression confers chemotherapy resistance.4 When Bcl-2 is low or absent, however, Bax homodimerizes, leading to cell death. Bax is also thought to interact with the mitochondrial voltage-dependent ion channel, and is directly induced by p53 in response to DNA damage.

Interestingly, a model already exists that might explain these breast cancer gene predictions. Martinez-Outschoorn et al. proposed, in 2010, the “autophagic tumor stroma model of cancer.” Essentially, the model hypoth- esized that tumors induce oxidative stress in the tumor microenvironment in order to cause stromal cells to release nutrients which are used for can- cer growth [119, 120]. Indeed, Trimmer et al. found that loss of Caveolin-1 (Cav-1) — whose stromal presence is a strong predictor of survival — dra- matically promotes breast cancer growth. Loss of Cav-1 may be rescued by over-expression of SOD2, another tumor suppressor, which relieves oxidative stress by processing mitochondrial superoxide radicals. [120]

GADD45B also plays a pro-apoptotic role, and was found to be down- regulated in at least two cases of hepatocellular carcinomas (reviewed in [121]).

4Chemotherapy resistance here and in the literature means that the cells fail to undergo apoptosis in the presence of damage, not that the chemotherapy fails to cause damage.

104 N=12,270 PPV=0.992

9.97E-5 response to methylmercury breast cancer NOT 5.83E-5 response to metal ion 21 known unity 5 predicted overlap 2

Figure 4.4: Breast cancer is predicted from genes involved in ze- brafish methylmercury response but not response to metal ion (p = 5.83 × 10−5, PPV = 0.992). Five genes are predicted for breast can- cer. The phenolog overlap was of size two, and nineteen known breast cancer genes were missed. The probability of seeing an overlap between breast cancer and defective methylmercury response alone is 9.97 × 10−5.

Both GADD45B and to a lesser extent SOD2 were observed to be up-regulated in inflammatory breast cancer (IBC); while the authors suggest that these genes contribute to the aggressiveness of IBC [122], another possibility is that up-regulation of SOD2 and GADD45B is part of a transcriptional program which protects against cancer.

While this model offers no especially novel predictions for breast cancer, it does suggest possible further investigations for MT-COI’s role in apoptosis and oxidative stress response — and hints that the predicted genes may be involved in a tumor suppression transcriptional program.

105 4.3.3 Myopathy from Arabidopsis

Myopathy5 is a broad category of disease categorized by muscular weak- ness with twenty-one associated human genes; three of these genes (DYSF, ACTA1, and FHL1) are members of human–plant orthogroups.

This phenolog, response to red light less response to water deprivation, predicts the involvement of five orthogroups in myopathy. The first of these orthogroups is comprised of members of the SWI/SNF complex and Mediator, including MED25, ARID1A (BAF250A), ARID1B (BAF250B), and PTOV1. MED25 has been observed in a family with Charcot–Marie–Tooth (CMT) syndrome, including childhood onset distal muscle weakness [123].

ARID1A and ARID1B are both muscle-related. In knockouts of the former, relatively fewer skeletal muscle cells differentiate from embryonic stem cells compared to other differentiation products [124]. ARID1B is associated with Coffin–Siris syndrome, which includes hypotonia among its symptoms [125].

In the second orthogroup, two of the three genes are involved in muscle. CREM is expressed in ventricular myocytes [126] and ATF1 is a hypoxia- responsive transcriptional activator of skeletal muscle via mitochondrial UCP3 [127]; the third, CREB1, is expressed in fibroblasts and not in myocytes [126].

5There are two human myopathy phenotypes in our database. The first includes two genes and consists of myopathy (1) due to CPT II deficiency, (2) due to phosphoglycerate mutase deficiency, and (3) with exercise intolerance (Swedish type). The second, discussed in this section, is collapsed from the remaining myopathy annotations in OMIM.

106 The third set of predictions is TFEB, TFEC, TFE3, and MITF. TFEC is a transcriptional activator of non-muscle myosin, and so is ruled out — but is nevertheless an interesting prediction. Over-expression of TFEB re- lieves Pompe disease, a disability of heart and skeletal muscles, by stimulat- ing autophagy [128]. MITF has a known autophagy role resembling that of TFEB [129] and is involved in a variety of developmental processes, but no evidence exists suggesting it or TFE3 are involved in myopathy.

An autophagy link for these predictions is further supported by the fourth predicted orthogroup, consisting of a number of cytochrome P450 (fam- ily 2) genes: CYP2A6–17/13, CYP2B6, CYP2C8–9/18–19, and CYP2D6/E1/ F1/J2/S1/U1/R1/W1. Cytochrome P450 is the site of a number of drug inter- actions — notably with grapefruit, cranberry, and pomegranate juice, which inhibit CYP3A4, a metabolizer of statins. CYP2C8, CYP2C9, and CYP2C19 are involved in various statin-induced myopathies [130–132]. At least with the last of these, the mechanism is autophagy related [133]. Similarly, predicted gene CYP2D6 increased statin efficacy and is a predicted drug interaction site with 3A4 [134]. Finally, CYP2E1 metabolizes ethanol — which also causes myopathy — and inhibits autophagy [135–137].

Two genes for which I found no literature support are FAM50A and FAM50B, predicted as a single orthogroup; neither appears to be particularly well researched. These may be good candidates for autophagy genes.

There are at least two other Boolean phenologs of myopathy (both intersectional rather than subtractive) at p = 1.57 × 10−6 (PPV = 0.987).

107 N=3,383 PPV=0.991

2.4E-5 defective response to red light myopathy NOT 1.1E-5 3 known defective response to water deprivation unity 5 predicted overlap 2

Figure 4.5: Myopathy is predicted from plant genes involved in red light response but not response to water deprivation. Among 3,383 human–plant orthogroups, three are involved in myopathy; five in the Boolean phenolog defective response to red light less defective response to water depri- vation; and two in the intersection (p = 1.10 × 10−5, PPV = 0.987). The sub-phenolog myopathy / red light has a p value of 2.36 × 10−5.

The first of these is response to light stimulus ∩ response to red light (1.97 × 10−4 and 2.36 × 10−5 for the individual components), which predicts the cytochrome P450 orthogroup. The second is response to auxin stimulus ∩ response to light stimulus (the former component has p = 3.7 × 10−4), pre- dicting GHDC. I found no literature support for GHDC, a gene about which little is known; it may be a good candidate for myopathy and autophagy.

4.3.4 Holoprosencephaly

To demonstrate predictions from multiple phenologs (as discussed in the previous chapter), and to show the utility of intersection phenologs, I looked for a disease with many phenologs among the and (intersection) pseudo- phenotypes.

108 The concept of k nearest neighbors-based ranking discussed in the pre- vious chapter is analogous to the k = 1 case of Boolean or (set union) phe- nologs. I present here an example of and (set intersection) phenologs for k = 1, with candidate genes for holoprosencephaly (HPE) predicted from D. rerio in- tersection phenologs (see Table 4.1). In this case, many phenologs fall in the k = 1 bin by p value, and predict a variety of genes (see Table 4.2 for kNN- based rankings). It may also be worth noting that no single component’s p value is better than the Boolean p value.

The most highly predicted gene is SMO (Smoothened), which is indeed an HPE gene — as determined by Rosenfeld et al. in 2010 [138], subsequent to the creation of our database. Rosenfeld et al. also identified DISP1, ranked third. NODAL, one of the bottom-ranked genes, has been observed as promot- ing an HPE-like phenotype in chick embryos [139] and mice (reviewed in [140]). Another gene ranked with NODAL, LHX2 (Lim1/Lhx2), is required for mouse head formation [141] and seems to regulate the development of the midline of the brain in that species [142].

The second-ranked orthogroup, consisting of SCUBE1 and SCUBE3, is suggested for HPE candidacy by its role in mouse brain formation (specifically SCUBE1) [143], but does not appear to be directly associated with HPE.

4.4 Discussion

Boolean phenologs offer a substantial improvement over the k nearest neighbors approach described in the previous chapter [144]. The basis for the

109 Table 4.1: Holoprosencephaly genes are predicted by many intersec- tion phenologs at the same p value. I include holoprosencephaly as an example because it has many Boolean and phenologs from zebrafish at the same p value (5.98 × 10−7, which is the lowest p value meeting the filtering criteria). Phenotypes A and B are arbitrarily labeled and indistinguishable overall (but not within individual pairs of table rules). Consider the first entry: brain development and axon guidance predicts gene LHX2; but brain devel- opment and muscle organ development predicts PRKCI. Nineteen different Boolean phenologs predict Smoothened. Rows are grouped first by candidate genes, but also by phenotype A when candidate genes differ. PPV = 0.999.

110 phenotype A phenotype B candidates axon guidance LHX2 brain development muscle organ development PRKCI floor plate formation NODAL dorsal/ventral pattern formation adenohypophysis development FGF3 nervous system development NEO1 somitogenesis endocrine pancreas development WNT5A/B muscle cell fate specification muscle organ development blood circulation somitogenesis SCUBE1/3 blood circulation somitogenesis muscle cell fate specification blood circulation adenohypophysis development DISP1 muscle cell fate specification axon guidance blood circulation CXCL12 striated muscle cell development nervous system development floor plate formation axon guidance muscle organ development endocrine pancreas development embryonic digestive tract morphogenesis floor plate formation muscle organ development nervous system development endocrine pancreas development embryonic digestive tract morphogenesis SMO muscle organ development floor plate formation adenohypophysis development embryonic digestive tract morphogenesis endocrine pancreas development muscle organ development adenohypophysis development embryonic digestive tract morphogenesis adenohypophysis development endocrine pancreas development embryonic digestive tract morphogenesis adenohypophysis development embryonic digestive tract morphogenesis diencephalon development neural plate morphogenesis

111 orthogroup members magnitude SMO 19 SCUBE1/3 5 DISP1 2 CXCL12 2 NEO1 1 WNT5A/B 1 FGF3 1 NODAL 1 LHX2 1 PRKCI 1

Table 4.2: Holoprosencephaly candidate genes may be ranked by evi- dence. I ranked by hand the genes suggested for testing from the phenotypes in Table 4.1 as they would be ranked by either the naïve Bayes or additive classifiers in the previous Chapter. I included the magnitude (the number of times the gene is predicted by a phenolog in Table 4.1). Genes within ranks are predicted with the same score (e.g., NEO1 and LHX2) and are thus equally likely under this model). The best predicted gene is SMO, also known as Smoothened.

112 improvement is not entirely clear, but may be partially subjective: I selected preferentially those Boolean phenologs where the p value is less than both the components. Notably, for the not phenologs, all Boolean p values are more significant at our maximum p threshold6 of 1 × 10−4.

A minor contributor is the way in which the genes in the single-species matrices are translated into orthogroups. For example, plant phenotypes reg- ulation of telomere maintenance and regulation of chromosome organization have differing sets of genes; but when the genes are translated into human– plant orthogroups, and those without orthologs are removed, the two pheno- types collapse into one.7

The filtering procedure may also play a role. The original phenolog method looked for any intersection at all; I require an intersection of size two or larger, and additionally that at least one orthogroup is predicted. For and phenotypes, these last two criteria imply that the intersection between the Boolean components must be at least size three.

The probability of finding such a three-way intersection by chance is quite low, and would ordinarily be described by the multivariate hypergeo- metric distribution (thus, the given probabilities may be conservative over- estimates). The frequency of the three-way intersections — higher than would

6This property is made more likely, but not guaranteed, by the mathematics of set intersection probabilities. If the subtracted phenotype intersects with the query, the Boolean phenotype will have a smaller intersection as a result of the subtraction; and thus the p value is more likely to rise above our threshold. The p value of two sets with no intersection is within ϵ of unity. 7These are in-paralogs in A. thaliana with respect to H. sapiens.

113 be expected at random — is a product of the way in which gene–phenotype associations are discovered by biologists.

It follows, then, that the standard hypergeometric distribution may not be conservative enough for phenologs that are “circular”. I define circu- lar by way of an example. Bardet–Biedl syndrome candidate genes which were identified subsequent to the assembly of our gene–phenotype associ- ation database are perfectly predicted by the zebrafish Boolean phenotype melanosome transport less embryonic specification defect (p = 2.06 × 10−25). However, upon review of the literature on the known genes, I discovered that zebrafish melanosome transport is studied at least in part for its role in Bardet– Biedl syndrome. As such, melanosome transport-associated genes at the time of database construction were already only one patient validation away from being Bardet–Biedl-associated genes.

An additional concern is over-training. One might argue, for example, that by subtracting every phenotype from every other phenotype, and then looking for phenologs, one is simply trimming away some of the less imme- diately relevant portions of phenotype gene sets. However, less relevant here could also be read as less studied, which is a contradiction: the genes are present in both phenotypes precisely because they have been well-studied for those characteristics. Even so, it seems to be a moot point. The predictions are either worth testing or they aren’t. They are either circular or they aren’t, which can be determined via literature review. Boolean phenologs simply pro- vide good hints about which genes and processes should be studied for a deep

114 homology role. Some genes may be missed, but the advantage is an observably lower false positive rate.

A final issue is confirmation bias, particularly when pursuing literature validation. Ioannidis argued in 2005 that most published research findings are false positives, in part due to the failure of researchers to share negative re- sults (which are more likely to be correct) [145]. Indeed, our gene–phenotype association matrices only contain “positive” and “unobserved,” and lack neg- atives. Negatives would be extremely useful for evaluation of results as well. Particularly when predicting extremely well-studied diseases like cancer, it’s unlikely that one will find a well-characterized gene which is not in some way associated if one looks long enough. One approach is to silently insert an additional random prediction in some set fraction of phenologs selected for lit- erature validation, and then determine how frequently the random prediction is marked as true by the researcher.8

4.5 Conclusion

Boolean phenologs combine the best elements from previous work by our lab on phenologs [2, 144] to offer a clearer picture of the processes which underlie diseases and phenotypes, as well as the non-obvious homologies which exist between organisms. Furthermore, they offer support for my overarching

8This exact thing happened for the breast cancer phenolog as a result of a bug in my code. SHOX2 was predicted, and the literature suggested that it was involved in cancer. Despite the gene being well-characterized, there was no observation of association with apoptosis or oxidative stress.

115 hypothesis: that deep homology extends beyond structure.

Unfortunately, phenologs are limited. They offer insight only into those elements of diseases, traits, or processes which are conserved and well-studied. A human–zebrafish phenolog is informative about breast cancer only insofar as cancer is affected by well-characterized processes shared between the two organisms. Phenologs will never tell us about the uniquely human compo- nents of breast cancer, but can give us information about oxidative stress and apoptosis, or about DNA repair genes.

In this chapter, I presented Boolean phenolog models for human dis- eases such as myopathy and breast cancer, as well as increased brown adipose tissue in mice. I demonstrate a number of predictions worthy of further testing, including FAM50A and FAM50B for autophagy and MT-COI for oxidative stress and apoptosis in cancer.

4.6 Availability of source code and data

Gene–phenotype associations are available at the phenologs website, phenologs.org. The source code and documentation are available on GitHub at http://github.com/marcottelab/boolean.

116 Chapter 5

Some concluding remarks

5.1 Phenologs

Deep homology, as defined by Shubin et al., is a model of evolution and development which explains how the apparently independent evolution of complex structures in multiple distantly related species is an emergent prop- erty of a single earlier adaptation [8,9]. The evolutionary innovation is not the eye itself — which has evolved several dozen times, Shubin argues — but the conserved underlying program which facilitated the vision adaptation in all of those species, and which is quite likely a unique innovation.

Not only do phenologs offer a way to identify non-obvious cases of deep homology, but they provide a strategy for quantifying it. The claim that the genetic program underlying vision is a unique adaptation is a claim for which, previously, no evidence could be gathered without a time machine. With phenologs, however, a p value can be assigned to that claim. Moreover, since the phenolog strategy predicts gene–phenotype associations on the basis of that p value, that quantifiable hypothesis can be tested both through literature validation and experimentally, it is possible to develop an argument for or against the previously untestable hypothesis that the genetic program evolved

117 but once.

The use of model organisms to learn about human disease is not new. Mice and rats would not be such a common model system if scientists had not already recognized their utility in understanding obviously parallel processes in humans. But phenologs have provided a method for identifying non-obvious cases; in Chapter 2, for example, I provided the example of plant gravitropism gene associations predicting neural crest migration genes. The identification of additional disease-causing genes is a mere curiosity next to the discovery that the two phenotypes are both facilitated by an underlying process.

In Chapters 3–4, I examine two methods for better understanding the genetic programs which form the bases for diseases and phenotypes in an array of organisms. The first of these — the k nearest neighbors ranking approach — casts a broad net and offers a glimpse at the myriad phenotypes and diseases that arise from deeply conserved underlying genetic programs. The kNN approach also offers a prioritization for the testing of candidates genes for diseases and traits, including one mouse disease-like phenotype, and an extremely non-disease-like Arabidopsis trait.

The other (in Chapter 4) is a much more targeted approach; instead of raking in all phenologs which might be predictive of some disease or trait, the “Boolean” method focuses in on a few at a time. While it does not explicitly rank predictions, it suggests that specific phenotypes lie at the intersection of conserved processes (or that one process rescues another, in the case of the subtractive examples). Anecdotally, this method seems to offer much

118 more relevant predictions than either the original phenologs approach or the k nearest neighbors approach.

5.1.1 Paralogous phenotypes

While phenologs which are in-paralogous rather than orthologous are not quantifiable, they are important conceptually. They have been used in the same way as orthologous phenotypes — namely to identify additional can- didates for autoimmune disorders, as discussed in Chapter 1. In Chapter 3, I identified a case where in-paralogous phenotypes may have (correctly) im- proved the ranking of one gene relative to its in-paralogs.

In the fourth chapter, I mentioned a consequence of the conversion from the gene–phenotype to orthogroup–phenotype matrix framework: that pheno- types which were represented by separate rows in the matrix, involving differ- ent genes, become equivalent when described in terms of their orthogroups. These are in-paralogous phenotypes.

Thus, in this work, I extend the theoretical framework for phenologs to paralogous phenotypes. A future area for exploration might involve finding some way to quantify the degree of paralogy of such phenotypes.

For example, two human phenotypes X and Y might involve the gene sets A, B, C, D, E and F, G, H, I, J, respectively. When those genes are trans- lated to human–mouse orthogroups, they may still be distinct. But when translated to human–chicken orthogroups, it might turn out that D is or- thologous with I,J. If instead we look at them in terms of human–plant

119 orthogroups, we might see that in fact what appeared to be two sets of five unique genes actually make up one set of four unique orthogroups.

If it were possible to quantify these paralogous phenotype relationships, it could help us to examine duplication and divergence — for example, in the evolution of color vision.

5.1.2 The importance of basic research

Biologists tend to look at mammalian biological processes and genes because what is learned tends to apply to humans as well. Yet the first people to study yeast weren’t looking for a cure for cancer; they were trying to make better beer.1 Even so, many people study yeast today with the hope that basic biology lessons taken from this single-celled organism can be applied, eventually, to humans. Some also question this research focus from time to time.

Yet phenologs place an economic value on basic research. In Chapter 2, we point out that the rate of gene–phenotype association discovery is 550-fold higher with phenologs than without. Phenologs provide a rationale for basic science: any process could serve as a quantifiably useful model for some human affliction. Indeed, while phenologs themselves produce useful predictions which may be directly applicable, the phenolog strategy is itself an example of basic research.

1This claim is verifiable simply by looking at the name, Saccharomyces cerevisiae, which means Sugar-eating beer-maker.

120 5.1.3 Future directions

There are a number of possibilities for future courses of study on phe- nologs. I believe that the most promising is the combination of the k nearest neighbors method with the Boolean phenologs method, as with the holopros- encephaly example described in Chapter 4, but approached systematically.

It is also my belief that we under-utilize visualization strategies in at- tempting to understand deep homology and phenologs. I attempted, early in the formulation of Chapter 3, to develop a Rails application for allowing a scientist to exclude certain noisy or redundant phenotypes from a k nearest neighbors search, but eventually decided it was a dissertation project in and of itself.

Another possible direction is the eventual incorporation of fruit fly Gene Ontology biological processes into our database. I continue to believe there is a wide array of useful models in Drosophila, but the nature of the FlyBase database offered too many obstacles.

121 Appendices

122 Appendix A

SciRuby: Tools for scientific computing in the Ruby language

A.1 Foreword

When I began my doctoral studies, I knew only C++. In the course of rotations, I learned Perl and Python, and upon arrival in the Marcotte Lab, I tried out Ruby — which in short order became my favorite programming language.

While a BioRuby project existed, Ruby lacked any kind of cohesive community of general scientific programmers. Some libraries existed, but it was unclear what was being actively developed and where assistance or sup- port could be had. With encouragement and assistance from former Marcotte Lab member John Prince, I organized the Ruby Science Foundation, and the SciRuby community and libraries, beginning in December of 2010.

In time, it became clear that one unifying factor still absent from our nascent project was a fast library for matrices, vectors, and linear algebra. So a year after SciRuby’s formation, I sat down and began work on NMatrix, a package designed to fill that niche. Other early code contributions came from Chris Wailes under a small grant from a start-up (now defunct) called Brighter

123 Planet, and from a handful of fellows hired using a grant from the Ruby Association, which is to the Ruby language what the Ruby Science Foundation is to the SciRuby Project. This support in particular offered a great deal of motivation, as it meant that my work had the support of Ruby’s creator.

On SciRuby’s behalf, I applied for a total of four grants, and received three (the fourth was a long-shot). Perhaps the most useful to us was Google’s Summer of Code, which offered not only four motivated students to work on SciRuby, but also a great deal of publicity and attention for our project, and networking opportunities with other open source endeavors.

SciRuby has been an integral part of this dissertation; nearly all of the figures in Chapter 3 were generated using Rubyvis, a SciRuby plotting and visualization library. Additionally, NMatrix (discussed in more detail in Appendix B) made the methods presented in Chapter 4 feasible.

The decision to present certain elements of my methodology in these appendices is based upon the recognition that the computational tools cre- ated for the fourth chapter are not biological nor even bioinformatic in nature. NMatrix itself includes no especially innovative algorithms or data structures, but the synthesis of its parts may represent an innovation. Additionally, the process of writing NMatrix provided me with understanding of the principles of matrix storage structures, which in turn enabled some algorithmic innova- tions critical to the methodology of Chapter 4. SciRuby and NMatrix are not necessary to the presented research, strictly speaking; the work could have been done with existing tools, albeit over a longer period of time.

124 Finally, I include these appendices because infrastructure tools are of- ten neglected in the pursuit of computational research. Researchers frequently seem to re-invent the wheel in the course of informatics projects, developing useful algorithmic innovations which never see the light of day — or they de- pend upon kluges and hacks which slow progress. I present these appendices to represent my intent to continue to promote SciRuby and NMatrix as com- putational science infrastructure, and the SciRuby community as a source of support and guidance.

A.2 Introduction A.2.1 Why Ruby?

Ruby is a completely free (gratis and libre) scripting language created by Yukihiro “Matz” Matsumoto, inspired by Perl, Smalltalk, Eiffel, Ada, and Lisp [146]. In recent years, this fully object-oriented language has become a favorite among high-tech startups; Twitter, Groupon, and GitHub are three well-known companies which chose Ruby and its powerful Rails framework for quick prototyping of web applications [147, 148]. Yet the language has lagged in other areas, particularly scientific computing, as it for some time lacked interconnected tools for statistics, linear algebra, and data visualization.

Ruby is an ideal language for scientific computing. Nearly everything in Ruby can be easily re-defined (even basics like the + method of an object), and its object model is cleverly designed to promote use of mix-ins in addition to standard inheritance. The language can be easily extended to incorporate

125 existing libraries written in C/C++ or Java, including ATLAS [149], Apache Commons Math [150], and the GNU Scientific Library [151]. Furthermore, Ruby has in Rails [148] a cloud-ready framework that can be utilized to avail both the public and other scientists of a variety of computational methods.

A.2.2 Open source, open data

Perhaps just as importantly for science, the Ruby community is sharing- oriented. It is common for Rubyists to create their libraries on the public source code repository GitHub — and only then download them onto their local machines. While a non-scientific study by the Software Freedom Law Center (SFLC) suggested that licensing terms are not clearly included for a majority GitHub projects, Ruby projects tended to have the most permissive licensing. Indeed, the unclear licensing terms for Ruby projects — which were exceeded in number only by JavaScript projects in the SFLC sample — may stem from such projects’ tendency to simply reference “the Ruby license,” which has changed over the years,1 but is currently considered permissive. [152]

This public and permissive culture bodes well for sciences that make heavy use of computer code for formulation and testing of hypotheses and analysis of data. Studies suggest that data is rarely made available even when

1The full history of Ruby licenses can be viewed on GitHub at https://github.com/ ruby/ruby/commits/trunk/COPYING. The record shows that Ruby was originally released under the GPL, which is considered less permissive. In 2001, it shifted to a dual custom license (the so-called “Ruby license”) with GPL, and with version 1.9.3 traded GPL for a permissive BSD license. The Ruby license would likely be classified as permissive.

126 the journal mandates it [153–155]. Fortunately, unwillingness to share data may be a product of habit and inconvenience more than ideology [156]. Wel- coming scientists into a successful open source community — with open source being very much an ideology — may be one way to promote not only open source, but open data and open access as well. Last but not least, reproduction of results is difficult or impossible in the face of unpublished, undocumented methods (namely source code).

A.3 SciRuby

SciRuby is both a community and a set of libraries, or — as they’re called in Ruby terms — gems. The Ruby Science Foundation, the SciRuby umbrella organization, was founded in 2010 by myself, John Prince, and Clau- dio Bustos. What follows is an accounting of the tools which the organization publishes, as well as other tools not currently included which are nevertheless useful for scientific computing.

A.3.1 A note about Ruby implementations

The original Ruby virtual machine implementation is called MRI, for Matz’s Ruby Interpreter, and was written in C. More recently, YARV [157] (also in C) superseded MRI as the standard interpretation. The C interpreters allow for relatively easy extension in the form of compiled static libraries linked into gems, which can dramatically improve the speed over standard Ruby.

There also exists a Java Ruby interpreter called JRuby. Like YARV

127 and MRI, JRuby can be extended using bytecode-compiled Java code, which can dramatically improve the speed of complex algorithms — or allow incor- poration of existing code, such as Apache Commons Math.

In theory, it is possible to extend Ruby modules with both Java and C/C++ simultaneously, but this is not recommended.

A.3.2 Statistics

The main statistics tool, authored by Claudio Bustos, is known as Stat- sample. Subsequent to its incorporation into SciRuby, three additional gems were abstracted from Statsample: Distribution, Integration, and Minimiza- tion.

The last two of these primarily provide algorithms needed for the prob- ability distributions in the Distribution gem, including general purpose inte- gration and minimization methods.

The goal with the Statsample gems is to provide a standard Ruby implementation of every necessary statistical tool. Additionally, a gem called statsample-optimization includes C and Java linkages for MRI/YARV and JRuby, respectively, allowing the gems to make use of external libraries such as Commons Math, GSL, and the C Ruby gem Statistics2 (see [158]).

This last summer, Google funded four SciRuby students as part of its annual Summer of Code. Under Claudio’s mentorship, one of these students, Ankur Goel, contributed Statsample extensions for general linear models and time series (statsample-glm and statsample-timeseries).

128 A full list of features for Statsample is available at https://github. com/sciruby/statsample/.

A.3.3 Data visualization: Rubyvis and Plotrb

SciRuby includes two separate data visualization tools. The first of these, Rubyvis, existed prior to SciRuby, and like Statsample was written by Claudio Bustos [159]. Rubyvis is a Ruby port of Protovis [160], a JavaScript plotting toolkit.2

Our eventual goal with the SciRuby visualization tools is web-enabled interactivity. Interactivity is a difficult task, as plots are created client-side based on server-side specifications — and thus usually in JavaScript. While a scalable vector graphic (SVG) may be generated using Rubyvis, which contains no JavaScript, JavaScript provides access to the document object model, or DOM, which enables modification of the SVG code on the fly. D3 (or D3) is a sort of successor project to Protovis; it is a DOM representation specifically designed to facilitate dynamic data visualizations (D3 stands for data-driven documents) [161].

In order to reach toward this goal of improved interactivity, Wan Zuhao, another Google Summer of Code student, wrote SciRuby’s second plotting tool, Plotrb [162]. Plotrb utilizes the Vega plotting grammar [163] for D3, and may eventually allow interactivity via a Ruby-to-JavaScript compiler called

2Many of the plots in this work use Rubyvis, especially those in Chapters 3 and 4. Plotrb became available only after the illustrations had been completed.

129 Opal [164].3

A.3.4 Semantic web: PubliSci

Will Strinz, our third Summer of Code student, wrote PubliSci, which provides a domain-specific language (DSL) for describing and publishing datasets.

PubliSci allows SPARQL4 queries to be run on loaded datasets, and allows ex- porting to other formats. It also includes a basic parser which can be extended to include other data types. [166]

A.3.5 Data mining: Ruby Band

Our fourth Summer of Code student, Alberto Arrigoni, wrote an API tying together various Java-based data mining tools including Weka [167] and

Apache Commons. This gem, called Ruby Band, supports data preprocessing

(filtering), clustering, classification, regression, and feature selection [168].

A.3.6 Matrices and linear algebra: NMatrix

NMatrix is a library of data structures for dense and sparse matrices as well as algorithms that work on those data structures. I wrote NMatrix in

C and C++, and it includes two sparse formats as well as and standard dense matrices, with row ordering. NMatrix allows manipulation of matrices with

3Plotrb is likely to be the subject of a future manuscript. A great deal of information, including several demonstration plots, can be found at wanzuhao.com. 4SPARQL is a recursive acronym which stands for SPARQL Protocol and RDF Query Language. It is similar to SQL. [165]

130 C++-templated data types for easy extensibility; standard data types include integers, floating point, complex floating point, and rational. Also permissible is a Ruby object data type, which allows even greater flexibility. In addition to providing an API for ATLAS and LAPACK, NMatrix implements a number of algorithms from these libraries in templated C++, for rational computation support. NMatrix’s features and design choices are more thoroughly discussed in Appendix B.

NMatrix development has been catalyzed by coding fellowship funding from Brighter Planet and the Ruby Association.

A.3.7 Multi-dimensional arrays

Finally, SciRuby is in the process of incorporating a NetCDF Java- based JRuby multi-dimensional array library called MDArray. MDArray uses Parallel Colt (see [169]) for many of its algorithms — meaning that where NMatrix focuses on versatility of storage, MDArray (which provides only dense storage) works toward multi-threading and parallelization. MDArray was writ- ten by Rodrigo Botafogo. [170]

A.4 Other Ruby science tools

Quite a few scientific and mathematical tools are available to Rubyists outside of SciRuby. BioRuby [171], part of the Open Founda- tion, is one successful project, having just passed 50,000 downloads. BioRuby includes over ninety gems, totaling around 400,000 downloads. A number of

131 BioRuby contributors are also involved in SciRuby.

NArray, Masahiro Tanaka’s dense array gem, has been a model for both NMatrix and MDArray. By my own error, standard NArray is not compatible with the NMatrix gem (since it contains its own NMatrix class); a compatible version may be obtained under the name narray-nmatrix. The author is currently working on a complete rewrite. [172]

Similarly, a GNU Scientific Library API is available in the Ruby/GSL gem, which was written by Yoshiki Tsunesada in 2004, and is maintained by David MacMahon [173]. GSL works with NArray by default, but an exper- imental NMatrix-compatible version, gsl-nmatrix, is available through the RubyGems repository [174].

Linalg, by James Lawrence, is another (older) option for linear algebra; it allows any LAPACK function to be called from Ruby [175]; however, it has not been maintained in several years.

A.5 Conclusion

Many of the libraries included in SciRuby existed in some shape or form prior to the SciRuby Project and the Ruby Science Foundation. The clear catalyst has been the formation of a community around a shared goal: open source software for science, engineering, and mathematics.

It is, of course, not possible to discuss SciRuby without mention of the immensely successful Python libraries SciPy [176], NumPy [177], Matplotlib

132 [178], and SymPy [179], which help provide the Python language with science and numeric libraries — libraries which have been a source of inspiration for

SciRuby.

Like SciPy, SciRuby is an infrastructure project. Most scientific pro- gramming, and even open source science, revolves around the short-term needs of those who are currently either publishing or perishing — violating the pro- gramming precept of “Don’t Repeat Yourself” (DRY). SciRuby is attempting to provide general-purpose libraries that enable researchers to quickly proto- type scientific code, and moreover, to give back to the community and accel- erate science.

A.5.1 Acknowledgements

I wish to thank Edward Marcotte, my graduate advisor, for funding my work on this project, and to John Prince (a former Marcotte lab member and now professor at Brigham Young University) for providing vision, enthusiasm, feedback, and innumerable other types of support. Thanks also go to Yukihiro

“Matz” Matsumoto and the Ruby Association for endorsing our work, and to the Google Open Source Programs Office for providing us with visibility and summer coders.

A.5.2 Funding sources

This work was supported by grants from Brighter Planet, the Ruby

Association, and Google Summer of Code (the last jointly to the Ruby Science

133 Foundation and to Wan Zuhao, Will Strinz, Alberto Arrigoni, and Ankur Goel). My contributions were supported by a National Science Foundation Graduate Research Fellowship.

134 Appendix B

NMatrix: A fast matrix library for the Ruby language, with dense and sparse data structures

B.1 Introduction

A key obstacle for the adoption of Ruby for scientific computing has been the absence of a unified interface for fast numerical computing. While

Ruby contains a basic Matrix class, that class is implemented in Ruby rather than C (in contrast to Ruby’s Array and Hash types, which are written in pure C for speed). Several Ruby libraries (or gems) provide dense matrix or array functionality, including Linalg [175], NArray [172], Ruby/GSL [173], and most recently the JRuby-compatible MDArray [170].

Ruby is an excellent language for prototyping code for the testing and generation of scientific hypotheses. As the scientific method involves continual refinement of approaches and methodologies, it makes sense to be able to quickly produce code which can be easily modified — even at the cost of some speed. C code is wonderful — but is incredibly time-consuming to write and debug.

I present the first beta release candidate of NMatrix, a unified dense

135 and sparse matrix library which lies at the core of the SciRuby Project. A Ruby library, NMatrix is implemented primarily in C and C++ for efficiency, with bindings for a number of functions in LAPACK and ATLAS’ CBLAS and CLAPACK implementations (see [149]). While most linear algebra libraries support only floating point and complex floating point values, NMatrix also includes native support for integers and rationals. It can be easily extended either through C++ templates or through its support for Ruby object matrices (which can contain nearly any Ruby data structure).

B.2 Features B.2.1 Storage

NMatrix uses row-order matrices (stored left to right, row by row, start- ing at the top). By default, matrices are dense, but a single Ruby class

(Matrix) provides a common interface for two types of sparse storage, and also enables specification of the type of data stored in each cell. Certain storage types allow other arguments, for which some examples are included (Listing B.1).

Note that, so as to avoid namespace pollution, the N shortcut is not provided by default in NMatrix.

While exact implemented functionalities differ for each storage type

(stype), certain operations are guaranteed to be available for all:

• Efficient conversion to any other storage format and/or data type (dtype),

136 simultaneously or separately

• Element-wise operations (arithmetic, trigonometric functions, boolean comparisons, etc.)

• Whole-matrix comparisons (equality and inequality), including between

matrices of different stype

• Slice copying: getting individual pieces of the matrix and returning them as new matrices

• Slice referencing: creating views of the matrix which, when edited, also modify the original matrix

• Ruby enumerable functionalities for efficient iteration: map, each, and

corollaries guaranteed by inclusion of the Ruby core Enumerable module

• Ruby-like specialized variations on each:

– each_with_indices

– each_stored_with_indices

– each_ordered_stored_with_indices

• Chainable referencing and assigning of individual entries (see Listing B.2)

• Copying and assigning of matrix slices (ranges; many entries simultane- ously) (see Listing B.3)

137 B.2.1.1 Dense

The dense matrix format is the best-supported, due to its simplicity and the availability of pre-existing libraries and algorithms which work on dense matrices. Dense matrices may be n-dimensional with n ≥ 1, and are represented by the Ruby symbol1 :dense. Internally, dense matrices are stored as a single contiguous array, and thus can be passed directly to ATLAS or BLAS and LAPACK implementations (Table B.1).

B.2.1.2 Compressed row (Yale)

NMatrix includes a compressed sparse row storage format which is col- loquially called “new” Yale, and in code is represented by the symbol :yale. It differs slightly from the “old” Yale format [180], using two vectors instead of three, and with the diagonal values stored separately from the non-diagonal ( ) ( ) non-default values (for O 1 diagonal access instead of O log n ).2 Addition- ally, a default value other than zero may be set; whereas Yale matrices typically only store non-diagonal non-zeros, one might specify at matrix creation a value other than zero.

Because NMatrix stores data row by row, varying the shape away from

1Symbols are a concept which are perhaps unique to Ruby. They are somewhat like strings, but are represented by a single integer value; and thus may be copied and compared much more efficiently. They are similar to constant #defines in C, but the Ruby interpreter assigns their values automatically so the programmer need not worry about it. 2Bank & Douglas (1993) cited [180] for a description of the “new” Yale format. That reference appears to contain only the “old” format, but does mention that the diagonal can be stored separately. For descriptions of the “new” format, see either [181] or the 1992 edition of Numerical Recipes in C, pp. 78–79.

138 Method Library Rational C++ Description rot CBLAS 1 undefined yes plane rotation rotg CBLAS 1 undefined yes givens plane rotation asum CBLAS 1 Z, Q yes sum of absolute values of a vector nrm2 CBLAS 1 undefined yes vector 2-norm gemv CBLAS 2 Z, Q yes general matrix–vector multiplication gemm CBLAS 3 Z, Q yes general matrix–matrix multiplication trsm CBLAS 3 Q yes triangular solve getrf CLAPACK Q yes Gaussian elimination (LU factorization) getri CLAPACK no no inversion of LU-factored matrix getrs CLAPACK Q yes LU-factored matrix solve gesv CLAPACK Q yes Gaussian elimination square matrix solve potrf CLAPACK no no positive definite matrix factorization (Cholesky) potri CLAPACK no no inversion of Cholesky-factored matrix potrs CLAPACK Q yes Cholesky-factored matrix solve posv CLAPACK no no positive definite matrix solve (Cholesky) laswp CLAPACK Z, Q yes apply a permutation scal CLAPACK Z, Q yes scale a vector by a constant lauum CLAPACK Q yes compute UU H or LH L for complex, or UU T or LT L for others (U = upper triangular,L = lower) gesvd LAPACK undefined no real m×n singular value decomposition (SVD) gesdd LAPACK undefined no divide & conquer real m × n SVD geev LAPACK undefined no real n × n asymmetric eigenvalue decomposi- tion

Table B.1: ATLAS and LAPACK functions available for dense ma- trices in NMatrix. In addition to providing an API for ATLAS (CBLAS and CLAPACK) functions, NMatrix strives to also offer a native implementa- tion that allows for rationals and Ruby objects. In many cases, the operations are undefined in the rational field Q; in a few, the operations are defined not only for rationals but for integers Z. The C++ column indicates whether the function is implemented natively (using C++ templates) in such a way that it could be used with Ruby objects — which, with some caution and careful testing, could permit nearly any user-defined data structure to be included in an NMatrix (such as rational complex numbers). CBLAS and CLAPACK are the C APIs defined by ATLAS which reflect Netlib’s BLAS and LAPACK reference implementations.

139 square leads to a trade-off between storage and access complexity. For ex- ample, Yale column vectors are inefficient for storage but allow constant-time access. Similarly, Yale row vectors are efficient for storage but have logarithmic access requirements.

Matrix–matrix multiplication and matrix–vector multiplication are im- plemented for Yale (per [181]).

Yale is currently the recommended NMatrix storage format for expert recommendation systems. The method described in the last chapter of this work stores phenotypes as rows and genes as columns in a Yale matrix; and makes extremely efficient comparisons between rows using set operations.

In general, while it is more efficient to convert a Yale matrix from a linked list matrix than to create it directly (which often forces the storage vectors to resize repeatedly), if the size of the final matrix is known in advance it is possible to fill it in storage order in a manner which is nearly optimal.

For both of the last two features, specialized functions are available which don’t exist for dense or linked list storage types. Yale is also distinct from the other storage types in that it must be two-dimensional.

B.2.1.3 Nested linked list

NMatrix’s other sparse storage format is referred to by the Ruby symbol

:list, and consists of nested singly-linked lists. Like the dense format, list matrices may be n-dimensional. Only the non-default values are stored, with no special treatment for the diagonals.

140 List matrices are the least complete, in part because they are intended as an intermediate storage format. There are four paradigms for efficient worst- case creation of large sparse matrices in NMatrix, and the two most common involve a linked list matrix intermediate:

1. If it is impossible to order insertion, the matrix should be created as a list and initialized in whatever order is possible (Listing B.4).

2. If no reasonable upper bound can be provided for the off-diagonal ca- pacity, it is preferable to create the matrix as a list and initialize it from right to left, starting at the bottom (Listing B.4).

3. If the values can be grouped in contiguous blocks, or can be described as a repeating pattern, it is trivial to insert them directly in a Yale matrix (and faster when done in storage order). See Listing B.5.

4. If the number of non-diagonal non-default values (usually given by ndnz,

standing for non-diagonal non-zeros) is known, or the ndnz’s upper bound is known, a Yale matrix should be created and initialized row by row from left to right then top to bottom. This option is somewhat more compli- cated to implement, and the API for it is experimental. (Listing B.6)

B.2.2 Data Types

NMatrix includes, by default, the following data types or dtypes: byte (unsigned character), integer (8, 16, 32, and 64-bit), floating point (32 and 64- bit), complex floating point (64 and 128-bit), rational (32, 64, and 128-bit),

141 and Ruby objects. These are represented by symbols, such as :byte for byte,

:int8 for 8-bit integer, :rational128 for 128-bit rational, and :object for Ruby objects.

As much as possible, these data types behave like their Ruby equiva- lents, and differ from the standard C arithmetic behaviors with one notable caveat: overflow and underflow are not handled unless using :object (where a Ruby Fixnum will magically be converted to a Bignum when necessary, for example).3

NMatrix will try to guess the appropriate dtype based on default values supplied, if there are any. For example, while NMatrix.new(3) produces a square dense matrix of Ruby objects initialized to nil,4 NMatrix.new(3, 0) creates an :int32 matrix initialized to 0.

B.2.2.1 Initialization values

It is worth noting that regardless of data type, Yale and linked list matrices are always initialized to 0 (unless another option is given). However, dense matrices are only initialized by default when requested — or when the data type mandates it.

To understand why a dtype would need to require a default, consider

3Fixnum is Ruby’s integer type. When the stored value exceeds the maximum storage capacity of the operating system’s integer type, Ruby automatically changes to storing the value as a Bignum. 4Ruby’s nil object is an undefined or null value which evaluates to false in boolean expressions; but it is equal to neither false nor 0.

142 momentarily a string of 64 bits set to a random sequence of 1 and 0. When cast to a 64-bit integer, a 64-bit floating point, or two 32-bit floating points in a 64-bit complex value, the sequence is guaranteed to translate into a value.

But with 64-bit rationals (two 32-bit integers), we might end up with a non- sensical number, where both the numerator and denominator are negative, or the denominator is zero, or the numerator divides the denominator. With

Ruby objects, the situation may be quite a bit less predictable.

So why not simply initialize floating point numbers and complex values to 0 also? The simple answer is speed. It may be desirable to perform some operation on matrix B and store the result in A, and so when we create A, we don’t want to spend the extra time initializing it. While this may seem like an insignificant amount of time, consider that some n-dimensional matrices may use quite a bit of storage. This strategy is utilized frequently in NMatrix’s

ATLAS helper functions, since ATLAS expects the user to allocate (but not initialize) its outputs.

So, while the initialization parameter is not required, it is strongly recommended — unless one is planning on immediately initializing the result in some more meaningful way.

143 B.2.3 Input and output

NMatrix has its own input and output formats5 for its Yale and dense storage types — but none for linked list storage, which should be converted to

Yale prior to saving. The functions are NMatrix#write and NMatrix::read; the former is called on an instance, and the latter on the class directly (as is done with new).

NMatrix also supports reading and writing the Matrix Market for- mat [182]. This can be accessed using the methods load and save on the

NMatrix::IO::Market module.

Finally, NMatrix is capable of reading most MATLAB .mat version 5 files [183]. These files tend to contain workspaces rather than single matrices, so NMatrix does the best it can to return the included values in an appropriate manner. The .mat file reader can be accessed using NMatrix::IO::Matlab:: load.

B.2.4 ATLAS and LAPACK

As previously mentioned, some ATLAS and LAPACK functions are exposed in NMatrix in at least one of three ways. The first two of these are generally in the NMatrix::LAPACK and NMatrix::BLAS modules, and the third is usually found in NMatrix proper:

5The format documentation is available at https://github.com/SciRuby/nmatrix/blob/ master/ext/nmatrix/binary_format.txt.

144 1. Prefixed (unfriendly): All exposed ATLAS and LAPACK functions

are available with a prefix, one of lapack_, clapack_, or cblas_. The prefix indicates that the C or Fortran library function is called as directly as possible with little or no argument checking (other than is provided by the library itself), and also tells which library the function is called upon. These functions may cause segmentation faults or other undefined behavior if called improperly, but are kept intact for power users and for testing — and are as faithful as possible to the ATLAS function prototypes in terms of how parameters should be specified. An example

is cblas_gemm, which requires fourteen arguments. For consistency, this function may also be used to call the rational, integer, and Ruby object versions implemented natively in NMatrix.

2. Friendly: These are “safe” versions of the prefixed functions. They provide default parameters or try to infer them from what is provided. They also check, as best as possible, that those parameters which are

provided are correctly specified. The friendly version of cblas_gemm,

called simply gemm, requires only two arguments (instead of fourteen).

3. Instance methods: In some cases, ATLAS and LAPACK functions can

be called as instance methods on matrix objects (such as NMatrix#getrf!,

which modifies an NMatrix directly by calling the prefixed version of the

function, NMatrix::LAPACK::clapack_getrf; or NMatrix#getrf which

makes a copy and then modifies that copy using #getrf!). Additionally,

145 NMatrix#factorize_lu does the same thing, but transposes before and

after, and returns a copy. The cblas_gemm and cblas_gemv algorithms

may both be called using NMatrix#dot with one argument (and are ap- propriately dispatched according to the shape of the matrix argument

which is passed to the dot method).

One obstacle with interfacing with LAPACK functions is that LAPACK is generally column-major, whereas NMatrix is row-major. Where possible, the appropriate adjustments are made.

B.2.5 Experimental features

The Yale matrix type includes a number of experimental features, which can be enabled either with NMatrix.include(NMatrix::YaleFunctions) (if only working with Yale matrices) or by using extend() on a matrix object instance (also with the YaleFunctions module).

None of these functions can be expected to work properly on reference slices of Yale matrices, since they operate directly on the underlying storage.

B.2.5.1 Set operations

For my work on boolean phenologs, I provide some useful functions for performing set operations on Yale functions. In such a scheme, one would store the recommendable vectors as matrix rows. Understanding how they work requires some knowledge of the Yale storage format’s internals, which I touch upon in this section.

146 ( ) While finding an off-diagonal element in a row is O log n , returning the ( ) entire non-diagonal portion of the row is O n , making it preferable to get the whole row at once rather than one element at a time. Stored in parallel with the entries are the corresponding column indices, and these can be retrieved using a function, yale_nd_row, which complements the __yale_vector_set__ method demonstrated in Listing B.6.

The nd_row function (which stands for non-diagonal row) requires to arguments: a row index i and a symbol, either :hash or :keys; the former returns the indices mapped to the stored values in a hash, and the latter returns only the column indices.

There are several Ruby helper functions which call nd_row as well:

• yale_ja_at(i): returns the non-diagonal column indices for row i as an

Array

• yale_ja_set_at(i), yale_ja_sorted_set_at(i): same as above, but as

a Set or SortedSet, respectively

• yale_ja_d_keys_at(i), yale_ja_d_keys_set_at(i), yale_ja_d_keys_

sorted_set_at(i): return non-diagonal and diagonal elements

• yale_row_as_hash(i): returns the non-diagonal and diagonal elements

of a row as a Hash

Since many recommendation functions require a set intersection op- eration, this is implemented in the same module as a C function, called

147 yale_row_keys_intersection. This function expects that the matrices upon which it is called have the same default values; it only checks those values which are stored.

In a future version, these experimental set-related functions are likely to be moved to a separate gem.

B.2.5.2 Data structure

Yale is a complicated data structure to implement efficiently, and for that reason a number of methods are included which enable simple viewing of the underlying data structure:

• yale_a: the A array which stores the diagonal values, the default value,

and the non-diagonal stored values (also, yale_d and yale_lu for lower- and-upper non-diagonals). Individual indices/pointers may be gotten by

passing an argument, k.

• yale_ija: the IJA array which stores the pointers to rows and the col- umn indices. As with the A accessor, individual values may be obtained by passing an argument. The pointers alone may be returned using

yale_ia, and the column indices alone using yale_ja.

B.3 Conclusion

Development on NMatrix began in late December 2011, and the first alpha (0.0.1) was released in April of 2012. This work marks the first beta

148 release candidate, 0.1.0.rc1.

B.4 Acknowledgements

I wrote the core library. Chris Wailes, funded by Brighter Planet, de- signed the original C++ templating facilities. Aleksej Timin designed the slicing, and he and I implemented it together. Carlos Agarie provided some documentation and shortcuts, Ryan Taylor worked on singular value decompo- sition, and Masaomi Hatakeyama, who worked on altering Ruby/GSL to work with NMatrix (all funded by the Ruby Association). John Prince provided direction and ideas; and Colin Fuller designed a solution to a complicated garbage collection problem, and has offered a large number of other patches as well. I wish to thank Yukihiro “Matz” Matsumoto for inventing a won- derful language, and the National Science Foundation (GRFP) and Edward Marcotte for funding me while I worked on it.

149 # dense 3x4 matrix, all zeros NMatrix.new([3,4], 0)

# dense 4x5 matrix with repeating integers 1,2,3 NMatrix.new([4,5], [1,2,3])

# 4x4 yale matrix containing 2.0 every other cell NMatrix.new(4, [0,2], stype: :yale, dtype: :float64)

# 4x4 list matrix with default 1/1 (with nothing stored) NMatrix.new([4,4], stype: :list, dtype: :rational128, default: 1)

# 500x1000 yale matrix, with 100 cells reserved for as-of- # yet not-inserted non-diagonal non-zeros NMatrix.new([500,1000], stype: :yale, capacity: 600)

# Shorthand for initializing an NMatrix NMatrix[[1,2,3],[4,5,6]]

# Shorthand also accepts options like NMatrix#initialize NMatrix[[1,2],[3,4], dtype: :rational128]

N = NMatrix # Enable even shorter shorthand N[[1,2],[3,4]]

# Some other familiar shorthands are enabled for MATLAB users: NMatrix.ones(3) # 3x3 matrix of 1s NMatrix.zeros([3,4]) # 3x4 matrix of 0s NMatrix.eye(4) # 4x4 identity matrix NMatrix.diagonals([1,2,3,4]) # 4x4 matrix with diagonal: 1,2,3,4 NMatrix.random(2) # 2x2 random matrix NMatrix.seq(2) # NMatrix[[0,1],[2,3]] NMatrix.indgen(2) # same as above # Others: findgen/dindgen for floats/doubles, bindgen for bytes, # cindgen/zindgen for complex/double complex, # rindgen for rationals, rbindgen for Ruby objects

Listing B.1: Different ways to create an NMatrix. 150 n = NMatrix[[1,2],[3,4]] n[0,1] # returns 2 n[0,1] = 5 # matrix is [[1,5],[3,4]]; returns 5 n[0,1] # returns 5

# Chained operation: n[0,1] = n[0,0] = 0 # matrix contains [[0,0],[3,4]]

# Vectors n = NMatrix[[1,2,3,4,5,6]] # row m = NMatrix[[1],[2],[3],[4],[5],[6]] # column n[1] == m[1] # => true n[0,1] == m[1,0] # => true (long form)

Listing B.2: Syntax for modifying or getting individual entries of a matrix or vector. Here, n can be an NMatrix of any storage for- mat. Lines 10–13: While dense and list vectors can be dimensionless (e.g., NMatrix[1,2,3,4,5,6], it is recommended that they be declared with shape [k,1] or [1,k] rather than [k] so that they have an orientation. In fact, ad- ditional dimensions may be added (except for Yale) if routinely dealing with matrices of higher dimensionality. In these cases, NMatrix does not require explicit indexing of those dimensions with size 1.

151 1 n = NMatrix[[1,2,3],[4,5,6],[7,8,9]]

2

3 # reference to elements 0 through 2 inclusive 4 ref = n[0,0..2] # => NMatrix[[1,2,3]] 5 ref[0] = 5 # Changes both! 6 # n = NMatrix[[5,2,3],[4,5,6],[7,8,9]] 7 # ref = NMatrix[[5,2,3]] 8 ref[0] = 1 # Changes both back

9

10 # copy of elements 0 through 2 inclusive 11 copy = n.slice(0,0..2) # => NMatrix[[1,2,3]] 12 copy[0] = 4 # only changes the copy

13

14 # Shorthands for getting entire ranks: 15 n.row(0) # reference by default 16 n.column(1, :copy) # returns column 1 as a copy 17 n[2,:*] # same as n[2,0..2]

18

19 # Set multiple entries simultaneously 20 n[1,:*] = [0,1,0] # => [0,1,0] 21 # n = NMatrix[[1,2,3],[0,1,0],[7,8,9]]

22

23 # Set multiple entries with automatic repetition 24 n[0..1,:*] = [7,8] #=> [7,8] 25 # n = NMatrix[[7,8,7],[8,7,8],[7,8,9]]

Listing B.3: Syntax for modifying or getting slices of a matrix. Again, n can be an NMatrix of any storage format (:dense, the default, is used as a demonstration).

152 # Strategies 1 and 2 (general) list_matrix = NMatrix.new([n,m], stype: :list, dtype: :float64)

# Reading from a space-separated file called table.txt File.open("table.txt", "w") do |f| while line = f.gets i,j,val = line.chomp.split

# Convert indices from strings to integers, and value to float list_matrix[i.to_i, j.to_i] = val.to_f end end

# Convert to Yale. yale_matrix = list_matrix.cast(stype: :yale)

Listing B.4: General strategy for creating a Yale sparse matrix (via texttt:list). This strategy is most efficient when insertion is in reverse order — that is, when values which will be stored last in the linked lists are added first.

153 1 # Strategy 3: 2 reserve = contig_vals3to9.size + contig_vals20to24.size + n+1 3 y = NMatrix.new([n,m], stype: :yale, 4 dtype: :float64, 5 capacity: reserve)

6

7 # Strategy 3A (contiguous blocks): 8 # Insert the two contiguous blocks in row i: 9 y[i,3...10] = contig_vals3to9 10 y[i,20...25] = contig_vals20to24

11

12 # Strategy 3B (pattern on a contiguous block): 13 # (Example is for repeating values [0,0,1]) 14 y[15..20,4...20] = [0,0,1]

Listing B.5: Contiguous block and repeating pattern strategies for creating a Yale sparse matrix. These two methods may be combined when appropriate. Where possible, insertion should occur in storage order for Yale matrices for optimal performance. It is also possible to insert elements individually — however, it is likely that a binary search will be performed for each new insertion in a row. For empty rows, insertion is trivial. Lines 2–5: Reserving space is optional, but may prevent (or delay) an expensive resize operation. In every case, the minimum reserve size must be n + 1, where n is the number of rows (the length of the diagonal). This particular example also makes room for both of the contiguous blocks to be inserted (which won’t be enough for Strategy 3B, though the code can be easily modified). Lines 7–10 (Strategy 3A): It is more efficient to insert the contiguous blocks in storage order. Switching the order essentially doubles the amount of time necessary to insert the second block. Lines 12–14 (Strategy 3B): This repeats the sequence 0, 0, 1 over the contiguous sub-matrix specified by 15..20 (rows 15 through 20 inclusive) and 4...20 (columns from 4 to 19, but not 20). Note that in Ruby .. is the equivalent of the : operator in MATLAB, and can be thought of as a colon laid out horizontally; three dots (...) indicates the range should exclude the last value. Although the given pattern suggests that zeros should be inserted, NMatrix will only store the non-zero values — and will remove from storage any values replaced by zeros.

154 # Strategy 4 (ndnz is known) reserve = n + 1 + ndnz y = NMatrix.new([n,m], 0, stype: :yale, capacity: n + 1 + ndnz)

# Assume we have the non-diagonal non-default values and their column # indices in the arrays values and indices, respectively. y.__yale_vector_set__(i, indices, values)

# For insertion of diagonals, the normal method is optimal: y[i,i] = diagonal_value

Listing B.6: Experimental strategy for optimal Yale construction. This strategy is utilized by boolean phenologs, which only needs to store binary values (1 or 0).

155 Index

α-adducin, 74 Arrigoni, Alberto, 130, 134 ARX, 72 abnormal body wall muscle cell ATF1, see CREB1/CREM/ATF1 polarization, 24 ATM, 19 abnormal brain ventricle, 21 ATP4A/B, 77, 79 abnormal brain wave pattern (mouse ATR, 102 phenotype), 69 atrial fibrillation, 76–79 abnormal chorion morphology, 81 categories of phenologs (car- abnormal heart development, 24 diac, gastric, and audi- abnormal pigmentation, 30 tory), 79 abnormal social investigation, 24 cross-validation of predictions, absent posterior semicircular canal, 76 24 autism susceptibility, 24 absent sperm flagella, 21, 24 autoimmunity, 10 Abstract, viii autophagic tumor stroma model of Acknowledgements, v, 92, 133, 149 cancer, 104 ACTA1, 106 autophagy, 107 additive classifier, see classifiers auxin stimulus response, see re- adducin, see α-adducin sponse to auxin stimulus adenohypophysis development, 111 axon guidance, 111 adipogenesis, 101 aztreonam, 74 adipose tissue, see increased brown adipose tissue amount BARD1, 19 Agarie, Carlos, 149 Bardet–Biedl syndrome, 21, 24, alleles, 5, 35 114 Alx3, 101 BAX, 103 amyotrophic lateral sclerosis, 24 BCL-2, 104 angiogenesis, 26–28 bidirectional best hits apoptosis, 102 in lieu of orthogroups, 38 Appendices, 122 BioRuby, 132 Arabidopsis Information Resource, blood circulation, 111 The, 36 body wall muscle cell polarization, ARID1A/B, 106 see abnormal body wall

156 muscle cell polarization Caveolin-1, 104 Boolean phenologs, 11, 93–116 CD226, 10 as artificial syndromes, 95 Cha Hye Ji, 12 filtering procedure, 96, 113 CHEK2, 19 performance, 109 chorion, see abnormal chorion mor- systematic identification, 96 phology Botafogo, Rodrigo, 131 choroid plexus morphology, 21 brain development, 111 ciliary function, disruption in mouse, brain ventricle, see abnormal brain 21 ventricle circling (mouse phenotype), 24 BRCA1, 19 classifiers breast cancer, inflammatory, 105 additive, 57–58 breast/ovarian cancer, 18, 19 comparison, 59, 62 Brighter Planet, 124, 131, 134, 149 naïve Bayes, 57–58 BRIP1, 19 CLEC16A, 10 brown adipose tissue, see increased clonic seizures (mouse phenotype), brown adipose tissue amount 69 Bustos, Claudio, 127, 128 codon sequence, 4 Coffin–Sirris syndrome, 106 c-Jun, see JUN confirmation bias, 115 cancer, see also retinoblastoma, congenital disorder of glycosyla- 24, 28 tion, 24 autophagic tumor stroma model connexin, see Cx40/43 of, see autophagic tumor controls stroma model of cancer negative, 64, 65, 74, 115 breast, inflammatory, see breast positive, 20–21, 69, 79, 83 cancer, inflammatory corneal dystrophy, 34 breast/ovarian, see breast/ovarian cotyledon development, 24, 69 cancer cranio-lenticular-sutural dysplasia, endometrial, 81 30 candidate genes, ranking, 45, 48, craniofacial dysmorphology, 30 70 CREB1/CREM/ATF1, 106 CANR mutator high, 24 cross-validation catalase, 103 leave-one-out (orthogroup- cataracts, 21, 24 based framework), 90

157 ten-fold, 40, 89 DYSF, 106 Crx, 101 CSNK2A1, 27 ectopic vulvae, 16 CXCL12, 111, 112 embryonic digestive tract morpho- cytochrome P450 genesis, 111 CYP3A4 (family 3), 107 embryonic specification defect, 114 family 2 genes (CYP2), 107 EMF2, 81 cytochrome C oxidase, 104 endocrine pancreas development, 111 D3 (Data-Driven Documents), see enlarged third ventricle, 21 Plotrb (gem) enlarged thyroid glands, 21 data, availability of, 116 epidermal cell differentiation (plant datasets, species, 90 phenotype), 69 contributions of, 84–88 epilepsy, 65–69, 73, see also phar- in situ hybridization, 88 macologically-induced sei- DDHD2, 48 zures (mouse phenotype) deafness, 21, see also hearing im- cross-validation of predictions, pairment 69 Dedication, iv genes involved in mouse, 69 deep paralogy hypothesis, 43 photosensitive reflex epilepsy, diencephalon development, see 74 brain development predictions, 70, 72 digestive tract, see embryonic di- ERCC6, 102 gestive tract morphogene- Esx1, 101 sis ethanol, 74, 107 disease evo–devo, see evolutionary devel- diagnostic disconnect between opmental biology symptoms and biology, 1 evolution, 8, 117 DISP1, 109, 111, 112 convergent, 8 distance functions, 58–59 of genes, 7 comparison, 59 evolutionary developmental biol- correlation of, 64 ogy, 33, 117 Distribution (gem), 128 eyes, see also retinal degeneration, DNA, 4 see also retinitis, see also DNAJC13, 48 retinoblastoma

158 evolution and development of, gene expansions, 53, 80 117 gene silencing by RNA (plant phe- notype), 69 Faim2, 74 gene–phenotype associations, 11, false discovery rate calculation, see 34 permutation test collection and breadth, 20 FAM50A/B, 116 discovery, 15 FAM82B, 19 positive only, 35 FGF3, 111, 112 prior work, 46 FHL1, 106 translation between species, Fitch, Walter, 7 53–56 floor plate formation, 111 general linear models, Statsample flower development, see negative and, 128 regulation of flower devel- GHDC, 108 opment GitHub, 125, 126 flux–balance analysis, 46 GJA1/5, see Cx40/43 FlyBase, 91 glycosylation, congenital disorder, Fuller, Colin, 149 see congenital disorder of functional modules, see modules, glycosylation functional Goel, Ankur, 128, 134 Future directions, 121 goiter, 21 Google Summer of Code, 128–130, GABBR1/2, 69, 73 134 GADD45B, 103, 104 Gramene, 90 gamma radiation, response to, 102 gravitropism, negative, see nega- gap junction protein, see Cx40/43 tive gravitropism gastrointestinal hemorrhage, 24 Grik2/5, 74 GCC2, 18, 19 GRV2, 48 GEISHA, 88, 92 GSK3, 14 gene, 3 families, influence on pheno- Hatakeyama, Masaomi, 149 logs, see deep paralogy hy- hearing impairment, 9, 24, 30, see pothesis also long QT syndrome function, molecular versus org- heart development, abnormal, see anism-level, 14 abnormal heart develop- networks, 31 ment

159 hemolytic anemia, 24 inflammatory breast cancer, see HESX1, 69 breast cancer, inflamma- high-frequency male progeny, 18, tory 19 INPARANOID algorithm, 52 hippocampus size, see small hip- Integration (gem), 128 pocampus Introduction, 1, 125 histamine H2 receptor, 76 Isx, 101 HMG–CoA reductase, role in an- Jervell and Lange-Nielsen syn- giogenesis, 27 drome, see long QT syn- HMG20A/B, 19 drome holoprosencephaly, 108–109, 112 Johanssen, Wilhelm, 1 homology JUN, 103 deep, 7–8, 47, 117 deep out-paralogy, testing for, k (neighborhood size), effect of, see deep paralogy hypothe- 59, 63, 85 sis KCNA1/2, 69, 73 definition, 7 KCNE1, 77, 79 gene, 7 KCNQ1, 79 orthology, 7 KIF15, 19 paralogy, 7 KRAS, 19 structure, 7 Laurent, Jon, 46 HOPX, 77 levetiracetam, see synaptic vesicle HORMAD1/2, 19 glycoprotein hydranencephaly, 72 LHX2, 111, 112 hydroxyurea sensitive, 24 light stimulus response, see re- hypergeometric CDF, see probabil- sponse to light stimulus ity of phenolog occurring Lim1, see LHX2 by chance Linalg (gem), 132 Hypotheses, 11 long QT syndrome, 79 hypotonia, 106 lovastatin sensitivity, 26–28 lymphoma, 24 increased brown adipose tissue amount, 100 MacMahon, David, 132 increased resistance to wortman- Marcotte, Edward, 11, 46, 94, 133, nin, 24 149

160 matrix formalism, phenolog, 48, body wall muscle cell po- 51–56 larization flawed prototype, 40–43, 52– organ development, 111 53, 81 weakness, see myopathy orthogroup-based, 53–56 mutations, 5–6 Matsumoto, Yukihiro, 133, 149 generation of, 6 McGary, Kriston, 11, 12, 46 types of, 5 MDArray (gem), 131 myopathy, 106–108 MED25, 106 melanosome transport, see Bardet– naïve Bayes classifier, 42, see clas- sifiers Biedl syndrome NArray (gem), 132 mental retardation, 24 negative gravitropism, 9, 47, 48 mercury toxicity, see methylmer- negative regulation of flower devel- cury opment, 84 metabolic phenotypes, 46 NEO1, 111, 112 metal ion response, see response to nervous system development, 111 metal ion networks methylmercury, 102 functional, 47 and reactive oxygen species, protein interaction, 32 103 neural crest, see also Waardenburg Minimization (gem), 128 syndrome, 9, 30 MITF, 107 nitrogen mustard, see trichlorme- modules, functional, 93, 95 thine sensitivity (yeast morpholino knock-down, 27, 30 phenotype) Mouse Genome Informatics, 34 NKX2 genes, 69 MRE11A, 19 NMatrix, 11, 131, 135–149 MT-COI, 103, 105, 116 NODAL, 109, 111, 112 multiple hypothesis testing, correc- tion for, 98, 99 oestradiol, see increased brown muscle, see also myopathy adipose tissue amount cell development, see striated Online Mendelian Inheritance in muscle cell development Man, 34 cell fate specification, 111 open (source, data, science, and cell polarization, see abnormal access), 127

161 organic substance response, see candidate genes, 74 response to organic sub- cross-validation of predictions, stance 73 orthology, see homology PHB, 19 Otx1/2, 101 phenologs, 8 ovarian cancer, see breast/ovarian circular, 114 cancer E. coli, see pharmacologically- oxidative phosphorylation path- induced seizures (mouse way, 103 phenotype) oxidative stress, 102, 104 identification, 16, 48 orthologous, 15 p53, 104 paralogous, 9–10, 32, 81, 84, pallid, 30 119–120 pancreas development, see endo- telomere maintenance and crine pancreas develop- chromosome organization, ment 113 Parallel Colt, 131 plant, 28–31, see also response paralogy, see homology to vernalization paraquat, 74 contributions by species, 82 Park Tae Joo, 12, 30 prediction from E. coli, 76 Partington’s syndrome, 72 Phenotype and genotype, 1 pattern formation, 111 phenotypes Pax2–8, 101 as an emergent property, 15 PAX3/6/7, 69, 73, 101 as emergent properties, 14 Perl, too slow, 94 collapsing, 36 permutation test, 20, 39 collection of, 11, 34–37, 90 peroxisomal matrix, protein im- definition, 3 port into, see protein im- distance, 56 port into peroxisomal ma- distinction from diseases, 3 trix homology, 10 peroxisomes, reduced number, see mapping between genotype reduced number of peroxi- and, 1, 6, 14 somes quantitative versus qualitative, pharmacologically-induced seizures 4 (mouse phenotype), 73–76 synthetic, 3

162 unpredictable, 64 RAX2, 69 phosphogluconate dehydrogenase, RB1, 14 see Pgd reciprocal best hit, 21–26 Phox2a/b, 101 recommendation functions, see dis- PIGA, 18, 19 tance functions pigmentation, abnormal, see ab- red light, response to, see response normal pigmentation to red light PIK3CA, 19 reduced number of peroxisomes, 24 Pitx1–3, 101 Refsum disease, 24, 28 pleiotropy, 6, 9, 10 regulation of flower development, Plotrb, 129 see negative regulation of posterior semicircular canal, see flower development absent posterior semicircu- regulation of gene expression by lar canal genetic imprinting (plant POU4F3, 79 phenotype), 69 Prince, John, 123, 127, 133, 149 response to auxin stimulus, 108 PRKCI, 111, 112 response to light stimulus, see also probability of phenolog occurring response to red light, 108 by chance, 20, 39 response to metal ion, 103, 105 programming languages, see Ruby response to methylmercury, 105 prolonged QT interval (mouse phe- response to organic substance, 103 notype), 77 response to red light, 24, 106 protein import into peroxisomal response to salt stress, 100 matrix, 24 response to sucrose stimulus, 100 Protovis, see Rubyvis (gem) response to vernalization, 79–84 Proud syndrome, 72 candidate genes, 83 PRRX1, 69 cross-validation of predictions, Psmd4, 101 81 PTOV1, 106 response to water deprivation, 106 Python science libraries, 133 retinal degeneration, 21 retinitis, 21 RAD1/21/51/54L, 19 retinoblastoma, 14, 16 radiation response, see gamma ra- RNA, 4 diation, response to RNA interference Rax, 101 data, collection of, 35

163 Ruby SEH1L, 18, 19 Apache Commons Math and, seizures, see also epilepsy 126 kainate-induced, 74 as free software, 125, 126 pharmacologically-induced, see ATLAS and, 126, 131 pharmacologically-induced C/C++ and, 126 seizures (mouse pheno- extending, 128 type) GSL and, 126, 132 semicircular canal, see absent pos- high-tech startups using, 125 terior semicircular canal Java and, 126 SGR2, 48 LAPACK and, 131, 132 shade avoidance, 24 license history, 126 similarity functions, see distance virtual machine implementa- functions tions, 127–128 Singh-Blom, Ulf Martin, 45 Ruby Association, 131, 133, 149 SIRT1, 27 Ruby projects, permissive licensing Slc1a2/3, 74 of, 126 SLC22A, 19 Ruby Science Foundation, 127 small hippocampus, 21 Ruby/GSL, 149 SMO, 109, 111, 112 Ruby/GSL (gem), 132 Smoothened, see SMO Rubyvis (gem), 129 snai2-a, 29 S1PR2, 79 social investigation, abnormal, see Saccharomyces Genome Database, abnormal social investiga- 35 tion salt stress, response to, see response SOD2, 103, 105 to salt stress sodium channel, voltage-gated, see SciPy, see Python science libraries SCN5A SciRuby, 11, 123–133 somitogenesis, 111 statistics, 128 source code, availability of, 116 SCN5A, 79 SOX13, 27 SCUBE1/3, 109, 111, 112 species data, correlation, 64 Sebox, 101 sperm flagella, absent, see absent SEC23, 30 sperm flagella SEC23IP, 29, 30, 47, 48 spleen hypoplasia, 24

164 statistics tools, Ruby, see Statsam- tumor autophagy, see autophagic ple (gem) tumor stroma model of Statsample (gem), 128 cancer striated muscle cell development, twist, 29 111 Strinz, Will, 130, 134 UCP1, 100, 101 STX7/12, 30, 48 UCP3, 106 uge, 24 sucrose stimulus, response to, see response to sucrose stimu- VAM3, 48 lus vasculature formation, see angio- SV2A, 74 genesis SVIL, 19 ventricle size, see enlarged third synaptic vesicle glycoprotein, see ventricle SV2A vernalization, 84 Systematic discovery of phenologs, vernalization-mediated flowering, 20 see also response to vernal- ization, 81 TAIR, see Arabidopsis Information visualization tools, Ruby, 129–130 Resource, The, 91 VRN2, 84 Taylor, Ryan, 149 vulvae, see ectopic vulvae TFE genes (B/C/3 and MITF), 107 Waardenburg syndrome, see also thermogenesis, 101 hearing impairment, 9, 28, thyroid gland size, see enlarged 47, 48 thyroid glands Wailes, Chris, 123, 149 Tien, Matthew, 94 Wallingford, John, 12 time series, Statsample and, 128 Wan Zuhao, 134 Timin, Aleksej, 149 Wan Zuhao, 130 trichlormethine sensitivity (yeast water deprivation response, see re- phenotype), 69 sponse to water depriva- tris(2-chloroethyl)amine, see trichlorme- tion thine sensitivity (yeast WDHD1, 18, 19 phenotype) white adipose tissue, see increased TSG101, 19 brown adipose tissue amount TSPO/BRZPL1, 19 wild-type, 5

165 WNT5A/B, 111, 112 WormBase, 35 wortmannin, see increased resis- tance to wortmannin

X-linked conductive deafness, see also hearing impairment, 24

Zellweger syndrome, 24, 28

166 Bibliography

[1] Thomas R Insel. Translating scientific opportunity into public health impact: a strategic plan for research on mental illness. Archives of general psychiatry, 66(2):128–33, February 2009.

[2] Kriston L McGary, Tae Joo Park, John O Woods, Hye Ji Cha, John B Wallingford, and Edward M Marcotte. Systematic discovery of nonobvi- ous human disease models through orthologous phenotypes. Proceedings of the National Academy of Sciences of the United States of America, 107(14):6544–9, April 2010.

[3] Wilhelm Johannsen. The genotype conception of heredity. The Ameri- can naturalist, 45(531):129–159, 1911.

[4] H J Muller. The production of mutations by X-rays. Proceedings of the National Academy of Sciences of the United States of America, 14:714– 726, 1928.

[5] Charles Darwin. On the Origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. John Murray, London, 1859.

[6] Walter M Fitch. Distinguishing homologous from analogous proteins. Systematic Zoology, 19(2):99–113, June 1970.

167 [7] Maido Remm, C E Storm, and E L Sonnhammer. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology, 314(5):1041–52, December 2001.

[8] Neil Shubin, Cliff Tabin, and Sean Carroll. Deep homology and the origins of evolutionary novelty. Nature, 457(7231):818–23, February 2009.

[9] Neil Shubin, Cliff Tabin, and Sean Carroll. Fossils, genes and the evo- lution of animal limbs. Nature, 388(6643):639–48, August 1997.

[10] L von Salvini-Plawen and E Mayr. On the evolution of photoreceptors and eyes. Evolutionary Biology, 10, 1977.

[11] K G Becker. The common genetic hypothesis of autoimmune/inflam- matory disease. Current opinion in allergy and clinical immunology, 1(5):399–405, October 2001.

[12] International Multiple Sclerosis Genetics Consortium. The expanding genetic overlap between multiple sclerosis and type I diabetes. Genes and immunity, 10(1):11–4, January 2009.

[13] Deborah J Smyth, Vincent Plagnol, Neil M Walker, Jason D Cooper, Kate Downes, Jennie H M Yang, Joanna M M Howson, Helen Stevens, Ross McManus, Cisca Wijmenga, Graham A Heap, Patrick C Dubois, David G Clayton, Karen A Hunt, David A van Heel, and John A Todd.

168 Shared and distinct genetic variants in type 1 diabetes and celiac disease. The New England journal of medicine, 359(26):2767–77, December 2008.

[14] V Binder. Genetic epidemiology in inflammatory bowel disease. Diges- tive diseases (Basel, Switzerland), 16(6):351–5, 1998.

[15] J Hampe, S Schreiber, S H Shaw, K F Lau, S Bridger, a J Macpherson, L R Cardon, H Sakul, T J Harris, a Buckler, J Hall, P Stokkers, S J van Deventer, P Nürnberg, M M Mirza, J C Lee, J E Lennard-Jones, C G Mathew, and M E Curran. A genomewide analysis provides evidence for novel linkages in inflammatory bowel disease in a large European cohort. American journal of human genetics, 64(3):808–16, March 1999.

[16] M M Griffiths, J A Encinas, E F Remmers, V K Kuchroo, and R L Wilder. Mapping autoimmunity genes. Current opinion in immunology, 11(6):689–700, December 1999.

[17] R L Wilder, E F Remmers, Y Kawahito, P S Gulko, G W Cannon, and M M Griffiths. Genetic factors regulating experimental arthritis in mice and rats. Current directions in autoimmunity, 1:121–65, January 1999.

[18] T P Dryja, W Cavenee, R White, J M Rapaport, R Petersen, D M Albert, and G A Bruns. Homozygosity of chromosome 13 in retino- blastoma. The New England journal of medicine, 310(9):550–3, March 1984.

169 [19] X Lu and H R Horvitz. lin-35 and lin-53, two genes that antagonize a C. elegans Ras pathway, encode proteins similar to Rb and its binding protein RbAp48. Cell, 95(7):981–91, December 1998.

[20] Yona Kassir, Ifat Rubin-Bejerano, and Yael Mandel-Gutfreund. The Saccharomyces cerevisiae GSK-3 beta homologs. Current drug targets, 7(11):1455–65, November 2006.

[21] Leung Kim and Alan R Kimmel. GSK3 at the edge: regulation of developmental specification and cell polarization. Current drug targets, 7(11):1411–9, November 2006.

[22] Karen J Liu, Joseph R Arron, Kryn Stankunas, Gerald R Crabtree, and Michael T Longaker. Chemical rescue of cleft palate and midline defects in conditional GSK-3 beta mice. Nature, 446(7131):79–82, March 2007.

[23] J Hodgkin, H R Horvitz, and S Brenner. Nondisjunction mutants of the nematode Caenorhabditis elegant. Genetics, 91(1):67–94, January 1979.

[24] Andrea L Richardson, Zhigang C Wang, Arcangela De Nicolo, Xin Lu, Myles Brown, Alexander Miron, Xiaodong Liao, J Dirk Iglehart, David M Livingston, and Shridar Ganesan. X chromosomal abnormalities in basal-like human breast cancer. Cancer cell, 9(2):121–32, February 2006.

[25] Joanna Amberger, Carol A Bocchini, Alan F Scott, and Ada Hamosh. McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic

170 acids research, 37:D793–6, January 2009.

[26] Janan T Eppig, Judith A Blake, Carol J Bult, James A Kadin, and

Joel E Richardson. The mouse genome database (MGD): new features

facilitating a model system. Nucleic acids research, 35:D630–7, January

2007.

[27] Nansheng Chen, Todd W Harris, Igor Antoshechkin, Carol Bastiani,

Tamberlyn Bieri, Darin Blasiar, Keith Bradnam, Payan Canaran, Juan-

carlos Chan, Chao-Kung Chen, Wen J Chen, Fiona Cunningham, Paul

Davis, Eimear Kenny, Ranjana Kishore, Daniel Lawson, Raymond Lee,

Hans-Michael Muller, Cecilia Nakamura, Shraddha Pai, Philip Ozersky,

Andrei Petcherski, Anthony Rogers, Aniko Sabo, Erich M Schwarz, Kim-

berly Van Auken, Qinghua Wang, Richard Durbin, John Spieth, Paul W

Sternberg, and Lincoln D Stein. WormBase: a comprehensive data re-

source for Caenorhabditis biology and genomics. Nucleic acids research,

33:D383–9, January 2005.

[28] Selina S Dwight, Midori A Harris, Kara Dolinski, Catherine A Ball, Gail

Binkley, Karen R Christie, Dianna G Fisk, Laurie Issel-Tarver, Mark

Schroeder, Gavin Sherlock, Anand Sethuraman, Shuai Weng, David Bot-

stein, and J Michael Cherry. Saccharomyces Genome Database (SGD)

provides secondary gene annotation using the Gene Ontology (GO). Nu-

cleic acids research, 30(1):69–72, January 2002.

171 [29] Taro L Saito, Miwaka Ohtani, Hiroshi Sawai, Fumi Sano, Ayaka Saka,

Daisuke Watanabe, Masashi Yukawa, Yoshikazu Ohya, and Shinichi

Morishita. SCMD: Saccharomyces cerevisiae Morphological Database.

Nucleic acids research, 32:D319–22, January 2004.

[30] Kriston L McGary, Insuk Lee, and Edward M Marcotte. Broad network-

based predictability of Saccharomyces cerevisiae gene loss-of-function

phenotypes. Genome biology, 8(12):R258, January 2007.

[31] Maureen E Hillenmeyer, Eula Fung, Jan Wildenhain, Sarah E Pierce,

Shawn Hoon, William Lee, Michael Proctor, Robert P St Onge, Mike

Tyers, Daphne Koller, Russ B Altman, Ronald W Davis, Corey Nislow,

and Guri Giaever. The chemical genomic portrait of yeast: uncovering

a phenotype for all genes. Science, 320(5874):362–5, April 2008.

[32] Alison J Ross, Helen May-Simera, Erica R Eichers, Masatake Kai, Jose-

phine Hill, Daniel J Jagger, Carmen C Leitch, J Paul Chapple, Peter M

Munro, Shannon Fisher, Perciliz L Tan, Helen M Phillips, Michel R

Leroux, Deborah J Henderson, Jennifer N Murdoch, Andrew J Copp,

Marie-Madeleine Eliot, James R Lupski, David T Kemp, Hélène Dollfus,

Masazumi Tada, Nicholas Katsanis, Andrew Forge, and Philip L Beales.

Disruption of Bardet–Biedl syndrome ciliary proteins perturbs planar

cell polarity in vertebrates. Nature genetics, 37(10):1135–40, October

2005.

172 [33] Marie-France Demierre, Peter D R Higgins, Stephen B Gruber, Ernest Hawk, and Scott M Lippman. Statins and cancer prevention. Nature reviews. Cancer, 5(12):930–42, December 2005.

[34] Michael Potente, Laleh Ghaeni, Danila Baldessari, Raul Mostoslavsky, Lothar Rossig, Franck Dequiedt, Judith Haendeler, Marina Mione, Elis- abetta Dejana, Frederick W Alt, Andreas M Zeiher, and Stefanie Dim- meler. SIRT1 controls endothelial angiogenic functions during vascular growth. Genes & development, 21(20):2644–58, October 2007.

[35] Alexander V Ljubimov, Sergio Caballero, Annette M Aoki, Lorenzo A Pinna, Maria B Grant, and Raquel Castellon. Involvement of protein kinase CK2 in angiogenesis and retinal neovascularization. Investigative ophthalmology & visual science, 45(12):4583–91, December 2004.

[36] Chetan S Nayak and Glenn Isaacson. Worldwide distribution of Waar- denburg syndrome. The Annals of otology, rhinology, and laryngology, 112(9 Pt 1):817–20, September 2003.

[37] L Huang, Y M Kuo, and J Gitschier. The ?itpallid gene encodes a novel, syntaxin 13-interacting protein involved in platelet storage pool deficiency. Nature genetics, 23(3):329–32, November 1999.

[38] L L Theriault and L S Hurley. Ultrastructure of developing melanosomes in C57 black and pallid mice. Developmental biology, 23(2):261–75, October 1970.

173 [39] Simeon A Boyadjiev, J Christopher Fromme, Jin Ben, Samuel S Chong, Christopher Nauta, David J Hur, George Zhang, Susan Hamamoto, Randy Schekman, Mariella Ravazzola, Lelio Orci, and Wafaa Eyaid. Cranio-lenticulo-sutural dysplasia is caused by a SEC23A mutation lead- ing to abnormal endoplasmic-reticulum-to-Golgi trafficking. Nature ge- netics, 38(10):1192–7, October 2006.

[40] J Christopher Fromme, Mariella Ravazzola, Susan Hamamoto, Moha- mmed Al-Balwi, Wafaa Eyaid, Simeon A Boyadjiev, Pierre Cosson, Randy Schekman, and Lelio Orci. The genetic basis of a craniofacial disease provides insight into COPII coat assembly. Developmental cell, 13(5):623–34, November 2007.

[41] Insuk Lee, Zhihua Li, and Edward M Marcotte. An improved, bias- reduced probabilistic functional gene network of baker’s yeast, Saccha- romyces cerevisiae. PloS one, 2(10):e988, January 2007.

[42] Brian P Kelley, Roded Sharan, Richard M Karp, Taylor Sittler, David E Root, Brent R Stockwell, and Trey Ideker. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Pro- ceedings of the National Academy of Sciences of the United States of America, 100(20):11394–9, September 2003.

[43] Richard Owen. Lectures on comparative anatomy and physiology of the invertebrate animals. Longmans, Brown, Green and Longmans, London, 1843.

174 [44] Wallace Arthur. The emerging conceptual framework of evolutionary developmental biology. Nature, 415(6873):757–64, February 2002.

[45] Insuk Lee, Ben Lehner, Catriona Crombie, Wendy Wong, Andrew G Fraser, and Edward M Marcotte. A single gene network accurately pre- dicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nature genetics, 40(2):181–8, February 2008.

[46] David Swarbreck, Christopher Wilks, Philippe Lamesch, Tanya Z Berar- dini, Margarita Garcia-Hernandez, Hartmut Foerster, Donghui Li, Tom Meyer, Robert Muller, Larry Ploetz, Amie Radenbaugh, Shanker Singh, Vanessa Swing, Christophe Tissier, Peifen Zhang, and Eva Huala. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic acids research, 36:D1009–14, January 2008.

[47] Y Benjamini and Y Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: series B (methodological), 57(1):289–300, 1995.

[48] Jonathan R Karr, Jayodita C Sanghvi, Derek N Macklin, Miriam V Gutschow, Jared M Jacobs, Benjamin Bolival, Nacyra Assad-Garcia, John I Glass, and Markus W Covert. A whole-cell computational model predicts phenotype from genotype. Cell, 150(2):389–401, July 2012.

[49] A Varma and BO Palsson. Metabolic flux balancing: basic concepts, scientific and practical use. Bio/technology, 12:994–998, 1994.

175 [50] M W Covert, C H Schilling, and B Palsson. Regulation of gene ex- pression in flux balance models of metabolism. Journal of theoretical biology, 213(1):73–88, November 2001.

[51] MW Covert, EM Knight, JL Reed, MJ Herrgard, and BO Palsson. In- tegrating high-throughput and computational data elucidates bacterial networks. Nature, 429(May):92–96, 2004.

[52] Markus W Covert, Nan Xiao, Tiffany J Chen, and Jonathan R Karr. Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics, 24(18):2044–50, September 2008.

[53] Sriram Chandrasekaran and Nathan D Price. Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Esche- richia coli and Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America, 107(41):17845–50, October 2010.

[54] Kai Wang, Samuel P Dickson, Catherine A Stolle, Ian D Krantz, David B Goldstein, and Hakon Hakonarson. Interpretation of association signals and identification of causal variants from genome-wide association stud- ies. American journal of human genetics, 86(5):730–42, May 2010.

[55] Casey S Greene and Olga G Troyanskaya. Chapter 2: Data-driven view of disease biology. PLoS computational biology, 8(12):e1002816, January 2012.

176 [56] Xin Wang, Wei Sun, Xilin Zhu, Liping Li, Xiaopan Wu, Hua Lin, Shuy- ing Zhu, Aihua Liu, Te Du, Yang Liu, Nifang Niu, Yuping Wang, and Ying Liu. Association between the gamma-aminobutyric acid type B re- ceptor 1 and 2 gene polymorphisms and mesial temporal lobe epilepsy in a Han Chinese population. Epilepsy research, 81(2-3):198–203, October 2008.

[57] Edward Glasscock, Jong W Yoo, Tim T Chen, Tara L Klassen, and Jef- frey L Noebels. Kv1.1 potassium channel deficiency reveals brain-driven cardiac dysfunction as a candidate mechanism for sudden unexplained death in epilepsy. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30(15):5167–75, April 2010.

[58] Simon J B Butt, Vitor H Sousa, Marc V Fuccillo, Jens Hjerling-Leffler, Goichi Miyoshi, Shioko Kimura, and Gordon Fishell. The requirement of Nkx2-1 in the temporal specification of cortical interneuron subtypes. Neuron, 59(5):722–32, September 2008.

[59] Rinki Ratnapriya, Joseph Vijai, Jayaram S Kadandale, Rajesh S Iyer, Kurupath Radhakrishnan, and Anuranjan Anand. A locus for juvenile myoclonic epilepsy maps to 2q33–q36. Human genetics, 128(2):123–30, August 2010.

[60] U Wyneken, K H Smalla, J J Marengo, D Soto, A de la Cerda, W Tis- chmeyer, R Grimm, T M Boeckers, G Wolf, F Orrego, and E D Gun- delfinger. Kainate-induced seizures alter protein composition and N-

177 methyl-D-aspartate receptor function of rat forebrain postsynaptic den- sities. Neuroscience, 102(1):65–74, January 2001.

[61] Marine Douaud, Katia Feve, Fabienne Pituello, David Gourichon, Simon Boitard, Eric Leguern, Gérard Coquerelle, Agathe Vieaud, Cesira Batini, Robert Naquet, Alain Vignal, Michèle Tixier-Boichard, and Frédérique Pitel. Epilepsy caused by an abnormal alternative splicing with dosage effect of the SV2A gene in a chicken model. PloS one, 6(10):e26932, January 2011.

[62] Rafal M Kaminski, Michel Gillard, Karine Leclercq, Etienne Hanon, Geneviève Lorent, Donald Dassesse, Alain Matagne, and Henrik Klit- gaard. Proepileptic phenotype of SV2A-deficient mice is associated with reduced anticonvulsant efficacy of levetiracetam. Epilepsia, 50(7):1729– 40, July 2009.

[63] R Janz, Y Goda, M Geppert, M Missler, and T C Südhof. SV2A and SV2B function as redundant Ca2+ regulators in neurotransmitter release. Neuron, 24(4):1003–16, December 1999.

[64] K M Crowder, J M Gunther, T a Jones, B D Hale, H Z Zhang, M R Peterson, R H Scheller, C Chavkin, and S M Bajjalieh. Abnormal neurotransmission in mice lacking synaptic vesicle protein 2A (SV2A). Proceedings of the National Academy of Sciences of the United States of America, 96(26):15268–73, December 1999.

178 [65] G Bagetta, M T Corasaniti, M Iannone, G Nisticò, and J D Stephenson. Production of limbic motor seizures and brain damage by systemic and intracerebral injections of paraquat in rats. Pharmacology & toxicology, 71(6):443–8, December 1992.

[66] A De Sarro, D Ammendola, M Zappala, S Grasso, and G B De Sarro. Relationship between structure and convulsant properties of some beta- lactam antibiotics following intracerebroventricular microinjection in rats. Antimicrobial agents and chemotherapy, 39(1):232–7, January 1995.

[67] G Barger and H H Dale. Chemical structure and sympathomimetic action of amines. The Journal of physiology, 41(1-2):19–59, October 1910.

[68] Nini Skovgaard, Kate Mø ller, Hans Gesser, and Tobias Wang. His- tamine induces postprandial tachycardia through a direct effect on car-

diac H2-receptors in pythons. American journal of physiology: regu- latory, integrative and comparative physiology, 296(3):R774–85, March 2009.

[69] K Shigenobu, H Tatsuno, N Matsuki, T Oshima, and Y Kasuya. Elec- trophysiological and mechanical studies on the cardiac effects of a his-

tamine H2 receptor antagonist, cimetidine, in the isolated guinea pig myocardium. J Pharm Dyn, 2:141–150, 1979.

[70] Roberto Levi, James R Malm, F O Bowman, and Michael R Rosen. The

179 arrhythmogenic actions of histamine on human atrial fibers. Circulation research, 49(2):545–50, August 1981.

[71] Gonghao He, Jing Hu, Teng Li, Xue Ma, Jingru Meng, Min Jia, Jun Lu, Hiroshi Ohtsu, Zhong Chen, and Xiaoxing Luo. The arrhythmo- genic effect of sympathetic histamine in mouse hearts subjected to acute ischemia. Molecular medicine, October 2011.

[72] Muna Hammadi, Mohamed Adi, Rony John, Ghalia A K Khoder, and Sherif M Karam. Dysregulation of gastric H+,K+-ATPase by cigarette smoke extract. World journal of gastroenterology : WJG, 15(32):4016– 22, August 2009.

[73] Chinmay M Trivedi, Thomas P Cappola, Kenneth B Margulies, and Jonathan A Epstein. Homeodomain only protein x is down-regulated in human heart failure. Journal of molecular and cellular cardiology, 50(6):1056–8, June 2011.

[74] Patrick T Ellinor, Vadim I Petrov-Kondratov, Elena Zakharova, Ed- win G Nam, and Calum A MacRae. Potassium channel gene mutations rarely cause atrial fibrillation. BMC medical genetics, 7:70, January 2006.

[75] Li-Xin Xu, Wei-Yu Yang, Huai-Qin Zhang, Zhi-Hua Tao, and Cheng- Cheng Duan. [Study on the correlation between CETP TaqIB, KCNE1

180 S38G and eNOS T-786C gene polymorphisms for predisposition and non- valvular atrial fibrillation]. Zhonghua liu xing bing xue za zhi, 29(5):486– 92, May 2008.

[76] Juan Yao, Yi-tong Ma, Xiang Xie, Fen Liu, Bang-dang Chen, and Yong An. [Association of rs1805127 polymorphism of KCNE1 gene with atrial fibrillation in Uigur population of Xinjiang]. Chinese journal of medical genetics, 28(4):436–40, August 2011.

[77] E C Beyer, D L Paul, and D a Goodenough. Connexin43: a protein from rat heart homologous to a gap junction protein from liver. The Journal of cell biology, 105(6 Pt 1):2621–9, December 1987.

[78] B Delorme, E Dahl, T Jarry-Guichard, I Marics, J P Briand, K Willecke, D Gros, and M Théveniau-Ruissy. Developmental regulation of connexin 40 gene expression in mouse heart correlates with the differentiation of the conduction system. Developmental dynamics : an official publication of the American Association of Anatomists, 204(4):358–71, December 1995.

[79] Xianming Lin, Joanna Gemel, Aaron Glass, Christian W Zemlin, Eric C Beyer, and Richard D Veenstra. Connexin40 and connexin43 determine gating properties of atrial gap junction channels. Journal of molecular and cellular cardiology, 48(1):238–45, January 2010.

[80] G Trevor Cottrell and Janis M Burt. Heterotypic gap junction channel

181 formation between heteromeric and homomeric Cx40 and Cx43 connex- ins. American journal of physiology: cell physiology, 281(5):C1559–67, November 2001.

[81] A G Reaume, P A de Sousa, S Kulkarni, B L Langille, D Zhu, T C Davies, S C Juneja, G M Kidder, and J Rossant. Cardiac malformation in neonatal mice lacking connexin43. Science, 267(5205):1831–4, March 1995.

[82] Jari M Tuomi, Karel Tyml, and Douglas L Jones. Atrial tachycar- dia/fibrillation in the connexin 43 G60S mutant (Oculodentodigital dys- plasia) mouse. American journal of physiology: heart and circulatory physiology, 300(4):H1402–11, April 2011.

[83] Isabelle L Thibodeau, Ji Xu, Qiuju Li, Gele Liu, Khanh Lam, John P Veinot, David H Birnie, Douglas L Jones, Andrew D Krahn, Robert Lemery, Bruce J Nicholson, and Michael H Gollob. Paradigm of genetic mosaicism and lone atrial fibrillation: physiological characterization of a connexin 43-deletion mutant identified from atrial tissue. Circulation, 122(3):236–44, July 2010.

[84] Päivi J Laitinen-Forsblom, Pekka Mäkynen, Heikki Mäkynen, Sinikka Yli-Mäyry, Vesa Virtanen, Kimmo Kontula, and Katriina Aalto-Setälä. SCN5A mutation associated with cardiac conduction defect and atrial arrhythmias. Journal of cardiovascular electrophysiology, 17(5):480–5, May 2006.

182 [85] Dawood Darbar, Prince J Kannankeril, Brian S Donahue, Gayle Kucera, Tanya Stubblefield, Jonathan L Haines, Alfred L George, and Dan M Ro- den. Cardiac sodium channel (SCN5A) variants associated with atrial fibrillation. Circulation, 117(15):1927–35, April 2008.

[86] L Chen, W Zhang, C Fang, S Jiang, C Shu, H Cheng, F Li, and H Li. Polymorphism H558R in the human cardiac sodium channel SCN5A gene is associated with atrial fibrillation. The Journal of international medical research, 39(5):1908–16, January 2011.

[87] G R Fraser, P Froggatt, and T N James. Congenital deafness associ- ated with electrocardiographic abnormalities, fainting attacks and sud- den death: a recessive syndrome. The Quarterly journal of medicine, 33:361–85, July 1964.

[88] G R Fraser, P Froggatt, and T Murphy. Genetical aspects of teh cardio- auditory syndrome of Jervell and Lange–Nielsen (Congenital deafness and electrocardiographic abnormalities). Annals of human genetics, 28:133–57, November 1964.

[89] Peter J Schwartz. The long QT syndrome. Current problems in cardi- ology, 22(6):297–351, June 1997.

[90] Peter J Schwartz, Carla Spazzolini, Lia Crotti, Jørn Bathen, Jan P Amlie, Katherine Timothy, Maria Shkolnikova, Charles I Berul, Maria Bitner-Glindzicz, Lauri Toivonen, Minoru Horie, Eric Schulze-Bahr, and

183 Isabelle Denjoy. The Jervell and Lange–Nielsen syndrome: natural his- tory, molecular basis, and clinical outcome. Circulation, 113(6):783–90, February 2006.

[91] N Neyroud, F Tesson, I Denjoy, M Leibovici, C Donger, J Barhanin, S Fauré, F Gary, P Coumel, C Petit, K Schwartz, and P Guicheney. A novel mutation in the potassium channel gene KVLQT1 causes the Jervell and Lange–Nielsen cardioauditory syndrome. Nature genetics, 15(2):186–9, February 1997.

[92] E Schulze-Bahr, Q Wang, H Wedekind, W Haverkamp, Q Chen, Y Sun, C Rubie, M Hördt, J A Towbin, M Borggrefe, G Assmann, X Qu, J C Somberg, G Breithardt, C Oberti, and H Funke. KCNE1 mutations cause Jervell and Lange–Nielsen syndrome. Nature genetics, 17(3):267– 8, November 1997.

[93] Sami Gritli, Mamia Ben Salah, Abdessalem Shili, Caroline D Robson, Mohamed Ferjaoui, Lotfi Hendaoui, Ali Belhani, Sarrah Baltagi-Ben Ji- lani, James F Gusella, and Calum A Macrae. Association of the long QT syndrome with goiter and deafness. The American journal of cardiology, 105(5):681–6, March 2010.

[94] John W Belmont, William Craigen, Hugo Martinez, and John Lynn Jefferies. Genetic disorders with both hearing loss and cardiovascular abnormalities. Advances in oto-rhino-laryngology, 70:66–74, January 2011.

184 [95] The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408(6814):796–815, December 2000.

[96] Yindee Chanvivattana, Anthony Bishopp, Daniel Schubert, Christine Stock, Yong-Hwan Moon, Z Renee Sung, and Justin Goodrich. Inter- action of Polycomb-group proteins controlling flowering in Arabidopsis. Development (Cambridge, England), 131(21):5263–76, November 2004.

[97] M Koornneef, C J Hanhart, and J H van der Veen. A genetic and physiological analysis of late flowering mutants in Arabidopsis thaliana. Molecular & general genetics : MGG, 229(1):57–66, September 1991.

[98] Ruth K Genger, W James Peacock, Elizabeth S Dennis, and E Jean Finnegan. Opposing effects of reduced DNA methylation on flowering time in Arabidopsis thaliana. Planta, 216(3):461–6, January 2003.

[99] Detlef Weigel, J H Ahn, M A Blázquez, J O Borevitz, S K Christensen, C Fankhauser, C Ferrándiz, I Kardailsky, E J Malancharuvil, M M Neff, J T Nguyen, S Sato, Z Y Wang, Y Xia, R A Dixon, M J Harrison, C J Lamb, M F Yanofsky, and J Chory. Activation tagging in Arabidopsis. Plant physiology, 122(4):1003–13, April 2000.

[100] Mitsutomo Abe, Hiroshi Katsumata, Yoshibumi Komeda, and Taku Takahashi. Regulation of shoot epidermal cell differentiation by a pair of homeodomain proteins in Arabidopsis. Development, 130(4):635–43, February 2003.

185 [101] Yoko Ikeda, Yasushi Kobayashi, Ayako Yamaguchi, Mitsutomo Abe, and

Takashi Araki. Molecular basis of late-flowering phenotype caused by

dominant epi-alleles of the FWA locus in Arabidopsis. Plant & cell

physiology, 48(2):205–20, February 2007.

[102] Rebecca A Green, Huey-Ling Kao, Anjon Audhya, Swathi Arur, Jona-

than R Mayers, Heidi N Fridolfsson, Monty Schulman, Siegfried Schloiss-

nig, Sherry Niessen, Kimberley Laband, Shaohe Wang, Daniel A Starr,

Anthony A Hyman, Tim Schedl, Arshad Desai, Fabio Piano, Kristin C

Gunsalus, and Karen Oegema. A high-resolution C. elegans essential

gene network based on phenotypic profiling of a complex tissue. Cell,

145(3):470–82, April 2011.

[103] Robert J Nichols, Saunak Sen, Yoe Jin Choo, Pedro Beltrao, Matylda

Zietek, Rachna Chaba, Sueyoung Lee, Krystyna M Kazmierczak, Karis J

Lee, Angela Wong, Michael Shales, Susan Lovett, Malcolm E Winkler,

Nevan J Krogan, Athanasios Typas, and Carol A Gross. Phenotypic

landscape of a bacterial cell. Cell, 144(1):143–56, January 2011.

[104] Susan Tweedie, Michael Ashburner, Kathleen Falls, Paul Leyland, Peter

McQuilton, Steven Marygold, Gillian Millburn, David Osumi-Sutherland,

Andrew Schroeder, Ruth Seal, and Haiyan Zhang. FlyBase: enhanc-

ing Drosophila Gene Ontology annotations. Nucleic acids research,

37:D555–9, January 2009.

186 [105] Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa Haendel, Douglas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran Song, Brock Sprunger, Sierra Taylor, Ceri E Van Slyke, and Monte Wester- field. The Zebrafish Information Network: the zebrafish model organism database. Nucleic acids research, 34:D581–5, January 2006.

[106] George W Bell, Tatiana a Yatskievych, and Parker B Antin. GEISHA, a whole-mount in situ hybridization gene expression screen in chicken em- bryos. Developmental dynamics : an official publication of the American Association of Anatomists, 229(3):677–87, March 2004.

[107] D G Nicholls, V S Bernson, and G M Heaton. The identification of the component in the inner membrane of brown adipose tissue mitochondria responsible for regulating energy dissipation. Experientia. Supplemen- tum, 32:89–93, January 1978.

[108] R B McFee, T R Caraccio, M A McGuigan, S A Reynolds, and P Bel- langer. Dying to be thin: a dinitrophenol related fatality. Veterinary and human toxicology, 46(5):251–4, October 2004.

[109] Estuardo J Miranda, Iain M McIntyre, Dawn R Parker, Ray D Gary, and Barry K Logan. Two deaths attributed to the use of 2,4-dinitrophenol. Journal of analytical toxicology, 30(3):219–22, April 2006.

[110] Areta Pankiewicz, Tomasz Sledzinski, Anna Nogalska, and Julian Swier- czynski. Tissue specific, sex and age-related differences in the 6-phospho-

187 gluconate dehydrogenase gene expression. The International Journal of Biochemistry & Cell Biology, 35(2):235–45, March 2003.

[111] M L Puerta, M P Nava, M Abelenda, and A Fernández. Inactivation of brown adipose tissue thermogenesis by oestradiol treatment in cold- acclimated rats. Pflügers Archiv : European journal of physiology, 416(6):659–62, August 1990.

[112] Kieran J Clarke, Alison E Adams, Lars H Manzke, Terry W Pearson, Christoph H Borchers, and Richard K Porter. A role for ubiquitinylation and the cytosolic proteasome in turnover of mitochondrial uncoupling protein 1 (UCP1). Biochimica et biophysica acta, 1817(10):1759–67, October 2012.

[113] Matthew K McElwee, Lindsey A Ho, Jeff W Chou, Marjolein V Smith, and Jonathan H Freedman. Comparative toxicogenomic responses of mercuric and methyl-mercury. BMC genomics, 14(1):698, October 2013.

[114] Thomas W Clarkson and Laszlo Magos. The toxicology of mercury and its chemical compounds. Critical reviews in toxicology, 36(8):609–62, September 2006.

[115] Marcelo Farina, Michael Aschner, and João B T Rocha. Oxidative stress in MeHg-induced neurotoxicity. Toxicology and applied pharmacology, 256(3):405–17, November 2011.

188 [116] Marcelo Farina, João B T Rocha, and Michael Aschner. Oxidative Stress and Methylmercury-induced Neurotoxicity. In Cheng Wang and William Slikker, editors, Developmental Neurotoxicology Research: Prin- ciples, Models, Techniques, Strategies, and Mechanisms, chapter 18, pages 357–385. John Wiley & Sons, Hoboken, New Jersey, 2011.

[117] M Esther Gallardo, Raquel Moreno-Loshuertos, Celia López, Mercedes Casqueiro, Javier Silva, Felix Bonilla, Santiago Rodríguez de Córdoba, and José Antonio Enríquez. m.6267G>A: a recurrent mutation in the human mitochondrial DNA that reduces cytochrome c oxidase activity and is associated with tumors. Human mutation, 27(6):575–82, June 2006.

[118] J C Reed, T Miyashita, S Takayama, H G Wang, T Sato, Stanislaw Krajewski, C Aimé-Sempé, Sharon Bodrug, Shinichi Kitada, and Motoi Hanada. BCL-2 family proteins: regulators of cell death involved in the pathogenesis of cancer and resistance to therapy. Journal of cellular biochemistry, 60(1):23–32, January 1996.

[119] Ubaldo E. Martinez-Outschoorn, Diana Whitaker-Menezes, Stephanos Pavlides, Barbara Chiavarina, Gloria Bonuccelli, Trimmer Casey, Aris- totelis Tsirigos, Gemma Migneco, Agnieszka Witkiewicz, Renee Balliet, Isabelle Mercier, Chengwang Wang, Neal Flomenberg, Anthony How- ell, Zhao Lin, Jaime Caro, Richard G. Pestell, Federica Sotgia, and Michael P. Lisanti. The autophagic tumor stroma model of cancer or

189 “battery-operated tumor growth:” A simple solution to the autophagy paradox. Cell cycle, 9(21):4297–306, November 2010.

[120] Casey Trimmer, Federica Sotgia, Diana Whitaker-Menezes, Renee M. Balliet, Gregory Eaton, Ubaldo E. Martinez-Outschoorn, Stephanos Pav- lides, Anthony Howell, Renato V. Iozzo, Richard G. Pestell, Philipp E. Scherer, Franco Capozza, and Michael P. Lisanti. Caveolin–1 and mito- chondrial SOD2 (MnSOD) function as tumor suppressors in the stromal microenvironment: A new genetically tractable model for human can- cer associated fibroblasts. Cancer biology & therapy, 11(4):383–394, February 2011.

[121] Jesús M Salvador, Joshua D Brown-Clay, and Albert J Fornace. Gadd45 in stress signaling, cell cycle control, and apoptosis. Advances in exper- imental medicine and biology, 793:1–19, January 2013.

[122] Florence Lerebours, Sophie Vacher, Catherine Andrieu, Marc Espie, Michel Marty, Rosette Lidereau, and Ivan Bieche. NF-kappa B genes have a major role in inflammatory breast cancer. BMC cancer, 8:41, January 2008.

[123] Alejandro Leal, Kathrin Huehne, Finn Bauer, Arif Ekici, Francesca Pa- sutto, and Sabine Endele. Identification of the variant Ala335Val of MED25 as responsible for CMT2B2: molecular data , functional studies of the SH3 recognition motif and correlation between wild-type MED25

190 and PMP22 RNA levels in CMT1A animal models. Neurogenetics, 10:275–287, 2009.

[124] Xiaolin Gao, Peri Tate, Ping Hu, Robert Tjian, William C Skarnes, and Zhong Wang. ES cell pluripotency and germ-layer formation require the SWI/SNF chromatin remodeling component BAF250a. Proceedings of the National Academy of Sciences of the United States of America, 105(18):6656–61, May 2008.

[125] Gijs W E Santen, Emmelien Aten, Anneke T Vulto-van Silfhout, Car- oline Pottinger, Bregje W M van Bon, Ivonne J H M van Minderhout, Ronelle Snowdowne, Christian a C van der Lans, Merel Boogaard, Mar- got M L Linssen, Linda Vijfhuizen, Michiel J R van der Wielen, M J Ellen Vollebregt, Martijn H Breuning, Marjolein Kriek, Arie van Haeringen, Johan T den Dunnen, Alexander Hoischen, Jill Clayton-Smith, Bert B A de Vries, Raoul C M Hennekam, Martine J van Belzen, Mariam Almur- eikhi, Anwar Baban, Mafalda Barbosa, Tawfeg Ben-Omran, Katherine Berry, Stefania Bigoni, Odile Boute, Louise Brueton, Ineke van der Burgt, Natalie Canham, Kate E Chandler, Krystyna Chrzanowska, Ama- nda L Collins, Teresa de Toni, John Dean, Nicolette S den Hollander, Leigh Anne Flore, Alan Fryer, Alice Gardham, John M Graham, Victo- ria Harrison, Denise Horn, Marjolijn C Jongmans, Dragana Josifova, Sa- rina G Kant, Seema Kapoor, Helen Kingston, Usha Kini, Tjitske Kleef- stra, Małgorzata Krajewska-Walasek, Nancy Kramer, Saskia M Maas,

191 Patricia Maciel, Grazia M S Mancini, Isabelle Maystadt, Shane McKee, Jeff M Milunsky, Sheela Nampoothiri, Ruth Newbury-Ecob, Sarah M Nikkel, Michael J Parker, Luis a Pérez-Jurado, Stephen P Robertson, Caroline Rooryck, Debbie Shears, Margherita Silengo, Ankur Singh, Robert Smigiel, Gabriela Soares, Miranda Splitt, Helen Stewart, Eliza- beth Sweeney, May Tassabehji, Beyhan Tuysuz, Albertien M van Eerde, Catherine Vincent-Delorme, Louise C Wilson, and Gozde Yesil. Coffin- Siris Syndrome and the BAF Complex: Genotype–Phenotype Study in 63 Patients. Human mutation, August 2013.

[126] B Husse and G Isenberg. CREB expression in cardiac fibroblasts and CREM expression in ventricular myocytes. Biochemical and biophysical research communications, 334(4):1260–5, September 2005.

[127] Zhongping Lu and Michael N Sack. ATF-1 is a hypoxia-responsive tran- scriptional activator of skeletal muscle mitochondrial-uncoupling protein 3. The Journal of biological , 283(34):23410–8, August 2008.

[128] Carmine Spampanato, Erin Feeney, Lishu Li, Monica Cardone, Jeong-A Lim, Fabio Annunziata, Hossein Zare, Roman Polishchuk, Rosa Puertol- lano, Giancarlo Parenti, Andrea Ballabio, and Nina Raben. Transcrip- tion factor EB (TFEB) is a new therapeutic target for Pompe disease. EMBO molecular medicine, 5(5):691–706, May 2013.

[129] Jose A Martina and Rosa Puertollano. RRAG GTPases link nutri- ent availability to gene expression, autophagy and lysosomal biogenesis.

192 Autophagy, 9(6):928–30, June 2013.

[130] L Gallelli, M Ferraro, V Spagnuolo, P Rende, G F Mauro, and G De Sarro. Rosuvastatin-induced rhabdomyolysis probably via CYP2C9 sat- uration. Drug metabolism and drug interactions, 24(1):83–7, January 2009.

[131] Chikako Ishikawa, Hiroshi Ozaki, Toshiaki Nakajima, Toshihiro Ishii, Saburo Kanai, Saeko Anjo, Kohji Shirai, and Ituro Inoue. A frameshift variant of CYP2C8 was identified in a patient who suffered from rhab- domyolysis after administration of cerivastatin. Journal of human ge- netics, 49(10):582–5, January 2004.

[132] Srecko Marusic, Ante Lisicic, Ivica Horvatic, Vesna Bacic-Vrca, and Nada Bozina. Atorvastatin-related rhabdomyolysis and acute renal fail- ure in a genetically predisposed patient with potential drug-drug interac- tion. International journal of clinical pharmacy, 34(6):825–7, December 2012.

[133] Makoto Araki, Masatomo Maeda, and Kiyoto Motojima. Hydropho- bic statins induce autophagy and cell death in human rhabdomyosar- coma cells by depleting geranylgeranyl diphosphate. European journal of pharmacology, 674(2-3):95–103, January 2012.

[134] Jon D Duke, Xu Han, Zhiping Wang, Abhinita Subhadarshini, Shreyas D Karnik, Xiaochun Li, Stephen D Hall, Yan Jin, J Thomas Callaghan,

193 Marcus J Overhage, David A Flockhart, R Matthew Strother, Sara K Quinney, and Lang Li. Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy as- sociated drug interactions. PLoS computational biology, 8(8):e1002614, January 2012.

[135] Defeng Wu, Xiaodong Wang, Richard Zhou, and Arthur Cederbaum. CYP2E1 enhances ethanol-induced lipid accumulation but impairs au- tophagy in HepG2 E47 cells. Biochemical and biophysical research com- munications, 402(1):116–22, November 2010.

[136] Defeng Wu, Xiaodong Wang, Richard Zhou, Lili Yang, and Arthur I Cederbaum. Alcohol steatosis and cytotoxicity: the role of cytochrome P4502E1 and autophagy. Free radical biology & medicine, 53(6):1346– 57, September 2012.

[137] Toshihiko Aki, Takeshi Funakoshi, Kana Unuma, and Koichi Uemura. Impairment of autophagy: From hereditary disorder to drug intoxica- tion. Toxicology, 311(3):205–15, September 2013.

[138] Jill A Rosenfeld, Blake C Ballif, Donna M Martin, Arthur S Aylsworth, Bassem A Bejjani, Beth S Torchia, and Lisa G Shaffer. Clinical char- acterization of individuals with deletions of genes in holoprosencephaly pathways by aCGH refines the phenotypic spectrum of HPE. Human genetics, 127(4):421–40, April 2010.

194 [139] Sandra Mercier, Véronique David, Leslie Ratié, Isabelle Gicquel, Sylvie Odent, and Valérie Dupé. NODAL and SHH dose-dependent double inhibition promotes an HPE-like phenotype in chick embryos. Disease models & mechanisms, 6(2):537–43, March 2013.

[140] David M McKean and Lee Niswander. Defects in GPI biosynthesis perturb Cripto signaling during forebrain development in two new mouse models of holoprosencephaly. Biology open, 1(9):874–83, September 2012.

[141] W Shawlot, M Wakamiya, K M Kwan, A Kania, T M Jessell, and R R Behringer. Lim1 is required in both primitive streak-derived tissues and visceral endoderm for head formation in the mouse. Development (Cambridge, England), 126(22):4925–32, November 1999.

[142] Achira Roy, Miriam Gonzalez-Gomez, Alessandra Pierani, Gundela Meyer, and Shubha Tole. Lhx2 regulates the development of the forebrain hem system. Cerebral cortex, pages 1–12, January 2013.

[143] Cheng-Fen Tu, Yu-Ting Yan, Szu-Yao Wu, Bambang Djoko, Ming-Tzu Tsai, Chien-Jui Cheng, and Ruey-Bing Yang. Domain and functional analysis of a novel platelet-endothelial cell surface protein, SCUBE1. The Journal of biological chemistry, 283(18):12478–88, May 2008.

[144] John O Woods, Ulf Martin Singh-Blom, Jon M Laurent, Kriston L Mc- Gary, and Edward M Marcotte. Prediction of gene–phenotype associa-

195 tions in humans, mice, and plants using phenologs. BMC bioinformatics, 14(203):1–17, January 2013.

[145] John P a Ioannidis. Why most published research findings are false. PLoS medicine, 2(8):e124, August 2005.

[146] About Ruby. Retrieved from https://www.ruby-lang.org/en/about/ on 4 November 2013.

[147] Ruby on Rails. Retrieved from http://rubyonrails.org on 4 November 2013.

[148] David H Hansson. Ruby on Rails, 2004.

[149] R Clint Whaley and Jack J Dongarra. Automatically Tuned Lin- ear Algebra Software. Supercomputing ’98: Proceedings of the 1998 ACM/IIEEE conference on Supercomputing, pages 1–27, 1998.

[150] Commons Math Developers. Apache Commons Math. Available from

http://commons.apache.org/math/download_math.cgi.

[151] M Galassi, J Davies, J Theiler, B Gough, G Jungman, P Alken, M Booth, and F Rossi. GNU Scientific Library Reference Manual. Network Theory Limited, 3 edition, 2009.

[152] Aaron Williamson. Licensing of software on github: A quantitative

analysis. Available from http://www.softwarefreedom.org/resources/

2013/lcs-slides-aaronw/#/rm-chart, 2013.

196 [153] Caroline J Savage and Andrew J Vickers. Empirical study of data sharing by authors publishing in PLoS journals. PLoS one, 4(9):e7078, January 2009.

[154] Alawi A Alsheikh-Ali, Waqas Qureshi, Mouaz H Al-Mallah, and John P A Ioannidis. Public Availability of Published Research Data in High- Impact Journals. PLoS one, 6(9), September 2011.

[155] Timothy H Vines, Rose L Andrew, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, Nolan C Kane, Jean-Sebastien Moore, Brook T Moyers, Sebastien Renaut, Diana J Rennison, Thor Veen, and Sam Yea- man. Mandated data archiving greatly improves access to research data. FASEB journal, 27(4):1304–1308, April 2013.

[156] Carol Tenopir, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth Manoff, and Mike Frame. Data sharing by scientists: practices and perceptions. PLoS one, 6(6), June 2011.

[157] Koichi Sasada. YARV: Yet Another RubyVM: Innovating the Ruby Interpreter. In Object-oriented programming, systems, languages, and applications. 2005.

[158] Shin-ichiro Hara. Statistics2: Statistical distributions for Ruby, 2009.

[159] Claudio Bustos. Rubyvis: Ruby port of protovis. Available from https:

//github.com/clbustos/rubyvis, 2010.

197 [160] Michael Bostock and Jeffrey Heer. Protovis: a graphical toolkit for vi- sualization. IEEE transactions on visualization and computer graphics, 15(6):1121–8, 2009.

[161] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3: Data-Driven Documents. IEEE transactions on visualization and computer graphics, 17(12):2301–9, December 2011.

[162] Wan Zuhao. Plotrb: Vega/d3-based plotting gem for ruby. Available

from https://github.com/SciRuby/plotrb, 2013.

[163] Vega: a visualization grammar. Available from https://github.com/

trifacta/vega, 2013.

[164] Adam Beynon. Opal. Available from http://opalrb.org, 2013.

[165] Sparql 1.1 overview: W3c recommendation. Available from http://www.

w3.org/TR/2013/REC-sparql11-overview-20130321/, March 2013.

[166] Will Strinz. Publisci: A toolkit for publishing scientific results to the

semantic web. Available from https://github.com/SciRuby/publisci, 2013.

[167] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The WEKA data mining software: an update. SIGKDD explorations, 11, 2009.

198 [168] Alberto Arrigoni. Gsoc 2013: Data mining in jruby with ruby band.

Available from http://sciruby.com/blog/2013/09/24/gsoc-2013-data-

mining-in-jruby-with-ruby-band/, September 2013.

[169] Piotr Wendykier and James G Nagy. Parallel colt: a high-performance Java library for scientific computing and image processing. ACM trans- actions on mathematical software (TOMS), 37(3):1–22, 2010.

[170] Rodrigo Botafogo. Github: Mdarray. Available from https://github.

com/rbotafogo/mdarray, 2013.

[171] Naohisa Goto, Pjotr Prins, Mitsuteru Nakao, Raoul Bonnal, Jan Aerts, and Toshiaki Katayama. BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics, 26(20):2617–9, October 2010.

[172] Masahiro Tanaka. Numerical Ruby: NArray. Available from http:

//narray.rubyforge.org, 2005.

[173] Yoshiki Tsunesada and David MacMahon. Ruby/GSL. Available from

http://rb-gsl.rubyforge.org/, 2010.

[174] Nick Quaranto. RubyGems.org: Your community gem host. Available

from http://rubygems.org, 2009.

[175] James M Lawrence. Linalg: a Ruby linear algebra library. Available

from http://linalg.rubyforge.org, 2004.

199 [176] Eric Jones, Travis Oliphant, and Pearu Peterson. SciPy: open source

scientific tools for Python. Available from http://www.scipy.org, 2001.

[177] Travis E Oliphant. Python for scientific computing. Computing in science & engineering, 9(90), 2007.

[178] John D Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(90), 2007.

[179] David Joyner, Aaron Meurer, Brian E Granger, and Vladimir Peri. SymPy. ACM communications in computer algebra, 45.4(178):225–234, 2011.

[180] Stanley C Eisenstat, Howard C Elman, Martin H Schultz, and Andrew H Sherman. The (New) Yale Sparse Matrix Package. In Garrett Birkhoff and Arthur L Schoenstadt, editors, Elliptic Problem Solvers II, pages 45–52. Academic Press, Inc., 1984.

[181] Randolph E Bank and Craig C Douglas. Sparse matrix multiplication package (SMMP). Advances in computational mathematics, 1(1):127– 137, 1993.

[182] Mathematical and Computational Science Division of the Information Technology Laboratory of the National Institute of Standards & Tech- nology. Text file formats: Matrix market exchange format, 2013.

[183] MATLAB MAT-File Format. MathWorks, Inc., 3 Apple Hill Drive, Natick, Mass., 01760–2098, r2013b edition, 2013.

200 Vita

John Oates Woods III was born in Fairfax County, Virginia, in 1984, the son of John O Woods Jr and Patti S Lieblich. He grew up in nearby Alexandria and Arlington. A National Merit Scholar, he graduated from Vir- ginia Polytechnic Institute & State University in Blacksburg, Virginia, in May 2007, magna cum laude; with the Bachelor of Science degree in Computer Sci- ence, and minors in Mathematics, Philosophy, and Russian. At Virginia Tech, John was a member of Hillcrest Honors Community and ΥΠE, and worked as a martial arts instructor; he also earned his black belt in tae kwon do in De- cember 2005, and studied in Moscow, Russia, in the summer of 2006. Woods moved to Austin to begin doctoral studies in Cell & Molecular Biology at The University of Texas at Austin in June 2007. In 2009, he earned the Na- tional Science Foundation Graduate Research Fellowship, and was inducted into the Friar Society in 2011. John serves on the boards of the Austin Swing Syndicate, Texas Gun Sense, and the Ruby Science Foundation.

Permanent address: 5105 Evans Avenue Austin, Texas 78751

This dissertation was typeset with LATEX† by the author.

†LATEX is a document preparation system developed by Leslie Lamport as a special version of Donald Knuth’s TEX Program.

201