<<

- CORRELATION USING PHYLOGENETIC TREES

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Farhat A Habib, M.S.,B.Tech

*****

The Ohio State University

2007

Dissertation Committee: Approved by

Professor Ralf Bundschuh, Adviser Professor Daniel Janies Adviser Professor Dongping Zhong Graduate Program in Professor Evan Sugarbaker Physics c Copyright by

Farhat A Habib

2007 ABSTRACT

Recent years have seen an exponential growth in publicly available genetic data for many . To be scientifically or medically useful, the genetic data must be mapped to the physical traits that the in the genotype code. In this disser- tation, we describe methods to find correlations between and using phylogenetic trees that can be applied on a -wide scale. We first de- scribe Felsenstein’s argument showing the necessity of using phylogenetic trees when a genotype-phenotype correlation is calculated. Then, we propose a method using a modified Maddison’s Concentrated Changes Test (CCT) to find correlations be- tween a binary phenotype and a binary genotype. The applicability of this method is demonstrated by its use to find genes correlated with susceptibility to anthrax in inbred mice strains.

As our programs can be used to correlate any two binary variables which can be optimized on a , it was used to find correlations between avian influenza strains and various traits of the or organisms affected. In particular, we find correlations between spread of influenza and particular in the influenza virus. We demonstrate its applicability in case of a continuous phenotype that has been suitably binarized by finding genes correlated with cholesterol and lipid levels in inbred mice and report results.

ii The limitation of CCT to binary phenotypes is significant as most phenotypes are not binary in nature. We develop a method that can be used to find correlations between a continuous phenotype and a binary genotype using a phylogenetic tree.

Randomization testing is used to assess the significance of the correlation between the genotype and the phenotype. We test our methods by correlating lipid levels in inbred mice with their genotype. Comparison of our results with literature surveys of previous in silico methods as well as experimental results show that our method performs favorably.

iii Dedicated to my mother and father

iv ACKNOWLEDGMENTS

Through the course of this dissertation, I am indebted to many people who I interacted with. My deepest thanks go to my advisers Dr. Ralf Bundschuh and Dr.

Daniel Janies. I would like to thank them not only for their guidance but also the way they have supported me. I was very lucky to have two advisers with whom I had frequent contact. My principal adviser, Dr. Bundschuh, let me have complete freedom in my choice of topic for this project. Both were always brimming with ideas and helpful in getting me motivated. I have learned and received more from them than I can acknowledge here. It has been an honor and privilege to work with them.

I like to thank the members of my dissertation committee for their acceptance of this task and their helpful comments and suggestions.

I would also like to thank Andrew Johnson for many helpful and stimulating discussions and his help navigating biological databases. Also thanks go to Diego Pol for many discussions and his help with TNT.

On the personal side, I would like to thank Dedra Demaree for for being a source of friendship, love, and encouragement. Her emotional support was invaluable at times in keeping me going. I would also like to thank Nandini Ganguly for her love, friendship, impetuousness, and passion for . I would also like to thank Kshitiz Garg and Dhananjay Adhikari who have been my closest friends since my undergraduate days. Their constant “When’re you defending?” helped keep my eyes on the goal.

v Finally, I would like to thank my parents, who encouraged me and provided sup- port in all ways. They have always expressed how proud they are of me. Special thanks go to my brothers, Ashfaque and Firoz, for their love and friendship all through my life.

vi VITA

September 5, 1976 ...... Born - Udaipur, Rajasthan, India

1999 ...... B.Tech., Engineering Physics, Indian Institute of Technology, Bombay 2004 ...... M.S. Physics, The Ohio State Univer- sity 2000-2007 ...... Graduate Teaching Associate, The Ohio State University.

PUBLICATIONS

Research Publications

Habib, F., Johnson, A.D., Bundschuh, R., and Janies, D., Large scale genotype- phenotype correlation analysis based on phylogenetic trees Bioinformatics. 2007; 23(7): 785–788. Janies, D., Hill, A.W., Guralnick, R., Habib, F., Waltari, E., and Wheeler, W., Genomic Analysis and Geographic Visualization of the Spread of Avian Influenza. Systematic . 2007 Apr;56(2): 321–329. Kurc, T., Janies, D.A., Johnson, A.D., Langella, S., Oster, S., Hastings, S., Habib, F., Camerlengo, T., Ervin, D., Catalyurek, U.V., and Saltz, J.H., An XML-based system for synthesis of data from disparate databases. Journal of American Medical Informatics Association, May-Jun 2006;13(3): 289–301. Habib, F. and Bundschuh, R., Modeling DNA unzipping in the presence of bound . Physics Review E, Statistical and Nonlinear Soft Matter Physics, Sep 2005; 72(3 Pt 1): 031906.

FIELDS OF STUDY

Major Field: Physics

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vii

List of Tables ...... xi

List of Figures ...... xii

Chapters:

1. Introduction ...... 1

1.1 Genotypes and Phenotypes ...... 2 1.2 Mapping Genotype To Phenotype ...... 5 1.2.1 DNA- relations ...... 6 1.2.2 Relations between genes ...... 6 1.2.3 Genes and environment ...... 7 1.2.4 Stochastic effects ...... 7 1.3 Organization ...... 8

2. Background ...... 10

2.1 Random mutagenesis ...... 11 2.2 Site-directed mutagenesis ...... 12 2.3 Linkage analysis ...... 12 2.4 Multifactorial Traits ...... 14 2.4.1 Quantitative Trait Analysis ...... 15

viii 2.4.2 Quantitative complementation tests ...... 17 2.5 Regression Analysis ...... 18 2.6 Need for automated methods ...... 20 2.7 In silico methods ...... 21 2.7.1 Grupe’s method ...... 22 2.7.2 Association Mapping ...... 23 2.7.3 Functional Mapping ...... 25 2.8 Felsenstein’s Argument ...... 25 2.8.1 The Problem ...... 26 2.8.2 Solution ...... 26

3. VENN ...... 29

3.1 Approach ...... 30 3.2 Implementation ...... 30 3.2.1 POY apomorphy list ...... 31 3.2.2 TNT apomorphy list ...... 32 3.2.3 PAUP apomorphy list ...... 34 3.2.4 VENN Algorithm ...... 36 3.3 Case Study ...... 36 3.4 Limitations ...... 37

4. CCTSWEEP ...... 40

4.1 Concentrated Changes Test ...... 40 4.2 Algorithm and Implementation ...... 43 4.3 Reconstruction ...... 45 4.3.1 DELTRAN and ACCTRAN ...... 46 4.3.2 Taking reversals into account ...... 47 4.4 Case Study ...... 49 4.4.1 Comparison to non-tree based methods ...... 51 4.4.2 Anthrax susceptibility candidates ...... 53 4.5 Controlling for multiple testing ...... 55 4.5.1 Statistical power and false discovery rates ...... 55 4.5.2 Family-wise error rate (FWER) ...... 55 4.5.3 False Discovery Rate ...... 56 4.6 Conclusion ...... 57

5. Applications of CCTSWEEP ...... 59

5.1 CCTSWEEP used as part of Mobius ...... 59 5.2 Case study: Lipid traits in mice ...... 60

ix 5.3 Spread of Avian Influenza ...... 63 5.3.1 Genotypes Associated with Various Hosts ...... 64 5.3.2 Spread of Various Genotypes over Time and Space . . . . . 67

6. Correlation of continuous characters and genotypes ...... 69

6.1 Background ...... 70 6.1.1 Continuous correlation using trees ...... 70 6.2 Optimizing characters on a tree ...... 71 6.2.1 Optimization algorithms ...... 72 6.2.2 Choosing a particular optimization ...... 72 6.3 Implementation ...... 74 6.4 Case Study: HDLC levels in inbred mice ...... 75 6.5 Discussion ...... 78 6.6 Conclusion ...... 79

7. Discussion and Future Directions ...... 81

7.1 Correlating Discrete Characters ...... 82 7.2 Correlating Continuous characters ...... 85 7.3 Future Directions ...... 86 7.4 Conclusion ...... 88

Appendices:

A. VENN code ...... 89

A.1 poyvenn.pl ...... 90 A.2 tntvenn.pl ...... 92 A.3 paupvenn.pl ...... 93

B. CCTSWEEP ...... 96

B.1 Script ...... 96

Bibliography ...... 106

x LIST OF TABLES

Table Page

3.1 All SNPs completely penetrant with susceptibility as identified by VENN...... 37

4.1 High ranking SNPs within 11 obtained using CCTSWEEP. Phi-rank is the rank of the SNP using the phi-coefficient for correla- tion. The last column indicates the percentage of mouse strains (out of 21) with data inferred for that SNP...... 50

5.1 The correlation between phenotypes and various genotypes calculated using CCT. To correct for multiple testing we set the significance level at CCT ≤ 0.0125. Significant associations are in bold, and nearly significant (0.0125

xi LIST OF FIGURES

Figure Page

2.1 Genetic approaches to identifying genes that regulate chemical pro- cesses: Forward entails introducing random mutations into cells, screening cells for a phenotype of interest and identify- ing mutated genes in affected cells. In the above example, cells are randomly mutated, cells that show antibiotic resistance phenotype are selected, and mutated genes are identified. Reverse ge- netics entails introducing a into a specific of interest and observing the phenotypic changes due to the mutation. In the ex- ample shown, a single mutated gene is introduced into yeast cells and an antibiotic resistance phenotype is observed...... 13

2.2 Growth in identification of genes underlying genetically in and other species. Complex trait genes were identified by the whole-genome screen approach and denote cumulative year-on-year data...... 21

2.3 A phylogenetic tree illustrating Felsenstein’s argument. Changes in a character are indicated by a cross on the branch on which the change is occurring. Here a is present in 8 of the 16 taxa under consideration. A phenotype is present in 9 of the 16 taxa, and the genetic marker is present in 8 of those 9 taxa. The correlation between these two characters is markedly different depending on whether we consider the 16 taxa as independent or related according to the tree shown...... 27

3.1 Three branches in a phylogenetic tree, identified with different colors, are chosen where there is a change in phenotype. Each circle shows the set of all genotypic changes optimized to that branch. VENN identifies the intersection set of changes correlated with the phenotypic character...... 31

xii 3.2 Sample apomorphy list output by POY ...... 33

3.3 Sample apomorphy list output by TNT ...... 33

3.4 Sample apomorphy list output by PAUP ...... 34

3.5 A mirror tree illustrating the correlation between the SNP rs4223417 identified by VENN and Bacillus anthracis susceptibility for a 15 taxa tree...... 38

4.1 An illustration of the correlation between SNP rs3142843 and Bacillus anthracis susceptibility ...... 52

5.1 Mirrored phylogenetic trees of females of mouse strains displaying cor- related changes of a phenotype and a genotype across 15 mouse strains. The right tree depicts phenotypic change in non-high-density lipopro- tein (non-HDL) cholesterol plasma levels in female mice after six weeks of atherogenic diet. Black branches indicate strains (C57BL/6J and CAST/EiJ) with non-HDL levels greater than one standard deviation (sd) above the mean after treatment. Genotype observations for each strain for the SNP of interest (rs3023213; T or C) are indicated on the left tree. Boxes at the terminal branches of the trees indicate genotype or phenotype observations in databases for those strains. CCT results for this phenotype-genotype correlation differ for females (p = 0.004) and males (p = 0.088) (not shown)...... 62

5.2 (top) Screenshot of a phylogenetic tree for 351 isolates projected on Earth. Branches of the tree are traced with color to represent the optimization of a character for taxonomic order of hosts. (bottom) A view of avian influenza spread from East Asia on the 291-taxa tree, showing Lysine-627 position in PB2 character optimization as colored branches...... 66

6.1 A normal quantile plot of the change on a branch of the phylogenetic tree...... 77

6.2 A plot of − log10(p) for each SNP (approximately 12000 each) plotted against position on 1 and 2 of mice. Lines indicating p = 0.01 and 0.0001 have been shown...... 78

xiii 6.3 Results of the continuous correlation method for HDLC. The top bar

chart shows − log10(p) for the top 40 best correlated candidates from the whole genome, and the bottom bar chart shows the peak LOD scores and significant QTL intervals described previously for HDLC. Of the twenty loci with the highest correlation scores, 16 intersect previously known QTL intervals. In cases where 2 or more bars are too close to be resolved visually, the number above the bar shows the number of bars at that location...... 80

xiv CHAPTER 1

INTRODUCTION

Biological science has undergone a revolution in the past few decades. The suc- cesses of molecular and structural biology, , and genetics have yielded large amounts of data that are increasingly quantitative in nature. This quantitative analysis of this data has attracted the use of techniques from applied mathemat- ics, informatics, statistics, and computer science to bring new insights into biological systems and understanding the interrelationships between them. This, in turn, has brought researchers from these different areas working to solve these complex prob- lems [1, 2].

Some of the major research efforts in the field include sequence alignment [3], genome assembly [4], protein prediction [5], prediction of and protein-protein interactions [6], modeling of [7], with a major goal being the linking of genotypes to phenotypes [8]. While there is a tight coupling of develop- ments and knowledge within these subfields, this dissertation will focus on algorithms and methodology for finding correlations between genotypes and phenotypes. Statis- tical correlation between a genotype and phenotype is often an important first step in finding the causal link between them.

1 1.1 Genotypes and Phenotypes

The genotype of an is the description of the , in the form of

DNA (deoxy-ribonucleic acid), or in some cases RNA (ribonucleic acid). For sexually reproducing organisms the DNA is contributed to the fertilized egg by the sperm and egg of its two parents. For asexually reproducing organisms, the inherited material is a direct copy (though not necessarily exact) of the DNA of its parent. The phenotype

of an organism is the description of the physical and behavioral characteristics of the

organism, for example its size and shape, its metabolic activities, susceptibility to

pathogens, or response to stress [9].

It is necessary to make the distinction between genotype and phenotype because

of the separation of causal pathways that lead to the passage of information about

organisms between successive generations, and, on the other, to the growth and de-

velopment of an organism within a generation. The inheritance mechanism is from

in one generation to genomes in the next ideally without any influence on

the genome of the events that occur in the development of the during the

life history of the organism. A phenome is the set of all phenotypes expressed by a

, tissue, organ, organism, or species. While the genome is an essential element in

the path from the first stage in the life of the organism to the final individual, it is

largely isolated from changes from the phenome of the developed organism [10].

The distinction between genotype and phenotype was first made by August Weis-

mann at the end of the nineteenth century, who differentiated between the germplasm

of an organism, the tissue that forms the to produce the next generation,

and the somatoplasm, the tissues of the rest of the body. Wilhelm Johannsen in 1908

realized that this distinction was a consequence of the hereditary and developmental

2 pathways being separate. According to Weismann, the somatoplasm developed and was influenced by the environment, whereas the germplasm was segregated early in development and was not susceptible to environmental influences. Thus, there could be no inheritance of acquired characteristics [11].

Earlier work on the study of in by showed that inherited traits are passed from one generation to the next in discrete units that interact in well-defined ways. For the first half of the twentieth century very little progress was made in identifying the physical basis of the hereditary units.

The major advance was that these discrete units, now named “genes” by Wilhelm

Johannsen, were linearly arranged along bodies in the nucleus of cells called the chromosomes [12]. Alterations at specific places on chromosomes could be associated with specific alterations in phenotype and heritable alterations in phenotype could be produced by bombarding organisms with high energy ionizing radiation but genes remained abstract entities whose existence as the elements of heredity and the causes of development depended entirely on inferences from the phenotypes of organisms involved in various breeding experiments. That is, at this stage, the genotype had to be inferred from its effect on the phenotype.

Schr¨odinger in his book What is Life? [13] introduced the idea of an “aperiodic crystal” that contained genetic information in its configuration of covalent chemical bonds. cites What is Life? as the best theoretical description, before the actual discovery of DNA, of how genetic storage would work [14, 15]. The definitive development of began with the identification of deoxy-ribonucleic acid (DNA) as the material basis of genes in the late 1940s and early 1950s. This was then followed by the rapid discoveries of the chemical and physical structure of

3 DNA, the molecular mechanism of its , and a detailed description of the molecular machinery using which the cells converted the information in the DNA of genes into the molecules of physiological and developmental . The DNA of the genome consists of long strings made up of a succession of only 4 kinds of nucleotides,

Adenine (A), (G), Thymine (T), and (C). The differences among genes come from the differences in the way these 4 kinds of nucleotides are arranged similar to how different words can convey different information while being composed of the same small set of letters.

DNA is usually found as a double-stranded molecule with the strands consisting of the paired nucleotides. only pairs with Thymine, while Guanine only pairs up with Cytosine. The DNA is replicated by copying the DNA into more

DNA molecules utilizing this complementary base pairing property of the nucleotides.

DNA replication, while having a very high fidelity, is not completely error free. In complex organisms, the replication error rate of DNA is on the order of 10−9 [16]. On the other hand, the transcription of the genotypic information to produce proteins that underlie development of the characteristics of the phenotype is carried out by a different pathway which has been called “The central dogma of molecular biology”

[17]. The DNA is first copied into messenger RNA (mRNA) during transcription, and the mRNA migrates from the nucleus to the cytoplasm. Then ribosomes translate the information from the mRNA and use it for protein synthesis. Information about which genes are to be transcribed in which cells, at which times in development and in what amounts is contained in stretches of DNA called controlling or regulatory elements. It is the transcription of the genomic DNA into RNA, which, in turn, carries the genotypic information into the metabolic apparatus of the cell that is the

4 critical element in the separation of the hereditary and the developmental functions of the genome. This mechanism allows the genome to be a cause of the phenotype but, at the same time, isolates the genome from the influence of the phenome, preventing the inheritance of any characteristics acquired during development.

1.2 Mapping Genotype To Phenotype

If the development mechanisms were such that there was a one-to-one correspon- dence between changes in genotype and changes in phenotype, that is, every change in genotype resulted in a unique difference in phenotype and every different pheno- was the consequence of a unique difference in genotype, the task of mapping genotype to phenotype would be greatly simplified. Given a knowledge of the pheno- type, the underlying causal genotype could be unambiguously inferred and vice versa.

However, the actual correspondence between genotype and phenotype is a many-to- many relationship in which any given genotype plays a role in the development of many different phenotypes and different genotypes come together to develop a given phenotype.

The many-many mapping between genotype and phenotype arises from four sources:

1. the relation between the DNA sequence and the amino acid sequence that makes

up a protein;

2. relations between the products of the transcription and translation of the infor-

mation coded in the genome;

5 3. the dependence of development and on both the genotype of the

organism and the role of the environment in which the organism develops and

functions;

4. stochastic variations of molecular processes within cells.

5. There is also a temporal component, as many genes have different roles in a

developing organism and in an adult organism.

1.2.1 DNA-protein relations

A protein is a macromolecule made of smaller molecules called amino acids ar- ranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. Each amino acid is coded for by a triplet of nucleic acids in the string of DNA constituting a gene. As there are 4 nucleotides, there can be 64 possible triplets but given that only 20 amino acids are found in nature, the coding between DNA and proteins is many-to-one. This is the most common form of many-to-one mappings of genotype onto phenotype. Thus, just from observing a change in the phenotype or lack of physiological activity of the protein, it is not possible to conclude what change in genotype has occurred.

1.2.2 Relations between genes

One of the complications with finding relationships between genotypes and phe- notypes is that in organisms carrying more than one type of gene for a phenotype, one gene’s effects may dominate the other. Mendel observed that in plants that carry one member of a gene pair specifying red flowers and one member specifying white

flowers were indistinguishable from plants carrying two copies of the red form of the

6 gene. While this effect is not universal by any means, it is sufficiently common that a large fraction of present in populations of organisms is hidden at the level of phenotype and requires further experimental techniques to reveal it [18, 19].

Another interaction that is extremely common is that which occurs between the different genes in the genome. If the products of different genes work together to produce a phenotype, then alterations in any one of the genes will cause a change in the phenotype. Such interactions occur when the phenotype is the outcome of a chain of chemical steps, each step being mediated by a product of a different gene.

On the other hand, some essential phenotypes may have redundant pathways and a change in one gene may not affect the final phenotype [20].

1.2.3 Genes and environment

While the complete DNA sequence of an organism contains all the information necessary to specify the organism, the organism will not come into existence unless it is in the right environment. To give an example, guinea fowl eggs need to incubated at a temperature between 36 and 39 degrees Celcius to hatch [21]. Outside this range, they will fail to develop. Thus, the development of phenotypes requires that the genes interact with the environment, which means that they have to find themselves in a favorable environment. Also, the mapping of different genotypes into phenotypes in one environment can be completely unpredictable from their mapping in another environment.

1.2.4 Stochastic effects

Even if we get past the earlier hurdles and manage to have a specification of both the genotype and the developmental environment we may still not have enough

7 information to completely predict the phenotype. Humans, e.g., do not have the same

fingerprints on their left and right hands and the differences in the patterns can be very large. Yet the genes of the left and right sides are the same and the developmental environment in the womb is the same for either hand. Gene expression is the process whereby the genetic information in a gene is made available to the cell. When a gene is expressed it is said to be “turned on”.

A major source of these stochastic variations is the very low numbers of certain intermediary molecules such as messenger in each cell. The usual rules of chemistry and physics that we use to predict how such systems behave is based on statistical averaging over very large numbers of molecules and this does not apply when there are only a few copies of a molecule undergoing the reaction. As a conse- quence of the stochastic variation in number, spatial location and reactivity of each kind of molecule, there can be considerable random variation from cell to cell in the products of a gene.

1.3 Organization

The organization of this dissertation is as follows. The second chapter is a review of background literature for resolving genotype-phenotype correlations by previous researchers and Felsenstein’s reasoning on the pitfalls of not taking the phylogeny of the organisms into account in statistical inference. In Chapter 3, we discuss VENN, an algorithm and it implementation to detect genotype changes at the same position in different branches of a phylogenetic tree distinguished by a change in phenotype.

In Chapter 4, we discuss CCTSWEEP, a software developed to find genotypic po- sitions correlated with phenotypic changes and rank them by significance. We also

8 discuss results from the use of VENN and CCTSWEEP on real data. In Chapter 5, we introduce other ways in which CCTSWEEP has been applied beyond genotype- phenotype correlations. Finally, we wrap up the dissertation with a summary and possible future directions for this research.

More detailed introductions for the relevant subject matter can be found at the start of each chapter. The chapters end with a short summary and the results ob- tained.

9 CHAPTER 2

BACKGROUND

In this chapter, we examine methods historically used for finding connecting links between genotypes and phenotypes. Even before the basis of genes had been iden- tified, Mendel had studied inheritance in pea plants. Mendel’s work had focused on discrete characters such as green versus yellow peas and tall versus dwarf [22].

Continuous characters, such as height in humans, were first studied by Galton, a contemporary of Mendel. Galton established the principle of what he termed “regres- sion to mediocrity”. Galton noticed that extremely tall fathers tended to have sons shorter than themselves, and extremely short fathers tended to have sons taller than themselves, thus, the offspring seemed to regress to the median, or “mediocrity” [23].

As the discrete nature of genes had not been identified at that time, the argument raged between the Mendelians and the Galtonians as to which of the two paradigms was the correct one for inheritance. was obviously cor- rect for some traits, but these were rare and were considered trivial by the Galtonians.

On the other hand, the inheritance of continuous traits could not be used to predict outcomes, only average estimates measured in large population studies. Mendelians considered the study of continuous traits to be trivial because they had little pre- dictive value while Galtonians considered Mendelian traits simplistic. R. A. Fisher

10 then reconciled the two camps by showing that the inheritance of continuous traits

can be reduced to Mendelian inheritance at many loci. Discrete changes at many loci

affecting a single trait, termed polygenic inheritance, could produce a distribution

very close to normal [24, 25, 26].

2.1 Random mutagenesis

One of the earliest techniques to directly observe the relationship between geno-

type and phenotype is by inducing random mutation in a population of organisms.

The mutations may be introduced by chemical like ENU (a highly potent

for mice), or by exposure to radiation [27, 28]. Then the population of

organisms that have a change in a particular phenotype is isolated from the general

population. One might screen for phenotypes such as fruit flies with no wings or a

colony that is noninfectious. Then sequence comparison between wild type

and mutant DNA is required to locate the DNA mutation that causes the phenotypic

difference. This type of screening for genes is often referred to as forward genetics as

opposed to , the term for identifying mutant in genes that are

already known. An is any one of a number of viable DNA codings that occupies a given position on a chromosome.

Random mutagenesis is a powerful method for identifying genes that regulate bi- ological processes in simple organisms and has been used to find the genetic basis of processes including cell division in the yeast Saccharomyces cerevisiae, programmed cell-death in the nematode Caenorhabditis elegans [29], and embryonic pattern for- mation in the fly melanogaster [30]. Its utility is limited for more complex like because of their slow rate of reproduction, large physical size,

11 and a large, diploid1 genome. Also, in more complex organisms, most traits are polygenic, thus a random variation in a gene is harder to link to a phenotype.

2.2 Site-directed mutagenesis

Unlike random mutagenesis where the location and type of mutation is not under control site-directed mutagenesis is a molecular biology technique in which a mutation is created at a defined site in a DNA molecule. This technique was first described in

1978 by Michael Smith, who was awarded a Nobel prize for it, and has become central to biochemistry and molecular biology [31]. A major use of site-directed mutagenesis is the study of protein structure and function. A specific change in the DNA induces a change in the amino acid, creating a mutant protein with its function altered. The method can also be used to study the complex cellular regulation of the genes and to increase understanding of the mechanism behind genetic and infectious diseases [32].

This technique would belong to the class of reverse genetics techniques where a gene is known but its function has yet to be identified.

2.3 Linkage analysis

Genetic linkage analysis refers to the ordering of genetic loci on the chromosome and to estimating the genetic distances among them. Linkage analysis proceeds by tracking patterns of coinheritence of the trait of interest and genetic markers, relying on the varying degree of recombination between trait and marker location to map the loci relative to one another.

1Ploidy is the number of sets of chromosomes in a biological cell. Diploid refers to each mam- malian cell having two sets of chromosomes. As each set can have a different allele genotype- phenotype mapping is more complex.

12 Figure 2.1: Genetic approaches to identifying genes that regulate chemical processes: Forward genetics entails introducing random mutations into cells, screening mutant cells for a phenotype of interest and identifying mutated genes in affected cells. In the above example, Escherichia coli cells are randomly mutated, cells that show an- tibiotic resistance phenotype are selected, and mutated genes are identified. Reverse genetics entails introducing a mutation into a specific gene of interest and observing the phenotypic changes due to the mutation. In the example shown, a single mutated gene is introduced into yeast cells and an antibiotic resistance phenotype is observed.

13 During the process of meiosis, gametic cells (egg and sperm) exchange genetic

material and crossing over occurs. According to Mendel’s second law of inheritance,

different genes segregate to gametes (egg or sperm) independently. In reality, inde-

pendent assortment of gene pairs only occurs when the genes are on different chromo-

somes or are so far apart on the same chromosome that the chance for recombination

or nonrecombination is identical. Such pairs of genes are said to be unlinked. Genes

that do not segregate independently are said to be linked and the degree of linkage is

given by the recombination fraction, the chance of recombination occurring between

two loci denoted by θ [33, 34].

If we have 2 genes, the first gene having alleles A and a, and the second gene having allele B and b then the recombination fraction is the fraction of recombinant (Ab + aB) out of the total (AB + ab + Ab + aB). The recombination fraction can vary between 0 (no recombination) and 0.5 (free recombination).

Data for linkage analysis consist of sets of related individuals (pedigrees) and information on the genetic marker and/or trait genotypes, usually selected on the basis of phenotype (e.g. a disease, or a quantitative trait, such as cholesterol levels).

2.4 Multifactorial Traits

As mentioned earlier, continuous traits can be explained by Mendelian inheritance

at many loci, resulting in a trait which is normally-distributed. If n is the number

of involved loci, then the coefficients of the binomial expansion of (a + b)2n will give

the frequency of distribution of all n allele combinations. This can be illustrated by

considering a trait such as height. If height were to be determined by two equally

frequent alleles, t (tall) and s (short), at a single , then this would result in

14 a discontinuous phenotype with three groups in a ratio of 1 (tall-tt) to 2 (average- ts/st) to 1 (short-ss). If the same trait were to be determined by two alleles at each of two loci interacting in a simple additive way, then this would lead to a phenotypic distribution of five groups in a ratio of 1 (4 tall genes) to 4 (3 tall + 1 short) to 6

(2 tall + 2 short) to 4 (1 tall + 3 short) to 1 (4 short). For a system with three loci each with two alleles the phenotypic ratio would be 1-6-15-20-15-6-1. As n increases, this binomial distribution rapidly begins to approach a normal distribution [35].

There are many traits in humans that are polygenic in nature such as blood pressure, head circumference, height, intelligence and skin color. Many genes (along with environment) factor into the development of these traits, so modification in a single gene changes the color only slightly. As most phenotypic characteristics are the result of the interaction of multiple genes the disorders in those traits are also polygenic in nature.

2.4.1 Quantitative Trait Analysis

Beginning in the late 1980s, techniques for identifying Quantitative Trait Loci

(QTLs) were developed although the basic idea behind them goes back much farther

[36]. QTLs are stretches of DNA that are closely linked to the genes that underlie the trait being examined. QTLs can help map regions of the genome that contain genes involved in specifying a quantitative trait. Knowing the number of QTLs that explains variation in the tells us about the of a trait. It may tell us that a particular trait is controlled by many genes of small effect, or by a few genes of large effect [37].

15 Another use of QTLs is to identify candidate genes underlying a trait. Once a region of DNA is identified as contributing to a phenotype, it can be sequenced. The genes in this region can then be compared to a database of genes whose function is already known. They are shown as intervals across a chromosome, where the probability of association is plotted for each marker used in the mapping experiment.

All QTL mapping approaches have three common components: a population of individuals with phenotypic diversity, a set of genetic markers present in that popu- lation, and a statistical method to assess the association between the phenotype and genotype. Over recent decades, much focus has been directed toward QTL mapping techniques in the mouse as many quantitative phenotypes of biomedical interest can be modeled in them. These methods use phenotypic and genotypic diversity gener- ated using a cross between two inbred strains differing substantially in a quantitative trait and an interval mapping method introduced by Lander and Botstein [38].

The classical approach for detecting a QTL near a genetic marker involves com- paring the phenotypic means for two classes of progeny: those with marker genotype

AB, and those with marker genotype AA. The difference between the means pro- vides an estimate of the phenotypic effect of substituting a B allele for an A allele at the QTL. While the traditional approach is simple to implement it has a number of shortcomings such as it underestimates the phenotypic effect if the QTL does not lie at the genetic marker locus. This approach also does not define the likely position of the QTL. In particular, it fails to distinguish between tight linkage to a QTL with small effect and loose linkage to a QTL with large effect. These difficulties stem from the analyzing the markers one at a time [39].

16 Lander and Botstein generalized the approach so that the intervals between the markers could be included as well. This method allowed efficient detection of QTLs while limiting the overall occurrence of false positives, more accurate estimation of phenotypic effects of QTLs, and better localization of QTLs to specific regions while significantly (7-fold) reducing the number of progeny that must be genotyped in order to detect a QTL [38].

This approach has been successfully used to map thousands of QTL in rodents for a wide range of phenotypes, ranging from taste preference to disease susceptibility

[40]. However, because this approach uses mouse crosses to generate phenotypic and genotypic diversity, genetic replicates of the intercrossed mouse population cannot be easily produced. Therefore, of each intercrossed is necessary after the initial breeding step, which makes traditional QTL mapping both expen- sive and time-consuming, requiring months or years to complete. Furthermore, of the thousands of QTL that have been identified, only a small percentage have been characterized at the molecular level, in part because of the large size of QTL intervals

[40].

2.4.2 Quantitative complementation tests

The basic idea of a quantitative complementation test is relatively straightforward.

A mutant allele of the candidate gene is tested in association with alleles derived from a natural population. The mutation is usually one that results in a non-functional or low-activity gene product (a loss-of function mutation). A Quantitative Comple- mentation Test provides a systematic test to examine whether and which genetic candidate locus or loci contribute to the QTL. The method requires a mutant (null),

17 a wild-type, and a minimum of two QTL alleles. The phenotypes of the hybrids of the QTL alleles with both the mutant and the wild-type allele are measured to com- pare the effects of the two or more QTL alleles across the mutant versus wild-type genetic background. Wild-type refers to the most common phenotype genotype in the natural population.

This method has been used for assessing the variation in genes affecting Drosophila lifespan.

2.5 Regression Analysis

The general purpose of multiple regression is to learn more about the relationship between several independent or predictor variables and a dependent or criterion vari- able. Francis Galton and Karl Pearson developed linear regression during Galton’s work on inherited characteristics of sweet peas. Subsequent efforts by Galton and

Pearson brought about the more general techniques of multiple regression and the product-moment correlation coefficient [41].

While there are many types of regression that are used in the natural and more specifically biological sciences to correlate variables, logistic regression is preferentially used in the modeling of genotype-phenotype correlations as genotype is a discrete, often binary, variable. Previously, before the discrete nature of genes had been iden- tified, linear regression was used to correlate genotypes and phenotypes in population studies by Galton. The term “regression” itself comes from “regression to the mean” used by Galton to show the correlation in height between fathers and sons.

Logistic regression is part of a category of statistical models called generalized linear models. In logistic regression, the dependent variable is a logit, which is a

18 natural log of the odds.

 p  log(odds) = logit(p) = ln = b + b x + ··· + b x (2.1) 1 − p 0 1 1 n n where bis are the respective parameters of independent variables, and n is the number of parameters to be estimated in the logistic regression. The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To accomplish this goal, a model is created that includes all predictor variables that are useful in predicting the response variable.

Although logistic regression finds a “best fitting” equation just as linear regres- sion does, the principles on which it does so are rather different. Instead of using a least-squared deviations criterion for the best fit, it uses a maximum likelihood method, which maximises the probability of getting the observed results given the

fitted regression coefficients. A consequence of this is that the goodness of fit and overall significance statistics used in logistic regression are different from those used in linear regression.

Logistic regression is used extensively in the medical and biological sciences and its use has seen a great increase in recent years as this technique is easy to implement and is included in a wide range of statistical packages. It has been used to correlate genetic polymorphisms with susceptibility to influenza [42], peripheral arterial disease and ischaemic heart disease [43], risk of coronary artery disease [44], and many others [45].

Since it does not take any relationships between the organisms into account, it could overestimate the significance of the correlation between genotype and phenotype.

19 2.6 Need for automated methods

As genetic sequencing has gotten progressively faster and cheaper, there has been

a major growth in the availability of genetic sequence data. GenBank R is a compre- hensive database that contains publicly available nucleotide sequences and has been doubling in size every 18 months. It currently contains over 65 billion nucleotide bases from more than 61 million individual sequences, with 15 million new sequences added in the past year [46].

It was expected that the availability of sequence and the com- pleted genome sequences of other organisms will expand our understanding of human diseases, both those caused by mutations in a single gene and those where many genes and multiple factors are involved. With individual drug response profiling, the human genome sequence will lead to improved diagnostic testing for disease suscepti- bility genes and individually tailored treatment regimens for those who have already developed disease symptoms [47]. These expectations have been slow in being real- ized.

As can be seen in Figure 2.2 genes that contribute to complex traits, which com- prise the vast majority of traits in humans have been relatively slow in being discov- ered. In contrast to that, the number of genes that have been known to cause human

Mendelian disorders stood at 1336 in 2000 [48]. There are many reasons which make

finding complex trait genes more challenging including locus heterogeneity,

(gene-gene interactions), low penetrance2, variable , and limited statistical power.

2Penetrance describes the extent to which the properties controlled by a gene, its phenotype, will be expressed.

20 35 Human complex traits 30 All complex traits 25

20

15

10 Complex trait Complex genes 5

0 1980 1985 1990 1995 2000 Year

Figure 2.2: Growth in identification of genes underlying genetically complex traits in humans and other species. Complex trait genes were identified by the whole-genome screen approach and denote cumulative year-on-year data.

2.7 In silico methods

The vastly increased availability of genetic sequence data for many organisms has opened another new way to link genotype to phenotype using in silico mapping.

In silico is an expression used to mean “performed on computer or via computer simulation”. With increasing computational power and ability to integrate data from multiple online databases, researchers can analyze genetic and phenotypic information to shortlist candidate genes without having to spend time and resources on animal colonies and breeding experiments. While in silico methods have not replaced tradi- tional experiments they are rapidly growing as a means to manage and find insights and patterns in the vast amount of biological information being compiled [49].

21 2.7.1 Grupe’s method

Grupe et al developed one of the first computational methods for predicting chro- mosomal regions regulating phenotypic traits from a database of mouse single nu- cleotide polymorphisms [50]. A single nucleotide or SNP is present at a particular nucleotide site if the DNA molecules in the population differ in the identity of the nucleotide pair that occupies the site. A SNP does not need to be in the coding sequence or a gene.

In this method, SNP information from 15 inbred strains3 of mice is used. Using

the allelic distributions across inbred strains contained in the mSNP database, their

computational method calculates genotypic distances between loci for a pair of mouse

strains. These genotypic distances are then compared with phenotypic differences

between the two mouse strains. The process is repeated for all mouse strain pairs for

which phenotypic information is available. Lastly, a correlation value is derived using

linear regression on the phenotypic and genotypic distances for each genomic locus

[50].

To demonstrate the utility of this method, they performed a comparison between

experimentally identified QTL intervals with computationally predicted chromosomal

regions for 10 phenotypic traits. These traits included phenotypes such as alcohol

preference, bone mineral density, eye weight, ganglion cell count etc. The percentage

of correct predictions was characterized as a function of the percentage of the mouse

genome contained within the predicted chromosomal regions. If predicted regions

contained 10% of the mouse genome (by selecting 10% of the peaks with the highest

3Inbred strains are homozygous, that is, both alleles at each position from the two sets of chro- mosomes are the same thus eliminating a complicating factor in genotype-phenotype correlation

22 correlation), then 15 of the 26 experimentally verified QTL intervals were correctly identified. As the threshold was raised, limiting the number of predicted candidate regions, more experimentally verified QTL intervals were missed. At cutoff values ranging from 2 to 16%, 38 computationally predicted regions were identified, out of which 19 overlapped 26 experimentally verified QTL intervals [50].

2.7.2 Haplotype Association Mapping

Pletcher et al developed methods for haplotype association mapping [51]. A hap- lotype is a set of SNPs on a single chromosome that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region.

For evaluating the haplotype association mapping algorithms, they considered two phenotypes for which the genetic determinants are relatively well-characterized: sweet taste preference and HDLC. Sweet taste preference is a relatively simple quantitative trait for which several QTL have been identified. HDLC is a complex quantitative trait for which many QTL have been identified using traditional cross-based QTL mapping [52]. Forty-two percent of the mouse genome falls within a known QTL confidence interval for this trait.

Single-marker Mapping

The simplest method of computing associations between genotype and phenotype is single marker mapping (SMM), in which each SNP position is considered indepen- dently. As each SNP is biallelic across inbred strains, a t-test is used to measure the strength of association between genotype and phenotype. They find that the SMM can successfully map the sweet taste preference loci mapping all previously known

23 regions. For the HDLC phenotype, of the top twenty peaks identified by SMM, eleven intersected a previously known QTL interval and nine did not.

Mapping by inferred haplotype structure, parametric model

The biallelic structure of inbred strains at a single SNP locus allows only two genetic groups to be modeled. Inspection of allele patterns across multiple loci sug- gests that genetic structure may be more complex [53]. To take this into account they define an inferred haplotype group as a set of strains with an identical genotype pattern over a local window of SNPs. The window of SNPs to define inferred hap- lotype groups is based on three contiguous SNP loci. Two strains are defined to be in the same inferred haplotype group if and only if their genetic pattern across three adjacent SNPs is identical.

Based on these groupings of inferred haplotype, the F-statistic from analysis of variance (ANOVA) is used to test the significance of the genotype/phenotype as- sociation at a given locus. The locus for sweet taste preference is again correctly mapped, and of the top twenty peaks identified by IH-P for HDLC, thirteen intersect a previously known QTL interval and seven do not.

Mapping by inferred haplotype structure, Kruskal-Wallis model

Using the same inferred haplotype structure, they use a rank based test statistic.

The Kruskal-Wallis test statistic is computed at each locus, and the significance is calculated using a bootstrap distribution. By using 1,000,000 bootstrap samples, the background distribution of the test statistic is modeled at a given locus and used to assess the p-value of test statistic calculated from the true phenotype values. The locus controlling sweet taste preference is again correctly identified, and of the top

24 twenty peaks identified by this method for HDLC, twelve intersect a previously known

QTL interval and eight do not.

Mapping by inferred haplotype structure, bootstrap model

Lastly, they examined the use of a nonparametric inferred haplotype bootstrap

method to calculate association scores. At each three-SNP window, the modified

F-statistic used in the inferred haplotype parametric approach and use the above

mentioned bootstrap protocol to calculate the significance. Of the top twenty peaks

identified by this method for HDLC, fourteen intersect a previously known QTL

interval and six do not.

2.7.3 Functional Mapping

A general statistical mapping framework, called functional mapping, has been

proposed to characterize, the quantitative trait loci (QTLs) or nucleotides (QTNs)

that underlie a complex dynamic trait. Functional mapping estimates mathemati-

cal parameters that describe the developmental mechanisms of trait formation and

expression for each QTL or QTN. The approach provides a useful quantitative and

testable framework for assessing the interplay between gene actions or interactions

and developmental changes [54].

2.8 Felsenstein’s Argument

Felsenstein points out an important limitation of the above mentioned techniques.

The techniques above, especially statistically based ones like QTL, regression analysis,

and in silico models, suffer from the drawback that they consider the population or organisms or taxa under study as essentially independent. In reality the taxa are part

25 of a hierarchically arranged phylogeny and thus cannot be regarded for statistical

purposes as if drawn independently from the same distribution [55]. This problem

has also been previously studied by Ridley but Felsenstein was the the first to propose

a solution with the independent contrasts method [56]. Phylogeny (or phylogenesis)

is the origin and evolution of a set of organisms. The interrelationships between the

organisms is illustrated as a tree called the phylogenetic tree.

2.8.1 The Problem

The problem can be illustrated by means of a simplified example. Suppose we

have 16 organisms and 9 of them have a particular binary phenotypic character. Let

there also be a genetic marker, such as a Single Nucleotide Polymorphism (SNP),

present with 8 of the organisms which coincides with the phenotypic character in

all but one of the organisms. The probability that if the SNP and the phenotype

16 if randomly distributed are present together can be given by 9/ C8 = 0.0006993 or statistically highly significant. On the other hand, if we assume that the organisms have a phylogeny as shown in Fig. 2.3 then, we can optimize the phenotypic and genotypic character on the phylogeny and we find that there is only one change occurring on tree. The probability that if a change in the SNP and a change in the phenotype are randomly distributed on the branches, they would occur together is

2/30 = 0.067 or not statistically significant. Thus, ignoring the phylogeny can lead to overestimating the statistical significance of a genotype-phenotype correlation.

2.8.2 Solution

The problem of nonindependent data points translates statistically into a ques- tion concerning the appropriate degrees of freedom to be used in tests of significance

26 Figure 2.3: A phylogenetic tree illustrating Felsenstein’s argument. Changes in a character are indicated by a cross on the branch on which the change is occurring. Here a genetic marker is present in 8 of the 16 taxa under consideration. A phenotype is present in 9 of the 16 taxa, and the genetic marker is present in 8 of those 9 taxa. The correlation between these two characters is markedly different depending on whether we consider the 16 taxa as independent or related according to the tree shown.

27 [55]. Hierarchical phylogenetic relationships between species effectively decrease the

available degrees of freedom by some unknown quantity. The independent contrasts

method computes (weighted) differences (“contrasts”) between the character values

of pairs of species and/or nodes, as indicated by a phylogenetic topology, and work-

ing down the tree from its tips. This procedure results in n − 1 contrasts from n original tip species. As long as the ancestral nodes are correctly determined, each of these contrasts is independent of the others in terms of the evolutionary changes that have occurred to produce differences between the two members of a single contrast.

Because the n − 1 contrasts are statistically independent, they can be employed in standard statistical analyses.

The non-independence of the various species can be taken into account while find- ing the correlations provided we know the phylogeny of the taxa under consideration.

In recent years, with the growth in available genomic sequences for many organisms, increase in computational power, and improvements in algorithms and software avail- able (PAUP, TNT, POY, etc.) for inferring the phylogeny, the phylogeny can be calculated in reasonable times even for large datasets.

Once the phylogeny is known though, methods for correlating the genotypes and phenotypes are still limited. Moreover, with the explosive growth in genotype data, in silico methods for inferring correlations on a large scale are essential to allow researchers to focus their efforts on a manageable number of candidate regions for

finding the causative mechanism behind the correlation if there is one.

28 CHAPTER 3

VENN

VENN is a program that allows us to identify SNPs that are completely penetrant with a phenotype. Completely penetrant indicates that a change in the genotype always occurs concurrently with a change in the phenotype. The idea behind VENN is that once a phylogenetic tree has been identified, any discrete character, genotypic or phenotypic can be optimized on the tree and we can glean information about possible relationships between a genotypic character and a phenotypic character by observing the branches on which these characters undergo changes.

In tracing a character over the tree, each node of the tree is assigned a state or a set of states such that the number of changes in the state of the character when going from the root to the tips of the tree is minimized. That distribution is called the most parsimonious distribution of the character over that tree. Using the distribution, we can identify branches on the tree where the phenotype is undergoing a change. Then

VENN, provided those branches, can identify the genotype that is also changing on those branches, thus restricting the candidate loci responsible for the phenotype.

29 3.1 Approach

For a modest number of branches containing change, set theory and Venn dia- grams provide the logical bases we employ to find genotype-phenotype relationships.

To illustrate, consider a phylogenetic tree on which there are three branches with significant change in the phenotypic character. The three circles in Fig. 3.1 represent all genotypic characters with change optimized to that branch. The intersection set of these three sets contains SNPs that have the potential to be functionally related to the phenotype. This type of analysis can be extended to any number of branches in the tree.

To further filter the candidate loci it could be useful to exclude SNPs that also change in branches where no change in the phenotype is observed. These candidate

SNPs would lie in the relative complement of the intersection. In Fig. 3.1, if only branches A and B exhibit a change in the phenotype while branch C does not contain that change, then the candidate SNPs would be contained in the white region at the intersection of the purple and green regions.

In applying these filters, missing nucleotide data due to incomplete sequencing is inferred using the tree. To avoid artifacts, we require that SNP nucleotides be known for at least 50% of the strains for a potential candidate SNP. Increases in quality and quantity of SNP data are making this cutoff irrelevant for most regions of the mouse genome [57].

3.2 Implementation

VENN operates on apomorphy lists output by popular software

PAUP [58], TNT [59], or POY [60]. An apomorphy list is a table containing all

30 Figure 3.1: Three branches in a phylogenetic tree, identified with different colors, are chosen where there is a change in phenotype. Each circle shows the set of all genotypic changes optimized to that branch. VENN identifies the intersection set of changes correlated with the phenotypic character.

changes on all branches of the tree organized into tables defined by branch. Figures

3.2-3.4 show the organization of a typical apomorphy list from POY, TNT, and PAUP respectively. It should be noted that only the apomorphy lists need to be output from the above mentioned software. The trees themselves could be calculated or inferred using other means.

3.2.1 POY apomorphy list

An apomorphy list from POY is a tab-delimited file with a header that describes the tree and other information which is ignored by VENN. Beginning with “Charac- ter Change List” are data that VENN operates on. The first column is the ancestor of a branch, denoted by HTU and a number, HTU stands for Hypothetical Taxo- nomic Unit, as each internal node is an inferred ancestor. The tips are termed OTUs

(Operational Taxonomic Units) and are the observed taxa sequences. The second column is the descendant in a branch, the third column is the character, in POY each character can be composed of positions, the position is the fifth column in the file.

31 The character and position number taken together uniquely identify a nucleotide or amino acid position in a dataset. The sixth column gives the state of the character at the ancestral node and the column after that gives the state of the character in the descendant node. Gaps are indicated by ‘–’. The 9th column indicates the type of change such as transition, transversion, insertion, or deletion. In molecular biology, a transition is a change of a purine to a purine (A↔G) or a pyrimidine to a pyrimidine

(C↔T). A transversion refers to the substitution of a purine for a pyrimidine or vice versa. Pyrimidine bases have a 6-membered ring with two nitrogens and four car- bons whereas purine bases have a 9-membered double-ring system with four nitrogens and five carbons. As a transversion changes the chemical structure dramatically, the consequences of this change tend to be more severe and less common than that of transitions. The last column indicates a definite change by a *, absence of a * denotes an ambiguous change. A change is ambiguous if the descendant state is not unique and may be dependent on the optimization being used.

3.2.2 TNT apomorphy list

A TNT apomorphy list is organized in a much simpler way. As shown in Figure 3.3, at the beginning of the list of changes in each branch is the name of the descendant node. Each character is identified by a Char. followed by a number, then the change in that character is finally given at the end in the form ‘ancestor state −− > descendant state’. The equivalent of HTU in TNT is Node. Internal nodes in the tree are labeled as Node followed by a number.

32

Character Change List: Anc Desc Character Pos AncS DescS Type Definite ______HTU0 Shikokuobius [8] [35] C G Tv * [37] G A Ti * [44] T C Ti * [46] A C Tv * [152] A - Del *

[9] [21] C - Del * [24] G - Del * [27] G - Del * [32] A - Del * [41] G C Tv *

HTU0 HTU1 [8] [35] C G Tv * [36] C T Ti * [41] G A Ti * [44] T C Ti * [49] A C Tv * [155] A - Del *

[9] [23] C - Del * [24] G - Del * [25] G - Del * [27] A - Del * [41] G C Tv *

Figure 3.2: Sample apomorphy list output by POY

Tree 0 : DQ010921 : No Node 22 : Char. 525: G --> A Char. 585: G --> T Char. 605: T --> C Char. 646: C --> T Char. 830: C --> T Char. 863: A --> T Char. 1277: T --> C Char. 1288: C --> T Node 23 : Char. 524: G --> A Char. 546: G --> T Char. 607: T --> C Char. 643: C --> T Char. 723: C --> T Char. 833: A --> T Char. 977: T --> C Char. 1388: C --> T

Figure 3.3: Sample apomorphy list output by TNT

33 Apomorphy lists:

Branch Character Steps CI Change ------node_2382 --> Peptostreptococc 10 1 0.333 T ==> C 21 1 0.500 G ==> A 23 1 0.500 G ==> A 25 1 0.750 G ==> T 31 1 1.000 C ==> T 45 1 0.500 G ==> A 47 1 0.250 T ==> C 95 1 0.143 C ==> T 99 1 1.000 C ==> T 103 1 0.273 A ==> T 108 1 0.667 A ==> T node_2381 --> node_2380 497 1 0.333 A --> G 498 1 1.000 C ==> T 519 1 0.400 A --> G 535 1 0.143 T ==> G 546 1 0.111 G ==> A 558 1 1.000 C --> G 563 1 0.300 C ==> T 661 1 0.500 G --> T 666 1 0.429 C ==> G

Figure 3.4: Sample apomorphy list output by PAUP

3.2.3 PAUP apomorphy list

A PAUP apomorphy list is shown in Figure 3.4, this is a space delimited file with the first column giving the ancestor and descendant node for each branch, then the character number, then the number of steps for that character over the tree. The number of steps is the number of times the character changes its state on the tree.

The next column is the Consistency Index (CI) of the character. CI is a measure of the parsimony fit of a character to a tree. CI varies from 1.0 (perfect fit) to a value asymptotically approaching zero (poorest fit). The last column is the change in the character from ancestral state to descendant state.

34 Input: A: List of descendant nodes of branches to be included (nodesinc) Input: B: List of descendant nodes of branches to be excluded (nodesexc) Input: File containing apomorphy list Output: File containing SNPs present in all branches in A but not in B foreach descendant node do Initialize hash ’name’ with list of nodes ; end /*reading data into hashes*/; foreach Line in apomorphy list do if nodename in line then name=descendant; end if name in list nodesinc or nodesexc then data(name,uniqchar)=line; char(uniqchar)=0; end end /*calculating intersections*/; foreach key in char do common=1; outline=“ ”; foreach key in names do while common=1 do if exists data(name,char) then if name in nodesinc then out=out+data(name,char); end else common=0; end else if name in nodesexc then end else common=0; end end end end if common=1 then print out; end end Algorithm 1: Algorithm used by VENN

35 3.2.4 VENN Algorithm

Algorithm 1 calculates the set of positions that are changing along the selected branches and not changing along other selected branches. The part of the algorithm that calculates the intersections or relative complements is the same regardless of the format of the apomorphy list. There are differences in preprocessing of the data fed to this part of the algorithm due to differences in file structure of different programs.

The algorithm is implemented in AWK and Perl and makes use of associative arrays (also called hashes) provided by these languages. Associative arrays have a fast O (1) lookup time for checking whether a particular key exists or not. This allows for very fast runtime for the above algorithm. In Algorithm 1, name is a hash that contains a list of the descendant nodes of the branches that are being considered. In a tree, a branch can be uniquely identified by its descendant. Data is a hash that contains information about all the lines within all the branches being considered.

Char is a hash that contains a unique position identifier of all the characters that are changing on all the branches being considered. The implementation of VENN can be seen in Appendix A.

3.3 Case Study

We used our tool VENN, which implements the approach described above, on the variable phenotype of susceptibility to Bacillus anthracis among 15 strains of inbred mice. The data on Bacillus susceptibility [61] were obtained from the Mouse

Phenome Database [62]. The tree was calculated with TNT using default parameters and the apomorphy list was then calculated using POY. Then, using VENN, we identified SNPs that change only on branches with a transition in phenotype from

36 SNP id Chr Annotation rs4223417 2 EH- containing 4 (Ehd4 ) rs4223418 2 rs4223864 3 6 kb from Peroxin 5-related protein (Pex2 ) rs4226421 7 50 kb from CEA-related cell adhesion molecule 9 r42447 (Ceacam9 )rs4226424 rs4226435 7 Excision repair cross-complementing rodent repair rs4226436 7 deficiency, complementation group 2 (Ercc2 ) rs4226437 7 rs4226439 7 Ercc2 ; opposite strand overlaps light chain 3 (Klc3 ) rs4226441 7 Trafficking protein particle complex 6A (Trappc6a) rs3694522 11 Kinesin family member 1C (Kif1c)

Table 3.1: All SNPs completely penetrant with Bacillus anthracis susceptibility as identified by VENN.

susceptible to resistant. All SNPs located by VENN along with the chromosome and their annotation are shown in Table 3.1. A mirror tree illustrating the correlation between one of the SNPs found and the phenotype can be seen in Figure 3.5. As discussed in the next chapter, the last of these SNPs is known to be associated with anthrax susceptibility, leaving the other identified SNPs as potential candidate SNPs.

3.4 Limitations

The limitations of VENN are two-fold. The first is that as the number of branches over which phenotype change occurs increases, the set of common nucleotides or amino acids identified by VENN decreases until eventually no changes are identified.

The reason for this is that completely penetrant genotypes, especially in complex organisms are very rare. Thus, we need to find a way to find partial matches between the genotypes and phenotypes. While it would not be too difficult to extend VENN

37 SPRET/EiJ

C57BL/6J

A/J

C3H/HeJ

DBA/2J

BALB/cByJ

129S1/SvImJ

129X1/SvJ

BALB/cJ

A/HeJ

AKR/J

CZECHII/EiJ Character: rs4223417 Character: Bacillus NZB/BINJ Susceptibility G Susceptible A NZW/LacJ Resistant CAST/EiJ

Figure 3.5: A mirror tree illustrating the correlation between the SNP rs4223417 identified by VENN and Bacillus anthracis susceptibility for a 15 taxa tree.

38 to report a change even if say there is a mismatch on some given number of branches, this does not fix a second problem. VENN does not provide a way to report how statistically significant is the correlation identified. Thus, it is non-trivial to identify a cut-off of how many mismatches could you expect before you decide that a candidate

SNP is not worth pursuing further. A method that improves on these limitations of

VENN is is given in the next chapter.

39 CHAPTER 4

CCTSWEEP

While VENN is a useful tool for rapidly finding genotypic changes concurrent with phenotypic changes, a problem with its application is that with an increase in the number of intersecting branches, the number of genetic changes within the intersect- ing region decreases rapidly. Thus, a more sophisticated approach is necessary when few or no SNPs are completely penetrant with the phenotype. Also, as most pheno- types are a product of interactions of multiple genes, completely penetrant genotypes are rarely encountered. Additionally, there is no mechanism to assess the statistical significance of the nucleotide or amino acid changes located by VENN. In this chap- ter, we develop a method termed CCTSWEEP that overcomes these limitations. A modification of Maddison’s Concentrated Changes Test (CCT) is used to find the statistical correlation between the genotype and phenotype [63].

4.1 Concentrated Changes Test

The CCT as originally described is a method for determining whether change in a binary character on a phylogenetic tree is correlated with the state of another binary character. In our application, one of the characters is phenotypic while the

40 other character is the genotype, such as a nucleotide or an amino acid. If the phe- notypic character is not binary, then it will have to be mapped to a suitable binary representation.

The CCT tests whether changes in one character are associated phylogenetically with the state of another character. For this, the character’s evolution on the phy- logeny has to be first reconstructed. The reconstruction can be done in many ways and a more detailed discussion on it is presented later. The CCT calculates the probability that a given number of gains and losses, in this case a change in the reconstructed state of a SNP on a branch, fall on the distinguished branches of the tree. A gain is a character changing from 0 → 1 and a loss is a change from 1 → 0.

For our purposes, distinguished branches are those on which a change is observed in the phenotype. We account for changes that are not on branches with the observed change in phenotype by summing over all cases where the distribution of gains and losses is as good as or better than the observed distribution to obtain the CCT,

n q X X B(u, r|n, m) P (p, q|n, m) = (4.1) W (n, m) u=p r=0 for a case where p gains and q losses are observed on branches of the tree having a significant change in the phenotype and n gains and m losses over the entire tree.

W (n, m) is the number of ways in which n gains and m losses can be distributed over the entire tree and can be calculated using the recursion relation [63] r s P P WF (r, s|0) = WG(i, j|0) · WH (r − i, s − j|0) i=0 j=0 r−1 s P P + WG(i, j|0) · WH (r − i − 1, s − j|1) i=0 j=0 r−1 s (4.2) P P + WG(i, j|1) · WH (r − i − 1, s − j|0) i=0 j=0 r−2 s P P + WG(i, j|1) · WH (r − i − 2, s − j|1) i=0 j=0

41 where WF (r, s|0) is the number of ways of distributing r gains and s losses over the

tree at node F given state at F is 0. A gain is a change of state from 0 → 1 while

a loss is a change of state from 1 → 0. In Equation 4.1, we assume that state of the variable at the root of the tree is always 0. 1 and 0 are symbols for binary states of a genotypic or phenotypic character. Thus, whatever state the reconstructed character has at the root node, can be designated as 0, and the complementary state then becomes 1 without any loss of generality. G and H are the daughter nodes of F .

Thus, starting from the tips, we can calculate our way up to the root of the tree given that for a terminal node W (0, 0|0) = W (0, 0|1) = 1 as there is only one way to have

0 gains and 0 losses within a terminal , and W (r, s|0) = W (r, s|1) = 0 for all

other values of r and s.

B(t, u|r, s, 0) ≡ B(t, u|r, s) is the number of ways in which t gains and u losses

can be distributed over the distinguished branches given r gains and s losses over

the entire tree and state of the genotypic character at the root node is 0 and can be

calculated by a recursion relation similar to Equation 4.2 [63]

BF (t, u|r, s, 0) = r s t u P P P P BG(x, y|i, j, 0) · BH (t − x, u − y|r − i, s − j, 0) i=0 j=0 x=0 y=0 r−1 s t−Z u P P PH P + BG(x, y|i, j, 0) · BH (t − x − ZH , u − y|r − i − 1, s − j, 1) i=0 j=0 x=0 y=0 r−1 s t−Z u P P PG P + BG(x, y|i, j, 1) · BH (t − x − ZG, u − y|r − i − 1, s − j, 0) i=0 j=0 x=0 y=0 r−2 s t−Z −Z u P P PG H P + BG(x, y|i, j, 1) · BH (t − x − ZH − ZH , u − y|r − i − 2, s − j, 1) i=0 j=0 x=0 y=0

(4.3)

where, G and H are daughter nodes of node F . ZG is 1 if the state of phenotypic

character at node G is changing from 0 → 1 and 0 otherwise and the same for

42 ZH . Thus, similar to the calculation for W , starting from the tips, we can calculate

our way up to the root of the tree given that for a terminal node B(0, 0|0, 0, 0) =

B(0, 0|0, 0, 1) = 1 as there is only one way to have 0 gains and 0 losses within a

terminal taxon, and B(t, u|r, s, 0) = B(t, u|r, s, 1) = 0 for all values of t, u, r and s

not equal to 0. The case where a phenotype undergoes a reversal (1 → 0) needs to

considered more carefully and is dealt with in Section 4.3.2.

A major limitation with CCT is that it calculates the correlation between binary

variables only though this is not a concern in the case study presented as the mouse

SNPs are overwhelmingly biallelic (having two states) and Bacillus anthracis, the

causative agent for anthrax, susceptibility is a binary variable (susceptible or resis-

tant). A strain is considered resistant if > 90% of the macrophage cells are viable in the toxin produced by Bacillus anthracis (for experimental details see [64]). For

continuous characters this means that a suitable threshold would have to be chosen to

make the variable binary. An example of the use of CCT with continuous characters

is presented in the next chapter. For discrete characters (for example, amino acids),

a set of characters could be designated as 0 and the complementary set as 1.

4.2 Algorithm and Implementation

CCTSWEEP can be divided into two major components. Reconstructing the

ancestral states of the characters, and calculating the CCT between the genotypic

character and the phenotypic character for those ancestral states. For calculating the

CCT, we need a binary tree, knowledge of the branches over which the phenotype

is changing, and the number and type of changes (gain or loss) in the genotype.

Once we have that, we can use Equation 4.1 to find the p-value of a given genotype

43 and a phenotype. For calculating P , we need to know the values for W and B.

As CCTSWEEP is intended to be used for large scale correlations, considerable time savings are achieved by precalculating the matrices for W and B for a phenotype, and performing a lookup for the number and type of change for each genotypic character.

As the matrix for B is a 4-dimensional array of various number of gains and losses, calculation of the B-matrix is the most computationally intensive step in running

CCTSWEEP.

CCTSWEEP is implemented in the scripting language provided by the phyloge- netic package TNT as it provides a number of ways to simplify the reconstruction of characters on a phylogeny. The phylogeny itself can be calculated within TNT though that is not necessary and a binary tree determined by other means can be imported into TNT for use. The implementation can be seen in Appendix B. Input: Set of binary characters in TNT or NEXUS format Input: Maximum number of expected changes for a genotypic character Input: Binary tree Output: File containing p-values for each genotypic character and the phenotype Reconstruct first character (phenotype) on the tree; Calculate W matrix; Calculate B matrix; foreach genotypic character do Reconstruct character across tree; Count gains and losses in character; Calculate P; end Algorithm 2: Algorithm used by CCTSWEEP

The implementation of CCTSWEEP is separated into two parts, as once the W

and B matrices have been calculated they can be saved and reused again or copied

on other machines to parallelize the operation of finding p-values for the SNPs.

44 4.3 Reconstruction

For calculating the CCT we need a reconstruction of each character over the

phylogeny. Several methods have been developed for finding the reconstruction of a

character under the principle of maximum parsimony [65]. Under maximum parsi-

mony, we aim to minimize the total amount of evolutionary change needed to explain

the variation in the given data. How evolutionary change is measured depends on the

particular variant of parsimony employed and on the type of data. In the “Wagner

method” for reconstructing character states, characters are measured on an interval

scale and no a priori restrictions are imposed either on the reversibility of character

states or on the number of times character-state changes may occur [66, 67]. These

methods have achieved widespread popularity in phylogenetic analysis because of

their presumed freedom from assumptions about the evolutionary process.

In its simplest form, the binary character is represented by 0 or 1, and we assume

that the character can change both forward (0 → 1) or backward (1 → 0). The total amount of change is measured by the number of changes required over the tree.

Under these conditions, Farris described an algorithm to assign the optimal character states to each of the interior nodes of the tree which minimizes the total amount of evolutionary change in the character [67].

Farris’s method for assigning internal node character states so as to obtain the minimum tree length required for a given topology consists of an initial pass during which state sets are computed for all internal nodes of a tree and a final pass in which nonsingleton states are replaced by singletons. Farris optimization can be used for any ordered set of characters. Swofford and Maddison provided a formal proof of

45 Farris’s method and also provided an algorithm for enumerating all possible maximum

parsimony reconstructions [68].

The parsimony algorithm for reconstructing the states of a character on a phy-

logeny provided by Farris will always yield a most parsimonious reconstruction. How-

ever, it does not eliminate the possibility of other equally parsimonious solutions.

Swofford and Maddison also provide algorithms for an exhaustive enumeration of all

possible most parsimonious reconstructions. Since we are trying to calculate the cor-

relation between changes in two characters on the tree, they must be optimized in

such a way that optimization method chosen is the same for both. Also, given many

reconstructions trying to work with all of them is both computationally prohibitive

and may not yield any additional information for the effort extended.

4.3.1 DELTRAN and ACCTRAN

There are two commonly used ways to obtain a unique optimization when faced

with multiple reconstructions . In the first, called DELTRAN (for DELayed TRANs-

formation) we delay any changes toward the tips of the tree. In the other, called

ACCTRAN (for ACCelerated TRANsformation) we accelerate changes toward the

root of the tree. For DELTRAN, we compute the set of all possible states at each

interior node and set the state of the root node using the state of the of the

tree. An outgroup is a group that lies outside the group whose phylogeny is being analyzed. Ideally, an outgroup is close to the group being analyzed and it roots the tree.

We then move from the root to the tips, selecting the state from the set of all possible states at a node such that the amount of change on a branch is minimized.

46 For a binary character, this reduces to selecting the state at the ancestral node if the

descendant node is not a singleton set. Thus, all changes are moved toward the tips

of the tree and wherever possible, parallelisms are chosen over reversals. A variation

on the above can be used to calculate an ACCTRAN reconstruction. For consistency,

both the phenotype and genotype must be optimized using the same method.

4.3.2 Taking reversals into account

As the CCT considers only two states for the independent character, in our case,

no change or a change from 0 to 1, we preferentially used DELTRAN in our analyses as

it minimizes and in favorable situations avoids reversals on the branches. If a reversal

(change from 1 to 0) is present, then alternate means have to be utilized to take that

into account. A logically straightforward, though computationally expensive, method

is to extend the recursion equations given earlier to include another state.

Let PF (w, v, t, u|r, s, 0) ≡ PF (w, v, t, u|r, s) be the number of ways in which w gains and v losses can be distributed over branches of the tree having a 0 → 1 change

in the phenotype, and t gains and u losses can be distributed over branches of the

tree having a 1 → 0 change in the phenotype given r gains and s losses over the whole tree and the state of the genotype at the root node being denoted by 0. This can be calculated using the recursion relation given in Equation 4.4.

47 PF (w, v, t, u|r, s, 0) = r s t u w v P P P P P P PG(k, l, x, y|i, j, 0)× i=0 j=0 x=0 y=0 k=0 l=0 PH (w − k, v − l, t − x, u − y|r − i, s − j, 0) r−1 s t−Z u w−A v P P PH P PH P + PG(k, l, x, y|i, j, 0)× i=0 j=0 x=0 y=0 k=0 l=0 PH (w − k − AH , v − l, t − x − ZH , u − y|r − i − 1, s − j, 1) r−1 s t−Z u w−A v P P PG P PG P + PG(k, l, x, y|i, j, 1)× i=0 j=0 x=0 y=0 k=0 l=0 PH (w − k − AG, v − l, t − x − ZG, u − y|r − i − 1, s − j, 0) r−2 s t−Z −Z u w−A −A v P P PG H P PG H P + PG(k, l, x, y|i, j, 1)× i=0 j=0 x=0 y=0 k=0 l=0 PH (w − k − AG − AH , v − l, t − x − ZG − ZH , u − y|r − i − 2, s − j, 1) (4.4) where, G and H are daughter nodes of node F . ZG is 1 if the state of phenotypic character at node G is changing from 0 → 1 and 0 otherwise and the same for ZH .

AG is 1 if the state of phenotypic character at node G is changing from 1 → 0 and 0 otherwise and the same for AH . As indicated before, if the state at root node is 1, all that needs to be done is switching the designations of the states denoted by 0 and 1.

From Equation 4.4 it can noticed that the calculation of P at each node requires us to fill a 6-dimensional matrix which is both memory and computation intensive but is tractable for trees in which the number of branches with changes is limited.

The limits are set by the amount of memory and time available to the researcher.

CCTSWEEP itself does not have any limits.

In many studies, certain assumptions regarding the nature of character evolution may be reasonable. For example, the transformation 0 → 1 may be more probable than the transformation 1 → 0 for a binary character. In such a case, an optimization with two 0 → 1 changes may be preferred over an optimization involving a 0 → 1

48 change and a 1 → 0 change even though they are “equally parsimonious” in requiring

2 units of change.

Another possible method, though with loss of information is to ignore the direction of change and only consider change as whole. Thus, 0 would be a state of no change, whereas 1 would be a change in either direction.

4.4 Case Study

In this case study, we extend the case study for Bacillus anthracis susceptibility in inbred mice described in Chapter 3 for a larger number of taxa and a denser SNP map. Here we focus only on SNPs on chromosome 11 as literature surveys indicated a higher possibility of finding correlated SNPs within this region. While there may be other regions in the genome (e.g. chromosomes 2, 3 and 7) with a functional relation to anthrax susceptibility, we would not be able to corroborate our findings as the literature is not sufficiently advanced regarding other contributing loci. We recalculate the tree using genotype data from chromosome 11 only as different parts of the genome could have different lineages [69], and a globally optimized tree may not reflect the changes happening in a smaller region completely. We use 21 taxa and a smaller number of SNPs, ∼600 as compared to the previous ∼13, 000 after the

50% cutoff. The 50% cutoff avoids artifacts in reconstruction from too much missing data. The high ranking SNPs obtained using CCTSWEEP, are shown in Table 4.1.

A mirror tree visualizing the mouse strain patterns for one of the identified SNPs and anthrax susceptibility is shown in Fig. 4.1.

A number of markers on chromosome 11 were identified as significantly associated with mouse susceptibility to Bacillus anthracis. No completely penetrant markers

49 SNP CCT Annotation Phi Phi-rank % rs3142843 0.0032 NACHT, leucine rich re- 1 1 43 peat and PYD containing 1 (Nalp1b) rs4228580 0.0032 activity modifying 0.54 12 10 protein 3 (Ramp3 ) rs3668244 0.0134 RNA polymerase II largest 0.47 21 0 subnit (Polr2a), 1.3 Mb from Nalp1b rs3690160 0.0134 Ten-m2 (Odz2 ), 2.3 Mb from 0.61 8 10 Dock2 rs6268529 0.0134 0.9 Mb from CC chemokine 0.46 22 10 gene locus CCL3-6,9 rs3142842 0.0158 Nalp1b 0.84 3 43 rs3148131 0.0158 Nalp1b 0.84 3 43 rs3148189 0.0158 Nalp1b 0.78 5 14 rs3654344 0.0158 0.2 Mb from Interleukin-12 0.79 4 5 beta chain precursor (Il12b) rs3694522 0.0158 Kinesin family member 1C 0.70 6 10 (Kif1c) rs3713702 0.0158 0.19 Mb from Protoheme IX 0.26 80 5 farnesyltransferase, mitochron- drial (Cox10 ) rs3726991 0.0158 0.3 Mb from Nalp1b 0.46 22 10

Table 4.1: High ranking SNPs within chromosome 11 obtained using CCTSWEEP. Phi-rank is the rank of the SNP using the phi-coefficient for correlation. The last column indicates the percentage of mouse strains (out of 21) with data inferred for that SNP.

50 were observed for this dataset. These markers were analyzed for their strain distribu-

tion patterns and for candidate genes nearby their genomic location. Highly ranked

markers included four SNPs in Nalp1b, one 2.3 Mb away from Dock2, one about 0.9

Mb from a locus with 5 CC chemokine genes (CCL3-6,9 ), one in proximity to IL12b

(rs3654344), one 0.3 Mb from Nalp1b and one in Kif1c. One of these was also iden- tified by VENN in the genomewide case (see Table 3.1). Since linkage disequilibrium blocks are typically large among inbred mouse strains, the Kif1c marker (rs3694522) located only 0.42 Mb away from Nalp1b and another (rs3726991) quite possibly reflect the same QTL. Linkage disequilibrium is a term for the non-random association of alleles at two or more loci, not necessarily on the same chromosome. It is generally caused by interactions between genes, random drift or non-random mating, and pop- ulation structure. In contrast, rs3654344, which was nearly as penetrant as the best

Nalp1b marker is located on chromosome 11 rather distantly (26.9 Mb proximal from

Nalp1b), suggesting the possibility of additional QTLs for anthrax susceptibility.

4.4.1 Comparison to non-tree based methods

To compare CCTSWEEP with a non-phylogeny based method we rank the SNPs using phi-coefficient [70] and display results in Table 4.1. The phi-coefficient is a measure of the degree of association between two binary variables. The formula for phi-coefficient is ad − bc φ = √ (4.5) efgh where a is the number of cases where both genotype and phenotype are 0, d is the

number of cases where both genotype and phenotype are 1, b is the number of cases

where the genotype is 1 while the phenotype is 0, and c is the number of cases where

51 SPRET/EiJ

CAST/EiJ

NZB/BINJ

DBA/2J

SWR/J

FVB/NJ

ST/bJ

C57BL/6J

C57L/J

BALB/cJ

PL/J

AKR/J

MRL/MpJ

C3H/HeJ

A/J

CBA/J

LP/J Character: Bacillus Character: rs3142843 Susceptibility 129S1/SvImJ Susceptible 129X1/SvJ C SM/J T Resistant I/LnJ

Figure 4.1: An illustration of the correlation between SNP rs3142843 and Bacillus anthracis susceptibility

52 genotype is 0 and the phenotype is 1. The terms in the denominator are e = a + b, f = c + d, g = a + c, and h = b + d. The term in the denominator keeps the

phi-coefficient between -1 and 1.

Although many SNPs that are strongly correlated with CCTSWEEP are also

strongly correlated with phi-coefficient, there are several SNPs that rank much higher

with CCTSWEEP than with phi-coefficient and vice versa. For instance, rs3726991

which is functionally related appears at rank 22 using the phi-coefficient but at rank

6 using CCTSWEEP. Thus, our methods provide distinctly different and thus com-

plementary results to non-tree based methods. In addition, CCTSWEEP and VENN

consider missing data using character optimization whereas other methods simply

ignore missing data.

4.4.2 Anthrax susceptibility candidates

In order to evaluate our candidates, we rely on the literature on Bacillus anthracis

susceptibility. Inbred mouse strains have various responses to Bacillus anthracis,

the causative agent of anthrax [71]. Specifically, cultured macrophages of various

mice strains exhibit differences in their susceptibility to cytolysis due to exposure to

anthrax lethal toxin (LeTx). Watters et al. present in vitro and phylogenetic studies

suggesting that the variability of resistance and susceptibility among mice is due to

a single locus, Kif1C, a kinesin-like motor protein controlling macrophage [72]. This

locus was identified both by VENN and by CCTSWEEP.

Other groups have found support for a multigenic nature of susceptibility to the

anthrax lethal toxin. McAllister et al. reported at least three QTLs on chromosome

11 control susceptibility to anthrax lethal toxin [73]. Macrophage cytolysis has been

53 found to play a minor role in anthrax pathology, with experiments indicating that

LeTx primarily alters signaling cascades in immune cells and blunts immune upreg-

ulation, thus reducing bacteriocidal potential against the pathogen [74, 75, 76]. At

the same time, Boyden et al. recently demonstrated that a major determinant in

mediating anthrax lethality is Nalp1b, a member of the inflammasome located near

Kif1c on chromosome 11 and again one of the genes identified in our case study

[77]. Expression of a Nalp1b allele from susceptible mice in resistant mouse strain

macrophages conferred the susceptibility trait. They conclude that previous results

regarding Kif1c are either artifactual due to linkage or indicate only a minor role for

that candidate [72]. Other highly ranked SNPs are close to regions with genes im-

portant in lymphocyte chemotaxis (Dock2 ) [78] and T cell activation upon

[79, 80, 81].

Thus, despite a demonstrated role for Nalp1b it remains possible that multiple loci

mediate anthrax susceptibility in mice, and that there is considerable variation among

the strains in the profile of their response to the toxin [82]. A promising marker identi-

fied in our case study without previously known implication for Bacillus susceptibility

is rs3654344. The nearest gene to rs3654344, IL12b, is located 0.2 Mb distal. IL12b expression is fairly restricted to monocytes, macrophages, and dendritic cells where it plays a role in the TH1 immune response. Humans with defects in IL12b expression

show decreased production of IFNγ and increased susceptibility to mycobacterial in-

fections [83]. Further studies have associated human polymorphisms in IL12b with

phenotypic stratification of individuals infected with [84]. Notably,

differences in and inducible expression of IL12b have also been observed among

mouse strains and associated with polymorphisms in that gene [85]. However, these

54 data are incomplete for all strains and do not strictly follow the pattern of LeTx

susceptibility. Nonetheless, the strain variation, tissue expression pattern, biological

function and prior genetic associations do suggest this as a potential novel candidate

locus for anthrax susceptibility, along with others in regions highly correlated here

(Dock2 and CC chemokine locus).

4.5 Controlling for multiple testing

4.5.1 Statistical power and false discovery rates

It is important to note that correlations between genotype and phenotype only indicate candidate genomic regions that may contain a gene that influences a given phenotype. The causal relationship between a specific gene and a phenotype can only be confirmed using experimental techniques (e.g., mouse knockouts, complementation

experiments). A method of assessing power, the false discovery rate (FDR), and/or

the false positive rate of our methods would therefore be of practical value for evalu-

ating how much confidence can be placed into any results from VENN.

Methods proposed to estimate the FDR [86, 87] have used the distribution of

p-values over all hypothesis tests to estimate significance cutoffs. These methods

assume that each test is independent of the others. As SNPs are found to be linked

to each other in haplotype blocks, they are not independent and the above methods

result in overly conservative significance cutoffs.

4.5.2 Family-wise error rate (FWER)

FWER is the probability of making one or more false discoveries, or type I errors

among all the hypotheses when performing multiple tests. While the assessment

55 of false positive rate is less desirable than estimates of power or FDR, it has the advantage of being backed by a mathematically justifiable approach.

It has been understood for some time that the FWER is an extremely strict method of defining the critical value for the rejection region [88, 89]. A Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested [90]. In our case study, we examine 600 SNPs for association with susceptibility to Bacillus anthracis. With α = 0.05, we find that after correction we would need a CCT value less than 0.000083 for an association to be considered significant. Thus, despite the observation that most of the top loci overlap with previously known regions implicated for this trait, we have no significant association between any SNP and the phenotype.

In the process of protecting against false positives (ensuring absolute specificity),

FWER sacrifices sensitivity and effectively eliminates the power of this association method.

4.5.3 False Discovery Rate

To generate significance thresholds that report practically useful associations, the rejection region must be relaxed by increasing the tolerance of false positives. In the context of multiple testing, the generalized family wise error rate (gFWER) is the probability of at least k +1 Type I errors occurring among any of the hypothesis tests

[91]. False discovery rate controls the expected proportion of incorrectly rejected null hypotheses (type I errors) in a list of rejected hypotheses [92]. It is a less conservative

56 comparison procedure with greater power than FWER control, at a cost of increasing

the likelihood of obtaining type I errors [93].

One multiple hypothesis testing error measure is the false discovery rate (FDR),

which is loosely defined to be the expected proportion of false positives among all

significant hypotheses. The FDR is especially appropriate for exploratory analyses

in which one is interested in finding several significant results among many tests

[94]. The q-value is defined to be the FDR analogue of the p-value. The q-value of an individual hypothesis test is the minimum FDR at which the test may be called significant. The percentage of false positives that can be tolerated will generally depend on the type of follow-up study to be done on the resulting candidates.

4.6 Conclusion

In this chapter, we described the methodology behind the software termed CCT-

SWEEP. CCTSWEEP uses a modification of Maddison’s Concentrated Changes Test

(CCT) to find correlations between 2 binary variables that are optimized onto the nodes of a phylogenetic tree. The two variables are the genotype and the phenotype.

If the phenotype is not naturally binary, it will have to be binarized suitably, by using methods such as thresholding. We demonstrated the applicability of CCTSWEEP by

finding genes in mice correlated to susceptibility to Bacillus anthracis, the causative agent of anthrax. A review of Bacillus anthracis literature reveals that Nalp1b, the strongest candidate identified by VENN and CCTSWEEP, as well as another gene identified, Kif1c, have been specifically studied and implicated in resistance. This is a good indication for the validity of these methods, and supports the conclusion that other SNPs identified by these tools are potential candidates for further investigation.

57 Thus, the correlation of genetic and phenotypic changes in phylogenetic reconstruc- tions on a large scale may significantly aid in identifying candidate genes for disease related traits.

58 CHAPTER 5

APPLICATIONS OF CCTSWEEP

CCTSWEEP was developed as a way to correlate a binary phenotype with a binary genotype but it can also be used to find correlations between any binary variables that can be optimized on a phylogenetic tree. Thus, it could be used to find correlations that may lead to insights about their causative factors between a large number of phenotypes. In this chapter, we will illustrate results from studies where

CCTSWEEP was used to find or sometimes confirm whether apparent correlations between genotypic and phenotypic traits were statistically significant or not.

5.1 CCTSWEEP used as part of Mobius

Mobius is a framework that supports distributed creation, versioning, manage- ment, and semantic discovery of data models and data instances, on demand creation of databases, federation of existing databases, and querying of data in a distributed environment [95, 96]. We used datasets from the Mouse Phenome Database (MPD)

[97] and GNF2. To facilitate the study of complex genetic diseases in mouse models, the Jackson Labs have compiled an extensive Mouse Phenome Database. Phenotype data from many mouse strains is collected from the literature and via collaboration with experts through consistently applied protocols. In addition, a large number of

59 SNP datasets are now publicly available in the MPD. We analyze SNP data for 15 strains represented in mpd146, for which there was accompanying phenotype data in the MPD.

5.2 Case study: Lipid traits in mice

Coronary artery disease is a widespread affliction in the first world and it has long been known that there is a strong genetic factor in the pathogenesis of this disorder

[98]. It is also known that the progression of this disease is controlled by the interac- tions of a large number of genes and the environment. As genetic studies in humans are hampered by long lifespan and generation length in humans among other factors, over the past century, the mouse has been developed into the premier mammalian model system for genetic research. Scientists from a wide range of biomedical fields have gravitated to the mouse because of its close genetic and physiological similari- ties to humans, as well as the ease with which its genome can be manipulated and analyzed.

Mouse models currently available for genetic research include thousands of unique inbred strains and genetically engineered . An is one that has been maintained by sibling (sister × brother) mating for 20 or more consecutive generations. Except for the difference, mice of an inbred strain are as genetically alike as possible, being homozygous at virtually all of their loci. An organism is referred to as being homozygous (of the same alleles) at a specific locus when it carries two identical copies of the gene affecting a given trait on the two corresponding homologous chromosomes. This eliminates a potential complicating factor in finding correlations that come about due to different copies of a gene at a given locus. An

60 inbred strain has a unique set of characteristics that sets it apart from all other inbred strains. Many traits do not vary from generation to generation. Other traits are easily influenced by diet and environmental conditions and therefore may vary from one generation to the next.

Using mice prone to developing characteristics of coronary artery disease such as elevated lipid levels when fed with a fatty diet (atherogenic), and other mice that are not prone to it, we can find locations of SNPs where a change in the allele is correlated with a change in indicators of coronary artery disease. These indicators include homocysteine levels, high-density and low-density lipoprotein (cholesterol) levels (HDL and LDL) and triglyceride levels.

The genotype data was obtained from the MPD. The mpd146 dataset available in the MPD is a compilation of 439,942 single and multiple nucleotide polymorphisms genotyped by many research groups for 17 mouse strains. The GNF2 database con- tains 8944 SNPs [51]. Although the GNF2 data has fewer SNPs, the datasets are more uniform in strain coverage, include a larger number of strains (a total of 48 strains), and are more evenly distributed along the genome. Phenotype data was available in the MPD for 39 of the strains represented in the GNF2.

Using the GNF2 datasets, we construct phylogenetic trees of the mouse strains with the strain SPRET/EiJ as the outgroup using TNT. Since the CCT works only on binary data values, the phenotype data is binarized by using a threshold value such as exceeding a standard deviation above or below the mean (e.g., the mean value and standard deviation value of non-HDL cholesterol levels). The SNPs are often naturally biallelic, and thus binary. CCTSWEEP was then run on the datasets and high ranking SNPS were identified.

61 Figure 5.1: Mirrored phylogenetic trees of females of mouse strains displaying cor- related changes of a phenotype and a genotype across 15 mouse strains. The right tree depicts phenotypic change in non-high-density lipoprotein (non-HDL) choles- terol plasma levels in female mice after six weeks of atherogenic diet. Black branches indicate strains (C57BL/6J and CAST/EiJ) with non-HDL levels greater than one standard deviation (sd) above the mean after treatment. Genotype observations for each strain for the SNP of interest (rs3023213; T or C) are indicated on the left tree. Boxes at the terminal branches of the trees indicate genotype or phenotype observations in databases for those strains. CCT results for this phenotype-genotype correlation differ for females (p = 0.004) and males (p = 0.088) (not shown).

62 An illustration of the types of results obtained can be seen in Figure 5.1. An example query (rs3023213) identified NNMT (nicotinamide N-methyltransferase) as a candidate gene for high non-HDL levels in female mice of strains C57BL/6J and

CAST/EiJ, within a block on mouse chromosome 9. NNMT is highly expressed in liver tissue and is known to exhibit large differences, in level and activity, between mouse strains and genders [99], and among humans [100]. N-methyltransferases (e.g.,

NNMT) are involved in the biochemical synthesis of homocysteine, a cardiovascular disease risk factor. NNMT was recently implicated as a genetic factor for plasma homocysteine levels, in a genome-wide linkage study in humans [101]. The potential link between N-methyltransferases, homocysteine and cholesterol levels is supported by findings in a knockout mouse model [102].

5.3 Spread of Avian Influenza

Avian influenza is not only complex and multidimensional in terms of biology but also raises several social and political issues. Pandemic influenza would have severe implications for public health, economic security, food safety, and wildlife conserva- tion. Wild birds are known to carry all strains of influenza and, in theory, any of these strains could be the source of the next human pandemic.

The influenza virus is composed of several different surface proteins such as hemag- glutinin (HA) and Neuraminidase (NA). Various mutations in these proteins can cause a virus to shift host or become more infectious. An important question is whether we see distinct temporal and spatial patterns in putatively key mutations in H5N1’s proteins. The mutations we chose to track are thought to be important to infection and replication of H5N1 in various hosts such as mammals, anseriform (wild aquatic

63 birds), or galliform birds (domestic birds). Using CCT, we analyzed correlation be-

tween various mutations in these proteins, indicated by amino acid and position,

and phenotypes in the birds carrying the virus. The correlations are illustrated in

Table 5.1.

To examine avian influenza evolution, we performed a phylogenetic analysis of the

H5N1 genome from 291 isolates, 259 of which were complete genomes. As more data

on influenza genomes was released a second data set of 351 complete genomes was

constructed later. Multiple-sequence alignment of nucleotide and amino acid data

was performed with MUSCLE under default parameters [103].

5.3.1 Genotypes Associated with Various Hosts

We see a strongly supported association between the genotype Lysine-627 in PB2

and mammalian hosts in the 291- and 351-isolate data sets (Table 5.1). This genotype

does not occur exclusively in mammals but is of interest because it is experimentally

associated with increased replication and virulence of the H5N1 virus in laboratory

mice [104, 105]. In the 351-isolate dataset the association between Lysine-627 in

PB2 and anseriform hosts is marginally nonsignificant under the conservative (CCT

≤ 0.0125) significance level that we have set. Within the surface proteins HA and

NA, no genotypes are significantly associated with certain host types in the 291- or 351-isolate data sets. Mutations of HA in amino acid positions 226 and 228, which mediate a shift from avian to human specificity in seasonal influenza strains of subtype H3 [106], are virtually invariant at Gln-226 and Gly-228 among the 291 and

351 isolates of H5N1 that we considered. Although Arg-110 in NA was proposed as

64 Isolates Genotype In or west Isolated Anseri- Galli- Mamma- in of East in 2005- form form lian host Dataset Asian- 2006 host host Australian flyway 291 Isoleucine-99 in 1.00 0.19 0.27 1.00 0.15 hemagglutinin 291 Asparagine-268 0.014 0.061 0.105 1.00 1.00 in hemagglu- tinin 291 Arginine-110 in 0.007 0.035 0.122 1.00 1.00 neuraminidase 291 Lysine-627 in 1.00 1.00 1.00 0.48 < 6 × 10−5 polymerase basic protein 2 351 Isoleucine-99 in 0.073 1.00 0.258 0.193 0.170 hemagglutinin 351 Asparagine-268 0.042 1.00 0.098 0.079 1.00 in hemagglu- tinin 351 Arginine-110 in 0.034 1.00 0.084 0.0624 1.00 neuraminidase 351 Lysine-627 in 0.184 1.00 0.0164 0.108 < 1 × 10−5 polymerase basic protein 2

Table 5.1: The correlation between phenotypes and various genotypes calculated using CCT. To correct for multiple testing we set the significance level at CCT ≤ 0.0125. Significant associations are in bold, and nearly significant (0.0125

65 Figure 5.2: (top) Screenshot of a phylogenetic tree for 351 isolates projected on Earth. Branches of the tree are traced with color to represent the optimization of a character for taxonomic order of hosts. (bottom) A view of avian influenza spread from East Asia on the 291-taxa tree, showing Lysine-627 position in PB2 character optimization as colored branches.

66 a signature for H5N1 to migratory waterfowl [107], this genotype is not significantly correlated with any particular host (Table 5.1).

5.3.2 Spread of Various Genotypes over Time and Space

In genotypes of HA amino acid positions 99 and 268, we see virtually no variation from the Ala-99 Tyr-268 genotype within the East Asian-Australian flyway in the

291- and 351-isolate data sets, except that a few isolates have Thr or Val at position

99. To the west of East Asian-Australian flyway, however, the HA genotype Ile-99

Asn-268 is prevalent, with the sole exception of a single branch of the tree represent- ing isolates of H5N1 from eagles smuggled from Thailand to Belgium in 2004 [108].

The bias of Asn-268 in the west is statistically significant at the CCT . 0.05 level but is marginally nonsignificant at the CCT ≤ 0.0125 level (Table 5.1). We found nonsignificant correlation in the 291- or 351-isolate data sets between HA amino acid positions 99 and 268 and dependent characters of time, anseriform host, galliform host, and mammalian or avian host (Table 5.1). Genotype Arg-110 of the surface protein NA is significantly correlated with viruses isolated west of the East Asian-

Australian flyway for at least the 291-isolate data set (CCT ≤ 0.0125 in 291-isolate data set but CCT ≤ 0.05 in 351-isolate data set; Table 5.1). The correlation of the genotype Arg-110 of NA with isolates west of the East Asian-Australian flyway is nearly significant for the 291-isolate data set but nonsignificant for the 351-isolate data set. Despite the visual appeal of a potential correlation (Figure 5.2), the CCT does not indicate a strong correlation between Lysine-627 in PB2 and the 2005-2006 date of isolation or in viruses isolated west of the East Asian-Australian flyway.

67 Suggestions of key genotypes for the spread of H5N1 to various hosts based on experimental mutagenesis represent a prognostic inference on what mutations we should be tracking. Our study of the actual geographic variation of H5N1 mutations and host shifts puts these experimental inferences in a real world context and checks them against the data derived from isolates actually circulating in the field.

68 CHAPTER 6

CORRELATION OF CONTINUOUS CHARACTERS AND GENOTYPES

In the last 3 chapters we have examined 2 methods for finding genetic changes correlated with phenotypic changes. The case studies performed demonstrate that they are useful for narrowing down the number of genomic regions that need to be considered while searching for a causative link to the correlation identified. Both of them suffer from a flaw that restricts their real-life applications. VENN requires that both characters be discrete and CCTSWEEP requires that not only they be discrete, they be binary as well. While the genotype is naturally discrete, this is not true for most phenotypes. Most phenotypes, especially in complex animals, are continuous.

To overcome this, it is possible to discretize or binarize continuous characters, but the methods always involve loss of information, and can be arbitrary. We may set a threshold, e.g. a standard deviation above or below the mean to discretize a continuous character, but we lose information on variation within the category. Also, if a character is just around the threshold, then small changes in the threshold can sometimes significantly alter the correlation results.

69 6.1 Background

The first general category of comparative methods are those that are explicitly nonphylogenetic. Usually this is a correlation across the tips of a phylogeny and is simply a Pearson product-moment correlation between the raw values of two traits for a series of species. This type of procedure has been dubbed the “nonphylogenetic approach” [109] and “naive species regression”[110], it is still popular as it is easy to implement and does not need any knowledge of the phylogeny.

6.1.1 Continuous correlation using trees

Previous attempts at correlating continuous characters using phylogenetic trees have focused on cases where both the characters being compared are continuous [111].

Felsenstein proposed the method of phylogenetically independent contrasts. Species themselves are not statistically independent, but the differences between them are.

Thus, for any group with a known phylogeny, character values can be subtracted from one another for each terminal species pair and for each ancestral node. Pairs of contrasts can then be used in correlations and regressions forced through the origin

[112]. Felsenstein’s method has some technical limitations: it requires a known phy- logeny and branch lengths, it assumes a Brownian motion model of evolution, and still provides only a correlation between characters. However, it has proven robust over a number of studies and simulations [112, 113]. The method that we propose uses changes in a character over all branches of the tree instead of just the termi- nal branches. This allows use of all the evolutionary data represented by the tree.

Also, our implementation allows for checking a large number of genotypic characters

70 against a phenotype while we are not aware of any implementation of the independent contrasts method that does that.

Felsenstein’s independent contrasts methods computes (weighted) differences (“con- trasts”) between the character values of pairs of species and/or nodes, as indicated by a phylogenetic topology, and working down the tree from its tips. This proce- dure results in n − 1 contrasts from n original tip species. As long as the ancestral nodes are correctly determined, each of these contrasts is independent of the others in terms of the evolutionary changes that have occurred to produce differences between the two members of a single contrast. Because the n − 1 contrasts are statistically independent, they can be employed in standard statistical analyses.

The independent contrasts methods and its modifications have found their way into many software packages [114] but relatively few of them allow one to correlate a discrete character with a continuous character. Also, none of these implementations are geared towards comparing large numbers of characters as is necessary when doing genome-wide associations.

In the next section, we address these shortcomings by a new method which corre- lates a continuous character with a binary character (or an ordered discrete character) and is usable for correlating many thousands of characters with minimal user inter- action.

6.2 Optimizing characters on a tree

In Chapter 4 we briefly discussed the different schemes using which a binary character can be optimized on a tree such that the total change in the character over the tree is minimized. The algorithm used for optimizing discrete characters can be

71 generalized for a continuous character and instead of an individual character or a set of characters at each internal node, we obtain a range of values for each internal node.

6.2.1 Optimization algorithms

In the context of phylogenetic trees, optimizing a character on the tree means to minimize the amount of evolutionary change in the character. Another way of under- standing parsimony methods is that they are cost minimization procedures, where the cost is a measure of the amount of evolutionary change. In previous cases, we were dealing with discrete or binary characters and the natural measure of evolutionary change was simply the number of changes between states. In cases where the charac- ter is an ordered character, Farris [67] described an algorithm for assigning optimal character states to each of the interior nodes on a tree so as to minimize the total change in the character on the tree. This model is known as Wagner parsimony whose groundplan divergence method [115] helped stimulate work in phylogeny algorithms.

6.2.2 Choosing a particular optimization

As unique character reconstructions are rare, we have to work out ways to deal with ambiguous states for internal nodes on the tree. Two commonly used ways for obtaining a unique reconstruction are ACCTRAN (ACCelerated TRANsformation) and DELTRAN (DELayed TRANsformation). ACCTRAN preferentially optimizes all excess changes toward the root of the tree, while DELTRAN does the opposite and pushes all excess change toward the tips of the tree [68].

While the algorithm provided in Swofford and Maddison’s [68] paper is for opti- mizing a discrete ordered character, it can be generalized to continuous characters

[116].

72 To correlate the genotype and phenotype we optimize both the genotypic char- acter and the phenotypic character over the tree. We use the same optimization

(ACCTRAN or DELTRAN) for both characters to allow for a consistent comparison of whether or not the two characters are correlated. Rather than make specific dis- tributional assumptions, a permutation test reshuffles the data samples at hand to construct the distribution of the test statistic under the null hypothesis. If the value of the test statistic based on the original samples is extreme relative to this distribu- tion (i.e. if it falls far into the tail of the distribution), then the null hypothesis of

“no difference between the populations” from which the data samples were drawn is rejected. Permutation tests maintain a wide applicability under a much broader range of data and research conditions than most parametric tests. In addition, they often have as much - and sometimes even more - statistical power than their parametric counterparts, and unlike many parametric and other nonparametric tests, the results of permutation tests (the p-values) are unbiased.

The basic insight upon which permutation testing rests is that the extremity of a test statistic can be judged by comparison to its distribution with the relationship between data and model suitably permuted. Because permuted data has no expected relationship to the model (or vice versa), permutations can be viewed effectively as replications of the experiment with no expected relationship between model and data – i.e., with a null hypothesis that is true. As the number of possible orderings of a set grow exponentially with the size of the set, complete enumeration is not computationally feasible. Instead of taking all possible permutations, we generate a reference distribution by Monte Carlo sampling, which takes a small random sample of the possible replicates. This type of permutation test is also known as approximate

73 permutation test. The only difference between a Monte Carlo and an exact test is the level of detail to which p is calculated [117, 118, 119, 120].

Permutation tests have been applied to genotype-phenotype correlations by vari- ous researchers though without consideration for phylogeny [121]. These are especially useful with complex traits where a phenotype is affected by genotypes across multiple loci as it does not rely on a specific model for the interaction of the loci [122, 123].

6.3 Implementation

This approach is implemented in 3 steps. In the first, we use the phenotypic values for the terminal nodes and a sufficiently large number of permutations (shufflings) in a TNT format file. The shuffling is done using Knuth’s shuffling algorithm [124]. The true phenotypic values and randomized ones are then used to find the the values at the internal nodes of the tree under DELTRAN or ACCTRAN optimization. Then under the chosen optimization we calculate the change on each branch for the phenotype and its permutations.

For the genotype the same analysis is done and we find the change in the genotype on each branch for the genotype. The genotype is chosen such that it is binary. Thus, the change for the genotype on each branch is +1 (0 → 1), −1 (1 → 0) or 0 (no change). Thus, having these two sets we can find the average change for each SNP on the tree by multiplying the change in the SNP by the change in the phenotype for the branch and adding together.

Repeating this analysis for each of the permutations allows us to compare the average change for a particular SNP with the distribution of the average change on permutations of the phenotypes for a SNP. From the distance between the mean of

74 the distribution and the actual value, we can calculate the p-value for the correlation

between a SNP and the phenotype. As can be seen in Figure 6.1, the distribution

turns out to be close to a Gaussian or normal distribution and hence we decided to

use the computationally more efficient parametric testing despite the above mentioned

advantages of nonparametric methods. To obtain the same power from nonparametric

methods we would have to compute much more than the 2000 permutations that were

used for the case study below.

For a normal distribution, we can calculate the p-value by

|k − hki| p = 1 − erf √ (6.1) σ 2

where k is the value, and hki is the mean, and σ is the standard deviation of the

distribution. erf(z) is the “error function” encountered in integrating the normal

distribution and is defined by

z 2 Z 2 erf(z) = √ e−t dt (6.2) π 0

Algorithm 3 illustrates the implementation of this method for finding genotype- phenotype correlations.

This algorithm was implemented in AWK and TNT scripting language. The parts related to optimizing the phenotype and its permutations were implemented in TNT whereas the rest was implemented in AWK.

6.4 Case Study: HDLC levels in inbred mice

To test our method we use it to find genetic regions in mice correlated to lipid levels. Atherosclerosis is a leading cause of coronary artery disease and has been extensively studied using mice as a model organism. Using mice prone to developing

75 Input: Set of continuous phenotypic characters in TNT or NEXUS format Input: Set of genotypic characters in TNT or NEXUS format Input: Binary tree Output: File containing p-values for each genotypic character and the phenotype Compute permutations of phenotypic character sets; Reconstruct phenotypic character on tree; Apply DELTRAN algorithm; Compute change on each branch; foreach Permutation of phenotypic character do Compute reconstruction of character on tree; Apply DELTRAN algorithm; Compute change on each branch for permuted character; end foreach genotypic character do Reconstruct character across tree; Apply DELTRAN algorithm; Compute change on each branch for the genotype; Calculate average change in phenotype on each branch weighted by change in genotype; Calculate average change in permuted phenotype on each branch weighted by change in genotype; Calculate standard deviation for the genotype; Calculate p-value for the genotype; end Algorithm 3: Algorithm used by continuous correlation algorithm. Using DEL- TRAN is one way of producing a unique reconstruction at each node. ACCTRAN may be used as well.

characteristics of coronary artery disease such as elevated lipid levels when fed with an atherogenic diet, and other mice that are not prone to it, we can find locations of

SNPs where a change in the allele is correlated with a change in indicators of coronary artery disease. These indicators include homocysteine levels, high-density and low- density lipoprotein (cholesterol) levels (HDL and LDL) and triglyceride levels. This trait has also been the focus of researchers applying computational methods toward

76 20

15

10

5

-10 -5 5 10 15 -5

-10

Figure 6.1: A normal quantile plot of the change on a branch of the phylogenetic tree.

identifying genes that may be related to this trait which allows for comparison of our results with other in silico studies and experimental results.

We used a dataset with 38 strains of mice for which HDLC levels were known. The phenotype data of HDLC levels in the mice was obtained from the Mice Phenome

Database (MPD). The SNP data for the genotype came from the Jackson Labs [97,

62]. The total number of SNPs in the genome was approximately 184,000 which was reduced to 129,000 after applying the 50% cutoff. The 50% cutoff is necessary to prevent artifacts during internal state reconstruction due to missing data at the tips.

We have applied the same cutoff for all case studies in this dissertation.

We calculated the tree using TNT and applied our algorithm to this data. To be able to calculate an accurate p-value we should know the distribution of the test

77 Figure 6.2: A plot of − log10(p) for each SNP (approximately 12000 each) plotted against position on chromosomes 1 and 2 of mice. Lines indicating p = 0.01 and 0.0001 have been shown.

statistic. We find that this distribution closely matches a normal distribution. Fig- ure 6.1 shows a normal quantile plot for the distribution of the change on one branch of the tree. The plot shows that the distribution is close to a normal distribution as the points fall close to the straight line except around the extremes. In this case study, we consider 2000 permutations and the results obtained for chromosomes 1 and

2 can be seen in Figure 6.2. In this figure, each SNP in the chromosome is represented by a bar and its height indicates − log10(p) where p is the p-value. The darker longer lines indicate positions of known genes for this trait.

6.5 Discussion

HDLC is a complex quantitative trait for which many QTL have been identified using traditional cross-based QTL mapping which allow us to compare our method with previously known results. Forty-two percent of the mouse genome falls within a known QTL confidence interval. As one of the most well-studied multigenic and

78 quantitative traits, HDLC levels are a good benchmark for evaluating our method.

Figure 6.3 shows the results of the correlation scores for the entire mouse genome.

The upper bar chart shows the computed HDLC phenotype correlation scores for the top 40 SNPs. The lower bar chart shows the maximum LOD scores at previously known QTL intervals (95% confidence intervals shown as red rectangles) [52]. The x-axis indicates the genomic axis, where chromosomal boundaries are indicated by the center bar. The maximum LOD scores are cut off at 12. Correlation scores below

3 and LOD scores below 3.3 are not shown.

We find that 16 of the top 20 SNPs are located within previously known QTLs.

This compares favorably with the results of McClurg’s Single Marker Mapping method

(see Section 2.7.2) where they reported that out the 20 markers with highest associ- ation scores, 11 intersected a previously known QTL [125].

6.6 Conclusion

In this chapter, we describe a method for correlating a continuous phenotype with a binary genotype using phylogenetic trees. This method uses reconstructions of the genotypic character, typically a SNP, but could be an amino acid, and of the phenotypic character over a phylogenetic tree and the correlation between the two is calculated. The correlation is computed using a randomization test. This method is implemented in TNT scripting language and AWK. We test the method on the phenotype of High Density Lipoprotein Cholesterol (HDLC) levels in inbred mice which is an important and well-studied trait. We find that our method performs well as compared to existing in silico methods.

79 4 7.0

6.0 4 4 3 5.0 2 2 4 7

4.0

3.0

Figure 6.3: Results of the continuous correlation method for HDLC. The top bar chart shows − log10(p) for the top 40 best correlated candidates from the whole genome, and the bottom bar chart shows the peak LOD scores and significant QTL intervals described previously for HDLC. Of the twenty loci with the highest correlation scores, 16 intersect previously known QTL intervals. In cases where 2 or more bars are too close to be resolved visually, the number above the bar shows the number of bars at that location.

80 CHAPTER 7

DISCUSSION AND FUTURE DIRECTIONS

There has been a rapid growth in biological sequence data in recent years and it is expected that this growth will continue in the foreseeable future. To utilize this sequence data in meaningful ways, there is a need to find connections between the sequences and the features that the sequences code for in an organism.

As there are thousands of genes or more in any organism it is a daunting task to find the biochemical pathway of a gene’s effect on the phenotype or finding all the genes that interact to produce a particular phenotype. Given the many ways in which genes interact with other genes, and the environment, and stochastic factors deducing all the relationships between them is time intensive and an expensive task.

This task can be considerably simplified by the identification of candidate genes for a trait. A candidate gene is a gene, located in a chromosome region suspected of being involved in the expression of a trait such as a disease.

The problem then reduces to working with the candidate genes and finding whether or not there is a biochemical pathway between the expression of the gene and the trait.

Finding candidate genes which may be associated with a trait is a complex question

81 as well. In this dissertation, we have described several ways in which regions asso- ciated with a trait can be identified. In particular, we have described methods that utilize the phylogenetic tree to find genotype-phenotype correlations.

Most genotype-phenotype correlations that are performed assume all organisms in the study as independent data points. This assumption is not completely justified as all organisms have evolved from a common ancestor and the interrelationships between the organisms can be shown by means of a phylogenetic tree. Felsenstein showed that taking the organisms as independent when they are not, could cause an overestimation of significance and also described the method of independent contrasts which took the phylogeny into account. Since then other methods such as Maddison’s

Concentrated Changes Test and Pagel’s Discrete have been described which can find the correlation between two characters taking the phylogeny into account.

A major shortcoming of the existing methods was that their implementations are not suited for a genome-wide analysis. In the pre-genomic era when a researcher was testing for only a few chosen characters and a phenotype this was not a major hindrance but when faced thousands of markers such as SNPs a script-based method requiring minimal interaction is essential. There is also no described method for correlating a genotype, which is necessarily discrete, with a continuous phenotype.

7.1 Correlating Discrete Characters

The first software that we described was termed VENN. VENN allows us to iden- tify SNPs that are completely penetrant with a phenotype. The idea behind VENN is that sets of changes in genotype and sets of changes in genotype that occur together may have a functional relationship to each other. Once a phylogenetic tree has been

82 identified, any discrete character, genotypic or phenotypic can be optimized on the

tree such that the total change in that character is minimized.

Using the distribution, we can identify branches on the tree where the phenotype

is undergoing a change. Then VENN, provided those branches, can identify the SNPs

that are also changing on those branches and the candidate genes can be inferred from

the regions in which those SNPs are located. We locate regions in strains inbred mice

that are correlated with susceptibility to the Bacillus anthracis, the causative agent

of anthrax, and find that our findings are corroborated by existing literature.

VENN was implemented in Perl making it available on all major platforms. It

can be noticed that VENN has a few shortcomings. The first is that as the number of

branches over which the phenotype is changing increases, the set of genotypic changes

in the intersection of all branches decreases and eventually only a few or no matches

for a query are returned. As has been noted, genes interact with other genes and the

environment in many ways to produce a phenotype. Thus, complete is

rarely observed in nature. Secondly, VENN provides for no way to calculate the p- value of a returned SNP. Thus, a researcher is unable to conclude whether a returned result is truly significant or not.

In order to take these shortcomings into account we develop another method that was termed CCTSWEEP. Maddison’s concentrated changes test (CCT) calculates the probability that changes in a binary character are distributed randomly on the branches of a phylogenetic tree. This test is used to examine hypotheses of corre- lated evolution, especially cases where changes in the state of one character influence changes in the state of another character. If we have two binary characters, and a

83 unique optimization, we can find the probability that the change in genotype is cor- related with a change in phenotype and the probability gives us the p-value of the association between the genotypic and phenotypic character.

CCTSWEEP was tested with the same trait as with VENN though with a larger number of mouse strains which made for a larger tree. We focused our attention to a smaller region on the mouse genome, chromosome 11, as literature surveys indicated that genes on this chromosome were implicated in this trait. We found that our results matched what we found with VENN and that 8 of the 12 best correlated SNPs were in regions (genes called Nalp1b and Kif1c) where previous experimental studies had found that genes that controlled susceptibility to Bacillus anthracis were located.

CCTSWEEP was used in other studies to find genes correlated with the medically important phenotype of coronary artery disease. Using mice prone to developing characteristics of coronary artery disease such as elevated lipid levels when fed with a fatty diet (atherogenic), and other mice that are not prone to it, we found locations of SNPs where a change in the allele is correlated with a change in indicators of coronary artery disease. Again, among other candidate genes which were correlated with these traits we found one called NNMT which had previously been associated in other studies with this trait.

In recent years, avian flu has been a major concern as there have been indications that the flu virus could mutate into a more virulent form. Wild birds are known to carry all strains of influenza and any of these strains could be the source of the next human pandemic. The influenza virus is composed of several different surface proteins, changes in which affect whether the flu virus can infect a particular host or its virulence. Using CCT, we analyzed correlation between various amino acid

84 positions in these proteins and phenotypes in the birds carrying the virus. We found that genotype Lysine-627 in the protein PB2 and mammalian hosts are strongly correlated. This genotype is of interest because it is experimentally associated with increased replication and virulence of the H5N1 virus in laboratory mice. We also found other amino acid locations that are potential candidates for further study in the spread of avian flu.

7.2 Correlating Continuous characters

All of the studies performed above were with phenotypic characters that are dis- crete. Even in cases where the characters were continuous, such as plasma lipid levels in mice, they were binarized by setting a threshold. As most phenotypic characters in complex organisms are continuous, making them binary or discrete results in a loss of information and the choice of threshold selected can alter the results obtained.

For our method of correlating continuous characters, we optimize both the geno- type and phenotype across the phylogenetic tree. Then, to find find the p-value for correlation between the genotype and the phenotype we use randomization testing.

Randomization tests are a type of nonparametric test and they maintain a wide ap- plicability under a much broader range of data and research conditions than most parametric tests. Permutation testing is based on the fact that the extremity of a test statistic can be judged by comparison to its distribution with the relationship between data and model suitably permuted. Since permuted data has no expected relationship to the model (or vice versa), permutations can be viewed effectively as replications of the experiment with no expected relationship between model and data.

85 Thus, we find the average change in the phenotype on branches where the genotype

is also changing under a chosen reconstruction method. Then, this process is repeated

with the phenotypic values at the tips of the tree randomly permuted. The number of

repetitions is set by the confidence level to which the p-value is desired. If the actual

phenotypic value is in the tail of the distribution of the shuffled phenotypic values

then the two might be correlated and the p-value of the association can be judged from how far into the tail the actual phenotype is found.

This method was implemented in TNT scripting language and AWK programming language. We performed a case study where we found find genetic regions associated with High Density Lipoprotein (HDL) cholesterol levels in strains of inbred mice.

HDLC levels play a demonstrable role in the progression of atherosclerosis. We find that a large majority of top candidates from our association method lie with Quan- titative Trait Loci previously implicated in this trait. Also, we compare our method with results from previous in silico studies for this trait and our method compares favorably with others.

7.3 Future Directions

Many studies have noted that a large fraction of the phenotypes in organisms are a result of the interactions of multiple genes. Glazier et al. show that while the molecular basis of a large number of monogenic traits in humans have been identified and the number has grown at a steady pace, the identification of molecular basis for complex traits has significantly lagged behind [48].

Experimental methods have been used to probe two-locus interactions, these meth- ods include the yeast two- screening [126]. Two-hybrid screening is a technique

86 which facilitates the study of protein-protein interactions and protein-DNA interac- tions by testing for physical interactions (such as binding) between two proteins or a single protein and a DNA molecule, respectively. The premise behind the test is the activation of downstream reporter gene(s) by the binding of a transcription fac- tor onto an upstream activating sequence. With increasing genomic data available genetic studies are becoming a powerful way to identify these interactions [127]. In silico genetic studies hold great promise here in narrowing down candidate genes for further research. It should be noted that going from a single-locus to even two locus interactions involves a significant increase in complexity.

Li and Reich showed that there are 512 two-locus, two-allele, two-phenotype, fully penetrant disease models [128]. Using the permutation between two alleles, between two loci, and between being affected and unaffected, one model can be considered to be equivalent to another model under the corresponding permutation. These per- mutations greatly reduce the number of two-locus models in the analysis of complex diseases. Even after these reductions, it can be seen that the problem of correlating two-locus genotypes to a phenotype is significantly more computationally taxing than that of a one-locus genotype.

As n genomic locations result in ∼ n2 possible two-locus combinations, enumer- ating all of them is computationally taxing and some heuristic is needed to reduce this number. We could use the methods described in this dissertation for correlating characters to reduce this to a smaller set of characters that show some minimum level of association with the phenotype. Then a compound character, such as a Boolean combination of presence or absence of each genetic marker, could be used as a char- acter and tested for association as if it was a single character. The characters highly

87 ranked together could indicate a possible relationship which has to be satisfied for

the phenotype to be realized.

7.4 Conclusion

In conclusion, we have described 3 methods which can be used for performing

genome-wide association studies between genotypes and phenotypes. The first two,

VENN and CCTSWEEP, correlate discrete and binary characters respectively, while

the third method can correlate continuous phenotypes. These methods were tested

with biological data and we found that genomic regions indicated by our methods are

in agreement with previously known results which is a good indicator for the validity

of our methods. The rapid increase in genetic data being released will increase the

importance of in silico studies to reduce the number of candidate regions that will be analyzed experimentally and our methods could play an important role in this.

88 APPENDIX A

VENN CODE

VENN was implemented in Perl in three different versions to handle PAUP, POY, or TNT apomorphy lists. The usage for all of them is

poyvenn.pl apofile "+node_1" "+node_2" "-node_3" ... > outfile

where poyvenn.pl could be replaced by tntvenn.pl or paupvenn.pl depend-

ing on the type of apomorphy list being analyzed. apofile is the name of the file containing the apomorphy list. outfile is the file in which the output is stored. If this is omitted the output will be directed to stdout which is usually the screen.

Branches on which intersection or exclusion is to be performed are identified by the

descendant node and a ‘+’ or ‘-’ is appended in front of them to indicate inclusion

or exclusion respectively. A missing sign is assumed +. Quotes are optional if node

names do not contain spaces. The first branch in the list must be a branch being

included. Taxon names beginning with a + or - will cause ambiguities and unpre-

dictable behavior and should not be used. Also, as PAUP apomorphy lists are space

delimited, taxon names containing spaces while using PAUP should be avoided. These

programs have been tested on Linux and Windows operating systems and should run

on a range of platforms.

89 A.1 poyvenn.pl

Given below is the implementation of poyvenn.pl

#!/usr/bin/env perl

# Can do intersections over unlimited number of branches # gives output sorted over character position # now doing exclusions as well

for ($i=1; $i<=$#ARGV; $i++){ if(@ARGV[$i] =˜ /ˆ-/){ @ARGV[$i] =˜ s/-//; $incexc{@ARGV[$i]} = -1; } else { if(@ARGV[$i] =˜ /ˆ\+/){ @ARGV[$i] =˜ s/\+//; } $incexc{@ARGV[$i]} = 1; } $names{@ARGV[$i]}=0; } $INFILE=@ARGV[0]; open (INPUT, $INFILE); $pass=1; while () { if($pass){ if (/Anc .*Desc/) { $pass=0; next; } next; } chomp; $templine = $_; if($templine =˜ /ˆ\t+$/) {next;} # ignore empty lines @chars = split(/\t/, $templine); if(@chars[2] ne ""){ $name = @chars[2]; $name =˜ s/ //g;

90 $anc = @chars[1]; next; } @chars[1]=$anc; @chars[2]=$name;

if (exists $names{$name}) { if (@chars[5] ne ""){ @chars[5] =˜ tr/\[\]//d;

$chr = @chars[5]; $position{$chr}=0; }

@chars[5] = $chr; @chars[7] =˜ s/\[//; @chars[7] =˜ s/\]//;

$data{$name."|".$chr."|".@chars[7]} = join("\t",@chars); $char{$chr}{@chars[7]} = 0; }

} # closing brace from while .... close (INPUT); # calculating intersections foreach my $key (sort{$a <=> $b}(keys %position)){ foreach my $pos (sort{$a <=> $b}(keys %{char->{$key}})) {

$common = 1; $out = ""; foreach my $name (sort{$a cmp $b}(keys %names)){ if (exists $data{$name."|".$key."|".$pos}){ if ($incexc{$name}==1){ $out = $out.$data{$name."|".$key."|".$pos}."\n"; } else { $common = 0; next; } } elsif ($incexc{$name}==-1){} else{ $common=0; next;

91 } } print "$out" if ($common); } }

A.2 tntvenn.pl

Given below is the implementation of tntvenn.pl

#!/usr/bin/env perl

#Can do intersections over unlimited number of branches #gives output sorted over character position # now doing exclusions as well

for ($i=1; $i<=$#ARGV; $i++){ if(@ARGV[$i] =˜ /ˆ-/){ @ARGV[$i] =˜ s/-//; $incexc{@ARGV[$i]} = -1; } else { if(@ARGV[$i] =˜ /ˆ\+/){ @ARGV[$i] =˜ s/\+//; } $incexc{@ARGV[$i]} = 1; } $names{@ARGV[$i]}=0; } $INFILE=@ARGV[0]; open (INPUT, $INFILE); while () { chomp; $templine = $_;

$templine =˜ s/ˆ[ \t]+|[ \t]+$//; @chars = split(/ *:/, $templine);

if(@chars[1] eq " "){ $name = @chars[0];

92 next; } @chars[0] =˜ s/Char. //; if (exists $names{$name}) { $data{$name,@chars[0]}=$templine; $char{@chars[0]}=0; }

} # closing brace from while .... close (INPUT); # calculating intersections foreach my $key (sort{$a <=> $b}(keys %char)){ $common = 1; $out = ""; foreach my $name (sort{$a <=> $b}(keys %names)){ if (exists $data{$name,$key}){ if ($incexc{$name}==1){ $out = $out.$data{$name,$key}."\n"; } else { $common = 0; } } elsif ($incexc{$name}==-1){} else{ $common=0; } } print "$out" if ($common); }

A.3 paupvenn.pl

And finally the implementation of paupvenn.pl

#!/usr/bin/env perl

#Can do intersections over unlimited number of branches #gives output sorted over character position # now doing exclusions as well for ($i=1; $i<=$#ARGV; $i++){

93 if(@ARGV[$i] =˜ /ˆ-/){ @ARGV[$i] =˜ s/-//; $incexc{@ARGV[$i]} = -1; } else { if(@ARGV[$i] =˜ /ˆ\+/){ @ARGV[$i] =˜ s/\+//; } $incexc{@ARGV[$i]} = 1; } $names{@ARGV[$i]}=0; } $INFILE=@ARGV[0]; open (INPUT, $INFILE); $pass=1; while () { $templine = $_; $templine =˜ s/ˆ[ \t]+|[ \t]+$//;

if($pass){ if ($templine =˜ /Branch .*Character .*Steps/) { $pass=0; next; } next; }

chomp; @chars = split(/ +/, $templine); if($#chars >6){ $name = @chars[2]; next; } if (exists $names{$name}) { $data{$name,@chars[0]}=$templine; $char{@chars[0]}=0; }

} # closing brace from while .... close (INPUT); # calculating intersections foreach my $key (sort{$a <=> $b}(keys %char)){ $common = 1; $out = "";

94 foreach my $name (sort{$a <=> $b}(keys %names)){ if (exists $data{$name,$key}){ if ($incexc{$name}==1){ $out = $out.$name."\t".$data{$name,$key}; } else { $common = 0; } } elsif ($incexc{$name}==-1){} else{ $common=0; } } print "$out" if ($common); }

95 APPENDIX B

CCTSWEEP

CCTSWEEP is a set of 4 scripts implemented in the TNT scripting language which perform initialization, calculation of the W matrix (Equation 4.2), calculation of the

B matrix (Equation 4.3), and calculation of the final CCT values (Equation 4.1). This allows one to save intermediate results and perform the subsequent step on another computer or at a later time with a different dataset. The input files are character matrices in TNT format that code for the character as 0 or 1. The first character in the dataset must be the phenotype of interest. This is the character that is used as the independent character. The rest of the characters are genotypic characters.

B.1 Script

Initialization:

var = 0 temp +charstanc +charstdec +Bcal[(nnodes[0]+1)] +B[(nnodes[0]+1) tg tg bg bg] +lide[2] +Wcal[(nnodes[0]+1)]

96 +W[(nnodes[0]+1) tgain tgain 2] +numb[(nnodes[0]+1)] +wloss +wgain +bloss +bgain +stroot +ancstate +decstat1 +decstat2 +isblack1 +isblack2 +gainloss +bgainloss +btot +cct +root; loop 0 (nnodes[0]) set Bcal[#1] (-1); set Wcal[#1] (-1); stop;

tg and bg should be replaced with maximum number of changes over the whole tree and maximum number of changes on branches where the phenotype is changing respectively. The time complexity of calculating the B matrix is O(n4) with the number of changes over the tree. Hence large values of tg or bg will take long times and should be chosen conservatively.

Script to calculate W matrix, to be run as calw node tg tg 2 where tg is the same number as above and node is the node number of the node at which CCT is to be calculated. Typically this will be the number of the root node.

set temp 0; set lide deslist[0 %1]; if (’Wcal[’lide[0]’]’ <0)

97 if(’lide[0]’<(ntax+1)) set W[’lide[0]’ 0 0 0] 1; set W[’lide[0]’ 0 0 1] 1; set Wcal[’lide[0]’] 1; else recurse ’lide[0]’ %2 %3; set lide deslist[0 %1]; end; end; if (’Wcal[’lide[1]’]’ <0) if(’lide[1]’<(ntax+1)) set W[’lide[1]’ 0 0 0] 1; set W[’lide[1]’ 0 0 1] 1; set Wcal[’lide[1]’] 1; else recurse ’lide[1]’ %2 %3; set lide deslist[0 %1];

end; end; if(%1 > ntax) loop 0 %2 loop 0 %3 loop 0 #1 loop 0 #2 set temp ’temp’+(’W[’lide[0]’ #3 #4 0]’*’W[’lide[1]’\ (#1-#3) (#2-#4) 0]’); stop; stop; if ((#1-1)>=0) loop 0 (#1-1) loop 0 #2 set temp ’temp’+(’W[’lide[0]’ #3 #4 0]’*’W[’lide[1]’\ (#1-#3-1) (#2-#4) 1]’); stop; stop; end; if((#1-1)>=0) loop 0 (#1-1) loop 0 #2 set temp ’temp’+(’W[’lide[0]’ #3 #4 1]’*’W[’lide[1]’\

98 (#1-#3-1) (#2-#4) 0]’); stop; stop; end; if ((#1-2)>=0) loop 0 (#1-2) loop 0 #2 set temp ’temp’+(’W[’lide[0]’ #3 #4 1]’*’W[’lide[1]’\ (#1-#3-2) (#2-#4) 1]’); stop; stop; end; set W[%1 #1 #2 0] ’temp’; set W[%1 #2 #1 1] ’temp’; set temp 0; stop; stop; end; set Wcal[%1] 1; proc/;

Script to calculate the B matrix. To be run as calb node tg tg bg bg,

where bg is the same as in the initialization.

set lide deslist[0 %1]; if (’Bcal[’lide[0]’]’ <0) if(’lide[0]’<(ntax+1)) set B[’lide[0]’ 0 0 0 0] 1; set Bcal[’lide[0]’] 1; else recurse ’lide[0]’ %2 %3 %4 %5 %6; set lide deslist[0 %1]; end; end;

if (’Bcal[’lide[1]’]’ <0) if(’lide[1]’<(ntax+1)) set B[’lide[1]’ 0 0 0 0] 1; set Bcal[’lide[1]’] 1; else

99 recurse ’lide[1]’ %2 %3 %4 %5 %6; set lide deslist[0 %1];

end; end; set ancstate states[%6 %1 0]; if (’ancstate’>2) set ancstate 1; end; set decstat1 states[%6 ’lide[0]’ 0]; if (’decstat1’>2) set decstat1 1; end; set decstat2 states[%6 ’lide[1]’ 0]; if (’decstat2’>2) set decstat2 1; end; set isblack1 ’decstat1’-’ancstate’; set isblack2 ’decstat2’-’ancstate’; if(%1 > ntax) loop 0 %2 loop 0 %3 loop 0 %4 loop 0 %5 loop 0 #3 loop 0 #4 loop 0 #1 loop 0 #2 set temp ’temp’+(’B[’lide[0]’ #7 #8\ #5 #6 ]’*’B[’lide[1]’ (#1-#7) (#2-#8)\ (#3-#5) (#4-#6)]’); stop; stop; stop; stop; if ((#3-1)>=0) loop 0 (#3-1) loop 0 #4

100 if ((#1-’isblack2’)>=0) loop 0 (#1-’isblack2’) loop 0 #2 set temp ’temp’+(’B[’lide[0]’ #7 #8\ #5 #6 ]’*’B[’lide[1]’ (#2-#8) ((#1-\ ’isblack2’)-#7) (#4-#6) (#3-#5-1)]’); stop; stop; end; stop; stop; end; if((#3-1)>=0) loop 0 (#3-1) loop 0 #4 if ((#1-’isblack1’)>=0) loop 0 (#1-’isblack1’) loop 0 #2 set temp ’temp’+(’B[’lide[0]’ #8 #7\ #6 #5 ]’*’B[’lide[1]’ ((#1-’isblack1’)\ -#7) (#2-#8) (#3-#5-1) (#4-#6)]’); stop; stop; end; stop; stop; end; if ((#3-2)>=0) loop 0 (#3-2) loop 0 #4 if ((#1-’isblack1’-’isblack2’)>=0) loop 0 (#1-’isblack1’-’isblack2’) loop 0 #2 set temp ’temp’+(’B[’lide[0]’ #8 #7\ #6 #5 ]’*’B[’lide[1]’ (#2-#8) ((#1\ -’isblack1’-’isblack2’)-#7) (#4-#6)\ (#3-#5-2)]’); stop; stop; end; stop; stop; end;

101 set B[%1 #1 #2 #3 #4] ’temp’; set temp 0; stop; stop; stop; stop; end; set Bcal[%1] 1; proc/;

Once the W and B have been calculated, the CCT values for each genotype can be calculated using the following script.

loop 1 nchar set wgain 0; set wloss 0; set bgain 0; set bloss 0; set root states[#1 (ntax+1) 0]; if (’root’ > 2) set root 1; end;

loop (ntax+1) nnodes[0] set lide deslist[0 #2]; set charstanc states[#1 #2 0];

if(’charstanc’ > 2) set charstanc ’root’; end;

set charstdec states[#1 ’lide[0]’ 0]; if (’charstdec’ > 3) set charstdec ’charstanc’; else if (’charstdec’ > 2) set charstdec ’root’; end; end;

102 set ancstate states[0 #2 0]; if (’ancstate’>2) set ancstate 1; end; set decstat1 states[0 ’lide[0]’ 0]; if (’decstat1’>2) set decstat1 1; end; set decstat2 states[0 ’lide[1]’ 0]; if (’decstat2’>2) set decstat2 1; end; set isblack1 ’decstat1’-’ancstate’; set isblack2 ’decstat2’-’ancstate’; if(’charstanc’<’charstdec’) set wgain ’wgain’+1; if (’isblack1’==1) set bgain ’bgain’+1; end; else if(’charstanc’>’charstdec’) set wloss ’wloss’+1; if (’isblack1’==1) set bloss ’bloss’+1; end; end; end; set charstdec states[#1 ’lide[1]’ 0]; if (’charstdec’ > 3) set charstdec ’charstanc’; else if (’charstdec’ > 2) set charstdec ’root’; end; end; if(’charstanc’<’charstdec’) set wgain ’wgain’+1;

103 if (’isblack2’==1) set bgain ’bgain’+1; end; else if(’charstanc’>’charstdec’) set wloss ’wloss’+1; if (’isblack2’==1) set bloss ’bloss’+1; end; end; end; stop; set btot 0; set temp ’btot’; if (states[#1 (ntax+1) 0]==2) if ((’bgain’-’bloss’)>=0) loop (’bgain’-’bloss’) ’wgain’ loop 0 (#2-(’bgain’-’bloss’)) set btot ’btot’+’B[(ntax+1) #3 #2 ’wloss’ ’wgain’]’; stop; stop; else loop (’bloss’-’bgain’) ’wloss’ loop 0 (#2-(’bloss’-’bgain’)) set btot ’btot’+’B[(ntax+1) #2 #3 ’wloss’ ’wgain’]’; stop; stop; end; else if ((’bgain’-’bloss’)>=0) loop (’bgain’-’bloss’) ’wgain’ loop 0 (#2-(’bgain’-’bloss’)) set btot ’btot’+’B[(ntax+1) #2 #3 ’wgain’ ’wloss’]’; stop; stop; else loop (’bloss’-’bgain’) ’wloss’ loop 0 (#2-(’bloss’-’bgain’)) set btot ’btot’+’B[(ntax+1) #3 #2 ’wgain’ ’wloss’]’; stop; stop; end;

104 end; set temp ’btot’; if (states[#1 (ntax+1) 0]==2) set cct ’btot’/’W[(ntax+1) ’wgain’ ’wloss’ 1]’; quote #1 1 ’wgain’ ’wloss’ ’bgain’ ’bloss’\ ’W[(ntax+1) ’wgain’ ’wloss’ 1]’ ’btot’ ’cct’; else set cct ’btot’/’W[(ntax+1) ’wgain’ ’wloss’ 0]’; quote #1 0 ’wgain’ ’wloss’ ’bgain’ ’bloss’\ ’W[(ntax+1) ’wgain’ ’wloss’ 0]’ ’btot’ ’cct’; end; stop;

105 BIBLIOGRAPHY

[1] R. Phillips and S. R. Quake, “The biological frontier of physics,” Physics Today 59 (2006) 38–43.

[2] M. W. Deem, “Mathematical adventures in biology,” Physics Today 60 (2007) 42–47.

[3] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, July, 1999.

[4] L. Hartwell, L. Hood, M. L. Goldberg, A. Reynolds, L. M. Silver, and R. C. Veres, Genetics: From genes to genomes. McGraw-Hill, 2006.

[5] C.-I. Branden and J. Tooze, Introduction to Protein Structure. Garland Publishing, 2 ed., 1999.

[6] S. Jones and J. Thornton, “Principles of protein-protein interactions,” PNAS 93 (1996), no. 1, 13–20.

[7] M. A. Nowak, Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press, 2006.

[8] J. N. Hirschhorn and M. J. Daly, “Genome-wide association studies for common diseases and complex traits,” 6 (February, 2005) 95–108.

[9] S. Rose and R. Mileusnic, The Chemistry of Life. Penguin, UK, 4 ed., 1999.

[10] D. L. Hartl and E. W. Jones, Genetics: Analysis of genes and genomes. Jones and Bartlett Publishers, 6 ed., 2004.

[11] H. E. Walter, Genetics: An Introduction to the Study of Heredity. The Macmillan Co., 1913.

[12] O. Winge, “Wilhelm Johannsen: The creator of the terms gene, genotype, phenotype and pure line,” J Hered 49 (1958), no. 2, 83–88.

106 [13] E. Schr¨odinger, What Is Life? The Physical Aspect of the Living Cell. . Cambridge University Press, Cambridge, 1944.

[14] N. Symonds, “What is life?: Schr¨odinger’s influence on biology,” The Quarterly Review of Biology 61 (1986), no. 2, 221–226.

[15] F. Crick, What Mad Pursuit: A Personal View of Scientific Discovery. Basic Books, New York, 1988.

[16] T. A. Kunkel, “DNA Replication Fidelity,” J. Biol. Chem. 279 (2004), no. 17, 16895–16898.

[17] F. Crick, “Central dogma of molecular biology.,” Nature 227 (Aug, 1970) 561–563.

[18] A. Wilkie, “The molecular basis of genetic ,” J Med Genet 31 (1994), no. 2, 89–98.

[19] P. D. Keightley, “A Metabolic Basis for Dominance and Recessivity,” Genetics 143 (1996), no. 2, 621–625.

[20] S. W. Omholt, E. Plahte, L. Oyehaug, and K. Xiang, “Gene Regulatory Networks Generating the Phenomena of Additivity, Dominance and Epistasis,” Genetics 155 (2000), no. 2, 969–980.

[21] A. Ancel, J. Armand, and H. Girard, “Optimum incubation conditions of the domestic guinea fowl egg.,” Br Poult Sci 35 (May, 1994) 227–240.

[22] W. Bateson, Mendel’s Principles of Heredity, a Defense. Cambridge University Press, London, 1 ed., 1902.

[23] R. S. Cowan, “Francis galton’s contribution to genetics,” Journal of the 5 (Sept., 1972) 389–412.

[24] W. Johannsen, Elemente der exakten Erblichkeitslehre. Gustav Fischer, 1909.

[25] H. Nilsson-Ehle, Kreuzungsuntersuchungen an hafer und weizen. Lund, 1909.

[26] R. A. Fisher, “The correlation between relatives on the supposition of mendelian inheritance,” Philosophical Transactions of the Royal Society of Edinburgh 52 (1918) 399–433.

[27] H. Muller, “Artificial transmutation of the gene,” Science 66 (July, 1927) 84–87.

[28] C. Auerbach and J. Robson, “The production of mutations by chemical substances,” Proceedings of the Royal Society of Edinburgh B 62 (1947) 279.

107 [29] M. M. Metzstein, G. M. Stanfield, and H. R. Horvitz, “Genetics of in C. elegans: past, present and future.,” Trends in genetics 14 (1998), no. 10, 410–416.

[30] C. Nusslein-Volhard and E. Wieschaus, “Mutations affecting segment number and polarity in Drosophila,” Nature 287 (1980) 795–801.

[31] N. Kresge, R. D. Simoni, and R. L. Hill, “The Development of Site-directed Mutagenesis by Michael Smith,” J. Biol. Chem. 281 (2006), no. 39, e31–.

[32] D. Botstein and D. Shortle, “Strategies and applications of in vitro mutagenesis,” Science 229 (1985), no. 4719, 1193–1201.

[33] A. J. F. Griffiths, J. H. Miller, D. T.Suzuki, R. C. Lewontin, and W. M. Gelbart, An introduction to . W. H. Freeman and Company, 1996.

[34] J. Ott, Analysis of Human . Johns Hopkins University Press, 1999.

[35] P. Turnpenny and S. Ellard, Emery’s Elements of . Elsevier, 12 ed., 2004.

[36] K. Sax, “The association of size differences with -coat pattern and pigmentation in Phaseolus vulgaris,” Genetics 8 (1923), no. 6, 552–560.

[37] B. H. Liu, Statistical : linkage, mapping, and QTL analysis. CRC Press, Boca Raton, 1998.

[38] E. S. Lander and D. Botstein, “Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps,” Genetics 121 (1989), no. 1, 185–199.

[39] M. Soller, T. Brody, and A. Genizi, “On the power of experimental designs for the detection of linkage between marker loci and quantitative loci in crosses between inbred lines,” TAG Theoretical and Applied Genetics 47 (Jan., 1976) 35–39.

[40] J. Flint, W. Valdar, S. Shifman, and R. Mott, “Strategies for mapping and quantitative trait genes in rodents.,” Nature reviews. Genetics 6 (2005), no. 4, 271–286.

[41] J. Stanton, “Galton, pearson, and the peas: A brief history of linear regression for statistics instructors,” Journal of Statistics Education 9 (2001), no. 3,.

108 [42] N. Hizawa, Y. Maeda, S. Konno, Y. Fukui, D. Takahashi, and M. Nishimura, “Genetic polymorphisms at fcer1b and pai-1 and asthma susceptibility,” Clinical & Experimental Allergy 36 (2006), no. 7, 872–876.

[43] F. B. Smith, J. M. Connor, A. J. Lee, A. Cooke, G. D. O. Lowe, A. Rumley, and F. G. Fowkes, “Relationship of the platelet glycoprotein pla and fibrinogen t/g+1689 polymorphisms with peripheral arterial disease and ischaemic heart disease,” Thrombosis Research 112 (2003), no. 4, 209–216.

[44] M. Murata, Y. Matsubara, K. Kawano, T. Zama, N. Aoki, H. Yoshino, G. Watanabe, K. Ishikawa, and Y. Ikeda, “Coronary artery disease and polymorphisms in a receptor mediating shear stress-dependent platelet activation,” Circulation 96 (1997), no. 10, 3281–3286.

[45] H. T. Lynch and T. Hirayama, Genetic Epidemiology of . CRC Press, 1989.

[46] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, “GenBank,” Nucl. Acids Res. 35 (2007), no. 1, 21–25.

[47] L. Peltonen and V. A. McKusick, “GENOMICS AND MEDICINE: Dissecting Human Disease in the Postgenomic Era,” Science 291 (2001), no. 5507, 1224–1229.

[48] A. M. Glazier, J. H. Nadeau, and T. J. Aitman, “Finding genes that underlie complex traits,” Science 298 (2002), no. 5602, 2345–2349.

[49] N. J. Risch, “Searching for genetic determinants in the new millennium,” Nature 405 (June, 2000) 847–856.

[50] A. Grupe, S. Germer, J. Usuka, D. Aud, J. K. Belknap, R. F. Klein, M. K. Ahluwalia, R. Higuchi, and G. Peltz, “In Silico Mapping of Complex Disease-Related Traits in Mice,” Science 292 (2001), no. 5523, 1915–1918.

[51] M. T. Pletcher, P. McClurg, S. Batalov, A. I. Su, S. W. Barnes, E. Lagler, R. Korstanje, X. Wang, D. Nusskern, M. A. Bogue, R. J. Mural, B. Paigen, and T. Wiltshire, “Use of a dense single nucleotide polymorphism map for in silico mapping in the mouse,” PLoS Biol. 2 (2004), no. 12, 2159–2169.

[52] X. Wang and B. Paigen, “Quantitative trait loci and candidate genes regulating HDL cholesterol: A murine chromosome map,” Arteriosclerosis, Thrombosis, and Vascular Biology 22 (2002), no. 9, 1390–1401.

[53] T. Wiltshire, M. T. Pletcher, S. Batalov, S. W. Barnes, L. M. Tarantino, M. P. Cooke, H. Wu, K. Smylie, A. Santrosyan, N. G. Copeland, N. A.

109 Jenkins, F. Kalush, R. J. Mural, R. J. Glynne, S. A. Kay, M. D. Adams, and C. F. Fletcher, “Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse,” PNAS 100 (2003), no. 6, 3380–3385.

[54] R. Wu and M. Lin, “Functional mapping [mdash] how to map and study the genetic architecture of dynamic complex traits,” Nature Reviews Genetics 7 (Mar., 2006) 229–237.

[55] J. Felsenstein, “Phylogenies and the comparative method,” American Naturalist 125 (1985), no. 1, 1–15.

[56] M. Ridley, The explanation of organic diversity. The comparative method and of mating. Oxford University Press, 1983.

[57] K. A. Frazer, C. M. Wade, D. A. Hinds, N. Patil, D. R. Cox, and M. J. Daly, “Segmental Phylogenetic Relationships of Inbred Mouse Strains Revealed by Fine-Scale Analysis of Sequence Variation Across 4.6 Mb of Mouse Genome,” Genome Res. 14 (2004), no. 8, 1493–1500.

[58] D. L. Swofford, PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts, 1989-2002.

[59] P. Goloboff, J. Farris, and K. Nixon, “Tree analysis using new technology. http:// www.zmuc.dk/public/phylogeny/tnt,” 2003.

[60] W. C. Wheeler, D. S. Gladstein, and J. de Laet, “POY version 3.0, Documentation by Daniel Janies and Ward Wheeler. Commandline documentation by J. De Laet and W. C. Wheeler,” tech. rep., 2005.

[61] W. F. Dietrich, J. Roberts, J. Watters, J. Ballard, K. Dewar, J. Lehoczky, and V. Boyartchuk, Mouse Phenome Database Web Site. The Jackson Laboratory, Bar Harbor, Maine USA, 1998.

[62] S. C. Grubb, G. A. Churchill, and M. A. Bogue, “A collaborative database of inbred mouse strain characteristics,” Bioinformatics 20 (2004), no. 16, 2857–2859.

[63] W. P. Maddison, “A method for testing the correlated evolution of two binary characters: Are gains or losses concentrated on certain branches of a phylogenetic tree?,” Evolution 44 (1990), no. 3, 539–557.

[64] J. E. Roberts, J. W. Watters, J. D. Ballard, and W. F. Dietrich, “Ltx1, a mouse locus that influences the susceptibility of macrophages to cytolysis caused by intoxication with bacillus anthracis lethal factor, maps to chromosome 11.,” Mol Microbiol 29 (Jul, 1998) 581–591.

110 [65] J. Felsenstein, “Parsimony in : Biological and statistical issues,” Annual Review of and Systematics 14 (1983) 313–333.

[66] A. Kluge and J. Farris, “Quantitative phyletics and the evolution of anurans,” Systematic 18 (1969) 1–32.

[67] J. Farris, “Methods for computing Wagner trees,” Systematic Zoology 19 (1970) 83–92.

[68] D. Swofford and W. Maddison, “Reconstructing ancestral character states under wagner parsimony,” Mathematical Biosciences 87 (1987) 199–229.

[69] C. M. Wade, E. J. Kulbokas, A. W. Kirby, M. C. Zody, J. C. Mullikin, E. S. Lander, K. Lindblad-Toh, and M. J. Daly, “The structure of variation in the genome,” Nature 420 (December, 2002) 574–578.

[70] A. H. Cheetham and J. E. Hazel, “Binary (presence-absence) similarity coefficients,” Journal of 43 (1969), no. 5, 1130–1136.

[71] S. L. Welkos, T. J. Keener, and P. H. Gibbs, “Differences in susceptibility of inbred mice to bacillus anthracis,” Infect. Immun. 51 (March, 1986) 795–800.

[72] J. W. Watters, K. Dewar, J. Lehoczky, V. Boyartchuk, and W. F. Dietrich, “Kif1C and a kinesin-like motor protein and mediates mouse macrophage resistance to anthrax lethal factor,” Curr. Biol. 11 (October, 2001) 1503–1511.

[73] R. D. McAllister, Y. Singh, W. D. du Bois, M. Potter, T. Boehm, N. D. Meeker, P. D. Fillmore, L. M. Anderson, M. E. Poynter, and C. Teuscher, “Susceptibility to Anthrax Lethal Toxin Is Controlled by Three Linked Quantitative Trait Loci,” Am. J. Pathol. 163 (2003), no. 5, 1735–1741.

[74] S. G. Popov, R. Villasmil, J. Bernardi, E. Grene, J. Cardwell, T. Popova, A. Wu, D. Alibek, C. Bailey, and K. Alibek, “Effect of bacillus anthracis lethal toxin on human peripheral blood mononuclear cells,” FEBS Letters 527 (Sep, 2002) 211–215.

[75] M. Moayeri, D. Haines, H. A. Young, and S. H. Leppla, “Bacillus anthracis lethal toxin induces TNF-alpha-independent hypoxia-mediated toxicity in mice,” J. Clin. Invest. 112 (2003), no. 5, 670–682.

[76] A. Agrawal, J. Lingappa, S. Leppla, S. Agrawal, A. Jabbar, C. Quinn, and B. Paulendran, “Impairment of dendritic cells and adaptive immunity by anthrax lethal toxin,” Nature 424 (2003) 329–334.

[77] E. D. Boyden and W. F. Dietrich, “Nalp1b controls mouse macrophage susceptibility to anthrax lethal toxin,” Nat. Genet. 38 (2006) 240–244.

111 [78] A. Janardhan, T. Swigut, B. Hill, M. P. Myers, and J. Skowronski, “HIV-1 Nef Binds the DOCK2-ELMO1 Complex to Activate Rac and Inhibit Lymphocyte Chemotaxis,” PLoS Biol. 2 (January, 2004) 65–76.

[79] S. G. Popov, T. G. Popova, E. Grene, F. Klotz, J. Cardwell, C. Bradburne, Y. Jama, M. Maland, J. Wells, A. Nalca, T. Voss, C. Bailey, and K. Alibek, “Systemic cytokine response in murine anthrax,” Cell. Microbiol. 6 (2004), no. 3, 225–233.

[80] N. H. Bergman, K. D. Passalacqua, R. Gaspard, L. M. Shetron-Rama, J. Quackenbush, and P. C. Hanna, “Murine Macrophage Transcriptional Responses to Bacillus anthracis Infection and Intoxication,” Infect. Immun. 73 (2005), no. 2, 1069–1080.

[81] C. K. Cote, N. Van Rooijen, and S. L. Welkos, “Roles of Macrophages and Neutrophils in the Early Host Response to Bacillus anthracis Spores in a Mouse Model of Infection,” Infect. Immun. 74 (2006), no. 1, 469–480.

[82] M. Moayeri, N. W. Martinez, J. Wiggins, H. A. Young, and S. H. Leppla, “Mouse susceptibility to anthrax lethal toxin is influenced by genetic factors in addition to those controlling macrophage sensitivity,” Infect. Immun. 72 (2004), no. 8, 4439–4447.

[83] N. Remus, J. Reichenbach, C. PIcard, C. Rietschel, P. Wood, D. Lammas, D. S. Kumararatne, and J.-L. Casanova, “Impaired Interferon Gamma-Mediated Immunity and Susceptibility to Mycobacterial Infection in Childhood,” Pediatric Research 50 (2001), no. 1, 8–13.

[84] T. Mueller, A. Mas-Marques, C. Sarrazin, M. Wiese, J. Halangk, H. Witt, G. Ahlenstiel, U. Spengler, U. Goebel, and B. Wiedenmann, “Influence of interleukin 12B (IL12B) polymorphisms on spontaneous and treatment-induced recovery from hepatitis C virus infection,” Journal of Hepatology 41 (2004), no. 4, 652–658.

[85] S. I. Ymer, D. Huang, G. Penna, S. Gregori, K. Branson, L. Adorini, and G. Morahan, “Polymorphisms in the Il12b gene affect structure and expression of IL-12 in NOD and other autoimmune-prone mouse strains,” Genes and Immunity 3 (May, 2002) 151–157.

[86] J. D. Storey and R. Tibshirani, “Statistical significance for genomewide studies.,” Proc Natl Acad Sci U S A 100 (Aug, 2003) 9440–9445.

[87] Yekutieli, “The control of the false discovery rate in multiple testing under dependency,” The Annals of Statistics 29 (2001), no. 4, 1165–1188.

112 [88] E. Lander and L. Kruglyak, “Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.,” Nat Genet 11 (Nov, 1995) 241–247.

[89] T. V. Perneger, “What’s wrong with bonferroni adjustments.,” BMJ 316 (Apr, 1998) 1236–1238.

[90] C. Bonferroni, “Teoria statistica delle classi e calcolo delle probabilit,” Pubblicazioni del Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8 (1936) 3–62.

[91] S. Dudoit, M. J. van der Laan, and K. S. Pollard, “Multiple testing. part i. single-step procedures for control of general type i error rates.,” Statistical Applications in Genetics and Molecular Biology 3 (2004) Article13.

[92] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” J. Roy. Statist. Soc. Ser. B 57 (1995), no. 1, 289–300.

[93] Y. Benjamini and D. Yekutieli, “The control of the false discovery rate in multiple testing under dependency,” Ann. Statist. 29 (2001), no. 4, 1165–1188.

[94] J. D. Storey, “The positive false discovery rate: A bayesian interpretation and the q-value,” The Annals of Statistics 31 (2003), no. 6, 2013–2035.

[95] S. Langella, S. Hastings, S. Oster, T. Kurc, U. Catalyurek, and J. Saltz, “A distributed data management middleware for data-driven application systems,” in CLUSTER ’04: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pp. 267–276. IEEE Computer Society, Washington, DC, USA, 2004.

[96] S. L. Hastings, S. Langella, S. Oster, and J. H. Saltz, “Distributed data management and integration framework: The mobius project,” Proceedings of the Global Grid Forum 11 (GGF11) Semantic Grid Applications Workshop (Dec, 2004) 20–38.

[97] “The mouse phenome databases.” http://www.jax.org/phenome/.

[98] W. Osler, Lectures on Angina Pectoris and Allied States. Appleton-Century-Crofts, 1897.

[99] T. Scheller, H. Orgacka, C. Szumlanski, and W. R.M., “Mouse liver nicotinamide N-methyltransferase pharmacogenetics: biochemical properties and variation in activity among inbred strains,” Pharmacogenetics 6 (Feb, 1996) 43–53.

113 [100] J. Rini, C. Szumlanski, R. Guerciolini, and R. Weinshilboum, “Human liver nicotinamide N-methyltransferase: ion-pairing radiochemical , biochemical properties and individual variation.,” Clinica Chimica Acta 186 (1990), no. 3, 359–374.

[101] J. C. Souto, F. Blanco−Vaca, J. M. Soria, A. Buil, L. Almasy, J. Ordo˜nez−Llanos, J. MN/A Mart´ın−Campos, M. Lathrop, W. Stone, J. Blangero, and J. Fontcuberta, “A genomewide exploration suggests a new candidate gene at chromosome 11q23 as the major determinant of plasma homocysteine levels: Results from the gait project,” Am J Hum Genet 76 (2005), no. 6, 925–933.

[102] A. A. Noga, Y. Zhao, and D. E. Vance, “An unexpected requirement for phosphatidylethanolamine N-methyltransferase in the secretion of very low density lipoproteins,” Journal of Biological Chemistry 277 (2002), no. 44, 42358–42365.

[103] R. C. Edgar, “Muscle: multiple sequence alignment with high accuracy and high throughput.,” Nucleic Acids Res 32 (2004), no. 5, 1792–1797.

[104] E. Subbarao, “A single amino acid in the PB2 gene of influenza A virus is a determinant of host range.,” Journal of 67 (1993), no. 4, 1761–.

[105] K. Shinya, S. Hamm, M. Hatta, H. Ito, T. Ito, and Y. Kawaoka, “Pb2 amino acid at position 627 affects replicative efficiency, but not cell tropism, of hong kong h5n1 influenza a viruses in mice,” Virology 320 (Mar., 2004) 258–266.

[106] J. Stevens, O. Blixt, T. M. Tumpey, J. K. Taubenberger, J. C. Paulson, and I. A. Wilson, “Structure and Receptor Specificity of the Hemagglutinin from an H5N1 Influenza Virus,” Science (2006) 1124513.

[107] H. Chen, G. J. D. Smith, K. S. Li, J. Wang, X. H. Fan, J. M. Rayner, D. Vijaykrishna, J. X. Zhang, L. J. Zhang, C. T. Guo, C. L. Cheung, K. M. Xu, L. Duan, K. Huang, K. Qin, Y. H. C. Leung, W. L. Wu, H. R. Lu, Y. Chen, N. S. Xia, T. S. P. Naipospos, K. Y. Yuen, S. S. Hassan, S. Bahri, T. D. Nguyen, R. G. Webster, J. S. M. Peiris, and Y. Guan, “Establishment of multiple sublineages of H5N1 influenza virus in Asia: Implications for pandemic control,” PNAS 103 (2006), no. 8, 2845–2850.

[108] S. Van Borm, I. Thomas, G. Hanquet, B. Lambrecht, M. Boschmans, G. Dupont, M. Decaestecker, R. Snacken, and T. van den Berg, “Highly pathogenic h5n1 influenza virus in smuggled thai eagles, belgium.,” Emerging Infectious Diseases 11 (2005), no. 5, 702–705.

114 [109] Felsenstein, “Phylogenies and quantitative characters,” Annual Review of Ecology and Systematics 19 (1988), no. 1, 445–471.

[110] A. Grafen, “The phylogenetic regression,” Philosophical Transactions of the Royal Society of London 326 (1989), no. 1233, 119–157.

[111] E. Martins and J. Theodore Garland, “Phylogenetic analyses of the correlated evolution of continuous characters: A simulation study,” Evolution 45 (1991), no. 3, 534–557.

[112] T. Garland Jr, P. Harvey, and A. R. Ives, “Procedures for the analysis of comparative data using phylogenetically independent contrasts,” Systematic Biology 41 (1992), no. 1, 18–32.

[113] T. H. Oakley and C. W. Cunningham, “Independent contrasts succeed where ancestor reconstruction fails in a known bacteriophage phylogeny,” Evolution 54 (Apr., 2000) 397–405.

[114] A. Purvis and A. Rambaut, “Comparative analysis by independent contrasts (CAIC): an Apple Macintosh application for analysing comparative data,” Comput. Appl. Biosci. 11 (1995), no. 3, 247–251.

[115] W. J. Wagner, Recent Advances in , vol. 1. University of Toronto Press, 1961.

[116] J. Felsenstein, Inferring Phylogenies. Sinauer Associates, September, 2003.

[117] B. Efron, “Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods,” Biometrika 68 (1981), no. 3, 589–599.

[118] E. S. Edgington, Randomization Tests. Marcel Dekker, New York, 3 ed., 1995.

[119] F. Pesarin, Multivariate Permutation Tests: With Applications in . Wiley, 2001.

[120] C. Lunneborg, Data Analysis by Resampling. Duxbury Press, 1999.

[121] Y. Cho, M. Ritchie, J. Moore, J. Park, K.-U. Lee, H. Shin, H. Lee, and K. Park, “Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus,” Diabetologia 47 (Mar., 2004) 549–554.

[122] L. Bastone, M. Reilly, D. Rader, and A. Foulkes, “MDR and PRP: A Comparison of Methods for High-Order Genotype-Phenotype Associations,” Human Heredity 58 (2004) 82–92.

115 [123] M. D. Ritchie, L. W. Hahn, and J. H. Moore, “Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity,” Genetic Epidemiology 24 (2003) 150–157.

[124] D. E. Knuth, The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, 1981.

[125] P. McClurg, M. Pletcher, T. Wiltshire, and A. Su, “Comparative analysis of haplotype association mapping algorithms,” BMC Bioinformatics 7 (2006), no. 1, 61.

[126] S. Fields and R. Sternglanz, “The two-hybrid system: an assay for protein-protein interaction.,” Trends in Genetics 10 (1994) 286–292.

[127] J. H. Nadeau, “Modifier genes in mice and humans,” Nature Reviews Genetics 2 (2001), no. 3, 165–174.

[128] W. Li and J. Reich, “A complete enumeration and classification of two-locus disease models,” Human Heredity 50 (2000), no. 6, 334–349.

116