<<

The Pennsylvania State University

The Graduate School

Department or Biology

MOLECULAR EVOLUTIONARY ANALYSIS OF IN NONPHOTOSYNTHETIC ANGIOSPERMS AND LINES

A Dissertation in

Biology

by

Yan Zhang

 2012 Yan Zhang

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Dec 2012

The Dissertation of Yan Zhang was reviewed and approved* by the following:

Schaeffer, Stephen W. Professor of Biology Chair of Committee

Ma, Hong Professor of Biology

Altman, Naomi Professor of Statistics

dePamphilis, Claude W Professor of Biology Dissertation Adviser

Douglas Cavener Professor of Biology Head of Department of Biology

*Signatures are on file in the Graduate School

iii ABSTRACT

This thesis explores the application of evolutionary theory and methods in understanding the plastid of nonphotosynthetic parasitic and role of in tumor proliferations. We explore plastid genome evolution in parasitic angiosperms lineages that have given up the primary function of plastid genome – photosynthesis. Genome structure, contents, and evolutionary dynamics were analyzed and compared in both independent and related parasitic lineages. Our studies revealed striking similarities in changes of gene content and evolutionary dynamics with the loss of photosynthetic ability in independent nonphotosynthetic plant lineages. Evolutionary analysis suggests accelerated evolution in the plastid genome of the nonphotosynthetic plants.

This thesis also explores the application of phylogenetic and evolutionary analysis in cancer biology. Although cancer has often been likened to Darwinian process, very little application of molecular evolutionary analysis has been seen in cancer biology research. In our study, phylogenetic approaches were used to explore the relationship of several hundred established lines based on multiple sequence alignments constructed with variant codons and residues across 494 and 523 . Phylogenetic analysis revealed that phylogenetic clustering of cancer cell lines might reflect the functional similarities in cellular pathways.

Molecular evolutionary analysis was applied to identify potential driver mutations that are causally implicated in oncogenesis. Our study indicates phylogenetic and molecular evolutionary analysis can provide important insights tumor classification and the development of novel anticancer strategies, and suggests the potential role of evolutionary history and methods in cancer diagnostics and therapeutics.

iv TABLE OF CONTENTS

LIST OF FIGURES ...... vi

LIST OF TABLES ...... viii

ACKNOWLEDGEMENTS ...... ix

Chapter 1 Introduction ...... 1

References ...... 9

Chapter 2 Striking parallelism of plastid genome structure and dynamics in independent nonphotosynthetic lineages (this is a Heading 1 style) ...... 11

Abstract ...... 12 Introduction ...... 14 Materials and Methods ...... 19 Plastome sequencing ...... 19 Indel analysis in inverted repeats and pseudogenes ...... 21 Phylogenetic anlaysis ...... 22 Molecular evolutionary analysis ...... 23 Results ...... 25 Plastid genome size and structure ...... 25 Gene content ...... 32 Analysis of indels ...... 38 Genes for gene expression ...... 42 and Epifagus are independently derived ...... 42 Molecular evolutionary analyses ...... 43 Relaxed clock analysis ...... 50 Discussion ...... 53 Plastome phylogeny ...... 53 Plastome structure and evolutionary dynamics ...... 54 Holoparasitism occurred more recently in Pholisma than in Epifagus ...... 61 Parallel polymorphoism of rbcL in independent holoparasitic lineages ...... 62 Deterministic and stochastic evolution model of plastomes in parasitic plants ..... 64 Is this a convergence or parallelism? ...... 65 References ...... 67

Chapter 3 Plastid genome sequence and analysis of Conopholis americana, a holoparasitic relative of Epifagus virginiana ...... 72

Abstract ...... 72 Introduction ...... 73 Materials and Methods ...... 76 DNA extraction and library ...... 76 Procedure for fosmid shotgun sequencing and sequence assembly ...... 77

v Genome annotation ...... 78 Molecular evolutionary anlaysis ...... 78 Results ...... 78 Discussion ...... 90 References ...... 95

Chapter 4 Molecular Evolutionary Analysis of Cancer Cell Lines ...... 101

Abstract ...... 102 Introduction ...... 103 Materials and Methods ...... 106 DNA sequencing ...... 106 Construction of multiple sequence alignments ...... 107 Phylogenetic analysis ...... 109 mRNA expression and pathway analysis ...... 110 Selection pressure analysis ...... 111 Results and Discussion ...... 111 References ...... 128

Chapter 5 Future Work ...... 132

References ...... 135

Appendix A Flowchart of molecular evolutionary analysis in chapter 2 ...... 136

Appendix B Gene summary by function category in Pholisma, Epifagus, and

Mimulus ...... 137

Appendix C Comparative plastome summary of Pholisma, Epifagus, Ehretia and

Mimulus ...... 141

Appendix D Indel position, length and analysis ...... 142

Appendix E Multiple sequence alignment for psedogenes in Pholisma, Epifagus, Ehretia, and Mimulus ...... 153

Appendix F Cancer cell lines nucleotide phylogenetic tree shown in Fig. 4-2 in phylogram format ...... 176

Appendix G Cancer cell line tree...... 177

Appendix H Cell lines with identical nucleotide sequences ...... 178

Appendix I 398 probeset (PLS-DA gene set) with VIP values > 2.5 from PLS-DA

SIMCA ...... 180

vi

LIST OF FIGURES

Figure 2-1: Plastid genome of ...... 28

Figure 2-2: Plastid genome of Ehretia acuminata ...... 29

Figure 2-3: Plastid genome of Mimulus guttatus...... 30

Figure 2-4: Plastid genome structure and content comparison between Pholisma and Ehretia ...... 31

Figure 2-5: Dot plot of the complete plastid genomes of Pholisma arenarium and Ehretia acuminata from mulan analysis...... 37

Figure 2-6: Multipipmaker analyses of 6 sequenced angiosperms using Panax ginseng as the reference genome, illustrating parallel losses in holoparasites Phoisma and Epifagus ...... 41

Figure 2-7: Histogram of Indels in Pholisma, Epifagus, Ehretia and Mimulus...... 42

Figure 2-8: The total substitution rates of the holoparasite and photosynthetic relative...... 47

Figure 2-9: The synonymous rates of the holoparasite and photosynthetic relative...... 48

Figure 2-10: The nonsynonymous rates of the holoparasite and photosynthetic relative...... 49

Figure 2-11: The Omega of the holoparasite and background Omega...... 50

Figure 2-12: Evolutionary analysis and relaxed clock analysis...... 54

Figure 3-1: Plastid genome of Conopholis americana ...... 81

Figure 3-2: Dot plots of the complete plastid genomes of Conopholis americana and Epifagus and their close green relative Mimulus guttatus from Mulan analysis ...... 85

Figure 3-3: Multipipmaker genome coverage map including Conopholis, Epifagus and

Mimulus ...... 86

Figure 3-4: Multipipmaker plot details ...... 87

Figure 4-1: Flowchart of cancer cell line gene variant analysis...... 107

vii Figure 4-2: Phylogenetic tree of tumor cell lines based on DNA sequences ...... 118

Figure 4-3: Best maximum likelihood phylogenetic tree of Clade A cell lines, based on nucleotide sequences ...... 121

viii

LIST OF TABLES

Table 2-1: Genome Content of Pholisma and Epifagus compared with their green relatives Ehretia and Mimulus...... 32

Table 2-2: Relative synonymous codon usage (RSCU) of the coding sequences across the plastid genomes...... 38

Table 2-3: One-rate, two-rate or multiple-rate model...... 51

Table 3-1 Genome Content of Conopholis and Epifagus compared with their green relatives Mimulus guttatus...... 83

Table 3-2: dN/dS ratio () in Epifagus and Conopholis ...... 91

Table 4-1: Variant codons occurring in ≥ 10 Clade A cell lines at frequencies ≥ 0.5...... 123

Table 4-2: Pathways enriched in Metacore GeneGo for both Clade A genes (Variants) and PLS-DA gene set (Expression) based on Hypergeometric P-Value < 0.05 ...... 125

Table 4-3: Genes with amino acids under positive selection (PS) in Clade A and all tumor cell lines. Based on PAML likelihood scores ...... 127

ix ACKNOWLEDGEMENTS

I would like to take this opportunity to express my sincere appreciation to all the people who have supported and inspired me during my entire doctoral study.

I owe my deepest gratitude to my dissertation advisor, Dr. Claude dePamphilis, for his continuous guidance, support and encouragement, without which the eventual completion of my

Ph.D study would have been impossible. I would also like to thank my thesis committee members, Dr. Schaffer, Dr. Hong, Dr. Altman for their thorough and thoughtful comments and revisions that will undoubtedly aid in finalizing the chapters for journal publication.

I also would like to thank all my coauthors, Kai Muller, James Brown, Joanna Holbrook,

Susann Wicke, Michael Moore, John Willis, Jennifer Kuehl, Alan Smith, Doug Soltis, Jeff Boore,

Michale Italia, Michal Magid-Slav, Wendy Halsey, Stephanie Van Horn and their lab members and colleagues. I also appreciate all my colleagues in the lab, especially Joel McNeal, Kerr Wall,

Lena Landherr, Paula Ralph, Liying Cui, Jim Leebens-Mack, Yuannian Jiao, Jill Duarte, Norman

Wickett, Yeting Zhang, Yuchen Zhang, Loren Honaas, and Eric Wafula.

I want to thank my husband, Xin Zhou, for his continuous support, consideration and encouragement. Finally, I would like to thank my parents for their boundless love and support all the years. To them, I dedicate this dissertation.

Chapter 1

Introduction

Evolutionary biology is an essential basic science. Evolutionary theory and methods have provided valuable insights in various scientific fields. This thesis applies phylogenetic and molecular evolutionary analysis to study the evolution of plastid genomes in parasites and study the tumor classification and the development of novel anti-cancer strategies.

Parasitic plants

Parasitic plants are plants that obtain some or all of their carbon, nutrients and water from an autotrophic host (Perry and Wolfe 2002). While most plants are autotrophic, producing fixed carbon through photosynthesis, parasitic plants form a connection between their and the host by penetrating the host xylem and/or host phloem. This connection, known as the haustorium

(Kuijt 1969), is a structure found in all parasitic plants, and it is through the haustorium that parasite acquire nutrients from their hosts (Perry and Wolfe 2002).

Parasitic plants are distributed in many natural ecosystems from tropical rain forests to the arctic (Yi, Lee et al.). They represent about 1% of angiosperm and are included in more than 20 angiosperm families (Braukmann and Stefanović 2012, Haberle, Fourcade et al.

2008) that represent at least 11 independent origins of the parasitic lifestyle (Barkman et al.

2007). Parasitic plants can be categorized into facultative parasites and obligate parasites (Palmer,

Osorio et al. 1987). Obligate parasites are parasitic plants that completely rely on their host for completion of their cycle. The parasitic plants Epifagus (beechdrop), Conopholis (squawroot)

Cuscuta (dodder), and Rafflesia are all examples of obligate parasites. Facultative parasites are parasitic plants that do not completely rely on their hosts and can complete their life cycle without

2 a host. Facultative parasitic lineages within include Triphysaria, Agalinis, and

Rhinanthus.

Depending on the location of the connection between hosts and parasites, parasitic plants can be categorized into root parasites and stem parasites (Kuijt 1969). Root parasites form haustoria on the of their host plants, while stem parasites form their haustoria on the host plant stems. Epifagus and Conopholis are both root parasites, feeding on the roots of beech trees and multiple members of Fagaceae, respectively. Both are examples of root parasites, while mistletoe and Cuscuta are examples of stem parasites.

Plastid Genome Evolution in Parasitic Angiosperms

Plastids are essential plant where photosynthesis and other important metabolic processes are performed (Maple and Moller 2007), and can be considered biosynthetic power- houses within plant cells. Besides photosynthesis, the plastid is also involved in many other biosynthetic pathways in plants including starch synthesis, fatty acid synthesis, nitrogen assimilation and amino acid synthesis (Neuhaus and Emes 2000). , as far as is known, are present in all plant and algal cells.

Plastids can be categorized into several different types based on color, structure and development stage, including , , proplastids, etioplastids, and . Proplastids are colorless undifferentiated plastids that develop into other more mature plastid types. The undifferentiated plastids occur in shoots, roots, embryos and endosperms meristematic cells. Chloroplasts are green plastids, and they are usually associated with photosynthesis in . Chromoplasts are usually red, yellow or orange colored plastids because they contain carotenoids, and they are responsible for pigment synthesis and storage.

Chromoplasts are usually found in fruits, flowers, roots and aging leaves. Etioplastids are plastids

3 that developed in the dark. Etioplastids are found in white, light-deprived stem and tissue

(Wise and Hoober 2006). Leucoplasts are colorless plastids, and this group of plastids can be again categorized into , , and based on their distinct function.

Amyloplasts are the leucoplasts responsible for starch synthesis and storage. Elaioplasts are specialized in lipid storage in plants. Proteinoplast are used for protein storage.

Plastids, which contain their own DNA, are derived from one or more endosymbiotic events between oxygenic photosynthetic and eukaryotic cells, which dates back to over a billion years ago (Wu, Wang et al. , Millen, Olmstead et al. 2001). The acquisition of plastids as

“little workers, green slaves” within the cell endowed the cell with compartmentalized bioenergetic functions (Martin and Kowallik 1999).

In most multicellular , plastid DNA is maternally-inherited. Compared with the gigantic nuclear genome, plastid genomes (also known as plastomes, ptDNA, or cpDNA) are circular, double-stranded, tiny “genomes” that are present in high copy number per cell.

Plastomes only retain a small set of core genes from their cyanobacterial ancestor and a majority of the genes has been either lost or transferred to nuclear genome during the endosymbiotic evolution (Martin, Rujan et al. 2002, Wise and Hoober 2006). Due to its small size, conserved structure and gene organization, and ability to be genetically transformed, and important functions, vast efforts have been invested in exploring plastid genomes.

The encoding capacity of contemporary plastid genomes is only about 5-10% of that of free-living cyanobacteria although plastids are derived from ancient cyanobacteria (Timmis JN,

Ayliffe MA et al. 2004). The contemporary plastid genome is reduced to approximately 120-250 genes from the ancestral cyanobacterial genome that has been estimated to contain over 3000 genes (Martin, Rujan et al. 2002, Timmis JN, Ayliffe MA et al. 2004) . Genome reduction is a major step in the process of transition from autonomous free-living cyanobacteria to organelles with the loss of genetic autonomy, resulting massive gene loss of redundant functions, and gene

4 transfer to nucleus (Dyall, Brown et al. 2004). As a result, the encoding capacity of contemporary plastids in plants and algae is greatly reduced compared to cyanobacteria. A typical plastid genome of photosynthetic plants and algae contains about 120 – 160 genes, the majority of which encode photosynthetic related or components of the gene expression apparatus of plastids (Palmer 1985, Raubeson and Jansen 2005). Specifically in photosynthetic green plants, the plastid genomes encode only about 90 proteins. Only about 5% of the proteins involved in the plastids metabolic activity are encoded in the , and most of the proteins present in plastids are encoded by the nuclear genes and then imported into the organelle

(Abdallah, Salamini et al. 2000, Krause 2008).

The high copy number of the plastid in the cell and their relatively small genome size make them much easier to sequence than the nuclear genome. Overall, nearly 300 plastid genomes have been sequenced to date (NCBI 2012). Among them, complete plastid genomes have been reported for more than 250 species in land plants and green algae (NCBI

2012). Other sequenced plastid genomes are from varied , including Rhizaria, red algae and other algal lineages. The first complete plastid genomes, of Nicotiana tobacum and the liverwort Marchantia polymorpha, were reported in 1986 (Ohyama, Fukuzawa et al. 1986,

Shinozaki, Ohme et al. 1986). After that, more plastid genomes have been sequenced and made publicly available mainly through GENBANK. In the last few years, genomic sequencing strategies and bioinformatics tools have gone through significant improvement. As a result, greater than 70% of the complete genomes in GENBANK were submitted within three years of a recent publication (Khan, Khan et al. 2010).

The plastid genome structure and contents are generally much conserved in higher plants.

The circular plastid genome is characterized by a quadripartite structure that is comprised of two identical copies of inverted repeated sequences (IR) that separate large single copy (LSC) and a small single copy (SSC) regions. Gene content and order are generally quite conserved with a few

5 exceptions. The boundaries and size of the IR vary in plastomes, which contributes to the variation in plastome size. In general, the plastomes in higher plants include genes encoding the components for photosynthetic apparatus, gene expression apparatus, and hypothetical open reading frames (ycfs). Photosynthetically related genes encode for subunits of photosystem I and photosystem II, the cytochrome b6f complex, ATP synthase, ruBisCo large subunit, and NADH dehydrogenase. and related genes encode plastid RNA polymerase subunits, transfer , ribosomal RNAs, and ribosomal proteins.. Conserved open reading frames are defined as ycfs. The size of the plastomes ranges from 37 kb in parasitic green alga to the exceptionally large plastome of 521 kb in green alga Floydiella (deKoning and Keeling 2006,

Brouard, Otis et al. 2010). In total, the variation in genome size is caused by the expansion and contraction of inverted and other repeats, increases or decreases in the number of repeated sequences, and loss or deletion of genes.

Plastid genes have been widely used for phylogenetic studies due to the plastome’s small, relatively constant size and conservative evolution (Palmer 1985). With the recent availability of a large number of complete plastid genomes, genome scale phylogenetic studies have resolved several enigmatic angiosperm relationship questions (Leebens-Mack, Raubeson et al. 2005,

Jansen, Cai et al. 2007, The Angiosperm Phylogeny 2009, Moore, Soltis et al. 2010).

Inverted Repeats:

An outstanding feature of plastomes is a pair of identical inverted repeats (IR). The inverted repeat regions of the plastome are generally more conserved in nucleotide substitution rate, gene content, and gene orders compared with single copy regions (Palmer 1991, Perry and

Wolfe 2002, Raubeson and Jansen 2005). The average size of inverted repeats is about 20 – 30kb, but may be much smaller or larger than this typical value (Park, Manen et al. 2008). Variations in the boundaries of IR regions with large single copy and small single copy regions are also an

6 important factor that caused the expansion or contraction in plastid genomes. IR regions extend to neighboring single copy regions, which consequently impact the size of the IRs. An exceptional example is gymnosperms Pinus thunbergii. Its inverted repeats are contracted to only

496bp, and consequently the length of the plastome is reduced to only 119kb. On the other hand, the inverted repeats in Pelargonium hortorum are expanded to about 76kb and correspondingly the size of the plastome is 217kb.

It is hypothesized that inverted repeats provide an insulation mechanism that stabilizes the genome structure or the retention of two identical inverted repeats provides corrective properties to confer stability. The later hypothesis is supported by much slower substitution rates in inverted repeat regions compared with single copy regions (Wolfe, Li et al. 1987, Birky 1989,

Gaut 1998, Cui, Yue et al. 2006). Generally, inverted repeats evolve two or three times slower compared to single copy regions. The synonymous rates in IRs are 2.3 fold lower than in single copy regions in legume plastomes, but the substitution rates were increased in plastomes lacking

IRs. In addition, it is observed that gene rearrangements are more frequent in plastomes without

IRs (Palmer and Thompson 1982).

Genes contained in inverted repeat regions usually include ribosomal RNA genes, ndhB, ycf1, transfer RNA genes, rps15 (Palmer 1985). The IR/SC border often involves different genes for and monocots (Palmer 1985, Ravi, Khurana et al. 2008). In eudicots, the IR/SC border is commonly in tRNA-His, rps19, ycf1 and ndhF (Ravi, Khurana et al. 2008).

Correspondingly in monocots, the border usually lies in psbA, rpl22, ndhH and ndhF (Ravi,

Khurana et al. 2008).

Single copy regions

The large single copy is usually the largest component of the quadripartite plastome structure and it is also the least conserved region when compared to IRs and the small single copy

7 region. The length of the LSC also varies depending on the extent to which the IR regions expand into the SC region, but the length is generally around 80kb, which contains the majority of the plastid-encoded protein genes. In contrast, the small single copy region is generally less than

20kb and it encodes a few ndh, tRNA, and ycf genes (Palmer 1985, Wicke, Schneiweiss, et al.

2011)

Cancer Genomics

Current advances in sequencing technology have provided an excellent opportunity for genomic scale studies of cancer cells and tissues. Explosive genomic information has been available for cancer bioinformatics research to explore the genomic mechanisms of cancer progression.

Cancer progressions are clonal proliferations that arise following mutations that confer selective advantage to cells (Greenman, Stephens et al. 2007). Cancer genomes carry two types of mutations: driver mutations and passenger mutations. Driver mutations are generally mutations that are causally implicated in cancer progression; in other words, they confer a growth advantage to the cells and contribute to tumorigenesis so that they are positively selected (Pleasance,

Cheetham et al. 2010). On the other hand, passenger mutations don’t confer any selective advantage to the cells, and thus are not subject to selection (Pleasance, Cheetham et al. 2010).

The majority of the somatic mutations are passenger mutations. For example, studies have shown that as few as three mutations may be sufficient for developing colorectal cancer, while colorectal carry about 100 nonsynonymous mutations (Beerenwinkel et al. 2007). One central task is to discriminate between driver and passenger mutations.

8 Although cancer progression is often likened to Darwinian process, very few studies have applied evolutionary approaches to understand cancers. This thesis also explores the potential role of molecular evolutionary analyses in tumor classification and the development of novel anticancer strategies.

9 References

Abdallah F, Salamini F, Leister D (2000). A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis.Trends Plant Sci 5: 141–142 Birky, C. W. (1989). Organelle evolution. Genome 31: 1095-1097. Blankenship, R. E. (2010). Early evolution of photosynthesis. Plant Physiol 154: 434-438. Braukmann, T. and S. Stefanović . 2012. Plastid genome evolution in mycoheterotrophic Ericaceae. Plant Mol Biol 79(1): 5-20 Brouard, J.-S., C. Otis, et al. (2010). The exceptionally large chloroplast genome of the green alga Floydiella terrestris illuminates the evolutionary history of the Chlorophyceae. Genome Biol and Evol 2: 240-256. Cui, L., F. Yue, C. W. dePamphilis, B. M. E. Moret, and J. Tang. (2006). Inferring ancestral chloroplast genomes with inverted repeat. Procedings of the 2006 International Conference on Bioinformatics and Comput Biol: 75-81. deKoning, A. P. and P. J. Keeling (2006). The complete plastid genome sequence of the parasitic green alga Helicosporidium sp. is highly reduced and structured. BMC Biol 4. Dyall, S. D., M. T. Brown, and P. J. Johnson. (2004). Ancient invasions: from to organelles. Science 304: 253-257. Gao, L., Y.-J. Su, et al. (2010). Plastid genome sequencing, comparative genomics, and phylogenomics: Current status and prospects. J Sys Evol 48: 77-93. Gaut, B. S. (1998). Molecular clocks and nucleotide substitution rates in higher plants. Evol Biol 30: 93-120. Khan, A., I. A. Khan, H. Asif, and M. K. Azim. (2010). Current trends in chloroplast genome research. Afr J Biotechnol 9: 3494-3500. Krause, K. (2008). From chloroplasts to “cryptic” plastids: evolution of plastid genomes in parasitic plants. Curr Genet 54: 111-121. Kuijt, J. (1969). The biology of parasitic flowering plants. Berkeley, University of California Press. Leebens-Mack, J., L. A. Raubeson, et al. (2005). Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the Felsenstein zone. Mol Biol Evol 22: 1948-1963. Maple, J., and S. G. Moller. (2007). Plastid division: evolution, mechanism and complexity. Ann Bot 99: 565-579. Martin, W., and K. V. Kowallik. (1999). Annotated english translation of Mereschkowsky's 1905 paper Martin, W., T. Rujan, et al. (2002). Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc Natl Acad Sci U S A 99: 12246-51. Moore, M. J., P. S. Soltis, et al. (2010). Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots. Proc Natl Acad Sci 107: 4623-4628. Moreira, D., H. Le Guyader, et al. (2000). The origin of red algae and the evolution of chloroplasts. Nature 405: 69-72. Neuhaus, H. E. and M. J. Emes (2000). Nonphotosynthetic metabolism in plastids. Annu Rev Plant Physiol Plant Mol Biol 51: 111-140. Nickrent, D., R. Duff, et al., Eds. (1998). Molecular phylogenetic and evolutionary studies of parasitic plants. Molecular Systematics of Plants II DNA Sequencing. Boston, USA, Kluwer Academic. Ohyama, K., H. Fukuzawa, T. Kohchi, H. Shirai, T. Sano, S. Sano, K. Umesono, Y. Shiki, M. Takeuchi, Z. Chang, S.-i. Aota, H. Inokuchi, and H. Ozeki. (1986). Chloroplast gene

10 organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322: 572. Palmer, J. D. (1985). Chloroplast DNA and molecular phylogeny. BioEssays 2: 263-267. Palmer, J. D. (2000). A single birth of all plastids? Nature 405: 32-33. Palmer, J. D., and W. F. Thompson. (1982). Chloroplast DNA rearrangements are more frequent when a large inverted repeat sequence is lost. Cell 29: 537-550. Perry, A. S. and K. H. Wolfe (2002). Nucleotide substitution rates in legume chloroplast DNA depend on the presence of the inverted repeat. J Mol Evol 55: 501-508. Press, M. C. and G. K. Phoenix (2005). Impacts of parasitic plants on natural communities. New Phytol 166: 737-51. Press, M., J. Scholes, et al., Eds. (1999). Parasitic plants: physiological and ecological interactions with their hosts. Physiological Plant Ecology. Oxford, UK, Blackwell Science. Press, M. C. (1998). Dracula or Robin Hood? A functional role for root hemiparasites in nutrient poor ecosystems. Oikos 82: 609-611. Shinozaki, K., M. Ohme, M. Tanaka, T. Wakasugi, N. Hayashida, T. Matsubayashi, N. Zaita, J. Chunwongse, J. Obokata, K. Yamaguchi-Shinozaki, C. Ohto, K. Torazawa, B. Y. Meng, M. Sugita, H. Deno, T. Kamogashira, K. Yamada, J. Kusuda, F. Takaiwa, A. Kato, N. Tohdoh, H. Shimada, and M. Sugiura. (1986). The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. Embo J 5: 2043-2049. The Angiosperm Phylogeny, G. (2009). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III. Bot J Linnean Soc 161: 105-121. Timmis, J. N., M. A. Ayliffe, et al. (2004). Endosymbiotic gene transfer: organelle genomes forge eukaryotic . Nat Rev Genet 5: 123-35. Westwood, J. H., J. I. Yoder, et al. (2010). The evolution of parasitism in plants. Trends Plant Sci 15: 227-35. Wicke, S., G. M. Schneeweiss, et al. (2011). The evolution of the plastid chromosome in land plants: gene content, gene order, gene function. Plant Mol Biol 76: 273-97. Wise, R. R. and J. K. Hoober (2006). The Structure and Function of Plastids, Springer. Wolfe, K. H., W. H. Li, et al. (1987). Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear . Proc Natl Acad Sci U S A 84: 9054-8.

11 Chapter 2

Striking parallelism of plastid genome structure and dynamics in independent nonphotosynthetic lineages of parasitic plants

Yan Zhang1, Kai Müller2,1, Michael Moore4, Joel R. McNeal1, Alan R.Smith3, John

Willis6, Jennifer V Kuehl7, Pam Soltis4, Douglas Soltis7, Jeff L. Boore8, Claude W. dePamphilis1*

1 Department of Biology, The Pennsylvania State University, University Park, PA 16801

USA

2 Institute for Evolution and Biodiversity, University of Muenster, Huefferstr. 1, D-48149

Muenster, Germany

3 Department of Biology, Vanderbilt University, Nashville, TN USA

4 Department of Biology, University of Florida, Gainesville, Florida 32611-8525 USA

5 Department of Biology, Oberlin College, Oberlin, Ohio 44074 USA

6 Department of Biology, Duke University, Durham, NC 27708 USA

7 Florida Museum of Natural History, University of Florida, Gainesville, Florida 32611-

7800 USA

8DOE Joint Genome Institute and Lawrence Berkeley National Laboratory, Walnut

Creek, California 94598 USA; Genome Project Solutions, Hercules, CA 94547, USA

Keywords

Pholisma arenarium, nonphotosynthetic plant, plastid genome, evolution, purifying selection, parasitism.

12 Abstract:

Parasitic plants are valuable models for studying gene and genome evolution. Under relaxed functional constraints, the plastid genomes of some parasitic plants have undergone extensive reduction in gene content and exhibit accelerated rates of evolution for many of the remaining genes. Recent complete plastid genome sequences of several species of Cuscuta, a lineage of parasitic plants in the Convolvulaceae, reveal plastid genomes distinctly different from an independently derived holoparasite, Epifagus (Orobanchaceae); however, species of the former genus retain minimal photosynthetic ability, while Epifagus is entirely nonphotosynthetic. To understand whether genome evolution is similar in other independent lineages of nonphotosynthetic angiosperms, the plastid genome of Pholisma arenarium

(Lennoaceae/) has been fully sequenced. Additionally, the plastid genomes of

Ehretia acuminata (Boraginaceae), a photosynthetic relative of Pholisma, and Mimulus guttatus

(Phrymaceae), a photosynthetic relative of Epifagus, were sequenced. All three plastomes were annotated and compared to those of Epifagus and each other, allowing detailed comparison of plastid genome evolution in two independent nonphotosynthetic lineages. The plastid genome of

Pholisma shows a pattern of gene loss strikingly similar to that observed in Epifagus. All of the photosynthetic genes (with the notable exception of rbcL and psaI) and ndh genes are lost, as are the RNA polymerase genes and some components of the translation apparatus. Additionally, extensive pseudogenization is apparent in Pholisma. However, Pholisma still retains the full complement of plastid ribosomal protein gene, while about one third of the ribosomal protein genes have been either deleted or degraded as pseudogenes in Epifagus. Furthermore, Pholisma still retains all but one tRNA gene whereas Epifagus displays extensive tRNA gene losses. The plastid genome of Pholisma is reduced to 81,186 bp, compared to 156,554 bp in Ehretia. The size of the plastid genome of Mimulus is 153,219 bp vs. Epifagus’ 70,028 bp. All four sequenced plastid genomes have the typical quadripartite structure. Phylogenetic analyses of 81

13 concatenated genes from 68 fully sequenced land plant chloroplast genomes identify Pholisma and Ehretia () as sister to , and Lamiales + Boraginales as sister to a clade comprising Gentianales and Solanales. The gene losses in Pholisma are a perfect subset of those observed in Epifagus, and the genes retained by Pholisma are evolving under purifying selection, but at accelerated rates, though not as accelerated as in Epifagus. Synonymous rates are significantly lower in the SC than in the IR in all four plants, but there is no evidence that individual genes have distinct synonymous rates in the green plants or in Epifagus. In contrast, a model with a distinct synonymous rate for each gene receives significantly more support in

Pholisma. A greatly increased frequency of small (<30 bp) insertions and deletions, and large deletion events characterized both parasite lineages in comparison to their photosynthetic relatives.

A relaxed clock analysis suggests that the loss of photosynthesis in Lennoaceae occurred approximately 80 Mya, while Epifagus’ nonphotosynthetic lineage dates back ca. 90 Mya. The results indicate parallel evolution of the two independent lineages of nonphotosynthetic plants and also suggest that photosynthesis has been lost more recently in the lineage including

Pholisma.

14 INTRODUCTION

Photosynthesis is among the most fundamental biological processes on earth, and autotrophic capability must convey major fitness advantages in most plants. However, the heterotrophic life history, and complete loss of photosynthetic ability, has evolved repeatedly and independently in several major groups of land plants, algae, apicomplexa and dinoflagellates

(dePamphilis and Palmer 1990, Leake 1994, Wilson, Denny et al. 1996, Gockel, Hachtel et al.

2002, deKoning and Keeling 2006, Barkman, McNeal et al. 2007, Wickett, Zhang et al. 2008).

Because photosynthesis is arguably the primary physiological function of the plastid genome in plants and algae, the loss of photosynthesis is expected to impose profound effects on the plastid genome content and its evolutionary dynamics.

Significant changes in genome content and metabolic repertoire resulting from loss of photosynthesis have been seen in many lineages, such as parasitic plants, apicomplexan parasites, heterotrophic flagellate, and green algal parasites (Wolfe, Morden et al. 1992, Gockel and

Hachtel 2000, Keeling 2004, deKoning and Keeling 2006, Wickett, Zhang et al. 2008). Genome degradation, and evolutionary changes appear to be associated with the symbiotic lifestyle.

Parasitic plants, which form a parasitic relationship with their plant hosts, and switch to heterotrophic lifestyle, offer an excellent opportunity to study genome evolution accompanying the symbiotic lifestyle.

Parasitic symbiotic relationships are commonly seen in land plants, and parasitism has evolved independently at least 11 times (Barkman, McNeal et al. 2007). Some parasitic plants have completely lost all photosynthetic ability, and thus obtain all the nutrients from their photosynthetic hosts through a specialized structure, the haustorium. Such plants are referred to as holoparasites. Many lineages within angiosperms have holoparasitic members, such as

15 Rafflesiaceae, Lennoaceae, Orobanchaceae, Hydnoraceae, among others (Westwood et al., 2010).

Epifagus virginiana (Beechdrop, Orobanchaceae) is the only nonphotosynthetic parasitic angiosperm to date for which a plastid genome sequence has been published (Wolfe, Morden et al. 1992). In contrast, other parasitic plants still retain the ability to photosynthesize and display less dependency over their hosts, such as several species of Cuscuta (Convolvulaceae; dodder), mistletoes (Santalales), Cassytha (Lauraceae), and photosynthetic members of Orobanchaceae.

Those are generally categorized as hemiparasites (Westwood et al. 2010).

A large number of studies have focused on the plastid genomes of photosynthetic plants.

Plastids originated from the once free-living cyanobacteria where the majority of genes were either transferred to the nucleus or lost since divergence (Wicke et al., 2011). The plastid genome

(plastome) structure, gene order and content are generally very well conserved within green land plants. A typical plastome has a quadripartite structure, with a large single copy region (LSC) and a small single copy region (SSC) separated by two identical inverted repeat (IR) regions. The plastome of land plants usually comprises 110 – 130 unique genes, and the majority of these genes encode proteins involved in photosynthesis and plastid gene expression (Raubeson and

Jansen 2005). However, substantial differences from this common pattern have been observed in plants that have evolved a parasitic or myco-heterotrophic lifestyle, particularly those that have lost photosynthetic capacity (Delannoy, Fujii et al. 2011). The plastid genome structure and contents have undergone significant changes because the genome is expected to escape from the functional constraint of photosynthesis and consequently accumulate pseudogenes and gene losses. In this way, parasitic and myco-heterotrophic plants have provided excellent models for studying genome evolution under relaxed functional constraints.

Plastid genome content tends to be similar in lineages derived from a shared evolutionary transition to heterotrophy, or in general, in parasites that share a relatively recent common ancestor. For example, the plastomes of the holoparasites Epifagus and the partially characterized

16 Conopholis have many shared characteristics because of their common ancestry – they are sister genera in the Orobanchaceae that most likely had a nonphotosynthetic ancestor (dePamphilis and

Palmer 1990, Colwell 1994, Young and dePamphilis 2005). However, plastome evolution within different parasitic lineages has exhibited both similarities and much diversity. The plastomes of the two parasites, Epifagus and Cuscuta, share certain characteristics, such as genome reduction, extensive gene losses, accumulations of pseudogenes, and an overall increase in substitution rates

(Wolfe, Morden et al. 1992, Funk, Berg et al. 2007, McNeal, Kuehl et al. 2007). Both Epifagus and Cuscuta species have undergone substantial downsizing of their plastid genomes with significant gene losses. Accelerated evolution of the translational apparatus is observed in remaining genes in Epifagus, in comparison to Nicotiana (Wolfe, Katz-Downie et al. 1992,

Wolfe, Morden et al. 1992). Four species of Cuscuta have been sequenced, including C.reflexa,

C.exaltata, C.gronovii and C.obtusiflora (Funk, Berg et al. 2007, McNeal, Kuehl et al. 2007).

Sequenced Cuscuta members have exhibited a gradient of photosynthetic abilities. Cuscuta reflexa and C. exaltata still possess , and studies have shown photosynthetic carbon fixation in C. reflexa (van der Kooij, Krause et al. 2000). On the other hand, Cuscuta gronovii has more restricted photosynthetic activity than C. reflexa (van der Kooij, Krause et al. 2000), and C. obtusiflora exhibits green pigmentation only in certain structures such as and fruits, and is presumed to be photosynthetic (Barkman, McNeal et al. 2007). In C. obtusiflora, nonsynonymous rates and synonymous rates are also greatly accelerated in the remaining genes, relative to other photosynthetic plants.

Despite all the similarities, plastome evolution is distinct for Epifagus and Cuscuta in several respects. First, the detailed pattern of gene loss is very different. While Epifagus has lost all of its photosynthetic and RNA polymerase genes, the plastomes of Cuscuta still retain the photosynthetic and RNA polymerase genes, and those protein coding genes are evolving under strong selective constraints (McNeal, Kuehl et al. 2007). Second, the plastomes of C.obtusiflora

17 and C. gronovii lack pseudogene sequences (Funk, Berg et al. 2007, McNeal, Kuehl et al. 2007).

In comparison, Epifagus retains a number of pseudogenes in its plastid genome. Additionally,

Cuscuta has exhibited much greater condensation in the noncoding intergenic region (Funk, Berg et al. 2007, McNeal, Kuehl et al. 2007).

Although parasitic plants exploit their host plants directly through plant-plant haustorial connections (Kuijt, 1969), myco-heterotrophic plants exploit green plants indirectly through mycorrihizal fungi (Brundrett 2009). Recently, the plastid genome of the myco-heterotrophic underground orchid Rhizanthella gardneri has also been sequenced (Delannoy, Fujii et al. 2011).

Because of its subterranean lifestyle, this myco-heterotrophic underground orchid is nonphotosynthetic as well. Its plastid genome retains genes for only 20 proteins, 4 rRNAs and 9 tRNAs in a total size of 59,190 bp, making it one of the smallest plastid genomes sequenced to date (Delannoy, Fujii et al. 2011). A comparative study of mycoheterotrophic Eriacease and their autotrophic relatives, based on slot-blot Southern hybridization approach, also indicates extensive gene loss related to photosynthetic function and plastid gene expression in the holo- mycoheterotrophic Ericaceae (Braukmann and Stefanovic, 2012).

In terms of molecular evolutionary studies, the most comprehensive study to date sampled three plastid genes (rbcL, matk, and rps2) from 38 taxa of Orobanchaceae and their close relatives, and demonstrated that many nonphotosynthetic members have experienced increases in both synonymous and nonsynonymous substitution rates. However, there appears to be a poor correlation between synonymous and nonsynonymous rates (Young and dePamphilis 2005). The lineage-specific pattern of synonymous rates have been shown to be very similar across genes in this study, while nonsynonynous rates exhibit both gene and lineage specific patterns (Young and dePamphilis 2005). This study also suggests that purifying selection is relaxed for all the three genes studied in at least some of the parasites. Similarly in Cuscuta, the plastid genes appear to be

18 somewhat more relaxed, though those genes are still evolving under strong purifying selection

(McNeal, Kuehl et al. 2007).

Wickett et al. sequenced and analyzed the plastome of the parasitic myco-heterotrophic liverwort Aneura. Aneura is a myco-heterotrophic liverwort that exploits the mycorrhizal association between a basidiomycete and its host tree such as pine or birch, and it is completely nonphotosynthetic. The plastid genome of Aneura is reduced to 108,007 bp (Wickett, Zhang et al. 2008). Gene deletions in Aneura are significantly less extensive compared with Epifagus, and its deleted genes or pseudogenes in Aneura are a subset of that in Epifagus. The majority of lost or nonfunctional genes are involved in photosynthesis and chlororespiration. But Aneura retains all its plastid-encoded RNA polymerase genes, ATP synthase genes, the large subunit RuBisCo gene rbcL, and a complete set of translation apparatus genes (Wickett, Zhang et al. 2008).

Evolutionary studies have also revealed relaxation in purifying selection in six out of eight pseudogenes and three of the six ORFs studied (Wickett, Zhang et al. 2008).

Here, we further investigate the patterns of plastome evolution in nonphotosynthetic plants by conducting comparative analysis for Epifagus and a second independent nonphotosynthetic angiosperm lineage. In this study, we have sequenced and annotated the complete plastid genome from the nonphotosynthetic holoparasite, Pholisma arenarium, and a representative from a very closely related photosynthetic lineage, Ehretia acuminata. To conduct comparative analysis for Epifagus, we also assembled and annotated the plastome of Mimulus guttatus, a close photosynthetic relative of Epifagus.

Pholisma is a genus of nonphotosynthetic parasitic plants from deserts in the southwestern United States and Mexico. Pholisma species parasitize a diversity of shrubs including members of the eudicot families Boraginaceae, Polygonaceae, and Asteraceae

(Cothrun, 1969). Its morphology is highly reduced to a fleshy stem with scale-like leaves.

Pholisma is a member of the subfamily Lennooideae of family Boraginaceae, previously treated

19 as family Lennoaceae, a small group of nonphotosynthetic holoparasites. Lennoaceae and

Orobanchaceae are both members the Euasterid I lineage of Asteridae, which is in turn a large monophyletic clade of angiosperms, comprising about one third of all flowering plants. The phylogenetic relationships of the major groups within Euasterids I remain unresolved even with the sequencing and detailed analysis of whole plastid genomes (Moore, Soltis et al. 2010).

Within the Euasterid I, the Gentianales, Lamiales, and Solanales form a strongly supported clade, along with Boraginaceae and Vahliaceae (Albach et al., 2001; Bremer et al., 2002; reviewed in

Soltis et al., 2005; Moore MJ et al. 2010), although relationships within this clade are unclear.

Pholisma is the second holoparasitic angiosperm whose plastid genome has been fully sequenced. This study provides an excellent opportunity to investigate plastid genome evolution in independent lineages of holoparasites, and contrast both holoparasites with closely related photosynthetic taxa. We generated complete plastid genome sequences of close photosynthetic relatives of Pholisma and Epifagus, and a phylogenetic analysis based on 83 concatenated genes from these and a total of 68 land plant plastomes. We then performed a detailed comparison of plastid genome structure and evolution in the two nonphotosynthetic lineages. Shared and distinct patterns of plastid genome evolution under relaxed constraints were identified, revealing extreme parallelism in independent heterotrophic groups.

MATERIALS AND METHODS

Plastome sequencing

Fresh plant material of Pholisma was collected in Ansa Borrego Desert State Park,

Arizona, snap-frozen with liquid nitrogen, and maintained in a -80 C freezer for further work. A

20 voucher specimen (RASmith 120) was also deposited at the Pennsylvania State University

Herbarium. Total DNA was extracted from one gram of frozen tissue of Pholisma using a modified CTAB method (McNeal, Leebens-Mack et al. 2006). A partial fosmid genomic library was constructed from the isolated DNA using the CopyControl Fosmid Library Production Kit

(Epicentre). The fosmid library was then screened for positive plastid clones, using labeled PCR products of rps2, rps12, rps7 and rpl16 amplified from the Pholisma DNA sample. Positive plastid clones were end-sequenced, and a minimally overlapping set of clones was selected for sequencing to obtain the complete plastid genome. The methods of DNA isolation, fosmid library construction, clone selection and preparation, and clone sequencing are discussed in detail

McNeal et al (2006). Individual were assembled using Consed and Sequencher

Fresh plant material of Ehretia acuminata was collected on the campus of the University of Florida in Gainesville, FL, USA; a voucher specimen (M. J. Moore 317) is housed in the herbarium of the Florida Museum of Natural History (FLAS). Purified chloroplast DNA for genome sequencing was isolated using sucrose gradient ultracentrifugation and was amplified via rolling circle amplification (RCA) following the protocols of Moore et al. (Moore, Dhingra et al.

2006). The RCA product was sequenced at the University of Florida using the Genome

Sequencer 20 System (GS 20; 454 Life Sciences Corp., Branford, CT, USA) following the protocols in Moore et al. (Moore, Dhingra et al. 2006), with the exception that the sequencing run was conducted in a single region of a 70  75 mm PicoTiterPlate equipped with a four-region gasket. Gaps between the contigs derived from 454 sequence assembly were bridged by designing custom primers near the ends of the GS 20 contigs for PCR and conventional capillary-based sequencing. Several frame-shift errors in protein-coding sequence that were observed in the 454 sequence assembly were also corrected using custom PCR and sequencing.

The Mimulus guttatus genome was sequenced by the Joint Genome Institute (JGI, http://www.jgi.doe.gov/sequencing/why/3062.html) from a high generation inbred line. Whole

21 genome shotgun sequencing was used. An initial assembly was performed to identify high copy contigs corresponding to the plastid genome, and then contigs were further assembled in

CONSED (Gordon, Abajian et al. 1998, Gordon 2004).

For each new plastid genome sequence, the four IR/SC boundaries were tested by PCR amplification followed by Sanger sequencing. All three complete plastid genome sequences were annotated using DOGMA (Wyman, Jansen et al. 2004) and are available in GenBank as accession

# XXXXX (Pholisma), YYYYY (Ehretia) and ZZZZZZ (Mimulus guttatus) (will be submitted)

(Wyman, Jansen et al. 2004).

Gene content was visualized using MultiPipMaker (Schwartz, Zhang et al. 2000) and compared between the nonphotosynthetic plants Pholisma and Epifagus and their respective photosynthetic relatives, Ehretia and Mimulus, as well as Nicotiana (Kunnimalaiyaan and Nielsen

1997) (Genbank accession number: NC_001879). A more distantly related Asterid, Panax ginseng (Araliaceae) (Kim and Lee 2004) (Genbank accession number: NC_006290) was used as the reference genome in this analysis. To illustrate the genome structure and gene order conservation in Pholisma, we used Mulan (Ovcharenko, Loots et al. 2005) to create dot plot comparisons between Pholisma and Ehretia.

Indel analysis in inverted repeats and pseudogenes

The regions containing the sequences of 9 genes (rpl2, rpl23, rps7, rrn23, rrn16, rrn5, rrn4.5, ycf2, ndhB) that were contained in the slowly evolving inverted repeat region of 5 taxa

(Pholisma, Ehretia, Epifagus, Mimulus, Nicotiana) were extracted and used for analysis of insertion and deletion (indel) events. The sequences were separated into genic, intergenic, and intron sequence components, and each component was aligned using MUSCLE v3.6 (Edgar

2004). The intergenic and intron sequence alignments are carefully examined to identify indels in light of the phylogenetic relationships of Pholisma, Ehretia, Epifagus and Mimulus (see below).

22 Nicotiana was used as an outgroup in this study. Histograms of the indel length were plotted in R

2.14.2.

Pseudogenes were identified by the presence of out of frame indels or stop codons, or truncated sequences where only a remnant of the coding sequences remained. Insertions and deletions were identified by aligning the Pholisma and Epifagus pseudogene sequences with those from photosynthetic relatives. Pseudogene sequences of parasites were aligned at the DNA level using MUSCLE v3.6 (Edgar 2004) and analyzed.

Phylogenetic analysis

Plastid genome sequences of the selected 65 angiosperms and gymnosperm species not newly sequenced for the present study were downloaded from the Chloroplast Genome Database

(Cui, Veeraraghavan et al. 2006). 83 genes including protein-coding genes, ycfs, and ribosomal

RNAs were extracted and aligned with newly sequenced Pholisma, Ehretia and Mimulus using the AlignMate program (Muller and Wall 2006), followed by manual adjustment. Phylogenetic analyses were performed on the alignment using maximum parsimony and maximum likelihood.

Maximum parsimony analysis was performed in PAUP* 4.0b (Swofford 2002) using PRAP2

(Müller 2004) to construct batch files for running the parsimony ratchet with 200 ratchet cycles and 10 random addition supercycles. Maximum likelihood analysis was also performed with

PAUP*, using the likelihood ratchet (Morrison 2007) as implemented in PRAP2. Bayesian analysis was performed in MrBayes v3.12.

A relaxed molecular clock analysis was performed using BEAST v1.4 (Drummond and

Rambaut 2007). Constraints were imposed by dates from a phylogenetic dating study based on

111asterid taxa (Moore MJ, Soltis PS et al. 2009).

23 Molecular evolutionary analysis

Molecular evolutionary analyses were focused on the plastomes of 14 sequenced members of Asteridae, which include the newly sequenced parasite and the photosynthetic sister groups in this study: Pholisma, Ehretia, and Mimulus, plus ten other parasitic and nonparasitic (Epifagus virginiana [parasite, NC_001568], Nicotiana sylvestris [NC_007500], Spinacia oleracea [NC_002202], Panax gingeng [NC_006290], Helianthus annuus [NC_007977], Lactuca sativa [NC_007578], Jasminum nudiflorum [NC_008407], Coffea Arabica [NC_008535],

Ipomoea purpurea [NC_009808], Solanum tuberosum [NC_008096], Atropa belladonna

[NC_004561]). Maximum likelihood estimates of total evolutionary rate, synonymous and nonsynonymous substitution rates were calculated in HYPHY 2.1 beta (Pond, Frost et al. 2005).

The MG94 model was used with parameters estimated locally. The ratio of nonsynonymous and synonymous rate (dN/dS = ) was estimated for each gene on the nonphotosynthetic plant branch and its photosynthetic sister group branch while constraining the background Omega.

Another set of analyses adressed potentially divergent evolutionary patterns for synonymous rates across genes, contrasting parasitic and non-parasitic lineages. Overall, synonymous rates have repeatedly been shown to be enhanced across the plastid genomes of parasites, while the (non-)uniformity of the acceleration across the plastome has been rarely addressed. Specifically we wanted to test whether there is a uniform synonymous rate as opposed to multiple synonymous rate classes, one for each gene. As genes in IR regions usually have lower substitution rates than the genes in the SC regions (Wolfe, Li et al. 1987)), a two-rate classes model for the synonymous substitution rate across the whole genome was also tested.

Rate computation and hypothesis tests for this were set up in Hyphy Batch language (HBL) files

(batch files) for HYPHY, which were in turn compiled via a custom Perl script. For each taxon of interest, the script determined which IR genes and which SC are retained, and - in order for a later

24 computation of a corrected Akaike information criterion score (AICc) - how many alignment sites went into the final dataset filters applied in HYPHY.

HYPHY was then instructed to optimize pairs of constrained and unconstrained models by means of joint likelihood functions encompassing one pair of a dataset filter and a corresponding tree model per character set ( i.e., gene or structural region). Resulting likelihood scores, parameter numbers, and number of alignment sites were used to compute AICc sores for each hypothesis, as well as likelihood ratio test (LRT) statistics and corresponding p-values.

The hypotheses tested were whether a single synonymous rate is a better explanation of the data than individual rates for the IR and SC partitions, respectively, and wether such a two- rate model is a better explantation than assuming an individual rate per gene. This was tested for both pairs of parasitic and non-parasitic sister taxa, i.e., Epifagus / Mimulus and Pholisma /

Ehretia.

The automatic generation of the corresponding HBL scripts guaranteed that only those

(IR- and SC-) genes were incorporated into the likelihood functions that are actually retained in the respective genome, and that sites exclusively composed of missing data or alignment gaps were removed prior to analysis. HYPHY does not prohibit the of such sites or testing across genes when some of them are represented by missing data, but results will clearly be biased unless they are excluded.

Synonymous codon usage of all coding sequence for each plastome was calculated in

GCUA (McInerney 1998) to determine whether changes in codon bias accompany the loss of plastid-encoded tRNAs in the nonphotosynthetic species Pholisma and Epifagus.

25

RESULTS

Plastid genome size and structure:

The plastid genomes of Pholisma, Ehretia, and Mimulus all have a typical quadripartite structure, in which a pair of identical inverted repeat (IR) sequences are separated by a large single copy region (LSC) and a small single copy region (SSC) (Figure 2-1, Suppl 1, Suppl 2).

Both the gene content and order of genes are identical for the plastid genomes of Ehretia and

Mimulus, and Nicotiana. The plastid genome of Ehretia is 156,554 bp, with a pair of IR of 25,803 bp, LSC region of 86,786 bp and SSC of 18,162 bp. The size of the plastid genome of Mimulus is

153,219 bp, including the LSC region of 84,296 bp, SSC region of 17,907 bp and IR region of

25,508.

Compared with Ehretia, the size of the plastid genome of Pholisma is greatly reduced to

81,186 bp, with inverted repeats of 22,280 bp separated by LSC of 30,167 bp and SSC of 6,459 bp. All the genes are in the same order and transcribed in the same direction as Ehretia and

Nicotiana, except for what appear to be five small inversions (Figure 2-4, 2-5).

The IR boundaries of both Ehretia and Mimulus are located at the 5’ portion of rps19

(IRb-LSC), between trnN-GUU and ndhF (IRb-SSC), between rpl2 and trnH-GUG (IRa-LSC), and at the 5’ portion of ycf1 (IRa-SSC). The IR boundaries of Pholisma are also conserved. The size of the IR region is 22,280 bp, which is slightly less than the 25,803 bp of the IR in Ehretia.

The IR boundaries of Pholisma are located at the 3’ portion of rps19 (IRb-LSC), between trnN-

GUU and trnL-UAG (IRb-SSC), between rpl2 and trnH-GUG (IRa-LSC), and at the 5’ of trnL-

UAG (IRa-SSC).

26

Figure 2-1. Plastid genome of Pholisma arenarium. The genes shown inside the circle are transcribed clockwise; genes on the outside are transcribed counterclockwise. Structure components of the plastid genome are labeled in the inner circle as LSC, SSC, IRa, and IRb. Intron-containing genes are represented by “*”. Pseudogenes are notated with a . Genes are color coded by function, as shown at bottom.

27

Figure 2-2. Plastid genome of Ehretia acuminata. The genes shown inside the circle are transcribed clockwise; genes on the outside are transcribed counterclockwise. Structure components of the plastid genome are labeled in the inner circle as LSC, SSC, IRa, and IRb. Intron-containing genes are represented by “*”. Genes are color coded by function, as shown at bottom.

28

Figure 2-3. Plastid genome of Mimulus guttatus. The genes shown inside the circle are transcribed clockwise; genes on the outside are transcribed counterclockwise. Structure components of the plastid genome are labeled in the inner circle as LSC, SSC, IRa, and IRb. Intron-containing genes are represented by *. Genes are color coded by function, as shown at bottom.

29

Figure 2-4 Plastid Genome structure and content comparison between Pholisma and Ehretia. Colored lines connect the genes from two plastid genomes. The color are coded as follows: Green – presence of the gene in both genomes; Yellow – inversions in the Pholisma genome; Purple – pseudogene in Pholisma genome. (Genes on IRa are not connected because they are identical to IRb)

30 Table 2-1. Genome Content of Pholisma and Epifagus compared with their green relatives Ehretia and Mimulus. Pseudogenes are indicated by . Highlighted genes indicate genes still retained in Pholisma, while lost in Epifagus. Pholisma arenarium vs. Ehretia acuminata Epifagus virginiana vs. Nicotiana tobacum

Gene Genes Absent or Genes Genes Absent

s Present Pseudogenes Present or Pseudogenes

Photosynthesis

Photosystem I psaI psaA(), B(), C, J psaA, B. C, I, J

psbA, B, C, D, E, F, H, I, psbA (), B (), C, D, E, Photosystem II J, K(), L, M, N, F, H, I, J, K, L, M, N,

Cytochrome b6f petA, B, D, G, N() petA, B, D, G,

atpA (), B (), E, F, H, I ATP synthase atpA (), B (), E, F, H, I ()

Rubisco rbcL rbcL()

ndhA, B (), C, D, E, F, G, ndhA, B (), C, D, E, F, Chlororespiration H, I, J, K, G, H, I, J, K

Gene Expression

16S, 23S, 4.5S, rRNA 16S, 23S, 4.5S, 5S 5S

rps2, 3, 4, 7, 8,

11, 12, 14, 15, rps2, 3, 4, 7, 8, 11, Ribosomal 16, 18, 19; rps15, 16; rpl14 () , 22, 12, 14, 18, 19; rpl2, proteins rpl2, 14, 16, 23 (), 32 16, 20, 33, 36 20, 22, 23, 32,

33, 36,

DGUC, EUUC, DGUC, EUUC, FGAA, AUGC(), CGCA(), GGCC,

Transfer RNA FGAA, HGUG, HGUG, ICAU, LCAA, GUCC, IGAU(), KUUU, VUAC genes ICAU, LCAA, LUAG, MCAU, NGUU , LUAA, RUCU(), SGGA(),

LUAG, MCAU, PUGG, QUUG, RACG, TGGU, TUGU, VGAC, VUAC

31

NGUU , PUGG, SUGA, SGCU, WCCA,

QUUG, RACG, YGUA, fMCAU

SUGA, SGCU,

WCCA, YGUA,

fMCAU, AUGC,

CGCA, GGCC,

GUCC, IGAU,

KUUU, LUAA,

RUCU, SGGA,

TGGU, TUGU,

VGAC

RNA polymerase and maturase matK rpoA (), B (), C1, C2 matK rpoA (), B, C1, C2 genes

Initiation factor infA infA

clpP, accD, orf26, orf31, orf34, Other protein clpP, accD, ycf1, ycf1, ycf2, orf62, orf168, orf184, genes ycf2 ycf15 orf229, orf313

32 Gene content

There are 69 unique genes in the plastid genome of Pholisma (considering only one inverted repeat, Table 1), including 29 intact transfer RNA genes, 4 ribosomal RNA genes, 2 photosynthetic genes, 21 ribosomal protein genes, 9 genes coding for proteins of other known functions, and finally 4 genes coding for proteins of unknown function.

The plastid genome of Pholisma has lost all the RNA polymerase genes, and all the photosynthetic and chlororespiratory genes with the exception of rbcL and the very small [96 bp]

PSI protein-coding gene, psaI (Table 2-1). Unlike Epifagus, Pholisma retains the full complement of plastid ribosomal protein genes. In contrast to the extensive loss of tRNA genes in

Epifagus, Pholisma still retains nearly all the tRNA genes with the exception of trnV-UAC.

Similar with the Epifagus plastome, extensive pseudogene content is observed in Pholisma.

The plastome of Pholisma includes 11 pseudogenes (atpA, atpB, atpI, rpoA,

rpoB, psaA, psaB, psbB, psbK, ndhB, petN), which are genes mainly associated with photosynthetic processes, plus RNA polymerase (Table 2-1). More than half of the pseudogenes are common to both Pholisma and Epifagus, including psbA, psbB, rpoA, atpA, atpB,

rpoA and ndhB, while nearly all the rest of the pseudogenes in Pholisma have no homologs remaining in Epifagus. However, other pseudogenes in Epifagus that are not shared by Pholisma, such as psbA and rbcL, are still present as complete ORFs in Pholisma.

Pholisma has retained more pseudogene sequences than Epifagus. Most of the pseudogenes that are retained in Pholisma are truncated and remain only as remnants. Multiple deletions and insertions are found in each of the pseudogenes with the exception of petN. A 5 bp insertion of a exists in the petN, while the pseudogene sequence aligns perfectly well in other regions. All the shared pseudogenes appear to be much more reduced in

Epifagus, with significantly more deletions or insertions than seen for the orthologous sequence

33 in Pholisma. Furthermore, Epifagus and Pholsima share common large deletions in some pseudogenes. For example, Pholisma and Epifagus share a 540 bp deletion at the 5’ end of the

atpA sequence and a 267 bp deletion at the 3’ end of the atpA sequence. But many independent deletions or insertions are evident in pseudogene sequences for Epifagus and

Pholisma.

Photosynthetic gene psbA appears to be an exception. Epifagus retains remnants of psbA in Epifagus, but homologous psbA sequences cannot be detected in Pholisma.

Fig. 2-4 illustrates the genome reductions, and regions of gene losses as well as the functional categories of the deleted genes in Pholisma. Gene losses in Pholisma are primarily concentrated in the LSC and SSC regions, and a majority of the deleted genes encode photosynthetic and chlororespiratory proteins. The LSC and SSC regions in Pholisma are truncated to only 30.2 and 6.5kb respectively, which is only about 34.8% and 14.7% of the size of

Ehretia counterpats. In comparison, the IR region is about 22.3kb, about 86.3% of the size of IR in Ehretia.

As with Epifagus, the reductions in the plastid genome of Pholisma are also primarily located in the LSC and SSC regions. Suppl. Table shows the percentage of unique sequences for each region. The percentages of unique sequences in Pholisma are only 37% in the LSC region and 8% in the SSC region, compared with 55% and 12% respectively in Ehretia. Similarly, in

Epifagus, the percentages are 28% and 7%, compared with 55% and 12% in Mimulus. Contents of the LSC region in parasites has a significant decrease by 18% in Pholisma and 27% in

Epifagus, when compared with its photosynthetic relatives. However, the IR regions are not much reduced. In converse, the percentage of IR region over the whole genome unique sequences increases due to the significant reduction in other regions.

Significant shrinkage in coding and intergenic regions is observed in both Pholisma and

Epifagus. The length of intron sequences is reduced in both parasites. Especially in Epifagus, the

34 length of intron sequences is reduced to only about 20% compared with Ehretia. Thus, the percentage of intron sequences is also reduced in Epifagus. In contrast, the intron sequences in

Pholisma were not as extensively reduced and its proportion over the whole genome unique sequences actually increased.

The pattern of gene loss in Pholisma is also very similar to that of Epifagus. Both plastid genomes have lost all the RNA polymerase,, photosynthetic, and chlororespiratory genes. The only exceptions are that rbcL and psaI are still retained in Pholisma with complete open reading frames.

The evolutionary constraints of protein coding genes were measured by the ratio of nonsynonymous rates and synonymous rates (). The  values and 95% confidence intervals for each parasite and the background taxa are presented in Figure 2-5. The majority of the remaining protein coding genes is evolving under strong purifying selection, including the photosynthetic gene rbcL. In fact, the photosynthetic gene rbcL in Pholisma has evolved under strong purifying evolution (=0.174, confidence interval (0.107, 0.264)). However, psaI is evolving under relaxed constraint or may even reflect adaptive evolution in Pholisma (=1.95, confidence interval

(1.018, 3.490)). More transfer RNAs are retained in the Pholisma plastid genome comparing to

Epifagus. The gene loss in Pholisma is less extenstive than that in Epifagus, and no genes that are lost in Pholisma are still retained in Epifagus. The only exception may be psbB; this gene is lost in Pholisma, but it still remains as a trace of pseudogene sequence in Epifagus.

35

Figure 2-5. Dot plot of the complete plastid genomes of Pholisma arenarium and Ehretia acuminata from mulan analysis. Structures features are as follows: the main diagonal indicates the alignment in the same orientation in both genomes; the points along the negative slope indicate alignment in opposite orientation in both genomes, the two major groups of points along the negative slope in the upper right corner represents the inverted repeats, the smaller groups of points along the negative slope represents inversions in Pholisma.

36

Table 2-2. Relative synonymous codon usage (RSCU) of the coding sequences across the plastid genomes. This table only includes the RSCU of the codons that corresponds to the deleted or pseudo transfer RNAs

Epifagus Mimulus

Leu UUA 1.97 1.95

Ile AUC 0.41 0.58

Val GUC 0.49 0.47

GUA 1.48 1.51

Ser UCC 0.87 0.88

Thr ACC 0.74 0.75

ACA 1.31 1.17

Ala GCA 1.44 1.13

Lys AAA 1.58 1.53

Cys UGC 0.52 0.47

Arg AGA 2.21 1.90

Gly GGC 0.29 0.43

GGA 1.62 1.57

Pholisma Ehretia

Val GUA 1.53 1.53

37 Although one or more transfer RNA genes have been deleted from the plastid genomes of

Pholisma and Epifagus, the codon usages are virtually identically biased compared with the codon usage in Ehretia and Mimulus (Table 2-2). There is no evidence that the loss of specific codons is in any way related to a reduction in the frequency with which that codon is encoded in the plastid gene sequences.

Fig. 2-6 (a) highlights the deleted or conserved regions in the genomes of the holoparasites Pholisma and Epifagus, compared with other photosynthetic plants. The parallel losses of genes in the two plastomes are also illustrated in Fig 2-6 (b), such as the ATP syntheses genes atpF and atpH, RNA polymerase genes rpoC1 and rpoC2, and genes involved in photosystem II psbM, psbD and psbC. Several genes deleted from Epifagus, such as atpI, are still retained as pseudogenes in Pholisma. Gene atpA is a pseudogene in both holoparasites, but it appears to be less diverged in Pholisma than in Epifagus, compared with Nicotiana and the reference plastome Panax.

Analysis of indels

Deletions and insertions have occurred throughout the genomes, but are most readily interpreted in the relatively slowly evolving IR regions. Indel length and analysis details is presented in Appendix C. The number of insertions and deletions of varying sizes are shown in

Figure 2-7. Figure 2-7 shows that the short indels are more prevalent in both Epifagus and

Pholisma. Figure 2-7 indicates that small indels seem to occur in an approximate normal distribution. Insertions and deletions are nearly “balanced” in size and frequency, especially for shorter indels that are less than 20bp. Generally, the distribution of larger indel events in both parasites are strongly skewed to the left, with a long tail of large deletions. No large insertions were inferred for either Epifagus or Pholisma. Skewness and kurtosis of the insertions and deletions were also measured to describe these asymmetric patterns of insertions and deletions.

38 The skewness for Epifagus(-6.40) is greater than Pholisma(-4.16). The kurtosis for Epifagus

(55.55) and Pholisma (22.22) are both positive and Epifagus is much larger so that the distribution is leptokurtic, where high frequencies (peaks) exists around the mean.

39

a.

b.

Figure 2-6 Multipipmaker analyses of 6 sequenced angiosperms using Panax ginseng as the reference genome, illustrating parallel losses in holoparasites Phoisma and Epifagus.

a) The holoparasite Pholisma was compared with holoparasite Epifagus, and their respective photosynthetic relatives, Ehretia and Mimulus, as well as tobacco and reference genome Panax. Positions on the linearized plastid genome are shown across the bottom. Regions that align with Panax genome are represented as vertical red (75-100% identity) and green (50-

75% identity) bars.

b) Selected regions were extracted from a) to illustrate the parallel losses in the holoparasites Pholisma and Epifagus.

40

a.

41

b.

Figure 2-7 Histogram of Indels in Pholisma, Epifagus, Ehretia and Mimulus. X-axis indicates the length of indels and y-axis indicates the number of events.

42

Genes for gene expression

Pholisma still retains a full complement of plastid-encoded ribosomal protein genes, and a complete set of the transfer RNA genes with the exception of trnV-UAC (Table 2-1). However, the plastome lacks intact genes for the plastid-encoded RNA polymerase. The molecular evolutionary analysis of the remaining genes suggests that the plastid genome is still evolving under purifying selection and therefore appears to encode functional proteins. The plastid genome of Epifagus is still transcribed, but at highly reduced levels in comparison to tobacco, a photosynthetic relative (Ems, Morden et al. 1995).

Pholisma and Epifagus are independently derived

Phylogenetic analyses were performed on an alignment of 91,505 bp resulted from 83 concatenated plastid genes from 68 taxa, using maximum likelihood, maximum parsimony, and

Bayesian inference methods.

All the phylogenetic results consistently support the monophyly of the clade representing

Pholisma, Ehretia, Epifagus, and Mimulus, and Jasminum. However, the grouping order of the major clades including Boraginales, Lamiales, Solanales, and Gentianales cannot be consistently resolved, and tend to vary with the taxon and character selection.

Phylogenetic analyses have 97 bootstrap support for the clade Boraginales including

Pholisma and Ehretia, and Lamiales, which is represented by Jasminum, Mimulus and Epifagus.

This clade is sister to Gentianales and Solanales with 65 bootstrap support.

43 The relationship among Boraginales (represented by Ehretia), Gentianales (represented by Coffea and Nerium), Lamiales (represented by Epifagus, Antirrhinum and Jasminum) and

Solanales (represented by Atropa, Solanum, Nicotiana, Cuscuta and Ipomea) were generally supported by BS values less than 55% with ML by recent studies by Moore et al. (Moore, Soltis et al. 2010).

Molecular evolutionary analyses

Analysis of total substitution, synonymous and nonsynonymous substitution

Overall, genes remaining in both Epifagus and Pholisma display increased total substitution rates (Fig. 2-8(a)). The total substitution rate for each gene also shows the Epifagus genes are more accelerated and the substitution rates are generally 2.1x – 10.8x faster than for its photosynthetic relative Mimulus (Fig. 2-8). Genes in Pholisma are generally 2.3x – 19.5x faster than genes in nonparasites.

Synonymous rates are accelerated in both parasites. In general, synonymous rates in

Epifagus are 2.1x – 8.5x faster and rates in Pholisma are generally 1.1x – 22.5x faster compared to nonparasites. Significant increases are observed in 15 (out of 21) protein-coding genes in

Epifagus (Fig. 2-9) and 7 (out of 28) protein-coding genes in Pholisma. Among genes with significantly increased synonymous rates, accD, matK, rps4, rps7, ycf1 and ycf2 are common in both parasites. In addition, rbcL and rpl14 appear to be significantly more accelerated in

Pholisma, while both of these genes have been deleted in Epifagus.

Nonsynonymous rates are also generally accelerated in the parasites. Remaining genes in

Epifagus are generally increased by a factor of 3.6 – 12.3 and genes in Pholisma are generally increased by a factor of 1.5 – 13.3. Significantly increased nonsynonymous rates have been observed in accD, clpP, rps14, rps18, rps2, rps3, rps4, rps8, ycf1 and ycf2 in both parasites. All the remaining genes in Epifagus have exhibited significantly accelerated nonsynonymous rates

44 except infA, matK, rpl2 and rpl33, while these four genes show significantly increased nonsynonymous rates in Pholisma (Fig. 2-10).

Fig. 2-11 presents the comparison of  values of both holoparasites and the background

. The majority of the genes retained in the parasites are still evolving under purifying selection.

There is a poor correlation between the  values of the two parasites. Three genes, including matK, rps3 and rps4 exhibits significantly greater  value in Epifagus, suggesting that the three genes are evolving under weaker purifying selection in the parasite Epifagus compared to other photosynthetic plants. In Pholisma, rps2, psaI, and rps16 are significantly less constrained comparing to other photosynthetic plants; both psaI and rps16 are deleted from the Epifagus plastid genome. Surprisingly, some genes appear to be evolving under stronger purifying selection in the parasites than in photosynthetic plants. For example, clpP has become significantly more constrained in Epifagus, and rps7 significantly more constrained in Pholisma.

Because both genes have increased in synonymous rates, the observation of weaker purifying selection in some parasites does not mean that the protein will have evolved more slowly in the parasites.

Among the 63 deleted genes or pseudogenes in Epifagus, 7 genes still retain an intact open reading frame in Pholisma. Within the seven genes, rbcL and psaI genes encode for proteins involving photosynthesis in normal green plants. The other five genes, rpl14, rpl32, rps15 and rps16 encode ribosomal proteins. PsaI and rps16, appear to have evolved under significantly more relaxed constraint in Pholisma, compared to other photosynthetic plants. Furthermore, the  values for both genes are not significantly different from 1, which suggests that both genes are evolving under neutral evolution.

45 a.

b.

Figure 2-8. The total substitution rates of the holoparasite (red) vs. photosynthetic relative

(green). a) The total substitution rates of the genes in Epifagus compared with genes in Mimulus b) The total substitution rates of the genes in Pholisma compared with genes in Ehretia.

46 a

b.

Figure 2-9. The synonymous rates of the holoparasite (red) vs. photosynthetic relative

(green). a) The synonymous rates of the genes in Epifagus compared with genes in Mimulus b)

The synonymous rates of the genes in Pholisma compared with genes in Ehretia. Genes that show significantly accelerated synonymous rates in parasites are indicated with “*”.

47 a.

b.

Figure 2-10. The nonsynonymous rates of the holoparasite (red) vs. photosynthetic relative (green). a) The nonsynonymous rates of the genes in Epifagus compared with genes in

Mimulus, b) The nonsynonymous rates of the genes in Pholisma compared with genes in Ehretia.

Genes that show significantly accelerated nonsynonymous rates in parasites are indicated with a

“*”.

48 a.

b.

Figure 2-11. The Omega of the holoparasite (red) vs. background Omega (green). a) The Omega of the genes in Epifagus compared with background Omega, b) The Omega of the genes in Pholisma compared with background Omega. 95% confidence interval of the Omega for parasite is shown with error bars. Genes that exhibit significantly lower Omega in parasites are indicated with blue “*”, and genes that exhibit significantly greater Omega in parasites are indicated with gray “*”.

49 Analysis of one-rate, two-rate or multiple-rate model in parasite genomes.

Table 2-3 One-rate, two-rate or multiple-rate model.

Assuming a SC- and a second IR specific synonymous rate significantly better fits the data than a uniform rate; this was found to be true for all four taxa - parasites and autotrophs alike. The null hypothesis (single rate across partitions across the lineage in question) was never more than 10^-08 times as probable as the alternative model (two rates) to minimize the

50 information loss, and likelihood ratio tests rejected the null hypothesis with very high significance

(p < 0.0001).

The situation is less clear for the comparison of a two-synonymous-rates model versus one-synonymous-rate-per-gene model on the autotrophic and parasitic sister branches.

For the autotrophs, the two-rates model cannot be rejected when the multi-rate model is the alternative, neither by AICc (the SC/IR model is at least 3.5*10^29 times as probably as the alternative model to minimize information loss) nor by LRT (p ~1.0). Interestingly, this failure to reject is much less pronounced for the parasitic Epifagus (in fact, 6.7*10^25 times less). Finally, for Pholisma, the two-rates model can be rejected in favor of a one-synonymous-rate-per-gene model (the first is 13% as probable to minimize the information loss as the second; LRT p=0.0001).

Table 2-3 shows that in both parasites and green plants (Epifagus, Pholisma, Mimulus,

Ehretia), the single copy region has significant different evolutionary rates compared with IR regions, which is congruent with earlier studies. Furthermore, each retained gene indicates evolving at a different rate in Pholisma.

Analysis of pseudogene sequences.

Analysis of the pseudogene sequences suggests both common and different indels in parasites. Large, shared deletions are observed in the two parasites. Although identical, they are not seen in the photosynthetic sister taxa Mimulus or Ehretia, so it can be inferred that they have occurred independently in different lineages. Many distinct deletions or insertions are also frequently observed in each parasite species.

Relaxed clock analysis

51 The estimated divergence time of Pholisma and Ehretia is about 80 Mya ago, which suggest the loss of photosynthesis occurred in the Pholisma lineage not as much as 80 Mya.

Similarly, the divergence time of Epifagus and Mimulus is estimated to be about 90 Mya, which indicates the loss of photosynthesis could have occurred in the lineage leading to Epifagus approximately 90 Mya (Figure 2-12). Therefore, our phylogenetic dating results suggests that

Epifagus may have become nonphotosynthetic ~10 million years earlier than Pholisma.

52 a.

b.

Figure 2-12 Evolutionary analysis and Relaxed clock analysis. a) The tree branch is scaled by the total substitution rates (concatenated all the genes). The analysis suggests the total substitution rates in nonphotosynthetic plants Pholisma and Epifagus are greatly accelerated than their photosynthetic relatives, respectively. b) Chronogram shows the divergence time between the parasites and their respective green relatives.

53 DISCUSSION

Plastome phylogeny supports rapid radiation of Euasterid I lineages and distinct origin of holoparasitism in Boraginales and Lamiales

Relationships among the major Euasterid I clades have long been difficult to resolve. In our study, 83 genes and an alignment of over 90,000 bp resolves nearly every node in a large scale phylogeny of angiosperm plastid genomes, and with the addition of four new plastomes to the alignment of 64 plastomes published by Jansen et al (2007) supports a global angiosperm phylogeny consistent with other recent papers (Jansen et al 2007).

In contrast, the relationships among major groups of Euasterid I clades are at best weakly resolved. Since these relationships were instable in the face of small changes in taxon sampling and character choice, we consider these relationships to be unresolved. A much larger matrix assembled by Moore et al. shows that additional taxon sampling and 83 genes still does not resolve the relationship in Euasterid I (Moore MJ, Soltis PS et al. 2009). Because phylograms of alternative branching orders show extremely short internodes separating these major lineages, the absence of resolution to be a result of a very rapid radiation of major lineages of Euasterid I

(Moore MJ, Soltis PS et al. 2009).

Although the phylogeny of the major groups of Euasterid I does not resolve the branching order for the major orders, including Lamiales, Solanales, Boraginales, and Gentianales, what is clear from the phylogeny is that Pholisma and Epifagus represent independent parasitic lineages that have progressed to holoparasitism. This phylogeny also strongly supports the grouping of

Pholisma and its close relative Ehretia, Epifagus and its close relative Mimulus respectively, which provides the framework for the comparative evolutionary studies.

54 Striking similarity in plastome structure (reduction) and evolutionary dynamics

(acceleration) in independent holoparasitic lineages

The observed composition and substitutional patterns of the plastid DNAs of Pholisma and Epifagus represent a striking case of independent evolution resulting in highly similar genomes not only on genomic scale, but also on the individual gene level. The plastomes of

Epifagus and Pholisma are among the smallest among all the sequenced parasitic plants to date, and both are characterized by extreme genome reduction and gene deletions, accumulation of pseudogenes, and accelerated evolution of the retained genes relative to their inferred ancestors with Mimulus and Ehretia, respectively.

The patterns of gene losses and pseudogenes formation are strikingly similar in the two holoparasites. First of all, gene losses in both parasites are concentrated in genes associated with the bioenergetic process of photosynthesis: all the photosynthetic and chlororespiratory genes have been deleted with the only exceptions that Pholisma still retains intact open reading frames for photosynthetic genes rbcL and psaI. In addition, all the RNA polymerase genes, as well as ndh genes have been functionally or physically lost. Similarly in both parasites, the greatest extent of deletion occurs in LSC and SSC, while the IR regions do not have much contraction.

Despite the significant reduction in size, both Pholisma and Epifagus have retained large amounts of dysfunctional pseudogene sequences. In contrast, Cuscuta lacks pseudogenes and they have extensive deletion in both SC and IR regions.

Furthermore, the remaining protein coding genes in both nonphotosynthetic species have increased synonymous rates, nonsynonymous rates and total substitution rates, when compared to the respective photosynthetic sister group.

In addition, the pattern of indels in Epifagus and Pholisma also shows striking similarities. Figure 2-7 presents the distribution of the size of indels which shows that the distribution of both parasites are very similar – Small deletions and insertions are much more

55 frequent than larger ones in both distributions. Also, both distributions are largely skewed to the left, which suggests that large deletions are more frequent than large insertions. But for smaller size indels, insertions and deletions seem to be quite balanced.

Similarities in certain features of plastomes with complete or partial loss of photosynthesis have been observed in various lineages. Plastome diminution, including deletion or degradation of photosynthetic genes has been found in apicomplexan parasites, nonphotosynthetic dinoflagellates, parasitic alga and liverworts, etc. (dePamphilis and Palmer

1990, Leake 1994, Wilson, Denny et al. 1996, Gockel, Hachtel et al. 2002, deKoning and Keeling

2006, Barkman, McNeal et al. 2007, Wickett, Zhang et al. 2008, Delannoy, Fujii et al. 2011).

Specifically in plants, eight parasitic or myco-heterotrophic species that with fully or partially lost photosynthetic ability have been sequenced to date: the nonphotosynthetic angiosperm Epifagus and Pholisma, nonphotosynthetic myco-heterotrophic liverwort Aneura, nonphotosynthetic myco-heterotrophic Orchid and four Cuscuta species that still retain at least minimal the photosynthetic ability (dePamphilis and Palmer 1990, Funk, Berg et al. 2007, McNeal, Kuehl et al. 2007, Wickett, Zhang et al. 2008, Delannoy, Fujii et al. 2011). Additionally, many studies have investigated specific aspects of the plastomes of other parasites such as Conopholis and

Orobanche (Wimpee, Wrobel et al. 1991, Wimpee, Morgan et al. 1992, Wimpee, Morgan et al.

1992, Colwell 1994, Lohan and Wolfe 1998, McNeal, Kuehl et al. 2009). The plastomes of the aforementioned parasites have shown similarities in reduction and gene deletions, although the pattern of gene loss varies between the nonphotosynthetic species and the photosynthetic Cuscuta species. In all of the sequenced nonphotosynthetic species (Epifagus, Pholisma, liverwort Aneura and Orchid Rhizanthella), loss of photosynthetic and chlororespiratory genes has occurred. But the genes deletions in Aneura are clearly much less extensive than the other three nonphotosynthetic species. Aneura may have lost its photosynthetic ability very recently. In sum,

56 despite certain similarities of genome changes are observed in these lineages, the striking resemblance of the plastid genomes of Epifagus and Pholisma are without precedent.

Although reduction in plastomes is the focus in this study, it should be noted that nuclear genome reduction is also seen in some parasitic lineages. There appears to be a correlation between the genome degradation and the symbiotic relationship. This genome reduction is observed in many obligate intracellular parasites and symbionts (Gilson and McFadden 1996,

Gilson and McFadden 1997, Gil, Sabater-Munoz et al. 2002, van Ham, Kamerbeek et al. 2003,

Baumann 2005, Loftus, Anderson et al. 2005, Nakabachi, Yamashita et al. 2006, Kuwahara,

Yoshida et al. 2007, McCutcheon and Moran 2007, McCutcheon, McDonald et al. 2009). The bacterial lineages that established an obligate symbiotic relationship with insects have been intensely studied (Moran and Mira 2001, Ochman and Moran 2001, Wernegreen, Richardson et al. 2001, van Ham, Kamerbeek et al. 2003, Nakabachi, Yamashita et al. 2006, McCutcheon and

Moran 2007). Among those lineages, shared features including genome reduction, biased GC contents, increased substitution rates, as well as complementarities in biosynthetic abilities resulting from the host-dependent lifestyle were revealed. In addition, it has been suggested that there has been convergence in the metabolic roles where gene sets involved in similar amino acid biosynthetic processes that are complementary to the biosynthetic pathways were retained in their co-symbionts of insects has also been observed (McCutcheon, McDonald et al. 2009).

There are many similarities between plant plastome reduction and bacterial systems associated with symbiotic lifestyle. But this study also provides an excellent opportunity to compare unambiguously orthologous sequences throughout the genomes of Epifagus and

Pholisma. The extreme resemblance of patterns of gene loss, gene retention and substitutional dynamics and selection across two independent lineages constitutes a remarkable case of parallel evolution.

57 Genome reduction appears to be a common evolutionary process occurring in many major groups of organisms. Extreme reduction in nuclear genome and organelle genome with massive gene loss has been observed in bacteria, land plants, algae, protest, fungi (Merhej and

Raoult, 2011; Tachezy and Šmíd, 2007; Jedelsky´ et al, 2010; Keeling et al, 2010; Koning and

Keeling, 2006; Moran and Mira 2001; Ochman and Moran 2001; Wernegreen, Richardson, and

Moran 2001; van Ham et al. 2003; Nakabachi et al. 2006; McCutcheon and Moran 2007). Our study has brought valuable insights by investigating the genome evolution of parasites compared to each other and their respective photosynthetic relatives.

The plastomes of holoparasites Pholisma and Epifagus are evolving under stringent, but distinct, functional constraints.

The plastome of Epifagus remains the smallest parasitic plant plastome sequenced to date. Besides all photosynthetic genes, Epifagus has functionally lost many genes involved in the gene expression apparatus including nearly one third of the transfer RNA genes and half of the ribosomal protein genes. But the distribution of genes retained in Epifagus is heavily skewed in that 38 out of the 42 genes retained in Epifagus plastome encodes for components of the gene expression apparatus (Wolfe, Morden et al. 1992).

Despite the significant reduction of the gene repertoire, the plastome of Epifagus still remains transcriptionally active (dePamphilis and Palmer 1990, Ems, Morden et al. 1995) and several lines of evidence are consistent with the reduced plastome of Epifagus still encoding functional protein coding genes (Morden, Wolfe et al. 1991, Wolfe, Morden et al. 1992, dePamphilis, Young et al. 1997, Lohan and Wolfe 1998). Similarly in Pholisma, a nearly complete set of gene sets for the expression apparatus is retained in the plastome except the RNA polymerase genes and a single transfer RNA gene. The evolutionary analyses in this study also

58 suggest rate increases in plastome genes of Epifagus and Pholisma while maintaining functional constraints.

The majority of the plastid genes retained in Pholisma and Epifagus have significantly accelerated synonymous rates compared to their photosynthetic closest relatives. The synonymous rates generally increase by 2-8 fold in Epifagus and 2-22 fold in Pholisma, respectively. Several short genes have large values for the proportion due to the small synonymous rates in the photosynthetic relatives. Although genes in the IR regions are generally evolving at much slower synonymous rates compared with genes in the single copy regions, these genes are accelerated at similar or even greater proportions compared with their counterparts in the photosynthetic relatives. Similarly, the synonymous rates in Epifagus are generally more elevated than Pholisma, but the proportional increases compared with their respective photosynthetic relatives are greater in Pholisma for most of the genes. But it’s worth to mention that the rate heterogeneity is relatively large across euasterids.

In addition, although there is a universal increase in the synonymous rates across the plastomes, but the rate increases have displayed wide variation among genes. This universal increase in synonymous rates indicates that the plastome may be similarly affected, but the actual degree of divergence in the increase is a reflection of chance mutational events or different set of processes affecting individual genes. Furthermore, both Epifagus and Pholisma have shown acceleration in nonsynonymous rates and total substitution rates. The majority of ribosomal protein genes remained in Epifagus show similar proportion of acceleration in nonsynonymous rates. Elevated nonsynonymous rates and synonymous rates are also observed in many nonphotosynthetic and photosynthetic lineages in Orobanchaceae (Young and dePamphilis 2005).

Nearly all of the protein-coding plastid genes are evolving under stringent functional constraints even in these nonphotosynthetic plants. But in contrast to the universal increase of substitution rates in Epifagus, the ratio of the nonosynonymous and synonymous rate was not

59 significantly affected in majority of the genes in Epifagus and this suggests an increase of the underlying rates or the decrease of the DNA repair efficiency. Surprisingly, despite the fact that plastomes in both nonphotosynthetic plants are still functional constrained, distinct patterns were seen of the selective functional constraints in the retained plastid genes by Epifagus and Pholisma. Some of the retained genes have evolved under even more stringent constraint relative to photosynthetic plants, but most are less constrained.

In Epifagus, significant changes in the ratio of nonsynonymous and synonymous rate were observed for clpP, matK, rps3 and rps4, although the changes are in different directions.

Gene clpP appears to be significantly more constrained and the other three genes are otherwise comparable to photosynthetic plants. Similarly in Pholisma, some genes are more constrained and other genes are less constrained. It is worth mentioning that the photosynthetic genes psaI and rps16, lost in Epifagu, are significantly relaxed and lack constraint in Pholisma, which is consistent with the possibility that psaI and rps16 are “on their way” to becoming pseudogenes or deleted. As the only unconstrained photosynthetic gene retained in the plastome of Pholisma, psaI is a very short gene with only 96bp and its retention can be explained by its “escape” from the deletion or mutation due to its small size. Lohan and Wolfe have shown that some of the tRNA genes remain intact in the plastome because they are small so they have not been hit by a deletion or mutation (Lohan and Wolfe 1998).

Furthermore, our one-rate, two-rate (SC/IR) and multi-rate analysis indicates that the SC region and IR region have significantly different rates in both parasites and their green relatives.

Wolfe et al. has estimated the substitution rates based on pairwise comparison of Tobacco,

Spinach, Soybean, Spirodela and shows that IR evolves at lower rates than SC region in both noncoding and protein-coding sequences. This analysis also suggests that each gene in Pholisma plastome is evolving at a significant different rate. This is not observed in the other three plants.

60 In sum, the distinct patterns seen in these nonphotosynthetic plastomes raise the interesting possibility that the functional constraints affecting the plastid genome may vary among nonphotosynthetic lineages.

Synonymous rates clearly seem to evolve at different rates in the SC and IR across all plants tested here. For green autotrophic plants, the additional assumption of an individual synonymous rate per gene (regardless of its location in either SC or IR) is not supported, in that it is rejected based on information theory and likelihood ratio tests. For the two parasitic plants incorporated here and directly compared to their non-parasitic closest relatives, the distinction of a separate synonymous rate per gene receives greater support (Epifagus, Pholisma) and is selected as a significantly better model via AICc and LRT (Pholisma).

As discussed, evolutionary analysis suggests the presence of selection pressure in the plastomes of nonphotosynthetic plants. In Epifagus, 38 out of the 42 genes retained are involved in transcription and translation, and only four nontranscriptional and nontranslational genes including accD, clpP, ycf1 and ycf2 were retained in the plastome. It has been proposed that one or more of these four genes provide the raison d’tre for the retention of plastomes in holoparasites

(Krause 2008). In Pholisma, all of the four genes are also retained as an intact copy in the plastome. In addition, all of the four genes are still evolving under stringent purifying selection in both holoparasites. In Epifagus, clpP is significantly more constrained, but no significant changes were identified in accD, ycf1, and ycf2 compared with photosynthetic plants. Retention of clpP has been observed in nearly all of the parasitic plants sequenced to date, including Epifagus,

Cuscuta, and Aneura. ClpP is a protease, and it is suggested to be essential for shoot development in tobacco (Kuroda and Maliga 2003).

In Pholisma, all of the four genes are still evolving under similar constraints in Pholisma compared with photosynthetic plants. Thus, the evolutionary results are consistent with the hypothesis that the four genes may play essential roles in nonphotosynthetic plants.

61 Similar with these four genes, the transcription-related gene matK is also identified in both photosynthetic and nonphotosynthetic plants. matK encodes for the plastid intron maturase that splices the group IIA introns in plastomes, and it is located in the intron of transfer RNA trnK. The complete loss of matK has only been observed in the plastome of C. gronovii and C. obtusiflora among all the sequenced land plants and streptophyte algal, associated with the loss of

7 out of 8 group IIA introns that dependent on matK maturase (Funk, Berg et al. 2007, McNeal,

Kuehl et al. 2007, McNeal, Kuehl et al. 2009). Instead of an intron-encoded gene, as seen in most other species, a free standing matK gene is observed in Epifagus, C. reflexa, C. exaltata,

Adiantum and streptophte alga Zygnema circumcarinatum (Wolfe, Morden et al. 1992, Ems,

Morden et al. 1995, Wolf, Rowe et al. 2003, Turmel, Otis et al. 2005, Funk, Berg et al. 2007,

McNeal, Kuehl et al. 2007, McNeal, Kuehl et al. 2009). Evolutionary analysis also indicates that the constraint of matK has been significantly relaxed in Epifagus. Although matK is still retained in the majority of the sequenced parasitic plants, matK could be deleted from the plastome with the loss of group IIA intron in parasites (McNeal et al. 2009). Unlike matK, the four genes discussed have not been identified to be deleted in any parasitic plants sequenced to date, which is consistent with their proposed essential roles.

Taken together, the plastomes of the nonphotosynthetic plants Epifagus and Pholisma are still constrained, but the functional constraints seem to be distinct for the two holoparasites and their retained gene sets. Considering the loss of the photosynthesis in both plants, the functional constraints of plastid genome have to be beyond the bioenergetic processes of photosynthesis and chlororespiration and four nontranscriptional and nontranslational genes accD, matK, ycf1 and ycf2 seem to be essential for the function of plastome in nonophotosynthetic plants.

Holoparasitism occurred more recently in Pholisma than in Epifagus

62 Divergence time estimates indicate that the split between Pholisma and Ehretia is approximately 10 my older than that between Epifagus and Mimulus; accordingly, the loss of photosynthesis in Pholisma may have happened up to 10 million years later than that in Epifagus.

This is congruent with the plastome changes in the two holoparasites. The pattern of plastome reduction and gene losses is clearly less extensive in Pholisma than Epifagus. Genes lost in Pholisma is a subset of genes lost in Epifagus (Table 1). Unlike in Epifagus where the translation apparatus is greatly reduced, Pholisma still retains a nearly intact translation apparatus. As for the photosynthesis related genes, Pholisma still retains an intact open reading frame for photosynthetic gene rbcL and psaI. The evolutionary rates of the remaining genes in

Pholisma plastome are generally not as accelerated as that in Epifagus. Taken together, a more recent loss of photosynthesis in Pholisma than in Epifagus is supported.

Parallel polymorphism of rbcL in independent holoparasitic lineages

Distinct from all the other photosynthetic genes that are generally lost in the plastome of nonphotosynthetic plants, retention of rbcL has displayed polymorphism in nonphotosynthetic lineages.

Despite many similarities shared by Epifagus and Pholisma, they are different in the retention of rbcL. The plastid gene rbcL encodes the large subunit of Rubisco, the most common enzyme on earth with a key role in photosynthesis. Although Pholisma have lost photosynthetic capacity and consequently lost nearly all the photosynthetic genes, Pholisma still retains an intact copy of rbcL with stringent functional constraint. Epifagus retains only a remnant of rbcL sequence. It is the first time we observe the retention of rbcL accompanying the absence of all the other photosynthesis genes in parasitic plants. The only similar case occurs in nonphotosynthetic

63 euglenoid flagellate Astasia longa, where rbcL is the only photosynthetic gene retained in the plastome.

The changes of rbcL associated with the loss of photosynthetic capacity vary in many lineages of nonphotosynthetic plants. The polymorphism of rbcL has also been observed within the same lineage: Orobanchaceae, where holoparasitism appears to have evolved independently five times in the Hyobanche/Harveya clade, clade representing , Alectra, Striga,

Lathraea (Wolfe and dePamphilis 1998). In this lineage, rbcL has been lost or degraded in some holoparasites, but retained as an intact open reading frame in others (Wolfe and dePamphilis

1998). For example, Epifagus, Boschniakia, Hyobanche, and Orobanche retain truncated rbcL pseudogenes, and Conopholis has lost the entire rbcL (Colwell 1994, dePamphilis, Young et al. 1997, Wolfe and dePamphilis 1997). Conversely, intact open reading frames have been found in each of the five clades, such as Alectra orobanchoides, Orobanche corymbosa, O. fasciculate, O. hederae, all but one single species in Harveya, Lathraea clandestine, Striga gesnerioides (Wolfe and dePamphilis 1998, Randle and Wolfe 2005). Furthermore, strong purifying selection, transcription and expression of the rbcL copy have been observed in some of the lineages. For example, rbcL transcript has been observed in heterotrophic euglenoid, Astasia longa. Rubisco activities have been detected in Lathraea clandestine, although this rbcL copy is not functional constrained (Leebens-Mack and DePamphilis 2002). Moreover, rbcL in Lathraea is transcribed by a nucleus-encoded polymerase instead of a plastid-encoded polymerase (Lusson,

Delavault et al. 1998).

Interestingly, a similar polymorphism of rbcL may exist in Lennoaceae. As discussed, the plastome of Pholisma retains a seemingly functional copy of rbcL. However, the presence of rbcL could not be detected through either Southern blot or PCR in Lennoa, a related holoparasitic relative of Pholisma within Lennoaceae (Smith 2003).

64 This parallel polymorphism of the retention of rbcL in Orobanchaceae and Lennoaceae raises interesting questions such as why the retention of rbcL appears to be independent of the photosynthetic capacity of the plants even within the same lineage.

Taken together, the retention of rbcL in many holoparasitic lineages, its transcription and translation in some holoparasites, as well as the functional constraints suggested by evolutionary studies all imply alternative function rather than photosynthesis. A hypothesis could be provided by an alternative function of rbcL in lipid biosynthesis in the green seeds of Brassica

(Schwender, Goffman et al. 2004). This hypothesis is corroborated by that accD, the gene encoding a subunit of the acetyle-CoA carboxylase and involved in the biosynthesis of fatty acid, is also retained in all the parasites plastome sequenced so far (Krause 2008).

Deterministic and stochastic evolution model of plastomes in parasitic plants

Genome contents were compared among the two nonphotosynthetic angiosperms,

Pholisma and Epifagus, and the recently sequenced parasitic liverwort Aneura mirabilis. As mentioned above,, the gene losses in Pholisma are a perfect subset of those observed in Epifagus.

In addition, the plastid genome evolution of parasitic liverwort Aneura also shares a lot of similarities with Epifagus and Pholisma, such as massive deletion of photosynthetic genes. 21 out of the 25 genes lost from Aneura are associated with the bioenergetic function of photosynthesis, which shows that the plastid genome evolution may be similar in nonphotosynthetic angiosperms and nonphotosynthetic liverwort (Wickett, Zhang et al. 2008). But the gene losses in Aneura are far less extensive than seen in Epifagus or Pholisma. When compared to Pholisma, the gene losses in A. mirabilis appear to be a perfect subset of those found in Pholisma. Gene losses in

Aneura focus on photosynthetic and NADH dehydronase genes. While plastid-encoded RNA polymerase genes, tRNA genes and more photosynthetic genes are either deleted or become pseudogenes in Pholisma. In Epifagus, all the photosynthetic genes are lost including rbcL. In

65 the meanwhile, translational genes and more transfer RNAs are lost. This hierarchical order of gene losses in the three nonphotosynthetic species is probably due to their different duration of loss of photosynthesis.

Therefore, a common deterministic evolutionary directionality of the gene losses of the plastid genome may exist, during the transition from autotrophy to heterotrophy for plants lacking and photosynthetic ability. Thus, we propose a deterministic evolution model of the plastid genome for nonphotosynthetic plants after their transition from autotrophic to heterotrophic lifestyle. According to the gene loss pattern, photosynthetic genes and NADH dehydronase genes appear to be the first set of genes that were removed from the genome, followed by transcription genes and transfer RNAs. Translational genes are the last set of genes to be removed in the currently sequenced plastid genomes.

But this pattern does not apply for the plastome evolution in hemiparasites where photosynthetic capacity is still retained. This distinction is clearly seen in the retention of seemingly functional set of photosynthetic genes in Cuscuta plastomes (McNeal, Kuehl et al.

2007).

Is this a Convergence or Parallelism?

“ Convergent and parallel evolution both result in the independent evolution of the same feature in two unrelated sequences; the difference between the two lies in whether the similarities was acquired from the same (parallelism) or a different (convergence) ancestral state ” (Page and

Holmes 1998).

Many studies have referred to similarities of the genome evolution within the same lineage as “parallel evolution” and to similarities from relatively diverged lineages as “convergent evolution.” However, the borderline between parallel evolution and convergent evolution is not always clear cut. “Parallel evolution occurs when a feature evolves independently in closely

66 related species, but how closely related they need be before it is parallelism rather then convergence is unclear and probably immaterial.” (Futuyma 1997)

In this case, the striking similarities in the plastome evolution of independent parasitic lineages Orobanchaceae and Lennoaceae do appear to be a case of parallel evolution of genome structure under the definition above by Page and Holmes because they share the same ancestral plastid genome state and gene content. However, the pseudogene sequence analysis shows that many independent deletions or insertions are present in the pseudogene alignment of Epifagus and Pholisma, although quite a few common deletions are also seen. The IR region indel analysis also presents many distinct indel events in IR regions in spite of the general highly similar distribution of indels for Pholisma and Epifagus. Therefore, further considering these independent steps along the plastome evolution, the evolutionary processes can be explained as convergent evolution.

Therefore, we think the striking similarities in the plastomes of nonphotosynthetic plants

Epifagus and Pholisma could be explained by both parallel evolution and convergent evolution.

67 References:

Barkman, T., J. McNeal, S.-H. Lim, G. Coat, H. Croom, N. Young, and C. dePamphilis. 2007. Mitochondrial DNA suggests at least 11 origins of parasitism in angiosperms and reveals genomic chimerism in parasitic plants. BMC Evol Biol 7:248. Baumann, P. 2005. Biology bacteriocyte-associated endosymbionts of plant sap-sucking insects. Annu Rev Microbiol 59:155-189. Bremer, K., E. M. Friis, and B. Bremer. 2004. Molecular phylogenetic dating of asterid flowering plants shows early Cretaceous diversification. Syst Biol 53:496-505 Braukmann, T. and S. Stefanović . 2012. Plastid genome evolution in mycoheterotrophic Ericaceae. Plant Mol Biol 79(1): 5-20 Brundrett, M. C. 2009. Mycorrhizal associations and other means of nutrition of vascular plants: understanding the global diversity of host plants by resolving conflicting information and developing reliable means of diagnosis. Plant and Soil 320:37-77. Colwell, A. E. L. 1994. Genome evolution in a non-photosynthetic plant. Washington University, St Louis, MO. Cui, L., N. Veeraraghavan, A. Richter, K. Wall, R. K. Jansen, J. Leebens-Mack, I. Makalowska, and C. W. dePamphilis. 2006. ChloroplastDB: the Chloroplast Genome Database. Nucleic Acids Res 34:D692-696. deKoning, A. P., and P. J. Keeling. 2006. The complete plastid genome sequence of the parasitic green alga Helicosporidium sp. is highly reduced and structured. BMC Biology 4. Delannoy, E., S. Fujii, C. Colas des Francs-Small, M. Brundrett, and I. Small. 2011. Rampant gene loss in the underground orchid Rhizanthella gardneri highlights evolutionary constraints on plastid genomes. Mol Biol Evol 28:2077-2086. Delavault, P. M., N. M. Russo, N. A. Lusson, and P. A. Thalouarn. 1996. Organization of the reduced plastid genome of Lathraea clandestina, an achlorophyllous parasitic plant. Physiologia Plantarum 96:674-682. dePamphilis, C. W., and J. D. Palmer. 1990. Loss of photosynthetic and chlororespiratory genes from the plastid genome of a parasitic . Nature 348:337-339. dePamphilis, C. W., N. D. Young, and A. D. Wolfe. 1997. Evolution of plastid gene rps2 in a lineage of hemiparasitic and holoparasitic plants: many losses of photosynthesis and complex patterns of rate variation. Proc Natl Acad Sci USA 94:7367 - 7372. Drummond, A. J., and A. Rambaut. 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7:214. Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797. Ems, S. C., C. W. Morden, C. K. Dixon, K. H. Wolfe, C. W. dePamphilis, and J. D. Palmer. 1995. Transcription, splicing and editing of plastid RNAs in the nonphotosynthetic plant Epifagus virginiana. Plant Mol Biol 29:721-733. Funk, H. T., S. Berg, K. Krupinska, U. G. Maier, and K. Krause. 2007. Complete DNA sequences of the plastid genomes of two parasitic flowering plant species, Cuscuta reflexa and Cuscuta gronovii. BMC Plant Biol 7:45. Futuyma, D. J. 1997. Evolutionary Biology. Sinauer Associates. Gil, R., B. Sabater-Munoz, A. Latorre, F. J. Silva, and A. Moya. 2002. Extreme genome reduction in Buchnera spp.: Toward the minimal genome needed for symbiotic life. Proc Natl Acad Sci USA 99:4454-4458. Gilson, P. R., and G. I. McFadden. 1997. Good things in small packages: the tiny genomes of chlorarachniophyte endosymbionts. Bioessays 19:167-173.

68 Gilson, P. R., and G. I. McFadden. 1996. The miniaturized nuclear genome of eukaryotic contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns. Proc Natl Acad Sci U S A 93:7737-7742. Gockel, G., and W. Hachtel. 2000. Complete gene map of the plastid genome of the nonphotosynthetic euglenoid flagellate Astasia longa. 151:347-351. Gordon, D. 2004. Viewing and Editing Assembled Sequences Using Consed. John Wiley & Co., New York. Gordon, D., C. Abajian, and P. Green. 1998. Consed: a graphical tool for sequence finishing. Genome Res 8:195-202. Jedelský, P. L., et al. 2011. The Minimal Proteome in the Reduced of the Parasitic Protist Giardia intestinalis. PLoS One 6(2): e17285. Keeling, P. J. 2004. Reduction and compaction in the genome of the apicomplexan parasite Cryptosporidium parvum. Dev Cell 6:614-616. Kim, K. J., and H. L. Lee. 2004. Complete chloroplast genome sequences from Korean ginseng (Panax schinseng Nees) and comparative analysis of sequence evolution among 17 vascular plants. DNA Res 11:247-261. Krause, K. 2008. From chloroplasts to "cryptic" plastids: evolution of plastid genomes in parasitic plants. Curr Genet 54:111-121. Kunnimalaiyaan, M., and B. L. Nielsen. 1997. Fine mapping of replication origins (ori A and ori B) in Nicotiana tabacum chloroplast DNA. Nucleic Acids Res 25:3681-3686. Kuroda, H., and P. Maliga. 2003. The plastid clpP1 protease gene is essential for plant development. Nature 425:86-89. Kuwahara, H., T. Yoshida, Y. Takaki, S. Shimamura, S. Nishi, M. Harada, K. Matsuyama, K. Takishita, M. Kawato, K. Uematsu, Y. Fujiwara, T. Sato, C. Kato, M. Kitagawa, I. Kato, and T. Maruyama. 2007. Reduced genome of the thioautotrophic intracellular symbiont in a deep-sea clam, Calyptogena okutanii. Curr Biol 17:881-886. Leake, J. R. 1994. The biology of myco-heterotrophic (Saprophytic) Plants. New Phytol 127:171- 216. Leebens-Mack, J., and C. dePamphilis. 2002. Power analysis of tests for loss of selective constraint in cave crayfish and nonphotosynthetic plant lineages. Mol Biol Evol 19:1292- 1302. Loftus, B., I. Anderson, R. Davies, U. C. Alsmark, J. Samuelson, P. Amedeo, P. Roncaglia, M. Berriman, R. P. Hirt, B. J. Mann, T. Nozaki, B. Suh, M. Pop, M. Duchene, J. Ackers, E. Tannich, M. Leippe, M. Hofer, I. Bruchhaus, U. Willhoeft, A. Bhattacharya, T. Chillingworth, C. Churcher, Z. Hance, B. Harris, D. Harris, K. Jagels, S. Moule, K. Mungall, D. Ormond, R. Squares, S. Whitehead, M. A. Quail, E. Rabbinowitsch, H. Norbertczak, C. Price, Z. Wang, N. Guillen, C. Gilchrist, S. E. Stroup, S. Bhattacharya, A. Lohia, P. G. Foster, T. Sicheritz-Ponten, C. Weber, U. Singh, C. Mukherjee, N. M. El- Sayed, W. A. Petri, Jr., C. G. Clark, T. M. Embley, B. Barrell, C. M. Fraser, and N. Hall. 2005. The genome of the protist parasite Entamoeba histolytica. Nature 433:865-868. Lohan, A. J., and K. H. Wolfe. 1998. A subset of conserved tRNA genes in plastid DNA of nongreen plants. Genetics 150:425-433. Lusson, N. A., P. M. Delavault, and P. A. Thalouarn. 1998. The rbcL gene from the non- photosynthetic parasite Lathraea clandestina is not transcribed by a plastid-encoded RNA polymerase. Curr Genet 34:212-215. Merhej, V. and D. Raoult . 2011. Rickettsial evolution in the light of comparative genomics. Biol Rev 86(2): 379-405. McCutcheon, J. P., B. R. McDonald, and N. A. Moran. 2009. Convergent evolution of metabolic roles in bacterial co-symbionts of insects. Proc Natl Acad Sci U S A 106:15394-15399.

69 McCutcheon, J. P., and N. A. Moran. 2007. Parallel genomic evolution and metabolic interdependence in an ancient symbiosis. Proc Natl Acad Sci U S A 104:19392-19397. McInerney, J. O. 1998. GCUA (General Codon Usage Analysis). Bioinformatics 14:372-373. McNeal, J., J. Kuehl, J. Boore, and C. dePamphilis. 2007a. Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol 7:57. McNeal, J. R., J. V. Kuehl, J. L. Boore, and C. W. de Pamphilis. 2007b. Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol 7:57. McNeal, J. R., J. V. Kuehl, J. L. Boore, J. Leebens-Mack, and C. W. dePamphilis. 2009. Parallel loss of plastid introns and their maturase in the genus Cuscuta. PLoS One 4:e5982. McNeal, J. R., J. H. Leebens-Mack, K. Arumuganathan, J. V. Kuehl, J. L. Boore, and C. W. dePamphilis. 2006. Using partial genomic fosmid libraries for sequencing complete organellar genomes. Biotechniques 41:69-73. Moore, M. J., P. S. Soltis, et al. 2010. Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots. Proc Natl Acad Sci 107: 4623-4628. Moore, M. J., A. Dhingra, P. S. Soltis, R. Shaw, W. G. Farmerie, K. M. Folta, and D. E. Soltis. 2006. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol 6:17. Moran, N. A., and A. Mira. 2001. The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol 2:RESEARCH0054. Morden, C. W., K. H. Wolfe, C. W. dePamphilis, and J. D. Palmer. 1991. Plastid translation and transcription genes in a non-photosynthetic plant: intact, missing and pseudo genes. Embo J 10:3281-3288. Morrison, D. A. 2007. Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences. Syst Biol 56:988-1010. Müller, K. F. 2004. PRAP - computation of Bremer support for large data sets. Mol Phylogenet and Evol 31:780-782. Muller, K. F., and P. K. Wall. 2006. AlignMate - A Perl pipeline for comparative analysis of organellar genes and genomes. Nakabachi, A., A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, and M. Hattori. 2006. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314:267. Ochman, H., and N. A. Moran. 2001. Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis. Science 292:1096-1099. Ovcharenko, I., G. G. Loots, B. M. Giardine, M. Hou, J. Ma, R. C. Hardison, L. Stubbs, and W. Miller. 2005. Mulan: multiple-sequence local alignment and visualization for studying function and evolution. Genome Res 15:184-194. Page, R. D. M., and E. C. Holmes. 1998. Molecular evolution: a phylogenetic approach. Wiley- Blackwell. Pond, S. L., S. D. Frost, and S. V. Muse. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676-679. Randle, C. P., and A. D. Wolfe. 2005. The evolution and expression of rbcL in holoparasitic sister-genera Harveya and Hyobanche (Orobanchaceae). Amer J Bot 92:1575-1585. Raubeson, L., and R. Jansen. 2005. Chloroplast genomes of plants. Pp. 45-68 in H. RJ, ed. Plant diversity and evolution: genotypic and phenotypic variation in higher plants. CAB International, Wallingford (UK).

70 Schwartz, S., Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller. 2000. PipMaker--a web server for aligning two genomic DNA sequences. Genome Res 10:577-586. Schwender, J., F. Goffman, J. B. Ohlrogge, and Y. Shachar-Hill. 2004. Rubisco without the Calvin cycle improves the carbon efficiency of developing green seeds. Nature 432:779- 782. Smith, A. 2003. The Systematics and Molecular Evolution of Lennoaceae. Diss. Vanderbilt University. Swofford, D. L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4.0b10. Tachezy, J. and O. Šmíd. 2008. in parasitic protists. In: Tachezy, J. and Mitosomes: Mitochondria of Anaerobic , Microbial Monographs Vol. 9, Berlin, Heidelberg, Springer-Verlag, pp. 201-230. ISBN: 978-3-540-76732-9. Turmel, M., C. Otis, and C. Lemieux. 2005. The complete chloroplast DNA sequences of the charophycean green algae Staurastrum and Zygnema reveal that the chloroplast genome underwent extensive changes during the evolution of the Zygnematales. BMC Biol 3:22. van der Kooij, T. A., K. Krause, I. Dorr, and K. Krupinska. 2000. Molecular, functional and ultrastructural characterisation of plastids from six species of the parasitic flowering plant genus Cuscuta. Planta 210:701-707. van Ham, R. C., J. Kamerbeek, C. Palacios, C. Rausell, F. Abascal, U. Bastolla, J. M. Fernandez, L. Jimenez, M. Postigo, F. J. Silva, J. Tamames, E. Viguera, A. Latorre, A. Valencia, F. Moran, and A. Moya. 2003. Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci U S A 100:581-586. Wernegreen, J. J., A. O. Richardson, and N. A. Moran. 2001. Parallel acceleration of evolutionary rates in symbiont genes underlying host nutrition. Mol Phylogenet Evol 19:479-485. Wickett, N. J., Y. Zhang, S. K. Hansen, J. M. Roper, J. V. Kuehl, S. A. Plock, P. G. Wolf, C. W. dePamphilis, J. L. Boore, and B. Goffinet. 2008. Functional gene losses occur with minimal size reduction in the plastid genome of the parasitic liverwort Aneura mirabilis. Mol Biol Evol 25:393-401. Wilson, R. J. M., P. W. Denny, P. R. Preiser, K. Rangachari, K. Roberts, A. Roy, A. Whyte, M. Strath, D. J. Moore, P. W. Moore, and D. H. Williamson. 1996. Complete plastid gene map of the plastid-like DNA of the malaria parasite Plasmodium falciparum. J Mol Biol 261:155-172. Wimpee, C. F., R. Morgan, and R. Wrobel. 1992a. An aberrant plastid ribosomal RNA gene cluster in the root parasite Conopholis americana. Plant Mol Biol 18:275-285. Wimpee, C. F., R. Morgan, and R. L. Wrobel. 1992b. Loss of transfer RNA genes from the plastid 16S-23S ribosomal RNA gene spacer in a parasitic plant. Curr Genet 21:417-422. Wimpee, C. F., R. L. Wrobel, and D. K. Garvin. 1991. A divergent plastid genome in Conopholis americana, an achlorophyllous parasitic plant. Plant Mol Biol 17:161-166. Wolf, P. G., C. A. Rowe, R. B. Sinclair, and M. Hasebe. 2003. Complete nucleotide sequence of the chloroplast genome from a leptosporangiate fern, Adiantum capillus-veneris L. DNA Res 10:59-65. Wolfe, A. D., and C. W. dePamphilis. 1998. The effect of relaxed functional constraints on the photosynthetic gene rbcL in photosynthetic and nonphotosynthetic parasitic plants. Mol Biol Evol 15:1243-1258. Wolfe, A. D., and C. W. dePamphilis. 1997. Alternate paths of evolution for the photosynthetic gene rbcL in four nonphotosynthetic species of Orobanche. Plant Mol Biol 33:965-977. Wolfe, K. H., D. S. Katz-Downie, C. W. Morden, and J. D. Palmer. 1992a. Evolution of the plastid ribosomal RNA operon in a nongreen parasitic plant: accelerated sequence

71 evolution, altered promoter structure, and tRNA pseudogenes. Plant Mol Biol 18:1037- 1048. Wolfe, K. H., W. H. Li, and P. M. Sharp. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad Sci USA 84:9054-9058. Wolfe, K. H., C. W. Morden, S. C. Ems, and J. D. Palmer. 1992b. Rapid evolution of the plastid translational apparatus in a nonphotosynthetic plant: loss or accelerated sequence evolution of tRNA and ribosomal protein genes. J Mol Evol 35:304-317. Wolfe, K. H., C. W. Morden, and J. D. Palmer. 1992. Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc Natl Acad Sci U S A 89:10648- 10652. Wyman, S. K., R. K. Jansen, and J. L. Boore. 2004. Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20:3252-3255. Young, N. D., and C. W. dePamphilis. 2005. Rate variation in parasitic plants: correlated and uncorrelated paterns among plastid genes of different function. BMC Evol Biol 5.

72 Chapter 3

Plastid Genome Sequence and Analysis of Conopholis americana, a Holoparasitic Relative of Epifagus virginiana

Yan Zhang, Susann Wicke, Norman J. Wickett, Claude W. dePamphilis

Abstract:

Conopholis americana (squaw root) is a completely nonphotosynthetic holoparasitic plant in

Orobancheace. The plastid genome (plastome) was sequenced, assembled, annotated, and compared to its sister genus Epifagus virginiana (beechdrop). This represents the first comparative study of the complete plastome sequences of two related holoparasites that shared a nonphotosynthetic ancestor. The plastome of Conopholis is highly reduced to 45,762 bp in size, which represents the smallest angiosperm plastid genome sequenced to date. Structurally, this extremely compact plastid genome has entirely lost one copy of the inverted repeat (IR).

However, no rearrangements have been observed in Conopholis, in contrast with some photosynthetic lineages that also lack the inverted repeat. As with Epifagus, the plastome also has lost all the photosynthetic genes and RNA polymerase genes, and some of the tRNA genes and ribosomal genes. Evolutionary analysis suggests that most protein genes retained in the

Conopholis plastome are evolving under relaxed constraint.

73 Introduction

The plastid is the photosynthetic organelle in plants and algae, and all plants and algae characterized to date retain a plastid chromosome. All living organisms rely on photosynthesis for energy. Furthermore, plastids also play important roles in other essential biochemical processes, such as amino acid, starch, fatty acid and pigmentation synthesis, and nitrogen and sulfate assimilation (Neuhaus and Emes 2000).

Plastids originated from ancestral photosynthetic cyanobacteria through endosymbiosis.

Plastid genomes still many features from their ancestral cyanobacteria genome including a circular genome and prokaryotic gene expression (Park, Manen et al. 2008), but the majority of the genes from the endosymbiotic bacterium have either been deleted or transferred to the nuclear genome. A typical plastid genome in higher plants contains only about 120 genes, including photosynthetic genes, ribosomal genes, RNA polymerase genes, and other protein genes. Within land plants, the structure and contents of the plastid genome have been remarkably conserved.

The typical plastid genome in higher plants is characterized by a quadripartite structure with two completely identical inverted repeat sequences (IR) containing the ribosomal RNA and other genes, and separated by a large single copy (LSC) and a small single copy (SSC) region. This pair of identical inverted repeats is one distinguishing feature of most plastid genomes. It was suggested early in the history of comparative plastome genomics that the presence of the IR provides a stabilizing effect on genome organization (Palmer, 1991; Palmer and Thompson,

1982). Therefore, loss of the IR may lead to plastid genomic changes.

Although presence of the IR has been observed in plastomes of nearly all land plants, exceptions to this conservation have been observed in several lineages including legumes and conifers. In the sequenced plastid genome of legume family (Kolodner and Tewari 1979; Koller and Delius 1980; Palmer and Thompson 1982; Lidholm et al. 1988; Palmer, Osorio, and

Thompson 1988; Strauss et al. 1988; Lidholm and Gustafsson 1991; Perry and Wolfe 2002; Hirao

74 et al. 2008), multiple genomic rearrangements including inversions, gene/intron losses have been observed. Loss of the IR has also been found in green algae (Tachezy, Šmíd et al. 2008). In parasitic angiosperms sequenced to date, Conopholis is the only plastome known that lacks an IR.

We have also observed similar genome reduction and gene losses in the plastid genome of other parasitic plants from independent lineages, such as Cuscuta species (Convolvulaceae),

Aneura (Aneuraceae) (McNeal, Kuehl et al. 2007, Wickett, Zhang et al. 2008) and Pholisma

(Lennoaceae; Chapter 2).

Parasitic plants can be categorized as hemiparasites and holoparasites, depending on their photosynthetic capability. Hemiparasites obtain some of their carbon and nutrients from their host plants, but they still retain the ability to photosynthesize. Holoparasites, in contrast, rely completely on their host plants for photosynthates, and thus become fully heterotrophic and nonphotosynthetic. As photosynthesis is the dominant metabolic function of the plastid, loss of photosynthesis represents a major metabolic functional shift in nonphotosynthetic plants. With the loss of the plastid’s primary function, relaxation of the constraints associated with photosynthesis is expected for parasitic plants. Therefore, the plastid genomes of the parasitic plants are expected to undergo massive alterations due to the relaxation of the evolutionary constraints on genes for photosynthesis and closely related processes (Park, Manen et al. 2008).

Holoparasitic angiosperms Epifagus virginiana (Orobanchaceae) and Pholisma arenarium

(Lennoaceae) are such examples (see Chapter 2, this dissertation). Their plastid genomes are extremely reduced to about half the size of normal plastid genomes. All the photosynthetic and bioenergetic genes have been deleted, or become pseudogenes due to indels or premature stop codons. Some of the ribosomal protein genes and transfer RNA genes have also been deleted, or formed pseudogenes. These genomes offer an opportunity to study the significant genome changes following a dramatic functional shift. But Epifagus and Pholisma are the only sequenced holoparasitic angiosperms to date, and they represent different independent holoparasitic lineages.

75 In addition, a myco-heterotrophic Aneura, a myco-heterotrophic underground orchid, a parasitic green alga Helicospordium, a heterotrophiceuglenid Euglena longa, and several apicomplexian parasites have been sequenced (Wickett et al. 2008; Koning and Keeling, 2006; Delannoy, et al.,

2011). Their plastid genomes have all undergone massive genome changes and share similarities in their genome evolution. But there are also differences in the genome changes from extant lineages.

Conopholis americana (squaw root, Orobanchaceae) is a nonphotosynthetic holoparasite that feeds on roots of several species of Fagaceae (mainly Quercus (oak trees), but also Fagus

(beech)). Conopholis is the sister genus of Epifagus, also a root holoparasite that completely lacks photosynthesis. Similar to Epifagus, Conopholis also has a highly reduced genome that is estimated to be only about 42 kb, which is among the smallest plastid genomes known in any plant (Colwell 1994). As in Epifagus, evidence has shown that the plastome of Conopholis is transcribed and expressed (Wimpee, Wrobel et al. 1991, Colwell 1994). However, structure of the plastid genome of Conopholis is distinct from other members of Orobanchaceae in that genome mapping and size estimations suggest that it has lost one copy of the large inverted repeat (IR)

(Colwell 1994). Two identical copies of the IR is a typical feature of plastid genome; it is observed in nearly all of the plants sequenced so far with only a few exceptions, such as some conifers and legumes (Park, Manen et al. 2008). In legumes, loss of IR has also been utilized as phylogenetic markers in legumes to resolve the relationship among deep nodes, and identified the monophyly of “IR-lacking clade” or so-called IRLC (Palmer and Thompson 1982). But generally, loss of the IR is rarely seen in plants.

Conserved noncoding sequences (CNSs) can provide valuable insights in the evolution of gene regulation. CNSs functions include transcription factor binding sites, chromosome level regulatory regions, enhancers etc. Generally, plant CNSs are enriched in regulationary genes such as transcription factors, and other cis-acting binding sites. Compared with CNSs in ,

76 they are usually shorter and less conserved. Enormous research efforts have been directed to the identification of CNSs such as plant regulatory elements, enhancers and repressors in nuclear genomes (Kaplinsky, Braun et al. 2002, Lescot, Dehais et al. 2002, Freeling, Rapaka et al. 2007,

Salvi, Sponza et al. 2007, Thomas, Rapaka et al. 2007). Salvi et al. identified a conserved noncoding sequence (vgt1) among maize, rice and sorghum that functions as a key cis-element associated with flowering time control. An upstream-conserved noncoding sequence is found with function in the regulation of SHOOT MERISTEMLESS (STM) expression in developing leaves (Uchida, Townsley et al. 2007).

Studies to date have focused on comparison of holoparasites from independent lineages

(Wimpee, Wrobel et al 1991,(Wimpee, Morgan et al. 1992, dePamphilis 1995, !!! INVALID

CITATION !!!). Here we present the first comparative study for two closely related holoparasitic plants that share a nonphotosynthetic ancestor. This provides a distinct opportunity to study the divergence of the plastid genome after the complete loss of photosynthesis and to identify conserved noncoding sequences that are potentially involved in plastid gene expression and regulation.

We sequenced the plastid genome of Conopholis, a member of Orobanchaceae.

Orobanchaceae (also known as the broomrape family) has the greatest range in parasitic specialization, including nonparasites, facultative and obligate hemiparasites and holoparasites

(Palmer, Osorio et al. 1987, Young, Steiner et al. 1999).

In this paper, we present the complete plastid genome of Conopholis, and comparative analysis of genome contents and structure with Epifagus and other holoparasites.

Materials and Methods

DNA extraction and Fosmid library

77 Fresh plant materials of Conopholis americana was collected in State College, PA. Total

DNA was extracted from one gram of frozen tissues of Conopholis using a modified CTAB method (McNeal, Leebens-Mack et al. 2006). A partial fosmid genomic library was constructed from the isolated DNA using the CopyControl Fosmid Library Production Kit (Epicentre). Then, the fosmid library was screened for positive plastid clones using labeled PCR products of rps2, rps7, rps12, rpl16 amplified from Conopholis DNA samples. Positive plastid clones were end- sequenced, and three clones were selected for sequencing. The methods of DNA isolation, fosmid library construction, clone selection and preparation, and clone sequencing were described in detail by McNeal et al (2006).

Procedure for fosmid shotgun sequencing and sequence assembly Fosmid DNA was extracted from 200 ml overnight culture using the NucleoBond Xtra

Midi Prep Kit (Macherey-Nagel) following the manufacturer’s instructions. Approximately five micrograms of purified fosmid DNA was mixed in 1x shearing buffer (10 mM Tris, pH = 8.3 +

10 % glycerol) and nebulized for 2 minutes at 6 bar with pressurized air. DNA was precipitated by adding 1/10 volume (V) of 5M NaCl and 2.5 V of absolute ethanol. After 30 minutes incubation on ice, DNA was pelleted at 16.000 x g for 10 minutes at 4°C, and washed twice with

70% ethanol. The DNA pellet was air-dried and resuspended completely in 100µl of RNAse-free water. DNA-fragments were en-repaired employing the NEBNext™ End Repair Module (New

England Biolabs). DNA-purification and size selection was carried out on a 1 % agarose gel by excising fragments running between 2-3 kb. Purified and size-selected fragments were subsequently cloned into pGEM-TEasy vector (Promega Inc.) and introduced into E.coli DH1-

Alpha by electroporesis at 2.500 V. Positive transformants were picked into 96-well plates and grown overnight in LB broth +8 % glycerol, supplemented with 100µg of Ampicillin. 480 clones

78 were chosen for sequencing bi-directionally using the vector primers T7promoter and M13R (-

20). DNA extraction and sequencing was performed at Macrogen Inc./South Korea.

All sequences obtained were trimmed under high stringency conditions in SeqMan I

(DNA Star, Lasergene). Before assembly, 50 bp were additionally clipped on both sides in order to ensure complete removal of vector contaminant sequences. Sequence assembly was carried out with a minimum match percentage of 95 % percent. Minimal sequence length for assembly was set to 75 bp allowing a maximum number of 25 gaps per 1000 bp in a contig sequence and a maximum number of 20 gaps per kb in any contig entering sequence. Word-size (W = 12), gap penalties and gap length penalties were used at default values (gap penalty = 0.00; gap length penalty = 0.70).

Genome Annotation

Complete genome was annotated using DOGMA followed with manual adjustment (Wyman, Jansen et al. 2004). Annotation was verified by aligning with other species.

Molecular evolutionary analysis

The ratio of nonsynonymous and synonymous rate (dN/dS = ) and 95% confidence interval was estimated for each gene on the nonphotosynthetic plant branch while constraining the background Omega in HYPHY 2.1 beta (Pond, Frost et al. 2005). Conopholis and Epifagus were compared to Striga and Mimulus.

Results

The plastome of Conopholis is the smallest plastome sequenced to date in land plants,

45,762 bp compared with 70,028 bp of the plastome in Epifagus (Figure 3-1). This is somewhat smaller than the 47,294 bp of unique sequence in the plastome of Epifagus. Conopholis lacks the

79 characteristic inverted repeat (IR) of the plastome in most land plants, which accounts for most its reduction in size in comparison to Epifagus.

The percentage of coding sequences out of nonrepeated sequences is about 56%, compared with about 50% in Mimulus, and 44% in Epifagus. Thus, the percentage of coding sequences is actually increased in Conopholis even with the massive gene deletions compared with photosynthetic relative Mimulus and its nonphotosynthetic sister group Epifagus.

80

Figure 3-1. Plastid genome of Conopholis americana. The genes shown inside the circle are transcribed clockwise; genes on the outside are transcribed counterclockwise. Structure components of the plastid genome are labeled in the inner circle as LSC, SSC, one copy of IR. Intron-containing genes are represented by *. Pseudogenes are notated with a . Genes are color coded by function, as shown at bottom.

81 Table 3-1. Genome Content of Conopholis and Epifagus compared with their green relatives

Mimulus guttatus. Pseudogenes are indicated by .

Conopholis americanum vs. Mimulus Epifagus virginiana vs. Mimulus guttatus guttatus

Genes Absent or Genes Absent or Genes Present Genes Present Pseudogenes Pseudogenes

Photosynthesis

Photosystem I psaA, B, C, J, I psaA, B. C, I, J

Photosystem II psbA(), B(), C, D, psbA (), B (), C, D,

E, F, H, I(), J, K, L, E, F, H, I, J, K, L, M,

M, N, N,

Cytochrome b6f petA, B, D, G, N petA, B, D, G, N

ATP synthase atpA (), B(), E, F, atpA (), B (), E, F,

H, I H, I

Rubisco rbcL rbcL()

Chlororespiration ndhA, B (), C, D, E, ndhA, B (), C, D, E,

F, G, H, I, J, K, F, G, H, I, J, K

Gene Expression

rRNA 16S, 23S, 4.5S, 5S 16S, 23S, 4.5S, 5S

Ribosomal proteins rps2, 3, rps15, 16(); rps2, 3, 4, 7, 8, rps15, 16; rpl14 () ,

4, 7, 8, 11, 12, 14, rpl23(), 32 11, 12, 14, 18, 22, 23 (), 32

18, 19; rpl2, 14, 19; rpl2, 16, 20,

16, 20, 22, 33, 36, 33, 36

Transfer RNA genes DGUC, EUUC, FGAA, AUGC, CGCA(), DGUC, EUUC, AUGC(), CGCA(),

HGUG, ICAU, LCAA, GGCC, GUCC, IGAU, FGAA, HGUG, GGCC, GUCC, IGAU(),

MCAU, NGUU , ICAU, LCAA, KUUU, WCCA(), KUUU, LUAA, RUCU(),

82

PUGG, QUUG, SUGA, LUAG, RACG (), LUAG, MCAU, SGGA(), TGGU, TUGU,

SGCU, YGUA, LUAA, TGGU, TUGU, NGUU, PUGG, VGAC, VUAC

fMCAU, CGCA, VUAC QUUG, RACG,

RUCU, SGGA, VGAC SUGA, SGCU,

WCCA, YGUA,

fMCAU

RNA polymerase and matK rpoA (), B, C1, C2 matK rpoA (), B, C1, C2

maturase genes

Initiation factor infA() infA

Other protein genes clpP, accD, ycf1, clpP, accD, orf26, orf31, orf34, ycf2, ycf15 ycf1, ycf2 orf62, orf168, orf184,

orf229, orf313

83

Figure 3-2. Dot plots of the complete plastid genomes of Conopholis americana and Epifagus and their close green relative Mimulus guttatus from Mulan analysis. Structural features are as follows: the main diagonal indicates the alignment in the same orientation in both genomes; the points along the negative slope indicate alignment in opposite orientation in both genomes, the two major groups of points along the negative slope in the upper right corner represents the inverted repeats (the smaller groups of points along the negative slope represent inversions in Conopholis). The dot plots do not resemble typical dot plots of chloroplast genomes due to the absence of one of the inverted repeats.

84

Figure 3 -3. Multipipmaker genome coverage map including Conopholis, Epifagus and

Mimulus. The results showed large amount of shared deletions and a few independent deletions in plastid genomes of Conopholis and Epifagus.

85

86

Figure 3-4. Multipipmaker plot details for Conopholis, Epifagus and Mimulus.

87 Similar to plastomes of other nonphotosynthetic parasitic plants, the Conopholis plastome is characterized by a large number of gene deletions, pseudogenes, and changes in evolutionary dynamics (see Chapter 2, this dissertation). Like other plastomes in nonphotosynthetic lineages,

Conopholis lacks all the photosynthetic genes, RNA polymerase genes and many of the ribosomal proteins and transfer RNA genes (Table 3-1). Conopholis still retains intact copies of clpP, matK, accD and infA.

As closely related nonphotosynthetic plants in Orobanchaceae, Conopholis shows many close similarities with Epifagus in plastome contents. Conopholis is only the second example, in addition to Epifagus, of an angiosperm plastome where all the photosynthetic genes are physically or functionally lost. All the photosynthetic genes have been either lost or degraded in both parasites. An interesting example is the photosynthetic gene rbcL. Some parasitic lineages within Orobanchaceae still retain an intact copy of rbcL, but some parasitic lineages have lost or only retain a remnant of rbcL (Lusson, Delavault et al. 1998). For Conopholis and Epifagus,

Conopholis has completely lost rbcL and Epifagus only retains a small remnant of rbcL. As for genes involved in gene expression, 4 ribosomal protein genes and 9 transfer RNA genes are deleted or degraded into pseudogenes. As with Epifagus, all the RNA polymerase genes are lost.

The ribosomal protein genes and transfer RNA genes lost in Conopholis are a perfect subset of those genes lost in Epifagus. In other words, none of the deleted ribosomal protein genes or transfer RNA genes in Conopholis are still retained in Epifagus. In contrast, infA gene is retained as a pseudogene in Conopholis while Epifagus still retains an intact copy.

Intact copies of matK, clpP, and accD are found, which is also the case for all the nonphotosynthetic angiosperms sequenced to date.

Several psuedogene sequences are found in the Conopholis plastome. Seven pseudogenes are identified for atpA, infA, psbA, psbI, rpoA, ndhB, rps16, rpl23, trnC-GCA, trnW-CCA and trnR-ACG. Overall, pseudogene sequences retained in Conopholis are less

88 common compared with Epifagus. There are four common pseudogenes in both parasites including psbA, rpoA, ndhB, rpl23. But unique pseudogenes are also identified in Conopholis, such as atpA, psbI and rps16. All these three protein coding genes have been completely deleted from Epifagus.

Conopholis has completely lost rbcL (Rubisco, large subunit) gene, but Epifagus on the other hand, still retain remnants of rbcL sequences in its plastome.

Whereas infA is still retained as an intact open reading frame in Epifagus, it is degraded as a pseudogene in the plastid genome of Conopholis. But in Conopholis, the degraded pseudogene still retains an open reading frame of 102 bp and the remnant sequences show great conservation with Epifagus gene sequences.

Having found a set of 20 shared, intact, protein coding sequences in the plastomes of both holoparasites, we sought to determine if there was evidence that these genes had evolved under functional constraint in these nonphotosynthetic lineages. In Epifagus, the ratio of nonsynonymous rate and synonymous rate () are less than 1 in most genes, indicating purifying selection. Exceptions are matK and rpl20. In matK and rpl20,  is 1.17 (95% CI: 0.92 – 1.46) and 0.72 (95% CI: 0.43-1.11). However, in Conopholis, more genes have the rate ratio () significantly greater than or close to 1, and most genes display a relaxation of selection relative to

Epifagus.

89

Table 3-2. dN/dS ratio () in Epifagus and Conopholis

Epifagus Conopholis 95% Confidence 95% Confidence

Interval Interval ω ω Lower Upper Lower Upper

bound bound bound bound rps8 0.26 0.13 0.46 0.74 0.41 1.22 clpP 0.14 0.05 0.3 0.37 0.2 0.61 rps19 0.34 0.13 0.7 0.55 0.24 1.03 rps11 0.35 0.18 0.59 0.76 0.38 1.32 rps7 0.21 0.08 0.43 0.31 0.14 0.59 rps4 0.65 0.44 0.92 0.23 0.1 0.44 rps14 0.38 0.17 0.71 0.92 0.31 2.03 rps11 0.35 0.18 0.59 0.76 0.38 1.32 rps3 0.48 0.31 0.7 0.46 0.24 0.77 rps12 0.30 0.08 0.69 0.57 0.2 1.21 rps18 0.24 0.06 0.56 7.70 3.79 13.66 rpl2 0.49 0.22 0.92 0.18 0.07 0.37 rpl20 0.72 0.43 1.11 2.64 1.38 4.46 matK 1.17 0.92 1.46 0.83 0.57 1.16 rpl36 0.13 0 0.62 0.41 0.07 1.31 rpl16 0.31 0.18 0.51 0.14 0.03 0.36 rpl33 0.11 0.02 0.33 0.4 0.15 0.82

90

Discussion

The plastome of nonphotosynthetic Conopholis is highly reduced to only about 45kb, which is the smallest plastome in angiosperm sequenced to date. This extreme reduction in genome size is contributed primarily by massive loss of genes and absence of an inverted repeat.

Compared with its nonphotosynthetic relative Epifagus, the plastome of Conopholis is much smaller, which is largely through the loss of the IR. The length of unique sequences in Epifagus is about 47kb, which is actually comparable with Conopholis.

The percentage of coding regions in Conopholis is about 56%, which is greater than that of Mimulus (50%) and Epifagus (44%). This suggests that the noncoding sequences and intron regions are more reduced in Conopholis.

Gene content reduction in Conopholis is decidedly similar to that of Epifagus.

Conopholis is the second case where all the photosynthetic genes are physically or functionally lost besides Epifagus. All the photosynthetic genes and RNA poloymerase genes are deleted from plastome of both Conopholis and Epifagus. Some ribosomal protein genes are deleted. The deleted ribosomal protein genes or pseudogenes are a subset of that in Epifagus. But this pattern is not observed in transfer RNA genes. Although most of the lost transfer RNA genes are common in Epifagus and Conopholis, several deleted genes are distinct. This suggests that photosynthetic, RNA polymerase genes, and most of the lost transfer RNA genes were already lost prior to the divergence of Conopholis and Epifagus. Those gene losses probably occurred in their nonphotosynthetic ancestor.

Conopholis has completely lost the entire rbcL copy, but Epifagus still retain remnants of rbcL sequences in its plastome. This is the only protein coding gene that Epifagus still retain some remnants of the gene, but Conopholis has already lost the entire copy.

91 Similarly, Epifagus still retains an intact copy of infA, gene that encode the translation initiation factor 1, but Conopholis lacks an intact copy of infA and it is degraded as a pseudogene in Conopholis plastome. This difference of infA between Epifagus and Conopholis indicates that the loss of infA likely occurred after the divergence of Epifagus and Conopholis. In fact, parallel loss of infA has been observed in many independent lineages during angiosperm evolution (Millen, Olmstead, et al., 2001). Studies have revealed 24 independent losses of chloroplast infA among the 309 angiosperms examined by DNA sequencing and gel blot surveys

(Millen, Olmstead, et al., 2001). Moreover, a transferred and expressed infA gene was found in the nucleus in four species where the chloroplast infA is dysfunct (Millen, Olmstead, et al., 2001).

Within asterids, a pseudogene of infA was seen in all the 17 Solanaceae species examined in the study. Convolvulaceae, a sister group of Solanaceae, also lacks an intact infA, which indicates the loss of infA may have occurred before the divergence of the two lineages.

Orobanchaceae is distinct from the other two sister groups in the polymorphism of infA, that holoparasite Epifagus retains an intact infA, but its close relative holoparasite Conopholis only has a degraded pseudogene sequences. Evolutionary analysis has indicated that infA in Epifagus may be evolving under even more stringent constraint comparing to the green plants.

Land plant plastomes are generally highly conserved in genome size and structure, and gene content. Although extensive rearrangements has been found in some lineages, such as

Trachelium, Trifolium, Geraniaceae, gene synteny is conserved in majority of lineages in land plants (Cai et al., 2008; Haberle et al., 2008; Guisinger et al., 2011).

A distinguishing feature of the plastid genome in land plants is the pair of large identical copies of inverted repeats that comprise duplicated sequences in opposite orientations and separate each other by a large single copy and a small single copy (Palmer 1983). Retention of the duplicated IRs has been seen in the plastomes of nearly all the land plants. The relative size of

IR has remained constant in angiosperms, but varies a great deal in gymnosperms.

92 It has been suggested that the retention of two completely identical IRs provides corrective properties to confer stability, which is supported by that the substitution rates in IR genes is generally much lower than genes in single copy regions (Wolfe, Li et al. 1987, Birky

1989, Gaut 1998).

Whereas the retention of IRs is a remarkable feature of the plastomes in land plants, exceptions have been documented in a few lineages and the loss of an inverted repeat occurred in a number of conifers and legumes (Kolodner and Tewari 1979, Koller and Delius 1980, Palmer and Thompson 1982, Lidholm, Szmidt et al. 1988, Palmer, Osorio et al. 1988, Strauss, Palmer et al. 1988, Lidholm and Gustafsson 1991, Perry and Wolfe 2002, Hirao, Watanabe et al. 2008).

Conopholis is another case where loss of an inverted repeat is observed in an independent lineage. The only other sequenced member of Orabanchaceae (Epifagus) retains both copies of the IR. Therefore, the loss of an IR in Conopholis may represent a unique event in

Orobanchaceae (Colwell 1994).

Although these exceptions suggest the retention of an extra copy of IR is not obligatory for plastid function, more frequent sequence rearrangements and acceleration of the remaining copy of IR have been observed after the loss of an IR (Palmer and Thompson 1982, Perry and

Wolfe 2002). There are generally more rearrangements in legumes than other angiosperms, but much more extensive sequence rearrangement are reported in those genomes that have lost one copy of the inverted repeat such as pea and broad bean, in compared to spinach, petunia, cucumber, etc. (Palmer and Thompson 1982). Chloroplast genomes of two conifers that lack a large inverted repeat, Douglas-fir and radiate pine, are also extensively rearranged (Strauss et al.

1988). All these support the hypothesis that presence of inverted repeat stabilizes the chloroplast genome against major structural rearrangements (Palmer and Thompson 1982; Strauss et al.

1988).

93 However, the Conopholis plastome does not follow this pattern. There is not any gene rearrangement in its minimal plastome and the gene order appears to be collinear with its green relative Mimulus as shown in Figure 3-2. Thus, there may not be any causal relationship between loss of inverted repeats and increased rearrangements in plastomes. Alternatively, the IR may have been lost so recently from Conopholis that any IR-loss induced instability cannot yet be seen.

Epifagus and Conopholis shared many pseudogenes and deletions. All of the photosynthetic genes and polymerase genes are deleted or formed pseudogenes in both parasites.

More ribosomal protein genes and transfer RNA genes are degraded in Epifagus than Conopholis.

But generally, similar amount of pseudogenes are identified in the two parasites. Epifagus has retained 14 pseudogenes and Conopholis has 13 pseudogenes (Table I).

Conopholis and Epifagus share many common pseudogenes, including atpA, atpB, psbA, psbB, ndhB, trnC-GCA, which suggest those genes may have already degraded before the divergence of Epifagus and Conopholis. On the other hand, some pseudogenes only exist in one of the parasites. For example, psbI and rps16 are already deleted from Epifagus, but retained remnant sequences in Conopholis. For photosynthetic gene rbcL, Epifagus still retains remnant sequences, but Conopholis has lost the entire locus. The degradation of these genes may also start before the divergence of the two parasites.

More pseudogenes in Epifagus including rpl14, rpl23, and several transfer RNA genes are still retained in Conopholis. On the other hand, intact genes in Epifagus including infA, trnW-

CCA and trnR-ACG are already degraded into pseudogenes in Conopholis. In contrast to the shared pseudogenes, these distinct changes in the parasites likely occurred after the divergence of the two holoparasites.

94 Evolutionary analysis indicates that less than half of the protein coding genes retained in

Conopholis plastid genome are still evolving under stringent constraint. Most of the genes seem to be evolving under relaxed constraint. Compared with Epifagus, significantly more genes are evolving under relaxed constraint. These results suggest that while the Conopholis and Epifagus plastomes share many fundamental similarities, it is possible that the two genomes are beginning to diverge in subtle ways that could be consistent with at least some differences in plastome function between the two closely related holoparasite lineages. Transcripotmes from these two related species would be a valuable tool for helping to gain further clues to understand the evolution of plastome function in heterotrophic lineages.

95

References

Barkman, T., J. McNeal, et al. (2007). "Mitochondrial DNA suggests at least 11 origins of parasitism in angiosperms and reveals genomic chimerism in parasitic plants." BMC Evol Biol 7: 248. Baumann, P. (2005). "Biology bacteriocyte-associated endosymbionts of plant sap-sucking insects." Annu Rev Microbiol 59: 155-89. Birky, C. W. (1989). "Organelle evolution." Genome 31: 1095-1097. Brouard, J.-S., C. Otis, et al. (2010). "The exceptionally large chloroplast genome of the green alga Floydiella terrestris illuminates the evolutionary history of the Chlorophyceae." Genome Biol Evol 2: 240-256. Brundrett, M. C. (2009). "Mycorrhizal associations and other means of nutrition of vascular plants: understanding the global diversity of host plants by resolving conflicting information and developing reliable means of diagnosis." Plant and Soil 320: 37-77. Colwell, A. E. L. (1994). Genome evolution in a non-photosynthetic plant. St Louis, MO, Washington University. Ph.D. Cui, L., N. Veeraraghavan, et al. (2006). "ChloroplastDB: the Chloroplast Genome Database." Nucleic Acids Res 34(Database issue): D692-6. Cui, L., F. Yue, et al. (2006). "Inferring ancestral chloroplast genomes with inverted repeat." Procedings of the 2006 International Conference on Bioinformatics and Computational Biology: 75-81. deKoning, A. P. and P. J. Keeling (2006). "The complete plastid genome sequence of the parasitic green alga Helicosporidium sp. is highly reduced and structured." BMC Biol 4. Delannoy, E., S. Fujii, et al. (2011). "Rampant gene loss in the underground orchid Rhizanthella gardneri highlights evolutionary constraints on plastid genomes." Mol Biol Evol 28: 2077-86. dePamphilis, C. (1995). Genes and Genomes. London, Chapman & Hall. dePamphilis, C. W. and J. D. Palmer (1990). "Loss of photosynthetic and chlororespiratory genes from the plastid genome of a parasitic flowering plant." Nature 348: 337-9. dePamphilis, C. W., N. D. Young, et al. (1997). "Evolution of plastid gene rps2 in a lineage of hemiparasitic and holoparasitic plants: many losses of photosynthesis and complex patterns of rate variation." Proc Natl Acad Sci USA 94: 7367 - 7372. Drummond, A. J. and A. Rambaut (2007). "BEAST: Bayesian evolutionary analysis by sampling trees." BMC Evol Biol 7: 214. Dyall, S. D., M. T. Brown, et al. (2004). "Ancient invasions: from endosymbionts to organelles." Science 304: 253-7. Edgar, R. C. (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Res 32: 1792-7. Ems, S. C., C. W. Morden, et al. (1995). "Transcription, splicing and editing of plastid RNAs in the nonphotosynthetic plant Epifagus virginiana." Plant Mol Biol 29: 721-33. Freeling, M., L. Rapaka, et al. (2007). "G-boxes, bigfoot genes, and environmental response: characterization of intragenomic conserved noncoding sequences in Arabidopsis." 19: 1441-57. Funk, H. T., S. Berg, et al. (2007). "Complete DNA sequences of the plastid genomes of two parasitic flowering plant species, Cuscuta reflexa and Cuscuta gronovii." BMC Plant Biol 7: 45. Futuyma, D. J. (1997). Evolutionary Biology., Sinauer Associates.

96 Gao, L., Y.-J. Su, et al. (2010). "Plastid genome sequencing, comparative genomics, and phylogenomics: Current status and prospects." J Sys Evol 48: 77-93. Gaut, B. S. (1998). "Molecular clocks and nucleotide substitution rates in higher plants." Evol Biol 30: 93-120. Gil, R., B. Sabater-Munoz, et al. (2002). "Extreme genome reduction in Buchnera spp.: Toward the minimal genome needed for symbiotic life." Proc Natl Acad Sci U S A 99: 4454- 4458. Gilson, P. R. and G. I. McFadden (1996). "The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns." Proc Natl Acad Sci U S A 93: 7737-42. Gilson, P. R. and G. I. McFadden (1997). "Good things in small packages: the tiny genomes of chlorarachniophyte endosymbionts." Bioessays 19: 167-73. Gockel, G. and W. Hachtel (2000). "Complete gene map of the plastid genome of the nonphotosynthetic euglenoid flagellate Astasia longa." Protist 151: 347-51. Gordon, D. (2004). Viewing and Editing Assembled Sequences Using Consed. New York, John Wiley & Co. Gordon, D., C. Abajian, et al. (1998). "Consed: a graphical tool for sequence finishing." Genome Res 8: 195-202. Greenman, C., P. Stephens, et al. (2007). "Patterns of somatic mutation in cancer genomes." Nature 446: 153-8. Hirao, T., A. Watanabe, et al. (2008). "Complete nucleotide sequence of the Cryptomeria japonica D. Don. chloroplast genome and comparative chloroplast genomics: diversified genomic structure of coniferous species." BMC Plant Biol 8: 70. Jansen, R. K., Z. Cai, et al. (2007). "Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns." Proc Natl Acad Sci U S A 104: 19369-19374. Jansen, R. K., M. F. Wojciechowski, et al. (2008). "Complete plastid genome sequence of the chickpea (Cicer arietinum) and the phylogenetic distribution of rps12 and clpP intron losses among legumes (Leguminosae)." Mol Phylogenet Evol 48: 1204-17. Kaplinsky, N. J., D. M. Braun, et al. (2002). "Utility and distribution of conserved noncoding sequences in the grasses." Proc Natl Acad Sci U S A 99: 6147-51. Keeling, P. J. (2004). "Reduction and compaction in the genome of the apicomplexan parasite Cryptosporidium parvum." Dev Cell 6: 614-6. Khan, A., I. A. Khan, et al. (2010). "Current trends in chloroplast genome research." Afr J Biotechnol 9: 3494-3500. Kim, K. J. and H. L. Lee (2004). "Complete chloroplast genome sequences from Korean ginseng (Panax schinseng Nees) and comparative analysis of sequence evolution among 17 vascular plants." DNA Res 11: 247-61. Koller, B. and H. Delius (1980). "Vicia faba chloroplast DNA has only one set of ribosomal-RNA genes as shown by partial denaturation mapping and R-loop analysis." Mol Gen Genet 178: 261-269. Kolodner, R. and K. K. Tewari (1979). "Inverted repeats in chloroplast DNA from higher-plants." Proc Natl Acad Sci U S A 76: 41-45. Krause, K. (2008). "From chloroplasts to "cryptic" plastids: evolution of plastid genomes in parasitic plants." Curr Genet 54: 111-21. Kuijt, J. (1969). The biology of parasitic flowering plants. Berkeley, University of Carlifornia Press. Kunnimalaiyaan, M. and B. L. Nielsen (1997). "Fine mapping of replication origins (ori A and ori B) in Nicotiana tabacum chloroplast DNA." Nucleic Acids Res 25: 3681-6.

97 Kuroda, H. and P. Maliga (2003). "The plastid clpP1 protease gene is essential for plant development." Nature 425: 86 - 89. Kuwahara, H., T. Yoshida, et al. (2007). "Reduced genome of the thioautotrophic intracellular symbiont in a deep-sea clam, Calyptogena okutanii." Curr Biol 17: 881-6. Leake, J. R. (1994). "The biology of myco-heterotrophic (saprophytic) plants." New Phytol 127: 171-216. Leebens-Mack, J. and C. dePamphilis (2002). "Power analysis of tests for loss of selective constraint in cave crayfish and nonphotosynthetic plant lineages." Mol Biol Evol 19: 1292-302. Leebens-Mack, J., L. A. Raubeson, et al. (2005). "Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the felsenstein zone." Mol Biol Evol 22: 1948-1963. Lescot, M., P. Dehais, et al. (2002). "PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences." Nucleic Acids Res 30: 325-7. Lidholm, J. and P. Gustafsson (1991). "The chloroplast genome of the gymnosperm Pinus contorta - a physical map and a complete collection of overlapping clones." Curr Genet 20: 161-166. Lidholm, J., A. E. Szmidt, et al. (1988). "The chloroplast genomes of conifers lack one of the rRNA-encoding inverted repeats." Mol Gen Gen 212: 6-10. Loftus, B., I. Anderson, et al. (2005). "The genome of the protist parasite Entamoeba histolytica." Nature 433: 865-8. Lohan, A. J. and K. H. Wolfe (1998). "A subset of conserved tRNA genes in plastid DNA of nongreen plants." Genetics 150: 425-33. Lusson, N. A., P. M. Delavault, et al. (1998). "The rbcL gene from the non-photosynthetic parasite Lathraea clandestina is not transcribed by a plastid-encoded RNA polymerase." Curr Genet 34: 212-5. Martin, W. and K. V. Kowallik (1999). "Annotated English translation of Mereschkowsky's 1905 paper Martin, W., T. Rujan, et al. (2002). "Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus." Proc Natl Acad Sci U S A 99: 12246-51. McCutcheon, J. P., B. R. McDonald, et al. (2009). "Convergent evolution of metabolic roles in bacterial co-symbionts of insects." Proc Natl Acad Sci U S A 106: 15394-9. McCutcheon, J. P. and N. A. Moran (2007). "Parallel genomic evolution and metabolic interdependence in an ancient symbiosis." Proc Natl Acad Sci U S A 104: 19392-7. McInerney, J. O. (1998). "GCUA (General Codon Usage Analysis)." Bioinformatics 14: 372 - 373. McNeal, J., J. Kuehl, et al. (2007). "Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta." BMC Plant Biol 7: 57. McNeal, J. R., J. V. Kuehl, et al. (2009). "Parallel loss of plastid introns and their maturase in the genus Cuscuta." PLoS One 4: e5982. McNeal, J. R., J. H. Leebens-Mack, et al. (2006). "Using partial genomic fosmid libraries for sequencing complete organellar genomes." Biotechniques 41: 69-73. Moore, M. J., A. Dhingra, et al. (2006). "Rapid and accurate pyrosequencing of angiosperm plastid genomes." BMC Plant Biol 6: 17. Moore, M. J., P. S. Soltis, et al. (2010). "Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots." Proc Natl Acad Sci 107: 4623-4628.

98 Moran, N. A. and A. Mira (2001). "The process of genome shrinkage in the obligate symbiont Buchnera aphidicola." Genome Biol 2(12): RESEARCH0054. Morden, C. W., K. H. Wolfe, et al. (1991). "Plastid translation and transcription genes in a non- photosynthetic plant: intact, missing and pseudo genes." Embo J 10: 3281-8. Moreira, D., H. Le Guyader, et al. (2000). "The origin of red algae and the evolution of chloroplasts." Nature 405: 69-72. Morrison, D. A. (2007). "Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences." Syst Biol 56: 988-1010. Müller, K. F. (2004). "PRAP - computation of Bremer support for large data sets." Mol Phylogenet Evol 31: 780-782. Muller, K. F. and K. P. Wall (2006). "AlignMate - A Perl pipeline for comparative analysis of organellar genes and genomes." Nakabachi, A., A. Yamashita, et al. (2006). "The 160-kilobase genome of the bacterial endosymbiont Carsonella." Science 314: 267. Neuhaus, H. E. and M. J. Emes (2000). "Nonphotosynthetic Metabolism in Plastids." Annu Rev Plant Physiol Plant Mol Biol 51: 111-140. Nickrent, D., R. Duff, et al., Eds. (1998). Molecular phylogenetic and evolutionary studies of parasitic plants. Molecular Systematics of Plants II DNA Sequencing. Boston, USA, Kluwer Academic. Ochman, H. and N. A. Moran (2001). "Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis." Science 292: 1096-9. Ohyama, K., H. Fukuzawa, et al. (1986). "Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA." Nature 322: 572. Ovcharenko, I., G. G. Loots, et al. (2005). "Mulan: multiple-sequence local alignment and visualization for studying function and evolution." Genome Res 15: 184-94. Page, R. D. M. and E. C. Holmes (1998). Molecular evolution: a phylogenetic approach, Wiley- Blackwell. Palmer, J. D. (1983). "Chloroplast DNA exists in two orientations." Nature 301: 92-93. Palmer, J. D. (1985). "Chloroplast DNA and molecular phylogeny." BioEssays 2(6): 263-267. Palmer, J. D. (2000). "A single birth of all plastids?" Nature 405: 32-33. Palmer, J. D., B. Osorio, et al. (1988). "Evolutionary significance of inversions in legume chloroplast DNAs." Curr Genet 14: 65-74. Palmer, J. D. and W. F. Thompson (1982). "Chloroplast DNA rearrangements are more frequent when a large inverted repeat sequence is lost." Cell 29: 537-550. Perry, A. S. and K. H. Wolfe (2002). "Nucleotide substitution rates in legume chloroplast DNA depend on the presence of the inverted repeat." J Mol Evol 55: 501-508. Pond, S. L., S. D. Frost, et al. (2005). "HyPhy: hypothesis testing using phylogenies." Bioinformatics 21: 676 - 679. Press, M., J. Scholes, et al., Eds. (1999). Parasitic plants: physiological and ecological interactions with their hosts. Physiological Plant Ecology. Oxford, UK, Blackwell Science. Press, M. C. (1998). "Dracula or Robin Hood? A functional role for root hemiparasites in nutrient poor ecosystems." Oikos 82: 609-611. Press, M. C. and G. K. Phoenix (2005). "Impacts of parasitic plants on natural communities." New Phytol 166: 737-51. Randle, C. P. and A. D. Wolfe (2005). "The evolution and expression of rbcL in holoparasitic sister-genera Harveya and Hyobanche (Orobanchaceae)." Amer J Bot 92: 1575-1585.

99 Raubeson, L. and R. Jansen, Eds. (2005). Chloroplast genomes of plants. Plant diversity and evolution: genotypic and phenotypic variation in higher plants. Wallingford (UK), CAB International. Salvi, S., G. Sponza, et al. (2007). "Conserved noncoding genomic sequences associated with a flowering-time quantitative trait locus in maize." Proc Natl Acad Sci U S A 104: 11376- 81. Schwartz, S., Z. Zhang, et al. (2000). "PipMaker--a web server for aligning two genomic DNA sequences." Genome Res 10(4): 577-86. Schwender, J., F. Goffman, et al. (2004). "Rubisco without the Calvin cycle improves the carbon efficiency of developing green seeds." Nature 432: 779-82. Shinozaki, K., M. Ohme, et al. (1986). "The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression." Embo J 5: 2043-2049. Smith, A. (2003). "The Systematics and Molecular Evolution of Lennoaceae." Diss. Vanderbilt University. Strauss, S. H., J. D. Palmer, et al. (1988). "Chloroplast genomes of two conifers lack a large inverted repeat and are extensively rearranged." Proc Natl Acad Sci U S A 85: 3898-902. Swofford, D. L. (2002). "PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4.0b10." Tachezy, J. and O. Šmíd. (2008). Mitosomes in parasitic protists. In: Tachezy, J. Hydrogenosomes and Mitosomes: Mitochondria of Anaerobic Eukaryotes, Microbial Monographs Vol. 9, Berlin, Heidelberg, Springer-Verlag, pp. 201-230. ISBN: 978-3-540- 76732-9. The Angiosperm Phylogeny, G. (2009). "An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III." Bot J Linnean Soc 161: 105-121. Thomas, B. C., L. Rapaka, et al. (2007). "Arabidopsis intragenomic conserved noncoding sequence." Proc Natl Acad Sci U S A 104: 3348-53. Timmis, J. N., M. A. Ayliffe, et al. (2004). "Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes." Nat Rev Genet 5: 123-35. Turmel, M., C. Otis, et al. (2005). "The complete chloroplast DNA sequences of the charophycean green algae Staurastrum and Zygnema reveal that the chloroplast genome underwent extensive changes during the evolution of the Zygnematales." BMC Biol 3: 22. Uchida, N., B. Townsley, et al. (2007). "Regulation of SHOOT MERISTEMLESS genes via an upstream-conserved noncoding sequence coordinates leaf development." Proc Natl Acad Sci U S A 104: 15953-8. van der Kooij, T. A., K. Krause, et al. (2000). "Molecular, functional and ultrastructural characterisation of plastids from six species of the parasitic flowering plant genus Cuscuta." Planta 210: 701-7. van Ham, R. C., J. Kamerbeek, et al. (2003). "Reductive genome evolution in Buchnera aphidicola." Proc Natl Acad Sci U S A 100: 581-6. Wernegreen, J. J., A. O. Richardson, et al. (2001). "Parallel acceleration of evolutionary rates in symbiont genes underlying host nutrition." Mol Phylogenet Evol 19: 479-85. Westwood, J. H., J. I. Yoder, et al. (2010). "The evolution of parasitism in plants." Trends Plant Sci 15: 227-35. Wicke, S., G. M. Schneeweiss, et al. (2011). "The evolution of the plastid chromosome in land plants: gene content, gene order, gene function." Plant Mol Biol 76: 273-97.

100 Wickett, N. J., Y. Zhang, et al. (2008). "Functional gene losses occur with minimal size reduction in the plastid genome of the parasitic liverwort Aneura mirabilis." Mol Biol Evol 25: 393-401. Wilson, R. J. M., P. W. Denny, et al. (1996). "Complete plastid gene map of the plastid-like DNA of the malaria parasite Plasmodium falciparum." J Mol Biol 261: 155 - 172. Wimpee, C. F., R. Morgan, et al. (1992). "An aberrant plastid ribosomal RNA gene cluster in the root parasite Conopholis americana." Plant Mol Biol 18: 275-85. Wimpee, C. F., R. Morgan, et al. (1992). "Loss of transfer RNA genes from the plastid 16S-23S ribosomal RNA gene spacer in a parasitic plant." Curr Genet 21: 417-22. Wimpee, C. F., R. L. Wrobel, et al. (1991). "A divergent plastid genome in Conopholis americana, an achlorophyllous parasitic plant." Plant Mol Biol 17: 161-6. Wolf, P. G., C. A. Rowe, et al. (2003). "Complete nucleotide sequence of the chloroplast genome from a leptosporangiate fern, Adiantum capillus-veneris L." DNA Res 10: 59-65. Wolfe, A. D. and C. W. dePamphilis (1997). "Alternate paths of evolution for the photosynthetic gene rbcL in four nonphotosynthetic species of Orobanche." Plant Mol Biol 33: 965-77. Wolfe, A. D. and C. W. dePamphilis (1998). "The effect of relaxed functional constraints on the photosynthetic gene rbcL in photosynthetic and nonphotosynthetic parasitic plants." Mol Biol Evol 15: 1243-58. Wolfe, K. H., D. S. Katz-Downie, et al. (1992). "Evolution of the plastid ribosomal RNA operon in a nongreen parasitic plant: accelerated sequence evolution, altered promoter structure, and tRNA pseudogenes." Plant Mol Biol 18: 1037-48. Wolfe, K. H., W. H. Li, et al. (1987). "Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs." Proc Natl Acad Sci USA 84: 9054-9058. Wolfe, K. H., C. W. Morden, et al. (1992). "Rapid evolution of the plastid translational apparatus in a nonphotosynthetic plant: loss or accelerated sequence evolution of tRNA and ribosomal protein genes." J Mol Evol 35: 304-17. Wolfe, K. H., C. W. Morden, et al. (1992). "Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant." Proc Natl Acad Sci U S A 89: 10648-52. Wyman, S. K., R. K. Jansen, et al. (2004). "Automatic annotation of organellar genomes with DOGMA." Bioinformatics 20: 3252-3255. Young, N. D. and C. W. dePamphilis (2005). "Rate variation in parasitic plants: correlated and uncorrelated paterns among plastid genes of different function." BMC Evol Biol 5(16. Young, N. D., K. E. Steiner, et al. (1999). "The evolution of parasitism in Scrophulariaceae/Orobanchaceae: Plastid gene sequence refute an evolutionary transition series." Ann. Missouri Bot. Gard. 86: 876-893.

Chapter 4

Molecular Evolutionary Analysis of Cancer Cell Lines

Zhang, Y., M. J. Italia, et al. "Molecular evolutionary analysis of cancer cell lines." Mol Cancer

Ther 9(2): 279-91.

Yan Zhang1, Michael J. Italia2†, Kurt R. Auger4, Wendy S. Halsey3, Stephanie F. Van Horn3,

Ganesh M. Sathe3, Michal Magid-Slav2,, James R. Brown2, Joanna D. Holbrook4

1 Department of Biology, Pennsylvania State University, University Park PA 16802, USA

2 Computational Biology, 3Molecular and Cellular Technology 4Oncology, Research and

Development, GlaxoSmithKline, 1250 South Collegeville Road, Collegeville PA 19426, USA

† Present address: Center for Biomedical Informatics, The Children's Hospital of Philadelphia,

Philadelphia, PA 19104, USA

Running Title: Cancer Cell Line Evolution

Key words: cancer cell lines; tumor classification; mutations; evolution; phylogeny

102

Abstract

With genome-wide cancer studies producing large DNA sequence datasets, novel computational approaches towards better understanding the role of mutations in tumor survival and proliferation are greatly needed. Tumors are widely viewed to be influenced by Darwinian processes yet molecular evolutionary analysis, invaluable in other DNA sequence studies, has seen little application in cancer biology. Here, we describe the phylogenetic analysis of 353 cancer cell lines based on multiple sequence alignments of 3252 nucleotides and 1170 amino acids built from the concatenation of variant codons and residues across 494 and 523 genes, respectively. Reconstructed phylogenetic trees cluster cell lines by shared DNA variant patterns rather than cancer tissue type suggesting that tumors originating from diverse have similar oncogenic pathways. A well-supported clade of 91 cancer cell lines representing multiple tumor types also had significantly different gene expression profiles from the remaining cell lines according to statistical analyses of mRNA microarray data. This suggests that phylogenetic clustering of tumor cell lines based on DNA variants might reflect functional similarities in cellular pathways. Positive selection analysis revealed specific DNA variants that might be potential driver mutations. Our study shows the potential role of molecular evolutionary analyses in tumor classification and the development of novel anti-cancer strategies.

103 Introduction

The advent of lower cost, higher throughput DNA sequencing technologies is ushering in a new era of cancer genomics or oncogenomics (1-4). Recent large scale mutational analyses of cancer cell lines and clinical samples have revealed complex and highly heterogeneous variation even among tumors derived from the same tissue type (5). Such findings have important implications for the clinical patients. For example, a large scale sequencing project of cDNAs from 90 tyrosine kinase genes across 254 cell lines found that colon cancer cell lines had similar mutations in the EGFR kinase domain that rendered non-small-cell lung (NSCLC) patients responsive to the inhibitor gefitinib thus suggesting another potential indication for this anti-cancer drug (6). While the cost of generation of DNA sequence data has greatly decreased, these datasets present formidable bioinformatics challenges, creating a critical need for the development of new analytical approaches to help understand the dependence of the cancer cell on mutations and the relationships among different tumor types.

One significant objective of molecular cancer genomic studies is to attempt to distinguish driver mutations responsible for tumor proliferation and survival from coincidental passenger mutations resulting from relaxed DNA repair and replication fidelity (5). Another frequent study goal is the classification of tumors based on various molecular data including DNA variants, transcriptional profiling, epigenetic signatures or combinations thereof. Interestingly, although tumorigenesis has been viewed as an evolutionary process influenced by natural selection processes (7), there has been little application of molecular evolutionary approaches in the analysis of cancer mutations and understanding the relationships of diverse tumors at the DNA sequence level.

104 Here, we describe the evolutionary relationships of several hundred cancer cell lines and characterize potential mutational patterns using both phylogenetic and positive selection analytical approaches. We owe much of our current understanding of cancer biology to the study of established cell lines and it is expected that further functional validation of discoveries from the oncogenome will rely on many of the same cell lines. For example, the NCI-60 panel of cell lines has been extensively characterized both at the phenotypic and genotypic level (8, 9).

Recently, Lorenzi et al. (10) reported on DNA fingerprinting of the NCI-60 cell line panel which suggested that a few cell lines might have common donor origins. However, phylogenetic classification based on DNA mutational patterns of any collection of cell lines has not been previously reported.

The basis for any molecular evolutionary analysis is a consistent and biologically relevant multiple sequence alignment of nucleotide or protein sequences from the relevant taxa. Although larger scale genome-level rearrangements are often associated with tumorigenesis, we focused on single nucleotide synonymous and nonsynonymous substitutions or point mutations as the most tractable genetic variant for phylogenetic analyses. Point mutations have been shown to be highly important in the modulation of many different tumorigenic pathways, either by activating or inhibiting enzymatic activity of specific proteins, thereby causing clinical resistance to new cancer drugs targeting specific proteins such as kinases (11, 12). Since for any particular tumor gene only one or a few DNA variants may occur, we constructed concatenated multiple sequence alignments comprised of DNA variant codons from several hundred genes collected across hundreds of tumor cell lines. Phylogenetic analysis of concatenated datasets has been used

105

Figure 4-1. Flow chart of cancer cell line gene variant analysis. In-house sequencing project determined a total of 2777 variant nucleotides (compared to GenBank RefSeq) in 55 genes across 353 human tumor cell lines. Additional mutations for those cell lines were obtained from two public data-sources, COSMIC (15) (nucleotides) and Tykiva (6) (amino acids only). A single multiple sequence alignment was constructed from concatenated mutated codons that differed from RefSeq. An amino acid multiple sequence alignment was also constructed from translated codons and additional amino acid mutations identified in the Tykiva database. Both nucleotide and amino acid sequence alignments were used for subsequent phylogenetic and positive selection analysis (see Material and Methods).

106 previously to study other complex evolutionary questions such as the origin of eukaryotes (13) and the universal tree of life (14). To our knowledge, this is the first application of rigorous phylogenetic analyses to cancer mutation datasets.

Materials and Methods

DNA sequencing

Our core dataset for phylogenetic analyses was derived from novel DNA sequencing of from 55 selected , or genes otherwise implicated in tumor proliferation, from

353 cancer cell lines. Supplemental Table S1 lists the cell lines and their tissue of origin along with the genes sequenced in this study. All cell lines were obtained from either ATCC or other public repositories. Both cDNAs and genomic DNA were sequenced. Total RNA from cancer cell lines was isolated using a modified Qiagen RNeasy kit (QIAGEN Inc, Valencia, CA) and converted into cDNA using the Roche First Strand cDNA Synthesis Kit using oligo dT primers

(Roche Diagnostics, Mannheim, Germany). Genomic DNA was prepared using the Promega

Maxwell-16 DNA Purification Kit (Promega, Madison WI). For genomic DNA, all exons were

PCR amplified by designing oligonucleotide primers within flanking intronic regions.

Approximately 2 Kb of 5’ and 1 kb of 3’ UTR areas were also covered by sequencing. PCR primers were tailed with M13 universal sequencing primer sequences (UF & UR). All primers have been tested (QC) on Promega Human Genomic DNA (Promega, Madison WI) and BD qPCR Human Reference cDNA- random primed (BD Biosciences, Palo Alto, CA) respectively, before using on cell line samples. PCR reactions were carried out using HotstarTaq DNA polymerase (QIAGEN Inc, Valencia, CA). DNA was amplified for 35 cycles at 95ºC for 20 seconds, 55ºC for 30 seconds and 72ºC for 45 seconds followed by 7 minutes extension at 72ºC.

107 Prior to DNA sequencing, all PCR products were purified using Agencourt AmPure

(Agencourt Bioscience Corporation, Beverly, MA). Direct sequencing of purified PCR products was performed with AB v1.1 BigDye-terminator cycle sequencing kit (Applied Biosystems,

Foster City, CA) followed by purification using Agencourt CleanSeq (Agencourt Bioscience

Corporation, Beverly, MA). Sequencing was performed using an AB Genetic Analyzer 3730XL.

All sequence data were assembled and analyzed using Codon Code Aligner software (CodonCode

Corporation, Dedham, MA) and sequence variants were confirmed by independent PCR amplifications and sequencing.

Construction of Multiple Sequence Alignments

Fig.1 shows a flow diagram of the DNA sequence analysis pipeline. For the cell lines we sequenced, comparisons to the NCBI Human Refseq (August 2008) revealed 2777 different nucleotide variations. No distinction was made between germline single nucleotide polymorphisms (SNPs) or somatic tissue tumor-specific mutations (herein collectively called

“variants”). To augment our sample with additional gene sequences, overlapping cell line collections were identified in two public repositories of cancer mutations. Of the 353 cell lines sequenced, 229 were also found to have nucleotide-level mutations recorded in the database

COSMIC (15), a comprehensive source of cancer mutations maintained by the Sanger Centre

(http://www.sanger.ac.uk/genetics/CGP/cosmic/), from which we extracted an additional 922 mutations across 452 genes. The Tykiva Database (6) of the Bioinformatics Institute of Singapore

(http://tykiva.bii.a-star.edu.sg/SOGdb/cgi-bin/sogweb.pl) provides amino acid sequences of cancer cell line mutations (DNA sequences are unavailable). A total of 133 amino acid variants in

59 genes across 60 common cell lines were obtained from the Tykiva Database.

108 Since each of the databases used different wild-type human reference sequences to determine tumor cell line variants (called here dbRef), we adopted a protocol to standardize variants against a common wild type human reference. As our standard wild type or wtRef for new DNA sequences generated in this study as well as sequences imported from cancer mutation databases, we used human gene sequences from NCBI Reference Sequence collection (August

2008). From the COSMIC database, we extracted information of the exact position of each reported point mutation, its corresponding dbRef type nucleotide and its position in wtRef sequences. To ensure the correctness of the data extraction, we compare the extracted dbRef nucleotide with the corresponding wtRef nucleotide. Furthermore, we confirmed that the extracted three nucleotide codon were identical in both wtRef and dbRef sequences. Any differences were investigated and rejected if unresolved. If the wtRef and dbRef codons were identical, then the variant codon or amino acid recorded in that database was retained. For the inclusion of a codon in our multiple sequence alignments, at least one variant had to be observed in one of the cell lines. For heterozygous genes, we used the variant allele in order to increase phylogenetic signal.

To build the concatenated sequence alignments, we added the variant codons in sequential order such that each aligned nucleotide position was homologous across all cell lines.

For those genes with missing data, the wtRef codon was used in order to build a complete multiple sequence alignment without gaps. As the outgroup, a complete wtRef codon sequence was added to the cell line multiple sequence alignment. This produced a multiple sequence alignment of 3252 nucleotides for 353 cell lines plus nucleotide wtRef. Prior to phylogenetic analysis, cell lines with 100 % identity in nucleotide sequence were reduced to a single representative (see Results and Discussion for a list of identical cell lines). Thus phylogenetic analyses were performed on the distinct sequences from 321 cell lines including the wtRef sequence.

109

A similar approach was used to construct the amino acid sequence alignment with some additional steps. First, all variant codon positions from our DNA sequence data as well as that of the COSMIC database were translated to the corresponding amino acid. An additional 133 amino acid variants found in 59 genes were added from the Tykiva resource. The wtRef nucleotide sequence was also translated into amino acids. After consolidating cell lines with identical amino acid sequences, the final protein multiple sequence alignment was 1170 amino acids for 292 cell lines including an amino acid sequence version of wtRef.

Phylogenetic Analysis

For the nucleotide sequence alignments, maximum likelihood (ML) tree topologies were constructed using the software GARLI v0.96 (16, 17). Estimation of rate heterogeneity was done with the gamma distribution model. Sixty initial runs of GARLI were carried out to ensure similar likelihood scores were reached. Subsequently, another 60 runs were made to evaluate the consistency of tree topologies with improved parameters to increase the intensity of the searches

(attachmentspertaxon=200, genthreshfortopoterm=80000, numberofprecreductions=40, other parameters default). Nucleotide sequence phylogenies were also reconstructed using Bayesian posterior probabilities (BP) as implemented by the software MrBayes v3.0B4 (18, 19). Bayesian analysis also used the gamma-distributed rate model with 6 discrete rate categories. Markov chains were run for 106 generations, burn-in values were set for 104 generations, and trees sampled every 100 generations.

We constructed amino acid based phylogenetic trees using BP and distance neighbor- joining (NJ) methods. BP was performed as described for nucleotide alignments but with the mixed model for the amino acid rate matrix. NJ trees were based on pair wise distances between amino acid sequences using the programs NEIGHBOR and PROTDIST (Dayhoff option) of the

110 PHYLIP 3.6 package (20). All reconstructed phylogenetic trees were visualized using the programs TREEVIEW v1.6.6 (21) and Dendroscope v2.2.2 (22).

mRNA Expression and Pathway Analysis

In order to determine the relative differences in mRNA expression between Clade A (see

Results and Discussion) and other cell lines, we analyzed mRNA microarray profiles previously generated for these cell-lines by GlaxoSmithKline which are available from the public repository,

CaBIG (23). Transcript profiling data as well as background information for each cell line microarray experiment sample are available at: https://cabig.nci.nih.gov/caArray_GSKdata/.

Since large numbers of peripheral blood cell lines fall outside of Clade A, inclusion of peripheral blood cell lines could bias results towards genes that simply differentiate solid tumor from peripheral blood cancer cell lines. Thus we focused our gene expression analysis exclusively on solid tumor cell lines. For the statistical analysis of gene expression data, we used partial least squares discriminating analysis (PLS-DA) (24) for class comparison (25) in order to identify genes with the most discriminating probesets between Clade A and non-Clade A solid tumor cell lines. PLS-DA has been widely used for biomarker discovery and biological processes elucidation in particular for cancer transcriptomic studies (26-29). PLS-DA analysis was performed using SIMCA-P+ 11.5 (Umetrics, Umea, Sweden). Cross-validation was done according to default software settings except for increasing the number of permutations from 7 to

100. A total of 398 probesets (PLS-DA gene set) were selected based on their high variable importance for the projection (VIP) values. As a cross-check of PLS-DA results, the data were also analyzed using Student‘s t-test. The P-value was corrected by Benjamini-Hochberg (BH) procedure with a False Discovery Rate (FDR) < 0.05 as the significance threshold (30). Pathway enrichment analyses were conducted with MetaCore (GeneGo Inc., St. Joseph. MI) (31).

111 Selection pressure Analysis

The branch-site model (32, 33) implemented in the CODEML program from the PAML package (34, 35) was used to test for positive selection. We tested each of the branches on the cell line phylogeny, treating each in turn as the foreground branch, with all the other branches specified as background branches. Likelihood-ratio tests were performed with the Bonferroni correction for multiple testing (36). The alternative branch-site model has four codon site categories, the first two for sites evolving under purifying selection and neutral selection on all the lineages and the additional two categories for sites under positive selection on the foreground branch. The null model restricts sites on the foreground lineage to be undergoing neutral evolution. Each branch-site model was run three times. At least two of three replicate runs of each model should converge at or within 0.001 of the same log-likelihood value. Runs that did not converge indicated problems with the data and were rerun until convergence was obtained or else reported as a convergence problem.

Results and Discussion

We constructed an initial dataset from an in-house DNA sequencing effort of 55 known oncogenes and tumor-suppressor genes from 353 cancer cell lines (Fig. 1). The selected cell lines represent a significant proportion of tumor cell lines commonly used in , including the National Cancer Institute (NCI) collection. Multiple tissue sources were represented in this cancer cell line sample with the top four tumor types being breast (n = 32), colon (n = 26), lung (n

= 38) and (n = 88; classified as peripheral blood in Supplemental Table S1).

Among the cell lines we sequenced, comparisons to Human Refseq (retrieved from NCBI build 36.3, August 2008) revealed 2777 different nucleotide variations that were either germline single nucleotide polymorphisms (SNPs) or somatic tissue tumor-specific

112 mutations, herein collectively called “variants”. To augment our sample with further gene sequences, overlapping cell line collections were identified in two public repositories of cancer mutations. The Sanger Centre COSMIC (15) resource, a comprehensive database of cancer mutations, had 229 of the 353 cell lines from which we extracted an additional 922 mutations.

The Tykiva Database (6) provided amino acid sequences with 133 additional variants for those cancer cell lines (DNA sequences are unavailable).

For every gene, codon or amino acid residues that differed from the respective “wild- type” Human Refseq (herein called wtRef) were included in the nucleotide or protein multiple sequence alignments, respectively. In order to build a complete data matrix, wherever the variant codon sequence was unavailable for a particular cell line, the wtRef codon sequence was assumed. Codons were then concatenated into a single multiple sequence alignment comprised of

3252 nucleotides for 353 cell lines. For proteins, the codons were translated from the nucleotide alignment with additional amino acid variants from the Tykiva database to give a final alignment of 1170 amino acids. Among all cancer cell lines, the proportion of different nucleotides and amino acids ranged from 0-7.9% and 0-6.6%, respectively.

Six cell lines had nucleotide sequences that were 100% identical to the wtRef sequence.

There were another 16 identity groups, ranging from two to six cell lines, several of which included multiple tumors types (Supplemental Table S2). For example, tumor cell lines originating from patients with Hodgkin’s disease (L-428, RPMI 6666), chronic myeloid leukaemia (MEG-01), squamous cell cervical carcinoma (SiHa), breast ductal carcinoma (UACC-

812) and retinoblastoma (Y79) all share identical nucleotide sequences. In our cell line sample,

MDA-MB-435 and M14 as well as U251 and SNB-19, were also previously reported to have identical DNA fingerprints thus could have common donor origins (10). For those occurrences of two or more identical cell lines, a single representative was selected for the multiple sequence alignments which resulted in phylogenetic datasets of 321 unique nucleotide and 292 unique

113 protein sequences which includes the respective nucleotide or amino acid wtRef outgroup

(Supplemental Tables S3 and S4, respectively).

Phylogenetic analysis of the cancer cell line dataset is challenging because many gene variants were represented in low frequencies in the overall dataset. The pattern of nucleotide variation and the large number of cell lines meant certain phylogenetic reconstruction methods were less suitable for determining tree topologies, in part due to their sensitivity to errors involving unequal rates of evolutionary changes among lineages. Therefore we used the maximum likelihood (ML) method as implemented by the software GARLI (16) which has a rapid algorithm allowing for multiple ML reiterations to optimize parameters and assess tree topology robustness.

The best phylogenetic tree (lowest ML value) for combined new DNA sequence and public nucleotide data is shown in Fig. 4-2 (shown in Supplemental Fig. 1 as a vertical phylogram with branch lengths). A similar tree topology was recovered in separate phylogenetic analyses restricted to the new DNA sequences generated in this study alone (data not shown). The most striking aspect of the tree is the overall lack of clustering with respect to tissue of origin or cancer type which suggests that common nucleotide variants can occur in multiple tumor types. In several clusters, cell lines derived from liquid tumors, such as lymphomas and leukaemias, co- occurred with solid tumors suggesting that both of these broad tumor types might be dependent upon common mutations for tumorigenesis and . For example, one statistically well- supported subcluster (indicated in Fig. 4-2 by a solid blue bar) is comprised of 11 tumor cell lines including seven lymphomas (lymphocytes, lymphoblast), two ovarian tumors (adenocarcinoma of endometrium ovarian tissue) and two myeloid leukaemias ().

Extensive DNA sequence heterogeneity in common tumor types sampled from different patients or cell lines has been previously observed in large scale cancer genome surveys. Such studies have highlighted pathways apparently important for the transformed phenotype. For

114 example, a large scale gene survey of glioblastomas found higher than expected incidences of

ERBB2 mutations which had been previously thought to be more commonly associated with lung, gastric and colon cancers (4). Other profiling methods such as DNA fingerprints have often been unable to distinguish cell line by tissue-of-origin (10). Our study shows that phylogenetic analysis can assist in quantifying and visualizing the complex variability of the cancer genome, thus facilitating further exploration of relationships between particular collections of tumor samples.

115

Figure 4-2. Phylogenetic tree of tumor cell lines based on DNA sequences. Best maximum likelihood (ML value = -ln19311.744) tree of 320 unique tumor cell lines outgroup rooted with the human RefSeq (wtRef). Based on a multiple sequence alignment of 3252 nucleotides derived from the concatenation of variant codons, the tree was reconstructed using the GARLI package (16) (see Materials and Methods). Cell lines are color coded by tumor tissue type as listed, along with legend abbreviations, in Supplemental Table S1. Identical cell lines are listed at terminal nodes separated by

116 commas. Solid red bar indicates the terminal node for the Clade A subtree which was consistently obtained in 60 randomized, replicate ML analyses. Red arc line shows the range of cell lines included in Clade A. Also indicated are other nodes supported in 50%- 69% (+) and 70%-100% (*) replications. Those nodes supported by 0.8 – 1.0 probabilities in Bayesian tree reconstruction using the software MrBayes (19) are marked with “!”. Solid blue bar indicates an example of a well-supported subcluster of cell lines from diverse tissues of origin (discussed in text). Supplemental Figure S1 is a PDF version of the same phylogenetic tree in vertical phylogram format.

117 An exceptional phylogenetic structure in all ML runs was a cluster comprised of up to 91 cell lines which we will refer to as Clade A (indicated in Fig. 4-2 and the best ML scored Clade A subtree shown in Fig. 4-3). The clustering of 83 of these cell lines was consistently supported in more than 80% of 60 separate ML replicates (Supplemental Table S5). The existence of the Clade

A node, as well as several other internal nodes, was also significantly supported in BP phylogenetic reconstructions. All main tumor types as well as several rare types are represented in Clade A. In order to determine if the nucleotide tree topology reflected nonsynonymous changes, we also reconstructed a protein-based phylogenetic tree (Supplemental Figure S2) for all cell lines with amino acid sequences (320 cell lines and 1170 amino acids). Similar to the nucleotide tree, very few clusters in the protein tree were comprised of cell lines with common tissue types. Although, some associations among cell lines in the nucleotide tree were not recapitulated in the protein tree, the majority of Clade A cell lines clustered together which suggests that most variants defining this group of cell lines are nonsynonymous changes, and thereby could reflect potentially functional changes at the protein level.

As shown in the Clade A tree, several cell lines, in particular those derived from lung, breast and colon cancers, have especially long branches reflecting more numerous DNA variants.

Higher levels of variation can result from either elevated mutational events in those tumor cell lines or sampling bias towards particular tumor types. We feel that the latter explanation is less likely since gene sequence data was available across a wide spectrum of tumor tissue types. Thus the longer branches leading to various lung cancers, for example, are reflective of the higher mutation rate in that particular tumor type for the genes studied. This does not exclude the possibility that there are other genes yet to be sequenced which are commonly mutated in other cancers. Previous studies suggest that early and late stage tumors can accumulate different

118

Figure 4-3. Best maximum likelihood phylogenetic tree of Clade A cell lines, based on nucleotide sequences (ML value = -ln11135.2836). Phylogenetic methods, outgroup rooting and tree labeling given in Fig. 2. The tissue of origin for each cell line is indicated by color and suffix abbreviation.

119 mutations which could also account for differences in mutation rates (37, 38). However, in our study we examined cell lines which, for the most part, likely best model late stage cancers.

Additionally, some methods of phylogenetic tree reconstruction are highly sensitive to taxa with unequal mutations rates and exceptional long branches will tend to co-cluster as tree artefacts

(39). By employing ML methods which are less sensitive to long-branch effects, we have minimized the introduction of this bias in our phylogeny.

Clade A cell lines had several unique variant combinations distinguishable from other cell lines (Table 1). As an example, the gene RPS6KB2 encodes ribosomal protein S6 kinase

(70kDa, polypeptide 2 isoform) which functions downstream of the rapamycin-sensitive mTOR kinase signalling pathway involved in cell growth and proliferation (40). RPS6KB2 is also one of the most variable genes in our analysis having both synonymous and nonsynonymous nucleotide variants. Clade A included all 58 cell lines with one particular RPS6KB2 variation, A420V. This variant, a known human polymorphism (RS13859), is also suggested to be under positive selection pressure from our analysis (described below). Another example is the gene BIRC5, a member of the inhibitor of (IAP) gene family, negatively regulates apoptotic cell death.

There are multiple splice variants of BIRC5, including survivin-2B which is highly expressed during fetal development and in many tumors but not expressed in normal adult tissue (41). The

BIRC5 variant E152E/K which occurred in 42 cell lines, including 40 Clade A members, also tested significant for positive selection. Many nonsynonymous variants that occurred in 10 or more cell lines (across all 353 cell lines) were represented disproportionately higher (over 50 %) in Clade A cell lines. These variants occurred in genes involved with chromosome segregation and cell proliferation (AURKA, BUB1B, INCENP, CENPE, CDKN1A, CHFR), PI3K/mTOR

120

Table 4-1. Variant codons occurring in ≥ 10 Clade A cell lines at frequencies ≥ 0.5.

Variant Occcurence Gene Amino Acid Codon No. of Clade Total No, of Clade:Total Position Cell Lines Cell Lines Cell Lines AKT1 242 GAG 35 35 1.00 AURKA 31 TTT 16 16 1.00 AURKA 57 ATT 10 10 1.00 AURKB 295 TCC 20 20 1.00 BAD 96 CGC 18 18 1.00 BIRC5 152 GAG 40 42 0.95 BUB1B 349 CAA 17 17 1.00 BUB1B 388 GCG 17 17 1.00 CDKN1A 31 AGC 12 12 1.00 CENPE 81 ACT 12 12 1.00 CENPE 338 GTA 12 12 1.00 CENPE 675 CAG 24 24 1.00 CENPE 1535 TTT 11 11 1.00 CENPE 1911 AGC 11 11 1.00 CENPE 2090 ACG 10 10 1.00 CHFR 254 CCA 19 19 1.00 CHFR 528 TTG 20 20 1.00 CHFR 539 GTG 14 14 1.00 FOXO3A 53 GCC 19 20 0.95 FRAP1 479 GAT 64 64 1.00 FRAP1 999 AAC 65 65 1.00 FRAP1 1577 GCG 61 61 1.00 FRAP1 1851 AGC 27 27 1.00 FRAP1 2303 CTG 25 25 1.00 INCENP 36 GAG 27 27 1.00 INCENP 120 GTT 28 28 1.00 INCENP 506 ATG 19 19 1.00 MAP2K2 64 GTC 10 10 1.00 MAP2K2 151 GAC 27 27 1.00 MAP2K2 220 ATA 51 52 0.98 PIK3CA 1047 CAT 9 15 0.60 PIK3CD 936 TAC 21 21 1.00 PIK3CG 327 GAT 11 11 1.00 PIK3CG 442 TCC 18 18 1.00 PIK3CG 675 AGC 35 35 1.00 PIK3R1 73 TAC 35 35 1.00 PIK3R1 326 ATG 24 24 1.00 PIK3R2 313 CCC 10 10 1.00 PIK3R2 637 AGT 45 45 1.00 RPS6KB2 269 TTC 54 54 1.00 RPS6KB2 420 GCC 58 58 1.00 TSC1 322 ATG 20 20 1.00 TSC1 445 GAA 21 21 1.00 TSC2 860 TTT 10 10 1.00 TSC2 1667 GAT 33 33 1.00 TSC2 1732 TCG 12 12 1.00

121 pathway genes (TSC1, PIK3CA, PIK3CG, PIK3R1, PIK3R2), the KRAS, and the tumor suppressor TP53.

We looked for significant differences in mRNA expression between Clade A and other cell lines using data previously released to the CaBIG database (23) by GlaxoSmithKline

(https://cabig.nci.nih.gov/caArray_GSKdata/). Partial least squares discriminating analysis (PLS-

DA) was used to identify genes with the most discriminating probesets between Clade A and non-

Clade A solid tumor cell lines. A total of 398 probesets (PLS-DA gene set) were selected based on their high variable importance for the projection (VIP) values. Pathway enrichment analyses on the PLS-DA gene set and Clade A commonly variant genes returned many of the same pathways (Table 2). In addition, regulation of translation initiation was significantly enriched in

PLS-DA gene expression set (P = 2.706e-03) which is a downstream pathway for a number variant genes.

Student’s t-test (FDR of 5%) confirmed that 394 of the 398 PLS-DA genes had an adjusted P-Value ≤ 0.05. Furthermore, 192 of 398 top PLS-DA VIP genes were also among the top 398 t-test genes sorted by adjusted P-Value. Pathway enrichment analysis of 398 top t-test genes based on adjusted P-Values yielded similar results to PLS-DA VIP pathways, although the latter were in general better agreement with mutation based pathways.

The differential mRNA expression of genes in Clade A cell lines relative to other cell lines, suggests that phylogenetic clustering of tumor cell lines based on DNA variants might recapitulate functional similarities in cellular pathways. We can also exclude the possibility that the genes commonly mutated in Clade A are substantially perturbed by other mutational changes

(e.g. large scale genomic aberrations such as deletions, amplifications or re-arrangements) in the wild-type cell lines. If this was the case, we would not expect differential pathway gene expression patterns between the two groups.

122

Table 4-2. Pathways enriched in Metacore GeneGo for both Clade A genes (Variants) and PLS-DA gene set (Expression) based on Hypergeometric P-Value < 0.05.

Pathway P-Value Source Apoptosis and survival_BAD phosphorylation 1.32E-09 Variants 0.01061 Expression Cell cycle_The metaphase checkpoint 5.76E-10 Variants 0.007398 Expression Cell cycle_Chromosome condensation in prometaphase 0.002715 Variants 0.04408 Expression Cell cycle_Role of APC in cell cycle regulation 0.01052 Variants 0.03767 Expression Signal transduction_AKT signaling 1.47E-15 Variants 0.01209 Expression Development_PIP3 signaling in cardiac myocytes 3.53E-15 Variants 0.001808 Expression Development_IGF-RI signaling 7.83E-15 Variants 0.031189 Expression G-protein signaling_EDG5 signaling 0.0368 Variants 0.04996 Expression G-protein signaling_G-Protein alpha-12 signaling pathway 0.01146 Variants 0.04157 Expression Development_EDG3 signaling pathway 6.37E-06 Variants 0.000108 Expression Development_Mu-type opioid receptor signaling 3.85E-06 Variants 0.008602 Expression Immune response_Role of DAP12 receptors in NK cells 0.03142 Variants 0.000531 Expression Neurophysiological process_HTR1A receptor signaling in neuronal cells 0.007976 Variants 0.02727 Expression

123 Determining the critical mutations which contribute to the malignant cellular phenotype is challenging. Beerenwinkel et al. (42) estimated that colorectal cancers may carry about 100 nonsynonymous mutations, of which perhaps as few as three may be sufficient for developing cancer. These critical mutations are commonly termed driver mutations. Conversely those mutations that are accumulated by the tumor cell but do not appear to confer a growth advantage are called passenger mutations (these can occur in an accelerated manner as a consequence of impaired DNA repair and apoptosis processes). Evolutionary methods may have the capability to discriminate between passenger and driver mutations (43). In an effort to highlight putative driver mutations within the phylogeny, we classified cell lines by tissue of origin. Nucleotide alignments for each tissue group were analysed to detect evidence for evolution influenced by positive selection pressures using the PAML algorithms (34, 35). We found significant evidence for the influence of positive selection pressure acting on several sites in the concatenated alignment for most tissue groups.

For those tissue groups which returned a significant test for positive selection, a Bayesian analysis was used to identify the sites responsible for the signal with posterior probability > 0.95.

These sites might be driver mutations conferring growth or survival advantage on cancer cells

(Table 3). Included are known oncogenic mutations which are highly represented in COSMIC

(15) such as activating mutations at G12 position of the KRAS gene in 73% of primary pancreatic

124

Table 4-3. Genes with amino acids under positive selection (PS) in Clade A and all tumor cell lines. Based on PAML likelihood scores (see Supplementary Methods File 1), all tumor tissue types where genes had amino acid sites under significant PS (Bayesian posterior probabilities > 0.95) are shown. Bladder and liver tumor derived cell lines were also tested but do not have any significant PS sites. PS sites which are also sites of known SNPs and their corresponding SNP ID are indicated by *. For NRAS, PS site Q61& is a site of a known SNP but mutations at this site were not of the SNP variant (Q61R).

125 ductal and, activating mutations at V600 position of the BRAF gene occurring in 33% of primary malignant melanomas.

Interestingly, 12 of the 20 positively selected mutations are known germline SNPs as identified in the NCBI dbSNP database (Table 4-3). These SNPs are present in normal tissues although our analysis suggests they are selected for in cell culture and possibly confer some cellular growth advantage. Therefore, we hypothesise that they may also confer cancer susceptibility in polymorphic individuals – a hypothesis supported by two of the SNP variants.

The variant Q349R in the gene BUB1B is associated with chronic lymphocytic leukaemia incidence (44). The P72R polymorphism in the gene TP53 has been shown to effect risk of cervical cancer (45) and a meta-analysis of all types of cancer, estimated a pooled weak cancer risk in p53-P72 homozygotes (46, 47). The frequencies of some of variants in the cancer cell lines are higher than expected for human populations according to HapMap. This could be due to the cancer cells having somatically mutated to the other allele during cell line culture or by originating from individuals with germline SNPs more likely to develop tumors. Further caution should be used in the interpretation of positive selection by the branch-site method used by

PAML which has been recently challenged as overestimating sites with potential functional changes (48). Thus further sequencing of the positively selected, polymorphic sites in tumors seems warranted to determine the clinical relevance of these variants.

Our phylogenetic and positive selection analysis approaches to the study of DNA variants in human cancer cell lines has several important caveats. First, it is optimal to have a complete data matrix with sequence variant data available across all tumor samples. Given the low frequency of individual gene mutations, characterization of variants across hundreds of genes is required for robust phylogenetic trees. Fortunately, such genome-wide DNA sequence data are becoming available through several large scale cancer genome projects now underway. Second, although point mutations are very important in tumorigenesis, larger scale DNA aberrations are

126 also key mutational events in many cancers. Potentially, reconstruction of comprehensive phylogenies of tumor genomes might be possible using combined data models of both point mutations and genomic aberrations as has been done for mixed DNA sequence and morphological datasets (49). Third, sequenced genes in our dataset are biased towards pathways known to be perturbed in cancer. Broader DNA sequencing surveys involving additional genes may well alter the tree topology. Finally, the genetic variation between cancer cell lines, as well as reflecting variation in human tumors, could also be the result of germline differences between the cell line donors and changes acquired during cell culture.

Here we show that evolutionary analyses can illuminate commonalities in tumorigenic mechanisms among diverse cancers – this finding has several important implications for future cancer research. First, phylogenetic trees provide a direct and visual means for understanding the relationships between tumor types based on complex data sets as well as a statistical basis for their classification. Previous results have been reported in tabular formats (i.e. mutated protein kinases (6)) which are less tractable for identifying tumor type relationships across large multiple gene variant datasets. Second, tumor cell lines are key tools in the pre-clinical screening of potential compounds and biological agents for anti-cancer activity. By knowing the homologous relationships of tumor cell lines based on genome-wide variant data, more rational selection of cell lines can be made for drug screening campaigns. Furthermore, by screening cell lines from different tissues of origin that might cluster together because of similar DNA variation patterns, new indications for novel drugs can be discovered pre-clinically. Finally, sophisticated DNA- based diagnostic tests to identify specific target gene mutations are being increasingly used to develop clinical treatment regimes for cancer patients (50). As these data are collected across more genes and patients, evolutionary approaches can be used to classify clinical tumor samples based on molecular similarities and potentially guide therapeutic decisions.

127

Acknowledgments

We thank C. Traini, E. Thomas, K. Allen, and A. Hughes for DNA sequencing; D.

Zwickl, N.J. Wickett, P.K. Wall and Jeremy M. Martin for computational consulting; C.W. dePamphilis for support of Y.Z.; and R. Wooster and three anonymous reviewers for their comments on this manuscript.

128 Reference List

1. Sjoblom T, Jones S, Wood LD, et al. The consensus coding sequences of human breast and colorectal cancers. Science 2006;314:268-74.

2. Jones S, Zhang X, Parsons DW, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 2008;321:1801-6.

3. Parsons DW, Jones S, Zhang X, et al. An integrated genomic analysis of human glioblastoma multiforme. Science 2008;321:1807-12.

4. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008;455:1061-8.

5. Sjoblom T. Systematic analyses of the cancer genome: lessons learned from sequencing most of the annotated human protein-coding genes. Curr Opin Oncol 2008;20:66-71.

6. Ruhe JE, Streit S, Hart S, et al. Genetic alterations in the tyrosine kinase transcriptome of human cancer cell lines. Cancer Res 2007;67:11368-76.

7. Greaves M. Darwinian medicine: a case for cancer. Nat Rev Cancer 2007;7:213-21.

8. Nishizuka S, Charboneau L, Young L, et al. Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc Natl Acad Sci U S A 2003;100:14229-34.

9. Shankavaram UT, Reinhold WC, Nishizuka S, et al. Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study. Mol Cancer Ther 2007;6:820-32.

10. Lorenzi PL, Reinhold WC, Varma S, et al. DNA fingerprinting of the NCI-60 cell line panel. Mol Cancer Ther 2009;8:713-24.

11. Suda K, Onozato R, Yatabe Y, Mitsudomi T. EGFR T790M mutation: a double role in lung cancer cell survival? J Thorac Oncol 2009;4:1-4.

12. Snead JL, O'Hare T, Eide CA, Deininger MW. New strategies for the first-line treatment of chronic myeloid : can resistance be avoided? Clin Myeloma 2008;8 Suppl 3:S107-S117.

13. Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 2000;290:972-7.

14. Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ. Universal trees based on large combined protein sequence data sets. Nat Genet 2001;28:281-5.

15. Bamford S, Dawson E, Forbes S, et al. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer 2004;91:355-8.

129 16. Zwickl DJ. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, The University of Texas at Austin; 2006.

17. Lewis PO. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol Biol Evol 1998;15:277-83.

18. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 2001;17:754-5.

19. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003;19:1572-4.

20. Felsentein J. PHYLIP (Phylogenetic Inference Package). [3.6]. 2000. Seattle, Department of Genetics, University of Washington.

21. Page RD. TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci 1996;12:357-8.

22. Huson DH, Richter DC, Rausch C, Dezulian T, Franz M, Rupp R. Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics 2007;8:460.

23. Fenstermacher D, Street C, McSherry T, Nayak V, Overby C, Feldman M. The Cancer Biomedical Informatics Grid (caBIG). Conf Proc IEEE Eng Med Biol Soc 2005;1:743-6.

24. Musumarra G, Barresi V, Condorelli DF, Fortuna CG, Scire S. Genome-based identification of diagnostic molecular markers for human lung carcinomas by PLS-DA. Comput Biol Chem 2005;29:183-95.

25. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003;95:14-8.

26. Musumarra G, Barresi V, Condorelli DF, Scire S. A bioinformatic approach to the identification of candidate genes for the development of new cancer diagnostics. Biol Chem 2003;384:321-7.

27. Perez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum Genet 2003;112:581-92.

28. Modlich O, Prisack HB, Munnes M, Audretsch W, Bojar H. Predictors of primary breast cancers responsiveness to preoperative epirubicin/cyclophosphamide-based : translation of microarray data into clinically useful predictive signatures. J Transl Med 2005;3:32.

29. Man MZ, Dyson G, Johnson K, Liao B. Evaluating methods for classifying expression data. J Biopharm Stat 2004;14:1065-84.

130 30. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995;57:289-300.

31. Ekins S, Nikolsky Y, Bugrim A, Kirillov E, Nikolskaya T. Pathway mapping tools for analysis of high content data. Methods Mol Biol 2007;356:319-50.

32. Yang Z, Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol Biol Evol 2002;19:908-17.

33. Zhang J, Nielsen R, Yang Z. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol Biol Evol 2005;22:2472-9.

34. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997;13:555-6.

35. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 2007;24:1586-91.

36. Anisimova M, Yang Z. Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites. Mol Biol Evol 2007;24:1219-28.

37. Spencer SL, Gerety RA, Pienta KJ, Forrest S. Modeling somatic evolution in tumorigenesis. PLoS Comput Biol 2006;2:e108.

38. Bielas JH, Loeb KR, Rubin BP, True LD, Loeb LA. Human cancers express a mutator phenotype. Proc Natl Acad Sci U S A 2006;103:18238-42.

39. Kuhner MK, Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 1994;11:459-68.

40. Boyer D, Quintanilla R, Lee-Fruman KK. Regulation of catalytic activity of S6 kinase 2 during cell cycle. Mol Cell Biochem 2008;307:59-64.

41. Li F, Ling X. Survivin study: an update of "what is the next wave"? J Cell Physiol 2006;208:476-86.

42. Beerenwinkel N, Antal T, Dingli D, et al. Genetic progression and the waiting time to cancer. PLoS Comput Biol 2007;3:e225.

43. Greenman C, Stephens P, Smith R, et al. Patterns of somatic mutation in human cancer genomes. Nature 2007;446:153-8.

44. Rudd MF, Sellick GS, Webb EL, Catovsky D, Houlston RS. Variants in the ATM- BRCA2-CHEK2 axis predispose to chronic lymphocytic leukemia. Blood 2006;108:638- 44.

45. Storey A, Thomas M, Kalita A, et al. Role of a p53 polymorphism in the development of human papillomavirus-associated cancer. Nature 1998;393:229-34.

131 46. van HD, Mooijaart SP, Beekman M, et al. Variation in the human TP53 gene affects old age survival and cancer mortality. Exp Gerontol 2005;40:11-5.

47. Whibley C, Pharoah PD, Hollstein M. p53 polymorphisms: cancer implications. Nat Rev Cancer 2009;9:95-107.

48. Nozawa M, Suzuki Y, Nei M. Reliabilities of identifying positive selection by the branch-site and the site-prediction methods. Proc Natl Acad Sci U S A 2009;106:6700-5.

49. Ragan MA. Matrix representation in reconstructing phylogenetic relationships among the eukaryotes. Biosystems 1992;28:47-55.

50. Varley KE, Mitra RD. Nested Patch PCR enables highly multiplexed mutation discovery in candidate genes. Genome Res 2008;18:1844-50.

132 Chapter 5

Future Work

In this thesis, we compared the plastomes from independent lineages of nonphotosynthetic plants, Pholisma arenarium and Epifagus virginiana and also plastomes from the same nonphotosynthetic lineages, Conopholis americana and Epifagus virginiana. We have observed striking similarities in both plastome contents and evolutionary dynamics in the nonphotosynthetic plants. We also compared the plastomes of the nonphotosynthetic plants and their green relatives to find that highly reduction in plastome contents for nonphotosynthetic plants and also accelerated evolutionary rates in the remaining genes. Furthermore, evolutionary analysis shows that some of the most of the remaining genes are still evolving under strong constraints. These results raise several questions that can be addressed in future studies either through functional analysis, or with a denser collection of plastid genomes. These include: 1) after loss of the plastid’s major function, are the remaining genes still functional, or are they silenced, and likely on their way to be lost? Do the protein structures remain the same, or have they already begun to change significantly in photosynthetic and nonphotosynthetic parasites?

What are the likely mechanisms driving the deletion of the genes in plastomes?

Transcriptome sequencing, RT-PCR, or northern blot will be applied to further explore the functions of the remaining genes in plastomes in the nonphotosynthetic plants.

Photosynthetic gene rbcL would be one of the more interesting genes to study in terms of whether it is still functional in plastome of some holoparasites that retain intact copies of the gene

We would also like to use web-logo to explore whether the conservation of protein sequences has changed in parasites by comparing the multiple sequence alignment of green plants and the same alignments with added parasites (Crooks, Hon et al. 2004). This would address whether the enhanced rate of nonsynonymous change detected in some protein coding sequences

133 in the holoparasite transcriptomes are associated with changes to otherwise conserved aspects of the protein sequence.

Our preliminary examination of the plastomes of nonphotosynthetic sister species

Conopholis and Epifagus revealed several interesting patterns that would bear further analysis and experimental research. Although many of the gene losses observed were common to both species, the existence of some, it would be interesting to identify the total length of intact genes and conserved noncoding sequence common to the species, providing some insight into how much smaller a plastome might be possible, while still compatible with the functions encoded by these genomes. The shared noncoding sequences observed between the two nonphotosynthetic plants are of special interest, because these can provide some insights into potential promoters needed by the nonphotosynthetic plastome.

This thesis has been focused on nonphotosynthetic holoparasites, but we would also like to further explore plastome evolution in related hemiparasites, such as Striga. This will lead a greater understanding of the initial steps of plastome evolution in parasites that still rely to an extent on their own photosynthate. Because autotrophic capability has been lost on at least five occasions in the family, Orobanchaceae provides multiple such opportunities for comparison of related hemi- and holoparasitic lineages (dePamphilis et al., 1997).

Tumorgenesis is influenced by natural selection, thus we believe molecular evolutionary methodologies are useful tools for understanding cancer progression mechanisms. This thesis applies evolutionary analysis in the framework of cancer cell lines. We constructed a phylogenetic analysis of 353 cancer cell lines based multiple sequence alignments of 3252 variant nucleotides from 494 genes. The relationship among cancer cell lines was studied using phylogentic approaches for the first time. Positive selection analysis was performed to identify potential driver mutations in cancer genes. Liu et al. has further tested the SNPs reported in

134 various cancer cell, and found that one out of the 13 unique SNP IDs presented in our table 4-2 was found in domain regions and corresponded to a domain-altering SNPs associated with pancreatic cancer (Liu and Tozeren 2010). We think further analysis should be conducted to test whether protein domains have been altered by the presence of positively selected variants identified with evolutionary approach. Reva et al. has recently developed a new functional impact score (FIS) for amino acid residue changes using evolutionary conservation patterns and this method assign high score to more likely functional mutations, driver mutations (Reva,

Antipin et al. 2011). We could test whether positively selected mutations identified with our approach are also assigned high score by their methods.

Furthermore, with the explosion of genome wide DNA sequence data available through large scale cancer genome projects, we could reconstruct a more complete data matrix for a more robust phylogenetic study. Broader sequencing surveys including additional genes could also reduce the bias caused by the heavy sampling in signal transduction pathways known to be perturbed in cancer.

135 References:

Crooks, G. E., et al. (2004). "WebLogo: a sequence logo generator." Genome Res 14: 1188-1190

Liu, Y. and A. Tozeren (2010). "Domain altering SNPs in the human proteome and their impact

on signaling pathways." PLoS One 5: e12890.

Reva, B., et al. (2011). "Predicting the functional impact of protein mutations: application to

cancer genomics." Nucleic Acids Res 39: e118

Appendix A. Flowchart of molecular evolutionary analysis in chapter 2.

137

Appendix B. Gene summary by functional category in Pholisma, Epifagus, Ehretia and Mimulus. Data shown are lengths of potentially functional sequences (in black) and pseudogene sequences (in red).

Gene Ehretia Pholisma Mimulus Epifagus Photosynthesis Photosystem I psaA 2253 141 2253 0 psaB 2205 241 2205 0 psaC 246 0 531 0 psaI 111 96 111 0 psaJ 135 0 135 0

All - Photosystem I 4950 478 5235 0

Photosysterm II psbA 1062 0 1059 336 psbB 1527 327 1527 40 psbC 1386 0 1419 0 psbD 1062 0 1062 0 psbE 252 0 252 0 psbF 120 0 120 0 psbG 0 0 738 0 psbH 222 0 240 0 psbI 111 0 111 0 psbJ 123 0 123 0 psbK 180 98 180 0 psbL 117 0 117 0 psbM 105 0 105 0 psbN 132 0 132 0 psbT 108 0 108 0 psbZ 189 0 189 0

All - Photosystem II 6696 425 7482 376

Cytochrome b6f petA 963 0 963 0 petB* 648 0 648 0 petD* 483 0 483 0 petG 114 0 114 0 petL 96 0 96 0 petN 90 95 90 0

All - Cytochrome b6f 2394 95 2394 0

ATP synthase atpA 1524 631 1524 263 atpB 1497 673 1497 150

138

atpE 405 0 402 0 atpF* 555 0 555 0 atpH 246 0 246 0 atpI 744 466 744 0

All - ATP synthase 4971 1770 4968 413

Rubisco rbcL 1443 1437 1455 435

Cholororespiration ndhA* 1092 0 1092 0 ndhB* 1533 685 1533 572 ndhC 363 0 363 0 ndhD 1503 0 1503 0 ndhE 306 0 306 0 ndhF 2226 0 2250 0 ndhG 531 0 531 0 ndhH 1182 0 1182 80 ndhI 504 0 507 0 ndhJ 477 0 477 0 ndhK 855 0 852 0

All - Chlororespiration 10572 685 10596 652

All- Photosynthesis 31026 4890 32130 1876

Gene Expression rRNA rrn16 1491 1496 1491 1492 rrn23 2810 2833 2811 2804 rrn4.5 103 103 103 103 rrn5 121 121 121 121

All - rRNA 4525 4553 4526 4520

Ribosomal Proteins rps2 711 711 711 714 rps3 657 657 663 663 rps4 606 606 606 609 rps7 468 468 468 468 rps8 405 402 405 405 rps11 417 417 417 411 rps12_3end* 258 258 258 261 rps12_5end 114 114 114 114 rps14 303 303 303 303 rps15 273 228 273 0

139

rps16* 267 265 267 0 rps18 306 246 312 279 rps19 279 279 279 285 rpl2* 825 825 825 825 rpl14 369 369 369 337 rpl16* 408 408 408 408 rpl20 387 387 378 387 rpl22 468 396 465 0 rpl23 282 147 282 289 rpl32 162 162 177 0 rpl33 201 195 201 201 rpl36 114 114 114 114

All - Ribosomal Proteins 8280 7957 8295 7073

Transfer RNAs trnA-UGC* 73 73 73 65 trnC-GCA 71 81 81 69 trnD-GUC 74 74 74 74 trnE-UUC 73 73 73 73 trnF-GAA 73 73 73 73 trnfM-CAU 74 71 74 74 trnG-GCC 71 71 71 0 trnG-UCC* 72 71 71 0 trnH-GUG 75 75 75 75 trnI-CAU 74 74 74 74 trnI-GAU* 72 72 72 33 trnK-UUU* 72 72 72 0 trnL-CAA 81 81 81 81 trnL-UAA* 85 85 85 0 trnL-UAG 80 80 1182 0 trnM-CAU 73 73 73 73 trnN-GUU 72 72 72 72 trnP-UGG 74 74 74 74 trnQ-UUG 72 72 72 72 trnR-ACG 74 76 74 74 trnR-UCU 72 72 72 68 trnS-GCU 88 87 88 88 trnS-GGA 87 86 87 89 trnS-UGA 92 91 93 93 trnT-GGU 72 72 72 0 trnT-UGU 73 73 73 0 trnV-GAC 72 72 72 0 trnV-UAC* 73 0 73 0 trnW-CCA 74 74 74 75 trnY-GUA 84 85 84 85

All - Transfer RNAs 2272 2205 3384 1554

140

RNA polymerase and maturase genes rpoA 1020 245 1008 158 rpoB 3213 133 3213 0 rpoC1* 2064 0 2037 0 rpoC2 4146 0 4131 0 matK 1509 1506 1539 1320

All - Polymerase and maturase genes 11952 1884 11928 1478

Initiation Factors infA 234 234 234 234 All - Gene Expression 27263 16833 28367 14859

Other Protein Genes accD 1551 1494 1533 1482 ccsA 966 0 975 0 cemA 690 0 690 0 clpP** 591 588 591 591 ycf1 5445 5289 5586 5217 ycf2 6855 6552 6830 6651 ycf3** 507 0 507 0 ycf4 555 555 0 ycf15 258 324 150 0

All - Other Protein Genes 17418 14247 17417 13941

All Unique Sequences 75707 35970 77914 30676

141 Appendix C. Comparative Genome Summary for Pholisma, Epifagus, Ehretia and Mimulus

Table includes the sequence length and percentage of coding, intron, integenic regions, pseudogene sequences, and pseudogene sequences, LSC, SSC, IR regions. Only one IR repeat region is counted in coding, intron, integenic and psedogene sequences regions.

Mimulus Epifagus Ehretia Pholisma 76,524 30,676 76,761 35,974 Coding 50% 44% 49% 44% 15,454 3,207 15,658 9,369 Intron 10% 5% 10% 12% 35,733 11,132 38,332 9,828 Integenic Region 23% 16% 24% 12% Pseudogene 0 2,278 0 3,735 sequences 0% 3% 0% 5% 84,296 19,799 86,786 30,167 LSC 55% 28% 55% 37% 17,907 4,759 18,162 6,459 SSC 12% 7% 12% 8% 25,508 22,734 25,803 22,280 Single IR 17% 32% 16% 27% Total Length 153,219 70,028 156,554 81,186

142

Appendix D. Indel Position, Length and Analysis.

Table includes the length of insertions (positive numbers) and deletions (negative numbers) inferred from multiple sequence alignments of the IR regions of Pholisma, Epifagus, and photosynthetic relatives. Insertions and deletions are scored relative to their inferred ancestral state from the photosynthetic relatives.

Indel IndelPos length Epifagus Pholisma Mimulus Ehretia Nicotiana 1 234-239 6 0 0 0 0 1 2 235-239 5 0 1 0 0 ? 3 239-239 1 0 ? 1 0 ? 4 257-259 -3 1 0 0 0 0 5 271-271 -1 0 1 0 0 0 6 274-274 -1 0 0 1 0 0 7 732-732 1 0 0 0 0 1 8 743-745 -3 1 0 0 0 0 9 744-745 2 ? 0 1 1 1 10 831-836 -6 0 1 0 1 0 11 844-849 -6 1 0 0 0 0 12 849-867 -19 0 1 0 0 0 13 866-867 2 1 ? 1 0 1 14 1007-1012 -6 1 0 0 0 0 15 1038-1044 -4 0 1 0 0 0 16 1040-1042 3 0 ? ? 1 1 17 1040-1043 -1 0 ? 1 0 0 18 1053-1056 4 0 1 1 1 1 19 1144-1148 5 1 0 1 1 1 20 1173-1173 -1 0 1 0 0 0 21 1865-1870 6 0 1 1 1 1 22 1898-1902 5 0 1 1 1 1 23 1917-1923 -7 1 0 0 0 0 24 1928-1928 1 1 0 1 1 1 25 1963-1963 -1 1 0 0 0 0 26 1972-1975 4 0 1 1 1 1 27 2000-2031 -32 0 1 0 0 0 28 2047-2047 -1 0 1 0 0 0 29 2175-2184 10 0 1 1 1 1 30 2222-2228 -7 1 0 0 0 0 31 2272-2275 -4 1 0 0 0 0 32 2356-2356 1 1 0 1 1 1 33 2372-2379 8 1 0 1 1 1

143

34 2372-2379 5 0 1 1 1 1 35 2419-2442 24 1 1 1 1 0 36 2493-2498 6 0 1 1 1 1 37 2530-2541 -12 1 0 0 0 0 38 3315-3317 3 0 1 1 1 1 39 3609-3614 6 0 1 1 1 1 40 3619-3624 -6 0 1 0 0 0 41 3629-3634 6 0 1 1 1 1 42 3715-3717 3 0 1 1 1 1 43 3780-3785 -6 1 0 0 0 0 44 3879-3884 -6 1 0 0 0 0 45 4310-4315 6 0 1 1 1 1 46 4429-4437 -9 1 0 0 0 0 47 4452-4610 -159 1 0 0 0 0 48 4466-4474 9 ? 0 1 0 1 49 5017-5022 6 0 1 1 1 1 50 5096-5104 9 1 0 1 1 1 51 5128-5148 -21 1 0 0 0 0 52 5393-5398 -6 1 0 1 0 0 53 5653-5928 -276 0 1 0 0 0 54 5767-5769 3 0 ? 1 1 1 55 5911-5922 -12 1 ? 0 0 0 56 5994-5999 -6 0 1 0 0 0 57 6050-6055 -6 0 0 1 0 0 58 6143-6148 6 0 1 1 1 1 59 6304-6306 3 1 1 0 1 1 60 6805-6810 -6 1 0 1 0 0 61 6890-6898 -9 1 1 0 1 0 62 7548-7562 -15 1 0 0 0 0 63 7590-7607 -18 0 1 0 0 0 64 7641-7643 3 1 0 1 0 1 65 7836-7844 -9 1 0 1 0 0 66 8321-8326 6 0 1 1 1 1 67 8606-8611 6 0 1 1 1 1 68 8715-8729 15 1 0 1 1 1 69 8771-8779 9 0 0 0 0 1 70 8920-8925 6 0 1 1 1 1 71 8978-9007 -30 0 1 0 0 0 72 9278-9283 6 0 0 0 1 1 73 9281-9283 -3 0 0 1 ? ? 74 9421-9429 -9 1 0 0 0 0 75 9434-9437 4 1 1 0 1 1 76 9486-9490 5 0 1 1 1 1

144

77 9518-9522 5 0 1 1 1 1 78 9568-9572 -5 0 1 0 0 0 79 9583-9587 -5 1 0 0 0 0 80 9602-9602 1 1 1 0 1 1 81 uncertain 82 9632-9632 1 ? 1 ? 0 ? 83 9632-9635 -4 ? 0 1 0 ? 84 9632-9637 -6 1 0 0 0 1 85 9648-9653 -6 1 0 0 0 0 86 9668-9675 8 0 1 1 1 1 87 9709-9709 1 1 0 1 1 1 88 9745-9750 6 0 1 1 1 1 89 9772-9774 3 1 0 1 0 1 90 9783-9785 3 1 0 1 1 1 91 9811-9812 2 1 1 1 1 0 92 9826-9836 -11 1 0 0 0 1 93 9831-9836 6 ? 0 1 1 ? 94 9849-9852 -4 1 0 ? 0 ? 95 9849-9855 7 0 0 1 0 1 96 9862-9868 7 0 1 1 1 1 97 9990-9991 2 1 0 1 1 1 98 10014-10014 -1 1 0 1 0 0 99 10043-10047 5 0 1 1 1 1 100 10065-10071 7 0 1 1 1 1 101 10083-10091 -9 0 1 0 0 0 102 10131-10143 13 0 0 1 1 1 103 10135-10143 4 0 1 ? ? ? 104 10158-10200 -43 1 1 0 0 0 105 10200-10200 1 ? ? 1 1 0 106 10212-10213 -2 1 0 0 0 0 107 10218-10223 -6 1 0 0 0 0 108 10316-10316 1 0 1 1 1 1 109 10384-10385 -2 1 0 0 0 0 110 10389-10393 -5 0 0 0 0 1 111 10405-10409 -5 1 0 1 1 0 112 10413-10418 -6 1 0 0 0 0 113 10454-10462 -9 1 0 0 0 0 114 10468-10478 11 0 0 0 0 1 115 10470-10471 2 ? 0 1 ? ? 116 10470-10478 -9 1 0 0 1 ? 117 10486-10486 -1 1 0 0 ? 0 118 10486-10489 -4 0 0 0 1 0 119 10495-10576 -82 1 0 0 0 0

145

120 10522-10535 14 ? 1 0 1 1 121 10545-10551 -7 ? 0 0 0 1 122 10575-10576 2 ? 1 0 1 ? 123 10575-10577 3 0 0 0 0 1 124 10581-10584 4 0 0 0 0 1 125 10613-10617 5 0 1 1 1 1 126 10643-10651 -9 1 0 0 0 0 127 10648-10651 4 ? 0 1 1 1 128 10701-10706 6 1 0 1 1 1 129 10714-10739 -26 1 0 0 0 0 130 10716-10723 -8 ? 0 1 0 0 131 10732-10739 8 ? 0 1 1 1 132 10758-10758 1 0 1 1 1 1 133 10837-10845 -9 0 1 0 0 0 134 10839-10842 4 0 ? 1 1 1 135 10872-11002 -131 1 0 0 0 0 136 10893-10893 -1 ? 1 0 0 0 137 11018-11207 -190 1 0 0 0 0 138 11050-11055 6 ? 0 1 1 1 139 11096-11116 -21 ? 1 0 0 0 140 11170-11175 6 ? 0 1 1 1 141 11216-11232 -11 1 0 0 0 0 142 11227-11232 6 ? 0 1 1 1 143 11238-11394 -157 1 0 0 0 0 144 11244-11248 -5 ? 1 0 0 0 145 11298-11298 1 ? 0 1 1 1 146 11320-11326 7 ? 0 1 1 1 147 11341-11386 -46 ? 1 0 0 0 148 11406-11950 -545 1 0 0 0 0 149 11504-11507 4 ? 0 1 1 1 150 11594-11623 -30 ? 1 0 0 0 151 11635-11635 -1 ? 1 0 0 0 152 11649-11654 -6 ? 1 0 0 0 153 11867-11868 2 ? 1 0 1 1 154 11965-12038 -74 1 0 0 0 0 155 12047-12139 -93 1 0 0 0 0 156 12065-12075 -11 ? 1 0 0 0 157 12108-12112 -5 ? 1 0 0 0 158 12145-12250 -106 1 0 0 0 0 159 12156-12285 -130 0 1 0 0 0 160 12328-12329 2 0 1 1 1 1 161 12352-12357 -6 1 0 0 0 0 162 12399-12821 -423 0 1 0 0 0

146

163 12407-12407 -1 1 ? 0 0 0 164 12417-12425 -9 1 ? 0 0 0 165 12456-12460 5 0 ? 1 1 1 166 12513-12517 5 0 ? 1 1 1 167 12542-12548 7 0 ? 1 1 1 168 12580-12588 -9 1 ? 0 0 0 169 12636-12642 7 0 ? 1 1 1 170 12688-12688 -1 1 ? 0 0 0 171 12706-12727 -22 1 ? 0 0 0 172 12790-13096 -307 1 0 0 0 0 173 12829-13137 -309 0 1 0 0 0 174 13146-13221 -76 0 1 0 0 0 175 13216-13222 -7 1 0 0 0 0 176 13220-13221 2 ? ? 1 0 1 177 13238-13239 -2 0 1 0 0 0 178 13242-13246 -5 1 0 0 0 0 179 13252-13256 5 1 0 1 1 1 180 13264-13264 1 0 1 1 1 ? 181 13264-13269 6 0 0 0 0 1 182 13271-13295 -25 1 0 0 0 0 183 13277-13296 -20 0 0 1 0 0 184 13288-13292 5 ? 0 ? 1 1 185 13311-13311 1 1 0 1 1 1 186 13393-13397 5 0 0 0 0 1 187 13434-13438 5 1 0 1 1 1 188 13453-13460 8 1 0 1 1 1 189 13476-13479 4 0 1 1 1 1 190 13489-13489 -1 0 1 0 0 0 191 13510-13510 -1 1 0 0 0 0 192 14007-14008 -2 1 0 0 0 0 193 14024-14029 6 1 0 1 1 1 194 14049-14049 -1 1 0 0 0 0 195 14148-14152 5 0 1 1 1 1 196 14301-14306 6 1 0 1 1 1 197 14374-14377 4 1 0 1 1 1 198 14416-14416 1 0 1 1 1 1 199 14467-14468 2 0 1 1 1 1 200 14519-14519 1 0 1 1 1 1 201 14961-14961 1 0 ? 0 0 1 202 14961-15523 -563 0 1 0 0 0 203 14981-14985 -5 1 ? 0 0 0 204 14984-14985 2 ? ? 0 0 1 205 14985-14985 -1 ? ? 1 0 ?

147

206 15061-15208 -148 1 ? 0 0 0 207 15068-15073 6 ? ? 0 0 1 208 15072-15073 -2 ? ? 0 1 ? 209 15135-15138 4 ? ? 1 1 0 210 15234-15255 -22 1 ? 0 0 0 211 15259-15259 1 0 ? 0 1 1 212 15272-15278 -7 0 ? 0 1 0 213 15273-15289 -17 1 ? 0 0 0 214 15276-15278 3 ? ? 0 ? 1 215 15344-15352 -9 1 ? 1 0 0 216 15383-15396 -14 1 ? 0 0 0 217 15388-15396 9 ? ? 0 0 1 218 15426-15426 -1 0 ? 1 0 0 219 15434-15438 5 1 ? 1 1 0 220 15454-15455 2 ? ? ? 1 0 221 15454-15458 -5 1 ? 1 0 0 222 15462-15467 -6 1 ? 1 0 0 223 15538-15542 -5 1 0 0 0 0 224 15554-15570 17 1 1 0 1 0 225 15554-15554 -1 ? ? 0 ? 1 226 15650-15652 3 0 1 1 1 1 227 15656-15656 -1 0 1 0 0 0 228 15670-15673 4 1 1 0 1 1 229 15714-15715 2 ? 0 1 1 ? 230 15714-15720 7 1 0 0 0 1 231 15740-15741 2 1 0 ? 0 ? 232 15740-15742 1 0 0 1 0 1 233 15746-15748 3 0 1 1 1 1 234 15768-15772 -5 0 1 0 0 0 235 15775-15782 -8 1 0 0 0 0 236 15809-15815 7 0 1 0 1 1 237 15818-15822 -5 0 1 0 1 0 238 15833-15870 -38 1 0 0 0 0 239 15839-15865 -27 ? 1 0 0 0 240 15844-15865 -22 ? ? 0 1 0 241 15851-15865 15 ? ? 1 ? 0 242 15891-15908 18 1 1 0 1 1 243 15925-15934 10 0 0 0 0 1 244 15954-15967 -14 1 0 0 0 0 245 15961-15965 5 ? 0 1 1 1 246 15982-15985 4 0 1 1 1 1 247 16008-16017 -10 1 0 1 0 0 248 16023-16023 -1 1 0 1 0 0

148

249 16062-16062 -1 1 0 0 0 0 250 16069-16069 1 0 1 1 1 1 251 16077-16081 5 0 1 0 1 1 252 16157-16161 5 0 1 1 1 1 253 16220-16223 -4 1 0 0 0 0 254 16233-16238 6 1 0 1 0 1 255 16265-16479 -215 0 1 0 0 0 256 16267-16272 6 0 ? 1 ? ? 257 16267-16273 1 0 ? 0 1 1 258 16292-16301 10 0 ? 1 1 1 259 16330-16334 -5 1 ? 0 0 0 260 16349-16349 -1 1 ? 0 0 0 261 16388-16393 6 0 ? 0 0 1 262 16444-16444 1 1 ? 0 1 1 263 16494-16514 21 0 1 1 1 1 264 16535-16535 -1 0 1 0 0 0 265 16563-16563 1 0 1 1 1 1 266 16585-16588 -4 1 0 0 0 0 267 16608-16616 -9 ? 1 0 ? 0 268 16608-16617 -10 ? 0 0 1 0 269 16608-16712 -105 1 0 0 0 0 270 16617-16617 1 ? 0 0 ? 1 271 16723-16780 -58 1 0 0 0 0 272 16731-16732 -2 ? 1 0 0 0 273 16743-16751 -9 ? 1 0 0 0 274 16780-16780 -1 ? 1 0 0 0 275 16790-16795 -6 1 0 0 0 0 276 17282-17282 -1 1 0 0 0 0 277 18013-18017 5 1 0 1 1 1 278 18021-18022 2 0 1 1 1 1 279 18432-18494 -63 1 0 0 0 0 280 18462-18463 2 ? 0 1 1 1 281 18480-18485 -6 ? 0 0 1 0 282 18481-18485 -5 ? 1 0 ? 0 283 18592-18675 -84 0 1 0 0 0 284 18595-18601 7 0 ? 1 1 1 285 18617-18621 5 0 ? 1 1 1 286 18640-18641 -2 0 ? 1 ? 0 287 18640-18648 -9 0 ? 0 1 0 288 18674-18675 2 0 ? 1 1 1 289 18698-18703 6 0 0 1 1 1 290 18703-18703 -1 0 1 ? ? ? 291 18723-18723 -1 0 1 0 0 0

149

delete this 292 18726-18727 one 0 1 0 0 0 293 18727-18727 1 0 ? 1 1 1 294 18745-18746 2 0 1 1 1 1 295 18753-18768 16 0 1 1 1 1 296 18797-18815 -19 1 0 0 0 0 297 18897-18907 -11 1 0 0 0 0 298 19027-19035 -9 1 0 0 0 0 299 19035-19035 -1 ? 1 0 0 0 300 19087-19096 10 0 1 1 1 1 301 19133-19137 5 1 ? 0 0 ? 302 19133-19138 6 0 1 0 0 ? 303 19133-19141 9 0 0 0 0 1 304 19234-19238 5 1 0 1 1 1 305 19253-19255 -3 0 0 1 0 0 306 19255-19255 1 1 1 ? 0 1 307 19262-19270 -9 1 0 0 0 ? 308 19262-19511 250 0 0 0 0 1 309 19274-19414 -141 0 1 0 0 ? 310 19277-19277 1 1 ? 0 0 ? 311 19341-19342 2 1 ? 0 0 ? 312 19424-19428 5 1 0 0 0 ? 313 19434-19436 3 ? 0 1 1 ? 314 19434-19439 6 1 0 0 0 ? 315 19471-19485 15 0 1 1 1 ? 316 19505-19509 -5 1 0 0 0 ? 317 19516-19521 -6 1 0 0 0 0 318 19532-19542 -11 1 0 0 0 0 319 19593-19595 3 0 1 1 1 1 320 19608-19608 -1 1 0 0 0 0 321 19611-19611 1 1 0 1 1 1 322 19649-19651 3 0 1 1 1 1 323 19692-19692 -1 1 0 0 0 0 324 19715-19723 -9 1 0 0 0 0 325 19752-19758 7 1 0 1 1 1 326 19828-19828 -1 1 0 0 0 0 327 19842-19848 -7 0 1 0 0 0 328 19843-19843 -1 1 ? 0 0 0 329 19886-19895 -10 1 0 0 0 0 330 19917-19921 5 0 1 1 1 1 331 19985-19990 -6 1 0 0 0 0 332 20027-20114 -88 1 0 0 0 0 333 20080-20089 10 ? 0 1 1 1

150

334 20125-20164 -40 1 0 0 0 0 335 20169-20211 -43 1 0 0 0 0 336 20193-20193 -1 ? 0 1 0 0 337 20346-20351 6 0 1 1 1 1 338 20373-20381 -9 1 0 0 0 0 339 20428-20429 -2 1 0 0 0 0 340 20459-20459 1 0 1 0 1 1 341 20492-20498 7 0 1 1 1 ? 342 20492-20601 110 0 0 0 0 1 343 20508-20508 -1 0 1 0 0 ? 344 20538-20700 -163 1 0 0 0 0 345 20584-20584 -1 ? 1 0 0 ? 346 20700-20700 1 ? 0 1 1 1 347 20707-20713 -7 1 0 0 0 0 348 20773-20778 -6 1 0 0 0 0 349 20788-20793 -6 1 0 0 0 0 350 20804-20809 -6 1 0 0 0 0 351 20832-20837 6 0 1 1 1 1 352 20857-20857 1 0 1 1 1 1 353 20894-20898 -5 1 0 0 0 0 354 20928-20928 1 0 1 1 1 1 355 20954-20971 18 0 1 1 1 1 356 20983-20989 7 1 ? 0 ? ? 357 20983-20996 14 0 1 0 ? ? 358 20983-21001 19 0 0 0 1 1 359 21128-21149 22 1 0 1 1 1 360 21587-21587 1 1 0 1 0 1 361 21967-21967 1 0 1 1 1 1 362 22097-22097 1 0 1 1 1 1 363 22504-22505 2 0 1 1 1 1 364 22560-22561 2 0 1 1 1 1 365 22657-22657 1 0 1 0 1 1 366 22792-22794 -3 1 0 0 0 0 367 23194-23202 -9 1 0 0 0 0 368 23566-23566 -1 1 0 0 0 0 369 23872-23872 -1 1 1 0 0 0 370 23913-23913 -1 0 1 0 0 0 371 23933-23935 3 1 1 1 1 0 372 24135-24135 -1 ? 0 1 0 0 373 24135-24152 -18 1 0 0 0 0 374 24155-24164 10 ? 0 1 1 1 375 24155-24187 -33 1 0 0 0 0 376 24194-24196 3 1 0 0 0 0

151

377 24195-24196 2 ? 0 1 1 1 378 24201-24205 -5 1 0 0 0 0 379 24205-24205 -1 ? 1 0 0 0 380 24241-24241 -1 ? 1 0 0 0 381 24241-24250 -10 1 0 0 0 0 382 24277-24287 -11 1 0 0 0 0 383 24279-24288 -10 0 0 1 0 0 384 24303-24304 2 1 0 1 1 1 385 24464-24612 -149 0 1 0 0 0 386 24488-24501 -14 ? ? 1 0 0 387 24488-24509 -22 1 ? 0 0 0 388 24497-24501 -5 ? ? ? 1 0 389 24543-24545 3 1 ? 1 0 1 390 24604-24608 -5 1 ? 0 0 0 391 24678-24679 2 1 0 1 1 1 392 24684-24687 4 0 1 1 1 1 393 24700-24704 -5 0 1 0 0 0 394 24711-24711 -1 1 0 0 0 0 395 24731-24731 1 1 0 1 1 1 396 24778-24778 1 1 0 1 1 1 397 24805-24807 4 0 1 1 1 1 398 24829-24829 1 0 1 1 1 1 399 24875-24875 1 0 1 1 1 1 400 24889-24897 -9 1 0 1 0 0 401 24911-24917 -7 0 1 1 0 0 402 24923-24923 1 0 1 1 1 1 403 24964-24969 6 0 ? 0 0 1 404 24964-25164 -201 0 1 0 0 0 405 24969-24969 1 1 ? 1 0 ? 406 24988-24989 -2 1 ? 0 0 0 407 25045-25050 6 0 ? 0 0 1 408 25049-25050 -2 0 ? 0 1 ? 409 25066-25070 5 0 ? 1 0 1 410 25069-25070 3 1 ? ? 0 ? 411 25174-25207 -34 0 1 0 0 0 412 25188-25196 -9 1 ? 0 0 0 413 25201-25204 -4 1 ? 0 0 0 414 25220-25241 -22 0 1 0 0 0 415 25254-25255 2 ? 1 0 1 1 416 25254-25260 -7 1 0 0 0 0 417 25271-25271 -1 1 0 0 0 0 418 25285-25288 -4 1 0 0 0 0 419 25293-25302 -10 0 0 1 0 0

152

420 25358-25363 5 ? 1 0 ? ? 421 25358-25368 11 1 1 0 1 1 422 25408-25415 8 1 0 1 1 1 423 25425-25426 -2 0 1 0 0 0 424 25437-25437 -1 0 1 0 0 0 425 25465-25465 1 1 0 1 1 1 426 25476-25477 2 1 0 1 1 1 427 25486-25493 -8 0 1 0 0 0 428 25529-25533 -5 1 0 0 0 0 429 25554-25560 -7 1 0 0 0 0 430 25620-25627 -8 1 0 0 0 0 431 25639-25641 3 0 1 1 1 1 432 25645-25657 -13 0 1 0 0 0 433 25683-25687 -5 1 0 0 0 0 434 25732-25766 -35 0 1 0 0 0 435 25742-25748 7 0 ? 1 1 1 436 25797-25816 -10 0 1 0 0 0 437 25807-25816 10 0 ? 1 1 1 438 25866-26074 -209 0 1 0 0 0 439 25890-25895 6 0 ? 1 1 1 440 26171-26179 -9 0 ? 0 1 0 441 26442-26459 -18 1 ? 0 0 0 442 26587-26665 -79 0 ? 1 0 0 443 26605-26616 -12 1 ? ? 0 0 444 26698-26703 -6 ? ? ? 1 0

Appendix E. Multiple sequence alignments of pseudognes in Pholisma, Epifagus, Ehretia and Mimulus

(common pseudogenes of Pholisma and Epifagus are included: atpA, atpB, ndhB, psbB, rpoA) atpA: Epifagus ...... Pholisma ...... Mimulus ATGGTAACCA TTCGAGCCGA TGAAATTAGT AATATTATTC GTGAACGTAT Ehretia ATGGTAACCA TTCGAGCCGA CGAAATTAGT AATATTATCC GTGAACGTAT

Epifagus ...... Pholisma ...... Mimulus TGAACAATAT AATAGAGAAG TCAAGATTGT AAATACTGGT ACCGTACTTC Ehretia TGAACAATAT AATAGAGAAG TAAAGATTGT AAATACCGGT ACCGTACTTC

Epifagus ...... Pholisma ...... Mimulus AAGTAGGCGA TGGTATTGCT CGTATTCATG GTCTTGATGA AGTAATGGCG Ehretia AAGTAGGCGA TGGCATTGCT CGTATTCATG GTCTTGATGA AGTAATGGCG

Epifagus ...... Pholisma ...... Mimulus GGTGAATTAG TCGAATTTGA CGAAGGTACA ATAGGTATTG CTCTAAATTT Ehretia GGTGAATTAG TAGAATTTGA AGAGGGTACA ATAGGCATTG CTCTGAATTT

Epifagus ...... Pholisma ...... Mimulus GGAATCAAAT AATGTTGGTG TTGTATTAAT GGGTGATGGT TTGCTGATAC Ehretia GGAATCAAAT AATGTTGGTG TTGTATTAAT GGGCGATGGT TTGATGATAC

154

Epifagus ...... Pholisma ...... Mimulus AAGAAGGAAG TTCTGTAAAA GCAACAGGAA GAATTGCTCA GATACCAGTG Ehretia AGGAGGGAAG TTCTGTAAAA GCAACAGGAA GAATTGCTCA GATACCAGTG

Epifagus ...... Pholisma ...... Mimulus AGTGGGGCTT ATTTGGGTCG TGTTATAAAC GCCCTAGCTA AACCTATTGA Ehretia AGTGAGGCCT ATTTGGGTCG TGTTATAAAT GCTCTGGCTA AACCTATTGA

Epifagus ...... Pholisma ...... Mimulus TGGTAGAGGT GAAATTCCAG CTTCTGAATC TCGATTAATT GAATCTCCCG Ehretia TGGTAGAGGT GAAATTTCAG CTTCTGAATC TCGATTAATT GAATCTCCTG

Epifagus ...... Pholisma ...... Mimulus CGCCAGGTAT TATTTCCCGG CGTTCCGTAT ACGAGCCCCT TCAAACCGGG Ehretia CTCCAGGTAT TATTTCGCGG CGTTCCGTAT ATGAACCTCT TCAAACCGGG

Epifagus ...... Pholisma ...... Mimulus CTTATTGCTA TTGATTCCAT GATCCCTATA GGACGTGGTC AGCGAGAATT Ehretia CTTATTGCTA TTGATTCGAT GATCCCTATA GGACGTGGTC AGCGAGAATT

Epifagus ...... Pholisma ...... ACATATATGA Mimulus AATTATTGGA GACAGGCAGA CCGGTAAAAC AGCAGTAGCC ACAGATACGA Ehretia AATTATTGGG GACAGGCAGA CCGGTAAAAC AGCAGTAGCC ACAGATACGA

Epifagus TTTTTAA...... GTAAG TATGTAAT...... T Pholisma TTCTCAATCA ACAAGTTAAA AATGTAATAT GTGTTTATGT AGCTATTGGT Mimulus TTCTCAATCA ACAAGGTCAA AATGTAATAT GTGTTTATGT AGCTATTGGG Ehretia TTCTTAATCA ACAAGGTCAA AATGTAATAT GCGTTTATGT AGCTATTGGT

155

Epifagus CAACAACAAT TTTACTTACT ....ATTTTA AAAATTTTAT ATT...... Pholisma CAAAAAGCAT CTTAG...... CAGGTAGTA AATACTTTAC ATAAAAGGGT Mimulus CAAAAAGCAT CTTCTGTGGC TCAGGTAGTA ACTACTTTAC AGGAAAGGGG Ehretia CAAAAAGCAT CTTCTGTGGC CCAGGTAGTA AATACTTTAC AGGAAAGGGG

Epifagus ...... A TATATTAGTA GTAGTATAAG CTGAAACATC ...... TC Pholisma CG...... A TACACTATTT GTAGT...AT CCG.AACGGC GGATTCCACC Mimulus CGCGATGGAA TACACTA.TT GTGGT...AG CCGAAACGGC GGATT...CC Ehretia GGCGATGGAA TACACTA.TT GTGGT...AG CCGAAACGGC GGATT...CC

Epifagus CCGATTAAGT T...... Pholisma CTTGCTACAT TACAA...... Mimulus CCTGCTACAT TACAATACCT CGCTCCTTAT ACAGGAGCTG CCTTGGCTGA Ehretia CCTGCTACAT TACAATATCT CGCTCCTTAT ACAGGAGCAG CTCTGGCTGA

Epifagus ...... TAAAT TTGGTAAGTT AAGTTAGTAA GGCGGTCCCT Pholisma ...... TGAAC GACACACTTT AATCATTTAT GATGATCCTT Mimulus ATATTTTATG TACCGTAAAC AACACACTTT AATCATTTAT GATGATCCCT Ehretia ATATTTTATG TACCGTGAAC GACACACTTC AATCATTTAT GATGATCCCT

Epifagus TCAACTAAAC TAGAACTTAA T...... Pholisma CCAAACAAGC ACAAGCTTAT CGCCAAATTT ATTTTATATT ATGAAGACTC Mimulus CCAAACAAGC CCAAGCTTAT CGGCAAATGT CTCTTCTATT ACGAAGACCA Ehretia CCAAACAAGC ACAAGCTTAT CGCCAAATGT CTCTTCTATT ACGAAGACCC

Epifagus ...... TT ATTTATT...... Pholisma ACAGGGCATG AA..TTATC. ..AGGGTATT TTTTTTT... .TTAACGCC. Mimulus CCCGGCCGCG AAGCTTATCC AGGGGATGTT TTTTATTTGC ATTCACGCCT Ehretia CCTGGCCGCG AAGCTTATCC AGGGGATGTT TTTTATTTGC ATTCACGCCT

156

Epifagus ...... AA TTTAGTGG...... Pholisma ....GAAAGA TACACAAAAT TGAGTTATAG TTTAGGTGAA TAAAGTATGA Mimulus TTTGGAAAGA GCCGCTAAAT CAAGTTCTAG TTTAGGTGAA GGAAGTATGA Ehretia TTTGGAAAGA GCCGCTAAAT CAAGTTCTAG TTTAGGTGAA GGAAGTATGA

Epifagus .CTCTTTACT AAA...... TACG Pholisma CCGCCTTACC AACAGTTGAA ACTCTCAATT GGTAGATGTT TCGTCTTATA Mimulus CCGCCTTACC TATAGTGGAA A..CTCAATC GGGAGATGTT TCGGCTTATA Ehretia CCGCCTTACC AATAGTTGAA A..CTCAATC GGGAGATGTT TCGGCTTATA

Epifagus TTAATACAAA TA...... AAAAC AT...... Pholisma TTTATACTAA TGGAAGAATT TCCATTACTG ATGGACAAAT ATTATTACCT Mimulus TTCCTACAAA TG...TAATT TCCATTACCG ATGGACAAAT ATTTTTATCT Ehretia TTCCTACAAA TG...TAATT TCCATTACTG ATGGACAAAT ATTCTTATCT

Epifagus ...... CC CCTGAATAAT ATAAGCTTCA Pholisma GC...... CCATGCTGT AATCAGACCT ACTATAAAAT GTGGGCATCT Mimulus GCCGATTTAT TCAATTCTGG AATCAGACCT GCTAT.TAAT GTGGGGATCT Ehretia GCCGATCTAT TCAATGCTGG AATCAGACCC GCTAT.TAAT GTGGGTATCT

Epifagus C...... Pholisma TC...... CAGCTCAAA TTAAAGTCAT GAAACAAGTA Mimulus CCGTTTCCAG AGTGGGGTCT GCAGCTCAAA TTAAAGCTAT GAAACAAGTA Ehretia CCGTTTCCAG AGTGGGGTCT GCAGCTCAAA TTAAAGCCAT GAAACAAGTA

Epifagus ...... GGCCGGG T...... Pholisma GTTGTTAAAT TAAAATAGGA ACTGAACTGG TGAAATTTGC AGAATTTGAA Mimulus GCTGGTAAAT TAAAATTG.. ...GAACTGG CACAATTTGC AGAATTAGAA Ehretia GCTGGTAAAT TAAAATTG.. ...GAACTGG CGCAATTTGC AGAATTAGAA

Epifagus ...... GG TCTTCATAAT Pholisma ACCTTTGCAC AGCACAATTT TCTTCTTATT TAGATAAAGA TACTCATAAT Mimulus GCTTTT.... .GCACAATTT GCTTCAGATC TTGATAAAGC TACTCAGAAT Ehretia GCCTTT.... .GCACAATTT GCTTCTGATC TCGATAAAGC TACTCAGAAT

157

Epifagus ...... ATAA GA...... Pholisma CAATTGACAA GAGGTCAACA ATTACGTGAA ...... Mimulus CAATTGGCAA GAGGTCAACG ATTACGTGAA TTGCTTAAAC AATCCCAAGC Ehretia CAATTGGCAA GAGGTCAACG ATTACGTGAA TTGCTTAAAC AATCCCAAGC

Epifagus ...... Pholisma ...... Mimulus AGCTCCGCTT ACAGTAGAAG AACAGATAAT GACTATTTAT ACCGGAACAA Ehretia AGCTCCTCTC GCGGTGGAAG AACAGATAAT GACTATTTAT ACCGGAACAA

Epifagus ...... Pholisma ...... Mimulus ACGGTTATCT TGATTCATTA GAAATTGGAC AGGTAAGGAA ATTTCTTGTT Ehretia ACGGTTATCT TGATTCATTA GAAATTGGAC AGGTAAGGAA ATTTCTTGTT

Epifagus ...... Pholisma ...... Mimulus GAATTACGTA CTTACTTAAA AACTAATAAA CCTCAATTCC AAGAAATCAT Ehretia GAGTTACGTA CTTACGTAAA AACGAATAAG CCTCAGTTCC AAGAAATCAT

Epifagus ...... Pholisma ...... Mimulus ATCTTCTACT AAGACATTTA CTGAGGAAGC AGAATTCCTT TTGAAAGAAG Ehretia ATCTTCTACC AAGACATTTA CTGAGGAAGC AGAAGCCCTT TTGAAAGAAG

Epifagus ...... Pholisma ...... Mimulus CTATTCAGGA ACAAATGGAC CGGTTTCTAC TTCAAGAACA AGCA... Ehretia CTATTCAAGA ACAAATGGAC CGTTTTCTAC TTCAAGAACA AGCATAA

158 atpB: Epifagus ...... Pholisma ...... GTGTTTCCA CGCT.....A Mimulus ATGAGAATTA ATCCTACTAC TTCTGGTTCT GGGGTTTCCA CGCTTGAAAA Ehretia ATGAGAATCA ATCCTACAGC TTCTGGTTCT GGGGTTTCCA CGCTTGAAAA

Epifagus ...... Pholisma AAAAAACCTG GGGCGTGTCA TCCAAATCAT TGGTCCGGTA CTCGATGTTG Mimulus AAAAAACCTG GGGCGTATCG TCCAAATCAT AGGTCCGGTA CTAGATGTAG Ehretia AAAAAACCTG GGGCGTATCG TCCAAATCAT CGGTCCGGTA CTAGATGTCG

Epifagus ...... Pholisma CCTTTCCGCC AGGAAACATA CCTAATATTT ATAGCG.TCT GGTACTTTAA Mimulus CCTTTTCGCC GGGCAAGATG CCTAATATTT ATAACGCTCT AGTAGTTAAA Ehretia CCTTTCCGCC GGGCAAGATG CCTAATATTT ATAACGCTCT GGTAGTTAAA

Epifagus ..TCGACATA AT...... Pholisma GGTCGAGATA CTGTAGGTCA ACCAATTAAT CTGACTTGTG AGGTACAGCT Mimulus GGCCGAGATA CTGCTGGTCA ACCAATTAAT GTGACTTGTG AGGTACAGCA Ehretia GGTCGAGATA CTGTTGGTCA ACCAATTAAT GTGACTTGTG AGGTACAGCA

Epifagus .CTACTAAAA AA...... Pholisma ATTATTAGGA AATAATCGAG TTAGAGATGT GGCGATGAGT GCTACAGATG Mimulus ATTATTAGGA AATAATCGAG TTAGAGCTGT AGCTATGAGT GCTACAGATG Ehretia ATTATTAGGA AATAATCGAG TTAGAGCTGT AGCTATGAGT GCTACAGATG

Epifagus ...... Pholisma GTATAAGTAT AAGGAGGGGA ATGTAAGTGA TTGAC...... Mimulus G...... TCT GACGAGAGGA ATGGAAGTGA TTGATACGGG AGCTCCTCTA Ehretia G...... TCT AACGAGGGGA ATGGAAGTGG TTGACACAGG AGCTCCTCTA

159

Epifagus ...... Pholisma ...... Mimulus AGTGTTCCAG TCGGTGGAGC GACTCTGGGG CGAATTTTCA ACGTACTTGG Ehretia AGTGTTCCGG TTGGTGGAGC GACTCTGGGA CGAATTTTCA ACGTGCTTGG

Epifagus ...... Pholisma ...... Mimulus AGAGCCTGTT GATAATTTAG GTCCTGTAGA TACTAGTACA ACATTTCCTA Ehretia AGAGCCTGTT GATAATTTAG GTCCTGTAGA TACTCGTACA ACATCTCCTA

Epifagus ...... Pholisma ...... Mimulus TTCATCGATC TGCACCTGCC TTTATACAGT TAGATACAAA ATTATCTATT Ehretia TTCATAGACC TGCACCCGCT TTTATACAGT TAGATACAAA ATTATCTATT

Epifagus ...... Pholisma ...... Mimulus TTTGAAACAG GAATAAAAGT AGTAGATCTT TTAGCACCTT ATCGACGTGG Ehretia TTTGAAACAG GAATTAAAGT AGTAGATCTT TTAGCCCCTT ATCGCCGTGG

Epifagus ...... Pholisma ...... TGGGTAAA ACAGTACTCA Mimulus GGGAAAAATC GGACTATTTG GGGGAGCTGG GGTTGGCAAA ACGGTACTCA Ehretia AGGAAAAATC GGACTATTCG GGGGAGCTGG AGTGGGTAAA ACAGTACTCA

Epifagus ...... AAAG CCTA...... Pholisma TTATGGAATT GATTAACAAT ATTGACAAAG CCCATGGGGG TGTATCCATA Mimulus TTATGGAATT GATTAACAAT ATTGCCAAAG CCCATGGTGG CGTCTCCGTA Ehretia TTATGGAATT GATTAACAAT ATTGCCAAAG CTCATGGGGG TGTATCCGTA

Epifagus ...... Pholisma C...... Mimulus TTTGGCGGAG TGGGTGAACG TACTCGTGAA GGAAATGATC TTTACATGGA Ehretia TTTGGCGGAG TGGGTGAACG TACTCGCGAA GGAAATGATC TTTACATGGA

160

Epifagus ...... Pholisma ...... AA AATTTTATCA GAATCCATAG Mimulus AATGAAAGAA TCTGGAGTGA TTAATGAAGA AAATATTGCA GAATCAAAAG Ehretia AATGAAAGAA TCTGGAGTTA TTAATGAAGA AAATATTGCA GAATCCAAAG

Epifagus ....TTTAGT TTA..GTAAG ACCACCTAAT TATCGGAA...... Pholisma TGTCTCTAGT TTA..CCTAG ATGAATGTAC CACCAGGAGC TCATATGTAA Mimulus TGGCTCTAGT TTACGGCCAG ATGAATGAAC CGCCGGGAGC TCGTATG... Ehretia TAGCTCTAGT TTACGGCCAG ATGAATGAAC CGCCAGGAGC TCGTATG...

Epifagus ...... AG GATTAATTGA AACCCTATAA C...... Pholisma GTATTAGTAG AGTAGGTTTT ACTACTCTAA CTATGGC.GA ATATTTCCTA Mimulus ...... AG AGTTGGTTTG ACTGCCTTAA CGATGGCGGA ATATTTCCGA Ehretia ...... AG AGTTGGTTTG ACTGCCCTAA CTATGGCGGA ATATTTCCGA

Epifagus ...... Pholisma GATGTTAATG ATAAAGAAG...... Mimulus GATGTTAATG AACAAGACGT ACTTCTATTT ATCGACAATA TTTTCCGTTT Ehretia GATGTTAATG AACAAGACGT ACTTCTATTT ATCGACAATA TCTTCCGTTT

Epifagus ...... Pholisma ...... Mimulus CGTCCAAGCC GGATCCGAAG TATCCGCCTT ATTGGGTAGA ATGCCTTCCG Ehretia CGTTCAAGCA GGATCCGAAG TATCGGCCTT ATTGGGTAGA ATGCCTTCCG

Epifagus ...... Pholisma ...... Mimulus CTGTCGGTTA TCAACCCACC CTGAGTACCG AAATGGGCTC TTTACAAGAA Ehretia CTGTGGGTTA TCAACCCACC CTGAGTACCG AAATGGGTTC TTTACAAGAG

161

Epifagus ...... Pholisma ...... Mimulus AGAATTACCT CTACCAAAGA GGGGTCCATA ACGTCTATTC AAGCTGTTTA Ehretia AGAATTACTT CTACCAAAGC GGGGTCCATA ACTTCTATTC AAGCAGTTTA

Epifagus ...... GATTTACAC Pholisma ...... ACGTCC TCCTGCTACT AGATTTGCAC Mimulus TGTACCTGCA GACGATTTGA CCGATCCTGC CCCTGCTACG ACATTTGCAC Ehretia TGTACCCGCA GATGATTTGA CCGACCCTGC TCCTGCTACG ACATTTGCAC

Epifagus ...... TGAAAGGATT AATT....G. Pholisma ATTTAGATGC TACTACTGTA CCTGTACTAT CAAGAGTATT CGCTGACAGC Mimulus ATTTAGATGC TACTA...... CCGTACTAT CAAGAGGATT AGCT....GC Ehretia ATTTAGATGC TACTA...... CCGTACTAT CAAGAGGATT AGCT....GC

Epifagus ...... AAACCC...... Pholisma CAAAGGAATC GATCCAGCAG TAGATGC...... Mimulus CAAAGGTATT TATCCAGCAG TAGATCCTTT AGATTCAACT TCAACCATGC Ehretia CAAAGGGATC TATCCAGCAG TAGATCCTTT AGATTCAACG TCAACCATGC

Epifagus ...... Pholisma ...... Mimulus TTCAACCTCG GATCGTTGGT GAGGAACATT ATGAAACTGC GCAAAGAGTT Ehretia TTCAACCTCG GATCGTTGGT GAGGAACATT ATGAAACTGC GCAAAGAGTG

Epifagus ...... TA Pholisma ...... TTA TAAAGAACTT CAAGACATTC ATTATAGCTA Mimulus AAACAAACTT TACAACGTTA TAAAGAGCTT CAGGACAT...... TA Ehretia AAGCAAACTT TACAACGTTA TAAAGAACTT CAGGACAT...... TA

Epifagus TAACGAT..T TACACTAGAC CAATT...... Pholisma TAGCTATCCT TGGATTAGAC GAATTA...... Mimulus TAGCTATCCT TGGGTTGGAC GAATTATCTG AAGAGGATCG TTTAACCGTA Ehretia TAGCTATCCT TGGGTTGGAC GAATTATCTG AAGAGGATCG TTTAACCGTA

162

Epifagus ...... Pholisma ...... Mimulus GCAAGAGCGC GAAAAATTGA GCGTTTCTTA TCACAACCCT TTTTTGTAGC Ehretia GCAAGAGCGC GAAAAATTGA GCGTTTCTTA TCACAACCCT TTTTCGTAGC

Epifagus ...... Pholisma ...... Mimulus TGAAGTATTT ACTGGTTCTC CAGGGAAATA TGTTGGTCTA GCAGAAACCA Ehretia AGAAGTATTT ACCGGTTCTC CAGGGAAATA TGTTGGTCTA GCAGAAACAA

Epifagus ...... Pholisma ...... Mimulus TTAGAGGGTT TCAATTGATC CTTTCCGGAG AATTAGATGG TCTTCCTGAA Ehretia TTAGAGGGTT TCAATTGATC CTTTCCGGAG AATTAGATGG TCTTCCTGAA

Epifagus ...... TACC...... Pholisma ...... AAG GAAGCTACCG CGAAGGCTAT Mimulus CAGGCCTTTT ATTTGGTAGG TAATATCGAT GAAGCTACCG CGAAAGCTAT Ehretia CAGGCCTTTT ATTTGGTAGG TAATATCGAT GAAGCTACCG CGAAGGCTAT

Epifagus ...... TGGAGAA...... Pholisma TAACTTCGAA ATGAAGAGCA ATTTGAATAA ATG. Mimulus CAACTTAGAA ATGGAGAGCA ATTTGAAGAA A... Ehretia GAACTTAGAA ATGGAGAGCA ATTTGAAGAA ATGA

163 ndhB: Epifagus ...... Pholisma ...... Mimulus ATGATCTGGC ATGTACAGAA TGAAAACTTC ATTCTCGATT CTACGAGAAT Ehretia ATGATCTGGC ATGTACAGAA TGAAAACTTC ATTCTCGATT CTACGAGAAT

Epifagus ...... Pholisma ...... Mimulus TTTTATGAAA GCCTTTCATT TGCTTCTCTT CGATGGAAGT TTGATTTTCC Ehretia TTTTATGAAA GCCTTTCATT TGCTTCTCTT CGATGGAAGT TTGATTTTCC

Epifagus ...... Pholisma ...... Mimulus CAGAATGTAT CCTAATTTTT GGTCTAATTC TTCTTCTGAT GATTGATTCA Ehretia CAGAATGTAT CCTAATTTTT GGCCTAATTC TTCTTCTGAT GATCGATTCA

Epifagus ...... Pholisma ...... Mimulus ACCTCTGATC AAAAAGATAT ACCTTGGTTA TATTTCATCT CTTCAACAAG Ehretia ACCTCTGATC AAAAAGATAT ACCTTGGTTA TATTTCATCT CTTCAACAAG

Epifagus ...... Pholisma ...... Mimulus TTTAGTAATG AGCATAACGG CCCTATTGTT CCGATGGAGA GAAGAACCTA Ehretia TTTAGTAATG AGCATAACGG CCCTATTGTT CCGATGGAGA GAAGAACCTA

Epifagus ...... Pholisma ...... Mimulus TGATTAGCTT TTCGGGAAAT TTCCAAACGA ACAATTTCAA CGAAATCTTT Ehretia TGATTAGCTT TTCGGGAAAT TTCCAAACGA ACAATTTCAA CGAAATCTTT

164

Epifagus ...... Pholisma ...... Mimulus CAATTTCTTA TTTTACTATG TTCAACTCTA TGTATTCCTC TATCCGTAGA Ehretia CAATTTCTTA TTTTACTATG TTCAACTCTA TGTATTCCTC TATCCGTAGA

Epifagus ...... Pholisma ...... Mimulus GTACATTGAA TGTACAGAAA TGGCTATAAC AGAGTTTCTC TTATTCGTAT Ehretia GTACATTGAA TGTACAGAAA TGGCTATAAC AGAGTTTCTC TTATTCGTAT

Epifagus ...... Pholisma ...... Mimulus TAACAGCTAC TCTAGGAGGA ATGTTTTTAT GCGGTGCTAA CGATTTAATA Ehretia TAACAGCTAC TCTAGGAGGA ATGTTTTTAT GCGGTGCTAA CGATTTAATA

Epifagus ...... Pholisma ...... Mimulus ACTATCTTTG TAGCTCCAGA ATGTTTCAGT TTATGCTCTT ACCTATTATC Ehretia ACTATCTTTG TAGCTCCAGA ATGTTTCAGT TTATGCTCCT ACCTATTATC

Epifagus ...... Pholisma ...... Mimulus TGGATATACC AAGAAAGATG TACGGTCTAA TGAGGCTACT ATGAAATATT Ehretia TGGATATACC AAGAAAGATG TACGGTCTAA TGAGGCTACT ATGAAATATT

Epifagus ...... Pholisma ...... Mimulus TACTCATGGG TGGGGCAAGC TCTTCTATTC TGGTTCATGG TTTCTCTTGG Ehretia TACTCATGGG TGGGGCAAGC TCTTCTATTC TGGTTCATGG TTTCTCTTGG

Epifagus ...... Pholisma ...... Mimulus CTATATGGTT TATCCGGGGG AGAGATCGAG CTTCAAGAAA TAGTGAATGG Ehretia CTATATGGTT CATCCGGGGG AGAGATCGAG CTTCAAGAAA TAGTGAATGG

165

Epifagus ...... Pholisma ...... Mimulus TCTTATCAAT ACACAAATGT ATAACTCCCC AGGAATTTCA ATTGCGCTCA Ehretia TCTTATCAAT ACACAAATGT ATAACTCCCC AGGAATTTCA ATTGCGCTCA

Epifagus ...... Pholisma ...... Mimulus TATTCATCAC TGTAGGAATT GGGTTCAAGC TTTCCCCAGC CCCTTCTCAT Ehretia TATTCATCAC TGTAGGAATT GGGTTCAAGC TTTCCCCAGC CCCTTCTCAT

Epifagus ...... AGATTC GTCGTTCCTG ACCTTGCTTC Pholisma ...... A CTCTGACTTT CCCACTCCAG TCGTTTCTTT Mimulus CAATGGACTC CTGACGTATA CGAAGGATCT CCCACTCCAG TCGTTGCTTT Ehretia CAATGGACTC CTGACGTATA CGAAGGATCT CCCACTCCAG TCGTTGCTTT

Epifagus ACCTTAATTG TTAATT...... GTT...... ATTTTAA CAAGTAAAA. Pholisma TCTTT..CTG TTACTTCGAA AGTAGCT...... GCTTCAG CCACTCGAA. Mimulus TCTTT..CTG TTACTTCGAA AGTAGCTGCT TCAGCTTCAG CCACTCGAAT Ehretia TCTTT..CTG TTACTTCGAA AGTAGCTGCT TCAGCTTCAG CCACTCGAAT

Epifagus ...... ATT CTGTCTTGGT CCGAG...... TGG Pholisma TTTTTATATT CCTTTTC...... TAG Mimulus TTTCGATATT CCTTTTTATT TCTCATCAAA CGAATGGCAT CTTCTTCTGG Ehretia TTTCGATATT CCTTTTTATT TCTCATCAAA CGAATGGCAT CTTCTTCTGG

Epifagus GGATAGCATT TATCTTCTGC ATG...... TCCAT AGAGTTTTTT Pholisma AAATACTAGC TATTCTTAGC ATGATATTGG GAAATCTCAT TGCTATTACT Mimulus AAATCCTAGC TATTCTTAGC ATGATATTGG GAAATATCAT TGCTATTACT Ehretia AAATCCTAGC TATTCTTAGC ATGATATTGG GAAATCTCAT TGCTATTACT

166

Epifagus TAAA.AATCC AAAAATATAA GA...... TAT ATAGGTAAGA Pholisma CAAACAAGCA TGAAACGTAT GCTTGCATAT TCATTCGTAC ATAGGTAAAA Mimulus CAAACAAGCA TGAAACGTAT GCTTGCAT.. ..ATTCGTCC ATAGGTCAAA Ehretia CAAACAAGCA TGAAACGTAT GCTTGCAT.. ..ATTCGTCC ATAGGTCAAA

Epifagus ..ATTTATAT AATGAACCCC ACTCCTTCGT ATA..CGAAG GAGTCCAAGG Pholisma TCGGGTATGT AATTA...... TTGGA ATAATTGTTG GAGACTCAAA Mimulus TCGGATATGT AATTA...... TTGGA ATAATTGTTG GAGACTCAAA Ehretia TCGGATATGT AATTA...... TTGGA ATAATTGTTG GAGACTCAAA

Epifagus GGCTGGG... GAAAGCTT.. TAACCCAATT CCTACTCTAC TGTGATTAAT Pholisma TGATGGATAT GCAAGCATGA TAACTTATAT GCTGTTCTAT ....ATCTCC Mimulus TGATGGATAT GCAAGCATGA TAACTTATAT GCTGTTCTAT ....ATCTCC Ehretia TGATGGATAT GCAAGCATGA TAACTTATAT GCTGTTCTAT ....ATCTCC

Epifagus ATGAGC...G CAAATTTAAT TCCTGTGGAG TTAT...... Pholisma ATGAATCTAG GAACTTTCG...... Mimulus ATGAATCTAG GAACTTTTGC TTGCATTGTA TTATTTGGTC TACGTACCGG Ehretia ATGAATCTAG GAACTTTTGC TTGCATTGTA TTATTTGGTC TACGTACCGG

Epifagus ...... AC ATTTGCC...... TATT GTATTGAAAA GACCATTCAC Pholisma ...... AG ATTCTGCAGG GATTAGGATT ATACACGAAA GATCCTTTAT Mimulus AACTGATAAC ATTCGAGATT ATGCAGGATT ATACACGAAA GATCCTTT.T Ehretia AACTGATAAC ATTCGAGATT ATGCAGGATT ATACACGAAA GATCCTTT.T

Epifagus TATATTCTTG TTCTTGAAGT TCGATCTCTC CCCCGGATAA ACAATAGAAA Pholisma TTGGCTCTCT CTTT...AGC CCTATGTCTC ...... TTA TCCCTAGGGG Mimulus TTGGCTCTCT CTTT...AGC CCTATGTCTC ...... TTA TCCCTAGGAG Ehretia TTGGCTCTCT CTTT...AGC CCTATGTCTC ...... TTA TCCCTAGGAG

Epifagus ....CCATGA ACCAGAA...... TAGAAGAGCT TGCCCCACCC Pholisma GTCTTCTT.. ...AGCAGGT TTTTGTTTTT TCGGAAAACT CTATTTATTC Mimulus GTCTTCCTCC ACTAGCAG...... GTTTTT TCGGAAAACT CTATTTATTC Ehretia GTCTTCCTCC ACTAGCAG...... GTTTTT TCGGAAAACT CTATTTATTC

167

Epifagus ATGAGTCAAT ATTAAATA...... TTTAATAGT Pholisma TGGTGTGGAT GGCAGGCAGG CCTATATTTA TATTTATTGG TTTTAATAGG Mimulus TGGTGTGGAT GGCAGGCAGG CC...... TA TATTCCTTGG TTTTAATAGG Ehretia TGGTGTGGAT GGCAGGCAGG CC...... TA TATTTCTTGG TTTTAATAGG

Epifagus AGCCTCATTA GACTGTACAT CTTTCT.TGG CTATCCAGA...... Pholisma ACTCCTTACA AGCCTTGTTT CTATCTACTA TTATCTAAA...... Mimulus ACTCCTTACA AGCGTTCTTT CTATCTACTA TTATCTAAAA ATAATCAAGT Ehretia ACTCCTTACA AGCGTTGTTT CTATCTACTA TTATCTAAAA ATAATCAAGT

Epifagus ...... TAAT AGGTAGGAGC TAAAAAG...... AT Pholisma ...... AGGACGAAAC CAAGAAATAA CCCTTTACGT GTGAAATTAT Mimulus TATTAATGAC TGGACGAAAC CAAGAAATCA CCCCTCACGT GCGA...... Ehretia TATTAATGAC TGGGCGAAAC CAAGAAATAA CCCCTCACGT GCGA......

Epifagus AGTTATTAAA .....TCGTT AGCACCACAT AAAAACATT...... Pholisma AATTATAGAG GATCTCCTTT TAGATCAAAC AATTCCATCG AATTGAGTAT Mimulus AATTATCGAA GATCTCCTTT AAGATCAAAC AATTCCATCG AATTGAGTAT Ehretia AATTATAGAA GATCCCCTTT AAGATCAAAC AATTCCATCG AATTGAGTAT

Epifagus ...... C CTCCTAGAGT AGTAAAATAA A...... Pholisma GATTGTATGT GTGATAGCAT CTACTATACC AGGAATATCA ATGAACCCGA Mimulus GATTGTATGT GTGATAGCAT CTACTATACC AGGAATATCA ATGAACCCAA Ehretia GATTGTATGT GTGATAGCAT CTACTATACC AGGAATATCA ATGAACCCGA

Epifagus ..ACTTCAAT CG...AAGAT AAGCAAATGA AGGGCTTTCA A Pholisma TTATTGCAAT TGCTCATGAT ACCC...... TTTTTTA G Mimulus TTATTGCAAT TGCTCAGGAT ACCC...... TTTTTTA A Ehretia TTATTGCAAT TGCTCAGGAT ACCC...... TTTTTTA G

168

Epifagus ...... Pholisma .....TTTGC C...... G TGTTCATACC GTTGTATTGA ATGCT.TCGG Mimulus ATGGGTTTGC CTTGGTATCG TGTTCATACC GTTGTATTGA ATGATCCCGG Ehretia ATGGGTTTGC CTTGGTATCG TGTTCATACC GTTGTATTGA ATGATCCCGG

Epifagus ...GTTGCTT TTAACTCATA TAATGCATAC AGCTCGAGTT CCT...... Pholisma TTGTTTGCTT TCTGTTCATA TAATGCATAC AGCTCTGGTT ...... GGG Mimulus CCGGTTGCTT TCCGTTCATA TAATGCATAC AGCTTTGGTT GCTGGTTGGG Ehretia TCGGTTGCTT TCTGTTCATA TAATGCATAC AGCTCTGGTT GCTGGTTGGG

Epifagus ...... Pholisma ATGGTTTTAT CAATCTGTAT GAATTTAAAG TTTTTGGTAC ...... Mimulus CGGGCTCGAT GGCTCTGTAT GAATTAGCAG TTTTTGATCC TTCTGACCCT Ehretia CCGGTTCTAT GGCTCTGTAT GAATTAGCAG TTTTTGATCC TTCTGACCCT

Epifagus ...... Pholisma ...... Mimulus GTTCTTGATC CAATGTGGAG ACAGGGTATG TTCGTTATAC CCTTCATGAC Ehretia GTTCTTGATC CAATGTGGAG ACAGGGCATG TTCGTTATAC CCTTCATGAC

Epifagus ...... Pholisma ...... Mimulus TCGTTTAGGA ATAACCAATT CATGGGGAGG TTGGAGTATC ACAGGAGGGA Ehretia TCGTTTAGGA ATAACAAATT CATGGGGCGG TTGGAGTATC ACAGGAGGAA

Epifagus ...... Pholisma ...... Mimulus CTGTAACAAA TCCAGGGATT TGGAGTTACG AAGGTGTAGC TGGGGCACAT Ehretia CTGTAACGAA TCCGGGTATT TGGAGTTACG AAGGTGTAGC TGGGGCACAT

Epifagus ...... Pholisma ...... Mimulus ATTGTGTTTT CTGGCTTATG CTTTTTGGCA GCTATCTGGC ATTGGGTCTA Ehretia ATTGTGTTTT CCGGCTTATG CTTTTTGGCA GCTATCTGGC ATTGGGTGTA

169

Epifagus ...... Pholisma ...... Mimulus TTGGGATCTA GAAATATTTT CTGATGAACG TACAGGAAAA CCTTCTTTGG Ehretia TTGGGATCTA GAAATATTTT GCGATGAACG TACAGGAAAA CCTTCTTTGG

Epifagus ...... Pholisma ...... TTTAT TTCTTTCAGA AGTGACTTGC Mimulus ATTTGCCCAA GATCTTTGGA ATTCATTTAT TTCTCTCCGG GGTGGCTTGC Ehretia ATTTGCCCAA GATCTTTGGA ATTCATTTAT TTCTTTCAGG GGTGGCTTGC

Epifagus ...... Pholisma TTTGTTTTCG GTGACATTTT ATGTAATAGG TTTGTACGGT CCCGGAATAT Mimulus TTTGGTTTTG GTG.CATTTC ATGTAACAGG CTTGTATGGT CCTGGAATAT Ehretia TTTGGTTTTG GTG.CATTTC ATGTAACAGG CTTGTATGGT CCTGGAATAT

Epifagus ...... Pholisma GGGTGTCTGA CC...... Mimulus GGGTATCCGA TCCTTATGGA CTAACCGGAA AAGTCCAACC TGTAAATCCG Ehretia GGGTGTCCGA CCCTTATGGA CTAACCGGAA AAGTACAACC CGTAAATCCA

Epifagus ...... Pholisma ...... Mimulus TCGTGGGGCG TGGAAGGTTT TGATCCTTTT GTTCCGGGAG GAATAGCTTC Ehretia GCGTGGGGCG TGGAAGGCTT TGATCCTTTT GTTCCGGGAG GAATAGCCTC

Epifagus ...... Pholisma ...... Mimulus TCATCATATT GCAGCAGGTG CATTGGGTAT ATTAGCGGGT CTATTCCATC Ehretia TCATCATATT GCAGCAGGGA CATTGGGCAT ATTAGCGGGT CTATTCCATC

170

Epifagus ...... Pholisma ...... CCCTCATAA CGCCTATACA AAGAGTTAC. TATGGGAAAT Mimulus TTAGCGTGCG ACCACCACAA CGTCTATACA AAGGATTGCG TATGGGAAAT Ehretia TTAGCGTCCG CCCGCCACAA CGTCTATACA AAGGGTTGCG TATGGGAAAT

Epifagus ...... Pholisma ATTGAAAGCG TTCTTT...... TTTTTTT TTTGTCTTTT AT.CAACTTT Mimulus ATTGAAACCG TACTCTCCAG TAGTATCGCG GCTGTCTTTT TTGCAGCTTT Ehretia ATTGAAACCG TCCTTTCCAG CAGTATCGCT GCCGTCTTTT TTGCAGCTTT

Epifagus ...... Pholisma TGTTGTT.CC AGAACTATGT GGTATGGCTC AGCAACTACC CC...... Mimulus TGTTGTTGCT GGAACTATGT GGTATGGTTC AGCAACTACC CCTATCGAAT Ehretia TGTTGTTGCC GGAACTATGT GGTATGGCTC AGCAACTACC CCCATCGAAT

Epifagus ...... Pholisma ...... Mimulus TATTTGGGCC CACTCGTTAT CAATGGGATC AGGGTTACTT CCAACAAGAG Ehretia TATTTGGGCC CACTCGTTAT CAATGGGATC AGGGGTACTT CCAACAAGAG

Epifagus ...... Pholisma ...... Mimulus ATATATCGAA GAGTTAGCGC CGGGCTCGCA GAAAATCAAA GTTTCTCAGA Ehretia ATATATCGAA GAGTGAGTGC TGGGCTAGCA GAAAATCAAA GTTTATCAGA

Epifagus ...... Pholisma ...... Mimulus AGCCTGGTCT AAAATTCCTG AAAAATTAGC TTTTTATGAT TATATCGGAA Ehretia AGCCTGGTCT AAAATTCCTG AAAAATTAGC TTTTTATGAT TACATCGGCA

Epifagus ...... Pholisma ...... Mimulus ATAATCCGGC AAAAGGAGGA TTATTCAGGG CAGGCTCAAT GGATAATGGA Ehretia ATAATCCGGC AAAAGGAGGA TTATTCAGGG CCGGCTCAAT GGATAATGGG

171

Epifagus ...... Pholisma ...... Mimulus GATGGAATAG CGGTTGGATG GTTAGGACAC CCTATCTTTA AAGATAAAGA Ehretia GATGGAATAG CGGTCGGATG GTTAGGACAC CCTATCTTTA GAGATAAAGA

Epifagus ...... Pholisma ...... Mimulus AGGGCGTGAG CTTTTTGTAC GTCGTATGCC TACTTTTTTT GAAACATTTC Ehretia AGGGCGTGAA CTTTTTGTAC GTCGTATGCC TACTTTTTTT GAAACATTCC

Epifagus ...... Pholisma ...... Mimulus CAGTCGTTTT GGTAGACGGC GACGGAATTG TTAGAGCTGA TGTTCCTTTT Ehretia CAGTCGTTTT GGTAGACGGC GACGGAATTG TTAGAGCCGA TGTTCCTTTT

Epifagus ...... Pholisma ...... Mimulus AGGAGGGCAG AATCGAAGTA TAGTGTTGAA CAAGTAGGTG TAACCGTTGA Ehretia AGAAGGGCAG AATCAAAGTA TAGTGTTGAA CAAGTAGGTG TAACTGTTGA

Epifagus ...... Pholisma ...... Mimulus GTTCTATGGC GGCGAACTCA ATGGAGTCAG TTATAGTGAT CCTGTTACCG Ehretia ATTCTACGGC GGCGAACTCA ACGGAGTCAG TTATAGTGAT CCTGCTACTG

Epifagus ...... Pholisma ...... Mimulus TAAAAAAATA TGCTAGACGT GCTCAATTGG GTGAAATTTT TGAATTAGAT Ehretia TGAAAAAATA TGCTAGACGT GCTCAATTGG GTGAAATTTT TGAATTAGAT

172

Epifagus ...... Pholisma ...... Mimulus CGTGCGACTT TGAAATCCGA TGGTGTTTTT CGTAGCAGTC CAAGGGGTTG Ehretia CGTGCTACTT TGAAATCCGA TGGTGTTTTT CGTAGCAGTC CAAGGGGTTG

Epifagus ...... Pholisma ...... Mimulus GTTTACTTTT GGGCATGCTT CCTTTGCTTT GCTCTTCTTC TTCGGACATA Ehretia GTTTACTTTT GGACATGCTT CATTTGCTTT GCTCTTCTTC TTCGGACACA

Epifagus ...... Pholisma ...... Mimulus TTTGGCATGG TGCTAGAACC TTGTTCAGAG ATGTTTTTGC TGGTATTGAC Ehretia TTTGGCATGG TGCTAGAACC TTGTTCAGAG ATGTTTTTGC TGGTATTGAC

Epifagus ...... Pholisma ...... Mimulus CCAGATTTAG ACGCTCAAGT AGAGTTTGGA GCATTCCAAA AACTTGGGGA Ehretia CCAGATTTGG ATGCTCAAGT AGAATTTGGA GCATTCCAGA AACTTGGAGA

Epifagus ...... Pholisma ...... Mimulus TCCAACTACA AGAAGGCAGG CAGTCTGA Ehretia TCCAACTACA AGAAGACAGG TAGTCTGA rpoA: Epifagus ...... GCATCGC AATATCTACT ...... GTGTA Phol_dogma ATGGTTCGAG ATAAAGTGAT AGTATCCACC CACCTGGACA CTCTAGTGGA Mimulus ATGGTTCGAG AGAAAGTAAG AGTATCTACT C....GGACA CTACAGTGGA Ehreita ATGGTTCGAG AGAAAGTAAT AGTATCTACT C....GGACA CTACAGTGGA

173

Epifagus AACTTG...... GCCTTT...... Phol_dogma AGTGTGTTGA ATCAAGAGCA GCCAGTAAAT TTCTTTTTTT TTAGTGCTTT Mimulus AGTGTGTTGA ATCAAGAATA GACAGTAAAC GTCTTTATTA TGGGCGCTTT Ehreita AGTGTGTTGA ATCAAGAACA GATAGTAAAC GTCTTTATTA TGGACGCTTT

Epifagus ...... CATACATAAG TG...... TAG ACATAATAAA Phol_dogma ATTCTGTATA CATTTATGAA AGGCCAA...... AATAG GCATTGCAAT Mimulus ATTCTGTCTC CACTTATGAA AGGCCAAGCC GACACAATAG GCATTGCGAT Ehreita ATTCTATCTC CACTTATGAA AGGTCAAGCC GATACAATAG GCATTGCGAT

Epifagus ATAAAGCGCT T...... ATAAAAAA AAAGTTTACT ATCTGTTCTT Phol_dogma GGAAAGATAT TTTATTGTAT AAATAGAAGG AACATATCTA AGACGTGTAA Mimulus GAGAAGAGCT TTGCTTGGAG AAATAGAAGG AACATGTATC ACACGTATAA Ehreita GCGAAGAGCT TTGCTTGGAG AAATAGAAGG AACATCTATC ACACGTGTAA

Epifagus GATTCAA...... CACACTTAC GCTGTAGTGT ACAATGTAT. Phol_dogma AATAAAAAGA TGATAAAGTC TCACATGAAT ATTCAACTAT G..GTATATT Mimulus AATCTGA...... GAACGTC CCCCATGAGT ATTCTACTAT AACGGGTATT Ehreita AATCTGAATC TGAGAAAGTC CCACATGAAT ATTCTACCAT ATCGGGTATT

Epifagus ...... Phol_dogma CAAGA...... Mimulus CAAGAATCGG TCCACGAAAT TTTAATGAAT TTAAAAGAAA TTGTATTGAG Ehreita CAAGAATCGG TACATGAAAT TTTAATGAAT TTGAAAGAAA TTGTATTGAG

Epifagus ...... Phol_dogma ...... Mimulus AAGTAATCTA TATGGAATTT GTGATGCGTC TATTTGTATC AAGGGCCCTG Ehreita AAGTAATCTA TATGGAACTT GTGACGCGTC TATTTGTGTT AGGGGTCCTG

Epifagus ...... Phol_dogma ...... Mimulus GATATGTAAC TGCTCAAGAT ATAAGCTTAC CCCCTTATGT AGAAATCGTC Ehreita GATATGTAAC TGCTCAAGAC ATCATCTTAC CGCCTTATGT GGAAATTGTT

174

Epifagus ...... Phol_dogma ...... Mimulus GACAATACAC AACATATAGC TAGCTTAACA GAACCAATTG ATTTGTGTAT Ehreita GATAATACAC AACATATAGC TAGCTTGACG GAACCAATTG ATTTGTGTAT

Epifagus ...... Phol_dogma ...... Mimulus TGCATTAGAA ATCGAGAGAA ATCGCGGATA TCTTCTAAAA ATGCCACATA Ehreita TGGATTACAA ATCGAGAGAA ATCGAGGATA TCTTATAAAA ACGCCACATA

Epifagus ...... Phol_dogma ...... Mimulus ACTTTCAAGA TGGAAGTTAT CCTATAGATG CTGTATTCAT GCCTGTTCGA Ehreita ACTTTCAAGA TGGAAGTTAT CCTATAGATG CTGTATTCAT GCCTGTTCGA

Epifagus ...... ATAATGTTTC TTTATATAGG ...... Phol_dogma ...... Mimulus AATGTGAATC ATAGTATTCA TTCCTATGGG ...... AATG AAAAACAAGA Ehreita AATGCGAATC ATAGTATTCA TTCTTATGGG AATGGAAATG AAAAACAAGA

Epifagus ...... Phol_dogma ...... Mimulus GATACTCTTT CTCGAAATAT GGACAAATGG GAGTTTAACT CCGAAAGAAG Ehreita GATACTTTTT CTCGAAATAT GGACAAATGG AAGTTTAACT CCGAAAGAAG

Epifagus ...... Phol_dogma ...... Mimulus CACTTCATGA AGCCTCCCGA AATTTGATTG ATTTATTTAT TCCCTTTTTA Ehreita CACTTCATGA AGCCTCTCGG AATTTGATTG ATTTATTTAT TCCTTTTTTA

175

Epifagus ...... Phol_dogma ...... Mimulus CATAAGGAAG AAGAAAACTT ACCTTTAGAG GACAATCGAC ACACGGTTAC Ehreita CATATGGAAG AAGAAAACGT ACATTTAGAG GACAATCAAC ACACGGTTCC

Epifagus ...... Phol_dogma ...... Mimulus TTTATCTCCT TTTACTTTTC ACGATAAATT GGATAAACTC AGAAAAAACA Ehreita TTTATCCCCT TTTATCTTTC ATGATAAATT GGCCAAACTC AGAAAAAAGA

Epifagus ...... Phol_dogma ...... Mimulus AAAAAAAAAT AGCATTAAAA TCGATTTTTA TTGATCAATC CGAATTGTCT Ehreita AAAAAAAAAT AGCATTGAAA TCGATTTTTA TTGACCAATC AGAATTGCCT

Epifagus ...... AATAT...... Phol_dogma ...... Mimulus CCCAGGATCT ATAATTGCCT CAAAAGGTCG AATATATACA CATTATTAGA Ehreita CCAAGGATCT ATAATTGTCT CAAAAGGTCC AATATATATA CATTATTGGA

Epifagus ...... Phol_dogma ...... Mimulus CCTTTTGAAT AACAGTCAAG AAGATCTTAT GAAAATTGAA CATTTTCACC Ehreita CCTTTTGAAT AACAGTCAAG AAGATCTTAT GAAAATTGAG CATTTTCGCA

Epifagus ...... Phol_dogma ...... Mimulus TAGAAGATAT AAAACAGATA TTGGGCATTC TAGAAAAACA TTTTGCACTT Ehreita TAGAAGATGT AAAACAGATA TTGGGCATTC TAGAAAAACA TTTCGCAATT

Epifagus ...... Phol_dogma ...... Mimulus GATTTACCAA AAAAGAAATT TTAA Ehreita GATTTATCAA AAAGTAAGTT TTAA

Appendix F. Cancer cell lines nucleotide phylogenetic tree shown in Fig. 4-2 in phylogram format

177 Appendix G. Cancer cell line protein tree. Protein distance neighbor joining tree of tumor cell lines with unique amino acid sequences, rooted by wtRef sequence. Those nodes supported by 0.8 – 1.0 probabilities in Bayesian tree reconstruction are marked with a large red “*”. The position of Clade A taxa are indicated by a small “*” after the cell line name and a large green bar. Scale bar represents the estimated number of amino acids substitutions per site. Cell line labeling and two letter abbreviations follow Fig. 4-2.

178 Appendix H. Cell-lines with identical nucleotide sequences.

For each identity group, the first cell-line was selected as the representative sequence for phylogenetic analyses.

Identity Group Cell-line Tissue 1 1A2: B lymphocyte G_401 Kidney RWPE_1 Prostatic Toledo B lymphocyte 2 A172: Brain C_4_I Cervix C_4_II Cervix 3 BeWo: Placental Caki_1 Kidney JEG_3 Placental MJ Lymphoblast 4 CAL_54: Kidney CCRF_SB Lymphoblast F_36P Myleloblast 5 CGTH_W_1: Thyroid SW579 Thyroid 6 CRO_AP2: SF_539 Brain 7 CaHPV_10: Prostrate EKVX Lung I_9_2 Lymphocyte 8 DBTRG_05MG: Brain UACC_62 Skin 9 wtRef: NCBI DOHH_2 Peripheral blood H4 Brain NCI_H522 Lung RS4_11 Bone SW780 Bladder 10 HOS: Bone KHOS_240S Bone 11 L_428: Lymph node MEG_01 Peripheral blood RPMI_6666 Peripheral blood SiHa Cervix UACC_812 Breast Y79 Retina 12 MDA_MB_435: Breast UCLA_SO_M14 Skin 13 MEC_1: Peripheral blood MG_63 Bone SCC_15 Tongue 14 NCI_H226: Lung

179

SUP_B15 Bone marrow 15 NCI_H292: Lung WIDR Colon 16 SH_4: Skin SW_982 Synovial joint 17 SNB_19: Brain U_251_MG Brain

Appendix I - 398 probeset (PLS-DA gene set) with VIP values > 2.5 from PLS-DA SIMCA

Table S6. 398 probeset (PLS-DA gene set) with VIP values > 2.5 from PLS-DA SIMCA Probeset ID Gene Symbol Gene Title VIP Score 213736_at COX5B Cytochrome c oxidase subunit Vb 4.914 216962_at RPAIN RPA interacting protein 4.85172 202648_at ------4.80471 Ras-related C3 botulinum toxin substrate 1 (rho family, 1567457_at RAC1 small GTP binding protein Rac1) 4.33306 230982_at SOX1 SRY (sex determining region Y)-box 1 4.20471 LSM4 homolog, U6 small nuclear RNA associated (S. 212924_s_at LSM4 cerevisiae) 4.08857 231534_at CDC2 cycle 2, G1 to S and G2 to M 4.05947 1553749_at FAM76B family with sequence similarity 76, member B 4.03505 226830_x_at CHD2 Chromodomain helicase DNA binding protein 2 4.01009 LOC100130332 /// LOC100134793 /// myotrophin /// leucine zipper protein 6 /// similar to 223925_s_at LUZP6 /// MTPN PRO2474 3.97902 206942_s_at PMCH pro-melanin-concentrating hormone 3.97476 221372_s_at P2RX2 purinergic receptor P2X, ligand-gated ion channel, 2 3.86441 207688_s_at ------3.85148 217383_at PGK1 Phosphoglycerate kinase 1 3.74901 217426_at ------3.73116 241611_s_at FNDC3A fibronectin type III domain containing 3A 3.7271 metastasis associated lung adenocarcinoma transcript 224568_x_at MALAT1 1 (non-protein coding) 3.71491 1558048_x_at ------3.68277 241084_x_at DYNC1H1 , cytoplasmic 1, heavy chain 1 3.66234 230156_x_at CHD2 Chromodomain helicase DNA binding protein 2 3.64854 222831_at SAP30L SAP30-like 3.64814 213409_s_at RHEB Ras homolog enriched in brain 3.61911 222936_s_at FAM152A family with sequence similarity 152, member A 3.60374 229449_at --- CDNA FLJ36553 fis, clone TRACH2008478 3.58795 228530_at RP11-11C5.2 Similar to RIKEN cDNA 2410129H14 3.58501 228997_at TRSPAP1 tRNA selenocysteine associated protein 1 3.58272

181

TATA box binding protein (TBP)-associated factor, 203937_s_at TAF1C RNA polymerase I, C, 110kDa 3.57469 potassium inwardly-rectifying channel, subfamily J, 208404_x_at KCNJ5 member 5 3.5729 218234_at ING4 inhibitor of growth family, member 4 3.56743 217625_x_at --- Homo sapiens, clone IMAGE:3851018, mRNA 3.54215 222439_s_at THRAP3 thyroid hormone receptor associated protein 3 3.5404 241838_at --- Transcribed locus 3.52151 233971_at LOC401565 similar to 4931415M17 protein 3.52144 210332_at LOC100127899 hypothetical protein LOC100127899 3.52064 209114_at TSPAN1 tetraspanin 1 3.49875 213963_s_at SAP30 Sin3A-associated protein, 30kDa 3.49228 208255_s_at FKBP8 FK506 binding protein 8, 38kDa 3.45537 1559343_at SNRPN small nuclear ribonucleoprotein polypeptide N 3.42147 239368_at FAM111A Family with sequence similarity 111, member A 3.41026 234495_at KLK15 kallikrein-related peptidase 15 3.40927 235998_at RHPN1 rhophilin, Rho GTPase binding protein 1 3.39819 204536_s_at ------3.39446 213810_s_at C6orf166 Chromosome 6 open reading frame 166 3.39253 223292_s_at MRPS15 mitochondrial ribosomal protein S15 3.38852 killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 1 /// killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 2 /// killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 3 /// killer cell immunoglobulin-like receptor, two domains, short cytoplasmic tail, 1 /// killer cell immunoglobulin-like receptor, two domains, short cytoplasmic tail, 2 /// killer cell immunoglobulin-like receptor, two domains, short cytoplasmic tail, 3 /// killer KIR2DL1 /// KIR2DL2 /// KIR2DL3 /// cell immunoglobulin-like receptor, two domains, short KIR2DL5A /// KIR2DL5B /// KIR2DS1 cytoplasmic tail, 4 /// killer cell immunoglobulin-like /// KIR2DS2 /// KIR2DS3 /// KIR2DS4 receptor, two domains, short cytoplasmic tail, 5 /// killer /// KIR2DS5 /// KIR3DL1 /// KIR3DL2 cell immunoglobulin-like receptor, three domains, long /// KIR3DL3 /// KIR3DP1 /// cytoplasmic tail, 1 /// killer cell immunoglobulin-like LOC652001 /// LOC652779 /// receptor, three domains, long cytoplasmic tail, 2 /// 217318_x_at LOC727787 killer cell immunoglobulin-like receptor, two domains, 3.35932

182

long cytoplasmic tail, 5A /// killer cell immunoglobulin- like receptor, three domains, long cytoplasmic tail, 3 /// killer cell immunoglobulin-like receptor, three domains, pseudogene 1 /// killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 5B /// similar to killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 5B /// similar to Killer cell immunoglobulin-like receptor 2DS3 precursor (MHC class I NK cell receptor) (Natural killer associated transcript 7) (NKAT-7) /// similar to killer cell immunoglobulin-like receptor 3DL2 precursor (MHC class I NK cell receptor) (Natural killer-associated transcript 4) (NKAT-4) (p70 natural killer cell receptor clone CL-5) (CD158k antigen) /// killer-cell Ig-like receptor 213755_s_at SKI V-ski viral oncogene homolog (avian) 3.35468 231161_x_at ------3.35442 1557029_at --- CDNA clone IMAGE:4822878 3.32852 233191_at RUFY2 RUN and FYVE domain containing 2 3.32849 224047_at ------3.32308 200908_s_at RPLP2 ribosomal protein, large, P2 3.31559 221890_at ZNF335 zinc finger protein 335 3.29466 metastasis associated lung adenocarcinoma transcript 1558678_s_at MALAT1 1 (non-protein coding) 3.28709 214056_at MCL1 Myeloid cell leukemia sequence 1 (BCL2-related) 3.2811 226765_at SPTBN1 , beta, non-erythrocytic 1 3.27299 glutamate receptor, ionotropic, N-methyl-D-aspartate 233892_at GRIN3B 3B 3.26645 ATPase, H+ transporting, lysosomal 9kDa, V0 subunit 214244_s_at ATP6V0E1 e1 3.26538 225423_x_at LOC100129015 hypothetical protein LOC100129015 3.26373 219138_at RPL14 ribosomal protein L14 3.2598 227310_at ADSS Adenylosuccinate synthase 3.25627 215450_at ------3.24496 head 10 homolog B (Chlamydomonas)- 1555272_at LOC728194 like 3.24491

183

229284_at MAT2B Methionine adenosyltransferase II, beta 3.23442 229115_at DYNC1H1 dynein, cytoplasmic 1, heavy chain 1 3.22043 asparagine-linked glycosylation 5 homolog (S. cerevisiae, dolichyl-phosphate beta- 222556_at ALG5 glucosyltransferase) 3.21898 228735_s_at PANK2 Pantothenate kinase 2 (Hallervorden-Spatz syndrome) 3.20919 231005_at --- Transcribed locus 3.20513 205987_at CD1C CD1c molecule 3.20419 235063_at C20orf196 chromosome 20 open reading frame 196 3.20405 241347_at KIAA1618 KIAA1618 3.19927 204801_s_at DHRS12 dehydrogenase/reductase (SDR family) member 12 3.19492 217465_at NCKAP1 NCK-associated protein 1 3.18911 splicing factor, arginine/serine-rich 1 (splicing factor 2, 227164_at SFRS1 alternate splicing factor) 3.18814 1563048_at --- Homo sapiens, clone IMAGE:5393038, mRNA 3.18648 UTP18, small subunit (SSU) processome component, 222038_s_at UTP18 homolog (yeast) 3.17736 ATPase, H+ transporting, lysosomal 9kDa, V0 subunit 214149_s_at ATP6V0E1 e1 3.17587 1566342_at --- Transcribed locus 3.17356 226746_s_at UBE4B Ubiquitination factor E4B (UFD2 homolog, yeast) 3.17353 223631_s_at C19orf33 chromosome 19 open reading frame 33 3.16081 227619_at WRNIP1 Werner helicase interacting protein 1 3.13699 229212_at --- Transcribed locus 3.13404 217416_x_at ------3.1252 221871_s_at TFG TRK-fused gene 3.12273 MRNA; cDNA DKFZp686E2299 (from clone 244415_at --- DKFZp686E2299) 3.1225 228520_s_at APLP2 Amyloid beta (A4) precursor-like protein 2 3.11176 207435_s_at SRRM2 serine/arginine repetitive matrix 2 3.10813 215185_at --- Homo sapiens, clone IMAGE:3851018, mRNA 3.10494 ATPase, H+ transporting, lysosomal 9kDa, V0 subunit 201171_at ATP6V0E1 e1 3.09822 metastasis associated lung adenocarcinoma transcript 223940_x_at MALAT1 1 (non-protein coding) 3.09598

184

227175_at LOC100131311 hypothetical protein LOC100131311 3.08979 230034_x_at LOC100129015 hypothetical protein LOC100129015 3.08744 34478_at RAB11B RAB11B, member RAS oncogene family 3.06989 221960_s_at RAB2A RAB2A, member RAS oncogene family 3.0591 1567213_at PNN pinin, desmosome associated protein 3.05535 228593_at LOC339483 hypothetical LOC339483 3.05458 1555225_at C1orf43 chromosome 1 open reading frame 43 3.04715 211803_at CDK2 cyclin-dependent kinase 2 3.04489 241061_at --- Transcribed locus 3.02877 essential meiotic endonuclease 1 homolog 1 (S. 234464_s_at EME1 pombe) 3.02573 223731_at MYCBPAP MYCBP associated protein 3.02296 CTD (carboxy-terminal domain, RNA polymerase II, 213597_s_at CTDSPL polypeptide A) small phosphatase-like 3.02269 216870_x_at DLEU2 deleted in lymphocytic leukemia, 2 3.01455 238238_at --- Transcribed locus 3.0075 217264_s_at SCNN1A sodium channel, nonvoltage-gated 1 alpha 3.00367 sema domain, immunoglobulin domain (Ig), short basic 215324_at SEMA3D domain, secreted, (semaphorin) 3D 3.00068 203543_s_at KLF9 Kruppel-like factor 9 2.9978 233809_at HYPK Huntingtin interacting protein K 2.9949 227921_at ------2.99475 201818_at LPCAT1 lysophosphatidylcholine acyltransferase 1 2.98973 220016_at AHNAK AHNAK nucleoprotein 2.97919 229107_at ------2.97721 217247_at ------2.97134 235927_at XPO1 exportin 1 (CRM1 homolog, yeast) 2.96457 243881_at --- CDNA FLJ45325 fis, clone BRHIP3006717 2.96251 228907_at --- Transcribed locus 2.96025 226220_at METTL9 Methyltransferase like 9 2.96015 solute carrier family 9 (sodium/hydrogen exchanger), 209830_s_at SLC9A3R2 member 3 regulator 2 2.95677 243597_at FANCB Fanconi anemia, complementation group B 2.95549 1562238_at USPL1 ubiquitin specific peptidase like 1 2.95539 236809_at ------2.95411

185

221917_s_at GRSF1 G-rich RNA sequence binding factor 1 2.95012 ADP-ribosylation factor guanine nucleotide-exchange 215927_at ARFGEF2 factor 2 (brefeldin A-inhibited) 2.947 LOC100132540 /// LOC100132620 /// nuclear pore complex interacting protein /// hypothetical 240377_at LOC339047 /// NPIP protein LOC339047 /// similar to LOC339047 protein 2.94124 216609_at TXN Thioredoxin 2.94117 225507_at SFRS18 splicing factor, arginine/serine-rich 18 2.94091 Transcribed locus, moderately similar to XP_001102659.1 PREDICTED: hypothetical protein 236398_s_at --- [Macaca mulatta] 2.94013 scavenger receptor cysteine rich domain containing, 236529_at SRCRB4D group B (4 domains) 2.93904 SWI/SNF related, matrix associated, dependent 215714_s_at SMARCA4 regulator of , subfamily a, member 4 2.93091 214310_s_at ZFPL1 zinc finger protein-like 1 2.92937 229589_x_at BIVM Basic, immunoglobulin-like variable motif containing 2.92543 238202_at ------2.91899 228663_x_at FIZ1 /// ZNF579 FLT3-interacting zinc finger 1 /// zinc finger protein 579 2.9185 1552980_at HAS3 hyaluronan synthase 3 2.91773 phosphatidylinositol glycan anchor biosynthesis, class 227403_at PIGX X 2.91742 220825_s_at KIRREL kin of IRRE like (Drosophila) 2.91299 234254_at LOC441642 similar to hCG27427 2.90854 1552316_a_at GIMAP1 GTPase, IMAP family member 1 2.90227 1557052_at ------2.89948 232291_at MIRHG1 microRNA host gene (non-protein coding) 1 2.89893 230540_at ------2.89775 MRNA; cDNA DKFZp761G241 (from clone 234494_x_at --- DKFZp761G241) 2.89293 1560402_at GAS5 growth arrest-specific 5 2.88745 208395_s_at URB1 URB1 biogenesis 1 homolog (S. cerevisiae) 2.88578 Homo sapiens, Similar to v-myb avian myeloblastosis viral oncogene homolog, clone IMAGE:3535159, 1562677_at --- mRNA 2.88452 228522_at LOC642031 hypothetical protein LOC642031 2.88341

186

200835_s_at MAP4 -associated protein 4 2.87853 216139_s_at MAPK8IP3 mitogen-activated protein kinase 8 interacting protein 3 2.87329 235854_x_at ROCK1 Rho-associated, coiled-coil containing protein kinase 1 2.8712 deleted in lymphocytic leukemia, 2 /// deleted in 215629_s_at DLEU2 /// DLEU2L lymphocytic leukemia 2-like 2.86874 202783_at NNT nicotinamide nucleotide transhydrogenase 2.86395 219781_s_at ZNF771 zinc finger protein 771 2.86232 228072_at SYT12 synaptotagmin XII 2.86228 226836_at SFT2D1 SFT2 domain containing 1 2.85688 216450_x_at HSP90B1 heat shock protein 90kDa beta (Grp94), member 1 2.85653 208080_at AURKA aurora kinase A 2.85382 1569297_at LOC731779 hypothetical protein LOC731779 2.85311 234347_s_at DENR density-regulated protein 2.84846 205451_at FOXO4 forkhead box O4 2.84764 207280_at RNF185 Ring finger protein 185 2.84446 235237_at LOC203547 Hypothetical protein LOC203547 2.84255 229081_at SLC25A13 Solute carrier family 25, member 13 (citrin) 2.84238 promyelocytic leukemia /// hypothetical protein 211012_s_at LOC161527 /// PML LOC161527 2.84163 226539_s_at DDX54 DEAD (Asp-Glu-Ala-Asp) box polypeptide 54 2.83646 1560082_at --- Transcribed locus 2.83473 227137_at C10orf46 open reading frame 46 2.83283 240900_at C7orf50 chromosome 7 open reading frame 50 2.83174 214395_x_at --- CDNA clone IMAGE:4838699 2.83164 1555680_a_at SMOX spermine oxidase 2.83094 237882_at --- Transcribed locus 2.82994 1568822_at GTPBP5 GTP binding protein 5 (putative) 2.8257 222515_x_at TMEM165 transmembrane protein 165 2.82258 243229_at --- Transcribed locus 2.81961 214040_s_at GSN (amyloidosis, Finnish type) 2.81758 243058_at ------2.81726 222057_at NOL12 nucleolar protein 12 2.81408 202411_at IFI27 interferon, alpha-inducible protein 27 2.81262 210656_at EED embryonic ectoderm development 2.80936

187

231109_at --- CDNA FLJ38468 fis, clone FEBRA2021864 2.80907 210098_s_at ------2.80878 Structural maintenance of chromosomes flexible hinge 241621_at SMCHD1 domain containing 1 2.80624 229798_s_at ------2.80542 SSU72 RNA polymerase II CTD phosphatase homolog 227244_s_at SSU72 (S. cerevisiae) 2.80449 243616_at --- Transcribed locus 2.80181 guanine nucleotide binding protein (G protein), beta 206047_at GNB3 polypeptide 3 2.8015 32099_at SAFB2 scaffold attachment factor B2 2.79674 1555944_at FAM120A family with sequence similarity 120A 2.79552 202028_s_at RPL38 ribosomal protein L38 2.79257 236682_at ------2.79146 221995_s_at ------2.78497 226263_at C6orf151 chromosome 6 open reading frame 151 2.78003 222788_s_at RSBN1 round spermatid basic protein 1 2.77931 206620_at GRAP GRB2-related adaptor protein 2.77921 234747_at C4orf41 chromosome 4 open reading frame 41 2.77915 216461_at --- CDNA FLJ20827 fis, clone ADKA03543 2.77646 235897_at COPZ2 coatomer protein complex, subunit zeta 2 2.77627 213758_at COX4I1 cytochrome c oxidase subunit IV isoform 1 2.77454 1555788_a_at TRIB3 tribbles homolog 3 (Drosophila) 2.77088 229420_at --- CDNA FLJ37566 fis, clone BRCOC2002085 2.76896 227679_at --- Transcribed locus 2.76788 239408_at --- Transcribed locus 2.76636 225191_at CIRBP cold inducible RNA binding protein 2.76535 209535_s_at ------2.76418 207026_s_at ATP2B3 ATPase, Ca++ transporting, plasma membrane 3 2.76343 214742_at AZI1 5-azacytidine induced 1 2.76298 216412_x_at IGL@ immunoglobulin lambda locus 2.76247 211690_at RPS6 ribosomal protein S6 2.76132 215375_x_at LRRFIP1 Leucine rich repeat (in FLII) interacting protein 1 2.76037 241629_at --- Transcribed locus 2.75989

188

217258_x_at IGL@ immunoglobulin lambda locus 2.75915 216875_x_at HAB1 B1 for mucin 2.75875 1568815_a_at DDX50 DEAD (Asp-Glu-Ala-Asp) box polypeptide 50 2.75838 Transcribed locus, strongly similar to XP_528750.2 PREDICTED: epidermal growth factor receptor 231362_at --- pathway substrate 8 [Pan troglodytes] 2.75784 236288_at RNF34 ring finger protein 34 2.75632 1568190_at ------2.75104 235239_at QSOX2 quiescin Q6 sulfhydryl oxidase 2 2.75025 solute carrier family 16, member 5 (monocarboxylic 206600_s_at LOC100133772 /// SLC16A5 acid transporter 6) /// similar to MCT 2.74899 226570_at ATP1B3 ATPase, Na+/K+ transporting, beta 3 polypeptide 2.74803 227136_s_at C10orf46 Chromosome 10 open reading frame 46 2.74709 FUS interacting protein (serine/arginine-rich) 1 /// similar to FUS interacting protein (serine-arginine rich) 225348_at FUSIP1 /// LOC642558 1 2.74694 1559427_at MCF2L MCF.2 cell line derived transforming sequence-like 2.74681 35436_at GOLGA2 golgi autoantigen, golgin subfamily a, 2 2.74536 242197_x_at CD36 CD36 molecule (thrombospondin receptor) 2.74108 224346_at ------2.74021 1563867_at LOC283194 hypothetical protein LOC283194 2.73908 208719_s_at DDX17 DEAD (Asp-Glu-Ala-Asp) box polypeptide 17 2.73465 Oxidative stress induced growth inhibitor family 214161_at OSGIN2 member 2 2.72914 213350_at RPS11 Ribosomal protein S11 2.727 213998_s_at DDX17 DEAD (Asp-Glu-Ala-Asp) box polypeptide 17 2.7265 226014_at LOC100133577 hypothetical protein LOC100133577 2.7264 220729_at ------2.71959 237202_at PGPEP1 pyroglutamyl-peptidase I 2.71819 203255_at FBXO11 F-box protein 11 2.71796 233445_at --- Transcribed locus 2.7161 1560019_at MGC11082 Hypothetical protein MGC11082 2.71542 210230_at --- CDNA: FLJ23438 fis, clone HRC13275 2.71398 227953_at --- CDNA FLJ36440 fis, clone THYMU2012565 2.7137 217410_at AGRN agrin 2.7108

189

208272_at RANBP3 RAN binding protein 3 2.70918 231618_s_at SUNC1 Sad1 and UNC84 domain containing 1 2.70827 226108_at ZC3H18 zinc finger CCCH-type containing 18 2.70743 210588_x_at HNRNPH3 heterogeneous nuclear ribonucleoprotein H3 (2H9) 2.70706 204178_s_at RBM14 RNA binding motif protein 14 2.70615 WAS protein homology region 2 domain containing 1- like 1 /// WAS protein homology region 2 domain containing 1-like 2 /// similar to junction-mediating and LOC644231 /// LOC652637 /// regulatory protein p300 JMY-like hypothetical protein /// 244105_at WHDC1L1 /// WHDC1L2 similar to junction-mediating and regulatory protein 2.70234 solute carrier family 16, member 5 (monocarboxylic 206599_at LOC100133772 /// SLC16A5 acid transporter 6) /// similar to MCT 2.70218 204695_at CDC25A cell division cycle 25 homolog A (S. pombe) 2.7006 202473_x_at HCFC1 host cell factor C1 (VP16-accessory protein) 2.69973 LOC100129513 /// LOC100133276 /// 210596_at MAGT1 magnesium transporter 1 /// similar to PRO0756 2.69892 Nudix (nucleoside diphosphate linked moiety X)-type 226061_s_at NUDT3 motif 3 2.69785 205925_s_at RAB3B RAB3B, member RAS oncogene family 2.6961 223974_at MGC11082 hypothetical protein MGC11082 2.68629 229525_at ------2.68172 1568680_s_at YTHDC2 YTH domain containing 2 2.68018 238295_at C17orf42 open reading frame 42 2.67983 226937_at CRLS1 Cardiolipin synthase 1 2.67778 203061_s_at MDC1 mediator of DNA damage checkpoint 1 2.67593 228881_at PARL presenilin associated, rhomboid-like 2.67582 201531_at ZFP36 zinc finger protein 36, C3H type, homolog (mouse) 2.6738 209230_s_at NUPR1 nuclear protein 1 2.67334 242713_at --- Transcribed locus 2.66536 1553396_a_at CCDC13 coiled-coil domain containing 13 2.66522 1557226_a_at LOC374569 Similar to Lysophospholipase 2.66518 206142_at ZNF135 zinc finger protein 135 2.66444 232966_at LPIN3 lipin 3 2.66348 RCD1 required for cell differentiation1 homolog (S. 213903_s_at RQCD1 pombe) 2.66218

190

214537_at HIST1H1D histone cluster 1, H1d 2.66075 218778_x_at EPS8L1 EPS8-like 1 2.66005 218145_at TRIB3 tribbles homolog 3 (Drosophila) 2.65984 227350_at HELLS helicase, lymphoid-specific 2.65883 225645_at EHF Ets homologous factor 2.65748 ficolin (collagen/fibrinogen domain containing lectin) 2 208439_s_at FCN2 (hucolin) 2.65664 tankyrase, TRF1-interacting -related ADP- 216695_s_at TNKS ribose polymerase 2.65527 EGF, latrophilin and seven transmembrane domain 219134_at ELTD1 containing 1 2.655 215089_s_at RBM10 RNA binding motif protein 10 2.65424 224003_at TTTY14 testis-specific transcript, Y-linked 14 2.65417 solute carrier family 16, member 3 (monocarboxylic 202855_s_at SLC16A3 acid transporter 4) 2.65307 234328_at ------2.65207 210778_s_at MXD4 MAX dimerization protein 4 2.65101 210110_x_at HNRNPH3 heterogeneous nuclear ribonucleoprotein H3 (2H9) 2.64968 205131_x_at CLEC11A C-type lectin domain family 11, member A 2.6481 223295_s_at LUC7L LUC7-like (S. cerevisiae) 2.64804 210699_at ------2.64789 206027_at S100A3 S100 calcium binding protein A3 2.64391 215480_at KIAA0509 KIAA0509 protein 2.64269 MRNA; cDNA DKFZp434O0212 (from clone 234449_at --- DKFZp434O0212) 2.64239 tumor necrosis factor receptor superfamily, member 207641_at TNFRSF13B 13B 2.64235 229671_s_at C21orf45 Chromosome 21 open reading frame 45 2.64206 226424_at CAPS calcyphosine 2.63943 220839_at METTL5 methyltransferase like 5 2.63772 207362_at SLC30A4 solute carrier family 30 (zinc transporter), member 4 2.63698 214101_s_at --- Transcribed locus 2.63634 209355_s_at PPAP2B phosphatidic acid phosphatase type 2B 2.63406 tumor necrosis factor receptor superfamily, member 218368_s_at TNFRSF12A 12A 2.63248

191

230759_at SNX14 Sorting 14 2.63233 234920_at ZNF7 Zinc finger protein 7 2.63149 220675_s_at PNPLA3 patatin-like phospholipase domain containing 3 2.62954 224158_s_at LOC286144 hypothetical LOC286144 2.62856 1565786_x_at FLJ45482 hypothetical LOC645566 2.6278 1565689_at ------2.62629 213506_at F2RL1 coagulation factor II (thrombin) receptor-like 1 2.62622 220538_at ADM2 adrenomedullin 2 2.6242 227912_s_at EXOSC3 component 3 2.62179 226643_s_at NUDCD2 NudC domain containing 2 2.62143 203496_s_at MED1 mediator complex subunit 1 2.62079 Myeloid/lymphoid or mixed-lineage leukemia (trithorax 216480_x_at MLLT10 homolog, Drosophila); translocated to, 10 2.61799 220323_at CNTD2 cyclin N-terminal domain containing 2 2.61752 236137_at --- Transcribed locus 2.61667 229188_s_at ------2.61648 230590_at --- Transcribed locus 2.6143 216884_at PTPN12 protein tyrosine phosphatase, non-receptor type 12 2.61405 215006_at --- CDNA FLJ13754 fis, clone PLACE3000362 2.61323 232367_x_at ZNF598 zinc finger protein 598 2.61158 226491_x_at PTBP1 Polypyrimidine tract binding protein 1 2.6096 241839_at --- Transcribed locus 2.60815 207757_at ZFP2 zinc finger protein 2 homolog (mouse) 2.60747 CKLF-like MARVEL transmembrane domain containing 231219_at CMTM1 1 2.60723 217670_at RPLP2 Ribosomal protein, large, P2 2.60696 213907_at EEF1E1 Eukaryotic translation elongation factor 1 epsilon 1 2.60297 1569600_at DLEU2 Deleted in lymphocytic leukemia, 2 2.60106 204964_s_at SSPN sarcospan (Kras oncogene-associated gene) 2.60074 227913_at EXOSC3 Exosome component 3 2.60057 202437_s_at CYP1B1 cytochrome P450, family 1, subfamily B, polypeptide 1 2.59509 226703_at KIAA1787 KIAA1787 protein 2.59476 Ubiquitin-conjugating enzyme E2I (UBC9 homolog, 213536_s_at UBE2I yeast) 2.59302

192

201702_s_at PPP1R10 protein phosphatase 1, regulatory (inhibitor) subunit 10 2.59198 210874_s_at NAT6 N-acetyltransferase 6 2.59183 233243_at --- CDNA FLJ11533 fis, clone HEMBA1002678 2.59031 209483_s_at ------2.5896 1563431_x_at CALM3 Calmodulin 3 (phosphorylase kinase, delta) 2.58902 40850_at FKBP8 FK506 binding protein 8, 38kDa 2.58674 222036_s_at MCM4 minichromosome maintenance complex component 4 2.5864 201631_s_at IER3 immediate early response 3 2.58235 243069_at ------2.58097 guanine nucleotide binding protein (G protein), gamma 1555240_s_at GNG12 12 2.57954 215034_s_at TM4SF1 transmembrane 4 L six family member 1 2.57901 1568248_x_at SNORA71B small nucleolar RNA, H/ACA box 71B 2.57822 1553215_s_at CCDC7 coiled-coil domain containing 7 2.57532 226444_at --- CDNA FLJ31108 fis, clone IMR322000164 2.57439 212225_at EIF1 eukaryotic translation initiation factor 1 2.57038 MRNA; cDNA DKFZp686H1819 (from clone 1559144_x_at --- DKFZp686H1819) 2.56943 239073_at ANKFY1 ankyrin repeat and FYVE domain containing 1 2.56784 232118_at --- CDNA: FLJ22446 fis, clone HRC09457 2.56652 211406_at IER3IP1 immediate early response 3 interacting protein 1 2.56571 adenylate cyclase 8 (brain) /// similar to adenylyl 206811_at ADCY8 /// LOC100134077 cyclase 2.56565 229896_at --- CDNA clone IMAGE:6106200 2.56313 1564072_at MYH16 , heavy chain 16 pseudogene 2.56243 216455_at FLJ00049 FLJ00049 protein 2.56201 222850_s_at DNAJB14 DnaJ (Hsp40) homolog, subfamily B, member 14 2.56113 1559936_at --- CDNA clone IMAGE:5106201 2.55825 ATPase, Ca++ transporting, cardiac muscle, slow 239996_x_at ATP2A2 twitch 2 2.55825 238105_x_at ------2.55816 208844_at VDAC3 voltage-dependent anion channel 3 2.55641 232706_s_at TRABD TraB domain containing 2.55515 tyrosine 3-monooxygenase/tryptophan 5- 210317_s_at YWHAE monooxygenase activation protein, epsilon polypeptide 2.55333

193

206361_at GPR44 G protein-coupled receptor 44 2.55182 208090_s_at AIRE autoimmune regulator 2.55161 metastasis associated lung adenocarcinoma transcript 226675_s_at MALAT1 1 (non-protein coding) 2.55096 243153_at CDK5RAP2 CDK5 regulatory subunit associated protein 2 2.55006 1568903_at --- Transcribed locus 2.54807 220101_x_at ------2.54729 214464_at CDC42BPA CDC42 binding protein kinase alpha (DMPK-like) 2.54693 239789_at C11orf49 open reading frame 49 2.54661 239498_at --- Transcribed locus 2.54549 203976_s_at CHAF1A chromatin assembly factor 1, subunit A (p150) 2.53884 SWI/SNF related, matrix associated, actin dependent 206543_at SMARCA2 regulator of chromatin, subfamily a, member 2 2.53791 212037_at PNN pinin, desmosome associated protein 2.53778 226328_at KLF16 Kruppel-like factor 16 2.53634 217300_at ------2.5353 1559501_at --- CDNA clone IMAGE:5262521 2.53453 233396_s_at CSRP2BP CSRP2 binding protein 2.53413 206162_x_at SYT5 synaptotagmin V 2.53402 209016_s_at KRT7 7 2.5335 Heterochromatin protein 1, 229204_at HP1BP3 binding protein 3 2.53342 206623_at PDE6A phosphodiesterase 6A, cGMP-specific, rod, alpha 2.53325 1552596_at GAS2L2 growth arrest-specific 2 like 2 2.52863 222045_s_at PCIF1 PDX1 C-terminal inhibiting factor 1 2.52849 217218_at WAPAL wings apart-like homolog (Drosophila) 2.52821 1568126_at ANXA2 annexin A2 2.52801 204665_at RP5-1000E10.4 suppressor of IKK epsilon 2.52785 213961_s_at ------2.52388 218355_at KIF4A family member 4A 2.52381 222356_at --- Transcribed locus 2.52374 236892_s_at LOC100130740 similar to hCG2042068 2.5237 209491_s_at AMPD3 adenosine monophosphate deaminase (isoform E) 2.52253 230934_at STK32C serine/threonine kinase 32C 2.5221

194

natriuretic peptide receptor C/guanylate cyclase C 219790_s_at NPR3 (atrionatriuretic peptide receptor C) 2.52106 suppressor of defective silencing 3 homolog (S. 233841_s_at SUDS3 cerevisiae) 2.52045 218215_s_at NR1H2 nuclear receptor subfamily 1, group H, member 2 2.51998 223146_at WDR33 WD repeat domain 33 2.51969 217230_at EZR ezrin 2.51968 210753_s_at EPHB1 EPH receptor B1 2.51827 233905_at SPAG4L sperm associated antigen 4-like 2.51818 CTD (carboxy-terminal domain, RNA polymerase II, 214187_x_at CTDSPL polypeptide A) small phosphatase-like 2.51754 210583_at POLDIP3 polymerase (DNA-directed), delta interacting protein 3 2.51653 Transcribed locus, strongly similar to XP_001727068.1 1558011_at --- PREDICTED: similar to POM121-like 1 [Homo sapiens] 2.51461 205323_s_at MTF1 metal-regulatory transcription factor 1 2.51457 1561596_at --- CDNA clone IMAGE:5288527 2.51315 217995_at SQRDL sulfide quinone reductase-like (yeast) 2.51258 213087_s_at --- CDNA clone IMAGE:4838699 2.51077

VITA

Yan Zhang

EDUCATION Doctor of Philosophy 8/03 – 12/12 The Pennsylvania State University, University Park, PA Master of Applied Statistics 8/05 – 5/09 The Pennsylvania State University, University Park, PA Master of Science 8/00 – 7/03 Chinese Academy of Sciences, Bachelor of Science 9/96 – 7/00 Huazhong Normal University, China

WORK EXPERIENCE Sr. Business Analyst, 12/09 - present Informatics and Analystics, Quest Diagnostics Inc. Madison, NJ Intern, Quest Diagnostics Inc. Madison, NJ 6/09 - 8/09 Intern, GlaxoSmithKline, Collegeville, PA 6/08 – 8/08

PUBLICATIONS Yan Zhang, Michael J. Italia, Kurt R. Auger, Wendy S. Halsey, Stephanie f. Van Horn, Ganesh M. Sathe, Michal Magid-Slav, James R. Brown, Joanna D. Holbrook. (2009) "Molecular evolutionary analysis of cancer cell lines." Mol Cancer Ther 9(2): 279-91.

Norman J. Wickett, Yan Zhang, S. Kellon Hansen, Jessie M. Roper, Jennifer V. Kuehl, Shelia A. Plock, Paul G. Wolf, Claude W.dePamphilis, Jeffrey L. Boore and Bernard Goffinet. (2008). "Functional gene losses occur with minimal size reduction in the plastid genome of the parasitic liverwort Aneura mirabilis." Mol Biol Evol 25(2): 393- 401.