<<

The Pennsylvania State University

The Graduate School

Intercollege Graduate Program in Genetics

HORIZONTAL GENE TRANSFER STUDIES IN PARASITIC OF THE

OROBANCHACEAE

A Dissertation in

Genetics

by

Yeting Zhang

© 2013 Yeting Zhang

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2013

The dissertation of Yeting Zhang was reviewed and approved* by the following:

Stephen W. Schaeffer Professor of Biology Chair of Committee

Claude W. dePamphilis Professor of Biology Dissertation Advisor

Tomas A. Carlo Assistant Professor of Biology

John E. Carlson Professor of Molecular Genetics, School of Forest Resources Director of The Schatz Center for Tree Molecular Genetics

Naomi Altman Professor of Statistics

Robert F. Paulson Professor of Veterinary and Biomedical Sciences Chair, Intercollege Graduate Degree Program in Genetics

*Signatures are on file in the Graduate School

ii

ABSTRACT

Parasitic plants, represented by several thousand species of angiosperms, use modified structures known as haustoria to tap into photosynthetic host plants in order to extract nutrients and water. As a result of these direct -to-plant connections with their host plants, parasitic plants have unique opportunities for horizontal gene transfer (HGT), the nonsexual transmission of genetic material across species boundaries. There is increasing evidence that parasitic plants have served as the recipients and donors of HGT, but the long-term impacts of eukaryotic HGT in parasitic plants are largely unknown.

Three genera from ( versicolor, Striga hermonthica, and aegyptiaca (syn. Phelipanche aegyptiaca)) were chosen for a massive transcriptome-sequencing project known as the Parasitic Plant Genome Project (PPGP).

These species were chosen for two reasons. First, the three parasites ranges from facultative hemiparasite to obligate holoparasite, which makes them excellent candidates for understanding the evolution of plants having heterotrophic capacities. Second, because they have destructively impacted important crops, especially Striga hermonthica and Orobanche aegyptiaca (syn.

Phelipanche aegyptiaca), they are relatively well studied by scientists working on controlling these harmful parasites. The first half of Chapter 1 is a comprehensive introduction of parasitic plants. The remainder of Chapter 1 is a detailed literature review on HGT studies on plants, particularly parasitic plants.

Chapter 2 introduces a gene encoding albumin 1 KNOTTIN-like protein found in

Phelipanche aegyptiaca and related parasitic species of family Orobanchaceae that was likely acquired by a Phelipanche ancestor via HGT from a host based on phylogenetic analyses. iii

Before our research, albumin 1 genes were only known in papilionoid , where they serve dual roles as food storage and insect toxin. The KNOTTINs are well known for their unique

“disulfide through disulfide knot” structure and have been extensively studied in various contexts, including drug design. According to genomic sequences from nine related parasite species, 3D protein structure simulation tests, and evolutionary constraint analyses, the parasite gene we identified here retains the intron structure, six highly conserved cysteine residues necessary to form a KNOTTIN protein, and displays levels of purifying selection like those seen in legumes.

The albumin 1 xenogene has evolved through more than 150 speciation events over ca. 16 million years, forming a small family of differentially expressed genes that may confer novel functions in the parasites. Moreover, further data show that a distantly related parasitic plant, , obtained two copies of albumin 1 KNOTTIN-like genes from legumes through a separate HGT event, suggesting that legume KNOTTIN structures have been repeatedly co-opted by parasitic plants.

Chapter 3 summarizes the HGT findings in the PPGP utilizing a phylogenomic approach in identifying HGT events. Twenty-two published plant genomes (Selaginella moellendorffii,

Physcomitrella patens, Amborella trichopoda, Oryza sativa, Brachypodium distachyon, Sorghum bicolor, Phoenix dactylifera, Musa acuminata, Nelumbo nucifera, Aquilegia coerulea,

Arabidopsis thaliana, Carica papaya, Fragaria vesca, Glycine max, Medicago truncatula,

Populus trichocarpa, Thellungiella parvula, Theobroma cacao, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum, and Mimulus guttatus), two asterid EST datasets (Lactuca sativa and Helianthus annuus), and four PPGP species (Lindenbergia philipensis, , Striga hermonthica, and Phelipanche aegyptiaca) were used in orthogroup classifications. Phylogenies for each orthogroup were built and customized JAVA scripts were

iv used to perform the initial HGT screening. Secondary manual screening was carried out using various criteria. The final findings are presented in this chapter. With each finding, evidence from incongruent phylogenies, expression profiles for HGT genes, potential gene function, evolution constraint analyses, and genomic integration is discussed. The majority of the transgenes show evidence of introns, indicating the HGT transfer happened at the DNA level, and not from a retroprocessed transcript. We also identified two high confidence HGT transgenes in

Striga hermonthica located adjacent to each other. This is the first time a genomic integration of length greater than one gene has been identified in a parasitic plant, and suggests that a search for large integration fragments could be fruitful. We are identifying HGT transgenes based on our

EST datasets and this may be just the “tip of the iceberg” of HGT in parasitic plants, as a lot of the HGT transgenes in parasitic plant genomes may be highly divergent pseudogenes and may not be expressed. With more genomic data available in the future for the PPGP project, we would be able to tackle this question. We hope that our findings provide a rich pool for functional studies in the parasitic plants’ research community.

v

TABLE OF CONTENTS

LIST OF FIGURES ...... ix

LIST OF TABLES...... xiv

ACKNOWLEDGEMENTS...... xvi

Chapter 1 Background of Parasitic Plants ...... 1

Introduction to Parasitic Plants ...... 1 Categories of Parasitic Plants...... 1 General Introduction of the Haustorium ...... 2 Germination Signals Studies ...... 3 Studies on Genomes and Evolution of Parasitic Plants...... 4 Introduction to Triphysaria versicolor, Striga hermonthica, and Orobanche aegyptiaca ...... 9 Degree in Orobanchaceae...... 9 Evolutionary History and General Information for Orobanchaceae ...... 10 Ecological Impact of Orobanchaceae...... 12 Large-Scale Genome Information Studies on Orobanchaceae ...... 12 Introduction to the Parasitic Plant Genome Project (PPGP)...... 14 The Reasons to Initiate the PPGP and its Goals...... 14 The Design and Unique Features of the PPGP ...... 15 The Achievements of the PPGP...... 17 A Brief Review of Horizontal Gene Transfer in Microorganisms and Plants ...... 19 HGT in Bacteria ...... 19 HGT in Eukaryotes...... 21

Chapter 2 Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species [85] ...... 33

Background ...... 33 Results...... 36 Identifying the albumin1 gene in Broomrape species...... 36 Genomic sequence features of the albumin 1 gene in Phelipanche aegyptiaca and related species...... 38 Incongruent Phylogeny of the albumin1 gene...... 42 KNOTTIN structure identified in albumin 1 proteins from Phelipanche aegyptiaca ...... 44 Evolution constraint analysis on albumin1 genes in broomrape species ...... 47 Expression profile of albumin1 genes in Phelipanche aegyptiaca ...... 48 Discussion ...... 51 Conclusions...... 55 Methods...... 55 Screening for HGT candidates ...... 55 Phylogenetic analysis and dating ...... 59

vi

KNOTTIN structure validation and 3D structure simulation...... 60 dN, dS and dN/dS calculation ...... 60 Expression level comparisons of HGT candidates...... 61 Obtaining genomic sequences by PCR approach...... 61 Legume DNA extraction, and gene amplification ...... 62

Chapter 3 Widespread horizontal gene transfer events of transcribed gene sequences in parasitic plants of the Orobanchaceae...... 64

Abstract ...... 64 Background ...... 65 Results...... 69 Result summary on four species, including Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca...... 69 HGT candidates identified in Lindenbergia philippensis ...... 72 HGT candidates identified in Triphysaria versicolor ...... 72 HGT candidates identified in Striga hermonthica ...... 73 HGT candidates identified in Phelipanche aegyptiaca...... 75 Expression profile of the HGT gene candidates...... 77 Discussion ...... 80 A large scale, but still conservative, approach to identify HGT ...... 80 High confidence and interesting HGT cases in Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica and Phelipanche aegyptiaca...... 81 The possible mechanism of HGT...... 83 Two high confidence Striga hermonthica HGT transgenes located adjacently on the same genomic contig ...... 84 Expression pattern comparison between HGT transgenes and homologous genes from potential donor species ...... 88 Conclusion...... 88 Tables ...... 89 Figures...... 105 Methods...... 140 Sequencing Details...... 140 Data process pipelines...... 141 Constructing a global gene family classification ...... 143 Functional Annotation...... 145 Expression Profile Mapping in Transcriptomic Data...... 145 HGT Screening...... 146 Genomic Data Assembly...... 148 Genomic Contig Mapping onto EST data...... 149 Constraint Analyses...... 149 Supplemental Materials...... 152

Appendix A...... 170

List of Publications ...... 170

Appendix B ...... 171

vii

A genome triplication associated with early diversification of the core ...... 171

Reference ...... 202

viii

LIST OF FIGURES

Figure 1-1. Range of Parasitism among all major parasitic plant lineages [2]...... 2

Figure 1-2: Phylogenetic Hypotheses for relationships of Triphysaria, Striga and Orobanche [2]...... 11

Figure 2-1. Phylogeny of major lineage of plants. Figure is adapted from Soltis et al [130]...... 35

Figure 2-2. NCBI BLAST result (database: nr, BLASTp) of (A) P. aegyptiaca albumin1- 1 (unigene 12653) and (B) P. aegyptiaca albumin1-2 (unigene 75797)...... 37

Figure 2-3. Alignments of 5’ ends of the genomic and inferred CDS sequences of albumin 1 homologs from five Phelipanche species...... 39

Figure 2-4. alignment of insect toxin albumin 1 protein (Medicago_truncatula_albumin1_Q7XZC5) and inferred protein sequences for the two homologs in P. aegyptiaca, and structure of the M. truncatula toxic albumin 1 gene...... 40

Figure 2-5. Alignments of the 3’ end of genomic and inferred CDS sequences of albumin 1 homologs from five Phelipanche species...... 41

Figure 2-6. Partial genomic DNA and cDNA alignments of M. truncatula albumin 1 (Medtr8g025950), P. aegyptiaca albumin1-1 (12653) and P. aegyptiaca albumin 1-2 (75797)...... 42

Figure 2-7. Maximum likelihood (ML) and Bayesian inference (BI) phylogeny of albumin 1 homologs in broomrape species and legumes...... 43

Figure 2-8. Amino acid sequence alignment and 3D structure simulation of albumin 1 sequences from Medicago and P. aegyptiaca...... 46

Figure 2-9. ML estimate of dN and dS changes, and evolutionary constraint (dN/dS) through the history of albumin 1 sequences in broomrapes and their homologs in three related legume species...... 48

Figure 2-10. Expression level (log scale) of P. aegyptiaca albumin 1 genes in P. aegyptiaca across eight developmental stages...... 49

Figure 2-11. Maximum likelihood (ML) phylogeny of KNOTTIN homologs in broomrape species, Cuscuta pentagona and papilionoid legumes...... 54

Figure 3-1. Phylogeny of orthogroup 3861...... 105

ix

Figure 3-2. Phylogeny of orthogroup 12303...... 106

Figure 3-3. Phylogeny of orthogroup 12577...... 106

Figure 3-4. Phylogeny of orthogroup 18774...... 107

Figure 3-5. Phylogeny of orthogroup 14233...... 108

Figure 3-6. Phylogeny of orthogroup 13656...... 108

Figure 3-7. Phylogeny of orthogroup 2270...... 108

Figure 3-8. Phylogeny of orthogroup 5896...... 109

Figure 3-9. Phylogeny of orthogroup 10124...... 110

Figure 3-10. Phylogeny of orthogroup 294...... 111

Figure 3-11. Phylogeny of orthogroup 1886...... 112

Figure 3-12. Phylogeny of orthogroup 4067...... 113

Figure 3-13. Phylogeny of orthogroup 8888...... 114

Figure 3-14. Phylogeny of orthogroup 10050...... 115

Figure 3-15. Phylogeny of orthogroup 10143...... 116

Figure 3-16. Phylogeny of orthogroup 11841...... 117

Figure 3-17. Phylogeny of orthogroup 806...... 118

Figure 3-18. Phylogeny of orthogroup 1685...... 119

Figure 3-19. Phylogeny of orthogroup 2376...... 121

Figure 3-20. Phylogeny of orthogroup 4336...... 123

Figure 3-21. Phylogeny of orthogroup 8235...... 123

Figure 3-22. Phylogeny of orthogroup 9613...... 124

Figure 3-23. Expression profile of HGT gene TrVeBC3_31777.1 (orthogroup 12303). .. 125

Figure 3-24. Expression profile of HGT gene TrVeBC3_22826.1 (orthogroup 12577). .. 125

Figure 3-25. Expression profile of HGT gene StHeBC3_4423.1 (orthogroup 18774)...... 126

Figure 3-26. Expression profile of HGT gene StHeBC3_16619.1 (orthogroup 14233)...... 126 x

Figure 3-27. Expression profile of HGT gene StHeBC3_41710.1 (orthogroup 13656)...... 127

Figure 3-28. Expression profile of HGT gene StHeBC3_2868.1 (orthogroup 2270)...... 127

Figure 3-29. Expression profile of HGT gene StHeBC3_21126 (orthogroup 5896)...... 128

Figure 3-30. Expression profile of HGT gene StHeBC3_5017 (orthogroup 10124)...... 128

Figure 3-31. Expression profile of HGT gene OrAeBC5_9142.1 (orthogroup 294)...... 129

Figure 3-32. Expression profile of HGT gene OrAeBC5_15496.1 (orthogroup 1886). .... 129

Figure 3-33. Expression profile of HGT gene OrAeBC5_9762.1 (orthogroup 4067)...... 130

Figure 3-34. Expression profile of HGT gene OrAeBC5_15086.1 (orthogroup 8888)...... 130

Figure 3-35. Expression profile of HGT gene OrAeBC5_4284.2 (orthogroup 10050)...... 131

Figure 3-36. Expression profile of HGT gene OrAeBC5_14056.1 (orthogroup 10143)...... 131

Figure 3-37. Expression profile of HGT gene OrAeBC5_3756.1 (orthogroup 11841)...... 132

Figure 3-38. Expression profile of HGT gene StHeBC3_55745.1 (orthogroup 11841)...... 132

Figure 3-39. Expression profile of HGT gene OrAeBC5_4239 (orthogroup 806), including three splicing forms from this gene...... 133

Figure 3-40. Expression profile of HGT gene OrAeBC5_14072.1 (orthogroup 1685)...... 133

Figure 3-41. Expression profile of HGT gene OrAeBC5_15353 (orthogroup 1685), including two splicing forms from this gene...... 134

Figure 3-42. Expression profile of HGT gene OrAeBC5_6791 (orthogroup 1685), including four splicing forms from this gene...... 134

Figure 3-43. Expression profile of HGT gene OrAeBC5_9731 (orthogroup 1685), including two splicing forms from this gene...... 135

Figure 3-44. Expression profile of HGT gene OrAeBC5_7956 (orthogroup 1685), including two splicing forms from this gene...... 136

Figure 3-45. Expression profile of HGT gene OrAeBC5_270 (orthogroup 2376), including four splicing forms from this gene...... 137

Figure 3-46. Expression profile of HGT gene StHeBC3_48088.1 (orthogroup 2376), including four splicing forms from this gene...... 137

Figure 3-47. Expression profile of HGT gene OrAeBC5_7046 (orthogroup 4336)...... 138

xi

Figure 3-48. Expression profile of HGT gene OrAeBC5_13694.1 (orthogroup 4336)...... 139

Figure 3-49. Expression profile of HGT gene OrAeBC5_26251.1 (orthogroup 8235)...... 139

Figure 3-50. Expression profile of HGT gene OrAeBC5_16890.17 (orthogroup 9613)...... 140

Supplemental Figure 3-1. Phylogeny of orthologous gene group 2270. Additional homolog genes were identified in Phelipanche aegyptiaca (OrAeBC5_3356 and OrAeBC5_51488.1)...... 152

Supplemental Figure 3-2. Expression profile of additional Phelipanche aegyptiaca HGT gene, OrAeBC5_3356, including seven alternative splicing forms...... 153

Supplemental Figure 3-3. dN/dS ratios on background lineages and HGT parasite branch for each orthogroup having HGT genes identified in Striga hermonthica. ... 154

Supplemental Figure 3-4. dN/dS ratios on background lineages and HGT parasite branch for each orthogroup having HGT genes identified in Phelipanche aegyptiaca...... 154

Supplemental Figure 3-5. Expression profile of laterally transferred gene from Phelipanche aegyptiaca in orthogroup 4336...... 155

Supplemental Figure 3-6. Constraint analysis results for orthogroup 11841...... 156

Supplemental Figure 3-7. Constraint analysis results for orthogroup 4336...... 157

Supplemental Figure 3-8. Incongruent phylogeny for orthogroup 14624...... 158

Supplemental Figure 3-9. NCBI Blast result summary of StHeGnB1_80049...... 159

Supplemental Figure 3-10. NCBI Blast result summary of StHeBC3_16619.1...... 159

Supplemental Figure 3-11. Gene locations of StHeGnB1_80049 and StHeBC3_16619.1 on the same genomic contig 136486...... 160

Supplemental Figure 3-12. Conservation between intergenic region (StHeGnB1_80049 and StHeBC3_16619) and Sorghum bicolor Genome...... 161

Supplemental Figure 3-13. Expression pattern for StHeBC3_16619.1 and its homolog in rice (LOC_Os01g08650.1)...... 162

Appendix B Figure 1. Schematic phylogenetic tree of flowering plants...... 192

Appendix B Figure 2. Exemplar maximum likelihood phylogeny of Ortho 1202...... 193

Appendix B Figure 3. Exemplar maximum likelihood phylogeny of Ortho 1083...... 195

Appendix B Figure 4. Age distribution of γ duplications...... 195 xii

Appendix B Figure 5. Ks distributions of paralogs in Vitis from syntenic block analysis...... 196

xiii

LIST OF TABLES

Table 1-1. Host plants used in the Parasitic Plant Genome Project...... 15

Table 1-2. Developmental Stages of Parasitic Plants defined by the Parasitic Plant Genome Project (http://ppgp.huck.psu.edu/download.php) with characteristics of each stage and their potential for having host plant tissue contamination in library preparations...... 16

Table 1-3. Sequencing data summary for the Parasitic Plant Genome Project (October, 2013)...... 17

Table 1-4. Summary list of major HGT cases reported in plants...... 25

Table 2-1. Expression values for albumin 1 genes in P. aegyptiaca at different developmental stages...... 49

Table 2-2. Developmental stages used for transcriptome sequencing in P. aegyptiaca with characteristics of each stage and the expectation of host plant tissue contamination in library preparations...... 50

Table 2-3. HGT candidates BLAST database...... 56

Table 2-4. PCR primers used for albumin 1 amplification...... 62

Table 3-1. Number of HGT genes identified in four species, including Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca...... 89

Table 3-3. Genomic support for the HGT candidates identified in Striga hermonthica transcriptome assemblies. N/A sign means unknown based on the current data...... 96

Table 3-4. Depth of coverage for the Striga genomic contigs that have been mapped onto the HGT transcript assemblies...... 98

Table 3-5. Genomic evidence for the HGT candidates identified in Phelipanche aegyptiaca transcriptome datasets. N/A means unknown based on the current data...... 99

Table 3-6. Depth of coverage for the genomic contigs that have been mapped to the HGT gene ESTs identified in Phelipanche aegyptiaca (syn. Orobanche aegyptiaca). Average length of the Illumina reads is 101bp...... 102

Table 3-7. Summary of expression profile for each high confidence HGT transgene...... 103

Supplemental Table 3-1. Expression Profile for HGT genes identified in Triphysaria versicolor. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads)...... 163

xiv

Supplemental Table 3-2. Expression Profile for HGT genes identified in Striga hermonthica. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads)...... 164

Supplemental Table 3-3. Expression Profile for HGT genes identified in Phelipanche aegyptiaca. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads)...... 165

Supplemental Table 3-4. SH (Shimodaira-Hasegawa) test results for the high confidence HGT orthogroups...... 168

Supplemental Table 3-5. Assembly Results of PPGP Data...... 169

Appendix B Table 1. Summary of datasets for eight sequenced plant genomes included in this study ...... 198

Appendix B Table 2. Summary of unigene sequences of , basal eudicots, non- grass monocots, and basal angiosperms included in phylogenetic study...... 199

Appendix B Table 3. Phylogenetic timing of Vitis γ duplications inferred from orthogroup phylogenetic histories...... 201

xv

ACKNOWLEDGEMENTS

I would like to first show my deep appreciation to my parents, Xiaoyun Deng and Baowu

Zhang, who have brought me to this world, nurtured me with all their efforts, supported my decisions unconditionally, and empowered me with great strength to survive and thrive in this fantastic world. They are my first mentors and will always be my mentors and best friends throughout my life.

My husband, Yazhou Sun, is a special person whom I can never thank enough. I feel extremely lucky to have known him in the past ten years. We are partners in our lives and careers and have achieved numerous things together. He has made and is continuing to make me a better person. My precious son, William Z. Sun, thank you for coming into my life and continuing to help me become stronger and more mature.

I would like to show my deep appreciation to Dr. Claude dePamphilis, my PhD advisor and mentor in my professional career. I am very lucky to be able to work with Claude during my

PhD thesis studies and have truly learned a lot from him. Claude is such a generous and nice person who not only supports me in my PhD research but also has helped me to get through some difficult situations in my life. For this, I can never thank him enough.

Throughout my PhD studies, I have been very lucky to know a few other mentors. Dr.

Robert F. Paulson, the current chair of the Genetics Program, has always been there when I needed guidance and support. Dr. Richard Ordway, the former chair of the Genetics Program, has provided me with countless guidance both in my studies and in my life. I feel truly blissful to have known both chairmen during my PhD studies. I owe a lot of thanks to my committee chair, xvi

Dr. Stephen W. Schaeffer, who has helped me go through the milestones of my PhD. A lot of thanks go to Dr. Tomas A. Carlo, Dr. John E. Carlson, and Dr. Naomi Altman, who have provided me with much insight during my PhD studies.

During my PhD studies, I have worked with and learned a lot from numerous scientists. I want thank Dr. Monica Fernandez-Aparicio and Dr. James H. Westwood for their collaboration on the PPGP project. I appreciate the support of my thesis work from all other PIs on the PPGP project, including Dr. Michael P. Timko and Dr. John I. Yoder. Thanks to Dr. Satoko Yoshida,

Dr. Cedric Feschotte, Dr. Norman Wickett, Dr. Martin F Wojciechowski, Dr. Yan Zhang, Dr.

Yuannian Jiao, Dr. Tim Evans, Dr. Giovanna Carpi, Dr. Arthur Lesk, Dr. Webb Miller, Dr.

Stephan Schuster, Dr. Nicola Wittekindt, Dr. Daniela Drautz, Dr. Ji Qi, Dr. Hong Ma, Dr. Eric

Thomas Harvill, Dr. Douglas Cavener and Dr. Masatoshi Nei for all of your support during my

PhD studies.

I feel especially blessed that I have met and known a lot of fantastic co-workers and friends at Penn State through my PhD studies. A lot of thanks go to my co-workers, Paula Ralph,

Dr. Joshua Der, Dr. Loren Honaas, Eric Wafula, Zhenzhen Yang, Marcos Caraballo, Prakash

Timilsena, and Julia Naumann. I have really enjoyed working with you all.

xvii

Chapter 1

Background of Parasitic Plants

Introduction to Parasitic Plants

Categories of Parasitic Plants

Parasitic plants are plants that directly invade other plants’ tissues, either shoots or roots through a structure called haustoria [1], and acquire nutrients from their host plants to fulfill their needs. The degree of parasitism varies greatly. In general, parasites fall into two categories: facultative parasites and obligate parasites. Facultative parasites can complete their life cycles independent of a host plant, but will parasitize neighboring plants when available. Obligate parasites cannot complete their life cycles without host plants. Parasitic plants can also be categorized according to their ability. Hemiparasites have the ability to photosynthesize, though their efficiency to do so may vary among different species.

Holoparasites acquire all of their carbon needs from host plants. Depending on which of their host plants’ organs they parasitize, parasitic plants can also be divided into root and stem parasites. Figure 1-1 shows the parasitism ranges among all major parasitic plant lineages, their photosynthesis ability, and which organs of the host plants they usually parasitize [2].

1

Figure 1-1. Range of Parasitism among all major parasitic plant lineages [2]. Mode of parasitism is shown at each lineage. (r: root; s: stem; e: plant is mainly an internal endophyte). Numbers shown next to mode of feeding are the estimates of how many genera and species are in each lineage.

General Introduction of the Haustorium

The haustorium is the unique organ parasitic plants have evolved to penetrate their host plants’ tissue [1, 3]. It provides a vascular bridge of water and nutrients from the host plant to the

2

parasitic plant, and is both a morphological and physiological bridge between host and parasite.

When it is mature, it often appears as a swollen, round structure attached to the host’s surface.

The developmental stages of haustoria include induction and early cell expansion, pre-attachment differentiation, host attachment, and penetration. Haustorial development completes at the stage of vascular differentiation within the haustorium and cellular connections to the host’s vascular tissue of the hosts [4](Chapter 3: Haustorial initiation and differentiation). The initiation of haustoria is unique in utilizing the exogenous signals in a complex environment. Haustorial – inducing factors (HIFs) have been described in cotton string, soybean extract and many plant root exudates [5, 6]. It has been discovered that a simple phenolic compound, 2, 6-dimethoxy-p- benzoquinone (2,6 – DMBQ) is an HIF for Striga and Agalinis [7, 8]. This compound is common to most plants that are the intermediates of phenylpropanoid metabolism and lignification

[4](Chapter 2: Seed Germination). Chang and Lynn also hypothesized that a laccase-like enzyme cleaves a quinone (DMBQ) from the host’s cell wall, and that this released DMBQ diffuses and functions as a HIF [8]. DMBQ is now very commonly used in experimental labs because of its efficiency in inducing haustoria.

Germination Signals Studies

Parasitic plants have evolved a system to determine if host plants are in the vicinity through chemical signals. Many compounds that parasites utilize as germination signals have been identified in host plants. One of the compounds, the recently discovered plant hormone strigolactone, is the mosts studied [9, 10]. Two recent important studies have explained why host plants would give out germination signals to parasitic plants. The first study found that parasitic plants’ germination signals are also signals to facilitate colonization of arbuscular mycorrhizal

3

fungi (AMF) [11]. AMF help to capture nutrients, such as phosphorous, in the soil for the plants.

Plants produce strigolactones under phosphate starvation conditions in order to attract AMF colonization [10, 12, 13]. The second study found that strigolactone-related plant hormones also contribute to regulation of branching in plants [14, 15]. Due to these two studies, strigolactone is now known to be an important mediator of plants’ responses to the environments. How do parasitic plants utilize, and even take advantage of, such important signals? At this time, the mechanism remains unclear. One hypothesis is that parasitic plants may have shared the strigolactone pathway ancestrally with non-parasitic relatives and that parasites altered the function of this pathway at some point in their evolutionary history [2].

Studies on Genomes and Evolution of Parasitic Plants

Evolution of Parasitic Plants

It was hypothesized as early as 1970 that parasitic plants have three distinct phases of evolution [16]. In the first phase, the parasites evolve from free-living to initial parasites, or symbionts with hosts. The second phase involves relaxed natural selection on important functions required for free-living organisms. In the third phase, the parasites evolve new functions/adaptations for a parasitic life style [17]. With more and more genomic information available, this hypothesis could be tested and numerous studies have already been performed [18,

19]. The following reviews in this section (1.4) focus on major findings in each field.

[Parasitic Plants Plastid DNA studies] Genome studies of parasitic plants began when there was only a limited amount of genomic information available. Plastid DNA (ptDNA) is the

4

most studied of the three plant genomes (plastid, mitochondrial, and nuclear), mostly due to the genome’s smaller size and its uniformity across most flowering plants. The first comprehensive analysis of a parasitic plant’s plastid DNA was done for Epifagus virginiana, which is a root holoparasite. Initial experiments were done using a Southern blot hybridization in the plastid

DNA of tobacco (Nicotiana tabacum), which is a related non-parasitic plant [20]. The results showed that Epifagus plastid DNA include regions that are highly conserved with non-parasitic plants and regions that appear to be highly diverged. Further research on the genome map of

Epifagus plastid DNA found that its plastid genome is highly reduced in size and that plastid- encoded photosynthetic genes had been deleted [18]. Another report on the Epifagus virginiana plastid genome [21] shows that the nonphotosynthetic parasitic plants lack all genes for photosynthesis and chlororespiration. One recent review paper [22] provides a summary of the evolution of plastid genomes in parasitic plants. It reveals that 12 out of more than 130 completely sequenced plastid genomes are strongly reduced plastid genomes from parasitic plants or plant-related species having a parasitic lifestyle. Results from this review supported the second evolutionary phase hypothesized by Searcy. The second phase proposed by Searcy indicated that the parasites would lose or have relaxed natural selection on important functions required for free-living organisms, but not needed in the parasite. Taking parasitic plants into consideration, if nutrients (organic macromolecules and ions) could be obtained from hosts, the need for photosynthetic activity would be lessened, leading to a reduced plastid genome.

Reduced plastid genome size could also bring replication advantages, since a smaller genome replicates more quickly [4](Chapter 8: Genes and genomes).

[Parasitic Plant Nuclear DNA Studies] In comparison to plastid genomes, which have been fully sequenced, much less is known about the nuclear genes of parasitic plants. This section will focus on the nuclear DNA studies performed before the large scale EST studies in parasitic plants, which will be discussed in section 2.4. Most nuclear DNA studies in parasitic

5

plants in the early 1990s focused on nuclear ribosomal RNAs. This focus was mainly for two reasons. First, the DNA sequences available in the early 1990s were very limited since the sequencing technology was not as advanced as today’s. Second, since ribosomal genes are important to the functionality of cells, sequences are quite conserved. As a result, methods using primers designed from the conserved regions can be used to amplify ribosomal genes in different organisms, leading to the massive usages of ribosomal DNA in phylogenetic analyses. Plants’ nuclear ribosomal RNAs are encoded by a series of tandemly repeated sequences [23]. Each repeat includes an 18S (small subunit), a 28S (large subunit), a 5S gene and spacer region

[4](Chapter 8: Genes and Genomes). Numerous phylogenetic studies have been performed on

18S rRNA [24, 25]. The first molecular phylogenetic analysis of parasitic plants based on 18s rRNA was performed by Nickrent et al [24]. In this analysis, it was shown that Phoradendron serotinum and Dendrophthora domingensis may be derived from, or is a sister to, the

Santalaceae.

Since the early 1990s, researchers have continued to use nuclear rDNA sequences when analyzing parasitic plant phylogeny. The following are a few examples from studies using nuclear DNA in the past several years. Nuclear small subunit ribosomal DNA was used in analyzing phylogeny of all the genera from [26]. ITS regions of ribosomal DNA were used in analyzing phylogeny of holoparasite Orobanche [27]. 26S rDNA [28], along with

18S rDNA and other choloroplast genes, were also used in the phylogeny analysis in Ericales.

[Parasitic Plant Mitochondrial DNA Studies] Much of the early phylogeny research of parasitic plants was performed using plastid genes, as discussed above. However, the loss of photosynthetic and other genes have prevented comprehensive research of more parasitic plants.

For this reason, researchers have started to use mitochondrial genes in phylogeny analyses of parasitic plants. The phylogenetic position of , a holoparasite, was reported to be in the

6

order of using mitochondrial genes [29, 30]. The phylogenetic position of

Cynomorium, another holoparasite, was determined in Saxifragales using mitochondrial gene, matR [31]. Another important finding regarding this report is that Balanophoraceae’s phylogenetic position was determined at . According to this report, Balanophoraceae and Cynomoriaceae have independent origins based on strong support from both mitochondrial genes and nuclear genes. The first large scale phylogenetic study on parasitic plants using mitochondrial genes was reported in 2007 [32]. In this study, a phylogenetic analysis of 102 species of seed plants was performed to infer the position of all haustorial parasitic angiosperm lineages using three mitochondrial genes: atp1, cox1 and matR. The phylogeny developed in this analysis showed that parasitism has independently evolved at least 12 times. Moreover, the mtDNA phylogenetic tree is highly congruent with the non-parasitic plants’ phylogenies, which were done using plastid and nuclear data. Phylogenetic results of major parasitic lineages with strong support (BP >50) according to this study are as follows. 1. Hydnoraceae in Piperales; 2.

Cassytha with ; 3. Cuscuta with Solanales; 4. Orobanchaceae with ; 5.

Lennoaceae with Boraginaceae; 6. Mitrastemonaceae with Ericales; 7. Cytinaceae with Malvales;

8. Krameriaceae with Zygophyllaceae; 9. Rafflesiaceae with Malpighiales. The results indicated that parasitism has arisen independently at least 12 times, suggests that parasitic plants have very clearly evolved independently from free-living plant lineages. This also supports Searcy’s phase one evolutionary hypothesis. Another important finding of this report is that endoparasitism was found to have arisen in four independent lineages, including Apodanthaceae, Rafflesiaceae,

Cytinaceae and Mitrastemonaceae. These four families had been traditionally included in

Rafflesiaceae, as endoparasitism was previously assumed to be uniquely derived[1]. The results of this research indicate that endoparasites are not monophyletic.

7

Molecular Mechanisms Used by Parasitic Plants to Germinate

Several studies have focused on interesting nuclear genes in parasitic plants. Alpha- expansin genes have been reported as differentially expressed in Triphysaria versicolor (a hemiparasite) root cells when treated with exudates from the host plant root [33]. The Asparagine synthetase gene was found to be upregulated in Triphysaria versicolor root cells when seedlings were exposed to exudates of the host plant root [34]. Two genes, TvQR1 and TvQR2, which encode distinct quinone reductases, were found to be upregulated in Triphysaria versicolor root cells when exposed to DMBQ [35]. However, homologs of both genes in non-parasitic plants show different regulation patterns. When exposed to DMBQ, TvQR1 was upregulated only in parasitic plants, while TvQR2 was upregulated both in parasitic plants and non-parasitic plants, suggesting that TvQR1 maybe an important gene in the haustorium developmental pathway.

TvQR2 cDNA were further cloned into the Pichia pastoris and the recombinant protein was purified to explore function of this gene [36]. Results from this study showed that the recombinant protein reduced a variety of quinones and napthoquinones. According to this result, it was proposed that TvQR2 functions as quinone reductase in roots to mitigate toxicity in exogenous quinones present in the rhizosphere.

The group of findings summarized in the last paragraph focused mainly on the haustorial responses to host root exudates. Host exudates not only contain organogenic factors, but also phytotoxic factors. Parasitic plants’ haustorium developments were observed when exposed to their host roots’ exudates under different concentrations [37]. When exposed to high concentrations of the host root’s exudates, haustorial development was greatly hindered and eventually failed. Phytotoxic molecules in the host root molecules are a means for the host plant to limit the growth of other plants when competing for limited resources. However, it seems that this strategy used by host plants is taken advantage of by parasitic plants. Since many of the

8

parasitic plants’ molecules sense that the host roots’ exudates are toxic (e.g., quinones), the expression of quinone reductases (TvQR1 and TvQR2) are upregulated correspondingly in parasitic plants. Such resistance mechanisms in parasitic plants, which were discovered by the studies summarized in the previous paragraph, support Searcy’s parasitic plant evolutionary hypothesis phase 3 [16, 17]. The phase 3 evolutionary hypothesis stated that parasitic plants develop more complex adaptations specific to parasitic life style. The discovery that parasitic plants take advantage of host signals that host plants use as toxins is an excellent example of

Searcy’s phase 3 hypothesis.

To this date, no complete nuclear genome sequence for any parasitic plant exists. Studies have started by generating EST datasets (e.g., the PPGP, Pscroph and Striga hermonthica EST databases) for parasitic plants. All three large-scale EST databases will be introduced in the following sections. EST studies have provided rich information on parasitic plants’ transcriptome and expression profile (whole plants’ EST dataset vs. the EST dataset from a specific organ or a developmental stage). The extensive EST datasets not only allow for research on parasitic plants’ true phylogeny, but also enable researchers to understand more deeply the and biology of parasitic plants.

Introduction to Triphysaria versicolor, Striga hermonthica, and Orobanche aegyptiaca

Parasitism Degree in Orobanchaceae

Among the major lineages of parasitic plants, three contain only hemiparasites, eight contain only holoparasites, and only Orobanchaceae contain both hemiparasites and holoparasites

9

(as shown in Figure 1). Orobanchaceae also include the most important agricultural pests. These unique features have made the Orobanchaceae the obvious focus of our studies.

Among the three species chosen in this study, Triphysaria versicolor is a facultative hemiparasite; Striga hermonthica is an obligate hemiparasite (which needs the host plant from germination to maturity); and Orobanche aegyptiaca is an obligate holoparasite [2]. All three species in this study are root parasites.

Evolutionary History and General Information for Orobanchaceae

The early studies regarding the phylogenetic position of Orobanchaceae tend to use choloroplast genes to determine the phylogeny [18, 19, 38]. However, holoparasites are usually not included in such phylogeny studies due to the loss of photosynthetic and other genes in choloroplast. Using nuclear genes to determine the phylogenetic position of Orobanchaceae has also been researched, e.g., Bennet et al in 2006 have used the nuclear gene encoding the photoreceptor phytochrome A (PHYA)[39]. Nevertheless, no large-scale phylogenies among major parasitic lineages had been reported until Barkman et al in 2007 performed a comprehensive phylogenetic analysis on major parasitic lineages using mitochondrial genes[32].

In the same study, the phylogenetic position of Orobanchaceae was also confirmed to be a member of the asterids [32]. The phylogenetic relationships among the three study-subjects remain controversial. Hypotheses for the phylogenetic relationships of the three parasites is illustrated in Figure 1-2. Figure 1-2 shows two possibilities, one is that Triphysaria versicolor and Striga hermonthica are a subgroup, and the other is that Striga hermonthica and Orobanche aegyptiaca are a subgroup. Mimulus guttatus is the closest non-parasitic plant relative to the three parasitic plants. The most recent Orobanchaceae phylogeny was performed using 229 putative

10

nuclear single copy genes (PPGP annual report Year 3, unpublished data, N. Wickett and C. dePamphilis) and the result was the same as Figure 1-2, supporting Triphysaria versicolor and

Striga hermonthica as a subgroup [2].

Figure 1-2: Phylogenetic Hypotheses for relationships of Triphysaria, Striga and Orobanche [2].

The three parasitic plants in Orobanchaceae have a different range of host plants.

Triphysaria versicolor’s host plants range from monocots to dicots. Striga hermonthica parasitizes monocots, and Orobanche aegyptiaca parasitizes (within dicots). Host plants with abundant sequence information for each chosen study subject are shown in Figure 1-2. The chromosome numbers of the three parasites do not differ significantly from Mimulus guttatus.

Details are also shown in Figure 1-2. It appears that the genome size enlarges along with the level of parasitism. The reason for this interesting pattern remains unclear.

11

Ecological Impact of Orobanchaceae

One of the main reasons for this study is that Striga (witchweeds) and Orobanche’s

(broomrapes) have a devastating impact on agricultural plants. For example, over two thirds of the 73 million hectares of farmland cultivated for cereal grains and legumes in Africa are infested with one or more Striga species, affecting the livelihoods of 100 million farmers in 25 countries

[40] [41]. Most Striga species parasitize grasses, such as maize (Zea mays), rice (Oryza sativa) and sorghum (Sorghum bicolor), which are all very important crops. In Europe, Orobanche species cause detrimental impacts on important crops, including legumes (faba , chickpea and pea), important vegetables (tomato, potato and carrot), and oilseed crops (sunflower and

Brassica)[42]. Striga and Orobanche infestations also exist in the U.S. The methods to control these infestations, however, are very costly. It is difficult to control the parasitic plants using conventional methods because the plants infest crops underground and produce large numbers of seeds that survive for decades. Efforts have been made to breed parasite-resistant crops with some success, including physiology-based breeding methods [43, 44] and molecular breeding (marker- assisted breeding) [45-50]. Biotechnologies have been applied in trying to generate crops resistant to parasites.

Large-Scale Genome Information Studies on Orobanchaceae

Two large-scale studies based on EST data from parasitic plants were performed before

PPGP (Parasitic Plant Genome Project). In 2005, Pscroph [51, 52], an EST database enriched for

Triphysaria versicolor root transcripts was made publically available. In this study, Triphysaria

12

versicolor’s roots were treated with host roots, haustorium-inducing factors, or host roots’ exudates. Three suppressive subtractive libraries were separately prepared for Triphysaria versicolor’s roots transcripts that are either up regulated or down regulated. A total of nine thousand ESTs were obtained. The database provided useful resources for investigating the root haustoria development in Triphysaria versicolor. Moreover, interesting findings in this study included the comparison of the down-regulated and the up-regulated transcripts. First, transcripts associated with mitochondrion and electron transport were overrepresented. Second, transcripts associated with stress response were also up regulated. Third, transcripts associated with metabolism of nucleic acids and proteins were down regulated. Although this was a fairly large- scale study, it was limited by studying only the changes of either up regulation or down regulation in the transcriptome of one species of parasitic plant (Triphysaria versicolor).

Another large-scale EST database of parasitic plants is for Striga hermonthica from the

Riken Institutes [53]. In this study, full-length-enriched cDNA libraries were made with the

SMART cDNA synthesis method, which can enrich for full-length cDNA. An advantage of this study, therefore, was that the library-building method avoids the 3’ biased problem, which is caused by poly-A tail amplification method. RNA was extracted from S. hermonthica seedlings, shoots, flowers and roots (secondary haustoria). A total of 67,814 high quality ESTs were obtained and the data is available on the Striga hermonthica EST Database

(http://striga.psc.riken.jp/strigaDB/index.php). Though this study is limited to Striga hermonthica, it provided a large amount of transcriptome information to the public before the initiation of the Parasitic Plant Genome Project, which is the topic of the next section.

13

Introduction to the Parasitic Plant Genome Project (PPGP)

The Reasons to Initiate the PPGP and its Goals

The major subjects of the Parasitic Plant Genome Project (http://ppgp.huck.psu.edu) are three species in Orobanchaceae, including Triphysaria versicolor, Striga hermonthica, and

Orobanche aegyptiaca. As discussed earlier, Triphysaria versicolor is a facultative hemiparasite,

Striga hermonthica is an obligate hemiparasite, and Orobanche aegyptiaca is an obligate holoparasite. Together, all three species cover the full spectrum of parasitic lifestyles, which is one of the main reasons these three parasitic plants were chosen for this project. The three parasitic plants thus serve as an excellent resource for understanding the evolution and consequences of plants having heterotrophic abilities. The other reason for choosing these three subjects is that we hope that the power of genomics can provide clues about how to control the destructive impact of these plants on important crops. Through understanding the genomic/transcriptomic information of the parasites, novel strategies to control them may be developed. The above reasons constitute the main goals of the PPGP, which is to understand the evolutionary mechanism on how plants acquire parasitism, to discover how parasitic plant’s genomes change to accommodate the lifestyle of parasitism, to characterize how host plants’ response to parasitic plants’ invasion, and to understand the interaction between host plants and parasitic plants.

14

The Design and Unique Features of the PPGP

The PPGP was carefully designed in different levels (roughly 3 advantage levels) in order to fulfill its main objectives. 1. The PPGP has been utilizing both 454 and Illumina platforms to achieve the best combination of sequencing. Illumina platforms compensate for the homopolymer problems of 454 platforms and provide a higher amount of data per run, while 454 platforms provide a longer read length. 2. The host plants of the three parasitic plants were chosen specifically because they have full or partial genome information available. This is to facilitate the downstream data analysis for understanding the host plants’ response at a genomic level. The host plants used in the PPGP are shown in Table 1-1. Host plants were selected specifically because they have a fully sequenced genome or a partially sequenced genome available, with one exception of Nicotiana tabacum, which has extensive genomic filtration data and EST data to provide sufficient information. 3. Parasitic plants have major developmental stages, which divide roughly into six stages. Information on the stages is shown in Table 1-2.

For each developmental stage, tissue samples were collected for each parasitic plant, non- normalized libraries were prepared and sequencings were done. Replicate libraries were also separately prepared and sequenced for most developmental stages. This design will help carry out two kinds of studies. One is to determine the expression level of different genes at each stage.

The other is to compare the expressed transcriptome through different developmental stages.

Moreover, besides the individual developmental stage data, the PPGP also designed three normalized libraries for whole plant tissues corresponding to each study subject. This will help to identify the complete parasitic plant transcriptome.

Table 1-1. Host plants used in the Parasitic Plant Genome Project. Parasitic Plant Name Host Plants used in PPGP

Triphysaria versicolor Medicago truncatula, Zea Mays

15

Striga hermonthica Sorghum bicolor

Orobanche aegyptiaca Arabidopsis thaliana, Nicotiana tabacum

Table 1-2. Developmental Stages of Parasitic Plants defined by the Parasitic Plant Genome Project (http://ppgp.huck.psu.edu/download.php) with characteristics of each stage and their potential for having host plant tissue contamination in library preparations. Stage Potential for Host Plant Description of the More Information Name Tissues Contamination developmental stage 0 None Seed germination Pre-attachment of haustoria 1 None Germinated seed; Radicle emerged; pre-haustorial growth 2 None Seedling after exposure to haustorial induction factors (HIFs) 3 Yes Haustoria attached to host Early post-attachment root; penetration stages, pre- vascular connection (~48 hrs.) 4.1 Yes Early-established parasite; parasite vegetative growth after vascular connection (~72 hrs.) 4.2 Yes Spider stage 5.1 None Pre-emergence from soil - Late post-attachment shoots 5.2 None Pre-emergence from soil - roots 6.1 None Vegetative structures; /stems 6.2 None Reproductive structures; floral buds (up to anthesis)

16

The Achievements of the PPGP

The PPGP’s sequencing is still ongoing. Stages that have data available are shown in

Table 1-3 [54]. For each dataset, data is first de novo assembled. For 454 data, MIRA

(http://www.chevreux.org/projects_mira.html) has been chosen to do de novo assembly. MIRA

2.9.45 was used with the following commands: -project=[project_name] - job=denovo,EST,draft,454 -notraceinfo -OUT:ora=yes:ota=yes -AS:ugpf=no -CL:cpat=yes -

AL:egp=yes:ms=20:mrs=75 -SK:mnr=yes) For Illumina data, CLC Genomic Workbench

(http://www.clcbio.com/index.php?id=1240) has been used for de novo assembly using default parameters. Assembled contigs have been annotated using customized scripts. A putative Gene

Ontology definition is assigned for each contig. The initial raw data, the assembled contigs and the annotation are all publicly available at the Parasitic Plant Genome Project website

(http://ppgp.huck.psu.edu).

Table 1-3. Sequencing data summary for the Parasitic Plant Genome Project (October, 2013). Output is given in the number of reads and megabases (MB) of sequence for each library

(Average Illumina read length for the following libraries are 75bp/read).

17

The above raw data were assembled using the CLC Genomics workbench for Illumina data and Mira for 454 data (assembly details are described at the beginning of this section). The data, although incomplete, provide valuable information for studies on the evolution of parasitic

18

plants, and for study of parasite function. For example, these data provide outstanding opportunities for testing whether important parasitic functions may have originated through HGT from Agrobacteria, which would be consistent with a classic hypothesis proposed by Atsatt [6].

A Brief Review of Horizontal Gene Transfer in Microorganisms and Plants

HGT in Bacteria

HGT in Bacteria Findings

Horizontal gene transfer (HGT) is any process in which an organism incorporates genetic material from another organism without being that organism’s offspring. The phenomenon was first reported in bacteria in the 1950s, when multidrug resistance emerged on a worldwide scale, indicating that the antibiotic resistance traits were transferred among taxa instead of generated de novo by each taxon [55]. Over time, more and more HGT cases have been identified in bacteria and it is now considered a common event in bacterial evolution. A substantial amount of HGT is associated with plasmid-, phage- or transposon related sequences [56]. In the past ten years, pathogenicity islands (PAI) in bacteria genomes were discovered to be another example of HGT in prokaryotes. PAIs were originally identified in pathogen-related bacteria. Though the exact origin of PAIs is not clear, it has been hypothesized that they may have been derived from plasmids or bacteriophages that may have lost their genes for replication and self-transfer. They are usually unstable and contain mobility genes that encode integrases or transposases [57, 58].

PAIs carry one or more virulence-associated genes and are very often found associated with tRNA genes. With the increasing number of sequenced bacterial genomes and comparison studies between bacteria, it has been found that features of PAIs in pathogen-related bacteria can also be

19

found in non-pathogenic bacteria. The definition of PAIs, therefore, has been broadened and GEIs

(Genomic Islands) have now been defined as a unique group of genetic entities.

Common Methods identifying HGT and Mechanisms of HGT in Bacteria

How are HGT cases identified in bacteria and what are the mechanisms that lead to high frequencies of HGT? There are two commonly used methods. 1. The first, more commonly used, method is to build up the phylogeny and determine if there is any strongly supported discordance, e.g., very high bootstrap value, in the phylogeny as compared to the commonly- known species phylogeny. This method is commonly utilized in determining HGT in both prokaryotes and eukaryotes. 2. The second method relies on how the base compositions of genes in particular species of bacteria are very similar in terms of patterns of codon usage and frequencies of di- and trinucleotides [59, 60]. The HGT sequence shares sequence characteristics with the donor genome, and thus can be distinguished from the acceptor genome. Using this method, Ochman et al in 2000 have evaluated 19 full bacterial genomes and determined the ranges of HGT for each bacterial taxon. According to this finding, about 12.8% of the E. coli genome is acquired through HGT [56]. Bacteria have their own mechanisms that facilitate the

HGT events, which include transformation, transduction and conjugation. Transformation involves bacteria uptaking naked DNA from its environment—this mechanism thus has the high potential of enhancing HGT between two distantly related organisms. Transduction involves using a mediator, bacteriophage, to bring genetic material from the donor bacteria to the acceptor bacteria. Conjugation involves the donor and acceptor bacterial cells having physical contact with one another. Reports have shown that bacterial conjugation mediates the transfer of genetic material between different life domains, e. g., bacteria and yeast [61], and bacteria and plants

[62].

20

HGT in Eukaryotes

HGT Findings Summary in Eukaryotes

[HGT Findings in Non-Plant Eukaryotes] In contrast to the bacterial HGT, there are only limited reports on HGT events in Eukaryotic genomes. Scholl et al. have identified several nuclear genes in plant-parasitic nematodes that were evidently acquired from bacterial origins

[63]. HGT cases have also been identified in fungi and these transfers have acquired some novel functions [64, 65] [66]. One of the most famous HGT findings in an animal involves the P transposable element in Drosophila and related genera [67, 68]. The other famous HGT finding in an animal involves the mariner transposons in insects [69, 70] and also nuclear genes between a vertebrate host and a protozoa parasite[71].

[HGT found in Plants, mainly in Mitochondrion Genes] Numerous efforts have been made to look for HGT in flowering plants and many likely instances of mitochondrial HGT have been identified. The cases were initially identified by observing that the phylogenetic position of the HGT candidate was supported as incongruent with the commonly known phylogeny.

Bergthorsson et al. reported HGT events involving the mitochondrion genes atp1, rps2 and rps11 in distantly-related flowering plants [72]. Gene sequences were obtained by PCR. In the HGT case of rps11, expression status was also explored by reverse transcription of rps11 mRNA in

Sanguinaria. Not only were the cDNA sequences obtained, but evidence of RNA editing was also found, which showed support for the HGT gene being functional. In 2004, the same group reported massive mitochondrial HGT cases (20 out of 31 mitochondrial genes) from diverse land

21

plants to the basal angiosperm Amborella. PCR approach in the study was used to obtain the mitochondrial gene sequences. What’s more intriguing is that most of the transferred genes were intact and potentially functional. The mechanism concerning why and how such large scale HGT happened in the basal angiosperm remains unknown [73]. However, some of the phylogenies shown in this study do not have high bootstrap support (15 out of 26 putative HGT cases have a bootstrap value of less than 60%), which weakened the findings. Won et al in 2003 have identified the horizontal transfer of the mitochondrial nad1 intron2 and adjacent exons b and c from an asterid to Gnetum. Sequences were obtained by PCR. Not only was phylogenetic evidence found, but also the domain structure of nad1 intron2 between potential host and acceptor [74]. Another interesting finding was reported by Cho et al in 1998 about the widespread invasion of plant mitochondrial cox1 gene by a ‘homing’ group I intron. Southern blots were used to determine whether or not each of 341 species of land plants had the group I intron, followed by cloning and sequencing. This was an immense study considering the sequencing technology available at the time, which lagged behind current capabilities [75].

[HGT found in Parasitic Plants’ Mitochondrial Genes] Among the mitochondrial

HGT cases found in plants, many are found in parasitic plants. Charles C. Davis et al in 2004 reported the first mitochondrial HGT (nad1B-C), sequenced by PCR from the mitochondrial genome. The sequence was associated with high bootstrap value (100%) to the host plant lineage

(Vitaceae) of the parasitic plants (Rafflesiaceae) [76]. In the same year, Mower et al reported the first mitochondrial HGT cases involving the atp1 gene, showing highly supported transfers

(bootstrap values 97% and 81%) from parasitic plants to host plants. The study demonstrated that the transferred sequences were normally mitochondrial genome encoded genes by direct PCR amplification of the sequences [77]. Charles C. Davis et al reported the first mitochondrial HGT

(nad1 and matR) from a parasitic plant to a fern in 2005 [78]. Subsequently, the dePamphilis group reported a large-scale phylogeny of parasitic plants using three mitochondrial genes (atp1,

22

cox1 and matR). In this report, multiple putative HGT events were detected in the atp1 gene.

This is also the only copy of atp1 gene found in the acceptor species with an intact open reading frame, suggesting this horizontally transferred atp1 is functional. RT-PCR was done on this gene in Rafflesia cantleyi, demonstrating that this gene was transcribed and appropriately RNA edited, which indicates that this gene is potentially functional. Moreover, in this report, a putative correlation between parasitism and the presence of the group I intron in cox1 gene was observed

[32].

[HGT Found in Plants Nuclear and Plastid Genes] Despite the large numbers of mitochondrial HGT events detected in plants, nuclear and plastid HGTs have also been reported, though on a much smaller scale. Diao et al. in 2006 reported a putative HGT case involving a transposon (MU- like element) between two grass species (Setaria and Oryza sativa). The genomic sequences were obtained by PCR [79]. Park et al. in 2007 reported case of HGT involving the plastid gene rps2 in plastid genes between two parasitic plants in Orobanchaceae

[80]. This gene encodes the S2 subunit of the plastid ribosome. Whether the HGT candidate identified in this study is located in the plastid, mitochondrial, or nuclear genome is unresolved.

Furthermore, a research group in Japan recently published the first finding of nuclear HGT in parasitic plants Striga hermonthica [81]. The analysis began with examining EST datasets of

Striga hermonthica [53]. After the potential HGT candidate was identified, the genomic sequence was obtained by PCR. Since this candidate was initially identified among cDNA sequences, this

HGT sequence is transcribed and potentially functional. Although the mechanism and the function of this HGT is not fully elucidated, a possible integration mechanism was proposed involving an intermediate cDNA/mRNA, based on the fact that this HGT gene has 13 consecutive adenine nucleotides at the 3’ end. Southern blot analysis for this particular gene was also performed to demonstrate that the sequence is present in the genome of Striga hermonthica.

However, the homolog in Sorghum bicolor (Sb01g013240) is a gene with no introns, which

23

weakened the hypothesis of cDNA directly inserted into the host genome, though the other homolog (Sb0015s004030) is a gene with 3 exons. Another drawback of this research was the

HGT analysis procedure that this study used. When a blast search found any hits in eudicot species, the sequence was excluded from the potential HGT pool. Such a procedure has a very high likelihood of losing potential HGT candidates that have a significantly higher similarity with the donor species (in this case, monocot species) compared to the alignments with a closely related species (eudicot species). The procedure utilized in this research can only recover the

HGT genes that are unique to monocot species. Xi et al reported that a large number of HGT candidates were detected in a parasitic plant, Rafflesia cantleyi [82] [83]. In a separate study,

HGT events involving C3 or C4 photosynthetic pathways in panicoid grass species were reported

[84]. Nuclear genes were horizontally transferred between panicoid species and were subsequently adapted into the existing pathways with the effect of advancing the extent of C4 photosynthesis in some lineages [84]. Recently, our group has published a paper demonstrating the long-term retention and evolution of a horizontally acquired legume gene, albumin1, in

Phelipanche aegyptiaca and related species. This finding will be discussed in details in Chapter 2 of this thesis [85]. A summary showing the reported HGT cases in plants is in Table 1-4.

24

Table 1-4. Summary list of major HGT cases reported in plants. Reference HGT genes Donor Acceptor Parasitic Genome More information plants/other Plants plants compartment species involved? [77] atp1 Cuscuta Plantago No mitochondrion The transferred atp1 genes have become Bartsia Plantago pseudogenes. (Orobanchaceae) [72] rps2 Monocots Actinidia No mitochondrion Mitochondrion gene rps11 Caprifoliaceae No mitochondrion The upstream noncoding region is also HGT. rps11 Monocots Sanguinaria No mitochondrion Only the 3’ half of the gene is HGT. (a basal eudicot) atp1 eudicots Amborella No mitochondrion mitochondrial gene

[73] 26 Other land Amborella No mitochondrion mitochondrial gene mitochondrial plants (eudicots, genes moss, Bryophyte ) [75] Group I intron fungus 48 angiosperm No mitochondrion mitochondrial gene (resides in genera cox1 gene) [74] nad 1 Group asterid Gnetum No mitochondrion mitochondrial gene II intron and (Gnetales, adjacent gymnosperms) exons b and c [76] nad 1B-C Tetrastigma Rafflesiaceae Yes mitochondrion Host to parasites HGT [78] nad 1B-C Santalales fern Yes mitochondrion Host to parasites HGT

25

matR (second (Botrychium copy) virginianum) [32] atp1 four host lineages Yes mitochondrion Host to parasites HGT endoparasite lineages

cox1 intron unknown Almost all Yes mitochondrion This is unique because it’s an intron HGT. vectors parasitic lineages [81] gene with Sorghum bicolor Striga Yes nuclear Host to parasites HGT unknown hermonthica function [80] plastid gene Orobanche Phelipanche Yes plastid plastid gene (rps2)

[79] Transposon rice Setaria No nuclear nuclear gene (MULEs) [84] C3, C4 Panicoid Panicoid No nuclear nuclear gene photosynthetic pathways [82] expressed Tetrastigma Rafflesia Yes nuclear large number of HGT candidates identified genes [83] mitochondrial Tetrastigma Rafflesia Yes mitochondrion massive mitochondrial gene transfer genes [85] albumin1 legumes Broomrape Yes nuclear nuclear gene species

26

Demonstration of HGT typically begins with an observation of an unusual blast hit, or a misplaced sequence on a phylogenetic tree. However, these observations alone can only suggest that the sequence in question may have been acquired through HGT. Unambiguous identification of HGT requires that numerous alternative hypotheses, including host plant or other contamination, be carefully excluded.

Studies have taken the following steps to avoid contamination [32, 77, 86]. 1. In some studies, not only one, but two, HGT cases were discovered for the same gene in different species of the same family, which could be considered phylogenetically reproducible. 2. When isolating the candidate of HGT genome sequences, control experiments, which involved isolating a common gene (e.g., a ribosomal gene), were performed and shown to be vertically transmitted. 3.

If the HGT had become a pseudogene, the functional copy was also isolated and incorporated into the phylogeny. 4. If possible, the experiments were performed in two to three individual labs, which obtained the same sequences. 5. HGT conclusions were usually based on very strongly supported phylogenies (having very high bootstrap and high posterior probability) and the organism phylogeny of both hosts and donors were not in debate. 6. When a potential HGT gene was identified, some research groups identified the transcripts and, more surprisingly, identified

RNA editing sites on the potentially functional transcripts, which strongly supported the HGT finding[32]. 7. It was also important to know that the horizontally transferred sequence is actually in the genome. Southern blot, or obtaining the genome sequence, helped to infer the mechanism of integration, such as the comparison of genomic sequence vs. cDNA sequence.

27

Mechanisms of HGT in Eukaryotes

[Mechanisms at the Physiological Level] The specific mechanism(s) for HGT in eukaryotes remain a mystery. To initiate HGT, a vector or a bridge needs to connect the donor and the receptor. There are currently three proposed physiological contact methods for eukaryotes to achieve HGT. The first is through co-infection by a or bacterial vector, the second is through direct contact with a parasite (e.g., parasitic plant, parasitic nematode, parasitic fungus), and the third is through grafting [87] [88].

The physiological connections between parasitic plants and host plants for the three study subjects, Triphysaria versicolor, Striga hermonthica and Orobanche aegyptiaca, are known to differ in their haustorium tissues. For Triphysaria and Striga species, only the xylem connection between hosts and parasites has been observed [89, 90]. In Orobanche and Phelipanche species, direct symplastic connections between the cells of the parasite and their host sieve elements are observed by electron microscopy [91]. Furthermore, additional studies also reported direct phloem transmission of dyes, proteins [92] and even [93]. Symplastic parasites may absorb a potentially wider range of molecules, including small nutrient molecules and potentially macromolecules as well, directly from the host phloem.

Reports show how macromolecules, including proteins and RNA, transfer between parasites and hosts. Most of these reports were made for Cuscuta, which has exceptionally open connections to host vascular tissues. Dye tracers, soluble proteins, mRNA and viruses were all reported transferring from hosts to parasites in Cuscuta [94-96]. This again suggests that the more open the connections are between hosts and parasites, the greater the chances of macromolecules being transferred into the parasites, thus raising the possibility of a HGT event.

28

Although there are not as many reports on this topic as in Cuscuta, there are studies showing movements of macromolecules in Triphysaria and Phelipanche. Green fluorescent protein (GFP) and the phloem-localized dye carboxyfluorescein were reported as mobile in Phelipanche aegyptiaca [92]. In addition single-stranded RNA and DNA viruses were also reported to transfer from their hosts to Phelipanche aegyptiaca [93]. The trans-specific gene silencing process suppressing mannose 6-phosphate reductase (M6PR) expression in Phelipanche aegyptiaca was reported, suggesting the transferring of mRNA from host plants to parasites [97].

In Triphysaria, small RNA translocation was reported to suppress the GUS (beta-glucuronidase) reporter gene in parasites [98]. Currently, there are no reports on movements of large sections of

DNA between plants, but this has been proposed as a mechanism of HGT in plants [99]. Large sections of DNA may transfer short distances in the area of a graft junction [100]. The haustorium shares some features with grafts, including inter-specific symplastic connections

[101]. These facts suggest that the transfer of a large section of DNA from hosts to parasites is potentially possible.

If the HGT transgenes were picked up during the haustorial development stage and incorporated into the genome of the haustorium cells, this occurrence would increase the chance that such transient HGT transgenes could be maintained in other tissues of the plants as the parasite grew. This would require the haustorial junction’s cells retain or gain totipotency and give rise to shoots that could flower and then transmit the new genes [102]. If the cells also maintained such gene in the genome during the reproductive stage, this HGT transgene could be carried into the next generation. However, parasite regeneration from a haustorium has thus far not been reported in the Orobanchaceae. This remains a target for future study if a viable mechanism for host to parasite HGT is to be found.

29

[Mechanism at the Cellular Level] Another interesting question is whether the transferred sequences entered the hosts as DNA or RNA. There are two possible answers to this question. 1.

Depending on the amount of transferred DNA sequences, there are two possible answers under this category. 1.1 One possibility is that a huge “chunk” of DNA sequence (up to and including the entire genome) was transferred and released into the host body. Such a process could involve bacteria as a donor that undergoes cellular lysis, and the released DNA sequences may be incorporated into the host genome by recombination. There have been studies showing that in bacteria genomic islands, which are more prone to HGT, contain higher AT contents, suggesting that the higher AT content was the result of the genetic recombination reactions necessary to move homologous but diverged DNA segments into new genomes [103]. This is also what has been proposed as the “you are what you eat” mechanism [104]. One very well supported example proposed in this literature is the endosymbiotic hypothesis of alpha-proteobacteria into eukaryotic cells, followed by a massive gene losses and transfer to the host’s nuclear genome. During the digestion of the engulfed bacteria, the DNA managed to remain partially intact and some parts are recombined into the host’s own genomes. 1.2 It is possible that a certain kind of DNA sequence, such as DNA transposons, was transferred to the acceptor cells and was somehow integrated into the acceptor’s genome. A well-known case is P element movement in Drosophila melanogaster

[67].

A second possibility is that the transferred sequence is a RNA sequence. Incorporating the RNA sequence into the genome would certainly need the help of a reverse transcriptase, which is possible if the enzyme has already existed in the host body being infected with the virus.

It is also possible that the transferred RNA sequence could encode a reverse transcriptase (e.g., a retro-transposon already in the genome, perhaps by horizontal transfer) by itself. There are reports of putative fungal genes being horizontally transferred from non-retroviral RNA viruses

[105]. But this event is quite rare, perhaps because the incorporation of RNA virus’ genes into

30

hosts’ genomes would requires reverse transcriptases which are lacking in non-retroviral genomes. The plausible mechanism described in this report is that the host fungus has its own retro-elements that may provide the reverse transcriptase needed for this process. Although the

HGT findings in eukaryotes are not nearly as abundant as those found in prokaryotes, these findings should increase now that more eukaryotic genomes are being sequenced.

In order to sustain the HGT transgenes’ function, regardless of whether the origin was

DNA or RNA molecules, a promoter region or a certain transcriptional factor binding sites, should be maintained. Such a hypothesis has been confirmed by findings in fungi HGT. A

ST/AF gene cluster involved in intermediary and secondary metabolism in fungi was reported to be horizontal transferred from Aspergillus to Podospora anserine [106]. The intergenic regions of the horizontally transferred ST/AF gene cluster contain 14 putative binding sites for AfIR, which is the transcription factor required for activation of the ST/AF biosynthetic genes.

HGT and Parasitic Plants Hypothesis

As discussed above, many HGT cases reported so far have involved parasitic plants. It is possible that parasitic plants have played an important role transferring genes between species due to their unique physiological organs, haustoria. Is it possible that a network of both vertically and horizontally transferred functional genes exists in parasitic plants that have facilitated the evolution of the haustoria’s unique functions, or other interactions with host plants? It has been hypothesized that haustoria may have evolved via a transformation (HGT) event from a bacterium such as [6]. To answer these interesting questions, a large-scale examination of HGT events present in parasitic plants (focusing on Orobanchaceae in this study)

31

is needed. Identifying host plants that derived HGT genes in parasitic plants could help explain what has been uniquely selected or randomly selected by parasitic plants to integrate into their genomes. Examining whether there are a certain groups of genes in parasitic plants that evolved from Agrobacterium could test the interesting hypothesis that Atsatt proposed[6].

32

Chapter 2

Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species [85]

Background

Horizontal gene transfer (HGT) is the nonsexual transmission of genetic material across species boundaries [107, 108]. HGT is well known in bacteria, where HGT often results in adaptive gains of novel genes and traits [55-57]. There are fewer well-documented cases of HGT among eukaryotes, especially in plants [87] and the large majority of these cases appear to result in short-lived, nonfunctional sequences [87, 109, 110]. Consequently, the long-term evolutionary impact of HGT in multicellular eukaryotes remains largely unknown. Several cases of HGT are known or suspected in plants [32, 72-80, 84, 111-114], most involving mitochondrial sequences, and/or parasitic plants [32, 76-78, 80, 82, 86, 111, 112, 114]. Parasitic plants form direct haustorial connections with their host plants and are capable of obtaining a wide range of macromolecules from their hosts, including viruses [115], gene silencing signals [98], and messenger RNAs [116]. Consequently, parasites may have many opportunities for HGT events and an increased likelihood that some of these result in functional, and potentially adaptive, gene transfers. Two recent reports by Yoshida et al [111] and Xi et al [82] were the first indications that nuclear protein coding sequences, likely obtained from their respective host species, could be integrated into the genomes of parasitic plants by HGT. These were important advances, but they provided few clues as to the long term impact of HGT, how the transgenes evolve, and how they may function. We hypothesized that systematic analysis of genome-scale datasets from parasitic

33

plants could lead to evidence for acquisition and long-term maintenance of functional gene sequences in plants that had been acquired via HGT.

Albumin 1 genes are known only from a subset of species in the legume family

(Leguminosae) of angiosperms where they encode seed storage proteins and insect toxins [117,

118]. The albumin 1 proteins in legumes are 112 to 154 amino acids in length and rich in cysteine residues. They form a unique protein structure known as a KNOTTIN, which has three disulfide bonds and is characterized by a “disulfide through disulfide knot”[119]. The

KNOTTINs have been extensively studied in various fields, most of which are related with potentials in drug design[120-125]. Albumin 1 genes may have originated early in the diversification of papilionoid legumes [117, 118], but multiple homologous gene copies have been found only in species that are members of the more derived “Millettioid s.l.” and

“Hologalegina” clades [126].

Orobanche s. l., often known by the common name “broomrape,” includes 150-170 obligate parasitic plant species in the family Orobanchaceae. Growing evidence supports the segregation of broomrapes into four genera [127]: Aphyllon (syn. Orobanche sect.

Gymnocaulis), Myzorrhiza (syn. Orobanche sect. M.), Phelipanche (syn. Orobanche sect.

Trionychon), Orobanche s. str. (syn. Orobanche sect. O.). Most broomrape species have a narrow host spectrum and grow exclusively on perennial eudicot host plants [128], with members of the

Leguminosae, Solanaceae, and Asteraceae among the more common hosts [129]. As a member of order Lamiales, Orobanchaceae is phylogenetically well-separated from host members in these lineages, particularly legume hosts in the rosid order (Figure 2-1; [130]). A few broomrape species (e.g., P. aegyptiaca, P. ramosa, O. cernua, O. crenata, and O. minor) have become devastating pests of important crop plants, affecting their growth and resource allocation and imparting significant losses in yield [131]. P. aegyptiaca, the focal species in this study, has

34

a broad host range that includes members of the eudicot families Apiaceae, Asteraceae,

Brassicaceae, Cucurbitaceae, Leguminosae, and Solanaceae.

Figure 2-1. Phylogeny of major lineage of plants. Figure is adapted from Soltis et al [130].

35

Legumes belong to the rosid order Fabales (blue box), while the parasites Phelipanche and

Cuscuta represent derived lineages within the asterid orders Lamiales, (red box) and Solanales

(green box), respectively. For the designations from 2a to 2l, please refer to Soltis et al [130].

Here we show that a gene encoding albumin 1 KNOTTIN-like protein, closely related to the albumin 1 genes, only known from papilionoid legumes, serving dual roles in food storage and as insect toxins, was found in Phelipanche aegyptiaca and related parasitic species of family

Orobanchaceae, and was likely acquired by a Phelipanche ancestor via HGT from a legume host based on phylogenetic analyses. According to genomic sequences from nine related parasite species, 3D protein structure simulation tests, and evolutionary constraint analyses, the broomrape xenogene we identified here retains the intron structure, six highly conserved cysteine residues necessary to form a KNOTTIN protein, and displays levels of purifying selection like those seen in legumes. The albumin 1 xenogene has survived through more than 150 speciation events over ca. 16 million years [132], forming a small family of differentially expressed genes that may confer novel functions in the parasites.

Results

Identifying the albumin1 gene in Broomrape species

The albumin 1 transcript was first identified as a HGT candidate in the transcriptome of

P. aegyptiaca (cultured and grown on Arabidopsis and tobacco) using a BLAST-based [133] bioinformatic screen (details in Material and Methods). Albumin 1 transcripts were then searched further, using BLASTX, against the NCBI nr database and the PlantGDB database [134] . Top hits were seen (Figure 2-2) to Medicago truncatula albumin 1 sequences, with expected values of

5e-51 and 1e-48. Additional BLAST, including Hidden Markov Model (HMM)-based psi-

36

BLAST searches with the sequence from P. aegyptiaca were performed to attempt to detect homologs in three other members of Orobanchaceae with large transcriptome datasets (two parasites, Striga hermonthica and Triphysaria versicolor, and the nonparasitic Lindenbergia philippensis [2]) (Parasitic Plant Genome Project, PPGP [135]). Several large public databases, including Phytozome [136], PlantGDB, and SOL Genomics Network [137], were also searched.

After searching 34 sequenced genomes and transcriptomes of 274 additional plant species, albumin 1 homologs were detected only in legumes and the transcriptome libraries of P. aegyptiaca.

`

A

B

Figure 2-2. NCBI BLAST result (database: nr, BLASTp) of (A) P. aegyptiaca albumin1-1 (unigene 12653) and (B) P. aegyptiaca albumin1-2 (unigene 75797).

37

Genomic sequence features of the albumin 1 gene in Phelipanche aegyptiaca and related species

Having identified the albumin 1 sequence in the P. aegyptiaca transcriptome, genomic sequences encoding albumin 1 were then obtained from P. aegyptiaca and eight additional broomrape species, including P. schultzii, P. ramosa, P. mutelli, P. nana, and Orobanche hederae, O. minor, O. cernua and O. ballotae. The nucleotide sequence and inferred gene structures of the albumin 1 genes in broomrape species (Figure 2-3, 2-4, 2-5, 2-6) are closely related, with inferred protein alignments 57.3-58.3% identical and 72.7%-74.3% similar (= identity + conservative substitutions) in ungapped regions between the legume and parasite proteins. Two albumin 1 genes were identified in Phelipanche species, and are identified here as contig_12653 and contig_75797, or albumin1-1 and albumin1-2, respectively. An intron disrupts the coding region at the same position in both genes and the intron sequences are similar but contain a number of insertion and deletion mutations. Only one albumin 1 gene was detected from Orobanche species. Although the intron length in albumin 1 genes of Phelipanche and legume species is not well conserved, several critical intron features are shared (Figure 2-6).

First, the starting position of the intron in both the P. aegyptiaca and M. truncatula sequences are the same, and the first nine base pairs are identical. Second, the introns have characteristic splicing sites at their 5’ and 3’ ends; 5’ ends often have GT/GU and 3’ ends often have AG, and these motifs are found in both M. truncatula and Phelipanche albumin 1 introns (Figure 2-3 and

Figure 2-6). Albumin1 gene sequences from Phelipanche were also searched with BLASTn against the NCBI nt database in order to search for high frequency repeats and mobile elements, but no such features were identified.

38

start codon intron

Figure 2-3. Alignments of 5’ ends of the genomic and inferred CDS sequences of albumin 1 homologs from five Phelipanche species. Two genes are identified from P. aegyptiaca unigene 12653 (first five sequences, red bar) and unigene 75797 (yellow bar). Red box indicates the intron region identified by comparison of the genomic DNA and cDNA sequences. Blue box indicates the putative translation start codon.

39

signal peptide mature peptide A *:::: *.:****: * : * **. *:****.********* *:****:* **. * * **:* : *** ** *** Medtru_Albumin1 MA-YIRFAHLVVFLLAA-FSLVPTKKVGATDCSGACSPFEMPPCRSSDCRCIPIGLVAGYCTYPSSPTVMKMVEEHPNLC 78 PhAeg_Albumin1-1 MADYVKLSPLALFLLATLFFMSPMKKADAADCSGVCSPFEMPPCGSTDCRCVPWGLFVGQCIYPTSVVMHKMVGEHNNLC 80 PhAeg_Albumin1-2 MADYVKLSPLALFLLATVFLMSPIKKAEATDCSGVCSPFEMPPCGSTDCRCVPLGLFFGQCIYPTSVEMNKMVDEHNNLC 80 1...... 10...... 20...... 30...... 40...... 50...... 60...... 70...... 80

:** ** ** **********.***:****:* :* *: *:.::** ** :. * * Medtru_Albumin1 QSHADCTKKESGSFCARYPNPDIEHGWCFSSNFEAYDV------FFNVSSNRGLIKDSLPMFTLTLDS 140 PhAeg_Albumin1-1 KSHDDCMKKGSGSFCARYPNADIEYGWCFASVSDAQDMFKIASNSEFTKAFLKIASNSGLANGFLKMPAA-IAT 153 PhAeg_Albumin1-2 KSHDDCMKKGSGSFCARYPNADIEYGWCFASDSEAQDMLKIASNSEFTKTFLRIASNSGLAKSFLKMPGA---- 150 ...... 90...... 100...... 110...... 120...... 130...... 140...... 150....

B

Figure 2-4. Amino acid alignment of insect toxin albumin 1 protein (Medicago_truncatula_albumin1_Q7XZC5) and inferred protein sequences for the two homologs in P. aegyptiaca, and structure of the M. truncatula toxic albumin 1 gene. (A) Inferred protein sequence alignments are 57.3-58.3% identical and 72.7%-74.3% similar (= identity + conservative substitutions) in shared regions between the legume and parasite proteins. (B) The legume protein product has a 27 amino acid signal peptide and 113 amino acid mature peptide; both regions are similarly conserved between the legume and Phelipanche inferred proteins. The gene structure representation for this legume gene was obtained from

EMBL-EBI databases [138] (accession #AJ574789).

40

Figure 2-5. Alignments of the 3’ end of genomic and inferred CDS sequences of albumin 1 homologs from five Phelipanche species. Two genes are identified from P. aegyptiaca unigene 12653 (first five sequences, red bar) and unigene 75797 (yellow bar). Red box indicates putative stop codon.

41

Figure 2-6. Partial genomic DNA and cDNA alignments of M. truncatula albumin 1 (Medtr8g025950), P. aegyptiaca albumin1-1 (12653) and P. aegyptiaca albumin 1-2 (75797).

Incongruent Phylogeny of the albumin1 gene

Phylogenetic analysis (Figure 2-7) of all known plant albumin 1 sequences showed a strongly supported clade containing all of the albumin 1 sequences from broomrapes (Maximum likelihood (ML) boostrap 98, Bayesian inference (BI) Posterior probabilities (PP) 0.99) nested deeply within the IRLC (Inverted Repeat-lacking clade) of papilionoid legumes [139]. Among legumes, the next most closely related sequences (ML bootstrap 100, BI PP 0.99) are from

Onobrychis argentea and Onobrychis viciifolia. Because the node supporting the position of the broomrape clade (ML bootstrap 79, BI PP 0.99) within the papilionoid legumes is relatively weakly supported, we also tested the hypothesis that the broomrape clade of albumin 1 sequences falls outside the larger clade of legumes represented in this analysis (i.e., at a position sister to the

42

Millettioid and Hologalegina clades). This hypothesis was rejected (Shimodaira-Hasegawa test and Kishino-Hasegawa test, using Tree-Puzzle version 5.2, Log L = -4482.60) relative to the maximum likelihood position as indicated in this tree. Two albumin 1 genes are resolved as sister clades in Phelipanche species, which are in turn resolved as sister to the single gene obtained from Orobanche species. Gene structures supported a similar conclusion (Figure 2-3).

(!!'!"&&$ Glycine_max_albumin1_Q39837_ALB1_SOYBN Glycine_soja_albumin1_Q9ZQX0_ALB1_GLYSO Pea_Pisum_sativum_albumin1B_P62927_ALB1B (!!'!"&&$ Pea_Pisum_sativum_albumin1A_P62926_ALB1A Pea_ Pisum_sativum_albumin1C_P62928_ALB1C Pea_ Pisum_sativum_albumin1F_P62931_ALB1F

#%$ Pea_Pisum_sativum_albumin1E_P62930_ALB1E Pea_ Pisum_sativum_albumin1D_P62929_ALB1D $ (!!'!"&&$ Medicago_truncatula./012345(.6#789+.:;<==>;$

(!!'!"&&$ Medicago_truncatula.?@ABC%D!-+&+!"(./012345($

(!!'!"&&$ Medicago_truncatula.?@ABC%D!-+&*!"(./012345($ Medicago_truncatula.?@ABC%D!-+&-!"(./012345($ Medicago_truncatula_Medtr1g024630.1_albumin1 &!'!"%)$ Medicago_truncatula_Medtr6g038830.1_albumin1 Medicago_truncatula_Medtr4g029170.1_albumin1 Medicago_truncatula_Medtr3g089970.1_albumin1 (!!'!"&&$ Medicago_truncatula_Medtr3g089930.1_albumin1

(!!'!"&&$ Medicago_truncatula_Medtr3g089870.1_albumin1 Medicago_truncatula_Medtr3g089880.1_albumin1

(!!'!"&&$ Medicago_truncatula_Medtr7g041000.1_albumin1 Medicago_truncatula_Medtr7g040960.1_albumin1 )+'!"%,$ Astragalus_monspessulanus./012345(.EC/[email protected])F(G#$

(!!'!"&&$ Onobrychis_viciifolia./012345(.EC/[email protected])F(9&$ Onobrychis_argentea.??HI-%./012345($ )-'!")&$ Orobanche_cernua_albumin1 #&'!"&)$ (!!'!"&&$ Orobanche_ballotae_albumin1 Orobanche_hederae_albumin1 Albumin1 16 Mya, #('!"%)$ %&'!"&&$ Orobanche_minorA_albumin1 Orobanche &('!"&&$ SE 2.5E­2 &#'!"&&$ Orobanche_minorB_albumin1 Phelipanche_schultzii_albumin1­1 11Mya, &&'!"&&$ Phelipanche_mutelli_albumin1­1 &+'!"&&$ Albumin1­1 SE 1.9E­2 Phelipanche_aegyptiaca_albumin1­1 &)'!"&&$ &+'!"&%$ Phelipanche Phelipanche_ramosa_albumin1­1 5Mya, Phelipanche_nana_albumin1­2 SE 9.3E­3 &*'!"&&$ Phelipanche_aegyptiaca_albumin1­2 %%'!"&,$ Phelipanche_mutelli_albumin1­2 Albumin1­2 #-'!")!$ Phelipanche_ramosa_albumin1­2 (!!'!"&&$ Phelipanche Phelipanche_schultzii_albumin1­2 !"!#$

Figure 2-7. Maximum likelihood (ML) and Bayesian inference (BI) phylogeny of albumin 1 homologs in broomrape species and legumes. Horizontal acquisition of albumin 1 by an ancestral Phelipanche/Orobanche species was estimated to have occurred ca. 16 million years ago (Mya, with standard errors SE), with

Orobanche-Phelipanche speciation ca. 11 Mya, and a gene duplication ca. 5 Mya in the

Phelipanche lineage produced xenparalogous genes designated Albumin1-1 (12653) and

Albumin1-2 (75797) (see Supplemental Methods). Papilionoid legumes in black, others as

43

indicated. Age estimate of legume node marked by red circle (39 + 2.4 Mya) taken from Lavin et al. [140]. Unrooted trees have been rooted with Glycine max, in agreement with a prior

KNOTTIN phylogeny [118] and phylogenetic relationships of related legume sequences [139].

Tree shown is ML (BI method produced the same tree topology); bootstrap values (if >50%) and posterior probabilities (if >0.60) are shown on internal nodes. The legume clade containing albumin 1 genes is comprised of the Millettioid clade, which contains genera such as Glycine and

Phaseolus, as the sister group to the large, temperate Hologalegina clade, which includes

Medicago, Pisum, and Onobrychis, as well as several other agriculturally important genera such as Cicer, Lens, Vicia, and Trifolium [139]. Legume KNOTTIN sequences were from the KNOTTIN database [119]. For each legume KNOTTIN, tripartite names are given as: species full name-ID from KNOTTIN database-sequence ID from UniProt database. Additional albumin 1 homologs from M. truncatula were retrieved from Medicago truncatula HapMap

Project [141] with original sequence IDs. Branches are scaled by number of substitutions. The two albumin 1 genes in Phelipanche aegyptiaca have nt sequence identity 92%.

KNOTTIN structure identified in albumin 1 proteins from Phelipanche aegyptiaca

The amino acid sequence alignments of albumin 1 from legumes to P. aegyptiaca show conservation of all cysteine residues essential for disulfide bond formation in albumin 1 proteins

(Figure 2-8A). We investigated whether the predicted albumin 1 proteins from parasites maintain the characteristic KNOTTIN structures found in the legume albumin 1 proteins using Knoter1d

[119] [142]. Simulated 3D structures show that the Phelipanche albumin 1 proteins form a characteristic KNOTTIN structure with three-disulfide bonds and a “disulfide through disulfide knot”. KNOTTIN protein structures are also predicted in all of the other full-length albumin 1 genes in Phelipanche species. Knoter1d assigned scores greater than 35 to each Phelipanche

44

albumin 1 sequence; a score greater than 20 in this analysis passes the Knoter1d criteria for identification as an albumin 1 structure. The predicted 3D structures for

P_aegyptiaca_Albumin1-1 (Figure 2-8B) and P_aegyptiaca_Albumin1-2 (Figure 2-8C) are very similar to the insect toxic albumin 1 protein from M. truncatula. Albumin 2, a non-KNOTTIN legume protein, has no discernable homology with the albumin 1 protein in legumes (Figure 2-

8E).

45

A *:::: *.:****: * : * **. *:****.********* *:****:* **. * * **:* : *** ** *** Medtru_Albumin1 MA-YIRFAHLVVFLLAA-FSLVPTKKVGATDCSGACSPFEMPPCRSSDCRCIPIGLVAGYCTYPSSPTVMKMVEEHPNLC 78 PhAeg_Albumin1-1 MADYVKLSPLALFLLATLFFMSPMKKADAADCSGVCSPFEMPPCGSTDCRCVPWGLFVGQCIYPTSVVMHKMVGEHNNLC 80 PhAeg_Albumin1-2 MADYVKLSPLALFLLATVFLMSPIKKAEATDCSGVCSPFEMPPCGSTDCRCVPLGLFFGQCIYPTSVEMNKMVDEHNNLC 80 1...... 10...... 20...... 30...... 40...... 50...... 60...... 70...... 80

:** ** ** **********.***:****:* :* *: *:.::** ** :. * * Medtru_Albumin1 QSHADCTKKESGSFCARYPNPDIEHGWCFSSNFEAYDV------FFNVSSNRGLIKDSLPMFTLTLDS 140 PhAeg_Albumin1-1 KSHDDCMKKGSGSFCARYPNADIEYGWCFASVSDAQDMFKIASNSEFTKAFLKIASNSGLANGFLKMPAA-IAT 153 PhAeg_Albumin1-2 KSHDDCMKKGSGSFCARYPNADIEYGWCFASDSEAQDMLKIASNSEFTKTFLRIASNSGLAKSFLKMPGA---- 150 ...... 90...... 100...... 110...... 120...... 130...... 140...... 150....

B C

D E

Figure 2-8. Amino acid sequence alignment and 3D structure simulation of albumin 1 sequences from Medicago and P. aegyptiaca. (A) Amino acid alignment for the two P. aegyptiaca albumin 1 sequences and a M. truncatula albumin 1 sequence (Q7XZC5, a confirmed KNOTTIN insect toxin protein). Red squares indicate cysteine residues. (B) and (C) show the simulated 3D structures for both Phelipanche

46

sequences. Protein 2D structures are colored from N-terminal to C-terminal with a rainbow color scheme. The three disulfide bonds are shown as colored sticks. The left most and right most sticks open a space that is pierced by the stick in the center. This “disulfide through disulfide knot” is the characteristic structure of KNOTTIN proteins. (D) 3D structure of the KNOTTIN insect toxin protein in M. truncatula. The toxicity of this protein to insect herbivores was confirmed in an earlier report [117]. The PDB file for this 3D structure was obtained from the KNOTTIN database. (E) Predicted albumin 2 (a non-KNOTTIN albumin, PDB ID#3LP9) protein 3D structure in grass pea ( sativus).

Evolution constraint analysis on albumin1 genes in broomrape species

Having found that the horizontally acquired albumin1 genes were present in related species of broomrapes we then asked if the genes are evolving under purifying selection indicative of a functional protein coding sequence. dN (nonsynonymous substitutions), dS

(synonymous substitutions) and dN/dS were calculated for all three lineages of the broomrape albumin 1 clade (= albumin 1 in Orobanche, albumin1-1 and albumin1-2 in Phelipanche) and for the albumin 1 sequences from three closely related legume sequences; Astragalus monspessulanus, Onobrychis argentea and Onobrychis viciifolia. Synonymous substitutions in the albumin 1 genes (dS) outnumber non-synonymous substitutions (dN) by at least 3:1 in most lineages (Figure 2-9, and details in Supplemental Materials), and dN/dS, reflecting the level of purifying selection, is similar in broomrapes to the value estimated for closely related albumin1 sequences from legumes. All cysteine residues were also identified as evolving under purifying selection, suggesting that the horizontally acquired albumin 1 genes in broomrapes are functional

(Bayes factors ranging from 3.04 to 27.52.)

47

Astragalus_monspessulanus$%&'(')*+$(,-&.)/0$12(3.+/'$456078

Onobrychis_argentea $99:;<=$%&'(')*+$(,-&.)/> Onobrychis_viciifolia $%&'(')*+$(,-&.)/0$12(3.+/'$4560?@

Albumin1 Orobanche

Albumin1­1 Albumin 1 dN dS dN/dS Phelipanche Astragalus monspessulanus 0.259 0.629 0.412 Onobrychis viciifolia 0 0.263 0 Onobrychis argentea 0.112 0.45 0.249 Albumin1­2 Orobanche 0.123 1.068 0.115 Phelipanche 1­1 Phelipanche 0.043 0.264 0.163 1­2 Phelipanche 0.018 0.079 0.228

!"!#

Figure 2-9. ML estimate of dN and dS changes, and evolutionary constraint (dN/dS) through the history of albumin 1 sequences in broomrapes and their homologs in three related legume species. Branch lengths scaled by total number of substitutions. Because the total amount of evolutionary change on individual branches for closely related species can be very low (or even zero in some cases), changes have been pooled within several of the specific lineages.

Expression profile of albumin1 genes in Phelipanche aegyptiaca

Having observed evidence for selection for structural conservation, we investigated whether these genes exhibit transcription profiles that suggest a new or unique pattern of expression in parasites. Normalized expression levels of both albumin 1 genes in P. aegyptiaca were estimated as reads per kilobase per million reads (RPKM) for eight libraries representing major stages of belowground and aboveground parasite development (Figure 2-10). Both genes displayed lowest expression levels at stage 3 (haustorial attachment stage) and highest at stage 6

48

(above-ground tissues). Transcripts were particularly abundant at stage 6.2 (reproductive), more than 1000x higher than the haustorial stage.

P_aegyptiaca_Albumin1-1 P_aegyptiaca_Albumin1-2 $"

(#$"

("

'#$"

'"

&#$"

&" Log 10 (RPKM) %#$"

%"

!#$"

!"

6.1-Leaves

0-Imbibed seed 2-Seedling+HIF 6.2-Floral buds 1-Germinated seed 3-Young haustorium4.1-Later haustorium 4.2-Mature haustorium Developmental Stages

Figure 2-10. Expression level (log scale) of P. aegyptiaca albumin 1 genes in P. aegyptiaca across eight developmental stages. Normalized expression levels were estimated by RPKM (= count of mapped Reads to this gene

Per Kilobase of sequence length per Million library reads). Numerical values in Table 2-1; P. aegyptiaca stages are as defined[143] and in Table 2-2. Stage 3 (haustorium attached to host root, pre-vascular connection) is the earliest post-attachment stage for this parasite[143].

Table 2-1. Expression values for albumin 1 genes in P. aegyptiaca at different developmental stages.

49

Expression levels were measured by number of mapped Reads to this gene Per Kilobase of

sequence length per Million (M) library reads (RPKM) in Illumina sequence (G) libraries

(PPGP). Developmental stages described in Table 2-2.

Albumin I genes 0G 1G 2G 3G 41G 42G 61G 62G

Albumin1-1 96.5 71 186.2 22.6 78 29 414 9402.3

Albumin1-2 186.3 237.1 368.4 6.5 169.5 40.9 2826.5 23162.7

Total reads used 15.6 26 25.9 18.2 27.9 20.7 15.8 16.2 in mapping (M)

Table 2-2. Developmental stages used for transcriptome sequencing in P. aegyptiaca with characteristics of each stage and the expectation of host plant tissue contamination in library preparations.

Stage Expectation of Host Plant Description of the More Information Name Tissues Contamination developmental stage 0 No Seeds imbibed, pre-germination Pre-attachment of haustoria 1 No Germinated seed; Radicle emerged; pre-haustorial growth 2 No Seedling after exposure to haustorial induction factors (HIFs) 3 Yes Haustoria attached to host root; Early post-attachment early penetration stages, pre- vascular connection (~48hrs.) 4.1 Yes Early-established parasite; parasite vegetative growth after vascular connection (~72 hrs.) 4.2 Yes Spider stage 5.1 No Pre-emergence from soil - Late post-attachment shoots 5.2 No Pre-emergence from soil - roots

50

6.1 No Post emergence from soil - Vegetative structures; leaves/stems 6.2 No Post emergence from soil - Reproductive structures; floral buds (up to anthesis)

Discussion

Biogeographic overlap and common feeding interactions between diverse broomrapes and temperate papilionoid legumes increase the likelihood that the HGT event occurred in a common ancestor of the parasites that was in direct contact with legume host plants. An alternative (and less parsimonious) explanation is that another organism or virus that co-occurred in the same habitats as the ancestral lineages served as a “stepping stone” for a two- or more step transfer. However, this is not supported with strong evidence in our searches of the sequence databases. Based on fossil-calibrated age estimates of legume lineages [139, 140], we estimate that this horizontal acquisition occurred in an ancestral broomrape that lived in the Miocene epoch, about 16 Mya. Both parasite and their legume host groups have northern temperate distributions, and their lineages likely overlapped in the past as they do now, providing a minimal requirement for a horizontal gene transfer to occur. Another possibility, however unlikely, is that albumin1 was a more recent acquisition that underwent strong convergence at the protein level with this legume lineage. However the branch lengths we observed in the phylogeny (Figure 2-7) were not unusually long in our DNA-based phylogeny, and given the large collection of related sequences we obtained from other broomrape species, we have reduced any tendency the

51

Orobanche/Phelipanche lineage may have had to connect by chance to a deep branch. Thus, the convergence hypothesis is not supported. Because the breadth of Phelipanche and Orobanche species we have sampled spans the deepest branches of broomrape diversity [132], the albumin gene can be inferred to have survived through an extended evolutionary radiation of at least 150 species [132, 144, 145] or more, if the number of now-extinct broomrape species could be estimated.

Because the introns of Phelipanche albumin 1 xenogenes maintain critical splicing sites and share the same starting positions and first nine base pairs with the known M. truncatula albumin 1 intron, it is likely that the HGT event in broomrapes involved transfer of a genomic sequence rather than a separate cDNA. Following the transfer, albumin 1 genes in broomrape species have evolved under purifying selection consistent with what is observed in related legume albumin 1 genes. This observation, as well as the stage-specific transcription patterns, conserved cysteine residues and predicted 3D KNOTTIN protein structures, strongly suggest that albumin 1 genes encode functional proteins in broomrape species, and could potentially serve a function similar to its role in legumes, providing a large pool of sulfur storage and exhibiting toxicity to insect herbivores in certain legumes [117, 118]. A recent report involves panicoid grass species with C3 or C4 photosynthetic pathways. Evidence was presented that nuclear genes were horizontally transferred between panicoid species and were subsequently adapted into the existing pathways with the effect of advancing the extent of C4 photosynthesis in some lineages [84].

This HGT finding in panicoid grass species, involving C3 and C4 photosynthetic pathways, indicates that HGT may promote the sharing of adaptive traits among related species. In comparison, the albumin example described here shows how a completely novel and highly specialized trait has been acquired at an ancestral stage from a distantly related donor species and maintained by the recipient lineage throughout an extended period of evolutionary history.

52

The albumin 1 genes in P. aegyptiaca are highly transcribed in most of the developmental stages we examined. Transcripts are more abundant in reproductive tissue, and lowest in the young haustorium (stage 3), which represents the earliest point in our tissue sampling where the parasite is in direct contact with the host plant. This suggests that the novel gene in P. aegyptiaca is probably not encoding a protein that is playing a direct role in the process of haustorial formation, and that albumin 1 expression is down-regulated as the parasite devotes energy to the essential process of establishing host vascular connections. It is also possible that the low expression in the haustorial stage could help the parasite avoid detection or minimize a negative impact on the health of the host plant during early stages of parasite contact and feeding.

Several other parasitic lineages, including members of Cuscuta (),

Cassytha (), Apodanthaceae, Hydnoraceae, and the order Santalales, regularly feed upon legumes [146] and therefore might also have had opportunities to acquire albumin 1 sequences through HGT. Large transcriptome datasets are currently available for only two of these, the generalist parasite Cuscuta pentagona (Convolvulaceae) and for the legume specialist feeder Pilostyles thurberi (Apodanthaceae) [32]. Both of these parasites, and other species in these genera, feed widely on legumes. No homolog of albumin 1 was detected in BLAST searches of the Pilostyles transcriptome in the 1KP dataset [147]. However, albumin 1 sequences were detected in the same dataset and in two additional transcriptome libraries from Cuscuta pentagona (J. Westwood, unpublished data). Phylogenetic analysis nests the Cuscuta sequences well within Leguminosae, but on an independent branch from the broomrape sequences (Figure

2-11), suggesting that these transcripts in Cuscuta represent a different HGT event into Cuscuta from a lineage of papilionoid legumes that was different from the source of the broomrape albumin 1 xenogene. The putative Cuscuta albumin 1 similarly encodes a protein predicted to

53

have KNOTTIN structure (Knoter1d score: 33 to 35). No other albumin 1 sequences were identified elsewhere in searches of REFSEQ or publicly available plant transcriptome datasets.

Glycine_soja_albumin1_Q9ZQX0_ALB1_GLYSO 100 Glycine_max_albumin1_Q39837_ALB1_SOYBN Vigana_radiata_albumin1_fragment_Q9FRT8_ALB1_PHAAU 81 Phaseolus_angularis_albumin1_fragment_Q9FRT9_ALB1_PHAAN Bituminaria_bituminosa_Putative_albumin1_fragment_Q61AD5 100 60 Phaseolus_vulgaris_albumin1_fragment_Q7XZC2 Alysicarpus_ovalifolius_Putative_albumin1_fragment_Q6A1D6 Cuscuta_pentagona_scaffold-AHRN-0113940 Albumin 1 93 Cuscuta_pentagona_CuPeArThGFB1_17428 100 Cuscuta pentagona Cuscuta_pentagona_CuPeSoLyGFB1_70196 Philenoptera_violacea_Putative_albumin1_fragment_Q6A1D3 94 Mundulea_sericea_Putative_albumin1_fragment_Q6A1D1 Canavalia_brasiliensis_Putative_albumin1_fragment_Q6A1D4 Medicago_truncatula_Medtr1g024630.1_albumin1 98 Medicago_truncatula_Medtr6g038830.1_albumin1 78 Medicago_truncatula_Medtr4g029170.1_albumin1 65 Medicago_truncatula_Medtr3g089970.1_albumin1 100 Medicago_truncatula_Medtr3g089930.1_albumin1 61 Medicago_truncatula_Medtr3g089870.1_albumin1 100 Medicago_truncatula_Medtr3g089880.1_albumin1 Melilotus_albus_Putative_albumin1_fragment_Q6A1D2 94 Medicago_truncatula_Medtr8g025920.1_albumin1_KNOTTIN 100 100 Medicago_truncatula_Medtr8g025940.1_albumin1_KNOTTIN Trigonella_foenum­graecum_Putative_albumin1_fragment_Q6A1C8 68 100 Medicago_truncatula_Medtr8g025950.1_albumin1_KNOTTIN Medicago_truncatula_albumin1_Q7XZC5_KNOTTIN Vicia_hirsuta_Putative_Albumin1_fragment_Q6A1C7 Pea_Pisum_sativum_albumin_1E_P62930_ALB1E_PEA 62 100 83 Pea_Pisum_sativum_albumin_1D_P62929_ALB1D_PEA 100 Pea_Pisum_sativum_albumin_1B_P62927_ALB1B_PEA Pea_Pisum_sativum_albumin_1A_P62926_ALB1A_PEA Pea_Pisum_sativum_albumin_1C_P62928_ALB1C_PEA Pea_Pisum_sativum_albumin_1F_P62931_ALB1F_PEA Medicago_truncatula_Medtr7g040960.1_albumin1 100 Medicago_truncatula_Medtr7g041000.1_albumin1 Astragalus_monspessulanus_Putative_albumin1_fragment_Q6A1D7 Onobrychis_argentea_MMSB28 99 78 Onobrychis_viciifolia_Putative_albumin1_fragment_Q6A1C9 Orobanche_cernua_albumin1 94 Orobanche_ballotae_albumin1 76 53 Orobanche_hederae_albumin1 Albumin1 Orobanche 82 Orobanche_minorA_albumin1 91 85 Orobanche_minorB_albumin1 Phelipanche_schultzii_albumin1-1 99 Phelipanche_mutelli_albumin1-1 Albumin1­1 Phelipanche 97Phelipanche_ramosa _albumin1-1 94 99 Phelipanche_aegyptiaca_albumin1-1 Phelipanche_nana_albumin1-2 97 Phelipanche_aegyptiaca_albumin1-2 95 Phelipanche_mutelli_albumin1-2 Albumin1­2 Phelipanche 57Phelipanche_schultzii _albumin1-2 100 Phelipanche_ramosa_albumin1-2 0.2

Figure 2-11. Maximum likelihood (ML) phylogeny of KNOTTIN homologs in broomrape species, Cuscuta pentagona and papilionoid legumes. ML and Bayesian Inference (BI) methods produced the same tree topology. Three Cuscuta pentagona sequences were obtained from the 1KP project and from additional independently prepared libraries. Other information as given (Figure 2-6).

54

Conclusions

Because of their extensive, intimate contacts with host plant tissues, and the wide range of materials that are commonly transmitted across haustorial connections [2, 96, 98, 116, 148-

150], parasitic plants could play an important role as recipients and donors for HGT in plants [76]

[32, 77, 78, 86, 111]. Our results support the hypothesis that parasitic plants are creating a network of functional genes that are overlain on top of the normal set of vertically transmitted sequences, and that some of these genes likely go on to serve important functions in their new genomes. As parasitic plants increasingly become the targets for genome-scale analyses, it should become possible to estimate the frequency and likely mechanisms of HGT events between parasites and hosts involving albumin 1 and other genes, the likelihood of more complex stepping-stone models, and how often HGT leads to long-term maintenance of new genes and novel traits.

Methods

Screening for HGT candidates

The assembled transcriptome of the parasite P. aegyptiaca was systematically screened for potential HGT candidate sequences. Immediately following an HGT event, a host-derived sequence in a parasitic organism may be identical to the sequence from the host. Over evolutionary time, the host-derived sequence will diverge from the ancestral transgene and, if it survives long enough, the xenologous sequence may pass through both speciation events

(forming “xenorthologs”) and/or duplication events (forming “xenoparalogs”). Initially, the xenologous sequence will be more closely related to the host sequence than to any other sequence in the parasite or its relatives’ genomes. Such sequences can provide valuable indicators of the

55

rate and types of host-derived sequence incorporation in parasite-host interactions, but they can be difficult to distinguish from host-plant contamination or host-derived mobile transcripts in the parasite. However, as genetic divergence, speciation, and gene duplication events occur, the xenologs can be detectable as a clade of sequences that is closely related to sequences from the host lineage.

The parasitic plants that are the focus of this study are in the family Orobanchaceae

(eudicots, asterid order Lamiales). The analysis begins with high throughput BLAST

(tBLASTx) of all the contigs from the P. aegyptiaca transcriptome assembly against a database with sequences from two closely related nonparasitic species (Lindenbergia philippensis, a member of Orobanchaceae, representing the nonparasitic sister group of the parasitic members, and Mimulus guttatus, another closely related nonparasitic species of Lamiales/Asteridae,[151]) and thirteen other plant species with sequenced genomes or large transcriptome assemblies, including eudicots (two Solanaceae [asterids related to Lamiales]: Solanum lycopersicum and

Nicotiana tabacum; and six much more distantly related rosid taxa including the range of major host families for most broomrapes: Arabidopsis thaliana [Brassicaceae], Carica papaya

[Caricaceae], Populus trichocarpa [Salicaceae], Medicago truncatula [, papilionoid],

Cucumis sativus [Cucurbitaceae], Vitis vinifera [Vitaceae]) monocots ( Sorghum bicolor, Oryza sativa) and distantly related non- species (Selaginella moellendorffii,

Physcomitrella patens, Chlamydomonas reinhardtii). Details about the database are in Table 2-

3. The analysis details are described below.

Table 2-3. HGT candidates BLAST database. Information that cannot be retrieved is marked as Not Applicable (NA). M: million; GB:

Gigabase.

56

Classification Resource # of Size of # of Reads Dataset Unigenes Closely Striga hermonthica Eudicots,Asterids,Lamiids, PPGP 473.3M 41GB 726534 related Lamiales,Orobanchaceae species Triphysaria versicolor Eudicots,Asterids,Lamiids, PPGP 181M 15.5GB 480595 Lamiales,Orobanchaceae Lindenbergia Eudicots,Asterids,Lamiids, PPGP 69M 5.9GB 104904 philippensis Lamiales,Orobanchaceae Mimulus guttatus Eudicots,Asterids,Lamiids, Phytozome NA NA 27501 Lamiales,Phrymaceae

Other more Solanum lycopersicum Eudicots,Asterids,Solanales,Solanaceae PlantGDB NA NA 56845 distantly Nicotiana tabacum Eudicots,Asterids,Solanales,Solanaceae PlantGDB NA NA 131942 related plant Arabidopsis thaliana Eudicots,Rosids,Brassicales, TAIR 9 NA NA 27379 species Brassicaceae Carica papaya Eudicots,Rosids,Brassicales, ASGPB release NA NA 25536 Caricaceae Populus trichocarpa Eudicots,Rosids,Malpighiales,Salicaceae JGI version 2.0 NA NA 41377 Medicago truncatula Eudicots,Rosids,Fabales,Fabaceae Phytozome NA NA 50962 Cucumis sativus Eudicots,Rosids,Cucurbitales, BGI release NA NA 21635 Cucurbitaceae

Vitis vinifera Eudicots,Rosids,Vitales,Vitaceae Genoscope NA NA 30434 release Sorghum bicolor Monocots,Poales,Poaceae JGI version 1.4 NA NA 34496

Oryza sativa Monocots,Poales,Poaceae RGAP release NA NA 56979 6.1 Selaginella Embryophyta,Tracheophyta, JGI version 1.4 NA NA 34697 moellendorffii Lycopodiophyta,Isoetopsida, Selaginellales,Selaginellaceae Physcomitrella patens Bryophyta,Bryophytina, JGI version 1.1 NA NA 35938 Bryopsida,Funariidae, Funariales,Funariaceae Chlamydomonas Chlorophyta,Chlorophyceae, Phytozome NA NA 15935 reinhardtii Chlamydomonadales, Chlamydomonadaceae

Contigs were downloaded from the Parasitic Plant Genome Project website (Assembly

version OrAeBC4). The HGT candidate screening includes the following steps. First, contigs

were BLASTed onto the queried database (tBLASTx, expected value: 1e-10, -b 1, -v 1) described

in the above paragraph and the top hit of the BLAST result was retrieved. “–b 1” option means

57

that result only shows the top one subject’s alignments. “– v 1” option means that result only shows the top one subject’s description line. Second, contigs with rosid species as the top hit were maintained for downstream filtering processes to identify sequences that could be useful for high- resolution evolutionary analysis. Candidate sequences were retained only if the contig length was longer than five hundred base pairs, the aligned identity score was in the range of sixty to ninety five percent, and aligned length was at least fifty percent of the contig length. The last requirement was included to avoid long contigs that only have a small portion that is nearly identical to a distantly related sequence. Third, the filtered contigs were BLASTed against the same database and the top ten hits (expected value: 1e-10, -b 10, -v 10) were retrieved. Contigs that had either of the closely related Mimulus guttatus or Lindenbergia philippensis present in the top ten hits were excluded from further consideration to avoid sequences that were not decisively better matches to distantly related species. Fourth, the same BLAST was performed for the contigs that have passed the previous screenings and all the BLAST hits (expected value: 1e-10, - b 100000, -v 100000) available were considered. If a contig had no Mimulus guttatus and

Lindenbergia philippensis in the BLAST hits, which would be expected if the sequence were vertically transmitted from a nonparasitic ancestor, such a contig would be considered as a HGT candidate. However, if a contig had Mimulus guttatus or Lindenbergia philippensis among the

BLAST hits, but there was much higher expect value or a much smaller bit score to a host plant lineage, such a contig was also retained as a HGT candidate. We initially began with 157806

Phelipanche aegyptiaca contigs. 333 contigs passed the initial BLAST screening, while 168 contigs and 36 contigs passed the second and third BLAST screenings, respectively. These 36

HGT candidates were passed on to phylogenetic testing. Once HGT candidates were found, we also checked for related sequences in the other parasitic Orobanchaceae species Striga hermonthica and Triphysaria versicolor by using BLAST search, including psi-BLAST.

58

Phylogenetic analysis and dating

Phylogenetic analysis was performed on all albumin1 homologs detected in the broomrape species (Phelipanche, Orobanche) as well as all other previously known albumin1 sequences and sequences obtained from additional legume species via PCR and cloning (see below). Albumin1 is reported to be restricted to papilionoid legume species (including

Medicago). Low stringency BLAST searches (using E-value cutoff of e-5; tBLASTx, BLASTp, and psiBLAST) of diverse angiosperm databases including NCBI nr database, PlantGDB,

Phytozome database and SOL genome network (Versions of all databases are before May 2012), failed to detect any additional homologs outside legumes. MUSCLE [152] was used to produce a multiple sequence alignment of the translated amino acid sequences; a custom java program was used to force nucleotide sequences onto the corresponding amino acid alignment sequences to yield a DNA sequence alignment consistent with the translated sequences. ML phylogeny was obtained using RAxML, version 7.0.4 [153] with the following parameters: raxmlHPC –f a –x

12345 –p 12345 -# 100 –m GTRGAMMA –s alignmentsFile –n OutputFile. Bayesian analysis was performed with BEAST version 1.6.1 [154], using the following parameters: substitution model : GTR, base frequencies : estimated, site heterogeneity model : gamma, clock model : relaxed clock (uncorrelated exp), tree prior : speciation (yule process), MCMC : length of chain

10000000, Log parameters every 1000 chain. Tracer version 1.4 [154] was used to determine the performance of the BEAST output. Tracer burn-in state is 1000000. All ESS are larger than 196.

The potential HGT acquisition time was estimated by BEAST v1.6.1 using the same alignment. We assigned one calibration point: the most recent common ancestor (MRCA) of

Pisum/Medicago/Astragalus/Onobrychis, of which the prior was treated as fitting a normal distribution with mean set to 39 mya and stdev of 2.4 mya [140]. We also created taxon groups of

Onobrychis/Orobanche/Phelipanche, Orobanche/Phelipanche, and a taxon group just containing

59

Phelipanche genes. The other settings are the same as described above in Phylogenetic analysis section. Tracer was used to analyze the output of BEAST to report the estimated mean and 95%

HPD range of divergence time of the previously defined taxon groups (16 Mya: 95% HPD is 11-

21 mya. 11 Mya: 95% HPD is 6-16 Mya. 5 Mya: 95% HPD is 3-7 my.). Similar patterns were observed within the BEAST confidence ranges when dates were estimated with r8s [155] (results not shown).

KNOTTIN structure validation and 3D structure simulation

HGT candidates were confirmed to be KNOTTIN proteins using the prediction program provided by the KNOTTIN database[119, 156]. Amino acid sequences were first confirmed as

KNOTTIN structures using Knoter1D program offered by the KNOTTIN database. Knoter1D scores larger than 20 are determined to be KNOTTIN protein structures. Confirmed amino acid sequences (all the albumin1 sequences in Phelipanche) were input in Knoter1D3D program and pdb files were generated by this program.

dN, dS and dN/dS calculation

HyPhy version 2.0 was used to calculate dN, dS and dN/dS ratios [157]. Treefiles and multiple sequence alignments of albumin 1 coding sequences were imported into HyPhy with the

ML phylogeny based on the above analysis. Analyses were focused on broomrape species plus three most closely related legume species. Calculations were performed using the following parameters: partition type: codon; substitution model: MG94xHKY85_3x4; parameters: local;

60

equilibrium freqs: estimate. HyPhy was also used in functional constraint analyses among sites using the empirical Bayes technique.

Expression level comparisons of HGT candidates

Assembled contigs and raw Illumina reads were downloaded from PPGP website. For each library, raw reads were mapped onto the HGT candidates in P. aegyptiaca using bwa[158], samtools[159] and bedtools[160]. Normalized measures of expression intensity, Reads Per

Kilobase per Million mapped reads (RPKM), were calculated from the read counts, the length of each contig, and the total number of mapped reads in each library or developmental stage [143].

Obtaining genomic sequences by PCR approach

Broomrape species DNA extraction, and gene amplification - Two different sources of tissue were used for broomrape species, dry seeds (obtained from the GermPlasm Bank of the

IAS-CSIC, Cordoba, Spain) for Orobanche ballotae, Orobanche hederae, Phelipanche nana and

Phelipanche schultzii, and vegetative shoots for Phelipanche aegyptiaca, Orobanche cernua,

Orobanche minor, Phelipanche mutelii, and Phelipanche ramosa. Total genomic DNA was isolated from fresh, liquid nitrogen frozen tissue using a DNeasy Plant Mini Kit (Qiagen).

EST unigene contigs OrAeGnB1_75797 and OrAe41G2B1_12653 were downloaded from the Parasitic Plant Genome Project database. A different set of P. aegyptiaca specific primers was designed for each contig (Table 2-4). The P. aegyptiaca primers were also used to

61

amplify related sequences from other Orobanche species P. mutelii, P. nana, P. ramosa and P. schultzii. Each PCR reaction contained 10 ng of genomic DNA, 0.5 µM of each forward and reverse primers, 12.5 of 2x iProof Master Mix (BIO-RAD) and conditions as described in the manufacturer’s protocol. PCR products were separated by electrophoresis through a 1% agarose gel, yielding a single band that was excised from the gel, purified using the QIAquick Gel extraction kit (Qiagen), and sequenced using ABI3730xl genetic analyzer and Big Dye

Terminator v3.1 sequencing kit for sequencing (both from Applied Biosystems).

Table 2-4. PCR primers used for albumin 1 amplification.

Primer used in Primer Orientation 5’-3’sequence

Broomrape OrAeGnB1_75797 Fw1 GATTCAGCATCAAAAGCAATGGC species Rv1 GGAGTGTTGGATCGGATACAT

OrAe41G2B1_12653 Fw2 CAACAGCAAGAACCAGTTCC

Rv2 GAGATCCAACTGAGTTGGAC

Legumes LegAlb1 Fw3 TTAAGCTCACTCCTTTGGTCCTCTTC

Rv3 CAGGCATCTTCARGAAKCYTTTYKC

Legume DNA extraction, and gene amplification

Total DNA was isolated from herbarium material of Onobrychis argentea Boiss. ssp. africana, A. Dubois 13246 (M), using a DNeasy Plant Mini Kit (Qiagen). Because the

Onobrychis sequence obtained from NCBI was incomplete, one forward primer (AlbuminFw3:

5´TTAAGCTCACTCCTTTGGTCCTCTTC3´ ) and one degenerate reverse primer (AlbuminRv3:

62

5´CAGGCATCTTCARGAAKCYTTTYKC3´) were designed in order to amplify the full length

Albumin 1 gene in O. argentea. Forward 3 was designed on the Q6A1C9 sequence, targeting the more conserved region before the start codon between sequences Q6A1C9 and Q6A1D7 obtained from Onobrychis viciifolia and Astragalus monspessulanus. Reverse 3 was designed from the downstream end of the complete albumin genes Medtr7g041000.1 and OrAeGnB1_75797. The

PCR reaction was composed by 10 ng of genomic DNA of O. argentea, using forward primer

(Fw3, 1 µM), reverse primer (Rv3, 1 µM), and 12.5 µl of 2x iProof Master Mix (BIO-RAD) in a final volume of 25 µl., following the manufacturer’s protocol. PCR product was separated by electrophoresis through a 1% agarose gel. This product was excised from the gel, purified using the QIAquick Gel extraction kit (Qiagen), sequenced and identified as Albumin 1.

63

Chapter 3

Widespread horizontal gene transfer events of transcribed gene sequences in parasitic plants of the Orobanchaceae

Abstract

This report summarizes the HGT findings in the PPGP (The Parasitic Plant Genome

Project) by utilizing a phylogenomic approach in identifying HGT events. Twenty-two plant genomes, two asterid EST datasets (Lactuca sativa and Helianthus annuus), and four PPGP species (Lindenbergia Philipensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca) were used in orthogroup classifications. The phylogenies for each orthogroup were built, and customized JAVA scripts were used to perform the initial automated HGT screening for genes in the parasite with phylogenetic positions inconsistent with the known species relationships and common gene family evolutionary processes. Secondary manual screening was carried out under stringent criteria to avoid scenarios other than HGT that could also lead to phylogenetic incongruence.

Twenty-three well supported cases of putative HGT of transcribed gene sequences were identified. With each case supported by incongruent phylogenies, the expression profiles for the

HGT genes, potential gene functions, evolutionary constraint analyses, and genomic evidences are discussed. The majority of the transgenes show evidence of introns, indicating that HGT transfer happened at the DNA (not transcript) level. We also identified two high confidence HGT transgenes in Striga hermonthica located adjacent to each other. This is the first time that two adjacent HGT transgenes have been reported in host plant to parasitic plant gene transfer. HGT transgenes detected from EST datasets could be just a “tip of the iceberg” of a potentially much

64

larger number of still undiscovered HGT events, since the majority of transgenes are likely to lose function and degrade with time. The findings presented here provide a foundation for functional studies in the parasitic plants’ research community.

Background

Horizontal Gene Transfer (HGT) is any process in which an organism incorporates genetic material from another organism without being that organism’s offspring [107, 108]. The phenomenon was first reported in bacteria in the 1950s, when multidrug resistance emerged on a worldwide scale, indicating that the antibiotic resistance traits were transferred among taxa instead of generated de novo by each taxon [55]. Over time, more and more cases of HGT have been identified in bacteria and it is now considered a common event in bacterial evolution. A substantial amount of HGT is associated with plasmid-, phage- or transposon related sequences

[56]. Fewer well-documented cases of HGT among eukaryotes [87] are known. A lot of these cases appear to result in short-lived, nonfunctional sequences [87, 109, 110]. Consequently, the long-term evolutionary impact of HGT in multicellular eukaryotes remains largely unknown.

The most commonly used method for detection of an HGT event is a well-supported incongruence between a gene tree and a well-resolved species tree. Several studies have utilized this approach to identify likely HGT events among plant species, including parasitic plants.

Parasitic plants invade their host plants’ tissues through either the shoots or roots. The

65

haustorium [1], a unique organ for heterotrophic feeding, allows parasitic plants to acquire nutrients and water from their host plants. The large diversity of compounds exchanged between parasites and their host plants increases the possibility of HGT events [107].

Most HGT events have been discovered and studied one at a time as they were discovered [32, 72-80, 84, 111-114, 161]. With an increase in available genomic and transcriptomic data for plants, it is now possible to detect HGT events en masse by using a comprehensive phylogenomic approach. Phylogenomic analysis allowed Xi et al. to conclude that many sequences in the parasitic plant Rafflesia cantleyi originated via host to parasite HGT using [82]. Rafflesiaceae are endophytic holoparasites that lack leaves and stems. They have a narrow host range, e.g., members of the grapevine family, Vitaceae. Transcriptomic data for both the parasite (Rafflesia) and its obligate host (Tetrastigma) were generated. Combined with whole genome sequence data from nine other plant species, phylogenomic analyses were performed and several dozen potential HGT genes were identified. These HGT candidate genes were actively transcribed and largely nuclear genes. Later, the same group reported massive mitochondrial gene transfer in the same species using a similar approach [83]. Based on this report, thirty-eight mitochondrial genes were examined and eleven of them showed phylogenetic patterns suggestive of HGT in Rafflesiaceae species.

The Parasitic Plant Genome Project (PPGP) has produced a large amount of transcriptome data [135] for Orobanchaceae, including the three root-feeding parasitic plants

Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca. Transcriptome data were also obtained for a fourth Orobanchaceae species, Lindenbergia philippensis, which is an autotrophic nonparasitic plant closely related to the parasites. Among the three parasite species chosen for this study, Triphysaria versicolor is a facultative hemiparasite, a photosynthetic plant that can go through its entire life cycle without parasitic connection to a host; Striga hermonthica

66

is an obligate hemiparasite, a photosynthetic plant that needs a host plant from germination to maturity); and Phelipanche aegyptiaca is an obligate holoparasite that lacks photosynthetic ability and is completely dependent upon a host plant as a source of carbon, minerals, and water

[2]. Such a wide range of parasitic ability is only found in one family of parasitic plants,

Orobanchaceae [2].

One of the main reasons for this study is that Striga (witchweeds) and Phelipanche

(broomrapes) have a devastating impact on agricultural plants. For example, over two thirds of the 73 million hectares of farmland cultivated for cereal grains and legumes in Africa are infested with one or more Striga species, affecting the livelihoods of over 100 million farmers in 25 countries [40] [41]. Most Striga species parasitize grasses, such as maize (Zea mays), rice (Oryza sativa), and sorghum (Sorghum bicolor), which are all very important agricultural species. In

Europe, Phelipanche species detrimentally impact important crops, including legumes (faba bean, chickpea, and pea), vegetables (tomato, potato, and carrot), and oilseed crops (sunflower and

Brassica) [42]. Striga and Phelipanche infestations also exist in the U.S. However, the methods used to control these infestations are very costly, and it is difficult to control parasitic plants using conventional methods because these plants infest crops underground. Efforts have been made to breed parasite-resistant crops with some success, including physiology-based breeding methods[43, 44] and molecular breeding (marker-assisted breeding) [45-50]. Biotechnologies have been applied in trying to generate crops resistant to parasites. Currently, there is more information on the use of translocated RNAi than peptides to suppress parasites [93, 97, 98, 116].

The increasing amount of transcriptomic information from PPGP will facilitate the process of developing resistant crops.

67

The three parasitic plants in Orobanchaceae have a different range of host plants.

Triphysaria versicolor’s host plants range from monocots to dicots [162, 163]. Striga hermonthica parasitizes grasses [131, 164], and Phelipanche aegyptiaca parasitizes a range of dicot hosts from rosid and asterid angiosperm lineages, including the Solanaceae tomato, potato, eggplant, tobacco, crops in Fabaceae, Apiaceae and Asteraceae [128, 129, 131]. Because Striga hermonthica has a relatively narrow range of host plant species, a stringent BLAST-based screening for HGT candidates is feasible [111]. Zhang et al. [85] reported a unique gene, albumin1, identified in Phelipanche aegyptiaca and other related broomrape species, using the above approach [85]. However, parasitic plants with a wide range of host plant species, such as

Triphysaria versicolor the previous BLAST-based HGT screening approach is less effective.

Furthermore, ancient HGT events may have taken place in an ancestor of the parasite in question that fed on host plants different from the current species. Therefore, a phylogenomic approach

[82, 165] would be more appropriate in such cases.

In this study, we implement a phylogenomic approach to detect cases of HGT in transcribed sequences members of Orobanchaceae with a wide range of parasitic ability, including the parasites Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca, and the closely related nonparasitic Lindenbergia philippensis. . The short-term goals of this study are composed of the following questions. First, how frequent is HGT leading to expressed transgenes in the Orobanchaceae? Second, do the HGT events detected involve only known host plants or unexpected lineages? Third, what is the mechanism involved in the HGT event, e.g., has

HGT occurred through RNA intermediate or direct transfer of the genomic fragment? Fourth, are any HGT events shared among the three parasite species, suggesting transfers from an ancestral parasite taxon? Fifth, are there any HGT genes that appear to have a function related to parasitism? Through a comprehensive study of all possible HGT genes, we hope to identify a

68

functional pattern of the HGT genes to improve functional studies on controlling invasive parasites in Orobanchaceae.

Results

Result summary on four species, including Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca

A phylogenomic approach was applied to detect putative HGT events in the four study subjects, including Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and

Phelipanche aegyptiaca. Twenty two plant genomes (Selaginella moellendorffii, Physcomitrella patens, Amborella trichopoda, Oryza sativa, Brachypodium distachyon, Sorghum bicolor,

Phoenix dactylifera, Musa acuminata, Nelumbo nucifera, Aquilegia coerulea, Arabidopsis thaliana, Carica papaya, Fragaria vesca, Glycine max, Medicago truncatula, Populus trichocarpa, Thellungiella parvula, Theobroma cacao, Vitis vinifera, Solanum lycopersicum,

Solanum tuberosum, and Mimulus guttatus) and two asterid EST datasets (Lactuca sativa and

Helianthus annuus) were selected to build a global classification of plant proteins using the protein clustering program OrthoMCL [166]. Orobanchaceae’s commonly known phylogenetic position is within Lamiales, which also includes Mimulus guttatus (Phyrmaceae) a nonparasitic relative with a sequenced genome [167]. Contigs from de novo assembly of large-scale transcriptome datasets from the four species reported here were assigned into each orthogroup using BLAST and HMM models (E. Wafula, unpublished). Translated sequnces were aligned and then DNA sequences were forced back onto the protein alignments to obtain DNA-level alignments of each orthogroup including any Orobanchaceae or Mimulus genes assigned to those

69

objectively defined gene families. Phylogenies were built for a total of 13254 gene families.

Details can be found in the Methods section under subsections named as constructing a global gene family classification, sorting EST assemblies into orthogroups and global gene family phylogenetic analysis.

In this report, phylogenetic trees are color-coded. Black represents the root group

(Selaginella moellendorffii and Physcomitrella patens). Blue represents Amborella trichopoda, a basal angiosperm species. Yellow represents the monocot group (Oryza sativa, Brachypodium distachyon, Sorghum bicolor, Phoenix dactylifera, and Musa acuminata). Purple represents basal eudicots (Nelumbo nucifera and Aquilegia coerulea). Red represents rosids (Arabidopsis thaliana, Carica papaya, Fragaria vesca, Glycine max, Medicago truncatula, Populus trichocarpa, Thellungiella parvula, Theobroma cacao, and Vitis vinifera) and light green represents the following asterids (Solanum lycopersicum, Solanum tuberosum, Mimulus guttatus,

Lactuca sativa, and Helianthus annuus). Dark green represents the taxa in Orobanchaceae

(Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca).

Taxon names are abbreviated in the phylogenies. Full names and corresponding abbreviations are as follows: Selaginella moellendorffii (Selmo1.0), Physcomitrella patens

(Phypa1.6), Amborella trichopoda (Ambtr), Oryza sativa (Orysa6.0), Brachypodium distachyon

(Bradi1.2), Sorghum bicolor (Sorbi1.4), Phoenix dactylifera (Phoda3.0), Musa acuminata

(Musac1.0), Nelumbo nucifera (Nelnu1.0), Aquilegia coerulea (Aquco1.0), Arabidopsis thaliana

(Arath10), Carica papaya (Carpa1.181), Fragaria vesca (Frave2.0), Glycine max (Glyma1.0),

Medicago truncatula (Medtr3.5), Populus trichocarpa (Poptr2.2), Thellungiella parvula

(Thepa2.0), Theobroma cacao (Theca1.0), Vitis vinifera (Vitvi12X), Solanum lycopersicum

(Solly2.3), Solanum tuberosum (Soltu3.4), Mimulus guttatus (Mimgu1.0), Lactuca sativa (LaSa),

70

Helianthus annuus (HeAn), Lindenbergia philippensis (LiPhGn), Triphysaria versiclor

(TrVeBC), Striga hermonthica (StHeBC), and Orobanche aegyptiaca (OrAeBC) (syn.

Phelipanche aegyptiaca).

Customized JAVA scripts were applied to the 13245 orthogroup phylogenies to automatically identify incongruent phylogenies. Stringent manual screenings were then applied to screen high confidence HGT gene trees. We used four general levels of manual screening. Level

1 checks for an adequate taxon representation in the Asteridae group. Level 2 checks for an even distribution of taxa sampling in the rest of the lineages, especially the lineage where the HGT gene resides. Level 3 checks for a gene tree topology that is generally consistent with the species tree plus gene duplication and loss processes. Some gene trees are so poorly resolved, or so conflicted relative to the species tree that no meaningful conclusions can be drawn about potential

HGT events. Level 4 checks for general bootstrap support. Low confidence HGT genes failed at manual screening at level 1. Medium confidence HGT genes passed level 1 but failed at the next three steps. High confidence genes passed all four levels. For high confidence HGT candidates, the 1KP database [147] was also searched using BLASTP [133] for homologs in closely related species. Details on these two steps can be found in the Method section under subsection, “HGT screening”. Table 3-1 shows the results of the final HGT genes identified.

In total, twenty-three gene families with high confidence HGT events were detected using the screening method described above. We identified one high confidence HGT gene family in the autotroph Lindenbergia philippensis, two high confidence HGT gene families were identified in Triphysaria versicolor, six high confidence HGT gene families were found in Striga hermonthica, and fourteen high confidence HGT gene families were detected in Phelipanche aegyptiaca. In the following sections, detailed results related to each of the high confidence HGT

71

genes, including incongruent phylogenies, expression profiles, and potential functions, will be discussed.

HGT candidates identified in Lindenbergia philippensis

Lindenbergia philippensis is an autotrophic plant species in Orobanchaceae. Because it is nonparasitic, we hypothesized that there might be fewer HGT events detected in this species.

After several levels of stringent manual screening, one gene family (orthogroup 3861) was identified with strong confidence to horizontally transfer in Lindenbergia philippensis. The incongruent phylogeny is shown in Figure 3-1. Orthologous genes of this protein family were identified in Phelipanche aegyptiaca (OrAeBC5_22555 and OrAeBC5_9914), indicating that this

HGT event happened prior to the speciation event of Lindenbergia philippensis and Phelipanche aegyptiaca. Based on orthogroup annotation, ortho 3861 may be involved in glycohydrolase activity (GO: 0004649) and carbohydrate metabolic processes (GO: 0005975). Constraint analysis [168] shows that the dN/dS ratio at the branch containing potential HGT clades is

0.2949, which indicates a clear signature of purifying selection, suggesting that the horizontally transferred gene may still be functional.

HGT candidates identified in Triphysaria versicolor

Two orthogroups (ortho 12303 and 12577) were identified as high confidence HGT candidates in the facultative parasite Triphysaria versicolor. The HGT’s corresponding to incongruent phylogenies and expression profile are shown in Figures 3-2 and 3-3, and the function for both gene families is unknown. Evolutionary constraint analyses [168] shows that

72

the parasite genes are under the same constraint level as the homologous genes in the same gene family (dN/dS = 0.31 for orthogroup 12303 and dN/dS = 0.28 for orthogroup 12577). Limited genomic data is available for Triphysaria versicolor; however, we did not find genomic evidence for the two HGT candidates in Triphysaria versicolor.

HGT candidates identified in Striga hermonthica

In total, six gene families were identified with putative HGT events involving Striga hermonthica, including orthogroups 13656, 14233, 18774, 2270, 5896, and 10124. Among the six orthogroups, 13656, 14233, and 18744 are monocots exclusive gene families. The potential gene functions for these three orthogroups are still unknown. Table 3-2 shows the potential GO annotation for the three HGT orthogroups that are not monocot exclusive. The incongruent phylogenies of the horizontally transferred gene corresponding to each orthogroup mentioned above are shown from Figure 3-4 to Figure 3-9.

Figures 3-7 to Figure 3-9 show the phylogenies of three high confident HGT candidate genes (ortho 2270, ortho 5896, and ortho 10124) not grouped with monocots. All three Striga

73

hermonthica genes are grouped within rosids. Although rosids are not as commonly known as

Striga hermonthica’s host plants, two explanations may help answer this misconception. First, an

HGT event may happen earlier than the Striga hermonthica speciation event. The other possibility is that Striga hermonthica may obtain this gene through another vector, such as a virus. For the first possibility, we searched PPGP data again to find more homologs in other parasites. More homologs for StHeBC3_2868.1 in Orobanche (syn. Phelipanche) aegyptiaca data were identified, and the expanded phylogeny is shown in Supplemental Figure 3-1. The expression profile of the additional Orobanche ESTs is also shown in Supplemental Figure 3-2.

Only OrAeBC5_3356 is shown. OrAeBC5_51488.1 is not shown due to extremely low expression across all stages. Moreover, we did not locate more homologs in other parasite for ortho 5896 and ortho 10124.

Obtaining genomic evidence is crucial in detecting an HGT event. One reason is to confirm the presence of this gene in the genome. The other reason is that by studying the gene structure and sequence, it may shed some light on the HGT mechanism of a particular HGT event. To gather this evidence, we used both GMAP and BLASTn to map genomic contigs onto

HGT contigs [133, 169]. Details can be found in the Method section. The mapping results are summarized in Tables 3-3 and 3-4. Table 3-3 shows the results for mapping genomic contigs onto HGT EST contigs, and the results are satisfactory, as most of the HGT contigs (5 out of 6) are largely covered by genomic contigs (77% to 99%). Table 3-4 shows the high quality and depth of coverage ranging from 10X to 41X of the genomic contigs that have been mapped onto the HGT EST contigs. Introns can be detected in four out of the six HGT genes from Striga hermonthica based on Table 3-3. This fact is an indication that genomic DNA was integrated into the parasite genome directly.

74

Having identified the high confidence HGT candidate genes in Striga hermonthica, we explored the evolutionary constraints on these candidates. The result is shown in Supplemental

Figure 3-3. The dN/dS ratio on the HGT parasite branch was estimated separately, while the dN/dS ratios on all other branches were summarized into one value as background dN/dS using

PAML [168]. All of the HGT transgenes still evolve under constraints based on Supplemental

Figure 3-3. However, in most cases, the constraint lessens in the HGT parasite lineage. In one gene (orthogroup 14233), the constraint on the parasite is larger than the other lineages. This is also monocot specific, but the function is currently unknown even in the monocots’ gene family.

Such a fact may indicate that the horizontally acquired gene is serving a unique but unknown function to the parasite.

HGT candidates identified in Phelipanche aegyptiaca

Among the HGT candidates (present in 14 different orthogroups) identified in

Phelipanche aegyptiaca, seven orthogroups have the Phelipanche aegyptiaca HGT gene grouped with Fragaria vesca under high bootstrap support, while the remaining seven orthogroups have the Phelipanche aegyptiaca HGT gene grouped with various species. Albumin 1[85] was identified in the other seven orthogroups but is not discussed further in this report. GO annotation of the HGT candidate gene families were also listed in Table 3-2. Figure 3-10 to

Figure 3-22 show the incongruent phylogenies and the expression profile for each high confident

HGT gene identified in Phelipanche aegyptiaca. For each orthogroup that has an identified HGT,

75

the incongruent phylogeny is shown first, followed by the expression profile of the HGT genes in the parasite.

Among all of the ortho gene groups involving Phelipanche aegyptiaca closely related to

Fragaria vesca, ortho 11841 attracted our attention the most. First, in this orthogroup, homologous genes were found not only in Striga hermonthica but also in Orobanche fasciculata from the 1KP dataset[147]. This has increased our confidence in this particular gene family.

Also, such a fact suggests that the HGT event happened prior to the speciation event between

Striga hermonthica and broomrape species. Furthermore, the drastically different expression pattern between Striga hermonthica and Phelipanche aegyptiaca (Supplemental Table 3-1 and 3-

2) indicates that perhaps Striga hermonthica has lost the function of this gene even though it still serves an important function in Phelipanche aegyptiaca. To obtain further evidence for this hypothesis, we studied the evolutionary constraint on this gene family using PAML (details are listed in the Method section). The result is shown in Supplemental Figure 3-6. Based on the constraint analysis results on each branch and terminal node, the HGT genes show a similar level of constraint as genes in the species closely related to potential donor species.

Genomic data were also mapped onto the HGT contigs using limited genomic data in

Phelipanche aegyptiaca. Results are shown in Table 3-5 and Table 3-6. The mapped percentage of HGT EST contigs, which has found evidence in the limited genomic data (based on the depth of coverage in Table 3-6), ranged from 19% to 95%. With more genomic data available, it is possible that we can find more evidence for the HGT genes not finding genomic evidence in this dataset. Intron signatures were also examined and results were shown in Table 3-5. Out of the thirteen potential HGT candidate genes in Phelipanche aegyptiaca, twelve of them have introns that were detectable based on available data. Such facts again indicated that genomic DNA from donor species were integrated into parasite genome directly.

76

Evolutionary constraint analyses were also performed on the HGT candidate gene families in Phelipanche aegyptiaca. Results are shown in Supplemental Figure 3-4. Results for orthogroup 11841 and orthogroup 4336 are not shown here, as their results are discussed in details above. Based on Supplemental Figure 3-4, in most cases, the horizontally transferred genes were under lessened evolutionary constraints.

Expression profile of the HGT gene candidates

Stage specific expression data for the three parasites were also available through the

PPGP database. Expression profiles for the HGT gene were examined. Raw data is list in

Supplemental Table 3-1, 3-2 and 3-3. Expression pattern for each HGT candidate gene were shown from Figure 3-23 to Figure 3-50. A summary table of highly expressed stages for each high confidence HGT transgene is shown in Table 3-7.

Based on Table 3-7, the HGT transgenes have a diverse range of expression patterns.

Since all the high confidence transgenes have their own unique function, it is understandable that we cannot observe a unified expression pattern. Many transgenes have high expression at non- haustorial stages, indicating that these genes may play a role in the non-haustorial development stage. A few interesting expression patterns also appeared in Table 3-7, while the interesting patterns that also have genomic evidences will be discussed here.

TrVeBC3_22826.1 is a high confidence HGT transgene identified in Triphysaria versicolor and is only expressed in three stages, including the interface stage grown on Medicago

77

truncatula, 3G and 41G. We have obtained genomic evidence for this gene from genomic PCR

(data not shown, ongoing investigation). Based on incongruent phylogeny, as shown in Figure 3-

3, the gene is grouped with legumes under high confidence. It is known that legumes are common hosts to Triphysaria versicolor. The interesting expression pattern that this gene is only expressed at haustorial development related stages are of great value to our further functional study, since this transgene may directly impact haustorial development. Furthermore, within the interface stages, this gene is only expressed when it was grown on Medicago. No expression was detected when it was grown on Maize. This fact suggests that this gene plays a role in host recognition.

OrAeBC5_9762.1 and OrAeBC5_4284.2 are both high confidence HGT transgenes identified in Phelipanche aegyptiaca. They are of different functions and the incongruent phylogenies for both are shown in Figure 3-12 and Figure 3-14. Both transgenes are grouped with Fragaria vesca under high bootstrap support. The expression pattern shows that both transgenes are highly expressed at haustorial development stages, indicating that these two genes may play a role in haustorial development.

OrAeBC5_3756.1 and StHeBC3_55745.1 are both grouped with Fragaria vesca under high support within the same orthogroup (ortho 11841, Figure 3-16). The HGT event has very likely happened prior to the speciation event of Phelipanche aegyptiaca and Striga hermonthica.

However, the two transgenes have shown drastically different expression patterns. Phelipanche aegyptiaca has a moderate expression across all stages, while Striga hermonthica is only barely expressed at Stage 1. This fact likely suggests that the two transgenes in Phelipanche and Striga are evolving differently.

78

In the HGT candidate genes identified from the Phelipanche aegyptiaca search, HGT contigs (OrAeBC5_13694.1 and OrAeBC5_7046) in ortho 4336 showed an interesting pattern where the expression was highest in the haustorial related stages and interface stage. We then examined the expression profile of the laterally transferred homologous gene in Phelipanche aegyptiaca (OrAeBC5_6623). Results are shown in Supplemental Figure 3-5. The horizontally transferred gene and laterally transferred gene were both highly expressed in stages 3 and 4

(haustorial stages). However, the HGT gene was also highly expressed at the interface stage while the laterally transferred was not. This may indicate that the HGT gene is playing a role at the interface tissues between parasite and host plants. Furthermore, we ran a thorough constraint analysis on each branch in orthogroup 4336. The results are shown in Supplemental Figure 3-7.

Based on the results, the HGT genes have a similar level of constraint as the ones in the species closely related to potential donor species.

79

Discussion

A large scale, but still conservative, approach to identify HGT

This study sought to identify HGT events from a diverse range of related parasitic plant species that share the characteristic of having intimate cell-cell connections with their host plants and therefore may be good candidates for plant-plant HGT. Compared to the previous BLAST- based stringent HGT approach used [85], a phylogenomic approach should be able to harvest more HGT candidates. We indeed obtained many more initial HGT candidates in the first automated screening (Table 3-1).

However, the purpose of our report here is to describe in detail the HGT events that we believe have the highest confidence. After the second round of stringent manual screening

(details can be found in the Method section under HGT screening section), we have found 23 highly confident HGT genes. The 23 HGT candidates are rich resources for functional studies.

At the same time, we also categorized the remaining HGT candidates that passed the automated screening but failed stringent manual screening, as low confidence HGT candidates were based on current conditions. We believe that there could be additional true HGT sequences hiding in this category; however, we are not able to identify them due to current limitations in public genome data available for some critical species. As more plant sequencing data becomes available, some of these candidates will be resolved.

80

The current approach is to identify HGT events between plants and plant species. One area being omitted here is the HGT between bacteria, viruses, and parasitic plants. This area is currently under investigation and will be reported in a separate paper.

High confidence and interesting HGT cases in Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica and Phelipanche aegyptiaca

Lindenbergia philippensis is an autotrophic plant species closely related to all three parasites in Orobanchaceae. We used Lindenbergia philippensis as a negative control and did not expect to find many HGT events occurring within this species. However, one high confidence

HGT candidate gene (orthogroup 3861) was found in this species with high confidence. The constraint analysis showed that this gene is still evolving under constraints and may still be functional.

Triphysaria versicolor is a generalist parasitic plant that feeds on a variety of plant species, making it difficult to use a Blast-based HGT detection approach. The phylogenomic approach used in this report is more suitable for detecting HGT in such a species. Two HGT candidate genes (orthogroup 12303 and 12577) were identified as high confidence HGT candidates, although the function for these two genes remains unknown in both the PPGP annotation and in the Medicago truncatula genome annotation. These two genes were grouped with legumes, common host plants for Triphysaria versicolor. These two HGT candidate genes have expression profiles with high expression in interface and haustoria related stages.

Two types of HGT cases were identified from Striga hermonthica. One candidate group is from the monocot specific gene families, while the other group contains gene families shared

81

by dicots and monocots. Monocots are common host plants of Striga hermonthica, which may partially explain why Striga hermonthica has obtained monocot-specific genes. Unfortunately, the function of the three monocot specific genes were unknown both in parasites and host plants, so it is not possible to speculate on the possible function of these HGT acquisitions in Striga.

For the other three orthogroups that are not monocot specific gene families, ortho 2270 has a defense response related biological process, and ortho 5896 is involved in a tetrapyrrole biosynthetic process. Ortho 10124 has an unknown function.

Fourteen high confidence HGT candidates were found in Phelipanche aegyptiaca. Two major types of HGT candidates were identified. One group of candidates (7/14) was closely grouped with Fragaria vesca in rosids, while the rest (7/14) of the candidates were grouped with various plant species. The fact that half of the candidates were grouped with Fragaria vesca may have been caused by two reasons. The first possibility is taxa sampling artifact, meaning that

Fragaria vesca is the most closely related species to the real HGT donor species in our taxa samplings. With additional rosid species, especially other Rosaceae members, the specific source of the transgene could be resolved, as we have done with an extensive sampling of legume sequences in our prior study of albumin 1 (Chapter 2). It is also possible that the real donor species is actually Fragaria vesca, or a closely related ancestral source. Then, in such a case, including additional Rosaceae and other rosid species would enforce the position of the gene in the phylogenies. If this scenario is true, then an interesting question arises: what has caused such frequent HGT events from an ancestor of Fragaria vesca to Phelipanche aegyptiaca? Through the study of expression profile of potential HGT candidate gene, HGT candidate gene in orthogroup 4336 showed an interesting pattern, where it’s highly expressed at the interface stage.

This gene family can be further studied for potential haustorial related function.

82

The possible mechanism of HGT

Physiologically, the haustorial connection between hosts and parasites increase the chances of HGT. For Triphysaria and Striga species, only the xylem connection between hosts and parasites is known [89, 90]. In Orobanche and Phelipanche species, direct symplastic connections between their own cells and their hosts’ sieve elements are observed by electron microscopy [91]. Furthermore, additional studies also reported direct phloem transmission of dyes, proteins, [92] and even viruses [93]. Symplastic parasites may absorb a potentially wider range of molecules directly from the host phloem, including small nutrient molecules and, potentially, macromolecules as well. Thus, the chances that broomrape species may obtain macromolecules, e.g., DNA or RNA, are higher than Triphysaria and Striga. This will have an impact of the potential of HGT events. We indeed observed more high confidence HGT transgenes in Phelipanche (14) compared to Triphysaria (2) and Striga (6). Though other explanations exist, e.g., the bias of the screening strategy or the completeness of the datasets differ, the difference of haustorial connections at a finer scale, xylem or phloem connection, may play a role in the susceptibility of HGT events for the parasites.

Some of the donor species of the HGT events were well-known parasite host plants, while some were not, e.g., HGT gene identified (OrAeBC5_26251.1) in orthogroup 8235, which is grouped within the monocot clades. There are several explanations for this. First is that the

HGT event happened earlier than the speciation event of Phelipanche aegyptiaca. The second scenario is that a vector was involved in the HGT transfer event, e.g., a virus or other parasites.

The examination of genomic sequences for the HGT candidate genes in this study showed that the majority of HGT candidate genes have detectable intron structures (4/6 for Striga

83

hermonthica and 12/13 for Phelipanche aegyptiaca). This fact suggests that, at the cellular level, these HGT transgenes involve the transfer of DNA elements, instead of RNA intermediates. We examined the 5 prime regions for potential promoter signatures if genomic data was sufficient, and the results are listed in Table 3-3 and Table 3-5. Four of the high confidence HGT transgenes in Striga hermonthica have promoters predicted. Having a functional promoter will greatly increase the survival rate of a HGT transgene. We furthermore explored the possibility if the 5 prime regions show conservation between parasites and potential donors; however, no conclusive results can be drawn.

Two high confidence Striga hermonthica HGT transgenes located adjacently on the same genomic contig

Basic features of the two high confidence Striga hermonthica HGT transgenes

In the process of studying genomic evidence for high confidence HGT transgene

(StHeBC3_16619.1, orthogroup 14423), we identified another contig (StHeGnB1_80049, orthogroup 14624) to be a high confidence HGT transgene, and StHeGnB1_80049 is located 5 prime upstream of StHeBC3_16619.1 on the same genomic contig.

StHeGnB1_80049 is missing in the current assembly data but present in previous assemblies, which is why we did not identify this HGT transgene in the initial screening based on the current assembly. This fact in part reflected some drawbacks of our current assembly. Our research group is working on improving the current assembly, and the findings will be reported in a separate article. The incongruent phylogeny of orthogroup 14624 is shown in Supplemental

84

Figure 3-8. Based on the phylogeny, this is a monocot exclusive gene family, as in the case of

StHeBC3_16619.1 in orthogroup 14423 (Figure 3-5).

BLASTx and BLASTn analyses to NCBI nr and nt databases with StHeGnB1_80049 and

StHeBC3_16619.1. The results are shown in Supplemental Figure 3-9 and Supplemental Figure

3-10. Based on Supplemental Figure 3-9, StHeGnB1_80049 has 90% similarity at the amino acid level and 91% identities at the nucleotide level with a Sorghum bicolor gene

(SORBIDRAFT_10g026740/ Sb10g026740, top hit from NCBI Blast results). BTB superfamily domain was identified in StHeGnB1_80049 and the functional annotations of its homologs in monocots are mainly speckle-type POZ protein. Supplemental Figure 3-10 showed the NCBI

Blast result for StHeBC3_16619.1. In contrast to StHeGnB1_80049, the best hit of

StHeBC3_16619.1 is to a hypothetical protein in Setaria italica (sequence ID: XP_004978140.1) with 43% identity at the amino acid level. The functional annotation is unknown in the monocot homologs of StHeBC3_16619.1.

We then closely studied the Striga hermonthica genomic contig (136486) that harbors the two high confidence HGT transgenes (StHeGnB1_80049 and StHeBC3_16619.1). Supplemental

Figure 3-11 shows the gene locations and features we have observed from this Striga genomic contig 136486. StHeGnB1_80049 is located on the plus strand of the genomic contig 136486, while StHeBC3_16619.1 is found to be downstream of StHeGnB1_80049 on the minus strand.

StHeGnB1_80049 is a transgene that has no introns based on the comparison of the transcript to the gene sequences. Interestingly, we found that the homolog of StHeGnB1_80049 in Sorghum bicolor (SORBIDRAFT_10g026740) is also a gene without introns. TATA Box was identified upstream of HGT transgene StHeGnB1_80049 and the promoter region was predicted using

SoftBerry prediction software (www.softberry.com). Attempts were made to determine if the promoter region of StHeGnB1_80049 comes from its potential donor species, Sorghum bicolor,

85

or not. However, no conclusive result can be drawn based on current data. Due to limited genomic data, we cannot determine whether StHeBC3_16619.1 has introns or not. The homologs of StHeBC3_16619.1 in monocots are reported as having intron structures (homolog in Setaria italica, XM_004978083.1 and homolog in Sorghum bicolor, XM_002451523.1).

An intergenic region between StHeGnB1_80049 and StHeBC3_16619.1 was also studied. High similarity at the DNA level (90.07% and 88.54%) was found between the intergenic region and Sorghum bicolor genomic sequences. The intergenic region was blasted

(Blastn, E-value cutoff: 1E-10) against Sorghum bicolor Genome Database, and the results are shown in Supplemental Figure 3-12. The best two hits identified from the Sorghum bicolor genome are also intergenic regions and are immediately downstream of Sb10g026740, the potential donor gene of StHeGnB1_80049.

Two consecutive HGT events with different donor species in monocots

Having identified two high confidence HGT transgenes located adjacently on the same genomic contig is interesting. It is the first report in the plant HGT field to have two HGT transgenes found adjacent to each other. In fungi, it has been reported that multiple genes (23 gene cluster) on the same chromosome are horizontally transferred between different fungi species [170]. Based on our research on these two Striga hermonthica HGT transgenes

(StHeGnB1_80049 and StHeBC3_16619.1), we hypothesized that the two HGT events happened at different times from different donor species, and our reasons are listed here. First, based on

Supplemental Figure 3-9 and Supplemental Figure 3-10, the two transgenes have different blast best hits (StHeGnB1_80049 to a Sorghum bicolor gene, Sb10g026740, and StHeBC3_16619.1 to a Setaria italica gene, XP_004978140.1) with different similarity levels (90% similar at amino

86

acid level to Sorghum bicolor gene and 43% similarity at amino acid level to the Setaria italica gene). Second, the intergenic region between the two transgenes has great conservation to the intergenic region, downstream of Sb10g026740, based on Supplemental Figure 3-12. Though other explanations do exist, e.g., two HGT transgenes were transferred at the same time from the same donor species and evolve at a different rate in the parasite genome, we think that the most parsimonious scenario would be that StHeBC3_16619.1 transferred prior to the HGT event of

StHeGnB1_80049 and from different donor species in monocots. At least for the HGT transgene

StHeGnB1_80049, the transfer directly involved a DNA sequence. For StHeBC3_16619.1, we do not have sufficient genomic data to make a conclusion about whether the transfer happened at the DNA level or RNA level.

A hot spot for HGT events?

If StHeBC3_16619.1 was transferred at an earlier time point in the evolution history from a monocot donor species, the surrounding genomic context of this gene would provide a potency the for recombination of other genomic sequences from the same or other closely related monocot species. With limited genomic data, we were able to depict that StHeBC3_16619.1 and

StHeGnB1_80049 were two HGT transgenes on the same genomic contig. What about the other genes upstream and downstream of these two HGT transgenes? Is this genomic region a hot spot for HGT events? Surprisingly, we observed a stretch of poly-T tails (10bp) upstream of

StHeGnB1_80049. A poly-T tail is a signature of a HGT event involving RNA intermediates.

These questions will remain unanswered until we have more genomic data.

87

Expression pattern comparison between HGT transgenes and homologous genes from potential donor species

We compared the expression pattern of our high confidence HGT transgenes with the homologous genes from potential donor species. The hypothesis in our mind is that if the parasite

HGT gene performs a similar function to its homolog in donor species, both genes are likely to have similar expression pattern. Moreover, if the promoter region were co-transferred from donor species genome to the parasite genome, it is likely that both genes will have similar expression patterns. However, we do not have a conclusive result so far based on our data and existing databases.

Conclusion

A phylogenomic approach was presented in this report, and twenty-three high confidence

HGT genes were identified from four species in Orobanchaceae, including three parasitic plants and one closely related autotrophic plant. Majority of the transgenes show evidences of introns, indicating the HGT transfer happened at the DNA level. We also identified two high confidence

HGT transgenes in Striga hermonthica locating adjacently to each other. This is the first time that two HGT transgenes reported locating adjacently in parasitic plant HGT fields. We are identifying HGT transgenes based on our EST datasets and this may be just a tip of the iceberg, as a lot of the HGT transgenes may be evolving differently and may not be expressed. With more genomic data available in the future for the PPGP project, we would be able to tackle on this question. These candidate genes serve as a rich pool for functional studies and may shed some light on the parasitism on parasitic plants through a novel and effective method of controlling these devastating parasites. Although we did not identify a clear functional pattern from the high confidence HGT candidates reported here, with additional plant data available, some of the low

88

confidence HGT candidates may be identified as high confidence candidates in future studies,

thus making identifying functional patterns possible.

Tables

Table 3-1. Number of HGT genes identified in four species, including Lindenbergia philippensis, Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca.

Species Number of HGT Number of low Number of moderate Number of high candidate gene confidence HGT confidence HGT confidence HGT

families after candidate gene candidate gene families gene families after

automated screening families after after manual screening manual screening

manual screening

Lindenbergia philippensis 56 0 55 1

Triphysaria versicolor 274 164 108 2

Striga hermonthica 393 242 145 6

Phelipanche aegyptiaca 409 235 160 14

89

Table 3-2. GO annotation for HGT candidate gene families and other related information. LiPh: Lindenbergia philippensis, TrVe: Triphysaria versicolor, StHe: Striga hermonthica, OrAe: Phelipanche aegyptiaca, OrFa: Orobanche fasciculata. Detailed results on SH test can be found in Supplemental table 4. ND: no data. IntMed: interface stage grown on Medicago truncatula. IntSorbi: interface stage grown on Sorghum bicolor. IntArath: interface stage grown on Arabidopsis thaliana. Information about development stage can be found in Method section.

Highest

Expression Closely related Orthogroup GO Biological GO Cellular SH Test Species present GO Molecular Function stage for species to HGT ID Process Component Result in the HGT clade HGT group

group

poly(ADP-ribose) carbohydrate

3861 glycohydrolase activity metabolic process unknown ND passed Glycine max LiPh, OrAe

(GO:0004649) (GO:0005975)

12303 unknown unknown unknown IntMed passed Glycine max TrVe

90

12577 unknown unknown unknown 41 passed Glycine max TrVe

ATP binding (GO:0005524); Defense response

aminoacyl-tRNA ligase (GO:0006952);

activity (GO:0004812); tRNA cytoplasm Theobroma 2270 4 passed StHe, OrAe nucleotide binding aminoacylation for (GO:0005737) cacao

(GO:0000166); tRNA binding protein translation

(GO:0000049) (GO:0006418)

tetrapyrrole

uroporphyrinogen-III synthase biosynthetic 5896 unknown 52 passed Glycine max StHe activity (GO:0004852) process

(GO:0033014)

Sorghum 13656 unknown unknown unknown IntSorbi passed StHe bicolor

14233 unknown unknown unknown 61 passed Sorghum StHe

91

bicolor

18744 unknown unknown unknown 51 passed Oryza sativa StHe

Populus 10124 unknown unknown unknown IntSorbi, 4 passed StHe trichocarpa

protein binding protein import into cytoplasm (GO:0005515); protein 294 nucleus (GO:0005737);nucleus 52 passed Fragaria vesca OrAe transporter activity (GO:0006606) (GO:0005634) (GO:0008565)

1886 unknown unknown unknown 62 passed Fragaria vesca OrAe

hydrolase activity, carbohydrate

4067 hydrolyzing O-glycosyl metabolic process unknown 41 passed Fragaria vesca OrAe

compounds (GO:0004553) (GO:0005975)

RNA binding (GO:0003723); RNA processing 8888 nucleotidyltransferase activity unknown 41 passed Fragaria vesca OrAe (GO:0006396) (GO:0016779)

10050 ATP binding (GO:0005524); biosynthetic cytoplasm 42 passed Fragaria vesca OrAe

92

aminoacyl-tRNA ligase process (GO:0005737)

activity (GO:0004812); (GO:0009058);

ammonia-lyase activity tRNA

(GO:0016841); nucleotide aminoacylation for

binding (GO:0000166) protein translation

(GO:0006418)

10143 protein binding (GO:0005515) unknown unknown 42 passed Fragaria vesca OrAe

oxidoreductase activity

(GO:0016491);

oxidoreductase activity, acting

on paired donors, with oxidation-reduction 51 (OrAe) OrAe, StHe, 11841 incorporation or reduction of process unknown 1 (StHe) passed Fragaria vesca OrFa molecular oxygen, 2- (GO:0055114)

oxoglutarate as one donor,

and incorporation of one atom

each of oxygen into both

93

donors (GO:0016706)

tRNA ATP binding (GO:0005524); aminoacylation for aminoacyl-tRNA ligase protein translation activity (GO:0004812); cytoplasm Theobroma 806 (GO:0006418); 41, 42 passed OrAe nucleotide binding (GO:0005737) cacao valyl-tRNA (GO:0000166); valine-tRNA aminoacylation ligase activity (GO:0004832) (GO:0006438)

ATP binding (GO:0005524); protein protein kinase activity 3, 51, 52, 1685 phosphorylation unknown passed Glycine max OrAe (GO:0004672); sugar binding 61, 62 (GO:0006468) (GO:0005529)

endopeptidase activity proteolysis proteasome core

(GO:0004175); threonine-type involved in cellular complex 41(OrAe), Populus 2376 passed OrAe, StHe endopeptidase activity protein catabolic (GO:0005839); 1 (StHe) trichocarpa

(GO:0004298) process proteasome core

94

(GO:0051603); complex, alpha-

ubiquitin-dependent subunit complex

protein catabolic (GO:0019773)

process

(GO:0006511)

protein

modification

methyltransferase activity process IntArath, Theobroma 4336 unknown passed OrAe (GO:0008168) (GO:0006464); 3 cacao

protein transport

(GO:0015031)

ATP binding (GO:0005524); protein Sorghum 8235 protein kinase activity phosphorylation unknown 1 passed OrAe bicolor (GO:0004672) (GO:0006468)

9613 unknown unknown unknown 0 passed Glycine max OrAe

95

Table 3-3. Genomic support for the HGT candidates identified in Striga hermonthica transcriptome assemblies. N/A sign means unknown based on the current data. Striga HGT EST Contig Orthogroup Regions that have PolyT TE Putative Intron Coverage Mapped Genomic

Contig ID Length # been covered by tail detected? promoter Detected? of contig Contigs IDs

(bp) Striga genomic > region length by

contigs 20bp detected? Striga

genomic

contigs

StHeBC3_41710.1 331 13656 none N/A N/A N/A N/A 0 none

StHeBC3_16619.1 1465 14233 1-690, 751-1348 no no N/A N/A 0.88 136486,

450140

StHeBC3_4423.1 2104 18774 1-971, 1423-2087 no no yes yes 0.78 32771, 320496,

104197, 32772

StHeBC3_2868.1 2117 2270 97-233, 230-360, 423- no no yes yes 0.904 1522, 336758,

546, 544-662, 659-768, 336757 857-1004, 1001-1235,

1234-1533, 1544-1691,

1690-1815,

96

1814-2023,1907-2032

StHeBC3_21126.1 1393 5896 1-38, no no yes yes 0.99 138901, 88410

68-631, 630-717, 714-

773, 773-867, 866-986,

985-1052, 1052-1140,

1139-1393

StHeBC3_5017.3 997 10124 1-138, 139-684, 885-997 no no yes yes 0.78 123996

97

Table 3-4. Depth of coverage for the Striga genomic contigs that have been mapped onto the HGT transcript assemblies. Number of Corresponding EST Genomic mapped Depth of Genomic contig ID contig ID (StHeBC) contig Length genomic reads coverage contig_1522 2868 9507 2285 24.035 contig_336757 2868 443 46 10.384 contig_336758 2868 743 83 11.171 contig_104197 4423 831 113 13.598 contig_320496 4423 3143 1121 35.667 contig_32771 4423 1422 234 16.456 contig_32772 4423 702 294 41.880 contig_123996 5017 6148 1056 17.176 contig_136486 16619 3150 560 17.778 contig_450140 16619 657 91 13.851 contig_138901 21126 5330 1091 20.470 contig_88410 21126 3287 567 17.250

98

Table 3-5. Genomic evidence for the HGT candidates identified in Phelipanche aegyptiaca transcriptome datasets. N/A means unknown based on the current data. Regions that have PolyT TE Promoter Intron Mapped genomic EST_Contig_ID Total_len been covered with tail > detected? detected? detected? Matched contig IDs OrAeBC5_ Orthogroup# gth genomic contigs 20bp? percentage 160-517, 504-866, no N/A N/A yes 1021293, 249396, 863-1119 760936 3756.1 11841 1235bp 0.79 14056.1 10143 2285bp none N/A N/A N/A N/A N/A none 761-859, no N/A N/A yes 824263,323273, 896-1235, 1326- 870671, 342545 1619, 1774-1997, 4284.2 10050 2273bp 0.42 248-358, 357-458 no N/A N/A yes 594865,1104623 15086.1 8888 1094bp 0.19 191-309, 307-414, no N/A N/A yes 125457,731729, 412-498, 498-670, 939666,390491, 669-716, 714-847, 948174,1001786 844-1176, 1174- 9762.1 4067 1722bp 1239, 1259-1336, 0.66 1-163, 176-392, no no N/A N/A 611679,213054, 599-901, 939- 1025591,808913, 1144, 1405-1917, 365121 15496.1 1886 2150bp 1892-2001 0.7 490-658, 677-820, no N/A N/A yes 62892, 809507, 798-1012, 1013- 1020702,818912, 1143, 1140-1271, 309131 1268-1349, 1349- 1532, 1617-1680, 9142.1 294 1889bp 1682-1740 0.62 152-311, 312-386, no N/A N/A yes 672795, 685108, 385-451, 448-537, 56970, 747901, 532-614, 613-696, 170955, 20667, 695-777, 961- 242757, 547941, 4239.1 806 3171bp 1137, 1136-1231, 0.84 514754, 654491,

99

1185-1308, 1307- 594899 1515, 1514-1651, 1650-1916, 1913- 2166, 2397-2667, 2676-2778, 2773- 2876, 2873-3024, 3024-3171 1-113, 690-839, no N/A N/A yes 971619,1063709, 847-948 1044392 15353.1 1685 948bp 0.38 370-560, 500-731, no N/A N/A yes 951675, 935609, 786-912, 1043- 944884, 932051, 1307, 1444-1569, 1067473,1094683, 1569-1720, 1766- 221991, 6791.1 1685 2034bp 1976, 0.64 109-199, 203-510, no N/A N/A yes 804799, 990507, 507-575, 583-628, 1031613, 921125 627-753, 762-972, 9731.1 1685 1080bp 970-1080 0.86 14072.1 1685 753bp none N/A N/A N/A N/A N/A none 7956.1 1685 591bp none N/A N/A N/A N/A N/A none 56-109, 108-177, no N/A N/A yes 1103954, 216847, 175-221, 219-298, 456028, 475928, 296-365, 365-437, 1029701 434-526, 520-584, 270.1 2376 696bp 584-651 0.88 1-171, 219-313, no N/A N/A yes 820264, 2472, 947606, 310-457, 458-569, 866690, 847264, 567-653, 651-709 776905 7046.1 4336 786bp ,709-786 0.95 128-182, 180-441, no N/A N/A yes 445311, 1062136, 558-703, 704-816, 479356, 409246, 814-900, 897-955, 957251 13694.1 4336 1032bp 956-1032 0.77 26251.1 8235 945bp none no N/A N/A N/A N/A none 16890.17 9613 1251bp 496-720, 105-210, no N/A N/A yes 0.37 631883, 959303,

100

209-246, 1138- 990579 1233

101

Table 3-6. Depth of coverage for the genomic contigs that have been mapped to the HGT gene ESTs identified in Phelipanche aegyptiaca (syn. Orobanche aegyptiaca). Average length of the Illumina reads is 101bp.

Number of Corresponding Contig genomic EST ID Length reads being Genomic contig ID (OrAeBC) (bp) mapped Depth of coverage contig_1029701 270.1 354 20 5.650 contig_1103954 270.1 451 27 5.987 contig_216847 270.1 792 65 8.207 contig_456028 270.1 301 27 8.970 contig_475928 270.1 348 25 7.184 contig_1021293 3756.1 594 50 8.418 contig_249396 3756.1 283 14 4.947 contig_760936 3756.1 528 32 6.061 contig_170955 4239.1 369 25 6.775 contig_20667 4239.1 882 86 9.751 contig_242757 4239.1 372 30 8.065 contig_514754 4239.1 475 43 9.053 contig_547941 4239.1 875 101 11.543 contig_56970 4239.1 742 79 10.647 contig_594899 4239.1 396 30 7.576 contig_654491 4239.1 570 39 6.842 contig_672795 4239.1 630 48 7.620 contig_685108 4239.1 450 22 4.889 contig_747901 4239.1 1062 90 8.475 contig_323273 4284.2 352 26 7.387 contig_342545 4284.2 391 94 24.041 contig_824263 4284.2 768 72 9.375 contig_870671 4284.2 224 25 11.161 contig_1067473 6791.1 269 13 4.833 contig_1094683 6791.1 258 15 5.814 contig_221991 6791.1 891 172 19.304 contig_932051 6791.1 406 44 10.837 contig_935609 6791.1 233 14 6.009 contig_944884 6791.1 211 13 6.161 contig_951675 6791.1 274 17 6.204 contig_2472 7046.1 229 13 5.677 contig_776905 7046.1 267 23 8.614 contig_820264 7046.1 236 14 5.932 contig_847264 7046.1 423 23 5.437 contig_866690 7046.1 254 10 3.937 contig_947606 7046.1 390 23 5.897 contig_1020702 9142.1 200 18 9 contig_309131 9142.1 454 69 15.198 contig_62892 9142.1 1636 221 13.509 contig_809507 9142.1 222 36 16.216 102

contig_818912 9142.1 280 31 11.071 contig_1031613 9731.1 297 29 9.764 contig_804799 9731.1 523 32 6.119 contig_921125 9731.1 304 10 3.289 contig_990507 9731.1 450 43 9.556 contig_1001786 9762.1 268 17 6.343 contig_125457 9762.1 1469 139 9.462 contig_390491 9762.1 889 84 9.449 contig_731729 9762.1 722 71 9.834 contig_939666 9762.1 341 37 10.850 contig_948174 9762.1 203 8 3.941 contig_1062136 13694.1 341 18 5.279 contig_409246 13694.1 214 9 4.206 contig_445311 13694.1 384 13 3.385 contig_479356 13694.1 577 228 39.515 contig_957251 13694.1 277 27 9.747 contig_1104623 15086.1 312 22 7.051 contig_594865 15086.1 279 24 8.602 contig_1044392 15353.1 276 15 5.435 contig_1063709 15353.1 233 26 11.159 contig_971619 15353.1 313 32 10.223 contig_1025591 15496.1 217 13 5.991 contig_213054 15496.1 551 37 6.716 contig_365121 15496.1 643 61 9.487 contig_611679 15496.1 514 31 6.031 contig_808913 15496.1 393 18 4.580 contig_631883 16890.17 244 6 2.459 contig_959303 16890.17 317 17 5.363 contig_990579 16890.17 239 19 7.950

Table 3-7. Summary of expression profile for each high confidence HGT transgene. Contig ID Orthogroup Highly Expressed Stage

TrVeBC3_31777.1 12303 IntMedtrG, 3G, 41G

TrVeBC3_22826.1 12577 IntMedtrG, 3G, 41G

StHeBC3_4423.1 18774 0G, 51G,

StHeBC3_16619.1 14233 61G

StHeBC3_41710.1 13656 IntSorbiG

StHeBC3_2868.1 2270 0G, 1G, 2G, 4G

103

StHeBC3_21126 5896 52G

StHeBC3_5017 10124 IntSorbiG, 0G, 4G, 51G

OrAeBC5_9142.1 294 51G, 52G, 62G

OrAeBC5_15496.1 1886 62G

OrAeBC5_9762.1 4067 41G, 3G

OrAeBC5_15086.1 8888 3G, 41G, 52G, 62G

OrAeBC5_4284.2 10050 42G

OrAeBC5_14056.1 10143 41G, 42G, 51G, 52G

OrAeBC5_3756.1 11841 3G, 41G, 42G, 51G, 52G

StHeBC3_55745.1 11841 1G

OrAeBC5_4239 806 3G, 42G, 51G, 52G, 62G

OrAeBC5_14072.1 1685 42G, 51G, 61G

OrAeBC5_15353 1685 3G, 41G, 42G, 52G, 62G

OrAeBC5_6791 1685 51G

OrAeBC5_9731 1685 52G, 61G

OrAeBC5_7956 1685 61G, 62G

OrAeBC5_270 2376 41G, 61G

StHeBC3_48088.1 2376 1G

OrAeBC5_7046 4336 IntArathG, 3G, 41G

OrAeBC5_13694.1 4336 IntArathG, 3G, 41G

OrAeBC5_26251.1 8235 1G

OrAeBC5_16890.17 9613 0G, 41G, 62G

104

Figures

gnl_Musac1.0_GSMUA_Achr3T21720_001 60 gnl_Phoda3.0_PDK_30s1048841g004 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00066.92 gnl_Nelnu1.0_NNU_020756-RA Ortho 3861 LaSa_42899 gnl_Solly2.3_Solyc12g017920.1.1 52 100 gnl_Solly2.3_Solyc12g096610.1.1 63 gnl_Soltu3.4_PGSC0003DMP400051136 gnl_Soltu3.4_PGSC0003DMP400051135100 60 gnl_Mimgu1.0_PACid_17688960 LiPhGnB2_15157.1 66 LiPhGnB2_15157.2100 96 OrAeBC5_17480.2 OrAeBC5_17480.1100 43 LiPhGnB2_20423.2 100LiPhGnB2_20423.1

100 TrVeBC3_15449.1 OrAeBC5_5452.2 99 75 StHeBC3_8657.3 91 97 StHeBC3_8657.1 47 StHeBC3_8657.2100 OrAeBC5_5452.1 gnl_Frave2.0_gene28028 59 100gnl_Frave2.0_gene10159 gnl_Medtr3.5_Medtr3g029520.1 100 LiPhGnB2_6846.1 LiPhGnB2_6846.2100 (H) 100 OrAeBC5_22555.6 100 100 100OrAeBC5_22555.5 38 OrAeBC5_22555.1 OrAeBC5_22555.2100 100 98 OrAeBC5_22555.4 100 78 OrAeBC5_22555.3 gnl_Glyma1.01_PACid_16282271 100 52 gnl_Glyma1.01_PACid_16282272 45 OrAeBC5_9914.2 100OrAeBC5_9914.1 gnl_Theca1.0_Tc00_g035400 gnl_Thepa2.0_Tp4g14220 100 50 62 gnl_Arath10_AT2G31870.1 94 98 gnl_Thepa2.0_Tp2g03830 100 90 gnl_Arath10_AT2G31865.2

76 gnl_Carpa1.181_PACid_16428142 gnl_Poptr2.2_PACid_18209526 100gnl_Poptr2.2_PACid_18242959 gnl_Vitvi12X_PACid_17842629 97 gnl_Aquco1.0_PACid_18146143 gnl_Sorbi1.4_PACid_1949469 100 gnl_Orysa6.0_PACid_16851025 43 100 gnl_Bradi1.2_Bradi1g01942.1 100gnl_Bradi1.2_Bradi1g01980.1 gnl_Selmo1.0_PACid_15419364 gnl_Selmo1.0_PACid_15414792 gnl_Phypa1.6_PACid_18053315 gnl_Phypa1.6_PACid_18064667 gnl_Phypa1.6_PACid_18064666 0.1

Figure 3-1. Phylogeny of orthogroup 3861. The HGT gene identified in Lindenbergia philippensis is LiPhGnB2_6846. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

105

Ortho 12303

gnl_Nelnu1.0_NNU_012589-RA gnl_Soltu3.4_PGSC0003DMP400045810 gnl_Solly2.3_Solyc02g086170.2.1 TrVeBC3_9085.1 40 TrVeBC3_9085.2 68 57 StHeBC3_16033.1 77 OrAeBC5_9561.2 100 59 OrAeBC5_9561.1 LiPhGnB2_11396.2 LiPhGnB2_11396.1 LaSa_8625 16 gnl_Vitvi12X_PACid_17837306 gnl_Frave2.0_gene06824 TrVeBC3_31777.1 (H) 32 19 100 gnl_Glyma1.01_PACid_16289989

99 gnl_Glyma1.01_PACid_16289990 18 gnl_Glyma1.01_PACid_16312108 50 gnl_Carpa1.181_PACid_16410876 53 gnl_Poptr2.2_PACid_18207935 gnl_Thepa2.0_Tp6g27980 100 gnl_Arath10_AT5G16160.1 gnl_Theca1.0_Tc04_g014430 gnl_Musac1.0_GSMUA_Achr8T31710_001 99 gnl_Orysa6.0_PACid_16834366 100 gnl_Bradi1.2_Bradi2g46970.1 65 gnl_Sorbi1.4_PACid_1962919

0.1

Figure 3-2. Phylogeny of orthogroup 12303. The HGT gene identified in Triphysaria versicolor is TrVeBC3_31777.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

Ortho 12577

gnl_Nelnu1.0_NNU_004265-RA gnl_Vitvi12X_PACid_17838317 gnl_Frave2.0_gene27950 71 gnl_Poptr2.2_PACid_18237948

32 46 gnl_Poptr2.2_PACid_18237947 47 70 gnl_Theca1.0_Tc06_g008440

33 gnl_Carpa1.181_PACid_16405496 gnl_Thepa2.0_Tp2g16690 100 gnl_Arath10_AT3G27520.1 25 gnl_Aquco1.0_PACid_18149613 47 gnl_Glyma1.01_PACid_16258755 100 90 gnl_Glyma1.01_PACid_16311451 TrVeBC3_22826.1 (H) LaSa_34029 74 100 LaSa_45030 88 gnl_Solly2.3_Solyc02g093450.2.1 gnl_Soltu3.4_PGSC0003DMP400022388 100 LiPhGnB2_10752.1 gnl_Mimgu1.0_PACid_17692652 OrAeBC5_12949.15 54 49 62 OrAeBC5_12949.16 OrAeBC5_12949.19 46 TrVeBC3_18162.3 91 99 TrVeBC3_18162.2 TrVeBC3_18162.1 StHeBC3_24681.1 StHeBC3_21731.1 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00049.269

0.1

Figure 3-3. Phylogeny of orthogroup 12577.

106

The HGT gene identified in Triphysaria versicolor is TrVeBC3_22826.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

Ortho 18774 gnl_Orysa6.0_PACid_16873470 100 gnl_Orysa6.0_PACid_16873469 StHeBC3_4423.1 (H) 92 gnl_Orysa6.0_PACid_16895713 91 gnl_Bradi1.2_Bradi2g25790.1 96 gnl_Bradi1.2_Bradi3g55500.1 gnl_Sorbi1.4_PACid_ 0.01 1964039

Figure 3-4. Phylogeny of orthogroup 18774. The HGT gene identified in Striga hermonthica is StHeBC3_4423.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

107

Figure 3-5. Phylogeny of orthogroup 14233. The HGT gene identified in Striga hermonthica is StHeBC3_16619.1.

Ortho 13656 StHeBC3_41710.1(H) 65 gnl_Sorbi1.4_PACid_1965806 gnl_Orysa6.0_PACid_16838846 100 gnl_Orysa6.0_PACid_16844224 gnl_Orysa6.0_PACid_1683884487 gnl_Orysa6.0_PACid_16838848 gnl_Orysa6.0_PACid_16844223 gnl_Orysa6.0_PACid_16838847 gnl_Orysa6.0_PACid_16838841 gnl_Orysa6.0_PACid_1683884311 gnl_Orysa6.0_PACid_16838842 gnl_Orysa6.0_PACid_16838845 gnl_Phoda3.0_PDK_30s1094801g001 gnl_Phoda3.0_PDK_30s911091g001

0.1

Figure 3-6. Phylogeny of orthogroup 13656. The HGT gene identified in Striga hermonthica is StHeBC3_41710.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

Ortho 2270 gnl_Selmo1.0_PACid_15413777 gnl_Phoda3.0_PDK_30s970271g001 gnl_Musac1.0_GSMUA_Achr8T31240_001 79 gnl_Musac1.0_GSMUA_Achr6T24850_001 gnl_Bradi1.2_Bradi1g69880.1 99 97 gnl_Orysa6.0_PACid_16866568 100 gnl_Sorbi1.4_PACid_1967640 73 100 gnl_Sorbi1.4_PACid_1954278 gnl_Bradi1.2_Bradi3g31100.1 100 gnl_Orysa6.0_PACid_16885414 gnl_Orysa6.0_PACid_16885415 gnl_Orysa6.0_PACid_16885412 gnl_Orysa6.0_PACid_16885413 gnl_Nelnu1.0_NNU_026444-RA 100 gnl_Nelnu1.0_NNU_012022-RA 96 gnl_Vitvi12X_PACid_17830170 gnl_Medtr3.5_Medtr4g122580.1 99 gnl_Medtr3.5_Medtr4g077700.1 100 gnl_Medtr3.5_Medtr4g077810.1 gnl_Glyma1.01_PACid_16286058 86 100 gnl_Glyma1.01_PACid_16290193 gnl_Frave2.0_ 95 100 gnl_Frave2.0_gene06519 gene00896 94 gnl_Poptr2.2_PACid_18210733 100 gnl_Poptr2.2_PACid_18239126 85 93 gnl_Theca1.0_Tc04_g020790 100 StHeBC3_2868.1(H) 100 gnl_Carpa1.181_PACid_16427157 92 81 gnl_Arath10_AT4G13780.1 100 100 gnl_Thepa2.0_Tp7g11570 LaSa_20765 88 100 LaSa_5371 gnl_Solly2.3_Solyc07g008950.2.1 95 gnl_Soltu3.4_PGSC0003DMP400050083 100 100 gnl_Mimgu1.0_PACid_17669561 100 gnl_Mimgu1.0_PACid_17691323 39 gnl_Mimgu1.0_PACid_17695503 100 LiPhGnB2_3847.1 94 OrAeBC5_4515.1 gnl_Carpa1.181_PACid_16426409 98 gnl_Vitvi12X_PACid_17842684 gnl_Aquco1.0_PACid_18150130 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00209.4 100 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00209.2 gnl_Phypa1.6_PACid_18070166 0.1

Figure 3-7. Phylogeny of orthogroup 2270.

108

The HGT gene identified in Striga hermonthica is StHeBC3_2868.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

Ortho 5896 gnl_Selmo1.0_PACid_15404138 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00123.1 gnl_Aquco1.0_PACid_18152575 gnl_Aquco1.0_PACid_18152576 67gnl_Aquco1.0_PACid_18152577 61 42 gnl_Aquco1.0_PACid_18152578 100 gnl_Nelnu1.0_NNU_001969-RA gnl_Bradi1.2_Bradi1g72180.2 100 gnl_Orysa6.0_PACid_16845307 51 30 gnl_Sorbi1.4_PACid_1954513

23 42 gnl_Musac1.0_GSMUA_Achr2T07670_001 gnl_Phoda3.0_PDK_30s836921g002 100 gnl_Phoda3.0_PDK_30s1210471g001 gnl_Vitvi12X_PACid_17821942 gnl_Medtr3.5_Medtr3g101350.1

43 gnl_Glyma1.01_PACid_16261644 100 100 18 gnl_Glyma1.01_PACid_16254781 84 StHeBC3_21126.2 100 StHeBC3_21126.3 80 (H) StHeBC3_21126.160 gnl_Theca1.0_Tc08_g000970 79 gnl_Carpa1.181_PACid_16424250 96 68 gnl_Thepa2.0_Tp4g07760 100 54 gnl_Arath10_AT2G26540.1 gnl_Poptr2.2_PACid_18222585 LaSa_24621 99 HeAn_47681 gnl_Solly2.3_Solyc04g079320.2.1 98 gnl_Mimgu1.0_PACid_17691026 100 gnl_Mimgu1.0_PACid_17689222 75 gnl_Mimgu1.0_PACid_17678524 100gnl_Mimgu1.0_PACid_17678523

100 TrVeBC3_14119.1 TrVeBC3_14119.3 43 TrVeBC3_14119.4 50 TrVeBC3_14119.2 97 94 StHeBC3_13163.8 15 StHeBC3_13163.3 98 57 StHeBC3_13163.2 StHeBC3_13163.6 LiPhGnB2_12184.1

75 OrAeBC5_16415.4 OrAeBC5_16415.891 32 OrAeBC5_16415.7 OrAeBC5_16415.1 OrAeBC5_16415.599 93OrAeBC5_16415.2 100 39OrAeBC5_16415.6 OrAeBC5_16415.3 gnl_Phypa1.6_PACid_18046074

0.1

Figure 3-8. Phylogeny of orthogroup 5896. The HGT gene identified in Striga hermonthica is StHeBC3_21126.1, StHeBC3_21126.2, and StHeBC3_21126.3. These three Striga hermonthica HGT contigs are the same gene

109

with different splice forms. HGT gene or clade was labeled with a “(H)” sign on the

phylogeny.

Ortho 10124

gnl_Sorbi1.4_PACid_1977171

55 gnl_Sorbi1.4_PACid_1968853

71 gnl_Bradi1.2_Bradi4g44832.2 gnl_Bradi1.2_Bradi4g26776.1 57 gnl_Orysa6.0_PACid_16887483 66 gnl_Musac1.0_GSMUA_Achr2T02420_001 gnl_Thepa2.0_Tp5g00970 96 97 gnl_Arath10_AT3G62450.1 71 gnl_Soltu3.4_PGSC0003DMP400039608 100 46 18 gnl_Solly2.3_Solyc01g067820.2.1 gnl_Theca1.0_Tc04_g027250 47 gnl_Carpa1.181_PACid_16410134 OrAeBC5_2757.2 21 77 TrVeBC3_10819.9 99 38 TrVeBC3_10819.6 gnl_Mimgu1.0_PACid_17684638 gnl_Glyma1.01_PACid_16290845 73 99 gnl_Glyma1.01_PACid_16290844 gnl_Glyma1.01_PACid_16278625 gnl_Poptr2.2_PACid _18207162 49 100 StHeBC3_5017.9 100 (H) 100 StHeBC3_5017.3 gnl_Medtr3.5_Medtr2g048470.1 64 gnl_Medtr3.5_Medtr2g048470.2 71 gnl_Medtr3.5_Medtr2g005070.1 gnl_Medtr3.5_Medtr1g068580.1 gnl_Phypa1.6_PACid_18047166 0.1

Figure 3-9. Phylogeny of orthogroup 10124. The HGT gene identified in Striga hermonthica is StHeBC3_5017.3 and StHeBC3_5017.9. StHeBC3_5017.3 and

StHeBC3_5017.9 are from the same gene with different splicing forms. HGT gene or clade was labeled with a

“(H)” sign on the phylogeny.

110

Ortho 294 gnl_Phypa1.6_PACid_18061474 gnl_Phypa1.6_PACid_18061887 75 gnl_Phypa1.6_PACid_18070379 gnl_Musac1.0_GSMUA_Achr8T22290_001 100 gnl_Phoda3.0_PDK_30s871351g004 100 gnl_Phoda3.0_PDK_30s827541g003 100 gnl_Phoda3.0_PDK_30s922111g004 LiPhGnB2_3804.1 94 OrAeBC5_610.1 95 TrVeBC3_3872.1

100 gnl_Mimgu1.0_PACid_17674912

57 gnl_Soltu3.4_PGSC0003DMP400004946 100 gnl_Solly2.3_Solyc01g100720.2.1 75 gnl_Frave2.0_gene00228 gnl_Thepa2.0_Tp1g07880 72 100 87 gnl_Arath10_AT1G09270.1 98 100 gnl_Carpa1.181_PACid_16420814 53 gnl_Poptr2.2_PACid_18220366 53 40 gnl_Theca1.0_Tc05_g031360 gnl_Vitvi12X_PACid_17837492 gnl_Nelnu1.0_NNU_016106-RA gnl_Theca1.0_Tc04_g029940 gnl_Sorbi1.4_PACid_1961196 100 gnl_Sorbi1.4_PACid_1961195 98 gnl_Bradi1.2_Bradi2g08960.1 54 gnl_Orysa6.0_PACid_16830966 gnl_Orysa6.0_PACid_16858480 100 97 gnl_Bradi1.2_Bradi2g35050.1 75 100 gnl_Sorbi1.4_PACid_1979838 gnl_Orysa6.0_PACid_16831839 100 gnl_Orysa6.0_PACid_16831838 92 gnl_Musac1.0_GSMUA_Achr1T14770_001 95 gnl_Musac1.0_GSMUA_Achr5T27090_001 100 gnl_Musac1.0_GSMUA_Achr3T24150_001 35 77 gnl_Musac1.0_GSMUA_Achr8T06540_001 gnl_Phoda3.0_PDK_30s952841g004 100 gnl_Phoda3.0_PDK_30s1000361g002 gnl_Soltu3.4_PGSC0003DMP400012911 100 gnl_Solly2.3_Solyc06g009750.2.1 OrAeBC5_2714.1 98 100 OrAeBC5_2714.2 gnl_Mimgu1.0_PACid_17680185 100 LiPhGnB2_589.4 88 30 LiPhGnB2_589.7 67 LiPhGnB2_589.5 LiPhGnB2_589.1 95 100 LiPhGnB2_589.3 95 21 LiPhGnB2_589.8 LiPhGnB2_589.6 LiPhGnB2_589.2 gnl_Vitvi12X_PACid_17835304 100 gnl_Arath10_AT1G02690.2 56 100 gnl_Arath10_AT4G02150.1 100 gnl_Thepa2.0_Tp6g02050 76 gnl_Carpa1.181_PACid_16409573 98 64 gnl_Theca1.0_Tc01_g031310 52 gnl_Poptr2.2_PACid_18246779 gnl_Poptr2.2_PACid_18223890 100 gnl_Poptr2.2_PACid_18223891 100 gnl_Vitvi12X_PACid_17840317 LaSa_39234 HeAn_1661970 98 55 LaSa_46168

32 46 HeAn_8219 98 85 HeAn_8183 LaSa_42010 gnl_Mimgu1.0_PACid_17695975

96 LiPhGnB2_894.2 100 100 LiPhGnB2_894.1 OrAeBC5_5788.2 100 OrAeBC5_5788.1 88 TrVeBC3_2205.5 TrVeBC3_2205.4 71 TrVeBC3_2205.3 84 67 55 TrVeBC3_2205.2 TrVeBC3_2205.1 StHeBC3_195.1 gnl_Soltu3.4_PGSC0003DMP400026357 100 67 gnl_Solly2.3_Solyc01g060470.2.1 gnl_Glyma1.01_PACid_16304274 100 94 gnl_Glyma1.01_PACid_16269005 gnl_Medtr3.5_Medtr4g121440.1 gnl_Medtr3.5_Medtr2g034900.1 gnl_Medtr3.5_AC235758_3.1

53 StHeBC3_11583.1 43 30 100 14 LiPhGnB2_8971.1 gnl_Medtr3.5_Medtr4g132980.1 13 43 gnl_Medtr3.5_Medtr4g133040.1 80 27 gnl_Medtr3.5_Medtr4g133030.1 35 gnl_Medtr3.5_Medtr4g131000.1 99 gnl_Medtr3.5_Medtr4g131510.1 gnl_Glyma1.01_PACid_16298880 99 gnl_Glyma1.01_PACid_16274759 94 7 gnl_Frave2.0_gene18970 100 gnl_Frave2.0_gene05174

100 gnl_Poptr2.2_PACid_18248510 99 gnl_Poptr2.2_PACid_18248509 gnl_Theca1.0_Tc04_g029030 49 gnl_Thepa2.0_Tp7g14620 100 100 gnl_Arath10_AT4G16143.1 82 99 gnl_Thepa2.0_Tp2g19660 100 gnl_Arath10_AT5G49310.1 44 100 93 gnl_Arath10_AT3G06720.1 100 gnl_Thepa2.0_Tp3g05690 gnl_Carpa1.181_PACid_16427981 LiPhGnB2_589.9

100 gnl_Mimgu1.0_PACid_17693160 100 gnl_Mimgu1.0_PACid_17678551 91 StHeBC3_2679.1 98 73 OrAeBC5_1543.1 gnl_Soltu3.4_PGSC0003DMP400052078 100 98 gnl_Solly2.3_Solyc08g041890.2.1 HeAn_8061965 98 HeAn_13400 99 LaSa_43469 100 93 LaSa_43430 100 78 LaSa_15256 gnl_Vitvi12X_PACid_17837445 gnl_Medtr3.5_Medtr1g083810.1 97 gnl_Glyma1.01_PACid_16280289 gnl_Glyma1.01_PACid_16318278 59 gnl_Glyma1.01_PACid_16318277 93 100 gnl_Glyma1.01_PACid_16318276 gnl_Medtr3.5_Medtr7g112350.1 95 gnl_Glyma1.01_PACid_16314601 39 96 67 gnl_Glyma1.01_PACid_16253863 OrAeBC5_9142.1 (H) 99 gnl_Frave2.0_gene31556

55 gnl_Poptr2.2_PACid_18220856 100 gnl_Poptr2.2_PACid_18208376 60 gnl_Carpa1.181_PACid_16420767 51 gnl_Theca1.0_Tc05_g030910 gnl_Nelnu1.0_NNU_009112-RA 100 72 gnl_Nelnu1.0_NNU_014533-RA gnl_Aquco1.0_PACid_18158249 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00147.17 gnl_Selmo1.0_PACid_15409102 gnl_Phypa1.6_PACid_18072997 gnl_Phypa1.6_PACid_18061889 gnl_Phypa1.6_PACid_18061473 0.1

Figure 3-10. Phylogeny of orthogroup 294.

The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_9142.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

111

Ortho 1886

gnl_Phypa1.6_PACid_18055748 gnl_Phypa1.6_PACid_18072779 81 gnl_Phypa1.6_PACid_18062396 gnl_Selmo1.0_PACid_15420480 45 gnl_Selmo1.0_PACid_15417722 gnl_Aquco1.0_PACid_18158782 gnl_Nelnu1.0_NNU_004481-RA 100 gnl_Nelnu1.0_NNU_020768-RA 92 gnl_Vitvi12X_PACid_17842635 gnl_Soltu3.4_PGSC0003DMP400024257 gnl_Solly2.3_Solyc08g015780.2.1 100 58 gnl_Solly2.3_Solyc12g017540.1.1 gnl_Soltu3.4_PGSC0003DMP400011476 89 gnl_Soltu3.4_PGSC0003DMP400011475 100 LaSa_10259 gnl_Mimgu1.0_PACid_17676671 100 86 gnl_Mimgu1.0_PACid_17673310 StHeBC3_5733.1 90 OrAeBC5_392.2 79 OrAeBC5_392.1 TrVeBC3_4307.1 19 LiPhGnB2_1939.1 LiPhGnB2_1939.5 LiPhGnB2_1939.3 54 51 LiPhGnB2_1939.2 100 LiPhGnB2_1939.6 100 LiPhGnB2_1939.4 100 gnl_Mimgu1.0_PACid_17676918 gnl_Glyma1.01_PACid_16303903 gnl_Glyma1.01_PACid_16269402 78 gnl_Glyma1.01_PACid_16269401 OrAeBC5_15496.1 (H) 83 gnl_Frave2.0_gene10164 gnl_Theca1.0_Tc00_g035340 98 gnl_Thepa2.0_Tp4g26170 100 92 gnl_Arath10_AT2G44090.1 100 gnl_Arath10_AT3G59910.1 96 77 gnl_Thepa2.0_Tp5g03260 gnl_Carpa1.181_PACid_16428134 gnl_Poptr2.2_PACid_18242821 99 gnl_Poptr2.2_PACid_18210308 gnl_Poptr2.2_PACid_18210307 gnl_Bradi1.2_Bradi4g20950.1 100 gnl_Orysa6.0_PACid_16888805 57 gnl_Sorbi1.4_PACid_1969730 gnl_Musac1.0_GSMUA_Achr10T16350_001 100 100 100 100 gnl_Musac1.0_GSMUA_Achr6T12560_001 gnl_Musac1.0_GSMUA_Achr3T02760_001 100 gnl_Musac1.0_GSMUA_AchrUn_randomT20880_001 100 60 gnl_Musac1.0_GSMUA_Achr6T18250_001 gnl_Phoda3.0_PDK_30s721731g001 100 gnl_Phoda3.0_PDK_30s970671g002 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00066.79 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00169.18 gnl_Phypa1.6_PACid_18044237 0.1

Figure 3-11. Phylogeny of orthogroup 1886. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_15496.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

112

gnl_Selmo1.0_PACid_15416608 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00104.33 Ortho 4067 gnl_Vitvi12X_PACid_17831143 LaSa_44805 OrAeBC5_9480.1

93 gnl_Mimgu1.0_PACid_17678571 48

99 LiPhGnB2_14111.6 78 100 100 LiPhGnB2_14111.5 100 69 LiPhGnB2_14111.7

gnl_Soltu3.4_PGSC0003DMP400054342 100 gnl_Soltu3.4_PGSC0003DMP400054339 100 20 gnl_Solly2.3_Solyc04g051600.2.1

gnl_Arath10_AT2G32320.2

97 gnl_Thepa2.0_Tp4g14010 96 92 47 gnl_Arath10_AT2G31580.1

60 gnl_Carpa1.181_PACid_16431986 18 gnl_Poptr2.2_PACid_18244294 gnl_Glyma1.01_PACid_16278099

27 100 gnl_Glyma1.01_PACid_16302712

100gnl_Glyma1.01_PACid_16302163 97 39 gnl_Glyma1.01_PACid_16302546

30 gnl_Frave2.0_gene23761 96 OrAeBC5_9762.1 (H) 97 gnl_Frave2.0_gene21135 gnl_Nelnu1.0_NNU_014811-RA

gnl_Aquco1.0_PACid_18140021

100gnl_Aquco1.0_PACid_18146097 85 53 55 gnl_Aquco1.0_PACid_18144697

gnl_Nelnu1.0_NNU_014809-RA

gnl_Sorbi1.4_PACid_1981630 gnl_Orysa6.0_PACid_16862565 100 100 gnl_Orysa6.0_PACid_16862564 43 gnl_Bradi1.2_Bradi3g06090.1 92 100 gnl_Bradi1.2_Bradi2g18677.2

gnl_Phoda3.0_PDK_30s1172441g001 78 gnl_Musac1.0_GSMUA_Achr8T03920_001

gnl_Phypa1.6_PACid_18036661 gnl_Phypa1.6_PACid_18036660 0.1

Figure 3-12. Phylogeny of orthogroup 4067. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_9762.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

113

Ortho 8888 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00017.26 gnl_Bradi1.2_Bradi4g41470.1 100 gnl_Orysa6.0_PACid_16892943 63 gnl_Sorbi1.4_PACid_1977671 100 58 gnl_Sorbi1.4_PACid_1977672 gnl_Phoda3.0_PDK_30s671251g015 51 100 gnl_Musac1.0_GSMUA_Achr6T17850_001 LaSa_19553 gnl_Mimgu1.0_PACid_17680175 46 99 LiPhGnB2_4383.1 93 57 StHeBC3_8636.1 38 gnl_Solly2.3_Solyc11g018820.1.1 100 33 gnl_Solly2.3_Solyc11g018770.1.1 gnl_Vitvi12X_PACid_17840176 gnl_Theca1.0_Tc10_g013720 100 28 gnl_Theca1.0_Tc08_g014540 59 23 gnl_Carpa1.181_PACid_16424591 gnl_Poptr2.2_PACid_18221628 49 gnl_Frave2.0_gene06236 100 OrAeBC5_15086.1 (H) 74 gnl_Glyma1.01_PACid_16317495 76 96 100 gnl_Glyma1.01_PACid_16281150 gnl_Thepa2.0_ gnl_Arath10_ 100 Tp4g24780 82 AT1G22660.1 gnl_Thepa2.0_Tp1g20170 gnl_Aquco1.0_PACid_18163162 gnl_Selmo1.0_PACid_15417645 gnl_Phypa1.6_PACid_18064672

0.1

Figure 3-13. Phylogeny of orthogroup 8888. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_15086.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

114

gnl_Aquco1.0_PACid_18142427 79 gnl_Nelnu1.0_NNU_009115-RA Ortho 10050 LaSa_1078 gnl_Solly2.3_Solyc01g102800.2.1 100 100 gnl_Soltu3.4_ 99 PGSC0003DMP400031813 100 gnl_Mimgu1.0_PACid_17686933 100 StHeBC3_2699.1 73 LiPhGnB2_3298.1 70 98 OrAeBC5_5740.1 gnl_Vitvi12X_PACid_17841263 gnl_Theca1.0_Tc05_g019030

50 gnl_Poptr2.2_PACid_18218941 gnl_Arath10_ 78 AT3G02760.1 60 100 gnl_Thepa2.0_ 100 Tp3g01790 35 gnl_Carpa1.181_PACid_16409022 OrAeBC5_4284.2 (H) 100 gnl_Frave2.0_gene31205 39 41 gnl_Glyma1.01_PACid_16269967 100 gnl_Glyma1.01_PACid_16260743 gnl_Vitvi12X_PACid_17841262 gnl_Phoda3.0_PDK_ 30s734041g001 gnl_Sorbi1.4_PACid_1979785 95 100 gnl_Orysa6.0_PACid_16858423 69 49 gnl_Bradi1.2_Bradi2g35680.1 gnl_Musac1.0_GSMUA_Achr5T14080_001 46 gnl_Phoda3.0_PDK_30s734041g002 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00071.77

0.1

Figure 3-14. Phylogeny of orthogroup 10050. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_4284.2. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

115

gnl_Selmo1.0_PACid_15412306 Ortho 10143 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00048.180

gnl_Nelnu1.0_NNU_008901-RA 82 gnl_Aquco1.0_PACid_18164815

gnl_Vitvi12X_PACid_17834998

LaSa_44198 49 gnl_Solly2.3_ Solyc01g091080.2.1 100 100 gnl_Soltu3.4_ 92 PGSC0003DMP400045030 98 gnl_Mimgu1.0_PACid_17694589

100 LiPhGnB2_8921.1 100 OrAeBC5_9156.1

85 TrVeBC3_10034.1 100 TrVeBC3_10034.2 62 100 StHeBC3_6177.2 100 StHeBC3_6177.1 78 StHeBC3_6177.3

gnl_Poptr2.2_PACid_18222357

82 gnl_Theca1.0_Tc01_g037890 38 57 gnl_Carpa1.181_PACid_16409070

98 gnl_Arath10_AT3G60740.1 100 92 gnl_Thepa2.0_Tp5g02510

gnl_Glyma1.01_PACid_16316674 100 gnl_Glyma1.01_PACid_16281882 84 gnl_Frave2.0_gene07454 100 OrAeBC5_14056.1 (H)

gnl_Orysa6.0_PACid_16886461

100 gnl_Bradi1.2_Bradi3g30790.1 79

90 gnl_Sorbi1.4_PACid_1951424 gnl_Musac1.0_GSMUA_Achr11T03730 70 _001 gnl_Phoda3.0_PDK_ 30s862041g003 gnl_Phypa1.6_PACid_18065795

0.1

Figure 3-15. Phylogeny of orthogroup 10143. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_14056.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

116

Ortho 11841 gnl|Aquco1.0|PACid_18150554 gnl|Frave2.0|gene08051 100 gnl|Frave2.0|gene08052 62 gnl|Frave2.0|gene08053 gnl|Glyma1.01|PACid_16278523 52 63 gnl|Frave2.0|gene25610 100 gnl|Frave2.0|gene08069 100 Orobanche_fasciculata_1kp_VYDM_130285 100 51 OrAeBC5_3756.1 StHeBC3_55745.1 (H) 41 gnl|Poptr2.2|PACid_18246017

36 gnl|Poptr2.2|PACid_18245040 20 gnl|Theca1.0|Tc00_g064570 gnl|Solly2.3|Solyc06g083910.2.1 gnl|Soltu3.4|PGSC0003DMP400034797 18 gnl|Vitvi12X|PACid_17835486 49 LaSa_11146 13 LaSa_11147 LaSa_2850 80 LaSa_47034 TrVeBC3_21804.1 TrVeBC3_21804.3 89 TrVeBC3_21804.5 100 81 24TrVeBC3_21804.6 TrVeBC3_21804.2 100 OrAeBC5_70566.1 gnl|Mimgu1.0|PACid_17679563 46 100 OrAeBC5_50452.1 LaSa_24786 46 HeAn_61257 HeAn_41629 gnl|Sorbi1.4|PACid_1955730 63 gnl|Bradi1.2|Bradi1g56240.1 100gnl|Orysa6.0|PACid_16869707 gnl|Orysa6.0|PACid_16869706

0.1

Figure 3-16. Phylogeny of orthogroup 11841. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_3756.1. A Striga hermonthica gene (StHeBC3_55745.1) is also grouped with Fragaria vesca, indicating that the HGT event may happen earlier than the speciation event between Phelipanche aegyptiaca and Striga hermonthica. For this particular orthogroup, the HGT gene was also identified in Orobanche fasciculate from the 1KP database. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

117

Ortho 806

gnl_Musac1.0_GSMUA_AchrUn_randomT06440_001 91 gnl_Phoda3.0_PDK_30s728481g001 gnl_Orysa6.0_PACid_16846845 97 gnl_Sorbi1.4_PACid_1975126 100 100 gnl_Bradi1.2_Bradi2g04960.1 gnl_Orysa6.0_PACid_16849430 79 100 gnl_Orysa6.0_PACid_16849429 100 100 gnl_Orysa6.0_PACid_16844493 100 84 gnl_Orysa6.0_PACid_16844494 gnl_Orysa6.0_PACid_16886435 100 gnl_Bradi1.2_Bradi3g30350.1 gnl_Aquco1.0_PACid_18144325 100gnl_Aquco1.0_PACid_18157961 100 71 gnl_Aquco1.0_PACid_18164810 gnl_Nelnu1.0_NNU_017995-RA gnl_Poptr2.2_PACid_18244025 100 gnl_Poptr2.2_PACid_18207203 gnl_Carpa1.181_PACid_16412864 100 100 gnl_Arath10_AT1G14610.1 100 100 gnl_Thepa2.0_Tp1g12970 97 gnl_Theca1.0_Tc09_g018620

100 OrAeBC5_4239.1 100 OrAeBC5_4239.2 100 (H) 65 OrAeBC5_4239.3 gnl_Frave2.0_gene13331 100 gnl_Frave2.0_gene29501 100 gnl_Glyma1.01_PACid_16281403 100 gnl_Glyma1.01_PACid_16281404 100 gnl_Medtr3.5_Medtr1g101620.1 62 gnl_Vitvi12X_PACid_17822742 LaSa_45473 100 LaSa_3856381 55 gnl_Mimgu1.0_PACid_17684099 100 gnl_Mimgu1.0_PACid_17678221

100 100 gnl_Mimgu1.0_PACid_17693905 100 LiPhGnB2_3240.1 100 TrVeBC3_3608.1 87 OrAeBC5_4810.1 81 StHeBC3_1171.5 100 StHeBC3_1171.3 100 StHeBC3_1171.4 gnl_Solly2.3_Solyc09g007540.2.1 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00038.43

0.1

Figure 3-17. Phylogeny of orthogroup 806. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_4239. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

118

Ortho 1685 gnl_Musac1.0_GSMUA_Achr5T11830_001

95 gnl_Bradi1.2_Bradi3g29120.1 97 99 gnl_Orysa6.0_PACid_16890008 gnl_Sorbi1.4_PACid_1951606 gnl_Aquco1.0_PACid_18143181 100 gnl_Aquco1.0_PACid_18143776 100 gnl_Aquco1.0_PACid_18163597 LiPhGnB2_5053.1 OrAeBC5_8212.1 OrAeBC5_8212.2100 OrAeBC5_8212.8 100 100 OrAeBC5_8212.10100 95 100 OrAeBC5_8212.5 OrAeBC5_8212.3100

50 97OrAeBC5_8212.9 95 99 OrAeBC5_8212.6 OrAeBC5_8212.4100 OrAeBC5_8212.774 gnl_Mimgu1.0_PACid_17686838 gnl_Solly2.3_Solyc11g011880.1.1 LaSa_2472

32 HeAn_60614 98 95 HeAn_47622 62 91 LaSa_18050 41 LaSa_5084794

93 HeAn_59663 100 HeAn_33941 LaSa_47990 LaSa_10890 100 LaSa_17404 93 LaSa_1998 77 91 52 LaSa_24942 95 HeAn_26844 LaSa_55798 43 LaSa_31572 94 LaSa_45152 LaSa_49575

87 LaSa_31036 98 88 LaSa_11493 72 HeAn_57321 62 LaSa_6072 62 60 LaSa_54481 45 100LaSa_54461 99 82 HeAn_19514 63 HeAn_49826 LaSa_46835 HeAn_33070 46 LaSa_14312 50 LaSa_26769 gnl_Vitvi12X_PACid_17840416 100 gnl_Vitvi12X_PACid_17819441 98 gnl_Vitvi12X_PACid_17840415 gnl_Theca1.0_Tc01_g013100 93 77 gnl_Carpa1.181_PACid_16413274 gnl_Theca1.0_Tc00_g049550 72 71 gnl_Poptr2.2_PACid_18237968 39 gnl_Poptr2.2_PACid_18209787 98gnl_Poptr2.2_PACid_18210298 gnl_Frave2.0_gene11688 OrAeBC5_14072.1 84 OrAeBC5_15353.2 83 OrAeBC5_15353.1100 OrAeBC5_6791.3 94 OrAeBC5_6791.1 OrAeBC5_6791.295 OrAeBC5_6791.4 (H) 84 84 OrAeBC5_9731.1 OrAeBC5_9731.2100 49 OrAeBC5_7956.2 OrAeBC5_7956.1100 gnl_Glyma1.01_PACid_16285126

100 gnl_Glyma1.01_PACid_16307982 gnl_Glyma1.01_PACid_16307984 60 gnl_Glyma1.01_PACid_16285165 80 93 80 gnl_Glyma1.01_PACid_16285157 gnl_Glyma1.01_PACid_16285158100 gnl_Medtr3.5_Medtr3g064090.1 93 gnl_Medtr3.5_Medtr3g064110.1 78 gnl_Glyma1.01_PACid_16285113 gnl_Glyma1.01_PACid_16307983 gnl_Glyma1.01_PACid_16285154 100 24 75 gnl_Glyma1.01_PACid_16285155 71 gnl_Glyma1.01_PACid_16285127 24 gnl_Glyma1.01_PACid_16285164 15 gnl_Glyma1.01_PACid_16285137 16 42 HeAn_11522 98 10gnl_Glyma1.01_PACid_16285142 gnl_Glyma1.01_PACid_16285144 70 gnl_Glyma1.01_PACid_16285136

57 gnl_Glyma1.01_PACid_16285115 37 gnl_Glyma1.01_PACid_16285124 21 40 gnl_Glyma1.01_PACid_16307988 gnl_Glyma1.01_PACid_16285114 gnl_Medtr3.5_Medtr3g064080.1 99 gnl_Glyma1.01_PACid_16285106 gnl_Glyma1.01_PACid_16285111 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00029.148

0.1

Figure 3-18. Phylogeny of orthogroup 1685. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_14072.1,

OrAeBC5_15353, OrAeBC5_6791, OrAeBC5_9731 and OrAeBC5_7956. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

119

gnl_Musac1.0_GSMUA_Achr8T13980_001 100 Ortho 2376 gnl_Musac1.0_GSMUA_Achr5T19360_001 gnl_Sorbi1.4_PACid_1953163 gnl_Sorbi1.4_PACid_1967357 59 49 96 gnl_Orysa6.0_PACid_16844271 100 48 gnl_Orysa6.0_PACid_16842056

58 gnl_Bradi1.2_Bradi3g49610.1 42 gnl_Orysa6.0_PACid_16847356 gnl_Phoda3.0_PDK_30s1154301g003 93 gnl_Phoda3.0_PDK_30s6550949g006 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00029.51 gnl_Nelnu1.0_NNU_016324-RA 26 41 gnl_Aquco1.0_PACid_18158791 LaSa_1735

100 HeAn_438 HeAn_1661968 99 HeAn_504 48 HeAn_1761973 OrAeBC5_5125.1 100 OrAeBC5_5125.2 75 15 gnl_Mimgu1.0_PACid_17671107

95 gnl_Mimgu1.0_PACid_17674865

81 LiPhGnB2_1016.1 100 40 LiPhGnB2_1016.2 OrAeBC5_3502.1 75 StHeBC3_3231.3 TrVeBC3_1507.32 65

2 TrVeBC3_1507.10 83 TrVeBC3_1507.15 62 34 10 52 TrVeBC3_1507.16 TrVeBC3_1507.29 72 TrVeBC3_1507.21 45 TrVeBC3_1507.25 19 StHeBC3_3231.2 98 StHeBC3_3231.6 57 83 StHeBC3_3231.1 89StHeBC3_3231.4 76 StHeBC3_3231.5 gnl_Solly2.3_Solyc07g055080.2.1 100 gnl_Soltu3.4_PGSC0003DMP400035506 98 gnl_Solly2.3_Solyc10g008010.2.1 99 gnl_Soltu3.4_PGSC0003DMP400018645 gnl_Carpa1.181_PACid_16426987 gnl_Arath10_AT1G16470.1 29 97 gnl_Thepa2.0_Tp1g14620 97 gnl_Thepa2.0_Tp5g34260 97 gnl_Arath10_AT1G79210.1 55 30 gnl_Glyma1.01_PACid_16251836 64gnl_Glyma1.01_PACid_16244470 gnl_Glyma1.01_PACid_16244469100 29

94 gnl_Glyma1.01_PACid_16244471 gnl_Glyma1.01_PACid_16300716 gnl_Glyma1.01_PACid_16271522 60 gnl_Glyma1.01_PACid_1627152197 27 gnl_Glyma1.01_PACid_16271518 gnl_Glyma1.01_PACid_16271520 gnl_Vitvi12X_PACid_17823789 OrAeBC5_270.4 97 OrAeBC5_270.3 60 100StHeBC3_48088.1 (H) OrAeBC5_270.150 98 80 OrAeBC5_270.2 gnl_Poptr2.2_PACid_18233344 100 gnl_Poptr2.2_PACid_18229980 gnl_Theca1.0_Tc06_g012030 gnl_Phypa1.6_PACid_18073888 100 gnl_Phypa1.6_PACid_18052836 120 0.01

Figure 3-19. Phylogeny of orthogroup 2376.

The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_270 and StHeBC3_48088.

This indicates that the HGT event was likely to have occurred before the speciation event between Phelipanche aegyptiaca and Striga hermonthica. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

121

Ortho 4336 gnl_Selmo1.0_PACid_15418335

gnl_Frave2.0_gene14273

gnl_Poptr2.2_PACid_18212791 34 100 gnl_Poptr2.2_PACid_18250539

85 gnl_Carpa1.181_PACid_16407044

59 30 gnl_Thepa2.0_Tp5g05580 100 gnl_Arath10_AT3G57000.1

gnl_Vitvi12X_PACid_17839296 100 gnl_Vitvi12X_PACid_17834748

LaSa_26502 100 LaSa_48735 100 HeAn_49816 73 HeAn_5164 96 gnl_Solly2.3_Solyc12g017630.1.1

80 100 gnl_Soltu3.4_ PGSC0003DMP400056097 TrVeBC3_3742.1

30 LiPhGnB2_6372.2 100 100 42 LiPhGnB2_6372.1

OrAeBC5_6623.2 100 37 OrAeBC5_6623.1

66 100 StHeBC3_9924.1

gnl_Mimgu1.0_PACid_17681679 64

gnl_Glyma1.01_PACid_16283378

gnl_Glyma1.01_PACid_16286165 100

gnl_Glyma1.01_PACid_16286163

17 gnl_Glyma1.01_PACid_16286161 99

100 91gnl_Glyma1.01_PACid_16286164

gnl_Glyma1.01_PACid_16286162

gnl_Medtr3.5_Medtr3g096190.1 100

33 gnl_Medtr3.5_Medtr4g032390.1

gnl_Aquco1.0_PACid_18160587

gnl_Sorbi1.4_PACid_1966229 97 gnl_Sorbi1.4_PACid_1968125 100 gnl_Orysa6.0_PACid_16839576 58 100 gnl_Bradi1.2_Bradi3g10610.1

gnl_Musac1.0_GSMUA_ Achr10T20260_001 100 49 gnl_Phoda3.0_PDK_30s688461g001

66 gnl_Theca1.0_Tc03_g026970 100 gnl_Theca1.0_Tc07_g011950 96 OrAeBC5_7046.2 100 100 47 OrAeBC5_7046.1 (H)

OrAeBC5_13694.1

gnl_Nelnu1.0_NNU_001131-RA

gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00092.104

gnl_Phypa1.6_PACid_18055581

0.1 122

Figure 3-20. Phylogeny of orthogroup 4336.

The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_7046 and

OrAeBC5_13694.1. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

Ortho 8235 gnl_Aquco1.0_PACid_18155093

gnl_Phoda3.0_PDK_30s672151g005

gnl_Musac1.0_GSMUA_Achr6T33100_001 90 40 gnl_Orysa6.0_PACid_16846364

81 84 OrAeBC5 60 _26251.1 gnl_Sorbi1.4_PACid_1953785 20 36 54 (H) gnl_Bradi1.2_Bradi1g65810.1

gnl_Musac1.0_GSMUA_Achr9T04280_001

gnl_Nelnu1.0_NNU_005597-RA

75 gnl_Vitvi12X_PACid_17833161

gnl_Theca1.0_Tc09_g033020

gnl_Frave2.0_gene25390

23 gnl_Medtr3.5_Medtr4g061130.1 9 100 gnl_Glyma1.01_PACid_16286785

100 gnl_Glyma1.01_PACid_16284035

35 23 100 gnl_Glyma1.01_PACid_16284034

gnl_Poptr2.2_PACid_18234559 100 gnl_Poptr2.2_PACid_18228123 24 gnl_Carpa1.181_PACid_16408546

gnl_Solly2.3_Solyc06g005170.2.1

gnl_Soltu3.4_PGSC0003DMP400052344 32 94 gnl_Soltu3.4_PGSC0003DMP400052345

56 LiPhGnB2_940.1 100 LiPhGnB2_940.2

TrVeBC3_4785.3 99 96 TrVeBC3_4785.1 73 TrVeBC3_4785.2

65 gnl_Mimgu1.0_PACid_17693534

44 StHeBC3_5333.3 48 StHeBC3_5333.2 98 StHeBC3_5333.1

StHeBC3_5333.4

gnl_Arath10_AT3G45640.1 100 gnl_Thepa2.0_Tp5g15850

100 LaSa_34435 100 LaSa_55644

100 LaSa_18029

94 HeAn_47890 99 HeAn_52760

HeAn_39457

gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00023.237

0.1

Figure 3-21. Phylogeny of orthogroup 8235. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_26251.1. HGT gene or

123

clade was labeled with a “(H)” sign on the phylogeny.

Ortho 9613 gnl_Selmo1.0_PACid_15412591 gnl_Nelnu1.0_NNU_020849-RA gnl_Vitvi12X_PACid_17832071 LaSa_13951 66 gnl_Mimgu1.0_PACid_17670015 81 100 LiPhGnB2_12551.1 97 95 StHeBC3_9019.1 76 99 gnl_Soltu3.4_PGSC0003DMP400019855 100gnl_Solly2.3_Solyc07g006190.2.1 gnl_Poptr2.2_PACid_18214284 OrAeBC5_(H) 16890.17 56 100 gnl_Glyma1.01_PACid_16304787 gnl_Glyma1.01_PACid_16304786100 38 39 gnl_Frave2.0_gene16847 53 28 gnl_Theca1.0_Tc09_g015200

30 gnl_Arath10_AT2G23890.1 100 53 gnl_Thepa2.0_Tp4g02790 gnl_Carpa1.181_PACid_16425939 gnl_Aquco1.0_PACid_18158606 gnl_Phoda3.0_PDK_30s806571g005 100 64 gnl_Musac1.0_GSMUA_Achr5T24210_001 87 gnl_Orysa6.0_PACid_16881654 100gnl_Orysa6.0_PACid_16881655 100 gnl_Sorbi1.4_PACid_1957633 53 gnl_Bradi1.2_Bradi4g30660.1 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00019.105 gnl_Phypa1.6_PACid_18064088

0.1

Figure 3-22. Phylogeny of orthogroup 9613. The HGT gene identified in Phelipanche aegyptiaca is OrAeBC5_16890.17. HGT gene or clade was labeled with a “(H)” sign on the phylogeny.

124

Figure 3-23. Expression profile of HGT gene TrVeBC3_31777.1 (orthogroup 12303). Y-axis is the value of FPKM.

Figure 3-24. Expression profile of HGT gene TrVeBC3_22826.1 (orthogroup 12577). Y-axis is the value of FPKM.

125

Figure 3-25. Expression profile of HGT gene StHeBC3_4423.1 (orthogroup 18774). Y-axis is the value of FPKM.

Figure 3-26. Expression profile of HGT gene StHeBC3_16619.1 (orthogroup 14233). Y-axis is the value of FPKM.

126

Figure 3-27. Expression profile of HGT gene StHeBC3_41710.1 (orthogroup 13656). Y-axis is the value of FPKM.

Figure 3-28. Expression profile of HGT gene StHeBC3_2868.1 (orthogroup 2270). Y-axis is the value of FPKM.

127

Figure 3-29. Expression profile of HGT gene StHeBC3_21126 (orthogroup 5896). Y-axis is the value of FPKM.

Figure 3-30. Expression profile of HGT gene StHeBC3_5017 (orthogroup 10124). Y-axis is the value of FPKM.

128

Figure 3-31. Expression profile of HGT gene OrAeBC5_9142.1 (orthogroup 294). Y-axis is the value of FPKM.

Figure 3-32. Expression profile of HGT gene OrAeBC5_15496.1 (orthogroup 1886). Y-axis is the value of FPKM.

129

Figure 3-33. Expression profile of HGT gene OrAeBC5_9762.1 (orthogroup 4067). Y-axis is the value of FPKM.

Figure 3-34. Expression profile of HGT gene OrAeBC5_15086.1 (orthogroup 8888). Y-axis is the value of FPKM.

130

Figure 3-35. Expression profile of HGT gene OrAeBC5_4284.2 (orthogroup 10050). Y-axis is the value of FPKM.

Figure 3-36. Expression profile of HGT gene OrAeBC5_14056.1 (orthogroup 10143). Y-axis is the value of FPKM.

131

Figure 3-37. Expression profile of HGT gene OrAeBC5_3756.1 (orthogroup 11841). Y-axis is the value of FPKM.

Figure 3-38. Expression profile of HGT gene StHeBC3_55745.1 (orthogroup 11841). Y-axis is the value of FPKM.

132

Figure 3-39. Expression profile of HGT gene OrAeBC5_4239 (orthogroup 806), including three splicing forms from this gene. Y-axis is the value of FPKM.

Figure 3-40. Expression profile of HGT gene OrAeBC5_14072.1 (orthogroup 1685). Y-axis is the value of FPKM.

133

Figure 3-41. Expression profile of HGT gene OrAeBC5_15353 (orthogroup 1685), including two splicing forms from this gene. Y-axis is the value of FPKM.

Figure 3-42. Expression profile of HGT gene OrAeBC5_6791 (orthogroup 1685), including four splicing forms from this gene.

134

Y-axis is the value of FPKM.

Figure 3-43. Expression profile of HGT gene OrAeBC5_9731 (orthogroup 1685), including two splicing forms from this gene. Y-axis is the value of FPKM.

135

Figure 3-44. Expression profile of HGT gene OrAeBC5_7956 (orthogroup 1685), including two splicing forms from this gene. Y-axis is the value of FPKM.

136

Figure 3-45. Expression profile of HGT gene OrAeBC5_270 (orthogroup 2376), including four splicing forms from this gene. Y-axis is the value of FPKM.

Figure 3-46. Expression profile of HGT gene StHeBC3_48088.1 (orthogroup 2376), 137

including four splicing forms from this gene. Y-axis is the value of FPKM.

Figure 3-47. Expression profile of HGT gene OrAeBC5_7046 (orthogroup 4336). Y-axis is the value of FPKM.

138

Figure 3-48. Expression profile of HGT gene OrAeBC5_13694.1 (orthogroup 4336). Y-axis is the value of FPKM.

Figure 3-49. Expression profile of HGT gene OrAeBC5_26251.1 (orthogroup 8235). Y-axis is the value of FPKM.

139

Figure 3-50. Expression profile of HGT gene OrAeBC5_16890.17 (orthogroup 9613). Y-axis is the value of FPKM.

Methods

Sequencing Details

One Lindenbergia philippensis whole plant normalized, eleven Triphysaria versicolor stage-specific and one whole plant normalized, nine Striga hermonthica stage-specific and one whole plant normalized, and ten Phelipanche aegyptiaca stage-specific and one whole plant normalized paired-end (PE) cDNA libraries were used to generate 1,995,494,710 reads

140

for the project [54]. Transcriptome sequencing (76-bp and 83-bp PE) for each library was performed on the Illumina GAIIx by Genomics Core facility at the University of Virginia in

Charlottesville, VA USA, Genomics Core facility at the Michigan State University in East

Lansing, MI USA, Stephan Schuster lab at the Pennsylvania State University in University

Park, PA USA, and the Genomic Core facility at University of California in Davis, CA USA.

Additional sequencing was also performed on Roche 454 GS-FLX by the Stephan Schuster

Lab at the Pennsylvania State University for two Triphysaria, four Striga, and six

Phelipanche stage-specific libraries to generate 3,153,353 single fragment reads for the project [54]. All the sequencing data generated in this study were deposited in the NCBI

Sequence Read Archive (Study Accession Number SRP001053), and the Parasitic Plant

Genome Project web portal (http://ppgp.huck.psu.edu).

Data process pipelines

Read cleaning and quality control

Raw Roche 454 sequence files in Standard Flowgram Format (SFF) were converted to FASTA and associated quality files along with clipping of sequence adapters and low- quality bases using sff_extract version 0.2.10 (http://bioinf.comav.upv.es/sff_extract/). Using

CLC Assembly Cell version 3.2.0 (http://www.clcbio.com/products/clc-assembly-cell/), duplicates reads introduced in the Illumina read data during PCR amplification step of library 141

preparation were removed, and adapters and low-quality bases (

De novo assembly and post-processing

De novo assembly of Illumina reads and de novo hybrid assembly 454-Illumina reads for combined data sets of each species were performed using Trinity release 2011-10-29

[171], and CLC Assembly Cell version 3.2.0 respectively with default parameters. Using

BLASTN (E-value = 1e-10), assembled transcripts from both assemblies were combined by assigning hybrid CLC transcripts to Trinity components (putative loci) yielding the best bitscore, and the resulting combined species assemblies filtered to remove redundancy [172] and contigs without coding regions [173]. Assemblies for parasite species (Triphysaria,

Striga, and Phelipanche) were then cleaned to remove contaminant sequences by screening transcripts against NCBI non-redundant protein database using BLASTX (E-value = 1e-5) to remove non plant transcripts, followed by BLASTN (E-value = 1e-10) against a collection of publicly available host genomes and ESTs data sets to remove host transcripts, and lastly

BLASTN (E-value = 1e-10) of host candidate sequences against Orobanchaceae species

(Lindenbergia, Triphysaria, Striga, and Phelipanche) databases (not including the parasite species being cleaned) to retrieve transcripts that were better matches to Orobanchaceae family than the host plant. Since some downstream analyses utilize transcript translations,

142

ORFs and protein sequences encoded by reconstructed transcripts were predicted with

ESTScan version 2.0 [173]. Assembly results shown in Supplemental Table 3-5.

Constructing a global gene family classification

586,228 protein coding gene of 22 representatives of sequenced land plant genomes were classified into 53,136 orthogroups using OrthoMCL. The selected taxa includes nine rosids (Arabidopsis thaliana, Thellungiella parvula, Carica papaya, Theobroma cacao,

Populus trichocarpa, Fragaria vesca, Glycine max, Medicago truncatula, Vitis vinifera), three asterids (Solanum lycopersicum, Solanum tuberosum, Mimulus guttatus), two basal eudicots (Nelumbo nucifera, Aquilegia coerulea), five monocots (Oryza sativa,

Brachypodium distachyon, Sorghum bicolor, Musa acuminate, Phoenix dactylifera), one basal angiosperm (Amborella trichopoda), one lycophyte (Selaginella moellendorffii), and one moss (Physcomitrella patens). Candidate orthogroups for Lindenbergia, Triphysaria,

Striga, Phelipanche, and two Asteraceae species, Lactuca sativa and Helianthus annuus, were identified by retaining BLASTP [133] hits with E-value <=1e-5 of translated transcripts searches on 22 plant genomes proteomes that were classified in orthogroups. HMM [174] searches of translated transcript were then performed on constructed candidate HMM orthogroup classification profiles, and orthogroups yielding the best bitscore assigned to

143

transcripts. A total 16,538 orthogroups were identified as containing at least one parasite species transcript. Amino acid alignments of these orthogroups (with translated transcripts included) were generated with MAFFT [175] and corresponding DNA sequences forced onto the amino acid alignments using a custom perl script. DNA alignments were then trimmed with trimAL [176] to remove sites with less 10% of the taxa. Orthogroup alignments were required to contain transcripts with alignment coverage of at least 50%, otherwise the failing transcripts were removed from the orthogroup amino acids and DNA FASTA sequence files, and the alignments re-generated. Finally, Maximum likelihood (ML) phylogenetic trees of

DNA alignments for orthogroups containing parasites sequence(s) were generated using

RAxML [153] with GTRGAMMA model, and 100 bootstrap replicates. Because very large transcriptome data sets are complex (containing alternative and incompletely spliced transcripts), it was also necessary to develop a scheme to select the best single transcript representing each locus (proxy for gene) in an orthogroup to improve the phylogenetic inference. Another iteration of alignments and phylogenetic trees was therefore conducted that included only locus representative transcripts with the highest alignment coverage. This resulted in 13,245 and 13,125 orthogroup phylogenetic trees with all classified transcripts and representative transcripts respectively that passed the alignment-filtering threshold.

144

Functional Annotation

The translations of assembled transcript sequences were used for BLASTP (E-value =

1e-5) searches against Swissprot, TAIR10 and trEMBL databases to assign putative functional annotations to them in the form of human readable descriptions using Automated assignment of human readable descriptions (AHRD) pipeline

(https://github.com/groupschoof/AHRD). AHRD uses similarity searches and lexical analysis for automatic assignment of human readable descriptions to protein sequences. Additionally, translated transcripts were also annotated against Pfam domains using InterProScan version

4.8 [177], and identified domains directly translated into Gene Ontology terms.

Expression Profile Mapping in Transcriptomic Data

High quality non-redundant Illumina reads from individual stage-specific samples were independently mapped on the each parasite’s de novo assembled and post-processed transcripts using CLC Genomic Workbench version 6.0.4 (parameters: mismatch cost = 2, insertion cost = 3, deletion cost = 3, length fraction = 0.5, similarity = 0.8, min insert size =

100, and max insert size = 300). Transcript abundance was then estimated using CLC

Genomic Workbench RNA-Seq program with unique reads counted to matching transcripts, and non-specifically mapped reads allocated on a proportional basis relative to the number of

145

uniquely mapped reads. The numbers of reads mapped per library were also normalized by the Fragments Per Kilobase per Million mapped reads (FPKM) method that corrects for biases in total transcript size, and normalizes for the total read sequences obtained in each sample library. The resulting expression values were subsequently used to estimate orthogroup-locus expression as a proxy for gene expression. Read counts and FPKM values of transcripts for each Trinity component classified in an orthogroup were summed up to obtain orthogroup-locus expression.

HGT Screening

Customized JAVA scripts were developed by the author to automatically screen incongruent phylogenies. We used closely related species as negative controls in our study.

Lindenbergia Philipensis and Mimulus guttatus were used as negative control for three parasites’ screening. For Lindenbergia Philipensis screening, Mimulus guttatus, Solanum lycopersicum and Solanum tuberosum were used as negative controls. A cutoff of 50% bootstrap values was used in the script. Scripts are available upon request. Please send inquiries to [email protected].

146

After the automated screening, the output HGT candidates were manually screened based on the following criteria. We used four general levels of manual screening. Level 1 checks for an even distribution of taxon sampling in the Asteridae group. Level 2 checks for an even distribution of taxa sampling in the rest of the lineages, especially the lineage where the HGT gene resides. Level 3 checks for a gene tree topology that is generally consistent with the species tree plus gene duplication and loss processes. Some gene trees are so poorly resolved, or so conflicted relative to the species tree that no meaningful conclusions can be drawn about potential HGT events. Level 4 checks for general bootstrap support. Low confidence HGT genes failed at manual screening at level 1. Medium confidence HGT genes passed level 1 but failed at the next three steps. High confidence genes passed all four levels.

For high confidence HGT candidates, the 1KP database [147] was also searched using

BLASTP [133] for homologs in closely related species. For high confidence HGT candidates, additional effort was taken to attempt to rule out some of the drawbacks listed below for each candidate.

1. Does the current tree topology generally confirm the commonly known species tree, except the HGT genes?

2. Check on the quality of the alignment (contigs >= 0.5 of alignment coverage).

3. Is the bootstrap larger than 50% on closely related critical branches?

4. Check on super-ortho trees if needed.

5. Check the possibility of contamination. 147

6. Rule out phylogeny artifact due to nucleotide composition by checking amino acid tree.

7. Does the HGT candidate tree present a possible scenario in which a duplicated gene families with some losses on a number of taxa or insufficient taxa sampling?

8. Does the HGT gene branch seems to be abnormally long and what’s reason causing this?

Genomic Data Assembly

Genomic data (Illumina data for Striga hermonthica and Phelipanche aegyptiaca) was de novo assembled using CLC Assembly Cell v 4.1 (http://www.clcbio.com/products/clc- assembly-cell/). The command is the following: novo_assemble -o contigs.fasta -p fb ss 180 250 -q -i reads1.fq reads2.fq

Raw reads data was mapped back onto the denovo-assembled contigs using clc ref assemble tool, with the following command: clc_ref_assemble -d genomic_assembly.fasta -o genomic_assembly.fasta.ref_assemble -q -i

R1.fastq R2.fastq -p fb ss 180 300 -t 1000 --cpus 40

The Castosam tool was used to convert output reading mapping file to SAM or BAM file format and Bedtools [160] was used in calculating depth of coverage.

148

Genomic Contig Mapping onto EST data

GMAP [169] was used in mapping genomic contigs onto EST data. Blastn [133] was also used to confirm the mapping result from GMAP, E-value <=1e-10. Results are confirmed with both methods.

Constraint Analyses

PAML (version 4.6) [168] was used in analyzing dn/ds ratios. codeml was used. A sample control file is listed here for M0 mode. seqfile = 806.fasta.fna.aln.trim.phylip treefile = 806_bootstrapRemoved outfile = M0_result noisy = 3 verbose = 0 runmode = 0 seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs

CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table clock = 0 * 0:no clock, 1:global clock; 2:local clock; 3:TipDate aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a aaRatefile = /../wag.dat

149

model = 0

* models for codons:

* 0:one, 1:b, 2:2 or more dN/dS ratios for branches

NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;

* 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ

* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;

* 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-10:see belo fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 3 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1 * initial or fixed omega, for codons or codon-based AAs fix_alpha = 1 * 0: estimate gamma shape parameter; 1: fix it at alpha alpha = 0 * initial or fixed alpha, 0:infinity (constant rate)

Malpha = 0 * different alphas for genes ncatG = 10 * # of categories in dG of NSsites models getSE = 0 * 0: don't want them, 1: want S.E.s of estimates

RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)

Small_Diff = .5e-6

* cleandata = 1

* method = 1 * 0: simultaneous; 1: one branch at a time 150

151

Supplemental Materials

OrAeBC5_46381.1 StHeBC3_18644.1 47 gnl_Phypa1.6_PACid_18070166

28 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00209.2 100 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00209.4 OrAeBC5_50195.1

44 29 gnl_Orysa6.0_PACid_16885414 gnl_Orysa6.0_PACid_16885413 97 38 gnl_Orysa6.0_PACid_16885412 gnl_Orysa6.0_PACid_16885415

41 gnl_Bradi1.2_Bradi3g31100.1 42 gnl_Bradi1.2_Bradi1g69880.1 91 gnl_Orysa6.0_PACid_16866568 62 8 gnl_Sorbi1.4_PACid_1967640 49 gnl_Sorbi1.4_PACid_1954278 gnl_Phoda3.0_PDK_30s970271g001 6 gnl_Musac1.0_GSMUA_Achr6T24850_001

4 OrAeBC5_ 12 4 5 HeAn_32689 29148.3 gnl_Musac1.0_GSMUA_Achr8T31240_001 gnl_Nelnu1.0_NNU_012022-RA 88 gnl_Nelnu1.0_NNU_026444-RA gnl_Aquco1.0_PACid_18150130 gnl_Vitvi12X_PACid_17842684 5 94 gnl_Carpa1.181_PACid_16426409 gnl_Vitvi12X_PACid_17830170

7 gnl_Frave2.0_gene00896 71 gnl_Frave2.0_gene06519

55 gnl_Glyma1.01_PACid_16286058 94 gnl_Glyma1.01_PACid_16290193 92 gnl_Medtr3.5_Medtr4g077810.1 9 100 96 gnl_Medtr3.5_Medtr4g077700.1 gnl_Medtr3.5_Medtr4g122580.1 42 gnl_Poptr2.2_PACid_18239126 97 gnl_Poptr2.2_PACid_18210733 gnl_Theca1.0_Tc04_g020790 OrAeBC5_3356.7 94 OrAeBC5_3356.3 71 89 9 OrAeBC5_3356.5 OrAeBC5_3356.4100 94 OrAeBC5_3356.6 OrAeBC5_3356.1 80 OrAeBC5_3356.2100 52 StHeBC3_2868.1 97 OrAeBC5_51488.1 gnl_Carpa1.181_PACid_16427157 82 gnl_Arath10_AT4G13780.1 98 gnl_Thepa2.0_Tp7g11570 gnl_Soltu3.4_PGSC0003DMP400050083 100gnl_Solly2.3_Solyc07g008950.2.1 gnl_Mimgu1.0_PACid_17669561 95 27 76 gnl_Mimgu1.0_PACid_17691323 gnl_Mimgu1.0_PACid_17695503 54 LiPhGnB2_24825.1 29 LiPhGnB2_30160.1 LiPhGnB2_3847.199 TrVeBC3_2919.2

57 TrVeBC3_2919.9 46 TrVeBC3_2919.441 45TrVeBC3_2919.13 90TrVeBC3_2919.10 StHeBC3_7204.3

16 StHeBC3_7204.5 44StHeBC3_7204.6 33 20 StHeBC3_7204.4 StHeBC3_7204.247 TrVeBC3_2919.8

12 TrVeBC3_2919.11 59 10 TrVeBC3_2919.14 57 44 TrVeBC3_2919.3 TrVeBC3_2919.1 TrVeBC3_2919.59 82 TrVeBC3_2919.6 TrVeBC3_2919.7 29 49 TrVeBC3_2919.12 OrAeBC5_4515.1 27 StHeBC3_987.1 StHeBC3_3412.6 StHeBC3_3412.12 StHeBC3_3412.773 90 StHeBC3_317.27 StHeBC3_3412.10 LaSa_40579 99LaSa_20765 HeAn_34763 46 HeAn_35531 HeAn_15017 45 79 47 HeAn_15065 LaSa_48077 78LaSa_5371 HeAn_54649 gnl_Selmo1.0_PACid_15413777 0.1

Supplemental Figure 3-1. Phylogeny of orthologous gene group 2270. Additional homolog genes were identified in Phelipanche aegyptiaca (OrAeBC5_3356 and OrAeBC5_51488.1).

152

Supplemental Figure 3-2. Expression profile of additional Phelipanche aegyptiaca HGT gene, OrAeBC5_3356, including seven alternative splicing forms. Y-axis shows the value of FPKM.

153

Supplemental Figure 3-3. dN/dS ratios on background lineages and HGT parasite branch for each orthogroup having HGT genes identified in Striga hermonthica.

Supplemental Figure 3-4. dN/dS ratios on background lineages and HGT parasite branch for each orthogroup having HGT genes identified in Phelipanche aegyptiaca.

154

Supplemental Figure 3-5. Expression profile of laterally transferred gene from Phelipanche aegyptiaca in orthogroup 4336. Y-axis shows the value of FPKM.

155

gnl|Aquco1.0|PACid_18150554 gnl|Frave2.0|gene08051 #0.4810 #0.4228 gnl|Frave2.0|gene08052 #0.2939 #0.3822 gnl|Frave2.0|gene08053 #0.6860 gnl|Glyma1.01|PACid_16278523 #0.2752 #0.0052 #0.2387 gnl|Frave2.0|gene25610 #0.3472 #0.1841 gnl|Frave2.0|gene08069 #0.0178 #0.7851 Orobanche_fasciculata_1kp_VYDM_130285 #0.0970 #0.1624 #0.5275 OrAeBC5_3756.1 #0.0001 #0.3279 StHeBC3_55745.1 #0.0001 #0.0631 gnl|Poptr2.2|PACid_18246017 #1.0664 #0.2451 gnl|Poptr2.2|PACid_18245040 #0.7909 #0.1428 #0.1209 gnl|Theca1.0|Tc00_g064570 #0.2695 gnl|Solly2.3|Solyc06g083910.2.1 #0.1207 #0.2827 gnl|Soltu3.4|PGSC0003DMP400034797 #0.3246 gnl|Vitvi12X|PACid_17835486 #0.2710 #0.4180 LaSa_11146 #0.2290 #0.3107 LaSa_11147 #0.4164 LaSa_2850 #0.2438 #0.1882 LaSa_47034 #0.2454 TrVeBC3_21804.1 #0.0001 TrVeBC3_21804.3 #0.2037 #0.2660 #0.0001 #0.0001 TrVeBC3_21804.5 #0.0001 #0.0001 TrVeBC3_21804.6 #0.0001 TrVeBC3_21804.2 #0.0001 #0.2154 OrAeBC5_70566.1 #0.4584 gnl|Mimgu1.0|PACid_17679563 #0.1658 #0.0404 #0.3298 OrAeBC5_50452.1 #0.3427 LaSa_24786 #0.1374 #4.8149 #0.3482 HeAn_61257 #0.1756 HeAn_41629 #0.5273 gnl|Sorbi1.4|PACid_1955730 #0.5408 #0.4112 gnl|Bradi1.2|Bradi1g56240.1 #0.4810 #0.0052 gnl|Orysa6.0|PACid_16869707 #5.4355 #1.0006 gnl|Orysa6.0|PACid_16869706 #3.6810

2.0

Supplemental Figure 3-6. Constraint analysis results for orthogroup 11841. dN/dS values are shown on each branch and terminal taxa. Asterid branches are in green and HGT gene in Orobanchaceae are colored in dark green.

156

gnl_Selmo1.0_PACid_15418335 #0.0270 gnl_Frave2.0_gene14273 #0.2863 gnl_Poptr2.2_PACid_18212791 #0.2576 #0.5740 #0.3159 gnl_Poptr2.2_PACid_18250539 #0.3594 #0.1482 gnl_Carpa1.181_PACid_16407044 #0.0726 #0.1120 gnl_Thepa2.0_Tp5g05580 #0.3557 #0.0973 gnl_Arath10_AT3G57000.1 #0.2627 gnl_Vitvi12X_PACid_17839296 #0.2863 #0.2636 gnl_Vitvi12X_PACid_17834748 #0.0732 LaSa_26502 #0.5837 #0.0399 LaSa_48735 #0.0001 #0.1150 HeAn_49816 #0.0825 HeAn_5164 #0.2529 #0.1920 gnl_Solly2.3_Solyc12g017630.1.1 #0.0700 #0.1219 #0.0939 gnl_Soltu3.4_PGSC0003DMP400056097 #0.3914 TrVeBC3_3742.1 #0.1627 #0.4280 LiPhGnB2_6372.2 #0.0001 #0.1612 #0.2281 LiPhGnB2_6372.1 #0.0001 OrAeBC5_6623.2 #0.0001 #0.2005 OrAeBC5_6623.1 #0.0001 #0.0003 #0.5744 #0.1548 StHeBC3_9924.1 #0.0852

#2.9658 gnl_Mimgu1.0_PACid_17681679 #0.2829 gnl_Glyma1.01_PACid_16283378 #0.3117

#0.1239 gnl_Glyma1.01_PACid_16286165 #5.1090 gnl_Glyma1.01_PACid_16286163 #0.2822 gnl_Glyma1.01_PACid_16286161 #2.3428 #0.1124 #0.0001 gnl_Glyma1.01_PACid_16286164 gnl_Glyma1.01_PACid_16286162 #1.0067 gnl_Medtr3.5_Medtr3g096190.1 #0.2186 #1.0433 #0.7286 gnl_Medtr3.5_Medtr4g032390.1 #0.4272 gnl_Aquco1.0_PACid_18160587 #0.1034 gnl_Sorbi1.4_PACid_1966229 #0.1078 #0.2096 gnl_Sorbi1.4_PACid_1968125 #0.3686 #0.1048 gnl_Orysa6.0_PACid_16839576 #0.1098 #0.0807 #0.0486 gnl_Bradi1.2_Bradi3g10610.1 #0.2344 gnl_Musac1.0_GSMUA_Achr10T20260_001 #0.0815 #0.0917 gnl_Phoda3.0_PDK_30s688461g001 #0.1421 #2.0766 gnl_Theca1.0_Tc03_g026970 #0.1781 #0.2429 gnl_Theca1.0_Tc07_g011950

#0.0280 OrAeBC5_7046.2 #0.0001 #0.1539 #0.3201 OrAeBC5_7046.1 #0.0001 OrAeBC5_13694.1 #0.3581 gnl_Nelnu1.0_NNU_001131-RA #0.2776 gnl_Ambtr1.0.27_AmTr_v1.0_scaffold00092.104 #0.1047 gnl_Phypa1.6_PACid_18055581 #0.0017

2.0

Supplemental Figure 3-7. Constraint analysis results for orthogroup 4336. dN/dS values are shown on each branch and terminal taxa. Asterid branches are in green and HGT gene in Orobanchaceae are colored in dark green.

157

Ortho 14624 gnl|Orysa6.0|PACid_16864927 gnl|Orysa6.0|PACid_16864926 gnl|Bradi1.2|Bradi1g32247.1 82 gnl|Bradi1.2|Bradi1g32230.1 76 gnl|Bradi1.2|Bradi1g32260.1 62 gnl|Sorbi1.4|PACid_1984651 90 gnl|Sorbi1.4|PACid_1984652 80 90 gnl|Sorbi1.4|PACid_1984655 gnl|Orysa6.0|PACid_16868116 100 100 gnl|Sorbi1.4|PACid_1956568

StHeGnB1_80049 (H) 100 gnl|Sorbi1.4|PACid_1984649

0.1

Supplemental Figure 3-8. Incongruent phylogeny for orthogroup 14624.

158

Supplemental Figure 3-9. NCBI Blast result summary of StHeGnB1_80049.

Supplemental Figure 3-10. NCBI Blast result summary of StHeBC3_16619.1.

159

Supplemental Figure 3-11. Gene locations of StHeGnB1_80049 and StHeBC3_16619.1 on the same genomic contig 136486.

160

Supplemental Figure 3-12. Conservation between intergenic region (StHeGnB1_80049 and StHeBC3_16619) and Sorghum bicolor Genome.

161

Supplemental Figure 3-13. Expression pattern for StHeBC3_16619.1 and its homolog in rice (LOC_Os01g08650.1).

162

Supplemental Table 3-1. Expression Profile for HGT genes identified in Triphysaria versicolor. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads).

Contig ID Interface Interface Interface 0G 1G 2G 3G 3G2 41G 61G 61Gu 62G 62Gu 63G 63Gu

(TrVeBC3) MedtrG1 MedtrG2 Zeama

31777.1 8.831 10.738 0 0 0 0 2.194 2.725 6.071 0 0 0 0 0 0.133

22826.1 0.953 8.261 0 0 0 0.03 2.343 1.704 12.268 0 0 0.09 0 0 0

163

Supplemental Table 3-2. Expression Profile for HGT genes identified in Striga hermonthica. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads). Contig ID Interface 0G 1G 1G2 1G3 2G 2G2 3G 4G 51G 51G2 51G3 52G 52G2 61G 62G

(StHeBC3) SorbiG

41710.1 238.835 0.19 0 0 0 0 0 1.376 2.269 0.44 0.757 0.349 1.084 1.991 0 0

16619.1 1.541 12.238 4.351 2.823 2.721 5.537 5.783 6.357 6.128 4.57 4.278 7.40 4.12 6.79 26.7 9.34

4423.1 16.737 71.359 17.759 9.505 9.248 22.14 23.355 18.887 29.578 74.4 63.87 77.9 26.72 29.5 0.07 26.7

2868.1 25.5 54.18 61.73 46.78 39.79 55.20 49.12 40.73 64.73 20.4 20.41 17.82 46.26 40.16 20.3 22.3

21126.1 0.22 0.57 0.34 0.29 0 0.32 0.31 0.21 0.29 0.21 0.21 0.08 1.03 0.76 0.64 0.17

21126.2 0.12 0.32 0.50 0.49 0.58 0.62 0.41 0.74 1.18 0.35 0.39 0.42 1.6 3.6 2.14 0.02

21126.3 1.28 1.19 1.52 1.64 2.04 1.11 0.52 1.51 2.26 1.38 0.9 1.59 4.69 2.397 2.44 0.93

5017.3 5.04 6.89 9.06 7.17 5.90 6.10 2.88 6.44 11.54 10.8 5.66 8.91 7.06 5.42 1.15 5.55

5017.9 9.05 6.19 4.23 4.40 4.02 3.21 2.24 2.91 6.35 4.85 2.68 2.50 4.01 1.77 1.44 2.10

164

Supplemental Table 3-3. Expression Profile for HGT genes identified in Phelipanche aegyptiaca. Expression was calculated using FPKM (Fragments Per Kilobase per Million mapped reads).

ContigID Interface 0G 1G 2G 3G 41G 41G2 42G 51G 52G 61G 62G

(OrAeBC5) Arath

3756.1 13.30 7.71 19.13 9.02 116.28 66.22 110.97 108.63 173.53 102.62 72.24 39.62

14056.1 0.39 1.62 1.36 1.25 3.20 3.74 1.90 5.85 3.99 4.14 2.21 2.76

4284.2 0.16 0.78 1.57 2.58 2.63 2.90 3.36 7.56 1.37 3.29 0.49 3.60

15086.1 0.30 4.62 3.89 4.86 5.71 2.40 8.03 6.40 4.95 6.33 3.05 6.70

9762.1 0.86 1.66 1.25 1.69 4.09 1.77 6.60 3.09 1.66 2.39 1.16 1.88

15496.1 0 0 0.33 0.02 0.08 0.17 0.13 0.20 0.06 0 2.51 17.35

9142.1 2.93 10.68 9.12 13.32 20.98 17.43 21.81 18.71 28.03 28.42 19.72 27.92

4239.1 1.328 5.078 4.352 3.856 24.22 19.53 15.51 37.12 30.74 26.63 18.40 30.38

4239.2 1.223 0.284 0.1 0.037 0.60 0.33 0.86 0.16 0.16 0.39 0.28 0.3

4239.3 4.988 5.757 2.695 6.33 2.05 5.28 24.70 4.9 4.96 4.51 1.37 12.69

15353.1 0.34 0.19 1.36 0.66 2.46 0.67 5.22 4.67 1.48 0.83 0.61 0.44

165

15353.2 1.23 1.62 1.43 1.29 6.33 2.42 3.04 1.91 2.86 4.81 2.48 4.54

6791.1 1.41 0.55 0.67 1.41 3.78 3.90 4.77 6.68 9.62 8.32 7.42 6.53

6791.2 1.33 0.52 0.31 1.37 3.59 3.95 4.31 6.60 10.75 8.23 7.78 7.36

6791.3 1.48 0.29 0.60 1.19 3.43 4.21 5.83 6.91 11.28 10.02 8.36 7.99

6791.4 1.71 0.64 0.33 1.52 4.22 4.56 5.47 8.66 11.99 10.95 9.17 7.7

9731.1 0.06 2.19 8.88 6.54 5.13 4.53 9.49 14.91 14.94 18.61 17.36 15.19

9731.2 1.65 5.45 4.74 5.01 2.12 7.9 5.57 13.81 14.30 19.29 22.99 11.66

14072.1 1.19 4.00 1.92 4.04 8.71 6.81 7.49 14.76 12.92 9.10 12.43 6.88

7956.1 2.65 2.65 9.91 5.52 8.97 7.54 10.88 20.99 20.48 20.0 25.40 27.02

7956.2 0.57 0.42 0.44 0.12 1.61 1.07 1.41 1.55 3.00 1.76 1.22 1.62

270.1 17.18 2.98 6.69 4.40 21.26 43.68 16.04 22.32 33.73 25.39 36.77 17.07

270.2 18.04 4.15 6.68 4.99 20.71 8.45 17.57 8.36 7.51 10.86 5.54 7.63

270.3 6.24 7.08 12.22 8.25 37 129.39 28.93 63.54 98.15 66.54 124.52 47.31

270.4 7.88 6.64 12.12 10.95 31.39 15.83 33.10 16.93 21.35 15.73 12.52 16.94

7046.1 32.76 16.10 15.98 16.38 11.86 14.66 30.26 12.32 7.59 17.33 4.078 9.84

166

7046.2 8.45 1.95 4.02 1.43 33.48 1.98 9.28 5.85 4.01 6.39 4.33 3.91

13694.1 14.63 3.03 7.85 6.99 21.58 6.83 16.45 6.50 5.7 4.63 4.65 11.67

26251.1 0 0.24 12.96 0 0.12 0.09 0.029 0 0.055 0 0 0

16890.17 0.53 4.39 1.1 2.46 1.49 1.47 2.91 3.19 0.43 2.39 2.23 3.13

167

Supplemental Table 3-4. SH (Shimodaira-Hasegawa) test results for the high confidence HGT orthogroups. D(LH) means the difference of likelihood score between the reference tree (original tree) and the tested tree. D(LH) score smaller than zero means that the likelihood of tested tree topology is less than the likelihood of the reference tree topology, indicating that the reference tree topology is preferred.

Orthogroup D(LH) SD significant at significant at significant at

ID 5% 2% 1%

3861 -243.57 23.54 yes yes yes

12303 -142.16 16.80 yes yes yes

12577 -57.63 14.39 yes yes yes

2270 -281.64 26.80 yes yes yes

5896 -386.37 27.55 yes yes yes

13656 -44.88 10.83 yes yes yes

14233 -162.72 19.32 yes yes yes

18744 -10.66 5.02 yes no no

10124 -64.11 14.95 yes yes yes

294 -54.16 11.89 yes yes yes

1886 -46.55 9.76 yes yes yes

4067 -326.71 29.02 yes yes yes

8888 -191.63 23.43 yes yes yes

10050 -363.77 29.01 yes yes yes

168

10143 -576.19 37.52 yes yes yes

11841 -434.47 31.27 yes yes yes

806 -700.25 42.03 yes yes yes

1685 -203.60 22.55 yes yes yes

2376 -241.10 18.02 yes yes yes

4336 -97.81 17.49 yes yes yes

8235 -12.44 10.14 no no no

9613 -58.87 13.32 yes yes yes

Supplemental Table 3-5. Assembly Results of PPGP Data. PRI = Primary Assembly, PP = Posts Processed Assembly, and ORF = Predicted Open Reading

Frames

169

Appendix A

List of Publications

Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species Zhang Y, Fernandez-Aparicio M, Wafula EK, Das M, Jiao Y, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF, Timko MP, Yoder JI, Westwood JH, dePamphilis CW. BMC Evol Biol. 2013 Feb 20;13:48. doi: 10.1186/1471-2148-13-48.

Phylogenomic analysis of transcriptome data elucidates co-occurrence of a paleopolyploid event and the origin of bimodal karyotypes in Agavoideae (Asparagaceae). McKain MR, Wickett N, Zhang Y, Ayyampalayam S, McCombie WR, Chase MW, Pires JC, dePamphilis CW, Leebens-Mack J. Am J Bot. 2012 Feb;99(2):397-406. doi: 10.3732/ajb.1100537. Epub 2012 Feb 1.

A genome triplication associated with early diversification of the core eudicots. Jiao Y, Leebens-Mack J, Ayyampalayam S, Bowers JE, McKain MR, McNeal J, Rolf M, Ruzicka DR, Wafula E, Wickett NJ, Wu X, Zhang Y, Wang J, Zhang Y, Carpenter EJ, Deyholos MK, Kutchan TM, Chanderbali AS, Soltis PS, Stevenson DW, McCombie R, Pires JC, Wong GK, Soltis DE, dePamphilis CW. Genome Biology 2012, 13:R3 doi:10.1186/gb-2012-13-1-r3

The mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus). Miller W, Drautz DI, Janecka JE, Lesk AM, Ratan A, Tomsho LP, Packard M, Zhang Y, McClellan LR, Qi J, Zhao F, Gilbert MT, Dalén L, Arsuaga JL, Ericson PG, Huson DH, Helgen KM, Murphy WJ, Götherström A, Schuster SC. Genome Res. 2009 Feb;19(2):213-20. doi: 10.1101/gr.082628.108. Epub 2009 Jan 12.

170

Appendix B

A genome triplication associated with early diversification of the core eudicots

Yuannian Jiao1,2, Jim Leebens-Mack3, Saravanaraj Ayyampalayam3, John E Bowers3,

Michael R McKain3, Joel McNeal3,4, Megan Rolf5, Daniel R Ruzicka5, Eric Wafula2,

Norman J Wickett2,6 Xiaolei Wu7, Yong Zhang7, Jun Wang7,8, Yeting Zhang2,9, Eric J

Carpenter10, Michael K Deyholos10, Toni M Kutchan5, Andre S Chanderbali11,12, Pamela S

Soltis11, Dennis W Stevenson13, Richard McCombie14, J. Chris Pires15, Gane Ka-Shu

Wong7,16, Douglas E Soltis12 and Claude W dePamphilis1,2,*

1Intercollege Graduate Degree Program in Plant Biology, The Pennsylvania State

University, University Park, PA 16802, USA

2Department of Biology, Institute of Molecular Evolutionary Genetics, Huck Institutes of the

Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA

3Department of Plant Biology, University of Georgia, Athens, GA 30602, USA

4Department of Biology and Physics, Kennesaw State University, Kennesaw, GA 30144,

USA

5Donald Danforth Plant Science Center, 975 North Warson Road, St Louis, MO 63132,

USA

6Division of Plant Science and Conservation, Chicago Botanic Garden, Glencoe, IL 60022,

USA

171

7Beijing Genomics Institute-Shenzhen, Bei Shan Industrial Zone, Yantian District, Shenzhen

518083, China

8The Novo Nordisk Foundation Center for Basic Metabolic Research, Department of

Biology, University of Copenhagen, Store Kannikestræde 11, 1169 København K, Denmark

9Intercollege Graduate Degree Program in Genetics, The Pennsylvania State University,

University Park, PA 16802, USA

10Department of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2E9,

Canada

11Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA

12Department of Biology, University of Florida, Gainesville, FL 32611, USA

13 New York Botanical Garden, Bronx, New York, NY 10458, USA

14Genome Research Center, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd,

Woodbury, NY 11797, USA

15Division of Biological Sciences, University of Missouri, Columbia, MI 65211, USA

16Departments of Biological Sciences and Medicine, Department of Biological Sciences,

University of Alberta, Edmonton AB, T6G 2E9, Canada

*Correspondence: Claude W dePamphilis. Email: [email protected]

Received: 3 November 2011

Accepted: 26 January 2012

Published: 26 January 2012

172

© 2012 Jiao et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background: Although it is agreed that a major polyploidy event, gamma, occurred within the eudicots, the phylogenetic placement of the event remains unclear.

Results: To determine when this polyploidization occurred relative to speciation events in angiosperm history, we employed a phylogenomic approach to investigate the timing of gene set duplications located on syntenic gamma blocks. We populated 769 putative gene families with large sets of homologs obtained from public transcriptomes of basal angiosperms, , asterids, and more than 91.8 gigabases of new next-generation transcriptome sequences of non-grass monocots and basal eudicots. The overwhelming majority (95%) of well-resolved gamma duplications was placed before the separation of rosids and asterids and after the split of monocots and eudicots, providing strong evidence that the gamma polyploidy event occurred early in eudicot evolution. Further, the majority of gene duplications was placed after the divergence of the Ranunculales and core eudicots, indicating that the gamma appears to be restricted to core eudicots. Molecular dating estimates indicate that the duplication events were intensely concentrated around 117 million years ago.

Conclusions: The rapid radiation of core eudicot lineages that gave rise to nearly 75% of angiosperm species appears to have occurred coincidentally or shortly following the gamma

173 triplication event. Reconciliation of gene trees with a species phylogeny can elucidate the timing of major events in genome evolution, even when genome sequences are only available for a subset of species represented in the gene trees. Comprehensive transcriptome datasets are valuable complements to genome sequences for high-resolution phylogenomic analysis.

Background

Gene duplication provides the raw genetic material for the evolution of functional novelty and is considered to be a driving force in evolution [178, 179]. A major source of gene duplication is whole genome duplication (WGD; polyploidy), which involves the doubling of the entire genome. WGD has played a major role in the evolution of most eukaryotes, including ciliates [180], fungi [181], flowering plants [165, 182-192], and vertebrates [193-195]. Studies in these lineages support an association between WGD and gene duplications [182, 196], functional divergence in duplicate gene pairs [197, 198], phenotypic novelty [199], and possible increases in species diversity [200, 201] driven by variation in gene loss and retention among diverging polyploidy sub-populations [202-205].

There is growing consensus that one or more rounds of WGD played a major role early in the evolution of flowering plants [165, 179, 183-185, 189, 206, 207]. Early synteny- based and phylogenomic analyses of the Arabidopsis genome revealed multiple WGD events [184, 185]. The oldest of these WGD events was placed before the monocot-eudicot divergence, a second WGD was hypothesized to be shared among most, if not all, eudicots, and a more recent WGD was inferred to have occurred before diversification of the

174

Brassicales [185]. Synteny analyses of the recently sequenced nuclear genomes of Vitis vinifera (wine grape, grapevine) [208] and Carica papaya (papaya tree) [183] provided more conclusive evidence for a somewhat different scenario in terms of the number and timing of WGDs early in the history of angiosperms. Each Vitis (or Carica) genome segment can be syntenic with up to four segments in the Arabidopsis genome, implicating two WGDs in the Arabidopsis lineage after separation from the Vitis (or Carica) lineage

[183, 188, 208]. The more ancient one (β) appears to have occurred around the time of the

Cretaceous-Tertiary extinction [186]. Analyses of the genome structure of Vitis revealed triplicate sets of syntenic gene blocks [187, 208]. Because the blocks are all similarly diverged, and thus were probably generated at around the same time in the past, the triplicated genome structure is likely to have been generated by an ancient hexaploidy event, possibly similar to the two successive WGDs likely to have produced Triticum aestivum

[209]. Although the mechanism is not clear at this point, the origin of this triplicated genome structure is commonly referred to as gamma or γ (hereafter γ refers to the gamma event).

Comparisons of available genome sequences for other core rosid species (including Carica,

Populus, and Arabidopsis) and the recently sequenced potato genome (an asterid, Solanum tuberosum) show evidence of one or more rounds of polyploidy with the most ancient event within each genome represented by triplicated gene blocks showing interspecific synteny with triplicated blocks in the Vitis genome [183, 187, 210, 211]. The most parsimonious explanation of these patterns is that γ occurred in a common ancestor of rosids and asterids, because all sequenced genomes within these lineages share a triplicate genome structure

[188, 211].

175

Despite this growing body of evidence from genome sequences, the phylogenetic placement of γ on the angiosperm tree of life remains equivocal (for example, [189]). As described above, the γ event is readily apparent in analyses of sequenced core eudicot genomes, and recent comparisons of regions of the Amborella genome and the Vitis synteny blocks indicate that the γ event occurred after the origin and early diversification of angiosperms [212]. In addition, comparisons of the Vitis synteny blocks with bacterial artificial chromosome sequences from the Musa (a monocot) genome provide weak evidence that γ postdates the divergence of monocots and eudicots [187].

As an alternative to synteny comparisons, a phylogenomic approach has also been used successfully to determine the relative timing of WGD events. By mapping paralogs created by a given WGD onto phylogenetic trees, we can determine whether the paralogs resulted from a duplication event before or after a given branching event [185]. In a recent study, Jiao et al. [165] used a similar strategy to identify two bouts of concerted gene duplications that are hypothesized to be derived from successive genome duplications in common ancestors of living seed plants and angiosperms. When using a phylogenomic approach, extensive rate variation among species could lead to incorrect phylogenetic inferences and then possibly also result in the incorrect placement of duplication events

[187]. Gene or taxon sampling can reduce variation in branch lengths and the impact of long-branch attraction in gene tree estimates (for example, [213-215]). Therefore, effective use of the phylogenomic approach requires consideration of possible differences in substitution rates and careful taxon sampling to divide long branches that can lead to artifacts in phylogenetic analyses. 176

The availability of transcriptome data produced by both traditional (Sanger) and next-generation cDNA sequencing methods has grown rapidly in recent years [216, 217]. In

PlantGDB, very large Sanger EST datasets from multiple members of Asteraceae (for example, Helianthus annuus, sunflower) and Solanaceae (for example, S. tuberosum, potato), in particular, provide good coverage of the gene sets from the two largest asterid lineages. With advances in next-generation sequencing, comprehensive transcriptome datasets are being generated for an expanding number of species. For example, the Ancestral

Angiosperm Genome Project has generated large, multi-tissue cDNA datasets of magnoliids and other basal angiosperms, including Aristolochia, Persea, Liriodendron, Nuphar and

Amborella [165]. The Monocot Tree of Life project [218] is generating deep transcriptome datasets for at least 50 monocot species that previously have not been the focus of genome- scale sequencing. The 1000 Green Plant Transcriptome Project [134] is generating at least 3

Gb of Illumina paired-end RNAseq data from each of 1,000 plant species from green algae through angiosperms (Viridiplantae). In this study, we draw upon these resources, including an initial collection of basal eudicot species that have been very deeply sequenced by the

1000 Green Plant Transcriptome Project. Six members of (Argemone mexicana, Eschscholzia californica, and four species of ) have been targeted for especially deep sequencing, with over 12 Gb of cDNA sequence derived from four or five tissue-specific RNAseq libraries. Three other basal eudicots (Podophyllum peltatum

(Berberidaceae), Akebia trifoliata (Lardizabalaceae), and Platanus occidentalis

(Platanaceae)) sequenced by the 1000 Green Plant (1KP) Transcriptome Project, and EST sets available for additional strategically placed species (for example, [219], [134]) were

177 employed for phylogenomic estimation of the timing of the γ event. Assembled unigenes

(sequences produced from assembly of EST data sets) were sorted into gene families and then the phylogenetic analyses of gene families were performed to test alternative hypotheses for the phylogenetic placement of the γ event.

Results and discussion

Since the γ event was first identified in a groundbreaking phylogenomic analysis of the Arabidopsis genome [185], its timing has been hypothesized to have predated the origin of angiosperms (for example, [201],[220]), the divergence of monocots and eudicots (for example, [221]) and the divergence of asterid and rosid eudicot clades (for example, [187,

211]) (Appendix B Figure 1). Most recent analyses suggest that γ occurred within the eudicots, but the timing of the γ event relative to the diversification of core eudicots remains unclear [189]. Resolving whether γ occurred just before the radiation of core eudicots or earlier, in a common ancestor of all eudicots, has implications for our understanding of the relationship between polyploidization, diversification rates, and morphological novelty (for example, [190]).

Phylogenomic placement of the γ polyploidy event

To ascertain the timing of the γ event relative to the origin and early diversification of eudicots, we mainly focused on dating paralogous gene pairs that are retained on synteny blocks in Vitis [187, 188]. Vitis displays the most complete retention for γ blocks among all genomes sequenced to date, and thus provides the best target for phylogenomic mining of the γ history. Vitis also represents the sister group to all other members of the rosid lineage 178

(APG III, 2009) [130, 222], so homologous genes were sampled from other species of rosids, asterids, basal eudicots, monocots, and basal angiosperms in order to estimate the timing of the γ event in relation to the divergence of these lineages. Genes were clustered into ‘orthogroups’ (homologous genes that derive from a single gene in the common ancestor of the focal taxa) using OrthoMCL [166] with eight sequenced angiosperm genomes (Appendix B Table 1). By excluding Vitis pairs that are not included in the same orthogroups, and requiring that orthogroups contained both monocots and non-Vitis eudicots, 900 pairs of Vitis genes were retained from 781 orthogroups. These orthogroups were used in our investigation of the γ duplication event.

To verify that the phylogenetic placement of the γ event was shared by rosids and asterids, and to test whether it was shared by all eudicots or by eudicots and monocots (near angiosperm-wide), these orthogroups were then populated with unigenes of asterids, basal eudicots, non-grass monocots, and basal angiosperms (Appendix B Table 2). Grasses are known to be distinct from other angiosperms in their high rate of nucleotide substitutions, and codon biases within the grasses make this clade distinct from other angiosperms, including non-grass monocots (for example, [223, 224]), so inclusion of non-grass monocots was necessary to reduce artifacts in gene tree estimation. More generally, when dealing with phylogenomic-scale datasets, we strive for adequate taxon sampling to cut long branches, but avoid adding a large proportion of unigenes with low coverage. Inadequate taxon sampling could lead to spurious inference of phylogeny, while incomplete sequences (that is, low-coverage unigenes) can greatly degrade branch support and resolution of phylogenetic trees. 179

To phylogenetically place the γ event with confidence, we adopted the following support-based approach. Three relevant bootstrap values were taken into account when evaluating support for a particular duplication. For example, given a topology of

(((clade2)bootstrap2,(clade3)bootstrap3)bootstrap1), bootstrap2 and bootstrap3 are the bootstrap values supporting clade2 (clade2 here will include one of the Vitis γ duplicates) and clade3 (including the other Vitis duplicate), respectively, while bootstrap1 is the bootstrap value supporting the larger clade including clade2 and clade3. The value of bootstrap1 indicates the degree of confidence in the inferred ancestral node joining clades 2 and 3. In this study, when bootstrap1, and at least one of bootstrap2 and bootstrap3 were

≥50% (or 80%), we determined whether an asterid, basal eudicot, monocot, or basal angiosperm was contained in clades 2 or 3 (for example, asterids in Appendix B Figures 2 and Appendix B Figure 3) or sister to their common ancestor (node defining clade 1) with a bootstrap value (BS) ≥50% (or 80%; for example, basal eudicots, monocots and basal angiosperms in Appendix B Figures 2 and Appendix B Figure 3).

Homologous sequences were identified for 769 of the 781 orthogroups and were subsequently used for phylogenetic analysis. For example, orthogroup 1202 was well populated with unigenes of asterids, basal eudicots, non-grass monocots, and basal angiosperms (Appendix B Figure 2). Two Vitis genes, which were located on a syntenic block, were clustered into two clades, both of which include genes from asterids and other rosids. This phylogenetic tree supports (BS ≥80%) the duplication of two Vitis genes before the split of rosids and asterids and after the divergence of basal eudicots, indicating that γ is 180 restricted to core eudicots (BR3 of Appendix B Figure 1; Appendix B Figure 2). In another example, only one asterid unigene passed the quality control steps and was clustered into orthogroup 1083. This asterid unigene was grouped into one of the duplicated clades, also supporting (BS ≥50%) a duplication in the common ancestor of extant core eudicots (BR3 of

Appendix B Figure 1; Appendix B Figure 3). Only a few duplications of Vitis gene pairs were identified as occurring before the divergence of monocots and eudicots (BR1 of

Appendix B Appendix B Figure 1; seven duplications with BS ≥50%), or restricted to rosids

(BR4 of Appendix B Figure 1; six duplications with BS ≥50%, four duplications with BS

≥80%). We identified 168 Vitis gene pairs that were duplicated after the split of basal eudicots (BR3 of Appendix B Figure 1) with BS ≥50%, and 80 of these had BS ≥80%. We also found that 70 Vitis genes were duplicated before the separation of basal eudicots (BR2 of Appendix B Figure 1) with BS ≥50% and 19 with BS ≥80% (Appendix B Table 3).

Therefore, our phylogenomic analysis provided very strong support that γ occurred before the divergence of rosids and asterids, after the split of monocots and eudicots, and most likely after the earliest diversification of eudicots.

Molecular dating of the γ duplications

To estimate the absolute date of the γ event, we calibrated 161 of the 168 orthogroups supporting (BS ≥50%) a core eudicot-wide duplication and 66 of the 70 orthogroups supporting a eudicot-wide duplication, and then estimated the duplication times using the program r8s [155] (Materials and methods). We then analyzed the distribution of the inferred duplication times using a Bayesian method that assigned divergence time estimates to classes specified by a mixture model [225]. The distribution of duplication 181 times of core eudicot-wide Vitis pairs shows a peak at 117 ± 1 (95% confidence interval)

(Appendix B Figure 4a), and the distribution of all eudicot-wide duplication times has a peak at 133 ± 1 million years ago (mya) (Appendix B Figure 4b). Dating estimates have additional sources of error beyond the sampling effects accounted for in standard error estimates (for example, [226]). However, the clear pattern is that the duplication branch points occurred over a narrow window of time very close to the eudicot calibration point that represents the first documented appearance of tricolpate pollen in the fossil record. We also analyzed the 80 nodes and 19 nodes showing duplication shared by core eudicots and all eudicots, respectively, with bootstrap support ≥80% (Appendix B Figure 4d,e) and found similar distributions (116 ± 1 mya for core eudicot duplications and 135±2 mya for all eudicot duplications). The inferred dates for Vitis duplications shared either by core eudicots or all eudicots are very close to each other, and are concentrated around 125 mya. We also investigated the distribution of all inferred duplication times together (core eudicot-wide and eudicot-wide). Even given a time constraint (125 mya) that would split the date estimates for core eudicot and eudicot-wide duplications, the distributions of combined inferred duplication times show only one significant peak, with a mean at 121 mya for orthogroups with bootstrap support ≥50% (Appendix B Figure 4c) and 120 mya for orthogroups with bootstrap support ≥80% (Appendix B Figure 4f). A single peak observed for the combined data (Appendix B Figure 4c) suggests that the genome-scale event(s) leading to the triplicated genome structure of core eudicots occurred in a narrow window of time nearly coincident with the sudden appearance of eudicot pollen-types in the fossil record [227].

Hexaploidization and early eudicot radiation are close in time 182

Many of the gene trees showed no resolution or low bootstrap support for nodes distinguishing hypotheses BR2 and BR3. If the γ event had occurred almost anywhere along the long branch leading to eudicots, this event would have been relatively easy to resolve.

The lack of resolution of the timing of duplication events around the basal eudicot speciation nodes suggests that the γ event may have occurred during a rapid species radiation. Another possibility could be due to the nature of hexaploidization. If, as our analyses suggest, the polyploidy event (see below for possible scenarios) occurred soon after the divergence of basal eudicots, the substitution rates for γ paralogs could vary. For example, one duplicate could evolve very slowly while the other evolves at an accelerated rate [181]. These possibilities could add significant challenges to the precise resolution of events occurring at or near the branch points for basal versus core eudicot lineages. Despite these challenges, most well-resolved gene trees support the hypothesis that the γ event occurred in association with the origin and diversification of the core eudicots, after the core eudicot lineage diverged from the Ranunculales (BR3 of Appendix B Figure 1).

Nature of the γ event

An additional question is whether the ancient hexaploid common ancestor was formed by one or two WGDs that occurred over a very short period (for example, as with hexaploid wheat). It was demonstrated that two of the three homologous regions were more fractionated than the third, suggesting a possible mechanism for the γ event [210]. In one proposed scenario, a genome duplication event generated a tetraploid, which then hybridized with a diploid to generate a (probably sterile) triploid. Finally, a second WGD event doubled the triploid genome to generate a fertile hexaploid. Alternatively, unreduced gametes of a 183 tetraploid and a diploid could have fused to generate a hexaploid directly. Another characterization of syntenic blocks indicates that the three corresponding regions are generally equidistant from one another [187]. Our analyses of duplication points in the phylogenomic analyses resolve only a single peak in estimated dates for the ‘γ event’, which would be consistent with either scenario, given that any complex scenario would involve ancient events that occurred within a brief period of time. More evidence is needed to establish a more definitive mechanism for the apparent hexaploidization (that is, as one versus two events, allopolyploid versus autopolyploid).

Rate variations between paralogs of Vitis

In another attempt to increase resolving power, Ks distributions for duplicate Vitis genes were investigated. The Ks distributions of Vitis pairs supporting a core eudicot-wide duplication inferred from phylogenetic analyses show one significant peak at Ks ~1.03

(Appendix B Figure 5a). The Ks values for eudicot-wide duplicate Vitis pairs were not well clustered, and their distribution shows one peak at 1.31, which indicates slightly more divergence for these Vitis pairs (Appendix B Figure 5b). This result is consistent with phylogenetic analyses that show this set of duplications occurred somewhat earlier (all eudicot-wide versus core eudicot-wide). We also investigated the distribution of all Ks values together (core eudicot-wide and eudicot-wide). Three statistically significant peaks were identified: 0.3, 1.02 and 1.40 (Appendix B Figure 5c). Finally, we estimated Ks values for all (2,191) pairs of Vitis γ paralogs identified by Tang et al. [187] in analyses of syntenic blocks. We were able to detect four significant components using the mixture model implemented with EMMIX (McLachlan et al. [228]): 0.12, 1.09, 1.85, and 2.7 (Appendix B

184

Figure 5d). This Ks distribution clearly shows that the major peak (approximately 1.09; green curve in Appendix B Figure 5d) was close to the peak of Ks distribution of core eudicot-wide duplicates (at approximately 1.03; Appendix B Figure 5a). This intriguing pattern (Appendix B Figure 5c,d) could be a consequence of stable hexaploidy arising from two WGDs, one in the common ancestor of all eudicots and one in the common ancestor of core eudicots. However, there are no consistent patterns of duplications for entire syntenic blocks; for example, some syntenic blocks have genes consistently duplicated in core eudicots, while other syntenic blocks were duplicated eudicot-wide (results not shown).

Alternatively, this pattern also could be consistent with the hypothesis of an allopolyploidy event for γ. If two ancestral genomes were involved in the hexaploidization and the Vitis genome had evolved slowly, two significant peaks might be detected [229]. A third possibility is that Vitis pairs supporting a eudicot-wide duplication may be the products of pre-WGD tandem or segmental duplications that were misidentified as syntenic γ paralogs due to loss of alternative copies through the fractionation process. These hypotheses will have to be tested through comparative analyses as additional plant genomes, especially of outgroups (for example, Aquilegia, Amborella) and other basal eudicots, are sequenced.

Implications of the γ event characterizing most eudicots

Our results suggest that the γ polyploidy event was closely coincident with a rapid radiation of major lineages of core eudicot lineages that together contain about 75% of living angiosperm species. This rapid lineage expansion following the γ event could be an important exception to the general pattern described by Mayrose et al. [207], who concluded that there may generally be reduced survival of polyploid plant lineages. The eudicots 185 consist of a graded series of generally small clades (often called early-diverging or basal eudicots) that are successive sisters to the core eudicots ([130] and references therein). It is within the core eudicot clade where most major lineages as well as the large majority of angiosperm species reside (for example, rosids, asterids, caryophyllids). Several key evolutionary events seem to correspond closely to the origin of the core eudicots, including the genome-wide event described here, the evolution of a pentamerous, highly synorganized flower with a well-differentiated , and the production of ellagic and gallic acids

[230]. Significantly, the duplication of several genes crucial to the establishment of floral organ identity also occurred near the origin of the core eudicots (AP3, AP1, AG, and SEP gene lineages) [220, 231, 232], suggesting that these duplications - possibly originating from the γ event - may also be involved in the ‘new’ floral morphology that emerged in this clade

[233, 234].

This study also helps to shed light on prior studies, where the potential timing of the

γ event varied widely from possibly in an ancestor of all angiosperms [185] to perhaps as recent as only rosids [235]. A polyploid event has been detected that is angiosperm-wide, but this was an earlier event (ε, epsilon) [165]. Our results are consistent with a recent study that identified a signature of the γ event in the genome of the potato, an asterid [211]. The γ event was suggested to be absent from grass genomes in comparisons of Vitis and Oryza

[208], but this finding was questioned by Tang et al. [187]. However, the draft genome of strawberry (Fragaria vesca), a rosid that shares the γ event, did not show evidence for γ in syntenic block analysis [236], suggesting that either the γ event has been obscured by further rearrangements and fractionation, or expansion of the Fragaria genome sequence data may 186 be necessary. Although sequenced plant genomes are being produced at an increasing rate, a much larger source of genome-scale evidence is coming from very large-scale transcriptome studies such as the 1000 Green Plant Transcriptome Project and the Monocot Tree of Life

Project. In this paper, we have used gigabases of transcriptome data from species at key branch points to phylogenetically time hundreds of ancient gene duplications. Combined with evidence from Ks analysis and syntenic blocks, global gene family phylogenies could incorporate extensive evidence without a sequenced genome, and ultimately facilitate a much better understanding of plant evolution.

Conclusions

Phylogenetic analyses and molecular dating provide consistent and strong evidence supporting the occurrence of the γ polyploidy event after the divergence of monocots and eudicots, and before the asterid-rosid split. It is difficult to determine whether the γ event was shared by monocots or not based only on synteny patterns shared between Vitis and other monocot genomes [187]. By including massive transcriptome datasets from many additional taxa, such as basal angiosperms, non-grass monocots, basal eudicots and asterids, we employed a comprehensive phylogenomic approach, and dated gene pairs on syntenic blocks in a relatively slowly evolving species (Vitis) [187]. We were able to place the γ event(s) in a narrow window of time, most likely shortly before the origin and rapid radiation of core eudicots.

187

Material and methods

Data and assemblies

Genomes were obtained from various sources as given in Appendix B Table 1. EST data or assemblies were obtained from sources indicated in Appendix B Table 2. The largest quantities of new sequence data are represented by transcriptome datasets for nine basal eudicot species produced by Beijing Genomics Institute for the 1000 Green Plant

Transcriptome Project [147]. The Monocot Tree of Life Project (MonATOL) generated five non-grass monocot transcriptomes. One transcriptome dataset for Lindenbergia philippensis

(asterid) was obtained from the Parasitic Plant Genome Project [135]. Several methods were used for EST data assembly, according to the type and quantity of data that were available.

Assemblies involving large numbers of Sanger reads were obtained either from the Plant

Genome Database [134] or The Institute for Genomic Research (TIGR) Plant Transcript

Assemblies [237]. Hybrid assemblies with Sanger and 454 data were performed with

MIRA.Est. Short-read Illumina datasets were assembled either with SOAP denovo (K-mer size = 29 and asm_flag = 2) [238] or with CLC Genomics Workbench (reads trimmed first, and using default parameters except minimum contig length set to 200 bases). Assemblies for species with data from more than one sequencing technology were further post- assembled with CAP3 (overlap length cutoff = 40 and overlap percent identity = 98) to merge contigs that have significant overlap but could not be assembled into contiguous sequences by primary assemblers due to either the presence of SNPs in the consensus or path ambiguity in the graph.

188

Gene classification and phylogenetic analysis

The OrthoMCL method [166] was used to construct sets of orthogroups. Amino acid alignments for each orthogroup were generated with MUSCLE, and then trimmed by removing poorly aligned regions with trimAl 1.2, using the heuristic automate1 option

[176]. In order to sort and align transcriptome data into our eight-genome scaffold for downstream phylogenetic analyses, we first used ESTScan [173] to find the best reading frame for all unigenes. The best hit from a blast search against the inferred proteins of our eight-genome scaffold was then used to assign each unigene to an orthogroup. Additional sorted unigene sequences for the orthogroups of sequenced genomes were aligned at the amino acid level into the existing full alignments (before trimming) of eight sequenced species using ClustalX 1.8 [239]. Then these large alignments were trimmed again using trimAl 1.2 with the same settings. Each unigene sequence was checked and removed from the alignment if the sequence contained less than 70% of the total alignment length.

Corresponding DNA sequences were then forced onto the amino acid alignments using custom Perl scripts, and DNA alignments were used in subsequent phylogenetic analysis.

Maximum likelihood analyses were conducted using RAxML version 7.2.1 [153], searching for the best maximum likelihood tree with the GTRGAMMA model by conducting 100 bootstrap replicates, which represents an acceptable trade-off between speed and accuracy

(RAxML 7.0.4 manual).

Molecular dating analyses and 95% confidence intervals

The best maximum-likelihood topology for each orthogroup was used to estimate divergence times. The divergence time of the two paralogous clades in each orthogroup was

189 estimated under the assumption of a relaxed molecular clock by applying a semi-parametric penalized likelihood approach using a truncated Newton optimization algorithm as implemented in the program R8S [155]. The smoothing parameter was determined by cross- validation. We used the following dates in our estimation procedure: minimum age of 131 mya [240] and maximum age of 309 mya for crown-group angiosperms [241], and a fixed constraint age of 125 mya for crown-group eudicots [242]. We required that trees pass both the cross-validation procedure and provide estimates of the age of the duplication node. The collection of inferred divergence times was then analyzed by EMMIX [225]. For each significant component identified by EMMIX, the 95% confidence interval of the mean was then calculated.

Finite mixture models of genome duplications

To explore the divergence patterns for duplicated genes, the inferred distribution of Ks divergences were fitted to a mixture model comprising several component distributions in various proportions. The Ks value for each duplicated sequence pair was calculated using the

Goldman and Yang maximum likelihood method implemented in codeml with the F3X4 model [168]. The EMMIX software was used to fit a mixture model of multivariate normal components to a given data set. The mixed populations were modelled with one to four components. The EM algorithm was repeated 100 times with random starting values, as well as 10 times with k-mean starting values. The best mixture model was identified using the

Bayesian information criterion.

Abbreviations

190

BS, bootstrap value; EST, expressed sequence tag; Ks, rate of synonymous substitutions per synonymous site; mya, million years ago; WGD, whole genome duplication.

Authors' contributions

YJ, JL-M and CWD conceived of the study and its design, and YJ performed all of the final analyses. YJ, JL-M, CWD drafted the primary manuscript and additional text and discussion of the research was provided by DES, PSS, JEB, NJW, TMK, GW, DWS. Tissue samples, RNA isolations, library preparation sequencing and sample and sequence management were done by MR, MRM, JM, MR, XW, YongZ, JW, ASC, MKD, RM and JCP. Data assemblies and other analyses were done by YJ, SA, DRR, EW, and YetingZ. All authors contributed to and approved the final manuscript for publication.

Acknowledgements

We thank Joshua P Der for helpful comments. This work was supported in part by funds from the NSF Plant Genome Research Program (DEB 0638595, The Ancestral

Angiosperm Genome Project to CWD, JL-M, PSS, DES; DEB 0701748, The Parasitic Plant

Genome Project to CWD; DEB 0922742, The Amborella Genome: A Reference for Plant

Biology to CWD, JL-M, PSS, DES; IOS 0421604, Genomics of Comparative Seed

Evolution to DWS, RM), NSF Tree of Life program (‘MonATOL,’ DEB 0829868, From

Acorus to Zingiber - Assembling the Phylogeny of the Monocots to DWS, JCP, JL-M, RM,

CWD), National Institute on Drug Abuse (NIDA) Genetic Variation at the National

Institutes of Health (project 5R01DA025197-02 to TMK, CWD, JL-M), the Alberta 1000

Plants Initiative (1000 Green Plant Transcriptome Project, to GW) by Alberta Advanced

Education and Technology, by Musea Ventures, and by BGI-Shenzhen), iPLant (to JL-M) 191 and by the Biology Department and Plant Biology Graduate Program of Penn State

University.

!"#$%#& 7!D& '()*)+&!"#$+&%&'(")*+$"$+&,-./"01,&

7!C& :1/>"=E>8818(#&& '()*)+&913'+&:1$17(&.'0361747,& 7!B& -#.(/$%#& 7!A& '()*)+&0".1."+&2345"6(/,& 71#18&(3%$9".#& '()*)+&2-'3'04$,& '()*)+&:"83;<$4(+&0"==>,&

?"4"9".#& '()*)+&!$9(+&?1$@(+&5.+6',&

71#18&14*$"#=(/;#& '()*)+&%7(*&1--'+&84+6'&,&

Appendix B Figure 1. Schematic phylogenetic tree of flowering plants. 192

BR1 to BR4 denote potential time points when the γ event may have occurred. BR1, monocots + eudicots duplication; BR2, eudicot-wide duplication; BR3, core eudicot-wide duplication; BR4, rosid-wide duplication.

rosids asterids basal eudicots monocots basal angiosperms

Nuphar advena b3 c4633 92 7351 89 Eschscholzia californica 35239 100 Papaver rhoeas 249067 Papaver rhoeas 48932 68 Papaver rhoeas 162860 100 Populus trichocarpa 0003s21540 98 Populus trichocarpa 0001s04750

100 Populus trichocarpa 1020s00200 97 100 Populus trichocarpa 1020s00210 Populus trichocarpa 0001s04740 Vitis vinifera GSVIVT00024731001 96 100 Glycine max 14g38710 2 Glycine max 18g05690 57 Carica papaya supercontig 119.95 Cucumis sativus 142900 Solanum tuberosum TA25116 4113 1 84 82 100 Mimulus guttatus7117 84 Lindenbergia phillipensis 96262 100 Glycine max 19g33210 89 Glycine max 03g30290 3 Vitis vinifera GSVIVT00025407001 88 77 Arabidopsis thaliana AT3G58060 95 Lindenbergia phillipensis 95847 Panax quinquefolius 3903 97 Chlorophytum rhizopendulum 52723

100 78 Chamaedorea seifrizii 13550 Neoregelia sp. 8364 Typha angustifolia 36449 75 77 Typha angustifolia 53757 100 Sorghum bicolor Sb01g041820 Oryza sativa Os03g12530 100 Persea americana b4 c5230 Persea americana b4 c4145 Liriodendron tulipifera b3 c4952 Amborella trichopoda b4 c2129 0.1

Appendix B Figure 2. Exemplar maximum likelihood phylogeny of Ortho 1202. 193

RAxML topology of an orthogroup (Ortho 1202) indicating that the γ paralogs of Vitis were duplicated before the split of rosids and asterids and after the early radiation of eudicots. The scored bootstrap (BS) value for this duplication is over 80%, because nodes #1 and #2 (and/or #3) have BS

>80%. Legend: green star = core eudicot duplication; colored circles = recent independent duplications; numbers = bootstrap support values.

rosids asterids basal eudicots monocots basal angiosperms

100 Oryza sativa Os01g46700 Sorghum bicolor Sb03g029850 100 100 Sorghum bicolor Sb03g001640 100 Oryza sativa Os01g11952 Neoregelia sp. 40704 Vitis vinifera GSVIVT00037113001

100 Populus trichocarpa 0012s02120 100 Populus trichocarpa 0015s01670 3 88 100 Arabidopsis thaliana AT5G53430 79 100 Arabidopsis thaliana AT4G27910 Carica papaya supercontig 3.73 Cucumis sativus 32070 56 85 Glycine max 04g41500 1 100 Glycine max 06g13330 Vitis vinifera GSVIVT00027049001 Cucumis sativus 348660 97 100 Glycine max 16g02800 100 2 100 Glycine max 07g06190 100 Glycine max 03g37370 68 99 Arabidopsis thaliana AT3G61740 98 Carica papaya supercontig 96.10 Populus trichocarpa 0002s17180 100 Populus trichocarpa 0014s09400 Lindenbergia phillipensis 19460

76 Eschscholzia californica 95037 100 Eschscholzia californica 56188 Eschscholzia californica 10658 63 100 Papaver bracteatum 42604 Papaver bracteatum 130345 Nuphar advena b3 c21977 0.1

194

Appendix B Figure 3. Exemplar maximum likelihood phylogeny of Ortho 1083. RAxML topology of an orthogroup (Ortho 1083) indicates that the γ paralogs of Vitis were duplicated before the split of rosids and asterids, and after the early radiation of eudicots. The scored bootstrap (BS) value for this duplication is over 50%, because nodes #1 has BS <80%. Legend: green star = core eudicot duplication; colored circles = recent independent duplications; numbers = bootstrap support values.

a b c 60 70 25 60 50 20 50 40 15 40 30 30 Frequency 10 20 20 5 10 10 0 0 0 0 0

60 80 100 120 140 120 140 160 180 200 60 80 120 160 200 d e f 7 30 40 6 25 30 5 20 4 15 20 3 Frequency 10 2 10 5 1 0 0 0 0 0

60 80 100 120 140 120 140 160 180 60 80 100 140 180

Divergence time (mya)

Appendix B Figure 4. Age distribution of γ duplications. (a) The inferred duplication times for 161 γ duplication nodes that support core eudicot-wide duplication (BS ≥50%) were analyzed by EMMIX to determine whether these duplications occurred randomly over time or within some small timeframe. Each component is written as ‘color/mean

195 molecular timing/proportion’ where ‘color’ is the component (curve) color and ‘proportion’ is the percentage of duplication nodes assigned to the identified component. There is one statistically significant component: green/117 (mya)/1. (b) Distribution of inferred γ duplication times from 66 orthogroups that support a eudicot-wide duplication with BS ≥50%. There is one statistically significant component: blue/133 (mya)/1. (c) Distribution of inferred γ duplication times from combination of (a) and (b) shows one significant component: purple/121 (mya)/1. (d-f)

Corresponding distributions of inferred duplication times from orthogroups with BS ≥80%. One significant component in (d), green/116 (mya)/1; one in (e), blue/135 (mya)/1; and one in (f), purple/120 (mya)/1.

a b 8 30 25 6 20 4 15 10 Frequency 2 5 0 0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 c d 40 200 30 150 20 100 Frequency 50 10 0 0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 1.0 2.0 3.0

Ks

Appendix B Figure 5. Ks distributions of paralogs in Vitis from syntenic block analysis. 196

Methods for sequence alignment and estimation of Ks were as reported (Cui et al. 2006), but were here limited to paralogous gene pairs retained on syntenic blocks in the Vitis genome.

Colored lines superimposed on Ks distribution represent significant duplication components identified by likelihood mixture model as in Appendix B Figure 4 (Materials and methods). a, Ks distribution of 168 Vitis pairs supporting core eudicot-wide duplication in phylogenetic analysis. One statistically significant component: green/1.03/1. b, Ks distribution of 70 Vitis pairs showing all eudicot-wide duplications on phylogenies. One significant component: blue/1.31/1. c, Ks distribution of combination of Vitis pairs supporting core eudicot- (a) and eudicot-wide duplications (b) on phylogenies. Three significant components: black/0.3/0.01, green/1.02/0.70, blue/1.40/0.29. d, Ks distribution of 2191 paralogous pairs were identified from syntenic block analysis. Four significant components: black/0.12/0.02, green/1.09/0.74, blue/1.85/0.22, yellow/2.7/0.02.

197

Appendix B Table 1. Summary of datasets for eight sequenced plant genomes included in this study Number of Species Annotation version annotated genes Arabidopsis thaliana (thale cress) TAIR version 9 27,379 Carica papaya (papaya) ASGPB release 25,536 Cucumis sativus (cucumber) BGI release 21,635 Populus trichocarpa (black cottonwood) JGI version 2.0 41,377 Glycine max (soybean) Phytozome version 1.0 55,787 Vitis vinifera (grape ) Genoscope release 30,434 Oryza sativa (rice) RGAP release 6.1 56,979 Sorghum bicolor JGI version 1.4 34,496 These eight genome sequences were used to construct orthogroups, which were then populated with additional unigenes of asterids, basal eudicots, non-grass monocots, and basal angiosperms. The number of annotated genes in each genome is indicated. TAIR, The

Arabidopsis Information Resource; ASGPB, Advanced Studies of Genomics, Proteomics and Bioinformatics; JGI, Joint Genome Institute; RGAP, Rice Genome Annotation Project.

198

Appendix B Table 2. Summary of unigene sequences of asterids, basal eudicots, non-grass monocots, and basal angiosperms included in phylogenetic study Size Number Number of of Assembly of Species Lineage Source reads/ESTs data method(s) unigenes Panax quinquefolius Asterid NCBI- 209,745 89.7 MIRA 22,881 SRA Mb Lindenbergia Asterid PPGP 69,545,362 5.9 CLC 104,904 philipensis Gb Helianthus annuus Asterid TIGR PTA 93,279 NA Megablast- 44,662 CAP3 Solanum tuberosum Asterid TIGR PTA 219,485 NA Megablast- 81,072 CAP3 Mimulus gutatus Asterid PlantGDB 231,012 NA Vmatch- 39,577 PaCE-CAP3 Papaver somniferum Basal 1KP + 140,604,904 10.3 MIRA- 252,894 eudicot SRA + 3,709,876 Gb + SOAPDenovo- 1.3 CAP3 Gb Papaver setigerum Basal 1KP 134,478,938 9.8 SOAPDenovo- 406,167 eudicot Gb CAP3 Papaver rhoeas Basal 1KP 157,506,374 11.5 SOAPDenovo- 383,426 eudicot Gb CAP3 Papaver bracteatum Basal 1KP 89,663,900 6.5 SOAPDenovo- 201,564 eudicot Gb CAP3 Eschscholzia Basal NCBI + 14,381 + 6.8 MIRA- 165,260 californica eudicot SRA + 559,470 + Mb + SOAPDenovo- 1KP 133,422,402 55 CAP3 Mb + 9.7 Gb Argemone mexicana Basal 1KP + 144,520,360 10.5 SOAPDenovo- 148,533 eudicot NCBI + 1,692 Gb + CAP3 1 Mb Akebia trifoliata Basal 1KP 29,156,514 2.1 CLC-CAP3 46,024 eudicot Gb Podophyllum peltatum Basal 1KP 20,139,210 1.5 CLC-CAP3 31,472 eudicot Gb Platanus occidentalis Basal 1KP 25,508,642 1.9 CLC-CAP3 42,373 eudicot Gb Aquilegia formosa x Basal PlantGDB 85,040 NA Vmatch- 19,615 Aquilegia pubescens eudicot PaCE-CAP3 Mesembryanthemum Caryophillid PlantGDB 27,553 NA Vmatch- 11,317 crystallinum PaCE-CAP3 Beta vulgaris Caryophillid PlantGDB 25,883 NA Vmatch- 18,009 PaCE-CAP3 Acorus americanus Monocot MonATOL 149,320 + 44.9 MIRA- 59,453 + 1KP 15,427,316 Mb + SOAPDenovo- 1.1 CAP3 Gb Chamaedorea seifrizii Monocot MonATOL 33,100,948 2.5 CLC 68,489 Gb Chlorophytum Monocot MonATOL 59,505,714 4.5 CLC 58,766 rhizopendulum Gb Neoregelia sp. Monocot MonATOL 49,121,506 3.7 CLC 63,269 Gb Typha angustifolia Monocot MonATOL 70,733,124 5.7 CLC 57,980

199

Gb Persea americana Magnoliid AAGP 2,336,819 683 MIRA 132,532 (avocado) Mb Aristolochia fimbriata Magnoliid AAGP 3,930,505 880 MIRA 155,371 (Dutchman’s pipe) Mb Liriodendron tulipifera Magnoliid AAGP 2,327,654 543 MIRA 137,923 (yellow-poplar) Mb Nuphar advena (yellow Basal AAGP 3,889,719 1.1 MIRA 289,773 pond lily) angiosperm Gb Amborella trichopoda Basal AAGP 2,943,273 776 MIRA 208,394 angiosperm Mb 1KP, 1000 Green Plant Transcriptome Project; AAGP, Ancestral Angiosperm

Genome Project [219]; MonATOL, Monocot Tree of Life Project [218]; NA, not available;

NCBI, National Center for Biotechnology Information; PPGP, Parasitic Plant Genome

Project, SRA, Sequence Read Archive; TIGR PTA, The Institute for Genomic Research

Plant Transcript Assemblies [237].

200

Appendix B Table 3. Phylogenetic timing of Vitis γ duplications inferred from orthogroup phylogenetic histories BR1 BR2 BR3 BR4 Ortho BS ≥ 80 BS ≥ 50 BS ≥ 80 BS ≥ 50 BS ≥ 80 BS ≥ 50 BS ≥ 80 BS ≥ 50 Duplications 0 7 19 70 80 168 4 6 Percent 0% 2.8% 18.3% 27.9% 77.7% 67% 4% 2.3% BRx designations are illustrated in Appendix B Figure 1. Bootstrap (BS) ≥80 and BS ≥50 are counts of nodes resolved with BS ≥80 or ≥50, respectively.

201

Reference

1. Kuijt J: The biology of parasitic flowering plants. Berkeley: University of California Press; 1969. 2. Westwood JH, Yoder JI, Timko MP, dePamphilis CW: The evolution of parasitism in plants. Trends Plant Sci 2010, 15(4):227-235. 3. Candolle APd: Theoris elementaire de la botanique. 1813. 4. Press MC, Graves JD: Parasitic Plants, Edited by Malcolm C. Press and Jonathan D. Graves. 1995. 5. Atsatt P: The insect herbivore as a predictive model in parasitic seed plant biology. American Naturalist 1977, 111:579-586. 6. Atsatt PR: Parasitic flowering plants: How did they evolve? . American Naturalist 1973, 107:502-509. 7. Chang M: Isolation and characterization of semiochemicals involved in host recognition in Striga asiatica. Ph.D. thesis, Department of Chemistry, University of Chicago. 1986. 8. Chang M, Lynn DG: Haustoria and the chemistry of host recognition in parasitic angiosperms. . Journal of Chemical Ecology 1986, 12(561-579). 9. Matusova R, Rani K, Verstappen FW, Franssen MC, Beale MH, Bouwmeester HJ: The strigolactone germination stimulants of the plant-parasitic Striga and Orobanche spp. are derived from the carotenoid pathway. Plant Physiol 2005, 139(2):920-934. 10. Lopez-Raez JA, Charnikhova T, Gomez-Roldan V, Matusova R, Kohlen W, De Vos R, Verstappen F, Puech-Pages V, Becard G, Mulder P et al: Tomato strigolactones are derived from carotenoids and their biosynthesis is promoted by phosphate starvation. New Phytol 2008, 178(4):863-874. 11. Akiyama K, Matsuzaki K, Hayashi H: Plant sesquiterpenes induce hyphal branching in arbuscular mycorrhizal fungi. Nature 2005, 435(7043):824-827. 12. Yoneyama K, Takeuchi Y, Sekimoto H: Phosphorus deficiency in red clover promotes exudation of orobanchol, the signal for mycorrhizal symbionts and germination stimulant for root parasites. Planta 2007, 225(4):1031-1038. 13. Harrison MJ: Signaling in the arbuscular mycorrhizal symbiosis. Annu Rev Microbiol 2005, 59:19-42. 14. Umehara M, Hanada A, Yoshida S, Akiyama K, Arite T, Takeda-Kamiya N, Magome H, Kamiya Y, Shirasu K, Yoneyama K et al: Inhibition of shoot branching by new terpenoid plant hormones. Nature 2008, 455(7210):195-200. 15. Gomez-Roldan V, Fermas S, Brewer PB, Puech-Pages V, Dun EA, Pillot JP, Letisse F, Matusova R, Danoun S, Portais JC et al: Strigolactone inhibition of shoot branching. Nature 2008, 455(7210):189-194. 16. Searcy DG: Measurements by DNA hybridization in vitro of the genetic basis of parasitic reductions. Evolution 1970, 24(207-219). 17. Searcy DG, MacInnis AJ: Measurements by DNA renaturation of the genetic basis of parasitic reduction. Evolution 1970, 24:796-806.

202

18. dePamphilis CW, Palmer JD: Loss of photosynthetic and chlororespiratory genes from the plastid genome of a parasitic . Nature 1990, 348(6299):337- 339. 19. dePamphilis CW, Young ND, Wolfe AD: Evolution of plastid gene rps2 in a lineage of hemiparasitic and holoparasitic plants: many losses of photosynthesis and complex patterns of rate variation. Proc Natl Acad Sci U S A 1997, 94(14):7367-7372. 20. DePamphilis CW, Palmer JD: Evolution and function of plastid DNA : a review with special reference to nonphotosynthetic plants, in Physiology, Biochemistry and Genetics of Nongreen Plastids. American Society of Plant Physiologists 1989:182-202. 21. Wolfe KH, Morden CW, Palmer JD: Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc Natl Acad Sci U S A 1992, 89(22):10648-10652. 22. Krause K: From chloroplasts to "cryptic" plastids: evolution of plastid genomes in parasitic plants. Curr Genet 2008, 54(3):111-121. 23. Hamby KR, Zimmer EA: Ribosomal RNA as a phylogenetic tool in plant systematics. Molecular Systematics of Plants 1992:50-91. 24. Nickrent DL, Franchina CR: Phylogenetic relationships of the Santales and relatives. Journal of Molecular Evolution 1990, 31:294-301. 25. Nickrent DL, Soltis DE: A comparison of angiosperm phylogenies based upon complete 18S and rbcL sequences. Annals of the Missouri Botanical Garden 1995. 26. Der JP, Nickrent DL: A molecular phylogeny of Santalaceae (Santalales). Systematic 2008, 33(1):107-116. 27. Schneeweiss GM, Colwell A, Park JM, Jang CG, Stuessy TF: Phylogeny of holoparasitic Orobanche (Orobanchaceae) inferred from nuclear ITS sequences. Molecular and Evolution 2004, 30(2):465-478. 28. Schonenberger J, Anderberg A, Sytsma K: Molecular Phylogenetics and Patterns of Floral Evolution in the Ericales. Int J Plant Sci 2005, 166(2):265-288. 29. Barkman TJ, Lim SH, Salleh KM, Nais J: Mitochondrial DNA sequences reveal the photosynthetic relatives of Rafflesia, the world's largest flower. Proc Natl Acad Sci U S A 2004, 101(3):787-792. 30. Nickrent DL, Blarer A, Qiu YL, Vidal-Russell R, Anderson FE: Phylogenetic inference in Rafflesiales: the influence of rate heterogeneity and horizontal gene transfer. BMC Evol Biol 2004, 4(1):40. 31. Nickrent DL, Der JP, Anderson FE: Discovery of the photosynthetic relatives of the "Maltese mushroom" Cynomorium. BMC Evol Biol 2005, 5:38. 32. Barkman TJ, McNeal JR, Lim SH, Coat G, Croom HB, Young ND, Depamphilis CW: Mitochondrial DNA suggests at least 11 origins of parasitism in angiosperms and reveals genomic chimerism in parasitic plants. BMC Evol Biol 2007, 7:248. 33. Wrobel RL, Yoder JI: Differential RNA expression of alpha-expansin gene family members in the parasitic angiosperm Triphysaria versicolor (). Gene 2001, 266(1-2):85-93. 34. Delavault P, Estabrook E, Albrecht H, Wrobel R, Yoder JI: Host-root exudates increase gene expression of asparagine synthetase in the roots of a hemiparasitic plant Triphysaria versicolor (Scrophulariaceae). Gene 1998, 222(2):155-162. 35. Yoder JI: Host-plant recognition by parasitic Scrophulariaceae. Curr Opin Plant Biol 2001, 4(4):359-365. 36. Wrobel RL, Matvienko M, Yoder JI: Heterologous expression and biochemical characterization of an NAD(P)H : quinone oxidoreductase from the hemiparasitic plant Triphysaria versicolor. Plant Physiology and Biochemistry 2002, 40(3):265-272.

203

37. Tomilov A, Yoder JI: Chemical signaling between plants: mechanistic similarities between phytotoxic allelopathy and host recognition by parasitic plants. Chemical ecology : from genes to ecosystem 2006:55-69. 38. Nelson D. Young KES, Claude W. dePamphilis: The Evolution of Parasitism in Scrophulariaceae/Orobanchaceae: Plastid Gene Sequences Refute an Evolutionary Transition Series. Annals of the Missouri Botanical Garden 1999, 86(4):876-893. 39. Bennett JR, Mathews S: Phylogeny of the parasitic plant family Orobanchaceae inferred from Phytochrome A. American Journal of Botany 2006, 93(7):1039. 40. Scholes JD, Press MC: Striga infestation of cereal crops - an unsolved problem in resource limited agriculture. Curr Opin Plant Biol 2008, 11(2):180-186. 41. Ejeta G: The Striga Scourge in Africa: A Growing Pandemic. INTEGRATING NEW TECHNOLOGIES FOR STRIGA CONTROL 2007:3-16. 42. Westwood JH, Yoder JI, Timko MP, dePamphilis CW: The evolution of parasitism in plants. Trends Plant Sci, 15(4):227-235. 43. Ejeta G, Butler LG: Host-Plant Resistance to Striga. International Crop Science I 1993:561-569. 44. Joel DM: The long-term approach to parasitic weeds control: manipulation of specific developmental mechanisms of the parasite. Crop Protection 2000, 19(8- 10):753-758. 45. Rodenburg J, Bastiaans L, Kropff MJ: Characterization of host tolerance to Striga hermonthica. Euphytica 2006, 147(3):353-365. 46. Huang K, Whitlock R, Press MC, Scholes JD: Variation for host range within and among populations of the parasitic plant Striga hermonthica. Heredity 2012, 108(2):96-104. 47. Roman B, Torres AM, Rubiales D, Cubero JI, Satovic Z: Mapping of quantitative trait loci controlling broomrape (Orobanche crenata Forsk.) resistance in faba bean ( L.). Genome 2002, 45(6):1057-1063. 48. Rubiales D, Fernandez-Aparicio M: Innovations in parasitic weeds management in legume crops. A review. Agronomy for Sustainable Development 2012, 32(2):433-449. 49. Die JV, Dita MA, Krajinski F, Gonzalez-Verdejo CI, Rubiales D, Moreno MT, Roman B: Identification by suppression subtractive hybridization and expression analysis of Medicago truncatula putative defence genes in response to Orobanche crenata parasitization. Physiological and Molecular Plant Pathology 2007, 70(1-3):49-59. 50. Die JV, Verdejo CIG, Dita MA, Nadal S, Roman B: Gene expression analysis of molecular mechanisms of defense induced in Medicago truncatula parasitized by Orobanche crenata. Plant Physiology and Biochemistry 2009, 47(7):635-641. 51. Torres MJ, Tomilov AA, Tomilova N, Reagan RL, Yoder JI: Pscroph, a parasitic plant EST database enriched for parasite associated transcripts. BMC Plant Biol 2005, 5:24. 52. Matvienko M, Torres MJ, Yoder JI: Transcriptional responses in the hemiparasitic plant Triphysaria versicolor to host plant signals. Plant Physiol 2001, 127(1):272-282. 53. Yoshida S, Ishida JK, Kamal NM, Ali AM, Namba S, Shirasu K: A full-length enriched cDNA library and expressed sequence tag analysis of the parasitic weed, Striga hermonthica. BMC Plant Biol, 10:55. 54. Westwood JH, dePamphilis CW, Das M, Fernandez-Aparicio M, Honaas LA, Timko MP, Wafula EK, Wickett NJ, Yoder JI: The Parasitic Plant Genome Project: New Tools for Understanding the Biology of Orobanche and Striga. Weed Science 2012, 60(2):295-306.

204

55. Davies J, Davies D: Origins and evolution of antibiotic resistance. Microbiol Mol Biol Rev 2010, 74(3):417-433. 56. Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature 2000, 405(6784):299-304. 57. Dobrindt U, Hochhut B, Hentschel U, Hacker J: Genomic islands in pathogenic and environmental microorganisms. Nat Rev Microbiol 2004, 2(5):414-424. 58. Hacker J, Kaper JB: Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol 2000, 54:641-679. 59. Sueoka N: On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci U S A 1962, 48:582-592. 60. Karlin S, Campbell AM, Mrazek J: Comparative DNA analysis across diverse genomes. Annu Rev Genet 1998, 32:185-225. 61. Heinemann JA, Sprague GF, Jr.: Bacterial conjugative plasmids mobilize DNA transfer between bacteria and yeast. Nature 1989, 340(6230):205-209. 62. Zambryski P, Tempe J, Schell J: Transfer and function of T-DNA genes from agrobacterium Ti and Ri plasmids in plants. Cell 1989, 56(2):193-201. 63. Scholl EH, Thorne JL, McCarter JP, Bird DM: Horizontally transferred genes in plant- parasitic nematodes: a high-throughput genomic approach. Genome Biol 2003, 4(6):R39. 64. Slot JC, Hibbett DS: Horizontal transfer of a nitrate assimilation gene cluster and ecological transitions in fungi: a phylogenetic study. PLoS One 2007, 2(10):e1097. 65. Inderbitzin P, Harkness J, Turgeon BG, Berbee ML: Lateral transfer of mating system in Stemphylium. Proc Natl Acad Sci U S A 2005, 102(32):11390-11395. 66. Friesen TL, Stukenbrock EH, Liu Z, Meinhardt S, Ling H, Faris JD, Rasmussen JB, Solomon PS, McDonald BA, Oliver RP: Emergence of a new disease as a result of interspecific virulence gene transfer. Nat Genet 2006, 38(8):953-956. 67. Daniels SB, Peterson KR, Strausbaugh LD, Kidwell MG, Chovnick A: Evidence for horizontal transmission of the P transposable element between Drosophila species. Genetics 1990, 124(2):339-355. 68. Engels WR: P elements in Drosophila. Curr Top Microbiol Immunol 1996, 204:103- 123. 69. Hartl D: Discovery of the transposable element mariner. Genetics 2001, 157(2):471- 476. 70. Grenier E, Abadon M, Brunet F, Capy P, Abad P: A mariner-like transposable element in the insect parasite nematode Heterorhabditis bacteriophora. J Mol Evol 1999, 48(3):328-336. 71. Steglich C, Schaeffer SW: The ornithine decarboxylase gene of Trypanosoma brucei: Evidence for horizontal gene transfer from a vertebrate source. Infect Genet Evol 2006, 6(3):205-219. 72. Bergthorsson U, Adams KL, Thomason B, Palmer JD: Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 2003, 424(6945):197-201. 73. Bergthorsson U, Richardson AO, Young GJ, Goertzen LR, Palmer JD: Massive horizontal transfer of mitochondrial genes from diverse land plant donors to the basal angiosperm Amborella. Proc Natl Acad Sci U S A 2004, 101(51):17747-17752. 74. Won H, Renner SS: Horizontal gene transfer from flowering plants to Gnetum. Proc Natl Acad Sci U S A 2003, 100(19):10824-10829. 75. Cho Y, Qiu YL, Kuhlman P, Palmer JD: Explosive invasion of plant mitochondria by a group I intron. Proc Natl Acad Sci U S A 1998, 95(24):14244-14249.

205

76. Davis CC, Wurdack KJ: Host-to-parasite gene transfer in flowering plants: phylogenetic evidence from Malpighiales. Science 2004, 305(5684):676-678. 77. Mower JP, Stefanovic S, Young GJ, Palmer JD: Plant genetics: gene transfer from parasitic to host plants. Nature 2004, 432(7014):165-166. 78. Davis CC, Anderson WR, Wurdack KJ: Gene transfer from a parasitic flowering plant to a fern. Proc Biol Sci 2005, 272(1578):2237-2242. 79. Diao X, Freeling M, Lisch D: Horizontal transfer of a plant transposon. PLoS Biol 2006, 4(1):e5. 80. Park J-M, Manen J-F, Schneeweiss GM: Horizontal gene transfer of a plastid gene in the non-photosynthetic flowering plants Orobanche and Phelipanche(Orobanchaceae). Molecular Phylogenetics and Evolution 2007, 43:974- 985. 81. Yoshida S, Maruyama S, Nozaki H, Shirasu K: Horizontal gene transfer by the parasitic plant Striga hermonthica. Science, 328(5982):1128. 82. Xi Z, Bradley RK, Wurdack KJ, Wong KM, Sugumaran M, Bomblies K, Rest JS, Davis CC: Horizontal transfer of expressed genes in a parasitic flowering plant. BMC Genomics 2012, 13(1):227. 83. Xi Z, Wang Y, Bradley RK, Sugumaran M, Marx CJ, Rest JS, Davis CC: Massive mitochondrial gene transfer in a parasitic flowering plant clade. PLoS Genet 2013, 9(2):e1003265. 84. Christin PA, Edwards EJ, Besnard G, Boxall SF, Gregory R, Kellogg EA, Hartwell J, Osborne CP: Adaptive evolution of C(4) photosynthesis through recurrent lateral gene transfer. Curr Biol 2012, 22(5):445-449. 85. Zhang YT, Fernandez-Aparicio M, Wafula EK, Das M, Jiao YN, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF, Timko MP et al: Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species. Bmc 2013, 13. 86. Park JM, Manen JF, Schneeweiss GM: Horizontal gene transfer of a plastid gene in the non-photosynthetic flowering plants Orobanche and Phelipanche (Orobanchaceae). Mol Phylogenet Evol 2007, 43(3):974-985. 87. Keeling PJ, Palmer JD: Horizontal gene transfer in eukaryotic evolution. Nat Rev Genet 2008, 9(8):605-618. 88. Bock R: The give-and-take of DNA: horizontal gene transfer in plants. Trends Plant Sci 2010, 15(1):11-22. 89. Heidejorgensen HS, Kuijt J: The Haustorium of the Root Parasite Triphysaria (Scrophulariacea), with Special Reference to Xylem Bridge Ultrastructure. American Journal of Botany 1995, 82(6):782-797. 90. Dorr I: How Striga parasitizes its host: A TEM and SEM study. Annals of Botany 1997, 79(5):463-472. 91. Dorr I, Kollmann R: Symplasmic Sieve Element Continuity between Orobanche and Its Host. Botanica Acta 1995, 108(1):47-55. 92. Aly R, Hamamouch N, Abu-Nassar J, Wolf S, Joel DM, Eizenberg H, Kaisler E, Cramer C, Gal-On A, Westwood JH: Movement of protein and macromolecules between host plants and the parasitic weed Phelipanche aegyptiaca Pers. Plant Cell Rep 2011, 30(12):2233-2241. 93. Gal-On A, Naglis A, Leibman D, Ziadna H, Kathiravan K, Papayiannis L, Holdengreber V, Guenoune-Gelbert D, Lapidot M, Aly R: Broomrape Can Acquire Viruses from Its Hosts. Phytopathology 2009, 99(11):1321-1329.

206

94. Birschwilks M, Haupt S, Hofius D, Neumann S: Transfer of phloem-mobile substances from the host plants to the holoparasite Cuscuta sp. Journal of Experimental Botany 2006, 57(4):911-921. 95. Haupt S, Oparka KJ, Sauer N, Neumann S: Macromolecular trafficking between Nicotiana tabacum and the holoparasite Cuscuta reflexa. Journal of Experimental Botany 2001, 52(354):173-177. 96. Roney JK, Khatibi PA, Westwood JH: Cross-species translocation of mRNA from host plants into the parasitic plant dodder. Plant Physiol 2007, 143(2):1037-1043. 97. Aly R, Cholakh H, Joel DM, Leibman D, Steinitz B, Zelcer A, Naglis A, Yarden O, Gal- On A: Gene silencing of mannose 6-phosphate reductase in the parasitic weed Orobanche aegyptiaca through the production of homologous dsRNA sequences in the host plant. Plant Biotechnol J 2009, 7(6):487-498. 98. Tomilov AA, Tomilova NB, Wroblewski T, Michelmore R, Yoder JI: Trans-specific gene silencing between host and parasitic plants. Plant J 2008, 56(3):389-397. 99. Mower JP, Stefanovic S, Hao W, Gummow JS, Jain K, Ahmed D, Palmer JD: Horizontal acquisition of multiple mitochondrial genes from a parasitic plant followed by gene conversion with host mitochondrial genes. Bmc Biology 2010, 8. 100. Stegemann S, Bock R: Exchange of genetic material between cells in plant tissue grafts. Science 2009, 324(5927):649-651. 101. Kuijt J: Tissue compatibility and the haustoria of parasitic angiosperms. In: Vegetative compatibility responses in plants. Waco, TX: Baylor University; 1983. 102. Westwood JH: The Physiology of the Established Parasite-Host Association. In: Parasitic Orobanchaceae. Edited by Joel DM, Gressel J, Musselman LJ: Springer; 2013. 103. Syvanen M: Evolutionary implications of horizontal gene transfer. Annu Rev Genet 2012, 46:341-358. 104. Doolittle WF: You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 1998, 14(8):307-311. 105. Taylor DJ, Bruenn J: The evolution of novel fungal genes from non-retroviral RNA viruses. BMC Biol 2009, 7:88. 106. Slot JC, Rokas A: Horizontal Transfer of a Large and Highly Toxic Secondary Metabolic Gene Cluster between Fungi. Current Biology 2011, 21(2):134-139. 107. Richardson AO, Palmer JD: Horizontal gene transfer in plants. J Exp Bot 2007, 58(1):1-9. 108. Acuna R, Padilla BE, Florez-Ramos CP, Rubio JD, Herrera JC, Benavides P, Lee SJ, Yeats TH, Egan AN, Doyle JJ et al: Adaptive horizontal transfer of a bacterial gene to an invasive insect pest of coffee. Proc Natl Acad Sci U S A 2012, 109(11):4197-4202. 109. Feschotte C, Pritham EJ: DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet 2007, 41:331-368. 110. Schaack S, Gilbert C, Feschotte C: Promiscuous DNA: horizontal transfer of transposable elements and why it matters for eukaryotic evolution. Trends Ecol Evol 2010, 25(9):537-546. 111. Yoshida S, Maruyama S, Nozaki H, Shirasu K: Horizontal gene transfer by the parasitic plant Striga hermonthica. Science 2010, 328(5982):1128. 112. Sanchez-Puerta MV, Cho Y, Mower JP, Alverson AJ, Palmer JD: Frequent, phylogenetically local horizontal transfer of the cox1 group I Intron in flowering plant mitochondria. Mol Biol Evol 2008, 25(8):1762-1777. 113. Vallenback P, Jaarola M, Ghatnekar L, Bengtsson BO: Origin and timing of the horizontal transfer of a PgiC gene from Poa to Festuca ovina. Mol Phylogenet Evol 2008, 46(3):890-896.

207

114. Hepburn NJ, Schmidt DW, Mower JP: Loss of Two Introns from the Magnolia tripetala Mitochondrial cox2 Gene Implicates Horizontal Gene Transfer and Gene Conversion as a Novel Mechanism of Intron Loss. Mol Biol Evol 2012. 115. Birschwilks M, Haupt S, Hofius D, Neumann S: Transfer of phloem-mobile substances from the host plants to the holoparasite Cuscuta sp. J Exp Bot 2006, 57(4):911-921. 116. Westwood JH, Roney JK, Khatibi PA, Stromberg VK: RNA translocation between parasitic plants and their hosts. Pest Manag Sci 2009, 65(5):533-539. 117. Louis S, Delobel B, Gressent F, Rahioui I, Quillien L, Vallier A, Rahbe Y: Molecular and biological screening for insect-toxic seed albumins from four legume species. Plant SCI 2004, 167(4):705-714. 118. Louis S, Delobel B, Gressent F, Duport G, Diol O, Rahioui I, Charles H, Rahbe Y: Broad screening of the legume family for variability in seed insecticidal activities and for the occurrence of the A1b-like knottin peptide entomotoxins. Phytochemistry 2007, 68(4):521-535. 119. Gelly JC, Gracy J, Kaas Q, Le-Nguyen D, Heitz A, Chiche L: The KNOTTIN website and database: a new information system dedicated to the knottin scaffold. Nucleic Acids Res 2004, 32(Database issue):D156-159. 120. Clark RJ, Jensen J, Nevin ST, Callaghan BP, Adams DJ, Craik DJ: The engineering of an orally active conotoxin for the treatment of neuropathic pain. Angew Chem Int Ed Engl 2010, 49(37):6545-6548. 121. Wang X, Connor M, Smith R, Maciejewski MW, Howden ME, Nicholson GM, Christie MJ, King GF: Discovery and characterization of a family of insecticidal with a rare vicinal disulfide bridge. Nat Struct Biol 2000, 7(6):505-513. 122. Jackson PJ, McNulty JC, Yang YK, Thompson DA, Chai B, Gantz I, Barsh GS, Millhauser GL: Design, pharmacology, and NMR structure of a minimized cystine knot with agouti-related protein activity. Biochemistry 2002, 41(24):7565-7572. 123. Clark RJ, Daly NL, Craik DJ: Structural plasticity of the cyclic-cystine-knot framework: implications for biological activity and drug design. Biochem J 2006, 394(Pt 1):85-93. 124. Combelles C, Gracy J, Heitz A, Craik DJ, Chiche L: Structure and folding of disulfide- rich miniproteins: insights from molecular dynamics simulations and MM-PBSA free energy calculations. Proteins 2008, 73(1):87-103. 125. Silverman AP, Levin AM, Lahti JL, Cochran JR: Engineered cystine-knot peptides that bind alpha(v)beta(3) integrin with antibody-like affinities. J Mol Biol 2009, 385(4):1064-1075. 126. Lewis GP: Legumes of the World.: Royal Botanic Gardens, Kew; 2005. 127. Joel DM: The new nomenclature of Orobanche and Phelipanche. Weed Res 2009, 49:6-7. 128. Schneeweiss GM: Correlated evolution of life history and host range in the nonphotosynthetic parasitic flowering plants Orobanche and Phelipanche (Orobanchaceae). J Evol Biol 2007, 20(2):471-478. 129. Index of Orobanchaceae [http://www.farmalierganes.com/Otrospdf/publica/Orobanchaceae%20Index.htm] 130. Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, Refulio- Rodriguez NF, Walker JB, Moore MJ, Carlsward BS et al: Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot 2011, 98(4):704-730. 131. Parker C: Observations on the current status of Orobanche and Striga problems worldwide. Pest Manag Sci 2009, 65(5):453-459.

208

132. Schneeweiss GM, Colwell A, Park JM, Jang CG, Stuessy TF: Phylogeny of holoparasitic Orobanche (Orobanchaceae) inferred from nuclear ITS sequences. Mol Phylogenet Evol 2004, 30(2):465-478. 133. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. 134. PlantGDB [http://www.plantgdb.org/] 135. Parasitic Plant Genome Project [http://ppgp.huck.psu.edu/] 136. Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N et al: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 2012, 40(Database issue):D1178-1186. 137. SOL Genomics Network. 2012. 138. EMBL-EBI [http://www.ebi.ac.uk/Databases/] 139. Wojciechowski MF, Lavin M, Sanderson MJ: A phylogeny of legumes (Leguminosae) based on analysis of the plastid matK gene resolves many well-supported subclades within the family. Am J Bot 2004, 91(11):1846-1862. 140. Lavin M, Herendeen PS, Wojciechowski MF: Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst Biol 2005, 54(4):575-594. 141. Medicago truncatula HapMap Project [http://www.medicagohapmap.org/index.php] 142. Gracy J, Le-Nguyen D, Gelly JC, Kaas Q, Heitz A, Chiche L: KNOTTIN: the knottin or inhibitor cystine knot scaffold in 2007. Nucleic Acids Res 2008, 36(Database issue):D314-319. 143. Westwood JH, al. e: The Parasitic Plant Genome Project: New Tools for Understanding the Biology of Orobanche and Striga. Weed Sci 2012, 60(2):295-306. 144. Schneeweiss GM, Palomeque T, Colwell AE, Weiss-Schneeweiss H: Chromosome numbers and karyotype evolution in holoparasitic Orobanche (Orobanchaceae) and related genera. Am J Bot 2004, 91(3):439-448. 145. Manen JF, Habashi C, Jeanmonod D, Park JM, Schneeweiss GM: Phylogeny and intraspecific variability of holoparasitic Orobanche (Orobanchaceae) inferred from plastid rbcL sequences. Mol Phylogenet Evol 2004, 33(2):482-500. 146. The Parasitic Plant Connection [http://www.parasiticplants.siu.edu/] 147. The 1KP Project [http://www.onekp.com/] 148. Johnson F: Transmission of plant viruses by dodder. Phytopathology 1941, 31(7):649- 656. 149. Bennett CW: Studies of dodder transmission of plant viruses. Phytopathology 1944, 34(10):905-932. 150. David-Schwartz R, Runo S, Townsley B, Machuka J, Sinha N: Long-distance transport of mRNA via parenchyma cells and phloem across the host-parasite junction in Cuscuta. New Phytol 2008, 179(4):1133-1141. 151. Olmstead RG, dePamphilis CW, Wolfe AD, Young ND, Elisons WJ, Reeves PA: Disintegration of the Scrophulariaceae. Am J Bot 2001, 88(2):348-361. 152. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113. 153. Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 2006, 22(21):2688-2690. 154. Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 2007, 7:214.

209

155. Sanderson MJ: r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 2003, 19(2):301-302. 156. Gracy J, Chiche L: Optimizing structural modeling for a specific protein scaffold: knottins or inhibitor cystine knots. BMC Bioinformatics 2010, 11:535. 157. Pond SL, Frost SD, Muse SV: HyPhy: hypothesis testing using phylogenies. Bioinformatics 2005, 21(5):676-679. 158. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760. 159. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079. 160. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26(6):841-842. 161. Goremykin VV, Salamini F, Velasco R, R: Mitochondrial DNA of Vitis vinifera and the issue of rampant horizontal gene transfer. Mol Biol Evol 2009, 26(1):99-110. 162. Estabrook EM, Yoder JI: Plant-plant communications: Rhizosphere signaling between parasitic angiosperms and their hosts. Plant Physiology 1998, 116(1):1-7. 163. Jamison DS, Yoder JI: Heritable variation in quinone-induced haustorium development in the parasitic plant Triphysaria. Plant Physiol 2001, 125(4):1870- 1879. 164. De Groote H, Wangare L, Kanampiu F, Odendo M, Diallo A, Karaya H, Friesen D: The potential of a herbicide resistant maize technology for Striga control in Africa. Agricultural Systems 2008, 97(1-2):83-94. 165. Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS et al: Ancestral polyploidy in seed plants and angiosperms. Nature 2011, 473(7345):97-100. 166. Li L, Stoeckert CJ, Jr., Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178-2189. 167. Hellsten U, Wright KM, Jenkins J, Shu S, Yuan Y, Wessler SR, Schmutz J, Willis JH, Rokhsar DS: Fine-scale variation in meiotic recombination in Mimulus inferred from population shotgun sequencing. Proc Natl Acad Sci U S A 2013. 168. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997, 13(5):555-556. 169. Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005, 21(9):1859-1875. 170. Slot JC, Rokas A: Horizontal transfer of a large and highly toxic secondary metabolic gene cluster between fungi. Curr Biol 2011, 21(2):134-139. 171. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011, 29(7):644-652. 172. Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26(19):2460-2461. 173. Iseli C, Jongeneel CV, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol 1999:138-148. 174. Eddy SR: Accelerated Profile HMM Searches. Plos Computational Biology 2011, 7(10).

210

175. Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 2013, 30(4):772-780. 176. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009, 25(15):1972-1973. 177. McDowall J, Hunter S: InterPro protein classification. Methods Mol Biol 2011, 694:37-47. 178. Ohno S: Evolution by Gene Duplication; 1970. 179. Adams KL, Wendel JF: Polyploidy and genome evolution in plants. Curr Opin Plant Biol 2005, 8(2):135-141. 180. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, Segurens B, Daubin V, Anthouard V, Aiach N et al: Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 2006, 444(7116):171-178. 181. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428(6983):617-624. 182. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large- scale duplications in the Arabidopsis genome. Genome Res 2003, 13(2):137-144. 183. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL et al: The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 2008, 452(7190):991-996. 184. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290(5499):2114-2117. 185. Bowers JE, Chapman BA, Rong J, Paterson AH: Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 2003, 422(6930):433-438. 186. Fawcett JA, Maere S, Van de Peer Y: Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. Proc Natl Acad Sci U S A 2009, 106(14):5737-5742. 187. Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH: Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res 2008, 18(12):1944-1954. 188. Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH: Synteny and collinearity in plant genomes. Science 2008, 320(5875):486-488. 189. Van de Peer Y: A mystery unveiled. Genome Biol 2011, 12(5):113. 190. Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Depamphilis CW, Wall PK, Soltis PS: Polyploidy and angiosperm diversification. Am J Bot 2009, 96(1):336-348. 191. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I, Cheng F et al: The genome of the mesopolyploid crop species Brassica rapa. Nat Genet 2011, 43(10):1035-1039. 192. Schranz ME, Mitchell-Olds T: Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 2006, 18(5):1152-1165. 193. Dehal P, Boore JL: Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 2005, 3(10):e314. 194. Christoffels A, Koh EG, Chia JM, Brenner S, Aparicio S, Venkatesh B: Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. Mol Biol Evol 2004, 21(6):1146-1151. 195. Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A et al: Genome duplication in the teleost fish

211

Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 2004, 431(7011):946-957. 196. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A et al: Widespread genome duplications throughout the history of flowering plants. Genome Res 2006, 16(6):738-749. 197. Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW: Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Mol Biol Evol 2006, 23(2):469-478. 198. Johnson DA, Thomas MA: The monosaccharide transporter gene family in Arabidopsis and rice: a history of duplications, adaptive evolution, and functional divergence. Molecular biology and evolution 2007, 24(11):2412-2423. 199. Conrad B, Antonarakis SE: Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genomics Hum Genet 2007, 8:17-35. 200. Meyer A, Van de Peer Y: From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays 2005, 27(9):937-945. 201. De Bodt S, Maere S, Van de Peer Y: Genome duplication and the origin of angiosperms. Trends Ecol Evol 2005, 20(11):591-597. 202. Lynch M, Force AG: The origin of interspecific genomic incompatibility via gene duplication. Am Nat 2000, 156(6):590-605. 203. Wolfe KH, Scannell DR, Byrne KP, Gordon JL, Wong S: Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 2006, 440(7082):341- 345. 204. Taylor JS, Van de Peer Y, Meyer A: Genome duplication, divergent resolution and speciation. Trends Genet 2001, 17(6):299-301. 205. Werth CR, Windham MD: A model for divergent, allopatric speciation of polyploid pteridophytes resulting from silencing of duplicate-gene expression. Am Nat 1991, 137(4):515-526. 206. Barker MS, Vogel H, Schranz ME: Paleopolyploidy in the Brassicales: analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales. Genome Biol and Evol 2009, 1:391-399. 207. Mayrose I, Zhan SH, Rothfels CJ, Magnuson-Ford K, Barker MS, Rieseberg LH, Otto SP: Recently formed polyploid plants diversify at lower rates. Science (New York, NY 2011, 333(6047):1257. 208. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449(7161):463-467. 209. Dvorak J, Luo MC, Yang ZL, Zhang HB: The structure of the Aegilops tauschii genepool and the evolution of hexaploid wheat. Theor Appl Genet 1998, 97(4):657- 670. 210. Lyons E, Pedersen B, Kane J, Freeling M: The value of nonmodel genomes and an expample using synmap within CoGe to dissect the hexaploidy that predates the rosids. Tropical Plant Biol 2008, 1:181-190. 211. Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J et al: Genome sequence and analysis of the tuber crop potato. Nature 2011, 475(7355):189- 195. 212. Zuccolo A, Bowers JE, Estill JC, Xiong Z, Luo M, Sebastian A, Goicoechea JL, Collura K, Yu Y, Jiao Y et al: A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure. Genome Biol 2011, 12(5):R48.

212

213. Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, depamphilis CW: Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one's way out of the Felsenstein zone. Mol Biol Evol 2005, 22(10):1948-1963. 214. Felsenstein J: Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 1978, 27(4):401-410. 215. Hendy MD, Penny D: A framework for the quantitative study of evolutionary trees. Syst Zool 1989, 38(4):297-309. 216. Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP: The TIGR Plant Transcript Assemblies database. Nucleic Acids Res 2007, 35(Database issue):D846-851. 217. Shumway M, Cochrane G, Sugawara H: Archiving next generation sequencing data. Nucleic acids research 2010, 38:D870-871. 218. The Monocot Tree of Life project [http://www.botany.wisc.edu/givnish/monocotatol.htm] 219. The ancestral angiosperm genome project [http://ancangio.uga.edu] 220. Zahn LM, Kong H, Leebens-Mack JH, Kim S, Soltis PS, Landherr LL, Soltis DE, Depamphilis CW, Ma H: The evolution of the SEPALLATA subfamily of MADS-box genes: a preangiosperm origin with multiple duplications throughout angiosperm history. Genetics 2005, 169(4):2209-2223. 221. Chapman BA, Bowers JE, Feltus FA, Paterson AH: Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proc Natl Acad Sci U S A 2006, 103(8):2730-2735. 222. Wang H, Moore MJ, Soltis PS, Bell CD, Brockington SF, Alexandre R, Davis CC, Latvis M, Manchester SR, Soltis DE: Rosid radiation and the rapid rise of angiosperm- dominated forests. Proc Natl Acad Sci U S A 2009, 106(10):3853-3858. 223. Kuhl JC, Cheung F, Yuan QP, Martin W, Zewdie Y, McCallum J, Catanach A, Rutherford P, Sink KC, Jenderek M et al: A unique set of 11,008 onion expressed sequence tags reveals expressed sequence and genomic differences between the monocot orders Asparagales and Poales. Plant Cell 2004, 16(1):114-125. 224. Kuhl JC, Havey MJ, Martin WJ, Cheung F, Yuan QP, Landherr L, Hu Y, Leebens-Mack J, Town CD, Sink KC: Comparative genomic analyses in Asparagus. Genome 2005, 48(6):1052-1060. 225. McLachlan GJ, Peel D, Basford KE, Adams P: The Emmix software for the fitting of mixtures of normal and t-components. J Stat Softw 1999, 4:i02. 226. Morrison DA: How to Summarize Estimates of Ancestral Divergence Times. Evolutionary Bioinformatics 2008, 4:75-95. 227. Doyle J, Hotton C: Pollen and Spores. Patterns of Diversification. : Oxford: Clarendon; 1991. 228. McLachlan G, Peel D, Basford K, Adams P: The Emmix software for the fitting of mixtures of normal and t-components. J Stat Softw 1999, 4(i02). 229. American Society of Plant Biologists [http://abstracts.aspb.org/pb2010/public/S02/S022.html] 230. Soltis DE, Soltis PS, Endress PK, Chase MW: Phylogeny and evolution of angiosperms. Sinauer Associates, Sunderland, MA 2005. 231. Litt A, Irish VF: Duplication and diversification in the APETALA1/FRUITFULL floral homeotic gene lineage: implications for the evolution of floral development. Genetics 2003, 165(2):821-833.

213

232. Kramer EM, Zimmer EA: Gene duplication and floral developmental genetics of basal eudicots. Adv Bot Res 2006, 44:353-384. 233. Soltis PS, Brockington SF, Yoo MJ, Piedrahita A, Latvis M, Moore MJ, Chanderbali AS, Soltis DE: Floral variation and floral genetics in basal angiosperms. Am J Bot 2009, 96(1):110-128. 234. Chanderbali AS, Yoo MJ, Zahn LM, Brockington SF, Wall PK, Gitzendanner MA, Albert VA, Leebens-Mack J, Altman NS, Ma H et al: Conservation and canalization of gene expression during angiosperm diversification accompany the origin and evolution of the flower. Proc Natl Acad Sci U S A 2010, 107(52):22570-22575. 235. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313(5793):1596-1604. 236. Folta KM, Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A et al: The genome of woodland strawberry (Fragaria vesca). Nat Genet 2011, 43(2):109-U151. 237. TIGR Plant Transcript Assemblies database [http://plantta.jcvi.org] 238. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20(2):265-272. 239. Thompson JD, Gibson TJ, Higgins DG: Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinfo 2002, Chapter 2(1934-340X):Unit 2.3. 240. Hughes NF, Mcdougall AB: Records of angiospermid pollen entry into the English Early Cretaceous succession. Rev Palaeobot Palynol 1987, 50(3):255-272. 241. Miller CN: Implications of fossil conifers for the phylogenetic relationships of living families. Bot Rev 1999, 65(3):239-277. 242. Doyle JA, Hotton CL: Pollen and Spores. Patterns of Diversification. Clarendon: Oxford; 1991.

214

VITA

Yeting Zhang

Yeting Zhang was born in Lujiang, Anhui Province, P.R. China in 1984. Zhang studied at the University of Science and Technology of China from 2002 to 2006. Her college major was bioscience, but the curriculum also included many subjects in computer and information sciences. Zhang studied at the University of Pittsburgh’s Biological Science Department for one year in 2006. She began her PhD studies at the Intercollege Graduate Program in Genetics at the Pennsylvania State University in the fall semester of 2007. Her graduate studies focused on parasitic plants in Orobanchaceae, including Triphysaria versicolor, Striga hermonthica, and Orobanche aegyptiaca (syn. Phelipanche aegyptiaca). She applied DNA sequence analyses, molecular evolution analyses, and large-scale data handling approaches to the next-generation data for these three parasitic plants (PPGP, http://ppgp.huck.psu.edu/) and concentrated on the identification of Horizontal Gene Transfer (HGT) events occurring within the three plant species.

Publications

Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species Zhang Y, Fernandez-Aparicio M, Wafula EK, Das M, Jiao Y, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF, Timko MP, Yoder JI, Westwood JH, dePamphilis CW. BMC Evol Biol. 2013 Feb 20;13:48. doi: 10.1186/1471-2148-13-48.

Phylogenomic analysis of transcriptome data elucidates co-occurrence of a paleopolyploid event and the origin of bimodal karyotypes in Agavoideae (Asparagaceae). McKain MR, Wickett N, Zhang Y, Ayyampalayam S, McCombie WR, Chase MW, Pires JC, dePamphilis CW, Leebens-Mack J. Am J Bot. 2012 Feb;99(2):397-406. doi: 10.3732/ajb.1100537. Epub 2012 Feb 1.

A genome triplication associated with early diversification of the core eudicots. Jiao Y, Leebens-Mack J, Ayyampalayam S, Bowers JE, McKain MR, McNeal J, Rolf M, Ruzicka DR, Wafula E, Wickett NJ, Wu X, Zhang Y, Wang J, Zhang Y, Carpenter EJ, Deyholos MK, Kutchan TM, Chanderbali AS, Soltis PS, Stevenson DW, McCombie R, Pires JC, Wong GK, Soltis DE, dePamphilis CW. Genome Biology 2012, 13:R3 doi:10.1186/gb-2012-13-1-r3

The mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus). Miller W, Drautz DI, Janecka JE, Lesk AM, Ratan A, Tomsho LP, Packard M, Zhang Y, McClellan LR, Qi J, Zhao F, Gilbert MT, Dalén L, Arsuaga JL, Ericson PG, Huson DH, Helgen KM, Murphy WJ, Götherström A, Schuster SC. Genome Res. 2009 Feb;19(2):213-20. doi: 10.1101/gr.082628.108. Epub 2009 Jan 12.