The Pennsylvania State University The Graduate School

COMPUTATIONAL METHODS FOR COMPARATIVE GENOMICS OF NON-

MODEL

A CASE STUDY IN THE PARASITIC FAMILY

A Dissertation in Biology

by Eric Kenneth Wafula

Ó2019 Eric Kenneth Wafula

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2019

ii

The dissertation of Eric Kenneth Wafula was reviewed and approved* by the following:

Claude W. dePamphilis Professor of Biology Dissertation Advisor

Naomi S. Altman Professor of Statistics

Istvan Albert Research Professor of Bioinformatics Director of Online Graduate Certificate in Applied Bioinformatics

James H. Marden Professor of Biology Associate Director Huck Institute of the Life Sciences Chair of Committee

Stephen W. Schaeffer Professor of Biology Associate Department Head of Graduate Education

*Signatures are on file in the Graduate School

iii

ABSTRACT

The rapid development of sequencing technologies coupled with the continuous drop in the cost of sequencing has facilitated studies of genomes, transcriptomes, and metagenomes of a variety of organisms at unprecedented resolution. However, sequencing and accurately assembling the genomes of many non-model organisms, especially , remains cost-prohibitive because they are often large or complex, which pose challenges to current sequencing technologies and assembly algorithms.

Therefore, many researchers are now relying on comparative genomic approaches that integrate data from genomes and transcriptomes to gain novel insights into evolutionary history, including the unique features of complex non-model organisms. The genomes of parasitic angiosperms are relatively understudied. Past genome-scale research has been focused primarily on understanding the mechanisms of plant as a means to control weedy species that parasitize crops. Research efforts to understand the evolutionary aspects of parasitic plants have been restricted to the plastome degradation associated with the reduction and loss of photosynthesis. In this dissertation, I present

PlantTribes 2, a gene family analysis framework that utilizes objective classifications of complete protein sequences from genomes for comparative and evolutionary analyses of gene families and transcriptomes on a genome-scale. Utilizing PlantTribes 2, and the draft genome of Striga asiatica, including transcriptomes of sister lineages, I present evidence for an ancient polyploidy event shared by parasitic Orobanchaceae and closely related non-parasitic sister lineages. The observed gene family evolutionary dynamics in

Striga reveal an association between whole genome duplication (WGD) and the evolutionary origins of parasitism in Orobanchaceae. Gene losses are overrepresented by

iv older genes whose functions are complemented by the host, while gene gains often result from the WGD event specific to Orobanchaceae whose functions are associated with further adaptations to the parasitic lifestyle. The evolutionary transition from autotrophy to heterotrophy is associated with changes in gene functions common to non-parasitic plants. These findings will provide a focus for future studies into the mechanisms of plant parasitism and potential targets for parasite control.

v

TABLE OF CONTENTS

LIST OF FIGURES ...... viii

LIST OF TABLES ...... x

ACKNOWLEDGEMENTS ...... xi

Chapter 1 Introduction and overview ...... 1

Introduction ...... 1 Gene discovery ...... 2 Gene expression following character evolution ...... 3 Phylogenetic analysis ...... 5 Gene and genome duplication ...... 6 Parasitic plants ...... 8 Content of this dissertation ...... 10 References ...... 11

Chapter 2 PlantTribes 2: tools for comparative gene family analysis in plant genomics . 19

Abstract ...... 19 Introduction ...... 20 Pipeline implementation ...... 22 Gene family scaffolds ...... 24 Analysis tools ...... 26 Assembly post-processing ...... 26 Gene family classification ...... 26 Gene family alignment estimation ...... 27 Gene family phylogenetic inference ...... 28 Estimation of genome duplications ...... 29 Pipeline test dataset ...... 30 Performance evaluation ...... 32

vi

Evaluation of gene family inference methods ...... 32 Evaluation of sequence classifiers ...... 35 Evaluation of targeted gene family assembly ...... 37 Conclusions ...... 39 Availability and requirements ...... 40 References ...... 41

Chapter 3 Ancient whole genome duplication events in the parasitic lineages of

Orobanchaceae ...... 48

Abstract ...... 48 Introduction ...... 48 Results and discussion ...... 51 Gene annotation ...... 51 Genome cleaning ...... 52 Annotation-specific repeat masking ...... 53 Gene prediction and quality assessment ...... 54 Genome duplication history ...... 57 Global gene family phylogenetic analysis ...... 57 Duplicated gene divergence of phylogenetic syntelogs ...... 59 Genome structure analysis ...... 63 Conclusions ...... 73 Materials and methods ...... 74 Genome assembly cleaning ...... 74 Repeat library construction ...... 75 RNA-Seq sequencing and transcriptome assembly ...... 75 Gene prediction and functional assignment ...... 77 Gene family classification ...... 79 Phylogenetic and gene collinearity analysis ...... 80 Analysis of synonymous substitutions rates (Ks) ...... 82 Genome structure and synteny analysis ...... 83 Integration of additional taxa ...... 83 References ...... 84

vii

Chapter 4 Gene family dynamics and functional diversificationin parasitic lineages of

Orobanchaceae ...... 93

Abstract ...... 93 Introduction ...... 94 Results and discussion ...... 97 Gene family contractions and expansions ...... 97 Evolution of tissue-specific gene families ...... 103 Evolutionary events related to parasitism ...... 105 Shifts in gene expression (Phase I) ...... 105 Shifts in gene content (Phase II) ...... 107 Adaptation to parasitic lifestyle (Phase III) ...... 113 Conclusions ...... 114 Materials and methods ...... 115 Estimating gene family evolutionary dynamics ...... 115 Functional enrichment of contracted and expanded genes ...... 116 Divergence rates for contracted and expanded genes ...... 117 Defining tissue-specific gene families ...... 117 Analysis of haustorial and photosynthesis-related genes ...... 118 References ...... 118

Chapter 5 Conclusions and future directions ...... 126

Appendix Additional material for the Striga asiatica genome ...... 134

viii

LIST OF FIGURES

Figure 2-1: PlantTribes 2 analysis workflow...... 23

Figure 2-2: Gene family alignment, alignment visualization, and alignment editing...... 28

Figure 2-3: Gene family phylogenetic tree visualization...... 29

Figure 2-4: Estimation of whole genome duplication events...... 30

Figure 2-5: Relationship between orthologous clusters and expert curated gene families.

Figure 3-1: Distribution of known or proposed whole genome duplication events within

Lamiales...... 51

Figure 3-2: The cumulative distribution of annotation edit distance (AED) scores for all

Striga gene annotations...... 55

Figure 3-3: Assessment of Striga genome assembly and annotation completeness...... 56

Figure 3-4: Divergence (Ks) distribution plots for syntenic duplicate gene pairs in Striga and Mimulus...... 62

Figure 3-5: Principal component analysis (PCA) displaying duplication component similarities among Orobanchaceae species and closely related lineages...... 63

Figure 3-6: Striga syntenic dot plot showing the divergence of paralogous syntenic gene pairs within the Striga asiatica genome, ...... 66

Figure 3-7: Striga cross-species syntenic dot plot against Mimulus showing the divergence of orthologous syntenic gene pairs...... 67

Figure 3-8: Striga cross-species syntenic dot plot against Vitis showing the divergence of orthologous syntenic gene pairs...... 68

Figure 3-9: Exemplar microsynteny analysis consistent with Striga lineage-specific and core -wide duplications...... 70

ix

Figure 3-10: Exemplar microsynteny analysis consistent with Striga lineage-specific, core Lamiales-wide duplications in relation an ancestral core eudicot represented by the

Vitis...... 72

Figure 3-11: Exemplar phylogenetic tree consistent with Striga lineage-specific, core

Lamiales-wide duplications in relation an ancestral core eudicot represented by the Vitis gene...... 73

Figure 3-12: Gene duplication phylogenetic topology scoring scheme...... 82

Figure 4-1: Ks plots of Striga genes in contracted and expanded gene families...... 98

Figure 4-2: The maximum likelihood species tree for 26 plants genomes...... 116

x

LIST OF TABLES

Table 2-1: PlantTribes 2 test dataset...... 31

Table 2-2: Distribution of genes from expert curated gene families into orthologous gene clusters...... 34

Table 2-3: Performance evaluation of sequence classifiers...... 37

Table 2-4: Completeness assessment of transcriptome assemblies...... 39

Table 3-1: Repeat content and composition in the Striga genome...... 54

Table 3-2: Gene family classification summary of 26 plant genomes...... 58

Table 3-3: Duplication origins of Striga and Mimulus genes...... 59

Table 3-4: Additional RNA-Seq datasets used in the study...... 77

Table 3-5: Taxa in the 26 genomes gene family scaffold...... 80

Table 4-1: List of enriched KEGG pathways for genes in contracted and expanded orthogroups in Striga...... 100

Table 4-2: List of enriched GO terms for genes in contracted orthogroups in Striga. ... 100

Table 4-3: List of enriched GO terms for genes in expanded orthogroups in Striga...... 102

Table 4-4: Contractions and expansions of genes in tissue-specific gene families in

Striga...... 103

Table 4-5: Core parasitism genes expression in tissue-specific gene families...... 107

Table 4-6: Chlorophyll synthesis and photosynthetic pathways genes in Striga...... 112

xi

ACKNOWLEDGEMENTS

I wish to thank the countless people who helped me throughout my Ph.D., more specifically, Dr. Claude dePamphilis, my advisor for all his guidance and support. I have enjoyed working with many collaborators through the dePamphilis lab, including current and former workmates. Although trained as a programmer when I came to Penn State, my workmates have been instrumental in helping me learn biology while working on various projects with them. Last but not least, many thanks to my committee members, Dr.

Naomi Altman, Dr. Istvan Albert, and Dr. James Marden for their advice and guidance whenever I needed help to advance my research work.

Finally, everything I have ever achieved has been possible only because of my late mother, Willbroda who worked tirelessly as a single parent to give my brother,

Kevin, sister Laura, late sister Hilda, and I an opportunity in life. Her memories will always live with me. I celebrate this achievement with my lovely wife Edith and beautiful daughters Danae and Edna. Their loving comfort and prayers are a nourishment to my soul.

Funding:

Some datasets used in this dissertation were obtained from the Parasitic Plant Genome

Project (PPGP) website (http://ppgp.huck.psu.edu/) that is supported by the National

Science Foundation (NSF) and hosted at Penn State University. Any opinions, findings, conclusions or recommendations do not necessarily reflect the views of the NSF.

Chapter 1

Introduction and overview

Introduction

The recent advances in low-cost next-generation sequencing (NGS) technologies have allowed the genomes of a variety of plant species of ecological, evolutionary, agricultural, and economic importance to be sequenced by numerous research groups around the world. However, many non-model plant genomes are complex and remain challenging, given current sequencing technologies and assembly algorithms. Some of these complexities, including gene and genome duplication, heterozygosity, repetitive sequences, and large genome size, can be alleviated by sequencing the transcriptome rather than the genome [1, 2]. Therefore, many researchers now rely on transcriptome sequencing coupled with comparative genomics approaches to aid in the interpretation of complex non-model plant genomes. Transcriptome sequencing, also known as RNA-Seq involves isolating RNA from a tissue of interest from an organism, constructing a cDNA library of the RNA population, and sequencing the library using high-throughput NGS platforms [3-5]. A de novo assembly of library reads is then performed to create the reference transcriptome for virtually any non-model species for subsequent downstream comparative analyses.

Transcriptome sequencing has become one of the preferred approaches for comparative analysis of non-model plants. Several large research initiatives are currently underway that are collecting large numbers of plant transcriptomes, including the 1000

Plant Transcriptomes Project (1KP) that aims to sequence transcriptomes of over 1,000 2 plant species, ranging from algae to land and aquatic plants [6]. Beyond being one of the most efficient and cost-effective methods currently available for gene discovery and expression evolution in non-model organisms, transcriptomics is also being applied in other areas of plant comparative genomics. These areas include the examination of shifts in gene expression associated with character evolution and adaptation, reconstructing evolutionary histories of gene families and genomes, and unraveling gene and genome duplication events [7, 8].

Gene discovery

Transcript reconstruction using high-throughput RNA-Seq sequence data is the most commonly employed approach for transcriptome characterization and gene discovery in non-model organisms. Assembly based on a reference genome and de novo are the two complementary strategies for identifying both protein-coding and noncoding genes [9]. When a reference genome of a closely related species is available, transcript reconstruction can be guided by aligning RNA-Seq reads to the reference using splice- aware aligners [10-14]. Transcripts, including novel isoforms and unannotated genes, are then assembled from the resulting read alignments [15-17]. In addition, non-coding

RNAs (ncRNAs), some of which have important regulatory functions can also be identified, annotated, and/or assembled. Despite the main advantages of reference-based transcript reconstruction, which include smaller computational requirements than de novo assembly, the absence of contamination and sequencing artifacts in the assembled transcripts, and the ability to assemble low abundance transcripts, the completeness of the assembled transcripts is highly dependent on the quality of the reference genome [9].

Unfortunately, most genome assemblies of non-model plant species, if available at all,

3 are still in draft status, meaning they are often incomplete and replete with errors.

Furthermore, novel genes that are not present in the reference genome, and genes that are simply too diverged from the reference to map, are missed in a reference-based approach.

The de novo transcript reconstruction strategy assembles short sequence reads into transcripts without the guidance of a reference genome using de Bruijn graph algorithms

[18-20]. Transcripts of genes that are missing in the reference genome, or are too diverged for accurate mapping, can be recovered without difficulty, and the challenges of precisely recognizing splice sites and spanning long introns by short read alignments, which is critical for the discovery of alternative splice forms and for connecting exons, are not a concern for the de novo assembly strategy [9]. Reconstructed transcripts from one or many organisms can then be used for a wide array of transcriptome-wide functional and evolutionary analyses. The de novo transcript reconstruction is not without limitations. The complexity of the transcriptome assembly, including alternative splice forms, allelic variants, close paralogs, artefactual chimeras, incomplete transcripts, and missing genes, makes meaningful inferences in the absence of reference genome challenging [21-23].

Gene expression following character evolution

Comparative gene expression analysis is one of the most promising approaches to understanding phenotypic variation. Genetic changes in mechanisms controlling the expression of genes arise from character evolution including regulatory mutations and rearrangements, duplication, and loss of genes, and may account for the major morphological and physiological differences observed among species [24]. Examining shifts in gene expression using transcriptome data has unveiled key innovations attributed

4 to biological diversification of various plants, animals and other species [7, 25-30]. The major key innovations in the evolution of land plants were pollen, seeds, flowers, and fruits. Pollen allows plants to disperse their gametes at greater distances and seeds allowed plants to reproduce independently of water. The evolution of the flower has been responsible for one the greatest terrestrial radiations, the diversification of the flowering plants (angiosperms) famously referred to as the "abominable mystery" by Darwin [31-

34].

Here I highlight a few studies in plant comparative genomics that have employed transcriptome data to identify genes implicated in the origin of novel traits. A common flower in temperate spring gardens is Aquilegia, also known as the columbine. Aquilegia is a basal eudicot in the Ranunculaceae and has been developed as a model system for investigating the evolution of petaloid organs in angiosperms [35, 36]. Notable floral innovations exhibited by Aquilegia include petal spurs and a novel floral organ type, the staminodium. Differential RNA-Seq expression evidence identified gene duplication as the source for the origin of staminodia in Aquilegia, quite different from many other angiosperms that follow the traditional ABC model of flower development [25]. Tendrils are another likely key innovation that evolved independently in numerous angiosperms, gymnosperms, and ferns with different ontogenetic origins, and are a great model to study convergent evolution [37]. With the aid of transcriptomes of 13 Vitaceae (grape family) species, Zhang et al. (2014) [26] investigated the expression patterns of key floral meristem genes and found that the expression of LEAFY gene determines whether an anlage develops into a tendril or an . The mimicking of female insects by orchids of the Ophrys to attract male pollinators from only a single species is an

5 interesting key innovation reported in plant evolution [27, 38, 39]. Sedeek et al. (2013)

[27] sequenced transcriptomes of three closely related species of Ophrys and identified expressed genes potentially involved in pollinator attraction and reproductive isolation in orchids. Lastly, parasitic plants present a novel invasive organ, the haustorium [40-43], that is comprehensively described in other sections of this dissertation. Using transcriptomes of three closely related parasites with varying trophic spectra of parasitism in the Orobanchaceae, Yang et al. (2014) [44] identified genes with tissue-specific expression in autotrophic plants that have been recruited to function in haustoria of parasitic plants. These genes were also found to be a product of an ancient whole genome duplication event predating the split of this parasitic lineage from its autotrophic sister lineages.

Phylogenetic analysis

In plant systematics, the preponderance of molecular data used for phylogenetic reconstruction of evolutionary histories has traditionally come from nuclear ribosomal

DNA (nrDNA) and plastid DNA (cpDNA) [2, 7]. The high copy numbers of both these classes of markers in plant genomes have been useful because the cloning step is not required when sequencing them, and the restriction site analysis, as well as PCR amplification of specific regions, are easily accessible [7, 45, 46]. However, these markers are not without limitations in phylogenetic hypotheses. Because plastid markers are maternally inherited in most plant species, they can only track half of the parentage in plant lineages that have a history of hybridizations or allopolyploidization [2, 46].

Incomplete concerted evolution across repeat units in a genome [46] and genome

6 downsizing in hybrids or allopolyploids [47] may result in unhomogenized nrDNA, thus confounding phylogenetic inference.

Relevant to this dissertation is the problematic use of cpDNA markers for phylogenetic analyses of parasitic plants whose plastomes tend to degrade with increased dependence on the host plant for nutrients and water [48-51]. The loss of most plastid- encoded genes from many parasitic plants severely limits the utility of plastome sequences for phylogenetic analyses of these parasitic organisms. Fortunately, advances in NGS and more specifically transcriptome sequencing in non-model systems has helped overcome some of these barriers. Transcriptome data are a cost-effective source of thousands of nuclear encoded protein-coding gene sequences, and a treasure trove of phylogenetically informative markers because of their generally higher level of sequence variation compared to plastid or mitochondrial sequences. Additionally, unlike nrDNA and cpDNA, nuclear protein-coding genes are less frequently subject to concerted evolution and are biparentally inherited [46]. Many phylogenetic studies now regularly integrate data from genomes and transcriptomes to gain novel insight into the evolutionary history of genomes, gene families, and the tree of life.

Gene and genome duplication

Gene duplication is the primary mechanism of generating new genetic material in eukaryotic organisms that occurs frequently [52] and facilitates functional and structural diversity [53]. There are multiple modes of gene duplication that give rise to different types of duplicates. These include unequal crossing over for tandem or proximal duplicates, replicative-transposition and retro-transposition for dispersed duplicates, segmental duplication for copying large segments of chromosomes, and whole-genome

7 duplication (WGD) for copying an organism's entire genome either once or multiple times [53-57]. Gene duplication has occurred across the breath of eukaryotic phylogeny and is implicated in the origin of traits among animals, plants, and other evolutionary lineages [58-61]. WGD, which is prevalent in land plants, has been proposed to be a major player in species diversification and attributed to be the driving force in the rapid radiation of angiosperms [54, 62, 63]. Comparative genomic studies of duplicate genes in plants is necessary to shed light on their evolutionary[64], ecological [39, 65, 66], and agronomic [67] impact.

The three main methods for WGD inference are gene collinearity (synteny) [68-

72], age distribution of paralogs based on synonymous substitution rates (Ks) [73-75], and phylogenetic analysis of gene families [76, 77]. Synteny methods require relatively high-quality genome assembly to capture large syntenic blocks with many collinear genes. Most non-model plants either do not have a genome or, if available, the genomes are usually of draft quality thus limiting them to only using Ks and phylogenetic methods with transcriptome data. The Ks methods estimate WGDs under a null model of continuous small-scale gene duplication (SSD) and loss of duplicates not under selection.

Therefore, in the absence of large-scale duplication events, the inferred Ks distribution should follow a negative exponential distribution to show the continuous process of generation and decay of retained duplicates over time (Ks being a proxy for time). If

WGDs occurred in the history of the genome, the distribution will show additional peaks of excess duplicates as normal distributions against the background at particular Ks ages and can be differentiated form segmental duplications if the genome structure is available. However, Ks methods are not without limitations. Saturation of substitutions

8 can produce peaks in the distribution that do not reflect WGDs. Thus, older events are challenging to detect for Ks methods because of saturation coupled with lower gene retention rates [78-80]. To overcome these challenges, Jiao et al. (2011) [77] introduced a phylogenetic method that utilizes orthologs from multiple species to infer gene family phylogenies and identify duplication nodes. Genes extracted from duplication nodes of trees are then analyzed using Ks values for duplication nodes to detect evidence for

WGDs. This approach of combining methodologies can unravel much older WGDs even when there is no detectable synteny [77] .

Parasitic plants

Parasitic flowering plants directly attach to other plants to acquire water and nutrients using a parasitic structure called a haustorium. Several species of parasitic plants are pests on several agronomic crops and can cause dramatic reduction in yield, representing a major risk to food security in many developing economies [81, 82].

Despite the large potential negative economic impact, parasitic plants are relatively understudied, and control strategies are poorly developed. Parasitic plants acquire all or part of their water and nutrients from other living plants, and in the process, cause damage to their hosts. There are 28 dicotyledonous families with parasitic lineages representing approximately 1-2% of angiosperm species. Parasitism has independently evolved 12 times in angiosperms and therefore exhibits taxonomic diversity and morphological variation [41, 83, 84]. Parasitic plants are classified as either facultative or obligate, depending on the degrees of host dependency [81]. Facultative parasites can complete their life cycle without dependence on a host but will opportunistically parasitize other plants. On the other hand, obligate parasites must parasitize a host to

9 survive. Morphologically, facultative parasites, are very similar to autotrophic plants except for their distinguishing haustoria, but obligate parasites are distinct because they lack fundamental plant features such as roots and leaves for those that are non- photosynthetic. Parasites are also either hemiparasites or holoparasites; the former can photosynthesize but obtain some or most of their carbon from a host, while the latter have entirely lost their photosynthetic ability. Another distinction among parasitic plants is whether they attach to either the stem or roots of the host system to derive their water and nutrients [41, 81, 83, 85].

One common feature in all parasitic plants is the haustorium, a specialized organ that penetrates either the roots or shoots of their host and serves as a physiological bridge to obtain water and nutrients from the host vascular system [86]. Parasitic members of the

Orobanchaceae develop two types of haustoria depending on their ability to photosynthesize. Hemiparasites with efficient photosynthesis develop lateral haustoria that connect to host xylem to extract mineral nutrients and water. Terminal haustoria, on the other hand, form only at the apex of the root radical tip and are a feature of holoparasites and photosynthetically inefficient hemiparasites. Holoparasites will connect to host phloem to uptake sugars and a rich collection of metabolites in addition to siphoning water, mineral nutrients, and sugars from the host xylem sap [83].

The broomrape family, Orobanchaceae, a member of the Lamiales, is the only angiosperm family with species that span the full trophic spectrum of parasitism from autotrophs to semi-heterotrophs to full heterotrophs [81, 82]. Orobanchaceae is the youngest and one of the largest parasitic plant lineages with an estimated 2000 species

[87, 88]. Some members of Orobanchaceae are among the most devastating agricultural

10 weeds globally. In particular, Orobanche (broomrapes) and Striga (witchweeds) are agronomically destructive weeds in Africa, India, and Southeast Asia where crop yields are adversely affected [81, 82]. Therefore, parasitic Orobanchaceae provides a unique opportunity to investigate genetic changes associated with the transition of an organism from autotrophy to heterotrophy.

Content of this dissertation

The Parasitic Plant Genome Project (http://ppgp.huck.psu.edu) has sequenced expressed transcripts for several species including all key life stages of three parasitic species that span the full spectrum of parasitism, as well as a free-living non-parasitic close relative, to leverage the evolutionary framework presented by Orobanchaceae. The four focal species are a facultative parasite ( versicolor), a photosynthetically competent obligate parasite (Striga hermonthica), an obligate holoparasite (Phelipanche aegyptiaca), and the closest non-parasite sister group to the parasitic Orobanchaceae

(Lindenbergia philippensis) [82]. In this dissertation, I employed comparative genomic approaches to explore the origins and evolutionary history of plant parasitism in the

Orobanchaceae family using the Parasitic Plant Genome Project datasets. In the second chapter, I present a gene family analysis framework that utilizes objective classifications of complete protein sequences from genomes for comparative and evolutionary analyses of gene families and transcriptomes on a genome-scale. Subsequent chapters illustrate the utility of the toolsets I developed together with other analytical methods from the bioinformatics community to exhaustively address my research questions. I unravel gene and genome duplication events in the Orobanchaceae in the third chapter, utilizing the draft genome of Striga asiatica and transcriptomes of sister lineages, and tie detected

11 duplication events to genes important in parasitism. The fourth and last dissertation research chapter is focused primarily on gene family evolutionary dynamics in

Orobanchaceae and corresponding expression shifts underlying the origins of parasitism.

I believe that this collection of studies is an advancement of our knowledge of parasitic plant evolution and will provide critical evidence for understanding parasite function and potential targets for parasite control.

References

1. Hirsch, C.N., Robin Buell, C. Tapping the promise of genomics in species with complex, nonmodel genomes. Annu. Rev. Plant Biol. 64, 89–110 (2013). 2. Zimmer, E.A., Wen, J. Using nuclear gene data for plant phylogenetics: progress and prospects. Mol. Phylogenet. Evol. 65, 774–785 (2012). 3. Maekawa, S., Suzuki, A., Sugano, S., Suzuki, Y. RNA sequencing: from sample preparation to analysis. In: Transcription Factor Regulatory Networks. pp. 51–65. Springer New York, New York, NY (2014). 4. Wang, Z., Gerstein, M., Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). 5. Góngora-Castillo, E., Fedewa, G., Yeo, Y., Chappell, J., DellaPenna, D., Buell, C.R. Genomic approaches for interrogating the biochemistry of medicinal plant species. Meth. Enzymol. 517, 139–159 (2012). 6. Matasci, N., Hung, L.H., Yan, Z., Carpenter, E.J., Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Ayyampalayam, S., Barker, M., Burleigh, J.G., Gitzendanner, M.A., Wafula, E., Der, J.P., dePamphilis, C.W., Roure, B., Philippe, H., Ruhfel, B.R., Miles, N.W., Graham, S.W., Mathews, S., Surek, B., Melkonian, M., Soltis, D.E., Soltis, P.S., Rothfels, C., Pokorny, L., Shaw, J.A., DeGironimo, L., Stevenson, D.W., Villarreal, J.C., Chen, T., Kutchan, T.M., Rolf, M., Baucom, R.S., Deyholos, M.K., Samudrala, R., Tian, Z., Wu, X., Sun, X., Zhang, Y., Wang, J., Leebens-Mack, J., Wong, G.K.-S. Data access for the 1,000 Plants (1KP) project. Gigascience. 3, 17 (2014). 7. Wen, J., Egan, A.N., Dikow, R.B., Zimmer, E.A. Utility of transcriptome sequencing for phylogenetic inference and character evolution. In: Next- generation sequencing in plant systematics. Smithsonian Libraries. pp. 1-41 (2015). 8. Strickler, S.R., Bombarely, A., Mueller, L.A. Designing a transcriptome: next-generation sequencing project for a nonmodel plant species. Am. J. Bot. 99, 257–266 (2012). 9. Martin, J.A., Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671–682 (2011). 10. Wang, K., Singh, D., Zeng, Z., Coleman, S.J., Huang, Y., Savich, G.L., He,

12

X., Mieczkowski, P., Grimm, S.A., Perou, C.M., MacLeod, J.N., Chiang, D.Y., Prins, J.F., Liu, J. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010). 11. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). 12. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29, 15–21 (2013). 13. Kim, D., Langmead, B., Salzberg, S.L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods. 12, 357–U121 (2015). 14. Wu, T.D., Reeder, J., Lawrence, M., Becker, G., Brauer, M.J. GMAP and GSNAP for genomic sequence alignment: enhancements to speed, accuracy, and functionality. Methods Mol. Biol. 1418, 283–334 (2016). 15. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). 16. Guttman, M., Garber, M., Levin, J.Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M.J., Gnirke, A., Nusbaum, C., Rinn, J.L., Lander, E.S., Regev, A. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010). 17. Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.-C., Mendell, J.T., Salzberg, S.L. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015). 18. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011). 19. Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S.D., Mungall, K., Lee, S., Okada, H.M., Qian, J.Q., Griffith, M., Raymond, A., Thiessen, N., Cezard, T., Butterfield, Y.S., Newsome, R., Chan, S.K., She, R., Varhol, R., Kamoh, B., Prabhu, A.L., Tam, A., Zhao, Y., Moore, R.A., Hirst, M., Marra, M.A., Jones, S.J.M., Hoodless, P.A., Birol, I. De novo assembly and analysis of RNA-seq data. Nature Methods. 7, 909–912 (2010). 20. Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 28, 1086–1092 (2012). 21. Góngora-Castillo, E., Buell, C.R. Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence. Nat. Prod. Rep. 30, 490–11 (2013). 22. Ungaro, A., Pech, N., Martin, J.-F., McCairns, R.J.S., Mévy, J.P., Chappaz,

13

R., Gilles, A. Challenges and advances for transcriptome assembly in non- model species. PLoS ONE. 12, e0185020 (2017). 23. Honaas L.A., Wafula E.K., Wickett N.J., Der J.P., Zhang Y., Edger PP, Altman N.S., Pires J.C., Leebens-Mack J.H., dePamphilis C.W. Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome. PLoS ONE. 11, e0146062 (2016). 24. King, M.C., Wilson, A.C. Evolution at two levels in humans and chimpanzees. Science. 188, 107–116 (1975). 25. Sharma, B., Yant, L., Hodges, S.A., Kramer, E.M. Understanding the development and evolution of novel floral form in Aquilegia. Curr. Opin. Plant Biol. 17, 22–27 (2014). 26. Zhang, N., Wen, J., Zimmer, E.A. Expression patterns of AP1, FUL, FT and LEAFY orthologs in Vitaceae support the homology of tendrils and throughout the grape family. J. Systematics Evol. 53, 469–476 (2015). 27. Sedeek, K.E.M., Qi, W., Schauer, M.A., Gupta, A.K., Poveda, L., Xu, S., Liu, Z.-J., Grossniklaus, U., Schiestl, F.P., Schlüter, P.M. Transcriptome and proteome data reveal candidate genes for pollinator attraction in sexually deceptive orchids. PLoS ONE. 8, e64621 (2013). 28. Li, F.W., Villarreal, J.C., Kelly, S., Rothfels, C.J., Melkonian, M., Frangedakis, E., Ruhsam, M., Sigel, E.M., Der, J.P., Pittermann, J., Burge, D.O., Pokorny, L., Larsson, A., Chen, T., Weststrand, S., Thomas, P., Carpenter, E., Zhang, Y., Tian, Z., Chen, L., Yan, Z., Zhu, Y., Sun, X., Wang, J., Stevenson, D.W., Crandall-Stotler, B.J., Shaw, A.J., Deyholos, M.K., Soltis, D.E., Graham, S.W., Windham, M.D., Langdale, J.A., Wong, G.K.-S., Mathews, S., Pryer, K.M. Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns. Proc. Natl. Acad. Sci. U.S.A. 111, 6672–6677 (2014). 29. Brawand, D., Wagner, C.E., Li, Y.I., Malinsky, M., Keller, I., Fan, S., Simakov, O., Ng, A.Y., Lim, Z.W., Bezault, E., Turner-Maier, J., Johnson, J., Alcazar, R., Noh, H.J., Russell, P., Aken, B., Alföldi, J., Amemiya, C., Azzouzi, N., Baroiller, J.-F., Barloy-Hubler, F., Berlin, A., Bloomquist, R., Carleton, K.L., Conte, M.A., D'Cotta, H., Eshel, O., Gaffney, L., Galibert, F., Gante, H.F., Gnerre, S., Greuter, L., Guyon, R., Haddad, N.S., Haerty, W., Harris, R.M., Hofmann, H.A., Hourlier, T., Hulata, G., Jaffe, D.B., Lara, M., Lee, A.P., MacCallum, I., Mwaiko, S., Nikaido, M., Nishihara, H., Ozouf- Costaz, C., Penman, D.J., Przybylski, D., Rakotomanga, M., Renn, S.C.P., Ribeiro, F.J., Ron, M., Salzburger, W., Sanchez-Pulido, L., Santos, M.E., Searle, S., Sharpe, T., Swofford, R., Tan, F.J., Williams, L., Young, S., Yin, S., Okada, N., Kocher, T.D., Miska, E.A., Lander, E.S., Venkatesh, B., Fernald, R.D., Meyer, A., Ponting, C.P., Streelman, J.T., Lindblad-Toh, K., Seehausen, O., di Palma, F. The genomic substrate for adaptive radiation in African cichlid fish. Nature. 513, 375–381 (2014). 30. Guo, B., Chain, F.J., Bornberg-Bauer, E., Leder, E.H., Merilä, J. Genomic divergence between nine- and three-spined sticklebacks. BMC Genomics. 14, 1–11 (2013).

14

31. Doyle, J.A. Phylogenetic analyses and morphological innovations in land plants. In: Annual Plant Reviews, vol. 45. (2018). 32. Amborella Genome Project: The Amborella genome and the evolution of flowering plants. Science. 342, 1241089 (2013). 33. Chanderbali, A.S., Albert, V.A., Leebens-Mack, J., Altman, N.S., Soltis, D.E., Soltis, P.S. Transcriptional signatures of ancient floral developmental genetics in avocado (Persea americana; Lauraceae). Proc. Natl. Acad. Sci. U.S.A. 106, 8929–8934 (2009). 34. Berendse, F., Scheffer, M. The angiosperm radiation revisited, an ecological explanation for Darwin’s “abominable mystery.” Ecology Letters. 12, 865– 872 (2009). 35. Kramer, E.M. Aquilegia: a new model for plant development, ecology, and evolution. Annu. Rev. Plant Biol. 60, 261–277 (2009). 36. Sharma, B., Yant, L., Hodges, S.A., Kramer, E.M. Understanding the development and evolution of novel floral form in Aquilegia. Curr. Opin. Plant Biol. 17, 22–27 (2014). 37. Sousa-Baena, M.S., Sinha, N.R., Hernandes-Lopes, J., Lohmann, L.G. Convergent evolution and the diverse ontogenetic origins of tendrils in angiosperms. Front. Plant Sci. 9, (2018). 38. Gaskett, A.C., Winnick, C.G., Herberstein, M.E. Orchid sexual deceit provokes ejaculation. Am. Nat. 171, E206–12 (2008). 39. Tremblay, R.L. Trends in the ecology of the Orchidaceae: evolution and systematics. Canadian J. Bot. 70, 642–650 (1992). 40. Kokla, A., Melnyk, C.W. Developing a thief: haustoria formation in parasitic plants. Dev. Biol. 442, 53–59 (2018). 41. Yoshida, S., Cui, S., Ichihashi, Y., Shirasu, K. The Haustorium, a specialized invasive organ in parasitic plants. Annu. Rev. Plant Biol. 67, 643–667 (2016). 42. Searcy, D.G. Measurements by DNA hybridization in vitro of the genetic basis of parasitic reduction. Evolution. 24, 207–219 (1970). 43. Searcy, D.G., MacInnis, A.J. Measurements by DNA renaturation of the genetic basis of parasitic reduction. Evolution. 24, 796–806 (1970). 44. Yang, Z., Wafula, E.K., Honaas, L.A., Zhang, H., Das, M., Fernandez- Aparicio, M., Huang, K., Bandaranayake, P.C.G., Wu, B., Der, J.P., Clarke, C.R., Ralph, P.E., Landherr, L., Altman, N.S., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Comparative transcriptome analyses reveal core parasitism genes and suggest gene duplication and repurposing as sources of sructural novelty. Mol. Biol. Evol. 32, 767–790 (2014). 45. Liu, K.Q., Liu, Z.P., Hao, J.-K., Chen, L., Zhao, X.-M. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics. 13, 126 (2012). 46. Small, R.L., Cronn, R.C., Wendel, J.F. Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany. 17, 145–170 (2004). 47. Renny-Byfield, S., Chester, M., Kovařík, A., Le Comber, S.C., Grandbastien, M.A., Deloger, M., Nichols, R.A., Macas, J., Novak, P., Chase, M.W., Leitch, A.R. Next generation sequencing reveals genome

15

downsizing in allotetraploid Nicotiana tabacum, predominantly through the elimination of paternally derived repetitive DNAs. Mol. Biol. Evol. 28, 2843–2854 (2011). 48. Funk, H.T., Berg, S., Krupinska, K., Maier, U.G., Krause, K. Complete DNA sequences of the plastid genomes of two parasitic species, Cuscuta reflexa and Cuscuta gronovii. BMC Plant Biol. 7, 45 (2007). 49. McNeal, J.R., Kuehl, J.V., Boore, J.L., dePamphilis, C.W. Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol. 7, 57 (2007). 50. Wicke, S., Müller, K.F., dePamphilis, C.W., Quandt, D., Wickett, N.J., Zhang, Y., Renner, S.S., Schneeweiss, G.M. Mechanisms of functional and physical genome reduction in photosynthetic and nonphotosynthetic parasitic plants of the broomrape family. Plant Cell. 25, 3711–3725 (2013). 51. Petersen, G., Cuenca, A., Seberg, O. Plastome evolution in hemiparasitic mistletoes. Genome Biology and Evolution. 7, 2520–2532 (2015). 52. Lynch, M., Conery, J.S. The evolutionary fate and consequences of duplicate genes. Science. 290, 1151–1155 (2000). 53. Zhang, J.Z. Evolution by gene duplication: an update. Trends Ecol. Evol. 18, 292–298 (2003). 54. Panchy, N., Lehti-Shiu, M.D., Shiu, S.-H. Evolution of gene duplication in plants. Plant Physiol. 171, 2294-2316 (2016). 55. Kaessmann, H. Origins, evolution, and phenotypic impact of new genes. Genome Res. 20, 1313–1326 (2010). 56. Magadum, S., Banerjee, U., Murugan, P., Gangapur, D., Ravikesavan, R. Gene duplication as a major force in evolution. Journal of Genetics. 92, 155– 161 (2013). 57. Freeling, M. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol. 60, 433–453 (2009). 58. Donoghue, P., Purnell, M.A. Genome duplication, extinction and vertebrate evolution. Trends Ecol. Evol. 20, 312–319 (2005). 59. Schwager, E.E., Sharma, P.P., Clarke, T., Leite, D.J., Wierschin, T., Pechmann, M., Akiyama-Oda, Y., Esposito, L., Bechsgaard, J., Bilde, T., Buffry, A.D., Chao, H., Dinh, H., Doddapaneni, H., Dugan, S., Eibner, C., Extavour, C.G., Funch, P., Garb, J., Gonzalez, L.B., González, V.L., Griffiths-Jones, S., Han, Y., Hayashi, C., Hilbrant, M., Hughes, D.S.T., Janssen, R., Lee, S.L., Maeso, I., Murali, S.C., Muzny, D.M., da Fonseca, R.N., Paese, C.L.B., Qu, J., Ronshaugen, M., Schomburg, C., Schonauer, A., Stollewerk, A., Torres-Oliva, M., Turetzek, N., Vanthournout, B., Werren, J.H., Wolff, C., Worley, K.C., Bucher, G., Gibbs, R.A., Coddington, J., Oda, H., Stanke, M., Ayoub, N.A., Prpic, N.-M., Flot, J.-F., Posnien, N., Richards, S., McGregor, A.P. The house spider genome reveals an ancient whole- genome duplication during arachnid evolution. BMC Biol. 15, –27 (2017). 60. Marcet-Houben, M., Gabaldón, T. Beyond the whole-genome duplication: phylogenetic evidence for an ancient interspecies hybridization in the baker's yeast lineage. PLoS Biol. 13, e1002220 (2015).

16

61. Li, Z., Tiley, G.P., Galuska, S.R., Reardon, C.R., Kidder, T.I., Rundell, R.J., Barker, M.S. Multiple large-scale gene and genome duplications during the evolution of hexapods. Proc. Natl. Acad. Sci. U.S.A. 115, 4713–4718 (2018). 62. Van de Peer, Y., Mizrachi, E., Marchal, K. The evolutionary significance of polyploidy. Nat. Rev. Genet. 18, 411–424 (2017). 63. Clark, J.W., Donoghue, P.C.J. Constraining the timing of whole genome duplication in plant evolutionary history. Proc. Biol. Sci. 284, pii: 20170912 (2017). 64. Van de Peer, Y., Maere, S., Meyer, A. The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732 (2009). 65. Fenster, C.B., Armbruster, W.S., Wilson, P., Dudash, M.R., Thomson, J.D. Pollination syndromes and floral specialization. Annu. Rev. Ecol. Syst. 35, 375-403 (2004). 66. Edger, P.P., Heidel-Fischer, H.M., Bekaert, M., Rota, J., Glöckner, G., Platts, A.E., Heckel, D.G., Der, J.P., Wafula, E.K., Tang, M., Hofberger, J.A., Smithson, A., Hall, J.C., Blanchette, M., Bureau, T.E., Wright, S.I., dePamphilis, C.W., Schranz, M.E., Barker, M.S., Conant, G.C., Wahlberg, N., Vogel, H., Pires, J.C., Wheat, C.W. The butterfly plant arms-race escalated by gene and genome duplications. Proc. Natl. Acad. Sci. U.S.A. 112, 8362–8366 (2015). 67. Renny-Byfield, S., Wendel, J.F. Doubling down on genomes: polyploidy and crop plants. Am. J. Bot. 101, 1711–1725 (2014). 68. Vision, T.J., Brown, D.G., Tanksley, S.D. The origins of genomic duplications in Arabidopsis. Science. 290, 2114–2117 (2000). 69. Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M., Paterson, A.H. Synteny and collinearity in plant genomes. Science. 320, 486–488 (2008). 70. Bowers, J.E., Chapman, B.A., Rong, J., Paterson, A.H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 422, 433–438 (2003). 71. Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., Vezzi, A., Legeai, F., Hugueney, P., Dasilva, C., Horner, D., Mica, E., Jublot, D., Poulain, J., Bruyere, C., Billault, A., Ségurens, B., Gouyvenoux, M., Ugarte, E., Cattonaro, F., Anthouard, V., Vico, V., Del Fabbro, C., Alaux, M., Di Gaspero, G., Dumas, V., Felice, N., Paillard, S., Juman, I., Moroldo, M., Scalabrin, S., Canaguier, A., Le Clainche, I., Malacrida, G., Durand, E., Pesole, G., Laucou, V., Chatelet, P., Merdinoglu, D., Delledonne, M., Pezzotti, M., Lecharny, A., Scarpelli, C., Artiguenave, F., Pe, M.E., Valle, G., Morgante, M., Caboche, M., Adam-Blondon, A.-F., Weissenbach, J., Quétier, F., Wincker, P., French-Italian Public Consortium for Grapevine Genome Characterization. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 449, 463–467 (2007). 72. Tang, H., Wang, X., Bowers, J.E., Ming, R., Alam, M., Paterson, A.H. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res. 18, 1944–1954 (2008).

17

73. Cui, L., Wall, P.K., Leebens-Mack, J.H., Lindsay, B.G., Soltis, D.E., Doyle, J.J., Soltis, P.S., Carlson, J.E., Arumuganathan, K., Barakat, A., Albert, V.A., Ma, H., dePamphilis, C.W. Widespread genome duplications throughout the history of flowering plants. Genome Res. 16, 738–749 (2006). 74. Blanc, G., Wolfe, K.H. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell. 16, 1667–1678 (2004). 75. Maere, S., de Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., Van de Peer, Y. Modeling gene and genome duplications in eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 102, 5454–5459 (2005). 76. Jiao, Y., Leebens-Mack, J., Ayyampalayam, S., Bowers, J.E., McKain, M.R., McNeal, J., Rolf, M., Ruzicka, D.R., Wafula, E., Wickett, N.J., Wu, X., Zhang, Y., Wang, J., Zhang, Y., Carpenter, E.J., Deyholos, M.K., Kutchan, T.M., Chanderbali, A.S., Soltis, P.S., Stevenson, D.W., McCombie, R., Pires, J.C., Wong, G.K.-S., Soltis, D.E., dePamphilis, C.W. A genome triplication associated with early diversification of the core . Genome Biol. 13, R3 (2012). 77. Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S., Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., Soltis, P.S., Soltis, D.E., Clifton, S.W., Schlarbaum, S.E., Schuster, S.C., Ma, H., Leebens-Mack, J., dePamphilis, C.W. Ancestral polyploidy in seed plants and angiosperms. Nature. 473, 97–100 (2011). 78. Tiley, G.P., Barker, M.S., Burleigh, J.G. Assessing the performance of Ks plots for detecting ancient whole genome duplications. Genome Biol. Evol. 1–17 (2018). 79. Rabier, C.E., Ta, T., Ane, C. Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Mol. Biol. Evol. 31, 750–762 (2014). 80. Zwaenepoel, A., Van de Peer, Y. wgd—simple command line tools for the analysis of ancient whole-genome duplications. Bioinformatics, 35(12), pp.2153-2155 (2018). 81. Westwood, J.H., Yoder, J.I., Timko, M.P., dePamphilis, C.W. The evolution of parasitism in plants. Trends Plant Sci. 15, 227–235 (2010). 82. Westwood, J.H., dePamphilis, C.W., Das, M., Fernandez-Aparicio, M., Honaas, L.A., Timko, M.P., Wafula, E.K., Wickett, N.J., Yoder, J.I. The Parasitic Plant Genome Project: new tools for understanding the biology of Orobanche and Striga. Weed Science. 60, 295–306 (2012). 83. Těšitel, J. Functional biology of parasitic plants: a review. Plant Ecol. Evol. 149, 5–20 (2016). 84. Barkman, T.J., McNeal, J.R., Lim, S.-H., Coat, G., Croom, H.B., Young, N.D., dePamphilis, C.W. Mitochondrial DNA suggests at least 11 origins of parasitism in angiosperms and reveals genomic chimerism in parasitic plants. BMC Evol. Biol. 7, 1–15 (2007). 85. Heide-Jørgensen, H.S. Introduction: The Parasitic syndrome in higher plants. In: Parasitic Orobanchaceae. pp. 1–18. Springer, Berlin, Heidelberg (2013).

18

86. Joel, D.M. The haustorium and the life cycles of parasitic Orobanchaceae. In: Parasitic Orobanchaceae. pp. 21–23. Springer, Berlin, Heidelberg (2013). 87. Naumann, J., Salomo, K., Der, J.P., Wafula, E.K., Bolin, J.F., Maass, E., Frenzke, L., Samain, M.-S., Neinhuis, C., dePamphilis, C.W., Wanke, S. Single-copy nuclear genes place haustorial Hydnoraceae within piperales and reveal a cretaceous origin of multiple parasitic angiosperm lineages. PLoS ONE. 8, e79204 (2013). 88. Wicke, S. Genomic Evolution in Orobanchaceae. In: Parasitic Orobanchaceae. pp. 267–286. Springer, Berlin, Heidelberg (2013).

19

Chapter 2

PlantTribes 2: tools for comparative gene family analysis in plant genomics

This chapter represents my original development of PlantTribes 2. Integration into the

Galaxay framework was done in collaboration with Greg Von Kuster.

Abstract

The rapid development of sequencing technologies coupled with the continuous drop in the cost of sequencing has facilitated the availability of a large number of complete genome sequences across the plant tree of life. However, sequencing some non-model organisms is still cost-prohibitive because plant genomes vary widely in size and complexity and they can be very challenging to assemble. Therefore, many researchers are now relying on comparative genomics approaches that integrate data from genomes and transcriptomes to gain novel insight into the evolutionary history of genomes and gene families, including complex non-model organisms. Here, I present PlantTribes 2, a gene family analysis framework that utilizes objective classifications of complete protein sequences from genomes for comparative and evolutionary analyses of gene families and transcriptomes on a genome-scale, for all types of organisms, including fungi, microbes, animals, plants, and viruses. PlantTribes 2 post-processes de novo assembly transcripts, assigns post-processed transcripts or into pre-computed orthologous gene family clusters of fully sequenced genomes, and estimates gene family multiple sequence alignments and their corresponding phylogenetic trees.

Additionally, synonymous substitutions rates among paralogous transcripts can be

20 estimated to infer large-scale duplication events. PlantTribes 2 is freely available for usage at https://usegalaxy.org and can be download at https://github.com/dePamphilis/PlantTribes.

Introduction

A rapid and continuing decline in sequencing costs since the introduction of automated sequencing methods for the last 30 years has contributed to the generation of massive amounts of transcriptome and genome sequence data for non-model organisms by the plant genomics community [1-3]. Integrating data from sequenced genomes and transcriptomes representing diverse plant lineages in phylogenetic studies can provide the evolutionary context necessary for understanding gene function evolution [4-9], for resolving species relationships [10-

15], and for resolving gene and genome duplications (One Thousand Plant

Transcriptomes Initiative in press, 2019) [16-19]. Many analytical frameworks in large-scale phylogenomic studies that utilize data from both genomes and transcriptomes typically require bioinformatic expertise and computational resources that are out of reach for many researchers [20, 21]. For example, a large- scale phylogenomic study may require objectively circumscribing proteomes of representative sequenced genomes into gene families (orthologous groups) that can serve as a scaffold into which to sort coding regions inferred from transcriptomes or new genomes, building multiple sequence alignments, and performing phylogenetic inferences. Additionally, member genes in families may need to be annotated with functional information that makes it possible to investigate the evolution of lineages.

21

Most of the existing gene family resources are static and not well suited to accommodate new data sets [22-29]. There are a few analytical pipelines available that have addressed some of these computational needs, but PlantTribes 2 [21, 30-

34], provides the flexibility of integrating gene family circumscriptions from any source or organism, including expertly curated gene families to serve as a scaffold for readily classifying protein coding regions from transcriptomes or new genomes.

Plant Tribes 2 also provides a scalable framework that leverages information from diverse plant genomes to optimize circumscription of gene families from both an evolutionary and functional perspective.

An initial version of PlantTribes was first developed and introduced by Wall et al. (2008) [22]. That version implemented OrthoMCL [35] clustering of proteomes from sequenced plant genomes and a series of downstream analyses in a relational database. However, the initial PlantTribes was static, had limited analytical tools, and has become outmoded. In this research, I have completely revamped

PlantTribes from a static relational database to a flexible analysis pipeline with all new code, new features, and extensive testing, and made a carefully annotated version, with training tutorials. In collaboration with the Galaxy developers, we integrated PlantTribes 2 into the Galaxy framework, where it is now accessible to the research community on the main Galaxy analysis portal and the Galaxy

ToolShed, allowing researchers to integrate PlantTribes 2 tools into their instances of the Galaxy framework. This chapter will explain the implementation of

PlantTribes 2, its features, and subsequent chapters will demonstrate its utility in evolutionary analyses of genome-scale gene families and transcriptomes.

22

Pipeline implementation

The new PlantTribes 2 is a collection of automated modular analysis pipelines that utilizes objective classifications of complete protein sequences from sequenced genomes for comparative and evolutionary analyses of genome-scale gene families and transcriptomes. At the core of PlantTribes 2 analyses are the gene family scaffolds, which are clusters of orthologous and paralogous sequences from specified sets of proteomes. A series of modular analytical pipelines and scripts interact with scaffolds to perform comparative and evolutionary analyses. Each

PlantTribes 2 scaffold has up to three sets of clusters circumscribed by the GFam

(clusters of consensus domain architecture) [36], OrthoFinder (broadly defined clusters) [32] or OrthoMCL (narrowly defined clusters) [35] algorithms. Externally circumscribed scaffolds can be integrated into PlantTribes 2 for user-specific gene family analyses as long as they are converted to PlantTribes 2 data format as described on the PlantTribes 2 website

(https://github.com/dePamphilis/PlantTribes). Although we provide examples, sample datasets, and gene family scaffolds for plants, the entire system can be readily used with genome and transcriptome information from any group of related organisms. The pipelines invoke external bioinformatic tools in addition to novel

PlantTribes 2 methods to carry out several distinct analyses as illustrated in Figure

2-1.

23

Figure 2-1: PlantTribes 2 analysis workflow. A schematic diagram broadly illustrating the PlantTribes 2 modular analysis workflow. Inputs and outputs are indicated in light green, software analysis modules in dark green, and gene family scaffold datasets in yellow. (1) A user provides a de novo transcriptome assembly for post- processing, resulting in a non-redundant set of predicted coding sequences and their corresponding translations. (2). The post-processed assembly is searched against a gene family scaffold blast and/or hmm database(s), and transcripts are assigned into their putative orthogroups with corresponding metadata. (3) Classified assembly transcripts are integrated with their corresponding scaffold gene models to estimate orthogroup multiple sequence alignments and corresponding phylogenetic trees. (4) Optionally, synonymous substitutions rate (Ks) of paralogs from either the post-processed assembly or inferred from the phylogenic trees are estimated. The Ks results are used to detect large-scale duplication events and in many other evolutionary hypotheses.

PlantTribes 2 post-processes de novo assembly transcripts into putative coding sequences and their corresponding amino acid translations, estimates paralogous and orthologous pairwise synonymous and non-synonymous substitution rates for a set of gene sequences, classifies gene sequences into pre- computed orthologous plant gene family clusters, and builds gene family multiple sequence alignments and their corresponding phylogenies. A user provides de novo assembly transcripts and PlantTribes 2 produces: (1) predicted coding sequences

24 and their corresponding translations, (2) a table of pairwise synonymous/non- synonymous substitution rates for either orthologous or paralogous transcript pairs, (3) results of significant duplication components in the distribution of synonymous substitutions rates (Ks), (4) a summary table for transcripts classified into orthologous plant gene family clusters with their corresponding functional annotations, (4) gene family amino acid and nucleotide fasta sequences, (6) multiple sequence alignments, and (5) inferred maximum likelihood phylogenies (Figure 2-

1). Optionally, a user can provide an external gene family scaffold and/or externally predicted coding sequences derived from a transcriptome assembly or gene predictions from a sequenced genome. PlantTribes 2 is freely available as a graphical user interface application on the main Galaxy analysis portal

(https://usegalaxy.org/) [37] and the main Galaxy ToolShed

(https://galaxyproject.org/toolshed/) [38]. The ToolShed serves as a repository for thousands of Galaxy utilities and allows administrators to install any of the utilities into their instances of the Galaxy framework. In addition, the standalone version of the pipeline is available for download on GitHub

(https://github.com/dePamphilis/PlantTribes) as a command-line interface for batch processing of many datasets.

Gene family scaffolds

The current release of PlantTribes 2 (v1.0.4) provides several plant gene family scaffolds utilized in previously published and ongoing phylogenomic studies

[8, 11, 16, 39, 40]. Complete sets of protein-coding genes from plant genomes represented in each of the PlantTribes 2 scaffolds were clustered into gene families

25

(i.e., orthogroups) using at least one of the following protein clustering methods:

GFam [36], a consensus domain architecture-based clustering method, and two widely used methods, OrthoMCL [35] and OrthoFinder [32], that infer orthology by gene descent and duplication. I further performed additional clustering of primary gene family clusters using the MCL algorithm [41] to connect distantly, but potentially related orthogroups into larger hierarchical gene families (i.e., super- orthogroups), as described in Wall et al. (2008) [22]. I used 10 MCL stringencies with inflation values 1.2 to 5.0 to cluster gene families of each PlantTribes 2 scaffold method into larger groups that are likely characterized by shared functional domains. Additionally, I annotated each orthogroup with gene function information from biological databases, including Gene Ontology (GO) [42, 43], InterPro/Pfam protein domains [44-46], The Arabidopsis Information Resource (TAIR) [47-49],

UniProtKB/TrEMBL [50], and UniProtKB/Swiss-Prot [50]. PlantTribes 2 scaffold data sets include (1) orthogroups protein coding sequence fasta, (2) orthogroups protein multiple sequence alignments, (3) orthogroups protein HMM profiles, (4) a scaffold protein BLAST database, (5) a scaffold protein HMM profiles database, and

(6) templates for analysis pipelines with scaffold metadata. A detailed description is available on the GitHub repository for how to build a PlantTribes 2 gene family scaffold, beginning with unclassified genome scale gene sets or converting an existing gene family circumscription and corresponding metadata to a format that is usable with the PlantTribes 2 tools.

26

Analysis tools

Assembly post-processing

The AssemblyPostProcessor [51] tool is the entry point of a PlantTribes 2 analysis when the input data is a primary de novo transcriptome assembly. The pipeline uses either ESTScan [52] or TransDecoder [53] to transform de novo assembled contigs into putative coding sequences and their corresponding amino acid translations. Optionally, the resulting predicted coding regions can be filtered to remove duplicated and exact subsequences using GenomeTools [54], a genome analysis software program. The pipeline is implemented with an additional assembly post-processing method that utilizes scaffold orthogroups to reduce fragmentation in a de novo assembly. Homology searches of post-processed transcripts against HMM-profiles of targeted orthogroups are conducted using

HMMER hmmsearch [55]. After assignment of transcripts to targeted orthogroups, orthogroup-specific gene assembly of overlapping primary contigs is performed using CAP3 [56], an overlap-layout-consensus assembler. Lastly, protein multiple sequence alignments of orthogroups are estimated and trimmed using MAFFT [57] and trimAL [58] respectively, to aid in identifying targeted assembled transcripts that are orthologous to the backbone reference gene models based on the global sequence alignment coverage.

Gene family classification

The GeneFamilyClassifier tool classifies gene coding sequences either produced by the AssemblyPostProcessor tool or from an external source using

27

BLASTp and HMMER hmmscan (or both classifiers) into pre-computed orthologous gene family clusters (orthogroups) of a PlantTribes 2 scaffold. Classified sequences are then assigned with the corresponding orthogroups’ metadata that includes gene counts of backbone taxa, superclusters (super orthogroups) at multiple clustering stringencies, and annotations from functional genomic databases that include Gene

Ontology (GO), InterPro protein domains, TAIR, UniProtKB/Tremble, and

UniProtKB/Swiss-Prot. Additionally, sequences belonging to single/low-copy gene families that are mainly utilized in species tree inference can be determined. The

GeneFamilyIntegrator tool then integrates PlantTribes 2 scaffold orthogroup backbone gene models, with gene coding sequences classified into the scaffold by the GeneFamilyClassifier tool. Sequences from an external source can also be integrated into targeted orthogroups.

Gene family alignment estimation

The GeneFamilyAligner tool estimates protein and codon multiple sequence alignments of integrated orthologous gene family fasta files produced by the

GeneFamilyIntegrator tool. Orthogroup alignments are estimated using either

MAFFT's L-INS-i algorithm or the divide and conquer approach implemented in the

PASTA [59] pipeline for large alignments. Optional post-alignment processing includes trimming out sites that are predominantly gaps [58], removing sequences with very low global orthogroup alignment coverage, and performing realignment of orthogroup sequences following site trimming and sequence removal. In the

Galaxy framework, the MSAViewer [60] plugin allows orthogroup fasta multiple

28 sequence alignments produced by the GeneFamilyAligner to be visualized and edited using the Jalview Java Web Start [61] (Figure 2-2).

Figure 2-2: Gene family alignment, alignment visualization, and alignment editing. An illustration of an orthogroup multiple sequence alignment produced by the Galaxy PlantTribes 2 GeneFamilyAligner using the test dataset that can be visualized in Galaxy using the MSAViewer plugin and manually edited with Jalview Java Web Start.

Gene family phylogenetic inference

The GeneFamilyPhylogenyBuilder tool performs gene family phylogenetic inference of multiple sequence alignments produced by the GeneFamilyAligner tool.

PlantTribes 2 estimates maximum likelihood (ML) phylogenetic trees using either

RAxML [62] or FastTree [63] algorithms. Optional tree optimization includes setting the number of bootstrap replicates for RAxML to conduct a rapid bootstrap analysis, searching for the best-scoring ML tree, and rooting the inferred phylogenetic tree with the most distant taxon in the orthogroup or specified taxa. In the Galaxy

29 framework, either the Phylogenetic Tree Visualization plugin or the PHYLOViZ [64] plugin provide several options of rendering the phylogenetic trees produced by the

GeneFamilyPhylogenyBuilder (Figure 2-3).

Figure 2-3: Gene family phylogenetic tree visualization. An illustration of an orthogroup phylogenetic tree produced by the Galaxy PlantTribes 2 GeneFamilyPhylogenyBuilder using the test dataset that can be visualized in Galaxy using either the Phylogenetic Tree Visualization plug-in or the PHYLOViZ plugin. The Phylogenetic Tree Visualization plug-in provides several options for tree rendering.

Estimation of genome duplications

The KaKsAnalysis tool estimates paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates using PAML [65] for a set of protein coding gene sequences (produced by the AssemblyPostProcessor), with duplicates inferred from the phylogenomic analysis (using both the

GeneFamilyClassifier and GeneFamilyPhylogenyBuilder) or from an external source.

Optionally, the resulting set of estimated Ks values can be clustered into components using a mixture of multivariate normal distributions, implemented in

30

EMMIX [66] software, to identify significant duplication event(s) in a species or a pair of species. The KsDistribution tool then plots the Ks rates and fits the estimated significant component(s) onto the distribution (Figure 2-4).

Figure 2-4: Estimation of whole genome duplication events. An illustration of genome duplication events detected using the Galaxy PlantTribes 2 KaKsAnalysis tool. The distribution of estimated paralogs Ks values is clustered into components using a mixture of multivariate normal distributions to identify significant duplication event(s).

Pipeline test dataset

PlantTribes 2 provides a small test dataset shown in Table 2-1 that allows a user to validate that all modular components of the pipeline are working correctly, while at the same time learning how to use the tools and becoming familiar with the analysis results. The test data also serves to check if changes in the code work well with existing features for advanced users who would like to substitute external pipeline dependencies with tools of their liking. The test dataset consists of (1) a

31 subset of a plant transcriptome de novo assembly of Phelipanche aegyptiaca from the Parasitic Plant Genome Project (http://ppgp.huck.psu.edu) [8, 67] database, pre-filtered to select low-copy genes to enable quick test run times, (2) a list of targeted single-copy genes orthogroup identifiers, corresponding to the PlantTribes

2 22Gv1.1 gene family scaffold, that was assigned to the de novo assembly transcripts of the test dataset, and (3) pre-identified whole genome duplication genes from two plants species (Striga asiatica and Mimulus guttatus) [68]. Detailed step-by-step tutorials using the test data are available for both the Galaxy and the command-line versions of the pipeline.

Table: 2-1: PlantTribes 2 test dataset. A small test dataset is included with the PlantTribes 2 distribution that allows users to validate that all the features work properly through a series of analysis tutorials. Dataset Description assembly.fasta a sub-set of a plant transcriptome de novo assembly

a list of targeted single-copy genes orthogroup identifiers targetOrthos.ids corresponding to the PlantTribes 2 22Gv1.1 gene family scaffold

a sub-set of coding sequences (CDS) for the first species for species1.fna estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates corresponding protein sequences for the first species for species1.faa estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates a sub-set of coding sequences (CDS) for the second species for species2.fna estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates corresponding protein sequences for the second species for species2.faa estimating paralogous and orthologous pairwise synonymous (Ks) and non-synonymous (Ka) substitution rates

32

Performance evaluation

Evaluation of gene family inference methods

In order to examine how closely the protein clustering method used in

PlantTribes 2 recapitulates functional gene families, I obtained 181 Arabidopsis thaliana expert curated gene families, consisting of 925 subfamilies, from the TAIR database [47-49] and evaluated how member genes are circumscribed in the three clustering methods of the 22Gv1.1 gene family scaffold. The PlantTribes 2 22Gv1.1 gene family scaffold contains the annotated protein coding (CDS) sequences for 22 representative land plant genomes, including nine (Arabidopsis thaliana,

Thellungiella parvula, Carica papaya, Theobroma cacao, Populus trichocarpa,

Fragaria vesca, Medicago truncatula, Glycine max, and Vitis vinifera), three

(Solanum tuberosum, Solanum lycopersicum, and Mimulus guttatus), two basal eudicots (Aquilegia coerulea and Nelumbo nucifera), five monocots (Sorghum bicolor,

Brachypodium distachyon, Oryza sativa, Musa acuminata, and Phoenix dactylifera), one basal angiosperm (Amborella trichopoda), one lycophyte (Selaginella moellendorffii), and one moss (Physcomitrella patens).

GFam, is a consensus domain architecture-based gene family inference method [36, 45, 69-71], while OrthoMCL and OrthoFinder infer gene families by gene descent and duplication [25, 32, 72-76]. In PlantTribes 2, the primary gene clusters for the GFam method are also referred to as orthogroups for analytical consistency, even though some may not be orthologous gene sets by definition. I counted the number of orthogroups into which Arabidopsis genes from each curated gene family were assigned for the three clustering methods of the scaffold. All the

33

methods show a strong positive correlation between number of orthogroups and

subfamilies for each family. The Pearson correlation coefficient was 0.60 for

OrthoFinder, and 0.54 and 0.50 for the GFam and OrthoMCL, respectively, as shown

in Figure 2-5. The majority of differences between subfamilies and orthogroups

result from splitting subfamilies into smaller groups by clustering methods. Very

few differences account for clusters that merge genes from different expert curated

gene families (Table 2-2).

Figure 2-5: Relationship between orthologous clusters and expert curated gene families. Scatter plots showing the correlation between subfamilies and orthologous clusters of OrthoMCL, OrthoFinder, and GFam clustering methods of the 22Gv1.1 scaffold for 181 Arabidopsis thaliana curated gene families.

34

Table 2-2: Distribution of genes from expert curated gene families into orthologous gene clusters. The distribution of genes from Arabidopsis thaliana expert curated gene families into orthologous clusters, as estimated by OrthoMCL, OrthoFinder, and GFam clustering methods. Merged, ideal, and split orthologous clusters contain more than one curated subfamily, all members of a single curated subfamily, and some members of a subfamily, respectively. Orthologous clusters OrthoMCL OrthoFinder GFam Merged subfamilies 1% 2% 7% Ideal subfamilies 8% 20% 16% Split subfamilies 91% 78% 77%

In addition, I examined the orthogroups with genes from expert curated gene

families to determine the accuracy of clustering methods from an inferred

functional perspective. I considered both split and merged subfamilies to be

inaccurate only if the orthogroups contained genes from different curated gene

families. The accuracy of all the three method is similar and fairly high at 79%, 78%

and 77% for OrthoFinder, GFam, and OrthoMCL, respectively. This evaluation of

PlantTribes 2 gene family circumscriptions illustrates the differences between the

two paradigms of gene families: the expert curated and the evolutionary

perspectives. Orthology is defined by shared ancestry either from a speciation event

(orthologs) or a duplication event (paralogs). While orthologs and paralogs can

share significant sequence similarity, they do not necessarily have the same or

similar function. High sequence and functional similarity may occur among genes

that are not descended from a common ancestor because of convergent evolution

[77-81].

35

Evaluation of sequence classifiers

PlantTribes 2 utilizes BLAST (blastp) and HMMER (hmmscan and hmmsearch) algorithms to classify gene sequences into orthologous gene family clusters. In order to measure the performance of the two classifiers on GFam,

OrthoFinder, and OrthoMCL gene family clusters, I selected three taxa in the previously described 22Gv1.1 gene family scaffold with varying evolutionary distances in relationship to all the other taxa in the scaffold. The three taxa were removed from the scaffold and then classified back to measure recall and precision of classifiers [82]. Only gene sequences reassigned to their original orthologous clusters were considered true positives. I also calculated the F-score, a single metric that considers both recall and precision to measure the overall performance of classifiers [82]. In this analysis, I selected Physcomitrella patens, and two sister species, Solanum lycopersicum and Solanum tuberosum to evaluate the performance of classifiers when sorting distant, moderately distant, and confamilial taxa into the scaffold using following procedure:

1) Physcomitrella patens (Phypa) removed and sorted back into the scaffold to

evaluate the performance of classifiers with distant species (-Phypa,

+Phypa). No other moss species are present in this scaffold.

2) Both Solanum lycopersicum (Solly) and Solanum tuberosum (Soltu) removed,

and Solanum lycopersicum sorted back into the scaffold to evaluate the

performance of classifiers with moderately distant species (-Solly, -Soltu,

+Solly). After removing both Solanum lycopersicum and Solanum tuberosum,

36

no other sister species are present in the scaffold. However, other close

lineages, including three asterids and nine rosids are present in the scaffold.

3) Solanum lycopersicum (Solly) removed and sorted back into the scaffold to

evaluate the performance of classifiers with confamilial species (-Solly,

+Solly). Solanum tuberosum, a sister species in the same family is present in

the scaffold.

As shown in Table 2-3, both classifiers have a higher recall rate (90% - 96%) when classifying into OrthoMCL and OrthoFinder clusters when compared to GFam clusters (58% - 80%) across the evolutionary distance of species in the genome scaffold. HMMER is slightly more sensitive than BLAST when the evolutionary distance is significant, while BLAST is much more sensitive when classifying into

GFam clusters across the board. Precision for both classifiers is similar across the evolutionary distance of the scaffold (78% - 95%). Classifying into OrthoFinder clusters yields much higher precision than classifying into OrthoMCL and GFam clusters. Overall, classification performance for BLAST and HMMER is similar based on the F-scores (67% - 94%) across the evolutionary distance of the scaffold. These results suggest that classifying into clusters inferred by orthology methods results in better performance for both classifiers than classifying into clusters inferred by consensus domain-based methods. Additionally, classifying using both BLAST and

HMMER as implemented in PlantTribes 2, even though computationally expensive, should provide the best result by merging the sensitivity and specificity both classifiers.

37

Table 2-3: Performance evaluation of sequence classifiers. Summaries of classification rates for BLAST and HMMER classifiers. Recall, precision and F-score for the two classifiers is measured on GFam, OrthoFinder, and OrthoMCL clustering methods to determine how well distant (-Phypa, +Phypa), moderately distant (-Solly, -Soltu, +Solly), and confamilial (-Solly, +Solly) taxa are classified into the PlantTribes 2 22Gv1.1 gene family scaffold. The best performing classifier score in each evolutionary distance category and clustering methods is shown in bold. Recall -Phypa, +Phypa -Solly, -Soltu, +Solly -Solly, +Solly BLAST HMMER BLAST HMMER BLAST HMMER GFam 64% 58% 71% 62% 80% 66% OrthoMCL 90% 91% 93% 95% 95% 95% OrthoFinder 92% 95% 91% 94% 94% 94%

Precision -Phypa, +Phypa -Solly, -Soltu, +Solly -Solly, +Solly BLAST HMMER BLAST HMMER BLAST HMMER GFam 85% 80% 85% 81% 84% 81% OrthoMCL 86% 84% 79% 78% 79% 78% OrthoFinder 83% 80% 95% 94% 94% 93%

F-score -Phypa, +Phypa -Solly, -Soltu, +Solly -Solly, +Solly BLAST HMMER BLAST HMMER BLAST HMMER GFam 73% 67% 77% 70% 82% 73% OrthoMCL 88% 87% 86% 86% 87% 86% OrthoFinder 87% 88% 93% 94% 94% 94%

Evaluation of targeted gene family assembly

De novo assembly of RNA-Seq data is commonly used to reconstruct

expressed transcripts for non-model species that lack quality reference genomes.

However, heterogeneous sequence coverage, sequencing errors, polymorphism, and

sequence repeats, among other factors, cause algorithms to generate contigs that

38 are fragmented [51, 83]. In order to demonstrate the utility of the targeted gene family assembly in PlantTribes 2, I obtained raw Illumina transcriptome datasets sequenced by the Parasitic Plant Genome Project (http://ppgp.huck.psu.edu) that represent key life stages of three parasitic species in the Orobanchaceae family [8,

67]. These species span the full spectrum of plant parasitism, and include

Triphysaria versicolor, Striga hermonthica, and Phelipanche aegyptiaca. Species- specific transcriptome assemblies were performed using two approaches with

Trinity: (1) a single assembly combining raw Illumina reads from all development stages of the plant, and (2) multiple assemblies of individual developmental stages of the plant. BUSCO (benchmarked universal single-copy orthologs) [84, 85] assembly quality assessment using 1,440 universally conserved land plants single- copy orthologs suggests that the assembly combining all raw data recovers more conserved single-copy genes than any developmental stage-specific assembly (Table

2-4). However, a meta-assembly of transcripts from both approaches with the targeted gene family function of the AssemblyPostProcessor utilizing the 26Gv1.0 gene family scaffold recovers even more full-length conserved single-copy genes.

Therefore, the meta-assembly implementation of PlantTribes 2 can benefit many comparative transcriptome studies of non-model species to alleviate transcript fragmentation in gene families of interest.

39

Table 2-4: Completeness assessment of transcriptome assemblies. BUSCO completeness assessment of transcriptome assemblies to illustrate the utility of PlantTribes 2 targeted gene family assembly function in the AssemblyPostProcessor tool. The numbers indicate the proportion of complete, fragmented, and missing genes among 1,440 conserved photosynthetic land plant single-copy orthologs. Assemblies of parasitic plants, Phelipanche, Striga, and Triphysaria examined include (1) developmental stage-specific assemblies, (2) assemblies combining all developmental stage- specific raw data (combined assembly), and (3) meta-assembly of developmental stage-specific assemblies and combined assembly.

Species Phelipanche aegyptiaca Striga hermonthica Triphysaria versicolor Developmetal Stage Complete Fragmented Missing Complete Fragmented Missing Complete Fragmented Missing 0 28.2% 17.6% 54.2% 46.4% 16.8% 36.8% 42.1% 16.7% 41.2% 1 22.9% 15.3% 61.8% 69.7% 11.4% 18.9% 52.5% 16.9% 30.6% 2 30.3% 16.5% 53.2% 61.3% 15.1% 23.6% 59.2% 15.6% 25.2% 3 55.3% 9.3% 35.4% 56.4% 15.4% 28.2% 51.5% 15.8% 32.7% 4 NA NA NA 56.8% 16.2% 27.0% NA NA NA 4.1 61.0% 9.2% 29.8% NA NA NA 29.1% 18.2% 52.7% 4.2 54.0% 10.3% 35.7% NA NA NA NA NA NA 5.1 38.7% 14.1% 47.2% 70.2% 9.7% 20.1% NA NA NA 5.2 45.5% 10.7% 43.8% 51.6% 15.1% 33.3% NA NA NA 6.1 31.5% 15.6% 52.9% 57.1% 14.1% 28.8% 72.8% 9.6% 17.6% 6.2 41.9% 14.6% 43.5% 67.3% 12.3% 20.4% 64.5% 12.3% 23.2% 6.3 NA NA NA NA NA NA 60.1% 13.3% 26.6% combined assembly 70.2% 9.3% 20.5% 84.6% 5.9% 9.5% 87.7% 5.1% 7.2% meta-assembly 77.5% 2.4% 20.1% 91.3% 2.2% 6.5% 89.4% 2.6% 8.0%

Conclusions

PlantTribes 2 utilizes pre-computed or expert gene family classifications for

comparative and evolutionary analyses of gene families and transcriptomes for all

types of organisms. The two main aims of PlantTribes 2 are: (1) continual

development of a scalable and modular set of analysis tools and methods that

leverage gene family classifications for comparative genomics and phylogenomics to

gain novel insight into the evolutionary history of genomes, gene families, and the

tree of life; (2) to make these tools broadly available to the plant research

40 community as a stand-alone package and also through the Galaxy framework. Many genomic studies, including inference of species relationships, the timing of gene duplication and polyploidy, reconstruction of ancestral gene content, the timing of new gene function evolution, detection of reticulate evolutionary events such as horizontal gene transfer, and many others, can all be performed using PlantTribes 2 tools. The modular structure, which allows component tools of the pipeline to be independent of each other, makes PlantTribes 2 easy to update, expand, and add new features.

Availability and requirements

Project name: PlantTribes 2

Archived version: 1.0.4

Project home page: https://github.com/dePamphilis/PlantTribes

Galaxy: https://usegalaxy.org

Bioconda: https://bioconda.github.io/search.html?q=PlantTribes

Tutorials: https://github.com/dePamphilis/PlantTribes/blob/master/docs/Tutorial.md. https://galaxyproject.org/tutorials/pt_gfam/

Operating system(s): Linux, Mac OS X

Programming language: Perl, Python

Other requirements: Web browser for Galaxy

License: GNU

41

References

1. Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M., Yefanov, A., Lee, H., Zhang, N., Robertson, C.L., Serova, N., Davis, S., Soboleva, A. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012). 2. Matasci, N., Hung, L.-H., Yan, Z., Carpenter, E.J., Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Ayyampalayam, S., Barker, M., Burleigh, J.G., Gitzendanner, M.A., Wafula, E., Der, J.P., dePamphilis, C.W., Roure, B., Philippe, H., Ruhfel, B.R., Miles, N.W., Graham, S.W., Mathews, S., Surek, B., Melkonian, M., Soltis, D.E., Soltis, P.S., Rothfels, C., Pokorny, L., Shaw, J.A., DeGironimo, L., Stevenson, D.W., Villarreal, J.C., Chen, T., Kutchan, T.M., Rolf, M., Baucom, R.S., Deyholos, M.K., Samudrala, R., Tian, Z., Wu, X., Sun, X., Zhang, Y., Wang, J., Leebens-Mack, J., Wong, G.K.S. Data access for the 1,000 plants (1KP) project. Gigascience. 3, 17 (2014). 3. Sayers, E.W., Cavanaugh, M., Clark, K., Ostell, J., Pruitt, K.D., Karsch- Mizrachi, I. GenBank. Nucleic Acids Res. 47, D94–D99 (2018). 4. Der, J.P., Barker, M.S., Wickett, N.J., dePamphilis, C.W., Wolf, P.G. De novo characterization of the gametophyte transcriptome in bracken fern, Pteridium aquilinum. BMC Genomics. 12, 99 (2011). 5. Zhang, N., Wen, J., Zimmer, E.A. Expression patterns of AP1, FUL, FT and LEAFY orthologs in Vitaceae support the homology of tendrils and inflorescences throughout the grape family. Jnl of Sytematics Evolution. 53, 469–476 (2015). 6. Jayakodi, M., Lee, S.-C., Park, H.-S., Jang, W., Lee, Y.S., Choi, B.-S., Nah, G.J., Kim, D.-S., Natesan, S., Sun, C., Yang, T.J. Transcriptome profiling and comparative analysis of Panax ginseng adventitious roots. J Ginseng Res. 38, 278–288 (2014). 7. Williams, J.S., Der, J.P., dePamphilis, C.W., Kao, T.-H. Transcriptome analysis reveals the same 17 S-locus F-box genes in two haplotypes of the self- incompatibility locus of Petunia inflata. Plant Cell. 26, 2873–2888 (2014). 8. Yang, Z., Wafula, E.K., Honaas, L.A., Zhang, H., Das, M., Fernandez-Aparicio, M., Huang, K., Bandaranayake, P.C.G., Wu, B., Der, J.P., Clarke, C.R., Ralph, P.E., Landherr, L., Altman, N.S., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Comparative transcriptome analyses reveal core parasitism genes and suggest gene duplication and repurposing as sources of structural novelty. Mol. Biol. Evol. 32, 767–790 (2014). 9. Sharma, P.P., Kaluziak, S.T., Pérez-Porro, A.R., González, V.L., Hormiga, G., Wheeler, W.C., Giribet, G. Phylogenomic interrogation of arachnida reveals systemic conflicts in phylogenetic signal. Mol. Biol. Evol. 31, 2963–2984 (2014). 10. Timme, R.E., Bachvaroff, T.R., Delwiche, C.F. Broad phylogenomic sampling and the sister lineage of land plants. PLoS ONE. 7, e29696 (2012). 11. Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E., Matasci, N., Ayyampalayam, S., Barker, M.S., Burleigh, J.G., Gitzendanner, M.A., Ruhfel,

42

B.R., Wafula, E., Der, J.P., Graham, S.W., Mathews, S., Melkonian, M., Soltis, D.E., Soltis, P.S., Miles, N.W., Rothfels, C.J., Pokorny, L., Shaw, A.J., DeGironimo, L., Stevenson, D.W., Surek, B., Villarreal, J.C., Roure, B., Philippe, H., dePamphilis, C.W., Chen, T., Deyholos, M.K., Baucom, R.S., Kutchan, T.M., Augustin, M.M., Wang, J., Zhang, Y., Tian, Z., Yan, Z., Wu, X., Sun, X., Wong, G.K.-S., Leebens-Mack, J. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc. Natl. Acad. Sci. U.S.A. 111, E4859–E4868 (2014). 12. Zeng, L., Zhang, Q., Sun, R., Kong, H., Zhang, N., Ma, H. Resolution of deep angiosperm phylogeny using conserved nuclear genes and estimates of early divergence times. Nature Communications. 5, (2014). 13. Huang, C.-H., Sun, R., Hu, Y., Zeng, L., Zhang, N., Cai, L., Zhang, Q., Koch, M.A., Al-Shehbaz, I., Edger, P.P., Pires, J.C., Tan, D.-Y., Zhong, Y., Ma, H. Resolution of Brassicaceae phylogeny using nuclear genes uncovers nested radiations and supports convergent morphological evolution. Mol. Biol. Evol. msv226 (2015). 14. Naumann, J., Salomo, K., Der, J.P., Wafula, E.K., Bolin, J.F., Maass, E., Frenzke, L., Samain, M.-S., Neinhuis, C., dePamphilis, C.W., Wanke, S. Single-copy nuclear genes place haustorial Hydnoraceae within Piperales and reveal a cretaceous origin of multiple parasitic angiosperm lineages. PLoS ONE. 8, e79204 (2013). 15. Rothfels, C.J., Larsson, A., Li, F.-W., Sigel, E.M., Huiet, L., Burge, D.O., Ruhsam, M., Graham, S.W., Stevenson, D.W., Wong, G.K.-S., Korall, P., Pryer, K.M. Transcriptome-mining for single-copy nuclear markers in ferns. PLoS ONE. 8, e76957 (2013). 16. Amborella Genome Project. The Amborella genome and the evolution of flowering plants. Science. 342, 1241089 (2013). 17. Jiao, Y., Leebens-Mack, J., Ayyampalayam, S., Bowers, J.E., McKain, M.R., McNeal, J., Rolf, M., Ruzicka, D.R., Wafula, E., Wickett, N.J., Wu, X., Zhang, Y., Wang, J., Zhang, Y., Carpenter, E.J., Deyholos, M.K., Kutchan, T.M., Chanderbali, A.S., Soltis, P.S., Stevenson, D.W., McCombie, R., Pires, J.C., Wong, G.K.-S., Soltis, D.E., dePamphilis, C.W. A genome triplication associated with early diversification of the core eudicots. Genome Biol. 13, R3 (2012). 18. Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S., Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., Soltis, P.S., Soltis, D.E., Clifton, S.W., Schlarbaum, S.E., Schuster, S.C., Ma, H., Leebens-Mack, J., dePamphilis, C.W. Ancestral polyploidy in seed plants and angiosperms. Nature. 473, 97–100 (2011). 19. Li, Z., Baniaga, A.E., Sessa, E.B., Scascitelli, M., Graham, S.W., Rieseberg, L.H., Barker, M.S. Early genome duplications in conifers and other seed plants. Science Advances. 1, e1501084–e1501084 (2015). 20. Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y., Vandepoele, K. Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590–600 (2012). 21. Dunn, C.W., Howison, M., Zapata, F. Agalma: an automated phylogenomics

43

workflow, (2013). 22. Wall, P.K., Leebens-Mack, J., Müller, K.F., Field, D., Altman, N.S., dePamphilis, C.W. PlantTribes: a gene and gene family resource for comparative genomics in plants. Nucleic Acids Res. 36, D970–6 (2008). 23. Proost, S., Van Bel, M., Vaneechoutte, D., Van de Peer, Y., Inzé, D., Mueller- Roeber, B., Vandepoele, K. PLAZA 3.0: an access point for plant comparative genomics. Nucleic Acids Res. (2014). 24. Kriventseva, E.V., Kuznetsov, D., Tegenfeldt, F., Manni, M., Dias, R., Simão, F.A., Zdobnov, E.M. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47, D807–D811 (2019). 25. Chen, F., Mackey, A.J., Stoeckert, C.J., Roos, D.S. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 34, D363–8 (2006). 26. Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M.C., Rattei, T., Mende, D.R., Sunagawa, S., Kuhn, M., Jensen, L.J., Mering, von, C., Bork, P. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016). 27. Mi, H., Muruganujan, A., Huang, X., Ebert, D., Mills, C., Guo, X., Thomas, P.D. Protocol update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0). Nat Protoc. 14, 703–721 (2019). 28. Schreiber, F., Patricio, M., Muffato, M., Pignatelli, M., Bateman, A. TreeFam v9: a new website, more species and orthology-on-the-fly. Nucleic Acids Res. 42, D922–D925 (2014). 29. Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo, J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., Rokhsar, D.S. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–86 (2012). 30. Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., Blaxter, M. PartiGene – constructing partial genomes. Bioinformatics. 20, 1398–1404 (2004). 31. Roure, B., Rodríguez-Ezpeleta, N., Philippe, H. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evol. Biol. 7, S2 (2007). 32. Emms, D.M., Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015). 33. Song, J., Zheng, S., Nguyen, N., Wang, Y., Zhou, Y., Lin, K. Integrated pipeline for inferring the evolutionary history of a gene family embedded in the species tree: a case study on the STIMATE gene family. BMC Bioinformatics. 18, 439 (2017). 34. Thanki A.S., Soranzo N., Haerty W., Davey R.P. GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline. Gigascience. 1–10 (2018).

44

35. Li, L., Stoeckert, C.J., Roos, D.S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 13, 2178–2189 (2003). 36. Sasidharan, R., Nepusz, T., Swarbreck, D., Huala, E., Paccanaro, A. GFam: a platform for automatic annotation of gene families. Nucleic Acids Res. 40, e152 (2012). 37. Goecks, J., Nekrutenko, A., Taylor, J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010). 38. Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Čech, M., Chilton, J., Clements, D., Coraor, N., Grüning, B.A., Guerler, A., Hillman- Jackson, J., Hiltemann, S., Jalili, V., Rasche, H., Soranzo, N., Goecks, J., Taylor, J., Nekrutenko, A., Blankenberg, D. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018). 39. Shahid, S., Kim, G., Johnson, N.R., Wafula, E., Wang, F., Coruh, C., Bernal- Galeano, V., Phifer, T., dePamphilis, C.W., Westwood, J.H., Axtell, M.J. MicroRNAs from the parasitic plant Cuscuta campestris target host messenger RNAs. Nature. 553, 82–85 (2018). 40. Li, F.W., Brouwer, P., Carretero-Paulet, L., Cheng, S., Vries, J., Delaux, P.-M., Eily, A., Koppers, N., Kuo, L.-Y., Li, Z., Simenc, M., Small, I., Wafula, E., Angarita, S., Barker, M.S., Bräutigam, A., dePamphilis, C., Gould, S., Hosmani, P.S., Huang, Y.-M., Huettel, B., Kato, Y., Liu, X., Maere, S., McDowell, R., Mueller, L.A., Nierop, K.G.J., Rensing, S.A., Robison, T., Rothfels, C.J., Sigel, E.M., Song, Y., Timilsena, P.R., Peer, Y., Wang, H., Wilhelmsson, P.K.I., Wolf, P.G., Xu, X., Der, J.P., Schluepmann, H., Wong, G.K.S., Pryer, K.M. Fern genomes elucidate land plant evolution and cyanobacterial symbioses. Nature Plants. 1–16 (2018). 41. Enright, A.J., Van Dongen, S., Ouzounis, C.A. An efficient algorithm for large- scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). 42. The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still going strong. Nucleic Acids Res. 47, D330–D338 (2019). 43. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel- Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. The Gene Ontology Consortium: tool for the unification of biology. Nat Genet. 25, 25–29 (2000). 44. Mulder, N., Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 396, 59–70 (2007). 45. Sonnhammer, E.L.L., Eddy, S.R., Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins: Structure, Function, and Bioinformatics. 28, 405–420 (1997). 46. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., Lopez, R. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005).

45

47. Swarbreck, D., Wilks, C., Lamesch, P., Berardini, T.Z., Garcia-Hernandez, M. Foerster, H., Li, D., Meyer, T., Muller, R., Ploetz, L., Radenbaugh, A., Singh, S., Swing, V., Tissier, C., Zhang, P., Huala, E. Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36, D1009-10014 (2008). 48. Huala, E., Dickerman, A.W., Garcia-Hernandez, M. Weems, D., Reiser, L., LaFond, F., Hanley, D., Kiphart, D., Zhuang, M., Huang, W. Mueller, L.A., Bhattacharyya, D., Bhaya, D. Sobral, B.W., Beavis, W. Meinke, D.W., Town, C.D., Somerville, C. Rhee, S.Y. The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 29, 102-105 (2001). 49. Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D.L., Garcia-Hernandez, M., Karthikeyan, A.S., Lee, C.H., Nelson, W.D., Ploetz, L., Singh, S., Wensel, A., Huala, E. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012). 50. UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018). 51. Honaas, L.A., Wafula, E.K., Wicket, N.J., Der, J.P., Zhang, Y., Edger, P.P., Altman, N.S., Pires, J.C., Leebens-Mack, J.H., dePamphilis, C.W. Selecting superior de novo transcriptome assemblies: Lessons learned by leveraging the best plant genome. PLoS One. 11, e0146062 (2016). 52. Iseli, C. Jongeneel, C.V., Bucher, P. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in ESTsequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 138-148 (1999). 1–11 (2002). 53. Haas, B.J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P.D., Bowden, J., Couger, M.B., Eccles, D., Li, B., Lieber, M., Macmanes, M.D., Ott, M., Orvis, J., Pochet, N., Strozzi, F., Weeks, N., Westerman, R., William, T., Dewey, C.N., Henschel, R., Leduc, R.D., Friedman, N., Regev, A. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013). 54. Gremme, G., Steinbiss, S., Kurtz, S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol Bioinform. 10, 645–656 (2013). 55. Eddy, S.R. Accelerated Profile HMM Searches. PLoS Comput Biol. 7, e1002195 (2011). 56. Huang, X., Madan, A. CAP3: A DNA Sequence Assembly Program. Genome Res. 9, 868–877 (1999). 57. Katoh, K., Standley, D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). 58. Capella-Gutiérrez, S., Silla-Martínez, J.M., Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25, 1972–1973 (2009). 59. Mirarab, S., Nguyen, N., Guo, S., Wang. L.S., Kim, J., Warnow, T. PASTA:

46

ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22, 377-386 (2015). 60. Yachdav, G., Wilzbach, S., Rauscher, B., Sheridan, R., Sillitoe, I., Procter, J., Lewis, S.E., Rost, B., Goldberg, T. MSAViewer: interactive JavaScript visualization of multiple sequence alignments. Bioinformatics. 32, 3501– 3503 (2016). 61. Waterhouse, A.M., Procter, J.B., Martin, D.M.A., Clamp, M., Barton, G.J. Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics. 25, 1189–1191 (2009). 62. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics. 30, 1312–1313 (2014). 63. Price, M.N., Dehal, P.S., Arkin, A.P. FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS ONE. 5, e9490 (2010). 64. Nascimento, M., Sousa, A., Ramirez, M., Francisco, A.P., Carriço, J.A., Vaz, C. PHYLOViZ 2.0: providing scalable data integration and visualization for multiple phylogenetic inference methods. Bioinformatics. 33, 128–129 (2017). 65. Yang, Z. PAML: Phylogenetic Analysis by Maximum Likelihood. Mol. Biol. Evol. 24, 1586-1591 (2007). 66. McLachlan, G., Peel, D. The EMMIX algorithm for the fitting of normal and t- components. J. Stat. Software. 04, 10.18637/jss.v004.i02 (1999). 67. Westwood, J.H., dePamphilis, C.W., Das, M., Fernandez-Aparicio, M., Honaas, L.A., Timko, M.P., Wafula, E.K., Wickett, N.J., Yoder, J.I. The Parasitic Plant Genome Project: New tools for understanding the biology of Orobanche and Striga. Weed Science. 60, 295–306 (2012). 68. Yoshida, S., Kim, S., Wafula, E.K., Tanskanen, J., Kim, Y.-M., Honaas, L., Yang, Z., Spallek, T., Conn, C.E., Ichihashi, Y., Cheong, K., Cui, S., Der, J.P., Gundlach, H., Jiao, Y., Hori, C., Ishida, J.K., Kasahara, H., Kiba, T., Kim, M.-S., Koo, N., Laohavisit, A., Lee, Y.-H., Lumba, S., McCourt, P., Mortimer, J.C., Mutuku, J.M., Nomura, T., Sasaki-Sekimoto, Y., Seto, Y., Wang, Y., Wakatake, T., Sakakibara, H., Demura, T., Yamaguchi, S., Yoneyama, K., Manabe, R.-I., Nelson, D.C., Schulman, A.H., Timko, M.P., dePamphilis, C.W., Choi, D., Shirasu, K. Genome sequence of Striga asiatica provides insight into the evolution of plant parasitism. Curr. Biol. 29, 3041-3052 (2019). 69. Haas, B.J., Wortman, J.R., Ronning, C.M., Hannick, L.I., Smith, R.K., Maiti, R., Chan, A.P., Yu, C., Farzad, M., Wu, D., White, O., Town, C.D. Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol. 3, 1–19 (2005). 70. Heger, A., Holm, L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 328, 749-767 (2003). 71. Zhang, D., Aravind, L. Identification of novel families and classification of the C2 domain superfamily elucidate the origin and evolution of membrane targeting activities in eukaryotes. Gene. 469, 18–30 (2010). 72. Trachana, K., Larsson, T.A., Powell, S., Chen, W.-H., Doerks, T., Muller, J., Bork, P. Orthology prediction methods: a quality assessment using curated protein families. Bioessays. 33, 769–780 (2011).

47

73. Waterhouse, R.M., Tegenfeldt, F., Li, J., Zdobnov, E.M., Kriventseva, E.V. OrthoDB: A hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 41, D358–D365 (2013). 74. Powell, S., Forslund, K., Szklarczyk, D., Trachana, K., Roth, A., Huerta-Cepas, J., Gabaldón, T., Rattei, T., Creevey, C., Kuhn, M., Jensen, L.J., Mering, von, C., Bork, P. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 42, D231–D239 (2013). 75. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 4, 41–14 (2003). 76. Altenhoff, A.M., Schneider, A., Gonnet, G.H., Dessimoz, C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 39, D289-294 (2011). 77. Webber, C., Ponting, C.P. Genes and homology. Curr. Biol. 14, R332–333 (2004). 78. Jensen, R.A. Orthologs and paralogs - we need to get it right. Genome Biol. 2, interactions1002.1–3 (2001). 79. Thornton, J.W., DeSalle, R. Gene family evolution and homology: genomics meets phylogenetics. Annu. Rev. Genomics Hum. Genet. 1, 41–73 (2000). 80. Doolittle, R.F. Convergent Evolution - the Need to Be Explicit. Trends Biochem. Sci. 19, 15–18 (1994). 81. Fitch, W.M. Homology - a personal view on some of the problems. Trends in Genetics. 16, 227–231 (2000). 82. Vihinen, M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics. 13 Suppl 4, S2 (2012). 83. Zhang, Y., Sun, Y., Cole, J.R. A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data. PLoS Comput. Biol. 10, e1003737 (2014). 84. Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., Zdobnov, E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31, 3210–3212 (2015). 85. Waterhouse, R.M., Seppey, M., Simão, F.A., Manni, M., Ioannidis, P., Klioutchnikov, G., Kriventseva, E.V., Zdobnov, E.M. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).

48

Chapter 3

Ancient whole genome duplication events in the parasitic lineages of Orobanchaceae

This chapter represents a part of my contribution to the published paper on the Striga

asiatica genome by Yoshida et al. ( 2019) [1]

Abstract

The broomrape family, Orobanchaceae, a member of the Lamiales, is one of the

12 lineages that independently evolved parasitism in angiosperms. Orobanchaceae is also one of the largest parasitic plant lineages, with an estimated 2000 species, and includes species that are among the most devastating agricultural weeds globally. Even though whole genome duplication (WGD) is prevalent in angiosperms and attributed to be the driving force in the rapid radiation and diversification of angiosperms, evidence has been unclear as to whether WGD has played a significant role in the evolution of parasitism, including in the Orobanchaceae. This chapter employs complementary WGD inference approaches and dense taxon sampling of Orobanchaceae and closely related sister species to maximize the sensitivity to detect large-scale duplication events and to constrain their timing in a phylogenetic context. Evidence is found for an ancient polyploidy event in the history of Orobanchaceae and confirms the previously reported WGD event shared among all core Lamiales. This study provides the first opportunity to investigate how genome duplication is associated with the origins of parasitism in Orobanchaceae.

Introduction

Whole genome duplications (WGDs) have played an important role in the evolution of angiosperms. In eukaryotic plants, WGD has occurred across many lineages

49 but predominantly in embryophytes (land plants) [2-4]. WGDs have been implicated in the origin of traits among several large plant groups and has been proposed to be a major factor in species diversification [5-11]. Lamiales is one of the largest orders of flowering plants in the asterid I clade of eudicots and has approximately 24,000 species in 24 families that are widely distributed around the world [12]. Additionally, it is one of the most interesting orders evolutionarily with several families that have specialized life forms and traits, including parasitism (Orobanchaceae) [13, 14], epiphytism

(Gesneriaceae and Ancanthaceae) [15, 16], carnivory (Lentibulariaceae and Byblidaceae)

[17, 18], and desiccation tolerance (Linderniaceae and Gesneriaceae) [19, 20].

Economically important species in the order include (), sesame

(Pedaliaceae), and pot-herbs such as mint, oregano, and basil (Lamiaceae) [21].

The known polyploidy events in the Lamiales family are shown on Figure 3-1, the cladogram illustrating the phylogenetic relationship of major groups in the order inferred from four plastid markers by Schäferhoff et al. (2010) [21] . Three WGDs have been reported in the evolutionary history of the basal Lamiales family, Oleaceae; these include a species-specific WGD in the tree, Olea europaea, and two older paleopolyploidy events. The younger Oleaceae event is shared among the taxa Olea europaea, Phillyrea angustifolia, and Fraxinus excelsior (ash tree). The older event is predicted to have occurred at the base of the Oleaceae family and includes () from the Jasmineae tribe [22-24]. Several studies have independently inferred a likely ancestral WGD that occurred at the base of all Lamiales (the core group) excluding

Oleaceae and other basal Lamiales [3, 24-27]. Also, two species-specific WGDs have been reported in the miniaturized genome of the carnivorous plant, Utricularia gibba [25]

50 and one WGD in Paulownia [3], lineage closely related to the parasitic family,

Orobanchaceae. To date, no polyploidy events have been fully investigated within

Orobanchaceae, except for the likely events that were proposed by Wickett et al. (2011)

[28] using a limited amount of transcriptome data. Wickett et al. performed lineage- specific analyses of rates of synonymous substitutions and suggested that there may have been at least two WGDs reflected in the genome of Striga hermonthica and one in

Phelipanche aegyptiaca, two parasitic lineages in the Orobanchaceae. Therefore, resolving the timing of polyploidy events in Orobanchaceae will likely help in understanding the impact of gene duplication in the evolution of plant parasitism.

The aims of this study are to investigate: (1) WGD events in the ancestral lineages leading to parasitic clades in Orobanchaceae that might have contributed to the evolution of parasitism in the family, and (2) whether adequate sampling of major groups within

Orobanchaceae including sister lineages can resolve the timing of the inferred WGD events. In order to effectively address these questions, I obtained an unannotated genome assembly of Striga asiatica (US strain) with an estimated genome size of 600 Mb that had been assembled using Illumina (143x) data coupled with Sanger BAC library sequence data. The genome had been assembled by our collaborators into 472 Mb representing about 80% of the kmer-based genome estimate of 600 Mb with a scaffold

N50 > 1.3 Mbp (Appendix B) [1]. I first generated a comprehensive set of gene models, which I used together with genes from 25 other plant genomes (Table 3-1) to circumscribe objective gene family clusters (orthogroups). These genome-scale datasets were the starting point of my investigations into the evolutionary history and origins of parasitism in Striga and other lineages in Orobanchaceae.

51

Figure 3-1: Distribution of known or proposed whole genome duplication events within Lamiales. Whole genome duplications (WGDs) shown on the phylogeny of Lamiales inferred by Schäferhoff et al. (2010) [21]. Events identifies from whole genome assemblies are shown on the cladogram with green asterisks, and proposed events, primarily from transcriptome-based studies whose precise timing is not known, are marked with gold asterisks.

Results and discussion

Gene annotation

Accurate annotation of protein-coding genes in nuclear genomes depends on the quality of the genome assembly which is influenced by several factors including assembly contiguity, assembly errors (miss-assemblies and sequence errors), presence of contaminant sequences, and accurate identification and masking of repeats, among others

[29-36]. While assembly contiguity and assembly errors can be mitigated at the planning phase of a genome project to maximize assembly quality [37, 38], assembly cleaning and identifying repeats to mask is an analysis challenge that can be easily alleviated by computational approaches [29, 30, 39]. It is not uncommon to find sequences of

52 untargeted species in genome assemblies that have been annotated and erroneously included in the proteomes [40-44]. In plant genomes, these may include bacteria, viruses, fungi, insects, and even human DNA, all of which are ubiquitously present in greenhouses and laboratories. In addition to non-plant contamination, extranuclear DNA that includes and mitochondria present in plant cells must also be removed from the genome prior to gene annotation. Many plant species are notoriously repeat-rich with repeat content as high as 85% in the bread wheat genome [45, 46], which poses a challenge for genome assembly and annotation [32, 33, 37]. Therefore, comprehensive identification and masking of repeats, which include low-complexity sequences and transposable elements, is needed to enable better sensitivity and specificity for gene- finding and annotation evidence alignment algorithms [32, 47].

Genome cleaning

In a plant genome assembly, high read depth contigs mainly belong to the chloroplast genome, the mitochondrial genome, and nuclear repeat sequences; lower read depth contigs are mostly from the nuclear genome [48-50]. After cleaning the Striga genome assembly, a total of 200 out 13,847 scaffolds and contigs were determined to be contaminants and excluded from the genome assembly because they had their best- matching sequences in the NCBI nt database attributed to either non-embryophytes or plant organelles. It has been shown that chloroplast and mitochondrial DNA can be transferred into nuclear chromosomes of diverse eukaryotes, including plants [51]. For this reason, likely chloroplast and mitochondrial scaffolds and contigs that did not exhibit high sequence read depth or had matching segments to protein databases were not removed from the assembly.

53

Annotation-specific repeat masking

A custom annotation-specific repeat library (database) for Striga was created for masking the repetitive fraction of the genome to enable high-quality gene prediction and analysis of the genome structure. Novel genomes often have new classes of repeats that are not present in Repbase, a comprehensive database of repetitive elements from diverse eukaryotic organism [52, 53]. Therefore, generic genome masking using Repbase in conjunction with RepeatMasker (http://www.repeatmasker.org) prior to gene prediction and whole genome comparative alignment is not sufficient. It is essential to identify, annotate, and mask repeats, including interspersed repeats, low-complexity regions, and transposable elements to avoid prediction of spurious gene models and confounding alignments by repeat-mediated artifacts [32, 33, 54]. Using the Striga annotation-specific repeat database, approximately 59% of the assembled genome was masked prior to protein-coding gene annotation and comparative analysis of the genome structure. The genome content of various major classes of repeats identified in Striga and masked is shown in Table 3-1.

54

Table 3-1: Repeat content and composition in the Striga genome. Major classes of repeat content and composition predicted in and used for masking the genome assembly to enable accurate annotation of protein-coding genes and comparative analysis of the genome structure. Repeat classes % of genome masked DNA 5.00% LINE 3.45% LTR 3.21% RC Helitrons 0.83% SINE 0.02% Unknown 17.60% Unspecified 28.19% Total interspersed repeats 58.30% Satellites 4.00% Simple repeats 0.25% Total repeats 58.59%

Gene prediction and quality assessment

The final Striga post-processed gene annotation set consisted of all gene models supported by annotation evidence, as well as gene models not supported by annotation evidence, but that encoded Pfam domains [33]. Additionally, provisional functional annotations of the initial gene set were analyzed to exclude genes with functional terms associated with transposable elements. MAKER [55] annotated 1,553 scaffolds and contigs out of 5,666 (>= 1kb) that were provided to the pipeline. A majority of the sequences not annotated as genes were either short (< 10 kb) or were mainly composed of repeats. A total of 34, 575 protein-coding genes were predicted, 91% of which have an annotation edit distance (AED) < 0.5 (Figure 3-1). (AED is a quantitative measure of gene annotation quality based on annotation evidence with values ranging from 0 (perfect

55 agreement) to 1 (no support) [56]). To evaluate the quality of the Striga gene annotations,

I examined the presence and completeness of 1,440 land plant (embryophytes) benchmarking universal single-copy orthologs (BUSCO) [57] in Striga compared to 25 other sequenced plant genomes classified in the gene family scaffold. BUSCO results

(Figure 3-4) suggest that 87.1% of the conserved single-copy genes are complete, 4.0% are fragmented, and 8.9% are missing; these presence and completion rates are comparable to other taxa in the classification.

Figure 3-2: The cumulative distribution of annotation edit distance (AED) scores for all Striga gene annotations. The cumulative distribution function (CDF) of 34,575 predicted protein-coding genes, 91% of which have an AED score < 0.5. The CDF provides a genome-wide perspective of how well the Striga gene annotations reflect the evidence data. AED of 0 and 1 denote perfect agreement with evidence data and absence of evidence data, respectively.

56

Complete Fragmented Missing

Arabidopsis thaliana

Carica papaya

Theobroma cacao

Eucalyptus grandis

Phaseolus vulgaris

Medicago truncatula

Prunus persica

Manihot esculenta

Populus trichocarpa

Vitis vinifera

Solanum lycopersicum

Utricularia gibba

Mimulus guttatus

Striga asiatica

Beta vulgaris

Nelumbo nucifera

Aquilegia coerulea

Oryza sativa

Sorghum bicolor

Musa acuminata

Elaeis guineensis

Spirodella polyrhiza

Amborella trichopoda

Pinus taeda

Selaginella moellendorffii

Physcomitrella patens

0 25 50 75 100 % of 1,440 Land Plants Benchmarking Universal Single-Copy Orthologs (BUSCO) Figure 3-3: Assessment of Striga genome assembly and annotation completeness. The presence and completeness of universally conserved single copy land plants genes in the Striga genome compared to 25 other annotated representative land plant genomes. Results are depicted as a stacked bar graph representing percentage of 1,440 land plant BUSCO genes that are complete (blue), fragmented (green), and missing (gold).

57

Genome duplication history

The relative timing of WGD events relies on sampling genomic or transcriptomic data of a broad range of taxa across the phylogenetic diversity of the lineage of interest

[4, 58]. Following WGD events, a significant fraction of duplicated genes in homeologous genomic regions are lost due to disabling mutations [59-63]. This process of massive gene loss, known as fractionation, rapidly erases the signal of genome duplication. As a result, it becomes challenging to detect much older WGD events, thus the need to employ complementary approaches to maximize sensitivity [7, 64]. In order to investigate the duplication history of the genomes of Striga and the closely related non-parasitic plant, Mimulus guttatus (Erythranthe guttata), I integrated results of three approaches that include phylogenetic analysis of gene families, gene collinearity

(synteny), and divergence patterns of duplicate genes to diagnose duplication history of the genomes of Striga, and the closely related non-parasitic plant, Mimulus guttatus

(Erythranthe guttata). The analysis was then extended to include transcriptomes of major groups within Orobanchaceae and closely related sister lineages (Table 3-4) to constrain the timing of detected WGD events.

Global gene family phylogenetic analysis

A gene family classification analysis of 26 plant genomes identified 18,110 orthogroups containing at least two genes, 9,936 of which contain at least one gene from

Striga. Of the 34,575 annotated genes in Striga, 25,126 (72.7%) were classified in an orthogroup, and the remaining 9,449 (27.3%) genes are considered singletons, a clustering rate that is comparable to other taxa in the classification (Table 3-2).

Following gene collinearity analysis, genes in Striga and Mimulus were classified as

58 singletons, tandem duplicates, proximal duplicates, dispersed duplicates, and

WGD/segmental duplicates, based on the mechanism of duplication (Table 3-3).

WGD/segmental duplicates (syntelogs) were inferred by the anchor genes in collinear blocks, with blocks defined by a minimum of five anchor genes. Lamiales-wide gene duplications were identified from orthogroup gene trees anchored on both sides of the duplication node by WGD/segmental duplicates. A total of 889 orthogroups preserved duplicate copies of Striga (supported by 1,605 Striga syntelogs), and 1,521 orthogroups preserved duplicate copies of Mimulus (supported by 3,493 Mimulus syntelogs). Among all the orthogroups with Lamiales-wide duplications, 323 were supported with both

Striga and Mimulus (475 Striga and 608 Mimulus syntelogs).

Table 3-2: Gene family classification summary of 26 plant genomes. Orthogroup classification summary for 663,272 annotated protein-coding genes in 26 representatives of sequenced land plant genomes including Striga.

Number of Validated and Number Percentage Cleaned Number of Percentage of of Annotated Number of genes in of genes in singleton singleton Species Genes orthogroups orthogroups orthogroups genes genes Rosids Manihot esculenta 32,966 10,368 29,259 88.8 3,707 11.2 Populus trichocarpa 41,207 10,633 36,029 87.4 5,178 12.6 Phaseolus vulgaris 27,388 10,305 26,135 95.4 1,253 4.6 Medicago truncatula 50,869 10,922 39,619 77.9 11,250 22.1 persica 26,772 10,289 23,852 89.1 2,920 10.9 Arabidopsis thaliana 27,369 9,782 24,523 89.6 2,846 10.4 Carica papaya 27,528 10,221 21,978 79.8 5,550 20.2 Theobroma cacao 29,171 10,387 24,802 85.0 4,369 15.0 Eucalyptus grandis 36,288 9,958 31,195 86.0 5,093 14.0 Vitis vinifera 26,315 9,827 21,791 82.8 4,524 17.2 Asterids Striga asiatica 34,575 9,936 25,126 72.7 9,449 27.3

59

Mimulus guttatus 28,079 10,173 26,131 93.1 1,948 6.9 Utricularia gibba 27,206 9,102 21,220 78.0 5,986 22.0 Solanum lycopersicum 34,476 10,422 28,586 82.9 5,890 17.1 Caryophyllales Beta vulgaris 27,911 10,026 21,794 78.1 6,117 21.9 Basal eudicots Aquilegia coerulea 29,869 10,310 25,255 84.6 4,614 15.4 Nelumbo nucifera 26,643 9,795 23,775 89.2 2,868 10.8 Monocots Sorghum bicolor 34,118 11,115 27,239 79.8 6,879 20.2 Oryza sativa 41,411 11,216 29,734 71.8 11,677 28.2 Musa acuminata 36,514 9,707 29,770 81.5 6,744 18.5 Elaeis guineensis 29,667 10,054 26,638 89.8 3,029 10.2 Spirodella polyrhiza 19,572 9,371 17,372 88.8 2,200 11.2 Basal angiosperm Amborella trichopoda 26,802 10,003 19,588 73.1 7,214 26.9 Gymnosperm Pinus taeda 27,596 5,768 23,770 86.1 3,826 13.9 Seed-free land plants Selaginella moellendorffii 22,251 7,907 17,057 76.7 5,194 23.3 Physcomitrella patens 32,853 8,348 21,037 64.0 11,816 36.0

Table 3-3: Duplication origins of Striga and Mimulus genes. A summary of Striga and Mimulus genes classified into their likely duplication origins based on the mechanism of duplication.

Species Singleton Dispersed Proximal Tandem WGD/Segmental Striga asiatica 7,997 17,121 1,467 1,181 6,809 Mimulus guttatus 4,248 11,295 1,730 3,366 7,440

Duplicated gene divergence of phylogenetic syntelogs

I sought evidence for genome duplications in Striga by examining the divergence

60 patterns of synonymous substitution rates (Ks) for Lamiales duplicate genes identified by performing an integrated syntenic and phylogenomic analysis. Divergence estimates identified two significant duplication components in Striga at mean Ks ≈ 0.47 and mean

Ks ≈ 1.22, and one significant component for Mimulus at mean Ks ≈ 0.94 (Figure 3-4).

The peak of the older component in the Striga Ks distribution corresponds to the peak of the single component in the Mimulus Ks distribution despite being shifted further to the right. The larger Ks value for Striga compared to Mimulus suggests a higher rate of synonymous substitutions in Striga. Inspection of conserved single-copy gene trees and spot inspection of phylograms from gene families with WGD synteny orthologs showed that Striga genes were in fact consistently on branches somewhat longer than their

Mimulus orthologs. These results suggest that this was a bona fide description of a tendency for Striga branches to have evolved faster than those of Mimulus. Therefore, it is expected that the accelerated rate of evolution for Striga will be reflected in the estimated significant duplication components in which the shared event(s) with Mimulus would be shifted to higher Ks values.

In order to resolve the relative timing of the younger WGD event in Striga, a principal component analysis (PCA) was performed to determine the relationship among the significant duplication components identified in taxa representing major groups

(clades) within Orobanchaceae and closely related sister lineages (Table 3-4), including the Striga and Mimulus genomes. The most substantial variance among all the taxa is explained when the duplication components are grouped by component rather than clade

(Figure 3-5). This clustering pattern of estimated duplication components likely represents a younger Orobanchaceae-wide WGD event (red), an older core Lamiales-

61 wide WGD event (green), and an even older WGD event shared among all core eudicots

(blue). The two older WGD events exhibit a weaker signal to cluster together, as expected, because of saturation in synonymous changes coupled with lower gene retention rates.

Taken together, these analyses suggest that the prominent younger peak in the

Striga Ks distributions represents a duplication event in the Orobanchaceae family and closely related non-parasitic sister lineage that occurred after the divergence of lineages leading to Orobanchaceae and Mimulus, and the older peak represents a duplication event in the common ancestral genome of the three Lamiales taxa in the classification (Striga,

Mimulus, and Utricularia). There was no evidence of two lineage-specific WGDs in the

Striga lineage as reported by Wickett et al. (2011) [28]. Analysis including taxa representing major clades within Orobanchaceae and closely related non-parasitic lineages did not identify additional WGDs. The two WGDs reported by Wickett in the

Striga lineage (Figure 3-1) likely represent the core Lamiales WGD, and the WGD shared among all Orobanchaceae lineages identified in the Striga asiatica genome. My analysis identified evidence of two WGDs in both Phelipanche aegyptiaca, and

Triphysaria versicolor contrary to Wickett’s evidence of one WGD is Phelipanche and none in Triphysaria. The preliminary nature of the transcriptome data might not provide a clear resolution of other WGDs within Orobanchaceae, but as more genomes in

Orobanchaceae are sequenced, these events will be accurately placed in a phylogenetic context.

62

Figure 3-4: Divergence (Ks) distribution plots for syntenic duplicate gene pairs in Striga and Mimulus. Ks distributions of Lamiales-wide duplicate gene pairs in Striga and Mimulus identified by an integrated syntenic and phylogenomic analysis. Colored lines superimposed on Ks distributions represent significant duplication components identified by a likelihood mixture model analysis [65]. Plots show “color/mean/proportion” where color is the component (curve) color, mean is the mean divergence of gene pairs assigned to the identified component, and proportion is fraction of duplicate pairs assigned to the identified component. a). Pairwise Ks distribution for 1,605 Striga genes from duplications within orthogroups, and on syntenic blocks anchored by Striga genes. Two statistically significant components: purple/0.47/0.80 and cyan/1.22/0.20. b). Pairwise Ks distributions for 3,493 Mimulus genes from duplications within orthogroups, and on syntenic blocks anchored by Mimulus genes. One statistically significant component: cyan/0.94/0.92. c). Pairwise Ks distribution for 475 Striga genes from duplications within orthogroups, and on syntenic blocks anchored by both Striga and Mimulus genes. Two statistically significant components: purple/0.45/0.88 and cyan/1.27/0.12. d). Pairwise Ks distributions for 608 Mimulus genes from duplications within orthogroups, and on syntenic blocks anchored by both Striga and Mimulus genes. One statistically significant component: cyan/0.89/0.95. Additional results available in Appendix C: Data S1T and S1U.

63

Figure 3-5: Principal component analysis (PCA) displaying duplication component similarities among Orobanchaceae species and closely related lineages. PCA of the estimated significant duplication components in Orobanchaceae species and closely related lineages clustered by (a) clade and (b) the mean Ks of the estimated duplication component. Delineation of cluster boundaries (ellipses) is shown to identify the first component (red), the second component (green), and the third component (blue). Almost all the variance in the multidimensional clustering data is explained by PC1 and PC2 when grouped by component mean Ks. The components in (b) likely represents an Orobanchaceae-wide WGD event (red), a core Lamiales-wide WGD event (green), and core eudicots-wide WGD (blue).

Genome structure analysis

In addition to phylogenetic and gene divergence analyses, I compared the genome

structures of Striga, Mimulus, and Vitis vinifera to detect signatures of polyploidy using

whole genome alignment. The self-self dot plot of Striga syntenic blocks (Figure 3-6)

shows evidence (on the diagonal axis) of extensive collinear blocks, distributed

throughout the genome, indicating at least one round of ancient polyploidy. However,

there are numerous syntenic signals not on the diagonal, which suggest a second, older

polyploidy event. The overlaid color scheme that corresponds to the synonymous

mutation (Ks) age distribution histogram (Figure. 3-6b) as calculated by CODEML

64 identifies that the majority of genes comprising syntenic regions are from one age distribution (purple), and numerous others (off-diagonal) are from an older age distribution (cyan). This pattern is also evident in the cross-species dot plots of Striga-

Mimulus (Figure 3-7) and Striga-Vitis (Figure 3-8) that show a relatively recent WGD

(purple) superimposed on an older polyploidy event (cyan).

Taken together, the structure and synteny results suggest that the Striga genome reflects two rounds of ancient polyploidy. The histogram of Striga Ks values derived from syntenic blocks shows a bimodal makeup in its Ks distribution with peaks around log10 transformed values of -0.3 (Ks ≈ 0.5, younger peak) and 0.09 (Ks ≈ 1.2, older peak), indicated in purple and cyan, respectively (Fig. 3-6b.) The purple peak that represents the larger population of duplicate pairs is evidence that they are derived from a younger evolutionary event than the smaller population represented by the cyan peak. Previous studies have shown that the Mimulus lineage reflects only one WGD (that is most probably shared with core-Lamilales) following their divergence from the Vitis lineage, which has not had any polyploidy event since the much older eudicot-wide paleohexaploidy event (also known as gamma) [25, 66]. Therefore, there is a 1:2 mapping of orthologous syntenic regions between Mimulus and Vitis, as was reported by

Ibarra-Laclette et al. (2013) [25]. The Striga-Mimulus and Striga-Vitis ortholog plots show many large purple syntenic regions superimposed on many smaller and older cyan syntenic regions, highlighting two different age classes of syntenic blocks (Figures 3-7,

3-8). The younger syntenic blocks are orthologous blocks, while older paralogous blocks were detected as well. The duplication peaks of Striga-Mimulus and Striga-Vitus orthologs are around log10 transformed values of 0.04 (Ks ≈ 1.0) and 0.3 (Ks ≈ 2.0)

65 respectively.

The whole genome syntenic ortholog dot plot (Figure 3-7a) shows that most of the Striga genome is syntenic with at least one region of Mimulus. An example of one of several regions identified that showed 1x Mimulus to 2x Striga shows fractionated gene content, as expected following a polyploidy event (Figure 3-9a) [67]. An earlier WGD in the common ancestor of Mimulus and Striga would, therefore, create syntenic blocks comprised of 2x Mimulus regions and 4x Striga regions (Figure 3-9b). A close-up view of these regions (Figure 3-9c) shows evidence of 4 Striga and 2 Mimulus collinear anchor genes that are present on the duplication node of the gene family tree in Figure 3-11. I further identified a Vitis region from the ortholog collinear block that is syntenic to the shared Striga and Mimulus regions shown in Figure 3-9. The regenerated microsynteny plot (Figure 3-10) shows this Vitis region syntenic to the two Mimulus and four Striga regions as is expected following their divergence after the core eudicot-wide paleohexaploidy event.

66

Figure 3-6: Striga syntenic dot plot showing the divergence of paralogous syntenic gene pairs within the Striga asiatica genome, Syntenic dot plots of Striga against itself showing evidence of at least two WGD events. (a). Self- self-syntenic dot plot where contigs are ordered and oriented by syntenic path assembly. Syntenic gene pairs colored by their Ks values show two age distributions. Purple syntenic paralogs are younger than cyan.

(b). Histogram of log10 transformed Ks values of syntenic gene pairs identified in (a) shows a complex distribution with a peak of younger syntenic gene pairs in purple and a peak of older duplicates in cyan.

67

Figure 3-7: Striga cross-species syntenic dot plot against Mimulus showing the divergence of orthologous syntenic gene pairs. (a). Syntenic dot plot of orthologous Striga (y-axis) versus Mimulus (x-axis) with Striga contigs ordered and oriented based on their syntenic path to Mimulus. Syntenic gene pairs colored by their Ks values could reflect a mixture of two age distributions. Purple syntenic orthologs are younger than cyan.

(b). Histogram of log10 transformed Ks values of syntenic gene pairs identified in (a).

68

Figure 3-8: Striga cross-species syntenic dot plot against Vitis showing the divergence of orthologous syntenic gene pairs. (a). Syntenic dot plot Striga (y-axis) versus Vitis (x-axis) with Striga contigs ordered and oriented based on their syntenic path to Vitis. Syntenic gene pairs colored by their Ks values could reflect a mixture of two age distributions. Purple syntenic orthologs are younger than cyan.

(b). Histogram of log10 transformed Ks values of syntenic gene pairs identified in (a).

a).

69 b).

70 c).

Figure 3-9: Exemplar microsynteny analysis consistent with Striga lineage-specific and core Lamiales-wide duplications. a). Examplar microsynteny analysis of two syntenic Striga regions and one Mimulus region showing evidence of fractionated gene content. b). Syntenic regions in (a) with one additional region of Mimulus and two additional regions of Striga. c). Evidence of 4x Striga to 2x Mimulus collinear anchor genes present on the duplication node of a gene family phylogenetic tree (Figure 3-11). Cyan asterisk represents duplication in the common ancestor of Mimulus and Striga (core Lamiales-wide duplication), and purple asterisk represent duplications in the Striga lineage.

71 a).

72 b).

Figure 3-10: Exemplar microsynteny analysis consistent with Striga lineage-specific, core Lamiales-wide duplications in relation an ancestral core eudicot represented by the Vitis. a). Examplar of microsynteny among four Striga and two Mimulus syntenic regions shown in Figure 3-9b, and one Vitis region. b). Evidence of 4x Striga to 2x Mimulus to 1x Vitis collinear anchor genes present on the duplication node of a gene family phylogenetic tree (Figure 3-11). Cyan asterisk represents duplication in common ancestor of Lamiales-wide duplication, and purple asterisks represents duplications in the Striga lineage.

73

Figure 3-11: Exemplar phylogenetic tree consistent with Striga lineage-specific, core Lamiales-wide duplications in relation an ancestral core eudicot represented by the Vitis gene. Example a subtree of RAxML ML gene family phylogenetic tree (orthogroup 460) shows the duplication of syntenic anchor genes located on homologous Striga, Mimulus, and Vitis syntenic blocks. Anchor genes present on the syntenic blocks are surrounded in red boxes. Cyan asterisk represents a duplication in common ancestor of core Lamiales with node support of 100% bootstrap replicates, and purple asterisks represents two duplications in the Striga lineage supported by 100% and 98% bootstrap replicates. The core Lamiales duplication includes Utricularia gibba (Utrgi) based on evidence from literature described in the introductory section. Utricularia is a carnivorous species that has a miniature genome and is reported to have experienced massive gene losses. Therefore, missing gene duplicates in Utricularia is likely. Additional results available in Appendix C: Data S1T and S1U.

Conclusions

To date, whole genome duplication has not been studied comprehensively in the

Orobanchaceae, except for the hypothesized events reported by Wicket et al. (2011) using

74 sparse transcriptome data from a small number of species. In this study, I used an integrated phylogenetic, syntenic, and gene divergence approach to unravel ancient WGD events in the Orobanchaceae family. Taken together, these analyses identify an ancient polyploidy event in a common ancestor of Orobanchaceae and closely related non- parasitic sister lineages that occurred following their divergence from all other core

Lamiales lineages, as well as confirmed the existence of a previously reported polyploidy event restricted to all core Lamiales Surprisingly, no evidence was found for additional

WGD events within Orobanchaceae, meaning that the many WGD-derived duplicated genes [9] were present in a common ancestor of Orobanchaceae and have diversified further through duplication processes independent of WGD events within the family.

These results provide an opportunity to investigate the impact of whole genome duplications in the Orobanchaceae, the only lineage that has evolved parasitism in the order Lamiales. Whole Genome Duplication has already been shown to be one of the driving forces in the rapid radiation and diversification of angiosperms. In chapter 4, I explore the role of WGD in the evolution of novel traits that led to the transition to parasitism in the Orobanchaceae.

Materials and methods

Genome assembly cleaning

To exclude any extraneous DNA sequences in the Striga asiatica nuclear genome assembly, reads from Illumina PE libraries utilized in the assembly process were mapped back onto the assembly and read coverage depth of all scaffolds and contigs computed using CLC Assembly Cell software (https://www.qiagenbioinformatics.com/products/clc- assembly-cell). Additionally, 100 best-matching sequences in the NCBI nt database

75 assigned to the Striga scaffolds and contigs by Megablast (e-value < 1e-10) searches were queried against the NCBI database to determine their taxonomic and source attribution. Assembly sequences that had all their best-matching sequences not classified as embryophytes were considered contaminants and removed from the assembly together with sequences with high read depth (> 100x) that were attributed to plant organelles.

Repeat library construction

I followed the protocol described by Campbell et al., 2014 [33] to create a Striga- specific repeat library suitable for repeat masking prior to protein-coding gene annotation. Briefly, the genome assembly was first searched with structural approaches to collect consensus miniature inverted-repeat transposable elements (MITEs) and long terminal repeat retrotransposons (LTRs) using MITE-Hunter [68] and LTRharvest/

LTRdigest [69, 70], respectively. LTRs were filtered to remove false positives and elements with nested insertions. The genome was then masked using collected LTRs and

MITEs and additional de novo repetitive sequences predicted by RepeatModeler

(http://www.repeatmasker.org/RepeatModeler) from the unmasked regions of the genome. All collected repeat sequences were searched against plant proteins from UniRef

[71] where elements with significant hits to genes were excluded from the repeat masking library.

RNA-Seq sequencing and transcriptome assembly

Total RNAs were extracted from tissue samples (leaf, shoot, root, and haustoria) of Striga asiatica according to the protocol described by Yoshida et al., 2010 [72]. RNA-

Seq libraries were prepared using TruSeq RNA Sample Prep Kit (Illumina) for an insert

76 size of 180 bp and sequenced using 100-bp paired-end sequencing on the Illumina HiSeq

2000 platform. Raw RNA-Seq reads for Striga asiatica and other species representing major groups within Orobanchaceae (Table 3-1) were trimmed to remove low-quality bases as well as embedded adaptor sequences and filtered to discard short read fragments using Trimmomatic v0.33 [73]. FastQC v0.10.1

(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to assess the overall sequence quality before and after trimming. Cleaned reads from each tissue sample were de novo assembled using Trinity [74] with the default parameters. The resulting transcriptome assemblies were post-processed with the PlantTribes 2

AssemblyPostProcessor (https://github.com/dePamphilis/PlantTribes) to select contigs with potential coding regions to use as evidence for gene annotation and in phylogenomic analyses.

77

Table 3-4: Additional RNA-Seq datasets used in the study. Additional RNA-Seq datasets for Orobanchaceae and closely related species obtained from public biological data repositories and utilized as evidence for gene annotation and in phylogenomic analyses. NCBI SRA = NCBI Short Read Archive; PPGP = Parasitic Plant Genome Project; 1KP = One Thousand Plant Transcriptome Initiative.

Species Family Clade Source Tissue Library Type Paulownia tomentosa Paulowniaceae NCBI SRA leaves Illumina HiSeq 2000

Rehmannia glutinosa Rehmnniaceae Rehmannieae NCBI SRA roots Illumina HiSeq 2000 Lindenbergia philippensis Orobanchaceae Lindenbergia (clade I) PPGP Whole Plant Illumina HiSeq 2500 and GAII Epifagus virginiana Orobanchaceae Orobanche (clade III) 1KP buds, above ground tissues Illumina HiSeq 2000

Conopholis americana Orobanchaceae Orobanche (clade III) 1KP flowers buds Illumina HiSeq 2000 Cistanche deserticola Orobanchaceae Orobanche (clade III) NCBI SRA fleshy stem Illumina HiSeq 2000

Orobanche californica Orobanchaceae Orobanche (clade III) PPGP leaves Illumina HiSeq 2500 Orobanche minor Orobanchaceae Orobanche (clade III) PPGP leaves Illumina HiSeq 2000 Orobanche fasciculata Orobanchaceae Orobanche (clade III) 1KP vegetative growth, flowers buds Illumina HiSeq 2000

Phelipanche ramosa Orobanchaceae Orobanche (clade III) PPGP leaves Illumina HiSeq 2500 Phelipanche mutelii Orobanchaceae Orobanche (clade III) PPGP leaves Illumina HiSeq 2500 Phelipanche aegyptiaca Orobanchaceae Orobanche (clade III) PPGP above ground tissues Illumina HiSeq 2500 and GAII

Phtheirospermum japonicum Orobanchaceae Castilleja-Pedicularis (clade IV) NCBI SRA Illumina HiSeq 2000 pectinata orobanchaceae Striga_Alectra (clade VI) NBCI SRA Illumina HiSeq 2500

Triphysaria versicolor Orobanchaceae Castilleja-Pedicularis (clade IV) PPGP above ground tissues Illumina HiSeq 2500 and GAII Triphysaria pusilla Orobanchaceae Castilleja-Pedicularis (clade IV) John Yoder roots Sanger Triphysaria eriantha Orobanchaceae Castilleja-Pedicularis (clade IV) PPGP leaves Illumina HiSeq 2500

Euphrasia rostkoviana Orobanchaceae -Rhinanthus (clade V) Alex Twyford leaves Illumina HiSeq 2500 Melampyrum roseum Orobanchaceae Euphrasia-Rhinanthus (clade V) NCBI SRA Illumina HiSeq 2000

Striga gesneroides orobanchaceae Striga_Alectra (clade VI) PPGP leaves Illumina HiSeq 2000 Striga hermonthica orobanchaceae Striga_Alectra (clade VI) PPGP above ground tissues Illumina HiSeq 2500 and GAII indica orobanchaceae Striga_Alectra (clade VI) NCBI SRA Illumina HiSeq 2000 Alectra vogelii orobanchaceae Striga_Alectra (clade VI) PPGP leaves Illumina HiSeq 2000

Gene prediction and functional assignment

Gene models were predicted with the MAKER pipeline (release 2.31.8) [55]

using tissue-specific transcriptome assemblies of Striga asiatica (described previously)

and related species of Orobanchaceae obtained from the Parasitic Plant Genome Project

[75] as transcript evidence. Further cross-species protein homology evidence was

supplied by proteomes derived from the annotations for Mimulus guttatus v2.0 as

78 represented in Phytozome v11 [76] and a set of canonical plant (embryophytes) proteins from UniProt/SwissProt release 2017_04 [77]. Repetitive and low complexity regions of the genome assembly were masked with RepeatMasker in MAKER using the custom annotation repeat library developed for Striga. Genes were predicted in MAKER iteratively, with bootstrap training using SNAP [78] and Augustus [79] ab initio gene predictors to improve their performance [33, 80]. Gene models from each round of

MAKER runs were used to seed the next round of SNAP and Augustus training. Selected gene models for Augustus training were required to meet the following criteria: (1) must have greater than 75% evidence support, (2) the length of both 5' and 3' UTRs must be at least 200 bp, (3) at least 80% of the splice sites must be confirmed with RNA-Seq alignment evidence, (4) at least 80% of the exons must match both RNA-Seq and protein alignment evidence, (5) the length of the protein sequence produced by the predicted mRNAs must be approximately 450 amino acids, the average plant protein size [81], and

(6) the training genes must be divergent from each other (< 50% identity) and not overlap. Provisional functional descriptions for the gene models were assigned using the

AHRD (https://github.com/groupschoof/AHRD), a pipeline for lexical analysis and selection of the best functional descriptor for gene products following BLASTp searches against UniProt/SwissProt, UniProt/TrEMBL, and TAIR10 [77, 82-84] databases.

Additionally, gene models were also annotated with protein family domains as detected by InterProScan [85-87], and identified domains were directly translated into gene ontology terms.

79

Gene family classification

Complete sets of protein-coding genes from 26 plant genomes including Striga

(Table 3-5) were clustered into gene lineages (i.e., orthogroups) using OrthoFinder version 1.1.8 algorithm [88]. Selected taxa represent major land plant lineages, including ten rosid genomes, one basal core eudicot, four asterids, two basal eudicots, five monocots, one basal angiosperm, one gymnosperm, one lycophyte, and one moss.

Additional analyses (described in the previous chapter) were then performed to generate orthogroup metadata and to convert the orthogroup circumscription into a PlantTribes 2 gene family scaffold.

80

Table 3-5: Taxa in the 26 genomes gene family scaffold. Data summary for the 26 genomes representing major land plant lineages used in the OrthoFinder orthogroups classification.

Species version Transcripts Validated Source Family Order Group Manihot esculenta v6.1 33,033 32,966 Phytozome Euphorbiaceae Malpighiales rosid Populus trichocarpa v3.0 41,335 41,207 Phytozome Salicaceae Malpighiales rosid Phaseolus vulgaris v2.1 27,433 27,388 Phytozome Fabaceae Fabales rosid Medicago truncatula Mt4.0v1 50,894 50,869 Phytozome Fabaceae Fabales rosid Prunus persica v2.1 26,873 26,772 Phytozome rosid Arabidopsis thaliana TAIR10 27,416 27,411 Phytozome Brassicaceae Brassicales rosid Carica papaya ASGPBv0.4 27,751 27,528 Phytozome Caricaceae Brassicales rosid Theobroma cacao v1.1 29,452 29,171 Phytozome Malvaceae Malvales rosid Eucalyptus grandis v2.0 36,349 36,288 Phytozome rosid Vitis vinifera Genoscope 12x 26,346 26,315 Phytozome Vitaceae Vitales rosid Striga asiatica v2.0 34,575 34,575 Private Orobanchaceae Lamiales asterid Mimulus guttatus v2.0 28,140 28,079 Phytozome Lamiales asterid Utricularia gibba v4.1 28,032 27,206 CoGe Lentibulariaceae Lamiales asterid Solanum lycopersicum ITAGv2.4 34,725 34,476 Phytozome Solanaceae Solanales asterid Beta vulgaris v2.0 29,088 27,911 Genbank Amaranthaceae Caryophylales basal core-eudicot Aquilegia coerulea v3.1 30,023 29,869 Phytozome Ranunculaceae Ranunculales basal eudicot Nelumbo nucifera v1.0 26,685 26,643 Genbank Nelumbonaceae Proteales basal eudicot Sorghum bicolor v3.1 34,211 34,118 Phytozome Poaceae Poales grass Oryza sativa v7.0 42,189 41,411 Phytozome Poaceae Poales grass Musa acuminata v1.0 36,542 36,514 Genbank Musaceae Zingiberales non-grass monocot Elaeis guineensis v2.0 30,752 29,667 Genbank Arecaceae Arecales non-grass monocot Spirodella polyrhiza v2.0 19,623 19,572 Phytozome Araceae Alismatales non-grass monocot Amborella trichopoda v1.0.27 26,846 26,802 Phytozome Amborellaceae Amborellales basal angiosperm Pinus taeda v2.0 34,059 27,596 Genbank Pinaceae Pinales gymnosperm Selaginella moellendorffii v1.0 22,285 22,251 Phytozome Selaginellaceae Selaginellales lycophyte Physcomitrella patens v3.3 32,929 32,853 Phytozome Funariaceae Funariales moss

Phylogenetic and gene collinearity analysis

Gene family multiple-sequence alignments and phylogenetic trees were

performed in the PlantTribes 2 framework with the GeneFamilyAligner and the

81

GeneFamilyPhylologenyBuilder tools, respectively. Briefly, amino acid sequence alignments for each orthogroup were generated with PASTA [89] using a maximum of 5 iterative refinements. Corresponding DNA sequences were then forced onto the amino acid alignments and resulting DNA codon alignments trimmed to remove gappy sites present in 90% of the sequences using trimAl version 1.4.rev8 [90]. Approximately- maximum likelihood (ML) analyses were conducted using FastTree version 2.1.10 [91], searching for the best ML tree with the GTR and GAMMA models. The unrooted

FastTree phylogenies were traversed and rooted with the most distant taxa in the orthogroup using rooting functions implemented in ETE Toolkit, a python phylogenetic framework (http://etetoolkit.org/). Trees were then examined for gene duplications

(terminal or shared with other taxa) and the detected duplications were scored using a scoring strategy similar to that described by Jiao et al., 2011 [7]. I scored orthogroups that showed at least one shared Lamiales (Striga and Mimulus including Utricularia if present in the orthogroup) gene duplication with support values of at least 0.500 (50%) for the

Lamiales duplication node and also for one of the two internal Lamiales branches

(arbitrarily defined as the “right” or “left” branch). I also required that the duplication node be anchored on both sides with at least one species (Figure 3-12). Additionally, gene synteny and collinearity was performed with the MCScanX algorithms [92] to classify all Striga and Mimulus duplicated genes by their likely duplication origins.

82

Figure 3-12: Gene duplication phylogenetic topology scoring scheme. An illustration of gene family phylogenetic topologies required for identifying high confidence Lamiales-wide gene duplications. a). At least one Striga syntenic anchor gene must be on either side of the duplication node (gold asterisk) with at least 50% support and at least 50% support on either the left or the right branch. b). At least one Mimulus syntenic anchor gene must be on either side of the duplication node with at least 50% support and at least 50% support on either the left or the right branch. c). Both Striga and Mimulus syntenic anchor genes must be on both sides of the duplication node with at least 50% support and at least 50% support on either the left or the right branch.

Analysis of synonymous substitutions rates (Ks)

Analysis of divergence patterns of synonymous substitution rates (Ks) was

performed using the KaKsAnalysis tool of PlantTribes 2. Briefly, the matching Lamiales

syntenic duplicate gene pairs for both Striga and Mimulus were identified using

reciprocal blast searches. Protein alignments of paralogous pairs were estimated with

MAFFT [93] and converted to codon alignments. Ks values of paralogs were calculated

using the maximum likelihood method implemented in CODEML [94], imposing a

minimum alignment length of 300 bp. The EMMIX [65] software was used to fit a

mixture model of multivariate normal components to the Ks distributions and frequencies

83 of gene pairs with Ks divergences within the range of 0 to 2.0 plotted for both Striga and

Mimulus paralogs.

Genome structure and synteny analysis

Structural syntenic analyses were performed with the SynMap tool of the CoGe comparative genomics platform [95]. The genomes of Mimulus and Vitis were compared to the genome of Striga with the chaining algorithm implemented in DAGChainer [96]. I specified a maximum distance of 20 genes between gene matches and required a minimum of 5 genes to seed a syntenic region. Scaffolds and contigs of Striga were ordered and oriented based on their syntenic path to both Mimulus and Vitis. Additional high-resolution analyses of selected regions determined to be syntenic were performed with GEvo microsynteny analysis tool [97] in CoGe, which permits comparison of multiple genomic regions.

Integration of additional taxa

Post-processed de novo assembly transcripts of Orobanchaceae and closely related sister species listed in Table 3-4 were sorted into the 26 genomes gene family scaffold using PlantTribes 2 GeneFamilyClassifier and integrated with orthologous backbone reference gene models using PlantTribes 2 GeneFamilyIntegrator Protein alignments of integrated orthogroups were estimated and trimmed with the PlantTribes 2

GeneFamilyAligner, and representative sequences selected for each integrated taxon.

Representative sequences are the best scoring post-processed transcripts from each

Trinity assembler gene clusters based on the integrated orthogroup global sequence alignment coverage. Orthogroup phylogenetic trees were inferred with PlantTribes 2

GeneFamilyPhylogenyBuilder and scored for Laminales-wide duplications as previously

84 described (Figure 3-12). Taxa representing major groups (clades) within Orobanchaceae were also searched on the phylogenetic trees to look for clade specific WGDs. Lastly, paralogous pairwise Ks values for duplicate genes were estimated and clustered into components with PlantTribes 2 KaKsAnalysis tool to identify significant duplication events in each integrated taxon.

References

1. Yoshida, S., Kim, S., Wafula, E.K., Tanskanen, J., Kim, Y.-M., Honaas, L., Yang, Z., Spallek, T., Conn, C.E., Ichihashi, Y., Cheong, K., Cui, S., Der, J.P., Gundlach, H., Jiao, Y., Hori, C., Ishida, J.K., Kasahara, H., Kiba, T., Kim, M.- S., Koo, N., Laohavisit, A., Lee, Y.-H., Lumba, S., McCourt, P., Mortimer, J.C., Mutuku, J.M., Nomura, T., Sasaki-Sekimoto, Y., Seto, Y., Wang, Y., Wakatake, T., Sakakibara, H., Demura, T., Yamaguchi, S., Yoneyama, K., Manabe, R.-I., Nelson, D.C., Schulman, A.H., Timko, M.P., dePamphilis, C.W., Choi, D., Shirasu, K. Genome sequence of Striga asiatica provides insight into the evolution of plant parasitism. Curr. Biol. 29, 3041-3052 (2019). 2. Cui, L., Wall, P.K., Leebens-Mack, J.H., Lindsay, B.G., Soltis, D.E., Doyle, J.J., Soltis, P.S., Carlson, J.E., Arumuganathan, K., Barakat, A., Albert, V.A., Ma, H., dePamphilis, C.W. Widespread genome duplications throughout the history of flowering plants. Genome Research. 16, 738–749 (2006). 3. Ren, R., Wang, H., Guo, C., Zhang, N., Zeng, L., Chen, Y., Ma, H., Qi, J. Widespread whole genome duplications contribute to genome complexity and species diversity in angiosperms. Mol. Plant. 11, 414–428 (2018). 4. Clark, J.W., Donoghue, P.C.J. Whole-genome duplication and plant macroevolution. Trends Plant Sci. 23, 933–945 (2018). 5. Ainouche, M.L., Baumel, A., Salmon, A. Spartina anglica C. E. Hubbard: a natural model system for analysing early evolutionary changes that affect allopolyploid genomes. Biol J Linn Soc. 82, 475–484 (2004). 6. Wood, T.E., Takebayashi, N., Barker, M.S., Mayrose, I., Greenspoon, P.B., Rieseberg, L.H. The frequency of polyploid speciation in vascular plants. Proc. Natl. Acad. Sci. U.S.A. 106, 13875–13879 (2009). 7. Jiao, Y., Leebens-Mack, J., Ayyampalayam, S., Bowers, J.E., McKain, M.R., McNeal, J., Rolf, M., Ruzicka, D.R., Wafula, E., Wickett, N.J., Wu, X., Zhang, Y., Wang, J., Zhang, Y., Carpenter, E.J., Deyholos, M.K., Kutchan, T.M., Chanderbali, A.S., Soltis, P.S., Stevenson, D.W., McCombie, R., Pires, J.C., Wong, G.K.-S., Soltis, D.E., dePamphilis, C.W. A genome triplication associated with early diversification of the core eudicots. Genome Biol. 13, R3 (2012). 8. Buggs, R.J., Renny-Byfield, S., Chester, M. Jordon-Thaden, I.E., Viccini, L.F., Chamala, S., Leitch, A.R., Schnable, P.S., Barbazuk, W.B., Soltis, P.S., Soltis, D.E. Next-generation sequencing and genome evolution in allopolyploids. Am.

85

J. Bot. 99, 372-382 (2012). 9. Yang, Z., Wafula, E.K., Honaas, L.A., Zhang, H., Das, M., Fernandez- Aparicio, M., Huang, K., Bandaranayake, P.C.G., Wu, B., Der, J.P., Clarke, C.R., Ralph, P.E., Landherr, L., Altman, N.S., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Comparative transcriptome analyses reveal core parasitism genes and suggest gene duplication and repurposing as sources of structural novelty. Mol. Biol. Evol. 32, 767–790 (2014). 10. Tank, D.C., Eastman, J.M., Pennell, M.W., Soltis, P.S., Soltis, D.E., Hinchliff, C.E., Brown, J.W., Sessa, E.B., Harmon, L.J. Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. New Phytol. 207, 454-467 (2015). 11. Edger, P.P., Heidel-Fischer, H.M., Bekaert, M., Rota, J., Glöckner, G., Platts, A.E., Heckel, D.G., Der, J.P., Wafula, E.K., Tang, M., Hofberger, J.A., Smithson, A., Hall, J.C., Blanchette, M., Bureau, T.E., Wright, S.I., dePamphilis, C.W., Schranz, M.E., Barker, M.S., Conant, G.C., Wahlberg, N., Vogel, H., Pires, J.C., Wheat, C.W. The butterfly plant arms-race escalated by gene and genome duplications. Proc. Natl. Acad. Sci. U.S.A. 112, 8362–8366 (2015). 12. Chase, M.W., Christenhusz, M.J.M., Fay, M.F., Byng, J.W., Judd, W.S., Soltis, D.E., Mabberley, D.J., Sennikov, A.N., Soltis, P.S., Stevens, P.F. An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Bot J Linn Soc. 181, 1–20 (2016). 13. Press, M., Graves, J. Parasitic Plants. Springer. (1995). 14. Joel, D.M., Gressel, J., Musselman, L.J. Parasitic Orobanchaceae. Springer. (2013). 15. Smith, J.F., Wolfram, J.C., Brown, K.D., Carroll, C.L., Denton, D.S. Tribal relationships in the Gesneriaceae: Evidence from DNA sequences of the chloroplast gene ndhF. Ann. Missouri Bot. Gard. 84, 50-66 (1997). 16. Xu, Z., Chang, L. Acanthaceae. In: Identification and Control of Common Weeds: Vol 3. pp. 329–338. Springer Singapore, Singapore (2017). 17. Juniper, B.E., Robins, R.J., Joel, D.M. The Carnivorous Plants. Academic Press (1989). 18. Müller, K., Borsch, T., Legendre, L., Porembski, S., Theisen, I., Barthlott, W. Evolution of carnivory in Lentibulariaceae and the Lamiales. Plant Biol. 6, 477–490 (2004). 19. Bartels, D. Desiccation Tolerance Studied in the Resurrection Plant Craterostigma plantagineum. Integr. Comp. Biol. 45, 696-701 (2005). 20. Müller, J., Sprenger, N., Bortlik, K., Boller, T., Wiemken, A. Desiccation increases sucrose levels in Ramonda and Haberlea, two genera of resurrection plants in the Gesneriaceae. Physiologia Plantarum. 100, 153–158 (1997). 21. Schäferhoff, B., Fleischmann, A., Fischer, E., Albach, D.C., Borsch, T., Heubl, G., Müller, K.F. Towards resolving Lamiales relationships: insights from rapidly evolving chloroplast sequences. BMC Evol. Biol. 10, 352 (2010). 22. Unver, T., Wu, Z., Sterck, L., Turktas, M., Lohaus, R., Li, Z., Yang, M., He, L., Deng, T., Escalante, F.J., Llorens, C., Roig, F.J., Parmaksiz, I., Dundar, E., Xie, F., Zhang, B., Ipek, A., Uranbey, S., Erayman, M., Ilhan, E., Badad, O.,

86

Ghazal, H., Lightfoot, D.A., Kasarla, P., Colantonio, V., Tombuloglu, H., Hernandez, P., Mete, N., Cetin, O., Van Montagu, M., Yang, H., Gao, Q., Dorado, G., Van de Peer, Y. Genome of wild olive and the evolution of oil biosynthesis. Proc. Natl. Acad. Sci. U.S.A. 114, E9413–E9422 (2017). 23. Sollars, E.S.A., Harper, A.L., Kelly, L.J., Sambles, C.M., Ramirez-Gonzalez, R.H., Swarbreck, D., Kaithakottil, G., Cooper, E.D., Uauy, C., Havlickova, L., Worswick, G., Studholme, D.J., Zohren, J., Salmon, D.L., Clavijo, B.J., Li, Y., He, Z., Fellgett, A., McKinney, L.V., Nielsen, L.R., Douglas, G.C., Kjær, E.D., Downie, J.A., Boshier, D., Lee, S., Clark, J., Grant, M., Bancroft, I., Caccamo, M., Buggs, R.J.A. Genome sequence and genetic diversity of European ash trees. Nature. 541, 212–216 (2017). 24. Julca, I., Marcet-Houben, M., Vargas, P., Gabaldón, T. Phylogenomics of the olive tree (Olea europaea) reveals the relative contribution of ancient allo- and autopolyploidization events. BMC Biol. 16, 15 (2018). 25. Ibarra-Laclette, E., Lyons, E., Hernández-Guzmán, G., Pérez-Torres, C.A., Carretero-Paulet, L., Chang, T.-H., Lan, T., Welch, A.J., Juárez, M.J.A., Simpson, J., Fernández-Cortés, A., Arteaga-Vázquez, M., Góngora-Castillo, E., Acevedo-Hernández, G., Schuster, S.C., Himmelbauer, H., Minoche, A.E., Xu, S., Lynch, M., Oropeza-Aburto, A., Cervantes-Pérez, S.A., de Jesús Ortega- Estrada, M., Cervantes-Luevano, J.I., Michael, T.P., Mockler, T., Bryant, D., Herrera-Estrella, A., Albert, V.A., Herrera-Estrella, L. Architecture and evolution of a minute plant genome. Nature. 498, 94–98 (2013). 26. Wang, L., Yu, S., Tong, C., Zhao, Y., Liu, Y., Song, C., Zhang, Y., Zhang, X., Wang, Y., Hua, W., Li, D., Li, D., Li, F., Yu, J., Xu, C., Han, X., Huang, S., Tai, S., Wang, J., Xu, X., Li, Y., Liu, S., Varshney, R.K., Wang, J., Zhang, X. Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis. Genome Biol. 15, R39 (2014). 27. Xu, H., Song, J., Luo, H., Zhang, Y., Li, Q., Zhu, Y., Xu, J., Li, Y., Song, C., Wang, B., Sun, W., Shen, G., Zhang, X., Qian, J., Ji, A., Xu, Z., Luo, X., He, L., Li, C., Sun, C., Yan, H., Cui, G., Li, X., Li, X., Wei, J., Liu, J., Wang, Y., Hayward, A., Nelson, D., Ning, Z., Peters, R.J., Qi, X., Chen, S. Analysis of the Genome Sequence of the Medicinal Plant Salvia miltiorrhiza. Mol. Plant. 9, 949–952 (2016). 28. Wickett, N.J., Honaas, L.A., Wafula, E.K., Das, M., Huang, K., Wu, B., Landherr, L., Timko, M.P., Yoder, J., Westwood, J.H., dePamphilis, C.W. Transcriptomes of the parasitic plant family Orobanchaceae reveal surprising conservation of chlorophyll synthesis. Curr. Biol. 21, 2098–2104 (2011). 29. Breitwieser, F.P., Pertea, M., Zimin, A., Salzberg, S.L. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 29, 954-960 (2019). 30. Lu, J., Salzberg, S.L. Removing contaminants from databases of draft genomes. PLoS Comput. Biol. 14, e1006277 (2018). 31. Salzberg, S.L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019). 32. Yandell, M., Ence, D. A beginner’s guide to eukaryoticgenome annotation. Nat. Rev. Genet. 13, 329-342 (2012).

87

33. Campbell, M.S., Law, M., Holt, C., Stein, J.C., Moghe, G.D., Hufnagel, D.E., Lei, J., Achawanantakun, R., Jiao, D., Lawrence, C.J., Ware, D., Shiu, S.-H., Childs, K.L., Sun, Y., Jiang, N., Yandell, M. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164, 513–524 (2014). 34. Florea, L., Souvorov, A., Kalbfleisch, T.S., Salzberg, S.L. Genome assembly has a major impact on gene content: A comparison of annotation in two Bos taurus assemblies. PLoS ONE. 6, e21400 (2011). 35. Watson, M., Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019). 36. Hubisz, M.J., Lin, M.F., Kellis, M., Siepel, A. Error and error mitigation in low-coverage genome assemblies. PLoS ONE. 6, e17034 (2011). 37. Del Angel, V.D., Hjerde, E., Sterck, L., Capella-Gutierrez, S., Notredame, C., Pettersson, O.V., Amselem, J., Bouri, L., Bocs, S., Klopp, C., Gibrat, J.-F., Vlasova, A., Leskosek, B.L., Soler, L., Binzer-Panchal, M., Lantz, H. Ten steps to get started in genome assembly and annotation. F1000Res. 7, 148 (2018). 38. Li, F.-W., Harkess, A. A guide to sequence your favorite plant genomes. Appl. Plant Sci. 6, e1030 (2018). 39. Delmont, T.O., Eren, A.M. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ. 4, e1839 (2016). 40. Merchant, S., Wood, D.E., Salzberg, S.L. Unexpected cross-species contamination in genome sequencing projects. PeerJ. 2, e675 (2014). 41. Longo, M.S., O'Neill, M.J., O'Neill, R.J. Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE. 6, e16410 (2011). 42. Kryukov, K., Imanishi, T. Human contamination in public genome assemblies. PLoS ONE. 11, e0162424 (2016). 43. Richards, T.A., Monier, A. A tale of two tardigrades. Proc. Natl. Acad. Sci. U.S.A. 113, 4892–4894 (2016). 44. Koutsovoulos, G., Kumar, S., Laetsch, D.R., Stevens, L., Daub, J., Conlon, C., Maroon, H., Thomas, F., Aboobaker, A.A., Blaxter, M. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc. Natl. Acad. Sci. U.S.A. 113, 5053–5058 (2016). 45. Garbus, I., Romero, J.R., Valarik, M., Vanžurová, H., Karafiátová, M., Caccamo, M., Dolezel, J., Tranquilli, G., Helguera, M., Echenique, V. Characterization of repetitive DNA landscape in wheat homeologous group 4 chromosomes. BMC Genomics. 16, 1–16 (2015). 46. International Wheat Genome Sequencing Consortium (IWGSC), IWGSC RefSeq principal investigators:, Keller, B., IWGSC whole-genome assembly principal investigators:, Distelfeld, A., Poland, J., Ronen, G., Sharpe, A.G., Whole-genome sequencing and assembly:, Pozniak, C., Barad, O., Baruch, K., Ben-Zvi, G., Hi-C data-based scaffolding:, Himmelbach, A., Whole-genome assembly quality control and analyses:, Choulet, F., Gutierrez-Gonzalez, J., Josselin, A.-A., Koh, C., Muehlbauer, G., Pasam, R.K., Rigault, P., Pseudomolecule assembly:, Mascher, M., RefSeq genome structure and gene analyses:, Ormanbekova, D., Wicker, T., Automated annotation:, Swarbreck,

88

D., Felder, M., Guilhot, N., Kaithakottil, G., Keilwagen, J., Lang, D., Leroy, P., Lux, T., Mayer, K.F.X., Twardziok, S., Venturini, L., Manual gene curation:, Rimbert, H., Subgenome comparative analyses:, Abrouk, M., Haberer, G., Transposable elements:, Gundlach, H., Phylogenomic analyses:, Fischer, I., Transcriptome analyses and RNA-seq data:, Uauy, C., Arnaud, D., Chalabi, S., Chalhoub, B., Cory, A., Datla, R., Davey, M.W., van Ex, F., Whole-genome methylome:, Robinson, S.J., Histone mark analyses:, Benhamed, M., Bendahmane, A., Concia, L., Latrasse, D., BAC chromosome MTP IWGSC– Bayer Whole-Genome Profiling (WGP) tags:, Jacobs, J., Alaux, M., Bartoš, J., Singh, K., Chromosome LTC mapping and physical mapping quality control:, Frenkel, Z., Fahima, T., Glikson, V., Raats, D., RH mapping:, Tiwari, V., Optical mapping:, Číhalíková, J., Šimková, H., Recombination analyses:, Sourdille, P., Darrier, B., Gene family analyses:, Spannagl, M., Prade, V., CBF gene family:, Barabaschi, D., Cattivelli, L., Dehydrin gene family:, Hernandez, P., Budak, H., NLR gene family:, Steuernagel, B., Jones, J.D.G., Witek, K., Wulff, B.B.H., Yu, G., PPR gene family:, Small, I., Melonek, J., Zhou, R., Prolamin gene family:, Juhász, A., WAK gene family:, King, R., Stem solidness (SSt1) QTL team:, Nilsen, K., Cuthbert, R., Knox, R., Wiebe, K., Xiang, D., Flowering locus C (FLC) gene team:, Rohde, A., Golds, T., Genome size analysis:, Čížková, J., MicroRNA and tRNA annotation:, Akpinar, B.A., Biyiklioglu, S., Genetic maps and mapping:, Gao, L., N'Daiye, A., BAC libraries and chromosome sorting:, Kubaláková, M., Šafář, J., Vrána, J., BAC pooling, BAC library repository, and access:, Berges, H., IWGSC sequence and data repository and access:, Alfama, F., Adam-Blondon, A.-F., Flores, R., Guerche, C., Letellier, T., Loaec, M., Quesneville, H., Physical maps and BAC- based sequences:, 1A BAC sequencing and assembly:, Walkowiak, S., Condie, J., Ens, J., Maclachlan, R., Tan, Y., 1B BAC sequencing and assembly:, Paux, E., Alberti, A., Aury, J.-M., Balfourier, F., Barbe, V., Couloux, A., Cruaud, C., Labadie, K., Mangenot, S., Wincker, P., 1D, 4D, and 6D physical mapping:, Gill, B., Kaur, G., Luo, M., Sehgal, S., 2AL physical mapping:, Chhuneja, P., Gupta, O.P., Jindal, S., Kaur, P., Malik, P., Sharma, P., Yadav, B., 2AS physical mapping:, Singh, N.K., Khurana, J., Chaudhary, C., Khurana, P., Kumar, V., Mahato, A., Mathur, S., Sevanthi, A., Sharma, N., Tomar, R.S., 2B, 2D, 4B, 5BL, and 5DL IWGSC–Bayer Whole-Genome Profiling (WGP) physical maps:, Bellec, A., Dolezel, J., Feuillet, C., Korol, A., van der Vossen, E., Vautrin, S., 3AL physical mapping:, 3DS physical mapping and BAC sequencing and assembly:, Holušová, K., Plíhal, O., 3DL BAC sequencing and assembly:, Clark, M.D., Heavens, D., Kettleborough, G., Wright, J., 4A physical mapping, BAC sequencing, assembly, and annotation:, Valarik, M., Balcárková, B., Hu, Y., 5BS BAC sequencing and assembly:, Salina, E., Ravin, N., Skryabin, K., Beletsky, A., Kadnikov, V., Mardanov, A., Nesterov, M., Rakitin, A., Sergeeva, E., 6B BAC sequencing and assembly:, Handa, H., Kanamori, H., Katagiri, S., Kobayashi, F., Nasuda, S., Tanaka, T., Wu, J., 7A physical mapping and BAC sequencing:, Appels, R., Hayden, M., Keeble- Gagnère, G., Tibbits, J., 7B physical mapping, BAC sequencing, and assembly:, Olsen, O.-A., Belova, T., Cattonaro, F., Jiumeng, M., Kugler, K.,

89

Pfeifer, M., Sandve, S., Xun, X., Zhan, B., 7DS BAC sequencing and assembly:, Batley, J., Bayer, P.E., Edwards, D., Hayashi, S., Toegelová, H., Tulpová, Z., Visendi, P., 7DL physical mapping and BAC sequencing:, Weining, S., Cui, L., Du, X., Feng, K., Nie, X., Tong, W., Wang, L., Figures:, Galvez, S., Manuscript writing team:, Stein, N., Eversole, K., Rogers, J., Borrill, P., Kanyuka, K., Pozniak, C.J., Ramirez-Gonzalez, R.H. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 361, eaar7191 (2018). 47. Treangen, T.J., Salzberg, S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2012). 48. Dierckxsens, N., Mardulyn, P., Smits, G. NOVOPlasty de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45, e18 (2017). 49. Ahmed, I. Chloroplast Genome Sequencing: Some Reflections. J. Next Generat. Sequenc. & Applic. 2:2, (2015). 50. Ekblom, R., Smeds, L., Ellegren, H. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics. 15, 467 (2014). 51. Timmis, J.N., Ayliffe, M.A., Huang, C.Y., Martin, W. Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nat. Rev. Genet. 5, 123–135 (2004). 52. Kapitonov, V.V., Jurka, J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat. Rev. Genet. 9, 411–412; author reply 414 (2008). 53. Bao, W., Kojima, K.K., Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 6, 11 (2015). 54. Frith, M.C., Hamada, M., Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics. 11, 80 (2010). 55. Holt, C., Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 12, 491 (2011). 56. Eilbeck, K., Moore, B., Holt, C., Yandell, M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics. 10, 67 (2009). 57. Waterhouse, R.M., Seppey, M., Simão, F.A., Manni, M., Ioannidis, P., Klioutchnikov, G., Kriventseva, E.V., Zdobnov, E.M. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018). 58. Clark, J.W., Donoghue, P.C.J. Constraining the timing of whole genome duplication in plant evolutionary history. Proc. Biol. Sci. 284, pii: 20170912 (2017). 59. Wolfe, K.H. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet. 2, 333–341 (2001). 60. Schnable, J.C., Freeling, M., Lyons, E. Genome-wide analysis of syntenic gene deletion in the grasses. Genome Biol. Evol. 4, 265-277 (2012). 61. Lyons, E., Pedersen, B., Kane, J., Alam, M., Ming, R., Tang, H., Wang, X.,

90

Bowers, J., Paterson, A., Lisch, D., Freeling, M. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Physiol. 148, 1772–1781 (2008). 62. Dehal, P., Boore, J.L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3, e314 (2005). 63. Tang, H., Bowers, J.E., Wang, X., Paterson, A.H. Angiosperm genome comparisons reveal early polyploidy in the monocot lineage. Proc. Natl. Acad. Sci. U.S.A. 107, 472–477 (2010). 64. Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S., Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., Soltis, P.S., Soltis, D.E., Clifton, S.W., Schlarbaum, S.E., Schuster, S.C., Ma, H., Leebens-Mack, J., dePamphilis, C.W. Ancestral polyploidy in seed plants and angiosperms. Nature. 473, 97–100 (2011). 65. McLachlan, G., Peel, D. The EMMIX algorithm for the fitting of normal and t- components. J. Stat.Software. 004, (1999). 66. Denoeud, F., Carretero-Paulet, L., Dereeper, A., Droc, G., Guyot, R., Pietrella, M., Zheng, C., Alberti, A., Anthony, F., Aprea, G., Aury, J.-M., Bento, P., Bernard, M., Bocs, S., Campa, C., Cenci, A., Combes, M.-C., Crouzillat, D., Da Silva, C., Daddiego, L., De Bellis, F., Dussert, S., Garsmeur, O., Gayraud, T., Guignon, V., Jahn, K., Jamilloux, V., Joët, T., Labadie, K., Lan, T., Leclercq, J., Lepelley, M., Leroy, T., Li, L.-T., Librado, P., Lopez, L., Muñoz, A., Noel, B., Pallavicini, A., Perrotta, G., Poncet, V., Pot, D., Priyono, Rigoreau, M., Rouard, M., Rozas, J., Tranchant-Dubreuil, C., Vanburen, R., Zhang, Q., Andrade, A.C., Argout, X., Bertrand, B., de Kochko, A., Graziosi, G., Henry, R.J., Jayarama, Ming, R., Nagai, C., Rounsley, S., Sankoff, D., Giuliano, G., Albert, V.A., Wincker, P., Lashermes, P. The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science. 345, 1181–1184 (2014). 67. Freeling, M., Woodhouse, M.R., Subramaniam, S., Turco, G., Lisch, D., Schnable, J.C. Fractionation mutagenesis and similar consequences of mechanisms removing dispensable or less-expressed DNA in plants. Curr. Opin. Plant Biol. 15, 131–139 (2012). 68. Han, Y., Wessler, S.R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010). 69. Ellinghaus, D., Kurtz, S., Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 9, (2008). 70. Steinbiss, S., Willhoeft, U., Gremme, G., Kurtz, S. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res. 37, 7002–7013 (2009). 71. Suzek, B.E., Wang, Y., Huang, H., McGarvey, P.B., Wu, C.H., UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 31, 926–932 (2015). 72. Yoshida, S., Ishida, J.K., Kamal, N.M., Ali, A.M., Namba, S., Shirasu, K. A full-length enriched cDNA library and expressed sequence tag analysis of the

91

parasitic weed, Striga hermonthica. BMC Plant Biol. 10, 55 (2010). 73. Bolger, A.M., Lohse, M., Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 30, 2114–2120 (2014). 74. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011). 75. Westwood, J.H., dePamphilis, C.W., Das, M., Fernandez-Aparicio, M., Honaas, L.A., Timko, M.P., Wafula, E.K., Wickett, N.J., Yoder, J.I. The Parasitic Plant Genome Project: New tools for understanding the biology of Orobanche and Striga. Weed Science. 60, 295–306 (2012). 76. Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo, J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., Rokhsar, D.S. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–86 (2012). 77. UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018). 78. Korf, I. Gene finding in novel genomes. BMC Bioinformatics. 5, 59 (2004). 79. Stanke, M., Keller, O., Gunduz, I., Hayes, A., Waack, S., Morgenstern, B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–9 (2006). 80. Cantarel, B.L., Korf, I., Robb, S.M.C., Parra, G., Ross, E., Moore, B., Holt, C., Sánchez Alvarado, A., Yandell, M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Research. 18, 188–196 (2008). 81. Ramírez-Sánchez, O., Pérez-Rodríguez, P., Delaye, L., Tiessen, A. Plant proteins are smaller because they are encoded by fewer exons than animal proteins. Genomics Proteomics Bioinformatics. 14, 357–370 (2016). 82. Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D.L., Garcia-Hernandez, M., Karthikeyan, A.S., Lee, C.H., Nelson, W.D., Ploetz, L., Singh, S., Wensel, A., Huala, E. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012). 83. Huala, E., Dickerman, A.W., Garcia-Hernandez, M., Weems, D., Reiser, L., LaFond, F., Hanley, D., Kiphart, D., Zhuang, M., Huang, W., Mueller, L.A., Bhattacharyya, D., Bhaya, D. Sobral, B.W., Beavis, W. Meinke, D.W., Town, C.D., Somerville, C., Rhee, S.Y. Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 29, 102-105 (2001). 84. Swarbreck, D., Wilks, C., Lamesch, P., Berardini, T.Z., Garcia-Hernandez, M. Foerster, H., Li, D., Meyer, T., Muller, R., Ploetz, L., Radenbaugh, A., Singh, S., Swing, V., Tissier, C., Zhang, P., Huala, E. Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36, D1009-10014 (2008).

92

85. Mulder, N., Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 396, 59–70 (2007). 86. McDowall, J., Hunter, S. InterPro protein classification. Methods Mol. Biol. 694, 37–47 (2011). 87. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., Lopez, R. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005). 88. Emms, D.M., Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015). 89. Mirarab, S., Nguyen, N., Guo, S., Wang. L.S., Kim, J., Warnow, T. PASTA: Ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biology. 22, 377-386 (2015). 90. Capella-Gutiérrez, S., Silla-Martínez, J.M., Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25, 1972–1973 (2009). 91. Price, M.N., Dehal, P.S., Arkin, A.P. FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS ONE. 5, e9490 (2010). 92. Wang, Y., Tang, H., Debarry, J.D., Tan, X., Li, J., Wang, X., Lee, T.-H., Jin, H., Marler, B., Guo, H., Kissinger, J.C., Paterson, A.H. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012). 93. Katoh, K., Standley, D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). 94. Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). 95. Tang, H., Lyons, E. Unleashing the genome of Brassica rapa. Front. Plant Sci. 3, 172 (2012). 96. Haas, B.J., Delcher, A.L., Wortman, J.R., Salzberg, S.L. DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 20, 3643–3646 (2004). 97. Pedersen, B.S., Tang, H., Freeling, M. Gobe: an interactive, web-based tool for comparative genomic visualization. Bioinformatics. 27, 1015–1016 (2011).

93

Chapter 4

Gene family dynamics and functional diversification

in parasitic lineages of Orobanchaceae

This chapter represents an expanded version of my contribution to the published paper

on the Striga asiatica genome by Yoshida et al. (2019) [1]. The discussion of gene

contractions and expansions was done in collaboration with Dr. Loren Honaas.

Abstract

The growing number of fully sequenced genomes across the tree of life presents an opportunity to examine the evolutionary histories of gene families across a variety of species. Knowledge of lineage-specific gene family expansions and contractions, and their associated biological functions, can provide insights into the underlying basis behind observed phenotypic diversity. Striga asiatica is an obligate hemiparasite that is photosynthetically competent, but is dependent on the host for carbon, mineral nutrients, and water. It, therefore, constitutes an ideal model to study the evolution of a parasite from a free-living organism to an obligate holoparasite. In this chapter, I examined the role that whole-genome duplications (WGDs) have played in the evolution of parasitism in Orobanchaceae by examining gene family dynamics in the Striga genome in relation to its closest non-parasitic sequenced genome, Mimulus guttatus (Erythranthe guttata).

Contracted gene families in Striga include older genes whose functions are more likely to align with parasite functions that are supplemented by the host plant, while the relatively younger expanded gene families encode specialized traits in the parasite.

94

Introduction

Gene and genome duplication are the two major mechanisms that provide new genetic material for evolutionary forces to act upon and create novel gene functions [2].

While it is now accepted that duplications are a driving force behind functional and species diversification [3-5], evolutionary changes resulting from gene copy number variation have not garnered as much attention. The increasing availability of fully sequenced genomes from related lineages is now revealing widespread variation in gene copy number that may play a critical role in adaptive phenotypic diversity among organisms. Changes in gene family size result from differential gene gain and loss following duplication [6-8]. The resulting gene family expansions and contractions can be a random process or the result of natural selection; either way, these dynamics can result in enormous functional consequences [9]. In plants, adaptive expansion is much more prevalent than adaptive contraction of gene families [7], as a consequence of gene and genome duplication (WGD) followed by neo-sub-functionalization [2, 10-13]. The exception to this general pattern is parasitic and myco-heterotrophic plants that exhibit dramatic loss of genes that have become superfluous because of adopting a parasitic (or symbiotic), and therefore host-dependent, lifestyle [14-16].

While there is abundant evidence of massive gene loss in autotrophic plants following both local and large-scale duplications [17-19], these losses usually are associated with either functionally inconsequential, redundant duplicates or biased toward an obvious pattern of functional categories associated with diploidization [20-23]. In parasitic plants, gene losses are likely to be associated with functions that have become redundant in the parasites due to being supplemented by the host plant. Most notable is

95 the wholesale gene loss associated with plastid genome downsizing in holoparasites [24-

28]. The recently published genomes of Cuscuta species (Cuscuta australis and Cuscuta campestris) reported large-scale contractions of nuclear gene families with functions that are essential for plant growth and development [14, 15]. The loss of genes related to photosynthesis, transport processes, metabolism, leaf and root development, flowering time, and defense in Cuscuta ssp (dodders) are indicative of a parasite actively adapting to a lifestyle dependent the host.

Although gene losses are typically associated with loss of autotrophic functions related to adaptation to a parasitic lifestyle, other complementary evolutionary mechanisms may be adaptive for parasites, such as gene family expansions, gene co- options and horizontal gene transfers (HGT). Conn et al. (2015) [29] reported an expansion in KAI2 genes in the obligate parasites of Orobanchaceae. KAI2 genes mediate karrikin-specific germination responses in autotrophic plants, but in parasitic plants, they have evolved strigolactone perception for host sensing that subsequently stimulates germination of their seeds [30-32]. Yang et al. (2014) [33] identified genes with tissue-specific expression in autotrophic plants that have been recruited to function in the haustoria of Orobanchaceae parasites. These genes were predominantly from root tissues that have disappeared or been reduced to vestigial structures in the parasites, however significant numbers of genes were recruited to the haustorium that serve in floral or other organ specific functions. The intimate relationship between parasites and their hosts facilitates the exchange of material across species boundaries [34], including nucleic acids, which potentially affords the parasite the opportunity to acquire genetic material from the host, which may then become integrated into its genome [35-39]. Yang

96 et al. (2016; 2019) [37, 38] identified numerous cases of DNA-mediated HGT events from host to parasite in Cuscuta species and in the Orobanchaceae, many of which were shown to be functional genes in the parasite. A majority of the integrated functional HGT genes showed elevated expression in the haustoria with enriched functions related to defense response. Acquisition of defense-related genes from the host suggests an adaptive evolutionary strategy by the parasite to attenuate host resistance.

More than 45 years ago, Dennis Searcy [40, 41] proposed that parasite evolution would predictably involve three phases of evolutionary events: gains of parasitic ability

(Phase 1), followed by losses of functions supplemented by the host (Phase II), and finally gains of highly specialized traits (Phase III) would characterize the evolutionary transition from autotrophy to heterotrophy in parasitic angiosperms. Therefore, the relative timing of evolutionary events, and thus the age of affected gene families, should follow a predictable pattern. The supplementary functions should be more broadly shared by the parasite and host and therefore be older traits, while newer, lineage-specific functions should reflect specialized adaptations by the parasite. Despite their ability to photosynthesize, obligate hemiparasites are unable to complete their life cycle without attachment to the host. Their carbon budget deficit and their reduced or non-existent root system requires them to acquire virtually all mineral nutrients and water through the host

[42-46]. The evolutionary history of Striga asiatica, an obligate hemiparasite parasite, should retain evidence of all three phases of parasite evolution hypothesized by Searcy, and is therefore an ideal model to study the evolution of a parasite from a free-living organism to an obligate holoparasite. In this chapter, I focus primarily on gene family evolutionary dynamics in Striga and a closely related non-parasite, Mimulus, as well as

97 other successively earlier common ancestors, and the corresponding expression shifts underlying the origins of parasitism.

Results and discussion

Gene family contractions and expansions

Using the 26 genomes gene family data set previously described in Chapter 3, I used a weighted parsimony method in DupliPHY [6] to estimate the ancestral history of the gene families. Among 10,248 orthogroups with representative asterid taxa, ~23% showed a significant change in gene numbers between Striga and its ancestral node shared with Mimulus. In total, 647 contractions, 1,742 expansions, 456 losses, and, 153 gains were estimated in Striga (Appendix C: Data S1D and S1W). Subsequent analyses of gene family dynamics focused primarily on contractions and expansions because of the expected uncertainty in inferring gene family losses and gains; Even though parsimony methods are robust enough to infer evolutionary histories of gene families in genome- wide studies involving many species, they do not accurately infer the number of individual loss or gain events [6, 47, 48]. In parsimonious scenarios, gains are less likely than losses except in instances of non-lateral acquisitions. However, many inferred losses in draft genomes can be attributed to sequencing issues or incorrect gene prediction [49].

The relative age of genes in contracted families (mean Ks ~ 0.41) was significantly older (two-tailed Mann-Whitney U test, p-values < 2.2e-16) than genes in expanded families (mean Ks ~ 0.66) as shown in Figure 4-1. The difference in age is even more pronounced in WGD duplicates from contracted families (mean Ks ~0.53) and expanded families (mean Ks ~ 0.89). In addition, the ratios of non- synonymous/synonymous substitutions (Ka/Ks) of genes in expanded families (mean

98

Ka/Ks ~ 0.33) are significantly higher (two-tailed Mann-Whitney U test, p-values <

1.66e-13) than ratios in contracted families (mean Ka/Ks ~ 0.21), suggesting that the expanded gene families are under more relaxed purifying selection (Appendix C: Data

S1X and Appendix D: Figure S1). This evolutionary pattern is concordant with previous studies that have focused on characteristics of young (expansions in Striga) and old

(contractions in Striga) duplicate genes [50-54] . In agreement with Searcy’s hypothesis

[40, 41], the older, contracted gene families include plant genes whose functions are more likely to align with ancestral autotroph functions, rather than functions that are vestigial in the parasite. The relatively younger expanded gene families, gained largely as a result of the younger Striga WGD, also support this hypothesis by providing a more recent source of genes that could be neo-functionalized for specialized traits in the parasite.

Figure 4-1: Ks plots of Striga genes in contracted and expanded gene families. Distributions of synonymous (Ks) substitution rates of paralogous gene pairs in contracted and expanded orthogroups in Striga asiatica. Age of contracted genes categorized as significantly older than expanded genes categories (two-tailed Mann-Whitney U test, p-values < 2.2e-16). Additional results available in Appendix C: Data S1E.

99

Gene enrichment analysis of contracted and expanded gene families in Striga detected significant (Benjamini corrected p < 0.05) signatures of gene family contractions in two photosynthesis-related KEGG pathways and several secondary metabolic biosynthesis pathways (Table 4-1). Additionally, an analysis of Gene Ontology (GO) terms among contracted lineages revealed several photosynthesis-related cellular compartment (CC) terms (with profiles biased toward structural and photosynthesis- related genes families), and biological process (BP) terms were significantly over- represented, including GO BP terms associated with leaf anatomy and function. The most substantial contractions were gene families annotated with GO BP terms that relate to abiotic and biotic stimulus-response, including virtually all plant hormones (Table 4-2).

Expanded gene families were significantly biased toward transcription functions, endocytosis, and intracellular transport (Table 4-1 and Table 4-3).

100

Table 4-1: List of enriched KEGG pathways for genes in contracted and expanded orthogroups in Striga.

Contractions

ID Ontology Pathway P-value Bonferroni Benjamini

ath00940 KEGG Phenylpropanoid biosynthesis 1.46E-52 1.47E-50 1.47E-50

ath00500 KEGG Starch and sucrose metabolism 5.53E-15 5.61E-13 2.80E-13

ath00460 KEGG Cyanoamino acid metabolism 9.01E-14 9.11E-12 3.04E-12

ath04075 KEGG Plant hormone signal transduction 4.76E-12 4.80E-10 1.20E-10

ath01110 KEGG Biosynthesis of secondary metabolites 1.36E-10 1.37E-08 2.74E-09

ath00040 KEGG Pentose and glucuronate interconversions 9.05E-09 9.14E-07 1.52E-07

ath00062 KEGG Fatty acid elongation 3.96E-07 3.99E-05 5.71E-06

ath00073 KEGG Cutin, suberine and wax biosynthesis 1.58E-06 1.59E-04 1.99E-05

ath04130 KEGG SNARE interactions in vesicular transport 3.97E-06 4.01E-04 4.46E-05

ath00196 KEGG Photosynthesis - antenna proteins 2.13E-04 0.021274736 0.00214812

ath00220 KEGG Arginine biosynthesis 0.001135833 0.108441636 0.010380688

ath00052 KEGG Galactose metabolism 0.001388487 0.1309326 0.011626434

ath00909 KEGG Sesquiterpenoid and triterpenoid biosynthesis 0.001645769 0.153258332 0.012715362

ath04626 KEGG Plant-pathogen interaction 0.002365591 0.212748704 0.016941127

ath00906 KEGG Carotenoid biosynthesis 0.002779184 0.245038398 0.018564738

ath00710 KEGG Carbon fixation in photosynthetic organisms 0.006760808 0.495990612 0.041918589

Expansions

ID Ontology Pathway P-value Bonferroni Benjamini Tropane, piperidine and pyridine alkaloid ath00960 KEGG 2.57E-06 2.88E-04 2.88E-04 biosynthesis ath03010 KEGG Ribosome 5.70E-06 6.38E-04 3.19E-04

ath04144 KEGG Endocytosis 1.44E-05 0.001615846 5.39E-04

ath04120 KEGG Ubiquitin mediated proteolysis 2.66E-04 0.029382711 0.007428032

Table 4-2: List of enriched GO terms for genes in contracted orthogroups in Striga. Additional results available in Appendix C: Data S1H.

ID Ontology Term P-value Bonferroni Benjamini

GO:0006979 BP response to oxidative stress 9.77E-21 1.12E-17 1.40E-18

GO:0009733 BP response to auxin 4.26E-17 4.87E-14 4.87E-15

GO:0009751 BP response to salicylic acid 1.75E-14 1.99E-11 1.42E-12

GO:0009651 BP response to salt stress 1.35E-12 1.54E-09 9.63E-11

101

GO:0009416 BP response to light stimulus 3.50E-09 4.01E-06 1.91E-07

GO:0009409 BP response to cold 1.19E-08 1.35E-05 5.21E-07

GO:0009737 BP response to abscisic acid 1.46E-08 1.67E-05 5.75E-07

GO:0009753 BP response to jasmonic acid 2.05E-07 2.34E-04 7.09E-06

GO:0009414 BP response to water deprivation 1.07E-06 0.001222569 3.22E-05

GO:0006813 BP potassium ion transport 1.36E-06 0.001557429 4.00E-05

GO:0015693 BP magnesium ion transport 5.57E-06 0.006350285 1.38E-04

GO:0019722 BP calcium-mediated signaling 7.61E-06 0.008660002 1.77E-04

GO:0009908 BP flower development 1.04E-05 0.011822195 2.33E-04

GO:0018298 BP protein-chromophore linkage 2.51E-06 0.002862785 6.67E-05

GO:0048366 BP leaf development 3.62E-05 0.040581533 7.53E-04

GO:0001944 BP vasculature development 0.002806083 0.959718428 0.033946576

GO:0070588 BP calcium ion transmembrane transport 4.49E-05 0.050020022 9.00E-04

GO:0006816 BP calcium ion transport 5.16E-05 0.057284053 0.001016556

GO:0015977 BP carbon fixation 7.77E-05 0.08500941 0.001504655

GO:0042538 BP hyperosmotic salinity response 1.06E-04 0.113732205 0.002010245

GO:0009768 BP photosynthesis, light harvesting in photosystem I 1.24E-04 0.13254359 0.002328269

GO:0010119 BP regulation of stomatal movement 0.002327716 0.930308748 0.029815575

GO:0010148 BP transpiration 0.0027989 0.959385435 0.034592232

GO:0006970 BP response to osmotic stress 3.29E-04 0.313102401 0.005350919

GO:0009734 BP auxin-activated signaling pathway 0.001638628 0.846566453 0.022068185

GO:0009863 BP salicylic acid mediated signaling pathway 0.002806083 0.959718428 0.033946576

GO:0009867 BP jasmonic acid mediated signaling pathway 0.003311954 0.977446082 0.039536014

GO:0030076 CC light-harvesting complex 1.21E-04 0.026470859 0.002061528

GO:0009522 CC photosystem I 0.002691016 0.448722115 0.025559688

GO:0009579 CC thylakoid 0.005661581 0.714857086 0.04711417

GO:0010287 CC plastoglobule 7.78E-07 1.72E-04 2.46E-05

GO:0008519 MF ammonium transmembrane transporter activity 2.65E-06 0.00149169 4.15E-05 magnesium ion transmembrane transporter GO:0015095 MF 4.56E-06 0.002571116 6.60E-05 activity GO:0005385 MF zinc ion transmembrane transporter activity 6.11E-05 0.033897071 6.89E-04

GO:0004392 MF heme oxygenase (decyclizing) activity 1.38E-04 0.074701641 0.001385451

GO:0009881 MF photoreceptor activity 1.68E-04 0.090547665 0.001635087

GO:0005381 MF iron ion transmembrane transporter activity 0.002387847 0.740333176 0.020222364

GO:0015250 MF water channel activity 0.005167401 0.946172461 0.040883211

GO:0010295 MF (+)-abscisic acid 8'-hydroxylase activity (ABA) 0.00638799 0.973066469 0.046444524

102

Table 4-3: List of enriched GO terms for genes in expanded orthogroups in Striga. Additional results available in Appendix C: Data S1H.

ID Ontology Term P-value Bonferroni Benjamini

GO:0015031 BP protein transport 1.23E-10 2.09E-07 6.98E-08

GO:0016192 BP vesicle-mediated transport 2.45E-09 4.16E-06 1.04E-06

GO:0045489 BP pectin biosynthetic process 4.97E-09 8.45E-06 1.69E-06

GO:0048194 BP Golgi vesicle budding 6.82E-07 0.001157704 1.29E-04

GO:0009415 BP response to water 1.09E-05 0.018421014 9.29E-04

GO:0031047 BP gene silencing by RNA 1.21E-05 0.020307163 9.32E-04

GO:0006281 BP DNA repair 1.54E-05 0.025798865 0.001088469

GO:0008380 BP RNA splicing 2.02E-05 0.033675799 0.001316667

GO:0006397 BP mRNA processing 2.91E-04 0.389846743 0.014859555

GO:0009851 BP auxin biosynthetic process 3.75E-04 0.471310072 0.018045289

GO:0015976 BP carbon utilization 5.93E-04 0.634886766 0.023703778

GO:0006886 BP intracellular protein transport 0.001380877 0.904414256 0.046783182

GO:0005802 CC trans-Golgi network 1.29E-09 4.99E-07 4.99E-08

GO:0005801 CC cis-Golgi network 1.09E-05 0.004216605 2.22E-04

GO:0032588 CC trans-Golgi network membrane 0.003279258 0.719493503 0.032068351

GO:0032580 CC Golgi cisterna membrane 0.004723841 0.83998023 0.043710029

GO:0000139 CC Golgi membrane 1.09E-17 4.21E-15 1.05E-15

GO:0005768 CC endosome 1.39E-09 5.39E-07 4.90E-08

GO:0031901 CC early endosome membrane 4.05E-04 0.145258919 0.005397679

GO:0032580 CC clathrin-coated vesicle membrane 0.004723841 0.83998023 0.043710029

GO:0003729 MF mRNA binding 7.26E-19 7.10E-16 7.10E-16

GO:0000166 MF nucleotide binding 3.10E-14 3.03E-11 7.57E-12 RNA polymerase II transcription factor GO:0001085 MF 7.13E-14 6.96E-11 1.39E-11 binding GO:0003723 MF RNA binding 1.42E-13 1.38E-10 2.31E-11

GO:0003676 MF nucleic acid binding 4.92E-09 4.81E-06 4.37E-07

GO:0031492 MF nucleosomal DNA binding 7.63E-09 7.45E-06 5.73E-07

GO:0008143 MF poly(A) binding 1.63E-05 0.015801795 6.92E-04

GO:0003727 MF single-stranded RNA binding 4.74E-05 0.045226275 0.001541507

GO:0042277 MF peptide binding 1.63E-04 0.146953325 0.00428649

GO:0019843 MF rRNA binding 0.001464261 0.761081156 0.029385258

GO:0003690 MF double-stranded DNA binding 0.001794807 0.827110454 0.035184527

GO:0016874 MF ligase activity 0.001937275 0.849614471 0.037182177

103

GO:0005515 MF protein binding 0.002535174 0.916257575 0.045714736

Evolution of tissue-specific gene families

Presumably, the specificity of gene expression is correlated with tissue- or organ-

specific function; therefore, changes in gene number for tissue-specific orthogroups can

be used as a proxy for changes in tissue function. Various studies have demonstrated that

changes in copy number have drastic effects in plant genomes, including adaptation to

different lifestyles and environments, and phenotypic changes [20, 55, 56]. All tissue-

specific orthogroups were examined for enrichment of evolutionary events in Striga,

including expansions, contractions, and losses. Statistically significant evolutionary

events in various tissues, including root, inflorescence, stem, seedlings, embryo, pollen,

hypocotyl, leaf, pistil, seed, shoot, , and testa are shown in Table 4-4.

Interestingly, most of the tissue-specific gene families were enriched for contractions.

These are most likely the functional groups that are supplemented by the host species,

which tend to be older genes from the ancient polyploidy event predating the divergence

of the core Lamiales lineages.

Table 4-4: Contractions and expansions of genes in tissue-specific gene families in Striga. Enrichment of evolutionary events including expansions, contractions, and losses in Striga orthogroups with expression specific to various tissues in Arabidopsis. Statistically significant categories are highlighted in yellow when the observed value is greater than the expected number, and in blue if the observed value is less than the expected number. Additional results available in Appendix C: Data S1D.

Events root orthogroups All orthogroups Expected Significant (P-value) Contractions 78 647 (23%) 58.19 0.03439 Expansions 165 1742 (61%) 154.33 0.375311 losses 10 456 (16%) 40.48 0.000341

104

Events inflorescence stem orthogroups All orthogroups Expected Significant (P-value) Contractions 6 647 (23%) 1.84 0.00905 Expansions 2 1742 (61%) 4.88 0.104874 losses 0 456 (16%) 1.28 0.640824 Events seedling orthogroups All orthogroups Expected Significant (P-value) Contractions 3 647 (23%) 1.38 0.386741 Expansions 2 1742 (61%) 3.66 0.367879 losses 1 456 (16%) 0.96 1 Events embryo orthogroups All orthogroups Expected Significant (P-value) Contractions 55 647 (23%) 57.93 0.937067 Expansions 165 1742 (61%) 153.11 0.293758 Losses 31 456 (16%) 40.16 0.484325 Events pollen orthogroups All orthogroups Expected Significant (P-value) Contractions 219 647 (23%) 154.56 < 0.00001 Expansions 401 1742 (61%) 409.92 0.774916 Losses 52 456 (16%) 107.52 4.80E-05 Events hypocotyl orthogroups All orthogroups Expected Significant (P-value) Contractions 59 647 (23%) 40.71 0.016408 Expansions 107 1742 (61%) 107.97 0.99005 Losses 11 456 (16%) 28.32 0.025097 Events leaf orthogroups All orthogroups Expected Significant (P-value) Contractions 24 647 (23%) 15.41 0.091173 Expansions 37 1742 (61%) 40.87 0.615697 Losses 6 456 (16%) 10.72 0.484325 Events pistil orthogroups All orthogroups Expected Significant (P-value) Contractions 30 647 (23%) 13.34 3.00E-05 Expansions 27 1742 (61%) 35.38 0.072078 Losses 1 456 (16%) 9.28 0.076536 Events seed orthogroups All orthogroups Expected Significant (P-value) Contractions 30 647 (23%) 13.57 4.80E-05 Expansions 23 1742 (61%) 35.99 0.001999 Losses 6 456 (16%) 9.44 0.647265 Events shoot apex orthogroups All orthogroups Expected Significant (P-value) Contractions 18 647 (23%) 8.51 0.005042 Expansions 18 1742 (61%) 22.57 0.293758 Losses 1 456 (16%) 5.92 0.241714 Events stamen orthogroups All orthogroups Expected Significant (P-value) Contractions 46 647 (23%) 19.32 < 0.00001

105

Expansions 33 1742 (61%) 51.24 0.000182 Losses 5 456 (16%) 13.44 0.158025 Events testa orthogroups All orthogroups Expected Significant (P-value) Contractions 128 647 (23%) 97.52 0.008523 Expansions 260 1742 (61%) 258.64 0.99005 Losses 36 456 (16%) 67.84 0.005517

Evolutionary events related to parasitism

Important aspects of the Searcy hypothesis are the function and also the source of genes leveraged by the parasite during the three phases of parasitic evolution [57]. During

Phase I, genetic innovation is required for the evolution of the haustorium, either by the acquisition of new genetic material or by modification of existing genetic material. Phase

II is characterized by loss of genes whose encoded functions were made redundant by resources acquired from the host (e.g., the carbon and water). Phase III predicts that obligate parasites would add genetic material associated with further adaptations to the parasitic lifestyle. Striga, an obligate parasite, should show evidence of all three phases of parasite evolution.

Shifts in gene expression (Phase I)

Yang et al. 2014 [33] reported finding genes in the Orobanchaceae with preferential haustorium expression (core parasitism genes) that were derived from duplicated genes whose orthologs have preferential root or pollen gene expression in non- parasitic angiosperms. Here, I classified the core parasitism genes identified in Yang et al. into the 26 genome gene family scaffold and evaluated them with respect to their assignments to orthogroups with tissue-specific expression. Concordant with the results

106 in Yang et al., I found evidence for the recruitment of tissue-specific genes for haustorial development in the parasitic Orobanchaceae. Parasitism genes were enriched for orthogroups with tissue-specific expression patterns. Testa, hypocotyl, root, and especially pollen, were identified as likely sources for haustorial genes (Tables 4-5).

These results suggest that in the Orobanchaceae, during Phase I, haustorium innovation was underpinned by neo-sub-functionalization of existing genes from tissue-specific gene families. Curiously, most of the tissue-specific gene families (except seedling, leaf and embryo) were also enriched for contracted orthogroups, which predominantly contain older genes from the WGD event shared with Mimulus (Table 4-4). These findings may also represent Phase II, the loss of parasite functions via host complementation. It is not surprising that the haustorium, which is one of the key innovations among seed plants, might have evolved from the genetic material of an ancient WGD event rather than a more recent WGD event that is specific to the Striga lineage. Recent studies assessing ancient polyploidies have found significant evidence that major shifts in diversification do not always involve an entire species clade [58-62]. This asymmetrical phylogenetic divergence suggests that evolution of novel traits following WGD does not immediately give rise to novel functions but requires a lag in time prior to a burst in diversification

[58-62].

107

Table 4-5: Core parasitism genes expression in tissue-specific gene families. Orobanchaceae (Striga, Phelipanche, and Triphysaria) haustorial genes identified in Yang et al. (2014) in tissue-specific orthogroups. Haustorial genes are enriched for orthogroups with tissue-specific expression patterns in testa, hypocotyl, and stamen, but most predominantly pollen.

Shifts in gene content (Phase II)

In Striga, patterns of gene family evolution are commonly characterized by

contractions. Notably, orthogroups that are enriched for contracted gene families have

highly tissue-specific expression. (Table 4-4). Because evolutionary losses of leaf and

root genes in the leafless and rootless holoparasites Monotropa (a mycoheterotroph) and

Cuscuta have been reported [14, 15, 63], I was not surprised to see contractions in “root”

specific orthogroups, since Striga completely lacks a proper root system [46]. However,

these data also suggest that the pattern of functional complementation by the host, and a

corresponding loss of function in the parasite, extends to other parasite functions beyond

the more obvious changes like loss of a functional root system. Consistent with the

108 relatively normal outward appearance of Striga leaves, leaf-specific orthogroups lacked strong evidence of evolutionary shifts. However, even the leafy green hemiparasite Striga is heavily dependent upon the host for carbon, and is entirely heterotrophic as a seedling and during its extensive subterranean growth phase [43, 46, 64]. Therefore, in a hemiparasite like Striga, even though it retains some photosynthetic capabilities, losses in photosynthesis-related genes might be expected.

It has been shown that the plastid genomes of holoparasitic plants undergo wholesale gene loss, accelerated sequence evolution, and genome reduction, including the loss of photosynthesis genes, [25, 57]. These observations support Phase II of the

Searcy hypothesis: functions that are vestigial in the parasite, like carbon assimilation, are supplemented by host photosynthesis, and through time are lost from parasitic plants due to the relaxed constraint of genes involved in the pathway. A recent study [65] defined a list of photosynthesis genes used to survey changes in the photosynthetic apparatus in three species of parasitic Orobanchaceae, including Striga hermonthica. Concordant with the findings in Wickett et al. (2011) [65], this study found that most gene families representing chlorophyll synthesis and photosynthesis pathways are present. However, some of these gene families encoding proteins involved in heme and protoporphyrin IX

(in the chlorophyll biosynthesis pathway), as well as light harvesting, showed signatures of contraction (Table 4-6). By contrast, the nuclear-encoded photosystem gene families were intact compared to the ancestral state (shared with Mimulus, Table 4-6). Two expansions (one in the chlorophyll synthesis pathway, one in the photosynthesis pathway), were unexpected and may represent genes co-opted to perform novel functions in the parasite.

109

Additional Phase II signatures of gene loss in the genome of Striga include overrepresentation among contracted orthogroups of the KEGG pathways

“photosynthesis-antenna proteins” (Benjamini P = 0.0021) and “carbon fixation in photosynthetic organisms” (Benjamini P = 0.0419) (Table 4-1). Among contracted orthogroups, the GO BP terms “protein-chromaphore linkage” (Benjamini P = 6.6e-5),

“carbon fixation” (Benjamini P=0.0015), and “photosynthesis, light harvesting in photosystem I” (Benjamini P = 0.0023) were significantly enriched (Table 4-2). A similar theme of photosynthesis-related losses is also observed in GO CC terms “plastoglobule”

(Benjamini P = 2.5e-5), “light-harvesting complex” (Benjamini P = 0.0021),

“photosystem 1” (Benjamini P = 0.0256) and “thylakoid” (Benjamini P = 0.0471) that were enriched among contracted orthogroups (Table 4-2). These losses may explain the greatly reduced photosynthetic efficiency of Striga [46, 64], even though Striga still maintains low levels of photosynthetic flux that result in carbon fixation [46].

Leaves of Striga have undifferentiated mesophyll [66], a low number of plastids per cell [67], low chlorophyll concentration [68], an insensitive apparatus for regulating water loss [69], and likely a negative net carbon gain in leaves [66]. Consistent with these reductions in anatomy and function of Striga leaves, GO BP terms “leaf development”

(Benjamini P = 7.5e-4), “regulation of stomatal movement” (Benjamini P = 0.0298),

“transpiration” (Benjamini P = 0.0346), and “vasculature development” (Benjamini P =

0.0339) are overrepresented among contracted orthogroups. This indicates that genes encoding elements of the transpirational apparatus of Striga are also under relaxed constraint. Indeed, the insensitive water loss apparatus [69] and abnormally high nighttime foliar carbon emission due to constitutively open stomata [66, 70] show that

110

Striga has limited capability to regulate water loss. It has been shown that the closely related holoparasite Phelipanche aegyptiaca expresses a full complement of chlorophyll synthesis genes, but not photosystem genes [65]. Additional roles for chlorophyll (and other tetrapyrroles), like retrograde plastid-nuclear signaling [71], may explain conservation of these pathways in obligate parasites that have diminished photosynthetic capability. Taken together, this suggests that the primary functions of the Striga leaf could extend beyond carbon assimilation.

A clear and dominant signal in the ancestral gene family reconstruction is the contraction of cellular response machinery. ~28% of all overrepresented GO BP terms in contracted orthogroups, compared to ~4% in the expanded orthogroups, were “response to abiotic or biotic stimuli”, including virtually all major plant hormones (Table 4-2 and

Table 4-3). Also included were numerous “signaling” terms that implicate hormone response/action (Table 4-2). Furthermore, the KEGG pathways “plant hormone signal transduction” (Benjamini P =1.2e-10) and “plant-pathogen interaction” (Benjamini P =

0.0169) were also enriched among contracted orthogroups. Consistent with Searcy’s prediction of functional complementation by the host plant resulting in loss of functions in the parasite, these data, along with the reported insensitivity to water stress (thus implicating ABA, [69]), show that the parasite may have increased its reliance on the host to sense and respond to its environment. This shift would reduce the energetic burden to perceive and integrate environmental cues while at the same time promoting parasite wellness over a stressed host plant. The same applies to biotic stresses – parasites could leverage host responses and defense strategies to biotic stress without expending their own resources. This might even expand the parasite niche by leveraging locally adapted

111 defense responses. These data reveal a wide pattern of loss of sensing and response apparatus that provides strong support to the Searcy hypothesis.

Functions that are lost in the parasite and complemented by the host during Phase

II may also be targets for Phase III specialization of the parasite-host relationship. For instance, alteration in water movement functions may span evolutionary events in Phases

II and III because the host plant could complement water stress response pathways, while decreased water potential [46], constitutive transpiration [25, 57], and other alterations to the water relations apparatus, such as host vessel element invasion by parasitic vascular connections (oscula) [72] could be adaptive. Evolutionary shifts within a common process can be parsed into the respective phases based on the timing of these events. For instance, the GO BP term “response to water” (Benjamini P = 9.29e-4) is enriched in expanded orthogroups (Table 4-3), which have been shown to be significantly younger than contracted ones. This would suggest that these expanded orthogroups represent

Phase III signatures, even though orthogroup contractions dominate water relation signatures.

112

Table 4-6: Chlorophyll synthesis and photosynthetic pathways genes in Striga. The photosynthesis apparatus is intact in Striga, an obligate hemiparasite that is photosynthetically competent. However, there is evidence of gene loss, suggesting that Striga is in the early stages of losing genes related to photosynthesis (red). The two expansions (green) are contrary to the expected trajectory of parasite evolution towards complete parasitism. These may represent genes co-opted to perform novel functions in the parasite. a).Chlorophyll synthesis

b).Photosynthesis

113

Adaptation to parasitic lifestyle (Phase III)

During the transition to obligate parasitism, it was suggested by Searcy that parasitic plants would adapt to the parasitic lifestyle by accruing new genetic information

[40, 41]. Since WGD in the Striga genome is a source for gene family expansion, it is, therefore, possible that new and highly derived genes sourced from the Striga lineage- specific WGD encode genes that underpin highly adapted parasite traits, especially in the novel haustorium. The primary function of the haustorium is to make a vascular connection from the parasite to its host, and implicit in this function is the acquisition of host resources and regulation of host defenses. Heide-Jørgensen and Kuijt observed that the haustorium of the closely related Triphysaria versicolor contained transfer-like cells

[73, 74]. Because evidence of phloem continuity in Striga host connections is lacking, we hypothesized that Phase III innovation may include cellular machinery, such as endocytosis and vesicle mediated transport. These would facilitate acquisition of host resources, perhaps in haustorial-interface transfer cells. It is clear that the high proportion of heterotrophic carbon, especially in unemerged Striga seedlings at virtually 100%, would require a highly efficient means of obtaining host carbon [43]. The survey of functions in expanded orthogroups (Table 4-1 and Table 4-3) revealed that GO BP terms

“vesicle-mediated transport” (Benjamini P = 1.04e-6) and “Golgi vesicle budding”

(Benjamini = 1.29e-4) were enriched. Furthermore, the GO CC terms “Golgi membrane”

(Benjamini P = 1.05e-15), “trans-Golgi network” (Benjamini P = 4.99e-8), “endosome”

(Benjamini P = 4.90e-8), “cis-Golgi network” (Benjamini P=2.22e-4), “early endosome membrane” (Benjamini P = 0.0054), “clathrin-coated vesicle membrane” (Benjamini P =

114

0.0273), “trans-Golgi network membrane” (Benjamini P = 0.0321), and “Golgi cisterna membrane” (Benjamini P = 0.0437) and KEGG pathway “endocytosis” (Benjamini P =

5.39e-4) were enriched among expanded orthogroups. This suggests that relatively young and significantly expanded orthogroups that encode inter- and intra-cellular transport genes may represent Phase III innovations related to host resource acquisition.

Host-induced gene silencing (involving movement of silencing RNAs from host to parasite) would be one potential mechanism to enhance host resistance to parasitic plants [75, 76]. Previous work has revealed massive mRNA transfer between parasite plant Cuscuta and host [34]. However, the mechanism(s) of RNA transport in these systems remain unknown. Clues that RNA transfer may occur in Striga as well are found in enriched GO Molecular Function terms that are unique in expanded orthogroups, including “mRNA binding” (Bonferroni P=7.1e-16), “RNA binding” (Bonferroni P=2.3e-

11), “nucleic acid binding” (Bonferroni P=4.4e-7), “poly(A) binding” (Bonferroni

P=6.9e-4), and “single stranded RNA binding” (Bonferroni P=0.0015). These orthogroups encode nucleic acid binding proteins that could be part of a mechanism for

RNA transfer between parasitic plants and host plants, perhaps similar to phloem localized RNA binding proteins that likely facilitate mRNA translocation via phloem in plants [77].

Conclusions

Changes in gene family sizes can provide valuable insights into mechanisms that underlie adaptive organismal differences among closely related species. The observed gene family dynamics in Striga asiatica follow a predictable pattern of gene losses in functions that are likely complemented by the host, and gene gains in functions likely to

115 improve parasite fitness. Interestingly, the contracted and expanded gene families align with older (Lamiales-wide) and younger (Orobanchaceae-wide) polyploidy events in

Striga, respectively. Contractions include functions related to photosynthesis, environmental sensing, and leaf developmental processes, while expansions are associated with transcription functions, endocytosis, and intracellular transport. These findings suggest that parasitism is associated with changes in gene functions common to non-parasitic plants, thus providing critical evidence for understanding parasite function and potential targets for parasite control.

Materials and methods

Estimating gene family evolutionary dynamics

In order to estimate gene family contractions and expansions in Striga, I selected

1,440 BUSCO [78] single-copy genes from the 26 genomes gene family data described in the previous chapter. Amino acid sequence alignments for each BUSCO single-copy orthogroup were estimated using MAFFT (v7.205) L-INS-i algorithm [79] with a maximum of 1000 iterative refinements. Corresponding DNA sequences were then forced onto the amino acid alignments and resulting DNA codon alignments trimmed to remove gappy sites present in 90% of the sequences using trimAl version 1.4.rev8 [80].

Maximum likelihood (ML) species tree was inferred using RAxML [81] with 100 bootstrap replicates and the GTR+GAMMA model of DNA sequence evolution (Figure

4-2).

The weighted Wagner parsimony method implemented in DupliPHY version 2.0

[6] was used to reconstruct the presence and size of each gene family in the common ancestor of Striga and Mimulus, as well as other successively earlier common ancestors.

116

The weights matrix to define the cost of a family changing from the ancestral state was computed by DupliPHY from input matrix of family sizes. I used the species tree inferred from BUSCO of single-copy genes and a table of genes for each species observed in each orthogroup (with at least an asterid taxon) from the 26 genomes orthogroup circumscription (Appendix C: Data S1S) as input for DupliPHY.

Figure 4-2: The maximum likelihood species tree for 26 plants genomes. A Phylogram of 26 representative plant species estimated from the concatenated data matrix for 1,440 single-copy orthogroup genes obtained from the BUSCO classification. Selected taxa represent major land plant lineages including asterids (green), Caryophyllales (gold), rosids (red), basal eudicot (purple), monocots (blue), one basal angiosperm (turquoise green), and outgroups (gymnosperm/lycophyte/moss)(black).

Functional enrichment of contracted and expanded genes

To evaluate the putative functional roles of contracted and expanded gene families

(orthogroups), the Arabidopsis thaliana genes in these orthogroups were extracted and examined for statistically enriched Gene Ontology (GO) terms [82, 83] and KEGG

117 pathways [84-86]. The Arabidopsis genome has a relatively complete gene set and thus it is ideal for identifying gene family functions. The enrichment analysis was performed in

DAVID [87], the gene function classification database, using Fisher's exact test with a

Benjamini adjusted p-value ≤ 0.05, and allowing a minimum of two genes per functional term. Arabidopsis genes in both the contracted and the expanded gene families

(foreground) were compared to all genes in the Arabidopsis genome (background) to identify significantly enriched GO terms and KEGG pathways.

Divergence rates for contracted and expanded genes

Synonymous substitution rates (Ks) for paralogous genes in contracted and expanded gene families were performed using the KaKsAnalysis tool of PlantTribes 2 to determine their age distributions. Briefly, Striga paralogous gene pairs in both the contracted and the expanded orthogroups were separately identified using reciprocal blast searches. Protein pairwise alignments were estimated with MAFFT [79] and converted to codon alignments. Ks values of paralogs were calculated using the maximum likelihood method implemented in CODEML [88], imposing a minimum alignment length of 300 bp. The age distributions of syntenic paralogs emanating from the whole genome duplication analysis described in the previous chapter were also identified for both contractions and expansions.

Defining tissue-specific gene families

Baseline Arabidopsis gene expression data were obtained from the

ExpressionData database (www.expressiondata.org) [89] to aid in defining tissue-specific orthogroups for the 26 genomes gene family scaffold. These data are a curated summary of more than 5,000 microarray datasets from AtGenExpress experiments conducted using

118 the Agilent ATH1 GeneChip® [90-92]. The baseline gene expression matrix was updated with current ATH1 probe annotations from the Gene Networks in Seed Development website (http://seedgenenetwork.net), and Arabidopsis genes with orthogroup assignments were used to determine tissue-specific orthogroups. Following the method described by Severin et al. (2010) [93], Z-scores analysis was performed on the expression matrix for all Arabidopsis genes in the 26 genomes orthogroups. A Z-score cutoff of 2 was determined empirically to select gene sets for which >95% of the genes had a Z-score >2 in only one tissue category. The Arabidopsis gene identifiers were then searched against the 26 genomes orthogroup classification to identify orthogroups with genes that have tissue-specific expression.

Analysis of haustorial and photosynthesis-related genes

Transcripts of three Orobanchaceae parasites with elevated expression in haustorial tissues defined as core parasitism genes by Yang et al. (2014) [33] were assigned into the 26 genomes gene family classification to trace their evolutionary history. The identified orthogroups enriched for tissue-specific expression were examined to determine likely tissues from which haustorial genes were recruited. Additionally, orthogroups corresponding to Arabidopsis chlorophyll synthesis and photosynthesis reference pathways described in Wicket et al. (2011) [65] were identified and examined for gene family contractions and expansions.

References

1. Yoshida, S., Kim, S., Wafula, E.K., Tanskanen, J., Kim, Y.-M., Honaas, L., Yang, Z., Spallek, T., Conn, C.E., Ichihashi, Y., Cheong, K., Cui, S., Der, J.P., Gundlach, H., Jiao, Y., Hori, C., Ishida, J.K., Kasahara, H., Kiba, T., Kim, M.- S., Koo, N., Laohavisit, A., Lee, Y.-H., Lumba, S., McCourt, P., Mortimer, J.C., Mutuku, J.M., Nomura, T., Sasaki-Sekimoto, Y., Seto, Y., Wang, Y.,

119

Wakatake, T., Sakakibara, H., Demura, T., Yamaguchi, S., Yoneyama, K., Manabe, R.-I., Nelson, D.C., Schulman, A.H., Timko, M.P., dePamphilis, C.W., Choi, D., Shirasu, K. Genome sequence of Striga asiatica provides insight into the evolution of plant parasitism. Curr. Biol. 29, 3041-3052 (2019). 2. Ohno, S. Evolution by Gene Duplication. Springer-Verlag Berlin Heidelberg GmbH (1970). 3. Van de Peer, Y., Maere, S., Meyer, A. The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732 (2009). 4. Panchy, N., Lehti-Shiu, M.D., Shiu, S.-H. Evolution of gene duplication in plants. Plant Physiol. 171, 2294-2316 (2016). 5. Clark, J.W., Donoghue, P.C.J. Whole-genome duplication and plant macroevolution. Trends Plant Sci. 23, 933–945 (2018). 6. Ames, R.M., Money, D., Ghatge, V.P., Whelan, S., Lovell, S.C. Determining the evolutionary history of gene families. Bioinformatics. 28, 48–55 (2012). 7. Demuth, J.P., Hahn, M.W. The life and death of gene families. Bioessays. 31, 29–39 (2009). 8. Albalat, R., Cañestro, C. Evolution by gene loss. Nat. Rev. Genet. 17, 379–391 (2016). 9. Harris, R.M., Hofmann, H.A. Seeing is believing: Dynamic evolution of gene families. Proc. Natl. Acad. Sci. U.S.A. 112, 1252–1253 (2015). 10. Sidow, A. Gen(om)e duplications in the evolution of early vertebrates. Curr. Opin. Genet. Dev. 6, 715–722 (1996). 11. Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.-L., Postlethwait, J. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 151, 1531–1545 (1999). 12. Rastogi, S., Liberles, D.A. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol. Biol. 5, 28 (2005). 13. Langham, R.J., Walsh, J., Dunn, M., Ko, C., Goff, S.A., Freeling, M. Genomic duplication, fractionation and the origin of regulatory novelty. Genetics. 166, 935–945 (2004). 14. Vogel, A., Schwacke, R., Denton, A.K., Usadel, B., Hollmann, J., Fischer, K., Bolger, A., Schmidt, M.H.W., Bolger, M.E., Gundlach, H., Mayer, K.F.X., Weiss-Schneeweiss, H., Temsch, E.M., Krause, K. Footprints of parasitism in the genome of the parasitic flowering plant Cuscuta campestris. Nat. Commun. 9, 2515 (2018). 15. Sun, G., Xu, Y., Liu, H., Sun, T., Zhang, J., Hettenhausen, C., Shen, G., Qi, J., Qin, Y., Li, J., Wang, L., Chang, W., Guo, Z., Baldwin, I.T., Wu, J. Large-scale gene losses underlie the genome evolution of parasitic plant Cuscuta australis. Nat. Commun. 9, 2683 (2018). 16. Yuan, Y., Jin, X., Liu, J., Zhao, X., Zhou, J., Wang, X., Wang, D., Lai, C., Xu, W., Huang, J., Zha, L., Liu, D., Ma, X., Wang, L., Zhou, M., Jiang, Z., Meng, H., Peng, H., Liang, Y., Li, R., Jiang, C., Zhao, Y., Nan, T., Jin, Y., Zhan, Z., Yang, J., Jiang, W., Huang, L. The Gastrodia elata genome provides insights into plant adaptation to heterotrophy. Nat. Commun. 9, 1615 (2018). 17. Wolfe, K.H. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet. 2, 333–341 (2001).

120

18. Brunet, F.G., Roest Crollius, H., Paris, M., Aury, J.M., Gilbert, P., Jaillon, O., Laudet, V., Robinson-Rechavi, M. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol. Biol. Evol. 23, 1808-1816 (2006). 19. Inoue, J., Sato, Y., Sinclair, R., Tsukamoto, K., Nishida, M. Rapid genome reshaping by multiple-gene loss after whole-genome duplication in teleost fish suggested by mathematical modeling. Proc. Natl. Acad. Sci. U.S.A. 112, 14918–14923 (2015). 20. Blanc, G., Wolfe, K.H. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell. 16, 1679–1691 (2004). 21. Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., Van de Peer, Y. Modeling gene and genome duplications in eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 102, 5454–5459 (2005). 22. Seoighe, C., Gehring, C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends in Genetics. 20, 461–464 (2004). 23. Chen, E.C.H., Buen Abad Najar, C., Zheng, C., Brandts, A., Lyons, E., Tang, H., Carretero-Paulet, L., Albert, V.A., Sankoff, D. The dynamics of functional classes of plant genes in rediploidized ancient polyploids. BMC Bioinformatics. 14 Suppl 15, S19 (2013). 24. Krause, K. From to “cryptic” plastids: evolution of plastid genomes in parasitic plants. Curr Genet. 54, 111–121 (2008). 25. Wicke, S. Genomic Evolution in Orobanchaceae. In: Parasitic Orobanchaceae. pp. 267–286. Springer, Berlin, Heidelberg (2013). 26. Naumann, J., Der, J.P., Wafula, E.K., Jones, S.S., Wagner, S.T., Honaas, L.A., Ralph, P.E., Bolin, J.F., Maass, E., Neinhuis, C., Wanke, S., dePamphilis, C.W. Detecting and characterizing the highly divergent plastid genome of the nonphotosynthetic parasitic plant Hydnora visseri (Hydnoraceae). Genome Biol. Evol. 8, 345-363 (2016). 27. Su, H.-J., Barkman, T.J., Hao, W., Jones, S.S., Naumann, J., Skippington, E., Wafula, E.K., Hu, J.-M., Palmer, J.D., dePamphilis, C.W. Novel genetic code and record-setting AT-richness in the highly reduced plastid genome of the holoparasitic plant Balanophora. Proc. Natl. Acad. Sci. U.S.A. 116, 934–943 (2019). 28. McNeal, J.R., Kuehl, J.V., Boore, J.L., dePamphilis, C.W. Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol. 7, 57 (2007). 29. Conn, C.E., Bythell-Douglas, R., Neumann, D., Yoshida, S., Whittington, B., Westwood, J.H., Shirasu, K., Bond, C.S., Dyer, K.A., Nelson, D.C. Convergent evolution of strigolactone perception enabled host detection in parasitic plants. Science. 349, 540–543 (2015). 30. Tsuchiya, Y., Yoshimura, M., Sato, Y., Kuwata, K., Toh, S., Holbrook-Smith, D., Zhang, H., McCourt, P., Itami, K., Kinoshita, T., Hagihara, S. Probing strigolactone receptors in Striga hermonthica with fluorescence. Science. 349, 864–868 (2015). 31. Toh, S., Holbrook-Smith, D., Stogios, P.J., Onopriyenko, O., Lumba, S., Tsuchiya, Y., Savchenko, A., McCourt, P. Structure-function analysis identifies

121

highly sensitive strigolactone receptors in Striga. Science. 350, 203–207 (2015). 32. Smith, S.M. Q&A: What are strigolactones and why are they important to plants and soil microbes? BMC Biol. 12, 19 (2014). 33. Yang, Z., Wafula, E.K., Honaas, L.A., Zhang, H., Das, M., Fernandez- Aparicio, M., Huang, K., Bandaranayake, P.C.G., Wu, B., Der, J.P., Clarke, C.R., Ralph, P.E., Landherr, L., Altman, N.S., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Comparative transcriptome analyses reveal core parasitism genes and suggest gene duplication and repurposing as sources of structural novelty. Mol. Biol. Evol. 32, 767–790 (2014). 34. Kim, G., LeBlanc, M.L., Wafula, E.K., dePamphilis, C.W., Westwood, J.H. Plant science. Genomic-scale exchange of mRNA between a parasitic plant and its hosts. Science. 345, 808–811 (2014). 35. Yoshida, S., Maruyama, S., Nozaki, H., Shirasu, K. Horizontal gene transfer by the parasitic plant Striga hermonthica. Science. 328, 1128–1128 (2010). 36. Zhang, Y., Fernandez-Aparicio, M., Wafula, E.K., Das, M., Jiao, Y., Wickett, N.J., Honaas, L.A., Ralph, P.E., Wojciechowski, M.F., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species. BMC Evol. Biol. 13, 48 (2013). 37. Yang, Z., Wafula, E.K., Kim, G., Shahid, S., McNeal, J.R., Ralph, P.E., Timilsena, P.R., Yu, W.-B., Kelly, E.A., Zhang, H., Person, T.N., Altman, N.S., Axtell, M.J., Westwood, J.H., dePamphilis, C.W. Convergent horizontal gene transfer and cross-talk of mobile nucleic acids in parasitic plants. Nat. Plants. 5, 991-1001 (2019). 38. Yang, Z., Zhang, Y., Wafula, E.K., Honaas, L.A., Ralph, P.E., Jones, S., Clarke, C.R., Liu, S., Su, C., Zhang, H., Altman, N.S., Schuster, S.C., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Horizontal gene transfer is more frequent with increased heterotrophy and contributes to parasite adaptation. Proc. Natl. Acad. Sci. U.S.A. 113, E7010–E7019 (2016). 39. Kado, T., Innan, H. Horizontal gene transfer in five parasite plant species in Orobanchaceae. Genome Biol. Evol. 10, 3196–3210 (2018). 40. Searcy, D.G. Measurements by DNA hybridization in vitro of the genetic basis of parasitic reduction. Evolution. 24, 207–219 (1970). 41. Searcy, D.G., MacInnis, A.J. Measurements by DNA renaturation of the genetic basis of parasitic reduction. Evolution. 24, 796–806 (1970). 42. Těšitel, J. Functional biology of parasitic plants: a review. Plant Ecology and Evolution. 149, 5–20 (2016). 43. Těšitel, J., Plavcová, L., Cameron, D.D. Interactions between hemiparasitic plants and their hosts: the importance of organic carbon transfer. Plant Signal Behav. 5, 1072–1076 (2010). 44. Yoshida, S., Cui, S., Ichihashi, Y., Shirasu, K. The haustorium, a specialized invasive organ in parasitic plants. Annu. Rev. Plant Biol. 67, 643–667 (2016). 45. Westwood, J.H., Yoder, J.I., Timko, M.P., dePamphilis, C.W. The evolution of parasitism in plants. Trends Plant Sci. 15, 227–235 (2010). 46. Westwood, J.H. The physiology of the established parasite-host association. In:

122

Parasitic Orobanchaceae: Parasitic Mechanisms and Control Strategies. pp. 87– 114 (2013). 47. Krylov, D.M., Wolf, Y.I., Rogozin, I.B., Koonin, E.V. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Research. 13, 2229–2235 (2003). 48. Ames, R.M., Money, D., Lovell, S.C. Inferring gene family histories in yeast identifies lineage specific expansions. PLoS ONE. 9, e99480 (2014). 49. Deutekom, E.S., Vosseberg, J., van Dam, T.J.P., Snel, B. Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences. PLoS Comput Biol. 15, e1007301 (2019). 50. Song, H., Sun, J., Yang, G. Old and young duplicate genes reveal different responses to environmental changes in Arachis duranensis. Mol. Genet. Genomics. 25, 3389–11 (2019). 51. Cui, X., Lv, Y., Chen, M., Nikoloski, Z., Twell, D., Zhang, D. Young genes out of the male: An insight from evolutionary age analysis of the pollen transcriptome. Mol Plant. 8, 935–945 (2015). 52. Wolf, Y.I., Novichkov, P.S., Karev, G.P., Koonin, E.V., Lipman, D.J. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc. Natl. Acad. Sci. U.S.A. 106, 7273–7280 (2009). 53. Vishnoi, A., Kryazhimskiy, S., Bazykin, G.A., Hannenhalli, S., Plotkin, J.B. Young proteins experience more variable selection pressures than old proteins. Genome Research. 20, 1574–1581 (2010). 54. Wang, J., Tao, F., Marowsky, N.C., Fan, C. Evolutionary fates and dynamic functionalization of young duplicate genes in Arabidopsis genomes. Plant Physiol. 172, 427–440 (2016). 55. Wang, Y., Wang, X., Paterson, A.H. Genome and gene duplications and gene expression divergence: a view from plants. Ann. N.Y. Acad. Sci. 1256, 1–14 (2012). 56. Hardigan, M.A., Crisovan, E., Hamilton, J.P., Kim, J., Laimbeer, P., Leisner, C.P., Manrique-Carpintero, N.C., Newton, L., Pham, G.M., Vaillancourt, B., Yang, X., Zeng, Z., Douches, D.S., Jiang, J., Veilleux, R.E., Buell, C.R. Genome reduction uncovers a large dispensable genome and adaptive role for copy number variation in asexually propagated Solanum tuberosum. The Plant Cell. 28, 388–405 (2016). 57. Press, M., Graves, J. Parasitic Plants. Springer. (1995). 58. Edger, P.P., Heidel-Fischer, H.M., Bekaert, M., Rota, J., Glöckner, G., Platts, A.E., Heckel, D.G., Der, J.P., Wafula, E.K., Tang, M., Hofberger, J.A., Smithson, A., Hall, J.C., Blanchette, M., Bureau, T.E., Wright, S.I., dePamphilis, C.W., Schranz, M.E., Barker, M.S., Conant, G.C., Wahlberg, N., Vogel, H., Pires, J.C., Wheat, C.W. The butterfly plant arms-race escalated by gene and genome duplications. Proc. Natl. Acad. Sci. U.S.A. 112, 8362–8366 (2015). 59. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., dePamphilis, C.W., Wall, P.K., Soltis, P.S. Polyploidy and angiosperm diversification. Am. J. Bot. 96, 336–348 (2009).

123

60. Schranz, M.E., Mohammadin, S., Edger, P.P. Ancient whole genome duplications, novelty and diversification: the WGD Radiation Lag-Time Model. Curr. Opin. Plant Biol. 15, 147–153 (2012). 61. Tank, D.C., Eastman, J.M., Pennell, M.W., Soltis, P.S., Soltis, D.E., Hinchliff, C.E., Brown, J.W., Sessa, E.B., Harmon, L.J. Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. New Phytol. 207, 454–467 (2015). 62. Macqueen, D.J., Johnston, I.A. A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification. Proc. Biol. Sci. 281, 20132881 (2014). 63. Ravin, N.V., Gruzdev, E.V., Beletsky, A.V., Mazur, A.M., Prokhortchouk, E.B., Filyushin, M.A., Kochieva, E.Z., Kadnikov, V.V., Mardanov, A.V., Skryabin, K.G. The loss of photosynthetic pathways in the plastid and nuclear genomes of the non-photosynthetic mycoheterotrophic eudicot Monotropa hypopitys. BMC Plant Biol. 16, 153–161 (2016). 64. Rogers, W.E., Nelson, R.R. Penetration and nutrition of Striga asiatica. Phytopathology. 52, 1064-1070 (1962). 65. Wickett, N.J., Honaas, L.A., Wafula, E.K., Das, M., Huang, K., Wu, B., Landherr, L., Timko, M.P., Yoder, J., Westwood, J.H., dePamphilis, C.W. Transcriptomes of the parasitic plant family Orobanchaceae reveal surprising conservation of chlorophyll synthesis. Curr. Biol. 21, 2098–2104 (2011). 66. Press, M.C., Smith, S., Stewart, G.R. Carbon acquisition and assimilation in parasitic plants. Functional Ecology. 5, 278 (1991). 67. Tuohy, J., Smith, E.A., Stewart, G.R. The parasitic habit: trends in morphological and ultrastructural reductionism. In: Biology and Control of Orovanche. Wageningen, The Netherlands. pp. 86-95 (1986). 68. Shah, N., Smirnoff, N., Stewart, G.R. Photosynthesis and stomatal characteristics of Striga hermonthica in relation to its parasitic habit. Physiologia Plantarum. 69, 699–703 (1987). 69. Press, M.C., Tuohy, J.M., Stewart, G.R. Gas exchange characteristics of the sorghum-striga host-parasite association. Plant Physiol. 84, 814–819 (1987). 70. Smith, S., Stewart, G.R. Effect of potassium levels on the stomatal behavior of the hemi-parasite Striga hermonthica. Plant Physiol. 94, 1472–1476 (1990). 71. Singh, R., Singh, S., Parihar, P., Singh, V.P., Prasad, S.M. Retrograde signaling between plastid and nucleus: A review. J. Plant Physiol. 181, 55–66 (2015). 72. Dorr, I. How Striga Parasitizes its Host: a TEM and SEM Study. Annals of Botany. 79, 463–472 (1997). 73. Heide-JØrgensen, H.S., Kuijt, J. Epidermal derivatives as xylem elements and transfer cells: a study of the host-parasite interface in two species of Triphysaria (Scrophulariaceae). Protoplasma 174, 173-183 (1993). 74. Heide-Jørgensen, H.S., Kuijt, J. The Haustorium of the Root Parasite Triphysaria (Scrophulariaceae), with Special Reference to Xylem Bridge Ultrastructure. Am. J. Bot. 82, 782–797 (1995). 75. Tomilov, A.A., Tomilova, N.B., Wroblewski, T., Michelmore, R., Yoder, J.I. Trans-specific gene silencing between host and parasitic plants. The Plant Journal. 56, 389–397 (2008).

124

76. Aly, R., Hamamouch, N., Abu-Nassar, J., Wolf, S., Joel, D.M., Eizenberg, H., Kaisler, E., Cramer, C., Gal-On, A., Westwood, J.H. Movement of protein and macromolecules between host plants and the parasitic weed Phelipanche aegyptiaca Pers. Plant Cell Rep. 30, 2233–2241 (2011). 77. Pallas, V., Gomez, G. Phloem RNA-binding proteins as potential components of the long-distance RNA transport system. Front Plant Sci. 4, 130 (2013). 78. Waterhouse, R.M., Seppey, M., Simão, F.A., Manni, M., Ioannidis, P., Klioutchnikov, G., Kriventseva, E.V., Zdobnov, E.M. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018). 79. Katoh, K., Standley, D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). 80. Capella-Gutiérrez, S., Silla-Martínez, J.M., Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25, 1972–1973 (2009). 81. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics. 30, 1312–1313 (2014). 82. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25, 25–29 (2000). 83. The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019). 84. Kanehisa, M., Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000). 85. Kanehisa, M., Sato, Y., Furumichi, M., Morishmima, K., Tanabe, M. New approach for understanding genome variations in KEGG. Nucleic Acids Res. 47, D590-D595 (2019). 86. Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947-1951 (2019). 87. Huang, D.W., Sherman, B.T., Tan, Q., Kir, J., Liu, D., Bryant, D., Guo, Y., Stephens, R., Baseler, M.W., Lane, H.C., Lempicki, R.A. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 35, W169– W175 (2007). 88. Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). 89. Zimmermann, P., Bleuler, S., Laule, O., Martin, F., Ivanov, N.V., Campanoni, P., Oishi, K., Lugon-Moulin, N., Wyss, M., Hruz, T., Gruissem, W. ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions. BioData Mining. 7, 18 (2014). 90. Goda, H., Sasaki, E., Akiyama, K., Maruyama-Nakashita, A., Nakabayashi, K.,

125

Li, W., Ogawa, M., Yamauchi, Y., Preston, J., Aoki, K., Kiba, T., Takatsuto, S., Fujioka, S., Asami, T., Nakano, T., Kato, H., Mizuno, T., Sakakibara, H., Yamaguchi, S., Nambara, E., Kamiya, Y., Takahashi, H., Hirai, M.Y., Sakurai, T., Shinozaki, K., Saito, K., Yoshida, S., Shimada, Y. The AtGenExpress hormone and chemical treatment data set: experimental design, data evaluation, model data analysis and data access. The Plant Journal. 55, 526–542 (2008). 91. Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Schölkopf, B., Weigel, D., Lohmann, J.U. A gene expression map of Arabidopsis thaliana development. Nat Genet. 37, 501–506 (2005). 92. Kilian, J., Whitehead, D., Horak, J., Wanke, D., Weinl, S., Batistic, O., D’Angelo, C., Bornberg-Bauer, E., Kudla, J., Harter, K. The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 50, 347–363 (2007). 93. Severin, A.J., Woody, J.L., Bolon, Y.-T., Joseph, B., Diers, B.W., Farmer, A.D., Muehlbauer, G.J., Nelson, R.T., Grant, D., Specht, J.E., Graham, M.A., Cannon, S.B., May, G.D., Vance, C.P., Shoemaker, R.C. RNA-Seq Atlas of Glycine max: a guide to the soybean transcriptome. BMC Plant Biol. 10, 160 (2010).

126

Chapter 5

Conclusions and future directions

This dissertation aimed to better understand the evolutionary origins of plant parasitism in Orobanchaceae, the largest and most diverse family of parasitic angiosperms. I employed comparative genomic approaches using the draft genome of

Striga asiatica, including transcriptomes of diverse species representing the major groups within Orobanchaceae and related sister lineages; this strategy allowed me to unravel gene and genome duplication events in the Orobanchaceae and tie the associated changes in gene family sizes and expressions shifts to genes important in parasitism. In order to exhaustively address my research objectives, I developed PlantTribes 2, a gene family analysis framework that utilizes objective classifications of complete protein sequences from genomes for comparative and evolutionary analyses of gene families and transcriptomes on a genome-scale. PlantTribes 2 post-processes de novo assembled transcripts, assigns post-processed transcripts into pre-computed orthologous gene family clusters based on the inferred proteomes of fully sequenced genomes, and estimates gene family multiple sequence alignments and their corresponding phylogenetic trees.

Additionally, substitution rates among orthologous and paralogous transcripts can be estimated to address evolutionary hypotheses, including inference of large-scale duplication events. The key novelty in PlantTribes 2 is the ease and flexibility that it can accommodate new data and integrate new analysis modules. To date, no global gene family comparative analysis tool provides the flexibility of integrating gene family circumscriptions from any source or organismal group, including expert gene families, to serve as a scaffold for readily classifying protein-coding regions from transcriptomes, or

127 new genomes to perform comparative gene analyses. The modular design of its analysis tools makes PlantTribes 2 a platform- and language-independent general framework that can be easily extended by integrating third-party tools with custom methods. Moreover, advanced users have the option to substitute external pipeline dependencies in existing features with tools of their liking. Data reproducibility is key to the scientific process that allows the recapitulation of published data analysis results. PlantTribes 2 enables fully reproducible analyses by providing fully automated analysis modules that can be executed independently or collectively, thus allowing reproducibility of complete analyses. Integration into the Galaxy framework enables scientists without strong computational backgrounds to access PlantTribes 2 tools using the graphical user interface, as well as allowing all scientists to share analysis histories, workflows, and visualizations.

With the aid of PlantTribes 2 utilities, I used an integrated phylogenetic, syntenic, and gene divergence approach to unravel ancient whole genome duplication (WGD) events in the Orobanchaceae (order Lamiales) and closely related sister lineages. Even though WGD is prevalent in angiosperms and considered to be a key driving force in the rapid radiation and diversification of angiosperms [1-3], evidence has been unclear as to whether WGD has played a significant role in the evolution of parasitism, including in the Orobanchaceae. This dissertation identified an ancient polyploidy event in a common ancestor of Orobanchaceae and closely related non-parasitic sister lineages that occurred following their divergence from all other core Lamiales lineages, as well as confirmed the existence of a previously reported polyploidy event restricted to all core Lamiales [4-8].

Surprisingly, no evidence supports additional WGD events within Orobanchaceae, which

128 suggests that the large diversity within Orobanchaceae could have come from WGD- derived duplicated genes already present in a recent common ancestor of the

Orobanchaceae. Orobanchaceae species span the full trophic spectrum of parasitism, including a species-poor autotroph clade (~12 species) and a species-rich crown group consisting of semi- and full heterotrophs (~2000 species) [9-12]. This pattern is in agreement with recent studies assessing ancient polyploidies that have found significant evidence that major shifts in diversification do not always involve an entire species clade

[13-17]. The species divergence pattern observed in Orobanchaceae suggests that the evolution of novel traits following WGD does not immediately give rise to novel functions but requires a lag in time before a burst in diversification. Current research provides little insight into the polyploidy and evolutionary history of other angiosperm parasitic lineages, except for the morning glory family (Convolvulaceae) in which a whole genome triplication event predated the divergence of the parasitic lineage Cuscuta from its autotrophic sister lineage, Ipomea [18]. However, unlike Orobanchaceae there has been no investigation to date to learn whether this polyploidy event played a role in the creation of the specific genes deployed in parasitism in Cuscuta. Research presented in this dissertation is the first attempt to investigate how genome duplication is associated with the origins of parasitism in Orobanchaceae, and it provides a framework that can be utilized in studies of the other independent parasitic angiosperm lineages to determine whether WGD is a necessary precursor in the evolution plant parasitism. Future studies should extend these approaches to critically evaluate how many of the 12 independently evolved parasitic angiosperm clades may have similarly depended on genes from ancestral WGD events for the evolutionary origin of key parasitic functions.

129

Knowledge of lineage-specific changes in gene family sizes and their associated biological functions can provide valuable insights into mechanisms that underlie adaptive organismal differences among closely related species. I explored the role that WGD has played in the evolution of novel traits that led to the transition to parasitism in the

Orobanchaceae by examining gene family dynamics in the Striga genome in relation to its closest non-parasitic sequenced genome, Mimulus guttatus (Erythranthe guttata).

Striga is an obligate hemiparasite that is photosynthetically competent but is dependent on a host for carbon, mineral nutrients, and water. It, therefore, constitutes an ideal model to study the evolution of a parasite that evolved from a free-living. The observed gene family changes in Striga reveal an association between WGD and the evolutionary origins of parasitism in Orobanchaceae, and are consistent with a three phase model of parasite evolution that identifies gene gains and expression shifts associated with the origin of parasitism, losses of redundant gene functions, and gene gains associated with enhancing parasitic ability Gene losses, which include functions related to photosynthesis, environmental sensing, and leaf developmental processes, are overrepresented by older genes whose functions are complemented by the host. Some of these older genes, specifically those expressed in the novel parasitic organ, the haustorium, are duplicates from the WGD event in the common ancestor of Striga and

Mimulus. On the other hand, gene gains, which include functions related to transcription, endocytosis, and intracellular transport (functions associated with further adaptations to a parasitic lifestyle), often arose from the WGD event specific to Orobanchaceae, These findings suggest that the evolutionary transition from autotrophy to heterotrophy is associated with changes in gene functions that are also common to non-parasitic plants.

130

Future studies will also seek to test and expand our three-phase model of parasite genome evolution through examination of genomes and transcriptomes representing the rich diversity of parasitic dependencies in Orobanchaceae, as well as establish how repeatable these patterns are in independent parasitic lineages. The trophic diversity in

Orobanchaceae, including non-parasitic species as well as hemi- and holoparasites make it an ideal group of closely related species to investigate the evolution of plant parasitism.

A comparative cross-species expression study among Orobanchaceae species could help identify critical functional changes at every transition towards increased parasitic dependency, i.e., from free-living non-parasites to facultative hemiparasites, facultative hemiparasites to obligate hemiparasites, and obligate hemiparasites to holoparasites.

Parallels are starting to emerge from recent studies of Orobanchaceae parasites and

Cuscuta, where similarities in functional gene losses and gains (including gene gains by pervasive horizontal gene transfer) that are likely related to adaptation to parasitic lifestyle have been identified [18-22]. Therefore, leveraging the evolutionary framework presented by Orobanchaceae could provide critical evidence for understanding the progression of parasite evolution from free-living plants, and also identify potential functional targets for parasite control.

The evolutionary transition from autotrophism to heterotrophism by parasitic plants is an extraordinary example of adaptive plant biology. It is associated with striking morphological adaptations, such as the development of a haustorium, physiological changes that enable the parasite to overcome host defenses and redirect host nutrients to the parasite, and reduced leaf and root architecture in holoparasites. Sequencing of diverse lineages of parasitic plant transcriptomes and genomes will lead to better

131 characterization of the evolution and mechanisms of parasitism. However, the overarching goal of the parasitic plant research community is to reduce the negative impact of weedy parasitic species in the developing world where crop yields are adversely affected and a threat to world food security. Our studies have identified specific genes with expression patterns and predicted functions that are likely to be crucial to parasitic processes. In doing so, we have identified gene targets for functional analysis in the parasites as well as for targeted interference through enhancement of host defenses or deployment of host induced gene silencing [9, 10, 23-26]. As more omic data sets for diverse parasitic lineages with different aspects of host relationship become available, research progress will accelerate, including the kind of comparative evolutionary analyses described in this dissertation, which will contribute towards understanding the essential aspects of parasitism that can be targeted by new control strategies.

References

1. Panchy, N., Lehti-Shiu, M.D., Shiu, S.-H. Evolution of gene duplication in plants. Plant Physiol. 171, 2294-2316 (2016). 2. Van de Peer, Y., Mizrachi, E., Marchal, K. The evolutionary significance of polyploidy. Nat. Rev. Genet. 18, 411–424 (2017). 3. Clark, J.W., Donoghue, P.C.J. Constraining the timing of whole genome duplication in plant evolutionary history. Proc. Biol. Sci. 284 pii: 20170912 (2017). 4. Ibarra-Laclette, E., Lyons, E., Hernández-Guzmán, G., Pérez-Torres, C.A., Carretero-Paulet, L., Chang, T.-H., Lan, T., Welch, A.J., Juárez, M.J.A., Simpson, J., Fernández-Cortés, A., Arteaga-Vázquez, M., Góngora-Castillo, E., Acevedo- Hernández, G., Schuster, S.C., Himmelbauer, H., Minoche, A.E., Xu, S., Lynch, M., Oropeza-Aburto, A., Cervantes-Pérez, S.A., de Jesús Ortega-Estrada, M., Cervantes-Luevano, J.I., Michael, T.P., Mockler, T., Bryant, D., Herrera-Estrella, A., Albert, V.A., Herrera-Estrella, L. Architecture and evolution of a minute plant genome. Nature. 498, 94–98 (2013). 5. Wang, L., Yu, S., Tong, C., Zhao, Y., Liu, Y., Song, C., Zhang, Y., Zhang, X., Wang, Y., Hua, W., Li, D., Li, D., Li, F., Yu, J., Xu, C., Han, X., Huang, S., Tai, S., Wang, J., Xu, X., Li, Y., Liu, S., Varshney, R.K., Wang, J., Zhang, X. Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis. Genome Biol. 15, R39 (2014).

132

6. Ren, R., Wang, H., Guo, C., Zhang, N., Zeng, L., Chen, Y., Ma, H., Qi, J. Widespread whole genome duplications contribute to genome complexity and species diversity in angiosperms. Mol. Plant. 11, 414–428 (2018). 7. Julca, I., Marcet-Houben, M., Vargas, P., Gabaldón, T. Phylogenomics of the olive tree (Olea europaea) reveals the relative contribution of ancient allo- and autopolyploidization events. BMC Biol. 16, 15 (2018). 8. Xu, H., Song, J., Luo, H., Zhang, Y., Li, Q., Zhu, Y., Xu, J., Li, Y., Song, C., Wang, B., Sun, W., Shen, G., Zhang, X., Qian, J., Ji, A., Xu, Z., Luo, X., He, L., Li, C., Sun, C., Yan, H., Cui, G., Li, X., Li, X., Wei, J., Liu, J., Wang, Y., Hayward, A., Nelson, D., Ning, Z., Peters, R.J., Qi, X., Chen, S. Analysis of the genome sequence of the medicinal plant Salvia miltiorrhiza. Mol Plant. 9, 949–952 (2016). 9. Westwood, J.H., Yoder, J.I., Timko, M.P., dePamphilis, C.W. The evolution of parasitism in plants. Trends Plant Sci. 15, 227–235 (2010). 10. Westwood, J.H., dePamphilis, C.W., Das, M., Fernandez-Aparicio, M., Honaas, L.A., Timko, M.P., Wafula, E.K., Wickett, N.J., Yoder, J.I. The Parasitic Plant Genome Project: New tools for understanding the biology of Orobanche and Striga. Weed Science. 60, 295–306 (2012). 11. Naumann, J., Salomo, K., Der, J.P., Wafula, E.K., Bolin, J.F., Maass, E., Frenzke, L., Samain, M.-S., Neinhuis, C., dePamphilis, C.W., Wanke, S. Single-copy nuclear genes place haustorial Hydnoraceae within piperales and reveal a cretaceous origin of multiple parasitic angiosperm lineages. PLoS ONE. 8, e79204 (2013). 12. Wicke, S. Genomic evolution in Orobanchaceae. In: Parasitic Orobanchaceae. pp. 267–286. Springer, Berlin, Heidelberg (2013). 13. Edger, P.P., Heidel-Fischer, H.M., Bekaert, M., Rota, J., Glöckner, G., Platts, A.E., Heckel, D.G., Der, J.P., Wafula, E.K., Tang, M., Hofberger, J.A., Smithson, A., Hall, J.C., Blanchette, M., Bureau, T.E., Wright, S.I., dePamphilis, C.W., Schranz, M.E., Barker, M.S., Conant, G.C., Wahlberg, N., Vogel, H., Pires, J.C., Wheat, C.W. The butterfly plant arms-race escalated by gene and genome duplications. Proc. Natl. Acad. Sci. U.S.A. 112, 8362–8366 (2015). 14. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., dePamphilis, C.W., Wall, P.K., Soltis, P.S. Polyploidy and angiosperm diversification. Am. J. Bot. 96, 336–348 (2009). 15. Schranz, M.E., Mohammadin, S., Edger, P.P. Ancient whole genome duplications, novelty and diversification: the WGD Radiation Lag-Time Model. Curr. Opin. Plant Biol. 15, 147–153 (2012). 16. Tank, D.C., Eastman, J.M., Pennell, M.W., Soltis, P.S., Soltis, D.E., Hinchliff, C.E., Brown, J.W., Sessa, E.B., Harmon, L.J. Nested radiations and the pulse of angiosperm diversification: increased diversification rates often follow whole genome duplications. New Phytol. 207, 454–467 (2015). 17. Macqueen, D.J., Johnston, I.A. A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification. Proc. Biol. Sci. 281, 20132881 (2014). 18. Sun, G., Xu, Y., Liu, H., Sun, T., Zhang, J., Hettenhausen, C., Shen, G., Qi, J., Qin, Y., Li, J., Wang, L., Chang, W., Guo, Z., Baldwin, I.T., Wu, J. Large-scale

133

gene losses underlie the genome evolution of parasitic plant Cuscuta australis. Nat. Commun. 9, 2683 (2018). 19. Yoshida, S., Kim, S., Wafula, E.K., Tanskanen, J., Kim, Y.-M., Honaas, L., Yang, Z., Spallek, T., Conn, C.E., Ichihashi, Y., Cheong, K., Cui, S., Der, J.P., Gundlach, H., Jiao, Y., Hori, C., Ishida, J.K., Kasahara, H., Kiba, T., Kim, M.-S., Koo, N., Laohavisit, A., Lee, Y.-H., Lumba, S., McCourt, P., Mortimer, J.C., Mutuku, J.M., Nomura, T., Sasaki-Sekimoto, Y., Seto, Y., Wang, Y., Wakatake, T., Sakakibara, H., Demura, T., Yamaguchi, S., Yoneyama, K., Manabe, R.-I., Nelson, D.C., Schulman, A.H., Timko, M.P., dePamphilis, C.W., Choi, D., Shirasu, K. Genome sequence of Striga asiatica provides insight into the evolution of plant parasitism. Curr. Biol. 29, 3041-3052 (2019). 20. Vogel, A., Schwacke, R., Denton, A.K., Usadel, B., Hollmann, J., Fischer, K., Bolger, A., Schmidt, M.H.W., Bolger, M.E., Gundlach, H., Mayer, K.F.X., Weiss- Schneeweiss, H., Temsch, E.M., Krause, K. Footprints of parasitism in the genome of the parasitic flowering plant Cuscuta campestris. Nat. Commun. 9, 2515 (2018). 21. Yang, Z., Zhang, Y., Wafula, E.K., Honaas, L.A., Ralph, P.E., Jones, S., Clarke, C.R., Liu, S., Su, C., Zhang, H., Altman, N.S., Schuster, S.C., Timko, M.P., Yoder, J.I., Westwood, J.H., dePamphilis, C.W. Horizontal gene transfer is more frequent with increased heterotrophy and contributes to parasite adaptation. Proc. Natl. Acad. Sci. U.S.A. 113, E7010–E7019 (2016). 22. Yang, Z., Wafula, E.K., Kim, G., Shahid, S., McNeal, J.R., Ralph, P.E., Timilsena, P.R., Yu, W.-B., Kelly, E.A., Zhang, H., Person, T.N., Altman, N.S., Axtell, M.J., Westwood, J.H., dePamphilis, C.W. Convergent horizontal gene transfer and cross-talk of mobile nucleic acids in parasitic plants. Nature Plants. 5, 991-1001 (2019). 23. Alakonya, A., Kumar, R., Koenig, D., Kimura, S., Townsley, B., Runo, S., Garces, H.M., Kang, J., Yanez, A., David-Schwartz, R., Machuka, J., Sinha, N. Interspecific RNA interference of SHOOT MERISTEMLESS-like disrupts Cuscuta pentagona plant parasitism. Plant Cell. 24, 3153–3166 (2012). 24. Yoder, J.I., Gunathilake, P., Wu, B., Tomilova, N., Tomilov, A.A. Engineering host resistance against parasitic weeds with RNA interference. Pest. Manag. Sci. 65, 460–466 (2009). 25. Tomilov, A.A., Tomilova, N.B., Wroblewski, T., Michelmore, R., Yoder, J.I. Trans-specific gene silencing between host and parasitic plants. The Plant Journal. 56, 389–397 (2008). 26. Aly, R., Cholakh, H., Joel, D.M., Leibman, D., Steinitz, B., Zelcer, A., Naglis, A., Yarden, O., Gal-On, A. Gene silencing of mannose 6-phosphate reductase in the parasitic weed Orobanche aegyptiaca through the production of homologous dsRNA sequences in the host plant. Plant Biotechnology Journal. 7, 487–498 (2009).

134

Appendix

Additional material for the Striga asiatica genome

(Published in Current Biology, Yoshida et al. 2019)

A. Main genome paper: https://ars.els-cdn.com/content/image/1-s2.0-

S0960982219310103-mmc5.pdf

B. Main supplementary document: https://ars.els-cdn.com/content/image/1-s2.0-

S0960982219310103-mmc3.pdf

C. Supplementary excel tables, including whole genome duplication (WGD) and

ancestral gene family reconstruction: https://ars.els-cdn.com/content/image/1-

s2.0-S0960982219310103-mmc2.xlsx

D. Additional supplementary figures: https://ars.els-cdn.com/content/image/1-s2.0-

S0960982219310103-mmc1.pdf

VITA

Eric Kenneth Wafula

Professional Experience

Bioinformatics Programmer Analysts June 2009 – Present The Pennsylvania State University, University Park, PA • Develop comparative genomics computational analysis pipelines, tools, and methods to gain novel insight into the functional and evolutionary history of plant genomes, gene families, and the tree of life • Apply existing and/or new methods to large-scale next-generation sequencing datasets • Develop, administer, setup, use, maintain, modify and/or troubleshoot plant genome analysis pipelines and databases for sequence information • Participate in scientific presentations and manuscript/grant preparation • Provide computational support to project(s) collaborators • Perform bioinformatics training of REU undergraduates, graduate students, and research scholars

Education

Ph.D. in Biology - Molecular Evolutionary Biology option Dec 2019 The Pennsylvania State University, University Park, PA

M.S. in Bioinformatics May 2008 Johns Hopkins, University, Baltimore, MD

B.S. in Computer Science May 2005 University of the District of Columbia, Washington, DC

Selected Publications

* denotes co-authorship or significant contribution i.e. performed most of the bioinformatics analyses and/or wrote significant sections of the paper contribution (additional publications available on Google Scholar).

Yoshida, Satoko, Seungill Kim, Eric K*. Wafula, Jaakko Tanskanen, Yong-Min Kim, Loren Honaas, Zhenzhen Yang et al. "Genome sequence of Striga asiatica provides insight into the evolution of plant parasitism." Current Biology (2019).

Yang, Zhenzhen, Eric K*. Wafula, Gunjune Kim, Saima Shahid, Joel R. McNeal, Paula E. Ralph, Prakash R. Timilsena et al. "Convergent horizontal gene transfer and cross-talk of mobile nucleic acids in parasitic plants." Nature plants (2019).

Honaas, Loren A., Eric K. Wafula*, Norman J. Wickett, Joshua P. Der, Yeting Zhang, Patrick P. Edger, Naomi S. Altman, J. Chris Pires, and James H. Leebens-Mack. "Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome." PLoS One (2016).