ALTERNATIVE SPLICING IN BASAL ANGIOSPERMS AND ()

By

XIAOXIAN LIU

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

© 2018 Xiaoxian Liu

To my Husband and Parents

ACKNOWLEDGMENTS

I would like to express my most sincere gratitude to my advisors, Drs. Doug

Soltis and Brad Barbazuk. Their patience, encouragement, and generosity have all been indispensable for my research. Dr. Pam Soltis has also been a true mentor who offered tremendous help in all stages of my dissertation research. I want to thank my other committee member, Dr. Mark Settles for his consistent support and guidance during my study.

I am extremely grateful to my fellow graduate students and postdocs for helping me with my research, inspiring me to be a better scholar, and providing a lovely environment in which I learn and grow. Members of the Soltis and Barbazuk labs offered valuable suggestions on experimental design, data interpretation, and verbal and oral presentation. I want to give special thanks to Shengchen Shan, Qinyin Ling,

Nathan Catlin and Drs. Andre Chanderbali, Mi-jeong Yoo, Miao Sun, Blaine Marchant,

Srikar Chamala, Lucas Boatwright, Wenbin Mei.

My work would not be possible without the hard work from several laboratory managers and assistants, Dr. Matt Gitzendanner, Dr. Evgeny Mavrodiev and Ruth

Davenport. I also owe thanks to Mallory St Clair for her help in laboratory work.

I appreciate Dr. Patrick Schnable from Iowa State University with his help on 10X

Genomics approach. I appreciate Dr. Jim Leebens-Mark from University of Georgia for sharing data.

I am extremely grateful for the financial support from National Science

Foundation, including a Dissertation Improvement Grant, and research awards from the

Botanical Society of America and The American Society of Taxonomists. I am also

4

grateful for NSF and the Botanical Society of America for their Travel Award to attend

International Botanical Congress.

And last, but certainly not least, I wish to thank my husband and my family for their unconditional love and support. To them I dedicate this dissertation.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 10

LIST OF FIGURES ...... 12

ABSTRACT ...... 14

CHAPTER

1 INTRODUCTION ...... 16

Whole-Genome Duplication and Plant Evolution ...... 16 Gene Expression Changes after ...... 17 RNA Alternative Splicing ...... 18 Alternative Splicing After Whole-Genome Duplication ...... 19 Tragopogon as a Model System for Studying Polyploidy ...... 20

2 DETECTION ALTERNATIVE SPLICING WITHOUT REFERENCE GENOME ...... 27

Introduction ...... 27 Materials and Methods ...... 30 Amborella Genome, Annotation, and RNA-Seq Data ...... 30 Plant Materials and RNA Sample Preparation ...... 31 cDNA Synthesis, Library Construction, and Sequencing ...... 31 Data Collection and Error Correction ...... 31 Evaluation of Read Quality ...... 32 Reference-Based Transcriptome Assembly and Alternative Splicing Detection ...... 33 Comparisons between Iso-SeqTM Assemblies and Amborella Gene Models ... 33 De novo Detection and Validation of Alternative Splicing Events ...... 33 Optimization of de novo Pipeline ...... 35 Results ...... 35 General Properties of ROIs in Iso-SeqTM ...... 35 Error Correction by ICE-Quiver and Illumina Reads ...... 37 Mapping Full-Length ROIs and Error-Corrected Reads to the Amborella Genome ...... 38 Alternative Splicing Analysis Using Program to Assemble Spliced Alignments (PASA) ...... 39 Improvements of Current Gene Models by Isoform-Level Comparison ...... 40 De novo Detection of AS Events in Iso-SeqTM Data and Comparison with PASA Results ...... 41 AS Primer Design and RT-PCR Validation ...... 42 Optimization of the de novo Pipeline Based on Decision Tree Analysis ...... 43

6

Discussion ...... 43 Pretreatment Strategies for Iso-SeqTM Raw Data ...... 43 Advantages of Iso-SeqTM in Reference-Based Transcriptome Analysis ...... 46 Iso-SeqTM and AS de novo Detection Pipeline ...... 48

3 CONSERVED ALTERNATIVE SPLICING IN BASAL ANGIOSPERMS ...... 61

Introduction ...... 61 Materials and Methods ...... 64 Nymphaea Genome, Annotation, and Transcriptome Data ...... 64 Nymphaea Iso-SeqTM Data QC and Error Correction ...... 64 Nymphaea RNA-Seq Data Assembly ...... 64 Alternative Splicing Analysis in Nymphaea ...... 65 Reanalysis of Amborella Alternative Splicing Events ...... 66 Gene Orthology Analysis ...... 67 Conserved Alternative Splicing in Basal Angiosperms, , and Monocots ...... 67 Functional Annotation of Nymphaea Protein-Coding Genes ...... 68 GO Term Enrichment ...... 69 Principal Component Analysis (PCA) of Shared AS Events in Eight Angiosperms ...... 69 Phylogeny and Divergence Time Estimation ...... 69 Results ...... 69 Alternative Splicing in Nymphaea ...... 69 Alternative Splicing in Amborella ...... 70 Shared AS Events in Angiosperms ...... 71 Conserved AS between basal angiosperms: Nymphaea and Amborella ...... 72 Highly Conserved AS between Basal Angiosperms, Monocots and Eudicots .. 73 GO Annotation and Enrichment Analysis of the Conserved or Shared AS Events ...... 74 Discussion ...... 75 Iso-SeqTM or RNA-Seq for AS Analysis ...... 75 Different AS Frequencies in Two Basal Angiosperms ...... 76 Dynamic AS Changes during Angiosperm Evolution ...... 77 Conserved AS in Basal Angiosperms ...... 78 Shared AS Events in Aquatic Species (Habitat Effects) ...... 79

4 DE NOVO ALTERNATIVE SPLICING DETECTION IN TRAGOPOGON ...... 98

Introduction ...... 98 Methods and Materials ...... 100 RNA Sample Extraction and Iso-SeqTM Sequencing ...... 100 Tragopogon dubius Iso-SeqTM Data Processing ...... 101 Tragopogon dubius AS Candidates from de novo Detection ...... 101 Population-Level RT-PCR Validation ...... 101 Results ...... 102 Tragopogon dubius Iso-SeqTM Dataset ...... 102

7

Primer Design and RT-PCR Validation of AS Candidate Genes ...... 102 Population-Level RT-PCR Validation in T. dubius ...... 103 Discussion ...... 104

5 TRAGOPOGON DUBIUS DRAFT GENOME ASSEMBLY AND ANNOTATION AND A SURVEY OF ALTERNATIVE SPLICING ...... 116

Introduction ...... 116 Materials and Methods ...... 118 DNA Sample Collection and Sequencing Strategy ...... 118 Raw Data Trimming ...... 118 Removal of Organellar Contaminants from Genomic Data ...... 119 Estimation of Genome Size of T. dubius ...... 119 Genome Assembly and Evaluation ...... 119 RNA Isolation and Iso-SeqTM Sequencing for Genome Annotation ...... 120 Repetitive Element Annotation ...... 121 Genome Annotation ...... 121 Orthology Analysis ...... 123 Functional Annotation ...... 124 Alternative Splicing Analysis ...... 124 Results ...... 124 Tragopogon dubius Linked-read Sequencing and Removal of Organellar Contaminants ...... 124 Genome Size of T. dubius ...... 124 Tragopogon dubius Genome Assembly and Evaluation ...... 125 Repetitive Element Annotation ...... 125 Genome Annotation ...... 126 Orthology Analysis ...... 126 Gene Family Analysis ...... 126 Tragopogon dubius Alternative Splicing Behavior ...... 127 Discussion ...... 127 Interpreting the Completeness of the T. dubius Draft Genome Assembly and Annotation ...... 127 Implications of the Results of Repetitive Elements Components on Sequencing Platform ...... 129 Orthogroups as Resources for Future Studies ...... 130 Cycloidea/Teosinte Branched1 (CYC/TB1) Gene Family in T. dubius ...... 130 Latex/Rubber-Related Genes in T. dubius ...... 131 Tragopogon dubius Leaf Alternative Splicing ...... 132 Cost-Efficiency of de novo Genome Assembly ...... 133

6 CONCLUSION ...... 141

APPENDIX

A SUPPLEMENTAL MATERIALS FOR CHAPTER 2 ...... 146

8

Materials and Methods ...... 146 cDNA Synthesis, Library Construction, and Sequencing ...... 146 Data Collection and Error Correction Using SMRT Analysis Software ...... 146 AS de novo Detection Pipeline ...... 147

B SUPPLEMENTAL MATERIALS FOR CHAPTER 5 ...... 159

LIST OF REFERENCES ...... 165

BIOGRAPHICAL SKETCH ...... 180

9

LIST OF TABLES

Table page

2-1 General properties of zero-full-passes ROIs and ICE consensus isoforms using Iso-SeqTM...... 51

2-2 High-confidence mappings by GMAP with different identity/length filters...... 52

2-3 AS events detected by the hybrid dataset...... 53

2-4 Gene model corrected by Iso-SeqTM approach...... 54

2-5 Summary of validation of de novo inferred AS events...... 55

3-1 Amborella RNA-Seq data...... 81

3-2 Nine species for orthology analysis...... 82

3-3 Eight species for conserved AS analysis...... 83

3-4 AS events in Nymphaea...... 84

3-5 The performance of Iso-SeqTM and RNA-Seq in AS detection in Nymphaea. .... 85

3-6 AS events in Amborella...... 86

3-7 The performance of Iso-SeqTM and RNA-Seq in AS detection in Amborella...... 87

3-8 Shared AS between eight tested angiosperms...... 88

3-9 Conserved AS between A. trichopoda and N. coerulea...... 89

3-10 Genes with AS events that are highly conserved in eight angiosperms...... 90

4-1 Ten T. dubius populations used in this study...... 107

4-2 Five confirmed T. dubius AS events...... 108

5-1 Statistics of the T. dubius genome assembly...... 134

5-2 Classification of the predicted repetitive elements in the T. dubius genome. ... 135

5-3 Comparison between the predicted gene sets of T. dubius and L. sativa...... 136

5-4 Alternative splicing in T. dubius leaf tissue...... 137

A-1 Number of reads collected using different minimum number of full passes...... 148

10

A-2 Number of reads generated from ICE and LSC after error correction...... 149

A-3 Summary of the number of AS events detected by three datasets...... 150

A-4 Comparison of detected AS events between zero-full-passes ROIs and LSC- corrected zero-full-passes ROIs...... 151

A-5 Genes using RT-PCR validation...... 152

B-1 The TCP protein family in Tragopogon...... 159

B-2 CYC/TB1 gene family...... 161

B-3 The REF protein family in Tragopogon...... 162

B-4 The Bet_v_1 protein family in Tragopogon...... 163

11

LIST OF FIGURES

Figure page

1-1 The “wondrous cycles” of polyploidy in ...... 24

1-2 Diagram of AS types and resulting proteomic diversity...... 25

1-3 The Tragopogon triangle...... 26

2-1 De novo detection of alternative splicing...... 56

2-2 Length distribution between zero-full-pass flncROIs and Amborella gene models...... 57

2-3 PacBio Iso-SeqTM “well-mapped reads”...... 58

2-4 Example of revised gene models after inclusion of Iso-SeqTM data...... 59

2-5 CHAID decision tree analysis...... 60

3-1 The taxonomic distribution of the 22,764 shared AS events...... 92

3-2 Top 30 over-represented GO - Biological Process...... 93

3-3 Top 29 over-represented GO - Cellular Component...... 94

3-4 Top 30 over-represented GO - Molecular Function...... 95

3-5 Phylogenetic distribution of conserved AS...... 96

3-6 Over-represented GO in N. caerulea genes that have aquatic-shared AS events - Biological Process...... 97

4-1 The modified AS de novo detection pipeline...... 109

4-2 Size distribution of T. dubius leaf Iso-SeqTM ROIs...... 110

4-3 RT-PCR results of Td9...... 111

4-4 Manual checking and primer design, RT-PCR validation and ORF prediction of Td27...... 112

4-5 Manual checking and primer design, RT-PCR validation and ORF prediction of Td30...... 113

4-6 Manual checking and primer design, RT-PCR validation and ORF prediction of Td31...... 114

12

4-7 Manual checking and primer design, RT-PCR validation and ORF prediction of Td32...... 115

5-1 K-mer frequency analysis of the T. dubius genome...... 138

5-2 Genome components of T. dubius...... 139

5-3 BUSCO results of the T. dubius predicted gene set...... 140

A-1 AS de novo detection pipeline flowchart...... 155

A-2 Length distribution of all zero-full-passes ROIs and full-length ROIs...... 156

A-3 Percentage of full-length ROIs out of all zero-full-passes ROIs in all size ranges and tissues...... 157

A-4 Pie chart of zero-full-passes ROIs of three categories...... 158

13

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

ALTERNATIVE SPLICING IN BASAL ANGIOSPERMS AND TRAGOPOGON (ASTERACEAE)

By

Xiaoxian Liu

December 2018

Chair: Douglas Edward Soltis Cochair: William Bradley Barbazuk Major: Botany

Polyploidy, or whole-genome duplication (WGD), is a widespread phenomenon throughout eukaryotes, and has long been considered as an important speciation mechanism in plants. Alternative splicing (AS) is a major source of transcript and proteome diversity. My dissertation focuses on interpreting the evolutionary impacts of

WGD on AS. My dissertation first developed a system for detecting AS events without a reference genome. I developed a de novo pipeline to capture potential AS events from Amborella trichopoda full-length transcriptome data (PacBio Iso-SeqTM) and compared the results with reference-based AS detection using the Amborella genome to validate the de novo pipeline. Results using this pipeline showed a 63.5% overall success rate in identifying AS events.

Secondly, I investigated and compared the genome-scale conserved AS events between the basal angiosperms A. trichopoda and Nymphaea caerulea, as well as four eudicots and two monocots, in order to explore the evolutionary implications of these conserved AS events. The results suggest that genes with conserved AS are concentrated in some specific functions, and the gain/loss of the AS events is likely

14

related to their evolutionary lineages. Besides the evolutionary implications, environmental factors, such as an aquatic habitat, may also affect the AS pattern.

Finally, I applied the de novo detection pipeline to Tragopogon dubius

(Asteraceae). According to RT-PCR validation, the success rate was low (approximately

33%), suggesting that genome-wide AS analysis in plant species with complex genomes requires a reference genome. Thus, I developed a draft genome assembly and annotation for T. dubius, the shared diploid parent of two allotetraploid species.

Using linked-read sequencing (10X Chromium), I performed genome assembly and annotation, functional annotation, AS analysis in leaf tissue, and analysis of gene families that may be of interest in Tragopogon and Asteraceae. This draft genome was

808.32 Mb, covering approximately 40% of the genome, and had an N50 scaffold size of 0.11 Mb and N50 contig size of 16.60 kb. Approximately 80% of gene space was captured in this draft genome assembly, and 30,325 protein-coding genes were annotated. Tragopogon dubius genome forms the foundation for analyses of AS in polyploids relative to their diploid parents.

15

CHAPTER 1 INTRODUCTION

Whole-Genome Duplication and Plant Evolution

Polyploidy, or whole-genome duplication (WGD), is a widespread phenomenon throughout eukaryotes, including fungi, animals, and plants (e.g. Kellis et al., 2004;

Gordon et al., 2009; Moghadam et al., 2009; Jose and Dufresne, 2010; Adolfsson et al.,

2010; Soltis and Soltis, 2012; Husband et al., 2013). Although long considered an important speciation mechanism in plants, especially angiosperms, polyploidy is now recognized as being even more common than previously thought (Soltis et al., 2009a;

Van de Peer, 2009). Recent genomic studies indicate that all angiosperms have experienced at least one round of WGD (Jiao et al., 2011; Amborella Genome Project,

2013).

The evolutionary impacts of WGD are vast and include gene loss, chromosome rearrangement, differential gene expression, gene functional divergence, and phenotypic diversity (Chen, 2007; Otto, 2007; Gaeta et al., 2007, 2009; Doyle et al.,

2008; Flagel et al., 2008, 2009; Gaeta and Pires, 2010; Xiong and Pires, 2011; Xiong et al., 2011; Grover et al., 2012; Madlung and Wendel, 2013; Yoo et al., 2013; Conant et al., 2014; Soltis et al., 2014a, 2015; Hovav et al., 2015; Hu et al., 2015; Renny-Byfield et al., 2015; Wendel, 2015; Wendel and Grover, 2015; Wendel et al., 2016; Gallagher et al., 2016; Soltis and Soltis, 2016). Interestingly, all of the above noted genomic and genetic consequences of WGD are also primary drivers of plant evolution and speciation. For instance, the rise and eventual dominance of seed plants and angiosperms, respectively, have been attributed to innovations contributed by the

16

ancestral whole-genome duplications that occurred in the ancestor of each clade (Jiao et al., 2011).

Most lineages show repeated patterns of WGD in a cycle of WGD, diploidization,

WGD, etc. (e.g. Wendel, 2015; Soltis et al., 2016). Wendel (2015) proposed a model called the “wondrous cycle” to explain the processes and patterns of polyploidy in plants

(Figure 1-1 modified from Wendel, 2015, Figure 1). After the WGD, the polyploid species will contain two homoeologous genomes inherited from their diploid parents. On the short‐term, responses to polyploidy could be gene loss, gene silencing, homoeologous recombination, biased homoeologous gene expression, subfunctionalization, and neofunctionalization. Subsequently, diploidization occurs through genome fractionation, genome downsizing, and chromosome reduction, and another WGD may occur. These processes typically are cyclical, occurring repeatedly on timescales of thousands to millions of years.

Gene Expression Changes after Polyploidy

There are several types of polyploidy. During polyploidization, duplicated chromosome sets could come from closely related individuals of the same species

(autopolyploidy) or from hybridization involving two species of the same genus

(allopolyploidy) (Stebbins, 1947; Grant, 1975). Allopolyploids, which result from interspecific hybridization and genome doubling (Stebbins, 1950), are expected to contain, at the time of formation, duplicated gene copies from both parents, termed homoeologous. However, allotetraploids rarely exhibit simple additivity of their parental genomes or phenotypes and typically express genetic, morphological, and phenotypic novelty (e.g. Soltis et al., 2014b; Soltis and Soltis, 2016). Extensive changes in gene expression have been observed in many polyploid species relative to their diploid

17

progenitors (Lynch and Conery, 2000; Flagel and Wendel, 2010; Wendel et al., 2012;

Roulin et al., 2013; Yoo et al., 2014; Wendel, 2015; Wendel et al., 2016).

Although we now recognize that polyploid genomes are dynamic, with shifts in both content and regulation relative to their diploid progenitors, it is unclear whether or not such modifications are sufficient to generate the phenotypic novelty that characterizes most polyploids and has allowed many polyploids to occupy new habitats

(e.g. Ehrendorfer, 1980; Brochmann et al., 2004; Soltis et al., 2014a; Soltis and Soltis,

2016; Marchant et al., 2016). Given that most allopolyploids are genetically depauperate for segregating variation, their persistence and ‘success’ are often attributed to genetic variation contributed by divergent parental genomes.

RNA Alternative Splicing

Following transcription of DNA sequences, introns are spliced, and exons connected prior to translation into proteins. A frequent splicing behavior in both plants and animals is RNA alternative splicing (AS). Through AS, multiple forms (isoforms) of a transcript can be generated from a single gene by the regulated selection of splice sites

(Figure 1-2; Barbazuk et al., 2008; Syed et al., 2012; Reddy et al., 2013). AS is therefore a post-transcriptional means of modulating protein levels and can promote functional innovation and proteomic diversity, even when genetic variation is limited. AS can influence gene expression and ultimately proteomic novelty on two levels: (i) AS creates multiple forms of mRNA from a single gene, leading to multiple protein isoforms; and (ii) AS can modulate mRNA stability and translation through nonsense-mediated decay (NMD) and miRNA regulation (Syed et al., 2012; Reddy et al., 2013).

In plants, the number of isoforms generated via AS is high, based on the study of a few model systems. Genome-wide AS frequency was estimated as 61% of all multi-

18

exonic loci in the well-studied model Arabidopsis thaliana (Marquez et al., 2012).

Studies of AS behavior in other plant species (crops and model systems), including

Oryza sativa (Zhang et al., 2010), Zea mays (Li et al., 2010), Solanum tuberosum

(Lozano et al., 2012), Physcomitrella patens (Wu et al., 2014), Brachypodium distachyon (Walters et al., 2013), Gossypium raimondii (Li et al., 2014), Glycine max

(Sagasti et al., 2014), and Vigna radiata (Satyawan et al., 2017), show similarly frequent

AS.

Alternative Splicing After Whole-Genome Duplication

Despite the importance of both AS and WGD in eukaryotic evolution, surprisingly little is known about the relationship between WGD and AS. Some clues come from limited studies of gene duplication and AS (Su et al., 2006; Roux and Robinson-

Rechavi, 2011). Early studies suggested a negative correlation between AS and gene duplication (Su et al., 2006; Talavera et al., 2007; Irimia et al., 2007) -- that is, duplicate genes undergo less AS than single-copy genes. This hypothesis is consistent with the

"function-sharing” model of functional divergence of duplicated genes (Hughes, 1994), i.e., subfunctionalization (Lynch and Conery, 2000), which suggests that given an ancestral gene with two functions through AS (e.g. one gene, two isoforms, each with a different function), one gene copy will keep one function after duplication, and the other copy the other function. In contrast, however, later investigations suggested a positive correlation between gene duplication and AS. Higher AS frequency was observed in small gene families compared to singletons, suggesting that gene duplication allows for more AS events to occur (Jin et al., 2008; Chen et al., 2011; Roux and Robinson-

Rechavi, 2011). Roux and Robinson-Rechavi (2011) therefore suggested that genes

19

progressively gain new splice variants with duplication and time, i.e. the “age-dependent gain of AS” hypothesis.

These results for individual genes and gene families indicate that gene duplication has an important impact on AS, but how AS patterns change after genome- wide duplication is unclear. Very few studies have examined the impact of WGD on AS, and those conducted have yielded inconsistent results (Terashima and Takumi, 2009;

Zhou et al., 2011; Saminathan et al., 2014). Terashima and Takumi (2009) suggest that

WGD may inhibit the efficiency of AS of the WDREB2 locus in hexaploid wheat. In contrast, Saminathan et al. (2014) confirmed that 22 genes experienced AS events across tissues in tetraploid watermelon (Citrullus lanatus var. lanatus). In the crop allotetraploid Brassica napus (which does not occur in nature), 16 of 82 AS events showed altered AS patterns relative to its diploid parents. Of these 16 events, 15 were losses of AS in one homeolog, and one was a gain of a new AS isoform in both homoeologous. This indicates that AS patterns can change rapidly after polyploid formation (Zhou et al., 2011).

Tragopogon as a Model System for Studying Polyploidy

Tragopogon (~150 species) includes two recently and repeatedly formed naturally occurring allotetraploids, T. mirus and T. miscellus, whose parents are

T. dubius and T. porrifolius and T. dubius and T. pratensis, respectively (Figure 1-3).

The polyploids each formed multiple times -- with at least 10 separate formations of each species in different small towns in eastern Washington and adjacent Idaho where the diploid parents came into contact during the past 90 years (45 generations in these biennials) after introduction from Europe (Ownbey, 1950; Symonds et al., 2010; Soltis et al., 2012). These multiple origins represent independent evolutionary lineages and are

20

an important source of genetic, genomic, chromosomal, and morphological variation in these species. Some of these morphological differences reflect underlying chromosomal and genetic differences between populations of the parental species (legacy effects), and others are novel changes.

Since the early studies by Roose and Gottlieb (1976, 1980) and the Soltis lab

(Soltis and Soltis, 1989, 1991; Soltis et al., 1995; Cook et al., 1998) on Tragopogon, more recent investigations of the polyploids and their parents over the past 15 years have addressed genomic, transcriptomic, and chromosomal diversity (reviewed in Soltis et al., 2012). In both T. miscellus and T. mirus, both homeolog loss (i.e., loss of one parental gene copy) and extensive changes to patterns of homoeologous gene expression (“transcriptomic shock”) occurred soon after polyploidization (Tate et al.,

2006, 2009a, 2009b; Buggs et al., 2009, 2010a, 2010b, 2011, 2012a, 2012b; Koh et al.,

2010; reviewed by Yoo et al., 2014). For instance, homoeolog-specific patterns were examined using cDNA cleaved amplified polymorphic sequence (CAPS) and Sequenom analysis (Tate et al., 2006, 2009a; Buggs et al., 2010a, 2011). In T. miscellus, the homoeolog expression of one parent dominated, and some changes were shown to occur soon after polyploidization (Buggs et al., 2014). Patterns of gene expression as well as homoeolog loss are repeated across natural populations and synthetic lines.

Similar results have now been observed in T. mirus (Chester et al., 2012; Koh et al.,

2010; review by Yoo et al., 2014). All evidence suggests that WGD significantly influences gene expression patterns in polyploids in general and Tragopogon in particular.

21

An evolutionary model system such as Tragopogon is ideal for studying the impact of WGD on AS, both immediately after polyploidization and in established polyploid species. However, the genome resources for the Tragopogon polyploidy system are very limited. A reference genome for Tragopogon was not previously available; the lack of a reference genome made it extremely challenging to study the effects of WGD on AS.

The second chapter of my dissertation introduces a system for detecting AS events without a reference genome. I used the basal angiosperm Amborella trichopoda as the study system. In Chapter 3, I focused on the conserved AS events in basal angiosperms. I investigated and compared the genome-scale conserved AS events between the basal angiosperms Amborella trichopoda and Nymphaea caerulea. I also investigated whether these putatively “ancestral” AS events were retained in other angiosperms such as core eudicots and monocots, in order to explore the evolutionary implications of these conserved AS events. In Chapter 4, I attempted to study AS in

Tragopogon dubius using the de novo pipeline described in Chapter 2. The results showed an overall success rate of 33.3% in T. dubius, approximately half of the success rate for Amborella reported in Chapter 2. The low efficiency of the de novo method limited further AS detection in T. dubius and tetraploid species. In an attempt to develop better genome reference sequences for Tragopogon to aid in AS discovery and analysis, I completed a draft genome assembly and annotation for T. dubius with support from an NSF Doctoral Dissertation Improvement Grant (DDIG).

Tragopogon dubius is the shared diploid parent of the two tetraploids in this system.

Development of the T. dubius reference genome sequence and a genome-scale AS

22

analysis are discussed in Chapter 5, to lay the foundation for future studies of AS in the allotetraploid Tragopogon species.

23

Figure 1-1. The “wondrous cycles” of polyploidy in plants, modified from Wendel (2015) Figure 1.

24

Genome DNA Transcription Pre-mRNA Splicing Intron Retention Exon Skipping Alternative Donor Alternative Acceptor (IR) (ES) (AD) (AA) AS type

Multiple Isoforms Translation Proteomic diversity and functional diversity

Figure 1-2. Diagram of AS types and resulting proteomic diversity.

25

Figure 1-3. The Tragopogon triangle modified from Figure 1 in Soltis et al. (2009b). Diploids are at the points, with the two recently formed polyploids between their diploid parents. Tragopogon miscellus formed reciprocally in nature (outside left of triangle), resulting in long-liguled and short-liguled forms. T. mirus only formed with T. porrifolius as the maternal parent (outside right). Reciprocal synthetics of both polyploids have been produced (inside triangle) (Tate et al., 2009b; Soltis et al., 2009b).

26

CHAPTER 2 DETECTION ALTERNATIVE SPLICING WITHOUT REFERENCE GENOME

Introduction

RNA alternative splicing (AS), which occurs after a pre-mRNA transcript is formed from template DNA, can greatly increase transcript diversity in eukaryotes (e.g.

Graveley, 2001; Barbazuk et al., 2008; Syed et al., 2012; Reddy et al., 2013).

Considerable recent progress in the study of AS events has resulted from the development of high-throughput approaches in both plants and animals. In plant, more recent genome-wide analyses estimated AS frequencies to be over 60% based on

RNA-Seq short reads (Marquez et al., 2012; Chamala et al., 2015). However, most of the case studies of AS in plants involve crops and/or model systems. The major factor that limits studies of AS in numerous non-model species is the challenge of de novo assembly of transcriptome sequence without the benefit of a well-annotated reference genome. Building complete transcripts to study gene isoforms is crucial for detecting AS candidate genes. This is not a problem in genetic model species or crops with reference genomes available. However, for nearly all naturally occurring species, including numerous evolutionary model systems, reference genomes are not available. High- throughput RNA-Seq methods such as the Illumina platform make de novo transcriptome assembly especially challenging because these technologies provide only small snippets of the transcripts. Recent progress in single-molecule long-read sequencing, such as the Pacific Biosciences platform (“PacBio”, http://www.pacificbiosciences.com/), has introduced powerful new tools to help solve this problem. The sequencing length of the PacBio RS II is 10 kb (‘P6-C4’ chemistry), which should cover the size distribution of most transcripts in eukaryotes. Thus, PacBio

27

sequences are expected to be full-length or nearly full-length transcripts and thus do not need assembly for downstream analysis (Au et al., 2012; Sharon et al., 2013; Tilgner et al., 2014).

A limitation of the PacBio platform that affected previous applications was its high sequence error rate, but this problem has been mitigated with bioinformatics applications. The PacBio SMRT analysis software now provides high-accuracy reads using their Reads of Insert (ROIs) algorithm [the circular consensus sequence (CCS) in

PacBio’s early terminology]. In addition, several examples of third-party software have been developed for hybrid error correction using PacBio data and short reads from other sequencing platforms, especially for PacBio data sets with low sequencing depth

(Au et al., 2012; Koren et al., 2012; Salmela and Rivals, 2014).

PacBio long-read transcriptome sequencing (Iso-SeqTM) has been successfully applied in to both human and mouse and shows a significant advantage over short-read

RNA-Seq methods in identifying novel isoforms, in detecting AS events, and in gene fusion studies (Sharon et al., 2013; Tilgner et al., 2014; Weirather et al., 2015). In plants, the Iso-SeqTM platform has been recently applied in transcriptomic studies in two crops, maize and sorghum, using a reference-based approach (Abdel-Ghany et al.,

2016; Wang et al., 2016). The recently developed TAPIS (Transcriptome Analysis

Pipeline for Isoform Sequencing) pipeline defines isoforms, AS, and polyadenylation sites from Iso-SeqTM data (Abdel-Ghany et al., 2016). However, two key steps in this

Iso-SeqTM data analysis pipeline, namely the error correction and the AS detection steps, are highly dependent on an available genome assembly and annotation.

Numerous evolutionary models are poised to address specific fundamentally important

28

questions that would benefit from a well-characterized transcriptome, but most of these evolutionary model species lack reference genome sequences, precluding application of otherwise potentially revolutionary methods, such as Iso-SeqTM. An excellent example of an evolutionary model that lacks a reference genome is the ,

Tragopogon miscellus (Asteraceae), a naturally occurring allopolyploid of very recent origin (90 years) that serves as an important model for genome evolution following whole-genome duplication (Soltis et al., 2012). The TAPIS reference-based pipeline cannot be applied to Iso-SeqTM data for error correction and de novo AS detection in species that lack a reference genome sequence. To address this limitation of all currently available analytical pipelines, I developed a de novo approach for AS detection that can be used with Iso-SeqTM data for any species, even those species that lack reference genome sequences.

In this chapter, I (i) recommend some pre-treatment strategies for Iso-SeqTM raw data, and (ii) report a de novo pipeline for AS detection using Iso-SeqTM long-read data.

An Iso-SeqTM data set for de novo AS detection requires higher data quality and more pretreatment than sequence data being used for reference-based approaches.

However, applying increasingly stringent parameters within the SMRT analysis software decreases the amount of usable read data extracted from the Iso-SeqTM raw data. Thus, selecting the appropriate SMRT analysis parameters and ROI data set is an important step in error correction of Iso-SeqTM data (Koren et al., 2012; Au et al., 2012; Sharon et al., 2013; Weirather et al., 2015). However, existing error correction methods may give different results depending on the algorithm employed (Sharon et al., 2013; Weirather et al., 2015).

29

Here I used the Iso-SeqTM workflow to produce a single-molecule RNA-Seq data set for Amborella trichopoda, the sister species to all other living angiosperms and thus of major importance to studies of plant evolution at organismal, -omic, and genetic levels (Soltis et al., 2008; Amborella Genome Project 2013). I investigated the general properties of Iso-SeqTM and compared the data quality using different SMRT analysis parameters and different error correction methods. Based on this high-accuracy Iso-

SeqTM data set, I then detected AS events in A. trichopoda using a reference-based approach. Finally, I developed a de novo AS detection pipeline and applied it to the Iso-

SeqTM data set. The AS events identified by the reference-based approach were used to validate results from the de novo pipeline. My hope was that this pipeline would have wide application to non-model systems, even those that lack a reference genome sequence. I also discussed the advantages of the PacBio platform for full-length transcriptome sequencing and de novo analysis of AS events in species without vast genomic resources.

Materials and Methods

Amborella Genome, Annotation, and RNA-Seq Data

Genome and annotation data used here for A. trichopoda were downloaded from the Amborella Genome database (http://amborella.org/), including genome scaffolds v1.0 (AmTr_v1.0_454Scaffolds.fna.bz2), genome annotation EVM 27

(AmTr_v1.0_evm_run27_filter02.cds.fna), alternatively spliced annotation

(ASisoform_overlap_evm27_0926_JunctionSupport2x_FPKM1_pasa.gtf.zip), and protein-coding sequence (AmTr_v1.0_evm_run27_filter02.cds.fna). In addition, existing

Illumina RNA-Seq data (2x100 bp) from leaves and flowers (Amborella Genome Project

30

2013, NCBI BioProject PRJNA212863) were also used here for conducting hybrid error correction on PacBio data.

Plant Materials and RNA Sample Preparation

Young leaves and female flowers of A. trichopoda were collected for RNA sample preparation. RNA was extracted separately from leaf and flower tissues using the CTAB method and RNAeasy Mini extraction kit (Qiagen, Germantown, MD, USA).

The TURBO DNA-free Kit (Invitrogen, Carlsbad, CA, USA) was used for DNA digestion in RNA samples. The concentration of each RNA sample was checked using the

QUBIT® Fluorometer (Life Technologies, Carlsbad, CA, USA), and RNA integrity was checked using a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). cDNA Synthesis, Library Construction, and Sequencing

I conducted cDNA synthesis and library construction following the PacBio Iso-

SeqTM protocol used at the Interdisciplinary Center for Biotechnology Research

(http://www.biotech.ufl.edu/) at the University of Florida, Gainesville (see APPENDIX A).

All sequencing was performed using the PacBio RS II system with P4-C2 polymerase and chemistry at the UF ICBR.

Data Collection and Error Correction

Data from leaf and flower samples were first analyzed separately and then combined and analyzed. Sequence data were collected using PacBio’s SMRT analysis software (v2.2.0). In total, three flncROI data sets were generated. I recorded them as zero-full-passes flncROIs, one-full-pass flncROIs, and two-full-passes flncROIs, respectively (see APPENDIX A).

The Iterative Clustering for Error Correction (ICE) algorithm and Quiver were applied together to remove redundancy and to improve accuracy of the flncROIs. The

31

zero-full-passes flncROI data set was used as the input, following the PacBio Iso-SeqTM tutorial’s recommendation. Consensus isoforms were predicted and polished through the pbtranscript.py cluster module with default settings. flncROIs from different size fractions were analyzed separately.

LSC (version 1alpha), a fast PacBio long-read error correction tool (Au et al.,

2012), was implemented for hybrid error correction using Illumina RNA-Seq reads from

Amborella. Hybrid error correction was also conducted on the zero-full-passes flncROI data set and the two-full-passes flncROI data set. Long reads were corrected by LSC if at least 50% of the length of the long read was covered by short reads, and all other parameters were set as default. I chose ICE-Quiver and LSC over the recently released

TAPIS pipeline (Abdel-Ghany et al., 2016) to perform error correction and for additional isoform identification because the TAPIS error correction module requires a reference genome. Considering that the primary goal was to identify a de novo process to identify

AS within Iso-SeqTM acquired from species without a reference genome, I utilized error correction and data cleaning modules that do not rely on the presence of a reference.

Evaluation of Read Quality

I generated six Iso-SeqTM databases: 1) zero-full-passes flncROIs, 2) one-full- pass flncROIs, 3) two-full-passes flncROIs, 4) ICE-Quiver consensus isoforms, 5) LSC- corrected zero-full-passes flncROIs, and 6) LSC-corrected two-full-passes flncROIs.

These six data sets were mapped to the Amborella trichopoda genome using GMAP

(Wu and Watanabe, 2005). High-confidence mappings (“well-mapped reads”) were determined by the average identity of the alignment and the percentage of length that a read aligned. I used the following filters on all GMAP alignments separately: 75%/67%

(identity/length aligned) based on Iso-SeqTM analysis on human (Sharon et al., 2013),

32

80%/90% as recommended by the PacBio Iso-SeqTM tutorial, 95%/90% as applied in the PASA pipeline (Campbell et al., 2006), and stricter identity/length values of

95%/95%.

Reference-Based Transcriptome Assembly and Alternative Splicing Detection

Original zero-full-passes flncROI, LSC-corrected zero-full-passes flncROI, and a hybrid data set (see Results) were applied in reference-based detection. Given that

TAPIS (Abdel-Ghany et al., 2016) was not used in the error correction step (as explained above), I applied an alternative approach, the PASA pipeline (Campbell et al.,

2006), to build transcriptome assemblies, compare annotations, and identify AS events.

Iso-SeqTM full-length ROIs were aligned to the reference genome with BLAT (Kent,

2002) and GMAP. The maximum intron size considered by PASA was 20,000 bp to account for large introns discovered in Amborella (Amborella Genome Project 2013;

Chamala et al., 2015).

Comparisons between Iso-SeqTM Assemblies and Amborella Gene Models

The annotation comparison module in PASA was applied to conduct isoform- level comparisons. Gene models from EVM 27 were loaded as the original annotation in the first cycle of annotation comparison. Iso-SeqTM assemblies were then used as the input transcripts. Three cycles of transcript loading, annotation comparison, and annotation updates were conducted to maximize the incorporation of transcript alignments into gene structures. All other parameters were set as PASA defaults.

De novo Detection and Validation of Alternative Splicing Events

In general, two isoforms that result from AS could align as illustrated in Figure 2-

1a: perfectly aligned regions flanking an INDEL that represents the AS event. The length of the indel region could be several base pairs, such as the alternative

33

adapter/donor (AA, AD) events, to several hundred base pairs, such as intron retention and exon skipping (IR, ES) events. Previous work on Brassica shows that it is possible to conduct de novo AS detection by clustering transcript sequences from the same gene and determining if insertion appeared in the alignment (e.g. Zhou et al., 2011). In this paper, I used Iso-SeqTM data directly to run all-vs-all BLAST with high identity settings.

For example, in an IR event with a 150-bp retained-intron, BLAST alignments that met all criteria given in Figure 2-1b were considered products of candidate AS events: there should be two HSPs (High-scoring Segment Pair) in the alignment; in the sequence of fully spliced transcript (isoform 1 in Figure 2-1b), the base pair coordinates representing the end of HSP1 and the start of HSP2 should be sequentially continuous (recorded here as “Overlap”, 1 bp); in the sequence with a retained intron (isoform 2 in Figure 2-

1b), the difference between the base pair coordinate at the end of HSP1 and the base pair coordinate at the start of HSP2 should be one greater than the size of the intron

(recorded here as “AS Gap”, 151 bp).

The entire pipeline is described in Figure A-1 (see APPENDIX A). The percentage of de novo events that could be validated was calculated. I also summarized the number of events that were validated by each reference-based AS database separately, as well as the number of each type of AS.

To validate this de novo approach further, 61 AS events (14% of the total) were picked randomly to conduct reverse transcription (RT)-PCR. These 61 RT-PCR validation events include 41 computationally valid events, and 20 computationally failed events (Table A-5). Primers for each putative event were designed in the flanking region of “AS Gap”. First-strand cDNAs were synthesized using the iScript kit (Bio-Rad)

34

following the manufacturer’s protocol. Taq DNA polymerase (ApexTM) was used to perform PCR, and PCR products were checked using 1.5% agarose gel electrophoresis.

Optimization of de novo Pipeline

To determine whether the success rate of the de novo pipeline could be improved by changing any settings of the pipeline, the 428 de novo events were classified into two categories based on the results of computational validation. Valid events were marked as “yes”, and failed events were marked as “no.” Then a decision tree analysis was conducted in SPSS (v20) using Chi-squared Automatic Interaction

Detection (CHAID). Three options in the de novo pipeline, identity percentage, “AS Gap” size, and “Overlap” size, were considered as three independent variables in this analysis. The identity percentage refers to the minimum percent identity setting in the

BLAST analysis.

Results

General Properties of ROIs in Iso-SeqTM

The complete Iso-SeqTM data set for this experiment is composed of sequences that meet the most permissive criterion (minimum number of full passes = 0). In all,

401,873 ROIs (11 bp - 31,981 bp) were collected from eight SMRT cells of leaf samples;

258,585 ROIs (10 bp - 27,197 bp) were collected from 11 SMRT cells of flower samples. After filtering out short ROIs (<300 bp), 386,351 ROIs from leaf samples and

242,973 ROIs from flower samples were retained, resulting in a total of 629,324 ROIs.

The length distribution of these short-filtered ROIs is shown in Figure A-2a.

Within these 629,324 short-filtered ROIs (>300bp), I identified 217,954 flncROIs with an average length of 2,044 bp (Table 2-1). On average, 33.0% of all ROIs were

35

“PacBio” full-length. This percentage varied greatly among size ranges and tissues

(e.g., 26.6% in the flower data set and 37.1% in the leaf data set). In both the leaf and flower data sets, the percentage of flncROIs was higher in the small size fraction than in the large size fraction. For example, in the leaf-derived 1-2-kb, 2-3-kb, and 3-6-kb size fractions, 53.2%, 37.5%, and 28.3% of all ROIs were flncROIs, respectively. Overall, the leaf data set had a higher percentage of flncROIs than the flower data set (Table 2-1,

Figure A-3).

The length distribution of flncROIs is shown in Figure A-2b. Some of these flncROIs were extremely long (e.g. >10,000 bp), but most were 800 bp to 4,000 bp

(Figure A-2b). I also compared these Iso-SeqTM flncROIs with the protein-coding gene transcripts of Amborella trichopoda (Figure 2-2). Transcripts of protein-coding genes

(red line) had a peak in the length distribution at 335 bp and an average length of approximately 950 bp. I found that 98.83% of these transcripts were no longer than

4,000 bp in length. Iso-SeqTM flncROIs (green line) had three peaks at 1,034, 1,804, and

2,874 bp, with an average length of 2,044 bp. Comparison between the protein-coding gene transcripts and those of flncROIs revealed strong concordance, indicating that the latter were often full-length sequences and had a better recovery on large transcripts than previous 454/Illumina RNA-Seq data for gene model prediction, especially in the

2,500-4,000-bp size range.

As noted, 33.0% of the ROIs were full-length with polyA tails, suggesting that they were mRNA sequences with intact 3’ ends. To gain a better understanding of the complete ROIs data set, especially those non-full-length ROIs, I placed all ROIs into three categories: 1) flncROIs as described above; 2) ROIs with polyA tails, but not

36

“PacBio” full-length; 3) ROIs without polyA tails. The second category, partial polyA

ROIs, represented a considerable fraction of the data set (Figure A-4): 16%, 18%, and

21% of ROIs were partial polyA ROIs in the 1-2-kb, 2-3-kb, and 3-6-kb size fractions, respectively, with an average of 19%.

In addition, two stricter settings on the parameter “--minFullPasses” (-- minFullPasses as 1 or 2) were applied during the ROIs collection process to investigate their effects on data quality and quantity. In total, three different values (0, 1, and 2) of minimum number of full passes were set. Counts of ROIs generated from each setting are shown in Table A-1. The data suggest that I could retrieve most ROIs and flncROIs when the minimum number of full passes was set at 0 and that I could obtain the fewest

ROIs and flncROIs when I set the same parameter at 2. The quality of the three data sets in Table A-1 is discussed below.

Error Correction by ICE-Quiver and Illumina Reads

During ICE analysis, flncROIs from the same transcript were clustered together to predict a consensus isoform. Thus, ICE analysis could increase accuracy and partially remove redundancy without requiring additional sequence data. I predicted

55,602 ICE consensus isoforms in the flower data set and 91,085 isoforms in the leaf data set (Table A-2).

From 149,175 leaf zero-full-passes flncROIs (Table 2-1), hybrid error correction by LSC generated 146,158 corrected long reads, 143,065 of which had at least 50% of the length of the flncROIs covered by Illumina short reads (Table A-2). For flowers

(Table A-2), 68,177 corrected long reads were generated from 68,779 zero-full-passes flncROIs (Table 2-1), and 65,254 of these had at least 50% of their length covered by

37

Illumina short reads. The quality of these data generated by ICE and LSC is discussed below.

Mapping Full-Length ROIs and Error-Corrected Reads to the Amborella Genome

High-confidence mappings of three types of flncROIs, ICE consensus isoforms, and LSC-corrected long reads identified by different identity/length filters are given in

Table 2-2. Considering the percentage of well-mapped reads and the number of reads in a data set (Figure 2-3), I proceeded with the two-full-passes flncROIs data set over the other flncROIs data set and LSC-corrected reads for downstream de novo analysis.

Overall, the two LSC-corrected long-read data sets had the highest percentage of well-mapped reads under all identity/length settings (compare the purple and orange lines in Figure 2-3a). However, the LSC-corrected zero-full-passes flncROIs data contained many more ROI reads than LSC-corrected two-full-passes flncROIs (purple and orange bars in Figure 2-3b, Table 2-2). In LSC-corrected zero-full-passes flncROIs,

183,311 reads had more than 95% of their whole length mapped to the genome with at least 95% identity (95%/95% filter). In terms of the percentage of well-mapped reads, two-full-passes flncROIs were comparable to the LSC-corrected reads under all four identity/length filters (Figure 2-3a); 85.7% of the reads in the two-full-passes flncROIs data set passed the 95%/90% filter, and 85.3% of the reads passed the 95%/95% filter.

In contrast, of the LSC-corrected reads (zero-full-passes), 85.9% passed the 95%/90% filter, and 85.5% passed the 95%/95% filter. Relative to the two-full-passes flncROIs, zero- and one-full-pass flncROIs had more reads (Table 2-2), even though the percentages of well-mapped reads were lower under all four identity/length filters

(Figure 2-3). For example, only 58.4% of zero-full-passes flncROIs could pass the

95%/95% filter, but the actual number of zero-full-passes flncROIs that passed this

38

cutoff value was 127,301, which is more than the two-full-passes flncROIs data set

(108,900 reads). Interestingly, polished consensus isoforms given by the ICE-Quiver analysis have the lowest percentages (55%-56%) and actual number of well-mapped reads under strict identity/length filters (Figure 2-3a, b).

Alternative Splicing Analysis Using Program to Assemble Spliced Alignments (PASA)

To include the greatest sequence information, the two data sets that contained the highest number of well-mapped reads under the 95%/90% identity/length filters were used in this part of the analysis. Of the 217,954 zero-full-passes flncROIs first placed in the PASA pipeline, 96,212 passed all criteria for PASA assembly, while the remainder failed. In contrast, of 214,335 LSC-corrected zero-full-passes flncROIs placed in the PASA pipeline, 150,970 passed all criteria for PASA assembly.

Four major types of AS event (IR, ES, AA, and AD) were present among the AS events detected by PASA in both the zero-full-passes flncROIs database and in the

LSC-corrected zero-full-passes flncROIs (Table A-3). Moreover, in order to be consistent with the terminology in the PASA output, IR events were recorded separately as the subtypes “retained intron” and “spliced intron”. Similarly, the ES events were recorded separately as the subtypes “skipped exon” and “spliced exon” (Table 2-3, A-3,

A-4). More AS events were detected in the LSC-corrected flncROIs data set (21,748) than in the original flncROIs data set (17,693). The distribution of the four AS types also differed between the two data sets. For example, the percentage of IR events in the original flncROIs data set was 27.3%, while the LSC-corrected flncROI data set had a value of 34.1%. Similarly, the percentage of ES events was 6.5% in the original flncROI data set but only 4.0% in the LSC-corrected flncROI data set.

39

I also summarized the proportion of AS events shared between the two data sets vs. those detected in only one data set. For all four types of AS, the LSC-corrected data set and the original data sets revealed quite different AS events (Table A-4). For example, only 3,938 IR events could be found in both the LSC-corrected and original data sets, representing ~50% of all IR events in the LSC-corrected data set and ~80% in the original data set.

In addition, zero-full-passes flncROIs that failed during pre-validation in the PASA assembly were collected and replaced by their LSC-corrected reads. Here, 41,223 LSC- corrected reads for flowers and 75,197 LSC-corrected reads for leaf tissue were extracted from all LSC-corrected zero-full-passes flncROIs and combined with 96, 212

PASA-valid zero-full-passes flncROIs. A hybrid data set of 212,632 Iso-SeqTM long reads was generated and fed to the PASA pipeline to detect AS events. Based on this mixed data set, PASA identified 24,798 assemblies. These assemblies formed 10,617 isoform clusters, with 28,229 AS events in 17,037 assemblies and 4,879 isoform clusters (Table 2-3).

Improvements of Current Gene Models by Isoform-Level Comparison

In total, 10,594 gene models from EVM 27 showed coverage by Iso-SeqTM assemblies. Compared with the original annotation, 3,255 loci were improved by Iso-

SeqTM isoforms. These improvements included gene extension, gene structure rearrangement, gene merging, and novel genes (Table 2-4). Compared with the EVM

27 gene models, the first/last exon of 950 loci and the internal gene structure of 1,755 loci were extended and rearranged by Iso-SeqTM isoforms, respectively. An additional

590 loci in EVM 27 were merged into 290 loci in the Iso-SeqTM data (Figure 2-4). Of the

290 gene-merging events, 280 loci contained two original loci in the EVM 27 gene

40

models, while 10 merged loci contained three original loci. In addition, 510 novel genes were discovered, supported by 725 Iso-SeqTM isoforms.

De novo Detection of AS Events in Iso-SeqTM Data and Comparison with PASA Results

Our de novo approach following the all-vs-all BLAST gave 428 pairs of Iso-SeqTM reads that might represent AS events. To assess the authenticity of these “de novo” candidates, all 428 events were compared with AS events detected by the genome- guided transcript discovery process implemented in the PASA analysis of the Iso-SeqTM analysis described previously, and with the existing RNA-Seq AS database. Of 428 de novo candidates, 157 events were found in both the PASA and RNA-Seq results, 160 events were found only by PASA, and 11 events occurred only in the RNA-Seq database (Table 2-5). In all, 328 events (76.6%) were validated by PASA or the RNA-

Seq databases, of which 55 events were ES, 233 events were IR, and 40 were AD/AA.

In contrast, 100 events could not be found in either the PASA or RNA-Seq databases.

Of these 100 “false positive” events, eight events had AS in the same region as those in the existing AS database, but differed in the type of AS. For example, the AS database showed an IR event in the locus scaffold00018.167, while in the same intron region, my de novo pipeline found an AA event. In addition to false positive events, I also found three events that could not be mapped on any Amborella genome scaffolds. BLAST results indicated that two events were likely the result of plant pathogen mRNA, and one event had high similarity with a 30S ribosomal protein encoded by the plastid genome.

High-score BLAST hits of Iso-SeqTM long reads in the first two false positive events were derived from the fungal plant pathogen, Cochliobolus (anamorph: Bipolaris). This

41

fungal genus is globally distributed and causes leaf spot disease in plants. This analysis indicates that at least 76.6% of the de novo AS candidates may be “real.”

AS Primer Design and RT-PCR Validation

To confirm that the de novo approach could detect AS events, 61 candidates were randomly selected for primer design and validated with RT-PCR (Table A-5). Of the 61 candidates, 41 were present in the PASA or the RNA-Seq database, while the remaining 20 were not present in the existing AS data sets and were considered false positive events.

Of the 41 candidates that were validated by PASA or the RNA-Seq database, 34 were also confirmed by RT-PCR, which indicates an 82.9% validation rate. Of these 34 confirmed candidates, 15 candidates were found only by the Iso-SeqTM PASA analysis, and four events were found only in the existing RNA-Seq database. Seven candidates that were not detected by RT-PCR included four IR events and three AA events, although all of them were confirmed by either the RNA-Seq database or the PASA analysis based on the Iso-SeqTM long reads. These candidates may not have been detected because of the low sensitivity of RT-PCR, or the putative IR events may have been artifacts resulting from incomplete splicing rather than bona fide IR events.

Interestingly, of the 20 false positive events, three were detected by RT-PCR. Of these three events, two contained large introns that were larger than the PASA validation setting, while the third harbored a sequence-error-mediated mismatch at the splice boundary that prevented PASA from detecting it. Of the remaining 17 false positive events, four mapped to gap areas within scaffolds and represented transcripts from regions of the genome that are missing in the draft assembly. Ten events were IR,

AA, and AD with non-canonical splice sites. Considering their failure in both

42

computational and RT-PCR validations, I speculate that these 10 AS-like events may reflect large indel sequencing errors introduced by Iso-SeqTM.

Optimization of the de novo Pipeline Based on Decision Tree Analysis

A CHAID decision tree analysis indicated that the “overlap size” had the strongest influence on the success rate of the de novo pipeline in computational validation. Decreasing the maximum overlap value between two HSPs of the “spliced transcript” (see Figure 2-1b) could significantly improve the accuracy of de novo AS detection. Figure 2-5a shows that when there was no overlap between two HSPs

(overlap size = 0 bp), the accuracy was 91.2%, with 68 candidates in this category.

When overlap size was 1 or 2 bp, the accuracy was 79.1%, and 249 candidates were retained. When overlap size was larger than 2 bp, accuracy (62.2%) was below the average (76.6%), with 111 candidates. In addition, the “AS gap size” in the “retained transcript” (see Figure 2-1b) also influenced the success rate. A significant decrease in success rate was observed for de novo events whose AS gap was larger than 173 bp.

For the 174 de novo events with smaller AS gap sizes (<173 bp), the success rate was

84.5% (Figure 2-5b).

Discussion

Pretreatment Strategies for Iso-SeqTM Raw Data

I determined in my analyses that a hybrid data set was the best approach for reference-based AS analysis. This hybrid data set was composed of all PASA-valid zero-full-passes ROIs and LSC-corrected zero-full-passes ROIs that corresponded to

PASA-failed zero-full-passes ROIs. Thus, the hybrid data set included information from the original zero-full-passes ROI and LSC-corrected zero-full-passes ROI data sets and hence detected more AS events than either of the two individual data sets (Table A-3).

43

The original zero-full-passes ROIs are not recommended for use as the direct input into the PASA pipeline because of their high sequence error. PASA does not use all input data in its assembly. Reads that could not pass the following settings were dropped: alignment identity lower than 95%, aligned length shorter than 90% of the total length of a flncROI, mismatch near the splicing junction, or sequences of a splicing site mismatched compared to the PASA settings. Due to the relatively higher error rate, only

44% (96, 212 out of 217,954) of all zero-full-passes ROIs were able to pass the PASA assembly settings, so fewer events were detected in the original zero-full-passes ROI data set than in the LSC-corrected zero-full-passes ROI data set or the hybrid data set.

Sequencing accuracy of the LSC-corrected zero-full-passes flncROIs was greatly improved compared with the original zero-full-passes flncROIs. Approximately 70%

(149,160 out of 214,335) of all LSC-corrected zero-full-passes ROIs passed the PASA assembly settings. This increased accuracy could help a read pass PASA validation using not only the “95%/90%” filter, but also the “no mismatch in splicing site boundary region” filter. The latter filter is the major reason that PASA dropped the same read in the zero-full-passes flncROI data set. However, the LSC-corrected zero-full-passes flncROIs data set is also not the ideal input data for PASA analysis. Its major disadvantage is that all reads in this data set must have coverage by short reads. That is, if a zero-full-passes flncROI is not mapped by short reads, it will not be corrected, but instead dropped by LSC and omitted from the output. As a result, the LSC-corrected zero-full-passes flncROI data set may reflect sequencing biases present in the short reads. In fact, 662 flower flncROIs (0.96%) and 3,017 leaf flncROIs (2.02%) were missed in the LSC-corrected data set. Hence a significant difference of detected AS

44

events was observed between the LSC-corrected data set and the original data sets

(Table A-4). All of the above reasons make the LSC-based data set less appropriate as the input for PASA than the other approaches, although it detects slightly more AS events than the ROI-based data set.

The ICE-Quiver analysis is an important module in the SMRT analysis pipeline. It was designed to obtain high-accuracy non-redundant isoforms. Unfortunately, this data set gave us the lowest percentage of well-mapped reads. This result occurred because

ICE-Quiver analysis is ideal when there is sufficient full-length ROI (flncROI) and non- full-length ROI coverage. A recommendation given by PacBio was 20 SMRT cells for one Iso-SeqTM experiment per tissue (http://www.pacificbiosciences.com/).

For a de novo transcriptome analysis, I suggest using two-full-passes ROIs. De novo transcriptome analysis requires higher-quality data than reference-based analysis.

The LSC-corrected two-full-passes ROIs had the highest percentage of well-mapped reads under all identity/length settings (Figure 2-3). However, this data set is a hybrid data set that requires information from two different data sets: Iso-SeqTM ROIs and

Illumina short reads. Such hybrid long reads are not truly single-molecule reads and may produce artifacts owing to alignment errors (Sharon et al., 2013). In terms of the percentage of well-mapped reads, I found that the results from the two-full-passes flncROIs were comparable to the LSC-corrected zero-full-passes ROIs under all four identity/length filters (Figure 2-3). I therefore used this two-full-passes data set as the input for the de novo pipeline.

For a de novo transcriptome analysis, I also recommend conducting a BLAST analysis between the Iso-SeqTM data and all existing protein-coding sequences of

45

common plant pathogens, using, for example, the Comprehensive Phytopathogen

Genomics Resource (CPGR, http://cpgr.plantbiology.msu.edu/index.html, Hamilton et al., 2011). I make this recommendation because two of the three false positive candidates I discovered were caused by sequences from pathogenic fungi

(Cochliobolus, anamorph: Bipolaris). These sequences are easily filtered out in a reference-based analysis because they will not be mapped onto the genome reference.

However, in a de novo analysis, these sequences could not be detected without the

BLAST analysis. Moreover, this type of contamination due to pathogen infection could be difficult to avoid during plant cultivation and sample collection, particularly for those plants grown in a greenhouse with other plants and not in a growth chamber. The fungal mRNA sequences were detectable even though the plant was grown under careful management, with the leaf tissues collected when they were very young. Thus, a

BLAST analysis is highly recommended in a de novo pipeline.

Advantages of Iso-SeqTM in Reference-Based Transcriptome Analysis

I recognize that the sequencing depth and coverage of the Iso-SeqTM experiments are much lower than the Illumina data sets used to generate the EVM 27 gene models (Amborella Genome Project, 2013). The Iso-SeqTM data only covered 62% of the multi-exonic genes in the EVM 27 gene models (10,594 vs. 17,089). As described in the Iso-SeqTM protocol, eight SMRT cells per tissue are sufficient to recover multiple transcripts and isoforms of abundant transcripts, but do not provide sufficiently deep transcriptome sequencing for a detailed view of most transcripts and their isoforms or for rare transcripts. More SMRT cells could be employed in the future to obtain a more complete transcriptome for Amborella. For example, the Iso-SeqTM protocol recommends 8-50 SMRT cells per tissue for sequencing most transcripts and their

46

isoforms of a plant transcriptome. On the other hand, more tissue could also be added in the experimental design, especially for polyploids or plants with large genome sizes.

An Iso-SeqTM analysis of tetraploid cotton (Gossypium hirsutum; 2n = 4x = 52), with a genome size of 2.5 Gb, included three tissues (leaf, root, and stem), with four libraries per tissue and four SMRT cells per library, for 16 SMRT cells per tissue (van Eijk,

2015). However, based on these pilot sequencing data for Amborella (8 SMRT cells per tissue only), the Iso-SeqTM approach shows significant advantages in transcriptome studies for different purposes.

Our isoform-level comparison between Iso-SeqTM isoforms and the original genome annotation indicated that the Iso-SeqTM platform could improve current gene models by extending the length of the current EVM 27 gene models, rearranging the internal structure of the gene, and merging adjacent inferred genes into a single gene.

These improvements are possible because of the better recovery rate by Iso-SeqTM on large transcripts than on short reads obtained from the Illumina platform. In contrast, mRNAs transcribed from genes are more likely to be sequenced completely by long- read rather than short-read sequencing approaches. Thus, a new gene model may be longer than an earlier one. Longer transcript reads provide better support and higher accuracy in splice junctions than short reads when these reads are aligned back to the genome. As a result, gene models predicted from long reads correct exon/intron structure (2,705 loci) and can merge two or more adjacent genes (590 loci) that may have been mistakenly considered as discrete genes.

The Iso-SeqTM approach also detected 510 novel genes that were not included in the previous genome annotation using the PASA pipeline. In the random-selected RT-

47

PCR validation, two novel genes were found (e.g., AmTrS6.new and AmTrS9.new,

Table A-5). Considering the strict parameter setting on genome mapping and assembly in PASA, these 510 novel events are not likely derived from contaminations in sample preparation. Considering that the sequencing depth and coverage of the Iso-SeqTM experiments was much lower than those for the EVM 27 gene models, the discovery of novel genes is quite encouraging, further demonstrating the advantages of the Iso-

SeqTM approach.

Iso-SeqTM also improves reference-based AS analysis. Based on RNA-Seq data,

Amborella has 17,089 predicted multi-exonic genes, 6,407 (37.5%) of which have AS

(Amborella Genome Project, 2013). In contrast, the Iso-SeqTM hybrid data set identified

10,617 isoform clusters, and 4,879 (45.8%) of them exhibit AS, demonstrating that Iso-

SeqTM long reads detect more AS events than Illumina data. De novo AS validation results (Table 2-5) also indicated that of the 328 events validated by PASA/RNA-Seq, nearly 50% (160) of them were only found in the Iso-SeqTM PASA data.

Iso-SeqTM and AS de novo Detection Pipeline

Iso-SeqTM generates long reads that do not require assembly to obtain full-length transcripts. This is the major advantage of Iso-SeqTM compared with short-read RNA-

Seq data. My results show that de novo AS detection is possible using Iso-SeqTM reads, with 76.6% of the de novo candidates confirmed in the AS database generated by the reference-based method. Among the 41 events confirmed in the AS database, 34 events (82.9%) were validated by RT-PCR. The percentage of confirmed de novo candidates by RT-PCR was slightly lower than the 100% expectation (because all of these the 41 events were confirmed in reference-based validation), but this might reflect the limitations of RT-PCR and agarose gel electrophoresis in terms of sensitivity and

48

resolution. RT-PCR products from low-expressed isoforms may show very light or fuzzy bands on an agarose gel, making visualization difficult. These results indicate that AS analysis, using methods such as RT-PCR, is possible in non-model species and could have a relatively high success rate (63.5% success rate, 76.6% multiplied by 82.9%).

The accuracy of AS candidate detection could be improved by changing the settings of the de novo pipeline. Decreasing the “overlap size” could significantly increase the potential success of the RT-PCR step; allowing a smaller “AS gap” size could also prevent the detection of false positives in de novo analysis. But such changes may result in the loss of some candidates that are real AS events. Using a 2- bp-overlap-size instead of the current 5-bp size could increase the success rate from the current 76.6% to 81.7% (Figure 2-5), but this would also result in a loss of roughly

20% of the real AS candidates (69 out of 328). In addition, I found that AS candidates that failed in the RT-PCR validation analysis were all IR or AA events, while ES events usually did not result in this problem. Polyribosomal RNA-Seq data from Arabidopsis thaliana indicate a significant difference in IR events compared with traditional RNA-Seq data (Zhang et al., 2015). It is likely that a large proportion of retained introns in RNA-

Seq data results from incompletely spliced pre-mature mRNA (Zhang et al., 2015).

Thus, avoiding potential IR events in the RT-PCR step may increase the success rate.

In Amborella, the average intron length (1,528 bp) is much longer than the average exon length (229 bp). This trend is also true of other plants examined, including

Picea abies (average exon/intron size: 246 bp/1,018 bp), Vitis vinifera (average exon/intron size: 295 bp/966 bp), and Zea mays (average exon/intron size: 277 bp/640 bp) (Nystedt et al., 2013). As a result, using de novo candidates with small “AS gaps”

49

could decrease the probability of selecting IR events in primer design. The decision tree analysis also indicated that smaller “AS gap” sizes (e.g., <173 bp) could avoid false positive candidates in de novo analysis.

In conclusion, the results show the advantages of using Iso-SeqTM data in AS detection using both reference-based and de novo approaches. The Iso-SeqTM platform provides better recovery on large transcripts and detects more AS events than can be recovered from transcriptome assemblies based on short reads. I recommend a combination of an original ROI-based data set and an LSC-corrected data set for reference-based AS detection, especially for studies having only relatively small Iso-

SeqTM data sets. The overall high success rate of the de novo approach in the identification of AS events indicates that this method could be successfully applied to a non-model system with limited genetic resources. The latter is a particularly exciting result, opening the door to many novel research possibilities.

50

Table 2-1. General properties of zero-full-passes ROIs and ICE consensus isoforms using Iso-SeqTM. Number flncROIs Average ICE-Quiver Tissue Size SMRT Number of % in all length of consensus type fraction Cell of ROIs1 flncROIs2 ROIs flncROIs isoforms Leaf 1-2kb 2 83,577 44,439 53.2% 1,027 24,933 2-3kb 3 158,919 59,614 37.5% 1,954 36,454 > 3kb 3 159,377 45,122 28.3% 2,940 29,699 Total 8 401,873 149,175 37.1% 1,967 91,084 Flower 1-2 kb 3 38,913 15,118 38.9% 1,201 11,274 2-3 kb 4 67,528 21,262 31.5% 1,973 16,538 > 3kb 4 152,144 32,399 21.3% 2,795 27,790 Total 11 258,585 68,779 26.6% 2,191 55,602 TOTAL 19 660,458 217,954 33.0% 2,044 146,686 1. ROIs: Reads of Insert, set minimum number of full passes at 0. 2. flncROI: full-length ROIs.

51

Table 2-2. High-confidence mappings by GMAP with different identity/length filters. 95% identity, 95% identity, 80% identity, 75 % identity, Total Mapped 95% length 90% length 90% length 67% length Data type flncROIs flncROIs aligned aligned aligned 1 aligned 2 Zero-full-passes flncROI3 217,954 212,983 127,301 127,965 196,979 204,228 (97.8%) (58.4%) (58.7%) (90.4%) (93.7%) One-full-passes flncROI 156,887 153,113 121,280 121,878 145,029 148,612 (97.6%) (77.3%) (77.7%) (92.4%) (94.7%) Two-full-passes flncROI 127,703 124,509 108,900 109,394 119,693 121,878 (97.5%) (85.3%) (85.7%) (93.7%) (95.4%) LSC Zero-full- passes flncROI 214,335 212,387 183,311 184,118 203,017 206,866 (99.2%) (85.5%) (85.9%) (94.7%) (96.5%) LSC Two-full- passes flncROI 125,518 124,170 114,953 115,427 120,164 121,938 (98.9%) (91.6%) (92.0%) (95.7%) (97.1%) ICE isoform 146,686 142,676 81,592 82,142 128,507 134,729 (97.3%) (55.6%) (56.0%) (87.6%) (91.8%) 1. Used in PacBio Iso-SeqTM tutorial. 2. Used in human Iso-SeqTM analysis (Sharon et al., 2013). 3. Number in brackets shows the minimum number of full passes setting.

52

Table 2-3. AS events detected by the hybrid dataset. AS type Events Assembles Isoform clusters Retained Intron 8,286 29.4% 7,188 42.2% 3,622 74.2% Spliced Intron 8,286 29.4% 9,710 57.0% 3,622 74.2% Retained Exon 1,686 6.0% 2,603 15.3% 1,233 25.3% Skipped Exon 1,595 5.7% 2,420 14.2% 1,233 25.3% Alternative 5,242 Acceptor 18.7% 6,850 40.2% 1,886 38.7% Alternative 3,134 Donor 11.1% 4,541 26.7% 1,256 25.7% Total 28,229 17,037 4,879

53

Table 2-4. Gene model corrected by Iso-SeqTM approach. EVM 27 Iso-SeqTM Iso-SeqTM Type gene loci gene loci isoforms Gene extension 950 950 950 Internal gene 1,755 1,755 1,755 structure rearrangement Gene Merging 590 290 290 New Gene 510 725 Total 3,255 3,465 3,680

54

Table 2-5. Summary of validation of de novo inferred AS events. Valid PASA and RNA-Seq 157 PASA only 160 RNA-Seq only 11 Total 328 Failed "N" scaffold 19 Large intron 5 Single exon 21 Other 55 Total 100 Overall Total 428

55

a.

b.

Figure 2-1. De novo detection of alternative splicing. a. Alignment of two AS isoforms, perfectly aligned regions flanking an INDEL (the green square, and base pair in red) that represents the AS event. b. BLAST alignment graph of an AS event with 150bp retained-intron. HSP: High-scoring Segment Pair. Numbers above or below the alignment graph are the base pair coordinates representing the start/end of an HSP.

56

Figure 2-2. Length distribution between zero-full-pass flncROIs and Amborella gene models; the number near each peak is the read length.

57

a 100.00%

90.00%

0-full-passes flncROI 80.00% 1-full-passes flncROI 2-full-passes flncROI 70.00% ROI0_LSC corrected reads ROI2_LSC corrected reads 60.00% ICE+Quiver

50.00% 75% 80% 95% 95% identity,67% identity,90% identity,90% identity,95% length aligned length aligned length aligned length aligned b 250000

200000

0-full-passes flncROI 150000 1-full-passes flncROI 2-full-passes flncROI 100000 ROI0_LSC corrected reads ROI2_LSC corrected reads

50000 ICE+Quiver

0 75% 80% 95% 95% identity,67% identity,90% identity,90% identity,95% length aligned length aligned length aligned length aligned

Figure 2-3. PacBio Iso-SeqTM “well-mapped reads”. a. Percentage of “well-mapped reads” in different PacBio Iso-SeqTM datasets under four identity/length filters. b. Actual number of “well-mapped reads” in different PacBio Iso-SeqTM datasets under four identity/length filters.

58 evm_27.TU.AmTr_v1.0_scaffold00001.196 Updated gene model Iso-SeqTM isoform

evm_27.TU.AmTr_v1.0_scaffold00001.355 Updated gene model Iso-SeqTM isoform a

evm_27.TU.AmTr_v1.0_scaffold00005.63 evm_27.TU.AmTr_v1.0_scaffold00001.196 evm_27.TU.AmTr_v1.0_scaffold00005.64 Updated gene model evm_27.TU.AmTr_v1.0_scaffold00005.65 Iso-SeqTM isoform Updated gene model Iso-SeqTM isoform b

evm_27.TU.AmTr_v1.0_scaffold00001.355 Updated gene model TM Iso-Seq isoform c

evm_27.TU.AmTr_v1.0_scaffold00001.196 evm_27.TU.AmTr_v1.0_scaffold00005.63 Updated gene model eIsvom-S_e2q7T.TMUis.AofmorTmr_v1.0_scaffold00005.64 evm_27.TU.AmTr_v1.0_scaffold00005.65 Updated gene model Figure 2-4. Example of revised gene models after inclusion of IsoIso-S-eSeqqTM isoTMform data. a. Gene merging. b. Gene extension. c. Rearrangement of internal gene structure.

evm_27.TU.AmTr_v1.0_scaffold00001.355 Updated gene model Iso-SeqTM isoform

evm_27.TU.AmTr_v1.0_scaffold00005.63 evm_27.TU.AmTr_v1.0_scaffold00005.64 evm_27.TU.AmTr_v1.0_scaffold00005.65 Updated gene model Iso-SeqTM isoform

59

a

b

Figure 2-5. CHAID decision tree analysis. a. Overlap size. b. AS gap size.

60

CHAPTER 3 CONSERVED ALTERNATIVE SPLICING IN BASAL ANGIOSPERMS

Introduction

The angiosperms, also known as flowering plants, are the most diverse group of green plants with at least 350,000 species (The Plant List, 2010). The angiosperms represent one of the greatest terrestrial radiations that occurred in the early- to mid-

Cretaceous (approximately 100 to 140 million years ago). Angiosperms now dominate most terrestrial and many aquatic environments on our planet. They are also the primary source of human food (e.g. maize, rice, wheat, and most vegetables and fruits) and provide material for industrial products, such as natural rubber and biofuel; as well as pharmaceuticals. Thus, understanding the evolutionary history of flowering plants is a fundamental question in plant biology and has significant implications for food safety, human health, and technological revolution.

Nymphaeaceae, or the water lily family, has a critical evolutionary position in the angiosperms (Chen et al., 2017). The order Nymphaeales is one of three orders that form the basal grade of angiosperms (sometimes referred to as the ANA grade; Soltis et al., 2005); ANA stands for Amborellales, Nymphaeales, and Austrobaileyales (Soltis et al., 2005; Soltis et al., 2018). Amborella trichopoda, a single species of shrub from New

Caledonia, is considered sister to all other extant angiosperms (Angiosperm Phylogeny

Group, APG IV, 2016). Following Amborellales, Nymphaeales is sister to all remaining angiosperms (Soltis et al., 2018). Nymphaeales contain ~80 species from three families,

Nymphaeaceae, Cabombaceae, and Hydatellaceae (APG IV, 2016). Thus,

Nymphaeaceae are also critical for studying the early evolution of angiosperms.

61

RNA alternative splicing (AS) is a common phenomenon during gene expression in plants that plays regulatory roles and can promote functional innovation and proteomic diversity even when genetic variation is limited (Barbazuk et al., 2008; Syed et al., 2012; Reddy et al., 2013). However, there is some evidence that some AS in plants may be stochastic (Zhang et al., 2009; Pickrell et al., 2010; Hon et al., 2012;

Satyawan et al., 2017). Recent proteomics evidence also suggests that the vast majority of genes have a single dominant splice isoform (Tress et al., 2017). Thus, it is important to reduce splicing noise when investigating the extent and function of AS. A common hypothesis is that AS events with functional significance may be conserved among different species. For instance, one of the AS events in the FCA gene involved in flowering control is conserved between Arabidopsis and rice (Lee et al., 2005).

Another example of functionally significant conserved AS includes the exon skipping events in TFIIIA (transcription factor for polymerase III A) found in all land plant species examined (Fu et al., 2009). Interestingly, the spliced exon in the TFIIIA gene is an exonization of 5S ribosomal RNA (5S rRNA), and the TFIIIA gene is a transcription regulator of the 5S rRNA (Fu et al., 2009).

There are several case studies of genome-wide analyses of conserved AS events among different lineages of angiosperms. Chamala et al. (2015) reported a method to identify conserved AS events across large phylogenetic distances using

RNA-Seq datasets. They also investigated conserved AS across nine species, including seven eudicots, one monocot, and the basal angiosperm A. trichopoda. Mei et al.

(2017a) focused on AS conservation in seven monocot species. Both case studies suggested that serine/arginine (SR) splicing factor protein families undergo AS. In plant-

62

specific RS and RS2Z subfamilies of the serine/arginine (SR) splice-factor proteins, both conservation and divergence of AS events were observed after the whole-genome duplication in maize (Mei et al., 2017b). In addition, a recent study (Yang et al., 2018) examined conserved AS in the mangrove genus Sonneratia and suggested that the conserved AS might contribute to the adaptation of Sonneratia species to harsh intertidal environments.

Existing studies on conservation of AS among angiosperms used only A. trichopoda to represent the basal angiosperms. None of them includes other species in the ANA grade. Here, I conduct a genome-scale detection of the conserved AS events between the basal angiosperms A. trichopoda and Nymphaea caerulea. AS events conserved between basal angiosperms may represent the ancestral AS events with important biological functions. The presence or absence of these conserved AS events were examined with six additional angiosperms (two monocots and four eudicots) to see whether these ancestral AS events were conserved during evolutionary history. The basal monocot Spirodela polyrhiza (Araceae) and a well-studied crop species Oryza sativa spp. japonica (rice; Poaceae) were used to represent the monocot clade. Two clades from the core eudicots were included in the conserved AS analysis, the rosids and : rosids Vitis vinifera cv. Cabernet Sauvignon (Vitaceae) and the model species Arabidopsis thaliana (Brassicaceae) and asterids Camptotheca acuminata

(Nyssaceae) and the well-studied Solanum lycopersicum (Solanaceae). This comparison will help us to better understand the evolutionary implications of conserved

AS events. I hypothesize that conserved AS pattern among different species should be

63

associated with their evolutionary lineages (hereafter, the Ancestral Inheritance

Hypothesis).

Materials and Methods

Nymphaea Genome, Annotation, and Transcriptome Data

Genome and annotation data for Nymphaea caerulea were obtained from collaborators in the Amborella Genome Project (unpublished). Transcriptome data for alternative splicing analysis, including both RNA-Seq and Iso-SeqTM, were also obtained from the same group of collaborators (unpublished). The Iso-SeqTM data were from four libraries and sequenced through the PacBio Sequel system. Each Iso-SeqTM library was constructed with one tissue or stress type, including leaf, flower, root, and leaf stress.

The RNA-Seq data (150bp PE) were generated from leaf and flower tissues.

Nymphaea Iso-SeqTM Data QC and Error Correction

The circular consensus reads (CCS reads) were extracted from the Iso-SeqTM raw data using the CCS module in SMRTanalysis v4.0 with the following parameters “-- noPolish --minLength=200 --minPasses=0 --minZScore=-999 --maxDropFraction=0.8 -- minPredictedAccuracy=0.8 --minSnr=4”. Then the full-length, non-chimeric CCS reads

(flnc CCS reads) were identified by the SMRTanslysis module “pbtranscript.py classify” with the parameter “--min_seq_len 180”. Finally, the Iterative Clustering for Error

Correction (ICE) algorithm and Arrow were applied together to remove redundancy and to improve accuracy of the flnc CCS reads using the SMRTanalysis module

“pbtranscript cluster”. The output was termed ICE-Quiver polished isoforms.

Nymphaea RNA-Seq Data Assembly

Cleaned RNA-Seq data from leaf and flower tissues of N. caerulea were aligned separately to the N. caerulea genome assembly using HiSat2 (v2.1.0, Kim et al., 2015)

64

with default parameters. The SAM file outputs were sorted using SAMtools (v1.8, Li et al., 2009), and duplicate reads were identified and removed using MarkDuplicates included in the Picard suite (v2.10.3, http://broadinstitute.github.io/picard). The filtered

HiSat2 alignment was then used to conduct annotation-guided transcriptome assembly using StringTie (v1.3.3, Pertea et al., 2015) with the parameter -f 0.05 and Cufflinks

(v2.2.1.1, Trapnell et al., 2010) with parameters --no-faux-reads -F 0.05 -I 12500 (-I: maximum intron length), respectively. In all Nymphaea analyses, the maximum intron length was set to 12,500 bp, which represents the 98th percentile of intron size. Then the output GTF files of these two assemblers were merged using StringTie (v1.3.3) with the parameter --merge -f 0.05. Finally, the genome-guided Trinity (v2.5.0, Grabherr et al., 2011) assembly was conducted based on the merged GTF using parameters -- min_contig_length 200 --genome_guided_max_intron 12500.

Alternative Splicing Analysis in Nymphaea

Alternative splicing analysis was conducted using the Program to Assemble

Spliced Alignments (PASA) 2.0 pipeline. The maximum intron length was set as 12,500 bp. I adopted the PASA parameter --INVALIDATE_SINGLE_EXON_ESTS to ignore single-exon ESTs when building PASA assemblies.

AS events based on Iso-SeqTM data were detected following the workflow described in Liu et al. (2017). ICE-Quiver polished isoforms generated by “pbtranscript cluster” were put into PASA for pre-validation. These isoforms were classified into two groups: PASA-valid isoforms or PASA-failed isoforms, based on whether the ICE-

Quiver polished isoforms passed or failed the PASA validation, respectively. All PASA- failed isoforms were error-corrected with RNA-Seq from the same tissue using LSC (v

2.0). Finally, the PASA-valid isoforms and the short-read-corrected PASA-failed

65

isoforms were combined and used as input to PASA for additional assembly and clustering to remove redundancy. The output of this PASA is referred to as the Iso-

SeqTM PASA assembly.

Based on RNA-Seq data, AS events were detected by the PASA pipeline with the same parameters that were used for the Iso-SeqTM data. Trinity genome-guided assemblies, as well as the merged GTF file produced with StringTie and Cufflinks assemblies, were applied as the input. The PASA output was filtered by a custom python script (Chamala et al., 2015) that checks for splice junction support. If a splice junction defined by PASA is supported by fewer than two reads, this PASA assembly is discarded. The junction filtered output is referred to as the RNA-Seq PASA assembly.

Finally, the Iso-SeqTM PASA assembly and the RNA-Seq PASA assembly were combined and run through PASA once more. The AS events detected from this combined dataset represent the comprehensive AS event dataset for N. caerulea.

Reanalysis of Amborella Alternative Splicing Events

Similar to the AS analysis in N. caerulea, this analysis of AS in Amborella trichopoda also included RNA-Seq data and Iso-SeqTM data. The RNA-Seq PASA assembly was conducted with data from flower, leaf, and root tissues (Table 3-1) using the same workflow and parameters that were used in the N. caerulea AS analysis. The value of max intron length in all A. trichopoda analysis was set as 20,000 bp, which represents the 98th percentile of intron size based on the Amborella annotation. Similar to Nymphaea, A. trichopoda Iso-SeqTM PASA assemblies were constructed following the workflow described in Liu et al. (2017). Thus, the A. trichopoda Iso-SeqTM PASA assembly was based on leaf and female flower transcripts. The Iso-SeqTM PASA assembly and the RNA-Seq PASA assembly of A. trichopoda were combined and run

66

through PASA a final time to resolve redundancy. The AS events detected from this combined dataset is the comprehensive AS dataset for A. trichopoda.

Gene Orthology Analysis

Protein sequences of N. caerulea and A. trichopoda, Aquilegia coerulea

(Ranunculaceae), the rosid Vitis vinifera (Vitaceae), the model rosid Arabidopsis thaliana (Brassicaceae), the asterid Camptotheca acuminata (Nyssaceae), the model asterid Solanum lycopersicum (Solanaceae), the monocot Spirodela polyrhiza

(Araceae), and the model monocot Oryza sativa spp. japonica (Poaceae) were clustered using OrthoFinder (v 2.2.3) with default parameters (Emms and Kelly, 2015).

The version of genome assembly and the annotation used for these species are described in Table 3-2. All analyses of conserved AS events were based on this orthogroup list for these nine angiosperms; however, Aquilegia coerulea is not included in the conserved AS analysis because its RNA-Seq data are currently unavailable.

Conserved Alternative Splicing in Basal Angiosperms, Eudicots, and Monocots

The AS events conserved between the two basal angiosperms A. trichopoda and

N. caerulea, and six additional species representing both eudicots and monocots (see

Table 3-3), were investigated using the method developed by Chamala et al. (2015). AS events detected in two or more species were referred to as shared events. In comparison, AS events detected in two basal angiosperms were named as basal- angiosperm-conserved AS events. AS datasets for A. trichopoda and N. caerulea were obtained from the analyses described above. The AS datasets of the six eudicots and monocots were obtained from collaborators in the Barbazuk lab at the University of

Florida (unpublished). Because the plant materials from both basal angiosperms were from the same tissues/development stages, I selected RNA-Seq data from similar

67

development stages from the other six species for consistency. Three tissue types, i.e., leaf, flower, and root, were included in the AS analysis in all eight species (Table 3-3).

The nature of stress treatments examined by the plant community varied greatly among the species examined, thus stress treatment data were not included in this analysis. In order to maintain consistency in sample data size and quality, a single lab source for each species/tissue type combination was identified.

Functional Annotation of Nymphaea Protein-Coding Genes

The functional annotation of N. caerulea protein-coding genes was conducted using Trinotate (v3.0.1, http://trinotate.github.io), which coordinates the analysis and results in the collection of a number of protein functional annotation tools; details of parameters used for each are described below. Homology searches between N. caerulea protein-coding genes and the SwissProt plant database were conducted by

BLASTP and BLASTX with a maximum E-value as 11e-10 (NCBI BLAST v 2.7.1,

Altschul et al., 1990). This SwissProt search also included the KEGG (Kanehisa et al.,

2011), Gene Ontology (GO) (The Gene Ontology Consortium, 2000), and EggNOG

(Powell et al., 2011) annotation information. Protein domain identification was conducted based on the latest Pfam-A database (Finn et al., 2015) using HMMER

(v3.1b2, Finn et al., 2011) with a maximum E-value as 11e-5. Protein signal peptide prediction and transmembrane domain prediction were conducted by SignalP (v 4,

Petersen, et al., 2011) and tmHMM (v 2, Krogh et al., 2001). Final functional annotation data from the above analysis were integrated by Trinotate (v 3.0.1).

68

GO Term Enrichment

The GO term enrichment analyses on genes with shared AS events were conducted with the R package TopGo (Alexa et al., 2006) using Fisher’s exact test. GO terms of all N. caerulea protein-coding genes were considered as background. The weight01 algorithm was applied for the Fisher’s exact test.

Principal Component Analysis (PCA) of Shared AS Events in Eight Angiosperms

Principal component analysis (PCA) of shared AS events in eight angiosperms was conducted using the “prcomp” function included in the R statistical programming environment. An AS event that occurs in any two species within the eight angiosperms was recorded as a shared AS event. The results of PCA analysis were then plotted using the R package ggplot2.

Phylogeny and Divergence Time Estimation

The evolutionary relationships of the studied species were compared to their shared AS patterns. The plastome-based phylogenetic tree for green plants of Ruhfel et al. (2014), which agrees largely with the topology of angiosperms in APG IV (2016) and

Soltis et al. (2011), was pruned to the eight species analyzed for shared AS events.

This pruned tree was dated using the penalized likelihood program treePL (Smith and

O’meara, 2012), calibrated with 162.2-209.7 million years (Mya) as the crown age of angiosperms (Magallón, 2014) and 125 Mya crown age for eudicots (Crane et al., 1995;

Magallón et al., 1999).

Results

Alternative Splicing in Nymphaea

In total, 159,144 ICE-Quiver polished isoforms and 39,835,016 RNA-Seq reads

(150PE) from the leaf, flower, and root tissues, as well as the leaf tissue with stress

69

treatment, were used to identify AS events in N. caerulea. Based on this comprehensive transcriptome dataset, PASA identified 39,067 isoforms that formed 15,264 isoform clusters (loci). Only four types of AS events (intron retention (IR), exon skipping (ES), alternative accepter (AA), and alternative donor (AD)) were considered within the AS events detected by PASA. The results indicated that 46,876 AS events in 28,342 isoforms and 7,283 gene loci were detected in N. caerulea (Table 3-4). In total, the N. caerulea genome contained 24,764 predicted genes, and 20,948 of them were multi- exonic genes. Similar to other plant species examined (Campbell et al., 2006; Marquez et al., 2012; Chamala et al., 2015; Mei et al., 2017a), the most common type of AS in N. caerulea is IR. Among 7,283 AS genes, 79.06% contained IR event(s). In contrast, ES is the least represented event; only 19.91% of AS genes contained ES event(s).

Within the 46,876 AS events detected in N. caerulea, 38.02% (17,820) were detected by both Iso-SeqTM data and RNA-Seq data (Table 3-5). In contrast, 44.69% of the AS events were only detected by the Iso-SeqTM platform and 17.30% were only detected in the RNA-Seq method (Table 3-5). The percentage of the events that were only detected by RNA-Seq data varied by AS types. For instance, only 7.75% of the IR events were only detected by Illumina data, while these percentages for the ES, AA, and AD events ranged between 22.26% and 30.89% (Table 3-5).

Alternative Splicing in Amborella

To identify conserved AS between the two basal angiosperms, AS events in A. trichopoda were reanalyzed with 146,686 ICE-Quiver polished isoforms and

186,131,792 RNA-Seq reads (2x100bp) from the leaf, flower, and root tissues. PASA identified 147,381 isoforms that formed 16,866 isoform clusters (loci). The results indicated that 141,970 AS events in 133,589 isoforms and 11,416 gene loci were

70

detected in A. trichopoda (Table 3-6). The A. trichopoda genome has 17,047 multi-exon genes. Thus 66.97% (11,416 out of 17,047) of the multi-exon genes in A. trichopoda undergo AS; IR is the most common AS event in A. trichopoda, representing 78.78% of the AS genes. Consistent with results in other plant species, ES was the least represented event with only 19.91% of the AS genes.

Within the 141,970 AS events detected in A. trichopoda, 69.64% (98,867) were detected in both Iso-SeqTM data and RNA-Seq data (Table 3-7). In contrast, 0.34% of the AS events were only detected by the Iso-SeqTM platform, and 30.02% of the AS events were only detected within RNA-Seq data (Table 3-7). The percentage of the events that were only detected by RNA-Seq data varied by type. For instance, only

27.57% of the IR events were only detected by Illumina data, while the percentages of

Illumina-only AS events of ES, AA, and AD types varied from 30.10% to 36.65% (Table

3-7).

Shared AS Events in Angiosperms

AS datasets of eight angiosperms (two basal angiosperms, two monocots, and four eudicots) were processed using the pipeline developed by Chamala et al. (2015) to generate an AS coordinate table for each species. It is worth noting that the number of

AS coordinates of A. trichopoda and N. caerulea (the Input AS Coordinates column in

Table 3-8) is roughly half that of the number of AS events (the Events column in Table

3-4 and Table 3-6). This is because PASA reports retained introns, and spliced intron events as a single AS coordinate. In the eight angiosperms, AS events detected in more than one species were considered as shared events. It is possible that convergent evolution could also result in the same AS event between different species, although it is challenging to pin-point the exact mechanism. Both mechanisms are addressed in the

71

Discussion. In total, 22,764 shared events were found. Within these events, the proportions of four types of AS were: 64.59% (14,704 events) IR, 19.80% (4,507 events) AA, 8.53% (1,941 events) AD, and 7.08% (1,612 event) ES.

The taxonomic distribution of these 22,764 shared AS events is shown in Table

3-8 and Figure 3-1. Amborella trichopoda, A. thaliana, and C. acuminata had more events shared with at least one other species than the remaining species examined:

10,057, 10,512, and 11,652 events, respectively. The two monocots analyzed, S. polyrhiza and O. sativa, had much fewer events shared with at least one other species,

1,993 and 2,178 events, respectively. The number of AS coordinates from each species and the size of the input transcriptome data sets are also summarized in Table 3-8. The most input data were available for O. sativa (98.40Gb), but we only generated 17,995

AS coordinates for shared AS detection. In contrast, A. trichopoda and N. caerulea both have relatively small collections of input of transcriptome data (19.06 Gb and 6.78Gb), but more AS coordinates were generated (70,315 and 22,910 events).

A PCA analysis was conducted to compare the similarities of shared AS events among selected angiosperm species. The results show that the two monocots, O. sativa and S. polyrhiza, clustered closely together (Figure 3-5a). In contrast, the four eudicots did not group together based on the shared AS data. Interestingly, the two basal angiosperms, A. trichopoda and N. caerulea, were somewhat separated from the four eudicots according to PC1. Amborella trichopoda was separated from all other species, while N. caerulea grouped most closely to the monocots group (Figure 3-5a).

Conserved AS between basal angiosperms: Nymphaea and Amborella

In total, 2,199 conserved AS events from 1,471 orthogroups were identified between A. trichopoda and N. caerulea. These 2,199 conserved AS events were

72

distributed in 1,729 A. trichopoda genes and 1,674 Nymphaea genes, respectively. The most common type of conserved AS event is IR, which represents 77.67% of all conserved AS events and occurred in 83.00% of the conserved AS orthogroups (Table

3-9). ES events are the least common at both the event and gene levels; only 3.37% of the conserved events from 4.69% of the conserved AS orthogroups exhibit ES. The distribution of different types of AS among the conserved AS events was similar to the distribution of overall AS types observed in A. trichopoda and N. caerulea. At the gene level, 1,451 A. trichopoda genes and 1,388 N. caerulea genes featured IR events. In contrast, 87 A. trichopoda genes and 78 N. caerulea genes had ES events. The AA events were found in 292 A. trichopoda genes and 291 N. caerulea genes. Additionally,

137 A. trichopoda genes and 135 N. caerulea genes had AD events (Table 3-9).

For the 2,199 events that appear conserved in basal angiosperms, one-third (715 events) were not detected in the other six angiosperms, while 1,484 events were identified in at least one eudicot or monocot species. The taxonomic scale distribution of these 1,484 events is shown in Figure 3-5b. Many fewer of these conserved AS events were retained in the two monocots, O. sativa and S. polyrhiza, each of which had ~200 events. In contrast, the four eudicots contain an average of 700 AS events conserved with the basal angiosperms.

Highly Conserved AS between Basal Angiosperms, Monocots and Eudicots

In total, 25 AS events from 23 orthogroups were detected in all eight angiosperms. These highly conserved AS events included four ES events, one AD event, 7 AA events, and 13 IR events (Table 3-10). Two orthogroups contained more than one highly conserved AS event. For instance, a serine/arginine-rich splicing factor orthogroup OG0001040 had an exon skipping event and an intron retention event that

73

were highly conserved among all studied species (Table 3-10). Another orthogroup that had two conserved AS events was the serine/threonine-protein kinase orthogroup

OG0000750 (Table 3-10). The UniProt gene names for A. thaliana were used to describe the potential function of these orthogroups. These highly conserved AS events were enriched in some protein families. For instance, within the 23 orthogroups, six of them were annotated as protein kinases, and three of them were annotated as splicing factors.

GO Annotation and Enrichment Analysis of the Conserved or Shared AS Events

Gene Ontology (GO) annotation and GO term enrichment were conducted to further investigate the functional significance of the conserved AS events. The Trinotate pipeline was used to conduct the GO annotation of the N. caerulea protein-coding genes. The GO annotation was based on three sequence similarity searches: (i) the

BLASTP analysis between the N. caerulea protein sequences and the plant

UniProt/Swissport database; (ii) the BLASTX analysis between the N. caerulea nucleotide sequence and the plant UniProt/Swissport database; and (iii) the HMMER search between the N. caerulea protein sequences and the Pfam-A protein database.

The Trinotate pipeline uses the results of these searches to generate a comprehensive

GO annotation for N. caerulea. In total, 16,825 out of 24,764 (67.94%) protein-coding genes in N. caerulea had GO annotations.

A GO term enrichment analysis of the 1,674 N. caerulea genes that shared conserved AS events with A. trichopoda indicates that 193 GO terms were enriched in these genes (Fisher’s exact test; FDR < 0.05). Among these genes, 88 GO terms were in the biological process category; 29 were in the cellular component category; and 76

GO terms belonged to the molecular function category. The enriched GO terms in each

74

category were sorted by the FDR value. The top 30 over-represented GO terms in each category are presented in Figure 3-2, Figure 3-3, and Figure 3-4.

Based on the 22,764 shared AS events in eight angiosperms, a PCA analysis grouped the basal angiosperm N. caerulea closely with the two monocots. Thus, GO term enrichment analysis was conducted to investigate enrichment of functional annotation classes associated with the genes underlying the shared AS events in these three species. Shared AS events were those between N. caerulea and either O. sativa or S. polyrhiza, or among all three. The GO term enrichment analysis in biological process detected 57 over-represented GO terms (FDR < 0.05) (Figure 3-6). The GO terms of these genes were enriched for genes that correspond to abiotic and biotic stimulus (e.g. GO:0046898, GO:0071217, GO0001666, GO1902074), cellular metabolism (e.g. GO:0071586, GO:0046167, GO:0080065), mRNA stabilization (e. g.

GO:0048255), and tissue development (e. g. GO:1902183).

Discussion

Iso-SeqTM or RNA-Seq for AS Analysis

In Chapter 2, I discussed the advantages of Iso-SeqTM in transcriptome analysis.

Results from that chapter address the advantages of Iso-SeqTM on de novo transcriptome analysis, as well as the improvement of genome annotation using Iso-

SeqTM full-length isoforms. In this chapter, I compared the capability of two sequencing approaches, Iso-SeqTM and RNA-Seq, on reference-based AS analysis.

In N. caerulea, the Iso-SeqTM platform detected more AS events than the RNA-

Seq approach (Table 3-5); 44.69% of the AS events were supported only by Iso-SeqTM, while this percentage for the RNA-Seq approach was only 17.30%. However, in A.

75

trichopoda, I observed the opposite trend. More AS events were detected by RNA-Seq

(30.02% of the AS events) than by Iso-SeqTM (0.34% of the AS events; Table 3-7).

To interpret this inconsistency between N. caerulea and A. trichopoda, I recalled the size of the input data of each species. For the size of the IsoSeqTM input, N. caerulea and A. trichopoda were in the same range, which corresponded to 159,144 vs.

146,686 ICE-Quiver polished isoforms, respectively. However, for the RNA-Seq data, the sequencing depth of A. trichopoda is approximately five-fold of N. caerulea

(186,131,792 2x100bp reads vs 39,835,016 2x150bp reads). Thus, I speculate that the data size of the RNA-Seq data could significantly influence the capability of identifying

AS events.

The results also show that the Iso-SeqTM data support more IR events than

Illumina data. For instance, most of the retained intron events in N. caerulea (70.54%) were supported by Iso-SeqTM-only data. However, just 7.75% of the retained intron events were supported by Illumina-only data. A similar discrepancy was also observed in the spliced introns events. In total, 26.98% of the retained introns were supported by

Iso-SeqTM-only data, while only 17.45% were supported by Illumina-only data. This discrepancy suggests that our Iso-SeqTM data detected more IR events than were represented within the Illumina data.

Different AS Frequencies in Two Basal Angiosperms

In both N. caerulea and A. trichopoda, the most common type of AS events was

IR, and the rarest type was ES. This is consistent with observations in other species

(Barbazuk et al., 2008; Chamala et al., 2015; Mei et al., 2017a). Regarding the frequency of genes exhibiting AS in each species, I observed that 34.77% of multi- exonic genes in N. caerulea exhibit AS, vs. ~67% in A. trichopoda. The increased

76

frequency of genes exhibiting AS in A. trichopoda may reflect the increased sequencing depth of the A. trichopoda transcriptome than that of N. caerulea (e.g. five-fold for RNA-

Seq data). Another contributing factor is that more of the protein-coding genes in N. caerulea are multi-exonic: 84.59% of protein-coding genes (20,948 out of 24,764) are multi-exonic in N. caerulea, while this percentage in A. trichopoda is 62.92% (17,047 out of 27,095).

Dynamic AS Changes during Angiosperm Evolution

These analyses revealed significant differences between the two monocots and four eudicots in terms of AS events shared with A. trichopoda and N. caerulea (Table 3-

8, Figure 3-1). In all, the eight angiosperms contain 22,764 AS events shared between a minimum of two species. The two monocot species had considerably lower numbers of shared AS events (e.g. 2,178 in O. sativa and 1,993 in S. polyrhiza, Table 3-8), contributing to the significant distance from the cluster of eudicots in the PCA analysis

(Figure 3-5a). The dissimilarity between monocots and eudicots in the shared AS events generally supports the Ancestral Inheritance Hypothesis. This low number of shared AS events could result from the generally low number of AS events reported for both species. This is not due to a lack of sequencing depth, however, as O. sativa has the largest size of input transcriptome data. This low number of shared AS events in monocots relative to eudicots was not observed in previous studies. For example, data from Chamala et al. (2015) and Mei et al. (2017a) suggested that more shared AS events were detected in monocot species (e.g. O. sativa and Zea mays) than eudicots and A. trichopoda. Both these studies combined data from multiple sources and/or developmental stages, contributing to the large number of AS events in general.

Recalling that transcriptome data used in the present analysis were selected to be

77

consistent with the tissues and developmental stage sampled in the two basal species, the discrepancy between these results and past studies may reflect the impact of developmental stage in affecting the expression and patterns of AS events. Indeed, AS during seed development in maize was found to be dynamic (Mei et al. 2017b).

Conserved AS in Basal Angiosperms

In total, 2,199 conserved AS events were identified between the basal angiosperms A. trichopoda and N. caerulea. To gain a better understanding of the functional significance of these conserved AS events, I conducted the GO term enrichment analysis on the genes that produce them. The results indicated that conserved AS events were enriched in genes involved in the photosynthetic process, mRNA transcription, stress response, and DNA methylation. Protein kinases and genes involved in mRNA processing, splicing, and splicing site selection were particularly likely to contain conserved AS events. For instance, the snRNP-U1-70K gene, pre-mRNA- processing factor 40A gene, and several splicing factors showed conserved AS between A. trichopoda and N. caerulea. All of these genes are involved in fundamental functions during plant development and reproduction.

Of the 2,199 basal-angiosperm-conserved AS events, 715 events were lost after the common ancestor of monocots and eudicots. The GO enrichment of the genes that underwent these 715 events suggests that their functions were diverse, such as DNA methylation, stress response, photosynthesis, and tRNA synthesis. For the genes that were related to mRNA processing and splicing, the conserved AS events in snRNP-U1-

70K gene were lost in all monocots and eudicots, while those in the pre-mRNA- processing factor 40A gene and splicing factor genes were kept. A total of 1,484 basal- angiosperm-conserved AS events were kept in at least one eudicot or monocot. Many

78

fewer of these conserved AS events were retained in the two monocots than in eudicots

(Figure 3-5b). Again, the sharp contrast in basal-angiosperm-conserved AS events between monocots and eudicots is generally consistent with the Ancestral Inheritance

Hypothesis.

I also found that 25 AS events were highly conserved in all eight angiosperms

(Table 3-10). Functional annotation of these AS genes suggested that 40% of these highly conserved AS events are in genes that code protein kinases and serine/arginine- rich splicing factors. These results suggest that conserved AS may not be a random process, but has functional importance and evolutionary implications. Genes with conserved AS are concentrated in some specific functions, and the gain/loss of the AS events is related to some extent to phylogeny.

Shared AS Events in Aquatic Species (Habitat Effects)

The PCA analysis based on shared AS events among the eight angiosperm species grouped N. caerulea with the two monocots. This result does not support the

Ancestral Inheritance Hypothesis that conserved AS patterns among species should be associated with phylogeny. One similarity between these three species is their habitat.

Nymphaea caerulea, S. polyrhiza, and O. sativa all have aquatic habitats. This result leads to a new hypothesis: environmental factors might preserve certain AS events, reflecting selection for these AS events in distantly related species. To test this hypothesis, I first summarized a list of the “aquatic shared AS”, which included the AS events that were detected in N. caerulea and at least one monocot species. Then, GO enrichment analysis was conducted in N. caerulea genes that have those aquatic- shared AS events (Figure 3-6). The GO terms of these genes were enriched in genes that correspond to abiotic and biotic stimulus and genes involved in cellular metabolism.

79

The GO terms response to hypoxia (GO0001666, FDR=0.03802) and response to salt

(GO1902074, FDR=0.00677) were over-represented in the genes with shared AS between N. caerulea and S. polyrhiza. This is consistent with the fact that the aquatic habitat usually has lower oxygen concentration and higher salt concentrations than soil.

This suggests that besides the evolutionary implications, environmental factors may also affect the AS pattern.

Convergent evolution is likely responsible for the similar AS patterns among these aquatic species. For some AS events that were not present in the last common ancestor of these species, they might be selected for by the aquatic habitat, leading to convergent evolution. Alternatively, these AS events might be present in the last common ancestor and were selected by the aquatic habitat, indicative of conserved AS events. It is technically difficult to differentiate whether these AS events were conserved during evolution or were the results of convergent evolution. The AS patterns of their common ancestors would be key to answering this question. Adding more species in this analysis and conducting ancestral state reconstruction analysis would be fruitful avenues for future research.

80

Table 3-1. Amborella RNA-Seq data. Tissue NCBI SRA accession number Number of reads Old Leaf SRR5293261 17,385,573 34,771,146 100PE Young leaf SRR5293262 18,325,611 36,651,222 100PE Male flower SRR5293266 10,672,610 21,345,220 100PE Male flower SRR5293269 9,722,859 19,445,718 100PE Female flower SRR5293273 9,815,015 19,630,030 100PE Female flower SRR5293276 9,318,353 18,636,706 100PE Old root SRR5293260 17,825,875 35,651,750 100PE

81

Table 3-2. Nine species for orthology analysis. Species Genome and annotation Amborella trichopoda V6, Unpublished Nymphaea caerulea V1, Unpublished Aquilegia coerulea V 3.1, https://phytozome.jgi.doe.gov Vitis vinifera https://cantulab.github.io/data.html Arabidopsis thaliana Arapro11, https://phytozome.jgi.doe.gov/ Camptotheca acuminata https://datadryad.org/resource/doi:10.5061/dryad.nc8qr Solanum lycopersicum ITAG3.2, https://solgenomics.net/organism/Solanum_lycopersicum/genome Spirodela polyrhiza V 2, https://phytozome.jgi.doe.gov/ Oryza sativa ssp. japonica MSU, unpublished version

82

Table 3-3. Eight species for conserved AS analysis. Species Tissue (RNA-Seq) Tissue (Iso-SeqTM) Tissue (EST) Amborella trichopoda Young leaf, Old leaf Leaf, Female flower N/A Root, Female flower, Male flower Nymphaea caerulea Leaf, Flower Leaf, Root, Flower, Leaf N/A with stress treatment Vitis vinifera Leaf N/A Leaf, Flower, Root Arabidopsis thaliana Leaf, Root, Flower N/A N/A Camptotheca acuminata Young leaf, Old leaf N/A N/A Root, Flower Solanum lycopersicum 3-week-old leaf, 4- N/A N/A week-old leaf, 7-week- old leaf, 4-week-old root, 7-week-old, 3- week-old flower Spirodela polyrhiza Frond, Turion N/A Oryza sativa Leaf, Root, Flower N/A N/A

83

Table 3-4. AS events in Nymphaea. AS Type Events Assemblies Loci (genes) Retained Intron 15,374 32.80% 12,652 44.64% 5,758 79.06% Spliced Intron 15,374 32.80% 18,142 64.01% 5,758 79.06% Retained Exon 2,156 4.60% 3,616 12.76% 1,450 19.91% Skipped Exon 1,981 4.23% 3,479 12.28% 1,450 19.91% Alternative Acceptor 7,034 15.01% 10,963 38.68% 2,699 37.06% Alternative Donor 4,957 10.57% 8,139 28.72% 2,041 28.02% Total 46,876 28,342 7,283

84

Table 3-5. The performance of Iso-SeqTM and RNA-Seq in AS detection in Nymphaea. AS Type IsoSeq only Illumina only Both platforms Total

Retained 10,845 70.54% 1,192 7.75% 3,337 21.71% 15,374 Intron Spliced 4,148 26.98% 2,697 17.54% 8,529 55.48% 15,374 Intron Retained 919 42.63% 480 22.26% 757 35.11% 2,156 Exon Skipped 715 36.09% 612 30.89% 654 33.01% 1,981 Exon Alternative 2,436 34.63% 1,882 26.76% 2,716 38.61% 7,034 Acceptor Alternative 1,884 38.01% 1,246 25.14% 1,827 36.86% 4,957 Donor Total four 20,947 44.69% 8,109 17.30% 17,820 38.02% 46,876 types AS

85

Table 3-6. AS events in Amborella. AS Type Events Assemblies Loci (genes) Retained Intron 34,861 24.56% 70,021 52.42% 8,993 78.78% Spliced Intron 34,861 24.56% 103,290 77.32% 8,993 78.78% Retained Exon 13,174 9.28% 48,295 36.15% 5,338 46.76% Skipped Exon 10,666 7.51% 41,693 31.21% 5,338 46.76% Alternative Acceptor 28,546 20.11% 92,821 69.48% 7,378 64.63% Alternative Donor 19,862 13.99% 74,814 56.00% 5,905 51.73% Total 141,970 133,589 11,416

86

Table 3-7. The performance of Iso-SeqTM and RNA-Seq in AS detection in Amborella. AS Type IsoSeq only Illumina only Both platforms Total Retained 2,693 7.72% 18,200 52.21% 13,968 40.07% 34,861 Intron Spliced 1,079 3.10% 17,119 49.11% 16,663 47.80% 34,861 Intron Retained 514 3.90% 8,137 61.77% 4,523 34.33% 13,174 Exon Skipped 540 5.06% 6,723 63.03% 3,403 31.91% 10,666 Exon Alternative 1,143 4.00% 16,527 57.90% 10,876 38.10% 28,546 Acceptor Alternative 767 3.86% 11,932 60.07% 7,163 36.06% 19,862 Donor Total four 6,736 4.74% 78,638 55.39% 56,596 39.86% 141,970 types AS

87

Table 3-8. Shared AS between eight tested angiosperms. Input data (# # of Input AS # of Shared of bases) coordinates AS Amborella trichopoda 19.06G 70,315 10,057 Nymphaea caerulea 6.78G 22,910 4,932 Vitis vinifera 39.42G 28,377 7,717 Arabidopsis thaliana 73.59G 60,771 10,512 Camptotheca acuminata 44.33G 58,054 11,652 Solanum lycopersicum 43.21G 49,201 9,214 Spirodela polyrhiza 36.97G 7,409 1,993 Oryza sativa 98.40G 17,995 2,178

88

Table 3-9. Conserved AS between A. trichopoda and N. coerulea. AS Type Events Orthogroup Genes (A. trichopoda / N. caerulea) Intron retention 1708 77.67% 1221 83.00% 1,451 / 1,388 Exon Skipping 74 3.37% 69 4.69% 87 / 78 Alternative Acceptor 287 13.05% 271 18.42% 292 / 291 Alternative Donor 130 5.91% 126 8.57% 137 / 135

Total 2199 1,471 1,729 / 1,674

89

Table 3-10. Genes with AS events that are highly conserved in eight angiosperms. Orthogroup AS type Gene Name Arabidopsis orthlogs OG0000016 IntronR G-type lectin S- AT1G11330, AT4G21380, receptor-like AT4G21390, AT1G11340, serine/threonine-protein AT1G11350, AT1G65800, kinase AT1G11280, AT1G11300, AT1G61490, AT4G27300, AT1G65790, AT1G61430, AT1G11410, AT1G61480, AT1G61360, AT1G61610, AT4G11900, AT4G27290, AT1G11303 OG0000045 IntronR Cysteine-rich receptor- AT4G23280, AT4G21410, like protein kinase AT4G21400, AT3G45860, AT4G23200, AT4G23140, AT4G23270, AT4G23130, AT4G00970, AT4G23220, AT4G21230, AT4G23180, AT4G23190 OG0000095 AltA Unknow protein AT5G18460 OG0000117 IntronR Sugar transport protein AT3G19930, AT1G34580, AT1G11260, AT5G61520 OG0000400 IntronR Ethylene-responsive AT5G67180, AT2G28550, transcription factor AT5G60120, AT4G36920 OG0000625 IntronR CBL-interacting AT1G01140, AT1G30270, serine/threonine-protein AT2G26980 kinase OG0000664 IntronR Receptor-like cytosolic AT5G65530, AT5G10520, serine/threonine-protein AT3G05140 kinase OG0000693 IntronR THO complex subunit AT1G66260, AT5G02530 OG0000731 IntronR Protein kinase ATN1 AT5G01850, AT5G40540, AT3G27560, AT5G50180 OG0000747 ES Serine/arginine-rich AT4G25500, AT5G52040, splicing factor AT3G61860 OG0000750 ES, Serine/threonine-protein AT4G24740, AT3G53570, IntronR kinase AT4G32660 OG0000940 IntronR Probable AT3G51070, AT2G34300 methyltransferase OG0001040 ES, Serine/arginine-rich AT1G07350, AT4G35785 IntronR splicing factor OG0001279 IntronR Protein NDL1/NDL2 AT5G11790, AT5G56750 OG0001410 AltA Putative chloride AT5G33280 channel-like protein CLC-g

90

Table 3-10. Continued Orthogroup AS type Gene Name Arabidopsis orthlogs OG0001739 AltD Serine/threonine protein AT1G17720 phosphatase 2A 55 kDa regulatory subunit B beta isoform OG0001950 ES Polyadenylate-binding AT5G54900, AT1G11650 protein RBP45 OG0002772 AltA Serine/arginine-rich AT3G53500, AT2G37340 splicing factor OG0004743 AltA CLIP-associated protein AT2G20190 OG0006247 IntronR Dihydrodipicolinate AT5G52100 reductase-like protein CRR1 OG0008480 AltA Peptidyl-prolyl cis-trans AT3G01480 isomerase OG0009273 AltA Subtilisin-like protease AT5G19660 SBT6.1 OG0009285 AltA Lysophospholipid AT2G45670 acyltransferase LPEAT2

91

Figure 3-1. The taxonomic distribution of the 22,764 shared AS events.

92

0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00%

cell surface receptor signaling pathway protein autophosphorylation transmembrane transport photosystem II oxygen evolving complex a... mRNA processing regulation of RNA splicing defense response to bacterium, incompati... galactose metabolic process regulation of defense response to bacter... methylglyoxal catabolic process to D-lac... N-glycan processing chlorophyll biosynthetic process vacuole organization MAPK cascade chloroplast organization mRNA splicing, via spliceosome protein stabilization peptidyl-lysine trimethylation purine nucleotide transport capsule polysaccharide biosynthetic proc... negative regulation of shoot apical meri... positive regulation of shoot apical meri... glutamyl-tRNA aminoacylation prolyl-tRNA aminoacylation S-adenosylhomocysteine catabolic process cellular response to cytokine stimulus detection of cytokinin stimulus regulation of transcription from RNA pol... UDP-L-arabinose biosynthetic process glucose homeostasis

Figure 3-2. Top 30 over-represented GO in N. caerulea genes among the 2,199 basal- angiosperm-conserved AS events - Biological Process.

93

0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00% 16.00% 18.00%

chloroplast stroma chloroplast membrane chloroplast nuclear speck mitochondrial inner membrane plant-type vacuole membrane integral component of Golgi membrane trans-Golgi network transport vesicle me... early endosome mitochondrial matrix P-body chloroplast thylakoid lumen peribacteroid membrane aminoacyl-tRNA synthetase multienzyme co... pre-autophagosomal structure membrane spliceosomal complex U1 snRNP Set1C/COMPASS complex chloroplast nucleoid chloroplast thylakoid membrane trans-Golgi network mediator complex PML body nuclear pore cytoplasmic filaments root hair tip Golgi medial cisterna nuclear membrane lysosomal membrane cytoplasmic vesicle

Figure 3-3. Top 29 over-represented GO in N. caerulea genes among the 2,199 basal- angiosperm-conserved AS events - Cellular Component.

94

0.00% 2.00% 4.00% 6.00% 8.00%10.00%12.00%14.00%16.00%18.00%20.00%

protein serine/threonine/tyrosine kinase... ATP binding GTP binding UDP-glucose 4-epimerase activity transmembrane receptor protein serine/th... protein serine/threonine phosphatase act... ADP binding calcium ion binding protein self-association [ribulose-bisphosphate carboxylase]-lysi... mannosyl-oligosaccharide 1,2-alpha-manno... structural constituent of cytoskeleton 3-methyl-2-oxobutanoate dehydrogenase (2... proline-tRNA ligase activity adenosylhomocysteinase activity glutamate-tRNA ligase activity ATPase-coupled protein transmembrane tra... dihydroorotate dehydrogenase activity mRNA (2'-O-methyladenosine-N6-)-methyltr... ion channel binding digalactosyldiacylglycerol synthase acti... polynucleotide adenylyltransferase activ... 1-acyl-2-lysophosphatidylserine acylhydr... 1,2-alpha-L-fucosidase activity phosphatidylserine 1-acylhydrolase activ... magnesium-protoporphyrin IX monomethyl e... 2-carboxy-D-arabinitol-1-phosphatase act... UDP-arabinose 4-epimerase activity glycerone kinase activity 12-oxophytodienoate reductase activity

Figure 3-4. Top 30 over-represented GO in N. caerulea genes among the 2,199 basal- angiosperm-conserved AS events - Molecular Function.

95

a

b

Figure 3-5. Phylogenetic distribution of conserved AS. a. PCA on 22,764 shared AS among eight angiosperms; AS events detected in more than one species were considered as shared events. b. 2,199 basal-angiosperm-conserved AS events in eudicots and monocots. Number in pink block after each species represent how many basal-angiosperm-conserved AS events were kept in this species (number in bracket are the events that are also conserved in the basal species of this clade). Each orange spot represents one WGD event.

96

0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% 3.50% 4.00% 4.50% 5.00%

cellular response to external biotic sti... response to silver ion response to cycloheximide regulation of shoot apical meristem deve... response to nematode meristem maintenance mRNA stabilization glycerol-3-phosphate biosynthetic proces... 4-alpha-methyl-delta7-sterol oxidation etioplast organization sphingolipid catabolic process fumarate transport negative regulation of mitochondrial cal... histone H2A acetylation negative regulation of exoribonuclease a... stem cell fate determination GDP-L-fucose salvage photoreactive repair response to salt plastid organization polarity specification of adaxial/abaxia... oligopeptide transport CAAX-box protein processing negative regulation of anion channel act... mucilage pectin biosynthetic process negative regulation of microtubule depol... rhamnogalacturonan II biosynthetic proce... succinate transport thiamine pyrophosphate transport response to microbial phytotoxin cellular response to DNA damage stimulus mucilage extrusion from seed coat lipid transport galacturonan metabolic process photoinhibition UV protection regulation of histone H3-K4 methylation intracellular lipid transport glycerol catabolic process homogentisate catabolic process isoleucyl-tRNA aminoacylation regulation of embryonic development mucilage metabolic process involved in s... cellular calcium ion homeostasis protein deubiquitination sterol biosynthetic process positive regulation of proteasomal ubiqu... tyrosine catabolic process phosphatidylinositol-mediated signaling positive regulation of unidimensional ce... response to hypoxia actin filament-based process histone H4 acetylation regulation of proton transport L-serine biosynthetic process glycerol-3-phosphate catabolic process autophagy of peroxisome

Figure 3-6. Over-represented GO in N. caerulea genes that have aquatic-shared AS events - Biological Process.

97

CHAPTER 4 DE NOVO ALTERNATIVE SPLICING DETECTION IN TRAGOPOGON

Introduction

Whole-genome duplication (WGD; polyploidy) is widespread in eukaryotes, including fungi, animals, and particularly green plants (e.g. Kellis et al., 2004; Gordon et al., 2009; Moghadam et al., 2009; Jose and Dufresne, 2010; Adolfsson et al., 2010;

Soltis and Soltis, 2012; Husband et al., 2013). Recent genomic studies indicate that all angiosperms have experienced at least one round of WGD and that most lineages show repeated patterns of WGD (Jiao et al., 2011; Amborella Genome Project, 2013; Wendel,

2015; Soltis et al., 2016). Polyploid genomes are dynamic, with changes in gene number, expression, and regulation relative to their diploid progenitors (Lynch and

Conery, 2000; Flagel and Wendel, 2010; Roulin et al., 2013; Yoo et al., 2014; Wendel et al., 2012; Wendel, 2015; Wendel et al., 2016). However, it is unclear whether or not such modifications are sufficient to generate the phenotypic novelty that characterizes most polyploids and has allowed many polyploids to occupy new habitats. RNA alternative splicing (AS) is a major regulatory mechanism and a major factor promoting functional innovation and proteomic diversity, even when genetic variation is limited

(Barbazuk et al., 2008; Syed et al., 2012; Reddy et al., 2013). Yet, very few studies have explored the effect of WGD on AS. In fact, a genome-wide comparison of AS has yet to be conducted in any naturally occurring allopolyploid and its diploid parents, and the impact of WGD on AS is therefore largely unknown.

Reddy proposed three evolutionary models of AS after gene duplication (Reddy et al., 2013). First, the “function-sharing model” suggests a negative correlation between

AS and gene duplication (Su et al., 2006) -- that is, duplicated genes undergo less AS

98

than single-copy genes. Second, the “accelerated AS model” suggests a positive correlation between gene duplication and AS (Jin et al., 2008) -- that is, genes progressively gain new splice variants with duplication and time. In addition, there is an independent model that predicts no causal relationship between gene duplication and the amount of AS (Reddy et al., 2013). Thus, the question now is which model best describes the relationship between AS and polyploidy (WGD).

An evolutionary model system such as Tragopogon is ideal for studying the impact of WGD on AS, both immediately after polyploidization and in established polyploid species. A draft genome of T. dubius, as well as genome annotation, are available (See Chapter 5). However, until early 2018, the genome resources for the

Tragopogon polyploidy system were very limited. A reference genome for Tragopogon was not available when I started the work described in this chapter; the lack of a reference genome makes it extremely challenging to study the WGD effects on AS.

This chapter is my attempt to study AS in T. dubius without a reference genome.

I used the de novo pipeline described in Chapter 2 to detect AS events in T. dubius. The original goal was to develop PCR primers for a set of AS genes (e.g. 50 genes) and then to compare the AS patterns between Tragopogon diploids and tetraploids. My hypothesis was that the AS frequency will decrease in the allotetraploid T. miscellus compared to its diploid parents T. dubius and T. pratensis. However, due to the low efficiency of the de novo pipeline in identifying AS in T. dubius, I only obtained a small number of PCR primers for genes with confirmed AS events (i.e., five genes). The relatively low success rate of the de novo pipeline also made it infeasible to acquire enough sets of AS genes for the diploid vs. tetraploid comparison. Thus, I terminated

99

my attempt on de novo AS detection in Tragopogon and focused my attention on building a T. dubius draft genome (see Chapter 5). This short chapter describes the development and testing of primers in T. dubius only.

Methods and Materials

Tragopogon dubius RNA Sample Extraction and Iso-SeqTM Sequencing

Eight-week-old leaves of T. dubius (Soltis & Soltis 2674-3-4) were collected for

RNA sample preparation. RNA was extracted from leaf tissues using the CTAB method and RNAeasy Mini extraction kit (Qiagen, Germantown, MD, USA). The TURBO DNA- free Kit (Invitrogen, Carlsbad, CA, USA) was used for DNA digestion in RNA samples.

The concentration of each RNA sample was checked using the QUBIT® Fluorometer

(Life Technologies, Carlsbad, CA, USA), and RNA integrity was checked using a

Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). The cDNA synthesis and library construction were conducted following the PacBio Iso-SeqTM protocol used at the Interdisciplinary Center for Biotechnology Research (http://www.biotech.ufl.edu/) at the University of Florida, Gainesville, FL. The mRNAs with polyA tails were reverse- transcribed to full-length cDNA using the Clontech SMARTerTM PCR cDNA Synthesis

Kit. The cDNAs were then size-selected into 1-2-kb, 2-3-kb, and 3-6-kb fractions with

BluePippinTM and converted to SMRTbell libraries. In total, the Iso-SeqTM project required eight SMRT cells: two SMRT cells were used for the 1-2-kb fraction, three cells for the 2-3-kb fraction, and another three cells for the 3-6-kb fraction. All sequencing was performed using the PacBio RS II system with P6-C4 polymerase and chemistry at the UF ICBR.

100

Tragopogon dubius Iso-SeqTM Data Processing

For the leaf Iso-SeqTM data, SMRT analysis (v2.3.0) was used to pre-process the raw data with the following parameters “--minPredictedAccuracy 90, --minFullPasses 2” for the SMRT analysis module “ConsensusTools.sh CircularConsensus” and “-- min_seq_len 300” for the SMRT analysis module “pbtranscript.py classify”. The resulting dataset was referred to as the leaf Iso-SeqTM ROIs dataset.

Tragopogon dubius AS Candidates from de novo Detection

The T. dubius AS candidates were first detected using the workflow described in

Chapter 2. The Iso-SeqTM ROIs dataset was fed to the de novo pipeline directly.

Subsequently, to increase the accuracy of the de novo detection and reduce false positives in the RT-PCR detection, I modified the original workflow as in Figure 4-1. The expression level of the AS isoforms was considered in the modified workflow. Briefly, all

T. dubius leaf Iso-SeqTM ROIs were clustered using CD-HIT (Li and Godzik, 2006). CD-

HIT representative reads from clusters that contain at least two reads were fed into the de novo detection pipeline. The resulting candidate AS transcripts (ROIs) were then considered as the reference transcripts for transcripts mapping using BLASR

(v20160518, Chaisson and Tesler, 2012) and the leaf Iso-SeqTM ROIs dataset. Finally, the AS candidate transcript alignments were manually checked in Geneious, followed by BLAST analysis with the Arabidopsis TAIR database and the open reading frame

(ORF) prediction.

Population-Level RT-PCR Validation

Leaf RNA from 10 T. dubius populations (Table 4-1) was used for RT-PCR validation. Primers for each putative event were designed in the flanking region of “AS

Gap” (see Figure 2-1a). First-strand cDNAs were synthesized using the iScript kit (Bio-

101

Rad) following the manufacturer’s protocol. Taq DNA polymerase (ApexTM) was used to perform PCR, and PCR products were checked using 1.5% agarose gel electrophoresis.

Results

Tragopogon dubius Iso-SeqTM Dataset

The Iso-SeqTM sequencing of T. dubius leaf tissue generated valuable transcriptomic information. This standard eight-SMRTcell Iso-SeqTM experiment generated 235,890 ROIs with an average length of 1,833 bp (432 Mb total, Figure 4-2).

Because T. dubius lacked a reference genome when I did this analysis, I used a more strict cutoff value than that for Amborella in the raw data processing and ROIs generation. All ROIs in the output should have a minimum predicted accuracy of 90% and minimum number of full passes as two.

Primer Design and RT-PCR Validation of AS Candidate Genes

Following the original workflow of AS de novo detection, I obtained 2,893 candidate alignments that might contain AS events based on the T. dubius leaf Iso-

SeqTM dataset. I first designed 24 pairs of primers from 21 randomly selected candidate alignments. However, only one AS candidate gene, labeled Td9, showed the expected

AS bands in the RT-PCR validation (Figure 4-3a). This success rate was much lower than that in the Amborella study (i.e., 63.5%). It was likely that the expression level of some alternatively spliced isoforms was much lower than the major splice isoform.

Thus, I modified the original workflow with the hope of improving the success rate.

The modified workflow considered the expression level of an isoform assessed by CD-HIT clustering, prior to feeding the ROIs to the de novo detection pipeline (Figure

4-1), as well as manually checking the transcript alignments of the AS candidates (e.g.

102

the transcripts alignments in Figure 4-4a, Figure 4-5a, Figure 4-6a, Figure 4-7a). The

CD-HIT clustering analysis based on 235,890 ROIs resulted in 20,926 clusters containing more than two transcripts (ROIs). The all-vs-all BLAST between representative transcripts of these 20,926 clusters resulted in 655 pairs of AS candidate transcripts (ROIs). After removing redundancy, 518 ROIs were identified as the referenced transcripts for the BLASR transcript mapping. The 518 BLASR alignments were then manually checked in Geneious, and 12 AS candidates were selected for primer design and RT-PCR validation. In total, 26 pairs of primers were designed from

12 AS candidate genes.

Of the 26 primer pairs, six, from four AS genes, were confirmed by RT-PCR assays (Td27, Figure 4-4b; Td30, Figure 4-5b; Td31, Figure 4-6b; Td32, Figure 4-7b,

Table 4-2). This result suggests a 33.3% success rate (four of 12) of the modified workflow. ORF prediction indicated that 80% of AS events occurred in the untranslated region (UTR) (Table 4-2, Figure 4-4c, Figure 4-5c, Figure 4-6c, Figure 4-7c). Orthologs of these AS genes in Arabidopsis were identified for predicting their function (Table 4-2).

Population-Level RT-PCR Validation in T. dubius

Primers from the five AS genes identified in previous RT-PCR validation were then used to conduct a population-level validation in 10 T. dubius populations. Four AS genes with putative AS events, Td9, Td27, Td30, and Td31, showed robust and consistent AS patterns among all 10 T. dubius populations (Figure 4-3b, Figure 4-4d,

Figure 4-5d, Figure 4-6d). Primer pairs of the Td32 gene produced consistent AS patterns among eight T. dubius populations, while the multiple bands representing an

AS event were not observed for two T. dubius populations (Soltis & Soltis 2877 -

Pullman, WA and Soltis & Soltis 2895 - Garfield, WA) (Figure 4-7d).

103

Discussion

The analyses in this chapter were conducted prior to the availability of the T. dubius genome sequence. This study of AS in T. dubius indicates that the de novo pipeline developed and validated in the basal angiosperm Amborella could also be applied in other angiosperms that lack a reference genome. However, the success rate of de novo AS detection in T. dubius was relatively low. Even after the workflow improvement, the success rate in T. dubius was only half than that in Amborella (33.3% vs 63.5%). Compared to the reference-based approach, de novo methods are more limited and carry higher false positive rates. WGD might be another possible reason for the low success rate in T. dubius. Tragopogon dubius has undergone at least five more rounds of WGD compared to Amborella (Huang et al., 2016; Van De Peer et al., 2017).

Paralogs remaining from ancestral WGD events may affect the accuracy of de novo AS detection.

The low efficiency of the de novo detection method makes it impractical to conduct a detailed comparison of AS in diploid vs. tetraploid Tragopogon. Specifically, the de novo method is not cost-effective and likely more time-consuming. For instance, if one wants to test 60 AS genes, at least 180 pairs of PCR primers would need to be designed, purchased, and validated based on the observed ~33% success rate.

Designing these primers alone would take approximately 20 days, as each candidate alignment needs to be manually checked prior to the design of primers. Furthermore, to investigate the changes of splicing patterns during polyploidization, additional sets of degenerate primers would need to be designed and purchased for the RT-PCR assays in T. miscellus and its other diploid parent, T. pratensis, as well as in T. mirus and its other diploid parent, T. porrifolius. When the low efficiency of the de novo method was

104

first discovered in T. dubius, I had just received funding (NSF Doctoral Dissertation

Improvement Grant) to build a genome for T. dubius. After consulting with my committee and careful consideration, I decided to focus my attention on the T. dubius genome assembly and annotation (see Chapter 5), as a reference genome would be able to support high-quality and high-throughput AS detection in this complicated species. With the genome assembly now in hand, it will be possible to conduct the planned comparison between the allotetraploids and their diploid parents.

In this study, 80% of the confirmed AS events occurred in the UTR region and did not result in an ORF shift or potential nonsense-mediated decay (NMD) (Table 4-2).

One AS event is in the coding region and introduces a premature termination codon

(PTC) by ORF shift (Table 4-2). Because the sample size of this ratio is quite low, only five data points, it is hard to evaluate whether AS in the UTR region is common in T. dubius. However, previous studies suggested that AS events in the UTR region, especially in 5’end UTRs, usually represent a regulatory mechanism for gene expression or translation (Palaniswamy et al., 2010; Kramer et al., 2013). Thus, the functional implication of these UTR-region AS events still needs to be studied.

The AS pattern of the same gene may be different in different populations (Park et al., 2018; Tapanainen et al., 2018). Thus, the AS pattern of five confirmed AS genes was also examined across 10 T. dubius populations. Results show that four AS events were detected in all 10 populations (Figure 4-3b, Figure 4-4d, Figure 4-5d, Figure 4-6d), which suggests that the AS pattern of these four genes is consistent across these populations. For the Td32 gene, the AS pattern is consistent across eight of the ten T.

105

dubius populations, but absent in the remaining two (Figure 4-7d). Whether or not this is a repeatable population-specific AS event requires more evidence to evaluate.

In the next chapter (Chapter 5), I will describe the development of a T. dubius draft genome sequence. With this draft genome sequence, I conducted a reference- based AS analysis in T. dubius leaf tissue. Compared to the de novo approach, this reference-based T. dubius AS database supports more accurate AS detection with greater efficiency. The T. dubius AS database, as well as the T. dubius draft genome, provide a reference for future AS analyses in the Tragopogon polyploidy system, including comparisons of (i) the tetraploid T. miscellus with its diploid parents, T. dubius and T. pratensis; and (ii) the tetraploid T. mirus with its diploid parents, T. dubius and T. porrifolius.

106

Table 4-1. Ten T. dubius populations used in this study. Population Number Geographic location Lane number1

2891 Palouse, WA 1 2860 Spangle, WA 2 2879 Pullman, WA 3 2864 Spokane, WA 4 2877 Pullman, WA 5 2863 Spokane, WA 6 2874 Oakesdale, WA 7 2895 Garfield, WA 8 2886 Moscow, ID 9 2887 Troy, ID 10

1. Lane number on the gel image in Figure 4-3, 4-4, 4-5, 4-6.

107

Table 4-2. Five confirmed T. dubius AS events. Gene Primer Predicted Function TAIR locus UTR ORF Pair /CDS Shift Td9 Td9_F1 Shaggy-like kinase AT5G14640 5’UTR No Td9_R1 SK13

Td27 Td27_F1 2-methyl-6-phytyl-1,4- AT3G63410 5’UTR No Td27_R1 hydroquinone VTE3 Td27_R2 methyltransferase

Td30 Td_30F2 F-box/RNI-like At1G67190 CDS Yes Td_30R2 superfamily protein

Td31 Td31_F1 Transport inhibitor AT3G62980 5’UTR No Td31_R1 response 1 TIR1 Td31_R2

Td32 Td_32F2 Dicarboxylate At5G12860 3’UTR No Td_32R1 transporter 1 or 2- DIT1 oxoglutarate /malate translocator

108

Figure 4-1. The modified AS de novo detection pipeline.

109

Figure 4-2. Size distribution of T. dubius leaf Iso-SeqTM ROIs.

110

a b

Figure 4-3. RT-PCR results of Td9. a. temperature gradient PCR. b. population level validation using Td9_F1, Td9_R1. The Lane number refers to Table 4-1.

111

a b

c

d

Figure 4-4. Manual checking and primer design, RT-PCR validation and ORF prediction of Td27. a. Transcript alignment with splicing variants, b. temperature gradient PCR, c. ORF prediction, d. population level RT-PCR validation using Td27_F1, Td27_R1. The Lane number refers to Table 4-1.

112

a b

c

d

Figure 4-5. Manual checking and primer design, RT-PCR validation and ORF prediction of Td30. a. Transcript alignment with splicing variants, b. temperature gradient PCR, c. ORF prediction, d. population level RT-PCR validation using Td30_F2, Td30_R2. The Lane number refers to Table 4-1.

113

a b

c

d

Figure 4-6. Manual checking and primer design, RT-PCR validation and ORF prediction of Td31. a. Transcript alignment with splicing variants, b. temperature gradient PCR, c. ORF prediction, d. population level RT-PCR validation using Td31_F1, Td31_R2.

114

a b

c

d

Figure 4-7. Manual checking and primer design, RT-PCR validation and ORF prediction of Td32. a. Transcript alignment with splicing variants, b. temperature gradient PCR, c. ORF prediction, d. population level RT-PCR validation using Td32_F2, Td32_R1. The Lane number refers to Table 4-1.

115

CHAPTER 5 TRAGOPOGON DUBIUS DRAFT GENOME ASSEMBLY AND ANNOTATION AND A SURVEY OF ALTERNATIVE SPLICING

Introduction

As mentioned in Chapter 1, Tragopogon (sunflower family, Asteraceae) is an ideal system for examining the initial consequences of WGD on the genome, transcriptome, and proteome relative to well-documented diploid parents. This system includes two recently and repeatedly formed naturally occurring allotetraploids, T. mirus and T. miscellus, whose parents are T. dubius and T. porrifolius, and T. dubius and T. pratensis, respectively. Investigations of the polyploids and their parents over the past

30 years cover multiple areas of research, including cytology, population genetics, genomics, transcriptomics, and epigenetics (reviewed in Soltis et al., 2012; Chester et al., 2012; Sehrish et al., 2015; Spoelhof et al., 2017). However, a reference genome for

Tragopogon is not available, and the lack of a reference genome makes further mechanistic studies extremely challenging. As a result, at this point, Tragopogon is an excellent evolutionary model, but not a genetic model. The absence of a reference genome makes many studies impossible, including genome-scale analysis of DNA methylation, transposable elements, and RNA alternative splicing (AS).

To obtain a reference genome for Tragopogon, T. dubius (2n = 12), the shared diploid parent of the two recent polyploids in this system (T. miscellus and T. mirus), is a good candidate species. There are several challenges in developing the T. dubius draft genome. First, the genome size of T. dubius is relatively large, estimated to be 2.3 Gb to 2.88 Gb (Pires et al., 2004; Garcia et al., 2013). Second, assembling a highly heterozygous diploid genome is a major challenge in many de novo genome projects

(Kajitani et al., 2014). Fortunately for this analysis, genetic variation in natural

116

populations of T. dubius is low (due to a genetic bottleneck from introduction to the U.S.;

Soltis et al., 1995). In addition, T. dubius inbred lines have been generated by previous research (Buggs et al., 2011; Yoo et al., unpublished).

In this chapter, I sequenced the complete nuclear genome of T. dubius using a linked-read sequencing approach (10X Genomics, Pleasanton, CA, USA) and subsequently completed the genomic assembly and annotation for this diploid species.

The linked-read sequencing approach was featured by generating long-range information analogous to traditional BAC-by-BAC sequencing technologies but at a tiny fraction of the cost and at high throughput (Hulse-Kemp et al., 2018). With these benefits of the linked-read approach, complex plant genome sequencing becomes more accessible. On the computational side, a unique assembly algorithm, the Supernova diploid assembler, has been specifically designed for assembling linked-read sequencing data (Weisenfeld et al., 2017). Unlike commonly used haploid genome assemblers, Supernova works better on a highly heterozygous diploid genome

(Weisenfeld et al., 2017). With these benefits, diploid pepper (Capsicum annuum) with a

3.5-Gb genome size, has been successfully sequenced and assembled using a linked- read sequencing approach and the Supernova assembler (Hulse-Kemp et al., 2018).

Based on the T. dubius genome assembly obtained, I then investigated AS events in leaf tissue using this T. dubius genome and a transcriptome dataset generated using a single-molecule long-read sequencing platform (PacBio Iso-SeqTM). I also conducted gene family analysis and reported particular gene families that are related to some key morphological characters of Tragopogon, such as the head-like inflorescence (capitulum) and its leaf and stems that are rich in milky sap (latex).

117

Materials and Methods

DNA Sample Collection and Sequencing Strategy

An inbred line of T. dubius (Oakesdale, WA, USA; Soltis & Soltis 2674; voucher deposited in FLAS) was grown in the Department of Biology greenhouse at the

University of Florida, Gainesville, FL, USA; a single plant (Soltis & Soltis 2674-4-3-14) was used as the source of material for genome sequencing. Nuclei were isolated from young leaves and then used to extract High-Molecular-Weight (HMW) DNA using a modified method developed by PacBio (https://www.pacb.com/wp- content/uploads/2015/09/Shared-Protocol-Preparing-Arabidopsis-DNA-for-20-kb-

SMRTbell-Libraries.pdf). The HMW DNA larger than 50 kb was size-selected using the

SageELF system. The size-selected HMW DNA sample was then evaluated with a

Qubit Fluorometer (Life Technologies, Carlsbad, CA, USA) and a pulsed field gel.

The linked-read sequencing approach was used to perform genome sequencing of T. dubius. The Chromium Gel Bead and Library Kit (10X Genomics) and the

Chromium instrument (10X Genomics) were used to prepare the Chromium libraries for genome sequencing with 1 ng HMW DNA as input.

Raw Data Trimming

Modules from the Fastx toolkit (v 0.0.14, http://hannonlab.cshl.edu/fastx_toolkit/) were used to perform quality trimming of Illumina raw data. First, the fastq_quality_filter module was run with the following parameters: -q (Minimum quality score to keep) 20, - p (minimum percent of bases that must have [-q] quality) 70. Then, fastq_trimmer was run to trim Chromium barcodes and adapter sequences with the following parameters: “- f (first base to keep) 24, -l (last base to keep) 140” for the R1 file; and “-f 2, -l 140” for

118

the R2 file, respectively. The output was processed through the fastx_artifacts_filter module with default parameters.

Removal of Organellar Contaminants from Genomic Data

Trimmed reads were aligned against plastid and mitochondrial genomes from other Asteraceae: the Lactuca sativa and Helianthus annuus plastid genome sequences and the Helianthus annuus mitochondrial genome sequence using the software GSNAP

(v20160501, Wu and Nacu, 2010). Reads aligned with at least 85% identity and 50% along their length were considered to have potentially originated from an organellar genome. The filterbyname.sh scripts in BBMap (v37.09, Bushnell, 2014) were used to conduct in parallel the removal of the two organellar contaminants: i) for the k-mer analysis and genome size estimation, contaminated reads were removed from the trimmed data; ii) for the de novo genome assembly, contaminated reads were removed from the raw data. I used this dual approach because the raw data contain Chromium barcodes and adapter sequences that are required in the 10X Chromium de novo assembly.

Estimation of Genome Size of T. dubius

After removing organellar contaminants, clean reads without Chromium barcodes and adapter sequences were used to estimate the genome size of T. dubius by k-mer frequency distribution analysis with Jellyfish (v 2.2.6; Marçais and Kingsford, 2011).

Genome Assembly and Evaluation

After removing organelle-contaminated reads, Illumina raw data (untrimmed reads, containing Chromium barcodes and adapter sequences) were put into the

Supernova (v1.2.1) assembler for de novo assembly using the command “supernova run” with default parameters. The Supernova assembler is specifically designed for

119

linked-read data. The assembler is highly integrated and contains modules for read pre- processing, such as quality filtering and trimming, before the de novo assembly. The output scaffolds were generated by the command “supernova mkoutput -- style=pseudohap --headers=short --minsize=10000”. Scaffolds smaller than 10 kb were removed from the Supernova assembly.

The genome assembly completeness was evaluated using BUSCO (v3.0, Simão et al., 2015). The alignment rate when mapping RNA-Seq reads back to the genome was also conducted to reflect the completeness of the coding region of the genome assembly.

RNA Isolation and Iso-SeqTM Sequencing for Genome Annotation

Mature leaves were collected from the same inbred line of T. dubius (Soltis &

Soltis 2674-4-3-14) for RNA extraction and Iso-SeqTM sequencing. RNA was extracted using the CTAB method and RNeasy Mini extraction kit (Qiagen, Germantown, MD,

USA). The TURBO DNA-free Kit (Invitrogen, Carlsbad, CA, USA) was used for DNA digestion in RNA samples. The concentration of the RNA sample was checked using the QUBIT Fluorometer (Life Technologies, Carlsbad, CA, USA), and RNA integrity was checked using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA).

I conducted cDNA synthesis and library construction at the Interdisciplinary

Center for Biotechnology Research (ICBR) at the University of Florida, Gainesville, FL,

USA (UF). The mRNAs with polyA tails were reverse-transcribed to full-length cDNA using the Clontech SMARTerTM PCR cDNA Synthesis Kit. The cDNAs were then size- selected into 1-2-kb, 2-3-kb, and 3-6-kb fractions with BluePippinTM and converted to

SMRTbell libraries. In total, the Iso-SeqTM project required eight SMRT cells: two SMRT cells were used for the 1-2-kb fraction, three cells for the 2-3-kb fraction, and another

120

three cells for the 3-6-kb fraction. All sequencing was performed using the PacBio RS II system with P6C4 polymerase and chemistry at ICBR.

Repetitive Element Annotation

A T. dubius-specific repeat library was constructed using the method described by Campbell et al. (2014). Miniature inverted repeat transposon elements (MITEs) were collected by MITE-Hunter. Long terminal repeat (LTR) retrotransposons were first collected by LTRharvest (GenomeTools v1.5.7, Gremme et al., 2013) and then filtered by LTRdigest (GenomeTools v1.5.7) and custom scripts (Campbell et al., 2014). I conducted LTRharvest LTR collection twice, with two different similarity values between

LTRs, 99% and 85%, respectively. In addition, de novo repeat searching was conducted by RepeatModeler (v1.0.8, Smit et al., 2013). Gene fragments were removed from repeat libraries identified above by BLAST and ProtExcluder (v1.2, Campbell et al.,

2014). All repeat libraries were then combined as the T. dubius-specific repeat library for further analyses. All custom scripts can be found at http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction

-Advanced.

Subsequently, the T. dubius-specific repeat library was classified by the module

RepeatClassifier in RepeatModeler(v1.0.8). RepeatMasker (v4.0.5, Smit et al., 2013) was then used to evaluate the repeat content of the T. dubius genome assembly based on the T. dubius-specific repeat library and the Asteraceae RepBase repeat library

(v20180326, Jurka et al., 2005) using RepeatMasker through the MAKER pipeline.

Genome Annotation

The MAKER (v2.31.9) pipeline was used to conduct genome annotation for T. dubius. Repeat regions were first masked using RepeatMasker through the MAKER

121

pipeline based on the two repeat libraries described above. RNA-Seq data from eight- week-old leaf (Yoo et al., unpublished) and inflorescence (Shan et al., unpublished) tissues were aligned to the T. dubius genome assembly using HiSat2 (v2.1.0) with default parameters, respectively. Duplicate reads were removed by the module

MarkDuplicates in Picard v2.10.3. The HiSat2 alignment was then used to create a transcriptome assembly using StringTie (v1.3.3) and Cufflinks (v2.2.1.1), and the resulting GTF files were merged and provided to Trinity (v2.5.0) for genome-guided assembly. Isoforms identified by these assemblers were then used as input for the

PASA 2.0 pipeline (Campbell et al., 2006) to conduct additional assembly and clustering to remove redundancy. For the leaf Iso-SeqTM data, I used SMRT analysis (v2.3.0) to pre-process the raw data with the following parameters “--minPredictedAccuracy 90, -- minFullPasses 2” for the SMRT analysis module “ConsensusTools.sh

CircularConsensus” and “--min_seq_len 300” for the SMRT analysis module

“pbtranscript.py classify”.

In total, I conducted three rounds of analysis with MAKER to obtain a high- confidence genome annotation. For the initial MAKER analysis, three PASA assemblies

(i.e., the leaf RNA-Seq, the inflorescence RNA-Seq, and the leaf Iso-SeqTM) described above were used as the EST evidence. Protein sequences of primary transcripts of the eudicot species Arabidopsis thaliana, Glycine max, Gossypium raimondii, Helianthus annuus, Lactuca sativa, Vitis vinifera, and Solanum lycopersicum were downloaded from Phytozome (https://phytozome.jgi.doe.gov/) and used as the protein evidence. To avoid false positives, I did not consider single-exon EST evidence when generating annotations. In this round of MAKER analysis, I performed MAKER gene predictions

122

based on the aligned transcripts and proteins by setting “est2genome=1 and protein2genome=1”; all remaining setting used default values.

The gene models identified by the initial MAKER analysis were used to train the

SNAP and Augustus gene prediction software packages. Only gene models with an

Annotation Edit Distance (AED) score of 0.25 or better and a length of 50 or more amino acids were considered for training SNAP. Augustus was trained using regions that were annotated as mRNA and the 1000-bp flanking region on each side through

BUSCO with the setting “-l embryophyta_odb9 -m geno --long -sp tomato -- augustus_parameters='--progress=true' -z”.

After the gene predictor training, the ab initio gene prediction was conducted by running SNAP and Augustus within MAKER. To be considered for gene prediction, at least 50 amino acids in our predicted proteins were required. In total, I did two rounds of ab initio gene prediction.

Orthology Analysis

To initiate the gene family analysis, as well as to provide a foundation for further functional genomic studies in Tragopogon, I conducted an orthology analysis using the

T. dubius protein set and proteins from 11 other plant genomes. The predicted T. dubius protein set was clustered using OrthoFinder (v2.2.3) with proteins from Amborella trichopoda, Aquilegia coerulea, Arabidopsis thaliana, Glycine max, Gossypium raimondii, Helianthus annuus, Lactuca sativa, Vitis vinifera, Solanum lycopersicum,

Oryza sativa, and Sorghum bicolor. These 11 species cover the sister group to all other extant angiosperms as well as the major clades of angiosperms, including monocots, asterids, and rosids.

123

Functional Annotation

The Pfam protein domain analysis was conducted using Pfam-A database v31 and HMMER v3.1b2 (http://hmmer.org/). The abundance of a gene family in T. dubius was inferred from the number of protein domains detected with hmmsearch with an e- value of 11e-5 or better per species.

Alternative Splicing Analysis

The alternative splicing analysis was conducted using the PASA 2.0 pipeline

(Campbell et al., 2006) using the leaf Iso-SeqTM data. The maximum intron length was set as 30,000 bp. Single-exon genes were excluded.

Results

Tragopogon dubius Linked-read Sequencing and Removal of Organellar Contaminants

The 10X Chromium library was sequenced on two lanes of an Illumina HiSeq X sequencer to produce 2150-bp paired-end sequences. In total, 253.2 Gb of raw data

(1,688,011,008 reads) were generated. Based on the similarity to organellar genomes from the closely related species Lactuca sativa and Helianthus annuus, 144,566,834 out of 1,688,011,008 reads (9.37%) were filtered and dropped from further analyses.

Genome Size of T. dubius

The k-mer frequency distribution analysis in T. dubius used a k value of 53, and the final result was plotted as a frequency graph (Figure 5-1). My analyses show that the genome size of T. dubius is approximately 2.02 Gb, which is slightly smaller than the previously estimated size derived from C-values (2.30Gb to 2.88 Gb, Pires et al.,

2004; Garcia et al., 2013). The k-mer size distribution analysis also indicates that T. dubius is a diploid species with low heterozygosity, which is consistent with the fact that

124

a single individual of an inbred line of this largely selfing species (Cook and Soltis, 1999,

2000) was used for sequencing.

Tragopogon dubius Genome Assembly and Evaluation

After removing organelle-contaminated reads, 231.52 Gb (1,543,444,174 reads) of Illumina raw data containing Chromium barcodes and adapters were put into the

Supernova (v1.2.1) assembler for de novo assembly. This resulted in an assembly of

808.32 Mb, which covers approximately 40% of the estimated size of the T. dubius genome. The genome assembly comprises 12,310 scaffolds with an N50 contig size of

16.60 kb and an N50 scaffold size of 106.72 kb (Table 5-1).

Another statistic that could reflect the completeness of the coding region of a genome assembly is the alignment rate when mapping RNA-Seq reads back to the genome. Tragopogon dubius RNA-Seq data from two tissues, the eight-week-old leaf and the inflorescence (Yoo et al., unpublished; Shan et al., unpublished) were aligned to the T. dubius genome assembly using HiSat2 v2.1.0 with default parameters. The overall alignment rates of RNA-Seq reads were 72.95% for the leaf dataset and 78.43% for the inflorescence dataset.

Repetitive Element Annotation

Plant genomes are often characterized by high repeat content (Kubis et al.,

1998), and repeats should be excluded from the gene annotation process. In total,

59.40% of this T. dubius draft assembly was composed of repetitive elements. The most abundant elements were LTRs, accounting for 32.92% of the T. dubius genome assembly (Table 5-2, Figure 5-2).

125

Genome Annotation

The MAKER pipeline identified 30,325 high-confidence protein-coding genes with an average gene length of 2,996.01 bp. In this gene set, 96.2% of the gene models had an Annotation Edit Distance (AED) score of 0.5 or better. To evaluate completeness, the gene set was compared against the Embryophyta odb9 database containing 1,440 conserved plant single-copy orthologs (the BUSCO group). The results showed that

72.6% and 10.3% of the 1,440 plant BUSCO groups were identified as complete and fragmented genes, respectively, while 17.1% of the genes were considered to be missing from the current T. dubius genome assembly (Figure 5-3).

Orthology Analysis

In total, 436,223 protein sequences from T. dubius and the 11 angiosperms described previously were used to conduct orthology analysis. The results show that among these 12 species, 336,344 of 436,223 proteins were clustered into 19,406 orthogroups, which covered 24,263 (80.0%) of T. dubius proteins. A total of 12,535 orthologous groups containing at least a single T. dubius protein were identified, which were used for the gene family analysis.

Gene Family Analysis

My analysis focused on the gene families that are related to key morphological characters of Tragopogon, such as the its head-like inflorescence (capitulum) and its leaf and stems that are rich in milky sap (latex and rubber). I detected the presence of the CYC/TB1 gene family, which is important in inflorescence architecture (Tähtiharju et al., 2012; Reyes-Chin-Wo et al., 2017), as well as two other protein families related to latex/rubber production (Tang et al., 2016; Reyes-Chin-Wo et al., 2017). The hmmsearch results based on Pfam-A database showed that 33 genes belonged to the

126

TCP (PF03634) transcription factor family (Table B-1). Orthogroup analysis indicated that among the 33 TCP genes, 13 genes were clustered into the same orthogroup with the three CYC/TB1 genes in Arabidopsis (TCP1: AT1G67260, TCP12: AT3G18550,

TCP18: AT1G68800). Twelve genes belonging to the Rubber Elongation Factor family

(PF05755) were found in T. dubius (Table B-3). In addition, the Pfam protein domain analysis found that 41 genes belonged to the Bet_v_1/Major Latex Protein family

(PF00407) in T. dubius (Table B-4).

Tragopogon dubius Alternative Splicing Behavior

In total, 235,890 Iso-SeqTM ROIs from leaf tissue were used to identify AS events in T. dubius. Based on this leaf transcriptome dataset, PASA identified 16,301 isoforms that formed 10,406 isoform clusters (loci). Only the four major types of AS events (intron retention (IR), exon skipping (ES), alternative accepter (AA), and alternative donor (AD)) were included. My results showed that 14,628 AS events in 8,556 isoforms and 3,009 gene loci were detected in T. dubius (Table 5-4). The T. dubius genome contained

26,631 multi-exon genes. Thus, 11.30% (3,009 out of 26,631) of the multi-exon genes in

T. dubius showed AS. The most common type of AS in T. dubius was IR. Among the

3,009 AS genes, 72.75% contained IR event(s). In contrast, ES had the lowest frequency; only 9.87% of AS genes contained ES event(s).

Discussion

Interpreting the Completeness of the T. dubius Draft Genome Assembly and Annotation

I conducted the first de novo genome assembly of T. dubius, which is also the first reference genome of the Tragopogon polyploid system. However, this assembly size was 808.32 Mb, which covers approximately 40% of the estimated genome size.

127

Noting that repeat content of this draft genome assembly is 59.40%, we speculate that the remaining 60% of the genome is largely made up of additional repeats.

Thus, the question is, does this draft genome assembly provide sufficient information for further analysis in Tragopogon? Considering that the primary goal of this draft genome sequencing project is to initiate AS analysis, I evaluated the completeness of the gene space of this draft genome assembly. The BUSCO analysis of the predicted gene set for T. dubius showed that approximately 83% of the 1,440 conserved plant

BUSCO groups were identified as complete or fragmented, with approximately 17% of the genes as missing events (Figure 5-3). In addition, the alignment rate of T. dubus

RNA-Seq reads from multiple tissues is consistent with the assembly completeness based on the BUSCO results, suggesting that ~80% of the gene space in the T. dubius genome was captured in this draft genome assembly. Given the large size of the T. dubius genome and high repeat content, these findings show successful capture of the majority of the gene space by this draft genome.

I also compared the statistics of this predicted T. dubius gene set with the gene set of the related species Lactuca sativa (Reyes-Chin-Wo et al., 2017). The gene models of T. dubius have an average coding length of 1.03 kb and 5.21 exons per gene

(Table 5-3), similar to those of the L. sativa genome (1.05 kb in length and 4.54 exons per gene, Reyes-Chin-Wo et al., 2017). Interestingly, I found that T. dubius has proportionally fewer single-exonic genes (12.18%) than does L. sativa (26.00%). This result is likely due to the fact that I did not consider single-exon EST evidence in the annotation of T. dubius, hence decreasing the sensitivity of MAKER in detecting single- exon genes. In support of this idea, a comparison between the Arabidopsis MAKER

128

annotation and the TAIR10 annotation also showed that most of the absent gene models (60%) in the MAKER annotation are single-exon genes (Campbell et al., 2014).

Thus, this first draft genome for T. dubius covered most of the gene space. It is the first reference genome in the genus Tragopogon and will be the foundation for further studies of alternative splicing, DNA methylation, transposon diversity, and other aspects of genome evolution in the Tragopogon polyploid system. This reference genome will help make the Tragopogon polyploid system a much more tractable genetic and evolutionary model for the study of polyploidy. The draft genome provides essential genomic resources for both downstream analyses of Tragopogon polyploidy as well as for studies of Asteraceae.

Implications of the Results of Repetitive Elements Components on Sequencing Platform

Repeat masking is a required step preceding gene model prediction during genome annotation (Campbell et al., 2014). Thus, the repetitive element analyses in this chapter were focused on building a T. dubius-specific repeat library for masking the repeats during MAKER genome annotation. The repeat content of T. dubius was estimated as 59.40% by RepeatMasker, which is lower than those of two other

Asteraceae species, L. sativa (74.2%) and H. annuus (74.7%). Recalling that this T. dubius draft genome assembly covered approximately 40% of the estimated genome size, and the fact that the 10X Genomics approach is a short-read sequencing platform, it is possible that some repeat elements were not captured by the 10X Genomics data, leading to an underestimated repeat content. Consistent with L. sativa and H. annuus, the most abundant elements in T. dubius were LTRs. However, the proportion of LTR in

T. dubius (32.92%) was much lower than that in L. sativa (60%) and H. annuus

129

(72.40%). These discrepancies between this T. dubius draft genome and two more data-rich genomes in Asteraceae suggest that, although it covers most of the gene space, the T. dubius assembly could be improved. More sequence data, especially long-read data, are needed to improve the resolution of the high repeat region and to enable more detailed analysis of repeats in T. dubius.

Orthogroups as Resources for Future Studies

A total of 12,535 orthologous groups containing at least a single T. dubius protein were identified. The orthology analysis was conducted to enable the investigation of specific gene families, such as the CYC/TB1 gene family, that are central to key morphological characters of Tragopogon, such as inflorescence morphology and flower size. Besides this goal, the results from this orthology analysis also serve as valuable resources for studies in Tragopogon and Asteraceae. For instance, DNA and protein sequences from the 12,535 orthologous groups could provide useful nuclear loci for phylogenetic and phylogenomic analyses.

Cycloidea/Teosinte Branched1 (CYC/TB1) Gene Family in T. dubius

Species in Asteraceae are characterized by their complex, head-like inflorescences which contain a number of individual flowers closely aggregated together. It is commonly hypothesized that this feature contributes to the evolutionary success of Asteraceae (Gillies et al., 2004). Previous studies demonstrated that the plant-specific TCP domain transcription factor, particularly the CYC/TB1 gene family, regulates the complex architecture of the Asteraceae inflorescence (Tähtiharju et al.,

2012). Due to whole-genome duplication, the CYC/TB1 gene family has three major clades, CYC1, CYC2, and CYC3, in core eudicots (Howarth and Donoghue, 2006).

Expression-level analysis in three phylogenetically distant tribes in Asteraceae indicated

130

that genes in the CYC2 clade are highly expressed in ray flowers (Tähtiharju et al.,

2012; Bello et al., 2017), with certain genes considered to be strong candidates as regulators of ray flower identity (Tähtiharju et al., 2012).

Species of the Tragopogon polyploidy system, which belong to tribe Cichorieae, have significant morphological variation in their ray flowers. The polyploid species T. miscellus formed reciprocally in nature, resulting in both ‘long-liguled’ and ‘short-liguled’ forms (Ownbey, 1950). We identified the CYC/TB1 gene family in T. dubius. The Pfam protein domain analysis and orthogroup analysis indicated that in T. dubius, 13 TCP genes were clustered into the same orthogroup with the three CYC/TB1 genes in

Arabidopsis (TCP1:AT1G67260, TCP12: AT3G18550, TCP18: AT1G68800). In contrast, a closely related species to Tragopogon, L. sativa, has eight genes in the same orthogroup (Table B-2) (Reyes-Chin-Wo et al., 2017).

Latex/Rubber-Related Genes in T. dubius

The family Asteraceae has drawn interest as an alternative source of natural rubber (Bushman et al., 2006). The Pfam protein domain analysis in T. dubius revealed expansion and diversification of two families related to latex and rubber production.

Twelve genes belonging to the Rubber Elongation Factor (REF) family (PF05755) were found in T. dubius (Table B-3). A previous study in Hevea brasiliensis (rubber) indicated that the copy number of this gene family is related to the overall capability to produce rubber (Tang et al., 2016). In non-rubber-producing species, the average number of paralogs per genome of the gene family is 3.6 (Reyes-Chin-Wo et al., 2017), while in rubber-producing species, this gene family usually has a higher copy number.

For instance, H. brasiliensis has 18 REF/SRPP members, and L. sativa has 11 members. Another protein family related to latex production showed the same pattern in

131

T. dubius. The Pfam protein domain analysis found that 41 genes belonged to the

Bet_v_1/Major Latex Protein family (PF00407) in T. dubius (Table B-4). This number is lower than that found in L. sativa (e.g. 78 genes), but still higher than the average of 34 genes per genome in non-rubber-producing species (Reyes-Chin-Wo et al., 2017).

Considering that T. dubius has milky sap in its stem and leaves (Duke, 1992), our genomic data suggest that T. dubius, or Tragopogon polyploids such as T. miscellus and T. mirus, might serve as another potential alternative source of natural rubber.

Tragopogon dubius Leaf Alternative Splicing

In Chapter 4, I attempted to study AS in T. dubius using the de novo pipeline described in Chapter 2. Here, a reference-based AS analysis for T. dubius was conducted using leaf Iso-SeqTM data. It is worth noting that the inflorescence RNA-Seq data (Shan et al., unpublished) used in genome annotation were not included in the AS analysis conducted here. These inflorescence RNA-Seq data (Shan et al., unpublished) were from two different T. dubius populations (Pullman, WA, and Moscow, ID) and were originally generated for the data provider’s differential gene expression project. Thus, for the AS analysis, I did not combine these inflorescence RNA-Seq data with leaf Iso-

SeqTM data as in the genome annotation. In T. dubius, the most common type of AS events was IR, and the rarest event was ES, consistent with the observations in other species (Barbazuk et al., 2008; Chamala et al. 2015; Mei et al., 2017a). In leaves,

11.30% (3,009 out of 26,631) of the multi-exon genes in T. dubius showed AS. This frequency was much lower than the overall AS frequency of the two basal angiosperms discussed in Chapter 2, as well as other monocots, and eudicots (Barbazuk et al., 2008;

Chamala et al. 2015; Mei et al., 2017a). This result may reflect the impact of tissue type on the expression and patterns of AS events, as only one tissue type was included in

132

this study. In contrast, tissue-specific AS events were commonly observed in other plant species (Zhou et al., 2011; Amborella Genome Project, 2013).

In Chapter 4, the de novo pipeline generated 518 BLASR alignments as AS candidates. Considering the ~33% success rate of that pipeline, only ~150 AS events might be able to pass the RT-PCR validation. In contrast, the reference-based AS analysis here had much higher throughput and was more time-efficient and accurate. In total, 14,628 AS events from 3,009 genes were detected. This T. dubius AS database, as well as the T. dubius draft genome, will be the reference for future AS analyses in the

Tragopogon polyploidy system.

Cost-Efficiency of de novo Genome Assembly

The 10X Genomics platform proved to be a highly cost-efficient approach that can be applied to future plant genome projects. The total budget of my project was only around $4,000, while the draft genome was able to cover approximately 80% of the gene space. This result is particularly encouraging given the complex evolutionary history of T. dubius. The computation behind genome assembly was also reasonably effective thanks to the Supernova assembler. Additional PacBio data would be able to further improve the quality and coverage of the genome. In the end, the draft genome supports high-throughput AS analysis and gene family analyses. It also lays the foundation for future research in the Tragopogon polyploidy system, such as DNA methylation, transposon diversity, and differential gene expression.

133

Table 5-1. Statistics of the T. dubius genome assembly. Metric Value Assembly size (scaffold >10kb) 808.32 Mb Large scaffold (scaffold >10kb) 12,310 GC content 36.2% Repeat content 59.40% Contig N50 16.60 kb Scaffold N50 106.72 kb Largest Scaffold 1.05 Mb Minimum Scaffold 10.00 kb

134

Table 5-2. Classification of the predicted repetitive elements in the T. dubius genome. Type Number of elements Length occupied (bp) % of sequence SINEs: 2,514 245,432 0.02% LINEs: 22,493 14,072,739 1.39% LTR 356,932 333,941,521 32.92% DNA 105,291 35,599,144 3.51% Unclassified 806,593 210,896,889 20.79% Total interspersed repeats 594,755,725 58.63% Small RNA 215 32,803 0.00% Satellites 1,431 240,124 0.02% Simple repeat 136,909 8,555,193 0.84% Low complexity 21,490 1,138,669 0.11%

135

Table 5-3. Comparison between the predicted gene sets of T. dubius and L. sativa. Metric T. dubius L. sativa Number of gene models 30,325 37,828 Number of monoexonic gene models 3,694 9,834 Number of exon per gene 5.21441055 4.54597124

Mean gene length 2996.01 2824.79 Min gene length 180 150 Max gene length 77,648 90,071

Mean CDS length 1030.17 1058.64 Min CDS length 156 69 Max CDS length 15,204 15,948

Mean exon length 280.26 280.11 Min exon length 3 1 Max exon length 10,017 12,527

Mean intron length 364.37 336.8 Min intron length 4 4 Max intron length 68,182 49,397

136

Table 5-4. Alternative splicing in T. dubius leaf tissue. AS Type Events Assemblies Loci (genes) Retained Intron 4,780 32.68% 3,574 41.77% 2,189 72.75% Spliced Intron 4,780 32.68% 4,679 54.69% 2,189 72.75% Retained Exon 396 2.71% 581 6.79% 297 9.87% Skipped Exon 352 2.41% 499 5.83% 297 9.87% Alternative Acceptor 2,936 20.07% 3,564 41.65% 1164 38.68% Alternative Donor 1,384 9.46% 1,931 22.57% 606 20.14% Total 14,628 8,556 3,009

137

K−mer Frequency Distribution

7

0

+

e

4

y

c

7

n

0

e

+

u

e

q

2

e

r

F

0

0

+

e 0 10 50 100 150 200

K−mer Depth

Figure 5-1. K-mer frequency analysis of the T. dubius genome.

138

SINEs: 0.02% Other non-repetitive DNA: 31.43% LINEs: 1.39%

LTR: 32.92%

Protein-coding gene: 4.37%

Intron: 4.59%

DNA transposon: Low complexity: 0.11% 3.51% Simple repeat: 0.84% Unclassified repeat Satellites: 0.02% Small RNA: 0.00% region: 20.79%

Figure 5-2. Genome components of T. dubius.

139

Fragmented, Single-copy, 10.3% (148) Completed, 66.9% (964) 72.6% (1,046) Missing, 17.1% (246)

Duplicated, 5.7% (82)

Figure 5-3. BUSCO results of the T. dubius predicted gene set.

140

CHAPTER 6 CONCLUSION

Polyploidy, or WGD, is a widespread phenomenon throughout eukaryotes and has long been considered as an important speciation mechanism in plants. Alternative splicing (AS) is a major source of transcript and proteome diversity, which may also influence speciation. However, studies examining the impact of WGD on AS are still limited. Examining AS in non-model plant species is a challenge usually requiring de novo assembly of transcriptome sequence without the benefit of a well-annotated reference genome. Thus, I first developed a novel approach for detecting AS events in non-model organisms (the Chapter 2).

Research on both human and mouse has demonstrated the advantages of using

Iso-SeqTM data for isoform-level transcriptome analysis, including the study of AS and gene fusion. I applied Iso-SeqTM to investigate AS in Amborella trichopoda, a phylogenetically pivotal species that is sister to all other living angiosperms. The data show that, compared with RNA-Seq data, the Iso-SeqTM platform provides better recovery of large transcripts, new gene locus identification, and gene model correction.

Reference-based AS detection with Iso-SeqTM data identifies AS within a higher fraction of multi-exonic genes than observed for published RNA-Seq analysis (45.8% vs.

37.5%). These data demonstrate that the Iso-SeqTM approach is useful for detecting AS events.

Using the Iso-Seq-defined transcript collection in A. trichopoda as a reference, I further describe a pipeline for detection of AS isoforms from PacBio Iso-SeqTM without using a reference sequence (de novo). Results using this pipeline show a 63.5% overall success rate in identifying AS events. This de novo AS detection pipeline provides a

141

method to accurately characterize and identify bona fide alternatively spliced transcripts in any non-model system that lacks a reference genome sequence. Hence, the pipeline has huge potential applications and benefits to the broader biology community.

Splicing noise is observed in AS analysis in plants. A common hypothesis is that

AS events with functional significance may be conserved among different species.

Thus, the second part of my Ph.D. project was to conduct a genome-scale conserved alternative splicing event detection analysis between representatives of basal angiosperms, eudicots, and monocots (Chapter 3). This analysis includes the basal angiosperms A. trichopoda and N. caerulea; the rosids V. vinifera cv. Cabernet

Sauvignon and A. thaliana; the asterids C. acuminata and S. lycopersicum; and the two monocots S. polyrhiza and O. sativa spp. japonica.

In total, 22,764 AS events shared between 2 or more species were identified among these eight angiosperms. My results revealed significant differences between monocots and eudicots in terms of shared AS events, which suggests dynamic AS changes during angiosperm evolution. The two monocot species examined had considerably lower numbers of shared AS events, and AS events in general, despite having large input sequence data sets.

Next, I identified 2,199 conserved AS events between the two basal angiosperms, A. trichopoda and N. caerulea. The GO enrichment analysis indicated that conserved AS genes were enriched in genes that are involved in photosynthesis, mRNA transcription process, stress response, and DNA methylation. Protein kinase genes and genes involved in mRNA processing, splicing, and splicing site selection were likely to contain conserved AS events. I also found that 25 AS events were highly conserved in

142

all eight angiosperms. Functional annotation of these AS genes suggested that nearly

50% of these highly conserved AS genes are genes that code protein kinase and the serine/arginine-rich splicing factor. These results suggest that conserved AS may not be a random process, but rather have functional importance and evolutionary implications.

Genes with conserved AS are concentrated in some specific functions, and the gain/loss of the AS events is likely related to their evolutionary lineages.

The PCA analysis on shared AS events among the eight angiosperm species grouped three species with high evolutionary distances together, including one basal angiosperm (N. caerulea) and two monocots (S. polyrhiza and O. sativa). I conducted

GO enrichment analysis on genes that have shared AS events among these aquatic species. The results showed that GO terms “response to hypoxia” and “response to salt” were over-represented in the genes with shared AS between N. coerulea and S. polyrhiza. Water lily, rice, and duckweed all grow in aquatic habitats, which usually have lower oxygen concentration and higher salt concentrations than land habitats. It is compelling to consider that these events result from convergent evolution because they assist in adapting to an aquatic environment, which suggests that environmental factors may also affect AS in addition to evolutionary forces like WGD and speciation.

In Chapter 4, I applied the de novo AS detection pipeline to study the

Tragopogon polyploidy system; the analysis was performed in T. dubius. Results indicate that the de novo pipeline developed and validated in the basal angiosperm A. trichopoda could also be applied in other angiosperms that lack a reference genome.

However, according to RT-PCR validation, the success rate of the pipeline for de novo detection of AS in T. dubius was relatively low (33.3% in T. dubius vs. 63.5% in A.

143

trichopoda). Interestingly, 80% of the confirmed AS events occurred in UTR region and did not have any ORF shift and potential NMD. AS events in the 5’ UTR region usually represent a regulatory mechanism for gene expression or translation (Palaniswamy et al., 2010; Kramer et al., 2013). The low percentage of loci detected to undergo AS suggests that de novo detection may not completely identify all, or even most, cases of

AS and that in fact a genome-based approach is necessary for AS detection in

Tragopogon.

Allopolyploid species and their diploid parents in Tragopogon (Asteraceae) represent an important evolutionary model for examining the genomic consequences of recent and recurring allopolyploidy. The genome size of Tragopogon is large, estimated to be 2.30 Gb to 2.88 Gb (Pires et al., 2004; Garcia et al., 2013). Genome-wide AS analysis in large, complex genomes such as Tragopogon requires a reference genome.

However, there is no available genome reference sequence for any species of

Tragopogon, which limits investigation of the impacts of polyploidization in this evolutionary model system. Thus, in the last part of my dissertation, I used the linked- read sequencing approach (10X Genomics) to construct a draft genome assembly of T. dubius, the shared diploid parent of the recently formed allotetraploids T. mirus and T. miscellus (Chapter 5).

I generated 253.2 Gb of raw data and completed a de novo assembly of the T. dubius genome. This draft genome is 808.32 Mb, covering approximately 40% of the genome, and has an N50 scaffold size of 0.11 Mb and N50 contig size of 16.60 kb.

Annotation of both genes and repeats identified 30,522 protein-coding genes with an average gene length of 2,982 bp. BUSCO analysis suggested that approximately 80%

144

of the gene space was captured in this draft genome assembly. I also detected the presence of the CYC/TB1 gene family, which is important in inflorescence architecture, as well as two other protein families related to latex/rubber production.

This draft genome for T. dubius is the first reference genome in the genus

Tragopogon. It will be the foundation for analyses of alternative splicing, DNA methylation, and transposon elements in the Tragopogon polyploid system. This reference genome will help make the Tragopogon polyploid system a much more tractable genetic and evolutionary model for the study of polyploidy. The draft genome provides essential genomic resources for both downstream analyses of polyploidy in

Tragopogon, as well as for studies of Asteraceae in general.

145

APPENDIX A SUPPLEMENTAL MATERIALS FOR CHAPTER 2

Materials and Methods cDNA Synthesis, Library Construction, and Sequencing

mRNAs with polyA tails were reverse-transcribed to full-length cDNA using the

Clontech SMARTerTM PCR cDNA Synthesis Kit. cDNAs were then size-selected into 1-

2-kb, 2-3-kb, and 3-6-kb fractions with BluePippinTM and converted to SMRTbell libraries. The leaf cDNA library was normalized using an Evrogen Trimmer-2 Kit prior to size selection. For leaf tissue, two SMRT cells were used for the 1-2-kb fraction, three cells for the 2-3-kb fraction, and another three cells for the 3-6-kb fraction. For flower tissue, three, four, and four SMRT cells were used in 1-2-, 2-3-, and 3-6-kb fractions, respectively.

Data Collection and Error Correction Using SMRT Analysis Software

The PacBio platform uses ROIs to represent the primary sequence data, which can be collected by the ConsensusTools.sh module from the raw data (PacBio HDF5 files). The ConsensusTools.sh module has two important parameters, the minimum predicted accuracy (--minPredictedAccuracy) and the minimum number of full passes (-- minFullPasses). The parameter --minFullPasses denotes the number of full-pass reads traversing the insert on a SMRTbell that are required for making an ROI. For example, setting --minFullPasses at 1 means only raw reads that contain at least one full-pass read could generate ROIs. In contrast, setting --minFullPasses at 0 allows raw reads that contain only partial reads to generate ROIs.

To investigate the effects of different parameter settings on ROI collection, we set --minFullPasses as 0, 1, and 2 to generate three ROI data sets. Another parameter,

146

--minPredictedAccuracy, was set at 75 based on the PacBio Iso-SeqTM manufacturer’s recommendations. Full-length, non-chimeric ROIs (flncROIs) were then identified by the pbtranscript.py classify module with a minimum read length requirement (-- min_seq_len) of 300 bp for each ROI data set. The flncROI does not necessarily equate to a biological full-length transcript, which is expected to include the entire CDS and

UTR regions. Rather, flncROIs are defined as reads that capture the 3’ adapter, 5’ adapters, and the polyA tail, and thus are reads that define the full length of an insert.

In total, three flncROI data sets were generated. They are: zero-full-passes flncROIs (--minFullPasses 0), one-full-pass flncROIs (--minFullPasses 1), and two-full- passes flncROIs (--minFullPasses 2).

AS de novo Detection Pipeline

The entire pipeline is described in Figure A-1. All-vs-all BLAST was first performed on two-full-passes flncROIs using NCBI-BLAST with 99% identity settings.

BLAST alignments that meet the following features were kept as candidate AS events:

1) both sequences are larger than 1000 bp and have two HSPs in the alignment; 2) the

“AS Gap” is larger than 100 bp and is at least 100 bp from the 3’/5’end; and 3) an

“Overlap” of 5 bp is allowed in the fully spliced transcript. AS events found by the de novo method were first subjected to computational validation with the existing Amborella

AS database based on RNA-Seq data

(ASisoform_overlap_evm27_0926_JunctionSupport2x_FPKM1_pasa.gtf.zip) and AS events detected in the Iso-SeqTM data by the PASA pipeline.

147

Table A-1. Number of reads collected using different minimum number of full passes. Tissue Size type fraction minfullpass1=0 minfullpass=1 minfullpass=2 Leaf 1-2kb ROI 83,577 46,479 41,014 flncROI 44,439 (53.2%) 36,667 (78.9%) 32,982 (80.4%) 2-3kb ROI 158,919 59,259 47,137 flncROI 59,614 (37.5%) 46,268 (78.1%) 38,133 (80.9%) > 3kb ROI 159,377 37,108 26,606 flncROI 45,122 (28.3%) 29,132 (78.5%) 21,471(80.7%) Flower 1-2kb ROI 38,913 17,661 15,195 flncROI 15,118 (38.9%) 12,453 (70.5%) 11,022(72.5%) 2-3kb ROI 67,528 17,451 13,785 flncROI 21,262 (31.5%) 13,344 (76.5%) 10,735(77.9%) > 3kb ROI 152,144 26,397 18,145 flncROI 32,399 (21.3%) 19,023 (7.21%) 13,360 (73.6%) 1. minfullpass: minimum number of full passes

148

Table A-2. Number of reads generated from ICE and LSC after error correction. ICE Consensus LSC > 0.5 SR length Tissue type isoform LSC coverage Leaf 91,085 146,158 143, 065 Flower 55,602 68,177 65,254 Total 146,687 214,335 208,319

149

Table A-3. Summary of the number of AS events detected by three datasets. Number of AS events Zero-full-passes LSC-corrected zero- ROI0+LSC_ Full_LR flncROIs full-passes flncROIs Retained Intron 4,836 (27.3%) 7,410 (34.1%) 8,286 (29.4%) Spliced Intron 4,836 7,410 8,286 Retained Exon 1,206 (6.8%) 943 (4.3%) 1,686 (6.0%) Skipped Exon 1,155 (6.5%) 871 (4.0%) 1,595 (5.7%) Alt Acceptor 3,570 (20.2%) 3,221 (14.8%) 5,242 (18.7%) Alt Donor 2,090 (11.8%) 1,893 (8.7%) 3,134 (11.1%) Total 17,693 21,748 28,229

150

Table A-4. Comparison of detected AS events between zero-full-passes ROIs and LSC- corrected zero-full-passes ROIs. Same LSC only ROIs only Retained Intron 3,938 3,472 898 Retained Exon 495 448 711 Skipped Exon 472 399 683 Alternative Acceptor 1,644 1,577 1,926 Alternative Donor 918 975 1,172

151

Table A-5. Genes using RT-PCR validation. Gene Validated by Scaffold AS type Annotation PASA RNA_Seq RT-PCR AmTrS1.217 Yes No Yes 1 IR Hippocampus abundant transcript 1 protein AmTrS1.593 No No No 1 N/A N/A AmTrS2.33 Yes Yes Yes 2 IR Cyclin-dependent kinase E-1 AmTrS2.155 Yes yes Yes 2 ES Pentatricopeptide repeat-containing protein At4g17616 AmTrS2.258 No No No 2 N/A N/A AmTrS3.122 Yes Yes Yes 3 ES Flap endonuclease 1 AmTrS3.4 yes No Yes 3 IR pyrophosphate-energized vacuolar membrane proton pump 1 AmTrS3.151 No No No 3 N/A N/A AmTrS4.83 Yes No Yes 4 ES Cationic amino acid transporter 2 AmTrS4.287 No yes Yes 4 IR AP-5 complex subunit zeta-1 AmTrS6.197 Yes Yes Yes 6 ES WPP domain-interacting tail- anchored protein 2 AmTrS6.new Yes No Yes 6 ES Nudix hydrolase 2-like AmTrS9.428 Yes Yes No 9 IR Protochlorophyllide reductase AmTrS9.832 Yes yes No 9 alt MLO-like protein 14 AmTrS9.new Yes no Yes 9 IR Hyoscyamine 6-dioxygenase-like AmTrS9.279 No No No 9 N/A N/A AmTrS9.289 No No No 9 N/A N/A AmTrS10.197 No No No 10 N/A N/A AmTrS11.95 yes No Yes 11 ES dihydroflavonol-4-reductase AmTrS12.292a Yes Yes Yes 12 IR Ribulose bisphosphate carboxylase/oxygenase activas AmTrS12.292b Yes Yes Yes 12 IR Ribulose bisphosphate carboxylase/oxygenase activas

152

Table A-5. Continued Gene Validated by Scaffold AS type Annotation PASA RNA_Seq RT-PCR AmTrS13 No No No 13 N/A N/A AmTrS14 No No No 14 N/A N/A AmTrS17.69 Yes No Yes 17 IR Serine/threonine-protein kinase SRK2A AmTrS17.174a No No No 17 N/A N/A AmTrS17.174b No No No 17 N/A N/A AmTrS18.130 No No No 18 N/A N/A AmTrS18.167 No No Yes 18 AA Heterogeneous nuclear ribonucleoprotein R AmTrS22.242 Yes No Yes 22 ES Linoleate 13S-lipoxygenase 2-1 AmTrS29.21 Yes No(IR) No 29 AA Probable RNA helicase SDE3 AmTrS29.298 Yes Yes Yes 29 ES Putative ATP-dependent RNA helicase C550.03c AmTrS33.102 Yes Yes No 33 IR Large proline-rich protein bag6-A AmTrS38.5 Yes Yes No 38 IR Uncharacterized protein AmTrS40.175 No No Yes 40 ES Neutral/alkaline invertase 3 AmTrS48.133 Yes no No 48 AA E3 ubiquitin-protein ligase SINAT2 AmTrS48.222 No No No 48 N/A N/A AmTrS49.254 No yes Yes 49 IR Beta-1,4-mannosyl-glycoprotein 4- beta-N- acetylglucosaminyltransferase AmTrS52.82 Yes No Yes 52 IR Vacuolar protein sorting-associated protein 35B AmTrS53.36 Yes yes Yes 53 IR ABC transporter C family member 10 AmTrS56.145 Yes yes Yes 56 ES Inactive poly [ADP-ribose] polymerase RCD1 AmTrS57.195 Yes Yes Yes 57 IR Uncharacterized protein

153

Table A-5. Continued Gene Validated by Scaffold AS type Annotation PASA RNA_Seq RT-PCR AmTrS57.275 Yes yes Yes 57 ES Dephospho-CoA kinase AmTrS58.182 Yes yes Yes 58 ES Uncharacterized protein AmTrS70.132 Yes yes Yes 70 ES Potassium channel AKT2/3 AmTrS76.42 Yes Yes Yes 76 AA SET and MYND domain-containing protein 4 AmTrS77.91 No No No 77 N/A N/A AmTrS95.73 Yes No Yes 95 AD CBL-interacting protein kinase 32 AmTrS99.64 No No No 99 N/A N/A AmTrS99.153a No No No 99 N/A N/A AmTrS99.153b No No No 99 N/A N/A AmTrS126.55 No No Yes 126 AA Protein CHROMATIN REMODELING 25 AmTrS129.65 Yes no Yes 129 AD 1-phosphatidylinositol-3-phosphate 5-kinase FAB1B AmTrS137 No No No 137 N/A N/A AmTrS138.47 Yes Yes Yes 138 AD Uncharacterized protein AmTrS140.51 Yes yes Yes 140 IR Malate dehydrogenase [NADP] 1 AmTrS142.13 Yes yes Yes 142 IR RING finger and CHY zinc finger domain-containing protein 1 AmTrS146.64 Yes no Yes 146 IR U-box domain-containing protein 34 AmTrS156.36 no yes Yes 156 ES Isoamylase 2 AmTrS166.42 Yes no No 166 IR E3 ubiquitin-protein ligase RGLG2 AmTrS175.20 Yes no Yes 175 IR Chorismate synthase 2 AmTrS224.3 no yes Yes 224 ES Pentatricopeptide repeat-containing protein MRL1

154

Iso-SeqTM raw data

Self-correction, 3’ end/5’ end complicity identification, and length filtering by SMRTanalysis

Hybrid error correction if short-reads are available

Error-corrected long reads (> 1kb)

All-vs-all NCBI_BLAST analysis

Transcripts alignments

Python scripts Transcript alignments with two HSPs

Two HSPs have the same forward/reverse direction.

Within the same alignment, one sequence should be continuous, or with a small “Overlap” size; the other one should be discrete to show an “AS gap”

Within the same alignment, the continuous sequence should pretty much completely align to the discrete sequence.

The AS Gap should larger than 100 bp and should be at least 100 bp away from the 3’/5’ end

AS candidate alignments for primer design and RT-PCR analysis

Figure A-1. AS de novo detection pipeline flowchart.

155

a

b

Figure A-2. Length distribution of all zero-full-passes ROIs (a) and full-length ROIs (b).

156

Figure A-3. Percentage of full-length ROIs out of all zero-full-passes ROIs in all size ranges and tissues.

157

a b 1-2kb 2-3kb

no-polyA flncROIs ROIs and no-polyA 36% short ROIs flncROIs ROIs and 35% 49% short ROIs 46% Partial Partial polyA polyA ROIs ROIs 16% 18% c d > 3kb All

flncROIs flncROIs 33% 25% no-polyA no-polyA ROIs and ROIs and short ROIs short ROIs Partial 48% 54% polyA Partial ROIs polyA 21% ROIs 19%

Figure A-4. Pie chart of zero-full-passes ROIs of three categories, as defined in the text: blue: full-length ROIs, representing full-length mRNA sequences; red: partial ROIs with polyA tails, representing incomplete mRNA sequences; green: ROIs without a polyA tail. a. 1-2-kb size fraction. b. 2-3-kb size fraction. c. 3- 6-kb size fraction. d. all size ranges; both leaf and flower data are included.

158

APPENDIX B SUPPLEMENTAL MATERIALS FOR CHAPTER 5

Table B-1. The TCP protein family in Tragopogon. Target Query Target name Accession length Query name length E-value Description of target TCP PF03634.12 158 TragDub22573-RA 371 1.00E-42 TCP family transcription factor TCP PF03634.12 158 TragDub22573-RA 371 1.00E-42 TCP family transcription factor TCP PF03634.12 158 TragDub17855-RA 392 1.20E-42 TCP family transcription factor TCP PF03634.12 158 TragDub17855-RA 392 1.20E-42 TCP family transcription factor TCP PF03634.12 158 TragDub21715-RA 322 4.90E-30 TCP family transcription factor TCP PF03634.12 158 TragDub16635-RA 371 1.40E-37 TCP family transcription factor TCP PF03634.12 158 TragDub22569-RA 449 4.30E-41 TCP family transcription factor TCP PF03634.12 158 TragDub01087-RA 369 1.90E-42 TCP family transcription factor TCP PF03634.12 158 TragDub23885-RA 319 3.00E-37 TCP family transcription factor TCP PF03634.12 158 TragDub10279-RA 192 1.10E-28 TCP family transcription factor TCP PF03634.12 158 TragDub16441-RA 163 1.60E-22 TCP family transcription factor TCP PF03634.12 158 TragDub16025-RA 488 6.30E-41 TCP family transcription factor TCP PF03634.12 158 TragDub15759-RA 342 3.80E-42 TCP family transcription factor TCP PF03634.12 158 TragDub15759-RA 342 3.80E-42 TCP family transcription factor TCP PF03634.12 158 TragDub06533-RA 210 6.50E-29 TCP family transcription factor TCP PF03634.12 158 TragDub04047-RA 303 1.30E-34 TCP family transcription factor TCP PF03634.12 158 TragDub04048-RA 330 1.60E-33 TCP family transcription factor TCP PF03634.12 158 TragDub04049-RA 331 2.20E-36 TCP family transcription factor TCP PF03634.12 158 TragDub04049-RA 331 2.20E-36 TCP family transcription factor TCP PF03634.12 158 TragDub04045-RA 291 1.20E-34 TCP family transcription factor TCP PF03634.12 158 TragDub04046-RA 343 1.50E-34 TCP family transcription factor TCP PF03634.12 158 TragDub04304-RA 395 1.20E-35 TCP family transcription factor TCP PF03634.12 158 TragDub04304-RA 395 1.20E-35 TCP family transcription factor TCP PF03634.12 158 TragDub04304-RA 395 1.20E-35 TCP family transcription factor TCP PF03634.12 158 TragDub16894-RA 339 2.40E-40 TCP family transcription factor

159

Table B-1. Continued Target Query Target name Accession length Query name length E-value Description of target TCP PF03634.12 158 TragDub29017-RA 294 7.20E-38 TCP family transcription factor TCP PF03634.12 158 TragDub18480-RA 392 2.60E-41 TCP family transcription factor TCP PF03634.12 158 TragDub25478-RA 364 7.00E-27 TCP family transcription factor TCP PF03634.12 158 TragDub25478-RA 364 7.00E-27 TCP family transcription factor TCP PF03634.12 158 TragDub26634-RA 242 4.90E-30 TCP family transcription factor TCP PF03634.12 158 TragDub15985-RA 200 1.10E-16 TCP family transcription factor TCP PF03634.12 158 TragDub15986-RA 214 6.20E-22 TCP family transcription factor TCP PF03634.12 158 TragDub15990-RA 131 4.20E-21 TCP family transcription factor TCP PF03634.12 158 TragDub06280-RA 558 4.20E-33 TCP family transcription factor TCP PF03634.12 158 TragDub08163-RA 292 1.30E-29 TCP family transcription factor TCP PF03634.12 158 TragDub08163-RA 292 1.30E-29 TCP family transcription factor TCP PF03634.12 158 TragDub08399-RA 443 1.80E-40 TCP family transcription factor TCP PF03634.12 158 TragDub08399-RA 443 1.80E-40 TCP family transcription factor TCP PF03634.12 158 TragDub26682-RA 523 3.10E-41 TCP family transcription factor TCP PF03634.12 158 TragDub26682-RA 523 3.10E-41 TCP family transcription factor TCP PF03634.12 158 TragDub26682-RA 523 3.10E-41 TCP family transcription factor TCP PF03634.12 158 TragDub07816-RA 342 2.00E-21 TCP family transcription factor TCP PF03634.12 158 TragDub14174-RA 281 2.30E-34 TCP family transcription factor TCP PF03634.12 158 TragDub24305-RA 400 9.00E-30 TCP family transcription factor TCP PF03634.12 158 TragDub24305-RA 400 9.00E-30 TCP family transcription factor

160

Table B-2. CYC/TB1 gene family. Species Gene ID Aquilegia coerulea Aqcoe3G048600.1, Aqcoe3G395500.1 Arabidopsis thaliana AT1G67260.1, AT1G68800.1, AT3G18550.1 Glycine max Glyma.04G152400.1, Glyma.05G013300.1, Glyma.06G210600.1, Glyma.08G256400.1, Glyma.10G246200.1, Glyma.13G047400.1, Glyma.17G121500.1, Glyma.18G280700.1, Glyma.19G044400.1, Glyma.20G148500.1 Gossypium raimondii Gorai.001G200400.1, Gorai.007G007500.1, Gorai.008G186800.1, Gorai.008G285300.1 Helianthus annuus HanXRQChr04g0109041, HanXRQChr04g0112141, HanXRQChr04g0112191, HanXRQChr04g0112201, HanXRQChr05g0149701, HanXRQChr05g0153871, HanXRQChr08g0209331, HanXRQChr09g0269811, HanXRQChr10g0283781, HanXRQChr12g0365351, HanXRQChr12g0365361, HanXRQChr15g0472911, HanXRQChr15g0489241, HanXRQChr15g0489921, HanXRQChr16g0501831, HanXRQChr16g0528881 Lactuca sativa Lsat_1_v5_gn_1_16061.1, Lsat_1_v5_gn_3_94461.1, Lsat_1_v5_gn_4_112280.1, Lsat_1_v5_gn_4_175901.1, Lsat_1_v5_gn_4_175961.1, Lsat_1_v5_gn_4_19160.1, Lsat_1_v5_gn_5_118480.1, Lsat_1_v5_gn_8_37420.1 Oryza sativa Os03g49880.1, Os08g33530.1, Os09g24480.1 Sorghum bicolor Sobic.001G121600.1, Sobic.002G198400.1, Sobic.007G135700.1 Solanum lycopersicum Solyc02g089830.1.1, Solyc03g045030.1.1, Solyc03g119770.2.1, Solyc04g006980.1.1, Solyc05g009900.1.1, Solyc06g069240.1.1 Tragopogon dubius TragDub01087, TragDub04045, TragDub04046, TragDub04047, TragDub04048, TragDub04049, TragDub04304, TragDub06533, TragDub08399, TragDub15759, TragDub16635, TragDub22573, TragDub29017 Vitis vinifera GSVIVT01008234001, GSVIVT01011962001, GSVIVT01036449001

161

Table B-3. The REF protein family in Tragopogon. Target Query Target name Accession length Query name length E-value Description of target REF PF05755.11 206 TragDub22449-RA 210 9.80E-78 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub22450-RA 175 2.70E-62 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub22450-RA 175 2.70E-62 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub22451-RA 173 4.30E-61 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub22451-RA 173 4.30E-61 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07265-RA 224 2.90E-58 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07266-RA 207 8.70E-39 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07266-RA 207 8.70E-39 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07266-RA 207 8.70E-39 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07262-RA 229 5.80E-65 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07263-RA 229 1.50E-60 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07260-RA 230 1.00E-58 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07261-RA 230 4.00E-62 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07258-RA 261 6.40E-10 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub07258-RA 261 6.40E-10 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub28742-RA 227 2.70E-30 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub28742-RA 227 2.70E-30 Rubber elongation factor protein (REF) REF PF05755.11 206 TragDub21985-RA 117 1.30E-33 Rubber elongation factor protein (REF)

162

Table B-4. The Bet_v_1 protein family in Tragopogon. Target Query Target name Accession length Query name length E-value Description of target Bet_v_1 PF00407.18 151 TragDub06492-RA 118 4.70E-16 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub06492-RA 118 4.70E-16 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub13429-RA 154 6.20E-38 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub02621-RA 133 6.50E-22 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub02621-RA 133 6.50E-22 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub01419-RA 197 3.10E-06 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub01419-RA 197 3.10E-06 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub23020-RA 158 9.10E-20 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub23022-RA 158 6.40E-19 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub14117-RA 153 1.60E-13 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub14116-RA 101 2.20E-06 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub12813-RA 153 5.50E-32 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub17298-RA 151 6.90E-29 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub29091-RA 137 1.20E-16 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub15359-RA 158 1.80E-28 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub24233-RA 151 1.90E-28 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub10643-RA 169 3.10E-20 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub19436-RA 151 8.30E-36 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub23438-RA 175 5.10E-41 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub11902-RA 152 1.10E-31 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub11899-RA 111 7.50E-27 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub11898-RA 103 6.60E-19 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub00565-RA 154 5.60E-10 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub00567-RA 153 2.10E-09 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub00564-RA 156 7.10E-12 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub01854-RA 240 1.10E-46 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub01854-RA 240 1.10E-46 Pathogenesis-related protein Bet_v_I family

163

Table B-4. Continued Target Query Target name Accession length Query name length E-value Description of target Bet_v_1 PF00407.18 151 TragDub10035-RA 198 6.10E-07 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub20066-RA 165 2.20E-14 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08865-RA 156 1.60E-15 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub29182-RA 207 7.60E-06 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub21718-RA 302 4.00E-86 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub21718-RA 302 4.00E-86 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub22348-RA 152 3.10E-33 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub22966-RA 102 7.50E-06 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub12582-RA 302 1.10E-86 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub12582-RA 302 1.10E-86 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub29938-RA 152 4.90E-31 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08716-RA 158 5.10E-21 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08718-RA 155 1.90E-21 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08719-RA 158 4.90E-20 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08717-RA 279 1.60E-39 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub08717-RA 279 1.60E-39 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub26966-RA 135 2.10E-23 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub26966-RA 135 2.10E-23 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub26967-RA 151 9.10E-35 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub21289-RA 158 4.30E-23 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub21291-RA 158 2.50E-20 Pathogenesis-related protein Bet_v_I family Bet_v_1 PF00407.18 151 TragDub24368-RA 152 8.50E-27 Pathogenesis-related protein Bet_v_I family

164

LIST OF REFERENCES

Abdel-Ghany, S.E., Hamilton, M., Jacobi, J.L., Ngam, P., Devitt, N., Schilkey, F., Ben- Hur, A., and Reddy, A.S., 2016. A survey of the sorghum transcriptome using single-molecule long reads. Nature Communications 7, 11706.

Adolfsson, S., Michalakis, Y., Paczesniak, D., Bode, S.N., Butlin, R.K., Lamatsch, D.K., Martins, M.J., Schmit, O., Vandekerkhove, J., and Jokela, J., 2010. Evaluation of elevated ploidy and asexual reproduction as alternative explanations for geographic parthenogenesis in Eucypris virens ostracods. Evolution 64, 986-997.

Alexa, A., Rahnenführer, J., and Lengauer, T., 2006. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600-1607.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410.

Amborella Genome Project, 2013. The Amborella genome and the evolution of flowering plants. Science 342, 1241089.

The Angiosperm Phylogeny Group, 2016. An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Bot. J. Linn. Soc. 181, 1-20.

Au, K., Underwood, J., Lee, L., and Wong, W., 2012. Improving PacBio Long Read Accuracy by Short Read Alignment. PloS one 7, e46679.

Barbazuk, W.B., Fu, Y., and McGinnis, K.M., 2008. Genome-wide analyses of alternative splicing in plants: opportunities and challenges. Genome Res. 18, 1381-1392.

Bello, M.A., Cubas, P., Álvarez, I., Sanjuanbenito, G., and Fuertes-Aguilar, J., 2017. Evolution and Expression Patterns of CYC/TB1 Genes in Anacyclus: Phylogenetic Insights for Floral Symmetry Genes in Asteraceae. Frontiers in plant science 8, 589.

Brochmann, C., Brysting, A., Alsos, I., Borgen, L., Grundt, H., Scheen, A., and Elven, R., 2004. Polyploidy in arctic plants. Biol. J. Linn. Soc. 82, 521-536.

Buggs, R.J., Chamala, S., Wu, W., Gao, L., May, G.D., Schnable, P.S., Soltis, D.E., Soltis, P.S., and Barbazuk, W.B., 2010a. Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next‐ generation sequencing and Sequenom iPLEX MassARRAY genotyping. Mol. Ecol. 19, 132-146.

165

Buggs, R.J., Chamala, S., Wu, W., Tate, J.A., Schnable, P.S., Soltis, D.E., Soltis, P.S., and Barbazuk, W.B., 2012a. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current Biology 22, 248-252.

Buggs, R., Doust, A., Tate, J., Koh, J., Soltis, K., Feltus, F., Paterson, A., Soltis, P., and Soltis, D., 2009. Gene loss and silencing in Tragopogon miscellus (Asteraceae): comparison of natural and synthetic allotetraploids. Heredity 103, 73.

Buggs, R.J., Elliott, N.M., Zhang, L., Koh, J., Viccini, L.F., Soltis, D.E., and Soltis, P.S., 2010b. Tissue‐specific silencing of homoeologs in natural populations of the recent allopolyploid Tragopogon mirus. New Phytol. 186, 175-183.

Buggs, R.J., Renny‐Byfield, S., Chester, M., Jordon‐Thaden, I.E., Viccini, L.F., Chamala, S., Leitch, A.R., Schnable, P.S., Barbazuk, W.B., and Soltis, P.S., 2012b. Next‐generation sequencing and genome evolution in allopolyploids. Am. J. Bot. 99, 372-382.

Buggs, R.J., Wendel, J.F., Doyle, J.J., Soltis, D.E., Soltis, P.S., and Coate, J.E., 2014. The legacy of diploid progenitors in allopolyploid gene expression patterns. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 369, 10.1098/rstb.2013.0354.

Buggs, R.J., Zhang, L., Miles, N., Tate, J.A., Gao, L., Wei, W., Schnable, P.S., Barbazuk, W.B., Soltis, P.S., and Soltis, D.E., 2011. Transcriptomic shock generates evolutionary novelty in a newly formed, natural allopolyploid plant. Current Biology 21, 551-556.

Bushman, B.S., Scholte, A.A., Cornish, K., Scott, D.J., Brichta, J.L., Vederas, J.C., Ochoa, O., Michelmore, R.W., Shintani, D.K., and Knapp, S.J., 2006. Identification and comparison of natural rubber from two Lactuca species. Phytochemistry 67, 2590-2596.

Bushnell, B., 2014. BBMap: a fast, accurate, splice-aware aligner.

Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M., and Buell, C.R., 2006. Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7, 1-17.

Campbell, M.S., Law, M., Holt, C., Stein, J.C., Moghe, G.D., Hufnagel, D.E., Lei, J., Achawanantakun, R., Jiao, D., Lawrence, C.J., et al., 2014. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164, 513-524.

Chaisson, M.J. and Tesler, G., 2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238.

166

Chamala, S., Feng, G., Chavarro, C., and Barbazuk, W.B., 2015. Genome-wide identification of evolutionarily conserved alternative splicing events in flowering plants. Frontiers in bioengineering and biotechnology 3, 33.

Chen, F., Liu, X., Yu, C., Chen, Y., Tang, H., and Zhang, L., 2017. Water lilies as emerging models for Darwin’s abominable mystery. Horticulture research 4, 17051.

Chen, T., Wu, T.H., Ng, W.V., and Lin, W., 2011. Interrogation of alternative splicing events in duplicated genes during evolution. 12, S16.

Chen, Z.J., 2007. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu.Rev.Plant Biol. 58, 377-406.

Chester, M., Gallagher, J.P., Symonds, V.V., Cruz da Silva, A.V., Mavrodiev, E.V., Leitch, A.R., Soltis, P.S., and Soltis, D.E., 2012. Extensive chromosomal variation in a recently formed natural allopolyploid species, Tragopogon miscellus (Asteraceae). Proc. Natl. Acad. Sci. U. S. A. 109, 1176-1181.

Conant, G.C., Birchler, J.A., and Pires, J.C., 2014. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr. Opin. Plant Biol. 19, 91-98.

Cook, L.M. and Soltis, P.S., 2000. Mating systems of diploid and allotetraploid populations of Tragopogon (Asteraceae). II. Artificial populations. Heredity 84, 410.

Cook, L.M. and Soltis, P.S., 1999. Mating systems of diploid and allotetraploid populations of Tragopogon (Asteraceae). I. Natural populations. Heredity 82, 237-244.

Cook, L., Soltis, P., Brunsfeld, S., and Soltis, D., 1998. Multiple independent formations of Tragopogon tetraploids (Asteraceae): evidence from RAPD markers. Mol. Ecol. 7, 1293-1302.

Crane, P.R., Friis, E.M., and Pedersen, K.R., 1995. The origin and early diversification of angiosperms. Nature 374, 27.

Doyle, J.J., Flagel, L.E., Paterson, A.H., Rapp, R.A., Soltis, D.E., Soltis, P.S., and Wendel, J.F., 2008. Evolutionary genetics of genome merger and doubling in plants. Annu. Rev. Genet. 42, 443-461.

Duke, J., 1992. Handbook of edible weeds. Handbook of edible weeds.

Ehrendorfer, F. 1980. Polyploidy and distribution. In: Anonymous Polyploidy. Springer, pp. 45-60.

167

Emms, D.M. and Kelly, S., 2015. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157.

Finn, R.D., Clements, J., and Eddy, S.R., 2011. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29-W37.

Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell, A.L., Potter, S.C., Punta, M., Qureshi, M., and Sangrador-Vegas, A., 2015. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279-D285.

Flagel, L.E., Chen, L., Chaudhary, B., and Wendel, J.F., 2009. Coordinated and fine- scale control of homoeologous gene expression in allotetraploid cotton. J. Hered. 100, 487-490.

Flagel, L., Udall, J., Nettleton, D., and Wendel, J., 2008. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC biology 6, 16.

Flagel, L.E. and Wendel, J.F., 2010. Evolutionary rate variation, genomic dominance and duplicate gene expression evolution during allotetraploid cotton speciation. New Phytol. 186, 184-193.

Fu, Y., Bannach, O., Chen, H., Teune, J.H., Schmitz, A., Steger, G., Xiong, L., and Barbazuk, W.B., 2009. Alternative splicing of anciently exonized 5S rRNA regulates plant transcription factor TFIIIA. Genome Res. 19, 913-921.

Gaeta, R.T. and Chris Pires, J., 2010. Homoeologous recombination in allopolyploids: the polyploid ratchet. New Phytol. 186, 18-28.

Gaeta, R.T., Pires, J.C., Iniguez-Luy, F., Leon, E., and Osborn, T.C., 2007. Genomic changes in resynthesized Brassica napus and their effect on gene expression and phenotype. Plant Cell 19, 3403-3417.

Gaeta, R.T., Yoo, S., Pires, J., Doerge, R.W., Chen, Z.J., and Osborn, T.C., 2009. Analysis of gene expression in resynthesized Brassica napus allopolyploids using Arabidopsis 70mer oligo microarrays. PLoS One 4, e4760.

Gallagher, J.P., Grover, C.E., Hu, G., and Wendel, J.F., 2016. Insights into the ecology and evolution of polyploid plants through network analysis. Mol. Ecol. 25, 2644- 2660.

Garcia, S., Hidalgo, O., Jakovljević, I., Siljak-Yakovlev, S., Vigo, J., Garnatje, T., and Vallès, J., 2013. New data on genome size in 128 Asteraceae species and subspecies, with first assessments for 40 genera, 3 tribes and 2 subfamilies. Plant Biosystems-An International Journal Dealing with all Aspects of Plant Biology 147, 1219-1227.

168

The Gene Ontology Consortium, 2000. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25.

Gillies, A.C., Cubas, P., Coen, E.S., and Abbott, R.J. 2004. Making rays in the asteraceae: Genetics and evolution of radiate versus discoid flower heads. In: Anonymous Developmental genetics and plant evolution. CRC Press, pp. 248- 261.

Gordon, A. and Hannon, G., 2010. Fastx-toolkit. FASTQ/A short-reads pre-processing tools. Unpublished http://hannonlab.cshl.edu/fastx_toolkit.

Gordon, J.L., Byrne, K.P., and Wolfe, K.H., 2009. Additions, losses, and rearrangements on the evolutionary route from a reconstructed ancestor to the modern Saccharomyces cerevisiae genome. PLoS genetics 5, e1000485.

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al., 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644-652.

Grant, V., 1975. Genetics of Flowering Plants. Columbia University Press.

Graveley, B.R., 2001. Alternative splicing: increasing diversity in the proteomic world. TRENDS in Genetics 17, 100-107.

Gremme, G., Steinbiss, S., and Kurtz, S., 2013. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10, 645- 656.

Grover, C., Gallagher, J., Szadkowski, E., Yoo, M., Flagel, L., and Wendel, J., 2012. Homoeolog expression bias and expression level dominance in allopolyploids. New Phytol. 196, 966-971.

Hamilton, J.P., Neeno-Eckwall, E.C., Adhikari, B.N., Perna, N.T., Tisserat, N., Leach, J.E., Levesque, C.A., and Buell, C.R., 2011. The Comprehensive Phytopathogen Genomics Resource: a web-based resource for data-mining plant pathogen genomes. Database 2011, bar053.

Hon, C., Weber, C., Sismeiro, O., Proux, C., Koutero, M., Deloger, M., Das, S., Agrahari, M., Dillies, M., and Jagla, B., 2012. Quantification of stochastic noise of splicing and polyadenylation in Entamoeba histolytica. Nucleic Acids Res. 41, 1936-1952.

Hovav, R., Faigenboim-Doron, A., Kadmon, N., Hu, G., Zhang, X., Gallagher, J.P., and Wendel, J.F., 2015. A transcriptome profile for developing seed of polyploid cotton. The Plant Genome 8, .

169

Howarth, D.G. and Donoghue, M.J., 2006. Phylogenetic analysis of the "ECE" (CYC/TB1) clade reveals duplications predating the core eudicots. Proc. Natl. Acad. Sci. U. S. A. 103, 9101-9106.

Hu, G., Koh, J., Yoo, M.J., Chen, S., and Wendel, J.F., 2015. Gene-expression novelty in allopolyploid cotton: a proteomic perspective. Genetics 200, 91-104.

Huang, C., Zhang, C., Liu, M., Hu, Y., Gao, T., Qi, J., and Ma, H., 2016. Multiple polyploidization events across Asteraceae with two nested events in the early history revealed by nuclear phylogenomics. Mol. Biol. Evol. 33, 2820-2835.

Hughes, A.L., 1994. The evolution of functionally novel proteins after gene duplication. Proc. Biol. Sci. 256, 119-124.

Hulse-Kemp, A.M., Maheshwari, S., Stoffel, K., Hill, T.A., Jaffe, D., Williams, S.R., Weisenfeld, N., Ramakrishnan, S., Kumar, V., and Shah, P., 2018. Reference quality assembly of the 3.5-Gb genome of Capsicum annuum from a single linked-read library. Horticulture Research 5, 4.

Husband, B.C., Baldwin, S.J., and Suda, J. 2013. The incidence of polyploidy in natural plant populations: Major patterns and evolutionary processes. In: Anonymous Plant Genome Diversity Volume 2. Springer, pp. 255-276.

Irimia, M., Rukov, J.L., Penny, D., Garcia-Fernandez, J., Vinther, J., and Roy, S.W., 2007. Widespread evolutionary conservation of alternatively spliced exons in Caenorhabditis. Mol. Biol. Evol. 25, 375-382.

Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S., Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., and Soltis, P.S., 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473, 97.

Jin, L., Kryukov, K., Clemente, J.C., Komiyama, T., Suzuki, Y., Imanishi, T., Ikeo, K., and Gojobori, T., 2008. The evolutionary relationship between gene duplication and alternative splicing. Gene 427, 19-31.

Jose, C. and Dufresne, F., 2010. Differential survival among genotypes of Daphnia pulex differing in reproductive mode, ploidy level, and geographic origin. Evol. Ecol. 24, 413-421.

Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., and Walichiewicz, J., 2005. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462-467.

Kajitani, R., Toshimoto, K., Noguchi, H., Toyoda, A., Ogura, Y., Okuno, M., Yabana, M., Harada, M., Nagayasu, E., Maruyama, H., et al., 2014. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384-1395.

170

Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M., 2011. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109-D114.

Kellis, M., Birren, B.W., and Lander, E.S., 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428, 617.

Kent, W.J., 2002. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664.

Kim, D., Langmead, B., and Salzberg, S.L., 2015. HISAT: a fast spliced aligner with low memory requirements. Nature methods 12, 357.

Koh, J., Soltis, P.S., and Soltis, D.E., 2010. Homeolog loss and expression changes in natural populations of the recently and repeatedly formed allotetraploid Tragopogon mirus (Asteraceae). BMC Genomics 11, 97.

Koren, S., Schatz, M.C., Walenz, B.P., Martin, J., Howard, J.T., Ganapathy, G., Wang, Z., Rasko, D.A., McCombie, W.R., and Jarvis, E.D., 2012. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693.

Kramer, M., Sponholz, C., Slaba, M., Wissuwa, B., Claus, R.A., Menzel, U., Huse, K., Platzer, M., and Bauer, M., 2013. Alternative 5’untranslated regions are involved in expression regulation of human heme oxygenase-1. PloS one 8, e77224.

Krogh, A., Larsson, B., Von Heijne, G., and Sonnhammer, E.L., 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567-580.

Kubis, S., Schmidt, T., and Heslop-Harrison, J.S., 1998. Repetitive DNA elements as a major component of plant genomes. Annals of Botany 82, 45-55.

Lee, J., Cho, Y., Yoon, H., Suh, M.C., Moon, J., Lee, I., Weigel, D., Yun, C., and Kim, J., 2005. Conservation and divergence of FCA function between Arabidopsis and rice. Plant Mol. Biol. 58, 823-838.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R., 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078-2079.

Li, P., Ponnala, L., Gandotra, N., Wang, L., Si, Y., Tausta, S.L., Kebrom, T.H., Provart, N., Patel, R., and Myers, C.R., 2010. The developmental dynamics of the maize leaf transcriptome. Nat. Genet. 42, 1060.

Li, Q., Xiao, G., and Zhu, Y., 2014. Single-nucleotide resolution mapping of the Gossypium raimondii transcriptome reveals a new mechanism for alternative splicing of introns. Molecular plant 7, 829-840.

171

Li, W. and Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659.

Liu, X., Mei, W., Soltis, P.S., Soltis, D.E., and Barbazuk, W.B., 2017. Detecting alternatively spliced transcript isoforms from single‐molecule long‐read sequences without a reference genome. Molecular ecology resources 17, 1243- 1256.

Lozano, R., Ponce, O., Ramirez, M., Mostajo, N., and Orjeda, G., 2012. Genome-wide identification and mapping of NBS-encoding resistance genes in Solanum tuberosum group phureja. PLoS One 7, e34775.

Lynch, M. and Conery, J.S., 2000. The evolutionary fate and consequences of duplicate genes. Science 290, 1151-1155.

Madlung, A. and Wendel, J.F., 2013. Genetic and epigenetic aspects of polyploid evolution in plants. Cytogenet. Genome Res. 140, 270-285.

Magallón, S., 2014. A review of the effect of relaxed clock method, long branches, genes, and calibrations in the estimation of angiosperm age. Botanical Sciences 92, 1-22.

Magallon, S., Crane, P.R., and Herendeen, P.S., 1999. Phylogenetic pattern, diversity, and diversification of eudicots. Ann. Mo. Bot. Gard. 297-372.

Marçais, G. and Kingsford, C., 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764-770.

Marchant, D.B., Soltis, D.E., and Soltis, P.S., 2016. Patterns of abiotic niche shifts in allopolyploids relative to their progenitors. New Phytol. 212, 708-718.

Marquez, Y., Brown, J.W., Simpson, C., Barta, A., and Kalyna, M., 2012. Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res. 22, 1184-1195.

Mei, W., Boatwright, L., Feng, G., Schnable, J.C., and Barbazuk, W.B., 2017a. Evolutionarily Conserved Alternative Splicing Across Monocots. Genetics 207, 465-480.

Mei, W., Liu, S., Schnable, J.C., Yeh, C., Springer, N.M., Schnable, P.S., and Barbazuk, W.B., 2017b. A comprehensive analysis of alternative splicing in paleopolyploid maize. Frontiers in plant science 8, 694.

Moghadam, H.K., Ferguson, M.M., and Danzmann, R.G., 2009. Comparative genomics and evolution of conserved noncoding elements (CNE) in rainbow trout. BMC Genomics 10, 278.

172

Nystedt, B., Street, N.R., Wetterbom, A., Zuccolo, A., Lin, Y., Scofield, D.G., Vezzi, F., Delhomme, N., Giacomello, S., and Alexeyenko, A., 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497, 579.

Otto, S.P., 2007. The evolutionary consequences of polyploidy. Cell 131, 452-462.

Ownbey, M., 1950. Natural hybridization and amphiploidy in the genus Tragopogon. Am. J. Bot. 487-499.

Palaniswamy, R., Teglund, S., Lauth, M., Zaphiropoulos, P.G., and Shimokawa, T., 2010. Genetic variations regulate alternative splicing in the 5'untranslated regions of the mouse glioma-associated oncogene 1, Gli1. BMC molecular biology 11, 32.

Park, E., Pan, Z., Zhang, Z., Lin, L., and Xing, Y., 2018. The expanding landscape of alternative splicing variation in human populations. The American Journal of Human Genetics 102, 11-26.

Petersen, T.N., Brunak, S., von Heijne, G., and Nielsen, H., 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods 8, 785.

Pickrell, J.K., Pai, A.A., Gilad, Y., and Pritchard, J.K., 2010. Noisy splicing drives mRNA isoform diversity in human cells. PLoS genetics 6, e1001236.

Pires, J.C., Lim, K.Y., Kovarík, A., Matyásek, R., Boyd, A., Leitch, A.R., Leitch, I.J., Bennett, M.D., Soltis, P.S., and Soltis, D.E., 2004. Molecular cytogenetic analysis of recently evolved Tragopogon (Asteraceae) allopolyploids reveal a karyotype that is additive of the diploid progenitors. Am. J. Bot. 91, 1022-1035.

Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., and Doerks, T., 2011. eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284-D289.

Reddy, A.S., Marquez, Y., Kalyna, M., and Barta, A., 2013. Complexity of the alternative splicing landscape in plants. Plant Cell 25, 3657-3683.

Renny-Byfield, S., Gong, L., Gallagher, J.P., and Wendel, J.F., 2015. Persistence of subgenomes in paleopolyploid cotton after 60 my of evolution. Mol. Biol. Evol. 32, 1063-1071.

Reyes-Chin-Wo, S., Wang, Z., Yang, X., Kozik, A., Arikit, S., Song, C., Xia, L., Froenicke, L., Lavelle, D.O., and Truco, M., 2017. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nature communications 8, 14953.

173

Roose, M. and Gottlieb, L., 1980. Biochemical properties and level of expression of alcohol dehydrogenases in the allotetraploid plant Tragopogon miscellus and its diploid progenitors. Biochem. Genet. 18, 1065-1085.

Roose, M. and Gottlieb, L., 1976. Genetic and biochemical consequences of polyploidy in Tragopogon. Evolution 30, 818-830.

Roulin, A., Auer, P.L., Libault, M., Schlueter, J., Farmer, A., May, G., Stacey, G., Doerge, R.W., and Jackson, S.A., 2013. The fate of duplicated genes in a polyploid plant genome. The Plant Journal 73, 143-153.

Roux, J. and Robinson-Rechavi, M., 2011. Age-dependent gain of alternative splice forms and biased duplication explain the relation between splicing and duplication. Genome Res. 21, 357-363.

Ruhfel, B.R., Gitzendanner, M.A., Soltis, P.S., Soltis, D.E., and Burleigh, J.G., 2014. From algae to angiosperms–inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes. BMC Evolutionary Biology 14, 23.

Sagasti, S., Bernal, M., Sancho, D., del Castillo, M.B., and Picorel, R., 2014. Regulation of the chloroplastic copper chaperone (CCS) and cuprozinc superoxide dismutase (CSD2) by alternative splicing and copper excess in Glycine max. Functional plant biology 41, 144-155.

Salmela, L. and Rivals, E., 2014. LoRDEC: accurate and efficient long read error correction. Bioinformatics 30, 3506-3514.

Saminathan, T., Nimmakayala, P., Manohar, S., Malkaram, S., Almeida, A., Cantrell, R., Tomason, Y., Abburi, L., Rahman, M.A., and Vajja, V.G., 2014. Differential gene expression and alternative splicing between diploid and tetraploid watermelon. J. Exp. Bot. 66, 1369-1385.

Satyawan, D., Kim, M.Y., and Lee, S., 2017. Stochastic alternative splicing is prevalent in mungbean (Vigna radiata). Plant biotechnology journal 15, 174-182.

Sehrish, T., Symonds, V.V., Soltis, D.E., Soltis, P.S., and Tate, J.A., 2015. Cytonuclear coordination is not immediate upon allopolyploid formation in Tragopogon miscellus (Asteraceae) allopolyploids. PloS one 10, e0144339.

Sharon, D., Tilgner, H., Grubert, F., and Snyder, M., 2013. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009.

Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov, E.M., 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210-3212.

Smit, A., Hubley, R., and Green, P., 2013. 2013–2015.RepeatMasker Open-4.0.

174

Smith, S.A. and O’meara, B.C., 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28, 2689-2690.

Soltis, D.E., Albert, V.A., Leebens‐Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Pamphilis, C.W., Wall, P.K., and Soltis, P.S., 2009a. Polyploidy and angiosperm diversification. Am. J. Bot. 96, 336-348.

Soltis, D.E., Albert, V.A., Leebens-Mack, J., Palmer, J.D., Wing, R.A., Ma, H., Carlson, J.E., Altman, N., Kim, S., and Wall, P.K., 2008. The Amborella genome: an evolutionary reference for plant biology. Genome Biol. 9, 402.

Soltis, D.E., Buggs, R.J., Barbazuk, W.B., Chamala, S., Chester, M., Gallagher, J.P., Schnable, P.S., and Soltis, P.S. 2012. The early stages of polyploidy: Rapid and repeated evolution in Tragopogon. In: Anonymous Polyploidy and genome evolution. Springer, pp. 271-292.

Soltis, D.E., Buggs, R.J., Barbazuk, W.B., Schnable, P.S., and Soltis, P.S., 2009b. On the origins of species: does evolution repeat itself in polyploid populations of independent origin? Cold Spring Harb. Symp. Quant. Biol. 74, 215-223.

Soltis, D.E., Smith, S.A., Cellinese, N., Wurdack, K.J., Tank, D.C., Brockington, S.F., Refulio‐Rodriguez, N.F., Walker, J.B., Moore, M.J., and Carlsward, B.S., 2011. Angiosperm phylogeny: 17 genes, 640 taxa. Am. J. Bot. 98, 704-730.

Soltis, D.E. and Soltis, P.S., 1989. Allopolyploid speciation in Tragopogon: insights from chloroplast DNA. Am. J. Bot. 76, 1119-1124.

Soltis, D., Soltis, P., Endress, P., and Chase, M., 2005. Angiosperm phylogeny and evolution. Sinauer, Sunderland, MA.

Soltis, D., Soltis, P., Endress, P., Chase, M.W., Manchester, S., Judd, W., Majure, L., and Mavrodiev, E., 2018. Phylogeny and Evolution of the Angiosperms: Revised and Updated Edition. University of Chicago Press.

Soltis, D.E., Visger, C.J., Marchant, D.B., and Soltis, P.S., 2016. Polyploidy: Pitfalls and paths to a paradigm. Am. J. Bot. 103, 1146-1166.

Soltis, D.E., Visger, C.J., and Soltis, P.S., 2014a. The polyploidy revolution then… and now: Stebbins revisited. Am. J. Bot. 101, 1057-1078.

Soltis, P.S., Liu, X., Marchant, D.B., Visger, C.J., and Soltis, D.E., 2014b. Polyploidy and novelty: Gottlieb's legacy. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 369, 10.1098/rstb.2013.0351.

Soltis, P.S., Marchant, D.B., Van de Peer, Y., and Soltis, D.E., 2015. Polyploidy and genome evolution in plants. Curr. Opin. Genet. Dev. 35, 119-125.

175

Soltis, P.S., Plunkett, G.M., Novak, S.J., and Soltis, D.E., 1995. Genetic variation in Tragopogon species: additional origins of the allotetraploids T. mirus and T. miscellus (Compositae). Am. J. Bot. 82, 1329-1341.

Soltis, P.S. and Soltis, D.E., 2016. Ancient WGD events as drivers of key innovations in angiosperms. Curr. Opin. Plant Biol. 30, 159-165.

Soltis, P.S. and Soltis, D.E., 2012. Polyploidy and Genome Evolution. Springer.

Soltis, P.S. and Soltis, D.E., 1991. Multiple origins of the allotetraploid Tragopogon mirus (Compositae): rDNA evidence. Syst. Bot. 407-413.

Spoelhof, J.P., Chester, M., Rodriguez, R., Geraci, B., Heo, K., Mavrodiev, E., Soltis, P.S., and Soltis, D.E., 2017. Karyotypic variation and pollen stainability in resynthesized allopolyploids Tragopogon miscellus and T. mirus. Am. J. Bot. 104, 1484-1492.

Stebbins Jr, G.L. 1947. Types of polyploids: Their classification and significance. In: Anonymous Advances in genetics. Elsevier, pp. 403-429.

Stebbins, G.L., 1950. Variation and Evolution in Plants. Geoffrey Cumberlege. London.

Su, Z., Wang, J., Yu, J., Huang, X., and Gu, X., 2006. Evolution of alternative splicing after gene duplication. Genome Res. 16, 182-189.

Syed, N.H., Kalyna, M., Marquez, Y., Barta, A., and Brown, J.W., 2012. Alternative splicing in plants–coming of age. Trends Plant Sci. 17, 616-623.

Symonds, V.V., Soltis, P.S., and Soltis, D.E., 2010. Dynamics of polyploid formation in Tragopogon (Asteraceae): recurrent formation, gene flow, and population structure. Evolution 64, 1984-2003.

Tähtiharju, S., Rijpkema, A.S., Vetterli, A., Albert, V.A., Teeri, T.H., and Elomaa, P., 2011. Evolution and diversification of the CYC/TB1 gene family in Asteraceae—a comparative study in Gerbera (Mutisieae) and sunflower (Heliantheae). Mol. Biol. Evol. 29, 1155-1166.

Talavera, D., Vogel, C., Orozco, M., Teichmann, S.A., and De La Cruz, X., 2007. The (in) dependence of alternative splicing and gene duplication. PLoS computational biology 3, e33.

Tang, C., Yang, M., Fang, Y., Luo, Y., Gao, S., Xiao, X., An, Z., Zhou, B., Zhang, B., and Tan, X., 2016. The rubber tree genome reveals new insights into rubber production and species adaptation. Nature plants 2, 16073.

Tapanainen, R., Parker, D.J., and Kankare, M., 2018. Photosensitive Alternative Splicing of the Circadian Clock Gene timeless Is Population Specific in a Cold- Adapted Fly, Drosophila montana. G3 (Bethesda) 8, 1291-1297.

176

Tate, J.A., Joshi, P., Soltis, K.A., Soltis, P.S., and Soltis, D.E., 2009a. On the road to diploidization? Homoeolog loss in independently formed populations of the allopolyploid Tragopogon miscellus (Asteraceae). BMC Plant Biology 9, 80.

Tate, J.A., Ni, Z., Scheen, A.C., Koh, J., Gilbert, C.A., Lefkowitz, D., Chen, Z.J., Soltis, P.S., and Soltis, D.E., 2006. Evolution and expression of homeologous loci in Tragopogon miscellus (Asteraceae), a recent and reciprocally formed allopolyploid. Genetics 173, 1599-1611.

Tate, J.A., Symonds, V.V., Doust, A.N., Buggs, R.J., Mavrodiev, E., Majure, L.C., Soltis, P.S., and Soltis, D.E., 2009b. Synthetic polyploids of Tragopogon miscellus and T. mirus (Asteraceae): 60 Years after Ownbey's discovery. Am. J. Bot. 96, 979- 988.

Terashima, A. and Takumi, S., 2009. Allopolyploidization reduces alternative splicing efficiency for transcripts of the wheat DREB2 homolog, WDREB2. Genome 52, 100-105.

The Plant List, 2010. http://www.theplantlist.org/.

Tilgner, H., Grubert, F., Sharon, D., and Snyder, M.P., 2014. Defining a personal, allele- specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. U. S. A. 111, 9869-9874.

Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L., 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511.

Tress, M.L., Abascal, F., and Valencia, A., 2017. Alternative splicing may not be the key to proteome complexity. Trends Biochem. Sci. 42, 98-110.

Van de Peer, Y., Maere, S., and Meyer, A., 2009. The evolutionary significance of ancient genome duplications. Nature Reviews Genetics 10, 725.

Van de Peer, Y., Mizrachi, E., and Marchal, K., 2017. The evolutionary significance of polyploidy. Nature Reviews Genetics 18, 411. van Eijk, M., 2015. Genome assembly and Iso-Seq transcriptome sequencing of tetraploid cotton.

Walters, B., Lum, G., Sablok, G., and Min, X.J., 2013. Genome-wide landscape of alternative splicing events in Brachypodium distachyon. DNA research 20, 163- 171.

Wang, B., Tseng, E., Regulski, M., Clark, T.A., Hon, T., Jiao, Y., Lu, Z., Olson, A., Stein, J.C., and Ware, D., 2016. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nature communications 7, 11708.

177

Weirather, J.L., Afshar, P.T., Clark, T.A., Tseng, E., Powers, L.S., Underwood, J.G., Zabner, J., Korlach, J., Wong, W.H., and Au, K.F., 2015. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116-e116.

Weisenfeld, N.I., Kumar, V., Shah, P., Church, D.M., and Jaffe, D.B., 2017. Direct determination of diploid genome sequences. Genome Res. 27, 757-767.

Wendel, J.F., 2015. The wondrous cycles of polyploidy in plants. Am. J. Bot. 102, 1753- 1756.

Wendel, J.F., Flagel, L.E., and Adams, K.L. 2012. Jeans, genes, and genomes: Cotton as a model for studying polyploidy. In: Anonymous Polyploidy and genome evolution. Springer, pp. 181-207.

Wendel, J.F. and Grover, C.E., 2015. and evolution of the cotton genus, Gossypium. Cotton 25-44.

Wendel, J.F., Jackson, S.A., Meyers, B.C., and Wing, R.A., 2016. Evolution of plant genome architecture. Genome Biol. 17, 37.

Wu, H., Su, Y., Chen, H., Chen, Y., Wu, C., Lin, W., and Tu, S., 2014. Genome-wide analysis of light-regulated alternative splicing mediated by photoreceptors in Physcomitrella patens. Genome Biol. 15, R10.

Wu, T.D. and Nacu, S., 2010. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873-881.

Wu, T.D. and Watanabe, C.K., 2005. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859-1875.

Xiong, Z., Gaeta, R.T., and Pires, J.C., 2011. Homoeologous shuffling and chromosome compensation maintain genome balance in resynthesized allopolyploid Brassica napus. Proc. Natl. Acad. Sci. U. S. A. 108, 7908-7913.

Xiong, Z. and Pires, J.C., 2011. Karyotype and identification of all homoeologous chromosomes of allopolyploid Brassica napus and its diploid progenitors. Genetics 187, 37-49.

Yang, Y., Guo, W., Shen, X., Li, J., Yang, S., Chen, S., He, Z., Zhou, R., and Shi, S., 2018. Identification and characterization of evolutionarily conserved alternative splicing events in a mangrove genus Sonneratia. Scientific reports 8, 4425.

Yoo, M., Liu, X., Pires, J.C., Soltis, P.S., and Soltis, D.E., 2014. Nonadditive gene expression in polyploids. Annu. Rev. Genet. 48, 485-517.

Yoo, M., Szadkowski, E., and Wendel, J., 2013. Homoeolog expression bias and expression level dominance in allopolyploid cotton. Heredity 110, 171.

178

Zhang, G., Guo, G., Hu, X., Zhang, Y., Li, Q., Li, R., Zhuang, R., Lu, Z., He, Z., Fang, X., et al., 2010. Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. Genome Res. 20, 646-654.

Zhang, X., Rosen, B.D., Tang, H., Krishnakumar, V., and Town, C.D., 2015. Polyribosomal RNA-Seq reveals the decreased complexity and diversity of the Arabidopsis translatome. PloS one 10, e0117699.

Zhang, Z., Xin, D., Wang, P., Zhou, L., Hu, L., Kong, X., and Hurst, L.D., 2009. Noisy splicing, more than expression regulation, explains why some exons are subject to nonsense-mediated mRNA decay. BMC biology 7, 23.

Zhou, R., Moshgabadi, N., and Adams, K.L., 2011. Extensive changes to alternative splicing patterns following allopolyploidy in natural and resynthesized polyploids. Proc. Natl. Acad. Sci. U. S. A. 108, 16122-16127.

179

BIOGRAPHICAL SKETCH

Xiaoxian Liu earned her B.Sc. in Biological Sciences from Zhejiang University,

China in 2007. Under the mentorship of Professor Chengxin Fu, Xiaoxian began her academic career in plant conservation genetics, studying the genetic diversity of a Chinese

Traditional Medicine Fritillara thunbergii (Liliaceae). After graduating as an undergraduate student, Xiaoxian joined Fu’s lab as a master student in 2007 to study the phytogeography and population genetics in a polypoidy complex Smilax china (Smilacaceae). From late

2008 to early 2009, she joined an exchange program between Zhejiang University and

Nation Taiwan University under the mentorship of Assistant Professor Dr Jer-Ming Hu.

Xiaoxian Liu got her M.Sc. in Botany from Zhejiang University in 2011. Liu began her doctoral work at the University of Florida in the fall of 2011 in the Department of Biology under the mentorship of Drs. Douglas E. Soltis and W. Brad Barbazuk. She studies plant evolution by investigating alternative splicing in Tragopogon, and the evolutionary impacts of conserved alternative splicing in basal angiosperms. Xiaoxian Liu received her Ph.D. from the University of Florida in the fall of 2018.

180