<<

OUT OF THE BUSHES AND INTO THE : ALTERNATIVE APPROACHES FOR RESOLVING THE PHYLOGENY OF

By

GRANT THOMAS GODDEN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2014

© 2014 Grant Thomas Godden

To my father, Clesson Dale Godden Jr., who would have been proud to see me complete this journey, and to Mr. Tea and Skippyjon Jones, who sat patiently by my side and offered friendship along the way

ACKNOWLEDGMENTS

I would like to express my deepest gratitude for the consistent support of my advisor, Dr. Pamela Soltis, whose generous allocation of time, innovative advice, encouragement, and mentorship positively shaped my research and professional development. I also offer my thanks to Dr. J. Gordon Burleigh, Dr. Bryan Drew, Dr.

Ingrid Jordon-Thaden, Dr. Stephen Smith, and the members of my committee—Dr.

Nicoletta Cellinese, Dr. Walter Judd, Dr. Matias Kirst, and Dr. Douglas Soltis—for their helpful advice, guidance, and research support.

I also acknowledge the many individuals who helped make possible my field research activities in the and abroad. I wish to extend a special thank you to Dr. Angelica Cibrian Jaramillo, who kindly hosted me in her laboratory at the National

Laboratory of Genomics for Biodiversity (Langebio) and helped me acquire collecting permits and resources in Mexico. Additional thanks belong to Francisco Mancilla

Barboza, Gerardo Balandran, and Praxaedis (Adan) Sinaca for their field assistance in

Northeastern Mexico; my collecting trip was a great success thanks to your resourcefulness and on-site support. Thank you to Patricia Manning, Dr. Michael

Powell, Dr. Alan Prather, Dr. Donovan Bailey, Dr. Timothy Lowrey, Robert Savinski, Dr.

Leslie Landrum, Dr. Tina Ayers, the US Forest Service, and the US Bureau of Land

Management for their assistance with field research activities in the Southwestern

United States. Lastly, I wish to thank Dr. Selene Baez, Dr. Omar Torres-Carvajal, Dr. C.

Lorena Endara, Dr. Claudia Segovia Salcedo, and Rhonda Arthur for their support of my field research activities in Ecuador.

Lastly, thank you to Dr. Matthew Gitzendanner, Dr. Michael Moore, and Dr. David

Tank for their guidance and training in the areas of library preparation, next-generation 4

sequencing, and data processing. To Drs. Steven Manchester and Evgeny Mavrodiev: thank you for your assistance sourcing and translating descriptions, respectively.

And lastly, thank you to my friends in Gainesville and colleagues in the Soltis Lab,

Florida Museum of Natural History, and University of Florida, particularly Jared Cellon,

Kim Schleissing, Claudia Segovia Salcedo, Dr. C. Lorena Endara, Dr. Igor Ignatovich,

Dr. Gretchen Ionta, Dr. Ingrid Jordon-Thaden, and Dr. Maribeth Latvis. Your friendship and support throughout this journey were vital to my success.

Financial support for this research was made possible by a grant from the

National Science Foundation (DEB-1210671), with additional funding from the

University of Florida Graduate Student Research Abroad Program, The Herb Society of

America, The Community Foundation Desert Legacy Fund, the Native

Society, and the Botanical Society of America.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 9

LIST OF FIGURES ...... 10

ABSTRACT ...... 13

CHAPTER

1 GENERAL INTRODUCTION ...... 15

2 INTERPRETING COMPLEX PATTERNS IN LAMIACEAE PHYLOGENY ...... 19

Introduction ...... 19 Current Status of Lamiaceae and Phylogeny ...... 19 Interpreting Complex Patterns in Lamiaceae Phylogeny ...... 22 Materials and Methods...... 22 Supermatrix Phylogenetic Inference ...... 22 Phylogenetic Inference ...... 23 Assessment of Phylogenetic Sampling Effort ...... 24 Divergence Time Estimation ...... 24 Diversification Dynamics in Lamiaceae ...... 26 Results ...... 27 Descriptive Data ...... 27 Phylogenetic Inference ...... 29 Divergence Times ...... 30 Diversification Dynamics ...... 30 Discussion ...... 31 Phylogenetic Relationships in Lamiaceae ...... 31 Evolutionary Timing and Tempo in Lamiaceae ...... 32 Limitations Imposed by Opportunistic Sampling ...... 33 Towards Fully Resolving Relationships in Lamiaceae ...... 35

3 MAKING NEXT-GENERATION SEQUENCING WORK FOR YOU: APPROACHES AND PRACTICAL CONSIDERATIONS FOR MARKER DEVELOPMENT AND PHYLOGENETICS ...... 62

Introduction ...... 62 NGS and New Markers for Phylogenetic Studies ...... 66 Organellar Genomes and Gene Space as Markers ...... 66 Nuclear Markers ...... 67 Cost Considerations for NGS Projects ...... 68 Experimental Pathways for NGS Phylogenetics ...... 71

6

Transcriptomes vs. Genomes ...... 72 Pathways for Phylogenetic Marker Development ...... 74 Marker Development Pipeline for Targeted Sequencing ...... 76 Data Mining ...... 77 Tests for Orthology ...... 78 Primer Development for Targeted NGS ...... 79 Targeted Sequencing Approaches for Phylogenetics ...... 82 Amplicon Approaches for Targeted Sequencing ...... 82 Hybridization Approaches for Targeted Sequencing ...... 87 Processing of NGS Data for Phylogenetic Marker Development ...... 89 NGS Data: File Types and Conversion Tools ...... 89 Quality Control and Trimming of NGS Data ...... 90 Genomic and Transcriptomic Assembly ...... 92 Data Mining for Orthologous Markers ...... 94 Case Study: 1KP Pipeline ...... 96 Discussion ...... 97

4 RESOLVING RELATIONSHIPS WITH ORGANELLAR PHYLOGENOMIC DATA: A CASE STUDY WITH AND ALLIED GENERA FROM MENTHINAE (LAMIACEAE) ...... 114

Introduction ...... 114 Phylogenetic Relationships within Hedeoma and Among Closely Allied Genera of Menthinae ...... 115 A Model System for Evaluating Organellar Phylogenomic Approaches ...... 117 Materials and Methods...... 118 Plant Materials and Taxonomic Sampling ...... 118 DNA Extraction, Shearing, and Genomic Library Preparation ...... 119 Plastid Genome Enrichment and Sequencing ...... 121 Data Processing, Mining, and Assembly ...... 121 Alignments and Phylogenetic Analyses ...... 122 Results ...... 123 Descriptive Data ...... 123 Phylogenetic Analyses ...... 124 Discussion ...... 127 Phylogenetic Relationships Among Hedeoma and Allied Genera and Their Taxonomic Implications ...... 127 The Importance of Morphological Characters in Delineating Generic Boundaries ...... 131 The Taxonomy of Hedeoma and Allied Menthinae ...... 131 The Utility of Plastid Phylogenomic Approaches ...... 135

5 RESOLVING RELATIONSHIPS WITH NUCLEAR PHYLOGENOMIC APPROACHES: A CASE STUDY WITH ...... 155

Introduction ...... 155 Background ...... 155

7

Current Status of Lamiales Phylogeny ...... 156 Materials and Methods...... 158 Taxonomic Sampling ...... 158 Discovery of Single-copy Nuclear Genes from Transcriptome Data ...... 159 Clustering, Reorientation, and Characterization of Putative Orthologues ...... 160 Multiple Sequence Alignment and Comparisons of Phylogenetic Utility Among 85 Commonly Shared SCN Genes ...... 161 Outgroup Selection and Phylogenetic Analyses ...... 161 Results ...... 162 Single-copy Nuclear Gene Discovery ...... 162 Sequence Characteristics and Phylogenetic Utility ...... 163 Phylogenetic Results and Relationships in Lamiales ...... 164 Discussion ...... 166 Utility of New SCN Gene Sets for Phylogenetic Studies of Lamiales ...... 166 The Interfamilial Phylogeny of Lamiales ...... 166 Conclusions ...... 171

6 GENERAL CONCLUSIONS ...... 270

LIST OF REFERENCES ...... 274

BIOGRAPHICAL SKETCH ...... 302

8

LIST OF TABLES

Table page

2-1 A hierarchical taxonomic classification of Lamiaceae ...... 37

2-2 General Time Reversible (GTR) and CAT approximation parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML) ...... 49

2-3 Divergence times and diversification dynamics from treePL and MEDUSA ...... 50

3-1 Summary of targeted sequencing methods...... 106

3-2 Summary of sequence formats and data processing tools for NGS...... 107

4-1 Proposed taxa and nomenclature for Hedeoma ...... 139

4-2 Accessions used for phylogeny reconstruction ...... 143

4-3 General Time Reversible (GTR) +  Model parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML) ...... 149

5-1 OneKP transcriptome assemblies used for single-copy nuclear gene discovery and phylogenetic analyses ...... 174

5-2 A summary of 1,993 single-copy nuclear (SCN) loci identified by the MarkerMiner 1.0 pipeline ...... 178

5-3 Sequence characteristics for 85 commonly shared single-copy nuclear loci .... 256

5-4 General Time Reversible (GTR) +  Model parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML)...... 259

9

LIST OF FIGURES

Figure page

2-1 Lamiaceae diversity sampled for the phylogenetic analysis ...... 52

2-2 Proportions of recognized and sampled species-level diversity among the subfamilial clades of Lamiaceae ...... 53

2-3 Maximum likelihood phylogeny of 1,256 Lamiaceae species and nine outgroup families from Lamiales ...... 54

2-4 Summary showing relationships among major clades in Lamiaceae ...... 55

2-5 Lineages through time (LTT) plot for Lamiaceae species ...... 56

2-6 Chronogram showing estimated divergence dates and diversification dynamics for selected clades of Lamiaceae ...... 57

2-7 Four family-wide phylogenetic hypotheses for Lamiaceae ...... 58

3-1 Experimental design pathways for NGS phylogenetics ...... 111

3-2 Marker development pipeline for targeted sequencing ...... 112

3-3 Data processing with NGS for phylogenetics ...... 113

4-1 Mean number of reads (or FASTQ records) yielded per each of four lanes of Illumina GAIIx sequencing ...... 150

4-2 Maximum likelihood tree inferred from an analysis of nearly complete plastome sequences ...... 151

5-1 Number of single-copy nuclear genes discovered in each of 77 transcriptomes from the oneKP project ...... 260

5-2 The distribution of single-copy nuclear genes shared across 77 transcriptomes from the oneKP project...... 261

5-3 Concatenated supermatrix of 85 single-copy nuclear genes ...... 262

5-4 Maximum likelihood tree inferred from 85 single-copy nuclear loci...... 263

5-5 Phylogenetic relationships among 17 families from Lamiales and 5 additional lamiid families ...... 265

5-6 Most parsimonious tree from an analysis of 85 single-copy nuclear loci...... 266

10

5-7 Comparison of maximum likelihood and maximum parsimony topologies recovered from an analysis of 85 single-copy nuclear genes ...... 268

11

LIST OF OBJECTS Object page

2-1 Species and DNA sequences used for phylogenetic analysis (.xlsx 86 KB) ...... 59

2-2 Optimal tree topology identified by RAxML (.pdf 1.8 MB) ...... 60

2-3 Cronogram showing results from divergence time estimation and diversification analyses (.pdf 131 KB)...... 61

4-1 Aligned matrix in PHYLIP multiple alignment format (.rtf file 12.5 MB) ...... 153

4-2 Supermatrix data assembly output, including characteristics by taxon and plastome partition (.xlxs file 172 KB) ...... 154

5-1 Modified output and summary data reported by MarkerMiner 1.0 (.xlxs file 950 KB) ...... 269

12

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

OUT OF THE BUSHES AND INTO THE TREES: ALTERNATIVE APPROACHES FOR RESOLVING THE PHYLOGENY OF LAMIACEAE

By

Grant Thomas Godden

December 2014

Chair: Pamela S. Soltis Major:

The infrafamilial phylogeny of Lamiaceae (mint family) and its position in

Lamiales have remained poorly understood for many decades, and standard approaches used to resolve phylogenetic relationships, including the reasonable addition of species and genes into analyses, have proven minimally effective. I attempted to identify challenges and impediments to phylogeny reconstruction and designed and tested new approaches for resolving this complex phylogenetic problem.

First, I inferred a comprehensive family-wide phylogenetic hypothesis for 1,265 species, representing a synthesis of current molecular data; the topology was used to evaluate progress in mint systematics, estimate divergence times, and investigate diversification dynamics. The composition of major clades was consistent with previous hypotheses, but some infrafamilial relationships differed and were poorly supported due to insufficient sequence variation, discordant signal among genes, and/or heterogeneity among the data resources. My results also indicated that several lineages in Lamiaceae have experienced bursts of rapid diversification over a short evolutionary timeframe; carefully designed family-wide sampling strategies and large, multi-locus datasets are likely necessary to resolve infrafamilial relationships. Thus, I designed several new 13

workflows for phylogenetic marker development and targeted sequencing of organellar and nuclear datasets on next-generation sequencing (NGS) platforms, and I tested these as part of two case studies. Whole-plastome enrichment and NGS were utilized as part of a phylogenomic study of 96 accessions from Hedeoma and allied Menthinae.

The application of plastid phylogenomic data did not fully resolve relationships with strong support. Nevertheless, the data demonstrate that Hedeoma is non-monophyletic.

In contrast, I successfully reconstructed relationships in Lamiales using a nuclear phylogenomic approach. I identified a set of 1,993 single-copy nuclear (SCN) loci with a novel bioinformatic workflow and used 85 of these to infer a phylogeny for 77 accessions from 17 families of Lamiales and five outgroup families. My phylogenetic results, the first inferred with large-scale, multi-locus nuclear data, provide a strongly supported view of Lamiales evolution; they also demonstrate the enormous potential of

SCN loci and pave the way for future studies with multispecies coalescent approaches.

14

CHAPTER 1 GENERAL INTRODUCTION

Mints (Lamiaceae), with a nearly cosmopolitan distribution and more than 7,000 species, are the sixth-largest angiosperm family and are of major economic and cultural importance worldwide (Harley et al. 2004). This hyper-diverse clade of flowering presents many exciting opportunities to investigate global patterns of radiation and to link these patterns to shifts in distribution, ecology, and phenotypic evolution. However, comparative studies of Lamiaceae have proven somewhat intractable due to poor phylogenetic resolution.

Numerous molecular systematic studies of mint groups have been published in recent decades, including investigations of infrafamilial relationships that span the full phylogenetic scale of Lamiaceae (e.g. see Chapter 2 for citations and a detailed discussion). Many studies have helped identify the membership of well-supported monophyletic groups (e.g. clades now recognized at subfamilial, tribal, or subtribal ranks), and they provide good starting hypotheses for both infrafamilial relationships and improved taxonomic circumscriptions. However, an overwhelming majority of these studies have not adequately resolved relationships within and among these major lineages. Consequently, many phylogenetic relationships along the backbone of the family remain unresolved or poorly supported, presenting a significant roadblock to our common goals in understanding the evolutionary history of Lamiaceae and revising its taxonomy.

Less well known are the relationships within each of the major lineages of

Lamiaceae; phylogenetic analyses of closely related genera and species regularly yield polytomies composed of a few well-supported subclades, and standard approaches

15

used to resolve phylogenies, such as the reasonable addition of species and genes into the analyses, have been largely ineffective. Thus, progress in mint systematics remains largely stalled in the absence of effective strategies that can resolve these difficult phylogenetic problems.

In this study, I use Lamiaceae and Lamiales—an inclusive, and difficult-to- resolve clade that includes Lamiaceae and ca. 16,000 additional species from ca. 24 other families—as model systems in which to evaluate new, alternative approaches to resolving difficult phylogenetic problems. I also explore the limits of how evolutionary relationships can be efficiently and cost-effectively reconstructed for large groups of organisms.

The sequential goals of this study are to:

 synthesize and analyze all publically available data for Lamiaceae;

 illustrate gaps in our current knowledge of Lamiaceae phylogeny;

 characterize challenges or impediments to reconstructing Lamiaceae phylogeny;

 design new research strategies and workflows for resolving difficult phylogenetic problems with phylogenomic approaches;

 investigate the utility of organellar and nuclear phylogenomic approaches in resolving problematic relationships in Lamiaceae and Lamiales, respectively;

 make recommendations regarding effective strategies and best practices for resolving difficult phylogenetic problems—including those in Lamiaceae, Lamiales, and other angiosperm groups.

These goals are addressed throughout the following four chapters: In Chapter 2,

I reconstruct a family-wide phylogeny of Lamiaceae and summarize our current knowledge of infrafamilial relationships. I also investigate the timing and tempo of evolution across the family to identify possible challenges or impediments to fully reconstructing mint phylogeny.

16

In Chapter 3, I present a detailed review of experimental approaches and practical considerations for developing new phylogenetic loci with NGS technologies. I also outline a flexible framework for data acquisition that is readily adaptable to the needs of individual researchers and carefully consider both time- and cost-related issues that may be of concern to many laboratories in evolutionary biology.

In Chapter 4, I investigate the utility of organellar phylogenomic approaches to resolve phylogenetic relationships in Lamiaceae. I use Hedeoma and allied genera of

Menthinae (Lamiaceae: : ), a difficult-to-resolve group of closely related mints from the southwestern United States and , as a model system for this investigation, and I follow the recommended workflows outlined in

Chapter 3 for targeted sequencing of whole plastomes. Lastly, I present a phylogeny inferred from nearly complete plastome sequences from 97 accessions and discuss the potential utility of organellar phylogenomic approaches in mint systematics.

In Chapter 5, I follow the recommended workflows outlined in Chapter 3 for utilitzation of existing NGS data resources, and I identify a large set of new putatively single-copy nuclear (SCN) loci that can be used in phylogenomic investigations of

Lamiales and Lamiaceae (as well as other families in Lamiales). From this set of loci, I investigate the phylogenetic utility of 85 randomly selected genes and conduct preliminary phylogenomic analyses. Lastly, I present the first nuclear phylogenetic results for Lamiales (including 19 species from 4 subfamilies of Lamiaceae), as well as the results of an ancestral character state reconstruction using this topology, and discuss their implications for our current understanding of interfamilial relationships and character evolution.

17

In Chapter 6, I summarize what has been learned about resolving difficult phylogenetic problems and propose future directions that will help move mint systematics “out of the bushes and into the trees”—that is, to resolve clear evolutionary trees from the currently unresolved bush patterns.

18

CHAPTER 2 INTERPRETING COMPLEX PATTERNS IN LAMIACEAE PHYLOGENY

Introduction

Mints (Lamiaceae Martinov) are the sixth-largest angiosperm family and comprise ca. 236 genera and over 7,000 species, many of which are culturally and economically important worldwide. Since the family was expanded to include members of a more broadly defined sensu Briquet, Lamiaceae possess extraordinary diversity and exhibit a broad range of growth forms and life histories, floral architectures, and secondary chemistry (reviewed in Harley et al 2004). They also have a nearly cosmopolitan distribution and occupy a remarkable range of ecological niches in most world biomes.

The extreme morphological complexity of Lamiaceae, combined with the large number of species spanning diverse , presents many taxonomic problems; it is often difficult to distinguish closely related genera and species. Resolving these taxonomic problems is a priority, but it currently presents many scientific challenges.

Comparative studies are critical to our understanding of species relationships and character evolution in Lamiaceae, but these have proven somewhat intractable in recent

decades due to poor phylogenetic resolution.

Current Status of Lamiaceae Taxonomy and Phylogeny

Lamiaceae were first recognized more than two centuries ago (Jussieu 1789), but the limits of the family have since been subject to widely divergent morphological interpretations (e.g. Bentham 1832-1836, 1848, 1876; Briquet 1895-1897; Erdtman

1945; Wunderlich 1967; see also Cantino 1992). Although Junell (1934) proposed a circumscription of Lamiaceae very similar to the current one, this view has only been

19

widely accepted and formalized recently on the basis of morphological and molecular evidence (Cantino 1992a, 1992b; Cantino et al. 1992, 1997; Wagstaff et al. 1995;

Wagstaff and Olmstead 1997; Wagstaff et al. 1998; Harley et al. 2004; see also Junell

1934).

Molecular systematics studies conducted over the last two decades have helped clarify the composition of Lamiaceae (e.g. Wagstaff et al. 1995; Wagstaff and Olmstead

1997; Wagstaff et al. 1998; Paton et al. 2004; Bramley et al. 2009; Conn et al. 2009;

Bräuchler et al. 2010; Bendiksby et al. 2011; Drew and Systma 2011, 2012; Li et al.

2012; Chen et al. 2014). In fact, seven groups within Lamiaceae are currently recognized at the subfamilial level (e.g. , , Nepetoideae,

Viticoideae, , Scutellarioideae, and Symphorematoideae), and their monophyly has been confirmed with molecular evidence (Harley et al. 2004; Bendiskby et al. 2011; Li et al. 2012; Chen et al. 2014).1 However, our understanding of the interrelationships among these clades remains exceedingly poor.

Less well known are the relationships within each of the major clades of

Lamiaceae. Phylogenetic analyses of many closely related mint groups from limited

DNA datasets regularly yield few well-supported clades within a largely unresolved backbone (e.g., Steane et al. 1999 [ L.]; Barber et al. 2002 [Sideritis L.];

Prather et al. 2002 [ L.]; Jamzad et al. 2003 [Nepeta L.]; Paton et al. 2004

[Ocimeae]; Steane et al. 2004 [Clerodendrum and other Ajugoideae]; Trusty et al. 2004

[ L’Hèr and other Mentheae]; Walker et al. 2004 [Salvia L.]; Bräuchler et al.

2005 [ Benth.]; Trusty et al. 2005 [Bystropogon]; Edwards et al. 2006

1 are apparently non-monophyletic (Bramley et al. 2009).

20

[]; Meimberg et al. 2006 [Micromeria]; Oliveira et al. 2007 [];

Walker et al. 2007 [Salvia]; Edwards et al. 2008 [Conradina and related Menthinae];

Scheen et al. 2008 [Synandreae]; Schmidt-Lebuhn 2008 [ (Benth.)

Spach.]; Bramley et al. 2009 [Viticoideae]; Godden 2009 [Poliomintha A. Gray and other

Menthinae]; Scheen et al. 2009 [Leucas R. Br.: Lamioideae]; Bräuchler et al. 2010

[Menthinae]; Moon et al. 2010 [Mentheae]; Scheen et al. 2010 [Lamioideae]; Bendiksby et al. 2011a [Lamioideae], 2011b [ L.]; Mathiesen et al. 2011 [Phlomis L.];

Pastore et al. 2011 [Ocimeae]; Agostini et al. 2012 [ L. and other Menthinae] Li et al. 2012 [Lamiaceae-wide]; Wilson et al. 2012 [Prostanthera Labil.]; Drew and Sytsma

2012 [Mentheae]). Moreover, relationships within some major mint clades appear more difficult to resolve than others with traditional phylogenetic approaches involving reasonable addition of taxa and genes. For example, relationships in Ajugoideae are fully resolved with only a few plastid genes (e.g. rbcL and ndhF), while those in

Nepetoideae are largely unresolved (e.g. see Li et al. 2012 and Chen et al. 2014).

Previously proposed explanations for why some clades are notoriously difficult to resolve are at best speculative; a formal investigation of phylogenetic patterns across the family has never been conducted. Nevertheless, the frequency of unresolved phylogenetic patterns, which apparently occur at all phylogenetic depths within

Lamiaceae, illustrate the extreme challenges confronting mint systematists using traditional sampling and phylogenetic approaches. A better understanding of the putative factors underlying poor phylogenetic resolution may help identify new sampling approaches and methodological strategies that can help address these challenges.

21

Interpreting Complex Patterns in Lamiaceae Phylogeny

Unresolved phylogenetic patterns do not resolve taxonomic questions, but they may represent important evolutionary events and processes as well as symptoms of fundamental challenges that must be confronted as part of future phylogenetic investigations (Rokas and Carroll 2006). For example, phylogenetic inference may be greatly complicated by either systematic error in the analyses (e.g. alignment error, unrecognized paralogy, inadequate taxon sampling, and model misspecification

[Sanderson and Schafer 2002; Felsenstein 2004]) or evolutionary processes and events

(e.g. incomplete lineage sorting, reticulation, introgression, gene duplication, and heterogeneous processes of molecular evolution across loci and lineages [Comes et al.

2001; Jakob and Blattner 2006; Maddison and Knolwes 2006]).

A primary goal of this investigation is to diagnose factors contributing to poor resolution and support in Lamiaceae. My approach synthesizes all available sequence data from Lamiaceae to construct a family-wide phylogenetic hypothesis. The resulting tree provides a starting point for evaluating recent progress in Lamiaceae systematics, as well as for characterizing and comparing phylogenetic patterns and investigating the timing of evolution and patterns of diversification across all major clades.

Materials and Methods

Supermatrix Phylogenetic Inference

A phylogenetic hypothesis was generated for Lamiaceae from available sequence data. During November 2009, all nucleotide sequence data for Lamiaceae were downloaded from GenBank and compiled along with unpublished sequence data generated by previous projects (Object 2-1). Additional sequence data were also downloaded for the following species, representing a diverse selection of outgroups

22

from Lamiales: Barleria prionitis L. (), bignonioides Walter

(), lucida L. (), Ligustrum vulgare L. (),

Sesamum indicum L. (Pedaliaceae), Scrophularia californica Cham. & Schltdl.

(), Streptocarpus holstii Engl. (Gesneriaceae), Verbena bonariensis L. and Verbena bracteata Lag. & Rodr. (Verbenaceae; 2 accessions), and Veronica persica Poir. () (Object 2-1).

To identify sets of homologous sequences from the compiled data set, sequences that were less than 10 kilobase-pairs (Kbp) were clustered using the bioinformatic approach of Burleigh et al. (2012). Of the clusters identified, only those clusters with sequences from at least four taxa (i.e. the minimum amount for an informative, unrooted tree) were retained and aligned using MUSCLE (Edgar 2004).

The resulting cluster alignments were checked for errors and manually adjusted as necessary in Se-Al version 2.0a11 (Rambaut 2002).

Prior to analysis, an additional large data set for Menthinae (i.e. Bräuchler et al.

2010) was released in GenBank. These data were incorporated into the completed alignments with additional clustering, profile alignment, and manual alignment editing steps. The final cluster alignments were then edited for inclusion into a supermatrix using the criteria described by Burleigh et al. (2012). Finally, all edited alignment clusters were concatenated into a single supermatrix for phylogenetic analysis.

Phylogenetic Inference

A maximum likelihood (ML) analysis was conducted using the full supermatrix to reconstruct an optimal phylogenetic hypothesis for Lamiaceae. The ML analysis was performed with Randomized Axelerated Maximum Likelihood (RAxML) version 8.0.25

(Stamatakis 2006; version 8.0.25 released June 16, 2014) using the following

23

commands: raxmlHPC-PTHREADS-SSE3 -f a -m GTRCAT –c 6 -p 2178 -N 2 -T 32 -x

7670 -N 100. Because the outgroups are not a monophyletic group, Ligustrum vulgare

(Oleaceae) was chosen to the tree based on previous phylogenetic results (e.g.

Schäferhoff et al. 2010). Rapid bootstraps (RBS) were performed with 100 replicates, followed by both fast and slow ML heuristic searches (two replicates) using RBS starting trees. The general time reversible (GTR) model with CAT approximation of rate heterogeneity (Stamatakis 2006) was used for all ML analyses, with default settings for model optimization.

Assessment of Phylogenetic Sampling Effort

To assess phylogenetic sampling efforts across Lamiaceae as of 2010, species richness data were compiled from Harley et al. (2004) for all recognized genera of

Lamiaceae and compared with the taxon sampling in the supermatrix described above.

To assess sampling homogeneity across higher-level taxonomic groups (i.e. clades recognized at the subfamilial, tribal, and subtribal levels), the proportional species-level diversity (relative to family-wide diversity) was also calculated for major clades and compared with the taxon sampling used in this study.

Divergence Time Estimation

To investigate the timing of evolution in Lamiaceae, divergence times were estimated using the optimal tree from RAxML and available fossil evidence for

Lamiaceae and Lamiales (see below). Because an acceptable date for the crown of

Lamiales was not available for use, my approach to divergence time estimation followed

Drew and Sytsma (2012). An upper boundary of 107 Ma was imposed as a constraint for Oleaceae and was selected based on the upper confidence interval (CI) crown estimate for Lamiales reported by Janssens et al. (2009). A minimum age constraint

24

was also used and was based on the lower CI crown estimate for Lamiales reported by

Wikström et al. (2001). Additionally, the stem of Bignoniaceae was constrained using a minimum age of 60 Ma, with a mean of 2.0 Ma and a standard deviation of 1.0 Ma, based on Paleocene from western (Brown 1962) and Japan

(Horiuchi and Manchester, in prep.). The fossil calibration was placed on the stem because it was unclear to which extant the Bignoniaceae fossils are attributable.

Lamiaceae are not well represented in the fossil record (Harley et al. 2004). Drew and Sytsma (2012) identified several fossils for use as calibration points, including an

Early Eocene hexacolpate fossil pollen sample used at the crown of Nepetoideae (Kar

1996) and a fossil of Melissa L. (Salviinae) from the Early-Middle Oligocene (Reid and Chandler 1926; Martínez-Millán 2010); these calibrations were also used for this dating analysis. However, my research of the paleobotanical literature extended this list to include six additional fossil for use as calibration points: i.e. Ajuginucula smithii

E. M. Reid & Chandler (1926; Ajugoideae), Oligocene of the Isle of Wight; Stachys paulustris L. and S. germanica L. (Lamioideae), Miocene of western Siberia; tertiara Dorofeev (Elsholtzieae: Nepetoideae), Oligocene-Miocene of western Siberia;

Lycopus europaeus L. (Lycopinae: Mentheae: Nepetoideae), Oligocene of western

Siberia; and L. (Menthinae: Mentheae: Nepetoideae), Miocene of western

Siberia (Dorofeev 1988, 1963).

Minimum dates associated with Lamiaceae fossil calibration points correspond to the Middle Oligocene (28.1 Ma) and Oligocene-Miocene boundary (23 Ma) (Gradstein et al. 2012), the latter of which was confirmed by paleomagnetism studies (Gnibidenko and Semakov 2009) along the Kompasskii Bor Tract on the Tym River (western Siberia)

25

from which many of the fossils are described. An upper boundary of 86 Ma was also imposed for all Lamiaceae fossil calibrations and corresponds to the oldest unconfirmed fossil documented for Lamiaceae (Boltenhagen 1976a, 1976b).

The analysis was conducted using penalized likelihood (Sanderson 2002) in treePL (Smith and O’Meara 2012), with the following minimum age constraints: most recent common ancestor (MRCA) of Ajugoideae: 28.1 million years ago (Ma), MRCA of

Bignoniaceae: 60 Ma, MRCA of Lamioideae: 23 Ma, MRCA of Ligustrum L. (Oleaceae):

60 Ma, MRCA of Nepetoideae: 47.8 Ma, MRCA of Melissa L. (Salviinae): 28.1 Ma,

MRCA of Lycopus L. (Lycopinae): 23 Ma, MRCA of Mentha L. (Menthinae): 23 Ma,

MRCA of Perilla L. (Elscholtzieae): 23 Ma. A cross-validation step was first performed to determine the best smoothing parameter for the dating analysis.

Diversification Dynamics in Lamiaceae

General patterns of diversification through time among the extant Lamiaceae included in the phylogenetic analyses were examined. Lineages through time (LTT) plots were calculated using APE version 2.7-3 (Paradis et al. 2004) in R version 2.13.1

(The R Development Core Team). LTT plots were made for each of 100 BS trees to account for topological uncertainty in the dating estimates; BS trees were transformed into chronograms and pruned of all outgroup taxa prior to the analyses.

To investigate the tempo of evolution across major clades in Lamiaceae, diversification rates were also calculated to test for shifts in speciation and relative extinction rates using MEDUSA (Alfaro et al. 2009). The analysis was implemented in the GEIGER package of R (Harmon et al. 2008) using the chronogram from treePL

(Smith and O’Meara 2012; see Divergence Time Estimation above). A pure birth (Yule) model was implemented in MEDUSA to test whether any of the branches in a given tree

26

led to clades of exceptional species richness given (i.e. clades with high net diversification rates).

Most diversification methods perform best with reasonably well-sampled phylogenies (Alfaro et al. 2009). One of the benefits of MEDUSA is that it allows for integration of species richness data to supplement unresolved, incompletely or non- randomly sampled clades. However, given that many genera in Lamiaceae are non- monophyletic (see Introduction for citations), it was not possible to assign taxonomic data to generic-level clades. Thus, in this case, I assumed that taxon sampling was random, homogeneous across major clades of Lamiaceae, and representative of the total recognized species diversity in each clade.

Results

Descriptive Data

The final supermatrix included 1,265 accessions, of which 11 were outgroups.

The total aligned length of the matrix included 31,349 characters, of which 7,290 were variable and 24,059 were constant. In all, the matrix included 2,149,154 nucleotides, but

94.59% of the data were missing (including missing loci and gaps).

Sequence data from 34 aligned clusters (or data partitions) were included, representing three single-copy nuclear genes (GapC-1, GapC-2, FPS2), one mitochondrial gene (COX1), 13 plastid genes (accD, atpB, atp1, ndhF, rbcL, rpl16, rpoB, rpoC1, rps2, and 2 clusters each for matK and rps16), 11 plastid intergenic spacers (atpF-atpH, psbA-trnH, rpl32-trnL, trnS-trnG, trnS-psbZ, trnT-trnL, trnT-trnF, and two clusters each for trnL-trnF [including the trnL gene] and psbA-trnH), and six nuclear ribosomal internal transcribed spacer (ITS) clusters. The coordinates for all data

27

partitions in the supermatrix, as well as GenBank accession numbers for all sampled taxa, are reported in tabular format (see Object 2-1).

Estimates of species richness for all recognized higher-level taxonomic groups

(i.e. at the level of genera, subtribe, , subfamily and family) are reported in Table 2-

1.2 My results suggest that, as of 2010, only approximately 20.9% of the estimated species diversity in Lamiaceae was sampled for molecular phylogenetic analysis.

However, the percentage of species-level diversity sampled varied considerably within subfamilies, with the highest and lowest percentages observed for Prostantheroideae

(51%) and Scutellarioideae (5%), respectively (Figure 2-1; see also Table 2-1).

The proportion of species diversity within subfamilies relative to the total family- wide species diversity is shown in Figure 2-2A. Nepetoideae represented the largest proportion of family-wide diversity (49.07%), followed by Lamioideae (18.56%),

Ajugoideae (12.20%), Viticoideae (7.83%), Scutellarioideae (5.64%), Prostantheroideae

(3.93%), genera of incertae sedis (I.S.) with regard to subfamily (2.38%), and

Symphorematoideae (0.40%). Pairwise comparisons between recognized versus sampled proportional diversity revealed only minor over- or under-sampling biases in the dataset used for phylogenetic analysis (Figure 2-2A and B); in most cases, only small differences in proportional diversity were detected (e.g. Nepetoideae [+7.6%],

2 Estimates are based on higher-level taxonomic groups recognized by Harley et al. (2004). More recent revisions to higher-level taxonomy were published for Lamioideae and Mentheae by Scheen et al. (2010) and Bendiskby et al. (2011), and Drew and Systma (2012), respectively. However, only the latter was supported by my phylogenetic results and, therefore, was updated in Table 2-1.

28

Lamioideae [+4.6%], Ajugoideae [-3.5%], Viticoideae [-4.7%], Scutellarioideae [-4.3%],

Prostantheroideae [+0.8%], I.S. [-0.3%], and Symphorematoideae [-0.3%]).3

Phylogenetic Inference

The ML tree had a negative log likelihood (-lnL) of 163214.248232 (Figure 2-3; see Table 2-2 for estimated model parameters). In the 1,265-taxon tree, only 332

(mostly internal) nodes had at least 70% bootstrap support. However, Lamiaceae were recovered as a strongly supported clade (ML BS = 97%).

Eleven major subclades representing subfamilies and genera of I.S. were recovered within Lamiaceae (Figure 2-4). Of these, only four (e.g. Tectona I.S.,

Viticoideae A, Peronema/Hymenopyramis/Petraeovitex, and Scutellarioideae) were strongly supported (ML BS = 91-100%; See Figure 2-4). Nevertheless, the composition of all 11 major subclades was consistent with previous phylogenetic evidence.

In the ML tree (Figures 2-3 and 2-4), Nepetoideae was sister to a large clade comprising the remainder of the family. The remaining subclades formed a grade that branched in the following order: Callicarpa I.S. + Prostantheroideae,

Symphorematoideae and Viticoideae B (i.e. “the group” sensu Bramley et al.

[2009]), Tectona I.S., Viticoideae A (i.e. “Gmelina/Prema group” sensu Bramley et al.

[2009]), Ajugoideae, Peronema/Hymenopyramis/Petraeovitex, and Scutellarioideae +

Lamioideae (see Figure 2-4). However, the relationships among these major subclades were not supported.

3 Percentage gains and losses are expressed as positive and negative values [in brackets], respectively, and correspond to proportional species-level diversity sampled for this study minus the proportional species-level diversity recognized by subfamily of Lamiaceae.

29

Divergence Times

treePL identified a cross-validation value of 2 = 0.001 for the dating analysis with penalized likelihood (PL). The results of the divergence time estimation analysis placed the basal node of Lamiaceae in the Late Cretaceous (72 Ma) (Figure 2-6; see also Table 2-3). As shown in the chronogram (Figure 2-6), the MRCAs of most major subclades of Lamiaceae originated and diverged during the Late Cretaceous (70.34 Ma) through the Thanetian age of the Paleocene (59.63 Ma).

The crown nodes of three of the 11 subclades were placed in the Paleocene (see

Table 2-3), including: Ajugoideae (62.21 Ma), Peronema/Hymenopyramis/Petraeovitex

I.S. (57.38 Ma), and Nepetoideae (56.68 Ma).4 As for the remaining subclades,

Viticoideae A (55.92 Ma), Lamioideae (52.39 Ma), Scutellarioideae (47.09 Ma), and

Prostantheroideae (46.27 Ma) were placed in the early Eocene, and Viticoideae B

(32.38 Ma), Tectona I.S. (24.13 Ma), and Callicarpa (17.65 Ma) were placed in the

Miocene.

Diversification Dynamics

The LTT plots displayed considerable variation among bootstrap replicates

(Figure 2-4). However, the general trend among the bootstrap trees suggested increasing rates of diversification beginning at 40-30 Ma, with a recent and rapid acceleration in the last five to ten million years. As for the diversification analyses,

MEDUSA identified nine clades with high rates of net diversification, including:

Nepetoideae (r = 0.062851); Mentheae (Nepetoideae; r = 0.132401);

4 Symphorematoideae was represented by only one accession in the phylogenetic analysis. The MRCA of Symphorematoideae and Vitidcoideae B originated in the Selandian age of the Paleocene (62.21 Ma), but the crown node of Symphorematoideae may be younger in age.

30

Haplostachys/Phyllostegia/Stachys/Stenogyne, (Lamioideae; r = 0.330419);

Acrotome/Leucas/Otostegia/Stachys (Lamioideae; r = 0.192031);

Dicrastylis/Cyanostegia (Prostantheroideae; r = 2.283437); Sideritis (Lamioideae; r =

0.184496); Ocimum/Syncolostemon (Nepetoideae: Ocimeae; r = 0.365144); Salviinae

(Nepetoideae: Mentheae; r = 0.90164); and /Hedeoma/Poliomintha

(Nepetoideae: Mentheae: Menthinae; r = 0.329501) (Figure 2-5).

Discussion

The 1,265-taxon tree generated by this study provides a comprehensive view of phylogenetic relationships in Lamiaceae and serves as a useful tool for understanding broad-scale phylogenetic patterns. The results presented here illustrate current gaps in our knowledge of mint phylogeny and provide several new and important insights with regard to evolutionary processes and differential phylogenetic patterns and species richness among major clades. As a result, several new sampling strategies are proposed, and these may help move mint systematics “out of the bushes and into the trees”—that is, to resolve clear evolutionary trees from the currently unresolved patterns.

Phylogenetic Relationships in Lamiaceae

Previous studies have contributed much to our understanding of major clades within Lamiaceae. My phylogenetic hypothesis extends support for the monophyly of many of these, including most clades recognized at the subfamilial, tribal, and subtribal levels. However, despite my synthesis of nearly two decades of phylogenetic data, there is still no clear picture of infrafamilial relationships.

The phylogenetic hypotheses reported by Bendiskby et al. (2011), Li et al.

(2012), Chen et al. (2014), and my study—the most recent studies in which all seven

31

subfamilies were sampled—disagree with regard to most infrafamilial relationships

(Figure 2-7). One exception is a sister relationship between Scutellarioideae and

Lamioideae, but this was first demonstrated nearly 20 years ago (e.g. Wagstaff et al.1998).

Only the study of Bendiksby et al. (2011) recovers relationships among all

Lamiaceae subfamilies with strong support, but their study was based on only four plastid loci and included poor sampling outside of Lamioideae. As a result, their support values may be inflated and misleading compared to the more densely sampled phylogeny presented here or those of Li et al. (2012) and Chen et al. (2014).

Evolutionary Timing and Tempo in Lamiaceae

My phylogenetic dating results indicate that Lamiaceae originated approximately

72 Ma, extending support for a recent hypothesis of 60-71 Ma by Yao et al. (in prep.).

This date is older than the previous estimate for Lamiales by Bell et al. (2010) (e.g. 27-

50 Ma), but was within the range of dates reported by Wikström et al. (2001) (e.g. 71-74

Ma). However, none of these angiosperm studies used Lamiaceae fossils, and sampling of both Lamiaceae and Lamiales was limited.

As shown in Figure 2-6, Lamiaceae began to diversify prior to the Cretaceous-

Paleogene boundary (KPB) at 66.043 ± 0.043 Ma (Renne et al. 2013) and continued to radiate throughout the Paleogene. These results are intriguing and suggest that much of

Lamiaceae not only survived a global extinction event, but also largely arose in its wake.

While the origins of many clades of Lamiaceae appear much older, my dating and diversification results suggest that much of extant mint diversity diverged much more recently (e.g. < 20-10 Ma) from a MRCA. Moreover, the diversification results suggest that some clades are still diversifying rapidly (e.g. Nepetoideae). The

32

combination of recent origins and high net diversification rates may explain why relationships in some clades appear more challenging to resolve than others as well as differential patterns of species richness across the family (e.g. Nepetoideae comprises nearly half of mint species diversity and their phylogenetic relationships appear more challenging to resolve relative to other subfamilial clades). In this evolutionary scenario, the time between divergence events is likely small, and ancestral variation within sampled loci contributes discordant signal that confounds phylogenetic inference (Oliver

2013; see Chapter 5: Discussion on incomplete lineage sorting).

Limitations Imposed by Opportunistic Sampling

While providing the most comprehensive taxon sampling for Lamiaceae to date, this study also suffers from limitations imposed by data availability as it relies on previous taxonomic sampling and available sequence data. The opportunistic sampling strategy employed negatively impacts support for phylogenetic relationships and, in some cases, affects downstream interpretation of results.

Gene sampling. Current data resources available for Lamiaceae appear too heterogeneous for reconstruction of family-wide relationships. In other words, the data used for phylogenetic inference reflect individual gene sampling strategies used by previous researchers to resolve group-specific hypotheses within the family rather than family-wide hypotheses.

Many recovered clades in my tree topology reflect individual research efforts and are (not surprisingly) well supported, but their placement within the family and relationships to one another remain uncertain due to insufficient overlap of gene sets or inadequate rates of sequence evolution among shared loci. With regard to the latter, there appears to be substantial conflict and phylogenetic uncertainty within both the

33

large tribe Mentheae (Nepetoideae) and some clades of Lamioideae (e.g. within

Stachys), where shared loci (e.g. primarily ITS and trnL-trnF) apparently exhibit insufficient sequence variation to adequately resolve relationships. Additionally, topological discordance among trees independently inferred from ITS and trnL-trnF datasets has been documented in previous studies of Mentheae (e.g. Edwards et al.

2006; Godden 2009; Bräuchler et al. 2010). In this case, ITS and plastid data may introduce signal conflict into the analysis and result in poorly supported topology.

In all, very few loci have been used for phylogenetic reconstructions in

Lamiaceae, and most of our knowledge of phylogenetic relationships is based on ITS and a small number of plastid loci. It seems clear from my results that additional variable sequence data are necessary to fully resolve phylogenetic relationships, and the importance of carefully designed gene-sampling strategies cannot be overemphasized. Given the phylogenetic patterns observed here, reconstructing relationships within and among major clades of Lamiaceae will likely require carefully selected DNA markers that exhibit “appropriate” rates of molecular evolution across multiple phylogenetic depths and allow for more complete overlap among gene sets.

However, this could prove challenging, especially since fast-evolving markers that are useful for resolving shallow-level relationships could be difficult to align across a broadly sampled Lamiaceae.

Taxon sampling. My estimates of phylogenetic sampling effort indicate that approximately 20% of total mint species diversity has been sampled for phylogenetic analyses, with lower percentages of sampled diversity within several subfamilies (range:

3.7%-23.6%; mean: 14.3%; median: 15.1%; standard deviation: 8%). Four subfamilies

34

are poorly sampled here, including Ajugoideae, Scutellarioideae, Symphorematoideae, and Viticoideae, and approximately 63 genera of Lamiaceae have never been sampled.

Since 2010, several additional datasets have been published for Lamioideae

(Bendiskby 2011a, 2011b; Mathiesen et al. 2011; Salmaki et al. 2013), Nepetoideae

(Pastore et al. 2011 [Ocimeae]; Agostini et al. 2012 [Menthinae]; Drew & Systma 2012,

2013 [Mentheae]), and Prostantheroideae (Wilson et al. 2012). Inclusion of these data would likely improve resolution and support within each of these clades. However, as shown in Figure 2-2B, this would also contribute to additional sampling bias and, consequently, inflate estimates of diversification rates in all three cases. While there is little doubt that Nepetoideae and Lamioideae have high net diversification rates, these subfamilies are slightly overrepresented in my dataset. Thus, the magnitude of their diversification rates may be lower than my results suggest. The reverse is also possible; undersampled groups may actually have higher rates of diversification, but their poor representation in my phylogenetic analysis hinders detection of these patterns.

Towards Fully Resolving Relationships in Lamiaceae

Future family-wide phylogenetic analyses of Lamiaceae will likely benefit from the application of whole or nearly complete plastome sequence data. With the availability of new methods for targeted sequencing on next-generation sequencing platforms, this is now an economically feasible option for a species-rich clade such as Lamiaceae

(Godden et al. 2012; see Chapter 3). However, it is important to note that all of the broad-scale phylogenetic analyses of Lamiaceae are based primarily on plastid DNA

(including this study), and this is a concern given that analyses of plastid and nuclear loci can (and often do) yield different topologies (e.g. Edwards et al. 2006; Godden

35

2009; Bräuchler et al. 2010). Thus, nuclear trees are crucial for comparison with plastid trees.

Even at shallow phylogenetic depths, few studies of Lamiaceae have explored the wealth of information in nuclear genes for phylogeny reconstruction (e.g., Edwards et al. 2008a; Curto et al. 2012; Drew & Sytsma 2013). However, my results suggest that nuclear loci may be needed to resolve relationships within several clades of Lamiaceae, particularly in clades where incomplete lineage sorting may complicate accurate phylogenetic inference (e.g. recently derived and rapidly diversifying lineages in

Mentheae). Fortunately, new single-copy nuclear gene sets are already being developed for Lamiaceae (see Chapter 5), and these will serve as a valuable resource for future phylogenetic investigations.

36

Table 2-1. A hierarchical taxonomic classification of Lamiaceae adapted from Harley et al. 2004, including estimated species diversity reported by taxon. The status of sampling efforts across Lamiaceae as of 2010 is summarized in the table. Reported for all higher taxa is the total number of species sampled as part of previous studies and the percentages of sampled species diversity [generic-level diversity estimates are also reported for higher taxonomic groups). All percentages were calculated using the total number of sampled taxa divided by the maximum number of recognized taxa.

Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Symphorematoideae Briq. Congea Roxb. 7 10 1 1 10.00% Sphenodesme Jack 14 0 0 0.00% Symphorema Roxb. 3 0 0 0.00%

SYMPHOREMATOIDEAE TOTAL SPECIES 24 27 1 3.70% SYMPHOREMATOIDEAE TOTAL GENERA 3 1 33.33%

Viticoideae Briq. Cornutia L. 12 0 0 0.00% Gmelina L. 33 0 0 0.00% Paravitex Fletcher 1 4 1 4 100.00% Petitia Jacq. 2 1 1 50.00% Premna L. 50 200 1 4 2.00% Pseudocarpidium Millsp. 8 0 0 0.00% Teijsmanniodendron Koorders 14 1 3 21.43% Tsoongia Merrill 1 1 1 100.00% Vitex L. 250 1 25 10.00% Viticipremna H.J. Lam 5 1 2 40.00%

VITICOIDEAE TOTAL SPECIES 376 530 40 7.55% VITICOIDEAE TOTAL GENERA 10 7

37

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Ajugoideae Kostel. Jacq. 116 1 8 6.90% L. 40 50 1 6 12.00% L. f. 8 1 1 12.50% L. 1 1 1 100.00% Bunge 7 1 6 85.71% Clerodendrum L. 400 500 1 48 9.60% Discretitheca P.D. Cantino 1 0 0 0.00% F. Muell. 3 1 2 66.67% Wall. Ex Griff. 13 0 0 0.00% Hosea Ridl. 1 0 0 0.00% Huxleya Ewart 1 0 0 0.00% Dop 9 1 1 11.11% Fisch. & C.A. Mey 2 0 0 0.00% Oncinocalyx F. Muell. 1 1 1 100.00% Labill. 21 1 8 38.10% Pseudocaryopteris P.D. Cantino 3 1 1 33.33% Raf. 50 60 1 2 3.33% Rubiteucris Kudô 2 1 2 100.00% Schnabelia Hand.-Mazz. 5 1 3 60.00% Briq. 3 0 0 0.00% Hook. F. 1 1 1 100.00% L. 250 1 7 2.80% L. 17 1 17 100.00% P.D. Cantino 1 0 0 0.00%

AJUGOIDEAE TOTAL SPECIES 706 826 115 13.92% AJUGOIDEAE TOTAL GENERA 24 17 70.83%

38

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Prostantheroideae Luerssen Chloantheae Benth. & Hook. f. Brachysola Rye 2 1 2 100.00% Chloanthes R. Br. 4 1 3 75.00% Cyanostegia Turcz. 5 1 3 60.00% Dicrastylis J. Drumm. ex Harv. 26 1 18 69.23% Hemiphora F. Muell. 1 1 1 100.00% Lachnostachys Hook. 6 1 5 83.33% Mallophora Endl. 2 0 0 0.00% Newcastelia F. Muell. 12 1 4 33.33% Physopsis Turcz. 2 1 2 100.00% Pityrodia R. Br. 45 1 15 33.33%

CHLOANTHEAE TOTAL SPECIES 105 53 50.48% CLOANTHEAE TOTAL GENERA 10 9 90.00%

Westringieae Bartl. Hemiandra R. Br. 14 0 0 0.00% Microcorys R. Br. 20 0 0 0.00% Prostanthera Labill. 100 1 5 5.00% Westringia Sm. 25 1 3 12.00% Wrixonia F. Muell. 2 0 0 0.00%

WESTRINGIEAE TOTAL SPECIES 161 8 4.97% WESTRINGIEAE TOTAL GENERA 5 2 40.00%

PROSTANTHEROIDEAE TOTAL SPECIES 266 61 22.93% PROSTANTHEROIDEAE TOTAL GENERA 15 11 73.33%

39

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Scutellarioideae Luerssen Holmskioldia Retz. 1 1 1 100.00% Renschia Vatke 1 0 0 0.00% Scutellaria L. 360 1 15 4.17% Tinnea Kotschy ex Hook. f. 19 1 1 5.26% Wenchengia C.Y. Wu & S. Chow 1 0 0 0.00%

SCUTELLARIOIDEAE TOTAL SPECIES 382 17 4.45% SCUTELLARIOIDEAE TOTAL GENERA 5 3 60.00%

Lamioideae Harley Achyrospermum Blume 25 0 0 0.00% Acrotome Benth. ex Endl. 6 1 4 66.67% Ajugoides Makino 1 0 0 0.00% Alajja Ikonn. 3 0 0 0.00% Anisomeles R. Br. 3 1 1 33.33% Ballota L. 30 1 4 13.33% Bostrychanthera Benth. 2 0 0 0.00% Brazoria Englm. ex A. Gray 2 3 1 3 100.00% Chaiturus Willd. 1 0 0 0.00% Chamaesphacos Schrenk ex Fisch. & C.A. Mey 1 0 0 0.00% Chelonopsis Miq. 16 1 1 6.25% Colebrookea Sm. 1 1 1 100.00% Colquhounia Wall. 6 1 2 33.33% Comanthosphace S. Moore 3 4 0 0 0.00% Craniotome Rchb. 1 0 0 0.00% Eremostachys Bunge 5 60 1 1 1.67% Eriophyton Benth. 1 0 0 0.00% Eurysolen Prain 1 0 0 0.00%

40

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Lamioideae Harley [continued) Galeopsis L. 10 1 3 30.00% Gomphostemma Wall. ex Benth. 36 1 1 2.78% Haplostachys [A. Gray) W.F. Hillebr. 5 1 2 40.00% Hypogomphia Bunge 1 3 0 0 0.00% Isoleucas O. Schwartz 1 1 1 100.00% Lagochilus Bunge ex Benth. 40 0 0 0.00% Lagopsis [Bunge ex Benth.) Bunge 4 1 1 25.00% Lamiophlomis Kudô 1 1 1 100.00% Lamium L. 17 30 1 8 26.67% Leonotis [Pers.) R. Br. 10 1 6 60.00% Leonurus L. 25 1 8 32.00% Leucas Burm. ex. R. Br. 100 1 38 38.00% Leucosceptrum Sm. 1 0 0 0.00% Loxocalyx Hemsl. 3 0 0 0.00% Macbridea Elliott ex Nutt. 2 1 2 100.00% Marrubium L. 40 1 4 10.00% Matsumurella Makino 1 5 0 0 0.00% Melittis L. 1 1 1 100.00% Metastachydium Airy Shaw ex C.Y. Wu & H.W. Li 1 0 0 0.00% Microtoenia Prain 24 0 0 0.00% Moluccella L. 2 1 3 150.00% Notochaete Benth. 2 0 0 0.00% Otostegia Benth. 15 1 9 60.00% Panzerina Soják 2 7 0 0 0.00% Paralamium Dunn 1 0 0 0.00% Paraphlomis [Prain) Prain 20 1 4 20.00% Phlomidoschema [Benth.) Vved. 1 1 1 100.00% Phlomis L. 100 1 23 23.00% Phyllostegia Benth. 34 1 29 85.29% Physostegia Benth. 12 1 12 100.00% Pogostemon Desf. 80 1 1 1.25%

41

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Lamioideae Harley [continued) Prasium L. 1 1 1 100.00% Pseudoeremostachys Popov 1 0 0 0.00% Pseudomarrubium Popov 1 0 0 0.00% Rostrinucula Kudô 2 0 0 0.00% Roylea Wall. ex Benth. 1 1 1 100.00% Sideritis L. 140 1 49 35.00% Stachyopsis Popov & Vved. 4 0 0 0.00% Stachys L. 300 1 52 17.33% Stenogyne Benth. 20 1 17 85.00% Sulaimania Hedge & Rech. f. 1 0 0 0.00% Suzukia Kudô 2 0 0 0.00% Synandra Nutt. 1 1 1 100.00% Thuspeinanta T. Durand 2 0 0 0.00% M.W. Turner 1 1 1 100.00%

LAMIOIDEAE TOTAL SPECIES 1177 1257 297 23.63% LAMIOIDEAE TOTAL GENERA 63 36 57.14%

Nepetoideae [Dumort.) Luerss.

Elscholtzieae [Burnett) Sanders & Cantino L. 10 1 1 10.00% Willd. 40 1 1 2.50% [Benth.) Buch.-Ham. ex Maxim. 22 1 1 4.55% Perilla L. 1 1 1 100.00% Perillula Maxim. 1 0 0 0.00%

ELSCHOLTZIEAE TOTAL SPECIES 75 4 5.33% ELSCHOLTZIEAE TOTAL GENERA 5 4 80.00%

42

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Mentheae Dumort. Salviinae [Dumort.) Endl. Chaunostoma Donn. Sm. 1 0 0 0.00% Dorystoechas Boiss. & Heldr. 1 1 1 100.00% Lepechinia Willd. 40 1 5 12.50% Meriandra Benth. 2 1 1 50.00% Perovskia Kar. 7 1 3 42.86% Rosmarinus L. 3 1 1 33.33% Salvia L. 900 1 169 18.78% Zhumeria Rech. f. & Wendelbo 1 1 1 100.00%

SALVIINAE TOTAL SPECIES 955 181 18.95% SALVIINAE TOTAL GENERA 8 7 87.50%

Menthinae [Dumort.) Endl. [A. Gray) Benth. & Hook. f. 4 1 3 75.00% Raf. 3 1 2 66.67% Bystropogon L’Hér. 7 1 7 100.00% Clinopodium L. 100 1 75 75.00% Conradina A. Gray 7 1 6 85.71% Colla 1 2 1 2 100.00% Cunila D. Royen ex L. 15 1 10 66.67% Cyclotrichium [Boiss.) Manden. & Schreng. 8 1 7 87.50% Dicerandra Benth. 9 10 1 9 90.00% Eriothymus [Benth.) Schmidt 1 0 0 0.00% Glechon Spreng. 6 7 1 3 42.86% Gontscharovia Boriss. 1 1 1 100.00% Hedeoma Pers. 42 1 29 69.05% Hesperozygis Epling 8 1 4 50.00% Hoehnea Epling 4 1 1 25.00%

43

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Menthinae [Dumort.) Endl. [Continued) Hyssopus L. 2 1 2 100.00% Killickia 4 1 4 100.00% Kurzamra Kuntze 1 1 1 100.00% Mentha L. 20 1 18 90.00% Micromeria Benth. 70 1 27 38.57% Minthostachys [Benth.) Spach 12 1 10 83.33% Monarda L. 20 1 18 90.00% Monardella Benth. 30 1 6 20.00% Neoeplingia Ramamoorthy, Hiriart & Medrano 1 0 0 0.00% Obtegomeria 1 1 1 100.00% L. 40 1 10 25.00% Pentapleura Hand.-Mazz. 1 1 1 100.00% Raf. 1 1 1 100.00% Pogogyne Benth. 7 1 6 85.71% Poliomintha A. Gray 7 1 6 85.71% Michx. 17 21 1 7 33.33% Rhabdocaulon [Benth.) Epling 7 1 2 28.57% Rhododon Epling 1 2 1 2 100.00% Saccocalyx Coss. & Durieu 1 1 1 100.00% L. 38 1 17 44.74% Small 1 1 1 100.00% Thymbra L. 4 1 4 100.00% Thymus L. 220 1 23 10.45% Zataria Boiss. 1 1 1 100.00% Ziziphora L. 20 1 6 30.00%

MENTHINAE TOTAL SPECIES 743 751 334 44.95% MENTHINAE TOTAL GENERA 44 38 95.00%

44

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled Nepetinae Coss. & Germ. J. Clayton ex Gronov. 14 1 10 71.43% Cedronella Moench 1 1 1 100.00% Dracocephalum L. 70 1 6 8.57% Drepanocaryum Pojark. 1 1 1 100.00% Glechoma L. 4 8 1 5 62.50% Hymenocrater Fisch. & C.A. Mey 10 0 0 0.00% Lallemantia Fisch. & C.A. Mey 5 1 1 20.00% Lophanthus Adans. 20 0 0 0.00% Marmoritis Benth. 4 5 1 2 40.00% Meehania Britton 6 1 2 33.33% Nepeta L. 200 1 37 18.50% Schizonepeta [Benth.) Briq. 3 1 2 66.67%

NEPETINAE TOTAL SPECIES 338 348 67 19.25% NEPETINAE TOTAL GENERA 12 10 83.33%

Lycopinae B. T. Drew & Sytsma Lycopus L. 14 1 4 28.57%

LYCOPINAE TOTAL SPECIES 14 4 28.57% LYCOPINAE TOTAL GENERA 1 1 100.00%

Prunellinae [Dumortier) B. T. Drew & Sytsma Cleonia L. 1 1 1 100.00% Horminum L. 1 1 1 100.00% Prunella L. 7 1 5 71.43%

PRUNELLINAE TOTAL SPECIES 9 7 77.78% PRUNELLINAE TOTAL GENERA 3 3 100.00%

45

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Mentheae Incertae Sedis Heterolamium C.Y. Wu 1 0 0 0.00% Melissa L. 4 1 1 25.00%

MENTHEAE I.S. TOTAL SPECIES 5 1 20.00% MENTHEAE I.S. TOTAL GENERA 2 1 50.00%

MENTHEAE TOTAL SPECIES 1726 2077 594 28.60% MENTHEAE TOTAL GENERA 66 60 90.91%

Ocimeae Dumort. Lavandulinae Endl. Lavandula L. 36 1 10 27.78%

SPECIES SUBTOTAL 36 10 27.78% GENERA SPECIES SUBTOTAL 1 1 100.00%

Hanceolinae [C.Y. Wu) A.J. Paton, Ryding & Harley Hanceola Kudô 8 0 0 0.00% Isodon [Benth.) Schrader ex Spach 100 1 10 10.00% Siphocranion Kudô 2 1 1 50.00%

SPECIES SUBTOTAL 110 11 10.00% GENERA SPECIES SUBTOTAL 3 2 66.67%

Hyptidinae Endlicher Asterohyptis Epling 3 0 0 0.00% Eriope Bonpl. ex Benth. 40 0 0 0.00% Hypenia [Mart. ex. Benth.) Harley 23 1 2 8.70% Hyptidendron Harley 19 0 0 0.00% Hyptis Jacq. 280 1 8 2.86%

46

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Hyptidinae Endlicher [Continued) Marypianthes Mart. ex Benth. 5 6 0 0 0.00% Peltodon Pohl 6 0 0 0.00% Raphiodon Schauer 1 0 0 0.00%

SPECIES SUBTOTAL 377 378 10 2.65% GENERA SPECIES SUBTOTAL 8 2 25.00%

Ociminae [Dumort.) Schmidt Basilicum Moench 1 1 1 100.00% Benguellia G. Tayl. 1 0 0 0.00% Catoferia [Benth.) Benth. 4 1 1 25.00% Endostemon N.E. Br. 19 1 2 10.53% Fuerstia T.C.E. Fries 8 1 1 12.50% Haumaniastrum P.A. Duvign. & Plancke 35 1 1 2.86% Hoslundia Vahl 1 1 1 100.00% Ocimum L. 65 1 8 12.31% Orthosiphon Benth. 40 1 5 12.50% Syncolostemon E. Mey. Ex Benth. in E. Mey. 10 47 1 19 190.00% Platostoma P. Beauv. 45 1 19 42.22%

SPECIES SUBTOTAL 229 266 58 25.33% GENERA SPECIES SUBTOTAL 11 10 90.91%

Plectranthinae Endlicher Aeollanthus Mart. ex Spreng. 40 1 2 5.00% Alvesia Welw. 3 1 1 33.33% Capitanopsis S. Moore 3 1 1 33.33% Dauphinea Hedge 1 1 1 100.00% Leocus A. Chev. 5 6 0 0 0.00% Madlabium Hedge 1 0 0 0.00%

47

Table 2-1. Continued Minimum Maximum Genera Species % Generic % Species estimated estimated in in diversity diversity Taxon species species analysis analysis sampled sampled

Plectranthinae Endlicher [Continued) Plectranthus L’Hér. 300 1 27 9.00% Pycnostachys Hook. 40 1 3 7.50% Tetradenia Benth. 15 20 1 2 10.00% Thorncroftia N.E. br. 4 1 2 50.00%

SPECIES SUBTOTAL 412 418 39 9.33% GENERA SPECIES SUBTOTAL 11 8 72.73%

OCIMEAE TOTAL SPECIES 1164 1171 128 10.93% OCIMEAE TOTAL GENERA 33 23 69.70%

NEPETOIDEAE TOTAL SPECIES 2965 3323 726 21.85% NEPETOIDEAE TOTAL GENERA 104 87 83.65%

Lamiaceae Insertae Sedis Acrymia Prain 1 0 0 Callicarpa L. 140 1 21 Benth. 2 3 0 0 Garrettia H.R. Fletcher 1 0 0 Holocheila [Kudô) S. Chow 1 1 1 Hymenopyramis Wall. ex Griffith 1 1 1 Ombrocharis Hand.-Mazz. 1 0 0 Peronema Jack 1 1 1 Petraeovitex Oliv. 8 1 1 Tectona L.f. 4 1 2

SPECIES SUBTOTAL 160 161 27 16.77% GENERA SPECIES SUBTOTAL 10 6 60.00%

LAMIACEAE TOTAL SPECIES 6056 6124 1284 20.97% LAMIACEAE TOTAL GENERA 234 168 71.79%

48

Table 2-2. General Time Reversible (GTR) and CAT approximation parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML). Shape parameter Rate matrix Base frequencies

alpha: 0.328396 rate A <-> C: 1.737188 freq pi(A): 0.268135 rate A <-> G: 1.989645 freq pi(C): 0.230598 rate A <-> T: 0.488569 freq pi(G): 0.234677 rate C <-> G: 0.844535 freq pi(T): 0.266590 rate C <-> T: 2.739690 rate G <-> T: 1.000000

49

Table 2-3. Divergence times estimated with treePL for selected clades of Lamiaceae (Smith and O’Meara 2012). Crown node ages estimates in units of millions of years ago (Ma) and the corresponding geologic era, period, epoch, and age corresponding to each date are indicated.5 Crown node age estimates for the nine clades with high net diversification rates identified by MEDUSA (Alfaro et al. 2009) are also reported and denoted by an asterisk (*). Geologic Time Scale Divergence Time Estimation Era Period Epoch Age Date (Ma) Clade Mesozoic Cretaceous Late Maastrichtian 72.00 Lamiaceae Cenozoic Paleogene Paleocene Danian 62.21 MRCA Symphorematoideae and Viticoideae B Cenozoic Paleogene Paleocene Danian 61.72 Ajugoideae Cenozoic Paleogene Paleocene Thanetian 57.38 Peronema/Hymenopyramis/Petraeovitex I.S. Cenozoic Paleogene Paleocene Thanetian 56.68 Nepetoideae Cenozoic Paleogene Eocene Ypresian 55.92 Viticoideae A Cenozoic Paleogene Eocene Ypresian 52.84 Ocimeae Cenozoic Paleogene Eocene Ypresian 52.39 Lamioideae Cenozoic Paleogene Eocene Ypresian 49.48 Mentheae Cenozoic Paleogene Eocene Lutetian 47.31 Salviinae Cenozoic Paleogene Eocene Lutetian 47.09 Scutellarioideae Cenozoic Paleogene Eocene Lutetian 46.55 Nepetinae Cenozoic Paleogene Eocene Lutetian 46.27 Prostantheroideae Cenozoic Paleogene Eocene Bartonian 39.62 Prunellinae Cenozoic Paleogene Eocene Bartonian 38.30 Haplostachys/Phyllostegia/Stachys/Stenogyne (Stachydeae)* Cenozoic Paleogene Eocene Priabonian 37.44 Chloantheae Cenozoic Paleogene Oligocene Rupelian 32.38 Viticoideae B Cenozoic Paleogene Oligocene Rupelian 31.26 Menthinae Cenozoic Paleogene Oligocene Chattian 26.89 Westrigieae Cenozoic Paleogene Oligocene Chattian 26.57 Sideritis (Stachydeae) Cenozoic Paleogene Oligocene Chattian 24.13 Tectona I.S. Cenozoic Neogene Miocene Aquitanian 18.80 Ocimum /Syncolostemon (Ocimeae)* Cenozoic Neogene Miocene Burdigalian 18.14 Lycopinae Cenozoic Neogene Miocene Burdigalian 17.65 Callicarpa I.S.

5 Geologic eras, periods, epochs, and ages correspond to the geologic time scale of Gradstein et al. (2012).

50

Table 2-3. Continued Geologic Time Scale Divergence Time Estimation Age Date (Ma) Clade Cenozoic Neogene Miocene Burdigalian 17.65 Callicarpa I.S. Cenozoic Neogene Miocene Burdigalian 16.25 Leucadeae Cenozoic Neogene Miocene Serravallian 13.08 Dycrastylis, Cyanostegia (Prostantheroideae) Cenozoic Neogene Miocene Tortonian 8.44 Clinopodium/Cunila/Hedeoma/Poliomintha (Menthinae)*

51

Figure 2-1. Lamiaceae species diversity sampled for the phylogenetic analysis. The histogram shows the estimated species diversity by recognized subfamily [red) and the proportion of total species-level diversity sampled for the analysis [blue). For convenience, percentages of total diversity sampled by taxon [rounded to the nearest one) are also indicated. Estimated species richness data were compiled from Harley et al. [2004; see also Table 2-1).

52

Figure 2-2. Proportions of recognized and sampled species-level diversity among the subfamilial clades of Lamiaceae. Figure A) shows the proportions of recognized species-level diversity [adapted from Harley et al. [2004]). Figure B) shows the proportions of species-level diversity sampled for this study. Count data are reported in Table 2-1.

53

Figure 2-3. Maximum likelihood [ML) phylogeny of 1,256 Lamiaceae species and nine outgroup families from Lamiales [-ln L = 163215.258232). The ML tree was inferred from a supermatrix of approximately 2.1 million nucleotides. Branch lengths are drawn to scale, and colors indicate Lamiaceae subfamilies sensu Harley et al. [2004): Ajugoideae [gold), Lamioideae [cyan), Nepetoideae [blue), Prostantheroideae [orange), Symphorematoideae [green), Scutellarioideae [red), Viticoideae [magenta), incertae sedis [black), and outgroups [lavender).

54

Figure 2-4. Summary tree showing relationships among major clades in Lamiaceae. Triangle sizes are proportional to clade size. I.S.= incertae sedis. Maximum likelihood BS support values [> 70%) are indicated above the branches.

55

Figure 2-5. Lineages through time (LTT) plot for Lamiaceae species. The LTT plot was calculated using 100 bootstrap trees inferred with RAxML, with ultrametric branch lengths from r8s (Sanderson 2003) and outgroup taxa removed. Each line represents a single ML bootstrap tree. The graph shows the pattern of diversification of the Lamiaceae taxa in the tree through time, as the tree grew from a single lineage at the root to the current sampling of 1,256 species.

56

Figure 2-6. Chronogram of the ML tree derived from the penalized likelihood analysis from treePL. Major clades within Lamiaceae are shown as cartoons [I.S. = incertae sedis), and estimated divergence dates are shown at crown nodes. A time scale is shown at bottom; units are Ma [million years ago). The nine clades with the highest rates of net diversification identified by a MEDUSA analysis are indicated with numbers and red font.

57

Figure 2-7. Four family-wide phylogenetic hypotheses for Lamiaceae. Summary trees shown here are from Bendiksby et al. 2011 (A), Li et al. 2012 (B), Chen et al. 2014 (C), and this study (D). MP and ML bootstrap support values are indicated below and above the branches, respectively (Figure A = jackknife range values).

58

Object 2-1. Species names and authorities and DNA sequences used for phylogenetic analysis. Accession numbers are provided for all data downloaded from GenBank by matrix partition; unpublished sequences are also indicated. Missing data are indicated by empty cells. (.xlsx 86 KB)

59

Object 2-2. Optimal tree topology identified by RAxML. Bootstrap support values are indicated at the nodes. (.pdf 1.8 MB)

60

Object 2-3. Cronogram showing results from divergence time estimation and diversification analyses (.pdf 131 KB).

61

CHAPTER 3 MAKING NEXT-GENERATION SEQUENCING WORK FOR YOU: APPROACHES AND PRACTICAL CONSIDERATIONS FOR MARKER DEVELOPMENT AND PHYLOGENETICS

Introduction

†We have witnessed a genomic revolution recently, fuelled by the emergence of next-generation sequencing (NGS) technologies (see definitions in Glenn 2011). The cost of a human genome sequence has decreased from $2.7 billion to under $10,000, and the capabilities of each new sequencing platform are astounding (Davies 2010; see http://www.genome.gov/sequencingcosts/). The rapid responses by manufacturers and advancements in NGS technology development have been, for the most part, motivated by the needs of the biomedical research community. However, NGS is beginning to make far-reaching impacts in other research areas and offers exciting benefits to many biological disciplines, including plant-related fields (as is evident from the collection of papers in this Special Feature; see also American Journal of Botany vol. 99) such as systematics (Metzker 2010; Harrison and Kidner 2011; Straub et al. 2011; Lee et al.

2011), genetics (Buggs et al. 2012a), co-evolution (Young et al. 2011), ecology (Ekblom and Galindo 2010), and conservation genetics (Kohn et al. 2006; Allendorf et al. 2010).

As the field of plant systematics progresses toward a more comprehensive understanding of plant evolution and phylogeny, the importance of resolving lower-level taxonomic relationships is becoming apparent. For example, we are increasingly confident in the relationships among most major plant lineages, especially within the

Reprinted with permission from Taylor & Francis Group. Original publication: Godden, G. T., I. E. Jordon- Thaden, S. Chamala, A. A. Crowl, N. García, C. C. Germain-Aubrey, J. M. Heaney, M. Latvis, X. Qi, and M. A. Gitzendanner, 2012. Making next-generation sequencing work for you: approaches and practical considerations for marker development and phylogenetics. Plant Ecology and Diversity 5:427-450. Online access: http://www.tandfonline.com/doi/full/10.1080/17550874.2012.745909#

62

angiosperms (reviewed in Soltis et al. 2009; e.g. Moore et al. 2010, 2011; Soltis et al.

2011). Our confidence in relationships decreases, however, as we begin to shift our focus from deeper-level questions towards the tips of the plant phylogeny. Commonly used phylogenetic markers, in many cases, appear insufficient to resolve relationships among many closely related, recently diverged and rapidly evolving lineages.

Plant molecular phylogenetics, since its inception in the early 1980s (reviewed by

Soltis et al. 2009), has allowed significant progress in systematics. Chloroplast DNA was introduced early on as a promising tool (Palmer and Zamir 1982) and has since reigned as the ‘workhorse’ of plant phylogenetics (e.g. Chase et al. 1993; reviewed in

Soltis et al. 2009). In fact, since the first chloroplast genome was sequenced for

Nicotiana tabacum L. (Shinozaki et al. 1986), many studies have identified structural and point mutations in the chloroplast genome that are useful for population-level investigations (e.g. Cronn et al. 2008; Parks et al. 2009; Kane et al. 2012), analyses of deeper evolutionary divergences (e.g. Moore et al. 2010; Jansen and Ruhlman 2012), and phylogeographic studies (McCormack et al. 2012).

The conservative nature of the chloroplast and the orthology of its genes make it ideal for phylogenetic studies and facilitate the use of universal primers that amplify efficiently across much of plant diversity (Graham and Olmstead 2000; Shaw et al.

2005, 2007). However, the use of chloroplast sequence data also has its limitations.

The relatively low rates of molecular evolution in the chloroplast genome often necessitate use of large amounts of sequence data to obtain statistically robust and well-resolved trees (Small et al. 1998; Cronn et al. 2002; Shaw et al. 2005, 2007; Moore et al. 2007, 2010; Jian et al. 2008; Givnish et al. 2010), particularly at shallow levels.

63

Whole chloroplast genomes or additional sequence data from the nuclear genome may help to resolve these relationships.

Nuclear genes generally exhibit faster rates of molecular evolution than organellar genes, especially in their introns, and provide more sequence variation at shallow phylogenetic levels. Single-copy nuclear genes are preferred over plastid and nuclear-ribosomal genes because they avoid problems associated with complete linkage, uniparental inheritance and concerted evolution (Alvarez and Wendel 2003;

Linder and Rieseberg 2004; Small et al. 2004). Single-copy nuclear (SCN) sequences are more likely to have conserved homologues from each contributing genome than repeated sequences (Barakat et al. 1997; Sang 2002; Duarte et al. 2010), and are useful for investigations of hybridisation and introgression events (Sang 2002; Egan et al. 2012). Moreover, SCNs are useful for understanding patterns of inheritance in polyploids (Small et al. 2004; Clarkson et al. 2010).

At shallow levels, processes such as incomplete lineage sorting complicate phylogenetic inference, and individual gene histories may differ from the true species history (Maddison and Knowles 2006; Liu and Pearl 2007; Liu et al. 2008).

Phylogenetic analyses, in these cases, would benefit from approaches that use multiple unlinked loci and alternative methods of inference, such as those based on multi- species coalescence (reviewed in Degnan and Rosenberg 2009). However, much of the research employing nuclear markers in phylogenetics has been limited to the ribosomal cistron (e.g. the slowly evolving 18S and 26S rDNA, and the faster spacers ITS and

ETS) or a handful of low-copy genes, such as alcohol dehydrogenase (Adh), meristem

64

identity (LEAFY), and granule-bound starch synthase (GBSSI or waxy) genes (Sang

2002; Calonje et al. 2008).

There has been great interest in the development of single-copy nuclear genes for phylogenetic applications (e.g. Sang 2002; Mort and Crawford 2004; Small et al.

2004; Hughes et al. 2006). The processes of identifying and developing nuclear markers present major challenges. First, traditional methods of primer development and sequencing of large numbers of loci are time consuming, labour intensive, and expensive for large-scale projects. Second, the identification of orthologous loci (i.e. loci inherited via a speciation event) is crucial for phylogenetics (Fitch 1970). Inadvertently including some orthologous and some paralogous loci (i.e. loci inherited following a gene or genome duplication event, which do not reflect the history of speciation) in an analysis can disrupt the phylogenetic signal within an aligned matrix and diminish the resolution of phylogenetic relationships. When all loci are sampled, paralogous copies can easily be identified with phylogenetic analyses. An issue arises, however, when some taxa have lost one or more copies or when these copies are not amplified and sequenced.

NGS approaches are having a profound impact on the development of new markers for phylogenetics, allowing for relatively inexpensive and rapid acquisition of data from organellar (e.g. Moore et al. 2006, 2010; Parks et al. 2009; Givinish et al.

2010; Griffin et al. 2011) and nuclear genomes (e.g. Griffin et al. 2011; Lee et al. 2011;

Straub et al. 2011). The goal of this paper is to guide researchers of plant systematics through the process of designing NGS experimental approaches for new marker development, including mining and use of existing NGS data sets and new data

65

acquisition on NGS platforms. We also review several targeted sequencing approaches and discuss useful pipelines and tools for post-NGS processing, orthology prediction, and primer development. Our discussion is by no means comprehensive, and many important considerations are detailed in other recent reviews (e.g. Glenn 2011; Cronn et al. 2012; Egan et al. 2012). However, we cover many topics related to large-scale marker development and discuss them with ‘typical’ systematics researchers in mind— that is, we carefully consider practical and cost-related issues that may be of concern to many laboratories in evolutionary biology.

NGS and New Markers for Phylogenetic Studies

Organellar Genomes and Gene Space as Markers

Recent advances in NGS have enabled phylogenetic analysis of whole chloroplast genomes across a wide taxonomic breadth (e.g. Moore et al. 2010; Steele et al. 2012). Initial NGS-based approaches, such as those of Moore et al. (2006, 2007), involved sequencing of chloroplast-enriched samples following laborious chloroplast isolation protocols. These protocols limited sequencing to taxa for which large quantities of fresh tissue were readily obtainable (Jansen et al. 2005). However, standard genomic

DNA or total RNA extracts contain sufficient chloroplast sequence, and these data can typically be skimmed out of genomic or transcriptomic NGS runs by using current technologies.

Even though a number of studies have employed mitochondrial DNA sequence data at deep levels in green plants (e.g. Qiu et al. 2005, 2010; Davis et al. 2007; Soltis et al. 2011), plant mitochondrial genes have been less widely used in phylogenetic studies than chloroplast or nuclear genes because of evidence for low nucleotide mutation rates, RNA editing, horizontal gene transfer, and high rates of structural

66

rearrangement within the mitochondrial genome (Palmer and Herbon 1988; Bowe and dePamphilis 1996; Bergthorsson et al. 2003; Davis and Wurdack 2004; Davis et al.

2005; Xue et al. 2010; Davila et al. 2011). Unfortunately, the assembly of mitochondrial genomes from NGS data is not a fruitful goal due to variations in genome structure, which sometimes exist within individuals (Woloszynska 2010). However, the mitochondrial gene space (i.e. the extended space including and surrounding genic regions, including both coding and intergenic non-coding regions closely associated with each gene) may prove useful in phylogenetic applications. It is already possible to obtain most mitochondrial genes from NGS runs with modest investment. In fact, in many cases it simply involves additional processing in an informatics pipeline (e.g.

Straub et al. 2012). Analyses of new NGS data may provide additional insights into the overall phylogenetic utility of mitochondrial gene space and may promote more frequent use of these markers in the future.

Nuclear Markers

NGS technologies afford an ability to explore the wealth of the nuclear genome and more rapidly than ever before. Several recent studies have made use of existing genomic data for the development of new nuclear phylogenetic markers. For example,

Duarte et al. (2010) identified a set of 959 single-copy nuclear genes by comparing complete genome sequences from Arabidopsis, Populus, Vitis, and Oryza, followed by mining of genes from the public EST database (TIGR Plant Transcript Assemblies: http://plantta.jcvi.org/) and characterisation of phylogenetic utility. In another recent study, Lee et al. (2011) obtained 22,833 sets of orthologs from 101 genera of land plants by using a similar strategy and compared five completely sequenced plant genomes, followed by isolation of unigenes from the TIGR Plant Transcript Assemblies

67

to fill in the remaining taxon sampling. By leveraging available data from existing NGS projects, both studies were able to identify orthologs and produce a list of candidate genes for use in phylogenetic analyses.

Very recently, researchers have started to explore use of NGS approaches to acquire new data for marker development and phylogenetics. For example, a recent study by Straub et al. (2011) used the Illumina platform (Illumina Inc., San Diego, CA) to sequence the genome of Asclepias syriaca L. (Apocynaceae). Their approach yielded a whole chloroplast genome and rDNA cistron, a partial mitochondrial genome, and facilitated characterisation of the nuclear genome; 25 microsatellite loci and 27 informative low-copy nuclear genes in total were developed using the approach. In a subsequent study, an approach known as genome skimming was used to acquire new

NGS data from six additional species of Asclepias, which were aligned to the reference genome of A. syriaca (Straub et al. 2012). In genome skimming, low-coverage sequence data are collected and processed to find regions of interest (e.g. chloroplast genome, various nuclear loci) that are sequenced by chance. Genome skimming provides an alternative to other approaches that attempt to either generate full genomic coverage or sequence targeted regions of the genome through some form of selection.

In this case, the approach yielded nearly 160,000 bp of sequence data from all three organelles and demonstrated the efficiency of NGS approaches for phylogenetic applications.

Cost Considerations for NGS Projects

Advancements in NGS technologies are making the acquisition of larger data sets by individual researchers possible, and at increasingly lower costs (reviewed in Harrison and Kidner 2011). Currently, megabases of data from plant organellar and nuclear

68

genomes can be acquired for multiple samples in a single NGS run, such as a single lane of Illumina (Illumina Inc., San Diego, CA) (Steele and Pires 2011; Straub et al.

2011, 2012; Steele et al. 2012). NGS data are either employed as primary data in phylogenetic analyses or used as a tool for the discovery and development of new phylogenetic markers. New markers developed from these primary data can be subsequently targeted and sequenced for a group of interest on an NGS platform. With regard to the latter approach for secondary data acquisition, many exciting new technologies for targeted sequencing have recently become available and can offer considerable savings of cost and time for systematics researchers.

With the increasing use of NGS, many researchers are discovering a growing number of data sets from other projects that can be used for their research, mitigating potential costs of generating primary sequence data. For example, there are currently

23,795,023 expressed sequence tags (ESTs) for Embryophyta in NCBI as of October

2012 (http://www.ncbi.nlm.nih.gov/), and new transcriptomic and genomic data are becoming available at increasingly rapid rates (e.g. 476 genome sequences for

Embryophyta are available as of October 2012). NGS data collected for microsatellite development or comparative transcriptome projects, are also likely to contain chloroplast genomic data that can be used either to add a taxon to a phylogenetic analysis or to design primer sequences for clade-specific amplification of particular genes (see Zalapa et al. 2012).

Two large sets of NGS data that are now available derive from the transcriptomes produced by the 1KP (www.onekp.com) and BigPlant

(nypg.bio.nyu.edu/bp/) projects. The 1KP project, in particular, is producing

69

approximately 2Gbp of Illumina paired-end transcriptomic sequence data per sample for

1,000 species representing much of green plant diversity. The project is a collaborative effort among multiple international labs, and sequencing is being carried out at BGI

(Shenzheng, Guangdong, ) using Illumina’s GA IIx and HiSeq platforms (Illumina

Inc., San Diego, California, USA). The first set of 1KP data has already been released to the public, and new data will continue to be released as they are generated.

Additionally, the Brassicales Map Alignment Project (BMAP) is a comparative genomics project with applications for systematics. The project aims to sequence 150 genomes of representative species distributed across the Brassicales phylogeny (see http://www.brassica.info/resource/sequencing/bmap.php). Currently, the United States

Department of Energy’s Joint Genomic Institute (DOE-JGI) is completing genomic and transcriptomic sequencing for the first 20 species, which are scheduled for completion by 2013 (R. Wing et al. pers. comm.).

When choosing among different approaches for NGS data acquisition, several factors, often interrelated, are worthy of consideration (see Glenn 2011: Table 2 for cost comparisons per base). For many phylogenetic applications, NGS remains somewhat costly per sample. Sequencing projects are best designed in a way that minimises wasted data (i.e. unused or unusable NGS reads that are not of interest or are uninformative), facilitates capture of useful data from a large number of samples, and takes full advantage of NGS instrumentation (Baker 2010). Approaches that reduce wasted data maximise efficiency and cost-savings both before and after sequencing as well as facilitate successful acquisition of target data on NGS platforms. Multiplexing samples in an NGS run, for example, reduces overall costs on a per-sample basis

70

because the expenses associated with processing and sequencing are divided among the total number of samples (Glenn 2011; Cronn et al. 2012). The goal may be to acquire large quantities of sequence data for primary use or to identify useful markers for targeted enrichment and sequencing in subsequent NGS runs. In either case, multiplexed NGS offers considerable savings over traditional PCR and Sanger sequencing approaches. See Cronn et al. (2012) for additional discussion on multiplexing and NGS.

In the following sections, we highlight several factors to help researchers in choosing among various NGS experimental design pathways applicable to phylogenetic studies. We review methods and tools for the mining and assembly of organellar gene space as well as those for identifying new orthologous nuclear markers, primer design, and targeted NGS for any taxonomic group of interest. These methods and tools are applicable to both newly acquired and existing NGS data sets and, therefore, are not presented within the context of a particular experimental pathway. Useful methods for

NGS library preparation and the advantages and disadvantages of various NGS platforms, while important to any NGS project design, are beyond the scope of this paper and are discussed in detail elsewhere (see Glenn 2011; McCormack et al. 2012;

Egan et al. 2012).

Experimental Pathways for NGS Phylogenetics

As with any traditional approach in molecular systematics, a primary factor that should be determined or estimated a priori involves the quantity and type of data necessary to resolve a particular systematic question or phylogenetic challenge (Figure

3-1). For example, deeper-level phylogenetic analyses may benefit from the addition of either transcriptomic or organellar genomic sequence data from representative samples

71

across a group of interest (e.g. 454: Moore et al. 2006, 2007, 2010; Illumina: Cronn et al. 2008, Whittall et al. 2010; transcriptomes: Hittinger et al. 2010, Timme et al. 2012, V.

Maia et al., Universidade do Rio de Janeiro, in prep; B. Ruhfel et al. University of

Florida, pers. comm.). On the other hand, it may be better to approach analyses of shallow-level relationships or rapidly evolving groups by using large numbers of nuclear markers (Heled and Drummond 2010; Straub et al. 2011, 2012) and multiple individuals per species (Heled and Drummond 2010). Identification of the type(s) of data required may seem obvious, but its importance should not be underestimated. It may be tempting to sequence vast amounts of genomic data simply because it is possible, but specificity in terms of data needs can help to identify the most economical and time- saving approaches, both for acquisition and downstream processing of NGS data for the research question at hand.

Transcriptomes vs. Genomes

Phylogenetic markers can be developed from both transcriptomic and genomic data.

However, there may be a number of trade-offs to consider when choosing among data types for marker development. The choice is largely dependent on the research goals and budget for a particular project, the availability of existing data resources, and the unique challenges of developing phylogenetically informative markers from each type of data. It is difficult to make a blanket recommendation in favour of a single approach; every project is unique. We can, however, provide a few points of comparison for advanced consideration.

Transcriptomic sequencing (see Strickler et al. 2012) can provide a cost-effective means of acquiring new data for phylogenetic marker development. With multiplexed

NGS, there are several trade-offs that involve factors, such as genome size, number of

72

samples per run, and the desired depth of coverage per sample. Transcriptomic sequencing can offer an advantage over genome skimming with regard to all three of these factors because the sequencing effort is focused on the expressed portion of the genome. The transcriptome is a much-reduced fraction of the whole genome, and, therefore, more samples can be multiplexed with high coverage per sample in an NGS run (e.g. up to 12 taxa per Illumina lane; Gane Ka-Shu Wong, BGI, pers. comm.).

Genomic sequencing, on the other hand, may prove less economical; genome sizes will dictate the amount of NGS necessary to provide the desired coverage depth across all samples. For instance, large genome sizes likely will impose limitations on the number of samples that can be multiplexed with high coverage per NGS run (e.g. see Kelly et al., this issue). Thus, costs for primary NGS data acquisition could be substantially more for genome skimming approaches than for transcriptomic approaches. We encourage researchers to carefully consider their data needs for marker development and contemplate alternative uses for the data. Transcriptome data, for example, may be employed in gene expression studies. With careful planning, researchers may be able to gain additional value from their investment. It may even be possible to share costs with other researchers who are interested in using the data in different ways.

A second issue worthy of consideration involves post-data processing and the ability to discover phylogenetically useful markers for primer development and sequencing. Transcriptomic approaches to marker development, in particular, may prove risky for lower-level phylogenetic studies; methods for developing phylogenetic markers from transcriptomic data remain, for now, largely unexplored.

73

Transcriptomic data, while economical from an NGS cost perspective, may pose a number of downstream challenges for marker development. Firstly, tissue selection, library preparation, and the sequencing approach chosen by the researcher can have serious positive or negative outcomes on the overall utility of the data. For example, transcriptomic data derived from different plant tissues may be less comparable; transcriptomes often differ in their size and uniqueness, with differing proportions of selectively expressed and enriched genes in different plant tissues (Pina et al. 2005).

Secondly, designing primers from transcriptomic data is not a straightforward process

(see marker development). Transcriptomic data are derived from mRNA and, therefore, do not include the more desirable intronic regions that are useful for lower-level studies.

It is also challenging to identify intron-exon boundaries from transcriptomic data (e.g. see Ward et al. 2012) for primer development; primers must be anchored to the conserved elements of transcribed gene space to facilitate capture of informative data adjacent to these less-variable coding regions. Consequently, phylogenetic utility cannot be assessed a priori and a larger number of markers will need to be sequenced in order to ensure capture of informative data.

Genomic data, in contrast, may represent a more ideal medium for developing variable markers for phylogenetic analyses. Genomic data provide access to both coding and variable non-coding sequences and, thus, primers can be easily designed from primary NGS data to capture phylogenetically informative markers with targeted sequencing approaches.

Pathways for Phylogenetic Marker Development

Various experimental pathways for generating new NGS data for phylogenetic analyses are currently available, and each has its advantages and disadvantages, particularly in

74

terms of financial and time investment (Figure 3-1). When choosing among possible designs for an NGS project, it may be necessary to recombine these pathways in ways that maximise productivity and minimise total costs for NGS data acquisition. The pathways described here are merely recommendations, and researchers can adapt them to better suit their study systems and individual needs. It is worth noting, however, that NGS technologies are rapidly evolving, and a re-evaluation of the various options is always advisable.

One of the first considerations for researchers is whether existing NGS data resources (e.g. BGI, EMBL-EBI, NCBI, JGI GOLD, Phytozome) are available for a group of interest and can be used as primary data for new nuclear marker development

(Figure 3-1: right side). These data are especially useful when existing taxon sampling is adequate, and they enable identification of phylogenetically useful markers within a particular study group. In other words, existing NGS data must be available for a representative subset of taxa (or close relatives) spanning the full phylogenetic scale of the study. Where these NGS data exist, they can be used as primary data for nuclear marker development followed by secondary data acquisition (i.e. targeted sequencing) for an entire taxonomic group.

Conversely, when there are no existing NGS data for a group of interest or the taxon sampling in existing data sets is inadequate, it is necessary to acquire new primary data (Figure 3-1: middle). An efficient pathway for acquiring these new NGS data for nuclear marker development involves the use of either transcriptome sequencing or multiplexed, low-coverage genome skimming approaches with a subset of taxa. As previously described, these primary NGS data can be used to develop new

75

nuclear markers, followed by secondary NGS data acquisition of targeted markers with expanded taxon sampling.

The collection of organellar genomic data, in contrast, necessitates NGS on a larger scale because the data will be used as primary data in the phylogenetic analysis

(Figure 3-1: left side). If both chloroplast and mitochondrial data are desired, standard genomic library preparation and multiplexed, low-coverage genome skimming on NGS platforms are adequate for studies involving smaller numbers of sampled taxa—for example, 50 or fewer samples (e.g. Cronn et al. 2008; Parks et al. 2009; Straub et al.

2011). It will be possible to mine and assemble organellar gene space from these low- coverage NGS runs with bioinformatic pipelines. Alternatively, many new methods for organellar sequence capture and other targeted sequencing approaches are available and facilitate sequencing of larger numbers of multiplexed samples per NGS run than genome skimming pathways. In cases where nuclear and organellar genomic data are required for an analysis, targeted sequences from all three genomes can be sequenced together on NGS platforms. These methods and approaches allow for reductions in wasted data and take full advantage of NGS instrumentation capabilities in a way that maximizes cost savings.

Marker Development Pipeline for Targeted Sequencing

Once transcriptomic or genomic data have been identified from existing resources or acquired using NGS, the next step in the experimental pathway is to develop orthologous nuclear markers with optimum utility—that is, sufficient levels of sequence variation to resolve a phylogenetic problem of interest (Figure 3-2). The following steps are purely bioinformatic and include workflows for data mining and primer development.

We briefly outline each of the workflow components comprising a marker development

76

pipeline within this section. Specific tools for NGS data processing and marker development are discussed in more detail later.

Marker development, as it pertains to nuclear genomic or transcriptomic data, is essentially a two-step process involving primary and secondary NGS data (Figure 3-2: top). First, primary NGS data from a subset of taxa are mined to identify orthologous markers and to design PCR primers for use in secondary NGS data acquisition pathways—that is, targeted sequencing. Next, secondary NGS data from all taxa are mined and the orthology of newly acquired nuclear markers is tested with full taxon sampling (Figure 3-2: bottom).

In contrast, organellar genomic data generally represent a one-step process that involves mining and assembly of organellar gene space from primary NGS data. For example, chloroplast or mitochondrial genes can be mined from full genomic or transcriptomic reads and assembled (e.g. Parks et al. 2009; Straub et al. 2011) or simply processed and assembled from targeted NGS runs. Limited orthology screening is needed to ensure that pseudogenes are not included; both chloroplast and mitochondrial genes are frequently translocated between organellar genomes and the nuclear genome (e.g. Timmis et al. 2004). Additionally, many of the ribosomal RNA genes have high sequence similarity and are found in all three genomes; care is needed to prevent accidental cross-assembly or mislabelling of these genes. In general, however, processing of organellar genomic data is relatively straightforward and the value of these markers cannot be understated (Steele and Pires 2011).

Data Mining

As presented here in the context of phylogenetic marker development, data mining refers to the direct handling and sorting of assembled sequences as well as the

77

combined processes of gene clustering, multiple sequence alignment, and tree-based orthology testing (phylogenetic analysis). The latter processes are iterative and ensure confidence in orthology (Figure 3-3). Once orthologs are predicted from primary NGS data, researchers can assess phylogenetic utility among putative markers with pair-wise comparisons of sequence divergence, curate and annotate genes (optional), and continue with bioinformatic pipelines for primer development.

While this workflow may sound simple and straightforward, anyone who has worked with NGS data and phylogenetic analyses realises the complexity and time investment involved. For example, a recent study by Straub et al. (2012) realistically illustrated the extensive amount of data mining that was required for new marker development; the authors cited numerous Perl scripts developed in-house as part of their marker development pipeline. It is becoming increasingly common for researchers to hire or collaborate with bioinformaticians to assist with data mining for biological research. However, many graduate student and post-doctoral researchers are beginning to acquire bioinformatic skills (see Zauhar 2001; Haddock and Dunn 2011;

Lewitter and Bourne 2011). These individuals may help ease the transition from the traditional PCR and Sanger sequencing approaches used in many laboratories to new

NGS approaches.

Tests for Orthology

The nuclear genome offers thousands of protein-encoding genes for potential use in phylogenetic reconstruction. However, most genes are members of multi-gene families (Clegg et al. 1997; The Arabidopsis Genome Initiative 2000; Goff et al. 2002; e.g. Yuan et al. 2009), and mistakenly accepting paralogs as orthologs poses a problem for phylogenetic research. With Sanger sequence data, the identification of paralogs

78

was often more obvious with primer pairs, following PCR: double-peaks in the chromatogram were indicative of multiple copies, which could be confirmed with cloned sequence data and phylogenetic analyses. It should be noted, however, that many paralogs undoubtedly went undetected and were inadvertently included in data sets. In contrast, NGS is conducted over a single strand of DNA, and paralogy cannot be detected until after the reads are assembled. Signals of paralogy in NGS data include presence of more than two bases (alleles) at a putative locus, high levels of observed heterozygosity at a given position in an assembly, or excessive variation within a single locus observed in the raw data (reviewed in McCormack et al. 2012). With NGS data, bioinformatic pipelines automate the process of tree-based orthology detection.

Primer Development for Targeted NGS

Several methodologies and software to determine oligonucleotide properties for primer design have been reviewed elsewhere (Dieffenbach et al. 1993, Abd-Elsalam

2003, Chavali et al. 2005). Therefore, we briefly mention only some aspects of primer design that are relevant for nuclear marker development based on NGS data. We also introduce a useful software tool that can assist researchers with primer design and discuss issues related to large-scale primer testing.

An important step in the primer development process is to identify exon/intron boundaries in nuclear genes. To achieve good cross-amplification for interspecific or intergeneric studies, primers should be ideally located in conserved regions within exons flanking the target intron(s) (Creer 2007). This primer design strategy, known as exon-primed intron-crossing (EPIC; Palumbi 1998), has been extensively used to amplify non-coding regions in animals (e.g. Slade et al. 1994, Friesen et al. 1999,

Hassan et al. 2002, Touriya et al. 2003) and plants (e.g. Ishikawa et al. 2002, Cronn et

79

al. 2002, Shaw et al. 2005). The putative locations of introns can be assessed by comparison with well-annotated genomes. If possible, primer sites should be chosen from sections of the exons that are highly similar to the references. It is also important that priming sites are without mismatches at the 3’ end of the oligonucleotide to achieve optimal annealing potential. Primer development from transcriptome sequences, on the other hand, should be designed to amplify the flanking regions of the conserved elements to facilitate capture of non-coding regions with phylogenetic utility.

Researchers will find there are many software tools to assist with the process of primer design. Primer-BLAST (http://www.ncbi.nlm.nih.gov/tools/primer-blast/; Ye et al.

2012) is an example of a promising new tool for the development of target-specific primers, accommodating the design parameters outlined above. By combining the widely used program Primer3 (Rozen and Skaletsky 2000) with a BLAST search (using local alignment) and a global alignment algorithm, the programme facilitates a full primer-target alignment and detects targets with a high number of mismatches to the primer. Primer-BLAST is currently the only program that allows the user to specify the number and location of required mismatches, providing researchers with greater flexibility in primer stringency. It also has the capability to place primers at exon-intron boundaries and SNP locations and is the only tool that will allow placement on different exons (i.e. to span an intron) (Ye et al. 2012). Primer-BLAST results are summarised graphically and through detailed primer reports that show alignments between primer pairs and targets. For these reasons, Primer-BLAST is an excellent tool for NGS projects.

80

Once new primers are designed, researchers typically proceed with primer testing. However, traditional tests for amplification are not a realistic option with NGS approaches. At the scale of potentially hundreds or thousands of primers amplified across many taxa, tests for amplification with traditional PCR methods remain extraordinarily time consuming and, therefore, highly impractical. Once primers have been identified bioinformatically, it may be more economical to continue with high- throughput amplicon (PCR) approaches for targeted sequencing. Some of these commercially available technologies are especially useful with regard to monitoring of amplification success, since digital PCR quantitation capabilities are integrated into the instrumentation, facilitating accurate quantitation of amplified DNA molecules and absolute concentrations of a DNA libraries. However, as a precaution, it may be worthwhile to synthesise oligonucleotide primers for a randomly selected subset of the bioinformatically-designed primers. These test primers can be used to evaluate amplification success using a subset of taxa, and will help to detect potential bioinformatic errors prior to investing in a complete set of barcoded primers for high- throughput amplicon approaches.

When developing primers for targeted NGS across many nuclear loci, the identical annealing temperatures in high-throughput PCR are an issue worthy of consideration. The length of targeted sequences also should be considered and made to correspond to read lengths of NGS instrumentation. Target sequences that exceed read length capabilities will need to be subdivided into smaller sections with multiple primers and assembled post-NGS.

81

Targeted Sequencing Approaches for Phylogenetics

Acquiring large quantities of sequence data from multiple independent nuclear loci and organellar genomes may help to resolve many of the current challenges in plant systematics. However, large-scale data acquisition has remained somewhat inaccessible for many laboratories that rely on standard PCR amplification and Sanger sequencing of targeted loci. PCR and Sanger sequencing costs for data acquisition remain high for studies involving large numbers of samples and loci, especially in terms of time and effort. For example, if a researcher were interested in acquiring sequence data from 48 nuclear loci for 96 individuals, it would require a minimum of 4,608 PCR and 9,216 forward and reverse Sanger sequencing reactions. Thus, to acquire the targeted data in this scenario, upwards of $27,648 would be required for PCR and sequencing expenses (estimated at $0.50 and $2.75 per sample, respectively), not to mention potential cloning costs and significant laboratory time.

NGS instruments are exceptional at generating enormous quantities of data per sample. Indexing allows multiple samples to be sequenced together (multiplexed), dividing the reads among a number of barcoded samples. However, without some form of targeting, indexing simply spreads reads over a larger pool of templates (i.e. by combining multiple samples, the total sequencing space is increased), reducing coverage and overlap among indexed samples. To compensate for this issue, there is a growing array of targeted sequencing methods that allow researchers to efficiently sequence loci of interest from a large number of samples (Table 3-1).

Amplicon Approaches for Targeted Sequencing

Several innovative high-throughput commercial technologies enabling targeted sequencing have recently become available, offering maximised productivity and

82

reduced costs for phylogenetic research. Microfluidic targeted enrichment technologies such as Access Array (Fluidigm Corp., , California, USA) and

RainStorm/RDT1000 (RainDance Technologies Inc., Lexington, Massachusetts, USA) enable simultaneous amplification of multiple target loci for large numbers of samples using either plate-based nanofluid or microdroplet PCR technology, respectively. Each system uses PCR to tag amplicons with unique indexes (barcodes) for downstream multiplex sequencing on NGS platforms, eliminating the costly and time-consuming library preparation steps that generally are required for NGS and the need to curate thousands of PCR samples.

High-throughput systems for targeted sequencing help to overcome some of the problems associated with multiplexed PCR, such as variable levels of amplification among amplicons (Baker 2010; Mamanova et al. 2009). Amplicons produced by microfluidic systems are highly uniform across all samples and target loci, which helps to ensure evenly distributed numbers of reads across all samples pooled in a single

NGS run (or sequencing uniformity). The systems also work well with smaller amounts of template DNA or whole-genome amplified DNA, and without biases to certain alleles

(Baker 2010). The former feature may enable use of herbarium tissues, particularly in cases where large quantities of tissue may be necessary to acquire up to 15g DNA for genomic library construction (see Discussion).

Among the most obvious benefits to phylogenetic applications, some microfluidic systems are extraordinarily inexpensive to use (e.g. as low as $350 per run; Baker

2010). The Access Array System by Fluidigm, for example, can accommodate a range of 2,304 to 23,040 unique reactions in a single run—that is, amplification of 48 to 480

83

loci (respectively) for 48 samples per each Access Array Integrated Fluidic Circuit (IFC) or ‘chip’ (http://www.fluidigm.com). Moreover, unique barcode sets can be used with individual IFCs, enabling the pooling of multiple runs of Access Array into a single multiplexed NGS run. Using our previous scenario with 48 loci and 96 individuals, a targeted sequencing approach using Fluidigm technology can offer an 84% reduction in costs or $23,248 in total savings (estimated at ca. $0.46 per sample or $2,400 per 48 custom primer pairs, $700 per two runs of Access Array, and $1300 per single Illumina lane; $4,400 in total) as compared to traditional PCR and Sanger sequencing approaches for new data acquisition (estimated above at $27,648). Moreover, the

Fluidigm approach offers considerable savings over traditional PCR approaches in terms of laboratory time; DNA isolations are all that is required.

RainDance Technology’s RainStorm and RDT1000 systems are also high- throughput targeted sequencing options that may offer low-cost-per-sample solutions in the near future; costs are currently out of reach for most phylogenetic projects.

RainDance technology is designed around a microfluidic process that brings together template and primers in individual oil droplets, followed by PCR in a standard thermal cycler and downstream NGS. Up to 1,536 primer pairs (expected to increase to 4,000 pairs in the near future) are individually incorporated into droplets, and these droplets are mixed to form a primer pool. On the RDT1000 instrument, for example, a DNA template is merged with each droplet, resulting in 1,536 reactions for each sample.

RainDance’s ThunderStorm platform allows automated handling of up to 96 samples and eight separate primer pools. Many thousands of PCR reactions therefore can be conducted in parallel and sequenced in one run.

84

An additional advantage of the RainDance system is that each droplet receives, on average, less than half of a haploid genomes’ equivalent of DNA. Thus, the probability of two heterozygous alleles being in the droplet is low, eliminating the problem of ambiguous alleles and the resulting need for cloning. Among all droplets, both alleles will be amplified. In this way, single-molecule allele-specific amplification and sequencing facilitate allelic determination. While Fluidigm may still have PCR recombination errors, both alleles should assemble from NGS reads with adequate coverage (Harismendy et al. 2009).

Targeted sequencing technologies such as Fluidigm and RainDance have been primarily applied within the realm of human biomedical research. However, some plant- based applications have been published recently (e.g. Maughan et al. 2011; Paux et al.

2011). While these technologies—as well as others in development—offer promising potential for phylogenetic applications, there may be several disadvantages to their use in terms of accessibility, initial financial investments, and performance; each warrants caution and additional consideration.

First, the instrumentation is expensive (e.g. $79,000 to $225,000; see Baker

2010) and may not be available at most institutions. Thus, identification and use of core facilities or third-party service providers may be required to complete projects, which can add to costs. Second, technologies such as Fluidigm’s Access Array System require costly investments in custom design target-specific, barcoded primer sets

(recently quoted by Fluidigm Corp. at approximately $50 per primer pair). Custom primers may not be universally applicable among plant groups, necessitating additional investments in custom primers for future studies of other plant groups. Lastly, and

85

perhaps most importantly, it is currently unknown how well microfluidic technologies perform in phylogenetic applications involving multiple taxa, particularly distantly related taxa. Taxon-specific amplification problems are not uncommon in phylogenetic studies of plants, even among closely related species, and multiple primer sets often are required for successful amplification across a range of species. RainDance seems particularly well suited for these cases because clade-specific primers can be incorporated into the primer design. Using multiple primer sets for a locus may reduce the total number of target loci per run. However, if (for example) every locus required two primer sets to amplify across the taxa of interest, 768 loci can be amplified in a single run of RDT1000. Preliminary tests are needed to evaluate the successful application of these technologies to plant phylogenetics.

As an alternative to microfluidic technologies, Bybee et al. (2011) developed an amplicon method for targeted sequencing on NGS platforms. The PCR-based method employs two primer sets in a two-step PCR protocol—that is, one primer pair for target- specific amplification and another primer pair for sample-specific indexing (barcoding).

Within limits, this method is easily scalable and economical in terms of laboratory time and financial investment. However, it remains impractical for studies targeting hundreds or thousands of loci for a large taxonomic sample. Given the uncertain performance of commercial systems, however, the method of Bybee et al. may provide an effective means to circumnavigate taxon-specific amplification problems where microfluidic systems fail. For example, it may be useful to employ this method for failed samples using a gradient PCR approach to optimise taxon-specific annealing temperatures. For

86

further discussion on targeted sequencing methods see also Cronn et al. (2012) and

Grover et al. (2012).

Hybridization Approaches for Targeted Sequencing

Sequence capture methods offer an alternative to amplicon methods for targeted sequencing on NGS platforms, but they are currently more expensive. These methods are based on hybridisation of targeted regions to probes or baits designed from known sequences of those regions. Depending on the technology, the hybridisation is conducted either on a slide or array or in solution (e.g. Agilent SureSelect, NimbleGen

SeqCap EZ, which have both array-based and solution-based options). In general, fragmented genomic DNA is mixed with probes (DNA or RNA) designed to target regions (reviewed in detail in Mamanova et al. 2009). The probes hybridise with the

DNA fragments, after which the non-hybridised fragments are washed away, leaving the targeted fragments for NGS. It is possible to prepare barcoded genomic libraries and pool them before or after enrichment (Kenny et al. 2011). SureSelect kit costs vary greatly depending on the number of samples the kit can accommodate (e.g. ranging from $918 per sample for a 16-sample kit to $180 per sample for a 480-sample kit).

Additionally, Agilent recently released a kit that supports pre-selection pooling of up to

16 samples and, with modifications of the official manufacturer protocols, the kits might be used for pre-selection pooling of many more samples (Kenny et al. 2011). Probe design is an important step and needs to take into account the sequence variation at the loci of interest. However, probes tend to be quite long (120bp for SureSelect), so more mismatches, and even insertion and deletion mutations, are tolerated in sequence capture methods than in PCR-based methods.

87

Agilent recently released HaloPlex, a new product that bypasses the library preparation steps that are still required by the hybridisation methods described above.

Genomic DNA is digested with enzymes, and probes designed to recognise each end of a desired fragment are hybridised, creating circular molecules. The probe sequence includes priming sites to amplify from the selected circularised templates, generating ready-to-sequence products. The costs of these kits can be as low as $200 per individual sample and, considering that no additional library preparation step is required, they may prove economical as a method for sequencing up to ca. 500 kb of targeted

DNA. The technology is currently available for human genome samples only. However,

Agilent is actively developing the technology for other samples and, like their SureSelect products, the product may become applicable to any relatively well-characterised target in the near future.

Preliminary trials with hybridisation-based targeted sequencing approaches are producing promising results that will be of great interest to the larger research community. Recently, reports have emerged that illustrate the successful implementation of hybridisation-based approaches to large-scale data acquisition in a number of laboratories. For example, Liston (2012) recently reported on work with solution hybridisation-based target enrichment in Rosaceae. Through pair-wise comparisons of the apple, peach, and strawberry genomes, it was possible to identify and develop probes anchored to conserved orthologous exons in 257 single-copy nuclear loci. The utility of these markers is currently being tested across four lineages of

Rosaceae. Additionally, the success with in-solution hybridisation of chloroplast gene space using Agilent’s SureSelect kits and Illumina sequencing has recently been tested

88

(Stull et al. 2012). The preliminary results suggest that approximately 100 chloroplast genomes can be sequenced in a single lane of Illumina, offering exciting potential for large-scale data acquisition.

Processing of NGS Data for Phylogenetic Marker Development

Bioinformatic pipelines for developing nuclear markers from NGS data consist generally of the following steps: 1) sequence format conversion; 2) quality assessment and trimming; 3) assembly of raw read data into contigs; and 4) the combined processes of data mining, including clustering, multiple sequence alignment, and orthology testing (Figure 3-3). There are several options for processing NGS data, especially with regard to software (Table 3-2).

Transforming NGS data into reliable orthologous markers for phylogenetic studies involves many computational steps and, therefore, necessitates the creation or use of existing project-specific databases to manage raw read data, downstream modified data, and results (see Discussion). Data processing also requires some training in bioinformatics. Fortunately, several bioinformatics software packages were developed during the last decade to assist biologists with limited computational skills.

These software packages streamline processes for NGS data handling and help biologists achieve their data processing and analysis goals.

NGS Data: File Types and Conversion Tools

NGS platforms deliver sequence data in variety of file formats (Table 3-2), and some of these may include additional variants. For example, there are at least three incompatible variants of the FASTQ format that encode base quality —e.g. for Solexa (now

Illumina), Illumina 1.3-1.7, and Sanger quality score coding. It is important to understand

89

the differences among various file formats before proceeding with processing of NGS data. A detailed review by Cock et al. (2009) describes variants of FASTQ files.

Native NGS file formats must be converted to accommodate the input file format requirements of various biological data analysis software packages. NGS instrument manufacturers usually provide software tools to convert proprietary sequence data formats into commonly used formats such as FASTA and FASTQ (Rhie et al. 2010). For instance, the Newbler assembly software from 454/Roche includes a utility programme called ‘sffinfo’ (http://454.com/products-solutions/analysis-tools/index.asp) that can convert SFF files to FASTA/QUAL formats. Similarly, Pacific Biosciences provides a

Pbh5tools software package that can convert the HDF5 files of PacBio to FASTA/QUAL or FASTQ formats (https://github.com/PacificBiosciences/pbh5tools).

Apart from proprietary software, there are also several open-source software packages that were developed to convert various NGS file formats. Some of the NGS file format converters include Pyrus (Rhie et al. 2010), sff_extract

(http://bioinf.comav.upv.es/sff_extract), and the fq_all2std.pl and solid2fastq.pl scripts included in the Maq alignment and assembly package (http://maq.sourceforge.net).

Quality Control and Trimming of NGS Data

Quality assessment is a necessary component of NGS data processing and aids the detection of problematic sequences prior to assembly or other analyses. Researchers are often reluctant to discard data, but removal of low quality sequences can greatly improve assemblies and facilitate downstream analyses (Zhang et al. 2011; Niu et al.

2010). There are several measures for assessing quality of NGS data, including per base quality scores, per sequence quality scores, per base sequence content, per base

GC content, per sequence GC content, per base N content, sequence length

90

distribution, sequence duplication levels, over-represented sequences, and kmer content.

There are a number of available tools for NGS data quality assessment and trimming of raw read data, including BIGpre (Zhang et al. 2011), FastQC

(http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/), FASTX-Toolkit

(http://hannonlab.cshl.edu/fastx_toolkit/), PIQA (Martínez-Alcántara 2009), SolexaQA

(Cox et al. 2010), and Trimmomatic

(http://www.usadellab.org/cms/index.php?page=trimmomatic). These tools assist with quality visualisations (graphs) and ‘trimming’ of raw read data, a process in which bases with poor quality scores, stretches of Ns, and adapter sequences from multiplexed samples are identified and removed (i.e. quality trimming, ambiguity trimming, and adapter sequence trimming, respectively). Sequences shorter than a specified threshold also can be removed following the initial trimming steps. When choosing a quality assessment and trimming tool, it is important to consider the capabilities of each; many tools provide functionality for both processes, but differ in their unique capabilities. It may be preferable, for example, to use one tool for quality visualization and another for trimming.

Another important consideration for quality assessment and trimming of NGS data involves removal of sequencing artifacts such as artificial duplicate sequences.

These sequences have exactly the same start positions, but may vary in their end positions where the shorter reads fully align with longer reads (Niu et al. 2010). Artificial duplicate sequences result during the PCR amplification steps, especially with insufficient starting quantities of DNA or RNA (Zhang et al. 2011; Niu et al. 2010), and

91

can be removed using BIGpre (Zhang et al. 2011) or CDHIT-454 (Niu et al. 2010) with either exact or nearly exact matches. This process of quality-based trimming and artificial duplicate removal will reduce genomic and transcriptomic misassemblies

(Zhang et al. 2011; Niu et al. 2010).

It is important to recognise that the quality measures above provide information about how well the genomic or transcriptomic data were sequenced, but not about the quality of the sample itself. If proper care was not taken during the DNA/RNA sample collection and preparation, foreign or unwanted DNA/RNA contaminants may appear in

NGS data. Longo et al. (2011), for example, reported extensive contamination of human

DNA in sequences, genomic assemblies, and trace archive reads databases of non- human species. Thus, it is a good practice to conduct a crude analysis to detect the presence of contamination by blasting NGS data against reference sequences from potential contaminant sources (Folta et al. 2010). Additionally, Martin and Wang (2011) suggested that better transcriptomic assemblies were generated when ribosomal RNA

(rRNA) and other abundant transcripts were removed during library preparation.

Screening transcriptomic sequence data for the presence of rRNA and other abundant transcripts will provide information about the quality of a transcriptomic sample (Martin and Wang 2011).

Genomic and Transcriptomic Assembly

Once NGS data have been analysed for quality and the reads are trimmed, the data are assembled into contigs. We note, however, that an additional error correction step may be useful for Illumina genomic reads prior to assembly (see Yang et al. 2012); most Illumina-based genome assemblers include error correction algorithms.

92

Several software packages are available to help tackle the challenges of genomic and transcriptomic NGS data assembly, and their underlying algorithms can accommodate the large data sets, short read lengths, and error rate variations among different NGS platforms (reviewed in Bao et al. 2011; Garber et al. 2011; and Martin and

Wang 2011). There are three main strategies for assembling genomic or transcriptomic data: reference-based assembly, de novo assembly, and a hybrid (or combined) approach. The reference-based strategy involves the reconstruction of sequences by aligning NGS data to an available genome or transcriptome. In contrast, the de novo assembly strategy reconstructs sequences without making comparisons to a database or other reference. The use of de novo genomic or transcriptomic assemblers (Bao et al. 2011; Garber et al. 2011; Martin and Wang 2011) depends largely on variables, such as NGS data type and coverage as well as the availability of computational resources, which can be a limiting factor with large and complex assemblies. Hybrid assembly strategies combine both reference and de novo assemblies, allowing researchers to detect novel and variable transcripts, while still employing the high sensitivity of reference-based approaches. Depending on the quality or phylogenetic relatedness of the available reference genome to the taxon of interest, hybrid strategies may begin with assembly using a reference genome, followed by de novo assembly of reads that initially failed to align to the reference, or by initiating the process with de novo assembly, followed by alignment of contigs to a reference scaffold. When the reference genome is from a different species, as is often the case when working in non-model systems, the latter approach is favoured (Martin and Wang 2011). By using a hybrid assembly technique, a reference genome assists with filling the gaps produced by the

93

de novo assembly, which is typically fragmented (e.g. Surget-Groba and Montoya-

Burgos 2010). These methods of combining reference-guided and de novo strategies may provide an optimal avenue to capture known information, while allowing for novel variation (Garber et al. 2011).

It is important to note that genomic assembly algorithms, in general, cannot be directly adopted for transcriptomic assemblies for three reasons (Garber et al. 2011;

Martin and Wang 2011). First, the read depth associated with transcriptomic data is highly variable due to differences in expression. Second, strand-specific information needs to be considered in order to efficiently resolve overlapping sense and antisense transcripts, especially in the case of strand-specific RNA-seq protocols (Levin et al.

2010). Lastly, teasing out transcript variants, which share exon(s) from the same gene is necessary with transcriptomic data. Transcriptome assemblers were developed to accommodate these characteristic differences between transcriptomic and genomic data. Details of reference-based and de novo assemblers for NGS data—which includes whole genome or genomic DNA from targeted sequencing, genome skimming, and transcriptomic data—were reviewed by Bao et al. (2011) and Martin and Wang

(2011), respectively. The species sequenced and availability of a reference genome, along with type of data (genomic or transcriptomic) and computational resources available, will largely guide the selection of assembly programmes and the generation of contigs for further analysis.

Data Mining for Orthologous Markers

Once NGS data are assembled into contigs, steps for mining orthologous markers can commence. This iterative process combines steps for clustering, multiple sequence alignment, and tree-based orthology testing (Figure 3-3). Clustering of assembled

94

genomic or transcriptomic sequences from multiple species is accomplished using high similarity scores from pair-wise comparisons and blast searches (Lechner et al. 2011).

Next, the sequence clusters are aligned using a multiple sequence alignment tool

(Edgar and Batzoglou 2006; Liu et al. 2009; Wang et al. 2011). Lastly, aligned clusters are analysed phylogenetically (i.e. tree-based orthology testing) to differentiate orthologs from paralogs—that is, by identifying clades containing a single sequence from each taxon in the total sample.

A large variety of methods for predicting orthologs among two or more data sets have appeared in recent years (Tatusov et al. 1997; Remm et al. 2001; Li et al. 2003;

Dessimoz et al. 2005; DeLuca et al. 2006; Wheeler et al. 2007; Hubbard et al. 2007;

Jansen et al. 2008; Martinez 2011), and some reviews assessing their relative performance have been published (Chen et al. 2007; Dutilh et al. 2007; Altenhoff and

Dessimoz 2009). In general, algorithms to predict orthology among two or more loci can be classified as pair-wise orthology, cluster orthology, and tree-based orthology (Dutilh et al. 2007), the latter being the closest to the original definition of the concept (Fitch

1970, Sonnhammer and Koonin 2002).

Several algorithms are currently available for clustering and orthology prediction for coding loci and are used to compile orthology databases across multiple species.

For example, available databases include COG-database (Tatusov et al. 2000), eggNOG (Jensen et al. 2008), Ensembl Compara (Hubbard et al. 2007), GreenPhyl

(Rouard et al. 2011), HomoloGene (Wheeler et al. 2007), InParanoid (Berglund et al.

2008), OMA Browser (Dessimoz 2005; Schneider et al. 2007; Altenhoff et al. 2011),

OrthoMCL-DB (Chen et al. 2006), RoundUp (DeLuca et al. 2006), and TreeFam (Li et

95

al. 2006; Ruan et al. 2008). The content in these databases is primarily limited to previously published protein sequences from Arabidopsis thaliana (L.) Heynh. or Oryza sativa L. However, there are other stand-alone clustering and orthology prediction tools that were developed to accommodate larger amounts of transcriptomic data from NGS technologies (Lechner et al. 2011), including tools such as MultiMSOAR (Shi et al.

2010), MultiParanoid (Alexeyenko et al. 2006), OrthoMCL (Li et al. 2003), and

Proteinortho (Lechner et al 2011). Additionally, UCLUST (Edgar et al. 2010) and CD-

HIT (Li et al. 2006) are useful for independent clustering of conserved DNA sequences.

Case Study: 1KP Pipeline

Our initial analyses of the preliminary 1KP transcriptome assemblies showed that the preliminary assemblies (initially made with a genomic assembler) were not well suited for extracting chloroplast, mitochondrial, or nuclear ribosomal data (V. Maia et al. pers. comm.; B. Ruhfel pers. comm.). For these data and, we suspect, many other data sets, it is more fruitful to start with the original raw sequence data and then pull and assemble the smaller subsets of reads that are of interest. This approach represents a more computationally tractable task, as, for this study, we were not interested in the nuclear genes. It should be noted that the 1KP project also is reassembling data sets with SOAPdenovo-trans (http://soap.genomics.org.cn/SOAPdenovo-Trans.html), a modified version of the program specifically designed for RNA-seq data sets.

The pipeline we developed for extracting transcribed chloroplast, mitochondrial, and nuclear ribosomal genes from 1KP transcriptomic data begins with the raw reads.

These reads are blasted against a database consisting of phylogenetically representative sequences for the regions of interest. Any read with a significant blast hit, and its pair in a paired-end dataset, is then pulled into a separate file of selected reads.

96

We then used the VelvetOptimizer script (Gladman and Seeman 2011), a script distributed with Velvet (Zerbino and Birney 2008), to create the optimal assembly of the selected reads. Next, the assembled contigs are blasted against the same database used to cluster, orient, and trim the contigs to the individual gene regions. Where a taxon has multiple contigs for a given gene region, they are joined together with missing data between contigs to generate a consensus sequence for each region. The consensus sequences for each region are then assembled together, followed by multiple sequence alignment, and orthology testing with tree-based methods (i.e. phylogenetic analysis). With this pipeline, hundreds of NGS data sets can efficiently be processed, and regions of interest can be extracted for final phylogenetic analysis.

Discussion

The field of plant systematics has arrived at a new crossroads of technology and innovation (Soltis et al. 2009; Straub et al. 2012). Rapid developments in NGS are prompting systematists to rethink what is possible (e.g. Harrison and Kidner 2011), especially since these technological advancements present exciting opportunities to revisit many long-standing questions and grand challenges in biology. This technological revolution makes possible not only a next-generation of sequencing, it also allows for the next-generation of evolutionary synthesis—a rejuvenation of systematics research in which we can explore genome evolution (e.g. Renny-Byfield et al. 2011; Buggs et al. 2012a; 2012b), link genotypes with phenotypes (reviewed in

Henry et al. 2011), and pursue investigations of adaptive traits in a dynamic and ever- changing global environment (e.g. Stapley et al. 2010; Anderson and Mitchell-Olds

2011; Parchman et al. 2012).

97

The availability of large numbers of nuclear markers for phylogenetic studies will allow us to take advantage of and promote further developments in alternative methods of inference, such as deep coalescent (or multi-species) approaches (Liu and Pearl

2007; Degnan and Rosenberg 2009), non-parametric tree reconciliation (Slowinski and

Page 1999; Ané et al. 2007), and network representations of reticulate evolution (Huson et al. 2010). Until NGS, access to valuable phylogenetic markers from the nuclear genome was limited. Now that we have a larger phylogenetic toolbox, we can begin to better address the problematic issue of incongruence between gene trees and species trees and further explore alternatives to inference from concatenated data matrices

(Kubatko and Degnan 2007; Edwards 2009), apply new methods for species delimitation (Carstens and Dewey 2010; Zhang et al. 2011), and untangle previously problematic relationships obscured by hybridisation and introgression (Twyford and

Ennos 2012). These alternative approaches may be important as we attempt to reconstruct many relationships at the tips of the Tree of Life.

Although we place much emphasis on identification of orthologous markers for phylogenetic analyses, access to large-scale NGS data from the nuclear genome may also present an opportunity to explore other methods and analytical tools for inferring species trees from multicopy genes (e.g. orthologs and paralogs). The software package iGTP (Chaudhary et al. 2010), for example, uses Gene Tree Parsimony (GTP)

(Slowinski and Page 1999) to reconcile topological heterogeneities among thousands of individual gene trees inferred from hundreds of taxa, providing a useful means to circumnavigate problems with phylogenetic inference that arise from evolutionary events such as gene duplications and losses, incomplete lineage sorting, and horizontal

98

gene transfer. Information from ancient or recently duplicated genes may provide an additional level of phylogenetic information for many analyses and, when linked with models of chromosome evolution, facilitate investigations of genome evolution and polyploidisation (e.g. Ness et al. 2011). We caution, however, that adequate levels of sequencing coverage are necessary to recover all orthologs, and care is needed to avoid generation of chimeric sequences during assembly; these factors will mislead analyses and interpretation of results.

For some researchers who are beginning to familiarise themselves with NGS approaches, the computational demands of data processing may appear daunting.

Fortunately, there are many large-scale initiatives that are developing user-friendly tools to assist researchers with the development of unique bioinformatic pipelines as well as to conduct phylogenetic and related analyses. For example, resources such as the

Discovery Environment and the DNA Subway of the iPlant Collaborative

(http://www.iplantcollaborative.org/) help to streamline data processing and provide many useful features for phylogenetic research, including data storage, remote computing capabilities, customized application installations, and collaborative research and data sharing. In addition, the tools within the Discovery Environment can be tailored to the particular interests or needs of individual researchers. Users also can share newly developed applications within the iPlant community for beta testing, including everything from file format converters to statistical applications.

Despite the exciting advancements in NGS and computational biology, which afford great potential within many areas of systematic research, accessing the world’s biodiversity via field collections remains an ongoing challenge and may pose obstacles

99

to NGS projects. For example, the acquisition of plant tissues comprises a major component of any molecular study, and field research presents unique challenges, especially in terms of political or physical access to field localities and the financial and time investments required for fieldwork. Historically, molecular systematists have relied on herbarium specimens as a source of tissue for molecular studies, particularly in cases where field-collected tissues were not readily accessible. However, not every museum specimen is amenable for use with NGS (Blow et al. 2008).

There are several potential issues related to tissue preservation that may affect the utility of either herbarium or newly field-collected materials in NGS applications, such as DNA or RNA quality/stability and the quantity of tissue required for extraction.

With regard to herbarium collections, the systematics community may benefit from empirically based measures of degradation and average NGS sequencing success in single specimens across a diversity of plant groups (Staats et al. 2011). It is not known whether emulsion or solid–phase PCR is preferable for amplification of specimen- extracted tissues or if sequencing chemistry or read lengths affect sequence quality in extremely fragmented DNA (<300bp length). Many studies of herbarium specimens, for example, have focused on ‘DNA barcodes’ (Staats et al. 2011) or individual nuclear genes (Lister et al. 2008). Additional tests with NGS may help elucidate the extent of genomic sequence degradation—as illustrated for ancient DNA samples of mammoths and cave bears (Gilbert et al. 2008; Stiller et al. 2009; respectively)—and help to optimise protocols for collection, preservation, and processing of living or dried material for NGS applications. These protocols may be particularly useful for small specimens or rare populations for which access to larger quantities of plant tissue is limited.

100

Some of the experimental pathways described here may help circumnavigate challenges posed by limited access to biological resources that are appropriate for NGS library preparation—that is, when plant tissues are insufficient for acquisition of high

DNA yields, the DNA is potentially degraded, or flash-frozen tissues are required for

RNA extractions. For example, our recommended two-step NGS data acquisition pathways (Figure 3-2) make use of a smaller subset of taxa for marker development, enabling use of a smaller number of specimens or field collections for which appropriate and adequate material exists or is readily available. Some institutions maintain collections of vouchered silica-dried tissues or frozen and nucleotide extracts as part of developing tissue or DNA repositories, respectively. -banking projects, such as the Millennium Seed Bank Partnership at the Royal Botanic Gardens, Kew

(http://www.kew.org/science-conservation/save-seed-prosper/millennium-seed-bank/), also serve as repositories of plant biodiversity and provide an ‘insurance policy’ against extinction of species in the wild. These collections may provide useful material in the near future as we transition to NGS approaches for marker development and other phylogenetic applications. However, in scenarios where it remains necessary to use smaller quantities of field-collected tissues or herbarium tissues of difficult-to-obtain taxa, targeted sequencing approaches for secondary data acquisition are a particularly effective option. Microfluidic technologies and other amplicon targeted sequencing approaches, for example, can accommodate smaller amounts of starting DNA (Baker

2010).

Lastly, the stability of tissue for RNA extraction poses significant challenges for

NGS projects that require field-collected taxa. Experimental results suggest that it is

101

necessary to flash-freeze plant tissues in liquid nitrogen within seconds of removal from the plant in order to extract high quality RNA for transcriptomic projects (Johnson et al.

2012). The sampling protocol for the 1KP project, for example, requires transport of a portable liquid nitrogen Dewar to collection sites. This use of liquid nitrogen presents practical challenges for collection of plant tissue for RNA extractions in the field. In response, companies are beginning to offer tools to help prepare field-collected tissues for RNA extraction; the tools bypass the need for on-site flash freezing of tissues. One example is a modified cordless drill that facilitates maceration of tissues with ‘bashing beads’ in pre-filled tubes and stabilises RNA in a buffer solution (Zymo Research, Irvine,

CA, USA).

Gayral et al. (2011) recently tested different stabilisation buffers, storage times, and temperatures on different animal tissues collected in the field; they report that the buffers RNAlater (Qiagen, Hilden, ) and Trizol (Life Technologies, Carlsbad,

CA, USA) can be used successfully in the field without flash freezing, followed by cold temperature storage (4 °C to -20 °C). Their RNA extractions were greatly improved when animal tissues were sliced into smaller pieces before they were added to stabilising buffers. The tissues also maintained RNA stability when stored for months at

-20 °C. These techniques may prove useful with field collections of plants. However, formal tests with plant tissues are needed to assess whether or not these methods can be successfully applied.

Future Directions. This is an exciting time for the systematics community. As we take advantage of the benefits of NGS and move closer to a comprehensive view of plant phylogeny, we can begin to make larger-scale inferences of the patterns in

102

nature—such as phylogeny, historical biogeography, and the fossil record—and further explore the individual processes that influenced these patterns (e.g. mutation/variation, gene flow, natural selection, adaptation, and speciation) with increasingly larger data sets. Genomic data repositories and web-based, community-driven initiatives will accelerate exploration of these links between evolutionary patterns and processes, particularly as more NGS innovations become available in the near future.

The capabilities of NGS technologies are already well beyond marker development, as illustrated by recently published phylogenomic studies (Lai et al. 2012;

Lee et al. 2011; McKain et al. 2012; Smith et al. 2011; Steele et al. 2012; Zhou et al.

2011) and other large-scale genomic and transcriptomic projects such as 1KP, BMAP, and the Compositae Genome Project (Lei et al. 2012). Platforms for plant comparative genomics such as CoGe (Lyons and Freeling 2008) and Phytozome, a DOE-JGI initiative (Goodstein et al. 2012), will be important for investigations seeking correlations between gene structure and function. The platforms already facilitate many research activities ranging from chromosome mapping to studies of differential gene expression.

Phytozome, for example, currently houses 25 fully or partially annotated genomes and is updated annually; it will also include new BMAP sequences in the near future.

Most of the large genomic data repositories are in their infancy. These repositories have the potential to be vastly beneficial as we move forward towards an evolutionary synthesis, particularly as more genomes and transcriptomes are sequenced across the plant phylogeny. Connecting these genomic resources to other knowledge resources—for example, taxonomic and phylogenetic resources—will be imperative to future investigations of adaptation and speciation. Recently, Parr et al.

103

(2012) review progress and needs of ‘evolutionary informatics’, a specialised field that includes the integration of information from the Tree of Life with all associated, available data (or metadata)—for example, morphological, geospatial, ecological, genomic, and other data. iPToL (iPlant Tree of Life) from the iPlant Collaborative

(http://iplantcollaborative.org/) is another example of a community effort to develop informatics tools that support plant phylogenetics and data integration. In addition, taxon- or project-specific database resources are growing in number, and provide a vast amount of information directly relevant or complementary to the Tree of Life. For example, the Silene EST annotation database (SiESTs), a genomic resource for Silene

L. and Dianthus L. (), comprises ESTs generated by 454 sequencing

(Blavet et al. 2011). BrassiBase (Koch et al. 2012 is another resource currently in development and will represent a compilation of knowledge on Brassicaceae. Lastly, data and tools from amborella.org and ancangio.uga.edu (AAGP: Ancestral Angiosperm

Genome Project) will facilitate rooting of genomic data for all other angiosperms

(Zuccolo et al. 2011).

In addition to phylogenetics, evolutionary informatics is taking the burgeoning field of adaptive genomics to new levels and will contribute to many other fields (Stapley et al. 2010; Arnold et al. 2013). For example, genes conferring adaptations for growth on serpentine (ultramafic) were recently elucidated in Arabidopsis lyrata (L.)

O’Kane and Al-Shehbaz (Turner et al. 2010) using a combination of NGS and comparative approaches. With the newly published full genome sequence of A. lyrata

(Hu et al. 2011), further connections between genotype and phenotype across various projects with this species will be possible. Other examples include current investigations

104

for the extremophile mustard, Thellungiella parvula (Schrenk) Al-Shehbaz and O’Kane, whose genome was recently sequenced (Dassanayake et al. 2011). Adaptive genomics of T. parvula may help us better understand adaptations to saline and poor-quality soils.

Integrating complex adaptation processes with genomic sequence data or other data available from the community continues to provide new insights into the evolution of lineages.

By taking full advantage of newly available tools and resources for evolutionary research, we can begin to address many unanswered questions in biology, including those related to the origin of plant species and environmental adaptation (Darwin 1859;

Kane et al. 2011; Strasburg et al. 2012;). However, we must begin our transition to NGS now in order to speed our progress in reconstructing phylogenetic relationships among all plant species. We hope that the experimental design approaches, practical considerations, and methodologies described here will make NGS more tangible to systematic researchers, and we encourage the integration of such approaches and practices into research programmes as soon as possible.

105

Table 3-1. Summary of targeted sequencing methods. Method Description Technology

Transcriptome Reduces genomes to RNA-seq & GA II (Illumina expressed genes (Inc. San Diego, CA, USA)

Hybridisation/ Reduces genomes using SureSelect (Agilent sequence capture probes or baits to capture Technologies Inc. Santa sequence of interest on a Clara, CA, USA); NimbleGen slide or array in solution SeqCap EZ (NimbleGen Inc., Madison, WI, USA)

Exome Kits designed to enrich TruSeq Kit (Illumina Inc. San the exome Diego, CA, USA)

Amplicon PCR amplification of Access Array (Fluidigm Corp. targeted sequences San Francisco, CA, USA); RainStorm (RainDance Technologies Inc. Lexington, MA, USA); Bybee et al. 2011

RADSeq anonymous enzymatic Davey and Blaxter 2010 digested portions of the genome

106

Table 3-2. Summary of sequence formats and data processing tools for NGS. Resources and Information

NGS sequence formats

Sequencing technology Sequence formats References Roche 454 SFF http://www.my454.com PacBio HDF5 http://www.pacificbiosciences.com Illumina FASTQ http://www.illumina.com; (Cock et al. 2009) Ion Torrent SFF/FASTQ http://www.iontorrent.com ABI Solid CSFASTA/CSQUAL http://www.appliedbiosystems.com

NGS sequence format conversion software

Conversion software Format conversions References

sffinfo (part of Newbler Assembly SFF to FASTA/QUAL http://454.com/products-solutions/analysis-tools/index.asp package) Pbh5tools HDF5 to FASTA/QUAL or FASTQ formats https://github.com/PacificBiosciences/pbh5tools

Pyrus CSFASTA/CSQUAL to FASTQ or Rhie et al. (2010) FASTA/QUAL SFF to FASTQ or FASTA/QUAL FASTQ to FASTA/QUAL FASTA and QUAL to FASTQ

sff_extract SFF to FASTA/QUAL http://bioinf.comav.upv.es/sff_extract

fq_all2std.pl FASTQ to FASTA/QUAL or other FASTQ http://maq.sourceforge.net; Cock et al. (2009) variants

solid2fastq.pl CSFASTA/CSQUAL to FASTQ http://maq.sourceforge.net

107

Table 3-2. Continued

Resources and Information

NGS quality control and trimming software

Quality control and trimming Summary References

BIGpre Input NGS Format: FASTA and QUAL or Zhang et al. (2011) FASTQ or SFF Functionality: quality visualisation, quality trimming, and artificial duplicate removal

FastQC, PIQA, SolexaQA Input NGS format: FASTQ http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc, Functionality: quality visualization Martinez-Alcantara (2009); Cox et al. (2010)

FASTX-Toolkit Input NGS format: FASTA/FASTQ http://hannonlab.cshl.edu/fastx_toolkit/ Functionality: quality visualisation, quality trimming, and artificial duplicate removal

Trimmomatic Input NGS format: FASTQ http://www.usadellab.org/cms/index.php?page=trimmoma Functionality: quality trimming tic

CDHIT-454 Input NGS format: FASTA Niu et al. (2010) Functionality: artificial duplicate removal

108

Table 3-2. Continued

Resources and Information

NGS genomic or transcriptomic assembly

Application Summary References

Genomic NGS data error Reviews and compares available software See Yang et al. (2012) for summary correction for de novo error correction of genomic reads generated from Illumina sequencing technologies.

De novo and reference-based Overview and evaluation of mapping and See Bao et al. (2011) for summary genomic assembly assembly algorithms and software for genomic data generated from next generation sequencing technologies.

De novo and reference-based Summarises recent methodologies and See Martin and Wang (2011) for summary transcriptomic assembly software implemented for assembly of transcriptome data generated from next generation sequencing technologies.

109

Table 3-2. Continued

Resources and Information

Clustering and orthology testing

Precompiled orthology databases Hosted species References

COG-database, eggNOG, Ensembl Archaea, bacteria and eukaryota Tatusov et al. (2000); Jensen et al. (2008); Hubbard et al. Compara,OMA Browser, (2007); Altenhoff et al. (2004); Schneider et al. (2007); OrthoMCL-DB, and RoundUp Chen et al. (2006); DeLuca et al. (2006)

HomoloGene and InParanoid Eukaryotes Wheeler et al. (2008); Berglund et al. (2008)

GreenPhyl Members of Plantae kingdom Rouard et al. 2008

TreeFam Animals, yeast (Saccharomyces cerevisiae Ruan et al. (2006); Li et al. (2006) and S. pombe), and plants (Arabidopsis thaliana, Oryza sativa)

CD-HIT and UCLUST Cluster protein and nucleotide sequences Li et al. (2006); Edgar et al. (2010)

MultiMSOAR, MultiParanoid, Cluster and identify orthologous genes Shi et al. (2010); Alexeyenko et al. (2006); Li et al. (2003); OrthoMCL, and Proteinortho across multiple species Lechner et al. (2011)

110

Figure 3-1. Experimental design pathways for NGS phylogenetics. Our flowchart highlights many important questions to help guide researchers in selecting an appropriate pathway for phylogenetic marker development, sequencing, and analyses. We consider both pathways that make use of existing NGS data resources as well as for new data acquisition on NGS platforms. Advanced considerations for new data acquisition are indicated with stars, and include variables such as experimental scale (e.g. number of sampled taxa and loci), multiplexing, and desired depth of NGS coverage per sample.

111

Figure 3-2. Marker development pipeline for targeted sequencing. Input data from NGS is shown here as a part of a two-step, bioinformatic process. Our flowchart generally applies to processes for nuclear marker development. However, we also show how organellar data can be mined from NGS runs.

112

Figure 3-3. Data processing with NGS for phylogenetics. Our flowchart illustrates in general terms the post-NGS processing of new data for phylogenetic marker development, including processes for file format conversion, quality control and trimming of raw read data, assembly, data mining for orthologous markers, and primer/probe design. Newly acquired, targeted NGS data (dashed box) must follow a portion of the same pipeline prior to phylogenetic analyses.

113

CHAPTER 4 RESOLVING RELATIONSHIPS WITH ORGANELLAR PHYLOGENOMIC DATA: A CASE STUDY WITH HEDEOMA AND ALLIED GENERA FROM MENTHINAE (LAMIACEAE)

Introduction

Hedeoma Pers. (Lamiaceae), or false pennyroyal, is a genus of perennial and annual herbs from the arid and montane regions of the southwestern United States

(U.S.) and Mexico and a few disjunct localities in South America. The genus comprises

43 species, one subspecies, and eight varieties (Table 4-1) classified in the following subgenera and sections: H. subgenus Ciliata Irving, H. subgenus Poliominthoides

Irving, H. subgenus Saturejoides section Alpine Irving, H. subgenus Saturejoides section Saturejoides Irving, and H. subgenus Hedeoma.

The circumscription of Hedeoma has remained contentious throughout much of its taxonomic history, and its generic limits have been subject to widely divergent morphological interpretations (e.g. Persoon 1807; Bentham 1832-1836; Bentham 1848;

Gray 1878; Briquet 1897; Epling and Stewart 1939). Species from seven other genera— including Mosla Buch.-Ham. ex Maxim., Rhododon Epling, Stachydeoma Small,

Poliomintha A. Gray, Eriothymus J. A. Schmidt, Hoehnea Epling, and Rhabdocaulon

(Benth.) Epling—have been merged with Hedeoma at some point in time, and some taxa from Cunila D. Royen ex L., Hesperozygis Epling, Melissa L., Micromeria Benth., and Satureja L. have been transferred to the genus on the basis of morphology.

As currently delimited, Hedeoma is characterized by an herbaceous (or occasionally semishrub) habit, gibbous or saccate calyces with zygomorphic symmetry and differentiated upper (deltoid) and lower (lanceolate) sets of teeth, well-defined calyx annulus, and mucilaginous nutlets (Irving 1980). However, none of these characters is

114

unique to Hedeoma, and systematic treatments have therefore largely focused on a combination of habit, calyx, and nutlet characters to distinguish Hedeoma from other allied genera of Menthinae (Nepetoideae: Mentheae) (Epling and Stewart 1939; Irving

1972; and Irving 1980). Poliomintha A. Gray, in particular, shares a strikingly similar morphology with Hedeoma but differs slightly in its character combinations, including a habit, tubular and radially symmetrical calyces with subequal teeth, absent or irregular calyx annulus, and oblong nutlets that are not mucilaginous (Epling and

Stewart 1939; Irving 1972); both genera have tubular corollas with two .

In addition to taxonomic problems related to the generic limits of Hedeoma, a number of problems also exist at the specific and intraspecific levels. Patterns of morphological variation within several Hedeoma species (e.g. H. costata A. Gray, H. drummondii Benth., and H. nana (Torr.) Briq. species complexes) appear correlated with patterns of disjunct geographic distribution and preference. These patterns may represent either cryptic species derived from recent speciation events involving isolation by distance or phenology, polyploidization events, or ongoing speciation processes. A well-resolved phylogenetic framework in which to investigate these patterns of morphological variation and possible causative hypotheses is currently lacking.

Phylogenetic Relationships within Hedeoma and Among Closely Allied Genera of Menthinae

That Hedeoma falls within Menthinae (Nepetoideae: Mentheae) is well established with molecular evidence (Wagstaff et al. 1995; Walker and Sytsma 2007;

Bräuchler et al. 2010; Drew and Systma 2012), but its monophyly and relationship to other genera remain ambiguous. Previous phylogenetic results suggest a close

115

relationship among at least some members of Poliomintha and Hedeoma (Wagstaff et al. 1998; Bräuchler et al. 2010; Drew and Sytsma 2012), but sampling has been limited to a small number of taxa and molecular markers.

Godden (2009) conducted the most extensively sampled phylogenetic study to date of Hedeoma and allied genera; the sampling design included 22 species of

Hedeoma (including representatives from all subgenera), six species of Poliomintha, four species of Hesperozygis, and 23 other species from Menthinae. Phylogenetic analyses of the nuclear ribosomal internal transcribed spacer (nrITS) and two plastid markers (trnL-trnF and rpl32-trnL) revealed a close relationship among most species of

Poliomintha and Hedeoma. However, resolution and bootstrap (BS) support for relationships among recovered clades were especially poor along the backbone of the phylogeny. Many sampled genera also appeared polyphyletic (including Clinopodium L.,

Hedeoma, Hesperozygis, and Poliomintha), and there was no evidence to suggest that the type species of either Poliomintha or Hedeoma (P. incana A. Gray and H. pulegioides Pers., respectively) was included in the same clade as the majority of the

Poliomintha and Hedeoma species. In fact, the results of a Shimodaira-Hasegawa test

(Shimodaira and Hasegawa 1999) suggested that P. incana was not related to the remaining Poliomintha species (P=0.009; α = 0.05).

The phylogenetic results of Godden (2009) may have important taxonomic implications, especially with regard to the generic limits of Poliomintha. However, the taxonomic placement of most species of Poliomintha depends on their relationships among the species of Hedeoma, which were not exhaustively sampled. Additional phylogenetic evidence is needed in order to assess the monophyly of Hedeoma, to

116

clarify its membership, and to infer the relationships of Hedeoma species to species of

Poliomintha and other genera.

A Model System for Evaluating Organellar Phylogenomic Approaches

A wealth of biological data exist for Hedeoma and Poliomintha, and when combined with a resolved phylogenetic framework, they provide exciting opportunities to explore questions related to biogeography, character evolution, and speciation. For example, the morphology and distribution of Poliomintha and Hedeoma have been well studied (Epling and Stewart 1939; Irving 1972; Irving 1980), and Hedeoma has one of the most complete cytological records of any plant genus (Irving 1976). Moreover, inter- and intrageneric hybridization potential has been described from artificial hybridization experiments involving both Hedeoma and Poliomintha (Irving et al. 1979).

Unfortunately, phylogenetic relationships within Hedeoma and related genera have proven extraordinarily challenging to reconstruct, hindering downstream comparative studies that explore many interesting biological questions. Phylogenetic patterns reported in previous studies—including short branch lengths, insufficient resolution, and low bootstrap support within and among recovered clades—are not exclusive to Hedeoma and allies, but are shared across many groups of Menthinae (e.g.

Prather et al. 2002; Trusty et al. 2004, 2005; Bräuchler et al. 2005, 2010; Meimberg et al. 2006; Edwards et al. 2006; Oliveira et al. 2007; Schmidt-Lebuhn 2008; Katsiotis et al.

2009; Drew and Sytsma 2012) as well as across other distantly related groups of

Lamiaceae (see Chapter 2). It is possible that the lack of resolution and support is a consequence of either incomplete taxon sampling or insufficient sequence data, or both.

Most Menthinae phylogenies published to date have been inferred from only a few plastid loci (mostly trnL-trnF) and the nrITS (e.g. Trusty et al. 2005; Oliveira et al.

117

2007; Edwards et al. 2008; Godden 2009; Bräuchler et al. 2010; Drew and Sytsma

2012), and very few studies have explored the wealth of information in nuclear genes

(e.g. Edwards et al. 2008; Curto et al. 2012; Drew and Sytsma 2013). The overall paucity of genomic resources for Lamiaceae has hindered identification of useful loci for phylogenetic investigations. Additional character data appear necessary to resolve phylogenetic relationships within and among many groups in the family. However, the amount and types of data necessary (e.g. organellar vs. nuclear data) remain unclear.

Recently, new methods for targeted capture and next-generation sequencing

(NGS) of both organellar (e.g. Moore et al. 2006, 2010; Parks et al. 2009; Givnish et al.

2010; Griffin et al. 2011; Stull et al. 2013) and nuclear (e.g. Griffin et al. 2011; Lee et al.

2011; Straub et al. 2011) genomes have become available, making possible inexpensive and rapid acquisition of large-scale data for phylogenomic studies of

Lamiaceae (reviewed in Godden et al. 2012; see Chapter 3). For this study, I selected

Hedeoma and allied genera as a model system in which to test new sequence-capture techniques for plastid data (Stull et al. 2013), and I evaluate the utility of increased taxon sampling and plastid phylogenomic approaches as a possible solution for resolving relationships within and among closely related groups in Lamiaceae.

Materials and Methods

Plant Materials and Taxonomic Sampling

In all, 96 accessions were sampled for the study, including 53 accessions from

Hedeoma (39/44 species; 88 percent of total recognized taxonomic diversity) and 41 accessions representing taxa from 15 genera of New World Menthinae: Acanthomintha

(A. Gray) A. Gray (1/4), Blephilia Raf. (1/3), Clinopodium (14/100), Conradina A. Gray

(1/6), Cunila (5/15), Dicerandra Benth. (1/9), Glechon Spreng. (1/7), Hesperozygis (3/8),

118

Kurzamra Kuntze (1/1), Monarda L. (1/20), Monardella Benth. (1/30), Poliomintha (6/7),

Pycnanthemum Michx. (1/21), Rhabdocaulon (1/8), and Rhododon (2/2).6 One accession each from two additional genera, Mentha L. and Thymus L., were sampled as outgroups for the phylogenetic analyses. All taxon sampling was based Irving (1980) and a family-wide phylogenetic hypothesis for Lamiaceae, representing a synthesis of all available molecular data (Godden et al., unpublished; see Chapter 2).

Leaf tissues were acquired through field collections and destructive sampling of specimens from the following herbaria: ASU, CAS, CONC, FLAS, HUH, IEB, MEXU, MI,

MO, MSC, NY, SR, TEX/LL, UC, and US. Research permits from the appropriate government agencies were obtained for all US and Mexico field research activities, and permission from each herbarium was granted for destructive sampling of all specimens

(see Table 4-2 for voucher information).

DNA Extraction, Shearing, and Genomic Library Preparation

Approximately 10-15 µg genomic DNA was extracted using 20-50 mg tissue, the CTAB method of Doyle and Doyle (1987), and the small-scale extraction techniques of Loockerman and Jansen (1996). DNA extractions were evaluated for purity and quantified using a NanoDrop 1000 spectrophometer (Thermo Fisher Scientific, Inc.,

Waltham, Massachusetts, USA).

Total genomic DNA samples were sheared to 400-base pair (bp) fragments in

130-µL volumes, each containing 5 µg DNA, using a Covaris E220 focused- ultrasonicator (Covaris, Inc., Woburn, Massachusetts, USA) and manufacturer-specified

6 The total number of species sampled from each genus versus the total number of recognized species is shown as a proportion in parenthesis. Species diversity data were modified from Table 1 of Braüchler et al. (2010).

119

settings at the University of Florida Interdisciplinary Center for Biotechnology Research

(UF-ICBR). Sheared samples representing individual accessions were combined into a single tube, followed by cleaning with two-thirds the total volume AMPure XP (Beckman

Coulter, Inc., Brea, California, USA).

Genomic libraries were prepared using the NEBNext DNA Library Prep Master

Mix Set for Illumina (New England Biolabs, Inc., Ipswich, Massachusetts, USA) and a set of 24 unique 5-bp barcode adapters from Craig et al. (2008). A modified version of the NEBNext manufacturer’s protocol was used, which included half-volume reactions for all end repair, dA-tailing, and adapter ligation reactions to reduce per-sample costs.

Additional modifications to the manufacturer’s protocol included a 41.5-µL starting volume (comprising 10-15 µg sonicated DNA) for the initial end repair step and the addition of 1 µL DNA Polymerase I, Large (Klenow) Fragment (New England Biolabs,

Inc.) to repair potential nicks in the DNA. Following adapter ligation, 400-500-bp fragments were excised and purified from agarose gels using Freeze ‘N Squeeze gel extraction spin columns (Bio-Rad, Hercules, California, USA) and cleaned with 0.65X total purified volume AMPure XP to remove unligated adapters. Size-selected libraries were enriched in two 25-µL reactions using Phusion High-Fidelity PCR Master Mix (New

England Biolabs, Inc.) with the following PCR profile: 98ºC for 30 s; 11-16 cycles of

98ºC for 10 s, 65ºC for 30 s, and 72ºC for 30 s; and an extension at 72ºC for 5 min, followed by a final hold at 4ºC. Individual enrichments were visualized using gel electrophoresis, combined into a single volume, and cleaned using 0.90X total volume

AMPureXP to remove adapter dimers. All enriched samples were submitted for DNA quantitation on a 2100 Bioanalyzer (Agilent Technologies, Inc., Santa Clara, California,

120

USA) at UF-ICBR, followed by pooling into one of four equimolar mixes, each with 24 uniquely barcoded samples.

Plastid Genome Enrichment and Sequencing

Plastid genomic fragments were captured from each of the four equimolar mixes using a custom set of 55,000 120-bp RNA probes (or “baits”) and the SureSelect kit

(Agilent Technologies Inc.; see Stull et al. 2013 for methods, including deviations from the manufacturer’s protocol). Each enriched library pool was amplified using Phusion

High-Fidelity Master Mix (New England Biolabs, Inc.) and the following PCR profile:

98ºC for 30 s; 18 cycles of 98ºC for 10 s, 57ºC for 30 s, and 72ºC for 30 s; and an extension at 72ºC for 7 min, followed by a final hold at 4ºC. The amplified products were cleaned using AMPure XP beads and sequenced in one of four lanes (i.e. one 24-plex product per lane) of the Illumina GAIIx (Illumina, Inc., San Diego, California, USA) with

100 cycles and single-end reads at the UF-ICBR.

Data Processing, Mining, and Assembly

De-multiplexing, quality control, and trimming of Illumina raw read data were performed using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and Sickle

(https://github.com/najoshi/sickle; quality threshold=33, length threshold=40) modules available through the UF High Performance Computing Center (HPCC). Processed data were assembled using a de novo reference-guided approach implemented within the

Assembly by Reduced Complexity (ARC) pipeline (https://github.com/ibest/ARC); reads were mapped against annotated gene and intergenic spacer targets corresponding to the large single-copy (LSC), inverted repeat B (IRb), and small single-copy (SSC) regions of the complete Origanum vulgare L. chloroplast genome (Lukas and Novak

121

2013; GenBank: JX880022). The entire ycf1 gene, which overlaps the SSC and IRa regions, was included as a target.

Assembled contigs were clustered by annotation using a custom Python script and imported into Geneious 6.1.4 (Biomatters, Inc., Auckland, New Zealand) for further processing. Accessions with multiple contigs for a given locus were mapped to the corresponding O. vulgare reference sequence; consensus sequences, which included question marks (i.e. for parsimony) or “N” (i.e. for maximum likelihood) to designate regions of missing data relative to the reference, were constructed and trimmed.

Alignments and Phylogenetic Analyses

Data were initially aligned with MUSCLE (Edgar 2004) and manually edited as necessary to remove areas of ambiguous alignment. Aligned contigs were also trimmed to the reference sequence at their 5’ and 3’ ends to eliminate possible overlap among assembled contigs comprising individually aligned partitions. Finally, processed alignments were concatenated into a supermatrix according to the order of genes in the reference sequence of O. vulgare (GenBank: JX880022) using SequenceMatrix 1.7.8

(Vaidya et al. 2011) for phylogenetic analysis as a single locus.

Two phylogenetic analyses were conducted using maximum likelihood (ML) and maximum parsimony (MP) optimality criteria, respectively. Branch support for the resulting trees from each analysis was evaluated using bootstrap analyses (Felsenstein

1985). The reference sequence of O. vulgare (GenBank: JX880022) was included as a third outgroup for both analyses. Gaps and missing data were ignored.

The ML analysis was conducted using Randomized Axelerated Maximum

Likelihood (RAxML) version 8.0.25 (Stamatakis 2014) and the following commands: raxmlHPC-PTHREADS-SSE3 -f a -m GTRGAMMA -p 2089 -N 10 -T 16 -x 31538 -N

122

100. Rapid bootstraps (RBS) were performed with 100 replicates, followed by both fast and slow ML heuristic searches (10 replicates) using RBS starting trees and the GTR+ model.

The MP analysis was conducted in PAUP version 4.0a136 (Swofford 2002) using a heuristic search with 100 random addition replicates and tree bisection-reconnection

(TBR) branch swapping. One tree was held during each stepwise addition, and no more than 100 trees were saved per replicate. BS support values were obtained for the resulting trees using a heuristic search with TBR branch swapping, 100 replicates, and ten random addition sequences per BS replicate, saving no more than 100 trees per replicate.

Results

Descriptive Data

The Illumina GAIIx sequencing run yielded a total of 197.75 Gbp of sequence data for 96 samples distributed across four lanes, of which 11.04 Gbp of data were retained after quality filtering (Figure 4-1). The number of retained reads per sample was highly variable and ranged from 15,014 (Rhabdocaulon lavanduloides [Benth.]

Epling) to 6,301,120 (Poliomintha bustamanta B. L. Turner) after quality filtering, with a mean recovery of 1,150,073 (median = 993,268; standard deviation = 2,885,113) reads.

The matrix (Object 4-1) included 97 samples, of which 3 were outgroups. The length of the plastid dataset was 128,716 base pairs (including gaps and missing data).

Sequence data from 113 genes (rps12 with two intervals), 104 intergenic spacers, and one pseudogene (ycf1) were included in the matrix (Object 4-2), representing a nearly complete plastome sequence spanning the LSC, IRb, and SSC regions of the Origanum

123

vulgare reference genome (GenBank: JX880022). Sequence data from one gene

(tRNAArgUCU) and one intergenic spacer (rpl2-rpl23) were not recovered by our assembly pipeline and, consequently, were not included in the phylogenetic analyses.

The final matrix included 120,954 constant characters and 4,854 variable characters, of which 2,908 were parsimony-informative. Sequence data representing individual accessions ranged from 37,105 to 128,663 bp (not including gaps; mean =

124,604; median = 128,005; standard deviation = 12,049). The median number of plastome partitions recovered across all accessions ranged from 75 to 218 (mean =

213; median = 218; standard deviation = 19) (Object 4-1).

Phylogenetic Analyses

The ML tree (-ln L = 254797.6572) recovered a monophyletic ingroup (BS =

100%) (Figure 4-2 A and B; estimated model parameters are reported in Table 4-3).

The outgroup taxa also comprised a strongly supported clade (BS = 100%), with

Thymus L. as sister to a clade (BS = 100%) of Mentha and Origanum.

Within the ingroup, the first five nodes (BS < 50%) constituted a poorly supported grade of five clades whose interrelationships are uncertain. Of these clades, only two received strong support: a sister relationship between Clinopodium vulgare L. and

Hesperozygis marifolia (S. Shauer) Epling (Figure 4-2 A: Clade I; BS = 100%) and a

California Floristic Province clade of Clinopodium species (Figure 4-2 A: Clade II; BS =

100%) that included C. ganderi (Epling) Govaerts, C. chandleri (Brandegee) P. D.

Cantino & Wagstaff, and C. douglasii (Benth.) Kuntze.

Seven other noteworthy clades were identified in the ML tree that had either taxonomic or geographic affinities (Figure 4-2 A and B: Clades III through IX). These clades formed a grade along with two solitary taxa, Hedeoma patrina W. S. Stewart and

124

Clinopodium maderense (Henr.) Govaerts, and their interrelationships were fully resolved and supported by the BS results (see Figure 4-2 A and B). The grade included the following (in ascending branching order): (1) Trans-Mexico Volcanic Belt Clade III

(BS < 70%); (2) Eastern South American Clade IV (BS = 99%); (3) South American

Clade V (BS = 87%); (4) H. patrina; (5) Southeastern USA Clade VI (BS = 96%); (6)

North American Clade VII (BS = 95%); (7) Rhododon Clade VIII (BS = 100%); (8) C. maderense; and (9) Southwestern USA and Mexico Clade IX (BS = 98%).

The monophyly of Hedeoma was not supported in the ML topology. The majority of Hedeoma species were placed within the large Clade IX, which also included taxa from Clinopodium, Cunila, Kurzamra, and Poliomintha. Seven Hedeoma species were placed outside of Clade IX (e.g. H. bella (Epling) Irving, H. crenata Irving, H. hispida

Pursh., H. mandoniana Wedd., H. media Epling, H. patrina W. S. Stewart, and H. piperita Benth.). The phylogenetic placements of six of these species were supported by moderate to strong BS values (> 87%), providing evidence for close relationships with taxa from at least four other genera, including Clinopodium (Clades III and IV); Eastern

South American Cunila, Glechon, Hesperozygis (Clade V); and Poliomintha (Clade VII).

Relationships among the majority of Hedeoma species and other taxa in Clade

IX were largely unsupported by BS values. The sister taxa, H. todsenii Irving and H. quercetora Epling (BS = 92%), were strongly supported as sister (BS = 99%) to the remaining taxa in Clade IX. However, the remainder of the phylogenetic backbone of

Clade IX was unsupported (BS < 50%). Several well-supported subclades exhibited geographic affinities related to species distributions in the Sierra Madre Oriental and adjacent regions of the Chihuahuan Desert and the Madrean Sky Islands of

125

southeastern Arizona, southwestern New Mexico, and northwestern Mexico. Two strongly supported subclades (BS > 94%; Figure 4-2 B: Clades XIa and XIb) were geographically distinct: a Sierra Madre Occidental Hedeoma Clade XIa including H. jucunda Greene, H. patens M. E. Jones, and H. floribunda Standl. and a Sonoran

Desert Hedeoma Clade XIb including H. nana var. macrocalyx W. S. Stewart, H. matomiana Moran, and H. tenuiflora Brandegee.

The parsimony analysis of the combined plastid data recovered 9,400 most- parsimonious trees with a length of 9,652 steps, consistency index (CI) of 0.861, retention index (RI) of 0.856, and a rescaled consistency index (RC) of 0.737. Trees resulting from the ML and MP analyses varied slightly in their topologies. Most differences occurred at either poorly resolved or poorly supported nodes, except for the following examples.

In the strict consensus of MP trees (MP BS values are presented in Figure 4-2 A and B), Clinopodium acinos (L.) Kuntze was strongly supported (MP BS = 100%) as sister to Clade I (MP BS = 100%), which in turn was sister to all remaining taxa (MP BS

= 100%). Secondly, Rhabdocaulon lavanduloides (Benth.) Epling, which was unsupported as sister to Clinopodium acinos in the ML topology, was placed within a moderately supported Eastern South American Clade V (MP BS = 88%) in the strict consensus of MP trees; relationships within this clade were poorly resolved. Lastly, the

Trans-Mexico Volcanic Belt Clade III was better supported in the MP topology (MP BS =

100% vs. ML BS = 53%), but the placement of Clinopodium brownei (Sw.) Kuntze was unresolved in a well-supported trichotomy (MP BS = 97%) of fully supported subclades

(see Figure 4-2 A).

126

Discussion

The phylogenetic hypotheses produced as part of this study—the first phylogenomic results for Lamiaceae, inferred from nearly complete plastome sequences—illustrate the complexity of resolving inter- and intrageneric relationships in

Menthinae. They also resurrect several systematic issues relevant to the generic boundaries of Hedeoma and allied genera. Several important findings and their implications are discussed here, including: (1) phylogenetic relationships among

Hedeoma and allied genera; (2) the importance of morphological characters in delineating generic boundaries; (3) the taxonomy of Hedeoma and allied genera; and

(4) the utility of plastid phylogenomic approaches in studies of closely related

Lamiaceae.

Phylogenetic Relationships Among Hedeoma and Allied Genera and Their Taxonomic Implications

Hedeoma is not monophyletic as currently circumscribed, and species are distributed among at least six clades (e.g. Hedeoma sensu stricto [s.s.] and five other clades containing at least one species of Hedeoma; see descriptions below). These results confirm and contrast with a number of conclusions drawn from recent phylogenetic studies, and they may have important evolutionary, taxonomic, or nomenclatural implications involving several genera of Menthinae.

The largest clade of Hedeoma or “Hedeoma s.s.” (Figure 4-2A-B: Clades VII-IX, including Clinopodium maderensis [Henr.] Govaerts), is well supported (ML/MP BS =

96/92%) and comprises most of the recognized species diversity from North American

Hedeoma and Poliomintha, including both type species (H. pulegioides and P. incana).

The close relationship among species from these two genera is consistent with previous

127

hypotheses; it was first reported by Wagstaff et al. (1995) in a parsimony analysis of chloroplast restriction site variation and was confirmed more recently in a phylogenetic investigation of DNA sequence variation by Godden (2009). However, the extent of the interrelationships among species from Hedeoma and Poliomintha is demonstrated here for the first time with more comprehensive taxon sampling and a robust phylogenetic framework.

In the phylogenomic context presented here, Hedeoma s.s. also includes

Rhododon, Poliomintha, and at least some species from Clinopodium and Cunila.7

These results contradict previous conclusions made by both Braüchler et al. (2010), who reported that Hedeoma was monophyletic, and Godden (2009), whose analyses did not place either type species of Hedeoma and Poliomintha within the same large clade. The close relationship among species from Hedeoma, Poliomintha, Clinopodium,

Cunila, and Rhododon has been reported previously (e.g. Drew and Sytsma 2012), but the type species from each of the four genera have never been included in the same analysis until now.

My results suggest that Clinopodium and Cunila are non-monophyletic, corroborating evidence from previous phylogenetic studies (e.g. Braüchler et al. 2010;

Agostini et al. 2012; Drew and Sytsma 2012). The generic circumscriptions of both genera encompass several distantly related clades with distinct geographic affinities

(e.g. see Braüchler et al. 2010 and Agostini et al. 2012, respectively). These affinities

7 The phylogenetic placement of Kurzamra pulchella (Clos) Kuntze within Hedeoma s.s. is likely due to poor sequence quality and low plastome coverage. The taxon belongs to a South American clade of Clinopodium and Cuminia (Benth.) Benth. (e.g. see Braüchler et al. 2010; Drew and Sytsma 2012), and its position was confirmed by the supermatrix analysis presented in Chapter 2. Thus, I have intentionally ignored this result in this discussion of phylogenetic relationships.

128

are most pronounced in Cunila, which comprises two strongly supported South

American clades and one strongly supported Mexican clade (Agostini et al. 2012).

Previous plastid-based studies support the phylogenetic position of the type species of Clinopodium (C. vulgare) and Cunila (C. origanoides [L.] Britton) with Old

World mint genera (e.g. Acinos Mill., Bystropogon L. Hér., Clinopodium, Killickia

Braüchler, Heubl & Doroszenko, Micromeria Benth., and Ziziphora L.) and North

American mint genera (e.g. Blephilia, Conradina, Clinopodium, Dicerandra, Monarda, and Pycnanthemum), respectively (Braüchler et al. 2010; Drew and Sytsma 2012).

Although my results support only the latter hypothesis, there is no evidence to suggest that the type species of either genus belongs to Hedeoma s.s. Thus, based on plastid phylogenomic evidence alone, the circumscription of Hedeoma could be expanded to include at least two taxa from Clinopodium and all of Mexican Cunila, Poliomintha, and

Rhododon.8

As for the remaining species of Hedeoma, whose phylogenetic position is outside of Hedeoma s.s., they may require taxonomic or nomenclatural revision to reflect either their status as distinct lineages or their close relationships with other genera and geographic clades. First, the phylogenetic position of H. patrina renders Hedeoma s.s. paraphyletic, and to maintain its current circumscription within the genus would necessitate inclusion of taxa from many other genera (e.g. Blephilia, Conradina,

Dicerandra, Monarda, and Pycnanthemum) in Hedeoma. Second, several accessions of

Mexican and South American Hedeoma are more closely related to species from other

8 Of the three genera whose type species are included in Hedeoma s.s., Hedeoma Pers. is the oldest name and has priority for preservation under the International Code of Nomenclature for algae, fungi, and plants (http://www.iapt-taxon.org/).

129

genera from the same geographic region than they are to other North American

Hedeoma. For example, H. bella and H. piperita are maximally supported (ML/MP BS =

100%) as sister to Trans-Mexico Volcanic Belt Clinopodium species (Figure 4-2A: Clade

III), and H. media (Uruguay and Argentina) and H. crenata (Brazil) are independently supported as sister (ML/MP BS > 87%) to species from other Eastern South American mint genera (e.g. Clinopodium, Cunila, Hesperozyygis, and Rhabdaucaulon; Figure 4-

2A: Clades V and IV, respectively). Drew and Sytsma (2012) reported similar relationships involving H. piperita (Mexico) and H. multiflora Benth. (Brazil; not sampled as part of this study), which were sister to a Trans-Mexico Volcanic Belt species of

Clinopodium and to an Eastern South American clade of Cunila, Glechon Spreng.,

Hoehnea Epling, Hesperozyygis, and Rhabdaucaulon, respectively. Lastly, my results suggest that H. mandoniana (Peru and Bolivia) may be closely related to species from the California Floristic Province (e.g. Acanthomintha or Monardella species). However, the position of H. mandoniana in the plastid phylogeny is inconsistent with previous evidence by Godden (2009), whose combined nrITS and plastid phylogeny moderately supports (ML BS = 83%) a relationship with a Mexican species, Clinopodium macrostemum (Moc. & Sessé ex Benth.) Kuntze (not sampled here). The uncertainty regarding the phylogenetic position of H. mandoniana may be explained by poor taxon sampling from the Andean region of South America in both studies. However, both of the putative relatives of H. mandoniana occur along the Pacific flyway, and many mints

(including Hedeoma) have persistent, hooked calyces that may facilitate attachment to migratory animals such as . Thus, historical long-distance dispersal remains a plausible hypothesis that might explain either of these disjunct relationships.

130

The Importance of Morphological Characters in Delineating Generic Boundaries

As discussed in more detail by Godden (2009), individual characters or character suites based on habit and calyx morphology, the presence or absence of a well-defined or irregular calyx annulus, and number of stamens (Epling and Stewart 1939; Irving

1972, 1980) represent poor choices for delineating generic boundaries among

Hedeoma and allied genera. None of these characters represent synapomorphies that characterize well-supported clades, but instead appear homoplasious in this phylogenomic context.

It is not yet clear which morphological characters distinguish Hedeoma s.s. from other clades of Menthinae; this represents an important and exciting area of future research. However, given the apparent widespread dispersion of character states used by previous researchers to delineate genera, it seems likely that any future morphological analyses will need to objectively consider a broader suite of phenotypic characters. Moreover, these analyses will necessitate a well-resolved phylogenetic framework with more comprehensive taxon sampling across Menthinae.

The Taxonomy of Hedeoma and Allied Menthinae

Several genera of Menthinae are not monophyletic as currently circumscribed, including Clinopodium, Cunila, Hedeoma, Hesperozygis, and Poliomintha. These genera will likely require recircumscription, with appropriate taxonomic and nomenclatural changes, in order that generic circumscriptions more accurately reflect phylogenetic patterns. Such changes should be conducted as part of taxonomic revisions and are thus beyond the scope of this study. Nevertheless, I acknowledge several limitations associated with interpretation of my results below and provide

131

possible solutions for revising the taxonomy of Hedeoma and other allied genera. I also discuss possible directions for future research.

Hedeoma s.s. includes at least two taxa from Clinopodium and all of Mexican

Cunila, Poliomintha, and Rhododon. The latter two genera have been treated previously as part of Hedeoma by Briquet (1897), who in his treatment of Labiatae recognized H.

(= Poliomintha) glabrescens (A. Gray) Briq., H. (= Poliomintha) longiflora (A.Gray) Briq., and H. (= Hesperozygis) marifolia (Schauer) Briq. as part of H. section Poliomintha (A.

Gray) Briq. and H. (= Rhododon) ciliata and H. (= Stachydeoma) (Chapm.) Briq. as part of H. section Stachydeoma A. Gray. However, as Epling and Stewart (1939) pointed out, Briquet’s decision to subsume these genera into Hedeoma on the basis of insufficient clear-cut differences was inappropriate within the larger context of

Menthinae and would require inclusion of many other genera from the clade.

Cunila was historically considered distantly related to Hedeoma (e.g.

Bentham1832-1836, 1848; Briquet 1897). However, Epling and Stewart (1939) treated

Cunila and other New World genera with two stamens as closely allied taxa. However, they also noted that Cunila encompasses “considerable diversity” and may be too

“heterogenous”; at least a portion of the genus may be more closely allied to other four- genera9. Irving (1980), on the other hand, acknowledged that the morphological features that distinguish Cunila and Hedeoma are “subtle.” While he did not dispute the generic distinction of either genus, he implied that the relationship between the two

9 Cunila has two exserted adaxial stamens and two abaxial staminodes, a feature common to many genera of Menthinae (including Hedeoma).

132

genera might be closer than previously supposed, citing similarities in calyx morphology among H. floribunda and at least some species from Cunila.

My results, along with those of Drew and Sytsma (2012), strongly suggest that at least some Mexican Cunila species are closely related to Hedeoma. Based on available phylogenetic evidence, it seems that Cunila is best limited to the type species, C. origanoides. As for the remaining Cunila taxa, their recognition or taxonomic dispositions depend on their relationships with other genera of Menthinae (including

Hedeoma). Additional phylogenetic evidence is necessary to place these taxa confidently within Menthinae.

In light of my results and other previously reported evidence, it seems reasonable to consider an expanded concept of Hedeoma that includes Hedeoma s.s. and excludes the six Hedeoma species positioned outside of the clade. However, it may be premature to propose taxonomic or nomenclatural changes based solely on phylogenetic relationships inferred from the plastid genome. Revisions based on plastid phylogenies could, in some cases, fail to consider a number of important evolutionary processes, including (but not limited to) hybridization and introgression.

Irving (1980) suggested that hybridization is common within Hedeoma section

Saturejoides Irving. In his artificial hybridization experiments with Hedeoma, Irving demonstrated that the majority of crosses among taxa in section Saturejoides produce fertile or partially fertile F1 hybrids (Irving et al. 1979). In fact, in some cases he documented these naturally occurring interspecific hybrids in regions where the putative parental species occur sympatrically (e.g. H. drummondii Benth.  H. nana (Torr.) Briq. var. nana, H. drummondii var. drummondii  H. pulcherrima Wooton & Standl., H.

133

dentata Torr.  H. hyssopifolia A. Gray, and H. drummondii  H. reverchonii A. Gray 

H. serpyllifolia Small) (Irving 1980). I have observed several putative hybrids throughout my field studies and have documented an additional interspecific hybrid from Nuevo

Léon, H. costata var. pulchella [Greene] Irving  H. palmeri subsp. galeana B. L. Turner

(e.g. Godden 226 [FLAS]).

My plastid phylogenomic evidence is suggestive of several intra- and intergeneric hybridization events involving taxa from within Hedeoma s.s. and with other genera of

Menthinae; these require further investigation. First, several species complexes of

Hedeoma appear non-monophyletic (e.g. H. costata, H. nana, H. oblongifolia, and H. palmeri). While these examples may represent cryptic species that have not been formally recognized, hypotheses of hybridization or introgression cannot be ruled out.

Second, many species within Hedeoma s.s. appear to share similar plastid haplotypes.

It is possible that these shared haplotypes are the result of either widespread historical hybridization or introgression events, or a combination of recent diversification events and slow rates of molecular evolution in the plastome. In either case, one would expect the plastid data to appear uninformative with regard to species-level relationships; these patterns are observed in Hedeoma s.s. (Figure 4-2A: Clade IX). Finally, at least five

Hedeoma species appear closely related to species from other genera of Menthinae that occur within the same geographic range (e.g. Figure 4-2A: Clades III-V). While the phylogenetic placement of Hedeoma in each of these cases may represent poor taxonomy or erroneous generic-level classification, it is also plausible that intergeneric hybridization events or introgression is responsible for these patterns. Intergeneric hybridization has been hypothesized for other Menthinae based on phylogenetic results.

134

For example, Edwards et al. (2006) found evidence of historical intergeneric hybridization among species of Clinopodium, Conradina, Stachydeoma, and

Piloblephis; shared plastid haplotypes among these genera were the result of historical range expansion and contraction that brought previously isolated taxa into zones of hybridization.

Species of Hedeoma are typically distributed at mid-elevations of 1500-2500 m in temperate xerophytic shrublands dominated by Pinus L., Juniperus L., and Quercus L.

Many populations of Hedeoma appear disjunct, occurring on isolated mountain ranges separated by vast expanses of xeric desert. These species distributions suggest that at least some taxa may have been more widely distributed in the past. Historical climate and vegetation modeling supports the plausibility of this hypothesis; temperate xerophytic shrublands were distributed over a vast expanse of the western U.S. and

Mexico during the Middle Pliocene (3.6-2.6 million years ago; Saltzman et al. 2008), whereas they are now restricted to montane regions at mid-elevations. In this scenario, it is possible that the ranges of many Hedeoma species may have expanded and contracted several times throughout their evolutionary histories. Previously isolated species and populations may have come into contact with one another at lower elevations, facilitating zones of hybridization or introgression. Nuclear phylogenetic and population genetic data may be useful in testing these interesting hypotheses.

The Utility of Plastid Phylogenomic Approaches

As reviewed in Chapter 3, plastid phylogenomic approaches have been used successfully in both phylogenetic and phylogeographic investigations (see Godden et al.

2012). This work represents the first phylogenomic study for Lamiaceae and is perhaps one of earliest large-scale studies employing plastid phylogenomic approaches to

135

reconstruct shallow-level relationships in angiosperms. Consequently, it is appropriate to offer some discussion regarding sequence quality and the efficacy and limitations of the approach, as well as the utility of plastid phylogenomics for reconstructing relationships within and among recently derived and rapidly diverging angiosperm lineages.

In this study, the depth of sequencing on the Illumina GAIIx platform was highly variable at the level of individual accessions, and in some cases yielded fewer than

200,000 reads after quality filtering (i.e. Clinopodium alpestre [Urb.] Harley, Cunila pycnantha B. L. Rob. & Greenm., Cunila ramamoorthiana M. R. Garcia-Peña, Hedeoma acinoides Scheele, H. matomiana, H. molle Torr., H. pulegioides, Poliomintha maderensis Henr., Rhabdocaulon lavanduloides, and Thymus mastichina). Of those accessions with fewer retained reads, only three resulted in plastome assemblies with less than 100 kbp of aligned sequence data (i.e. C. alpestre [66,991 bp], H. pulegioides

[90,322 bp], R. lavanduloides [37,105 bp]); these were evaluated with caution when interpreting phylogenetic results. One of these three taxa (R. lavanduloides) behaved abnormally with regard to its phylogenetic placement in the ML vs. MP topologies, but the sequence was retained because its phylogenetic placement in the MP results was consistent with previous hypotheses (e.g. Godden 2009). Additionally, three other taxa

(i.e. Kurzamra pulchella [Clos] Kuntze, H. multiflora Benth., and H. martirense Moran; see Figure 4-2A) contained large numbers of ambiguous base calls in their assembled sequences, which likely resulted in their phylogenetic placements in clades that were inconsistent with their geographic distributions. The sequences were derived from herbarium tissues, and DNA degradation may be responsible for their poor quality.

136

The custom RNA probe set of Stull et al. (2013) was used for sequence capture in this study. The probes were designed for capture of plastomes across angiosperms, and their efficacy appeared somewhat limited here with regard to capture of variable loci, especially the intergenic spacer regions. For example, one of the most informative plastid regions for phylogenetic studies of Mentheae (Drew and Sytsma 2011), ycf1, received poor coverage across my sampled accessions, and the assembled sequences included many large indel regions corresponding to regions of missing data. I followed the recommendations of Stull et al. (2013) and used larger 400-500-bp insert sizes to increase sequence coverage of these variable regions. However, in spite of this modification, several variable regions failed to fully assemble even with use of multiple conspecific references for plastome assembly. Future studies of Lamiaceae may benefit from the development of new family- or taxon-specific probe sets based on available mint plastomes – including those inferred from this analysis – to facilitate more efficient capture of noncoding regions.

Regarding the utility of plastid phylogenomic approaches for recently derived and rapidly diverging mint lineages, phylogenetic resolution and support for intergeneric and species relationships was only incrementally improved (i.e. as compared with Godden

2009). Nevertheless, many of the same evolutionary processes and patterns emerge

(e.g. the likelihood of hybridization and introgression), and the application of large-scale data support and strengthen previous inferences. Based on my phylogenomic results, it seems likely that hybridization is a major evolutionary force driving diversification in

Menthinae (and perhaps in other mint lineages as well), and future studies will benefit

137

from phylogenetic approaches involving multilocus nuclear data sets that can further explore these patterns.

138

Table 4-1. Proposed taxa and nomenclature for Hedeoma Pers. Four subgenera and two sections proposed by Irving (1980) in the most recent revision of Hedeoma are indicated with bold font, immediately followed by an alphabetical listing of recognized subordinate taxa. Taxa described after 1980 are assigned to subgenera or sections based on either the author’s classification or their morphological affinities to other classified taxa. Type designations specified by Irving are indicated with the following symbols: * = type of genus; † = type of subgenus; ‡ = type of section. Publication Taxon Synonyms Basionym Year

Hedeoma subg. Ciliata Irving 1980 Hedeoma apiculata W. S. Stewart 1939 Hedeoma ciliolata (Epling & W. S. Stewart) 1970 Hesperozygis ciliolata Epling & W. Irving† S. Stewart Hedeoma pilosa Irving 1970 Hedeoma pusilla (Irving) Irving 1970 Hesperozygis pusilla Irving Irving 1979

Hedeoma subg. Poliominthoides Irving 1980 Hedeoma irvingii B. L. Turner 1991 Hedeoma molle Torr. 1859 Poliomintha mollis (Torr.) A. Gray Hedeoma montana Brandegee 1911 Hedeoma montana Hedeoma palmeri Hemsl. subsp. galeanum 1991 B. L. Turner Hedeoma palmeri Hemsl. var. palmeri† 1881 Hedeoma rotundifolium Hemsl. Hedeoma palmeri Hemsl. var. santiagoanum 1995 B. L. Turner Hedeoma palmeri Hemsl. var. zaragozanum 1995 B. L. Turner Hedeoma patrina W. S. Stewart 1939 Hedeoma rzedowskii B. L. Turner 1994

139

Table 4-1. Continued Publication Taxon Synonyms Basionym Year

Hedeoma subg. Saturejoides sect. Alpine 1980 Irving Hedeoma bella (Epling) Irving 1970 Hesperozygis bella Epling Hedeoma floribunda Standl. 1939 Hedeoma jucunda Greene 1888 1908 Hedeoma piperita Benth.‡ 1835 Cunila piperita Moç. & Sessé ex Benth. Hedeoma polygalæfolia Benth. 1834

Hedeoma subg. Saturejoides Irving sect. 1980 Saturejoides Irving Hedeoma acinoides Scheele 1849 Hedeoma chihuahuensis (Henr.) B. L. Turner 1991 H. hyssopifolium A. Gray var. Hedeoma hyssopifolium A. chihuahuensis Henr. Gray var. chihuahuensis [sic] Henr. Hedeoma costata Hemsl. var. costata† 1872 H. tenellum Hemsl., H. pringlei Briq., H. permixtum Briq., H. quinquenervatum Bartlett Hedeoma costata Hemsl. var. pulchella 1970 H. pulchellum Greene, H. Hedeoma pulchellum Greene (Greene) Irving albescentifolium Bartlett, H. convisae A. Nels. Hedeoma dentata Torr. 1859 Benth. var. 1836 H. ciliatum Nutt., H. ovatum A. drummondii Nels., H. longiflorum Rydb., H. camporum Rydb. Hedeoma drummondii Benth. var. crenulata 1970 Irving

140

Table 4-1. Continued Publication Taxon Synonyms Basionym Year

Hedeoma diffusa Greene 1898 H. blepharodontum Greene Hedeoma hispida Pursh 1814 H. hirtum Nutt., Ziziphora hispidum (Pursh) R. & S., H. hispidum Pursh forma simplex Lalonde Hedeoma hyssopifolia A. Gray 1876 Hedeoma johnstonii Irving 1977 Hedeoma media Epling 1939 Hedeoma pulcherrima Woot. & Standl. 1913 A. Gray var. reverchonii 1878 H. drummondii var. reverchonii A. Gray, H. latum Small Hedeoma reverchonii serpyllifolia (Small) 1970 H. serpyllifolium Small, H. sanctum Irving Small Hedeoma martirense Moran 1969 Hedeoma matomiana Moran 1999 Hedeoma microphylla Irving 1970 Hedeoma multiflora Benth. 1834 H. gilliesii Benth., Micromeria bonariensis Fisch. & Meyer, Satureia gilliesii (Benth.) Briq., Satureia bonariensis (Fisch. & Meyer) Briq. in Engler & Prantl. Hedeoma nana (Torr.) Briq. var. nana 1872 H. dentatum var. nanum Torr., H. thymoides A. Gray, H. nanum Greene, H. nanum Briq. var. typicum W. S. Stewart, H. nanum var. mexicanum W. S. Stewart Hedeoma nana (Torr.) Briq. var. califonica 1939 W. S. Stewart

141

Table 4-1. Continued Publication Taxon Synonyms Basionym Year

Hedeoma nana (Torr.) Briq. var. macrocalyx 1939 W. S. Stewart Hedeoma oblatifolia Villarreal 1993 Hedeoma oblongifolia (A. Gray) A. Heller var. 1900 H. piperitum Benth. var. Hedeoma piperitum Benth. oblongifolia oblongifolium A. Gray, H. thymoides var. oblongifolium A. Gray A. Gray var. oblongifolium A. Gray Hedeoma oblongifolia (A. Gray) A. Heller var. 1970 mexicana Irving Hedeoma plicata Torr. 1859 Hedeoma quercetora Epling 1939 Hedema tenuiflora Brandegee 1908 Hedeoma tenuipes Epling 1939

Hedeoma subg. Hedeoma Hedeoma crenata Irving 1970 H. polygalæfolium Benth. var. montanum Dusén, Pseduocunila montana Brade, Cunila montana Brade ex Epling, H. montanum (Brade) Epling & Jativa Hedeoma mandoniana Wedd. 1860 H. breviflorum Griseb. in Lechler, H. adscendens Rusby (L.) Pers.*,† 1807 Melissa pulegioides L., Cunila pulegioides L., H. pulegioides (L.) Pers. forma simplex Lalonde

142

Table 4-2. Accessions used for phylogeny reconstruction, including sample identifier, voucher information (herbarium code), and collection locality. Herbarium code abbreviations are as follows: Arizona State University (ASU), California Academy of Sciences (CAS), Universidad de Concepción (CONC), Florida Museum of Natural History (FLAS), Harvard University (HUH), Instituto de Ecología, A. C. (IEB), Universidad Nacional Autónoma de México (MEXU), University of Michigan (MICH), Missouri Botanical Garden (MO), Michigan State University (MSC), New York Botanical Garden (NY), Sul Ross State University (SR), University of at Austin (TEX), University of California (UC), University of New Mexico (UNM), and Smithsonian Institution (US).

Taxon Identifier Voucher (Herbarium) Country: State/Province

Acanthomintha ilicifolia A. Gray 216 Moran 21979 (MSC) Mexico: Baja California (Pursh) Benth. 322 Smith 4096 (MSC) USA: Indiana Clinopodium acinos (L.) Kuntze (syn. Acinos 321 Stephenson s.n. 14 June 1986 USA: Michigan arvensis (Schur) Dandy) (MSC) Clinopodium alpestre (Urb.) Harley 324 Zanoni et al. 37666 (NY) Dominican Republic Clinopodium brownei (Sw.) Kuntze 166 López Pérez 436 (CAS) Mexico: Chiapas Clinopodium carolinianum Mill. (syn. Clinopodium 326 Churchill 89389 (MSC) USA: South Carolina georgianum R. M. Harper) Clinopodium chandleri (Brandegee) P. D. 183 Moran 17799 (MO) Mexico: Baja California Cantino & Wagstaff Clinopodium darwinii (Benth.) Kuntze 807 Domínguez 473 (CONC) Chile Clinopodium douglasii (Benth.) Kuntze 813 Daniel 8898 (CAS) USA: California Clinopodium ganderi (Epling) Govaerts 323 Moran 19092 (NY) Mexico: Baja California Clinopodium hintoniorum (B. L. Turner) Govaerts 173 Hinton et al. 23946 (CAS) Mexico: Nuevo Léon

Clinopodium maderense (Henr.) Govaerts 278 Henrickson et al. 18656 (TEX) Mexico: Coahuila

Clinopodium mexicanum (Benth.) Govaerts 796 Cronquist et al. 11850 (NY) Mexico: Oaxaca

143

Table 4-2. Continued

Taxon Identifier Voucher (Herbarium) Country: State/Province

Clinopodium oaxacanum (Fernald) Standl. (syn. 270 Salinas Tovar 6400 (CAS) Mexico: Puebla Clinopodium mexicanum (Benth.) Govaerts)

Clinopodium procumbens (Greenm.) Harley 770 Breedlove 29192 (CAS) Mexico: Chiapas Clinopodium selerianum (Loes.) Govaerts 771 Breedlove 12574 (UC) Mexico: Chiapas Clinopodium vulgare L. 286 Godden et al. 119 (FLAS) USA: New Mexico Kral & McCartney 349 Godden 190 (FLAS) USA: Florida (L.) Britton 229 McNeilus 89-1053 (MSC) USA: Tennessee Cunila pycnantha B. L. Rob. & Greenm. 576 Delgado 2496 (MEXU) Mexico: Guerrero Cunila ramamoorthiana M. R. García Peña 581 García Peña 107 (MEXU) Mexico: Guerrero Cunila secunda S. Watson (syn. Cunila 578 Rzedowski 42101 (IEB) Mexico: Guanajuato polyantha Benth.) Cunila spicata Benth. 590 Hatschbach 82390 (MEXU) Brazil Dicerandra radfordiana Huck 293 Payton s.n. (FLAS) USA: Glechon ciliata Benth. 231 Cristobál et al. 2279 (MICH) Argentina: Misiones Hedeoma acinoides Scheele 298 Godden 172 (FLAS) USA: Texas Hedeoma apiculata W. S. Stewart 332 Knight 3647 (UNM) USA: New Mexico Hedeoma bella (Epling) Irving 798 Arsène 8593 (NY) Mexico: Michoacán

Hedeoma ciliolata (Epling & W. S. Stewart) Irving 373 Godden et al. 227 (FLAS) Mexico: Nuevo Léon

Hedeoma costata Hemsl. var. pulchella (Greene) 287 Godden et al. 123 (FLAS) USA: New Mexico Irving Hedeoma costata Hemsl. var. pulchella (Greene) 359 Godden et al. 209 (FLAS) Mexico: Coahuila Irving

144

Table 4-2. Continued

Taxon Identifier Voucher (Herbarium) Country: State/Province

355 Godden et al. 198 (FLAS) Mexico: Coahuila Hedeoma costata Hemsl. var. pulchella (Greene) Irving Hedeoma crenata Irving 823 Plowman 2842 and Sucre 5142 Brazil (US) Hedeoma dentata Torr. 291 Godden 149 (FLAS) USA: Arizona Hedeoma diffusa Greene 339 Fletcher 4465 (UNM) USA: Arizona Hedeoma drummondii Benth. 284 Godden et al. 117 (FLAS) USA: New Mexico Hedeoma drummondii Benth. var. crenulata 763 Breedlove et al. 63769 (CAS) Mexico: San Luis Potosí Irving Hedeoma floribunda Standl. 306 Reina G. et al. 97-1212 (ASU) Mexico: Sonora Hedeoma hispida Pursh. 296 Godden 169 (FLAS) USA: Texas Hedeoma hyssopifolia A. Gray 290 Godden 145 (FLAS) USA: New Mexico Hedeoma irvingii B. L. Turner 820 Estrada 12742 (MEXU) Mexico: Nuevo Léon Hedeoma johnstonii Irving 542 Lowry et al. 3141 (MEXU) Mexico: Coahuila Hedeoma jucunda Greene 139 Irving 707 (TEX) Mexico: Durango Hedeoma mandoniana Wedd. 117 Tapayachi 1035 (NY) Peru: Cusco Hedeoma martirensis Moran 305 Rebman 5444 (ASU) Mexico: Baja California Hedeoma matomiana Moran 819 Moran 20810 (CAS) Mexico: Baja California Hedeoma media Epling 307 Irving et al. U-15 (ASU) Uruguay: Colonia Hedeoma c.f. microphylla Irving 816 Bye et al. 18259 (MEXU) Mexico: Chihuahua Hedeoma molle Torr. 334 Warnock 332 (SR) USA: Texas Hedeoma montana Brandegee 40 Henrickson 12159 (NY) Mexico: Coahuila Hedeoma multiflora Benth. 794 Pedersen 15869 (NY) Uruguay: Paysandú Hedeoma nana (Torr.) Briq. 354 Godden et al. 193 (FLAS) Mexico: Coahuila Hedeoma nana (Torr.) Briq. 365 Godden et al. 218 (FLAS) Mexico: Nuevo Léon

145

Table 4-2. Continued

Taxon Identifier Voucher (Herbarium) Country: State/Province

Hedeoma nana (Torr.) Briq. var. macrocalyx 808 Lehto et al. 10947 (NY) USA: Arizona W.S.Stewart Hedeoma nana (Torr.) Briq. var. nana 340 Sivinski 7548 (UNM) USA: New Mexico

Hedeoma oblatifolia Villarreal 556 Carranza et al. s.n. 13 August Mexico: Coahuila 1993 (MEXU) Hedeoma oblongifolia (A. Gray) A. Heller var. 560 Breedlove 61106 (MEXU) Mexico: Sonora mexicana Irving Hedeoma oblongifolia (A. Gray) A. Heller var. 289 Godden et al. 135 (FLAS) USA: New Mexico oblongifolia Hedeoma palmeri Hemsl. subsp. galeana B. L. 371 Godden et al. 225 (FLAS) Mexico: Nuevo Léon Turner Hedeoma palmeri Hemsl. var. palmeri 802 Ventura et al. 8639 (TEX) Mexico: Guanajuato Hedeoma palmeri Hemsl. var. santiagoana B. L. 357 Godden et al. 201 (FLAS) Mexico: Coahuila Turner Hedeoma palmeri Hemsl. var. zaragozana 366 Godden et al. 219 (FLAS) Mexico: Nuevo Léon B.L.Turner Hedeoma patens M. E. Jones 817 Alvarado 615 (MEXU) Mexico: Durango Hedeoma patrina W. S. Stewart 564 Villarreal et al. 8293 (MEXU) Mexico: Coahuila Hedeoma piperita Benth. 281 Sanders 1072 (TEX) Mexico: Distrito Federal Hedeoma plicata Torr. 337 Manning 5021 (SR) USA: Texas Hedeoma pulcherrima Wooton & Standl. 821 Godden et al. 132 (FLAS) USA: New Mexico Hedeoma pulegioides (L.) Pers. 830 Magrath 4647 (NY) USA: Missouri Hedeoma pusilla (Irving) Irving 818 Godden et al. 228 (FLAS) Mexico: Nuevo Léon Hedeoma quercetora Epling 370 Godden et al. 223 (FLAS) Mexico: Nuevo Léon Hedeoma reverchonii A. Gray 98 Holmes 10075 (TEX) USA: Texas

146

Table 4-2. Continued

Taxon Identifier Voucher (Herbarium) Country: State/Province

Hedeoma rzedowskii B. L. Turner 766 Johnston et al. 11084 (CAS) Mexico: San Luis Potosí Hedeoma serpyllifolia Small 569 Villareal et al. 9095 (MEXU) Mexico: Nuevo Léon Hedeoma serpyllifolia Small 336 Turner 23-149 (SR) USA: Texas Hedeoma sp. 327 Spellenberg et al. 13836 (NMC) Mexico: Chihuahua Hedeoma tenuiflora Brandegee 765 Moran 12806 (UC) Mexico: Baja California Hedeoma todsenii Irving 328 Dunmire 1136 (UNM) USA: New Mexico Hesperozygis kleinii Epling & Játiva 302 Hatschbach et al. 78233 (ASU) Santa Catarina, Brazil Hesperozygis marifolia (S. Schauer) Epling 316 Ventura et al. 257 (TEX) Mexico: Guanajuato Hesperozygis spathulata Epling 797 Dusén 15165 (MSC) Brazil: Paraná Kurzamra pulchella (Clos) Kuntze 318 Werdermann 957 (HUH) Chile: Atacama

Mentha x rotundifolia (L.) Huds. 152 Stevens 1525 (MSC) USA: California Monarda fistulosa L. var. menthifolia (Graham) 319 Chamberlain 1567 (MSC) USA: Arizona Fernald Monardella odoratissima Benth. 259 Churchill 761326 (MSC) USA: Washington Poliomintha bustamanta B. L. Turner 828 Godden GTG1 (FLAS) USA: Florida (cultivated) Poliomintha dendritica B. L. Turner 376 Godden et al. 232 (FLAS) Mexico: Nuevo Léon Poliomintha glabrescens A. Gray ex Hemsl. 283 Godden et al. 100 (FLAS) USA: Texas (Torr.) A. Gray 829 Godden et al. 114 (FLAS) USA: New Mexico Poliomintha longiflora A. Gray 83 Hinton et al. 22073 (TEX) Mexico: Coahuila Poliomintha maderensis Henr. 308 Reeves et al. PI3155 (ASU) Mexico: Coahuila Pycnanthemum virginianum (L.) T. Durand & B. 793 Churchill 92-165 (MSC) USA: Virginia D. Jacks. ex B. L. Rob. & Fernald in A. Gray Rhabdocaulon lavanduloides (Benth.) Epling 240 Hatschbach 42766 (MO) Brazil: Paraná Rhododon angulatus (Tharp) B. L. Turner 822 Godden et al. 93 (FLAS) USA: Texas

147

Table 4-2. Continued

Taxon Identifier Voucher (Herbarium) Country: State/Province

Rhododon ciliatus (Benth.) Epling 297 Godden 171 (FLAS) USA: Texas Thymus mastichina (L.) L. 153 Ladero s.n. 16 May 1969 (MSC) : Cáceres

148

Table 4-3. General Time Reversible (GTR) +  Model parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML). Shape parameter Rate matrix Base frequencies

alpha: 0.020000 rate A <-> C: 1.227778 freq pi(A): 0.310739 rate A <-> G: 2.051774 freq pi(C): 0.174235 rate A <-> T: 0.612300 freq pi(G): 0.196969 rate C <-> G: 0.979892 freq pi(T): 0.318057 rate C <-> T: 2.283587 rate G <-> T: 1.000000

149

Figure 4-1. Mean number of reads (or FASTQ records) yielded per each of four lanes of Illumina GAIIx sequencing (in green) and mean reads retained after quality control and filtering (in red). On average, 56% of all sequenced reads were retained for assembly (shown here as “All”).

150

Figure 4-2. Maximum likelihood tree inferred from an analysis of nearly complete plastome sequences. The topology is shown as a cladogram, with a phylogram illustrating the tree shape and region of the tree shown as an inset in Figure B). Bootstrap values are indicated above the branches subtending nodes with >50% support (ML/MP) in either analysis. An asterisk (*) indicates BS = 100%; a hyphen (-) indicates the node was absent in the MP strict consensus. For convenience, Hedeoma accessions are indicated with bold font; colored boxes, Roman numerals, and letters are indicated to facilitate in- text discussion. The phylogenetic placement of all taxa indicated with red font is considered dubious based on previous analyses and unpublished data.

151

Figure 4-2. Continued

152

Object 4-1. Aligned matrix in PHYLIP multiple alignment format, including 128,716 base pairs of plastome sequence data for 97 accessions of Menthinae (.rtf file 12.5 MB)

153

Object 4-2. Supermatrix data assembly output, including characteristics by taxon and plastome partition (.xlxs file 172 KB)

154

CHAPTER 5 RESOLVING RELATIONSHIPS WITH NUCLEAR PHYLOGENOMIC APPROACHES: A CASE STUDY WITH LAMIALES

Introduction

Background

Lamiales sensu Olmstead (1993) are the largest clade of Lamiidae (Refulio-

Rodriguez and Olmstead 2014). As currently delimited, they comprise more than 23,000 species (Kadeireit 2004) from 25 families (i.e. Acanthaceae, Bignoniaceae,

Byblidaceae, Calceolariaceae, Carlemanniaceae, Gesneriaceae, Lamiaceae,

Lentibulariaceae, Linderniaceae, Martyniaceae, Mazaceae, Oleaceae, ,

Paulowniaceae, Pedaliaceae, Phrymaceae, Plantaginaceae, Plocospermataceae,

Rehmanniaceae, Schlegeliaceae, Scrophulariaceae, Stilbaceae, Tetrachondraceae,

Thomandersiaceae, and Verbenaceae; see Olmstead et al. 2012, http://depts.washington.edu/phylo/Classification.pdf), many of which include ecologically, economically, and culturally important species.

Lamiales are nearly ubiquitous, and they occupy a diverse range of ecological niches found throughout most (if not all) world biomes—that is, from tropical rainforest and submerged aquatic habitats to dry deserts and high alpine scree. They also exhibit a remarkably diverse range of growth forms and life histories (e.g. ephemeral herbs and long-lived trees), specialized life strategies (e.g. epiphytes, carnivores, and parasites), secondary metabolites, floral architectures, and biology. These attributes present exciting opportunities to investigate global patterns of radiation and to link these patterns to shifts in distribution, ecology, genomics, chemistry, and floral morphology.

However, comparative studies of Lamiales have proven somewhat intractable due to poor phylogenetic resolution.

155

Current Status of Lamiales Phylogeny

Much of our understanding of Lamiales (and Lamiidae, in general) stems from many years of molecular research whose primary aim was to identify and reconstruct relationships among major angiosperm lineages (e.g. Olmstead et al. 1992, 1993, 2000;

Chase et al. 1993; Soltis et al. 2000, 2011; Albach et al. 2001; Kårehed 2001; Bremer et al. 2002; González; reviewed in Refulio-Rodriguez and Olmstead 2014). The monophyly of Lamiales was first demonstrated more than two decades ago (Olmstead et al. 1992,

1993), and, since that time, it has been consistently recovered with strong support in many phylogenetic studies (e.g. Olmstead et al. 2000; Soltis et al. 2000, 2011; Bremer et al. 2002; Schäferhoff et al. 2010; Refulio-Rodriguez and Olmstead 2014). Lamiales are now a universally accepted clade, but their phylogeny remains poorly known.

Previous research efforts in Lamiales systematics have focused primarily on the monophyly and membership of individual families. In fact, several families have been re- circumscribed in light of molecular phylogenetic results, and a few newly discovered subclades have been recognized at the family rank (e.g. Acanthaceae [Scotland et al.

1995; McDade et al. 1999, 2000, 2008, 2013], Bignoniaceae [Spangler and Olmstead

1999; Olmstead et al. 2009], Calceolariaceae [Olmstead et al. 2001; Andersson 2006],

Gesneriaceace [Smith et al. 1997; Perret et al. 2013], Lamiaceae [Cantino 1992, 1992b;

Wagstaff et al. 1998; Bendiksby et al. 2011; Drew and Systma 2012], Lentibulariaceae

[Jobson et al. 2003; Müller et al. 2004], Linderniaceae [Rahmanzadeh et al. 2005];

Martyniaceae [Gutierrez 2011], Mazaceae [Reveal 2011], Oleaceae [Wallander and

Albert 2000], Orobanchaceae [Young et al. 1999; Bennett and Matthews 2006];

Phrymaceae [Beardsley and Olmstead 2002], Plantaginaceae [Albach et al. 2005],

Rehmanniaceae [Albach et al. 2009; Xia et al. 2009], Scrophulariaceae [Kornhall et al.

156

2001; Kornhall and Bremer 2004; Oxelman et al. 2005], Stilbaceae [Schäferhoff et al.

2010; Refulio-Rodriguez and Olmstead 2014]; Thomandersiaceae [Wortley et al. 2007], and Verbenaceae [Marx et al. 2010]).

Our understanding of the interfamilial phylogeny of Lamiales has progressed considerably in recent years. Currently, the branching order of early diverging lineages in Lamiales10 appears well established by molecular evidence from multiple studies

(Olmstead et al. 1993; Savolainen et al. 2000; Bremer et al. 2002; Hilu et al. 2003;

Rahmanzadeh et al. 2005; Müller et al. 2006; Schäferhoff et al. 2010; Refulio-Rodriguez and Olmstead 2014). However, progress has been slower with regard to resolution and support for relationships in the “Higher Core Lamiales” (HCL)11, a large clade composed of the following 17 families: Acanthaceae, Byblidaceae, Bignoniaceae, Lamiaceae,

Lentibulariaceae, Linderniaceae, Martyniaceae, Mazaceae, Orobanchaceae,

Paulowniaceae, Pedaliaceae, Phrymaceae, Rehmanniaceae, Schlegeliaceae,

Stilbaceae, Thomandersiaceae, and Verbenaceae.

The interfamilial phylogeny of the HCL is, perhaps, the last major hurdle in

Lamiales systematics (at least at the superfamilial level). Use of more comprehensive taxonomic sampling across Lamiales and increasingly larger plastid DNA datasets have incrementally improved resolution and support across consecutive phylogenetic studies

(Olmstead et al. 2001; Bremer et al. 2002; Schäferhoff et al. 2010; Soltis et al. 2011;

Refulio-Rodriguez and Olmstead 2014). However, even with more than 17 kilobase

10 (Plocospermataceae ((Carlemanniaceace (Oleaceae)) (Tetrachondraceae ((Calceolariaceae (Gesneriaceae)) (Plantaginaceae (Scrophulariaceae (remaining families)))))))

11 Clade nomenclature used by Schäferhoff et al. (2010)

157

pairs (kbp) of sequence data applied to the problem, we still lack resolution and support along the backbone of the HCL.

Refulio-Rodriguez and Olmstead (2014) concluded that “much of the interfamilial phylogeny of Lamiales is now resolved”, but the HCL comprises most of the diversity in

Lamiales, and the subclade includes some of the weakest support values found within their Bayesian, maximum likelihood, and parsimony trees. Moreover, trees recovered by each of the three optimality criteria were discordant with regard to the HCL topology.

Without a consistent view of interfamilial relationships within the largest subclade of

Lamiales, it seems that most phylogenetic relationships in the clade are uncertain; more convincing phylogenetic evidence is needed.

Previous studies of Lamiales have relied heavily on plastid DNA for phylogenetic inference. Thus, for this study, my goals were to develop a large multilocus nuclear dataset for Lamiales and to evaluate the utility of nuclear phylogenomic approaches for resolving interfamilial relationships.

Materials and Methods

Taxonomic Sampling

Seventy-seven de novo transcriptome assemblies were downloaded from the

One Thousand Plants (oneKP) project (http://www.onekp.com) and processed to identify putatively orthologous single-copy nuclear (SCN) markers suitable for phylogenetic analyses (Table 5-1). The taxonomic sampling was opportunistic and limited to resources afforded by oneKP. In all, 68 representative accessions from 17 of

25 recognized families from Lamiales (i.e. Acanthaceae, Bignoniaceae, Byblidaceae,

Calceolariaceae, Gesneriaceae, Lamiaceae, Lentibulariaceae, Oleaceae,

Orobanchaceae, Paulowniaceae, Pedaliaceae, Plantaginaceae, Rehmanniaceae,

158

Schlegelliaceae, Scrophulariaceae, Tetrachondraceae, and Verbenaceae; transcriptomes were not available for Carlemanniaceae, Linderniaceae, Martyniaceae,

Mazaceae, Phrymaceae, Stilbaceae, Plocospermataceae and Thomandersiaceae)

(Olmstead et al. 2000; Olmstead 2012; Refulio-Rodriguez and Olmstead 2014). Nine additional accessions representing the following angiosperm orders (and families) were also included in the analysis: Boraginales (Boraginaceae), ( and ), and Solanales (Convolvulaceae and Solanaceae).

Discovery of Single-copy Nuclear Genes from Transcriptome Data

The oneKP transcriptome assemblies were downloaded and filtered individually using a novel bioinformatic workflow for discovery of SCN genes across angiosperms

(Chamala et al., submitted) and computing resources and modules available via the

High Performance Computing Center at the University of Florida. Scaffolds less than

900 bp in length were removed first to avoid inclusion of broken or partially assembled transcripts in subsequent data processing steps. Next, putative sequence orthology was assessed with reciprocal BLAST searches (Altschul et al. 1990, 1997) against a complete Arabidopsis thaliana (L.) Heynh proteome downloaded from the PLAZA 2.5 database (Van Bel et al. 2011). The filtered transcripts from each assembly were aligned against A. thaliana proteins using BLASTX with default settings, and, reciprocally, the A. thaliana proteins were aligned against the filtered transcripts using

TBLASTN with default settings. The top hits from each query (i.e. the subjects with the highest BLAST alignment score S) were retained if they met either of the following criteria: a minimum of 70% of the transcript length was aligned with an A. thaliana protein with at least 70% sequence similarity (BLASTX) or a minimum of 80% of the protein length was aligned to the transcript with at least 70% sequence similarity

159

(TBLASTN). Only the reciprocal uniquely best BLAST alignments were considered putative orthologues and were retained for downstream processing.

To identify SCN loci present in the oneKP transcriptomes, the filtered orthologues were aligned against a reduced A. thaliana protein reference data set that included only

SCN loci identified as “strictly” single-copy or single-copy across a “majority” of the 20 angiosperm genomes surveyed by De Smet et al. (2013).

Clustering, Reorientation, and Characterization of Putative Orthologues

The putative orthologues detected by the aforementioned workflow were clustered into SCN gene sets using the A. thaliana reference protein ID. The transcripts within each set were then reverse-complemented as necessary to ensure identical orientation among all sequences; the corresponding DNA sequence of A. thaliana was used as a reference for reorientation.

Detailed information about each gene was also compiled in tabular format.

Tabular information included an A. thaliana gene ID from the PLAZA 2.5 database, a single-copy classification (e.g. “strictly” or “majority”) from De Smet et al. (2013), a gene functional description (see below), the number of putative orthologues detected across all assemblies, and a scaffold ID for each of the oneKP transcriptome assemblies included in the analysis. All functional annotations were compiled using data available in the TAIR10 Arabidopsis genome release

(ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_functional_ descriptions_20130831.txt).

160

Multiple Sequence Alignment and Comparisons of Phylogenetic Utility Among 85 Commonly Shared SCN Genes

To facilitate both downstream assessments of phylogenetic utility and phylogenetic analyses using selected SCN genes, multiple sequence alignments were performed for all SCN gene sets with MAFFT (Katoh et al. 2013), using –quiet and – auto parameters. In all, 85 gene alignments, representing a selection of the most commonly shared SCN genes detected across all samples (see Table 5-2), were manually edited in Geneious 6.1.4 (Biomatters Inc., Auckland, New Zealand); ragged ends and ambiguously aligned regions with uncertain sequence similarity were removed. Sequence characteristics—including the average pairwise sequence identity and total number of characters, constant characters, variable but parsimony- uninformative characters, and parsimony-informative characters—were calculated for each gene using Geneious and PAUP* v. 4.0a136 (Swofford 2002), respectively.

Finally, the edited alignments were concatenated into a supermatrix for phylogenetic analysis.

Outgroup Selection and Phylogenetic Analyses

Two species from Gentianales, affine Balf.f. and boreale L.

(Gentianaceae and Rubiaceae, respectively), were selected as outgroups based on the phylogenetic results of Refulio-Rodriguez and Olmstead (2014). Phylogenetic analyses of the 85-gene data set were conducted using both maximum likelihood (ML) and maximum parsimony (MP) optimality criteria; gaps and missing data were ignored.

Branch support for the resulting trees from each analysis was evaluated using bootstrap

(BS) analyses (Felsenstein 1985).

161

With regard to my choice of methods for phylogenetic inference, MP was used to diagnose suprafamilial relationships that may result from long-branch attraction

(Felsenstein 1978; Hendy and Penny 1989). Simulation studies demonstrate that MP is particularly susceptible long-branch attraction problems (Huelsenbeck and Hillis 1993).

Thus, comparisons of ML and MP topologies are useful for detecting artificial relationships that result from incomplete taxon sampling and heterogeneity in branch lengths.

The ML analysis was conducted using Randomized Axelerated Maximum

Likelihood (RAxML) version 8.0.25 (Stamatakis 2014; version 8.0.25 released June 16,

2014) and the following commands: raxmlHPC-PTHREADS-SSE3 -f a -m GTRGAMMA

-p 14769 -N 10 -T 8 -x 18160 -N 100. Rapid bootstraps (RBS) were performed with 100 replicates, followed by both fast and slow ML heuristic searches (ten replicates) using

RBS starting trees and the GTR+ model. The MP analysis was conducted in PAUP version 4.0a136 (Swofford 2002) using a heuristic search with 100 random addition replicates, tree bisection-reconnection (TBR) branch swapping, and one tree held during each stepwise addition. BS values were obtained for the resulting trees using a heuristic search with TBR branch swapping, 100 replicates, and ten random addition sequences per BS replicate.

Results

Single-copy Nuclear Gene Discovery

In all, 1,993 putatively orthologous SCN genes were discovered with the bioinformatic approach described here (Table 5-2; Object 5-1). The number of SCN genes detected per accession ranged from zero to 909 [i.e. in Plantago coronopus L. and Clinopodium serpyllifolium subsp. fruticosum (L.) Bräuchler, respectively], with a

162

median number of 616 loci detected per transcriptome (mean = 589, standard deviation

= 193) (Figure 5-1).

The distribution of shared SCN loci across all sampled accessions showed a negative trend (Figure 5-2). No locus was present in all 76 of the transcriptomes. In all, only 60 of the 1,993 (3%) loci detected were present in at least 57 (75%) of the sampled transcriptomes; the most commonly shared locus was present in only 69 accessions. As for the less commonly shared genes, 103 were present in at least three accessions.

There were no uniquely detected loci in any of the oneKP transcriptomes.

Sequence Characteristics and Phylogenetic Utility

Sequence characteristics for 85 commonly shared SCN genes detected by the marker discovery pipeline are summarized in Table 5-4. The aligned and edited gene sets ranged in size from 569 to 2,345 bp (mean = 1365, median = 1301, standard deviation = 419) and exhibited many desirable sequence characteristics for phylogenetic investigations, including a mean pairwise sequence identity of 82.2%

(median = 82.7%, standard deviation = 2.5%, range = 73.6% - 86.7%). The total number of variable characters within these gene alignments ranged from 286 to 1,555, with a mean of 817 (median = 783, standard deviation = 265); nearly 85.0% of the variable characters were parsimony-informative (mean = 693, median = 657, standard deviation

= 226, range = 252 – 1,357).

The final supermatrix of 85 concatenated SCN genes (Figure 5-3) included

116,052 characters, of which 46,648 were constant and 69,404 were variable. The percent pairwise sequence identity was 49.8%, which is lower than the mean pairwise sequence identity calculated for individual gene alignments due to missing data in the

163

supermatrix. The alignment is available as a supplemental document; detailed sequence characteristics are provided in Table 5-4.

Phylogenetic Results and Relationships in Lamiales

The ML tree had a negative log likelihood (-ln L) of 2035023.973823 (Figure 5-5

A and B; estimated model parameters are reported in Table 5-5). Most of the relationships in the resulting topology were strongly supported by the BS values; only 5 nodes received support less than 90%.

In the ML tree, the outgroup taxa from Gentianales comprised a maximally supported clade (BS = 100%; Figure 5-5 A and B). An clade (BS = 51%) of Boraginales and Solanales, each of which was monophyletic (BS = 100%), was strongly supported as sister to Lamiales (BS = 100%).

All families from Lamiales represented by multiple accessions were monophyletic

(BS = 100%; Figure 5-5 A and B, Figure 5-6 A and B), and their interrelationships were fully resolved with very high or maximum support (BS = 92 – 100%; except for the position of Pedaliaceae). The family-level branching order of the first six nodes in

Lamiales (Figure 5-6 A and B; see numbered nodes in Figure 5-7) was as follows:

Oleaceae, Tetrachondraceae, Calceolariaceae and Gesneriaceace, Plantaginaceae,

Scrophulariaceae, and Byblidaceae. The next branching group (Figure 5-7: node 7) included the sister families, Acanthaceae and Lentibulariaceae (BS = 100%), which were sister (BS = 100%) to a clade of all remaining families.

The phylogenetic placement of Pedaliaceae was not well supported in the ML tree (BS = 57%); the family was sister to a large clade (BS = 97%) composed of two subclades, with three and four of the remaining families of Lamiales, respectively. The first of these subclades (BS = 97%) included Schlegeliaceae, which was sister (BS =

164

97%) to the sister families Verbenaceae and Bignoniaceae (BS = 99%). The second subclade included Lamiaceae, which was sister to a maximally supported clade of the sister families Orobanchaceae and Rehmanniaceae (BS = 100%) and their sister,

Paulowniaceae (BS = 100%).

Topological differences between the ML and MP phylogenies were observed at multiple nodes. Some of these differences were strongly supported with BS values (see description below and Figure 5-7).

The MP search recovered only one tree (Figure 5-6 A and B) with a length of

430,444 steps, consistency index (CI) of 0.296, retention index (RI) of 0.496, and a rescaled consistency index (RC) of 0.147. Solanales were sister to Lamiales (MP BS =

85%), and this clade in turn was sister to Boraginales (MP BS = 100%). The MP and ML topologies disagreed slightly with regard to the family-level branching order across the first six nodes in Lamiales. As compared to the ML tree, the positions of

Tetrachondraceae and Oleaceae in the MP topology were inverted and inferred with strong support (MP BS = 100% and 97%, respectively; Figure 5-7: nodes 1-2). The positions of Scrophulariaceae and Bylblidaceae were also inverted (Figure 5-7: node 6).

However, the alternative position of Scrophulariaceae (Figure 5-7: node 6) as sister to the remaining families of Lamiales received BS support of only 56%.

The remaining families of Lamiales in the MP topology (Figure 5-7: node 7) formed a large clade (MP BS = 100%) composed of two subclades whose membership differed greatly from the ML results. The first (MP BS = 70%) included Lamiaceae, which was sister to a smaller subclade (MP BS = 87%) that included Verbenaceae and the sister families Acanthaceae and Lentibulariaceae (MP BS = 100%). The second

165

subclade (MP BS = 68%) comprised two small subclades, each with three families:

Orobanchaceae, Rehmanniaceae, and Paulowniaceae (MP BS = 100%) and

Bignoniaceae, Pedaliaceae, and Schlegeliaceae (MP BS = 68%). In the latter subclade,

Pedaliaceae was weakly supported (MP BS = 72%) as sister to Bignoniaceae.

Discussion

My phylogenetic results—the first inferred with large-scale, multi-locus nuclear data—provide an alternative view of Lamiales phylogeny as compared with previous studies. They also demonstrate the enormous potential of SCN loci to resolve difficult phylogenetic problems and pave the way for future studies that can confidently resolve interfamilial relationships in Lamiales.

Utility of New SCN Gene Sets for Phylogenetic Studies of Lamiales

The marker development approach used in this study was effective and yielded a new set of 1,993 SCN genes for phylogenetic studies of Lamiales. Based on my preliminary gene survey, at least 85 of these SCN loci exhibit desirable characteristics for phylogenetic applications, especially with regard to their sequence variability and alignability across the full phylogenetic scale of Lamiales. Most importantly, my results suggest that these loci are phylogenetically informative. Phylogenetic analyses of the

85-gene set recover fully resolved topologies with many strongly supported relationships, including both inter- and intra-familial relationships. These results are particularly encouraging and suggest that SCN loci may offer enormous potential for resolving difficult problems across multiple phylogenetic scales in Lamiales.

The Interfamilial Phylogeny of Lamiales

All previous phylogenetic analyses of Lamiales have been based on organellar datasets, of which the largest and most recent included over 17 kilobase-pairs (Kbp) of

166

sequence data (i.e. Refulio-Rodriguez and Olmstead 2014). The nuclear phylogenomic results presented here are based on an even larger dataset comprising nuclear genes

(i.e. over 16 Kbp of sequence data), and they provide a view of interfamilial relationships that, in several cases, extends support for previous topologies derived from MP, ML, and Bayesian phylogenetic analyses (Schäferhoff et al. 2010; Refulio-

Rodriguez and Olmstead 2014). However, incomplete taxon sampling and conflicting

ML and MP topologies (see additional discussion below) collectively limit direct comparisons of my results with previous studies.

The phylogenetic analyses of Schäferhoff et al. (2010) and Refulio-Rodriguez and Olmstead (2014) are consistent with regard to their family-level branching order across the basal grade of Lamiales.12 Although my sampling excludes two of these families (i.e. Plocospermataceae and Carlemanniaceae), the topologies recovered by my phylogenomic analyses appear largely consistent with previous studies. However, the positions of two family-level clades (i.e. Tetrachondraceae and Scrophulariaceae) differ between the ML and MP topologies. Tetrachondraceae are sister to all other

Lamiales in the MP tree (BS = 100%), but are sister to all other Lamiales except

Oleaceae in the ML tree (BS = 92%). These MP results contrast with those of Refulio-

Rodriguez and Olmstead (2014), who recovered identical topologies across the first five nodes in Lamiales with MP, ML, and Bayesian inference; Tetrachondraceae are sister to all Lamiales except Oleaceae, Carlemanniaceace, and Plocospermataceae. With regard to Scrophulariaceae and Byblidaceae, their positions are interchanged in my MP

12 Plocospermataceae ((Carlemanniaceace (Oleaceae)) (Tetrachondraceae ((Calceolariaceae (Gesneriaceae)) (Plantaginaceae (Scrophulariaceae (HCL)))))))

167

and ML topologies. It is impossible to make a direct comparison with the ML and MP topology of Refulio-Rodriguez and Olmstead (2014) because the trees are not available.

However, the support values reported by the authors suggest topological uncertainty.

Scrophulariaceae are weakly supported (MP BS = 68%) as sister to the HCL, and the sister relationship of Byblidaceae and Linderniaceae was only recovered with Bayesian inference (Refulio-Rodriguez and Olmstead 2014).

Within the HCL, my phylogenomic analyses recover a clade of families that is consistent with previous evidence (e.g. Schäferhoff et al. 2010; Refulio-Rodriguez and

Olmstead 2014).13 Both resolution and support for interfamilial relationships in the HCL are considerably improved with the application of large-scale nuclear data relative to previous studies; only a few nodes were poorly supported (ML/MP BS < 70%) in either the ML or MP topologies.

Direct comparisons with previous HCL topologies are difficult to interpret because my sampling was opportunistic and six of the 17 families in the clade were not included in my analyses. Of the relationships that are comparable, the inclusive subclade of Lamiaceae, Orobanchaceae, Rehmanniaceae, and Paulowniaceae (the

“Lamiaceae/Orobancheae” clade) recovered by the ML analysis appears consistent with previous studies (e.g. Soltis et al. 2011; Schäferhoff et al. 2010; Refulio-Rodriguez and

Olmstead 2014).14 As for the remaining relationships in the HCL, my topologies conflict

13 In this case, the HCL includes 11 of the 17 families recovered in the analyses of Schäferhoff et al. (2010) (or Acanthaceae, Bignoniaceae, Byblidaceae, Lamiaceae, Lentibulariaceae, Orobanchaceae, Paulowniaceace, Pedaliaceae, Rehmanniaceae, Schlegeliaceae, and Verbenaceae). Byblidaceae were not included in the HCL in the MP tree because its position was interchanged with Scrophulariaceae.

14 The Lamiaceae/Orobanchaceae subclade of HCL is not recovered by the MP analysis. Lamiaceae are sister to a subclade that includes Acanthaceae, Lentibulariaceae, and Verbenaceae. The relationships

168

with most existing phylogenetic evidence. One exception is the sister relationship between Acanthaceae and Lentibulariaceae (or “Acanthaceae/Lentibulariaceae” clade), which is maximally supported in both the ML and MP topologies; the relationship is reported in the 17-gene tree of Soltis et al. (2011), but not in Refulio-Rodriguez and

Olmstead (2014).

My ML and MP phylogenomic results provide two contrasting views of interfamilial relationships in the HCL (see Figure 5-7). The HCL is composed of either three major subclades and two independent family-level clades (ML) or two major subclades comprising six and four family-level clades, respectively (MP). These HCL topologies share few common relationships; only two clades of families are identical across both trees (e.g. Paulowniaceae/Rehmanniaceae/Orobanchaceae and

Acanthaceae/Lentibulariaceae), but they differ in their relationships to other family-level clades of Lamiales.

Sources of Topological Conflict

Topological differences between the ML and MP phylogenies were observed at multiple nodes. These topologies conflict with each other not only as regards interfamilial relationships, but also with previous phylogenetic evidence in many cases, particularly within the HCL.

It is likely that incomplete taxon sampling and long-branch attraction are major factors contributing to topological incongruence between the trees recovered here with

ML and MP. Parsimony, in particular, is susceptible to long-branch attraction

among Orobanchaceae, Rehmanniaceae, and Paulowniaceae are equivalent, but the subclade is sister to another subclade that includes Bignoniaceae, Pedaliaceae, and Schlegeliaceae.

169

(Felsenstein 1978), and my taxon sampling is limited to 17 of 25 families of Lamiales.

Use of comprehensive taxon sampling in future studies may facilitate more accurate phylogenetic inference, or at least resolve some of the conflicts between ML and MP topologies.

The positions of Byblidaceae and Lentibulariaceae, in particular, have proven difficult to reconstruct in previous studies. So far, only Bayesian inference has recovered their positions within the HCL with strong support (e.g. Refulio-Rodriguez and

Olmstead). Anomalous molecular evolutionary attributes have been documented in both carnivorous and families (e.g. Lentibulariaceae and Orobanchaceae, respectively), including dynamic genome size evolution, relaxed gene functional constraints, and increased rates of substitution (Albert et al. 2010; Wolfe and dePamphilis 1998). These attributes, which are likely associated with unique life history traits, can result in longer branch lengths (i.e. relative to non- carnivorous or parasitic plants) and may cause problems for phylogenetic inference. Simulation studies have demonstrated that Bayesian inference is biased in favor of topologies that group long branches together, particularly as more data are applied to the problem (Kolaczkowski and Thorton 2009). Given the patterns I observe here, it seems likely that at least some relationships in the topology of Refulio-Rodriguez and Olmstead (2014) may reflect this bias.

While topological inconsistencies between my nuclear phylogenomic results and previous plastid-based studies could indicate ancient hybridization events (a hypothesis that must be ruled out as part of future investigations), evidence from preliminary plastid phylogenomic analyses of Lamiidae suggests otherwise (Stull 2014, unpublished data).

170

These large-scale plastid data sets yield strongly supported ML topologies that are consistent with those described in this study. Thus, it seems likely that incomplete taxon sampling, long-branch attraction, and phylogenetic method are primary factors responsible for the topological discordances described above, and my ML phylogeny may represent a more accurate representation of interfamilial relationships.

Conclusions

Resolving relationships among family-level clades in Lamiales represents one of the most difficult problems in angiosperm phylogenetics (Soltis et al. 2005). This study successfully identified nearly 2000 SCN genes and demonstrated that interfamilial relationships can be resolved with large-scale nuclear data. More importantly, it produced an important resource that will aid future phylogenetic investigations across this important and diverse clade of life.

However, despite the extraordinary value of the oneKP resources, this study was limited by several factors. First, the opportunistic taxon sampling afforded by oneKP is not representative of family-level diversity in Lamiales (17 of 25 families included) and thus both precludes placement of those families not represented and possibly alters those relationships that are included. While useful for exploring the utility of SCN loci for resolving phylogenetic relationships in Lamiales, this analysis provides a limited view of interfamilial relationships that will likely change with expanded taxon sampling. Second, phylogenetic relationships may be better inferred with complete gene sampling

(Huelsenbeck 1991; Lemmon et al. 2009; Roure et al. 2012), although some studies provide contrasting viewpoints on this issue (e.g. see Wiens 2003, 2006; Philippe et al.

2004). My dependence on existing oneKP data resources necessitates use of a matrix with missing data from many accessions, and these missing data could have

171

deleterious impacts on phylogenetic accuracy (Lemmon et al. 2009). Third, my phylogenetic analyses are inferred from a concatenated dataset; missing data precluded use of alternative methods that can account for variation among individual gene histories (e.g. MP-EST [Liu et al. 2010], BUCKy [Larget et al. 2010], or ASTRAL

[Mirarab et al. 2014]).

The Lamiales crown group is estimated to have diverged from its most recent common ancestor around 62.77 to 106.9 million years ago (Wikström et al. 2001;

Bremer et al. 2004; Janssens et al. 2009; Magallón and Castillo 2009), and yet short branch lengths are commonly observed along the backbone of both Lamiales (e.g.

Schäferhoff et al. 2010) and many of its subclades (e.g. Lamiaceae; see Chapter 2).

These phylogenetic patterns suggest that Lamiales may have undergone many bouts of rapid divergence throughout their relatively short evolutionary history. In other words,

Lamiales are likely composed of many nested, rapidly diverging clades (a common pattern in angiosperm phylogeny [Soltis et al. 2004]); resolving phylogenetic relationships under this evolutionary scenario requires a carefully planned strategy.

Evolutionary processes and historical events such as incomplete lineage sorting

(ILS), hybridization or horizontal transfer, and gene duplication or are all possible factors that can contribute to poorly resolved or unsupported tree topologies.

Phylogenetic inference from multilocus nuclear datasets may represent a useful strategy for both resolving relationships and hypothesis testing in any of these cases.

In light of the phylogenetic patterns described above, ILS may be a predominant factor complicating phylogenetic inference within many groups of Lamiales. With ILS, individual gene histories can appear misleading or uninformative with regard to species

172

relationships (Maddison and Knowles 2006; see also: Degnan and Rosenberg 2006,

2009; Kubatko and Degnan 2007; Edwards et al. 2007; Liu and Pearl 2007; Liu et al.

2008). Furthermore, simulation studies demonstrate that the probability of discordant signal among sampled loci depends more on the time between divergence events than on the absolute time of divergence (Oliver 2013). Thus, individual gene histories in

Lamiales could provide conflicting signal at both shallow and deep phylogenetic nodes.

Phylogenetic inference from a concatenated dataset, such as that analyzed here, may therefore contribute to topological uncertainty along the backbone of the phylogeny.

The SCN gene set identified by this study will serve as a valuable resource for future studies in Lamiales systematics. Using this new resource, we can employ novel strategies for targeted sequencing (see Chapter 3) to fill in the matrix to include all families and reduce missing data, allowing us to approach our goal of providing a more comprehensive view of evolutionary relationships in this large and biologically interesting clade of life. Moreover, we can begin to explore use of newly available and powerful genome-scale coalescent-based species tree estimation approaches (Mirarab et al. 2014; see Wickett et al. 2014 for recent example).

173

Table 5-1. The following de novo transcriptome assemblies were downloaded from the One Thousand Plants (oneKP) project and used for single-copy nuclear locus discovery and phylogenetic analyses. The phylogenetic position (APGIII Clade) and taxonomic classification (i.e., order, family, and taxon) for each assembly is reported below. Additional information about oneKP assemblies, including tissues used in sample preparation and voucher specimen data, is available online at http://www.onekp.com. oneKP APGIII Clade Order Family Taxon Sample ID Core //Lamiids Boraginales Boraginaceae Ehretia acuminata R. Br. EMAL Core Eudicots/Asterids/Lamiids Boraginales Boraginaceae Lennoa madreporoides La Llave & SMUR Lex. Core Eudicots/Asterids/Lamiids Boraginales Boraginaceae Mertensia paniculata (Aiton) G.Don. DKFZ Core Eudicots/Asterids/Lamiids Boraginales Boraginaceae Phacelia campanularia A.Gray YQIJ Core Eudicots/Asterids/Lamiids Boraginales Boraginaceae Pholisma arenarium Nutt. HANM Core Eudicots/Asterids/Lamiids Gentianales Gentianaceae Balf. f. KPUM Core Eudicots/Asterids/Lamiids Gentianales Rubiaceae Galium boreale L. WQRD Core Eudicots/Asterids/Lamiids Lamiales Acanthaceae quadrifidus Standl. PCGJ

70 Core Eudicots/Asterids/Lamiids Lamiales Acanthaceae Ruellia brittoniana Leonard AYIY

Core Eudicots/Asterids/Lamiids Lamiales Acanthaceae Sanchezia Ruiz & Pav. NBMW Core Eudicots/Asterids/Lamiids Lamiales Acanthaceae Strobilanthes dyeriana Mast. WEAC Core Eudicots/Asterids/Lamiids Lamiales Bignoniaceae Kigelia africana (Lam.) Benth. QKEI Core Eudicots/Asterids/Lamiids Lamiales Bignoniaceae Kigelia africana (Lam.) Benth. SVQC Core Eudicots/Asterids/Lamiids Lamiales Bignoniaceae Mansoa alliacea (Lam.) A. H. Gentry TKEK Core Eudicots/Asterids/Lamiids Lamiales Bignoniaceae Tabebuia umbellata (Sond.) Sandwith UTQR Core Eudicots/Asterids/Lamiids Lamiales Byblidaceae gigantea Lindl. GDZS Core Eudicots/Asterids/Lamiids Lamiales Calceolariaceae Calceolaria pinifolia Cav. DCCI Core Eudicots/Asterids/Lamiids Lamiales Gesneriaceae Saintpaulia ionantha H. Wendl. RWKR Core Eudicots/Asterids/Lamiids Lamiales Gesneriaceae Sinningia tuberosa (Mart.) H. E. Moore DTNC

174

Table 5-1. Continued oneKP APGIII Clade Order Family Taxon Sample ID Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Agastache rugosa Kuntze. PUCW Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Ajuga reptans L. UCNM Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Lavandula angustifolia Mill. FYUH Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Leonurus japonicus Houtt. SNNC Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Marrubium vulgare L. EAAA Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Melissa officinalis L. TAGM Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Clinopodium serpyllifolium subsp. WHNV fruticosum (L.) Bräuchler Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Nepeta cataria L. FUMQ Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Oxera neriifolia (Montrouz.) Beauvis. GNPX Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Oxera pulchella Labill. RTNA Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Pogostemon cablin (Blanco) Benth. GETL Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Poliomintha bustamanta B. L. Turner XMBA Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Prunella vulgaris L. PHCE Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Rosmarinus officinalis L. FDMM Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Salvia L. EQDA Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Scutellaria montana Chapm. ATYL Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Plectranthus scutellarioides (L.) R. Br. BAHE Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Teucrium chamaedrys L. LRRR Core Eudicots/Asterids/Lamiids Lamiales Lamiaceae Thymus vulgaris L. IYDF Core Eudicots/Asterids/Lamiids Lamiales Lentibulariaceae Pinguicula agnata Casper MXFG Core Eudicots/Asterids/Lamiids Lamiales Lentibulariaceae Pinguicula caudata Schltdl. JCMU Core Eudicots/Asterids/Lamiids Lamiales Lentibulariaceae Utricularia L. HRUR Core Eudicots/Asterids/Lamiids Lamiales Oleaceae retusus Paxton KTAR Core Eudicots/Asterids/Lamiids Lamiales Oleaceae segregata (Jacq.) Krug & UEEN Urb.

175

Table 5-1. Continued oneKP APGIII Clade Order Family Taxon Sample ID Core Eudicots/Asterids/Lamiids Lamiales Oleaceae Ligustrum sinense Lour. MZLD Core Eudicots/Asterids/Lamiids Lamiales Oleaceae Olea europaea L. TORX Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae (L.) Wallr. FAMO Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae Epifagus virginiana (L.) W.P.C.Barton URZI Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae Epifagus virginiana (L.) W. P. C. XMOG Barton Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae Lindenbergia philippinensis Benth. WUZV Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae Lindenbergia philippinensis Benth. ZVFS Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae fasciculata Nutt. PHOQ Core Eudicots/Asterids/Lamiids Lamiales Orobanchaceae Orobanche fasciculata Nutt. VTOK Core Eudicots/Asterids/Lamiids Lamiales Paulowniaceae Paulownia fargesii Franch. UMUL Core Eudicots/Asterids/Lamiids Lamiales Pedaliaceae Uncarina grandidieri (Beaill.) Stapf ZRIN Core Eudicots/Asterids/Lamiids Lamiales Phrymaceae Rehmannia glutinosa Steud. OWAS Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Antirrhinum majus L. EBOL Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Antirrhinum majus L. TPUT Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Antirrhinum braun-blanquetii Rothm. YRHD Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae caroliniana (Walter) B. L. Rob. CLRW Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Digitalis purpurea L. GNRI Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Plantago coronopus L. DCVZ Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Plantago maritima L. YKZB Core Eudicots/Asterids/Lamiids Lamiales Plantaginaceae Plantago virginica L. PTBJ Core Eudicots/Asterids/Lamiids Lamiales Schlegeliaceae Schlegelia parasitica Griseb. GAKQ Core Eudicots/Asterids/Lamiids Lamiales Schlegeliaceae Schlegelia parasitica Griseb. CWLL Core Eudicots/Asterids/Lamiids Lamiales Schlegeliaceae Schlegelia violacea Griseb. EDXZ

176

Table 5-1. Continued oneKP APGIII Clade Order Family Taxon Sample ID Core Eudicots/Asterids/Lamiids Lamiales Scrophulariaceae Anticharis glandulosa Asch. EJBY Core Eudicots/Asterids/Lamiids Lamiales Scrophulariaceae L. GRFT Core Eudicots/Asterids/Lamiids Lamiales Scrophulariaceae Buddleja lindleyana Lindl. XRLM Core Eudicots/Asterids/Lamiids Lamiales Scrophulariaceae Celsia arcturus Jacq. SIBR Core Eudicots/Asterids/Lamiids Lamiales Scrophulariaceae Verbascum L. XXYA Core Eudicots/Asterids/Lamiids Lamiales Tetrachondraceae Polypremum procumbens L. COBX Core Eudicots/Asterids/Lamiids Lamiales Verbenaceae Lantana camara L. PSHB Core Eudicots/Asterids/Lamiids Lamiales Verbenaceae Phyla dulcis (Trev.) Moldenke MQIV Core Eudicots/Asterids/Lamiids Lamiales Verbenaceae Verbena hastata L. GCFE Core Eudicots/Asterids/Lamiids Solanales Convolvulaceae Ipomoea pubescens Lam. EMBR Core Eudicots/Asterids/Lamiids Solanales Solanaceae Solanum ptychanthum Dunal DLJZ

177

Table 5-2. A summary of 1,993 single-copy nuclear (SCN) loci identified by the MarkerMiner 1.0 pipeline (Chamala et al., submitted) using 77 transcriptomes from the One Thousand Plants project. Reported for each locus are the Arabidopsis thaliana gene ID (PLAZA 2.5 database; Van Bel et al 2012), single-copy (SC) status across angiosperms (De Smet et al. 2013), gene functional annotations derived from the TAIR10 Arabidopsis genome release, and number of putative orthologues detected across analyzed transcriptomes. The loci used for phylogeny reconstruction are indicated with a superscript cross symbol (†) in the final column. Gene ID SC Gene Functional Category No. Status Orthologues AT1G53280 Majority Class I glutamine amidotransferase-like superfamily protein 69† AT1G64550 Majority General control non-repressible 3 68† AT4G37040 Strictly Methionine aminopeptidase 1D 67† AT5G64370 Majority Beta-ureidopropionase 67† AT4G29490 Majority Metallopeptidase M24 family protein 66† AT5G05200 Majority Protein kinase superfamily protein 66† AT5G57655 Majority Xylose isomerase family protein 66† AT1G30360 Majority Early-responsive to dehydration stress protein (ERD4) 65†

70 AT1G43860 Majority Sequence-specific DNA binding transcription factors 65†

AT5G14520 Majority Pescadillo-related 65† AT1G62750 Majority Translation elongation factor EFG/EF2 protein 64† AT2G18940 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 64† AT5G14250 Majority Proteasome component (PCI) domain protein 64† AT5G46580 Majority Pentatricopeptide (PPR) repeat-containing protein 63† AT4G35850 Majority Pentatricopeptide repeat (PPR) superfamily protein 62† AT5G62530 Majority Aldehyde dehydrogenase 12A1 62† AT5G63890 Majority Histidinol dehydrogenase 62† AT1G13900 Majority Purple acid phosphatases superfamily protein 61† AT1G57770 Majority FAD/NAD(P)-binding oxidoreductase family protein 61† AT1G60770 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 61 AT1G76400 Majority Ribophorin I 61† AT2G31880 Majority Leucine-rich repeat protein kinase family protein 61†

178

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G10230 Majority Lycopene cyclase 61† AT3G51050 Majority FG-GAP repeat-containing protein 61† AT3G58460 Majority RHOMBOID-like protein 15 61† AT5G65720 Majority Nitrogen fixation S (NIFS)-like 1 61† AT1G07230 Majority Non-specific phospholipase C1 60† AT1G67680 Majority SRP72 RNA-binding domain 60† AT1G73720 Majority Transducin family protein / WD-40 repeat family protein 60† AT3G45300 Majority Isovaleryl-CoA-dehydrogenase 60† AT5G30510 Majority Ribosomal protein S1 60† AT1G18260 Majority HCP-like superfamily protein 59† AT1G74470 Majority Pyridine nucleotide-disulphide oxidoreductase family protein 59† AT3G54860 Majority Sec1/munc18-like (SM) proteins superfamily 59† AT4G03560 Majority Two-pore channel 1 59† AT4G29830 Majority Transducin/WD40 repeat-like superfamily protein 59† AT5G04420 Majority Galactose oxidase/kelch repeat superfamily protein 59† AT5G04660 Majority Cytochrome P450, family 77, subfamily A, polypeptide 4 59† AT5G06260 Majority TLD-domain containing nucleolar protein 59† AT5G19540 Majority Unknown 59† AT5G42310 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 59† AT1G05350 Majority NAD(P)-binding Rossmann-fold superfamily protein 58† AT1G74850 Majority Plastid transcriptionally active 2 58† AT2G18710 Majority SECY homolog 1 58† AT3G20790 Strictly NAD(P)-binding Rossmann-fold superfamily protein 58† AT4G19860 Majority Alpha/beta-Hydrolases superfamily protein 58† AT4G27800 Majority Thylakoid-associated phosphatase 38 58† AT4G30310 Majority FGGY family of carbohydrate kinase 58†

179

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G31990 Majority Aspartate aminotransferase 5 58† AT5G08170 Majority Porphyromonas-type peptidyl-arginine deiminase family protein 58† AT5G13650 Majority Elongation factor family protein 58† AT5G42480 Majority Chaperone DnaJ-domain superfamily protein 58† AT3G06510 Majority Glycosyl hydrolase superfamily protein 57† AT3G09650 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 57 AT3G17810 Majority Pyrimidine 1 57† AT3G47610 Majority Transcription regulators; Zinc ion binding 57† AT3G53700 Majority Pentatricopeptide repeat (PPR) superfamily protein 57† AT3G55070 Majority LisH/CRA/RING-U-box domains-containing protein 57† AT4G16390 Majority Pentatricopeptide (PPR) repeat-containing protein 57 AT5G13030 Majority Unknown 57† AT1G28340 Majority Receptor like protein 4 56† AT1G43580 Majority Sphingomyelin synthetase family protein 56† AT1G68830 Majority STT7 homolog STN7 56† AT1G73990 Majority Signal peptide peptidase 56† AT2G15230 Majority Lipase 1 56† AT3G01720 Majority Unknown 56 AT3G09180 Majority Unknown 56† AT3G56460 Majority GroES-like zinc-binding alcohol dehydrogenase family protein 56† AT3G66658 Majority Aldehyde dehydrogenase 22A1 56† AT4G00740 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 56† AT5G08100 Majority N-terminal nucleophile aminohydrolases (Ntn hydrolases) superfamily protein 56† AT5G17530 Majority Phosphoglucosamine mutase family protein 56† AT5G51340 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 56 AT5G57030 Majority Lycopene beta/epsilon cyclase protein 56†

180

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G24820 Majority Translocon at the inner envelope membrane of chloroplasts 55-II 55 AT2G37500 Majority Arginine biosynthesis protein ArgJ family 55† AT3G05350 Majority Metallopeptidase M24 family protein 55† AT3G52190 Majority Phosphate transporter traffic facilitator1 55† AT4G09020 Majority Isoamylase 3 55† AT5G09860 Majority Nuclear matrix protein-related 55† AT1G12410 Majority CLP protease proteolytic subunit 2 54† AT1G14300 Majority ARM repeat superfamily protein 54† AT1G15980 Majority NDH-dependent cyclic electron flow 1 54 AT2G19940 Majority Oxidoreductases, acting on the aldehyde or oxo group of donors, NAD or NADP as 54† acceptor; Copper ion binding AT2G26060 Majority Transducin/WD40 repeat-like superfamily protein 54† AT2G43770 Majority Transducin/WD40 repeat-like superfamily protein 54 AT3G17940 Majority Galactose mutarotase-like superfamily protein 54 AT3G18860 Majority Transducin family protein / WD-40 repeat family protein 54† AT3G54850 Majority Plant U-box 14 54 AT4G00090 Majority Transducin/WD40 repeat-like superfamily protein 54† AT4G13360 Majority ATP-dependent caseinolytic (Clp) protease/crotonase family protein 54† AT4G33030 Majority Sulfoquinovosyldiacylglycerol 1 54† AT5G13520 Majority Peptidase M1 family protein 54† AT5G43600 Majority Ureidoglycolate amidohydrolase 54† AT5G65860 Majority Ankyrin repeat family protein 54 AT1G06550 Majority ATP-dependent caseinolytic (Clp) protease/crotonase family protein 53 AT1G08490 Majority Chloroplastic NIFS-like cysteine desulfurase 53 AT1G10510 Strictly RNI-like superfamily protein 53 AT1G14810 Majority Semialdehyde dehydrogenase family protein 53†

181

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G27150 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 53 AT1G31780 Majority Unknown 53 AT1G80910 Majority Protein of unknown function (DUF1712) 53 AT2G38000 Majority Chaperone protein dnaJ-related 53 AT2G39190 Majority Protein kinase superfamily protein 53 AT2G41720 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 53 AT3G04460 Majority Peroxin-12 53 AT3G11830 Majority TCP-1/cpn60 chaperonin family protein 53 AT3G57050 Majority Cystathionine beta-lyase 53 AT3G57790 Majority Pectin lyase-like superfamily protein 53 AT3G62370 Majority Heme binding 53 AT4G01690 Majority Flavin containing amine oxidoreductase family 53 AT4G04955 Majority Allantoinase 53 AT4G14210 Majority Phytoene desaturase 3 53 AT4G30810 Majority Serine carboxypeptidase-like 29 53 AT5G11490 Majority Adaptin family protein 53 AT5G13400 Majority Major facilitator superfamily protein 53 AT5G22480 Majority ZPR1 zinc-finger domain protein 53 AT1G16180 Majority Serinc-domain containing serine and sphingolipid biosynthesis protein 52 AT1G27980 Majority Dihydrosphingosine phosphate lyase 52 AT1G31800 Majority Cytochrome P450, family 97, subfamily A, polypeptide 3 52 AT1G42970 Majority Glyceraldehyde-3-phosphate dehydrogenase B subunit 52 AT1G61870 Majority Pentatricopeptide repeat 336 52 AT3G15410 Majority Leucine-rich repeat (LRR) family protein 52 AT3G27000 Majority Actin related protein 2 52 AT3G44880 Majority Pheophorbide a oxygenase family protein with Rieske [2Fe-2S] domain 52

182

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G55260 Majority Beta-hexosaminidase 1 52 AT4G14500 Majority Polyketide cyclase/dehydrase and lipid transport superfamily protein 52 AT4G16180 Majority Unknown 52 AT4G30950 Majority Fatty acid desaturase 6 52 AT5G08720 Majority Unknown 52 AT5G10920 Majority L-Aspartase-like family protein 52 AT5G14220 Majority Flavin containing amine oxidoreductase family 52 AT5G19350 Majority RNA-binding (RRM/RBD/RNP motifs) family protein 52 AT5G23940 Majority HXXXD-type acyl-transferase family protein 52 AT5G48520 Majority Unknown 52 AT5G54140 Majority IAA-leucine-resistant (ILR1)-like 3 52 AT5G61060 Majority Histone deacetylase 5 52 AT1G04620 Majority Coenzyme F420 hydrogenase family / dehydrogenase, beta subunit family 51 AT1G08520 Majority ALBINA 1 51 AT1G48520 Majority GLU-ADT subunit B 51 AT1G73110 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 51 AT1G77470 Majority Replication factor C subunit 3 51 AT2G01320 Majority ABC-2 type transporter family protein 51 AT2G05830 Majority NagB/RpiA/CoA transferase-like superfamily protein 51 AT2G27450 Majority Nitrilase-like protein 1 51 AT2G33630 Majority NAD(P)-binding Rossmann-fold superfamily protein 51 AT2G36630 Majority Sulfite exporter TauE/SafE family protein 51 AT3G01910 Majority Sulfite oxidase 51 AT3G24190 Majority Protein kinase superfamily protein 51 AT4G38160 Majority Mitochondrial transcription termination factor family protein 51 AT4G38890 Majority FMN-linked oxidoreductases superfamily protein 51

183

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G21930 Majority P-type ATPase of Arabidopsis 2 51 AT5G48020 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 51 AT5G52520 Majority Class II aaRS and biotin synthetases superfamily protein 51 AT5G64050 Majority Glutamate tRNA synthetase 51 AT5G67530 Majority Plant U-box 49 51 AT1G21780 Majority BTB/POZ domain-containing protein 50 AT1G47670 Majority Transmembrane amino acid transporter family protein 50 AT1G70820 Majority Phosphoglucomutase, putative / glucose phosphomutase, putative 50 AT1G79600 Majority Protein kinase superfamily protein 50 AT2G01490 Majority Phytanoyl-CoA dioxygenase (PhyH) family protein 50 AT2G26230 Majority Uricase / urate oxidase / nodulin 35, putative 50 AT2G26510 Majority Xanthine/uracil permease family protein 50 AT2G42690 Majority Alpha/beta-Hydrolases superfamily protein 50 AT3G04870 Majority Zeta-carotene desaturase 50 AT3G23940 Majority Dehydratase family 50 AT3G48500 Majority Nucleic acid-binding, OB-fold-like protein 50 AT4G08810 Majority Calcium ion binding 50 AT4G10050 Majority Esterase/lipase/thioesterase family protein 50 AT4G10180 Majority Light-mediated development protein 1 / deetiolated1 (DET1) 50 AT4G21470 Majority Riboflavin kinase/FMN hydrolase 50 AT4G29120 Majority 6-phosphogluconate dehydrogenase family protein 50 AT5G03070 Majority Importin alpha isoform 9 50 AT5G24120 Majority Sigma factor E 50 AT5G42470 Majority Unknown 50 AT5G48220 Majority Aldolase-type TIM barrel family protein 50 AT5G61510 Majority GroES-like zinc-binding alcohol dehydrogenase family protein 50

184

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G64860 Majority Disproportionating enzyme 50 AT1G19600 Majority PfkB-like carbohydrate kinase family protein 49 AT1G62390 Majority Octicosapeptide/Phox/Bem1p (PB1) domain-containing protein / tetratricopeptide 49 repeat (TPR)-containing protein AT1G67550 Majority Urease 49 AT2G42920 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 49 AT2G47330 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 49 AT3G02300 Majority Regulator of chromosome condensation (RCC1) family protein 49 AT3G03310 Majority Lecithin:cholesterol acyltransferase 3 49 AT3G52940 Majority Ergosterol biosynthesis ERG4/ERG24 family 49 AT3G53130 Majority Cytochrome P450 superfamily protein 49 AT3G59040 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 49 AT4G15110 Majority Cytochrome P450, family 97, subfamily B, polypeptide 3 49 AT4G33060 Majority Cyclophilin-like peptidyl-prolyl cis-trans isomerase family protein 49 AT4G34030 Majority 3-methylcrotonyl-CoA carboxylase 49 AT4G38240 Majority Alpha-1,3-mannosyl-glycoprotein beta-1,2-N-acetylglucosaminyltransferase, putative 49 AT5G06830 Majority Unknown 49 AT5G13630 Majority Magnesium-chelatase subunit chlH, chloroplast, putative / Mg-protoporphyrin IX 49 chelatase, putative (CHLH) AT5G49970 Majority Pyridoxin (pyrodoxamine) 5'-phosphate oxidase 49 AT5G56900 Majority CwfJ-like family protein / zinc finger (CCCH-type) family protein 49 AT5G57460 Majority Unknown 49 AT5G61530 Majority Small G protein family protein / RhoGAP family protein 49 AT5G64440 Majority Fatty acid amide hydrolase 49 AT1G04130 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 48 AT1G09010 Majority Glycoside hydrolase family 2 protein 48 AT1G51110 Majority Plastid-lipid associated protein PAP / fibrillin family protein 48

185

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G64600 Majority Methyltransferases; copper ion binding 48 AT2G13540 Majority ARM repeat superfamily protein 48 AT2G35130 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 48 AT3G02710 Strictly ARM repeat superfamily protein 48 AT3G20740 Majority Transducin/WD40 repeat-like superfamily protein 48 AT3G25660 Majority Amidase family protein 48 AT3G26570 Majority Phosphate transporter 2;1 48 AT3G44600 Majority Cyclophilin 71 48 AT3G47450 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 48 AT3G51830 Majority SAC domain-containing protein 8 48 AT4G09750 Strictly NAD(P)-binding Rossmann-fold superfamily protein 48 AT4G31390 Majority Protein kinase superfamily protein 48 AT4G33420 Majority Peroxidase superfamily protein 48 AT4G33670 Majority NAD(P)-linked oxidoreductase superfamily protein 48 AT4G36530 Majority Alpha/beta-Hydrolases superfamily protein 48 AT5G08380 Majority Alpha-galactosidase 1 48 AT5G08530 Majority 51 kDa subunit of complex I 48 AT5G39410 Majority Saccharopine dehydrogenase 48 AT5G50310 Majority Galactose oxidase/kelch repeat superfamily protein 48 AT1G04870 Majority Protein arginine methyltransferase 10 47 AT1G69740 Majority Aldolase superfamily protein 47 AT1G73430 Majority Sec34-like family protein 47 AT2G01220 Majority Nucleotidylyl transferase superfamily protein 47 AT2G15430 Majority DNA-directed RNA polymerase family protein 47 AT2G20330 Majority Transducin/WD40 repeat-like superfamily protein 47 AT2G27680 Majority NAD(P)-linked oxidoreductase superfamily protein 47

186

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G31170 Majority Cysteinyl-tRNA synthetase, class Ia family protein 47 AT2G38020 Majority Vacuoleless1 (VCL1) 47 AT2G41680 Majority NADPH-dependent thioredoxin reductase C 47 AT2G45990 Majority Unknown 47 AT3G01480 Majority Cyclophilin 38 47 AT3G19630 Majority Radical SAM superfamily protein 47 AT3G44190 Majority FAD/NAD(P)-binding oxidoreductase family protein 47 AT3G56940 Majority Dicarboxylate diiron protein, putative (Crd1) 47 AT3G60830 Majority Actin-related protein 7 47 AT4G17300 Majority Class II aminoacyl-tRNA and biotin synthetases superfamily protein 47 AT4G21520 Majority Transducin/WD40 repeat-like superfamily protein 47 AT4G31850 Strictly Proton gradient regulation 3 47 AT4G39820 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 47 AT5G02230 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 47 AT5G02270 Majority Non-intrinsic ABC protein 9 47 AT5G08740 Majority NAD(P)H dehydrogenase C1 47 AT5G20990 Majority Molybdopterin biosynthesis CNX1 protein / molybdenum cofactor biosynthesis enzyme 47 CNX1 (CNX1) AT5G23340 Majority RNI-like superfamily protein 47 AT5G25150 Majority TBP-associated factor 5 47 AT5G47860 Majority Protein of unknown function (DUF1350) 47 AT5G49570 Majority Peptide-N-glycanase 1 47 AT5G49880 Majority Mitotic checkpoint family protein 47 AT5G53450 Majority OBP3-responsive gene 1 47 AT5G54080 Majority Homogentisate 1,2-dioxygenase 47 AT5G55960 Majority Unknown 47

187

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G56740 Majority Histone acetyltransferase of the GNAT family 2 47 AT5G65490 Majority Unknown 47 AT1G05620 Majority Uridine-ribohydrolase 2 46 AT1G11870 Majority Seryl-tRNA synthetase 46 AT1G21480 Majority Exostosin family protein 46 AT1G21640 Majority NAD kinase 2 46 AT1G75330 Majority Ornithine carbamoyltransferase 46 AT2G17990 Majority Unknown 46 AT2G37230 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 46 AT3G15520 Majority Cyclophilin-like peptidyl-prolyl cis-trans isomerase family protein 46 AT3G51820 Majority UbiA prenyltransferase family protein 46 AT3G55760 Majority Unknown 46 AT3G61080 Majority Protein kinase superfamily protein 46 AT4G01040 Majority Glycosyl hydrolase superfamily protein 46 AT4G15420 Majority Ubiquitin fusion degradation UFD1 family protein 46 AT5G01010 Majority Unknown 46 AT5G02860 Majority Pentatricopeptide repeat (PPR) superfamily protein 46 AT5G22010 Majority Replication factor C1 46 AT5G24760 Majority GroES-like zinc-binding dehydrogenase family protein 46 AT5G25265 Majority Unknown 46 AT5G48960 Majority HAD-superfamily hydrolase, subfamily IG, 5'-nucleotidase 46 AT5G55000 Strictly Potassium channel tetramerisation domain-containing protein / pentapeptide repeat- 46 containing protein AT5G61970 Majority Signal recognition particle-related / SRP-related 46 AT5G67030 Majority Zeaxanthin epoxidase (ZEP) (ABA1) 46 AT1G05590 Majority Beta-hexosaminidase 2 45

188

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G05940 Majority Cationic amino acid transporter 9 45 AT1G12370 Majority Photolyase 1 45 AT1G14030 Majority Rubisco methyltransferase family protein 45 AT1G27520 Majority Glycosyl hydrolase family 47 protein 45 AT1G52310 Majority Protein kinase family protein / C-type lectin domain-containing protein 45 AT1G56000 Majority FAD/NAD(P)-binding oxidoreductase family protein 45 AT1G66510 Majority AAR2 protein family 45 AT1G73180 Majority Eukaryotic translation initiation factor eIF2A family protein 45 AT1G74750 Majority Pentatricopeptide repeat (PPR) superfamily protein 45 AT2G19560 Majority Proteasome family protein 45 AT2G31060 Majority Elongation factor family protein 45 AT2G34560 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 45 AT2G34640 Majority Plastid transcriptionally active 12 45 AT2G36310 Majority Uridine-ribohydrolase 1 45 AT2G42810 Majority Protein phosphatase 5.2 45 AT2G44760 Majority Domain of unknown function (DUF3598) 45 AT3G10420 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 45 AT3G10970 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 45 AT3G19490 Majority Sodium:hydrogen antiporter 1 45 AT3G52640 Majority Zn-dependent exopeptidases superfamily protein 45 AT3G55400 Majority Methionyl-tRNA synthetase / methionine--tRNA ligase / MetRS (cpMetRS) 45 AT4G17370 Majority Oxidoreductase family protein 45 AT5G08640 Majority Flavonol synthase 1 45 AT5G20070 Majority Nudix hydrolase homolog 19 45 AT5G36170 Majority High fluorescent 109 45 AT5G39590 Majority TLD-domain containing nucleolar protein 45

189

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G04200 Majority Unknown 44 AT1G50710 Majority Unknown 44 AT1G72990 Strictly Beta-galactosidase 17 44 AT2G25840 Majority Nucleotidylyl transferase superfamily protein 44 AT2G31040 Majority ATP synthase protein I -related 44 AT2G35720 Majority DNAJ heat shock N-terminal domain-containing protein 44 AT2G41530 Majority S-formylglutathione hydrolase 44 AT3G12290 Majority Amino acid dehydrogenase family protein 44 AT3G59380 Majority Farnesyltransferase A 44 AT4G04320 Majority Malonyl-CoA decarboxylase family protein 44 AT4G34460 Majority GTP binding protein beta 1 44 AT5G03430 Majority Phosphoadenosine phosphosulfate (PAPS) reductase family protein 44 AT5G05660 Majority Sequence-specific DNA binding transcription factors;zinc ion binding;sequence-specific 44 DNA binding transcription factors AT5G15550 Majority Transducin/WD40 repeat-like superfamily protein 44 AT5G19150 Majority PfkB-like carbohydrate kinase family protein 44 AT5G23210 Majority Serine carboxypeptidase-like 34 44 AT5G24300 Majority Glycogen/starch synthases, ADP-glucose type 44 AT5G37830 Majority Oxoprolinase 1 44 AT5G66030 Majority Golgi-localized GRIP domain-containing protein 44 AT1G11780 Majority Oxidoreductase, 2OG-Fe(II) oxygenase family protein 43 AT1G12050 Majority Fumarylacetoacetase, putative 43 AT1G16970 Majority KU70 homolog 43 AT1G24610 Majority Rubisco methyltransferase family protein 43 AT1G32080 Majority Membrane protein, putative 43 AT1G52630 Majority O-fucosyltransferase family protein 43

190

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G68010 Majority Hydroxypyruvate reductase 43 AT1G76130 Majority Alpha-amylase-like 2 43 AT2G06210 Majority Binding 43 AT2G35660 Majority FAD/NAD(P)-binding oxidoreductase family protein 43 AT2G40190 Majority UDP-Glycosyltransferase superfamily protein 43 AT2G45500 Majority AAA-type ATPase family protein 43 AT3G01060 Majority Unknown 43 AT3G07720 Strictly Galactose oxidase/kelch repeat superfamily protein 43 AT3G08010 Majority RNA binding 43 AT3G19720 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 43 AT3G21300 Majority RNA methyltransferase family protein 43 AT3G26410 Majority Methyltransferases; nucleic acid binding 43 AT3G27530 Majority Golgin candidate 6 43 AT3G48425 Majority DNAse I-like superfamily protein 43 AT3G48610 Majority Non-specific phospholipase C6 43 AT3G56040 Majority UDP-glucose pyrophosphorylase 3 43 AT4G01130 Majority GDSL-like Lipase/Acylhydrolase superfamily protein 43 AT4G08540 Majority DNA-directed RNA polymerase II protein 43 AT5G15880 Majority Unknown 43 AT5G35570 Majority O-fucosyltransferase family protein 43 AT5G63610 Majority Cyclin-dependent kinase E;1 43 AT5G65760 Strictly Serine carboxypeptidase S28 family protein 43 AT1G04420 Majority NAD(P)-linked oxidoreductase superfamily protein 42 AT1G05120 Majority Helicase protein with RING/U-box domain 42 AT1G08460 Majority Histone deacetylase 8 42 AT1G64350 Majority Transducin/WD40 repeat-like superfamily protein 42

191

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G77030 Majority Hydrolases, acting on acid anhydrides, in phosphorus-containing anhydrides; ATP- 42 dependent helicases; nucleic acid binding; ATP binding; RNA binding; helicases AT2G17510 Majority Ribonuclease II family protein 42 AT2G29560 Majority Cytosolic enolase 42 AT2G39930 Majority Isoamylase 1 42 AT3G02660 Majority Tyrosyl-tRNA synthetase, class Ib, bacterial/mitochondrial 42 AT3G17770 Majority Dihydroxyacetone kinase 42 AT3G21580 Majority Cobalt ion transmembrane transporters 42 AT3G22590 Majority Plant Homologous to PARAFIBROMIN 42 AT3G52200 Strictly Dihydrolipoamide acetyltransferase, long form protein 42 AT4G08790 Majority Nitrilase/cyanide hydratase and apolipoprotein N-acyltransferase family protein 42 AT4G30825 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 42 AT5G05740 Majority Ethylene-dependent gravitropism-deficient and yellow-green-like 2 42 AT5G38530 Majority Tryptophan synthase beta type 2 42 AT5G39830 Strictly Trypsin family protein with PDZ domain 42 AT5G50230 Majority Transducin/WD40 repeat-like superfamily protein 42 AT5G53170 Majority FTSH protease 11 42 AT1G26460 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 41 AT1G76730 Majority NagB/RpiA/CoA transferase-like superfamily protein 41 AT1G78800 Majority UDP-Glycosyltransferase superfamily protein 41 AT2G16860 Majority GCIP-interacting family protein 41 AT2G23390 Majority Unknown 41 AT2G42120 Majority DNA polymerase delta small subunit 41 AT3G04880 Majority DNA-damage-repair/toleration protein (DRT102) 41 AT3G05675 Majority BTB/POZ domain-containing protein 41 AT3G14860 Majority NHL domain-containing protein 41

192

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G14910 Majority Unknown 41 AT3G14930 Majority Uroporphyrinogen decarboxylase 41 AT3G45770 Majority Polyketide synthase, enoylreductase family protein 41 AT3G48380 Majority Peptidase C78, ubiquitin fold modifier-specific peptidase 1/ 2 41 AT4G01870 Majority TolB protein-related 41 AT4G04880 Majority Adenosine/AMP deaminase family protein 41 AT4G14030 Majority Selenium-binding protein 1 41 AT4G17100 Strictly Unknown 41 AT4G24880 Majority Unknown 41 AT4G36390 Majority Methylthiotransferase 41 AT5G02250 Majority Ribonuclease II/R family protein 41 AT5G02820 Majority Spo11/DNA topoisomerase VI, subunit A protein 41 AT5G40440 Majority Mitogen-activated protein kinase kinase 3 41 AT5G57300 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 41 AT1G03090 Majority Methylcrotonyl-CoA carboxylase alpha chain, mitochondrial / 3-methylcrotonyl-CoA 40 carboxylase 1 (MCCA) AT1G12470 Majority Zinc ion binding 40 AT1G18480 Majority Calcineurin-like metallo-phosphoesterase superfamily protein 40 AT1G25570 Majority Di-glucose binding protein with Leucine-rich repeat domain 40 AT1G50510 Majority Indigoidine synthase A family protein 40 AT1G65030 Majority Transducin/WD40 repeat-like superfamily protein 40 AT2G14260 Majority Proline iminopeptidase 40 AT2G25280 Majority Unknown 40 AT2G38270 Majority CAX-interacting protein 2 40 AT2G39550 Majority Prenyltransferase family protein 40 AT2G47940 Majority DEGP protease 2 40

193

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G04260 Majority Plastid transcriptionally active 3 40 AT3G10700 Majority Galacturonic acid kinase 40 AT3G17040 Majority High chlorophyll fluorescent 107 40 AT3G48460 Majority GDSL-like Lipase/Acylhydrolase superfamily protein 40 AT3G54690 Majority Sugar isomerase (SIS) family protein 40 AT4G02820 Majority Pentatricopeptide repeat (PPR) superfamily protein 40 AT4G05090 Majority Inositol monophosphatase family protein 40 AT4G11120 Majority Translation elongation factor Ts (EF-Ts), putative 40 AT5G15730 Majority Protein kinase superfamily protein 40 AT5G18070 Majority Phosphoglucosamine mutase-related 40 AT5G19210 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 40 AT5G21060 Majority Glyceraldehyde-3-phosphate dehydrogenase-like family protein 40 AT5G46180 Majority Ornithine-delta-aminotransferase 40 AT5G48730 Majority Pentatricopeptide repeat (PPR) superfamily protein 40 AT5G49030 Majority tRNA synthetase class I (I, L, M and V) family protein 40 AT5G55860 Majority Plant protein of unknown function (DUF827) 40 AT5G62650 Majority Tic22-like family protein 40 AT1G03910 Majority Unknown 39 AT1G08660 Majority Male gametophyte defective 2 39 AT1G11090 Majority Alpha/beta-Hydrolases superfamily protein 39 AT1G50500 Majority Membrane trafficking VPS53 family protein 39 AT1G53120 Majority RNA-binding S4 domain-containing protein 39 AT1G73920 Majority Alpha/beta-Hydrolases superfamily protein 39 AT1G76570 Majority Chlorophyll A-B binding family protein 39 AT2G05170 Majority Vacuolar protein sorting 11 39 AT2G17020 Majority F-box/RNI-like superfamily protein 39

194

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G26870 Majority Non-specific phospholipase C2 39 AT2G30700 Majority Unknown 39 AT2G43950 Majority Chloroplast outer envelope protein 37 39 AT2G44970 Majority Alpha/beta-Hydrolases superfamily protein 39 AT3G07140 Majority GPI transamidase component Gpi16 subunit family protein 39 AT3G17470 Majority Ca2+-activated RelA/spot homolog 39 AT3G47390 Majority Cytidine/deoxycytidylate deaminase family protein 39 AT3G57680 Majority Peptidase S41 family protein 39 AT3G62910 Majority Peptide chain release factor 1 39 AT4G09900 Majority Methyl esterase 12 39 AT4G15545 Majority Unknown 39 AT5G08710 Majority Regulator of chromosome condensation (RCC1) family protein 39 AT5G14760 Majority L-aspartate oxidase 39 AT5G43430 Majority Electron transfer flavoprotein beta 39 AT5G50420 Majority O-fucosyltransferase family protein 39 AT5G56130 Majority Transducin/WD40 repeat-like superfamily protein 39 AT5G58480 Majority O-Glycosyl hydrolases family 17 protein 39 AT5G59250 Majority Major facilitator superfamily protein 39 AT1G01910 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 38 AT1G05910 Majority Cell division cycle protein 48-related / CDC48-related 38 AT1G14000 Majority VH1-interacting kinase 38 AT1G71190 Majority Senescence associated gene 18 38 AT1G71810 Strictly Protein kinase superfamily protein 38 AT2G01350 Majority Quinolinate phoshoribosyltransferase 38 AT2G28070 Majority ABC-2 type transporter family protein 38 AT2G36360 Majority Galactose oxidase/kelch repeat superfamily protein 38

195

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G44020 Majority Mitochondrial transcription termination factor family protein 38 AT2G48120 Majority Pale cress protein (PAC) 38 AT3G12080 Majority GTP-binding family protein 38 AT3G26580 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 38 AT3G47700 Majority RINT-1 / TIP-1 family 38 AT4G00560 Majority NAD(P)-binding Rossmann-fold superfamily protein 38 AT4G14605 Majority Mitochondrial transcription termination factor family protein 38 AT4G25450 Strictly Non-intrinsic ABC protein 8 38 AT4G27640 Majority ARM repeat superfamily protein 38 AT4G33410 Majority SIGNAL PEPTIDE PEPTIDASE-LIKE 1 38 AT4G34890 Majority Xanthine dehydrogenase 1 38 AT4G39520 Majority GTP-binding protein-related 38 AT5G03160 Majority Homolog of mamallian P58IPK 38 AT5G24340 Majority 3'-5' exonuclease domain-containing protein 38 AT5G52810 Majority NAD(P)-binding Rossmann-fold superfamily protein 38 AT5G55500 Majority Beta-1,2-xylosyltransferase 38 AT5G61540 Majority N-terminal nucleophile aminohydrolases (Ntn hydrolases) superfamily protein 38 AT1G17760 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 37 AT1G35190 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 37 AT1G65020 Majority Unknown 37 AT1G65280 Majority DNAJ heat shock N-terminal domain-containing protein 37 AT1G71240 Majority Plant protein of unknown function (DUF639) 37 AT2G04530 Majority Metallo-hydrolase/oxidoreductase superfamily protein 37 AT2G15860 Majority Unknown 37 AT2G16405 Majority Transducin/WD40 repeat-like superfamily protein 37 AT2G42700 Majority Unknown 37

196

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G47210 Majority Myb-like transcription factor family protein 37 AT3G15290 Majority 3-hydroxyacyl-CoA dehydrogenase family protein 37 AT3G56310 Majority Melibiase family protein 37 AT3G59630 Majority Diphthamide synthesis DPH2 family protein 37 AT4G04850 Majority K+ efflux antiporter 3 37 AT4G04930 Majority Fatty acid desaturase family protein 37 AT4G34270 Majority TIP41-like family protein 37 AT4G37380 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 37 AT5G11010 Majority Pre-mRNA cleavage complex II protein family 37 AT5G12410 Majority THUMP domain-containing protein 37 AT5G20660 Strictly Zn-dependent exopeptidases superfamily protein 37 AT5G44450 Majority Methyltransferases 37 AT5G45760 Majority Transducin/WD40 repeat-like superfamily protein 37 AT5G47760 Majority 2-phosphoglycolate phosphatase 2 37 AT5G51140 Majority Pseudouridine synthase family protein 37 AT5G57140 Majority Purple acid phosphatase 28 37 AT5G57410 Majority Afadin/alpha-actinin-binding protein 37 AT5G64250 Majority Aldolase-type TIM barrel family protein 37 AT1G08470 Majority Strictosidine synthase-like 3 36 AT1G12780 Majority UDP-D-glucose/UDP-D-galactose 4-epimerase 1 36 AT1G18070 Majority Translation elongation factor EF1A/initiation factor IF2gamma family protein 36 AT1G23400 Majority RNA-binding CRS1 / YhbY (CRM) domain-containing protein 36 AT1G30090 Majority Galactose oxidase/kelch repeat superfamily protein 36 AT1G64890 Majority Major facilitator superfamily protein 36 AT2G01650 Majority Plant UBX domain-containing protein 2 36 AT2G03270 Majority DNA-binding protein, putative 36

197

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G19430 Majority DWD (DDB1-binding WD40 protein) hypersensitive to ABA 1 36 AT2G21280 Majority NAD(P)-binding Rossmann-fold superfamily protein 36 AT2G21370 Majority Xylulose kinase-1 36 AT2G24830 Majority Zinc finger (CCCH-type) family protein / D111/G-patch domain-containing protein 36 AT2G29900 Majority Presenilin-2 36 AT2G34460 Majority NAD(P)-binding Rossmann-fold superfamily protein 36 AT2G40090 Majority ABC2 homolog 9 36 AT3G02060 Majority DEAD/DEAH box helicase, putative 36 AT3G07180 Majority GPI transamidase component PIG-S-related 36 AT3G09580 Majority FAD/NAD(P)-binding oxidoreductase family protein 36 AT3G10850 Majority Metallo-hydrolase/oxidoreductase superfamily protein 36 AT3G16810 Majority Pumilio 24 36 AT3G20780 Majority Topoisomerase 6 subunit B 36 AT3G21540 Majority Transducin family protein / WD-40 repeat family protein 36 AT3G28700 Majority Protein of unknown function (DUF185) 36 AT3G28720 Majority Unknown 36 AT3G50660 Majority Cytochrome P450 superfamily protein 36 AT4G08960 Majority Phosphotyrosyl phosphatase activator (PTPA) family protein 36 AT4G16490 Majority ARM repeat superfamily protein 36 AT4G27750 Majority Binding 36 AT4G33945 Majority ARM repeat superfamily protein 36 AT5G10460 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 36 AT5G20380 Majority Phosphate transporter 4;5 36 AT5G21040 Majority F-box protein 2 36 AT5G43280 Majority Delta(3,5),delta(2,4)-dienoyl-CoA isomerase 1 36 AT5G50840 Majority Unknown 36

198

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G60940 Majority Transducin/WD40 repeat-like superfamily protein 36 AT1G02910 Majority Tetratricopeptide repeat (TPR)-containing protein 35 AT1G16900 Majority Alg9-like mannosyltransferase family 35 AT1G29260 Majority Peroxin 7 35 AT1G31190 Majority Myo-inositol monophosphatase like 1 35 AT1G44820 Majority Peptidase M20/M25/M40 family protein 35 AT1G50590 Majority RmlC-like cupins superfamily protein 35 AT1G54350 Majority ABC transporter family protein 35 AT1G63110 Majority GPI transamidase subunit PIG-U 35 AT1G70590 Majority F-box family protein 35 AT1G72090 Majority Methylthiotransferase 35 AT1G72660 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 35 AT2G20360 Majority NAD(P)-binding Rossmann-fold superfamily protein 35 AT2G21960 Majority Unknown 35 AT2G28100 Majority Alpha-L-fucosidase 1 35 AT3G03790 Majority Ankyrin repeat family protein / regulator of chromosome condensation (RCC1) family 35 protein AT3G06920 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 35 AT3G09030 Majority BTB/POZ domain-containing protein 35 AT3G14900 Majority Unknown 35 AT3G18940 Majority Clast3-related 35 AT3G19553 Majority Amino acid permease family protein 35 AT3G23620 Majority Ribosomal RNA processing Brix domain protein 35 AT3G44160 Majority Outer membrane OMP85 family protein 35 AT4G12700 Majority Unknown 35 AT4G28980 Majority CDK-activating kinase 1AT 35

199

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G29380 Majority Protein kinase family protein / WD-40 repeat family protein 35 AT5G05920 Majority Deoxyhypusine synthase 35 AT5G11240 Majority Transducin family protein / WD-40 repeat family protein 35 AT5G19680 Majority Leucine-rich repeat (LRR) family protein 35 AT5G20910 Majority RING/U-box superfamily protein 35 AT5G24850 Majority Cryptochrome 3 35 AT5G26240 Majority Chloride channel D 35 AT5G41330 Majority BTB/POZ domain with WD40/YVTN repeat-like protein 35 AT1G02100 Majority Leucine carboxyl methyltransferase 34 AT1G03000 Majority Peroxin 6 34 AT1G13770 Majority Protein of unknown function, DUF647 34 AT1G20300 Majority Pentatricopeptide repeat (PPR) superfamily protein 34 AT1G31410 Majority Putrescine-binding periplasmic protein-related 34 AT1G32160 Majority Protein of unknown function (DUF760) 34 AT1G43620 Majority UDP-Glycosyltransferase superfamily protein 34 AT1G60230 Majority Radical SAM superfamily protein 34 AT1G69380 Majority Protein of unknown function (DUF155) 34 AT1G76990 Majority ACT domain repeat 3 34 AT1G80680 Majority Supressor of auxin resistance 3 34 AT2G03510 Majority SPFH/Band 7/PHB domain-containing membrane-associated protein family 34 AT2G06050 Majority Oxophytodienoate-reductase 3 34 AT2G13840 Majority Polymerase/histidinol phosphatase-like 34 AT2G26350 Strictly Peroxin 10 34 AT2G44270 Majority Repressor of lrx1 34 AT3G01510 Majority Like SEX4 1 34 AT3G18520 Majority Histone deacetylase 15 34

200

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G43540 Majority Protein of unknown function (DUF1350) 34 AT3G57300 Majority INO80 orthologue 34 AT4G01880 Majority Methyltransferases 34 AT4G02290 Majority Glycosyl hydrolase 9B13 34 AT4G02790 Majority GTP-binding family protein 34 AT4G02990 Majority Mitochondrial transcription termination factor family protein 34 AT4G16210 Majority Enoyl-CoA hydratase/isomerase A 34 AT4G30610 Majority Alpha/beta-Hydrolases superfamily protein 34 AT5G10730 Majority NAD(P)-binding Rossmann-fold superfamily protein 34 AT5G14260 Majority Rubisco methyltransferase family protein 34 AT5G16120 Majority Alpha/beta-Hydrolases superfamily protein 34 AT5G20170 Majority RNA polymerase II transcription mediators 34 AT5G45170 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 34 AT5G51570 Majority SPFH/Band 7/PHB domain-containing membrane-associated protein family 34 AT5G62030 Majority Diphthamide synthesis DPH2 family protein 34 AT1G29120 Majority Hydrolase-like protein family 33 AT1G69910 Majority Protein kinase superfamily protein 33 AT1G71110 Majority Unknown 33 AT2G01070 Majority Lung seven transmembrane receptor family protein 33 AT2G36740 Majority Sequence-specific DNA binding transcription factors;DNA binding;DNA binding 33 AT2G40660 Majority Nucleic acid-binding, OB-fold-like protein 33 AT2G41670 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 33 AT3G08760 Majority Protein kinase superfamily protein 33 AT3G12940 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 33 AT3G44680 Majority Histone deacetylase 9 33 AT3G52210 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 33

201

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G52570 Majority Alpha/beta-Hydrolases superfamily protein 33 AT5G17640 Majority Protein of unknown function (DUF1005) 33 AT5G27290 Majority Unknown 33 AT5G42130 Majority Mitochondrial substrate carrier family protein 33 AT5G52860 Majority ABC-2 type transporter family protein 33 AT5G56460 Majority Protein kinase superfamily protein 33 AT5G64940 Majority ABC2 homolog 13 33 AT1G63680 Majority Acid-amino acid ligases; ligases; ATP binding; 32 AT1G63900 Majority E3 Ubiquitin ligase family protein 32 AT1G71790 Majority Subunits of heterodimeric actin filament capping protein Capz superfamily 32 AT2G02860 Majority Sucrose transporter 2 32 AT2G04305 Majority Magnesium transporter CorA-like family protein 32 AT2G16440 Majority Minichromosome maintenance (MCM2/3/5) family protein 32 AT2G24220 Majority Purine permease 5 32 AT2G30800 Majority Helicase in vascular tissue and tapetum 32 AT2G36570 Majority Leucine-rich repeat protein kinase family protein 32 AT3G06850 Majority 2-oxoacid dehydrogenases acyltransferase family protein 32 AT3G22290 Majority Endoplasmic reticulum vesicle transporter protein 32 AT3G25900 Majority Homocysteine S-methyltransferase family protein 32 AT4G02030 Majority Vps51/Vps67 family (components of vesicular transport) protein 32 AT4G19900 Majority Alpha 1,4-glycosyltransferase family protein 32 AT4G21770 Majority Pseudouridine synthase family protein 32 AT4G36440 Majority Unknown 32 AT4G36470 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 32 AT5G01230 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 32 AT5G47010 Majority RNA helicase, putative 32

202

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G57930 Majority Arabidopsis thaliana protein of unknown function (DUF794) 32 AT5G66530 Majority Galactose mutarotase-like superfamily protein 32 AT1G01770 Majority Unknown 31 AT1G04010 Majority Phospholipid sterol acyl transferase 1 31 AT1G11545 Majority Xyloglucan endotransglucosylase/hydrolase 8 31 AT1G13160 Majority ARM repeat superfamily protein 31 AT1G30300 Majority Metallo-hydrolase/oxidoreductase superfamily protein 31 AT1G49540 Majority Elongator protein 2 31 AT1G70070 Strictly DEAD/DEAH box helicase, putative 31 AT2G26000 Majority Zinc finger (C3HC4-type RING finger) family protein 31 AT2G26200 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 31 AT2G31530 Majority SecY protein transport family protein 31 AT2G31890 Majority RAP 31 AT2G44420 Majority Protein N-terminal asparagine amidohydrolase family protein 31 AT2G47680 Strictly Zinc finger (CCCH type) helicase family protein 31 AT2G47990 Majority Transducin family protein / WD-40 repeat family protein 31 AT3G01920 Majority DHBP synthase RibB-like alpha/beta domain 31 AT3G09720 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 31 AT3G14120 Majority Unknown 31 AT3G28760 Majority Unknown 31 AT3G48200 Majority Unknown 31 AT3G48820 Majority Glycosyltransferase family 29 (sialyltransferase) family protein 31 AT4G27340 Majority Met-10+ like family protein 31 AT4G27450 Majority Aluminium induced protein with YGL and LRDR motifs 31 AT4G36650 Majority Plant-specific TFIIB-related protein 31 AT5G03555 Majority Permease, cytosine/purines, uracil, thiamine, allantoin family protein 31

203

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G10910 Majority MraW methylase family protein 31 AT5G17290 Majority Autophagy protein Apg5 family 31 AT5G27560 Majority Domain of unknown function (DUF1995) 31 AT5G41760 Majority Nucleotide-sugar transporter family protein 31 AT5G45380 Majority Solute:sodium symporters; Urea transmembrane transporters 31 AT5G64380 Majority Inositol monophosphatase family protein 31 AT1G03110 Majority Transducin/WD40 repeat-like superfamily protein 30 AT1G06560 Majority NOL1/NOP2/sun family protein 30 AT1G35510 Majority O-fucosyltransferase family protein 30 AT1G50120 Majority Unknown 30 AT1G51965 Majority ABA Overly-Sensitive 5 30 AT1G56345 Majority Pseudouridine synthase family protein 30 AT1G65380 Majority Leucine-rich repeat (LRR) family protein 30 AT1G76080 Majority Chloroplastic drought-induced stress protein of 32 kD 30 AT2G14530 Majority Trichome birefringence-like 13 30 AT2G25620 Majority DNA-binding protein phosphatase 1 30 AT2G44660 Majority ALG6, ALG8 glycosyltransferase family 30 AT3G02760 Majority Class II aaRS and biotin synthetases superfamily protein 30 AT3G04480 Majority Endoribonucleases 30 AT3G07080 Strictly EamA-like transporter family 30 AT3G07890 Majority Ypt/Rab-GAP domain of gyp1p superfamily protein 30 AT3G10210 Majority SEC14 cytosolic factor family protein / phosphoglyceride transfer family protein 30 AT3G11210 Majority SGNH hydrolase-type esterase superfamily protein 30 AT3G19770 Majority Vacuolar sorting protein 9 (VPS9) domain 30 AT4G01570 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 30 AT4G33630 Majority Protein of unknown function (DUF3506) 30

204

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G37030 Majority Unknown 30 AT5G04710 Majority Zn-dependent exopeptidases superfamily protein 30 AT5G08580 Majority Calcium-binding EF hand family protein 30 AT5G10330 Majority Histidinol phosphate aminotransferase 1 30 AT5G14600 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 30 AT5G35560 Majority DENN (AEX-3) domain-containing protein 30 AT5G60960 Majority Pentatricopeptide repeat (PPR) superfamily protein 30 AT1G02020 Majority Nitroreductase family protein 29 AT1G08125 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 29 AT1G15440 Majority Periodic tryptophan protein 2 29 AT1G28060 Majority Pre-mRNA-splicing factor 3 29 AT1G29700 Majority Metallo-hydrolase/oxidoreductase superfamily protein 29 AT1G48270 Majority G-protein-coupled receptor 1 29 AT1G54780 Majority Thylakoid lumen 18.3 kDa protein 29 AT1G68400 Majority Leucine-rich repeat transmembrane protein kinase family protein 29 AT1G78280 Majority Transferases, transferring glycosyl groups 29 AT2G21250 Majority NAD(P)-linked oxidoreductase superfamily protein 29 AT2G21860 Majority Violaxanthin de-epoxidase-related 29 AT2G42160 Majority Zinc finger (ubiquitin-hydrolase) domain-containing protein 29 AT2G45770 Majority Signal recognition particle receptor protein, chloroplast (FTSY) 29 AT3G33520 Majority Actin-related protein 6 29 AT3G46610 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 29 AT3G59770 Majority SacI homology domain-containing protein / WW domain-containing protein 29 AT3G60850 Majority Unknown 29 AT4G02060 Majority Minichromosome maintenance (MCM2/3/5) family protein 29 AT4G04940 Majority Transducin family protein / WD-40 repeat family protein 29

205

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G08370 Majority Alpha-galactosidase 2 29 AT5G11960 Majority Protein of unknown function (DUF803) 29 AT5G40610 Majority NAD-dependent glycerol-3-phosphate dehydrogenase family protein 29 AT5G44635 Majority Minichromosome maintenance (MCM2/3/5) family protein 29 AT5G46800 Majority Mitochondrial substrate carrier family protein 29 AT5G54910 Majority DEA(D/H)-box RNA helicase family protein 29 AT5G57480 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 29 AT5G60870 Strictly Regulator of chromosome condensation (RCC1) family protein 29 AT1G08960 Majority Cation exchanger 11 28 AT1G50940 Majority Electron transfer flavoprotein alpha 28 AT1G51940 Majority Protein kinase family protein / peptidoglycan-binding LysM domain-containing protein 28 AT1G52340 Majority NAD(P)-binding Rossmann-fold superfamily protein 28 AT1G55510 Majority Branched-chain alpha-keto acid decarboxylase E1 beta subunit 28 AT1G68070 Majority Zinc finger, C3HC4 type (RING finger) family protein 28 AT2G04850 Majority Auxin-responsive family protein 28 AT2G18850 Majority SET domain-containing protein 28 AT2G26830 Majority Protein kinase superfamily protein 28 AT2G41835 Majority Zinc finger (C2H2 type, AN1-like) family protein 28 AT3G08840 Majority D-alanine--D-alanine ligase family 28 AT3G12680 Majority Floral homeotic protein (HUA1) 28 AT3G14075 Majority Mono-/di-acylglycerol lipase, N-terminal; Lipase, class 3 28 AT3G19895 Majority RING/U-box superfamily protein 28 AT3G21110 Majority Purin 7 28 AT3G52390 Majority TatD related DNase 28 AT4G13250 Majority NAD(P)-binding Rossmann-fold superfamily protein 28

206

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G31790 Majority Tetrapyrrole (Corrin/Porphyrin) Methylases 28 AT4G33470 Strictly Histone deacetylase 14 28 AT4G35870 Majority Early-responsive to dehydration stress protein (ERD4) 28 AT4G38460 Majority Geranylgeranyl reductase 28 AT5G41880 Majority DNA primases; DNA primases 28 AT5G48470 Majority Unknown 28 AT5G52030 Majority TraB family protein 28 AT5G65950 Majority Unknown 28 AT1G09280 Majority Unknown 27 AT1G11290 Majority Pentatricopeptide repeat (PPR) superfamily protein 27 AT1G17650 Strictly Glyoxylate reductase 2 27 AT1G17870 Strictly Ethylene-dependent gravitropism-deficient and yellow-green-like 3 27 AT1G25375 Majority Metallo-hydrolase/oxidoreductase superfamily protein 27 AT1G52510 Majority Alpha/beta-Hydrolases superfamily protein 27 AT1G71696 Majority Carboxypeptidase D, putative 27 AT1G74030 Majority Enolase 1 27 AT1G78010 Majority tRNA modification GTPase, putative 27 AT2G05320 Majority Beta-1,2-N-acetylglucosaminyltransferase II 27 AT2G17670 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 27 AT2G20790 Majority Clathrin adaptor complexes medium subunit family protein 27 AT2G26590 Majority Regulatory particle non-ATPase 13 27 AT2G32760 Majority Unknown 27 AT3G05510 Majority Phospholipid/glycerol acyltransferase family protein 27 AT3G24040 Majority Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase family protein 27 AT4G14790 Majority ATP-dependent RNA helicase, mitochondrial (SUV3) 27 AT4G32190 Majority Myosin heavy chain-related protein 27

207

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G14240 Majority Thioredoxin superfamily protein 27 AT5G16280 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 27 AT5G23060 Majority Calcium sensing receptor 27 AT5G41920 Majority GRAS family transcription factor 27 AT5G45900 Majority ThiF family protein 27 AT5G62840 Strictly Phosphoglycerate mutase family protein 27 AT5G66380 Majority Folate transporter 1 27 AT1G09300 Majority Metallopeptidase M24 family protein 26 AT1G09800 Majority Pseudouridine synthase family protein 26 AT1G21370 Majority Unknown 26 AT1G30000 Majority Alpha-mannosidase 3 26 AT1G48050 Majority Ku80 family protein 26 AT1G75210 Majority HAD-superfamily hydrolase, subfamily IG, 5'-nucleotidase 26 AT1G80380 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 26 AT1G80770 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 26 AT2G02880 Majority Mucin-related 26 AT2G25830 Majority YebC-related 26 AT2G30170 Majority Protein phosphatase 2C family protein 26 AT2G34470 Majority Urease accessory protein G 26 AT3G16840 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 26 AT3G46790 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 26 AT3G54090 Majority Fructokinase-like 1 26 AT4G09730 Strictly RH39 26 AT4G19010 Majority AMP-dependent synthetase and ligase family protein 26 AT4G19610 Majority Nucleotide binding;nucleic acid binding;RNA binding 26 AT4G20930 Majority 6-phosphogluconate dehydrogenase family protein 26

208

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G34910 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 26 AT5G01920 Majority Protein kinase superfamily protein 26 AT5G03800 Strictly Pentatricopeptide repeat (PPR) superfamily protein 26 AT5G15170 Majority Tyrosyl-DNA phosphodiesterase-related 26 AT5G16210 Majority HEAT repeat-containing protein 26 AT5G23400 Majority Leucine-rich repeat (LRR) family protein 26 AT5G36950 Majority DegP protease 10 26 AT5G39900 Majority Small GTP-binding protein 26 AT5G45300 Majority Beta-amylase 2 26 AT1G03560 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 25 AT1G16280 Majority RNA helicase 36 25 AT1G21790 Majority TRAM, LAG1 and CLN8 (TLC) lipid-sensing domain containing protein 25 AT1G23740 Majority Oxidoreductase, zinc-binding dehydrogenase family protein 25 AT1G55870 Majority Polynucleotidyl transferase, ribonuclease H-like superfamily protein 25 AT1G64880 Majority Ribosomal protein S5 family protein 25 AT1G69220 Majority Protein kinase superfamily protein 25 AT1G74530 Majority Unknown 25 AT1G78590 Majority NAD(H) kinase 3 25 AT2G17900 Majority SET domain group 37 25 AT2G31240 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 25 AT2G31440 Majority Unknown 25 AT3G04970 Majority DHHC-type zinc finger family protein 25 AT3G12150 Majority Unknown 25 AT3G15620 Strictly DNA photolyase family protein 25 AT3G46980 Majority Phosphate transporter 4;3 25 AT4G01037 Majority Ubiquitin carboxyl-terminal hydrolase family protein 25

209

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G04180 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 25 AT4G16330 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 25 AT4G17910 Majority Transferases, transferring acyl groups 25 AT4G20940 Majority Leucine-rich receptor-like protein kinase family protein 25 AT4G36400 Majority FAD-linked oxidases family protein 25 AT5G13800 Majority Pheophytinase 25 AT5G16420 Strictly Pentatricopeptide repeat (PPR-like) superfamily protein 25 AT5G16890 Majority Exostosin family protein 25 AT5G18900 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 25 AT5G27660 Majority Trypsin family protein with PDZ domain 25 AT5G48330 Strictly Regulator of chromosome condensation (RCC1) family protein 25 AT5G53580 Majority NAD(P)-linked oxidoreductase superfamily protein 25 AT5G54860 Majority Major facilitator superfamily protein 25 AT5G55220 Majority Trigger factor type chaperone family protein 25 AT5G55760 Majority Sirtuin 1 25 AT5G61770 Majority Peter Pan-like protein 25 AT5G64320 Majority Pentatricopeptide repeat (PPR) superfamily protein 25 AT5G66120 Majority 3-dehydroquinate synthase, putative 25 AT1G03030 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 24 AT1G08315 Majority ARM repeat superfamily protein 24 AT1G26640 Strictly Amino acid kinase family protein 24 AT1G44350 Majority IAA-leucine resistant (ILR)-like gene 6 24 AT1G48310 Majority Chromatin remodeling factor18 24 AT1G63810 Majority Unknown 24 AT1G64810 Majority Arabidopsis thaliana protein of unknown function (DUF794) 24 AT1G68100 Majority ZIP metal ion transporter family 24

210

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G74640 Majority Alpha/beta-Hydrolases superfamily protein 24 AT1G80030 Majority Molecular chaperone Hsp40/DnaJ family protein 24 AT2G03430 Majority Ankyrin repeat family protein 24 AT2G17265 Majority Homoserine kinase 24 AT2G47580 Majority Spliceosomal protein U1A 24 AT3G07400 Majority Lipase class 3 family protein 24 AT3G25530 Majority Glyoxylate reductase 1 24 AT3G51980 Majority ARM repeat superfamily protein 24 AT4G14040 Majority Selenium-binding protein 2 24 AT4G32960 Majority Unknown 24 AT4G35250 Majority NAD(P)-binding Rossmann-fold superfamily protein 24 AT5G11450 Majority Mog1/PsbP/DUF1795-like photosystem II reaction center PsbP family protein 24 AT5G12040 Majority Nitrilase/cyanide hydratase and apolipoprotein N-acyltransferase family protein 24 AT5G27270 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 24 AT5G39960 Majority GTP binding 24 AT5G44000 Majority Glutathione S-transferase family protein 24 AT5G47500 Majority Pectin lyase-like superfamily protein 24 AT5G48840 Majority Homolog of bacterial PANC 24 AT5G53770 Majority Nucleotidyltransferase family protein 24 AT5G63220 Majority Unknown 24 AT5G64150 Majority RNA methyltransferase family protein 24 AT5G64630 Majority Transducin/WD40 repeat-like superfamily protein 24 AT5G64730 Majority Transducin/WD40 repeat-like superfamily protein 24 AT5G65000 Majority Nucleotide-sugar transporter family protein 24 AT1G03750 Majority Switch 2 23 AT1G09850 Majority Xylem cysteine peptidase 3 23

211

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G12250 Majority Pentapeptide repeat-containing protein 23 AT1G31600 Majority RNA-binding (RRM/RBD/RNP motifs) family protein 23 AT1G55880 Majority Pyridoxal-5'-phosphate-dependent enzyme family protein 23 AT1G64430 Majority Pentatricopeptide repeat (PPR) superfamily protein 23 AT1G70480 Majority Domain of unknown function (DUF220) 23 AT1G74240 Majority Mitochondrial substrate carrier family protein 23 AT2G04280 Majority Unknown 23 AT2G04540 Majority Beta-ketoacyl synthase 23 AT2G16880 Majority Pentatricopeptide repeat (PPR) superfamily protein 23 AT2G21710 Majority Mitochondrial transcription termination factor family protein 23 AT2G27590 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 23 AT2G30100 Majority Pentatricopeptide (PPR) repeat-containing protein 23 AT2G32630 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 23 AT2G37560 Majority Origin recognition complex second largest subunit 2 23 AT2G39090 Majority Tetratricopeptide repeat (TPR)-containing protein 23 AT2G40760 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 23 AT3G06430 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 23 AT3G09010 Majority Protein kinase superfamily protein 23 AT3G10130 Majority SOUL heme-binding family protein 23 AT3G18110 Majority Pentatricopeptide repeat (PPR) superfamily protein 23 AT3G18630 Strictly Uracil dna glycosylase 23 AT3G20420 Majority RNAse THREE-like protein 2 23 AT3G24430 Majority ATP binding 23 AT3G26710 Majority Cofactor assembly of complex C 23 AT3G54480 Strictly SKP1/ASK-interacting protein 5 23 AT4G12130 Majority Glycine cleavage T-protein family 23

212

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G15850 Majority RNA helicase 1 23 AT4G17420 Majority Tryptophan RNA-binding attenuator protein-like 23 AT4G19670 Majority RING/U-box superfamily protein 23 AT4G20325 Majority Unknown 23 AT5G05310 Majority TLC ATP/ADP transporter 23 AT5G13390 Majority No exine formation 1 23 AT5G17570 Majority TatD related DNase 23 AT5G19500 Strictly Tryptophan/tyrosine permease 23 AT5G19850 Majority Alpha/beta-Hydrolases superfamily protein 23 AT5G20040 Majority Isopentenyltransferase 9 23 AT5G25480 Majority DNA methyltransferase-2 23 AT5G48440 Majority FAD-dependent oxidoreductase family protein 23 AT5G48790 Majority Domain of unknown function (DUF1995) 23 AT5G57850 Majority D-aminoacid aminotransferase-like PLP-dependent enzymes superfamily protein 23 AT5G59750 Majority DHBP synthase RibB-like alpha/beta domain; GTP cyclohydrolase II 23 AT1G01860 Majority Ribosomal RNA adenine dimethylase family protein 22 AT1G03475 Majority Coproporphyrinogen III oxidase 22 AT1G06240 Majority Protein of unknown function DUF455 22 AT1G16290 Majority Unknown 22 AT1G47840 Majority Hexokinase 3 22 AT1G64280 Majority Regulatory protein (NPR1) 22 AT1G77220 Majority Protein of unknown function (DUF300) 22 AT1G77290 Majority Glutathione S-transferase family protein 22 AT2G02710 Majority PAS/LOV protein B 22 AT2G23890 Strictly HAD-superfamily hydrolase, subfamily IG, 5'-nucleotidase 22 AT2G25530 Majority AFG1-like ATPase family protein 22

213

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G28790 Majority Pathogenesis-related thaumatin superfamily protein 22 AT2G35450 Majority Catalytics; hydrolases 22 AT2G39970 Majority Mitochondrial substrate carrier family protein 22 AT2G40700 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 22 AT2G47760 Majority Asparagine-linked glycosylation 3 22 AT3G04560 Majority Unknown 22 AT3G13180 Majority NOL1/NOP2/sun family protein / antitermination NusB domain-containing protein 22 AT3G18790 Majority Unknown 22 AT3G24660 Majority Transmembrane kinase-like 1 22 AT3G58690 Majority Protein kinase superfamily protein 22 AT4G14000 Majority Putative methyltransferase family protein 22 AT4G28020 Majority Unknown 22 AT4G31770 Majority Debranching enzyme 1 22 AT4G38050 Majority Xanthine/uracil permease family protein 22 AT5G02410 Majority DIE2/ALG10 family 22 AT5G04360 Strictly Limit dextrinase 22 AT5G07360 Majority Amidase family protein 22 AT5G14660 Majority Peptide deformylase 1B 22 AT5G15640 Majority Mitochondrial substrate carrier family protein 22 AT5G26820 Majority Iron-regulated protein 3 22 AT5G37850 Majority PfkB-like carbohydrate kinase family protein 22 AT5G39710 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 22 AT5G41150 Majority Restriction endonuclease, type II-like superfamily protein 22 AT5G47090 Majority Unknown 22 AT5G55260 Majority Protein phosphatase X 2 22 AT5G56580 Majority MAP kinase kinase 6 22

214

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G60540 Majority Pyridoxine biosynthesis 2 22 AT1G01920 Majority SET domain-containing protein 21 AT1G03310 Majority Debranching enzyme 1 21 AT1G05055 Majority General transcription factor II H2 21 AT1G09960 Majority Sucrose transporter 4 21 AT1G15510 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 21 AT1G19520 Majority Pentatricopeptide (PPR) repeat-containing protein 21 AT1G24340 Majority FAD/NAD(P)-binding oxidoreductase family protein 21 AT1G27920 Majority Microtubule-associated protein 65-8 21 AT1G34630 Majority Unknown 21 AT1G54650 Strictly Methyltransferase family protein 21 AT1G56310 Majority Polynucleotidyl transferase, ribonuclease H-like superfamily protein 21 AT1G62730 Majority Terpenoid synthases superfamily protein 21 AT1G77930 Majority Chaperone DnaJ-domain superfamily protein 21 AT1G78930 Majority Mitochondrial transcription termination factor family protein 21 AT2G04740 Majority Ankyrin repeat family protein 21 AT2G18220 Majority Noc2p family 21 AT2G26680 Majority Unknown 21 AT2G37330 Majority Aluminum sensitive 3 21 AT2G44580 Majority Zinc ion binding 21 AT3G02450 Majority Cell division protein ftsH, putative 21 AT3G06060 Majority NAD(P)-binding Rossmann-fold superfamily protein 21 AT3G07090 Majority PPPDE putative thiol peptidase family protein 21 AT3G14890 Majority Phosphoesterase 21 AT3G16565 Majority Alanine-tRNA ligases;nucleic acid binding;ligases, forming aminoacyl-tRNA and related 21 compounds;nucleotide binding;ATP binding

215

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G18390 Majority CRS1 / YhbY (CRM) domain-containing protein 21 AT3G24030 Majority Hydroxyethylthiazole kinase family protein 21 AT3G45640 Majority Mitogen-activated protein kinase 3 21 AT3G57190 Majority Peptide chain release factor, putative 21 AT3G57430 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 21 AT3G63170 Majority Chalcone-flavanone isomerase family protein 21 AT4G10750 Majority Phosphoenolpyruvate carboxylase family protein 21 AT4G10760 Majority mRNAadenosine methylase 21 AT4G24270 Majority Embryo defective 140 21 AT4G29310 Majority Protein of unknown function (DUF1005) 21 AT5G01360 Majority Plant protein of unknown function (DUF828) 21 AT5G04520 Strictly Protein of unknown function DUF455 21 AT5G08470 Majority Peroxisome 1 21 AT5G10750 Majority Protein of unknown function (DUF1336) 21 AT5G18475 Majority Pentatricopeptide repeat (PPR) superfamily protein 21 AT5G19660 Majority SITE-1 protease 21 AT5G20520 Majority Alpha/beta-Hydrolases superfamily protein 21 AT5G26180 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 21 AT5G37710 Majority Alpha/beta-Hydrolases superfamily protein 21 AT5G50280 Majority Pentatricopeptide repeat (PPR) superfamily protein 21 AT5G57700 Majority BNR/Asp-box repeat family protein 21 AT5G63010 Majority Transducin/WD40 repeat-like superfamily protein 21 AT1G06730 Majority PfkB-like carbohydrate kinase family protein 20 AT1G07010 Majority Calcineurin-like metallo-phosphoesterase superfamily protein 20 AT1G11190 Majority Bifunctional nuclease i 20 AT1G28690 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 20

216

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G65320 Majority Cystathionine beta-synthase (CBS) family protein 20 AT1G67320 Majority DNA primase, large subunit family 20 AT1G77550 Majority Tubulin-tyrosine ligases 20 AT1G79050 Strictly RecA DNA recombination family protein 20 AT1G80080 Strictly Leucine-rich repeat (LRR) family protein 20 AT2G01860 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 20 AT2G03200 Majority Eukaryotic aspartyl protease family protein 20 AT2G26780 Majority ARM repeat superfamily protein 20 AT2G33560 Majority BUB1-related (BUB1: budding uninhibited by benzymidazol 1) 20 AT2G46090 Majority Diacylglycerol kinase family protein 20 AT3G05620 Majority Plant invertase/pectin methylesterase inhibitor superfamily 20 AT3G09410 Majority Pectinacetylesterase family protein 20 AT3G23020 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 20 AT3G49170 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 20 AT3G57180 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 20 AT3G58530 Majority RNI-like superfamily protein 20 AT3G61320 Majority Bestrophin-like protein 20 AT4G13550 Majority Triglyceride lipases 20 AT4G16630 Majority DEA(D/H)-box RNA helicase family protein 20 AT4G17740 Majority Peptidase S41 family protein 20 AT4G20130 Majority Plastid transcriptionally active 14 20 AT4G21585 Majority Endonuclease 4 20 AT4G24750 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 20 AT4G39620 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 20 AT5G15390 Majority tRNA/rRNA methyltransferase (SpoU) family protein 20 AT5G17670 Majority Alpha/beta-Hydrolases superfamily protein 20

217

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G20140 Majority SOUL heme-binding family protein 20 AT5G24020 Majority Septum site-determining protein (MIND) 20 AT5G42760 Majority Leucine carboxyl methyltransferase 20 AT5G44520 Strictly NagB/RpiA/CoA transferase-like superfamily protein 20 AT5G50390 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 20 AT5G51540 Majority Zincin-like metalloproteases family protein 20 AT5G54090 Majority DNA mismatch repair protein MutS, type 2 20 AT5G59900 Majority Pentatricopeptide repeat (PPR) superfamily protein 20 AT1G08030 Majority Tyrosylprotein sulfotransferase 19 AT1G16870 Majority Mitochondrial 28S ribosomal protein S29-related 19 AT1G18030 Majority Protein phosphatase 2C family protein 19 AT1G25380 Majority NAD+ transporter 2 19 AT1G30520 Majority Acyl-activating enzyme 14 19 AT1G30960 Strictly GTP-binding family protein 19 AT1G49380 Majority Cytochrome c biogenesis protein family 19 AT1G59990 Strictly DEA(D/H)-box RNA helicase family protein 19 AT1G63850 Majority BTB/POZ domain-containing protein 19 AT1G65070 Strictly DNA mismatch repair protein MutS, type 2 19 AT1G67940 Majority Non-intrinsic ABC protein 3 19 AT1G78915 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 19 AT2G15820 Majority Endonucleases 19 AT2G18890 Majority Protein kinase superfamily protein 19 AT2G34090 Majority Maternal effect embryo arrest 18 19 AT2G37130 Majority Peroxidase superfamily protein 19 AT3G03440 Majority ARM repeat superfamily protein 19 AT3G05410 Strictly Photosystem II reaction center PsbP family protein 19

218

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G06950 Majority Pseudouridine synthase family protein 19 AT3G11960 Majority Cleavage and polyadenylation specificity factor (CPSF) A subunit protein 19 AT3G26670 Majority Protein of unknown function (DUF803) 19 AT3G47860 Majority Chloroplastic lipocalin 19 AT3G61620 Majority 3'-5'-exoribonuclease family protein 19 AT4G13020 Majority Protein kinase superfamily protein 19 AT4G14965 Majority Membrane-associated progesterone binding protein 4 19 AT4G18465 Strictly RNA helicase family protein 19 AT4G35910 Majority Adenine nucleotide alpha hydrolases-like superfamily protein 19 AT5G03860 Majority Malate synthase 19 AT5G14550 Majority Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase family protein 19 AT5G14580 Majority Polyribonucleotide nucleotidyltransferase, putative 19 AT5G18820 Majority TCP-1/cpn60 chaperonin family protein 19 AT5G23590 Majority DNAJ heat shock N-terminal domain-containing protein 19 AT5G24840 Majority tRNA (guanine-N-7) methyltransferase 19 AT5G38520 Strictly Alpha/beta-Hydrolases superfamily protein 19 AT5G39250 Majority F-box family protein 19 AT5G57960 Majority GTP-binding protein, HflX 19 AT5G58760 Majority Damaged DNA binding 2 19 AT5G64670 Majority Ribosomal protein L18e/L15 superfamily protein 19 AT5G65960 Majority GTP binding 19 AT1G10910 Majority Pentatricopeptide repeat (PPR) superfamily protein 18 AT1G11880 Majority Transferases, transferring hexosyl groups 18 AT1G12520 Majority Copper chaperone for SOD1 18 AT1G33270 Majority Acyl transferase/acyl hydrolase/lysophospholipase superfamily protein 18 AT1G52530 Strictly Unknown 18

219

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G54310 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 18 AT1G55760 Majority BTB/POZ domain-containing protein 18 AT1G59600 Majority ZCW7 18 AT1G62780 Majority Unknown 18 AT2G03667 Majority Asparagine synthase family protein 18 AT2G36895 Majority Unknown 18 AT3G05040 Majority ARM repeat superfamily protein 18 AT3G10150 Majority Purple acid phosphatase 16 18 AT3G17430 Majority Nucleotide-sugar transporter family protein 18 AT3G20810 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 18 AT3G26780 Majority Phosphoglycerate mutase family protein 18 AT3G29240 Majority Protein of unknown function (DUF179) 18 AT3G63140 Majority Chloroplast stem-loop binding protein of 41 kDa 18 AT4G01030 Majority Pentatricopeptide (PPR) repeat-containing protein 18 AT4G01310 Majority Ribosomal L5P family protein 18 AT4G19180 Majority GDA1/CD39 nucleoside phosphatase family protein 18 AT4G20740 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 18 AT4G20760 Majority NAD(P)-binding Rossmann-fold superfamily protein 18 AT4G31600 Majority UDP-N-acetylglucosamine (UAA) transporter family 18 AT5G04200 Majority Metacaspase 9 18 AT5G13760 Majority Plasma-membrane choline transporter family protein 18 AT5G26040 Strictly Histone deacetylase 2 18 AT5G27710 Majority Unknown 18 AT5G46220 Majority Protein of unknown function (DUF616) 18 AT5G59400 Majority Unknown 18 AT5G63080 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 18

220

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G67570 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 18 AT1G07040 Majority Unknown 17 AT1G19800 Majority Trigalactosyldiacylglycerol 1 17 AT1G25145 Majority UDP-3-O-acyl N-acetylglycosamine deacetylase family protein 17 AT1G31480 Majority Shoot gravitropism 2 (SGR2) 17 AT1G31500 Majority DNAse I-like superfamily protein 17 AT1G31860 Majority Histidine biosynthesis bifunctional protein (HISIE) 17 AT1G32220 Majority NAD(P)-binding Rossmann-fold superfamily protein 17 AT1G53710 Majority Calcineurin-like metallo-phosphoesterase superfamily protein 17 AT1G59720 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 17 AT2G04560 Majority Transferases, transferring glycosyl groups 17 AT2G04660 Majority Anaphase-promoting complex/cyclosome 2 17 AT2G11000 Majority MAK10 homologue 17 AT2G17140 Majority Pentatricopeptide repeat (PPR) superfamily protein 17 AT2G26930 Majority 4-(cytidine 5'-phospho)-2-C-methyl-D-erithritol kinase 17 AT2G38620 Majority Cyclin-dependent kinase B1;2 17 AT3G10940 Majority Dual specificity protein phosphatase (DsPTP1) family protein 17 AT3G11220 Majority Paxneb protein-related 17 AT3G20260 Majority Protein of unknown function (DUF1666) 17 AT3G20440 Majority Alpha amylase family protein 17 AT3G21200 Majority Proton gradient regulation 7 17 AT3G48250 Majority Pentatricopeptide repeat (PPR) superfamily protein 17 AT3G53190 Majority Pectin lyase-like superfamily protein 17 AT3G54170 Majority FKBP12 interacting protein 37 17 AT3G56950 Majority Small and basic intrinsic protein 2;1 17 AT3G59490 Strictly Unknown 17

221

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G06676 Majority Unknown 17 AT4G12230 Majority Alpha/beta-Hydrolases superfamily protein 17 AT4G21190 Majority Pentatricopeptide repeat (PPR) superfamily protein 17 AT4G26180 Majority Mitochondrial substrate carrier family protein 17 AT4G38020 Majority tRNA/rRNA methyltransferase (SpoU) family protein 17 AT5G13890 Majority Family of unknown function (DUF716) 17 AT5G15680 Majority ARM repeat superfamily protein 17 AT5G23120 Majority Photosystem II stability/assembly factor, chloroplast (HCF136) 17 AT5G27920 Majority F-box family protein 17 AT5G41480 Majority Folylpolyglutamate synthetase family protein 17 AT5G46630 Majority Clathrin adaptor complexes medium subunit family protein 17 AT5G62270 Majority Unknown 17 AT5G63200 Majority Tetratricopeptide repeat (TPR)-containing protein 17 AT1G03687 Majority DTW domain-containing protein 16 AT1G08550 Majority Non-photochemical quenching 1 16 AT1G10830 Majority 15-cis-zeta-carotene isomerase 16 AT1G26170 Strictly ARM repeat superfamily protein 16 AT1G72440 Majority CCAAT-binding factor 16 AT1G73060 Majority Low PSII Accumulation 3 16 AT1G74680 Majority Exostosin family protein 16 AT1G76630 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 16 AT1G78140 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 16 AT2G03390 Majority uvrB/uvrC motif-containing protein 16 AT2G13100 Majority Major facilitator superfamily protein 16 AT2G18950 Majority Homogentisate phytyltransferase 1 16 AT2G39805 Majority Integral membrane Yip1 family protein 16

222

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G42450 Majority Alpha/beta-Hydrolases superfamily protein 16 AT2G46915 Strictly Protein of unknown function (DUF3754) 16 AT3G05210 Majority Nucleotide repair protein, putative 16 AT3G25410 Majority Sodium bile acid symporter family 16 AT3G53170 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 16 AT3G55250 Majority Unknown 16 AT3G56740 Majority Ubiquitin-associated (UBA) protein 16 AT3G60450 Majority Phosphoglycerate mutase family protein 16 AT3G60660 Majority Unknown 16 AT4G08940 Majority Ubiquitin carboxyl-terminal hydrolase family protein 16 AT4G10340 Majority Light harvesting complex of photosystem II 5 16 AT4G34215 Majority Domain of unknown function (DUF303) 16 AT4G34500 Majority Protein kinase superfamily protein 16 AT5G01850 Majority Protein kinase superfamily protein 16 AT5G06680 Majority Spindle pole body component 98 16 AT5G14140 Majority Zinc ion binding; nucleic acid binding 16 AT5G19300 Majority Unknown 16 AT5G42520 Majority Basic pentacysteine 6 16 AT5G52380 Majority Vascular-related NAC-domain 6 16 AT5G55060 Majority Unknown 16 AT5G63970 Majority Copine (Calcium-dependent phospholipid-binding protein) family 16 AT1G02145 Majority Homolog of asparagine-linked glycosylation 12 15 AT1G02420 Majority Pentatricopeptide repeat (PPR) superfamily protein 15 AT1G09820 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 15 AT1G10520 Majority DNA polymerase lambda (POLL) 15 AT1G19720 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 15

223

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G24310 Majority Unknown 15 AT1G27060 Majority Regulator of chromosome condensation (RCC1) family protein 15 AT1G51310 Majority Transferases; tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferases 15 AT1G65950 Majority Protein kinase superfamily protein 15 AT1G76170 Majority 2-thiocytidine tRNA biosynthesis protein, TtcA 15 AT1G80550 Majority Pentatricopeptide repeat (PPR) superfamily protein 15 AT2G18030 Majority Peptide methionine sulfoxide reductase family protein 15 AT2G25570 Majority Binding 15 AT2G26900 Majority Sodium bile acid symporter family 15 AT2G30390 Majority Ferrochelatase 2 15 AT2G34980 Majority Phosphatidylinositolglycan synthase family protein 15 AT2G38060 Majority Phosphate transporter 4;2 15 AT2G40550 Majority E2F target gene 1 15 AT3G06440 Strictly Galactosyltransferase family protein 15 AT3G19180 Majority Paralog of ARC6 15 AT3G29010 Majority Biotin/lipoate A/B protein ligase family 15 AT3G45040 Majority Phosphatidate cytidylyltransferase family protein 15 AT3G53090 Majority Ubiquitin-protein ligase 7 15 AT3G54720 Majority Peptidase M28 family protein 15 AT3G58520 Majority Ubiquitin carboxyl-terminal hydrolase family protein 15 AT3G61360 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 15 AT4G02750 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 15 AT4G26980 Majority RNI-like superfamily protein 15 AT4G29590 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 15 AT4G30840 Majority Transducin/WD40 repeat-like superfamily protein 15 AT4G33460 Strictly ABC transporter family protein 15

224

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G39970 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 15 AT5G03910 Strictly ABC2 homolog 12 15 AT5G10710 Majority Unknown 15 AT5G11480 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 15 AT5G22110 Majority DNA polymerase epsilon subunit B2 15 AT5G48600 Majority Structural maintenance of chromosome 3 15 AT5G50110 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 15 AT5G51130 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 15 AT5G53920 Strictly Ribosomal protein L11 methyltransferase-related 15 AT5G63440 Majority Protein of unknown function (DUF167) 15 AT1G01040 Majority Dicer-like 1 14 AT1G02050 Majority Chalcone and stilbene synthase family protein 14 AT1G08370 Majority Decapping 1 14 AT1G17850 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 14 AT1G18870 Majority Isochorismate synthase 2 14 AT1G19150 Majority Photosystem I light harvesting complex gene 6 14 AT1G20410 Majority Pseudouridine synthase family protein 14 AT1G20575 Majority Nucleotide-diphospho-sugar transferases superfamily protein 14 AT1G52780 Majority Protein of unknown function (DUF2921) 14 AT1G66330 Majority Senescence-associated family protein 14 AT1G67420 Majority Zn-dependent exopeptidases superfamily protein 14 AT1G67570 Majority Protein of unknown function (DUF3537) 14 AT1G72320 Majority Pumilio 23 14 AT1G72640 Majority NAD(P)-binding Rossmann-fold superfamily protein 14 AT2G13290 Majority Beta-1,4-N-acetylglucosaminyltransferase family protein 14 AT2G21120 Majority Protein of unknown function (DUF803) 14

225

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G26270 Majority Unknown 14 AT2G31450 Majority DNA glycosylase superfamily protein 14 AT2G33205 Majority Serinc-domain containing serine and sphingolipid biosynthesis protein 14 AT2G39670 Majority Radical SAM superfamily protein 14 AT2G44510 Majority CDK inhibitor P21 binding protein 14 AT2G45060 Majority Uncharacterised conserved protein UCP022280 14 AT2G45790 Majority Phosphomannomutase 14 AT2G47830 Majority Cation efflux family protein 14 AT3G10600 Majority Cationic amino acid transporter 7 14 AT3G14580 Strictly Pentatricopeptide repeat (PPR) superfamily protein 14 AT3G15140 Strictly Polynucleotidyl transferase, ribonuclease H-like superfamily protein 14 AT3G15150 Majority RING/U-box superfamily protein 14 AT3G18020 Majority Pentatricopeptide repeat (PPR) superfamily protein 14 AT3G48880 Majority RNI-like superfamily protein 14 AT3G53760 Majority GAMMA-TUBULIN COMPLEX PROTEIN 4 14 AT3G59530 Majority Calcium-dependent phosphotriesterase superfamily protein 14 AT4G15450 Majority Senescence/dehydration-associated protein-related 14 AT4G17760 Majority Damaged DNA binding; exodeoxyribonuclease IIIs 14 AT4G22840 Majority Sodium bile acid symporter family 14 AT4G39010 Majority Glycosyl hydrolase 9B18 14 AT5G03880 Majority Thioredoxin family protein 14 AT5G04070 Majority NAD(P)-binding Rossmann-fold superfamily protein 14 AT5G05130 Majority DNA/RNA helicase protein 14 AT5G08510 Majority Pentatricopeptide repeat (PPR) superfamily protein 14 AT5G26850 Majority Uncharacterized protein 14 AT5G45780 Majority Leucine-rich repeat protein kinase family protein 14

226

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G54270 Majority Light-harvesting chlorophyll B-binding protein 3 14 AT5G54290 Majority Cytochrome c biogenesis protein family 14 AT5G57590 Majority Adenosylmethionine-8-amino-7-oxononanoate transaminases 14 AT5G65820 Majority Pentatricopeptide repeat (PPR) superfamily protein 14 AT1G01090 Majority Pyruvate dehydrogenase E1 alpha 13 AT1G05060 Majority Unknown 13 AT1G06050 Majority Protein of unknown function (DUF1336) 13 AT1G06710 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 13 AT1G08610 Majority Pentatricopeptide repeat (PPR) superfamily protein 13 AT1G11750 Majority CLP protease proteolytic subunit 6 13 AT1G16070 Majority Tubby like protein 8 13 AT1G20370 Majority Pseudouridine synthase family protein 13 AT1G21710 Majority 8-oxoguanine-DNA glycosylase 1 13 AT1G27752 Majority Ubiquitin system component Cue protein 13 AT1G28680 Majority HXXXD-type acyl-transferase family protein 13 AT1G34150 Majority Pseudouridine synthase family protein 13 AT1G35340 Majority ATP-dependent protease La (LON) domain protein 13 AT1G44770 Majority Unknown 13 AT1G48430 Majority Dihydroxyacetone kinase 13 AT1G56690 Majority Pentatricopeptide repeat (PPR) superfamily protein 13 AT1G59840 Majority Cofactor assembly of complex C 13 AT2G18360 Majority Alpha/beta-Hydrolases superfamily protein 13 AT2G22650 Majority FAD-dependent oxidoreductase family protein 13 AT2G27760 Majority tRNAisopentenyltransferase 2 13 AT2G35790 Majority Unknown 13 AT2G37020 Majority Translin family protein 13

227

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G38680 Majority 5'-nucleotidases;magnesium ion binding 13 AT2G38780 Majority Unknown 13 AT2G40430 Majority Unknown 13 AT2G45200 Majority Golgi snare 12 13 AT2G46910 Majority Plastid-lipid associated protein PAP / fibrillin family protein 13 AT3G03890 Majority FMN binding 13 AT3G03990 Majority Alpha/beta-Hydrolases superfamily protein 13 AT3G11920 Majority Glutaredoxin-related 13 AT3G20015 Majority Eukaryotic aspartyl protease family protein 13 AT3G24320 Majority MUTL protein homolog 1 13 AT3G27925 Majority DegP protease 1 13 AT3G55530 Majority RING/U-box superfamily protein 13 AT3G59520 Majority RHOMBOID-like protein 13 13 AT3G61870 Majority Unknown 13 AT3G63090 Majority Ubiquitin carboxyl-terminal hydrolase family protein 13 AT3G63190 Majority Ribosome recycling factor, chloroplast precursor 13 AT4G08455 Majority BTB/POZ domain-containing protein 13 AT4G10090 Majority Elongator protein 6 13 AT4G10620 Strictly P-loop containing nucleoside triphosphate hydrolases superfamily protein 13 AT4G15440 Majority Hydroperoxide lyase 1 13 AT4G24090 Majority Unknown 13 AT4G25080 Majority Magnesium-protoporphyrin IX methyltransferase 13 AT4G28740 Majority Unknown 13 AT4G33495 Majority Ubiquitin carboxyl-terminal hydrolase family protein 13 AT5G06370 Majority NC domain-containing protein-related 13 AT5G08400 Majority Protein of unknown function (DUF3531) 13

228

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G09580 Majority Unknown 13 AT5G13680 Majority IKI3 family protein 13 AT5G16780 Majority SART-1 family 13 AT5G19020 Majority Mitochondrial editing factor 18 13 AT5G24330 Majority Arabidopsis trithorax-related protein 6 13 AT5G27620 Majority Cyclin H;1 13 AT5G37360 Majority Unknown 13 AT5G46390 Strictly Peptidase S41 family protein 13 AT5G51170 Majority Unknown 13 AT1G16080 Majority Unknown 12 AT1G26500 Majority Pentatricopeptide repeat (PPR) superfamily protein 12 AT1G26760 Majority SET domain protein 35 12 AT1G30290 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 12 AT1G45110 Majority Tetrapyrrole (Corrin/Porphyrin) Methylases 12 AT1G49980 Majority DNA/RNA polymerases superfamily protein 12 AT1G53140 Majority Dynamin related protein 5A 12 AT1G55480 Majority Protein containing PDZ domain, a K-box domain, and a TPR region 12 AT1G62260 Majority Mitochondrial editing factor 9 12 AT1G64680 Majority Unknown 12 AT1G69500 Majority Cytochrome P450, family 704, subfamily B, polypeptide 1 12 AT1G69800 Majority Cystathionine beta-synthase (CBS) protein 12 AT1G71180 Majority 6-phosphogluconate dehydrogenase family protein 12 AT1G79490 Majority Pentatricopeptide repeat (PPR) superfamily protein 12 AT2G20495 Majority Unknown 12 AT2G32415 Majority Polynucleotidyl transferase, ribonuclease H fold protein with HRDC domain 12 AT2G32520 Majority Alpha/beta-Hydrolases superfamily protein 12

229

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G36990 Majority RNApolymerase sigma-subunit F 12 AT2G38025 Strictly Cysteine proteinases superfamily protein 12 AT2G39910 Strictly ARM repeat superfamily protein 12 AT2G40570 Majority Initiator tRNA phosphoribosyl transferase family protein 12 AT3G02330 Majority Pentatricopeptide repeat (PPR) superfamily protein 12 AT3G09050 Majority Unknown 12 AT3G12100 Majority Cation efflux family protein 12 AT3G21740 Majority Arabidopsis thaliana protein of unknown function (DUF794) 12 AT3G42660 Majority Transducin family protein / WD-40 repeat family protein 12 AT3G51930 Majority Transducin/WD40 repeat-like superfamily protein 12 AT3G52050 Majority 5'-3' exonuclease family protein 12 AT3G54460 Majority SNF2 domain-containing protein / helicase domain-containing protein / F-box family 12 protein AT3G55480 Majority Protein Affected TRAFfiCKING 2 12 AT3G56840 Majority FAD-dependent oxidoreductase family protein 12 AT3G59420 Majority Crinkly4 12 AT3G60050 Majority Pentatricopeptide repeat (PPR) superfamily protein 12 AT4G00590 Majority N-terminal nucleophile aminohydrolases (Ntn hydrolases) superfamily protein 12 AT4G00620 Majority Amino acid dehydrogenase family protein 12 AT4G13650 Majority Pentatricopeptide repeat (PPR) superfamily protein 12 AT4G17430 Majority O-fucosyltransferase family protein 12 AT4G19890 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 12 AT4G21300 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 12 AT4G23440 Majority Disease resistance protein (TIR-NBS class) 12 AT4G27250 Majority NAD(P)-binding Rossmann-fold superfamily protein 12 AT4G32750 Majority Unknown 12 AT4G34360 Strictly S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 12

230

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G35440 Majority Chloride channel E 12 AT4G35740 Strictly DEAD/DEAH box RNA helicase family protein 12 AT5G01110 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 12 AT5G11650 Majority Alpha/beta-Hydrolases superfamily protein 12 AT5G13570 Majority Decapping 2 12 AT5G18390 Strictly Pentatricopeptide repeat (PPR) superfamily protein 12 AT5G38730 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 12 AT5G47420 Majority Tryptophan RNA-binding attenuator protein-like 12 AT5G51290 Majority Diacylglycerol kinase family protein 12 AT5G54260 Majority DNA repair and meiosis protein (Mre11) 12 AT5G56450 Majority Mitochondrial substrate carrier family protein 12 AT5G59520 Majority ZRT/IRT-like protein 2 12 AT1G02330 Majority Unknown 11 AT1G05310 Strictly Pectin lyase-like superfamily protein 11 AT1G07590 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT1G07740 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT1G07970 Majority Unknown 11 AT1G10280 Majority Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase family protein 11 AT1G17470 Majority Developmentally regulated G-protein 1 11 AT1G28140 Majority Unknown 11 AT1G28210 Majority DNAJ heat shock family protein 11 AT1G56180 Majority Unknown 11 AT1G56350 Majority Peptide chain release factor 2 11 AT1G59540 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 11 AT1G68290 Majority Endonuclease 2 11 AT1G69523 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 11

231

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G71170 Majority 6-phosphogluconate dehydrogenase family protein 11 AT1G73470 Majority Unknown 11 AT1G74580 Majority Pentatricopeptide repeat (PPR) superfamily protein 11 AT1G76920 Majority F-box family protein 11 AT2G02150 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT2G20890 Majority Photosystem II reaction center PSB29 protein 11 AT2G23093 Majority Major facilitator superfamily protein 11 AT2G23380 Majority SET domain-containing protein 11 AT2G28450 Majority Zinc finger (CCCH-type) family protein 11 AT2G38660 Majority Amino acid dehydrogenase family protein 11 AT2G40390 Majority Unknown 11 AT2G40800 Majority Unknown 11 AT3G05480 Majority Cell cycle checkpoint control protein family 11 AT3G09060 Strictly Pentatricopeptide repeat (PPR) superfamily protein 11 AT3G13220 Majority ABC-2 type transporter family protein 11 AT3G13450 Majority Transketolase family protein 11 AT3G15180 Majority ARM repeat superfamily protein 11 AT3G17590 Strictly Transcription regulatory protein SNF5, putative (BSH) 11 AT3G19300 Majority Protein kinase superfamily protein 11 AT3G25210 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT3G25805 Strictly Unknown 11 AT3G25920 Majority Ribosomal protein L15 11 AT3G28460 Majority Methyltransferases 11 AT3G45880 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 11 AT3G48420 Majority Haloacid dehalogenase-like hydrolase (HAD) superfamily protein 11 AT3G60360 Majority Embryo sac development arrest 14 11

232

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G62390 Majority Trichome birefringence-like 6 11 AT4G01730 Majority DHHC-type zinc finger family protein 11 AT4G23560 Majority Glycosyl hydrolase 9B15 11 AT4G27600 Majority PfkB-like carbohydrate kinase family protein 11 AT4G33990 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT4G39470 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 11 AT5G01310 Majority APRATAXIN-like 11 AT5G01470 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 11 AT5G11640 Majority Thioredoxin superfamily protein 11 AT5G12260 Majority Unknown 11 AT5G37530 Majority NAD(P)-binding Rossmann-fold superfamily protein 11 AT5G38380 Majority Unknown 11 AT5G63280 Majority C2H2-like zinc finger protein 11 AT5G66860 Majority Ribosomal protein L25/Gln-tRNA synthetase, anti-codon-binding domain 11 AT1G01280 Strictly Cytochrome P450, family 703, subfamily A, polypeptide 2 10 AT1G01880 Majority 5'-3' exonuclease family protein 10 AT1G04900 Strictly Protein of unknown function (DUF185) 10 AT1G05670 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 10 AT1G07615 Majority GTP-binding protein Obg/CgtA 10 AT1G09420 Majority Glucose-6-phosphate dehydrogenase 4 10 AT1G14140 Majority Mitochondrial substrate carrier family protein 10 AT1G18460 Majority Alpha/beta-Hydrolases superfamily protein 10 AT1G34770 Majority Unknown 10 AT1G53760 Majority Unknown 10 AT1G55325 Majority RNA polymerase II transcription mediators 10 AT1G64770 Majority NDH-dependent cyclic electron flow 1 10

233

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G65230 Majority Uncharacterized conserved protein (DUF2358) 10 AT1G67690 Majority Zincin-like metalloproteases family protein 10 AT1G72420 Majority NADH:ubiquinone oxidoreductase intermediate-associated protein 30 10 AT1G80510 Majority Transmembrane amino acid transporter family protein 10 AT2G14050 Majority Minichromosome maintenance 9 10 AT2G17033 Majority Pentatricopeptide (PPR) repeat-containing protein 10 AT2G31290 Majority Ubiquitin carboxyl-terminal hydrolase family protein 10 AT2G36810 Majority ARM repeat superfamily protein 10 AT2G39000 Majority Acyl-CoA N-acyltransferases (NAT) superfamily protein 10 AT2G39140 Majority Pseudouridine synthase family protein 10 AT2G44520 Majority Cytochrome c oxidase 10 10 AT3G10840 Majority Alpha/beta-Hydrolases superfamily protein 10 AT3G15840 Majority Post-illumination chlorophyll fluorescence increase 10 AT3G50120 Majority Plant protein of unknown function (DUF247) 10 AT3G51040 Majority RTE1-homolog 10 AT3G52260 Majority Pseudouridine synthase family protein 10 AT4G01995 Majority Unknown 10 AT4G14850 Strictly Pentatricopeptide repeat (PPR) superfamily protein 10 AT4G24710 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 10 AT4G25720 Majority Glutaminyl cyclase 10 AT4G29860 Majority Transducin/WD40 repeat-like superfamily protein 10 AT4G29910 Majority Origin recognition complex protein 5 10 AT4G31530 Majority NAD(P)-binding Rossmann-fold superfamily protein 10 AT4G32430 Majority Pentatricopeptide repeat (PPR) superfamily protein 10 AT4G36980 Majority Unknown 10 AT5G04480 Majority UDP-Glycosyltransferase superfamily protein 10

234

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G14100 Majority Non-intrinsic ABC protein 14 10 AT5G15330 Majority SPX domain gene 4 10 AT5G17400 Majority Endoplasmic reticulum-adenine nucleotide transporter 1 10 AT5G21970 Majority Ubiquitin carboxyl-terminal hydrolase family protein 10 AT5G24970 Strictly Protein kinase superfamily protein 10 AT5G32450 Majority RNA binding (RRM/RBD/RNP motifs) family protein 10 AT5G35100 Majority Cyclophilin-like peptidyl-prolyl cis-trans isomerase family protein 10 AT5G47240 Majority Nudix hydrolase homolog 8 10 AT5G52110 Strictly Protein of unknown function (DUF2930) 10 AT5G58020 Majority Unknown 10 AT5G67100 Majority DNA-directed DNA polymerases 10 AT1G02970 Majority WEE1 kinase homolog 9 AT1G09680 Majority Pentatricopeptide repeat (PPR) superfamily protein 9 AT1G14620 Majority Decoy 9 AT1G16650 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 9 AT1G19140 Majority Unknown 9 AT1G19690 Majority NAD(P)-binding Rossmann-fold superfamily protein 9 AT1G22460 Majority O-fucosyltransferase family protein 9 AT1G31970 Majority DEA(D/H)-box RNA helicase family protein 9 AT1G32200 Majority Phospholipid/glycerol acyltransferase family protein 9 AT1G50575 Majority Putative lysine decarboxylase family protein 9 AT1G52640 Majority Pentatricopeptide repeat (PPR) superfamily protein 9 AT1G61520 Majority Photosystem I light harvesting complex gene 3 9 AT1G61900 Majority Unknown 9 AT1G67400 Majority ELMO/CED-12 family protein 9 AT1G67960 Majority Unknown 9

235

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G01390 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 9 AT2G20020 Majority RNA-binding CRS1 / YhbY (CRM) domain-containing protein 9 AT2G23540 Majority GDSL-like Lipase/Acylhydrolase superfamily protein 9 AT2G28560 Strictly DNA repair (Rad51) family protein 9 AT3G01380 Majority Transferases; sulfuric ester hydrolases; catalytics; transferases 9 AT3G08740 Majority Elongation factor P (EF-P) family protein 9 AT3G10220 Majority Tubulin folding cofactor B 9 AT3G11470 Majority 4'-phosphopantetheinyl transferase superfamily 9 AT3G20680 Majority Domain of unknown function (DUF1995) 9 AT3G53140 Majority O-methyltransferase family protein 9 AT3G54180 Majority Cyclin-dependent kinase B1;1 9 AT3G61750 Majority Cytochrome b561/ferric reductase transmembrane with DOMON related domain 9 AT4G00026 Majority Unknown 9 AT4G10380 Majority NOD26-like intrinsic protein 5;1 9 AT4G16270 Strictly Peroxidase superfamily protein 9 AT4G16570 Majority Protein arginine methyltransferase 7 9 AT4G21790 Majority Tobamovirus multiplication 1 9 AT4G22300 Majority Carboxylesterases 9 AT4G22790 Majority MATE efflux family protein 9 AT4G32050 Majority Neurochondrin family protein 9 AT4G32320 Majority Ascorbate peroxidase 6 9 AT4G33000 Majority Calcineurin B-like protein 10 9 AT4G33440 Majority Pectin lyase-like superfamily protein 9 AT4G38380 Majority MATE efflux family protein 9 AT5G01510 Majority Protein of unknown function, DUF647 9 AT5G08305 Strictly Pentatricopeptide repeat (PPR) superfamily protein 9

236

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G09790 Majority Arabidopsis trithorax-related protein 5 9 AT5G13290 Majority Protein kinase superfamily protein 9 AT5G13410 Majority FKBP-like peptidyl-prolyl cis-trans isomerase family protein 9 AT5G17660 Majority tRNA (guanine-N-7) methyltransferase 9 AT5G19130 Majority GPI transamidase component family protein / Gaa1-like family protein 9 AT5G23240 Majority DNAJ heat shock N-terminal domain-containing protein 9 AT5G40400 Strictly Pentatricopeptide repeat (PPR) superfamily protein 9 AT5G45950 Majority GDSL-like Lipase/Acylhydrolase superfamily protein 9 AT1G06440 Majority Ubiquitin carboxyl-terminal hydrolase family protein 8 AT1G15330 Majority Cystathionine beta-synthase (CBS) protein 8 AT1G15390 Majority Peptide deformylase 1A 8 AT1G20230 Majority Pentatricopeptide repeat (PPR) superfamily protein 8 AT1G26840 Majority Origin recognition complex protein 6 8 AT1G44575 Majority Chlorophyll A-B binding family protein 8 AT1G49350 Majority PfkB-like carbohydrate kinase family protein 8 AT1G54990 Majority Alpha/beta-Hydrolases superfamily protein 8 AT1G60600 Majority UbiA prenyltransferase family protein 8 AT1G78620 Majority Protein of unknown function DUF92, transmembrane 8 AT2G02410 Majority Unknown 8 AT2G07680 Strictly Multidrug resistance-associated protein 11 8 AT2G22870 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 8 AT2G42220 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 8 AT3G10630 Majority UDP-Glycosyltransferase superfamily protein 8 AT3G15200 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 8 AT3G18165 Majority Modifier of snc1,4 8 AT3G20480 Majority Tetraacyldisaccharide 4'-kinase family protein 8

237

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G26540 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 8 AT3G46210 Majority Ribosomal protein S5 domain 2-like superfamily protein 8 AT4G09650 Majority ATP synthase delta-subunit gene 8 AT4G12760 Majority Unknown 8 AT4G14480 Majority Protein kinase superfamily protein 8 AT4G15640 Majority Unknown 8 AT4G27540 Majority Prenylated RAB acceptor 1.H 8 AT4G27700 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 8 AT4G28590 Majority Unknown 8 AT4G31010 Strictly RNA-binding CRS1 / YhbY (CRM) domain-containing protein 8 AT4G32605 Majority Mitochondrial glycoprotein family protein 8 AT4G34310 Majority Alpha/beta-Hydrolases superfamily protein 8 AT4G38370 Majority Phosphoglycerate mutase family protein 8 AT5G03770 Majority KDO transferase A 8 AT5G06130 Majority Chaperone protein dnaJ-related 8 AT5G08490 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 8 AT5G17990 Majority Tryptophan biosynthesis 1 8 AT5G18570 Majority GTP1/OBG family protein 8 AT5G22470 Majority NAD+ ADP-ribosyltransferases;NAD+ ADP-ribosyltransferases 8 AT5G25310 Majority Exostosin family protein 8 AT5G51220 Majority Ubiquinol-cytochrome C chaperone family protein 8 AT5G52800 Majority DNA primases 8 AT5G64470 Majority Plant protein of unknown function (DUF828) 8 AT5G66960 Majority Prolyl oligopeptidase family protein 8 AT1G07270 Majority Cell division control, Cdc6 7 AT1G08220 Strictly Unknown 7

238

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G09380 Majority Nodulin MtN21 /EamA-like transporter family protein 7 AT1G19880 Majority Regulator of chromosome condensation (RCC1) family protein 7 AT1G26900 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT1G28560 Majority snRNA activating complex family protein 7 AT1G31430 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 7 AT1G32520 Majority Unknown 7 AT1G36310 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 7 AT1G49730 Majority Protein kinase superfamily protein 7 AT1G50300 Majority TBP-associated factor 15 7 AT1G52155 Majority Unknown 7 AT1G55630 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT1G69290 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT1G71340 Majority PLC-like phosphodiesterases superfamily protein 7 AT1G77090 Majority Mog1/PsbP/DUF1795-like photosystem II reaction center PsbP family protein 7 AT2G19080 Majority Metaxin-related 7 AT2G33680 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 7 AT2G35490 Majority Plastid-lipid associated protein PAP / fibrillin family protein 7 AT2G46180 Majority Golgin candidate 4 7 AT2G47490 Majority NAD+ transporter 1 7 AT2G47910 Majority Chlororespiratory reduction 6 7 AT3G01440 Majority PsbQ-like 1 7 AT3G03690 Majority Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase family protein 7 AT3G05625 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 7 AT3G05740 Strictly RECQ helicase l1 7 AT3G10060 Majority FKBP-like peptidyl-prolyl cis-trans isomerase family protein 7 AT3G12270 Majority Protein arginine methyltransferase 3 7

239

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G13770 Strictly Pentatricopeptide repeat (PPR) superfamily protein 7 AT3G17060 Majority Pectin lyase-like superfamily protein 7 AT3G18680 Majority Amino acid kinase family protein 7 AT3G19260 Majority LAG1 homologue 2 7 AT3G19440 Strictly Pseudouridine synthase family protein 7 AT3G45890 Majority Protein of unknown function, DUF647 7 AT3G48470 Majority Embryo defective 2423 7 AT3G54380 Majority SAC3/GANP/Nin1/mts3/eIF-3 p25 family 7 AT3G57060 Strictly Binding 7 AT4G01935 Majority Unknown 7 AT4G04670 Majority Met-10+ like family protein / kelch repeat-containing protein 7 AT4G08690 Majority Sec14p-like phosphatidylinositol transfer family protein 7 AT4G10020 Majority Hydroxysteroid dehydrogenase 5 7 AT4G10030 Majority Alpha/beta-Hydrolases superfamily protein 7 AT4G11080 Majority HMG (high mobility group) box protein 7 AT4G15030 Majority Unknown 7 AT4G16510 Majority YbaK/aminoacyl-tRNA synthetase-associated domain 7 AT4G18370 Majority DEGP protease 5 7 AT4G20350 Majority Oxidoreductases 7 AT4G21530 Majority Transducin/WD40 repeat-like superfamily protein 7 AT4G30700 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT5G02130 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 7 AT5G04490 Majority Vitamin E pathway gene 5 7 AT5G13230 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 7 AT5G15300 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT5G17930 Majority MIF4G domain-containing protein / MA3 domain-containing protein 7

240

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G19630 Majority Alpha/beta-Hydrolases superfamily protein 7 AT5G20590 Majority Trichome birefringence-like 5 7 AT5G22050 Majority Protein kinase superfamily protein 7 AT5G24460 Majority Unknown 7 AT5G39530 Majority Protein of unknown function (DUF1997) 7 AT5G44230 Strictly Pentatricopeptide repeat (PPR) superfamily protein 7 AT5G47520 Majority RAB GTPase homolog A5A 7 AT5G47680 Majority Unknown 7 AT5G50375 Majority Cyclopropyl isomerase 7 AT5G55840 Majority Pentatricopeptide repeat (PPR) superfamily protein 7 AT5G59440 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 7 AT5G59600 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 7 AT5G60590 Majority DHBP synthase RibB-like alpha/beta domain 7 AT5G63040 Majority Unknown 7 AT5G67130 Majority PLC-like phosphodiesterases superfamily protein 7 AT1G01180 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 6 AT1G03250 Majority Unknown 6 AT1G03540 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 6 AT1G04640 Majority Lipoyltransferase 2 6 AT1G12800 Majority Nucleic acid-binding, OB-fold-like protein 6 AT1G13040 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 6 AT1G13330 Majority Arabidopsis Hop2 homolog 6 AT1G14730 Majority Cytochrome b561/ferric reductase transmembrane protein family 6 AT1G16020 Majority Protein of unknown function (DUF1712) 6 AT1G18900 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT1G20830 Majority Multiple chloroplast division site 1 6

241

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G25360 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT1G43800 Majority Plant stearoyl-acyl-carrier-protein desaturase family protein 6 AT1G51730 Majority Ubiquitin-conjugating enzyme family protein 6 AT1G56230 Majority Protein of unknown function (DUF1399) 6 AT1G63250 Majority DEA(D/H)-box RNA helicase family protein 6 AT1G63270 Majority Non-intrinsic ABC protein 10 6 AT1G64970 Majority Gamma-tocopherol methyltransferase 6 AT1G69070 Majority Unknown 6 AT1G71060 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 6 AT1G71500 Majority Rieske (2Fe-2S) domain-containing protein 6 AT1G77010 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT1G77405 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT2G03070 Majority Mediator subunit 8 6 AT2G20980 Majority Minichromosome maintenance 10 6 AT2G22070 Majority Pentatricopeptide (PPR) repeat-containing protein 6 AT2G22410 Strictly SLOW GROWTH 1 6 AT2G26540 Majority Uroporphyrinogen-III synthase family protein 6 AT2G31740 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 6 AT2G35190 Strictly Novel plant snare 11 6 AT3G01370 Majority CRM family member 2 6 AT3G01800 Majority Ribosome recycling factor 6 AT3G03490 Majority Peroxin 19-1 6 AT3G10180 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 6 AT3G11460 Strictly Pentatricopeptide repeat (PPR) superfamily protein 6 AT3G13170 Majority Spo11/DNA topoisomerase VI, subunit A protein 6 AT3G15820 Majority Phosphatidic acid phosphatase-related / PAP2-related 6

242

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G18580 Majority Nucleic acid-binding, OB-fold-like protein 6 AT3G19230 Majority Leucine-rich repeat (LRR) family protein 6 AT3G22150 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 6 AT3G22880 Majority DNA repair (Rad51) family protein 6 AT3G24495 Majority MUTS homolog 7 6 AT3G26085 Strictly CAAX amino terminal protease family protein 6 AT3G26782 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 6 AT3G27730 Strictly ATP binding; ATP-dependent helicases; DNA helicases 6 AT3G42630 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT3G47830 Strictly DNA glycosylase superfamily protein 6 AT3G48900 Strictly Single-stranded DNA endonuclease family protein 6 AT3G55120 Majority Chalcone-flavanone isomerase family protein 6 AT3G56690 Majority Cam interacting protein 111 6 AT3G59300 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT3G59780 Majority Rhodanese/Cell cycle control phosphatase superfamily protein 6 AT4G00830 Majority RNA-binding (RRM/RBD/RNP motifs) family protein 6 AT4G13070 Majority RNA-binding CRS1 / YhbY (CRM) domain protein 6 AT4G13670 Majority Plastid transcriptionally active 5 6 AT4G13970 Majority Zinc ion binding 6 AT4G15780 Majority Vesicle-associated membrane protein 724 6 AT4G39920 Majority C-CAP/cofactor C-like domain-containing protein 6 AT5G07250 Majority RHOMBOID-like protein 3 6 AT5G08340 Majority Nucleotidylyl transferase superfamily protein 6 AT5G12100 Majority Pentatricopeptide (PPR) repeat-containing protein 6 AT5G13050 Majority 5-formyltetrahydrofolate cycloligase 6 AT5G15940 Majority NAD(P)-binding Rossmann-fold superfamily protein 6

243

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G17550 Majority Peroxin 19-2 6 AT5G19840 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 6 AT5G20050 Majority Protein kinase superfamily protein 6 AT5G20870 Majority O-Glycosyl hydrolases family 17 protein 6 AT5G21070 Majority Unknown 6 AT5G39350 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 6 AT5G43820 Majority Pentatricopeptide repeat (PPR) superfamily protein 6 AT5G44600 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 6 AT5G45790 Majority Ubiquitin carboxyl-terminal hydrolase family protein 6 AT5G47380 Majority Protein of unknown function, DUF547 6 AT5G50970 Majority Transducin family protein / WD-40 repeat family protein 6 AT5G51880 Majority 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 6 AT5G55580 Majority Mitochondrial transcription termination factor family protein 6 AT5G57160 Strictly DNA ligase IV 6 AT5G63920 Majority Topoisomerase 3-alpha 6 AT5G66470 Majority RNA binding; GTP binding 6 AT1G02410 Majority Cytochrome c oxidase assembly protein CtaG / Cox11 family 5 AT1G07220 Majority Arabidopsis thaliana protein of unknown function (DUF821) 5 AT1G10930 Majority DNA helicase (RECQl4A) 5 AT1G14410 Majority ssDNA-binding transcriptional regulator 5 AT1G25420 Majority Regulator of Vps4 activity in the MVB pathway protein 5 AT1G26940 Majority Cyclophilin-like peptidyl-prolyl cis-trans isomerase family protein 5 AT1G30610 Strictly Pentatricopeptide (PPR) repeat-containing protein 5 AT1G54820 Majority Protein kinase superfamily protein 5 AT1G56570 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT1G63610 Majority Unknown 5

244

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G64150 Majority Uncharacterized protein family (UPF0016) 5 AT1G65270 Majority Unknown 5 AT1G66670 Majority CLP protease proteolytic subunit 3 5 AT1G67700 Majority Unknown 5 AT1G72250 Majority Di-glucose binding protein with Kinesin motor domain 5 AT1G74070 Majority Cyclophilin-like peptidyl-prolyl cis-trans isomerase family protein 5 AT1G77020 Majority DNAJ heat shock N-terminal domain-containing protein 5 AT1G79120 Majority Ubiquitin carboxyl-terminal hydrolase family protein 5 AT1G79540 Majority Pentatricopeptide repeat (PPR) superfamily protein 5 AT2G02180 Majority Tobamovirus multiplication protein 3 5 AT2G02590 Majority Unknown 5 AT2G15630 Majority Pentatricopeptide repeat (PPR) superfamily protein 5 AT2G26810 Strictly Putative methyltransferase family protein 5 AT2G27800 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT2G28230 Majority TATA-binding related factor (TRF) of subunit 20 of Mediator complex 5 AT2G28250 Majority Protein kinase superfamily protein 5 AT2G35360 Majority Ubiquitin family protein 5 AT2G38720 Majority Microtubule-associated protein 65-5 5 AT2G40316 Majority Unknown 5 AT2G43030 Majority Ribosomal protein L3 family protein 5 AT2G46060 Majority Transmembrane protein-related 5 AT2G46100 Majority Nuclear transport factor 2 (NTF2) family protein 5 AT2G47450 Majority Chloroplast signal recognition particle component (CAO) 5 AT3G02010 Majority Pentatricopeptide repeat (PPR) superfamily protein 5 AT3G04020 Majority Unknown 5 AT3G04550 Majority Unknown 5

245

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G08820 Majority Pentatricopeptide repeat (PPR) superfamily protein 5 AT3G08970 Majority DNAJ heat shock N-terminal domain-containing protein 5 AT3G12040 Majority DNA-3-methyladenine glycosylase (MAG) 5 AT3G14750 Majority Unknown 5 AT3G18890 Majority NAD(P)-binding Rossmann-fold superfamily protein 5 AT3G25680 Majority Unknown 5 AT3G49730 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT3G53270 Majority Small nuclear RNA activating complex (SNAPc), subunit SNAP43 protein 5 AT3G54510 Strictly Early-responsive to dehydration stress protein (ERD4) 5 AT3G54940 Majority Papain family cysteine protease 5 AT4G02020 Majority SET domain-containing protein 5 AT4G02460 Majority DNA mismatch repair protein, putative 5 AT4G09350 Majority Chaperone DnaJ-domain superfamily protein 5 AT4G11690 Majority Pentatricopeptide repeat (PPR-like) superfamily protein 5 AT4G15890 Majority Binding 5 AT4G21220 Majority Trimeric LpxA-like enzymes superfamily protein 5 AT4G29720 Majority Polyamine oxidase 5 5 AT4G30580 Majority Phospholipid/glycerol acyltransferase family protein 5 AT4G31150 Majority Endonuclease V family protein 5 AT5G05930 Majority Guanylyl cyclase 1 5 AT5G07910 Majority Leucine-rich repeat (LRR) family protein 5 AT5G09950 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT5G11540 Majority D-arabinono-1,4-lactone oxidase family protein 5 AT5G16860 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT5G24490 Majority 30S ribosomal protein, putative 5 AT5G26010 Majority Protein phosphatase 2C family protein 5

246

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G27395 Majority Mitochondrial inner membrane translocase complex, subunit Tim44-related protein 5 AT5G37340 Majority ZPR1 zinc-finger domain protein 5 AT5G40020 Majority Pathogenesis-related thaumatin superfamily protein 5 AT5G40720 Majority Domain of unknown function (DUF23) 5 AT5G42450 Majority Pentatricopeptide repeat (PPR) superfamily protein 5 AT5G44740 Majority Y-family DNA polymerase H 5 AT5G46420 Majority 16S rRNA processing protein RimM family 5 AT5G48340 Majority Unknown 5 AT5G49580 Majority Chaperone DnaJ-domain superfamily protein 5 AT5G49800 Majority Polyketide cyclase/dehydrase and lipid transport superfamily protein 5 AT5G53490 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT5G57630 Majority CBL-interacting protein kinase 21 5 AT5G57950 Majority 26S proteasome regulatory subunit, putative 5 AT5G62990 Majority Ubiquitin carboxyl-terminal hydrolase family protein 5 AT5G66360 Majority Ribosomal RNA adenine dimethylase family protein 5 AT5G66631 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 5 AT5G66810 Majority Unknown 5 AT1G01930 Majority Zinc finger protein-related 4 AT1G04110 Majority Subtilase family protein 4 AT1G05750 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 4 AT1G09450 Strictly Protein kinase superfamily protein 4 AT1G10330 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 4 AT1G17330 Majority Metal-dependent phosphohydrolase 4 AT1G18335 Majority Acyl-CoA N-acyltransferases (NAT) superfamily protein 4 AT1G18550 Majority ATP binding microtubule motor family protein 4 AT1G20810 Strictly FKBP-like peptidyl-prolyl cis-trans isomerase family protein 4

247

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT1G22180 Majority Sec14p-like phosphatidylinositol transfer family protein 4 AT1G22730 Majority MA3 domain-containing protein 4 AT1G22850 Majority SNARE associated Golgi protein family 4 AT1G23070 Majority Protein of unknown function (DUF300) 4 AT1G25054 Majority UDP-3-O-acyl N-acetylglycosamine deacetylase family protein 4 AT1G26180 Majority Unknown 4 AT1G30320 Majority Remorin family protein 4 AT1G33350 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT1G34160 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 4 AT1G34380 Majority 5'-3' exonuclease family protein 4 AT1G35420 Majority Alpha/beta-Hydrolases superfamily protein 4 AT1G45474 Strictly Photosystem I light harvesting complex gene 5 4 AT1G69020 Majority Prolyl oligopeptidase family protein 4 AT1G69200 Majority Fructokinase-like 2 4 AT1G70630 Majority Nucleotide-diphospho-sugar transferase family protein 4 AT1G73740 Majority UDP-Glycosyltransferase superfamily protein 4 AT1G73970 Majority Unknown 4 AT1G74460 Majority GDSL-like Lipase/Acylhydrolase superfamily protein 4 AT1G76120 Majority Pseudouridine synthase family protein 4 AT1G80150 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 4 AT2G15560 Majority Putative endonuclease or glycosyl hydrolase 4 AT2G16630 Majority Pollen Ole e 1 allergen and extensin family protein 4 AT2G20410 Majority RNA-binding ASCH domain protein 4 AT2G28380 Majority dsRNA-binding protein 2 4 AT2G35260 Majority Unknown 4

248

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT2G38920 Majority SPX (SYG1/Pho81/XPR1) domain-containing protein / zinc finger (C3HC4-type RING 4 finger) protein-related AT2G42240 Majority RNA-binding (RRM/RBD/RNP motifs) family protein 4 AT2G43650 Majority Sas10/U3 ribonucleoprotein (Utp) family protein 4 AT2G45700 Strictly Sterile alpha motif (SAM) domain-containing protein 4 AT3G02690 Majority Nodulin MtN21 /EamA-like transporter family protein 4 AT3G06880 Majority Transducin/WD40 repeat-like superfamily protein 4 AT3G07440 Majority Unknown 4 AT3G09040 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT3G09210 Majority Plastid transcriptionally active 13 4 AT3G15080 Majority Polynucleotidyl transferase, ribonuclease H-like superfamily protein 4 AT3G16990 Majority Haem oxygenase-like, multi-helical 4 AT3G21470 Strictly Pentatricopeptide repeat (PPR-like) superfamily protein 4 AT3G22425 Majority Imidazoleglycerol-phosphate dehydratase 4 AT3G23700 Majority Nucleic acid-binding proteins superfamily 4 AT3G43860 Majority Glycosyl hydrolase 9A4 4 AT3G48260 Majority With no lysine (K) kinase 3 4 AT3G48810 Strictly Pentatricopeptide repeat (PPR) superfamily protein 4 AT3G54980 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT3G55510 Majority Noc2p family 4 AT3G56650 Majority Mog1/PsbP/DUF1795-like photosystem II reaction center PsbP family protein 4 AT3G58590 Strictly Pentatricopeptide repeat (PPR) superfamily protein 4 AT3G58830 Majority Haloacid dehalogenase (HAD) superfamily protein 4 AT3G62080 Majority SNF7 family protein 4 AT4G04870 Majority Cardiolipin synthase 4 AT4G10260 Majority PfkB-like carbohydrate kinase family protein 4

249

Table 5-2. Continued

Gene ID SC Gene Functional Category No. Status Orthologues AT4G12610 Majority Transcription activators; DNA binding; RNA polymerase II transcription factors; 4 catalytics; transcription initiation factors AT4G12790 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 4 AT4G14590 Majority Embryo defective 2739 4 AT4G16835 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 4 AT4G17380 Strictly MUTS-like protein 4 4 AT4G18975 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT4G23800 Majority HMG (high mobility group) box protein 4 AT4G23860 Majority PHD finger protein-related 4 AT4G23890 Majority Unknown 4 AT4G25835 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 4 AT4G30870 Majority Restriction endonuclease, type II-like superfamily protein 4 AT4G34830 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT4G35640 Majority Serine acetyltransferase 3;2 4 AT5G04440 Majority Protein of unknown function (DUF1997) 4 AT5G09820 Majority Plastid-lipid associated protein PAP / fibrillin family protein 4 AT5G13830 Majority FtsJ-like methyltransferase family protein 4 AT5G17240 Majority SET domain group 40 4 AT5G17780 Majority Alpha/beta-Hydrolases superfamily protein 4 AT5G18110 Majority Novel cap-binding protein 4 AT5G18525 Majority Protein serine/threonine kinases;protein tyrosine kinases;ATP binding;protein kinases 4 AT5G24130 Majority Unknown 4 AT5G26110 Majority Protein kinase superfamily protein 4 AT5G27680 Strictly RECQ helicase SIM 4 AT5G38630 Majority Cytochrome B561-1 4 AT5G42370 Majority Calcineurin-like metallo-phosphoesterase superfamily protein 4

250

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G44560 Majority SNF7 family protein 4 AT5G46100 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT5G46330 Majority Leucine-rich receptor-like protein kinase family protein 4 AT5G48040 Majority Ubiquitin carboxyl-terminal hydrolase family protein 4 AT5G48830 Majority Unknown 4 AT5G51700 Majority Protein binding; Zinc ion binding 4 AT5G57250 Majority Pentatricopeptide repeat (PPR) superfamily protein 4 AT5G60370 Majority Unknown 4 AT1G01760 Majority Adenosine deaminases; RNA binding; 3 AT1G03100 Majority Pentatricopeptide repeat (PPR) superfamily protein 3 AT1G03620 Majority ELMO/CED-12 family protein 3 AT1G09130 Majority ATP-dependent caseinolytic (Clp) protease/crotonase family protein 3 AT1G09760 Majority U2 small nuclear ribonucleoprotein A 3 AT1G11650 Majority RNA-binding (RRM/RBD/RNP motifs) family protein 3 AT1G12244 Strictly Polynucleotidyl transferase, ribonuclease H-like superfamily protein 3 AT1G15200 Majority Protein-protein interaction regulator family protein 3 AT1G16445 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 3 AT1G16480 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT1G16880 Majority Uridylyltransferase-related 3 AT1G18700 Majority DNAJ heat shock N-terminal domain-containing protein 3 AT1G27595 Majority Unknown 3 AT1G31870 Majority Unknown 3 AT1G53600 Strictly Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT1G53800 Majority Unknown 3 AT1G62250 Majority Unknown 3 AT1G68930 Strictly Pentatricopeptide (PPR) repeat-containing protein 3

251

Table 5-2. Continued

Gene ID SC Gene Functional Category No. Status Orthologues AT1G70190 Majority Ribosomal protein L7/L12, oligomerisation;Ribosomal protein L7/L12, C- 3 terminal/adaptor protein ClpS-like AT1G77720 Majority Putative protein kinase 1 3 AT1G79190 Majority ARM repeat superfamily protein 3 AT2G04360 Majority Unknown 3 AT2G07750 Majority DEA(D/H)-box RNA helicase family protein 3 AT2G19640 Majority ASH1-related protein 2 3 AT2G21720 Majority Plant protein of unknown function (DUF639) 3 AT2G23840 Majority HNH endonuclease 3 AT2G30080 Majority ZIP metal ion transporter family 3 AT2G34050 Majority Unknown 3 AT2G34570 Majority PIN domain-like family protein 3 AT2G36710 Majority Pectin lyase-like superfamily protein 3 AT2G41760 Strictly Unknown 3 AT2G42710 Majority Ribosomal protein L1p/L10e family 3 AT2G44850 Strictly Unknown 3 AT2G45460 Majority SMAD/FHA domain-containing protein 3 AT3G05340 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT3G09660 Majority Minichromosome maintenance 8 3 AT3G10670 Majority Non-intrinsic ABC protein 7 3 AT3G10915 Majority Reticulon family protein 3 AT3G14110 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT3G15120 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 3 AT3G15130 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT3G18500 Majority DNAse I-like superfamily protein 3 AT3G20970 Majority NFU domain protein 4 3

252

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT3G21465 Majority Unknown 3 AT3G27970 Majority Exonuclease family protein 3 AT3G32930 Majority Unknown 3 AT3G52030 Majority F-box family protein with WD40/YVTN repeat doamin 3 AT3G54970 Majority D-aminoacid aminotransferase-like PLP-dependent enzymes superfamily protein 3 AT3G56430 Majority Unknown 3 AT3G56570 Majority SET domain-containing protein 3 AT3G57480 Majority Zinc finger (C2H2 type, AN1-like) family protein 3 AT3G58470 Majority Nucleic acid binding; methyltransferases 3 AT3G59710 Majority NAD(P)-binding Rossmann-fold superfamily protein 3 AT3G60810 Majority Unknown 3 AT3G61570 Majority GRIP-related ARF-binding domain-containing protein 1 3 AT3G61770 Majority Acid phosphatase/vanadium-dependent haloperoxidase-related protein 3 AT3G61800 Majority Unknown 3 AT4G02405 Majority S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 3 AT4G02485 Strictly 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein 3 AT4G09740 Majority Glycosyl hydrolase 9B14 3 AT4G12680 Majority Unknown 3 AT4G13590 Majority Uncharacterized protein family (UPF0016) 3 AT4G23840 Strictly Leucine-rich repeat (LRR) family protein 3 AT4G25120 Majority P-loop containing nucleoside triphosphate hydrolases superfamily protein 3 AT4G26410 Majority Uncharacterised conserved protein UCP022280 3 AT4G29070 Majority Phospholipase A2 family protein 3 AT4G29170 Strictly Mnd1 family protein 3 AT4G29750 Majority CRS1 / YhbY (CRM) domain-containing protein 3 AT4G30690 Majority Translation initiation factor 3 protein 3

253

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT4G30860 Strictly SET domain group 4 3 AT4G31160 Majority DDB1-CUL4 associated factor 1 3 AT4G36610 Majority Alpha/beta-Hydrolases superfamily protein 3 AT4G37480 Majority Chaperone DnaJ-domain superfamily protein 3 AT5G01400 Majority HEAT repeat-containing protein 3 AT5G02320 Majority Myb domain protein 3r-5 3 AT5G06400 Majority Pentatricopeptide repeat (PPR) superfamily protein 3 AT5G09995 Majority Unknown 3 AT5G11310 Majority Pentatricopeptide repeat (PPR) superfamily protein 3 AT5G11580 Majority Regulator of chromosome condensation (RCC1) family protein 3 AT5G11810 Majority Unknown 3 AT5G11840 Strictly Protein of unknown function (DUF1230) 3 AT5G12920 Majority Transducin/WD40 repeat-like superfamily protein 3 AT5G13240 Majority Transcription regulators 3 AT5G14020 Majority Endosomal targeting BRO1-like domain-containing protein 3 AT5G14080 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT5G15010 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT5G15100 Majority Auxin efflux carrier family protein 3 AT5G16180 Strictly Ortholog of maize chloroplast splicing factor CRS1 3 AT5G22340 Majority Unknown 3 AT5G28500 Majority Unknown 3 AT5G30490 Majority Unknown 3 AT5G35400 Majority Pseudouridine synthase family protein 3 AT5G35690 Majority Unknown 3 AT5G38510 Majority Rhomboid-related intramembrane serine protease family protein 3 AT5G43750 Majority NAD(P)H dehydrogenase 18 3

254

Table 5-2. Continued Gene ID SC Gene Functional Category No. Status Orthologues AT5G47840 Majority Adenosine monophosphate kinase 3 AT5G47850 Majority CRINKLY4 related 4 3 AT5G48390 Majority Tetratricopeptide repeat (TPR)-like superfamily protein 3 AT5G49560 Majority Putative methyltransferase family protein 3 AT5G52630 Majority Mitochondrial RNA editing factor 1 3 AT5G52910 Majority Timeless family protein 3 AT5G54310 Majority ARF-GAP domain 5 3 AT5G57670 Majority Protein kinase superfamily protein 3 AT5G66930 Majority Unknown 3

255

Table 5-3. Sequence characteristics for 85 commonly shared single-copy nuclear loci detected by the marker discovery pipeline (see Table 5-2 for Gene IDs). The total aligned, constant, variable parsimony uninformative characters, and parsimony informative characters and average pairwise sequence identity are reported for all loci and the comined data set. Summary statistics are also provided at the bottom of the table. Data set Total Constant Variable, Parsimony % Pairwise Characters Characters Parsimony Informative identity Uninformative Characters Characters AT1G53280 1198 408 116 674 81.1 AT1G64550 2162 982 173 1007 83.1 AT4G37040 853 343 86 424 84.2 AT5G64370 1283 580 87 616 82.9 AT4G29490 1388 605 100 683 83.5 AT5G05200 1501 592 135 774 84.0 AT5G57655 1340 615 108 617 84.4 AT1G30360 2345 790 198 1357 75.4 AT1G43860 995 394 86 515 82.3 AT5G14520 1301 455 134 712 82.1 AT1G62750 2139 1086 106 947 84.1 AT2G18940 2090 658 248 1184 79.3 AT5G14250 1237 393 157 687 80.1 AT5G46580 1694 583 145 966 78.4 AT4G35850 1263 352 137 774 80.2 AT5G62530 1735 774 171 790 80.7 AT5G63890 1451 711 106 634 85.0 AT1G13900 1696 573 155 968 78.2 AT1G57770 1802 831 183 788 82.2 AT1G76400 1798 572 184 1042 73.6 AT2G31880 1504 439 133 932 80.8 AT3G10230 1341 489 78 774 76.1 AT3G51050 2100 880 223 997 83.0 AT3G58460 1034 406 84 544 81.9 AT5G65720 1257 621 78 558 82.6 AT1G07230 1509 631 99 779 79.5 AT1G67680 1775 540 156 1079 79.8 AT1G73720 1671 901 113 657 86.5 AT3G45300 1285 671 100 514 85.5 AT5G30510 1116 517 91 508 83.3 AT1G18260 1724 729 180 815 84.6 AT1G74470 1216 515 48 653 79.3 AT3G54860 1697 652 177 868 83.0 AT4G03560 2098 822 161 1115 81.4 AT4G29830 791 285 87 419 79.4 AT5G04420 1362 379 168 815 80.9

256

Table 5-3. Continued

Data set Total Constant Variable, Parsimony % Pairwise Characters Characters Parsimony Informative identity Uninformative Characters Characters AT5G04660 1386 451 128 807 76.1 AT5G06260 925 300 117 508 80.6 AT5G19540 936 391 93 452 82.6 AT5G42310 1689 645 176 868 81.9 AT1G05350 1024 535 71 418 86.7 AT1G74850 2323 962 229 1132 83.3 AT2G18710 1233 577 108 548 84.2 AT3G20790 1068 239 213 616 78.4 AT4G19860 1215 489 108 618 81.0 AT4G27800 930 298 83 549 80.2 AT4G30310 1852 887 164 801 83.1 AT4G31990 1211 560 84 567 85.7 AT5G08170 1047 382 88 577 82.5 AT5G13650 1799 879 103 817 84.7 AT5G42480 1679 615 184 880 82.9 AT3G06510 1354 555 110 689 83.1 AT3G17810 1123 561 52 510 83.8 AT3G47610 797 309 67 421 82.8 AT3G53700 1865 544 219 1102 78.4 AT3G55070 1169 480 129 560 83.5 AT5G13030 1669 706 130 833 83.6 AT1G28340 1749 537 144 1068 78.9 AT1G43580 993 418 63 512 81.0 AT1G68830 1641 814 97 730 81.6 AT1G73990 1663 655 152 856 83.0 AT2G15230 856 299 88 469 81.9 AT3G09180 689 245 68 376 82.6 AT3G56460 880 324 79 477 81.3 AT3G66658 1756 838 131 787 86.1 AT4G00740 1530 555 198 777 83.2 AT5G08100 727 246 79 402 80.6 AT5G17530 1509 648 136 725 84.9 AT5G57030 1329 620 105 604 84.3 AT2G37500 1225 508 155 562 83.6 AT3G05350 1781 755 166 860 82.7 AT3G52190 893 299 98 496 82.5 AT4G09020 1991 847 195 949 83.8 AT5G09860 1254 548 129 577 85.7 AT1G12410 643 278 84 281 84.3 AT1G14300 963 332 105 526 80.8 AT2G19940 1032 439 84 509 82.7 AT2G26060 784 265 121 398 83.0 AT3G18860 1575 598 139 838 82.7

257

Table 5-3. Continued

Data set Total Constant Variable, Parsimony % Pairwise Characters Characters Parsimony Informative identity Uninformative Characters Characters AT4G00090 885 323 88 474 82.3 AT4G13360 833 366 75 392 84.8 AT4G33030 1163 587 66 510 81.7 AT5G13520 983 422 72 489 84.1 AT5G43600 1111 460 101 550 83.1 AT1G14810 569 283 34 252 84.6

Total 116052 46648 10499 58905 - Mean 1365 549 124 693 82.2 Median 1301 548 110 657 82.7 Standard 419 197 46 226 86.7 Deviation Range 569 - 2345 239 - 1086 34 - 248 252 - 1357 73.6 - 86.7

258

Table 5-4. General Time Reversible (GTR) +  Model parameters estimated by Randomized Axelerated Maximum Likelihood (RAxML). Shape parameter Rate matrix Base frequencies

alpha: 0.403861 rate A <-> C: 1.737967 freq pi(A): 0.275421 rate A <-> G: 4.266640 freq pi(C): 0.194872 rate A <-> T: 1.301809 freq pi(G): 0.251427 rate C <-> G: 1.507803 freq pi(T): 0.278380 rate C <-> T: 5.535664 rate G <-> T: 1.000000

259

Figure 5-1. Number of single-copy nuclear (SCN) genes discovered in each of 77 transcriptomes from the One Thousand Plants (oneKP) project. In all, 1,993 SCN genes were identified by the bioinformatic approach, and the number of genes detected per accession ranged from 0 to 909 (mean = 588.92, median = 616, standard deviation = 192.50). OneKP accession codes correspond to Table 5-1.

260

Figure 5-2. The distribution of single-copy nuclear genes shared across 77 transcriptomes from the One Thousand Plants (oneKP) project.

261

Figure 5-3. Concatenated supermatrix of 85 single-copy nuclear genes. The supermatrix included 116,052 characters of sequence data for 77 oneKP accessions (see Table 5-3 for sequence characteristics). The image was exported from Geneious 6.1.4 (Biomatters Inc., Auckland, New Zealand). Sequence data are shown as black bars and missing data and indels are shown as blank spaces.

262

Figure 5-4. ML tree (-ln L = 2035023.973823) inferred from a maximum likelihood (ML) analysis of a supermatrix that included 85 single-copy nuclear loci and 22 angiosperm families from the following orders: Lamiales (17 families), Boraginales (1 family), Solanales (2 families), and Gentianales (2 families). Branch lengths are drawn to scale, and ML bootstrap values are indicated above the branches. Sample identifiers from the One Thousand Plants (oneKP) project are indicated for duplicate accessions representing a single taxon (see Table 5-1).

263

Figure 5-4. Continued

264

Figure 5-5. Phylogenetic relationships among 17 families from Lamiales and 5 additional lamiid families recovered from a maximum likelihood (ML) analysis of 85 putatively single-copy nuclear genes. The topology shown in Figure A) is a cladogram with branches collapsed at the family level. Figure B) is a phylogram and is included to illustrate the tree shape. Bootstrap values are indicated above the branches.

265

Figure 5-6. Most parsimonious tree (length = 430,444 steps; consistency index = 0.296; retention index = 0.496; rescaled consistency index = 0.147) inferred from a maximum parsimony (MP) analysis of 85 single-copy nuclear loci for 17 families of Lamiales and five additional families of lamiids (Boraginales, Solanales, and Gentianales). Branch lengths are drawn to scale, and MP bootstrap values are indicated above the branches. Sample identifiers from the One Thousand Plants (oneKP) project are indicated for duplicate accessions representing a single taxon (see Table 5-1).

266

Figure 5-6. Continued

267

Figure 5-7. Comparison of maximum likelihood (ML; left) and maximum parsimony (MP; right) topologies recovered from an analysis of 85 putatively single-copy nuclear genes. The topologies shown are cladograms with branches collapsed at the family level. Bootstrap values are indicated above the branches. Black and grey lines illustrate differences in the phylogenetic placement of families and clades, respectively. Selected nodes within Lamiales are numbered to facilitate in- text discussion.

268

Object 5-1. Modified output and summary data reported by MarkerMiner 1.0 (.xlxs file 950 KB)

269

CHAPTER 6 GENERAL CONCLUSIONS

Mints (Lamiaceae) are the sixth-largest angiosperm family, with a worldwide distribution and more than 7,000 species. The extreme morphological complexity of

Lamiaceae, combined with the large number of species spanning diverse habitats, presents many exciting opportunities for comparative study. However, fully resolving relationships within and among many mint groups has proven intractable; phylogenetic studies often yield a few well-supported clades within a largely unresolved backbone.

Thus, a primary goal of this study was to help move mint systematics “out of the bushes and into the trees”—that is, to resolve clear evolutionary trees from the currently unresolved patterns.

Available sequence data from Lamiaceae were assembled for phylogenetic analysis in order to examine relationships throughout the clade and to investigate potential factors complicating phylogenetic inference. The analysis produced the most comprehensive family-wide phylogenetic hypothesis for Lamiaceae to date, representing a synthesis of current molecular data.

The composition of major clades within Lamiaceae was consistent with previously published hypotheses. However, infrafamilial relationships were poorly supported due to a combination of systematic error in the analyses and heterogeneous processes of molecular evolution across loci and lineages. First, available data resources for Lamiaceae are too heterogeneous to reconstruct family-wide phylogenetic relationships; insufficient overlap of gene sets and inadequate sequence variation among shared loci introduced conflict or uncertainty into the analysis. Second, phylogenetic dating and diversification analyses revealed heterogeneous patterns of

270

evolutionary timing and tempo across lineages. Much of extant mint species diversity appears recently derived, and at least nine clades had high net diversification rates.

Thus, the amount of time between divergence events was likely small in several mint groups, which may explain their poorly resolved phylogenetic patterns. In this scenario, individual DNA loci appear uninformative or provide conflicting phylogenetic signal with regard to species relationships because of a lack of coalescence.

My results suggest that more data are needed to resolve Lamiaceae phylogeny.

Future study designs should strive for more inclusive taxon and data sampling; more complete overlap of gene sets and use of a combination of slow- and fast-evolving genes may help to resolve relationships across multiple phylogenetic scales. Moreover, coalescent-based approaches should be explored as an alternative to phylogenetic inference from concatenated datasets; given possible incomplete lineage sorting, these may provide a more accurate view of species relationships.

Towards fully resolving relationships in Lamiaceae, several new workflows for phylogenetic marker development and targeted sequencing of organellar and nuclear datasets on next-generation sequencing (NGS) platforms were proposed. These approaches were tested as part of two independent case studies with Lamiaceae and

Lamiales, respectively.

First, I investigated the utility of plastid phylogenomic approaches for resolving relationships within a closely related group of mints from the southwestern United

States and Mexico (i.e. Hedeoma Pers. and allied genera of Menthinae). The study produced the first phylogenomic results for Lamiaceae, inferred from nearly complete plastome sequences. Some relationships in the topology were either poorly supported

271

or unsupported. However, as compared with previously reported phylogenetic sresults based on only a few plastid loci, resolution and support were considerably improved.

Plastid phylogenomic data do not support the monophyly of Hedeoma, species of which are distributed among at least six clades (e.g. Hedeoma sensu stricto and five other clades containing at least one species of Hedeoma). Despite poor support for some relationships, the data extend support for previously hypothesized evolutionary patterns and processes, including geographically based clades and possible widespread hybridization or introgression. Future studies will benefit from nuclear phylogenies that can provide additional evidence to evaluate these hypotheses.

Lastly, a novel bioinformatic workflow was used to identify a set of 1,993 single- copy nuclear (SCN) loci for Lamiales from available transcriptome assemblies. In all, 85 of these SCN loci were used to generate a phylogenetic hypothesis and to demonstrate the utility of nuclear markers for future investigations of Lamiales. Both resolution and support for inter- and infrafamilial relationships were considerably improved with the application of large-scale nuclear data relative to previous plastid-based studies. Only a few nodes were poorly supported, and this may have resulted from incomplete taxon sampling. Nevertheless, the nuclear data provided a view of interfamilial relationships that, in several cases, extends support for previous hypotheses.

In summary, this study identifies several possible factors underlying poorly resolved phylogenetic patterns in Lamiaceae and demonstrates the utility of several approaches that can help resolve these difficult problems. Towards resolving relationships in Lamiaceae and other groups in Lamiales, future investigators are advised to take advantage of the newly identified gene set developed here. New

272

sequence capture methods and NGS provide an efficient and cost-effective means for nuclear data acquisition, and nearly complete plastome data can be mined from these runs for phylogenomic analyses.

273

LIST OF REFERENCES

Abd-Elsalam KA. 2003. Bioinformatic tools and guideline for PCR primer design. African Journal of Biotechnology 2:91-95.

Albach DC, Kun Y, SØren Rosendal J, Hong-Quing L. 2009 Phylogenetic placement of Triaenophora (formerly Scrophulariaceae) with some implications for the phylogeny of Lamiales. Taxon 58:749-756.

Albach DC, Meudt HM, Oxelman B. 2005. Piecing together the “new” Plantaginaceae. American Journal of Botany 92:297-315.

Albert VA, Jobson RW, Michael TP, Taylor DJ. 2010. The carnivorous bladderwort (Utricularia, Lentibulariaceae): a system inflates. Journal of Experimental Botany 61:5-9.

Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. 2006. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22:e9-15.

Allendorf FW, Hohenlohe PA, Luikart G. 2010. Genomics and the future of conservation genetics. Nature Reviews Genetics 11:697-709.

Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. 2011. OMA 2011: orthology inference among 1000 complete genomes, Nucleic Acids Research 39(suppl 1): D289-D294.

Altenhoff AM, Dessimoz C. 2009. Phylogenetic and functional assessment of orthologs inference project and methods. PLoS Computational Biology 5:e1000262.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402.

Alvarez I, Wendel JF. 2003. Ribosomal ITS sequences and plant phylogenetic inference. Molecular Phylogenetics and Evolution 29:417-434.

Anderson JT, Mitchel-Olds T. 2011. Ecological genetics and genomics of plant defenses: evidence and approaches. Functional Ecology 25:312-324.

Andersson S. 2006. On the phylogeny of the genus Calceolaria (Calceolariaceae) as inferred from ITS and plastid matK sequences. Taxon 55:125-137.

Ané C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution 24:412-426.

Arnold ML, Ballerini ES, Brothers AN, Hamlin JAP, Ishibashi CDA, Zuelig MP. 2013. The genomics of natural selection and adaptation: Christmas past, present and future(?). Plant Ecology and Diversity 5:451-456.

274

Baker M. 2010. Clever PCR: more genotyping, smaller volumes. Nature Methods 7:351- 355.

Bao S, Jiang R, Kwan W, Wang B, Ma X, Song Y. 2011. Evaluation of next-generation sequencing software in mapping and assembly. Journal of Human Genetics 56: 406-414.

Barakat A, Carels N, Bernardi G. 1997. The distribution of genes in the genomes of Gramineae. Proceedings of the National Academy of Sciences of the United States of America 94:6857-6861.

Barber JC, Francisco-Ortega J, Santos-Guerra A, Turner KG, Jansen RK. 2002. Origin of Macaronesian Sideritis L. (Lamioideae: Lamiaceae) inferred from nuclear and chloroplast sequence datasets. Molecular Phylogenetics and Evolution 23:293- 306.

Beardsley PM, Olmstead RG. 2002. Redefining Phrymaceae: The placement of Mimulus, tribe Mimuleae, and Phryma, character evolution and biogeography. American Journal of Botany 89:1093-1102.

Bendiksby M, Brysting AK, THorbek L, Gussarova G, Ryding O. 2011a. Molecular phylogeny of the genus Lamium L. (Lamiaceae): disentangling origins and presumed allotetraploids. Taxon 60:987-1000.

Bendiksby M, Thorbek L, Scheen A-C, Lindqvist C, Ryding O. 2011b. An updated phylogeny and classification of Lamiaceae subfamily Lamioideae. Taxon 60:471- 484.

Bennett JR, Matthews S. 2006. Phylogeny of the parasitic plant family Orobanchaceae inferred from phytochrome A. American Journal of Botany 93:1039-1051.

Bentham G. 1848. Labiatae. In: de Candolle AP. Prodromus systematis nautralis regni vegetabilis 23:27-603.

Bentham G. 1834. Labiatarum genera et species. London (UK): James Ridgway. p. 1- 783.

Berglund AC, Sjölund E, Ostlund G, Sonnhammer EL. 2008. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Research 36:D263-266.

Bergthorsson U, Adams KL, Thomason B, Palmer JD. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197-201.

Blavet N, Charif D, Oger-Desfeux C, Marais GAB, Widmer A. 2011. Comparative high- throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database. BioMed Central Genomics 12:376.

275

Blow MJ, Zhang BT, Woyke T, Speller CF, Krivoshapkin A, Yang DY, Derevianko A, Rubin EM. 2008. Identification of ancient remains through genomic sequencing. Genome Research 18:1347-1353.

Boltenhagen E. 1976a. La microflore Senonienne du Gabon. Revue de Micropaleontologie 18:191-199.

Boltenhagen E. 1976b. Pollens et spores Senoniennes du Gabon. Cahiers de Micropaleontologie 3:1-21.

Bowe LM, DePamphilis CW. 1996. Effects of RNA editing and gene processing on phylogenetic reconstruction. Molecular Biology and Evolution 13:1159-1166.

Bramley GLC, Forest F, de Kok RPJ. 2009. Troublesome tropical mints: re-examining generic limits of Vitex and relations (Lamiaceae) in South East Asia. Taxon 58:500-510.

Bräuchler C, Meimberg H, Heubl G. 2010. Molecular phylogeny of Menthinae (Lamiaceae, Nepetoideae, Mentheae)–taxonomy, biogeography and conflicts. Molecular Phylogenetics and Evolution 55:501-523.

Bräuchler C, Meimberg H, Abele T, Heubl G. 2005. Polyphyly of the genus Micromeria (Lamiaceae) – evidence from cpDNA sequence data. Taxon 54:639-650.

Bremer K, Friis EM, Bremer B. 2004. Molecular phylogenetic dating of asterid flowering plants shows Early Cretaceous diversification. Systematic Biology 53:496-505.

Bremer B, Bremer K, Heidari N, Erixon P, Olmstead RG, Anderberg AA, Källersjö M, Barkhordarian E. 2002. Phylogenetics of asterids based on 3 coding and 3 non- coding chloroplast DNA markers and the utility of non-coding DNA at higher taxonomic levels. Molecular Phylogenetics and Evolution 24:274-301.

Briquet J. 1897. Labiatae. In: Engler A, Prantl EK. Die Natürlichen Pflanzenfamilien 4:183-375.

Brown RW. 1962. Paleocene flora of the Rocky Mountains and Great Plains. United States Geological Survey Professional Paper 375:1-119.

Buggs RJA, Chamala S, Wu W, Tate JA, Schnable PS, Soltis DE, Soltis PS, Barbazuk WB. 2012a. Rapid, Repeated, and Clustered Loss of Duplicate Genes in Allopolyploid Plant Populations of Independent Origin. Current Biology 22:248- 252.

Buggs RJA, Renny-Byfield S, Chester M, Jordon-Thaden IE, Facio Viccini L, Chamala S, Leitch AR, Schnable PS, Barbazuk WB, Soltis PS, Soltis DE. 2012b. Next- generation sequencing and genome evolution in allopolyploids. American Journal of Botany 99:372-382.

276

Burleigh JG, Barbazuk WB, Davis JM, Morse AM, Soltis PS. 2012. Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesis. Journal of Botany 2012:1-6.

Bybee SM, Bracken-Grissom H, Haynes BD, Hermansen RA, Byers RL, Clement MJ, Udall JA, et al. 2011. Targeted amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa phylogenetics. Genome Biology and Evolution 3:1312-1323.

Calonje M, Martín-Bravo S, Dobeš C, Gong W, Jordon-Thaden I, Keifer C, Markus K, Paule J, Schmickl R, Koch MA. 2008. Non-coding nuclear DNA markers in phylogenetic reconstruction. Plant Systematics and Evolution 282:257-280.

Cantino PD. 1997. A comparison of phylogenetic nomenclature with the current system: a botanical case study. Systematic Biology 46:313-331.

Cantino PD. 1992a. Evidence for a polyphyletic origin of the Labiatae. Annals of the Missouri Botanical Garden 79:361-379.

Cantino PD. 1992b. Toward a phylogenetic classification of the Labiatae. In: Harley RM, Reynolds T, editors. Advances in labiate science. Royal Botanic Gardens, Kew. p. 27-37.

Cantino PD, Harley RM, Wagstaff SJ. 1992. Genera of Labiatae: Status and Classification. In: Harley RM, Reynolds T, editors. Advances in labiate science. Royal Botanic Gardens, Kew. p. 511-522.

Cantino PD, Sanders RW. 1986. Subfamilial classification of Labiatae. Systematic Botany 11:163-185.

Carstens BC, Dewey TA. 2010. Species delimitation using a combined coalescent and information-theoretic approach: an example from North American Myotis bats. Systematic Biology 59:400-14.

Chamala S, García N, Godden GT, Krishnakumar V, Jordon-Thaden IJ, De Smet R, Barbazuk WB, Soltis DE, Soltis PS. MarkerMiner 1.0: A new application for phylogenetic marker development using angiosperm transcriptomes. Applications in Plant Sciences, submitted.

Chase MW, Soltis DE, Olmstead RG, Morgan D, Les DH, Mishler BD, Duvall MR, Price RA, Hills HG, Qiu YL, et al. 1993. Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical Garden 80:528-580.

Chaudhary R, Bansal MS, Wehe A, Fernández-Baca D, Eulenstein O. 2010. iGTP: A software package for large-scale gene tree parsimony analysis. BMC Bioinformatics 11:574-580.

277

Chavali S, Mahajan A, Tabassum R, Maiti S, Bharadwaj D. 2005. Oligonucleotide properties determination and primer designing: a critical examination of predictions. Bioinformatics 21:3918-3925.

Chen F, Mackey AJ, Vermunt JK, Roos DS. 2007. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One 2:e383.

Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS. 2006. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Research 34 Database:D363-D368.

Chen Y-P, Li B, Olmstead RG, Cantino PD, Liu E-D, Xiang C-L. 2014. Phylogenetic placement of the enigmatic genus Holocheila (Lamiaceae) inferred from plastid DNA sequences. Taxon 63:355-366.

Clarkson JJ, Kelley LJ, Leitch AR, Knapp S, Chase MW. 2010. Nuclear glutamine synthetase evolution in Nicotiana: phylogenetics and the origins of allotetraploid and homoploid (diploid) hybrids. Molecular Phylogenetics and Evolution 55:99- 112.

Clegg MT, Cummings MP, Durbin ML. 1997. The evolution of plant nuclear genes. Proceedings of the National Academy of Sciences of the United States of America 94:7791-7798.

Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38:1767-1771.

Comes HP, Abbot RJ. 2001. Molecular phylogeography, reticulation, and lineage sorting in Mediterranean Senecio sect. Senecio. Evolution 55:1943-1962.

Conn B, Streiber N, Brown E, Henwood M, Olmstead RG. 2009. Infrageneric phylogeny of Chloantheae (Lamiaceae) based on chloroplast ndhF and nuclear ITS sequence data. Australian Systematic Botany 22:243-256.

Cox MP, Peterson DA, Biggs PJ. 2010. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485.

Creer S. 2007. Choosing and using introns in molecular phylogenetics. Evolutionary Bioinformatics 3:99-108.

Cronn R, Knaus BJ, Liston A, Maughan PJ, Parks M, Syring JV, Udall J. 2012. Targeted enrichment strategies for next-generation plant biology. American Journal of Botany 99:291-311.

Cronn R, Liston A, Parks M, Gernandt DS, Shen R, Mockler T. 2008. Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Research 36:e122-e122.

278

Cronn RC, Small RL, Haselkorn T, Wendel JF. 2002. Rapid diversification of the cotton genus (Gossypium: Malvaceae) revealed by analysis of sixteen nuclear and chloroplast genes. American Journal of Botany 89:707-725.

Curto MA, Puppo P, Ferreira D, Nogueira M, Meimberg H. 2012. Development of phylogenetic markers from single-copy nuclear genes for multi locus, species level analyses in the mint family (Lamiaceae). Molecular Phylogenetics and Evolution 63:758-767.

Dassanayake M, Oh D-H, Haas JS, Hernandez A, Hong H, Ali S, Yun D-J, Bressan RA, Zhu J-K, Bohnert HJ. 2011. The genome of the extremophile crucifer Thellungiella parvula. Nature Genetics 43:913-918.

Darwin C. 1859. Origin of Species by Means of Natural Selection or The Preservation of Favoured Races in the Struggle for Life. London: John Murray, Albemarle Street.

Davies K. 2010. The $1,000 Genome: The Revolution in DNA Sequencing and the New Era of Personalized Medicine, 1st ed. Free Press

Davis CC, Anderson WR, Wurdack KJ. 2005. Gene transfer from a parasitic to a fern. Proceedings of the Royal Society B 272:2237-2242.

Davis CC, Latvis M, Nickrent DL, Wurdack KJ, Baum DA. 2007. Floral gigantism in Rafflesiaceae. Science 315:1812.

Davis CC, Wurdack KJ. 2004. Host-to-parasite gene transfer in flowering plants: phylogenetic evidence from Malpighiales. Science 305:676-678.

Degnan JH, Rosenberg NA. 2009. Gene tree discordance, phylogenetic inference and the multi-species coalescent. Trends Ecology and Evolution 24:332-340.

DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP. 2006. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics 22:2044-2046.

De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y. 2013. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proceedings of the National Academy of Sciences of the United States of America 110:2899-2903.

Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet GH. 2005. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In: McLysath A, Huson DH, editors. Comparative Genomics. Springer (Berlin). p. 61- 72.

Dieffenbach CW, Lowe TMJ, Dveksler GS. 1993. General concepts for PCR primer design. PCR Methods and Applications 3:S30-S37.

279

Dorofeev PI. 1988. Miocene floras of the Tambov district. [posthumous work ed. by Velichkevich FY, in Russian]. Akademii Nauka, Leningrad, Russia.

Dorofeev PI. 1963. The Tertiary floras of western Siberia. Moscow (Russia): Izdatelstvo Akademii Nauka SSSR. 366 p.

Doyle JJ, Doyle JA. 1987. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochemical Bulletin 19:11-15.

Drew BT, Systma KJ. 2013. The South American radiation of Lepechinia (Lamiaceae): phylogenetics, divergence times and evolution of dioecy. Botanical Journal of the Linnean Society 171:171-190.

Drew BT, Systma KJ. 2012. Phylogenetics, biogeography, and staminal evolution in the tribe Mentheae (Lamiaceae). American Journal of Botany 99:933-953.

Duarte JM, Wall PK, Edger PP, Landherr LL, Ma H, Pires JC, Leebens-Mack J, dePamphilis CW. 2010. Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis, and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evolutionary Biology 10:61.

Dutilh BE, van Noort V, van der Heijden RTJM, Boekhout T, Snel B, Huynen MA. 2007. Assessment of phylogenomic and orthology approaches for phylogenetic inference. Bioinformatics 23:815-824.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460-2461.

Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32:1792-1797.

Edgar RC, Batzoglou S. 2006. Multiple sequence alignment. Current Opinion in Structural Biology 16:368-373.

Edwards CE, Lefkowitz D, Soltis DE, Soltis PS. 2008. Phylogeny of Conradina and related southeastern scrub mints (Lamiaceae) based on GapC gene sequences. International Journal of Plant Sciences 169: 579-594.

Edwards CE, Soltis DE, Soltis PS. 2006. Molecular phylogeny of Conradina and other scrub mints (Lamiaceae) from the southeastern USA: evidence for hybridization in Pleistocene refugia? Systematic Botany 31:183-207.

Edwards SV. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63:1-19.

Egan AN, Schlueter J, Spooner DM. 2012. Applications of next-generation sequencing in plant biology. American Journal of Botany 99:175-185.

280

Ekblom R, Galindo J. 2010. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 107:1-15.

Epling C, Stewart WS. 1939. A revision of Hedeoma with review of allied genera. Repertorium Specierum Novarum Regni Vegetabilis Beihefte 115:1-49.

Erdtman G. 1945. Pollen morphology and plant taxonomy, vol IV: Labiatae, Verbenaceae, and Avicenniaceae. Svensk Botanisk Tidskrift 39: 279-285.

Felsenstein J. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates.

Felsenstein J. 1985. Phylogenies and the comparative method. American Naturalist 125:1-15.

Felsenstein J. 1978. Cases in which parsimony or compatability methods will be positively misleading. Systematic Zoology 27:401-410.

Fitch WM. 1970. Distinguishing homologous from analogous proteins. Systematic Zoology 19:99-113.

Folta KM, Clancy MA, Chamala S, Brunings AM, Dhingra A, Gomide L, Kulathinal RJ, Peres N, Davis TM, Barbazuk WB. 2010. A transcript accounting from diverse tissues of a cultivated strawberry. Plant Genetics 3:90-105.

Friesen VL, Congdon BC, Kidd MG, Birt TP. 1999. Polymerase chain reaction (PCR) primers for the amplification of five nuclear introns in vertebrates. Molecular Ecology 8:2141-2152.

Garber M, Grabherr MG, Guttman M, Trapnell C. 2011. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8: 469-477.

Gayral P, Weinert L, Chiari Y, Tsagkogeorga G, Ballenghien M, Galtier N. 2011. Next- generation sequencing of transcriptomes: a guide to RNA isolation in nonmodel animals. Molecular Ecology Resources 11:650-661.

Gilbert MTP, Drautz DI, Lesk AM, Ho SYW, Qi J, Ratan A, Hsu C-H, Sher A, Dalén L, Götherström A. 2008. Intraspecific phylogenetic analysis of Siberian woolly mammoths using complete mitochondrial genomes. Proceedings of the National Academy of Sciences of the United States of America 105:8327-8332.

Givnish TJ, Ames M, McNeal JR, McKain MR, Steele PR, dePamphilis CW, Graham SW, Pires JC, Stevenson DW, Zomlefer WB, et al. 2010. Assembling the tree of the monocotyledons: plastome sequence phylogeny and evolution of Poales 1. Annals of the Missouri Botanical Garden 97:584-616.

Glenn TC. 2011. Field guide to next-generation DNA sequencers. Molecular Ecology Resources 11:759-769.

281

Gnibidenko ZN, Semakov NN. 2009. Paleomagnetism of boundary Oligocene-Miocene deposits in the Kompasskii Bor Tract on the Tym River (Western Siberia). Izvestiya, Physics of the Solid Earth 45:70-79.

Godden GT. 2009. Phylogenetic relationships in Poliomintha and related genera in the Mentheae (Lamiaceae) [thesis]. [East Lansing (MI)]: Michigan State University.

Godden GT, Jordon-Thaden IE, Chamala S, Crowl AA, García N, Germain-Aubrey CC, Heaney JM, Latvis M, Qi X, Gitzendanner MA. 2012. Making next-generation sequencing work for you: approaches and practical considerations for marker development and phylogenetics. Plant Ecology and Diversity 5:427-450.

Goff SA, Ricke D, Lan T-H, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Hadley D, et al. 2002. A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica). Science 296:92-100.

Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS. 2012. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research 40:D1178-D1186.

Gradstein FM, Ogg JG, Schmitz MD (Eds.). 2012. The Geologic Time Scale 2012. Elsevier, Boston (MA). doi: 10.1016/B978-0-444-59425-9.00004-4

Graham SW, Olmstead RG. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. American Journal of Botany 87:1712-1730.

Gray A. 1878. Synoptical flora of North America, Vol. 2. New York, New York.

Griffin PC, Robin C, Hoffmann AA. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BioMed Central Biology 9:19.

Gutierrez R. 2011. A phylogenetic study of the plant family Martyniaceae (order Lamiales) [dissertation]. [Tempe (AZ)]: Arizona State University.

Grover CE, Salmon A, Wendel JF. 2012. Targeted sequence capture as a powerful tool for evolutionary analysis. American Journal of Botany 99:312-319.

Haddock SHD, Dunn CW. 2011. Practical Computing for Biologists. Sinauer Associates: Sunderland, Massachusetts.

Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, et al. 2009. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology 10:R32.

282

Harley RM, Atkins S, Budantsev A, Cantino PD, Conn BJ, Grayer R, Harley MM, De Kok R, Krestovskaja T, Morales R, et al. 2004. Labiatae. In: Kubitzki K. The Families and Genera of Vascular Plants, vol. 7. Berlin: Springer Verlag. p. 167- 275.

Harmon LJ, Weir JT, Brock CD, Glor RE, Challenger W. 2008. GEIGER: Investigating evolutionary radiations. Bioinformatics: 24129-131.

Harrison N, Kidner CA. 2011. Next-generation sequencing and systematics: What can a billion base pairs of DNA sequence data do for you? Taxon 60:1552-1566.

Hassan M, Lemaire C, Fauvelot C, Bonhomme F. 2002. Seventeen new exon-primed intron-crossing polymerase chain reaction amplifiable introns in fish. Molecular Ecology Notes 2:334-340.

Hedtke SM, Townsend TM, Hillis DM. 2006. Resolution of phylogenetic conflict in large data sets by increased taxon sampling. Systematic Biology 55:522-529.

Heled J, Drummond AJ. 2010. Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution 27:570-580.

Hendy MD, Penny D. 1989. A framework for the quantitative study of evolutionary trees. Systematic Zoology 38:297-309.

Henry CS, Overbeek R, Xia F, Best AA, Glass E, Gilbert J, Larsen P, Edwards R, Disz T, Meyer F, Vonstein V, Dejongh M, Bartels D, Desai N, D’Souza M, Devoid S, Keegan KP, Olson R, Wilke A, Wilkening J, Stevens RL. 2011. Connecting genotype to phenotype in the era of high-throughput sequencing. Biochimica et Biophysica Acta 1810:967-977.

Hilu KW, Borsch T, Müller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell MP, Alice LA, Evans R, Sauquet H, et al. 2003. Angiosperm phylogeny based on matK sequence information. American Journal of Botany 90:1758-1776.

Hittinger CT, Johnston M, Tossberg JT, Rokas A. 2010. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proceedings of the National Academy of Sciences of the United States of America 107:1476-1481.

Hu TT, Pattyn P, Bakker EG, Cao J, Cheng J-F, Clark RM, Fahlgren N, Fawcett JA, Grimwood J, Gundlach H, et al. 2011. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genetics 43:476-481.

Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. 2007. Nucleic Acids Research 35:D610-D617.

Huelsenbeck JP, Hillis DM 1993. Success of phylogenetic methods in the four-taxon case. Systematic Biology 42:247-264.

283

Huelsenbeck JP. 1991. When are fossils better than extant taxa in phylogenetic analysis?. Systematic Biology, 40:458-469.

Hughes CE, Eastwood RJ, Bailey CD. 2006. Review: From famine to feast? Selecting nuclear DNA sequence loci for plant species-level phylogeny reconstruction. Philosophical Transactions of the Royal Society B: Biological Sciences 361:211- 225.

Huson DH, Rupp R, Scornavaca C. 2010. Phylogenetic networks. Concepts, algorithms and applications. New York (NY): Cambridge University Press.

Irving RS. 1980. The systematics of Hedeoma (Labiatae). SIDA 8:218-295.

Irving RS. 1976. Chromosome numbers of Hedeoma (Labiatae) and related genera. Systematic Botany 1:46-56.

Irving RS. 1972. A revision of the genus Poliomintha (Labiatae). Sida 5:8-22.

Irving RS, Brenholts HS, Irving DD. 1979. Artificial hybridization in Hedeoma (Labiatae). Systematic Botany 4:1-15.

Ishikawa H, Watano Y, Kano K, Ito M, Kurita S. 2002. Development of primer sets for PCR amplification of the PgiC gene in ferns. Journal of Plant Research 115:65- 70.

Jakob SS, Blattner FR. 2006. A chloroplast genealogy of Hordeum (): Long- term persisting haplotypes, incomplete lineage sorting, regional extinction, and the consequences for phylogenetic inference. Molecular Biology and Evolution 23:1602-1612.

Jamzad Z, Chase MW, Ingrouille M, Simmonds MSJ, Jalili A. 2003. Phylogenetic relationships in Nepeta L. (Lamiaceace) and related genera based on ITS sequence data. Taxon 52:21-32.

Jansen RK, Raubeson LA, Boore JL, dePamphilis CW, Chumley TW, Haberle RC, Wyman SK, Alverson AJ, Peery R, Herman, et al. 2005. Methods for obtaining and analyzing whole chloroplast genome sequences. Methods in Enzymology 395:348-384.

Jansen RK, Ruhlman TA. 2012. Plastid genomes of seed plants. In: Bock R, Knoop V, editors. Genomics of Chloroplasts and Mitochondria. New York (NY): Springer. p.103-126.

Janssens SB, Knox EB, Huysmans S, Smets EF, Merckx VSTF. 2009. Rapid radiation of Impatiens (Balsaminaceae) during Pliocene and Pleistocene: Result of global climate change. Molecular Phylogenetics and Evolution 52:806-824.

284

Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P. 2008. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Research 36:D250-254.

Jian S, Soltis PS, Gitzendanner MA, Moore MJ, Li R, Hendry TA, Qiu Y-L, Dhingra A, Bell CD, DE Soltis. 2008. Resolving an ancient, rapid radiation in Saxifragales. Systematic Biology 57:38-57.

Jobson RW, Playford J, Cameron KM, Albert VA. 2003. Molecular phylogenetics of Lentibulariaceae inferred from plastid rps16 intron and trnL-F DNA sequences: Implications for character evolution and biogeography. Systematic Botany 28: 157-171.

Johnson MTJ, Carpenter EJ, Tian Z, Bruskiewich R, Burris JN, Carrigan CT, Chase MW, Clarke ND, Covshoff S, dePamphilis CW et al. 2012. Evaluating methods for isolating total RNA and predicting the success of sequencing phylogenetically diverse plant transcriptomes. PLoS One 7:e50226.

Junell S. 1934. Zur gynäceummorphologie und systematik der Verbenaceen und Labiaten nebst bemerkungen über ihre samenentwicklung. [dissertation]. [Uppsala (Sweden)]: Uppsala University. de Jussieu AL. 1789. Genera plantarum. Paris: Herrisant.

Kane NC, Barker MS, Zhan SH, Rieseberg LH. 2011. Molecular evolution across the : micro- and macroevolutionary processes. Molecular Biology and Evolution 28:3225-3235.

Kane NC, Sveinsson S, Dempewolf H, Yang JY, Zhang D, Engels JMM, Cronk Q. 2012. Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. American Journal of Botany 99:320-329.

Kar PK. 1996. On the Indian origin of Ocimum (Lamiaceae): A palynological approach. Palaeobotanist 43:45-50.

Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30: 882-780.

Katsiotis, Nikoloudakis N, Linos A, Drossou A, Constantinidis T. 2009. Phylogenetic relationships in Origanum spp. based on rDNA sequences and intra-genetic variation of Greek O. vulgare subsp hirtum revealed by RAPD. Scientia Horticulturae 12:103-108.

Kenny EM, Cormican P, Gilks WP, Gates AS, O'Dushlaine CT, Pinto C, Corvin AP, Gill M, Morris DW. 2011. Multiplex target enrichment using DNA indexing for ultra- high throughput SNP detection. DNA Research 18:31-38.

285

Koch MA, Kiefer M, German D, Al-Shehbaz IA, Franzke A, Mummenhoff K, Schmickl R. 2012. BrassiBase: Tools and biological resources to study characters and traits in the Brassicaceae. Version 1.1. Taxon 61:1001-1009.

Kohn MH, Murphy WJ, Ostrander EA, Wayne RK. 2006. Genomics and conservation genetics. Trends in Ecology and Evolution 21:629-637.

Kolaczkowski B, Thorton JW. 2009. Long-branch attraction bias and inconsistency in Bayesian phylogenetics. PLoS One 4:e7891.

Kornhall P, Bremer B. 2004. New circumscription of the tribe Limoselleae (Scrophulariaceae) that includes the taxa of the tribe Manuleeae. Botanical Journal of the Linnean Society 146:453-467.

Kornhall P, Heidari N, Bremer B. 2001. Selaginae and Manuleeae, two tribes or one? Phylogenetic studies in the Scrophulariaceae. Plant Systematics and Evolution 228:199-218.

Kubatko LS, Degnan JH. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic Biology 56:17-24.

Lai Z, Kane NC, Kozik A, Hodgins KA, Dlugosch KM, Barker MS, Matvienko M, Yu Q, Turner KG, Pearl SA, et al. 2012. Genomics of Compositae weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany 99:209- 218.

Larget BR, Kotha SK, Dewey CN, Ané C. 2010. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26:2910-2911.

Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. 2011. Proteinortho: detection of (co-)orthologs in large-scale analysis. BioMed Central Bioinformatics 12:12.

Lee EK, Cibrian-Jaramillo A, Kolokotronis SO, Katari MS, Stamatakis A, Ott M, Chiu JC, Little DP, Stevenson DW, McCombie WR, et al. 2011. A functional phylogenetic view of the seed plants. PLoS Genetics 7:e1002411.

Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Systematic Biology 58:130-145.

Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A. 2010. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods 7:709-715.

Lewitter F and Bourne PE, 2011. Teaching bioinformatics at the secondary school level. PLoS Computational Biology 7:e1002242.

286

Li B, Xu W, Tu T, Wang Z, Olmstead RG, Peng H, Francisco-Ortega J, Cantino PG, Zhang D. 2012. Phylogenetic position of Wenchengia (Lamiaceae): a taxonomically enigmatic and critically endangered genus. Taxon 61:392-401.

Li H, Coghlan A, Ruan J, Coin LJ, Hériché J-K, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, et al. 2006. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Research 34:D572-D580.

Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research 13:2178-2189.

Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research 20:265-272.

Linder CR, Rieseberg LH. 2004. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany 91:1700-1708.

Lister DL, Bower MA, Howe CJ, Jones MK. 2008. Extraction and amplification of nuclear DNA from herbarium specimens of emmer wheat: a method for assessing DNA preservation by maximum amplicon length recovery. Taxon 57:254-258.

Liston A. 2012. Simultaneous sequencing of hundreds of nuclear loci for phylogenetic analyses across Rosaceae: A next generation sequencing approach. Paper presented at: Botany 2012 – The Next Generation. Annual meeting of the Botanical Society of America. Columbus, Ohio, USA.

Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T. 2009. Rapid and accurate large- scale coestimation of sequence alignments and phylogenetic trees. Science 324:1561-1564.

Liu L, Yu L, Edwards SV. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology 10:302.

Liu L, Yu L, Kubatko L, Pearl DK, Edwards SV. 2009. Coalescent methods for estimating phylogenetic trees. Molecular Phylogenetics and Evolution 53:320- 328.

Liu L, Pearl DK, Brumfield RT, Edwards SV. 2008. Estimating species trees using multiple-allele DNA sequence data. Evolution 62:2080-2091.

Liu L, Pearl DK. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology 56:504-514.

287

Longo MS, O'Neill MJ, O'Neill RJ. 2011. Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE 6:e16410. doi:10.1371/journal.pone.0016410.

Loockerman DJ, Jansen RK. 1996. The use of herbarium material for DNA studies. In: Stuessy TF, Sohmer SH, editors. Sampling the green world. New York (NY): Columbia University Press. p. 205-220.

Lukas B, Novak J. 2013. The complete chloroplast genome of Origanum vulgare L. (Lamiaceae). Gene 528:163-169.

Lyons E and Freeling M. 2008. How to usefully compare homologous plant genes and chromosomes as DNA sequences. The Plant Journal 53:661-673.

Maddison WP, Maddison DR. 2014. Mesquite: A modular system for evolutionary analysis. Version 2.75. http://mesquiteproject.org

Maddison WP, Knowles LL. 2006. Inferring phylogeny despite incomplete lineage sorting. Systematic Biology 55:21-30.

Magallón S, Castillo A. 2009. Angiosperm diversification through time. American Journal of Botany 96:349-365.

Mamanova L, Coffee AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. 2009. Target-enrichment strategies for next generation sequencing. Nature Methods 7:111-118.

Martin JA, Wang Z. 2011. Next-generation transcriptome assembly. Genetics 12:671- 682.

Martínez-Alcántara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov VY, Havlak P, Fofanov Y. 2009. PIQA: pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics 25:2438-2439.

Martinez M. 2011. Plant protein-coding gene families: emerging bioinformatics approaches. Trends in Plant Science 10:558-567.

Martínez-Millán M. 2010. Fossil record and age of the Asteridae. Botanical Review 76:83-135.

Marx HE, O’Leary N, Yuan Y-W, Lu-Irving P, Tank DC, Múlgura M, Olmstead RG. 2010. A molecular phylogeny and classification of Verbenaceae. American Journal of Botany 97:1647-1663.

Mathiesen C, Scheen A-C, Lindqvist C. 2010. Phylogeny and biogeography of the lamioid genus Plomis (Lamiaceae). Kew Bulletin 66:83-99.

288

Maughan PJ, Smith SM, Fairbanks DJ, Jellen EN. 2011. Development, characterization, and linkage mapping of single nucleotide polymorphisms in the grain Amaranths (Amaranthus sp.). The Plant Genome Journal 4:92.

McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT. 2012. Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66:526-538.

McDade, Daniel TF, Kiel CA, Borg AJ. 2013. Phylogenetic placement, delimitation, and relationships among genera of the enigmatic Nelsonioideae (Lamiales: Acanthaceae). Taxon 61: 657-651.

McDade LA, Daniel TF, Kiel CA. 2008. Toward a comprehensive understanding of phylogenetic relationships among lineages of Acanthaceae sl (Lamiales). American Journal of Botany 95:1136-1152.

McDade LA, Masta SE, Moody ML, Waters E. 2000. Phylogenetic relationships among Acanthaceae: evidence from two genomes. Systematic Botany 25:106-121.

McDade LA, Moody ML. 1999. Phylogenetic relationships among Acanthaceae: evidence from noncoding trnL-trnF chloroplast DNA sequences. American Journal of Botany 86:70-80.

Meimberg H, Abele T, Bräuchler C, McKay JK, Pérez de Paz PL, Heubl G. 2006. Molecular evidence for adaptive radiation of Micromeria Benth. (Lamiaceae) on the as inferred from chloroplast and nuclear DNA sequences and ISSR fingerprint data. Molecular Phylogenetics and Evolution 41:566-578.

Metzker ML. 2010. Sequencing technologies - the next generation. Nature Reviews Genetics 11:31-46.

Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, and Warnow T. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541-i548.

Moon H-K, Smets E, Huysmans S. 2010. Phylogeny of tribe Mentheae (Lamiaceae): the story of molecules and micromorphological characters. Taxon 59:1065-1076.

Moore MJ, Bell CD, Soltis PS, Soltis DE. 2007. Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proceedings of the National Academy of Sciences of the United States of America 104:19363- 19368.

Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie W, Folta K, Soltis DE. 2006. Rapid and accurate pyrosequencing of angiosperm plastid genomes. BioMed Central Plant Biology 6:17.

289

Moore MJ, Hassan N, Gitzendanner MA, Bruenn RA, Croley M, Vandeventer A, Horn JW, Dhingra A, Brockington SF, Latvis M, et al. 2011. Phylogenetic analysis of the plastid inverted repeat for 244 species: insights into deeper-level angiosperm relationships from a long, slowly evolving sequence region. International Journal of Plant Sciences 172:541-558.

Moore MJ, Soltis PS, Bell CD, Burleigh JG, Soltis DE. 2010. Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots. Proceedings of the National Academy of Sciences of the United States of America 107:4623- 4628.

Mort ME, Crawford DJ. 2004. The continuing search: low-copy nuclear sequences for lower-level plant molecular phylogenetic studies. Taxon 53:257-261.

Müller K, Borsch T, Legendre L, Porembski S, Thiesen I, Barthlott W. 2004. Evolution of carnivory in Lentibulariaceae and the Lamiales. Plant Biology 6:477-490.

Ness RW, Graham SW, Barrett SCH. 2011. Reconciling gene and genome duplication events: Using multiple nuclear gene families to infer the phylogeny of the aquatic plant family Pontederiaceae. Molecular Biology and Evolution 28:3009-3018.

Niu B, Fu L, Sun S, Li W. 2010. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11:187.

Oliveira LO, Huck RB, Gitzendanner MA, Judd WS, Soltis DE, Soltis PS. 2007. Molecular phylogeny, biogeography, and systematics of Dicerandra (Lamiaceae), a genus endemic to the southeastern United States. American Journal of Botany 94:1017-1027.

Olmstead RG, Bremer B, Scott KM, Palmer JD. 1993. A parsimony analysis of the Asteridae sensu lato based on rbcL sequences. Annals of the Missouri Botanical Garden 80:700-722.

Olmstead RG, DePamphilis CW, Wolfe AD, Young ND, Elisons WJ, Reeves PA. 2001. Disintegration of the Scrophulariaceae. American Journal of Botany 88:348-361.

Olmstead RG, Kim KJ, Jansen RK, Wagstaff SJ. 2000. The phylogeny of the Asteridae sensu lato based on chloroplast ndhF gene sequences. Molecular Phylogenetics and Evolution 16:96-112.

Olmstead RG, Michaels HJ, Scott K, Palmer JD. 1992. Monophyly of the Asteridae and identification of their major lineages inferred from DNA sequences of rbcL. Annals of the Missouri Botanical Garden 79:249-265.

Olmstead RG, Zjhra ML, Lohmann LG, Grose SO, Eckert AJ. 2009. A molecular phylogeny and classification of Bignoniaceae. American Journal of Botany 96:1731-1743.

290

Oxelman B, Kornhall P, Olmstead RG, Bremer B. 2005. Further disintegration of the Scrophulariaceae. Taxon 54:411-425.

Palmer JD, Herbon LA. 1988. Plant mitochondrial DNA evolved rapidly in structure, but slowly in sequence. Journal of Molecular Evolution 28:87-97.

Palmer JD, Zamir D. 1982. Chloroplast DNA evolution and phylogenetic relationships in Lycopersicon. Proceedings of the National Academy of Sciences of the United States of America 79:5006-5010.

Palumbi SR. 1998. Nucleic acids II: the polymerase chain reaction. In: Hillis DM, Moritz C, Mable BK, editors. Molecular Systematics. 2nd ed. Sunderland (MA): Sinauer Associates. p. 205-247.

Paradis E, Bolker B, Claude J, Cuong H S, Desper R, Durand B, Dutheil J, Gascuel O, Heibl C, Lawson D, Lefort V, Legndre P, Lemon J, Noel Y, Nylander J, Opgen- Rhein R, Schliep K, Strimmer K, de Vienne D. 2011. Package ‘ape’. Available at: http://cran.rproject.org/web/ packages/ape/ape.pdf

Parchman TL, Gompert Z, Mudge J, Schilkey FD, Benkman CW, Buerkle CA. 2012. Genome-wide association genetics of an adaptive trait in lodgepole pine. Molecular Ecology 21:2991-3005.

Parks M, Cronn R, Liston A. 2009. Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BioMed Central Biology 7:84.

Parr C, Guralnick R, Cellinese N. 2012. Evolutionary informatics: unifying knowledge about the diversity of life. Trends in Ecology and Evolution 27:94-103.

Pastore JFB, Harley RM, Forest F, Paton A, van den Berg C. 2011. Phylogeny of the subtribe Hyptidinae (Lamiaceae subtribe Ocimeae) as inferred from nuclear and plastid DNA. Taxon 60:1317-1329.

Paton AJ, Springate D, Suddee S, Otieno D, Grayer RJ, Harley MM, Willis F, Simmonds MSJ, Powell MP, Savolainen V. 2004. Phylogeny and evolution of basils and allies (Ocimeae, Labiatae) based on three plastid DNA regions. Molecular Phylogenetics and Evolution 31:277-299.

Paux E, Sourdille P, Mackay I. 2011. Sequence-based marker development in wheat: Advances and applications to breeding. Biotechnology Advances, doi:10.1016/j.biotechadv.2011.09.015.

Perret MA, Chatems A, Onefre de Arujo A, Salamin N. 2013. Temporal and spatial origin of Gesneriaceae in the New World inferred from plastid DNA sequences. Botanical Journal of the Linnean Society 171:61-79.

291

Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D. 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Molecular Biology and Evolution 21: 1740-1752.

Pina C, Pinto F, Feijó JA, Becker JD. 2005. Gene family analysis of the Arabidopsis pollen transcriptome reveals biological implications for cell growth, division control, and gene expression regulation. Plant Physiology 138:744-756.

Prather LA, Monfils AK, Posto AL, Williams RA. 2002. Monophyly and phylogeny of Monarda (Lamiaceae): evidence from the internal transcribed spacer (ITS) region of nuclear ribosomal DNA. Systematic Botany 27:127-137.

Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW, et al. 2005. Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. International Journal of Plant Sciences 166:815-842.

Qiu Y-L, Li L, Wang B, Xue JY, Hendry TA, Li RQ, Brown J W, Liu Y, Hudson GT, Chen ZD. 2010. Angiosperm phylogeny inferred from sequences of four mitochondrial genes. Journal of Systematics and Evolution 48:391-425.

Rahmanzadeh RK, Müller K, Fischer E. Bartels D, Borsch T. 2005. The Linderniaceae and Gratiolaceae are further lineages distinct from Scrophulariaceae (Lamiales). Plant Biology 7:67-78.

Rambaut A. 2002. Se-Al. Sequence Alignment Editor. Department of Zoology, University of Oxford.

Refulio-Rodriguez NF, Olmstead RG. 2014. Phylogeny of Lamiidae. American Journal of Botany 101:287-299.

Reid EM, Chandler MEJ. 1926. The Bembridge flora. Catalogue of Cainzoic plants in the Department of Geology, vol. 1. British Museum (Natural History), London, UK.

Remm M, Storm C, Sonnhammer E. 2001. Automatic clustering of orthologs and in- paralogs from pairwise species comparisons. Journal of Molecular Biology 314: 1041-1052.

Renne PR, Deino AL, Hilgen FJ, Kuiper KF, Mark DF, Mitchell III WS, Morgan LE, Mundil R, Smit J. 2013. Time scales of critical events around the Cretaceous- Paleogene boundary. Science 339:684-687.

Renny-Byfield S, Chester M, Kovarik A, Le Comber SC, Grandbastien M-A, Deloger M, Nichols R, Macas J, Novák P, Chase MW, Leitch AR. 2011. Next generation sequencing reveals genome downsizing in allotetraploid Nicotiana tabacum, predominantly through the elimination of paternally derived repetitive DNAs. Molecular Biology and Evolution 28:2843-2854.

292

Reveal JL. 2011. Summary of recent systems of angiosperm classification. Kew Bulletin 66:5-48.

Rhie A, Yang S, Lee K-E, Thong CT, Park H-S. 2010. genominfo.org : A simple GUI- based sequencing format conversion tool for the three NGS platforms. Genomics & Informatics 8:97-99.

Rokas A, Carroll SB. 2006. Bushes in the Tree of Life. PLoS Biology 4:e352.

Rouard M, Guignon V, Aluome C, Laporte MA, Droc G, Walde C, Zmasek CM, Périn C, Conte MG. 2011. GreenPhylDB v2.0: comparative and functional genomics in plants. Nucleic Acids Research 39(Suppl. 1):D1095-D1102.

Roure B, Baurain D, and Philippe H. 2012. Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Molecular Biology and Evolution 30: 197-214.

Rozen S, Skaletsky HJ. 2000. Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S, editors. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Totowa (NJ): Humana Press. p. 365- 386.

Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Hériché JK, Hu Y, Kristiansen K, Li R, Liu T, et al. 2008. TreeFam: 2008 update. Nucleic Acids Research, 36(Database), p D735-D740.

Salmaki Y, Zarre S, Ryding O, Lindqvist C, Bräuchler C, Heuble G, Barber J, Bendiksby M. 2013. Molecular phylogeny of tribe Stachydeae (Lamiaceae subfamily Lamioideae). Molecular Phylogenetics and Evolution 69:535-551.

Sanderson MJ. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Molecular Biology and Evolution 19:101- 109.

Sanderson MJ, Shaffer HB. 2002. Troubleshooting molecular phylogenetic analyses. Annual Reviews in Ecology and Systematics 33:49-72.

Sang T. 2002. Utility of low-copy nuclear gene sequences in plant phylogenetics. Critical Reviews of Biochemical Molecular Biology 37:121-147.

Savolainen V, Fay MF, Albach DC, Backlund A, Van der Bank M, Cameron KM, Johnson LA, Lledó MD, Pintaud J-, Powell M, Sheaham MC et al. Phylogeny of the eudicots: a nearly complete familial analysis based on rbcL gene sequences. Kew Bulletin 55:257-309.

Schäferhoff B, Fleischmann A, Fischer E, Albach DC, Borsch T, Heubl G, Müller KF. 2010. Towards resolving Lamiales relationships: insights from rapidly evolving chloroplast sequences. BMC Evolutionary Biology 10:352.

293

Scheen A-C, Albert VA. 2009. Molecular phylogenetics of the Leucas group (Lamioideae; Lamiaceae). Systematic Botany 34:173-181.

Scheen A-C, Bendiksby M, Ryding O, Mathiesen C, Albert VA, Lindqvist C. 2010. Molecular phylogenetics, character evolution, and suprageneric classification of Lamioideae (Lamiaceace). Annals of the Missouri Botanical Garden 97:191-219.

Scheen A-C, Lindqvist C, Fossdal CG, Albert VA. 2008. Molecular phylogenetics of tribe Synandreae, a North American lineage of lamioid mints (Lamiaceae). Cladistics 24:299-314.

Schneider A, Dessimoz C, Gonnet GH. 2007. OMA Browser-exploring orthologous relations across 352 complete genomes. Bioinformatics 23:2180-2182.

Schmidt-Lebuhn AN. 2008. Monophyly and phylogenetic relationships of Minthostachys (Labiatae, Nepetoideae) examined using morphological and nrITS data. Plant Systematics and Evolution 270:25-38.

Scotland RW, Sweere JA, Reeves PA, Olmstead RG. 1995. Higher-level systematics of Acanthaceae determined by chloroplast DNA sequences. American Journal of Botany 82:266-275.

Shaw J, Lickey EB, Beck JT, Farmer SB, Liu W, Miller J, Siripun KC, Winder CT, Schilling EE, Small RL. 2005. The tortoise and the hare II: relative utility of 21 noncoding chloroplast DNA sequences for phylogenetic analysis. American Journal of Botany 92:142-166.

Shaw J, Lickey EB, Schilling EE, Small RL. 2007. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. American Journal of Botany 94:275- 288.

Shi G, Peng M, Jiang T. 2010. Accurate identification of ortholog groups among multiple genomes. In: Proceeding LSS Comput Syst Bioinform Conference Stanford, CA. Vol. 2. Stanford (CA): Life Sciences Society. p. 166-179.

Shimodaira H, Hasegawa M. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molecular Biology and Evolution 16:1114- 1116.

Shinozaki K, Ohme M, Tanaka M, Wakasugi T, Hayashida N, Matsubayashi T, Zaita N, Chunwongse J, Obokata J, Yamaguchi-Shinozaki K, et al. 1986. The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. The EMBO Journal 5:2043-2049.

Slade RW, Moritz C, Heideman A. 1994. Multiple nuclear-gene phylogenies: Application to pinnipeds and comparison with a mitochondrial DNA gene phylogeny. Molecular Biology and Evolution 11:341-356.

294

Slowinski JB, Page RDM. 1999. How should species phylogenies be inferred from sequence data? Systematic Biology 48:814-825.

Small RL, Cronn RC, Wendel JF. 2004. Use of nuclear genes for phylogeny reconstruction in plants. Australian Journal of Botany 17:145-170.

Small RL, Ryburn JA, Cronn RC, Seelanan T, Wendel JF. 1998. The tortoise and the hare: choosing between noncoding plastome and nuclear Adh sequences for phylogeny reconstruction in a recently diverged plant group. American Journal of Botany 85:1301-1315.

Smith JF, Wolfram JC, Brown KD, Carroll CL, Denton DS. 1997. Tribal relationships in the Gesneriaceae: Evidence from DNA sequences of the chloroplast gene ndhF. Annals of the Missouri Botanical Garden 84:50-66.

Smith SA, O’Meara BC. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28:2689-2690.

Smith SA, Wilson NG, Goetz FE, Feehery C, Andrade SCS, Rouse GW, Giribet G, Dunn C. 2011. Resolving the evolutionary relationships of mollusks with phylogenomic tools. Nature 480:364-367.

Soltis DE, Smith SA, Cellinese NA, Wurdack KJ, Tank DC, Brockington SF, Refulio- Rodrigues NF, Walker JB, Moore MJ, Carlsward BS, et al. 2011. Angiosperm phylogeny: 17 genes, 640 taxa. American Journal of Botany 98:704-730.

Soltis DE, Moore MJ, Burleigh G, Soltis PS. 2009. Molecular markers and concepts of plant evolutionary relationships: progress, promise and future prospects. Critical Reviews in Plant Science 28:1-15.

Soltis DE, Soltis PS, Endress PK, Chase MW. 2005. Phylogeny and evolution of angiosperms. Sinauer Associates Incorporated, Sunderland (MA).

Soltis PS, Soltis DE, Chase MW, Endress PK, Crane PR. 2004. The diversification of flowering plants. In: Cracraft J, Donoghue M, editors. The Tree of Life. New York (NY): Oxford University Press. p. 154-167.

Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M, Savolainen V, Hahn WH, Hoot SB, Fay MF, et al. 2000. Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Botanical Journal of the Linnean Society 133:381-461. Sonnhammer ELL, Koonin EV. 2002. Orthology, paralogy and proposed classification for paralog subtypes. Trends in Genetics 18:619-620.

Spangler RE, Olmstead RG. 1999. Phylogenetic analysis of the Bignoniaceae based on the cpDNA gene sequences rbcL and ndhF. Annals of the Missouri Botanical Garden 84:1-49.

295

Staats M, Cuenca A, Richardson JE, van Ginkel RV, Petersen G, Seberg O, Bakker FT. 2011. DNA damage in plant herbarium tissue. PLoS ONE 6:e28448.

Stamatakis A. 2014. RAxML version 8: A tool for phylogenetic analysis and post- analysis of large phylogenies. In: Bioinformatics. Open access link: http://bioinformatics.oxfordjournals.org/content/early/2014/01/21/bioinformatics.bt u033.abstract?keytype=ref&ijkey=VTEqgUJYCDcf0kP

Stamatakis A. 2006. Phylogenetic models of rate heterogeneity: a high performance computing perspective. In: Parallel and Distributed Processing Symposium, IPDPS 2006, 20th International. doi: 10.1109/IPDPS.2006.1639535

Stapley J, Reger J, Feulner PGD, Smadja C, Galindo J, Ekblom R, Bennison C, Ball AD, Beckerman AP, Slate J. 2010. Adaptation genomics: The next generation. Trends in Ecology and Evolution 12:702-712.

Strasburg JL, Sherman NA, Wright KM, Moyle LC, Willis JH, Rieseberg LH. 2012. What can patterns of differentiation across plant genomes tell us about adaptation and speciation? Philosophical Transactions of the Royal Society B: Biological Sciences 367:364-373.

Steane DA, de Kok RPJ, Olmstead RG. 2004. Phylogenetic relationships between Clerodendrum (Lamiacecae) and other Ajugoid genera. Molecular Phylogenetics and Evolution 32:39-45.

Steane DA, Scotland RW, Mabberly DJ, Olmstead RG. 1999. Molecular systematics of Clerodendrum (Lamiaceae): ITS sequences and total evidence. American Journal of Botany 86:98-107.

Steele PR, Hertweck KL, Mayfield D, McKain MR, Leebens-Mack J, Pires JC. 2012. Quality and quantity of data recovered from massively parallel sequencing: examples in Asparagales and Poaceae. American Journal of Botany 99:330-348.

Steele PR, Pires JC. 2011. Biodiversity assessment: State-of-the-art techniques in phylogenomics and species identification. American Journal of Botany 98:415- 425.

Stiller M, Knapp M, Stenzel U, Hofreiter M, Meyer M. 2009. Direct multiplex sequencing (DMPS)--a novel method for targeted high-throughput sequencing of ancient and highly degraded DNA. Genome Research 19:1843-1848.

Straub S, Fishbein M, Livshultz T, Foster Z, Parks M, Weitemier K, Cronn R, Liston A. 2011. Building a model: developing genomic resources for common milkweed (Asclepias syriaca) with low coverage genome sequencing. BioMed Central Genomics 12:211.

296

Straub SC, Parks M, Weitemier K, Fishbein M, Cronn RC, Liston A. 2012. Navigating the tip of the genomic iceberg: Next-generation sequencing for plant systematics. American Journal of Botany 99:349-364.

Strickler SR, Bombarely A, Mueller LA. 2012. Designing a transcriptome next- generation sequencing project for a non-model plant species. American Journal of Botany 99:257-266.

Stull GW, Duno de Stefano R, Manchester SR, Soltis DE, Soltis PS. 2012. Resolving basal lamiid phylogenetic relationships and the placement of Icacinaceae with next-generation sequence data. Paper presented at: Botany 2012 – The Next Generation. Annual meeting of the Botanical Society of America. Columbus, Ohio, USA.

Stull GW, Moore MJ, Mandala VS, Douglas NA, Kates HR, Qi X, Brockington SF, Soltis PS, Soltis DE, Gitzendanner MA. 2013. A targeted enrichment strategy for massively parallel sequencing of angiosperm plastid genomes. Applications in Plant Sciences 1:1-7.

Stull GW, Refulio-Rodriguez NF, Olmstead RG, Duno de Stefano R, Soltis DE, Soltis PS. 2014. Resolving enigmatic basal lamiid relationships with a plastome-scale data set. Paper presented at: Botany 2014 – New Frontiers in Botany. Annual meeting of the Botanical Society of America. Boise, Idaho, USA.

Surget-Groba Y, Montoya-Burgos JI. 2010. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Research 20:1432- 1440.

Swofford, D. L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.

Tatusov RL, Galperin MY, Natale DA, Koonin EV. 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28:33-36.

Tatusov RL, Koonin EV, Lipman DJ. 1997. A genomic perspective on protein families. Science 278:631-637.

The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant: Arabidopsis thaliana. Nature 408:796-815.

Timme RE, Bachvaroff TR, Delwiche CF. 2012. Broad phylogenomic sampling and the sister lineage of land plants. PLoS One 7:e29696.

Timmis JN, Ayliffe MA, Huang CY, Martin W. 2004. Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nature Reviews Genetics 5:123-135.

297

Touriya A, Rami M, Cattaneo-Berrebi G, Ibanez C, Augros S, Boissin E, Dakkak A, Berrebi P. 2003. Primers for EPIC amplification of intron sequences for fish and other vertebrate population genetic studies. Biotechniques 35:676-678.

Trusty JL, Olmstead RG, Bogler DJ, Santos-Guerra A, Francisco-Ortega J. 2004. Using molecular data to test a biogeographic connection of the Macronesian genus Bystropogon (Lamiaceae) to the new world: a case of conflicting phylogenies. Systematic Botany 29:702-715.

Trusty JL, Olmstead RG, Santos-Guerra A, Sá-Fontinha S, and Francisco-Ortega, J. 2005. Molecular phylogenetics of the Macaronesian-endemic genus Bystropogon (Lamiaceae): palaeo-islands, ecological shifts and interisland colonizations. Molecular Ecology 14:1177-1189.

Turner TL, Bourne EC, Von Wettberg EJ, Hu TT, Nuzhdin SV. 2010. Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nature Genetics 42:260-263.

Twyford AD, Ennos RA. 2012. Next-generation hybridization and introgression. Heredity 108:179-89.

Vaidya G, Lohman DJ, Meier R. 2011. SequenceMatrix: concatenation software for the fast assembly of multi‐ gene datasets with character set and codon information. Cladistics 27:171-180.

Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K. 2012. Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiology 158:590-600.

Walker JB, Sytsma KJ. 2007. Staminal evolution in the genus Salvia (Lamiaceae): molecular phylogenetic evidence for multiple origins of the staminal lever. Annals of Botany 100:375-391.

Walker JB, Sytsma KJ, Treutlein J, Wink M. 2004. Salvia (Lamiaceae) is not monophyletic: implications for the systematics, radiation, and ecological specializations of Salvia and tribe Mentheae. American Journal of Botany 91:1115-1125.

Wagstaff SJ, Hickerson L, Spangler R, Reeves PA, Olmstead RG. 1998. Phylogeny in Labiatae s. l., inferred from cpDNA sequences. Plant Systematics and Evolution 209:265-274.

Wagstaff SJ, Olmstead RG. 1997. Phylogeny of the Labiatae and Verbenaceae inferred from rbcL sequences. Systematic Botany 22:165-179.

Wagstaff SJ, Olmstead RG, Cantino PD. 1995. Parsimony analysis of cpDNA restriction site variation in subfamily Nepetoideae (Labiatae). American Journal of Botany 87:886-892.

298

Wallander E and Albert VA. 2000 Phylogeny and classification of Oleaceae based on rps16 and trnL-F sequence data. American Journal of Botany 87:1827-1841.

Wang L-S, Leebens-Mack J, Wall PK, Beckmann K, dePamphilis CW, Warnow T. 2011. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8:1108- 1119.

Ward JA, Ponnala L, Weber CA. 2012. Strategies for transcriptome analysis in non- model plants. American Journal of Botany 99:267-276.

Weins JJ. 2006. Missing data and the design of phylogenetic analyses. Journal of Biomedical Informatics 39:34-42.

Weins JJ. 2003. Missing data, incomplete taxa, and phylogenetic accuracy?. Systematic Biology 52:528-538.

Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S. 2007. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 35(Database):D5- D12.

Whittall JB, Syring J, Parks M, Buenrostro J, Dick C, Liston A, Cronn R. 2010. Finding a (pine) needle in a haystack: chloroplast genome sequence divergence in rare and widespread pines. Molecular Ecology 19:100-114.

Wickett NJ, Mirarab S, Nguyen N, Warnow T, Carpenter E, Matasci N, Auuampalayam S, Barker MS, Burleigh JG, Gitzendanner MA, et al. 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences of the United States of America 111:e4859- e4868.

Wikström, N, Savolainen V, Chase MW. 2001. Evolution of the angiosperms: calibrating the family tree. Proceedings of the Royal Society B 268:2211-2220.

Wilson TC, Conn BJ, Henwood MJ. 2012. Molecular phylogeny and systematics of Prostanthera (Lamiaceae). Australian Systematic Botany 25:341-352.

Wolfe AD, dePamphilis CW. 1998. The effect of relaxed functional constraints on the photosynthetic gene rbcL in photosynthetic and nonphotosynthetic parasitic plants. Molecular Biology and Evolution 15:1243-1258.

Woloszynska M. 2010. Heteroplasmy and stoichiometric complexity of plant mitochondrial genomes—though this be madness, yet there's method in't. Journal of Experimental Botany 61:657-671.

Wortley AH, Harris DJ, Scotland RW. 2007. On the taxonomy and phylogenetic position of Thomandersia. Systematic Botany 32:415-444.

299

Wortley AH, Rudall PJ, Harris DJ, Scotland RW. 2005. How much data are needed to resolve a difficult phylogeny? Case study in Lamiales. Systematic Biology 54:697-709.

Wunderlich R. 1967. Ein Vorschlag zu einer natürlichen Gliederung der Labiaten auf Grund der Pollenkörner, der Samenentwicklung und des reifen Samens. Öesterreichische Botanische Zeitschrift 114: 383-483.

Xia X, Wang YZ, and Smith JF. 2009. Familial placement and relations of Rhemannia and Triaenophora (Scrophulariaceae s.l.) inferred from five gene regions. American Journal of Botany 96:519-530.

Xue J-Y, Liu Y, Li L, Wang B, Qiu Y-L. 2010. The complete mitochondrial genome sequence of the hornwort Phaeoceros laevis: retention of many ancient pseudogenes and conservative evolution of mitochondrial genomes in hornworts. Current Genetics 56:53-61.

Young ND, Debellé F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, et al. 2011. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 10.1038/nature10625.

Young ND, Steiner KE, and DePamphilis CW. 1999. The evolution of parasitism in Scrophulariaceae/Orobanchaceae: plastid gene sequences refute an evolutionary transition series. Annals of the Missouri Botanical Garden 86:876- 893.

Yang X, Chockalingam SP, Aluru S. 2012. A survey of error-correction methods for next-generation sequencing. Brief Bioinformatics 10.1093/bib/bbs015.

Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, Madden T. 2012. Primer- BLAST: A tool to design target-specific primers for polymerase chain reaction. BioMed Central Informatics 13:134.

Yuan YW, Liu C, Marx HE, Olmstead RG. 2009. The pentatricopeptide repeat (PPR) gene family, a tremendous resource for plant phylogenetic studies. New Phytologist 182:272-283.

Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, McCown B, Harbut R, Simon P. 2012. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. American Journal of Botany 99:193-208.

Zauhar RJ. 2001. University bioinformatics programs on the rise. Nature Biotechnology 19:285-286.

Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18:821-829.

300

Zhang C, Zhang DX, Zhu T, Yang Z. 2011. Evaluation of a Bayesian coalescent method of species delimitation. Systematic Biology 60:747-761.

Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. 2011. BIGpre: A quality assessment package for next-generation sequencing data. Genomics, Proteomics and Bioinformatics 9:238-244.

Zhou X, Xu S, Xu J, Chen B, Zhou K, Yang G. 2011. Phylogenomic analysis resolves the interordinal relationships and rapid diversification of the Laurasiatherian mammals. Systematic Biology 61:150-164.

Zuccolo A, Bowers JE, Estill JC, Xiong Z, Luo M, Sebastian A, Goicoechea JL, Collura K, Yu Y, Jiao Y, et al. 2011. A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure. Genome Biology 12:R48.

301

BIOGRAPHICAL SKETCH

Grant Thomas Godden completed his undergraduate education at Michigan

State University (MSU), where he majored in Interdisciplinary Humanities. He received his BA degree in 2000 and, after working briefly for the university, returned in 2001 to complete an additional year of advanced coursework in Botany and Plant Pathology. As an undergraduate researcher, Grant studied color polymorphism and adaptation to serpentine soils in Linanthus (Polemoniaceae) with Douglas W. Schemske at MSU and investigated the pollination biology of Lepanthes (Orchidaceae) with James D.

Ackerman at the University of . In 2002, he moved to New York City and began a professional career, specializing in cause-related marketing and corporate sponsorship of science, education, and the environment. He also worked briefly in medical education and publishing, before returning to MSU in 2006 to continue his graduate education. At MSU, Grant worked under the supervision of L. Alan Prather from 2006-2009 and studied the systematics of Poliomintha (Lamiaceae) and related mint genera from the southwestern U.S.A. and Mexico. Grant began his PhD at the

University of Florida (UF) in August 2008, but completed his MS degree at MSU in

2009—his second semester at UF. Shortly thereafter, he began his PhD research under the direction of Pamela S. Soltis, expanding his previous work with Lamiaceae. Grant graduated with a PhD in Botany in December 2014, and began a postdoctoral research position with Lucinda A. McDade at the Rancho Santa Ana Botanical Garden in

Claremont, California during the same semester.

302