Genomic characterisation of species from two independent return to the sea events

HueyTyng Lee

BSc Hons Biotechnology

A thesis submitted for the degree of Doctor of Philosophy at

The University of Queensland in 2017

School of Agriculture and Food Sciences Abstract

Seagrasses are marine angiosperms that evolved from land , but returned to the sea around 140 Mya during the early evolution of monocotyledonous plants through multiple independent lineages. They successfully adapted to stresses posed by the marine environment, and today are distributed in coastal waters worldwide. Seagrass meadows are important ecosystems as carbon sinks, where its storage per unit area is comparable to terrestrial forests, primary food sources of marine animals, nursery groups for commercially-important species and sediment stabilization.

This thesis reports the genomic comparison between model species and seagrass, as well as between two lineages of seagrasses. This is achieved by the assembly of the Zostera muelleri genome, a southern hemisphere temperate seagrass species from the family , and genome wide characterization of Halophila ovalis, a species in the Hydrocharitaceae family which is abundant in the Indo-Pacific area.

Multiple genes were lost or modified in Z. muelleri compared to terrestrial or floating aquatic plants, highlighting the biological processes associated with adaptation to the sea. These include genes for hormone biosynthesis and signalling, and cell wall catabolism. Whole genome duplication common to the seagrass lineage, as well as a species-specific event dated at 5.7-10.5 Mya were also identified. The absence of the ancient Tau genome duplication in Z. muelleri further confirmed its phylogenetic placement after the divergence of from monocots.

As the Zosteraceae and Hydrocharitaceae families represent two marine recolonization events which occurred approximately 30 My apart, their comparison presents the first instance of molecular evidence that describe this repeated evolution. These two lineages share the same gene loss in ethylene biosynthesis and signalling, terpenoid biosynthesis, stomata and flower development. A group of well-conserved genes that are specific to the seagrass genomes also provided insights into the uniqueness of seagrasses among other angiosperms. The seagrass-specific genes are highly enriched in the intracellular transport pathway and cell wall construction, organization and modification, suggesting similar selective pressures drove the evolution of similar molecular, and possibly, functional outcomes in this two taxa. Intra-comparison of seagrass genomes also revealed the loss of the NDH protein complex in H. ovalis. This loss is not shared by the Zosteraceae species, and is likely to be related to the nutrient levels of habitats.

2

The Z. muelleri genome is the first assembled southern hemisphere temperate species and one of the few representatives of the basal monocot lineage. This is also the first time the convergence evolution of multiple seagrass lineages is studied in the genomic level. The results of this thesis therefore contributed to the early understanding of seagrass genetics and provide a foundation for future studies of many biological questions that remain open.

3

Declaration by author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.

4

Publications during candidature

Golicz AA, Schliep M, Lee HT, Larkum AW, Dolferus R, Batley J, Chan CK, Sablok G, Ralph PJ, Edwards D (2015) Genome-wide survey of the seagrass Zostera muelleri suggests modification of the ethylene signalling network. J Exp Bot 66: 1489-1498

Davey PA, Pernice M, Sablok G, Larkum A, Lee HT, Golicz A, Edwards D, Dolferus R, Ralph P (2016) The emergence of molecular profiling and omics techniques in seagrass biology; furthering our understanding of seagrasses. Funct Integr Genomics 16: 465-480

Lee HT, Golicz AA, Bayer PE, Jiao Y, Tang H, Paterson AH, Sablok G, Krishnaraj RR, Chan CK, Batley J, Kendrick GA, Larkum AW, Ralph PJ, Edwards D (2016) The Genome of a Southern Hemisphere Seagrass Species (Zostera muelleri). Plant Physiol 172: 272- 283

Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee HT, Chan CK, Visendi P, Lai K, Dolezel J, Batley J, Edwards D (2017) The pangenome of hexaploid bread wheat. Plant J 90: 1007-1013

Bayer PE, Hurgobin B, Golicz AA, Chan CK, Yuan Y, Lee HT, Renton M, Meng J, Li R, Long Y, Zou J, Bancroft I, Chalhoub B, King GJ, Batley J, Edwards D (2017) Assembly and comparison of two closely related Brassica napus genomes. Plant Biotechnol J (In press; Accepted 12 April 2017)

Yuan Y, Bayer PE, Lee HT, Edwards D (2017) runBNG: A software package for BioNano genomics analysis on the command line. Bioinformatics (In press; Accepted 2 June 2017)

5

Publications included in this thesis

Lee HT, Golicz AA, Bayer PE, Jiao Y, Tang H, Paterson AH, Sablok G, Krishnaraj RR, Chan CK, Batley J, Kendrick GA, Larkum AW, Ralph PJ, Edwards D (2016) The Genome of a Southern Hemisphere Seagrass Species (Zostera muelleri). Plant Physiol 172: 272- 283

All of the results presented in this publication were incorporated into Chapter 2 and Chapter 3, with rewritten content and extensive elaboration in both results and discussion sections. Chapter 2 was mainly expanded with issues in performing accurate gene annotation and quality assessment, as well as solutions to overcome. Chapter 3 was expanded with multiple sequence alignments and phylogeny of gene families of interest.

Contributor Statement of contribution

(Contribution by task in percent)

H.T. Lee (Candidate) Manuscript design (70%)

Manuscript writing (70%)

Genome assembly, annotation and gene clustering (80%)

Whole genome duplication analysis (30%)

A.A. Golicz Genome assembly, annotation and gene clustering (5%)

Manuscript editing (10%)

P.E. Bayer Genome assembly, annotation and gene clustering (5%)

Computational support (50%)

Y. Jiao Whole genome duplication analysis (50%)

H. Tang Whole genome duplication analysis (10%)

6

A.H. Paterson Whole genome duplication analysis (10%)

G. Sablok Genome assembly, annotation and gene clustering (5%)

R.R. Krishnaraj Genome assembly, annotation and gene clustering (5%)

C.K. Chan Computational support (50%)

J. Batley (Supervisor) Manuscript design (10%)

Manuscript writing (10%)

Manuscript editing (40%)

G.A. Kendrick Provided insights into seagrass biology

Manuscript editing (10%)

A.W. Larkum Provided sequence data (50%)

P.J. Ralph Provided sequence data (50%)

D. Edwards (Supervisor) Manuscript design (20%)

Manuscript writing (20%)

Manuscript editing (40%)

7

Contributions by others to the thesis

David Edwards and Jacqueline Batley helped with study design and editing of all chapters.

Statement of parts of the thesis submitted to qualify for the award of another degree

None.

8

Acknowledgements

I am extremely lucky to have met many kind hearts during this journey of PhD. This thesis cannot be completed without them, and I would like to acknowledge their contributions.

A big thanks to my supervisors, Dave and Jacqui, for relentlessly guide, support and encourage me through projects, presentations, writings and all other tasks in the past 3.5 years.

Happy to work together and being in a team with everyone in the Applied Bioinformatics Group in UQ and UWA, including Kenneth, Philipp, Bhavna, Paula, Agnieszka, Paul, Juan, Pradeep, Andy, Armin and others. I appreciate the inputs, sharing, discussions and lunch sessions, science-related or not, as they are always of great help to my work and sometimes of great fun. The same appreciation goes to everyone in the Batley Lab.

To other office mates, Mahsa, Candy, Vicky and others, thanks for chit chats and sharing PhD woes.

A shout-out to my gym instructors, Amy and Maria, and other gym mates, you guys kept me sane amidst stress and setbacks.

Lastly, I am grateful to have my family and best friends, for encouragements, laughs, (occasionally) showing interest in my project and always fixing my food cravings.

9

Keywords bioinformatics, genome assembly, genome annotation, seagrass, zostera muelleri, halophila ovalis

Australian and New Zealand Standard Research Classifications (ANZSRC)

ANZSRC code: 060408 Genomics, 40%

ANZSRC code: 060102 Bioinformatics, 40%

ANZSRC code: 060303 Biological Adaptation, 20%

Fields of Research (FoR) Classification

FoR code: 0607, Plant Biology, 50%

FoR code: 0604, Genetics, 25%

FoR code: 0603, Evolutionary Biology, 25%

10

Table of Contents

Abstract...... 2

Declaration by author...... 4

Publications during candidature……………………………………………………………...... 5

Publications included in this thesis...... 6

Contributions by others to the thesis...... 8

Statement of parts of the thesis submitted to qualify for the award of another degree...... 8

Acknowledgements...... 10

Keywords…………………………………………………………………...... 10

Australian and New Zealand Standard Research Classifications...... 10

Fields of Research (FoR) Classification……………………………………………………...... 10

Chapter 1. General introduction ...... 20 1.1 The seagrass families ...... 20 1.2 Global distribution of seagrass and seagrass conservation ...... 23 1.3 Genetic analysis on seagrasses so far ...... 26 1.4 Species of interest ...... 28 1.4.1 Zostera muelleri ...... 28 1.4.2 Halophila ovalis ...... 29 1.5 Introduction to genomics ...... 30 1.5.1 Whole genome sequencing ...... 30 1.5.2 Genome assembly and gene annotation ...... 32 1.6 Summary of literature review ...... 36 Chapter 2. Assembly and gene annotation of the Zostera muelleri genome ...... 38 2.1 The importance of obtaining a seagrass genome assembly ...... 38 2.2 Sequenced and publicly-available data ...... 40 2.3 Summary of workflow and software used ...... 41 2.4 Materials and methods ...... 43 2.4.1 DNA sequencing ...... 43 2.4.2 Genome size estimation ...... 43 2.4.3 Preprocessing of reads ...... 44 2.4.4 Selecting optimal k using a subset of reads ...... 44 2.4.5 Assembly of reads ...... 45 11

2.4.6 Scaffolding of contigs ...... 45 2.4.7 Error correction using REAPR ...... 46 2.4.8 Digital normalization of reads ...... 46 2.4.9 Evaluation of genic completeness ...... 46 2.4.10 De novo repeat annotation ...... 47 2.4.11 Assembly of the RNA-seq evidence data ...... 47 2.4.12 Downloading of the protein evidence data ...... 49 2.4.13 Model training of gene predictors SNAP and Augustus ...... 49 2.4.14 Gene annotation using the MAKER pipeline ...... 50 2.4.15 Functional annotation ...... 51 2.5 Results and Discussion ...... 53 2.5.1 Genome size estimation ...... 53 2.5.2 Genome assembly ...... 54 2.5.3 Effect of data normalization in plant genome assembly ...... 57 2.5.4 Quality assessment of the assembled genome ...... 59 2.5.5 Predicting the Z. muelleri gene set ...... 65 2.6 Quality assessment of the predicted genes ...... 74 2.7 Functional annotation of Z. muelleri genes ...... 80 2.8 Conclusion ...... 82 Chapter 3. Comparative genomics between Zostera muelleri and other plant species ... 84 3.1 Introduction to comparative genomics ...... 84 3.1.1 Concepts of homology ...... 84 3.1.2 Whole genome duplication ...... 86 3.2 Materials and methods ...... 90 3.2.1 Clustering of orthologous genes ...... 90 3.2.2 Detecting gene families of interest ...... 91 3.2.3 Multiple sequence alignments and phylogeny of genes of interest in Z. muelleri and other plants ...... 93 3.2.4 Phylogenomic timing analysis of Z. muelleri among major land plant lineages 94 3.2.5 K-mer frequency of homologous gene pairs between two Zosteraceae species 95 3.2.6 Copy number of gene pairs between two Zosteraceae species ...... 95 3.2.7 Rate of substitutions per synonymous sites between two Zosteraceae species 95 3.3 Results and discussion ...... 96 12

3.3.1 Orthologous gene clustering of Z. muelleri and four other reference species identifies absent or modified genes ...... 96 3.3.2 Identification of WGD events in Z. muelleri and the refinement of the timing Tau 118 3.3.3 Z. muelleri has an extra WGD when compared to Z. marina...... 122 3.4 Conclusion ...... 125 Chapter 4. Genetic comparison between two independent seagrass lineages ...... 128 4.1 Introduction to the independent evolution of seagrass lineages ...... 128 4.2 Summary of pipeline and workflow ...... 131 4.3 Materials and methods ...... 135 4.3.1 DNA sequencing ...... 135 4.3.2 Pre-processing of reads ...... 135 4.3.3 Orthologous gene cluster (OGC) construction ...... 136 4.3.4 Pipeline to identify lost and conserved genes ...... 136 4.3.5 Lost and conserved H. ovalis, Z. muelleri and Z. marina genes in OGCsM . 137 4.3.6 Lost and conserved H. ovalis genes in OGCZ ...... 137 4.3.7 Gene ontology enrichment and word cloud plotting ...... 138 4.3.8 Assembly of H. ovalis protein and multiple sequence alignments with orthologs of other species ...... 138 4.4 Results and discussion ...... 140 4.4.1 H. ovalis reads quality check and preprocessing ...... 140 4.4.2 OGC construction ...... 142 4.4.3 Conservation of core biological processes ...... 145 4.4.4 Gene loss in H. ovalis and comparison of lost genes between the three seagrass species...... 147 4.4.5 H. ovalis lost genes related to nitrogen usage ...... 151 4.4.6 Seagrass-specific genes suggest co-evolution of intracellular transport, cell wall and ion transport related genes in H. ovalis and Zosteraceae ...... 157 4.5 Conclusion ...... 167 Chapter 5. Summary and outlook ...... 169 5.1 Addressing limitations of methodologies used and possible improvements ...... 169 5.1.1 Genome assembly and gene annotation ...... 169 5.1.2 Analyses of non-genic regions: repeat annotation and whole genome duplication events...... 170 5.1.3 Comparative pathway analysis using orthologous relationship ...... 171 5.1.4 Presence and absence of genes using read mapping ...... 172

13

5.2 Additional analysis that could be performed ...... 173 5.3 The future of seagrass genomics...... 174 References ...... 176 Appendix 1 ...... 199 Appendix 2 ...... 200 Appendix 3 ...... 203 Appendix 4 ...... 205 Appendix 5 ...... 206 Appendix 6 ...... 208 Appendix 7 ...... 211 Appendix 8 ...... 214

14

List of figures

Figure 1-1 The origin of seagrass in the Cretaceous period from a freshwater ancestor of terrestrial origin. Figure adapted from Catherine Collier, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/)...... 21 Figure 1-2 The relationship between environmental variability, rhizome persistence and morphological/physiological plasticity for all seagrass genera. Figure adapted from (Carruthers et al., 2007)...... 24 Figure 2-1 Variations in approaches to genome annotation based on time, effort, amount of evidence and accuracy of predictions. Figure adapted from (Yandell and Ence, 2012)...... 39 Figure 2-2 Workflow and software used in genome size estimation, genome assembly, gene annotation and functional annotation...... 42 Figure 2-3 Distribution of 17-mer coverage of Z. muelleri libraries...... 54 Figure 2-4 Distribution of per base Phred quality score in a) read one and b) read two of the Z. muelleri whole genome shotgun sequencing...... 56 Figure 2-7 Workflow of model training and gene annotation using the MAKER pipeline. The dotted line arrows indicate the second-pass annotation...... 69 Figure 2-8 Comparison between one-pass and iterative runs of MAKER annotation in scaffold 3725_30727; a) results after the first MAKER run and b) results after the second MAKER run. Blue blocks within the light blue panels are genes annotated by MAKER, where the upper and lower panels represent the forward and reverse direction. The blocks within the upper and lower black panels are repeat annotation, evidence alignments, and ab initio predictions by SNAP and Augustus. Transparent boxes with blue outline represent the untranslated regions (UTRs)...... 77 Figure 2-9 Cumulative distribution function curve of Annotation Edit Distance (AED) scores of all predicted Z. muelleri genes...... 78 Figure 2-10 Stacked plots of three categories of predictions across AED scores, 1) prediction is supported by both ab initio and evidence, 2) prediction is only ab initio¬-based, and 3) prediction is only evidence-based...... 79 Figure 3-2 Illustration of the cyclical episodes during polyploidization. Figure adapted from (Wendel et al., 2016)...... 87 Figure 3-3 Known whole genome duplication events in the plant phylogeny. Figure adapted from https://genomevolution.org/wiki/index.php/Plant_paleopolyploid ...... 89 Figure 3-4 Venn diagram showing the distribution of shared gene families and genes (in parentheses) among five species (M. acuminata, O. sativa, S. polyrhiza, A. thaliana and Z. muelleri). Areas were colour-coded by the number of species...... 98 Figure 3-5 Word cloud representing GO terms enriched in AMOS gene families. Terms were coloured based on related functions 1) pink: hormone response signalling, 2) green: cell wall modification, and 3) purple: light and photosynthesis-related...... 100 Figure 3-6 Neighbour-joining tree of genes orthologous to A. thaliana (ATH) gene AT5G55590.1 in four other species: M. acuminata (MAC), O. sativa (OSA), S. polyrhiza (SPO) and Z. muelleri (ZMU). Genes were clustered into two groups (branch colour red and black)...... 103 Figure 3-7 Ethylene biosynthesis and signalling in Z. muelleri. Genes encoding proteins marked with a cross are found to be lost and those marked with a tick are found to be present. EBF1/2 is marked with a question mark as both are not found in Z. muelleri but EBF1 is present in Z. marina. Figure adapted and edited from (Golicz et al., 2015a)...... 108 Figure 3-8 EIN3/EIL1 multiple sequence alignment between Z. muelleri (ZMU), Z. marina (ZMA), A. thaliana (ATH), G. max (GMA), O. sativa (OSA), P. trichocarpa (PTR) and S. polyrhiza (SPO). Regions conserved between all sequences were indicated in blue, DNA-binding domain was

15

indicated by grey boxes and region deleted in Z. muelleri and Z. marina but conserved in eight other sequences was boxed in red...... 112 Figure 3-9 Neighbour-joining tree of putative carboxylic acid OMTs in Z. muelleri (ZMU), Z. marina (ZMA), Z. mays (ZMY), N. tabacum (NTA), C. breweri (CBR), A. majus (AMA), A. thaliana (ATH) and B. rapa (BRA). Carboxylic acid OMTs included were SAM:salicylic acid carboxyl methyltransferase (BAMT), Salicylate/benzoate carboxyl methyltransferase (BSMT), SAM:anthranilic acid methyltransferase (AAMT) and SAM:jasmonic acid carboxyl methyltransferase (JMT)...... 114 Figure 3-10 Jasmonate biosynthesis. α-linolenic acid is converted to 12-oxophytodienoic acid (OPDA) by three enzymes: lipoxygenase (LOX), allene oxide synthase (AOS) and allene oxidase cyclase (AOC). OPDA is then reduced by OPDA reductase 3 (OPR3) and oxidized to form jasmonic acid (JA). JA can be transformed to various derivatives, including methyl-jasmonate (MeJA). Figure adapted from Fonseca et al., 2009...... 115 Figure 3-11 Phylogenetic timing of inferred gene duplications. Values given are the number of orthogroups with at least one Z. muelleri gene showing duplications at the specified branches on the tree...... 119 Figure 3-16 Phylogenetic timing of WGD events in plants. WGD events postulated in Z. muelleri are labelled as grey and yellow circles. Figure adapted and edited from Ming et al., 2015...... 127 Figure 4-1 Phylogeny of core alismatid species. Branches of seagrass species were labelled with red stars. Figure adapted and modified from (Ross et al., 2016)...... 130 Figure 4-2 Workflow describes whole genome comparison of H. ovalis to two groups of genomes, five model plant species (A. thaliana, O. sativa, M. acuminata, P. dactylifera and S. polyrhiza) and two seagrass species (Z. muelleri and Z. marina). Status of conserved or lost was assigned to genes within the orthologous clusters, OGCsM (orthologous gene clusters conserved between seven species, with at least one gene originating from monocot; Golicz et al., 2015a) and OGCZ (orthologous gene clusters uniquely conserved between Z. muelleri and Z. marina). Results were in four categories: 1) Genes common in H. ovalis and all angiosperms; 2) genes common in all angiosperms but lost in H. ovalis; 3) genes uniquely conserved in all seagrasses; 4) genes present in Zosteraceae but lost in H. ovalis. Category 2 was further compared with the Zosteraceae lineage to obtain a collection of seagrass lost genes. Evidence for molecular convergence was searched in category 3, the collection of seagrass-specific genes. Solid arrows represent the gene loss pipeline and dotted arrows represent the subsequent analysis...... 133 Figure 4-3 Distribution of per base Phred quality score in a) read one and b) read two of the H. ovalis whole genome shotgun sequencing...... 141 Figure 4-4 Overrepresented sequences in a) read one and b) read two of the H. ovalis whole genome shotgun sequencing...... 142 Figure 4-5 Venn diagram showing the number of shared orthologous clusters among six species (A. thaliana, M. acuminata, O. sativa, S. polyrhiza and two Zosteraceae species)...... 144 Figure 4-6 Significantly enriched biological process GO terms in the genes conserved in seven plant species but lost in H. ovalis. Terms were coloured based on related functions 1) green: ethylene synthesis and signalling, 2) purple: flower development, 3) brown: stomata development and 4) pink: terpenoid biosynthesis...... 150 Figure 4-7 a) The involvement of NDH complex in electron transfer pathways during oxygenic photosynthesis of plants and b) its subunit composition. Figures adapted from (Peltier et al., 2016)...... 155 Figure 4-8 Nutrient (nitrogen and phosphorus) cycle in seagrass ecosystems; (1) and (2), nitrification; (3) and (4), nitrate reduction; and (5) denitrification; dotted lines represent excretion and dashed lines represent absorption. Figure adapted from (Leoni et at al., 2008) ...... 156

16

Figure 4-9 Significantly enriched cellular component GO terms in seagrass-specific genes. Terms in green are subcomponents or organelles of the intracellular transport pathways...... 160 Figure 4-10 Ribosomal protein L16 multiple sequence alignments between 19 species and three seagrass species. Species and corresponding IDs were listed in Table 4-1. Amino acids that were conserved within the non-seagrass group or among seagrasses were coloured according to physicochemical properties based on “Zappo” colour scheme. White arrows indicated seagrass- specific mutations...... 165 Figure 4-11 Phylogenetic tree showing distance between Rpl16 sequences of 17 species together with three seagrasses. Species and corresponding IDs were listed in Table 4-1. IDs coloured in red were members of core Alismatids and blue were members of Araceae. Branches were labelled with bootstrap values (%)...... 166 Figure 5-1 Duplication events generate false positives in the identification of functionally identical orthologs between species. Figure adapted from (Fang et al., 2010)...... 172

17

List of tables

Table 2-1 Sequenced and publicly-available data as evidence for Z. muelleri genome assembly and gene annotation...... 40 Table 2-2 Number of reads and nucleotides retained for each pre-processing step ...... 55 Table 2-3 Assembly statistics of contigs and scaffolds...... 57 Table 2-4 Assembly statistics of digitally normalized and unnormalized reads...... 59 Table 2-5 Assessment results of Z. muelleri scaffolds using CEGMA and BUSCO orthologous groups...... 62 Table 2-6 Repetitive sequences in Z. muelleri scaffolds...... 67 Table 2-7 Pre-calculated parameters of plant species provided for open source gene predictors. . 71 Table 2-8 Statistics of assembled Z. muelleri and Z. marina transcriptomes...... 71 Table 2-9 Gene characteristics of Z. muelleri in comparison with four monocot species (Z. marina, S. polyrhiza, M. acuminata and O. sativa) and one dicot (A. thaliana). * genome not available at the time of analysis...... 73 Table 2-10 Functional annotation of predicted proteins in Z. muelleri genome...... 80 Table 2-11 Public databases used for functional annotation of predicted Z. muelleri proteins...... 81 Table 3-1 Significantly enriched Biological process (BP) GO terms in AMOS families...... 99 Table 3-2 A. thaliana representative genes in AMOS corresponding to cell wall modification (GO:0042545) and pectin catabolic process (GO:0045490) pathways...... 102 Table 3-3 Number of pectin methylesterase (PME) and/or methylesterase invertase (PMEI)-related proteins and families in Z. muelleri...... 102 Table 3-4 A. thaliana representative genes in AMOS corresponding to ethylene signalling (GO:0010105) pathway...... 107 Table 4-1 Species selected for multiple sequence alignment of orthologous proteins...... 139 Table 4-2 Number of reads and nucleotides retained for each pre-processing step...... 142 Table 4-3 Significantly enriched biological process GO terms in the genes conserved in H. ovalis compared with five other plant species...... 146 Table 4-4 Significantly enriched biological process GO terms in the genes conserved in five other plant species but absent in H. ovalis...... 149 Table 4-5 Significantly enriched biological process GO terms in the genes that were lost in H. ovalis, but present in Z. muelleri, Z. marina and five other plant species...... 154

18

List of abbreviations

AED Annotation edit distance

AMOS Arabidopsis thaliana, Musa acuminata, Oryza sativa, Spirodela polyrhiza

BLAST Basic local alignment search tool

CDS Coding sequence

DBD DNA binding domain

DNA Deoxyribonucleic acid

Gb Giga bases

GO Gene ontology

JA Jasmonic acid

KEGG Kyoto Encyclopedia of Genes and Genomes

Ks rate of substitutions per synonymous sites

LTR Long terminal repeats

MeJA Methyl jasmonate

Mya Million years ago

My Million years

Mb Mega bases

RNA Ribonucleic acid

TE Transposable element

UTR Untranslated region

WGD Whole genome duplication

WGS Whole genome sequencing

19

Chapter 1. General introduction

1.1 The seagrass families

Seagrasses are a group of flowering plants which live fully submerged in the marine environment. The morphology of seagrass varies among species, the common features include long, strap-shaped leaves, simple flowers and they form mono-specific meadows resembling terrestrial grasses. They are, however, not close relatives to the terrestrial grasses in the Poaceae family. Seagrass belongs to a basal lineage which diverged approximately 15 My earlier than Poaceae within the clade and evolved to re-colonize the marine environment (Figure 1-1). The uniqueness of seagrass evolution is therefore summarised as resembling the whales (Lambers et al., 1998), which returned to aquatic life from terrestrial mammals.

Arber (1920) theorized seagrasses as a “biological group” derived from independent ancestors, but possessed necessary requirements to survive in the sea. Early morphological (Tomlinson, 1984) and molecular evidence (Les et al., 1993) supported the polyphyletic origins of seagrasses. Les et al. (1997) detected three separate origins of seagrass using the plastid gene rbcL, and this was confirmed in subsequent studies (Janssen and Bremer, 2004; Larkum et al., 2006). Results from Les et al. (1997) also rejected the hypothesis of the formation of seagrass through evolution of “mangrove-like” saltwater plants (Larkum et al., 1989), since the remaining members of phylogenetic families are freshwater species. The current seagrass is agreed to contain around 72 species forming 3 families, Zosteraceae, Hydrocharitaceae and Cymodoceaceae complex, within the order Alismatales and representing less than one percent of all species (Les et al., 1997; Short et al., 2011; Nguyen et al., 2015).

20

Figure 1-1 The origin of seagrass in the Cretaceous period from a freshwater ancestor of terrestrial origin. Figure adapted from Catherine Collier, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).

The occurrence of at least three parallel evolution pathways from a freshwater ancestor of terrestrial origin (Les et al., 1997) form independent lineages seagrasses. Through independent evolutionary routes, seagrasses develop a set of physiological and morphological adaptive features, which collectively represent marine adaptation traits. High salinity is one of the major obstacles overcome by submerged marine plants. For terrestrial angiosperms, osmotic changes caused by elevated salinity in the soil reduce the

21

ability of roots to absorb water (Läuchli and Grattan, 2007). The longer-term effect of salt stress is the accumulation of sodium and chloride ions in vacuoles and cytoplasm, leading to disruption of the ionic equilibrium in cells. This effect, termed salt toxicity causes great impacts on plant growth and development. For example, in wheat and rice, high salinity increases crop sterility and affects flowering and maturity time.

Effective osmoregulation is crucial for plants to survive salinity fluctuations, for example, salt-secreting glands are common in some halophytes (Liphschitz and Waisel, 1974; Hansen et al., 1976). In seagrasses, obvious physiological adaptive features are not found, but multiple characteristics were hypothesized as salt-tolerance mechanisms in seagrasses (reviewed in Touchette, 2007), such as cell wall rigidity, selective ion flux (Carpaneto et al., 2004) and vacuolar ion sequestering, and the synthesis of compatible solutes (Ye and Zhao, 2003). For example, seagrass cell walls contain sulphated polysaccharides (Popper et al., 2011), a feature found in algae (Aquino et al., 2011) but lost in land plants (Michel et al., 2010). Sulfation increases the anionic potential of molecules and is hypothesized to support ion transport of seagrass cell walls in high salinity environments (Aquino et al., 2011; Popper et al., 2011). The regulation of these mechanisms in seagrasses is not understood.

With the lack of stomata, gas exchange in seagrasses occurs through permeable leaf cuticles. To adapt to the low gas diffusion rate in the water, seagrasses have specialized gas-filled cavities called aerenchyma in the roots and rhizomes. This morphology facilitates oxygen exchange of roots in anaerobic sediments, especially during periods of reduced photosynthesis. For carbon acquisition, some seagrasses have evolved carbon- - concentrating mechanisms (CCMs) to utilize HCO 3 ions (Beer et al., 2002; Borum et al.,

2015), which are about 200 times more concentrated in water than CO2 (Procaccini et al., 2012). Seagrasses have also adapted to variable and low levels of light that attenuates quickly in seawater into blue or green wavelengths of the spectrum (Larkum et al., 2006). In addition, seagrasses are hydrophilous angiosperms, where the transportation and capture of pollen grains are carried out on or below the water surface.

To date, whole genome sequences of more than 500 eukaryotic species are available (National Centre for Biotechnology Information on May 2017). There are, however, many taxonomic gaps in the plant kingdom, as sequencing efforts are largely focused on economically-fundamental and model species. Seagrasses are basal monocots residing in one of these taxonomic gaps. With land plants as ancestors, seagrass lineages adapted to

22

a submerged lifestyle in the sea. Little is known about seagrass genetics: For example: How marine adaptation occurs in seagrass? Which genes and cellular processes are modified under marine selection pressures? How different are seagrasses from fresh water aquatic plants? Seagrasses are not agricultural crops, the value of understanding the genomics of seagrass is therefore not immediately obvious. In fact, the survival of seagrasses has a direct effect on us. Seagrass meadows are the second most important carbon sinks in the world after rain forests, they are therefore one of our few solutions to fight global warming. Soil salinity affects an estimated of 45 million hectares of land (Roy et al., 2014), and is a major limitation to crop production. Genes responsible for salt tolerance in seagrass may be a long-term solution for crops to tolerate high salinity in soil. Crops with improved traits can contribute to solving the challenge to match food availability with global population growth.

This thesis describes seagrass genomes through computational analysis of high- throughput genome sequence data. This work aims to answer two fundamental questions: 1) What biological pathways are being selected by the marine environment in seagrasses and what genes are involved? 2) Do multiple lineages of seagrasses follow the same evolutionary path?

1.2 Global distribution of seagrass and seagrass conservation

The geomorphological structures of regions and their oceanographic factors significantly influence the habitats of seagrass meadows. Three main types of seagrass habitats are 1) permanently or rarely open estuaries, 2) sheltered and 3) exposed habitats, with varying tidal energy, salinity, temperature and light intensity (Carruthers et al., 2007). Depending on the morphological and physiological plasticity, which describe the seagrass’ ability to withstand acclimation of stresses, different genera of seagrass are distributed in each habitat type (Figure 1-2). For example, coriacea has deeply buried rhizomes and long thick leaves to survive exposed habitats, but is sensitive to low salinity and epiphyte overgrowth. Halophila ovalis is highly adaptable, with tolerance to large salinity range and anaerobic sediment.

23

Figure 1-2 The relationship between environmental variability, rhizome persistence and morphological/physiological plasticity for all seagrass genera. Figure adapted from (Carruthers et al., 2007).

Seagrass species can be categorized to two main groups, temperate or tropical, based on their geographic distribution (Short et al., 2007). Seagrasses thrive in global coastlines of six bioregions (Figure 1-3a), 1) Temperate North Atlantic, 2) Tropical Atlantic, 3) Temperate Mediterranean, 4) Temperate North Pacific, 5) Tropical Indo-Pacific and 6) Temperate Southern Oceans. Ruppia under the family Cymodoceaceae complex, also commonly known as widgeonweed, is the only seagrass genus found in both tropical and temperate regions and in both hemispheres. Zostera, another widespread genus, dominates the northern temperate bioregions 1 and 4, whereas species of Posidonia dominate the southern temperate bioregions 3 and 6. Bioregion 5 Tropical Indo-Pacific contains the largest diversity of seagrass species with Halophila as the major genus (Short et al., 2007).

24

Seagrass meadows are valuable ecosystems. Global seagrass living biomass and soils were estimated to store 19.9 Pg of organic carbon, where its storage per unit area is comparable to terrestrial forests (Fourqurean et al., 2012). The ecological role of seagrasses includes primary food sources of endangered animals such as dugongs, manatees and sea turtles (Heck and Valentine, 2006), nursery grounds for many commercially-important species of fish and shellfish (Hemminga and Duarte, 2000), nutrient cycling and sediment stabilization (Hemminga and Duarte, 2000). Based on the criteria of the International Union for the Conservation of Nature (IUCN), 24% of all seagrass species (Short et al., 2011) were assigned with Threatened (Endangered or Vulnerable) and Near Threatened status (Figure 1-3b) in 2011. Three species in Zosteraceae were endangered, including japonicus and Zostera geojeensis which are found near Chinese, Korean and Japanese shores, and Zostera chilensis in Chile. Seven species which mostly thrive in bioregion 5 were listed as vulnerable. The main reason contributing to seagrass decline is human activity that causes habitat destruction and degradation of water quality, particularly light reduction due to eutrophication and poor water clarity due to coastal developments (Waycott et al., 2009). With efforts to reduce remineralization of organic carbon, together with identification of extinction risk in these seagrass species, awareness in conservation and restoration of seagrass meadows is now increasing.

25

Figure 1-3 Global distribution of a) all seagrass species; b) 15 seagrass species listed in threatened (VU: Vulnerable; EN: Endangered) or Near Threatened (NT) based on IUCN Red List Categories and Criteria. Figure adapted from (Short et al., 2011).

1.3 Genetic analysis on seagrasses so far

Besides plastid gene analysis for the purpose of taxonomy studies, genetic analyses of the seagrass species have largely focused on two species, Zostera marina, the most widespread temperate species, and which is endemic to the Mediterranean Sea. Most studies aim to understand the effect of climate change on seagrass growth and distribution. Besides increase of thermal stress, seagrass habitats are also negatively impacted by light limitation caused by human-induced sediment suspension, as seagrasses have relatively high minimum light requirements (Dennison et 26

al., 1993). Models of seagrass responses under current and future light climate have been described, including individual leaf responses, shoot-scale responses and alterations to meadow structure (reviewed in Ralph et al., 2007) . The genetics of seagrass responses to differing light conditions were studied through gene (Dattolo et al., 2013; Dattolo et al., 2014; Salo et al., 2015) and protein (Mazzuca et al., 2009) expression, together with genome methylation signatures (Greco et al., 2013). Differential expression of genes for protein turnover, cellular stress related enzymes and photosynthesis were identified as responsive to light stress. Photosynthesis, particularly components of photosystem II (PS II) are known to be highly sensitive to thermal stress (Sharkey, 2005; Dutta et al., 2009). Measurements of photophysiological gene expression under a simulated heat wave showed different sets of upregulated genes in populations of different latitude (Winters et al., 2011). A low latitude seagrass population differentially expressed fewer genes and achieved full recovery from thermal stress faster than a high latitude population. In two different species, markedly different molecular responses to heat wave were also observed through RNAseq (Franssen et al., 2014). Overall, these analyses demonstrated negative effect of thermal stress to seagrasses, with complex responses in genes related to protein folding, synthesis of chloroplast ribosomes, cell wall modification and heat shock proteins. Salt tolerance mechanisms in seagrass are also of particular interest due to their potential to be applied to improve salinity tolerance in crop plants, such as the use of a Z. marina salt tolerance gene in transgenic rice research (Zhao et al., 2013).

To understand the genetic mechanisms underlying marine adaptation, expressed sequence tag libraries of two seagrass species revealed three functional groups of positively selected genes, including those encoding glycolysis enzymes, ribosomal genes and photosynthesis genes (Wissler et al., 2011). In a separate study, ortholog groups of the Z. marina transcriptome were found to be more similar to land plants than the red algae Porphyra yezoensis (Kong et al., 2014).

A comparative genomics approach was applied to characterize gene conservation and loss in seagrass, in comparison to four terrestrial plants and one aquatic plant, using unassembled whole genome shotgun sequence data of Z. muelleri (Golicz et al., 2015a). Genes associated with ethylene synthesis and signalling pathways were found to be lost, suggesting that seagrasses have evolved to survive and grow without this gaseous hormone. Ethylene mediates many aspects of plant development, such as seed germination, fruit ripening and senescence (Ecker, 1995). Mutants of ethylene synthesis, perception or signalling have mild phenotypes, but constitutive ethylene response 27

Arabidopsis thaliana mutants displayed impaired growth as compact dwarves with small roots (Kieber et al., 1993; Roman et al., 1995). This indicates the importance of regulating endogenous ethylene concentrations through balance between ethylene biosynthesis and outward diffusion. During flood, with reduced outward diffusion, accumulated ethylene in terrestrial plant tissues is responsible for triggering flood-adaptive responses such as accelerated shoot elongation (Voesenek and Bailey-Serres, 2015). Flood-adapted plants, such as deep water rice (Kende et al., 1998), only respond to ethylene in high concentrations (Voesenek and Vanderveen, 1994). In seagrasses, where gaseous outward diffusion is hampered and flood-adaptation is irrelevant, it is logical to hypothesize that genes involved in ethylene biosynthesis, perception and signalling were negatively selected.

The whole genome assembly of Z. marina has been published (Olsen et al., 2016). The Z. marina genome was assembled to 202.3 Mb, with 63% of repetitive regions and encodes a total of 20,450 protein-coding genes. Loss and gain of gene families in Z. marina were investigated in detail across a phylogenetic tree with 12 other representatives of plants. Besides displaying the loss of ethylene, as seen in Z. muelleri (Golicz et al., 2015a), Z. marina also lost the genes responsible for the synthesis of secondary volatile terpenes and stomatal differentiation. Also, genes responsible for ultraviolet sensing and protection, and phytochromes for far-red sensing were lost in Z. marina. This is explained by the light- attenuated habitat of seagrasses, with low penetration of UV-B and far-red wavelengths (Kirk, 2011). To enhance light harvesting under low light, Z. marina also has an expanded light-harvesting complex B (LHCB) family. Variation within carbohydrate-active enzymes (CAZyme) gene families also modified cell wall hemicelluloses and pectins in Z. marina, which is hypothesized to enable salinity tolerance.

1.4 Species of interest

1.4.1 Zostera muelleri

Zostera muelleri, commonly known as eelgrass, is a temperate southern hemisphere seagrass species native to both New Zealand and the south and east coasts of Australia. Z. muelleri belongs to the Zosteraceae family which consists of about 20 seagrass species and is estimated to have evolved at about 17 Mya (Larkum et al., 2006). Z. muelleri was also known as Z. capricorni, Z. novazelandica or Z. mucronata before the results of 28

morphological and molecular analyses showed the conspecificity of these species (Les et al., 2002). Z. muelleri is a monoecious, fast-growing species which often dominates estuaries and coastal lakes. Z. muelleri meadows have narrow leaf blades and thick rhizomes, and are therefore relatively resilient to sedimentation.

With the exception of phylogenetic analyses using a few nuclear and plastid representative loci, the number of genetic studies of Z. muelleri performed has been very limited. Chromosome micrographs of Z. muelleri showed a chromosome number of 2n=24, whereas most of the members in the Zosteraceae family have 2n=12 (Kuo, 2001). The genome size of Z. muelleri was not estimated, but image and flow cytometry analysis of two other Zostera species with 2n=12 showed 2C values of 1.22 pg and 1.54 pg, respectively (Koce et al., 2003). The recently sequenced Z. marina, which belongs to a sister clade (Tanaka et al., 2003), has a genome assembly size of 202.3 Mb (Olsen et al., 2016).

1.4.2 Halophila ovalis

Halophila ovalis belongs to Hydrocharitaceae, a largely diverse aquatic angiosperm family with about 75 species (Larkum et al., 2006). The historical biogeography studies of Hydrocharitaceae suggest that it originated from the Indo-pacific region (Chen et al., 2012). Using plastid genes rbcL and matK incorporated with fossil evidence, the divergence time estimates of the seagrass branch in Hydrocharitaceae are approximated to 55 Mya (Kato et al., 2003; Janssen and Bremer, 2004; Chen et al., 2012).This evolutionary lineage is therefore about 30 My earlier than the Zostera genus. H. ovalis is a small plant with paddle-shaped leaves, with ephemeral and morphologically variable characteristics (Carruthers et al., 2007). It is widely distributed in the sheltered habitats of tropical Indo-Pacific Ocean, as well as south western Australia to East Africa (Short et al., 2011; Nguyen et al., 2014).

Many molecular barcoding analyses have been performed to identify divergence within the Hydrocharitaceae (Li and Zhou, 2009; Chen et al., 2012; Iles et al., 2013; Petersen et al., 2016), and some of the species in the Halophila genus still remained as an unresolved complex. Flow cytometric analysis has been performed in H. ovalis, but estimation of two other Hydrocharitaceae members, Najas minor and Elodea Canadensis, were 2C=7.28 and 7.54 respectively (Hidalgo et al., 2015).

29

1.5 Introduction to genomics

In genomics, efforts to obtain whole genome sequences of species using DNA sequencing technology has answered complex questions which contribute to multiple fields. In medical research, the use of variants in genes associated with cancer subtypes and respective drug responses as biomarkers has led to the advancements of personalized medicine and targeted therapy (reviewed in Wang et al., 2015). For example, before the development of next-generation sequencing, the morphology of lung cancer tumour was used for simple categorisation of cancer types - non-small-cell (NSCLC) and small-cell (SCLC). Using sequences of tumour cells, more than twelve driver mutations are currently used to better describe the heterogeneity of clinical subtypes and targeted therapies (reviewed in Hensing et al., 2014) . For instance, through the characterization of point mutation in a receptor tyrosine kinase, EGFR, clinical response rates of patients with chemotherapy- resistant NSCLC increase by 20% with EGFR-inhibitors (Shepherd et al., 2005). In agriculture, genome sequencing of crops has revolutionized crop breeding by replacing field trials with molecular markers for trait selection (reviewed in Edwards and Batley, 2010 and Varshney et al., 2014) . By providing a large repertoire of candidate genes for trait markers and quantitative trait locus (QTL) mining, crop genomes link genotype data to phenotypes. When coupled with high throughput phenotyping, collections of genomic data are used for genome wide association studies (GWAS) to map genomic regions of important traits. For example, high-resolution mapping of 5000 recombinant inbred maize lines identified multiple small additive QTL controlling maize flowering time, providing insights into the architecture of adaptive traits in plants (Buckler et al., 2009). In wheat, GWAS studies of 322 accessions of its D-genome progenitor revealed variants associated with 29 morphological traits (Liu et al., 2015), providing a platform for studies of important agronomic traits in wheat.

1.5.1 Whole genome sequencing

The genome of a bacteriophage was the first genome sequenced (Sanger et al., 1977) using the “chain-termination” method. The sequencing technique uses fluorescent-labelled chain-terminating nucleotide analogs to emit light at each base termination during DNA synthesis. By performing four individual dideoxynucleotide base reactions in parallel, the sequence can be inferred using autoradiography. The Sanger sequencing technology became the basis for first-generation DNA sequencing machines and a few improvements

30

were made in the following years. For example, Applied Biosystems developed an automated large-scale sequencer range which produced the first draft of the human genome (Lander et al., 2001; Venter et al., 2001).

Second-generation sequencing technologies were pioneered by another “sequence-by- synthesis” method termed pyrosequencing. This method which was first published in 1988 (Hyman, 1988) inferred sequences by measuring pyrophosphate synthesis. Compared to the Sanger method, pyrosequencing does not need modified dideoxynucleotides and can be observed in real time (Ronaghi et al., 1996; Ronaghi et al., 1998). The major drawback of the method is observed when sequencing homopolymers, as the increment of light intensity is not linear after four or more consecutive identical nucleotides, causing difficulty in determining the exact number of nucleotides (Ronaghi et al., 1998). Pyrosequencing was then commercialized by 454 Life Sciences and parallelisation of reactions was performed for increased yield. In the following years, instead of new sequencing reactions, the focus of DNA sequencing shifted to development of parallelisation techniques (reviewed in Shendure and Ji, 2008), such as “bead-based water-in-oil emulsion PCR” by 454 and “bridge amplification” by Solexa. These developments defined the second- generation era by high-throughput data and low sequencing cost.

Third-generation sequencing is marked by platforms which enable direct observation of DNA synthesis and the reduction of DNA material needed to one single molecule (Braslavsky et al., 2003). This is a major breakthrough, as single molecule sequencing (SMS) is applicable to single cell studies, such as gene expression and mutations of cancer cells. Currently, the single molecule real time (SMRT) platform by Pacific Biosciences, which utilizes microfabricated nanostructures to confine DNA polymerase for real time observation (Levene et al., 2003), is the most widely used technology (van Dijk et al., 2014). Another example is nanopore DNA sequencing by Oxford Nanopore Technologies which transits DNA strand through a pore and read each nucleotide using electric signals (Clarke et al., 2009). The main challenge faced by third-generation technologies is a relatively low read accuracy (Clarke et al., 2009; Eid et al., 2009; Koren and Phillippy, 2015).

Repetitive sequences in genomes are by far the largest obstacle towards obtaining a gapless genome assembly (Phillippy et al., 2008). Theoretically, if all the repeats in the genome were spanned by reads with length larger than the maximum repeat size, the repetitive regions are resolvable (Bickhart et al., 2017). Factors such as the sequencing

31

coverage, and hence the sequencing cost, as well as the error rates of long read sequencing, are currently limiting the progress of resolving repetitive regions. Several technologies have been recently developed for the purpose of orienting assembled contigs, indirectly reducing the gaps caused by repeats. Chromosome interaction mapping (Hi-C) identifies long-range chromosome interactions and captures chromosomal conformation (Dekker et al., 2002; Burton et al., 2013). Optical mapping (such as BioNano Genomics) uses restriction enzymes recognition sites for contiguity information (Schwartz et al., 1993; Riley et al., 2011).

1.5.2 Genome assembly and gene annotation

The development of sequencing platforms is complemented by concurrent advances of bioinformatics algorithms. Genome assembly and annotation are usually the two analyses that followed a whole genome shotgun sequencing (Figure 1-4). Assembly is the construction of continuous sequences based on the overlapping of raw reads to resemble a representation of the original genome (reviewed in Schatz et al., 2010). The product is usually termed as a “draft genome” and the level of completion is determined by the amount of gaps or missing information within the stretches of contiguous sequences (Chain et al., 2009). The annotation step is essential in giving biological relevance to the assembled genome. Annotation algorithms predict the location and function of protein- coding genic sequences in the draft genome to obtain a set of gene models (reviewed in Yandell and Ence, 2012) .

Huge continuing efforts are put into improving these two steps mainly in terms of contiguity and accuracy, dealing with low-complexity regions, flexibility in accepting different read types, requirements of computational resources and speed. Accuracy of assembly and annotation are fundamental for valid downstream analysis, but with a lack of gold standard, the quality assessment of these algorithms are proven to be challenging (reviewed in El-Metwally et al., 2013).

In the following sections, common tools for genome assembly and gene annotation will be reviewed.

32

Figure 1-4 Illustration of the assembly process, 1) Shotgun sequencing: DNA sequencing platform produce fragments of genome, 2) Genome assembly: extension of short reads to form contigs and scaffolds, 3) Annotation: prediction of gene models. Figure adapted from

(Ekblom and Wolf, 2014).

1.5.2.1 Genome assemblers and quality assessment of assemblies

The variety of genome assemblers is large and ever increasing to accommodate differing strategies of read sequencing and therefore read lengths and insert sizes. Besides accuracy, the design of an efficient assembler has to take into account computational parameters like memory consumption, space requirements and run time. Assemblers are designed to strive to achieve contiguity of genome sequences, but also address different

33

limitations of sequencing, such as base correction for very long reads with high error rate and collapsing repetitive regions when assembling short reads.

Algorithms of current successful short-read assemblers are graph-based, which use a set of nodes and edges between the nodes. Graph-based assemblers can better model genome complexity and resolve repeats (reviewed in Pop, 2009) , a major complication in plant genomes (Hamilton and Buell, 2012), as compared to the “greedy approach” used by early genome assemblers. The two main graph-based methods are Overlap/Layout/Consensus (OLC) and de Bruijn (DBG). OLC assemblers have a three- stage pipeline, identify overlapped reads, construct layout graph and produce consensus sequences. Examples of widely-used OLC assemblers are PHRAP (de la Bastide and McCombie, 2007), Celera Assembler (Myers et al., 2000) and CAP3 (Huang and Madan, 1999). The drawbacks of comparing between all pairs of reads in the first step of OLC are long run time and increasing graph size as genome complexity increases. DBG constructs vertices and edges of graphs by read k-mers and an Eulerian path, that is, a path that traverses each edge once, is followed to assemble the sequences into contigs. The DBG method does not require the memory expensive process of pairwise overlap between reads as in OLC. However, DBG is highly sensitive to sequencing error, leading to exponential increase of the resulting deBruijn graph size, which also consumes memory to resolve. Various methods were developed to address this limitation, for example, Velvet simplifies non-intersecting paths during graph construction (Zerbino and Birney, 2008) and ABYSS introduced a representation of graph without storing edges to distribute memory more effectively (Simpson et al., 2009). Current genome assemblers also combine both OLC and DBG strategies, for example MaSuRCA (Zimin et al., 2013) first extends the reads using the DBG approach and then assemble the extended super-reads using an OLC assembler.

Assemblers designed for very long third-generation reads are also available, such as Canu (Koren et al., 2016) and FALCON (Chin et al., 2016). To overcome the challenge of high sequencing error rate in long reads, error correction approaches, through hybrid or hierarchical methods, are extensively developed in long read assemblers (Koren and Phillippy, 2015). Hybrid methods rely on short reads such as Illumina to correct base calls in long reads, whereas hierarchical methods align and overlap long reads for multiple rounds to achieve read correction.

34

A lack of gold standards, particularly in de novo assemblies, further emphasizes the importance of quality assessments. General statistics such as maximum contig length, total length of assembly, average contig length and N50 (a metric similar to median of contig length but gives a greater weight to longer contigs) are used to describe assembly quality. However, public assessment efforts of assemblers by research groups such as Genome Assembly Gold-standard Evaluations (GAGE) (Salzberg et al., 2012), Assemblathon (Earl et al., 2011; Bradnam et al., 2013) and dnGASP (de novo Genome Assembly Assessment; http://big.crg.cat/rgasp_dngasp) have acknowledged the drawbacks of measuring assembly quality solely by general statistics. Programs to generate more comprehensive parameters like percentage of correct joints (Hunt et al., 2013), percentage of reads that aligned to the assembly and number of conserved genes found (Parra et al., 2007; Simao et al., 2015) are therefore now included as standard procedures of assessment pipelines.

1.5.2.2 Gene annotation tools

The step after obtaining a genome assembly is the translation of DNA sequences to biological information. When compared to genome assembly, the task of gene annotation has a relatively less automated workflow and is more challenging. Basic steps of gene annotation are identification and masking of repeat regions in the assembly, followed by prediction of coding sequence through intron-exon structures and lastly the annotation of a consensus set of gene models. Gene prediction can be performed ab initio, but benefits from external data as evidence, such as species-specific transcripts or proteins. Tools such as MAKER (Cantarel et al., 2008; Holt and Yandell, 2011; Campbell et al., 2014b), PASA (Haas et al., 2003) and TriAnnot (Leroy et al., 2012) automated the gene annotation pipeline. However, the assistance of manual curation is still useful, particularly in the final step as the resolution of gene models is often complicated by pseudogenes, pre-mRNA, transposons and conflicting information between evidence. The importance of references is emphasized in gene annotation, as prediction relies heavily on known information for algorithm training. In general, the accuracy of predicted genes is affected by multiple factors: the completeness of genome assembly, the complexity of the genome depending on ploidy and amount of repetitive regions, the presence of gene models of closely related species and the availability of evidence data in the form of transcripts or proteins.

35

Efforts of collecting and managing functional information of genes were initiated about 20 years ago through consortiums like Gene Ontology (Harris et al., 2004; Primmer et al., 2013) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000). The process of generating a collection of well-characterized genes with associated ontologies and pathways involves manual curation. Due to the amount of analysis and time needed to achieve considerable accuracy, the data collected are often limited to model species. For example, the genome of Arabidopsis thaliana, the first sequenced plant, has remained as the best choice for reference in plant research. Functional data of reference species are nevertheless valuable references for assigning protein domains, families and functional sites to predicted genes of non-model species.

1.6 Summary of literature review

Seagrasses belong to a basal monocotyledonous lineage within the Alismatales order. There are 72 species of seagrasses forming 3 families the Zosteraceae, Hydrocharitaceae and Cymodoceaceae complex. The seagrass species is a polyphyletic group, where the members have been grouped together through common characteristics rather than a common ancestor. Parallel evolution from a freshwater ancestor of land plant origin occurred at least three times to form seagrass families.

As seagrasses evolved to live fully submerged in the sea, physiological and morphological adaptive features for the marine environment were developed. For example, seagrasses lack stomata, have aerenchymas in roots and rhizomes for effective gas exchange, osmoregulate for high salinity and transport pollen grains in the water.

Seagrass species are categorized as temperate or tropical based on their geographical distribution. 24% of all seagrass species are listed as threatened or near threatened by the International Union of the Conservation of Nature (IUCN) mainly due to increasing human coastal activity. Efforts of seagrass conservation and restoration are important, as seagrasses are key carbon sinks and also act as breeding ground and food source for many marine fauna.

The genetic analyses of seagrasses are largely focused on the effect of climate change, particularly light limitation and thermal stress, to seagrass survival and distribution. Comparative genomics of seagrass with land plants revealed positively selected genes related to photosynthesis, glycolysis enzymes and ribosomes as well as lost genes in 36

ethylene synthesis and signalling pathways. The whole genome assembly of a northern hemisphere temperate species further identified gene loss in synthesis of secondary volatile terpenes and stomatal differentiation. Gene family modifications were also observed relating to low light harvesting and salinity tolerance of the cell wall.

This thesis aimed to 1) detect biological pathways responsible for marine adaptation in seagrasses, and 2) identify whether seagrass species of multiple lineages underwent convergent evolution.

The advancement in genomics, including genome sequencing platforms and bioinformatics algorithms has contributed greatly in scientific research, particularly in human medicine, crop breeding, microbiology and evolutionary studies. The aims of this project will therefore be addressed by applying genomic methods on two seagrass species, Z. muelleri and H. ovalis.

It is important to note that there was no whole genomic information available at the start of this project. The Z. marina assembly, which is the first seagrass draft genome, was published in February 2016, the third year after commencement of this project.

In the following chapters, genome characterization of Z. muelleri (Chapter 2), identification of genes responsible for marine adaptation and whole genome duplication events (Chapter 3), and gene comparison of multiple evolutionary lineages (Chapter 4) are discussed. Lastly, closing thoughts and future directions are detailed in Chapter 5.

37

Chapter 2. Assembly and gene annotation of the Zostera muelleri genome

2.1 The importance of obtaining a seagrass genome assembly

Assembly and gene annotation are two main components of genome analysis (Pevsner, 2009). Using whole genome sequencing data as input of assembler algorithms, genome assembly aims to reconstruct the full chromosomal sequences of the species of interest. The identification of gene structures in an assembled genome is termed gene annotation. Gene annotation programs function to distinguish open reading frames, intron and exons from intergenic and repetitive regions. The set of predicted genes is then used to represent the genic content of a species of interest in various downstream analysis, such as gene expression studies, genetic variation and trait association, gene and molecular regulation and comparative genomics between species.

This chapter describes the genome assembly and gene annotation of Zostera muelleri, a temperate southern hemisphere seagrass species. Seagrasses belong to the basal monocots, a lineage which is underrepresented among sequenced plant genomes to date. Besides representing the Alismatales order, more importantly, the assembled genome of Z. muelleri and the annotated gene set are the bases of understanding seagrass genetics. For example, comparative analysis of seagrass genes with other model plant species enables the detection of presence and absence of genes and biological pathways. At the molecular level, gene sequence comparison is useful for identifying site-based evidence of selection, which would explain marine adaptation of plants.

A complete assembly and annotation of the Z. muelleri genome is therefore important for downstream analyses. However, the level of completion largely depends on the amount of available data. For instance, Figure 2-1 shows the variability of gene annotation results based on the amount of evidence data, time and effort. The amount of publicly available seagrass data and the data produced for the purpose of this thesis, together with the expectation on the genome completion level, were evaluated in this chapter. Also, the accuracy of the results is expected to be influenced by known complexities of the plant genome. For example, allelic sequences in plant genomes with high level of heterozygosity could interfere the assembly process and inflate the size of draft genomes

38

(example in potato: Potato Genome Sequencing Consortium et al., 2011 and conifer: Nystedt et al., 2013).

In plant genomes with a large amount of duplicated and repetitive sequences, however, the size of assemblies is possibly underestimated, as assemblers tend to collapse repeat regions (Treangen and Salzberg, 2011). As genome assemblies and gene annotations are predictions lacking golden standards, the assessment of accuracy is also a major challenge. This chapter addresses practical difficulties in genome assembly and gene annotation of plant genomes, particularly in the context of a newly sequenced non-model species.

Results presented in this chapter will serve as the foundation for the analyses in the following chapters, and also for the future studies of seagrass biology.

Figure 2-1 Variations in approaches to genome annotation based on time, effort, amount of evidence and accuracy of predictions. Figure adapted from (Yandell and Ence, 2012).

39

2.2 Sequenced and publicly-available data

At the time of commencing this work, there were no publicly-available sequences of Z. muelleri. We performed shotgun sequencing of the Z. muelleri using paired-end reads and mate-paired reads (Table 2-1). These two libraries of whole genome sequencing were expected to be assembled to form a draft genome of scaffold level. To improve the accuracy of gene annotation, evidence is used to assist with intron-exon and homology prediction (Yandell and Ence, 2012). There are two main categories of evidence, 1) transcriptomic data of the species to be annotated or of a closely related species, 2) proteins of species to be annotated, or proteins of closely related species or even species more distantly related. For evidence in the first category, 9 libraries of Z. muelleri RNA-seq data, 11 libraries of Zostera marina RNA-seq data and 93,148 Z. marina EST sequences from previous publications were used (as listed in Table 2-1). Z. marina is the best studied member in the Zosteraceae family, which belongs to a different clade than Z. muelleri (Tanaka et al., 2003). Since proteins do align over large evolutionary distances and well-curated proteomes of the Alismatidae order are not available, proteomes of two monocotyledon species; Brachypodium distachyon, the grass model, and Musa acuminata, the non-grass model, were selected as evidence of the second category.

Table 2-1 Sequenced and publicly-available data as evidence for Z. muelleri genome assembly and gene annotation.

Accession Associated Usage in this Species Data type numbers/ publications project Database Genomic Genome Z. muelleri (WGS) SRR1714574 Golicz et al., 2015a assembly (paired-end) Genomic Genome Z. muelleri (WGS) SRR3398780 Lee et al., 2016 assembly (mate-paired) Genome Z. muelleri RNA-seq ERP010473 Lee et al., 2016 annotation Franssen et al., SRP035489 Gene Z. marina RNA-seq 2014 SRP022957 annotation Kong et al., 2014 Gene Z. marina EST Dr.Zompo (v2.0) Wissler et al., 2009 annotation International Brachypodium Ensembl Gene Protein Brachypodium distachyon (release-26) annotation Initiative, 2010 Ensembl Gene M. acuminata Protein D’Hont et al., 2012 (release-26) annotation 40

2.3 Summary of workflow and software used

Figure 2-2 summarises the workflow of the four main steps described in this chapter: 1) genome size estimation, 2) genome assembly, 3) gene annotation and 4) functional annotation. With the lack of flow cytometry information, the genome size of Z. muelleri is unknown. An estimation of genome size was performed by counting the occurrence of a fixed length of sequence, k (k-mers) in the paired-end whole genome shotgun sequenced reads using the software Khmer (Crusoe et al., 2015).

The paired-end reads were pre-processed for genome assembly through removal of clones and quality-based trimming. Velvet, a graph-based assembler that builds de Bruijn representation of reads k-mer (Zerbino, 2010), was used to perform de novo assembly of pre-processed reads. Distance information in mate-paired reads was then incorporated with the contigs to generate scaffolds using the program SSPACE (Boetzer et al., 2011).

The third step of the workflow was gene annotation of the assembled scaffolds. As no reference species is closely related to seagrass, model training of two predictors, SNAP (Korf, 2004) and Augustus (Stanke and Waack, 2003), and de novo repeat annotation using RepeatModeler (Smit and Hubley, 2008) was performed. A two-pass MAKER pipeline was designed to iteratively train gene predictors for higher prediction accuracy, particularly when annotating a newly sequenced species. This is discussed in detail in Section 2.5.5.

To assign functions to the predicted gene set, proteins were aligned to public databases using BLAST (Camacho et al., 2009) and InterProScan (Jones et al., 2014).

41

Figure 2-2 Workflow and software used in genome size estimation, genome assembly, gene annotation and functional annotation.

42

2.4 Materials and methods

2.4.1 DNA sequencing

Sample collection and DNA extraction were performed by Martin Schliep in the University of Technology Sydney. Z. muelleri ssp. capricorni plants were collected from Pelican Banks in Gladstone Harbour (Queensland, Australia) in November 2011, transferred with rhizomes attached in a 5–10cm deep sediment layer into 1 litre rectangular, clear plastic food storage containers, and delivered on the same day to the University of Technology Sydney. The plants were acclimatized for 2 months in an aerated and temperature- controlled mesocosm under Sydney summer conditions with weekly partial seawater changes. Leaf blades of a single plant were snap-frozen and stored in liquid nitrogen for later stage DNA extraction with a plant sample DNA extraction kit (DNeasy PowerPlant ProKit by Qiagen) according to the manufacturer’s instructions. Prior to DNA extraction, the plant material was ground to a fine powder in liquid nitrogen with a pre-chilled mortar and pestle. Genomic DNA was sequenced using an Illumina HiSeq 2000-SBS v3.0 sequencer with 100bp paired-end (PE) technology and an insert size of 304 bp at the Ramaciotti Centre at the University of New South Wales (UNSW; NCBI: PRJNA253152). The libraries for genome sequencing were prepared with the Illumina Tru-seq DNA-seq kit, according to the manufacturer’s instructions. Reads were deposited in the short read archive (SRR1714574).

2.4.2 Genome size estimation

A total of 7,750,350 sequenced reads were used as input to the program Khmer (Crusoe et al., 2015) for k-mer counting. Khmer uses Bloom filters for effective hashing and calculates abundance distribution of 17-mer with the following commands:

43

READS="MB_Z_ATCACG_L006_R1and2.fastq" READ1=”MB_Z_ATCACG_L006_R1.fastq” READ2=”MB_Z_ATCACG_L006_R2.fastq” HASH=”reads.kh” HIST=”reads.hist”

cat $READ1 $READ2 > $READS

load-into-counting.py -k 17 -N 4 -T 8 -x 2.5e9 $HASH $READS

abundance-dist.py $HASH $READS $HIST

The genome size, G, was calculated from the formula G=K_num/K_depth, where K_num is the total number of k-mer and K_depth is the major peak of frequency k-mer occurrences.

2.4.3 Preprocessing of reads

FastQC (Andrews, 2010) performed on the sequenced reads using default parameters for quality checks.

The reads were preprocessed for clone-removal using a custom perl script “remove_possible_clones.pl” and quality-based trimming using Sickle (Joshi and Fass, 2011) with the following commands:

READ1=”MB_Z_ATCACG_L006_R1.fastq” READ2=”MB_Z_ATCACG_L006_R2.fastq”

remove_possible_clones.pl –a $READ1 –b $READ2 –length 80 –threshold 1

sickle pe –t sanger –f $READ1_cr –r $READ2_cr –l 80 –q 20 –o $READ1_cr_cleaned –p $READ2_cr_cleaned -s single.fastq

2.4.4 Selecting optimal k using a subset of reads

About 10% of all reads were subsampled for optimization of the k value in Velvet assembly. Values ranging from 15 to 99 were used in Velvet runs. Commands were as below:

44

#subsample

seqtk sample -2 -s100 r1_cr_cleaned.fastq 0.1 > r1_cr_cleaned_sampled.fastq

seqtk sample -2 -s100 r2_cr_cleaned.fastq 0.1 > r2_cr_cleaned_sampled.fastq

#interleave python interleave_fastq.py -l -o r1_cr_cleaned_sampled.fastq r2_cr_cleaned_sampled.fastq subsampled.fastq

#velvet

VelvetOptimizer.pl –s 15 –e 99 –f ‘-shortPaired –fastq subsampled.fastq’

2.4.5 Assembly of reads

The left and right reads were interleaved into one file using a custom python script “interleave_fastq.py” and the reads were assembled using Velvet with the optimal k as 69:

READ1=”MB_Z_ATCACG_L006_R1.fastq_cr_cleaned” READ2=”MB_Z_ATCACG_L006_R2.fastq_cr_cleaned”

kmer=69

python interleave_fastq.py -l $READ1 $READ2 -o MB_Z_ATCACG_L006_interleaved.fastq

velveth \ Assem $kmer \ -shortPaired -fastq r1r2_interleaved_cr_cleaned.fastq \ -short –fastq single.fastq

velvetg \ Assem \ -exp_cov auto -cov_cutoff auto

2.4.6 Scaffolding of contigs

Contigs larger than 500 bp were scaffolded with SSPACE version 2.0 (Boetzer et al., 2011) using mate-paired reads of 2 kb insert size (SRR3398780). Commands as below:

Libraries.txt: MP bowtie MP_AH7GB5ADXX_CTTGA_L002_R1.fastq \ MP_AH7GB5ADXX_CTTGA_L002_R2.fastq 2000 0.75 FR

perl SSPACE_Basic_v2.0.pl -l libraries.txt -s contigs.fa.abv500 -x 1 -T 1 -b scaffolded -v 1

45

2.4.7 Error correction using REAPR

Read mapping and REAPR analysis (Hunt et al., 2013) were performed on Z. muelleri scaffolds with the commands below:

SCAFFOLD_PREFIX="Zmuelleri_scaffolds” MP1_PREFIX="Seagrass_MP_AH7GB5ADXX_CTTGTA_L002_R1" MP2_PREFIX="Seagrass_MP_AH7GB5ADXX_CTTGTA_L002_R2" PE1_PREFIX="MB_Z_ATCACG_L006_R1_001" PE2_PREFIX="MB_Z_ATCACG_L006_R2_001"

$REAPR smaltmap $SCAFFOLD_PREFIX.fa $MP1_PREFIX.fastq $MP2_PREFIX.fastq long_mapped.bam

$REAPR perfectmap $SCAFFOLD_PREFIX.fa $PE1_PREFIX.fastq $PE2_PREFIX.fastq 300 short

$REAPR pipeline $SCAFFOLD_PREFIX.fa long_mapped.bam out short

2.4.8 Digital normalization of reads

Preprocessed reads one and two were interleaved and normalized using commands below:

R1=”r1_cr_cleaned.fastq” R2=”r2_cr_cleaned.fastq” OUT=”r1r2_interleaved_cr_cleaned.fastq”

#Interleave reads sed -i ‘s| 1:N:0:ATCACG|/1’ $R1 sed -i ‘s| 2:N:0:ATCACG|/2’ $R2 python /home/jlee/scripts/interleave_fastq.py -l $R1 -r $R2 -o $OUT

#Normalization module load khmer/2.0 normalize-by-median.py -C 20 -k 20 -N 4 -x 64e9 -p --savehash hashtable.kh $OUT

2.4.9 Evaluation of genic completeness

CEGMA version 2.4 (Parra et al., 2007) was run on the Z. muelleri genome assembly using default parameters:

46

GENOME=”Zmuelleri_scaffolds.fa” PREFIX=”scaffold”

cegma -g $GENOME -o $PREFIX -T 4 -v --tmp --ext

The early release of plant dataset and necessary software of the program BUSCO (Simao et al., 2015) was downloaded upon request to the authors. BUSCO was run on Z. muelleri and seven other genome assemblies (downloaded from Phytozome: Z. marina v2.1, Amborella trichopoda contigs v1.0, A. trichopoda scaffolds v1.0, Linum usitatissimum v1.0, Brassica napus v1.4 Darmor-bzh, M. acuminata v1, A. thaliana TAIR10) using default parameters:

GENOME=”Zmuelleri_scaffolds.fa” PREFIX=”ZMU”

python3 BUSCO_plants.py -in $GENOME -o $PREFIX -l /home/jlee/downloads/BUSCO_v1.1b1/plant_early_release/plantae -m genome -f -c 12

2.4.10 De novo repeat annotation

RepeatModeler (Smit and Hubley, 2008) which employs two repeat-finding programs, RECON (Bao and Eddy, 2002) and RepeatScout (Price et al., 2005), was used to identify de novo repeat families and build a Z. muelleri-specific repeat library. Command is as below:

module load repeatmodeler/1.0.5

BuildDatabase -name RepeatModeler -database

2.4.11 Assembly of the RNA-seq evidence data

RNA-seq data of Z. muelleri and Z. marina were downloaded in the SRA (short read archive) format and converted to FASTQ. Libraries of reads were combined and assembled by species using Trinity (Haas et al., 2013) to form transcripts. The commands are as below:

47

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR866/SRR866326/SRR866326.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR866/SRR866328/SRR866328.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR866/SRR866329/SRR866329.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR866/SRR866327/SRR866327.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR117/SRR1171048/SRR1171048.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra- instant/reads/ByRun/sra/SRR/SRR118/SRR1180229/SRR1180229.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963450/ ERR884884/ERR884047.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963451/ ERR884884/ERR884048.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963452/ ERR884884/ERR884049.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963453/ ERR884884/ERR884050.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963454/ ERR884884/ERR884051.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963455/ ERR884884/ERR884052.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963456/ ERR884884/ERR884053.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963457/ ERR884884/ERR884054.sra wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/ERX/ERX963/ERX963458/ ERR884884/ERR884055.sra module load sratoolkit/2.3.5-2 fastq-dump --split-files module load trinityrnaseq/20140717 Trinity --seqType fq --left $LEFT --right $RIGHT --SS_lib_type FR --trimmomatic --CPU 16 --JM 60G --output trinity_out_dir2 Trinity --seqType fq --single $LEFT --SS_lib_type F --trimmomatic --CPU 16 --JM 60G

TrinityStats.pl Trinity.fasta

48

2.4.12 Downloading of the protein evidence data

The protein sequences of M. acuminata and B. distachyon were downloaded from ENSEMBL with the commands below:

wget ftp://ftp.ensemblgenomes.org/pub/plants/release- 26/fasta/musa_acuminata/pep/Musa_acuminata.MA1.26.pep.all.fa.gz wget ftp://ftp.ensemblgenomes.org/pub/plants/release- 26/fasta/brachypodium_distachyon/pep/Brachypodium_distachyon.v1.0.26.pep.all.fa.gz

2.4.13 Model training of gene predictors SNAP and Augustus

SNAP (Korf, 2004) was trained with the assembled Z. muelleri transcriptome using the “est2genome” option in the MAKER pipeline. The transcripts were directly used to identify genes in the genome, without ab initio predictions. Following are the changes made in the MAKER config file “maker_opts.ctl”:

genome=04.break.broken_assembly.fa.abv1000.renamed est=Trinity_zmuelleri.fasta model_org=Liliopsida rmlib=consensi.fa.classified repeat_protein=/ws/ws-group/app/maker/data/te_proteins.fasta est2genome=1

MAKER was run and the results were used to train SNAP with the following commands:

nohup /ws/ws-group/app/mpich/bin/mpiexec -n 10 /ws/ws-group/app/maker/bin/maker -- fix_nucleotides 2> maker.err 1> maker.out /ws/ws-group/app/maker/bin/gff3_merge -d 04.break.broken_assembly.fa.abv1000_master_datastore_index.log /ws/ws-group/app/maker/bin/maker2zff 04.break.broken_assembly.fa.abv1000.all.gff /ws/ws-group/app/snap_2013_11_29/fathom -categorize 1000 genome.ann genome.dna /ws/ws-group/app/snap_2013_11_29/fathom -export 1000 -plus uni.ann uni.dna /ws/ws-group/app/snap_2013_11_29/forge export.ann export.dna /ws/ws-group/app/snap_2013_11_29/hmm-assembler.pl zmuelleri . > zmuelleri.hmm

49

Augustus (Stanke and Waack, 2003) was trained using the set of conserved genes identified by CEGMA during quality assessment of the assembly:

CEGMA_GFF=” scaf.cegma.gff” TRAIN_GFF=” augustus-training.gff”

export AUGUSTUS_CONFIG_PATH=/ws/ws-group/app/augustus-3.0.2/config/ /ws/ws-group/app/augustus-3.0.2/scripts/new_species.pl --species=zmu /ws/ws-group/app/augustus-3.0.2/scripts/cegma2gff.pl $CEGMA_GFF > $TRAIN_GFF /ws/ws-group/app/augustus-3.0.2/scripts/gff2gbSmallDNA.pl augustus-training.gff SSPACE_trimmed_out.final.scaffolds.fasta.renamed 1000 cegma_zm.gb /ws/ws-group/app/augustus-3.0.2/scripts/optimize_augustus.pl --cpus=12 --species=zmu cegma_zm.gb > optimize_augustus.log /ws/ws-group/bin/etraining --species=zmu cegma_zm.gb > etraining.log

2.4.14 Gene annotation using the MAKER pipeline

Iterative runs of MAKER (Cantarel et al., 2008) were designed to improve the gene training process and subsequently gene prediction. The first pass was run with the following parameters in the MAKER config file “maker_opts.ctl”:

genome=04.break.broken_assembly.fa.abv1000.renamed altest=Zoma_A_unigenes.fasta, Zoma_B_unigenes.fasta, Zoma_C_unigenes.fasta, Trinity_zmarina_SRR8663.renamed.fasta, Trinity_zmarina_SRR11.renamed.fasta protein=Brachypodium_distachyon.v1.0.26.pep.all.fa, Musa_acuminata.MA1.26.pep.all.fa model_org=Liliopsida rmlib=consensi.fa.classified repeat_protein=/ws/ws-group/app/maker/data/te_proteins.fasta snaphmm=zmuelleri.hmm augustus_species=zmu min_contig=5000

Retraining of SNAP and Augustus was done using the predictions of the first pass MAKER run:

50

/ws/ws-group/app/maker/bin/gff3_merge -d 04.break.broken_assembly.fa.abv1000_master_datastore_index.log /ws/ws-group/app/maker/bin/maker2zff -n 04.break.broken_assembly.fa.abv1000.all.gff /ws/ws-group/app/snap_2013_11_29/fathom -categorize 1000 genome.ann genome.dna /ws/ws-group/app/snap_2013_11_29/fathom -export 1000 -plus uni.ann uni.dna /ws/ws-group/app/snap_2013_11_29/forge export.ann export.dna /ws/ws-group/app/snap_2013_11_29/hmm-assembler.pl zmuelleri_retrained . > zmuelleri_retrained.hmm

perl zff2augustus_gbk.pl export.ann export.dna > train.gb export AUGUSTUS_CONFIG_PATH=/ws/ws-group/app/augustus-3.0.2/config/ /ws/ws-group/app/augustus-3.0.2/scripts/new_species.pl --species=zmu_retrained /ws/ws-group/app/augustus-3.0.2/scripts/optimize_augustus.pl --species=zmu_retrained train.gb > optimize_augustus.log

The second-pass of MAKER was run with the following changes in the MAKER config file “maker_opts.ctl”:

snaphmm=zmuelleri_retrained.hmm augustus_species=zmu_retrained

2.4.15 Functional annotation

Predicted proteins were compared with TIGRFAM, ProDom, Panther, PfamA, PrositeProfile and PrositePatterns using InterProScan (version 5.3) (Jones et al., 2014), for motif and domain annotation. Two hundred and one proteins containing transposase and reverse transcriptase-associated domain signatures were removed from the gene set. Commands are as following:

INPUT=”zmu_proteins.fa” OUT=”zmu_proteins_iprs.gff”

interproscan.sh -appl TIGRFAM,Panther,PfamA,PrositeProfile,PrositePatterns -i $INPUT -o $OUT -iprlookup -goterms -pa -f gff3

Gene functions were annotated from the best match alignments of predicted genes to Swiss-Prot and TrEMBL plant proteins (release 2015_05) using Blastx (BLAST+ v2.28)

51

(Camacho et al., 2009) with the following parameters: “blastx -evalue 0.00001 - max_target_seqs 1” and further filtered with percentage of identical matches >= 40 and percentage of subject sequence covered >= 60.

52

2.5 Results and Discussion

2.5.1 Genome size estimation

The distribution plot of 17-mer in Z. muelleri paired-end reads approximate to a Poisson distribution with two peaks (Figure 2-3). Based on the k-mer abundance, the total genome size of Z. muelleri was estimated to be 889 Mb.

Flow cytometry data of Z. muelleri is not available, but the 2C value of Z. marina which has 2n=12 is 1.22 pg (Koce et al., 2003). 2C values are used to represent the mass of diploid DNA content, and can be converted to number of base pairs based on 1pg = 978 Mb (Dolezel and Bartos, 2005). The genome of Z. muelleri which has 2n=24 is therefore expected to approximate to 1 Gb in size, which is consistent with the k-mer abundance estimation.

The high frequency of unique k-mers represent sequencing errors. A major peak at 35 represents the sequencing coverage of all the single copy sequences in the genome. The minor peak at 75 reflects the repetitive structures in the genome (Figure 2-3). Since the occurrence of the minor peak is double of that of the single copy genome occurrence, the repetitive sequences are likely to be a result of whole genome duplication events. A shoulder at 18 suggests relatively low heterozygosity. Seagrass species have varying degree of clonal diversity (Reusch et al., 2000; Hughes and Stachowicz, 2004; Hughes et al., 2008). Z. muelleri, as well as more than 40% of all seagrass species, is monoecious (Short et al., 2011). Together with the clonal structure of meadows, this potentially increases the likelihood for self-pollination (Reusch, 2001). However, studies on highly clonal seagrass species such as P. oceanica and P. australis and, found effective hydrophilous pollination in various water conditions, which encourages outcrossing (Arnaud-Haond et al., 2007; Sinclair et al., 2014).

Genome size varies largely across plant species (Bennetzen and Wang, 2014; Michael, 2014; Wendel et al., 2016) even within a genus. There is also no clear correlation between size and speciation (Kraaijeveld, 2010). The knowledge of the genome size of a species is therefore less informative biologically or phylogenetically. It is, however, important for measuring the completion of genome assembly and understanding the genome architecture (Wendel et al., 2016), such as polyploidisation and activity of transposable elements.

53

Figure 2-3 Distribution of 17-mer coverage of Z. muelleri libraries.

2.5.2 Genome assembly

Reads were pre-processed by clonal pair removal and quality trimming, with an average Phred threshold of 20 within an 80 bases window. A total of 1,705,962 reads (0.89%) have an average quality Phred score of lower than 20. The per base Phred distribution of the paired end reads (Figure 2-4) shows relatively large standard deviation at the end of reads. A total of 3,875,175 paired reads (2%) were identified as duplicated. Table 2-2 summarises the number of reads and nucleotides after each pre-processing step. A total of 179,178,513 (89%) of paired-end reads were retained.

Phred score is a metric to indicate the accuracy of a given base called by the sequencer, for example, a Phred score of 20 has a 1 in 100 probability of an incorrect base call (Ewing

54

and Green, 1998). When in high percentage, duplicated reads interfere with read coverage estimation and are therefore advisable to be removed as a pre-assembly step (Ekblom and Wolf, 2014).

The processed paired-end reads were assembled to form 237,933 contigs with a total length of 657,815,437 nucleotides. Mate-pair information was incorporated to contigs to form 45,719 scaffolds with a N50 of 47,119. Table 2-3 shows the statistics of assembled contigs and scaffolds. Misassemblies such as large structural and small base errors were identified using REAPR (Recognition of Errors in Assemblies using Paired Reads) (Hunt et al., 2013). REAPR utilizes mapped paired-end and mate-pair reads to measure accuracy of the assembly and introduce gaps at incorrect scaffolds. The corrected scaffolds increase slightly in number to 56,823 with a N50 length of 36,732 bp. Over 80% of the draft assembly is represented by 12,904 scaffolds and the length of the largest scaffold is 327,369 bp. The total length of the scaffolds is 632 Mbp, which is about 71% of the estimated genome size.

Table 2-2 Number of reads and nucleotides retained for each pre-processing step

Total reads Total nucleotides Read 1 Read 2 Single reads (Percentage (Percentage retained) retained) Raw 382,283,286 38,610,611,886 191,141,643 191,141,643 0 reads (100%) (100%) After 374,532,936 37,827,826,536 clone 187,266,468 187,266,468 0 (98.0%) (98.0%) removal After 353,853,079 35,585,827,798 quality 170,316,193 170,316,193 13,220,693 (92.6%) (92.2%) filtering

55

a)

b)

Figure 2-4 Distribution of per base Phred quality score in a) read one and b) read two of the Z. muelleri whole genome shotgun sequencing.

56

Table 2-3 Assembly statistics of contigs and scaffolds.

Contigs Scaffolds Final Scaffolds

Number Length (bp) Number Length (bp) Number Length (bp)

Total 235,854 666,249,128 45,719 632,821,539 56,823 632,070,940 number (2.91) (3.38) (3.49) (% of Ns)

Total 20,755 446,438,073 14,657 571,260,147 16,674 552,311,609 number (>=10 kbp)

N90 50,847 2,042 14,488 10,250 18,533 7,926

N50 11,859 16,095 4,043 47,119 5,114 36,732

Longest - 136,127 - 327,369 - 303,880

Mean - 2,825 - 12,144 - 11,124 length

2.5.3 Effect of data normalization in plant genome assembly

To investigate the effect of data normalization on the assembly of plant genomes, normalized reads were assembled using Velvet to form contigs and compared with assembly from unnormalized reads (Table 2-4). Data normalization reduced the total pre- processed reads by 54%. 15 hours 40 mins were used to complete the assembly, with 64.4 GB memory consumed. Unnormalized reads were assembled in 22 hours 15 mins with 82.2 GM of memory consumed. Data normalization therefore shortened the assembly process by nearly 7 hours with about 20 GB less memory consumption. The contigs of normalized data was about 100 Mb shorter with lower N50 and N90, when compared to contigs of unnormalized data.

Based on the N50, N90 and total length of contigs larger than 10 kbp, the contigs produced using normalized data were more fragmented overall. Data normalization also discarded reads that would contribute to the assembly of about 100 Mb of sequences. The results of this analysis implied that data normalization is effective in reducing runtime and memory consumption of genome assembly. The quality of plant genome assemblies however, is compromised by normalized data. 57

Digital normalization is an algorithm developed to down-sample sequencing data to normalize read coverage and remove errors (Titus Brown et al., 2012). It is useful to reduce computing resources required in subsequent steps of analysis, such as the assembly of genomic and transcriptomic data. The Velvet program prioritized specificity over memory efficiency (Abbas et al., 2014). Therefore the larger the amount of data and higher the genome complexity, a larger graph needs to be generated where memory consumption increases. De novo assembly of the plant genome is constrained by features of high performance supercomputers available, such as maximum memory per node and time limit per job. One possible solution is the usage of digital normalization on whole genome shotgun reads to reduce memory consumption during assembly. Normalized bacterial, human and insect genomes had been demonstrated to be assembled without compromising the quality (Kleftogiannis et al., 2013). However, the effect of digital normalization on plant genomes is unclear. Using the Z. muelleri genome assembly as an example, we showed that digital normalization produced a shorter and more fragmented assembly. It is possible that the repetitive nature and polyploidy of plant genomes requires high coverage of sequencing data to be resolved, and a reduced coverage causes fragmentation.

Data normalization is therefore not a compatible solution to reduce computational resources in plant genome research. With the development of newer assemblers that prioritized memory efficiency, such as the recent implementation of a sparse de Bruijn pregraph method in SOAPdenovo2 (Luo R et al., 2012), it is possible that this is a lesser concern in the near future.

58

Table 2-4 Assembly statistics of digitally normalized and unnormalized reads.

Normalized reads Unnormalized reads

Number of reads 164,756,031 353,853,079

(% of total reads) (46.56%) (100%)

Time taken for 15 hours 40 mins 22 hours 15 mins assembly

Memory consumed 64,417,460 KB 82,185,736 KB

Number Length (bp) Number Length (bp)

Total number 155,390 562,041,448 235,854 666,249,128

(% of Ns) (2.63) (2.91)

Total number (>=10 16,867 300,216,383 20,755 446,438,073 kbp)

N90 59,661 1,902 50,847 2,042

N50 15,025 10,865 11,859 16,095

Longest - 144,306 - 136,127

Mean length - 3,617 - 2,825

2.5.4 Quality assessment of the assembled genome

The quality of the draft assembly was evaluated using Core Eukaryotic Genes Mapping Approach (CEGMA; Parra et al., 2007) and Benchmarking Universal Single-Copy Orthologs (BUSCO; Simao et al., 2015). Table 2-5 shows 81% and 77% of complete orthologs were identified for CEGMA and BUSCO respectively. A total of 78% of the complete orthologs identified by CEGMA are duplicated in Z. muelleri, whereas only about 50% of complete BUSCO orthologs are duplicated. Analaysis revealed that 6.5% of CEGs and 17.5% of BUSCOs were not found in the assembly.

To determine whether the number of missing orthologs is a direct measure of genic incompleteness in the assembly, the assemblies of six other species were searched using the BUSCO dataset and compared to Z. muelleri (Figure 2-5). The assemblies of M. acuminata (3% BUSCOs missing), B. napus (3% BUSCO missing) and A. thaliana (2% BUSCOs missing) in chromosomal-level served as well-assembled standards. In order to

59

address the 17% missing BUSCOs in the Z. muelleri assembly, Z. marina and the contigs of A. trichopoda, instead of the scaffolds, were analysed, to control for the degree of conservation of BUSCO orthologs and the comparability of assembly statistics. Since the A. trichopoda contigs have 76% of complete BUSCOs, whereas the scaffolds have 93% (results not shown), it is highly likely that the amount of missing BUSCOs in Z. muelleri is due to the level of assembly. Interestingly, the number of duplicated BUSCOs detected in Z. muelleri is more than double of that in Z. marina. A total of 63.7 % of CEGMA genes were also duplicated in Z. muelleri. Two species which recently underwent whole genome duplication (WGD) were included for comparison. The B. napus genome is highly multiplied due to allopolyploidy (Chalhoub et al., 2014) and triplication around 13-17 Mya (Yang et al., 1999; Town et al., 2006). As expected, 90% of BUSCOs are duplicated in B. napus. Similarly, in L. usitatissimum that was duplicated 5-9 Mya (Wang et al., 2012), 69% of BUSCOs are duplicated. In contrast, the A. trichopoda genome which only retains evidence from the ancient epsilon WGD (Amborella Genome Project, 2013) has 8% of duplicated BUSCOs. The presence of three lineage-specific WGDs in M. acuminata dated at around 65-100 Mya (D'Hont et al., 2012), however, is not reflected in the percentage of duplicated BUSCOs, most possibly because these WGDs are relatively ancient.

CEGMA uses hidden Markov model (HMM) profiles of 248 highly conserved and least paralogous core eukaryotic genes to define the level of genic completeness in a draft assembly. Alignments are then categorized as complete or partially covered, and containing single or duplicated copies. In 2015, BUSCO was available as an improved version of CEGMA. BUSCO uses larger lineage-specific core genes to solve the inconsistency of CEGs across diverse species. CEGMA and BUSCO orthologs are expected to be single-copy as they were selected through exclusion of paralogs that exist in multiple species (three or more out of six species in CEGMA (Parra et al., 2007); more than 10% out of all species selected per lineage (Simao et al., 2015)). The level of redundancy in an assembly is estimated using the single-copy metrics, where high number of duplicated orthologs indicates a highly erroneous assembly (Simao et al., 2015). Based on the BUSCO results of species with known recent WGD events in Figure 2-5, such as L. usitatissimum and B. napus, it is obvious that this assumption is not applicable to plant genomes. Besides, small scale duplications, which are common in plant genomes (Veeckman et al., 2016), can also affect the accuracy of this measurement.

Not only is the variety of ploidy and frequency of whole-genome duplications in different plant species large (Adams and Wendel, 2005), but the representation of plant species in 60

constructing these orthologous sets is also low (A. thaliana - one out of six in CEGMA; 20 species in BUSCO) to assign “single-copy” status to genes. As shown by the BUSCO results of Z. muelleri and six other species, the measurement of genic incompleteness using the number of missing orthologs is affected by two other factors, 1) the degree of conservation of the orthologous datasets in organisms that are distantly-related from model species, and 2) the assembly level (contigs/scaffolds/pseudochromosomes) of the draft genome. As seagrass belongs to an early branch of monocots and lacks a sequenced model species, it is unknown whether all the orthologs searched by CEGMA and BUSCO are indeed shared in seagrasses. From these results, two conclusions are drawn: 1) the single-copy orthology status of BUSCO plant dataset is valid in genomes with no recent duplications or chromosomal doubling, 2) Z. muelleri has more duplicated genes than another Zosteraceae member in a sister clade.

Four methods of assessment in two main categories were compared in a recent review (Veeckman et al., 2016); CEGMA, BUSCO and coreGFs (Van Bel et al., 2012) which are conserved gene set-based, and transcript mapping which is species-specific. Differences between completeness scores of each method were detailed in 12 plant genomes (Figure 2-6). Two main messages were conveyed: 1) Methods using conserved gene set evaluation overlooked the issue of fungal contamination in Cicer arietinum, but this is reflected in low EST mapping score. This emphasizes the importance of evaluating genomes with both transcript and conserved gene sets to avoid misassumptions. 2) In a few instances, coreGF which uses predefined evolutionary levels scored lower. For example only 83% of monocot coreGFs, which were defined with Poales gene set, were found in Phalaenopsis equestris. As P. equestris is not a commelinid, this again indicates the limitation in assessing gene space in non-model plants, as argued above.

61

Table 2-5 Assessment results of Z. muelleri scaffolds using CEGMA and BUSCO orthologous groups.

Parameter CEGMA (Percentage of BUSCO (Percentage of total total orthologs searched) orthologs searched)

Complete orthologs 201 (81.0) 737 (77.1)

Single-copy 43 365

Duplicated 158 372

Fragmented orthologs 31 (12.5) 52 (5.4)

Missing orthologs 16 (6.5) 167 (17.5)

Total orthologs 248 (100.0) 956 (100.0)

62

Figure 2-5 Stack plots of the categories (complete single copy, complete duplicated copy, fragmented and missing) of BUSCO groups searched in draft assemblies of Z. muelleri and six other plant species (Z. marina, A. trichopoda, L. usitatissimum, B. napus, M. acuminata, A. thaliana)

63

Figure 2-6 Comparison of CEGMA, BUSCO, coreGF and EST completeness scores for twelve genomes. Figure adapted from (Veeckman et al., 2016). .

64

2.5.5 Predicting the Z. muelleri gene set

2.5.5.1 De novo repeat annotation

De novo and homology-based repeat prediction using RepBase (version 18.06) identified 339 Mb (56%) of repetitive sequence in the assembly (Table 2-6). Out of the 339 Mb of repeats, 83 Mb are categorized as retrotransposons, 23 Mb are DNA transposons and the remaining include simple sequence repeats (SSR), low complexity sequences, rRNA, satellite and others. Within the two groups of transposons, 22% is made up of LTR retrotransposons and 7% for non-LTR retrotransposons and DNA transposons respectively. The frequency ratio of LTR retrotransposon superfamilies Gypsy/Copia is 2.1, whereas subclass I DNA transposons are more abundant than subclass II.

The process of accumulation (Neumann et al., 2006; Piegu et al., 2006) and elimination of repetitive sequences (Hawkins et al., 2009; Renny-Byfield et al., 2011), particularly LTR- retrotransposons, along with whole genome duplication events, has been shown to be key drivers of genome size evolution in plant species. This is indicated by the positive correlation of the amount of repetitive sequences and genome size across species (El Baidouri and Panaud, 2013; Michael, 2014), with relatively stable length and number of protein-coding sequences. The composition of repeats in a genome is therefore highly related to evolutionary distances between species. Also, as each category of transposable elements has different insertional properties, length and amplification profiles, repeat annotation is important to predict the potential of genes being impacted by transposons in each species (Vitte et al., 2014).

The reported amount of repetitive sequences varies hugely in species, such as 85% in maize (Schnable et al., 2009) and 25% in duckweed (Wang et al., 2014). In Z. marina, 63% of the genome was annotated as repeats (Olsen et al., 2016). In most published plant genomes, the repeat content is around 57% (Michael, 2014), which approximates to the repeats found in Z. muelleri. The long terminal repeat (LTR) retrotransposon is known to be the most abundant subclass in plant repeats (Gaut and Ross-Ibarra, 2008). The ratio of the two best characterised and most abundant LTR retrotransposons, Gypsy and Copia is species specific (Vitte et al., 2014). In papaya (Ming et al., 2008), sorghum (Paterson et al., 2009), rice (McCarthy et al., 2002) and poplar (Natali et al., 2015), the prevalence of Gypsy is higher than Copia, whereas in the grape genome, the amount of Copia is double that of Gypsy (Jaillon et al., 2007). The frequency ratio found in Z. muelleri, 2.1, is 65

comparable to other seagrasses; P. oceanica and Z. marina, which were recently reported as 1.8 (Barghini et al., 2015) and 1.7 (Olsen et al., 2016). Also consistent with other plant species, non-LTR retrotransposons and DNA transposons were found to be poorly represented in Z. muelleri. As the structural features of DNA transposons are less understood (Du et al., 2006), the amount of annotated DNA transposons is likely to be underestimated. About half of the annotated repeats in the assembly categorized as “unknown” represent both the de novo sequences characterized by RepeatScout and RECON, and the unclassified known repeats in RepBase.

Complete identification of repeat regions in a genome and the masking of these regions are two essential steps prior to gene prediction. This is because gene predictors are unable to differentiate between transposon and gene open reading frames, causing a huge increase in false positives (Yandell and Ence, 2012).

66

Table 2-6 Repetitive sequences in Z. muelleri scaffolds.

Class Subclass Superfamilies Copy Number Size (bp) % Retrotransposon LTR retrotransposon Gypsy 109,973 88,361,874 14.48 Copia 90,120 40,846,715 6.7 Caulimovirus 9 581 0 ERV1 290 133,853 0.02 Pao 2,382 2,142,352 0.35 Others 1,024 532,856 0.09 Non -LTR retrotransposon 34,058 18,092,892 2.97 DNA transposon Subclass I hAT -Tag1 16,939 8,958,321 1.47 hAT-Tip100 3,830 1,162,568 0.19 hAT-Ac 8,270 2,440,575 0.4 CMC-EnSpm 18,752 10,701,684 1.75 MuDR 10,505 7,965,979 1.31 En-Spm 8,431 5,089,984 0.83 PIF-Harbinger 2,006 4,924,949 0.81 MULE-MuDr 1,757 301,922 0.05 MULE-NOF 95 45,958 0.01 TcMar-Stowaway 15 1,125 0 Subclass II Helitron 517 147,523 0.02 Others 139 50,448 0.01 rRNA 510 108,607 0.02 Satellite 55 5,380 0 Low complexity 31,882 1,594,884 0.26 Simple repeat 141,135 12,013,786 1.97 Unknown 208,109 133,550,359 21.89 Total 580,830 339,175,175 55.6

67

2.5.5.2 Selection of software and the gene annotation workflow

MAKER is an annotation pipeline that masks repeat regions, handles evidence-based information, predicts genes and finally scores annotation (Holt and Yandell, 2011). It provides high flexibility in workflow, parameters, evidence file format and selection of gene prediction programs. There are many gene predictors available. Four of them which are most commonly used; Augustus (Stanke and Waack, 2003), SNAP (Korf, 2004), GeneMark (Ter-Hovhannisyan et al., 2008) and FGENESH (Salamov and Solovyev, 2000), can be run internally within the MAKER pipeline. They have been shown to produce results with comparable sensitivity and specificity at the gene, exon and nucleotide level when provided with enough training data (Goodswen et al., 2012; Zickmann and Renard, 2015). The general trend (Guigo et al., 2006; Knapp and Chen, 2007; Goodswen et al., 2012) of prediction accuracy is in ascending order of the following: GeneMark, SNAP and Augustus. Augustus is recognised as a highly accurate ab initio predictor (Holt and Yandell, 2011), but memory intensive and time consuming to train. SNAP is easily trained but has a tendency to make short and partially overlapped predictions that actually belonged to one gene (Goodswen et al., 2012). FGENESH is a commercial service developed by Softberry Inc.

To annotate the Z. muelleri genome, iterative runs of the MAKER pipeline using Augustus and SNAP as predictors was performed (Figure 2-7). To train the gene predictors for a Z. muelleri-specific model, conserved genes identified by CEGMA (Section 2.5.3) and assembled Z. muelleri RNA-seq data were used in Augustus and SNAP, respectively. Using transcriptome and ESTs from Z. marina, and proteomes from two monocots B. distachyon and M. acuminata, the first round of gene prediction is performed on repeat- masked genome through the MAKER pipeline using trained models. Next, an iterative process where models are retrained is performed with the predicted gene set. Finally, the MAKER pipeline was rerun to obtain a final set of consensus genes.

68

Figure 2-7 Workflow of model training and gene annotation using the MAKER pipeline. The dotted line arrows indicate the second-pass annotation.

2.5.5.3 Problems faced in predicting genes of a newly sequenced genome and solutions incorporated in the gene annotation workflow

Genomic characteristics such as codon frequencies, intron-exon length and GC content, are used as parameters for ab initio gene predictors to correctly identify genic regions (Yandell and Ence, 2012). Most gene predictors provide pre-calculated parameters for model organisms, however only a limited number of plant species were included (Table 2- 7). Both SNAP and GlimmerHMM provided parameters for rice and A. thaliana, the standard models for monocot and dicot, respectively. Augustus provided parameters for maize and tomato, in addition to A. thaliana. GeneMark-ES featured an unsupervised training algorithm which iteratively self-trains using the genome to be annotated, therefore eliminating the need to build species-specific parameters (Ter-Hovhannisyan et al., 2008).

69

The nucleotide landscape is highly species-specific. For example, the A. thaliana genome is GC-poor and homogeneous, whereas the rice genome is generally GC-rich and heterogeneous (Wang and Hickey, 2007). Even within the grass family, the variation of GC pattern is large (Serres-Giardi et al., 2012). The effect of predicting genes using interspecies parameters was shown in Korf, 2004 using the predictor SNAP. The author concluded that gene prediction accuracy is hampered when using parameters of species with significant GC compositional and codon usage differences, emphasising the importance of reliable genomic parameters. Evaluation performed on small eukaryotic parasitic genome also highlighted the need to use models specifically trained for organism to be annotated (Goodswen et al., 2012).

A Z. muelleri-specific model needs to be trained prior to gene annotation for two main reasons, 1) species-specific parameters is preferred for accurate gene prediction, and 2) pre-calculated parameters are limited to four model plant species. Model training involves first collecting high-confidence data and then training a choice of gene predictor with the data (Yandell and Ence, 2012). Valid training data should contain a large number (more than 200) of complete coding sequences with splice sites and exons that are manually curated, or at least well-supported by mRNA data (Coghlan et al., 2008). These criteria are challenging for a newly sequenced species, as available data are usually scarce and experimental (Holt and Yandell, 2011).

Published work of de novo species include solutions to constructing a reliable training set. Examples include aligning transcripts with long ORFs (open reading frames) to curated sequence databases and selecting high-confidence queries based on e-value and percent identity (examples in Ming et al., 2008 and Huang et al., 2009), or using complete sequences of conserved orthologous genes identified by genome assessment programs (Holt and Yandell, 2011). Another solution is to use iterative runs of predictor training and gene prediction to bootstrap the training process (Cantarel et al., 2008). The first-pass training material is usually the conserved gene set. Gene prediction is then performed and the output is used to retrain the gene predictor. As this two-pass procedure incorporates external evidence in the process of training, such as proteins and mRNA of species to be annotated, as well as closely-related species, gene annotations are improved (Campbell et al., 2014a). This method was therefore selected to be included in the Z. muelleri gene annotation workflow as an attempt to solve the limitations of predicting a newly sequenced genome in this project.

70

Table 2-7 Pre-calculated parameters of plant species provided for open source gene predictors.

Plant species with pre- Corresponding publication of Gene predictors calculated parameters predictors Arabidopsis thaliana Augustus Zea mays (Stanke and Waack, 2003) Solanum lycopersicum Arabidopsis thaliana SNAP (Korf, 2004) Oryza sativa Arabidopsis thaliana GlimmerHMM (Majoros et al., 2004) Oryza sativa None (Ter-Hovhannisyan et al., GeneMark-ES (self-training algorithm) 2008)

2.5.5.4 Training and evidence data: Z. muelleri and Z. marina transcriptome

RNA-seq data of Z. muelleri were assembled to 89,792 transcripts with a median length of 763 bp. Two other Z. marina transcriptomes with median length of 399.40 and 941.42 bp, respectively, were also assembled to provide evidence data for gene annotation.

Table 2-8 Statistics of assembled Z. muelleri and Z. marina transcriptomes.

Z. marina Z. marina Z. muelleri (ID: SRP022957) (ID: SRP035489) Total transcripts 89,792 29,382 128,505 Percent GC 40.16 35.92 40.25 Median contig length (bp) 763 380 641 Average contig length 1,119.76 399.40 941.42 (bp) Total assembled bases 100,545,252 11,735,208 120,976,569 (bp)

71

2.5.5.5 Two-pass run of the MAKER pipeline

A total of 35,875 protein-coding genes were annotated in the Z. muelleri genome, with average gene length of 3,154 bp, coding sequence length of 984 bp and 5.7 exons (Table 2-9).

The number of genes annotated is similar to M. acuminata. However, in comparison with another seagrass species and duckweed, both members of the same order Alismatales, Z. muelleri has about 15,000 more genes predicted. This difference is thought to be consistent with the difference in genome sizes, as the genomes of Z. marina and duckweed are only about one-third the size of Z. muelleri. Gene characteristics of Z. muelleri such as CDS and exon size, as well as exon number, are approximate to other angiosperms. This is expected as the gene features and structure of exons are generally well conserved across species (Hupalo and Kern, 2013). This comparison also serves as a simple quality evaluation of the gene annotation process.

72

Table 2-9 Gene characteristics of Z. muelleri in comparison with four monocot species (Z. marina, S. polyrhiza, M. acuminata and O. sativa) and one dicot (A. thaliana). * genome not available at the time of analysis.

Species Assembled Total genes Average gene Average Average Average Publication size of annotated size (bp) coding exons per exon genome sequence gene length (Mbp) length (bp) (bp) Z. muelleri 632 35,875 3,155 987 5.7 226 - Z. marina* 203 20,450 3,301 1,177 5.2 227 (Olsen et al., 2016) S. polyrhiza 128 19,623 3,458 1,108 5.2 213 (Wang et al., 2014) M. acuminata 472 36,542 3,596 1,038 5.4 192 (D'Hont et al., 2012) O. sativa 374 39,049 2,.29 1,064 4.1 258 (International Rice Genome Sequencing Project, 2005) A. thaliana 119 27,416 1,867 1,218 5.1 237 (Maumus and Quesneville, 2014)

73

2.6 Quality assessment of the predicted genes

The quality of the predicted gene set was evaluated in several ways. Firstly, the annotation was manually inspected using the viewer Apollo (Lewis et al., 2002) to assess the improvements in iterative runs. Figure 2-8 shows an example of gene prediction improvement in “scaffold 3725_30727” using the viewer Apollo. Gene “augustus_masked- 3275_30727-processed-gene-0.3-mRNA-1” (in b) was initially annotated as three separate transcripts (in a), where one was longer and another two were short. The predictions of untranslated regions (UTRs) and its distinction with coding region were also improved in the second-pass annotation.

Using two metrics provided by the MAKER pipeline for each of the gene it annotated, annotation edit distance (AED) and quality index (QI), the correctness of the predicted gene set was estimated. A cumulative distribution function (CDF) curve of the AED scores of all Z. muelleri genes shows more than 85% of annotations are supported by homology- based or transcript evidence (AED < 1.0; Figure 2-9). The sharp increase of annotations from AED 0.9 to 1.0 observed in the CDF curve represents 4,781 ab initio predictions with no supporting evidence. Genes were further categorized based on support across AED scores in Figure 2-10. In general, the majority of the predictions have an overlap of ab initio and evidence support. The number of genes that are only supported by evidence is also lower than the number of ab initio genes.

There is no direct way of deciding the correctness of gene prediction, but these measurements provide ways to detect possible errors. AED measures the fit of each annotation to its supporting homology-based or transcript evidence in the MAKER pipeline (Yandell and Ence, 2012). AED score ranges from 0 to 1, where 0 indicates the highest fitness between annotation and evidence. Further information is detailed in QI, a nine dimensional summary that describes the fraction of splice sites and exons that are supported by evidence and ab initio prediction, as well as UTR and protein lengths (Campbell et al., 2014a). These metrics identify the congruency of genes with provided evidence and estimate the accuracy of the ab initio predictors. For example, a gene with low AED and high fraction of evidence and ab initio support is reliable annotation, a gene with well-supported evidence but low ab initio support suggests a poorly trained gene predictor, whereas an ab initio gene without evidence support would benefit from manual curation.

74

The two-pass MAKER run was shown to improve gene prediction where exons mistakenly predicted as genes in the first pass were merged in the second pass. UTR regions were also better predicted in the second pass. Combining gene training and gene predicting using iterative runs is therefore a valid solution to annotate a non-model genome. Figure 2- 10 further confirms the accuracy of the trained predictors, as most of the predictions are supported by both ab initio and evidence support. As these results demonstrated confidence on the predictions, and also that the amount of seagrasses evidence data (transcripts and proteins) provided is relatively low as compared to well-studied plant families and therefore might not sufficiently support all true Z. muelleri genes, the 4,781 ab initio predictions with AED 0.9 to 1.0 were retained in the final gene set.

75

a)

76

b)

Figure 2-8 Comparison between one-pass and iterative runs of MAKER annotation in scaffold 3725_30727; a) results after the first MAKER run and b) results after the second MAKER run. Blue blocks within the light blue panels are genes annotated by MAKER, where the upper and lower panels represent the forward and reverse direction. The blocks within the upper and lower black panels are repeat annotation, evidence alignments, and ab initio predictions by SNAP and Augustus. Transparent boxes with blue outline represent the untranslated regions (UTRs).

77

Figure 2-9 Cumulative distribution function curve of Annotation Edit Distance (AED) scores of all predicted Z. muelleri genes.

78

Figure 2-10 Stacked plots of three categories of predictions across AED scores, 1) prediction is supported by both ab initio and evidence, 2) prediction is only ab initio¬-based, and 3) prediction is only evidence-based.

79

2.7 Functional annotation of Z. muelleri genes

Table 2-10 shows the results of functional annotation of the Z. muelleri annotated proteins. About 50% of the proteins are orthologous to at least one sequence in the protein databases, whereas functional domains were identified in nearly 85% of all Z. muelleri proteins. In total, 30,135 (95%) genes had identity to sequences in the Swiss-Prot, TrEMBL or InterPro databases. A total of 5,740 proteins remain unannotated. Interestingly, 81% (4,655) of the unannotated genes are supported by transcript evidence, with high confidence (<= 0.5) AED scores. This indicates the novelty of these unannotated genes which could collectively describe seagrass-specific characteristics.

The purpose of functional annotation is to assign functions to predicted genes through sequence similarities with known proteins and domains. Functional annotation also serves to assess the quality of the annotated gene set as the percentage of proteins with no sequence similarities to public databases is expected to be considerably low (Yandell and Ence, 2012). This is because protein homology are well conserved across distant phylogenetic relationships (Frazer et al., 2003), and to date there are already more than 50 plant genomes annotated (Michael and Jackson, 2013). Two main categories of databases were used for alignments, protein sequences (Swiss-Prot and TrEMBL) and domains or families (InterPro, its member databases and Gene Ontology) (Table 2-11).

Table 2-10 Functional annotation of predicted proteins in Z. muelleri genome.

Number Percentage Total 35,875 100 Annotated SWISSPROT 9,857 27.5 TrEMBL 16,235 45.4 InterProScan 30,135 84.2 Gene Ontology 18,916 52.9 Unannotated 5,740 16.0

80

Table 2-11 Public databases used for functional annotation of predicted Z. muelleri proteins.

Databases Description Publication Swiss-Prot Manually annotated and curated protein (Bairoch and records Boeckmann, 1991) TrEMBL Computationally predicted protein (Bairoch and Apweller, records 1997) InterPro Collection of protein signatures (Jones et al., 2014) predicted using models of a combination of member databases (listed below) TIGRFAM Collection of multiple sequence (Haft et al., 2001) alignments and hidden Markov models (HMM) of protein domains ProDom Homologous domains of proteins (Corpet et al., 2000) Panther Models of functionally-related families (Mi et al., 2005) and subfamilies Pfam Collection of multiple sequence (Sonnhammer et al., alignments and hidden Markov models 1997) (HMM) of protein domains Prosite Biologically significant sites, patterns (Sigrist et al., 2010) and profiles of protein domains Gene Ontology Models of biological systems which (Ashburner et al., 2000) describe gene functions

81

2.8 Conclusion

In this chapter, the genome assembly and annotation of Z. muelleri is described. The abundance distribution of k-mers in whole genome shotgun sequences was calculated and the expected genome size was estimated as 889 Mbp. The reads were then preprocessed and assembled to form 237,933 contigs. The contigs were scaffolded using mate-paired information and error corrected to form a final set of 56,823 scaffolds.

The level of completeness in this draft assembly is relatively low, when compared to pseudochromosomal assemblies. However, the main aim of this study is based on the genic regions and genome assemblies can be constantly improved depending on the advancement of sequencing technologies. One simple way to determine that the completeness of genic regions is not hampered by genome fragmentation is that the N50 length should at least approximate to gene size (Yandell and Ence, 2012). In this assembly, the N50 length is 36,732 bp and the average predicted gene length is about ten times shorter. Two software which search for highly conserved genes and score the genic completeness in genomes, CEGMA and BUSCO were used to further assess the assembly. Drawbacks of using this assessment method in plant genomes, as well as the availability of other methods, were reviewed.

The assembled genome was then annotated to predict repetitive regions and protein- coding genes. 56% of the genome contains repeats, with a majority of LTR retrotransposons. The challenges of annotating a newly sequenced genome with limited data from closely-related species were addressed. Based on the data available, including Z. muelleri RNA-seq, Z. marina RNA-seq and ESTs, an annotation method which employs the flexibility of the MAKER pipeline was design. This method combines model training of predictors and gene prediction with iterative runs, in attempt to improve the accuracy of predictions with limited data as evidence.

A total of 35,875 Z. muelleri genes were predicted and the gene characteristics were compared with other plant species. The two-pass MAKER run was proven successful through quality evaluation of the genes with two categories of metrics, 1) the congruency of predictions with evidence data (transcriptome and proteins), and 2) the sequence similarity with known proteins and domains in public databases. More than 80% of the predictions were identified to have evidence support and 95% of the genes had sequence identity in Swiss-Prot, TrEMBL or InterPro databases.

82

Besides assessing the quality of predictions, functional annotation also assigned putative functions to genes. It is important to note that the identification of the true biological function of a protein requires extensive molecular biology research, including protein- protein interaction and loss-of-function studies. This is because the mechanisms of gene specialization, following gene or genome duplication, to decide the functional fates of genes, is not well understood (reviewed in Panchy et al., 2016). It is hypothesized that subfunctionalization, neofunctionalization or pseudogenization of genes is driven by positive selection (Innan and Kondrashov, 2010). Phylogenetic sequence similarity alone is therefore not sufficient to confirm protein functions.

83

Chapter 3. Comparative genomics between Zostera muelleri and other plant species

3.1 Introduction to comparative genomics

Comparative genomics is the study of the similarities and differences in the genomes of individual species across taxa. Genomic similarities and differences are expressed in DNA sequences, domain structure and even functionalities. The comparison between species is made valid by relatively slow evolution of genic regions hence high conservation of homology, but in plants it is complicated by complex structural mutations such as large and small-scale duplications, gene deletion and pseudogenization, repetitive elements and localized rearrangements.

Comparative genomics is useful in agricultural research, particularly locating genes and structural features that contribute to the expression of economical valuable phenotypes in crops. In a broader view, comparative genomics enables phylogenetic studies through analysis such as the reconstruction of ancestral genomes, patterns of natural selection, conservation of genes due to domestication and predictions of gene functions.

In non-model species where information of closely related taxa is lacking, such as seagrasses, the prediction of gene functions through sequence comparison is challenging, as divergence has introduced complex gene deviations. However, comparative genomics is proven to be a good start in differentiating novel lineage-specific genes and conserved genes shared across higher plants.

In the following sections, concepts of homology between genes and the events of whole genome duplication will be introduced. Analysis of orthologous genes and whole genome duplications in Z. muelleri will be explored using results of genome annotation prediction in

Chapter 2.

3.1.1 Concepts of homology

The term homology is used to describe the relationship between two characters that have descended from a common ancestor. Here, the focus is restricted to genes. Homologous genes are categorized into orthologs and paralogs based on the type of events which

84

occurred at the most recent common ancestor. The inference of homologs, particularly orthologs, is an important step in comparative genomics for two main reasons. The first reason is to assist in functionality prediction of genes, since orthologous genes are more likely to share similar functions than paralogs (Altenhoff et al., 2012). More importantly, orthologs represent the evolutionary classification of genes, and therefore the identification of orthologous relationships enables reconstruction of the evolution of species. Orthologous genes are genes which have diverged since a speciation event, whereas paralogous genes are homologs which have diverged since a duplication event. For example, as depicted in Figure 3-1, besides gene 1 in species A and B, gene 1α and 1β are also orthologs of gene 1, since they are related by speciation. The relationship between gene 1α and 1β, however, is paralogous as they arise through duplication. Orthologous clusters identified between green algae and higher plants were used in phylogenetic analysis to estimate divergence time (Zimmer et al., 2007). In agricultural research, comparative genetic maps, which contain the order of orthologous genes, have been built for many economically important genera or clades, such as Poaceae (Bennetzen and Freeling, 1997; Gale and Devos, 1998; Wilson and Doudna Cate, 2012), legumes (Menanciohautea et al., 1993; Boutin et al., 1995) and Solanaceae (Mueller et al., 2005).

For more divergent species, the evolutionary time between species is greater and gene similarities reduced, comparative mapping is therefore more challenging. This is because in general, orthology inference methods lack a golden standard, as the evolutionary history is unknown. Even so, efforts of indirect benchmarking, such as measuring coexpression levels, domain conservation (Hulsen et al., 2006), manual curation (Trachana et al., 2011) and data simulation (Dalquen et al., 2013) had been carried out. In addition, identification of homologous genes between organisms is further complicated by species-specific large and segmental gene duplications, gene loss and retention, and other gene rearrangements such as horizontal gene transfer and gene fusion (example review Szollosi et al., 2015). In highly duplicated genomes such as plants, orthologs between species may exist in many-to-many relationships, rather than the conceptual “one-to-one” relationship.

Computational methods to infer orthology are categorized into three main groups (reviewed in Kristensen et al., 2011 and Tekaia, 2016): 1) most of the tools available are based on sequence similarity-based methods, which cluster genes with high similarity as orthoglous groups; 2) phylogenetic tree-based methods use gene similarity information 85

together with species phylogeny; and 3) synteny-based methods rely on conservation of gene order and chromosomal orientation to infer orthologs.

Figure 3-1 Types of homology relationships in the evolution of four genes with a common ancestor. Figure adapted from (Kristensen et al., 2011).

3.1.2 Whole genome duplication

Whole genome duplication (WGD) is the doubling of the entire genome of a species, causing a change in the level of ploidy. The polyploidization of a genome is thought to provide the source of genetic material to facilitate adaptation, generate novel genes and drive speciation. Studies linking small and large-scale gene duplication to evolution have a long history (reviewed in Taylor and Raes, 2004) . All species in major phylogenetic lineages are results of one or more round of ancient WGD(s), besides plant lineages, other examples are yeasts (Kellis et al., 2004), frogs (Morin et al., 2006) and fish (Postlethwait et al., 2000; Meyer and Van de Peer, 2005).

What follows after genome duplication is an accumulation of sequence rearrangements, fractionation, mutations and deletions, with the ultimate aim of “genome-downsizing” or “diploidization” (Figure 3-2). The amount of comparative analyses aimed at understanding the mechanisms of determining the fate of duplicated gene pairs are increasing (Koonin et al., 2004; Paterson et al., 2006; Duarte et al., 2010; Waterhouse et al., 2011; De Smet et al., 2013). For example, a group of “duplication-resistant” genes are found to be convergently restored to single-copy status. These single-copy genes are enriched for 86

survival and housekeeping genes, and often have high expression levels (De Smet et al., 2013).

Figure 3-2 Illustration of the cyclical episodes during polyploidization. Figure adapted from (Wendel et al., 2016).

Many cultivated crops are results of recent WGD (neo-polyploids) (Allario et al., 2013; Chao et al., 2013; Li et al., 2015; Zhang et al., 2015). A recent analysis linked domestication with WGDs, where species that underwent polyploidy events are more likely to be successfully domesticated than their wild relatives (Salman-Minkov et al., 2016). Besides identifying species-specific duplications, the detection of WGDs also helps to place the timeline of ancient pan-lineage events, for example in core eudicots (Jiao et al., 2012) and cereals (Paterson et al., 2004).

There are three main methods used to detect WGD in an assembled genome, using rates of synonymous substitutions (Ks) of paralogous genes, detection of syntenic blocks and calibrated phylogenetic trees. Synteny-based methods use the clustering of matching gene

87

pairs and their collinearity to locate WGDs. These methods have high accuracy but are limited by the level of completion the genome assembly (Tang et al., 2008). Ks-based methods are more suitable for incomplete assemblies, but are only able to detect relatively recent WGDs, as the estimation of older WGDs is biased by Ks saturation effect (Vanneste et al., 2013). Calibrated phylogenetic trees of gene families are also useful in detecting

WGDs in incomplete assemblies and are able to detect older WGDs than Ks-based methods (Jiao et al., 2011).

88

Figure 3-3 Known whole genome duplication events in the plant phylogeny. Figure adapted from https://genomevolution.org/wiki/index.php/Plant_paleopolyploid

89

3.2 Materials and methods

3.2.1 Clustering of orthologous genes

To cluster the annotated Z. muelleri genes into orthologous groups, four other species were selected for comparison: two model plant species; Arabidopsis thaliana and Oryza sativa to represent eudicot and monocot respectively, M. acuminata as a non-grass monocot representative, and the only Alismatales genome available at the time of this analysis, Spirodela polyrhiza. Protein sequences of the four species were downloaded from Phytozome version 10. Together with Z. muelleri, the protein sequences of these five species were used as input to the OrthoMCL pipeline with the following commands:

orthomclAdjustFasta ATH ATH_prot.fa 1 orthomclAdjustFasta OSA OSA_prot.fa 1 orthomclAdjustFasta MAC MAC_prot.fa 1 orthomclAdjustFasta SPO SPO_prot.fa 1 orthomclAdjustFasta ZMU ZMU_prot.fa 1 mkdir compliantFasta mv ATH_prot.fa BDI_prot.fa MAC_prot.fa SPO_prot.fa ZMU_prot.fa compliantFasta/

orthomclFilterFasta ./compliantFasta 10 20 makeblastdb -in goodProteins.fasta -dbtype prot blastp -db goodProteins.fasta -query goodProteins.fasta -out allvsall.hit -outfmt 6 -evalue 1e-5 orthomclBlastParser allvsall.hit ./compliantFasta >> SimilarSequences orthomclInstallSchema orthomcl.config mysqlimport -u root -p orthomcl SimilarSequences --local orthomclPairs orthomcl.config orthomcl.log cleanup=no orthomclDumpPairsFiles orthomcl.config mcl mclInput --abc -I 1.5 -o mclOutput orthomclMclToGroups cluster 1000 < mclOutput > groups.txt

90

Results of clustered orthologous genes were presented in a five-way Venn diagram using the following commands:

grep "ATH" groups.txt | cut -d : -f1 | tr "\n" " " > ATH_groups.list grep "OSA" groups.txt | cut -d : -f1 | tr "\n" " " > OSA_groups.list grep "MAC" groups.txt | cut -d : -f1 | tr "\n" " " > MAC_groups.list grep “SPO" groups.txt | cut -d : -f1 | tr "\n" " " > SPO_groups.list grep "ZMU" groups.txt | cut -d : -f1 | tr "\n" " " > ZMU_groups.list

wget https://cran.r-project.org/src/contrib/gplots_3.0.1.tar.gz tar -xzf /home/uqhlee16/HPC/downloads/gplots_3.0.1.tar.gz

plot_venn.R: library(gplots, lib.loc="/home/uqhlee16/HPC/downloads") ath = read.table("ATH_groups.list",header=F,sep=" ") spo = read.table("SPO_groups.list",header=F,sep=" ") zmu = read.table("ZMU_groups.list",header=F,sep=" ") mac = read.table("MAC_groups.list",header=F,sep=" ") osa = read.table("OSA_groups.list",header=F,sep=" ") v.table <- venn(list(V.vinifera=vvi,A.thaliana=ath,Z.muelleri=zmu,M.acuminata=mac,O.sativa=osa)) plot (v.table,font=“3”,ps=“10”)

The Venn diagram was coloured and labelled using the Adobe Photoshop software.

3.2.2 Detecting gene families of interest

Gene families that were present in A. thaliana, M. acuminata, O. sativa and S. polyrhiza, but absent in Z. muelleri, were named AMOS, after the genus of the four species present. One A. thaliana gene was selected at random to represent each AMOS cluster for Gene Ontology enrichment analysis, since the GO ID attribution of A. thaliana is the most complete among four species. Gene ontology (GO) ID was assigned to 1,131 A. thaliana orthologs based on TAIR10 (2010-09-TAIR10) attributes downloaded from BioMart EnsemblPlants (http://plants.ensembl.org/biomart/martview). GO enrichment was performed using the R package topGO version 2.22.0 (Alexa and Rahnenfuhrer, 2010) with A. thaliana whole proteome (2010-09-TAIR10) as background with the following R script:

91

library(topGO) writeTopGO <- function(background, foreground, GOterm, top, txt) { annAT <- readMappings(background, sep="\t", IDsep=";") allgenes <- unique(unlist(read.table(file="gene.list"))) mygenes <-scan(foreground ,what="") geneList <- factor(as.integer(allgenes %in% mygenes)) names(geneList) <- allgenes GOdata <-new ("topGOdata", ontology = GOterm, allGenes = geneList, nodeSize = top, annot=annFUN.gene2GO, gene2GO=annAT) weight01.fisher <- runTest(GOdata, statistic = "fisher") allRes <- GenTable(GOdata, classicFisher=weight01.fisher,topNodes=30) names(allRes)[length(allRes)] <- "p.value" write.table(allRes, file=txt, sep="\t",quote=FALSE,row.names=FALSE) }

writeTopGO("background.GO.txt","AMOS.txt","BP",5,"AMOS_BP") writeTopGO("background.GO.txt","AMOS.txt","BP",5,"AMOS_MF") writeTopGO("background.GO.txt","AMOS.txt","BP",5,"AMOS_CC")

A word cloud where the sizes of GO terms were based on enrichment p values was plotted. Three categories of functions, cell wall-related, hormone-related and light-related were coloured. The following R script was used to plot the word cloud:

library("wordcloud") t <- read.table("GO_table.csv", head=T, sep="\t") pal3 <- c("#B15928","#B15928","#FB9A99","#000000","#000000","#000000","#000000","#00000 0","#FB9A99","#000000","#6A3D9A","#000000","#000000","#6A3D9A","#000000","#000 000","#000000","#FB9A99","#000000","#000000","#000000","#000000","#000000","#000 000","#000000","#000000","#FB9A99","#000000","#000000","#000000" png("wordcloud.png", width=17000,height=17000, res=400) wordcloud(t$Term,t$Size, scale=c(9,2),min.freq=1, max.words=Inf, random.order=FALSE, rot.per=0.0, colors=pal3,ordered.colors=TRUE)

92

As the GO enrichment analysis only put weights on the GO terms with high numbers of representatives, single gene loss/modification within pathways is overlooked. Therefore, the A. thaliana orthologs were also mapped to KEGG pathways (Kanehisa and Goto, 2000) using the KEGG Mapper web interface (http://www.kegg.jp/kegg/tool/map_pathway2.html).

3.2.3 Multiple sequence alignments and phylogeny of genes of interest in Z. muelleri and other plants

3.2.3.1 Pectin methylesterase-related genes

Protein sequences of two orthogroups, one Z. muelleri-specific (cluster11122) and one containing orthologs of all species analysed in the OrthoMCL run (cluster5405), were aligned and distance calculated using the program MUSCLE (Edgar, 2004):

muscle -in pme.fa -out pme.afa muscle -maketree -I pme.afa -out pme.phy -cluster neighborjoining

Dendroscope version 3.5.7 (Huson et al., 2007) was used to plot a phylotree.

3.2.3.2 Ethylene responsive gene EIN3

Multiple sequence alignment between EIN3/EIL1 orthologs of seven species, Z. muelleri, Z. marina (IDs: Zosma140g00280.1, Zosma44g00270.1), A. thaliana (IDs: AT2G27050, AT3G20770), Glycine max (ID: Glyma.02G274600.1.p), Oryza sativa (ID: Q8W3L9), Populus trichocarpa (ID: B9GMA1) and S. polyrhiza (IDs: Spiopo0G0106000, Spipo3G0015900) was performed using T-coffee (Notredame et al., 2000) and coloured in Jalview (v2.8.1) (Waterhouse et al., 2009).

3.2.3.3 Jasmonate methyltransferase JMT

Protein sequences with similarities to carboxylic acid O-methyltransferases in Z. muelleri and Z. marina were collected. Putative SAM:salicylic acid carboxyl methyltransferase (SAMT), SAM:benzoic acid carboxyl methyltransferase (BAMT), Salicylate/benzoate carboxyl methyltransferase (BSMT), SAM:anthranilic acid methyltransferase (AAMT) and 93

SAM:jasmonic acid carboxyl methyltransferase (JMT) in Nicotiana tabacum (B5TVE1), Clarkia breweri (ID: Q9SPV4), Antirrhinum majus (IDs: B6SU46, Q8H6N2), A. thaliana (ID: Q9AR07) and Brassica rapa (IDs: Q9SBK6, Q6XMI3) were collected. Sequences were aligned and distance calculated using the program MUSCLE (Edgar, 2004):

muscle -in jmt.fa -out jmt.afa muscle -maketree -I jmt.afa -out jmt.phy -cluster neighborjoining

Dendroscope version 3.5.7 (Huson et al., 2007) was used to plot a phylotree.

3.2.4 Phylogenomic timing analysis of Z. muelleri among major land plant lineages

Seventeen genomes, together with Z. muelleri, were selected for WGD analysis: seven eudicotyledons (A. thaliana, Populus trichocarpa, Theobroma cacao, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum, Nelumbo nucifera), seven (O. sativa, B. distachyon, Sorghum bicolor, Elaeis guineensis, Phoenix dactylifera, M. acuminata, S. polyrhiza), one basal angiosperm (A. trichopoda), one lycophyte (Selaginella moellendorffii), and one moss (Physcomitrella patens). Below are the links to the downloaded data:

Nelumbo nucifera http://lotus-db.wbgcas.cn/genome_download/

Elaeis guineensis ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000442705.1_EG5/

Phoenix dactylifera ftp://ftp.ncbi.nih.gov/genomes/Phoenix_dactylifera

Other species Phytozome (version 11)

Analyses were performed based on a published approach (Jiao et al., 2011; Jiao et al., 2014). Briefly, orthogroup classification of coding sequences of all 18 genomes were constructed using OrthoMCL. Amino acid alignments of orthologs were used for construction and trimming of DNA alignments. Maximum likelihood phylogenetic trees were generated using RAxML version 7.2.1. To determine the phylogenomic timing of gene duplications, homologous anchor genes were mapped to phylogenetic trees and only those with bootstrap value equal or greater than 50% were considered.

94

3.2.5 K-mer frequency of homologous gene pairs between two Zosteraceae species

To demonstrate the hypothesis of the occurrence of a Z. muelleri-specific large-scale duplication, k-mer frequency of homologous gene pairs between two Zosteraceae species were analysed as below based on a published approach (Vu et al., 2015).

A total of 1,944 of orthologous pairs between Z. muelleri and Z. marina were randomly selected. 17-mer frequencies of coding sequences of orthologs in each species were calculated with the program khmer (Pell et al., 2012) using the following command:

Generate k-mer counting hash table:

python /pawsey/sles11sp4/python/2.7.10/khmer/2.0/bin/load-into-counting.py -k 17 -N 4 - x 8e9 $hash_table $reads

Count kmer abundance in orthologs:

python /pawsey/sles11sp4/python/2.7.10/khmer/2.0/bin/abundance-dist.py $hash_table $orthologs_seq $histogram

Average copy number of the k-mers was calculated by dividing the sum product of k-mer abundance and frequency by sum of k-mer frequency.

3.2.6 Copy number of gene pairs between two Zosteraceae species

Z. muelleri genes were used as query to align against Z. marina genes using the following parameters “blastp -evalue 0.00001 -max_target_seqs 1”. The number of query genes per target hit was calculated and plotted in a histogram.

3.2.7 Rate of substitutions per synonymous sites between two Zosteraceae species

The Ks analysis was performed according to previous approach (Cui et al., 2006). Paralogous pairs of sequences in Z. muelleri and Z. marina, respectively, were identified from best reciprocal matches in all-by-all BLAST searches. The EMMIX software was

95

used to fit a mixture model of multivariate normal components to a given data set. The mixed populations were modelled with one to four components.

3.3 Results and discussion

3.3.1 Orthologous gene clustering of Z. muelleri and four other reference species identifies absent or modified genes

A total of 7,404 orthologous groups, termed “gene families”, comprising of 78,683 genes, were identified as common to A. thaliana, M. acuminata, O. sativa, S. polyrhiza and Z. muelleri genome assemblies (Figure 3-4). The number of gene families shared between A. thaliana, M. acuminata, O. sativa and S. polyrhiza (AMOS) was significantly larger than gene families shared between any other four species (purple sections in Figure 3-4). AMOS consists of 7,946 genes making up 1,131 families. GO enrichment analysis identified 30 GO terms which were significantly over-represented in the biological process category mapped by 196 of the AMOS families (Table 3-1). The results were presented in a word cloud (Figure 3-5). In a separate analysis, 143 of the AMOS families mapped to 85 KEGG pathways (Appendix 2). Cell wall modification and pectin catabolism are the top two enriched terms in AMOS, followed by the ethylene signalling pathway. As ethylene is involved in complex crosstalk, pathways of other hormones (coloured in red) also appeared to be enriched. Two terms related to photosynthesis and light, response to high light intensity and cytochrome b6f complex assembly, were coloured in purple. A combination of GO and KEGG results suggest that genes related to at least five processes, pectin catabolism, hormone response, photosynthesis, stress response and ribosomal constituent, were either lost or modified in seagrass. AMOS genes associated with these pathways will be further discussed in the following sections.

OrthoMCL (Li et al., 2003), software widely used to detect orthologs based on similarity of sequence alignments, was chosen in our analysis for two reasons: 1) ablility to perform multiple species comparisons instead of one-to-one, 2) evaluation results showed more than 90% of specificity and sensitivity (Chen et al., 2007), and are comparable to manually curated ortholog database KOG (euKaryotic Orthologous Group). Despite the unique habitat and morphology of seagrasses, the number of Z. muelleri specific gene families was relatively low (2,233), representing 16% of all Z. muelleri genes. This suggests that Z. muelleri maintained a similar gene content to the land plant species and that the early divergence and marine adaptation of seagrass did not involve a major increase in the

96

abundance of novel gene families. The large number of O. sativa specific gene families (2,880) was consistent with previous studies that demonstrate a high level of gene diversification in Poaceae (Goff et al., 2002). Similarly, a relatively low number of S. polyrhiza specific gene families has been previously reported (Wang et al., 2014). S. polyrhiza, commonly known as duckweed, is a freshwater floating plant. Considering that both duckweed and seagrass are aquatic plants, and that they both are members of the Alismatales order, the number of shared gene families between S. polyrhiza and Z. muelleri was not particularly high (8,497; Z. muelleri and A. thaliana: 8,615; Z. muelleri and M. acuminata: 8,841; Z. muelleri and O. sativa: 8,516). This could be explained by the polyphyletic nature of the Alismatales order, as well as differences between submerged versus floating, and marine versus freshwater habitat.

Considering that Z. muelleri does not share 1,131 families of common plant genes and that the gene families unique to Z. muelleri is not particularly high, it is possible that seagrass may have adapted through the loss of terrestrial plant genes rather than the generation of novel, seagrass-specific genes. However, detailed gene family analysis is needed to test this hypothesis. It is important to note that the presence of a gene family in AMOS suggests three possibilities: 1) the gene is absent in Z. muelleri and therefore the gene functionality is lost; 2) the gene functionality is conserved in Z. muelleri but the gene sequence is modified or has very low similarity with orthologs in other four species; 3) the gene is not lost in Z. muelleri but is missing from the draft assembly or the annotated gene set.

97

Figure 3-4 Venn diagram showing the distribution of shared gene families and genes (in parentheses) among five species (M. acuminata, O. sativa, S. polyrhiza, A. thaliana and Z. muelleri). Areas were colour-coded by the number of species.

98

Table 3-1 Significantly enriched Biological process (BP) GO terms in AMOS families.

GO ID Term p value GO:0042545 cell wall modification 1.7e-07 GO:0045490 pectin catabolic process 4.5e-06 GO:0010105 negative regulation of ethylene-activated signalling 5.9e-05 pathway GO:0005983 starch catabolic process 0.00017 GO:0042542 response to hydrogen peroxide 0.00043 GO:0016567 protein ubiquitination 0.00057 GO:0009408 response to heat 0.00155 GO:0000023 maltose metabolic process 0.00228 GO:0009737 response to abscisic acid 0.00344 GO:0009965 leaf morphogenesis 0.00387 GO:0010190 cytochrome b6f complex assembly 0.00481 GO:0016236 macroautophagy 0.00620 GO:0043043 peptide biosynthetic process 0.00620 GO:0009644 response to high light intensity 0.00636 GO:0071281 cellular response to iron ion 0.00696 GO:0002237 response to molecule of bacterial origin 0.00699 GO:0045892 negative regulation of transcription, DNA-templated 0.00777 GO:0009739 response to gibberellin 0.00781 GO:0019252 starch biosynthetic process 0.00994 GO:0030307 positive regulation of cell growth 0.01013 GO:0010029 regulation of seed germination 0.01115 GO:0019375 galactolipid biosynthetic process 0.01436 GO:0050665 hydrogen peroxide biosynthetic process 0.01436 GO:0046685 response to arsenic-containing substance 0.01439 GO:0080060 integument development 0.01487 GO:0050994 regulation of lipid catabolic process 0.01487 GO:0009755 hormone-mediated signalling pathway 0.01487 GO:0010200 response to chitin 0.01790 GO:0046777 protein autophosphorylation 0.01860 GO:0008356 asymmetric cell division 0.02005

99

Figure 3-5 Word cloud representing GO terms enriched in AMOS gene families. Terms were coloured based on related functions 1) pink: hormone response signalling, 2) green: cell wall modification, and 3) purple: light and photosynthesis-related.

100

3.3.1.1 Pectin catabolism

The two most significantly enriched GO terms in AMOS are cell wall-related: cell wall modification (GO:0042545) and pectin catabolic process (GO:0045490). Table 3-3 shows that most of the cell wall-related AMOS genes are pectin methylesterases (PME) and the PME regulators, pectin methylesterase inhibitors (PMEI). To determine whether PMEs are lost or modified in seagrass, motif finding analysis (results from Chapter 2 Section 2.7) was used to identify PME (Pfam ID: PF01095) and PMEI (Pfam ID: PF04043) domains in Z. muelleri proteins. A total of 121 PME and/or PMEI-related proteins are identified, and about half of the clustered proteins do not group with PME/PMEI proteins from other four plant species (Table 3-4). A total of 42 PME/PMEI proteins formed 15 species-specific clusters, while 28 proteins were not clustered due to low sequence similarities. A Neighbour-joining tree constructed using PME/PMEI genes in all five species showed an example of gene expansion in Z. muelleri (Figure 3-6). OrthoMCL grouped these genes into two clusters. The cluster labelled by black branches contains the A. thaliana gene AT5G55590.1 and its orthologs in four other species (4 in Z. muelleri, 2 in O. sativa and one each in M. acuminata and S. polyrhiza). Another cluster labelled in red has five Z. muelleri genes and has a longer branch, indicating greater genetic distance from the clusters in black. Gene expansion was observed in Z. muelleri genes of both clusters.

101

Table 3-2 A. thaliana representative genes in AMOS corresponding to cell wall modification (GO:0042545) and pectin catabolic process (GO:0045490) pathways.

A. thaliana genes Gene name corresponding to (Gene description) GO:0042545 and GO:0045490 AT1G01300.1 Eukaryotic aspartyl protease family protein AT3G43270.1 PME32 (Probable pectinesterase/pectinesterase inhibitor 32) AT2G36710.1 PME15 (Probable pectinesterase 15) AT2G43050.1 PME16 (Probable pectinesterase/pectinesterase inhibitor 16) AT1G53830.1 PME2 (Pectinesterase 2) AT5G47500.1 PME68 (Probable pectinesterase 68) AT3G05610.1 PME21 (Probable pectinesterase/pectinesterase inhibitor 21) AT1G02810.1 PME7 (Probable pectinesterase/pectinesterase inhibitor 7) AT3G05620.1 PME22 (Putative pectinesterase/pectinesterase inhibitor 22) AT2G26440.1 PME12 (Probable pectinesterase/pectinesterase inhibitor 12) AT5G19730.1 PME53 (Probable pectinesterase 53) AT2G26450.1 PME13 (Probable pectinesterase/pectinesterase inhibitor 13) AT3G15680.1 Ran BP2/NZF zinc finger-like superfamily protein AT1G11580.1 PME18 (Pectinesterase/pectinesterase inhibitor 18) AT2G21610.1 PME11 (Putative pectinesterase 11) AT1G14420.1 AT59 (Probable pectate lyase 3)

Table 3-3 Number of pectin methylesterase (PME) and/or methylesterase invertase (PMEI)-related proteins and families in Z. muelleri.

Total number of Z. muelleri proteins with PME and/or PMEI 121 domains Number of PME/PMEI proteins in Z. muelleri-specific clusters 51 (21) (Number of clusters) Number of PME/PMEI proteins in other clusters 42 (15) (Number of clusters) Number of unclustered PME/PMEI proteins 28

102

Figure 3-6 Neighbour-joining tree of genes orthologous to A. thaliana (ATH) gene AT5G55590.1 in four other species: M. acuminata (MAC), O. sativa (OSA), S. polyrhiza (SPO) and Z. muelleri (ZMU). Genes were clustered into two groups (branch colour red and black).

103

Pectic polysaccharides are members of the plant cell wall structure, where their ratio and types differ depending on species, environment and tissue (reviewed in Caffall and Mohnen, 2009). The biosynthesis of pectin occurs in the Golgi and pectin is inserted into the cell wall as a highly methyl esterified polymer. Pectin degradation leads to cell separation during processes such as fruit ripening, leaf abscission and pollen release (reviewed in Daher and Braybrook, 2015). The removal of methyl groups in pectin, or de- esterification, by PME/PMEI proteins is involved in both cell wall loosening (breakdown of pectin) and cell wall rigidification (interaction with calcium ions). Contradicting evidence of PME activity is thought to be caused by other modifying proteins and the biochemical environment (Daher and Braybrook, 2015), which further highlights the complexity of cell adhesion and separation. PME activity therefore needs to be tightly regulated, because the pattern and degree of methylation affect the physiochemical properties of the pectin polysaccharides (reviewed in Micheli, 2001). Seagrass pectin is known to contain a rare class of pectin homogalacturonan which is apiose-substituted (AGA) (Ovodov et al., 1971), and this AGA is also found in the floating plant, duckweed (Hart and Kindel, 1970). Seagrass AGA, known as zosterin, has a low level of methyl esterification (Khotimchenko et al., 2012).

Besides forming species-specific clusters, the PME/PMEI family members were also expanded in Z. muelleri when compared to monocot land plants. For example, both O. sativa and S. bicolor have less than 80 PME/PMEI genes. Figure 3-6 depicts the formation and expansion of two distinct groups of Z. muelleri genes, suggesting sequence diversity, possibly as a result of functional specialization in regulation of pectin methyl esterification in seagrass cell walls. The biological function of the apiose side-chains and the AGA methyl esterification level in seagrass, and whether they are related to the marine environment, is unknown. However, since duckweed, which also contains AGA, has PME/PMEI genes with high sequence similarity with the other species, it is likely that seagrass genes were modified to adapt to submergence. Pectin degradation is partly regulated by ethylene (De Paepe et al., 2004; Onkokesung et al., 2010) and it is unknown whether the modified pectin in seagrass is associated with the loss of ethylene signalling in seagrass (Golicz et al., 2015a).

This expansion of pectin catabolic genes was also observed in Z. marina (Olsen et al., 2016). It has been suggested that the low methylesterification of pectin is one of many mechanisms to increase the polyanionic character of cell wall matrix in seagrasses (Olsen

104

et al., 2016), as this feature is also observed in marine algae (reviewed Popper et al., 2011).

3.3.1.2 Ethylene biosynthesis and signalling

Genes responsible for ethylene signalling (GO:0010105) were significantly enriched in AMOS families (Table 3-4), including ethylene receptors, enzymes responsible for ethylene biosynthesis, ethylene-activated kinases and transcription factors. The sequence conservation of an ethylene responsive transcription factor, ethylene insensitive 3 (EIN3) was investigated using multiple sequence alignments and annotated Z. muelleri and Z. marina genes (Olsen et al., 2016) with the highest percent identity and alignment length with A. thaliana EIN3 and its ortholog EIN3-like1 (EIL1). Figure 3-9 shows the N-terminal deletion in Z. marina protein Zosma140g00280.1 but not in the Z. muelleri protein. The Z. muelleri protein augustus_masked-5412_21708--0.0-mRNA-1 has high similarity with another Z. marina protein Zosma44g00270.1. These two proteins contain the well- conserved EIN3 DNA binding domains (grey in Figure 3-8), except for a deletion of ~60 amino acids at about position 340, which spans one conserved region (red box in Figure 3- 8). Components of ubiquitin ligase complexes responsible for EIN3 degradation, EIN3- binding F-box 1 protein 1 (EBF1) and 2 were absent from the Z. muelleri annotated gene set but one copy of EBF1 ortholog was found in Z. marina (Protein ID: Zosma17g00680.1).

Ethylene is a gaseous hormone responsible for a wide range of plant growth and development processes (Ecker, 1995). Ethylene is synthesized by the conversion of S- adenosylmethionine (SAM) to ACC by ACS, and ACC is then oxidized by ACO. Five ethylene receptors were found in A. thaliana, which activate a Raf-like kinase CTR1 and then EIN2 upon ethylene perception. EIN2 blocks the ubiquitination of EIN3, a family of transcription factors that activate various downstream ethylene-induced gene transcription responses (Wang et al., 2002). In the absence of ethylene, EIN3 is targeted for degradation by EBF1 and 2 (Gagne et al., 2004).

The overrepresentation of ethylene-related gene families encoding in AMOS support previous findings which demonstrated the loss of multiple ethylene biosynthesis and signalling genes in Z. muelleri (Figure 3-7) and Z. marina (Golicz et al., 2015a) (Olsen et al., 2016). Consistent with the loss of stomata and hence the lack of outward diffusion processes in seagrass, the absence of ethylene production avoids tissue accumulation. In terrestrial plants, submergence triggers hypoxic response through ethylene responsive 105

transcription factors (Nakano et al., 2006). In flood-adapted land plants, the ethylene signal is used to sense submergence and induces a response to flooding (reviewed in Voesenek et al., 2015). It is possible that ethylene is selected against during the evolution of land plants to submerged life. Ethylene biosynthesis and signalling also play an important role in plant response to salinity (reviewed in Zhang et al., 2016). There is conflicting evidence of ethylene as positive or negative regulator during high salinity stress in different species at different developmental stages (reviewed in Tao et al., 2015) . For example, functional knockout of EIN2 and EIN3 in the seedlings of a flood-adapted species, rice, increased the salt tolerance (Yang et al., 2015), as opposed to A. thaliana where mutations of EIN2 and EIN3 caused extreme salt sensitivity (Lei et al., 2011; Peng et al., 2014). This suggests novelty in components of ethylene signalling pathways in some species to adjust plant sensitivity to stress according to environmental factors.

Without ethylene, the question of how ethylene-responsive genes are activated in seagrass is still open. In particular, with the loss of EIN2, the regulation of EIN3 is of interest. It is possible that seagrass EIN3 is constitutively active, or utilizes an EIN2- independent pathway for stabilization. A recent paper showed that ionic stress triggered by high salinity modulates the stability of EIN3 and EIL1 in an unknown pathway independent of upstream members of ethylene perception, including EIN2 (Peng et al., 2014). The paper further reported EIN3 accumulation enhanced salt tolerance in A. thaliana through elimination of the stress signal, reactive oxygen species (ROS).

Previous alignments of Z. marina and Z. noltii transcripts with EIN3/EIL1 orthologs in seven plant species showed an N-terminal deletion of ~130 residues (Golicz et al., 2015a) with well-conserved DNA-binding domains (DBD). This raises the possibility of positive auto-feedback regulation of ethylene-responsive transcription factors. The multiple sequence alignment in Figure 3-9 revisited this hypothesis. These results indicate two forms of EIN3/EIL1 in seagrasses with conserved DBD (Song et al., 2015), one with the N- terminal deletion and another with the complete N-terminus but a deletion at position 340. The homolog with the N-terminal deletion was not identified in Z. muelleri, possibly due to partial annotation or incomplete assembly. The region at position 340 is conserved in other species but the function is unknown based on domain analysis using InterPro (Jones et al., 2014). The region recognized by EBF1 and 2 for ubiquitination is also unknown but expected to be near C-terminus.

106

It is possible that the two forms of EIN3/EIL1 act and are regulated differently in seagrasses. If encouraging EIN3/EIL1 accumulation is part of the salt-adaptive mechanism in seagrasses, the activity of EIN3/EIL1 degradation has to be repressed, possibly through regulation of EBF1 and 2 or changes in ubiquitination recognition sites in EIN3/EIL1. As one copy of EBF1 ortholog was found in Z. marina (Protein ID: Zosma17g00680.1) but none in Z. muelleri, whether the function of EBF1 is conserved in seagrasses needs further examination.

Table 3-4 A. thaliana representative genes in AMOS corresponding to ethylene signalling (GO:0010105) pathway.

A. thaliana representative Gene name genes corresponding to (Gene description) GO:0010105 AT1G50640.1 ERF3 (Ethylene-responsive transcription factor 3) AT1G66340.1 ETR1 (Ethylene receptor 1) AT1G04310.1 ERS2 (Ethylene response sensor 2) AT2G26070.1 RTE1 (Protein REVERSION-TO-ETHYLENE SENSITIVITY1) AT2G25490.1 EBF1 (EIN3-binding F-box protein 1) AT5G03730.1 CTR1 (Serine/threonine-protein kinase CTR1)

107

Figure 3-7 Ethylene biosynthesis and signalling in Z. muelleri. Genes encoding proteins marked with a cross are found to be lost and those marked with a tick are found to be present. EBF1/2 is marked with a question mark as both are not found in Z. muelleri but EBF1 is present in Z. marina. Figure adapted and edited from (Golicz et al., 2015a).

108

109

110

111

Figure 3-8 EIN3/EIL1 multiple sequence alignment between Z. muelleri (ZMU), Z. marina (ZMA), A. thaliana (ATH), G. max (GMA), O. sativa (OSA), P. trichocarpa (PTR) and S. polyrhiza (SPO). Regions conserved between all sequences were indicated in blue, DNA- binding domain was indicated by grey boxes and region deleted in Z. muelleri and Z. marina but conserved in eight other sequences was boxed in red.

112

3.3.1.3 Methyl-jasmonate synthesis

With the loss of ethylene signalling and biosynthesis in Z. muelleri, the conservation of other gaseous signals is in question. The presence of jasmonate methyl transferase (JMT), which is responsible for the conversion of jasmonic acid to its volatile derivative, was investigated. As JMT belongs to the carboxylic acid group of O-methyltransferases (OMTs) family, multiple sequence alignment between putative carboxylic acid OMTs in Z. muelleri, Z. marina, and other plant species was performed. The neighbour-joining tree of sequence similarity (Figure 3-9) shows the clustering of three Z. muelleri and two Z. marina proteins with Z. mays SAM:anthranilic acid methyltransferase (AAMT), whereas another Z.marina protein clusters with SAM:salicylic acid carboxyl methyltransferases (SAMT). All of the seagrass proteins are distantly related with the two JMTs. Sequences with high similarities to MeJA esterase (MJE), which reverses MeJA to jasmonic acid, as well as other key genes for jasmonate biosynthesis (AOS1, AOC, LOXs, OPR3 and JAR1) and signalling (COI1, JAZs, MYCs and NINJA), were found to be present in Z. muelleri.

Jasmonate is a class of fatty-acid derived plant hormone which function in multiple developmental processes, including fertility, root elongation and fruit ripening, as well as defence response against pathogens and wounding (Katsir et al., 2008). Jasmonate represents a collection of bioactive compounds, including jasmonic acid, its precursors, conjugates and derivatives (Figure 3-10). Among the jasmonic acid derivatives, methyl- jasmonate (MeJA) is released as a volatile compound (reviewed in Wasternack and Hause, 2013). Since seagrasses had dispensed with the use of the gaseous hormone ethylene, it is therefore of interest to identify whether MeJA is synthesized in seagrasses. The conversion of jasmonic acid to MeJA is catalyzed by JMT and reversed by MJE. The level of JMT conservation across species and the substrate-binding domains are not well characterized, but JMT proteins are highly substrate specific (Zubieta et al., 2003).

Since no orthologs of JMT are detected in Z. muelleri, it is likely that MeJA biosynthesis is lost. However, there are possibilities that other OMTs host amino acid substitutions in key regions that enable wider substrate specificity, therefore function to convert more than one carboxylic acid. Measurement of the level of JA and its derivatives in seagrasses, as well as further characterization of domains in plant JMTs will help in concluding the loss of MeJA in marine adaptation.

113

Figure 3-9 Neighbour-joining tree of putative carboxylic acid OMTs in Z. muelleri (ZMU), Z. marina (ZMA), Z. mays (ZMY), N. tabacum (NTA), C. breweri (CBR), A. majus (AMA), A. thaliana (ATH) and B. rapa (BRA). Carboxylic acid OMTs included were SAM:salicylic acid carboxyl methyltransferase (BAMT), Salicylate/benzoate carboxyl methyltransferase (BSMT), SAM:anthranilic acid methyltransferase (AAMT) and SAM:jasmonic acid carboxyl methyltransferase (JMT). 114

Figure 3-10 Jasmonate biosynthesis. α-linolenic acid is converted to 12-oxophytodienoic acid (OPDA) by three enzymes: lipoxygenase (LOX), allene oxide synthase (AOS) and allene oxidase cyclase (AOC). OPDA is then reduced by OPDA reductase 3 (OPR3) and oxidized to form jasmonic acid (JA). JA can be transformed to various derivatives, including methyl-jasmonate (MeJA). Figure adapted from Fonseca et al., 2009.

3.3.1.4 Crosstalk between ethylene, jasmonate, abscisic acid and gibberellin signalling pathways

The hormone-mediated signalling pathway (GO:0009755; p value 0.01487), and pathways related to abscisic acid (GO:0009737; p value 0.00344) and gibberellin (GO:0009739; p value 0.00781) were also enriched in AMOS families. Twenty two genes in AMOS are

115

abscisic acid-related, where most of them encode for transcription factors. Ten representative genes in AMOS were found to be involved in the GO term response to gibberellin. Among those, four out of five Z. muelleri proteins with sequence similarity to DELLA proteins, which are repressors of the gibberellin signalling pathway (Daviere and Achard, 2016), clustered in Z. muelleri-specific orthologous groups, indicating sequence diversification.

Abscisic acid stimulates short-term response under abiotic stress and is involved in processes like seed dormancy, stomata closure and leaf abscission. Gibberellin is recognized as an antagonist of abscisic acid, promotes seed germination and plays a role in plant growth and development. The interaction between hormone signalling pathways are complex and coordinated in plants. Ethylene-responsive proteins, such as EIN3 and ERF, have been shown to crosstalk with the jasmonate ZIM-domain (JAZ) repressor (Song et al., 2014) and DELLA proteins (De Grauwe et al., 2008), which are members of jasmonate and gibberellin signalling pathways, respectively. JAZ and DELLA were also shown to interact to regulate the antagonization of gibberellin and jasmonate in regulating seedling growth and pathogen interaction (Song et al., 2014). Ethylene and abscisic acid also regulate the signalling components of each other in plant development such as root growth and seedling development (Ghassemian et al., 2000; Cheng et al., 2009).

The network of complex hormone crosstalk is likely to be affected in Z. muelleri with the absence of ethylene. The Z. muelleri-specific groups of abscisic acid responsive transcription factors and DELLA proteins are likely to be modified to compensate for the loss of ethylene signalling members.

3.3.1.5 Genes involved in other pathways: starch metabolism, stress response, ribosomes and terpenoid biosynthesis

Genes involved in a few other pathways were also enriched in AMOS, including starch metabolism, stress response, terpenoid biosynthesis and ribosomal proteins (Appendix 3). A total of 14 genes in the starch catabolic (GO:0005983) and biosynthetic (GO:0019252) processes corresponded to the AMOS group. Multiple heat shock proteins were also enriched in three stress response-related GO terms, response to hydrogen peroxide (GO:0042542), hydrogen peroxide biosynthesis (GO:0050665) and response to high light

116

intensity (GO:0009644). 8 ribosomal-related (KEGG pathway ath03010) gene families were highly represented in AMOS. Genes of terpenoid backbone biosynthesis (ath00900) and diterpenoid biosynthesis (ath00904) were also found in AMOS.

The enrichment of starch metabolism genes in the AMOS group suggests modifications in the seagrass carbohydrate storage. The reduction of starch-related genes was also reported in the Z. marina genome (Olsen et al., 2016). In seagrass leaves, sucrose is the preferred carbon storage compared to starch (Burke et al., 1996). In addition, carbohydrate metabolism is one of the known hypoxic response triggered by submergence in land plants (reviewed in Fukao and Bailey-Serres, 2004) .

Seagrasses are adapted to conditions that are deemed stressful for land plants, such as high salinity and irregular light. Hydrogen peroxides are intermediates of reactive oxygen species (ROS), which are synthesized in response to kinase cascades triggered by stress signals. Heat shock proteins function to chaperone protein folding, assembly and degradation under stress and normal conditions (reviewed in Park and Seo, 2015). It is likely that the stress perception and response network in seagrasses are modified under the selection pressure in the marine environment.

As ribosomes play important roles in translation, a fundamental biological process, ribosomal genes are expected to be highly conserved within eukaryotes (reviewed in Wilson and Doudna Cate, 2012). However, Z. muelleri ribosomal genes have low sequence similarities with other plant species, and ribosomes were also found to be positively selected in Z. marina and P. oceania ESTs (Wissler et al., 2011). The basis for the observed differences in ribosomal gene sequences is not known, but it is postulated to be related to salt tolerance. Translation and consequently, protein synthesis are shown to be salt-sensitive in yeast and plants (Rausell et al., 2003). For example, the expression of genes involved in the translation apparatus had been shown to increase when transcriptomics of A. thaliana was compared to the halophyte salt cress (Taji et al., 2004). When under salt stress, levels of genes encoding for plastidial translation machinery were altered in A. thaliana (Omidbakhshfard et al., 2012). Ribosomes in seagrasses could be modified to contribute to salt tolerance through regulation of salt responses.

The presence of terpenoid biosynthesis genes in AMOS is supported by the loss of genes in synthesizing volatile terpenoids reported in Z. marina (Olsen et al., 2016). Terpenoids

117

constitute the largest class of volatiles produced by plants, functioning in plant defence and reproduction in land plants (Dudareva et al., 2006). 3.3.2 Identification of WGD events in Z. muelleri and the refinement of the timing Tau

Phylogenetic analysis between Z. muelleri and 18 other genomes revealed the absence of the ancient WGD Tau (τ) in Alismatales. Results of phylogenomic timing analysis support that Tau occurred after the divergence of the Alismatales from the main monocot lineage (summarised in Figure 3-11). Figure 3-12 and Figure 3-13 show two exemplar orthogroup trees of gene families in monocots, where the Z. muelleri and S. polyrhiza branch appear to be independent of the Tau event (red star). Duplications specific to the Zostera lineage were also detected (green stars).

The ancient pan-Commelinid WGD (Jiao et al., 2014) Tau, has been identified in most of the sequenced genomes in the monocotyledon lineage, including grasses, banana and palm (International Brachypodium Initiative, 2010; D'Hont et al., 2012; Al-Mssallem et al., 2013), and Tau was hypothesized to be a pan-monocot WGD which occurred about 150 Mya, echoing the pan-eudicot WGD gamma (ɣ). Seagrasses belong to the basal monocot order Alismatales, which diverged relatively early in monocot evolution. Tau was estimated to occur 150 Mya, coinciding with the origin of monocots (Bremer et al., 2009), whereas the split of Alismatales from monocots occurred at 130 Mya based on analysis of the chloroplast gene rbcL (Janssen and Bremer, 2004). The seagrass genome is therefore a suitable outgroup for comparative analyses. The only other sequenced representative of Alismatales at the time of this study is the giant duckweed, S. polyrhiza. The recent comparison of syntenic regions of the pineapple genome to another member of Alismatales, the duckweed, revealed that the Tau event is not shared by Alismatales (Ming et al., 2015). Results in Figure 3-12 and Figure 3-13 supported that the early split of the Alismatales branch occurred before Tau. The presence of Tau was indicated in Z. marina with a lack of supporting evidence (Figure 2c, Olsen et al., 2016).

Multiple duplications specific to Z. muelleri were also detected. However, since the Z. marina genome was not available at the time of this analysis, whether these duplications were shared within the genus was unknown. The following section discusses the comparison between these two species, which was performed following the publication of the Z. marina genome.

118

Figure 3-11 Phylogenetic timing of inferred gene duplications. Values given are the number of orthogroups with at least one Z. muelleri gene showing duplications at the specified branches on the tree.

119

Figure 3-12 Exemplar orthogroup tree (orthoID_1773) showing duplication events in monocots. Red star shows the phylogenetic duplication timing of Tau event. Green stars show the duplications specific to Z.muelleri lineage. Numbers on branches are the bootstrap values. Branch lengths are drawn to scale, with the scale bar indicating the number of nucleotide substitutions per site.

120

Figure 3-13 Exemplar orthogroup tree (orthoID_2030) showing duplication events in monocots. Red star shows the phylogenetic duplication timing of Tau event. Green stars show the duplications specific to Z. muelleri lineage. Numbers on branches are the bootstrap values. Branch lengths are drawn to scale, with the scale bar indicating the number of nucleotide substitutions per site.

121

3.3.3 Z. muelleri has an extra WGD when compared to Z. marina

The duplication of gene content in Z. muelleri compared to Z. marina was evaluated in two ways: 1) using copy numbers of orthologous pairs and 2) using rate of substitutions per synonymous sites (Ks) of paralogous pairs. In the first method, K-mer frequency comparison of the coding sequences of 1,944 randomly selected orthologous pairs between Z. marina and Z. muelleri revealed a copy number ratio of 1:1.6. Gene set alignments of Z. muelleri to Z. marina showed about 52% (7,502) of Z. marina genes had two copies of orthologs in Z. muelleri (Figure 3-14), whereas 3,393 are single copy. The rate of substitutions per synonymous sites (Ks) of paralogous pairs in Z. muelleri and Z. marina further suggested a species-specific WGD event. Excluding small-scale background duplications, two significant components were demonstrated in Z. muelleri paralogous pairs (Figure 3-15b; mean-variance-proportion: 0.10-0.00-0.40; 0.17-0.01- 0.55; 1.10-0.32-0.05), whereas three were found in Z. marina pairs (Figure 3-15a; mean- variance-proportion: 0.38-0.04-0.17; 1.19-0.09-0.30; 1.99-0.12-0.24; 2.63-0.05-0.08). The peak at 0.17 (red in Figure 3-15b), which was absent in Z. marina, was identified as a potential extra WGD in Z. muelleri which occurred 5.7-10.5 Mya, based on assumptions of synonymous mutation rate per base (Koch et al., 2000; Lynch and Conery, 2000). The peak shared between two species at 1.1 (grey in Figure 3-15a and blue in Figure 3-15b) suggests a pan-lineage WGD.

Based on previous multi-locus phylogeny studies (Coyer et al., 2013) using sequences of nuclear (ITS1) and chloroplast (matK, rbcL and psbA-trnH) genes, Z. marina and Z. muelleri belong to two clades, Zostera and Nanozostera, that diverged around 14 Mya. Z. marina is estimated to be 430 Mbp (Golicz et al., 2015a) and assembled to 202 Mbp (Olsen et al., 2016), which is 50-70% smaller than the Z. muelleri genome. Z. marina also has fewer predicted genes (20,450) (Olsen et al., 2016) than Z. muelleri (35,875). With the lack of Z. muelleri flow cytometry data, there is no preliminary evidence to connect the doubling of genome size to large scale duplication. Based on the composition of duplicated CEGMA and BUSCO orthologs in Z. muelleri (as shown in Chapter 2 Section 2.5.3), we hypothesized that Z. muelleri possibly underwent at least one recent large scale genome duplication, or is the result of recent hybridization between Zostera species, causing the duplication of gene content. Results of k-mer abundance, sequence alignments and Ks supported the presence of large-scale duplication in Z. muelleri that is not shared by Z.

122

marina. The pan-lineage WGD observed in Figure 3-14 is likely to coincide with the WGD previously found in Z. marina (Olsen et al., 2016) which occurred 64-72 Mya.

Figure 3-14 Histogram of orthologous gene ratio of Z. muelleri to Z. marina

123

a.

b.

Figure 3-15 Rate of substitutions per synonymous sites (Ks) of paralogous gene pairs in a) Z. marina with four distinct components, and in b) Z. muelleri with three distinct components.

124

3.4 Conclusion

In this chapter, the Z. muelleri annotated gene set was extensively studied using comparative alignment techniques to analyse orthologous relationships and whole genome duplication events. As genomes of closely-related taxa of Z. muelleri were not available at the time of analysis, for most of the comparative analysis, model species of clades and orders were used. In a few instances, the recently published Z. marina genes were used as additional analysis.

Gene family analysis revealed 7,404 gene families shared by Z. muelleri with four other model species and 2,233 species-specific families. As seagrasses have land plant ancestors, the genetic drive behind marine adaptation is likely to involve minor modifications such as gene family expansion and gene loss or pseudogenization, rather than generation of genes with novel functions. Nevertheless, functional prediction of the Z. muelleri-specific families, such as reference-based or ab initio domain annotations, although challenging, is helpful in complementing our understanding in marine adaptation.

A total of 1,131 gene families conserved in all species except for Z. muelleri termed as AMOS were identified, highlighting the difference between Z. muelleri and land plants. It is important to note that analysis of the AMOS family does not identify novel genes, and does not elucidate between lost and modified genes. It is also possible that the fragmentation of scaffolds, misassemblies and misannotation interfere with the presence and absence outcome of genes.

Genes involved in ethylene biosynthesis and signalling, including enzymes for precursor conversion, receptors and signalling kinase, are lost in Z. muelleri. The possible loss of gaseous MeJA was also examined through the presence of carboxyl methyltransferase JMT. Plants rely on complex hormone cross talk as central regulators of response to environmental stresses. With the loss of genes for ethylene production and perception in seagrass and possible modification of abscisic acid, gibberellin and jasmonate signalling pathways, seagrasses provide an interesting model to test hypotheses about roles and interactions of these hormone signalling pathways. The PME and PMEI gene families are expanded in Z. muelleri, indicating tight control of the cell wall composition and modification, possibly to assist osmoregulation and ion exchange. Other pathways with possible modification of genes include ROS stress response, starch metabolism, terpenoid biosynthesis and ribosomal proteins. Genes of the pathways highlighted in these results

125

provide the first step towards understanding the genetic basis of marine adaptation in seagrasses.

Using calibrated phylogenetic trees of gene families in Z. muelleri and 17 taxa, WGD events were identified. The ancient WGD Tau is absent in both Z. muelleri and S. polyrhiza, indicating that the split of Alismatales from monocots occurred before Tau. A lack of synteny information limits the conclusion drawn for pan-lineage and Z. muelleri- specific WGDs. However, using comparison of ortholog copy numbers and paralogs Ks rates between Z. muelleri and Z. muelleri, one WGD at 64-72 Mya was identified to be common between Z. muelleri and Z. marina. Another recent WGD potentially occurred at 5.7-10.5 Mya in Z. muelleri but this event is not shared in Z. marina. These results were summarised in Figure 3-16. As genome duplication is expected to be triggered by selection and contributes to the generation of novel genes, more gene comparison between Z. marina and Z. muelleri is needed. It would be interesting to find geographical adaptation signals since Z. marina and Z. muelleri inhabit Northern and Southern Hemisphere respectively, and belong to two different clades. Selection pressure analysis was attempted on Z. muelleri genes using land plant species as references with no significant results (not described in this thesis), likely due to the phylogenetic distances between seagrass and land plants. Future efforts in gene comparison, selection pressure analysis, and refining the timing of WGDs in seagrasses will contribute in understanding the return-to-the-sea events.

126

Figure 3-16 Phylogenetic timing of WGD events in plants. WGD events postulated in Z. muelleri are labelled as grey and yellow circles. Figure adapted and edited from Ming et al., 2015.

127

Chapter 4. Genetic comparison between two independent seagrass lineages

4.1 Introduction to the independent evolution of seagrass lineages

The term seagrass is used to collectively describe the group of flowering plants which live fully submerged in the sea and form mono-specific meadows resembling terrestrial grasses. Elucidating the taxonomy of seagrasses has long been challenging due to the ambiguous definition of properties unique to seagrasses (Den Hartog and Kuo, 2006). For example, some aquatic species are adapted to both submergence and salinity, but do not successfully thrive in the marine environment, such as Ruppia and Potamogeton in hypersaline brackish waters; whereas some members of the seagrass genera, such as Halophila and Cymodocea, extended their habitat from the coast to estuaries. These debatable cases had contributed to the development of seagrass taxonomic keys. These difficulties also highlight the polyphyletic nature of seagrasses and imply that members of the group are not necessarily closely-related taxonomically.

Molecular evidence using plastid genes detected at least three independent lineages in seagrasses (Les et al., 1997; Janssen and Bremer, 2004; Larkum et al., 2006). Figure 4-1 shows the phylogeny of the core Alismatids, where a total of three families, Hydrocharitaceae, Cymodoceaeceae complex (including Ruppiaceae and Posidoniaceae) and Zosteraceae contain at least one seagrass genus. The seagrass lineages (red stars in Figure 4-1) are embedded between branches of non-marine angiosperms, including freshwater species, indicating multiple occurrences of marine colonization. The sharing of physiological and morphological characteristics between individual lineages of seagrasses is an example of parallel evolution. However, to date, no genetic analysis has been performed to characterize the adaptation mechanisms that evolved independently in each lineage. Genetic differences between model plants and Zosteraceae as described in previous chapters and by others (Golicz et al., 2015a; Olsen et al., 2016) were not explored in other seagrass lineages.

As these gene losses, modification and gene family expansions within single genus are not sufficient to reflect marine adaptation of all seagrasses, analysis of a second lineage is needed to provide non-plastid genetic evidence of seagrass parallel evolution. To investigate the genetic processes that caused the independent rise of seagrass 128

phenotypes, Halophila ovalis from the Hydrocharitaceae family, was selected for inter- genus comparison. The seagrass branch in Hydrocharitaceae diverged about 55 Mya (Chen et al., 2012), which is 30 My earlier than Zosteraceae (Coyer et al., 2013). Examples of parallel evolution, where similar phenotypes were generated from a similar genetic process of independent convergent evolution (Ord and Summers, 2015), is not abundant in plants (example of carnivorous species Fukushima et al., 2017) . Other less phenotypically-obvious examples include the recurrence of C4 photosynthesis in plant lineages (reviewed in Washburn et al., 2016) and the convergent mutations in loci during domestication (Paterson et al., 1995). Also, as the likelihood of convergent evolution is predicted to decrease with phylogenetic distance (Ord and Summers, 2015), the repeated adaptation in Zostera and Halophila species is particularly interesting.

The main aim of this chapter is to answer whether the phenotypic characteristics common to seagrasses arise from the same genetic changes across independent lineages. Similarities between seagrasses of Hydrocharitaceae and Zosteraceae were explored in two aspects: 1) whether the gene loss previously identified in Z. muelleri and Z. marina is also observed in H. ovalis, 2) are there any seagrass-specific genes that are unique to both lineages? These results were expected to provide molecular evidence for seagrass convergent evolution, as well as yielding a more complete description of marine adaptation genetics. Genetic differences between H. ovalis and Zosteraceae species were also explored.

129

Figure 4-1 Phylogeny of core alismatid species. Branches of seagrass species were labelled with red stars. Figure adapted and modified from (Ross et al., 2016).

130

4.2 Summary of pipeline and workflow

To compare the molecular evolution of two independent lineages represented by H. ovalis and Zosteraceae, analyses were performed with the aim of obtaining two groups of genes; genes that are commonly lost in both lineages, and genes that are uniquely conserved in both lineages. This workflow was summarised in Figure 4-2.

First, H. ovalis whole genome sequencing reads were compared to two groups of annotated genomes to identify presence and absence of genes through the gene loss pipeline (Golicz et al., 2015b). The first group consists of five model plant species (four land plants: A. thaliana, O. sativa, M. acuminata, P. dactylifera; one floating freshwater plant: S. polyrhiza), and the second group consists of Z. muelleri and Z. marina. In the gene loss pipeline, reads were mapped to reference coding sequences (CDS), the horizontal and vertical coverage of exons in each CDS were then calculated to determine gene presence. Read-to-genome mappers available are abundant (example BWA (Li and Durbin, 2009) and Bowtie (Langmead et al., 2009)) . However, these mappers were only used in intra- or closely related species analyses, such as genome resequencing, RNA- seq and SNP detection. The mapping algorithms therefore have strict parameters for mismatched and gapped alignments to increase mapping sensitivity (Holtgrewe et al., 2011). To map reads to divergent references, mappers such as Stampy that account for indels, are used across species but within genus (Lunter and Goodson, 2011). A variation of BLAST, dc-megablast allows gapped extension, therefore supports alignments of sequences with a low degree of identity (Camacho et al., 2009). dc-megablast was used in detecting Z. muelleri gene loss with success (Golicz et al., 2015b).

To eliminate noise and false positives, only genes in two specific orthologous groups, OGCsM (orthologous gene clusters conserved between seven species, with at least one gene originating from monocot; Golicz et al., 2015a) and OGCZ (orthologous gene clusters uniquely conserved between Z. muelleri and Z. marina), were inspected and assigned with present/absent status. The construction of these two OGCs is further described in Section 4.3.3. Results were in four categories: 1) Genes common in H. ovalis and all angiosperms; 2) genes common in all angiosperms but lost in H. ovalis; 3) genes uniquely conserved in all seagrasses; 4) genes present in Zosteraceae but lost in H. ovalis.

Genes in category 2 were compared to gene loss previously reported in the Zosteraceae lineage (Olsen et al., 2016) (Golicz et al., 2015a), to obtain a collection of genes commonly 131

lost in seagrasses. Category 3 represents “gained” genes, or genes that are conserved in seagrasses but have low sequence similarity in other model plants. Evidence of molecular convergence was searched in this category through multiple sequence alignments.

The methods and results obtained through this workflow were detailed in the following sections.

132

Figure 4-2 Workflow describes whole genome comparison of H. ovalis to two groups of genomes, five model plant species (A. thaliana, O. sativa, M. acuminata, P. dactylifera and S. polyrhiza) and two seagrass species (Z. muelleri and Z. marina). Status of conserved or 133

lost was assigned to genes within the orthologous clusters, OGCsM (orthologous gene clusters conserved between seven species, with at least one gene originating from monocot; Golicz et al., 2015a) and OGCZ (orthologous gene clusters uniquely conserved between Z. muelleri and Z. marina). Results were in four categories: 1) Genes common in H. ovalis and all angiosperms; 2) genes common in all angiosperms but lost in H. ovalis; 3) genes uniquely conserved in all seagrasses; 4) genes present in Zosteraceae but lost in H. ovalis. Category 2 was further compared with the Zosteraceae lineage to obtain a collection of seagrass lost genes. Evidence for molecular convergence was searched in category 3, the collection of seagrass-specific genes. Solid arrows represent the gene loss pipeline and dotted arrows represent the subsequent analysis.

134

4.3 Materials and methods

4.3.1 DNA sequencing

The H. ovalis sample was collected and provided by Prof. Gary Kendrick (University of Western Australia). The collection location was Swan River, Claremont waters in Perth, Western Australia (coordinate: 32 0’ 3.98’’ S, 115 45’ 18.31’’ E).

DNA extraction was performed by Mrs. Anita Severn-Ellis (University of Western Australia). The growth tips of the seagrass thalli were carefully removed, rinsed in sterile water and inspected for visible external contamination. Seven hundred milligram of tissue was placed in 5ml tubes, flash frozen in liquid nitrogen and bead pulverised using the 2010 Geno/Grinder (SPEX SamplePrep, USA). The Qiagen DNeasy Plant mini kit was used for the extraction of the DNA. The frozen powdered plant material was suspended in 3ml of Buffer AP1 and 28ul of RNAse A was added. After incubating at 65 °C, 910 ul of Buffer AP2 was added. The tubes were incubated on ice for 5 minutes and centrifuged thereafter to collect plant debris. Four hundred and fifty ul of lysate was transferred to each of 5-6 qiashredder tubes. The remainder of the DNA extraction steps were followed according to the kit protocol. The extracted DNA of each repeatition was pooled after elution. DNA concentration was quantitated using the Qubit 3.0 Fluorometer (Thermo Fischer Scientific) and visualised using the Labchip GX Touch 24 (PerkinElmer).

The extrated DNA was submitted to the Australian Genome Research Facility (AGRF) for library preparation and whole genome sequencing. The libraries for genome sequencing were prepared using the Illumina Tru-seq Nano DNA HT Library Preparation kit, according to the manufacturer’s instructions. Genomic DNA was sequenced using an Illumina XTEN sequencer with 150 bp paired-end (PE) technology at the Garvan Institute of Medical Research.

4.3.2 Pre-processing of reads

FastQC (Andrews) was performed on the sequenced reads using default parameters for quality checks.

135

The reads were preprocessed for adapters-removal using Trimmomatic (Bolger et al., 2014) with the following commands:

R1=” Undetermined_S0_L008_R1_001.fastq”

R2=” Undetermined_S0_L008_R1_001.fastq”

R1P="Haloph_R1_p.fastq"

R1S="Haloph_R1_s.fastq"

R2P="Haloph_R2_p.fastq"

R2S="Haloph_R2_s.fastq"

/home/jlee/downloads/FastQC/fastqc $R1 $R2

java -jar /group/pawsey0149/groupEnv/ivec/app/Trimmomatic/Trimmomatic- 0.33/trimmomatic-0.33.jar PE -threads 36 -trimlog trimmomatic.log $R1 $R2 $R1P $R1S $R2P $R2S ILLUMINACLIP:Truseq.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20 MINLEN:50"

4.3.3 Orthologous gene cluster (OGC) construction

Gene clusters conserved between seven model species (dicot: A. thaliana, S. lycopersicum, N. nucifera, and monocot: O. sativa, P. dactylifera, M. acuminata and S. polyrhiza) with at least one gene originating from a monocot species termed OGCsM (as previously defined and constructed in Golicz et al., 2015a) was used to represent orthologs highly conserved in plants.

Gene clusters unique to Zosteraceae were identified using all-against-all comparison with BLASTP (Camacho et al., 2009) “blastp -evalue 1e-5” and OrthoMCL (Li et al., 2003) between Z. muelleri, Z. marina, A. thaliana, O. sativa, M. acuminata and S. polyrhiza (data source as stated in Chapter 3). Clusters containing genes originating from both Zostera species were termed OGCZ.

4.3.4 Pipeline to identify lost and conserved genes

136

The identification of lost and conserved genes was achieved using the mapping of whole genome shotgun sequencing reads against reference genomes based on the previous approach (Golicz et al., 2015b). Reads were mapped to coding sequences (CDS) of reference species using dc-megaBLAST (Camacho et al., 2009) with e-value 1e-5. A custom python script calculate_blast_coverage.py was used to calculate the horizontal coverage of each CDS. The average coverage of each CDS across multiple reference species was calculated. If the average coverage was <2%, the ortholog was considered lost. If the average coverage was >50%, the ortholog was conserved. Command lines were as below:

blastn –task dc-megablast –query $QUERY –db $DB –evalue 1e-5 –outfmt 6 –out $HIT

perl choose_best_hit_blast.pl $HIT > $HIT.uniq

python count_gene_coverage_v3.py $REF $OUT.uniq > $COV

python find_conserved3_atha.py $COV length.txt $OGC 0.5 > conserved.txt

python find_lost3_atha.py $COV length.txt $OGC 0.5 > lost.txt

4.3.5 Lost and conserved H. ovalis, Z. muelleri and Z. marina genes in OGCsM

Primary transcript CDSs of five species (four land plants: A. thaliana, O. sativa, M. acuminata, P. dactylifera; one floating freshwater plant: S. polyrhiza; versions as listed in Golicz et al., 2015a) were used as references for mapping of H. ovalis reads. For each ortholog in OGCsM, lost or conserved status was assigned in each species. Comparison of gene lists was done between H. ovalis, Z. muelleri and Z. marina.

4.3.6 Lost and conserved H. ovalis genes in OGCZ

Primary transcript CDSs of Z. muelleri annotated as described in Chapter 2 (Data downloadable through http://www.appliedbioinformatics.com.au/index.php/Seagrass_Zmu_Genome) and Z. marina (Phytozome 10; Olsen et al., 2016) were used as references for H. ovalis read mapping. For each ortholog in OGCZ, lost or conserved status was assigned in H. ovalis.

137

4.3.7 Gene ontology enrichment and word cloud plotting

GO annotation and enrichment were performed using the topGO package (Alexa and Rahnenfuhrer, 2010) based on the previous approach (Golicz et al., 2015a). A word cloud was generated and coloured to represent the enriched significance of GO terms using the wordcloud package (Fellows, 2013). Commands were as described in Chapter 3 Section 3.2.2.

4.3.8 Assembly of H. ovalis protein and multiple sequence alignments with orthologs of other species

H. ovalis reads aligned to CDS of 50S ribosomal protein Rpl16 were extracted and assembled using Spades v3.10.1 (Bankevich et al., 2012) with the following commands “spades.py --only-assembler -1 reads_1.fasta -2 reads_2.fasta”. The corresponding protein was aligned to the assembled contigs using Exonerate (Slater and Birney, 2005) with the following parameters: “exonerate –model protein2genome –E 1 –bestn 1 –score 100 – softmaskquery no –softmasktarget yes –minintron 20 –maxintron 20000 –ryo ‘>HAL_%qi_%qd\n%tas’”. The aligned target regions were translated to protein sequences using the translate tool in ExPASy (Gasteiger et al., 2003). Each H. ovalis protein sequence obtained was aligned with orthologs of selected species (Table 4-1) using MAFFT (Katoh et al., 2002). A phylogenetic tree was plotted with PhyML (Guindon et al., 2009) assuming the JTT model for amino acid substitution and gamma parameter for invariable sites (based on Huang et al., 2016) using the alignments excluding the outgroup (charophyte and chlorophyte). The multiple-sequence alignments were visualized and coloured using Jalview (Waterhouse et al., 2009).

138

Table 4-1 Species selected for multiple sequence alignment of orthologous proteins.

ID Species Order/Suborder Habitat UniProt/ Protein ID ATH|DL Arabidopsis Dicot Land P56793 thaliana SLY|ML Solanum Dicot Land Q2MI63 lycopersicum OSA|ML Oryza sativa Monocot Land P0C443 PDA|ML Phoenix dactylifera Monocot Land D5FHC4 MAC|ML Musa acuminata Monocot Land S6DQH6 DSE|MLA Dieffenbachia Monocot, Alismatales, Land A0A0G3FDD1 seguine Araceae TTH|MLA Tofieldia thibetica Monocot, Alismatales, Land A0A142CRI4 Tofieldiaceae PPE|MWA Potamogeton Monocot, Alismatales, Freshwater, A0A142CRR9 perfoliatus Core alismatid submerged AMA|MLA Alocasia Monocot, Alismatales, Land A0A0U2CJM7 macrorhizzos Araceae

ECA|MWA Elodia canadensis Monocot, Alismatales, Freshwater, J7F447 Core alismatid submerged SPU|MWA Spirodela punctata Monocot, Alismatales, Freshwater, P06510 Araceae floating SPO|MWA Spirodela polyrhiza Monocot, Alismatales, Freshwater, G1FB78 Araceae floating WAU|MWA Wolfia australiana Monocot, Alismatales, Freshwater, G1FBL9 Araceae floating LMI|MWA Lemna minor Monocot, Alismatales, Freshwater, A9L9D3 Araceae floating NFL|MWA Najas flexilis Monocot, Alismatales, Freshwater, S4TB75 Core alismatid submerged SLI|MWA Sagittaria Monocot, Alismatales, Freshwater, A0A142CS04 lichuanensis Core alismatid emerged EAU|MLA Epiprenum aureum Monocot, Alismatales, Land A0A0M4B5N3 Araceae CRE|CHL Chlamydomonas Chlorophyta - P05726 reinhardtii MVI|CHA Mesostigma viride Charophyta - Q9MUU3 HAL|MMA Halophila ovalis Monocot, Alismatales, Marine, - Core alismatid submerged ZMU|MMA Zostera muelleri Monocot, Alismatales, Marine, snap_masked- Core alismatid submerged 124_71937_1_ 31140--0.14- mRNA-1 ZMA|MMA Zostera marina Monocot, Alismatales. Marine, Zosma101g00 Core alismatid submerged 550.1

139

4.4 Results and discussion

4.4.1 H. ovalis reads quality check and preprocessing

The per base Phred distribution of the paired end reads (Figure 4-3) shows a relatively large standard deviation at the end of reads, particularly in read 2. A total of 7,061,495 (1.4%) read 1 and 225,706 (0.04%) of read 2 have an average quality Phred score of lower than 20. Two overrepresented short sequences (3.4% and 0.3%) were identified in read 1 and one (3.3%) in read 2 (Figure 3-4). These sequences were adapters and primers used during the whole genome sequencing process and were removed using the pre-processing software Trimmomatic. Table 4-2 summarised the number of reads and nucleotides after the removal of adapters. A total of 908,769,239 (89%) reads were retained for the subsequent analysis.

140

a)

b)

Figure 4-3 Distribution of per base Phred quality score in a) read one and b) read two of the H. ovalis whole genome shotgun sequencing.

141

a)

b)

Figure 4-4 Overrepresented sequences in a) read one and b) read two of the H. ovalis whole genome shotgun sequencing.

Table 4-2 Number of reads and nucleotides retained for each pre-processing step.

Read 1 Read 2 Single reads Total reads Total nucleotides (Percentage (Percentage retained) retained)

Raw 510,485,779 510,485,779 0 1,020,971,558 154,166,705,25 reads (100%) 8 (100%)

After 425,827,039 425,827,039 57,115,161 908,769,239 136,315,385,85 adapters (89.0%) 0 (88.4%) removal

4.4.2 OGC construction

Based on previous studies (Golicz et al., 2015a), OGCsM was defined as a set of 16,007 gene clusters highly conserved in plants.

Figure 4-5 shows the number of orthologous clusters shared between Zostera and four other plant species as identified by OrthoMCL. Z. marina was added to better represent the Zostera genome in this Venn diagram, as an update of Figure 3-4 in Chapter 3. A total of 8,068 clusters were conserved reference species as well as seagrasses, whereas 3,294 142

clusters have low sequence similarities with other species and are unique to Zostera. A total of 1,748 clusters, containing 6,252 genes were therefore defined as OGCZ (Appendix 8).

143

Figure 4-5 Venn diagram showing the number of shared orthologous clusters among six species (A. thaliana, M. acuminata, O. sativa, S. polyrhiza and two Zosteraceae species).

144

4.4.3 Conservation of core biological processes

A total of 4,367 OGCsM genes were conserved in H. ovalis. When compared with conserved genes previously described in Z. muelleri and Z. marina (Golicz et al., 2015a) (Lee et al., 2016; Olsen et al., 2016), 3,335 (76.4%) genes were conserved in all three seagrass species, 377 genes were shared with either Z. muelleri or Z. marina and 655 genes were only conserved in H. ovalis. A total of 508 genes were only conserved in the Zosteraceae species. A full list of genes conserved in H. ovalis and their presence in other seagrass species is detailed in Appendix 8.

Among the GO terms enriched in the H. ovalis conserved genes involved core biological pathways such as photosynthesis, chlorophyll biosynthesis and glycolytic processes, as well as response to stresses such as cadmium ion and salinity (Table 4-3).

The conservation of genes in core plant cellular processes and stress response in H. ovalis was in agreement with observation in Z. muelleri (Chapter 3, Golicz et al., 2015a) and Z. marina (Olsen et al., 2016). As members of the monocot lineage with land plant ancestry, these findings were expected in the seagrass species. These results serve as validation of the quality of H. ovalis sequencing data and the gene loss pipeline performed.

145

Table 4-3 Significantly enriched biological process GO terms in the genes conserved in H. ovalis compared with five other plant species.

GO.ID Term p value GO:0046686 response to cadmium ion < 1e-30 GO:0006412 translation < 1e-30 GO:0006096 glycolytic process 1.00E-30 GO:0055085 transmembrane transport 9.70E-26 GO:0006099 tricarboxylic acid cycle 1.40E-22 GO:0018105 peptidyl-serine phosphorylation 1.30E-19 GO:0009651 response to salt stress 1.10E-18 GO:0015991 ATP hydrolysis coupled proton transport 7.60E-18 GO:0006094 gluconeogenesis 3.50E-17 GO:0009735 response to cytokinin 1.60E-16 GO:0015979 photosynthesis 5.20E-16 GO:0008152 metabolic process 5.10E-15 GO:0006886 intracellular protein transport 4.40E-13 GO:0006446 regulation of translational initiation 1.10E-12 GO:0051258 protein polymerization 1.40E-12 GO:0015986 ATP synthesis coupled proton transport 2.00E-12 GO:0008643 carbohydrate transport 2.80E-12 GO:0015995 chlorophyll biosynthetic process 5.40E-12 GO:0015031 protein transport 5.60E-12 GO:0071555 cell wall organization 5.60E-12 GO:0006468 protein phosphorylation 7.00E-12 GO:0006418 tRNA aminoacylation for protein translation 1.50E-11 GO:0006108 malate metabolic process 2.00E-11 GO:0018298 protein-chromophore linkage 7.50E-11 GO:0009853 photorespiration 9.50E-11 GO:0098656 anion transmembrane transport 1.20E-10 GO:0010498 proteasomal protein catabolic process 7.10E-10 GO:0006631 fatty acid metabolic process 8.70E-10 GO:0006006 glucose metabolic process 3.20E-09 GO:0044262 cellular carbohydrate metabolic process 3.70E-09

146

4.4.4 Gene loss in H. ovalis and comparison of lost genes between the three seagrass species

A total of 1,822 OGCsM genes were lost in H. ovalis, and these were compared with those previously reported as lost in both Z. muelleri and Z. marina (Lee et al., 2016; Olsen et al., 2016) (Golicz et al., 2015a) (Appendix 8). Out of these, a total of 1,197 (65.6%) lost genes were shared between all three seagrass species, 187 were shared with either Z. muelleri or Z. marina, and 412 were only lost in H. ovalis. In comparison to this, 743 genes were only lost in the Zosteraceae lineage.

Enriched GO terms of the 1,822 genes highlighted the loss of genes associated with ethylene biosynthesis and signalling, including fruit ripening (regulated by ethylene), 1- aminoclyclopropane-1-carboxylate (ACC, intermediate in ethylene synthesis) biosynthesis, ethylene biosynthetic process and negative regulation of ethylene-activated signalling pathway (Table 4-4; green in Figure 4-5). Genes involved in flower development (sepal and petal formation) are also lost (purple in Figure 4-3). Other terms include terpenoid biosynthetic process and stomatal complex patterning. Appendix 5 details the presence and absence of genes involved in three main biological processes; stomata development, ethylene synthesis and signalling, and terpenoid biosynthesis in H. ovalis, Z. marina and Z. muelleri.

The concurrent absence of multiple genes in H. ovalis, Z. muelleri and Z. marina has revealed the pathways that are potentially negatively-selected in the process of marine adaptation. The detection of ethylene absence in Z. muelleri and Z. marina raised the question of whether ethylene is dispensable in all species of submergent lifestyle. A freshwater submerged monocot Potamogeton pectinatus is also unable to produce ethylene, but does accumulate ACC and responds to exogenous ethylene (Summers et al., 1996; Summers and Jackson, 1998). This suggests that ethylene responsive pathway and enzyme activity are still possibly intact in some species. Results showed that enzymes responsible for biosynthesis of precursors, receptors, ethylene-responsive kinases and transcription factors of ethylene pathways are lost in H. ovalis, which matches the loss observed in Z. muelleri and Z. marina. Moreover, H. ovalis also lost the enzymes responsible for the synthesis of another gaseous compound, terpene. Together, these shared losses correspond to the low diffusion rate of gases underwater, and the dispensability of volatiles in biological pathways is likely to be one of the characteristics common to all seagrasses.

147

Two other well-studied morphologies of seagrasses are the absence of stomata in leaf epidermis, and flowers with simplified structures. For example, both H. ovalis and Zosteraceae species have reduced perianth and compressed ovaries in female flowers (Larkum et al., 2006). These phenotypical differences between seagrass and terrestrial angiosperms were also captured in the gene loss analysis, with genes of flower development and stomatal differentiation being absent.

148

Table 4-4 Significantly enriched biological process GO terms in the genes conserved in five other plant species but absent in H. ovalis.

GO.ID Term p value GO:0009835 fruit ripening 1.20E-10 1-aminocyclopropane-1-carboxylate GO:0042218 2.10E-10 biosynthetic process negative regulation of ethylene-activated GO:0010105 1.10E-06 signalling pathway GO:0009693 ethylene biosynthetic process 2.30E-06 GO:0048453 sepal formation 4.40E-05 GO:0048451 petal formation 6.90E-05 GO:0009626 plant-type hypersensitive response 0.0001 GO:0010375 stomatal complex patterning 0.0002 GO:0071493 cellular response to UV-B 0.0002 GO:0071486 cellular response to high light intensity 0.00044 GO:0071281 cellular response to iron ion 0.00045 positive regulation of defense response to GO:1900426 0.00083 bacterium GO:0071396 cellular response to lipid 0.00091 GO:0080027 response to herbivore 0.00142 GO:0010942 positive regulation of cell death 0.00214 photosynthetic electron transport in GO:0009773 0.00281 photosystem I GO:0015743 malate transport 0.00295 GO:0018106 peptidyl-histidine phosphorylation 0.00295 GO:0043543 protein acylation 0.00382 GO:0010257 NADH dehydrogenase complex assembly 0.00408 GO:0071483 cellular response to blue light 0.00408 GO:0071229 cellular response to acid chemical 0.0043 GO:0006811 ion transport 0.00461 GO:0006810 transport 0.00553 GO:0016114 terpenoid biosynthetic process 0.00616 GO:0019430 removal of superoxide radicals 0.0066 GO:0042542 response to hydrogen peroxide 0.00776 GO:0010162 seed dormancy process 0.00951 GO:0002237 response to molecule of bacterial origin 0.01088 GO:0046688 response to copper ion 0.01141

149

Figure 4-6 Significantly enriched biological process GO terms in the genes conserved in seven plant species but lost in H. ovalis. Terms were coloured based on related functions 1) green: ethylene synthesis and signalling, 2) purple: flower development, 3) brown: stomata development and 4) pink: terpenoid biosynthesis.

150

4.4.5 H. ovalis lost genes related to nitrogen usage

The five most significantly enriched GO terms in the 412 genes that were only lost in H. ovalis were transport (GO:006810), cellular response to high light intensity (GO:0071486), ion transport (GO:006811), thylakoid membrane organization (GO:0010027) and photosynthetic electron transport in photosystem I (GO:0009773). All significantly enriched terms are included in Table 4-5.

Closer examination revealed that a total of 23 (15 nuclear and 8 chloroplast encoded) genes which encode for the 5 subcomplexes in the NAD(P)H dehydrogenase (NDH) complex were lost (Appendix 6). In addition, 17 genes required for the supercomplex formation, including tethering of NDH to PSI, assembly of subunits, accessory proteins and transcription factors, were absent in H. ovalis.

NDH is a major protein complex residing in the thylakoid membrane of chloroplasts which participates in cyclic electron flow pathways as oxireductase (Figure 4-6a) (reviewed in Peltier et al., 2016) . The NDH complex is made up of at least 30 subunits (Figure 4-6b), with 11 of the NDH genes encoded in the chloroplast (ndhA-K) and 5 others in the nucleus (ndhL-S). Plastidal (Neyland and Urbatsch, 1996) and the expression of nuclear (Ruhlman et al., 2015), ndh genes, was shown to be highly conserved across all vascular plants. NDH-specific knockouts in plants showed no change in growth phenotype, but were sensitive to environmental stresses (Burrows et al., 1998; Kofer et al., 1998; Shikanai et al., 1998; Horvath et al., 2000; Martin et al., 2009). NDH gene expression is responsive to abiotic stresses, such as low temperature, low light and nutrient (particularly nitrogen) starvation (Peltier and Schmidt, 1991; Yamori et al., 2011; Ueda et al., 2012). As the NDH complex is only present in the Streptophyta lineage, which includes charophyte algae and plants, and acquisitions of novel NDH genes occurred during terrestrial transition, NDH is hypothesized as one of the innovations to enable land plant evolution (Martin et al., 2009; Ruhlman et al., 2015).

The absence of genes encoding for NDH subunits and proteins required for complex formation in H. ovalis points to a total loss of NDH complex in the H. ovalis thylakoid. A recent whole plastid phylogenomics analysis of Alismatales revealed four convergent losses of NDH genes, where one of them occurred in the Hydrocharitaceae family, including genera of three seagrasses (Enhalus, Thalassia and Halophila) and three submerged fresh water plants (Nechamandra, Vallisneria and Najas) (Ross et al., 2016).

151

The absence of plastid NDH genes in H. ovalis therefore supported previous results, whereas the dispensability of the NDH complex was confirmed through this first reported instance of nuclear NDH and NDH-related genes among Alismatales. Besides Alismatales aquatic plants (Iles et al., 2013; Peredo et al., 2013; Wilkin and Mayo, 2013; Ross et al., 2016), rare evidence of loss and pseudogenization of plastid NDH genes were also reported in parasitic plants (Wolfe et al., 1992; Haberhausen and Zetsche, 1994; Funk et al., 2007; Logacheva et al., 2011) and two gymnosperm orders (Braukmann et al., 2009).

Reasons for NDH dispensability have been speculated (Stefanovic and Olmstead, 2005; Iles et al., 2013; Peredo et al., 2013; Xu et al., 2013). In parasitic plants, this is due to heterotrophy (Stefanovic and Olmstead, 2005). It is also hypothesized that a certain level of functional redundancy is present between chloroplastic and mitochondrial NDH (also known as NDH-2) (Xu et al., 2013). The loss observed in Tofieldiaceae is the first in non- submerged aquatic species, therefore rejecting the possibility of submergence as the reason for NDH gene loss (Iles et al., 2013; Peredo et al., 2013). The loss is also not related to adaptation to the marine environment, since the plastid NDH genes remain functional in many other seagrass species, including Zosteraceae as shown in Appendix 6.

Ross et al., 2016 suggested that this loss enabled low nitrogen investment as an adaptation to reduce chances of nutrient deficiency. This hypothesis could be plausible, as in the green algae Chlamydomomnas reinhardtii, factors such as types of nitrogen sources available and the level of excess nitrogen affects NDH expression (Peltier and Schmidt, 1991). Seagrasses derive both organic and inorganic forms of nitrogen, mostly nitrate and ammonium, from the sediment and seawater column (Figure 4-7). Affinity of seagrass leaves and roots towards nitrogen sources vary among different species and habitat. For example, conflicting responses are observed in different seagrass species to nutrient enrichment and/or eutrophication events (Touchette and Burkholder, 2000; Leoni et al., 2008). This is because nitrogen metabolism is integrally coupled with photosynthesis (Turpin, 1991), therefore environmental and physiological conditions that affect carbon fixation also affect nitrogen uptake. However, the correlation of nitrogen availability in seagrass habitats, as well as the use of nitrogen sources, and NDH presence is unclear. For example, in Z. marina which has NDH genes present, nitrate reductase (NR) had been shown to be highly inducible with increasing level of nitrate (Roth and Pregnall, 1988). In Posidonia oceanica (Loques et al., 1990) and Halophila stipulacea (Doddema and Howari, 1983; Alexandre et al., 2014) which lacked plastid NDH genes, the correlation between NR

152

activity and nitrate concentration is weak. However, Thalassia testudinum which also lacked plastid NDH, is able to utilize nitrate through uptake and assimilation reactions (Lee and Dunton, 1999). Halophila and Zostera species often inhabit the same environment (Carruthers et al., 2007) but have different NDH presence. Interestingly, two proteins related to nitrate uptake, nitrogen reductase 1 (NR1) and nitrate transporter (NRT3.1) were also identified as lost in H. ovalis (Appendix 6).

A main limitation of seagrass nitrogen uptake studies listed above is that the contributions of microbial or algal communities to the uptake measurements were not accounted for. The possibilities of symbionts compensating for the seagrass loss of NDH is likely, as seen in myco-heterotrophic liverworts (Wickett et al., 2008a; Wickett et al., 2008b). There is evidence of cyanobacterial mats on Posidonia leaves responsible for assisting seagrass in organic nitrogen uptake (Jeremy Bougoure, personal communication). To elucidate whether the dispensability of NDH complex in H. ovalis is related to nitrogen use, further examination is needed, particularly the role of microbial communities in seagrass roots and leaves.

As the gene loss method does not generate complete coding sequence of the species of interest, the possibility of pseudogenization is not examined in this analysis. A gene is therefore categorized as conserved, lost or information not available when the read coverage is ambiguous. The absence of plastid NDH in H. ovalis is therefore consistent with results reported in Halophila decipiens (Ross et al., 2016), with a few exceptions of pseudogenes.

153

Table 4-5 Significantly enriched biological process GO terms in the genes that were lost in H. ovalis, but present in Z. muelleri, Z. marina and five other plant species.

GO ID Term p value GO:0006810 transport 5.70E-10 GO:0071486 cellular response to high light intensity 1.30E-06 GO:0006811 ion transport 2.10E-06 GO:0010027 thylakoid membrane organization 1.10E-05 photosynthetic electron transport in GO:0009773 photosystem I 1.80E-05 GO:0019430 removal of superoxide radicals 2.40E-05 GO:0010257 NADH dehydrogenase complex assembly 5.30E-05 GO:0071493 cellular response to UV-B 5.30E-05 GO:0030003 cellular cation homeostasis 0.00014 GO:0019684 photosynthesis, light reaction 0.0002 GO:0071472 cellular response to salt stress 0.00032 GO:0016556 mRNA modification 0.00085 GO:0046688 response to copper ion 0.00109 anthocyanin accumulation in tissues in GO:0043481 response to UV light 0.0019 GO:0003333 amino acid transmembrane transport 0.00282 GO:0010117 photoprotection 0.00282 GO:0071490 cellular response to far red light 0.00392 GO:0071483 cellular response to blue light 0.00498 GO:0034644 cellular response to UV 0.00508 GO:0071329 cellular response to sucrose stimulus 0.00517 GO:0071229 cellular response to acid chemical 0.00559 GO:0071491 cellular response to red light 0.00659 GO:0010193 response to ozone 0.00759 GO:0009825 multidimensional cell growth 0.00759 GO:0000023 maltose metabolic process 0.01175 GO:0010030 positive regulation of seed germination 0.01176 GO:0019252 starch biosynthetic process 0.01229 GO:0010380 regulation of chlorophyll biosynthetic process 0.01377 GO:0031407 oxylipin metabolic process 0.01397 GO:0048856 anatomical structure development 0.01449

154

a)

b)

Figure 4-7 a) The involvement of NDH complex in electron transfer pathways during oxygenic photosynthesis of plants and b) its subunit composition. Figures adapted from (Peltier et al., 2016).

155

Figure 4-8 Nutrient (nitrogen and phosphorus) cycle in seagrass ecosystems; (1) and (2), nitrification; (3) and (4), nitrate reduction; and (5) denitrification; dotted lines represent excretion and dashed lines represent absorption. Figure adapted from (Leoni et at al., 2008)

156

4.4.6 Seagrass-specific genes suggest co-evolution of intracellular transport, cell wall and ion transport related genes in H. ovalis and Zosteraceae

A total of 57 OGCZ genes were conserved in H. ovalis. The majority of these genes are predicted to be involved in protein secretion and intracellular transport, as the top tree significantly enriched terms in the cellular component ontology analysis were important organelles of the intracellular transport pathways: Golgi apparatus, trans-Golgi network and endosome, and nearly half of the remaining terms are intracellular transport-related (Figure 4-8).

There are 13 of the conserved genes that function in protein secretion and intracellular transport, mainly as transport proteins or transport regulators. Nine genes are responsible for cell wall construction, organization and modification. The remaining genes are involved in ion or proton transport, lipid catabolism, transcription and translation-related, protein ubiquitination and histone assembly (Appendix 7).

In plant cells, secreted proteins are processed through the Golgi apparatus as cargo molecules and sorted by receptors in the trans-Golgi network to different destinations (reviewed in Brandizzi and Barlowe, 2013). Non-cellulosic cell wall matrix polysaccharides are among the wide range of vesicles synthesized and transported by the Golgi apparatus (Driouich et al., 1993; Lerouxel et al., 2006; Driouich et al., 2012). Besides catalytic mechanisms of glycosyltransferases and nucleotide-sugar conversions for polysaccharide assembly, the Golgi is also responsible for post-processes such as acetylation and methylation of the cell wall polysaccharides. The known differences between cell walls of seagrasses and land plants were discussed in Chapter 3, particularly the level of pectin modification. Interestingly, within the list of seagrass-specific genes conserved in H. ovalis, CGR2 (cotton Golgi-related 2), a methyltransferase was shown to be involved in pectin methylesterification in A. thaliana (Weraduwage et al., 2016). Tubulin cofactor which is responsible for microtubule stability, one of the mediators of cell wall components trafficking (Zhu et al., 2015), is conserved among seagrasses. A total of five genes which encode for RAB GTPases, the key regulators of vesicle trafficking (Miserey-Lenkei et al., 2010; Valente et al., 2010), were also conserved across both seagrass lineages. Knockouts of some members of the RAB GTPases has shown their redundant roles in salinity stress tolerance (Asaoka et al., 2013). It is likely that this conservation of cell wall- related genes, as well as proteins involved in intracellular transport, in both families of 157

seagrasses is linked to modification of cell wall composition as one of the osmoregulatory strategies.

Multiple characteristics were hypothesized as salt-tolerance mechanisms in seagrasses (reviewed in Touchette, 2007), such as cell wall rigidity, selective ion flux and vacuolar ion sequestering, and the synthesis of compatible solutes (Ye and Zhao, 2003; Carpaneto et al., 2004; Touchette et al., 2014). To avoid salt damage, plant cells adjust osmotic balance through 1) removal of Na+ and Cl- ions from the cytoplasm and 2) accumulation of K+ and Ca2+ ions in cells. This is achieved by influx and efflux of these ions through the transmembrane transport proteins, assisted by H+ pumps (reviewed in Hasegawa, 2013). Three genes, including the component of vacuolar proton pump, ATP synthase and calmodulin were identified as conserved across the two seagrass lineages. Moreover, vacuolar proton ATPase A1 had been shown to be responsive to salt stress in sugar beet (Kirsch et al., 1996). This collection of genes is likely to be specialized in seagrass to act together in regulating ion and osmotic homeostatis of cells in the marine environment.

Lipid transport and catabolism is another crucial role of the intracellular transport system. The endoplasmic reticulum (ER) synthesizes and exports phospholipids, sterols and storage lipids to other organelles such as mitochondria, thylakoids and peroxisomes for various purposes (van Meer et al., 2008). A total of four genes involved in lipid transport and catabolism were conserved in all three seagrass species, including ceramidase which is responsible for sphingolipid metabolism. Sphingolipids form membrane structure and are involved in cellular signal transduction (Hannun and Obeid, 2008). The difference between lipids of seagrasses and land plants is not well understood, but expansion in genes related to sphingolipid metabolism was observed in Z. marina when compared to duckweed (Olsen et al., 2016). Another alkaline ceramidase had been shown to regulate cell turgor pressure in A. thaliana (Chen et al., 2015), however, more evidence are needed to elucidate whether any seagrass-specific lipid metabolism plays a role in marine adaptation.

Two members of the core histone family were also conserved in seagrasses. The domains in histone families, particularly H2A and H3, showed great expansion and variety but each variant is strongly conserved across species (Kawashima et al., 2015). Epigenetic regulation, including histone acetylation, methylation and phosphorylation, is known to be responsive to salt stress in plants (Luo M et al., 2012). Also, members of ribosomal constituents were previously identified as modified in Z. muelleri when compared to land 158

plants (as described in Chapter 3) and positively selected in Z. marina and P. oceanica (Wissler et al., 2011), and results show that these genes are also conserved in H. ovalis.

159

Figure 4-9 Significantly enriched cellular component GO terms in seagrass-specific genes. Terms in green are subcomponents or organelles of the intracellular transport pathways.

160

4.4.6.1 Amino acid alignments show seagrass-specific conservation

The presence of OGCZ genes in H. ovalis indicates possible functional conservation across seagrasses that is related to adapting to the marine environment. To further investigate the molecular difference of these seagrass-specific genes from other plants, comparisons of protein alignments were performed. Multiple-sequence alignments of orthologs from 21 species and H. ovalis were used to identify seagrass-specific mutations.

The alignments of 50S ribosomal protein L16 revealed nine mutation sites that were specific in H. ovalis, Z. muelleri and Z. marina (white arrows in Figure 4-10). These nine positions were conserved among 17 angiosperms (12 belong to the Alismatales order, eight are freshwater plants), one charophyte and one chlorophyte. Among the nine mutations observed, three of them were mutated to a positive residue, three were mutated to a hydrophobic residue, one was a deletion and two were hydrophilic residues.

Relationships between orthologs of these 22 species were represented in the phylogenetic tree (Figure 4-11). The branch of the three seagrasses (HAL, ZMA, ZMU) from the remaining species had a 100% bootstrap support, complementing the observed mutations in the multi-sequence alignments. The separation of the two Zostera orthologs from H. ovalis was also well-supported. Four Araceae members (blue in Figure 4-11), including duckweed, clustered together with average support. Phylogenetic split within Alismatales was not well-supported, as shown in previous work (Ross et al., 2016). Core alismatids (Alismatidae sensu; Les and Tippery, 2013) labelled in red in Figure 4-11 grouped together, except for the three seagrasses.

Together, the amino acid alignments and phylogenetic tree results show the divergence of seagrass proteins from other embryophyta and green algae. As the mutations were not shared by representatives of both core alismatids and Araceae, the possibility of having common ancestors with these mutations was ruled out. This is particularly obvious when Najas flexilis (NFL) and Potamogeton perfoliatus (PPE), the two freshwater species in sister genera of Halophila and Zostera, do not contain these mutations and have higher similarities with other monocots and dicots. In addition, these two species shared submergence characteristics with seagrasses, suggesting that the mutations observed are not required to survive underwater. It is therefore possible that these mutated amino acids were subjected to selection pressure posed by salinity.

161

The results further complemented the seagrass clustering of OGCZ through OrthoMCL analysis, and provided the first molecular evidence of convergent evolution of seagrasses.

162

163

164

Figure 4-10 Ribosomal protein L16 multiple sequence alignments between 19 species and three seagrass species. Species and corresponding IDs were listed in Table 4-1. Amino acids that were conserved within the non-seagrass group or among seagrasses were coloured according to physicochemical properties based on “Zappo” colour scheme. White arrows indicated seagrass-specific mutations.

165

Figure 4-11 Phylogenetic tree showing distance between Rpl16 sequences of 17 species together with three seagrasses. Species and corresponding IDs were listed in Table 4-1. IDs coloured in red were members of core Alismatids and blue were members of Araceae. Branches were labelled with bootstrap values (%). 166

4.5 Conclusion

Through inter- and intracomparison between genomes of seagrasses and other non- marine plants, this chapter describes genomic characteristics uniquely shared in two families of seagrasses.

H. ovalis and two members of Zosteraceae represent two occurrences of repeated evolution to adapt to the marine environment which were about 30 My apart. Remarkably, the two lineages shared the same gene loss in ethylene biosynthesis and signalling, terpenoid biosynthesis, stomata and flower development. These gene losses describe the similar phenotypes which characterized seagrasses, and suggest that they were driven by similar genetic processes. A group of well-conserved genes that are specific to the seagrass genomes also provided insights into the uniqueness of seagrasses among other angiosperms. The seagrass-specific genes are highly enriched in the intracellular transport pathway and cell wall construction, organization and modification. A few genes responsible for ion transport and sequestering were also conserved within H. ovalis, Z. muelleri and Z. marina. Using one of the seagrass-specific genes, ribosomal protein L16, the first instance of molecular evidence that described independent marine recolonization events was obtained.

To date, only one non-seagrass Alismatales species, duckweed (S. polyrhiza), has a publicly available genome assembly (Wang et al., 2014). Gene clustering results (Figure 3- 4 and 4-5) had shown that a group of seagrass genes have low sequence similarities with model land plant species and also duckweed. This demonstrates that the mutations in these genes are likely absent in the common ancestor of Araceae (represented by duckweed) and core Alismatid (represented by seagrasses). However, protein sequences of other sister genera were essential to show that the mutations do not arise from the core Alismatid branch. This is achieved using the ribosomal protein L16 sequences from publicly available chloroplast assemblies. The distance of seagrass protein sequences from other species, including core Alismatid members, clearly indicated mutations specific to seagrasses. This evidence is likely applicable to all 57 genes shared between H. ovalis and Zosteraceae and points to protein domains that are subjected to marine selective pressures. It is likely that this group of genes independently evolved to form similar molecular, and possibly, functional outcomes in this two taxa. It is hypothesized that salinity stress is the main selection pressure in play.

167

Habitat is the most common factor associated with reported examples of repeated evolution (Ord and Summers, 2015). The conservation of gene loss and the sharing of seagrass-specific orthologs in these two independent lineages, despite the phylogenetic distance, described the requirements of an angiosperm of land plant ancestry to survive in the sea. These results also present another example of parallel evolution in the plant kingdom.

168

Chapter 5. Summary and outlook

This thesis presents the genomic characterization of seagrass species through comparisons between Z. muelleri with land plants, as well as between Z. muelleri and H. ovalis which represent two lineages of seagrasses. This is the second whole genome study of a seagrass species, and first among southern hemisphere species. This is also the first time the convergent evolution of multiple seagrass lineages is studied at the genomic level.

This chapter addresses the limitations of the methods used and suggests possible improvements. Lastly, the future prospects of seagrass genomic are reviewed.

5.1 Addressing limitations of methodologies used and possible improvements

5.1.1 Genome assembly and gene annotation

The methods of Z. muelleri genome assembly can be modified or improved in several ways. Using the existing amount of sequencing data, other assemblers could possibly replace Velvet. Another de Bruijn-based assembler SOAPdenovo was used in several plant genome assemblies, for example Chinese cabbage (Wang et al., 2011) and cotton (Li et al., 2014). More recent assemblers which employ different strategies, such as combining Overlap/Layout/Consensus and de Bruijn strategies for read extension in MaSuRCA (Zimin et al., 2013), could improve the contiguity and accuracy of the assembly. With the availability of the Z. marina genome (Olsen et al., 2016), a reference-based assembly could help in positioning scaffolds and reduce gaps in our assembly. Lastly, the completion level of the assembly would benefit from additional long range data or long read sequences. These scaffolding technologies are increasingly recognised as solutions (reviewed in Jiao and Schneeberger, 2017) to challenges of assembling complete plant (example in Yang et al., 2016) or vertebrate genomes (example in Bickhart et al., 2017), particularly by resolving repetitive regions. However, as advancements in genome assembly, such as sequencing technologies and bioinformatics algorithms, are constantly emerging, it is important that research progress is made even without the availability of a fully complete genome.

169

The gene annotation was performed using gene predictors trained specifically for Z. muelleri to increase prediction accuracy. However, an inherent bias is present because predictors were trained to identify genes similar to the training data, which were the highly conserved orthologs from CEGMA (Parra et al., 2007) and the Z. muelleri transcriptome in this case. It is therefore possible that non-canonical or under-expressed genes were unannotated. This limitation could be partly solved by the two-pass training method, as the ab initio genes from the first-pass prediction were also included for training, but it is possible that these genes have lower accuracy since they are not well-supported.

5.1.2 Analyses of non-genic regions: repeat annotation and whole genome duplication events

The results of the analyses on non-genic regions, such as repeat annotation and Whole Genome Duplication events, could be improved with an assembly of a higher level of completion.

As the main reason of assembly fragmentation is unresolved repetitive regions (Phillippy et al., 2008), it is likely that the number of repeats annotated in the current assembly is under-estimated. An assembly with a higher contiguity and fewer gaps would improve the repeat annotation. Even though repetitive regions, such as transposable elements, made up the largest genomic fraction in most plant species (Bennetzen and Wang, 2014), they are still not well understood. Their importance in gene regulation has been prevailing in recent years (reviewed in Lisch, 2013). Seagrass repeats are likely to play a role in regulating the evolution of marine adaptation, and are therefore an essential piece in shedding light on seagrass genetics.

The WGD analysis of the Z. muelleri genome is also limited by the level of completion of the assembly. As synteny and collinearity information are only available in pseudochromosomes, the inference of WGD is likely to be less accurate when using only genic regions. An improved assembly will enable the identification of syntenic ancestral regions, and complement the suggested WGDs deduced by phylogenetic trees of orthologs and Ks analysis of paralogs.

170

5.1.3 Comparative pathway analysis using orthologous relationship

The identification of lost or modified genes in Z. muelleri when compared to other plant species was based on comparisons between orthologous clusters. The use of A. thaliana representative genes to assign function for each cluster through GO terms is built upon the assumption of functional consistencies of putative orthologs. This assumption is supported by studies of manually curated conserved genes (Bandyopadhyay et al., 2006; Hulsen et al., 2006; Altenhoff et al., 2012; Wang et al., 2013). In fact, sequence similarities have been shown to be a stronger predictor of functional relatedness than evolutionary history of genes (Altenhoff and Dessimoz, 2009). The validity of this assumption, however, is questionable in several cases. For example, genes of different functions could be clustered in the same orthogroup due to subfunctionalization events or the dispensability of function in some but not all species, therefore increasing false positives (Figure 5-1).

Four species represented in the AMOS clusters have large phylogenetic distance. In particular, duckweed in the Alismatales order is known to be distantly related to the model monocots (Wang et al., 2014). Moreover, duckweed is the only floating aquatic plant included among other three land plants. It is therefore likely that functions of orthologous genes between these four species cannot be fully represented by A. thaliana genes. In general, plant genomes are subjected to complicated duplication events (Wendel et al., 2016), a high rate of neofunctionalization and subfunctionalization of genes would possibly hamper the relationship of gene function and orthology. However, the accurate identification of protein function requires extensive biochemical and structural studies, and is therefore not feasible for genomic-scale analysis. One possible improvement in this context is increasing the number of species per clade in orthogroup construction, thereby reducing the phylogenetic distance between species.

The conclusion of gene loss or modification in Z. muelleri was drawn by the absence of orthologs in AMOS. Possibilities of pathway compensation, inaccuracies in gene annotation or cases of pseudogenization are not accounted for in this method. The absence and modification of seagrass genes can be confirmed by gene expression data or targeted sequencing.

171

Figure 5-1 Duplication events generate false positives in the identification of functionally identical orthologs between species. Figure adapted from (Fang et al., 2010).

5.1.4 Presence and absence of genes using read mapping

The identification of gene presence and absence in H. ovalis using the read mapping method provides evidence for convergent evolution of seagrass families. However, the calculation of exon coverage by reads is based on sequence alignments and does not account for small variants, partial and pseudo genes. The genes identified to be lost could be actually present with variants that interfere with read mapping. The results also relied on the quality of the coding sequence annotation in reference genomes. The incorporation of transcriptome data to match gene presence and absence with its expression can reduce false positives caused by sequence diversity.

The gene loss method was used to obtain a group of genes that are unique to seagrasses and therefore possibly subjected to selection pressures of the marine environment. However, only genes with low sequence similarities to non-marine plants were examined. In the cases where seagrass-specific gene family expansion occurred, for example Z.

172

marina genes involved in light harvesting complex (LHCB1), sucrose synthase and sucrose transport (Olsen et al., 2016), as the duplications are orthologous to other plants, they were not included in the analysis.

5.2 Additional analysis that could be performed

It is of interest to correlate the detected WGDs in the Zostera genus to function of orthologous groups. For example, orthologs that retained back to single copy in species following WGDs are expected to have housekeeping functions such as DNA replication and repair (De Smet et al., 2013). The extra round of WGD in Z. muelleri that is absent in Z. marina makes the orthologs between these two species ideal models for gene duplicability studies (McGrath et al., 2014). Duplication of species-specific gene families could also render adaptation advantage unique to each species. However, functional analysis of genes is limited by the low similarity between seagrass genes and genes of model plants in public databases. GO representation of the seagrass gene sets is also low, where only 52.9% and 49.8% of genes are annotated with GO terms in Z. muelleri and Z. marina, respectively. At this stage, it is not possible to accurately annotate gene categories that are maintained after WGD, but ab initio protein domain analysis could possibly help assign functions to novel genes.

The detection of positive selection signatures in seagrass genes could be performed using the Z. muelleri genome. Positively selected genes (PSGs) code for proteins that contain variants that contribute to fitness and therefore adaptation, which would form species- specific phenotypic traits (Yang, 2005). Seagrass PSGs had been studied previously using ESTs (Wissler et al., 2011). A genome-scaled search using the Z. muelleri gene set will enable a more robust inference of selection signals. Examples of genome-wide scans for PSGs are abundant in pathogenic microorganisms (Petersen et al., 2007; Soyer et al., 2009; Martemyanov et al., 2017) and animals (Rhesus Macaque Genome Sequencing Analysis Consortium et al., 2007; Markova-Raina and Petrov, 2011; Ge et al., 2013; Lan et al., 2017). However, as computational tools for this purpose (recent examples are POTION (Hongo et al., 2015) and PosiGene (Sahm et al., 2017)) are sensitive to the quality of gene annotation and misalignments (Mallick et al., 2009; Schneider et al., 2009; Privman et al., 2012), careful quality filtering is necessary, particularly for newly assembled genome drafts. It would be of huge interest to be able to show that the seagrass-specific genes

173

shared by Z. muelleri, Z. marina and H. ovalis, as described in Chapter 4, are under positive selection.

The genome assembly of H. ovalis would largely contribute to the analysis of parallel evolution of multiple return to the sea events, particularly increasing the evidence of conservation of amino acid mutations. However, the current sequencing data is insufficient for an assembly with enough contiguity. Long range data, such as mate paired sequencing reads, or third generation long read data, would greatly improve the genome assembly to serve as another reference of the seagrass species.

5.3 The future of seagrass genomics

Even though the first description and naming of seagrass species Z. marina and Posidonia oceanica by Linnaeus dated back to the 17th century, we are only beginning to understand seagrass biology through genetics and genomics.

Questions remain open in fundamental biological processes of seagrasses. For example, reports on carbon fixation signatures of seagrasses are conflicting, and osmoregulation mechanisms of seagrass cells to survive in high salinity are still not completely understood (reviewed in Davey et al., 2016). In addition, the difference of light perception, nutrient utilization and anoxia tolerance between species of varying plasticity and habitats is very much unknown. Results of inter-species comparison described in Chapter 4 relate nitrate usage to nutrient content in different habitats, suggesting key adaptive differences even within seagrass families.

The genome characterization of Z. marina (Olsen et al., 2016) and this thesis contribute largely in shedding light on marine adaptation signatures through comparison with terrestrial and other aquatic plants. The dispensability of ethylene, together with other gaseous compounds in seagrasses is one of the important findings. However, whether the loss of ethylene is a result of adaptation to submerged life, or high salinity, is unknown. Ethylene and stomatal loss were also observed in pondweed P. pectinatus, a freshwater submerged plant (Summers et al., 1996). As an essential stress signal, ethylene also regulates salinity tolerance in land plants, though the evidence of ethylene as positive or negative regulator during high salinity stress in different species at different developmental stages are conflicting (reviewed in Tao et al., 2015) . It is also necessary to study the

174

regulation of hormonal crosstalk in seagrasses to understand how the absence of ethylene is compensated.

The critical need for awareness of seagrass conservation in both the scientific community and publicly has been highlighted since seagrass decline was recognized (Orth et al., 2006). A recent study has further emphasized this through concluding that seagrass ecosystems significantly reduce bacterial contamination in the waters (Lamb et al., 2017). Genomic knowledge has a large potential in seagrass protection. The use of molecular indicators in monitoring programs enables the early signs of stresses such as eutrophication, heat and light intensity (Macreadie et al., 2014). For example, molecular markers were used to detect light stress in Z. muelleri (Pernice et al., 2015).

The future of seagrass genomics will benefit from additional genomic data of species and research efforts, whether aiming at increasing seagrass protection or understanding marine plant evolution.

175

References

Abbas MM, Malluhi QM, Balakrishnan P (2014) Assessment of de novo assemblers for draft genomes: a case study with fungal genomes. BMC Genomics 15 Suppl 9: S10 Adams KL, Wendel JF (2005) Polyploidy and genome evolution in plants. Current Opinion in Plant Biology 8: 135-141 Al-Mssallem IS, Hu S, Zhang X, Lin Q, Liu W, Tan J, Yu X, Liu J, Pan L, Zhang T, Yin Y, Xin C, Wu H, Zhang G, Ba Abdullah MM, Huang D, Fang Y, Alnakhli YO, Jia S, Yin A, Alhuzimi EM, Alsaihati BA, Al- Owayyed SA, Zhao D, Zhang S, Al-Otaibi NA, Sun G, Majrashi MA, Li F, Tala, Wang J, Yun Q, Alnassar NA, Wang L, Yang M, Al-Jelaify RF, Liu K, Gao S, Chen K, Alkhaldi SR, Liu G, Zhang M, Guo H, Yu J (2013) Genome sequence of the date palm Phoenix dactylifera L. Nature Communication 4: 2274 Alexa A, Rahnenfuhrer J (2010) topGO: Enrichment analysis for Gene Ontology. In, Ed R package version 2.22.0 Alexandre A, Georgiou D, Santos R (2014) Inorganic nitrogen acquisition by the tropical seagrass Halophila stipulacea. Marine Ecology 35: 387-394 Allario T, Brumos J, Colmenero-Flores JM, Iglesias DJ, Pina JA, Navarro L, Talon M, Ollitrault P, Morillon R (2013) Tetraploid Rangpur lime rootstock increases drought tolerance via enhanced constitutive root abscisic acid production. Plant Cell and Environment 36: 856-868 Altenhoff AM, Dessimoz C (2009) Phylogenetic and functional assessment of orthologs inference projects and methods. PLOS Computational Biology 5: e1000262 Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C (2012) Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLOS Computational Biology 8: e1002514 Amborella Genome Project (2013) The Amborella genome and the evolution of flowering plants. Science 342: 1241089 Andrews S (2010) FastQC A Quality Control tool for High Throughput Sequence Data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Aquino RS, Grativol C, Mourao PA (2011) Rising from the sea: correlations between sulfated polysaccharides and salinity in plants. PLOS One 6: e18862 Arber AR (1920) Water plants : a study of aquatic angiosperms. University Press, Cambridge England Arnaud-Haond S, Migliaccio M, Diaz-Almela E, Teixeira S, van de Vliet MS, Alberto F, Procaccini G, Duarte CM, Serrao EA (2007) Vicariance patterns in the Mediterranean Sea: east-west cleavage and low dispersal in the endemic seagrass Posidonia oceanica. Journal of Biogeography 34: 963-976 Asaoka R, Uemura T, Ito J, Fujimoto M, Ito E, Ueda T, Nakano A (2013) Arabidopsis RABA1 GTPases are involved in transport between the trans-Golgi network and the plasma membrane, and are required for salinity stress tolerance. Plant Journal 73: 240-249 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G, Consortium GO (2000) Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29 Bairoch A, Apweller R (1997) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research 25: 31-36 Bairoch A, Boeckmann B (1991) The Swiss-Prot Protein-Sequence Data-Bank. Nucleic Acids Research 19: 2247-2248 Bandyopadhyay S, Sharan R, Ideker T (2006) Systematic identification of functional orthologs based on protein network comparison. Genome Research 16: 428-435 Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455-477

176

Bao Z, Eddy SR (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research 12: 1269-1276 Barghini E, Mascagni F, Natali L, Giordani T, Cavallini A (2015) Analysis of the repetitive component and retrotransposon population in the genome of a marine angiosperm, Posidonia oceanica (L.) Delile. Marine Genomics 24 Pt 3: 397-404 Beer S, Bjork M, Hellblom F, Axelsson L (2002) Inorganic carbon utilization in marine angiosperms (seagrasses). Functional Plant Biology 29: 349-354 Bennetzen JL, Freeling M (1997) The unified grass genome: Synergy in synteny. Genome Research 7: 301- 306 Bennetzen JL, Wang H (2014) The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annual Review of Plant Biology 65: 505-530 Bickhart DM, Rosen BD, Koren S, Sayre BL, Hastie AR, Chan S, Lee J, Lam ET, Liachko I, Sullivan ST, Burton JN, Huson HJ, Nystrom JC, Kelley CM, Hutchison JL, Zhou Y, Sun J, Crisa A, Ponce de Leon FA, Schwartz JC, Hammond JA, Waldbieser GC, Schroeder SG, Liu GE, Dunham MJ, Shendure J, Sonstegard TS, Phillippy AM, Van Tassell CP, Smith TP (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics 49: 643-650 Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27: 578-579 Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114-2120 Borum J, Pedersen O, Kotula L, Fraser MW, Statton J, Colmer TD, Kendrick GA (2015) Photosynthetic response to globally increasing CO2 of co-occurring temperate seagrass species. Plant , Cell & Environment in press Boutin SR, Young ND, Olson TC, Yu ZH, Shoemaker RC, Vallejos CE (1995) Genome conservation among three legume genera detected with DNA markers. Genome 38: 928-937 Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou WC, Corbeil J, Del Fabbro C, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, Howard J, Hunt M, Jackman SD, Jaffe DB, Jarvis ED, Jiang H, Kazakov S, Kersey PJ, Kitzman JO, Knight JR, Koren S, Lam TW, Lavenier D, Laviolette F, Li Y, Li Z, Liu B, Liu Y, Luo R, Maccallum I, Macmanes MD, Maillet N, Melnikov S, Naquin D, Ning Z, Otto TD, Paten B, Paulo OS, Phillippy AM, Pina-Martins F, Place M, Przybylski D, Qin X, Qu C, Ribeiro FJ, Richards S, Rokhsar DS, Ruby JG, Scalabrin S, Schatz MC, Schwartz DC, Sergushichev A, Sharpe T, Shaw TI, Shendure J, Shi Y, Simpson JT, Song H, Tsarev F, Vezzi F, Vicedomini R, Vieira BM, Wang J, Worley KC, Yin S, Yiu SM, Yuan J, Zhang G, Zhang H, Zhou S, Korf IF (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2: 10 Brandizzi F, Barlowe C (2013) Organization of the ER-Golgi interface for membrane traffic control. Nature Reviews Molecular Cell Biology 14: 382-392 Braslavsky I, Hebert B, Kartalov E, Quake SR (2003) Sequence information can be obtained from single DNA molecules. Proceedings of the National Academy of Sciences of the United States of America 100: 3960-3964 Braukmann TWA, Kuzmina M, Stefanovic S (2009) Loss of all plastid ndh genes in Gnetales and conifers: extent and evolutionary significance for the seed plant phylogeny. Current Genetics 55: 323-337 Bremer B, Bremer K, Chase MW, Fay MF, Reveal JL, Soltis DE, Soltis PS, Stevens PF, Anderberg AA, Moore MJ, Olmstead RG, Rudall PJ, Sytsma KJ, Tank DC, Wurdack K, Xiang JQY, Zmarzty S, Angiosperm Phylogeny Grp (2009) An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III. Botanical Journal of the Linnean Society 161: 105-121 Buckler ES, Holland JB, Bradbury PJ, Acharya CB, Brown PJ, Browne C, Ersoz E, Flint-Garcia S, Garcia A, Glaubitz JC, Goodman MM, Harjes C, Guill K, Kroon DE, Larsson S, Lepak NK, Li H, Mitchell SE, Pressoir G, Peiffer JA, Rosas MO, Rocheford TR, Romay MC, Romero S, Salvo S, Sanchez Villeda H, da Silva HS, Sun Q, Tian F, Upadyayula N, Ware D, Yates H, Yu J, Zhang Z, Kresovich S, McMullen MD (2009) The genetic architecture of maize flowering time. Science 325: 714-718 177

Burke MK, Dennison WC, Moore KA (1996) Non-structural carbohydrate reserves of eelgrass Zostera marina. Marine Ecology Progress Series 137: 195-201 Burrows PA, Sazanov LA, Svab Z, Maliga P, Nixon PJ (1998) Identification of a functional respiratory complex in chloroplasts through analysis of tobacco mutants containing disrupted plastid ndh genes. EMBO Journal 17: 868-876 Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J (2013) Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnolgy 31: 1119- 1125 Caffall KH, Mohnen D (2009) The structure, function, and biosynthesis of plant cell wall pectic polysaccharides. Carbohydrate Research 344: 1879-1900 Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10: 421 Campbell MS, Holt C, Moore B, Yandell M (2014a) Genome Annotation and Curation Using MAKER and MAKER-P. Current Protocols in Bioinformatics 48: 4 11 11-39 Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ, Ware D, Shiu SH, Childs KL, Sun Y, Jiang N, Yandell M (2014b) MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiology 164: 513-524 Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Research 18: 188-196 Carpaneto A, Naso A, Paganetto A, Cornara L, Pesce ER, Gambale F (2004) Properties of ion channels in the protoplasts of the Mediterranean seagrass Posidonia oceanica. Plant Cell and Environment 27: 279-292 Carruthers TJB, Dennison WC, Kendrick GA, Waycott M, Walker DI, Cambridge ML (2007) Seagrasses of south-west Australia: A conceptual synthesis of the world's most diverse and extensive seagrass meadows. Journal of Experimental Marine Biology and Ecology 350: 21-45 Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, Nelson KE, Parkhill J, Pitluck S, Qin X, Read TD, Schmutz J, Sozhamannan S, Sterk P, Strausberg RL, Sutton G, Thomson NR, Tiedje JM, Weinstock G, Wollam A, Genomic Standards Consortium Human Microbiome Project Jumpstart C, Detter JC (2009) Genome project standards in a new era of sequencing. Science 326: 236-237 Chalhoub B, Denoeud F, Liu S, Parkin IA, Tang H, Wang X, Chiquet J, Belcram H, Tong C, Samans B, Correa M, Da Silva C, Just J, Falentin C, Koh CS, Le Clainche I, Bernard M, Bento P, Noel B, Labadie K, Alberti A, Charles M, Arnaud D, Guo H, Daviaud C, Alamery S, Jabbari K, Zhao M, Edger PP, Chelaifa H, Tack D, Lassalle G, Mestiri I, Schnel N, Le Paslier MC, Fan G, Renault V, Bayer PE, Golicz AA, Manoli S, Lee TH, Thi VH, Chalabi S, Hu Q, Fan C, Tollenaere R, Lu Y, Battail C, Shen J, Sidebottom CH, Wang X, Canaguier A, Chauveau A, Berard A, Deniot G, Guan M, Liu Z, Sun F, Lim YP, Lyons E, Town CD, Bancroft I, Wang X, Meng J, Ma J, Pires JC, King GJ, Brunel D, Delourme R, Renard M, Aury JM, Adams KL, Batley J, Snowdon RJ, Tost J, Edwards D, Zhou Y, Hua W, Sharpe AG, Paterson AH, Guan C, Wincker P (2014) Plant genetics. Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 345: 950-953 Chao DY, Dilkes B, Luo HB, Douglas A, Yakubova E, Lahner B, Salt DE (2013) Polyploids exhibit higher potassium uptake and salinity tolerance in Arabidopsis. Science 341: 658-659 Chen F, Mackey AJ, Vermunt JK, Roos DS (2007) Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLOS One 2: e383 Chen LY, Chen JM, Gituru RW, Wang QF (2012) Generic phylogeny, historical biogeography and character evolution of the cosmopolitan aquatic plant family Hydrocharitaceae. BMC Evolutionary Biology 12 Chen LY, Shi DQ, Zhang WJ, Tang ZS, Liu J, Yang WC (2015) The Arabidopsis alkaline ceramidase TOD1 is a key turgor pressure regulator in plant cells. Nature Communications 6

178

Cheng WH, Chiang MH, Hwang SG, Lin PC (2009) Antagonism between abscisic acid and ethylene in Arabidopsis acts in parallel with the reciprocal regulation of their metabolism and signaling pathways. Plant Molecular Biology 71: 61-80 Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley R, Figueroa- Balderas R, Morales-Cruz A, Cramer GR, Delledonne M, Luo C, Ecker JR, Cantu D, Rank DR, Schatz MC (2016) Phased diploid genome assembly with single molecule real-time sequencing. bioRxiv Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H (2009) Continuous base identification for single- molecule nanopore DNA sequencing. Nature Nanotechnology 4: 265-270 Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, n GC, Stein LD (2008) nGASP--the nematode genome annotation assessment project. BMC Bioinformatics 9: 549 Corpet F, Servant F, Gouzy J, Kahn D (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Research 28: 267-269 Coyer JA, Hoarau G, Kuo J, Tronholm A, Veldsink J, Olsen JL (2013) Phylogeny and temporal divergence of the seagrass family Zosteraceae using one nuclear and three chloroplast loci. Systematics and Biodiversity 11: 271-284 Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A, Cartwright R, Charbonneau A, Constantinides B, Edvenson G, Fay S, Fenton J, Fenzl T, Fish J, Garcia-Gutierrez L, Garland P, Gluck J, Gonzalez I, Guermond S, Guo J, Gupta A, Herr JR, Howe A, Hyer A, Harpfer A, Irber L, Kidd R, Lin D, Lippi J, Mansour T, McA'Nulty P, McDonald E, Mizzi J, Murray KD, Nahum JR, Nanlohy K, Nederbragt AJ, Ortiz-Zuazaga H, Ory J, Pell J, Pepe-Ranney C, Russ ZN, Schwarz E, Scott C, Seaman J, Sievert S, Simpson J, Skennerton CT, Spencer J, Srinivasan R, Standage D, Stapleton JA, Steinman SR, Stein J, Taylor B, Trimble W, Wiencko HL, Wright M, Wyss B, Zhang Q, Zyme E, Brown CT (2015) The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4: 900 Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, Albert VA, Ma H, dePamphilis CW (2006) Widespread genome duplications throughout the history of flowering plants. Genome Research 16: 738-749 D'Hont A, Denoeud F, Aury JM, Baurens FC, Carreel F, Garsmeur O, Noel B, Bocs S, Droc G, Rouard M, Da Silva C, Jabbari K, Cardi C, Poulain J, Souquet M, Labadie K, Jourda C, Lengelle J, Rodier-Goud M, Alberti A, Bernard M, Correa M, Ayyampalayam S, McKain MR, Leebens-Mack J, Burgess D, Freeling M, Mbeguie AMD, Chabannes M, Wicker T, Panaud O, Barbosa J, Hribova E, Heslop- Harrison P, Habas R, Rivallan R, Francois P, Poiron C, Kilian A, Burthia D, Jenny C, Bakry F, Brown S, Guignon V, Kema G, Dita M, Waalwijk C, Joseph S, Dievart A, Jaillon O, Leclercq J, Argout X, Lyons E, Almeida A, Jeridi M, Dolezel J, Roux N, Risterucci AM, Weissenbach J, Ruiz M, Glaszmann JC, Quetier F, Yahiaoui N, Wincker P (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213-217 Daher FB, Braybrook SA (2015) How to let go: pectin and plant cell adhesion. Frontiers in Plant Science 6 Dalquen DA, Altenhoff AM, Gonnet GH, Dessimoz C (2013) The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: A simulation study. PLOS One 8 Dattolo E, Gu J, Bayer PE, Mazzuca S, Serra IA, Spadafora A, Bernardo L, Natali L, Cavallini A, Procaccini G (2013) Acclimation to different depths by the marine angiosperm Posidonia oceanica: transcriptomic and proteomic profiles. Frontiers in Plant Science 4: 195 Dattolo E, Ruocco M, Brunet C, Lorenti M, Lauritano C, D'Esposito D, De Luca P, Sanges R, Mazzuca S, Procaccini G (2014) Response of the seagrass Posidonia oceanica to different light environments: Insights from a combined molecular and photo-physiological study. Marine Environmental Research 101: 225-236 Davey PA, Pernice M, Sablok G, Larkum A, Lee HT, Golicz A, Edwards D, Dolferus R, Ralph P (2016) The emergence of molecular profiling and omics techniques in seagrass biology; furthering our understanding of seagrasses. Funct Integr Genomics 16: 465-480 Daviere JM, Achard P (2016) A pivotal role of DELLAs in regulating multiple hormone signals. Molecular Plant 9: 10-20 De Grauwe L, Dugardeyn J, Van Der Straeten D (2008) Novel mechanisms of ethylene-gibberellin crosstalk revealed by the gai eto2-1 double mutant. Plant Signalling & Behavior 3: 1113-1115 179

de la Bastide M, McCombie WR (2007) Assembling genomic DNA sequences with PHRAP. Current Protocols in Bioinformatics Chapter 11: Unit11.14 De Paepe A, Vuylsteke M, Van Hummelen P, Zabeau M, Van Der Straeten D (2004) Transcriptional profiling by cDNA-AFLP and microarray analysis reveals novel insights into the early response to ethylene in Arabidopsis. Plant Journal 39: 537-559 De Smet R, Adams KL, Vandepoele K, van Montagu MCE, Maere S, Van de Peer Y (2013) Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proceedings of the National Academy of Sciences of the United States of America 110: 2898-2903 Dekker J, Rippe K, Dekker M, Kleckner N (2002) Capturing chromosome conformation. Science 295: 1306- 1311 Den Hartog C, Kuo J (2006) Taxonomy and biogeography in seagrasses. In Seagrasses: Biology, ecology and conservation. Springer, pp 1-23 Dennison WC, Orth RJ, Moore KA, Stevenson JC, Carter V, Kollar S, Bergstrom PW, Batiuk RA (1993) Assessing water quality with submersed aquatic vegetation. BioScience 43: 86-94 Doddema H, Howari M (1983) Invivo nitrate reductase activity in the seagrass Halophila stipulacea from the Gulf of Aqaba (Jordan). Botanica Marina 26: 307-312 Dolezel J, Bartos J (2005) Plant DNA flow cytometry and estimation of nuclear genome size. Ann Bot 95: 99- 110 Driouich A, Faye L, Staehelin LA (1993) The plant Golgi apparatus: a factory for complex polysaccharides and glycoproteins. Trends in Biochemical Sciences 18: 210-214 Driouich A, Follet-Gueye ML, Bernard S, Kousar S, Chevalier L, Vicre-Gibouin M, Lerouxel O (2012) Golgi- mediated synthesis and secretion of matrix polysaccharides of the primary cell wall of higher plants. Frontiers in Plant Science 3: 79 Du C, Swigonova Z, Messing J (2006) Retrotranspositions in orthologous regions of closely related grass species. BMC Evolutionary Biollogy 6: 62 Duarte JM, Wall PK, Edger PP, Landherr LL, Ma H, Pires JC, Leebens-Mack J, dePamphilis CW (2010) Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evolutionary Biology 10 Dudareva N, Negre F, Nagegowda DA, Orlova I (2006) Plant volatiles: Recent advances and future perspectives. Critical Reviews in Plant Sciences 25: 417-440 Dutta S, Mohanty S, Tripathy BC (2009) Role of temperature stress on chloroplast biogenesis and protein import in pea. Plant Physiology 150: 1050-1061 Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, Yang SP, Wu W, Chou WC, Srivastava A, Shaw TI, Ruby JG, Skewes-Cox P, Betegon M, Dimon MT, Solovyev V, Seledtsov I, Kosarev P, Vorobyev D, Ramirez-Gonzalez R, Leggett R, MacLean D, Xia F, Luo R, Li Z, Xie Y, Liu B, Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Yin S, Sharpe T, Hall G, Kersey PJ, Durbin R, Jackman SD, Chapman JA, Huang X, DeRisi JL, Caccamo M, Li Y, Jaffe DB, Green RE, Haussler D, Korf I, Paten B (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241 Ecker JR (1995) The ethylene signal transduction pathway in plants. Science 268: 667-675 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32: 1792-1797 Edwards D, Batley J (2010) Plant genome sequencing: applications for crop improvement. Plant Biotechnology Journal 8: 2-9 Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S (2009) Real-time DNA sequencing from single polymerase molecules. Science 323: 133-138 180

Ekblom R, Wolf JB (2014) A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7: 1026-1042 El Baidouri M, Panaud O (2013) Comparative Genomic Paleontology across Plant Kingdom Reveals the Dynamics of TE-Driven Genome Evolution. Genome Biology and Evolution 5: 954-965 El-Metwally S, Hamza T, Zakaria M, Helmy M (2013) Next-generation sequence assembly: four stages of data processing and computational challenges. PLOS Computational Biology 9: e1003345 Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8: 186-194 Fang G, Bhardwaj N, Robilotto R, Gerstein MB (2010) Getting started in gene orthology and functional analysis. PLOS Computational Biology 6: e1000703 Fellows I (2013) Package ‘wordcloud’. Fonseca S, Chico JM, Solano R (2009) The jasmonate pathway: the ligand, the receptor and the core signalling module. Curr Opin Plant Biol 12: 539-547 Fourqurean JW, Duarte CM, Kennedy H, Marba N, Holmer M, Mateo MA, Apostolaki ET, Kendrick GA, Krause-Jensen D, McGlathery KJ, Serrano O (2012) Seagrass ecosystems as a globally significant carbon stock. Nature Geoscience 5: 505-509 Franssen SU, Gu J, Winters G, Huylmans AK, Wienpahl I, Sparwel M, Coyer JA, Olsen JL, Reusch TBH, Bornberg-Bauer E (2014) Genome-wide transcriptomic responses of the seagrasses Zostera marina and Nanozostera noltii under a simulated heatwave confirm functional types. Marine Genomics 15: 65-73 Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC (2003) Cross-species sequence comparisons: A review of methods and available resources. Genome Research 13: 1-12 Fukao T, Bailey-Serres J (2004) Plant responses to hypoxia--is survival a balancing act? Trends Plant Sci 9: 449-456 Fukushima K, Fang X, Alvarez-Ponce D, Cai H, Carretero-Paulet L, Chen C, Chang T-H, Farr KM, Fujita T, Hiwatashi Y, Hoshi Y, Imai T, Kasahara M, Librado P, Mao L, Mori H, Nishiyama T, Nozawa M, Pálfalvi G, Pollard ST, Rozas J, Sánchez-Gracia A, Sankoff D, Shibata TF, Shigenobu S, Sumikawa N, Uzawa T, Xie M, Zheng C, Pollock DD, Albert VA, Li S, Hasebe M (2017) Genome of the pitcher plant Cephalotus reveals genetic changes associated with carnivory. Nature Ecology & Evolution 1: 0059 Funk HT, Berg S, Krupinska K, Maier UG, Krause K (2007) Complete DNA sequences of the plastid genomes of two parasitic flowering plant species, Cuscuta reflexa and Cuscuta gronovii. BMC Plant Biology 7: 45 Gagne JM, Smalle J, Gingerich DJ, Walker JM, Yoo SD, Yanagisawa S, Vierstra RD (2004) Arabidopsis EIN3- binding F-box 1 and 2 form ubiquitin-protein ligases that repress ethylene action and promote growth by directing EIN3 degradation. Proceedings of the National Academy of Sciences of the United States of America 101: 6803-6808 Gale MD, Devos KM (1998) Comparative genetics in the grasses. Proceedings of the National Academy of Sciences of the United States of America 95: 1971-1974 Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A (2003) ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31: 3784-3788 Gaut BS, Ross-Ibarra J (2008) Selection on major components of angiosperm genomes. Science 320: 484- 486 Ge RL, Cai Q, Shen YY, San A, Ma L, Zhang Y, Yi X, Chen Y, Yang L, Huang Y, He R, Hui Y, Hao M, Li Y, Wang B, Ou X, Xu J, Zhang Y, Wu K, Geng C, Zhou W, Zhou T, Irwin DM, Yang Y, Ying L, Bao H, Kim J, Larkin DM, Ma J, Lewin HA, Xing J, Platt RN, 2nd, Ray DA, Auvil L, Capitanu B, Zhang X, Zhang G, Murphy RW, Wang J, Zhang YP, Wang J (2013) Draft genome sequence of the Tibetan antelope. Nature Communication 4: 1858 Ghassemian M, Nambara E, Cutler S, Kawaide H, Kamiya Y, McCourt P (2000) Regulation of abscisic acid signaling by the ethylene response pathway in Arabidopsis. Plant Cell 12: 1117-1126 Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, Hadley D, Hutchison D, Martin C, Katagiri F, Lange BM, Moughamer T, Xia Y, Budworth P, Zhong J, Miguel T, Paszkowski U, Zhang S, Colbert M, Sun WL, Chen L, Cooper B, Park S, Wood TC, Mao L, 181

Quail P, Wing R, Dean R, Yu Y, Zharkikh A, Shen R, Sahasrabudhe S, Thomas A, Cannings R, Gutin A, Pruss D, Reid J, Tavtigian S, Mitchell J, Eldredge G, Scholl T, Miller RM, Bhatnagar S, Adey N, Rubano T, Tusneem N, Robinson R, Feldhaus J, Macalma T, Oliphant A, Briggs S (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100 Golicz AA, Martinez PA, Zander M, Patel DA, Van De Wouw AP, Visendi P, Fitzgerald TL, Edwards D, Batley J (2015b) Gene loss in the fungal canola pathogen Leptosphaeria maculans. Functional & Integrative Genomics 15: 189-196 Golicz AA, Schliep M, Lee HT, Larkum AW, Dolferus R, Batley J, Chan CK, Sablok G, Ralph PJ, Edwards D (2015a) Genome-wide survey of the seagrass Zostera muelleri suggests modification of the ethylene signalling network. Journal of Experimental Botany 66: 1489-1498 Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLOS One 7: e50609 Greco M, Chiappetta A, Bruno L, Bitonti MB (2013) Effects of light deficiency on genome methylation in Posidonia oceanica. marine Ecology Progress Series 473: 103-114 Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology 7 Suppl 1: S2 1-31 Guindon S, Delsuc F, Dufayard JF, Gascuel O (2009) Estimating maximum likelihood phylogenies with PhyML. Methods in Molecular Biology 537: 113-137 Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Jr., Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31: 5654-5666 Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, Leduc RD, Friedman N, Regev A (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols 8: 1494-1512 Haberhausen G, Zetsche K (1994) Functional loss of all ndh genes in an otherwise relatively unaltered plastid genome of the holoparasitic flowering plant Cuscuta reflexa. Plant Molecular Biology 24: 217-222 Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O (2001) TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research 29: 41-43 Hamilton JP, Buell CR (2012) Advances in plant genome sequencing. Plant Journal 70: 177-190 Hannun YA, Obeid LM (2008) Principles of bioactive lipid signalling: lessons from sphingolipids. Nature Reviews Molecular Cell Biology 9: 139-150 Hansen DJ, Dayanandan P, Kaufman PB, Brotherson JD (1976) Ecological adaptations of salt-marsh grass, Distichlis spicata (Gramineae), and environmental factors affecting its growth and distribution. American Journal of Botany 63: 635-650 Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R, Gene Ontology C (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research 32: D258-261 Hart DA, Kindel PK (1970) Isolation and partial characterization of apiogalacturonans from the cell wall of Lemna minor. Biochemical Journal 116: 569-579 Hasegawa PM (2013) Sodium (Na+) homeostasis and salt tolerance of plants. Environmental and Experimental Botany 92: 19-31

182

Hawkins JS, Proulx SR, Rapp RA, Wendel JF (2009) Rapid DNA loss as a counterbalance to genome expansion through retrotransposon proliferation in plants. Proceedings of the National Academy of Sciences of the United States of America 106: 17811-17816 Heck KL, Valentine JF (2006) Plant-herbivore interactions in seagrass meadows. Journal of Experimental Marine Biology and Ecology 330: 420-436 Hemminga MA, Duarte CM (2000) Seagrass ecology. Cambridge University Press, Cambridge Hensing T, Chawla A, Batra R, Salgia R (2014) A personalized treatment for lung cancer: molecular pathways, targeted therapies, and genomic characterization. Advances in Experimental Medicine and Biology 799: 85-117 Hidalgo O, Garcia S, Garnatje T, Mumbru M, Patterson A, Vigo J, Valles J (2015) Genome size in aquatic and wetland plants: fitting with the large genome constraint hypothesis with a few relevant exceptions. Plant Systematics and Evolution 301: 1927-1936 Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12: 491 Holtgrewe M, Emde AK, Weese D, Reinert K (2011) A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12: 210 Hongo JA, de Castro GM, Cintra LC, Zerlotini A, Lobo FP (2015) POTION: an end-to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein- coding genes. BMC Genomics 16: 567 Horvath EM, Peter SO, Joet T, Rumeau D, Cournac L, Horvath GV, Kavanagh TA, Schafer C, Peltier G, Medgyesy P (2000) Targeted inactivation of the plastid ndhB gene in tobacco results in an enhanced sensitivity of photosynthesis to moderate stomatal closure. Plant Physiology 123: 1337- 1349 Huang R, O'Donnell AJ, Barboline JJ, Barkman TJ (2016) Convergent evolution of caffeine in plants by co- option of exapted ancestral enzymes. Proceedings of the National Academy of Sciences of the United States of America 113: 10613-10618 Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, Ren Y, Zhu H, Li J, Lin K, Jin W, Fei Z, Li G, Staub J, Kilian A, van der Vossen EA, Wu Y, Guo J, He J, Jia Z, Ren Y, Tian G, Lu Y, Ruan J, Qian W, Wang M, Huang Q, Li B, Xuan Z, Cao J, Asan, Wu Z, Zhang J, Cai Q, Bai Y, Zhao B, Han Y, Li Y, Li X, Wang S, Shi Q, Liu S, Cho WK, Kim JY, Xu Y, Heller-Uszynska K, Miao H, Cheng Z, Zhang S, Wu J, Yang Y, Kang H, Li M, Liang H, Ren X, Shi Z, Wen M, Jian M, Yang H, Zhang G, Yang Z, Chen R, Liu S, Li J, Ma L, Liu H, Zhou Y, Zhao J, Fang X, Li G, Fang L, Li Y, Liu D, Zheng H, Zhang Y, Qin N, Li Z, Yang G, Yang S, Bolund L, Kristiansen K, Zheng H, Li S, Zhang X, Yang H, Wang J, Sun R, Zhang B, Jiang S, Wang J, Du Y, Li S (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics 41: 1275-1281 Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Research 9: 868-877 Hughes AR, Inouye BD, Johnson MT, Underwood N, Vellend M (2008) Ecological consequences of genetic diversity. Ecol Lett 11: 609-623 Hughes AR, Stachowicz JJ (2004) Genetic diversity enhances the resistance of a seagrass ecosystem to disturbance. Proc Natl Acad Sci U S A 101: 8998-9002 Hulsen T, Huynen MA, de Vlieg J, Groenen PMA (2006) Benchmarking ortholog identification methods using functional genomics data. Genome Biology 7 Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD (2013) REAPR: a universal tool for genome assembly evaluation. Genome Biology 14 Hupalo D, Kern AD (2013) Conservation and Functional Element Discovery in 20 Angiosperm Plant Genomes. Molecular Biology and Evolution 30: 1729-1744 Huson DH, Richter DC, Rausch C, Dezulian T, Franz M, Rupp R (2007) Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics 8: 460 Hyman ED (1988) A new method of sequencing DNA. Analytical Biochemistry 174: 423-436 Iles WJD, Smith SY, Graham SW (2013) A well-supported phylogenetic framework for the monocot order Alismatales reveals multiple losses of the plastid NADH dehydrogenase complex and a strong long- branch effect. Early Events in Monocot Evolution 83: 1-28

183

Innan H, Kondrashov F (2010) The evolution of gene duplications: classifying and distinguishing between models. Nature Reviews Genetics 11: 97-108 International Brachypodium Initiative (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463: 763-768 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436: 793-800 Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyere C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, Felice N, Paillard S, Juman I, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, Chatelet P, Merdinoglu D, Delledonne M, Pezzotti M, Lecharny A, Scarpelli C, Artiguenave F, Pe ME, Valle G, Morgante M, Caboche M, Adam-Blondon AF, Weissenbach J, Quetier F, Wincker P, Public F-I (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449: 463-U465 Janssen T, Bremer K (2004) The age of major monocot groups inferred from 800+rbcL sequences. Botanical Journal of the Linnean Society 146: 385-398 Jiao Y, Li J, Tang H, Paterson AH (2014) Integrated syntenic and phylogenomic analyses reveal an ancient genome duplication in monocots. Plant Cell 26: 2792-2802 Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, Soltis DE, Clifton SW, Schlarbaum SE, Schuster SC, Ma H, Leebens-Mack J, dePamphilis CW (2011) Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97-100 Jiao YN, Leebens-Mack J, Ayyampalayam S, Bowers JE, McKain MR, McNeal J, Rolf M, Ruzicka DR, Wafula E, Wickett NJ, Wu XL, Zhang Y, Wang J, Zhang YT, Carpenter EJ, Deyholos MK, Kutchan TM, Chanderbali AS, Soltis PS, Stevenson DW, McCombie R, Pires JC, Wong GKS, Soltis DE, dePamphilis CW (2012) A genome triplication associated with early diversification of the core eudicots. Genome Biology 13 Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30: 1236-1240 Joshi NA, Fass JN (2011) Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. In, Vol Version 1.33, p Available at https://github.com/najoshi/sickle. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27-30 Kato Y, Aioi K, Omori Y, Takahata N, Satta Y (2003) Phylogenetic analyses of Zostera species based on rbcL and matK nucleotide sequences: implications for the origin and diversification of seagrasses in Japanese waters. Genes & Genetic Systems 78: 329-342 Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30: 3059-3066 Katsir L, Chung HS, Koo AJK, Howe GA (2008) Jasmonate signaling: a conserved mechanism of hormone sensing. Current Opinion in Plant Biology 11: 428-435 Kawashima T, Lorkovic ZJ, Nishihama R, Ishizaki K, Axelsson E, Yelagandula R, Kohchi T, Berger F (2015) Diversification of histone H2A variants during plant evolution. Trends in Plant Science 20: 419-425 Kellis M, Birren BW, Lander ES (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428: 617-624 Kende H, van der Knaap E, Cho HT (1998) Deepwater rice: A model plant to study stem elongation. Plant Physiology 118: 1105-1110 Khotimchenko Y, Khozhaenko E, Kovalev V, Khotimchenko M (2012) Cerium binding activity of pectins isolated from the seagrasses Zostera marina and Phyllospadix iwatensis. Marine Drugs 10: 834-848 Kieber JJ, Rothenberg M, Roman G, Feldmann KA, Ecker JR (1993) CTR1, a negative regulator of the ethylene response pathway in Arabidopsis, encodes a member of the raf family of protein kinases. Cell 72: 427-441 Kirk JTO (2011) Light and photosynthesis in aquatic ecosystems. Cambridge University Press, Cambridge

184

Kirsch M, Zhigang A, Viereck R, Low R, Rausch T (1996) Salt stress induces an increased expression of V- type H+-ATPase in mature sugar beet leaves. Plant Molecular Biology 32: 543-547 Kleftogiannis D, Kalnis P, Bajic VB (2013) Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLOS One 8: e75505 Knapp K, Chen YP (2007) An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids Research 35: 317-324 Koce JD, Vilhar B, Bohanec B, Dermastia M (2003) Genome size of Adriatic seagrasses. Aquatic Botany 77: 17-25 Koch MA, Haubold B, Mitchell-Olds T (2000) Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Molecular Biology and Evolution 17: 1483-1498 Kofer W, Koop HU, Wanner G, Steinmuller K (1998) Mutagenesis of the genes encoding subunits A, C, H, I, J and K of the plastid NAD(P)H-plastoquinone-oxidoreductase in tobacco by polyethylene glycol- mediated plastome transformation. Molecular and General Genetics 258: 166-173 Kong F, Li H, Sun P, Zhou Y, Mao Y (2014) De novo assembly and characterization of the transcriptome of seagrass Zostera marina using Illumina paired-end sequencing. PLOS One 9: e112245 Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biology 5 Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current Opinion in Microbiology 23: 110-120 Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 23: 110-120 Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM (2016) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. bioRxiv Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5: 59 Kraaijeveld K (2010) Genome Size and Species Diversification. Evol Biol 37: 227-233 Kristensen DM, Wolf YI, Mushegian AR, Koonin EV (2011) Computational methods for gene orthology inference. Briefings in Bioinformatics 12: 379-391 Kuo J (2001) Chromosome numbers of the Australian Zosteraceae. Plant Systemics and Evolution 226: 155- 163 Lamb JB, van de Water JA, Bourne DG, Altier C, Hein MY, Fiorenza EA, Abu N, Jompa J, Harvell CD (2017) Seagrass ecosystems reduce exposure to bacterial pathogens of humans, fishes, and invertebrates. Science 355: 731-733 Lambers H, Chapin FS, Pons TL (1998) Plant physiological ecology. Springer, New York Lan Y, Sun J, Tian R, Bartlett DH, Li R, Wong YH, Zhang W, Qiu JW, Xu T, He LS, Tabata HG, Qian PY (2017) Molecular adaptation in the world's deepest-living animal: Insights from transcriptome sequencing of the hadal amphipod Hirondellea gigas. Molecular Ecology Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, 185

Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J, International Human Genome Sequencing C (2001) Initial sequencing and analysis of the human genome. Nature 409: 860-921 Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10: R25 Larkum AWD, McComb AJ, Shepherd SA (1989) Biology of seagrasses : a treatise on the biology of seagrasses with special reference to the Australian region. Elsevier, Amsterdam ; New York Larkum AWD, Orth RJ, Duarte CM (2006) Seagrasses : biology, ecology and conservation. Springer, Dordrecht, Netherlands Läuchli A, Grattan SR (2007) Plant Growth And Development Under Salinity Stress. In MA Jenks, PM Hasegawa, SM Jain, eds, Advances in Molecular Breeding Toward Drought and Salt Tolerant Crops. Springer Netherlands, Dordrecht, pp 1-32 Lee H, Golicz AA, Bayer PE, Jiao Y, Tang H, Paterson AH, Sablok G, Krishnaraj RR, Chan CK, Batley J, Kendrick GA, Larkum AW, Ralph PJ, Edwards D (2016) The genome of a southern hemisphere seagrass species (Zostera muelleri). Plant Physiology 172: 272-283 Lee KS, Dunton KH (1999) Inorganic nitrogen acquisition in the seagrass Thalassia testudinum: Development of a whole-plant nitrogen budget. Limnology and Oceanography 44: 1204-1215 Lei G, Shen M, Li ZG, Zhang B, Duan KX, Wang N, Cao YR, Zhang WK, Ma B, Ling HQ, Chen SY, Zhang JS (2011) EIN2 regulates salt stress response and interacts with a MA3 domain-containing protein ECIP1 in Arabidopsis. Plant, Cell & Environment 34: 1678-1692 Leoni V, Vela A, Pasqualini V, Pergent-Martini C, Pergent G (2008) Effects of experimental reduction of light and nutrient enrichments (N and P) on seagrasses: a review. Aquatic Conservation-Marine and Freshwater Ecosystems 18: 202-220 Lerouxel O, Cavalier DM, Liepman AH, Keegstra K (2006) Biosynthesis of plant cell wall polysaccharides - a complex process. Current Opinion in Plant Biology 9: 621-630 Leroy P, Guilhot N, Sakai H, Bernard A, Choulet F, Theil S, Reboux S, Amano N, Flutre T, Pelegrin C, Ohyanagi H, Seidel M, Giacomoni F, Reichstadt M, Alaux M, Gicquello E, Legeai F, Cerutti L, Numa H, Tanaka T, Mayer K, Itoh T, Quesneville H, Feuillet C (2012) TriAnnot: a versatile and high performance pipeline for the automated annotation of plant genomes. Frontiers in Plant Science 3 Les DH, Cleland MA, Waycott M (1997) Phylogenetic studies in Alismatidae, II: evolution of marine angiosperms (seagrasses) and hydrophily. Systematic Botany: 443-463 Les DH, Garvin DK, Wimpee CF (1993) Phylogenetic studies in the monocot subclass Alismatidae: evidence for a reappraisal of the aquatic order Najadales. Molecular Phylogenetics and Evolution 2: 304-314 Les DH, Moody ML, Jacobs SWL, Bayer RJ (2002) Systematics of seagrasses (Zosteraceae) in Australia and New Zealand. Systematic Botany 27: 468-484 Les DH, Tippery NP (2013) In time and with water . . . the systematics of alismatid monocotyledons. In P Wilkin, SJ Mayo, eds, Early Events in Monocot Evolution. Cambridge University Press, Cambridge

186

Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW (2003) Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299: 682-686 Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglu L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME (2002) Apollo: a sequence annotation editor. Genome Biology 3: RESEARCH0082 Li F, Fan G, Wang K, Sun F, Yuan Y, Song G, Li Q, Ma Z, Lu C, Zou C, Chen W, Liang X, Shang H, Liu W, Shi C, Xiao G, Gou C, Ye W, Xu X, Zhang X, Wei H, Li Z, Zhang G, Wang J, Liu K, Kohel RJ, Percy RG, Yu JZ, Zhu YX, Wang J, Yu S (2014) Genome sequence of the cultivated cotton Gossypium arboreum. Nature Genetics 46: 567-572 Li FG, Fan GY, Lu CR, Xiao GH, Zou CS, Kohel RJ, Ma ZY, Shang HH, Ma XF, Wu JY, Liang XM, Huang G, Percy RG, Liu K, Yang WH, Chen WB, Du XM, Shi CC, Yuan YL, Ye WW, Liu X, Zhang XY, Liu WQ, Wei HL, Wei SJ, Huang GD, Zhang XL, Zhu SJ, Zhang H, Sun FM, Wang XF, Liang J, Wang JH, He Q, Huang LH, Wang J, Cui JJ, Song GL, Wang KB, Xu X, Yu JZ, Zhu YX, Yu SX (2015) Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nature Biotechnology 33: 524-U242 Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754-1760 Li L, Stoeckert CJ, Jr., Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research 13: 2178-2189 Li XX, Zhou ZK (2009) Phylogenetic studies of the core Alismatales inferred from morphology and rbcL sequences. Progress in Natural Science-Materials International 19: 931-945 Liphschitz N, Waisel Y (1974) Existence of Salt-Glands in Various Genera of Gramineae. New Phytologist 73: 507-+ Lisch D (2013) How important are transposons for plant evolution? Nature Reviews Genetics 14: 49-61 Liu Y, Wang L, Mao S, Liu K, Lu Y, Wang J, Wei Y, Zheng Y (2015) Genome-wide association study of 29 morphological traits in Aegilops tauschii. Scientific Reports 5: 15562 Logacheva MD, Schelkunov MI, Penin AA (2011) Sequencing and analysis of plastid genome in mycoheterotrophic orchid Neottia nidus-avis. Genome Biology & Evolution 3: 1296-1303 Loques F, Cave G, Meinesz A (1990) Axenic culture of selected tissue of Posidonia oceanica. Aquatic Botany 37: 171-188 Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research 21: 936-939 Luo M, Liu X, Singh P, Cui Y, Zimmerli L, Wu K (2012) Chromatin modifications and remodeling in plant abiotic stress responses. Biochimica et Biophysica Acta 1819: 129-136 Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J (2012) SOAPdenovo2: an empirically improved memory-efficient short- read de novo assembler. Gigascience 1: 18 Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290: 1151- 1155 Macreadie PI, Schliepl MT, Rasheed MA, Chartrand KM, Ralph PJ (2014) Molecular indicators of chronic seagrass stress: A new era in the management of seagrass ecosystems? Ecological Indicators 38: 279-281 Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20: 2878-2879 Mallick S, Gnerre S, Muller P, Reich D (2009) The difficulty of avoiding false positives in genome scans for natural selection. Genome Research 19: 922-933 Markova-Raina P, Petrov D (2011) High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Research 21: 863-874 Martemyanov VV, Podgwaite JD, Belousova IA, Pavlushin SV, Slavicek JM, Baturina OA, Kabilov MR, Ilyinykh AV (2017) A comparison of the adaptations of strains of Lymantria dispar multiple nucleopolyhedrovirus to hosts from spatially isolated populations. Journal of Invertebrate Pathology 146: 41-46 187

Martin M, Funk HT, Serrot PH, Poltnigg P, Sabater B (2009) Functional characterization of the thylakoid Ndh complex phosphorylation by site-directed mutations in the ndhF gene. Biochimica et Biophysica Acta 1787: 920-928 Maumus F, Quesneville H (2014) Ancestral repeats have shaped epigenome and genome composition for millions of years in Arabidopsis thaliana. Nature Communication 5: 4104 Mazzuca S, Spadafora A, Filadoro D, Vannini C, Marsoni M, Cozza R, Bracale M, Pangaro T, Innocenti AM (2009) Seagrass light acclimation: 2-DE protein analysis in Posidonia leaves grown in chronic low light conditions. Journal of Experimental Marine Biology and Ecology 374: 113-122 McCarthy EM, Liu J, Lizhi G, McDonald JF (2002) Long terminal repeat retrotransposons of Oryza sativa. Genome Biology 3: RESEARCH0053 McGrath CL, Gout JF, Johri P, Doak TG, Lynch M (2014) Differential retention and divergent resolution of duplicate genes following whole-genome duplication. Genome Research 24: 1665-1675 Menanciohautea D, Fatokun CA, Kumar L, Danesh D, Young ND (1993) Comparative Genome Analysis of Mungbean (Vigna radiata L. Wilczek) and Cowpea (V. unguiculata L Walpers) Using Rflp Mapping Data. Theoretical and Applied Genetics 86: 797-810 Meyer A, Van de Peer Y (2005) From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays 27: 937-945 Mi HY, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Research 33: D284-D288 Michael TP (2014) Plant genome size variation: bloating and purging DNA. Briefings in Functional Genomics 13: 308-317 Michael TP, Jackson S (2013) The First 50 Plant Genomes. Plant Genome 6 Michel G, Tonon T, Scornet D, Cock JM, Kloareg B (2010) The cell wall polysaccharide metabolism of the brown alga Ectocarpus siliculosus. Insights into the evolution of extracellular matrix polysaccharides in Eukaryotes. New Phytologist 188: 82-97 Micheli F (2001) Pectin methylesterases: cell wall enzymes with important roles in plant physiology. Trends in Plant Science 6: 414-419 Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, Salzberg SL, Feng L, Jones MR, Skelton RL, Murray JE, Chen C, Qian W, Shen J, Du P, Eustice M, Tong E, Tang H, Lyons E, Paull RE, Michael TP, Wall K, Rice DW, Albert H, Wang ML, Zhu YJ, Schatz M, Nagarajan N, Acob RA, Guan P, Blas A, Wai CM, Ackerman CM, Ren Y, Liu C, Wang J, Wang J, Na JK, Shakirov EV, Haas B, Thimmapuram J, Nelson D, Wang X, Bowers JE, Gschwend AR, Delcher AL, Singh R, Suzuki JY, Tripathi S, Neupane K, Wei H, Irikura B, Paidi M, Jiang N, Zhang W, Presting G, Windsor A, Navajas-Perez R, Torres MJ, Feltus FA, Porter B, Li Y, Burroughs AM, Luo MC, Liu L, Christopher DA, Mount SM, Moore PH, Sugimura T, Jiang J, Schuler MA, Friedman V, Mitchell-Olds T, Shippen DE, dePamphilis CW, Palmer JD, Freeling M, Paterson AH, Gonsalves D, Wang L, Alam M (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452: 991-996 Ming R, VanBuren R, Wai CM, Tang H, Schatz MC, Bowers JE, Lyons E, Wang ML, Chen J, Biggers E, Zhang J, Huang L, Zhang L, Miao W, Zhang J, Ye Z, Miao C, Lin Z, Wang H, Zhou H, Yim WC, Priest HD, Zheng C, Woodhouse M, Edger PP, Guyot R, Guo HB, Guo H, Zheng G, Singh R, Sharma A, Min X, Zheng Y, Lee H, Gurtowski J, Sedlazeck FJ, Harkess A, McKain MR, Liao Z, Fang J, Liu J, Zhang X, Zhang Q, Hu W, Qin Y, Wang K, Chen LY, Shirley N, Lin YR, Liu LY, Hernandez AG, Wright CL, Bulone V, Tuskan GA, Heath K, Zee F, Moore PH, Sunkar R, Leebens-Mack JH, Mockler T, Bennetzen JL, Freeling M, Sankoff D, Paterson AH, Zhu X, Yang X, Smith JA, Cushman JC, Paull RE, Yu Q (2015) The pineapple genome and the evolution of CAM photosynthesis. Nature Genetics 47: 1435-1442 Miserey-Lenkei S, Chalancon G, Bardin S, Formstecher E, Goud B, Echard A (2010) Rab and actomyosin- dependent fission of transport vesicles at the Golgi complex. Nature Cell Biology 12: 645-U649 Morin RD, Chang E, Petrescu A, Liao N, Griffith M, Kirkpatrick R, Butterfield YS, Young AC, Stott J, Barber S, Babakaiff R, Dickson MC, Matsuo C, Wong D, Yang GS, Smailus DE, Wetherby KD, Kwong PN, Grimwood J, Brinkley CP, Brown-John M, Reddix-Dugue ND, Mayo M, Schmutz J, Beland J, Park 188

M, Gibson S, Olson T, Bouffard GG, Tsai M, Featherstone R, Chand S, Siddiqui AS, Jang WH, Lee E, Klein SL, Blakesley RW, Zeeberg BR, Narasimhan S, Weinstein JN, Pennacchio CP, Myers RM, Green ED, Wagner L, Gerhard DS, Marra MA, Jones SJM, Holt RA (2006) Sequencing and analysis of 10,967 full-length cDNA clones from Xenopus laevis and Xenopus tropicalis reveals post- tetraploidization transcriptome remodeling. Genome Research 16: 796-803 Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin CW, Wright MH, Ahrens R, Wang Y, Herbst EV, Keyder ER, Menda N, Zamir D, Tanksley SD (2005) The SOL Genomics Network. A comparative resource for Solanaceae biology and beyond. Plant Physiology 138: 1310-1317 Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC (2000) A whole genome assembly of Drosophila. Science 287: 2196-2204 Nakano T, Suzuki K, Ohtsuki N, Tsujimoto Y, Fujimura T, Shinshi H (2006) Identification of genes of the plant-specific transcription-factor families cooperatively regulated by ethylene and jasmonate in Arabidopsis thaliana. J Plant Res 119: 407-413 Natali L, Cossu RM, Mascagni F, Giordani T, Cavallini A (2015) A survey of Gypsy and Copia LTR- retrotransposon superfamilies and lineages and their distinct dynamics in the Populus trichocarpa (L.) genome. Tree Genetics & Genomes 11 Neumann P, Koblizkova A, Navratilova A, Macas J (2006) Significant expansion of Vicia pannonica genome size mediated by amplification of a single type of giant retroelement. Genetics 173: 1047-1056 Neyland R, Urbatsch LE (1996) The ndhF chloroplast gene detected in all divisions. Planta 200: 273-277 Nguyen VX, Detcharoen M, Tuntiprapas P, Soe-Htun U, Sidik JB, Harah MZ, Prathep A, Papenbrock J (2014) Genetic species identification and population structure of Halophila (Hydrocharitaceae) from the Western Pacific to the Eastern Indian Ocean. BMC Evolutionary Biology 14: 92 Nguyen XV, Hofler S, Glasenapp Y, Thangaradjou T, Lucas C, Papenbrock J (2015) New insights into DNA barcoding of seagrasses. Systematics and Biodiversity 13: 496-508 Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302: 205-217 Olsen JL, Rouze P, Verhelst B, Lin YC, Bayer T, Collen J, Dattolo E, De Paoli E, Dittami S, Maumus F, Michel G, Kersting A, Lauritano C, Lohaus R, Topel M, Tonon T, Vanneste K, Amirebrahimi M, Brakel J, Bostrom C, Chovatia M, Grimwood J, Jenkins JW, Jueterbock A, Mraz A, Stam WT, Tice H, Bornberg-Bauer E, Green PJ, Pearson GA, Procaccini G, Duarte CM, Schmutz J, Reusch TB, Van de Peer Y (2016) The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 530: 331-335 Omidbakhshfard MA, Omranian N, Ahmadi FS, Nikoloski Z, Mueller-Roeber B (2012) Effect of salt stress on genes encoding translation-associated proteins in Arabidopsis thaliana. Plant Signaling & Behavior 7: 1095-1102 Onkokesung N, Baldwin IT, Galis I (2010) The role of jasmonic acid and ethylene crosstalk in direct defense of Nicotiana attenuata plants against chewing herbivores. Plant Signaling & Behavior 5: 1305-1307 Ord TJ, Summers TC (2015) Repeated evolution and the impact of evolutionary history on adaptation. BMC Evolutionary Biology 15: 137 Orth RJ, Carruthers TJB, Dennison WC, Duarte CM, Fourqurean JW, Heck KL, Hughes AR, Kendrick GA, Kenworthy WJ, Olyarnik S, Short FT, Waycott M, Williams SL (2006) A global crisis for seagrass ecosystems. Bioscience 56: 987-996 Ovodov YS, Ovodova RG, Bondaren.Od, Krasikov.In (1971) Pectic substances of Zosteraceae: Part IV. Pectinase digestion of zosterine. Carbohydrate Research 18: 311-& Panchy N, Lehti-Shiu M, Shiu SH (2016) Evolution of gene duplication in plants. Plant Physiology 171: 2294- 2316 Park CJ, Seo YS (2015) Heat shock proteins: A review of the molecular chaperones for plant immunity. Plant Pathology Journal 31: 323-333 Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23: 1061-1067 189

Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK, Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, Freeling M, Gingle AR, Hash CT, Keller B, Klein P, Kresovich S, McCann MC, Ming R, Peterson DG, Mehboob ur R, Ware D, Westhoff P, Mayer KF, Messing J, Rokhsar DS (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551-556 Paterson AH, Bowers JE, Chapman BA (2004) Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proceedings of the National Academy of Sciences of the United States of America 101: 9903-9908 Paterson AH, Chapman BA, Kissinger JC, Bowers JE, Feltus FA, Estill JC (2006) Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends in Genetics 22: 597-602 Paterson AH, Lin YR, Li ZK, Schertz KF, Doebley JF, Pinson SRM, Liu SC, Stansel JW, Irvine JE (1995) Convergent domestication of cereal crops by independent mutations at corresponding genetic-loci. Science 269: 1714-1718 Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT (2012) Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proceedings of the National Academy of Sciences of the United States of America 109: 13272-13277 Peltier G, Aro EM, Shikanai T (2016) NDH-1 and NDH-2 plastoquinone reductases in oxygenic photosynthesis. Annual Review of Plant Biology 67: 55-80 Peltier G, Schmidt GW (1991) Chlororespiration: an adaptation to nitrogen deficiency in Chlamydomonas reinhardtii. Proceedings of the National Academy of Sciences of the United States of America 88: 4791-4795 Peng J, Li Z, Wen X, Li W, Shi H, Yang L, Zhu H, Guo H (2014) Salt-induced stabilization of EIN3/EIL1 confers salinity tolerance by deterring ROS accumulation in Arabidopsis. PLOS Genetics 10: e1004664 Peredo EL, King UM, Les DH (2013) The plastid genome of Najas flexilis: adaptation to submersed environments is accompanied by the complete loss of the NDH complex in an aquatic angiosperm. PLOS One 8: e68591 Pernice M, Schliep M, Szabo M, Rasheed M, Bryant C, York P, Chartrand K, Petrou K, Ralph P (2015) Development of a molecular biology tool kit to monitor dredging-related light stress in the seagrass Zostera muelleri ssp. capricorni in Port Curtis Final Report. Petersen G, Seberg O, Cuenca A, Stevenson DW, Thadeo M, Davis JI, Graham S, Ross TG (2016) Phylogeny of the Alismatales (Monocotyledons) and the relationship of Acorus (Acorales?). Cladistics 32: 141- 159 Petersen L, Bollback JP, Dimmic M, Hubisz M, Nielsen R (2007) Genes under positive selection in Escherichia coli. Genome Research 17: 1336-1343 Pevsner J (2009) Bioinformatics and functional genomics, Ed 2nd. Wiley-Blackwell, Hoboken, N.J. Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive mis-assembly. Genome Biology 9: R55 Piegu B, Guyot R, Picault N, Roulin A, Saniyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, Panaud O (2006) Doubling genome size without polyploidization: Dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Research 16: 1262-1269 Pop M (2009) Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 10: 354-366 Popper ZA, Michel G, Herve C, Domozych DS, Willats WG, Tuohy MG, Kloareg B, Stengel DB (2011) Evolution and diversity of plant cell walls: from algae to flowering plants. Annual Review of Plant Biology 62: 567-590 Postlethwait JH, Woods IG, Ngo-Hazelett P, Yan YL, Kelly PD, Chu F, Huang H, Hill-Force A, Talbot WS (2000) Zebrafish comparative genomics and the origins of vertebrate chromosomes. Genome Research 10: 1890-1902 Potato Genome Sequencing Consortium, Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J, Orjeda G, Guzman F, Torres M, Lozano R, Ponce O, Martinez D, De la Cruz G, Chakrabarti 190

SK, Patil VU, Skryabin KG, Kuznetsov BB, Ravin NV, Kolganova TV, Beletsky AV, Mardanov AV, Di Genova A, Bolser DM, Martin DM, Li G, Yang Y, Kuang H, Hu Q, Xiong X, Bishop GJ, Sagredo B, Mejia N, Zagorski W, Gromadka R, Gawor J, Szczesny P, Huang S, Zhang Z, Liang C, He J, Li Y, He Y, Xu J, Zhang Y, Xie B, Du Y, Qu D, Bonierbale M, Ghislain M, Herrera Mdel R, Giuliano G, Pietrella M, Perrotta G, Facella P, O'Brien K, Feingold SE, Barreiro LE, Massa GA, Diambra L, Whitty BR, Vaillancourt B, Lin H, Massa AN, Geoffroy M, Lundback S, DellaPenna D, Buell CR, Sharma SK, Marshall DF, Waugh R, Bryan GJ, Destefanis M, Nagy I, Milbourne D, Thomson SJ, Fiers M, Jacobs JM, Nielsen KL, Sonderkaer M, Iovene M, Torres GA, Jiang J, Veilleux RE, Bachem CW, de Boer J, Borm T, Kloosterman B, van Eck H, Datema E, Hekkert B, Goverse A, van Ham RC, Visser RG (2011) Genome sequence and analysis of the tuber crop potato. Nature 475: 189-195 Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1: i351-358 Primmer CR, Papakostas S, Leder EH, Davis MJ, Ragan MA (2013) Annotated genes and nonannotated genomes: cross-species use of Gene Ontology in ecology and evolution research. Molecular Ecology 22: 3216-3241 Privman E, Penn O, Pupko T (2012) Improving the performance of positive selection inference by filtering unreliable alignment regions. Molecular Biology and Evolution 29: 1-5 Procaccini G, Beer S, Bjork M, Olsen J, Mazzuca S, Santos R (2012) Seagrass ecophysiology meets ecological genomics: are we ready? Marine Ecology-an Evolutionary Perspective 33: 522-527 Ralph PJ, Durako MJ, Enriquez S, Collier CJ, Doblin MA (2007) Impact of light limitation on seagrasses. Journal of Experimental Marine Biology and Ecology 350: 176-193 Rausell A, Kanhonou R, Yenush L, Serrano R, Ros R (2003) The translation initiation factor eIF1A is an important determinant in the tolerance to NaCl stress in yeast and plants. Plant Journal 34: 257-267 Renny-Byfield S, Chester M, Kovarik A, Le Comber SC, Grandbastien MA, Deloger M, Nichols RA, Macas J, Novak P, Chase MW, Leitch AR (2011) Next generation sequencing reveals genome downsizing in allotetraploid Nicotiana tabacum, predominantly through the elimination of paternally derived repetitive DNAs. Molecular Biology and Evolution 28: 2843-2854 Reusch TB, Stam WT, Olsen JL (2000) A microsatellite-based estimation of clonal diversity and population subdivision in Zostera marina, a marine flowering plant. Molecular Ecology 9: 127-140 Reusch TBH (2001) Fitness-consequences of geitonogamous selfing in a clonal marine angiosperm (Zostera marina). Journal of Evolutionary Biology 14: 129-138 Rhesus Macaque Genome Sequencing Analysis Consortium, Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, Wilson RK, Batzer MA, Bustamante CD, Eichler EE, Hahn MW, Hardison RC, Makova KD, Miller W, Milosavljevic A, Palermo RE, Siepel A, Sikela JM, Attaway T, Bell S, Bernard KE, Buhay CJ, Chandrabose MN, Dao M, Davis C, Delehaunty KD, Ding Y, Dinh HH, Dugan-Rocha S, Fulton LA, Gabisi RA, Garner TT, Godfrey J, Hawes AC, Hernandez J, Hines S, Holder M, Hume J, Jhangiani SN, Joshi V, Khan ZM, Kirkness EF, Cree A, Fowler RG, Lee S, Lewis LR, Li Z, Liu YS, Moore SM, Muzny D, Nazareth LV, Ngo DN, Okwuonu GO, Pai G, Parker D, Paul HA, Pfannkoch C, Pohl CS, Rogers YH, Ruiz SJ, Sabo A, Santibanez J, Schneider BW, Smith SM, Sodergren E, Svatek AF, Utterback TR, Vattathil S, Warren W, White CS, Chinwalla AT, Feng Y, Halpern AL, Hillier LW, Huang X, Minx P, Nelson JO, Pepin KH, Qin X, Sutton GG, Venter E, Walenz BP, Wallis JW, Worley KC, Yang SP, Jones SM, Marra MA, Rocchi M, Schein JE, Baertsch R, Clarke L, Csuros M, Glasscock J, Harris RA, Havlak P, Jackson AR, Jiang H, Liu Y, Messina DN, Shen Y, Song HX, Wylie T, Zhang L, Birney E, Han K, Konkel MK, Lee J, Smit AF, Ullmer B, Wang H, Xing J, Burhans R, Cheng Z, Karro JE, Ma J, Raney B, She X, Cox MJ, Demuth JP, Dumas LJ, Han SG, Hopkins J, Karimpour-Fard A, Kim YH, Pollack JR, Vinar T, Addo- Quaye C, Degenhardt J, Denby A, Hubisz MJ, Indap A, Kosiol C, Lahn BT, Lawson HA, Marklein A, Nielsen R, Vallender EJ, Clark AG, Ferguson B, Hernandez RD, Hirani K, Kehrer-Sawatzki H, Kolb J, Patil S, Pu LL, Ren Y, Smith DG, Wheeler DA, Schenck I, Ball EV, Chen R, Cooper DN, Giardine B, Hsu F, Kent WJ, Lesk A, Nelson DL, O'Brien W E, Prufer K, Stenson PD, Wallace JC, Ke H, Liu XM, Wang P, Xiang AP, Yang F, Barber GP, Haussler D, Karolchik D, Kern AD, Kuhn RM, Smith KE, Zwieg AS (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316: 222-234 191

Riley MC, Kirkup BC, Jr., Johnson JD, Lesho EP, Ockenhouse CF (2011) Rapid whole genome optical mapping of Plasmodium falciparum. Malaria Journal 10: 252 Roman G, Lubarsky B, Kieber JJ, Rothenberg M, Ecker JR (1995) Genetic analysis of ethylene signal transduction in Arabidopsis thaliana: five novel mutant loci integrated into a stress response pathway. Genetics 139: 1393-1409 Ronaghi M, Karamohamed S, Pettersson B, Uhlen M, Nyren P (1996) Real-time DNA sequencing using detection of pyrophosphate release. Analytical Biochemistry 242: 84-89 Ronaghi M, Uhlen M, Nyren P (1998) A sequencing method based on real-time pyrophosphate. Science 281: 363-+ Ross TG, Barrett CF, Soto Gomez M, Lam VKY, Henriquez CL, Les DH, Davis JI, Cuenca A, Petersen G, Seberg O, Thadeo M, Givnish TJ, Conran J, Stevenson DW, Graham SW (2016) Plastid phylogenomics and molecular evolution of Alismatales. Cladistics 32: 160-178 Roth NC, Pregnall AM (1988) Nitrate reductase activity in Zostera marina. Marine Biology 99: 457-463 Roy SJ, Negrao S, Tester M (2014) Salt resistant crop plants. Current Opinion in Biotechnology 26: 115-124 Ruhlman TA, Chang WJ, Chen JJ, Huang YT, Chan MT, Zhang J, Liao DC, Blazier JC, Jin X, Shih MC, Jansen RK, Lin CS (2015) NDH expression marks major transitions in plant evolution and reveals coordinate intracellular gene loss. BMC Plant Biology 15: 100 Sahm A, Bens M, Platzer M, Szafranski K (2017) PosiGene: automated and easy-to-use pipeline for genome-wide detection of positively selected genes. Nucleic Acids Research Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10: 516- 522 Salman-Minkov A, Sabath N, Mayrose I (2016) Whole-genome duplication as a key factor in crop domestication. Nature Plants 2 Salo T, Reusch TBH, Bostrom C (2015) Genotype-specific responses to light stress in eelgrass Zostera marina, a marine foundation plant. Marine Ecology Progress Series 519: 129-140 Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22: 557-567 Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America 74: 5463-5467 Schatz MC, Delcher AL, Salzberg SL (2010) Assembly of large genomes using second-generation sequencing. Genome Research 20: 1165-1173 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, Chen W, Yan L, Higginbotham J, Cardenas M, Waligorski J, Applebaum E, Phelps L, Falcone J, Kanchi K, Thane T, Scimone A, Thane N, Henke J, Wang T, Ruppert J, Shah N, Rotter K, Hodges J, Ingenthron E, Cordes M, Kohlberg S, Sgro J, Delgado B, Mead K, Chinwalla A, Leonard S, Crouse K, Collura K, Kudrna D, Currie J, He R, Angelova A, Rajasekar S, Mueller T, Lomeli R, Scara G, Ko A, Delaney K, Wissotski M, Lopez G, Campos D, Braidotti M, Ashley E, Golser W, Kim H, Lee S, Lin J, Dujmic Z, Kim W, Talag J, Zuccolo A, Fan C, Sebastian A, Kramer M, Spiegel L, Nascimento L, Zutavern T, Miller B, Ambroise C, Muller S, Spooner W, Narechania A, Ren L, Wei S, Kumari S, Faga B, Levy MJ, McMahan L, Van Buren P, Vaughn MW, Ying K, Yeh CT, Emrich SJ, Jia Y, Kalyanaraman A, Hsia AP, Barbazuk WB, Baucom RS, Brutnell TP, Carpita NC, Chaparro C, Chia JM, Deragon JM, Estill JC, Fu Y, Jeddeloh JA, Han Y, Lee H, Li P, Lisch DR, Liu S, Liu Z, Nagel DH, McCann MC, SanMiguel P, Myers AM, Nettleton D, Nguyen J, Penning BW, Ponnala L, Schneider KL, Schwartz DC, Sharma A, Soderlund C, Springer NM, Sun Q, Wang H, Waterman M, Westerman R, Wolfgruber TK, Yang L, Yu Y, Zhang L, Zhou S, Zhu Q, Bennetzen JL, Dawe RK, Jiang J, Jiang N, Presting GG, Wessler SR, Aluru S, Martienssen RA, Clifton SW, McCombie WR, Wing RA, Wilson RK (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112-1115 Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D (2009) Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biology and Evolution 1: 114-118 192

Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, Wang YK (1993) Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262: 110-114 Serres-Giardi L, Belkhir K, David J, Glemin S (2012) Patterns and evolution of nucleotide landscapes in seed plants. Plant Cell 24: 1379-1397 Sharkey TD (2005) Effects of moderate heat stress on photosynthesis: importance of thylakoid reactions, rubisco deactivation, reactive oxygen species, and thermotolerance provided by isoprene. Plant Cell and Environment 28: 269-277 Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26: 1135-1145 Shepherd FA, Rodrigues Pereira J, Ciuleanu T, Tan EH, Hirsh V, Thongprasert S, Campos D, Maoleekoonpiroj S, Smylie M, Martins R, van Kooten M, Dediu M, Findlay B, Tu D, Johnston D, Bezjak A, Clark G, Santabarbara P, Seymour L, National Cancer Institute of Canada Clinical Trials G (2005) Erlotinib in previously treated non-small-cell lung cancer. New England Journal of Medicine 353: 123-132 Shikanai T, Endo T, Hashimoto T, Yamada Y, Asada K, Yokota A (1998) Directed disruption of the tobacco ndhB gene impairs cyclic electron flow around photosystem I. Proceedings of the National Academy of Sciences of the United States of America 95: 9705-9709 Short F, Carruthers T, Dennison W, Waycott M (2007) Global seagrass distribution and diversity: A bioregional model. Journal of Experimental Marine Biology and Ecology 350: 3-20 Short FT, Polidoro B, Livingstone SR, Carpenter KE, Bandeira S, Bujang JS, Calumpong HP, Carruthers TJB, Coles RG, Dennison WC, Erftemeijer PLA, Fortes MD, Freeman AS, Jagtap TG, Kamal AM, Kendrick GA, Kenworthy WJ, La Nafie YA, Nasution IM, Orth RJ, Prathep A, Sanciangco JC, van Tussenbroek B, Vergara SG, Waycott M, Zieman JC (2011) Extinction risk assessment of the world's seagrass species. Biological Conservation 144: 1961-1971 Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38: D161-D166 Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210-3212 Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I (2009) ABySS: A parallel assembler for short read sequence data. Genome Research 19: 1117-1123 Sinclair EA, Gecan I, Krauss SL, Kendrick GA (2014) Against the odds: complete outcrossing in a monoecious clonal seagrass Posidonia australis (Posidoniaceae). Annals of Botany 113: 1185-1196 Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6: 31 Smit AFA, Hubley R (2008) RepeatModeler Open-1.0. In, p http://www.repeatmasker.org Song JH, Zhu CX, Zhang X, Wen X, Liu LL, Peng JY, Guo HW, Yi CQ (2015) Biochemical and Structural Insights into the Mechanism of DNA Recognition by Arabidopsis ETHYLENE INSENSITIVE3. PLOS One 10 Song SS, Huang H, Gao H, Wang JJ, Wu DW, Liu XL, Yang SH, Zhai QZ, Li CY, Qi TC, Xie DX (2014) Interaction between MYC2 and ETHYLENE INSENSITIVE3 modulates antagonism between jasmonate and ethylene signaling in Arabidopsis. Plant Cell 26: 263-279 Sonnhammer ELL, Eddy SR, Durbin R (1997) Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins-Structure Function and Bioinformatics 28: 405-420 Soyer Y, Orsi RH, Rodriguez-Rivera LD, Sun Q, Wiedmann M (2009) Genome wide evolutionary analyses reveal serotype specific patterns of positive selection in selected Salmonella serotypes. BMC Evolutionary Biology 9: 264 Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2: ii215-225 Stefanovic S, Olmstead RG (2005) Down the slippery slope: Plastid genome evolution in Convolvulaceae. Journal of Molecular Evolution 61: 292-305 Summers JE, Jackson MB (1998) Light- and dark-grown Potamogeton pectinatus, an aquatic macrophyte, make no ethylene (ethene) but retain responsiveness to the gas. Australian Journal of Plant Physiology 25: 599-608 193

Summers JE, Voesenek L, Blom C, Lewis MJ, Jackson MB (1996) Potamogeton pectinatus is constitutively incapable of synthesizing ethylene and lacks 1-aminocyclopropane-1-carboxylic acid oxidase. Plant Physiology 111: 901-908 Szollosi GJ, Tannier E, Daubin V, Boussau B (2015) The inference of gene trees with species trees. Systematic Biology 64: e42-62 Taji T, Seki M, Satou M, Sakurai T, Kobayashi M, Ishiyama K, Narusaka Y, Narusaka M, Zhu JK, Shinozaki K (2004) Comparative genomics in salt tolerance between Arabidopsis and Arabidopsis-related halophyte salt cress using Arabidopsis microarray. Plant Physiology 135: 1697-1709 Tanaka N, Kuo J, Omori Y, Nakaoka M, Aioi K (2003) Phylogenetic relationships in the genera Zostera and Heterozostera (Zosteraceae) based on matK sequence data. Journal of Plant Research 116: 273-279 Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH (2008) Synteny and collinearity in plant genomes. Science 320: 486-488 Tao JJ, Chen HW, Ma B, Zhang WK, Chen SY, Zhang JS (2015) The role of ethylene in plants under salinity stress. Frontiers in Plant Science 6: 1059 Taylor JS, Raes J (2004) Duplication and divergence: the evolution of new genes and old ideas. Annual Review of Genetics 38: 615-643 Tekaia F (2016) Inferring Orthologs: Open Questions and Perspectives. Genomics Insights 9: 17-28 Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Research 18: 1979-1990 Titus Brown C, Howe A, Zhang Q, Pyrkosz AB, Brom TH (2012) A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data, Vol 1203 Tomlinson PB (1984) The Monocotyledons: a comparative study Bioscience 34: 525-525 Touchette BW (2007) Seagrass-salinity interactions: Physiological mechanisms used by submersed marine angiosperms for a life at sea. Journal of Experimental Marine Biology and Ecology 350: 194-215 Touchette BW, Burkholder JM (2000) Review of nitrogen and phosphorus metabolism in seagrasses. Journal of Experimental Marine Biology and Ecology 250: 133-167 Touchette BW, Marcus SE, Adams EC (2014) Bulk elastic moduli and solute potentials in leaves of freshwater, coastal and marine hydrophytes. Are marine plants more rigid? AoB Plants 6 Town CD, Cheung F, Maiti R, Crabtree J, Haas BJ, Wortman JR, Hine EE, Althoff R, Arbogast TS, Tallon LJ, Vigouroux M, Trick M, Bancroft I (2006) Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation, and dispersal after polyploidy. Plant Cell 18: 1348-1359 Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, Muller J, Bork P (2011) Orthology prediction methods: A quality assessment using curated protein families. Bioessays 33: 769-780 Treangen TJ, Salzberg SL (2011) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13: 36-46 Turpin DH (1991) Effects of inorganic N availability on algal photosynthesis and carbon metabolism. Journal of Phycology 27: 14-20 Ueda M, Kuniyoshi T, Yamamoto H, Sugimoto K, Ishizaki K, Kohchi T, Nishimura Y, Shikanai T (2012) Composition and physiological function of the chloroplast NADH dehydrogenase-like complex in Marchantia polymorpha. Plant Journal 72: 683-693 Valente C, Polishchuk R, De Matteis MA (2010) Rab6 and myosin II at the cutting edge of membrane fission. Nature Cell Biology 12: 635-638 Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K (2012) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiology 158: 590-600 van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C (2014) Ten years of next-generation sequencing technology. Trends in Genetics 30: 418-426 van Meer G, Voelker DR, Feigenson GW (2008) Membrane lipids: where they are and how they behave. Nature Reviews Molecular Cell Biology 9: 112-124 Vanneste K, Van de Peer Y, Maere S (2013) Inference of genome duplications from age distributions revisited. Molecular Biology and Evolution 30: 177-190

194

Varshney RK, Terauchi R, McCouch SR (2014) Harvesting the promising fruits of genomics: applying genome sequencing technologies to crop breeding. PLOS Biology 12: e1001883 Veeckman E, Ruttink T, Vandepoele K (2016) Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. Plant Cell 28: 1759-1768 Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X (2001) The sequence of the human genome. Science 291: 1304-1351 Vitte C, Fustier MA, Alix K, Tenaillon MI (2014) The bright side of transposons in crop evolution. Briefings in Functional Genomics 13: 276-295 Voesenek LA, Bailey-Serres J (2015) Flood adaptive traits and processes: an overview. New Phytologist 206: 57-73 Voesenek LACJ, Pierik R, Sasidharan R (2015) Plant life without ethylene. Trends in Plant Science 20: 783- 786 Voesenek LACJ, Vanderveen R (1994) The role of phytohormones in plant stress - too much or too little water. Acta Botanica Neerlandica 43: 91-127 Vu GTH, Schmutzer T, Bull F, Cao HX, Fuchs J, Tran TD, Jovtchev G, Pistrick K, Stein N, Pecinka A, Neumann P, Novak P, Macas J, Dear PH, Blattner FR, Scholz U, Schubert I (2015) Comparative genome analysis reveals divergent genome size evolution in a carnivorous plant genus. Plant Genome 8 Wang E, Zaman N, Mcgee S, Milanese JS, Masoudi-Nejad A, O'Connor-McCourt M (2015) Predictive genomics: A cancer hallmark network framework for predicting tumor clinical phenotypes using genome sequencing data. Seminars in Cancer Biology 30: 4-12 Wang HC, Hickey DA (2007) Rapid divergence of codon usage patterns within the rice genome. BMC Evol Biol 7 Suppl 1: S6 Wang KLC, Li H, Ecker JR (2002) Ethylene biosynthesis and signaling networks. Plant Cell 14: S131-S151 195

Wang PW, Lai WF, Li MLJ, Xu F, Yalamanchili HK, Lovell-Badge R, Wang JW (2013) Inference of gene- phenotype associations via protein-protein interaction and orthology. PLOS One 8 Wang W, Haberer G, Gundlach H, Glasser C, Nussbaumer T, Luo MC, Lomsadze A, Borodovsky M, Kerstetter RA, Shanklin J, Byrant DW, Mockler TC, Appenroth KJ, Grimwood J, Jenkins J, Chow J, Choi C, Adam C, Cao XH, Fuchs J, Schubert I, Rokhsar D, Schmutz J, Michael TP, Mayer KF, Messing J (2014) The Spirodela polyrhiza genome reveals insights into its neotenous reduction fast growth and aquatic lifestyle. Nature Communication 5: 3311 Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I, Cheng F, Huang S, Li X, Hua W, Wang J, Wang X, Freeling M, Pires JC, Paterson AH, Chalhoub B, Wang B, Hayward A, Sharpe AG, Park BS, Weisshaar B, Liu B, Li B, Liu B, Tong C, Song C, Duran C, Peng C, Geng C, Koh C, Lin C, Edwards D, Mu D, Shen D, Soumpourou E, Li F, Fraser F, Conant G, Lassalle G, King GJ, Bonnema G, Tang H, Wang H, Belcram H, Zhou H, Hirakawa H, Abe H, Guo H, Wang H, Jin H, Parkin IA, Batley J, Kim JS, Just J, Li J, Xu J, Deng J, Kim JA, Li J, Yu J, Meng J, Wang J, Min J, Poulain J, Wang J, Hatakeyama K, Wu K, Wang L, Fang L, Trick M, Links MG, Zhao M, Jin M, Ramchiary N, Drou N, Berkman PJ, Cai Q, Huang Q, Li R, Tabata S, Cheng S, Zhang S, Zhang S, Huang S, Sato S, Sun S, Kwon SJ, Choi SR, Lee TH, Fan W, Zhao X, Tan X, Xu X, Wang Y, Qiu Y, Yin Y, Li Y, Du Y, Liao Y, Lim Y, Narusaka Y, Wang Y, Wang Z, Li Z, Wang Z, Xiong Z, Zhang Z, Brassica rapa Genome Sequencing Project C (2011) The genome of the mesopolyploid crop species Brassica rapa. Nature Genetics 43: 1035-1039 Wang Z, Hobson N, Galindo L, Zhu S, Shi D, McDill J, Yang L, Hawkins S, Neutelings G, Datla R, Lambert G, Galbraith DW, Grassa CJ, Geraldes A, Cronk QC, Cullis C, Dash PK, Kumar PA, Cloutier S, Sharpe AG, Wong GK, Wang J, Deyholos MK (2012) The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. Plant Journal 72: 461-473 Washburn JD, Bird KA, Conant GC, Pires JC (2016) Convergent evolution and the origin of complex phenotypes in the age of systems biology. International Journal of Plant Sciences 177: 305-318 Wasternack C, Hause B (2013) Jasmonates: biosynthesis, perception, signal transduction and action in plant stress response, growth and development. An update to the 2007 review in Annals of Botany. Annals of Botany 111: 1021-1058 Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ (2009) Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25: 1189-1191 Waterhouse RM, Zdobnov EM, Kriventseva EV (2011) Correlating Traits of Gene Retention, Sequence Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi. Genome Biology and Evolution 3: 75-86 Waycott M, Duarte CM, Carruthers TJ, Orth RJ, Dennison WC, Olyarnik S, Calladine A, Fourqurean JW, Heck KL, Jr., Hughes AR, Kendrick GA, Kenworthy WJ, Short FT, Williams SL (2009) Accelerating loss of seagrasses across the globe threatens coastal ecosystems. Proceedings of the National Academy of Sciences of the United States of America 106: 12377-12381 Wendel JF, Jackson SA, Meyers BC, Wing RA (2016) Evolution of plant genome architecture. Genome Biology 17: 37 Weraduwage SM, Kim SJ, Renna L, Anozie FC, Sharkey TD, Brandizzi F (2016) Pectin Methylesterification Impacts the Relationship between Photosynthesis and Plant Growth. Plant Physiology 171: 833-848 Wickett NJ, Fan Y, Lewis PO, Goffinet B (2008a) Distribution and evolution of pseudogenes, gene losses, and a gene rearrangement in the plastid genome of the nonphotosynthetic liverwort, Aneura mirabilis (Metzgeriales, Jungermanniopsida). Journal of Molecular Evolution 67: 111-122 Wickett NJ, Zhang Y, Hansen SK, Roper JM, Kuehl JV, Plock SA, Wolf PG, DePamphilis CW, Boore JL, Goffinet B (2008b) Functional gene losses occur with minimal size reduction in the plastid genome of the parasitic liverwort Aneura mirabilis. Molecular Biology and Evolution 25: 393-401 Wilkin P, Mayo SJ (2013) Early Events in Monocot Evolution. Cambridge University Press, Cambridge Wilson DN, Doudna Cate JH (2012) The structure and function of the eukaryotic ribosome. Cold Spring Harbour Perspectives in Biology 4 Winters G, Nelle P, Fricke B, Rauch G, Reusch TBH (2011) Effects of a simulated heat wave on photophysiology and gene expression of high- and low-latitude populations of Zostera marina. Marine Ecology Progress Series 435: 83-95 196

Wissler L, Codoner FM, Gu J, Reusch TB, Olsen JL, Procaccini G, Bornberg-Bauer E (2011) Back to the sea twice: identifying candidate plant genes for molecular evolution to marine life. BMC Evolutionary Biology 11: 8 Wolfe KH, Morden CW, Palmer JD (1992) Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proceedings of the National Academy of Sciences of the United States of America 89: 10648-10652 Xu L, Law SR, Murcha MW, Whelan J, Carrie C (2013) The dual targeting ability of type II NAD(P)H dehydrogenases arose early in land plant evolution. BMC Plant Biology 13 Yamori W, Sakata N, Suzuki Y, Shikanai T, Makino A (2011) Cyclic electron flow around photosystem I via chloroplast NAD(P)H dehydrogenase (NDH) complex performs a significant physiological role during photosynthesis and plant growth at low temperature in rice. Plant Journal 68: 966-976 Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13: 329-342 Yang C, Ma B, He SJ, Xiong Q, Duan KX, Yin CC, Chen H, Lu X, Chen SY, Zhang JS (2015) MAOHUZI6/ETHYLENE INSENSITIVE3-LIKE1 and ETHYLENE INSENSITIVE3-LIKE2 regulate ethylene response of roots and coleoptiles and negatively affect salt tolerance in rice. Plant Physiology 169: 148-165 Yang J, Liu D, Wang X, Ji C, Cheng F, Liu B, Hu Z, Chen S, Pental D, Ju Y, Yao P, Li X, Xie K, Zhang J, Wang J, Liu F, Ma W, Shopan J, Zheng H, Mackenzie SA, Zhang M (2016) The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nat Genet 48: 1225-1232 Yang JH, Liu DY, Wang XW, Ji CM, Cheng F, Liu BN, Hu ZY, Chen S, Pental D, Ju YH, Yao P, Li XM, Xie K, Zhang JH, Wang JL, Liu F, Ma WW, Shopan J, Zheng HK, Mackenzie SA, Zhang MF (2016) The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nature Genetics 48: 1225-1232 Yang YW, Lai KN, Tai PY, Li WH (1999) Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. Journal of Molecular Evolution 48: 597-604 Yang Z (2005) The power of phylogenetic comparison in revealing protein function. Proceedings of the National Academy of Sciences of the United States of America 102: 3179-3180 Ye CJ, Zhao KF (2003) Osmotically active compounds and their localization in the marine halophyte eelgrass. Biologia Plantarum 46: 137-140 Zerbino DR (2010) Using the Velvet de novo assembler for short-read sequencing technologies. Current Protocols in Bioinformatics Chapter 11: Unit 11 15 Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18: 821-829 Zhang TZ, Hu Y, Jiang WK, Fang L, Guan XY, Chen JD, Zhang JB, Saski CA, Scheffler BE, Stelly DM, Hulse- Kemp AM, Wan Q, Liu BL, Liu CX, Wang S, Pan MQ, Wang YK, Wang DW, Ye WX, Chang LJ, Zhang WP, Song QX, Kirkbride RC, Chen XY, Dennis E, Llewellyn DJ, Peterson DG, Thaxton P, Jones DC, Wang Q, Xu XY, Zhang H, Wu HT, Zhou L, Mei GF, Chen SQ, Tian Y, Xiang D, Li XH, Ding J, Zuo QY, Tao LN, Liu YC, Li J, Lin Y, Hui YY, Cao ZS, Cai CP, Zhu XF, Jiang Z, Zhou BL, Guo WZ, Li RQ, Chen ZJ (2015) Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nature Biotechnology 33: 531-U252 Zhao F, Li QY, Weng ML, Wang XL, Guo BT, Wang L, Wang W, Duan DL, Wang B (2013) Cloning of TPS gene from eelgrass species Zostera marina and its functional identification by genetic transformation in rice. Gene 531: 205-211 Zhu C, Ganguly A, Baskin TI, McClosky DD, Anderson CT, Foster C, Meunier KA, Okamoto R, Berg H, Dixit R (2015) The fragile Fiber1 kinesin contributes to cortical microtubule-mediated trafficking of cell wall components. Plant Physiology 167: 780-792 Zickmann F, Renard BY (2015) IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy. BMC Genomics 16: 134 Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA (2013) The MaSuRCA genome assembler. Bioinformatics 29: 2669-2677 197

Zimmer A, Lang D, Richardt S, Frank W, Reski R, Rensing SA (2007) Dating the early evolution of plants: detection and molecular clock analyses of orthologs. Molecular Genetics and Genomics 278: 393- 402 Zubieta C, Ross JR, Koscheski P, Yang Y, Pichersky E, Noel JP (2003) Structural basis for substrate recognition in the salicylic acid carboxyl methyltransferase family. Plant Cell 15: 1704-1716

198

Appendix 1

Assembled contigs and results of gene annotation were downloadable from this URL: http://appliedbioinformatics.com.au/index.php/Seagrass_Zmu_Genome

199

Appendix 2

List of KEGG pathways mapped by AMOS families

KEGG KEGG pathway name Number of pathway ID AMOS genes mapped ath01100 Metabolic pathways 58 ath01110 Biosynthesis of secondary metabolites 35 ath04075 Plant hormone signal transduction 15 ath00040 Pentose and glucuronate interconversions 11 ath00500 Starch and sucrose metabolism 10 ath00940 Phenylpropanoid biosynthesis 8 ath03010 Ribosome 8 ath04626 Plant-pathogen interaction 8 ath01200 Carbon metabolism 8 ath04141 Protein processing in endoplasmic reticulum 7 ath03040 Spliceosome 7 ath00195 Photosynthesis 6 ath00270 Cysteine and methionine metabolism 6 ath01230 Biosynthesis of amino acids 5 ath04144 Endocytosis 5 ath00620 Pyruvate metabolism 4 ath00630 Glyoxylate and dicarboxylate metabolism 4 ath00190 Oxidative phosphorylation 4 ath00130 Ubiquinone and other terpenoid-quinone 3 biosynthesis ath03013 RNA transport 3 ath00330 Arginine and proline metabolism 3 ath00030 Pentose phosphate pathway 3 ath00561 Glycerolipid metabolism 3 ath00010 Glycolysis / Gluconeogenesis 3 ath00260 Glycine, serine and threonine metabolism 2 ath04122 Sulfur relay system 2 ath00052 Galactose metabolism 2 ath00790 Folate biosynthesis 2 ath04712 Circadian rhythm 2 ath00520 Amino sugar and nucleotide sugar metabolism 2 ath00410 beta-Alanine metabolism 2 ath00480 Glutathione metabolism 2 ath03410 Base excision repair 2 ath04145 Phagosome 2

200

ath03440 Homologous recombination 2 ath00592 alpha-Linolenic acid metabolism 2 ath00380 Tryptophan metabolism 2 ath00860 Porphyrin and chlorophyll metabolism 2 ath00564 Glycerophospholipid metabolism 2 ath04120 Ubiquitin mediated proteolysis 2 ath00670 One carbon pool by folate 2 ath00603 Glycosphingolipid biosynthesis 2 ath03018 RNA degradation 2 ath00903 Limonene and pinene degradation 2 ath03015 mRNA surveillance pathway 2 ath00053 Ascorbate and aldarate metabolism 1 ath00350 Tyrosine metabolism 1 ath00904 Diterpenoid biosynthesis 1 ath00280 Valine, leucine and isoleucine degradation 1 ath00360 Phenylalanine metabolism 1 ath00511 Other glycan degradation 1 ath00920 Sulfur metabolism 1 ath00240 Pyrimidine metabolism 1 ath04140 Regulation of autophagy 1 ath00531 Glycosaminoglycan degradation 1 ath00230 Purine metabolism 1 ath04146 Peroxisome 1 ath00062 Fatty acid elongation 1 ath00590 Arachidonic acid metabolism 1 ath00196 Photosynthesis 1 ath00908 Zeatin biosynthesis 1 ath00071 Fatty acid degradation 1 ath00600 Sphingolipid metabolism 1 ath00604 Glycosphingolipid biosynthesis 1 ath00945 Stilbenoid, diarylheptanoid and gingerol 1 biosynthesis ath00400 Phenylalanine, tyrosine and tryptophan 1 biosynthesis ath00710 Carbon fixation in photosynthetic organisms 1 ath00020 Citrate cycle 1 ath00460 Cyanoamino acid metabolism 1 ath04070 Phosphatidylinositol signalling system 1 ath03420 Nucleotide excision repair 1 ath03008 Ribosome biogenesis in eukaryotes 1 ath00310 Lysine degradation 1 ath03050 Proteasome 1

201

ath00073 Cutin, suberine and wax biosynthesis 1 ath00510 N-Glycan biosynthesis 1 ath00900 Terpenoid backbone biosynthesis 1 ath04130 SNARE interactions in vesicular transport 1 ath00906 Carotenoid biosynthesis 1 ath00910 Nitrogen metabolism 1 ath00051 Fructose and mannose metabolism 1 ath03430 Mismatch repair 1 ath03030 DNA replication 1 ath00340 Histidine metabolism 1 ath00730 Thiamine metabolism 1

202

Appendix 3

A. thaliana representative genes in AMOS corresponding to GO terms or KEGG pathways of starch metabolism, stress response, terpenoid biosynthesis and ribosomal proteins

GO or GO term or A. thaliana Gene name KEGG KEGG representative (Gene description) pathway ID pathway genes description GO:0005983 starch AT2G40220 ABI4; Ethylene-responsive transcription factor catabolic AT2G40840 DPE2; 4-alpha-glucanotransferase DPE2 process AT3G01510 LSF1; Phosphoglucan phosphatase AT3G10940 LSF2; Phosphoglucan phosphatase AT5G17520 MEX1; Maltose excess protein GO:0019252 starch AT1G12800 Nucleic acid-binding, OB-fold-like protein biosynthetic AT1G76730 5-formyltetrahydrofolate cyclo-ligase process paralogous protein AT3G01660 S-adenosyl-L-methionine-dependent methyltransferases superfamily protein AT3G16250 Photosynthetic NDH subunit of subcomplex B 3 AT3G19000 2-oxoglutarate (2OG) and Fe (II)-dependent oxygenase superfamily protein AT4G09650 ATPD; F-type H+-transporting ATPase subunit delta AT4G17360 putative formyltetrahydrofolate deformylase AT5G08370 AGAL2; alpha-galactosidase 2 AT3G17930 Unknown function GO:0042542 response to AT1G17870 EGY3; Probable zinc metallopeptidase hydrogen AT1G54050 CIII heat shock protein 17.4 peroxide AT1G71000 Chaperone DnaJ-domain superfamily domain AT2G32120 HSP70-8; Heat shock 70 kDa protein 8 AT3G02800 Tyrosine phosphatase family protein AT3G08970 DnaJ protein AT4G12570 UPL5; E3 ubiquitin-protein ligase UPL5 AT4G21320 HSA32; Aldolase-type TIM barrel family protein AT4G27670 HSP21; heat shock protein 21 AT2G43630 Unknown function AT5G07330 Unknown function AT5G35320 Unknown function GO:0050665 hydrogen AT1G24030 Protein kinase superfamily protein peroxide AT1G66340 ETR1; ethylene receptor 1 biosynthetic AT3G14130 Aldolase-type TIM barrel family protein

203

process AT3G23900 RNA recognition motif-containing protein AT5G12290 DGS1; DGD1 suppressor 1 GO:0009644 response to AT1G17870 EGY3; Probable zinc metallopeptidase high light AT1G54050 CIII heat shock protein 17.4 intensity AT1G71000 Chaperone DnaJ-domain superfamily domain AT2G32120 HSP70-8; Heat shock 70 kDa protein 8 AT3G02800 Tyrosine phosphatase family protein AT3G08970 DnaJ protein AT4G21320 HSA32; Aldolase-type TIM barrel family protein AT4G27670 HSP21; heat shock protein 21 AT2G43630 Unknown function AT5G07330 Unknown function AT5G35320 Unknown function ath03010 Ribosome AT1G07320 RPL4; 50S ribosomal protein L4 AT1G23410 40S ribosomal protein S27a-1 AT1G74970 RPS9; 30S ribosomal protein S9 AT2G21580 40S ribosomal protein S25-2 AT3G22300 RPS10; 40S ribosomal protein S10 AT3G52590 UBQ1; ubiquitin extension protein 1 AT5G59850 40S ribosomal protein S15a-1 ATCG00750 rps11; 30S ribosomal protein S11 ath00900 Terpenoid AT5G58770 dehydrodolichyl diphosphate synthase 2 backbone biosynthesis ath00904 Diterpenoid AT1G02400 GA2OX6; gibberellin 2-oxidase 6 biosynthesis

204

Appendix 4

Additional results described in Chapter 3, including

1) List of genes significantly enriched in AMOS with corresponding GO annotation 2) List of genes in AMOS corresponding to KEGG pathways were published as supplementary materials in Plant Physiology and downloadable from this URL: http://www.plantphysiol.org/content/suppl/2016/07/03/pp.16.00868.DC1/PP2016- 00868D_Supplemental_Material.pdf

205

Appendix 5

Presence and absence of genes involved in stomata development, ethylene synthesis and signalling, terpenoid biosynthesis, sporopollenin biosynthesis and methyl-jasmonate biosynthesis in OGCsM, H. ovalis, Z. marina and Z. muelleri. Categories are: gene present (+), gene absent (-) and information not available (NA).

TAIR ID Gene Function Conserved in Presence in H. Presence in Presence in symbol OGCsM ovalis Z. muelleri Z. marina Stomata development AT1G04110 SBT1.2 Spacing and patterning + NA - - AT4G12970 EPFL9 Spacing and patterning + - - - AT2G20875 EPF1 Spacing and patterning + - - - AT1G80080 TMM Spacing and patterning + - - - AT1G34245 EPF2 Spacing and patterning + - - - AT2G02820 MYB88 Differentiation - NA NA - AT3G06120 MUTE Differentiation + - - - AT5G53210 SPCH Differentiation + - - - AT3G24140 FAMA Differentiation + NA NA - AT1G12860 SCRM2 Differentiation - NA NA - AT1G14350 FLP Differentiation + - - - Ethylene synthesis and signalling AT2G19590 ACO1 ACC oxidase + - - - AT1G62380 ACO2 ACC oxidase + - - - AT1G05010 ACO4 ACC oxidase + - - - AT1G77330 ACO5 ACC oxidase + - - - AT3G61510 ACS1 ACC synthase + - - - AT1G01480 ACS2 ACC synthase + - - - AT2G22810 ACS4 ACC synthase + - - - 206

AT5G65800 ACS5 ACC synthase + - - - AT4G11280 ACS6 ACC synthase + - - - AT4G26200 ACS7 ACC synthase + - - - AT4G37770 ACS8 ACC synthase + - - - AT3G49700 ACS9 ACC synthase + - - - AT4G08040 ACS11 ACC synthase + - - - AT2G40940 ERS1 Ethylene receptor + - - - AT1G66340 ETR1 Ethylene receptor + - - - AT3G23150 ETR2 Ethylene receptor + - - - AT3G04580 EIN4 Ethylene receptor + - - - AT5G03730 CTR1 Raf-like kinase + NA - - AT5G03280 EIN2 Signal transducer + NA - - AT2G25490 EBF1 EIN2 degradation + - - - AT5G25350 EBF2 EIN2 degradation + - - - Terpenoid biosynthesis AT3G25820 TPS-CIN Terpene synthase + - - - AT3G25830 TPS23 Terpene synthase + - - - AT4G16740 TPS03 Terpene synthase - - - - AT2G24210 TPS10 Terpene synthase + - - - AT3G25810 TPS24 Terpene synthase - - - -

207

Appendix 6

Presence and absence of 40 nuclear and chloroplast-encoded genes involved in formation of the NDH complex. Categories are: gene present (+), gene absent (-) and information not available (NA).

TAIR ID Gene symbol Function Presence in Presence in Presence in Presence OGCsM H. ovalis Z. muelleri in Z. marina Nuclear encoded AT1G70760 NDHL Subunit A + NA + + AT4G37925 NDHM Subunit A + - + + AT5G58260 NDHN Subunit A + NA + + AT1G74880 NDHO Subunit A + - + + AT4G23890 NDHS Subunit ED + - NA + AT4G09350 NDHT Subunit ED + - + + AT5G21430 NDHU Subunit ED + - + + AT1G15980 PNSB1 Subunit B + - NA + AT1G64770 PNSB2 Subunit B + - + + AT3G16250 PNSB3 Subunit B + - + + AT1G18730 PNSB4 Subunit B + - + + AT2G39470 PNSL1 Subunit B + - + + AT1G14150 PNSL2 Subunit L + - + + AT3G01440 PNSL3 Subunit L + NA + + 208

AT4G39710 PNSL4 Subunit L + - + + AT5G13120 PNSL5 Subunit L + + + + AT2G47910 CRR6 Complex formation + - + + AT5G39210 CRR7 Complex formation + - + + AT1G45474 Lhca5 Complex formation + - + + AT1G19150 Lhca6 Complex formation + NA + + AT1G26230 CRR27 Complex formation + NA + + AT1G51100 CRR41 Complex formation + NA + + AT2G05620 PGR5 Proton gradient regulation + NA + + AT4G22890 PGRL1A Proton gradient regulation + + + + AT3G46790 CRR2 Unknown + NA + + AT2G01590 CRR3 Unknown + - - + AT5G20935 CRR42 Unknown + - + + AT2G01918 PQL3 Unknown + - - + AT1G55370 NDF5 Unknown + - + + Chloroplast encoded ATCG00890 NDHB Subunit M + NA + + ATCG01250 ATCG01010 NDHF Subunit M + NA + + ATCG00440 NDHC Subunit M + - + + ATCG01050 NDHD Subunit M + - + + ATCG01070 NDHE Subunit M + - + +

209

ATCG01100 NDHA Subunit M + - + + ATCG01080 NDHG Subunit M + NA + + ATCG01110 NDHH Subunit A + + + + ATCG00420 NDHJ Subunit A + - + + ATCG00430 NDHK Subunit A + - + + ATCG01090 NDHI Subunit A + - + +

210

Appendix 7

57 orthologous groups of seagrass-specific genes shared in two Zosteraceae species (Z. muelleri and Z. marina) and Halophila categorized by predicted function. Gene functions were predicted with corresponding A. thaliana gene of highest sequence similarity.

Category of Name of best TAIR10 hit corresponding to Zostera ID of best TAIR10 Gene function/involved in related ortholog hit corresponding function to Zostera ortholog Protein Endoplasmic reticulum retention defective 2B AT3G25040.1 Retention mechanism secretion and Endoplasmic reticulum-type calcium-transporting AT1G10130.1 Calcium and manganese ion intracellular ATPase 3 transport transport RAB GTPase homolog A1F AT5G60860.1 GTPase activity RAB GTPase homolog A2B AT1G07410.1 GTPase activity Secretory carrier 3 AT1G61250.1 Integral membrane protein NOD26-like intrinsic protein 1;2 AT4G18910.1 Aquaporin Mitochondrial substrate carrier family protein AT3G53940.1 Substrate transport Mitochondrial import inner membrane translocase AT5G63000.1 Protein transport subunit Tim17/Tim22/Tim23 family protein Transducin/WD40 repeat-like superfamily protein AT3G01340.1 Protein transport Protein of unknown function AT1G09330.1 - Cell wall Expansin A16 AT3G55500.1 Cell wall loosening Expansin A1 AT1G69530.2 Cell wall loosening Galacturonosyltransferase-like 2 AT3G50760.1 Cell wall organization Xyloglucan endotransglucosylase/hydrolase 5 AT5G13870.1 Cell wall organization Glucan synthase-like 8 AT2G36850.1 Callose synthesis S-adenosyl-L-methionine-dependent AT4G34050.1 Lignin biosynthesis methyltransferases superfamily protein 211

Peroxidase superfamily protein AT5G05340.1 Lignin biosynthesis Cotton Golgi-related 2 (pectin methyltransferase) AT3G49720.1 Cell wall modification Vascular related NAC-domain protein 1 AT2G18060.1 Xylem secondary cell wall formation Ion flux and ATP synthase epsilon chain, mitochondrial AT1G51650.1 Proton-transporting ATPase sequestering activity Vacuolar proton ATPase A1 AT2G28520.1 Proton-transporting ATPase activity Calmodulin 4 AT1G66410.1 Calcium ion binding Lipid Trigalactosyldiacylglycerol 5 AT1G27695.1 Lipid transport catabolism GDSL-like Lipase/Acylhydrolase superfamily protein AT1G29670.1 Lipid catabolic process AT5G45670.1 Peroxin 6 AT1G03000.1 Peroxisomal matrix protein import Alkaline phytoceramidase (aPHC) AT4G22330.1 Ceramide synthase involved in sphingolipid metabolism Transcription- RNA polymerase subunit beta ATCG00190.1 Constituent of RNA polymerase B related Pre-mRNA-splicing factor SPF27 homolog AT3G18165.1 mRNA splicing of resistance genes Ribosome/tran Ribosomal protein L16 ATCG00790.1 Structural constituent of ribosome slation-related Ribosomal protein S26e family protein AT2G40510.1 Structural constituent of ribosome Ribosomal protein S8e family protein AT5G59240.1 Structural constituent of ribosome Ribosomal protein S2 ATCG00160.1 Structural constituent of ribosome eukaryotic translation initiation factor 3A AT4G11420.1 Constituent of eukaryotic initiation factor 3 (eIF3) Protein F-box protein PP2-A13 AT3G61060.1 Protein ubiquitination ubiquitination BTB/POZ domain-containing protein AT1G63850.1 Protein ubiquitination Ubiquitin-conjugating enzyme 28 AT1G64230.1 Protein ubiquitination Ubiquitin-like protein 5 AT5G42300.1 Ubiquitin-like modification Histone Histone H2A.2 AT3G20670.1 Histones/DNA 212

binding/nucleosome assembly Histone H3.3 AT4G40030.2 Histones/DNA binding/nucleosome assembly Others Photosystem II light harvesting complex gene 2.1 AT2G05100.1 Constituent of light harvesting complex II (LHC II) Alternative oxidase 1A AT3G22370.1 Alternative oxidase activity Tubulin folding cofactor D AT3G60740.1 Microtubule stability Asparagine synthetase 2 AT5G65010.2 Asparagine biosynthesis Glutamate-1-semialdehyde 2,1-aminomutase 2 AT3G48730.1 Porphyrin-containing compound metabolism Membrane-associated progesterone binding protein 3 AT3G48890.1 Porphyrin binding Thioredoxin superfamily protein AT3G62950.1 Electron carrier activity DNA polymerase epsilon catalytic subunit AT1G08260.1 DNA replication proofreading NAC domain containing protein 32 AT1G77450.1 Transcription factor DNA-binding protein phosphatase 1 AT2G25620.1 Protein phosphatase activity Protein kinase 1B AT2G28930.1 Serine/threonine kinase activity UDP-Glycosyltransferase superfamily protein AT5G04480.1 - Adenine nucleotide alpha hydrolases-like superfamily AT1G11360.4 - protein Protein of unknown function (DUF300) AT1G11200.1 - Protein of unknown function (DUF803) AT1G34470.1 -

213

Appendix 8

Additional results described in Chapter 4, including

1) List of genes included in OGCZ clusters 2) List of genes which were conserved in OGCsM and at least one species among H. ovalis, Z. muelleri and Z. marina 3) List of genes which were conserved in OGCsM but absent in at least one species among H. ovalis, Z. muelleri and Z. marina were uploaded to a public research data repository and are retrievable with the DOI: 10.5281/zenodo.807026.

214