Bioinformatic Identification and Analysis of -rich in

A dissertation presented to

the faculty of

the College of Arts and Sciences of Ohio University

In partial fulfillment

of the requirements for the degree

Doctor of Philosophy

Xiao Liu

August 2017

© 2017 Xiao Liu. All Rights Reserved.

2

This dissertation titled

Bioinformatic Identification and Analysis of Hydroxyproline-rich Glycoproteins in Plants

by

XIAO LIU

has been approved for

the Department of Environmental and Biology

and the College of Arts and Sciences by

Allan M. Showalter

Professor of Environmental and Plant Biology

Robert Frank

Dean, College of Arts and Sciences 3

ABSTRACT

LIU, XIAO, Ph.D., August 2017, Environmental and Plant Biology

Bioinformatic Identification and Analysis of Hydroxyproline-rich Glycoproteins in Plants

Director of Dissertation: Allan M. Showalter

Hydroxyproline-rich glycoproteins (HRGPs) are a superfamily of plant that function in diverse aspects of plant growth and development. This superfamily consists of three members: -proteins (AGPs), extensins (EXTs), and -rich proteins (PRPs). A bioinformatic software program, BIO OHIO 2.0, was developed to expedite the genome-wide identification and classification of HRGPs based on characteristic motifs and biased amino acid compositions. Principles of HRGPs identification and a stepwise tutorial for using this program with proteomic data was provided to facilitate and guide basic and applied research on HRGPs. Firstly, bioinformatic identification of EXTs was conducted in plants. A total of 758 EXTs were identified in 16 species, including 87 classical EXTs, 97 short EXTs, 61 leucine-rich repeat extensins (LRXs), 75 proline-rich extensin-like receptor kinases (PERKs), 54 formin- homolog EXTs (FHXs), 38 long chimeric EXTs, and 346 other chimeric EXTs. Classical

EXTs were likely derived after the terrestrialization of plants; LRXs, PERKs, and FHXs were derived earlier than classical EXTs; monocots have fewer classical EXTs than eudicots; have no classical EXTs but have a number of long chimeric EXTs that are absent in embryophytes. Phylogenetic analysis was conducted which shed light on the evolution of three EXT classes. In a second study, bioinformatic identification of

HRGPs was conducted in poplar (Populus trichocarpa) which identified and classified 271 4

HRGPs including 162 AGPs, 60 EXTs, and 49 PRPs. Comparisons were made with

Arabidopsis thaliana to facilitate the understanding of their respective structural and functional roles, including their possible applications in the areas plant biofuel and natural products for medicinal or industrial uses. In a third study, the bioinformatic identification and analysis of EXTs was conducted in Arabidopsis lyrata and Arabidopsis l, two close relatives of . A total of 61 EXTs and 65 EXTs were identified in A. lyrata and A. halleri, respectively, compared with 69 EXTs in A. thaliana. Phylogenetic trees and cluster analysis revealed a number of potential orthologous and paralogous proteins among the three species. The identified EXTs and their homologous proteins provide insight into the evolution and functions of EXTs in related species within the same genus.

5

ACKNOWLEDGMENTS

First and foremost, I want to express my earnest gratitude to my advisor, Dr. Allan

Showalter. Dr. Showalter always provided me with support and encouragement, allowed me to learn from my mistakes and develop problem solving skills, helped me to learn about writing manuscripts and grants, guided me to identify and work with my strengths and weaknesses, and provided opportunities for developing independence in my career. Dr.

Showalter has served as a role model of a successful academic career, personal integrity, persistence, and accountability.

I am grateful to Dr. Ahmed Faik, who shared his expertise in cell wall biology, his critical thinking in our journal club, and his ideas of experimental design and implement. I thank Dr. Marcia Kieliszewski, an expert in chemistry, for facilitating my scientific growth and development, and providing scientific instruction and suggestions for my research project. I thank Dr. Lonnie Welch, who led me to the world of bioinformatics, provided guidance and wisdom for my study in this interdisciplinary field, and offered me opportunities for joint research projects and presentations. I especially thank Dr. Frank

Drew for his support and valuable comments and suggestions on my dissertation.

I thank past and current lab mates, Dr. Yizhu Zhang, Dr. Yan Liang, Dr. Debarti

Basu, Dr. Melanie Schori, Mr. Wuda Wang, Ms. Lu Tian, Mr. Oyeyemi Ajayi, and Ms.

Dasmeet Kaur, for their support and encouragement. I thank the cell wall research group members, Dr. Michael Held, Dr. Yuning Chen, Dr. Dening Ye, Dr. Wen Dong, Dr. Nan

Jiang, Mr Daniel Nething, Mr. Yadi Zhou, Mr. John Elmore, and Ms. Tasleem Javaid, for sharing their knowledge and skills in cell wall biology. I thank my collaborators and 6 coworkers, Mr. Richard A. Wolfe, Mr. Yichao Li, Ms. Savannah McKenna, and Mr. David

Masters, for their contributions in our research projects and manuscripts. I thank Ms.

Connie Pollard and Ms. Eileen Delehanty-Schulz for their excellent administrative service.

Without the support and encouragement from all the above people, my research would not have gone as smoothly and my study at Ohio University would not have been so enjoyable.

I thank the Department of Environmental and Plant Biology and the Molecular and

Cellular Program for offering me this valuable opportunity for pursuing a Ph.D., and providing me with teaching positions and fundings to attend local, regional, and national meetings.

Last but not least, I am indebted to my family and friends, who walked me through difficult times and provided me with endless love and support.

7

TABLE OF CONTENTS

Page

Abstract ...... 3 Acknowledgments ...... 5 List of Tables...... 11 List of Figures ...... 12 List of Abbreviations ...... 14 Chapter 1. Introduction ...... 15 1.1. The significance of the plant cell wall ...... 15 1.2 The structure of plant cell wall ...... 15 1.3 Cell wall polysaccharides ...... 16 1.3.1 Cellulose ...... 16 1.3.2 Hemicelluloses ...... 16 1.3.3 Pectins...... 17 1.4 Cell wall structural proteins ...... 17 1.4.1 AGPs ...... 18 1.4.2 EXTs ...... 20 1.4.3 PRPs ...... 23 1.5 History of bioinformatic identification of HRGPs ...... 23 Chapter 2. Bioinformatic Identification of Plant Hydroxyproline-rich Glycoproteins ..... 26 2.1. Introduction ...... 26 2.2. Materials and methods ...... 28 2.2.1 BIO OHIO 2.0 program overview ...... 28 2.2.2 Install the BIO OHIO 2.0 program ...... 29 2.2.3 The BIO OHIO 2.0 program graphical user interphase (GUI) ...... 29 2.2.4 Select a file to be analyzed ...... 31 2.2.5 Methods for identifying AGPs ...... 31 2.2.6 Methods for identifying EXTs ...... 32 2.2.7 Methods for identifying PRPs ...... 32 2.2.8 The SignalP, GPI, and BLAST modules ...... 33 8

2.2.9 The general workflow of identifying HRGPs ...... 34 2.3 Results ...... 37 2.3.1 Identification of P. vulgaris AGPs ...... 37 2.3.2 Identification of P. vulgaris EXTs ...... 41 2.3.3 Identification of P. vulgaris PRPs ...... 46 2.3.4 HRGPs identified in P. vulgaris ...... 50 2.4. Notes and tips for using BIO OHIO 2.0 ...... 51 Chapter 3. Bioinformatic Identification and Analysis of Extensins in the Plant Kingdom ...... 54 3.1 Introduction ...... 54 3.2 Materials and methods ...... 58 3.2.1 Identification of EXTs ...... 58 3.2.2 BLAST analysis ...... 59 3.2.3 Signal peptides and GPI anchors ...... 59 3.2.4 Sequence alignment and phylogenetic analysis ...... 59 3.3 Results ...... 60 3.3.1 Classical and short EXTs ...... 65 3.3.2 LRXs, PERKs, and FHs ...... 68 3.3.3 Long chimeric EXTs and other chimeric EXTs ...... 76 3.3.4 Comparison with previously identified EXTs ...... 76 3.4 Discussion ...... 77 3.4.1 Bioinformatic identification of plant EXTs using BIO OHIO 2.0 ...... 77 3.4.2 The origin and evolution of EXTs ...... 78 3.5 Conclusions ...... 81 Chapter 4. Bioinformatic Identification and Analysis of Hydroxyproline-rich Glycoproteins in Populus trichocarpa ...... 83 4.1. Background ...... 83 4.2. Materials and methods ...... 88 4.2.1 Identification of AGPs, EXTs, and PRPs using BIO OHIO 2.0 ...... 88 4.2.2 BLAST analysis against Arabidopsis and poplar proteomes ...... 89 4.2.3 Pfam database and poplar HRGP gene expression database ...... 89 4.3 Results ...... 90 9

4.3.1 Arabinogalactan-proteins (AGPs) ...... 90 4.3.2 Extensins (EXTs) ...... 121 4.3.3 Proline-rich proteins (PRPs) ...... 135 4.4 Discussion ...... 146 4.4 1 A bioinformatics approach for identifying HRGPs ...... 146 4.4.2 HRGPs exist as a spectrum of proteins ...... 148 4.4.3 Comparisons with previously identified poplar HRGPs ...... 150 4.4.4 Comparisons with Arabidopsis ...... 150 4.4.5 Poplar HRGPs genome 2.0 release and expression analysis ...... 153 4.4.6 Pfam analysis of poplar HRGPs...... 154 4.5 Conclusions ...... 155 Chapter 5. Bioinformatic Identification and Analysis of Cell Wall Extensins in Three Arabidopsis Species ...... 156 5.1 Introduction ...... 156 5.2 Materials and methods ...... 160 5.2.1 BIO OHIO 2.0 program and the input data files ...... 160 5.2.2 Identification of EXTs ...... 161 5.2.3 Phylogenetic and cluster analysis of EXTs ...... 162 5.3 Results ...... 163 5.3.1 EXTs identified in A. lyrata ...... 163 5.3.2 EXTs identified in A. halleri...... 167 5.3.3 Update on the EXTs in Arabidopsis thaliana ...... 172 5.3.4 Phylogenetic analysis of EXTs A in A. thaliana, A. lyrata, and A. halleri .... 177 5.3.5 Cluster analysis and identification of potential orthologs and paralogs ...... 185 5.4 Discussion ...... 193 Chapter 6. Verification of the Bioinformatic Methods ...... 198 6.1 Verification with previously reported AGPs ...... 198 6.2 Verification with previously reported EXTs ...... 207 6.3 Verification with previously reported PRPs ...... 212 Chapter 7. Conclusions and Future Work ...... 217 7.1 Conclusions ...... 217 10

7.2 Discussion ...... 219 7.3 Future work ...... 221 7.3.1 Refinement of the current bioinformatic methods ...... 221 7.3.2 Envising BIO OHIO 3.0 ...... 222 7.3.3 Automation of the identification and classification process ...... 222 7.3.4 Predicting HRGPs ...... 223 7.3.5 Deep understanding of plant HRGPs ...... 224 References ...... 225

11

LIST OF TABLES

Page

Table 2.1 Identification and classification of P. vulgaris AGPs...... 39 Table 2.2 Identification and classification of P. vulgaris EXTs...... 44 Table 2.3 Identification and classification of P. vulgaris PRPs...... 48 Table 2.4 Comparison of HRGPs identified in P. vulgaris and A. thaliana...... 51 Table 4.1 AGPs, EXTs, and PRPs identified from the Populus trichocarpa protein database based on biased amino acid compositions, size, and repeat units...... 91 Table 4.2 Identification and analysis of AGP genes in Populus trichocarpa...... 93 Table 4.3 Identification and analysis of EXT genes in Populus trichocarpa...... 123 Table 4.4 Identification and analysis of PRP genes in Populus trichocarpa...... 137 Table 4.5 Comparison of HRGPs identified in Populus trichocarpa and Arabidopsis thaliana...... 151 Table 5.1 The list of EXTs identified in A. lyrata...... 164 Table 5.2 The list of EXTs identified in A. halleri...... 168 Table 5.3 Comparison of EXTs in A. thaliana, A. lyrata, and A. halleri...... 173 Table 5.4 Summary of EXTs identified in A. thaliana, A. lyrata, and A. halleri...... 173 Table 5.5 The list of EXTs identified in A. thaliana...... 174 Table 5.6 The list of identified clusters using the OrthoMCL program...... 187 Table 5.7 The list of orthologous proteins identified by the OrthoMCL program...... 188 Table 5.8 The list of paralogous proteins identified by the OrthoMCL program...... 192 Table 6.1 List of historically reported AGPs in the literature and their related information...... 199 Table 6.2 List of historically reported EXTs in the literature and their related information...... 208 Table 6.3 List of historically reported PRPs in the literature and their related information...... 213

12

LIST OF FIGURES

Page

Figure 1.1 Protein sequences of an AGP molecule, an EXT molecule and a hybrid PRP molecule identified in Arabidopsis...... 18 Figure 2.1 The BIO OHIO 2.0 GUI...... 30 Figure 2.2 Workflow diagram for the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) using a newly revised and improved BIO OHIO 2.0. 35 Figure 2.3 Representative AGPs in P. vulgaris...... 38 Figure 2.4 Representative EXTs in P. vulgaris...... 43 Figure 2.5 Representative PRPs in P. vulgaris...... 47 Figure 3.1 Phylogenetic distribution of EXTs in selected plant genomes...... 62 Figure 3.2 Protein sequences encoded by representative EXT gene classes in Brassica rapa and Chlamydomonas reinhardtii...... 65

Figure 3.3 The frequency of SP3, SP4, and SP5 repeats in classical EXTs of selected genomes...... 67 Figure 3.4 Average number of YXY motif in classical and non-classical EXTs...... 68 Figure 3.5 Structural schemes of classical EXTs, LRX, and PERKs...... 69 Figure 3.6 Maximum likelihood analysis of LRXs...... 71 Figure 3.7 Maximum likelihood analysis of PERKs...... 73 Figure 3.8 Maximum likelihood analysis of FHs...... 75 Figure 4.1 Workflow diagram for the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) in poplar using a newly revised and improved BIO OHIO 2.0...... 87 Figure 4.2 Protein sequences encoded by the representative AGP gene classes in Populus trichocarpa...... 120 Figure 4.3 Protein sequences encoded by the representative EXT gene classes in Populus trichocarpa...... 134 Figure 4.4 Protein sequences encoded by the representative PRP gene classes in Populus trichocarpa...... 145 Figure 5.1 The species tree of A. thaliana and its close relatives...... 159 Figure 5.2 Phylogenetic tree of classical EXTs using the maximum likelihood method...... 178 Figure 5.3 Phylogenetic tree of short EXTs using the maximum likelihood method. .... 180 13

Figure 5.4 Phylogenetic tree of LRXs using the maximum likelihood method...... 182 Figure 5.5 Phylogenetic tree of PERKs using the maximum likelihood method...... 183 Figure 5.6 Phylogenetic tree of FHXs using the maximum likelihood method...... 184

14

LIST OF ABBREVIATIONS

AGP: arabinogalactan-protein

BLAST: basic local alignment search tool

EXT: extensin

FH: formin homology protein

FLA: fasciclin-like AGP

GPI: glycosylphosphatidylinositol

GT:

HRGP: hydroxyproline-rich

IDT: isodityrosine

LRX: leucine-rich repeat extensin

NGS: next generation sequencing

PAG: plastocyanin AGP

PERK: proline-rich extensin-like receptor

PRP: proline-rich protein

PTM: post-translational modification

P4H: prolyl 4-hydroxylase

15

CHAPTER 1. INTRODUCTION

1.1. The significance of the plant cell wall

The plant cell wall is a dynamic and complex structure of polysaccharides, lignin, and glycoproteins that is unique to plant cells (Cassab 1998). It plays fundamental roles in plant growth, development, and defense systems, such as maintaining cell shape, controlling cell growth, signaling events, and serving as the physical barrier to pathogen infection (Caño-Delgado et al. 2003). Besides playing essential roles in plant physiology, the cell wall is of great economic significance to human lives, as it serves as raw materials for commercial products, such as food, paper, wood and shelters. In recent years, cell wall biology has been a major research area in plant biology, as it represents about 50% of plant biomass and contains raw materials for the production of bioethanol (Sarkar et al. 2009).

A deep understanding of the structure and function of the cell wall will facilitate its industrial applications and better meet human needs.

1.2 The structure of plant cell wall

All plant cells have primary cell walls, which occur at the cell plate and expand along with cell expansion. Two types of primary cell wall are found in flowering plants

(Carpita and Gibeaut 1993). Type I cell walls are found in dicots and most monocot plants.

They are consisted of cellulose microfibrils cross-linked by hemicellulose xyloglucan as the framework embedded in a network of pectin polymers as the matrix. The type I cell walls are also rich in structural proteins that have various functions. The type II cell walls are found in commelinoid monocots. They are mostly composed of cellulose microfibrils cross-linked by glucuronoarabinoxylan (GAX), except for the order Poales where mixed- 16 linkage glucans serve as the intermolecular linker. Unlike the type I cell walls, the amount of pectin is greatly reduced, and the type II cell walls are poor in structural proteins as well.

In some specialized cells, once the cells have ceased growth, a secondary cell wall is formed between the primary cell walls and the plasma membrane. The secondary cell walls are often thickened due to the presence of lignin polymers that enhance the mechanical strength of cell walls. The secondary cell walls often have a high content of cellulose and a low content of pectin and proteins.

1.3 Cell wall polysaccharides

1.3.1 Cellulose

Cellulose is the most abundant component in both primary and secondary walls. Primary cell walls contain 20-30% (w/w) cellulose, and secondary cell walls contain an even higher amount of cellulose (Albersheim et al. 2011). The unit of cellulose is β-(1, 4)-D-glucose that forms glucan chains. Approximately 36 glucan chains form cellulose microfibrils by hydrogen bonds (Herth 1983). Microfibrils are synthesized at the plasma membrane by cellulose synthase complexes, which contain six cellulose synthase catalytic subunits (Somerville 2006). Cellulose microfibrils serve as the load- bearing framework in the cell wall, rendering the cell wall mechanical strength and stiffness.

1.3.2 Hemicelluloses

Hemicelluloses are a group of polysaccharides found in both primary and secondary cell walls. Their major function is to strengthen the cell wall by cross-linking with the cellulose microfibrils. Hemicelluloses include xyloglucan, glucuronoxylan (GX), GAX, 17 arabinoxylan (AX), glucomannan, galactomannan, and galactoglucomannan. Unlike cellulose which contains only glucose, the sugar backbones of hemicelluloses are often substituted with other kinds of monosaccharides such as xylose (Xyl), (Man),

Gal, (Rha), and (Ara). The abundance of hemicellulose varies across species. In dicots, xyloglucan and GX often predominate in the primary and secondary cell wall, respectively. In commelinoid monocots, GAX is the major hemicellulose in the primary cell wall. In gymnosperms, galactoglucomannans are major components of the secondary cell wall (Ebringerová et al. 2005).

1.3.3 Pectins

Pectins polymers serve as the major components as the matrix in the cell walls.

There are three major types of pectins: homogalacturonan (HG), rhamnogalacturonan I

(RG-I) and rhamnogalacturonan II (RG-II). HG is the most abundant pectin, composing

60%-70% of pectin polymers. It is an unbranched α-(1, 4)-linked GalA chain that may be partially methylesterified and/or acetylated. RG-I is the second abundant pectin, composing about 20% of the pectin polymers. RG-II has an HG backbone substituted by various types of oligosaccharide and polysaccharide side chains (Mohnen 2008).

1.4 Cell wall structural proteins

Proteins represent important components in the cell wall. One superfamily of structural proteins is the hydroxyproline-rich glycoproteins (HRGPs) consisting of three major members: arabinogalactan-proteins (AGPs), extensins (EXTs) and proline-rich proteins (PRPs) (Fig 1.1) (Showalter 1993). HRGPs are widely present in plants, but are distinct in the characteristic motifs of protein backbones, the level of glycosylation and the 18 cross-linking properties. Nonetheless, the HRGP superfamily shares the same rules of proline and subsequent glycosylation, and should be regarded as a spectrum of proteins (Kieliszewski and Lamport 1994).

An AGP

An EXT

A PRP

Figure 1.1 Protein sequences of an AGP molecule, an EXT molecule and a hybrid PRP molecule identified in Arabidopsis. Colored sequences at the N and C termini indicate predicted (green) and GPI anchor (light blue) addition sequences if present. A classical AGP, At1g35230, is mainly composed of Pro-dipeptide sequences (yellow). The EXT contains the YVY motif (dark red) and the Ser-Pron motif located in longer repetitive motifs (purple and red). The PRP contains short PPV motifs (pink) and longer repetitive motifs (cyan) (Showalter et al. 2010).

1.4.1 AGPs

AGPs are the most highly glycosylated member in the HRGPs superfamily (Tan et al. 2003). Typically, more than 90% of the AGP are carbohydrate components, while the protein backbone only accounts for less than 10% of the molecular mass (Nothnagel 1997).

AGP protein backbones contain biased amino acid compositions, which are rich in proline 19

(P), (A), (S), and (T). These four amino acids often account for more than 50% of the total amino acids and usually form repeated dipeptide motifs, such as AP, PA, SP, TP (Showalter et al. 2010; Gao and Showalter 1999; Showalter 2001).

Many proline residues are hydroxylated as hydroxyproline (Hyp) by Prolyl 4-hydroxylase

(Hieta and Myllyharju 2002; Koski et al. 2007; Koski et al. 2009). Another unique feature of AGPs is the specific binding to an artificial Yariv phenylglycosides (β-) due to the presence of β-(1, 3)- in the AGPs (Kitazawa et al. 2013). The carbohydrate moiety of AGPs generally consists of a (1, 3)-linked β- backbone from which (1, 6)-linked β-galactan chains branch off. These chains are usually heavily substituted with Ara residues and other sugars like Xyl, Rha, (Fuc), glucuronic acid

(GlcA), and 4-O-methyl-GlcA. In addition to glycosylation, mature AGPs often have the glycosylphosphatidylinositol (GPI) anchor added at the C-terminus of the protein backbone.

AGPs have been implicated to play various roles in plants growth and development

(Showalter 2001; Seifert and Roberts 2007; Ellis et al. 2010). AGPs were found to promote , and were involved in by treatment of β-

Yariv reagent (van Hengel et al. 2002; Gao and Showalter 1999). AGPs were also implicated in signal transduction as putative co-receptors due to the presence of the GPI anchor (Schultz et al. 1998; Zhang et al. 2011). Based on NMR data and molecular dynamics simulations, Lamport and Varnai identified AGPs as potential intramolecular

Ca2+ -binding molecules (Lamport and Varnai 2013). AGPs were highly up-regulated in the salt-stressed cells of tobacco BY-2 cells (Lamport et al. 2006). AGPs also play roles in 20 tube growth (Mollet et al. 2002; Lee et al. 2008) and pollen grain development

(Pereira et al. 2006; Levitin et al. 2008). AGPs have recently been found to be involved in the molecular interaction and cross-linking of other plant cell wall components (Tan et al.

2013). AGPs also have the potential for commercial uses as emulsifiers in the food industry and as immunological regulators for human health (Showalter 2001).

Characterization and genetic study on individual AGPs were also conducted on various species and AGP subclasses. AtAGP30 was found to play a role in regeneration and germination (van Hengel and Roberts 2003). In cucumber (Cucumis sativus), CsAGP1 was associated with gibberellin response and stem elongation (Park et al. 2003). Recently, GhPLA1 was identified as a chimeric AGP that promotes somatic embryogenesis (Poon et al. 2012) in cotton (Gossypium hirsutum). Shi et al. (2003) reported a salt overly sensitive mutant called sos5, which encodes an Arabidopsis fasciclin

AGP (FLA4). Interestingly, fei1/2, a double mutant for two cell wall-associated leucine- rich repeat receptor kinases, phenocopies sos5 (Xu et al. 2008). Both sos5 and fei1/2 demonstrate a swollen-root phenotype; moreover, genetic analysis indicates sos5 is in the same genetic pathway as fei1/2. Also, a FLA mutant of Arabidopsis thaliana, fla1, showed defects in shoot regeneration, in tissue culture (Johnson et al. 2011);

1.4.2 EXTs

EXTs are a group of moderately glycosylated proteins which have sugar components counting for about half of the molecular weight. The protein backbones of

EXTs are composed of repetitive motifs of Ser-Hypn (where n is usually 3-5) and often 21 hydrophobic motifs of (Tyr)-X-Tyr (where X can be any amino acid)

(Kieliszewski and Lamport 1994; Showalter et al. 2010).

Pro residues are post-translationally modified to Hyp by prolyl 4-hydroxylases

(P4Hs), and subsequent glycosylation includes Ara oligosaccharides attached to Hyp and

Gal monosaccharides attached to Ser residues. (Kieliszewski and Lamport 1994). Tyr residues on EXTs are involved in both intramolecular and intermolecular cross-linking, forming isodityrosine (IDT), di-IDT and pulcherosines (Held et al. 2004). These cross- linking properties of EXTs contribute to the extracellular network and play roles in and defense mechanisms (Merkouropoulos et al. 1999; Cannon et al. 2008).

EXTs can be divided into several classes: classical EXTs, short EXTs, leucine-rich repeat extensins (LRXs), proline-rich extensin-like receptor kinases (PERKs), formin- homolog EXTs (FHXs), long chimeric EXTs and other chimeric EXTs. Classical EXTs have SP3-5 repeated motifs throughout their sequences; many classical EXTs have numerous YXY motifs. For instance, one classical EXT (Bra000292) identified in Brassica rapa contains 84 EXT SPn motifs and 52 YXY motifs (Liu et al. 2016). In Arabidopsis, a homozygous extensin gene mutant (atext3) had an abnormal cell wall structure and was found to be lethal (Saha et al. 2013).

Short EXTs have a smaller number of amino acids, usually less than 200 aa. Some short EXTs were predicted with a C-terminal glycosylphosphatidylinositol (GPI) membrane anchor addition sequence that leads to the tethering the protein to the outer leaflet of the plasma membrane (Showalter et al. 2010; Showalter et al. 2016; Liu et al.

2016). Such GPI anchor addition sequences are rarely seen in other subclasses of EXTs. 22

Chimeric EXTs are proteins with both EXT domains and one or more domains not belonging to HRGPs (Showalter et al. 2010; Liu et al. 2016). One group of chimeric EXTs are Leucine-rich repeat extensins (LRXs). In addition to the EXTs motifs, LRXs have a leucine-rich repeat (LRR) domain, which was thought to be involved in protein-protein interactions, giving LRXs possible regulatory role on the plant cell surface (Kobe and

Deisenhofer 1994). Another group of chimeric EXTs are Proline-rich extensin-like receptor kinases (PERKs) with the EXT domains located at the N-terminal portion of the protein and a protein kinase catalytic domain located at the C-terminus. PERKs were involved in important physiological processes, including cell expansion and floral organ formation (Haffani et al. 2006), abscisic acid responses (Bai et al. 2009), and responses to wounding and pathogen stimuli (Silva et al. 2002). FHXs are a third group of chimeric

EXTs that have a conserved FH2 domain (pfam02181) in addition to EXT motifs. In plants, the FH2 domain is involved in actin nucleation and the co-ordination between actin and microtubule cytoskeletons (Chalkia et al. 2008; Cvrčková et al. 2015). “long chimeric

EXT” represent another group of chimeric EXTs because of their extraordinary sequence length that have more than 2,000 amino acids.

Besides the above chimeric EXTs, other chimeric EXTs and hybrid EXTs also exist. All EXTs but the PERKs have N-terminal signal peptide sequences that direct the proteins to the endomembrane system and ultimately to the extracellular matrix. Despite the absence of a signal peptide, Arabidopsis PERK1 was found to be localized at the plasma membrane (Silva and Goring 2002). 23

1.4.3 PRPs

PRPs are the least glycosylated member of HRGPs that contains 2-27% (w/w) sugar

(Shpak et al. 1999). PRPs typically contain contiguous P residues, which may reside in repeated amino acid motifs such as PPVX(K/T) (where X can be any amino acid),

KKPCPP, and KPPXYKPP (where X can be valine, histidine, threonine, glutamine or alanine and Y can be tyrosine, threonine, glutamic acid or proline (Sommer-Knudsen et al.

1998). Variations of repeated motifs are also found in different species. In addition to an abundance of P residues, PRPs are generally rich in valine, lysine, cysteine, tyrosine, and threonine residues. Some Pro residues are hydroxylated and subsequently O-glycosylated by Ara oligosaccharides. PRPs can be divided into subfamilies, including the classical

PRPs, PR peptides, and chimeric PRPs (Showalter et al. 2010).

The exact functions of PRPs are not clear, but PRPs are implicated to play roles in plant growth and development, such as nodule formation, cell expansion, and response mechanisms to fungal and endogenous elicitors, and wounding (Showalter 1993;

Dvorakova et al. 2012).

1.5 History of bioinformatic identification of HRGPs

The rapid advancement of the “next generation sequencing (NGS)” techniques has revolutionized biology research. Our enhanced abilities to acquire, store, process, and analyze data have also greatly facilitated the study of genomics and proteomics, including the bioinformatic study of HRGPs.

The first comprehensive bioinformatics analysis in the HRGP field was done by

Schultz et al. (2002) who used the biased amino acid composition of AGPs to identify 24

AGPs and AG peptides in the Arabidopsis genome and found 13 classical AGPs, 10 AG- peptides, three lysine-rich AGPs, and 21 fasciclin-like AGPs (FLAs). Faik et al. (2006) used a Java script to analyze wheat (Triticum aestivum) and rice () EST databases and identified 34 wheat FLAs and 24 rice FLAs. Ma and Zhao (2010) used a Perl script to identify AGPs from rice (Oryza sativa L.) genome based on biased amino acid composition, length, and conserved domain. Together they identified 69 rice AGPs, including 13 classical AGPs, 15 AG peptides, 3 non-classical AGPs, 3 early nodulin-like

AGPs, 8 non-specific lipid-transfer protein-like AGPs, and 27 FLAs. In addition, Newman and Cooper (2011) utilized a bioinformatic approach based on EST and NCBI Non-

Redundant protein sequences data and identified 31 distinct proline-rich Tandem Repeat

Proteins (TRPs) classes including certain classes of AGPs, EXTs, and PRPs, in a variety of plant species. Based on previous advances of bioinformatic identification of AGPs, Ma et al. (2016) developed a Python script named Finding-AGP to systematically screen for

AGPs from proteomic data. They conducted bioinformatic identification of AGPs in 47 plant species and proposed possible origins of certain classes of AGPs. Based on motifs and amino acid bias, Johnson et al. (2017a) recently developed a bioinformatics pipeline which classified HRGPs into 23 subclasses. The pipeline was tested on 15 plant genomes and applied to the 1000 plants transcriptome project data set (Johnson et al. 2017b).

Showalter et al. (2010) identified and classified the whole HRGP superfamily in

Arabidopsis including AGPs, EXTs, PRPs, hybrid HRGPs, and chimeric HRGPs by use of a bioinformatics software program BIO OHIO (Lichtenberg et al. 2012). The BIO OHIO software program was designed and developed specifically for protein identification based 25 on amino acid signatures. In addition to searching for biased amino acid composition in the genome-encoded protein sequences, the program can search for common HRGP amino acid motifs. The program can also check for the presence of potential signal peptide sequences, GPI anchor addition sequences, and finding similar HRGPs via the Basic Local

Alignment Search Tool (BLAST). The program identified 166 HRGPs consisting of 85

AGPs, 59 EXTs, 18 PRPs, and 4 AGP/EXT hybrid HRGPs in Arabidopsis thaliana. More recently, the BIO OHIO 2.0 program was newly developed which included improved functional modules over its predecessor. Using this updated program, the Showalter lab did a more comprehensive identification and analysis of EXTs in the plant kingdom, and identified the inventory of HRGPs in a number of plant species, including poplar and common beans (Liu et al. 2016; Showalter et al. 2016 and unpublished data). The details of the improved BIO OHIO 2.0 program and the associated bioinformatic studies will be discussed in later chapters.

26

CHAPTER 2. BIOINFORMATIC IDENTIFICATION OF PLANT

HYDROXYPROLINE-RICH GLYCOPROTEINS

This work is in press in the following manuscript.

Xiao Liu, David Masters, Savannah McKenna, Lonnie R. Welch, and Allan M.

Showalter 2017 Bioinformatic identification of plant hydroxyproline-rich glycoproteins.

Methods Mol Biol.

2.1. Introduction

Hydroxyproline-rich glycoproteins (HRGPs) are a diverse superfamily of plant cell wall glycoproteins that are implicated in various roles in plant growth and development

(Showalter 1993; Kieliszewski and Lamport, 1994; Cassab 1998; Seifert and Roberts 2007;

Kieliszewski et al. 2010). Based on their protein backbone sequences, and patterns of proline hydroxylation and glycosylation, HRGPs are divided into three families: arabinogalactan-proteins (AGPs), extensins (EXTs), and proline-rich proteins (PRPs).

AGPs are rich in proline (P), alanine (A), serine (S), and threonine (T). The P residues are generally clustered as non-contiguous residues (e.g., APSPTP) and reside in certain common dipeptide sequences such as AP, PA, SP, TP, VP, and GP. These P residues are post-translationally hydroxylated and glycosylated with arabinogalactan (AG) polysaccharides (Tan et al. 2003; Tan et al. 2004; Tan et al. 2012). The AGP family can be divided into several subfamilies, including the classical AGPs, AG peptides, and chimeric

AGPs including fasciclin-like AGPs (FLAs), and the plastocyanin AGPs (PAGs). EXTs typically contain repeating units of SP3-5, in which the contiguous P residues are hydroxylated and subsequently glycosylated with arabinose oligosaccharides 27

(Kieliszewski et al. 2010; Shpak et al. 2001). Many EXTs also contain Tyr-X-Tyr (YXY) amino acid repeats [where X can be any amino acid] and are involved in intramolecular and intermolecular crosslinking in the cell wall (Kieliszewski et al. 2010). EXTs can be divided into several subfamilies, including the classical EXTs, short EXTs, leucine-rich repeat extensins (LRXs), proline-rich extensin-like receptor kinases (PERKs), formin- homolog EXTs (FHXs), and other chimeric EXTs (Kieliszewski et al. 2010, Liu et al.

2016). PRPs typically contain contiguous P residues (e.g., PPVYK), which may reside in repeating amino acid sequences such as PPVX(K/T) [where X represents any amino acid] or KKPCPP. In addition to an abundance of P residues, PRPs are generally rich in valine

(V), lysine (K), cysteine (C), tyrosine (Y), and threonine (T) residues. The P residues may be hydroxylated and subsequently glycosylated with arabinose oligosaccharides. PRPs can be divided into subfamilies, including the classical PRPs, PR peptides, and chimeric PRPs

(Showalter et al. 2010).

Most HRGPs have an N-terminal signal peptide that direct them into the endomembrane system and ultimately to the plasma membrane/cell wall. Certain HRGPs groups, in particular the AGP family, are also modified with a C-terminal glycosylphosphatidylinositol (GPI) membrane anchor, which tethers the protein to the plasma membrane to allow the rest of the glycoprotein to extend toward the cell wall in the periplasm (Seifert and Roberts 2007, Youl et al. 1998).

Several bioinformatic approaches to identify HRGPs are reported (Liu et al. 2016;

Showalter et al. 2010; Schultz et al. 2002; Ma and Zhao 2010; Newman and Cooper 2011).

All of these approaches take advantage of the characteristic amino acid compositions, 28 repeating sequences, and sequence features present in HRGPs, but vary with respect to how each of these characteristic features is integrated into the program. Here, we present the most recent version of the HRGP bioinformatic program developed in our laboratory called

BIO OHIO 2.0. This program and its predecessor allow for the effective identification and classification of HRGPs from proteomic databases by utilizing bioinformatic approaches involving biased amino acid composition searches and HRGP amino acid motif searches

(Showalter et al. 2010; Lichtenberg et al. 2012).

Successful identification and classification of HRGPs will facilitate both basic and applied research on these important cell wall proteins, particularly as they relate to emulsifiers (Showalter 1993), adhesives (Huang et al. 2016), and biofuel (Fleming et al.

2016). Moreover, HRGP comparisons among different plant species will provide further insight to the roles these HRGPs play in plants and in evolution. This chapter provides a tutorial on how to use the BIO OHIO 2.0 bioinformatic program to identify and classify

HRGPs from proteomes revealed by genomic sequencing. As an example, the proteome of the common bean (Phaseolus vulgaris) is analyzed using the BIO OHIO 2.0 program to identify and characterize its set of HRGPs.

2.2. Materials and methods

2.2.1 BIO OHIO 2.0 program overview

BIO OHIO 2.0 is a newly revised and improved bioinformatics software program developed at Ohio University that was designed primarily to identify and characterize plant

HRGPs. The program allows for protein identification based on amino acid signatures, such as biased amino acid compositions and common HRGP amino acid motifs in genome- 29 encoded protein sequences (i.e., the proteome). The program further analyzes identified proteins by examining them for the presence of potential signal peptide sequences and GPI anchor addition sequences and by searching for similar HRGPs using Basic Local

Alignment Search Tool (BLAST). The BIO OHIO 2.0 program is free and publicly available at https://github.com/david-masters/Bio-Ohio-Public/releases

2.2.2 Install the BIO OHIO 2.0 program

(1) Download the program from https://github.com/david-masters/Bio-Ohio-

Public/releases

(2) Unzip BIO_OHIO.zip to a convenient location (the BIO OHIO has only been tested to work on Windows).

2.2.3 The BIO OHIO 2.0 program graphical user interphase (GUI)

In the program folder, double click the BioOhio.bat file to launch the BIO OHIO

2.0 GUI (Fig 2.1). Select a protein sequence file of interest in the fasta format to be analyzed. Under the “Select test” drop-down menu, select one of the available tests: AGP,

EXT, PRP, or ALL tests. 30

Figure 2.1 The BIO OHIO 2.0 GUI. 31

2.2.4 Select a file to be analyzed

At the top of the GUI, you are given the option to select your file of interest for analysis in the “Select File” module. The file must contain protein sequences in fasta format and can be placed at any folder/directory on the computer. In our example, we downloaded the proteome file of the Phaseolus vulgaris (Pvulgaris_218_v1.0.protein.fa) from the phytozome website v12 (Phaseolus vulgaris v1.0, DOE-JGI and USDA-NIFA, http://phytozome.jgi.doe.gov/") and selected it for our analysis.

2.2.5 Methods for identifying AGPs

If AGP test is selected, AGP related functional module will appear. If the goal is to identify classical AGPs, select “Yes” to include the “AGP Biased Amino Acid” module into the test. Classical AGPs were identified by examining proteins in the proteome for biased amino acid compositions. Candidate classical AGPs were identified as having 50% or greater of the amino acids P, A, S, and T (PAST). In this module, “Short” and “Long” denotes the proteins to be included in the test based on protein lengths as defined by a lower and upper limit to the number of amino acids in a given protein. “Threshold” denotes the minimum percentage of the selected amino acids to pass the test. The default is set at 50%.

“Window” denotes the length of sequence for calculation. “Amino Acids” denotes the specific biased amino acids (e.g., PAST).

If the goal is to identify AG peptides, select “Yes” to include the “AG Peptide” module into the test. For AG peptides, a reduced PAST threshold percentage of 35% and a protein length of 50 to 90 amino acids were used, as AG peptides usually contain an N- terminal signal peptide and possibly a C-terminal GPI anchor addition sequence. In this 32 module, the length of the AG peptide was restricted from 50 (Short) to 90 (Long) amino acids, and the threshold for the biased amino acids (i.e., PAST) was set to 35%.

If the goal is to identify FLAs, select “Yes” to include the “Fasciclin AGP” module into the test. FLAs were identified by searching for one or more fasciclin motifs. A regular expression typical of fasciclin motif was identified as the following peptide sequence:

[MALIT]T[VILS][FLCM][CAVT][PVLIS][GSTKRNDPEIV]+[DNS][DSENAGE]+[AS

QM] (Johnson et al. 2003). In this module, the “H1 Motif” denotes the above FLA core motif. “Search Length” denotes the maximum length of the window used in searching for this motif. “AGP Region Motif” denotes the motif to be searched in the AGP domain of a protein.

2.2.6 Methods for identifying EXTs

If EXT test is selected, EXT related functional modules will show up. The regular expression of two or more SPPP repeats is used to search for candidate EXTs. In this module, “Pattern 0” and “Pattern 1” denote specific extensin motifs used in the search, such as SPPP for Pattern 0 and SPPPP for Pattern 1. “Qty” denotes the minimum number of occurrences to pass the test. The default searching criteria are any protein sequences with two or more SPPP or SPPPP sequences.

2.2.7 Methods for identifying PRPs

If PRP test is selected, PRP related functional modules will be displayed. PRPs is identified by two approaches. One approach involves searching for proteins having a biased amino acid composition of greater than 45% PVKCYT (Fowler et al. 1999). Select “Yes” to include the “PRP Biased Amino Acid” to perform such analysis. The default values are 33 set as 45% PVKCYT amino acids. The other approach involves using regular expressions to identify proteins which contain two or more PPVX(K/T) sequences [where X represents any amino acid] or KKPCPP sequences (Fowler et al. 1999). For this approach of analysis, select to include the “PRP Motif” module into the test. The default value is set to two occurrences of KKPCPP or PPV.[KT] sequences, where “ . ” indicates any amino acid.

2.2.8 The SignalP, GPI, and BLAST modules

The SignalP, GPI, and BLAST modules will always appear despite tests selected, as they assist in the process of identification. The SignalP module was used to search for the presence of a signal peptide by entering each sequence into the SignalP website

(http://www.cbs.dtu.dk/services/SignalP/) using the sensitive mode (Petersen et al. 2011).

In our practice, this module was included for all HRGP tests as most HRGPs have predicted

N-terminal signal peptides. The GPI module is used to search for the presence of a GPI anchor addition sequence by entering each protein sequence into the big-PI plant predictor website (http://mendel.imp.ac.at/gpi/gpi_server.html) (Eisenhaber et al. 2003). Likewise, this module was used for all of the HRGP tests given that the presence of GPI anchor addition sequences is a characteristic feature for many AGPs and certain short EXTs, but not for other groups of HRGPs. The BLAST module takes each identified sequence to a protein BLAST search against the Arabidopsis thaliana proteome, and returns similar

Arabidopsis HRGPs previously identified by Showalter et al. (Showalter et al. 2010). This module is useful for all HRGP tests because BLAST results provide valuable information of a candidate HRGP in terms of its subfamily attribute. 34

2.2.9 The general workflow of identifying HRGPs

A step-wise strategy was devised for systematically identification and classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) using the newly revised and improved BIO OHIO 2.0 (Fig 2.2). Classical AGPs were characterized as having equal or more than 50% PAST. AG peptides were characterized to be 50 to 90 amino acids in length and containing equal or more than 35% PAST. FLAs were characterized as having a fasciclin domain. Chimeric AGPs were characterized as containing greater than 50% PAST coupled with one or more domain(s) not known in HRGPs. All AGPs feature the presence of AP, PA, TP, GP, and SP repeats distributed throughout the protein.

35

Figure 2.2 Workflow diagram for the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) using a newly revised and improved BIO OHIO 2.0.

36

Candidate extensin sequences were analyzed for the locations of the SPn repeats

(where n ≥ 3) and YXY cross-linking motifs (where X can be any amino acid). Classical

EXTs were defined as containing numerous SPPP repeats, in many cases coupled with the distribution of YXY repeats throughout the protein; chimeric extensins, including LRXs,

PERKs, FHXs, long chimeric EXTs (>2000 aa), and other chimeric EXTs, were similarly identified but were distinguished from the classical EXTs by the localized distribution of

SPn repeats in the protein and the presence of non-HRGP sequences/domains, many of which were identified by the Pfam analysis; and short extensins were defined to be less than 200 amino acids in length coupled with the EXT definition.

Classical PRPs were identified as having equal or greater than 45% PVKCYT or two or more KKPCPP or PVX(K/T) repeats coupled with the distribution of such repeats or PPV throughout the protein. Chimeric PRPs were similarly identified but were distinguished from classical PRPs by the localized distribution of such repeats in the protein.

The SignalP module is used to search for the presence of a signal peptide sequence; all subclasses of HRGPs contain predicted signal peptide sequences, except for PERKs which are known to lack signal peptides. The GPI module searches for the presence of a

GPI anchor addition sequence for added support for the identification of AGPs, and a certain group of short EXTs, and BLAST module to provide some support to the classification at the subclass level. 37

2.3 Results

2.3.1 Identification of P. vulgaris AGPs

In the P. vulgaris proteome, 49 candidate AGPs were found that had more than

50% of PAST. Among them, 15 sequences had a predicted signal peptide and clustered non-contiguous proline residues residing in dipeptide sequences such as AP, PA, SP, TP,

VP, and GP. These 15 proteins were therefore deemed as classical AGPs. In addition, the

BLAST results of one sequence (Phvul.005G073200) revealed significant similarities with

PAGs found in Arabidopsis. Thus, this sequence was deemed a PAG (Fig 2.3 and Table

2.1).

The AG peptide test for P. vulgaris returned 156 candidate proteins, among which

19 proteins had predicted signal peptides and were rich in non-contiguous proline residues residing in dipeptide sequences such as AP, PA, SP, TP, VP, and GP; many of these 19 proteins had predicted GPI anchor addition sites. These 19 sequences were deemed AG peptides (Fig 2.3 and Table 2.1).

The fasciclin test for P. vulgaris returned 13 candidate proteins that have a core fasciclin domain. Among the 13 sequences, eight sequences had a predicted signal peptide and had significant similarities with Arabidopsis FLAs as revealed by the BLAST Analysis results. Therefore, these eight proteins were deemed FLAs (Fig 2.3 and Table 2.1).

38

Classical AGP >Phvul.003G238200 MDRNSIFSLAFICIVIAGVGGQSPASAPSGTQPTPAASTPAAAPSTTKSPAPVASPKSSTPAASP KAVTPASSPVASPPTVVAPAPATKPPAASPPAAAPVSSPPAPVPVSSPPAPVPIAAPVAAPPTPV APAPAPGKHKKSKKHGAPAPSPSLLGPPAPPTGAPGPSEDASSPGPASSANDESGAETIMCLKKV LGGLALGWATLVLVF

AG Peptide >Phvul.011G150500 MDMRKVTCAILIAAASVSATMAAAEVPAPAPGPSSGASAAFPLVGSLVGASVLSFFALFH

FLA >Phvul.001G058300 MSFKSSSLLCIAFLLAFSSAVYGFDITKMLEKEPELSSFNKYITEAKLADQINSRNTITVLAVGN DAISSIAGKSPELIKAIISTHVILDYYDEKKLVEAQASTPQLTTLFQSSGNAVKEQGFVKVSLIG EGEIAFSSVGSSDYSELVKPIASEPYNISILQVTKPIIPPGLDSQTAQSPQQAKASPPTSSKTAK APAPSKTAKAPSPAKSSEAPAPSDAAAAPSPSETVAESPLGGSDAPATAAEGPAADDGDAASDSS SSSTVKMGLVAVMALASLFIVS

PAG >Phvul.005G073200 MELRISVFCLLSFLFSLLSGSQAYTFNVGGKDGWLLYPSENYNHWAERMRFQVSDTLVFKYKKGS DSVLVVNKEDYEKCNKKNPIKKFEDGASQFQFDRSGPFYFISGKDYNCEKGQKLIIVVLAVREPP PYSPPKAPYPPHQTPPPVYTPPEAPSPPTNLPFPPKAQPPYAPIPPNTPSPTYAPSPNNHPPHAP LPPNTPSPHHQPPFVPITPSPFSQSPYVPPQPNTPSPTSQPPYTPSPPNTPSPISQPPYIPSPPN TTPSPISQPPYISTPPSTTPSPISQPPSYISTPPSTTPSPISQPPYIPSPPNTPSPTSQSPYPYA PTPNSNSPVTQPPSSSSPPSPPSLSPYPSTIPPSYPPSPFAPATTPSPPLSSPPSESTPAATSPS SSSSPGSSSNETTPSRPNGASSMSKSRFGVYSLTILVGAALSTILG

Figure 2.3 Representative AGPs in P. vulgaris. Highlighted sequences indicate predicted signal peptide (green) and GPI anchor addition sequences (light blue). AP, PA, SP, TP, GP, and VP dipeptide sequences (yellow) are also indicated.

39

Table 2.1 Identification and classification of P. vulgaris AGPs. Protein ID AGP Subfamily AA PAST% SP GPI Top 5 Arabidopsis HRGP BLAST Hits Phvul.003G289900 Classical AGP 140 68% Y Y AGP5C AGP10C AGP18K PEX4 AGP2C Phvul.003G238200 Classical AGP 210 63% Y Y AGP18K AGP17K PEX2 Phvul.009G033400 Classical AGP 197 74% Y N None Phvul.004G059300 Classical AGP 146 61% Y N AGP6C PAG17 AGP18K AGP17K Phvul.007G058600 Classical AGP 222 50% Y Y AGP29I Phvul.006G057200 Classical AGP 193 64% Y Y AGP10C PEX2 Phvul.006G204100 Classical AGP 133 66% Y Y AGP1C Phvul.002G149200 Classical AGP 144 74% Y Y AGP4C AGP10C AGP2C AGP7C AGP3C Phvul.002G211000 Classical AGP 133 63% Y Y AGP1C AGP10C Phvul.003G281900 Classical AGP 419 63% Y N AGP18K Phvul.003G024400 Classical AGP 261 65% Y Y None Phvul.006G126400 Classical AGP 283 68% Y N PRP10 EXT51 Phvul.002G272600 Classical AGP 681 67% Y N None Phvul.002G272800 Classical AGP 470 62% Y N None Phvul.002G014100 Classical AGP 297 74% Y Y AGP18K PRP1 AGP17K PRP2 PRP14 Phvul.003G278100 AG Peptide 80 35% Y N None Phvul.009G137200 AG Peptide 63 44% Y Y AGP16P AGP22P AGP20P AGP41P AGP43P Phvul.005G019400 AG Peptide 67 47% Y Y AGP16P AGP22P Phvul.005G019300 AG Peptide 66 46% Y Y AGP16P AGP43P Phvul.005G036200 AG Peptide 66 43% Y Y AGP43P AGP23P Phvul.005G162900 AG Peptide 66 43% Y N PERK1 Phvul.011G150500 AG Peptide 60 50% Y Y AGP43P AGP23P AGP40P AGP14P AGP13P Phvul.011G150600 AG Peptide 60 50% Y Y AGP43P AGP40P AGP23P AGP14P AGP13P Phvul.008G250300 AG Peptide 53 39% Y N None Phvul.004G002700 AG Peptide 88 35% Y N None Phvul.004G080600 AG Peptide 55 41% Y Y AGP14P AGP12P AGP21P AGP13P AGP22P Phvul.007G255800 AG Peptide 58 44% Y Y AGP43P AGP23P AGP40P AGP14P AGP15P Phvul.001G137100 AG Peptide 58 44% Y Y AGP43P AGP23P AGP40P AGP14P AGP12P Phvul.001G033400 AG Peptide 59 52% Y N AGP15P AGP14P AGP21P AGP13P AGP12P Phvul.001G111400 AG Peptide 57 45% Y Y AGP14P AGP12P AGP13P AGP21P AGP22P

40

Table 2.1: continued Phvul.001G079000 AG Peptide 59 40% Y Y AGP16P AGP20P AGP22P AGP41P AGP15P Phvul.006G179100 AG Peptide 69 37% Y N None Phvul.002G162600 AG Peptide 72 37% Y Y AGP16P AGP20P AGP22P AGP41P AGP15P Phvul.002G296000 AG Peptide 74 39% Y Y AGP20P AGP16P AGP22P AGP41P AGP12P Phvul.009G218400 FLA 207 40% Y Y FLA20 FLA6 FLA21 FLA11 FLA9 Phvul.011G100600 FLA 450 28% Y N FLA16 FLA17 FLA18 FLA15 FLA11 Phvul.008G288800 FLA 403 31% Y Y FLA2 FLA1 FLA8 FLA10 FLA14 Phvul.008G075000 FLA 419 42% Y Y FLA10 FLA8 FLA2 FLA1 FLA14 Phvul.008G287700 FLA 257 42% Y Y FLA7 FLA13 FLA11 FLA6 FLA9 Phvul.007G037300 FLA 326 32% Y N FLA20 FLA21 FLA19 FLA12 FLA7 Phvul.001G058300 FLA 282 44% Y Y FLA8 FLA3 FLA14 FLA5 FLA10 Phvul.006G057400 FLA 420 35% Y Y FLA1 FLA2 FLA8 FLA10 FLA4 Phvul.005G073200 PAG 436 54% Y Y PAG17 PAG10 PAG11 PAG14 PAG2 41

2.3.2 Identification of P. vulgaris EXTs

The EXT test for the P. vulgaris proteomics (protein data) file returned 79 candidate

EXT sequences, each with at least two SPPP repeats. Analysis of these 79 sequences revealed: (1) eight classical EXTs that had predicted signal peptide sequences and EXT motifs (SPPP3-5 and possibly YXY motifs) throughout the sequence; (2) 11 short EXTs

(<200 AA) that had predicted signal peptide sequences and EXT motifs; (3) four LRXs that had signal peptide sequences, clusters of EXT motifs at the C-terminus, and significant similarities to Arabidopsis LRXs. Two of the LRXs (Phvul.003G011500 and

Phvul.009G062000) are possibly pollen-specific LRXs (PEXs) as indicated by BLAST results; (4) six PERKs that had clusters of EXT motifs near the N-terminus and significant similarities to Arabidopsis PERKs; (5) one FH EXT that had a predicted signal peptide and was homologous to ; (6) one chimeric EXT that had a predicted signal peptide, clustered EXT motifs, and a non-HRGP domain; and finally (7) one hybrid AGP-EXT

(HAE) that had a predicted signal peptide, clustered EXT motifs and an AGP domain (Fig

2.4 and Table 2.2).

42

Classical EXT >Phvul.004G098000 MGSLMASITLTLVLAIVSLSLPSQTSADNYPYSSPPPPPKNPYYYHSPPPQEYSPPKHPYHHPSP PYKYPSSPPPIYKNKSPPPPYKYSSPPPPPKKPYKYPSPPPPVYKYKSPPPPVYKYKSPPPPPKK PYKYPSPPPPVYKYKSPPPPVYKYKSPPPPVYKYKSPPPPYKYPSPPPPPYKYPSPPPPVYKYKS PPPPYKYPSPPPPPYKYPSPPPPVYKYKSPPPPYKYPSPPPPPYKYPSPPPPAYYYKSPPPPPKY PSPPPPHYVYASPPPPHHY

Short EXT >Phvul.011G059100 MGTRQWPRLILAFAFSLMAITLAADYDKPYYSQPSTYYPHPTPPYQQQRSPPYYYKSPPPPPPYV RKFPPYYYKSPPPPSPSPPPPYVYKSPPPPSPSPPPPYLY

LRX >Phvul.003G011500 MAHHCSNKALGLLIFLSLLSTICSAQLVPEFTHSEPVDPADAVAPAPEVDEDGAELAPAPSIEFE NDEPLTPPPPAKTNERLKKAHIAFKAWKKAIHSDPLNITGNWVGEDVCSYNGVYCAPALHDPTIN VVAGIDLNNADIAGNLPEELLHLEDIALFHINSNRFCGVIPQNLQNLTLMHEYDISNNRFVGSFP SVVLTWPNLKYLDLRFNDFEGAIPPELFHKNLDAIFLNNNRFTSIIPDSLGNSSASVITFAYNNF KGCVPNSMGNMRNLNEIVFIGNNLGGCFPQEIGTMENLRVLDLSGNGFVGTLPNLSGLKSVEVID IAYNKMSGYVSNSVCQLPALKNFTFSHNYFNGEAQSCVAEGNSSVALDDSWNCLPGRKNQKASMK CLPVLTRPVDCVKQCGGGKENEHSHSPKPSPSPKPLTPKVVHSSPPPPVHSPPPPPPVHSPPPPP PPPVNSPPPPPPPVNSPPPPPPVFSPPPPVFSPPPPPVSSPPPPPVHSPPPPVHSPPPPVHSPPP PVSSPPPPPVHSPPPPVFSPPPPPVSSPPPPVHSPPPPVSSPPPPVSSPPPPPVSSPPPPVHSPP PPVFSPPPPVFSPPPPPVHSPPPPVFSPPPPPPVHSPPPPVFSPPPPVFSPPPPPLVSSPPPPVS SPPPPPVSSPPPPVNSPPPPPVSSPPPPVSSPPPPIFSPPPPVSSPPPPVYSPPPPVYPPPPPTW DDVFLPPHFGASYKSPPPPVIVGY

PERK >Phvul.003G024000 MSAASPSPAASTASPPSQTPSSSNTGPSPSNTTTPPPQQTPSSTSPPQQPASPPPSPPSSPPSSP PSSSPPPVSGTTPPSVPPPSPPPSPPPAPDAPPPVTPSSPSPPPPVTPSSPSPPPPVTPSSPPPP PPSAPIPSRSPPSPPPPSNPPNNTSPPPLPPPPQPSPSAPPPNNTPPPPPRGNSPPPPATTPPPA SPPRNSPPAPAAPPPSNSTRSPPPVNSPPPAAHAPPPRSSAPPAPEPSNPPSSISPPPTPSSSPP PPSNSTPSSSPPSPPSLTPLAPPPPPSPPSPPPNATSDSTPGGDGIGTAGVVAISVVGGFLLLGF IGVLMWCMRRQKRKIPVNGGYVMPSTLASSPESDSSFFKTHSSAPLVQSGSGSDVVYTPSDPGGL GHSRSWFSYEELIKATNGFSSQNLLGEGGFGCVYKGQLPDGREIAVKELKIGGGQGEREFKAEVE IISRIHHRHLVSLVGYCIEDNRRLLVYDFVSNDTLYFHLHGESQPVLEWANRVKIAAGAARGLAY LHEDCNPRIIHRDIKSSNILLDFNYEAKVSDFGLAKLALDANTHITTRVMGTFGYVAPEYASSGK LTEKSDVYSFGVVLLELITGRKPVDVSQPLGDESLVEWARPLLSQAIDTEEFNTLADPRLEKNYI ESELYCMIEVAAACVRHSASKRPRMGQVVRAFDSLGGSDLTNGMQLGESEVFDSAQQSEEIRLFR RMAFGSQNYSTDFFSRASMNP

FHX >Phvul.008G050300 MQISSFFFFFFYLFLLCALASSQPLFFNRRILHEPFIPLTSLSPSDPPKPPPSHPSPSQPPKPPP PHPSPSASKQKPKYPSSSTIPTTTITTTPETATTTTTQSPFFPLYPSSPPPPSPITFASFPANIS SLILPHSPKPSSSSNKLLPVALSAVVAAALVISISTFVCYRRRRNAPPSPAGKVLRSETGLRPLR RNAETSVETRKLRHTSSASSEFLYLGTVVNSHMVEEAEVGDGDRKMESPELRPLPPLARQLSVPP 43

APRDEAGFMTAEEDEDEFYSPRGSSLGGSGGTGSVSRRVFAADRSVTSSSRSSSSSGSPERSITN LLPRASSSYGNTLPKSPENYNHQHVHSSSSSMCSTPDRVFAERDNDALSACAHADAAPSSLHEGT LEKNENALSSPPPQRLSNASSSSAFSLPSSPENVTRHHTFDQSPRMSSVSDGLMLPGLSSLPLSP ALLSSPETERGSFGAQRKHWSIPVLSMPITTPFDEIRSIPAPPPLPQRKHWEIPGPAPPPPPPLP RQRKQWGVQAPGPSTPVGQPVSRPPELVPPSRPFVLQNQATNVELPGSLREIEETGRPKLKPLHW DKVRTSSEREMVWDQMKSSSFKLNEKMIETLFVVNTPNPKGKDAATNSVSHPPNQEERILDPKKS QNISILLKALNVTIEEVCEALSEGSTDALGTELLESLLRMAPSKEEERKLREHKDESQTKLGLAE KFLKAVLDVPFAFKRIEAMLFIASFESEVEYLRTSFQTLEAACEELRHCRMFLKLLEAVLKTGNR MNVGTNRGDAEAFKLDTLLKLADVKGADGKTTLLHFVVQEISRTEGARLSDTNQTPSSSLNEDGK CRRLGLQVVSSLSSELSNVKKAAAMDSEVLSSDVSKLSKGIATIAEVVQLNQSSENFTESVKKFI SMAEEEIPKIQAQESVASSLVKEITEYFHGNLAKEEAHPFRLFMVVRDFVAVLDRVCKEVGMMNE RTMVSSAHKFPVPVNPMLPQPLPGSP Figure 2.4 Representative EXTs in P. vulgaris. Highlighted sequences indicate predicted signal peptide (green), SP3 (blue), SP4 (red), SP5 (purple), and YXY (dark red) sequences.

44

Table 2.2 Identification and classification of P. vulgaris EXTs. Protein ID Subfamily AA SP3/SP4/SP5/YXY Counts SP GPI Top Arabidopsis HRGP BLAST Hits Phvul.011G059000 Classical EXT 338 7/0/7/27 Y N EXT22 EXT21 Phvul.008G003200 Classical EXT 884 79/0/0/47 Y N None Phvul.004G098300 Classical EXT 557 0/36/3/25 Y N EXT22 Phvul.004G098000 Classical EXT 279 2/14/7/21 Y N EXT3/5 EXT22 EXT18 Phvul.007G099700 Classical EXT 233 1/8/5/7 Y N EXT3/5 PRP1 Phvul.007G002400 Classical EXT 870 1/54/6/48 Y N EXT22 EXT3/5 HAE3 Phvul.007G084600 Classical EXT 679 1/50/1/33 Y N EXT22 Phvul.007G078600 Classical EXT 427 0/27/5/21 Y N EXT3/5 Phvul.003G023700 Short EXT 160 1/4/0/1 Y Y EXT31 EXT33 Phvul.003G110300 Short EXT 174 1/2/1/4 Y Y None Phvul.009G253400 Short EXT 165 0/4/0/2 Y Y EXT37 Phvul.008G234100 Short EXT 160 1/0/2/2 Y Y EXT33 PERK6 Phvul.004G155200 Short EXT 159 2/0/0/0 Y Y None Phvul.007G216700 Short EXT 175 0/3/1/0 Y N None Phvul.001G231800 Short EXT 173 0/0/2/1 Y N None Phvul.006G169900 Short EXT 138 0/2/1/0 Y N None Phvul.002G093100 Short EXT 131 0/2/0/0 Y Y EXT31 EXT33 PERK3 PERK4 PERK6 Phvul.011G059100 Short EXT 105 0/4/1/4 Y N AGP45P EXT21 Phvul.011G166600 Short EXT 180 1/0/2/0 Y N None Phvul.003G011500 LRX 739 0/23/14/0 Y N PEX1 PEX2 PEX4 LRX5 LRX4 45

Table 2.2: continued Phvul.009G062000 LRX 634 1/10/1/0 Y N PEX1 PEX3 PEX4 PEX2 LRX3 Phvul.004G161500 LRX 725 3/15/13/3 Y N LRX4 LRX5 LRX3 PEX4 PEX2 Phvul.002G314200 LRX 740 3/2/3/1 Y N LRX4 LRX5 LRX3 LRX7 PEX2 Phvul.010G149300 PERK 566 0/1/1/1 N N PERK1 PERK15 PERK4 PERK5 PERK3 Phvul.003G024000 PERK 736 9/5/1/0 N N PERK8 PERK12 PERK13 PERK1 PERK15 Phvul.008G021000 PERK 743 3/2/0/2 N N PERK10 PERK12 PERK1 PERK15 PERK4 Phvul.004G164000 PERK 663 5/0/0/0 N N PERK1 PERK5 PERK4 PERK15 PERK3 Phvul.002G070900 PERK 695 9/3/0/1 N N PERK12 PERK8 PERK10 PERK15 PERK5 Phvul.002G029200 PERK 631 6/0/0/1 N N PERK4 PERK5 PERK7 PERK6 PERK15 Phvul.008G050300 FH 1001 1/1/0/0 Y N None Phvul.004G097800 Chimeric EXT 248 0/4/9/0 Y N None Phvul.002G052800 HAE 213 15/0/0/0 Y Y AGP17K Phvul.010G159300 HAE 384 1/5/0/1 Y N None 46

2.3.3 Identification of P. vulgaris PRPs

The test in P. vulgaris returned 125 candidate PRP sequences, in which: (1) 19

PRPs were identified that had a predicted signal peptide, contiguous proline residues and were rich in PVKCYT; (2) three PR peptides (<200 AA) that had predicted signal peptides, contiguous proline residues and were rich in PVKCYT; (3) two chimeric PRPs that had a predicted signal peptide, a domain rich in contiguous and VKCYT residues and a non-HRGP domain; (4) one hybrid AGP-PRP (HAP) that had a predicted signal peptide, an AGP domain and a domain typical of PRPs; and (5) one chimeric AGP-PRP that had a predicted signal peptide, an AGP domain, a PRP domain, and a non-HRGP region. (Fig

2.5 and Table 2.3). The PRP test in P. vulgaris returned 10 candidate PRP sequences, among which one PR peptide and four PRPs were identified. These five PRPs and PR peptides, however, were already found using the biased PRP test (Fig 2.5 and Table 2.3).

47

PRP >Phvul.009G211000 MASLSFLVLLFAALVLSPQGLANYDKPPVYKPPIQKPPVYKPPVEKPPVYKPPVEKPPVYKPPVE KPPVYKPPVQKPPVYKPPYGKPPVHESQYEKAPLYKSPPLYKPPVHKPPVEKPPVYKPPVHKPPV EKPPVHKPPYGKPPVQESQYEKPPVYKSPPVYKPPVHKPPVEKPPVYKPPVHKPPVEKPPVYKPP YGKPPHPKYPPGSN

PR Peptide >Phvul.005G027300 MASFLSFLVLLLAALILIPQGLATYYKPIKKHPIYKPPVYKHPIYKPPVYKHKPPYYKPPYKKPP YKKHPPVEENNNHA

Chimeric PRP >Phvul.003G169100 MESSKIHAYLFLSMLFISSATPILGCGYCGKPPKKPNHGKKPKTPVVNPPVTHPPIVKPPVIVPP ITVPPVTVPPVTVPPIVKPPIPLPIPPVTVPPVLNPPTTPTTPGKGGNTPCPPPNSPAQATCPID TLKLGACVDLLGGLVHVGVGDPAANQCCPVLQGLVELEAAACLCTTLKLKLLNLNIYVPIALQLL VACGKSPPPGYTCSL Figure 2.5 Representative PRPs in P. vulgaris. Highlighted sequences indicate predicted signal peptide (green), PPVX(K/T) (gray), and PPV (pink) sequences.

48

Table 2.3 Identification and classification of P. vulgaris PRPs. Protein ID Subfamily AA PVKCYT% PPVX[KT]/ KKPCPP/PPV Counts SP GPI Top 5 Arabidopsis HRGP BLAST Hits Phvul.009G213600 PRP 315 46% 0/0/0 Y N PRP13 Phvul.005G018100 PRP 326 47% 0/0/1 Y N PRP10 Phvul.005G048800 PRP 375 45% 0/0/1 Y N PRP13 PEX4 Phvul.011G209600 PRP 338 71% 0/0/0 Y N PEX2 PRP2 Phvul.011G209800 PRP 286 55% 0/0/0 Y N PEX2 PEX4 Phvul.009G092000 PRP 162 47% 0/0/0 Y N PRP15 PRP14 PRP16 PRP17 HAE4 Phvul.009G211100 PRP 137 58% 7/0/10 Y N PRP1 PEX4 Phvul.009G211200 PRP 331 72% 34/0/39 N N PRP1 PRP2 PRP5 Phvul.009G211000 PRP 209 72% 24/0/26 Y N PRP1 PRP2 PRP5 Phvul.011G026700 PRP 386 55% 2/0/5 Y N PRP2 PRP4 PEX1 Phvul.011G208200 PRP 266 55% 0/0/0 Y N PEX2 Phvul.011G209700 PRP 232 53% 0/0/0 Y N PEX2 Phvul.004G019900 PRP 179 58% 0/0/4 Y N PRP18 PRP1 PEX2 PRP15 Phvul.007G127700 PRP 392 69% 0/0/19 Y N PRP1 PRP5 Phvul.001G031400 PRP 295 54% 0/0/0 Y N PRP10 PRP9 PEX2 PRP15 Phvul.006G003700 PRP 332 66% 4/0/11 Y N PRP17 PRP15 PRP14 PRP16 HAE4 Phvul.002G259700 PRP 437 63% 0/0/3 Y N PRP1 PRP5 Phvul.002G212200 PRP 176 52% 0/0/8 Y N PRP10 PRP15 PRP9 Phvul.009G217800 PRP 186 50% 0/0/1 Y N PRP1 PRP15 EXT20 PRP2 PEX4 Phvul.005G027300 PR 79 55% 2/0/3 Y N PRP1 PRP2 Peptide 49

Table 2.3: continued Phvul.003G219200 PR 131 45% 1/0/1 Y N PRP14 PRP16 PRP15 PRP17 HAE4 Peptide Phvul.003G219100 PR 131 45% 1/0/1 Y N PRP14 PRP16 PRP15 PRP17 HAE4 Peptide Phvul.002G229200 HAP 286 53% 0/0/10 Y N AGP30I AGP31I PRP7 PRP1 PRP11 Phvul.003G219000 Chimeric 171 46% 0/0/2 Y N PRP14 PRP15 PRP17 PRP16 HAE4 PRP Phvul.003G169100 Chimeric 210 53% 0/0/6 Y N PRP15 PRP14 PRP16 PRP17 PRP2 PRP Phvul.004G042800 Chimeric 365 46% 0/0/0 Y N PRP11 PEX2 AGP PRP 50

2.3.4 HRGPs identified in P. vulgaris

A total of 102 HRGPs were identified and classified in P. vulgaris using BIO OHIO

2.0, including 43 AGPs, 31 EXTs, 25 PRPs, and three hybrid HRGPs. These P. vulgaris

HRGPs represent virtually all subfamilies previously identified in Arabidopsis (Showalter et al. 2010). Compared with Arabidopsis, P. vulgaris has smaller numbers of HRGPs in most subfamilies, particularly with respect to the numbers of FLAs and PAGs in the AGP family, and the number of classical EXTs in the EXT family (Table 2.4).

51

Table 2.4 Comparison of HRGPs identified in P. vulgaris and A. thaliana.

HRGP family HRGP subfamily P. vulgaris A. thaliana a AGPs Classical AGPs 15 25 AG-Peptides 19 16 (Chimeric) FLAs 8 21 (Chimeric) PAGs 1 17 Other Chimeric AGPs 0 6 All AGP subfamilies 43 85 EXTs Classical EXTs 8 20 Short EXTs 11 12 (Chimeric) LRXs 4 11 (Chimeric) FHXs 1 6 (Chimeric) PERKs 6 13 Other Chimeric EXTs 1 3 All EXT subfamilies 31 59 PRPs PRPs 19 11 PR Peptides 3 1 Chimeric PRPs 3 6 All PRP subfamilies 25 18 Hybrid HRGPs Hybrid AGP EXT 2 4 Hybrid AGP PRP 1 0 All hybrid HRGPs 3 4 Total 102 172 a The Arabidopsis HRGP data are from Showalter et al. (2010) with the exceptions of the 6 chimeric FHXs that were added later and the one PR-peptide that was subsequently reclassified within the originally identified 12 PRPs.

2.4. Notes and tips for using BIO OHIO 2.0

(1) Highlighting amino acid sequences corresponding to characteristic HRGP features, such as signal peptides (green), GPI anchor addition sequences (light blue), AP,

PA, SP, TP, GP, and VP dipeptide sequences (yellow), SP3 sequences (blue), SP4 sequences (red), SP5 sequences (purple), and YXY sequences (dark red), PPVX(K/T) sequences (gray), and PPV sequences (pink) is beneficial to the identification process (Fig

2.2-2.4). For the convenience of the users, we provided a python script 52

(gpi_signalp_formatter.py) in the main folder of the program that highlights the signal peptide in green and the GPI anchor addition sequences in light blue.

(2) When running BIO OHIO 2.0, the SignalP and GPI modules may take more time to complete than the other modules, as these two modules take candidate sequences to the corresponding websites and wait until the test results can be obtained. In our practice, the servers of the websites may respond slower occasionally, resulting in a prolonged time for a test to finish. If these two modules are not included, it often takes within minutes for the rest modules to finish.

(3) Generally, HRGPs have signal peptides that allow them entry to the endomembrane system and delivery to the plasma membrane/cell wall. In our analysis of

P. vulgaris, all AGPs and EXTs (except for the PERKs) and 94% of the PRPs have predicted signal peptides.

(4) Some HRGPs, especially the AGP family, are also modified with a C-terminal

GPI membrane anchor that tethers the protein to the plasma membrane. In our analysis of

P. vulgaris, 67% of AGPs, 12% of EXTs, and none of the PRPs have predicted GPI anchor addition sequences.

(5) The same sequence can show up in multiple test results if it meets the criteria, which indicates the likelihood of a hybrid HRGP. Hybrid HRGPs consist of domains from more than one HRGP family. They often show up in multiple tests and may have BLAST hits in more than one HRGP family. For instance, one HAE (Phvul.002G052800) showed up in both the EXT test with 15 SPPP repeats and the biased amino acid AGP test with

73% of PAST and a GPI anchor addition sequence. 53

(6) Chimeric HRGPs consist of domain(s) typical of a HRGP family and a non-

HRGP domain. Additional chimeric HRGPs could be identified through BLAST analysis against the P. vulgaris proteome itself.

54

CHAPTER 3. BIOINFORMATIC IDENTIFICATION AND ANALYSIS OF

EXTENSINS IN THE PLANT KINGDOM

This work has been published in the following manuscript.

Liu X, R Wolfe, LR Welch, DS Domozych, ZA Popper, AM Showalter 2016

Bioinformatic identification and analysis of extensins in the plant kingdom. PLoS One

11:2.

3.1 Introduction

Extensins (EXTs) are a diverse family of hydroxyproline-rich glycoproteins

(HRGPs) found only in the plant kingdom. They are cell wall proteins characterized by the repeated occurrence of serine (Ser) followed by several consecutive prolines (Pro)

(Kieliszewski and Lamport 1994; Showalter et al. 2010). Some EXT molecules have Tyr-

X-Tyr motifs (where X can be any amino acid) that are responsible for intramolecular or intermolecular cross-linking with other EXT molecules in forms of isodityrosine (IDT), di-

IDT, and pulcherosine (Held et al. 2004). These cross-linking properties contribute to the extracellular matrix and play roles in plant development and defense mechanisms

(Merkouropoulos et al. 1999; Cannon et al. 2008).

Besides cross-linking of Tyr motifs, post-translational modification of EXTs includes hydroxylation of Pro residues to hydroxyproline (Hyp), and subsequent glycosylation of Hyp and Ser residues. Peptidyl-Pro is hydroxylated by prolyl 4- hydroxylases (P4Hs). Plant P4Hs belong to a family of 2-oxoglutarate-dependent dioxygenases (Hieta and Myllyharju 2002; Koski et al. 2007; Koski et al. 2009).

Characterization of P4Hs is reported for a number of plants, including Arabidopsis thaliana 55

(Hieta and Myllyharju 2002; Tiainen et al. 2005; Vlad et al. 2007; Velasquez et al. 2011),

Nicotiana tabacum (Yuasa et al. 2005), Dianthus caryophyllus (Vlad et al. 2010), and

Chlamydomonas reinhardtii (Keskiaho et al. 2007).

O-glycosylation of EXTs predominantly occurs on Ser-Hypn motifs, with often four to five oligoarabinosides attached to Hyp residues and (Gal) monosaccharides attached to Ser (Kieliszewski and Lamport 1994). In Arabidopsis, the sequential addition of arabinose (Ara) residues is carried out by distinct arabinosyltransferases: hydroxyproline

O-β-arabinosyltransferase (HPAT1-3) (Ogawa-ohnishi et al. 2013), reduced residual arabinose 1-3 (RRA1-3) (Velasquez et al. 2011; Egelund et al. 2007), Xyloglucanase113

(XEG113) (Gille et al. 2009), and extensin arabinose deficient (ExAD) (Petersen et al. unpublished). The addition of Gal to Ser is carried out by Ser galactosyltransferase (SGT1)

(Saito et al. 2014).

EXTs can be divided into several classes: classical EXTs, short EXTs, leucine-rich repeat extensins (LRXs), proline-rich extensin-like receptor kinases (PERKs), formin- homolog EXTs (FH EXTs), long chimeric EXTs and other chimeric EXTs (Showalter et al. 2010). Classical EXTs have signal peptide sequences which direct the proteins to the secretory system and ultimately the extracellular matrix. Most prominently, they have Ser-

Pro3-5 repeated motifs throughout their sequences. Moreover, some EXTs have Tyr-X-Tyr

(YXY) motifs along with the Ser-Pro3-5 motifs. EXTs that are less than 200 amino acids in length are referred to as “Short EXTs”. LRXs are a class of chimeric EXTs which usually have signal peptide sequences at the N terminus, followed by leucine-rich repeat (LRR) domains, and Ser-Pro3-5 repeated modules near the C terminus. The LRR domain is known 56 to be involved in protein-protein interactions (Kobe and Deisenhofer 1994) and the EXT domain is thought to contribute to the insolubility in the cell wall. These features make

LRXs candidates for regulatory functions on the cell surface. In Arabidopsis, LRXs are implicated in root hair morphogenesis (Baumberger et al. 2001). PERKs represent another class of chimeric EXTs. They lack a signal peptide sequence and their SPn repeated motifs are predominately located at the N terminus; they have a protein kinase catalytic domain near their C terminus. In Arabidopsis, the PERK gene family contains 15 members, and

PERK1 was localized at the plasma membrane (Silva and Goring 2002). Microarray data showed that there are two major groups of PERKs: those that are specifically expressed in the pollen and those that are generally expressed throughout all plant tissues (Nakhamchik et al. 2004). Research has shown that PERKs may affect cell expansion and normal floral organ formation (Haffani et al. 2006). In Arabidopsis, they are associated with an abscisic acid response (Bai et al. 2009). In Brassica napus, BnPERK1 is reported to be involved in signal perception and response to wound and/or pathogen stimuli (Silva and Goring 2002).

A third class of chimeric EXTs is the FH EXTs. FH EXTs are characterized by significant homology to formins and the presence of repeated Ser-Pro3-5 motifs. In , formins are associated with actin dynamics in that they control the assembly and elongation of unbranched actin filaments (Mao 2011; Chalkia et al. 2008). A fourth group of chimeric

EXTs are termed “long chimeric EXT” because of their extraordinary sequence length that have more than 2,000 amino acids. Lastly, some EXTs were characterized as “other chimeric EXTs” as these EXTs have an EXT domain and one or more domain(s) not known to HRGPs or the above classes of chimeric EXTs. 57

Showalter et al. (Showalter et al. 2010 and unpublished data) conducted the identification of the HRGP superfamily in Arabidopsis thaliana and Populus trichocarpa in which 59 and 60 EXTs were identified, respectively. In addition, Newman and Cooper

(2011) identified numerous proline-rich tandem repeat proteins (TRPs) including EXTs through a bioinformatics approach using EST and NCBI Non-Redundant protein sequence data of a number of plant species, but the search criteria for TRPs were not tailored for identifying EXTs. Nonetheless, knowledge about the number and distribution of EXTs in plant kingdom is still lacking.

BIO OHIO 2.0 is a newly revised and improved bioinformatics software program developed at Ohio University that was tailored to fulfill this task (Showalter et al. 2010;

Lichtenberg et al. 2012). The program was designed and developed for protein identification based on amino acid signatures, such as biased amino acid composition and common HRGP amino acid motifs in the genome-encoded protein sequences (i.e., the predicted proteome). The program can also further analyze identified proteins by checking for the presence of potential signal peptide sequences and GPI anchor addition sequences and finding similar HRGPs via the Basic Local Alignment Search Tool (BLAST). Using this bioinformatics tool, Showalter et al. (2010) identified and classified the HRGP superfamily in Arabidopsis (Arabidopsis thaliana) and poplar (Populus trichocarpa), two fully sequenced plant genomes (Showalter et al. 2010; Lichtenberg et al. 2012 and unpublished data).

Rapid advancement in the “next generation sequencing (NGS)” techniques is increasingly making genome sequences available. Thus, it is now feasible to conduct a 58 more detailed analysis on the EXT family in the plant kingdom. Here, we analyzed 16 plant genomes: Ostreococcus lucimarinus (Palenik et al. 2007), Chlamydomonas reinhardtii

(Merchant et al. 2007), Volvox carteri (Prochnik et al. 2010), Klebsormidium flaccidum

(Hori et al. 2014), (Physcomitrella patens v3.0, DOE-JGI, http:://www.phytozome.net/Physcomitrella), Selaginella moellendorffii (Banks et al.

2011), Pinus taeda (Zimin et al. 2014), Picea abies (Nystedt et al. 2013), Brachypodium distachyon (International Brachypodium Initiative 2010), Zea mays (Schnable et al. 2009),

Oryza sativa (Ouyang et al. 2007), Glycine max (Schmutz et al. 2010), Medicago truncatula (Young et al. 2011), Brassica rapa (Brassica rapa FPsc v1.3, DOE-JGI, http:://www.phytozome.net/BrapaFPsc), Solanum lycopersicum (Tomato Genome

Consortium 2012), and Solanum tuberosum (Xu et al. 2011). We also integrated previously studied data on Arabidopsis and P. trichocarpa to determine the number and distribution of the EXT family members in the plant kingdom and examine the evolutionary history of this fundamental cell wall constituent (Tan et al. 2012).

3.2 Materials and methods

3.2.1 Identification of EXTs

The predicted protein data files from 16 plant species (O. lucimarinus, C. reinhardtii, V. carteri, K. flaccidum, P. patens, S. moellendorffii, P. taeda, P. abies, B. distachyon, Z. mays, O. sativa, G. max, M. truncatula, B. rapa, S. lycopersicum, and S. tuberosum) were downloaded from the Phytozome website (www.phytozome.org). The protein database was searched for EXTs using BIO OHIO 2.0 software, which integrated more functional modules into the software compared to BIO OHIO 1.0 (Showalter et al. 59

2010; Lichtenberg et al. 2012). Briefly, a regular expression of two or more SPPP repeats was used to search for candidate EXTs. Candidate EXT sequences were then analyzed for the positions of SPn repeats and YXY cross-linking motifs, the presence of signal peptide sequences, the presence of GPI anchors, and for similar sequences using BLAST searches against known Arabidopsis EXTs. An EXT is determined by comparing all the above information with known features of each class of EXTs. If a sequence fails to fit in any class of EXTs, it is called a potential EXT.

3.2.2 BLAST analysis

The functional module of BLAST was integrated into the BIO OHIO 2.0. All candidate EXTs identified were subjected to NCBI BLASTP analysis using the current

Arabidopsis protein BLAST dataset (November 2010 TAIR10 Genome Release) downloaded from The Arabidopsis Information Resource (TAIR; www.arabidopsis.org).

3.2.3 Signal peptides and GPI anchors

The functional modules for signal peptides and GPI anchors were integrated into the BIO OHIO 2.0. All proteins were analyzed for signal peptides using SignalP

(www.cbs.dtu.dk/services/SignalP/) (Petersen et al. 2011) and for GPI anchor addition sequences using the big-PI plant predictor (mendel.imp.ac.at/gpi/plant_server.html)

(Eisenhaber et al. 2003).

3.2.4 Sequence alignment and phylogenetic analysis

Amino acid sequences were aligned by use of the Geneious software program

(http://www.geneious.com/) to obtain conserved domains. Aligned sequences of LRXs,

PERKs, and FH EXTs were input into Mega 6 for phylogenetic analysis using the 60 maximum likelihood and the maximum parsimony methods (Tamura et al. 2013). For

LRXs, the analysis involved 78 protein sequences. There were a total of 294 amino acid positions in the final dataset. The evolutionary history inferred by the Maximum

Likelihood method was based on the JTT matrix-based model (Jones et al. 1992). The evolutionary history inferred by the Maximum Parsimony method used the Tree-Bisection-

Regrafting (TBR) algorithm with search level 1 in which the initial trees were obtained by the random addition of sequences (10 replicates) (Nei and Kumar 2000). The bootstrap consensus tree inferred from 1000 replicates was shown and branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. For the phylogenetic analysis of PERKs and FH EXTs, the same methods were used as for analysis of the LRXs. The analysis of PERKs involved 93 protein sequences, and a total of 283 amino acid positions were present in the final dataset. The analysis of FH EXTs involved

76 protein sequences, and a total of 377 amino acid positions were present in the final dataset.

3.3 Results

In order to identify candidate EXTs, the BIO OHIO 2.0 program searched for protein sequences with two or more SPPP repeats from 16 plant proteomes: O. lucimarinus,

C. reinhardtii, V. carteri, K. flaccidum, P. patens, S. moellendorffii, P. taeda, P. abies, B. distachyon, Z. mays, O. sativa, G. max, M. truncatula, B. rapa, S. lycopersicum, and S. tuberosum. This initial screening obtained 2563 candidate EXTs, among which 758 were determined as EXTs and 1804 as potential EXTs. The EXTs include 87 classical EXTs, 97 short EXTs, 61 LRXs, 75 PERKs, 54 FH EXTs, 38 long chimeric EXTs, and 346 other 61 chimeric EXTs (Fig 3.1). In addition to having at least two SPPPs, these EXTs contain a

HRGP domain that is rich in Pro, Alanine (Ala), Valine (Val), Ser, Glycine (Gly), and

Threonine (Thr), and most proteins (76%) have predicted signal peptide sequences that direct them into the secretory pathway and ultimately to the cell wall. A representative

EXT sequence from each class is shown in Fig 3.2. Detailed sequence feature analysis of identified EXTs for each species is shown in Table S1-S16 (Liu et al. 2016). All the identified EXT sequences are shown in Fig S1 (Liu et al. 2016); the sequences of potential

EXTs are shown in Fig S2 (Liu et al. 2016). These potential EXTs have at least two SPPP repeat motifs but mostly lack a signal sequence and a HRGP domain.

62

Figure 3.1 Phylogenetic distribution of EXTs in selected plant genomes. Dendrogram showing the evolutionary relationships of species selected representing major plant divisions. The distribution of EXTs identified in this study and in the previous literature (Showalter et al. 2010, and unpublished data). EXTs are divided into seven subclasses including classical EXTs, short EXTs, LRXs, PERKs, FH EXTs, chimeric EXTs, and long chimeric EXTs.

63

Classical EXT >Bra011019|PACid:22722048 (Brassica rapa) MANPNWPSLLMLVLALFTIVVHSSAQYSPPSPPPYAYSYPWLPPYVYKSPPYAYSSPPPPPYVYN SPPPPPYVYSSPPPPPYVYKSPPPPPYVYSSPPPPPYVYKSPPPPPYVYKSPPPPPYVYSSPPPP PYVYKSPPPPPYVYKSPPPPPYVYSSPPPPPYVYKSPPPPPYVYKSPPPPPYVYSSPPPPPYVYK SPPPPPYVYKSPPPPPYVYSSPPPPPYVYKSPPPPPYVYSSPPPPPYVYKSPPPPPYVYSSPPPP PYVYKSPPPPPYVYSSPPPPPYVYKSPPPPPYVYRSPPPPPYVYKSPPPPPYVYRSPPPPPYVYK SPPPPPYVYSSAPPPPYVYKSPPPPPYVYNSPPSPSYIYSSPPPPSYSYSYSSPPPPIY

Short EXT >Bra009880|PACid:22695319 (Brassica rapa) MVSLNLSFALVFILAILFTFAEANYSRKLLQTPTNYQPAYSPPSPTPVYSPPVNPPPPTPVTYPP PTPAYPPPVALPPPAPINSPPPPAPIIPPLKANPSPQAYRAFYYRKSPPPPSGKPWWWLL

LRX >Bra017617|PACid:22708869 (Brassica rapa) MAKPPSFVCCIFLLFFFLLSSSFVAFALTDTEAAFIVQRQLLTLPANGELPDDIEYEVDLKATFA NTRLKRAYIALQTWKKAFFSDPFNTTGNWHGPHVCGYNGVVCAPALDDPDVTVVAGVDLNGADIA GHLPAELGLMTDVALFHLNSNRFCGIIPKSFEKLKLMHEFDVSNNRFVGPFPEVVLAWPDVKFID LRFNDFEGQVPSELFKKELDAIFLNNNRFTSTIPESLGESPATVVSFANNKFTGCIPKSIGNMKN LNEVVFMDNKLGGCFPSEIGKLSNVTLFDASKNTFIGRLPTSFVGLTGVEEFDISENKLTGLVAD NICKLPNLVNFTYSYNYFNGQGGSCVPGGGRKEIELDDVRNCLPHRPDQRSAQECAVVISRPVDC SKDKCAGGGSSTPSRPSLVHKPSPVPTTPVQKPSPVPTTPVQKPSPVPTTPVHKPSPVPTTPVHK PSPVPTRPVHEPQPPKKSPQPDDPYDQSPVGNRRSPPPPHESQPPVVFSPPPTPVSSPPLLSPPL PSPPPPVYSPPPPVHSPPPPVNSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHS PPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPPPPVHSPP PPVHSPPPPVHSPPPPVHSPPPPVHSPPPPPVYSPPPPVFSPPPPPVHSPPPPVHSPPPPVHSPP PPPVNSPPPPPPVEKKEVPLAHEPAPSDEFILPPFVGHQYASPPPPMFPGY

PERK >Bra028192|PACid:22721050 (Brassica rapa) MSLVPPLPILTPPPSSNSSTTSSPPPSSSPPTTPLVPPPVTPPPSPPVPSSSPPPPVISSPPPSS SPPPPVVSSPPPAAASSPPPPPVVVASPPPSTPPPAPPQDSSPPPPPESSPPAPTTTSSPPPPPS NTPSPPKPSPSPPSDTRSPPPPPSSDKPSPPPPGSTSTPPPASHPTDPAALAPPPTPLPLVPREK PTPPASPNANGNNTSSSSPSGVGTGGIVAIGVIVGLLLLSLFVLALWLTRKRKRKDPGAFVGYTM PPSGYSSPQGSDAVLFNTNSSAPNNKMRSHSGNDYMYASSDSGMVSNQRSWFSYDELAQVTNGFS QKNLLGEGGFGCVYKGVLSDGREIAVKQLKIGGSQGEREFKAEVEIISRVHHRHLVTLVGYCISE QHRLLVYDYVPNNTLHYHLHAPGRPVMTWETRVKVAAGAARGIAYLHEDCHPRIIHRDIKSSNIL LDNSFEALVADFGLAKIAQELDLNTHVSTRVMGTFGYMAPEYATSGKLSEKADVYSYGVILLELI TGRKPVDTSQPLGDESLVEWAKPLLSQAIENEEFGELVDPRLGVNFIAAEMFRMVEAAAACVRHS AARRPKMSQVVRALDTLEEASDITNGMRPGQSQVYDSRQQSAQIRMFQRMAFGSQDYSSDFFDRS QSHSSWGSRDTRDQSKFVP

FHX >Bra016233|PACid:22702140 (Brassica rapa) MAAMLHQPWPFPLQVTILLFLCVVVLPYHSFSQSDSPQNIETFYPTAELSPVPPPINPSTPSSPP SNQSSSSDRGTITKAVLITAASTVVLAAAFFFFLQKCVIARRRRRRENRINVQNTLPPYPPTTTT VAATETLAREGFTRFGGVKGLILDENGLDVLYWRNLQSQRQRRSGSFRKEIVVAGGEEGGSEEKE VIYYKNKKKMEQVTEVPLLRGRSSTSHSIIHNDSYELESPPPPPPPPSIPAKKTVPAPSPPPKKG 64

PSPPPPPIPVNKTAPPPPPPPSPPPTPMKKTAPAPPPPRPSPSPPPPPPVKKAAALSSSGSRPPP APRGSSGGESSRGGGSGQVKLKPLHWDKVNPDSDHSMVWDKIDRGSFSFDGDLMEALFGYVAVGK KSPEHDKQKSNSPTKIFILDPRKSQNSAIVLKSLGMTREELVEALVEGHDFVPETLERLARIAPT KEEQSAILEFDGGDAGKLADAEAFLFHLLKAVPTAFTRLNAFLFRANYYPEIAHHGRSLQTLDLA CKELRSRGLFVKLLEAILKAGNRMNAGTARGNAQAFNLTALLKLSDVKSVDGKTTLLNFVVEEVV RSEGKRCVLNRRSNSLTRSNSNRSSSSSSSNNAPQAMSKEEQEKEFLKLGLPIVGGLSSEFSNVK KAASVDYDTIVATCSALAVRAKDARKVIAQCEGGRFVEKMMTFLDSVEEEVKTAREEERKVMELV KRTTEYYQAGASVKGKNPLHLFVIVRDFLAMVDKVCLDIMRNVQRRKVTSSSESTSTQRNAVKFP VLPPNFMSDRSRSDSGESDSDM

Long Chimeric EXT >g7021.t1|PACid:27567910 (Chlamydomonas reinhardtii) MRAGMAGVWVLLLVAAAHQAAAAGNGNGNGNNAATTTTTTTTTVVPVGSTSTAATIQPLYTLGNN GCPINSYNYCIAKSKSGCAACGTVNGVNSTGPTTGNNGCKTIDGFGQNVQACRMPFGAITVRFPN GTKYTEDFGGIVPQSEICTGSIAATPVFAYAGTGEENRVGTLVTYRTQAGWLVATVRLDCPYLSW FSYNGKDAPGATIVDAYNSVYNLTTKQTTANVTTRRRNLLQANKASTTTCPSLTTDGSVLYGNGN GQYQNGKPDARSYIALSYKVDGAVTAFSIPTYRYQNNGDGSNNVEDNGGDGWYSCYTFMHQVSFG SKNTDTCATINMNIGLFMSQIFMVSSDGKQCAVKAKTSIVTPPTTTTPTTSVVDPPVTITTVTAT PPTPSPSPSPSPPSPAPSPPPSPAPSPLPPSPTPSPPPSPAPSPRPPSPTPSPPPSPPPSPPPSP TPSPPPPVPVPSPSPPPSPTPSPPPSPPPSPRPPSPTPSPPPPSPRPSPPSPAPSPPPSPPPSPA PPTPGSVTCAANGGNCDKLVQTTDQNGQGKCWVNAATGEYAFCMYSDPGNNTAVGVDPNGNINTY TGVTTPNEQNNGGPILPPSPPPPSPSPPSPPPPSPKPPSPTPPPSVSPSPPPFPSPPPSPSPPSP PPPSPPPPSPSPPPPSPPPPSPPPTPSPSPPPSPPPSPSPPPPSPPPPSPEPPSPEPPSPSPPPP SPGTDSKDTRSEGQVQIIKSTGEDEILPNIDSPPLIPSDTQPSTGPTSAAVEGTGSFTEVVPDCV ENCAIVGGPGKVEGGSAVGVPVEELKNVDTNGGMSVPPTVPEVPVPTATVVTNTDGSTTTTYPNG TVVIEWNDGTTKTTTTDGTTSTTKVDGTEIVTTPNGTETTTKPNGEVITKYPNGTIVDEKPGGTT ITTTPDGKTVTESPGGTVTTVTPTPSGGTNTTVEQPNGSTSTTIKDSTGSTTSSSTKQPDGTNST STTTSGGTTTVVTYPNGTIDTTNPDGKEIISYPNGTTVTKNTDGSTSTTTSSGTTTVTNPDGSTK VTTSDGTSTTTYSNGTVVTVKTDNTTTYTYPNGTTSTTTPDTTVTVKYPDGSSTVTQPGGTTIVT TTDKTVTVTEPDGTKTVITVDGTKTVTEPDGTKTVEKPDGQVTIVEPDGTVTIKTPDGKVTVTEP NGVTTVTTTNGTTTITDPNAGTVKEIPPSGNPVTTTTSDGTKTVVDGTTGTTTTTTPGGDVTVTT KDGTTNTNRTDGVVIVTTPDGTTTTTVPGGTTVTEFINGTKIVDDKSTGTTTVSYPNATTVTTTK DGTITTVDADGTTNTTTSTGTTTIWYPDGTTVIDPPGVGNVTVIYINGTVEIIAPNGSESITYPN GTSKDVLPNGSTITTYPNGTVITTTPGGTPQITQPPTTTTTTTDNSKVTTLPDQTTVTTKPDGST QIQTQEGVNITTDKTGVTTVTPPTTTPEVTPPVTVVKPDGSSTQKDPDNTVTTQDTTGKVTTVIP SNDPTVPPEVIVNNPATGETIRTVPEIPIVPDVVTPTPSPSPSPSPSPSPSPSPSPSPTPTPPVE APVTVITVRDSNTDTPLVAIITSNGDTTVQTSNGTVVTTEVETGQTTTVTTDGIVKVQDPSTGTT SVTNTTTGQTTTTIAVPVPAPEPTGPFCSCPPTVANGATCSSGTPLEWYTRTVKPDGSFQDNLVA TQCVDAPTVTNADTGAVVFSFCWRRACITQGKTLDNEALFFKAPAVALPAKYQASSSYGIALTIL RPSCATNAAATGHTVAANFMPLDFNEAETVETTQASGGLLNFDAKIVAPGGCDDTGATYMLNIMA ISLVETVYNQCPKHAIPDGTNGNADCTFPLVPVYLDITDANFVKDPSFPTYKQCVDPCNFDLFSG ELACSYCWSFGQLPPQFSSNYFVIELRNVTKYNIDLMAGGTYRSLVNQPQLIMPDGKKVEVTFTL NPKDEIKGKGAKAVVFTTNGNYTLPDGVAPVSSIGDIRGDFTKTTKYGGIHDVNMQIKVPRGLDD QSVSVQMFGSPNPPPPEVLCPCPTPLNSEGDCDGTLTKIKLYPSDNSTTFVEGCIPSPYWDGAKG ELMISHCWRASCLENPPFFTPTAAKPNSPFYRKDYDIEMLLTDKYNIAAVPSGKHELSFFKPKVV DMFVQFQYVDRARPTETAQFSSRDGQVQSITVPAGGIAKMGLRMKVPFNPIYDINSASANLNTDP SRLIPPPPPPLAITNYSYTINRNAQVLRSTVLIDDLLNVTTAASNGSDAASGAFNFTQVKSTFAR DFSNDIDMSMSGYIMSQDISFPDYGLVQSCTSADVQAVINKIAAKYGVSPDSIAASCQYDYKKPD AVTTVTAAHRRRRGRSLQQTTVTTPAVCYPRISITVSMPLGVPMDGNVTKALAAATTAADAVYDA 65

VGTDACYVPTGAESRVGTVVAFTTTNSKATCEYITQAFANAGNFTADMIRTYNCGEENPKKPYPP GLIALWCVLGVLGVVVIGVVTYLGVSWRKRRAANLQFVNQEGAISGGLPKVAPSPSSLAMGGNDY KYSTKAAAVAEGLQRPDTPTRYSPRAAVPAIERNRVVPV

Other Chimeric EXT >Bra038210|PACid:22704542 (Brassica rapa) MDLAKHTTLQMLGFILLASLVLTMAQPPGLTKPSHATCKIKKYKHCYNLEHVCPKFCPDTCHVEC ASCKPICGPASPGDDGGDTPPTPVPPVSPPPPAPVPPVSPPPPVTPTPSYPTPTDPLPPAPVSPP PPAPVPPVSPPPPTPTPYVPSPTPPVSPPPPSPTPDVPSPTPPSSPPPTTPTPAVPSPTPSSPPP PSPTPAVLTPPHVTPTPPTPAVPSPPDVTPTPPTPSVPSPDTPTAPLPPYSPPATPAPSVPSPTP TPPSSPTPPGSTPTTPTPSVPTPSPSVPVPSAPNSPPYVPPSSPTPTPPSDGEAGAGVRRARCKK KGSPCYGVEYSCPSACPRSCEVDCVTCKPLCNCDKPGSVCQDPRFIGGDGLTFYFHGKKDSNFCL ISDPNLHINAHFIGKRRPGMARDFTWVQSIAVLFGTHRFYVGALKTATWDDSVDRISASFDGNVI SLPQLDGATWTSSPGVYPQVSVKRVNADTNNIEVEVEGLLKITARVVSITMEDSRIHGYDVKEDD CLAHLDLGFKFQDLSDNVDGVLGQTYRPNYVSRVKIGVHMPVMGGDREFQTTGLFAPDCSAARFI GNGGRNGGWSKMELPEMSCASGVGGKGVVCKR Figure 3.2 Protein sequences encoded by representative EXT gene classes in Brassica rapa and Chlamydomonas reinhardtii. Green colored sequences at the N terminus indicate predicted signal peptides. SP3 (blue), SP4 (red), SP5 (purple), and YXY (dark red) repeats are also indicated. Sequences typical of AGPs, AP, PA, SP, TP, VP, and GP repeats, are also indicated (yellow).

3.3.1 Classical and short EXTs

Classical EXTs are categorized as having EXT domains throughout the protein sequence, except at the N terminus where there is usually a signal peptide that directs the protein into the secretory pathway and ultimately to the cell wall. The EXT domains contain repeated motifs of SPn, where n >3. Moreover, most classical EXTs have cross- linking YXY motifs in addition to the SPn motifs.

In this study, no classical EXTs were identified in the five earliest diverging species

(O. lucimarinus, C. reinhardtii, V. carteri, K. flaccidum, and P. patens), indicating that classical EXTs are absent in these non-vascular species that are either aquatic green algae

(O. lucimarinus, C. reinhardtii, V. carteri, K. flaccidium) or land plants that are dependent on water for reproduction, lacking , and predominantly living in humid habitats (P. patens). However, classical EXTs were found in tracheophytes (vascular plants), including 66 early diverging members as 13 were identified in the lycophyte S. moellendorffii, 11 of which contained YXY motifs. S. moellendorffii EXT11 was found to share high similarity with all 12 of the other EXTs, indicating the likely occurrence of gene duplication events

(Data not shown).

Despite the presence of classical EXTs in tracheophytes dating back to more than

420 million years before present (MYBP), classical EXTs were nearly absent from the genomes of the two gymnosperm species and the three monocot species examined here.

No classical EXTs were identified in loblolly pine (P. taeda), and only one classical EXT

(MA_74039g0010) was identified in Norway spruce (P. abies). Furthermore, while this protein contains 35 SPPP3-5, it lacks a signal peptide. Similarly, no classical EXTs were identified in corn (Z. mays), or rice (O. sativa), while B. distachyon, a non-crop species, only contained one apparent classical EXT, Bradi3g10280. This protein contains 11 SP3 and three SP4 repeats, along with 19 QAAA repeats, which is not known to be associated with any other EXTs or HRGPs. In addition, a BLAST search with Bradi3g10280 as query found no hits of significant similarity to other protein sequences (data not shown). These findings are consistent with two previous studies in monocots that found a lack of the SPn repeat motif in Z. mays (Stiefel et al. 1988; Kieliszewski et al. 1990).

Classical EXTs, however, were ubiquitous in eudicots. In this project, five species were chosen for analysis: G. max, M. truncatula, B. rapa, S. lycopersicum, and S. tuberosum. Classical EXTs were found in all these species. In addition, previous research reported on the identification of classical EXTs in Arabidopsis and poplar, respectively 67

(Showalter et al. 2010 and unpublished data). An overview of the number of classical EXTs identified in these plants is shown in Fig 3.1.

The frequency of SP3, SP4 and SP5 (or more) in classical EXTs among the above species was calculated to determine which of these repeat motifs is dominant in classical

EXTs. The results showed that SP4 repeats universally predominated in EXT sequences, with the lowest being in the lycophyte S. moellendorffii (56%) and the highest being in S. tuberosum (80%) (Fig 3.3). However, the dominance of the SP4 repeated motif is not seen in other categories of EXTs (data not shown).

Figure 3.3 The frequency of SP3, SP4, and SP5 repeats in classical EXTs of selected genomes. The frequency was calculated by the total number of each type of repeat divided by the total number of SP3, SP4, and SP5 adding together in each species.

The average number of YXY motifs in classical EXT and non-classical EXT (i.e. all other classes of EXTs) was calculated to confirm the observation that YXY motifs are abundant exclusively in classical EXTs. As is shown in Fig 3.4, the average number of

YXY motifs in classical EXTs ranges from 5.7 (in S. lycopersicum) to 27.5 (in B. rapa), 68 whereas less than two occurrences of the YXY motif were found in non-classical EXTs in all species studied.

Figure 3.4 Average number of YXY motif in classical and non-classical EXTs. The frequency was calculated as the total number of YXY repeats divided by the total number of classical and non-classical EXTs in chosen species.

Short EXTs were found in most species in this study but no short EXTs were identified from the O. lucimarinus, V. carteri, K. flaccidum, and P. taeda genomes.

Interestingly, two short EXTs were identified in the aquatic species, C. reinhardtii. Unlike classical EXTs, short EXTs were also found in P. patens. Overall, there is a slight increase in the number of short EXTs in the embryophytes, which indicates the importance of this group of proteins in plant growth, development, and defense (Fig 3.1).

3.3.2 LRXs, PERKs, and FHs

Leucine-rich repeat extensins (LRXs) are a group of chimeric EXTs. A typical LRX has an N terminal signal peptide, followed by a leucine-rich repeat (LRR) domain, and a C terminal EXT domain. A representative structure of an LRX is shown in Fig 3.5. In this study, LRXs were found in all but the four algal species (O. lucimarinus, C. reinhardtii, V. 69 carteri, K. flaccidum) and one gymnosperm (P. taeda). B. rapa, found to contain sequences for 15, had the highest number of LRXs of any species. However, most species contain two to seven LRXs (Fig 3.1 and Fig S1) (Liu et al. 2016). Five LRXs were identified in P. patens, suggesting the possibility that LRXs were derived during plant terrestrialisation

(and subsequently lost from some species e.g. P. tadea). A BLAST search against all the

LRXs identified in this study revealed that these five LRXs share more homology among themselves than any LRXs identified in other species, indicating they are likely paralogs derived from gene duplication events. Interestingly, for the two gymnosperm species included in this study, two LRXs were identified in P. abies while none were found in P. taeda. According to evidence from chloroplast, mitochondrial and nuclear genes although they are closely related members of the Pinaceae the genera Picea and Pinus diverged ~140 million years ago and it is reasonable that differences in LRXs may exist between them

(Wang et al. 2000). LRXs were identified in all flowering plants in this study, with eudicots having more LRXs in general.

Figure 3.5 Structural schemes of classical EXTs, LRX, and PERKs. Classical EXTs have an N-terminal signal peptide (green) followed only by EXT domain (red). LRXs have an N-terminal signal peptide (green) followed by a leucine-rich region (LRR, blue). The EXT domain (red) of LRXs is located at the C terminus. The EXT domain of PERKs is located at N terminus followed by a transmembrane domain (TMD, black) and a kinase domain (yellow). 70

To explore the evolutionary relationship of LRXs in different species, phylogenetic analysis was conducted using the maximum likelihood method based on the JTT matrix- based model. The phylogenetic analysis showed that all LRXs in the moss P. patens were clustered together as the outgroup. The rest of the LRXs fell into five major clades. Among them, all eudicot LRXs fell into clades A and B (G. max, M. truncatula, P. trichocarpa, A. thaliana, B. rapa, S. lycopersicum, and S. tuberosum), while clades C and D contained all the monocot LRXs (O. sativa, Z. mays, and B. distachyon). This topology indicates that either LRXs in monocots and eudicots went through quite different changes, or one or more clades of LRXs were derived after the divergence of monocots and eudicots. Notably, previously reported PEXs were found only in clade E, which contained all PEXs from

Arabidopsis (AtPEX1-4), two O. sativa PEXs (OsPEX1 and OsPEX3), one Z. mays PEX

(ZmPEX1), and one S. lycopersicum PEX (LePEX1). Therefore, it is likely that ancestral

PEX gene(s) existed before the division of monocots and eudicots, and that the rest of the

LRXs in clade E may also be PEXs (Fig 3.6). Phylogenetic analysis was also conducted using the maximum parsimony method (Fig S3) (Liu et al. 2016), which showed almost identical tree topology as inferred by the maximum likelihood method. The aligned sequences of LRXs are shown in Fig S4 (Liu et al. 2016). 71

Figure 3.6 Maximum likelihood analysis of LRXs. The evolutionary history was inferred by using the Maximum Likelihood method based on the JTT matrix-based model. The bootstrap consensus tree inferred from 1000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The analysis involved 78 amino acid sequences. There were a total of 294 positions in the final dataset. The green-colored fan area (clades A and B) indicates LRXs from eudicot species. The yellow-colored fan area (clades C and D) indicates LRXs from monocot species. The blue-colored fan area indicates possible PEX clade. Previously reported PEXs were either marked with red “☆” or in parentheses.

Proline-rich extensin-like receptor kinases (PERKs) represent another group of chimeric EXTs. They have an extracellular EXT domain at the N terminus followed by a 72 transmembrane domain and an intracellular receptor kinase domain (Fig 3.5). In this study,

PERKs were identified in most of the species in this study but not in O. lucimarinus, C. reinhardtii, V. carteri, K. flaccidum, S. moellendorffii and P. abies. Notably, BLAST analysis for two K. flaccidum proteins (kfl00031_0230 and kfl00671_0010p) revealed that they were similar to Arabidopsis PERKs, but a closer look found that they differed from the general PERK structure and were thus classified as chimeric EXTs (Table S4 and Fig

5) (Liu et al. 2016). A number of PERKs were identified in P. patens, suggesting PERKs were derived after the terrestrialization of plants. With the exception in P. taeda, at least five PERKs were identified in each of other species with G. max and B. rapa having as many as 14 PERKs. To explore the evolutionary history of PERKs in these species, phylogenetic analysis was conducted using the maximum likelihood method based on the

JTT matrix-based model (Fig 3.7). The phylogenetic tree shows that PERKs from these species form two dominant clades (clade A and B). The expression pattern analysis of

PERKs in Arabidopsis showed that some of the Arabidopsis PERK members were pollen- specific genes while others were more broadly expressed (Nakhamchik et al. 2004). In this tree, five AtPERKs that are pollen-specific were clustered in one sub-clade under clade B, namely, PERKs 3-7. However, other pollen-specific PERKs were also seen elsewhere in the tree (AtPERKs 11 and 12). The phylogenetic tree showed that most groups at the tips were either the same or closely related species, and the tree had high confidence at the tips in general. PERKs from monocot plants did not form a single branch, but they tended to cluster together. Similar to LRXs, tree topology inferred by the maximum parsimony 73 method was nearly identical to that inferred by the maximum likelihood method (Fig S5)

(Liu et al. 2016). The aligned sequences of PERKs are shown in Fig S6 (Liu et al. 2016).

Figure 3.7 Maximum likelihood analysis of PERKs. The evolutionary history was inferred by using the Maximum Likelihood method based on the JTT matrix-based model. The bootstrap consensus tree inferred from 1000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The analysis involved 93 amino acid sequences. There were a total of 283 positions in the final dataset. A and B represent two major clades. 74

Formins were first found in animal cells as cytoplasmic proteins that are associated with the organization of the actin cytoskeleton. Plant formin homologs are quite different in structure and function (Deeks et al. 2002). Plant formins may play important roles in cell cortex organization, including cortical actin, microtubule cytoskeletons, and the attachment to the plasma membrane (Fatima 2012).

In this study, formin homolog EXTs (FH EXTs) were categorized as a third group of chimeric EXTs, as some of the formin homologs were found to have an N terminal signal peptide and contain a number of SPn repeats. FH EXTs were found in all but three of the algal species investigated: O. lucimarinus, C. reinhardtii, and K. flaccidum. Interestingly, one FH EXT (Vocar20008550m) was found in V. carteri, the remaining algal species examined, suggesting that plant formin homologs were derived before divergence of the embryophytes. In general, an increase in number of FH EXTs was found in higher plants, with B. rapa having as many as 11 FH EXTs. To explore the evolutionary history of FH

EXTs in these species, phylogenetic analysis was conducted using the maximum likelihood method based on the JTT matrix-based model (Fig 3.8). Notably, the FH EXT identified in V. carteri was placed as the outgroup given the considerable difference between this sequence and other FH EXTs upon alignment. Similar to the analysis done by Deeks et al.

(2002), phylogenetic analysis revealed that FH EXTs were clustered into two major clades with 100% confidence. Clade A contained Arabidopsis FH1-11, while clade B contained

AtFH 12-21. In general, FH EXTs from the same or closely related species tended to cluster together, and higher confidence was shown in deeper branches (with over 80% bootstrap support). Phylogenetic analysis was also conducted using the maximum parsimony method 75

(Fig S7), which showed almost identical tree topology as inferred by the maximum likelihood method. The aligned sequences of FH EXTs are shown in Fig S8 (Liu et al.

2016).

Figure 3.8 Maximum likelihood analysis of FHs. The evolutionary history was inferred by using the Maximum Likelihood method based on the JTT matrix-based model. The bootstrap consensus tree inferred from 1000 replicates is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The analysis involved 73 amino acid sequences. There were a total of 377 positions in the final dataset.

76

3.3.3 Long chimeric EXTs and other chimeric EXTs

A group of chimeric EXTs were characterized to contain an enormous number of amino acids, usually over 2,000 amino acids per protein. This group of EXTs was referred to as “long chimeric EXTs”. A few such EXTs were previously reported in C. reinhardtii

(Woessner and Goodenough 1989; Lee et al. 2007). In this project, 27 long chimeric EXTs were found in the C. reinhardtii genome. In addition, four and 84 long chimeric EXTs were identified in O. lucimarinus and V. carteri, respectively (Fig 3.1-3.2; Fig S1; Table S1-

S3) (Liu et al. 2016). Long chimeric EXTs, however, were not found in any other species analyzed in this study. Other chimeric EXTs were categorized as being chimeric EXTs but did not belong to LRXs, PERKs, FH EXTs, or long chimeric EXTs. In this study, chimeric

EXTs were identified in all but the two gymnosperm species; however, the number of chimeric EXTs in each species varied greatly (Fig 3.1). While most other species have no more than 21 chimeric EXTs, the two flagellated unicellular chlorophyte species, C. reinhardtii and V. carteri, contained 153 and 84 chimeric EXTs, respectively.

3.3.4 Comparison with previously identified EXTs

The literature was searched to compare previously characterized EXTs with EXTs identified in this study. A total of 54 EXTs were obtained, of which 32 were identified in species included in this study. Among the 32 EXTs, 24 EXTs were found to be identical or similar to previously identified EXTs. The list of previously identified EXTs and their counterparts in this study is listed in Table S17 (Liu et al. 2016). 77

3.4 Discussion

3.4.1 Bioinformatic identification of plant EXTs using BIO OHIO 2.0

The first comprehensive bioinformatics identification of EXTs was done in

Arabidopsis in which 59 EXTs were identified followed by poplar where 60 EXTs were found (Showalter et al. 2010 and unpublished data). In another study, Newman and Cooper

(2011) identified numerous proline-rich TRPs including EXTs from proteomic databases, but they adopted more stringent search criteria which resulted in fewer EXTs identified in their study. Up to now, there is still a lack of understanding of the number and distribution of EXTs in the plant kingdom. With a rapidly increasing number of plant genomes fully sequenced, vast amounts of data are being produced. A need to process, mine and analyze genome data to provide biological meanings awaits.

The Bio OHIO 2.0 software program provides an efficient and reliable tool to identify proteins with biased amino acid compositions and known repetitive motifs

(Showalter et al. 2010; Lichtenberg et al. 2012). The newly revised and improved 2.0 version integrated more functional modules that include searching for the presence of a signal peptide, GPI anchor, as well as automated BLAST searches against the Arabidopsis proteome. These improvements make the program an ideal bioinformatic tool to study cell wall components, and gain insight about evolution of protein families.

In this project, a total of 758 EXTs including 87 classical EXTs, 97 short EXTs, 61

LRXs, 75 PERKs, 54 FH EXTs, 38 long chimeric EXTs, and 346 other chimeric EXTs were identified among over half a million predicted protein sequences in 16 plant genomes ranging from primitive aquatic plants to eudicots. Moreover, the functions of searching for 78 signal peptide, GPI anchor, and BLAST searches against the model plant Arabidopsis were incorporated in the program, making it more robust and efficient for identification.

3.4.2 The origin and evolution of EXTs

The EXTs identified in this study showed that the EXT prototypes might occur early in evolution of green photosynthetic organisms, most likely in the form of chimeric

EXTs, as they are found both in the early diverging species and later diverging species.

This suggests that ancestral EXTs might originate as a stretch of amino acids or a small

EXT domain that functioned in favor of evolution, and thus had more widely spread over time in evolution. On the other hand, long chimeric EXTs were identified in only three of the most primitive species (O. lucimarinus, C. reinhardtii, and V. carteri), but are absent in all land plants, indicating the lack of evolutionary advantage for these long molecules.

LRXs, PERKs, and FH EXTs predated the evolution of classical EXTs, as they are found in the land plant P. patens, where no classical EXTs were found (Fig 3.1 and Fig S5) (Liu et al. 2016). The origin of classical EXTs was possibly associated with plant vascularization, as although they are absent from the (non-vascular) bryophyte P. patens they are present in the (vascular) lycophyte S. moellendorffii. Further analysis is needed to help address whether they are indeed associated with vascularization or alternatively with terrestrialization, since P. patens is the only bryophyte examined in this study.

Interestingly, classical EXTs are absent or nearly absent in gymnosperms P. taeda (0) and

P. abies (1), as well as monocot plants such as B. distachyon (1), Z. mays (0), and O. sativa

(0). However, further analysis is needed to provide greater support for this conclusion, as it is possible that classical EXTs may be more abundant in other gymnosperm and monocot 79 species. Eudicots have the greatest numbers of EXTs and the largest number of Tyr-X-Tyr cross-linking motifs, which largely occurs in the classical EXTs.

Combining previously identified EXTs in Arabidopsis and poplar, phylogenetic analysis was conducted for LRXs, PERKs and FH EXTs, but not for classical EXTs as the latter failed to align in the EXT domain due to the highly varied number of SPn repeat motifs and the Tyr-X-Tyr motifs, which is necessary for a meaningful outcome. Similarly, phylogenetic analysis was not done for short EXTs, chimeric and long chimeric EXTs due the heterogeneity of their non-EXT domains.

The phylogenetic analysis of LRXs included the LRR regions of 78 LRXs from 13 plant genomes. Five P. patens LRXs were clustered in the phylogenetic trees generated by both maximum likelihood and maximum parsimony methods, indicating that one ancestral gene duplicated multiple times. LRXs were found in almost all of the more advanced species in this study, demonstrating the widespread nature of this protein family.

Comparable to Baumberger et al. (2003) who reported LRXs form two clades: reproductive pollen-expressed LRXs referred to as PEXs and a vegetative LRXs, analysis here showed that LRXs form three major groups, including one group of eudicot-specific LRXs, one group of monocot-specific LRXs, and one group of likely PEXs. The existence of eudicot and monocot specific clades indicates that LRXs in monocot and eudicot evolved quite distinctly from ancient LRXs. The PEX specific clade contains LRXs from both monocots and eudicots, indicating that ancestral LRXs duplicated and diversified to become pollen- specific LRXs before the division of monocot and eudicot (Baumberger et al. 2003). 80

The PERK phylogenetic analysis included the receptor kinase domain of 93 protein sequences from 12 plant genomes. Similar to LRXs, PERKs identified in P. patens were clustered at the root of the phylogenetic tree by both maximum likelihood and maximum parsimony methods. However, PERKs were not found in the lycophyte S. moellendorffii or the gymnosperm P. abies. The phylogenetic tree shows that PERKs form two major clades, both included monocots and eudicots. This indicates that ancestral PERK genes existed before the division of monocots and eudicots. Expression pattern analysis of

PERKs in Arabidopsis showed that some of the AtPERK members were pollen-specific genes while others were more broadly expressed (Nakhamchik et al. 2004). Here, pollen- specific AtPERK3-7 were clustered in clade B, but other pollen-specific PERKs were also seen in other branches. The phylogenetic tree shows most groups at the tips are either the same or closely related species, and the tree has high confidence at the tips in general. This may be due to gene duplication events that lead to gene redundancy as is seen in

Arabidopsis (Shiu and Bleecker 2003; Champion et al. 2004)

Phylogenetic analysis of FH EXTs included the FH domain of 76 proteins from eleven plant species. All 23 formin homologs in Arabidopsis were included in this analysis, as it was interesting to see their distributions in the phylogenetic tree. Notably, only AtFH1,

AtFH5, AtFH8, AtFH13, AtFH16, and AtFH20 contain two or more SPPP. The phylogenetic tree showed two major clades (clades A and B) with high confidence, which was consistent with the study of Deeks et al. (2002) who showed that Arabidopsis FHs form two major types. 81

A major gap in our understanding of the evolution of EXTs in green plants is a result of the lack of significant genomic information for charophytes, i.e., the group of green algae ancestral and most closely related to modern day land plants. However, immunological based screening has revealed the presence of EXTs as well other HRGPs including arabinogalactan-proteins in many of the charophyte taxa (Sørensen et al. 2011;

Eder et al. 2008; Domozych et al. 2009). A previous study on cell wall biosynthetic pathways in charophytes (Mikkelsen et al. 2014) and this report have also shown that the presence of EXT-like macromolecules in the charophyte K. flaccidum and charophyte ancestors including the prasinophyte (Ostreococcus) and the charophyte sister clade, the chlorophytes (Chlamydomonas and Volvox), thereby supporting the presence of EXTs in charophytes. Similarly, the lack of genomic information for bryophytes and ferns, with only one bryophyte sequence and no fern sequences available, makes pinpointing some of the EXT distribution patterns difficult to interpret. For instance, classical EXTs were found in S. moellendorffii but not in P. patens, suggesting that the origin of classical EXTs might be associated with vascularization. However, since only one bryophyte and one lycophyte genome was investigated it may be possible that P. patens is an exception and that classical

EXTs are present in other bryophytes, in which case classical EXTs may be associated instead with terrestrialization. A significantly more resolved interpretation will soon be possible as more genomic data for charophytes, bryophytes, and ferns become available.

3.5 Conclusions

A revised and newly improved bioinformatics software program BIO OHIO 2.0 was utilized to identify and classify EXTs from predicted proteomes of 16 plant species. A 82 total of 758 EXTs were identified, including 87 classical EXTs, 97 short EXTs, 61 LRXs,

75 PERKs, 54 FH EXTs, 38 long chimeric EXTs, and 346 other chimeric EXTs. Analysis of these data revealed that: (1) classical EXTs were likely derived after the terrestrialization of plants; (2) LRXs, PERKs, and FHs were likely derived earlier than classical EXTs; (3) gymnosperms and monocots have few classical EXTs; (4) Eudicots have the greatest number of classical EXTs and Tyr-X-Tyr cross-linking motifs are predominantly in classical extensins; (5) green algae lack classical EXTs but have a number of long chimeric

EXTs that are absent in embryophytes. Furthermore, phylogenetic analysis was conducted for LRXs, PERKs and FH EXTs, which shed light on the evolution of the EXTs.

83

CHAPTER 4. BIOINFORMATIC IDENTIFICATION AND ANALYSIS OF

HYDROXYPROLINE-RICH GLYCOPROTEINS IN POPULUS TRICHOCARPA

This work has been published in the following manuscript.

Showalter AM, B Keppler, X Liu, J Lichtenberg, LR Welch 2016 Bioinformatic identification and analysis of hydroxyproline-rich glycoproteins in Populus trichocarpa.

BMC Plant Biol 16:229.

4.1. Background

The hydroxyproline-rich glycoproteins (HRGPs) constitute a diverse superfamily of glycoproteins found throughout the plant kingdom (Showalter 1993; Kieliszewski and

Lamport 1994; Nothnagel 1997; Cassab 1998; Jose-Estanyol and Puigdomenech 2000;

Seifert and Roberts 2007). Based on their patterns of proline hydroxylation and subsequent glycosylation, HRGPs are separated into three families: arabinogalactan-proteins (AGPs), extensins (EXTs), and proline-rich proteins (PRPs). These differences in proline hydroxylation and glycosylation are ultimately determined by the primary amino acid sequence, particularly with respect to the location and distribution of proline residues.

Specifically, AGPs typically contain non-contiguous proline residues (e.g., APAPAP) which are hydroxylated and glycosylated with arabinogalactan (AG) polysaccharides (Tan et al. 2003; Tan et al. 2004; Tan et al. 2012). In contrast, EXTs typically contain contiguous prolines (e.g., SPPPP) that are hydroxylated and subsequently glycosylated with arabinose oligosaccharides (Kieliszewski and Lamport 1994; Shpak et al. 2001). The PRPs typically contain stretches of contiguous proline residues which are shorter than those found in

EXTs; these proline residues may be hydroxylated and subsequently glycosylated with 84 arabinose oligosaccharides. Thus, AGPs are extensively glycosylated, EXTs are moderately glycosylated, and PRPs are lightly glycosylated, if at all. In addition, most

HRGPs have an N-terminal signal peptide that results in their insertion into the endomembrane system and delivery to the plasma membrane/cell wall. Certain families of

HRGPs, particularly the AGPs, are also modified with a C-terminal glycosylphosphatidylinositol (GPI) membrane anchor, which tethers the protein to the outer leaflet of plasma membrane and allows the rest of the glycoprotein to extend toward the cell wall in the periplasm (Youl et al. 1998; Sherrier et al. 1999; Svetek et al. 1999).

These characteristic amino acid sequences and sequence features allow for the effective identification and classification of HRGPs from proteomic databases by bioinformatic approaches involving biased amino acid composition searches and/or HRGP amino acid motif searches (Schultz et al. 2002; Graham et al. 2004; Showalter et al. 2010; Ma and

Zhao 2010). In addition, Newman and Cooper (2011) utilized another bioinformatic approach involving searching for proline-rich tandem repeats to identify numerous HRGPs as well as other proteins in a variety of plant species.

The AGP family can be divided into the classical AGPs, which include a subset of lysine-rich classical AGPs, and the AG peptides. In addition, chimeric AGPs exist, most notably the fasciclin-like AGPs (FLAs) and the plastocyanin AGPs (PAGs), but also other proteins which have AGP-like regions along with non-HRGP sequences. Classical AGPs are identified using a search for proteins whose amino acid composition consists of at least

50% proline (P), alanine (A), serine (S), and theronine (T), or more simply, 50% PAST

(Schultz et al. 2002; Showalter et al. 2010). Similarly, AG peptides are identified with a 85 search of 35% PAST, but are size limited to be between 50 and 90 amino acids in length.

EXTs contain characteristic SPPP and SPPPP repeats. As such, EXTs are identified by searching for proteins that contain at least two SPPP repeats. Finally, PRPs are identified by searching for proteins that contain at least 45% PVKCYT or contain two or more repeated motifs (PPVX[KT] or KKPCPP). Similar to AGPs, chimeric versions of EXTs and PRPs also exist. Each HRGP identified here in this poplar study can then be subjected to BLAST searches against both the Arabidopsis and poplar databases for several purposes:

1) to ensure that the protein identified is similar in sequence to some known HRGPs in

Arabidopsis, 2) to identify if the protein is similar to other proteins in poplar which were identified as HRGPs by using the BIO OHIO 2.O program, and 3) to identify similar proteins that may be HRGPs, but which do not meet the search criteria.

Although the numbers and types of HRGPs in Arabidopsis are well established

(Schultz et al. 2002; Showalter et al. 2010), much less is known in other plant species. As more plant genome sequencing projects are completed, comprehensive identification and analysis of HRGPs in these species can be completed. This knowledge can be used to facilitate and guide basic and applied research on these cell wall proteins, potentially with respect to plant biofuel research that utilizes cell wall components for energy production.

In fact, a paper was recently published linking poplar EXTs to recalcitrance (Fleming et al.

2016). Moreover, comparisons can be made with what is already known in Arabidopsis, which will potentially provide further insight into the roles that these particular classes of

HRGPs play in the plant as well as their evolution. A comprehensive inventory of HRGPs in poplar, or trees in general, is lacking, although a search for proline-rich tandem repeat 86 proteins in poplar recently identified several HRGP sequences (Newman and Cooper

2011). Additionally, fifteen fasciclin-like AGPs (FLAs) were identified in Populus tremula

× P. alba, a hybrid related to Populus trichocarpa, and found to be highly expressed in tension wood (Lafarguette et al. 2004).

Here, the completed genome sequence, or more precisely the encoded proteome, of

Populus trichocarpa was utilized to successfully conduct a comprehensive bioinformatics based approach for the identification of HRGPs in this species (Fig 4.1). This approach utilizes a newly revised and improved BIO OHIO 2.0 program. Since Arabidopsis and poplar are both dicots, they are expected to have a similar inventory of HRGPs, as opposed to the monocots, which may prove to be considerably different. Nevertheless, Arabidopsis and poplar are morphologically different from one another with Arabidopsis being a small annual herbaceous plant and with poplar being a large woody deciduous tree. Distinct differences were reflected in their inventories of HRGPs, which can now be used to guide further research on the functional roles, commercial applications, and evolution of these ubiquitous and highly modified plant glycoproteins.

87

Figure 4.1 Workflow diagram for the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) in poplar using a newly revised and improved BIO OHIO 2.0.

88

4.2. Materials and methods

4.2.1 Identification of AGPs, EXTs, and PRPs using BIO OHIO 2.0

The Populus trichocarpa protein database (Ptrichocarpa_210_v3.0.protein.fa.gz) was downloaded from the Phytozome v11.0 website (www.phytozome.org) (Tuskan et al.

2006). The protein database was searched for AGPs, EXTs, and PRPs using the newly revised and improved BIO OHIO 2.0 software (Showalter et al. 2010; Lichtenberg et al.

2012). Compared to the previous version, this new version integrated more functional modules that include searching for the presence of a signal peptide at the SignalP server

(www.cbs.dtu.dk/services/SignalP/) (Petersen et al. 2011), searching for the presence of

GPI anchor addition sequences using the big-PI plant predictor

(mendel.imp.ac.at/gpi/plant_server.html) (Eisenhaber et al. 2003), as well as an automated

BLAST search against Arabidopsis proteome. In cases where no signal peptide was identified using the default parameters for a sequence, the sensitive mode was then used which lowered the D-cutoff values to 0.34 (Petersen et al. 2011). These improvements make the program an ideal bioinformatic tool to study cell wall proteins/glycoproteins within any sequenced plant species. The program is freely available upon request. Briefly, classical AGPs were identified as proteins of any length that consisted of 50% or greater of the amino acids P, A, S, and T (PAST). AG peptides were identified as proteins of 50 to

90 amino acids in length consisting of 35% or greater PAST. FLAs were designated as proteins containing the following consensus motif:

[MALIT]T[VILS][FLCM][CAVT][PVLIS][GSTKRNDPEIV]+[DNS][DSENAGE]+[AS

QM]. EXTs were identified by searching with a regular expression for the occurrence of 89 two or more SPPP repeats in the protein. Hits were examined for the location and distribution of SP3 and SP4 repeats as well as for the occurrence of other repeating sequences, including YXY. PRPs were identified by searching for a biased amino acid composition of greater than 45% PVKCYT or for sequences containing two or more repeated motifs (PPVX[KT] or KKPCPP) (Fowler et al. 1999).

4.2.2 BLAST analysis against Arabidopsis and poplar proteomes

All proteins identified by the BIO OHIO 2.0 searches were subjected to protein- protein BLAST (blastp) analysis. BLAST analysis against Arabidopsis HRGPs was conducted as an integrated module within BIO OHIO 2.0. BLAST analysis against the poplar database (Ptrichocarpa_210_v3.0.protein.fa) was conducted using NCBI BLAST+

(2.2.30) downloaded from the NCBI website. BLAST searches were conducted with the

“filter query” option both on and off.

4.2.3 Pfam database and poplar HRGP gene expression database

All proteins identified in this study were subjected to a sequence search using Pfam database 30.0 (http://pfam.xfam.org/) to identify Pfam matches within the protein sequences (Finn et al. 2016), and the Poplar eFP Browser

(http://bar.utoronto.ca/efppop/cgi-bin/efpWeb.cgi) for organ/tissue-specific expression data (Wilkins et al. 2009). Specifically, protein sequences of poplar v3.0 were entered into the Pfam database, while poplar v2.0 identifiers were entered into the Poplar eFP Browser since the eFP browser currently does not recognize poplar v3.0 identifiers. 90

4.3 Results

4.3.1 Arabinogalactan-proteins (AGPs)

Among the 73,013 proteins in the poplar database, 86 proteins were found to have at least 50% PAST, while 194 peptides have at least 35% PAST, and are between 50 and

90 amino acids in length (Table 4.1). Several chimeric AGPs were identified in the 50%

PAST search, but the FLAs in particular required a unique test as they typically do not meet the 50% PAST threshold. Previously in Arabidopsis, a consensus sequence for the fasciclin H1 domain was utilized to search for these proteins, and this consensus sequence was again utilized here (Showalter et al. 2010). A total of 43 proteins were found to contain this sequence.

91

Table 4.1 AGPs, EXTs, and PRPs identified from the Populus trichocarpa protein database based on biased amino acid compositions, size, and repeat units. Search Criteria ≥ 50% ≥35% PAST Fasciclin > 2 SPPP > 2 KKPCPP > 2 PPV.[KT] ≥45% PVKCYT PAST and 50-90 AA domain Classical AGP 10 0 0 1 0 0 4 Lys-Rich AGP 5 0 0 1 0 0 5 AG Peptide 0 31 0 0 0 0 0 FLA 1 0 24 0 0 0 0 PAG 5 0 0 2 0 0 0 Other Chimeric AGP 0 0 0 0 0 0 1 EXT 7 0 0 8 0 0 8 Short EXT 4 0 0 21 0 0 8 LRX 0 0 0 10 0 0 0 PERK 0 0 0 12 0 0 0 FHX 0 0 0 5 0 0 0 Other Chimeric EXT 0 0 0 3 0 0 0 PRP 1 0 0 0 0 4 10 PR Peptide 16 0 0 0 0 0 10 Chimeric PRP 0 0 0 0 0 0 0 Others 37 163 19 99 0 25 194 Total 86 194 43 162 0 29 240

92

In addition to meeting one of the search criteria, several other factors were considered in determining if the proteins were classified as HRGPs. All proteins were examined for signal peptides and for GPI membrane anchor addition sequences, as these are known to occur in AGPs. In addition, sequences were examined for certain dipeptide repeats which are characteristic of AGPs, including AP, PA, SP, TP, VP, and GP

(Nothnagel 1997; Schultz et al. 2004). The presence of these repeats was used to determine if a protein identified by the search was classified as an AGP. The various searches for

AGPs combined with BLAST searches identified a total of 162 poplar proteins that were determined to be AGPs (Table 4.2). In total, 27 classical AGPs (which include six lysine- rich AGPs) and 35 AG peptides were identified. In terms of chimeric AGPs, FLAs were particularly abundant in poplar with 50 being identified. Using the consensus sequence that identifies all 21 of the Arabidopsis FLAs, a total of 24 FLAs were identified in poplar.

However, because a single amino acid change in the consensus sequence would result in a particular FLA not being identified, the additional 26 FLAs were identified with BLAST searches. Another particularly common class of chimeric AGPs identified in Arabidopsis was the plastocyanin AGPs, or PAGs. Only five PAGs were identified with the 50% PAST search, but 34 others were identified that fall below the 50% PAST threshold with BLAST searches. Finally, 11 other chimeric AGPs were also identified. Representative AGP sequences from each class are shown in Fig 4.2, while sequences from all 162 AGPs identified are available in Fig S1 (Showalter et al. 2016).

93

Table 4.2 Identification and analysis of AGP genes in Populus trichocarpa. Locus Identifier 3.0 Name Class AP/PA/ % AA Pfamb SPc GPI Organ/tissu Arabidopsis Poplar HRGP (ID 2.0) a SP/TP/ PAST e-specific HRGP BLAST Hitse GP/VP Expressiond BLAST Repeats Hits Potri.017G050200 PtAGP1C Classical 3/3/12/2 66% 137 Y Y AtAGP1C, PtAGP2C, /1/1 AtAGP17K, PtAGP7C, AtAGP18K, PtAGP9C, AtAGP7C PtAGP5C, Potri.005G077100 Potri.017G050300 PtAGP2C Classical 5/5/9/2/ 64% 133 Y Y Female AtAGP1C, PtAGP9C, (POPTR_0017s07700) 1/1 catkins AtAGP10C, PtAGP1C, AtAGP3C, Potri.004G161700, AtPAG11 Potri.001G376400, Potri.009G009600 Potri.005G161100 PtAGP3C Classical 11/9/8/5 59% 161 Y N Roots AtAGP10C, Potri.013G119700, (POPTR_0005s17440) /0/2 AtAGP3C, Potri.009G124200, AtAGP5C, Potri.004G162500, AtAGP18K, Potri.001G376400, AtPERK13 Potri.013G112500 Potri.014G135100 PtAGP4C Classical 4/4/6/1/ 54% 140 Y Y Dark AtAGP26C, PtAGP47C, (POPTR_0014s12960) 2/0 etiolated AtAGP27C, PtAGP48C, seedlings, AtAGP25C PtAGP49K, light-grown Potri.013G119700, seedling, Potri.004G196400 young Potri.001G339700 PtAGP5C Classical 9/8/4/3/ 59% 144 Y Y Male catkins AtAGP6C, PtAGP50C, (POPTR_0001s35940) 4/0 AtAGP11C, Potri.003G031800, AtAGP17K PtAGP51C, PtAGP52C, Potri.003G143000 Potri.001G259700 PtAGP6C Classical 1/3/20/3 57% 197 Y N None PtAGP43P, /0/1 PtPtEXT7, PtPtEXT4 94

Table 4.2: continued Potri.001G310300 PtAGP7C Classical 6/7/8/5/ 63% 126 Y Y Young leaf AtAGP6C PtAGP1C, (POPTR_0001s31780) 0/2 PtAGP9C, Potri.002G256200, Potri.002G235500, Potri.005G049100 Potri.001G367600 PtAGP8C Classical 7/8/29/4 68% 265 Y Y None Potri.004G145800 /1/1 Potri.001G310400 PtAGP9C Classical 6/7/9/3/ 62% 137 Y Y Young leaf AtAGP18K, PtAGP2C, (POPTR_0001s31790) 0/2 AtAGP1C, Potri.009G085400, AtPEX4, Potri.013G119700, AtAGP10C PtAGP7C, Potri.005G043900 Potri.017G047500 PtAGP10C Classical 0/2/4/5/ 50% 207 Y Y Female None Potri.011G046900, (POPTR_0017s07480) 1/3 catkins Potri.010G094700, PtPRP23, Potri.004G038300, PtPRP28 Potri.002G207500 PtAGP47C Classical 4/4/6/1/ 49% 141 Y N AtAGP26C, PtAGP4C, (POPTR_0020s00250) 2/0 AtAGP27C PtAGP48C, PtAGP49K, Potri.013G119700, Potri.003G164300 Potri.010G031700 PtAGP48C Classical 2/2/9/2/ 44% 169 Y* N Xylem AtAGP26C, PtAGP49K, (POPTR_0010s03290) 1/2 AtAGP25C, PtAGP4C, AtAGP27C PtAGP47C, Potri.008G153000, Potri.008G147100 Potri.008G182400 PtAGP50C Classical 3/2/1/0/ 47% 101 Y Y Male catkins AtAGP50C, PtAGP52C, (POPTR_0008s18270) 3/1 AtAGP6C, PtAGP51C, AtAGP5C PtAGP5C, Potri.013G011700, Potri.018G128000

95

Table 4.2: continued Potri.015G093700 PtAGP51C Classical 6/3/0/0/ 49% 115 Y Y Male catkins AtAGP50C, PtAGP52C, (POPTR_0015s10580) 2/1 AtAGP6C, PtAGP50C, AtAGP15P PtAGP5C, Potri.014G159300, Potri.009G065300 Potri.012G095900 PtAGP52C Classical 6/5/0/0/ 49% 115 Y Y Male catkins AtAGP50C, PtAGP51C, (POPTR_0012s09790) 2/1 AtAGP6C, PtAGP50C, AtAGP3C PtAGP5C, Potri.014G159300, Potri.019G095800 Potri.005G169000 PtAGP64C Classical 10/9/4/1 48% 216 PF143 Y N AtAGP29I PtAGP60I, /0/3 68.4 PtAGP57I, PtAGP58I, Potri.001G210100, PtAGP69C Potri.008G155200 PtAGP65C Classical 4/4/3/4/ 45% 219 PF143 Y* Y Xylem, male AtAGP29I Potri.010G085200, (POPTR_0008s15500) 0/7 68.4 catkins, PtAGP66C, female PtAGP67C, catkins PtAGP68C, PtAGP69C Potri.005G212000 PtAGP66C Classical 4/4/5/4/ 45% 207 PF143 Y Y Roots AtAGP29I PtAGP67C, (POPTR_0005s23360) 2/2 68.4 Potri.010G085200, PtAGP65C, PtAGP69C, PtAGP68C Potri.002G050200 PtAGP67C Classical 4/5/5/4/ 46% 205 PF143 Y N AtAGP29I PtAGP66C, (POPTR_0002s05110) 2/2 68.4 Potri.010G085200, PtAGP65C, PtAGP68C, PtAGP69C

96

Table 4.2: continued Potri.010G085400 PtAGP68C Classical 0/2/4/4/ 44% 170 PF143 Y Y Male catkins AtAGP29I PtAGP69C, (POPTR_0010s09550) 0/1 68.4 Potri.005G211800, Potri.002G050500, Potri.002G050300, Potri.005G211900 Potri.008G155100 PtAGP69C Classical 1/2/5/2/ 44% 170 PF143 Y Y Male catkins AtAGP29I PtAGP68C, (POPTR_0008s15490) 0/1 68.4 Potri.005G211800, Potri.002G050500, Potri.010G085300, Potri.002G050300 Potri.009G092300 PtAGP11K Lysine- 11/19/8/ 69% 196 Y Y Xylem AtAGP17K, PtAGP14K, (POPTR_0009s09530) rich 11/1/2 AtAGP18K, Potri.004G181200, AtPRP1 Potri.001G310900, PtAGP71I Potri.010G132500 PtAGP12K Lysine- 18/24/1 65% 241 Y N Xylem AtAGP19K PtAGP15K, (POPTR_0010s14250) rich 0/12/0/4 Potri.013G003500, Potri.007G013600 Potri.007G051600 PtAGP13K Lysine- 12/12/9/ 60% 204 Y Y Dark AtAGP17K, PtAGP14K, (POPTR_0007s10230) rich 11/2/5 etiolated AtAGP18K Potri.013G003500, seedlings, PtAGP72I, young leaf Potri.018G122900 Potri.005G144900 PtAGP14K Lysine- 11/12/9/ 62% 208 Y Y Female AtAGP18K, PtAGP13K, (POPTR_0005s18840) rich 10/3/4 catkins AtAGP17K, Potri.002G008600, AtPRP1 Potri.005G049100, Potri.006G234100 Potri.008G111000 PtAGP15K Lysine- 23/33/1 66% 276 Y Y None PtAGP12K, (POPTR_0008s11040) rich 4/12/0/2 PtPtPAG5 Potri.008G195700 PtAGP49K Lysine- 2/2/9/1/ 45% 194 Y N Female AtAGP25C, PtAGP48C, (POPTR_0008s20030) rich 1/4 catkins AtAGP27C, PtAGP4C, AtAGP26C PtAGP47C, Potri.008G147100, Potri.010G094700

97

Table 4.2: continued Potri.009G063600 PtAGP16P AG 2/2/1/0/ 48% 60 Y Y AtAGP43P, PtAGP41P, (POPTR_0006s05460) peptide 0/0 AtAGP23P, PtAGP24P, AtAGP40P, Potri.016G052000, AtAGP14P, PtAGP29P, AtAGP15P PtAGP28P Potri.009G062700 PtAGP17P AG 2/2/0/0/ 36% 68 Y Y AtAGP22P, PtAGP38P, peptide 0/0 AtAGP16P PtAGP29P, PtAGP22P, PtAGP28P, PtAGP25P Potri.009G063200 PtAGP18P AG 3/2/0/0/ 40% 69 Y Y AtAGP43P PtAGP39P, peptide 0/0 PtAGP19P, PtAGP29P, PtAGP38P, PtAGP53P Potri.009G063000 PtAGP19P AG 3/2/0/0/ 41% 70 Y Y None PtAGP18P, peptide 0/0 PtAGP39P, PtAGP29P, PtAGP53P, PtAGP38P Potri.013G057500 PtAGP20P AG 2/2/1/0/ 41% 60 Y Y Male catkins AtAGP14P, PtAGP54P, (POPTR_0013s05400) peptide 0/1 AtAGP12P, PtAGP33P, AtAGP13P, PtAGP44P, AtAGP21P, PtAGP41P, AtAGP15P PtAGP30P Potri.003G136600 PtAGP21P AG 3/2/0/0/ 39% 69 PF063 Y Y Female AtAGP20P, PtAGP40P, (POPTR_0003s13640) peptide 0/0 76.10 catkins, AtAGP16P, PtAGP30P, male catkins AtAGP22P, PtAGP45P, AtAGP41P, PtAGP35P, AtAGP15P PtAGP54P

98

Table 4.2: continued Potri.006G056000 PtAGP22P AG 3/2/0/0/ 36% 68 Y Y Xylem AtAGP40P, PtAGP53P, (POPTR_0831s00200) peptide 0/0 AtAGP43P PtAGP28P, PtAGP29P, PtAGP27P, PtAGP25P Potri.006G055700 PtAGP23P AG 4/3/0/0/ 42% 66 Y Y male AtAGP16P, PtAGP29P, (POPTR_0006s05460) peptide 0/0 catkins, dark AtAGP43P PtAGP27P, etiolated PtAGP22P, seedlings PtAGP25P, PtAGP28P Potri.006G056200 PtAGP24P AG 2/1/1/0/ 47% 61 Y Y Male catkins AtAGP43P, Potri.016G052000, (POPTR_0006s05490) peptide 0/0 AtAGP23P, PtAGP16P, AtAGP40P, PtAGP41P, AtAGP13P, PtAGP29P, AtAGP14P PtAGP23P Potri.006G055900 PtAGP25P AG 3/2/0/0/ 37% 67 Y Y AtAGP43P, PtAGP27P, peptide 0/0 AtPAG2 PtAGP28P, PtAGP22P, PtAGP29P, PtAGP53P Potri.006G055500 PtAGP26P AG 4/3/1/0/ 39% 69 Y Y Dark AtAGP12P, PtAGP23P, (POPTR_0006s05440) peptide 0/0 etiolated AtAGP43P, PtAGP29P, seedlings AtAGP15P PtAGP28P, PtAGP22P, PtAGP27P Potri.006G055800 PtAGP27P AG 3/2/0/0/ 37% 67 Y Y AtAGP43P, PtAGP25P, peptide 0/0 AtPAG2 PtAGP28P, PtAGP22P, PtAGP29P, PtAGP53P

99

Table 4.2: continued Potri.016G052400 PtAGP28P AG 3/2/0/0/ 37% 67 Y Y Dark AtAGP40P, PtAGP27P, (POPTR_0016s05280) peptide 0/0 etiolated AtAGP15P PtAGP22P, seedlings PtAGP25P, PtAGP53P, PtAGP29P Potri.016G052200 PtAGP29P AG 3/2/1/0/ 38% 67 Y Y Male catkins AtAGP40P, PtAGP22P, (POPTR_0016s05270) peptide 0/1 AtAGP28I PtAGP27P, AtAGP43P, PtAGP25P, AtAGP12P PtAGP28P, PtAGP53P Potri.015G022600 PtAGP30P AG 2/1/1/0/ 37% 64 PF063 Y Y AtAGP20P, PtAGP45P, (POPTR_0015s06130) peptide 0/0 76.10 AtAGP22P, PtAGP35P, AtAGP16P, PtAGP40P, AtAGP41P, PtAGP21P, AtAGP15P Potri.001G070600 Potri.015G139200 PtAGP31P AG 2/0/0/1/ 35% 57 Y N None Potri.015G139100, peptide 0/0 Potri.012G137400, Potri.006G150100, Potri.008G094200, Potri.007G131100 Potri.002G226300 PtAGP32P AG 1/1/4/0/ 37% 74 Y N None PtAGP34P, (POPTR_0002s21530) peptide 1/1 Potri.012G138200, Potri.001G274200, Potri.002G121800, Potri.015G140000 Potri.019G035500 PtAGP33P AG 2/2/1/0/ 44% 59 Y Y AtAGP14P, PtAGP20P, (POPTR_0019s05110) peptide 0/1 AtAGP12P, PtAGP54P, AtAGP13P, PtAGP44P, AtAGP21P, PtAGP41P, AtAGP22P PtAGP30P

100

Table 4.2: continued Potri.014G156600 PtAGP34P AG 1/0/2/1/ 37% 74 Y N None PtAGP32P, (POPTR_0014s15480) peptide 0/1 Potri.001G274200, Potri.012G138200, Potri.015G140000, Potri.010G111200 Potri.014G094800 PtAGP35P AG 3/3/2/0/ 42% 76 PF063 Y N Male catkins AtAGP20P, PtAGP30P, (POPTR_0014s09050) peptide 0/0 76.10 AtAGP16P, PtAGP45P, AtAGP22P, PtAGP40P, AtAGP41P, PtAGP21P, AtAGP15P PtAGP17P Potri.T142100 PtAGP36P AG 1/2/2/1/ 36% 90 Y N None Potri.004G234800, peptide 0/0 Potri.014G034500, Potri.005G136800, Potri.007G041500, Potri.007G041400 Potri.001G387800 PtAGP37P AG 1/0/3/0/ 37% 78 Y N Female None Potri.004G061300, (POPTR_0001s39620) peptide 0/0 catkins, Potri.011G070500, male Potri.003G125800, catkins, Potri.008G019500, young leaf Potri.002G195300 Potri.001G268400 PtAGP38P AG 3/2/0/0/ 39% 68 Y Y AtAGP22P, PtAGP17P, (POPTR_0001s27530) peptide 0/0 AtPAG1 PtAGP29P, PtAGP22P, PtAGP28P, PtAGP27P Potri.001G268500 PtAGP39P AG 3/3/0/0/ 40% 69 Y Y AtAGP15P, PtAGP18P, (POPTR_0001s27540) peptide 0/0 AtAGP14P, PtAGP19P, AtAGP28I PtAGP29P, AtAGP13P, PtAGP53P, AtPAG1 PtAGP38P

101

Table 4.2: continued Potri.001G094700 PtAGP40P AG 3/2/0/0/ 42% 69 PF063 Y Y AtAGP20P, PtAGP21P, (POPTR_0001s10310) peptide 0/0 76.10 AtAGP16P, PtAGP30P, AtAGP22P, PtAGP45P, AtAGP41P, PtAGP35P, AtAGP12P Potri.016G086300 Potri.001G268800 PtAGP41P AG 2/1/1/0/ 46% 60 Y Y AtAGP43P, PtAGP16P, peptide 0/0 AtAGP23P, PtAGP24P, AtAGP40P, Potri.016G052000, AtAGP12P, PtAGP29P, AtAGP15P PtAGP28P Potri.001G268900 PtAGP42P AG 1/1/0/0/ 36% 66 Y Y None PtAGP29P, (POPTR_0001s27570) peptide 0/0 PtAGP56P, Potri.010G100200, Potri.011G126900, PtAGP23P Potri.001G259500 PtAGP43P AG 0/0/3/1/ 37% 67 Y N None PtAGP6C, peptide 0/0 PtEXT7, PtEXT4, Potri.018G145800, Potri.007G096600 Potri.001G004100 PtAGP44P AG 2/1/1/0/ 40% 59 Y Y AtAGP14P, PtAGP54P, (POPTR_0001s04130) peptide 0/1 AtAGP12P, PtAGP20P, AtAGP13P, PtAGP33P, AtAGP21P, PtAGP41P, AtAGP15P PtAGP60I Potri.012G032000 PtAGP45P AG 2/1/1/0/ 39% 64 PF063 Y Y Male catkins AtAGP20P, PtAGP30P, (POPTR_0012s01350) peptide 0/0 76.10 AtAGP16P, PtAGP35P, AtAGP22P, PtAGP40P, AtAGP41P, PtAGP21P, AtAGP15P PtAGP54P

102

Table 4.2: continued Potri.012G144100 PtAGP46P AG 1/1/1/2/ 41% 89 Y N None Potri.002G258000, peptide 0/1 Potri.007G124600, Potri.003G086400, Potri.001G148100, Potri.013G051400 Potri.016G052300 PtAGP53P AG 3/2/1/0/ 32% 110 Y* Y AtAGP15P, PtAGP22P, peptide 0/0 AtAGP40P, PtAGP28P, AtPAG11, PtAGP27P, AtAGP43P, PtAGP25P, AtPERK3 PtAGP29P Potri.003G220900 PtAGP54P AG 3/1/1/1/ 37% 139 Y* Y AtAGP14P, PtAGP44P, (POPTR_0003s21020) peptide 0/1 AtAGP12P, PtAGP20P, AtAGP13P, PtAGP33P, AtAGP21P, PtAGP41P, AtAGP22P Potri.004G067400 Potri.006G056100 PtAGP55P AG 1/1/0/1/ 33% 66 Y N None PtAGP56P, (POPTR_0006s05480) peptide 0/0 PtAGP28P, PtAGP29P, PtAGP22P, PtAGP25P Potri.016G052100 PtAGP56P AG 1/1/0/1/ 31% 66 Y N Xylem None PtAGP55P, (POPTR_0016s05260) peptide 0/0 PtAGP29P, PtAGP25P, PtAGP27P, PtAGP22P Potri.010G244900 PtFLA1 Chimeric 10/4/0/0 26% 459 PF024 Y N AtFLA17, PtFLA19, PtFLA6, (POPTR_0010s25110) /3/1 69.20 AtFLA16, PtFLA8, PtFLA41, AtFLA18, Potri.012G006200 AtFLA15, AtFLA12

103

Table 4.2: continued Potri.009G012200 PtFLA2 Chimeric 8/7/3/2/ 39% 254 PF024 Y N AtFLA11, PtFLA34, (POPTR_0009s01740) 2/0 69.20 AtFLA12, PtFLA10, AtFLA13, PtFLA23, AtFLA9, PtFLA40, AtFLA6 PtFLA48 Potri.013G120600 PtFLA3 Chimeric 4/2/2/3/ 34% 238 PF024 Y Y Dark AtFLA6, PtFLA15, PtFLA9, (POPTR_0013s12490) 1/1 69.20 etiolated AtFLA9, PtFLA7, PtFLA10, seedlings, AtFLA13, PtFLA23 roots, AtFLA11, female AtFLA12 catkins Potri.013G152200 PtFLA4 Chimeric 5/0/5/0/ 31% 353 PF024 N N Female AtFLA21, Potri.019G125200, (POPTR_0013s14840) 1/0 69.20 catkins AtFLA19, PtFLA36, AtFLA20, PtFLA42, AtFLA15, PtFLA44, AtFLA16 Potri.T118500 Potri.011G093500 PtFLA5 Chimeric 7/4/2/2/ 32% 408 PF024 Y Y AtFLA1, PtFLA22, (POPTR_0011s09590) 1/2 69.20 AtFLA2, PtFLA16, AtFLA8, PtFLA17, AtFLA10, PtFLA21, AtFLA14 PtFLA37 Potri.006G200300 PtFLA6 Chimeric 8/2/1/0/ 27% 466 PF024 Y N AtFLA17, PtFLA8, PtFLA1, (POPTR_0006s21460) 3/1 69.20 AtFLA18, PtFLA19, AtFLA16, PtFLA41, AtFLA15, Potri.012G006200 AtFLA11 Potri.006G129200 PtFLA7 Chimeric 6/5/2/1/ 36% 227 PF024 Y N AtFLA11, PtFLA9, PtFLA10, (POPTR_0006s13120) 1/2 69.20 AtFLA12, PtFLA23, AtFLA6, PtFLA32, AtFLA13, PtFLA49 AtFLA9

104

Table 4.2: continued Potri.016G066500 PtFLA8 Chimeric 7/2/2/1/ 27% 466 PF024 Y N Male AtFLA17, PtFLA6, PtFLA1, (POPTR_0016s06680) 3/1 69.20 catkins, and AtFLA18, PtFLA19, light AtFLA16, PtFLA41, etiolated AtFLA15, Potri.012G006200 seedlings, AtFLA11 light grown seedling Potri.016G088700 PtFLA9 Chimeric 7/6/2/1/ 37% 239 PF024 Y Y Xylem AtFLA11, PtFLA7, PtFLA10, (POPTR_0016s09010) 1/2 69.20 AtFLA12, PtFLA23, AtFLA6, PtFLA32, AtFLA13, PtFLA49 AtFLA9 Potri.015G129400 PtFLA10 Chimeric 5/5/3/2/ 37% 240 PF024 Y Y Xylem AtFLA11, PtFLA23, (POPTR_0015s14570) 1/1 69.20 AtFLA12, PtFLA34, PtFLA2, AtFLA6, PtFLA20, AtFLA13, PtFLA28 AtFLA9 Potri.T130300 PtFLA11 Chimeric 8/3/3/1/ 40% 271 Y Y Male catkins AtFLA3, PtFLA25, (POPTR_0018s03790) 2/2 AtFLA5, PtFLA26, AtFLA14, PtFLA21, AtFLA8, PtFLA17, AtFLA10 PtFLA16 Potri.002G223300 PtFLA12 Chimeric 8/7/5/4/ 41% 263 PF024 Y Y Xylem AtFLA7, PtFLA18, PtFLA3, (POPTR_0002s22020) 1/1 69.20 AtFLA6, PtFLA9, PtFLA7, AtFLA11, PtFLA23 AtFLA9, AtFLA12 Potri.019G122600 PtFLA13 Chimeric 7/5/1/0/ 39% 215 PF024 N N AtFLA12, PtFLA45, (POPTR_0019s14350) 0/2 69.20 AtFLA11, PtFLA35, AtFLA13, PtFLA39, AtFLA9, PtFLA29, AtFLA6 PtFLA47

105

Table 4.2: continued Potri.019G120800 PtFLA14 Chimeric 10/10/2/ 43% 214 PF024 N N AtFLA12, PtFLA39, (POPTR_0019s14320) 1/0/1 69.20 AtFLA11, PtFLA28, AtFLA9, PtFLA13, AtFLA13, PtFLA45, AtFLA6 PtFLA35 Potri.019G093300 PtFLA15 Chimeric 6/5/3/0/ 34% 245 PF024 Y Y Dark AtFLA6, PtFLA3, PtFLA9, (POPTR_0019s12310) 1/1 69.20 etiolated AtFLA9, PtFLA7, PtFLA10, seedlings AtFLA13, PtFLA23 AtFLA11, AtFLA12 Potri.014G168100 PtFLA16 Chimeric 9/1/0/0/ 30% 397 PF024 Y Y Roots AtFLA2, PtFLA22, PtFLA5, (POPTR_0014s16610) 1/0 69.20 AtFLA1, PtFLA17, AtFLA8, PtFLA21, AtFLA10, PtFLA37 AtFLA4 Potri.014G071700 PtFLA17 Chimeric 13/7/7/4 42% 421 PF024 Y Y Xylem AtFLA10, PtFLA16, (POPTR_0014s06740) /1/3 69.20 AtFLA8, PtFLA22, PtFLA5, AtFLA2, PtFLA21, AtFLA1, PtFLA25 AtFLA14 Potri.014G162900 PtFLA18 Chimeric 7/6/7/4/ 40% 262 PF024 Y Y Xylem AtFLA7, PtFLA12, PtFLA3, (POPTR_0014s16100) 1/1 69.20 AtFLA6, PtFLA9, PtFLA7, AtFLA9, PtFLA23 AtFLA11, AtFLA12 Potri.008G012400 PtFLA19 Chimeric 11/4/1/0 27% 463 PF024 Y N Xylem AtFLA17, PtFLA1, PtFLA6, (POPTR_0008s01310) /3/1 69.20 AtFLA16, PtFLA8, PtFLA41, AtFLA18, Potri.012G006200 AtFLA15, AtFLA12

106

Table 4.2: continued Potri.001G320800 PtFLA20 Chimeric 7/6/3/1/ 37% 243 PF024 Y Y Xylem AtFLA11, PtFLA10, (POPTR_0001s32800) 1/1 69.20 AtFLA12, PtFLA23, AtFLA6, PtFLA39, AtFLA13, PtFLA34, AtFLA9 PtFLA13 Potri.001G037800 PtFLA21 Chimeric 2/5/7/2/ 43% 281 PF024 Y Y Male catkins AtFLA14, PtFLA26, (POPTR_0001s07490) 4/2 69.20 AtFLA8, PtFLA25, AtFLA10, PtFLA11, AtFLA3, PtFLA17, PtFLA5 AtFLA2 Potri.001G367900 PtFLA22 Chimeric 7/4/2/2/ 33% 406 PF024 Y Y Dark AtFLA1, PtFLA5, PtFLA16, (POPTR_0001s37650) 1/1 69.20 etiolated AtFLA2, PtFLA17, seedlings, AtFLA8, PtFLA21, young leaf AtFLA10, PtFLA37 AtFLA14 Potri.012G127900 PtFLA23 Chimeric 5/3/2/2/ 35% 240 PF024 Y Y Xylem AtFLA11, PtFLA10, (POPTR_0012s14510) 2/1 69.20 AtFLA12, PtFLA22, AtFLA6, PtFLA34, PtFLA2, AtFLA9, PtFLA20 AtFLA13 Potri.001G440800 PtFLA24 Chimeric 8/5/8/16 50% 399 Y Y Male catkins AtFLA20, Potri.T118500, (POPTR_0001s43130) /3/2 AtFLA19, PtFLA44, AtFLA21, PtFLA36, AtFLA15, Potri.019G125200, AtFLA17 PtFLA19 Potri.018G005100 PtFLA25 Chimeric 8/3/3/1/ 40% 271 Y Y AtFLA3, PtFLA11, 2/2 AtFLA5, PtFLA26, AtFLA14, PtFLA21, AtFLA8, PtFLA17, AtFLA10 PtFLA16

107

Table 4.2: continued Potri.006G276200 PtFLA26 Chimeric 11/11/4/ 38% 393 Y* Y Male catkins AtFLA3, PtFLA11, (POPTR_0006s29110) 4/4/2 AtFLA14, PtFLA25, AtFLA5, PtFLA21, AtFLA8, PtFLA17, AtFLA10 PtFLA16 Potri.012G015000 PtFLA27 Chimeric 8/6/2/1/ 38% 269 PF024 Y Y AtFLA11, PtFLA48, (POPTR_0012s02210) 1/2 69.20 AtFLA12, PtFLA10, AtFLA13, PtFLA23, AtFLA6, PtFLA39, AtFLA9 PtFLA28 Potri.013G014200 PtFLA28 Chimeric 8/8/2/2/ 42% 266 PF024 Y Y AtFLA12, PtFLA39, (POPTR_0013s01570) 0/2 69.20 AtFLA11, PtFLA47, AtFLA13, PtFLA50, AtFLA9, PtFLA32, AtFLA6 PtFLA49 Potri.019G121200 PtFLA29 Chimeric 8/8/3/1/ 42% 263 PF024 Y Y Xylem AtFLA11, PtFLA50, (POPTR_0019s14420) 0/2 69.20 AtFLA12, PtFLA32, AtFLA13, PtFLA49, AtFLA9, PtFLA28, AtFLA6 PtFLA39 Potri.006G174900 PtFLA30 Chimeric 1/4/5/3/ 38% 426 PF024 Y* Y Xylem AtFLA4, PtFLA37, (POPTR_0006s18920) 0/2 69.20 AtFLA8, PtFLA17, AtFLA10, PtFLA16, PtFLA5, AtFLA1, PtFLA22 AtFLA2 Potri.008G127500 PtFLA31 Chimeric 1/0/3/1/ 29% 292 PF024 Y N Male catkins AtFLA20, PtFLA36, (POPTR_0008s12640) 0/1 69.20 AtFLA19, PtFLA42, AtFLA21, Potri.019G125200, AtFLA10, PtFLA44, PtFLA4 AtFLA12

108

Table 4.2: continued Potri.019G123200 PtFLA32 Chimeric 10/9/1/1 42% 263 PF024 Y Y AtFLA11, PtFLA49, (POPTR_0019s14430) /0/2 69.20 AtFLA12, PtFLA50, AtFLA9, PtFLA28, AtFLA13, PtFLA39, AtFLA6, PtFLA29 Potri.019G120900 PtFLA33 Chimeric 8/8/3/1/ 42% 227 PF024 Y Y Xylem AtFLA11, PtFLA43, (POPTR_0019s14330) 0/2 69.20 AtFLA12, PtFLA50, AtFLA13, PtFLA32, AtFLA9, PtFLA49, AtFLA6 PtFLA29 Potri.004G210600 PtFLA34 Chimeric 10/5/3/3 40% 268 PF024 Y N Xylem AtFLA11, PtFLA2, PtFLA10, (POPTR_0004s22030) /2/0 69.20 AtFLA12, PtFLA23, AtFLA9, PtFLA39, AtFLA13, PtFLA40 AtFLA6 Potri.019G123000 PtFLA35 Chimeric 11/9/2/1 39% 269 PF024 Y Y AtFLA12, PtFLA45, (POPTR_0019s14410) /0/1 69.20 AtFLA11, PtFLA39, AtFLA13, PtFLA28, AtFLA9, PtFLA47, AtFLA6 PtFLA13 Potri.008G128200 PtFLA36 Chimeric 1/0/1/1/ 28% 344 PF024 Y Y Female AtFLA20, PtFLA31, (POPTR_0008s12720) 0/2 69.20 catkins, AtFLA21, PtFLA42, male catkins AtFLA19, PtFLA44, PtFLA4, AtFLA12, Potri.T118500 AtFLA6 Potri.019G002300 PtFLA37 Chimeric 1/2/3/0/ 29% 283 Y N Female AtFLA19, Potri.001G306800, (POPTR_0019s01620) 0/2 catkins, AtFLA21, PtFLA4, young leaf AtFLA20, Potri.T118500, AtFLA17, PtFLA24, AtFLA16 Potri.019G049600

109

Table 4.2: continued Potri.018G097000 PtFLA38 Chimeric 2/2/5/2/ 38% 427 PF024 Y* N Xylem AtFLA4, PtFLA30, (POPTR_0018s10600) 0/3 69.20 AtFLA8, PtFLA17, AtFLA10, PtFLA16, PtFLA5, AtFLA1, PtFLA22 AtFLA2, Potri.013G151300 PtFLA39 Chimeric 9/5/2/1/ 39% 269 PF024 Y Y Xylem AtFLA12, PtFLA40, (POPTR_0013s14760) 0/2 69.20 AtFLA11, PtFLA28, AtFLA13, PtFLA47, AtFLA6, PtFLA45, AtFLA9 PtFLA50 Potri.013G151400 PtFLA40 Chimeric 9/9/2/1/ 40% 269 PF024 Y Y Xylem AtFLA11, PtFLA39, (POPTR_0013s14780) 0/2 69.20 AtFLA12, PtFLA28, AtFLA13, PtFLA47, AtFLA9, PtFLA50, AtFLA6 PtFLA32 Potri.019G008400 PtFLA41 Chimeric 9/4/0/0/ 27% 361 PF024 N N Xylem AtFLA17, PtFLA1, (POPTR_0073s00210) 3/1 69.20 AtFLA16, Potri.012G006200, AtFLA18, PtFLA19, PtFLA6, AtFLA15, PtFLA8 AtFLA7 Potri.017G111600 PtFLA42 Chimeric 5/2/4/2/ 30% 352 PF024 Y N Male catkins AtFLA20, PtFLA36, (POPTR_0017s14020) 0/2 69.20 AtFLA21, PtFLA31, AtFLA19, PtFLA44, PtFLA4, AtFLA10, Potri.019G125200 AtFLA6 Potri.019G122800 PtFLA43 Chimeric 9/8/3/0/ 41% 252 PF024 Y Y Xylem AtFLA11, PtFLA50, (POPTR_0019s14390) 0/2 69.20 AtFLA12, PtFLA32, AtFLA9, PtFLA49, AtFLA13, PtFLA29, AtFLA6 PtFLA28

110

Table 4.2: continued Potri.005G079500 PtFLA44 Chimeric 3/3/5/2/ 33% 442 Y N Male catkins AtFLA21, PtFLA36, (POPTR_0005s08130) 1/6 AtFLA20, PtFLA42, AtFLA19, Potri.T118500, AtFLA15 PtFLA24, PtFLA4 Potri.019G121100 PtFLA45 Chimeric 10/9/2/1 41% 262 PF024 Y N AtFLA11, PtFLA35, (POPTR_0019s14370) /0/1 69.20 AtFLA12, PtFLA39, AtFLA13, PtFLA13, AtFLA9, PtFLA28, AtFLA6 PtFLA47 Potri.009G012100 PtFLA46 Chimeric 6/7/2/0/ 36% 263 PF024 Y N Xylem AtFLA11, PtFLA2, PtFLA48, (POPTR_0009s01730) 1/2 69.20 AtFLA12, PtFLA27, AtFLA9, PtFLA28, AtFLA13, PtFLA10 AtFLA6 Potri.013G151500 PtFLA47 Chimeric 8/9/2/2/ 42% 264 PF024 Y N Xylem AtFLA12, PtFLA28, (POPTR_0013s14790) 0/2 69.20 AtFLA11, PtFLA39, AtFLA13, PtFLA40, AtFLA9, PtFLA50, AtFLA6, PtFLA32 Potri.015G013300 PtFLA48 Chimeric 7/5/2/0/ 36% 267 PF024 Y Y Xylem AtFLA11, PtFLA27, (POPTR_0015s01560) 1/3 69.20 AtFLA12, PtFLA23, AtFLA13, PtFLA10, PtFLA2, AtFLA9, PtFLA34 AtFLA6 Potri.019G121300 PtFLA49 Chimeric 10/9/1/1 42% 263 PF024 Y Y AtFLA11, PtFLA32, /0/2 69.20 AtFLA12, PtFLA50, AtFLA9, PtFLA28, AtFLA13, PtFLA39, AtFLA6 PtFLA29

111

Table 4.2: continued Potri.019G123100 PtFLA50 Chimeric 8/8/3/1/ 42% 263 PF024 Y Y AtFLA11, PtFLA29, 0/2 69.20 AtFLA12, PtFLA32, AtFLA13, PtFLA49, AtFLA9, PtFLA28, AtFLA6 PtFLA39 Potri.011G117800 PtPAG1 Chimeric 10/10/2 52% 343 PF022 Y Y Roots AtPAG17, PtPAG5, PtPAG6, (POPTR_0011s11860) 2/9/4/3 98.15 AtPAG11, PtPAG7, PtPAG8, AtPAG10, PtPAG9 AtPAG14, AtPAG7 Potri.006G067300 PtPAG2 Chimeric 9/13/13/ 54% 322 PF022 Y* Y Male catkins AtPAG4, PtPAG3, PtPAG10, (POPTR_0006s06640) 13/1/0 98.15 AtPAG3, PtPAG11, PtPAG4, AtPAG5, PtPAG12 AtPAG16, AtPAG7 Potri.018G129200 PtPAG3 Chimeric 4/7/14/1 60% 250 PF022 Y Y Roots AtPAG5, PtPAG2, PtPAG10, (POPTR_0018s12930) 2/0/0 98.15 AtPAG4, PtPAG11, PtPAG4, AtPAG7, PtPAG12 AtPAG17, AtPAG3 Potri.018G129400 PtPAG4 Chimeric 1/1/3/4/ 50% 183 PF022 Y Y AtPAG16, PtPAG11, (POPTR_0018s12950) 1/0 98.15 AtPAG5, PtPAG10, AtPAG7, PtPAG13, PtPAG2, AtPAG3, PtPAG3 AtPAG8 Potri.001G398800 PtPAG5 Chimeric 15/11/2 51% 377 PF022 Y Y Light-grown AtPAG17, PtPAG1, PtPAG6, (POPTR_0001s40940) 3/8/5/3 98.15 seedling, AtPAG11, PtPAG7, PtPAG9, young leaf AtPAG10, PtPAG14 AtPAG14, AtPAG7

112

Table 4.2: continued Potri.017G011200 PtPAG6 Chimeric 1/3/5/2/ 33% 212 PF022 Y Y AtPAG11, PtPAG7, PtPAG1, (POPTR_0017s04390) 2/0 98.15 AtPAG14, PtPAG5, PtPAG16, AtPAG17, PtPAG14 AtPAG10, AtPAG7 Potri.017G012300 PtPAG7 Chimeric 1/3/5/2/ 33% 212 PF022 Y Y AtPAG11, PtPAG6, PtPAG1, (POPTR_0017s00580) 2/0 98.15 AtPAG14, PtPAG5, PtPAG16, AtPAG17, PtPAG14 AtPAG10, AtPAG7 Potri.011G135400 PtPAG8 Chimeric 2/2/3/2/ 35% 208 PF022 Y Y Roots, AtPAG7, PtPAG14, (POPTR_0011s13870) 2/2 98.15 young leaf AtPAG13, PtPAG16, PtPAG1, AtPAG2, PtPAG5, PtPAG15 AtPAG12, AtPAG17 Potri.018G018200 PtPAG9 Chimeric 1/2/2/0/ 26% 178 PF022 Y Y Young leaf AtPAG13, PtPAG16, (POPTR_0018s02630) 2/0 98.15 AtPAG2, PtPAG15, PtPAG1, AtPAG15, PtPAG5, PtPAG6 AtPAG12, AtPAG1 Potri.001G192100 PtPAG10 Chimeric 2/1/5/3/ 41% 210 PF022 Y Y Male catkins AtPAG2, PtPAG2, PtPAG3, (POPTR_0001s19280) 1/1 98.15 AtPAG4, PtPAG4, PtPAG11, AtPAG3, PtPAG17 AtPAG16, AtPAG7 Potri.006G067400 PtPAG11 Chimeric 0/1/3/0/ 39% 163 PF022 Y Y Light-grown AtPAG16, PtPAG4, PtPAG2, (POPTR_0006s06650) 1/0 98.15 seedling AtPAG5, PtPAG3, PtPAG10, AtPAG8, PtPAG13 AtPAG3, AtPAG13

113

Table 4.2: continued Potri.003G047300 PtPAG12 Chimeric 1/0/4/2/ 35% 217 PF022 Y Y Female AtPAG16, PtPAG18, (POPTR_0003s04580) 1/2 98.15 catkins AtPAG4, PtPAG19, AtPAG5, Potri.006G259100, AtPAG3, PtPAG20, AtPAG8 Potri.006G259000 Potri.014G049600 PtPAG13 Chimeric 2/1/1/5/ 48% 192 PF022 Y Y Dark AtPAG9, PtPAG21, (POPTR_0014s04850) 1/1 98.15 etiolated AtPAG8, PtPAG22, seedlings AtPAG6, PtPAG290, AtPAG3, PtPAG23, AtPAG5 PtPAG12 Potri.001G419200 PtPAG14 Chimeric 4/5/2/3/ 35% 221 PF022 Y Y Roots AtPAG7, PtPAG8, PtPAG15, (POPTR_0001s44510) 0/2 98.15 AtPAG17, PtPAG6, PtPAG1, AtPAG15, PtPAG7 AtPAG11, AtPAG12 Potri.006G184100 PtPAG15 Chimeric 2/2/3/0/ 29% 178 PF022 Y Y AtPAG13, PtPAG16, PtPAG9, (POPTR_0006s19770) 2/0 98.15 AtPAG2, PtPAG8, PtPAG14, AtPAG15, PtPAG1 AtPAG12, AtPAG1 Potri.006G264600 PtPAG16 Chimeric 2/3/3/0/ 28% 179 PF022 Y Y Young leaf AtPAG13, PtPAG9, PtPAG15, (POPTR_0006s28040) 2/0 98.15 AtPAG2, PtPAG8, PtPAG1, AtPAG15, PtPAG6 AtPAG1, AtPAG12 Potri.013G061300 PtPAG17 Chimeric 2/2/3/1/ 29% 155 PF022 Y N Female AtPAG5, PtPAG39, (POPTR_0013s05800) 0/1 98.15 catkins, AtPAG4, PtPAG24, male catkins AtPAG3, PtPAG25, AtPAG16, PtPAG26, AtPAG13 PtPAG27

114

Table 4.2: continued Potri.002G161300 PtPAG18 Chimeric 2/2/2/0/ 31% 169 PF022 Y Y Male catkins AtPAG16, PtPAG19, (POPTR_0002s16270) 1/0 98.15 AtPAG4, Potri.002G156100, AtPAG3, Potri.002G156400, AtPAG5, Potri.006G259000, AtPAG13 Potri.006G259100 Potri.001G268700 PtPAG19 Chimeric 1/2/4/0/ 31% 165 PF022 Y Y Male catkins AtPAG16, PtPAG18, (POPTR_0001s27560) 0/0 98.15 AtPAG4, Potri.002G156100, AtPAG3, Potri.002G156400, AtPAG5, Potri.006G259000, AtPAG13 PtPAG20 Potri.002G052500 PtPAG20 Chimeric 0/1/2/0/ 28% 169 PF022 Y Y Young leaf AtPAG16, PtPAG18, (POPTR_0002s05340) 1/0 98.15 AtPAG4, PtPAG19, AtPAG3, Potri.002G156100, AtPAG5, Potri.002G156400, AtPAG13 Potri.006G259000 Potri.001G080700 PtPAG21 Chimeric 1/2/0/0/ 30% 184 PF022 Y Y AtPAG5, PtPAG22, (POPTR_0001s11680) 0/1 98.15 AtPAG8, PtPAG13, AtPAG9, PtPAG28, AtPAG16, PtPAG23, AtPAG3 PtPAG290 Potri.003G150300 PtPAG22 Chimeric 1/1/1/0/ 31% 183 PF022 Y Y AtPAG5, PtPAG21, (POPTR_0003s15000) 0/0 98.15 AtPAG16, PtPAG13, AtPAG8, PtPAG28, AtPAG3, PtPAG23, AtPAG4 PtPAG290 Potri.002G101300 PtPAG23 Chimeric 0/1/3/1/ 42% 188 PF022 Y Y Xylem AtPAG5, PtPAG290, (POPTR_0002s10170) 0/4 98.15 AtPAG8, PtPAG13, AtPAG6, PtPAG12, AtPAG3, PtPAG22, AtPAG9 PtPAG24

115

Table 4.2: continued Potri.013G030000 PtPAG24 Chimeric 0/1/3/2/ 31% 168 PF022 Y Y Male catkins AtPAG5, PtPAG25, (POPTR_0013s03090) 1/3 98.15 AtPAG4, PtPAG30, AtPAG3, PtPAG26, AtPAG16, PtPAG27, AtPAG13 Potri.001G114200 Potri.013G030200 PtPAG25 Chimeric 0/1/3/2/ 31% 168 PF022 Y Y Male catkins AtPAG5, PtPAG24, (POPTR_0986s00200) 1/3 98.15 AtPAG4, PtPAG30, AtPAG3, PtPAG26, AtPAG16, PtPAG27, AtPAG13 Potri.001G114200 Potri.019G037800 PtPAG26 Chimeric 1/1/1/2/ 32% 155 PF022 Y Y AtPAG5, PtPAG27, 0/0 98.15 AtPAG16, PtPAG39, AtPAG4, PtPAG24, AtPAG9, PtPAG25, AtPAG3 PtPAG30 Potri.T070900 PtPAG27 Chimeric 1/1/1/2/ 32% 155 PF022 Y Y Male catkins AtPAG5, PtPAG26, (POPTR_0019s05370) 0/0 98.15 AtPAG16, PtPAG39, AtPAG4, PtPAG24, AtPAG9, PtPAG25, AtPAG3 PtPAG30 Potri.007G120200 PtPAG28 Chimeric 2/6/13/7 49% 247 PF022 Y Y Dark AtPAG5, PtPAG21, (POPTR_0007s02750) /1/0 98.15 etiolated AtPAG17, PtPAG22, seedlings AtPAG4, PtPAG13, AtPAG3, PtPAG12, AtPAG8 PtPAG31 Potri.002G101200 PtPAG29 Chimeric 0/1/4/3/ 37% 249 PF022 Y* Y AtPAG5, PtPAG23, (POPTR_1040s00200) 0/4 98.15 AtPAG8, PtPAG13, AtPAG3, PtPAG12, AtPAG6, PtPAG22, AtPAG9 PtPAG21

116

Table 4.2: continued Potri.003G117900 PtPAG30 Chimeric 0/0/6/1/ 33% 167 PF022 Y Y Male AtPAG5, PtPAG24, (POPTR_0003s11780) 0/2 98.15 catkins, AtPAG4, PtPAG25, female AtPAG3, PtPAG26, catkins AtPAG16, PtPAG27, AtPAG9 PtPAG17 Potri.001G332200 PtPAG31 Chimeric 1/1/2/1/ 33% 168 PF022 Y Y Xylem AtPAG5, PtPAG24, (POPTR_0001s33960) 0/0 98.15 AtPAG4, PtPAG25, AtPAG3, Potri.009G136200, AtPAG13, PtPAG28, AtPAG16 PtPAG23 Potri.008G151000 PtPAG32 Chimeric 3/1/2/0/ 35% 185 PF022 Y N Xylem AtPAG16, PtPAG38, (POPTR_0008s15040) 1/3 98.15 AtPAG3, PtPAG18, AtPAG4, Potri.006G259000, AtPAG5, Potri.006G259100, AtPAG13 PtPAG19 Potri.017G088500 PtPAG33 Chimeric 2/2/1/1/ 23% 175 PF022 Y* Y Roots AtPAG16, Potri.001G219900, (POPTR_0017s12450) 0/0 98.15 AtPAG9, Potri.001G219800, AtPAG1, Potri.017G088600, AtPAG5, Potri.003G183300, AtPAG2, Potri.001G043600 Potri.015G114300 PtPAG34 Chimeric 0/2/0/0/ 20% 131 PF022 Y N AtPAG11, Potri.015G114700, (POPTR_0015s12570) 0/1 98.15 AtPAG7, Potri.015G113300, AtPAG13, Potri.015G115600, AtPAG2, Potri.015G117100, AtPAG14 Potri.015G114600 Potri.010G243600 PtPAG35 Chimeric 3/3/6/0/ 34% 214 PF022 Y Y Male catkins AtPAG11, PtPAG2, PtPAG4, (POPTR_0010s24980) 1/2 98.15 AtPAG5, PtPAG3, PtPAG18, AtPAG17, PtPAG12 AtPAG2, AtPAG4,

117

Table 4.2: continued Potri.001G187700 PtPAG36 Chimeric 1/1/2/2/ 27% 181 PF022 Y Y Male AtPAG11, PtPAG37, (POPTR_0001s18820) 1/0 98.15 catkins, AtPAG7, Potri.015G052000, female AtPAG2, PtPAG8, PtPAG1, catkins AtPAG17, Potri.001G338800 AtPAG14 Potri.003G050500 PtPAG37 Chimeric 2/0/2/1/ 26% 180 PF022 Y Y AtPAG17, PtPAG36, (POPTR_0003s04900) 0/0 98.15 AtPAG2, Potri.015G052000, AtPAG13, PtPAG15, AtPAG7, Potri.001G338800, AtPAG15 PtPAG1 Potri.010G089900 PtPAG38 Chimeric 1/2/2/1/ 34% 185 PF022 Y N Xylem AtPAG16, PtPAG32, (POPTR_0010s10020) 1/2 98.15 AtPAG3, PtPAG18, AtPAG4, Potri.006G259000, AtPAG5, Potri.006G259100, AtPAG13 Potri.002G156100 Potri.013G054500 PtPAG39 Chimeric 2/1/0/1/ 29% 156 PF022 Y N Female AtPAG5, PtPAG26, (POPTR_0013s05140) 0/0 98.15 catkins AtPAG16, PtPAG27, AtPAG4, PtPAG24, AtPAG3, PtPAG25, AtPAG9 PtPAG17 Potri.002G092800 PtAGP57I Chimeric 10/7/3/0 46% 193 PF143 Y N AtAGP29I PtAGP60I, (POPTR_0002s09340) /0/1 68.4 PtAGP64C, PtAGP58I, PtAGP61I, PtAGP69C Potri.003G020200 PtAGP58I Chimeric 6/5/2/1/ 43% 179 PF143 Y Y Xylem, AtAGP29I PtAGP61I, (POPTR_0003s01440) 1/0 68.4 young leaf PtAGP60I, PtAGP64C, PtAGP57I, PtAGP68C

118

Table 4.2: continued Potri.006G261800 PtAGP59I Chimeric 3/11/9/5 36% 484 PF007 Y N Male catkins None Potri.018G112100, (POPTR_0006s27770) /2/4 04.26 Potri.006G188400, Potri.006G188300, Potri.018G111600, Potri.006G262000 Potri.005G167500 PtAGP60I Chimeric 10/9/4/1 48% 216 PF143 Y N Male AtAGP29I PtAGP64C, (POPTR_0005s16550) /0/3 68.4 catkins, PtAGP57I, female PtAGP58I, catkins PtAGP61I, PtAGP69C Potri.001G210100 PtAGP61I Chimeric 8/5/3/0/ 41% 178 PF143 Y Y Young leaf AtAGP29I, PtAGP58I, (POPTR_0001s21750) 0/0 68.4 AtAGP3C PtAGP60I, PtAGP64C, PtAGP57I, Potri.001G231400 Potri.010G085200 PtAGP62I Chimeric 4/1/6/5/ 47% 216 PF143 Y Y Male catkins AtAGP29I PtAGP65C, (POPTR_0010s09530) 2/4 68.4 PtAGP66C, PtAGP67C, PtAGP68C, PtAGP69C Potri.005G003500 PtAGP63I Chimeric 7/15/6/9 41% 624 PF079 Y Y AtPRP13, Potri.013G003500, (POPTR_0005s00550) /0/5 83.11 AtPEX4 PtAGP70I, PtAGP71I, PtAGP72I, PtAGP73I Potri.002G059600 PtAGP70I Chimeric 0/1/4/7/ 47% 255 PF079 Y N AtPRP13 PtAGP73I, (POPTR_0002s06050) 0/3 83.11 PtAGP71I, PtAGP72I, PtAGP63I, Potri.011G094400

119

Table 4.2: continued Potri.001G353400 PtAGP71I Chimeric 1/7/5/9/ 49% 286 PF079 Y N AtPRP13 PtAGP72I, (POPTR_0001s34420) 1/5 83.11 PtAGP70I, PtAGP73I, PtAGP63I, Potri.013G003500 Potri.011G078500 PtAGP72I Chimeric 1/7/5/10 46% 304 PF079 Y Y AtPRP13 PtAGP71I, (POPTR_0011s02870) /1/1 83.11 PtAGP70I, Potri.013G003500, PtAGP63I, PtAGP73I Potri.005G202400 PtAGP73I Chimeric 1/2/4/5/ 44% 261 PF079 Y N AtPRP13 PtAGP70I, 0/3 83.11 PtAGP71I, PtAGP72I, PtAGP63I, Potri.013G003500 a Protein identifiers of the version 2.0 are shown in the parenthesis. Italics indicates a protein that was identified only by a BLAST search. b The domains indicated by the Pfam number are: PF14368.4, LTP_2 domain (Probable lipid transfer); PF06376.10, AGP domain (Arabinogalactan peptide); PF02469.20, Fasciclin domain (Fasciclin domain); PF02298.15, Cu_bind_like domain (Plastocyanin-like domain); PF00704.26, Glyco_hydro_18 domain (Glycoside hydrolase family 18); PF07983.11, X8 domain (X8 domain). c Asterisk indicates a protein that is predicted to have a signal peptide either using the sensitive mode in the SignalP website or only if amino acids at the N terminus are discarded. d Expression data are shown only when available at http://bar.utoronto.ca/efppop/cgi-bin/efpWeb.cgi e A locus ID indicates that it is not identified as an HRGP.

120

Classical AGP >Potri.001G310400-PtAGP9C MAANKSMVFLMLITFLVASTKAQSPSSSPASSPTKSPPVATPPPKASAPAPTTVKPPASAPSPLE TPPPAANAPSPTTTTSPPSPLPVTPVPSTGDVPTSTIGSPAVAPAPANGAVLNRFALGGSVAVGV LAAVLVL

Lysine-rich Classical AGP >Potri.007G051600-PtAGP13K MDRNGILGWTLICVLVAGVGGQAPAATPTSTPATPTTPSVPLAAPAKAPAKPTTPAPVSSPPAVT PVASSPKQTVPTPVATPLATPPPAVTPVSSPPAPVPVSSPPEKSPPSPVPVAPPTSSPVAAPTAE VPAPTPSKKKPKKAPAPGPALLSPPAPPTEAPGPSAESMSPGSIADDSGAGRTRCFQKIAGGLAL GWGLLALIF

AG Peptide >Potri.001G268400-PtAGP38P MAQLASTFKAMAIFFVVAMYSATVTGQDFEMAPAPAPTMDKGAACSLGMSGAVFCSTLLLSLLAL LKH

(Chimeric) FLA >Potri.013G120600-PtFLA3 MATTPLSFFLLSLLSLSLNAQAQTPTAPAPTPSGPVNFTAVLVKGGQFATLIRLLNNTQTLNQIE NQLNSSSEGMTIFAPTDNAFNNLKAGALNGLNQQEQVQLLQYHTLPKFYTMSNLLLVSNPVPTQA SGQDGVWGLNFTGQSNQVNVSTGLVEVQINNALRQDSPLAVYPVDKVLLPEALFGVKPPTASPPA PSSKSNSTVAAAEPSTGKNSAGGRNVALGLVVGLGLVCMGILS

(Chimeric) PAG >Potri.001G398800-PtPAG5 MESRRCLCLLWALLACYSFTSSAAYNNSFDVGGKDGWVTNPSESYNHWAEKNRFQVNDSLVFKYN NGSDSVLLVTKDDYNSCKTKKPLKTMGSGSSVFQFDKSGPYFFISGNEDNCRKGQKMTVVVLSVK PKQAPTPVSQPPAMSPKAPSPVAYNNPSPAPSKSPSPSAEPPASSQGPSLSPISPAPISKTPSGS PLEAPGPSLVPVKSSPPSADTPTLAPSPTSNAPTGPVPAKSPSLSVSSPYLAPSPFSDAPTGAPG PSPVAMTPHISLVPSGSPASAPGSEISPSPLTNPPAPSQSPESPSPLASAPVVSPIPAKSPSSST PTPKSSYTPAHSPNSNGADLAPAPAASCVATPSTVMVIVASFLIGSVIGVWP

Other Chimeric AGP >Potri.001G210100-PtAGP61I MASRKVLSLILLCTFSISCCSQSPASAPAPSSVDCANLIFSMADCLSFVSNDSTAAKPEGKCCAG LKTVLSTKAECLCEAFKSSARFDIVLNVTKALSLPSVCKIHAPPASNCGCQLAISPSGARAPAPG GSAPGLAVNGGGNEQAPAPSPGHSGSIGFSISVGSLIIGFVFASFSSF Figure 4.2 Protein sequences encoded by the representative AGP gene classes in Populus trichocarpa. The colored sequences at the N and C terminus indicate predicted signal peptides (green) and GPI anchor addition sequences (light blue) if present. AP, PA, SP, TP, VP, and GP repeats (yellow), lysine-rich regions (olive) and core fasciclin motif (brown) are also indicated.

121

The vast majority (97%) of the identified AGPs were predicted to have a signal peptide and many (70%) were predicted to have a GPI anchor, both of which are characteristic features of the AGP family. Of the 162 AGPs identified, only four FLAs were predicted to lack a signal peptide. A total of 114 of the 162 AGPs (70%) were predicted to have a GPI anchor addition sequence. BLAST searches against the

Arabidopsis protein database found that all but 21 of the putative AGPs were similar to at least one known Arabidopsis AGP, providing further evidence that these proteins are likely

AGPs.

4.3.2 Extensins (EXTs)

Poplar had a smaller number of the classical EXTs containing large numbers of

SPPPP repeats compared to Arabidopsis. For instance, a search for proteins with at least

15 SPPPP repeats in Arabidopsis found 21 “hits” while a similar search in poplar yielded only six, two of which are chimeric EXTs. The largest number of SPPPP repeats found in a single protein in poplar is 25, while in Arabidopsis one EXT contains 70 SPPPP repeats.

Interestingly, although the abundance of these classical EXTs is decreased, many chimeric

EXTs found in Arabidopsis were also in poplar in similar numbers, including the leucine- rich repeat extensins (LRXs) and proline-rich extensin-like receptor protein kinases

(PERKs). By searching for proteins that contain at least two SPPP repeats, 162 poplar proteins were identified (Table 4.1). In all, 59 proteins identified in the search criteria were determined to be EXTs (Table 4.3). The only exception is a short EXT (i.e., Potri.T139000 or PtEXT33) identified by a BLAST search with one SPPPP that is homologous to several other short EXTs. These 60 proteins included 8 classical EXTs, 22 Short EXTs, 10 LRXs, 122

12 PERKs, 5 Formin Homology proteins (FHs), and 3 other chimeric EXTs (Fig 4.3 and

Fig S2) (Showalter et al. 2016). YXY repeats were observed in 45% of the EXT sequences; such sequences are involved in cross-linking EXTs (Brady et al. 1996; Schnabelrauch et al. 1996; Brady et al. 1998; Held et al. 2004; Cannon et al. 2008). Twenty-seven of the 60

EXTs identified contained YXY sequences in which X is quite variable. In contrast, 40 of the 59 EXTs in Arabidopsis (i.e., 68%) contained YXY sequences in which X was often V

(Showalter et al. 2010). Many of the classical EXTs and some of the LRXs also contained a SPPPP or SPPPPP sequence and Y residue at the C-terminus of their sequences as previously observed in Arabidopsis EXTs (Cannon et al. 2008).

123

Table 4.3 Identification and analysis of EXT genes in Populus trichocarpa. Locus Identifier 3.0 Name Class SP3/SP AA Pfamb SPc GPI Organ/issue Arabidopsis Poplar HRGP BLAST (ID 2.0)a 4/SP5/ -specific HRGP Hitse YXY Expressiond BLAST Repeats Hits Potri.018G050100 PtEXT1 Classical 1/6/4/5 190 PF04554.11 Y N Young leaf AtEXT22, Potri.001G201800 (POPTR_0018s05480) EXT AtEXT21 Potri.001G019700 PtEXT2 Classical 1/21/0/1 213 Y N AtEXT3/5 PtEXT8 (POPTR_0001s05720) EXT 1 Potri.001G122100 PtEXT3 Classical 2/5/6/0 238 PF14547.4 Y N Male catkins AtPRP16, Potri.013G128800, (POPTR_0001s00420) EXT AtPRP15, Potri.002G200100, AtPRP14, Potri.018G025900, AtHAE4 Potri.001G158400, Potri.014G059800 Potri.001G259600 PtEXT4 Classical 2/8/2/0 500 Y N AtAGP51C PtEXT7, AGP6C, (POPTR_0001s26690) EXT AGP43P Potri.001G020100 PtEXT5 Classical 1/22/0/1 257 Y N None PtEXT6, PtEXT8 (POPTR_0001s05740) EXT 3 Potri.001G019900 PtEXT6 Classical 1/25/0/1 259 Y* N None PtEXT8, PtEXT5 EXT 4 Potri.001G260200 PtEXT7 Classical 4/6/1/0 222 Y N None AGP43P, AGP6C, (POPTR_0001s26680) EXT PtEXT4, Potri.003G074200 Potri.001G020000 PtEXT8 Classical 1/23/0/1 267 Y* N AtEXT3/5 PtEXT6, PtEXT5 EXT 6 Potri.010G001200 PtEXT9 Short 1/6/0/3 174 Y Y AtEXT37, PtEXT24, (POPTR_0010s00350) EXT AtEXT41 FLA21Potri.008G1292 00, Potri.010G128900, Potri.008G117500, Potri.010G113300 PtEXT10 Short 0/2/0/0 131 Y N AtEXT31, PtEXT23, (POPTR_0010s12360) EXT AtEXT33 Potri.006G106800, Potri.005G033000, Potri.001G371600, PossiblePtEXT5 124

Table 4.3: continued Potri.T091000 PtEXT11 Short 1/1/0/0 106 Y N None PtEXT12, EXT PtEXT19, Potri.005G079400 Potri.013G045700 PtEXT12 Short 1/1/0/0 111 Y N None PtEXT11, PtEXT19 (POPTR_0013s04290) EXT Potri.003G064900 PtEXT13 Short 1/1/3/0 167 Y N AtEXT32, PtEXT26, (POPTR_0003s06350) EXT AtAGP57C, Potri.009G013500, AtPERK5 Potri.006G276200 Potri.006G225400 PtEXT14 Short 2/0/1/3 186 Y Y Male AtEXT38, Potri.015G147200, (POPTR_0006s24190) EXT catkins, AtEXT7 Potri.008G168300, roots Potri.010G094700, Potri.012G144400, PtFH2 Potri.002G070100 PtEXT15 Short 0/1/2/2 102 Y N AtEXT3/5, PtEXT20, EXT AtEXT1/4, Potri.017G110900, AtEXT22 PtEXT1, PtLRX3 Potri.019G015900 PtEXT16 Short 0/2/0/0 108 Y N None PtEXT18, (POPTR_0019s03210) EXT PtEXT33, PtEXT17, Potri.019G015700, Potri.T139100 Potri.019G015800 PtEXT17 Short 0/2/0/0 107 Y N Male catkins None PtEXT33, PtEXT18, (POPTR_0019s03200) EXT PtEXT16, Potri.T139100, Potri.019G015700 Potri.019G016000 PtEXT18 Short 0/2/0/0 116 Y N None PtEXT16, PtEXT33, EXT PtEXT17, Potri.019G015700, Potri.T139100

125

Table 4.3: continued Potri.019G017300 PtEXT19 Short 0/2/0/0 110 Y* N Dark AtPERK6, PtEXT11, (POPTR_0019s03400) EXT etiolated AtAGP45P PtEXT12, seedlings Potri.005G257000, Potri.010G244800, Potri.006G136900 Potri.005G190100 PtEXT20 Short 1/2/0/2 115 Y N AtEXT3/5, Potri.019G083200, EXT AtEXT1/4, Potri.013G112500, AtPRP3, PtLRX3, AtPRP1 Potri.007G090300, Potri.005G077700 Potri.014G124700 PtEXT21 Short 0/2/0/0 168 Y N AtEXT34, Potri.015G147200, EXT AtEXT41, Potri.012G144400, AtPERK3, Potri.001G371600, AtPERK5 Potri.004G143700, PtFH2 Potri.T082000 PtEXT22 Short 1/1/1/0 177 Y* N None PtAEH4, EXT PtEXT28, PtEXT27, Potri.001G042100, Potri.008G043900 Potri.008G129100 PtEXT23 Short 0/3/0/0 155 Y Y Female AtEXT31, PtEXT10, (POPTR_0008s12800) EXT catkins, AtEXT33, Potri.010G094700, xylem AtPAG10 Potri.015G147200, Potri.006G163700, Potri.018G086100 Potri.008G213600 PtEXT24 Short 0/1/1/2 172 Y Y Male catkins AtEXT37, PtEXT9, (POPTR_0008s22980) EXT AtPERK6, Potri.008G129200, AtEXT41 PossiblePtEXT15, Potri.010G094700, Potri.004G143700

126

Table 4.3: continued Potri.008G125400 PtEXT25 Short 2/0/0/0 80 Y* N None Potri.005G239200, (POPTR_0008s12430) EXT Potri.010G094700, Potri.010G006800, Potri.002G189300, Potri.005G239200 Potri.001G169200 PtEXT26 Short 0/0/2/0 147 Y N None PtEXT13, (POPTR_0001s16930) EXT Potri.010G006800 Potri.001G042200 PtEXT27 Short 2/2/0/1 177 Y N None PtEXT28, (POPTR_0001s03370) EXT PtEXT22, PtAEH4, Potri.001G042100, Potri.001G316500 Potri.T179500 PtEXT28 Short 1/0/1/0 176 Y* N None PtAEH4, (POPTR_0523s00220) EXT PtEXT22, PtEXT27, Potri.001G042100, Potri.005G030300 Potri.T101300 PtEXT29 Short 0/2/0/0 151 Y* N AtAGP56C Potri.007G120100, (POPTR_0017s06820) EXT Potri.002G054100, Potri.001G371600, Potri.015G147200, Potri.002G235500 Potri.T139000 PtEXT33 Short 0/1/0/0 107 Y N None PtEXT17, PtEXT18, EXT PtEXT16, Potri.019G015700, Potri.T139100 Potri.009G108100 PtLRX1 Chimeric 5/16/6/1 982 PF13855.4 Y N Female AtPEX3, PtLRX2, (POPTR_0009s11130) catkins AtPEX1, PtLRX10, AtPEX4, PtLRX3, AtPEX2, PtLRX6, AtLRX4 PtLRX7

127

Table 4.3: continued Potri.004G146400 PtLRX2 Chimeric 2/19/1/1 603 PF13855.4 Y N Male catkins AtPEX3, PtLRX1, PtLRX10, (POPTR_0004s15360) AtPEX4, PtLRX3, PtLRX4, AtPEX1, PtLRX7 AtPEX2, AtLRX4 Potri.006G081200 PtLRX3 Chimeric 2/1/3/0 584 PF13855.4 Y* N AtLRX2, PtLRX7, PtLRX6, PF08263.10 AtLRX1, PtLRX4, PtLRX2, AtLRX4, PtLRX10 AtLRX3, AtLRX5 Potri.006G245600 PtLRX4 Chimeric 2/2/5/1 549 PF08263.10 Y N Xylem AtLRX4, PtLRX8, PtLRX5, (POPTR_0006s26190) AtLRX3, PtLRX9, PtLRX6, AtLRX5, PtLRX3 AtLRX7, AtLRX6 Potri.006G162300 PtLRX5 Chimeric 2/3/3/0 569 PF13855.4 Y N Male catkins AtLRX4, PtLRX9, PtLRX6, (POPTR_0024s00730) AtLRX3, PtLRX4, PtLRX8, AtLRX2, PtLRX3 AtLRX1, AtPEX4 Potri.018G075900 PtLRX6 Chimeric 1/2/5/0 509 PF13855.4 Y N Male AtLRX3, PtLRX5, PtLRX9, (POPTR_0018s06150) catkins, AtLRX5, PtLRX4, PtLRX8, young leaf, AtLRX2, PtLRX3 xylem AtLRX7, AtLRX1 Potri.018G151000 PtLRX7 Chimeric 1/6/1/0 481 PF08263.10 Y N Male catkins AtLRX2, PtLRX3, PtLRX6, (POPTR_0018s14790) PF13855.4 AtLRX1, PtLRX5, PtLRX9, AtLRX4, PtLRX4 AtLRX3, AtLRX5

128

Table 4.3: continued Potri.018G035100 PtLRX8 Chimeric 0/3/2/1 496 PF08263.10 Y N Male catkins AtLRX4, PtLRX4, PtLRX6, (POPTR_0018s01010) AtLRX3, Potri.010G083000, AtLRX5, PtLRX3, PtLRX7 AtLRX7, AtLRX6 Potri.T016600 PtLRX9 Chimeric 2/3/4/0 573 PF13855.4 Y N Male catkins AtLRX4, PtLRX5, PtLRX6, (POPTR_0028s00200) AtLRX3, PtLRX8, PtLRX3, AtLRX2, PtLRX7 AtLRX1, AtPEX4 Potri.014G036700 PtLRX10 Chimeric 1/5/1/1 474 PF13855.4 Y N Male catkins AtPEX3, PtLRX2, PtLRX1, (POPTR_0014s03600) AtPEX1, PtLRX3, PtLRX7, AtPEX4, Potri.007G139200 AtPEX2, AtLRX4 Potri.010G041400 PtPERK1 Chimeric 5/0/2/1 700 PF07714.15 N N AtPERK13, PtPERK11,PtPERK3, (POPTR_0010s05110) AtPERK12, PtPERK6, PtPERK3, AtPERK11, PtPERK12 AtPERK10, AtPERK8 Potri.010G132900 PtPERK2 Chimeric 5/4/2/1 765 PF00069.23 N N AtPERK8, PtPERK12, (POPTR_0010s14290) AtPERK13, PtPERK11, PtPERK1, AtPERK1, PtPERK8, PtPERK10 AtPERK15, AtPERK4 Potri.017G110400 PtPERK3 Chimeric 5/5/0/1 724 PF07714.15 N N Dark AtPERK8, PtPERK6, PtPERK12, (POPTR_0017s14140 ) etiolated and AtPERK10, PtPERK2, PtPERK1, light-grown AtPERK13, PtPERK11 seedlings AtPERK12, AtPERK3

129

Table 4.3: continued Potri.009G115200 PtPERK4 Chimeric 1/6/2/1 649 PF07714.15 N N Male catkins AtPERK5, PtPERK10, PtPERK9, (POPTR_0009s11810 ) AtPERK4, PtPERK8, AtPERK15, Potri.001G183000, AtPERK3, Potri.T140000 AtPERK13 Potri.004G153600 PtPERK5 Chimeric 3/3/3/1 656 PF07714.15 N N AtPERK5, PtPERK4, PtPERK10, (POPTR_0004s16100) AtPERK7, PtPERK9, PtPERK8, AtPERK4, Potri.001G183000 AtPERK6, AtPERK15 Potri.004G105200 PtPERK6 Chimeric 6/4/0/2 724 PF07714.15 N N Dark AtPERK10, PtPERK3, PtPERK2, (POPTR_0004s10490) etiolated AtPERK12, PtPERK1, PtPERK11, seedlings AtPERK13, PtPERK10 AtPERK3, AtPERK15 Potri.006G242800 PtPERK7 Chimeric 2/0/0/1 706 PF07714.15 N N AtPERK1, PtPERK10, PtPERK9, AtPERK5, Potri.001G183000, AtPERK14, Potri.003G053300, AtPERK15, Potri.T140000 AtPERK3 Potri.018G081300 PtPERK8 Chimeric 0/2/2/0 672 PF07714.15 N N Xylem AtPERK1, Potri.001G183000, (POPTR_0018s08800) AtPERK4, PtPERK10, PtPERK9, AtPERK5, Potri.003G053300, AtPERK15, PtPERK5 AtPERK6 Potri.007G027000 PtPERK9 Chimeric 2/3/5/1 639 PF07714.15 N N AtPERK5, PtPERK10, PtPERK8, (POPTR_0007s12680) AtPERK7, PtPERK5, AtPERK6, Potri.003G053300, AtPERK15, Potri.T140000 AtPERK13

130

Table 4.3: continued Potri.005G124400 PtPERK10 Chimeric 2/1/5/0 592 PF07714.15 N N Female AtPERK4, PtPERK9, PtPERK8, (POPTR_0005s12590) catkins, AtPERK5, PtPERK5, PtPERK4, male catkins AtPERK7, Potri.001G183000 AtPERK6, AtPERK1 Potri.008G189700 PtPERK11 Chimeric 5/3/1/1 733 PF07714.15 N N Male catkins AtPERK13, PtPERK1, PtPERK3, (POPTR_0008s19400) AtPERK11, PtPERK6, PtPERK12, AtPERK8, PtPERK2 AtPERK10, AtPERK15 Potri.008G111600 PtPERK12 Chimeric 0/6/2/1 728 PF07714.15 N N AtPERK13, PtPERK2, PtPERK1, (POPTR_0008s11080) AtPERK1, PtPERK8, PtPERK11, AtPERK5, Potri.001G183000 AtPERK15, AtPERK3 Potri.003G103800 PtFH1 Chimeric 1/0/2/0 1226 PF02181.21 N N Female None Potri.018G019600, (POPTR_0003s10280) PF10409.7 catkins, PtFH5, male catkins Potri.018G108000, Potri.006G263700, Potri.015G061000 Potri.011G131700 PtFH2 Chimeric 1/0/2/0 987 PF02181.21 Y N Roots None Potri.001G416100, (POPTR_0011s13510) Potri.007G119900, Potri.007G054900, PtFH4, Potri.017G009900 Potri.002G240200 PtFH3 Chimeric 1/0/1/0 1066 PF02181.21 Y N Young leaf, None PtFH4, (POPTR_0002s24130) male catkins Potri.007G140200, Potri.017G009900, Potri.007G054900, Potri.013G017900

131

Table 4.3: continued Potri.014G174700 PtFH4 Chimeric 0/0/2/0 1071 PF02181.21 Y N Roots, light- AtPERK5 PtFH3, (POPTR_0014s17310) grown Potri.007G140200, seedling Potri.017G009900, Potri.007G054900, Potri.013G017900 Potri.012G067900 PtFH5 Chimeric 0/0/2/0 1400 PF10409.7 N N Xylem, male None Potri.015G061000, (POPTR_0012s06980) PF02181.21 catkins Potri.018G019600, Potri.006G185500, Potri.018G108000, PtFH1 Potri.009G145700 PtEXT30 Chimeric 5/0/0/0 467 PF06830.9 Y N Male AtEXT51 Potri.009G097400, (POPTR_0009s14810) catkins, Potri.012G145400, roots Potri.011G127900, Potri.009G012600, Potri.009G012500 Potri.014G115700 PtEXT31 Chimeric 8/0/0/0 526 PF00295.15 Y* N Roots None Potri.002G190600, (POPTR_0014s11110) Potri.005G005500, Potri.013G005000, Potri.010G152000, Potri.008G100500 Potri.011G066900 PtEXT32 Chimeric 0/1/2/2 498 PF00112.21 Y N Female AtAGP4C Potri.011G066800, (POPTR_0011s07300) PF00396.16 catkins, Potri.004G057700, PF08246.10 male catkins Potri.005G232900, Potri.014G024100, Potri.001G302100 Potri.004G024500 PtAEH1 AGP 0/1/1/1 673 PF01657.15 Y N None Potri.004G024600, EXT PF07714.15 PtAEH2, Hybrid Potri.004G025800, Potri.011G028400, Potri.004G025900

132

Table 4.3: continued Potri.004G024800 PtAEH2 AGP 0/1/1/1 678 PF01657.15 Y N None Potri.004G024600, EXT PF07714.15 Potri.004G025800, Hybrid PtAEH1, Potri.011G028400, Potri.004G025900 Potri.003G082300 PtAEH3 AGP 2/0/0/0 188 Y* Y Dark and AtPRP1 Potri.005G191900, (POPTR_0003s08030) EXT light-grown Potri.016G025300, Hybrid seedlings, Potri.004G162500, young leaf PossibleHybrid2, Potri.015G147200 Potri.003G184500 PtAEH4 AGP 1/1/1/0 177 Y* N None PtEXT22, PtEXT28, EXT PtEXT27, Hybrid Potri.001G042100, Potri.019G047600 a Protein identifiers of the version 2.0 are shown in the parenthesis. Italics indicates a protein that was identified only by a BLAST search. b The domains indicated by the Pfam number are: PF04554.11, Extensin_2 domain (Extensin-like region); PF14547.4, Hydrophob_seed domain (Hydrophobic seed protein); PF13855.4, LRR_8 domain (Leucine rich repeat); PF08263.10, LRRNT_2 domain (Leucine rich repeat N-terminal domain); PF07714.15, Pkinase_Tyr domain (Protein tyrosine kinase); PF00069.23, Pkinase domain (Protein kinase domain); PF02181.21, FH2 domain (Formin Homology 2 Domain); PF10409.7, PTEN_C2 domain (C2 domain of PTEN tumour-suppressor protein); PF06830.9, Root_cap domain (Root cap); PF00295.15, Glyco_hydro_28 domain (Glycoside hydrolase family 28); PF00112.21, Peptidase_C1 domain (Papain family cysteine protease); PF00396.16, Granulin domain (Granulin) ; PF08246.10, Inhibitor_I29 domain (Cathepsin propeptide inhibitor domain); PF01657.15, Stress-antifung domain (Salt stress response/antifungal); PF07714.15, Pkinase_Tyr domain (Protein tyrosine kinase). c Asterisk indicates a protein that is predicted to have a signal peptide either using the sensitive mode in the SignalP website or only if amino acids at the N terminus are discarded. d Expression data are shown only when available at http://bar.utoronto.ca/efppop/cgi-bin/efpWeb.cgi e A locus ID indicates that it is not identified as an HRGP.

133

Classical EXT >Potri.001G020100-PtEXT5 MENRGRMGHLSPMIHAIAICLVATSVVAYEPYYYKSPPPPSQSPPPPYHYSSPPPPKKSPPPPYH YTSPPPPKKSPPPPYHYSSPPQPKKSPPPPYHYSSPPPPKKSPPPPYHYSSPPPPKKSPLPPYHY SSPPPPKKSPPPPYHYSSPPPPKKSPPPQYHYTSPPPPKKSPPPPYHYTSPPPPKKSPPPPYHYT SPPPPKKSPPPPYHYTSPPPPKKSPPPPYHYSSPPPPKKSPPPPYHYTSPPPPKKIEIVDPW

Short EXT >Potri.010G001200-PtEXT9 METRHKLKLSLLALFMLLPSTKSSTMPKSRMLYQIACTMCSTCCGSTPVPSPPPPSPPPPAASPP PPATTAICPPPPSPPPSGGGSYYYSPPPPSTYTYSSPPPPQGGVVGGTYYPPPNYKNYPTPPPPN PIVPYFPFYYYSPPPPSMSASFKLMASYSTSVLVGVVALVLCLF

(Chimeric) LRX >Potri.018G075900-PtLRX6 MKKTTHSHLHLSLLIALFLGTVVCLSAAKQAPTISSYGGLSDADAMYIKKRQLLYYKDEFGDRGE RVTVDPSLVFPNPRLRNAYIALQAWKQAIFSDPLNLTANWVGSQVCNYEGVFCAPAPDNKTIRTV AGIDLNHGDIAGYLPEELGLLVDLALFHINSNRFCGTVPRKFKDMRLLFELDLSNNRFAGKFPQV VLKLPSLKFLDLRFNEFEGTVPKELFDKDLDAIFINHNRFVFDLPENLGNSPVSVIVLANNKFHG CVPSSLGNMSNLNEIILMNNGFRSCMPAEIGLLKDLTVLDVSFNQLMGPLPDAFGGMASLEQLNV AHNMLSGKIPASICKLPNLDNFTFSYNFFTGEPPVCLSLPDFSDRRNCLPARPLQRSAAQCNAFL SRPVDCSSFRCAPFVPSLPPPPPPSPPMPVPSPPPPPPVYSPPPPPPSSSPPPPIHYKPPPSPSP PPPPAVYYHSPPPLSPPPPPVIYGSPPPPTPVYEGPLPPITGVSYASPPPPPFY

(Chimeric) PERK >Potri.005G124400-PtPERK10 MSSTPDAPSPDSGSPPPPPASSPPPENSPPPPPPQSDSPPPDASSPPPPPPPTSESPPPPPPKHS NASPPPPPNSRSLSPPPPPPPPPPPPPPPPNSSNSGGSSDQMKIVVGVAVGVGIFLIAMIFICAY CSRRKKRKNMHYYGENPQGGSEQFSYNSPQQSNWHNGLPTEHGMKLSQSPGPMGSGWPAPPPPMM NSSDMSSNYSGPYRPPLPPPSPNIALGFNKSTFTYDELAAATNGFDQANLLGQGGFGYVHKGVLP NGKDIAVKSLKLGSGQGEREFQAEVDIISRVHHRHLVSLVGYCIAGGQRMLVYEFVPNKTLEHHL HGKGLPVMDWPTRLRIALGSAKGLAYLHEDCHPRIIHRDIKAANILIDNNFEAMVADFGLAKLSS DNYTHVSTRVMGTFGYLAPEYASSGKLTDKSDVFSYGVMLLELITGKKPVDPSSAMEDSLVDWAR PLMITSLDTGNYNELVDPMLENNYNHQEMQRMIACAAASIRHSARKRPKMSQVARALEGDVLLDD LNEGTKPGQSSVFSGSNGSADYDASSYNADMKKFRQVALNSQEFGSNELGTSSNESPVTGPSGIH RNSESNY

(Chimeric) FH >Potri.002G240200-PtFH3 MSTSTSTSTSITILLFFLLSYAPALHFSSTSSPHNRRILHQPFFPEGSIPPTEPPSSSPPSPPSS TTPQIPFSTSTPNPPPFFPSYPSPPPPPSPTTFASFPANISSLILPQSSKPKPTSQKPLLVAISA VISALIVLSITIIVYYARRRRNRSNFSDDKTYTGSNISNRNADTRVIGTSNNSYKLSITSTSSNF LYMDTLVNSTRLDESSDGSDRRKLESPELRPLPPLNKENSTLKYGNGEVGYISSTTTNSRDGREE EEEEFYSPRGSLGGRDSPSGTGSGSRRVFAAGVGFDEKSSDSSSSYSSSTSASPSRSQSLSISPP LSLSSTPKSHTLLAAQSQPPPPPMMDVDNERKSPSSASSPDVSPRNVLSSASTSPRVSHRNNVLM RSPSLSPARILNNNLSQNTPSSSPSSVSSSPGRALNDSAPFNAQSPSLSSVSTSPGNGVLEKTPP LIISFGLDQTAQSPSLSIASTSPERGLEKSPPPSPIVSNVLGRIQMMFPSLSASKSSFSSTRISY PTSGPPPPPPPPPLPPLSILQPQRQWEAPSVDASSTPTDQPISKPPALIPPSRPFVLQSTTNVSP IELPPSSKTVQDAEETPKPKLKPLHWDKVRASSDREMVWDHLKSSSFKLNEEMIETLFVVKTPKP 134

KATTPNSVSPTTSRENRVLDPKKAQNIAILLRALNVTIEEVCEGLLEATKVSPAGACYAGNVDTL GTELLESLLKMAPTKEEERKLKEYKEDSPTKLGHAEKFLKAVLDVPFAFKRVDAMLYVANFESEV EYLKKSFETLEAACEELRNSRMFFKLLEAVLKTGNRMNVGTNRGDAHAFKLDTLLKLVDVKGADG KTTLLHFVVQEIIRTEGARLSSTNQTPNSISSEDAKWRRLGLQVVSGLSLELTHVKKAAAMDSDV LSSDVSKLSRGTENISEVVRLIEKLGMVESNQKFSESMTMFMKMAEEEIIRIQAQESVALSLVKE ITEYFHGNSAKEEAHPFRIFMVVRDFLSVLDRVCKEVGMINERTIVSSALKFPVPVNPMLPVPVN PTLPQVFSGSNASKQYNSFDDESESP

Other Chimeric EXT >Potri.009G145700-PtEXT30 MARLSICLLVIFLAIVAEAATQKPKKAKKCRDKKNYPVCFKTKNLYCPPQCPRDCYVDCATCTPV CSKPSKSPPFLPPPPHSLSPPPTRSTPPSLSPPPTTSTPPLSPPPTTSTPPLSPPPTTSTPPPTI STPPPPPATSTPPLSPPPTDFTPPPSSTPPPATTTPPAQNPPPPPDSSESAPKRARCKNRNYATC YGQEYTCPSACPNQYCDRPGAVCQDPRFIGGDGITFYFHGKKDRDFCIVSDSNLHINAHFIGRRN EKLTRDFTWVQSLGILFGTHKLFIGAQKTATWDDSVDRLSLALDGEPIYLPDGEGMKWKAEISPS VTITRSSDANAVVIEAEDNFKIKAAVVPITQKDSRIHSYGIASENCFAHLDLSFKFYKLSGDVNG VLGQTYGSNYVSRVKMGVLMPVLGGEKEFASSNIFATDCAVARFSGQHPSSNSSENFEFANLHCA SGIDGRGVVCKR Figure 4.3 Protein sequences encoded by the representative EXT gene classes in Populus trichocarpa. The colored sequences at the N and C terminus indicate predicted signal peptides (green) and GPI anchor addition sequences (light blue) if present in the sequences. The SP3 (blue), SP4 (red), SP5 (purple), and YXY (dark red) repeats are also indicated in the sequences. The sequences typical of AGPs, specifically AP, PA, SP, TP, VP, and GP repeats, are also indicated (yellow).

In addition to the presence of SPPP and SPPPP repeats, the presence of a signal peptide was another factor in determining if a protein was considered an EXT. As with the

AGPs, all the potential EXTs identified by the search were examined for signal peptides and GPI anchors. Signal peptides are known to occur in EXTs, but certain chimeric EXTs, notably the PERKs, lack a signal peptide (Nakhamchik et al. 2004). In total, 46 of the 60

EXTs (77%) identified have a signal peptide. Only four EXTs with GPI anchor addition sequences were identified, all of which were classified as short EXTs. This novel class of short EXTs with GPI anchor addition sequences was also observed in Arabidopsis

(Showalter et al. 2010). 135

Because EXTs were identified by searching for proteins with at least two SPPP sequences, many proteins were identified that contain only a few SPPP or SPPPP repeats among a much larger protein sequence. Many of these potential chimeric EXTs are not included in Table 4.3, but the sequences are available in Fig S3 for further review

(Showalter et al. 2016). These may in fact be chimeric EXTs, but many lack a signal peptide and have only a few SPPP or SPPPP repeats among a much larger protein that does not belong to a class of previously characterized chimeric EXTs, such as PERKs, LRXs, or FHs.

4.3.3 Proline-rich proteins (PRPs)

PRPs were identified by searching for proteins that contain at least 45% PVKCYT or contain two or more repeated motifs (PPVX[KT] or KKPCPP) (Table 4.1). Although this search generates a large number of false positives and proteins identified as AGPs and

EXTs by other searches as described above, it was effective in the identification of PRPs in Arabidopsis (Showalter et al. 2010). Of the 240 poplar proteins meeting the 45%

PVKCYT criteria, 20 of the proteins were determined to be PRPs based on sequence analysis, the presence of a signal peptide, and BLAST analysis. The PPVX[KT] motif search returned 29 candidate proteins of which four were determined to be PRPs, while the other motif (KKPCPP) search returned no candidate protein despite its effectiveness in

Arabidopsis (Table 4.4 and Fig S4) (Showalter et al. 2016). Additional proteins were identified by BLAST searches that fall below the 45% threshold. Some of these proteins were also determined to be PRPs based on a spectrum of information, including the presence of a signal peptide and Pfam domains, the number of motif repeats, and BLAST 136 hits against Arabidopsis HRGPs. BLAST searches against the Arabidopsis database were particularly beneficial in determining if a protein was a PRP. In total, 49 proteins were determined as PRPs, including 16 PRPs, 30 PR-peptides, and three chimeric PRPs (Fig 4.4 and Fig S4) (Showalter et al. 2016). Indeed, each of the 49 putative PRPs identified here is similar to at least one PRP previously identified in Arabidopsis (Showalter et al. 2010).

137

Table 4.4 Identification and analysis of PRP genes in Populus trichocarpa. Locus Identifier 3.0 Name Class % PPV/PPLP AA Pfamb SPc GPI Organ/issue Arabidopsis Poplar HRGP (ID 2.0)a PVKCYT /PELPK -Specific HRGP BLAST Hitse Repeats Expressiond BLAST Hits Potri.004G168600 PtPRP1 PRP 64% 24/8/0 554 PF011 Y N Dark AtPRP2, PtPRP6, PtPRP32, (POPTR_0004s17590) 90.15 etiolated AtPRP1, PtPRP33, seedlings AtPRP11 PtPRP143, Potri.016G006200 Potri.016G015500 PtPRP2 PRP 70% 13/0/0 449 PF145 Y N Dark and AtPRP18, Potri.012G076700, (POPTR_0016s01720) 47.4 +3hrs light AtPEX4 Potri.015G071500, etiolated Potri.019G083900, seedlings Potri.T155100, Potri.005G239100 Potri.014G126200 PtPRP3 PRP 51% 0/0/0 372 PF011 Y N AtPRP9, PtPRP24, PtPRP22, (POPTR_0014s12100) 90.15 AtPRP10 PtPRP28, PtPRP26, PtPRP21 Potri.014G126500 PtPRP4 PRP 52% 0/0/0 366 PF011 Y N AtPRP7, PtPRP35, PtPRP3, (POPTR_0014s12120) 90.15 AtPRP3, PtPRP4, AtPRP1, Potri.014G126300, AtAGP30I, PtPRP39 AtAGP31I Potri.018G126000 PtPRP5 PRP 62% 15/9/0 310 PF145 Y* N AtPRP9, PtPRP44, PtPRP42, (POPTR_0018s12630) 47.4 AtPRP10, PtPRP41, PtPRP43, AtPERK15 Potri.011G060200 Potri.009G129900 PtPRP6 PRP 48% 2/1/0 283 PF011 Y* N AtPRP9, Potri.019G082700, (POPTR_0009s13250) 90.15 AtPRP10, PtPRP21, PtPRP26, AtPRP1 PtPRP18, PtPRP28 Potri.003G111300 PtPRP7 PRP 46% 4/1/0 234 PF145 Y* N Male catkins AtPRP9, PtPRP27, PtPRP30, (POPTR_0003s11060) 47.4 AtPRP10, PtPRP21, PtPRP26, AtPRP15 PtPRP22

138

Table 4.4: continued Potri.006G008300 PtPRP8 PRP 59% 8/0/0 234 PF145 Y N AtPRP9, PtPRP49, 47.4 AtPRP10 PtPRP26, PtPRP22, PtPRP23, PtPRP24 Potri.T162800 PtPRP9 PRP 50% 2/0/0 216 PF145 Y N AtPRP9, PtPRP48, (POPTR_0006s01030) 47.4 AtPRP10 PtPRP26, PtPRP22, PtPRP28, PtPRP23 Potri.006G008600 PtPRP10 PRP 53% 4/0/0 214 PF145 Y N Young leaf AtPRP16, PtPRP15, 47.4 AtPRP14, PtPRP13, PtPRP5, AtPRP17, PtPRP11, AtPRP15, Potri.018G025900 AtHAE4 Potri.002G201800 PtPRP34 PRP 37% 0/0/0 213 PF011 Y N Young leaf, AtPRP9, PtPRP22, (POPTR_0002s20290) 90.15 male catkins AtPRP10 PtPRP23, PtPRP26, PtPRP24, PtPRP29 Potri.017G145800 PtPRP35 PRP 42% 0/0/0 272 PF011 Y N AtPRP9, PtPRP22, (POPTR_0017s01230) 90.15 AtPRP10 PtPRP26, PtPRP21, PtPRP23, PtPRP24 Potri.001G060500 PtPRP38 PRP 39% 0/7/0 332 PF011 Y N Dark and AtPRP11, PtPRP33, (POPTR_0001s13450) 90.15 +3hrs light AtAGP31I, PtPRP36, etiolated AtPRP1 Potri.001G326200, seedlings Potri.017G068400, PtPRP38 Potri.003G167100 PtPRP40 PRP 39% 0/2/0 299 PF011 Y N Female AtPRP7, PtPRP34, PtPRP4, (POPTR_0003s16550) 90.15 catkins AtPRP1, PtPRP3, AtPRP3, Potri.014G126300, AtAGP30I, PtPRP39 AtAGP31I

139

Table 4.4: continued Potri.007G114400 PtPRP44 PRP 43% 0/1/10 275 Y N Roots AtPRP7, PtPRP34, AtPRP3, PtPRP35, PtPRP4, AtPRP1, PtPRP3, AtAGP30I, Potri.014G126300 AtAGP31I Potri.013G111600 PtPRP46 PRP 39% 0/4/0 216 Y N AtPRP9, PtPRP45, (POPTR_0013s11600) AtPRP10, PtPRP44, AtPERK5 PtPRP42, PtPRP43, PtPRP28 Potri.006G065500 PtPRP11 PR 56% 5/2/0 198 PF145 Y N Dark and AtPRP7, PtPRP4, (POPTR_0006s06430) Peptide 47.4 +3hrs light AtPRP3, PossiblePtPRP6, etiolated AtPRP1, Potri.002G201700, seedlings AtAGP30I, PtPRP34, PtPRP35 AtPRP9 Potri.001G350600 PtPRP12 PR 63% 6/0/0 191 PF027 Y N AtPRP7, PtPRP3, (POPTR_0001s34750) Peptide 04.12 AtPRP3, PossiblePtPRP6, AtPRP1, Potri.002G201700, AtPRP9, PtPRP34, PtPRP35 AtAGP30I Potri.T162900 PtPRP13 PR 52% 4/0/0 184 PF145 Y N Young leaf AtPRP15, PtPRP11, PtPRP7, (POPTR_0006s01020) Peptide 47.4 AtPRP14, PtPRP13, AtPRP17, PtPRP15, PtPRP8 AtPRP2, AtPRP1 Potri.010G072200 PtPRP14 PR 50% 6/0/0 179 PF020 Y N Mature leaf AtPRP2, PtPRP1.8, (POPTR_0010s08290) Peptide 95.13 AtPRP4, PtPRP32, AtPRP11 PtPRP33, PtPRP36, Potri.005G041400 Potri.006G008500 PtPRP15 PR 53% 4/0/0 179 PF145 Y N Roots AtPRP14, PtPRP11, PtPRP5, Peptide 47.4 AtPRP15, PtPRP2, PtPRP13, AtPRP16, PtPRP15 AtPRP17 140

Table 4.4: continued Potri.007G113900 PtPRP16 PR 47% 0/4/0 130 Y N AtPRP16, PtPRP15, (POPTR_0007s03420) Peptide AtPRP17, PtPRP13, PtPRP9, AtPRP15, PtPRP2, PtPRP11 AtPRP14, AtHAE4 Potri.007G114100 PtPRP17 PR 46% 0/3/0 119 Y N AtPRP16, PtPRP10, (POPTR_0007s03400) Peptide AtPRP17, PtPRP13, PtPRP8, AtPRP14, PtPRP2, PtPRP11 AtPRP15, AtHAE4 Potri.007G113700 PtPRP18 PR 47% 0/4/0 119 Y N AtPRP16, PtPRP9, PtPRP13, (POPTR_0007s03440) Peptide AtPRP17, PtPRP8, PtPRP2, AtPRP14, PtPRP15 AtPRP15, AtAGP30I Potri.017G047400 PtPRP19 PR 46% 0/3/0 113 Y N Dark AtPRP15, PtPRP5, PtPRP7, (POPTR_0017s07470 ) Peptide etiolated AtPRP14, PtPRP13, seedlings, AtPRP17, PtPRP15, PtPRP8 light-grown AtPRP2 seedling Potri.019G082600 PtPRP20 PR 45% 0/4/0 112 Y N light-grown AtPRP16, PtPRP15, PtPRP8, (POPTR_0019s11220) Peptide seedling AtPRP17, PtPRP10, PtPRP9, AtPRP14, PtPRP11 AtPRP15, AtHAE4, Potri.017G047200 PtPRP21 PR 43% 0/3/0 130 Y N Young leaf, AtPRP1, Potri.004G110100, (POPTR_0017s07450 ) Peptide male catkins AtPRP2, Potri.010G211100, AtPEX4 Potri.004G109000, Potri.T018900, Potri.004G109900

141

Table 4.4: continued Potri.017G045800 PtPRP22 PR 43% 0/3/0 116 Y N AtPRP16, PtPRP13, (POPTR_0017s07310) Peptide AtPRP17, PtPRP10, PtPRP2, AtPRP14, PtPRP9, PtPRP11 AtPRP15, AtHAE4, AtPERK5 Potri.017G046700 PtPRP23 PR 40% 0/3/0 116 Y N AtPRP9, PtPRP21, (POPTR_0017s07400) Peptide AtPRP10, PtPRP26, AtPRP15 PtPRP31, Potri.017G046800, PtPRP27 Potri.017G046400 PtPRP24 PR 43% 0/3/0 116 Y N Roots AtPRP9, PtPRP21, (POPTR_0017s07370) Peptide AtPRP10 PtPRP30, PtPRP27, Potri.017G046800, PtPRP18 Potri.017G045900 PtPRP25 PR 43% 0/3/0 116 Y N AtPRP9, PtPRP19, (POPTR_0017s07320) Peptide AtPRP10, PtPRP21, AtPRP15 PtPRP27, PtPRP30, Potri.017G046800 Potri.017G047000 PtPRP26 PR 42% 0/3/0 116 Y N AtPRP9, PtPRP18, (POPTR_0017s07430) Peptide AtPRP10 PtPRP21, Potri.017G046800, PtPRP27, PtPRP30 Potri.017G047100 PtPRP27 PR 44% 0/4/0 134 Y N Female AtPRP9, PtPRP21, Peptide catkins AtPRP10, PtPRP18, AtPRP15 PtPRP26, PtPRP37, PtPRP19 Potri.017G045600 PtPRP28 PR 44% 0/3/0 126 Y N Roots AtPRP9, PtPRP30, (POPTR_0017s07290) Peptide AtPRP10 Potri.017G046800, PtPRP27, PtPRP18, PtPRP17 142

Table 4.4: continued Potri.017G046100 PtPRP29 PR 42% 0/3/0 116 Y N AtPRP9, PtPRP26, (POPTR_0017s07340) Peptide AtPRP10 PtPRP25, PtPRP24, PtPRP23, PtPRP29 Potri.T178800 PtPRP30 PR 42% 0/4/0 135 Y N Xylem AtPRP9, PtPRP22, (POPTR_2000s00200) Peptide AtPRP10 PtPRP23, PtPRP26, PtPRP21, PtPRP28 Potri.007G114200 PtPRP31 PR 44% 0/4/0 121 Y N AtPRP9, PtPRP22, (POPTR_0007s03390) Peptide AtPRP10 PtPRP26, PtPRP21, PtPRP23, PtPRP28 Potri.017G045000 PtPRP37 PR 40% 0/3/0 105 Y N Roots AtPRP9, PtPRP16, Peptide AtPRP10, PtPRP21, AtPRP15 PtPRP26, Potri.017G046800, PtPRP27 Potri.002G201900 PtPRP39 PR 33% 0/0/0 179 PF011 Y N AtPRP11, PtPRP32, (POPTR_0002s20300) Peptide 90.15 AtAGP31I, PtPRP36, AtPRP1 Potri.001G326200, Potri.017G068400, PtPRP38 Potri.017G044800 PtPRP41 PR 34% 0/1/3 112 Y N Young leaf, AtPRP11, PtPRP32, (POPTR_0017s07230) Peptide male catkins AtPRP1, Potri.001G326200, AtAGP31I, Potri.017G068400, AtPRP2 PtPRP38, PtPRP40 Potri.017G044900 PtPRP42 PR 39% 0/0/5 109 Y N AtPRP9, PtPRP26, Peptide AtPRP10 PtPRP21, PtPRP22, PtPRP28, PtPRP23

143

Table 4.4: continued Potri.018G146200 PtPRP43 PR 42% 0/1/2 114 Y N Young leaf AtPRP9 PtPRP40, Peptide Potri.017G068400, Potri.001G326200, PtPRP32, PtPRP33 Potri.007G114700 PtPRP45 PR 38% 0/0/4 107 Y N AtPRP11 PtPRP38, (POPTR_0007s03340) Peptide Potri.017G068400, Potri.001G326200, PtPRP33, PtPRP32 Potri.017G046800 PtPRP47 PR 41% 0/5/0 174 Y* N AtPRP9, PtPRP45, (POPTR_0017s07440) Peptide AtPRP10, PtPRP44, AtPEX2 PtPRP43, PtPRP41, PtPRP18 Potri.017G045700 PtPRP48 PR 38% 0/2/0 97 Y N AtPRP9, PtPRP44, (POPTR_0017s07300 ) Peptide AtPRP10 PtPRP45, PtPRP42, PtPRP41, PtPRP37 Potri.017G046500 PtPRP49 PR 38% 0/3/0 97 Y* N AtPRP10, PtPRP45, (POPTR_0017s07380) Peptide AtPRP9, PtPRP43, AtPEX2 PtPRP42, PtPRP41, Potri.017G052100 Potri.004G114300 PtPRP32I Chimeric 41% 2/5/0 319 PF011 Y N AtPRP9, PtPRP22, (POPTR_0004s11300) 90.15 AtPRP10 PtPRP21, PtPRP23, PtPRP28, PtPRP24 Potri.004G114400 PtPRP33I Chimeric 41% 0/6/0 365 PF011 Y N AtPRP9, PtPRP30, 90.15 AtPRP10 Potri.017G046800, PtPRP21, PtPRP17, PtPRP18 Potri.017G100600 PtPRP36I Chimeric 43% 0/5/0 410 PF011 Y N AtPRP9, PtPRP27, (POPTR_0017s13490) 90.15 AtPRP10 PtPRP21, Potri.017G046800, PtPRP17, PtPRP18 144 a Protein identifiers of the version 2.0 are shown in the parenthesis. Italics indicates a protein that was identified only by a BLAST search. b The domains indicated by the Pfam number are: PF01190.15, Pollen_Ole_e_I domain (Pollen proteins Ole e I like); PF14547.4, Hydrophob_seed domain (Hydrophobic seed protein); PF02704.12, GASA domain (Gibberellin regulated protein); PF02095.13, Extensin_1 domain (Extensin-like protein repeat). c Asterisk indicates a protein that is predicted to have a signal peptide either using the sensitive mode in the SignalP website or only if amino acids at the N terminus are discarded. d Expression data are shown only when available at http://bar.utoronto.ca/efppop/cgi-bin/efpWeb.cgi e A locus ID indicates that it is not identified as an HRGP.

145

PRP >Potri.016G015500-PtPRP2 MAKFALANLLILLVNLGTLLTSLACPYCPYPAPPSKPPQYPPKIPPVHPPPKVKPPPKYPPKIPP VHPPPKVKPPPKYPPKIPPVHPPPKVKPPSKYPPKIPPVHPPPKVKPPPKCPPKIPPVHPPPTVK PPHDPKPPKPHPPKPPVVPKPPIVHPPFKPKPPIVHPPYTPKPPIVHPPFKPKPPIVHPPFKPKP PIVHPPFIPKPPYVPKPPVLPPTPPALPPPKPPVIPPIKPPTPPALPPPTLPPPKPPVTPPIKPP TPPTLPPPKPPVTPPIMPPTPPTLPPPKPPVTPPIMPPTPPTPPTLPPPKPPVTPPIMPPAPPTL PPPKPPVTPPIKPPTPPIVNPPPPETPCPPPPPPPPKQETCSIDTLKLGACVDVLGGLVHIGIGS SAKDACCPVLQGLLDLDAAICLCTTIKAKLLNISIIIPIALEVLVDCGKTPPEGFKCPA

PR Peptide >Potri.001G350600-PtPRP12 MAFKAVCLMVVAFVLVTAKASYMNEDFKEKAVFSKSVVPASTPAPPEVKSPTPAPPVVTPSTPLY KPPTPAPPVKTPPPAPPVNPPTPVKPPTTPAPPVYKPPSPAPPVNPPTPVKPPTGPMPPPVRTRS DCTPLCGQRCKLHSRKRLCVRACMTCCDRCKCVPPGTYGNREKCGKCYTDMTTRRNKPKCP

Chimeric PRP >Potri.004G114300-PtPRP32I MPWFFIIFLLGFTFNNPSEASHGKKLPSAVVVGTVYCDTCFQEYFSRNSHFISGAHVAVECKDEK SRPSFREEAKTDEHGEFKVHLPFSVSKHVKKIKRCSVELLSSSEPYCAVASTATSSSLRLKSRKE GTHIFSAGFFTFKPEKEPFLCNQKPSIENPREFSSKEASLPSFDNPTFPPPLQDPKTPVLPPLPP LPILPPLPQLPPLPPLPGLPFLPPIPANTENTKTTESLKSTTLPDEKAVHHPNQFGFPTPPLFPP NPFQPPPILPPIIQPPPLFPPILPPNPLQPPPVPSLPLPPVPSLPLPPYHPFRCHQYLA Figure 4.4 Protein sequences encoded by the representative PRP gene classes in Populus trichocarpa. The colored sequences at the N terminus indicate predicted signal peptides (green). PPV (pink) repeats typical of PRPs are indicated. The sequences typical of AGPs, specifically AP, PA, SP, TP, VP, and GP repeats, are also indicated (yellow) if present.

Interestingly, 30 short PRPs were identified in poplar, most of which contain a single SPPP repeat at the C-terminus. Nearly all of the 30 proteins show similarity to

AtPRP9 and AtPRP10 based on BLAST searches. These novel 30 proteins were grouped into a new class known as the proline-rich peptides (PR peptides) due to their much shorter amino acid length compared to the typical PRPs identified. These PR peptides can be further subdivided based on the presence of two pentapeptide repeat sequences, PPLP and

PELPK. The PPLP repeat is present in 23 of these PR peptides and in a few other PRPs and chimeric PRPs, while the PELPK repeat is found only in one PRP and four PR peptides including two that contain PPLP repeats. It is also interesting to note that the 23 genes 146 encoding the PPLP-containing PR peptides are clustered on chromosome 17, while the genes encoding only the PELPK-containing PR peptides are clustered on chromosome 7.

All of the 49 PRPs had a predicted signal peptide, while none had a GPI anchor predicted.

4.4 Discussion

4.4 1 A bioinformatics approach for identifying HRGPs

As more plant genome sequencing projects are completed, vast amounts of biological data are being generated. Bioinformatics and in particular the BIO OHIO 2.0 program, which was recently revised and improved to provide a more rapid, reliable, and efficient method to identify proteins with biased amino acid compositions and known repetitive motifs (Showalter et al. 2010; Lichtenberg et al. 2012). For instance, the BIO

OHIO/Prot-Class program can search through over 73,000 proteins in the poplar proteomic database and identify those containing at least 50% PAST in one minute. Using the various search criteria, we have predicted 271 HRGPs in poplar, including 162 AGPs, 60 EXTs, and 49 PRPs.

Although HRGPs were identified primarily through searching for biased amino acid compositions and repetitive motifs, the possibility that other HRGPs could be found in the poplar genome exists. Not all AGPs meet the 50% PAST threshold, for instance, one classical AGP, PtAGP51C, contains only 49% PAST. Similar problems exist for identifying chimeric AGPs. Because these proteins may contain only a small AGP region within a much larger sequence, they are likely to contain less than 50% PAST. The possibility remains that other classes of chimeric AGPs or individual proteins that contain

AGP-like regions exist and were not identified by the search parameters used in this study. 147

A similar problem could exist for AG peptides that fall below the 35% PAST cut-off or for

PRPs that fall below 45% PVKCYT.

One possible solution is to simply lower the thresholds and continue to search, but the number of false positives increases markedly as thresholds are lowered, making such searches less feasible. For instance, lowering the threshold for the AG peptide search to

30% would identify 877 proteins compared to the 194 identified with a 35% threshold.

In such a scenario, BLAST provides an alternative means to find additional candidate proteins. When using identified proteins as queries, BLAST is effective in finding a few related family members. For example, when using identified FLAs as queries,

BLAST is capable of finding additional FLAs that don’t meet the criteria of the BIO OHIO

2.0 program. However, it is not particularly effective in finding other members of HRGP superfamily and thus could not be utilized in a comprehensive manner.

Indeed, a bioinformatics search that identifies HRGPs, especially chimeric HRGPs without also identifying a very large number of false positives remains difficult.

Nevertheless, the search parameters and BLAST searches used here provide an efficient means to identify HRGPs and distinguish them from a limited number of false positive sequences. Of course, future molecular and biochemical analysis of the HRGPs predicted from this study will be necessary to validate these predictions more completely and elucidate their biological functions. Only when such work is completed will it become possible to conclusively distinguish HRGPs from false positive sequences. 148

4.4.2 HRGPs exist as a spectrum of proteins

Although HRGPs are divided into AGPs, EXTs, and PRPs, the distinction between these categories is not always clear, since many HRGPs appear to exist as members of a spectrum of proteins rather than distinct categories. Indeed, several HRGPs identified here as well as some previously identified in Arabidopsis have characteristics of multiple families and can be considered hybrid HRGPs. For instance, many of the PRPs identified here, particularly some chimeric PRPs, also contain dipeptide repeats that are characteristic of AGPs. As such, it is difficult to determine if these should be considered as AGPs, PRPs, or classified as a hybrid HRGP. Determining whether these are actually AGPs or PRPs would depend on whether the proline residues are hydroxylated and subsequently glycosylated with arabinogalactan polysaccharides, which are characteristic of AGPs.

Similarly, PtEXT4 also contains large numbers of characteristic AGP repeats (Fig S2)

(Showalter et al. 2016). In addition, BLAST searches revealed that it is similar in sequence to AtAGP51. Given that it contains many SPPP and SPPPP repeats, it was classified as an

EXT. However, there is a possibility that this protein may also be glycosylated with the addition of AG polysaccharides, in which case it could potentially be grouped as a hybrid

HRGP. Another example is the novel class identified here as the PR peptides (Table 4.4).

Although grouped here as PRPs, these short sequences (i.e., PtPRP16-31 and PtPRP37) also contain a SPPP sequence characteristic of an EXT as well as the dipeptide repeats characteristic of AGPs, particularly AP, PA, and VP (Fig S4) (Showalter et al. 2016).

Other difficulties arise when chimeric HRGPs are considered. For instance, the plastocyanins range from those that contain a majority of AGP repeats and easily pass the 149

50% PAST test to those that contain only a few AP, PA, SP, VP, and GP repeats to those that contain no characteristic AGP repeats. The exact cutoff between proteins that are considered chimeric AGPs and those that are simply plastocyanin proteins is difficult to determine. Again, biochemical studies would be required to examine which of the proteins are actually glycosylated to make a final determination for classification. However, all those proteins annotated here as PAGs have at least a few characteristic AGP repeats, contain a signal peptide, and most have predicted GPI membrane anchor addition sequences, all of which is consistent with the chimeric AGP designation (Fig S1)

(Showalter et al. 2016).

A similar situation also exists for the chimeric EXTs, such as the PERKs and LRXs.

How many SPPP or SPPPP repeats are required for a protein to be considered a LRX and not simply a leucine-rich repeat (LRR) protein? Here the cutoff was arbitrarily set to at least two repeats. As such, there may be LRR proteins that contain one SPPP that are not considered here as LRXs. Another example which illustrates this classification difficulty concerns the four proteins (PtAGP70I, PtAGP71I, PtAGP72I, and PtAGP73I) which are similar to AtPRP13 based on BLAST searches. However, these four proteins also contain numerous SP and AP repeats that would be more characteristic of an AGP. Exactly how proteins such as these should be classified is certainly debatable. Indeed, it is human nature to group and classify items to facilitate understanding, while Mother Nature operates without such regard. 150

4.4.3 Comparisons with previously identified poplar HRGPs

This study identified 271 poplar HRGPs (162 AGPs, 60 EXT, and 49 PRPs) in contrast to the 24 HRGPs (3 AGPs, 10 EXT, and 11 PRPs) identified by Newman and

Cooper (2011). The more stringent search criteria for proline-rich tandem repeats and a less comprehensive poplar proteomic database based on EST and NCBI Non-Redundant protein sequences data from10/04/09 likely account for the fewer poplar HRGPs identified in this earlier study. In addition, homologs of the 15 FLA AGPs reported by Lafarguette et al. (Lafarguette et al. 2004) in a Populus tremula × P. alba hybrid related to Populus trichocarpa were also identified in addition to 35 other FLAs. Thus, the present study represents the most comprehensive and detailed picture of the HRGP inventory in poplar to date.

4.4.4 Comparisons with Arabidopsis

Findings here allow for a comparison of the HRGPs identified in Arabidopsis to those in poplar (Table 4.5). For AGPs, the classical AGPs identified in poplar showed a similar number as in Arabidopsis. Specifically, 27 classical AGPs including six lysine-rich

AGPs were identified in poplar, while 25 classical AGPs including three lysine-rich AGPs were identified in Arabidopsis. Among other AGPs, particularly notable is the large increase the number of FLAs, PAGs, and AG peptides in poplar compared to Arabidopsis.

While 21 FLAs, 17 PAGs and 16 AG peptides were identified in Arabidopsis, 50 FLAs,

39 PAGs and 35 AG peptides are identified here in poplar. There is also a noticeable increase in the number of other chimeric AGPs in poplar compared to Arabidopsis. Here,

11 other chimeric AGPs were identified in poplar, while only 6 were found in Arabidopsis. 151

Table 4.5 Comparison of HRGPs identified in Populus trichocarpa and Arabidopsis thaliana.

HRGP family HRGP subfamily Poplar Arabidopsisa AGPs Classical AGPs 21 22 Lysine-Rich Classical AGPs 6 3 AG-Peptides 35 16 (Chimeric) FLAs 50 21 (Chimeric) PAGs 39 17 Other Chimeric AGPs 11 6 All AGP subfamilies 162 85 EXTs Classical EXTs 8 20 Short EXTs 22 12 (Chimeric) LRXs 10 11 (Chimeric) FHs 5 6 (Chimeric) PERKs 12 13 Other Chimeric EXTs 3 3 All EXT subfamilies 60 59 PRPs PRPs 16 11 PR Peptides 30 1 Chimeric PRPs 3 6 All PRP subfamilies 49 18 Total 271 168 a The Arabidopsis HRGP data shown here are from Showalter et al. (2010) with the exceptions that 6 chimeric FH EXTs were added and that one PR-peptide was found out of originally identified 12 PRPs as part of this study.

Among EXTs, the classical EXTs with large numbers of SPPPP repeats are markedly decreased in poplar, while similar numbers of the chimeric EXTs exist in both species. The reduction in the number of classical EXTs in poplar is dramatic and likely indicates that many EXT genes or EXT functions are dispensable in poplar, and therefore not conserved in evolution. A similar loss of EXTs has also been observed in analysis of certain monocot species (Newman and Cooper 2011 and unpublished data). Moreover, far fewer poplar EXTs contain putative cross-linking YXY sequences compared to 152

Arabidopsis, and this can be largely explained by the reduced number of classic EXT sequences, which typically contain such cross linking sequences. The various chimeric

EXTs, namely the LRXs/PEXs, PERKs, and FHs, are conserved in both species. Although

FHs were not reported in Showalter et al. (2010), a reexamination of the Arabidopsis proteome shows 6 FH sequences (AtFH1-At3g2550, AtFH5-At5g54650, AtFH8-

At1g70140, AtFH13-At5g58160, AtFH16-At5g07770, and AtFH20-At5g07740) contain two or more SPPP sequences. These 6 formins are included in Table 4.5 and are a subset of the 21 reported formins in Arabidopsis (Cvrčková et al. 2012). Similar to the chimeric

EXTs, the short EXTs are also conserved in Arabidopsis and poplar. The short EXTs are a particularly interesting class because EXTs are not known to have GPI membrane anchors, a feature commonly found in many AGPs and associated with proteins found in lipid rafts

(Borner et al. 2005). The finding that several of these short EXTs encode a predicted GPI- anchor sequence are conserved in poplar and Arabidopsis certainly prompts the question of what role these proteins are playing in the plant. Currently, no publications verifying their biochemical existence or examining their roles exist, but this class stands out in terms of having interesting candidates for further investigation, particularly with respect to confirming their plasma membrane localization, hydroxylation, and glycosylation.

PRPs are similar in both species with the notable exception of the PR-peptides, which is a much-expanded class in poplar compared to Arabidopsis, which is now recognized to have only one PR-peptide following a reexamination prompted by this study.

All of the PR-peptides in poplar are similar in sequence with most containing LPPLP repeats and having a single SPPP repeat at the C terminus, although some contained 153

PELPK repeats. In addition, most of these PR-peptides are similar to AtPRP9 and AtPRP10 based on BLAST analysis; both of these Arabidopsis proteins contain PELPK repeats as well. Indeed, AtPRP9 is quite short and similar in sequence to the PR peptides found in poplar but lacks the C terminal SPPP repeat. However, this is the only such protein found in Arabidopsis, while 30 were observed in poplar. AtPRP10 contains some similarity in sequence but is much longer than the poplar PR-peptides. Indeed, the large number of

LPPLP- and PELPK- containing PR-peptides in poplar clustered respectively in two chromosomal locations indicates that these two gene subfamilies likely result from tandem gene duplication events, analogous to a unique, clustered set of PEHK-containing PRP genes in the grape family (Newman and Cooper 2011).

Although most sub-families of HRGPs exist in both the Arabidopsis and poplar inventories, certain species-specific differences do exist, which is reflected in the difference of number of certain groups and the total number of HRGPs (271 in poplar versus 168 in Arabidopsis). Precisely why certain classes of HRGPs are increased or decreased in abundance in a particular species remains to be determined, but these results lay the groundwork for future experimentation in this area.

4.4.5 Poplar HRGPs genome 2.0 release and expression analysis

The study revealed that the poplar genome 3.0 release is quite different from 2.0 release in terms of HRGPs. Only 33% of HRGPs identified in 3.0 are the same as counterparts in 2.0, others may differ from a few amino acids in sequence to a distinct start and/or stop position. For several such cases, a green highlight indicated a likely signal sequence placed internally, either because these signal sequences were at the N terminus 154 in the 2.0 release or they should be at N terminus based on analysis of sequences in this study.

In addition, tissue/organ-specific HRGP expression data were obtained from the poplar eFP browser. However, this database does not contain all HRGP data, and it only accepts query IDs in poplar genome version 2.0 format. Judging from the available information, one could observe that HRGPs in general have high expression in seedlings, , and reproductive tissues (Table 4.2-4.4). In particular, a number of FLAs were specifically expressed in xylem, while some PAGs were found to be highly expressed in male catkins. Many PRPs have high expression in seedlings and leaves. Interestingly, several LRXs are found to be uniquely expressed in male catkins; this finding is consistent with previous research in Arabidopsis and rice that a group of LRXs are pollen-specific

LRXs, or PEXs (Baumberger et al. 2003).

4.4.6 Pfam analysis of poplar HRGPs

All 271 poplar HRGPs identified in this study were subjected to Pfam analysis to identify specific domains within them. Pfam domains were found in 160 of the 271 proteins

(59%). More specifically, Pfam domains were identified in 105 of the 162 AGPs, 32 of the

62 EXTs, and 23 of the 49 PRPs. In particular, Pfam analysis exceled at finding domains within chimeric HRGPs, such as FLAs, PAGs, LRXs, PERKs, and FH EXTs. In contrast, such analysis often failed to find domains in classical AGPs or EXTs, possibly due to the variable sequences and numbers of sequence repeats associated with many of the HRGPs.

Interestingly, many of the PRPs were found to contain Pollen Ole domains and Hydrophob seed domains. Pfam analysis also has merit in identifying domains in the chimeric HRGPs 155 identified in the study. Indeed, while Pfam analysis alone is not sufficient for identifying

HRGPs in a comprehensive manner, it can add valuable information to identified HRGPs, and thus a Pfam analysis module will likely be incorporated into future versions of the BIO

OHIO program.

4.5 Conclusions

The new and improved BIO OHIO 2.0 bioinformatics program was used to identify and classify the current inventory of HRGPs in poplar. This information will allow researchers to determine the structure and function of individual HRGPs and to explore potential industrial applications of these proteins in such areas as plant biofuel production, food additives, lubricants, and medicine. Other plant proteomes/genomes can also be examined with the program to provide their respective HRGP inventories and facilitate comparative evolutionary analysis of the HRGP family in the plant kingdom (Showalter et al. 2010; Liu et al. 2016). Finally, while this program was specifically developed for HRGP identification, it can also be used to examine other plant or non-plant genomes/proteomes in order to identify proteins or protein families with any particular amino acid bias and/or amino acid sequence motif, making it useful throughout the tree domains and six kingdoms of life.

156

CHAPTER 5. BIOINFORMATIC IDENTIFICATION AND ANALYSIS OF CELL

WALL EXTENSINS IN THREE ARABIDOPSIS SPECIES

This work has been accepted for publication in the following manuscript. Refer to the following manuscript for all the supplementary materials.

Xiao Liu, Savannah McKenna, David Masters, Lonnie R. Welch, and Allan M.

Showalter 2017 Bioinformatic identification and analysis of cell wall extensins in three

Arabidopsis species. Int J Plant Sci.

5.1 Introduction

Extensins (EXTs) are hydroxyproline-rich glycoproteins (HRGPs) that are unique to plant cell walls (Showalter 1993). Structurally, EXTs are comprised of repeated EXT motifs where a serine (S) residue is followed by three to five proline (P) residues, forming a rod-like, polyproline II helical secondary structure (Cannon et al. 2008). The glycosylation serves to stabilize the EXTs. It occurs on the S residues which are often modified with galactose monosaccharides, and the P residues can be hydroxylated to hydroxyproline (O) and subsequently O-glycosylated with arabinose oligosaccharides

(Kieliszewski and Lamport 1994; Shpak et al. 2001; Kieliszewski et al. 2010).

EXTs can be subdivided into several classes based on the arrangement of the EXT motifs, the protein length, and the presence of other amino acid motifs or protein domains.

In addition to the presence of repeating S[P]3-5 motifs, many classical EXTs also contain

Tyr-X-Tyr (YXY) amino acid repeats [where X can be any amino acid, but usually is Y,

K, V, E, or I]. Such Tyr residues are involved in EXT intramolecular and intermolecular cross-linking by forming isodityrosine (IDT), di-IDT, and pulcherosine (Held et al. 2004; 157

Kieliszewski et al. 2010). These cross-linking properties of EXTs play important roles in plant defense against pathogens (Merkouropoulos et al. 1999). Eudicot classical EXTs usually contain numerous YXY motifs. In Brassica rapa, for instance, BrEXT8

(Bra000292) was identified as a classical EXT with 84 EXT motifs and 52 YXY motifs

(Liu et al. 2016). Some EXTs have a smaller number of amino acids than classical EXTs; these EXTs (< 200 aa) are referred to as “short EXTs”. A number of short EXTs are modified with a C-terminal glycosylphosphatidylinositol (GPI) membrane anchor addition sequence that results in the tethering the protein to the outer leaflet of the plasma membrane. Such GPI anchor addition sequences are rarely seen in other subclasses of

EXTs, but are commonly found in arabinogalactan-proteins (AGPs) (Youl et al. 1998;

Seifert and Roberts 2007; Liu et al. 2016).

Besides classical and short EXTs, chimeric EXTs also exist (Showalter et al. 2010;

Liu et al. 2016). Leucine-rich repeat extensins (LRXs) are a class of chimeric EXTs with a leucine-rich repeat (LRR) domain that is involved in protein-protein interactions (Kobe and Deisenhofer 1994). The EXT motifs (i.e., S[P]3-5 repeats) of LRXs are located at the

C-terminal portion of the protein and contribute to their insolubility in the cell wall

(Draeger et al. 2015). LRXs likely play regulatory roles on the plant cell surface. Proline- rich extensin-like receptor kinases (PERKs) are another class of chimeric EXTs with the

EXT domains located at the N-terminal portion of the protein and a protein kinase catalytic domain located at the C-terminus. PERKs are associated with cell expansion and floral organ formation (Haffani et al. 2006), abscisic acid responses (Bai et al. 2009), and responses to wounding and pathogen stimuli (Silva et al. 2002). Formins are a group of 158 eukaryotic proteins with a conserved FH2 domain (pfam02181). In plants, the FH2 domain is involved in actin nucleation and the co-ordination between actin and microtubule cytoskeletons (Chalkia et al. 2008; Cvrčková et al. 2015). Some formins contain a signal peptide, an extracellular region with EXT motifs, and a transmembrane domain, suggesting a possible role as linkers between cell wall components and the cytoskeleton (Blanchoin and Staiger 2010). In this study, such formin homologs with EXT motifs are categorized as a class of chimeric EXTs, termed formin homology EXTs (FHXs). In addition to the above chimeric EXTs, other chimeric EXTs and hybrid EXTs also exist. All but the PERKs have N-terminal signal peptide sequences that direct the proteins to the endomembrane system and ultimately to the extracellular matrix. Despite the absence of a signal peptide,

Arabidopsis PERK1 was found to be localized at the plasma membrane (Silva and Goring

2002).

Arabidopsis thaliana (family Brassicaceae) has served as the primary model plant for decades. With five pairs of chromosomes and a 150 Mb genome size, it is a fully self- compatible annual plant with distributions in North Africa, Eurasia, and North America

(Mitchell-Olds 2001; Koenig and Weigel 2015). More recently, information has accumulated about its closest relatives Arabidopsis lyrata, Arabidopsis halleri, and A. arenosa. A. lyrata is a perennial herb found in cold to mild climatic regions of the Northern

Hemisphere (Clauss and Koch 2006). A. lyrata is self-incompatible due to the sporophytic incompatibility system, a phenomenon rarely seen in flowering plants (Koch et al. 2008).

By contrast, A. halleri is a perennial herb widely distributed in Europe and eastern Asia.

Tolerance to zinc and cadmium and the hyperaccumulation of these heavy metals are the 159 most widely recognized characteristics of A. halleri (Clauss and Koch 2006). Similar to A. lyrata, A. halleri also shows high levels of self-incompatibility (Koch et al. 2008). Unlike

A. thaliana, both A. lyrata and A. halleri have eight pairs of chromosomes and genome sizes of about 250 Mb. A. arenosa, with genome size about 200 Mb, is a diverse species group consisting of annual, biennial or perennial herbs that are widely distributed in Europe

(Clauss and Koch 2006; Koenig and Weigel 2015). Repeated genome duplications and inter-specific hybridizations contribute to the hybridization–polyploidization property in the A. arenosa species group (Clauss and Koch 2006).

Phylogenetic analysis using plastidic matK and nuclear CHS sequences revealed that A. arenosa and A. halleri are more closely related to each other, than to A. lyrata, while

A. thaliana is the least closely related to its three relatives (Fig 5.1 adapted from Koenig and Weigel 2015) (Mitchell-Olds 2001; Koch et al. 2001). More recently, fossil- constrained BEAST analysis and phylogenetic analysis using whole-chloroplast genome sequence data also reached the same conclusion that A. halleri and A. lyrata are more closely related to each other, than to A. thaliana (Hohmann et al. 2015). The evolutionary split between x5 A. thaliana and x8 Arabidopsis taxa, including A. lyrata and A. halleri, was estimated to be approximately 5.97 million years ago (mya) (Hohmann et al. 2015).

Figure 5.1 The species tree of A. thaliana and its close relatives. 160

Bioinformatic identification and analysis of EXTs were accomplished in A. thaliana and a number of other plant species (Showalter et al. 2010; Liu et al. 2016). These identified EXTs have allowed for the further study of this protein family across different orders and divisions. However, knowledge is lacking in model plant systems in which direct comparisons can be made for orthologs and co-orthologs in closely related species of the same genus. The genomes of A. lyrata and A. halleri are now sequenced, while genome sequencing for A. arenosa is underway (Rawat et al. 2015; Lamesch et al. 2012).

Here, we have identified and characterized the EXTs in A. lyrata and A. halleri, and updated the EXTs in A. thaliana using the BIO OHIO 2.0 bioinformatic software program. These identified EXTs were used to conduct phylogenetic analysis of the EXT subfamilies. In addition, EXT orthologous pairs between species, paralogous pairs within one species, and co-orthologous groups within the model system were identified. A. thaliana and its wild relatives provide a model system, in which diverse genetic resources and biological data make it possible to answer fundamental evolutionary questions that cannot be addressed by a study of a single species (Mitchell-Olds 2001; Koch and

Matschinger 2007). This study provides a good example of this concept with respect to the analysis of EXTs in this model system and insight into the evolution and functions of EXT protein family.

5.2 Materials and methods

5.2.1 BIO OHIO 2.0 program and the input data files

BIO OHIO 2.0 is a bioinformatics software program developed at Ohio University that is designed primarily to identify and characterize plant hydroxyproline-rich 161 glycoproteins (HRGPs). The program identifies potential HRGPs based on biased amino acid compositions and common HRGP amino acid motifs in protein sequences. The program also analyzes proteins for the presence of signal peptide sequences and GPI anchor addition sequences and also searches for similar HRGPs using the Basic Local Alignment

Search Tool (BLAST). The BIO OHIO 2.0 program can be obtained at https://github.com/Showlaterlab/BIO-OHIO-2.0. The proteome data files of A. lyrata

(Alyrata_384_v2.1.protein.fa) and A. thaliana (Athaliana_167_TAIR10.protein.fa) were obtained from the Phytozome website v12 (https://phytozome.jgi.doe.gov/pz/portal.html)

(Rawat et al. 2015; Lamesch et al. 2012). The coding sequences of all reported transcripts in A. halleri was downloaded from the Dryad Digital Repository

(http://dx.doi.org/10.5061/dryad.gn4hh) and translated using an inhouse python script

(Briskine et al. 2016).

5.2.2 Identification of EXTs

The methods for identifying EXTs were previously described in Liu et al. (2016).

Briefly, a regular expression of two or more SPPP amino acid motifs was used to search for candidate EXTs. Candidate EXT sequences were then analyzed for the presence of signal peptides and GPI anchor addition sequences by inputting each sequence to the

SignalP server using the sensitive mode (http://www.cbs.dtu.dk/services/SignalP/) and the big-PI plant predictor (http://mendel.imp.ac.at/gpi/cgi-bin/gpi_pred.cgi), respectively

(Petersen et al. 2011; Eisenhaber et al. 2003). Candidate sequences were also examined by

BLAST analysis against previously identified Arabidopsis thaliana HRGPs (Showalter et al. 2010; Liu et al. 2016). All candidate sequences were then subjected to protein-protein 162

BLAST (blastp) analysis against the A. thaliana protein database

(Athaliana_167_TAIR10.protein.fa). A candidate EXT was identified as an EXT if it had the predicted signal peptide (except for PERKs) and a cluster of EXT motifs, which in many cases had homology with known A. thaliana HRGPs. In a few instances, however, a candidate EXT might still be identified to be an EXT if it had EXT motifs that were homologous to A. thaliana EXTs but lacked the predicted signal peptide. All EXTs identified were subjected to domain analysis using the Pfam database 30.0

(http://pfam.xfam.org/) (Finn et al. 2016).

5.2.3 Phylogenetic and cluster analysis of EXTs

Protein sequences were aligned using the Geneious program

(http://www.geneious.com/) to obtain conserved domains. Aligned sequences of classical

EXTs, short EXTs, LRXs, PERKs, and FHXs were entered into Mega 7 as input files for phylogenetic analysis using the maximum likelihood and the maximum parsimony methods (Kumar et al. 2016). For classical EXTs, the analysis involved 47 proteins. There was a total of 195 amino acid positions in the final dataset. The evolutionary history inferred by the maximum likelihood method was based on the JTT matrix-based model

(Jones et al. 1992). The evolutionary history inferred by the maximum parsimony method used the Tree-Bisection-Regrafting (TBR) algorithm with search level 1 in which the initial trees were obtained by the random addition of sequences (10 replicates) (Nei and Kumar

2000). The bootstrap consensus tree inferred from 1000 replicates was shown and branches corresponding to partitions reproduced in less than 50% bootstrap replicates were collapsed. These methods were also applied to the analysis of Short EXTs, LRXs, PERKs, 163 and FHXs. The analysis of short EXTs involved 37 protein sequences and 106 amino acid positions. The analysis of LRXs involved 27 protein sequences and 391 amino acid positions. The analysis of PERKs involved 40 protein sequences and 312 amino acid positions. The analysis of FHXs involved 17 protein sequences and 529 amino acid positions. The identification of potential orthologs, paralogs, and cluster analysis were conducted using the OrthoMCL 2.0 program (Li et al. 2003).

5.3 Results

5.3.1 EXTs identified in A. lyrata

The predicted proteome of A. lyrata (Alyrata_384_v2.1.protein.fa) contains 33,132 protein sequences. BIO OHIO 2.0 found 121 sequences with two or more SPPP motifs, out of which 61 proteins were determined to be EXTs. The 61 EXTs include: (1) 15 classical EXTs that had EXT motifs (i.e., S[P]3-5 repeats and in most cases YXY motifs) throughout the sequence; all but one classical EXT had predicted signal peptide sequences;

(2) 10 short EXTs (<200 AA) that had predicted signal peptide sequences and EXT motifs;

(3) 8 LRXs that had clusters of EXT motifs at the C-terminus, and significant similarities to A. thaliana LRXs; 7 of the 8 proteins had predicted signal peptide sequences; (4) 14

PERKs that had clusters of EXT motifs near the N terminus and significant similarities to

A. thaliana PERKs; (5) 5 FHXs that were homologous to A. thaliana FHXs; (6) 5 other types of chimeric EXTs, and (7) 4 hybrid HRGPs that had predicted signal peptide sequences, clustered EXT motifs and an AGP domain or PRP domain (Table 5.1 and Fig

S1). 164

Table 5.1 The list of EXTs identified in A. lyrata. Protein ID Name Class SP3/SP4/SP5/ AA Pfam SP GPI Top BLAST Hits Against A. thaliana Proteome YXY Repeats AL5G18190.t1 AlEXT1 Classical SP4 1/22/1/11 351 PF04554.12 Yes No AtEXT9 AtEXT18 AtEXT10 AtEXT2 AtEXT15 AL5G18170.t1 AlEXT2 Classical SP4 2/13/2/7 290 PF04554.12 Yes No AtEXT9 AtEXT18 AtEXT2 AtEXT10 AtEXT12 AL5G35670.t1 AlEXT3 Classical SP4 1/22/1/10 408 PF04554.12 Yes No AtEXT9 AtEXT2 AtEXT12 AtEXT15 AtEXT22 AL4G41820.t1 AlEXT4 Classical SP4 0/24/0/13 228 PF04554.12 Yes No AtEXT8 AtEXT3/5 AL9U10270.t1 AlEXT5 Classical 2/7/7/8 329 PF04554.12 Yes No AtEXT18 AtEXT15 AtEXT14 AtEXT22 AtEXT12 SP4/ AL6G42210.t1 AlEXT6 Classical SP3 32/2/0/50 427 Yes No AtEXT20 AtEXT22 AtEXT17 AtEXT21 AL6G42130.t1 AlEXT7 Classical SP3 12/1/0/27 244 Yes No AtEXT20 AtEXT22 AtEXT21 AL6G42140.t1 AlEXT8 Classical SP5 0/5/8/14 242 Yes No AtEXT22 AtEXT17 AtEXT3/5

AL1G23290.t1 AlEXT9 Classical SP4 3/12/3/8 294 No No AtEXT22 AtEXT21 AL1G37470.t1 AlEXT10 Classical SP4 1/12/4/9 277 PF04554.12 Yes No AtEXT9 AtEXT18 AtEXT15 AtEXT2 AtEXT12 AL2G37060.t1 AlEXT11 Classical SP4 18/21/0/3 361 PF04554.12 Yes No AtEXT3/5 AtEXT1/4 AtEXT1/4 AL8G22510.t1 AlEXT12 Classical SP4 2/13/0/12 282 PF04554.12 Yes No AtEXT18 AtEXT15 AtEXT12 AtEXT7 AtEXT14 AL8G22520.t1 AlEXT13 Classical SP4 1/12/0/8 234 PF04554.12 Yes No AtEXT15 AtEXT18 AtEXT7 AtEXT11 AtEXT12

AL6G31330.t1 AlEXT14 Classical SP5 0/0/11/0 215 Yes No AtEXT39 AtEXT40 AtEXT21 AL6G16720.t1 AlEXT15 Classical SP4 1/10/0/7 257 PF04554.12 Yes No AtEXT14 AtEXT18 AtEXT13 AtEXT15 AtEXT12 AL5G35660.t1 AlEXT16s Short 1/5/0/2 125 PF04554.12 Yes No AtEXT10 AtEXT2 AtEXT9 AtEXT18 AtEXT12

AL6G31320.t1 AlEXT17s Short 0/0/3/1 94 Yes No AtEXT39 AtEXT19 AtEXT40 AtEXT35 AtAGP45P

AL6G38070.t1 AlEXT18s Short 3/2/1/0 147 Yes No AtEXT40 AtEXT35 AtEXT39 AtPRP1 AtPRP2

AL6G38060.t1 AlEXT19s Short 0/1/1/0 102 Yes No AtEXT40 AtEXT35 AtEXT39

AL1G62930.t1 AlEXT20s Short 0/1/1/0 170 Yes No AtEXT32

AL1G36610.t1 AlEXT21s Short 0/2/0/0 145 Yes Yes AtEXT31 AtEXT33 AtEXT30

AL1G12080.t1 AlEXT22s Short 0/2/2/0 155 Yes Yes AtEXT34

AL7G39030.t1 AlEXT23s Short 0/2/0/3 167 Yes Yes AtEXT37 AtEXT41 AtPERK5

AL3G17870.t1 AlEXT24s Short 0/1/1/2 149 Yes Yes AtEXT34 AtEXT41 165

Table 5.1: continued AL3G52210.t1 AlEXT25s Short 0/6/0/5 143 PF04554.12 Yes No AtEXT7 AtEXT13 AtEXT14 AtEXT15 AtEXT11 AL9U10340.t1 AlLRX1 LRX 2/3/10/1 624 PF08263.11 Yes No AtLRX3 AtLRX4 AtLRX5 AtLRX6 AtLRX2 AL6G37370.t1 AlLRX2 LRX 0/2/1/1 435 PF13855.5, Yes No AtLRX7 AtLRX4 AtLRX2 AtLRX5 AtLRX3 PF00560.32 AL7G35590.t1 AlLRX3 LRX 4/3/7/2 780 PF13855.5, Yes No AtLRX5 AtLRX4 AtLRX2 AtLRX1 AtLRX6 PF08263.11 AL7G17420.t1 AlLRX4 LRX 1/13/6/2 751 PF00560.32, Yes No AtPEX4 AtPEX3 AtPEX1 AtPEX2 AtLRX4 PF08263.11 AL3G32400.t1 AlLRX5 LRX 1/19/5/0 958 Yes No AtPEX2 AtPEX3 AtPEX4 AtLRX4 AtLRX3

AL3G47670.t1 AlLRX6 LRX 4/17/8/2 752 Yes No AtPEX3 AtPEX4 AtPEX1 AtPEX2 AtLRX5 AL3G37120.t1 AlLRX7 LRX 0/2/1/2 460 PF13855.5, No No AtLRX6 AtLRX4 AtLRX3 AtLRX5 AtLRX2 PF08263.11 AL2G13430.t1 AlLRX8 LRX 1/8/6/3 748 PF13855.5, Yes No AtLRX2 AtLRX1 AtLRX4 AtLRX3 AtLRX5 PF08263.11 AL1G56200.t1 AlPERK1 PERK 1/3/0/0 541 PF13966.5 No No AtPERK7 AtPERK6 AtPERK5 AtPERK1 AL1G56140.t1 AlPERK2 PERK 1/4/0/0 700 PF07714.16, No No AtPERK7 AtPERK6 AtPERK5 AtPERK15 PF00069.24 AtPERK3 AL1G39600.t1 AlPERK3 PERK 2/5/2/1 768 PF07714.16, No No AtPERK10 AtPERK12 AtPERK11 AtPERK13 PF00069.24 AtPERK1 AL1G21470.t1 AlPERK4 PERK 2/0/0/0 719 PF07714.16, No No AtPERK11 AtPERK13 AtPERK10 AtPERK1 PF00069.24 AtPERK15 AL1G37300.t1 AlPERK5 PERK 1/3/0/1 708 PF07714.16, No No AtPERK12 AtPERK11 AtPERK8 AtPERK10 PF00069.24 AtPERK5 AL7G52030.t1 AlPERK6 PERK 5/1/3/9 673 PF07714.16, No No AtPERK8 AtPERK10 AtPERK13 AtPERK12 PF00069.24 AtPERK1 AL7G16790.t1 AlPERK7 PERK 3/0/0/0 669 PF07714.16, No No AtPERK5 AtPERK7 AtPERK4 AtPERK15 PF00069.24 AtPERK13 AL7G19036.t1 AlPERK8 PERK 3/2/0/0 721 PF07714.16, No No AtPERK14 AtPERK3 AtPERK15 AtPERK5 PF00069.24 AtPERK4 AL3G40080.t1 AlPERK9 PERK 2/1/1/1 455 PF07714.16, No No AtPERK3 AtPERK1 AtPERK15 AtPERK5 PF00069.24 AtPERK4 166

Table 5.1: continued AL3G32190.t1 AlPERK10 PERK 1/2/2/0 654 PF07714.16, No No AtPERK6 AtPERK7 AtPERK5 AtPERK4 AtPERK1 PF00069.24 AL3G51760.t1 AlPERK11 PERK 1/0/1/1 640 PF07714.16, No No AtPERK4 AtPERK5 AtPERK1 AtPERK7 AtPERK6 PF00069.24 AL3G40090.t1 AlPERK12 PERK 2/0/1/0 650 PF07714.16, No No AtPERK1 AtPERK5 AtPERK3 AtPERK15 PF00069.24 AtPERK7 AL2G29830.t1 AlPERK13 PERK 3/1/2/1 686 PF07714.16, No No AtPERK13 AtPERK12 AtPERK11 AtPERK8 PF00069.24 AtPERK10 AL2G27610.t1 AlPERK14 PERK 5/6/3/1 705 PF07714.16, No No AtPERK12 AtPERK13 AtPERK1 AtPERK15 PF00069.24 AtPERK4 AL3G41670.t1 AlFHX1 FHX 2/1/0/0 1054 PF02181.22 Yes No AtFH1 AtFH5 AtFH5 AtFH8 AtFH20 AL8G30100.t1 AlFHX2 FHX 1/0/1/0 902 PF02181.22 Yes No AtFH5 AtFH5 AtFH1 AtFH8 AtFH20 AL8G34260.t1 AlFHX3 FHX 1/1/1/0 1190 PF02181.22, No No AtFH13 AtFH20 AtFH16 AtFH16 AtFH1 PF10409.8 AL6G18120.t1 AlFHX4 FHX 0/0/2/0 622 PF02181.22 No No AtFH16 AtFH16 AtFH20 AtFH13 AtFH1 AL6G18110.t1 AlFHX5 FHX 1/0/9/0 1593 PF02181.22, No No AtFH20 AtFH16 AtFH16 AtFH13 AtFH1 PF10409.8 AL2G13030.t1 AlEXT26i Chimeric 2/2/0/1 280 PF04043.14 Yes No AtHAE1 AL1G39490.t1 AlEXT27i Chimeric 0/3/18/24 531 PF04554.12 No No AtEXT22 AtEXT3/5

AL6G22780.t1 AlEXT28i Chimeric 5/0/1/1 183 Yes No AtEXT38 AL1G50850.t1 AlEXT29i Chimeric 3/0/0/0 249 PF14368.5, Yes No AtAGP29I PF00234.21 AL3G32840.t1 AlEXT30i Chimeric 0/3/0/0 510 PF06830.10 Yes No AtEXT51 AtPEX4

AL1G52230.t1 AlEXT31h Hybrid HRGP 7/5/1/0 364 Yes No AtHAE2 AtPRP8

AL5G30730.t1 AlEXT32h Hybrid HRGP 1/2/0/0 269 Yes No AtHAE2 AtPRP8 AtPEX4

AL1G27980.t1 AlEXT33h Hybrid HRGP 6/8/2/0 734 Yes No AtPRP5 AtPRP1

AL1G27990.t1 AlEXT34h Hybrid HRGP 2/4/1/1 272 Yes No None

167

5.3.2 EXTs identified in A. halleri

The translated A. halleri data file contains 34,553 proteins, of which 140 proteins were identified as candidate EXTs with at least two SPPP repeats. From these candidates,

65 were determined to be EXTs, including (1) 13 classical EXTs that had characteristic

EXT motifs; 10 of these 13 proteins had predicted signal peptide sequences; (2) 14 short

EXTs that had predicted signal peptide sequences and EXT motifs; (3) 8 LRXs with signal peptide sequences, clusters of C-terminal EXT motifs, and significant similarities to A. thaliana LRXs; (4) 14 PERKs that had clusters of N-terminal EXT motifs and significant similarities to A. thaliana PERKs; (5) 6 FHXs that were homologous to A. thaliana FHXs;

(6) 6 chimeric EXTs that had clustered EXT motifs and a non-HRGP domain; and (7) 4 hybrid HRGPs that had a predicted signal peptide sequence, clustered EXT motifs and an

AGP domain or PRP domain (Table 5.2 and Fig S2). 168

Table 5.2 The list of EXTs identified in A. halleri. Protein Name Class SP3/SP4/SP5/ AA Pfam SP GPI Top BLAST Hits Against A. thaliana Proteome ID YXY Repeats g08113.1 AhEXT1 Classical SP5 5/0/23/32 343 PF04554.12 Yes No AtEXT22 AtEXT3/5 AtEXT21 g21362.1 AhEXT2 Classical SP4/SP5 1/3/3/3 190 PF04554.12 Yes No AtEXT18 AtEXT15 AtEXT14 AtEXT7 AtEXT13 g26842.1 AhEXT3 Classical SP4 10/14/0/2 218 PF04554.12 No No AtEXT3/5 g19996.1 AhEXT4 Classical SP4 1/12/2/11 242 PF04554.12 No No AtEXT18 AtEXT15 AtEXT12 AtEXT9 AtEXT14 g13437.1 AhEXT5 Classical SP4 9/17/1/11 390 PF04554.12 Yes No AtEXT3/5 AtEXT1/4 AtEXT1/4 AtEXT18 AtEXT22 g24818.1 AhEXT6 Classical SP4 2/16/0/7 320 PF04554.12 Yes No AtEXT15 AtEXT14 AtEXT12 AtEXT13 AtEXT7 g13632.1 AhEXT7 Classical SP4 2/25/4/16 441 PF04554.12 Yes No AtEXT18 AtEXT15 AtEXT10 AtEXT14 AtEXT12

g20328.1 AhEXT8 Classical SP3 26/0/0/34 371 Yes No AtEXT20 AtEXT22 AtEXT17 AtEXT21 g08114.1 AhEXT9 Classical SP4 1/17/1/18 201 PF04554.12 No No AtEXT22 AtEXT3/5 AtEXT21

g01247.1 AhEXT10 Classical SP4 10/11/0/1 236 Yes No AtEXT3/5 AtEXT1/4 AtEXT1/4 AtPRP1 AtEXT18 g25079.1 AhEXT11 Classical SP4 7/31/0/22 656 PF04554.12 Yes No AtEXT18 AtEXT15 AtEXT12 AtEXT13 AtEXT11 g26951.1 AhEXT12 Classical SP4 0/14/0/11 273 PF04554.12 Yes No AtEXT15 AtEXT14 AtEXT12 AtEXT13 AtEXT7 g17979.1 AhEXT13 Classical SP4 0/58/1/39 1123 PF04554.12 Yes No AtEXT8 AtEXT3/5 AtEXT22 AtEXT18 AtEXT9

g11173.1 AhEXT14s Short 1/0/1/1 134 Yes No AtEXT35 AtEXT40 AtEXT39 AtEXT19 AtPRP2

g15596.1 AhEXT15s Short 0/6/0/4 137 Yes No AtEXT7 AtEXT13 AtEXT12 AtEXT11 AtEXT14 g11478.1 AhEXT16s Short 1/6/0/3 125 PF04554.12 Yes No AtEXT9 AtEXT2 AtEXT18 AtEXT10 AtEXT12 g20317.1 AhEXT17s Short 3/2/2/7 136 PF04554.12 Yes No AtEXT22 AtEXT17 AtEXT20 AtEXT21 AtEXT3/5

g12151.1 AhEXT18s Short 0/0/3/1 95 Yes No AtEXT39 AtEXT19 AtEXT40 AtEXT35 AtAGP45P

g06485.1 AhEXT19s Short 1/0/1/3 143 Yes No AtEXT34 AtEXT41

g20314.1 AhEXT20s Short 4/0/0/10 118 Yes No AtEXT20 AtEXT22 AtEXT17 AtEXT21 AtAGP45P

g13700.1 AhEXT21s Short 0/2/0/0 145 Yes No AtEXT31 AtEXT33 AtEXT30

g28923.1 AhEXT22s Short 1/3/2/2 140 Yes No AtEXT9 AtEXT2 AtEXT10 AtEXT18 AtEXT15

g20319.1 AhEXT23s Short 8/2/0/12 143 Yes No AtEXT20 AtEXT22 AtEXT17 AtEXT21 AtAGP45P 169

Table 5.2: continued g25081.1 AhEXT24s Short 1/9/0/5 195 PF04554.12 Yes No AtEXT15 AtEXT18 AtEXT7 AtEXT14 AtEXT12

g24874.1 AhEXT25s Short 0/1/1/0 169 Yes No AtEXT32

g28924.1 AhEXT26s Short 1/6/2/2 165 Yes No AtEXT9 AtEXT10 AtEXT2 AtEXT18 AtEXT12

g31500.1 AhEXT27s Short 0/2/0/0 104 Yes No AtEXT31 AtEXT33 g21653.1 AhLRX1 LRX 0/7/1/0 1027 PF14226.5, Yes No AtPEX1 AtPEX2 AtPEX3 AtPEX4 AtLRX4 PF03171.19 g00545.1 AhLRX2 LRX 3/12/7/2 790 PF08263.11 Yes No AtPEX4 AtPEX3 AtPEX2 AtLRX4 AtLRX5

g29325.1 AhLRX3 LRX 3/15/4/2 726 Yes No AtPEX4 AtLRX3 AtLRX4 AtLRX1 AtLRX2 g14084.2 AhLRX4 LRX 0/2/1/1 435 PF13855.5, Yes No AtLRX7 AtLRX4 AtLRX3 AtLRX2 AtLRX1 PF00560.32 g15232.1 AhLRX5 LRX 3/11/3/0 848 Yes No AtPEX2 AtPEX1 AtPEX4 AtPEX3 AtLRX4 g04875.1 AhLRX6 LRX 2/18/5/9 737 PF08263.11 Yes No AtLRX1 AtLRX2 AtLRX4 AtLRX3 AtLRX5 g15350.1 AhLRX7 LRX 3/3/9/3 811 PF13855.5, Yes No AtLRX5 AtLRX4 AtLRX3 AtLRX2 AtLRX1 PF08263.11 g19987.1 AhLRX8 LRX 4/10/17/3 756 PF08263.11 Yes No AtLRX3 AtLRX5 AtLRX4 AtLRX2 AtLRX6 g20067.1 AhPERK1 PERK 1/0/1/0 653 PF07714.16, No No AtPERK1 AtPERK4 AtPERK3 AtPERK5 PF00069.24 AtPERK15 g20066.1 AhPERK2 PERK 1/1/1/1 394 PF07714.16, No No AtPERK3 AtPERK1 AtPERK15 AtPERK4 PF00069.24 AtPERK5 g00485.2 AhPERK3 PERK 3/0/0/0 537 PF07714.16, No No AtPERK5 AtPERK7 AtPERK13 AtPERK12 PF00069.24 AtPERK3 g21635.1 AhPERK4 PERK 0/0/2/0 511 PF07714.16, No No AtPERK6 AtPERK7 AtPERK5 AtPERK4 AtPERK1 PF00069.24 g13643.1 AhPERK5 PERK 1/3/0/1 711 PF07714.16, No No AtPERK12 AtPERK13 AtPERK11 AtPERK10 PF00069.24 AtPERK1 g15194.1 AhPERK6 PERK 2/3/1/0 653 PF07714.16, Yes No AtPERK7 AtPERK6 AtPERK5 AtPERK4 AtPERK1 PF00069.24 g08124.1 AhPERK7 PERK 5/3/0/1 957 PF07714.16, No No AtPERK10 AtPERK8 AtPERK12 AtPERK1 PF00069.24 AtPERK15

170

Table 5.2: continued g14720.1 AhPERK8 PERK 3/5/2/1 647 PF07714.16, No No AtPERK10 AtPERK8 AtPERK13 AtPERK12 PF00069.24 AtPERK11 g00693.1 AhPERK9 PERK 3/4/1/0 720 PF07714.16, No No AtPERK14 AtPERK1 AtPERK5 AtPERK15 PF00069.24 AtPERK3 g20038.1 AhPERK10 PERK 1/5/0/0 1126 PF16113.4, No No AtPERK1 AtPERK3 AtPERK4 AtPERK15 PF07714.16 AtPERK5 g25049.1 AhPERK11 PERK 5/1/3/3 673 PF07714.16, No No AtPERK8 AtPERK10 AtPERK13 AtPERK12 PF00069.24 AtPERK1 g05041.1 AhPERK12 PERK 2/0/0/0 689 PF07714.16, No No AtPERK11 AtPERK12 AtPERK13 AtPERK8 PF00069.24 AtPERK10 g03784.1 AhPERK13 PERK 4/1/2/1 689 PF07714.16, No No AtPERK13 AtPERK12 AtPERK8 AtPERK1 PF00069.24 AtPERK10 g00485.1 AhPERK14 PERK 3/0/0/0 669 PF07714.16, No No AtPERK5 AtPERK7 AtPERK15 AtPERK13 PF00069.24 AtPERK12 g13582.1 AhFHX1 FHX 0/0/2/0 762 PF02181.22 Yes No AtFH8 AtFH5 AtFH5 AtFH1 AtFH20 g23866.1 AhFHX2 FHX 1/1/4/0 1567 PF02181.22, No No AtFH20 AtFH16 AtFH16 AtFH13 AtFH1 PF10409.8 g23384.1 AhFHX3 FHX 2/1/0/0 1051 PF02181.22 Yes No AtFH1 AtFH5 AtFH5 AtFH8 AtFH13 g31345.1 AhFHX4 FHX 1/1/0/0 1264 PF02181.22, No No AtFH13 AtFH20 AtFH16 AtFH16 AtFH1 PF10409.8 g16913.1 AhFHX5 FHX 1/1/0/0 1309 PF02181.22, No No AtFH13 AtFH20 AtFH16 AtFH16 AtFH1 PF10409.8 g09373.1 AhFHX6 FHX 1/0/1/0 902 PF02181.22 Yes No AtFH5 AtFH5 AtFH1 AtFH8 AtFH20

g17832.1 AhEXT28i Chimeric 1/10/7/2 477 No No AtLRX2 AtLRX1 AtLRX4 AtLRX7 AtPEX2 g22518.1 AhEXT29i Chimeric 0/5/1/0 584 PF06830.10 Yes No AtEXT51 g32264.1 AhEXT30i Chimeric 4/0/0/0 239 PF14368.5, Yes No AtAGP29I PF00234.21 g17071.1 AhEXT31i Chimeric 6/5/1/0 364 Yes No AtHAE2 AtPRP8 g11479.1 AhEXT32i Chimeric 1/14/1/7 676 PF01965.23, Yes No AtEXT2 AtEXT18 AtEXT12 AtEXT15 AtEXT22 PF04554.12

171

Table 5.2: continued g09998.1 AhEXT33i Chimeric 2/4/1/1 266 Yes No None

g14984.1 AhEXT34h Hybrid HRGP 3/3/0/0 302 Yes No AtHAE2 AtPRP8

g14717.1 AhEXT35h Hybrid HRGP 6/0/0/0 249 Yes No AtAGP19K

g14199.1 AhEXT36h Hybrid HRGP 4/1/0/0 191 Yes No AtAGP9C AtAGP9C AtPRP1

g09999.1 AhEXT37h Hybrid HRGP 6/9/2/0 744 Yes No AtPRP5 AtPRP1

172

5.3.3 Update on the EXTs in Arabidopsis thaliana

A detailed identification and classification of EXTs in A. thaliana was initially reported by Showalter et al. in 2010, in which an older version of the A. thaliana protein data file containing 28,954 predicted proteins (ATH1.pep) was analyzed (Table 5.3). The study revealed 63 EXTs including 20 classical EXTs, 12 short EXTs, 11 LRXs, 13 PERKs,

3 chimeric EXTs, and 4 hybrid HRGPs. Subsequent analysis also revealed 6 FHXs

(Showalter et al. 2016). In total, 69 EXTs were reported in A. thaliana (Table 5.4). Given that a more recent A. thaliana proteome data file is available, it was interesting to determine whether any differences in the number of the various EXT subclasses existed. Using the most recent proteome data file, we identified 69 EXTs; however, the number of EXTs in several of the subclasses varied slightly (Table 5.5 and Fig S3). One classical EXT

At5g49080 was present in the old data file but absent in the recent data file; two

(At1g52290, At4g32710) PERKs were also absent while one new PERK (AT1G68690) was identified. An additional short EXT (AT1G12665) and one more chimeric EXT

(AT1G44191) were also identified in the recent data file.

173

Table 5.3 Comparison of EXTs in A. thaliana, A. lyrata, and A. halleri. Species A. thaliana A. thaliana A. lyrata A. halleri File Name ATH1.pep Athaliana_167_TAIR10.protein.fa Alyrata_384_v2.1.protein.fa transcripts.fa

Date of June 10, 2004 Jan. 12, 2014 May 12, 2016 Sept. 22, 2016

Release

No. of 28,954 35,386 33,132 34,553

Proteins 125 140 121 140 Candidate EXTs (> 2 SPPP)

Table 5.4 Summary of EXTs identified in A. thaliana, A. lyrata, and A. halleri. Species A. thaliana A. thaliana A. lyrata A. halleri (original) (updated) Classical EXTs 20 19 15 13

Short EXTs 12 13 10 14 LRXs 11 11 8 8 PERKs 13 12 14 14 FHXs 6 6 5 6 Hybrid HRGPs 4 4 4 4 Chimeric EXTs 3 4 5 6 Total EXTs 69 69 61 65

174

Table 5.5 The list of EXTs identified in A. thaliana. Protein ID Name Class SP3/SP4/SP5/XY AA Pfam SP GPI Top BLAST Hits Against A. thaliana Proteome X Repeats AT2G24980 AtEXT7 Classical SP4 3/37/0/21 559 PF04554.11 Yes No AtEXT7 AtEXT15 AtEXT13 AtEXT18 AtEXT14 AT2G43150 AtEXT8 Classical SP4 0/22/0/12 212 PF04554.11 Yes No AtEXT8 AtEXT3/5 AtEXT18 AT4G08410 AtEXT12 Classical SP4 2/41/0/26 707 PF04554.11 Yes No AtEXT12 AtEXT11 AtEXT14 AtEXT15 AtEXT7 AT4G08400 AtEXT11 Classical SP4 2/21/0/18 513 PF04554.11 Yes No AtEXT11 AtEXT12 AtEXT14 AtEXT15 AtEXT13

AT4G08380 AtEXT17 Classical SP3 34/2/0/49 437 Yes No AtEXT17 AtEXT20 AtEXT22 AtEXT21 AT4G13390 AtEXT18 Classical SP4 0/14/8/13 429 PF04554.11 Yes No AtEXT18 AtEXT14 AtEXT15 AtEXT9 AtEXT13 AT4G08370 AtEXT22 Classical SP5 3/1/13/18 350 PF04554.11 Yes No AtEXT22 AtEXT17 AtEXT3/5 AtEXT21 AtAGP45P AT1G21310 AtEXT3/5 Classical SP4 13/27/1/16 431 PF04554.11 Yes No AtEXT3/5 AtEXT1/4 AtEXT1/4 AT1G23720 AtEXT6 Classical SP4 2/61/3/34 895 PF04554.11 No No AtEXT6 AtEXT18 AtEXT9 AtEXT15 AtEXT12 AT1G26240 AtEXT20 Classical SP5 2/1/40/44 478 PF04554.11 Yes No AtEXT20 AtEXT22 AtEXT3/5 AtEXT17

AT1G76930 AtEXT1/4 Classical SP4 8/9/0/1 293 Yes No AtEXT1/4 AtEXT3/5 AtPRP1 AtEXT22 AtEXT18 AT1G26250 AtEXT21 Classical SP5 7/0/28/40 443 PF04554.11 Yes No AtEXT21 AtEXT22 AtEXT3/5 AT3G28550 AtEXT9 Classical SP4 3/70/0/35 1018 PF04554.11 Yes No AtEXT9 AtEXT15 AtEXT12 AT3G54580 AtEXT10 Classical SP4 2/68/0/33 951 PF04554.11 Yes No AtEXT10 AtEXT9 AtEXT2 AtEXT15 AtEXT18 AT3G54590 AtEXT2 Classical SP4 2/47/0/22 699 PF04554.11 Yes No AtEXT2 AtEXT10 AtEXT15 AtEXT18 AtEXT12 AT5G06640 AtEXT14 Classical SP4 2/42/0/25 689 PF04554.11 Yes No AtEXT14 AtEXT12 AtEXT18 AtEXT15 AtEXT11 AT5G06630 AtEXT13 Classical SP4 1/29/0/17 440 PF04554.11 Yes No AtEXT13 AtEXT7 AtEXT14 AtEXT15 AtEXT11

AT5G19810 AtEXT19 Classical SP5 0/4/13/1 249 Yes No AtEXT19 AtEXT39 AtEXT40 AT5G35190 AtEXT15 Classical SP4 2/12/2/8 328 PF04554.11 Yes No AtEXT15 AtEXT14 AtEXT12 AtEXT7 AtEXT9

AT4G16140 AtEXT37s Short 0/1/1/4 164 Yes Yes AtEXT37 AtEXT41 AtPERK6

AT1G02405 AtEXT30s Short 0/3/0/0 134 Yes No AtEXT30 AtEXT33 AtEXT31 AtPERK6

AT1G12665 AtEXT42s Short 2/0/0/0 89 Yes No None

AT1G70990 AtEXT33s Short 0/2/0/1 176 Yes No AtEXT33 AtEXT31 AtEXT30

AT1G23040 AtEXT31s Short 0/2/0/0 144 Yes Yes AtEXT31 AtEXT33 AtEXT30 175

Table 5.5: continued AT1G54215 AtEXT32s Short 0/1/1/0 169 Yes No AtEXT32

AT3G49270 AtEXT36s Short 0/2/0/0 148 Yes No AtEXT36 AtEXT36

AT3G06750 AtEXT34s Short 0/1/1/1 147 Yes Yes AtEXT34 AtEXT41 AtEXT37 AtFH20

AT3G20850 AtEXT35s Short 1/0/1/2 134 Yes No AtEXT35 AtEXT40 AtEXT19 AtEXT39

AT5G11990 AtEXT38s Short 4/0/1/1 181 Yes Yes AtEXT38

AT5G26080 AtEXT40s Short 2/1/3/0 141 Yes No AtEXT40 AtEXT35 AtEXT39 AtPRP1 AtPRP2

AT5G49280 AtEXT41s Short 0/2/0/2 162 Yes Yes AtEXT41 AtEXT34 AtPERK3

AT5G19800 AtEXT39s Short 0/0/3/1 96 Yes No AtEXT39 AtEXT19 AtEXT40 AT4G18670 AtLRX5 LRX 3/2/6/3 857 PF08263.10, PF13855.4 Yes No AtLRX5 AtLRX4 AtLRX3 AtLRX2 AtLRX6 AT4G13340 AtLRX3 LRX 4/13/15/3 760 PF08263.10 Yes No AtLRX3 AtLRX4 AtLRX5 AtLRX7 AtPEX1 AT1G62440 AtLRX2 LRX 4/12/6/3 826 PF08263.10, PF13855.4 No No AtLRX2 AtLRX1 AtLRX4 AtLRX3 AtLRX5 AT1G12040 AtLRX1 LRX 1/17/7/9 744 PF08263.10 Yes No AtLRX1 AtLRX2 AtLRX4 AtLRX3 AtLRX5 AT3G22800 AtLRX6 LRX 1/0/2/6 470 PF08263.10, PF13855.4 Yes No AtLRX6 AtLRX4 AtLRX3 AtLRX5 AtLRX2 AT3G24480 AtLRX4 LRX 2/1/3/1 494 PF08263.10, PF13855.4 Yes No AtLRX4 AtLRX5 AtLRX3 AtLRX2 AtLRX6

AT1G49490 AtPEX2 LRX 1/13/1/0 847 Yes No AtPEX2 AtPEX1 AtPEX4 AtPEX3 AtLRX4

AT2G15880 AtPEX3 LRX 2/16/9/1 727 No No AtPEX3 AtPEX4 AtPEX1 AtLRX3 AtLRX5

AT3G19020 AtPEX1 LRX 1/19/5/0 956 Yes No AtPEX1 AtPEX2 AtPEX3 AtLRX4 AtLRX3

AT4G33970 AtPEX4 LRX 4/10/4/1 699 Yes No AtPEX4 AtPEX3 AtPEX2 AtLRX4 AtLRX5 AT5G25550 AtLRX7 LRX 1/0/1/1 433 PF13855.4, PF00560.31 Yes No AtLRX7 AtLRX4 AtLRX3 AtLRX5 AtLRX2 AT5G38560 AtPERK8 PERK 5/2/2/3 681 PF00069.23 No No AtPERK8 AtPERK10 AtPERK12 AtPERK13 AtPERK11 AT3G18810 AtPERK6 PERK 1/1/2/0 700 PF07714.15 No No AtPERK6 AtPERK7 AtPERK5 AtPERK4 AtPERK1 AT3G24550 AtPERK1 PERK 3/0/0/0 652 PF07714.15 No No AtPERK1 AtPERK4 AtPERK3 AtPERK5 AtPERK15 AT3G24540 AtPERK3 PERK 0/1/1/0 509 PF07714.15 No No AtPERK3 AtPERK1 AtPERK15 AtPERK4 AtPERK5 AT1G23540 AtPERK1 PERK 1/2/0/1 720 PF07714.15 No No AtPERK12 AtPERK13 AtPERK8 AtPERK10 AtPERK1 2 176

Table 5.5: continued AT1G49270 AtPERK7 PERK 1/4/1/0 699 No No AtPERK7 AtPERK6 AtPERK5 AtPERK4 AtPERK15 AT1G26150 AtPERK10 PERK 4/2/1/1 762 PF07714.15 No No AtPERK10 AtPERK1 AtPERK15 AtPERK4 AtPERK3 AT1G68690 AtPERK16 PERK 8/4/2/1 708 PF00069.23 No No AtPERK10 AtPERK13 AtPERK1 AtPERK15 AtPERK5 AT1G10620 AtPERK11 PERK 2/0/0/0 718 PF07714.15 No No AtPERK11 AtPERK12 AtPERK13 AtPERK10 AtPERK1 AT1G70460 AtPERK13 PERK 3/2/2/1 710 PF07714.15 No No AtPERK13 AtPERK8 AtPERK1 AtPERK15 AtPERK3 AT4G34440 AtPERK5 PERK 2/0/0/0 670 PF07714.15 No No AtPERK5 AtPERK6 AtPERK7 AtPERK4 AtPERK1 AT2G18470 AtPERK4 PERK 1/0/1/1 633 PF07714.15 No No AtPERK4 AtPERK1 AtPERK5 AtPERK7 AtPERK6 AT3G25500 AtFH1 FHX 2/1/0/0 1051 PF02181.21 Yes No AtFH1 AtFH5 AtFH5 AtFH8 AtFH20 AT5G54650 AtFH5 FHX 1/0/1/0 900 PF02181.21 Yes No AtFH5 AtFH5 AtFH1 AtFH8 AtFH20 AT1G70140 AtFH8 FHX 0/1/1/0 760 PF02181.21 Yes No AtFH8 AtFH1 AtFH5 AtFH5 AtFH20 AT5G58160 AtFH13 FHX 1/1/1/0 1324 PF02181.21, No No AtFH13 AtFH20 AtFH16 AtFH16 AtFH1 PF10409.7 AT5G07770 AtFH16 FHX 1/1/1/0 722 PF02181.21 No No AtFH20 AtFH16 AtFH16 AtFH13 AtFH1 AT5G07740 AtFH20 FHX 0/1/8/0 1649 PF02181.21, No No AtHAE2 AtPRP8 PF10409.7 AT1G44191 AtEXT53i Chimeric 5/7/0/0 359 Yes No AtEXT51 AT3G19430 AtEXT51i Chimeric 0/7/0/0 559 PF06830.9 No No AtEXT52 AtPAG3 AtPAG14 AtPAG4 AtPAG7 AT3G53330 AtEXT52i Chimeric 0/3/0/1 310 PF02298.15 Yes No AtEXT50 AT3G11030 AtEXT50i Chimeric 0/5/0/0 451 PF14416.4, No No AtHAE2 AtPRP8 PF13839.4 AT3G50580 AtHAE2 Hybrid HRGP 1/2/1/0 250 Yes No AtHAE1 AtPAG10 AT1G62760 AtHAE1 Hybrid HRGP 2/0/2/0 312 PF04043.13 Yes No AtHAE3

AT4G11430 AtHAE3 Hybrid HRGP 2/0/2/0 219 No No AtHAE4 AtPRP15 AtFH20 AtPRP1 AT4G22470 AtHAE4 Hybrid HRGP 2/1/0/0 375 PF14547.4 Yes No AtAtHAE4 AtFH20

177

5.3.4 Phylogenetic analysis of EXTs A in A. thaliana, A. lyrata, and A. halleri

Elucidating the phylogenetic relationships of EXTs in A. thaliana and its close relatives can be helpful in understanding the evolution of this protein family. EXTs were divided into a number of subfamilies, each with unique sequence features and associated functions. Phylogenetic analysis using both maximum likelihood and maximum parsimony methods was conducted on each of the subfamilies, including classical EXTs, short EXTs,

LRXs, PERKs, and FHXs.

The conserved regions of 47 classical EXT protein sequences (19 from A. thaliana,

15 from A. lyrata, and 13 from A. halleri) were used to construct a phylogenetic tree by the maximum likelihood method based on the JTT matrix-based model (Fig 5.2 and Fig S4).

A number of classical EXT pairs with high bootstrap (>98) values were resolved:

AtEXT22/AlEXT8, AtEXT19/AlEXT14, AhEXT5/AhEXT10, and AtEXT11/AtEXT12.

Only one major clade with a high bootstrap value was resolved, which included 22 sequences (8 AtEXTs, 7 AlEXTs, 7 AhEXTs). Phylogenetic analysis of classical EXTs was also conducted using the maximum parsimony method (Fig S5), which showed an almost identical tree topology as inferred by the maximum likelihood method.

178

Figure 5.2 Phylogenetic tree of classical EXTs using the maximum likelihood method.

179

The conserved regions of 37 short EXT protein sequences (13 from A. thaliana, 10 from A. lyrata, and 14 from A. halleri) were used to construct the phylogenetic tree by the maximum likelihood method based on the JTT matrix-based model (Fig 5.3 and Fig S6).

Several clusters of short EXTs with high bootstrap (>98) values were resolved:

AlEXT24s/AhEXT19s/AtEXT34s, AtEXT37s/AlEXT23s, AhEXT20s/AhEXT23s,

AtEXT31s/AlEXT21s/AhEXT21s/AhEXT27s, and AlEXT17s/AhEXT18s. Phylogenetic analysis of short EXTs was also conducted using the maximum parsimony method (Fig

S7), which showed an almost identical tree topology as inferred by the maximum likelihood method. 180

Figure 5.3 Phylogenetic tree of short EXTs using the maximum likelihood method.

181

The conserved regions of 27 LRXs, including four A. thaliana pollen-expressed

LRXs (PEXs), were used to construct the phylogenetic tree by the maximum likelihood method based on the JTT matrix-based model (Fig 5.4 and Fig S8). Seven clusters of LRXs with high bootstrap (>98) values were resolved in the tree: AtLRX1/AhLRX6,

AtLRX3/AlLRX1/AhLRX8, AtLRX5/AlLRX3/AhLRX7, AtLRX6/AlLRX7, and

AtPEX2/AhLRX5. The phylogenetic tree was divided into two distinct clades, and all the

A. thaliana PEXs reside in second clade. Interestingly, previous studies revealed phylogenetic analysis of LRXs across multiple plant species perfectly overlapped with the gene expression analysis (Ringli 2005). Thus, this clade was likely comprised of only PEXs in these three species. In addition to four known AtPEXs, three of the A. lyrata LRXs

(AlLRX4/5/6) and four of the A. halleri LRXs (AhLRX1/2/3/5) were likely PEXs.

Phylogenetic analysis of LRXs was also conducted using the maximum parsimony method

(Fig S9), which showed an almost identical tree topology as inferred by the maximum likelihood method. 182

Figure 5.4 Phylogenetic tree of LRXs using the maximum likelihood method.

The conserved regions of 40 PERKs (12 from A. thaliana, 14 from A. lyrata, and

14 from A. halleri) were used to construct the phylogenetic tree by the maximum likelihood method based on the JTT matrix-based model (Fig 5.5 and Fig S10). Several clusters of

PERKs with high bootstrap values (>98) were resolved in the tree:

AtPERK10/AlPERK3/AhPERK7, AtPERK16/AlPERK14/AhPERK8,

AlPERK8/AhPERK9, AtPERK8/AlPERK6/AhPERK11,

AtPERK11/AlPERK4/AhPERK12, AtPERK12/AlPERK5/AlPERK5,

AtPERK13/AlPERK13/AhPERK13, AtPERK3/AlPERK9/AhPERK2,

AtPERK7/AlPERK2/AhPERK6, and AtPERK6/AlPERK10/AhPERK4. Phylogenetic 183 analysis of PERKs was also conducted using the maximum parsimony method (Fig S11), which showed a similar tree topology especially at the tip of the branches as inferred by the maximum likelihood method.

Figure 5.5 Phylogenetic tree of PERKs using the maximum likelihood method. 184

For the phylogenetic analysis of FHXs, five A. lyrata FHXs and six A. halleri FHXs were aligned together with six A. thaliana FHXs. These AtFHXs were identified out of the

23 A. thaliana formin homologs (AtFH1-At3g2550, AtFH5-At5g54650, AtFH8-

At1g70140, AtFH13-At5g58160, AtFH16-At5g07770, and AtFH20-At5g07740).

Altogether, the conserved regions of the 17 sequences were used to construct the phylogenetic tree by the maximum likelihood method based on the JTT matrix-based model (Fig 5.6 and Fig S12). The phylogenetic tree split into two major clades, and displayed similar topology with a previous study (Deeks et al. 2002). One clade included

AtFH1/5/8, AlFHX1/2, and AhFHX1/3/6; the other clade included AtFH13/16/ 20,

AlFHX3/4/5, and AhFHX2/4/5. Phylogenetic analysis of FHXs was also conducted using the maximum parsimony method (Fig S13), which showed an almost identical tree topology as inferred by the maximum likelihood method.

Figure 5.6 Phylogenetic tree of FHXs using the maximum likelihood method. 185

5.3.5 Cluster analysis and identification of potential orthologs and paralogs

The identified EXTs sequences from all three species were entered into the

OrthoMCL 2.0 program to identify potential protein clusters and homologous pairs (Li et al. 2003). The OrthoMCL algorithm starts with all-against-all BLASTP analysis. Putative orthologs are identified between genomes by reciprocal best similarity pairs; probable paralogs are identified by reciprocal better similarity pairs within the same genome. Then, a similarity matrix is generated after normalization by species. Finally, the Markov Cluster algorithm is applied to the similarity matrix to generate ortholog groups and paralogs.

A total of 33 clusters were identified by the OrthoMCL program, which included

15 classical/short EXTs groups, 4 PERK exclusive groups, 5 LRX exclusive groups, 5 FHX exclusive groups, 2 chimeric EXTs exclusive groups, and 2 hybrid EXT groups (

Table 5.6). A number of potential orthologs and paralogs were identified (Table 5.7-5.8).

Specifically, 9 pairs of orthologs (AhEXT5/AtEXT3/5, AhEXT4/AlEXT18,

AlEXT11/AhEXT10, AhEXT4/AlEXT5, AhEXT11/AlEXT12, AtEXT8/AlEXT4,

AtEXT9/AlEXT2, AtEXT22/AlEXT8, and AtEXT18/AlEXT5) and 2 pairs of paralogs

(AtEXT11/AtEXT12, and AlEXT1/AlEXT2) were identified for classical EXTs. Twelve pairs of orthologs (AtEXT35s/AhEXT14s, AtEXT39s/AhEXT18s,

AtEXT31s/AhEXT21s, AtEXT32s/AhEXT25s, AlEXT16s/AhEXT16s,

AhEXT18s/AlEXT17s, AhEXT21s/AlEXT21s, AhEXT25s/AlEXT20s,

AtEXT31s/AlEXT21s, AtEXT32s/AlEXT20s, AtEXT39s/AlEXT17s, and

AtEXT40s/AlEXT18s) were identified for short EXTs. Interestingly, 6 pairs of orthologs were also identified between classical EXTs and short EXTs (Table 5.7). For LRXs, the 186 analysis revealed four groups of homologs. Group one included all reproductive LRXs:

AtPEX1/2/3/4, AlLRX4/5/6, Ah1/2/3/5; group two contained A. thaliana LRX3/4/5, A. lyrata LRX1/3, and A. halleri LRX7/8; group three consisted of AtLRX1/2, AlLRX6, and

AhLRX8; the last group had AhLRX4, AtLRX7, and AlLRX2. For PERKs, three major groups of orthologs and paralogs were identified. Group one contained

AtPERK1/3/4/5/6/7, AlPERK2/7/9/10/11/12, and AhPERK1/2/3/4/6/10/14; group two included AtPERK11/12/13, AlPERK4/5/13, and AhPERK5/12/13; group three included

AtPERK8/10/16, AlPERK3/6/14, and AhPERK7/8/11. For FHXs, 15 orthologous pairs were identified including 5 pairs between A. lyrata and A. thaliana, 3 pairs between A. lyrata and A. halleri, 7 pairs between A. thaliana and A. halleri.

187

Table 5.6 The list of identified clusters using the OrthoMCL program. Cluster Number EXTs G1 AhPERK14 AhPERK3 AhPERK1 AhPERK4 AhPERK6 AhPERK10 AtPERK7 AtPERK4 AtPERK6 AtPERK5 AtPERK1 AtPERK3 AlPERK2 AlPERK10 AlPERK7 AlPERK11 AlPERK12 AhPERK2 AlPERK9 G2 AhLRX2 AhLRX3 AhLRX5 AhLRX1 AtPEX2 AtPEX1 AtPEX4 AtPEX3 AlLRX5 AlLRX6 AlLRX4 G3 AhPERK13 AhPERK12 AhPERK5 AtPERK12 AtPERK13 AlPERK13 AtPERK11 AlPERK4 AlPERK5 G4 AhPERK8 AhPERK11 AtPERK10 AtPERK16 AhPERK7 AtPERK8 AlPERK3 AlPERK14 AlPERK6 G5 AtLRX4 AtLRX3 AtLRX5 AlLRX3 AlLRX1 AhLRX7 AhLRX8 G6 AtFH20 AtFH16 AhFHX2 AlFHX5 AlFHX4 G7 AhFHX5 AhFHX4 AtFH13 AlFHX3 G8 AtLRX1 AtLRX2 AhLRX6 AlLRX8 G9 AlEXT2 AlEXT1 AhEXT22s AtEXT9 G10 AhEXT15s AtEXT7 AhEXT12 AlEXT25s G11 AhFHX6 AtFH5 AlFHX2 G12 AhEXT18s AtEXT39s AlEXT17s G13 AhEXT21s AtEXT31s AlEXT21s G14 AhLRX4 AtLRX7 AlLRX2 G15 AhEXT4 AtEXT18 AlEXT5 G16 AhEXT17s AtEXT22 AlEXT8 G17 AhEXT29i AtEXT51i AlEXT30i G18 AhFHX3 AtFH1 AlFHX1 G19 AhEXT25s AtEXT32s AlEXT20s G20 AtEXT11 AtEXT12 G21 AhPERK9 AlPERK8 G22 AhEXT10 AlEXT11 G23 AhEXT14s AtEXT35s G24 AhEXT16s AlEXT16s G25 AhEXT5 AtEXT3/5 G26 AhFHX1 AtFH8 G27 AhEXT11 AlEXT12 G28 AhEXT30i AlEXT29i G29 AtHAE1 AlEXT26i G30 AtEXT8 AlEXT4 G31 AtLRX6 AlLRX7 G32 AtEXT38s AlEXT28i G33 AtEXT40s AlEXT18s

188

Table 5.7 The list of orthologous proteins identified by the OrthoMCL program. Orthologous Protein A Orthologous Protein B Normalized Score g00485.1_AhPERK14 AT1G49270_AtPERK7 1.114 g00485.1_AhPERK14 AT4G34440_AtPERK5 1.114 g00485.2_AhPERK3 AT4G34440_AtPERK5 1.114 g00545.1_AhLRX2 AT2G15880_AtPEX3 1.114 g00545.1_AhLRX2 AT4G33970_AtPEX4 1.114 g03784.1_AhPERK13 AT1G23540_AtPERK12 1.114 g03784.1_AhPERK13 AT1G70460_AtPERK13 1.114 g04875.1_AhLRX6 AT1G12040_AtLRX1 1.114 g05041.1_AhPERK12 AT1G10620_AtPERK11 1.114 g05041.1_AhPERK12 AT1G23540_AtPERK12 1.114 g05041.1_AhPERK12 AT1G70460_AtPERK13 1.114 g08124.1_AhPERK7 AT1G26150_AtPERK10 1.114 g08124.1_AhPERK7 AT1G68690_AtPERK16 1.114 g08124.1_AhPERK7 AT5G38560_AtPERK8 1.114 g09373.1_AhFHX6 AT5G54650_AtFH5 1.114 g11173.1_AhEXT14s AT3G20850_AtEXT35s 0.127 g12151.1_AhEXT18s AT5G19800_AtEXT39s 0.138 g13437.1_AhEXT5 AT1G21310_AtEXT3/5 0.935 g13582.1_AhFHX1 AT1G70140_AtFH8 1.114 g13643.1_AhPERK5 AT1G10620_AtPERK11 1.114 g13643.1_AhPERK5 AT1G23540_AtPERK12 1.114 g13700.1_AhEXT21s AT1G23040_AtEXT31s 0.425 g14084.2_AhLRX4 AT5G25550_AtLRX7 1.114 g14720.1_AhPERK8 AT1G26150_AtPERK10 1.114 g14720.1_AhPERK8 AT1G68690_AtPERK16 1.114 g15194.1_AhPERK6 AT1G49270_AtPERK7 1.114 g15194.1_AhPERK6 AT3G18810_AtPERK6 1.114 g15232.1_AhLRX5 AT1G49490_AtPEX2 1.114 g15232.1_AhLRX5 AT2G15880_AtPEX3 1.114 g15232.1_AhLRX5 AT3G19020_AtPEX1 1.114 g15350.1_AhLRX7 AT3G24480_AtLRX4 1.114 g15596.1_AhEXT15s AT2G24980_AtEXT7 0.177 g16913.1_AhFHX5 AT5G58160_AtFH13 1.114 g19987.1_AhLRX8 AT4G18670_AtLRX5 1.114 g19996.1_AhEXT4 AT4G13390_AtEXT18 0.122 g20038.1_AhPERK10 AT3G24540_AtPERK3 1.114 g20038.1_AhPERK10 AT3G24550_AtPERK1 1.114 g20066.1_AhPERK2 AT3G24540_AtPERK3 1.114 g20067.1_AhPERK1 AT2G18470_AtPERK4 1.114 g20067.1_AhPERK1 AT3G24540_AtPERK3 1.114 g20067.1_AhPERK1 AT3G24550_AtPERK1 1.114 g20067.1_AhPERK1 AT4G34440_AtPERK5 1.114 g20317.1_AhEXT17s AT4G08370_AtEXT22 0.108 g21635.1_AhPERK4 AT1G49270_AtPERK7 1.114 g21635.1_AhPERK4 AT2G18470_AtPERK4 1.114 g21635.1_AhPERK4 AT3G18810_AtPERK6 1.114 g21635.1_AhPERK4 AT4G34440_AtPERK5 1.114 189

Table 5.7: continued g21653.1_AhLRX1 AT1G49490_AtPEX2 1.114 g21653.1_AhLRX1 AT3G19020_AtPEX1 1.114 g22518.1_AhEXT29i AT3G19430_AtEXT51i 1.114 g23384.1_AhFHX3 AT3G25500_AtFH1 1.114 g23866.1_AhFHX2 AT5G07740_AtFH20 1.114 g23866.1_AhFHX2 AT5G07770_AtFH16 1.114 g24874.1_AhEXT25s AT1G54215_AtEXT32s 0.29 g25049.1_AhPERK11 AT1G26150_AtPERK10 1.114 g25049.1_AhPERK11 AT5G38560_AtPERK8 1.114 g29325.1_AhLRX3 AT4G33970_AtPEX4 1.114 g31345.1_AhFHX4 AT5G58160_AtFH13 1.114 g00485.1_AhPERK14 AL1G56140.t1_AlPERK2 1.189 g00485.1_AhPERK14 AL3G32190.t1_AlPERK10 1.189 g00485.1_AhPERK14 AL3G40090.t1_AlPERK12 1.189 g00485.1_AhPERK14 AL3G51760.t1_AlPERK11 1.189 g00485.1_AhPERK14 AL7G16790.t1_AlPERK7 1.189 g00485.2_AhPERK3 AL7G16790.t1_AlPERK7 1.189 g00545.1_AhLRX2 AL3G47670.t1_AlLRX6 1.189 g00545.1_AhLRX2 AL7G17420.t1_AlLRX4 1.189 g00693.1_AhPERK9 AL7G19036.t1_AlPERK8 1.189 g01247.1_AhEXT10 AL2G37060.t1_AlEXT11 0.629 g03784.1_AhPERK13 AL2G29830.t1_AlPERK13 1.189 g05041.1_AhPERK12 AL1G21470.t1_AlPERK4 1.189 g05041.1_AhPERK12 AL1G37300.t1_AlPERK5 1.189 g05041.1_AhPERK12 AL2G29830.t1_AlPERK13 1.189 g08124.1_AhPERK7 AL1G39600.t1_AlPERK3 1.189 g08124.1_AhPERK7 AL2G27610.t1_AlPERK14 1.189 g09373.1_AhFHX6 AL8G30100.t1_AlFHX2 1.189 g11478.1_AhEXT16s AL5G35660.t1_AlEXT16s 0.186 g12151.1_AhEXT18s AL6G31320.t1_AlEXT17s 0.375 g13643.1_AhPERK5 AL1G21470.t1_AlPERK4 1.189 g13643.1_AhPERK5 AL1G37300.t1_AlPERK5 1.189 g13700.1_AhEXT21s AL1G36610.t1_AlEXT21s 0.449 g14084.2_AhLRX4 AL6G37370.t1_AlLRX2 1.189 g14720.1_AhPERK8 AL1G39600.t1_AlPERK3 1.189 g14720.1_AhPERK8 AL7G52030.t1_AlPERK6 1.189 g15194.1_AhPERK6 AL1G56140.t1_AlPERK2 1.189 g15194.1_AhPERK6 AL3G32190.t1_AlPERK10 1.189

190

Table 5.7: continued g15232.1_AhLRX5 AL3G32400.t1_AlLRX5 1.189 g15350.1_AhLRX7 AL9U10340.t1_AlLRX1 1.189 g19987.1_AhLRX8 AL9U10340.t1_AlLRX1 1.189 g19996.1_AhEXT4 AL9U10270.t1_AlEXT5 0.182 g20038.1_AhPERK10 AL3G40090.t1_AlPERK12 1.189 g20066.1_AhPERK2 AL3G40080.t1_AlPERK9 1.158 g20067.1_AhPERK1 AL3G40090.t1_AlPERK12 1.189 g20067.1_AhPERK1 AL3G51760.t1_AlPERK11 1.189 g20317.1_AhEXT17s AL6G42140.t1_AlEXT8 0.098 g21635.1_AhPERK4 AL1G56140.t1_AlPERK2 1.189 g21635.1_AhPERK4 AL3G32190.t1_AlPERK10 1.189 g21635.1_AhPERK4 AL3G51760.t1_AlPERK11 1.189 g21635.1_AhPERK4 AL7G16790.t1_AlPERK7 1.189 g21653.1_AhLRX1 AL3G32400.t1_AlLRX5 1.189 g22518.1_AhEXT29i AL3G32840.t1_AlEXT30i 1.189 g23384.1_AhFHX3 AL3G41670.t1_AlFHX1 1.189 g23866.1_AhFHX2 AL6G18110.t1_AlFHX5 1.189 g24874.1_AhEXT25s AL1G62930.t1_AlEXT20s 0.331 g25049.1_AhPERK11 AL7G52030.t1_AlPERK6 1.189 g25079.1_AhEXT11 AL8G22510.t1_AlEXT12 0.204 g26951.1_AhEXT12 AL3G52210.t1_AlEXT25s 0.226 g28923.1_AhEXT22s AL5G18190.t1_AlEXT1 0.186 g29325.1_AhLRX3 AL7G17420.t1_AlLRX4 1.189 g32264.1_AhEXT30i AL1G50850.t1_AlEXT29i 0.616 AT1G10620_AtPERK11 AL1G21470.t1_AlPERK4 1.158 AT1G10620_AtPERK11 AL1G37300.t1_AlPERK5 1.158 AT1G23040_AtEXT31s AL1G36610.t1_AlEXT21s 0.416 AT1G23540_AtPERK12 AL1G37300.t1_AlPERK5 1.158 AT1G26150_AtPERK10 AL1G39600.t1_AlPERK3 1.158 AT1G26150_AtPERK10 AL2G27610.t1_AlPERK14 1.158 AT1G26150_AtPERK10 AL7G52030.t1_AlPERK6 1.158 AT1G49270_AtPERK7 AL1G56140.t1_AlPERK2 1.158 AT1G49270_AtPERK7 AL3G32190.t1_AlPERK10 1.158 AT1G49270_AtPERK7 AL7G16790.t1_AlPERK7 1.158 AT1G49490_AtPEX2 AL3G32400.t1_AlLRX5 1.158 AT1G49490_AtPEX2 AL3G47670.t1_AlLRX6 1.158 AT1G54215_AtEXT32s AL1G62930.t1_AlEXT20s 0.271 AT1G62440_AtLRX2 AL2G13430.t1_AlLRX8 1.158 AT1G62760_AtHAE1 AL2G13030.t1_AlEXT26i 0.751 AT1G68690_AtPERK16 AL1G39600.t1_AlPERK3 1.158 AT1G68690_AtPERK16 AL2G27610.t1_AlPERK14 1.158 AT1G70460_AtPERK13 AL2G29830.t1_AlPERK13 1.158 AT2G15880_AtPEX3 AL3G32400.t1_AlLRX5 1.158 AT2G15880_AtPEX3 AL3G47670.t1_AlLRX6 1.158 AT2G15880_AtPEX3 AL7G17420.t1_AlLRX4 1.158 AT2G18470_AtPERK4 AL3G32190.t1_AlPERK10 1.158 AT2G18470_AtPERK4 AL3G51760.t1_AlPERK11 1.158 AT2G18470_AtPERK4 AL7G16790.t1_AlPERK7 1.158 191

Table 5.7: continued AT2G24980_AtEXT7 AL3G52210.t1_AlEXT25s 0.188 AT2G43150_AtEXT8 AL4G41820.t1_AlEXT4 0.689 AT3G18810_AtPERK6 AL1G56140.t1_AlPERK2 1.158 AT3G18810_AtPERK6 AL3G32190.t1_AlPERK10 1.158 AT3G19020_AtPEX1 AL3G47670.t1_AlLRX6 1.158 AT3G19430_AtEXT51i AL3G32840.t1_AlEXT30i 1.158 AT3G22800_AtLRX6 AL3G37120.t1_AlLRX7 1.158 AT3G24480_AtLRX4 AL7G35590.t1_AlLRX3 1.158 AT3G24480_AtLRX4 AL9U10340.t1_AlLRX1 1.158 AT3G24540_AtPERK3 AL3G40080.t1_AlPERK9 1.158 AT3G24540_AtPERK3 AL3G40090.t1_AlPERK12 1.158 AT3G24550_AtPERK1 AL3G40090.t1_AlPERK12 1.158 AT3G25500_AtFH1 AL3G41670.t1_AlFHX1 1.158 AT3G28550_AtEXT9 AL5G18170.t1_AlEXT2 0.154 AT4G08370_AtEXT22 AL6G42140.t1_AlEXT8 0.133 AT4G13340_AtLRX3 AL9U10340.t1_AlLRX1 1.158 AT4G13390_AtEXT18 AL9U10270.t1_AlEXT5 0.332 AT4G18670_AtLRX5 AL7G35590.t1_AlLRX3 1.158 AT4G18670_AtLRX5 AL9U10340.t1_AlLRX1 1.158 AT4G33970_AtPEX4 AL3G32400.t1_AlLRX5 1.158 AT4G33970_AtPEX4 AL3G47670.t1_AlLRX6 1.158 AT4G33970_AtPEX4 AL7G17420.t1_AlLRX4 1.158 AT4G34440_AtPERK5 AL1G56140.t1_AlPERK2 1.158 AT4G34440_AtPERK5 AL3G51760.t1_AlPERK11 1.158 AT4G34440_AtPERK5 AL7G16790.t1_AlPERK7 1.158 AT5G07770_AtFH16 AL6G18110.t1_AlFHX5 1.158 AT5G07770_AtFH16 AL6G18120.t1_AlFHX4 1.158 AT5G11990_AtEXT38s AL6G22780.t1_AlEXT28i 0.35 AT5G19800_AtEXT39s AL6G31320.t1_AlEXT17s 0.149 AT5G25550_AtLRX7 AL6G37370.t1_AlLRX2 1.158 AT5G26080_AtEXT40s AL6G38070.t1_AlEXT18s 0.138 AT5G38560_AtPERK8 AL7G52030.t1_AlPERK6 1.158 AT5G54650_AtFH5 AL8G30100.t1_AlFHX2 1.158 AT5G58160_AtFH13 AL8G34260.t1_AlFHX3 1.158

192

Table 5.8 The list of paralogous proteins identified by the OrthoMCL program. Paralogous Protein A Paralogous Protein B Normalized Score g00485.1_AhPERK14 g00485.2_AhPERK3 1 g00485.1_AhPERK14 g20067.1_AhPERK1 1 g00485.1_AhPERK14 g21635.1_AhPERK4 1 g00545.1_AhLRX2 g29325.1_AhLRX3 1 g03784.1_AhPERK13 g05041.1_AhPERK12 1 g03784.1_AhPERK13 g13643.1_AhPERK5 1 g05041.1_AhPERK12 g13643.1_AhPERK5 1 g14720.1_AhPERK8 g25049.1_AhPERK11 1 g15194.1_AhPERK6 g21635.1_AhPERK4 1 g15232.1_AhLRX5 g21653.1_AhLRX1 1 g16913.1_AhFHX5 g31345.1_AhFHX4 1 g20038.1_AhPERK10 g20067.1_AhPERK1 1 AT1G12040_AtLRX1 AT1G62440_AtLRX2 1 AT1G26150_AtPERK10 AT1G68690_AtPERK16 1 AT1G49270_AtPERK7 AT2G18470_AtPERK4 1 AT1G49270_AtPERK7 AT3G18810_AtPERK6 1 AT1G49270_AtPERK7 AT4G34440_AtPERK5 1 AT1G49490_AtPEX2 AT3G19020_AtPEX1 1 AT1G49490_AtPEX2 AT4G33970_AtPEX4 1 AT2G15880_AtPEX3 AT3G19020_AtPEX1 1 AT2G15880_AtPEX3 AT4G33970_AtPEX4 1 AT2G18470_AtPERK4 AT3G24550_AtPERK1 1 AT2G18470_AtPERK4 AT4G34440_AtPERK5 1 AT3G18810_AtPERK6 AT4G34440_AtPERK5 1 AT3G24480_AtLRX4 AT4G13340_AtLRX3 1 AT3G24480_AtLRX4 AT4G18670_AtLRX5 1 AT3G24540_AtPERK3 AT3G24550_AtPERK1 1 AT3G24550_AtPERK1 AT4G34440_AtPERK5 1 AT4G08400_AtEXT11 AT4G08410_AtEXT12 0.251 AT5G07740_AtFH20 AT5G07770_AtFH16 1 AL1G56140.t1_AlPERK2 AL3G32190.t1_AlPERK10 1.117 AL1G56140.t1_AlPERK2 AL7G16790.t1_AlPERK7 1.117 AL3G32190.t1_AlPERK10 AL3G51760.t1_AlPERK11 1.117 AL3G32400.t1_AlLRX5 AL3G47670.t1_AlLRX6 1.117 AL3G40090.t1_AlPERK12 AL3G51760.t1_AlPERK11 1.117 AL3G47670.t1_AlLRX6 AL7G17420.t1_AlLRX4 1.117 AL5G18170.t1_AlEXT2 AL5G18190.t1_AlEXT1 0.182 AL7G35590.t1_AlLRX3 AL9U10340.t1_AlLRX1 1.117

193

In sum, the OrthoMCL 2.0 program identified 167 pairs of potential orthologs including 51 pairs between A. lyrata and A. halleri, 58 pairs between A. lyrata and A. thaliana, 58 pairs between A. halleri and A. thaliana. It also identified 38 pairs of potential paralogs including 8 pairs in A. lyrata, 12 pairs in A. halleri, and 18 pairs in A. thaliana.

5.4 Discussion

In this project, bioinformatic identification and classification of EXTs were conducted in A. lyrata and A. halleri, two close relatives of the A. thaliana. In A. lyrata, 61

EXTs were identified, including 15 classical EXTs, 10 short EXTs, 8 LRXs, 14 PERKs, 5

FHXs, 4 hybrid HRGPs, and 5 other types of chimeric EXTs. In A. halleri, 65 EXTs were identified, including 13 classical EXTs, 14 short EXTs, 8 LRXs, 14 PERKs, 6 FHX, 4 hybrid HRGPs, and 6 other types of chimeric EXTs. The inventory of EXTs in A. thaliana was also updated using a more recent proteome data file, which revealed 69 EXTs encompassing 19 classical EXTs, 13 short EXTs, 11 LRXs, 12 PERKs, 6 FHXs, 4 hybrid

HRGPs, and 4 other types of chimeric EXTs. Overall, the total numbers of EXTs were similar in all three species. While the numbers for each subclass varied slightly, all three species maintain all the various EXTs subclasses identified to date. The analysis also allowed for more detailed comparisons of the EXTs among the three closely related species

(Table 5.3-5.4).

Among the three species, A. thaliana has the highest number of classical EXTs, while A. halleri has the least. Nevertheless, all three species are rich in classical EXTs in the plants studied to date. Our previous study indicated the SP4 was the most abundant repetitive motif in classical EXTs in eudicots (Liu et al. 2016). This notion is consistent 194 with the findings in Arabidopsis relatives, as SP4 repeats were predominant in most classical EXTs in A. lyrata (10/15) and in A. halleri (10/13). Classical EXTs can also be categorized based on the number of YXY crosslinking motifs, or IDT motifs. In A. thaliana, most classical EXTs were found to be IDT-rich (Cannon et al. 2008). This is also true for its close relatives. In A. lyrata, all but two classical EXTs (AlEXT11 and AlEXT14) are IDT-rich, and 10 out of 13 classical EXTs are IDT-rich in A. halleri (except AhEXT2,

AhEXT3, and AhEXT10).

All three species have similar numbers of short EXTs. In A. thaliana, several short

EXTs were found to contain GPI membrane anchor addition sequences, a feature that is commonly seen in AGPs, but is rare in other subclasses of EXTs (Youl et al. 1998; Seifert and Roberts 2007; Liu et al. 2016). In A. lyrata, four of the ten short EXTs were found to have such feature (AlEXT21s-24s). Interestingly, no short EXTs with such feature was identified in A. halleri. The exact glycosylation patterns and functions of short EXTs with such feature remains to be studied in the future.

A. lyrata and A. halleri have slightly fewer LRXs than A. thaliana, yet “vegetative”

LRXs and “reproductive” PEXs are identified in both species. The phylogenetic analysis revealed a likely PEX exclusive clade which contained AlLRX4/5/6 and AhLRX1/2/3/5, in addition to four AtPEXs. This notion was also supported by the cluster analysis where these 11 sequences were clustered in the same group (G2 in Table 5.6). The cluster analysis also supported the topology of the LRX phylogenetic tree in its vegetative clade, as the

LRX only groups (G5, G8, G14, G31) were precisely consistent with the four branches in the vegetative clade. 195

All three species have numerous PERKs, indicating their important physiological roles in organ formation, hormone responses, and defense mechanisms (Haffani et al. 2006;

Bai et al. 2009; Silva et al. 2002). The phylogenetic tree resolved 10 instances where orthologous proteins were identified in all three species; in 6 out of the 10 cases a PERK from A. lyrata is closer to its orthologous protein in A. halleri, than the counterpart in A. thaliana. This coincides with the evolutionary relationship at the species level. A similar trend is also observed in the phylogenetic tree of FHXs, where an A. lyrata FHX is found to be closer to an A. halleri FHX in all three cases of triple orthologous proteins. In addition, the tree topology is well supported by the cluster analysis.

Hybrid HRGPs were initially observed in A. thaliana, where four proteins were found to contain characteristic motifs of both AGPs and EXTs (Showalter et al. 2010).

Four such hybrid HRGPs were also observed in A. lyrata and A. halleri. Interestingly,

AlEXT33h and AhEXT37h also contain a number of PPV repeats, a motif characteristic of proline-rich proteins (Showalter et al. 2010).

Apart from the above subclasses of EXTs, other types of chimeric EXTs also exist.

Chimeric EXTs with root cap domain (PF06830.10) were identified in all three species

(G17 in table 6). AhEXT30i and AlEXT29i also contain the LTP_2 lipid transfer domain

(PF14368.5) in addition to the EXT domain.

The presence of the full spectrum of EXTs in all three Arabidopsis species indicates the conservation of these subclasses from an evolutionary perspective and suggests the physiological and functional importance of these EXT subclasses. Classical EXTs with

YXY motifs were prevalent in all three species; these cross-linking motifs are a signature 196 feature of eudicot classical EXTs that play important roles in plant defense mechanisms

(Merkouropoulos et al. 1999). Chimeric EXTs also play important roles in plant growth and development. For instance, LRXs were shown to be essential for normal root hair development and cell wall formation (Baumberger et al. 2001; 2003), and PEXs are thought to be sexual recognition molecules in the pistil (Rubinstein et al. 1995); PERKs are response molecules to various stimuli, such as hormones, pathogen and wounding (Silva et al. 2002; Bai et al. 2009); formin homologs are suggested to be the linkers between cell wall components and the cytoskeleton (Blanchoin and Staiger 2010).

Among the three close relatives, A. lyrata and A. halleri are more closely related to each other than to A. thaliana, as revealed by phylogenetic analysis using plastidic matK sequences, nuclear CHS sequences, as well as whole-chloroplast genome sequence data.

(Mitchell-Olds 2001; Koch et al. 2001; Hohmann et al. 2015). However, such relationship on the species level is not clearly reflected by protein family level analysis of the EXTs.

In our phylogenetic analysis, EXTs from A. halleri and A. lyrata were clustered together in some cases, but in other cases, A. thaliana EXTs were found to be closer to the counterparts either in A. lyrata or A. halleri. Indeed, the scenario that a gene tree does not agree with a species tree is not uncommon (Pamilo and Nei 1988; Doyle 1992; Maddison

1997). Some biological processes, such as gene duplications, gene loss, horizontal gene transfer, and multi-species coalescence, can also lead to distinct topologies between gene/protein trees and species trees (Yang and Warnow 2011; Mirarab et al. 2014).

In this project, the identification and classification of EXTs were conducted in two close relatives of the widely studied model plant A. thaliana. The identified EXTs in the 197

Arabidopsis model plant system allowed for a comparative analysis within the three

Arabidopsis species, providing insight into the evolution of EXT family in closely related species. EXTs from the three species within the same genus maintained many similarities, which was not seen in a previous study of EXTs across different genus or families (Liu et al. 2016). For instance, classical EXTs with YXY crosslinking motifs were identified in all three species, and other EXT subfamilies are also conserved in all three species. The total number of EXTs were similar, and the numbers in each subclass only varied slightly. In the future, it would be interesting to add more A. thaliana close relatives, such as A. arenosa, into our analysis, when the genome data of these species become available; more study also needs to focus on the AGP and PRP families, to investigate if the level of similarities for EXTs would also be maintained in the other two HRGP families.

Nonetheless, this project provides a starting point for the study of HRGPs in three model plants within the same genus, which is important to answer fundamental evolutionary questions that cannot be addressed by studying any single species alone.

198

CHAPTER 6. VERIFICATION OF THE BIOINFORMATIC METHODS

To examine the effectiveness of our bioinformatic methods for HRGPs identification, a literature and database search was performed to find manuscripts with identified and experimentally verified HRGPs. Based on the sequence information, we tested whether the methods used in BIO OHIO 2.0 can identify these true HRGPs.

6.1 Verification with previously reported AGPs

A total of 107 AGPs were found that were previously identified and verified through wet lab experiments, including 36 classical AGPs, 13 AG peptides, 46 FLAs, four

PAGs, two HAE, six other types of chimeric AGPs (Table 6.1). 100 of the 107 reported

AGPs were identified by our program. Specifically, all the classical AGPs were identified by the AGP Biased Amino Acid module that looked for at least 50% PAST, with the exception of one AGP, which has 48% PAST. For the AG peptides, all 13 sequences were identified by the AG Peptide module, which searches for protein lengths of 50-90 amino acids and at least 35% PAST. For the FLAs, 34 of the 46 sequences were directly identified by the Fasciclin module; the remaining12 sequences were not identifiable by the Fasciclin module as they do not have the core fasciclin consensus sequences, however, these 12 sequences can be identified through a BLAST search within the respective genomes. Four

PAGs can only be identified through BLAST searches against their respective genomes, as no specific functional module was designed to specifically find PAGs. Likewise, six reported chimeric AGPs were not identifiable as none of them has at least 50% PAST.

Lastly, two HAEs were not identified by any of the AGP modules but were identified by the Extensin Motif module and were later determined as HAEs. 199

Table 6.1 List of historically reported AGPs in the literature and their related information. Name Species Cultivar/Ssp Seq Repeated Motif Known Gene Subfamily AA PAST BIO OHIO 2.0 Reference Type or Peptide Seq % Identifiable (Test Modules) NaAGP1 Nicotiana genotype s6s6 cDNA AP; TP; SP AAA66362 Classical 132 64% Yes (AGP Du et al. alata Biased AA) 1994 NaAGP4 Nicotiana Link et Otto cDNA AP; TP; SP; PA; AAG24616.1 Classical 228 63% Yes (AGP Gilson et al. alata (s6s6) VP Biased AA) 2001 BcMF8 Brassica chinensis cDNA AP; TP; SP; GP ABP57237 Classical 136 58% Yes (AGP Huang et al. campestris Biased AA) 2008 LeAGP1 Lycopersicon Pixie Hybrid cDNA AP; TP; SP; PA CAA67584.1 Classical 215 60% Yes (AGP Li and esculentum II Biased AA) Showalter 1996 OsAGP1 Oryza sativa japonica cDNA AP; TP; SP; PA LOC_Os08g3 Classical 146 71% Yes (AGP Mashiguchi 7630 Biased AA) et al. 2004 CsAGP1 Cucumis Spacemaster cDNA AP; TP; VP; PA; BAC75541 Classical 243 67% Yes (AGP Park et al. sativus 80 SP Biased AA) 2003 AtAGP1 Arabidopsis Columbia cDNA AP; SP At5g64310 Classical 131 59% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP2 Arabidopsis Columbia cDNA AP; TP; SP; GP At2g22470 Classical 131 63% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP3 Arabidopsis Columbia cDNA AP; TP; SP; GP At4g40090 Classical 139 48% No Pereira et al. thaliana 2006 AtAGP4 Arabidopsis Columbia cDNA AP; TP; SP; PA At5g10430 Classical 135 72% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP5 Arabidopsis Columbia cDNA AP; TP; SP At1g35230 Classical 133 63% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP6 Arabidopsis Columbia cDNA AP; SP; GP At5g14380 Classical 150 64% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP7 Arabidopsis Columbia cDNA AP; TP; SP; PA At5g65390 Classical 130 66% Yes (AGP Pereira et al. thaliana Biased AA) 2006

200

Table 6.1: continued AtAGP9 Arabidopsis Columbia cDNA AP; SP; TP; PA At2g14890 Classical 191 71% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP10 Arabidopsis Columbia cDNA AP; TP; SP At4g09030 Classical 127 65% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP11 Arabidopsis Columbia cDNA AP; SP; TP; GP At3g01700 Classical 136 61% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP25 Arabidopsis Columbia cDNA SP At5g18690 Classical 116 50% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP26 Arabidopsis Columbia cDNA AP; SP; TP At2g47930 Classical 136 54% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP27 Arabidopsis Columbia cDNA AP; SP; PA At3g06360 Classical 125 53% Yes (AGP Pereira et al. thaliana Biased AA) 2006 AtAGP1 Arabidopsis Columbia-0 cDNA AP; SP At5g64310 Classical 131 59% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP2 Arabidopsis Columbia-0 cDNA AP; SP; GP; TP At2g22470 Classical 126 64% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP3 Arabidopsis Columbia-0 cDNA AP; SP; GP; TP At4g40090 Classical 139 61% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP4 Arabidopsis Columbia-0 cDNA AP; SP; GP; At5g10430 Classical 135 72% Yes (AGP Schultz et al. thaliana strain CS1092 TP; PA Biased AA) 2000 AtAGP5 Arabidopsis Columbia-0 cDNA AP; TP; SP At1g35230 Classical 133 63% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP6 Arabidopsis Columbia-0 cDNA AP; SP; GP At5g14380 Classical 150 64% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP7 Arabidopsis Columbia-0 cDNA AP; SP At5g65390 Classical 130 66% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP9 Arabidopsis Columbia-0 cDNA AP; SP; TP; PA At2g14890 Classical 191 71% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 AtAGP10 Arabidopsis Columbia-0 cDNA AP; SP; GP; TP At4g09030 Classical 127 65% Yes (AGP Schultz et al. thaliana strain CS1092 Biased AA) 2000 201

Table 6.1: continued AtAGP11 Arabidopsis Columbia-0 cDNA AP; SP; GP; At3g01700 Classical 135 61% Yes (AGP Biased Schultz et al. thaliana strain CS1092 TP AA) 2000 AtAGP17 Arabidopsis Columbia-0 cDNA AP; SP; TP; At2g18770 Classical 185 56% Yes (AGP Biased Sun et al. thaliana GP AA) 2005 AtAGP18 Arabidopsis Columbia-0 cDNA AP; SP; TP; At4g42670 Classical 209 61% Yes (AGP Biased Sun et al. thaliana VP; GP AA) 2005 AtAGP19 Arabidopsis Columbia-0 cDNA AP; SP; TP; At1g62790 Classical 249 68% Yes (AGP Biased Sun et al. thaliana PA AA) 2005 GhH6L Gossypium Xuzhou142 cDNA AP; SP; TP; ACU55134 Classical 214 72% Yes (AGP Biased Wu et al. hirsutum and Cocker PA AA) 2009 312 PtaAGP6 Pinus taeda Genotype 7-56 cDNA AP; SP; TP AAF75821 Classical 236 61% Yes (AGP Biased Zhang et al. AA) 2000 EpAGP E. purpurea Moench Protein NA NA Classical NA 51% Yes (AGP Biased Classen et al. AA) 2000 CWAGP Red grapes NA Protein NA NA Classical NA 59% Yes (AGP Biased Doco1 and AA) Williams 2013 OSIAGP Oryza sativa indica. Pusa cDNA AP AAQ15184.1 AG 59 54% Yes (AG Peptide) Anand and Basmati 1 Peptide Tyagi 2010 OsAGPEP1 Oryza sativa japonica cDNA AP ABX83034.1 AG 68 57% Yes (AG Peptide) Mashiguchi et Peptide al. 2004 OsAGPEP2 Oryza sativa japonica cDNA AP ABX83035.1 AG 61 55% Yes (AG Peptide) Mashiguchi et Peptide al. 2004 OsAGPEP3 Oryza sativa japonica cDNA AP ABX83036.1 AG 63 52% Yes (AG Peptide) Mashiguchi et Peptide al. 2004 AtAGP21 Arabidopsis Columbia-0 cDNA AP; GP; AG At1g49900 AG 58 46% Yes (AG Peptide) Matsui et al. thaliana Peptide 2012 AtAGP12 Arabidopsis Columbia-0 cDNA AP At3g13520 AG 60 43% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 Peptide 2000 AtAGP13 Arabidopsis Columbia-0 cDNA AP At4g26320 AG 59 47% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 Peptide 2000 202

Table 6.1: continued AtAGP14 Arabidopsis Columbia-0 cDNA AP At5g56540 AG Peptide 60 41% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2000 AtAGP15 Arabidopsis Columbia-0 cDNA AP At5g11740 AG Peptide 61 50% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2000 AtAGP16 Arabidopsis Columbia-0 cDNA AP At2g46330 AG Peptide 73 41% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2000 MdAGP1 Malus Fuji Protein TP; AP; PA AAQ03094.1 AG Peptide 75 48% Yes (AG Peptide) Choi et al. 2010 domestica MdAGP2 Malus Fuji Protein TP; AP; PA AAQ03095.1 AG Peptide 88 58% Yes (AG Peptide) Choi et al. 2010 domestica MdAGP3 Malus Fuji Protein TP; AP; PA AAQ03096.1 AG Peptide 75 49% Yes (AG Peptide) Choi et al. 2010 domestica AGP12 Arabidopsis Columbia-0 Protein AP; PA At3g13520 AG Peptide 60 43% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP13 Arabidopsis Columbia-0 Protein AP; PA At4g26320 AG Peptide 59 47% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP14 Arabidopsis Columbia-0 Protein AP At5g56540 AG Peptide 60 42% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP15 Arabidopsis Columbia-0 Protein AP At5g11740 AG Peptide 61 51% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP16 Arabidopsis Columbia-0 Protein AP; PA At2g46330 AG Peptide 73 41% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP21 Arabidopsis Columbia-0 Protein AP; PA At1g55330 AG Peptide 58 47% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP22 Arabidopsis Columbia-0 Protein AP; PA At5g53250 AG Peptide 63 38% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 AGP24 Arabidopsis Columbia-0 Protein AP; PA At5g40730 AG Peptide 69 41% Yes (AG Peptide) Schultz et al. thaliana strain CS1092 2004 ZeFLA11 Zinnia Envy cDNA AP; TP; GP CAI99883 FLA 252 36% Yes (Fasciclin and Dahiya et al. elegans BLAST) 2006 GhFLA1 Gossypium Coker312 cDNA AP ABV27472.1 FLA 243 37% Yes (Fasciclin and Huang et al. hirsutum BLAST) 2008b 203

Table 6.1: continued GhFLA10 Gossypium Coker312 cDNA AP; TP ABV27481.1 FLA 414 32% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA11 Gossypium Coker312 cDNA AP; TP; SP ABV27482.1 FLA 417 32% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA12 Gossypium Coker312 cDNA AP; TP; SP; ABV27483.1 FLA 415 40% Yes (Fasciclin and Huang et al. 2008b hirsutum VP BLAST) GhFLA13 Gossypium Coker312 cDNA AP; TP; SP ABV27484.1 FLA 425 40% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA14 Gossypium Coker312 cDNA AP; GP ABV27485.1 FLA 459 27% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA15 Gossypium Coker312 cDNA AP; GP ABV27486.1 FLA 460 27% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA16 Gossypium Coker312 cDNA AP; GP ABV27487.1 FLA 457 27% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA17 Gossypium Coker312 cDNA AP; SP ABV27488.1 FLA 285 38% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA18 Gossypium Coker312 cDNA SP ABV27489.1 FLA 276 32% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA19 Gossypium Coker312 cDNA AP; SP ABV27490.1 FLA 398 30% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA2 Gossypium Coker312 cDNA AP; SP; TP ABV27473.1 FLA 265 40% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA3 Gossypium Coker312 cDNA AP; SP; TP ABV27474.1 FLA 263 39% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA4 Gossypium Coker312 cDNA AP; SP ABV27475.1 FLA 244 34% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA5 Gossypium Coker312 cDNA AP ABV27476.1 FLA 239 34% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA6 Gossypium Coker312 cDNA AP ABV27477.1 FLA 241 31% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) 204

Table 6.1: continued GhFLA7 Gossypium Coker312 cDNA AP; SP ABV27478.1 FLA 262 37% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA8 Gossypium Coker312 cDNA SP; VP; PA ABV27479.1 FLA 424 35% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhFLA9 Gossypium Coker312 cDNA SP; VP; AP ABV27480.1 FLA 436 36% Yes (Fasciclin and Huang et al. 2008b hirsutum BLAST) GhAGP1 Gossypium Xuzhou 142 cDNA AP AAO92753 FLA 243 37% Yes (Fasciclin and Ji et al. 2003 hirsutum BLAST) AtFLA1 Arabidopsis Columbia-0 RNA AP; VP; SP; At5g55730 FLA 424 33% Yes (Fasciclin and Johnson et al. 2003 thaliana GP BLAST) AtFLA10 Arabidopsis Columbia-0 RNA AP; SP; TP At3g60900 FLA 422 41% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA11 Arabidopsis Columbia-0 RNA AP; GP At5g03170 FLA 246 36% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA12 Arabidopsis Columbia-0 RNA AP At5g60490 FLA 249 35% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA13 Arabidopsis Columbia-0 RNA AP; SP; GP At5g44130 FLA 247 30% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA14 Arabidopsis Columbia-0 RNA AP; SP; PA At3g12660 FLA 255 35% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA15 Arabidopsis Columbia-0 RNA AP; GP; SP At3g52370 FLA 436 28% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA16 Arabidopsis Columbia-0 RNA AP; SP; GP; At2g35860 FLA 445 28% Yes (Fasciclin and Johnson et al. 2003 thaliana VP BLAST) AtFLA17 Arabidopsis Columbia-0 RNA AP; GP; PA At5g06390 FLA 458 26% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA18 Arabidopsis Columbia-0 RNA AP; GP At3g11700 FLA 462 25% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA19 Arabidopsis Columbia-0 RNA AP; SP; VP At1g15190 FLA 248 33% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST)

205

Table 6.1: continued AtFLA2 Arabidopsis Columbia-0 RNA AP; SP At4g12730 FLA 403 31% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA20 Arabidopsis Columbia-0 RNA AP; SP At5g40940 FLA 424 29% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA21 Arabidopsis Columbia-0 RNA SP; TP; VP At5g06920 FLA 353 32% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA3 Arabidopsis Columbia-0 RNA AP; SP; TP At2g24450 FLA 280 38% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA4 Arabidopsis Columbia-0 RNA SP; VP; PA At3g46550 FLA 420 37% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA5 Arabidopsis Columbia-0 RNA AP; TP; SP; At4g31370 FLA 278 37% Yes (Fasciclin and Johnson et al. 2003 thaliana VP BLAST) AtFLA6 Arabidopsis Columbia-0 RNA AP; SP At2g20520 FLA 247 34% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA7 Arabidopsis Columbia-0 RNA AP; SP; VP; At2g04780 FLA 254 39% Yes (Fasciclin and Johnson et al. 2003 thaliana GP BLAST) AtFLA8 Arabidopsis Columbia-0 RNA AP; SP; TP At2g45470 FLA 420 43% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) AtFLA9 Arabidopsis Columbia-0 RNA AP; SP; PA At1g03870 FLA 247 31% Yes (Fasciclin and Johnson et al. 2003 thaliana BLAST) EniFLA1 Eucalyptus Columbia-0 cDNA AP; TP; VP; ADE62307 FLA 265 38% Yes (Fasciclin and MacMillan et al. 2010 nitens CS60000 PA BLAST) EniFLA2 Eucalyptus Columbia-0 cDNA AP; SP; PA ADE62308 FLA 260 34% Yes (Fasciclin and MacMillan et al. 2010 nitens CS60000 BLAST) EniFLA3 Eucalyptus Columbia-0 cDNA AP; VP ADE62309 FLA 248 39% Yes (Fasciclin and MacMillan et al. 2010 nitens CS60000 BLAST) AtAGP8 Arabidopsis Columbia-0 cDNA AP; SP; TP At2g45470 FLA 323 44% Yes (Fasciclin and Schultz et al. 2000 thaliana CS1092 BLAST) VfENOD5 Vicia faba Kleine cDNA AP; SP CAB97367 PAG 137 34% Yes (BLAST) Fru¨hling et al. 2000 Thu¨ringer

206

Table 6.1: continued NtEPc Nicotiana Samsun cDNA SP BAA75495 PAG 166 28% Yes (BLAST) Kyo et al. tabacum 2000 OsENODL1 Oryza sativa japonica cDNA AP; GP; PA ABX83038.1 PAG 237 35% Yes (BLAST) Mashiguchi et al. 2004 GhPLA1 Gossypium Coker 315 cDNA AP; SP; GP ABI96983 PAG 175 27% Yes (BLAST) Poon et al. hirsutum 2012 DcAGP1 Daucus NA cDNA AP; TP; SP CAC16734 Chimeric 242 40% No Baldwin et carota al. 2001 NaAGP3 Nicotiana NA cDNA AP AAB41479 Chimeric 169 27% No Du et al. alata 1996 AtAGP31 Arabidopsis Columbia-0 cDNA AP; PA; SP; TP At1g28290 Chimeric 359 43% No Hijazi et al. thaliana 2012 NaAGP2 Nicotiana NA cDNA AP AAB35284 Chimeric 461 25% No Mau et al. alata 1995 PhPRP1 Petunia TidalWave cDNA AP; SP; TP; PA ACN60130 Chimeric 256 40% No Twomey et hybrida Silver al. 2013 PtaAGP3 Loblolly NA Protein AP; PA; SP AA556993 Chimeric 161 43% No Loopstra et pines al. 2000 Na120KD Nicotiana Link et Otto cDNA AP; SP; GP; AAC15893 HAE 461 48% Yes (Extensin Schultz et al. alata TP; PA Motif) 1997 Acacia NA Protein SOOO(O/T/S)L NA HAE NA NA Yes (Extensin Goodrum et senegal SOSOTOTOO Motif) al. 2000 (O/L)GPH

207

6.2 Verification with previously reported EXTs

A total 54 EXTs were found in the literature, of which 51 were identified by the

Extensin Motif module in BIO OHIO 2.0, which looks for at least two SPPP repeats in a given protein sequence (Table 6.2). The three exceptions are: (1) one Hordeum vulgar protein that has repeated motifs of APP, SPP, KPP, and TPP. In fact, it is likely that this protein is an AGP based on its sequence features, (2) one Beta vulgaris partial protein with repeated SPP motifs, and (3) one Zea mays chimeric EXT with only one SPPPP.

Given that we have identified numerous EXTs in a number of plant species in our bioinformatics research projects, it would be interesting to check how the genome sequencing results match the previously reported EXTs. Among the reported EXTs, 32 sequences were from plants species in which we have conducted EXT identification and analysis. 15 previously reported EXTs found a perfect match or a similar match in the predicted proteome generated by genome sequencing projects; for the rest of reported

EXTs with no similar EXTs in the proteome data file, the most likely reason is a difference in the plant material used. Specially, the cultivars or subspecies of the previously reported

EXTs differ from that used for genome sequencing.

208

Table 6.2 List of historically reported EXTs in the literature and their related information. Name Species Cultivar/Ssp Seq Repeated Motif Known Gene Subfamily AA BIO OHIO 2.0 Reference Type or Peptide Seq Identifiable (EXT motif) AtExt1 Arabidopsis landsberg erecta Protein SPPP; SPPPP; At1g76930 Classical EXT 373 Yes Merkouropou thaliana YXYK; VYK los et al. 1999 AtExt2 Arabidopsis Columbia cDNA SPPP At3g54590 Classical EXT 300 Yes Yoshiba et al. thaliana 2001 AtExt3 Arabidopsis Columbia cDNA SPPP At1g21310 Classical EXT 325 Yes Yoshiba et al. thaliana 2001 AtExt4 Arabidopsis Columbia cDNA SPPP At1g76930 Classical EXT 246 Yes Yoshiba et al. thaliana 2001 AtExt5 Arabidopsis Columbia cDNA SPPP At1g21310 Classical EXT 203 Yes Yoshiba et al. thaliana 2001 Hyp3.6 Phaseolus Kievitsboon cDNA YYYKSPPPPSP NA Classical EXT 163 Yes Corbin et al. vulgaris Koekoek SPPPP 1987 Hyp2.13 Phaseolus Kievitsboon cDNA YYYKSPPPPSP AAA33765.1 Classical EXT 368 Yes Corbin et al. vulgaris Koekoek SPPPP 1987 Hyp4.1 Phaseolus Kievitsboon cDNA YYYKSPPPPSP AAA33764.1 Classical EXT 230 Yes Corbin et al. vulgaris Koekoek SPPPP 1987 pDC5A1 NA cDNA SPPPP P06599.1 Classical EXT 306 Yes Chen and Varner 1985a SP2 Pseudotsuga NA Protein SPPPP NA Classical EXT NA Yes Fong et al. menziesii 1992 SbHRGP-1 Glycine max Wayne cDNA SPPPPSPSPPPP AAA33972.1 Classical EXT 118 Yes Hong et al. YVYK 1994 SbHRGP-2 Glycine max Wayne cDNA SPPPPSPSPPPP AAA33971.1 Classical EXT 169 Yes Hong et al. YYYK/H 1994 SbHRGP-3 Glycine max Wayne cDNA SPPPPYKYP, AAA33970.1 Classical EXT 199 Yes Hong et al. SPPPPPYKYP, 1994 SPPPPVYKYK

209

Table 6.2: continued SbHRGP3 Glycine max Paldal cDNA SPPPPKHSPPPPYY AAB53156.1 Classical 432 Yes Ahn et al. 1996 YH; EXT SPPPPVYKYKSPPP PYKYPSPPPPPYKY PSPPPPVYKYK NaPRP3 Nicotiana S1S3, S2S3, cDNA SPPPP AAB23813.1 Classical 151 Yes Chen et al. 1992 alata S6S6, S6S7 EXT 6PExt 1.2 Nicotiana Samsun; cDNA SPPPP, YXY CAA49808.1 Classical 131 Yes Parmentier et al. sylvestris Xanthi nc EXT 1995 Ext 1.2A Nicotiana Petit Havana cDNA SPPP CAC14888.1 Classical 311 Yes Guzzardi et al. 2004 sylvestris SR1 EXT Ext 1.4 Nicotiana Samsun; Genomic SPPPP AAB63341.1 Classical 224 Yes Hirsinger et al. 1997 tabacum Xanthi nc clone EXT NA Pisum Frisson cDNA SPPPP; SPPPPP AAK77899.1 Classical NA Yes Rathbun et al. 2002 sativum EXT NA Prunus Texas cDNA PYHYK, SPPPP, CAA46634.1 Classical 278 Yes Garcia-Mas et al. amygdalus SPSPPKH EXT 1992 Ext26G Vigna Red Caloona cDNA SPPPP CAA62943.1 Classical 489 Yes Arsenijević- unguiculata EXT Maksimović et al. 1997 Potato Solanom unpublished Protein SPPPP NA Classical NA Yes Kieliszewski and lectin tuberosum data EXT Lamport 1994 P1 Solanum Bonnie Best Protein SPPPPTPVYK; NA Classical NA Yes Smith et al. 1986 lycopersicum SPPPPVKPYHPTPV EXT YK P2 Solanum Bonnie Best Protein YK; SPPPPVYK; NA Classical NA Yes Smith et al. 1986 lycopersicum SPPPPVYKYK EXT P3 Solanum Bonnie Best Protein SPPPPSPSPPPPYYY NA Classical NA Yes Epstein and Lamport lycopersicum K EXT 1984 class I Solanum UC82B cDNA SPPPPSPSPPPPYYY S14983; Classical NA Yes Showalter et al. 1991 lycopersicum K S14981 EXT class II Solanum UC82B cDNA SPPPPSPSPPPPTY CAA39216.1 Classical NA Yes Showalter et al. 1991 lycopersicum EXT 210

Table 6.2: continued class IV Solanum UC82B cDNA SPPPPSPSPPPPYYY NA Classical NA Yes Showalter et al. lycopersicum K EXT 1991 Tom J-10 Solanum UC82B genomic SPPPPSPKYVYK; AAA34163.1 Classical 388 Yes Zhou et al. 1992 lycopersicum clone SPPPPYYYKSPPPPS EXT P Tom L-4 Solanum UC82B genomic SPPPP; AAA34164.1 Classical 322 Yes Zhou et al. 1992 lycopersicum clone TPSYEHPKTP; EXT SSPPPPSPSPPPPTY PTEL15 Solanum Cyprus cDNA SPPP; SPPPP CAA79930.1 Classical 291 Yes Bown et al. 1993 tuberosum EXT pCNT1 Tobacco Petit Havana' cDNA SPPPP; PYYPPH; CAA50603.1 Classical 318 Yes Memelink et al. SR1 TPVYTK EXT 1993 ISG Volvox carteri strains HK 10 cDNA SPPP; SPPPP; CAA46283.1 Classical 464 Yes Ertl et al. 1992 and 69-lb SPPPPP EXT LRX1 Arabidopsis Columbia cDNA SPPPP At1g12040 LRX 744 Yes Baumberger et al. thaliana 2001 NpLRX1 Nicotiana NA cDNA SPPP NA LRX 725 Yes Chida et al. 2007 plumbaginifolia tPex Solanum Moneymaker cDNA SPPP Solyc01g108900 LRX 711 Yes Bucher et al. 1997 lycopersicum PEX1 Zea mays inbred lines cDNA SPPP; SPPPP; 2111476A LRX 1188 Yes Rubinstein et al. Ky2l and B73 SPPPPP 1995 mPex2 Zea mays NA genomic SPPP; SPPPP AAD55980.2 LRX 1016 Yes Stratford et al. clone 2001 PERK1 Brassica napus Westar and cDNA SPPP Q9LV48.1 PERK 652 Yes Silva and Goring W1 cultivars 2002 NA Tobacco Petit Havana' cDNA SPPPP NP_001312455. Chimeric 426 Yes Goldman et al. SR1 1 EXT 1992 LSG1 Volvox carteri female strain cDNA SPPP Vocar20001045 Chimeric 415 Yes Shimizu et al. 2002 HK10 m EXT

211

Table 6.2: continued LSG2 Volvox carteri female strain cDNA SPPP BAB85218.1 Chimeric 625 Yes Shimizu et al. 2002 HK10 EXT S9 Volvox carteri female strain cDNA SPPP BAB85219.1 Chimeric 568 Yes Shimizu et al. 2002 HK10 EXT The class Chlamydomo strain CC-620 cDNA SPPP AAB23258.2 HAE 202 Yes Woessner and IV gene nas Goodenough 1989 reínhardtíí Gum arabic Acacia NA Protein SPPPTLSPSPTP NA HAE NA Yes Goodrum et al. 2000 senegal TPPLGPH Dif10 Solanum Moneymaker cDNA SPPP Solyc02g030220 HAE 396 Yes Bucher et al. 1997 lycopersicum Dif54 Solanum Moneymaker cDNA SPPP Solyc01g005880 HAE 438 Yes Bucher et al. 1997 lycopersicum A sugar Beta vulgaris line SP-6926- Protein SPP(VHE/KYP) NA HAE NA Yes Li et al. 1990 beet EXT O PPTPVYKP THRGP Zea mays Black Protein TPSPKPPTPKP 1707318A HAE 251 Yes Kieliszewski et al. Mexican TPPTY; 1990 TPSPPPPY pDC11 Daucus NA cDNA KPP; TPP; AAA33137.1 HEP 43 Yes Chen and Varner carota SPPPP; PPV 1985b MC56 Zea mays W64A and cDNA PPTYTP; SPPPP 1604465A Potential 267 No Stiefel et al. 1988 E41 EXT Hvex1 Hordeum Bomi cDNA APP; SPP; KPP; CAB10894.1 Potential 330 No Sturaro et al. 1998 vulgar TPP AGP NA Beta vulgaris NA Protein SP2 NA Potential 222 No Nuñez et al. 2009 AGP NA hybrid of L. NA Protein SPPPP NA NA NA Yes Brownleader and Dey esculentum 1993 and L. peruvianum

212

6.3 Verification with previously reported PRPs

A total of 51 PRP sequences were found through a literature search (Table 6.3).

Out of them, BIO OHIO 2.0 identified 45 of these sequences using the PRP Biased Amino

Acid module that looks for sequences of 45% or more PVKCYT. A motif search looking for sequences with at least two repeats of KKPCPP or PPVX[K/T] returned 17 sequences, although all 17 sequences were also identified by the PRP biased amino acid composition search. Six sequences were not identified that do not have the targeted repeated motifs and whose PVKCYT composition were below 45%.

213

Table 6.3 List of historically reported PRPs in the literature and their related information. Name Species Cultivar/Ssp Seq Repeated Known Gene Subfamily AA BIO OHIO 2.0 Reference Type Motif or Identifiable (Test Peptide Seq Modules) 30-kD hook Glycine max Merr. Protein PPVYK; NA PRP NA Yes (PRP Biased Francisco protein Williams 82 PPVEK AA) and Tiemey 1990 A2ENOD2 Medicago iroquois cDNA PPHEK; CAA31092.1 PRP 97 Yes (PRP Biased Dickstein et sativa PPEYQ AA) al. 1988 Ac1503 Zea mays NA cDNA TPPTYTP; M36635 PRP 328 Yes (PRP Biased Raz et al. KPP AA) 1992 AtPRP1 Arabidopsis Columbia cDNA PPVX(K/T) At1g54970 PRP 335 Yes (PRP Biased Fowler et al. thaliana AA and PRP Motif) 1999 AtPRP2 Arabidopsis Columbia cDNA PIYKPPV; At2g21140 PRP 320 Yes (PRP Biased Fowler et al. KKPCPP AA and PRP Motif) 1999 AtPRP3 Arabidopsis Columbia cDNA PPVX(K/T) At3g62680 PRP 313 Yes (PRP Biased Fowler et al. AA and PRP Motif) 1999 AtPRP4 Arabidopsis Columbia cDNA PPPKIEHPP At4g38770 PRP 448 Yes (PRP Biased Fowler et al. PVPVYK AA and PRP Motif) 1999 KKPCPP CanPRP Cicer Castellana cDNA PPVEK; CAA66038.1 PRP 227 Yes (PRP Biased Munoz et al. arietinum PPVYK AA and PRP Motif) 1998 GhPRP5 Gossypium Coker 312 cDNA PPK ABM05953.1 PRP 182 Yes (PRP Biased Xu et al. hirsutum AA) 2007 GhPRP6 Gossypium Coker 312 cDNA PPK; KPP; ABM05954.1 PRP 258 Yes (PRP Biased Xu et al. hirsutum PPV AA and PRP Motif) 2007 HyPRP Zea mays W64A cDNA PPTPRPS; CAA42959.1 PRP 301 Yes (PRP Biased Jose`- PPYV AA and PRP Motif) Estanyol et al. 1992 HyPRP Cuscuta NA cDNA PPV; AAA33132.1 PRP 329 Yes (PRP Biased Subramania reflexa PPHXP; AA and PRP Motif) m et al. 1994 PPXYK (X,Y=any AA) 214

Table 6.3: continued JX564851 Lens culinaris NA cDNA PPVXK JX564851 PRP 72 Yes (PRP Biased Singh et al. AA and PRP Motif) 2014 LlPRP2 Lupinus luteus Ventus cDNA PPHEK AAB68048.1 PRP 894 Yes (PRP Biased Karlowski et AA) al. 2000 Maize Zea mays inbreds cDNA KPP; PPK CAA10387.1 PRP 328 Yes (PRP Biased Stiefel et al. HRGP W64A AA) 1990 mrip1 Vitis vinifera Merlot cDNA PEHKP AAQ84302.1 PRP 268 Yes (PRP Biased Burger et al. AA) 2004 MsPRP2 Medicago Apollo Protein PPV; PPIV Q40336 PRP 381 Yes (PRP Biased Deutch and sativa AA and PRP Motif) Winicov 1995 MsPRP5 Medicago Regen S cDNA TPVLPPR(K/R)GRP CAA67553.1 PRP 84 Yes (PRP Biased Gyo¨rgyey et sativa PPVPP ; PPV AA) al. 1997 MtN4 Medicago jemalong cDNA PPV; PPI; KPP Y15372 PRP 249 Yes (PRP Biased Gamas et al. truncatula AA) 1996 NaPRP4 Nicotiana NA cDNA KP; CAA49895.1 PRP 255 Yes (PRP Biased Chen et al. alata PTKPPTYSPSKPP AA) 1993 OsPRP Oryza sativa Yukihikari cDNA PEPK AAL73214.1 PRP 358 Yes (PRP Biased Akiyama and AA) Pillai 2003 OsRePRP1.1 Oryza sativa Tainung 67 cDNA P(E/K)P(E/K)P BAF16879.1 PRP 358 Yes (PRP Biased Tseng et al. AA) 2013 OsRePRP1.2 Oryza sativa Tainung 67 cDNA P(E/K/)P(E/K)P BAF16881.1 PRP 416 Yes (PRP Biased Tseng et al. AA) 2013 OsRePRP2.1 Oryza sativa Tainung 67 cDNA P(E/K/Q)PXP BAF21387.1 PRP 247 Yes (PRP Biased Tseng et al. AA) 2013 OsRePRP2.2 Oryza sativa Tainung 67 cDNA P(E/K/Q)PXP BAF21386.1 PRP 247 Yes (PRP Biased Tseng et al. AA) 2013 p33 Daucus carota NA cDNA PPVHK; PPVYT 1108300B PRP 211 Yes (PRP Biased Chen and AA and PRP Motif) Varner 1985 p60 Cicer ILC 3279 cDNA EKPO; NPO; HPO NA PRP NA No Otte and Barz arietinum 2000 pENOD2 Glycine max Merr. cDNA PPHEKPP P08297.2 PRP 309 Yes (PRP Biased Franssen et al. Williams AA) 1987

215

Table 6.3: continued PIS_30 Brassica napus Westar flower cDNA PP(L/F/A)P AAT75332.1 PRP 103 No Foster et al. 2005 pLN70 Lupinus luteus NA cDNA PPPHEKPP; Q06841.1 PRP 434 Yes (PRP Biased Szczyglowski PPV AA) and Legocki 1990 pPsENOD2 Pisum sativum Sparkle cDNA PPHEK; CAA36245.1 PRP 112 Yes (PRP Biased van de Wiel et PPEYQ AA) al. 1990 pTIP13 Asparagus Limbras 10 cDNA PPV; PPT CAA57810.1 PRP 184 Yes (PRP Biased King et al. officinalis AA) 1996 PvPRP1 Phaseolus NA cDNA PPV CAA42942.1 PRP 297 Yes (PRP Biased Sheng et al. vulgaris AA) 1991 PVR5 Phaseolus ChungJu cDNA KPP; PPK AAC49369.1 PRP 127 Yes (PRP Biased Choi et al. vulgaris Jaelae AA) 1996 RPRP3 Glycine max Wayne cDNA PPVEK; P13993.1 PRP 230 Yes (PRP Biased Datta and PPVYK AA and PRP Motif) Marcus 1990 SaPRP Santalum album NA cDNA SPSP(P/S)PTPP NA PRP 326 Yes (PRP Biased Bhattacharya AA) and Sita 1998 SbPRP1 Glycine max Wayne cDNA PPVYK AAA66287.1 PRP 256 Yes (PRP Biased Hong et al. AA and PRP Motif) 1987 SbPRP2 Glycine max Wayne cDNA PPV(Y/E)K AAA34011.1 PRP 230 Yes (PRP Biased Hong et al. AA and PRP Motif) 1990 SbPRP3 Glycine max Wayne cDNA PPVYK; AAA34012.1 PRP 90 Yes (PRP Biased Hong et al. PPYKK AA and PRP Motif) 1990 sorghum Sorghum NA cDNA PPVYT; X56010 PRP 283 Yes (PRP Biased Raz et al. 1992 vulgare PPVTKPPT; AA and PRP Motif) TPP; KPP; PPV StGCPRP Solanum Desire´e cDNA PPV; PLPP; CAA04449.1 PRP 491 Yes (PRP Biased Menke et al. tuberosum KPPCPP; AA and PRP Motif) 2000 KPKPKP teosinte Zea NA cDNA KPPTPKPTP; X64173 PRP 350 Yes (PRP Biased Raz et al. 1992 diploperennis TPPTYTP AA)

216

Table 6.3: continued VvPRP1 Vitis vinifera Arka cDNA KPP; PPV AY046416 PRP 189 Yes (PRP Thomas et al. Neelamani Biased AA) 2003 VvPRP2 Vitis vinifera Arka cDNA PEHKP; KPP AY046417 PRP 193 Yes (PRP Thomas et al. Neelamani Biased AA) 2004 W22 Zea mays NA cDNA KPTPPTYTP; X63134 PRP 303 Yes (PRP Raz et al. 1992 KPPTPKP Biased AA) WPRP1 Triticum NA cDNA PEPK; PEPMPK; CAA36712.1 PRP 378 Yes (PRP Raines et al. aestivum PMPK; Biased AA) 1991 MPKPEPKPEPKPEP PHRGP Pseudotsuga Franco Protein PPV NA PRP NA Yes (PRP Kieliszewski et menziesii Motif) al. 1992 GhPRP3 Gossypium Coker 312 cDNA PVPL; PIPL; PLPP NA PR Peptide 111 No Xu et al. 2007 hirsutum GhPRPL Gossypium Coker 312 cDNA PPV; KPP NA Chimeric 164 No Xu et al. 2007 hirsutum PRP OsPRP1 Oryza sativa japonica cDNA PKPE; P(V/E)PPK BAB84823.1 Chimeric 224 No Wang et al. PRP 2006 GhPRP4 Gossypium Coker 312 cDNA PPP(F/M/L) NA Chimeric 237 No Xu et al. 2007 hirsutum AGP

217

CHAPTER 7. CONCLUSIONS AND FUTURE WORK

7.1 Conclusions

Hydroxyproline-rich glycoproteins (HRGPs) are a superfamily of plant cell wall proteins that are implicated in various aspects of plant growth and development. This superfamily consists of three main members: arabinogalactan-proteins (AGPs), extensins

(EXTs), and proline-rich proteins (PRPs); however, chimeric HRGPs and hybrid HRGPs also exist. To better understand the distribution of HRGPs in plants and assist in their bioinformatic identification, the BIO OHIO program was revised, resulting in a 2.0 version with enhanced features and improved functionality. This program and is freely available at https://github.com/showalte/Bio-Ohio-Public/releases, and a step-by-step tutorial manual on how to use this program was also provided to facilitate and guide basic and applied research on HRGPs.

Based on this program, several research projects were conducted to enhance our knowledge of HRGPs in the plant kingdom and attempt to better understand the evolution of the HRGPs. In the first place, bioinformatic identification of EXTs was conducted in representative plant species. A total of 758 EXTs were identified in 16 species, including

87 classical EXTs, 97 short EXTs, 61 LRXs, 75 PERKs, 54 FHXs, 38 long chimeric EXTs, and 346 other types of chimeric EXTs. The data indicated that classical EXTs were likely derived after the terrestrialization of plants; LRXs, PERKs, and FHXs were derived earlier than classical EXTs; monocots have fewer classical EXTs than eudicots; green algae have no classical EXTs but have a number of long chimeric EXTs that are absent in 218 embryophytes. Phylogenetic analysis was conducted which shed light on the evolution of

LRXs, PERKs, and FHXs.

In the second place, bioinformatic identification of the HRGP family was conducted in poplar (Populus trichocarpa) and identified and classified 271 HRGPs including 162 AGPs, 60 EXTs, and 49 PRPs, which are each divided into various subfamilies. Compared with a previous analysis of the Arabidopsis proteome which identified 162 HRGPs (85 AGPs, 59 EXTs, and 18 PRPs), poplar was observed to have fewer classical EXTs, more fasciclin-like AGPs, plastocyanin AGPs and AG peptides, and

PR peptides.

In the third place, the bioinformatic identification and analysis of EXTs was conducted in Arabidopsis lyrata and Arabidopsis halleri, two close relatives of Arabidopsis thaliana. In A. lyrata, 61 EXTs were identified including 15 classical EXTs, 10 short EXTs,

8 LRXs, 14 PERKs, 5 FHXs, 4 hybrid HRGPs, and 5 other chimeric EXTs. In A. halleri,

65 EXTs were identified, including 13 classical EXTs, 14 short EXTs, 8 LRXs, 14 PERKs,

6 FHXs, 4 hybrid HRGPs, and 6 other chimeric EXTs. For comparison, 69 EXTs were identified A. thaliana, including 19 classical EXTs, 13 short EXTs, 11 LRXs, 12 PERKs,

6 FHXs, 4 hybrid HRGPs, and 4 other chimeric EXTs. While the total numbers and subclasses of EXTs were similar in all three species, subtle differences also exist.

Phylogenetic trees and cluster analysis revealed numerous potential orthologous and paralogous proteins among the three species. The identified EXTs and their homologous proteins provide insight into the evolution and functions of EXTs in related species within the same genus. 219

Finally, in order to validate the bioinformatic methods used here for identification and classification plant HRGPs, we tested our program on previously reported HRGPs.

Our program identified 100 out of 107 reported AGPs, 51 out of 54 reported EXTs, and 45 out of 51 reported PRPs. Together, our program successfully identified 92% (196/212) of the previously reported HRGPs.

In sum, this research project expanded our knowledge of the distribution of EXTs in the plant kingdom and their evolutionary relationships; enhanced our understanding of the HRGPs in a model tree species poplar, through a comparison with the genetic model plant Arabidopsis; and provided an example of studying a model plant system of three closely related Arabidopsis species, in which diverse genetic resources and biological data make it possible to address fundamental evolutionary questions that cannot be addressed in a single species. The identified HRGPs in this research project provide insight to the evolution of the HRGP family in the plant kingdom, and allow for the examination of their respective structural and functional roles, including their possible applications in plant biofuel and natural products for medicinal or industrial uses.

7.2 Discussion

Currently, many genome sequencing projects provide large-scale annotations for their predicted proteins. While these annotations allow for a assignment of functional categories for the predicted proteins, in our practice we found many of these assignments lack accuracy, at least those relating to the HRGP families. As has been elaborated, HRGP family members have different protein structures and functions, distinguishing them is certainly beneficial for biologists studying on these cell wall proteins. Yet in many cases, 220 only a tag of “hypothetical hydroxyproline-rich glycoproteins” were provided, not to mention in many other cases, only “hypothetical protein” or “expressed protein” was provided. In comparison, our program provides more accurate annotations than the genome sequencing projects in terms of the HRGP family members, which will provide guidance for a more directed functional verification. For example, a recent study found that

AtEXT18 is required for full male fertility and normal vegetative growth (Choudhary et al.

2015), and another study indicated the possible linkage between poplar EXTs and recalcitrance (Fleming et al. 2016).

The program does have its limitations. While an HRGP test usually finishes within minutes, it often takes a much more human effort to classify some of the candidate proteins, as HRGPs exist as a spectrum of proteins with boundaries for each family member that is not always clear-cut.

Furthermore, the quality of the program’s output is dependent on the quality of the input proteome data files. In our practice, we observed that the quality of the input proteome data file would drastically affect the number of identified HRGPs and in turn altered our conclusions about the number of HRGPs in one species. For instance, in our project of identifying EXTs in three Arabidopsis species, we initially obtained a proteome file of A. halleri (Ahalleri_264_v1.1.protein.fa) from the Phytozome website v11.0

(Arabidopsis halleri v1.1, DOE-JGI, http://phytozome.jgi.doe.gov/). Using this data file, we only identified 18 EXTs, including 3 classical EXTs, 5 short EXTs, 3 LRXs, 2 PERKs,

1 FHX, 3 other chimeric EXTs, and 1 hybrid AGP-EXT (unpublished data). The numbers were in great contrast with those in A. thaliana and A. lyrata, making it tempting to draw 221 a conclusion that A. halleri had many fewer EXTs than its two relatives. However, we found that the A. halleri proteome data file was in its initial release and had not been published in peer-reviewed journal. Fortunately, we found another A. halleri file whose data had been published. Using the new input data file, we identified far more EXTs in A. halleri, and thus corrected our conclusions.

In addition, the glycosylation and cross-linking motifs of HRGPs are critical for their functions. In other words, these functional motifs will ultimately define the function of an HRGP, yet the current program is unable to predict these important post-translational modifications (PTM).

7.3 Future work

7.3.1 Refinement of the current bioinformatic methods

While our program successfully identified the vast majority of reported HRGPs, there is still room for improvement. In the first place, the methods for identifying certain groups of HRGPs can be refined. For example, six of previously reported PRPs in the literature were not identified by our current criteria. A closer look at these six sequences revealed that they have slightly different repeated motifs, such as KPP, LPP, PPK, PPL, and PPF. Adding these search motifs to the PRP Motif module will increase the coverage for candidate PRPs. In addition, more tailored methods for specific subclasses of HRGP should be developed. This applies mostly to chimeric HRGPs, such as LRX, PERKs,

FHXs, and PAGs. Currently, these chimeric HRGPs can only be identified through the

BLAST module. It would be helpful to develop specific modules for these groups. Lastly, refinement of the search criteria can be made to eliminate as many false positives as 222 possible. Ma et al. (2017) recently took an approach that looks for “localized” AGP motifs in addition to global motifs, making the identification of AGPs more efficient. Similar approaches can be adopted for EXTs and PRPs.

7.3.2 Envising BIO OHIO 3.0

While BIO OHIO 2.0 was revised with enhanced features and functional modules over its predecessor, additional improvements can be made. For example, several functional modules can be incorporated into the program, such as the expression data, Pfam domain search, and search for homologous proteins. These additional modules will provide more comprehensive information on candidate HRGPs to allow for easier classification.

Moreover, the BLAST module can be revised for more flexibility. Currently, each candidate sequence is subjected to a BLAST search against the Arabidopsis proteome, and

Arabidopsis HRGPs in the BLAST results above a certain threshold are shown. In our experience, it is beneficial to do BLAST analysis against its own proteome sequences as it may reveal additional candidate HRGPs, especially for certain groups of chimeric HRGPs.

In addition, while we provided a python script (gpi_signalp_formatter.py) that highlights predicted SignalP sequences and GPI anchor addition sequences, it would be useful to develop a module for automated highlighting of characteristic HRGP motifs and other user- defined motifs to visually assist in HRGP classification.

7.3.3 Automation of the identification and classification process

Elimination of human intervention in the identification and classification process can be achieved by applying supervised machine learning techniques given our inventory of HRGPs already identified through various methods. This would have benefits such as 223 enhancing the identification efficiency, and avoiding human errors. It would also allow for identification of HRGPs on an even larger scale, perhaps by analyzing all plant species whose genomes are sequenced.

7.3.4 Predicting HRGPs glycosylation

Glycosylation is thought to be critical for HRGPs to carry out their functions. For example, glycosylation has been shown to stabilize HRGP secondary structures and facilitate secretion of HRGPs into the plant cell wall (Xu et al. 2008; Velasquez et al. 2011).

Therefore, successful prediction of HRGP glycosylation patterns is important in elucidating functions of individual HRGPs.

The glycosylation of HRGPs mostly occur on Pro residues via a two-step process.

Firstly, Pro residues are hydroxylated as Hyp by Pro-4-hydroxylases (P4Hs) (Hieta and

Myllyharju 2002; Koski et al. 2007; Koski et al. 2009). It was observed that not all Pro are hydroxylated: PV is always hydroxylated, but KP, YP; FP are never hydroxylated; and only some GP are hydroxylated (Kieliszewski, 2001). Furthermore, O-glycosylation on

Hyp can be arabino-oligosaccharides or AG-polysaccharides by action of various (GTs). The rules of Hyp glycosylation was summarized in the Hyp contiguity hypothesis, which stated that the contiguous Hyp residues are glycosylated by arabino-oligosaccharides, while clustered and noncontiguous Hyp residues are the sites for

AG polysaccharides modification (Kieliszewski and Lamport, 1994; Kieliszewski and

Shpak, 2001)

While the patterns of Pro hydroxylation and Hyp glycosylation were found to be consistent with predictions in many cases, exceptions to such rules were also identified, 224 which indicate that the patterns may vary in different species, organs, or tissues. With an increasing number of HRGPs isolated, the rules of Pro hydroxylation and Hyp glycosylation can be refined. This will make it possible to design and implement another bioinformatic program for the prediction of Pro hydroxylation and subsequent glycosylation, which will benefit the cell wall research community.

7.3.5 Deep understanding of plant HRGPs

The goal of developing new tools and applying new techniques is to have a deeper understanding of HRGPs in terms of their distribution, evolution, and functions in plants.

Next generation sequencing techniques have made available the genomic data of more than

60 plant species, and more plant genomic data are being revealed on a daily basis. This creates a wonderful opportunity to study in greater detail the plant HRGP superfamily and analyze them from an evolutionary point of view, which may provide insight into the functions of HRGPs.

225

REFERENCES

Ahn JH, Choi Y, Kwon YM, Kim SG, Choi YD, Lee JS. A nove1 extensin gene encoding a hydroxyproline-rich glycoprotein requires sucrose for its wound-inducible expression in transgenic plants. Plant cell. 1996; 8(9):1477-90.

Akiyama T, Pillai MA. Isolation and characterization of a gene for a repetitive proline rich protein (OsPRP) down-regulated during submergence in rice (Oryza sativa). Physiol Plant. 2003; 18(4):507-13.

Albersheim P, Darvill A, Roberts K, Sederoff R, Staehelin A. Plant cell walls: from chemistry to biology. Ann Bot. 2011; 108(1): viii–ix.

Anand S, Tyagi AK. Characterization of a pollen-preferential gene OSIAGP from rice (Oryza sativa L. subspecies indica) coding for an arabinogalactan protein homologue, and analysis of its promoter activity during pollen development and growth. Transgenic Res. 2010;19(3):385-97.

Arsenijević-Maksimović L, Broughton WJ, Krause A. Rhizobia modulate root-hair- specific expression of extensin genes. Molecular plant-microbe interactions. 1997; 10(1):95-101.

Bai L, Zhang G, Zhou Y, Zhang Z, Wang W, Du Y, et al. Plasma membrane-associated proline-rich extensin-like receptor kinase 4, a novel regulator of Ca2+ signaling, is required for abscisic acid responses in Arabidopsis thaliana. Plant journal. 2009; 60(2):314-27.

Baldwin TC, Domingo C, Schindler T, Seetharaman G, Stacey N, Roberts K. DcAGP1, a secreted arabinogalactan protein, is related to a family of basic proline-rich proteins. Plant Mol Biol. 2001; 45(4):421-35.

Banks JA, Nishiyama T, Hasebe M, Bowman JL, Gribskov M, dePamphilis C, et al. The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science. 2011; 332(6032):960-3.

Basu D, Liang Y, Liu X, Himmeldirk K, Faik A, Kieliszewski M, et al. Functional identification of a hydroxyproline-O-galactosyltransferase specific for arabinogalactan protein biosynthesis in Arabidopsis. J Biol Chem. 2013; 288(14):10132-43.

Batoko H, Zheng H, Hawes C, Moore I. A Rab1 GTPase is required for transport between the and and for normal Golgi movement in plants. Plant Cell. 2000; 12(11):2201-17.

Baumberger N, Doesseger B, Guyot R, Diet A, Parsons RL, Clark MA, et al. Whole- genome comparison of leucine rich repeat extensins in Arabidopsis and rice: a conserved 226 family of cell wall proteins form a vegetative and a reproductive clade. Plant Physiol. 2003; 131(3):1313–26.

Baumberger N, Ringli C, Keller B. The chimeric leucine-rich repeat/extensin cell wall protein LRX1 is required for root hair morphogenesis in Arabidopsis thaliana. Genes & development. 2001; 15(9):1128-39.

Baumberger N, Steiner M, Ryser U, Keller B, Ringli C. Synergistic interaction of the two paralogous Arabidopsis genes LRX1 and LRX2 in cell wall formation during root hair development. Plant J. 2003; 35(1):71–81.

Bhattacharya A, Sita GL. cDNA cloning and characterization of a proline (or hydroxyproline)-rich protein from Santalum album L. Curr Sci. 1998; 75(7):697-701.

Blanchoin L, Staiger CJ. Plant formins: diverse isoforms and unique molecular mechanism. Biochim Biophys Acta. 2010; 1803(2):201-206.

Borner GHH, Sherrier DJ, Weimar T, Michaelson LV, Hawkins ND, MacAskill A, et al. Analysis of detergent-resistant membranes in Arabidopsis. Evidence for plasma membrane lipid rafts. Plant Physiol. 2005; 137(1):104–16.

Bown DP, Bolwell P, Gatehouse JA. Characterisation of potato (Solanum tuberosum L.) extensins: a novel extensin-like cDNA from dormant tubers. Gene. 1993; 134(2):229-33.

Brady JD, Sadler IH, Fry SC. Di-isodityrosine, a novel tetrameric derivative of tyrosine in plant cell wall proteins: a new potential cross-link. Biochem J. 1996; 315(Pt 1):323-7.

Brady JD, Sadler IH, Fry SC. Pulcherosine, an oxidatively coupled trimer of tyrosine in plant cell walls: Its role in cross-link formation. Phytochemistry. 1998; 47(3):349-53.

Briskine RV, Paape T, Shimizu-Inatsugi R, Nishiyama T, Akama S, Sese J, et al. Genome assembly and annotation of Arabidopsis halleri, a model for heavy metal hyperaccumulation and evolutionary ecology. Mol Ecol Resour. 2016; doi: 10.1111/1755- 0998.12604

Brownleader MD, Dey PM. Purification of extensin from cell walls of (hybrid of Lycopersicon esculentum and L. peruvianum) cells in suspension culture. Planta.1993; 191(4): 457-69.

Bucher M, Schroeer B, Willmitzer L, Riesmeier JW. Two genes encoding extension-like proteins are predominantly expressed in tomato root hair cells. Plant Mol Biol. 1997; 35(4):497-508. 227

Burger AL, Zwiegelaar JP, Botha FC. Characterisation of the gene encoding the Merlot ripening-induced protein 1 (mrip1): evidence that this putative protein is a distinct member of the plant proline-rich protein family. Plant Sci. 2004; 167:1075-89.

Cannon MC, Terneus K, Hall Q, Tan L, Wang Y, Wegenhart BL, et al. Self-assembly of the plant cell wall requires an extension scaffold. Proc Natl Acad Sci U S A. 2008; 105(6):2226-31.

Cano-Delgado A, Penfield S, Smith C, Catley M, Bevan M. Reduced cellulose synthesis invokes lignification and defense responses in Arabidopsis thaliana. Plant J. 2003; 34(3):351-62.

Carpita N, Gibeaut D. Structural models of primary-cell walls in flowering plants - consistency of molecular-structure with the physical-properties of the walls during growth. Plant J. 1993; 3(1):1-30.

Cassab GI. Plant cell wall proteins. Annu Rev Plant Physiol Plant Mol Biol. 1998; 49:281- 309.

Chalkia D, Nikolaidis N, Makalowski W, Klein J, Nei M. Origins and evolution of the formin multigene family that is involved in the formation of actin filaments. Mol Biol Evo. 2008; 25(12):2717-33.

Champion A, Kreis M, Mockaitis K, Picaud A, Henry Y. Arabidopsis kinome: after the casting. Funct Integr Genomics. 2004; 4(3):163–87.

Chen C, Cornish EC, Clarke AE. Specific expression of an extensin-like gene in the style of Nicotiana alata. Plant cell. 1992; 4(9):1053-62.

Chen CG, Mau SL, Clarke AE. Nucleotide sequence and style-specific expression of a novel proline-rich protein gene from Nicotiana alata. Plant Mol Biol. 1993; 21:391-5.

Chen J, Varner JE. An extracellular matrix protein in plants: characterization of a genomic clone for carrot extensin. EMBO J. 1985a; 4(9): 2145–51.

Chen J, Varner JE. Isolation and characterization of cDNA clones for carrot extensin and a proline-rich 33-kDa protein. Proc Natl Acad Sci U S A. 1985b; 82(13):4399-403.

Chida H, Yazawa K, Hasezawa S, Iwai H, Satoh S. Involvement of a tobacco leucine-rich repeat-extensin in cell morphogenesis. Plant biotechnology. 2007; 24(2):171–7.

Choi DW, Song JY, Kwon YM, Kim SG. Characterization of a cDNA encoding a proline- rich 14 kDa protein in developing cortical cells of the roots of bean (Phaseolus vulgaris) seedlings. Plant Mol Biol. 1996; 30(5):973-82. 228

Choi YO, Kim SS, Lee S, Kim S, Yoon GB, Kim H, et al. Isolation and promoter analysis of anther-specific genes encoding putative arabinogalactan proteins in Malus x domestica. Plant Cell Rep. 2010; 29(1):15-24.

Choudhary P, Saha P, Ray T, Tang Y, Yang D, Cannon MC. EXTENSIN18 is required for full male fertility as well as normal vegetative growth in Arabidopsis. Front Plant Sci. 2015; 6:553.

Classen B, Witthohn K, Blaschek W. Characterization of an arabinogalactan-protein isolated from pressed juice of Echinacea purpurea by precipitation with the beta-glucosyl Yariv reagent. Carbohydr Res. 2000; 327(4):497-504.

Clauss MJ, Koch MA. Poorly known relatives of Arabidopsis thaliana. Trends Plant Sci. 2006; 11(9):449-59.

Corbin DR, Sauer N, Lamb CJ. Differential regulation of a hydroxyproline-rich glycoprotein gene family in wounded and infected plants. Mol Cell Biol. 1987; 7(12): 4337-44.

Cvrčková F. Formins: emerging players in the dynamic plant cell cortex. Scientifica (Cairo). 2012; 2012: 712605.

Cvrčková F, Grunt M, Žárský V. Expression of GFP-mTalin reveals an actin related role for the Arabidopsis Class II formin AtFH12. Biologia Plantarum. 2012; 56(3):431-40.

Cvrčková F, Oulehlová D, Žárský V. Formins: linking cytoskeleton and endomembranes in plant cells. Int J Mol Sci. 2015; 16(1):1–18.

Dahiya P, Findlay K, Roberts K, McCann MC. A fasciclin-domain containing gene, ZeFLA11, is expressed exclusively in xylem elements that have reticulate wall thickenings in the stem vascular system of Zinnia elegans cv Envy. Planta. 2006; 223(6):1281-91.

Datta K, Marcus A. Nucleotide sequence of a gene encoding soybean repetitive proline- rich protein 3. Plant Mol Biol. 1990; 14(2):285-6.

Deeks MJ, Hussey PJ, Davies B. Formins: intermediates in signal-transduction cascades that affect cytoskeletal reorganization. Trends Plant Sci. 2002; 7(11):492–8.

Deutch CE, Winicov I. Post-transcriptional regulation of a salt-inducible alfalfa gene encoding a putative chimeric proline-rich cell wall protein. Plant Mol Biol. 1995; 27(2):411-8. 229

Dickstein R, Bisseling T, Reinhold VN, Ausubel FM. Expression of nodule-specific genes in alfalfa root nodules blocked at an early stage of development. Genes Dev. 1988; 2(6):677-87.

Doco T, Williams P. Purification and structural characterization of a type II arabinogalactan-protein from champagne wine. Am J Enol Vitic. 2013; 64:3.

Domozych DS, Sørensen I, Willats, WGT. The distribution of cell wall polymers during antheridium development and spermatogenesis in the charophycean green alga, Chara corallina. Ann Bot. 2009; 104(6): 1045-56.

Doyle JJ. Gene trees and species trees: molecular systematics as one-character taxonomy. Syst Bot. 1992; 17(1):144-163.

Draeger C, Ndinyanka Fabrice T, Gineau E, Mouille G, Kuhn BM, Moller I, et al. Arabidopsis leucine-rich repeat extensin (LRX) proteins modify cell wall composition and influence plant growth. BMC Plant Biol. 2015; 15:155.

Du H, Simpson RJ, Clarke AE, Bacic A. Molecular characterization of a -specific gene encoding an arabinogalactan-protein (AGP) from Nicotiana alata. Plant J. 1996; 9(3):313-23.

Du H, Simpson RJ, Moritz RL, Clarke AE, Bacic A. Isolation of the protein backbone of an arabinogalactan-protein from the styles of Nicotiana alata and characterization of a corresponding cDNA. Plant Cell. 1994; 6(11):1643-53.

Dvoráková L, Srba M, Opatrny Z, Fischer L. Hybrid proline-rich proteins: novel players in plant cell elongation? Ann Bot. 2012; 109(2):453-462.

Ebringerova A, Hromadkova Z, Heinze T. Hemicellulose. polysaccharides 1: structure, characterization and use. 2005; 186:1-67.

Eder M, Tenhaken R, Driouich A, Lutz-Meindl U. Occurrence and characterization of arabinogalactan-like proteins and hemicelluloses in Micrasterias (Streptophyta). J Phycol. 2008; 44(5):1221-34.

Egelund J, Obel N, Ulvskov P, Geshi N, Pauly M, Bacic A, et al. Molecular characterization of two Arabidopsis thaliana glycosyltransferase mutants, rra1 and rra2, which have a reduced residual arabinose content in a polymer tightly associated with the cellulosic wall residue. Plant Mol Biol. 2007; 64(4): 439-49.

Eisenhaber B, Wildpaner M, Schultz CJ, Borner GHH, Dupree P, Eisenhaber F. Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from 230 sequence- and genome-wide studies for Arabidopsis and rice. Plant physiology. 2003; 133(4):1691-701.

Ellis M, Egelund J, Schultz CJ, Bacic A. Arabinogalactan-proteins: Key regulators at the cell surface? Plant Physiol. 2010; 153(2):403-19.

Epstein L, Lamport DTA. An intramolecular linkage involving isodityrosine in extensin. Phytochemistry. 1984; 23(6):1241–6.

Ertl H, Hallmann A, Wenzl S, Sumper M. A novel extensin that may organize extracellular matrix biogenesis in Volvox carteri. EMBO J. 1992; 11(6):2055-62.

Faik A, Abouzouhair J, Sarhan F. Putative fasciclin-like arabinogalactan-proteins (FLA) in wheat (Triticum aestivum) and rice (Oryza sativa): identification and bioinformatic analyses. Mol Genet Genomics. 2006; 276(5):478-94.

Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016; 44:D279- 85.

Fleming MB, Decker SR, Bedinger PA. Investigating the role of extensin proteins in poplar biomass recalcitrance. BioResources. 2016; 11(2):4727-44.

Fong C, Kieliszewski MJ, Zacks R, Leykam JF, Lamport DTA. A gymnosperm extensin contains the serine-tetrahydroxyproline motif. Plant Physiol. 1992; 99(2):548-52.

Foster E, Le´vesque-Lemay M, Schneiderman D, Albani D, Schernthaner J, Routly E, et al. Characterization of a gene highly expressed in the Brassica napus pistil that encodes a novel proline-rich protein. Sex Plant Reprod. 2005; 17:261-7.

Fowler TJ, Bernhardt C, Tierney ML. Characterization and expression of four proline-rich cell wall protein genes in Arabidopsis encoding two distinct subsets of multiple domain proteins. Plant Physiol. 1999; 121(4):1081-91.

Franssen HJ, Nap JP, Gloudemans T, Stiekema W, van Dam H, Govers F. Characterization of cDNA for nodulin-75 of soybean: A gene product involved in early stages of root nodule development. Proc Natl Acad Sci U S A. 1987; 84:4495-9.

Frühling M, Hohnjec N, Schröder G, Küster H, Pühler A, Perlick AM. Genomic organization and expression properties of the VfENOD5 gene from broad bean (Vicia faba L.). Plant Sci. 2000;155(2):169-78. 231

Gamas P, Niebel Fde C, Lescure N, Cullimore J. Use of a subtractive hybridization approach to identify new Medicago truncatula genes induced during root nodule development. Mol Plant Microbe Interact. 1996; 9(4):233-42.

Gao M, Kieliszewski M, Lamport D, Showalter A. Isolation, characterization and immunolocalization of a novel, modular tomato arabinogalactan-protein corresponding to the LeAGP-1 gene. Plant J. 1999; 18(1):43-55.

Gao M, Showalter A. Yariv reagent treatment induces programmed cell death in Arabidopsis cell cultures and implicates arabinogalactan protein involvement. Plant J. 1999; 19(3):321-31.

Garcia-Mas J, Messeguer R, Arús P, Puigdomènech P. The extensin from Prunus amygdalus. Plant Physiol. 1992; 100(3):1603-4.

Gille S, Hansel U, Ziemann M, Pauly M. Identification of plant cell wall mutants by means of a forward chemical genetic approach using hydrolases. Proc Natl Acad Sci U S A. 2009; 106(34): 14699-704.

Gilson P, Gaspar YM, Oxley D, Youl JJ, Bacic A. NaAGP4 is an arabinogalactan protein whose expression is suppressed by wounding and fungal infection in Nicotiana alata. Protoplasma. 2001; 215(1-4):128-39.

Goldman MH, Pezzotti M, Seurinck J, Mariani C. Developmental expression of tobacco pistil-specific genes encoding nove1 extensin-like proteins. Plant cell. 1992; 4(9):1041-51.

Goodrum LJ, Patel A, Leykam JF, Kieliszewski MJ. Gum arabic glycoprotein contains glycomodules of both extensin and arabinogalactan-glycoproteins. Phytochemistry. 2000; 54(1):99-106.

Graham MA, Silverstein KAT, Cannon SB, VandenBosch KA. Computational identification and characterization of novel genes from legumes. Plant Physiol. 2004; 135(3):1179-97.

Guzzardi P, Genot G, Jamet E. The Nicotiana sylvestris extensin gene, Ext 1.2A, is expressed in the root transition zone and upon wounding. Biochimica et biophysica acta. 2004; 1680(2):83-92.

Györgyey J, Németh K, Magyar Z, Kelemen Z, Alliotte T, Inzé D, et al. Expression of a novel-type small proline-rich protein gene of alfalfa is induced by 2,4- dichlorophenoxiacetic acid in dedifferentiated callus cells. Plant Mol Biol. 1997; 34(4):593-602. 232

Haffani YZ, Silva-Gagliardi NF, Sewter SK, Aldea MG, Zhao Z, Nakhamchik A, et al. Altered expression of perk receptor kinases in Arabidopsis leads to changes in growth and floral organ formation. Plant Signal Behav. 2006; 1(5):251–60.

Held M, Tan L, Kamyab A, Hare M, Shpak E, Kieliszewski M. Di-isodityrosine is the intermolecular cross-link of isodityrosine-rich extensin analogs cross-linked in vitro. J Biol Chem. 2004; 279(53):55474-82.

Herth W. Arrays of plasma-membrane rosettes involved in cellulose microfibril formation of spirogyra. Planta. 1983; 159(4):347-56.

Hieta R, Myllyharju J. Cloning and characterization of a low molecular weight prolyl 4- hydroxylase from Arabidopsis thaliana. Effective hydroxylation of proline-rich collagen- like, and hypoxia-inducible transcription factor alpha-like peptides. J Biol Chem. 2002; 277(26): 23965-71.

Hijazi M, Durand J, Pichereaux C, Pont F, Jamet E, Albenne C. Characterization of the arabinogalactan protein 31 (AGP31) of Arabidopsis thaliana: new advances on the Hyp- O-glycosylation of the Pro-rich domain. J Biol Chem. 2012; 287(12):9623-32.

Hirsinger C, Parmentier Y, Durr A, Fleck J, Jamet E. Characterization of a tobacco extensin gene and regulation of its gene family in healthy plants and under various stress conditions. Plant Mol Biol. 1997; 33(2):279-89.

Hohmann N, Wolf EM, Lysak MA, Koch MA. A time-calibrated road map of brassicaceae species radiation and evolutionary history. Plant Cell. 2015; 27(10):2770-84.

Hong JC, Cheong YH, Nagao RT, Bahk JD, Choand MJ, Key JL. Isolation and characterization of three soybean extensin cDNAs. Plant Physiol. 1994; 104(2):793-6.

Hong JC, Nagao RT, Key JL. Characterization and sequence analysis of a developmentally regulated putative cell wall protein gene isolated from soybean. J Biol Chem. 1987; 262(17):8367-76.

Hong JC, Nagao RT, Key JL. Characterization of a proline-rich cell wall protein gene family of soybean. J Biol Chem. 1990; 265(5):2470-5.

Hori K, Maruyama F, Fujisawa T, Togashi T, Yamamoto N, Seo M, et al. Klebsormidium flaccidum genome reveals primary factors for plant terrestrial adaptation. Nat Commun. 2014; 5:3978.

Huang GQ, Xu WL, Gong SY, Li B, Wang XL, Xu D, et al. Characterization of 19 novel cotton FLA genes and their expression profiling in fiber development and in response to phytohormones and salt stress. Physiol Plant. 2008;134(2):348-59. 233

Huang L, Cao JS, Zhang AH, Ye YQ. Characterization of a putative pollen-specific arabinogalactan protein gene, BcMF8, from Brassica campestris ssp. chinensis. Mol Biol Rep. 2008;35(4):631-9.

Huang Y, Wang Y, Tan L, Sun L, Petrosino J, Cui MZ, et al. Nanospherical arabinogalactan proteins are a key component of the high-strength adhesive secreted by English ivy. Proc Natl Acad Sci U S A. 2016; 113(23):E3193-202.

International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature. 2010; 463(7282):763-8.

Ji SJ, Lu YC, Feng JX, Wei G, Li J, Shi YH, et al. Isolation and analyses of genes preferentially expressed during early cotton fiber development by subtractive PCR and cDNA array. Nucleic Acids Res. 2003; 31(10):2534-43.

Johnson KL, Cassin AM, Lonsdale A, Bacic A, Doblin MS, Schultz CJ. A motif and amino acid bias bioinformatics pipeline to identify hydroxyproline-rich glycoproteins. Plant Physiol. 2017a; 174(2):886-903.

Johnson KL, Cassin AM, Lonsdale A, Wong GK, Soltis DE, Miles NW, et al. Insights into the Evolution of Hydroxyproline-Rich Glycoproteins from 1000 Plant Transcriptomes. Plant Physiol. 2017b; 174(2):904-921.

Johnson KL, Jones BJ, Bacic A, Schultz CJ. The fasciclin-like arabinogalactan proteins of Arabidopsis: a multigene family of putative cell adhesion molecules. Plant Physiol. 2003; 133(4):1911–25.

Johnson KL, Kibble NA, Bacic A, Schultz CJ. A fasciclin-like arabinogalactan-protein (FLA) mutant of Arabidopsis thaliana, fla1, shows defects in shoot regeneration. PLoS One. 2011; 6(9):e25154.

Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992; 8(3):275–82.

Josè-Estanyol M, Puigdomenech P. Plant cell wall glycoproteins and their genes. Plant Physiol Biochem. 2000; 38(1-2):97-108.

Josè-Estanyol M, Ruiz-Avila L, Puigdomènech P. A maize -specific gene encodes a proline-rich and hydrophobic protein. Plant Cell. 1992; 4(4):413-23.

Karłowski WM, Strózycki PM, Legocki AB. Characterization and expression analysis of the yellow lupin (Lupinus luteus L.) gene coding for nodule specific proline-rich protein. Acta Biochim Pol. 2000; 47(2):371-83. 234

Keskiaho K, Hieta R, Sormunen R, Myllyharju J. Chlamydomonas reinhardtii has multiple prolyl 4-hydroxylases, one of which is essential for proper cell wall assembly. Plant cell. 2007; 19(1):256–69.

Kieliszewski MJ. The latest hype on Hyp-O-glycosylation codes. Phytochemistry. 2001; 57(3):319-23.

Kieliszewski MJ, de Zacks R, Leykam JF, Lamport DT. A repetitive proline-rich protein from the gymnosperm douglas fir is a hydroxyproline-rich glycoprotein. Plant Physiol. 1992; 98(3):919-26.

Kieliszewski MJ, Lamport DTA. Extensin - repetitive motifs, functional sites, posttranslational codes, and phylogeny. Plant J. 1994; 5(2):157-72.

Kieliszewski MJ, Lamport DTA, Tan L, Cannon MC. Hydroxyproline-rich glycoproteins: form and function. Annu Plant Rev: Plant polysaccharides, biosynthesis and bioengineering. 2010; 41.

Kieliszewski MJ, Leykam JF, Lamport DTA. Structure of the threonine-rich extensin from Zea mays. Plant Physiol. 1990; 92(2):316–26.

Kieliszewski MJ, Shpak E. Synthetic genes for the elucidation of glycosylation codes for arabinogalactan-proteins and other hydroxyproline-rich glycoproteins. Cell Mol Life Sci. 2001; 58:1386-98.

Kim J, Carpita N. Changes in esterification of the uronic-acid groups of cell-wall polysaccharides during elongation of maize coleoptiles. Plant Physiol. 1992; 98(2):646-53.

King GA, O'Donoghue EM, Borst WM, Davies KM, Moyle RL, Farnden KJ. Identification and characterization of an mRNA encoding a proline-rich protein that rapidly declines in abundance in the tips of harvested asparagus spears. Plant Cell Physiol. 1996; 37(5):706- 10.

Kitazawa K, Tryfona T, Yoshimi Y, Hayashi Y, Kawauchi S, Antonov L, et al. Beta- galactosyl yariv reagent binds to the beta-(1,3)-galactan of arabinogalactan proteins. Plant Physiol. 2013; 161(3):1117-26.

Kleis-San Francisco SM, Tierney ML. Isolation and characterization of a proline-rich cell wall protein from soybean seedlings. Plant Physiol. 1990; 94(4):1897–902.

Kobe B, Deisenhofer J. The leucine-rich repeat: a versatile binding motif. Trends Biochem Sci. 1994; 19(10): 415–21. 235

Koch MA, Haubold B, Mitchell-Olds T. Molecular systematics of the cruciferae: evidence from coding plastome matK and nuclear CHS sequences. Amer J Bot. 2001; 88(3):534-44.

Koch MA, Matschinger M. Evolution and genetic differentiation among relatives of Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2007; 104(15):6272-7.

Koch MA, Wernisch M, Schmickl R. Arabidopsis thaliana's wild relatives: an updated overview on systematics, taxonomy and evolution. Taxon. 2008; 57(3):933-943.

Koenig D, Weigel D. Beyond the thale: comparative genomics and genetics of Arabidopsis relatives. Nat Rev Genet. 2015; 16(5):285-298.

Koski MK, Hieta R, Bollner C, Kivirikko KI, Myllyharju J, Wierenga RK. The active site of an algal prolyl 4-hydroxylase has a large structural plasticity. J Biol Chem. 2007; 282(51):37112-23.

Koski MK, Hieta R, Hirsila M, Ronka A, Myllyharju J, Wierenga RK. The crystal structure of an algal prolyl 4-hydroxylase complexed with a proline-rich peptide reveals a novel buried tripeptide binding motif. J Biol Chem. 2009; 284(37):25290-301.

Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016; 33(7):1870-4.

Kyo M, Miyatake H, Mamezuka K, Amagata K. Cloning of cDNA encoding NtEPc, a marker protein for the embryogenic dedifferentiation of immature tobacco pollen grains cultured in vitro. Plant Cell Physiol. 2000; 41(2):129-37.

Lafarguette F, Leplé J-C, Déjardin A, Laurans F, Costa G, Lesage-Descauses M-C, et al. Poplar genes encoding fasciclin-like arabinogalactan proteins are highly expressed in tension wood. New Phytologist. 2004; 164(1):107–21.

Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012; 40:D1202-10.

Lamport DTA, Kieliszewski MJ, Showalter AM. Salt stress upregulates periplasmic arabinogalactan proteins: Using salt stress to analyse AGP function. New Phytologist. 2006; 169(3):479-92.

Lamport DTA, Varnai, P. Periplasmic arabinogalactan glycoproteins act as a calcium capacitor that regulates plant growth and development. New Phytologist. 2013; 197(1):58- 64. 236

Lee CB, Swatek KN, McClure B. Pollen proteins bind to the C-terminal domain of Nicotiana alata pistil arabinogalactan proteins. J Biol Chem. 2008; 283(40):26965-73.

Lee JH, Waffenschmidt S, Small L, Goodenough U. Between-species analysis of short- repeat modules in cell wall and sex-related hydroxyproline-rich glycoproteins. Plant Physiol. 2007; 144(4):1813–26.

Levitin B, Richter D, Markovich I, Zik M. Arabinogalactan proteins 6 and 11 are required for stamen and pollen function in Arabidopsis. Plant J. 2008; 56(3):351-363.

Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of orthologue groups for eukaryotic genomes. Genome Res. 2003; 13(9):2178–89.

Li SX, Showalter AM. Cloning and developmental/stress-regulated expression of a gene encoding a tomato arabinogalactan protein. Plant Mol Biol. 1996; 32(4):641-52.

Li X, Kieliszewski MJ, Lamport DTA. A chenopod extensin lacks repetitive tetrahydroxyproline blocks. Plant Physiol. 1990; 92(2):327-33.

Liang Y, Faik A, Kieliszewski M, Tan L, Xu W, Showalter AM. Identification and characterization of in vitro galactosyltransferase activities involved in arabinogalactan- protein glycosylation in tobacco and Arabidopsis. Plant Physiol. 2010; 154(2):632-642.

Lichtenberg J, Keppler B, Conley T, Gu D, Burns P, Welch L, et al. Prot-Class: a bioinformatics tool for protein classification based on amino acid signatures. Natural Sci. 2012; 4:1161-4.

Liu X, Wolfe R, Welch LR, Domozych DS, Popper ZA, Showalter AM. Bioinformatic identification and analysis of extensins in the plant kingdom. PLoS One. 2016; 11(2):e0150177.

Loopstra CA, Puryear JD, No EG. Purification and cloning of an arabinogalactan-protein from xylem of loblolly pine. Planta. 2000; 210(4):686-9.

Ma H, Zhao J. Genome-wide identification, classification, and expression analysis of the arabinogalactan protein gene family in rice (Oryza sativa L.). J Exp Bot. 2010; 61(10):2647-68.

Ma Y, Yan C, Li H, Wu W, Liu Y, Wang Y, et al. Bioinformatics prediction and evolution analysis of arabinogalactan proteins in the plant kingdom. Front Plant Sci. 2017; 8:66.

MacMillan CP, Mansfield SD, Stachurski ZH, Evans R, Southerton SG. Fasciclin-like arabinogalactan proteins: specialization for stem biomechanics and cell wall architecture in Arabidopsis and Eucalyptus. Plant J. 2010;62(4):689-703. 237

Maddison WP. Gene trees in species trees. Syst Biol. 1997; 46(3):523-36.

Mao Y. Formin: a link between kinetochores and microtubule ends. Trends Cell Biol. 2011; 21(11):625-9.

Mashiguchi K, Yamaguchi I, Suzuki Y. Isolation and identification of glycosylphosphatidylinositol-anchored arabinogalactan proteins and novel beta-glucosyl Yariv-reactive proteins from of rice (Oryza sativa). Plant Cell Physiol. 2004; 45(12):1817-29.

Matsui T, Matsuura H, Sawada1 K, Takita E, Kinjo S, Takenami S, et al. High level expression of transgenes by use of 5’-untranslated region of the Arabidopsis thaliana arabinogalactan-protein 21 gene in dicotyledons. Plant Biotechnology. 2012; 29:319–322.

Mau SL, Chen CG, Pu ZY, Moritz RL, Simpson RJ, Bacic A, et al. Molecular cloning of cDNAs encoding the protein backbones of arabinogalactan-proteins from the filtrate of suspension-cultured cells of Pyrus communis and Nicotiana alata. Plant J. 1995; 8(2):269- 81.

Memelink J, Swords KM, de Kam RJ, Schilperoort RA, Hoge JH, Staehelin LA. Structure and regulation of tobacco extensin. Plant J. 1993; 4(6):1011-22.

Menke U, Renault N, Mueller-Roeber B. StGCPRP, a potato gene strongly expressed in stomatal guard cells, defines a novel type of repetitive proline-rich proteins. Plant Physiol. 2000; 122(3):677-86.

Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman G, et al. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007; 318(5848):245-50.

Merkle R, Poppe I. Carbohydrate-composition analysis of glycoconjugates by gas-liquid- chromatography mass-spectrometry. Methods Enzymol. 1994; 230:1-15.

Merkouropoulos G, Barnett D, Shirsat A. The Arabidopsis extensin gene is developmentally regulated, is induced by wounding, methyl jasmonate, abscisic and salicylic acid, and codes for a protein with unusual motifs. Planta. 1999; 208(2):212-9.

Mikkelsen MD, Harholt J, Ulvskov P, Johansen IE, Fangel JU, Doblin MS, et al. Evidence for land plant cell wall biosynthetic mechanisms in charophyte green algae. Ann Bot. 2014; 114(6): 1217-36.

Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014; 30:i541- 548. 238

Mitchell-Olds T. Arabidopsis thaliana and its wild relatives: a model system for ecology and evolution. Trends in Ecology & Evolution. 2001; 16(12):693-700.

Mohnen D. Pectin structure and biosynthesis. Curr Opin Plant Biol. 2008; 11(3):266-77.

Mollet J, Kim S, Jauh G, Lord E. Arabinogalactan proteins, pollen tube growth, and the reversible effects of yariv phenylglycoside. Protoplasma. 2002; 219(1-2):89-98.

Munoz FJ, Dopico B, Labrador E. A cDNA encoding a proline-rich protein from Cicer arietinum. Changes in the expression during development and abiotic stresses. Physiol Plant. 1998; 102:582-90.

Nakhamchik A, Zhao Z, Provart NJ, Shiu SH, Keatley SK, Cameron RK, et al. A comprehensive expression analysis of the Arabidopsis proline-rich extensin-like receptor kinase gene family using bioinformatic and experimental approaches. Plant Cell Physiol. 2004; 45(12):1875-81.

Nei M, Kumar S. Molecular evolution and phylogenetics. New York: Oxford University Press; 2000.

Newman AM, Cooper JB. Global analysis of proline-rich tandem repeat proteins reveals broad phylogenetic diversity in plant secretomes. PLoS One. 2011; 6(8):e23167.

Nothnagel EA. and related components in plant cells. Int Rev Cytol. 1997; 174:195-291.

Nuñez A, Fishman ML, Fortis LL, Cooke PH, Hotchkiss AT Jr. Identification of extensin protein associated with sugar beet pectin. J Agric Food Chem. 2009; 57(22):10951-8.

Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin YC, Scofield DG, et al. The Norway spruce genome sequence and conifer genome evolution. Nature. 2013; 497:579–84.

Ogawa-ohnishi M, Matsushita W, Matsubayashi Y. Identification of three hydroxyproline O-arabinosyltransferases in Arabidopsis thaliana. Nature Chem Biol. 2013; 9:726-30.

Otte O, Barz W. Characterization and oxidative in vitro cross-linking of an extensin-like protein and a proline-rich protein purified from chickpea cell walls. Phytochemistry. 2000; 53:1-5.

Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, et al. The TIGR rice genome annotation resource: improvements and new features. Nucleic acids research. 2007; 35:D883-7. 239

Palenik B, Grimwood J, Aerts A, Rouzé P, Salamov A, Putnam N, et al. The tiny Ostreococcus provides genomic insights into the paradox of plankton speciation. Proc Natl Acad Sci U S A. 2007; 104(18):7705-10.

Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol. 1988; 5(5):568-83.

Park MH, Suzuki Y, Chono M, Knox J, Yamaguchi I. CsAGP1, a gibberellin-responsive gene from cucumber hypocotyls, encodes a classical arabinogalactan protein and is involved in stem elongation. Plant Physiol. 2003; 131(3):1450-9.

Parmentier Y, Durr A, Marbach J, Hirsinger C, Criqui MC, Fleck J, et al. A novel wound- inducible extensin gene is expressed early in newly isolated protoplasts of Nieotiana sylvestris. Plant Mol Biol. 1995; 29(2):279-92.

Pereira LG, Coimbra S, Oliveira H, Monteiro L, Sottomayor M. Expression of arabinogalactan protein genes in pollen tubes of Arabidopsis thaliana. Planta. 2006; 223(2):374-80.

Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat methods. 2011; 8(10):785-6.

Poon S, Heath RL, Clarke AE. A chimeric arabinogalactan protein promotes somatic embryogenesis in cotton cell culture. Plant Physiol. 2012; 160(2):684-95.

Prochnik SE, Umen J, Nedelcu AM, Hallmann A, Miller SM, Nishii I, et al. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science. 2010; 329(5988):223-6.

Qu Y, Egelund J, Gilson PR, Houghton F, Gleeson PA, Schultz CJ, et al. Identification of a novel group of putative Arabidopsis thaliana beta-(1,3)-galactosyltransferases. Plant Mol Biol. 2008; 68(1-2):43-59.

Raines CA, Lloyd JC, Chao SM, John UP, Murphy GJ. A novel proline-rich protein from wheat. Plant Mol Biol. 1991;16(4):663-70.

Rathbun EA, Naldrett MJ, Brewin NJ. Identification of a family of extensin-like glycoproteins in the lumen of rhizobium-induced infection threads in pea root nodules. Mol Plant Microbe Interact. 2002; 15(4):350-9.

Rawat V, Abdelsamad A, Pietzenuk B, Seymour DK, Koenig D, Weigel D, Pecinka A, et al. Improving the annotation of Arabidopsis lyrata using RNA-Seq data., PloS One. 2015; 10(9):e0137391. 240

Raz R, José M, Moya A, Martínez-Izquierdo JA, Puigdomènech P. Different mechanisms generating sequence variability are revealed in distinct regions of the hydroxyproline-rich glycoprotein gene from maize and related species. Mol Gen Genet. 1992; 233(1-2):252-9.

Ringli C. The role of extracellular LRR-extensin (LRX) proteins in cell wall formation. Plant Biosystems. 2005; 139(1):32–5.

Rubinstein AL, Broadwater AH, Lowrey KB, Bedinger PA. Pex1, a pollen-specific gene with an extensin-like domain. Proc Natl Acad Sci U S A. 1995; 92(8):3086-90.

Saha P, Ray T, Tang Y, Dutta I, Evangelous NR, Kieliszewski MJ, et al. Self-rescue of an EXTENSIN mutant reveals alternative gene expression programs and candidate proteins for new cell wall assembly in Arabidopsis. Plant J. 2013; 75(1):104-16.

Saint-Jore C, Evins J, Batoko H, Brandizzi F, Moore I, Hawes C. Redistribution of membrane proteins between the Golgi apparatus and endoplasmic reticulum in plants is reversible and not dependent on cytoskeletal networks. Plant J. 2002; 29(5):661-78.

Saito F, Suyama A, Oka T, Yoko-O T, Matsuoka K, Jigami Y, et al. Identification of novel peptidyl serine α-galactosyltransferase gene family in plants. J Biol Chem. 2014; 289(30):20405-20.

Sarkar P, Bosneaga E, Auer M. Plant cell walls throughout evolution: towards a molecular understanding of their design principles. J Exp Bot. 2009; 60(13):3615-35.

Schnabelrauch LS, Kieliszewski MJ, Upham BL, Alizedeh H, Lamport DTA. Isolation of pI 4.6 extensin peroxidase from tomato cell suspension cultures and identification of Val- Tyr-Lys as putative intermolecular cross-link site. Plant J. 1996; 9(4):477-89.

Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326(5956):1112-5.

Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010; 463(7278):178-83.

Schmutz J, McClean PE, Mamidi S, Wu GA, Cannon SB, Grimwood J, et al. A reference genome for common bean and genome-wide analysis of dual domestications. Nat Genet. 2014; 46(7):707-13.

Schultz CJ, Ferguson KL, Lahnstein J, Bacic A. Post-translational modifications of arabinogalactan-peptides of Arabidopsis thaliana. Endoplasmic reticulum and glycosylphosphatidylinositol-anchor signal cleavage sites and hydroxylation of proline. J Biol Chem. 2004; 279(44):45503-11. 241

Schultz CJ, Gilson P, Oxley D, Youl J, Bacic. A GPI-anchors on arabinogalactan-proteins: implications for signaling in plants. Trends Plant Sci. 1998; 3(11):426-31.

Schultz CJ, Hauser K, Lind JL, Atkinson AH, Pu ZY, Anderson MA, et al. Molecular characterisation of a cDNA sequence encoding the backbone of a style-specific 120 kDa glycoprotein which has features of both extensins and arabinogalactan proteins. Plant Mol Biol. 1997; 35(6):833-45.

Schultz CJ, Johnson KL, Currie G, Bacic A. The classical arabinogalactan protein gene family of Arabidopsis. Plant Cell. 2000; 12(9):1751-68.

Schultz CJ, Rumsewicz MP, Johnson KL, Jones BJ, Gaspar YM, Bacic A. Using genomic resources to guide research directions. The arabinogalactan protein gene family as a test case. Plant Physiol. 2002; 129(4):1448-63.

Seifert GJ, Roberts K. The biology of arabinogalactan proteins. Annu Rev Plant Biol. 2007; 58:137-61.

Serpe M, Nothnagel E. Arabinogalactan-proteins in the multiple domains of the plant cell surface. Advances Bot Res. 1999; 30(30):207-89.

Sheng J, D'Ovidio R, Mehdy MC. Negative and positive regulation of a novel proline-rich protein mRNA by fungal elicitor and wounding. Plant J. 1991; 1(3):345-54.

Sherrier DJ, Prime TA, Dupree P. Glycosylphosphatidylinositol-anchored cell surface proteins from Arabidopsis. Electrophoresis. 1999; 20(10):2027-35.

Shi HZ, Kim Y, Guo Y, Stevenson B, Zhu JK. The Arabidopsis SOS5 locus encodes a putative cell surface adhesion protein and is required for normal cell expansion. Plant Cell. 2003; 15(1):19-32.

Shimizu T, Inoue T, Shiraishi H. Cloning and characterization of novel extensin-like cDNAs that are expressed during late somatic cell phase in the green alga Volvox carteri. Gene. 2002; 284(1-2):179-87.

Shiu SH, Bleecker AB. Expansion of the receptor-like kinase/pelle gene family and receptor-like proteins in Arabidopsis. Plant Physiol. 2003; 132(2):530–43.

Showalter AM. Arabinogalactan-proteins: Structure, expression and function. Cell Mol Life Sci. 2001; 58(10):1399-417.

Showalter AM. Structure and function of plant-cell wall proteins. Plant Cell. 1993; 5(1):9- 23. 242

Showalter AM, Keppler B, Lichtenberg J, Gu D, Welch LR. A bioinformatics approach to the identification, classification, and analysis of hydroxyproline-rich glycoproteins. Plant Physiol. 2010; 153(2):485–513.

Showalter AM, Keppler B, Liu X, Lichtenberg J, Welch LR. Bioinformatic identification and analysis of hydroxyproline-rich glycoproteins in Populus trichocarpa. BMC Plant Biol. 2016; 16(1):229.

Showalter AM, Zhou J, Rumeau D, Worst SG, Varner JE. Tomato extensin and extensin- like cDNAs: structure and expression in response to wounding. Plant Mol Biol. 1991; 16(4):547-65.

Shpak E, Barbar E, Leykam JF, Kieliszewski MJ. Contiguous Hydroxyproline residues direct hydroxyproline arabinosylation in . J Biol Chem. 2001; 276(14):11272-8.

Shpak E, Leykam J, Kieliszewski M. Synthetic genes for glycoprotein design and the elucidation of hydroxyproline-O-glycosylation codes. Proc Natl Acad Sci U S A. 1999; 96(26):14736-41.

Silva N, Goring D. The proline-rich, extensin-like receptor kinase-1 (PERK1) gene is rapidly induced by wounding. Plant Mol Biol. 2002; 50(4-5):667-85.

Sims I, Bacic A. Extracellular polysaccharides from suspension-cultures of Nicotiana plumbaginifolia. Phytochemistry. 1995; 38(6):1397-405.

Singh J, Bhardwaj J, Kumar P, Tomar P, Rani A, Rani R. In-silico validation and comparative analysis of candidate gene encoding proline rich protein in Lens culinaris. Legume Res. 2014; 37(2):133-8.

Smith JJ, Muldoon EP, Willard JJ, Lamport DTA. Tomato extensin precursors P1 and P2 are highly periodic structures. Phytochemistry. 1986; 5(5):1021-30.

Somerville C. Cellulose synthesis in higher plants. Annu Rev Cell Dev Biol. 2006; 22:53- 78.

Sommer-Knudsen J, Bacic A, Clarke A. Hydroxyproline-rich plant glycoproteins. Phytochemistry. 1998; 47(4):483-97.

Strasser R, Bondili JS, Vavra U, Schoberer J, Svoboda B, Gloessl J, et al. A unique beta- (1,3)-galactosyltransferase is indispensable for the biosynthesis of N-glycans containing lewis a structures in Arabidopsis thaliana. Plant Cell. 2007; 19(7):2278-92. 243

Stiefel V, Pérez-Grau L, Albericio F, Giralt E, Ruiz-Avila L, Ludevid MD, et al. Molecular cloning of cDNAs encoding a putative cell wall protein from Zea mays and immunological identification of related polypeptides. Plant Mol Biol. 1988; 11(4):483-93.

Stiefel V, Ruiz-Avila L, Raz R, Pilar Vallés M, Gómez J, Pagés M, et al. Expression of a maize cell wall hydroxyproline-rich glycoprotein gene in early leaf and root vascular differentiation. Plant Cell. 1990; 2(8):785-93.

Stratford S, Barne W, Hohorst DL, Sagert JG, Cotter R, Golubiewski A, et al. A leucine- rich repeat region is conserved in pollen extensin-like (Pex) proteins in monocots and dicots. Plant Mol Biol. 2001; 46(1):43-56.

Sturaro M, Linnestad C, Kleinhofs A, Olsen OA, Doan DNP. Characterization of a cDNA encoding a putative extensin from developing barley grains (Hordeum vulgare L.). J Experi Bot. 1998; 49(329):1935-44.

Subramaniam K, Ranie J, Srinivasa BR, Sinha AM, Mahadevan S. Cloning and sequence of a cDNA encoding a novel hybrid proline-rich protein associated with cytokinin-induced haustoria formation in Cuscuta reflexa. Gene. 1994;141(2):207-10.

Sun W, Xu J, Yang J, Kieliszewski MJ, Showalter AM. The lysine-rich arabinogalactan- protein subfamily in Arabidopsis: gene expression, glycoprotein purification and biochemical characterization. Plant Cell Physiol. 2005; 46(6):975-84.

Svetek J, Yadav MP, Nothnagel EA. Presence of a glycosylphosphatidylinositol lipid anchor on rose arabinogalactan proteins. J Biol Chem. 1999; 274:14724-33.

Szczygłowski K, Legocki AB. Isolation and nucleotide sequence of cDNA clone encoding nodule-specific (hydroxy)proline-rich protein LENOD2 from yellow lupin. Plant Mol Biol. 1990;15(2):361-3.

Sørensen I, Pettolino F, Bacic A, Ralph J, Lu F, O'Neill M, et al. The Charophycean green algae provide insights into the early origins of plant cell walls. Plant J. 2011; 68(2):201- 11.

Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: molecular evolutionary genetics analysis version 6.0. Mol Biol Evol. 2013; 30(12):2725-9.

Tan L, Eberhard S, Pattathil S, Warder C, Glushka J, Yuan C, et al. An Arabidopsis cell wall consists of pectin and arabinoxylan covalently linked to an arabinogalactan protein. Plant Cell. 2013; 25(1):270-287.

Tan L, Leykam JF, Kieliszewski MJ. Glycosylation motifs that direct arabinogalactan addition to arabinogalactan-proteins. Plant Physiol. 2003; 132(3):1362-9. 244

Tan L, Qiu F, Lamport DTA, Kieliszewski MJ. Structure of a hydroxyproline (Hyp)- arabinogalactan polysaccharide from repetitive Ala-Hyp expressed in transgenic Nicotiana tabacum. J Biol Chem. 2004; 279:13156-65.

Tan L, Showalter AM, Egelund J, Hernandez-Sanchez A, Doblin MS, Bacic A. Arabinogalactan-proteins and the research challenges for these enigmatic plant cell surface proteoglycans. Front Plant Sci. 2012; 3:1-10.

Thomas P, Lee MM, Schiefelbein J. Molecular identification of proline-rich protein genes induced during root formation in grape (Vitis vinifera L.) stem cuttings. Plant Cell Envir. 2003; 26:1497–504.

Tiainen P, Myllyharju J, Koivunen P. Characterization of a second Arabidopsis thaliana prolyl 4-hydroxylase with distinct substrate specificity. J Biol Chem. 2005; 280:1142-8.

Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012; 485(7400):635-41.

Tseng IC, Hong CY, Yu SM, Ho TH. Abscisic acid- and stress-induced highly proline-rich glycoproteins regulate root growth in rice. Plant Physiol. 2013; 163(1):118-34.

Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006; 313(5793):1596- 604.

Twomey MC, Brooks JK, Corey JM, Singh-Cundy A. Characterization of PhPRP1, a histidine domain arabinogalactan protein from Petunia hybrida pistils. J Plant Physiol. 2013; 170(15):1384-8. van de Wiel C, Scheres B, Franssen H, van Lierop MJ, van Lammeren A, van Kammen A, et al. The early nodulin transcript ENOD2 is located in the nodule parenchyma (inner cortex) of pea and soybean root nodules. EMBO J. 1990; 9(1):1-7. van Hengel A, Roberts K. AtAGP30, an arabinogalactan-protein in the cell walls of the primary root, plays a role in root regeneration and seed germination. Plant J. 2003; 36(2):256-70. van Hengel A, van Kammen A, de Vries S. A relationship between seed development, arabinogalactan-proteins (AGPs) and the AGP mediated promotion of somatic embryogenesis. Physiologia Plantarum. 2002; 114(4):637-44.

Velasquez SM, Ricardi MM, Dorosz JG, Fernandez PV, Nadra AD, Pol-Fachin L, et al. O- glycosylated cell wall proteins are essential in root hair growth. Science. 2011; 332(6036):1401–3. 245

Vlad F, Spano T, Vlad D, Daher FB, Ouelhadj A, Fragkostefanakis S, et al. Involvement of Arabidopsis prolyl 4 hydroxylases in hypoxia, anaoxia and mechanical wounding. Plant Signal Behav. 2007; 2(5):368-9.

Vlad F, Tiainen P, Owen C, Spano T, Daher FB, Oualid F, et al. Characterization of two carnation petal prolyl 4 hydroxylases. Physiol Plant. 2010; 140(2):199–207.

Wang R, Chong K, Wang T. Divergence in spatial expression patterns and in response to stimuli of tandem-repeat paralogues encoding a novel class of proline-rich proteins in Oryza sativa. J Exp Bot. 2006; 57(11):2887-97.

Wang XQ, Tank DC, Sang T. Phylogeny and divergence times in the Pinaceae: evidence from three genomes. Mol Biol Evol. 2000; 17(5): 773–81.

Wilkins O, Nahal H, Foong J, Provart NJ, Campbell MM. Expansion and diversification of the Populus R2R3-MYB family of transcription factors. Plant Physiol. 2009; 149(2):981-93.

Woessner JP, Goodenough UW. Molecular characterization of a zygote wall protein: an extensin-like molecule in Chlamydomonas reínhardtíí. Plant cell. 1989; 1:901-11.

Wu Y, Williams M, Bernard S, Driouich A, Showalter AM, Faik A. Functional identification of two nonredundant Arabidopsis alpha-(1,2)-fucosyltransferases specific to arabinogalactan proteins. J Biol Chem. 2010; 285(18):13638-45.

Wu Y, Xu W, Huang G, Gong S, Li J, Qin Y, et al. Expression and localization of GhH6L, a putative classical arabinogalactan protein in cotton (Gossypium hirsutum). Acta Biochim Biophys Sin. 2009; 41(6):495-503.

Xu JF, Tan L, Lamport DTA, Showalter AM, Kieliszewski MJ. The O-Hyp glycosylation code in tobacco and Arabidopsis and a proposed role of Hyp-glycans in secretion. Phytochemistry. 2008; 69:1631-40.

Xu S, Rahman A, Baskin TI, Kieber JJ. Two leucine-rich repeat receptor kinases mediate signaling, linking cell wall biosynthesis and ACC synthase in Arabidopsis. Plant Cell. 2008; 20(11):3065-79.

Xu WL, Huang GQ, Wang XL, Wang H, Li XB. Molecular characterization and expression analysis of five novel genes encoding proline-rich proteins in cotton (Gosypium hirsutum). Progress in Biochemistry and Biophysics. 2007; 34(5):509-17.

Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, et al. Genome sequence and analysis of the tuber crop potato. Nature. 2011; 475(7355):189-95. 246

Yang J, Sardar HS, McGovern KR, Zhang Y, Showalter AM. A lysine-rich arabinogalactan protein in Arabidopsis is essential for plant growth and development, including and expansion. Plant J. 2007; 49(4):629-40.

Yang J, Warnow T. Fast and accurate methods for phylogenomic analyses. BMC Bioinformatics. 2011; 12(Suppl 9):S4.

Yoshiba Y, Aoki C, Iuchi S, Nanjo T, Seki M, Sekiguchi F, et al. Characterization of four extensin genes in Arabidopsis thaliana by differential gene expression under stress and non-stress conditions. DNA research. 2001; 8(3):115-22.

Youl JJ, Bacic A, Oxley D. Arabinogalactan-proteins from Nicotiana alata and Pyrus communis contain glycosylphosphatidylinositol membrane anchors. Proc Natl Acad Sci U S A. 1998; 95:7921-6.

Young ND, Debellé F, Oldroyd GE, Geurts R, Cannon SB, Udvardi MK, et al. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature. 2011; 480 (7378):520-4.

Yuasa K, Toyooka K, Fukuda H, Matsuoka K. Membrane-anchored prolyl hydroxylase with an export signal from the endoplasmic reticulum. Plant J. 2005; 41(1): 81–94.

Zhang Y, Sederoff RR, Allona I. Differential expression of genes encoding cell wall proteins in vascular tissues from vertical and bent loblolly pine trees. Tree Physiol. 2000; 20(7):457-66.

Zhang Y, Yang J, Showalter AM. AtAGP18, a lysine-rich arabinogalactan protein in Arabidopsis thaliana, functions in plant growth and development as a putative co-receptor for signal transduction. Plant Signal Behav. 2011; 6(6):855-7.

Zhao Z, Tan L, Showalter AM, Lamport DTA, Kieliszewski MJ. Tomato LeAGP-1 arabinogalactan-protein purified from transgenic tobacco corroborates the Hyp contiguity hypothesis. Plant J. 2002; 31(4):431-44.

Zhou J, Rumeau D, Showalter AM. Isolation and characterization of two wound-regulated tomato extensin genes. Plant Mol Biol. 1992; 20(1):5-17.

Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, et al. Sequencing and assembly of the 22-gb loblolly pine genome. Genetics. 2014; 196(3):875- 90.

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !