Scholars' Mine

Masters Theses Student Theses and Dissertations

Fall 2012

Analysis of a wound-induced gene family in Glycine max

Gena Robertson

Follow this and additional works at: https://scholarsmine.mst.edu/masters_theses

Part of the Biology Commons, and the Environmental Sciences Commons Department:

Recommended Citation Robertson, Gena, "Analysis of a wound-induced gene family in Glycine max" (2012). Masters Theses. 6940. https://scholarsmine.mst.edu/masters_theses/6940

This thesis is brought to you by Scholars' Mine, a service of the Missouri S&T Library and Learning Resources. This work is protected by U. S. Copyright Law. Unauthorized use including reproduction for redistribution requires the permission of the copyright holder. For more information, please contact [email protected].

ANALYSIS OF A WOUND-INDUCED GENE FAMILY IN GLYCINE MAX

by

GENA ROBERTSON

A THESIS

Presented to the Faculty of the Graduate School of the

MISSOURI UNIVERSITY OF SCIENCE AND TECHNOLOGY

In Partial Fulfillment of the Requirements for the Degree

MASTER OF SCIENCE IN APPLIED AND ENVIRONMENTAL BIOLOGY

2012

Approved by

Ronald L. Frank, Advisor Melanie Mormile David J. Westenberg

iii

ABSTRACT

Gene families in are important in understanding genome evolution indicating when and where genome duplications and segmental duplications occurred as well as subsequent divergence and subfunctionalization. A gene family in Glycine max that encodes a WI12 protein, wound-induced protein, was found to consist of ten genes on five chromosomes. Wound-induced proteins are activated in response to wounding in plants, and the WI12 protein in particular is thought to be involved in cell wall modifications at the wound site. A variety of bioinformatics tools have been used to analyze the expansion of this family in as well as identify potential functional domains in the protein. iv

ACKNOWLEDGMENTS

I would like to thank, first and foremost, my advisor Ronald Frank, whose support and knowledge has been invaluable. Without his help, this thesis would not have been possible. I simply could not have asked for a better advisor.

I would like to thank my graduate committee members, Melanie Mormile and

David Westenberg.

I would like to acknowledge the Biological Sciences Department at Missouri

University of Science and Technology for graduate student support.

Finally, I would like to thank Ciarán Ryan-Anderson for his encouragement and formatting help.

v

TABLE OF CONTENTS

Page

ABSTRACT ...... iii

ACKNOWLEDGMENTS ...... iv

LIST OF ILLUSTRATIONS ...... viii

LIST OF TABLES ...... ix

SECTION

1. INTRODUCTION ...... 1

1.1. GLYCINE MAX ...... 1

1.2. GENE DUPLICATION AND GENE FAMILIES ...... 3

1.3. WOUND-INDUCED PROTEIN ...... 4

1.4. PROTEIN STRUCTURE AND FOLDING ...... 5

1.5. TRUNCATULA ...... 6

1.6. SELAGINELLA MOELLENDORFFII ...... 6

1.7. DATABASES AND TOOLS ...... 6

1.7.1. Protein Databases ...... 6

1.7.2. BLAST ...... 7

1.7.3. Expressed Sequence Tags ...... 8

1.7.4. Phytozome...... 8

1.7.5. Sequence Alignments...... 9

1.7.6. MEME……………………………………………………..10

1.7.7. Augustus ...... 11 vi

1.7.8. CELLO…………………………………………………….11

1.7.9. PLACE…………………………………………………….11

2. MATERIALS AND METHODS……………………………………………...13

2.1. IDENTIFICATION OF GENE FAMILY MEMBERS……………..13

2.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS…………………………………………………………13

2.3. EVOLUTIONARY ANALYSIS……………………………………14

2.4. FUNCTIONAL ANALYSIS………………………………………..16

3. RESULTS…………………………………………………………………….. 18

3.1. IDENTIFICATION OF GENE FAMILY MEMBERS……………..18

3.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS………………………………………………………… 19

3.3. EVOLUTIONARY ANALYSIS……………………………………24

3.4. FUNCTIONAL ANALYSIS………………………………………..27

4. DISCUSSION………………………………………………………………… 31

4.1. IDENTIFICATION OF GENE FAMILY MEMBERS……………..31

4.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS…………………………………………………………31

4.3. EVOLUTIONARY ANALYSIS……………………………………32

4.4. FUNCTIONAL ANALYSIS………………………………………..34

APPENDICES

A. NUCLEOTIDE SEQUENCES…...... 36

B. ESTS…………………………………………………………………………….. 47

C. PHYLOGENETIC TREE FILE……………………………...…………………..60

D. PGENTHREADER AND PDOMTHREADER PREDICTIONS.…………..…..62 vii

E. PLACE DATA…...... 75

F. CELLO DATA………………………………………………………………..... 95

BIBLIOGRAPHY……………………………………………………………………... 101

VITA……………………………………………………………………………………106 viii

LIST OF ILLUSTRATIONS

Figure Page

3.1. Chromosome Map…………………………………………………………………. 20

3.2. ClustalW2 Protein Alignment……………………………………………………... 24

3.3. Phylogeny………………………………………………………………………..... 25

3.4. Neighbor Analysis……………………………………………………………….... 26

3.5. pGenTHREADER………………………………………………………………..... 27

3.6. pDomTHREADER……………………………………………………………..…. 28

3.7. GmWI-1 PLACE……………………………………………………………...……28

3.8. GmWI-2 PLACE……...……………………………………………………...…… 29

3.9. MEME Motifs………………………………………………………………………30

ix

LIST OF TABLES

Table Page

1.1. Glycine max Classification……………………………………………………...... 1

3.1. Glycine max Gene Table………………………………………………………...... 18

3.2. Medicago truncatula Gene Table………………………………………………..... 18

3.3. Selaginella moellendorffii Gene Table…………………………………………..... 19

3.4. EST Cultivars……………………………………………………………………… 21

3.5. EST Tissues……………………………………………………………………...... 22

3.6. EST Treatments…………………………………………………………………… 23

3.7. SNAP Results……………………………………………………………………... 27

1. INTRODUCTION

1.1. GLYCINE MAX

Glycine max is commonly known as cultivated soybean. It is a dicotyledon crop found in the legume family. The classification of G. max can be seen in Table 1.1 below.

Table 1.1. Glycine max Classification [1] Kingdom Plantae Plants Subkingdom Tracheobionta Vascular plants Superdivision Spermatophyta Seed plants Division Magnoliophyta Flowering plants Class Magnoliopsida Dicotyledons Subclass Rosidae Order Family Legumes Genus Glycine Willd. Soybean Species Glycine max Soybean

G. max is one of the world’s principle food crops. Cultivation of the domesticated soybean dates back as early as 3100 years ago in East Asia [2]. It is native to North and

Central China. Soybean was introduced into the United States in 1765 [3], and into

Canada in 1893 [4].

Soybean is the most valuable legume crop. It is used for its seed protein and oil content, and is the main source of biodiesel. It also fixes nitrogen in the soil by way of a symbiotic relationship with the bacterium Bradyrhizobium japonicum. G. max makes up more than 55% of all oilseed production, and 80% of consumable fats and oils in the 2

United States [5]. In 2011, 3,056,032,000 bushels of soybean were produced on

73,636,000 acres in the U.S [6]. The U.S. is the main producer of soybean, making up

roughly 36% of the world’s production, followed by Brazil, Argentina, China, and India,

all of which comprise approximately 95% of total production [7].

The soybean genome was sequenced in 2008 through large-scale shotgun

sequencing, initiated by the Department of Energy-Joint Genome Institute (DOE-JGI)

Community Sequencing Program. Approximately 975 megabases (Mb) are present in 20

chromosomes, with a small additional amount found in mostly repetitive unmapped

scaffolds. An initial screen of the genome resulted in the prediction of 46,430 high-

confidence genes. Approximately 78% of these genes are found in the chromosome

ends. Of the predicted genes, there are 31,264 that have been approximated into 12,253

gene families, and 15,166 that exist as singletons [8]. It should be noted that the number

of genes families in soybean is only an estimate.

Papilionoideae (the “papilionoids”) is the largest of the three subfamilies of

Fabaceae (the legumes) with 476 genera and 13,860 species. Papilionoideae includes the

grain legumes (beans, lentils, peanuts, etc.) [9]. A sub-clade of the papilionoids, the

phaseoloids, contains Glycine, Phaseolus (bean), Cajanus (pigeon pea), and Vigna (lentil, mung bean, cowpea) [10]. Another sub-clade of the papilionoids, the galegoids

(Medicago, Pisum) diverged from the phaseoloids approximately 50 million years ago

(Mya) [10]. The papilionoid origin has been predicted to have occurred approximately

59 Mya by way of matK and rbcL loci analysis. The origin of legumes has been estimated to 56 Mya using fossils [11]. The Glycine-specific genome duplication has been estimated to have occurred approximately 13 Mya [12,13]. Schmutz et al. (2010) 3

confirmed this by estimating the pairwise synonymous distances (KS) of the paralogues

[8]. Roughly 55-60 million years ago, the legumes evolved. Immediately after that, the

legumes started splitting into clades.

1.2. GENE DUPLICATION AND GENE FAMILIES

Gene duplication is an important method of evolution and new gene creation. In

genomes, sequences that encode function diverge at a slower rate than sequences that do not encode any function. In protein coding regions, mutations where the base would result in an amino acid change have been lost by selection. Gene duplication allows the

new genes with the same function to evolve freely [14].

When a gene is duplicated, it must be fixed in the population as well as preserved

over time in order to be observed and compared, though fixation rarely occurs. Once a

gene becomes fixed, there are three possible outcomes: nonfunctionalization,

neofunctionalization, or subfunctionalization. Nonfunctionalization, the most common

outcome, occurs when mutations in the gene copy result in a non-functioning

pseudogene. After gene duplication, a new allele may arise due to an increased rate of

amino acid change, resulting in the gene copy gaining a new function. This process is

neofunctionalization. Subfunctionalization occurs when the gene duplicates take on altered functions after gene duplication. This means that the gene still carries out the same function but under different conditions or in different locations, such as during different developmental stages or environmental cues [14,15].

Plants harbor many gene families in their genomes. A gene family is a group of genes that have a similar function and usually share a similar DNA sequence, though 4

dissimilar genes can comprise gene families if they are involved in the same process.

Categorizing genes into gene families helps describe how genes are related to one

another. Gene families are useful because they can assist in predicting the function of

newly discovered genes depending on how similar they are to previously discovered

genes [16]. The source of these gene families is whole genome duplication events as well

as segmental duplications of chromosomes. Gene family members are paralogous as they

are related by gene duplication, distinguishing them from orthologous genes, which are

related by speciation and are found in separate genomes [17].

1.3. WOUND-INDUCED PROTEIN

One gene family in Glycine max is thought to encode a wound-induced protein.

Wound-induced proteins are activated in response to wounding in plants. Wounding can

occur by way of biotic (such as lesions from infections and damage from herbivorous

feeding) or abiotic (mechanical) stresses. One of the first wound-induced gene families

was found in potato, Solanum tuberosum, and consists of the genes win1 and win2 as well

as others [18]. The WI12 protein was first identified in the halophyte ice plant,

Mesembryanthemum crystallinum. The WI12 gene has a homolog, WUN1, which was

found in potato. The WI12 protein is thought to be involved in cell wall modifications at

the wound site. WI12 genes were expressed and accumulated in the epidermis and

phloem of leaves in M. crystallinum, and the expression of WI12 was found to be

restricted to the wounded site [19].

5

1.4. PROTEIN STRUCTURE AND FOLDING

Proteins are composed of chains of amino acids connected by peptide bonds.

There are four levels of protein structure. Primary structure is the linear sequence of

amino acids held together by peptide (covalent) bonds. Secondary structure is the

organization of amino acid chains into regular stable structures. The two types of

secondary protein structures are alpha helices and beta sheets. An alpha helix is a coiled

structure held together by hydrogen bonds. A beta sheet is a pleated structure held

together by hydrogen bonds. Tertiary structure is the final three-dimensional structure of

a single protein, in which alpha helices and beta sheets are folded into a compact

structure. This structure is stabilized by ionic bonds, disulfide bonds, and weak

interactions. Quaternary structure is the three-dimensional structure of a multi-subunit

protein. A multi-subunit protein is composed of two or more polypeptides. These

proteins are stabilized by weak interactions such as hydrogen bonds and Van der Waals interactions [20,21].

Protein folding is important in understanding protein structure and function.

Protein folds can often show evolutionary relationships, and can thus be used to

understand function. Fold recognition is used to predict how a protein will fold. Two

methods of fold recognition include sequence similarity and threading. Sequence

similarity uses multiple sequence alignments to predict protein structure with programs

such as the Basic Local Alignment Search Tool (BLAST) [22]. Threading uses sequence

to structure matching by comparing a protein sequence to another sequence with a known

structure. Fold recognition methods typically use a combination of sequence matching

and threading [23]. Protein domains are functional folding units in a protein. Domains 6

usually have a defined interaction or function, and can resemble another protein structure

or domain in sequence or structure. A protein can contain a single domain or multiple

domains. Each domain has the ability to individually fold [24].

1.5. MEDICAGO TRUNCATULA

Medicago truncatula, a close relative to G. max, is a small legume used as a model in genomic research. It has a small diploid genome, short generation time, and abundant seed production. The genome of M. truncatula consists of 241 Mb in eight chromosomes, with an additional 16.6 Mb of unmapped sequence [25].

1.6. SELAGINELLA MOELLENDORFFII

Selaginella moellendorffii, a type of spikemoss, is a lycophyte. Lycophytes are ancient spore-producing vascular plants. The origins of S. moellendorffii have been dated

back as far as 400 million years. S. moellendorffii has the smallest genome of any sequenced plant with approximately 100 Mb in 27 chromosomes [26]. There is no

evidence to suggest the occurrence of an ancient whole-genome duplication in

Selaginella [27].

1.7. DATABASES AND TOOLS

1.7.1. Protein Databases. The functions of the genes were determined using

three protein databases: PFAM, PANTHER, and KOG. The PFAM database is a large

collection of protein families, with each family represented by multiple sequence

alignments and Hidden Markov Models (HMMs). Hidden Markov Models can be used 7

to predict genes, as well as determine the probability of genomic regions and signals such

as introns, exons, promoters, etc. [28]. The PANTHER (Protein ANalysis THrough

Evolutionary Relationships) Classification System is a database that classifies genes by

their functions using published scientific experimental evidence and evolutionary

relationships to predict function, and can be done in the absence of direct experimental

evidence [29]. EuKaryotic Orthologous Groups (KOG) is a tool for identifying ortholog and paralog proteins [30].

1.7.2. BLAST. BLAST is a tool at the National Center for Biotechnology

Information (NCBI) website used to calculate similarity between sequences as well as

statistical significance of matches. This is done by using a nucleotide or protein sequence

as a query to search selected databases to produce alignments. The query is the sequence

that is being used to search the database. The database is a collection of searchable

sequences. Each alignment pair (query to hit) has a bit score and an expect value (E-

value). A good alignment will have a higher bit score. The E-value is a measure of the

probability of an alignment occurring by chance based on the alignment quality and

database size. A lower E-value implies a more significant hit [31].

Different types of BLAST can be used to create alignments. Nucleotide BLAST

programs, such as BLASTN, perform a sequence comparison of a nucleotide query to a

nucleotide database. Protein BLAST programs, such as BLASTP and PSI-BLAST, use a

protein query to search a protein database. BLASTX uses a translated nucleotide query

to search a protein database. TBLASTN compares a protein query to a translated

nucleotide database. TBLASTX compares a translated nucleotide query to a translated

nucleotide database. MEGABLAST is a nucleotide BLAST program that performs 8

nucleotide-nucleotide comparisons. It is most often used to find an identical match to an

unknown sequence by searching for long alignments between highly similar sequences

[32].

1.7.3. Expressed Sequence Tags. An expressed sequence tag (EST) is a short

sequence that is generated from a cDNA sequence. ESTs are used to create genome

maps. This is helpful for identifying unknown genes and locating them within a genome.

A large EST database can be found at the NCBI website [33].

1.7.4. Phytozome. Phytozome is a comparative green plant genome database with a graphical user interface. It is a joint project of the DOE-JGI and the Center for

Integrative Genomics (CIG). Currently in version 8.0, Phytozome houses thirty-three

sequenced and annotated green plant genomes, with new genomes and updated versions

of genomes becoming available regularly. These species have been clustered into gene families in a phylogeny. Where possible, each gene has a PFAM, PANTHER, and KOG annotation. Through the genome browser, the gene transcript, sequences, peptide homologs, and gene ancestry can all be accessed. Each sequence (genomic, transcript,

CDS, and peptide) can be used in a BLAST search in Phytozome and NCBI [34].

When performing a BLAST search in Phytozome, there are several parameter

settings. The BLOSUM62 matrix is a comparison matrix that gives a lower penalty score. A comparison matrix is a substitution (scoring) matrix that determines the penalty for each possible nucleotide or amino acid mismatch between the query and target sequence. Word length is the minimum number of letters, or residues, that a search attempts to match between the query and target sequences. The expectation (E) threshold 9

is the maximum expectation value of the alignments retained when performing a BLAST

search [34].

1.7.5. Sequence Alignments. A sequence alignment is a comparison of two or

more sequences. An alignment is performed by searching for character matches that are

in the same order in all sequences. When sequences are aligned, they are put into rows

where identical or similar characters are put into the same column, and non-identical matches are placed into the same column as a mismatch or as a gap placed in the opposite sequence. Aligning sequences shows similarity between the sequences. This is useful because similar sequences can imply similar function or structure. The sequences can then be classified into families and also used to examine evolutionary relationships [35].

ClustalW is a multiple sequence alignment program for DNA or proteins. It creates alignments by calculating the best match between sequences so that similarities and differences can be observed. The alignments produced by ClustalW are global alignments, which aligns sequences along their entire length and inserts gaps for insertions and deletions. There are several parameters that can be manipulated when using ClustalW to create alignments. The Gonnet matrix is a protein weight matrix used to score an alignment. Gap open is a gap penalty for the first amino acid or nucleotide in a gap. Gap extension is a gap penalty for each additional residue in a gap. Gap distance is the gap separation penalty. The gap separation penalty can be disabled when scoring gaps at the end of an alignment with no end gap option. An iteration step can be performed to improve the alignment by removing each sequence in turn and realigning it.

The numiter option designates the maximum number of iterations to complete. There are two types of clustering for phylogenetic tree construction. Unweighted Pair Gap Method 10

with Arithmetic Mean (UPGMA) is the simplest method of tree construction and uses a

sequential clustering method. In neighbor-joining clustering, external nodes are

combined until an arrangement that gives the smallest sum of branch lengths is found.

Phylogenetic trees can be calculated from the ClustalW alignments. The percent identity matrix (P.I.M.) measures the number of identical nucleotides or amino acids based on the length of the alignment [36].

PAL2NAL is a program that creates a codon alignment from a multiple sequence alignment of proteins (such as from ClustalW) and the corresponding DNA sequences.

This codon alignment can then be used in the calculation of synonymous and non- synonymous substitution rates [37]. Synonymous Non-synonymous Analysis Program

(SNAP) is a program that calculates synonymous and non-synonymous substitution rates based on a codon alignment [38]. Synonymous substitutions occur when one base is substituted for another in a protein-coding gene, resulting in an unmodified amino acid sequence. Synonymous substitutions typically occur at a higher rate than non- synonymous substitutions, and this rate is similar for several different genes.

Synonymous substitution rate is the average number of synonymous substitutions per sequence site per time. The rate is supposed to be neutral, not purified by negative selection, and not taken to fixation by positive selection. Estimation of synonymous

(silent) and non-synonymous (amino acid-altering) nucleotide substitutions is important

because it can be used as a molecular clock for the evolutionary dating of genes [39].

1.7.6. MEME. Multiple Em for Motif Elicitation (MEME) is a tool that is used

for motif discovery in DNA and protein sequences. A motif, or signal, is a gapless,

approximate sequence pattern. In a process similar to multiple sequence alignment, 11

MEME works by searching for statistically significant domains in input sequences [40].

MEME uses an expectation maximization (Em) algorithm which fits a finite mixture

model to a set of DNA or protein sequences in order to create motifs [41].

1.7.7. Augustus. Augustus is a gene prediction program used for eukaryotic genomic sequences. The sequences can be uploaded to Augustus to create annotations.

EST data can also be uploaded with the sequences in order to improve gene prediction.

Augustus can also predict alternative transcripts and splicing as well as 5’ and 3’ untranslated regions [42].

1.7.8. CELLO. SubCELlular LOcalization predictor (CELLO) is a program that predicts subcellular localization from protein sequences. Subcellular localization of a protein sequence can infer protein function. CELLO is based on a multi-class two-level support vector machine (SVM) system. The first level is made up of four SVM classifiers: the n-peptide composition, the partitioned amino acid composition, the di- peptide composition, and the local amino acid composition. The second level is comprised of a SVM jury that processes the results from the first level and determines the probability distribution of subcellular localization [43].

1.7.9. PLACE. The plant cis-acting regulatory DNA elements (PLACE) database is a collection of nucleotide sequence motifs found in plant cis-acting regulatory DNA elements. Cis-acting regulatory DNA elements (cis-elements) are regions of DNA that control expression of genes located on the same chromosome. These cis-elements activate transcription and are located upstream of the transcription start site, usually within 200 base pairs [44]. PLACE motifs have been gathered from previously published work on genes in vascular plants. Each motif in PLACE contains a motif sequence, 12 description, and PubMed and GenBank information (if available). These motifs are useful because they can assist in determining the location of transcription elements such as promoters, and CAAT and TATA boxes [45].

13

2. MATERIALS AND METHODS

2.1. IDENTIFICATION OF GENE FAMILY MEMBERS

One soybean sequence that potentially encoded a PFAM-identified wound-

induced domain was used as a query to search the entire soybean genome using

TBLASTN in Phytozome [46], an expect (e) threshold of -1, the BLOSUM62

comparison matrix, a default word (w) length of 3, and the filter set to off. All hits with

an E-value less than -10, at least 65% match to the query length, and 30% identity of

residues were recorded and used again to query the entire soybean genome. The family

was limited to sequences that hit all other family members and encode the WI domain. A

chromosome map was constructed using the length of each chromosome, approximate

location of the centromere, and approximate location of each gene.

2.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS

The transcript sequence of each gene family member was taken from Phytozome

and used as query in a BLASTN [47,48] search of the Glycine max dbEST (database of

expressed sequence tags) at NCBI, using the non-human non-mouse (Others) expressed

sequence tags (est) database, Glycine max (taxid: 3847) as the organism, optimizing for

highly similar sequences (megablast), displaying a maximum of 1000 aligned sequences to display, automatically adjusting parameters for short input sequences and using an expect threshold of 10, word size of 28, match/mismatch scores of 1,-2, and linear gap costs, with all filtering options off. After performing the BLAST search, the alignment was changed to be shown as plain text. After reformatting, the list of alignments was 14

copied and inserted into Microsoft Excel, and then organized. Each EST was then used

as a query to search the genome using BLAST. Starting with the last score, the sequence

of each EST was put into Phytozome where a BLASTN search was performed against the

Glycine max genome with filtering options turned off. This was done for each EST, and

symmetrical best hits were used to assign ESTs to appropriate gene family members.

Information regarding cultivar, tissue, and treatment was recorded for each EST and

compared between family members to identify potential differential expression patterns.

ESTs were aligned with genomic sequences along with predictions by Augustus

[49] and FGenesH at SoftBerry to construct models of each gene family member. ORF

Finder (Open Reading Frame Finder) at NCBI was used to verify what was found by

FGenesH and Augustus.

2.3. EVOLUTIONARY ANALYSIS

Each WI soybean amino acid sequence was used as a query to search the

Medicago truncatula genome in order to find orthologs, using the same parameters as in the original TBLASTN search of the soybean genome. In order to find an outgroup for phylogeny construction, the WI amino acid sequences were used as queries in

Phytozome. Each species was queried in descending order of relatedness to Glycine max until a phylogeny was constructed in which all genes of the outgroup species appeared in the same clade and produced the same phylogeny using each gene as a root. Selaginella moellendorffii was finally found to be an outgroup that met all of the criteria.

ClustalW2 [50,51] was used to create an alignment of the Glycine max, Medicago

truncatula, and Selaginella moellendorffii sequences. The ClustalW2 pairwise alignment 15

search parameters used were a slow alignment type, a Gonnet protein weight matrix, a

gap open of 10, and a gap extension of 0.1. The multiple sequence alignment search parameters used were a Gonnet protein weight matrix, a gap open of 10, a gap extension of 0.2, gap distances of 5, allowing gaps at ends, without iteration, a numiter of 1, NJ clustering, an alignment with numbers format, and an aligned order. Once the alignment was created, it was put into the ClustalW2 phylogenetic tree generation program using the default tree format, no distance correction, no gap exclusion, a neighbor-joining clustering method, and no P.I.M. TreeDyn was used to view the phylogeny when the tree file obtained from Clustal was uploaded in Newick format.

The putative function of genes within 50 kbp (kilobase pairs) on both the 5’ and

3’ flanking regions for each gene family member were determined from PFAM [28], the

PANTHER Classification System [29], and KOG [52] annotations. The flanking genes were then used to determine homology of surrounding regions and deduce chromosomal and segmental duplications.

The PAL2NAL program [53] was used to convert a multiple sequence alignment of the gene family member protein sequences and the corresponding DNA sequences into a codon alignment. The parameters used in running this program include using a universal code codon table, without removing gaps, inframe stop codons, or mismatches, using no selected positions, and using a CLUSTAL output format. SNAP [54,55] was then used to calculate the synonymous and non-synonymous substitution rates based on the PAL2NAL codon alignment.

16

2.4. FUNCTIONAL ANALYSIS

The fold recognition algorithms pGenTHREADER and pDomTHREADER [56]

of the PSIPRED Protein Structure Prediction Server were used with all filters off to

identify proteins with similar secondary structure. They were also used to identify

potential distant homologs of the wound-induced gene family and provide clues on their

molecular role in the cell. The function of the top hit from the results of each algorithm was recorded.

MEME [57] was used to create motifs for the ten related WI protein sequences.

The PLACE (Plant Cis-acting Regulatory DNA Elements) database [45] was used

to identify potential regulatory elements in the 1.5 kbp 5’ flanking region of each gene.

1500 bases prior to the transcription start site of each gene family member were put into the PLACE web signal scan program and run as a group signal scan. The results were then copied and inserted into Microsoft Excel, and then sorted by site number in ascending order. The CAAT and TATA boxes that were at an appropriate distance from the transcription start site and from each other were noted, if they existed at all in that combination. A keyword search by SRS in the PLACE database was conducted with an

AllText search of the term “wound”, with search terms combined with & (AND) and wildcards used. The PLACE IDs in the keyword search results were compared with the

PLACE IDs from the signal scan results, and any corresponding IDs were noted. The results were then correlated to information regarding genotype, tissue, and treatment recorded for each cDNA library of origin for the ESTs that were assigned to each family member. 17

The CELLO program [43] was used to predict the subcellular location of each family member using an average of all methods available.

18

3. RESULTS

3.1. IDENTIFICATION OF GENE FAMILY MEMBERS

The name, gene address, amino acid length, exon number, and EST number for

the ten members of the wound-induced gene family in Glycine max are listed in Table

3.1. This information is also listed for the genes in Medicago truncatula and Selaginella moellendorffii in Tables 3.2 and 3.3. The sequences for all of the genes can be found in

Appendix A.

Table 3.1. Glycine max Gene Table Name of Gene Address Amino Number EST Gene Acid of Exons Number Model Length GmWI-1 Glyma18g02610.1 107 1 8 GmWI-2 Glyma11g35800.1 107 1 124 GmWI-3 Glyma18g06340.1 163 1 2 GmWI-4 Glyma11g29750.1 154 1 2 GmWI-5 Glyma18g06350.1 166 1 0 GmWI-6 Glyma11g29740.1 163 1 0 GmWI-7 Glyma18g44090.1 144 1 2 GmWI-8 Glyma09g41610.1 152 1 1 GmWI-9 Glyma14g38040.1 158 1 0 GmWI-10 Glyma03g05050.1 168 1 2

Table 3.2. Medicago truncatula Gene Table Gene Address Amino Acid Number of Length Exons Medtr7g07737.1 55 4 Medtr3g033850.1 162 1 Medtr3g097160.1 107 1 Medtr5g082480.1 153 1

19

Table 3.3. Selaginella moellendorffii Gene Table Name of Gene Gene Address Amino Acid Number of Model Length Exons Selmo73443 73443 156 1

A map of approximate locations of G. max genes on their respective chromosomes with centromere location indicated can be seen in Figure 3.1. The ten genes are found on five chromosomes of the 20 total chromosomes in Glycine max.

3.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS

The ESTs generated from the BLAST search of each G. max gene can be found in

Appendix B. The totals of cultivar, tissue, and treatment type taken from the EST data for each gene can be seen in Tables 3.4 – 3.6. One gene, GmWI-1, accounts for 88% of expression as measured by EST abundance.

20

Figure 3.1. Chromosome Map

Table 3.4. EST Cultivars Cultivar EST Variation Table Gene Williams PI567374 Williams Raiden Nourin Asgrow Kefeng Harosoy Harosoy Bragg Various None T157 Name 82 No. 2 A3237 1 63 NTS382

GmWI-1 1 1 6 GmWI-2 48 3 10 1 2 40 2 5 2 3 7 1 GmWI-3 2 GmWI-4 2 GmWI-7 2 GmWI-8 1 GmWI- 1 1 10

21

Table 3.5. EST Tissues Tissue Type EST Variation Table Gene Seed Cotyledons Flower Leaf Roots Stem Seedlings Various/ Hypocotyls Callus Seed Vegetable Name Other Coats Buds

GmWI-1 6 1 1

GmWI-2 44 5 1 3 5 1 16 12 4 27 5 1

GmWI-3 2

GmWI-4 2

GmWI-7 2

GmWI-8 1

GmWI-10 1 1

22

23

5

6 Dark Other

1 9 syringae syringae Pseudomonas Pseudomonas

3 Syndrome Sudden Death Death Sudden

1 1 Multiple Treatments Treatments

3 m japonicum m japonicum Bradyrhizobiu Table 3.6. EST Treatments Table 3.6. EST Treatment EST Variation Table

4 1 Salt Stressed Stressed

7 sojae sojae Phytophthora Phytophthora No Treatment

Gene Name GmWI-1 7 GmWI-1 66 GmWI-2 2 GmWI-3 2 GmWI-4 2 GmWI-7 1 GmWI-8 GmWI-10 GmWI-10 24

3.3. EVOLUTIONARY ANALYSIS

The G. max protein alignment created with ClustalW2 can be seen in Figure 3.2.

The yellow highlighted region indicates the wound-induced domain. The wound-induced domain is found in all ten genes. The pink highlighted region indicates a secondary domain. The secondary domain is found in eight of the ten genes. The phylogenetic tree file in Newick format can be found in Appendix C. A phylogeny generated with the genes from G. max, M. truncatula, and S. moellendorffii can be seen in Figure 3.3.

Figure 3.2. ClustalW2 Protein Alignment

25

Figure 3.3. Phylogeny

A neighbor analysis of the G. max genes can be seen in Figure 3.4. Each differing gene is indicated in a different color. The chromosomes are arranged by similarity of neighboring genes. The most similar chromosomes are arranged closest together.

Chromosome Number kb 14 11 18 11 18 9 3 50 Nodulin Hypox-PR BTB/POZ Peroxidase PPR 40 PPR Hypox-PR Transferase BTB/POZ K+/H+ 30 Family MitoRibo S/T Kinase Arm/BC-R AA Transport 20 WRKY Transferase Arm/BC-R AdAR MitoRibo AdAR AA Transport L-Rich D-Ala 10 Repeat PLAC8 AdAR Ligase RRM GmWI-4 PLAC8 PLAC8 0 GmWI-9 GmWI-6 GmWI-3 GmWI-2 GmWI-1 GmWI-8 GmWI-10 Zn Finger GmWI-5 10 Sel1R SC- WRKY Sel1R Heavy-MetA Dehydro O- 20 Methyltrans Heavy-MetA K+/H+ Family K+/H+ Family K+/H+ 30 Family NADP-Oxi WASP-1 NADP-Oxi Sel1R SC-Dehydro O- K+/H+ 40 Methyltrans WRKY Family NAD- BioReq Depend O- 50 Methyltrans BioReq

Figure 3.4. Neighbor Analysis 26 26

27

The G. max and M. truncatula pairwise comparison results of the SNAP program can be seen in Table 3.7.

Table 3.7. SNAP Results Sequence Names ps Sequence Names ps

GmWI-6 GmWI-5 0.1622 GmWI-1 Medtr3g097160.1 0.5514

GmWI-2 GmWI-1 0.1366 GmWI2 Medtr3g097160.1 0.5314

GmWI-4 GmWI-3 0.1408 GmWI6 Medtr3g033850.1 0.5754

GmWI-8 GmWI-7 0.1541 GmWI5 Medtr3g033850.1 0.6242 GmWI4 Medtr5g082480.1 0.5751 GmWI-7 GmWI-10 0.4737 GmWI3 Medtr5g082480.1 0.5905 GmWI-8 GmWI-10 0.5225 GmWI9 Medtr5g082480.1 0.594 GmWI-3 GmWI-9 0.6088

GmWI-4 GmWI-9 0.6184

3.4. FUNCTIONAL ANALYSIS

Sections of predicted protein folding for one G. max gene from the pGenTHREADER and pDomTHREADER programs can be seen in Figures 3.5 and 3.6.

pGenTHREADER

Ntf2: HHHHHHHHHHHHCC-CCCCHHHHHHC-EEEEEEEEECCEEEEEEEEEEECCEEEEEEEECCHHHHHHHCCCCCC WVFAITVRKSKVTSIREYIDT WVHVWTVRNGLITQFREYFNT GmWI-10: CCHHHHHHHHHHHHHCCCCHHHHHHHHCCCCCEEEEEC---EEEEEEEECCEEEEEEEEEEEEEEEEECCCCCC Figure 3.5. pGenTHREADER

28

pDomTHREADER

Limonene: CCCCCCCCCCCCCCCCCC-----HHHHHHHHHHHHHHHC-CHHHHHC-CCCCCCEEEECCCCCEECHHHHHHHH TMYQNMPLP LEWWYHGPP GmWI-10: CCCCCCCHHHHCCCCCCCCHHHCCHHHHHHHHHHHHHCCCCHHHHHHHHCCCCCEEEEECCCC------CHHH Figure 3.6. pDomTHREADER

In both figures, the folding sequence of the top hit for GmWI-10 is shown. The

amino acid sequence in the pGenTHREADER figure includes the wound-induced

domain, and the sequence in the pDomTHREADER figure includes the secondary

domain. The top result from the pGenTHREADER and pDomTHREADER programs for

each gene can be found in Appendix D.

The results for the two more highly expressed G. max genes from the PLACE

web signal scan program can be found in Appendix E. The results include the factor or

site name of the element, site location and strand, signal sequence, and site number.

Wound elements, CAAT boxes, TATA boxes, and transcription start sites indicated at

approximate locations on two genes can be seen in Figures 3.7 and 3.8. Decorated

nucleotide sequences for all genes include CAAT and TATA boxes, wound elements, and

open reading frames, and can be found in Appendix A.

GmWI-1 Figure 3.7. GmWI-1 PLACE

29

GmWI-2 Figure 3.8. GmWI-2 PLACE

Motifs of the conserved domains created using the MEME program can be seen in

Figure 3.9. In the wound-induced domain, 12 of the 24 amino acids are identical. In the secondary domain, 7 of the 20 amino acids are identical.

30

Figure 3.9. MEME Motifs

31

4. DISCUSSION

4.1. IDENTIFICATION OF GENE FAMILY MEMBERS

Computational analysis of the Glycine max genome revealed ten genes encoding

a PFAM-identified WI12 domain (Table 3.1). Of the ten genes, four are located on chromosome 18, three on chromosome 11, and one each on chromosomes 3, 9 and 14

(Figure 3.1). Two genes, one on chromosome 18 (GmWI-1) and the other on chromosome 11 (GmWI-2), have significantly shorter amino acid sequences (107 residues) than the other genes (150-170 residues). These two genes are also lacking an additional conserved domain (Figure 3.2). However, EST data analysis indicates that most expression is from these two genes while the others are expressed at very low levels or not at all (Tables 3.4-3.6). The three genes on chromosome 11 appear to be ancestrally related to three of the genes on chromosome 18. Since no introns were found, ORF

Finder was used to verify what was found by FGenesH and Augustus.

4.2. GENE STRUCTURE PREDICTION AND EST EXPRESSION ANALYSIS

According to expressed sequence tag (EST) analysis, GmWI-2 (124 EST) is expressed abundantly compared to the other genes in the family. GmWI-1 (8 ESTs), although not expressed as much as GmWI-2, is expressed to a significantly higher level that the other eight genes that are either expressed at low levels or not at all. They have two or fewer ESTs, and are all missing either a CAAT box or a TATA box in the appropriate locations. According to the EST profile across genotypes, tissues, and treatments, no examples of wounding are present. However, there appears to be a

32 response to abiotic stress, namely salt and dark. This expression is consistent with expression in the halophyte ice plant [19]. GmWI-1 and GmWI-2 contain regulatory elements that have been reported elsewhere to be associated with wound induction

[58,59,60]. They also appear to have promoter elements, CAAT and TATA boxes, at appropriate locations relative to transcription start sites that were estimated from EST data (Figures 3.7-3.8).

4.3. EVOLUTIONARY ANALYSIS

Phylogenetic analysis, chromosomal location, and neighboring gene analysis provide some insight into the evolution of these genes. Phylogenetic analysis showed the presence of four clades, with each clade containing a pair of sister genes. Of the four pairs of sister genes that are present, three have one member on chromosome 11 and the other on chromosome 18, and the fourth pair has a member on chromosome 18 and the other on chromosome 9 (Figure 3.3). The gene pairs GmWI-1/GmWI-2, GmWI-

3/GmWI-4, and GmWI-5/GmWI-6 likely correspond to the whole genome duplication dated approximately 13 mya [12,13]. Neighboring gene analysis in the region supports this. The distal half of the short arms of chromosomes 11 and 18 appear homeologous.

Analysis of the annotated function of neighboring genes confirmed this (Figure 3.4). In the neighboring gene analysis, the isolated genes do not show any segmental homeology in the region. The secondary ancestral similarity between the two adjacent pairs of genes on chromosomes 11 and 18 (GmWI-3/GmWI-5 and GmWI-4/GmWI-6) suggests that an inverted segmental duplication likely occurred before the whole genome duplication.

33

Schmutz et al. (2010) estimates the whole genome duplication at 13 million years ago, and approximates synonymous substitutions per site at 0.01 substitutions per million years. For the four pairs of genes that are the most recent in the G. max WI gene family, the duplication is estimated to have occurred 15 million years ago. Each gene pair corresponding to the whole genome duplication is orthologous to a different Medicago truncatula gene.

Speciation occurred in a window of 54-60 million years ago. The reason for this estimation is that the M. truncatula genes in clades with sister genes occurred around the same time. The papilionoid origin (at 59 million years ago) is very close to the divergence of M. truncatula. Almost immediately after the speciation event, GmWI-9 split with the ancestral GmWI-3/4 gene. The split of GmWI-10 from the ancestral

GmWI-7/8 seems to have occurred significantly later at 50 million years ago. The divergence of the M. truncatula family into 3-4 members in the M. truncatula/G. max ancestor occurred in a window of 70-90 million years ago.

There is slight similarity between GmWI-8 and GmWI-10 in the neighbor genes.

There is no M. truncatula ortholog for the GmWI-7/GmWI-8/GmWI-10 clade. This could mean that the ortholog was lost, or that there was an additional lineage-specific duplication in soybean. The M. truncatula outlier is a protein of unknown function. Its wound-induced ancestry is not certain. It does not contain a WI domain, but has some sequence similarity to the G. max family.

34

4.4. FUNCTIONAL ANALYSIS

All ten genes contain the PFAM-identfied wound-induced domain. In this

domain, there is a consecutive invariant WVH (tryptophan, valine, histidine) separated by

a distance of 12 amino acids from a consecutive invariant REYFNT (arginine, glutamic

acid, tyrosine, phenylalanine, asparagine, threonine). Between these two consecutive

domains are an invariant G (glycine), I (isoleucine), and Q (glutamine) (Figure 3.9). This

domain is located in the middle of the protein, and is equidistant from the amino and

carboxy termini in the two more highly expressed proteins.

Eight of the ten genes contain an additional conserved domain. In this domain, there are seven invariant amino acids in a span of eight and a consecutive invariant

LEWW (leucine, glutamic acid, tryptophan, tryptophan) separated by one amino acid from a consecutive invariant HGP (histidine, glycine, proline). This domain is located in the amino-terminal third of the protein (Figure 3.9). It does not exist in GmWI-1 and

GmWI-2 as these proteins are missing the corresponding amino-terminal region found in the other eight.

Because protein secondary and tertiary structure are more conserved than primary structure (amino acid sequence), dissimilar sequences can result in similar structure. A threading program was used to compare predicted structure of these proteins with the protein structural database. The non-wound-induced domain is very similar in its predicted secondary structure to a domain in a protein that was identified as a limonene hydrolase (Figure 3.6). This particular search, pDomTHREADER, showed similarity in sequence but not in secondary structure in region of the WI12 domain. Limonene is a

35

liquid hydrocarbon classified as a cyclic terpene and is present in considerable amounts

in the skin of citrus fruits.

Similarly the WI domain is only somewhat similar in amino acid sequence but

very similar in secondary structure to an NTF-2-like protein domain (Figure 3.5). NTF-2,

nuclear transport factor 2, stimulates efficient nuclear import of a cargo protein.

CELLO indicates that the two more highly expressed genes may be targeted to chloroplasts. The results of the CELLO program can be found in Appendix F.

APPENDIX A.

NUCLEOTIDE SEQUENCES

37

For each sequence, the yellow highlighted sections are CAAT boxes, the light blue highlighted sections are TATA boxes, the green highlighted sections are wound elements, and the dark blue highlighted sections are open reading frames.

GmWI-1:

TCCTATTATTTATTACATATTGGTTTACGTGAGTTATCGAAAAAAAAAAGGTTACGTAA TTTTGCAATATTCTTTTCCCACATTAATTAGCACTAAGACAAAAAAAAAAAAAAACATT GTTTTTGGGAGTGAAAACTGAAAAGTCTAAATTGCCAAAACTTGTAATACCTACTATGG TTCTACCTAGCTTACCTTGTGGATTGTAGTTAAGCTAAACCCGTTAAAGACTTAATGCA ACTAAGTAACCAAAAAACAAAGACAAGCGAACGCATTTGTATATACATGCTTTCACTTA TCACTTTTTTTTTATAAAAAAAGAAAAGATAGCAACAAGGTTTGAACATATGATCATGT ATAAAAAATATGTTTTTTATTTATTATTTTACCTTAAATATGAAGGAAATAAAAATATA CTTTTATTGAAGAATATTTATTAATGATATAACTAAAAATATGTCAATCATAGTATTGT AATGAATAATAATTTAATTGACTAAAAAATTAATGATATAACAAATTGTATCCAGTAAA AGAAACTAAAAATGTAGTATCAATCATTAATAAATGATAAAATCAAAATTTACAAAACA ACATATTTTAGAATTACCAAGGTATCAAAATGTTACTATTTCCTCCTCCGTTTCAAATT ATATGATATTTTAAAAAGAAAAAAATTGTTCAAAAATAGATATTATTTTACAAAATTAA TGTAATATTAAATATATATTTTTAAATTATTCTTAATCAATATTCAGTGAAAATAATAT ATAGAAAGAATAAATTAATTGTTAATTAAAGAGATAAATATTATATTAAAAATTACTTA ATATTTATTTGAAATTAATCGTATTTATTATTATTAAGTGTGCTTTCAATTCTTTAACG TGGACAAATACTTATTTCTGCATAATAAGAATATTAAAGGTGAAGAGAATATCCGATCC ACATATTTTTGGTCAAACTCCTTCAGACATCAGTGAAGTCCTTAAAAACTCAAATTAAG TCAAAAAAGTAGGCTTGACTACTATCATCAATTTGAACACTCATGACGTTGCAACATTT AAAACCCGCCTACTATTTCCATCTTACTCCACTTCTATTCCATCGCCCAATCAAAACCC AACACGTCATCAAATCAAAAAATATTCAAACTAGTACCGCCCGCACCCGATCGCAAGCT TCCCGGTACGGAATTCCTATATAAACCCCTCTTCCTTCCCATTTTAAGACACAACTCCA ACAAGGCATTGTTTGGTTTGGTTTGGTTACTTGAAAATCTCGAATCGCTTAATTTTGAT TTAGTTTTCCACCGCAACGCGGAACCTCTTTCTCGAACTGGCTAACTCTCAGGCAAGTG GCTCGGACGCTGATTCCAGCAACACGCGGCTGGTGGTTGCACTGTATGACGCCCTAAAC TCCGGCGACTCCAACGCCGTCGTCAAGATCGTCGCCGCCGATCTCGAGTGGTGGTTCCA TGGTCCGCCCTCTCACCAGTTTTTG TTTGGTTTGGTTTGGTTACTTGAAAATCTCGAATCGCTTAATTTTGATTTAGTT TTCCACCGCAACGCGGAACCTCTTTCTCGAACTGGCTAACTCTCAGGCAAGTGGCTCGG ACGCTGATTCCAGCAACACGCGGCTGGTGGTTGCACTGTATGACGCCCTAAACTCCGGC GACTCCAACGCCGTCGTCAAGATCGTCGCCGCCGATCTCGAGTGGTGGTTCCATGGTCC GCCCTCTCACCAGTTTTTGATGCGCATGCTCACCGGCGACTCCGCCGCCGACAACTCCT TCCGATTCGTTCCGCAGTCCATCGCCGCCTTCGGCTCCACCGTCATCGTCGAGGGCTGC GACTCCGCCCGCAACATTGCCTGGGTCCACGCCTGGACCGTCACTGATGGGATGATCAC TCAAATCAGAGAGTACTTCAACACCGCCCTCACCGTCACTCGCATCCACGATTCCGGCG AGATTGTTCCGGCCAGATCCGGCGCCGGCCGTTTGCCCTGCGTCTGGGAGAGCAGCGTC TCCGGTCGGGTCGGGAAATCCGTCCCCGGTTTGGTTCTCGCAATATAAAATATAAAATA AGTAATTAGGGAAGGACGAGGTCACGTGTTGCCGTTGCTATAATAATTAAATAAGGGAC TTGTGCACGTGGCGGTGACTGGATCGATCGGTTTCAGGGAACATTGATACTTTGTGTTA GTATTGGAGGTTAGGGAGATGTGAGAGCTTTGTTGTTATTGGTGTGGTTTGTTTTGTTT

38

GCTTGTGTGTTTTTCACCACTATGGGCGTATTCAGGTGGTTGTATCTTTCTTTTGTTAT TTGGAGTGTTGATGATGATGCAATAAGAATATCTATGGACTATGCTTTTAAGAGTTGGG TTGTGATGATGCC

GmWI-2:

CCCTTTTTAGTTTAAATTTCTTTTTGCCTTGTTTCCTGCAATGATTTGTTTTTAACGAA AGAATGGAAACATATTATTTTTTTTTAATTTAAATTTCCTTTATCCTTGTTTCCTGCAA TACATCTAATTGCGTACAAGATACAACTAAATTTGTAGTTAACTTTTTTTTATATATTT ATATGTCACTTTAAAAACATATAATATTAGTTACTAATTTATGAACGGCATCTTAACAT TTTCTTAATTGAATAACATTGTAAAATTATTTACATTATCAAAGTATAACCTATTTTTT CATTATTATTTTACAGTATACACTTGCCAAAAAAATTGATAATAAATTTTTTAATGTTA ACCAAAATAGGCTATAATTGACGTAATTTTATAACTCATTTTTTAATATCAAGCAAAAC TTAAGTGAAATTTATTAAATCATTAAATCCTATTATTCATTACATATTAGTTGACGCAA TTTTCTAATATTCTTTACCCATTTTTTAATACTCCGTATACACTTAAATTTGATAATAT GTAATTAGCATTAAGACAAAAGTATATTGTTTTGGGTTTTTTCGCGAGTGAAAACTGAA AAGTCTAAATTTTCAAAACAAGTAACGTACCTACTGTGTTAAGAAGGCACCAAACAAAG GTAAGCTAAACCCGTTAACGTAACAAAGTAACCAAAAAAACAAAGACAAGCAAACGCAT TTATAACATACATGCTTTCACTTATCACTTTTTTTAAAAAGAAAAGATAGCAACAAGGT TTGAAGTTTGAAGGTATAAAAATATACTTTTATTAAAGAATATTTATTAATGATACAAC TAAAAATAGGTCAATGATATCATAATAGTAATAGTGATGAATAATATATAATTTAATTG AAATCAATTAAGTTTCTTGTTTTGGAGTCTGAGTCCCACAGTTTTTGTGTGTTTTTGGT GAAAGCCCTTCACACACATCACTGAACTAAATGCTCAAATTAAGTCAATAAAGTAGGCT TGACCGTCGCTTCAGTTTCATCCCTAATGACGTTGCAGCATTTAAATAAAACCCGCGTA CTATTTCCATCTTAATCCACTCGTATTCCAACTTCCAATGAAAAACCAACACGTCATCA AATAAAAAAATGTTCCAACTAGTACCGTCCGCACTCGATCACAAGCTTCCCCATGCGGA CTTCCTATATAAACCCCTCTTCCTTCCCATTTCAAGACTCAACTCAAACAACAAAAATA AGACATTGTTTGGTTTGGTTACTTGCTCGAATTTCCTCGAATCGCTTAATTTGTATTTG ATTAGTTAAACCGCAACGCGGAATCTCTTTTCTCGAACTGGCTAACTCTCAGGCAAGTG GTTCGGACGCTGATTCCAGCAACAAGCGGCTGGTGCTCGCACTGTATGACGCCCTAAAC TCCGGCGACTCCGACGCCGTCGTCAAGATCGTCGCCGCCGACCTCGAGTGGTGGTTCCA TGGTCCGCCCTCACACCAGTTTTTG AACTCAAACAACAAAAATAAGACATTGTTTGGTTTGGTTACTTGCTCGAATTTC CTCGAATCGCTTAATTTGTATTTGATTAGTTAAACCGCAACGCGGAATCTCTTTTCTCG AACTGGCTAACTCTCAGGCAAGTGGTTCGGACGCTGATTCCAGCAACAAGCGGCTGGTG CTCGCACTGTATGACGCCCTAAACTCCGGCGACTCCGACGCCGTCGTCAAGATCGTCGC CGCCGACCTCGAGTGGTGGTTCCATGGTCCGCCCTCACACCAGTTTTTGATGCGCATGC TCACCGGCGACTCCGCTGCCGACAACTCCTTCCGCTTCCTTCCGCAGTCCATCGCCGCC TTCGGCTCCACCGTCATCGTCGAGGGCTGCGACACCGCCCGCAACATTGCCTGGGTCCA CGCCTGCACCGTCACGGATGGGATAATTACTCAGATCAGAGAGTACTTCAACACCGCCC TCACCGTCACCCGCATCCACGATTCCGGCGAGATTGTTCCGGCTAGCTCCGGCGCCGGC CGTTTGCCCTGTGTCTGGGAAAGCAGCGTCTCCGGTCGGGTCGGGAAATCCGTACCCGG TTTGGTTCTTGCAATATAAAATAATTATTAACAAGTAATTAGGGAAGAACGCGGTCACG TGTGAATAATAATTAAATAAGGAGGTTGTGCACGTGGCGGTGACTGGGTCGAACGGTTT CAGGGAACATTGATATATTTTCGTAGTATTGGTGTGTTCTAGAGGTTAGAGAGATGTGA GACCCTATTGGTGGGGTTTCTTATTTCTTTAATTTTCTCAGGTTTGGTTTGTTTTTGTT TTGTTTGCTTGTGTGTTTTGGGAACCACTATGGGCCTATCCAGGTGGCTGTATCTTTCT

39

TTTGTAATGCGGTTTGTTGATGATGATGATGATGCAATAAGAATATCTATGGATTATGG TTTTAAGAACTGTGTTGTGATGATGCTAAATTCGTTTATTATTTTATTTAGTA

GmWI-3:

GTGTTTTTTATACTCAAAGAGTCATAGTTATTAATTACTAAATCTGTTTGTCTTAAAAA AAATCAAATATTGTGTTAAGAATGTTCATGACCAGTATAATCTAATTAATGGTTCGAAG TTATTTTTTGACAAGTTCGAGTTTAAGAAATTAAATTTGGTGTATTGTTTAGAATAGAT TTTGAAGGCCAAAATGGATTTGCAAATTCAAGCCATATGCATTACATTTTGATGATGTT CCCTTTGATTCCAATATTAAAATGTTTGATTATGTTTTTTTCAGATGTTCCCTTTAATC AAAGGTCAATGCTAATTTTTTTAGTTTTCTTTATATTACTAAAGAGTTTTATGGGATGC AATGAGTAGCGATGCAAGTGCCATGAGCCAATTTTAGTAATAGACAATAAATATAGTTC ACAAAGTTAAGTTACATTATGTATTGGAGAGTTTTTGGATGTAATGAATTTATAATTCT AAAACATCACACAAGTTAATACAAAGATATTGGTTGCCATATAGTGCATATGTAAAGTG AGGAAGTATGTATTTGACATTTTATGTATTTGGGGGGTCATTTTTTGATGGACTTAGTA ATAATAGTAGTTGAAAAATAGAAAATTTGATTTGTTGTTGTGGGCAAGTATGTATATGC ACAATTTGTATTTTTTTACCCTTTTATAGAGTATAATATTGTGACAATATTCCTATTTA AATTTTACTGTATAATGTTACATATTTAATTGTTATTACAAAACAAATATTAAAAAAAA TTAACCCATGTATCCCACGGTTAACCTAACTACTTAATTTGAATGATTGGCAAAGCTAG GACAATGAAGATTAATATATATATGTTGTATTATTAAGTAACTAAAACAACCAAATATA AAAAGACAAAGAAGGGGAATGAACAATAACGTTATGCGTTCGGAGTAATTGAATAATCA AAACAATAAAGAACTTTAATAGTGGAATATTGCGTACATAAAATTAGAATTGACAAAGT TTGATGTGATGAATCAAACAATCAGTCACAAGTGATTAATATATTTAAATGAAGATTAA ATCATTCAGATAATTTTTAAGTTCAATATTTGATAGAAATAATTTTTTATTAGATTTTA ATTATTTTCTTCTAAATTATTCAAGATTTTTTTTTCTTATAAACCTGAGAGTTAACACT AAAAAAAATTGATGTGATGACTCTATTCTTGACCCTTTAAATTTATTTTATCGCAGGTC CACTACTATATAAGAACCTCCACCCTGCATGCACTTCCATTCCCACCAACATTGATTCC ATAAATACATAGCTTTCTTCCTTAATTTGAAAGTAAAGAAGCAATACCAAGTACAATCT CTCTACTTTCAGCGTGCTTGATTTCACATTTTGAATGCAATTTTCTAAATTTTAGAAAC TAGTAGTTATAATTTTCACGTCTTAAACGTGTGATTCTTTCATTTTCGAGTTTCTAAAC CTCTTACGAGAACCATTCCTAGTAT CATTCCCACCAACATTGATTCCATAAATACATAGCTTTCTTCCTTAATTTGAAA GTAAAGAAGCAATACCAAGTACAATCTCTCTACTTTCAGCGTGCTTGATTTCACATTTT GAATGCAATTTTCTAAATTTTAGAAACTAGTAGTTATAATTTTCACGTCTTAAACGTGT GATTCTTTCATTTTCGAGTTTCTAAACCTCTTACGAGAACCATTCCTAGTATATGTTTA CTTGTGTTCCGGAGCTGGCTAACTCACAATCACAAGAATGCTTGGAGGAATATGTGGAA CGCAACACGAGAGTGGTCACAGAGTTATACAAAGCCCTAACCTCCAAAGACCCCGAGAC GCTCCATGGGCTTGTGACCCAAGACTTGGAGTGGTGGTTCCACGGCCCACCATGCCATC GGCACCACTTGGTCCCCTGGCTCACGGGCTCCTCACCCTCCTCAAAGGCTTTGGTTCCA CAACACATGGTGGGCTTTGGGCCAGTGGTAATTGCTGAAGGGTTTGATGATGACCATTT GGTGTGGTGGGTGCACGCGTGGACCGTCACTGCTGATGGGCTCATAACTCAGGTCAAAG AGTATGTGAACACTTCCGTGACCGTGACTCTTTTATCACAACAAGTGCTTCCAAATGCT TCCAAGTGTCAATGCATTTGGCAGAGTAGGCTTTGTGATGAATCTGTGCCTGGACTTAT TCTGCCAATCTAGGGTATTAGAATTAGATGGTGGTTGGGTTTCAAAAGTTCACCATGCC ACATAGTGAGTGGTGCATGTCACGTAGTTTATATATGTGCACGTAATTGGGTCTATACT GTGAGTGTGCTTGTTTTTGTGCAATTGTGTGGGTTTTGTGACATTTAGTAATTGTATGT

40

TATGCTATGGTTTGGGGCTGTTGAAATAGTGGGTGTTTATATAAATATTAACAACGTAT CATCATCTTTTATT

GmWI-4:

GTAATGTACCATACAAAAAATTTAAACTAAACCTAATCTAAAATTATATTTTTCAAAAC CTCAAACTATTCTTAATCTACTATATAAACAATTTTAACTAAACCTAATGTACCATATA AAAATTTAAACTATTCCTAATTTATCATATAAAATTAATAACTTAGATTACCTAAAACT ATTCCTAATCTACCATATAAATTTTTTTAACTAAATCTAATCTACTATATAAAATACAA AACTCAAACTAATAATATTAACAAATAAACTAAATCTAATGTATCCTATAATATAAATA GGAACATACTAACTAAAATTATAACATATACATATAATATAAATAATATTAATATTTCA AATTACTACTTAAAATTTATAAATACCACCTAAACATACATAACATTAACATATAAAAC AAACCTAACCTAAACATACTTAATACATAACACTTAAAATTTATAACCTAAACATACAT AAAATTTATTATCAATTCAAATAAACTTACCAAGAAGTTACAACCAGCACAAACCAAAG AGGCAAACGAAGCAAGGAAGCAAGCTTGAAGAAGGCTCAGGACAACGGGTAGCAATGCA GTGGACGAGAACTGCATATAAAGGAGCAAAGCGCTCCTCGTGATGGCGCGTTCCACCTG CTATCCGCAACCCGCAAGCTCCAATGGCGTTTCCTGCTCCTGCTTCCTGCAGCAGTAGC AGACTGCTGCAGCACTGTGCTCTTTGTCCTTGCTGCAAAGGTAAAATGCCAGTTTGACT GGCGCCTTCAGTCTTCCATGCAGAACGCGCCAGTGGTCCTTACGCCTCCCTCGCCACGT AAGCAGAAGAGCAACGCGCTGGTGCTACTGGCGTCTCGAGGCCATGTGGCAGCTTCTAC GCGAAAGCGTCAATGGTGGTGGCGCCATGCATGGGCTTCACGTAGAAAAACTGCACCTG CCATGAATAAGTTTCAAAAGAACCCCATTTAGAGAATTAGTTTGTAAAAAAGCATCATT TGAGAATTTTTGCCGCTAATCTAAGTAGAAAGTTAAGACGACAAAGTACATTAAGTTGA AAATTAATGGCTAGTATTTATTAATTTGAATGATTGACATGGACAGTGAAGATTAAAGT ATGTTGTATTATTATTAAATAAATAAAACAACCAACGATAAAAAGACAGAAGGGGAATG AACAATAATGTTATGCGTTCTCAAAGGAATTAAATAATCAAAATAATAAAGAACTCTAA TAGTGGAATATTGCGGACATAAAATCAGAATTGACAAAGTTTGATGTGATGACTCTATT CTTGACCCTTTAAATTTATTATATCGCAAGTCCACTACTATATAAGAACCTCCACCCTG CATGCACTTCCATTCAATTCCCATCAACATTGATTCTATAAATACTTAGCTTTCTTCCT TAATTTGAAAGAAAAGACGTGATTCTTTCATTTTCAAAGAGTTTCTAAACCTGCTACTA GTAAATATTTACTTGTGTTCCGAAG CTGGCTAACTCACGAGAATGCTTAGAGGAATATGTGGAACACAACACGAGAGTG GTCACGGAATTATACAAAGCCCTAACCTCTAAAGACCCCGAAACGCTTCATGGGATTGT GGCCCAAGACTTGGAGTGGTGGTTCCACGGCCCATCATGCCATCGGCACCACTTGGTCC CCTGGCTCACGGGCTCCTCACCCTCCTCAAAGGCTTTGGTTCCGCAACATGTGGTGGGC TTTGGGCCATTGGTTATTGCTGAAGGGTTTGATGAGGCCCATTTGGTGTGGTGGGTGCA TGCATGGACCATCAGAACTTCTGATGGGCTCATAACCCAGGTCAAAGAGTATGTGAACA CATCCGTTTCCGTGACTCGTTTATCACAAATGCTTCCAAATGCTTCCAATTGTCAATGC ATTTGGCAGAGTAGGCTTTGTGATGAATCAGTGCCTGGACTTGTTCTGGCAATCTAA

41

GmWI-5:

GTGGAGAGGAAGTGTGGAAAGTATTTTATAGCTTGCCTAACAATTTAGAAAATAAAAAC AGAAAATAACATATAATTGTTTTCATTTTTTTTTCTATATAACTTTTGAAAATATGAAA TAAAGTGAAAATAAGGTTATTTTTAATTATAAGTAAAAATAAAAAATGATAAAAAAAAT CCAAATTAAATGACACCCTTTACTATTCATTTCTGTAGTCTCCGGTTCAGGGTGCGCCA ATTTATTTGGCATAAGACCGCAAGCAAAACAATAGATTATGATTGGGCCAAAGCGGCTC CAATTGCAGATCCGGACCCAGTTTCTACGAAGTGAAAAAGGCCCACAGGCCGATGGGGT GTTGCTAGGTGCACCCTGCACTTAAGAGAAAAAGATTGAAATACCTTTGATCGGTAAGT CTTTTAAGTTGATCTGTAAAAAGCTTACAGATTAACTTGATCTGTAAAAGACTTATGAA TCATCCTCACGTACACTCATCACAAATACAAAGAAAGCACAAGAGTAGTTTTGATATTT TTTAAAAATACTGGGTGCACCTAACAACACCCCGCCGATGACCACCGTGACCCAACAAG AGTTGAGAGAACCTGCTTATAACTCACAAGCATGTAGAATACAACCTTATTTCAATTCC GAAACACAATATGGCATAATATGCATCCATAGCAAGAAAGAAACAAATGTGTTGATGGT AAAAAATTTGGGAGGACCATGGCCCCTCAGCTTTAAAGAAGCTCTACCCCTTATATATA AACCAGACGTATCTTTCAACAAAATTTTATTTGGATCAAATAAAATTATTATAAATAAC AAAAGAGTATTTAAGGTAATTATAAATAACAAAAGAGTATTTAAGGTAATTGCATATTC AAAAATTGAATTTGGAGAGCCAATTGAGATTCTTTAGCCGTATATTGTCACCGCTTTTA CACATGTCTTCTGCTTAAAAAAAAAAAACAAAACTCATGAGATTAAACTGAAACAGTCA GCCAAACGTGAAAGAATAAAATTATACTTCTATTTAGTATAGTCTTTGTAATTTGTTTT TTTTTTTTTAAGATAGTCTTTGTAGTTTGTTAACATCTTTGTGTTAATAACATACATAA ACTTTTTTATAAAAACATTTTAATTATTAATTTAATTCTTACATTTTTCAAACTTATTA TTTTAGTTCATATAGTTAATAATCTAAAATTTAATAAGTGAATTTTTTAGTTTAAATTT ATATTTTAATTCTTAAAAAATTTTATCATTAAAATTTTTTTAACGACAAATATAAATTT GAGAAATTAAAAAATTCACTTATTAACTATAACAACTACAAAATATAAGTTTGAAAATT ATATTAACTAAATAATTAATTAAACTTAAAAAAAAAACATATATAAACTATTCACATAT AGCAACGAAATTGTCTATAACTGTAACCTAAATTCTGAGCATAAGCCTCTTATAAATAG CATGGGCTACAACCATTCTTTCACC ATGAAGATCATCCCCGGCAGCAAGGCCGAGCCGGAGAAATTCTCCGGCACGGTC ACGGAACACGAAAACCGCAACCGGGAGACGGTAAAATTGGTGTACAAGGCATTATTGCG TGATGCCGACACGGAGAAGCTTGCGAAGGTGGTGAGGGCAGAGTTGGAATGGTGGTACC ACGGGCCTCCTCACTGCCAGCACATGATGAAAGTGTTGACCGGCGAGTCAACGACGGCG CAAAAGGCCTTCAAGTTTAGGCCTCGGAGGATCAGGGCCGTCGGTGACCGCGTGGTGGT TGAAGGGTGGGAGGGGGCGGGGGAGTACTGGGTGCACGTGTGGAGGTTCAAGCACGGGA TCATTACTCAGCTGCGTGAGTACTTTAACACGTTGATCACGGTGGTGCATCGGGTTTCG GAGGACGGCGATGAGGCTCGCTTGTGGCGGAGCACCAACCGAGTTCGGGTTCGGGTTCA CGGGTCGTTGCCGGATCTTGTGCTTTCCATTTGA

42

GmWI-6:

AAAGTTGGTTACTCAAAAAAATGAACAATCACAAACAGTAAAATACCAATGCTGTAATT GCATATTCAAAACTGTTGTAATTCAAGATTACATGTGTCCATTAGTCTACACACTTAGA CGGCATTTGTTTGACGAATTATTTCTTTAATTTTAGAAATATTTTTTCTTGTAAATATA ATATGAGAATATTATTTCTAAGTATTTAAAAAATTGAGTTTTATATCCAACAATAATTA TTTTATTACACCATAACAAACATGAGAAAAAAATTTTCAATCTCATATTCTCGGAAATA AAAAATAATCTGTGAAACAAACCAAAAAAAGGAAAGACTAATATATATTAACTATTATC GTTAATTAGAAAACTAATTATTGTTAATACTGTACGTGATAAAATACATGTATATACAC AGATGTGAGACTGATACAAGGGGGCAGCCAAGGACTTTAGCCCCTTCCTCAAATTGCTT TTAATTATATATATATATCAATTAATAATATATATACTACTTCCAATAAAAATAAAATA ATTATATATATATATATATATATATATATATATATATATATATTAATATATGTTACCAC TAATTATATATTAATCTTTTTGATAATATATTGTTACAAATTATAATTTACTTAGAGAT ATATACACTAATAAATAGATCTAATAAAAATGATATTTTTATATGATTTTTATTTCATC CATAGTTATCTAAGTTAATGATGTTTTTGTATTATTATAAGTCTCATTTATCTATTTAA TTATTAATATTATATTATAAATATATTTATATATAATTGTTATGAAATTTGATGAATTT TAATTAGTTATTATTATATTTTTTCGTTTTATTGTGAGATTATTATATTGAGTCCTGTT TGATTGTAATTGTGCTTGTGAGATCATTTCAAAACTTTAAAAACAATTATACAAGTGAT AAACAAATGTATTTCCTTCTTTATCATTTGAAATAAAAACCAACTTTAATTCTAAGTGT GAAACTAATAAATAATTTTAAAAAATTATTATTTATTTGTTTGAAAATAGATATTTAAA TAATTTAAAAAATCCAGTTGTCTTTTGATGGTTTTCTAGATCCCTCAGGTATAAACTTA ATAAATTATAAAATTATTCTAGTTTGTTTAATAGTTATAACTCTTCATTTGATACTTGA TTCAAACTTCAAATTCAAACACAGCCACCACAATCGTTGAAAGGTCAGGAAGTCACCTT CGCATTTGCATATATATACTTGAATTTGCAATTACTAATTAAACTAAAACAGTCGTCGG TGATCCGTGAAAGAATAAAACTTTACAGTACTTATATTATTCAGTATAGTCTTTGTAAT ATATTACATCAGCTGTGTTAATATATATATATATATATATATATATATATATATATATA TATATATATATAAACTATTCACATATAGCAACAATATTGTCTATTAGTATAACCTGAAT TCTGAGCTACAACCGTTCTTCCACC ATGAAGACCATCCCCGGCAGCAAGGCCGAGCCGGAGAAACTCTCCGGCACGGCA GCGGATCACGAAAACCGCAACCGGGAGACGGTGAAAATGGTGTACAAGGCATTGCGTGA CGGCGACACGGAGAAGCTGGCGAAGGTGGTGAGGGTGGAGTTGGAATGGTGGTACCATG GACCTCCTCACTGCCAGCACATGAAGAAAGTGTTGACCGGGGAGTCAACGGCGACGCAA AAGGCTTTCAAGTTTAGGCCTCGGAGGATCAGGGCCGTCGGTGACCGCGTGATGGTTGA AGGGTGGGAGGGGGCGGGGGAGTACTGGGTCCACGTGTGGAGGCTCAAGCACGGGATTA TTGCCCAGCTTCGTGAGTATTTCAACACGTTGATCACGGTGGTGCTTCGGGTTTTGGAG GACGGTGATGAGGCTCGGTTGTGGCGGAGCACCGACCGGGTTCGGGTTCAAGGGTCGTT ACCGGATATTGTGCTTTCCATTTGA

43

GmWI-7:

TTAGGGACATATGACTATAAATTTGTCTGAGTTAGTTAATTCCATGTTAAAAAACACGA CATTTGCCGGTGTGTCATAGTTGAGTGAGGAGGCGTATTTCAAGATCGCAAAATTCTTT GCTATTAAAGGCCGACGAACCCAGACAATGATCAACACAAGCTTGCAGTATTCCAAAGT CATCTCCGAAGCAATGAATAACGGTCAACAAGAATCAAATACACACATTGTTAACAAAT TTAACAAACACAACCACACATTTATTGTAAGAGAGACCCATGCCCCACTTCTAACACCC AGACCACAGGGAAGGTTTAGGGTCATGTTAGATGTTACAATCCCAAAATTATCATTGTG ATGCATATCATACTAAACACTTACCGTGTTCTCAGGTCATGTTTGTTTGTAAATCTTTC ATGATCTCATGAACTATGTGCCGCCTACAAAATATTTTACACTTCTACCACAACTCCTT TGGTTTACTACCACACGAATTAATGTGGCAAAAATATGAAAGAGTTCAATGGGGTCCTG ACACAGGGAGAAGGAAGAGTGCAAAGCGTCGTTCTATTTCAACTTGCATTCCTACTAAG ATAGACGAAGAAGAAAATGAATGAGAGAGTAGAAAAAAAATGTGAAATTTGACTGCAAC AAGGACATAATAGAAATAATTATCCCAACGTACCTTCATCTTTAGTCTTCCCTTGGACA AATATTATTATAAATATTTTATTTATAATGTAATGTTCTTCTATAAATAAATTCTTAAA CAAAAAAATTATCATTTTTTAAAAAAATGTAATGACATAAAAAAAATTATATTTAGTAA AATAATAATATTAAAAGTTTTAATTTTTTTATGAATGCAAAAATAATATTTAAAAAATT TAGATCATTTTTAAAATTAAAATGTTAACTAAAAAAATGTCTTGTATAAATTAAATATT ATCAAACAAAGTCTATTATCAAATGAGTCTTGTATAAATTAAATAATGTAAGAAAAAAA ATTATATTAAATAACATCATTTTCTTAAACAAATAAAAATTATTTTAATAAAAAATTAT TTTTAAAATCAATTAAAAATATTTTTTATATAATATAATATTTAAAATAAAAGTACCAA AGTTTAAATTTAAAATATTTTTAATTATTTAAACTAAAATATAAATTTAAAATTAAAAA TAAAATTTAAATTAAAAAAAATCACTGTTTAGAAGTCGGGTGGCAAATCCCGACTTCTT GATGAAGCCTAAAAGTGGTACATTTTGGGATATAAGCTACAAATAGGTGTAGATTGAGA ATTTTCTAGGGAGAGAAAGTGTACATTGATAAAAAAACCTACAAAACTCATACATACCT TGTTGTCTATGCCTATAAATTGTGACACCACCAAGTCCTCCATGCAACTCACAACTAGT GTGACCATTTCCTAAAACATAAGCCCAGAAGCATAAATTTCACACAAAAAAATTCACAA CACGGAGGATCATTAAAAGGTGGAG ACATAAGCCCAGAAGCATAAATTTCACACAAAAAAATTCACAACACGGAGGATC ATTAAAAGGTGGAGATGCAAAACAAGGCAACAGTTGAAATGCTCTACATGGCACTATTA GGGCAAGAAACAATGGACAATGTGGCCAAACTGCTAGCAAGTGACCTTGAGTGGTGGTT CCATGGCCCTCCTCAATGCCACCACATGATGAAGGTGCTCACAGGGGAAACAGATCACA CCAAGGGCTTCCGGTTCGAGCCGAAGCAAGTCACCGCCATTGGAGACTGCGTGATTGCC GAGGGGTGGGAGGGCAAGGCCTATTGGGTGCACGTTTGGACCCTCAAAAATGGCCTCAT AACTCAGTTTAGAGAGTACTTTAACACGTGGCTTGTGGTGAGGGATTTGAGGACACCAA GGAGGGAAGATAGCAAAGATAGCATGACATTGTGGGAAAGCCAGCCTCGTGATCTCTAT CACCGGTCACTGCCAGGACTTGTGTTAACCATTTAGCCACAACAACACTTGATGGTTGT GGTTAATTGGAACTTGAATTGGAAGCTAGTTATTAGGGACGTGATGTGTAATGTACCTT ATAGTGGCCCCTACGGATTAAACATGTTGCTCTACAACATTGGAACTATTATACTGTTC GTACATATTTGGCTTGTGGACCTGGTTGATTGTTCAATAAAATGGTTTTCATCATTTTG AATCAAAAGG

44

GmWI-8:

ACAATTTATCTTAACAAAACTTATATGCAATTTCTTAGATTTTTTGTGATGTTCTCCAATTAAAA TTAAAAGTTTCATCACAATAATTTACATGTTGATTTTTAATACAAATCTTTATGATCTATATTAT CGGGGTGAAAAACTAATTTTGATAGTAGCACATAAAGAATCTAAGGTAGTATGTATATAAGTTTT ATCACAATAATTTACCTAGATGTTGTTGGTGATGTTCTCCCATCAAAATTGAAGTTTCATCATCA TTGTAGTTTGTGTATAAGTTTTGTCCTTTGTCATATTGATTTGTGGCTACTCATTTTTATAATAG AAAAGTTGACTAAAAAATTTAAAATGATTAAGATTGTTTGGGCAAAAAAAAAATAATACAAAAGA TAATTTAAAGTAGTTCTTATAAAATAAGGGAGGTGGGAGAATACTCTGAAAAATAAAAATAAAAG TAAATGAAAAATACTTAATGCTTATTTAGAATAACTTAATCTCTACAAAATAATCATTTATTTCT TAAAAGCTCTCTCAAGAAATTAGTAAAAATAAGATGTTTTTTTCTAAAGTAAATTGAACCAAACA AGAACTTGGTGTTTAGATTTATATCTTTGAATACAATAAATCTGACCATATACATGTGTTTATAA TAAAAATCAGCATCATAAATTGGAAAATAATCAAGTTCTCTAATTGTCATCTATTTTGTTCGTGA TTTTGTTTTTCTTTCACTCCTCACAGTATATTCTTATCTAGTAAGTTAGAGATTAATCTATAGAG ATGATAATTATTTACTAGACGTTAAAATGGAGACTCCTCACTACTATAATTAATGATTTATTTTT TGTTTTATAAATAAATCTCCTGAATAAACTTGAAACCACATGTTTAACTGATCTATTGACTATTC GTTCGTACTAACACAAGTTGCCTATGCTGAACGAGAAGAGGGTCGGCCATGCATAGTACATTCCA AAAACATCATTTCAAGTTCATAGCACCTTTAGATTATACCTAAACATTTAATGAAAACGACGGGT CCATCTTTCACACGAGTTATTAACAAGTCAAACTTCCCAAAGGCTGATTAGTCATTCTAATTATC CCCAAATTTAATTTTGGCTCCCTCATGCACTTACGTCTACAAGTTAATTCACATCGTTATCTCCT CATAAAAAGCATTCCAACTCTTACAATTTATAATTATTATATGAACATTTTGGTGTGACACATTT GTCATCTCAATCGTGCAAACATTAATCACAACTTCTAGGAGCTAATCCACACCCTCCTATACCCC ACTTTATAAATCAATCAATCTCAACCGTCTATGCATTTTTTATTTTAACCTATAATAAGGCACAA ATTTCGGATATACCTTCCTATCTATGCCTATAAATTGTGACACCACCAATCCCTCCATGCAATTC ACAACTAGTGCAACCATTTCCTAAAACACAAGCCTAGAAGCATAAATTTCACACAAAAAAAATTC AAAGC GTGCAACCATTTCCTAAAACACAAGCCTAGAAGCATAAATTTCACACAAAAAAAATTCA AAGCATGGAGGATCATCAAAAGGTGGAGATGCAAAACAAGGCAAGAGTTGAAGTGCTCTACAAGG CACTATTAGGGCAAGGAACAATGGACAATGTGGCCAAATTGCTAGCAAGTGACCTTGAGTGGTGG TTCCATGGCCCTCCTCATTGCCACCACATGATGAAGGTGCTCACAGGGGAAACAGATCACACCAA GGGCTTCCGGTTTGAGCCGAGGCGAGTCACAGCCGTCGGAGACTGCACGATCGCTGAGGGCTGGG AGGGCAAGGCGTATTGGGTGCACGTTTGGACCCTAAAAAATGGCCTCATAACTCAGTTTAGAGAG TACTTCAACACATGGCTTGTGGTGAGGGATTTGAGGCCTCCAAGGTGGGAAGATAGCAAAGATAG CATGACATTGTGGCAAAGCCAGCCTCGTGATCTCTATCACCGGTCACTGCCAGGGCTTGTGTTAA CTATTTAG

45

GmWI-9:

AATAGAATAGAGTGCTTTTTATATAGGTTAAAATATGTACATTATTTCCTGTAAAAATA AAATATATTAATTATTTTAAATAGGATGTTTAACCCACTCATAACAAGACATATATTTC TAAGTGCAAACATCTTTAATTCACCACCAAATTTTCATCTGTTAAAAAAGATCCTCCAT GTTGGTAACAAACCCACAAAAAGGCATTATATATGCATGCACGGCCATAAATACCAGAC CACACACAATAACCAAACAGCATATTCCATAGATGAATTAATTCTTGATTCTTTTCTTT TTAACAAATCAGAATGCAAGTGAAAATTCCTTTTACAATCCATGGAGAAAAGAGACTTA AAGCCATCCCAATGATAACAGGATTGCCACAATTAACAGGACCAGTTACAATCCTCAAT CTTACAGAGAACACGATATACAACTCATCGCCTCATGATAGAACATATTAATTGGACTG TTGACCTAATTCAAATTATAGATATGTTATCTACAATTTGAAATTGGAATAAATTCTAA GACAAGATATCTACCATTTGATAATTGAATTAGGTCAACCGAATAGAATACCTCTTCTG TGGAGACTCAATAATTAACTGAAATCCTGTTGAGAAGAACTTTGTAAGAAAAAAATAAT AAATCTCAAAGGTGACTTATTTGAAGACCATATATAGTCTCTCAAATCAGAATTGATCA GTTCCTCTAACCGCACAAAACTTTATTCTTCAAATGGCAAAACATCAACCCAAAAGAAC CAACATAATTTTGACACGGCTTACTATAAACAGAAAAGTCATACGGATTAATTATGGTT CGAGAATCTAACATTGTAAAGCCTTAACTGTCAATCGAAATAGTTTTTAGATTGCCTTG TATGAAAATGAATCATTTTTTTTTTCAATCAGTTGTTTTGTTATTTAACGATGAATATT CAAATTTTGTGATGGAACATTTTCCCTCTCATTCAGCAGGGATACTTGTTTTTGTTGAT AATTTATTTAACTTCCTGGCCAATTTTAGCTACAAATTTAATTTTATTATTCACTACAA TTTAATTAGTAGCCGCGTTATAAATGAGATTAATTTCTTAATTATAATGAGTCCGGGAC TGAATAATTTGATGATTATTTATCTGTATATCTCGCAATGTGTGTGAGACCAATGAATG CAATATATTAGTGCAGAGATGTTTAGTGTAATAATGTTTTCAAACAAGTAAAAAAATGT TAGTATAATGTTATTAAACTTAAAAGAAAAAATTAATGTAAGAAAATTAAAAAAAATAA GATTATGTATAATTGCATGAAACCTCATCATCATTCCATCAAGTTATCAACCATCTTTT TTTAAAAGGCCATTCCATGAACCATCTTACAACCACACCTATATAAGAACGTGCACTCA CCACCATTCACTTTCACCCCATATCTCTCTCATAGCAAATTAAGGAGTTAACAACTTAA TTAATTTGACCATATCACTATTTAG GAGCTGGCAAACTCGTCGCAAAAACCATGGCTAGAGGATCAAGAGGATCCCAAC AAGAGAGGGGTGATAGACTTGTACAACGCCATACTCTCCAATGACACTGACACAATGCA GCGCCTTCTAGCTCCAGACCTGGAGTGGTGGTTCCACGGTCCACCGTGTCACCGCCACC ACCTCGTTCCCCTCCTCACGGGCTCCTCGTCGAAGCCTTTAGTCCCTGATCTCGTAGTT GGGTTTGGATCAGTGACTATTGCTGAAGGCTTCGACGAGAGCAACTTGGTGTGGTGGGT ACATGCATGGACCACCACTGACGAAGGCATCATCACGGAGGTTAGAGAGTATGTCAACA CTTCCGTCACTGTCACCAAGCTCGGTTTGCATGTGAACAACGATGATGTTGTTTCTGCC ACGTGTCAGAGTGTCTGGCAGAGCAAGCTCTGCGATGAGTCTGTTCCGGGACTCATTCT TGCCATCTAG

46

GmWI-10:

TACTGGAATAGTCGAATATACTGTCCTTTATTCGTGGAATACTAGTGAGGCTATTTATGCGCTGT TTTTATTTTTGTCTCTTTGGCGTATATTTATGGCTTAAACTTTTAAATTGATTGAGTAATTGATT ATTGGACAAAATTTAGATACAATTTATGTTAATCTTGTTCTTCAATTAAAATTGAAATTCTACAA AAGTAGTCTGCATAATAAAAATTTACATTAAAAATCAACATAAGTTAATGATATAAAATTTTAAT CCTGATAAGAAAATAAAACAAAAAACACTCAAAAGAACTACATCTAAGTTATATATATAGAGAGA GATTATTACAAAAATTTTCTACTAAAAATAATCCTTGATCCAATTAGATATTGAAACTATTAATT AAGTTGAAAAAATCTTACACTAGTTGATCACTTGGTCAAAATTTGAATATCGTGTATAATCTTTT CCAAAATTTATGGAATCACTGTCAGTCGTTTGGTTTTAGGATTGTTTTTAGTATGTAGAAGTTAC AAACTAGAGTTATTTTGAAATTTTAAAATACTGCAAACCTTTTTATATGAATTAGAAAATGTTTA TTCATTATATTTATCATCTATGAACATTAGTAACTTGATACATTTCATCGATAAAATTGGCATCA TGAATCAGGGATATCAGTTAAGTTCTCCATTTGTTTTGCATTGTTTTCTTTTCAAAGCAATTCTG ATACTTGTTAACACTAGAGCAACAACATTCAAACCTCACGGATCTTTTTGGTTTTGTCCCTCATA TGCCTACAATACCATAACCTTAGGTAACTCTAATAAGGAGAAAGCCATGTTTATTTGTCCTTTGT TTAACCTACGGATACTTTGGTAAGAAATTGGCAGTGTTGGTCGACAAATTAAACAATAAAAGATC CTTCAAATAACACTCCAAAAGCATCAATTTAACATAGCACTTTTAGATTATACCTAAATTATTCA AAGATTTTTCCACAGAATAGACCAAACCAACCCAAGGGTAAATTAGTCATTCTAGAAATACACCC AATATTGCTTTGGCTCCATCGTACGCGTACATCCACAAGTCCACATCGTTATCTCCACATAATAA CGTTACAGCTTAATAGCTTATGATTGTTGTTAAATGTAAATTTTGTTCTGACAGACTTTTTGACA CTTCACACGTGCAGCATTAAATGCAAGTTTTCTAGGACCAAGTCCGTATCATCCAATCCACACTT TATAAAGTAATCAATCTCCACCGTCGTAGTTTAATTATTATTTGTGTTAAATGATTTGTCACATT AAATTAGTGTCAATCCTTATTAAAACTAATTATGTCTAAAAAAATGAGTGTCTCCTAGTCTCCTT AAACAATAATAAGGTGTGCTAATACTTTGACATTTATATCTCTATGTAGGCCTATAAATATGTGA CTATCCAAGAGCCACATTCCCATGCACAAAACAAATCAAGCTACTCTCAAATTAAACTTCCCATA TCAAA ACTCTCAAATTAAACTTCCCATATCAAAATGAAGCACATATCTAGCTCCATTGAAATTT CAGAAATGGGAGCAGAGAGGCACGAGAAAGAGGAAATGCATAACAAGGCAACTGTGGAAGCACTC TACAAGGCCCTTTTAGGGCAAGGCCAAATGGACACGGTTGCAACAAACATGCTAGCAAGTGACCT CGAGTGGTGGTTCCATGGCCCCCCTCAGTGCCAGCACATGATGAGGGTGCTCACCGGAGAGACCA CCCTCGACAACAACGGGTTCCGGTTCGAGCCAAGGAGCGTCACCGCCATCGGCGATTGCGTGATC GCCGAGGGGTGGGAAGGCAAGGCCTACTGGGTGCATGTTTGGACCGTGAGGAATGGCCTTATCAC GCAGTTTAGAGAGTACTTCAACACATGGCTTGTGGTCCGGGACTTGAGGTCTCAAAGGTGGGAAA ACAAGCTTGACAACATGACATTGTGGCAAAGCCAGCCGCGTGATCTATATCGTCGGTCCCTGCCA GGACTTGTGCTAGCCATTTAGCTACAAGAGAGAAATGCTAGCAATGTACTAAAAAAAAAACAGAG AAATTATTATATAGACTAAGACCATAGTGTGTTTGTAATCCTGATCAATCTAAATGTGACAAAAT TAAGTTTTATTAACTCTGTCACGTTCAGATTGATCAGAACCATAGACATGTAGTAGTGGTCATCG ACTATACAATAAATTCTCCATAAGATGAAATTTATGTGAGGTTTATTAGTTTTGTTTAAATTAAG TGTGTTATTATTTACCAAAACCTTCATGAATTTGATCCAATAAGGGAAAGTGTTTTGAAAAGAGT ATGTTGATGGTATAGGTCTTGGGTGAAGTTCAAGGATGTGAAATTGTGGGCAAAATAATAGCTTG GGATGTAACCTTTGCAGGGTTGTGTACCTGTGGTCAACTCAACCACATTGAGTTTTGGTACATAT TTGGTTTGTGTAACTGATTAATAGAAAATGACGGTCG

APPENDIX B.

ESTS

GmWI‐1: Score (Bits) E-value Tissue Treatment Cultivar GR854937.1 CCCC19967.b1 CCCC Glycine max callus grown in ... 1478 0 Callus Dark Williams 82 GR842685.1 CCCC12683.b1 CCCC Glycine max callus grown in ... 1454 0 Callus Dark Williams 82 GR827959.1 CCCC3172.b1 CCCC Glycine max callus grown in d... 1432 0 Callus Dark Williams 82 GR837180.1 CCCC890.b1 CCCC Glycine max callus grown in da... 1428 0 Callus Dark Williams 82 Nodules, GLNDE48TF JCVI‐SOY3 Glycine max cDNA 5', mRNA root tip, FK019793.1 ... 1384 0 roots Williams 82 BE661337.1 474 GmaxSC Glycine max cDNA, mRNA sequence 1373 0 Seed coats Harosoy 63 GR854938.1 CCCC19967.g1 CCCC Glycine max callus grown in ... 1354 0 Callus Dark Williams 82 GR837181.1 CCCC890.g1 CCCC Glycine max callus grown in da... 1349 0 Callus Dark Williams 82 GR842686.1 CCCC12683.g1 CCCC Glycine max callus grown in ... 1325 0 Callus Dark Williams 82 BE661338.1 478 GmaxSC Glycine max cDNA, mRNA sequence 1315 0 Seed coats Harosoy 63 Phakopsora JGI_ACBU3420.fwd ACBU Phakopsora pachyrhizi pachyrhizi EH260355.1 in... 1306 0 Leaf TW72‐1 Williams GR848325.1 CCCC16223.b1 CCCC Glycine max callus grown in ... 1301 0 Callus Dark Williams 82 Salt‐ and Nodules, drought‐ GLNA619TF JCVI‐SOY3 Glycine max cDNA 5', mRNA root tip, stressed, EV278391.1 ... 1286 0 roots waterlogged Williams 82 GR849501.1 CCCC16962.g1 CCCC Glycine max callus grown in ... 1230 0 Callus Dark Williams 82 GR849500.1 CCCC16962.b1 CCCC Glycine max callus grown in ... 1214 0 Callus Dark Williams 82 CD406629.1 Gm_ck31522 Soybean induced by Salicylic Acid G... 1203 0 Seedlings Salicylic acid Kefeng 1 Stem, GLQAV71TR JCVI‐SOY2 Glycine max cDNA 5', mRNA leaves, and FK014314.1 ... 1190 0 etiolated Pseudomonas Williams 82 48 48

seedlings psHB034xN16f USDA‐IFAFS:Expression of Phytophthora CF808381.1 Phytopht... 1173 0 Hypocotyl sojae Harosoy GR826208.1 CCCC2071.b1 CCCC Glycine max callus grown in d... 1170 0 Callus Dark Williams 82 GR852283.1 CCCC18680.b1 CCCC Glycine max callus grown in ... 1151 0 Callus Dark Williams 82 sas53f08.y1 Gm‐c1023 Glycine max cDNA clone BU764182.1 SO... 1149 0 Seed coats T157 GR844294.1 CCCC13711.b1 CCCC Glycine max callus grown in ... 1136 0 Callus Dark Williams 82 GR844418.1 CCCC13791.b1 CCCC Glycine max callus grown in ... 1131 0 Callus Dark Williams 82 GR852284.1 CCCC18680.g1 CCCC Glycine max callus grown in ... 1125 0 Callus Dark Williams 82 GR848326.1 CCCC16223.g1 CCCC Glycine max callus grown in ... 1110 0 Callus Dark Williams 82 sj18e08.y1 Gm‐c1032 Glycine max cDNA clone AW570157.1 GEN... 1081 0 Cotyledons Williams GR844295.1 CCCC13711.g1 CCCC Glycine max callus grown in ... 1075 0 Callus Dark Williams 82 se13a07.y1 Gm‐c1013 Glycine max cDNA clone AW132993.1 GEN... 1062 0 Seedlings Williams sag05f05.y1 Gm‐c1080 Glycine max cDNA clone Bradyhizobium Bragg BI426633.1 GE... 1061 0 Roots japonicum NTS382 GR826209.1 CCCC2071.g1 CCCC Glycine max callus grown in d... 1059 0 Callus Dark Williams 82 saq66f06.y1 Gm‐c1076 Glycine max cDNA clone Wounded Phytophthora BQ786333.1 SO... 1059 0 cotyledons sojae Williams 82 saj77g04.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BM177174.1 SO... 1057 0 Seedlings syringae Williams 82 sar77d11.y2 Gm‐c1074 Glycine max cDNA clone Pseudomonas BU765127.1 SO... 1055 0 Seedlings syringae Williams 82 Mixture of seedling, young

plant, adult Nourin No. 49 49 DB967435.1 DB967435 full‐length enriched soybean cDNA li... 1051 0 plant 2

sas03b04.y2 Gm‐c1080 Glycine max cDNA clone Bradyhizobium Bragg BU764546.1 SO... 1046 0 Roots japonicum NTS382 Induced for symptoms of SDS (Sudden Death Syndrome) disease by the translocation of culture filtrate of saj72a05.y1 Gm‐c1072 Glycine max cDNA clone Fusarium solani BM178356.1 SO... 1046 0 Seedlings f. sp. Glycines PI567374 se58e04.y1 Gm‐c1019 Glycine max cDNA clone Immature AW185699.1 GEN... 1046 0 seed coats Williams saj74d05.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BM176915.1 SO... 1044 0 Seedlings syringae Williams 82 saq87b10.y1 Gm‐c1076 Glycine max cDNA clone Wounded Phytophthora BQ785752.1 SO... 1042 0 cotyledons sojae Williams 82 san14a06.y1 Gm‐c1084 Glycine max cDNA clone Etiolated Phytophthora BQ079461.1 SO... 1040 0 hypocotyls soyae Williams 82 sam45h11.y1 Gm‐c1069 Glycine max cDNA clone BM891905.1 SO... 1026 0 Cotyledons Williams san78e06.y2 Gm‐c1052 Glycine max cDNA clone BQ252630.1 SO... 1024 0 Seedlings Harosoy sr68c04.y1 Gm‐c1052 Glycine max cDNA clone BU090042.1 GEN... 1022 0 Seedlings Harosoy CD391966.1 Gm_ck10970 Soybean induced by Salicylic Acid G... 1016 0 Seedlings Kefeng 1 ocpsga0_0495_G02.ab1 Soybean immature seed

HO044432.1 ful... 1003 0 Seed 50 50 GR844788.1 CCCC14011.g1 CCCC Glycine max callus grown in ... 1003 0 Callus Dark Williams 82

GR828552.1 CCCC3550.g1 CCCC Glycine max callus grown in d... 1003 0 Callus Dark Williams 82 saj85a09.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BM188130.1 SO... 1003 0 Seedlings syringae Williams 82 su20a03.y1 Gm‐c1066 Glycine max cDNA clone Leaf and BF325250.1 GEN... 1003 0 shoot tip Salt‐stressed Williams GR844787.1 CCCC14011.b1 CCCC Glycine max callus grown in ... 983 0 Callus Dark Williams 82 ocpsga0_0226_E04.ab1 Soybean immature seed HO026126.1 ful... 966 0 Seed saj96h12.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BM188531.1 SO... 966 0 Seedlings syringae Williams 82 ocpsga0_0043_C02.ab1 Soybean immature seed HO011832.1 ful... 955 0 Seed GR855498.1 CCCC20313.b1 CCCC Glycine max callus grown in ... 944 0 Callus Dark Williams 82 sau08f07.y2 Gm‐c1062 Glycine max cDNA clone CA801630.1 SO... 942 0 Stem Raiden san58a09.y1 Gm‐c1052 Glycine max cDNA clone BQ133572.1 SO... 941 0 Seedlings Harosoy se67h07.y1 Gm‐c1019 Glycine max cDNA clone Immature AW186445.1 GEN... 935 0 seed coats Williams su10e09.y1 Gm‐c1066 Glycine max cDNA clone Leaf and BU091038.1 GEN... 931 0 shoot tip Salt‐stressed Williams sad19b12.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BG509606.1 GE... 920 0 Seedlings syringae Williams 82 GM700015B10H12 Gm‐r1070 Glycine max cDNA BE821854.1 clone... 911 0 Various Induced for symptoms of SDS (Sudden Death

sac64e12.y1 Gm‐c1072 Glycine max cDNA clone Syndrome) 51 51 BG510264.1 GE... 894 0 Seedlings disease by the PI567374

translocation of culture filtrate of Fusarium solani f. sp. Glycines saq68e06.y1 Gm‐c1076 Glycine max cDNA clone Wounded Phytophthora BQ786459.1 SO... 893 0 cotyledons sojae Williams 82 sas05c09.y2 Gm‐c1080 Glycine max cDNA clone Bradyhizobium Bragg BU764695.1 SO... 889 0 Roots japonicum NTS382 sar47d07.y1 Gm‐c1074 Glycine max cDNA clone Pseudomonas BU578124.1 SO... 874 0 Seedlings syringae Williams 82 GR845551.1 CCCC14473.b1 CCCC Glycine max callus grown in ... 870 0 Callus Dark Williams 82 ocpsga0_0505_A10.ab1 Soybean immature seed HO045010.1 ful... 867 0 Seed st99b05.y1 Gm‐c1066 Glycine max cDNA clone Leaf and BG043257.1 GEN... 856 0 shoot tip Salt‐stressed Williams sam79e05.y2 Gm‐c1087 Glycine max cDNA clone BM523140.1 SO... 826 0 Roots Williams 82 Induced for symptoms of SDS (Sudden Death Syndrome) disease by the translocation of culture filtrate of saj54a07.y1 Gm‐c1072 Glycine max cDNA clone Fusarium solani BM142724.1 SO... 817 0 Seedlings f. sp. Glycines PI567374

sn06b06.y1 Gm‐c1015 Glycine max cDNA clone 52 52 BE057711.1 GEN... 815 0 Flowers Williams 82

Mixture of seedling, young plant, adult Nourin No. DB984702.1 DB984702 full‐length enriched soybean cDNA li... 797 0 plant 2 san14b08.y1 Gm‐c1084 Glycine max cDNA clone Etiolated Phytophthora BQ079470.1 SO... 795 0 hypocotyls soyae Williams 82 psHB011xE05f USDA‐IFAFS:Expression of Phytophthora CF806412.1 Phytopht... 787 0 Hypocotyl sojae Harosoy GR845552.1 CCCC14473.g1 CCCC Glycine max callus grown in ... 771 0 Callus Dark Williams 82 su06b08.y1 Gm‐c1066 Glycine max cDNA clone Leaf and BU090545.1 GEN... 752 0 shoot tip Salt‐stressed Williams EV555038.1 GmUV‐B_172 Substract library of UV‐B treated l... 695 0 Leaves UV‐B EG702119.1 GmO3_183 Substract library of ozone treated le... 695 0 Leaves Ozone GR851197.1 CCCC18012.b1 CCCC Glycine max callus grown in ... 671 0 Callus Dark Williams 82 GR828551.1 CCCC3550.b1 CCCC Glycine max callus grown in d... 671 0 Callus Dark Williams 82 GR851198.1 CCCC18012.g1 CCCC Glycine max callus grown in ... 665 0 Callus Williams 82 454GmaGlobSeed186655 Soybean Seeds 9.00E‐ Asgrow FK452023.1 Containing ... 540 152 Seed A3237 454GmaGlobSeed168862 Soybean Seeds 3.00E‐ Asgrow FK434230.1 Containing ... 538 151 Seed A3237 sap71e04.y1 Gm‐c1087 Glycine max cDNA clone 9.00E‐ BQ612541.1 SO... 523 147 Roots Williams 82 454GmaGlobSeed65652 Soybean Seeds Containing 3.00E‐ Asgrow FK331020.1 G... 505 141 Seed A3237 454GmaGlobSeed613873 Soybean Seeds 4.00E‐ Asgrow GD871224.1 Containing ... 501 140 Seed A3237 454GmaGlobSeed608001 Soybean Seeds 2.00E‐ Asgrow GD865352.1 Containing ... 499 139 Seed A3237 53 53 GD951465.1 454GmaGlobSeed694114 Soybean Seeds 497 6.00E‐ Seed Asgrow

Containing ... 139 A3237 454GmaGlobSeed232033 Soybean Seeds 2.00E‐ Asgrow FK497401.1 Containing ... 496 138 Seed A3237 454GmaGlobSeed622551 Soybean Seeds 3.00E‐ Asgrow GD879902.1 Containing ... 492 137 Seed A3237 454GmaGlobSeed535234 Soybean Seeds 3.00E‐ Asgrow GD792585.1 Containing ... 492 137 Seed A3237 454GmaGlobSeed626517 Soybean Seeds 1.00E‐ Asgrow GD883868.1 Containing ... 486 135 Seed A3237 454GmaGlobSeed716114 Soybean Seeds 4.00E‐ Asgrow GD973465.1 Containing ... 484 135 Seed A3237 sh32c07.y1 Gm‐c1017 Glycine max cDNA clone 6.00E‐ Vegetable AW394537.1 GEN... 481 134 buds Field‐grown Williams 82 454GmaGlobSeed252798 Soybean Seeds 1.00E‐ Asgrow FK518166.1 Containing ... 457 126 Seed A3237 454GmaGlobSeed574199 Soybean Seeds 4.00E‐ Asgrow GD831550.1 Containing ... 451 125 Seed A3237 454GmaGlobSeed15481 Soybean Seeds Containing 7.00E‐ Asgrow FK280849.1 G... 444 123 Seed A3237 454GmaGlobSeed517635 Soybean Seeds 1.00E‐ Asgrow GD774986.1 Containing ... 436 120 Seed A3237 454GmaGlobSeed382879 Soybean Seeds 1.00E‐ Asgrow FK648247.1 Containing ... 436 120 Seed A3237 Stem, apical meristem, flowers, young pods,

GLLEI67TF JCVI‐SOY1 Glycine max cDNA 5', mRNA 1.00E‐ green 54 54 FG995249.1 ... 436 120 seeds Williams 82

454GmaGlobSeed357768 Soybean Seeds 3.00E‐ Asgrow FK623136.1 Containing ... 425 117 Seed A3237 454GmaGlobSeed114286 Soybean Seeds 8.00E‐ Asgrow FK379654.1 Containing ... 411 113 Seed A3237 454GmaGlobSeed396676 Soybean Seeds 3.00E‐ Asgrow FK662044.1 Containing ... 409 112 Seed A3237 454GmaGlobSeed194254 Soybean Seeds 3.00E‐ Asgrow FK459622.1 Containing ... 409 112 Seed A3237 454GmaGlobSeed492158 Soybean Seeds 6.00E‐ Asgrow GD749509.1 Containing ... 398 109 Seed A3237 454GmaGlobSeed12363 Soybean Seeds Containing 6.00E‐ Asgrow FK277731.1 G... 398 109 Seed A3237 454GmaGlobSeed506175 Soybean Seeds 1.00E‐ Asgrow GD763526.1 Containing ... 370 100 Seed A3237 454GmaGlobSeed371119 Soybean Seeds Asgrow FK636487.1 Containing ... 363 2.00E‐98 Seed A3237 454GmaGlobSeed228894 Soybean Seeds Asgrow FK494262.1 Containing ... 361 8.00E‐98 Seed A3237 454GmaGlobSeed203833 Soybean Seeds Asgrow FK469201.1 Containing ... 313 2.00E‐83 Seed A3237 454GmaGlobSeed792118 Soybean Seeds Asgrow GE049470.1 Containing ... 311 8.00E‐83 Seed A3237 454GmaGlobSeed466635 Soybean Seeds Asgrow GD723986.1 Containing ... 311 8.00E‐83 Seed A3237 454GmaGlobSeed298973 Soybean Seeds Asgrow FK564341.1 Containing ... 305 4.00E‐81 Seed A3237 454GmaGlobSeed689185 Soybean Seeds Asgrow GD946536.1 Containing ... 302 5.00E‐80 Seed A3237 454GmaGlobSeed76446 Soybean Seeds Containing Asgrow

FK341814.1 G... 270 1.00E‐70 Seed A3237 55 55 FK286461.1 454GmaGlobSeed21093 Soybean Seeds Containing 270 1.00E‐70 Seed Asgrow

G... A3237 454GmaGlobSeed798574 Soybean Seeds Asgrow GE055926.1 Containing ... 268 5.00E‐70 Seed A3237 454GmaGlobSeed33223 Soybean Seeds Containing Asgrow FK298591.1 G... 268 5.00E‐70 Seed A3237 454GmaGlobSeed769308 Soybean Seeds Asgrow GE026660.1 Containing ... 243 3.00E‐62 Seed A3237 454GmaGlobSeed221403 Soybean Seeds Asgrow FK486771.1 Containing ... 209 3.00E‐52 Seed A3237 454GmaGlobSeed96976 Soybean Seeds Containing Asgrow FK362344.1 G... 193 3.00E‐47 Seed A3237 Stem, leaves, GLQAV71TF JCVI‐SOY2 Glycine max cDNA 5', mRNA etiolated FK014313.1 ... 185 5.00E‐45 seedlings Pseudomonas Williams 82 454GmaGlobSeed274691 Soybean Seeds Asgrow FK540059.1 Containing ... 130 2.00E‐28 Seed A3237 454GmaGlobSeed405454 Soybean Seeds Asgrow GD662805.1 Containing ... 104 1.00E‐20 Seed A3237 454GmaGlobSeed141937 Soybean Seeds Asgrow FK407305.1 Containing ... 91.6 1.00E‐16 Seed A3237

GmWI‐2: Score E- (Bits) value Tissue Treatment Cultivar BF009080.1 ss73b08.y1 Gm‐c1062 Glycine max cDNA clone GEN… 833 0.0 Stem Raiden Leaf and BM143160.1 saj40d03.y1 Gm‐c1066 Glycine max cDNA clone SO… 771 0.0 shoot tip Salt‐stressed Williams

GD690307.1 454GmaGlobSeed432956 Soybean Seeds Containing … 483 1.E‐134 Seed Asgrow A3237 56 56 GE046898.1 454GmaGlobSeed789546 Soybean Seeds Containing … 392 2.E‐107 Seed Asgrow A3237

GD666161.1 454GmaGlobSeed408810 Soybean Seeds Containing … 198 6.E‐49 Seed Asgrow A3237 FK588181.1 454GmaGlobSeed322813 Soybean Seeds Containing ... 154 1.E‐35 Seed Asgrow A3237 GD784905.1 454GmaGlobSeed527554 Soybean Seeds Containing … 147 2.E‐33 Seed Asgrow A3237 GD766089.1 454GmaGlobSeed508738 Soybean Seeds Containing … 147 2.E‐33 Seed Asgrow A3237

GmWI‐3: Score (Bits) E-value Tissue Treatment Cultivar BW671115.1 BW671115 GMFL01 Glycine max cDNA clone GMFL01… 1208 0 Various N/A N/A BW671116.1 BW671116 GMFL01 Glycine max cDNA clone GMFL01… 900 0 Various N/A N/A

GmWI‐4: Score E- (Bits) value Tissue Treatment Cultivar Nodules, root tip, strip root, whole root, inoculated root, drought stressed root, salt stressed root, waterlogged Williams FK019067.1 GLND636TF JCVI‐SOY3 Glycine max cDNA 5', mRNA … 848 0 root None 82 57 57

Nodules, root tip, strip root, whole root, inoculated root, drought stressed root, salt stressed root, waterlogged Williams FK014415.1 GLNBO65TF JCVI‐SOY3 Glycine max cDNA 5', mRNA … 848 0 root None 82

GmWI‐7: Score (Bits) E-value Tissue Treatment Cultivar Stem, apical meristem, flowers, GLPAG49TR JCVI‐SOY1 Glycine young pods, green seeds FK000414.1 max cDNA 5', mRNA … 1304 0 (Various) N/A Williams 82 Stem, apical meristem, flowers, young pods, green seeds GLPAG49TF JCVI‐SOY1 Glycine (Various) FK000413.1 max cDNA 5', mRNA … 147 2.E‐33 N/A Williams 82

GmWI‐8: 58 58

Score E- (Bits) value Tissue Treatment Cultivar su55e11.y1 Gm‐c1069 Glycine Degenerating cotyledons of 9‐10 day BF425343.1 max cDNA clone GEN… 767 0 old etiolated seedlings (Cotyledons) None Williams

GmWI‐10: Score E- (Bits) value Tissue Treatment Cultivar GM89021A2F07.r1 Gm‐r1089 Glycine max cDNA CO984163.1 clon... 641 0 Various Various Various sar85f07.y1 Gm‐c1074 Glycine max cDNA clone Williams BU762196.1 SO… 588 3.E‐166 Seedlings Pseudomonas syringae 82 59 59

APPENDIX C.

PHYLOGENETIC TREE FILE

61

( ( ( ( ( GmWI-10:0.09693, ( GmWI-8:0.04548, GmWI-7:0.03785) :0.06060) :0.14805, ( ( GmWI-6:0.06771, GmWI-5:0.04885) :0.08333, Medtr3g033850.1:0.12656) :0.14304) :0.01856, ( ( GmWI-2:0.02699, GmWI-1:0.01974) :0.09760, Medtr3g097160.1:0.14095) :0.14590) :0.03499, ( ( GmWI-4:0.06556, GmWI-3:0.07170) :0.11444, ( GmWI-9:0.13872, Medtr5g082480.1:0.15380) :0.05984) :0.08070) :0.02293, Selmo73443:0.31853, Medtr7g077370.1:0.53332);

APPENDIX D.

PGENTHREADER AND PDOMTHREADER PREDICTIONS

63

GmWI-1 pGenTHREADER:

10 20 30 40 50 60 CEEEEEECCCCCCHHHHHHHHHHHHHHHHCCCCCEEEEEEECCEEEEEECCCCCCHHHHH 3oepA0 VRGFYLRFGEGVSEEANRRALALAEALLRAPPPGLLDAVPAYGVLYLEYDPRRLSRGRLL Query ------

70 80 90 100 110 120 HHHHHCCCEEEEEEECCCCCHHHHHHHHCCCHHHHHHHHHCCCEEEEEECCCCCCEEEEC 3oepA0 RLLKGLPRVVEIPVRYDGEDLPEVASRLGLSLEAVKALHQKPLYRVYALGFTPGFPFLAE Query ------

130 140 150 160 170 180 CCCCCCCCCCCCCEEEECCCEEEEECCEEEECCCCEEECCEEEEEECCCCCCCCCCCCCC 3oepA0 VEPALRLPRKPHPRPRVPAHAVAVAGVQTGIYPLPSPGGWNLLGTSLVAVYDPHRETPFL Query ------

190 200 210 220 230 240 CCCCCEEEEEECCCCCCCCCCCCCCCCCCCCCEEEEEEECCCCCEEECCCCCCCCCCCCC 3oepA0 LRPGDRVRFLEAEGPTPPEPRPLELLPEEPRLPALLVEEPGLMDLVVDGGRFLGGHLGLA Query ------

250 260 270 280 290 300 CCCCCCHHHHHHHHHHCCCCCCCCEEEEECCCCEEEECCCEEEEEEECCEEEEECCEEEC 3oepA0 RSGPLDAPSARLANRLVGNGAGAPLLEFAYKGPVLTALRDLVAAFAGYGFVALLEGEEIP Query ------MRMLTGDSAADNSFRFVPQ--SIAAFGSTVIVE-GCD------S ------CCEECCCCCCCCCEEEEEC--EEEEECCEEEEE-ECC------C 10 20 30

310 320 330 340 CC-----EEEEECCCCEEEEEECCCCCEEEEEECCCCCCCCCCCCCCCC------CCCC 3oepA0 PG-----QSFLWPRGKTLRFRPRGPGVRGYLAVAGGLEVRPFLGSASPD------LRGR Query ARNIAWVHAWTVTDGMITQIREYFNTALTVTRIHDSGEIVPARSGAGRLPCVWESSVSGR CCCEEEEEEEEECCCCEEEEEEEEEEEEEEEEECCCCCCCCCCCCCCCCCEEEECCCCCC 40 50 60 70 80 90

360 370 380 390 400 CCCCCCCCCEEECCCCCCCCCCCCCCCCCCCCCEEEEEEECCCCCHHHHHHHCCCCEEEE 3oepA0 IGRPLWAGDVLGLEALRPVRPGRAFPQRPLPEAFRLRLLPGPQFAGEAFRALCSGPFRVA Query VGKSV-PGLVLAI------CCCCC-CCEEEEC------100

420 430 440 450 460 EECCCEEEEECCCCCCCEEEEEECCCCEEEECCCCCCEEECCCCCCEEEEEEEEEECHHH 3oepA0 RADRVGVELLGPEVPGGEGLSEPTPLGGVQVPPSGRPLVLLADKGSLGGYAKPALVDPRD Query ------

64

480 HHHHCCCCCCCEEEEEC 3oepA0 LWLLGQARPGVEIHFTS Query ------

Percentage Identity = 16.8%

GmWI-2 pGenTHREADER:

10 20 30 40 50 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-----CEEEECCCEEEEEECCCCC---EE 1ufgA0 GSSGSSGQSQGGGSVTKKRKLESSESRSSFS-----QHARTSGRVAVEEVDEEGK---FV Query ------MRMLTGDSAADNSFRFLPQSIAAFGSTVIVEGCDTARNIAWV ------CCEECCCCCCCCCEEEEECEEEEECCEEEEEECCCCCCEEEE 10 20 30 40

60 70 80 90 100 EEEECCCCCEECCCCEEEEEEC---CCCCEEEECCCCCEECCCCEEEEEECCCCCCCCCC 1ufgA0 RLRNKSNEDQSMGNWQIRRQNG---DDPLMTYRFPPKFTLKAGQVVTIWASGAGATHSPP Query HACTVTDG------IITQIREYFNTALTVTRIHDS-----GE---IVPASSGAGRLP- EEEEECCC------CEEEEEEEEEEEEEEEEECCC-----CC---CCCCCCCCCCCC- 50 60 70 80

120 130 140 150 CEEEECCCCCCCCC----CEEEEEEECCCCCEEEEEEEECCCCCCC 1ufgA0 TDLVWKAQNTWGCG----SSLRTALINSTGEEVAMRKLVRSGPSSG Query --CVWESSVSGRVGKSVPGLVLAI------EEEECCCCCCCCCCCCCEEEEC------90 100

Percentage Identity = 16.8%

GmWI-3 pGenTHREADER:

10 20 30 40 CCCC------HHHHHHHHHHHHCCCCCCHHHHHHCEEEEEEEEECCCC 3fgyA0 GMST------QENVQIVKDFFAAMGRGDKKGLLAVSAEDIEWIIPGEW Query -MFTCVPELANSQSQECLEEYVERNTRVVTELYKALTSKDPETLHGLVTQDLEWWFHGPP -CCCCHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHCCCHHHHHHHCCCCCEEEEECCC 10 20 30 40 50

50 60 70 80 90 100 CCCEEEEHHHHHHHHHHHHHCCCEEECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEEE 3fgyA0 PLAGTHRGHAALAALLQKASEMVEISYPEPPEFVAQGERVLVVGFATGRVKSTNRTFEDD Query CH------RHHLVPWLTGSSPSSKALV--PQHMVGFGPVVIAEGFDDDHL------VW CC------CHHHHHHHHCCCCCCCEEE--CCEEEEECCEEEEEEECCCEE------EE 70 80 90 100

65

110 120 130 EEEEEEEE-CCEEEEEEEECCHH----HHHHHCCCCCC------3fgyA0 WVFAITVR-KSKVTSIREYIDTL----ALARATNFNAT------Query WVHAWTVTADGLITQVKEYVNTSVTVTLLSQQVLPNASKCQCIWQSRLCDESVPGLILPI EEEEEEEECCCEEEEEEEEECCEEEEEEECCCCCCCCCCCCCEECCCCCCCCCCCEEECC 110 120 130 140 150 160

Percentage Identity = 23.7%

GmWI-4 pGenTHREADER:

10 20 30 40 50 CCCC------HHHHHHHHHHHHCCCCCCHHHHHHCEEEEEEEEECCCCCCCEEEEHH 3fgyA0 GMST------QENVQIVKDFFAAMGRGDKKGLLAVSAEDIEWIIPGEWPLAGTHRGH Query -LANSRECLEEYVEHNTRVVTELYKALTSKDPETLHGIVAQDLEWWFHGPSCH------R -CCCCHHHHHHHHHHHHHHHHHHHHHHHCCCHHHHHHCCCCCCEEEEECCCCH------H 10 20 30 40 50

60 70 80 90 100 110 HHHHHHHHHHHCCCEEECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEEEEEEEEEEE- 3fgyA0 AALAALLQKASEMVEISYPEPPEFVAQGERVLVVGFATGRVKSTNRTFEDDWVFAITVR- Query HHLVPWLTGSSPSSKALV--PQHVVGFGPLVIAE----GFDEAH----LVWWVHAWTIRT HHHHHHHHCCCCCCEEEE--CCEEEEECCEEEEE----EECCCE----EEEEEEEEEEEC 60 70 80 90 100

120 130 -CCEEEEEEEECC-HHHHHHHCCCCCC------3fgyA0 -KSKVTSIREYID-TLALARATNFNAT------Query SDGLITQVKEYVNTSVSVTRLSQMLPNASNCQCIWQSRLCDESVPGLVLAI CCCCEEEEEEEEECHHHHHHHCCCCCCCCCCCEEEECCCCCCCCCCEEEEC 110 120 130 140 150

Percentage Identity = 20.0%

GmWI-5 pGenTHREADER:

10 20 30 40 ------CCC------CHHHHHHHHHHHHCC-CCCCHHHHHHCEEEEEEEEECCC 3fgyA0 ------GMS------TQENVQIVKDFFAAM-GRGDKKGLLAVSAEDIEWIIPGE Query MKIIPGSKAEPEKFSGTVTEHENRNRETVKLVYKALLRDADTEKLAKVVRAELEWWYHG- CCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCHHHHHHHCCCCCEEEEEC- 10 20 30 40 50

50 60 70 80 90 100 CCCCEEEEHHHHHHHHHHHHHCCCEE-ECCCCCEEEEECCEEEEEEEEEEECCCCCCEEE 3fgyA0 WPLAGTHRGHAALAALLQKASEMVEI-SYPEPPEFVAQGERVLVVGFATGRVKSTNRTFE Query -PPH-----CQHMMKVLTGESTTAQKAFKFRPRRIRAVGDRVVVEGW-----EGAGE--- -CCC-----HHHHHHHHCCCCCCCCCEEEEECCEEEEECCEEEEEEE-----CCCCE--- 70 80 90 100

66

110 120 130 EEEEEEEEEECCEEEEEEEECCHHHHHHHCCCCCC------3fgyA0 DDWVFAITVRKSKVTSIREYIDTLALARATNFNAT------Query -YWVHVWRFKHGIITQLREYFN------TLITVVHRVSEDGDEARLWRSTNRVR -EEEEEEEEECCEEEEEEEEEE------EEEEEEECCCCCCCCCEEECCCCCCC 110 120 130 140 150

------3fgyA0 ------Query VRVHGSLPDLVLSI CCCCCCCCCEEEEC 160

Percentage Identity = 20.7%

GmWI-6 pGenTHREADER:

10 20 30 40 ------CCC---CHHHHHHHHHHHHCCCCCCHHHHHHCEEEEEEEEECCCC 3fgyA0 ------GMS---TQENVQIVKDFFAAMGRGDKKGLLAVSAEDIEWIIPGEW Query MKTIPGSKAEPEKLSGTAADHENRNRETVKMVYKALRDGDTEKLAKVVRVELEWWYHGPP CCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCCHHHHHHHCCCCCEEEEECCC 10 20 30 40 50 60

50 60 70 80 90 100 CCCEEEEHHHHHHHHHHHHHCCCEE-ECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEE 3fgyA0 PLAGTHRGHAALAALLQKASEMVEI-SYPEPPEFVAQGERVLVVGFATGRVKSTNRTFED Query HC------QHMKKVLTGESTATQKAFKFRPRRIRAVGDRVMVEGW-----EGAGE---- CC------CHHHHHHCCCCCCCCCEEEEECCEEEEECCEEEEEEE-----ECCCE---- 70 80 90 100

110 120 130 EEEEEEEEECCEEEEEEEECCHHHHHHHCCCCCC------3fgyA0 DWVFAITVRKSKVTSIREYIDTLALARATNFNAT------Query YWVHVWRLKHGIIAQLREYFN------TLITVVLRVLEDGDEARLWRSTDRVRV EEEEEEEEECCCEEEEEEEEE------EEEEEEEEECCCCCCCEEEEECCCCCC 110 120 130 140 150

------3fgyA0 ------Query QGSLPDIVLSI CCCCCCEEEEC 160

Percentage Identity = 20.0%

GmWI-7 pGenTHREADER:

10 20 30 40 50 CCCCHHHHHHHHHHHHCC-CCCCHHHHHHCEEEEEEEEECCCCCCCEEEEHHHHHHHHHH 3fgyA0 GMSTQENVQIVKDFFAAM-GRGDKKGLLAVSAEDIEWIIPGEWPLAGTHRGHAALAALLQ Query ----MQNKATVEMLYMALLGQETMDNVAKLLASDLEWWFHGPPQCH------HMMKVLT ----CCCHHHHHHHHHHHHCCCHHHHHHHHHCCCCCEEEECCCCCH------HHHHHHC 10 20 30 40

67

70 80 90 100 110 HHHCCCEEECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEEEEEEEEEEECCEEEEEEE 3fgyA0 KASEMVEISYPEPPEFVAQGERVLVVGFATGRVKSTNRTFEDDWVFAITVRKSKVTSIRE Query GETDHTKGFRFEPKQVTAIGDCVIAEGW------EGK---AYWVHVWTLKNGLITQFRE CCCCCCCEEEECCCEEEEECCEEEEEEC------CCC---EEEEEEEEECCCEEEEEEE 60 70 80 90

130 ECCHHHHHHHCCCCCC------3fgyA0 YIDTLALARATNFNAT------Query YFNTWLVVRDLRTPRREDSKDSMTLWESQPRDLYHRSLPGLVLTI EEEEEEEEEECCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEC 110 120 130 140

Percentage Identity = 18.5%

GmWI-8 pGenTHREADER:

10 20 30 40 50 CCCC-----HHHHHHHHHHHHCC-CCCCHHHHHHCEEEEEEEEECCCCCCCEEEEHHHHH 3fgyA0 GMST-----QENVQIVKDFFAAM-GRGDKKGLLAVSAEDIEWIIPGEWPLAGTHRGHAAL Query -MEDHQKVEMQNKARVEVLYKALLGQGTMDNVAKLLASDLEWWFHGPPH------CHHM -CCCHHHHHHHHHHHHHHHHHHHHCCCCHHHHHHHCCCCCEEEEECCCC------HHHH 10 20 30 40 50

60 70 80 90 100 110 HHHHHHHHCCCEEECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEEEEEEEEEEECCEE 3fgyA0 AALLQKASEMVEISYPEPPEFVAQGERVLVVGFATGRVKSTNRTFEDDWVFAITVRKSKV Query MKVLTGETDHTKGFRFEPRRVTAVGDCTIAEGWEGKAY------WVHVWTLKNGLI HHHHCCCCCCCCEEEEECCEEEEECCEEEEEEEEEEEE------EEEEEEEECCCE 60 70 80 90 100

120 130 EEEEEECCHHHHHHHCCCCCC------3fgyA0 TSIREYIDTLALARATNFNAT------Query TQFREYFNTWLVVRDLRPPRWEDSKDSMTLWQSQPRDLYHRSLPGLVLTI EEEEEEEEHHEEEHHCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEC 110 120 130 140 150

Percentage Identity = 19.3%

GmWI-9 pGenTHREADER:

10 20 30 40 ------CCCCHHHHHHHHHHHHCCCCCCHHHHHHCEEEEEEEEECCCCCCCEEEE 3fgyA0 ------GMSTQENVQIVKDFFAAMGRGDKKGLLAVSAEDIEWIIPGEWPLAGTHR Query ELANSSQKPWLEDQEDPNKRGVIDLYNAILSNDTDTMQRLLAPDLEWWFHGPPCH----- CCCCCCCCCCCCCCCCCCHHHHHHHHHHHHCCCHHHHHHHCCCCCCEEEECCCCC----- 10 20 30 40 50

60 70 80 90 100 HHHHHHHHHHHHHCCCEEECCCCCEEEEECCEEEEEEEEEEECCCCCCEEEEEEEEEEEE 3fgyA0 GHAALAALLQKASEMVEISYPEPPEFVAQGERVLVVGFATGRVKSTNRTFEDDWVFAITV Query -RHHLVPLLTGSSSKP----LVPDLVVGFGSVTIAEG-----FDESNL---VWWVHAWTT -CCHHHHHHCCCCCCC----EECCEEEEECCEEEEEE-----ECCCCE---EEEEEEEEE 60 70 80 90 100

68

120 130 EC-CEEEEEEEECCHHHHHHHCCCCCC------3fgyA0 RK-SKVTSIREYIDTLALARATNFNAT------Query TDEGIITEVREYVN------TSVTVTKLGLHVNNDDVVSATCQSVWQSKLCDES CCCCEEEEEEEEEE------EEEEEEEECCCCCCCCCCCCCCCCEECCCCCCCC 110 120 130 140 150

------3fgyA0 ------Query VPGLILAI CCCEEEEC

Percentage Identity = 20.7%

GmWI-10 pGenTHREADER:

10 20 30 40 ------CCCC-----HHHHHHHHHHHHCC-CCCCHHHHHHC-EEEEEEEEECC 3fgyA0 ------GMST-----QENVQIVKDFFAAM-GRGDKKGLLAV-SAEDIEWIIPG Query MKHISSSIEISEMGAERHEKEEMHNKATVEALYKALLGQGQMDTVATNMLASDLEWWFHG CCCCCCCHHHHCCCCCCCCHHHCCHHHHHHHHHHHHHCCCCHHHHHHHHCCCCCEEEEEC 10 20 30 40 50 60

50 60 70 80 90 CCCCCEEEEHHHHHHHHHHHHHCCC-EEECCCCCEEEEECCEEEEEEEEEEECCCCCCEE 3fgyA0 EWPLAGTHRGHAALAALLQKASEMV-EISYPEPPEFVAQGERVLVVGFATGRVKSTNRTF Query PPQC------QHMMRVLTGETTLDNNGFRFEPRSVTAIGDCVIAEGWEGKAY------CCCC------HHHHHHHHCCCCCCCCCEEEECCEEEEECCEEEEEEEECCEE------70 80 90 100

110 120 130 EEEEEEEEEEECCEEEEEEEECCHHHHHHHCCCCCC------3fgyA0 EDDWVFAITVRKSKVTSIREYIDTLALARATNFNAT------Query ---WVHVWTVRNGLITQFREYFNTWLVVRDLRSQRWENKLDNMTLWQSQPRDLYRRSLPG ---EEEEEEEECCEEEEEEEEEEEEEEEEECCCCCCCCCCCCCEEEECCCCCCCCCCCCC 110 120 130 140 150 160

----- 3fgyA0 ----- Query LVLAI EEEEC

Percentage Identity = 21.5%

69

GmWI-1 pDomTHREADER:

------1ad2A02 ------Query MRMLTGDSAADNSFRFVPQSIAAFGSTVIVEGCDSARNIAWVHAWTVTDGMITQIREYFN CCEECCCCCCCCCEEEEECEEEEECCEEEEEECCCCCCEEEEEEEEECCCCEEEEEEEEE 10 20 30 40 50 60

10 20 ------CC-CCCCC--EEEECCCHHHHHHHHCC 1ad2A02 ------GL-GKQVR--VLAIAKGEKIKEAEEAG Query TALTVTRIHDSGEIVPARSGAGRLPCVWESSVSGRVGKSVPGLVLAI------EEEEEEEECCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCEEEEC------70 80 90 100

30 40 50 60 70 80 CCEEECHHHHHHHHCCCCCCCEEEECHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCHH 1ad2A02 ADYVGGEEIIQKILDGWMDFDAVVATPDVMGAVGSKLGRILGPRGLLPNPKAGTVGFNIG Query ------

90 HHHHHHHCC 1ad2A02 EIIREIKAG Query ------

Percentage Identity = 8.6%

GmWI-2 pDomTHREADER:

------1ad2A02 ------Query MRMLTGDSAADNSFRFLPQSIAAFGSTVIVEGCDTARNIAWVHACTVTDGIITQIREYFN CCEECCCCCCCCCEEEEECEEEEECCEEEEEECCCCCCEEEEEEEEECCCCEEEEEEEEE 10 20 30 40 50 60

10 20 ------CC-CCCCC--EEEECCCHHHHHHHHCC 1ad2A02 ------GL-GKQVR--VLAIAKGEKIKEAEEAG Query TALTVTRIHDSGEIVPASSGAGRLPCVWESSVSGRVGKSVPGLVLAI------EEEEEEEECCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCEEEEC------70 80 90 100

30 40 50 60 70 80 CCEEECHHHHHHHHCCCCCCCEEEECHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCHH 1ad2A02 ADYVGGEEIIQKILDGWMDFDAVVATPDVMGAVGSKLGRILGPRGLLPNPKAGTVGFNIG Query ------

70

90 HHHHHHHCC 1ad2A02 EIIREIKAG Query ------

Percentage Identity = 8.6%

GmWI-3 pDomTHREADER:

10 20 30 40 ------CHHHHHHHHHHHHHHHHHHCCHHHHHHHEEEEEEEEECCCCC 1tuhA00 ------NEAEQNAETVRRGYAAFNSGDMKTLTELFDENASWHTPGRSR Query MFTCVPELANSQSQECLEEYVERNTRVVTELYKALTSKDPETLHGLVTQDLEWWFHGPPC CCCCHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHCCCHHHHHHHCCCCCEEEEECCCC 10 20 30 40 50 60

50 60 70 80 90 100 CCEEEECHHHHHHHHHHHHHCCCCCCEEEEEEEEECCCCCEEEEEEEEEEECCEEEEEEE 1tuhA00 IAGDHKGREAIFAQFGRYGGETGGTFKAVLLHVLKSDDGRVIGIHRNTAERGGKRLDVGC Query HR------HHLVPWLTGSSPSSKALVPQHMVGFGP------VVIAEGFDDDHL-VWW CC------HHHHHHHHCCCCCCCEEECCEEEEECC------EEEEEEECCCEE-EEE 70 80 90 100

110 120 130 EEEEEEE-CCEEEEEEEEECCHHHHHHHHC------1tuhA00 CIVFEFK-NGRVIDGREHFYDLYAWDEFWR------Query VHAWTVTADGLITQVKEYVNT------SVTVTLLSQQVLPNASKCQCIWQSRLCDESV EEEEEEECCCEEEEEEEEECC------EEEEEEECCCCCCCCCCCCCEECCCCCCCCC 110 120 130 140 150

------1tuhA00 ------Query PGLILPI CCEEECC 160

Percentage Identity = 14.5%

GmWI-4 pDomTHREADER:

10 20 30 40 50 ------CHHHHHHHHHHHHHHHHHHCCHHHHHHHEEEEEEEEECCCCCCCEEEECHH 1tuhA00 ------NEAEQNAETVRRGYAAFNSGDMKTLTELFDENASWHTPGRSRIAGDHKGRE Query LANSRECLEEYVEHNTRVVTELYKALTSKDPETLHGIVAQDLEWWFHGPSCHR------CCCCHHHHHHHHHHHHHHHHHHHHHHHCCCHHHHHHCCCCCCEEEEECCCCHH------10 20 30 40 50

60 70 80 90 100 HHHHHHHHHHHCCCCCCEEEEEEEEECCCCCEEEEEEEEEEECCEEEEEEEEEEEEEE-- 1tuhA00 AIFAQFGRYGGETGGTFKAVLLHVLKSDDGRVIGIHRNTAERGGKRLDVGCCIVFEFK-- Query --HHLVPWLTGSSPSSKALVPQHVVGFGP------LVIAEGFDEAHL-VWWVHAWTIRTS --HHHHHHHHCCCCCCEEEECCEEEEECC------EEEEEEECCCEE-EEEEEEEEEECC 60 70 80 90 100

71

120 130 CCEEEEEEEEECCHHHHHHHHC------1tuhA00 NGRVIDGREHFYDLYAWDEFWR------Query DGLITQVKEYVNTSVSVTRLSQMLPNASNCQCIWQSRLCDESVPGLVLAI CCCEEEEEEEEECHHHHHHHCCCCCCCCCCCEEEECCCCCCCCCCEEEEC 110 120 130 140 150

Percentage Identity = 15.3%

GmWI-5 pDomTHREADER:

10 20 30 40 50 --CCCCCCCCCCCCCCCCCC---HHHHHHHHHHHHHHHC-CHHHHHCCCCCCCEEEECCC 1nwwA00 --IEQPRWASKDSAAGAAST---PDEKIVLEFMDALTSN-DAAKLIEYFAEDTMYQNMPL Query MKIIPGSKAEPEKFSGTVTEHENRNRETVKLVYKALLRDADTEKLAKVVRAELEWWYHGP CCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHCCCCHHHHHHHCCCCCEEEEECC 10 20 30 40 50 60

60 70 80 90 100 CCEECHHHHHHHHHHHCCCCCCC---CEEEE--EEECCCCCCEEEEEECCCCCCCCCCCC 1nwwA00 PPAYGRDAVEQTLAGLFTV-SID---AVETF--HIGSSNG-VYTERVDV-RALPTGKSYN Query PH------CQHMMKVLTGESTTAQKAFKFRPRRIRAVGDRVVVEGWEGAGE------CC------HHHHHHHHCCCCCCCCCEEEEECCEEEEECCEEEEEEECCCCE------70 80 90 100

120 130 140 EEEEEEEEEECCEEEEEEEECC------HHHHHHHHCCCCCC--- 1nwwA00 LSILGVFQLTEGKITGWRDYFD------LREFEEAVDLPLRG--- Query -YWVHVWRFKHGIITQLREYFNTLITVVHRVSEDGDEARLWRSTNRVRVRVHGSLPDLVL -EEEEEEEEECCEEEEEEEEEEEEEEEEECCCCCCCCCEEECCCCCCCCCCCCCCCCEEE 110 120 130 140 150 160

-- 1nwwA00 -- Query SI EC

Percentage Identity = 17.9%

GmWI-6 pDomTHREADER:

10 20 30 40 50 CCCCCCCCCC-----CCCCCCCCHHHHHHHHHHHHHHHCCHHHHHCCCCCCCEEEECCCC 1nwwA00 IEQPRWASKD-----SAAGAASTPDEKIVLEFMDALTSNDAAKLIEYFAEDTMYQNMPLP Query MKTIPGSKAEPEKLSGTAADHENRNRETVKMVYKALRDGDTEKLAKVVRVELEWWYHGPP CCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHCCCHHHHHHHCCCCCEEEEECCC 10 20 30 40 50 60

72

60 70 80 90 100 110 CEECHHHHHHHHHHHCCCCCCCC---EEEE--EEECCCCCCEEEEEECCCCCCCCCCCCE 1nwwA00 PAYGRDAVEQTLAGLFTV-SIDA---VETF--HIGSSNG-VYTERVDV-RALPTGKSYNL Query H------CQHMKKVLTGESTATQKAFKFRPRRIRAVGDRVMVEGWEGAGE------C------CCHHHHHHCCCCCCCCCEEEEECCEEEEECCEEEEEEEECCCE------70 80 90 100

120 130 140 EEEEEEEEECCEEEEEEEECCHHHHHHHHCCCCCC------1nwwA00 SILGVFQLTEGKITGWRDYFDLREFEEAVDLPLRG------Query YWVHVWRLKHGIIAQLREYFN------TLITVVLRVLEDGDEARLWRSTDRVR EEEEEEEEECCCEEEEEEEEE------EEEEEEEEECCCCCCCEEEEECCCCC 110 120 130 140 150

------1nwwA00 ------Query VQGSLPDIVLSI CCCCCCCEEEEC 160

Percentage Identity = 14.5%

GmWI-7 pDomTHREADER:

10 20 30 40 50 CCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHCC-HHHHHCCCCCCCEEEECCCCCEEC 1nwwA00 IEQPRWASKDSAAGAASTPDEKIVLEFMDALTSND-AAKLIEYFAEDTMYQNMPLPPAYG Query ------MQNKATVEMLYMALLGQETMDNVAKLLASDLEWWFHGPPQ------CCCHHHHHHHHHHHHCCCHHHHHHHHHCCCCCEEEECCCC--- 10 20 30 40

70 80 90 100 110 HHHHHHHHHHHCCCCCCCCEEEE---EEECCCCCCEEEEEECCCCCCCCCCCCEEEEEEE 1nwwA00 RDAVEQTLAGLFTV-SIDAVETF---HIGSSNG-VYTERVDV-RALPTGKSYNLSILGVF Query ---CHHMMKVLTGETDHTKGFRFEPKQVTAIGDCVIAEGWEG------KAYWVHVW ---CHHHHHHHCCCCCCCCEEEECCCEEEEECCEEEEEECCC------CEEEEEEE 50 60 70 80

120 130 140 EEECCEEEEEEEECCHHHHHHHHCCCCCC------1nwwA00 QLTEGKITGWRDYFDLREFEEAVDLPLRG------Query TLKNGLITQFREYFNTWLVVRDLRTPRREDSKDSMTLWESQPRDLYHRSLPGLVLTI EECCCEEEEEEEEEEEEEEEEECCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEC 90 100 110 120 130 140

Percentage Identity = 13.9%

GmWI-8 pDomTHREADER:

10 20 30 40 50 CCCCCCCCCCCCCCCCCC------HHHHHHHHHHHHHHHC-CHHHHHCCCCCCCEEEEC 1nwwA00 IEQPRWASKDSAAGAAST------PDEKIVLEFMDALTSN-DAAKLIEYFAEDTMYQNM Query ------MEDHQKVEMQNKARVEVLYKALLGQGTMDNVAKLLASDLEWWFH ------CCCHHHHHHHHHHHHHHHHHHHHCCCCHHHHHHHCCCCCEEEEE 10 20 30 40

73

60 70 80 90 100 110 CCCCEECHHHHHHHHHHHCCCCCCCCEEEEEEECCCCCCEEEEEECCCCCCCCCCCCEEE 1nwwA00 PLPPAYGRDAVEQTLAGLFTV-SIDAVETFHIGSSNG-VYTERVDV-RALPTGKSYNLSI Query GPPH---CHHMMKVLTGETDHTKGFRFEPRRVTAVGDCTIAEGWEG------KAYW CCCC---HHHHHHHHCCCCCCCCEEEEECCEEEEECCEEEEEEEEE------EEEE 50 60 70 80 90

120 130 140 EEEEEEECCEEEEEEEECCHHHHHHHHCCCCCC------1nwwA00 LGVFQLTEGKITGWRDYFDLREFEEAVDLPLRG------Query VHVWTLKNGLITQFREYFNTWLVVRDLRPPRWEDSKDSMTLWQSQPRDLYHRSLPGLVLT EEEEEEECCCEEEEEEEEEHHEEEHHCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEE 100 110 120 130 140 150

- 1nwwA00 - Query I C

Percentage Identity = 13.1%

GmWI-9 pDomTHREADER:

10 20 30 40 50 60 CCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHCCHHHHHCCCCCCCEEEECCCCCEECH 1nwwA00 IEQPRWASKDSAAGAASTPDEKIVLEFMDALTSNDAAKLIEYFAEDTMYQNMPLPPAYGR Query --ELANSSQKPWLEDQEDPNKRGVIDLYNAILSNDTDTMQRLLAPDLEWWFHGPPC------CCCCCCCCCCCCCCCCCCHHHHHHHHHHHHCCCHHHHHHHCCCCCCEEEECCCC---- 10 20 30 40 50

70 80 90 100 110 120 HHHHHHHHHHCCCCCCCCEEEEEEECCCCCCEEEEEECCCCCCCCCCCCEEEEEEEEEEC 1nwwA00 DAVEQTLAGLFTV-SIDAVETFHIGSSNG-VYTERVDV-RALPTGKSYNLSILGVFQLTE Query --HRHHLVPLLTGSSSKPLVPDLVVGFGSVTIAEGFDESNL------VWWVHAWTTTD --CCCHHHHHHCCCCCCCEECCEEEEECCEEEEEEECCCCE------EEEEEEEEECC 60 70 80 90 100

130 140 -CEEEEEEEECCHHHHHHHHCCCCCC------1nwwA00 -GKITGWRDYFDLREFEEAVDLPLRG------Query EGIITEVREYVN------TSVTVTKLGLHVNNDDVVSATCQSVWQSKLCDESV CCEEEEEEEEEE------EEEEEEEECCCCCCCCCCCCCCCCEECCCCCCCCC 110 120 130 140 150

------1nwwA00 ------Query PGLILAI CCEEEEC

Percentage Identity = 15.2%

74

GmWI-10 pDomTHREADER:

10 20 30 40 50 CCCCCCCCCCCCCCCCCC-----HHHHHHHHHHHHHHHC-CHHHHHC-CCCCCCEEEECC 1nwwA00 IEQPRWASKDSAAGAAST-----PDEKIVLEFMDALTSN-DAAKLIE-YFAEDTMYQNMP Query MKHISSSIEISEMGAERHEKEEMHNKATVEALYKALLGQGQMDTVATNMLASDLEWWFHG CCCCCCCHHHHCCCCCCCCHHHCCHHHHHHHHHHHHHCCCCHHHHHHHHCCCCCEEEEEC 10 20 30 40 50 60

60 70 80 90 100 CCCEECHHHHHHHHHHHC---CCCCCCCEEEE--EEECCCCCCEEEEEECCCCCCCCCCC 1nwwA00 LPPAYGRDAVEQTLAGLF---TV-SIDAVETF--HIGSSNG-VYTERVDV-RALPTGKSY Query PPQ------CQHMMRVLTGETTLDNNGFRFEPRSVTAIGDCVIAEGWEG------CCC------CHHHHHHHHCCCCCCCCCEEEECCEEEEECCEEEEEEEEC------70 80 90 100

120 130 140 CEEEEEEEEEECCEEEEEEEECC------HHHHHHHHCCCCCC 1nwwA00 NLSILGVFQLTEGKITGWRDYFD------LREFEEAVDLPLRG Query KAYWVHVWTVRNGLITQFREYFNTWLVVRDLRSQRWENKLDNMTLWQSQPRDLYRRSLPG CEEEEEEEEEECCEEEEEEEEEEEEEEEEECCCCCCCCCCCCCEEEECCCCCCCCCCCCC 110 120 130 140 150 160

----- 1nwwA00 ----- Query LVLAI EEEEC

Percentage Identity = 15.2%

APPENDIX E.

PLACE DATA

76

GmWI‐1: Locatio Stran Factor or Site Name n d Signal Sequence Site # SEF1MOTIF 172 (+) ATATTTAWW S000006 SEF1MOTIF 807 (+) ATATTTAWW S000006 AMYBOX1 670 (+) TAACARA S000020 ASF1MOTIFCAMV 373 (+) TGACG S000024 ASF1MOTIFCAMV 465 (+) TGACG S000024 ASF1MOTIFCAMV 1031 (+) TGACG S000024 ASF1MOTIFCAMV 1404 (+) TGACG S000024 ASF1MOTIFCAMV 1114 (‐) TGACG S000024 ASF1MOTIFCAMV 1437 (‐) TGACG S000024 CAATBOX1 39 (+) CAAT S000028 CAATBOX1 116 (+) CAAT S000028 CAATBOX1 470 (+) CAAT S000028 CAATBOX1 838 (+) CAAT S000028 CAATBOX1 890 (+) CAAT S000028 CAATBOX1 990 (+) CAAT S000028 CAATBOX1 1098 (+) CAAT S000028 CAATBOX1 127 (‐) CAAT S000028 CAATBOX1 244 (‐) CAAT S000028 CAATBOX1 254 (‐) CAAT S000028 CAATBOX1 330 (‐) CAAT S000028 CAATBOX1 371 (‐) CAAT S000028 CAATBOX1 557 (‐) CAAT S000028 CAATBOX1 882 (‐) CAAT S000028 CAATBOX1 1244 (‐) CAAT S000028 CCAATBOX1 1097 (+) CCAAT S000030 CEREGLUBOX2PSLEGA 580 (+) TGAAAACT S000033 GATABOX 138 (+) GATA S000039 GATABOX 333 (+) GATA S000039 GATABOX 524 (+) GATA S000039 GATABOX 754 (+) GATA S000039 GATABOX 819 (+) GATA S000039 GATABOX 842 (+) GATA S000039 GATABOX 101 (‐) GATA S000039 GATABOX 273 (‐) GATA S000039 GATABOX 401 (‐) GATA S000039 GATABOX 731 (‐) GATA S000039 GATABOX 844 (‐) GATA S000039 HEXMOTIFTAH3H4 1113 (+) ACGTCA S000053 HEXMOTIFTAH3H4 373 (‐) ACGTCA S000053

77

HEXMOTIFTAH3H4 1031 (‐) ACGTCA S000053 MARTBOX 74 (+) TTWTWTTWTT S000067 MARTBOX 1122 (‐) TTWTWTTWTT S000067 POLASIG1 336 (+) AATAAA S000080 POLASIG1 991 (+) AATAAA S000080 POLASIG1 1048 (+) AATAAA S000080 POLASIG1 1122 (+) AATAAA S000080 POLASIG1 424 (‐) AATAAA S000080 POLASIG1 796 (‐) AATAAA S000080 POLASIG1 810 (‐) AATAAA S000080 POLASIG2 83 (‐) AATTAAA S000081 POLASIG2 878 (‐) AATTAAA S000081 INTRONLOWER 35 (‐) TGCAGG S000086 INTRONLOWER 112 (‐) TGCAGG S000086 POLASIG3 866 (+) AATAAT S000088 POLASIG3 73 (‐) AATAAT S000088 POLASIG3 262 (‐) AATAAT S000088 POLASIG3 297 (‐) AATAAT S000088 POLASIG3 300 (‐) AATAAT S000088 POLASIG3 445 (‐) AATAAT S000088 ROOTMOTIFTAPOX1 71 (+) ATATT S000098 ROOTMOTIFTAPOX1 172 (+) ATATT S000098 ROOTMOTIFTAPOX1 201 (+) ATATT S000098 ROOTMOTIFTAPOX1 457 (+) ATATT S000098 ROOTMOTIFTAPOX1 480 (+) ATATT S000098 ROOTMOTIFTAPOX1 555 (+) ATATT S000098 ROOTMOTIFTAPOX1 807 (+) ATATT S000098 ROOTMOTIFTAPOX1 200 (‐) ATATT S000098 ROOTMOTIFTAPOX1 399 (‐) ATATT S000098 ROOTMOTIFTAPOX1 479 (‐) ATATT S000098 ROOTMOTIFTAPOX1 527 (‐) ATATT S000098 ROOTMOTIFTAPOX1 788 (‐) ATATT S000098 ROOTMOTIFTAPOX1 806 (‐) ATATT S000098 ROOTMOTIFTAPOX1 869 (‐) ATATT S000098 SEF4MOTIFGM7S 48 (+) RTTTTTR S000103 SEF4MOTIFGM7S 926 (+) RTTTTTR S000103 SEF4MOTIFGM7S 936 (+) RTTTTTR S000103 SEF4MOTIFGM7S 1494 (+) RTTTTTR S000103 SEF4MOTIFGM7S 1232 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 189 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 784 (‐) RTTTTTR S000103

78

SEF4MOTIFGM7S 827 (‐) RTTTTTR S000103 TATABOX2 174 (‐) TATAAAT S000109 TATABOX2 707 (‐) TATAAAT S000109 TATABOX3 812 (+) TATTAAT S000110 TATABOX4 871 (+) TATATAA S000111 TATABOX4 1186 (+) TATATAA S000111 TATABOX4 168 (‐) TATATAA S000111 SEF3MOTIFGM 564 (‐) AACCCA S000115 ‐300ELEMENT 256 (+) TGHAAARK S000122 ‐300ELEMENT 587 (+) TGHAAARK S000122 ‐300ELEMENT 303 (‐) TGHAAARK S000122 SV40COREENHAN 1355 (+) GTGGWWHG S000123 IBOX 729 (‐) GATAAG S000124 GT1CORE 352 (‐) GGTTAA S000125 PALBOXAPC 1146 (+) CCGTCC S000137 ELRECOREPCRP1 1003 (+) TTGACC S000142 ELRECOREPCRP1 835 (‐) TTGACC S000142 EBOXBNNAPA 316 (+) CANNTG S000144 EBOXBNNAPA 1352 (+) CANNTG S000144 EBOXBNNAPA 316 (‐) CANNTG S000144 EBOXBNNAPA 1352 (‐) CANNTG S000144 HEXAMERATH4 1007 (+) CCGTCG S000146 HEXAMERATH4 1433 (+) CCGTCG S000146 CANBNNAPA 1108 (+) CNAACAC S000148 LTRECOREATCOR15 1427 (+) CCGAC S000153 LTRECOREATCOR15 1454 (+) CCGAC S000153 MYBPLANT 638 (+) MACCWAMC S000167 MYBPLANT 1247 (‐) MACCWAMC S000167 MYBPLANT 1252 (‐) MACCWAMC S000167 MYBCORE 661 (+) CNGTTR S000176 MYBST1 101 (‐) GGATA S000180 MYBGAHV 670 (+) TAACAAA S000181 SP8BFIBSP8BIB 1061 (+) TACTATT S000184 SP8BFIBSP8BIB 850 (‐) TACTATT S000184 GT1CONSENSUS 333 (+) GRWAAW S000198 GT1CONSENSUS 524 (+) GRWAAW S000198 GT1CONSENSUS 1102 (+) GRWAAW S000198 GT1CONSENSUS 93 (‐) GRWAAW S000198 GT1CONSENSUS 235 (‐) GRWAAW S000198 GT1CONSENSUS 271 (‐) GRWAAW S000198 GT1CONSENSUS 472 (‐) GRWAAW S000198

79

GT1CONSENSUS 599 (‐) GRWAAW S000198 GT1CONSENSUS 1065 (‐) GRWAAW S000198 GT1CONSENSUS 1270 (‐) GRWAAW S000198 GT1CONSENSUS 99 (‐) GRWAAW S000198 GT1CONSENSUS 291 (‐) GRWAAW S000198 GT1CONSENSUS 486 (‐) GRWAAW S000198 GT1CONSENSUS 569 (‐) GRWAAW S000198 IBOXCORE 333 (+) GATAA S000199 IBOXCORE 524 (+) GATAA S000199 IBOXCORE 100 (‐) GATAA S000199 IBOXCORE 272 (‐) GATAA S000199 IBOXCORE 730 (‐) GATAA S000199 TATABOX5 74 (+) TTATTT S000203 TATABOX5 263 (+) TTATTT S000203 TATABOX5 301 (+) TTATTT S000203 TATABOX5 1047 (‐) TTATTT S000203 TATABOX5 1121 (‐) TTATTT S000203 TATABOX5 1235 (‐) TTATTT S000203 CGACGOSAMY3 1428 (+) CGACG S000205 CGACGOSAMY3 1008 (‐) CGACG S000205 CGACGOSAMY3 1434 (‐) CGACG S000205 CGACGOSAMY3 1446 (‐) CGACG S000205 AUXRETGA1GMGH3 373 (+) TGACGTAA S000234 POLLEN1LELAT52 748 (+) AGAAA S000245 POLLEN1LELAT52 17 (‐) AGAAA S000245 POLLEN1LELAT52 237 (‐) AGAAA S000245 POLLEN1LELAT52 474 (‐) AGAAA S000245 POLLEN1LELAT52 898 (‐) AGAAA S000245 POLLEN1LELAT52 1327 (‐) AGAAA S000245 CIACADIANLELHC 838 (+) CAANNNNATC S000252 QELEMENTZMZM13 834 (+) AGGTCA S000254 PYRIMIDINEBOXOSRAMY 1A 2 (+) CCTTTT S000259 DOFCOREZM 58 (+) AAAG S000265 DOFCOREZM 277 (+) AAAG S000265 DOFCOREZM 550 (+) AAAG S000265 DOFCOREZM 590 (+) AAAG S000265 DOFCOREZM 646 (+) AAAG S000265 DOFCOREZM 674 (+) AAAG S000265 DOFCOREZM 691 (+) AAAG S000265 DOFCOREZM 746 (+) AAAG S000265 DOFCOREZM 751 (+) AAAG S000265

80

DOFCOREZM 802 (+) AAAG S000265 DOFCOREZM 946 (+) AAAG S000265 DOFCOREZM 994 (+) AAAG S000265 DOFCOREZM 3 (‐) AAAG S000265 DOFCOREZM 20 (‐) AAAG S000265 DOFCOREZM 98 (‐) AAAG S000265 DOFCOREZM 161 (‐) AAAG S000265 DOFCOREZM 186 (‐) AAAG S000265 DOFCOREZM 485 (‐) AAAG S000265 DOFCOREZM 723 (‐) AAAG S000265 DOFCOREZM 736 (‐) AAAG S000265 DOFCOREZM 794 (‐) AAAG S000265 DOFCOREZM 1325 (‐) AAAG S000265 NTBBF1ARROLB 185 (+) ACTTTA S000273 NTBBF1ARROLB 993 (‐) ACTTTA S000273 DPBFCOREDCDC3 315 (+) ACACNNG S000292 DPBFCOREDCDC3 1488 (+) ACACNNG S000292 RAV1AAT 759 (+) CAACA S000314 RAV1AAT 1109 (+) CAACA S000314 RAV1AAT 1229 (+) CAACA S000314 RAV1AAT 1377 (+) CAACA S000314 TATAPVTRNALEU 167 (+) TTTATATA S000340 TATAPVTRNALEU 1186 (‐) TTTATATA S000340 REALPHALGLHCB21 354 (+) AACCAA S000362 REALPHALGLHCB21 679 (+) AACCAA S000362 REALPHALGLHCB21 1106 (+) AACCAA S000362 REALPHALGLHCB21 1249 (‐) AACCAA S000362 REALPHALGLHCB21 1254 (‐) AACCAA S000362 TGACGTVMAMY 373 (+) TGACGT S000377 TGACGTVMAMY 1031 (+) TGACGT S000377 TGACGTVMAMY 1113 (‐) TGACGT S000377 GTGANTG10 418 (+) GTGA S000378 GTGANTG10 579 (+) GTGA S000378 GTGANTG10 860 (+) GTGA S000378 GTGANTG10 943 (+) GTGA S000378 GTGANTG10 183 (‐) GTGA S000378 GTGANTG10 726 (‐) GTGA S000378 GTGANTG10 733 (‐) GTGA S000378 GTGANTG10 954 (‐) GTGA S000378 GTGANTG10 963 (‐) GTGA S000378 GTGANTG10 1160 (‐) GTGA S000378

81

GTGANTG10 1486 (‐) GTGA S000378 TBOXATGAPB 276 (‐) ACTTTG S000383 TBOXATGAPB 673 (‐) ACTTTG S000383 TAAAGSTKST1 801 (+) TAAAG S000387 TAAAGSTKST1 993 (+) TAAAG S000387 TAAAGSTKST1 98 (‐) TAAAG S000387 TAAAGSTKST1 186 (‐) TAAAG S000387 TAAAGSTKST1 485 (‐) TAAAG S000387 WBOXATNPR1 372 (+) TTGAC S000390 WBOXATNPR1 464 (+) TTGAC S000390 WBOXATNPR1 1003 (+) TTGAC S000390 WBOXATNPR1 836 (‐) TTGAC S000390 WBOXATNPR1 988 (‐) TTGAC S000390 WBOXATNPR1 1438 (‐) TTGAC S000390 ‐10PEHVPSBD 481 (+) TATTCT S000392 ‐10PEHVPSBD 804 (‐) TATTCT S000392 INRNTPSADB 389 (+) YTCANTYY S000395 INRNTPSADB 1014 (+) YTCANTYY S000395 INRNTPSADB 583 (‐) YTCANTYY S000395 TATABOXOSPAL 1045 (‐) TATTTAA S000400 CARGATCONSENSUS 356 (+) CCWWWWWWGG S000404 CARGATCONSENSUS 356 (‐) CCWWWWWWGG S000404 MYCCONSENSUSAT 316 (+) CANNTG S000407 MYCCONSENSUSAT 1352 (+) CANNTG S000407 MYCCONSENSUSAT 316 (‐) CANNTG S000407 MYCCONSENSUSAT 1352 (‐) CANNTG S000407 MYB1AT 1105 (+) WAACCA S000408 MYB1AT 353 (+) WAACCA S000408 MYB1AT 678 (+) WAACCA S000408 MYB1AT 1250 (‐) WAACCA S000408 MYB1AT 1255 (‐) WAACCA S000408 MYB2CONSENSUSAT 661 (‐) YAACKG S000409 ABRELATERD1 1112 (‐) ACGTG S000414 ACGTATERD1 375 (+) ACGT S000415 ACGTATERD1 615 (+) ACGT S000415 ACGTATERD1 667 (+) ACGT S000415 ACGTATERD1 1033 (+) ACGT S000415 ACGTATERD1 1113 (+) ACGT S000415 ACGTATERD1 375 (‐) ACGT S000415 ACGTATERD1 615 (‐) ACGT S000415 ACGTATERD1 667 (‐) ACGT S000415

82

ACGTATERD1 1033 (‐) ACGT S000415 ACGTATERD1 1113 (‐) ACGT S000415 DRECRTCOREAT 1453 (+) RCCGAC S000418 GARE2OSREP1 613 (+) TAACGTA S000420 GARE2OSREP1 665 (+) TAACGTA S000420 CAREOSREP1 1220 (+) CAACTC S000421 GCCCORE 1450 (+) GCCGCC S000430 CARGCW8GAT 211 (+) CWWWWWWWWG S000431 CARGCW8GAT 515 (+) CWWWWWWWWG S000431 CARGCW8GAT 826 (+) CWWWWWWWWG S000431 CARGCW8GAT 1232 (+) CWWWWWWWWG S000431 CARGCW8GAT 211 (‐) CWWWWWWWWG S000431 CARGCW8GAT 515 (‐) CWWWWWWWWG S000431 CARGCW8GAT 826 (‐) CWWWWWWWWG S000431 CARGCW8GAT 1232 (‐) CWWWWWWWWG S000431 GAREAT 670 (+) TAACAAR S000439 WBOXHVISO1 987 (‐) TGACT S000442 WRKY71OS 373 (+) TGAC S000447 WRKY71OS 465 (+) TGAC S000447 WRKY71OS 1004 (+) TGAC S000447 WRKY71OS 1031 (+) TGAC S000447 WRKY71OS 1404 (+) TGAC S000447 WRKY71OS 182 (‐) TGAC S000447 WRKY71OS 836 (‐) TGAC S000447 WRKY71OS 988 (‐) TGAC S000447 WRKY71OS 1115 (‐) TGAC S000447 WRKY71OS 1438 (‐) TGAC S000447 CACTFTPPCA1 184 (+) YACT S000449 CACTFTPPCA1 316 (+) YACT S000449 CACTFTPPCA1 513 (+) YACT S000449 CACTFTPPCA1 727 (+) YACT S000449 CACTFTPPCA1 734 (+) YACT S000449 CACTFTPPCA1 964 (+) YACT S000449 CACTFTPPCA1 1080 (+) YACT S000449 CACTFTPPCA1 1153 (+) YACT S000449 CACTFTPPCA1 1397 (+) YACT S000449 CACTFTPPCA1 209 (+) YACT S000449 CACTFTPPCA1 502 (+) YACT S000449 CACTFTPPCA1 622 (+) YACT S000449 CACTFTPPCA1 792 (+) YACT S000449 CACTFTPPCA1 1061 (+) YACT S000449

83

CACTFTPPCA1 1259 (+) YACT S000449 CACTFTPPCA1 279 (‐) YACT S000449 CACTFTPPCA1 310 (‐) YACT S000449 CACTFTPPCA1 417 (‐) YACT S000449 CACTFTPPCA1 552 (‐) YACT S000449 CACTFTPPCA1 578 (‐) YACT S000449 CACTFTPPCA1 611 (‐) YACT S000449 CACTFTPPCA1 676 (‐) YACT S000449 CACTFTPPCA1 853 (‐) YACT S000449 CACTFTPPCA1 859 (‐) YACT S000449 CACTFTPPCA1 996 (‐) YACT S000449 CACTFTPPCA1 1142 (‐) YACT S000449 CACTFTPPCA1 1354 (‐) YACT S000449 CACTFTPPCA1 1463 (‐) YACT S000449 PREATPRODH 388 (+) ACTCAT S000450 GT1GMSCAM4 1102 (+) GAAAAA S000453 GT1GMSCAM4 291 (‐) GAAAAA S000453 GT1GMSCAM4 569 (‐) GAAAAA S000453 ARR1AT 42 (+) NGATT S000454 ARR1AT 1297 (+) NGATT S000454 ARR1AT 1368 (+) NGATT S000454 ARR1AT 431 (‐) NGATT S000454 ARR1AT 439 (‐) NGATT S000454 ARR1AT 887 (‐) NGATT S000454 ARR1AT 1076 (‐) NGATT S000454 ARR1AT 1279 (‐) NGATT S000454 ARR1AT 1320 (‐) NGATT S000454 WBOXNTERF3 1004 (+) TGACY S000457 WBOXNTERF3 987 (‐) TGACY S000457 WBOXNTERF3 835 (‐) TGACY S000457 NODCON1GM 751 (+) AAAGAT S000461 NODCON2GM 1197 (+) CTCTT S000462 NODCON2GM 1323 (+) CTCTT S000462 OSE1ROOTNODULE 751 (+) AAAGAT S000467 OSE2ROOTNODULE 1197 (+) CTCTT S000468 OSE2ROOTNODULE 1323 (+) CTCTT S000468 SREATMSD 100 (+) TTATCC S000470 ANAERO1CONSENSUS 642 (+) AAACAAA S000477 ANAERO1CONSENSUS 687 (+) AAACAAA S000477 ANAERO1CONSENSUS 45 (‐) AAACAAA S000477 ANAERO3CONSENSUS 860 (‐) TCATCAC S000479

84

CPBCSPOR 202 (+) TATTAG S000491 CPBCSPOR 458 (+) TATTAG S000491 CPBCSPOR 477 (‐) TATTAG S000491 CURECORECR 132 (+) GTAC S000493 CURECORECR 617 (+) GTAC S000493 CURECORECR 1060 (+) GTAC S000493 CURECORECR 1143 (+) GTAC S000493 CURECORECR 132 (‐) GTAC S000493 CURECORECR 617 (‐) GTAC S000493 CURECORECR 1060 (‐) GTAC S000493 CURECORECR 1143 (‐) GTAC S000493 EECCRCAH1 1179 (+) GANTTNC S000494 EECCRCAH1 1268 (+) GANTTNC S000494 CBFHV 1453 (+) RYCGAC S000497 BIHD1OS 181 (+) TGTCA S000498 CGCGBOXAT 1314 (+) VCGCGB S000501 CGCGBOXAT 1056 (+) VCGCGB S000501 CGCGBOXAT 1314 (‐) VCGCGB S000501 CGCGBOXAT 1056 (‐) VCGCGB S000501 MYBCOREATCYCB1 221 (+) AACGG S000502 MYBCOREATCYCB1 661 (‐) AACGG S000502 SCGAYNRNNNNNNNNNN PRECONSCRHSP70A 990 (‐) NNNNNHD S000506 SCGAYNRNNNNNNNNNN PRECONSCRHSP70A 1428 (‐) NNNNNHD S000506 ABRERATCAL 1313 (+) MACGYGB S000507

GmWI‐2: Factor or Site Name Location Strand Signal Sequence Site # SEF1MOTIF 427 (+) ATATTTAWW S000006 SEF1MOTIF 827 (+) ATATTTAWW S000006 SEF1MOTIF 715 (‐) ATATTTAWW S000006 AMYBOX1 511 (+) TAACARA S000020 ASF1MOTIFCAMV 1047 (+) TGACG S000024 ASF1MOTIFCAMV 1404 (+) TGACG S000024 ASF1MOTIFCAMV 1126 (‐) TGACG S000024 ASF1MOTIFCAMV 1437 (‐) TGACG S000024 CAATBOX1 65 (+) CAAT S000028 CAATBOX1 458 (+) CAAT S000028 CAATBOX1 552 (+) CAAT S000028 CAATBOX1 746 (+) CAAT S000028

85

CAATBOX1 873 (+) CAAT S000028 CAATBOX1 1032 (+) CAAT S000028 CAATBOX1 1110 (+) CAAT S000028 CAATBOX1 19 (‐) CAAT S000028 CAATBOX1 116 (‐) CAAT S000028 CAATBOX1 149 (‐) CAAT S000028 CAATBOX1 200 (‐) CAAT S000028 CAATBOX1 419 (‐) CAAT S000028 CAATBOX1 468 (‐) CAAT S000028 CAATBOX1 489 (‐) CAAT S000028 CAATBOX1 517 (‐) CAAT S000028 CAATBOX1 674 (‐) CAAT S000028 CAATBOX1 785 (‐) CAAT S000028 CAATBOX1 1247 (‐) CAAT S000028 CCAATBOX1 1109 (+) CCAAT S000030 CCAATBOX1 19 (‐) CCAAT S000030 CEREGLUBOX2PSLEGA 130 (+) TGAAAACT S000033 ERELEE4 834 (‐) AWTTCAAA S000037 GATABOX 323 (+) GATA S000039 GATABOX 439 (+) GATA S000039 GATABOX 507 (+) GATA S000039 GATABOX 567 (+) GATA S000039 GATABOX 654 (+) GATA S000039 GATABOX 688 (+) GATA S000039 GATABOX 800 (+) GATA S000039 GATABOX 35 (‐) GATA S000039 GATABOX 294 (‐) GATA S000039 GATABOX 521 (‐) GATA S000039 GATABOX 549 (‐) GATA S000039 GATABOX 613 (‐) GATA S000039 GATABOX 935 (‐) GATA S000039 GATABOX 1026 (‐) GATA S000039 HEXMOTIFTAH3H4 1125 (+) ACGTCA S000053 HEXMOTIFTAH3H4 1047 (‐) ACGTCA S000053 MARABOX1 370 (‐) AATAAAYAAA S000063 MARTBOX 40 (‐) TTWTWTTWTT S000067 MARTBOX 100 (‐) TTWTWTTWTT S000067 MARTBOX 101 (‐) TTWTWTTWTT S000067 MARTBOX 102 (‐) TTWTWTTWTT S000067 MARTBOX 103 (‐) TTWTWTTWTT S000067 MARTBOX 104 (‐) TTWTWTTWTT S000067

86

MARTBOX 105 (‐) TTWTWTTWTT S000067 POLASIG1 402 (+) AATAAA S000080 POLASIG1 560 (+) AATAAA S000080 POLASIG1 776 (+) AATAAA S000080 POLASIG1 9 (‐) AATAAA S000080 POLASIG1 370 (‐) AATAAA S000080 POLASIG1 374 (‐) AATAAA S000080 POLASIG1 416 (‐) AATAAA S000080 POLASIG1 430 (‐) AATAAA S000080 POLASIG1 830 (‐) AATAAA S000080 POLASIG1 850 (‐) AATAAA S000080 POLASIG2 791 (+) AATTAAA S000081 POLASIG2 485 (‐) AATTAAA S000081 POLASIG3 477 (+) AATAAT S000088 POLASIG3 480 (+) AATAAT S000088 POLASIG3 760 (+) AATAAT S000088 POLASIG3 5 (‐) AATAAT S000088 POLASIG3 377 (‐) AATAAT S000088 POLASIG3 691 (‐) AATAAT S000088 POLASIG3 734 (‐) AATAAT S000088 POLASIG3 853 (‐) AATAAT S000088 POLASIG3 856 (‐) AATAAT S000088 ROOTMOTIFTAPOX1 17 (+) ATATT S000098 ROOTMOTIFTAPOX1 67 (+) ATATT S000098 ROOTMOTIFTAPOX1 427 (+) ATATT S000098 ROOTMOTIFTAPOX1 593 (+) ATATT S000098 ROOTMOTIFTAPOX1 655 (+) ATATT S000098 ROOTMOTIFTAPOX1 689 (+) ATATT S000098 ROOTMOTIFTAPOX1 713 (+) ATATT S000098 ROOTMOTIFTAPOX1 724 (+) ATATT S000098 ROOTMOTIFTAPOX1 748 (+) ATATT S000098 ROOTMOTIFTAPOX1 805 (+) ATATT S000098 ROOTMOTIFTAPOX1 810 (+) ATATT S000098 ROOTMOTIFTAPOX1 827 (+) ATATT S000098 ROOTMOTIFTAPOX1 916 (+) ATATT S000098 ROOTMOTIFTAPOX1 947 (+) ATATT S000098 ROOTMOTIFTAPOX1 1143 (+) ATATT S000098 ROOTMOTIFTAPOX1 66 (‐) ATATT S000098 ROOTMOTIFTAPOX1 361 (‐) ATATT S000098 ROOTMOTIFTAPOX1 391 (‐) ATATT S000098 ROOTMOTIFTAPOX1 408 (‐) ATATT S000098

87

ROOTMOTIFTAPOX1 426 (‐) ATATT S000098 ROOTMOTIFTAPOX1 451 (‐) ATATT S000098 ROOTMOTIFTAPOX1 712 (‐) ATATT S000098 ROOTMOTIFTAPOX1 719 (‐) ATATT S000098 ROOTMOTIFTAPOX1 747 (‐) ATATT S000098 ROOTMOTIFTAPOX1 763 (‐) ATATT S000098 ROOTMOTIFTAPOX1 804 (‐) ATATT S000098 ROOTMOTIFTAPOX1 826 (‐) ATATT S000098 ROOTMOTIFTAPOX1 915 (‐) ATATT S000098 ROOTMOTIFTAPOX1 933 (‐) ATATT S000098 ROOTMOTIFTAPOX1 1142 (‐) ATATT S000098 SEF4MOTIFGM7S 726 (+) RTTTTTR S000103 SEF4MOTIFGM7S 949 (+) RTTTTTR S000103 SEF4MOTIFGM7S 119 (+) RTTTTTR S000103 SEF4MOTIFGM7S 1494 (+) RTTTTTR S000103 SEF4MOTIFGM7S 680 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 404 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 447 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 538 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 814 (‐) RTTTTTR S000103 SEF4MOTIFGM7S 987 (‐) RTTTTTR S000103 TATABOX3 432 (+) TATTAAT S000110 TATABOX3 557 (‐) TATTAAT S000110 TATABOX4 1198 (+) TATATAA S000111 SEF3MOTIFGM 1117 (+) AACCCA S000115 ‐300ELEMENT 137 (+) TGHAAARK S000122 ‐300ELEMENT 59 (‐) TGHAAARK S000122 ‐300ELEMENT 694 (‐) TGHAAARK S000122 SV40COREENHAN 196 (+) GTGGWWHG S000123 SV40COREENHAN 939 (‐) GTGGWWHG S000123 IBOX 292 (‐) GATAAG S000124 ACGTABOX 53 (+) TACGTA S000130 ACGTABOX 53 (‐) TACGTA S000130 ELRECOREPCRP1 955 (‐) TTGACC S000142 EBOXBNNAPA 270 (+) CANNTG S000144 EBOXBNNAPA 342 (+) CANNTG S000144 EBOXBNNAPA 1352 (+) CANNTG S000144 EBOXBNNAPA 270 (‐) CANNTG S000144 EBOXBNNAPA 342 (‐) CANNTG S000144 EBOXBNNAPA 1352 (‐) CANNTG S000144 HEXAMERATH4 1433 (+) CCGTCG S000146

88

CANBNNAPA 1120 (+) CNAACAC S000148 MYBPLANT 1250 (‐) MACCWAMC S000167 MYBPLANT 1255 (‐) MACCWAMC S000167 MYBPLANT 1260 (‐) MACCWAMC S000167 MYBCORE 218 (+) CNGTTR S000176 MYBST1 521 (‐) GGATA S000180 MYBST1 935 (‐) GGATA S000180 MYBGAHV 511 (+) TAACAAA S000181 SP8BFIBSP8BIB 624 (+) TACTATT S000184 SP8BFIBSP8BIB 1073 (+) TACTATT S000184 GT1CONSENSUS 39 (+) GRWAAW S000198 GT1CONSENSUS 399 (+) GRWAAW S000198 GT1CONSENSUS 567 (+) GRWAAW S000198 GT1CONSENSUS 667 (+) GRWAAW S000198 GT1CONSENSUS 757 (+) GRWAAW S000198 GT1CONSENSUS 800 (+) GRWAAW S000198 GT1CONSENSUS 1272 (+) GRWAAW S000198 GT1CONSENSUS 603 (‐) GRWAAW S000198 GT1CONSENSUS 628 (‐) GRWAAW S000198 GT1CONSENSUS 1077 (‐) GRWAAW S000198 GT1CONSENSUS 73 (‐) GRWAAW S000198 GT1CONSENSUS 382 (‐) GRWAAW S000198 GT1CONSENSUS 1303 (‐) GRWAAW S000198 IBOXCORE 567 (+) GATAA S000199 IBOXCORE 800 (+) GATAA S000199 IBOXCORE 34 (‐) GATAA S000199 IBOXCORE 293 (‐) GATAA S000199 TATABOX5 6 (+) TTATTT S000203 TATABOX5 371 (+) TTATTT S000203 TATABOX5 378 (+) TTATTT S000203 TATABOX5 692 (+) TTATTT S000203 TATABOX5 831 (+) TTATTT S000203 TATABOX5 897 (+) TTATTT S000203 TATABOX5 401 (‐) TTATTT S000203 TATABOX5 759 (‐) TTATTT S000203 CGACGOSAMY3 1434 (‐) CGACG S000205 CGACGOSAMY3 1446 (‐) CGACG S000205 POLLEN1LELAT52 317 (+) AGAAA S000245 POLLEN1LELAT52 532 (+) AGAAA S000245 POLLEN1LELAT52 666 (+) AGAAA S000245 POLLEN1LELAT52 770 (+) AGAAA S000245

89

POLLEN1LELAT52 900 (‐) AGAAA S000245 POLLEN1LELAT52 1327 (‐) AGAAA S000245 PYRIMIDINEBOXOSRA MY1A 46 (‐) CCTTTT S000259 DOFCOREZM 47 (+) AAAG S000265 DOFCOREZM 140 (+) AAAG S000265 DOFCOREZM 223 (+) AAAG S000265 DOFCOREZM 255 (+) AAAG S000265 DOFCOREZM 315 (+) AAAG S000265 DOFCOREZM 320 (+) AAAG S000265 DOFCOREZM 530 (+) AAAG S000265 DOFCOREZM 664 (+) AAAG S000265 DOFCOREZM 772 (+) AAAG S000265 DOFCOREZM 795 (+) AAAG S000265 DOFCOREZM 921 (+) AAAG S000265 DOFCOREZM 1009 (+) AAAG S000265 DOFCOREZM 72 (‐) AAAG S000265 DOFCOREZM 286 (‐) AAAG S000265 DOFCOREZM 299 (‐) AAAG S000265 DOFCOREZM 414 (‐) AAAG S000265 DOFCOREZM 869 (‐) AAAG S000265 DOFCOREZM 878 (‐) AAAG S000265 DOFCOREZM 1326 (‐) AAAG S000265 DPBFCOREDCDC3 1379 (+) ACACNNG S000292 BOXIINTPATPB 768 (+) ATAGAA S000296 BOXIINTPATPB 1095 (‐) ATAGAA S000296 WBBOXPCWRKY1 1002 (‐) TTTGACY S000310 WBBOXPCWRKY1 955 (‐) TTTGACY S000310 RAV1AAT 328 (+) CAACA S000314 RAV1AAT 589 (+) CAACA S000314 RAV1AAT 1055 (+) CAACA S000314 RAV1AAT 1121 (+) CAACA S000314 RAV1AAT 1238 (+) CAACA S000314 RAV1AAT 1377 (+) CAACA S000314 TATAPVTRNALEU 1198 (‐) TTTATATA S000340 REALPHALGLHCB21 244 (+) AACCAA S000362 REALPHALGLHCB21 20 (‐) AACCAA S000362 REALPHALGLHCB21 1252 (‐) AACCAA S000362 REALPHALGLHCB21 1257 (‐) AACCAA S000362 REALPHALGLHCB21 1262 (‐) AACCAA S000362 REBETALGLHCB21 935 (‐) CGGATA S000363 CATATGGMSAUR 342 (+) CATATG S000370

90

CATATGGMSAUR 342 (‐) CATATG S000370 GMHDLGMVSPB 81 (+) CATTAATTAG S000372 TGACGTVMAMY 1047 (+) TGACGT S000377 TGACGTVMAMY 1125 (‐) TGACGT S000377 GTGANTG10 29 (+) GTGA S000378 GTGANTG10 129 (+) GTGA S000378 GTGANTG10 755 (+) GTGA S000378 GTGANTG10 925 (+) GTGA S000378 GTGANTG10 977 (+) GTGA S000378 GTGANTG10 289 (‐) GTGA S000378 GTGANTG10 296 (‐) GTGA S000378 GTGANTG10 1488 (‐) GTGA S000378 TAAAGSTKST1 222 (+) TAAAG S000387 TAAAGSTKST1 794 (+) TAAAG S000387 TAAAGSTKST1 920 (+) TAAAG S000387 TAAAGSTKST1 878 (‐) TAAAG S000387 WBOXATNPR1 490 (+) TTGAC S000390 WBOXATNPR1 1018 (+) TTGAC S000390 WBOXATNPR1 456 (‐) TTGAC S000390 WBOXATNPR1 956 (‐) TTGAC S000390 WBOXATNPR1 1003 (‐) TTGAC S000390 WBOXATNPR1 1438 (‐) TTGAC S000390 ‐10PEHVPSBD 68 (+) TATTCT S000392 ‐10PEHVPSBD 736 (+) TATTCT S000392 ‐10PEHVPSBD 424 (‐) TATTCT S000392 ‐10PEHVPSBD 774 (‐) TATTCT S000392 ‐10PEHVPSBD 913 (‐) TATTCT S000392 ‐10PEHVPSBD 931 (‐) TATTCT S000392 INRNTPSADB 871 (+) YTCANTYY S000395 INRNTPSADB 133 (‐) YTCANTYY S000395 INRNTPSADB 126 (‐) YTCANTYY S000395 TATABOXOSPAL 388 (‐) TATTTAA S000400 TATABOXOSPAL 716 (‐) TATTTAA S000400 TATCCAOSAMY 521 (+) TATCCA S000403 MYCCONSENSUSAT 270 (+) CANNTG S000407 MYCCONSENSUSAT 342 (+) CANNTG S000407 MYCCONSENSUSAT 1352 (+) CANNTG S000407 MYCCONSENSUSAT 270 (‐) CANNTG S000407 MYCCONSENSUSAT 342 (‐) CANNTG S000407 MYCCONSENSUSAT 1352 (‐) CANNTG S000407 MYB1AT 243 (+) WAACCA S000408

91

MYB1AT 21 (‐) WAACCA S000408 MYB1AT 1253 (‐) WAACCA S000408 MYB1AT 1258 (‐) WAACCA S000408 MYB1AT 1263 (‐) WAACCA S000408 MYB2CONSENSUSAT 218 (‐) YAACKG S000409 ABRELATERD1 27 (+) ACGTG S000414 ABRELATERD1 883 (+) ACGTG S000414 ABRELATERD1 1124 (‐) ACGTG S000414 ACGTATERD1 27 (+) ACGT S000415 ACGTATERD1 54 (+) ACGT S000415 ACGTATERD1 883 (+) ACGT S000415 ACGTATERD1 1049 (+) ACGT S000415 ACGTATERD1 1125 (+) ACGT S000415 ACGTATERD1 27 (‐) ACGT S000415 ACGTATERD1 54 (‐) ACGT S000415 ACGTATERD1 883 (‐) ACGT S000415 ACGTATERD1 1049 (‐) ACGT S000415 ACGTATERD1 1125 (‐) ACGT S000415 CAREOSREP1 1232 (+) CAACTC S000421 GCCCORE 1450 (+) GCCGCC S000430 CARGCW8GAT 81 (+) CWWWWWWWWG S000431 CARGCW8GAT 387 (+) CWWWWWWWWG S000431 CARGCW8GAT 592 (+) CWWWWWWWWG S000431 CARGCW8GAT 946 (+) CWWWWWWWWG S000431 CARGCW8GAT 1287 (+) CWWWWWWWWG S000431 CARGCW8GAT 81 (‐) CWWWWWWWWG S000431 CARGCW8GAT 387 (‐) CWWWWWWWWG S000431 CARGCW8GAT 592 (‐) CWWWWWWWWG S000431 CARGCW8GAT 946 (‐) CWWWWWWWWG S000431 CARGCW8GAT 1287 (‐) CWWWWWWWWG S000431 GAREAT 511 (+) TAACAAR S000439 WBOXHVISO1 491 (+) TGACT S000442 WBOXHVISO1 1019 (+) TGACT S000442 WBOXHVISO1 1002 (‐) TGACT S000442 WRKY71OS 491 (+) TGAC S000447 WRKY71OS 1019 (+) TGAC S000447 WRKY71OS 1047 (+) TGAC S000447 WRKY71OS 1404 (+) TGAC S000447 WRKY71OS 456 (‐) TGAC S000447 WRKY71OS 956 (‐) TGAC S000447 WRKY71OS 1003 (‐) TGAC S000447

92

WRKY71OS 1127 (‐) TGAC S000447 WRKY71OS 1438 (‐) TGAC S000447 CACTFTPPCA1 91 (+) YACT S000449 CACTFTPPCA1 290 (+) YACT S000449 CACTFTPPCA1 297 (+) YACT S000449 CACTFTPPCA1 1041 (+) YACT S000449 CACTFTPPCA1 1092 (+) YACT S000449 CACTFTPPCA1 1397 (+) YACT S000449 CACTFTPPCA1 170 (+) YACT S000449 CACTFTPPCA1 412 (+) YACT S000449 CACTFTPPCA1 624 (+) YACT S000449 CACTFTPPCA1 821 (+) YACT S000449 CACTFTPPCA1 894 (+) YACT S000449 CACTFTPPCA1 1023 (+) YACT S000449 CACTFTPPCA1 1073 (+) YACT S000449 CACTFTPPCA1 1087 (+) YACT S000449 CACTFTPPCA1 1267 (+) YACT S000449 CACTFTPPCA1 128 (‐) YACT S000449 CACTFTPPCA1 241 (‐) YACT S000449 CACTFTPPCA1 465 (‐) YACT S000449 CACTFTPPCA1 526 (‐) YACT S000449 CACTFTPPCA1 547 (‐) YACT S000449 CACTFTPPCA1 754 (‐) YACT S000449 CACTFTPPCA1 863 (‐) YACT S000449 CACTFTPPCA1 976 (‐) YACT S000449 CACTFTPPCA1 1011 (‐) YACT S000449 CACTFTPPCA1 1154 (‐) YACT S000449 CACTFTPPCA1 1354 (‐) YACT S000449 CACTFTPPCA1 1463 (‐) YACT S000449 PREATPRODH 1042 (+) ACTCAT S000450 GT1GMSCAM4 39 (+) GAAAAA S000453 GT1GMSCAM4 667 (+) GAAAAA S000453 ARR1AT 198 (+) NGATT S000454 ARR1AT 1295 (+) NGATT S000454 ARR1AT 1368 (+) NGATT S000454 ARR1AT 459 (‐) NGATT S000454 ARR1AT 553 (‐) NGATT S000454 ARR1AT 572 (‐) NGATT S000454 ARR1AT 743 (‐) NGATT S000454 ARR1AT 843 (‐) NGATT S000454 ARR1AT 1111 (‐) NGATT S000454

93

ARR1AT 1134 (‐) NGATT S000454 ARR1AT 1275 (‐) NGATT S000454 ARR1AT 1282 (‐) NGATT S000454 WBOXNTERF3 491 (+) TGACY S000457 WBOXNTERF3 1019 (+) TGACY S000457 WBOXNTERF3 1002 (‐) TGACY S000457 WBOXNTERF3 955 (‐) TGACY S000457 T/GBOXATPIN2 882 (+) AACGTG S000458 P1BS 275 (+) GNATATNC S000459 P1BS 932 (+) GNATATNC S000459 P1BS 275 (‐) GNATATNC S000459 P1BS 932 (‐) GNATATNC S000459 NODCON1GM 320 (+) AAAGAT S000461 NODCON2GM 1209 (+) CTCTT S000462 NODCON2GM 1324 (+) CTCTT S000462 NODCON2GM 796 (‐) CTCTT S000462 NODCON2GM 928 (‐) CTCTT S000462 LECPLEACS2 593 (‐) TAAAATAT S000465 LECPLEACS2 655 (‐) TAAAATAT S000465 OSE1ROOTNODULE 320 (+) AAAGAT S000467 OSE2ROOTNODULE 1209 (+) CTCTT S000468 OSE2ROOTNODULE 1324 (+) CTCTT S000468 OSE2ROOTNODULE 796 (‐) CTCTT S000468 OSE2ROOTNODULE 928 (‐) CTCTT S000468 ANAERO1CONSENSUS 251 (+) AAACAAA S000477 SORLIP1AT 1355 (‐) GCCAC S000482 CURECORECR 1155 (+) GTAC S000493 CURECORECR 1186 (+) GTAC S000493 CURECORECR 1155 (‐) GTAC S000493 CURECORECR 1186 (‐) GTAC S000493 EECCRCAH1 601 (+) GANTTNC S000494 EECCRCAH1 1191 (+) GANTTNC S000494 EECCRCAH1 1190 (‐) GANTTNC S000494 EECCRCAH1 1272 (‐) GANTTNC S000494 BIHD1OS 455 (+) TGTCA S000498 CGCGBOXAT 1315 (+) VCGCGB S000501 CGCGBOXAT 1381 (+) VCGCGB S000501 CGCGBOXAT 1315 (‐) VCGCGB S000501 CGCGBOXAT 1381 (‐) VCGCGB S000501 MYBCOREATCYCB1 218 (‐) AACGG S000502 MYBCOREATCYCB1 638 (‐) AACGG S000502

94

SCGAYNRNNNNNNNNNNN PRECONSCRHSP70A 1085 (‐) NNNNHD S000506 SCGAYNRNNNNNNNNNNN PRECONSCRHSP70A 1152 (‐) NNNNHD S000506 SCGAYNRNNNNNNNNNNN PRECONSCRHSP70A 1428 (‐) NNNNHD S000506 ABRERATCAL 882 (+) MACGYGB S000507 ABRERATCAL 1314 (+) MACGYGB S000507 ABRERATCAL 1380 (+) MACGYGB S000507 RHERPATEXPA7 27 (‐) KCACGW S000512

APPENDIX F.

CELLO DATA

96

CELLO RESULTS

SeqID: Glyma03g05050.1 CELLO prediction: (predictor location reliable-index) Composition Mitochondrial 0.613 Di-peptide Cytoplasmic 0.409 part-Comp. Cytoplasmic 0.470 chemo-typy Cytoplasmic 0.770 Neighboring Extracellular 0.417

Combined SVM classifier: Extracellular 0.695 PlasmaMembrane 0.254 Cytoplasmic 1.904 * Cytoskeletal 0.022 ER 0.060 Golgi 0.038 Lysosomal 0.048 Mitochondrial 1.356 * Chloroplast 0.289 Peroxisomal 0.160 Vacuole 0.016 Nuclear 0.159

***************************************************************** **************** SeqID: Glyma09g41610.1 CELLO prediction: (predictor location reliable-index) Composition Mitochondrial 0.644 Di-peptide Cytoplasmic 0.383 part-Comp. Cytoplasmic 0.457 chemo-typy Cytoplasmic 0.684 Neighboring Extracellular 0.469

Combined SVM classifier: Extracellular 0.880 PlasmaMembrane 0.053 Cytoplasmic 1.797 * Cytoskeletal 0.018 ER 0.101 Golgi 0.028 Lysosomal 0.116 Mitochondrial 1.331 * Chloroplast 0.151 Peroxisomal 0.346 Vacuole 0.016 Nuclear 0.164

97

***************************************************************** **************** SeqID: Glyma11g29740.1 CELLO prediction: (predictor location reliable-index) Composition Mitochondrial 0.683 Di-peptide Cytoplasmic 0.897 part-Comp. Cytoplasmic 0.447 chemo-typy Cytoplasmic 0.513 Neighboring Mitochondrial 0.382

Combined SVM classifier: Extracellular 0.045 PlasmaMembrane 0.017 Cytoplasmic 2.414 * Cytoskeletal 0.009 ER 0.019 Golgi 0.007 Lysosomal 0.004 Mitochondrial 1.474 Chloroplast 0.586 Peroxisomal 0.073 Vacuole 0.004 Nuclear 0.347

***************************************************************** **************** SeqID: Glyma11g29750.1 CELLO prediction: (predictor location reliable-index) Composition Extracellular 0.337 Di-peptide Extracellular 0.364 part-Comp. Cytoplasmic 0.553 chemo-typy Nuclear 0.500 Neighboring Extracellular 0.343

Combined SVM classifier: Extracellular 1.301 * PlasmaMembrane 0.287 Cytoplasmic 1.411 * Cytoskeletal 0.012 ER 0.082 Golgi 0.102 Lysosomal 0.070 Mitochondrial 0.491 Chloroplast 0.394 Peroxisomal 0.109 Vacuole 0.014 Nuclear 0.726

98

***************************************************************** **************** SeqID: Glyma11g35800.1 CELLO prediction: (predictor location reliable-index) Composition PlasmaMembrane 0.514 Di-peptide Chloroplast 0.732 part-Comp. PlasmaMembrane 0.398 chemo-typy Chloroplast 0.774 Neighboring Mitochondrial 0.334

Combined SVM classifier: Extracellular 0.762 PlasmaMembrane 1.072 Cytoplasmic 0.360 Cytoskeletal 0.017 ER 0.027 Golgi 0.035 Lysosomal 0.030 Mitochondrial 0.591 Chloroplast 1.872 * Peroxisomal 0.055 Vacuole 0.026 Nuclear 0.154

***************************************************************** **************** SeqID: Glyma14g38040.1 CELLO prediction: (predictor location reliable-index) Composition Cytoplasmic 0.434 Di-peptide Chloroplast 0.469 part-Comp. Cytoplasmic 0.607 chemo-typy Nuclear 0.377 Neighboring Extracellular 0.383

Combined SVM classifier: Extracellular 0.825 PlasmaMembrane 0.241 Cytoplasmic 1.562 * Cytoskeletal 0.015 ER 0.072 Golgi 0.218 Lysosomal 0.118 Mitochondrial 0.164 Chloroplast 0.862 Peroxisomal 0.135 Vacuole 0.037 Nuclear 0.752

99

***************************************************************** **************** SeqID: Glyma18g02610.1 CELLO prediction: (predictor location reliable-index) Composition PlasmaMembrane 0.398 Di-peptide Chloroplast 0.568 part-Comp. PlasmaMembrane 0.494 chemo-typy Chloroplast 0.686 Neighboring Mitochondrial 0.270

Combined SVM classifier: Extracellular 0.645 PlasmaMembrane 1.032 Cytoplasmic 0.493 Cytoskeletal 0.018 ER 0.027 Golgi 0.034 Lysosomal 0.038 Mitochondrial 0.642 Chloroplast 1.792 * Peroxisomal 0.056 Vacuole 0.028 Nuclear 0.194

***************************************************************** **************** SeqID: Glyma18g06340.1 CELLO prediction: (predictor location reliable-index) Composition Extracellular 0.672 Di-peptide Extracellular 0.524 part-Comp. Cytoplasmic 0.574 chemo-typy Nuclear 0.468 Neighboring Extracellular 0.621

Combined SVM classifier: Extracellular 2.076 * PlasmaMembrane 0.300 Cytoplasmic 1.329 Cytoskeletal 0.006 ER 0.067 Golgi 0.050 Lysosomal 0.091 Mitochondrial 0.131 Chloroplast 0.239 Peroxisomal 0.080 Vacuole 0.008 Nuclear 0.621

100

***************************************************************** **************** SeqID: Glyma18g06350.1 CELLO prediction: (predictor location reliable-index) Composition Mitochondrial 0.572 Di-peptide Cytoplasmic 0.856 part-Comp. Cytoplasmic 0.526 chemo-typy Cytoplasmic 0.421 Neighboring Cytoplasmic 0.325

Combined SVM classifier: Extracellular 0.126 PlasmaMembrane 0.046 Cytoplasmic 2.374 * Cytoskeletal 0.015 ER 0.024 Golgi 0.014 Lysosomal 0.008 Mitochondrial 1.189 Chloroplast 0.467 Peroxisomal 0.131 Vacuole 0.007 Nuclear 0.600

***************************************************************** **************** SeqID: Glyma18g44090.1 CELLO prediction: (predictor location reliable-index) Composition Mitochondrial 0.477 Di-peptide Extracellular 0.327 part-Comp. Extracellular 0.491 chemo-typy Extracellular 0.354 Neighboring Extracellular 0.594

Combined SVM classifier: Extracellular 1.823 * PlasmaMembrane 0.078 Cytoplasmic 1.318 * Cytoskeletal 0.009 ER 0.103 Golgi 0.028 Lysosomal 0.116 Mitochondrial 1.095 Chloroplast 0.176 Peroxisomal 0.147 Vacuole 0.009 Nuclear 0.098

101

BIBLIOGRAPHY

[1] Glycine max (L.) Merr., USDA Plants Profile, http://plants.usda.gov/java/profile?symbol=GLMA4, August 2012.

[2] T. Hymowitz, On the domestication of the soybean, Econ. Bot. 24 (1970) 408- 421.

[3] T. Hymowitz, J. R. Harlan, Introduction of to North America by Samuel Bowen in 1765, Econ. Bot. 37 (1983) 371-379.

[4] The Biology of Glycine max (L.) Merr. (Soybean), Canadian Food Inspection Agency, http://www.inspection.gc.ca/plants/plants-with-novel-traits/applicants/directive- 94-08/biology-documents/glycine-max-l-merr-/eng/1330975306785/1330975382668, August 2012.

[5] Why Sequence the Soybean?, DOE Joint Genome Institute, http://www.jgi.doe.gov/sequencing/why/3123.html, August 2012.

[6] 2011 U.S. Soybean Production, USDA National Agricultural Statistics Service, http://quickstats.nass.usda.gov/results/DAA55BB7-3C8F-3E17-9D63- EE1E8C885962?pivot=short_desc, August 2012.

[7] 2010 Soybean Production, Food and Agriculture Organization of the United Nations, http://faostat.fao.org/site/339/default.aspx, August 2012.

[8] J. Schmutz, et al., Genome sequence of the palaeopolyploid soybean, Nature 463 (2010) 178-183.

[9] P. Gepts, W.D. Beavis, E.C. Brummer, R.C. Shoemaker, H.T. Stalker, N.F. Weeden, N.D. Young, Legumes as a model plant family: Genomics for food and feed Report of the Cross‐Legume Advances through Genomics Conference, Plant Physiol. 137 (2005) 1228-1235.

[10] D.J. Bertioli, M.C. Moretzsohn, L.H. Madsen, N. Sandal, S.C.M. Leal- Bertioli, P.M. Guimarães, B.K. Hougaard, J. Fredslund, L. Schauser, A.M. Nielsen, S. Sato, S. Tabata, S.B. Cannon, J. Stougaard, An analysis of synteny of Arachis with Lotus and Medicago sheds new light on the structure, stability and evolution of legume genomes, BMC Genomics 10 (2009) 1-11.

[11] M. Lavin, P.S. Herendeen, M.F. Wojciechowski, Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the tertiary, Syst. Biol. 54 (2005) 575-594.

102

[12] B.E. Pfeil, J.A. Schlueter, R.C. Shoemaker, J.J. Doyle, Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families, Syst. Biol. 54 (2005) 441-454.

[13] J.A. Schlueter, P. Dixon, C. Granger, D. Grant, L. Clark, J.J. Doyle, R.C. Shoemaker, Mining EST databases to resolve evolutionary events in major crop species, Genome 47 (2004) 868-876.

[14] M. Hurles, Gene Duplication: The Genomic Trade in Spare Parts, PLoS Biology 2 (2004) 900-904.

[15] S. Rastogi, D. A. Liberles, Subfunctionalization of duplicated genes as a transition state to neofunctionalization, BMC Evolutionary Biology 5 (2005) 1-7.

[16] What are gene families?, Genetics Home Reference, http://ghr.nlm.nih.gov/handbook/howgeneswork/genefamilies, August 2012.

[17] S. B. Cannon, N. D. Young, OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies, BMC Bioinformatics 4 (2003) 1-15.

[18] A. Stanford, M. Bevan, D. Northcote, Differential expression within a family of novel wound-induced genes in potato, Mol. Gen. Genet. 215 (1989) 200-208.

[19] S.K. Yen, M.C. Chung, P.C. Chen, H.E. Yen, Environmental and developmental regulation of the wound-induced cell wall protein WI12 in the halophyte ice plant, Plant Physiol. 127 (2001) 517-528.

[20] J. J. Diwan, Review: Basic Concepts of Protein Structure, http://www.rpi.edu/dept/bcbp/molbiochem/MBWeb/mb1/part2/protein.htm, August 2012.

[21] J. Clark, The Structure of Proteins, http://www.chemguide.co.uk/organicprops/aminoacids/proteinstruct.html, August 2012.

[22] R. L. Dunbrack Jr., Sequence comparison and protein structure prediction, Curr. Opin. Struct. Biol. 16 (2006) 374–384.

[23] J. U. Bowie, R. Luthy, D. Eisenberg, A Method to Identify Protein Sequences that Fold into a Known Three-Dimensional Structure, Science 253 (1991) 164-170.

[24] What are protein domains?, European Bioinformatics Institute, http://www.ebi.ac.uk/training/online/course/introduction-protein-classification- ebi/protein-classification/what-are-protein-domains, August 2012.

103

[25] Medicago truncatula, Phytozome, http://www.phytozome.net/medicago.php, August 2012.

[26] Selaginella moellendorffii, DOE Joint Genome Institute, http://genome.jgi- psf.org/Selmo1/Selmo1.home.html, August 2012.

[27] J. A. Banks, et al., The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants, Science 332 (2011) 960-963.

[28] R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy, A. Bateman, The Pfam protein families database, Nucleic Acids Res. Database Issue 38 (2010) D211-222.

[29] P.D. Thomas, A. Kejariwal, M. J. Campbell, H. Mi, K. Diemer, N. Guo, I. Ladunga, B. Ulitsky-Lazareva, A. Muruganujan, S. Rabkin, J. A. Vandergriff, O. Doremieux, PANTHER: a browsable database of gene products organized by biological function using curated protein family and subfamily classification, Nucleic Acids Res. 31 (2003) 334-341.

[30] The KOG Browser, DOE Joint Genome Institute, http://genome.jgi.doe.gov/Tutorial/tutorial/kog.html, August 2012.

[31] T. Madden, Chapter 16: The BLAST Sequence Analysis Tool, The NCBI Handbook (2002), http://www.ncbi.nlm.nih.gov/books/NBK21097/, August 2012.

[32] Web BLAST page options, NCBI, http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, August 2012.

[33] ESTs: Gene Discovery Made Easier, NCBI, http://www.ncbi.nlm.nih.gov/About/primer/est.html, August 2012.

[34] Phytozome Tutorials and Help, http://phytozome.net/help.php, August 2012.

[35] D. Huson, Alignments, Algorithmic Bioinformatics (2005) 4-18.

[36] ClustalW2 Multiple Sequence Alignments Help, http://www.ebi.ac.uk/Tools/msa/clustalw2/help/index.html, August 2012.

[37] What is PAL2NAL?, http://www.bork.embl.de/pal2nal/, August 2012.

[38] Synonymous Non-synonymous Analysis Program (SNAP), http://www.hiv.lanl.gov/content/sequence/SNAP/SNAP.html, August 2012.

104

[39] M. Nei, T. Gojobori, Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions, Mol. Biol. Evol. 3 (1986) 418-426.

[40] T. L. Bailey, N. Williams, C. Misleh, W. W. Li, MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Res. 34 (2006) 34 W369– W373.

[41] T. L. Bailey, C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (1994) 28-36.

[42] M. Stanke, R. Steinkamp, S. Waack, B. Morgenstern, AUGUSTUS: a web server for gene finding in eukaryotes, Nucleic Acids Res. 32 (2004) W309–W312.

[43] C.S. Yu, Y.C. Chen, C.H. Lu, J.K. Hwang, Prediction of protein subcellular localization, Proteins: Struct., Funct., Bioinf. 64 (2006) 643-651.

[44] P. McClean, Cis-Acting Elements and Trans-Acting Factors, http://www.ndsu.edu/pubweb/~mcclean/plsc731/cis-trans/cis-trans6.htm, August 2012.

[45] K. Higo, Y. Ugawa, M. Iwamoto, T. Korenaga, Plant cis-acting regulatory DNA elements (PLACE) database, Nucleic Acids Res. 27 (1999) 297-300.

[46] D.M. Goodstein, S. Shu, R. Howson, R. Neupane, R. D. Hayes, J. Fazo, T. Mitros, W. Dirks, U. Hellsten, N. Putnam, D. S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 40 (2012) D1178-D1186.

[47] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403-410.

[48] W. Gish, D.J. States, Identification of protein coding regions by database similarity search, Nature Genet. 3 (1993) 266-272.

[49] M. Stanke, M. Diekhans, R. Baertsch, D. Haussler, Using native and syntenically mapped cDNA alignments to improve de novo gene finding, Bioinformatics 24 (2008) 637-644.

[50] M. A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, D. G. Higgins, ClustalW and ClustalX version 2, Bioinformatics 23 (2007) 2947- 2948.

[51] M. Goujon, H. McWilliam, W. Li, F. Valentin, S. Squizzato, J. Paern, R. Lopez, A new bioinformatics analysis tools framework at EMBL-EBI, Nucleic Acids Res. 38 (2010) W695-W699.

105

[52] I.V. Grigoriev, H. Nordberg, I. Shabalov, A. Aerts, M. Cantor, D. Goodstein, A. Kuo, S. Minovitsky, R. Nikitin, R. A. Ohm, R. Otillar, A. Poliakov, I. Ratnere, R. Riley, T. Smirnova, D. Rokhsar, I. Dubchak, The genome portal of the Department of Energy Joint Genome Institute, Nucleic Acids Res. 40 (2012) D26-32.

[53] M. Suyama, D. Torrents, P. Bork, PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments, Nucleic Acids Res. 34 (2006) W609-W612.

[54] HIV database website, www.hiv.lanl.gov, August 2012.

[55] B. Korber, HIV Signature and Sequence Variation Analysis, Computational Analysis of HIV Molecular Sequences Chapter 4 (2000) 55-72.

[56] A. Lobley, M.I. Sadowski, D.T. Jones, pGenTHREADER and pDomTHREADER: New Methods For Improved Protein Fold Recognition and Superfamily Discrimination, Bioinformatics 25 (2009) 1761-1767.

[57] T.L. Bailey, C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (1994) 28-36.

[58] P.J. Rushton, J.T. Torres, M. Parniske, P. Wernert, K. Hahlbrock, I.E. Somssich, Interaction of elicitor-induced DNA-binding proteins with elicitor response elements in the promoters of parsley genes, EMBO J. 15 (1996) 5690-5700.

[59] T. Nishiuchi, H. Shinshi, K. Suzuki, Rapid and transient activation of transcription of the ERF3 gene by wounding in tobacco leaves: Possible involvement of NtWRKYs and autorepression, J. Biol. Chem. 279 (2004) 55355-55361.

[60] M. Boter, O. Ruiz-Rivero, A. Abdeen, S. Prat, Conserved MYC transcription factors play a key role in jasmonate signaling both in tomato and Arabidopsis, Genes Dev. 18 (2004) 1577-1591.

106

VITA

Gena Clare Robertson was born in Poplar Bluff, Missouri, USA. In December

2010, she received her B.S. degree in Physics from Missouri University of Science and

Technology, Rolla, Missouri. In December 2012, she received her M.S. degree in

Applied and Environmental Biology from Missouri University of Science and

Technology, Rolla, Missouri.