BMC Biology BioMed Central

Research article Open Access Identification of novel conserved peptide uORF homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups and preferential association with transcription factor-encoding genes Celine A Hayden and Richard A Jorgensen*

Address: Department of Sciences, University of Arizona, Tucson, AZ 85721-0036, USA Email: Celine A Hayden - [email protected]; Richard A Jorgensen* - [email protected] * Corresponding author

Published: 30 July 2007 Received: 22 January 2007 Accepted: 30 July 2007 BMC Biology 2007, 5:32 doi:10.1186/1741-7007-5-32 This article is available from: http://www.biomedcentral.com/1741-7007/5/32 © 2007 Hayden and Jorgensen; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Upstream open reading frames (uORFs) can mediate translational control over the largest, or major ORF (mORF) in response to starvation, polyamine concentrations, and sucrose concentrations. One plant uORF with conserved peptide sequences has been shown to exert this control in an amino acid sequence-dependent manner but generally it is not clear what kinds of genes are regulated, or how extensively this mechanism is invoked in a given genome. Results: By comparing full-length cDNA sequences from Arabidopsis and rice we identified 26 distinct homology groups of conserved peptide uORFs, only three of which have been reported previously. Pairwise Ka/Ks analysis showed that purifying selection had acted on nearly all conserved peptide uORFs and their associated mORFs. Functions of predicted mORF proteins could be inferred for 16 homology groups and many of these proteins appear to have a regulatory function, including 6 transcription factors, 5 signal transduction factors, 3 developmental signal molecules, a homolog of translation initiation factor eIF5, and a RING finger protein. Transcription factors are clearly overrepresented in this data set when compared to the frequency calculated for the entire genome (p = 1.2 × 10-7). Duplicate gene pairs arising from a whole genome duplication (ohnologs) with a conserved uORF are much more likely to have been retained in Arabidopsis (Arabidopsis thaliana) than are ohnologs of other genes (39% vs 14% of ancestral genes, p = 5 × 10-3). Two uORF groups were found in animals, indicating an ancient origin of these putative regulatory elements. Conclusion: Conservation of uORF amino acid sequence, association with homologous mORFs over long evolutionary time periods, preferential retention after whole genome duplications, and preferential association with mORFs coding for transcription factors suggest that the conserved peptide uORFs identified in this study are strong candidates for translational controllers of regulatory genes.

Background and can mediate translational regulation of the largest, or Upstream open reading frames (uORFs) are small open major, ORF (mORF). Regulation by uORFs has been stud- reading frames found in the 5' UTR of a mature mRNA, ied in several individual transcripts demonstrating the

Page 1 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

importance of uORFs in such processes as polyamine pro- relative to the mORF, 14 of which are conserved in one of duction [1], amino acid production [2,3], and sucrose Neurospora crassa, Fusarium graminaerum, or Magnaporthe response [4], but the biological effect of uORFs in the grisea [5], but the authors did not comment on whether vaste majority of transcripts of the genome is still unclear. the uORF amino acid sequences are also conserved. Upstream start codons (uAUGs) occur in 20–30% of yeast, mammalian, and plant transcript 5' UTRs [5-7] With the emergence of large plant full-length cDNA therefore potentially thousands of genes are regulated in sequence collections [24-26], it is now possible to adopt a this manner. comparative genomics approach to determine the preva- lence of conserved amino acid uORFs in the genome and The majority of characterized uORFs appear to act in an the persistence of these elements throughout eukaryotic amino acid sequence-independent manner, regulating evolution. Because rice and Arabidopsis shared a com- mORF translation by the uORF start codon nucleotide mon ancestor 140–200 million years ago (Mya) [27-29], context, by the uORF length, or by the distance between sequence similarity retained over this amount of time pro- the uORF stop codon and the mORF start codon, rather vides good candidates for truly conserved peptide uORF than by uORF-encoded peptides [8-11]. Some uORFs, sequences. In this study we have used Oryza sativa (rice) however, do rely on peptide sequences to mediate transla- and Arabidopsis thaliana (Arabidopsis) full-length cDNA tional regulation of the associated mORF, but few exam- sequence collections to estimate the incidence of con- ples have been identified and characterized to date. In served peptide uORFs in the rice and Arabidopsis fungi and animals, a few genes have been shown to con- genomes, to determine the prevalence of uORFs within tain uORFs whose amino acid sequences are similar regulatory genes, and to compare evolutionary rates for between two or more species [12-17], but only two cases, uORFs versus mORFs. By examining more distantly CPA1 [3] and SAMDC1/AdoMetDC1 [18], have demon- related sequences, we posit an ancient origin for select strated uORF sequence-dependent regulation. In uORFs and we provide evidence for one mechanism by two groups of genes, S-Adenosylmethionine decarboxy- which uORFs can arise within genes. lases (AdoMetDCs; EC 4.1.1.50) and group S basic region leucine zipper (bZIP) transcription factors, have been Results shown to contain uORFs with similar amino acids Identification of conserved peptide uORFs by comparison between monocots and dicots [19,20]. In the former of rice and Arabidopsis transcripts group, mORF translational regulation is dependent on the To identify conserved peptide uORFs, we developed sequence of the uORF peptide [1,4] and overexpression of "uORF-Finder", a Perl program that first compares the the mORF in either group results in stunted or lethal phe- mORF amino acid sequence of each cDNA from one col- notypes, suggesting that these genes play a critical role in lection with the mORF sequences of another species' col- growth and/or development. Indeed, AdoMetDC is lection to identify putative mORF homologs, and then required for polyamine synthesis, molecules that are compares the uORFs in the 5' UTRs of the two paired implicated in essential plant functions such as cell divi- sequences to identify uORFs with conserved amino acid sion, embryogenesis, leaf, root, and flower development, sequences (see Methods). Comparison by uORF-Finder of and stress responses [21,22]. a corrected set of 34000 full-length cDNA sequences from Arabidopsis with a similar set from rice resulted in the In general, it has been difficult to carry out genome-wide identification of conserved peptide uORFs in 44 Arabi- surveys of conserved peptide uORFs due to poor annota- dopsis genes and 36 rice genes, which together comprise tion of 5' UTRs. The availability of expressed sequence tags 19 homology groups based on uORF amino acid similar- (ESTs) has improved exon and intron annotation of the ity (Tables 1, 2, 3; Figures 1, 2, 3, 4, 5). All three of the genomic sequence, but they are relatively short and often homology groups that had been previously reported were do not predict the entire mRNA molecule, even when sev- identified by uORF-Finder [1,4]. The other 16 conserved eral ESTs overlap the same genomic region and can be uORFs have not been reported previously. Homologs of assembled to predict one transcript. As there are very few these 19 conserved uORF groups also exist in other introns in yeast transcripts, prediction of uORF conserva- angiosperm species (Figures 1, 2, 3, 4, 5). tion has been attempted in S. cerevisiae by analyzing genomic sequence upstream of predicted mORF start sites Comparison of Arabidopsis homologs detects additional [23], but it is still not clear whether these uORFs are truly conserved uORFs conserved (i.e., are under negative selection pressures), or Conserved uORFs that are not sufficiently well conserved are simply undergoing evolutionary drift. With the to be detected in a rice-Arabidopsis comparison could sequencing of the Aspergillus nidulans genome, compari- conceivably be detected in ohnologs, homologous genes son to A. fumigatus and A. oryzae has identified 38 uORFs arising by whole-genome duplication (WGD) [30], and in with putatively conserved start and stop codon positions paralogs, homologous genes arising from segmental

Page 2 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Table 1: uORF homology groups and associated mORF molecular function and biological role

Homology mORF: known or probable molecular function/ Known or inferred biological process Source group domain

uORF conserved in Arabidopsis and rice 1 bZIP transcription factor Sucrose regulation [89] 2 bHLH transcription factor Transcriptional control [68] 3 AdoMetDC Polyamine biosynthesis: developmental regulation [1] 4 Unknown; plant-specific Unknown BLAST analysis 5 Ankyrin repeat protein Unknown Protein domain analysis* 6 Amine oxidase Unknown Protein domain analysis* 7 Putative translation initiation factor eIF5 Start codon selection Protein domain analysis* 8 Similar to Mic-1 Unknown BLAST analysis

9 Unknown, cysteine-rich Unknown (Possible novel zinc finger?) CX4–7CX10CX2HX5 tandem repeats 10 MAP kinase Signal transduction PlantsP database 11 Trehalose-6-phosphate phosphatase Trehalose metabolism: developmental regulation [90] 12 Unknown Systemically primed response to pathogens [91] 13 Phosphoethanolamine N-methyltransferase Phosphocholine biosynthesis [38] 14 HDZip class I transcription factor Transcriptional control; development [92,93] 15 bHLH transcription factor Transcriptional control; responsive to polyamine? [52,68] 16 MAP kinase Signal transduction PlantsP database [99] 17 Unknown Unknown 18 Transcription co-activator/repressor HsfB1 Mediator of heat shock response [94,95] 19 SAUR protein Mediator of auxin response; calmodulin (CaM) IPR003676; [96] binding uORF conserved in Arabidopsis paralogs 20 Unknown Unknown 21 ERF/AP2 transcription factor Putative regulator of pathogen resistance [97,98] 22 Unknown Unknown 23 MAP kinase Signal transduction PlantsP database [99] 24 Unknown Unknown 25 Calcium response protein kinase Ca++/CaM-dependent signal transduction PlantsP database [100] 26 RING finger (C3HC4-type zinc finger) Ubiquitination; mediator of protein degradation Protein domain analysis*

bZIP, basic leucine zipper; bHLH, basic helix-loop-helix; AdoMetDC, S-Adenosylmethionine decarboxylase; Mic-1, colon cancer-associated protein macrophage-inhibitory cytokine 1; MAP kinase, mitogen activated protein kinase; HDZip, homeodomain leucine zipper; ERF/AP2, ethylene response factor/apetala2. *As determined by InterProScan and NCBI conserved domains search. duplication or tandem duplication. Modification of Purifying selection maintains uORF amino acid sequences uORF-Finder allowed comparison of each full-length Pairwise Ka/Ks tests for selection on amino acid sequences cDNA to all other cDNAs in the same collection (see were applied to each uORF homology group and their Methods), and identified seven additional conserved associated mORFs to determine whether uORF amino uORF homology groups (Tables 1 and 4; Figures 6, 7, 8). acid sequences are under selective constraints similar to Six of these pairs are ohnologs, created by the most recent their associated mORFs. Both an approximate method WGD (24–40 Mya) in an ancestor of Arabidopsis [31-33]. (Yn00) and a maximum likelihood method (codeml) The seventh pair is not found in syntenic regions and is were used to calculate mean pairwise Ka/Ks ratios for each most likely a paralogous pair. It appears to have arisen at group. A Ka/Ks ratio less than 1 implies that negative, or about the same time as the recent WGD event because its purifying, selection has acted on the sequence, a ratio synonymous substitution frequency (Ks value) of 0.7 is equal to 1 suggests drift, and a ratio greater than 1 indi- similar to the median Ks of recent duplicate pairs (0.8) and cates that positive selection has acted on an amino acid is within their Ks range (0.4–1.6) [32]. The corresponding sequence. It is also true that conservation at the nucleotide rice genes in four of the seven homology groups possess level, not the amino acid level, can drive the Ka/Ks ratio to uORFs, but lack sufficient uORF sequence similarity to one. Analysis of all 26 homology groups showed that gen- have been detected in the Arabidopsis-rice comparison erally both uORFs and mORFs have been under mild to (Figures 6, 7, 8). strong purifying selection since the divergence of each gene pair (Table 5) and these low Ka/Ks ratios suggest that

Page 3 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Table 2: Arabidopsis loci with conserved peptide uORFs identified from Arabidopsis-rice comparison

Homology group Locus Gene Name mORF description Gene ontology Recent duplicate molecular function

1 At2g18160.1 GBF5, AtbZIP2 Basic leucine zipper Transcription factor At4g34590 (bZIP) 1 At4g34590.1 GBF6, ATB2, AtbZIP11 bZIP Transcription factor At2g18160 1 At3g62420.1a AtbZIP53 bZIP Transcription factor Not found 1 At5g49450.1 AtbZIP1 bZIP Transcription factor Not found 1 At1g75390.1 AtbZIP44 bZIP Transcription factor Not found 2 At2g27230.1 AtBHLH156b Basic helix-loop-helix Transcription factor Not found (bHLH) 2 At2g31280.1 AtBHLH155b bHLH Transcription factor At1g06150 2 At1g06150.1 bHLH Transcription factor At2g31280 3 At3g02470.1 AdoMetDC1 AdoMetDC AdoMetDC At5g15950 3 At5g15950.1 AdoMetDC2 AdoMetDC AdoMetDC At3g02470 3 At3g25570.1 AdoMetDC3 AdoMetDC AdoMetDC Not found 4 At4g25670.1 Expressed transcript Unknown At5g52550 4 At4g25690.1 Expressed transcript Unknown At5g52550c 4 At5g52550.1 Expressed transcript Unknown At4g25670 5 At5g61230.1 Ankyrin repeat Protein binding At5g07840 5 At5g07840.1 Ankyrin repeat Protein binding At5g61230 6 At2g43020.1 Amine oxidase Oxidoreductase At3g59050 6 At3g59050.1 Amine oxidase Oxidoreductase At2g43020 7 At1g36730.1 Putative eIF-5 Translation initiation Not found factor 8 At3g12010.1a Similar to Mic-1 Unknown Not found 9 At5g09670.1 and Expressed transcript Unknown At5g64550 .2 9 At5g64550.1 Expressed transcript Unknown At5g09670 9 At1g64140.1 Expressed transcript Unknown Not found 10 At5g45430.1 AtMPK23d MAP kinase, PPC family ATP binding, protein At4g19110e 4.5.1 kinase 10 At4g19110.1 AtMPK22d MAP kinase, PPC family ATP binding, protein At5g45430e 4.5.1 kinase 11 At4g12430.1 TPPase Catalytic activity At4g22590 11 At4g22590.1 TPPase Catalytic activity At4g12430 12 At1g70780.1 Expressed transcript Unknown At1g23150 12 At1g23150.1 Expressed transcript Unknown At1g70780 13 At3g18000.1 XPL1, NMT1, PEAMT1 Phosphoethanolamine Methyltransferase At1g48600 N-methyltransferase 13 At1g48600.2 NMT2 Methyltransferase Methyltransferase At3g18000 13 At1g73600.1 NMT3 Methyltransferase Methyltransferase Not found 14 At3g01470.1 HAT5, HB-1, HD-ZIP-1, Homeobox DNA binding, Not found ATHB1 transcription factor, transcriptional activator 15 At1g29950.2 AtBHLH144b bHLH Transcription factor Not found 15 At5g50010.1 AtBHLH145b bHLH Transcription factor Not found 15 At5g64340.1 AtBHLH142b, SAC51 bHLH Transcription factor At5g09460 15 At5g09460.1a AtBHLH143b bHLH Transcription factor At5g64340 16 At3g51630.1 ZIK1, WNK5 MAP kinase, PPC family Protein kinase Not found 4.1.5 17 At1g58120.1 Expressed transcript Unknown Not found 17 At3g53400.1 Expressed transcript Unknown Not found 17 At5g03190.1 Expressed transcript Unknown Not found 17 At5g01710.1 Expressed transcript Unknown Not found 18 At4g36990.1 AT-HSFB1, ATHSF4 Heat shock factor Transcription factor Not found 19 At5g53590.1 SAUR Auxin responsive Unknown Not found

AdoMetDC, S-Adenosylmethionine decarboxylase; PPC, PlantsP protein kinase classification; TPPase, Trehalose-6-phosphate phosphatase. auORF found upstream of annotated mORF-containing locus (within 2 kb). bAs designated by Bailey et al [68], nomenclature agreed upon by both Heim et al [69] and Toledo-Ortiz et al [67]. cAt4g25670 and At4g25690 (tandem duplicates) have the same recent retained duplicate (not reported by Blanc and Wolfe). dAs designated by the PlantsP database [99]. eNot found in Blanc and Wolfe's initial analysis of ohnologs, but synteny and homology suggest they are retained recent duplicates.

Page 4 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Table 3: Rice loci with conserved peptide uORFs identified from Arabidopsis-rice comparison

Homology group Locus mORF description Gene ontology molecular function

1 LOC_Os02g03960 bZIP DNA binding, transcription factor 1 LOC_Os09g13570 bZIP DNA binding, transcription factor 1 LOC_Os05g03860 bZIP DNA binding, transcription factor 1 LOC_Os03g19370 bZIP DNA binding, transcription factor 1 LOC_Os12g37410 bZIP DNA binding, transcription factor 2 LOC_Os12g06330 bHLH Transcription factor 3 LOC_Os02g39790 AdoMetDC AdoMetDC activity 3 LOC_Os04g42090 AdoMetDC AdoMetDC activity 3 LOC_Os09g25620 AdoMetDC AdoMetDC activity 4 LOC_Os02g01360 Expressed transcript Unknown 5 LOC_Os02g01240, 133165– Ankyrin repeat Protein binding, Acyl CoA binding 133284* 6 LOC_Os04g53190, 31234580– Amine oxidase Amine oxidase 31234757* 7 LOC_Os09g15770 IF2B and IF5 domains Translation initiation 7 LOC_Os06g48350 IF2B and IF5 domains Translation initiation 8 LOC_Os10g26140 Similar to Mic-1 Unknown 9 LOC_Os04g38520 Expressed transcript Transcription factor 9 LOC_Os02g36590, 22043438– Expressed transcript Transcription factor 22043536* 9 LOC_Os01g43370 Expressed transcript Transcription factor 9 LOC_Os02g15880, 8987945– Expressed transcript Transcription factor 8988028* 10 LOC_Os06g02550 Protein kinase Kinase activity 10 LOC_Os02g47220, 28767408– Protein kinase Kinase activity 28767530* 11 LOC_Os02g44230 TPPase Trehalose phosphatase 11 LOC_Os10g40550 TPPase Trehalose phosphatase 12 LOC_Os02g21920 Expressed transcript Unknown 13 LOC_Os01g50030 Methyltransferase Phosphoethanolamine N- methyltransferase activity 13 LOC_Os05g47540 Methyltransferase Phosphoethanolamine N- methyltransferase activity 14 LOC_Os08g32080, 19755174– Homeobox DNA binding, transcription factor, 19755260* protein binding 15 LOC_Os02g21090 bHLH Transcription factor 15 LOC_Os01g43680, 25011025– bHLH Transcription factor 25012089* 15 LOC_Os03g39432, 21870203– bHLH Transcription factor 21870427* (LOC_Os03g39432 v.4 TIGR annotation) 15 LOC_Os03g27390 bHLH Unknown 16 LOC_Os11g02300 Protein kinase Protein kinase 17 LOC_Os07g42830, 25650516– Expressed transcript Unknown 25650623* (LOC_Os0742834 v.4 TIGR annotation) 17 LOC_Os02g52300 Expressed transcript Unknown 18 LOC_Os09g28350 Heat shock factor DNA binding, transcription factor (LOC_Os09g28354 v.4 TIGR annotation) 19 LOC_Os10g36700 Auxin responsive Unknown (LOC_Os10g36699 v.4 TIGR annotation)

All locus identifiers based on version 3 TIGR pseudomolecule assembly except where noted. AdoMetDC, S-Adenosylmethionine decarboxylase; TPPase, Trehalose-6-phosphate phosphatase. *Locus numbers indicate mORF location, and coordinates indicate uORF location in intergenic region on the same chromosome.

Page 5 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 1 10 20 30 40 | | | | Arath1 M S P - V I S E I L R S G L T I D S S L R R R T H L V Q S F S V V F L Y W F Y V F S - - - - Orysa1 M S P - V I S E I L L S G F M I N S T L R R R T H L V Q S F S V V F L Y W F Y V F S - - - - Arath2 M T P - V L C E I L L S G L T V K S A L C R R T H L V Q S F S V V F L Y W F Y N V S - - - - Arath3 M S P I I L S E I F L S G F M L N S T I R R R T H L V Q S F S V V F L Y W L Y Y V S - - - - Orysa2 ------M K L N V L R R R R S H - - - S F S V A F L Y W F Y V F S - - - - Orysa3 M L L L I I P E H L L S R F M T G P A I - - - T Q L V Q S F S V V F L Y W F Y V F S V S S L Arath4 ------M S Y S I L F R R I R I L H S F S V V Y L Y Y T Y V F S - - - - Orysa4 ------M S I T E A S L H F C Y L H S L S V V F L Y F F Y V I S - - - - Orysa5 ------M K I A T H L R R S C F L H S F S V V F L Y W F Y V I S - - - - Arath5 ------M I N - - - L N Q F L V Y H S I S V V I L H W F Y V I S - - - -

Group 2 10 20 30 40 | | | | Arath1 - - - - - MG K R Q I S Q D - - - - E V G P P I K P R A G L R R E Q A G R G S Y R G S Arath2 MF G K G MG T R Q Y S L G - - - - E V G P P I N P R A G L R R E Q A G R G S Y R G S Lyces - - - - - MG C R V L I V V - - - - N L D P Q R K R S G G L R T K Q A G R G S C R G S Medtr ------R C K K L T V V - - - - N L D P E R K R S G G L R T K Q A G R G S C R G S Arath3 - - - - - MA C R S MI A F - - - - S L D P E R K Q S G G L R T K Q A G R G S C R G I Lacsa - - - - - MS R I T WV L I L G G - N L D P E R K R S G G L R S K Q A G R G S C R G S Orysa - - - - - MA A A S V A A R S G G G G V R I P A R R G G G L R A K Q A G R A S F R G A

Group 3 10 20 30 40 50 | | | | | Orysa1 - M E S K G G K K K S - - S S S R - - - S L M Y E A P L G Y S I E D V R P A G - G V K K F Q S A A Y Orysa2 - M E S K G G K K K S - - S S S - - - - S S M Y E A P L G Y K I E D V R P A G - G I K K F Q S A A Y Orysa3 - M E S K G G K K K S - - S S S R S - - S L M Y E A P L G Y S I E D L R P A G - G I K K F R S A A Y Arath1 - M E T K G G K K K S S N S S S R D - - S L F F E A P L R Y S I E D V R P N G - G I K K F R S A A Y Arath2 MM E S K G G K K K S - - S S S S - - - S L F Y E A P L G Y S I E D V R P N G - G I K K F K S S V Y Cycru - M E T K G G K K K S S S N S Q ------Q Y E T P L G Y K I E D V R P H G - G I E K F R S A A Y Selmo - M D A K G G E D D S S S S S S N R N S K F R Y E I P L G Y V V E D V R P N G - G F E K L R R A G F Torru - M E T K G G K K - - S K K S S - - - - - I Q Y E S L L G Y V I E D V R P H G - G S E K F R T A A Y Arath3 MM E S K A G N K K S S S N S S - - - - - L C Y E A P L G Y S I E D V R P F G - G I K K F K S S V Y Ulvli MM C S A E D D S L N S Y L E S - - - - - L G E L A S L L A D C P D L R P W G P N K R K Y R T I A W

Orysa1 S N C A K K P S Orysa2 S N C A R K P S Orysa3 S N C A R K P S Arath1 S N I T G K P S Arath2 S N C S K R P S Cycru S N C V R K P S Selmo T V C V R K P S Torru S N C V R K P S Arath3 S N C A K R P S Ulvli D D C K R L P S

Group 4 10 20 30 40 50 | | | | | Arath1 ------M Q K M G - E S T G C G K G G L M I P N H K S A N S Y N L S L E L S L V D S V W D C N V Brana ------M G - E S K G Y G K G G R V I P N A K S A N S Y N V S L Q L S P V V S F W D C I V Arath2 ------M G - E S K G C G N V G R M I S K N K S A N S Y N L S L Q L S P V V S F W D C I V Arath3 ------M G - E V K G C W K L G R M L S K D N S A N S Y N V S L L L S P V V S V W D C I V Goshi ------M G - K C K G C G K L - R M L T G D G P A K A Y H Y A L L C S P V V S V W D C I V Gosra ------M G - K C K G F G K L G R M L P R D G P V N A Y Y S S L L C S P V V S V W D C I V Medtr ------M G - E C K C C G K L G R M G P R D G S K S A Y E F S L M L S P V V S F W D C I V Prupe MI S S F S M L Q M G - K C K G C G K L G R M M P R G G S V S A Y Q F S L L L S P V V S V W D C I V Poptt ------M G - K C K G C G K L G R M V P R N E P V D A Y H F S L L L S P V V S V W D C I V Orysa ------M G Q E L T T N K P L T T L L P T D E I A T A Y H F S L L L S P V V S F W D C I F 60 | Arath1 R T M R Y S Y I P E WL Brana R T M R Y S Y I P E WV Arath2 R T M R Y S Y I P E WV Arath3 R K M R Y T Y I P E WV Goshi R K M R Y S Y I P E WV Gosra R K M R Y S Y R P E WV Medtr R K M R Y S Y R P E WV Prupe R K M R Y S Y R P E WV Poptt R K M R Y S F R P E WV Orysa R K I R Y S F R P E WV

AlignmentsFigure 1 of plant uORF homology groups 1–4 Alignments of plant uORF homology groups 1–4. Plant sequences were aligned using ClustalW v. 1.82 and displayed using Jalview. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.

Page 6 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 5 10 20 30 40 | | | | Arath1 M L I F S - - - - S L S M A L S V I - L Q K L S V F G P G L N P S F P Y C I A N H S F A S R Arath2 M L V F S - - - - S L S M T P V V I - P Q N L R V F G P G L N P S F P Y C I A N H F P - - - Gosra M F L L S - - - - F S L M T C N V I - F Q N L R V F G P G L N P F S P F C V A H H S S A S R Hevbr M F I V S S - - - S S A I V S V V I - F E N L R V F G P G L N P F A P Y C I A N H F L A S R Phaac M I I T ------M N S I F A - A H N L R V F G P G L N P F A P Y C I A N H S L R A R Lacse M F N T S L T R S S S S L S L L V I K L R N L R L F G P G L N P F A P Y N M A N H F - - S R Orysa M V L T - - P S P S - - - P P P ML - P K K L R A L G P G L N P F A P F G M G N Y Y S S S R Triae M V R R - R P S S S S T S S S P ML - H K N L R A L G P G L N P F A P F G M G N Y - - - S R

Group 6 10 20 30 40 50 | | | | | Vitvi M S W L V N L G F T - - - S N - - S N H F T - T C L L I F I L H Y S P Q - - - L P P A L F K F E K P Aspof M S Y L V N L G F S - - - T N - - S N H F I - T C L L I F T L H Y S P Q - - - L P P ------Pontr M S W L G K L G F Y F L Y S N - - S N H F I - T C L L I F I L H Y S P Q - - - L P P S L F K F E K K Gosar M S W L G N L G F T C F S S L N C S N H F I - T C L L I F I L H Y S P Q - - - L P P S L F K F E K K Glyma M N W F G K L D S S - - S S N L N F N H F V - T C L L I F I L H F S P Q - - - S L S S ------Orysa M N C F V N L G L N - - - S N - - S N H Y V - A C L I I F L L H F S P Q - F L L N P S ------Triae M N C F V N L G L N - - - S N - - S N H Y V - A C L L I F L L H F S P Q P L L I N H P ------Arath1 M N L F G S L I F S - - - S S - I S S H F T - T C L L I F I L H F S P Q L ------Arath2 M N F F R K L G F N - - - P S - I A S Y F T N T S L L I S I L S F S P Q L P I S L F L F E Q - - - - 60 70 80 90 | | | | Vitvi L E P G S L V S C F N W------N F N H R R R R R F R S P E N G V V R E K ------Aspof - - - - S I I N C R ------S R W N G D Y R S S K H G E P L ------Pontr A T E P L L L L S L L S I S S N L I N N F N S N S N N N F S S N R L V L Y S S S H G F C K Q E Gosar S T E P L L L L ------F N S N S N - H I P P - - L H L K S K S ------Glyma ------T F D S N T N L E F N S - E Y G V E N ------Orysa ------N C N S N S S S N R E G E K Q N P N N G E Q Q F I W- - Triae ------N S N S - S R S N R Q E G Q S N L S S S N R ------Arath1 ------N S N N N - - N N N S N R N S N N L E S N R L I N - - - - Arath2 ------K S L D P L S L C N S S T N F N N L E S N Y S V H C - - -

Group 7 10 20 30 40 50 | | | | | Arath ------MS E Q A S L F Q S S I R L L G E Y C P I S R K - - Brana ------MS E Q A S L F Q S S I Q L L G E Y C P T S R K - - Nicbe ------MS E R A S L F L S S I G L L G E Y C P P S R - - - Mescr ------MS E Q F S L F L S I I G L L G E Y C P T R K K D T Orysa1 ------MS E Q D L L Y L G S V G L L G E Y S R T T R K R H Orysa2 M M Q E A A S V L L Q S D K V E G L K V E A S F MS E Q D L L Y L G S V G L L G E Y S R - T R K R H Triae ------MS E Q A L V Y L G S V G L L G E Y S R A T R K R H 60 70 80 90 100 | | | | | Arath K S T H K V G G L R F E E ------D I T S C I G F Q F M Q L S D Y H C L F - - - - Brana K S T - R L G - S S F E A F ------N N T S S S S F Q F L ------C L F - - - - Nicbe - L S H K V G C M H L E S S Q A H K R Y L K D H F I Q P T P S I D Q E I L I D L G S L Y H F - - - - Mescr H I S V K A S C M H H R P S E G ------S T V E I Q G C T F T R P S F P T I L V S T A S Orysa1 K L S L G R P C C ------A L V S A D S F Q G - - Y K T T V P F G - - S W Orysa2 K L P L G R P C C MH P K P C Q F M Q T S G C S WS A L V S A D P L Q G S R L L S L L A L G - - A S Triae K F V L - - P C M H Q E P ------I R F S G S E S S L A T A C A E S N L P A A T I A S

Arath ------Brana ------Nicbe ------Mescr ------Orysa1 C F ------Orysa2 D I I S V K E N Triae E I C H I - - -

AlignmentsFigure 2 of plant uORF homology groups 5–7 Alignments of plant uORF homology groups 5–7. Details as in Figure 1.

Page 7 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 8 10 20 30 40 50 | | | | | Popca - M K - - - - E R N - - T T S - T L R R V L V N C A A Q A K E Y G G C V A A K V P E I E R D MC L K Popeu - M K - - - - E R N S - S S S - T L R R I L V N C A A Q A K E Y G C C V A E K V P E I E R D MC L K Arath - M K - - - - E K N S T T A S - T L G R I L A T C S K Q A K D Y G S C V A S K V H E V E R D I C L K Sachc - M K - - - - E R K P A A S S P A L V R I L A V C A S Q A K D Y G R C I A A K I P E I E H N MC S K Triae - M K - - - - E R K P A A P S - A L A R I L A T C A S Q A K D Y G R C I A A K V P E I E H N MC S K Orysa - M K - - - - E R N P A V A S S A L A R I L A A C A S Q A K D Y G R C I A E K V P E I E Q N MC A K Chlre S K K P P G L R S A P E Q T A A R I G R A F A A C S A K A E L Y G T C I K K L V P E V D K G V C A K Phypa - M A S - - - E K G R P L P S - P L K N V F L R C S P A M K E Y G Q C V A T K L P A V E K G MC E K Mesvi1 MS S G G R S V P S S S K Y L K E L G Q G L A S C T A E V A V Y G K C I S Q G L Q D I N K G MC E Q Mesvi2 ------P S S W K I L K E L R Q V W G L C P P E V A V Y R K C I S Q A L Q D I N K G MC E Q 60 70 | | Popca E F L A L K N C M Q N T I R - - - G K A Popeu E F L A L K N C M H I T I R - - - G K A Arath E F L A L K S C M Q H T I R - - - G K A Sachc E F L A L R A C M Q T A V K - - - N K V Triae E F L A L R A C M Q T A V K - - - N K A Orysa E F L A L R S C M Q T V V K - - - R K A Chlre E F Q E L K T C F T R A M R S G S G R A Phypa E F L A L K T C M Q N A A K - - - K K - Mesvi1 E F R A L T K C M R Q A R V - - - G R - Mesvi2 D F R A L T K C M R Q A R V - - - G R -

Group 9 10 20 30 40 | | | | Arath1 - - - - M D R A D T S G F V S G G S C S S L N L L - - - Q Y M L I A T S MM C - Arath2 - - - - M D R A D T W G F V S G G R C L Y L K L L - - - L C L ------Orysa1 - - - - M D R A D T S G F V S G G - C F S T T L L L - - I F F H F V Y S V I C V Orysa2 - - - - M D R A D T S G F V S G G - L I G T S L L L - - I F F H F V Y S P V H I Allce - - - - M D R A D T S G F V S G G - C F K L K F R L - - - Y F Q F I Y T Q - - - Arath3 - - - - M D R C D T S G F V S G G G D ------A Y K V F ------Gosra M L Y K M D R E D T S G F V S G G G A S S ------E F L I F L I C F - - - Orysa3 - - - - M D H R D T S G F V S G G - C K E S T N C I T L G F L T F ------Orysa4 - - - - M D R D D I A G F V S G G - C K A ------F ------

Group 10.1 10 20 30 40 | | | | Arath1 ME Q D Y I C S G C Y Q Y R V F S L Q E A L D WR F L V H S D F L I G S F V N C T Brana ME Q D Y I C S G C Y Q F R V S S L Q E A L D L R F L A L S G F L G G S F V N C T Theha ME Q D Y I C S G C F Q Y R V F S L Q E A L D WR F L A R A D I L V G S F V N C T Arath2 ME Q V F V W P S C Y H Y R L F S F Q E A L D WR F L V R S D F L V G S F V N C T Poptt ME K A C F W S N C F Q H R V F S F Q E V L D WR F L I L G D F L L V S F V N C T Orysa1 ME Y T L Y T T S S S V L H I S L L E E V L G WR F S L Y G D F L V I S F V N C T Orysa2 ME K V I F W A H L L K C K L F L S K E T L D WC V L A R G N T H H F S L V N Y T Sorbi ME K V T C R A Y L L Q C K Q F F S K E T L D WC V L A R G N T H L S S L V N Y T

Group 10.2 10 20 | | Arath1 M F C L F L L A S F F C L V G Q P I D V Q T S K Brana M F C L L L L L S F F C L A G Q P S D F Q - S K Arath2 M C F P L L L L S F F C L N G Q P V Y V R - S K

Group 11 10 20 30 40 | | | | Arath1 ------M E L S P T S S D K K T M K R W F F I D K R V G Soltu - - - - MQ C G S G D E I A - G S P Y L M E S I P S S S D K K T L K R W F F I D K R V G Arath2 ------M D S S T T S S D K K T L K R W F F I D K R V G Jugre - - - - MQ W S I G N E K F P G S L C L M D C K P S S S D K K T L K R W F F I D K R V G Medtr ------M D C Y P K S S D K K T L K R W F F I D K R V G Glyma ------M D C Y P K S S D K K T L K R W F F I D K R V G Orysa1 - - - - MC T R S E H D K F S G I S L L M N C L H T C S D K K T L K K W F F I D K T V G Orysa2 M I C V H H G D E P N A G T S R L M G WM R C MH V C S D K K T L K R W F F I D K R V G

AlignmentsFigure 3 of plant uORF homology groups 8–11 Alignments of plant uORF homology groups 8–11. Details as in Figure 1. Decimal places in the group number indicate multiple conserved uORFs in a given 5' UTR.

Page 8 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 12 10 20 30 40 50 | | | | | Arath1 M V E T N P I G G K F T L S L S I K R A S F S - L P L S T S S V F A F G S C K E G W Y F R C L G L D P Brana M V E T N P T G G N S T L S L P V E R A S L - - - - L S A S S V F A F G S C K E G W Y F R C F G I D P Arath2 M V E T N L I G D N H R L T L S T K R A S F L I I P L S Q D S V F S F G S C K E G W Y F R C I G L D P Popde M V E I L R S S S N Y ML S L C Q S R A G V - - - - V S V A S A F A F G S C K E G W Y Y R C I G V D P Citsi M V E T I L K S S N L ML S L C E S R A ------S S P L A F A F G S C K E G W Y Y R C L G L D P Jugre M V E I L S S R S - F MR S L F H S G A ------S A A F A F A F G S C MK G W Y Y R C L G L D P Orysa M V G A R - - T T S P A S P P Q P E R ------A S F R S F G C C MQ G W H H R C I G L D P Zeama M V G A R - - A T S V A S Q P Q P E E ------A S F R G F G Y C MQ G W Y H R C I G L D P Triae M V G T H T I A A S P S S P P Q T E R D ------S S A S F R C F G A C MQ G W Y H R C I G L D P

Group 13 10 20 30 40 50 | | | | | Arath1 MQ Q R G R S V N R R S R S F S R S R L A V E G H ------Arath2 MN Q R G R S T N R R S R S F S R S R L A V E G H ------Gosra MQ Q R G R T N N R R S R S F S R S R L A V E G F ------Orysa1 MQ Q R G L A H N R R S R G F S R S R R A V E G A ------Zeama MQ Q R G L A H N C R S R G F S R S R R A V E G A ------Medtr MQ Q R G R A S N R R S R S F S R S R L A I E G T ------Arath3 MQ S K G R L H N F R S R S F S R S R L A I E G S ------Orysa2 MQ P R G R S F N R R S R S F S R A R L A I E G S ------Iponi MH Q R G R V K D H R S R T F S R S R I A I Q G R ------Cycru MQ Q R G R T S N R R S T T F S R A R L A I Q G R H K F L C WS S S S I I F L A G N H G P Q W R A I Linus MQ Q K G R S N N R R S R S F S R S R I A I Q G Y ------Phypa MH Q R G K S I N R R S R T F S R T R I A I E G H ------Xentr MR P R G K S Y N R Y S R S F S R T R V A I H T V I Y ------Xenla MR P R G R S D N R Y S R S F S R S R V A I H T V L L T V P L T Q ------60 70 80 90 | | | | Arath1 ------Arath2 ------Gosra ------Orysa1 ------Zeama ------Medtr ------Arath3 ------Orysa2 ------Iponi ------Cycru S S V K L L E R A F C S I E C G S N D A R Q S G F E A G P G R K A R D I V F A T T T Linus ------Phypa ------Xentr ------Xenla ------

Group 14 10 20 30 | | | Soltu - M G F C I C P L E T P A R L L W S T S F F R H K L M L F - - - - Nicbe - M G F C I C P L E T P A R L L W S T S F F R H K L M L F - - - - Medtr M M G S C I C P L E T P A R L L W T T S F F R H K L M I F - - - - Arath M M G F C I C P L E S P A R L L W S T S F F R H K I M I F - - - - Orysa - M G F S L Y P M K T S T R M L W S T S F F R H K V A S S - - F - Zeama - M G F S L Y P M K T S T R L L W S T S F F R H K L S S S S C F L Allce M M G F S L F P M K T S T R L L W S T S F F R H K I F V A - L F -

Group 15.1 10 20 30 | | | Arath1 - - - - - MP W - - - - - T A F F M F F N - - R T C T R L V V F F L V I L Gosra - - - - - MP W - - - - - S L V Y R ------V R M V V F F L I I L Orysa1 - - - - - MP W - - - - - V R F L ------V R L V V F F R V I V Welma - - - - - MP W - - - - - I G F V E D I F - - W Q T V R I V V F F R V I Q Orysa2 - - - - - MP W - - - - - A G T T ------S V R L A V F F R V I L Orysa3 - - - - - MP W - - - - - T P L Y S T Y Y S S R K S V H L A V F F R V I V Sacof - - - - - MP W - - - - - T P L Y S S Y R S S R K S I C L V V F F H V I V Medtr - - - - - MP W - - - - - F S L F K S - - - - - N C V C L V V F F R V I L Vitvi - - - - - MT W - - - - - T S S F V N Y - - S R N C V R L V V F F R V I L Maldo - - - - - MP W - - - - - T S S F V N Y - - S R N C V R L V V F F R V I L Citpa MK K V N MP W - - - - - T S S F V N Y - - S R N - - - L V V F F R V I L Arath2 - - - - - M------C I A V Y R ------K V L S L N L Y C R V I L Arath3 - - - - - MS W - F T R S V D V Y R ------K V V S L N L Y C R V I L Orysa4 - - - - - MR W - L A L S V D I F R ------K G I S L N L Y C R V I L Arath4 - - - - - MR W W L C L S A Y V F R ------T V V - - - V F C R V I L

AlignmentsFigure 4 of plant uORF homology groups 12–15.1 Alignments of plant uORF homology groups 12–15.1. Details as in Figure 1. Decimal places in the group number indi- cate multiple conserved uORFs in a given 5' UTR.

Page 9 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 15.2 10 | Arath3 MG S Q R F Y V F H C D Q Y R I I Arath2 MG S Q H F Y V Y H C E Q - R I S

Group 15.3 10 20 30 40 50 | | | | | Medtr ------MF I A K V G - Y I K M S - - - - V G V I A Y S Q - L ML V C Q A E Y F R Q L L K P V T Maldo - MN I G V N V K K Y R A I K E V G - K I L M S - - - - V G V I A Y F H - Q MQ V C Q A E Y F R Q L L K P V T Orysa3 - MY V K V E L K R F R I V K E V G - T I V M N - - - - V R I F A Y Y Q - L MQ V C Q A E Y F R Q L L K P V T Sacof - MY V K V E L K R F R V V K E V G - R I V M N - - - - V R I F A C Y Q - L I E V C Q A E Y F R Q L L K P V T Gosra - MY I T A T T K R Y R A V K E V G - K I K M S - - - - V G V I A Y Y Q - L V Q V C Q A E Y F R Q L L K P V T Citpa - MY I A I T L K R Y R A V K E V G - K I K M S - - - - V G V I A Y Y Q - L V Q V C Q A E Y F R Q L L K P V T Vitvi - MY I A V T L K R Y R A V K E V G - K I K M S - - - - V G I I A H Y Q - L MQ V C Q A E Y F R Q L L K P V T Arath1 - MC I V G N K K G N R S L K E I G - T F MM T - - - - T C F I A N Y Q - S V Q V C Q A E Y F R Q L L K P V T Orysa2 - MC V G F N G - - - R S L D T - - - - V V M T - - - - S G A V A Y Y Y R L MQ V C Q A E Y F R Q L L K P V T Orysa1 - MC I G F Q G S R R S S S E T S A A D A V M R - - - - S G A I A Y Y H T L T K V C Q A E Y F R H L L K P V T Arath4 M V C Q S A G Q T R F R T L K H E H - G I T G - - N I V V R V I A C F Q - P L Q D C Q A E Y F R Q L L K P V T Orysa4 M V W Q P V A Q T R F R V F K H E N - G I A ------V R V I A C F Q - P S Q D C Q A E Y F R H L L K P V T Arath2 M V C Q S P G K T R F R G L K Y E T - G N A N E S T I V V R V I E C Y Q - P MD N C Q A E Y F R L L L K P V T Arath3 M V S Q S A G Q T R F R T F K Y E N N G D S S R P T I V V R V I A C F Q - P MD N C Q A E Y F R H I L K P V T Adica M MH R S A R L T R L K T T K H G S - E V S G ------K P V MV H R - - - - - C Q G E Y F R Q L L K P I T Welma - MC Q A T G T F R I C G A K K R R - S V Q G S I I M T V R T I V V F Q - L L Q V C Q A E Y F R Q L L K P I T

Group 16 10 20 30 40 50 | | | | | Arath MK V R K G R R Q M I A K E Q D E Y K L R Q Q G L V F Q M A S L F L S F D Y S L L F F W L L F R V F Medtr ------M L S V D Y S R L L - W Q L F G V F Orysa1 ------MS A G T C C S E A S - M A W I D C S L L A - W L L F R V V Orysa2 ------MS A G P C C S E A P - M A W I D C S L L A - W L L F R V V 60 | Arath L Y E N H E R K L Q I F Medtr S Y E N H - W K P R I I Orysa1 F F S C H A K Q F S I F Orysa2 F F S C H A K Q F S I F

Group 17 10 20 30 40 50 | | | | | Arath1 ------M D L L MI Q R R E S R R V WL G V Q N S I V - L L V G G N H T S S R F C T R Arath2 - - - - MV D R T V R M G T E V K S I R V P R R V WF G S Q I S I V - L L V G G N H T S S R F C T R Arath3 ------M E I Q R M R Q L M S L WS N L K S S R V - S L T G G N H T A A R F C T R Lyces ------M L P T E K I M E V A R I R Q L L L L WS R I K S T R V - S L V G G N H T A A R F C T R Gosra M Y E I Q H W G I M G K T M E V Q R M K E L L L L WS R F Q S N R V - A L V G G N H T A A R F C T R Medtr1 ------M E I Q R M K E L L S L L L A F K S N R V - A L V G G N H T A A R F C S R Medtr2 ------M E V E R M K Q L M L L S F G K Q S S H V - A L V G G N H T A A R F C T R Orysa1 ------M E T E R I R H L L L L T L R A P H A R L - A L V G G N H T A A R F C T R Sacof ------M E T E R I R E L I L L T L R A P H S R L - A L V G G N H T A A R F C T R Orysa2 ------M D R R R A A E L M L L S L S S Q R T R M - A L V G G N H T A A R F C T R Arath4 ------M R F F C P Y P P T T C R I C N S S C Q F P T L I G A N H T S A R F S S R

Group 18 10 20 30 40 | | | | Arath ME E T K - - - - R N S D - L L R S R V F L S G F Y C W D W E F L T A L L L F S - C - - - Gosra ME T D T I - - - K P C D N V V H S R F F R S G F Y C W G W E F L T A L L L F S - C S A - Nicbe ME E T K I I I N R P K N N L V H S M V F L F G F Y V W D W E F L T A L L L F S Y C F F - Orysa MG V E A G - - - G G C G - - - - G R A V V T G F Y V W G W E F L T A L L L F S A T T S Y

Group 19 10 20 30 | | | Triae M S N V P T S L R D S S L T V F R L A I S S S - D P W P F S S - - - Zeama M S N V P T S L R D S P L T V F R L A I S S S - D P W P F S S C - L Orysa M S N V P T S L R D S S L T V F R L A I S S S - D P W P F S L - - - Vitvi M S N V P R S L T D S S L T L F K L A I S A S - D P W P F S F V F L Eupes M S N I P S S L T D S S L T L F N L A I S S S - D P W P F S L L S L Medtr M S N I P K S L T D S F L T L F N L T I S S S - D P W P F S F V F L Helan M S N I P K S L T K S I L A L F Q F A I S S S - D P W P F S F L K S Arath M S N I P R S L T D S D L S L F T L I I S S A V D P W P F S I T V F

AlignmentsFigure 5 of plant uORF homology groups 15.2–19 Alignments of plant uORF homology groups 15.2–19. Details as in Figure 1. Decimal places in the group number indi- cate multiple conserved uORFs in a given 5' UTR.

Page 10 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 20 10 20 30 40 | | | | Styhu M P S V P V I E L L L R H L V P S I T I F H G G S S S H R L K H L E K S D R R G N L S Medtr M P S V P I Q L L L L N H L V P F F N N F H N G S S S H R L K P L E K S D R R G N L S Brugy M P S V P V L L L L L - - - V P C L T L F H G G L C S H R L K E L E K G D R R G N L S Maldo M P S V P T H L L L R P - S V P N L T S F H G G S C S H R L K H L E K G D R R G N L S Gosar M P S L P L R S I L L P - L I P S L T P F H A G L S S H R L K H L E K S D R R G N L I Arath1 M R - V S S E I - L L L S L V P A L T F Y H S G L S S N R L K P L Q K S D W R G N L S Brana M R - L S S E - - L L I S L V P S L T F S Y S G L S S N R L K P L Q K S D W R G N L S Arath2 M P S V P S K L L P L L S L I P S L T P Y H S G L S S N R L K P L E K C D W R G N L S

Group 21-Dicot 10 20 30 40 50 | | | | | Popca M S A E F D L A L V G V Q - L K L WI L S L I A K T F V V A S C C C L L C Y ------ML L C Gosra M S A E F D L A L I G A Q - L Q L C F S S S T T T V C A R L F I T A L L A V S ------C MI L C Eupti M S A S F D L A L I G A P - L Q L WI F S S S K S T L S - A S S C V L R C A L ------F WL L C Maldo M S A S F D L V V I G A P - S K L WI F T ------A A C S A V C F ------ML L C Medtr ------ML F V Glyma M S S G F I L A V I G A P - L K L F I V Y A F V S F F L - T L F V C Y C C C ------ML H C Lacse M S A G - S L N L I G A P - L K L C I S F ------A L F I C S ------Y L C C Escca ------M Q A W G Betvu M S A K S D L I V F G A P S L K L F Q L Y ------Y S L F F V H C ------M Q L C C Arath1 M S A G F E S - T V I L S - L I H - I S S P I S S S S S P S S S S L K - - - - - S C L L N A L H F C Brana M S A G F E S K A V I W S - L I N N I S G P I S S T S C - S K P G L G - - - - - P T S - N A L H F C Arab2 M S A V S E S - A A I S S - L F L - I S F A I Q S S S S S S S S V L F F G L S L T A V L N A L R F C 60 70 80 90 100 | | | | | Popca C L Q Q R L N L V Y S A I S M P L R P K R T C S - - - - F G ------A F F Y L F L F L R G A L T Gosra F Y R Q R L N I V F S T V S M R L R P K K S C S G V E S F G G F H I K Q R F F Y L F L F L R G A H T Eupti C F Q Q R L N L G L P P F S M R L R P K R T C S G V E C F G G F H I K - R F F Y L F L F L R G A L T Maldo F F R Q R F N P V I R P V S M S L R P K R T C S G S Q Y F G A F H I K - L F S S L F L F L R G D L T Medtr N K S K - - F F V I P S V S M R L R P K R T C Y R V V C F G G F H I - Q K F S N L F L F L R V D L T Glyma I L S H - - N T V N L P V S M R L R P K R T C C G V E C F G G F H I - Q R F S Y L F I F L R G D F T Lacse F T Q Q S L L L C V P T V S M R L R P K R T C S G V E C F G G F H I N - R L F D L F L F L R G P L T Escca S D K - - - - - G F L T A S M R L R L Q R T G S G V E C F G G F L I K - R F L Y L F L F L R G I L T Betvu S S Q Q S N N L G F S S V S M R L R P K R T C S G V E C F G G F H I K - R F I A L F L F F R G T L T Arath1 C N K Q R L V L E T P S L S M R L R P K R T C S S V E V F G G F H I K Q Q K F S F F I V - R - - - - Brana S - K Q R L V L E T P S L S M R L R P K R T C S G V E V F G G F H I K Q Q K F S F F I V - R - - - - Arab2 S - E Q R L I L E V P S I S M G L R L K R T C S G V E V F G G F H E K Q - K I P F F I V - R - - - -

Group 21-Monocot 10 20 30 40 50 | | | | | Erate ------M S P A L L P - I H G S E L P S P - P A S V T C S L L L Q L R Q - - - - R G A R A Horvu ------M S P A L L - - L H G S E L P S P S P P S A F C S L L P Q A R Q - - - - R G A R T Orysa ------M S P A L P P F L H G S E L Q S P - P T V S S C S L L I L L L Q N WY T R G A R T Allce MS F E S N N Y F F N Y S L V A A Q I S C F F S S S S I L A F N C S L I L V A L F S F M R R K S E C 60 70 80 | | | Erate C G F S - - - F R I S R F R F G V K Y S G F R H L G P F F P T S L H L T R Horvu Q D Y R - - - F K I S R F G F G V K Y S G F L H L D - - - Q C P V H L K - Orysa Q Q N P - - - F T I S R F G F G V K Y F G F L R L D W ------Allce F K S D S ML L K S S R P G F G V K Y F G F L H L R - - - N C F F I F L T Group 21-Dicot and Monocot 10 20 30 40 50 | | | | | Eupti ------M S A S F D L A L I G A P - L Q L W I F S - S S K S T L S A S S C V - - - - - L R C A L F W Popca ------M S A E F D L A L V G V Q - L K L W I L S L I A K T F V V A S C C C - - - - - L L C - - Y M Gosra ------M S A E F D L A L I G A Q - L Q L C F S S S T T T V C A R L F I T A - - - - - L L A V S C M Maldo ------M S A S F D L V V I G A P - S K L W I F T ------A A C S - - - - - A V C - - F M Medtr ------Glyma ------M S S G F I L A V I G A P - L K L F I V Y A F - - - - - V S F F L T - - - - - L F V C Y C C Lacse ------M S A G - S L N L I G A P - L K L C I S F A ------L F I C S - Y Escca ------M Q Betvu ------M S A K S D L I V F G A P S L K L F Q L Y Y S ------L - - - - - F F V H C M Q Arath1 ------M S A G F E S - T V I L S L I H - I S S P I S S S S S P S S S S L K - - - - - S C L L N A L Brana ------M S A G F E S K A V I W S L I N N I S G P I S S T S C - S K P G L G - - - - - P T S - N A L Arath2 ------M S A V S E S - A A I S S L F L - I S F A I Q S S S S S S S S V L F F G L S L T A V L N A L Orysa ------M S P A L P P F L H G S E L Q S P - P T V S S ------C S L Erate ------M S P A L L P - I H G S E L P S P - P A S V T ------C S L Horvu ------M S P A L L L - - H G S E L P S P S P P S A F ------C S L Allce M S F E S N N Y F F N Y S L V A A Q I S C F F S S S S I L A F N ------C S L 60 70 80 90 100 110 | | | | | | Eupti L L C C F Q Q R L N L - - - G L P P F S M R L R P K R T C S G V E C F G G F H I - K R F F Y L F L F L R G A L T Popca L L C C L Q Q R L N L - - - V Y S A I S M P L R P K R T C S - - - - F G ------A F F Y L F L F L R G A L T Gosra I L C F Y R Q R L N I - - - V F S T V S M R L R P K K S C S G V E S F G G F H I K Q R F F Y L F L F L R G A H T Maldo L L C F F R Q R F N P - - - V I R P V S M S L R P K R T C S G S Q Y F G A F H I - K L F S S L F L F L R G D L T Medtr - M L F V N K S K F F - - - V I P S V S M R L R P K R T C Y R V V C F G G F H I - Q K F S N L F L F L R V D L T Glyma C M L H C I L S H N T - - - V N L P V S M R L R P K R T C C G V E C F G G F H I - Q R F S Y L F I F L R G D F T Lacse L C C F T Q Q S L L L - - - C V P T V S M R L R P K R T C S G V E C F G G F H I - N R L F D L F L F L R G P L T Escca A W G S D K ------G F L T A S M R L R L Q R T G S G V E C F G G F L I - K R F L Y L F L F L R G I L T Betvu L C C S S Q Q S N N L - - - G F S S V S M R L R P K R T C S G V E C F G G F H I - K R F I A L F L F F R G T L T Arath1 H F C C N K Q R L V L - - - E T P S L S M R L R P K R T C S S V E V F G G F H I K Q Q K F S F F I V R - - - - - Brana H F C S - K Q R L V L - - - E T P S L S M R L R P K R T C S G V E V F G G F H I K Q Q K F S F F I V R - - - - - Arath2 R F C S - E Q R L I L - - - E V P S I S M G L R L K R T C S G V E V F G G F H E K Q - K I P F F I V R - - - - - Orysa L I L L L Q N W Y T R - - - G A R T Q Q N P F T I S R F G F G V K Y F G F L R L D W ------Erate L L Q L R Q - - - - R - - - G A R A C G F S F R I S R F R F G V K Y S G F R H L G P F F P T S L H L T R - - - - Horvu L P Q A R Q - - - - R - - - G A R T Q D Y R F K I S R F G F G V K Y S G F L H L D Q C P - - - V H L K - - - - - Allce I L V A L F S F M R R K S E C F K S D S M L L K S S R P G F G V K Y F G F L H L R N C F F I F L T ------

AlignmentsFigure 6 of plant uORF homology groups 20 and 21 Alignments of plant uORF homology groups 20 and 21. Details as in Figure 1. Groups with similarity in both the mono- cot and dicot lineages are shown as separate alignments and as a joint alignment.

Page 11 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 22-Dicot 10 20 | | Arath1 M G S WL P Q Y C F L F I F Y C H S H N T - - N H F S Brana M G S R L P Q Y C F L P V F Y C H S Q N S N Y N Y F S Arath2 M G S Q L L E C C F R F V F Y C H S H S S - - N C F P

Group 22-Monocot (uORF1)

Horvu M L A L P L R S Triae M S A V P S R S

Group 22-Monocot (uORF2) 10 20 | | Zeama MQ G V S L T A R R A A V H - V K T H T - - - Sacof MQ G V S L T A R R A A V H V V K T Q I - - - Horvu ------M R R T A V H - V K T Q I S F A Orysa MQ D I G C S M R R I A V H - I K T Q I - - -

Group 22-Dicot and Monocot 10 20 | | Horvu ------M R R T A V H - V K T Q I S - - - - F A Orysa M Q D I G C S M R R I A V H - I K T Q I ------Zeama M Q G V S L T A R R A A V H - V K T H T ------Sacof M Q G V S L T A R R A A V H V V K T Q I ------Arath1 M G S W L P Q Y C F L F I F Y C H S H N T - - N H F S Brana M G S R L P Q Y C F L P V F Y C H S Q N S N Y N Y F S Arath2 M G S Q L L E C C F R F V F Y C H S H S S - - N C F P

Group 23 10 20 30 40 50 | | | | | Citsi ------M R L Y D T - - - - V C R ------F R F R - S A Glyma ------M F MG G S M I M R S H S S S S S P F C N N Y K M S ------S V E C R F R - S A Arath1 M E E ML F F F Y P I T I I R D T T L T Q T Q N H L T - T V I I S E M A A T L L C S R S R F R - S A Arath2 M E G - - - - F N R I K E K R R N H D G R R D R E I G K S E M M G I S F A A K L R F R S R F R T S S 60 70 80 | | | Citsi F D S Y Y V A C E A F S E Q Q R F G L S F S V P F G V K I K S L L F L Glyma F D S Y S V A C E A F S E P Q ------P S P L ------Arath1 I D S Y C F A R E A F S D P I S ------F H L L T L L ------Arath2 I D S Y C F A R E A F S D C I S S - - - - - V H F F F L ------

AlignmentsFigure 7 of plant uORF homology groups 22 and 23 Alignments of plant uORF homology groups 22 and 23. Details as in Figure 6. the conservation is at the amino acid level, not simply at A fusion product (Genbank accession no. DR353698) was the nucleotide level. identified between the N-terminal and central region of the uORF and the central and C-terminal region of the One possible explanation for low Ka/Ks ratios in the puta- mORF found at locus At5g03190 (group 17). Classifica- tive uORFs invokes an incomplete splicing of the full- tion of this putative uORF is shown in Table 1 for two rea- length cDNAs for which the uORF and mORF are nor- sons. Firstly, the four uORF C-terminal amino acids that mally fused. To address this possibility, all Genbank Ara- are excluded in the fusion EST are perfectly conserved in bidopsis ESTs were screened for evidence of uORF-mORF monocot and dicot members, and the position of their translational fusions. No ORFs were found to run contin- stop codon is perfectly conserved, therefore it is difficult uously between the uORF and mORF, with one exception. to explain this conservation if the uORF is not translated.

Page 12 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Group 24-Dicot 10 20 30 40 50 | | | | | Arath1 - - - - - MN L A G K R N I P V - I I S V S V T S D - - S Y H R R H V F R E H R ------Glyma L T A G D E D L A G E M N I P A K L F S L S L S N A R L S Y H R R H V F H E S L D L E T F F T S S F Arath2 - - - - - MN I P - - V T I S S - - - - I T V S N N A V S D N R R H V S R E N R Q ------Brana - - - - - MN L A G K M N I P V - - - N MT V S S N A V S D H R R H V F R E N R H ------

Arath1 - - - R T - - Glyma S S T R V V V Arath2 - - - R - - - Brana ------

Group 24-Monocot 10 | Horvu M R P A N A G - - H Q D R R L - Triae M R P A I D G - - D Q D R R L - Sorbi M S P A T A A I P S Q D R F P W Zeama M S Q A T A A - - S R D R F P W Orysa M S P S I A A - - S Q D R V S R Sacof M P A A I P R - E D Q D R R L Q

Group 24-Dicot and Monocot 10 20 30 40 50 | | | | | Sorbi ------M S P A T A A I P S Q D R F P W ------Zeama ------M S Q A T A A - - S R D R F P W ------Orysa ------M S P S I A A - - S Q D R V S R ------Horvu ------M R P A N A G - - H Q D R R L ------Triae ------M R P A I D G - - D Q D R R L ------Sacof ------M P A A I P R E - D Q D R R L Q ------Arath1 - - - - - M N L A G K R N I P V I - - - I S V S V T S D S Y H R R H V F R E H R R T ------Brana - - - - - M N L A G K M N I P V N - - - M T V S S N A V S D H R R H V F R E N R H ------Arath2 - - - - - M N I P - - V T I S S - - - - I T V S N N A V S D N R R H V S R E N R Q R ------Glyma L T A G D E D L A G E M N I P A K L F S L S L S N A R L S Y H R R H V F H E S L D L E T F F T S S F

Sorbi ------Zeama ------Orysa ------Horvu ------Triae ------Sacof ------Arath1 ------Brana ------Arath2 ------Glyma S S T R V V V

Group 25 10 20 | | Vitsh - M L V L N R S G I D F V R C F F H L A E W H P V E Escca - M L V L N R I G I N F I S C F V P L C E W H P L E Glyma - M L G V N R S G F N F C S R L I H L A E W H P L E Phaco - M L G V N R C G F N F C S R F L H L A E W H P L E Citsi - M L C L N R N G F N I I S S F F H L A E W H P L E Soltu MM S W F N R T G I N L I N C F F H F A E W H P I E Arath1 - M S C F D C S G Y D V S S C L I H L A E W H P L E Brana - M S R F E C S G Y D L S S S S V H L A E W H P L E Arath2 - M L R F S C S G F D I S S C F I H L A E W H P L E

Group 26 10 20 30 | | | Arath1 M R D I D S L F S I S F V I R F V Q Q D E S ------Arath2 M K K L D S F I S V S F V V R F E Q K L E Q ------Prupe M G K MN S L L S V S F V V R F G A K N P W R Y E T P S L F A Poptd - - - MN S L L S V S F V V R F E A R K H W S H E I S S L F A Citja M S V MN F L L S V S F V V L I H N ------Pethy M K T I - L L S L L S F V S Q L N C L S T V - - Q I F S H F A

AlignmentsFigure 8 of plant uORF homology groups 24–26 Alignments of plant uORF homology groups 24–26. Details as in Figure 6.

Page 13 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Table 4: Arabidopsis loci with conserved peptide uORFs identified from Arabidopsis-Arabidopsis comparison

Homology group Locus Gene name mORF description Gene ontology Recent duplicate molecular function

20 At3g53670.1 Expressed transcript Unknown At2g37480a 20 At2g37480.1 Expressed transcript Unknown At3g53670a 21 At1g68550.1 AtERF#118b Group VI-L ERF/AP2 Transcription factor At1g25470 transcription factor 21 At1g25470.1 AtERF#116b Group VI-L ERF/AP2 Transcription factor At1g68550 transcription factor 22 At1g16860.1 Expressed transcript Unknown At1g78880 22 At1g78880.1 Expressed transcript Unknown At1g16860 23 At1g64630.1 ZIK10 MAP kinase, PPC Transcription factor Not found Family 4.1.5 23 At5g41990.1 WNK8/ZIK6 MAP kinase, PPC Protein kinase Not found family 4.1.5 24 At3g22970.1 Expressed transcript Unknown At4g14620 24 At4g14620.1 Expressed transcript Unknown At3g22970 25 At3g45240.1c Calcium response ATP binding, protein At5g60550 kinase, PPC family kinase 4.2.7 25 At5g60550.1 Calcium response ATP binding, protein At3g45240 kinase, PPC family kinase 4.2.7 26 At3g10910.1 Zinc finger, C3HC4- Protein binding, zinc At5g05280 type (RING finger) ion binding 26 At5g05280.1 Zinc finger, C3HC4- Protein binding, zinc At3g10910 type (RING finger) ion binding

ERF/AP2, Ethylene Response Factor/Apetela 2 transcription factor; PPC, PlantsP protein kinase classification. aBlanc and Wolfe (2004) report that At2g3790 and At3g53670 are retained recent duplicates, but the At2g3790 locus has since been replaced by At2g3780. bAs defined by Nakano, et al [97] and previously characterized as part of subfamily B-6 by Sakuma, et al [100]. cuORF found upstream of annotated mORF-containing locus (within 2 kb).

Secondly, the N-terminal portion of the mORF that is some uORFs possess potentially interesting features. removed in the fusion EST is similar between three Arabi- Notably, some uORF groups possess regions rich in ser- dopsis loci of the same homology group, with the start ine, threonine, and/or tyrosine, and others possess codon position also being conserved in these three mem- regions rich in lysine and/or arginine. Two homology bers. It is likely, therefore, that the fusion EST represents groups are particularly noteworthy: Group 8 uORFs spec- an alternatively spliced form of this transcript, but further ify peptides with a coiled coil-helix, coiled coil-helix characterization of this locus will be needed to support (CHCH) domain (Pfam accession number PF06747; Fig- this conclusion. Most of the homology groups show ure 9), and group 13 uORFs encode peptides that are uORFs with conserved amino acid residues at the C-termi- extremely serine/arginine-rich (Figure 10). Both of these nus and an identical positioning of the uORF stop codon unusual peptides will be discussed in further detail below. (Figures 1, 2, 3, 4, 5, 6, 7, 8). This would suggest that the full-length cDNAs are fully spliced and are not errone- Most genes with conserved uORFs appear to have ously predicting uORF sequences due to incomplete splic- regulatory functions ing. A total of 31% of mORFs encoded by conserved peptide uORF loci in Arabidopsis were predicted to be a transcrip- Conserved features of uORF sequences tion factor, as determined by GO molecular function The lengths of uORFs vary to differing degrees within and terms (Tables 2 and 4), whereas only 5.9% of all Arabi- among homology groups, but in amino acid sequence dopsis loci are predicted to encode transcription factors alignments nearly all groups exhibit considerable conser- [34]. Thus, genes predicted to encode transcription factors vation of the position of the N-terminus and/or the C-ter- are significantly overrepresented (p = 1.2 × 10-7) among minus, i.e., length variation is usually due to a variable conserved peptide uORF loci. In each case, GO terms were region in the middle or at one end of the uORF (Table 6; validated by manual annotation of protein functions Figures 1, 2, 3, 4, 5, 6, 7, 8). The amino acid sequences of using domain predictions from NCBI Conserved Domain

Page 14 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

modulin-binding protein involved in auxin response Table 5: Mean pairwise Ka/Ks values for all pairwise combinations of a given homology group using two methods (yn00 and (groups 10, 16, 19, 23, 25). Two groups are involved in codeml). the metabolism of small molecules that regulate plant development: polyamines (group 3) [1] and trehalose Homology uORF mORF (group 11) [37]. One group (13) encodes the key enzyme group in the biosynthesis of phosphocholine, which is an inter- yn00 codeml yn00 codeml mediate in biosynthesis of phosphatidylcholine and phosphatidic acid; phosphocholine levels influence levels 1 0.20 0.16 0.22 0.11 of phosphatidic acid, an important physiological and 2 0.28 0.15 0.29 0.19 developmental signal molecule [38-40]. Group 7 puta- 3 0.13 0.11 0.15 0.09 tively encodes translation initiation factor eIF5, which 4 0.19 0.18 0.21 0.22 influences start codon selection, and Group 26 encodes a 5 0.06 0.06 0.06 0.08 RING finger protein, suggesting a role in targeted protein a 6 0.43 0.01 0.10 0.08 turnover by ubiquitination. Of the remaining 10 groups, 7 0.43 0.89 0.09 0.05 8 encode predicted proteins of unknown function, 1 8 0.14 0.01a 0.11 0.09 9 0.19 0.05 0.20 0.09 encodes an ankyrin-repeat protein, and 1 encodes an 10.1b 0.69 0.48 0.10 0.10 amine oxidase. Thus, all but two families of conserved 10.2b 0.70 0.64 uORF genes whose functions are known or can be inferred 11 0.13 0.09 0.13 0.09 potentially play a regulatory role in the biology of plants. 12 0.25 0.26 0.15 0.09 13 0.07 0.04 0.10 0.09 Genes with conserved uORFs were preferentially retained 14 0.17 0.05 0.14 0.01a after whole genome duplication 15.1b 0.31 0.17 0.34 0.21 15.2b 0.03 0.07 Since the most recent WGD event in the Arabidopsis line- 15.3b 0.37 0.16 age, only 14% of the original gene pairs present in the 16 0.30 0.11 0.10 0.11 ancestral tetraploid have been retained as a duplicate pair 17 0.28 0.24 0.41 0.11 in the extant Arabidopsis genome, i.e., for the remaining 18 0.26 0.01a 0.15 0.01a 86% of ancestral gene pairs, one member has been lost a a a a 19 0.00 0.01 0.01 0.01 [32]. Among 31 ancestral gene pairs that possessed con- served uORFs at the time immediately following the 20 0.13 0.17 0.48 0.39 21 0.47 0.44 0.11 0.09 genome duplication, 12 (39%) pairs have been retained 22 0.52 0.16 0.09 0.09 in the present Arabidopsis genome (Table 2), which is sig- 23 0.57 0.43 0.23 0.21 nificantly higher than the genome-wide average (p = 24 0.18 0.20 0.22 0.20 0.0005). The conserved uORF was retained in both copies 25 0.53 0.50 0.16 0.14 of each of the twelve retained duplicate pairs. Retention of 26 0.37 0.28 0.23 0.22 these 12 uORFs in both paralogs suggests that they act in cis, consistent with the expectation that uORFs typically a Ka or Ks values too high to determine Ka/Ks ratio accurately. control translation of downstream mORFs on the same bDecimal points after homology group numbers are used when multiple independent uORF peptides are conserved within a single RNA molecule [41]. transcript. The overrepresentation of transcription factors among and InterProScan Database searches [35,36]. A variety of conserved uORF loci could be due, in part, to preferential different types of transcription factors, including bZIP, retention of transcription factor recent duplicates (22.7% Ethylene Response Factor/Apetala 2-like (ERF/AP2-like), retention of transcription factor duplicates vs 14.4% basic helix-loop-helix (bHLH), and homeobox proteins, retention genome-wide) [32], but this alone does not are represented among conserved peptide uORF loci with account for the high frequency of predicted transcription no demonstrable bias. No other GO terms were found to factors among the uORF loci. When duplicate history bias be significantly over- or under-represented in the uORF is removed by calculating GO term frequencies of the pre- data set. genome-duplication set of loci, transcription factors are still overrepresented (11/31 loci, or 35%). Biological functions could be inferred for 16 of the 26 uORF homology groups (Table 1). Six groups encode Conserved angiosperm uORF peptide sequences in transcription factor homologs and so are presumably primitive plants and other eukaryotes involved in transcriptional control (1, 2, 14, 15, 18, and To determine whether any of the 19 uORF homology 21). Five groups are likely to be involved in signal trans- groups conserved between rice and Arabidopsis might duction, including four protein kinases and a putative cal- also be present in other eukaryotes, we searched for uORF

Page 15 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Table 6: uORF features conserved between Arabidopsis and rice

uORF homology uORF length Conserved sequence features group (amino acids)

Length conserved N-terminus Middle (amino C-terminus (amino Overall at N- and C- (amino acids) acids) acids) Termini

uORF features conserved between Arabidopsis and rice 1 25–43 C-terminus SY-rich: 5/14 20–39% STY 2 34–39 C-terminus KR-rich: 5–6/20 18–24% KR 3 50–54 N- and C-termini K-rich: 4/9 S-rich: 5–6/6 KR-rich: 4–5/16, 17–23% KR. 22– SY-rich: 4–5/12 29% SY 4 52–55 N- and C-termini SY-rich: 7/30 21–23% STY 5 38–41 N- and C-termini 6 55–68 N-terminus 7 57–105 N-terminus STY-rich: 6–7/22 KR-rich: 4/5–8 8 61–62 N- and C-termini CHCH domain, 17% KR 9 17–33 N-terminus 10.1 41 N- and C-termini 11 24–44 C-terminus KR-rich: 5–6/15 25% KR 12 39–51 N- and C-termini 13 25 N- and C-termini RS-rich: 10–12/18 40–48% RS 14 29 N-terminus ST-rich: 4–6/10 14–32% STY 15.1 18–27 N- and C-termini 8/9 hydrophobic 15.3 43–54 N- and C-termini 13/14 completely conserved 16 40–62 C-terminus 17 36–45 C-terminus 18 36–38 Neither 19 30–34 N-terminus 29% ST uORF features conserved between Arabidopsis paralogs 20 41–43 N- and C-termini STY-rich: 8–9/27 23% STY 21 87–90 N- and C-termini ST-rich: 11–12/ 22–25% ST 17–22 22 25 N- and C-termini 23 69–71 Neither 24 31–34 N- and C-termini 25 25 N- and C-termini 26 22 N-terminus

sequences in all Genbank eukaryotic ESTs. Amino acid Xenopus genome rather than being an EST library contam- sequences similar to four homology groups (3, 8, 13, and inant. 15) were detected in non-angiosperms. Group 15 was found only as distantly as a fern (Adiantum); group 3 was Sequences similar to group 8 Arabidopsis and rice uORFs found as far from angiosperms as the green algae (Ulva); were found in most eukaryotes, but transcript sequence group 13 was found in an animal (Xenopus tropicalis); and following the uORF varied among the different lineages. group 8 uORF sequence was found in primitive plants, All land plant uORFs were associated with macrophage animals, fungi, and a slime mold (Figures 9 and 10). inhibitory cytokine-1-like (Mic1-like) mORF sequences Another algal sequence (Chlamydomonas) from the Gen- while the mORFs downstream of the group 8 uORF bank non-redundant database was identified belonging to homologs in nematodes and arthropods code for an group 3 (Genbank: AJ841703). The group 13 uORF unknown protein and a putative mannosyl transferase, homolog found in a X. tropicalis EST was also found in a respectively (Figure 11). Available EST sequences for each genomic contig sequence [42] in which the uORF of the group 8 uORF homologs in mammals, fungi, algae, homolog is flanked by genes that are more similar to ani- and slime mold end shortly after the conserved peptide mal sequences than to any known plant sequences. Thus, uORF, suggesting that in these eukaryotes the uORF this group 13 uORF homolog most likely exists in the homolog is not associated with a mORF and is simply a

Page 16 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

10 20 30 40 50 | | | | | Homsa M S A - N G A V W G R V R S R - L R A F P E R L A A C G A E A A A Y G R C V Q A S T A P G G R L S K Danre M S G - S - N V W T R S R A R - MR L F P E L MA Q C S G E A T A Y G K C V A A T T T G K Q E L T R Arath M K - - E K N S T T A S ------T L G R I L A T C S K Q A K D Y G S C V A S K V H - - - E V E R Orysa M K - - E R N P A V A S S - - - - - A L A R I L A A C A S Q A K D Y G R C I A E K V P - - - E I E Q Phypa M A S - E K G R P L P S ------P L K N V F L R C S P A M K E Y G Q C V A T K L P - - - A V E K Mesvi M S - - S G G R S V P S S S K Y L K E L G Q G L A S C T A E V A V Y G K C I S Q G L Q - - - D I N K Ustma M R - - - P G A I K Q R D V A P V Q T F A K A A A K C A S E A R I Y G A C V T A N Y E - - - N I E R Dicdi M I - - - K S T S K W S - - - - - Q R V A - S L G D C T F E M S I Y G A C V T S N L D - - - N I E K Ciosa M P S T V A Q S V L R A Q N Y L R R E L P V K V K S C S K Q A V A Y G T C V G E W D N - - - - L R K Strpu M N S - - - G S V Q K A R Q K M A - Q F P A A F A E C T P Q A L A Y G R C V S S K D H - - - - V G R Caeel M P - - - - - K T Q Q R - - - - L L N Y A T S I A K C P T E T S N Y G S C V S V Q A E - - - R I K Q Drome M E S - - V R K A N Q R - - - - I R N Y P V L L S K C A D K A T A Y A V C V S R D L N - - - - V Q H Neucr M Q P - - P K T T V R P - - - - I Q R F A S A V S K C S V E S A A Y G K C I L A D Y N - - - S V H K

60 70 | | Homsa D F C A R E F E A L R S C F A A A A K K T L E G G C Danre N M C V K E F E A L K S C F Q S A A K K A V K - - - Arath D I C L K E F L A L K S C M Q H T I R G K A - - - - Orysa N M C A K E F L A L R S C M Q T V V K R K A - - - - Phypa G M C E K E F L A L K T C M Q N A A K K K - - - - - Mesvi G M C E Q E F R A L T K C M R Q A R V G R - - - - - Ustma N M C Q K E F F A F K A C V Q Q K L G R K W - - - - Dicdi N V C K V E F E K F K N C M A K S MV S K V K K - - Ciosa G D C E K E F L A F K Q C I Q S - - - - L K K - - - Strpu N D C G K E F Q T Y K E C L Q K A MS K I K K - - - Caeel G D C S A E F R K L I D C V T K N L K K K - - - - - Drome K I C D T E F K E F L S C I R K T A L E M K T K L - Neucr D M C V K E F MR L K D C Y L V R S S P P S L F T -

GroupFigure 8 9 small ORF/uORF alignment and percent identity across various eukaryotes Group 8 small ORF/uORF alignment and percent identity across various eukaryotes. Representative eukaryotic species aligned using Muscle and displayed by percent identity using Jalview. Arrowheads represent two conserved intron posi- tions for all but Mesvi (no genomic support), Dicdi (first but not second intron present), Ciosa (no introns), Caeel (no introns), Drome (no introns), and Neucr (first but not second intron present based on predicted mRNA). See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.

short ORF. This is further supported by more than 10 69–74 amino acids, respectively). Four cysteine residues human ESTs that end at the same position and include a consistently align in all eukaryotes, with nine amino acids polyA sequence. In the sea squirt lineage a putative mORF separating the first and second cysteine residues, as well as is present in the EST sequences, but a full-length cDNA the third and fourth cysteine residues, whereas 11–15 res- sequence will be needed to further investigate this possi- idues separate the second and third cysteines. Two intron bility. positions are perfectly conserved among the land plants, vertebrates, and at least one member of the fungal lineage. Although there is variability in the sequences found The first intron lies between the third and fourth amino downstream of group 8 uORFs, three features of these acids following the first conserved cysteine position, and uORF homologs are relatively well conserved: the length the second intron lies between the fourth and fifth amino of the predicted uORF, the relative positions of four acids following the fourth conserved cysteine position cysteine codons, and the positions of two introns (Figure (Figure 11). The first and/or second intron positions are 9). The length of the uORF peptide ranges from 51 amino present in Dictyostelium, algae, and some fungi, but are acids in Haemonchus (nematode), to 74 amino acids in absent in nematodes, arthropods, and sea squirts. humans, and length is even more highly conserved within each of the land plant, arthropod, nematode, fungal, and The four cysteines are part of a putative coiled coil-helix, vertebrate lineages (59–62, 65–69, 51–68, 54–66, and coiled coil-helix (CHCH) domain (Pfam accession

Page 17 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

A 10 20 30 40 50 | | | | | Arath1 M Q Q R G R S V N R R S R S F S R S R L A V E G H ------Arath2 M N Q R G R S T N R R S R S F S R S R L A V E G H ------Gosra M Q Q R G R T N N R R S R S F S R S R L A V E G F ------Orysa1 M Q Q R G L A H N R R S R G F S R S R R A V E G A ------Zeama M Q Q R G L A H N C R S R G F S R S R R A V E G A ------Medtr M Q Q R G R A S N R R S R S F S R S R L A I E G T ------Arath3 M Q S K G R L H N F R S R S F S R S R L A I E G S ------Orysa2 M Q P R G R S F N R R S R S F S R A R L A I E G S ------Iponi M H Q R G R V K D H R S R T F S R S R I A I Q G R ------Cycru M Q Q R G R T S N R R S T T F S R A R L A I Q G R H K F L C W S S S S I I F L A G N H G P Q W R A I Linus M Q Q K G R S N N R R S R S F S R S R I A I Q G Y ------Phypa M H Q R G K S I N R R S R T F S R T R I A I E G H ------Xentr M R P R G K S Y N R Y S R S F S R T R V A I H T V I Y ------Xenla M R P R G R S D N R Y S R S F S R S R V A I H T V L L T V P L T Q ------

B 10 20 30 40 50 | | | | | Orysa2 M D A A A A T A V N G ------V L E V E E R K A Q K S Y W E E H S K D L T V E A M M L D S R Phypa ------M A V N ------T E R S L Q S T Y W K E H S V E P S V E A M M L D S Q Arath1 ------M A A S ------Y E E E R D I Q K N Y W I E H S A D L T V E A M M L D S R Arath2 ------M A T P ------Y K E E R D I Q K S Y W M E H S S D L T V E A M M L D S K Arath3 ------M A S ------Y G E E R E I Q K N Y W K E H S V G L S V E A M M L D S K Iponi ------M A A ------I T Q E E R E I Q K N Y W M E H S A D L T V E A M M L D S K Medtr ------M A S P P ------G MT T T D E R E I Q K S Y W I Q H C A D L S V E A M M L D S K Linus ------M A A N ------G A A P V V E R E V Q K S Y W I E H S V D L T V E S M M L D S K Gosra ------M A A N ------G Y V G D E R K V Q K N Y W I E H S V D L T V E A M M L D S K Orysa1 ------M D A V ------A A N G I G E V E R K A Q R S Y W E E H S K D L T V E A M M L D S R Zeama ------M D T V G V P V V A V A N G I G E V E R K V Q K S Y W E E H S K C L T V E S M M L D S R Cycru ------M A P N ------G E R S L Q L S Y W K E H S A V L S V E A M M L D S Q Xentr M A D D ------T R Q V M T Q F W E E H S R D A T V E E M M L D S S Xenla M G E V ------T R Q V M T Q F W E E H S Q N A T V E E M M L D S S

60 70 80 90 100 | | | | | Orysa2 A A D L D K E E R P E I L S L L P P Y E G K S V L E L G A G I G R F T G E L V K T A G H V L A M D F Phypa A S K L D K E E R P E I L S L L P P Y E N K D V ME L G A G I G R F T G E L A K H A G H V L A M D F Arath1 A S D L D K E E R P E V L S L L P P Y E G K S V L E L G A G I G R F T G E L A Q K A G E L I A L D F Arath2 A S D L D K E E R P E V L S L I P P Y E G K S V L E L G A G I G R F T G E L A Q K A G E V I A L D F Arath3 A S D L D K E E R P E I L A F L P P I E G T T V L E F G A G I G R F T T E L A Q K A G Q V I A V D F Iponi A V D L D K E E R P E V L A L L P P Y E G K S V L E L G A G I G R F T G E L A K K A G Q L I A L D F Medtr A S D L D K E E R P E V L S L L P E Y E G K S V I E L G A G I G R F T G E L A Q K A G Q L L A V D F Linus A S D L D K E E R P E V L S M L P P Y E G K S V L E L G A G I G R F T G E I A K K A G Q V V A L D F Gosra A A D I D K E E R P E V L S L L P P Y E G K T I L E L G A G I G R F T G D L A K K A G H V I A L D F Orysa1 A A D L D K E E R P E V L S V L P S Y K G K S V L E L G A G I G R F T G E L A K E A G H V L A L D F Zeama A A D L D K E E R P E I L S L L P S Y K G K S V L E L G A G I G R F T G D L A K E A G H V L A L D F Cycru A S K L D R E E R P E I L S L L P P L E G K S V V E L G A G I G R Y T G E L A K S A D H V L A I D F Xentr A K L L S L E E K P E I I L L L P C L D G H S V L E L G A G I G R Y T G H L A K L A S H V T A V D F Xenla A E L L S L E E K P E I I L L L P C L D G L S V L E L G A G M G R Y T G H L A K L A S H V T A V D F

GroupFigure 13 10 alignment and percent identity of (A) uORF and (B) mORF sequences Group 13 alignment and percent identity of (A) uORF and (B) mORF sequences. Representative eukaryotic species were aligned using Muscle and displayed using Jalview. Panel A alignment is restricted to the first 50 amino acid positions, which excludes the full 92 amino acid uORF of Cycas rumphii. All other uORFs are shown in their entirety. Panel B alignment is restricted to the first 100 amino acid positions of the mORFs. See main text for abbreviated species names and Genbank acces- sion number, cDNA clone number, or genome identifier.

Page 18 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

1). Animal Mrp10p-like (Genbank: BC075310, uORF mORF DR155443 and BX935835), Debaryomyces group 8-like Land Plants C C C C Mic1 homolog (Genbank: NC_006045), and Dictyostelium Cox19p-like +3 +4 (Genbank: XM_631387) sequences were more divergent than other sequences, causing long branch attraction [43]. Vertebrates C C C C pA +3 +4 Thus, these sequences were removed from the analysis to prevent tree topology distortion. Five distinct clades were Slime Mold C C C C observed, which we refer to as Cox17p-like, Cox19p-like, +3 Mrp10p-like, CHCHD7-like, and uORF group 8-like (Fig- Sea Squirts C C C C ?? ure 12). All clades but one (Mrp10p-like) contain repre- sentatives from fungi, animals, and plants and are Algae C C C C strongly supported, showing branch order probabilities +7 +4 greater than 0.8, which suggests that these sequences emerged in a common eukaryotic ancestor and have since Arthropods C C C C Mannosyl Trfase diverged in the three lineages. Mrp10p-like sequences do not strongly group independently of other branches (P = Worms C C C C Unknown function 0.57), which could be due to highly divergent amino acid sequence represented by relatively long branches. The tree Fungi C C C C shows that the group 8-like proteins are a distinct clade +3 +4 from other CHCH domain proteins (P = 1.0), and that Ustilago: C1+3 C4+4 CHCHD7-like proteins are more closely related to group Gibberella: C1+3 C4+2 8-like members than to other CHCH-containing proteins Neurospora: C1+3 none (P = 0.94). The tree topology also indicates that Cox17p- Debaryomyces: none none like and Cox19p-like genes are more closely related to each other than to other CHCH proteins (P = 0.97). DiagrammaticeukaryotesFigure 11 representation of Group 8 features among A separate phylogenetic analysis of the 46 group 8-like Diagrammatic representation of Group 8 features among eukaryotes. Light grey boxes represent small sequences shows that most cluster into five taxonomic ORFs/uORFs, four perfectly conserved cysteine residues are groups (plants and green algae, arthropods, nematodes, shown as 'C', and numbers within triangles represent the vertebrates, and fungi) with strong branch support (0.85– number of amino acids between the immediately preceding 1.00) in all but the fungal lineage (0.58; Figure 13). Sea cysteine and an intron. Brackets surrounding fungal introns squirt sequences group with one of two Branchiostoma represent the variable nature of the intron position and/or sequences with weak branch support (0.53). Dictyostelium, presence. White boxes show mORFs directly downstream of sea urchin (Strongylocentrotus), and one further Branchios- the uORFs in a given lineage. Presence of a polyA tail is likely toma sequence do not group with any of these with weak to occur in vertebrates (pA; see Results). Question marks support (0.53). Sea squirt, Branchiostoma, and sea urchin indicate mORFs could be present, but insufficient EST sequences should be more similar to other deuterastomes sequence is available to infer this feature reliably. (includes the vertebrate lineage) than other organisms, but the short group 8-like sequence alignment could pre- vent resolution of correct evolutionary relationships of some groups (Additional file 2). Despite weakly sup- number PF06747), also found in three small yeast pro- ported branches, there is strong support for independent teins, Cox17p, Cox19p, and Mrp10p. Cox17p and clustering of the arthropods, nematodes, vertebrates and Cox19p are required for assembly of functional cyto- plants, as expected. chrome oxidase and Mrp10p is homologous to a nuclear- encoded mitochondrial ribosomal protein. A hypotheti- Although two Branchiostoma group 8-like sequences cal human gene, CHCH domain 7 (CHCHD7), is also (Brafl1 and 2) suggest that there has been a duplication similar to the group 8 uORF, as determined by BLAST sim- event within this lineage, there is no evidence for mainte- ilarity searches. nance of ancient group 8-like gene duplications occurring within the plant, vertebrate, nematode, arthropod, or fun- Phylogenetic relationships among group 8-like ORFs gal lineages. In Arabidopsis both the recent and ancient Fungal, animal, and plant representatives of each CHCH- duplicates from two WGD events have been lost from the containing ORF were identified using a BLAST search, and genome. Only the Mesostigma genome contains two group their evolutionary relationships were inferred using a 8-like transcripts. Their short branch lengths indicate that Bayesian phylogenetic analysis (Figure 12; Additional file this duplication occurred relatively recently and it is pos-

Page 19 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

PhylogeneticFigure 12 tree depicting CHCH domain-containing genes and alignment Phylogenetic tree depicting CHCH domain-containing genes and alignment. Unrooted phylogenetic tree generated using MrBayes 3.0. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier.

sible that insufficient time has passed for loss of the sec- new information for investigations of regulatory biology. ond copy. Because full-length cDNAs derived from both Arabidopsis and rice only represent a fraction of all nuclear genes, not Discussion and conclusion all conserved uORFs are expected to be detected by this Comparative analysis by uORF-Finder of 5' UTRs in full- approach. Extrapolation to the whole Arabidopsis length cDNAs from two distantly related plant species, genome suggests that it possesses approximately 61 to 102 rice and Arabidopsis, identified conserved peptide uORFs genes with conserved peptide uORFs that are also con- in 58 Arabidopsis loci that comprised 26 uORF homology served in the rice genome (see Methods for calculation). groups and in 36 rice loci that comprised 19 homology An additional 24 conserved peptide uORF genes are pre- groups, increasing the number of known conserved uORF dicted among Arabidopsis loci with retained duplicates homology groups from two to 26 and providing useful, from the most recent WGD event. In all, there are likely to

Page 20 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

PhylogeneticFigure 13 tree depicting group 8 small ORFs/uORFs and alignment Phylogenetic tree depicting group 8 small ORFs/uORFs and alignment. Unrooted phylogenetic tree generated using MrBayes 3.0. See main text for abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier. be approximately 99–140 genes, or 0.38–0.53% of all resources become available for other plant species, such as protein-coding genes, with conserved peptide uORFs in maize [44] and poplar [45], it should be possible to iden- the Arabidopsis genome. Because short conserved uORFs tify additional conserved uORFs that might be specific to (<20 amino acids) would not have been detected by taxonomic groups, such as monocotyledons or dicotyle- uORF-Finder, this is a conservative estimate. dons. Similarly, analysis of ancient tetraploidy events in species such as poplar and maize might be able to identify To find additional conserved uORFs, more extensive col- uORFs conserved between retained duplicates. lections of full-length cDNA sequences will need to be developed and/or 5' UTRs predicted from genomic sequence will be required. As full-length cDNA sequence

Page 21 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Conserved uORF genes are regulatory genes specific RNA sequences [51]. The predicted group 13 Based on the study of a few hundred genes, it has been uORF peptide has 5–7 arginines in a 16–17 amino acid suggested that uORFs are usually associated with mORFs region, well within this range, suggesting the possibility that encode proteins that regulate cell growth [41,46], but that it might bind to a specific RNA sequence, perhaps in a genome-wide study of upstream AUGs (uAUGs) found PEAMT/NMT transcripts. The fact that the group 13 uORF no correlation of uAUG-containing transcripts with any peptide was also found in Xenopus suggests that its regu- particular gene ontology (GO) molecular function term in latory role is widespread in eukaryotes. mammalian transcripts [6]. These observations did not differentiate between sequence-dependent and sequence- Another example is homology group 11, whose mORFs independent uORFs. Our analysis shows that genes are predicted to encode trehalose-6-phosphate phos- encoding transcription factors are overrepresented among phatase (TPPase); trehalose-6-phosphate is postulated to genes predicted to encode conserved peptide uORFs, rep- regulate sugar metabolism in plants [37]. In summary, resenting almost one third of the 58 Arabidopsis loci as sucrose, polyamines, phosphatidic acid, and trehalose-6- compared to 6% of all genes. Moreover, nearly all genes phosphate are possible regulators of translation of down- whose function can be reasonably inferred appear to play stream mORFs through interaction with conserved some regulatory role in the biology of plants. uORFs. Also interesting in this light are group 19, which specifies an auxin-induced calmodulin-binding homolog, Do conserved peptide uORFs mediate feedback and group 15, which encodes a bHLH transcription factor translational regulation by small regulatory molecules? that is believed to be subject to translational control Certain eukaryotic conserved peptide uORFs are known to through its conserved uORF by spermine synthase [52]. control translation of a downstream mORF in response to Spermine is a polyamine signal molecule necessary for a metabolic product such as arginine or polyamines normal plant growth and defense responses. [4,14,47]. In the case of the fungal arginine-regulated car- bamoyl-phosphate synthase subunit, a uORF codes for As mentioned, six conserved uORF families specify tran- the arginine attenuator peptide that responds to increased scription factors, one of which is regulated by the small arginine concentrations by causing ribosomes to stall near signaling molecule sucrose. In plants, transcription factors the 3' end of the uORF, interfering with ribosome scan- often act quantitatively to control target gene expression ning and translation of the downstream mORF [14]. A proportionate to transcription factor concentration [53]. similar mechanism has been elucidated for the regulation Therefore, it is interesting to consider the possibility that of AdoMetDC in which the uORF peptide interferes with translational control of transcription factor protein levels the termination of uORF translation in a polyamine- could be mediated by interaction of a conserved uORF dependent manner [48,49]. In plants, sucrose is a signal- peptide with a metabolite. This might be an effective ing molecule that controls not only the transcription of means for quantitatively modulating the levels of expres- many genes, but also translation of a class of bZIP tran- sion of a pathway or network of downstream genes, for scription factors via their conserved uORF, suggesting the instance, in response to changing physiological or envi- possibility of sucrose interaction with a uORF-encoded ronmental conditions. This logic can equally be applied to peptide to regulate translation downstream [4]. other key control proteins and their uORFs.

Our analysis identified not only these previously known How is translational control mediated by conserved examples of genes involved in pathways exhibiting small peptide uORFs? molecule feedback in a uORF sequence-dependent man- If conserved uORF peptides can regulate mORF levels in ner, but several additional genes that might also act via response to small molecules, they are clearly analogous to this mechanism. One is the conserved group 13 uORFs, RNA sensors and riboswitches that sense small molecules which are present in genes that encode phosphoeth- and regulate transcript translation accordingly [28,54]. It anolamine N-methyltransferase (PEAMT/NMT), the key is interesting to think of conserved peptide uORFs too as enzyme in phosphocholine (PCho) biosynthesis. sensors of cellular, physiological, or developmental con- Recently, NMT1 has been shown to contain a uORF that ditions. Although the role of conserved uORFs as 'sensors' differentially affects translation of the mORF in response of cellular metabolites has been clearly established in the to exogenously added choline [50]. This effect is observed cases of polyamine, sucrose, and arginine concentration, when the uORF start codon is abolished but it remains to it is still not clear how uORF peptides gauge cellular con- be determined whether the response to choline is uORF ditions. uORF peptides could affect mORF translation by sequence-specific. Intriguingly, the group 13 uORF pep- interacting directly with the ribosomal complex, by asso- tide is rich in arginine and serine (40–48% in Arabidopsis ciating with other proteins that influence the translational and rice genes; Table 6). A variety of arginine-rich peptides machinery, and/or by stabilizing or destabilizing RNA sec- 15–20 amino acids long with 5 or more arginines bind to ondary structures in the 5' UTR that impede or promote

Page 22 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

mORF translation. Given the variety of uORF peptides uORF emergence in plants, arthropods and nematodes. represented in the 26 homology groups, each of these pos- Group 8 uORFs are associated with three independent sibilities could occur one or more times. mORFs in the land plant, arthropod and nematode line- ages, while the vertebrate, slime mold, algal, and fungal It is perhaps interesting to note also that the uORFs of 9 small ORFs that are orthologous to group 8 uORFs do not homology groups are rich in serine, threonine, and/or seem to be associated with mORFs. Given the phyloge- tyrosine. These amino acids are potential targets for phos- netic relationships among these species [58], the most phorylation that conceivably could promote or inhibit parsimonious explanation for the evolutionary origin of ribosome stalling or initiation at downstream mORFs. As group 8 uORFs is that they originated as a small ORF tran- mentioned above, lysine/arginine-rich motifs could func- scribed independently of a mORF. Subsequently, this tion in RNA binding [51]. small ORF gene was displaced via genome rearrangements or transposition events to regions upstream of three inde- Effect of nonsense-mediated decay on uORF transcripts pendent large ORFs resulting in transcriptional fusions of Because uORFs create a premature termination codon the two previously independent transcripts. The uORFs (PTC), the nonsense-mediated decay (NMD) system and mORFs in the plant, nematode, and arthropod line- might target uORF transcripts for degradation. Yoine et al ages have remained associated within the same transcript [55] carried out a microarray analysis of plants mutant in for 300–500 My, therefore these transcriptional fusion the UPF1 ortholog, which is required for NMD. events seem to be stable and perhaps biologically advan- tageous. Evidence for other uORF emergence models, Among 75 genes that Yoine et al identified that accumu- such as mORF fragmentation or de novo creation, will late transcripts at more than twice the level in the upf1 require further analysis of closely related organisms. mutant as in wild type Arabidopsis, we found representa- tives of seven uORF homology groups (1, 7, 10, 12, 13, Potential dual role for uORF proteins 15, and 17), suggesting that these uORF transcripts are uORFs can regulate specific mORF protein expression in susceptible to nonsense-mediated decay. The uORFs in trans when the cis uORF is intact [59,60] but it is still these groups might work in a manner analogous to the unclear whether uORF proteins can play additional roles uORF arginine attenuator protein (AAP) in the fungal in the cell. Small proteins, similar in length to uORFs, play CPA1 transcript. The CPA1 transcript exclusively exhibits a role in plant development and could also be involved in increased levels of degradation via NMD when the AAP plant defense [61,62]. Potentially, uORFs could affect inhibits translation termination in response to high levels such processes independently of their role as a transla- of arginine, ultimately decreasing translation using a two- tional regulator. Homology group 8 uORFs are largely pronged approach [56]. Similarly, the above-identified conserved in length, sequence, and intron position across plant uORFs could intensify translational inhibition of most eukaryotes, but in fungi, algae, slime mold, and ver- their associated mORFs by both blocking the ribosome tebrates, the associated mORF seems to be absent. The physically and inducing the NMD pathway. absence of the mORF and strong conservation of the uORF amino acid sequence over one billion years in these Evolutionary emergence of uORFs and a 'transcriptional eukaryotes indicates that, in plants, this protein could act fusion' model as both a regulator of mORF expression and as a trans act- Very little is known about how uORFs arise. In the extant ing factor in the cell. rice and Arabidopsis genomes, sequences homologous to uORFs identified by uORF-Finder were observed only in Group 13 uORFs contain peptides similar to RS motifs 5' UTRs and never as part of another mORF, within 3' found in SR proteins. SR proteins are a family of proteins UTRs, within introns, or in non-transcribed regions. Pos- required for alternative and constitutive pre-mRNA splic- sible origins of 5' UTR ORFs include (a) fragmentation of ing [63,64]. A subset of these proteins, shuttling SR pro- mORF sequences, (b) creation of an AUG or alternate start teins, have not only been implicated in splicing but have codon by random mutation within the 5' UTR and subse- also been shown to stimulate translation of a reporter quent selection for the peptide sequence, and (c) reloca- gene when fused to the same transcript [65], analogous to tion of other ORF sequences within the genome to the 5' a uORF-mORF associated pair. It is possible then, that UTR or upstream region of a given gene and subsequent group 13 uORF proteins could also play a dual role, as a transcriptional fusion of the two ORFs. translational regulator and trans factor.

Transcriptional fusions occur in an estimated 2% of adja- Similarly, some uORFs in mammalian genomes might cently transcribed mRNAs in the human genome [57]. adopt these dual roles and further characterization of con- The evolutionary history of uORF homology group 8 sug- served mammalian uORFs [66] could resolve a dual role gests a stable transcriptional fusion model leading to model.

Page 23 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Applications the putative uORFs, the first 100 amino acids of the Ara- Ka/Ks analyses suggest that conserved peptide uORFs are bidopsis mORF were aligned to Genbank plant ESTs using under mild to strong negative selection and might there- tBLASTn (E-value = 1e-10, limit: Viridiplantae [orgn] NOT fore be useful for resolving orthology and paralogy of spe- Arabidopsis [orgn], complexity filter off), and all retrieved cific gene pairs. For example, phylogenetic studies have plant uORF sequences were aligned to rice and Arabidop- sometimes failed to identify all members within a uORF sis uORFs using ClustalW [74], manually adjusted, and homology group when only considering the mORF visualized using Jalview [75] (Figures 1, 2, 3, 4, 5, 6, 7, 8). sequence (e.g. homology group 2). Although the bHLH There were two exceptions to this procedure. Because the transcription factor domain occurs in the mORF of all uORFs in group 10 are 400–600 bp upstream of the three group 2 members, none were identified in the orig- mORF AUG, only the first 25 mORF amino acids were inal studies, and only two of the three members have been used to search Genbank plant ESTs (first 25 amino acids included in the latest description of Arabidopsis bHLH are very highly conserved). Secondly, high identity was family members [67-69]. limited to the 3' end of mORFs in group 17, therefore the Arabidopsis transcript's terminal 50 amino acids were Further characterization of conserved peptide uORFs and aligned to Genbank non-EST plant sequences. Support for their functional mechanisms might also provide useful a conserved uORF was found in the Medicago truncatula tools for creating inducible or repressible expression vec- and Lotus corniculatus genomic sequences. tors in plants. AdoMetDC1, bZIP11, and PEAMT/NMT1 protein levels are regulated by conserved uORFs in a To test whether uORFs appear upstream of non-homolo- metabolite-dependent manner (polyamine, sucrose, and gous genes, Arabidopsis uORF sequences were aligned to choline, respectively) and other conserved uORFs might the entire Arabidopsis genome (version 5) [76] using also regulate mORF translation in response to cellular tBLASTn (E-value = 10). Predicted conserved uORFs were compounds, such as TPPases. If this is the case, further found to lie upstream of the annotated gene instead of in functional characterization of conserved peptide uORFs the annotated 5' UTR in approximately 10% of Arabidop- could provide the tools necessary to build constructs that sis and 25% of rice genes (Tables 2, 3, 4). The discrepan- are quickly inducible or repressible at the translational cies with the accepted annotations, found at TAIR [76] level under various conditions. and TIGR [77], respectively, demonstrate the benefit of using full-length cDNA sequences for this analysis. Methods Identifying conserved uORFs in rice and Arabidopsis To determine whether sequences similar to these con- Corrected RIKEN and Genoscope Arabidopsis thaliana eco- served uORFs reside elsewhere in the rice and Arabidopsis type Columbia and NIAS, FAIS and RIKEN Oryza sativa genomes, uORF amino acid sequences were aligned with spp. japonica cv Nipponbare full-length cDNA collections sequences translated from the genome sequence using were used for all analysis [70]. A cDNA's major ORF tBLASTn [73]. Sequences similar to these uORFs were (mORF) was defined as the longest ORF starting with an found within 5' UTRs of homologous mORF loci, and AUG, the sequence upstream of this AUG was designated were absent from non-homologous transcripts, intronic the 5' UTR, and upstream ORFs (uORFs) were any ORFs regions, and intergenic regions with only one exception, found in the 5' UTR starting with an AUG. All ORFs were Arabidopsis NMT3 (AGI locus identifier At1g73600). The identified using getorf [71]. Arabidopsis mORFs were annotated mORF for NMT3 [78] is not covered by any aligned to rice cDNAs using tBLASTn with an E-value cut- available full-length cDNA and has no EST support at its off = 1e-5 [72,73] to find putative homologs. Rice cDNAs 5' end. Thus, we annotated NMT3 by comparison with its with hits below this threshold were paired with their paralog, NMT1 (At3g18000) [33]. NMT3 possesses respective Arabidopsis transcript, 5' UTR sequences sequences similar to the NMT1 uORF, as well as sequences extracted from both, uORFs determined using getorf, and similar to the NMT1 mORF, but the TAIR annotation fuses all combinations of rice and Arabidopsis uORF peptide these into a single ORF. However, NMT3 possesses poten- pairs aligned using needle [71]. The reciprocal analysis tial splice sites that would produce transcripts with uORF was also performed, starting with rice full-length cDNA and mORF sequences similar to those in NMT1. The sequences and comparing them to Arabidopsis transcript NMT3 uORF predicted by one alternative splice model is sequences. All uORFs greater than 100 amino acids were the same length as, and is 72% identical to, the NMT1 excluded from this analysis. uORF amino acid sequence (Group 13 in Figure 4).

All pairs with scores >50 were kept and examined manu- The TAIR website was used to assign locus numbers for ally against existing Arabidopsis transcript annotations each Arabidopsis transcript and the TIGR website for rice (TAIR and TIGR) and existing ESTs to determine whether locus numbers. The Arabidopsis locus numbers were then aligned peptides fall within a probable 5' UTR. To validate used to search for retained duplicates from the recent and

Page 24 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

ancient whole genome duplications as defined on the Ara- Estimate of conserved peptide uORF prevalence bidopsis Paralogon website [33]. Number of Arabidopsis-rice loci There is an average of 2.23 full-length cDNAs per uORF

Calculating Ka/Ks locus identified (excluding loci identified by BLAST align- For homology groups 1–19, Ka/Ks values for homologous ment), which suggests that 15200 Arabidopsis genes are rice and Arabidopsis mORFs and uORFs were determined represented in the cDNA collections (34000 cDNAs/2.23 using pairwise_kaks.PLS (version 1.7) [79]. Both the cDNAs per locus), representing approximately 60% of all approximate method (option-kaks yn00) and the maxi- Arabidopsis genes (assuming 26000 genes) [88]. In addi- mum likelihood method (-kaks codeml) were used. Any tion, Kikuchi et al [25] report that the 28000 rice full- Ka/Ks values resulting from a Ka or Ks value >10 was length cDNA sequences represent 20000 transcription excluded from the analysis, as these values result in inac- units (TUs) and that 64% of these (12800) have a curate predictions of Ka/Ks [80,81]. The Ka/Ks values for homolog in Arabidopsis. Assuming that 60–100% of homology groups 20–26 were determined with the same these homologs are represented in the Arabidopsis cDNA approach using Arabidopsis sequences only. collections, the estimated number of Arabidopsis homologs screened for uORF conservation is 7800– GO molecular function terms 13000. Only 80% of Arabidopsis genes also have a GO molecular function terms [82] were retrieved from homolog in rice (~21000) [25], therefore the uORF- TAIR Locus History pages [76]. GO terms for all Arabidop- Finder program has identified 37–62% of all conserved sis loci were downloaded from the TAIR website and used upstream ORFs (7800/21000 to 13000/21000) when to compare genome-wide GO molecular function term comparing rice and Arabidopsis full-length cDNAs. There- frequencies to those found in the conserved uORF-con- fore, there should be 61–102 loci that contain conserved taining loci. Statistically significant differences were uORFs: 38 loci found by uORF-Finder, 6 additional loci detected using the Exact Binomial test as described in the found by aligning known uORF sequences with the Arabi- R program package [83]. This analysis was also carried out dopsis genome using BLAST, and 17–58 presently uni- by GeneMerge, a program that incorporates a Bonferroni dentified loci. Using both uORF-Finder and BLAST corrected P-value [84]. algorithms we estimate that between 43% and 72% of conserved peptide uORFs between monocots and dicots Identification of Arabidopsis ohnologs and paralogs with have been identified. conserved uORF Conserved uORFs were found in Arabidopsis duplicates Number of Arabidopsis-Arabidopsis loci in much the same way as conserved uORFs were found A total of 60% of Arabidopsis genes are represented in the between rice and Arabidopsis. uORFs and mORFs were full-length cDNA collections used for this study. There- defined in the same way, and mORF sequences were fore, the probability of selecting two loci that have con- aligned to the entire Arabidopsis full-length cDNA collec- served peptide uORFs from the pool of known sequences tion using BLASTp (E-value cutoff = 1e-5) to detect tran- is 0.6*0.6 = 0.36. This translates to a total of 38 loci that scripts deriving from a duplicated locus. mORFs aligning have conserved uORFs using an Arabidopsis-Arabidopsis with ≥ 99% identity were discarded, and uORFs of all comparison (14 identified (36%), and 24 unidentified). remaining pairs were aligned using needle and validated as above. Total loci We therefore predict that there are between 99 and 140 Generation of phylogenetic trees loci in the Arabidopsis genome that contain conserved Sequences similar to Cox17p, Cox19p, Mrp10, CHCHD7, peptide uORFs, 41–58% of which have been identified. and uORF homology group 8 (as determined by tBLASTn and analyzed for conservation of the CHCH motif) were Abbreviations aligned using Muscle [85], trimmed of non-informative Species name abbreviations for Figures 1, 2, 3, 4, 5, 6, 7, 8, sites, and analyzed using Mr. Bayes v. 3.0 [86] (rates = 9, 10, 12, and 13 gamma, aamodel = mixed, ngen = 2000000). Phyloge- Acypi, Acyrthosiphon pisum; Adica, Adiantum capillus- netic trees were visualized using PHYLIP's DRAWTREE veneris; Ajeca, Ajellomyces capsulatus; Allce, Allium cepa; program v. 3.65 [87]. Anoga, Anopheles gambiae; Apime, Apis mellifera; Arath, Arabidopsis thaliana; Ascsu, Ascaris suum; Aspof, Asparagus Sequences similar to uORF homology group 8 were officinalis; Betvu, Beta vulgaris; Bosta, Bos taurus; Brafl, aligned, edited, and analyzed in the same manner with Branchiostoma floridae; Brana, Brassica napus; Brugy, Brugui- one exception, ngen = 3000000. era gymnorhiza; Caeel, Caenorhabditis elegans; Cicli, Cicin- dela litorea; Ciosa, Ciona savignyi; Citja, Citrus jambhiri; Citpa, Citrus paradisi; Citsi, Citrus sinensis; Chlre,

Page 25 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

Chlamydomonas reinhardtii; Cocpo, Coccidioides posadasii; (AK072162); Orysa2 (AK100397); Orysa3 (AK070259); Cryne, Cryptococcus neoformans; Cycru, Cycas rumphii; Selmo (DN838497); Torru (CN201012); Ulvli Danre, Danio rerio; Debha, Debaryomyces hansenii; Dicdi, (AJ892634). Dictyostelium discoidium; Drome, Drosophila melanogaster; Drops, Drosophila pseudoobscura; Erate, Eragrostis tef; Escca, Group4: Arath1 (RAFL09-11-P17); Arath2 (RAFL09-63- Eschscholzia californica; Eupes, Euphorbia esula; Eupti, H05); Arath3 (RAFL06-76-P19); Brana (CD823274); Euphorbia tirucalli; Galga, Gallus gallus; Gibze, Gibberella Goshi (AI730427); Gosra (CO113165); Medtr zeae; Glomo, Glossina morsitans; Glyma, Glycine max; (AW689516); Orysa (AK060830); Poptt (BU896557); Glyso, Glycine soja; Gosar, Gossypium arboreum; Goshi, Gos- Prupe (BU045695). sypium hirsutum; Gosra, Gossypium raimondii; Haeco, Haemonchus contortus; Helan, Helianthus annuus; Hetgl, Group 5: Arath1 (RAFL05-05-C03); Arath2 (CNS0A9PN); Heterodera glycines; Hevbr, Hevea brasiliensis; Homsa, Gosra (CO130855); Hevbr (CB376393); Lacse Homo sapiens Horvu, Hordeum vulgare; Iponi, Ipomoea nil; (BU011020); Orysa (AK103103); Phaac (BU791117); Jugre, Juglans regia; Lacsa, Lactuca sativa; Lacse, Lactuca ser- Triae (BJ233459). riola; Linus, Linum usitatissimum; Locmi, Locusta migratoria; Lyces, Lycopersicon esculentum; Maldo, Malus domestica; Group 6: Arath1 (RAFL05-17-I08); Arath2 (CNS0A6ZP); Medtr, Medicago truncatula; Mescr, Mesembryanthemum Aspof (CV291431); Glyma (BM143067); Gosar crystallinum; Mesvi, Mesostigma viride; Molte, Molgula tecti- (BG442153); Orysa (AK064902); Pontr (CD576165); formis; Musmu, Mus musculus; Neucr, Neurospora crassa; Triae (CK161649); Vitvi (CB980452). Nicbe, Nicotiana benthamiana; Oncmy, Oncorhynchus mykiss; Oryla, Oryzias latipes; Orysa, Oryza sativa; Parbr, Group 7: Arath (RAFL09-25-N17); Brana (CD836460); Paracoccidioides brasiliensis; Pethy, Petunia hybrida; Phaac, Mescr (BM301482); Nicbe (CK290710); Orysa1 Phaseolus acutifolius; Phaco, Phaseolus coccineus; Phypa, (AK067685); Orysa2 (LOC_Os06g48350); Triae Physcomitrella patens; Pontr, Poncirus trifoliata; Popde, Pop- (CV066319). ulus deltoides; Popca, Populus canadensis; Popeu, Populus euphratica; Poptd, Populus trichocarpa × Populus deltoides; Group 8: Arath (RAFL07-08-P17); Chlre (BE121764); Poptt, Populus tremula × Populus tremuloides; Prupe, Prunus Mesvi1 (DN255332); Mesvi2 (DN261354); Orysa persica; Sacce, Saccharomyces cerevisiae; Sachc, Saccharum (AK072620); Phypa (BJ174896); Popca (CX178804); hybrid cultivar; Sacof, Saccharum officinarum; Schma, Popeu (AJ776458); Sachc (CF573523); Triae Schistosoma mansoni; Schpo, Schizosaccharomyces pombe; (CA499582). Selmo, Selaginella moellendorffii; Soltu, Solanum tuberosum; Sorbi, Sorghum bicolor; Strpu, Strongylocentrotus purpuratus; Group 9: Allce (CF443194); (Arath1 (RAFL07-09-G06); Strra, Strongyloides ratti; Styhu, Stylosanthes humilis; Tetni, Arath2 (RAFL09-23-F23); Arath3 (At1g64140); Gosra Tetraodon nigroviridis; Theha, Thelungiella halophila; Torru, (CO081490); Orysa1 (AK101398); Orysa2 (AK105763); Tortula ruralis; Triae, Triticum aestivum; Trisp, Trichinella Orysa3 (AK068099); Orysa4 (AK099577). spiralis; Ulvli, Ulva linza; Ustma, Ustilago maydis; Vitsh, shuttleworthii; Vitvi, Vitis vinifera; Welma, Welwitschia Group 10: Arath1 (RAFL07-11-O11); Arath2 (RAFL09-17- mirabilis; Xenla, Xenopus laevis; Xentr, Xenopus tropicalis; I10); Brana (CN732239); Orysa1 (AK069526); Orysa2 Yarli, Yarrowia lipolytica; Zeama, Zea mays. (AK100056); Poptt (BI131713); Sorbi (CN139168); Theha (BE758596). Abbreviated species names and Genbank accession number, cDNA clone number, or genome identifier Group 11: Arath1 (RAFL07-14-D12); Arath2 Figures 1, 2, 3, 4, 5, 6, 7, 8 (CNS0A404); Glyma (CA783255); Jugre (CV197923); Group 1: Arath1 (CNS0ABWH); Arath2 (CNS09Y87); Medtr (AW691064); Orysa1 (AK103391); Orysa2 Arath3 (CNS0A364); Arath4 (CNS0A728); Arath5 (AK069361); Soltu (BQ113418). (RAFL11-10-D10); Orysa1 (AK070887); Orysa2 (AK065180); Orysa3 (AK064903); Orysa4 (AK109929); Group 12: Arath1 (RAFL07-18-F03); Arath2 (CNS0AB39); Orysa5 (LOC_Os12g37410). Brana (CD812479); Citse (CN185367); Jugre (CV196770); Orysa (AK060405); Popde (CK319714); Group2: Arath1 (At2g31280); Arath2 (At1g06150); Triae (BQ752938); Zeama (CD433782). Arath3 (RAFL04-15-e03); Lacsa (BQ869454); Lyces (AW621910); Medtr, (BF643643); Orysa (AK074015.1). Group 13: Arath1 (RAFL08-10-M03); Arath2 (At1g48600.2); Arath3 (At1g73600); Cycru (CB093136); Group3: Arath1 (CNS0A7A6); Arath2 (RAFL04-16-A04); Gosra (CO080661); Iponi (BJ562806); Linus Arath3 (RAFL09-22-L13); Cycru (CB092297); Orysa1 (CA483285); Medtr (AW587372); Orysa1

Page 26 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

(LOC_Os05g47540); Orysa2 (AK102037); Phypa (CK041713); Sacof (CA242575); Triae (BJ247925); (BJ204269); Xenla (CA792398); Xentr (CX412233); Zeama (CO458204). Zeama (AY103779). Group 23: Arath1 (RAFL07-11-L03); Arath2 (RAFL09-07- Group14: Allce (CF450799); Arath (RAFL09-10-M04); L11); Citsi (CV720092); Glyma (BI892512). Medtr (AW267817); Nicbe (CK295530); Orysa (AK101569); Soltu (CK258175); Zeama (CO519993). Group 24: Arath1 (RAFL07-14-J09); Arath2 (CNS0A44P); Group 15: Adica (BP914226); Arath1 (CNS0ADY7); Brana (CD828343); Glyma (BI471587); Horvu Arath2 (RAFL08-17-G21); Arath3 (RAFL04-17-N21); (BQ471053); Orysa (AK119634); Sacof (CA118382); Arath4 (RAFL16-69-M04); Citpa (DN959636); Gosra Sorbi (CB928687); Triae (CA483985); Zeama (CO125506); Maldo (CV082382); Medtr (CX528608); (CO520078). Orysa1 (AK102703); Orysa2 (AK101749); Orysa3 (AK071582); Orysa4 (AK065674); Sacof (CA154823); Group 25: Arath1 (RAFL09-94-P19); Arath2 Vitvi (CB001711); Welma (DT579937). (CNS0A6N0); Brana (CD835519); Citsi (CN191447); Escca (CD481312); Glyma (BE805986); Phaco Group 15: Adica (BP914226); Arath1 (CNS0ADY7); (CA913939); Soltu (DN940765); Vitsh (CV098492). Arath2 (RAFL08-17-G21); Arath3 (RAFL04-17-N21); Arath4 (RAFL16-69-M04); Citpa (DN959636); Gosra Group 26: Arath1 (CNS0A7NI); Arath2 (CNS0A1F5); (CO125506); Maldo (CV082382); Medtr (CX528608); Citja (CO912573); Pethy (CV298852); Poptd Orysa1 (AK102703); Orysa2 (AK101749); Orysa3 (CN521002); Prupe (BU045483). (AK071582); Orysa4 (AK065674); Sacof (CA154823); Vitvi (CB001711); Welma (DT579937). Figure 9 Arath (RAFL07-08-P17); Caeel (U10402); Ciosa Group 16: Arath (CNS0A4RC); Medtr (AW693231); (BW577210); Danre (CO350578); Dicdi (AU072562); Orysa1 (AK071885); Orysa2 (AK067447). Drome (AI297387); Homsa (BU541024); Mesvi (DN255332); Neucr (BX284746); Orysa, (AK072620); Group 17: Arath1 (RAFL09-25-E19); Arath2 (At5g03190); Phypa (BJ174896); Strpu (CX079489); Ustma Arath3 (RAFL19-67-G09); Arath4 (At5g01710); Gosra (CF644197). (CO108440); Lyces (AW738430); Medtr1 (BQ149694); Medtr2 (AC144517); Orysa1 (AK69088); Orysa2 Figure 10 (AK070250); Sacof (CA191644). Arath1 (RAFL08-10-M03); Arath2 (At1g48600.2); Arath3 (At1g73600); Cycru (CB093136); Gosra (CO080661); Group 18: Arath (RAFL08-18-B11); Gosra (CO115325); Iponi (BJ562806); Linus (CA483285); Medtr Nicbe (CK286574); Orysa (AK061433). (AW587372); Orysa1 (LOC_Os05g47540); Orysa2 (AK102037); Phypa (BJ204269); Xenla (CA792398); Group 19: Arath (CNS09ZXM); Eupes (DV113097); Xentr (CX412233); Zeama (AY103779). Helan (AJ541596); Medtr (BI309364); Orysa (AK068270); Triae (CD927685); Vitvi (CB918939); Figure 12 Zeama (DV166198). Acypi (CV847404); Ajeca (CV605785); Anoga1 (BX617953), Anoga2 (XM_552406); Apime Group 20: Arath1 (RAFL04-17-G13); Arath2 (NW_622706); Arath1 (BP562704), Arath2 (AY065264), (CNS0A8YX); Brana (CD835762); Brugy (BP941533); Arath3 (RAFL07-08-P17), Arath4 (NM_179521), Arath5 Gosar (BF274209); Maldo (CN940921); Medtr (NM_112400); Ascsu (BM964977); Bosta (CO877216); (BE316669); Styhu (L36823). Brafl1 (BW786058), Brafl2 (BW840607); Caeel (U10402); Chlre1 (BE121764), Chlre2 (AF280543); Cicli Group 21: Allce (CF450138); Arath1 (RAFL07-08-G04); (CV156944); Ciosa (BW577210); Cocpo (CO006101); Arath2 (RAFL21-49-G19); Betvu (BQ594525); Brana Cryne (XM_572394); Danre (CO350578); Debha (CD835573); Erate (DN481483); Escca (CD481239); (NC_006045); Dicdi1 (AU072562), Dicdi2 Eupti (BP958766); Gosra (CO074819); Glyma (XM_631387); Drome1 (AI297387), Drome2 (BU761432); Horvu (AV834976); Lacse (BQ998418); (AY102691); Drops (DR121964), Erate (DN481021); Maldo (CV881926); Medtr (CA991201); Orysa Galga1 (BX935835), Galga2 (CR407540); Gibze (AK100575); Popca (CX182168). (BI750032); Glomo (BX557417); Glyso (BG045953); Haeco (CA956938); Hetgl (CB299856); Homsa1 Group 22: Arath1 (RAFL07-11-D20); Arath2 (RAFL11-03- (DR155443), Homsa2 (CR607136), Homsa3 J07); Brana (CD836422); Horvu (CA023398); Orysa (BU541024), Homsa4 (AY957566), Homsa5

Page 27 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

(NM_005694); Hordvu (BF628344); Locmi1 (CO854527), Locmi2 (CO825844); Mesvi1 Additional file 2 (DN255332), Mesvi2 (DN261354); Molte (CJ368011); Alignment used to generate Figure 13 Musmu1 (BC030366), Musmu2 (AK010111); Neucr Click here for file [http://www.biomedcentral.com/content/supplementary/1741- (BX284746); Oncmy (BX081024); Oryla (BJ737531); 7007-5-32-S2.doc] Orysa1 (XM_482456), Orysa2 (AK072620), Orysa3 (AK120143), Orysa4 (XM_468245); Parbr (CA581923); Phypa1 (BJ966696), Phypa2 (BJ174896); Popca (CX178804); Popeu (AJ776458); Sacce1 (NC_001136), Acknowledgements Sacce2 (AY692601), Sacce3 (NC_001144), Sacce4 We thank Dr Yadegari for suggesting this project. We would also like to (NC_001144); Sachc (CF573523); Schma (CD081475); recognize N Merchant and S Miller at the Biotechnology Computing Facil- Schpo1 (NM_001019463), Schpo2 (NM_001022867), ity, as well as T Wheeler in the Department of Computer Science for inval- Schpo3 (NM_001022571); Sorbi (CD423660); Strpu uable programming suggestions. This research was supported by a (CX079489); Strra (BI323578); Tetni1 (CR709012), University of Arizona NSF IGERT Genomics Initiative fellowship (DGE- Tetni2 (CNS0G27U); Triae (CA499582); Trisp 0114420) to CAH and the NSF Plant Genome Program Grant No. DBI- 0421679 (RAJ). (BQ693345); Ustma1 (CF644197), Ustma2 (XM_754796); Xenla1 (BI477811), Xenla2 (BC084847); References Xentr1 (BC075310), Xentr2 (CN119217); Yarli 1. Hanfrey C, Franceschetti M, Mayer MJ, Illingworth C, Michael AJ: (XM_500713). Abrogation of upstream open reading frame-mediated translational control of a plant S-adenosylmethionine decar- boxylase results in polyamine disruption and growth pertur- Figure 13 bations. J Biol Chem 2002, 277:44131-44139. Acypi (CV847404); Ajeca (CV605785); Anoga 2. Hinnebusch AG: Translational regulation of yeast GCN4. J Biol Chem 1997, 272:21661-21664. (BX617953); Apime (NW_622706); Arath (RAFL07-08- 3. Werner M, Feller A, Messenguy F, Pierard A: The leader peptide P17); Ascsu (BM964977); Bosta (CO877216); Brafl1 of yeast gene CPA1 is essential for the translational repres- (BW840607), Brafl2 (BW786058); Caeel (U10402); sion of its expression. Cell 1987, 49:805-813. 4. Wiese A, Elzinga N, Wobbes B, Smeekens S: A conserved Chlre (BE121764); Cicli (CV156944); Ciosa upstream open reading frame mediates sucrose-induced (BW577210); Danre (CO350578); Debha (NC_006045); repression of translation. Plant Cell 2004, 16:1717-1729. 5. Galagan JE, Calvo SE, Cuomo C, Ma LJ, Wortman JR, Batzoglou S, Lee Dicdi (AU072562); Drome (AI297387); Drops SI, Basturkmen M, Spevak CC, Clutterbuck J, et al.: Sequencing of (DR121964); Galga (CR407540); Gibze (BI750032); Aspergillus nidulans and comparative analysis with A. fumiga- Glomo (BX557417); Haeco (CA956938); Hetgl tus and A. oryzae. Nature 2005, 438:1105-1115. 6. Churbanov A, Rogozin IB, Babenko VN, Ali H, Koonin EV: Evolu- (CB299856); Homsa (BU541024); Locmi (CO825844); tionary conservation suggests a regulatory function of AUG Mesvi1 (DN255332), Mesvi2 (DN261354); Molte triplets in 5'-UTRs of eukaryotic genes. Nucleic Acids Res 2005, (CJ368011); Musmu (AK010111); Neucr (BX284746); 33:5512-5520. 7. Kawaguchi R, Bailey-Serres J: mRNA sequence features that con- Oncmy (BX081024); Oryla (BJ737531); Orysa tribute to translational regulation in Arabidopsis. Nucleic Acids (AK072620); Phypa (BJ174896); Popca (CX178804); Res 2005, 33:955-965. 8. Futterer J, Hohn T: Role of an upstream open reading frame in Popeu (AJ776458); Sachc (CF573523); Schma the translation of polycistronic mRNAs in plant cells. Nucleic (CD081475); Sorbi (CD423660); Strpu (CX079489); Acids Res 1992, 20:3851-3857. Strra (BI323578); Tetni (CR709012); Triae (CA499582); 9. Kozak M: Effects of intercistronic length on the efficiency of reinitiation by eucaryotic ribosomes. Mol Cell Biol 1987, Trisp1 (BQ693345), Trisp2 (BQ692350); Ustma 7:3438-3445. (CF644197); Xenla (BI477811); Xentr (CN119217). 10. Kozak M: Pushing the limits of the scanning mechanism for initiation of translation. Gene 2002, 299:1-34. 11. Luukkonen BG, Tan W, Schwartz S: Efficiency of reinitiation of Authors' contributions translation on human immunodeficiency virus type 1 Both CAH and RAJ designed and implemented the analy- mRNAs is determined by the length of the upstream open reading frame and by intercistronic distance. J Virol 1995, ses for the present study. CAH drafted the manuscript and 69:4086-4094. RAJ provided critical comments. Both authors have read 12. Gopfert U, Kullmann M, Hengst L: Cell cycle-dependent transla- and approved the final manuscript. tion of p27 involves a responsive element in its 5'-UTR that overlaps with a uORF. Hum Mol Genet 2003, 12:1767-1779. 13. Gray TA, Saitoh S, Nicholls RD: An imprinted, mammalian bicis- Additional material tronic transcript encodes two independent proteins. Proc Natl Acad Sci USA 1999, 96:5616-5621. 14. Fang P, Wang Z, Sachs MS: Evolutionarily conserved features of the arginine attenuator peptide provide the necessary Additional file 1 requirements for its function in translational regulation. J Biol Alignment used to generate Figure 12 Chem 2000, 275:26710-26719. Click here for file 15. Jin X, Turcott E, Englehardt S, Mize GJ, Morris DR: The two [http://www.biomedcentral.com/content/supplementary/1741- upstream open reading frames of oncogene mdm2 have dif- ferent translational regulatory properties. J Biol Chem 2003, 7007-5-32-S1.doc] 278:25716-25721.

Page 28 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

16. Lee J, Park EH, Couture G, Harvey I, Garneau P, Pelletier J: An ical role for phospholipid metabolism in root system upstream open reading frame impedes translation of the development and epidermal cell integrity. Plant Cell 2004, huntingtin gene. Nucleic Acids Res 2002, 30:5110-5119. 16:2020-2034. 17. Lincoln AJ, Monczak Y, Williams SC, Johnson PF: Inhibition of 39. Mou Z, Wang X, Fu Z, Dai Y, Han C, Ouyang J, Bao F, Hu Y, Li J: CCAAT/enhancer-binding protein alpha and beta transla- Silencing of phosphoethanolamine N-methyltransferase tion by upstream open reading frames. J Biol Chem 1998, results in temperature-sensitive male sterility and salt 273:9552-9560. hypersensitivity in Arabidopsis. Plant Cell 2002, 14:2031-2043. 18. Hill JR, Morris DR: Cell-specific translational regulation of S- 40. Wang X: Regulatory functions of phospholipase D and phos- adenosylmethionine decarboxylase mRNA. J Biol Chem 1993, phatidic acid in plant growth, development, and stress 268:726-731. responses. Plant Physiol 2005, 139:566-573. 19. Lee MM, Lee SH, Park KY: Characterization and expression of 41. Geballe AP, Sachs MS: Translational control by upstream open two members of the S-adenosylmethionine decarboxylase reading frames. In Translational control of gene expression Edited by: gene family in carnation flower. Plant Mol Biol 1997, 34:371-382. Sonenberg N, Hershey JWB, Mathews MB. Cold Spring Harbor, New 20. Martinez-Garcia JF, Moyano E, Alcocer MJ, Martin C: Two bZIP pro- York: Cold Spring Harbor Laboratory Press; 2000:595-614. teins from Antirrhinum flowers preferentially bind a hybrid C- 42. UCSC X. tropicalis BLAT search [http://genome.ucsc.edu/cgi- box/G-box motif and help to define a new sub-family of bZIP bin/hgBlat] transcription factors. Plant J 1998, 13:489-505. 43. Felsenstein J: Cases in which parsimony or compatibility meth- 21. Evans PT, Malmberg RL: Do polyamines have roles in plant ods will be positively misleading. Syst Zool 1978, 27:401-410. development? Annu Rev Plant Phys 1989, 40:235-269. 44. The Maize full length cDNA Project [http://www.maiz 22. Walden R, Cordeiro A, Tiburcio AF: Polyamines: small mole- ecdna.org] cules triggering pathways in plant growth and development. 45. RIKEN Poplar full-length cDNA clones [http:// Plant Physiol 1997, 113:1009-1013. www.brc.riken.jp/lab/epd/Eng/catalog/poplar.shtml] 23. Zhang Z, Dietrich FS: Identification and characterization of 46. Kozak M: An analysis of 5'-noncoding sequences from 699 ver- upstream open reading frames (uORF) in the 5' untranslated tebrate messenger RNAs. Nucleic Acids Res 1987, 15:8125-8148. regions (UTR) of genes in Saccharomyces cerevisiae. Curr Genet 47. Hanfrey C, Elliott KA, Franceschetti M, Mayer MJ, Illingworth C, 2005, 48:77-87. Michael AJ: A dual upstream open reading frame-based 24. Castelli V, Aury JM, Jaillon O, Wincker P, Clepet C, Menard M, Cru- autoregulatory circuit controlling polyamine-responsive aud C, Quetier F, Scarpelli C, Schachter V, et al.: Whole genome translation. J Biol Chem 2005, 280:39229-39237. sequence comparisons and "full-length" cDNA sequences: a 48. Law GL, Raney A, Heusner C, Morris DR: Polyamine regulation combined approach to evaluate and improve Arabidopsis of ribosome pausing at the upstream open reading frame of genome annotation. Genome Res 2004, 14:406-413. S-adenosylmethionine decarboxylase. J Biol Chem 2001, 25. Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, 276:38036-38043. Yazaki J, Ishikawa M, Yamada H, Ooka H, et al.: Collection, map- 49. Raney A, Law GL, Mize GJ, Morris DR: Regulated translation ter- ping, and annotation of over 28,000 cDNA clones from mination at the upstream open reading frame in S-adenosyl- japonica rice. Science 2003, 301:376-379. methionine decarboxylase mRNA. J Biol Chem 2002, 26. Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima 277:5988-5994. M, Enju A, Akiyama K, Oono Y, et al.: Functional annotation of a 50. Tabuchi T, Okada T, Azuma T, Nanmori T, Yasuda T: Posttran- full-length Arabidopsis cDNA collection. Science 2002, scriptional regulation by the upstream open reading frame 296:141-145. of the phosphoethanolamine N-methyltransferase gene. Bio- 27. Wolfe KH, Gouy M, Yang YW, Sharp PM, Li WH: Date of the sci Biotechnol Biochem 2006, 70:2330-2334. monocot-dicot divergence estimated from chloroplast DNA 51. Bayer TS, Booth LN, Knudsen SM, Ellington AD: Arginine-rich sequence data. Proc Natl Acad Sci USA 1989, 86:6201-6205. motifs present multiple interfaces for specific binding by 28. Sanderson MJ: A nonparametric approach to estimating diver- RNA. RNA 2005, 11:1848-1857. gence times in the absence of rate constancy. Mol Biol Evol 52. Imai A, Hanzawa Y, Komura M, Yamamoto KT, Komeda Y, Takahashi 1997, 14:1218-1231. T: The dwarf phenotype of the Arabidopsis acl5 mutant is sup- 29. Chaw SM, Chang CC, Chen HL, Li WH: Dating the monocot- pressed by a mutation in an upstream ORF of a bHLH gene. dicot divergence and the origin of core using whole Development 2006, 133:3575-3585. chloroplast genomes. J Mol Evol 2004, 58:424-441. 53. Hollick JB, Patterson GI, Asmundsson IM, Chandler VL: Paramuta- 30. Wolfe K: Robustness – it's not where you think it is. Nat Genet tion alters regulatory control of the maize pl locus. Genetics 2000, 25:3-4. 2000, 154:1827-1838. 31. Schranz ME, Mitchell-Olds T: Independent ancient polyploidy 54. Lai EC: RNA sensors and riboswitches: self-regulating mes- events in the sister families Brassicaceae and Cleomaceae. sages. Curr Biol 2003, 13:R285-291. Plant Cell 2006, 18:1152-1165. 55. Yoine M, Ohto MA, Onai K, Mita S, Nakamura K: The lba1 muta- 32. Blanc G, Wolfe KH: Functional divergence of duplicated genes tion of UPF1 RNA helicase involved in nonsense-mediated formed by polyploidy during Arabidopsis evolution. Plant Cell mRNA decay causes pleiotropic phenotypic changes and 2004, 16:1679-1691. altered sugar signalling in Arabidopsis. Plant J 2006, 47:49-62. 33. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superim- 56. Gaba A, Jacobson A, Sachs MS: Ribosome occupancy of the yeast posed on older large-scale duplications in the Arabidopsis CPA1 upstream open reading frame termination codon genome. Genome Res 2003, 13:137-144. modulates nonsense-mediated mRNA decay. Mol Cell 2005, 34. Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam 20:449-460. L, Pineda O, Ratcliffe OJ, Samaha RR, et al.: Arabidopsis transcrip- 57. Akiva P, Toporik A, Edelheit S, Peretz Y, Diber A, Shemesh R, Novik tion factors: genome-wide comparative analysis among A, Sorek R: Transcription-mediated gene fusion in the human eukaryotes. Science 2000, 290:2105-2110. genome. Genome Res 2006, 16:30-36. 35. Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, 58. Douzery EJ, Snell EA, Bapteste E, Delsuc F, Philippe H: The timing Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, et al.: CDD: of eukaryotic evolution: does a relaxed molecular clock rec- a Conserved Domain Database for protein classification. oncile proteins and fossils? Proc Natl Acad Sci USA 2004, Nucleic Acids Res 2005, 33:D192-196. 101:15386-15391. 36. Zdobnov EM, Apweiler R: InterProScan – an integration plat- 59. Parola AL, Kobilka BK: The peptide product of a 5' leader cis- form for the signature-recognition methods in InterPro. Bio- tron in the beta 2 adrenergic receptor mRNA inhibits recep- informatics 2001, 17:847-848. tor synthesis. J Biol Chem 1994, 269:4497-4505. 37. Eastmond PJ, Graham IA: Trehalose metabolism: a regulatory 60. Pendleton LC, Goodwin BL, Solomonson LP, Eichler DC: Regula- role for trehalose-6-phosphate? Curr Opin Plant Biol 2003, tion of endothelial argininosuccinate synthase expression 6:231-235. and NO production by an upstream open reading frame. J 38. Cruz-Ramirez A, Lopez-Bucio J, Ramirez-Pimentel G, Zurita-Silva A, Biol Chem 2005, 280:24252-24260. Sanchez-Calderon L, Ramirez-Chavez E, Gonzalez-Ortega E, Her- rera-Estrella L: The xipotl mutant of Arabidopsis reveals a crit-

Page 29 of 30 (page number not for citation purposes) BMC Biology 2007, 5:32 http://www.biomedcentral.com/1741-7007/5/32

61. Marton ML, Cordts S, Broadhvest J, Dresselhaus T: Micropylar pol- 89. Rook F, Gerrits N, Kortstee A, van Kampen M, Borrias M, Weisbeek len tube guidance by egg apparatus 1 of maize. Science 2005, P, Smeekens S: Sucrose-specific signalling represses transla- 307:573-576. tion of the Arabidopsis ATB2 bZIP transcription factor gene. 62. Wen JQ, Lease KA, Walker JC: DVL, a novel class of small Plant J 1998, 15:253-263. polypeptides: overexpression alters Arabidopsis develop- 90. Satoh-Nagasawa N, Nagasawa N, Malcomber S, Sakai H, Jackson D: A ment. Plant J 2004, 37:668-677. trehalose metabolic enzyme controls inflorescence architec- 63. Graveley BR: Sorting out the complexity of SR protein func- ture in maize. Nature 2006, 441:227-230. tions. RNA 2000, 6:1197-1211. 91. Verhagen BW, Glazebrook J, Zhu T, Chang HS, van Loon LC, Pieterse 64. Fu XD: The superfamily of arginine/serine-rich splicing fac- CM: The transcriptome of rhizobacteria-induced systemic tors. RNA 1995, 1:663-680. resistance in Arabidopsis. Mol Plant Microbe Interact 2004, 65. Sanford JR, Gray NK, Beckmann K, Caceres JF: A novel role for 17:895-908. shuttling SR proteins in mRNA translation. Genes Dev 2004, 92. Henriksson E, Olsson AS, Johannesson H, Johansson H, Hanson J, 18:755-768. Engstrom P, Soderman E: Homeodomain leucine zipper class I 66. Crowe ML, Wang XQ, Rothnagel JA: Evidence for conservation genes in Arabidopsis. Expression patterns and phylogenetic and selection of upstream open reading frames suggests relationships. Plant Physiol 2005, 139:509-518. probable encoding of bioactive peptides. BMC Genomics 2006, 93. Aoyama T, Dong CH, Wu Y, Carabelli M, Sessa G, Ruberti I, Morelli 7:16. G, Chua NH: Ectopic expression of the Arabidopsis transcrip- 67. Toledo-Ortiz G, Huq E, Quail PH: The Arabidopsis basic/helix- tional activator Athb-1 alters leaf cell fate in tobacco. Plant loop-helix transcription factor family. Plant Cell 2003, Cell 1995, 7:1773-1785. 15:1749-1770. 94. Bharti K, Von Koskull-Doring P, Bharti S, Kumar P, Tintschl-Korbitzer 68. Bailey PC, Martin C, Toledo-Ortiz G, Quail PH, Huq E, Heim MA, A, Treuter E, Nover L: Tomato heat stress transcription factor Jakoby M, Werber M, Weisshaar B: Update on the basic helix- HsfB1 represents a novel type of general transcription coac- loop-helix transcription factor gene family in Arabidopsis tivator with a histone-like motif interacting with the plant thaliana. Plant Cell 2003, 15:2497-2502. CREB binding protein ortholog HAC1. Plant Cell 2004, 69. Heim MA, Jakoby M, Werber M, Martin C, Weisshaar B, Bailey PC: 16:1521-1535. The basic helix-loop-helix transcription factor family in 95. Nover L, Scharf KD, Gagliardi D, Vergne P, Czarnecka-Verner E, Gur- plants: a genome-wide study of protein structure and func- ley WB: The Hsf world: classification and properties of plant tional diversity. Mol Biol Evol 2003, 20:735-747. heat stress transcription factors. Cell Stress Chaperones 1996, 70. Hayden CA, Wheeler TJ, Jorgensen RA: Evaluating and improving 1:215-223. cDNA sequence quality with cQC. Bioinformatics 2005, 96. Yang T, Poovaiah BW: Molecular and biochemical evidence for 21:4414-4415. the involvement of calcium/calmodulin in auxin action. J Biol 71. EMBOSS-European Molecular Biology Open Software Suite Chem 2000, 275:3137-3143. [http://emboss.sourceforge.net] 97. Nakano T, Suzuki K, Fujimura T, Shinshi H: Genome-wide analysis 72. Gish W, States DJ: Identification of protein coding regions by of the ERF gene family in Arabidopsis and rice. Plant Physiol database similarity search. Nat Genet 1993, 3:266-272. 2006, 140:411-432. 73. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local 98. Gutterson N, Reuber TL: Regulation of disease resistance path- alignment search tool. J Mol Biol 1990, 215:403-410. ways by AP2/ERF transcription factors. Curr Opin Plant Biol 74. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving 2004, 7:465-471. the sensitivity of progressive multiple sequence alignment 99. PlantsP kinase classification [http://plantsp.genomics.pur through sequence weighting, position-specific gap penalties due.edu/html/families.html] and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. 100. Sakuma Y, Liu Q, Dubouzet JG, Abe H, Shinozaki K, Yamaguchi-Shi- 75. Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java Align- nozaki K: DNA-binding specificity of the ERF/AP2 domain of ment Editor. Bioinformatics 2004, 20:426-427. Arabidopsis DREBs, transcription factors involved in dehy- 76. The Arabidopsis Information Resource [http://www.arabidop dration- and cold-inducible gene expression. Biochem Biophys sis.org] Res Commun 2002, 290:998-1009. 77. TIGR Rice Genome Annotation Project – web BLASTserver [http://tigrblast.tigr.org/euk-blast/index.cgi?project=osa1] 78. TAIR locus At1g73600 [http://www.arabidopsis.org/servlets/Tai rObject?id=29540&type=locus] 79. Pairwise KaKs Perl script [http://cvs.biopl.orcgviewcvewcvs.cgbi- operl-livriptutilitiepairwise_kaks.PLS?cvs root=biop erl&rev=HEAD] 80. Yang Z: Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol Biol Evol 1998, 15:568-573. 81. Anisimova M, Bielawski JP, Yang Z: Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolu- tion. Mol Biol Evol 2001, 18:1585-1592. 82. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Con- sortium. Nat Genet 2000, 25:25-29. Publish with BioMed Central and every 83. The R Project for statistical computing [http://www.r- project.org] scientist can read your work free of charge 84. Castillo-Davis CI, Hartl DL: GeneMerge – post-genomic analy- "BioMed Central will be the most significant development for sis, data mining, and hypothesis testing. Bioinformatics 2003, disseminating the results of biomedical research in our lifetime." 19:891-892. 85. Edgar RC: MUSCLE: multiple sequence alignment with high Sir Paul Nurse, Cancer Research UK accuracy and high throughput. Nucleic Acids Res 2004, Your research papers will be: 32:1792-1797. 86. Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic available free of charge to the entire biomedical community inference under mixed models. Bioinformatics 2003, peer reviewed and published immediately upon acceptance 19:1572-1574. 87. Felsenstein J: PHYLIP-Phylogeny Inference Package (Version cited in PubMed and archived on PubMed Central 3.2). Cladistics 1989, 5:164-166. yours — you keep the copyright 88. AGI: Analysis of the genome sequence of the Arabidopsis thaliana. Nature 2000, 408:796-815. Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 30 of 30 (page number not for citation purposes)