<<

© 2011 Nature America, Inc. All rights reserved. Joao Carlos Setubal Carlos Joao Desany Brian Janet P Slovin Pankaj Jaiswal Vladimir Shulaev The ( of woodland Nature Ge Nature Received 9 June; accepted 2 December; published online 26 December 2010; *A full list of author affiliations appears at the end of the paper. pled with forselection favorable recessive traits such as day neutrality, growing range, self-compatibility and long history of gardens cultivation, cou European in centuries for or ‘alpine’perflorens’ forms of as sequencing a genomic reference for the for research. genomic advantages substantial offers Mb, ~240 genus, strawberry the of size ferent diploid ancestors. Paradoxically, the small basic ( sets of (2 ­ananassa Genomically, ethylene. to response in ripen not does because flesh the non-climacteric be to considered is strawberry the peach, appleand as such crops family other Unlike . the tip, shoot modified fleshy of a surface the dot that seeds) (or dry abundant the of consists actual the as fruit, true a nor crop youngest the among is and strawberry, cultivated The S Schuyler Korban Jade Carter Samuel E Fox Palitha Dharmawardhana Sergei A Filichkin Juan Jairo Ruiz Rojas analysis between to other genetic genome, cultivated perennial The VelascoRiccardo Ron Mittler

An international consortium selected selected consortium international An valuable

woodland

.

linkage of

is among the most complex of crop , harboring eight eight harboring plants, crop of complex most the among is which

has

strawberry n 154

horticultural etics

Gene

8 1 a

, Elena Lopez Girona , Lopez Elena strawberry,

, Barry Flinn , Barry

-coding map small was 4 14 12 4 ,

5

and , , Keithanne Mockaitis

, Oswald R Crasta R Oswald , , , Nahla Bassil , , Scott A Givan prediction VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME

27 sequenced into 1

4 genome

( * 33 Prunus n , , , Mark Borodovsky Fragaria , , Daniel J Sargent 5 = 8 9 , , Michela Troggio , , Jahn Davik

, traits 20 seven 18 F.

Fragaria vesca x , , Mithu Chatterjee , Jean-Marc, Celton × × = 56) from derived as many as four dif F. vesca

predict

modeling (240 21

ananassa including 4

genes

× pseudochromosomes. to , , Justin Elser , , Asaph Aharoni

ananassa

×39 13 1

Mb), . Their widespread temperate temperate widespread Their . ssp. 4

, , Roger P Hellens

suggests , a 5

1

34 coverage , Larry J Wilhelm , Larry , originated ~250 years ago ago years ~250 originated , hypothetical 2 . Botanically, it is neither a neither is it Botanically, .

F.vesca

identified

6 , , Anna Zdepski , vesca is

, , Amparo Monfort , Roderick V Jensen V Roderick , (2

) amenable

2 n and 8 , , Ross N Crowhurst

27 10 , , Aaron Liston 2 =

have been cultivated have cultivated been that . The so-called . ‘semThe so-called 4

, , Roberto Viola (2 , 2 nutritional

, , Rajani Raja 38

other using x 22

n , , Richard E Veilleux 19

34,809 = assignment = 2 = , 23

25

x , , D Jasper G Rees 14), ancestral

= 7) genome to

This

, , Bo Liu , , Jeffrey L Bennetzen economically second-generation x

genetic = 14) for for 14) =

is 3

genes,

, , Clive Evans diploid

value 17 a

4

versatile , , Wenqin Wang , 5

F

Rosaceae 4 of , , Sushma Naithani 4 35 . × . 11 , , Shrinivasrao P Mane

transformation , , Henry D Priest

doi:10.1038/ng.74

and - - - with , Populus 15 27 , , Herman Silva

36 strawberry , , Tia-Lynn Ashman , Andrew C Allan C Andrew , 3

, Beatrice , Denoyes-Rothan Beatrice important

Moreover, this genome was sequenced using an open-access open-access an using sequenced was model. community genome this Moreover, these solely using characterized and assembled be can sequence genome contiguous a that physical demonstrating a reference, without assembly did and technologies short-read with exclusively We coverage # PI551572). achieved accession Repository F.vesca family. Rosaceae the in plants all for function gene testing for surrogate attractive an strawberry render studies properties functional and structural as well as tools genetic for lished vitro Robust in . or peach as such species tree with compared stature herbaceous small and propagation of vegetative ease a perennial, for a including time short generation for ence system genomic Rosaceae, broadly, More diversity. genotypic extensive offer forms yellow-fruited and non-runnering , , Todd C Mockler flowering

experimental most This This report presents the genome sequence of the diploid strawberry

genome 19

9 20 technology, to , , Tim Harkins , , Kelly P Williams

being & Kevin M Folta

regeneration and transformation systems have been estab been have systems transformation and regeneration ssp. ssp. Malvidae,

sequence F. vesca

29

rosaceous 17 time and 0

vesca , , Steven L Salzberg that

, Randall A , KerstetterRandall supported 24

4

plant shares were 30 ,

5 , Lee Meisel, Lee

, facilitating the production of forward and reverse reverse and of forward production the , facilitating had accession 4 (National Clonal Germplasm Germplasm Clonal (National 4 Hawaii accession assembled , Douglas , W Douglas Bryant Jr

, , Alan Christoffels rather

lacks 3

Fragaria vesca

4 system. 9 plants. , nine identified.

16 28 , , , Paul Burns substantial 5 14

, , Otto Folkerts , Todd, Michael P by

, , Hao Wang F.vesca

the , , Chinnappa Kodira than

chromosomes.

transcriptome 9 22

de novo , Sarah , H Sarah Holt Here large

This ,

23 Fabidae, 25 37 offers many advantages as a refer a as advantages many offers

Macrosyntenic , , Avital Adato

, , Pere Arus

diminutive

sequence we genome 7 , , Allan W Dickerman

10

and report 29 17

, , Thomas M Davis is 31 , , , Wilfried Schwab

6

warranted.

then New mapping. , , David Y Salama , , Arthur L Delcher

duplications

4 identity

the 17 , herbaceous 5 20

, 35

anchored , phylogenetic

,

relationships s e l c i t r A

21 draft , 14 26 36 , , ,

,

Genes

with

technologies. technologies.

) F. vesca

seen

3–6

to the

critical . These These . the

11

in 9

32 , 109 22

,

7 , , - - ,

© 2011 Nature America, Inc. All rights reserved. Resequencing using Illumina confirmed the high quality of the ass scaffolds. 272 represented in is sequence total the of Mb) (209.8 95% ass bly is presented in Methods). A ofsummary the input sequence data forused the assem SOLiD platforms to generate ×39 combined average coverage (Online stems modified . called from plants new produces and fruit white-yellow It has season. of regardless months 4–6 in cycle life a completes and on self- seeds abundant sets is day neutral, H4x4 tagging. mutagenesis insertional (T-DNA) DNA transfer for used was accession 4 Hawaii The tion. transforma genetic high-throughput to amenability its of because the of line vesca inbred fourth-generation a selected We Genome RESULTS s e l c i t r A 110 progeny × FN FV full markers, genetic by 320 anchored were embedded gaps), 131 were anchored to the FV × FN map. The of posed over 10,000 bp (a total of 209.8 Mb of scaffold and sequences ( Fragaria diploid the to assembly the of scaffolds the oriented and We aligned Anchoring (300 Mb) asserving internal controls ( with cytometry, flow using Mb ~240 at estimated was size genome H4x4 ( ×26 approximately of depth average an with reads Illumina perfect-match by validated being bly assem the in bly, bases ofthe and99.98% scaffolds ofthe 99.8% with were anchored and oriented using map positions of markers in the full FV × FN progeny, whereas the yellow scaffolds were anchored to mapping bins. mapping to anchored were scaffolds yellow the whereas progeny, × FN FV full the in markers of positions map using oriented and anchored were scaffolds Blue markers. genetic 390 with map genetic the to anchored were length) in kb 10 over sequence contiguous all of (99.2% gaps embedded 1 Figure CFACT099,I:41 (37.0) Fig. Fig. Marker cMLG 512991CT47 (54.0) RC1635,I:90 (90.0) EPpCU9910 (14.6) e sd h Rce44 Ilmn/oea n Lf Technologies/ Life and Illumina/Solexa Roche/454, the used We UWC513099,I:90 e EMFn128 (56.5) EMFn182 (51.0) EMFn136 (26.0) EMFn049 (12.8) EMFv164 (65.6) PES,I:21 (41.1) EMFvi072 (6.3 UDF002 (22.2)

mbled with an N50 of 1.3 Mb ( Mb 1.3 of N50 an with mbled VT-846 (0.0 F3H (47.5) accession Hawaii 4 known as ‘H4×4’ for sequencing primarily primarily sequencing for ‘H4×4’ as 4 known Hawaii accession Arabidopsis thalianaArabidopsis 1 (74.0) ). Of 272 272 Of ).

reference linkage map (FV × FN) and its associated bin map ) ) Anchoring the the Anchoring

sequencing

genome K 1 F. vesca F. 578 794 480 136 74 283 80 468 181 111 75 406 106 153 813 713 550 594 681 1,238 1,900 324 424 1,992 3,435 500 885 779 1,875 1,445 589 Supplementary Table 1 b UWC513169,II:50 (50.0)

sequence EMFxa379796 (55.0) 6 UFFxa15H09 (44.0) RC0537,II:47 (47.0) F. vesca

( CFVCT021 (65.0) Marker cMLG EMFvi099 (10.0) and EMFn134 (38.0) EMFn214 (34.0) EMFn148 (20.0) EMFn121 (17.0) EMFn235 (16.0) EMFv183 (72.0) EMFv003 (63.0) RC0052 (14.0) VT-008A (0.0 Fig. Fig. VT-714 (26.0) VT-837 (8.0 Fvi11 (31.0) ADH (18.0) DFR (59.0) APB (22.0) 5,6 H4x4 sequence scaffolds that were com were that scaffolds sequence H4x4 (~147 Mb) and Mb) (~147

, as well as transposon and activation activation and transposon as well as , assembly 1 ) ) genome to the diploid diploid the to genome , blue bars) and 86 mapped in a bin set set a bin in mapped 86 and bars) , blue upeetr Fg 1 Fig. Supplementary

to M 2

the Supplementary Table 2 Table Supplementary Supplementary Table 3

K 2,262 626 920 694 2,682 1,898 1,079 1,193 542 216 237 193 232 639 503 114 359 99 3,083 539 465 420 406 350 338 244 1,000 237 255 457 468 150 231 900 183 2,778 111 1,052 1,790 535 940 677 110 363 709 genetic

b including 234 mapped in the the in mapped 234 including EMFxa381788 (50.0) UFFxa02H04 (8.0) CFVCT032 (31.0) CFVCT012 (68.0) . Over 3,200. scaffolds Over were EMFvi038 (43.0) FPS.III:13 (13.0) EMFn125 (41.0) EMFn170 (39.0) EMFn034 (35.0) EMFn207 (48.0) UDF004 (33.0) UDF016 (28.5) UDF017 (27.0) EMFv029 (2.0 EMFv016 (0.0 Brachypodium distachyonBrachypodium VT-725 (53.0) VT-744 (24.0) VT-673 (21.0) VT-398 (15.0) arker cMLG3

map ) ) Fragaria ). The The ). . vesca F. K 2,850 1,419 545 1,398 132 461 691 341 584 923 2,261 495 1,854 343 355 1,497 858 385 264 595 1,669 720 377 895 83 2,088 839 376 1,461 246 857 reference map, FV × FN. Scaffolds representing 198.1 Mb of scaffolded sequence with with sequence scaffolded of Mb 198.1 representing Scaffolds × FN. FV map, reference b

UWC513004.IV:68 (68.0) scaffolds EFvVB1231.IV:46 (46.0) EFvVB2179,IV:26 (26.0) . vesca F. ). Over Over ). ). Marker cMLG4 EMFvi136 (53.0) EMFn111 (32.0 EMFv180 (30.0) U VT-753 (86.0 VT-861 (60.0 VT-816 (40.0 VT-756 (35.0) ssp. ssp. DF008 (80.0 AG53 (22.0) AC24 (20.0) CEL1 (75.0) e LOX (28.0) Chr4 (0.0) m 7 - - - - ­ ) ) ) ) )

25S 25S signal and a proximal 5S signal, whereas F displayed and hybridized to 25S () and 5S (green) rDNA probes (Online Methods chromosomes tip) ( mitotic using pseudochromosomes defined genomically and chromosomes marked cytologically two between typic features in loci in an unspecified accession of loci and one pair of 5S loci that co-localized with one of the pairs of 45S rescent lished for a in study previous used nomenclature group linkage the to according numbered markers. We assembled the scaffolds into seven pseudochromosomes, to FV the × was anchored FN map 390 using sequence) scaffold total the of (~94% data sequence of Mb 198.1 containing scaffolds) split scaffolds sequence genome 204 of total a Thus, on map the genetic and were split into two at regions of low coverage. locations two to mapped scaffolds Three map. genetic the to length kb in of over 100,000 scaffolds DNA ling 70 additional for anchoring ­sequencing of a reduced complexity Illumina SNP direct markers through segregating to identify method seedlings six of also contains 25S sequences in a mapped scaffold may may scaffold mapped a in which sequences 25S VI, contains also Pseudochromosome G. chromosome to correspond to appears respectively, locations, proximal and distal at sequences 5S and 25S containing scaffolds kb.mapped with VII, 1.7 Pseudochromosome than less of scaffolds unmapped were 30 which of identity, marker EMFv190. The 25S sequence had 32 matches of >90% sequence by defined VII at locus the that maps to scaffold pseudochromosome one and pseudochromosomes to are mapped not that two scaffolds small to sequence had sequence probe 5S The signal. 25S distal weak a displayed D chromosome and signal 25S distal strong a Although a comprehensive molecular karyotype has yet to be estab Supplementary Fig. 2 Supplementary 978 647 998 4,676 404 875 2,767 2,544 1,027 1,111 833 186 881 681 91 2,874 808 902 Kb VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME in situ EFvVB2013.V:73 (73.0) CFACT104.V:29 (29.0) F. vesca CFVCT024 (37.0) Marker cMLG EMFvi018 (49.0) EMFn010 (55.5) EMFn238 (53.0) EMFn162 (51.5) EMFn110 (26.0) EMFvi108 (7.0 GP.V:21 (10.0) UDF009 (39.0) UDF006 (35.0) VT-734 (18.0) AG33 (45.0) CEL2 (20.0) PC14 (12.0) AC49 (0.0 hybridization (FISH), three pairs of 45S (18S-5.8S-25S) 8 . 7 F. vesca ) ) ( , researchers , in researchers a previous study Fig. Fig. 5 H4x4 and identified tentative correspondences 1 , yellow bars). Additionally, we used a new new a used we Additionally, bars). yellow , K 399 439 709 570 1,272 67 182 511 632 336 686 100 2,078 141 1,481 2,447 1,378 374 1,264 2,597 495 2,549 1,811 115 159 168 1,443 388 1,074 3,447 ). ). Chromosome G displayed a strong distal b UWC512940.VI:90 (90.0) UWC512934.VI:70 (70.0) RC1173,VI:07 (10.0) EPpCU1830 (17.0) CFVCT017 (50.0) CFVCT036 (43.0) Marker cMLG6 EMFvi133 (41.0) EMFvi025 (12.0) EMFn153 (60.0) EMFn123 (35.0) EMFn185 (32.0) EMFn228 (15.0) EMFv104 (45.0) EMFv006 (44.0) RC1247 (68.0) VT-679 (78.0) VT-076 (73.0) VT-674 (48.0) VT-704 (0.0 ACO (58.0) F. vesca ZIP (56.0) QR (75.0) Alu ) I I digestion of the bin set seed . . We also found these karyo Kb 156 1,327 479 206 177 138 1,469 306 506 599 247 258 208 2,460 420 1,174 653 898 2,869 903 2,335 2,458 3,552 796 1,278 1,178 1,017 4,105 1,392 419 361 337 1,386 203 1,929

( 9 including the three three the including EPpCU9223 (36.0) EPpCU9257 (18.0) identified, by identified, fluo MEX.VII:90 (80.0) CFVCT019 (24.0) Marker cMLG Nature Ge Nature EMFvi008 (30.0) EMFn213 (21.0) EMFv023 (56.0) EMFv190 (16.0) EMFv021 (14.0) VT-659 (59.0) VT-838 (41.0) Omt1 (66.0) CHS (13.0) AC31 (3.0 Chr7 (0.0 (47.0) ) ) c orrespond orrespond 7 n etics Kb 3,270 2,607 1,416 1,772 1,226 546 486 1,040 1,497 995 1,594 5,161 1,016 769 - - - - © 2011 Nature America, Inc. All rights reserved. to FC3 and FC2, whereas those on PG7 mapped onto FC6 and FC1. FC1. and FC6 onto mapped PG7 on those whereas FC2, and FC3 to located PG4 on Markers FC1. on being remainder the with located FC6, on were PG3 to mapped markers Most FC5. of section a with and two taxa, with complete synteny between the between conservation genome remarkable revealed analysis This . two the between shared blocks’ ‘syntenic within occurred RosCOS more or five when genomes two the between orthologous relationships between the two genomes ( of seven pseudochromosomes in bin mapped previously markers the map set of (RosCOS) ortholog positions 389 conserved rosaceous of macro-syntenic relationships among rosaceous taxa. Comparison of into provide insight the nature and analyses dynamics Genome-wide Synteny D.on chromosome signal 25S weak to the responding to Nature Ge Nature 100,000 nucleotide intervals within each T × E mappingby multiplying bin (see the URLs). marker positions in cM by 400,000.reference Markers map werewere convertedspaced at to approximate physicalmap positions for comparison map positions on the eight linkage groups (PG1-8)on ofthe the seven pseudochromosomes (FC1-7) of Figure 2 in Repbase elements transposable by masked assembly the of ~1.3% to compared assembly, the of 22% ( able elements exemplars ment genome maize larger vesca F. the of searches structure-based and homology Extensive genome they evolution. and/or gene genome drive they nuclear which to the degree of the and represent percentage the in both genomes, of components major are elements transposable studied, plants all In Repetitive triplication. ancient the of signature the obscure may genes) duplicated of loss preferential by accompanied (perhaps reduction ( rearrangement chromosome strawberry, grape in documented first triplication, ancient an share clade rosid the of large-scale, of ( duplication evidence no within-genome with date to sequenced genome plant less. or occurrences three have families other All ­contains the longest (14,721 bp) of the 126 matches that are to logy a ABA95102.1) protein rice retrotransposon (GenBank: ( occurrences 15 distinct families of approximate repeats, with the largest family having eight form long matches that showed Methods) (Online 3.22 version of the A comparison Absence Spiraeoideae the Rosaceae, modern the within group largest the of number chromosome haploid base the with consistent genome for Rosaceae with a haploid chromosome number ( ancestral an of reconstruction the for evidence extensive providing study previous a to resolution improved adding identified, were PG6-FC6 and PG3-FC1 PG1-FC5, between translocations chromosomal New

chromosome F. No mapped scaffold could be implicated as cor as implicated be could scaffold mapped F.No chromosome 1 Fragaria 0 7 . The diagram was plotted using Circos; map positions from the 1 . Our data support these same broad structural relationships, relationships, structural broad same these support data Our . 4

and found in all other rosid genomes, including apple including genomes, rosid other all in found and genome, performed as previously described for the much much the for described previously as performed genome, A schematic representation of the positions of 389 RosCOS markers

infers of

n sequences

chromosome (FC) 7, PG8 and a section of FC2, and PG5 large etics Supplementary Table 4 Supplementary

ancestral 1 Supplementary Fig. 3 Fig. Supplementary 7 , including more than 6,000 fully intact transpos intact fully 6,000 than more including ,

duplications VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME F. vesca 1 6

and ietfe 56 ifrn tasoal ele transposable different 576 identified ,

relationships

genome against itself using MUMmer using itself against genome transposable Supplementary Fig. 3 Fig. Supplementary F. vesca

in 1 8 Prunus

, the Munich Information Center , Center the Munich Information the H4x4 macro-syntenic revealed ). These elements mask about elements ). These F. vesca Fig.

). This family shows homo shows family This ). F. vesca Prunus 1

0 elements Fig. Fig. 2 to their positions on the positions to their ). Markers were deemed in relation to their bin linkage group (PG) 2 2

Prunus genome ) and genome size size genome and ) F.vesca ). All members members All ). 1 1 T × E reference . ≥ is the only only the is x 10,000 bp. ) ) of nine, Prunus

1 1 3 5 and . In .

1 2 - - - ­

and validation and provided pools F. (cDNA) vesca DNA complementary complex Multiple Transcriptome identity between some elements suggests recent transposition activity. be the reason 10,000, so the lack of highly abundant LTR retrotransposonsthan greater is numberslikely copy to with families have genomes angiosperm LTR retrotransposon family has fewer than 2,100 copies. Average-size resent 2.8% and 2.4%, respectively, of nuclear DNA. The elementsmost numerous (), the most numerous DNA transposable whereaselements, CACTA elementsrep and miniature inverted-repeat transposable searches of 30 sequenced superfamily were identified by the structure-based and homology-based and nine intact long terminal repeat (LTR) retrotransposonscovery resource.of the Moreover, previously identified transposablethat the assembly provides a comprehensiveelements transposable element dis masked in the raw sequence reads compared to the database assembly, indicating repeat ­statistically (TIGR) Research Genomic for database repeat (MIPs) Sequences Protein for observed overrepresentation of overrepresentation biotic and observed genes abiotic stress-related We categories. transduction signal and kinase factor, transcription other reports with consistent overrepresented, also were genes metabolic lipid and metabolic secondary metabolism, acid amino maturation-associated rases and polygalacturonases). Flower, fruit and embryo development, to ­associated with fruit development and were dominated by genes related functions and processes molecular of categories for biological several provided ( is fruit and in genes of these patterns expression of the spective < 0.01) in genes) the and (1,753 fruit root A genes). (2,151 per global rate discovery false (>twofold, genes overrepresented of transcripts identified the Methods) (Online in libraries cDNA genes root-specific and of fruit- representation relative the of Analysis transcripts.

LTR retrotransposons occupy ~16% of the PG4 metabolic activity (glycosyltransferases, pectin este pectin (glycosyltransferases, activity metabolic carbohydrate PG5

PG3

transcriptome sequence resources for gene model prediction prediction model for gene resources sequence transcriptome PG6 Fig. Fig.

PG2 data of amount the in found was difference significant 2 F. vesca 4 2 . . In contrast, abundant transcripts in roots represented 3

3

). Genes ). overrepresented Genes in showed fruit sequence and highlighted organ specificity of fruit and root root and fruit of specificity organ highlighted and PG7 has a relatively small-size genome. High sequence F. vesca

PG1 PG8 spp. americana

FC1 FC7 F. vesca fosmids 1 9 s e l c i t r A and the Institute Institute the and 2 0

FC2 nuclear genome, obnd No combined. FC6 2 2

enrichment enrichment .

FC3 FC5

FC4 Copia 111 2 1 - - - -

© 2011 Nature America, Inc. All rights reserved. s e l c i t r A 112 and provided empirical supportsequences EST for Roche/454 the and RNA-seqpredictions Illumina the ( using models gene families to 18,170 genes (~55%). Fig. 5a preliminary gene models. Based on our functional annotation pipeline, we provided Table6 and databases (plant) RefSeq UniRef90,SwissProt, against BLAST by similarity for searched were exons per gene ( models with a mean coding sequence size of 1,160 nt and a mean of 4.8 F. vesca elements and marked for introns inferred from a high confidence set of models were generated for the genomic sequence masked for repetitive training on the whole unmasked sequence of of the precision of gene elements derived from the transcriptome sequencing, to optimize from the which combines Weimplemented machinenewlearning algorithm,a GeneMark-ES+, Gene stages. organs and developmental different between plasticity tional transcrip in the roots expected compared to confirmed Results fruit. category. each in genes of number the denotes circle each of radius the and < 0.05), rate discovery false (yellow, level significance on based shaded are circles The genes. expressed 3 Figure Transferase a Post-embryonic development Comparison of transcriptome sequence information to gene gene to information sequence transcriptome of Comparison activity

6b 5 Multicellular organismal Transporter activity development × ab initio

10 , prediction ). Our approach to gene prediction proved highly transcript sequences. This process generated 34,809 gene b ). AtInterprooneleast). motif Flower

development Hydrolase activity ). Gene clustering methods allowed us to computationally assign –2 Gene ontology mapping and functional annotation of strawberry genes. Overrepresented gene ontology categories in fruit ( fruit in categories ontology gene Overrepresented genes. strawberry of annotation functional and mapping ontology Gene F. vesca metabolic process Catalytic Lipid metabolic activity Carbohydrate

F. vesca annotation for approximately 25,050 genes ( gene finder GeneMark-ES < 5 process Supplementary Table 5 transposable element library and external evidence for ab initio × Reproduction 10

gene annotation ( Molecular_function and –7 Biosynthetic Nucleotide binding process

models gene predictions, evidence for host gene deserts Binding Biological_process Photosynthesis Gene ontology metabolic process Metabolic process . thaliana A. Secondary membrane Plasma Supplementary Fig. Cellular_component Membrane 2 Catabolic 6 process was detected in 63% of hybridof63% indetected was 2 5 ). Predicted protein sequences were defined by unsupervised F. vesca rtis ( proteins Peroxisom Vacuole Supplementary Figs. Amino acidandderivative Cell abiotic stimulus Response to genome; hybrid gene metabolic process Response process Cellular External encapsulating to stress Extracellular e Cytoplasm Supplementary region

Supplementary effective as 90% structure

4 ). Parameters Intracellular Cell wall differentiation Thylakoid Cell

6a -

Transcription factor b binding similar to the pattern reported in in reported pattern the to similar is inserts shorter with identity sequence reduced of correlation The to to the nuclear genome (chloroplast nomads) ( plastid the from DNA transfer recent for evidence We observed also Malpighiales of outside genomes chloroplast plant land in found the of 78 encodes absence is the and Noteworthy tRNAs 30 rRNA and four genes. proteins, long bp 155,691 is genome chloroplast H4x4 The Chloroplast sequences consensus produce ( to rRNAs large of coverage length sufficient the was along there however, rRNAs; cytoplasmic of large copies full-length no found We sequences. RNA organellar eral assembly,wereunderrepresented the generally in recover we did sev alternative splicing of THIC mRNA RNAs somal pyrophosphate thiamin and the that controls riboswitch ( RNAs transfer RNAs, 569 nucleolar168 small RNAs, 76 micro RNAs and 24 other RNAs spliceosomal assembly: 111 fragments, (rRNA) the RNA ribosomal 177 in (tRNAs), genes RNA identified We RNA may genome be through accessed the browserstrawberry URLs). (see tools analytical and support transcriptional models, Gene coverage. sequence genome of completeness the confirm also findings These evidence. transcript-based by supported were models gene hybrid of DNA Supplementary Supplementary Table 8 Supplementary Table 7 Table Supplementary Protein metabolic Nucleic aci Nucleobase, nucleoside,nucleotideandnucleicaci activity binding regulator activity process

Transcriptio Oxygen genes binding VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME atpF Nucleotid d Transporte binding group II intron. This absence has previously not been been not previously has absence This intron. II group activity

genome Biosynthetic Metabolic process n metabolic process Binding process Transcriptio e binding Protei r Molecular_functio Protein modificatio n process ). n ). These include the minor ATACspliceo minor the include These ). Catalytic activity Transferase Biological_process Kinase activity communicatio activity Gene ontology process Cellula Cell n n Sorghum differentiatio 2 r transductio Cellular_component 7 . . Although organellar sequences Cel Signal n membrane endogenous stimulu Plasm d Membrane l Vacuol abiotic stimulu Response to Response to death biotic stimulu 2 n Cel Response to n 9 a Deat . Supplementary Fig. Supplementary 7 Response l to stress e h 5

Nucleus Cytoplasm a × ) and root ( root ) and Nature Ge Nature 10

s s –2 s Cel l Intracellula < 5 b n ) × etics 10 –7 r 2 8 ). ). - - .

© 2011 Nature America, Inc. All rights reserved. ries defined for transport, signal transduction and structural molecules. F. vesca ( genes ~25,050 of tion Arabidopsis Annotation coverage in the genome strawberry is equivalent to that of Gene clusters. the find to species 21 of a total using done was analysis The species. four all among shared were that clusters gene 6,233 were there Additionally, strawberry. and rice to unique and strawberry to unique clusters gene 663 were There strawberry. to unique clusters gene 681 revealed species four the of Comparison clusters. 9,895 in aligned (from genes protein-coding 33,264 total the of genes 18,170 strawberry, of case the In species. four all among shared were species four those from genes 103,570 of rice, with grape, rice, among and 4 Figure Nature Ge Nature species. four all to common were clusters gene 6,233 although (269), strawberry and grape crops, fruit perennial two the by shared are as models, monocot and eudicot annotated well- the by shared were clusters of number Approximatelysame the genes with no known InterPro or GO assignments. category,strawberry-specific of which 128 are previously unidentified 2,151 genes overrepresented in the theof Similarly,root classification. transcriptome, ontology gene or InterPro133 known belong no to the unique clusters and 84 of arethese previously unidentified genes with overrepresented in the fruit transcriptome, 92 belong to the fruit development, and metabolism. Of the 1,753 genes gories, followed by kinase domains and enzymatic activities related to The most InterPro domains found belong to transcription factor cate genomes. sequenced other from proportions relative with consistent unidentified predicted genes of unknown function. These numbers are previouslyare 541 remaining The categories. ontology gene assigned represent 957 genes, of which 416 contain InterPro domains and were geneclusters 681 genestrawberry.clusterstounique 681 being These berry, 18,170 genes aligned in 9,895 of these gene family clusters, with ( families gene dicots ( three and (rice) monocot a from sequences gene 103,570 of total A Strawberry true indeed are coding sequences. regions conserved overlapping and CDSs the that predictions,providingthusgenefrom evidence sequences coding lap on translated nucleotide sequences; 87% of the conserved regions over This genome alignment is independent of gene predictions, as it isVitis based vinifera complete version of this table, link in the showssee URL section) that relatedavailable genomic sequences. closely most representplants the seven other The Methods). (Online Eight plant genomes were aligned, anchored by the genome of Multiple employed gene annotation ontology study.in methods this membrane, ribosome, cytoskeleton plastid, and chromosome mitochondria, might be due the to the enriched to localization cellular with Additional and proteingene to response counts metabolism freezing. whereas more were assigned for biological processes, such as transport, activity, catalytic to assigned was genes of number same the Roughly By comparison, 2,777 clusters were unique to the monocot. monocot. the to unique were clusters 2,777 comparison, By

ontology genome maintains more genes for ‘molecular function’ catego Arabidopsis

Venn diagram showing unique and shared gene families between between families gene shared and unique showing Venn diagram Arabidopsis

genome n , which has a genome of similar size. Preliminary annota Preliminary size. a genome ofhas similar , which and

unique etics Fig. Fig.

Arabidopsis Populus trichocarpa annotation

, grape and strawberry) clustered together in 15,969

alignment 4 , grape and strawberry genes revealed that a total a total that revealed genes strawberry and , grape

VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME ). Of the 33,264 protein-coding genes in in genes protein-coding 33,264 the Of ). gene Arabidopsis Supplementary Fig. 5a Fig. Supplementary ab initio ab

clusters , whereas there were 262 gene clusters clusters gene 262 were there , whereas

to and strawberry. Comparative analysis analysis Comparative strawberry. and predictions; predictions;

F. vesca share the most genes with Supplementary TableSupplementary 9 Arabidopsis

as S

anchor upplementary , b ) suggested that the the that suggested ) and rice (260), (260), rice and

strawberry- T able 5 (for the F. vesca. F. vesca

) - - - - ­

Studies in model species such Studies in as species model An annotated genes. with genomes sequenced completely have and life of distribution tree the uniform across a represent that species non-plant and plant from sequences gene of analysis the of subset a represent data These ( whereas factors, Overall, rhythm. circadian and light and stress, biotic and abiotic hormones, and (lignins phenylpropanoids) secondary in metabolites response to and () primary of regulation or growth including responses, have factors plant in diverse implicated been regulating transcription for 1,856 ( tors the Within Analysis transduction elements is shared between strawberry and other plants. disease resistance in various species relatedproteins ( Table14 monic acid ( and tochromes) through the circadian oscillator ( passes genes controlling the sensing of light and (cryptochromes phy parallels that circuit molecular flowering 8 Fig. molecule O small the and synthases the acyltransferases, the including components, volatile of these in the production implicated been have families gene Several pathways. metabolic nylpropanoid phe and terpenoid acid, fatty the by produced mainly compounds ( disease to Note response and flowering production, 15 14, 13, ( genes structural many of paralogs and in characterized functionally been have processes these to aroma). central genes and 100 about However,only nutrition (flavor, quality fruit especially Rosaceae, to The system. translational agile vesca F. an in paradigms these testing for system parallel a represents strawberry diploid The . plant of Supplementary Fig. 10 - 39,207 genesinclusters 11,719 clusters 56,797 genes Rice methyltransferases ( methyltransferases

opportunity Supplementary Fig. 9 Supplementary 11 ). and aromas arise from the perception of volatile volatile of perception the from arise aromas and Flavors ). Supplementary Table 17 Table Supplementary ). Examination of the strawberry genome revealed an intact intact an revealed genome strawberry the of Examination ). ). Analysis of the the of Analysis ). sequence permits access to genomic information relevant relevant information genomic to access permits sequence ), nitric oxide ( oxidenitric ),

Arabidopsis Arabidopsis of and and

21 species clusters comprising 28,149 totalunique F. vesca F. Supplementary Supplementary Table 13 transcription 16 SupplementaryTable 16

) involved in key biological processes such as such processes biological in key ) involved for . vesca F. genome, we identified 1,616 transcription fac transcription 1,616 identified we genome,

has 303 303 has 2,777 using the same stringent BLAST-identity. stringent same the using translational Supplementary TableSupplementary 11 ) showed orthologs of many SupplementaryTable15 F.vesca has 187. Phylogeny of the R2R3 R2R3 the of Phylogeny 187. has ). Genes controlling the production of the jas production ). controlling Genes

260 factor 312 Arabidopsis MYB 23,163 10,384 27,025 Fragaria ), compared to 1,403 for grape and and grape for 1,403 to compared ), genome has revealed orthologs and and orthologs revealed has genome 963 A. A. thaliana 640 30,31 349

families

and and studies , indicating that a core set of signal Supplementary Tables 11, 12, 12, 11, Tables Supplementary 6,233 ), ), salicylic acid ( 262 187 ( MYB Supplementary Tables 10 Tables Supplementary ) have been associatedhavebeenwith ) 23,030 30,434 886 552 Grape 935 have defined basic tenets have basic defined Arabidopsis 9,477 Supplementary Table 12 -related transcription transcription -related 663 and 269 ) and pathogenesis-and ) s e l c i t r A 4 species at leastoneofthese 15,969 clustershad Arabidopsis Supplementary Supplementary Supplementary Supplementary Supplementary Supplementary 681 and encom and Strawberry 18,170 33,264 9,895

MYBs genes flavor flavor MYB 113 - - - - -

© 2011 Nature America, Inc. All rights reserved. the the seed as known also around was expansion homologous sequences ( the against pathway metabolic ing transcription factors implicated in regulating the phenylpropanoid proanthocyanins. and of flavonols production the to relevant sion for gene that expres example, underlie those function, with assigned per site. are at shown values The is scale nodes. acid substitutions amino Bootstrap tests. and by resampling topology supported was strongly studies, of placement The blue. are red in clade are colored and colored the species Malvidae in predominant and duplicates, were selected, or little no duplication exhibiting Genes of in 154 at on present genes of eight least alignments ten genomes. The tree is based outgroups. with two monocot genomes eudicot other 5 Figure s e l c i t r A 114 genes nuclear three and mitochondrial, six plastid, four as the mitochondrial gene Malvidae and Fabidae clades, rosid large two support strong with resolved consistently have genes plast power. overall more accumulated datasets larger the though genes, non-universal more included we as position per positions (20%) for 154 genes. This shows a 13,704 decline and in resolvinggenes power87 for (35%) positions 13,594 genes, 24 for (55%) positions 5,120 was minimum The of replicates. 1% than less to logy topo Fabidae the for support reduce to needed positions of number minimum the to determine for topology Fabidae the versus tree best the likelihoods per-site the bootstrapping used We choice. gene of except clades all jackknifing genewise (50% sets 87-gene and 154- the same topology. genes from yielded Resampling the AA positions) (9,480 species ten all in AA present of genes 24 subset and the (38,840 positions) species more or nine in present genes 87 of subset the a with Fabidae of monophyly the Malvidae rather than with earlier studies (discussed below) in placing phylogenetic analysis produced a tree ( likelihood Maximum positions. acid amino aligned 68,526 totaling species ten the of eight least at in present genes 154 yielded This uct. simple terminal duplications were resolved by a single selecting prod whereas genes, complexsome duplication patternsrejecting required homologs from newly sequenced genomes ( Note sure of phylogenetic coherence (Online Methods and groups, searching for genes that were single copy, and filtering selection began by with an a earlier genemea tree database be maycrops, tree fororganism model important an revealed that the currently surprisingly accepted genomes phylogenetic angiosperm placement other of with analysis Comparative Reinterpretation in duplicated been This synthesis. strawberry in tion ( one and expressed, were five silent, be to appeared two considered, was genes these of mapping read cDNA Illumina-sequenced when FvMyb33 BLAST analysis of BLAST analysis 20 Phylogenetic analyses of angiosperms based exclusively on based of chloro analyses angiosperms Phylogenetic test unbiased approximately The

), which yielded 240 genes from seven species. Upon species. genes fromaddition240 seven of yielded which ),

3 Maximum likelihood phylogeny relating relating phylogeny likelihood Maximum 2 . . There are at least six strawberry ; gene08694) was highly expressed, suggesting a key func key a suggesting expressed, highly was gene08694) ; Populus Glycine MYB123 Medicago-Lotus

of Brassica napus Brassica in Malvidae and not Fabidae, as and in in not found Fabidae, previous Malvidae

, were removed. Species in clade , the Species Fabidae were removed. angiosperm matR ), which controls proanthocyanidin levels in in levels proanthocyanidin controls which ), TT2 Arabidopsis Supplementary Supplementary Table 18 Fragaria (encoding (encoding 3 8 and 13 protein-coding genes F.vesca (98%), demonstrating robustness robustness demonstrating (98%), and and and legumes in Fabidae.

P 3 phylogeny R2R3 3 5 value of 3 × 10 × 3 of value 7 using the 154-gene set rejected rejected set 154-gene the using . In contrast, studies based on on based studies contrast, In . Fig. Lotus japonicus Lotus genome identified 25 highly highly 25 identified genome transparent Populus MYB Glycine TT2 5 ) ) that differed from most Fragaria -like -like sequences represent sequences 3 4 ). ). The greatest clade with 3 , of 9,000 homology 6 Carica ) fully supported supported fully ) to seven MYB

Supplementary Supplementary −38 incorrect. Gene Gene incorrect. MYB 32,33 Arabidopsis . Analysis of of Analysis . s. s. However, and T . 4 gene has has gene 3 0 9 esta Populus , placed placed , , , as well Lotus 2, 2, in ), - - - - - ­ ­ , or taxonomic sampling of Malvidae was limited to to limited was Malvidae of sampling taxonomic support or bootstrap likelihood maximum <50% either had ogy Populus the Malpighiales share a common ancestor with Malvidae and no no and Malvidae with ancestor common a share Malpighiales the with this result, there are at characters least floral seven for evidence placing of sources independent two now are there results, gene nuclear our Fabidae of non-monophyly the for support strong obtained genes of mitochondrial four analysis phylogenetic A recent Methods and any associated references are available in the online online the at http://www.nature.com/naturegenetics/. paper the of in version available are references associated any and Methods M or vt.edu/setubal/m or URLs. map a or genome. without physical reference technology using exclusively short-read and assembled be sequenced a that plant can genome illustrates also genome strawberry of ing the sequenc the of Completion system. agile this with addressed be can resistance, disease developmental controls and fruit flavor and quality important traits including or apple, Economically peach. , cherry crops with long generation times and/or spatial requirements, such as statured large or strawberry, cultivated as such genomes, polyploid cumbersome with crops fruit are high-value Its relatives nearest nial. for functional genomics research. Unlike useful it particularly make that traits are all which seed, to seed from formable, grows with a small footprint and has a short generation time gene studies within Rosaceae. Like genome other than for sequence genome The DISCUSSION of data advantages genomic copious the combining by clarified be will discrepancies ent ago years lion mil 83–108 orders rosid the of radiation rapid the for hypothesized a whole as genome the with odds at inheritance Chloroplasts, experience genes. can however, nuclear in orthologs recognizing in errors to lead can that loss and duplications gene frequent of Chloroplasts history the lack evolution. nuclear and chloroplast between ferences Malpighiales. including Fabidae, for characters derived shared egonstate.edu g ethods These apparently conflicting results may be due to biological dif biological to due be may results conflicting apparently These ; the complete version of ; version complete the 4 Strawberry genome browser, genome Strawberry VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME 3 (and its order, Malpighiales) in Malvidae. However, this topol , especially when speciation events when arespeciation , especially compressed in time, as 3 7 100 . As more plant genomes are sequenced, these appar these are . sequenced, As genomes more plant / apG.htm ; Circos, Circos, ; 100 100 100 Arabidopsis Populus 100 4 4 0.1 with denser taxonomic sampling. taxonomic denser with Viti l http: ; ; HashMatch, s 100 F. vesca F. in Malvidae and not Fabidae. Consistent Supplementary Table 9 Supplementary 99 Populus //mkweb.bcgsc.ca/cir and represents a gateway to functional Carica Fragaria Arabidopsis 100 Glycine is the smallest sequenced plant plant sequenced smallest the is Lotus http://www.strawberrygenome. Medicago http://mock Arabidopsis Oryza Sorghum Arabidopsis , F. vesca

Nature Ge Nature 4 , lerlab-tools.cgrb. 1 c , , F. vesca Fabidae Malvidae is rapidly trans . Together with with Together . o http://staff. Arabidopsis s 4 / 2 . that suggest is peren n etics 38,39 vbi. 4 0 ------, .

© 2011 Nature America, Inc. All rights reserved. Genome Biology Genome InitiativeBiology (to T.C.M.); Oregon State University start-up fund Foundation #ARF4435 (to T.C.M.); the Oregon State Computational and Council (BBSRC) (to D.J.S. and E.L.G.); the Oregon State Agricultural Research East Malling Trust (EMT) and Biotechnology and Research Biological Sciences Project VA-135816 (to B.F.); Rutgers Busch Biomedical Funding (to T.P.M.); State Research, Education and (USDA/CSREES)Extension Service Hatch Associates; UnitedStrawberry States Department of /Cooperative Breeding Program;Strawberry The Province of Trento, (to R.V.); Driscoll’s (IFAS)Agricultural Sciences Dean for Research; the University of Bioinformatics Institute; the University of Florida Institute of Food and This work was supported by Roche and 454 Sequencing; the Virginia Note: Supplementary information is available on the and genomic data and SRA026350 contains contains 454 transcriptome reads. SRA020125 notation: 454-generated following genomic reads, SRA026313 contains the Illumina RNA-seq under archive read short- the to deposited been have reads Sequence AEMH00000000. project the under GenBank and DDBJ,EMBL at deposited codes. Accession Nature Ge Nature and derivative works must be licensed under the same or similar license. author and source are credited. This licence original does the not permit provided commercial exploitation, medium, any in reproduction and distribution, permits which Commercial-Share Alike licence ( Attribution-Non- Commons Creative the of terms the under distributed is article This reprintsandpermissions/. http://npg.nature.com/ at online available is information permissions and Reprints Published online at http://www.nature.com/naturegenetics/. The authors declare no competing interests.financial All authors read critically and approved the manuscript. H.S., L.M., K.M.F. Contributed to revisions: A. Aharoni, J.L.B., R. Velasco. T.C.M., P.J., A.C.A., V.S., K.M., J.C.S., H.S., L.M., A. Adato, H.W., S.S.K., Manuscript preparation: T.C.M., C.E., J.G.R., J.P.S., K.M., S.S.K., R.P.H., B.F., R.M. funding Provided and/or other support: T.M.D., W.S., A.L., P.J., H.W., J.L.B., R.E.V. Contributed tables, figures and other analyses: Analysis of gene families: Comparative genomics: Evolutionary analyses: Gene ontology and pathway analysis: R.P.H., N.B., J.P.S., S.F., A.C.A., K.P.W. Gene prediction and annotation: management): Computational resources (GBrowse, submission Blast, Genbank and data M.T., R. Velasco, T.M.D., B.L., T.-L.A., B.D.-R., A.M., P.A. Anchoring to scaffolds map: linkage R. Viola, T.C.M., H.D.P., D.W.B., R.P.H., A.L., S.F., T.P.M. processing and Sequence assembly: B.D., O.R.C., M.T., R. Velasco, J.D., S.A.F., T.P.M.,Library construction and sequencing:S.E.F., R.P.H., B.F., R.A.K.,W.W. Germplasm, DNA and RNA preparation: T.P.M.A.C.A., R.N.C., Project coordination: Project management: 21000-180-01R (to J.P.S.). Experiment Station Project NH00535 (to T.M.D.); and USDA/ARS CRIS #1275- (NRI) Plant Genome Grant 2008-35300-04411 and New Hampshire Agricultural Health (grant HG00783; to M.B.); USDA-CSREES National Research Initiative METACyt Initiative of Indiana University (to K.M.); US National Institutes (to ofP.J.); the Center for Genomics and Bioinformatics, supported in part by the COMPETING FINANCIAL INTERESTS FINANCIAL COMPETING AUTHOR CONTRIBUTIONS A cknowledgments n R.N.C., S.P.M.,R.N.C., S.A.G., H.D.P., L.J.W. etics This whole-genome shotgun project has been been has project shotgun whole-genome This K.M.F., V.S., R.E.V. O.F., T.C.M., D.J.S., T.M.D., J.P.S., N.B., T.-L.A., L.M., H.S.,

A.L., A.W.D., D.J.S. D.J.S., A.L., J.C.S., E.L.G., M.C., K.M.F. VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME T.C.M., R.E.V., K.M., T.M.D., J.P.S., M.B., N.B., T.-L.A., K.M.F., R.E.V., T.M.D., T.-L.A., J.P.S., A.L., N.B., D.J.S., A.C.A., A. Adato, A. Aharoni. http://creativecommons.org/licenses/by-nc-sa/3.0 P.B., M.B., T.C.M., H.D.P., D.W.B., R.N.C., A.L.D., S.L.S., M.T., S.P.M., R. Velasco, D.J.S., J.-M.C., J.G.R., A.C., J.J.R.R., E.L.G., O.F., T.C.M., R.V.J., C.E., T.H., J.C., K.M., C.K., P.J., T.C.M., P.D., J.E., S.N. R.R., T.P.M., J.P.S., A.Z., D.Y.S., K.M.F., S.H.H. V.S., R.E.V., R. Velasco, R. Viola, K.M.F., H.S., L.M., T.C.M., D.J.S., B.L., N a t u r e

G

e n e t i c s website.

accession

/ ),

14. 13. 12. 11. 10. 9. 8. 7. 6. 5. 4. 3. 2. 1. 36. 35. 34. 27. 26. 25. 24. 23. 22. 21. 20. 19. 18. 17. 16. 15. 33. 32. 31. 30. 29. 28.

ie hoooe 1 ad 2 eunig osri. h sqec o rice Jaillon, O. of sequence The Consortia. Sequencing 12 and 11 Chromosomes Rice S. Kurtz, D. Potter, A. Cabrera, strawberry, Lim, diploid K.Y. Karyotype the and of ribosomal gene map mapping in linkage A H. Yu, & T.M. Davis, D.J. Sargent, Ruiz-Rojas, J.J. Oosumi, T., Ruiz-Rojas, J.J., Veilleux, R.E., Dickerman, A. & Shulaev, V. Implementing T.Oosumi, Alsheikh, M.K., Suso, H.P., Robson, M., Battey, N.H. & Wetten, A. Appropriate choice V. Shulaev, Darrow,G.M. Williams, K.P. selection. tree phylogenetic of test unbiased approximately An H. Shimodaira, prokaryotes. for resource phylogenomics a GeneTrees:A.W. Dickerman, & Y.Tian, A. Wachter, Hunter,S. Gene M. Borodovsky, & Y.O. Chernoff, V., Ter-Hovhannisyan, A., Lomsadze, O & A. Aharoni, K.M. Folta, Pontaroli, A.C. T.M. Davis, resource collective a Databases: Repeat Plant TIGR The C.R. Buell, & S. Ouyang, H.W.Mewes, J. Jurka, Baucom, R.S. P.S.Schnable, R. Velasco, ohd, K. Yoshida, Y.L.Wei, resistance and disease D.F.Engineering Klessig, & H. Silva, D.A., Dempsey, D.F. Klessig, Paterson, A.H. Daniell, H. hoooe 1 ad 2 rc i dsae eitne ee ad eet gene recent and genes resistance disease in rich duplications. 12, and 11 chromosomes Biol. (2007). 5–43 markers. of (COS) Set 103–106(2004). Hered. J. diploid the (2008). 120–127 in markers 103 of mapping in (2010). strawberry, diploid the in lines mutant insertional of sequences flanking T-DNA of analysis Rosaceae: in genetics reverse vesca vescaFragaria ofantibiotic and (2008). 985–1003 1966). USA, York, New York, New Winston, (2010). Biol. Res. Acids Nucleic (2008). Res. Res. algorithm. self-training by genomes eukaryotic novel in identification microarrays. (2002). DNA using maturation receptacle strawberry. vesca Biol. Plant BMC inplants. (2004). D360–D363 sequences repetitive of identification the for Res. Acids Nucleic Res. Genome Cytogenet. genome.maize B73 theretroelements inof Science Borkh.). phyla. angiosperm major in transcription factors comprising a multigene family. Rep. Biol. Mol. family encoding potential MYB regulatory proteins of proanthocyanidin biosynthesis. plants. in USA Sci. Acad. Natl. Proc. Nature intron.IIgroup a multiplelossesof of evolution the and genome chloroplast 3 alternative Fragariavesca

37 33 ) for functional genomics. functional for ) . 5 51

, R12 (2004). R12 , Plant Genome Plant 457

, D211–D215 (2009). D211–D215 , (2005). 6494–6506 , , 492–508 (2002). 492–508 , 326 Nat. Genet. Nat.

et al. et et al. et 88 t al. et TrendsMicrobiol. et al. et al. et et al. et al. et Plant Genome Plant et al. et , 551–556 (2009). 551–556 , et al. et et al. et t al. et t al. et t al. et t al. et ′ , 1112–1115 (2009). 1112–1115 , , 215–221 (1997). 215–221 , t al. et end processing of mRNAs. mRNAs. of processing end et al. et The Strawberry: History,Strawberry: The Physiology and Breeding et al. et BMC Biol. BMC et al. et al. and t al. et Molecular cloning of cloning Molecular et al. et al. Versatile and open software for comparing large genomes. large comparing for software open and Versatile et al.

The grapevine genome sequence suggests ancestral hexaploidization et al. et The complete nucleotide sequence of the cassava ( ebs udt, dtbs o ekroi rpttv elements. repetitive eukaryotic of database a update, Repbase Phylogeny and classification of Rosaceae. of classification and Phylogeny InterPro: the integrative protein signature database. signature protein integrative the InterPro: 34

′ L. Agrobacterium High-efficiency transformation of the diploid strawberry ( strawberry diploid the of transformation High-efficiency onl, .. ee xrsin nlss f tabry cee and strawberry of analysis expression Gene A.P. Connell, Development and bin mapping of a Rosaceae Conserved Ortholog Conserved Rosaceae a of mapping bin and Development 10 Riboswitch control of gene expression in plants by splicing and and splicing by plants in expression gene of control Riboswitch ucinl ifrnito of differentiation Functional n xmnto o tree gn nihohos n strawberry. in neighborhoods gene targeted of examination An utpe oes o Rsca genomics. Rosaceae for models Multiple h gnm o te oetctd pl ( apple domesticated the of genome The Phylogeny of Gammaproteobacteria. MIPS: analysis and annotation of proteins from whole genomes. whole from proteins of annotation and analysis MIPS: Exceptional diversity, non-random distribution, and rapid evolution The development of a bin mapping population and the selective the and population mapping bin a of development The .rncit conig rm ies tsus f cultivated a of tissues diverse from accounting A.transcript , 105–120 (2007). 105–120 ,

F.v.semperflorens Gene content and distribution in the nuclear genome of The 35 (2010). 81 , 32 irc xd ad aiyi ai sgaig n ln defense. plant in signaling acid salicylic and oxide Nitric SNP discovery and genetic mapping of T-DNA insertional mutants Theor.Genet.Appl.

The B73 maize genome: complexity, diversity, and dynamics. complexity,diversity,and genome: maize B73 The

42 2 , D328–D331 (2007). D328–D331 , (2004). D41–D44 , BMC Genomics BMC , 93–101 (2009). 93–101 ,

, 833–839 (2010). 833–839 ,

3 Sorghum bicolor 110

, 20 (2005). 20 , 3

, 90–105 (2010). 90–105 , 97 Nature 6 , 462–467 (2005). 462–467 , , 54–61 (1998). 54–61 , strain improves transformation ofantibiotic-sensitive , 8849–8855 (2000). 8849–8855 , Planta

Theor.Genet.Appl. 449 Brassica napus TRANSPARENTTESTA2 napus Brassica .

Plant Cell Rep. Cell Plant

10 223 Fragaria vesca Fragaria , 463–467 (2007). 463–467 , 121 genome and the diversification of grasses. , 562 (2009). 562 , Plant Cell Plant atpF , 1219–1230 (2006). 1219–1230 , Fragaria , 449–463(2010)., PLoS Genet.PLoS n apgils RA dtn and editing RNA Malpighiales: in ou japonicus Lotus Fragaria vesca . x. Bot. Exp. J. Plant Cell Physiol. eeec map. reference

19

J. Bacteriol. L. 20

116 , 3437–3450 (2007). 3437–3450 ,

, 1173–1180 (2002).1173–1180 , 5 Physiol. Plant. Physiol. uli Ais Res. Acids Nucleic , e1000732 (2009).e1000732, s e l c i t r A , 723–737(2008)., Plant Syst. Evol. Syst. Plant . (Holt, Rinehart and Rinehart (Holt, . ln Physiol. Plant Malus L.

Ts R2R3-MYB TT2s, Manihot esculenta

53 192 Acta Hortic. rgra vesca Fragaria 2073–2087 ,

× Nucleic Acids Nucleic Nucleic Acids Nucleic 49 , 2305–2314 Genome , 157–169

domestica 140 Genome Fragaria Fragaria

, 1–9 , gene

115 266 147 Syst.

649

32 51 ) , , , . , ,

© 2011 Nature America, Inc. All rights reserved. Bellville, Bellville, South Africa. New Jersey, USA. 16 USA. Henry Wallace Beltsville Agricultural Research Center, Beltsville, Maryland, USA. Sciences, University of New Hampshire, Durham, New Hampshire, USA. Virginia, USA. Indiana Bioinformatics, University, Bloomington, Indiana, USA. Champaign, Illinois, USA. University, Corvallis, Oregon, USA. Food Research Limited (Plant and Food Research), Mt. Albert Research Centre, Auckland, New Zealand. 1 40. 39. 38. 37. s e l c i t r A 116 de Recerca i Tecnologia (IRTA),Agroalimentàries Cabrils, Barcelona, . Illinois, Urbana, Illinois, USA. Africa. of Oregon Horticulture, State University, Corvallis, Oregon, USA. Biological Sciences, University of Pittsburgh, Pittsburgh, USA. Pennsylvania, 27 Facultad de Ciencias Biológicas, Universidad Andres Bello, Santiago, . Biotechnology and Faculty of Agronomy, University of Chile, Santiago, Chile. Florida, USA. Renewable Resources, Institute for Advanced Learning and Research, Danville, Virginia, USA. Science Science and Engineering, Georgia Tech, Atlanta, Georgia, USA. should Correspondence be addressed to K.M.F. ([email protected]). 37 Department Department of Biological Sciences, University of North Texas, Denton, Texas, USA. Istituto Istituto Agrario San Michele all’Adige (IASMA), Research and Innovation Centre, Foundation Edmund Mach, San Michele all’Adige, Trento, Italy. School of Biological Sciences, University of Auckland, Auckland, New Zealand. Institut Institut National de la Recherche Agronomique (INRA)-Unité de Recherche des Espèces Fruitières (UREF), Villenave d’Ornon, .

udc, .. Dvs CC Mlihae pyoeeis giig rud n one on ground gaining phylogenetics: Malpighiales C.C. Davis, & K.J. Wurdack, Duarte, J.M. X.Y. Zhu, Wang,H. f h ms rclirn cae i te nisem re f life. of tree Angiosperm (2009). 1551–1570 the in clades recalcitrant most the of levels. Vitis Populus, rosids. in relationships USA Sci. Acad. Natl. Proc.

14 32 Roche Roche Diagnostics, Roche Applied Science, Indiana, Indianapolis, USA. Biotechnology of Biotechnology Natural Products, Technical University München, Germany. BMC Evol. Biol. Evol. BMC et al. et t al. et 23 10 et al. The The Graduate Program for Plant Molecular and Cellular Biology, University of Florida, Gainesville, Florida, USA. Joint Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, Georgia, USA. and Rosid radiation and the rapid rise of Angiosperm-dominated forests. Angiosperm-dominated of rise rapid the and radiation Rosid 18 Mitochondrial Department Department of Computer Science, Virginia Tech, Blacksburg, Virginia USA. Identification of shared single copy nuclear genes in Oryza 20 Department Department of Virginia Horticulture, Polytechnic Institute and State University, Blacksburg, Virginia, USA.

BMC Evol. Biol. Evol. BMC 10 7 n ter hlgntc tlt ars vros taxonomic various across utility phylogenetic their and Center Center for and Bioinformatics Biology,Computational University of Maryland, College Park, Maryland, USA. , 61 (2010). 61 ,

106 34 matR Norwegian Norwegian Institute for Agricultural and Research, Environmental Genetics and Biotechnology, Kvithamar, Stjordal, Norway. , 3853–3858 (2009). 3853–3858 , 5 Center Center for Genome Research and Biocomputing (CGRB), Oregon State University, Corvallis, Oregon, USA. eune hl t rsle ep phylogenetic deep resolve to help sequences

7 , 217 (2007). 217 , 9 31 Virginia Bioinformatics Virginia Institute, Bioinformatics Virginia Polytechnic Institute and State University, Blacksburg, m J Bot. J. Am. South South African National Institute, Bioinformatics University of the Western Cape, Bellville, South Arabidopsis, 12 United Department of Agriculture (USDA), Agricultural Research Service (ARS), 36 25 26

29 Centre Centre de Recerca en Agrigenòmica (CSIC-IRTA-UAB), Cabrils, Barcelona, Spain. 96 Millennium Millennium Nucleus in Plant Cell and Biotechnology Centro de Vegetal,Biotecnología Department Department of Plant Sciences, Weizmann Institute of Science, Rehovot, Israel. Department Department of Genetics, University of Georgia, Athens, Georgia, USA. 15 ,

13 17 Department Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, USA. (USDA), (USDA), ARS, National Clonal Germplasm Repository, Corvallis, Oregon, Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, 2 44. 43. 42. 41. East East Malling Research, Kent, UK. 33

Department Department of Natural Resources and Sciences, Environmental University of oa, . Wlim, .. Kn, . Crol SB Gnm-cl approaches Genome-scale S.B. Carroll, & N. King, B.L., Williams, A., Rokas, Cyto-nuclear B. Khadari, & S. Santoni, C., Grout, F., Kjellberg, J.P., Renoult, Endress, P.K. & Matthews, M.L. First steps towards a floral structural characterization Qui, Y.-L. o eovn icnrec i mlclr phylogenies. molecular in (2003). incongruence resolving to associations. pollinator of phylogeny the in discordance subclades. rosid major the of genes. VOLUME 43 | NUMBER 2 | FEBRUARY 2011 FEBRUARY | 2 NUMBER | 43 VOLUME 22 19 J. Syst. Evol. Syst. J. Horticultural Sciences Horticultural Department, University of Florida, Gainesville, Department Department of Biotechnology, University of the Western Cape, et al. Angiosperm phylogeny inferred from sequences of four mitochondrial 4 Department Department of and Plant Pathology, Oregon State

48 BMC Evol. Biol. Evol. BMC , 391–425 (2010). 391–425 , Plant Syst. Evol. Syst. Plant 3 Ficus The The New Zealand Institute for Plant and section

9 24 , 248 (2009). 248 , Millennium Nucleus in Plant Cell 11 Department Department of Biological 21 Galoglychia

260 Institute Institute for Sustainable and 8 The The Center for Genomics and 38 , 223–251 (2006). 223–251 , School School of Computational 6

Chromatin Chromatin Inc., and host shifts in plant- in shifts host and Nature Nature Ge Nature 28 Department Department of 30

425 Department Department 798–804 , 35 n Institut Institut etics

© 2011 Nature America, Inc. All rights reserved. Bowtie contigs. of set initial an in puter, resulting com on a single processes multiple to run for parameters except parameters, default with processors) 32 and memory Gb large-memory (128 computer single a multiprocessor on run then was pipeline CA The them. exclude to were modified in input ranges files clear NUCmer, using found and fragment were mates, detecting after reads, in remaining sequences Linker format. file removed. pairs redundant with contigs, assembly to mapped uniquely be could end each where of pairs number the represent columns right the reads, For or calls adaptor SOLiD base sequence. ambiguous either containing reads removing after remained what represent columns right the in values Illumina reads, For pairs. mate and reads duplicate removing and bp) (64 CA for threshold length minimum the below those removed having reads, trimming 454 reads, values in the rightmost three columns represent what remained after assembly primary engine. Input data the are in summarized as 5.3 version (CA) Assembler Celera using accomplished was assembly. and sequencing Genome ONLINE doi:10.1038/ng.740 fruit from specifically isolated RNA from 454, using sequenced were tation anno gene for used Transcriptreads SRA026313.1. accession under Archive Supersplat and URLs) (see tocols. Reads were filtered and mapped to the enriched cDNA using the SMART method described treatments, previously as pharmacological and regulators growth light, stress, including (accession PI551572) organs subjected to a wide array of treatment conditions, Transcriptome analysis. the in presented are and 0.25 for segment overlaps. Additional details and analyzed species specific overlaps each sequence to total for 0.5 of cutoff enough a were used parameters similar Other other. were they if grouped were homologs where 0.5, of a cutoff bit cutoff of score of 0.05 confidence and group 50, in-paralog overlap set of rules adopted from Inparanoid. The parameters used were the defaults of remaining pair of Overlapping clusters were orthologs. potential resolved by a other. each to than Additionally, orthologs based on sequence in-paralogs were clustered together out-group with each an to higher scored B A sequence sequence or either if eliminated were pairs sequence A-B The orthologs. of loss selective of cases detect to used were species out-group from Sequences detected. were hits best mutual with pairs Sequence B). and A example, (for pairs species given two any from files sequence using run was search BLAST (reciprocal) all-versus-all the First, applied. was in-paralogs for adding rithm match, an after which algo with a clusters were pairwise best two-way seeded Ortholog events. duplication by arise that genes paralogous and genes ogous Gene homology analysis. the in presented are processes these of details The available. is sequence transcriptome when projects annotation genome to application for designed was tool software GeneMark-ES+ The genome. GeneMark-ES prediction. Gene described ously visualization. Chromosome in given is contig was Velvetsubstituted in the the contig. A in summary of the final run assembly statistics the of length the discrepancy, run-length homopolymer a indicated alignment the but contig, CA a within region a to unambiguously mapped Velvet contig a where addition, In contigs. two the between close gap the to used was sequence contig Velvet corresponding the gap, scaffold a flanking sequences contig to unambiguously mapped Velvet contig a Where NUCmer using scaffolds existing to mapped were Velvet contigs resulting 0.7.31 version Velvet with done assem was reads separate A Illumina/Solexa the scaffolds. of of bly set a create to run was step scaffolding the The paired-end SOLiD reads were then mapped to these contigs using using contigs these to mapped then were reads SOLiD paired-end The to format CA.frg file .sff native the from input were converted 454 data All 4 5 to create additional mate pairs that were added to CA inputs, and and inputs, CA to added were that pairs mate additional create to

Supplementary Table 2 Supplementary METHODS 4 9 4 to generate a set of of set a generate to 8 . Details of this method are available from T.M. from Davis. available are method this of Details . We used a new version of the self-training algorithm algorithm self-training the of version new a used We Supplementary Note Supplementary 5 Total RNA was isolated from mature 1 . Normalized pools were converted to full-length full-length to converted were pools Normalized . The Inparanoid algorithm 5 3 . Data are available in the NCBI Sequence Read Read Sequence NCBI the in available are Data . The FISH methodology The was FISH as performed methodology previ . ab ab initio The assembly of the strawberry genome genome strawberry of the assembly The 5 2 . and sequenced using Illumina pro F. vesca gene predictions for the the for predictions gene Supplementary Note Supplementary Supplementary TableSupplementary 1 1 genome using HashMatch 50 was used to find orthol F. vesca Hawaii-4 . F. vesca 4 6 . The The . . . For 4 7 ------.

such as pathogenesis-related (PR) proteins and biosynthesis and/or response response and/or biosynthesis and proteins (PR) pathogenesis-related as such genes. resistance disease of Identification the in presented are details Additional analyzed were sequences Peptide Interproscan through genes. 25,051 to assignments ontology annotation. ontology Gene the in presented are Details hits. unique quantifying The self-matches. contig whole excluding MUMmer by matches measured first ( level nucleotide the at of the in the duplications of large Analysis Note Supplementary the in presented are test the of parameters additional and used genomes on Details genome. anchor the in regions matched between overlaps on based the the to compared alignment. genome Multiple the in as described were determined categories ontology gene Overrepresented SRA026350.1. accession under Archive Read Sequence NCBI the in available are Data initials). root crown and roots tips, stages from buds (developmental initial to and overripe) roots root (including 46. 45. at available are flow work detailed a and analyses tree from files output ranges, subsequence with topology’ ‘Fabidae accepted widely the topology’, against of poplar, ‘poplar-Brassicales the our placement to evaluate test unbiased mately The approxi rate variation. and GTR gamma-distributed using RAxML with analyzed and characters 68,526 of matrix a to concatenated were alignments Gblocks with filtered were regions aligned poorly and (MUSCLE) realigned were families These genomes. two most at from members missing ( of paralogs (ref. 2.3 RAxML with MUSCLE aligned tein sequences were trimmed to eliminate non-conserved terminal regions and the using genomes GeneTrees database genes. angiosperm ten from protein-coding retrieved were 154 genes on single-copy based phylogeny Angiosperm than less but (1,403) than more is which genes, factor transcription 1,616 revealed approach 2.0.11 version ClustalW using performed were align ments sequence Protein factors. transcription strawberry putative exclude) (ref. 4.4 version InterproScan by detected by parsing by BLASTx tabular outputs using a PERL script. Protein factors family motifs, transcription strawberry BLASTx putative was classify and approach (RBH) identify Hit to Best used Reciprocal A repository. Databases Factor from the Peking were tion University factors downloaded Plant Transcription identification. factor Transcription presented. are identity >50% and the from family.the Rosaceae from Sequences this search were used to interrogate fied using a keyword search in the NCBI database, with priority given to results were acid and acid identi nitric salicylic acid, jasmonic with associated genes Sources for the protein or nucleotide datasets, individual sequence

Zerbino, D.R. & Birney, E. Velvet: algorithms for Birney,Velvet:algorithms & E. D.R. Zerbino, Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient de Bruijn graphs. Bruijn de (2009). alignment of short DNA sequences to the human genome. human the to sequences DNA short of alignment F. vesca F.vesca

second used BOWTIE to match Illumina reads against contigs and for for and contigs against reads Illumina match to BOWTIE used second F. vesca 5 8 sn a hehl ct f o 1 × 10 × 10 of off cut threshold a using genome as the first member of the pair. Results were then merged merged were then of pair. the member Results first as the genome genome using BLASTx. Hits with an an with Hits BLASTx. using genome Supplementary Note Supplementary genome was made against itself using MUMmer using itself against made was genome 64). Screening for phylogenetic coherence and the presence presence the and coherence phylogenetic for Screening 64). F.vesca Genome Res. Genome 6 http://staff.vbi. 35 1 Arabidopsis and subsequently available genomes. Homologous genomes. available pro and subsequently . 5 , as implemented in the program CONSEL program the in implemented , as 62,63 4 , followed by SignalP by followed , nuc sequence using MUMmer3, option promer, using using promer, option MUMmer3, using sequence , , and for were trees inferred each gene using family A functional annotation pipeline provided gene gene provided pipeline annotation functional A mer). Two approaches were implemented. The The implemented. were Twoapproaches mer).

Genomes from other species were pairwise pairwise were species other from Genomes 18 (1,856) ( (1,856) ) resulted in 154 orthologous gene families families gene orthologous 154 in ) resulted , 821–829 (2008). 821–829 , vt.edu/allan/StrawberryPhylogenomic Fragaria vesca Fragaria Protein sequences for plant transcrip plant for sequences Protein 37 Supplementary Note Supplementary . Supplementary Table 17 Supplementary 59), were used to confirm (but not not (but confirm to used were 59), Putative disease resistance genes genes resistance disease Putative 5 5 de novo de Supplementary Note Supplementary , Predotar , e −20 value <10 value Supplementary Note Supplementary genome. genome. . RBHs were extracted extracted were RBHs . short read assembly using assembly read short 60 Nature Ge Nature Genome Biol. Genome 5 . This conservative conservative This . 6 and TMHMM and . 7 , , 4 A comparison A comparison 7 a version 3.22 3.22 version score > 150 150 > score 6 6 6

, was used used , was identifiers 5 ). . Filtered Filtered .

Putative Putative . n 10 etics , R25 , Vitis s 5 . . 7 - - - - - .

© 2011 Nature America, Inc. All rights reserved. 56. 55. 54. 53. 52. 51. 50. 49. 48. 47. Nature Ge Nature

Small, I., Peeters, N., Legeai, F. & Lurin, C. Predotar: A tool for rapidly screening rapidly for tool A Predotar: C. Lurin, F.& Legeai, N., Peeters, I., Small, the in proteinsLocating H. Nielsen, & G. Heijne, von S., Brunak, O.,Emanuelsson, Hunter,S. Supersplat- T.C. Mockler, & W.K. Wong, H.D., Priest, R., Shen, Jr., D.W. Bryant, Zhu, Y.Y., Machleder, E.M., Chenchik, A., Li, R. & Siebert, P.D. Reverse transcriptase K.M. Folta, G. Ostlund, Gene M. Borodovsky, & Y.O. Chernoff, V., Ter-Hovhannisyan, A., Lomsadze, Liu, B. S. Kurtz, proteomes for N-terminal targeting sequences. targeting N-terminal for proteomes tools. relatedTargetP, andusing SignalPcell Res. alignment. RNA-seq spliced Biotechniques construction. library cDNA full-length for approach SMART a switching: template strawberry. analysis. Res. algorithm. self-training by genomes eukaryotic novel in identification and (2006). hybridization situ in Biol.

37 33 5 , R12 (2004). R12 , et al. , 6494–6506 (2005). 6494–6506 , , D211–D215 (2009). D211–D215 , Nucleic Acids Res. Acids Nucleic et al. et n et al. et Plant Genome Plant Molecular cytogenetic analysis of four etics et al. et t al. et

Versatile and open software for comparing large genomes. large comparing for software open and Versatile 30 InterPro: the integrative protein signature database. signature protein integrative the InterPro: InParanoid 7: new algorithms and tools for eukaryotic orthology eukaryotic for tools and algorithms new 7: InParanoid , 892–897 (2001). 892–897 , .rncit conig rm ies tsus f cultivated a of tissues diverse from accounting A.transcript

3 , 90–105 (2010). 90–105 ,

dapi Bioinformatics 38 , D196–D203 (2010). D196–D203 , banding.

26 n. . ln Sci. Plant J. Int. Nat. Protoc. Nat. , 1500–1505 (2010). 1500–1505 , Larix species by bicolor fluorescence

2 4 , 953–971 (2007). 953–971 , , 1581–1590 (2004). 1581–1590 ,

167 Nucleic Acids Nucleic Nucleic Acids Nucleic 367–372 , Genome

66. 65. 64. 63. 62. 61. 60. 59. 58. 57.

Shimodaira, H. & Hasegawa, M. CONSEL: for assessing the confidence of confidence the assessing for CONSEL: M. Hasegawa, & H. Shimodaira, Talavera, G. & Castresana, J. Improvement of phylogenies after removing divergent removing after phylogenies of Improvement J. Castresana, & Talavera,G. maximum for program fast a RAxML-III: H. Meier, & T. Ludwig, A., Stamatakis, high and accuracy high with alignment sequence multiple MUSCLE: R.C. Edgar, time reduced with method alignment sequence multiple a MUSCLE: R.C. Edgar, prokaryotes. for resource phylogenomics a GeneTrees:A.W. Dickerman, & Y.Tian, the for platform M.A. integration Larkin, InterProScan-an R. Apweiler, & E.M. Zdobnov, Predicting E.L.L. Sonnhammer, Altschul, S.F., & Gish, W., Miller, W., Myers, G. E.W. & Lipman, D.J. Basic local alignment Heijne, von B., Larsson, A., Krogh, phylogenetic tree selection. tree phylogenetic (2007). 564–577 alignments. sequence protein from blocks aligned ambiguously and (2005). trees. phylogenetic large of inference likelihood-based throughput. complexity. space and Res. Acids Nucleic (2007). 2947–2948 signature-recognition methods in InterPro. InterPro. in methods signature-recognition tool. search to application model: Markov hidden a genomes. complete with topology protein transmembrane Nucleic Acids Res. Acids Nucleic J. Mol. Biol. Mol. J. t al. et

35 J. Mol. Biol. Mol. J. lsa W n Cutl vrin 2.0. version X Clustal and W Clustal , D328–D331 (2007). D328–D331 , BMC Bioinformatics BMC

215 Bioinformatics , 403–410 (1990). 403–410 ,

32

305 , 1792–1797 (2004). 1792–1797 , , 567–580 (2001). 567–580 ,

17 5 Bioinformatics , 1–19 (2004). 1–19 , , 1246–1247 (2001). 1246–1247 , Bioinformatics

17 doi:10.1038/ng.740 , 847–848 (2001). 847–848 , Bioinformatics Syst. Biol. Syst.

21 , 456–463 ,

56 23 , ,