<<

Straub et 01. BMC Genomics 201 1, 12:21 1 http://www.biomedcentral.comI1471-21641 12/21 1 �Genomics

RESEARCH ARTICLE Open Access

Building a model: developing genomic resources for common milkweed ( syriaca) with low coverage genome sequencing

2 3 Shannon CK Straub 1*, Mark Fishbein , Tatyana Livshultz , Zachary Foster \ Matthew Parks \ Kevin Weitemier 1, Richard C Cronn4 and Aaron Liston1

Abstract

Background: Milkweeds (Asclepias L.) have been extensively investigated in diverse areas of evolutionary biology and ecology; however, there are few genetic resources available to facilitate and compliment these studies. This study explored how low coverage genome sequencing of the common milkweed (Asclepias syriaca L.) could be useful in characterizing the genome of a without prior genomic information and for development of genomic resources as a step toward further developing A. syriaca as a model in ecology and evolution.

Results: A O.5x genome of A. syriaca was produced using IIlumina sequencing. A virtually complete chloroplast genome of 158,598 bp was assembled, revealing few repeats and loss of three genes: accD, clpP, and ycfl. A nearly complete rONA cistron (185-5.85-265; 7,541 bp) and 55 rONA (120 bp) sequence were obtained. Assessment of polymorphism revealed that the rONA cistron and 55 rONA had 0.3% and 26.7% polymorphic sites, respectively. A partial mitochondrial genome sequence (130,764 bp), with identical gene content to tobacco, was also assembled. An initial characterization of repeat content indicated that Ty1/copia-like retroelements are the most common repeat type in the milkweed genome. At least one A. syriaca microread hit 88% of Catharanthus roseus () unigenes (median coverage of 0.29x) and 66% of single copy orthologs (C0511) in (median coverage of 0.14x). From this partial characterization of the A. syriaca genome, markers for population genetics (microsatellites) and phylogenetics (low-copy nuclear genes) studies were developed.

Conclusions: The results highlight the promise of next generation sequencing for development of genomic resources for any organism. Low coverage genome sequencing allows characterization of the high copy fraction of the genome and exploration of the low copy fraction of the genome, which facilitate the development of molecular tools for further study of a target species and its relatives. This study represents a first step in the development of a community resource for further study of plant-insect co-evolution, anti-herbivore defense, floral developmental genetics, reproductive biology, chemical evolution, population genetics, and comparative genomics using milkweeds, and A. syriaca in particular, as ecological and evolutionary models.

Background Repeat types can be characterized to assess the presence The advent of next generation sequencing technology and prevalence of retrotransposons, DNA transposons, has presented the opportunity to obtain genome-scale and other simple and low-complexity repeats [e.g., data for any organism [1-3). Detailed information about [4,7,8)). Nearly complete sequences of organellar gen­ the characteristics of the genome of a non-model organ­ omes are also readily obtained [e.g., [4,9-14)). If genomic ism, especially the high-copy fraction, can be obtained resources are available for a close relative, a portion of even with very low overall coverage of a genome [4-6). the gene space of the nuclear genome might also be sur­ veyed [12). In contrast to the above studies, next genera­ * Correspondence: [email protected] tion sequencing in non-model organisms has thus far 1 Department of Botany and Plant Pathology, Oregon State University, 2082 primarily been used for characterization of transcrip­ Cordley Hall, Corvallis, Oregon 9733 , USA 1 Full list of author information is available at the end of the article tomes [3) and as a tool for marker development in

© 2011 Straub et al: licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons o BioMeci Central Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Straub et 01. BMC Genomics 2011, 12:211 Page 2 of 22 http://www.biomedcentral.com!1471-2164!12!211

population and conservation genetics [e.g., [15,16]), phy­ Illinois and frozen at -20'C. Samples of approximately logenetics [10,17]' and other areas of evolutionary biol­ 3x3 mm green leaf tissue were chopped in 150 III Partec ogy [e.g., [18]]. CysStain UV Precise P Nuclei extraction buffer (made As the most intensively studied member of a 1.0% w/v PVP-40; Partec North America, Inc., Swedes­ that has been investigated extensively in diverse areas of boro, NY, USA) on ice in the presence of (-8x8 mm) B-

evolutionary biology and ecology [e.g., [19-24]], the 73 maize (lC = 2.725 pg) leaf tissue as internal refer­ common milkweed (Asclepias syriaca L., Apocynaceae) ence. Upon filtering and addition of 600 III CysStain UV is an attractive non-model target for the development of Precise P Staining Buffer, samples were run on a Partec genomic resources. Interest in the genus has been P A flow cytometer until at least 1000 counts were focused on complex and specialized floral structures, recorded for both the sample and reference peaks. The particularly as they relate to insect pollination [21], and estimated DNA content values for the five individuals on the remarkable chemical ecology exemplified by car­ were averaged and the genome size calculated using a denolide-sequestering specialist herbivores, such as the standard conversion factor [30]. (Danaus plexippus), that utilize Ascle­ pias defense compounds to defend themselves against lIIumina library preparation and sequencing their own predators [25]. In the context of floral biology Leaf tissue was sampled from a single individual from a and plant-insect coevolution, Asclepias has the potential natural population of A. syriaca in Ogle County, Illinois to add substantially to existing ecological and evolution­ [Fishbein 4885 (OKLA)]. Total genomic DNA was ary model systems (e.g., Arabidopsis, Mimulus, Aquile­ extracted using the Wizard® Genomic DNA Purification gia), although genomic resources for Asclepias are kit (Promega, Madison, WI, USA) following homogeni­ currently far more limiting than for these model sys­ zation of 100 mg of silica-dried tissue flash-frozen in tems. Genomic resources have been developed for only liquid N2 and ground with mortar and pestle. A BioRup­ two other species in the same order () as tor Sonicator (Diagenode Inc., Denville, NY, USA) was Asclepias: coffee (Coffea; http://www.coffeegenome.org) used to shear 2.5 Ilg of the extracted DNA using 15 one and Madagascar periwinkle (Catharanthus roseus: Apoc­ min cycles (30 sec on/30 sec off, setting 'high', 4'C). ynaceae), for which there are expressed sequence tag Subsequent library preparation followed the standard (EST) collections [26-28]. multi-step Illumina protocol [31]. Ligated, size-selected The aims of this study were to demonstrate how a fragments (-280 - 320 bp) were amplified through 18 small amount of Illumina short read sequence data cycles of polymerase chain reaction (PCR), using Phu­ could be useful for whole genome characterization of a sion High-Fidelity PCR Master Mix (New England Bio­ plant species with no prior genomic information and to labs, Ipswich, MA, USA) and standard Illumina primers. develop genomic resources for milkweeds. To these Resulting product concentrations were determined using ends, the genome of Asclepias syriaca was characterized a Nanodrop 1000 (Thermo Fisher Scientific, Wilmington, using O.5x sequence coverage obtained from a single DE, USA). lane of 40 bp Illumina reads. Reference-guided assembly Enriched, cleaned product was diluted to 5 pM and was used to assemble and characterize the high copy submitted for sequencing on an Illumina GAIl sequen­ fraction of the A. syriaca genome, including the com­ cer at the Center for Genome Research and Biocomput­ plete chloroplast and partial mitochondrial genomes and ing (CGRB) at Oregon State University (Corvallis, OR, a nearly complete nuclear ribosomal DNA cistron and USA). Denaturation, cluster generation, and subsequent 5S rDNA sequence. De novo assemblies were used to sequencing followed the manufacturer's recommenda­ characterize the repeat structure of the genome and to tions. Sequencing was completed over 40 cycles on a develop nuclear markers consisting of low-copy nuclear single lane of the Illumina GAIl. Image analysis, base­ loci for phylogenetics and micro satellite loci for popula­ calling and error estimation were performed using the tion genetics. This study highlights the extent to which Illumina GA Pipeline version 1.5. Images were collected a non-model plant genome can be characterized with from 120 tiles, averaging over 107,000 clusters/tile, and low coverage genome sequencing and represents the resulted in over 420 Mbp of sequence data after quality first step in developing the resources needed to develop filtering. Asclepias as a new genetic model system in ecological and evolutionary studies [29]. Reference-guided assembly Reference-guided assembly of microreads was facilitated Methods using a pipeline of five open source scripts called "align­ Genome size estimation reads." The pipeline incorporates: YASRA [32], a short Leaves were collected from five A. syriaca individuals read assembler; NUCMER, an aligner from the MUM­ from an experimental agricultural population in western mer 3.0 suite [33,34]; delta-filter, a utility program from Straub et ai. BMC Genomics 2011. 12:211 Page 3 of 22 http://www.biomedcentraLcom/1471-2164i12/211

the MUMmer 3.0 suite that removes excessive/erro­ these assemblies and the original consensus sequences neous alignments; "sumqual," a python script created to were resolved as either intraspecific variation, assembly integrate the NUCMER alignment information with the or alignment errors in the 40 bp read data set, or unre­ YASRA assembly information; and "qualtofa," another solved assembly errors. Informatically unresolved errors python script that reformats and filters the output of (e.g., AT -rich repeats) were corrected by obtaining San­ sumqual based on user inputs. A consensus sequence of ger sequences spanning those regions. Gene sequences the aligned contigs is produced, which can be masked that were expected to be divergent from the reference based on base cal! proportion and coverage depth; single (e.g., yefl) and putative pseudogenes were also checked nucleotide polymorphisms (SNPs) can be masked simi­ by Sanger sequencing. PCR reactions contained �10-20 larly using separate user specifications. Scripts are avail­ ng template DNA and had a finalco ncentration of Ix able for download at http://milkweedgenome.org. Read Phusion:& Flash High-Fidelity PCR Master Mix (Finn­ depth was determined calculating median base zymes) and 0.2 11M of primer. PCR cycling conditions sequencing depth from the alignreads output. were 98'C for 30 s followed by 25 cycles of 98'C for 10 Chloroplast Genome s, primer set specific annealing temperature for 30 s, 72' The reference-guided assembly of the chloroplast gen­ C for 30 s, and a final extension of 72'C for 5 min. See ome was completed in alignreads using the chloroplast Table 1 for primer sequences and annealing tempera­ genome sequence of oleander (Nerium oleander: Apocy­ tures. Cycling extension times were increased to 60 5 naceae; M. Moore and D. Soltis, unpublished data; see and final extension times to 10 min for aceD and clpP. [17J fo r coding sequences) with only one copy of the Cycling extension times were increased to 90 5 and final inverted repeat included as the reference and no mask­ extension times to 10 min for ycfl-ndhF and rps15-yefl. ing. A second alignment of the YASRA contigs was The success of PCR reactions was confirmed using agar­ completed using Mulan [35], and the resulting consen­ ose gel electrophoresis. The chloroplast and mitochon­ sus sequence was compared to the alignreads output drial paralogs of aceD were co-amplified, bands excised and tested as an alternate reference input for alignreads. from a 1.2% agarose gel, and purified using a QIAquick All of the consensus sequences were aligned to each Gel Extraction Kit (Qiagen, Inc.). Due to low yield, the other and the oleander reference using MAFFT 6.240 mitochondrial product was re-amplified using 1 III of [36]. A preliminary annotation of the A. syriaca chloro­ extract as PCR template prior to sequencing. For addi­ plast genome was completed using DOGMA [37]. The tional chloroplast aceD sequencing, the mitochondrial annotations of protein coding regions, ribosomal DNA product was virtually eliminated from the PCR pool by (rDNA), and sequences corresponding to transfer RNAs raising the anneaiing temperature to 61 'c. Sequences (tRNAs) were improved by fixing annotation and assem­ were produced using standard dye termination techni­ bly mistakes (e.g., no start or stop codon, internal stop ques by High-Throughput Sequencing Solutions at the codons, frameshifts, endpoints of tRNA sequences) University of -Washington http://\wrw.htseq.org or the through examination of the actual assembly of the short Oregon State University CGRB. The quality of the fin­ reads at that location in Tablet verso 1.10.09.20 [38] ished A. syriaca chloroplast genome sequence was and/or comparisons to ortholog sequences from other assessed by using it as the reference sequence in align­ asterids obtained from NCBI GenBank (e.g., Nerium, reads to re-assemble the 40 bp reads and any discrepan­ Coffea, Solanum). Various versions of the chloroplast cies were explored and corrected. genome sequence were used as a reference in alignreads The final annotation of the A. syriaea chloroplast gen­ to further refine the sequence based on new assemblies. ome was accomplished using BioEdit 7.0.5.3 [40] and Discrepancies between the assembly consensus visualized using GenomeVx [41]. The repeat content of sequences in introns and intergenic spacers were the chloroplast genome was determined using the NCBI explored at the microread-level in Tablet to identify the BLAST tool blast2seq. Only repeat motifs of greater correct sequence. An sequence [Gen­ than 30 bp and 85% sequence identity were character­ Bank:GQ248251.1] served as a reference in alignreads to ized. Sequence identities were calculated in BioEdit on correctly assemble the trnH- psbA region. sequence alignments produced in MAFFT with the In order to produce the highest quality A. syriaca gapped positions removed. Ka/Ks values fo r pseudo­ chloroplast genome sequence and annotation, the con­ genes were calculated using the Bergen Center for Com­ sensus sequences were also aligned to chloroplast gen­ putational Science Ka/Ks calculation tool http://services. ome sequences for other individuals of A. syriaea cbu.uib.no/tools/kaks. produced in alignreads and Velvet 1.0.12 [39] based on Mitochondriai Genome either 80 bp single end or paired end ltlumina The mitochondrial genome of A. syriaca was assembled sequences (S. Straub and A. Liston, unpublished data). using the alignreads pipeline and the 430,597 bp mito­ Discrepancies between the consensus sequences fr om chondrial genome sequence of Nicotiana tabaeum Straub et al. BMC Genomies 2011, 12:211 Page 4 of 22 http://www.biomedeentral.com/1471-2164/12/211

Table 1 Primers used for Sanger sequence finishing of the Asclepias syriaca chloroplast genome

Region spanned Orientation Primer Sequence (5'-3') Annealing Temp. ("C) ndhC-trnV FORWARD GCTCGATTCGAATTGTCAAGTCATCC 57

REVERSE AGGCCCTCCGATAGGATACTCAAA trnQ-psbK FORWARD AAACCCGiTGCCTTACCACTTG 57

REVERSE CACGGGTATGACTGGCATAACATC rbeL-aeeD FORWARD ATCCGTGAGGCTAGCAAATGGAGT 59

REVERSE AGGCCCTAGTCCACCAATATGGAA aceD FORWARD TAATAAACAATGCTAGCGATTGGG 57,61

REVERSE TACAATTAGATCTTAGGTCGCCAA

FORWARD GAACCTACACGCAAGCTAGTG

FORWARD TAAATGGCATTCCCATAGCAGT

FORWARD TGATATCGATGCGCAAGCTAG

REVERSE CCTATAAATGTAACTAGCACTGCA

REVERSE GGTAGCGTATTCAATCAAACGG

REVERSE CCAAGTATCCGGATCGATCGA c1pP FORWARD AATTGTATACATAATGGGCTGGT 57

REVERSE TATTTATTCATTGCCCTGTTCGT

FORWARD AACTACTATGATGGCTCCGTTG

FORWARD CATAAGACCCATAATTTGATTTGAG

FORWARD GTTTATGCTGTACTCCGGGTA

REVERSE GGTCAATTTTATTGTAAAGCCGTAT

REVERSE TATGCAATTTATAAAACCAGATGTCC

REVERSE GTGAATTGAAAAAGTAGAACGTCAT rpsS -rp174 FORWARD CTCAGCAATAGTGTCTCTGCCCAT 57

REVERSE AGAGCTGTAATTGTGCGTACCCGT ycf7-ndhF FORWARD TCCATCCATATCCCAATTCCATTCA 55

REVERSE GGGCATTAGAGGATTGGCTAAAGT

FORWARD ACGCCTCTGCATCTAGTATTG

REVERSE GGATCTTTTGTCGAGAGCTTC ndhG-ndhl FORWARD ATGCGGGCGGGTTGATATATGGTT 59

REVERSE ATTGCTTTGGGTCGGTTACCAACGTC rps 75-ycf7 FORWARD GGCCTAATTGATACCTCAGCGGTT 58

REVERSE TTATCAACGATGGAACCGCCCACT

FORWARD TATGTCTCGGTCTAGGAGAGG

REVERSE TGTACAAGTTGATTCAGGCGA

For each region, additional primers appearing after the two original peR primers are internal sequencing primers.

[GenBank:BA000042.1] [42] as a reference. Mitofy [43] Ribosomal DNA was used to characterize the complement of protein The nuclear rDNA cistron, including 18S, 5.8S, and 26S, coding, rRNA, and tRNA genes in the resulting contigs. was assembled in alignreads using a 669 bp partial The annotation of the mitochondrial genome of N. sequence from A. syriaca [GenBank:AM396892] span­ tabacum was updated using Mitofy before comparison ning the internal transcribed spacers (ITS) as the refer­ of the gene content of the two genomes. BLAT [44] ence. Following the initial assembly, the assembly was under default parameters was used to estimate the per­ repeated using the reverse complement of the reference centage of each A. syriaca mitochondrial gene sequence as the new reference to correct for an incon­ assembled by calculating the proportion of the total sistency in YASRA that caused the assembly result to be length of the N. tabacum mitochondrial genes with hits strand specific. A consensus sequence of the contigs in the A. syriaca contigs. The presence or absence of obtained through the first two analyses was then used as the coxl group I intron in Asclepias was determined by the reference for a final assembly. A 5S rDNA sequence using the Hoya sikkimensis (Apocynaceae) coxl for A. syriaca was assembled using a 330 bp sequence sequence [GenBank:AJ247588.1] as a BLAT reference. including the 5S locus from Solanum iopetalum Straub et al. BMC Genomics 2011, 12:211 Page 5 of 22 http://www.biomedcentral.com/1471-2164/121211

[GenBank:AJ226045) as a reference in both the forward generated through reference guided assembly. The de and reverse directions. novo assembly was repeated to produce the set of con­ To assess intragenomic polymorphism, the complete tigs used for the non-rDNA repetitive element categori­ genomic read pool was filtered for reads matching the zation and microsatellite marker development. The consensus sequences of the assembled rDNA loci using organellar genome and rDNA BLAT analyses were also BLAT and custom scripts [45). Reads were discarded if used to estimate the percentage of overall reads corre­ their average Phred score was less than 20, and the sponding to the high copy fraction of the genome. remaining reads filtered so that any base in a read with Repetitive Element Characterization Non-ribosomal a Phred score less than 20 was converted to 'N' using nuclear repetitive elements were characterized by sub­ the FASTX-Toolkit [46). Modified reads were assembled mitting the de novo contigs to the MAKER [50) web against the reference sequence using BWA [47), and at annotation service http://www.yandell-Iab.org/software/ each position the number of reads differing from the mwas.html and categorizing the RepeatMasker [51] out­ consensus was tallied and a proportion calculated (script put from the MAKER run. The percentage of unique available from KW). In order to estimate the error intro­ read hits to these contigs was determined using BLAT, duced by the sequencing method, the same procedure with minimum sequence identity set to 70%, to find hits was applied to the PhiX174 control lane of the sequen­ to the contigs of putative DNA transposon or retrotran­ cing run. sposon origin. The overall percentage of reads corre­ Positions with 2% or more of reads differing from the sponding to known repetitive elements was calculated consensus were considered to represent true poly­ using BLAT (minimum sequence identity = 70%) to morphism. This cutoff is comparable to error rates pre­ determine the number of reads that hit the Viridiplantae viously found using a similar quality-filtering scheme repeat sequences downloaded from the Repbase Update [48), but may be somewhat conservative as it is much release 15.07 database [52]. higher than the error seen in the PhiX lane. See addi­ Microsatellite Marker Development De novo contigs of tional file 1: Polymorphism detected among Illumina at least 150 bp, to allow for sufficient flanking sequence reads for the PhiX sequencing control. All positions of for primer design, were scanned for micro satellites using the PhiX genome exhibited <0.7% differing reads except BatchPrimer3 [53). Primers were designed flanking per­ position 3132, which had >17% of reads differing from fect di- and trinucleotide repeats of four or more repeat the consensus. While the reason for this anomaly is units and perfect tetra-, penta-, and hexanucleotide uncertain, the overall average error rate in the PhiX lane repeats of three or more repeat units such that the PCR is <0.05%. amplicon sizes ranged between 100 and 450 bp and pri­ De novo assembly mers met the default criteria for SSRs primers (Table 2; Data Quality Control Perl scripts were used to remove also see additional file 2: Primer sets designed using microreads containing N's and those reads correspond­ BatchPrimer3 for 184 nuclear microsatellite loci in ing to Illumina adapter sequences from the chastity-fil­ Asclepias syriaca not tested fo r amplification success). tered short read pool to create a cleaned read pool [45). Contigs containing at least one micro satellite locus for The remaining reads were trimmed of bases of low which primers could be designed were submitted to the quality or unscorable quality (quality score uB" in Illu­ PLAN web service [54), with the 'DNA to DNA: blastn mina fastq files) and all bases subsequent to these bases against NCB I NT 20081206 database, top 10 hits with e to produce a cleaned and trimmed read pool [45]. < = 0.1 cut-off option selected, in order to check if any De novo assembly of the cleaned read pool was per­ chloroplast or mitochondrial sequences had not been formed using VelvetOptimiser 2.1.7 [49] and Velvet eliminated from the read pool used to generate the de 1.0.12 with the minimum contig length set to 100 bp. novo contigs. To produce the best set of primers for Hash lengths between 19 and 31 were screened to find candidate nuclear micro satellite locus development, the optimal hash length for maximizing the total bases those primer sets originating from contigs with a chlor­ included in contigs. The expected coverage and the opti­ oplast or mitochondrial BLAST [55] hit with an E-value mum coverage cutoffwere estimated by VelvetOptimiser of 1e-5 or lower were removed from the candidate pool. to be 2 and 0.3777216 respectively. This analysis was Primers were tested for amplification success for the repeated for the cleaned and trimmed read pool. Use of 25 top candidate loci, which were selected for having the cleaned read pool resulted in more total bases in the highest number of repeat units among all candi­ contigs; therefore this pool was used for all downstream dates. PCR amplification was conducted on genomic de novo analyses. In order to further characterize the A. DNA of the same A. syriaca accession used for genome syriaca nuclear genome, BLAT was used to remove sequencing in 50 fil reactions. Each reaction consisted of reads corresponding to nuclear rDNA and the chloro­ 1 fil DNA extract template, 1.25 U GoTaq® Flexi DNA plast and mitochondrial genomes using the sequences Polymerase and Ix Green Flexi Reaction Buffer Straub et al. BMC Genomics 2011, 12:211 Page 6 of 22 http://www.biomedcentral.com/1471-2164/12/211

Table 2 Primers, peR reaction conditions, and characteristics of microsatellite loci tested for amplification success in Asclepias syriaca

Primer Set Name Orientation Primer Sequence (5'-3') Annealing Temp. (0C) Repeat Motif Product Size (bp) Asl04467 FORWARD ACTTTTTGTTCTGTTCCGAGT 52.9 TA6 100 REVERSE CGGGGCTAGTATAGAAGAAGA

As 106994 FORWARD AGGAAGCCAATTCAAGTAAAG 48.9 GAs 129 REVERSE GAAAATCCTTGCATGAAACT As12122 FORWARD TTGGTTGTTGGAACTTACACT 50.4 ATTC, 115 REVERSE AAGCACAAGAATCAGAACAAT

As125636 FORWARD GAACTCCTATGATCTTTCATTT 50.8 TC4.. .TA,* 136 REVERSE AGACATCATGAACCCTCTTTA As12641 FORWARD CTCTTCATCTGGTTTCTCTTG 49.0 TA, 130 REVERSE TGCATACAAATTTAGGTGATTC As13110 FORWARD AACCA TAAATGCAGGTTCTTT 50.9 TA4 101 REVERSE GCATATCTAGCTCGCATTTAG

As16135 FORWARD TTAGGTTGTGAGAAGGGA TTT 50.3 CCT4 100 REVERSE GCTTTTATTAATGTGGGAGGA

As16369 FORWARD AACGCTTATCCCTCACTTTAC 51.5 TAs 100 REVERSE CCGAACTCTATATGAGTCCAAT As196582 FORWARD GCATTTATTGCATTGTCAAA 50.5 TGAs 113 REVERSE GCCGTTGTTCCTCTATAATTT As2003a FORWARD CCCTCGTTATTGAGAAAATAGA 50.5 ATAs 105 REVERSE GGCTCTTACCAAAGAAAAGAT

As20679a FORWARD GTCATCGGAAGCTCTTTTC 52.7 CTT4 104 REVERSE AGTCAAAAGTCACACTCATCTG

As2338 FORWARD CATCTTGGTTAGCTCGTAAAC 51.7 TCs 104 REVERSE CCTTACCTAGGATGAAAGGAG As23491a FORWARD ATTTCCTCCATTTGTTCTTTC 50.0 TCs 101 REVERSE GAAGTTTGTCTGAAATGGTTG As2366 FORWARD TTGGTGTTGTAAAGTGTTCCT 54.5 TAs 101 REVERSE ACCAGCTCGTCAGTCAAG

As23942 FORWARD GTGATTTCCATTTTACCATACC 51.7 AGs 100 REVERSE ATCCTTAGCAAATAGGAGTCG

As23961 FORWARD ACTCTTGTTA TTTTGGGGAAT 49.3 TGs 108 REVERSE CAAAAGCTTTACACGATCAAT

As282930 FORWARD AACTTAGCTTTGGATGATGTTT 50.1 TAA, 102 REVERSE CAATTGGGCTTGT AT A TCAGA As28460 FORWARD CACAAGATGAACATCAAAACC 49.6 ATs 100 REVERSE TTGAGTATATTTGTTCCCCAAT

As292191 FORWARD GGAGGGTCTAATGTCTGATGT 51.3 TGs 125 REVERSE ATTCACACATCCCATACCAT As43163 FORWARD TATTCCAAGAAATTGGTCAAC 49.7 TTA, 110 REVERSE CATGGAAGAAGAACAAAACAG

As5768 FORWARD AAGGGAAATGATCCTT AGGTA 51.0 AG6 108 REVERSE TTCGAAAGAGTTCCACATCTA

As71725 FORWARD TTTACGACTCATGTACTCTTCC 50.8 AG4 .. .TGS* 133 REVERSE GTAATACCGGAAAATGACCTC

As77693 FORWARD GTCAGTCAACTCAACAATGCT 51.0 TC6 102 REVERSE TCCCAAAACTCAGATCTAACA As81784 FORWARD AATGTAAAGGGAGATGTTTATCA 52.0 TAT4 129 REVERSE AGCATCAAGAACTTGAACAGA As99212 FORWARD TCTATGGTCAACAATTTCTGC 53.4 ATs 153 REVERSE TTTATAGGGGTAGCGTAGGTG

*Repeat motifs are separated by 27 bp for primer set As71725 and 46 bp for primer set As125636. Straub et 01. BMC Genomics 20", 12:211 Page 7 of 22 http://www.biomedcentral.com/1471-2164!12/211

(Promega, Madison, WI, USA), 2.5 mM MgClz, 200 11M annuum), coffee (Coffea canephora)] and Arabidopsis dNTPs, and 0.8 mM amplification primers. PCR was thaliana [COS II markers; [57]J. The average length of carried out in a ClOOOTM thermal cycler (BioRad, Her­ sequences included in each alignment was used as the cules, CA, USA) under the following conditions: initial gene length in subsequent calculations, which were denaturing at 95'C for 2 min, 35 cycles of 95'C for 30 completed as described above for the Catharanthus uni­ sec-variable annealing temperature for 30 sec-72'C for 1 gene set. min, followed by a final extension at noc for 5 min. The COSH pool of candidate genes was most appro­ See Table 2 for annealing temperatures. Amplicons of priate for the development of markers intended for phy­ the predicted size were confirmed using agarose gel logenetics because single-copy genes are preferred in electrophoresis. order to avoid ortholog/paralog conflation. MAFFT 6.240 was used to align the Asclepias micro reads with Nuclear genome characterization and development of the COSH alignments using the localpair option and low copy nuclear markers 1,000 cycles of iterative refinement. Primers were In order to provide an initial characterization of nuclear designed in exons flanking the putative locations of one gene content and coverage, all 19,899 available ESTs for or more introns based on the location of introns in A. Catha ranthus roseus were downloaded from GenBank thaliana indicated by the Sol Genomics Network's on 29 November 2010. See additional file 3: List of GI intron finder application http://solgenomics.net/tools/ numbers for 19,899 Catharanthus roseus ESTs down­ intron_detection!find_introns.pl for genes with O.5x or loaded from GenBank and assembled into unigenes. The greater coverage (Table 3). If possible, primers were sequences were subjected to simple cleaning in Seq­ designed with GC clamps and were chosen from sec­ Clean (Dana Farber Cancer Institute; available at http:// tions of the read identical to or highly similar to the cof­ sourceforge.net!projects/seqclean). Cleaned and trimmed fee sequence without mismatches near the 3' end of the sequences were assembled into unigenes using iAssem­ primer sequence. bIer v1.2 (available at http://bioinfo.btLcornelLedultool! The primers designed using the A. syriaca microreads iAssembler). Unigenes with BLAT hits corresponding to were tested for amplification success and their utility in the rDNA or chloroplast or mitochondrial genomes of phylogenetic studies of Asclepias and across Apocyna­ A. syriaca were excluded from this set. Simple repeats, ceae. Vouchers for all DNA samples used are listed in low complexity sequences, and long terminal repeats Table 4. DNA was extracted as described above for A. were masked in the remaining unigenes using the syriaca. PCR amplification was conducted in 50 III reac­ RepeatMasker web service [[51]; available at http://www. tions on genomic DNA of the same A. syriaca accession repeatmasker.org!cgi -bin!WEBRepeatMasker] . BLAT used for genome sequencing, 16 accessions of 13 other with a tile size of 7 and minimum sequence identity of species of American Asclepias, including four accessions 80% was used to find read hits to the remaining puta­ of two morphotypes of A. pringlei, one species of Gom­ tively nuclear masked unigenes. The number of unique phocarpus belonging to the sister clade to American micro read hits per unigene was used to calculate the Asclepias, and one sample each of successive sister hits per kb of sequence based on the length of gene. groups to the clade including Asclepias and Gomphocar­ Coverage for these genes was then estimated using the pus (Caiotropis and ). Each reaction consisted formula [(hits per kb of sequence x length of the micro­ of 1 fJ.l template DNA extract, 1.25 U GoTaq:E: Flexi DNA read)/lOOOJ. Coverage is overestimated when reads over­ Polymerase and 1 x Green Flexi Reaction Buffer (Pro­ lap. IMP 9.0.0 (SAS Institute, Inc.) was used to calculate mega, Madison, WI, USA), 2.5 mM MgCh, 200 fJ.M coverage distributions. A Monte Carlo resampler imple­ dNTPs, and 0.8 mM amplification primers. PCR was car­ mented in PopTools [56] was used to calculate 95% ried out in a C1000TM thermal cycler (BioRad, Hercules, bootstrap confidence intervals (1,000 iterations) for CA, USA) under the following conditions: initial denatur­ median hits per kb and coverage values. ing at 94"C for 2 min, 35 cycles of 94'C for 30 sec-vari­ 'With low genome coverage, few if any single-copy able annealing temperature for 30 sec-72'C for 1 min, nuclear genes were expected to have been assembled in followed by a final extension at 72'C for 5 min. See Table de novo contigs. In order to extract useful information 3 for primer sequences and annealing temperatures. for nuclear marker development from the unassembled, For the Apocynaceae-wide analyses, species from four uncleaned pool of reads and further explore the A. syr­ tribes (Marsdenieae, Baisseeae, Apocyneae, A!yxieae) at iaca gene space, BLAT with a tile size of 7 and mini­ increasing phylogenetic distance from Asclepias [58] mum sequence identity of 80% was used to find reads were selected for primer testing. Pairs of species from that hit alignments for 1,086 single-copy orthologs from three tribes were used: cordata and Marsdenia other asterids [e.g., tomato (Solanum lycopersicum), glabra (Marsdenieae), Baissea multiflora and Oncinotis potato (Solanum tuberosum), pepper (Capsicum tenuiloba (Baisseeae), Epigynum auritum and E. Straub et 01. BMC Genomics 2011, 12:211 Page 8 of 22 http://www.biomedcentral.com/1471-2164/12/211

Table 3 Primers designed from Asclepias syriaca microreads hitting COSII alignments

Primer Set Name Orientation Primer Sequence (5'-3') Annealing Temp. (OC) Atlg24360a FORWARD GCAAGATCATCAAAGGAGGCA 59.0 REVERSE GATTTAGATCAATAACCTCCTGCCA

Atlg24360b FORWARD GGAGGTTATTGATCTAAATCTCAC 56.3 REVERSE AACAAGGCCAACAACTGATG

Atlg24360c FORWARD CAATTACAGTGCTGCAAAAGCTG 60.0 REVERSE CATATCAGATGCAATGAATCCGGG Atlg44760a FORWARD TACTCATGTTACTAATAAGGGTGA REVERSE CTTTGCTTTGCTTCCTCACTC Atlg55880a FORWARD ATGTACACAAAAGAGGAGGCTG 54.0 REVERSE GATGCCTCATCCCACTATCAC

Atlg77370a FORWARD CCAAAGCGCCATCTACTCCA 60.0 REVERSE GGCCAACCAAATCCAGAAGAAT

At2g03l20a FORWARD GAAGGTAGATCCAAATTTGAATGTC 58.0 REVERSE GCAAGTCAACACAGCATTAAC At2g06530a FORWARD GTGAAGGTAATGGCCAAAGATC 58.0 REVERSE GTAACACCTTTCATTGCTTCTCC At2g21960a FORWARD ATTGGGGTTTCTTTGTTCCATAC 53.0 REVERSE GCAGATTGGCCAACAACTTTG At2g2l960b FORWARD GTTGTTGGCCAATCTGCAGA 59.0 REVERSE TTTGTTGCTGATCTTTGCTAAGC At2g30200a FORWARD ACAAGCTGTTGGAATGGGTG 55.0 REVERSE ATCTCAACTGCAGCTAAGCTG At2g30200b FORWARD ATCTATGTCACCAGCTTAGCTG 58.3 REVERSE TTTGAATGATTTTGCCTTTGCTTC At2g30200c FORWARD CAAAGGCAAAATCATTCAAAGC 56.3 REVERSE TGCATCAACATTTGATATTACTGG At2g34620a FORWARD TTCAGAAAGGTCATCAACAAATG 57.8 REVERSE GCTGAAAGTGAACAAACCTGG At2g34620b FORWARD TGCCAACATTT CTTTACAA TTCAAG 52.0 REVERSE none At2g47390a FORWARD GTCTTGGTACAAAACACGCAAG 59.0 REVERSE TCAGTTTTAGATTCTTTAGAGGTAAG At2g47390b FORWARD CTAATGAGTTTGCTGGCATTGG REVERSE CACCATGTCCTTTGAGTGCG At4gl3430a FORWARD GAGCAAAACATAAAGTACTTCTA 48.0 REVERSE CATTTCACCATCCATAACAA At4g13430b FORWARD GTGCCTCCAACATTGAGATTTG 59.0 REVERSE CATTCTCTAGCCAAAGCACGA

At4g13430c FORWARD GTGCTTTGGCTAGAGAATGCA 60.0 REVERSE CAAGACAAGCACCACAACTAGG At4g26750a FORWARD GCCTGA TGACCATTT ACATCTTG REVERSE TACAGGTCTCCTCCCTTCTTTC

At5g13420a FORWARD TGGTATGACAACCTGTGCCG 55.0 REVERSE GTATCATCAGCAAGTCTTGGTGAA

At5g13420b FORWARD CTCCATGCGTCCCTTCAATC 605 REVERSE TGGCACCTTTCTTCACCAGAG At5g23540a FORWARD GTGTTGAAGCGGTTGATCATG REVERSE TCCATCTGTCCATTTTTTCTTATGA At5g23540b FORWARD none 52.0 Straub et al. BMC Genomics 2011, 12:211 Page 9 of 22 http://www.biomedcentral.com/1471-2164/12/211

Table 3 Primers designed from Asclepias syriaca microreads hitting COS II alignments (Continued)

REVERSE AAGGCATCAATTACCACTTTCC At5g49970a FORWARD GTAGCTGCTCGTCATTTATATC 55.8 REVERSE GAGAATCCAAACATTGCATCA At5g49970b FORWARD TTGATGCAATGTTTGGATTCTC 57.8 REVERSE GGCACACAATTTTGGAGCAG

Multiple primer sets were designed for loci where multiple read hits allowed potential amplification of different introns. Primer set names correspond to the orthologs of these genes in Arabidopsis. Listed annealing temperatures were used in amplifications of Asclepias samples. An annealing temperature of 55'( was used for all loci for amplifications of other Apocynaceae samples.

cochinchinense (Apocyneae); Oncinotis glabrata was sub­ thermocycler with an initial melting step of 94°C for 3 stituted for O. tenuiloba for six of the primer pairs. In min, followed by 35 cycles of 94°C for 1 min, 55°C for 1 addition, the even more distantly related Alyxia oblon­ min, noc for 1 min, with a final extension of noc for 7 gata and, for six of the primer pairs, At. grandis (Alyx­ min. PCR products were examined for amplification ieae) were tested. DNA was extracted using a modified success and product sizes estimated by agarose gel elec­ DNeasy method [58]. PCR reactions were performed in trophoresis for both the Asclepias-specific and family­ 30 III reaction volume with Apex™ Taq DNA Polymer­ wide surveys. ase Master Mix (Genesee Scientific) diluted to lx with final concentrations of 1.5 mM MgC12 and 0.5 mM fo r­ Results & Discussion ward and reverse primers (Table 3). One III of unquanti­ Genome size estimation fied DNA extract was added to each reaction. PCR was The average 2C value for the five A. syriaca individuals performed in an Eppendorf Mastercycler gradient was 1.68 pg (SD 0.07), which corresponds to a haploid

Table 4 Herbarium voucher specimens for DNA samples included in the C0511 marker screen

Species Voucher Provenance

Alyxia grandis PJ Forst. D. J. Middleton 703 [A] Australia

Alyxia oblongata Domin. D. J. Middleton 702 [A] Australia

Asclepias colifomico Greene ssp. colifomico Lynch 10,779 [LSUS] USA. California. San Bernardino Co.

Asclepias connivens Baldwin ex Elliott Lynch 12,350 [LSUS] USA. Florida

Asclepias coulteri A. Gray Fishbein 5172 [OKLA] Mexico. Quen§rtaro. Mpio. Landa

Asclepias eryptoeeras S. Watson ssp. davisii Fishbein 5723 [OKLA] USA. Oregon. Crook Co.

Asclepias faseicularis Decne. Fishbein 5886 [OKLA] USA. California. Siskiyou Co.

Asclepias glaueeseens Kunth Lynch 14,142 [LSUS] Mexico. Oaxaca

Asclepias oenotheroides Schltdl. & Cham. Lynch 13,339 [LSUS] USA. Texas

Asclepias otarioides E. Fourn. Fishbein 5857 [OKLA] Mexico. Distrito Federal. Del. Tlalpan

Asclepias pringlei (Greenm.) Woodson Fishbein 5845 [OKLA] Mexico. Zacatecas. Mpio. Tlaltenango

Asclepias pringlei Ventura 2807 [ARIZ] Mexico. Distrito Federal. Del. Milpa Alta

Asclepias aff. pring lei Fishbein 5119 [OKLA] Mexico. MichoaGln. Mpio. Coalcoman

Asclepias aff. pring lei Fishbein 5175 [OKLA] Mexico. Queretaro. Mpio. Pinal de Amoles

Asclepias sanjuanensis KD. Heil, J.M. Porter & S.L. Welsh Ellison S.n. [OKLA] USA. New Mexico. San Juan Co.

Asclepias stenophyl/a A. Gray Fishbein 5394 [OKLA] USA. Colorado. Boulder Co.

Asclepias subulata Decne. Fishbein 3147 [WS] Baja California Sur. Mpio. Comondu

Asclepias viridis Walter Lynch 13,072 [LSUS] USA. Texas

Baissea multiflora A. DC. F. Billiet S3853 [BR] Cultivated. Belgium. Nat. Bot. Gard. Belgium

Calotropis proeera (Aiton) WT. Aiton Fishbein 5472 [OKLA] Cultivated

Epigynum aurirum (CX. Schneid.) Tsiang & PT. Li D. J. Middleton et al. 1457 [A] Thailand

Epigynum eoehinehinensis (Pierre) D.J. Middleton D. J. Middleton 209 [A] Thailand

Gomphoearpus coneel/arus (Burm. f.) Bruyns Drewe 534 [I<] South Africa

Marsdenia glabra Constantin D. J. Middleton et al. 1123 [A] Thailand

Oneinotis glabrata Stapf ex Hiern C. c. H. Jongkind et al. 1315 [US] Ghana

Oneinotis tenuiloba Stapf 1. Abbott 7707 [Z] South

Pergularia daemia (Forssk) Chiov. Fishbein 5445 [OKLA] Namibia

Telosma cordata (Burm. f.) Merr. T. Livshultz 01-33 [BH] Cultivated. USA. Cornell University Straub et al. BMC Genomics 2011, 12:211 Page 10 of 22 http://www.biomedcentral.com/1471-2164/12/211

nuclear genome size of approximately 820 Mbp. Esti­ of reference guided assembly algorithms to reconstruct mated 2C genome sizes in Apocynaceae range from 0.74 these differences. Only 38 bp of sequence from the ori­ pg in Voaeanga grandifolia to 5.00 pg in tetraploid Cer­ ginal reference guided assembly were determined to be opegia woodii [59), and correspond to haploid nuclear incorrect, producing substitution errors in the genome genome sizes of approximately 360 - 2,445 Mbp. sequence. The sum total of these changes resulted in Among the ca. 3,000 species of [60), the alteration of approximately 5% of the total sequence which include C. woodii, other 2C values for diploids length, indicating that 95% of the chloroplast genome of are all greater than 4.30 pg [59]. The genome of Ascle­ A. syriaea was assembled at O.5x average genome cover­ pias is on the small side for the family based on current age with reads of only 40 bp and an oleander reference knowledge, but a comparison is difficult to make due to with 85% sequence identity (considering only one copy the limited sampling of genome size diversity in of the inverted repeat). Although the original assembly Apocynaceae. was very good overall, comparison of the final assembly with the finished reference sequence highlighted the iliumina sequencing and quality filtering limitations of using 40 bp vs. 80 bp reads for sequence Of the 10,512,738 reads obtained, a total of 9,651,427 assembly. Even with a correct reference sequence, some short reads passed the first step of quality filtering yield­ regions were still not assembled properly (e.g., aceD, ing 386 Mbp of sequence. The raw reads were sub­ yefl) and some assembly mistakes from the original mitted to the NCBI Sequence Read Archive assembly were recreated (e.g., an erroneous 5 bp dele­ (SRP005621). Trimming reads of nucleotides with a "B" tion in the middle of rbeL causing a frameshift). quality score and subsequent bases resulted in retention In terms of overall size and IR size, A. syriaea has a of 99.06% of the quality-filtered data. Reads yielding no typical asterid chloroplast genome [61]. The gene con­ data due to all unscorable bases accounted for 0.20% of tent of the A. syriaea chloroplast genome is similar to this read pool. Of the reads obtained, approximately those of oleander and coffee (Coffea arabiea) [62], 11.8%, 3.4%, and 1.8% originated from the chloroplast with llO unique genes (76 protein coding, 30 tRNA, 4 genome, mitochondrial genome, and rDNA repeat rRNA). Seventeen of these are fully duplicated in the arrays respectively. Based on the estimated genome IR for a total of 127 putatively functional genes. The sizes, approximately 0.5x coverage, on average, of the most notable difference between the milkweed genome whole A. syriaea genome (including organellar genomes) and the oleander and coffee chloroplast genomes is and 0.4x coverage of the nuclear genome were obtained that clpP, aceD, and yefl are likely pseudogenes in from one lane of Illumina reads. Asclepias. Another difference in the Asclepias genome relative to the other two genomes is shifting of the Characterization of the organellar genomes of Asclepias boundary of the IR in A. syriaea such that the trun­ syriaca cated version of rps19 observed in oleander and coffee The Chloroplast Genome is not present in IRa of milkweed. Assessment of smal­ The chloroplast genome of A. syriaea is 158,598 bp ler repeats showed that the A. syriaea chloroplast gen­ [GenBank:JF433943)' excluding two small, unresolved ome contains few repeats (Table 5). In comparison, the regions that were not able to be assembled or Sanger coffee chloroplast genome has nearly three times as sequenced due to polynucleotide stretches (rpsB-rp114 many repeats of 30 or more basepairs with 90% or intergenic spacer and ljIyefl), and has an inverted repeat greater sequence identity [62]. (IR) of 25,401 bp (Figure 1). The initial assembly pro­ The A. syriaea IjIclpP has several fe atures that indicate duced using the alignreads pipeline and the oleander that it is likely a pseudo gene. The nucleotide and pro­ reference contained 51 contigs with a median read tein sequences are divergent from those of oleander depth of 246x and N50 of 4,683 bp. The longest contig (81.3% exon nucleotide and 67.0% protein sequence was 14,884 bp. The final assembly using the finished A. identity) and coffee (80.8% exon nucleotide and 67.5% syriaea sequence as a reference contained 22 contigs, protein sequence identity). In comparison, oleander and had an N50 of 9,030 bp, and longest contig of 28,186 coffee have 97.2% exon nucleotide and 98.9% protein bp. The use of chloroplast contigs from 80 bp reads sequence identity. There is also a 28 bp insertion near from other A. syriaea individuals (S. Straub and A. Lis­ the end of exon 3 of A. syriaea (validated by Sanger ton, unpublished data) in combination with Sanger sequencing) that causes a frame shift and the introduc­ sequencing of select regions in the 0.5x genome indivi­ tion of a stop codon. Sequence comparisons showed dual and these other individuals, resulted in the addition that clpP in oleander and coffee are likely under nega­ of 6,622 bp and deletion of 1,402 bp of sequence. That tive selection (Ka/Ks = 0.07 and 0.03 respectively); how­ large insertions and deletions were fo und relative to the ever, the Ka/Ks for Asclepias is 1.15 indicating that this original assembly was expected due to the limited power locus is likely evolving neutrally. While the clpP introns Stra ub et 01. BMC Genomics 201 1, 12:21 1 Page 11 of 22 http://www.biomedcentral.com/1 471-21 641 1 2121 1

Lse

petD Asclepias syriaca 158,598 bp

Lse - 89,307 bp

sse - 18,489 bp

IR - 25,401 bp

sse

• Photosystem protein Rubisco subunij

D ATP synthase • NADH dehydrogenase • Ribosomal protein subunij • Ribosomal RNA • RNA Polymerase • Unknown function Transfer RNA

D Other D Intron

Figure 1 Map of the chloroplast genome of Asclepias syriaco. The thick black lines indicate the locations of the inverted repeats (IR). The thin black lines indicate the locations of the large single copy (LSC) and small single copy (SSC) regions. Transcription is clockwise fo r genes on the outside of the circle and counterclockwise fo r genes on the inside of the circle. Asterisks denote the locations of unresolved sequence due to polynucleotide stretches. Straub et 01. BMC Genom/cs 2011, 12:211 Page 12 of 22 http://www.biomedcentral.com/1471-2164/12/211

Table 5 Repeats in the Asclepias syriaca chloroplast genome and has a highly divergent sequence compared genome to oleander and coffee. The sequence of this gene was Repeat Location(s) Repeat Type Repeat Length (bp) Identity also checked and improved using de novo contigs and

trnS-GCU:trnS-GGA Inverted 32 100% Sanger sequencing to eliminate the possibility of assem­ IfIOCCO Direct 63 89% bly errors as a cause of sequence divergence. Although IfIOCCO Direct 202 85% yefl is one of the most quickly evolving chloroplast

psbN-psbH spacer Direct 66 99% genes in seed [e.g., [66,73,74]], the full length

rpI23·trnl-CAU spacer Direct 57 97% copy of yefl in Asclepias is so divergent that it may be trnN-GUU-lf/Ycf7 spacer Direct 299 87% non-functional. The sequence identity between yefl in oleander and coffee is 87.5%, whereas the sequence Repeats of at least 30 bp and at least 85% sequence identity to each copy are reported. identities between each of those species and Asclepias are only 79.4% and 72.2% respectively. In addition to have been lost multiple times in angiosperms [e.g., decreased sequence identity, there were 62 and 61 indels [63-65,67,68]]' the coding region has been lost much inferred from alignments of the Asclepias yefl sequence less often. Other examples of loss of the coding region with the oleander and coffee sequences respectively, are only known fr om Seaevola (Goodeniaceae), Passi­ whereas there were only 16 indels inferred from an flora (Passifloraceae), Traehelium (), Ger­ alignment of oleander and coffee. Among angiosperms anium and Monsonia (Geraniaceae) [67-69]. yefl has also been lost multiple times [e.g., The aeeD gene of A. syriaea is also likely a pseudo­ [65,67-69,71]]. gene. This gene is 2,584 bp in A. syriaea compared to All three of the putative pseudo genes detected in the 1,497 bp in oleander. There is a repeat-rich (Table 5), A. syriaea plastome, IfIclpP, lfIaeeD, and lfIyefl have func­ approximately 1 kb insertion in the middle of the gene. tions that are known to be essential for plant develop­ Although the insertion does not interrupt the reading ment and survival [75-77]. Additional studies to evaluate frame, the remainder of the sequence that is alignable the functionality of these putative pseudogenes will be with oleander is highly divergent. This gene can be needed to confirm their designation as such, in combi­ highly variable, even within a genus [66], but due to the nation with further analysis of the nuclear genome to extent of sequence divergence, and lengths of the inser­ locate functional copies of these indispensible genes or tion and repeat motifs, this gene is likely non-functional determine which other nuclear genes might encode a in Asclepias. However, in pepper, a smaller, repeat-con­ functional replacement as in, for example, the case of taining insertion of 144 bp in aeeD does not cause a fra­ aeeD in grasses [e.g., [78,79]]. meshift nor prevent transcription of the gene [70]. The Mitochondrial Genome There are numerous other examples of loss of A total of 130,764 bp of the A. syriaea mitochondrial aeeD from the plastome among angiosperms [e.g., genome was assembled in 115 contigs with a median [64,65,67-69,71]]. read depth of 32x and N50 of 1,666 bp. See additional The assembly of lfIaeeD proved especially challenging file 4: Asclepias syriaea mitochondrial genome contigs for Asclepias due to sequence divergence and the large (also accessible at http://milkweedgenome.org). The insertion relative to the oleander reference. De novo longest contig was 6,948 bp. The extent to which the assembly of 80 bp reads from another A. syriaea indivi­ portion of the genome assembled fo r A. syriaea repre­ dual and Sanger sequencing confirmed the correct sents the whole mitochondrial genome is unknown due sequence. Further inspection indicated that the assembly to the extensive variation observed for genome size, errors in this region arose due to the presence of an gene content, and organization among angiosperms aeeD pseudo gene in the mitochondrial genome of A. [43,80,81]. Divergence from the organization of the syriaea, also confirmed by Sanger sequencing, which Nieotiana reference likely prevented a more complete was more similar in sequence and length to the oleander and contiguous assembly. Complete (rrnS) or nearly chloroplast aeeD. A similar problem involving aeeD and complete (rrn26, rrn18) ribosomal DNA sequences were a mitochondrial genome sequence was encountered in identified among the contigs. Twenty-eight putative the assembly of the pepper plastome [70]. Problems of tRNA sequences representing 25 different tRNAs were this sort are likely to arise for assembly of chloroplast also identified. Absence of a complete complement of genomes, especially with short reads, for regions of the tRNAs was not unexpected because the number of genome that have been duplicated in the mitochondrial tRNAs in plant mitochondrial genomes is variable due or nuclear genomes [72]. to the replacement of native mitochondrial tRNAs with A third possible pseudogene in the A. syriaea chloro­ versions of chloroplast origin or import of nuclear plast genome is lfIyefl. This gene contains one of the encoded tRNAs of chloroplast or nuclear origin [80,81]. two polynucleotide repeats of unresolved length in the For example, among the tRNAs in the A. syriaea Straub et 01. BMC Genomics 201 1, 12:21 1 Page 13 of 22 http://www.biomedcentral.com/1 471-2164/1 2/2 1 1

mitochondrial genome, both a mitochondrial version and a chloroplast version of trnE-TTC were detected. The protein coding gene content observed for A. syr­ iaca was compared to that of the mitochondrial genome of Nicotiana tabacum because it is the only other asterid . Si mple repeats with a sequenced mitochondrial genome. The assembled . Low-complexity repeats part of the genome contained the same 37 protein coding • DNATransposons genes of known function as the N tabacum mitochon­ • Retrotransposons: LINEs • Retrotransposons: LTR-Ty1/co�a drial genome with an average coverage of 94.8%. Putative • RetrotransJX)SOOS: LTR - Ty3/gypsy pseudogenes of rp s2, rp s7, and rps 14 were identified in • Retrotransposoos: LTR - other the A. syriaca mitochondrial genome contigs. Both the mitochondrial genome of N tabacum and A. syriaca appear to lack functional copies of these three genes, in addition to lacking rp l6, rp s8, and rp sll, the genes lost in Figure 2 Repeat types detected among de novo contigs of the all angiosperms, but present in Marchantia [42] . Evi­ Asclepias syriaca nuclear genome. The numbers indicate dence of If/rps14 is present in tobacco, but there is no individual hits in the contig sequences. remaining evidence of the other two genes. Due to the pseudogenization or absence of these genes in the N ta bacum reference used for the A. syriaca assembly, a coverage to be assembled, so it is not surprising that the more complete characterization of the mitochondrial majority of the repeats were Ty l /copia - like and Ty3/ genome based on additional data will be required to gypsy -like retrotransposons. In plant species where ret­ determine if these genes are truly non-functional in A. rotransposons have been characterized, it appears to be syriaca or if their absence is an assembly artifact. lineage specific whether Ty l/copia - like or Ty3/gypsy-like Extensive horizontal gene transfer has been hypothe­ retroelements are more abundant in the genome, likely sized for the group I intron present in the coxl gene of because retrotransposons can proliferate or be lost from angiosperm mitochondrial genomes, with six horizontal genomes over relatively short evolutionary time scales transfer events hypothesized in Apocynaceae alone [82]. [85,87] . Among other asterids, Ty3/gyp sy-like retrotran­ Alternatively, it has been argued that the distribution of sposons are more prevalent in the sunflower (Helianthus the coxl intron in plants more likely involved a single annuus), tomato, and potato genomes, while in carrot horizontal transfer followed by intron loss in many (Daucus carota), as in A. syriaca, Ty l /copia-like retro­ lineages [83] . All but one of the Apocynaceae genera transposons are more numerous [88-90]. sampled have contained the coxl intron [82,83] ' and it The numbers of retroelements reported here are cer­ was also found to be present in the assembled coxl tain to be different than the actual numbers of retroele­ gene of A. syriaca. This result adds to the evidence sup­ ments in the sampled A. syriaca genome because the de porting the intron loss hypothesis over horizontal gene novo contigs were relatively short compared to the transfer in Apocynaceae. length of retrotransposons, possibly resulting in different regions of the same retroelement being counted multi­ Characterization of the nuclear genome of Asclepias ple times. Additionally, six contigs had more than one syriaca good repeat type match, slightly inflating the numbers The Velvet de novo assembly using a hash length of 21 discovered. Only approximately 2% of the putatively resulted in 1,399,320 bp of sequence in 6,997 contigs, nuclear reads had hits to the repeat-containing contigs. 79% of which were 200 bp or less. See additional file 5: The estimate of raw reads with hits to the plant repeat Asclepias syriaca nuclear genome de novo assembly con­ database did not give a better estimate of the total tigs (also accessible at http://milkweedgenome.org). The repeat content of the A. syriaca genome than the hits to longest contig was 5,027 bp and the N50 for the assem­ assembled repeats with only 0.24% of reads with one or bly was 196. more hits. These results indicate that at 0.4x genome Repetitive Element Characterization coverage, only an initial characterization of the most A total of 564 repetitive elements were detected in the common repeats present in a plant genome can be A. syriaca de novo contigs (Figure 2). The majority of made, while an estimation of the overall repeat content these repeats were Tyl /copia-like and Ty3/gypsy-like of the genome requires deeper coverage. LTR retrotransposons (Figure 2), which are abundant in Microsatellite repeats were also detected among the plants and common across eukaryotes [84-86] . Only the 2,624 de novo contigs of at least 150 bp, with a total of most highly represented repeats in the genome would 39 5 microsatellite sequences found in 333 of these con­ have high enough coverage at 0.4x total nuclear genome tigs (Table 6). Top ten BLAST hits with an E-value of Straub et 01. BMC Genomics 201 1, 12:21 1 Page 14 of 22 http://www.biomedcentral.com/1 471-2164/1 2/21 1

Table 6 Microsatellite loci detected in the O.Sx coverage genome of Asclepias syriaca Repeat Type Number Found Number of Loci with Primers Number Tested/Successful for peR Amplification di- 271 206 17/1 7 tri- 51 30 717 tet- 57 43 1/1 pent- 12 9 OINA hex- 4 2 OINA

Total 395 290 25/25

Amplification success of the 25 markers with the highest numbers of repeat units is indicated (see Ta ble 2).

1e-s or lower were recovered for 53 and three contigs expected transmission ratios (in the case of known pedi­ for mitochondrial and chloroplast sequences respec­ grees) or Hardy-Weinberg expectations (in the case of tively. The higher number of possible mitochondrial wild populations). If the markers show evidence of fixed reads remaining in the read pool following removal of heterozygosity or dominant-only variation, this would reads putatively belonging to the organellar genomes constitute evidence that our microsatellite 'contigs' are reflects our more limited knowledge of the mitochon­ the product of micro reads representing two or more drial genome in plants in general and incomplete assem­ paralogs. These loci are currently being examined in bly of the A. syriaca mitochondrial genome. An greater detail to determine their locus specificity. alternative explanation for the presence of these small The source of the rDNA repeats identified by Repeat­ segments of DNA of mitochondrial or chloroplast origin Masker in the de novo contigs (Figure 2) that remained in putatively nuclear contigs is that they correspond to after the removal of A. syriaca rDNA microreads was nuclear nomads [72] . also explored. BLAST searches (blastn) of the NCBI To be conservative in selecting 25 candidates for test­ nucleotide collection revealed that their source was likely ing, loci from the 56 contigs with putatively mitochon­ plant fungal pathogens or endophytes. For example, the drial or chloroplast sequence were removed from the top BLAST hits for two of theses contigs were to an pool of candidates, leaving 209 primer sets for putatively Alternaria alternata gene for large subunit (LSU) rRNA -lOS nuclear microsatellite loci. See additional file 2: Primer (Ie ) [GenBank:AB566324.1] and a Davidiella tassiana sets designed using BatchPrimer3 for 184 nuclear micro­ partial LSU rRNA gene sequence (le-92) [GenBank: satellite loci in Asclepias syriaca not tested for amplifi­ FN868880.1]. Approximately 0.02% of the total reads cation success. Of the 25 primer sets tested, all could be categorized as fungal rDNA when the sequences successfully amplified in A. syriaca (Table 6). It is for the top GenBank hits were used as a BLAT database. important to note that with the low overall genome cov­ Detection of fungal contaminants of a plant genomic erage, at least some of these microsatellite-containing DNA preparation is unsurprising due to the ubiquitous contigs are likely to have originated from multiple, association of these two types of organisms. nearly-identical paralogous loci. As is the case with all Ribosomal DNA microsatellites, the ultimate proof in the utility of these A 385 bp contig containing a 120 bp 5S rDNA sequence markers for identifying intraspecific variation (e.g., link­ [GenBank:JF312047] and an rDNA cistron of 7,541 bp age mapping or population genetics) will come from [GenBank:JF312046] were assembled for A. syriaca (fig­ obeying a Mendelian locus/allele model that meets ure 3) . Repeat structure in the external transcribed

IGS

�_-----,A,--__ ___ �ssem b l ed NTS -- .�..... 1IIiiiiIiIi .... 658 bp f unknown

Figure 3 The rDNA cistron of Asclepias syrioca. Blue boxes represent genes. Thick black lines represent additional transcribed sequence. The gray and dashed lines represent non-transcribed and unassembled sequence respectively. Only partial sequences of the non-transcribed spacer (NTS) and external transcribed spacer (ETS) were able to be assembled due to repeats, so the length of the intergenic spacer (IGS) remains unknown. Straub et 01. BMC Genomics 201 1, 12:21 1 Page 15 of 22 http://www.biomedcentral.com/1 4 71-21 64/1 2/21 1

spacer (ETS) and intergenic spacer (IGS) regions of the sequence was not observed in the A. syriaca sequence rDNA cistron prevented further extension of that contig upstream of 185, so the entire ETS region was not and produced an error in the assembly where reads cor­ assembled due to internal repeated sequence. However, responding to similar repeats were incorrectly piled up, another ETS motif conserved in plants and always highlighting one of the primary difficulties of genome located approximately 420 - 450 bp upstream of the assembly using short reads [32,91]. Consequently, the start of 185 [TGAGTGGTGG; [97] ] was located 423 bp first 280 bp of the contig sequence were removed prior upstream of 185 in the sequenced portion of ETS. Due to downstream analyses. The median read depth for the to the length variation in the ETS regions in other aster­ final assemblies were 406x for 55 and 738x for the 185- ids (Solanum � 1000 bp; Nicotiana >3000 bp in N. taba­ 5.85-265 cistron. Using the median read depth as an cum; Olea �2000 bp), it is difficult to estimate the estimate of the number of rDNA copies sequenced and completeness of our assembly for this region and the the nuclear genome coverage of approximately O.4x, cistron as a whole, which is between 9 and 12 kb in rough approximations of the number of 55 rDNA Nicotiana [96,98,99]. repeats and of rDNA cistron copies are 1,015 and 1,845 For intragenomic rDNA polymorphism characteriza­

respectively. These estimates are in line with values tion, quality filtering by average quality (q = 20) observed for other plants [92,93] . removed zero reads from the 55 pool, and 148 reads Fully characterized rDNA cistrons in plants have at from the 185-5.85-265 cistron pool (0.1%). Masking the least one TATA box upstream of 185, which demarcates remaining nucleotides with qualities below 20 affected the boundary between the ETS region and the repeat­ 1.7% and 6.2% of nucleotides in the two pools, respec­ rich non-transcribed spacer [94,95]. However, repeat tively. A high percentage of positions in the 55 locus units may also appear between the transcription initia­ were polymorphic (Figure 4). Most of these positions tion site (TIS) and 185 [96] . The conserved plant TIS had less than 8% of reads differing from the consensus,

0.35 ITS-1ITS-2 Q) en co 0.009 0 ..c 0.30 � -§ 'co 18S 5.8S 26S E Q) 0.25 0.002 o 0.004 :S E .g 0> 0.20 c � 5S -0 � 0.15 0.267 en co ..c o § 0.10 :e o c.. e a... 0.05

0.00

r' T' '-rT" " ,rrTO,rr' T' .-rT' ''''rTTO,rr,T,o-rT" ,-r, T" ,�,rT, T",rr, T",-r, "-,rT" ,o-"",rr,,,-r, T" ,-r"" ,rr,,,,, � o 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 0 200 Sequence position

Figure 4 Polymorphism among rDNA cistron (1 85-5.85-265) and 55 ribosomal DNA copies in Asclepias syriaca. Non,zero proportions below the blue line are likely due to sequencing error. Proportions above the blue line likely represent polymorphism within or among rDNA arrays. The proportion of polymorphic sites for each rDNA region is given beneath the label. Straub et 01. BMC Genomics 2011. 12:211 Page 16 of 22 http://W\lv\:\f,blomedcentrai.comJi 471-2164/12/211

though some ranged as high as 14%. Conversely, only 19 I a positions in the 18S-S.8S-26S cistron were polymorphic i (Figure 4). Polymorphisms were fo und in both coding and spacer regions, although none were found in either b. the S.8S or ITS2 regions. \X/fiile spacer regions con­ tained a slightly higher proportion of polymorphic posi­ tions than coding regions, positions with the greatest proportion of differing reads were found within coding regions, in particuiar in the 26S region. 1.SX Levels of intra-genomic polymorphism among copies of the rDNA loci generally agree with previously mea­ '" sured and theoretically expected levels. The high propor­ OJ � o previous observations for this locus; however, the levels u 1X '0 measured here are somewhat higher than those reported .c Q. in other plant and animal species [e.g., [100-102]J. Intra­

The distribution of polymorphic positions matches Figure 5 Coverage of Catharanthus unigenes and COSii theoretical expectations, with a higher proportion of markers by Asclepias syriaca microreads. Coverage of positions fo und within spacer rather than coding Catharonrhlj5 nuclear u0iger:es by sy ricca rpicroreads for genes tess than 5X coverage. Fifty-six of :he 8,077 u�igenes regions. The very high proportion of differing reads (O\l2(2ge greater Than 01 equal 1:8 5X were exciuded frem "(roe found at two positions in the 26S region may be caused ccver2c;e 17.5X. b by their location within expa:lsion segments of this Covelrage of COSli single-copy ruciear genes by A. syr ioca region, expansion segments 9 and 8, respectively [lOSJ. ::liuoreads. Tr·e upper box plot \vhlsk2i represents !.5 The lower levels of polymorphism observed for the 185- 5.85-26S cistron relative to the 5S rDNA loci are con­ cordant with the molecular evolutionary processes asso­ ciated with each rDNA type. While strong concerted evolution and s:rong purifying selection act to homoge­ members of multigene families or other duplicated nize the 185-S.8S-26S cistron copies within and among genes, increasing the chances of non-orthologous BLAT repeat arrays, weak concerted evolution and birth-and­ hits in conserved domains. For example, a unigene with

death processes for 55 rDNA allow accumulation of 6.79x coverage had multiple BLAST hits (E = 0) to polymorphism [100,106]. eudicot H� -ATPases (P3A-type). Within the P-type Protein Coding Nudear Genes ATPase gene super family, the P3Ktype subfamily has From 19,899 Catharanthus ESTs, 9,266 unigenes were 10 and 11 m.embers in rice and Arabidopsis respectively created. Of these, 72 corresponded to chloroplast, mito­ [107]. Based on the overall genome coverage and the chondrial or rDNA genes and were removed from the unigene coverage, the A. syriaca genome likely contains data set. Among the remaining repeat masked unigenes, a similar number of P3A-type ATPases. 8,077 had between one and 138 unique micro read hits. Another explanation for the high coverage of outlier The median number of hits per kilobase of sequence loci was indicated by the masked unigene sequences was 7.17 (9S% CI 7.03 - 7.29). The median coverage of themselves. Many of the unigenes with the highest cov­ the unigenes was O.29x (95% CI O.28x - O.29x; Figure erage still contained sequences that could produce mul­ Sa). The 618 genes with coverage levels above 0.984x tiple non-homologous hits (e.g., internal polynucleotide were outliers. Many of these genes likely correspond to stretches), which in turn produced stacks of reads. Read Straub et al. BMC Genomics 2011, 12:211 Page 17 of 22 http://www .biomedcentral.com/1471-2164/12/211

overlap due to this phenomenon inflated the coverage explorations of the nuclear gene content of the A. syr­ values, so many of the very high coverage values are iaca genome demonstrate that even at O.4x genome likely artifacts of the unigene creation and masking pro­ coverage, much of the protein coding portion of the cess. The presence of PCR duplicate reads in the data nuclear genome can be detected. set could also have caused read overlap and overestima­ Even with low overall coverage of the 721 COSII tion of coverage. Most of these reads should have been genes with A. syriaca micro read hits, the alignments excluded by counting only unique BLAT hits, but any were still useful for marker development for Asclepias sequencing errors in the duplicates or length differences and Apocynaceae molecular phylogenetics. The 28 genes due to differing sequence quality, especially at the ends that had 0.5x or greater coverage were evaluated for of the reads, could produce nearly identical hits. their suitability for this purpose. Microreads flanked Of the 1,086 COSH alignments 721 had between one putative intron locations in 15 genes of this set, allowing and 23 unique A. syriaca microread hits with a median primer design for amplification of one or more introns gene coverage of 0.14x (95% CI 0.13x - 0.15x; Figure per gene (Table 3). 5b) and median hits per kilobase of sequence of 3.40 Primer testing within Asclepias and across Apocyna­ (95% CI 3.13 - 3.64). The 35 genes with greater than ceae was largely successful with single products amplify­ 0.473x coverage were outliers. There is a possibility that ing in most tribes of Apocynaceae (Table 7). some of these genes are single copy in other asterids, Amplification of large introns and evidence of length but duplicated in Asclepias, leading to the higher cover­ variation among groups are promising for utility of age levels. More likely, as in the unigene analysis, these markers at different scales dependent upon overall sequence fe atures, including microsatellites that were sequence divergence. These results highlight how even a not masked in the COSH alignments, and PCR dupli­ small amount of genomic data can facilitate and expe­ cates led to increased estimates of coverage. Of the five dite the arduous process of nuclear marker development genes with greater than Ix coverage, the only BLAT across a range of taxonomic scales from within a genus hits for two were to microsatellites and a third had mul­ to across a family for groups of species not closely tiple microsatellite and a single other hit. related to a species with extensive genomic resources. The median coverage values for the unigene and COSH data sets would be expected to be close to the Conclusions estimate of O.4x coverage for the nuclear genome. How­ The often-employed strategy of transcriptome sequen­ ever, the observed values deviated from the expected cing for exploration of the genomes of non-model coverage. One explanation for this is that the overall organisms can produce a wealth of information for com­ coverage of the A. syriaca genome was slightly overesti­ parative and functional genomics studies due to high mated due to the presence of PCR duplicates in the coverage of the gene space and provide the necessary read pool. It is also possible that the genome size of the information for marker development, including single sequenced individual was larger than the genome sizes nucleotide polymorphisms [3]. As an alternative, low estimated from the experimental population, again caus­ coverage whole genome sequencing provides a more ing an overestimate of genome coverage. Divergence complete survey of the entire genome by providing sub­ from the reference sequences would also cause an stantial information about the organellar genomes and underestimate of coverage for A. syriaca orthologs. The the repeat content of the nuclear genome, while still lower amount of divergence from Catharanthus, a providing the necessary resources for marker develop­ member of the same family, is reflected in the higher ment. In plants with no prior genomic information, estimated coverage for the unigenes than the COSII acquisition of even a small amount of genomic data alignments, which contained sequences of more dis­ allows ready characterization of the high copy fraction tantly related asterids. The coverage value for the uni­ of the genome and exploration of the low copy fraction genes may also be higher due to the reasons discussed of the genome. A complete, or nearly complete, chloro­ above. Another feature of both reference data sets that plast genome and rDNA cistron are readily obtained. A would cause an underestimate of coverage was the partial to nearly complete mitochondrial genome can be absence of introns. The micro read sequences from obtained, depending on the complexity of the particular genomic DNA would also represent these features of mitochondrial genome. Initial characterizations of the the genes, but would not produce BLAT hits for these repeat and protein coding portions of the nuclear gen­ genes. Reads spanning intron-exon boundaries would ome can be made in terms of content and then used in also have been excluded in many cases due to the mis­ the development of molecular tools, including low-copy match of the intron sequence with the following exon in nuclear genes and microsatellites, for the further study the unigenes and COSII alignments. Even with these of the target organism and its relatives. issues confounding coverage estimates, these two -:7 V> Table 7 Success of PCR amplification and intron length variation for COSII nuclear loci in Asclepias and across Apocynaceae ::! :;' -c QI '-'-c Number of Individuals PCR Approximate Product Size (bp)# successful/PCR attempted ' 4 s Arabidopsis tair Annotation Primer Set Introns Coding As M2 B3 Ap Al As M B Ap AI I� 0-:-­-. <:0 Ortholog spanned* Sequence o s: (bp) � n g-� Atl g24360 3-oxoacyl-(acyl-carrier protein) Atlg24360a 360, 44 1 ,487 234 15/20 112 1/2 212 1/1 1400-1 500 1500 1000 300, 1000 1000 tD ::s :::l 0 reductase, ch loroplast :;' 3 Atlg24360b 582, 61 5 117 14/1 6 212 2/2 2/2 1/1 500-700 500 500 500 700, � cY o N 1500 3 0 � ::: At lg24360c 750 117 16/1 6 212 2/2 2/2 1/1 200 200 200, 1500 200 200, 700 -I> - ,,­ � IV �� At l g44760 universal stress protein (USP) fa mily Atl g44760a 369, 491 334 1/16 1/2 1/2 2/2 1/1 1300 1200 1200 1100 700 2: � protein :::.

Atl g55880 pyridoxal-5'-phosphate-dependent At l g55880a 1043 394 13/1 6 2/2 2/2 2/2 1/1 1200 900 700 - 800 700 300, 700 �

At l Atlg77370a 164, 238 172 16/20 212 2/2 0/2 Oil 1400-1 500 1100 1100 - 1 200 n/a n/a

At2g03 120 homologous to Signal Peptide At2g03 120a 156 210 19/1 9 212 2/2 2/2 1/1 300-400 300 300 300 300 Ppntirl",p, (SPP)

At2g06530 VPS2.1 (vesicle-mediated transport) At2g06530a 276 145 20/20 212 2/2 2/2 1/1 400-650 400 700 - 600 - 400, (600, 1000) (800,1 000) 1200

At2g21960 unknown protein At2g21 960a 503, 592, 722 366 7/1 6 2/2 2/2 212 1/1 1500 1500 1700 1600 1100 1700

At2g21 960b 81 3 84 12/1 6 212 2/2 2/2 1/1 200-300 200 150 150 150

At2g30200 [acyl-carrier-protein] S­ At2g30200a 377, 444 185 12/1 6 212 2/2 2/2 1/1 1400 1000 1200 900 1200, malonyltransferase »1500

At2g30200b 587, 636, 774 392 13120 2/2 2/2 2/2 1/1 700-1 500 1 200 1200 1200 1200

855 160 16/1 6 012 2/2 2/2 1/1 300 n/a 300 300 300

At2g34620 mitochondrial transcription At2g34620a 596 254 19/1 9 212 2/2 2/2 2/2 300 300 300 300 - 300 termination factor-related 1000

At2g34620FI 596 531 15/1 6 212 2/2 2/2 0/2 600 600 600 600 n/a

At2g47390 serin e-type endopeptidase At2g47390a 1718 401 20/20 212 2/2 2/2 0/2 500-600 500 500 500 n/a

At2g47390b 21 77, 2243, 506 1/16 0/2 0/2 0/2 0/2 1500 n/a n/a n/a n/a 2304, 2454, 2592

At4g 13430 methylthioalkylmalate isomerase At4g 13430a 510, 579, 696 275 14/1 6 212 2/2 2/2 1/1 700-1000 900, 1200 1200 1200 1200

At4g 13430b 771, 849, 945, 391 16/20 012 2/2 0/2 1/1 1300-1 500 n/a >1500 n/a 1500 " 995, 1050 QI to tD At4g 13430c 1182, 1230 278 15/1 6 212 2/2 2/2 1/1 (500, 800)- 800 900 800 500, 900, 1000 >1500 1200 00 Q, N N :J" Vl Table 7 Success of PCR ampl ification and intron length variation for COSII nuclear loci in Asclepias and across Apocynaceae (Continued) ;:+ " -0 OJ ,c At4g267S0 hydroxyproline-rich glycoprotein At4g267S0a 30S, 381 241 1/16 2/2 2/2 2/2 Oi l lS00 >lS00 lS00 1000 n/a �O" fa mily protein � � AtSg1 3420 putative transaldolase At5g13420a 348, 398 29S 12/1 6 212 212 2/2 Oi l 1300 1100 1000 700 n/a o£ s:: ;;; AtSg1 3420b 699, 888 338 20/20 212 2/2 2/2 1/1 800-900 1200 1200 SOO, 1200 lS00 3 " �G1 436 0/ 16 2/2 2/2 1/2 Oil n/a »lS00 300, 700, SOO, 700 n/a () '" AtSg23S40 265 proteasome regulatory subunit AtSg23S40a 308, 390, 567 ID :;, »1500 :::l 0 " ':3 AtSg23S40aF 308, 390 231 7/1 6 2/2 2/2 1/2 1/1 1400 »lS00 700 - 900, 700 1200 � O· o N AtSg23540bR »lS00 3 0 � :::: At5g49970 bifunctional pyridoxine AtSg49970a S07 193 20/20 212 2/2 2/2 2/2 300 300 300 300 300 -I> - " ... (pyridoxamine) S'-phosphate .. � oxidase (PPOX) � � AtSg49970b 613, 780 226 16/1 6 212 2/2 2/2 1/2 SOO-900 800 800 700 1000 � � � AtSg49970aFAtSg49970bR S07, 61 3, 780 398 lS/1 6 n/a n/a n/a n/a 1 000-1 SOO n/a n/a n/a n/a ---N � *Intron numbers represent the starting position of the intron in Arabidopsis. ' Asclepiadinae Samples include 4 genera: Asclepias (17), Gomphocarpus, Ca/otrapis, and Pergu/aria. 2 Marsdenieae 3 Baisseeae 4 Apocyneae ' Alyxieae #Two numbers separated by a comma indicate multiple peR products. A range of numbers indicates size variation among individuals.

" OJ 10 ID

\0 Q, N N Straub et al. BMC Genomics 2011, 12:2 '1 Page 20 of 22 http://w\jvw.biomedcentral.com/1471�2164/12/211

The data obtained through this study are the first step Statio'" USDA Fores: Sen.... ice 3200 jefferson 9733l in the development of a community resource for further :j5/�..

study of plant-herbivore and plant-pollinator interac­ Authors' contributions tions, floral deveiopmental genetics, chemical evolution, SCKS particioated in stUGY design, cc)',--d JeLed the maJOrity' of the analyses population genetics, and comparative genomics using desigr�led The ccilected t�,e chiorop!ast Sanger data and drafted mcnusGipt. ,'ljl F conceived of the stJdy conducted milkweeds as ecological and evolutionary models. The miuosatellite and b'/v-copy nuclear gene primer teSTing II' Asciepias. TL information obtained from this study is already serving pai�icipated in the desigi"'l of lhe 10\ }-copy nJciear gene analyses and as a resource for development of markers to aid in the ccndL:ctej 10\,\:-c09Y i' lJc!ear gene in A,oocynaceae. ZF scrio!eo and tested lhe data analysis pioeiine. MP prepa(ed the IliJmina study of phylogenetics, popuiation genetics, phylogeo­ library tested the data pipei:c', e. I0l/ cordJcted the rDN/\ graphy and other areas fo r Asclepias and Apocynaceae. pOlymorphism analy'ses_ P,CCconc eived of perfo(rned some These results highlight the promise of genomic statistcai analyses, ane; collected the ,;Lconc eived of the study, cocrdi::ated �he data co:iectjoi�" and study conducted tr,e resources for developing tools in any organism [2,108J COSII nuclear sene anaiysis sor::e of the al,aiyses and show that even labs with limited resources can fea­ sibly obtain O.5x or greater coverage of the genorne of Received: 26 February 201 1 Accepted: 4 May 201 1 their species of interest and a wealth of information. By Published: 4 May 201 1 incorporating 80-120 bp read lengths, paired-end sequences, and ever-improving sequencing technology, References this level of coverage and assembly success from a single :;'ckas ,A,boo: Harnessing genomics for evoiutionaty i;)sights. 2009, 24(4):1 92-200. lane of Illumina sequencing is attainable for all but the i\1E: Sequencing breakthroughs for genomic eco�ogy and very largest of plant genomes. evolutionalY biology. feo! 7008, 8(l}:3-i /' �" Ga,"ildo J: App!ications of next generation sequencing in mOl€cuiar ecology of :'fon-model organisms. Additionai material Rasmussen \joo� IVIA,F: What can you do with O.l x genome cove,age? A case study based on a genome survey of the scuttle fly fVfegasefia seaforis (Phoridae). Genorrics 2009, l O:382. Additional file 1: Polymorphism detected among Illumina reads for Lee ), .:)/ base the IIiJ,'D ina ?hiX seqLJenc!ng control tuberculotus) genome using pyrosequencing technology. l;\/e:?d 2009 Additional me 2: Primer sets designed using BatchPrimer3 for 184 57(5):463-469. nuclea� miciOsatellite loci 'n Ascfepias syriaca not tested for 6 <, 'Je PaD!- E. Ho , �okhsar JS, amplification success. tabie of r1 u ::::::ear rniuosatei!iTe- IOCJ5 primer Se�5 JJ, ,�v'1eyers Be GenoIT'icand predicted prOGJCt size and repeat rr otf jrJorrT'atior. smaH RNA sequencing of /\;'iiscanthus x giganteus shows the utl1ity of sorghum as a reference genome sequence for Andropogoneae grasses.

ESTs downloaded from GenBank and assembled into unigenes List 2elO, 11(2):R,i2 \eumann J, i\jlatSIJ'T':otO !. Roux �, J:J'ezel .,: Repetitive of GenBank ,l umbers fe r ESTs used tnis study. part of the oana:1a (Musa acurrinotoj genome investigated oy �ow- Additional file 4: Asclepias syriaca mitochondriai genome contigs depth 454 sequencing. B/viC Biel 201 0; 1 C:204 FA.5T,A fi le ::::or,ta;ning seqJe,lCeS of l[>,e assembled A5ciepia� ." \j eu"na:!n p, Na'ra'::i:o'/a R:epetitive DNA in the pea (Pisum mitochondriai contig5. 50tivum L.) genome: comprehensive characterization :..;sing 454 Additional file 5: Asclepias syriaca nuclear genome de novo sequencing and comparison to soybean and /vledicago truncatuia. B/V1C assembiy conti·gs. F/·5TA. cOi" tai�,ing seqJer,ces of :he ([u(iter 20C7. 8:427. ';Jencme COrjtigs from Velvet 'Iovo Aso'eoias syrioca genorre K,�/1" q,oSe':l�'la; BUI: Next-generation sequencing of the Trichinella murrelff mltochondrfal genome aiio\;vs comprehe:1sive comparison of its divergence from the prfncipal age:!t of human �richineHosis, Trichinelfa Genet Evo/ 2Cl l. l1(�j:l 23.

'0 tl G"be:':: MT, Bin'3c:en �, �o S, Cc:"n pos ::;', RatYI To msho i_I ca Fonseca p, She, ;(UZiletsova T al: Analysis of complete mitochondrial Acknowiedgements genorr:esfro m extinct and extant rhinoceroses reveals lack of The thar,k Michael 1�J1oore and Doug!as Soltis fo r p�oviding trle phylogenet1c resolution, 3iv1CEVOl 2009, 9(1 ):95 .. unpJblished o/eonda chloropiast ger'orT'e sequence, ',/V; n:rliOp Jempev\/o-f Kane Oste,·t k KL, Geeta 3,a�

15. Castoe TA, Poole AW, Gu WJ , de Koning APJ, Daza JM, Smith EN, 40. Hall TA: BioEdit: a user-friendly biological sequence alignment editor and

Pollock DD: Rapid identification of thousands of copperhead snake analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 1999, (Agkistrodon contortrix) microsatellite loci from modest amounts of 454 41 :95-98. shotgun genome sequence. Mol Ecol Resour 2010, 10(2):341 -347. 41. Conant GC, Wolfe KH: GenomeVx: simple web-based creation of editable 16. Abdelkrim J, Robertson BC, Stanton JAL, Gemmell NJ: Fast, cost-effective circular chromosome maps. Bioinformatics 2008, 24(6):861-862. development of species-specific microsatellite markers by genomic 42. Sugiyama Y, Watase Y, Nagase M. Makita N, Yagura S, Hirai A, Sugiura M: sequencing. Biotechniques 2009, 46(3):1 85-1 92. The complete nucleotide sequence and multipartite organization of the 17. Moore MJ, Soltis PS, Bell CD, Burleigh JG, Soltis DE: Phylogenetic analysis tobacco mitochondrial genome: comparative analysis of mitochondrial of 83 plastid genes fu rther resolves the early diversification of . genomes in higher plants. Mol Genet Genomics 2005, 272(6):603-61 5. Proc Natl Acad Sci USA 201 0, 107(10):4623-4628. 43. Alverson AJ, Wei XX, Rice DW, Stern DB, Barry K, Palmer JD: Insights into 18. Bai X, Zhang W, Orantes L, Jun T-H, Mittapa lli 0, Mian MAR, Michel AP: the evolution of mitochondrial genome size from complete sequences Combining next-generation sequencing strategies for rapid molecular of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol Bioi Evol resource development from an invasive aphid species, Aphis glycines. 201 0, 27(6):1436-1 448 PLoS ONE 20 1 0, 5(6):el 1370 .. 44. Kent WJ: BLAT - The BLAST-like alignment tool. Genome Res 2002, 19. Rasmann S, Agrawal AA, Cook SC, Erwin AC: Cardenolides, induced 12(4):656-664. responses, and interactions between above- and belowground 45. Knaus B: Short read toolbox.[hup:lIbria nknaus.comJ. herbivores of milkweed (Asclepias spp.). Ecology 2009, 90(9):2393-2404. 46. Gordon A: FASTX-Toolkit.[hup:llhannonlab.cshl.edu/fastx_toolkitlindex. 20. Broyles SB: Hybrid bridges to gene flow: A case study in milkweeds htmlJ. (Asclepias). Evolution 2002, 56(10): 1943-1953. 47. Li H, Durbin R: Fast and accurate short read alignment with Burrows­ 21. Wyatt R, Broyles S8: Ecology and evolution of reproduction in milkweeds. Wheeler transform. Bioinformatics 2009, 25(14):1 754-1 760. Annu Rev Ecol Syst 1994, 25:423-441 . 48. Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger T: Identification of errors 22. Fishbein M, Venable DL: Evolution of design: Theory and introduced during high throughput sequencing of the T cell receptor data. Evolution 1996, 50(6):2165-2177. repertoire. BMC Genomics 201 1, 12(1 ):1 06.. 23. Agrawal AA, Fishbein M: Phylogenetic escalation and decline of plant 49. Gladman S, Seeman T: VelvetOptimiser version 2.1 .7.[hup:llbioinformatics. defense strategies. Proc Natl Acad SCi USA 2008, 105(29):1 0057-10060. net.au/software.velvetoptimiser.shtmIJ. 24. Agrawal AA, Fishbein M: Plant defe nse syndromes. Ecology 2006, 87(7): 50. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, S132-S1 49. Ya ndell M: MAKER: An easy-to-use annotation pipeline designed for 25. Malcolm SB: Cardenolide-mediated interactions between plants and emerging model organism genomes. Genome Res 2008, 18(1 ):188-1 96. herbivores. In Herbivores: their interactions with secondary plant metabolites 51. Smit A, Hubley R, Green P: RepeatMasker Open-3.0.[hup:llwww. Volume I: The chemical participants .. 2 edition. Edited by: Rosenthal GA, repeatmasker.orgJ. Berenbaum MR. San Diego: Academic Press; 1991 :25 1 -296. 52. Jurka J, Ka pitonov W, Pavlicek A, Klonowski P, Kohany 0, Walichiewicz J: 26. Murata J, Bienzle D, Brandle JE, Sen sen CW, De Luca V: Expressed Repbase update, a database of eukaryotic repetitive elements. Cyrogenet sequence tags from Madagascar periwinkle (Catharanthus roseus). FEBS Genome Res 2005, 110(1-4):462-467. Lett 2006, 580(1 8):4501-4507. 53. You FM, Huo NX, Gu YQ, Luo MC, Ma YQ, Hane D, Lazo GR, Dvorak J, 27. Murata J, Roepke J, Gordon H, De Luca V: The leaf epidermome of Anderson ODe BatchPrimer3: A high throughput web application for PCR Catharanthus roseus reveals its biochemical specialization. Plant Cell 2008, and sequencing primer design. BMC Bioinformatics 2008, 9:253. 20(3):524-542. 54. He J, Dai XB, Zhao XC: PLAN: a web platform for automating high­ 28. Shukla AK, Shasany AK, Gupta MM, Khanuja SPS: Transcriptome analysis in throughput BLAST searches and for managing and mining results. BMC Catharanthus roseus leaves and roots for comparative terpenoid indole Bioinformatics 2007, 8:53. alkaloid profiles. J Exp Bot 2006, 57(14):3921-3932. 55. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, 29. Abzhanov A, Extavour CG, Groover A, Hodges SA, Hoekstra HE, Kramer EM, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein Monteiro A: Are we there yet? Tracking the development of new model database search programs. Nucleic Acids Res 1997, 25(1 7):3389-3402. systems. Trends in Genet 2008, 24(7):353-360. 56. Hood GM: PopTools version 3.2.3.[http://www.poptools.orgJ. 30. Dolezel J, Bartos J, Vog lmayr H, Greilhuber J: Nuclear DNA content and 57. Wu F, Mueller LA, Crouzillat D, Petiard V, Ta nksley SD: Combining genome size of trout and human. Cy tom Part A 2003, 51 A(2):127-1 28. bioinformatics and phylogenetics to identify large sets of single-copy 31. Solexa, Inc: Protocol for Whole Genome Sequencing using Solexa orthologous genes (COSII) for comparative, evolutionary and systematic Technology. BioTechniques Protocol Guide 2007 New York: BioTechniques; studies: a test case in the euasterid plant clade. Genetics 2006, 2006, 291. 174(3):1407-1420. 32. Ratan A: Assembly algorithms for next-generation sequence data. PhD 58. Livshultz 1. Middleton DJ, Endress ME, Williams JK: Phylogeny of thesis The Pennsylvania State University, Computer Science and and the APSA clade (Apocynaceae s.I.). Ann Mo Bot Gard Engineering; 2009. 2007, 94(2):324-359. 33. Deicher AL, Phillippy A, Carlton J, Salzberg SL: Fast algorithms for large­ 59. Angiosperm DNA C -values Database (Release 6.0, Oct. 2005). [hup:l1 scale genome alignment and comparison. Nucleic Acids Res 2002, www.kew.org/cvaluesJ. 30(1 1 ):2478-2483. 60. Meve U: Species numbers and progress in Asclepiad . Kew Bull 34. Kurtz S, Phillippy A, Deicher AL, Smoot M, Shumway M, Antonescu C, 2002, 57(2):459-464. Salzberg SL: Versatile and open software for comparing large genomes. 61 . Ravi V, Khurana JP, Tyagi AK, Khurana P: An update on chloroplast Genome Bioi 2004, 5(2):R1 2 .. genomes. Plant Syst Evol 2008, 271 (1-2):1 01-122. 35. Ovcharenko I, Loots GG, Giardine BM, Hou MM, Ma J, Hardison RC, 62. Samson N, Bausher MG, Lee SB, Jansen RK, Daniell H: The complete Stubbs L, Miller W: Mulan: Multiple-sequence local alignment and nucleotide sequence of the coffee (Coffea arabica L.) chloroplast visualization for studying function and evolution. Genome Res 2005, genome: organization and implications for biotechnology and 15(1):184-1 94. phylogenetic relationships amongst angiosperms. Plant Biotechnol J 2007, 36. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in 5(2):339-353. accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 63. Jansen RK, Wojciechowski MF, Sanniyasi E, Lee S8, Daniell H: Complete 33(2):511-518. plastid genome sequence of the chickpea (Cicer arietinum) and the 37. Wyman SK, Jansen RK, Boore JL: Automatic annotation of organellar phylogenetic distribution of rps 12 and c1pP intron losses among genomes with DOGMA. Bioinformatics 2004, 20(1 7):3252-3255. legu mes (Leguminosae). Mol Phylogenet Evol 2008, 48(3):1 204-1217. 38. Milne I, Bayer M, Cardle L, Shaw P, Stephen G, Wright F, Marshall D: Tablet­ 64. Lee HL, Jansen RK, Chumley TW, Kim KJ: Gene relocations within next generation sequence assembly visualization. Bioinformatics 2010, chloroplast genomes of Jasminum and Menodora () are due to 26(3):401 -402. multiple, overlapping inversions. Mol Bioi Evol 2007, 24(5):1 161-1 180. 39. Zerbino D, Birney E: Velvet: algorithms for de novo short read assembly 65. Guisinger MM, Chumley TW, Kuehl N, Boore JL, Jansen RK: Implications of using de Bruijn graphs. Genome Res 2008, 18:821-829. the plastid genome sequence of Typha (Typhaceae, Poales) for Straub et 01. BMC Genomics 201 1, 12:21 1 Page 22 of 22 http://www.biomedcentral.com/1 471-21 64/1 2/21 1

understanding genome evolution in Poaceae. J Mol Evol 201 0, 87. Vitte C, Panaud 0: LTR retrotransposons and genome 70(2):149-166 size: emergence of the increase/decrease model. Cyrogenet Genome Res 66. Greiner S, Wa ng X, Rauwolf U, Silber MV, Mayer K, Meurer J, Haberer G, 2005, 110(1 -4) 91-1 07 Herrmann RG: The complete nucleotide sequences of the five genetically 88. Cavagnaro P, Chung S-M, Szklarczyk M, Grzebelus D, Senalik D, Atkins A, distinct plastid genomes of Oenothera, subsection Oenathera: I. Simon P: Characterization of a deep-coverage carrot (Daucus carota L.) Sequence evaluation and plastome evolution. Nucleic Acids Res 2008, SAC library and initial analysis of SAC -end sequences. Mol Genet 36(7) 2366-2378 Genomics 2009, 281 (3):273-288 67. Guisinger MM, Kuehl JV, Boore JL, Jansen RK Extreme reconfiguration of 89. Cavallini A, Natali L, Zuccolo A, Giordani T, Jurman I, Ferrillo V, plastid genomes in the angiosperm fa mily Geraniaceae: Vitacolonna N, Sarri V, Cattonaro F, Ceccarelli M, et 01: Analysis of Rearrangements, repeats, and codon usage. Mol Bioi Evol 201 1, transposons and repeat composition of the sunflower (Helianthus 28(1):583-600 annuus L.) genome. Theor Appl Genet 201 0, 120(3):491-508. 68. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack J, 90. Datema E, Mueller L, Buels R, Giovannoni J, Visser R, Stiekema W, van MOiler KF, Guisinger-Bellian M, Haberle RC, Hansen AK, et 01: Analysis of 81 Ham R: Comparative BAC end sequence analysis of tomato and potato genes from 64 plastid genomes resolves relationships in angiosperms reveals overrepresentation of specific gene fa milies in potato. BMC Plant and identifies genome-scale evolutionary patterns. Proc Natl Acod SCi USA BioI 2008, 8(1 )34 2007, 104(49) 19369-1 9374 91 . Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex J, Roach P, 69. Haberle RC, Fourcade HM, Boore JL. Jansen RK: Extensive rearrangements Bradley M, Neylon C An analysis of the feaSibility of short read in the chloroplast genome of Tra chelium caeruleum are associated with sequencing. Nucl Acids Res 2005, 33(1 9) e 171.. repeats and tRNA genes. J Mol Evol 2008, 66(4)350-361 . 92. Rogers SO, Bendich AJ: Ribosomal-RNA genes in plants - Variability in 70. Jo Y, Park J, Kim J, Song W, Hur CoG, Lee Y-H, Kang S-C Complete copy number and in the intergenic spacer. Plant Mol BioI 1987, sequencing and comparative analyses of the pepper (Capsicum annuum 9(5):509-520. L.) plastome revealed high frequency of tandem repeats and large 93. Sastri DC, Hilu K, Appels R, Lagudah ES, Playford J, Baum BR: An overview insertion/deletions on pepper plastome. Plant Cell Rep 201 0, 1-13. of evolution in plant 5S DNA. Plant Syst Evo/ 1992, 183(3):1 69-1 81. 71. Cai ZQ, Guisinger M, Ki m HG, Ruck E, Blazier JC, McMurtry V, Kuehl JV, 94. Hillis DM, Dixon MT: Ribosomal DNA: Molecular evolution and Boore J, Jansen RK: Extensive reorganization of the plastid genome of phylogenetic inference. Q Rev BioI 1991, 66(4):41 1-453. Trifolium subterraneum (Fabaceae) is associated with numerous repeated 95. Vo lkov R, Kostishin S, Ehrendorfer E, Schweizer D: Molecular organization sequences and novel DNA insertions. J Mol Evol 2008, 67(6):696-704. and evolution of the external tra nscribed rDNA spacer region in two n Arthofer W, Schuler S, Steiner FM, Schlick-Steiner BC Chloroplast DNA­ diploid relatives of Nicotiana tabacum (Solanaceae). Plant Sys t Evol 1996, based studies in molecular ecology may be compromised by nuclear­ 201 (1 -4) 1 1 7-1 29. encoded plastid sequence. Mol Ecol 201 0, 19(1 8):3853-3856. 96. Borisjuk NV, Davidjuk YM, Kostishin SS, Miroshnichenco GP, Velasco R, 73. Parks M, Cronn R, Liston A: Increasing phylogenetic resolution at low Hemleben V, Vo l kov RA: Structural analysis of rDNA in the genus taxonomic levels using massively parallel sequencing of chloroplast Nicotiana. Plant Mol BioI 1997, 35(5):655-660. genomes. BMC Biology 2009, 784. 97. Bena G, Jubier M-F, Olivieri I, Lejeune B: Ribosomal external and internal 74. Neubig K, Whitten W, Carlsward B, Blanco M, Endara L, Wi lliams N, transcribed spacers: Combined use in the phylogenetic analysis of

Moore M: Phylogenetic utility of yef1 in orchids: a plastid gene more Medicago (Leguminosae). J Mol Evol 1998, 46(3)299-306. ' variable than matK. Plant Sys t Evol 2009, 277(1)75-84. 98. Volkov RA, Komarova NY, Panchuk II, Hemleben V: Molecular evolution of 75. Kuroda H, Maliga P: The plastid clpP1 protease gene is essential for plant rDNA external transcribed spacer and phylogeny of sect. Petota (genus development. Nature 2003, 425(6953):86-89. Solanum). Mol Phylogenet Evol 2003, 29(2) 187-202. 76. Drescher A, Ruf S, Calsa T, Carrer H, Bock R: The two largest chloroplast 99. Maggini F, Gelati MT, Spolverini M, Frediani M: The intergenic spacer genome-encoded open reading frames of higher plants are essential region of the rDNA in Olea europaea L. Tree Genet Genom 2008, genes. Plant J 2000, 22(2) 97-1 04. 4(2):293-298 77. Kode V, Mudd EA, lamtham S, Day A: The tobacco plastid accD gene is 100. Cronn RC, Zhao XP, Paterson AH, Wendel JF: Polymorphism and concerted essential and is required for leaf development. Plant J 2005, evolution in a tandemly repeated gene fa mily: 55 ribosomal DNA in 44(2)237-244 diploid and allopolyploid cottons. J Mol Evol 1996, 42(6) 685-705 78. Konishi T, Shinohara K, Yamada K, Sasaki Y: Acetyl-CoA carboxylase in 101. Fujiwara M, Inafu ku J, Ta keda A, Watanabe A, FUjiwara A, Kohno S, higher plants: Most plants other than Gramineae have both the Kubota S: Molecular organization of 55 rDNA in bitterlings (Cyprinidae). prokaryotic and the eukaryotic forms of this enzyme. Plant Cell Physiol Genetica 2009, 135(3):355-365. 1996, 37(2) 117-122. 102. Saini A, Jawali N: Molecular evolution of 55 rDNA region in Vig na 79. Gornicki P, Faris J, King I, Podkowinski J, Gill B, Haselkorn R: Plastid­ subgenus Ceratotropis and its phylogenetic implications. Plant Sys t Evol localized acetyl-CoA carboxylase of bread wheat is encoded by a single 2009, 280(3-4) 187-206. gene on each of the three ancestral chromosome sets. Proc Notl Acod Sci 103. Stage DE, EiGkbush TH: Sequence variation within the rRNA gene loci of USA 1997, 94(25) 141 79-14184. 12 Drosophila species. Genome Res 2007, 17(1 2):000. 80. Kubo T, Mikami T: Organization and variation of angiosperm 104. Ganley ARD, Kobayashi T: Highly efficient concerted evolution in the mitochondrial genome. Physiol Plant 2007, 129(1 ):6-13. ribosomal DNA repeats: Tota l rDNA repeat variation revealed by whole­ 81. Knoop V: The mitochondrial DNA of land plants: peculiarities in genome shotgun sequence data. Genome Res 2007, 17(2):1 84-1 91. phylogenetic perspective. Curr Genet 2004, 46(3):1 23-139. 105. Kolosha VO, Fodor I: Nucleotide sequence of Citrus limon 265 ribosomal­ 82. Sanchez-Puerta MV, Cho Y, Mower JP, Alverson AJ, Palmer JD: Frequent, RNA gene and secondary structure model of its RNA. Plant Mol BioI 1990, phylogenetically local horizontal transfer of the coxl group I intron in 14(2):1 47-161 flowering plant mitochondria. Mol BioI Evol 2008, 25(8):1 762-1777. 106. Nei M, Rooney AP: Concerted and birth-and-death evolution of 83. Cusimano N, Zhang L-S, Renner SS: Reevaluation of the cox1 group I multigene fa milies'. Annu Rev Genet 2005, 39(1):121-152. intron in Araceae and angiosperms indicates a history dominated by 107. Baxter I, Tchieu J, Sussman MR, Boutry M, Palmgren MG, Gribskov M, loss rather than horizontal transfer. Mol BioI Evol 2008, 25(2):265-276. Harper JF, Axelsen KB: Genomic comparison of P-Type ATPase ion pumps 84. Bennetzen JL: The contributions of retroelements to plant genome in Arabidopsis and Rice. Plant Physiol 2003, 132(2):61 8-628. organization, function and evolution. Trends Microbial 1996, 4(9)347-353. 108. Thomson R(, Wang IJ, Johnson JR: Genome-enabled development of 85. Kumar A, BennelZen JL: Plant retrotransposons. Annu Rev Genet 1999, DNA markers for ecology, evolution and conservation. Mol Ecol 20 10, 33:479-532. 19(1 1)2184-2195. 86. Vitte C, BennelZen JL: Analysis of retrotransposon structural diversity uncovers properties and propensities in angiosperm genome evolution. doi:1 0.1 1 86/1 471-2164-12-21 1 Proc Natl Acod Sci 2006, 103(47):1 7638-1 7643 Cite this article as: Straub er 01: Building a model: developing genomic resources for common milkweed (Asclepias syriaca) with low coverage genome sequencing. BMC Genomics 201 1 12:2 1 1.