New Softwares for Automated Microsatellite Marker Development

Published online February 21, 2006 Nucleic Acids Research, 2006, Vol. 34, No. 4 e31 doi:10.1093/nar/gnj030 New softwares for automated microsatellite marker development Wellington Martins, Daniel de Sousa1, Karina Proite2, Patrıćia Guimara˜es2, Marcio Moretzsohn2 and David Bertioli3 Department of Computer Science, Catholic University of Goia´s, Brazil, 1Department of Computer Science, Catholic University of Rio de Janeiro, Brazil, 2Embrapa Genetic Resources and Biotechnology, Brası´lia, Brazil and 3Genomic Sciences and Biotechnology, Catholic University of Brası´lia, Brazil Received November 23, 2005; Revised and Accepted January 31, 2006 ABSTRACT whose unit of repetition is between 1 and 6 bp. They are highly abundant in the genomes of eukaryotes, polymorphic and Microsatellites are repeated small sequence motifs usually co-dominant and transferable between different map- that are highly polymorphic and abundant in the ping populations. Microsatellite markers can also be used in genomes of eukaryotes. Often they are the molecular automated genotyping techniques. Thus, they have become markers of choice. To aid the development of micro- one of the most useful molecular markers for a large number satellite markers we have developed a module that of organisms. integrates a program for the detection of microsatel- Researchers working on the development of microsatellite lites (TROLL), with the sequence assembly and markers need an efficient way to go from usually hundreds or analysis software, the Staden Package. The module thousands of trace and/or text sequence files to the identifi- has easily adjustable parameters for microsatellite cation of new potential markers. However, the softwares lengths and base pair quality control. Starting with currently available for data-mining microsatellites are either restricted to small datasets, not available for Windows, or are large datasets of unassembled sequence data in the not targeted towards marker development. In addition, they do form of chromatograms and/or text data, it enables not offer base quality control and with one exception (Repeat- the creation of a compact database consisting of Masker when used with Staden), they are not integrated into a the processed and assembled microsatellite contain- sequence analysis software, and as such require already ing sequences. For the final phase of primer design, assembled data as input (1–3), http://www.maizemap.org/ we developed a program that accepts the multi- bioinformatics/SSRFINDER/README.ssrfinder and http:// sequence ‘experiment file’ format as input and www.repeatmasker.org/. produces a list of primer pairs for amplification of We have developed a Staden module that integrates the microsatellite markers. The program can take into program TROLL (4) to the Staden package. TROLL is an account the quality values of consensus bases, open source program based on the Aho-Corasick algorithm improving success rate of primer pairs in PCR. The for the detection of perfect microsatellites. The Staden Pack- age is a free, open source, software tool used for sequence software is freely available and simple to install in assembly and analysis (5) (http://staden.sourceforge.net/). The both Windows and Unix-based operating systems. Staden package, working with the module and TROLL, Here we demonstrate the software by developing pri- enables efficient processing of a large number of sequences mer pairs for 427 new candidate markers for peanut. in chromatogram and/or text form to produce a small database with only microsatellite containing sequences, processed, assembled in contigs and that have not previously been INTRODUCTION used for marker development. For the primer design we A molecular marker is an identifiable DNA region whose have developed a program that uses Primer3 (6) to design inheritance can be followed along generations. Molecular mar- primers flanking the SSR located by the module. It is available kers have become a powerful tool in many areas of biology. as a stand-alone program and through a Web service. This Microsatellites, also known as simple sequence repeats (SSR) program accepts multiple sequences in the ‘experiment file’ or short tandem repeats (STR), are repeated sequence motifs, format of Staden as input and produces a list of primer pairs *To whom correspondence should be addressed. Tel: +55 62 32271381; Fax: +55 62 32271073; Email: [email protected] Correspondence may also be addressed to David Bertioli. Tel: +55 61 34484735; Fax: +55 61 33403658; Email: [email protected] Ó The Author 2006. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] e31 Nucleic Acids Research, 2006, Vol. 34, No. 4 PAGE 2 OF 4 for amplification of microsatellite markers. The program can outputs a tab spaced file (primers.passed) that includes the take into account the quality values of consensus bases, contig name, primer names, microsatellite sequence(s), primer improving success rate of primer pairs in PCR. The Staden sequences, annealing temperatures and the quality of the Package, TROLL and the module are freely available for both contig sequences with the primers annealed. In addition, the Linux and Windows platforms. program provides three other output files, marker_seqs.fasta containing the sequences for which the primer design was successful, not_marker_seqs.fasta with sequences for which MATERIALS AND METHODS primer design was not successful, and primers.failed, which General details of the software also contains the sequences for which primer design was not successful, but in the experiment file format of Staden. The The TROLL module was developed in the Tcl/Tk scripting primers.failed file can be used for another round of analysis language, following the Staden Package module development using the primer design program with less stringent conditions specifications. It works through Pregap4 reading the Staden for the primer design. Experiment Files one by one. These files contain, among other information, the DNA sequences, the corresponding quality Downloading and using the programs values and any vector contamination masking information. All sequences are then concatenated and a special character Download and install: (#) is used as a separator. TROLL then locates the SSRs, these (i) The Staden package from http://staden.sourceforge.net/. are then filtered, disregarding those in low quality or vector (ii) The TROLL module (TROLL.p4m) from http://finder. contaminated regions and the result is written back on the sourceforge.net/ (follow the link to Resources). Troll_p4m Experiment Files. A file containing the names of experiment is distributed as a zip file, containing the module itself, troll files with microsatellites is generated (pregap.SSR.passed), binaries (Linux and Windows), a DLL Windows library and this can be used for entering into the Gap4 interface of and installation instructions. Staden to create a database of entirely microsatellite contain- (iii) The primer design program from http://finder.sourceforge. ing sequences. The tags added to the experiment files may be net/ (Follow the link to Resources). The primer design used by Gap4 to aid in assembly by masking and to visualize program is available as a Web service, requiring no pro- the microsatellites. Additionally the tags are used by the pri- gram installation. Alternatively, the primer design pro- mer design program to define the target region around which gram can be downloaded as a Perl script and installed the Primer3 designs primers (see below). locally. For the Perl script to work you will need to install The following module configuration options are available: Primer3, which is available from http://frodo.wi.mit.edu/ (i) Tag type: used to control Tag type used to mark micro- primer3/primer3_code.html. satellite repeats. (iv) The Step-by-step protocol for use of the module (ii) Motif length: used to set the motif lengths (mono, di, tri, from http://finder.sourceforge.net/ (Follow the link to tetra, penta and/or hexa) that will be used in the search for Resources). The Step-by-step protocol for use of the SSRs. module is written for novice users of the Staden Package. (iii) Repeat min: used to set the minimum number of motif However, to use the TROLL module to its full potential, repetitions for SSRs of each motif length. the user needs to be familiar with the Staden. The course (iv) Motif file: the location of the file containing the motif list documentation that is downloaded with the software is required by the TROLL program. highly recommended (for Windows users in C:\Program (v) Create SSR.passed file?: for the creation of a file with the files\Staden package\course\). names of all sequences containing SSRs. This file is used to enter into Gap4 to create a database of microsatellite Examples of marker development from three datasets containing sequences. (vi) Create no_SSR_passed file?: used to create a file with the TC-repeat enriched genomic library. Peanut (Arachis hypo- name of all sequences that do not contain SSRs. gaea) genomic DNA library enriched for TC repeats was made (vii) Filter quality: used to set the base quality threshold using genomic DNA cut with the restriction enzyme Sau3AI, which will be used when analyzing SSR regions. oligonucleotides and magnetic beads according to a standard protocol (7). Two hundred and eighty genomic clones were The program for primer design is responsible for parsing sequenced using the forward primer, and plasmids that con- the experiment format file exported from Gap4 and calling tained microsatellites were identified using the module.

New Softwares for Automated Microsatellite Marker Development

Evidence of Selection at the Ramosa1 Locus During Maize Domestication

DNA Sequencing

A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection

A Guide to HIV-1 Reverse Transcriptase and Protease Sequencing for Drug Resistance Studies

Next-Generation DNA Sequencing Informatics, 2Nd Edition

Comparison of DNA Sequence Assembly Algorithms Using Mixed Data Sources

Downloading and Will Run As Stand-Alone Software

Basecalling, Alignment, Assembly and Deconvolution of Sanger

Gap5—Editing the Billion Fragment Sequence Assembly James K

Download from Ftp:/ Matter of Fashion? Nat Rev Genet 2004, 5:63-69

The Staden Package Manual Last Update on 22 October 2002

Vsqual: a Visual System to Assist DNA Sequencing Quality Control