Freeware for Sequence Analyses
Total Page:16
File Type:pdf, Size:1020Kb
Freeware for Sequence Analyses Michael Kube 1656 genomes in June 2011 (Kube0711) &B4RTTEEES%2317320'2S36%T (Kube0711) General need for tools and databases! General information on bioinformatic tools & databases http://nar.oxfordjournals.org/ Database issue, January 2011 http://www.biomedcentral.com/bmcbioinformatics/ http://www.ploscompbiol.org/ http://bioinformatics.oxfordjournals.org/ (Kube0711) Some useful links providing information on bioinformatical software Mac http://www.apple.com/science/profiles/osxporting/ and Seqanswers overview Windows http://molbiol-tools.ca/molecular_biology_freeware.htm http://www.bioinformatics.org/softwaremap/?form_cat=2 Linux and other operating systems http://seqanswers.com/wiki/Software Percentage of windows applications (64 bit) will increase within the next years! (Kube0711) The Input- Read, Reads, Next Generation Sequencing Scale and objective make a difference in the strategy in bacterial genomics Small Projects: 1-20 Sanger reads Medium projects: 100-200’000 Sanger reads High Throughput Projects: 10’000- 100’000’000 reads provided by different platforms in different data formats Most of the large projects cannot be handled on Windows or Mac systems due to the demands on data pipelines software, RAM and CPUs! and adapted Linux is needed as operating system! software packages (Kube0711) Some categories of bioinformatical software * Handling of sequences or from raw sequence to contig -> simple sequence viewer -> assembler -> sequence editing and finishing &862'783E6 * Public databases and bioinformatical tools 83'28%68 -> data storage 4093617'20C'2%7 1C&74377'0 -> data search 83307T73$E6 -> data analysis 13C07P 3116'0730C@327 * Annotationtools and integrated solutions 3 &'%&5C0'8G632 -> structural RNAs 8&EGP -> protein content * Phylogenetic software * Toolboxes (Kube0711) PCR templates Tools for PCR primer design Overview of the primer design software -> in general, e.g. - http://molbiol-tools.ca/PCR.htm - http://www.science.co.il/biomedical/primer-tools.asp -> specific/ unique - http://www.ncbi.nlm.nih.gov/tools/primer-blast/ And something nice to calculate the optimal annealing temperature considering the content of the PCR mix from NEB - http://www.neb.com/nebecomm/tech_reference/TmCalc/Default.asp (Kube0711) Inspect sequences Sequence viewer and simple processing tools Some software assigned for this purpose: DNA Chromatogram Explorer Lite (www.dnabaser.com) FinTV (www.geospiza.com) CLC Sequence Viewer (www.clcbio.com) etc. Common windows software not assigned but can be used: MEGA5 (www.megasoftware.net) Bioedit (www.mbio.ncsu.edu/bioedit/bioedit.html) Common scientific platform including features for data processing and editing: GAP4/5 of the staden pg(package (sourceforge.net/projects/staden) g pj ) Unprocessed data formats can be used as input, format conversion/export functions and editing functions are included. Software is limited to low scale/small projects except GAP, which allows -> sequence assembly -> the handling of medium size projects and HT-projects in combination with other software Unprocessed data formats can be used as input. Data format conversion/ export and editing functions are included. (Kube0711) DNA Chromatogram Explorer Lite &B4RTTEEES276S31T3E203T&63183%61UF40366T'2FS&810 FinchTV &B4RTTEEES%374'HS31T%63C87T"2&8DS7&810 CLC sequence viewer -> allows the import of large data sets http://www.clcbio.com/ (Kube0711) Processing & Assembly Sequence assembly- some terms Sequence data Read: one DNA sequence derived from one trace chromatogram Trace: abbreviation for one trace chromatogram after gel or capillary electrophoresis Assembly Singlet: a sequence that represents a single object (-> stays alone) Contig: consists of several objects (sequences) which are connected in an alignment, but each sequence present in a database is a contig (old definition: “A contig is a set of gel readings that are related to one another by overlap of their sequences. All gel readings belong to one and only one contig, and each contig contains at least one gel reading. The gel readings in a contig can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig.”, by James Bonfield) Debris: sequences too short (after trimming), megahubs, singlets (during the assembly), low coverage alignments (threshold) Consensus sequence: The consensus sequence is calculated from multiple sequence alignment taking in account the quality value for each base. (Kube0711) Some data formats/ file extensions .ab1 Raw DNA data taken from a scientific instrument and output from Applied Biosystems' Sequencing Analysis Software; encodes an electropherogram, DNA base sequence and associated tags (e.g. instrument, key, value etc.) .scf SCF format files are used to store data from DNA sequencing instruments. Each file contains the data for a single reading (trace sample points, called sequence, positions of the bases relative to the trace sample points, numerical estimates of the accuracy of each base (same data are stored within the database format .exp). .fasta (.fas .fa) >gi|304309652:19179-20701 Gamma proteobacterium HdN1, complete genome AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAACAGGC CTTCGGGTGCTGACGAGCGGCGGACGGGTGAGGATAGCGCAGGAATCTGCCTTGTAGTGGGGGATAGTCC .fastq @3926 ATAAGTTGTCATCACGCCTCTCCTTTCTGGTATGTGATCTGAGTATTGGCGATATTTTAAATATGTTTATGGAATTATCT GGATAATATGGCACCTAAACG +3926 ceY\addcddeeedeaaeaeeed^eeceeeeeeeeddccdeacdde\eeee\ede\\ccb^dZ`d`dbcc \`T_Y__`c^ca\ccddd`\a``T_b`LTa` Internal extensions or data formats .cons (only internal) .staden (software specific data format), .aux (“old” storage format of GAP4) (Kube0711) Examples: Assembly software Phrap*: Sanger derived reads, considering read-pairs (requires high RAM and CPU resources) MIRA*: Sanger and NGS reads, considering read-pairs (requires high RAM and CPU resources), still one of the most powerful de novo assemblers CLC Assembly engine (commercial solution, free trial) Sanger and NGS reads, considering read-pairs (low requirements on RAM and CPU, fast) Newbler: Sanger and NGS reads, considering read-pairs Velvet NGS reads, considering read-pairs (low requirements on RAM and CPU) And several other (each months a new one) SSAKE, SHARCGS, VCAKE, Celera Assembler, Euler, ABySS, AllPaths, and SOAPdenovo References performance comparison see Zhang et al., 2011; PMID: 21423806 assembly algorithms see Miller et al., 2010; PMID: 20211242) *MIRA (Mimicking Intelligent Read Assembly; Chevreux, B., Wetter, T. and Suhai, S. (1999): Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56.) *Gordon, David. "Viewing and Editing Assembled Sequences Using Consed", in Current Protocols in Bioinformatics,A. D. Baxevanis and D. B. Davison, eds, New York: John Wiley & Co., 2004, 11.2.1-11.2.43. (Kube0711) Some assembly storage formats Becoming the default for a lot of assembly software .sam (and .bam) SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. -> SAM is a tab-delimited text format (easy to understand, to parse, to generate, to check but slow to parse) -> store all the alignment information generated by various alignment programs -> easily generated by alignment programs or converted from existing alignment formats -> compact in file size; -> allows most of operations on the alignment to work on a stream without loading the whole alignment into memory -> allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus (Li et al., 2009; PMID: 19505943) BAM ⇒ BAM, the binary equivalent to SAM is used in intensive data processing. Editing- larger assemblies will not result in one contig Copy control fosmid libraries (37-45 kb inserts) High copy plasmid libraries (1.5 and 2.5 kb inserts) Pyrosequencing reads Consed overview Platforms for editing of the assembled sequence (Freeware) GAP4/5 of the staden package (http://staden.sourceforge.net/) Advantages: Easy to install and to handle Disadvantages: Large scale projects need external assemblers Readpair information is not visualized No autofinisher function Software environment is limited on Windows GAP5 development is in progress Bonfield JK, Whitwham A. Gap5--editing the billion fragment sequence assembly. Bioinformatics. 2010 Jul 15;26(14): 1699-703. PMID: b`eacffb Platforms for editing of the assembled sequence (Freeware) Consed (www.phrap.org/consed/consed.html) Advantages: best software available Disadvantages: installation is a horrible needs external assemblers uses ACE format as input resulting in several limits and pitfalls no Windows version Gordon, D., C. Abajian, and P. Green. 1998. Consed: A Graphical Tool for Sequence Finishing. Genome Research. 8:195-202 Gordon, D., C. Desmarais, and P. Green. 2001. Automated Finishing with Autofinish. Genome Research. 11(4): 614-625. Commercial solutions Keep in mind! Commercial solutions show a reduced performance and are limited in the functions of the editor in comparison so far! This situation will hopefully change within the next years. Some examples running on windows: ⇒ DNAstar ⇒ Sequencher ⇒ CLC workbench etc. (Kube0711) Data analysis Some important resources for sequence analysis NCBI-Software http://www.ncbi.nlm.nih.gov/guide/sequence-analysis/ EBI-Tools http://www.ebi.ac.uk/Tools/ emboss http://emboss.sourceforge.net/ Center for Biological Sequence