Tophat Documentation Description: a Fast Splice Junction Mapper for RNA-Seq Reads

TopHat Documentation Description: A fast splice junction mapper for RNA-seq reads. Author: Cole Trapnell et al, University of Maryland Center for Bioinformatics and Computational Biology, [email protected] TopHat Version 2.0.8b Contact: Marc-Danie Nazaire, [email protected] Summary TopHat is a fast splice junction mapper. TopHat uses Bowtie to map RNA-seq reads to a reference genome, then analyzes the mapping results to identify splice junctions between exons. The software is optimized for reads 75bp or longer. TopHat was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the TopHat documentation for release 2.0.8b. Usage TopHat takes RNA-seq reads files in FASTA/FASTQ format as input and maps those reads against a reference genome, using Bowtie as its aligner engine. Bowtie (as of version 2.1.0) does not allow alignments between a read and the genome that contains large gaps, so TopHat takes the reads that Bowtie cannot align and breaks them into smaller pieces called segments. These smaller pieces will often align to the genome when they are processed independently. When several of a read's segments align to the genome far apart (that is, between 100 bp and several hundred kilobases) from each other, TopHat infers that the read spans a splice junction and estimates where that junction's splice sites are. TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. The TopHat 2.0.8b GenePattern module does not support: • short (fewer than a few nucleotides) insertions and deletions in the reported alignments • mixing paired- and single-end reads together • colorspace reads GenePattern installs the required packages for TopHat automatically: SAMtools and Bowtie. GenePattern install these packages separately from any existing versions and will not interfere with other instances of these tools. For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page. IMPORTANT NOTES: The first time you run a job with a given prebuilt index, the job may fail. Please re-run your job. It should work on the second run. If you encounter a recurring problem with your jobs failing in TopHat, contact [email protected]. TopHat is memory intensive and takes several hours to run. 1 References Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013 Apr 25;14(4):R36. • http://genomebiology.com/2013/14/4/R36/abstract Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578. • http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105-11. • http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp120 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. • http://genomebiology.com/2009/10/3/R25 Links TopHat: http://tophat.cbcb.umd.edu/ TopHat documentation: http://tophat.cbcb.umd.edu/manual.html Parameters Name Description prebuilt bowtie index Use the drop-down list to specify one of a number of prebuilt indexed genome references. If this list does not include the genome you require, an indexed genome can be generated using the Bowtie.indexer GenePattern module. Either a prebuilt OR a custom Bowtie index must be specified. custom bowtie index A ZIP archive containing Bowtie genome reference index files. Either a prebuilt OR a custom Bowtie (i.e., version Bowtie 2.0 or greater) index must be specified. reads pair 1 Unpaired reads file or first mate for paired reads. This can (required) be a file in FASTA or FASTQ format (bz2 and gz compressed files are supported), a ZIP archive containing FASTA or FASTQ files, or a directory that is accessible to the GenePattern server containing FASTA or FASTQ files. For more information on FASTA/FASTQ formats, see the Input Files section. 2 reads pair 2 Second mate for paired reads. This can be a file in FASTA (optional) or FASTQ format, a ZIP archive containing FASTA or FASTQ files. mate inner dist The expected mean inner distance between mate pairs. (optional) For example, for paired-end runs with fragments selected at 300 bp, where each end is 50 bp, you should set this to be 200. Default: 50 mate std dev The standard deviation for the distribution of inner (optional) distances between mate pairs. This does not have to be specified for paired end reads. library type If you want to run TopHat on strand-specific reads, you (optional) must specify a library type. Options include: • Standard Illumina (fr-unstranded) • dUTP, NSR, NNSR (fr-firststrand) • Ligation, Standard SOLiD (fr-secondstrand) Bowtie preset A combination of pre-packaged options for Bowtie based options on speed and sensitivity/accuracy. (optional) GTF file A GTF (v. 2.2) or GFF3 file containing a list of gene model (optional) annotations (that is, exon annotations that provide a virtual transcriptome). If this file is provided, TopHat will extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the reference genome. The reads that did map onto the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final TopHat output. transcriptome only Whether to align the reads to the virtual transcriptome (optional) (requires a file supplied in the GTF file parameter) and report only those mappings as genomic mappings. Default: no max transcriptome The maximum number of mappings allowed for a read hits (optional) when it is aligned to the virtual transcriptome (requires a file supplied in the GTF file parameter). Any reads found with more than this number of mappings will be discarded. 3 prefilter multihits When mapping reads on the virtual transcriptome (requires (optional) a file supplied in the GTF file parameter), some repetitive or low complexity reads that would be discarded in the context of the genome may appear to align to the transcript sequences and thus may end up reported as mapped to those genes only. This option directs TopHat to first align the reads to the whole genome, then exclude such multi- mapped reads. raw junctions file A file containing raw junctions. Junctions are specified one (optional) per line in a tab-delimited format. find novel junctions If you select no, then the module will only look for junctions (optional) indicated in the GTF file supplied in the GTF file parameter. (This parameter is ignored when no GTF file is specified.) Default: yes min anchor length The anchor length. TopHat will report junctions spanned by (optional) reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This value must be at least 3. Default: 8 max splice The maximum number of mismatches that may appear in mismatches the "anchor" region of a spliced alignment. Default: 0 (optional) min intron length The minimum intron length. TopHat will ignore (optional) donor/acceptor pairs closer than this many bases apart. max intron length The maximum intron length. When searching for junctions, (optional) TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. Default: 500000 max insertion length The maximum insertion length TopHat will allow. Default: (optional) 3. max deletion length The maximum deletion length TopHat will allow. Default: 3. (optional) quality value scale Whether to use the Solexa or Solexa v. 1.3 (Phred 64) (optional) quality value scale. (For more information on these quality scores, see this review article.) 4 quality value files 1 A ZIP file containing separate quality value files for single (optional) end reads or the first pair of paired end reads. quality value files 2 A ZIP file containing separate quality value files for the (optional) second pair of paired end reads. integer quals Quality values are space-delimited integer values. Default: (optional) no max multihits Specifies the number of times a read can be aligned to the (optional) reference genome. If a read is aligned more than this number of times, then TopHat will choose the alignments based on their alignment scores, reporting the alignments with the best alignment scores. If there are more than this number of alignments with the same score for a read, TopHat will randomly report only this many alignments. Default: 20 read mismatches Specifies the number of mismatches allowed in the initial (optional) read mapping in each read alignment. Final read alignments with more than this many mismatches are discarded. Default: 2 coverage search Enables or disables the coverage-based search for (optional) junctions. Use when coverage search is disabled by default (such as for reads ≥75 bp), for maximum sensitivity. The coverage search looks for distinct regions of piled-up reads in the initial mapping (coverage islands). Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron.

Load more