TopHat Documentation Description: A fast splice junction mapper for RNA-seq reads. Author: et al, University of Maryland Center for and , [email protected] TopHat Version 2.0.8b Contact: Marc-Danie Nazaire, [email protected]

Summary TopHat is a fast splice junction mapper. TopHat uses Bowtie to map RNA-seq reads to a , then analyzes the mapping results to identify splice junctions between exons. The software is optimized for reads 75bp or longer. TopHat was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the TopHat documentation for release 2.0.8b.

Usage TopHat takes RNA-seq reads files in FASTA/FASTQ format as input and maps those reads against a reference genome, using Bowtie as its aligner engine. Bowtie (as of version 2.1.0) does not allow alignments between a read and the genome that contains large gaps, so TopHat takes the reads that Bowtie cannot align and breaks them into smaller pieces called segments. These smaller pieces will often align to the genome when they are processed independently. When several of a read's segments align to the genome far apart (that is, between 100 bp and several hundred kilobases) from each other, TopHat infers that the read spans a splice junction and estimates where that junction's splice sites are. TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. The TopHat 2.0.8b GenePattern module does not support: • short (fewer than a few nucleotides) insertions and deletions in the reported alignments • mixing paired- and single-end reads together • colorspace reads GenePattern installs the required packages for TopHat automatically: SAMtools and Bowtie. GenePattern install these packages separately from any existing versions and will not interfere with other instances of these tools. For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page. IMPORTANT NOTES: The first time you run a job with a given prebuilt index, the job may fail. Please re-run your job. It should work on the second run. If you encounter a recurring problem with your jobs failing in TopHat, contact [email protected]. TopHat is memory intensive and takes several hours to run.

1

References Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of in the presence of insertions, deletions and gene fusions. Genome Biol. 2013 Apr 25;14(4):R36. • http://genomebiology.com/2013/14/4/R36/abstract Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578. • http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105-11. • http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp120 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. • http://genomebiology.com/2009/10/3/R25

Links TopHat: http://tophat.cbcb.umd.edu/ TopHat documentation: http://tophat.cbcb.umd.edu/manual.html

Parameters

Name Description

prebuilt bowtie index Use the drop-down list to specify one of a number of pre- built indexed genome references. If this list does not include the genome you require, an indexed genome can be generated using the Bowtie.indexer GenePattern module. Either a prebuilt OR a custom Bowtie index must be specified.

custom bowtie index A ZIP archive containing Bowtie genome reference index files. Either a prebuilt OR a custom Bowtie (i.e., version Bowtie 2.0 or greater) index must be specified.

reads pair 1 Unpaired reads file or first mate for paired reads. This can (required) be a file in FASTA or FASTQ format (bz2 and gz compressed files are supported), a ZIP archive containing FASTA or FASTQ files, or a directory that is accessible to the GenePattern server containing FASTA or FASTQ files. For more information on FASTA/FASTQ formats, see the Input Files section.

2 reads pair 2 Second mate for paired reads. This can be a file in FASTA (optional) or FASTQ format, a ZIP archive containing FASTA or FASTQ files. mate inner dist The expected mean inner distance between mate pairs. (optional) For example, for paired-end runs with fragments selected at 300 bp, where each end is 50 bp, you should set this to be 200. Default: 50 mate std dev The standard deviation for the distribution of inner (optional) distances between mate pairs. This does not have to be specified for paired end reads. library type If you want to run TopHat on strand-specific reads, you (optional) must specify a library type. Options include: • Standard Illumina (fr-unstranded) • dUTP, NSR, NNSR (fr-firststrand) • Ligation, Standard SOLiD (fr-secondstrand)

Bowtie preset A combination of pre-packaged options for Bowtie based options on speed and sensitivity/accuracy. (optional)

GTF file A GTF (v. 2.2) or GFF3 file containing a list of gene model (optional) annotations (that is, exon annotations that provide a virtual ). If this file is provided, TopHat will extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the reference genome. The reads that did map onto the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final TopHat output. transcriptome only Whether to align the reads to the virtual transcriptome (optional) (requires a file supplied in the GTF file parameter) and report only those mappings as genomic mappings. Default: no max transcriptome The maximum number of mappings allowed for a read hits (optional) when it is aligned to the virtual transcriptome (requires a file supplied in the GTF file parameter). Any reads found with more than this number of mappings will be discarded.

3 prefilter multihits When mapping reads on the virtual transcriptome (requires (optional) a file supplied in the GTF file parameter), some repetitive or low complexity reads that would be discarded in the context of the genome may appear to align to the transcript sequences and thus may end up reported as mapped to those genes only. This option directs TopHat to first align the reads to the whole genome, then exclude such multi- mapped reads. raw junctions file A file containing raw junctions. Junctions are specified one (optional) per line in a tab-delimited format. find novel junctions If you select no, then the module will only look for junctions (optional) indicated in the GTF file supplied in the GTF file parameter. (This parameter is ignored when no GTF file is specified.) Default: yes min anchor length The anchor length. TopHat will report junctions spanned by (optional) reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This value must be at least 3. Default: 8 max splice The maximum number of mismatches that may appear in mismatches the "anchor" region of a spliced alignment. Default: 0 (optional) min intron length The minimum intron length. TopHat will ignore (optional) donor/acceptor pairs closer than this many bases apart. max intron length The maximum intron length. When searching for junctions, (optional) TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. Default: 500000 max insertion length The maximum insertion length TopHat will allow. Default: (optional) 3. max deletion length The maximum deletion length TopHat will allow. Default: 3. (optional) quality value scale Whether to use the Solexa or Solexa v. 1.3 (Phred 64) (optional) quality value scale. (For more information on these quality scores, see this review article.)

4

quality value files 1 A ZIP file containing separate quality value files for single (optional) end reads or the first pair of paired end reads.

quality value files 2 A ZIP file containing separate quality value files for the (optional) second pair of paired end reads.

integer quals Quality values are space-delimited integer values. Default: (optional) no

max multihits Specifies the number of times a read can be aligned to the (optional) reference genome. If a read is aligned more than this number of times, then TopHat will choose the alignments based on their alignment scores, reporting the alignments with the best alignment scores. If there are more than this number of alignments with the same score for a read, TopHat will randomly report only this many alignments. Default: 20

read mismatches Specifies the number of mismatches allowed in the initial (optional) read mapping in each read alignment. Final read alignments with more than this many mismatches are discarded. Default: 2

coverage search Enables or disables the coverage-based search for (optional) junctions. Use when coverage search is disabled by default (such as for reads ≥75 bp), for maximum sensitivity. The coverage search looks for distinct regions of piled-up reads in the initial mapping (coverage islands). Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. Default: no

microexon search Attempts to find alignments incident to microexons. (optional) Very small exons, or microexons, occur in large numbers in some eukaryotic genomes and are very often the site of alternative splicing. Microexons range in length up to 25 bp. Works only for reads ≥50 bp. Default: no

output prefix The prefix to use for the output file (required)

Input Files 1. RNA-seq reads files in FASTA/FASTQ format (can be gzip or bzip2 compressed) For more information on the FASTA format, see the NIH description here: http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml. For more information on the FASTQ format,

5

see the specification here: http://nar.oxfordjournals.org/content/early/2009/12/16/nar.gkp1137.full. 2. Custom Bowtie index (optional, if the prebuilt indexes do not include the genome you need) This file is a genome reference index. You must create this file using Bowtie (Bowtie version 2.0 or higher) and can use the Bowtie.indexer GenePattern module for this. 3. GTF file (optional) A GTF/GFF file containing exon annotations, to provide a virtual transcriptome. The values in the first column of this file (the column that indicates the chromosome or contig on which the feature is located) MUST MATCH the sequence names in the reference sequence in the Bowtie index you are using with Tophat. Note that this is also case sensitive. For more information on GTF format, see the specification: http://mblab.wustl.edu/GTF22.html. For more information on GFF format, see the specification: http://www.sequenceontology.org/gff3.shtml. 4. Raw junctions file (optional) Junctions are specified one per line, in a tab-delimited format, like so: <+/-> specifies the last character of the left sequence to be spliced to the first character of the sequence, inclusive: that is, the last and first positions of the exons that flank the junction site. You can take the junctions.bed file from TopHat and convert it to this format. 5. Quality values file (optional) A QUAL file is a file where each line contains numerical quality values for each nucleotide of a sequence named in a related FASTA file. For more information on this format, see this page.

Output Files 1. .accepted_hits.bam A list of read alignments in BAM format. This file can be used as input for Cufflinks. BAM is the binary equivalent of SAM, a compact short read alignment format. For more information on the SAM/BAM formats, see the specification at: http://samtools.sourceforge.net. 2. .junctions.bed A BED file of junctions reported by TopHat (for more information on the BED format, see: http://genome.ucsc.edu/FAQ/FAQformat.html). Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. 3. .insertions.bed UCSC BED tracks of insertions reported by TopHat. insertions.bed - chromLeft refers to the last genomic base before the insertion. 4. .deletions.bed. UCSC BED tracks of deletions reported by TopHat. deletions.bed - chromLeft refers to the first genomic base of the deletion. 5. .unmapped.bam A list of reads left unaligned in a BAM file.

6

6. .prep_reads.info Provides numbers of reads processed and minimum and maximum read length in the dataset.

Platform Dependencies Module type: RNA-seq CPU type: any OS: Macintosh, Linux Language: C++, Perl 5.10+, Python 2.6

GenePattern Module Version Notes Version Date Description 6 5/10/13 TopHat module v.6 contains updates to TopHat version 2.0.8b. See the TopHat documentation (http://tophat.cbcb.umd.edu/) for more information about this version.

7