Supplementary Material Hands-On Tutorial

RNA-Seq: revelation of the messengers

Marcel C. Van Verk†,1, Richard Hickman†,1, Corné M.J. Pieterse1,2, Saskia C.M. Van Wees1 † Equal contributors

Corresponding author: Van Wees, S.C.M. ([email protected]). questions: Van Verk, M.C. ([email protected]) or Hickman, R. ([email protected]).

1 Plant-Microbe Interactions, Department of Biology, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands 2 Centre for BioSystems Genomics, PO Box 98, 6700 AB Wageningen, The Netherlands

INDEX 1. RNA-Seq tools websites……………………………………………………………………………………………………………………………. 2 2. Installation notes……………………………………………………………………………………………………………………………………… 3 2.1. CASAVA 1.8.2………………………………………………………………………………………………………………………………….. 3 2.2. Samtools…………………………………………………………………………………………………………………………………………. 3 2.3. MiSO……………………………………………………………………………………………………………………………………………….. 4 3. Quality Control: FastQC……………………………………………………………………………………………………………………………. 5 4. Creating indexes………………………………………………………………………………………………………………………………………. 5 4.1. Genome Index for Bowtie / TopHat alignment……………………………………………………………………………….. 5 4.2. Index for Bowtie alignment……………………………………………………………………………………… 6 4.3. Genome Index for BWA alignment………………………………………………………………………………………………….. 7 4.4. Transcriptome Index for BWA alignment………………………………………………………………………………………….7 5. Read Alignment………………………………………………………………………………………………………………………………………… 8 5.1. Preliminaries…………………………………………………………………………………………………………………………………… 8 5.2. Genome read alignment with Bowtie……………………………………………………………………………………………… 8 5.3. Transcriptome read alignment with Bowtie……………………………………………………………………………………. 9 5.4. Genome read alignment with BWA…………………………………………………………………………………………………. 9 5.5. Transcriptome read alignment with BWA……………………………………………………………………………………… 10 5.6. Genome read alignment with TopHat……………………………………………………………………………………………. 10 6. Summarization……………………………………………………………………………………………………………………………………….. 12 6.1. Summarizing counts from Bowtie transcriptome alignment………………………………………………………….. 12 6.2. Summarizing counts from BWA transcriptome alignment…………………………………………………………….. 12 6.3. Summarizing counts from TopHat genome alignment using HTSeq………………………………………………. 12 6.4. Bowtie/BWA genome count summarization…………………………………………………………………………………. 13 7. Differential Expression……………………………………………………………………………………………………………………………. 14 7.1. Differential expression of summarized TopHat count data using DESeq………………………………………… 14 7.2. Differential expression of summarized Bowtie/BWA genome-aligned count data using DESeq…….. 15 8. Isoform Analysis……………………………………………………………………………………………………………………………………… 16 8.1. Isoform analysis using MiSO………………………………………………………………………………………………………….. 16 8.2. Plotting MiSO data using Sashimi………………………………………………………………………………………………….. 18 8.3. Differential expression analysis with Cufflinks………………………………………………………………………………. 18 9. Visualization using IGV……………………………………………………………………………………………………………………………. 20 10. Interoperability………………………………………………………………………………………………………………………………………. 21 10.1. Convert SAM to BAM…………………………………………………………………………………………………………………….. 21 10.2. Convert BAM to SAM…………………………………………………………………………………………………………………….. 21 10.3. Sort a BAM file………………………………………………………………………………………………………………………………. 21 10.4. Index a BAM file……………………………………………………………………………………………………………………………. 21 10.5. Sort a SAM file………………………………………………………………………………………………………………………………. 22 10.6. Conversion of GTF to GFF……………………………………………………………………………………………………………… 22 11. References…………………………………………………………………………………………………………………………………………….. 23

IMPORTANT NOTE: The sample data set provided for use with this tutorial is a small subset of a real RNA-Seq experiment and only provides read data for a few select genes. Therefore the selected features in the analysis of this dataset differ from one where they are part of a realistic complete RNA-Seq experiment. The analysis steps provided here, however, are representative of a complete RNA- Seq analysis and no steps need to be altered when applying this analysis to a ‘real’ RNA-Seq data set compared to this sample data. Page | 1

1. RNA-Seq tools websites

The table below provides a selection of the most frequently used tools for RNA-Seq analysis. Included are the hyperlinks to the main website, the download location for the program and link to the program manual.

Tool URL Main page http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Program http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc FastQC[S1] Video http://www.youtube.com/watch?v=bz93ReOv87Y Manual http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/ Main page http://bowtie-bio.sourceforge.net/index.shtml http://sourceforge.net/projects/bowtie-bio/files/bowtie/ Bowtie[S2] Program or main page à latest releases Manual http://bowtie-bio.sourceforge.net/manual.shtml Main page http://bio-bwa.sourceforge.net/ http://sourceforge.net/projects/bio-bwa/files/ BWA[S3] Program or main page à SF download page Manual http://bio-bwa.sourceforge.net/bwa.shtml Main page http://tophat.cbcb.umd.edu/ TopHat http://tophat.cbcb.umd.edu/downloads/ Program [S5,S6] or main page à releases Manual http://tophat.cbcb.umd.edu/manual.html Main page http://bioconductor.org/packages/release/bioc/html/DESeq.html Program http://cran.us.r-project.org/ DESeq[S7] R-package source("http://bioconductor.org/biocLite.R"); biocLite("DESeq") Manual http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf Main page http://www-huber.embl.de/users/anders/HTSeq/ HTSeq[S8] Program http://www-huber.embl.de/users/anders/HTSeq/doc/install.html#install Manual http://www-huber.embl.de/users/anders/HTSeq/doc/count.html Main page http://genes.mit.edu/burgelab/miso/ Program http://genes.mit.edu/burgelab/miso/software.html MiSO[S9] http://genes.mit.edu/burgelab/miso/docs/ Manual http://genes.mit.edu/burgelab/miso/docs/sashimi.html Main page http://cufflinks.cbcb.umd.edu/tutorial.html Cufflinks http://cufflinks.cbcb.umd.edu/downloads/ Program [S5,S6] or main page à releases Manual http://cufflinks.cbcb.umd.edu/manual.html Main page http://samtools.sourceforge.net/ Samtools [S10] Program http://sourceforge.net/projects/samtools/files/samtools/ Manual http://samtools.sourceforge.net/samtools.shtml Main page http://www.broadinstitute.org/igv/ IGV[S11] Program http://www.broadinstitute.org/software/igv/download Manual http://www.broadinstitute.org/software/igv/UserGuide

For a full overview of next generation software tools required for each step described and more advanced analysis packages see: http://seqanswers.com/wiki/Software/list.

Page | 2

2. Installation notes

The manuals of most tools specify in detail how to install the software. Still, we have encountered some issues with installing some of the tools on our Mac OS X Lion system. An overview of the steps we took to solve the problems with some specific tools is shown below.

2.1 CASAVA v1.8.2

CASAVA is a pipeline supplied by Illumina to convert the Basecalls into (demultiplexed) FastQ files. If the sequencing is outsourced, this step is usually already performed by the sequencing provider.

On Mac OS X:

- Download the latest version of cmake at: http://www.cmake.org/cmake/resources/software.html - Install cmake using the downloaded .dmg package - At the command line type: $ which cmake this will give the exact installation location of cmake. Use this location in the next step for the part marked in red. $ /CASAVA_v1.8.2/src/configure --prefix=/Data/CASAVA_v1.8.2 --with-cmake=/usr/bin/cmake Underlined is the installation directory where CASAVA will be installed on the system.

If one of other systems one of the following libraries is missing during installation, install them using the following steps:

- Gzip $ sudo apt-get install build-essential

- Bzip - Goto bzip.org/downloads and download the installation package $ sudo make $ sudo make install

- Zlib $ sudo apt-get install zlib1g-dev

- Ncurses $ sudo apt-get install libncurses5-dev

- Libxml $ sudo apt-get install libxml2-dev

- Simpleperl $ sudo apt-get install libxml-simple-perl

2.2 SAMTOOLS

The installation of SAMTOOLS is very well explained by its manual. However, pay attention to the following point if a 64 bit system is used.

- Edit the Makefile, un-comment out the -m64 on the line containing CLFAGS. This ensures the program will compile as a 64 bit executable.

Page | 3

2.3 MiSO

The installation of MISO is very well described by its manual. However, MISO has some dependencies with other Python packages. Below an overview is given of the installation procedure for MISO and its dependencies.

- Within the downloaded MISO directory type: $ sudo python setup.py install - To check which dependent modules are missing run: $ python module_availability

Installation of modules:

Scipy: - For unix systems download Scipy and install after using: http://sourceforge.net/projects/scipy/files/ - For Mac OS X 64 bit systems, download the installation script at: http://www.scipy.org/Download Download the script that is named install_superpack.sh using the link: SciPy Superpack for Python 2.7 (64-bit) installation script - Install Scipy by typing: $ bash install_superpack.sh

Simplejson: $ sudo pip install simplejson

Pysam: $ sudo pip install pysam

Matplotlib: - On unix systems type: $ sudo pip install matplotlib - On Mac OS X systems type: $ sudo pip install git+ https://github.com/matplotlib/matplotlib.git#egg=matplotlib-dev

Page | 4

3. Quality Control: FastQC

FastQC performs a quality control check on the output Fastq files and indicates potential issues with a dataset. The manual provided by FastQC is very extensive and gives examples of the warnings generated by the program.

NB: the example data provided with this hands-on guide only contains information on a couple genes (i.e. enough information for this guide) and does not reflect a normal output Fastq file. Therefore FastQC will report many warnings for this data set.

4. Creating Indexes

All short read aligners require the creation of an indexed version of the /transcriptome. The index allows the alignment tool to match reads to the reference sequence quickly and efficiently.

4.1 Genome Index for Bowtie/TopHat alignment

Download sequence data

Prior to building the index, the reference genome should be retrieved. These can be downloaded from Ensembl plants using either a web browser or via a shell window:

Via Web browser: - Redirect web browser to http://plants.ensembl.org/ - Select ‘downloads’ at the top of the page. - Select ‘download data via FTP’ - Lookup the species of interest (in this case ‘’) and click on FASTA (DNA) - Download the file containing ‘dna.toplevel’ in the name (not the RM version).

Via the shell window: $ ftp ftp://ftp.ensemblgenomes.org/pub/plants/release- 16/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.1 6.dna.toplevel.fa.gz - Unpack the genome sequence. $ gunzip Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz

Creating the Bowtie index of a set of DNA sequences can be performed using the bowtie-build command. The index is represented by a set of files with the prefix .ebwt. For now we will build the indexes in a new directory called ‘bowtie_index’.

$ mkdir bowtie_index $ bowtie-build Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa bowtie_index/ath.10.16

bowtie-build Tool for this particular task Arabidopsis_thaliana.TAIR10. Genome sequence () 16.dna.toplevel.fa bowtie_index Output directory ath.10.16 Basename of index files

Page | 5

4.2 Transcriptome Index for Bowtie alignment

As alignment against the transcriptome is much faster (~10x) compared to genome alignments, it is recommended to use this if only gene expression is of main interest and not the analysis of alternatively-spliced isoforms.

NB: One very important thing to realize is that the Fasta file containing the cDNA sequences has one entry for each isoform of the same gene. If this file would be directly used to build the index, reads that can map to a gene of which multiple splice variants / isoforms exist, multiple hits will be reported and reads are flagged as multireads. To prevent this we provide a script to obtain only the ‘.1’ (most pronounced isoform) per gene model, resulting in only one entry per gene id. It is important to make sure that when performing a transcriptome alignment for a species of interest, to use a Fasta file with only one entry/isoform per gene id (representing the dominant / longest isoform).

To build the index for a reference transcriptome, first download the set of all annotated transcript sequences:

Via Web browser: - Redirect web browser to http://plants.ensembl.org/ - Select ‘downloads’ at the top of the page. - Select ‘download data via FTP’. - Lookup the species of interest (in this case ‘Arabidopsis thaliana’) and click on FASTA (cDNA) - Download the file containing ‘cdna.all’ in the name (not the abinitio version!). Via shell window: $ ftp ftp://ftp.ensemblgenomes.org/pub/plants/release- 16/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10. 16.cdna.all.fa.gz - Unpack the cDNA sequence Fasta file using $ gunzip Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz

As described in the warning above we first extract only the ‘.1’ splice variant / isoforms per gene id from this Fasta file using a custom script. This script works for the file indicated below but is not guaranteed to work with other files.

$ perl extract_one_model.pl Arabidopsis_thaliana.TAIR10.16.cdna.all.fa > At_trans.fa

perl Command for this particular task extract_one_model.pl Custom perl script provided as supplemental data Arabidopsis_thaliana.TAIR10. Reference transcriptome (fasta format) 16.cdna.all.fa At_trans.fa Output file name

Here we use the same ‘bowtie_index’ directory as for the genome index created above. If this is already created skip the mkdir command.

$ mkdir bowtie_index $ bowtie-build At_trans.fa bowtie_index/ath.trans

bowtie-build Tool for this particular task At_trans.fa Reference transcriptome. bowtie_index Output directory ath.trans Basename of index files

Page | 6

4.3 Genome Index for BWA alignment

Generate the “BWA index” from a set of DNA sequences using the bwa index command. The index file will be created in the same directory as the original sequence. Obtain the reference genome sequence as described above in “Genome Index for Bowtie/TopHat alignment”.

$ bwa index Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa

bwa Tool for this particular task index Command to create the index Arabidopsis_thaliana.TAIR10. Genome sequence (fasta format) 16.dna.toplevel.fa

4.4 Transcriptome Index for BWA alignment

To create the BWA transcriptome index, use the filtered transcript fasta file ‘At_trans.fa’ described above.

$ bwa index At_trans.fa

bwa Tool for this particular task index Command to create the index At_trans.fa Transcriptome sequence (fasta format)

Page | 7

5. Read Alignment

5.1 Preliminaries

Create directories to store the results of the different alignment programs (where the abbreviations tr and ge correspond to transcriptome and genome respectively):

$ mkdir bowtie $ mkdir bowtie/tr $ mkdir bowtie/ge $ mkdir bwa $ mkdir bwa/tr $ mkdir bwa/ge $ mkdir tophat

The Example-data file (.tar extension) can be extracted easily on any Unix based system with tar:

$ tar xvf example-data.tar

For extraction within windows the freeware tool 7-zip can be used www.7-zip.org.

5.2 Genome read alignment with Bowtie

Bowtie can be used to align reads against a reference genome, but as Bowtie is not a split-read aligner, reads that (partly) span an intron will not be aligned, causing an underestimation of gene expression. However, alignments using Bowtie are well suited if the reference genome is poorly annotated and splice variants are unknown.

Align reads to the Arabidopsis genome using Bowtie for each of the 4 different sample Fastq files:

$ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.10.16 -S sample-data/sample1a.fastq bowtie/ge/1a. $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.10.16 -S sample-data/sample1b.fastq bowtie/ge/1b.sam $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.10.16 -S sample-data/sample2a.fastq bowtie/ge/2a.sam $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.10.16 -S sample-data/sample2b.fastq bowtie/ge/2b.sam

bowtie Tool for this particular task -p 6 Number of processors (threads) used for alignment. Adjust this to the number of threads you can run (in most cases number of CPU’s multiplied by 2). -v 3 Allow 3 mismatches along the entire read --best Report the best alignment e.g., a 2 mismatch high quality alignment will be preferred over a 1 mismatch low quality. --strata Only used in combination with--best. Report only the alignments of the best ‘stratum’, e.g. if multiple alignments are reportable, only report alignments belonging to the highest quality/stratum group. -a Report all alignments found (and not only the top X hits). bowtie_index Directory containing the bowtie index files ath.10.16 Base name of the genome bowtie indexes (without the .ebwt) -S Output results to SAM format sample-data Directory containing the sample Fastq files Page | 8

sampleX.fastq Input files containing the sequencing reads to be aligned () X.sam Output SAM file name

5.3 Transcriptome read alignment with Bowtie

- Align reads to the Arabidopsis transcriptome using Bowtie for each of the 4 different sample Fastq files. For description of the parameters used please refer to table above. $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.trans -S sample-data/sample1a.fastq bowtie/tr/1a.sam $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.trans -S sample-data/sample1b.fastq bowtie/tr/1b.sam $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.trans -S sample-data/sample2a.fastq bowtie/tr/2a.sam $ bowtie -p 6 -v 3 --best --strata -a bowtie_index/ath.trans -S sample-data/sample2b.fastq bowtie/tr/2b.sam

5.4 Genome read alignment with BWA

Read alignment by BWA is performed in two steps: First, reads are aligned against the genome; secondly the results are summarized into a SAM file. If BWA encounters a read that can align to multiple locations in the genome with the same degree of accuracy (multireads), the read will be randomly assigned to one of these locations. For each reported alignment in the SAM file, an additional tag at the end of each SAM line describes the other locations that are also reported. For more information we refer to the manual of BWA.

NB: As BWA is not a split-read aligner, all reads that (partly) span an intron will not be aligned and may result in an underestimation of gene expression (see 5.2).

- Begin the alignment by BWA: $ bwa aln -t 6 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa sample-data/sample1a.fastq > bwa/ge/1a.fastq.temp $ bwa aln -t 6 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa sample-data/sample1b.fastq > bwa/ge/1b.fastq.temp $ bwa aln -t 6 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa sample-data/sample2a.fastq > bwa/ge/2a.fastq.temp $ bwa aln -t 6 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa sample-data/sample2b.fastq > bwa/ge/2b.fastq.temp

Now the results can be summarized into a SAM file:

$ bwa samse -n 20 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa bwa/ge/1a.fastq.temp sample-data/sample1a.fastq > bwa/ge/1a.sam $ bwa samse -n 20 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa bwa/ge/1b.fastq.temp sample-data/sample1b.fastq > bwa/ge/1b.sam $ bwa samse -n 20 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa bwa/ge/2a.fastq.temp sample-data/sample2a.fastq > bwa/ge/2a.sam $ bwa samse -n 20 Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa bwa/ge/2b.fastq.temp sample-data/sample2b.fastq > bwa/ge/2b.sam

bwa Tool for this particular task aln The parameter instructing BWA to perform an alignment samse The parameter instructing BWA to summarize reads -t 6 Use 6 processors (threads) for alignment. Adjust this to the number of threads you can run (in most cases number of CPU’s multiplied by 2). Page | 9

-n 20 Report the top N number of hits Arabidopsis_thaliana.TAIR10 The BWA genome index sequences .16.dna.toplevel.fa sample-data Directory containing the (sample) Fastq files sampleX.fastq Input Fastq file containing the sequencing reads X.fastq.temp Output temporary alignment file name X.sam Output SAM file name

5.5 Transcriptome read alignment with BWA

For a description of the parameters used, please refer to above table.

- Begin the transcriptome alignment by BWA: $ bwa aln -t 6 At_trans.fa sample-data/sample1a.fastq > bwa/tr/1a.fastq.temp $ bwa aln -t 6 At_trans.fa sample-data/sample1b.fastq > bwa/tr/1b.fastq.temp $ bwa aln -t 6 At_trans.fa sample-data /sample2a.fastq > bwa/tr/2a.fastq.temp $ bwa aln -t 6 At_trans.fa sample-data/sample2b.fastq > bwa/tr/2b.fastq.temp

- Summarize results into a SAM file: $ bwa samse -n 20 At_trans.fa bwa/tr/1a.fastq.temp sample-data/sample1a.fastq > bwa/tr/1a.sam $ bwa samse -n 20 At_trans.fa bwa/tr/1b.fastq.temp sample-data/sample1b.fastq > bwa/tr/1b.sam $ bwa samse -n 20 At_trans.fa bwa/tr/2a.fastq.temp sample-data/sample2a.fastq > bwa/tr/2a.sam $ bwa samse -n 20 At_trans.fa bwa/tr/2b.fastq.temp sample-data/sample2b.fastq > bwa/tr/2b.sam

5.6 Genome read alignment with TopHat

To align the data with TopHat a GTF file (containing all annotated gene models) is required. This file can be downloaded from Ensembl website:

Via Web browser: - Redirect web browser to http://plants.ensembl.org/ - Click on downloads in the top - Select download data via FTP - Lookup the species of interest (in this case ‘Arabidopsis thaliana’) and click on GTF - Download the GTF file.

Via shell window: $ ftp ftp://ftp.ensemblgenomes.org/pub/plants/release- 16/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf. gz - Unpack the GTF file using $ gunzip Arabidopsis_thaliana.TAIR10.16.gtf.gz

Sometimes, GTF annotation contains a 'gene_name' value that contains the delimiter ';', which can cause problems when running downstream applications. Replace the ";" in any gene name field by a ‘.’ using:

$ perl -ne 's/(gene_name\s\"[\d|\w|-]+)(;)([\d|\w|-]+\")/$1.$3/; print' Arabidopsis_thaliana.TAIR10.16.gtf > corrected.gtf Page | 10

Now the reads can be splice-aligned against the genome using TopHat. It is important to consider the parameters that specify the characteristics of the introns considered by TopHat. By default, TopHat will search for splice junctions a minimum of 70 and a maximum of 500,000 bases apart. These parameters are far too liberal when processing reads from Arabidopsis, which on average has much smaller introns. Here, we will set the maximum intron size to be 2000, which is 99th percentile of intron sizesS12.

$ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --transcriptome-index transcriptome_data/ath10.16 --output-dir tophat/1a bowtie_index/ath.10.16 sample-data/sample1a.fastq $ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --transcriptome-index transcriptome_data/ath10.16 --output-dir tophat/1b bowtie_index/ath.10.16 sample-data/sample1b.fastq $ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --transcriptome-index transcriptome_data/ath10.16 --output-dir tophat/2a bowtie_index/ath.10.16 sample-data/sample2a.fastq $ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --transcriptome-index transcriptome_data/ath10.16 --output-dir tophat/2b bowtie_index/ath.10.16 sample-data/sample2b.fastq

tophat Tool for this particular task -p 6 Use 6 processors (threads) for alignment. Adjust this to the number of threads you can run (in most cases number of CPU’s multiplied by 2) --min-intron-length 40 The minimum intron length set to 40, adjust this according to the intron characteristics for the species of interest --max-intron-length 2000 The maximum intron length set to 2000, adjust this according to the intron characteristics for the species of interest --bowtie1 Use Bowtie version 1 for alignment --no-novel-juncs Do not try to find novel junctions between exons, only use the annotated junctions from the GTF file --read-mismatches 2 Allow 2 mismatches along the entire read -G With this option, TopHat will try and align reads to the set of transcripts described in the GTF file supplied corrected.gtf GTF file name --transcriptome-index Specifies location of reference sequence index transcriptome_data/ath10.16 Output directory for the TopHat index --output-dir Option to specify the output directory tophat/X Output directory name bowtie_index Directory containing the Bowtie index files ath.10.16 Name of the genome Bowtie indexes (without the .ebwt) sample-data Directory containing the (sample) Fastq files sampleX.fastq Input Fastq file containing the sequencing reads

Page | 11

6. Summarization

6.1 Summarizing counts from Bowtie transcriptome alignment

During the transcriptome alignment, reads were aligned to a pre-defined, unique set of transcript sequences. The summarization step merely requires to sum reads that are mapping to each distinct transcript. First the list of unique transcript identifiers is obtained to which the reads were aligned:

$ perl -ne ' if (/^\@SQ/) { @F = split(/\t|:/, $_); print $F[2]."\n" } ' bowtie/tr/1a.sam > agi_list.txt

Next, count the number of reads mapping to each transcript:

$ perl -ne ' chomp($_); print $_."\t".`grep -c "\t$_" bowtie/tr/1a.sam ` ' agi_list.txt > bowtie/tr/1a.counts $ perl -ne ' chomp($_); print $_."\t".`grep -c "\t$_" bowtie/tr/1b.sam ` ' agi_list.txt > bowtie/tr/1b.counts $ perl -ne ' chomp($_); print $_."\t".`grep -c "\t$_" bowtie/tr/2a.sam ` ' agi_list.txt > bowtie/tr/2a.counts $ perl -ne ' chomp($_); print $_."\t".`grep -c "\t$_" bowtie/tr/2b.sam ` ' agi_list.txt > bowtie/tr/2b.counts

6.2 Summarizing counts from BWA transcriptome alignment

This is performed using the same method described above, using the same agi_list.txt only replacing the bowtie/tr/ directory with bwa/tr/.

6.3 Summarizing counts from TopHat genome alignment using HTSeq

As reads are aligned to larger reference sequences (in this case, chromosomes) it is required to summarize reads over regions of the genome that allow quantification of individual genes. This can be done by using htseq-count to summarize reads over exons.

HTSeq requires alignment files to be SAM format. As TopHat outputs alignments in BAM format, first the BAM file needs to be converted to SAM using Samtools as described in the Interoperability section (10.2).

$ samtools view -h -o tophat/1a/accepted_hits.sam tophat/1a/accepted_hits.bam $ samtools view -h -o tophat/1b/accepted_hits.sam tophat/1b/accepted_hits.bam $ samtools view -h -o tophat/2a/accepted_hits.sam tophat/2a/accepted_hits.bam $ samtools view -h -o tophat/2b/accepted_hits.sam tophat/2b/accepted_hits.bam

There are several features to which reads can be assigned. Here, count reads mapping to genomic intervals as defined by the gene_id field. In the GTF file this specifies the region starting at the 5’ end of the first exon and finishes at the 3’ end of the last exon.

$ htseq-count --stranded no -i gene_id tophat/1a/accepted_hits.sam corrected.gtf > tophat/1a/accepted_hits.gene.counts

Page | 12

$ htseq-count --stranded no -i gene_id tophat/1b/accepted_hits.sam corrected.gtf > tophat/1b/accepted_hits.gene.counts $ htseq-count --stranded no -i gene_id tophat/2a/accepted_hits.sam corrected.gtf > tophat/2a/accepted_hits.gene.counts $ htseq-count --stranded no -i gene_id tophat/2b/accepted_hits.sam corrected.gtf > tophat/2b/accepted_hits.gene.counts

htseq-count Tool for this particular task --stranded no Read does not have to be aligned to the same strand as the gene model. -i gene_id Attribute from the GTF file to consider as one feature, i.e., gene_id tophat/X/ Directory containing the alignment files accepted_hits.sam Input SAM file corrected.gtf GTF file name accepted_hits.gene.counts Output gene counts file name

More information regarding this tool and its relevant parameters can be found in the HTSeq-count manual at http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

6.4 Bowtie/BWA genome count summarization

This is performed using the last step described above in ‘Summarizing counts from TopHat genome alignment using HTSeq’, but replacing the SAM alignment file with the one from the 'bowtie' and 'bwa' directories. e.g., to compute counts for BWA alignment sample 1a:

$ htseq-count --stranded no -i gene_id bowtie/ge/1a.sam corrected.gtf > bowtie/ge/1a.counts

Page | 13

7. Differential Expression

7.1 Differential expression of summarized TopHat count data using DESeq

First, the counts for each gene in each sample need to be assembled into a table where the first row indicates the gene names (row names):

$ paste <(awk '{print $1,"\t",$2}' tophat/1a/accepted_hits.gene.counts) <(awk '{print $2}' tophat/1b/accepted_hits.gene.counts) <(awk '{print $2}' tophat/2a/accepted_hits.gene.counts) <(awk '{print $2}' tophat/2b/accepted_hits.gene.counts) > htseqCountTable.txt

- Add column names to identify samples:

$ perl -0777 -i -ne 'print " \tsample1a\tsample1b\tsample2a\tsample2b\n$_"' htseqCountTable.txt

The DESeq package runs within the R environment. Use the interactive R shell:

$ R

R version 2.15.2 (2012-10-26) -- "Trick or Treat" Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, o 'help.start()' for an HTML browser interface to help Type 'q()' to quit R.

- Use the read.delim function to read the count table into the R environment:

> countTable <- read.delim("htseqCountTable.txt", row.names = 1)

- Remove HTSeq-count special counters (last 5 rows of the count table):

> countTable <- countTable[ -( (dim(countTable)[1] - 4):dim(countTable)[1] ), ]

- Create a conditions factor variable, which indicates the treatment type for each sample:

> conditions <- factor( c( "mock", "mock", "treatment", "treatment" ) )

- Load the DESeq package into the R environment:

> library( "DESeq" )

- Create a CountDataSet data structure, which stores all of the sample attributes that the DESeq package requires for computing differential expression:

Page | 14

> cds <- newCountDataSet( countTable, conditions )

Normalization is a critical step prior to testing for differential expression. DESeq contains the function, estimateSizeFactors, which uses the median count ratio method to compute scaling factors for each sample:

> cds <- estimateSizeFactors( cds )

- Next, the variance for each gene needs to be estimated using the estimateDispersions function:

> cds <- estimateDispersions( cds )

- Test for differential expression between "mock" and "treated" samples using nbinomTest function:

> res <- nbinomTest( cds, "mock", "treatment" )

- Export results to a table, which can be easily viewed as a spreadsheet:

> write.csv( res, file="topTable.csv" )

- Identify the gene IDs that correspond to differentially expressed genes (in this case we classify genes with an adjusted p-value < 0.05 as differentially expressed):

> counts( cds, normalized = TRUE)[ which( res$padj < 0.05 ), ]

- Close R using the following command:

> q()

7.2 Differential expression of summarized Bowtie/BWA genome or transcriptome count data using DESeq

Summarized count data originating from the several other alignment programs described earlier in this guide can be used with DESeq using the steps described above. NB, The HTSeq-count special counters are not present in the count files associated with transcriptome alignments, therefore step 2:

> “countTable <- countTable[ -( (dim(countTable)[1] - 4):dim(countTable)[1] ), ]” in the R environment should be omitted.

Page | 15

8. Isoform Analysis

8.1 Isoform analysis using MiSO

MiSO can only compare two samples at a time. Therefore we have added Sample3a and 3b to the sample-data solely for MiSO analysis. As input MiSO uses splice aligned reads from a splice aligner such as TopHat.

- Alignment of samples 3a and 3b using TopHat: $ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --output-dir tophat_sample3a bowtie_index/ath.10.16 sample-data/sample3a.fastq $ tophat -p 6 --min-intron-length 40 --max-intron-length 2000 --bowtie1 --no-novel-juncs --read-mismatches 2 -G corrected.gtf --output-dir tophat_sample3b bowtie_index/ath.10.16 sample-data/sample3b.fastq

- Before MiSO can use the BAM files that are generated by TopHat they first need to be indexed: $ samtools index tophat_sample3a/accepted_hits.bam $ samtools index tophat_sample3b/accepted_hits.bam

MiSO uses an index of the different splice variants / isoforms just like TopHat. Only MiSO does not use a GTF file but a GFF that still needs some additional indexing.

- Conversion of the GTF file into a GFF file: $ perl gtf_to_gff.pl corrected.gtf > corrected.gff - Now MiSO can index this GFF file: $ mkdir miso_index $ python /MISO/misopy/index_gff.py --index corrected.gff miso_index/

python Tool for this particular task /MISO/misopy/ Directory of the MiSO script index_gff.py Script used for indexing the GFF file --index Argument to index corrected.gff The GTF file that requires indexing miso_index/ The output directory

MiSO can now calculate the expression estimates for each isoform of each gene. The two lines below are the commands that are normally used to calculate these. However as they calculate the estimates for each individual gene and we have only added one gene to the sample 3a and 3b data, we can speed up the process by using the replacement commands.

$ python /MISO/misopy/run_events_analysis.py --compute-genes-psi miso_index/ tophat_sample3a/accepted_hits.bam --output-dir miso_sample3a/ --read-len 50 $ python /MISO/misopy/run_events_analysis.py --compute-genes-psi miso_index/ tophat_sample3b/accepted_hits.bam --output-dir miso_sample3b/ --read-len 50

python Tool for this particular task /MISO/misopy/ Directory of the MiSO script run_event_analysis.py Script used for expression estimation

Page | 16

--compute-genes-psi Argument to initiate expression estimation miso_index/ The MiSO indexed GFF directory tophat_sampleX Input sample directory accepted_hits.bam Input BAM file --output-dir Argument for the output directory miso_sampleX/ Output directory --read-len Argument for the read length 50 The read length

- Replacement commands for only gene At1g69310: $ python /MISO/misopy/run_miso.py --compute-gene-psi AT1G69310 miso_index/chr1/AT1G69310.pickle tophat_sample3a/accepted_hits.bam miso_sample3a/ --read-len 50 --overhang-len 1 $ python /MISO/misopy/run_miso.py --compute-gene-psi AT1G69310 miso_index/chr1/AT1G69310.pickle tophat_sample3b/accepted_hits.bam miso_sample3b/ --read-len 50 --overhang-len 1

MiSO can now summarize all the expression estimate results:

$ python /MISO/misopy/run_miso.py --summarize-samples miso_sample3a/ miso_sample3a/ $ python /MISO/misopy/run_miso.py --summarize-samples miso_sample3b/ miso_sample3b/

python Tool for this particular task /MISO/misopy/ Directory of the MiSO script run_miso.py Script used for this analysis --summarize-samples Argument to initiate summarization miso_sampleX/ Specified 2x (as the input and output directory)

To compare the two samples for differential splicing use the following command:

$ python /MISO/misopy/run_miso.py --compare-samples miso_sample3a/ miso_sample3b/ miso_3a_vs_3b/

python Tool for this particular task /MISO/misopy/ Directory of the MiSO script run_miso.py Script used for this analysis --compare-samples Argument to initiate sample comparison miso_sample3a/ Miso analysis directory of sample 1 miso_sample3b/ Miso analysis directory of sample 2 miso_3a_vs_3b/ Comparison output directory

To get a quick overview of interesting alternative splice events, MiSO can filter the comparison of the two samples to meet all specified criteria. For a detailed overview we refer to the MiSO manual or paper. NB: in our samples there are different splice variants / isoforms present for the gene At1g69310, only they are not differentially expressed between sample 3a and 3b. Therefore the command below will not filter it out as a hit.

$ python /MISO/misopy/filter_events.py --filter miso_3a_vs_3b/miso_sample3a_vs_miso_sample3b/bayes- factors/miso_sample3a_vs_miso_sample3b.miso_bf --num-total 1000 --num-inc 2 --num-exc 2 --num-sum-inc-exc 20 --delta-psi 0.30 --bayes-factor 20 --apply-both --output-dir miso_3a_vs_3b/filtered/ Page | 17

8.2 Plotting MiSO data using Sashimi

To make Sashimi plots of the MiSO data we need the total number of aligned reads within the BAM files. This in order to get the right axis scales alongside of the graph. To obtain the aligned read count we use Samtools.

$ samtools view -c -F 4 tophat_sample3a/accepted_hits.bam $ samtools view -c -F 4 tophat_sample3b/accepted_hits.bam

samtools Tool for this particular task view Argument to obtain read count, in combination with arguments below -c Argument to count the total -F Argument to not use lines with containing the Flag 4 Flag 4; flag representing unmapped reads tophat_sample3x Input sample directory accepted_hits.bam Input BAM file to count

These numbers are then used to specify the read counts within the settings file of the sashimi plot. We have already prepared this settings file, supplied it with the sample-data and is called “sashimi.txt”. We advise everybody to open this file and see what variables can be specified, the essential variables are; location and name of the BAM files, locations of the MiSO expression estimates and the read count. To visualize the gene and splice variants / isoforms use the following command.

$ python /MISO/misopy/sashimi_plot/plot.py --plot-event "AT1G69310" miso_index/ sashimi.txt --output-dir sashimi_plot/

python Tool for this particular task /MISO/misopy/sashimi_plot/ Directory of the sashimi_plot script plot.py Script used for this analysis --plot-event Argument to plot a certain event “AT1G69310” Gene id to plot miso_index/ The MiSO indexed GFF directory sashimi.txt The settings file for the Sashimi plot --output-dir Argument for the output directory sashimi_plot/ Output directory

On the right side of the plot the expression estimates in percentage for one of the splice variants / isoforms is depicted including the 95% confidence interval within brackets. The given percentage is very high, indicating that there is mainly one isoform expressed while the other isoform is almost not expressed. Furthermore, the confidence intervals almost completely overlap; indicating that between those two samples there is no differential isoform expression.

8.3 Differential expression analysis with Cufflinks

To perform differential expression on the spliced alignments generated by TopHat only one command is required.

$ cuffdiff -o diff_out -p 2 -b Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa -u corrected.gtf -L mock,treated ./tophat/1a/accepted_hits.bam,./tophat/1b/accepted_hits.bam ./tophat/2a/accepted_hits.bam,./tophat/2b/accepted_hits.bam

Page | 18

cuffdiff Tool for this particular task -o Attribute to specify the output directory diff_out Name of the output directory -p 2 Use 2 processors (threads) for alignment -b Attribute to enable the use of a bias detection algorithm on genome sequence to improve accurate transcript abundance determination Arabidopsis….toplevel.fa Genome fasta sequence used by –b attribute -u Attribute to enable the correction for multireads corrected.gtf GTF file name -L Attribute to specify the label prefix in the output mock,treated Label prefix in output (comma separated) ./tophat….bam,./…hits.bam BAM sample files of category 1 (mock, comma separated) ./tophat….bam,./…hits.bam BAM sample files of category 2 (treated, comma separated)

For a more rigorous protocol that discusses how to use many of the features offered by Cufflinks please see Trapnell et al., (2012) [S6].

Page | 19

9. Visualization with IGV

It is often desirable to visually inspect RNA-Seq datasets and observe how reads align to genome regions of interest. The integrated genome viewer (IGV) is one such tool that meets these requirements and is compatible with all operating systems. IGV accepts both SAM and BAM files as input. The BAM format is a compressed version of the SAM file format and therefore the data is much quicker accessible. Before we can load the aligned BAM files we need to index them.

- Use the Interoperability guide below to index the BAM files for the TopHat aligned samples 1a, 1b, 2a and 2c - Start IGV - Select ‘File’ - Select ‘Load from file’ - Navigate to the directory that contains the BAM file for TopHat aligned sample 1a (tophat/1a/) - Open file ‘accepted_hits.bam’ - Repeat this the previous steps for samples 1b, 2a and 2b - Within the search box (left of the button labelled ‘go’), type one of the following gene-ids: At3g01970 or At5g56270. These are the two genes we have provided reads for in this example data set. Within samples 3a and 3b is the gene At1g69310. - Below is depicted a screenshot of the IGV view of samples 1a, 1b, 2a and 2b with gene At3g01970. - Much fewer reads align to samples 1a and 1b (two top-most windows) compared to samples 2a and 2b (two bottom-most windows). This observation indicates that gene At3g01970 is differentially expressed between conditions. This is confirmed by the differential expression analysis earlier in this guide.

Screenshot of the IGV browser, with samples 1a, 1b, 2a and 2b loaded, and displaying gene At3g01970.

Page | 20

10. Interoperability

Some of the most commonly required file format conversions are described below:

10.1 Convert SAM to BAM

The first step in converting a SAM to a BAM file is creating an indexed reference sequence file. This is performed using a Fasta sequence file to which the sequences within the SAM file are aligned.

$ samtools faidx sequencefile.fa

samtools Tool for this particular task faidx Argument to index the reference sequencefile.fa The index sequence fasta file used for alignment of the reads within the SAM file

After this step the SAM file can be converted to a BAM file:

$ samtools view -bt referencefai samfile > bamfile

samtools Tool for this particular task view -bt Argument to convert SAM to BAM referencefai The indexed sequence file from the step above samfile The source SAM file bamfile The destination BAM file

10.2 Convert BAM to SAM

$ samtools view -h -o samfile bamfile

samtools Tool for this particular task view -h -o Argument to convert BAM to SAM samfile The destination SAM file bamfile The source BAM file

10.3 Sort a BAM file

$ samtools sort bamfile sortedbamfile

samtools Tool for this particular task sort Argument to sort bamfile The source BAM file sortedbamfile The destination sorted BAM file

10.4 Index a BAM file

$ samtools index sortedbamfile

samtools Tool for this particular task index Argument to index a BAM file sortedbamfile A sorted BAM file for indexing

Page | 21

10.5 Sort a SAM file

Many tools that use SAM files as input require a sorted SAM file. There are multiple ways to sort a SAM file. Using the commands listed above it is possible to first convert an unsorted SAM file into BAM format, then sort the BAM file, and finally convert the sorted BAM file into sorted a SAM file.

If a SAM file does not contain any unaligned sequences indicated by an * at the chromosome position, then sorting can be done by one simple command:

$ sort -k 3,3 -k 4,4n samfile > sortedsamfile

sort Command for this particular task -k 3,3 -k 4,4n Argument to sort samfile The source SAM file sortedsamfile The destination sorted SAM file

If the file does contain unaligned sequences the above command will result in a file where the unaligned sequences are at the top of the file, which is not accepted by most tools. To work around this, we provide a perl script that first takes the headers, then sorts the aligned sequences and finally, adds the unaligned sequences to create a fully compatible sorted SAM file.

$ perl sortsam.pl samfile sortedsamfile

perl Command for this particular task sortsam.pl Custom perl script provided as suppl. data samfile The source SAM file sortedsamfile The destination sorted SAM file

10.6 Conversion of GTF to GFF

Many tools that require an annotation file as input require it to be GFF formatted as opposed to GTF. As many of the GTF files that can be downloaded contain extra “;” within the description, therefore before converting to GFF first remove the additional “;” as described in the Genome read alignment with TopHat section of this hands-on guide. After this, the conversion can be done by a perl script that was posted by Genec on Seqanswers.com. We have modified this script into 2 versions for conversion, and the difference is how the chromosomes are annotated. If the original genome sequence indicates chromosomes by using “Chr1” use the gtf_to_gff_chr.pl script if they are indicated only as a number “1” use the gtf_to_gff.pl script. For this tutorial use the first option.

$ perl gtf_to_gff.pl file.gtf > file.gff or $ perl gtf_to_gff_chr.pl file.gtf > file.gff

perl Command for this particular task gtf_to_gff.pl Custom perl script provided as suppl. data file.gtf The source GTF file file.gff The destination GFF file

Page | 22

11. References

S1. Andrews, S et al. FastQC: A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ S2. Langmead, B et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 S3. Li, H et al. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25, 1754- 1760 S4. Grant, G et al. (2011) Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27, 2518-2528 S5. Trapnell, C et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28, 511-515 S6. Trapnell, C et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Prot 7, 562-578 S7. Anders, A. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biol 11, R106 S8. Anders, S. HTSeq: Analysing high-throughput sequencing data with Python. http://www- huber.embl.de/users/anders/HTSeq/ S9. Katz, Y et al. (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Meth 7, 1009-1015 S10. Li, H. et al. (2009) The /map (SAM) format and SAMtools. Bioinformatics 25, 2078-2079 S11. Robinson, J.T. et al. (2011) Integrative Genomics Viewer. Nat Biotech 29, 24–26 S12. Gulledge, A.A. et al. (2012) Mining Arabidopsis thaliana RNA-seq data with Integrated Genome Browser reveals stress-induced alternative splicing of the putative splicing regulator SR45a. American journal of botany 99, 219-231

Page | 23