Introduction to Differential Gene Expression Analysis Using RNA-Seq
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to differential gene expression analysis using RNA-seq Written by Friederike D¨undar, Luce Skrabanek, Paul Zumbo September 2015 updated November 14, 2019 1 Introduction to RNA-seq 5 1.1 RNA extraction . .5 1.1.1 Quality control of RNA preparation (RIN) . .6 1.2 Library preparation methods . .6 1.3 Sequencing (Illumina) . 10 1.4 Experimental Design . 11 1.4.1 Capturing enough variability . 12 1.4.2 Avoiding bias . 13 2 Raw Data (Sequencing Reads) 15 2.1 Download from ENA . 15 2.2 Storing sequencing reads: FASTQ format . 16 2.3 Quality control of raw sequencing data . 20 3 Mapping reads with and without alignment 23 3.1 Reference genomes and annotation . 26 3.1.1 File formats for defining genomic regions . 26 3.2 Aligning reads using STAR ...................................... 30 3.3 Storing aligned reads: SAM/BAM file format . 31 3.3.1 The SAM file header section . 32 3.3.2 The SAM file alignment section . 34 3.3.3 Manipulating SAM/BAM files . 37 3.4 Quality control of aligned reads . 39 3.4.1 Basic alignment assessments . 39 3.4.2 Bias identification . 44 3.4.3 Quality control with QoRTs . 47 3.4.4 Summarizing the results of different QC tools with MultiQC ............... 47 4 Read Quantification 49 4.1 Gene-based read counting . 49 4.2 Isoform counting methods (transcript-level quantification) . 50 5 Normalizing and Transforming Read Counts 53 5.1 Normalization for sequencing depth differences . 53 5.1.1 DESeq2's specialized data set object . 54 5.2 Transformation of sequencing-depth-normalized read counts . 56 5.2.1 Log2 transformation of read counts . 56 5.2.2 Visually exploring normalized read counts . 57 5.2.3 Transformation of read counts including variance shrinkage . 58 5.3 Exploring global read count patterns . 59 5.3.1 Pairwise correlation . 59 5.3.2 Hierarchical clustering . 59 5.3.3 Principal Components Analysis (PCA) . 61 6 Differential Gene Expression Analysis (DGE) 63 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College Page 1 of 97 6.1 Estimating the difference between read counts for a given gene . 64 6.2 Testing the null hypothesis . 65 6.3 Running DGE analysis tools . 65 6.3.1 DESeq2 workflow . 65 6.3.2 Exploratory plots following DGE analysis . 66 6.3.3 Exercise suggestions . 69 6.3.4 edgeR . 69 6.3.5 limma-voom .......................................... 71 6.4 Judging DGE results . 73 6.5 Example downstream analyses . 75 7 Appendix 78 7.1 Improved alignment . 78 7.2 Additional tables . 79 7.3 Installing bioinformatics tools on a UNIX server . 89 Page 2 of 97 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College List of Tables List of Tables 1 Sequencing depth recommendations. 12 2 Examples for technical and biological replicates. 12 3 Illumina's different base call quality score schemes. 19 4 The fields of the alignment section of SAM files. 34 5 The FLAG field of SAM files. 35 6 Comparison of classification and clustering techniques. 60 7 Programs for DGE. 63 8 Biases and artifacts of Illumina sequencing data. 79 9 FASTQC test modules. 80 10 Optional entries in the header section of SAM files. 82 11 Overview of RSeQC scripts. 83 12 Overview of QoRTs QC functions. 85 13 Normalizing read counts between different conditions. 87 14 Normalizing read counts within the same sample. 88 List of Figures 1 RNA integrity assessment (RIN). .6 2 Overview of general RNA library preparation steps. .7 3 Size selection steps during common RNA library preparations. .9 4 The different steps of sequencing by synthesis. 10 5 Sequence data repositories. 15 6 Schema of an Illumina flowcell. 18 7 Phred score ranges. 19 8 Typical bioinformatics workflow of differential gene expression analysis. 20 9 Scheme of raw read QC. 21 10 Classic sequence alignment example. 23 11 RNA-seq read alignment. 24 12 Alignment-free sequence comparison. 25 13 Kallisto's strategy. 25 14 Schematic representation of a SAM file. 32 15 CIGAR strings of aligned reads. 37 16 Graphical summary of STAR's log files for 96 samples. 41 17 Different modes of counting read-transcript overlaps. 49 18 Schema of RSEM-based transcrip quantification. 51 19 Effects of different read count normalization methods. 53 20 Comparison of the read distribution plots for untransformed and log2-transformed values. 57 21 Comparison of log2- and rlog-transformed read counts. 58 22 Dendrogram of rlog-transformed read counts. 61 23 PCA on raw counts and rlog-transformed read counts. 62 24 Regression models and DGE. 64 25 Histogram and MA plot after DGE analysis. 66 26 Heatmaps of log-transformed read counts. 67 27 Read counts for two genes in two conditions. 69 28 Schematic overview of different DE analysis approaches. 71 29 Example plots to judge DGE analysis results. 74 30 Example results for ORA and GSEA. 77 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College Page 3 of 97 Technical Prerequisites Command-line interface The first steps of the analyses { that are the most computationally demanding { will be performed directly on our servers. The interaction with our servers is completely text-based, i.e., there will be no graphical user interface. We will instead be communicating entirely via the command line using the UNIX shell scripting language bash. You can find a good introduction into the shell basics at http://linuxcommand.org/lc3_learning_the_shell.php (for our course, chapters 2, 3, 5, 7, and 8 are probably most relevant). To start using the command line, Mac users should use the App called Terminal. Windows users need to install putty, a Terminal emulator (http://www.chiark.greenend.org.uk/~sgtatham/putty/download. html). You probably want the bits under the A Windows installer for everything except PuTTYtel heading. Putty will allow you to establish a connection with a UNIX server and interact with it. Programs that we will be using via the command line: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ featureCounts http://bioinf.wehi.edu.au/subread-package/) MultiQC http://multiqc.info/docs/ QoRTs https://hartleys.github.io/QoRTs/ RSeQC http://rseqc.sourceforge.net/ samtools http://www.htslib.org/ STAR https://github.com/alexdobin/STAR UCSC tools.