Transcriptome Sequencing (RNA-Seq & single-cell Seq) TIGP EAMCB 11/02/2020 Albert Yu IMB Bioinformatics Core General Outline 1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly 4. Single-cell RNA-Seq (scRNA-Seq) Quiz !!!! IMB Bioinformatics Core Transcriptome Sequencing (Bulk) • Differential Gene Expression • Quantitative evaluation • Comparison of transcript levels, usually between different groups • Vast majority of RNA-Seq is for DGE • Transcriptome Assembly • Build new or improved profile of transcribed regions (“gene models”) of the genome • Can then be used for DGE IMB Bioinformatics Core From RNA -> sequence data Sequencing Short-read: Illumina HiSeq Long-read: Nanopore IMB Bioinformatics Core Quality control of RNA preparation (RIN) RNA integrity assessment is based on the ratio of ( ) rRNA, estimated from: IMB Bioinformatics Core Which Sequencer should I use? (Short-reads) NovaSeq 6000 IMB Bioinformatics Core Sequencing Technology Workflow - Illumina IMB Bioinformatics Core Sequencing Technology Workflow - Illumina Library preparation 1. Fragment RNA 2. Reverse transcription 3. Ligate adaptors 4. Select by size Sequencing Cluster generation 1. Extend one base 1. Attach DNA to cell 2. Read signal 2. Bridge amplification 3. De-block 3. Generate clusters 4. Repeat 1-3 4. Anneal Seq primer 5. Generate base calls IMB Bioinformatics Core Sequencing Technology Workflow - Nanopore IMB Bioinformatics Core Comparison between Illumina and Oxford Nanopore Illumina (HiSeq 4000) Oxford Nanopore (MinION) Read length Up to 150bp Up to 2 Mb Number of reads ~300 million 1 – 10 million Processing time 1 – 3.5 days ~ 6 hours Error rate < 1% 5 – 15% Cost per run ~$3000 $500 - $900 Instrument price $900K $1K Advantage Highly accurate Long sequencing read Portable device IMB Bioinformatics Core General Outline 1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly 4. Single-cell RNA-Seq (scRNA-Seq) IMB Bioinformatics Core Removal of rRNA Type of RNA: Ribosomal (rRNA) • Responsible for protein synthesis (3-7%) Messenger (mRNA) (10-15%) • Translated into protein in ribosome • have poly-A tails in eukaryotes Transfer (tRNA) • Bring specific amino acids for protein synthesis Micro (miRNA) • short non-coding RNA for expression regulation Others (lncRNA, shRNA, siRNA, snoRNA, etc.) Removal/Enrichment Methods: (80-90%) • ribosomal depletion • Size selection • poly-A selection (eukaryotes only) IMB Bioinformatics Core Increases in the Biological replicates number 1. Increasing the number of biological replications consistently increases the power of detecting DE genes significantly, regardless of sequencing depth. The more, the better! At least 3 replicates Bioinformatics. 2014 Feb 1;30(3):301-4. Front Plant Sci. 2018 Feb 14;9:108. IMB Bioinformatics Core How many reads should be enough? Experiments Reads For a quick snapshot of highly expressed genes 5–25 million reads per sample For a more global view of gene expression or 30–60 million reads per sample alternative splicing (isoforms) For in-depth view of the transcriptome or novel 100–200 million reads per sample transcriptome assembly For miRNA-Seq or small RNA Analysis 1–5 million reads per sample IMB Bioinformatics Core How long should the reads be? Analysis Read length Gene expression Profiling 1 x 75–100 bp Transcript expression Profiling 2 x 75–100 bp Transcriptome Analysis 2 x 100–150 bp Small RNA Analysis 1 x 50 bp IMB Bioinformatics Core Beware confounding factors (batch effects) • The Ideally experimental design is to compare two groups that only differ in one factor. • Batch effect can occur when subsets of the replicates are handled separately at any stage of the process -- handling group becomes in effect another factor. • Avoid processing all samples from single group if you can’t do all the samples at once. IMB Bioinformatics Core Beware systematic biases (randomization) • Avoid systematic biases in the arrangement of replicates. • Don’t arrange replicate sample sets in the same order 1 2 IMB Bioinformatics Core Barcoding for multiple samples IMB Bioinformatics Core General Outline 1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly 4. Single-cell RNA-Seq (scRNA-Seq) IMB Bioinformatics Core Data analysis pipeline (Illumina - DEG) (1) (4) (2) (5) (3) (6) IMB Bioinformatics Core Bioinformatics is all about file formats • Raw data (FASTQ/FAST5) • Alignment (SAM/BAM) • Reference genomes (FASTA) • Annotation files (GTF/GFF) • Result files (TXT) IMB Bioinformatics Core Raw sequencing data FASTA >unique_sequence_ID My sequence is pretty cool ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA FASTQ Q Q Q Q Q IMB Bioinformatics Core Annotation file formats • GFF3 – General feature format • GTF – Gene transfer format • These files could be downloaded from public databases, ex: NCBI, Ensembl IMB Bioinformatics Core Data quality control – always check your data first FastQC Base Composition Normal Abnormal Phred Scores IMB Bioinformatics Core Reads cleaning - cutadapt; Trimmomatic; fastp 1. Detect and remove sequencing adapters present in the FastQ files 2. Filter / trim reads according to quality and length fastp --in1 $input.R1.fastq.gz \ --in2 $input.R2.fastq.gz \ --out1 $input.R1.cleaned_filtered.fastq.gz \ --out2 $input.R2.cleaned_filtered.fastq.gz \ --unpaired1 $input.R1_singles.fastq.gz \ --unpaired2 $input.R2_singles.fastq.gz \ -l 100 \ -w 1 \ -j fastp.json \ -h fastp.html \ -t 8 IMB Bioinformatics Core Generating alignments - STAR 1. For a sequencing read, to determine the origin location within the reference genome 2. Reference include complete genome and transcriptome sequence 3. Input is typically FASTQ and reference genome in FASTA format 4. Output is sequence alignment map (SAM/BAM) format file STAR --runThreadN $num_process \ --genomeDir $ref_genome_Index \ --readFilesIn $input.fastq \ --sjdbGTFfile $ref_annotation.gtf \ --outFileNamePrefix $output_file_name \ --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate \ --outTmpDir $output_temp_dir IMB Bioinformatics Core Alignment file format – SAM / BAM IMB Bioinformatics Core SAM CIGAR IMB Bioinformatics Core Visualize alignments - IGV IGV is the visualization tool used for this snapshot IMB Bioinformatics Core Quantification of gene/transcript abundance • Conventional methods: RSEM; featureCounts • Novel methods: Sailfish; Salmon; Kallisto (based on pseudo-alignments) IMB Bioinformatics Core Statistical analysis 1. Normalization of all gene counts across all samples 2. Clustering of samples based on all gene expression profiles 3. Identification of differential expression genes (DEGs) 4. Functional annotation 5. Gene Set Enrichment Analysis (GSEA) IMB Bioinformatics Core Normalization of expression data Before Normalized After normalized IMB Bioinformatics Core Normalization of expression data Normalization method Description Accounted factors Recommendations for use counts scaled by total number of NOT for within sample CPM (counts per million) sequencing depth reads comparisons or DE analysis TPM (transcripts per kilobase counts per length of transcript (kb) sequencing depth and NOT for DE analysis million) per million reads mapped gene length RPKM/FPKM (reads/fragmen ts per kilobase of exon per sequencing depth and NOT for between sample similar to TPM million reads/fragments gene length comparisons or DE analysis mapped) counts divided by sample-specific gene count comparisons DESeq2’s median of size factors determined by median sequencing depth and between samples and for DE ratios [1] ratio of gene counts relative to RNA composition analysis; NOT for within geometric mean per gene sample comparisons uses a weighted trimmed mean of sequencing depth, RNA gene count comparisons edgeR’s trimmed mean of M the log expression ratios between composition, and gene between and within samples values (TMM) [2] samples length and for DE analysis IMB Bioinformatics Core Dimension reduction analysis: 1. Multidimensional scaling plot (MDS) - distance 2. Principal component analysis (PCA) - correlation IMB Bioinformatics Core Heatmap Hierarchical clustering based on expression Correlations between replicates profiles IMB Bioinformatics Core Differential expression genes (DEG) 1. Pairwise comparison between two groups 2. Statistical package: edgeR; DESeq2; limma edgeR • Complex experimental designs using generalized linear model • Information sharing among genes (Bayesian gene-wise dispersion estimation) • Input: raw gene/transcript read counts IMB Bioinformatics Core Functional annotations / pathway analysis 1. Gene Ontology (GO) enrichment 2. Pathway (KEGG) enrichment 3. DAVID 4. STRING 5. Enrichr 6. WGCNA 7. … IMB Bioinformatics Core General Outline 1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly 4. Single-cell RNA-Seq (scRNA-Seq) IMB Bioinformatics Core Transcriptome Assembly – Reference based • Used when the reference genome is available, but Transcriptome data is unknown • Stringtie and Scripture are well-regarded transcriptome assemblers IMB Bioinformatics Core Transcriptome Assembly
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages53 Page
-
File Size-