Transcriptome Sequencing (RNA-Seq & Single-Cell Seq)

Transcriptome Sequencing (RNA-Seq & single-cell Seq)

TIGP EAMCB 11/02/2020

Albert Yu IMB Bioinformatics Core General Outline

1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly 4. Single-cell RNA-Seq (scRNA-Seq) Quiz !!!!

IMB Bioinformatics Core Transcriptome Sequencing (Bulk)

• Differential Gene Expression • Quantitative evaluation • Comparison of transcript levels, usually between different groups • Vast majority of RNA-Seq is for DGE • Transcriptome Assembly • Build new or improved profile of transcribed regions (“gene models”) of the genome • Can then be used for DGE

IMB Bioinformatics Core From RNA -> sequence data

Sequencing

Short-read: Illumina HiSeq Long-read: Nanopore

IMB Bioinformatics Core Quality control of RNA preparation (RIN)

RNA integrity assessment is based on the ratio of ( ) rRNA, estimated from:

IMB Bioinformatics Core Which Sequencer should I use? (Short-reads)

NovaSeq 6000 IMB Bioinformatics Core Sequencing Technology Workflow - Illumina

IMB Bioinformatics Core Sequencing Technology Workflow - Illumina Library preparation 1. Fragment RNA 2. Reverse transcription 3. Ligate adaptors 4. Select by size

Sequencing Cluster generation 1. Extend one base 1. Attach DNA to cell 2. Read signal 2. Bridge amplification 3. De-block 3. Generate clusters 4. Repeat 1-3 4. Anneal Seq primer 5. Generate base calls IMB Bioinformatics Core Sequencing Technology Workflow - Nanopore

IMB Bioinformatics Core Comparison between Illumina and Oxford Nanopore

Illumina (HiSeq 4000) Oxford Nanopore (MinION) Read length Up to 150bp Up to 2 Mb Number of reads ~300 million 1 – 10 million Processing time 1 – 3.5 days ~ 6 hours Error rate < 1% 5 – 15% Cost per run ~$3000 $500 - $900 Instrument price $900K $1K Advantage Highly accurate Long sequencing read Portable device

IMB Bioinformatics Core General Outline

1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly

4. Single-cell RNA-Seq (scRNA-Seq)

IMB Bioinformatics Core Removal of rRNA

Type of RNA: Ribosomal (rRNA) • Responsible for protein synthesis (3-7%) Messenger (mRNA) (10-15%) • Translated into protein in ribosome • have poly-A tails in eukaryotes Transfer (tRNA) • Bring specific amino acids for protein synthesis Micro (miRNA) • short non-coding RNA for expression regulation Others (lncRNA, shRNA, siRNA, snoRNA, etc.)

Removal/Enrichment Methods: (80-90%) • ribosomal depletion • Size selection • poly-A selection (eukaryotes only)

IMB Bioinformatics Core Increases in the Biological replicates number

1. Increasing the number of biological replications consistently increases the power of detecting DE genes significantly, regardless of sequencing depth.

The more, the better! At least 3 replicates

Bioinformatics. 2014 Feb 1;30(3):301-4. Front Plant Sci. 2018 Feb 14;9:108. IMB Bioinformatics Core How many reads should be enough?

Experiments Reads For a quick snapshot of highly expressed genes 5–25 million reads per sample For a more global view of gene expression or 30–60 million reads per sample alternative splicing (isoforms) For in-depth view of the transcriptome or novel 100–200 million reads per sample transcriptome assembly For miRNA-Seq or small RNA Analysis 1–5 million reads per sample

IMB Bioinformatics Core How long should the reads be?

Analysis Read length Gene expression Profiling 1 x 75–100 bp Transcript expression Profiling 2 x 75–100 bp Transcriptome Analysis 2 x 100–150 bp Small RNA Analysis 1 x 50 bp

IMB Bioinformatics Core Beware confounding factors (batch effects)

• The Ideally experimental design is to compare two groups that only differ in one factor. • Batch effect can occur when subsets of the replicates are handled separately at any stage of the process -- handling group becomes in effect another factor. • Avoid processing all samples from single group if you can’t do all the samples at once.

IMB Bioinformatics Core Beware systematic biases (randomization)

• Avoid systematic biases in the arrangement of replicates. • Don’t arrange replicate sample sets in the same order

IMB Bioinformatics Core Barcoding for multiple samples

IMB Bioinformatics Core General Outline

1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly

4. Single-cell RNA-Seq (scRNA-Seq)

IMB Bioinformatics Core Data analysis pipeline (Illumina - DEG)

(1) (4)

(2) (5)

(3) (6)

IMB Bioinformatics Core Bioinformatics is all about file formats

• Raw data (FASTQ/FAST5)

• Alignment (SAM/BAM)

• Reference genomes (FASTA)

• Annotation files (GTF/GFF)

• Result files (TXT)

IMB Bioinformatics Core Raw sequencing data FASTA >unique_sequence_ID My sequence is pretty cool ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA FASTQ

Q Q Q Q Q

IMB Bioinformatics Core Annotation file formats

• GFF3 – General feature format

• GTF – Gene transfer format

• These files could be downloaded from public databases, ex: NCBI, Ensembl IMB Bioinformatics Core Data quality control – always check your data first

FastQC Base Composition Normal Abnormal

Phred Scores

IMB Bioinformatics Core Reads cleaning - cutadapt; Trimmomatic; fastp

1. Detect and remove sequencing adapters present in the FastQ files

2. Filter / trim reads according to quality and length

fastp --in1 $input.R1.fastq.gz \ --in2 $input.R2.fastq.gz \ --out1 $input.R1.cleaned_filtered.fastq.gz \ --out2 $input.R2.cleaned_filtered.fastq.gz \ --unpaired1 $input.R1_singles.fastq.gz \ --unpaired2 $input.R2_singles.fastq.gz \ -l 100 \ -w 1 \ -j fastp.json \ -h fastp.html \ -t 8

IMB Bioinformatics Core Generating alignments - STAR

1. For a sequencing read, to determine the origin location within the reference genome

2. Reference include complete genome and transcriptome sequence

3. Input is typically FASTQ and reference genome in FASTA format

4. Output is sequence alignment map (SAM/BAM) format file

STAR --runThreadN $num_process \ --genomeDir $ref_genome_Index \ --readFilesIn $input.fastq \ --sjdbGTFfile $ref_annotation.gtf \ --outFileNamePrefix $output_file_name \ --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate \ --outTmpDir $output_temp_dir

IMB Bioinformatics Core Alignment file format – SAM / BAM

IMB Bioinformatics Core SAM CIGAR

IMB Bioinformatics Core Visualize alignments - IGV

IGV is the visualization tool used for this snapshot

IMB Bioinformatics Core Quantification of gene/transcript abundance

• Conventional methods: RSEM; featureCounts • Novel methods: Sailfish; Salmon; Kallisto (based on pseudo-alignments)

IMB Bioinformatics Core Statistical analysis

1. Normalization of all gene counts across all samples 2. Clustering of samples based on all gene expression profiles 3. Identification of differential expression genes (DEGs) 4. Functional annotation 5. Gene Set Enrichment Analysis (GSEA)

IMB Bioinformatics Core Normalization of expression data

Before Normalized After normalized

IMB Bioinformatics Core Normalization of expression data

Normalization method Description Accounted factors Recommendations for use counts scaled by total number of NOT for within sample CPM (counts per million) sequencing depth reads comparisons or DE analysis TPM (transcripts per kilobase counts per length of transcript (kb) sequencing depth and NOT for DE analysis million) per million reads mapped gene length RPKM/FPKM (reads/fragmen ts per kilobase of exon per sequencing depth and NOT for between sample similar to TPM million reads/fragments gene length comparisons or DE analysis mapped) counts divided by sample-specific gene count comparisons DESeq2’s median of size factors determined by median sequencing depth and between samples and for DE ratios [1] ratio of gene counts relative to RNA composition analysis; NOT for within geometric mean per gene sample comparisons uses a weighted trimmed mean of sequencing depth, RNA gene count comparisons edgeR’s trimmed mean of M the log expression ratios between composition, and gene between and within samples values (TMM) [2] samples length and for DE analysis

IMB Bioinformatics Core Dimension reduction analysis:

1. Multidimensional scaling plot (MDS) - distance 2. Principal component analysis (PCA) - correlation

IMB Bioinformatics Core Heatmap

Hierarchical clustering based on expression Correlations between replicates profiles

IMB Bioinformatics Core Differential expression genes (DEG)

1. Pairwise comparison between two groups 2. Statistical package: edgeR; DESeq2; limma

edgeR • Complex experimental designs using generalized linear model • Information sharing among genes (Bayesian gene-wise dispersion estimation) • Input: raw gene/transcript read counts

IMB Bioinformatics Core Functional annotations / pathway analysis

1. Gene Ontology (GO) enrichment 2. Pathway (KEGG) enrichment 3. DAVID 4. STRING 5. Enrichr 6. WGCNA 7. …

IMB Bioinformatics Core General Outline

1. From RNA to sequencing data 2. Experimental and practical considerations 3. Bulk transcriptomic analysis methods and tools a. Differential gene expression b. Transcriptome Assembly

4. Single-cell RNA-Seq (scRNA-Seq)

IMB Bioinformatics Core Transcriptome Assembly – Reference based

• Used when the reference genome is available, but Transcriptome data is unknown • Stringtie and Scripture are well-regarded transcriptome assemblers

IMB Bioinformatics Core Transcriptome Assembly – Reference based

IMB Bioinformatics Core Transcriptome Assembly – De novo

• Used when very little information is available for the genome • Trinity, SPAdes, and TransABySS are well-regarded transcriptome assemblers • Algorithm: De Bruijn graph construction

IMB Bioinformatics Core Transcriptome Assembly – De novo

IMB Bioinformatics Core General Outline

IMB Bioinformatics Core 10x Genomic - Single-cell sequencing workflow

Illumina Sequencing

Formation of GEMs (Gel Bead in emulsion)

Inside GEMs

1. Reverse transcriptions occur ~3.6 million Barcodes to separately index each 2. Pooled for cDNA amplification cell’s transcriptome. 3. Library construction UMI: Unique molecular identifiers IMB Bioinformatics Core Single-cell sequencing data analysis - 1

UMI - each read Barcode - each cell Feature - each gene

Y: the number of UMI counts mapped to each barcode X: the number of barcodes below that value A steep drop-off is indicative of good separation between the cell- associated barcodes and the barcodes associated with empty partitions.

IMB Bioinformatics Core Single-cell sequencing data analysis - 2

https://satijalab.org/seurat/ Clustering & Feature anchors to match datasets

1. Transcript Counts https://cole-trapnell-lab.github.io/monocle3/ 2. Data Normalizations Trajectory mapping 3. Metadata 4. Data visualization & clustering- dimension reduction • t-SNE (t-distributed stochastic neighbor embedding) • UMAP (Uniform Manifold Approximation and Projection)

IMB Bioinformatics Core Nonlinear dimension reduction methods

PCA

• PCA failed to reconstruct the data sets. • PCA as a linear algorithm assigns equal weights to all pairwise distances • Non-linear manifold learners, such as tSNE and UMAP, prioritize distances between neighbors. This strategy t-SNE UMAP allows them to figure out the intrinsic 2D dimensionality of the data.

IMB Bioinformatics Core t-SNE

scaling

https://www.youtube.com/watch?v=NEaUSP4YerM IMB Bioinformatics Core t-SNE

IMB Bioinformatics Core Comprehensive dissection and clustering

IMB Bioinformatics Core Trajectory analysis

Pseudo-time IMB Bioinformatics Core Conclusion

1. Pros: 1. Single cell level expression, not bulk expression 2. Generate high dimensional data 3. Provide spatial information 2. Cons: 1. Lower throughput 2. More expensive

IMB Bioinformatics Core Thanks for your attention!!

Quiz

https://tinyurl.com/yxbkmfu6

https://docs.google.com/forms/d/e/1FAIpQLSdUCYhxCkvfrZxo4Ga GRi37f4W6jbQK_o9uRwdQB-dC9BtKKA/viewform?usp=sf_link

If you are interested in Bioinformatics, feel free to contact us or stop by our Lab (N419)

IMB Bioinformatics Core