<<

Gene Expression: Transcription & RNA Seq

Ling Tao Baylor College of Medicine 2018-09-25 Transcription

Slides are adapted from Dr. Kevin G. McCracken’s lecture ppt unless specified Outline

1. Transcription

 Process

 RNA alternative splicing

Bioinformatics: promoter sequence and TF prediction

2. RNA seq

 RNA seq process

 Experimental design

Bioinformatics: RNA analysis resource and online database Different types of RNA:

1. mRNA Messenger RNA, encodes the sequence of a polypeptide.

2. tRNA Transfer RNA, transports amino acids to ribosomes during translation.

3. rRNA Ribosomal RNA, forms complexes called ribosomes with , the structure on which mRNA is translated.

4. snRNA Small nuclear RNA, forms complexes with used in eukaryotic RNA processing (e.g., splicing and removal).

5. miRNA/siRNA Micro RNA/small interfering RNA, short ~22 nt RNA sequences that bind to 3’ UTR target mRNAs and result in gene silencing.

6. lncRNA Long non-coding RNA, transcribed RNA with a length of more than 200 nucleotides that do not proteins (or lack>100 amino acid open reading frame) Three Steps of Transcription:

1. Initiation

2. Elongation

3. Termination

 Occur in both prokaryotes and eukaryotes.

 Elongation is conserved in prokaryotes and eukaryotes.

 Initiation and termination proceed differently. Prokaryotes possess only one type of RNA polymerase transcribes mRNAs, tRNAs, and rRNAs

Transcription is more complicated in eukaryotes

Eukaryotes possess three RNA polymerases:

1. RNA polymerase I, transcribes three major rRNAs 12S, 18S, 5.8S

2. RNA polymerase II, transcribes mRNAs and some snRNAs

3. RNA polymerase III, transcribes tRNAs, 5S rRNA, and snRNAs

*S values of rRNAs refer to molecular size, as determined by sucrose gradient centrifugation. RNAs with larger S values are larger/have a greater density. Eukaryotes - Transcription of protein-coding genes by RNA polym. II

 RNA polymerase II transcribes a precursor-mRNA  We can divide eukaryotes promoter into two regions: 1. The core promoters elements. The best characterized are  A short sequence called Inr (Initiator)  TATA Box = TATAAAA, located at about position -30 *AT-rich DNA is easier to denature than GC-rich DNA 2. Promoter proximal elements (located upstream, ~-50 to -200 bp) “Cat Box” = CAAT and “GC Box” GGGCGG  Different combinations occur near different genes.  Transcription regulatory proteins (activators and also repressors) are required. Transcription of protein-coding genes by RNA polymerase II

 General Transcription factors (GTFs) are required by RNA polymerases

 GTFs are proteins, assembled on the core promoter

 Each GTF works with only one kind of RNA polymerase (required by all 3 RNA polymerases).

 The Transcription initiation by eukaryotic RNA polymerase II involves TFIIA, TFIIB, TFIID, TFIIE, TFIIF, TFIIH

 Binding of GTFs and RNA polymerase occurs in a set order in protein coding genes.

 Complete complex (RNA polymerase + GTFs) is called a pre-initiation complex (PIC). Order of binding is: IID + IIA + IIB + RNA poly. II + IIF +IIE +IIH Transcription regulatory proteins = Activators

 High-level transcription is induced by binding of activator factors to DNA sequences called enhancers.

 Enhancers are usually located upstream of the gene they control, they modulate transcription from a distance.

 Can be several kb from the gene

 Silencer DNA elements and repressor TFs also exist Elongation:

RNA polymerase adds the correct complementary NTPs to the template and continue

Termination:

When RNA polymerase synthesis pass a poly(A) site in the transcript, the RNA synthesized is cleaved by:

 CPCF (cleavage and polyadenylation specificity factor) protein

 CstF (cleavage stimulating factor) protein

 CFI and CF II (cleavage factor proteins) Production of the pre-mRNA

Three main parts:

1. 5’ untranslated region (5’ UTR) or leader sequence

2. Coding sequence, specifies amino acids to be translated

3. 3’ untranslated region ( 3’ UTR) or trailer sequence may contain information that signals the stability of the particular mRNA Production of mature mRNA in eukaryotes: Alternative Splicing Events

FIGURE 1. Alternative splicing events. Types of AS events, previously described and commented in this work, are based on a comparison between the constitutive splicing and the alternative splicing events for a certain gene. Color boxes represent while black lines represent from a gene. Splicing sites are depicted by connections with dashed lines.

Iñiguez and Hernández, Front. Genet., 2017 How to find promoter sequence?

1. Go to ensembl website: http://www.ensembl.org/index.html 2. Choose an organism such as human 3. Search your gene such as BRCA2 4. Click the right hit on the search result page and it will bring you to the gene summary page. For example the link to BRCA2 gene 5. On the left, under "Gene Summary", click "Sequence", the sequence of the gene including 5' flanking, exons, introns and flanking region will be displayed. 6. The exons are high lighted in yellow background and red text, the sequence in front of the first exon is the promoter sequence. 7. By default, 600 bp 5'-flanking sequence (promoter) is displayed. If you want to get more, click "Configure this page" in the lower left column, a popup window opens allowing to input the size of 5' Flanking sequence (upstream). You can put for example "1000" and then save the configuration. 8. Sometimes there are discrepancies between Ensembl and UCSC annotation regarding TSS. To make sure the first exon given by ensembl is right, copy the promoter sequence 9. Go to UCSC BLAT and choose the right genome (eg, human), paste the sequence there. On the result page, click browse of the first hit, this will bring you to the genome browser Page. the query sequence is now aligned with UCSC genome sequence. Zoom out a bit, you will be able to determine whether the promoter sequence matches UCSC annotation. If it matches, the sequence is very likely the right one. 10. In UCSC genome broswer, you can turn on CpG island feature (under “Regulation” menu), if there is CpG island in the promoter sequence, the sequence is highly likely a true promoter. In the above example (BRCA2), a CpG island is displayed in the proximal promoter. 11. Beware some genes have alternative promoters. To find those sequences, it requires extensive bioinformatics and experimental analysis.

http://www.protocol-online.org/

TF prediction

TRANSFAC (TRANScription FACtor database) is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding Not free for professional profiles. The contents of the database can be used to predict potential version! transcription factor binding sites.

JASPAR is a collection of transcription factor DNA-binding preferences, modeled as matrices. JASPAR is the only database with this scope where the data can be used with no restrictions (open-source).

TFM-Explorer (Transcription Factor Matrix Explorer) is a program for analyzing regulatory regions in eukaryotic genomes. It takes a set of coregulated gene sequences, and searches for locally overrepresented transcription factor binding sites. RNA Sequencing

Resource: StatQuest and UC Davis Bioinformatics Core RNA sequencing

RNA-Seq (RNA sequencing), also called whole transcriptome shotgun sequencing (WTSS), uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment.

Wang et al, Nature Genet., 2009

RPKM, FPKM and TPM

RPKM (Reads Per Kilobase Million) FPKM (Fragments Per Kilobase Million)

 Normalize read counts for sequencing depth (total number of mapped reads) and the length of gene

TPM (Transcripts Per Kilobase Million)

 Normalize read counts for length of gene first and then sequencing depth (total number of length normalized reads) RPKM: Step 1 Normalize for read depth

Rep 1 Rep 2 Rep 3 Gene Name Counts Counts Counts A (2kb) 10 12 30 B (4kb) 20 25 60 C (1kb) 5 8 15 D (10 kb) 0 0 1

Total reads: 35 45 106 Scaled reads: 3.5 4.5 10.6

Gene Name Rep 1 RPM Rep 2 RPM Rep 3 RPM A (2kb) 2.86 2.67 2.83

Normalized B (4kb) 5.71 5.56 5.66 by scaled C (1kb) 1.43 1.78 1.43 total reads D (10 kb) 0 0 0.09 RPKM: Step 2 Normalize for gene length

Gene Name Rep 1 RPM Rep 2 RPM Rep 3 RPM A (2kb) 2.86 2.67 2.83 Normalized by scaled B (4kb) 5.71 5.56 5.66 total reads C (1kb) 1.43 1.78 1.43 D (10 kb) 0 0 0.09

Gene Name Rep 1 RPKM Rep 2 RPKM Rep 3 RPKM A (2kb) 1.43 1.33 1.42 Normalized B (4kb) 1.43 1.39 1.42 by gene C (1kb) 1.43 1.78 1.42 length D (10 kb) 0 0 0.009

TPM (Transcripts Per Kilobase Million) Normalize read counts for length of gene first and then sequencing depth (total number of length normalized reads)

Step 1 Normalize for gene length

Rep 1 Rep 2 Rep 3 Gene Name Counts Counts Counts A (2kb) 10 12 30 B (4kb) 20 25 60 C (1kb) 5 8 15 D (10 kb) 0 0 1

Gene Name Rep 1 RPK Rep 2 RPK Rep 3 RPK A (2kb) 5 6 15 Normalized by gene B (4kb) 5 6.25 15 length C (1kb) 5 8 15 D (10 kb) 0 0 0.1 TPM: Step 2 Normalize for sequencing depth

Gene Name Rep 1 RPK Rep 2 RPK Rep 3 RPK A (2kb) 5 6 15 Normalized B (4kb) 5 6.25 15 by gene C (1kb) 5 8 15 length D (10 kb) 0 0 0.1

Total RPK: 15 20.25 45.1 Scaled RPK: 1.5 2.025 4.51

Gene Name Rep 1 TPM Rep 2 TPM Rep 3 TPM Normalized by A (2kb) 3.33 2.96 3.326 sequencing B (4kb) 3.33 3.09 3.326 depth (total scaled RPK) C (1kb) 3.33 3.95 3.326 D (10 kb) 0 0 0.02 RPKM Gene Name Rep 1 RPKM Rep 2 RPKM Rep 3 RPKM A (2kb) 1.43 1.33 1.42 B (4kb) 1.43 1.39 1.42 C (1kb) 1.43 1.78 1.42 D (10 kb) 0 0 0.009 Total : 4.29 4.5 4.25

TPM Gene Name Rep 1 TPM Rep 2 TPM Rep 3 TPM A (2kb) 3.33 2.96 3.326 B (4kb) 3.33 3.09 3.326 C (1kb) 3.33 3.95 3.326 D (10 kb) 0 0 0.02 Total : 10 10 10 RNA Seq Experimental Design

1. Identify the objective

 Qualitative data includes identifying expressed transcripts, and identifying exon/intron boundaries, transcriptional start sites (TSS), and poly-A sites. Here, we will refer to this type of information as "annotation".

 Quantitative data includes measuring differences in expression, alternative splicing, alternative TSS, and alternative polyadenylation between two or more treatments or groups. RNA Seq Experimental Design

2. Sequencing depth

 Depth determined by the goals of the experiment and type of sample  10-20 million reads good for differential gene expression  100-200M paired-end reads needed to reliably detect low copy transcripts/isoforms from a typical mammalian tissue  At 1 billion reads, unique transcripts can still be found RNA Seq Experimental Design

3. Platform and length options BCM Genomic and RNA Profiling Core RNA Seq Experimental Design

4. Multiplexing

Multiplex sequencing allows large numbers of libraries to be pooled and sequenced simultaneously during a single run on a high-throughput instrument.

illumina RNA Seq Experimental Design

5. Technical vs biological replicates

Technical replicates Biological replicates

 Account for variation in preparation  Allow measurement of variation  Useful to even out lane effects between individuals/samples  Allow data processing even if one lane  More replicates increase statistical fails power  Can increase the cost of exp.  Some studies support increasing replicates over deeper sequencing (as long as you can detect genes of interest) Scotty - Power Analysis for RNA Seq Experiments

 When fed user-supplied pilot data, it will output a range of sample size/coverage configurations acceptable for specified power and cost constraints.

 Two group design

Input Summary of Findings

Control columns in pilot data: 3 Scotty has tested 90 possible Test columns in pilot data: 3 experimental designs. Cost per replicate, control: $500 Cost per replicate, test: $500 The following experiments meet your Cost per million reads: $20 criteria: Alignment Rate: 50% Least expensive: 4 replicates sequenced Maximum cost of experiment: $20000 to a depth of 10 million reads aligned to Percentage of genes detected: 50 genes per replicate. At p value cutoff: 0.05 For the following true fold change: 2 Most powerful: 10 replicates sequenced Maximum percentage of genes with low-powered to a depth of 10 million reads aligned to (biased) measurements: 50 genes per replicate.

Busby et al. Bioinformatics 2013 Differential Gene Expression Analysis Workflow EMBL-EBI online training course for RNA-Seq data analysis RNA-seq 2G

Zhang et al., BioRxiv preprint, 2017 R2: Genomics Analysis and Visualization Platform

Example: Tumor Neuroblastoma Gene - TARGET - 161 - fpkm - ensh37e59gc

http://r2.amc.nl MORPHEUS

View your dataset as a heat map, then explore the interactive tools in Morpheus. Cluster, create new annotations, search, filter, sort, display charts, and more. LGEA Web Portal i.e. human transcriptome array (HTA) Ian Korf, Nature Methods, 2013