Live DNALCRNA -Seq with DNA Subway Part II

Jason Williams Cold Spring Harbor Laboratory, DNA Learning Center [email protected] @JasonWilliamsNY DNALC Live This is an experiment, give us feedback on what you would like to see! DNALC Website and Social Media dnalc.cshl.edu

dnalc.cshl.edu/dnalc-live DNALC Website and Social Media

youtube.com/DNALearningCenter

facebook.com/cshldnalc

@dnalc

@dna_learning_center Who is this course for?

• Audience(s): • Undergraduate biology 200 level and up • (advanced AP Bio/graduate)

• Format: 3 sessions (1 per week); ~ 45 minutes each

• Exercises: Follow along with our online tool DNA Subway

• Learning resources: Slides and resource sheets available Course Learning Goals

• Understand the rationale of an RNA-Seq experiment and its design

• Understand how we obtain DNA sequence and access its quality

• Use DNA Subway (FastQC/FastX) to QC sequence data

• Use DNA Subway (Kallisto) to (pseudo)align reads

• Use DNA Subway (Sleuth) to explore RNA-Seq results Lab Setup • We will be using DNA Subway – You can get a free account at cyverse.org (required) RNA-Seq with DNA Subway Part II (Review sequence quality/reference alignment) Steps for today’s session

• Review RNA-Seq and our example data set

• Learn about reference data sources

• Learn about Kallisto pseudoalignment for RNA-Seq Review of RNA-Seq What is RNA-Seq? - measuring the

• RNA-Seq allows us to measure the transcriptome – take an account of all transcription occurring in a /tissue What is RNA-Seq? - measuring the transcriptome

• RNA-Seq allows us to measure the transcriptome – take an account of all transcription occurring in a cell/tissue

• We use the abundance of an RNA transcript as a proxy for the activity of some cellular process (e.g. protein synthesis, regulatory activity) What is RNA-Seq? - measuring the transcriptome

• RNA-Seq allows us to measure the transcriptome – take an account of all transcription occurring in a cell/tissue

• We use the abundance of an RNA transcript as a proxy for the activity of some cellular process (e.g. protein synthesis, regulatory activity)

• We analyze these data to compare samples (e.g. cancerous vs. non-cancerous) What is RNA-Seq? What can expression tell you?

• CYP1A/1B – Cytochrome p450 family, involved in drug including processing toxins

Photo Credit: Effects of Tobacco Smoke on Gene Expression and Cellular Pathways in a Cellular Model of Oral Leukoplakia Zeynep H. Gümüş, Baoheng Du, Ashutosh Kacker, Jay O. Boyle, Jennifer M. Bocker, Piali Mukherjee, Kotha Subbaramaiah, Andrew J. Dannenberg and Harel Weinstein Cancer Prev Res July 1 2008 (1) (2) 100-111; DOI: 10.1158/1940-6207.CAPR-08-0007 Key Concept: Variation vs. Difference Spot the difference – biological variation

Photo Credit: https://www.quora.com/What-are-Overlapping- Bell-Curves-and-how-do-they-affect-Quora- questions-and-answers Introduction to our data set RNA-Seq of hNPC – Zika Virus

Zika Virus

Photo credit: https://www.sigmaaldrich.com/life-science/stem- cell-biology/neural-stem-cell-biology.html https://en.wikipedia.org/wiki/Zika_virus#/media/File: Zika-chain-colored.png Sequence data from NCBI Generation of sequence data Illumina sequencing

Photo credit: https://www.illumina.com/content/dam/illumina- marketing/documents/products/illumina_sequencing_i ntroduction.pdf Lab: Cleaning sequence with FastX Working on DNA Subway Green Line Working on DNA Subway Green Line Key Concept: Sequence quality Phred scores…

Phred Score Error (bases miscalled) Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999% FastQ format

• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description • Line 2 is the raw sequence letters • Line 3 begins with a '+' character and is optionally followed by the same • Line 4 encodes the quality values for the sequence in Line 2

Photo and text credit: https://en.wikipedia.org/wiki/FASTQ_format FastX Toolkit FastX Toolkit Jobs on DNA Subway Jobs on DNA Subway Key Concept: Read alignment Counting reads Counting reads Counting reads Counting reads Intuition: The more reads we observe from a given “gene” the more “active” that gene is There are many roads to RNA-Seq

Photo credit: Conesa et al. Biology (2016) 17:13 DOI 10.1186/s13059-016-0881-8 RNA-Seq with Kallisto RNA-Seq with Kallisto

Kallisto (pseudo)aligns reads to a reference transcriptome

1. An index is built of the reference transcriptome

2. Sequence reads are (pseudo)aligned to transcripts Reference transcriptome A collection of “all” the transcripts in an organism

Ensembl tour: https://useast.ensembl.org/Homo_sapiens/Info/Index Problem: A transcriptome (like a genome) contains thousands of transcripts. How will we match sequence reads with transcripts? Kallisto – thinking about alignments

Photo credit: https://www.pexels.com/photo/jigsaw-puzzle-1586950/ Kallisto – thinking about alignments

Photo credit: https://www.pexels.com/photo/jigsaw-puzzle-1586950/ https://galaxyproject.github.io/training-material/topics/proteomics/tutorials/proteogenomics-dbcreation/tutorial.html Kallisto – building an index

Photo credit: https://www.pexels.com/photo/jigsaw-puzzle-1586950/ https://www.cloudberries.co.uk/puzzle-tips-tricks/how-to-complete-a-jigsaw-puzzle-quickly/ Kallisto – thinking about alignments

Photo credit: https://jbrowse.org/docs/alignments.html Kallisto – transcriptome De Bruijn graph

Photo credit: https://homolog.us/Tutorials/book3/p7.3.html Kallisto – Building an Index

“Bridges of Königsberg problem”

Photo credit: https://www.nature.com/articles/nbt.2023 Kallisto – Building an De Bruijn graph

Photo credit: https://homolog.us/Tutorials/book4/p2.1.html Kallisto – Building an De Bruijn graph

Photo credit: https://homolog.us/Tutorials/book4/p2.1.html Problem: How can we make search efficient by minimizing the search space? Kallisto – Pseudoalignment

Photo credit: https://www.nature.com/articles/nbt.3519 Kallisto – Pseudoalignment

Photo credit: http://mcb112.org/w02/w02-lecture.html Kallisto – Pseudoalignment

Photo credit: https://www.nature.com/articles/nbt.3519 Analogy: jumbled words

I cnduo't bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid, aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno't mttaer in waht oderr the lterets in a wrod are, the olny irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. The rset can be a taotl mses and you can sitll raed it whoutit a pboerlm. Tihs is bucseae the huamn mnid deos not raed ervey ltteer by istlef, but the wrod as a wlohe. Aaznmig, huh? Yaeh and I awlyas tghhuot slelinpg was ipmorantt! See if yuor fdreins can raed tihs too.

Credit: https://www.ecenglish.com/learnenglish/lessons/can-you-read Analogy: GeoGuesser Analogy: GeoGuesser Analogy, GeoGuesser Analogy: GeoGuesser Analogy: GeoGuesser Analogy: GeoGuesser Analogy: GeoGuesser Lab: Pseudoalignment with Kallisto Lab: Kallisto Lab: Kallisto

See DNA Subway Guide (Green Line) on learning.cyverse.org Next time: Differential expression (abundance) DNALC Website and Social Media dnalc.cshl.edu

dnalc.cshl.edu/dnalc-live