Workshop on Computational Biology Lecture Notes

Workshop on Computational Biology Lecture Notes Prof. Dr. Martin Peifer Contents 1 DNA sequencing analysis 2 1.1 Introduction . .2 1.2 Massively parallel sequencing . .3 1.3 Alignment: Smith-Waterman algorithm . .3 1.4 Runtime analysis . .5 1.5 Short read alignment . .6 1.6 The SAM/BAM file format . .6 1.7 Post-processing of aligned data . .8 1.8 Detection of somatic point mutations . .9 1.9 Detection of genomic rearrangements . 11 1.10 Copy number analysis . 11 1.11 Tumor evolution . 14 2 RNA sequencing analysis 17 2.1 Introduction . 17 2.2 Alignment of RNAseq data with STAR . 18 2.3 Pseudo-alignment (Kallisto) . 19 2.4 Quantification of transcripts . 21 1 1 DNA sequencing analysis 1.1 Introduction - Cancer is a disease of the genome ! genomes of cancer cells strongly divert form non-cancer cells of the same individual - Main types of genome alterations: 1. Point mutations (substitutions, insertion, deletions) 2. Copy number changes 3. Genomic rearrangements/fusion genes - Some of these genome alterations can be used for targeted therapies, e.g., EGFR mutations in lung cancer, ERBB2 amplifications in breast cancer, and EML4-ALK fusions in lung cancer. All these alterations can be measured at once with massively parallel or next generation sequencing. 2 1.2 Massively parallel sequencing With Massively parallel sequencing (or next generation sequencing (NGS)) ! sequencing entire genomes in a short amount of time. Most common is the Illumina platform. Properties of the Illumina platform: - Short sequencing reads ∼ 150bp - Paired end is possible (sequencing a fragment from both sides) - A sequencing run creates more than 109 reads - Enough sequencing reads to cover the multiple times (e.g., 30X ! ∼ 70GB raw data) - Exon enrichment or amplicon-based sequencing possible Read structure (fastq file): Row 1: @ followed by read name Row 2: the sequence of the read Row 3: + followed by extra information (I have never seen anything here) Row 4: letter encoded (ASCII code−33) phred-score; let p be the probably that the detected base is a sequencing error ! phred = −10 · log10(p) ! The position where the read matches the reference genome has to be determined ! Massive text search ! Alignment problem 1.3 Alignment: Smith-Waterman algorithm In 1981, the algorithm was proposed by Temple F. Smith and Michael S. Waterman [1]. The basic idea is to compute a scoring matrix for each possible alignment between the reference (e.g., the reference genome) and query sequence. The alignment is then reconstructed from a path of highest scores. Let R = a1a2 ··· an be the reference sequence, Q = b1b2 ··· bm the query sequence and n > m. For genomic alignments ai; bj 2 fA; C; G; T g for each i = 1; ··· ; n and j = 1; ··· ; m. 3 Example: R = GGT T GACT A and Q = GT T AC. Let m be the score for a match and −m for a mismatch. A gap is penalized by w. For example, let us set: m = 3 and w = 2. Then, the Smith-Waterman algorithm is a follows: 1. Initialize the (m+1)×(n+1) scoring matrix by setting the the first row and column to zero. Example: GGTTGACTA 0 0 0 0 0 0 0 0 0 0 G 0 T 0 T 0 A 0 C 0 2. Compute H recursively: 8 >Hi−1;j−1 + s(ai; bj) (match or mismatch) > <Hi−1;j − w insertion Hij = max ; >Hi;j−1 − w deletion > :0 where ( m if ai = bi s(ai; bi) = : −m if ai 6= bi Example: GGTTGACTA 0 0 0 0 0 0 0 0 0 0 G 0 3 3 1 0 3 1 0 0 0 T 0 1 1 6 4 2 0 0 3 1 T 0 0 0 4 9 7 5 3 3 1 A 0 0 0 2 7 6 10 8 6 6 C 0 0 0 0 5 4 8 13 11 9 3. Traceback the alignment by starting from the largest score and choosing the path of highest scores to the left, diagonally up, and up until zero is reached. 4 Example: GGTTGACTA 0 0 0 0 0 0 0 0 0 0 G 0331031000 T 0 1 16420031 T 0 0 0 4975331 A 0 0 0 2 7 6 10866 C 0 0 0 0 5 4 8 13 11 9 Thus, the alignment is: reference GGTTGACTA query − GTT − AC − − Note: This is the easiest scoring scheme for the Smith-Waterman algorithm. More common is to include a higher gap opening penalty to obtain more clustered deletions or insertions. 1.4 Runtime analysis The runtime of analysis is a measure of the complexity of an algorithm: - Often a procedure depend on n data points. - The Landau symbol O(··· ) is used to express the complexity. - An algorithm has a linear run time O(n) if a task has to repeated n times. An assembly of n bikes is, e.g., O(n). - Only the leading order is important, factors are ignored, e.g., O(2n3 + 5n2 + n) = O(n3). Example: There are n locations on a map, the task is to compute all pairwise distances n(n−1) 2 between these points. Since there are 2 pairs, the runtime would be O(n ). Smith-Waterman algorithm: 1) Computing the scoring matrix H takes m · n operations (n genome size and m size of the query sequence). 2) The traceback procedure takes m steps. ! the runtime of the Smith-Waterman algorithm is O(m · n) Note: On optimize version of the Smith-Waterman algorithm just requires O(n) operations [2] ! with 109 computational not feasible for NGS data. 5 1.5 Short read alignment - The Smith-Waterman too slow for aligning NGS data. - Fastest short read aligner use the Burrows-Wheeler transformation (BWT) for the reference genome: Figure adapted from Li and Durbin, Bioinformatics (2009) [3]. - The BWT is originally designed for data compresson. - It allows fast exact and inexact search. - The runtime of exact match is O(n0:628 · m), [4] where n is the genome size and m the read size. - To check if there exists an exact match takes only O(m). - Runtime benefit comes from the repetitive structure of the genome. - Common BWT-based aligners are: Bowtie [5] and BWA [3]. 1.6 The SAM/BAM file format Standard format for aligned NGS data is the SAM/BAM format. SAM stands for Sequence Alignment/Map format and is a human readable text file. The compressed binary (only machine readable) is the BAM format. A comprehensive format description can be found under: https://samtools.github.io/hts-specs/SAMv1.pdf. There is a header section (further description skipped) and an alignment section. Example line of alignment section: 6 A00685:13:HHCTLDSXX:1:2574:30364:2127 163 chr1 18999851 60 150M = 19000107 389 GGCCAATCCCACACGGTCAGCTCCTTAAGTGAACACATTTTTGTCCACTCCC... FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF... BX:Z:CATCAAGGTTAGCTCA NM:i:1 MD:Z:1C148 AS:i:148 XS:i:0 Each of the entries are tab-separated, their values are: 1. Read name. 2. Flag of the alignment: - Is a 16 bit integer number ranging from 0 to 216 − 1 = 65; 535. - Each value of the binary representation of the number is a yes (=1)/no (=0) assignment to a certain category. In our example the binary representation of 163 is 10100011 ! 1·20 +1·21 +0·22 +0·23 +0·24 +1·25 +0·26 +1·27 = 163. - The categories are given in the following table: bit example description 20 = 1 1 read having multiple segments in sequencing (e.g., paired end) 21 = 2 1 each segments properly aligned according to the aligner 22 = 4 0 segments unmapped 23 = 8 0 next segments in the read unmapped 24 = 16 0 sequence being reverse complemented 25 = 32 1 sequence of the next segment being reverse complemented 26 = 64 0 the first segment in the read 27 = 128 1 the last segment in the read 28 = 256 0 secondary alignment 29 = 512 0 not passing filters 210 = 1024 0 PCR or optical duplicate 211 = 2048 0 supplementary alignment (ambiguous alignment) - Higher bits not shown are not specified. 3. Chromosome of the alignment. 4. Position of the alignment. 5. Mapping quality. May depend on the aligner but is typically the phred-score of the probability that the position is wrong. 255 means that the mapping quality is not available. 6. CIGAR string to determine the structure of the alignment (match/mismatch, insertion, deletion, skipping). Common CIGAR operations are: 7 operation description M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in sequence; beginning or end) Example: a 30 base read, where the first 5 bases are clipped, 15 are matching, 5 inserted, and 5 matching has the CIGAR string: 5S15M5I5M. 7. Chromosome of the other sequence in the pair, where = means that the chromosomes of the pair are identical. 8. Mapping position of the other sequence in the pair. 9. Insert size of the sequenced fragment: distance between left- and rightmost position if chromosomes of the pair are identical. 10. Sequence of the read in the orientation of the reference genome. 11. Quality scores of the read. 12. The remaining fields are for additional information (optional), e.g., the single cell barcode: BX:Z:CATCAAGGTTAGCTCA. 1.7 Post-processing of aligned data Post-processing of alignment data consists of two steps: 1) masking of duplicates and 2) overlap removal. 1) masking of duplicates: 8 Duplicates occur during a PCR-amplification step of the sequencing library.

Workshop on Computational Biology Lecture Notes

BIOGRAPHICAL SKETCH NAME: Berger

Lior Pachter Genome Informatics 2013 Keynote

Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective

Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction

Eric S. Lander

The Anatomy of Successful Computational Biology Software

BIOGRAPHICAL SKETCH NAME: Bonnie Berger POSITION TITLE

Statistical Models for Genome Assembly and Analysis by Atif

Big Data Challenges in Genome Informatics

Combinatorics of Least-Squares Trees

The Khmer Software Package: Enabling Efficient Nucleotide Sequence Analysis [Version 1; Peer Review: 2 Approved, 1 Approved with Reservations]

Download (8MB)