NGS Raw Data Analysis

NGS raw data analysis Introduc3on to Next-Generaon Sequencing (NGS)-data Analysis Laura Riva & Maa Pelizzola Outline 1. Descrip7on of sequencing plaorms 2. Alignments, QC, file formats, data processing, and tools of general use 3. Prac7cal lesson Illumina Sequencing Single base extension with incorporaon of fluorescently labeled nucleodes Library preparaon Automated cluster genera3on DNA fragmenta3on and adapter ligaon Aachment to the flow cell Cluster generaon by bridge amplifica3on of DNA fragments Sequencing Illumina Output Quality Scores Sequence Files Comparison of some exisng plaGorms 454 Ti RocheT Illumina HiSeqTM ABI 5500 (SOLiD) 2000 Amplificaon Emulsion PCR Bridge PCR Emulsion PCR Sequencing Pyrosequencing Reversible Ligaon-based reac7on terminators sequencing Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb Read length 400 bp 100 bp 75 bp Advantages Short run 7mes. The most popular Good base call Longer reads plaorm accuracy. Good improve mapping in mulplexing repe7ve regions. capability Ability to detect large structural variaons Disadvantages High reagent cost. Higher error rates in repeat sequences The Illumina HiSeq! Stacey Gabriel Libraries, lanes, and flowcells Flowcell Lanes Illumina Each reaction produces a unique Each NGS machine processes a library of DNA fragments for single flowcell containing several sequencing. independent lanes during a single sequencing run Applications of Next-Generation Sequencing Basic data analysis concepts: • Raw data analysis= image processing and base calling • Understanding Fastq • Understanding Quality scores • Quality control and read cleaning • Alignment to the reference genome • Understanding SAM/BAM formats • Samtools • Bedtools Understanding FASTQ file format FASTQ format is a text-based format for storing a biological sequence and its corresponding quality scores. It has become the standard for storing the output of high throughput sequencing instruments. EXAMPLE: 1. begins with a '@' character and is followed by a sequence iden7fier 2. the raw sequence leeers. 3. begins with a '+' character and is oponally followed by the same sequence iden7fier 4. encodes the quality values for the sequence in and must contain the same number of symbols as leeers in the sequence. Understanding Quality scores Fastq contains quality informaon. Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabili#es P( Where P is the probability that the corresponding base call is incorrect. Phred quality scores Phred quality scores Q are represented with a single bit in ASCII format. ASCII stands for American Standard Code for Informaon Interchange. ASCII code is the numerical representaon of a character such as 'a' or '@' or an ac7on of some sort. The first 32 symbols in ASCII are control characters, so we start at 33. If your ASCII character is ‘B’ and they are in Sanger format, then 66-33=33, so 33= (-10log10p) -3.3=log10p 10-3.3=p, so p= 0.0005 or 0.05% chance of an incorrect base. @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ The fourth line contains the base quali7es where BQ + 33 = ASCII value shown in the base quality string Sequences ID @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG means: Quality control and read cleaning • Before alignment there is some7mes the need to preprocess/ manipulate the FASTA/FASTQ files to produce beeer mapping results. It is important to do quality control checks to understand whether your data has any problems of which you should be aware before doing any further analysis • FastQC: quality control checks on raw sequence data coming from high throughput sequencing pipelines (hep:// www.bioinformacs.bbsrc.ac.uk/projects/fastqc/). • The FASTX-Toolkit tools perform some of these preprocessing tasks (hep://hannonlab.cshl.edu/fastx_toolkit/). Two of many useful tools are: – FASTQ Quality Filter à Filters sequences based on quality – FASTQ Quality Trimmer à Trims (cuts) sequences based on quality FastQC: quality controls of sequencing files I 90th percen7le of the data Very good quality median Reasonable quality mean PER BASE SEQUENCE QUALITY Poor quality 10th percen7le of the data ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC A very good quality sequence FastQC: quality controls of sequencing files II DUPLICATION LEVELS Which por7on of the sequences has that characteris7c? Computed for the first 200’000 reads gives an overall impression for the duplicate levels in the whole file How many 7mes is the sequence repeated? A very good quality sequence General steps of data analysis: – Quality control and read cleaning – Alignment to the genome There are many aligners and many short reads aligners: Alignment strategy: • Mapping: quickly iden7fy candidates of hits on the reference genome • Alignment and report: score the alignment • Important features: – Some souware use the base quality score to evaluate alignment, others do not – For all the aligners there is a trade off between performance and accuracy – Gapped or ungapped alignment – Important parameters: • Maximum of mismatches • Repor7ng unique hits or mul7ple hits BWA •It uses Burrows-Wheeler indexing algorithm to speed up alignment 7me • Fast and moderate memory usage • Work for different sequencing plaorms, for SE and PE • Gapped alignment for both SE and PE reads • Effec7ve pairing to achieve high alignment accuracy; subop7mal hits considered in pairing. • Non-unique read is placed randomly with a mapping quality 0. • Reports ambiguous hits References: 1.Li, H. and Durbin, R., Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinforma#cs 25 (14), 1754 (2009). 2.hep://bio-bwa.sourceforge.net/bwa.shtml Understanding SAM/BAM file format The Sequence Alignment/Map (SAM) format: • Format for the storage of sequence alignments and their mapping coordinates • Supports different sequencing plaorms • Flexible in style, compact in size, computaonally efficient to access BAM is the binary version of the SAM format Samtools is a set of tools for manipulang and controlling SAM/BAM files SAM file format BAM headers: an optional (but SAM file format essential) part of a BAM file @HD VN:1.0 GO:none SO:coordinate Required: Standard header @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 Essential: read groups. @SQ SN:chr2 LN:242951149 Essential: contigs of Carries platform (PL), library [cut for clarity] aligned reference (LB), and sample (SM) @SQ SN:chr9 LN:140273252 sequence. Should be @SQ SN:chr10 LN:135374737 information. Each read is in karotypic order. @SQ SN:chr11 LN:134452384 associated with a read group [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 Useful: Data processing tools applied to the reads 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33 SAM file format Mapping quality scores They quan7fy the probability that a read is misplaced (P probability that the alignment is wrong). The error probability scaled in the Phred is the mapping quality (Q). The calculaon of mapping quali7es considers all the following factors: • The repeat structure of the reference. Reads falling in repe77ve regions have very low mapping quality. • The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment. • The sensi7vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi7vity, which also causes mapping errors. • Paired end or not. Reads mapped in pairs are more likely to be correct. CIGAR string •M: match/mismatch •I: inser7on •D: dele7on •S: soclip •H: hardclip •P: padding •N: skip SAMtools • Library and souware package that manipulate BAM/SAM files • SAMàBAM conversion (samtools view) • Samtools view –f INT file.bam Only output alignments with all bits in INT present in the FLAG field. • Samtools view –F INT file.bam Skip alignments with bits present in INT [0] • Creates sorted and indexed BAM files from SAM files (samtools sort /samtools index) • Removing PCR duplicates (samtools rmdup) • Merging alignments (samtools merge) • Visualizaon of alignments from BAM files • SNP calling and short indel detecon References: 1. hep://samtools.sourceforge.net/ 2.hp:// samtools.sourceforge.net/samtools.shtml BEDtools The BEDTools u7li7es allow one to address common genomics tasks such as finding feature overlaps and compu7ng coverage. The u7li7es are largely based on four widely-used file formats: BED, GFF/GTF, VCF and SAM/BAM. BED format • slopBed Adjusts each BED entry by a requested number of base pairs. • shuffleBed Randomly permutes the locaons of a BED file among a genome. • intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files. • genomeCoverageBed (BAM) Creates a "per base" report of genome coverage. • subtractBed Removes the por7on of an interval that is overlapped by another feature. • mergeBed Merges overlapping features into a single feature. References: 1 Quinlan AR and Hall IM, 2010.

NGS Raw Data Analysis

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

File Formats Exercises

EMBL-EBI Powerpoint Presentation

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Sequence Alignment/Map Format Specification

Multi-Scale Analysis and Clustering of Co-Expression Networks

Introduction to Bioinformatics (Elective) – SBB1609

Large Scale Genomic Rearrangements in Selected

L3: Short Read Alignment to a Reference Genome

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Sequence Data

Efficient Synchronization of Paired-End Fastq Files