<<

NGS raw data analysis

Introducon to Next-Generaon (NGS)-data Analysis

Laura Riva & Maa Pelizzola Outline

1. Descripon of sequencing plaorms

2. Alignments, QC, file formats, data processing, and tools of general use

3. Praccal lesson

Illumina Sequencing Single base extension with incorporaon of fluorescently labeled nucleodes Library preparaon Automated cluster generaon

DNA fragmentaon and adapter ligaon Aachment to the flow cell Cluster generaon by bridge amplificaon of DNA fragments Sequencing Illumina Output

Quality Scores Sequence Files Comparison of some exisng plaorms

454 Ti RocheT Illumina HiSeqTM ABI 5500 (SOLiD) 2000 Amplificaon Emulsion PCR Bridge PCR Emulsion PCR Sequencing Pyrosequencing Reversible Ligaon-based reacon terminators sequencing Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb Read length 400 bp 100 bp 75 bp Advantages Short run mes. The most popular Good base call Longer reads plaorm accuracy. Good improve mapping in mulplexing repeve regions. capability Ability to detect large structural variaons Disadvantages High reagent cost. Higher error rates in repeat sequences The Illumina HiSeq!

Stacey Gabriel Libraries, lanes, and flowcells

Flowcell

Lanes

Illumina

Each reaction produces a unique Each NGS machine processes a library of DNA fragments for single flowcell containing several sequencing. independent lanes during a single sequencing run Applications of Next-Generation Sequencing Basic data analysis concepts: • Raw data analysis= image processing and base calling • Understanding Fastq • Understanding Quality scores • Quality control and read cleaning • Alignment to the reference genome • Understanding SAM/BAM formats • Samtools • Bedtools Understanding FASTQ file format

FASTQ format is a text-based format for storing a biological sequence and its corresponding quality scores. It has become the standard for storing the output of high throughput sequencing instruments.

EXAMPLE:

1. begins with a '@' character and is followed by a sequence idenfier 2. the raw sequence leers. 3. begins with a '+' character and is oponally followed by the same sequence idenfier 4. encodes the quality values for the sequence in and must contain the same number of symbols as leers in the sequence. Understanding Quality scores

Fastq contains quality informaon. Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilies P Where P is the probability that the corresponding base call is incorrect.

Phred quality scores Phred quality scores Q are represented with a single bit in ASCII format. ASCII stands for American Standard Code for Informaon Interchange. ASCII code is the numerical representaon of a character such as 'a' or '@' or an acon of some sort. The first 32 symbols in ASCII are control characters, so we start at 33.

If your ASCII character is ‘B’ and they are in Sanger format, then 66-33=33, so 33= (-10log10p) -3.3=log10p 10-3.3=p, so p= 0.0005 or 0.05% chance of an incorrect base.

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?

The fourth line contains the base qualies where BQ + 33 = ASCII value shown in the base quality string

Sequences ID @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG means: Quality control and read cleaning

• Before alignment there is somemes the need to preprocess/ manipulate the FASTA/FASTQ files to produce beer mapping results. It is important to do quality control checks to understand whether your data has any problems of which you should be aware before doing any further analysis

• FastQC: quality control checks on raw sequence data coming from high throughput sequencing pipelines (hp:// www.bioinformacs.bbsrc.ac.uk/projects/fastqc/).

• The FASTX-Toolkit tools perform some of these preprocessing tasks (hp://hannonlab.cshl.edu/fastx_toolkit/). Two of many useful tools are: – FASTQ Quality Filter à Filters sequences based on quality – FASTQ Quality Trimmer à Trims (cuts) sequences based on quality

FastQC: quality controls of sequencing files I

90th percenle of the data

Very good quality

median Reasonable quality

mean PER BASE SEQUENCE QUALITY

Poor quality

10th percenle of the data ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC

A very good quality sequence FastQC: quality controls of sequencing files II

DUPLICATION LEVELS Which poron of the sequences has that characterisc?

Computed for the first 200’000 reads gives an overall impression for the duplicate levels in the whole file

How many mes is the sequence repeated?

A very good quality sequence General steps of data analysis: – Quality control and read cleaning – Alignment to the genome

There are many aligners and many short reads aligners: Alignment strategy:

• Mapping: quickly idenfy candidates of hits on the reference genome

• Alignment and report: score the alignment

• Important features: – Some soware use the base quality score to evaluate alignment, others do not – For all the aligners there is a trade off between performance and accuracy – Gapped or ungapped alignment – Important parameters: • Maximum of mismatches • Reporng unique hits or mulple hits BWA

•It uses Burrows-Wheeler indexing algorithm to speed up alignment me

• Fast and moderate memory usage

• Work for different sequencing plaorms, for SE and PE

• Gapped alignment for both SE and PE reads

• Effecve pairing to achieve high alignment accuracy; subopmal hits considered in pairing.

• Non-unique read is placed randomly with a mapping quality 0.

• Reports ambiguous hits

References: 1.Li, H. and Durbin, R., Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformacs 25 (14), 1754 (2009). 2.hp://bio-bwa.sourceforge.net/bwa.shtml Understanding SAM/BAM file format

The /Map (SAM) format:

• Format for the storage of sequence alignments and their mapping coordinates

• Supports different sequencing plaorms

• Flexible in style, compact in size, computaonally efficient to access

BAM is the binary version of the SAM format

Samtools is a set of tools for manipulang and controlling SAM/BAM files SAM file format BAM headers: an optional (but SAM file format essential) part of a BAM file

@HD VN:1.0 GO:none SO:coordinate Required: Standard header @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 Essential: read groups. @SQ SN:chr2 LN:242951149 Essential: contigs of Carries platform (PL), library [cut for clarity] aligned reference (LB), and sample (SM) @SQ SN:chr9 LN:140273252 sequence. Should be @SQ SN:chr10 LN:135374737 information. Each read is in karotypic order. @SQ SN:chr11 LN:134452384 associated with a read group [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 Useful: Data processing tools applied to the reads 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33 SAM file format

Mapping quality scores

They quanfy the probability that a read is misplaced (P probability that the alignment is wrong). The error probability scaled in the Phred is the mapping quality (Q).

The calculaon of mapping qualies considers all the following factors: • The repeat structure of the reference. Reads falling in repeve regions have very low mapping quality. • The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment. • The sensivity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensivity, which also causes mapping errors. • Paired end or not. Reads mapped in pairs are more likely to be correct. CIGAR string •M: match/mismatch •I: inseron •D: deleon •S: soclip •H: hardclip •P: padding •N: skip SAMtools

• Library and soware package that manipulate BAM/SAM files • SAMàBAM conversion ( view)

• Samtools view –f INT file.bam Only output alignments with all bits in INT present in the FLAG field.

• Samtools view –F INT file.bam Skip alignments with bits present in INT [0]

• Creates sorted and indexed BAM files from SAM files (samtools sort /samtools index) • Removing PCR duplicates (samtools rmdup) • Merging alignments (samtools merge) • Visualizaon of alignments from BAM files • SNP calling and short indel detecon

References: 1. hp://samtools.sourceforge.net/ 2.hp:// samtools.sourceforge.net/samtools.shtml BEDtools The BEDTools ulies allow one to address common genomics tasks such as finding feature overlaps and compung coverage. The ulies are largely based on four widely-used file formats: BED, GFF/GTF, VCF and SAM/BAM.

BED format

• slopBed Adjusts each BED entry by a requested number of base pairs. • shuffleBed Randomly permutes the locaons of a BED file among a genome. • intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files. • genomeCoverageBed (BAM) Creates a "per base" report of genome coverage. • subtractBed Removes the poron of an interval that is overlapped by another feature. • mergeBed Merges overlapping features into a single feature.

References: 1 Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of ulies for comparing genomic features. Bioinformacs. 26, 6, pp. 841–842. 2 hp://code.google.com/p/bedtools/downloads/detail?name=BEDTools-User-Manual.v4.pdf BamTools

Freely available at hp://github.org/pezmaster31/bamtools Picard

Picard comprises Java-based command-line ulies that manipulate SAM files, and a Java API (SAM-JDK) for creang new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported

Freely available at hp://picard.sourceforge.net GATK

Freely available at hp://www.broadinstute.org/gatk/ ReadsReads Visualizaon: IGV and fragments

Clean C/T Non-reference bases are colored; heterozygote reference bases are grey Depth of coverage IGV screenshot

First and second read from the same fragment

Details about specific read Individual reads aligned to the genome

Reference genome

Reference: hp://www.broadinstute.org/igv/ Recitaon We are going to use a fastq file that is present in GEO (hp://www.ncbi.nlm.nih.gov/geo/, record GSE24326)

It is an Hsf1 ChIP-seq experiment used to idenfy the genome-wide sites of HSF1 binding in human adipocytes derived from mesenchymal stem cells.

HSF1 (Heat shock factor protein 1) is a heat-shock transcripon factor. Its target genes include major inducible heat shock proteins such as Hsp72. The transcripon of heat-shock genes is rapidly induced aer temperature stress.

The known Hsf1 mof is:

Quality control

Quality control fastx_quality_stats -Q 33 -i hsf1.fastq -o hsf1_stats.txt

The output TEXT file will have the following fields (one row per column): • column = column number (1 to 36 for a 36-cycles read solexa file) • count = number of bases found in this column. • min = Lowest quality score value found in this column. • max = Highest quality score value found in this column. • sum = Sum of quality score values for this column. • mean = Mean quality score value for this column. • Q1 = 1st quarle quality score. • med = Median quality score. • Q3 = 3rd quarle quality score. • IQR = Inter-Quarle range (Q3-Q1). • lW = 'Le-Whisker' value (for boxplong). • rW = 'Right-Whisker' value (for boxplong). • A_Count = Count of 'A' nucleodes found in this column. • C_Count = Count of 'C' nucleodes found in this column. • G_Count = Count of 'G' nucleodes found in this column. • T_Count = Count of 'T' nucleodes found in this column. • N_Count = Count of 'N' nucleodes found in this column. • max-count = max. number of bases (in all cycles) fastq_quality_boxplot_graph.sh -i hsf1_stats.txt -o hsf1.png

fastx_arfacts_filter -Q 33 -v -i hsf1.fastq -o hsf1_arfact.fastq It removes sequencing arfacts

Input: 6313567 reads. Output: 6312015 reads. discarded 1552 (0%) arfact reads. fastq_quality_trimmer -Q 33 -t 20 -l 28 -v -i hsf1_arfact.fastq –o hsf1_arfact_trim.fastq It trims sequences based on quality

Minimum Quality Threshold: 20 Minimum Length: 28 Input: 6312015 reads. Output: 5929465 reads. discarded 382550 (6%) too-short reads. fastq_quality_filter -Q 33 -v -q 20 -p 80 –I hsf1_arfact_trim.fastq -o hsf1_arfact_trim_qual.fastq It filters sequences based on quality

Quality cut-off: 20 Minimum percentage: 80 Input: 5929465 reads. Output: 5655141 reads. discarded 274324 (4%) low-quality reads.

• fastx_quality_stats -Q 33 –i hsf1_arfact_trim_qual.fastq -o hsf1_arfact_trim_qual_stats.txt • fastq_quality_boxplot_graph.sh -i hsf1_arfact_trim_qual_stats.txt -o hsf1_arfact_trim_qual.png Quality control with fastQC /home/lriva/FastQC/fastqc --contaminants /home/lriva/FastQC/ Contaminants/contaminant_list.txt hsf1_arfact_trim_qual.fastq

BWA

§ A fast program able to align short query sequences (< 200 bp; error rate < 3% ) to a target reference genome

§ Work flow:

üindex [-options] (it index the database)

üaln [-options] < input query.fastq> (it computes the SA coordinates for the input reads)

üsamse [-options] (it converts SA coordinates into chromosomal coordinates and generates alignment in SAM formats for single reads)

Alignment of the reads

• bwa aln -t 4 -f hsf1.sai /db/bwa/0.6.2/hg19/ hg19 hsf1_arfact_trim_qual.fastq • bwa samse /db/bwa/0.6.2/hg19/hg19 hsf1.sai hsf1_arfact_trim_qual.fastq | samtools view –ut /db/bwa/0.6.2/hg19/hg19 - | samtools sort - hsf1 • samtools index hsf1.bam

Visualizing alignments, converng from BAM binary to SAM text format samtools view -h -o hsf1_stdchr.nodup. hsf1_stdchr.nodup.bam less hsf1_stdchr.nodup.sam

Let’s start

Go to hp://bioserver.iit.ieo.eu/~lriva/ and download corsoPoplimiChIPseq.txt ssh [email protected] ssh [email protected] cd public_html

You can check all your files at hp://bioserver.iit.ieo.eu/~user/ Filtering alignments counng and visualizing unmapped reads • samtools view -f 4 -c hsf1.bam • samtools view -f 4 hsf1.bam | less

HWI-EAS413:4:1:3:1708 4 * 0 0 * * 0 0 GGTGCTGGCCAACATGACCTTGCACGGA BCACCCCBCCCBCCBBCCCBB<'

Filtering alignments counng and visualizing mapped reads • samtools view -F 4 -c hsf1.bam • samtools view -F 4 hsf1.bam | less

HWI-EAS413:4:100:367:1780 16 chr1 9999 25 35M * 0 0 GATCTCCCTAACCCTAACCCTAACCCTAACCCTAA ?B;BABBAAC?AC?AC?CBCBCBCCCBCBBCCBCB XT:A:U NM:i:2 XN:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i: 0 XG:i:0 MD:Z:3A0A30 select only mapped reads with mapping quality>=0 samtools view -F 4 -q 1 -bo hsf1_stdchr.bam hsf1.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX

Prefix trie of string ‘GOOGOL’ root L 3,3 0,6 G O O 1,2 ^ 4,6 O 5,5 O G G 6,6 1,2 4,4 1,1 1,2 O G O O ^ 2,2 6,6 4,4 1,2 ^ 4,4 G O O 2,2 2,2 6,6 6,6 G ^ G 2,2 2,2 2,2 ^ ^ 2,2 ^ = start of the string 2,2 Burrows-Wheeler transform (BWT)

SA interval of ‘go’ googol + $

B= BWT(X) (BWT string of X)

Suffix array =

X = ‘googol$’ (reference string) * * W = ‘go’ (query string) g o o g o l SA interval=[1,2] 0 1 2 3 4 5 Suffix array interval: formal definion

[ , ]= SA interval of W If W is an empty string then [ , ]= [1, n-1]

Example: X = ‘googol$’ W = ‘go’ (‘go’)= 1 SA interval of ‘go’ = [1,2] (‘go’)= 2 Exact match: Ferragina-Manzini algorithm

X = reference string; W = query string C(a) = number of symbols < a in X[0,n-2] O(a, i) = number of occurrences of a in B[0,i] where B= BWT (X) If W is a substring of X:

true only if and only if aW is a substring of X Exact match: backward search if W ← ‘ ’: 0 6 $googol L ←1 L=1 R ← length(X)-1 R=6 1 3 gol$goo 2 0 googol$ j←length(W)-1 → 3 5 l$googo while (L ≤ R and j ≥ 0) do: 1) a=‘o’ W=‘’ 4 2 ogol$go aW←W[ j:end ] L=C(‘o’)+1= 3+1=4 R=C(‘o’)+O(‘o’,R )= 3+3=6 5 4 ol$goog if length(aW)<2: 4≤6 True 6 1 oogol$g L ← C(a)+1 R← C(a)+O(a,R) else: 2) a=‘g’→ W=‘o’ L← C(a)+O(a,L-1)+1 L=C(‘g’)+O(‘g’,L-1)+1=0+0+1=1 R← C(a)+O(a,R) R=C(‘g’)+O(‘g’,R) = 0+2=2 j←j-1 1≤2 True [ , ] = [1,n-1] if W = ‘’ if L ≤ R: X= ‘googol$’ return [L,R] 3) a=‘o’→ W=‘go’ bwt= ‘lo$oogg’ else return “No match” L=C(‘o’)+O(‘o’,L-1)+1=3+0+1=4 R=C(‘o’)+O(‘o’,R) = 3+1=4 W=‘ogo’ 4≤4 True Suffix array= (6, 3, 0, 5, 2, 4, 1) SA interval =[4,4] Exact match: backward search (connued)

X= A T T T G A T C T T T T G A T C 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

* *

W= GATC

SA interval =[6,7] 0 16 $ATTTGATCTTTTGATC 1 13 ATC$ATTTGATCTTTTG BWT string= CGG$TTTTAATTTTTAC 2 5 ATCTTTTGATC$ATTTG 3 0 ATTTGATCTTTTGATC$ SA= (16, 13, 5, 0, 15, 7, 12, 4, 14, 6, 11, 3, 10, 2, 9, 1, 8) 4 15 C$ATTTGATCTTTTGAT 5 7 CTTTTGATC$ATTTGAT 6 12 GATC$ATTTGATCTTTT 7 4 GATCTTTTGATC$ATTT 8 14 TC$ATTTGATCTTTTGA 9 6 TCTTTTGATC$ATTTGA 10 11 TGATC$ATTTGATCTTT 11 3 TGATCTTTTGATC$ATT 12 10 TTGATC$ATTTGATCTT 13 2 TTGATCTTTTGATC$AT 14 9 TTTGATC$ATTTGATCT 15 1 TTTGATCTTTTGATC$A 16 8 TTTTGATC$ATTTGATC Inexact match algorithm Basics of NGS technology • DNA is fragmented • Adaptors ligated to fragments • Several possible protocols yield array of PCR colonies. (AMPLIFICATION) • Enzymac extension with fluorescently tagged nucleodes. (SEQUENCING REACTION) • Cyclic readout by imaging the array. • Align strings to the reference genome.

Consider the following paper that describes sequencing technologies in details: Michael Metzker ‘Sequencing technologies-the next generaon’ Nat Rev Genet. 2010 Jan;11(1):31-46. Mapping Algorithm (2 mismatches)

GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG

GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds

GATGCATTG CTATGCCTC Exact match CCAGTCCGC Genome AACTTCACG ......

Indexed table of exactly matching seeds

Approximate search around the exactly matching seeds Mapping Algorithm (2 mismatches)

• Partition reads into 4 seeds {A,B,C,D} – At least 2 seed must map with no mismatches • Scan genome to identify locations where the seeds match exactly – 6 possible combinations of the seeds to search • {AB, CD, AC, BD, AD, BC} – 6 scans to find all candidates • Do approximate matching around the exactly-matching seeds. – Determine all targets for the reads – Ins/del can be incorporated

• The reads are indexed and hashed before scanning genome

• Bit operations are used to accelerate mapping – Each nt encoded into 2-bits