Quality Control and Mapping

Next-Generaon Sequencing: Quality Control and Mapping BaRC Hot Topics – January 2015 BioinFormacs and ResearcH Compu*ng WHiteHead Ins*tute HKp://barc.wi.mit.edu/Hot_topics/ Outline • Quality control • Preprocessing • Read mapping – Non-spliced alignment – Spliced alignment • Post process the mapped read files – Remove unmapped reads, sort, index etc – Mapping stas*cs 2 Illumina data Format • Fastq Format: /1 or /2 paired-end HKp://en.wikipedia.org/wiki/FASTQ_format @ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 @seq idenfier GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG seq +ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 +any descrip*on hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh seq quality values Input quali*es Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7 --phred33 >= 1.8 3 CHeck read quality with Fastqc (HKp://www.bioinFormacs.babraham.ac.uk/projects/Fastqc/) 1. Run Fastqc to cHeck read quality bsub Fastqc sample.Fastq 2. Open output file: “Fastqc_report.Html” 4 Fastqc report We Have to know the quality encoding to use the appropriate parameter in the mapping step. 5 FastQC: per base sequence quality very good quality calls reasonable quality • C poor quality o n t e n t Red: median blue: mean yellow: 25%, 75% whiskers: 10%, 90% 6 6 Preprocessing tools • Fastx Toolkit (hKp://Hannonlab.csHl.edu/Fastx_toolkit/) – FASTQ/A Trimmer: SHortening reads in a FASTQ or FASTQ files (removing barcodes or noise). – FASTQ Quality Filter: Filters sequences based on quality – FASTQ Quality Trimmer: Trims (cuts) sequences based on quality – FASTQ Masker: Masks nucleo*des with 'N' (or other cHaracter) based on quality (For a complete list go to the link above) • cutadapt to remove adapters (HKps://code.google.com/p/cutadapt/) 7 WHat preprocessing do we need? Flagged Kmer Content: About 100% oF the first Bad quality -> Use six bases are the same sequence -> Use “FASTQ Quality Filter” and/or “FASTQ Quality “FASTQTrimmer” Trimmer” Sequence Count Percentage Possible Source RNA PCR Primer, Index 3 (100% TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCA 7360116 82.88507591015895 over 40bp) GCGAGTGCGGTAGAGGGTAGTGGAATTCTCGGGTGCCAA 541189 6.094535921273932 No Hit G TCGAATTGCCTTTGGGACTGCGAGGCTTTGAGGACGGAAG 291330 3.2807783416601866 No Hit RNA PCR Primer, Index 3 (100% CCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGG 210051 2.365464495397192 over 38bp) Overrepresented sequences -> If the over represented sequence is an adapter use “cutadapt” 8 Examples oF preprocessing I hands on exercise Remove reads with lower quality -i: input file -o: output file bsub Fastq_quality_filter -v -q 20 -p 75 -v: report number oF sequences -q 20 the quality value required -i sample.Fastq -o sample_filtered.Fastq -p 75 the percentage oF bases that Have to Have that quality value -q 20 -p 75 -F: First base to keep -l: Last base to keep Trim the reads -i: input file -o: output file # Delete the first 6nt From 5’ -v: report number oF sequences bsub Fastx_trimmer -v -f 7 -l 36 -i sample.Fastq -o sample_trimmed.Fastq 9 Examples oF preprocessing II hands on exercise • Remove adapter/Linker cutadapt # usage bsub " cutadapt -m 20 -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGT CTTCTGCTTG sample2.Fastq | Fastx_ar*Facts_filter > sample2_trimFilt.Fastq” -a: Sequence oF an adapter that was ligated to the 3' end 10 -b: Sequence oF an adapter that was ligated to the 5' or 3' end -g: Sequence oF an adapter that was ligated to the 5' end -o: output file name 10 Recommendaon For preprocessing • Treat all the samples the same way. • WatcH out For preprocessing that may result in very different read length in the different samples as that can affect mapping. • If you Have paired-end reads, make sure you s*ll Have both reads oF the pair aer the processing is done. • Run Fastqc on the processed samples to see iF the problem Has been removed. 11 Local genomic files needed For mapping tak: /nfs/genomes/ – Human, mouse, zebrafisH, C.elegans, fly, yeast, etc. – Different genome builds • mm9: mouse_gp_jul_07 • mm10: mouse_mm10_dec_11 – Human_gp_feb_09 vs Human_gp_feb_09_no_random? • Human_gp_feb_09 includes *_random.Fa, *hap*.fa, etc. – Sub directories: • bowe – Bow*e1: *.ebwt – Bow*e2: *.bt2 • Fasta: • Fasta_wHole_genome: all sequences in one file • g: gene models From ReFseq, Ensembl, etc. 12 Mapping I Non-spliced alignment sovware § Used mapping DNA Fragments, i.e. CHIP-seq, SNP calling § Bowe: § bowe 1 vs bowe 2 § For reads >50 bp Bow*e 2 is generally Faster, more sensi*ve, and uses less memory than Bow*e 1. § Bow*e 2 supports gapped alignment, it makes it beKer For snp calling. Bow*e 1 only finds ungapped alignments. § Bow*e 2 supports a "local" alignment mode, in addi*on to the “end-to-end" alignment mode supported by bow*e1. § BWA: § reFer to the BaRC SOP For detailed inFormaon 13 Mapping reads with bow*e2 • Mapping single reads: bow*e2 [op*ons]* -x <bt2-index> -U <r> [-S <output.sam>] bsub bowAe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/ mm10 –U DNA.fastq –S DNA.sam • Mapping paired-end reads: bow*e2 [op*ons]* -x <bt2-index> -1 <m1> -2 <m2> [-S < output.sam >] bsub bowAe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 -1 Reads1.fastq -2 Reads2.fastq –S DNA.sam 14 Some important parameters in bow*e2 • ReporAng (deFault) look For mul*ple alignments, report best, with MAPQ OR -k <int> report up to <int> alns per read; MAPQ not meaningFul OR -a/--all report all alignments; very slow, MAPQ not meaningFul • Alignment mode --end-to-end en*re read must align; no clipping (on) OR --local local alignment; ends migHt be sov clipped (off) • -L <int> length oF seed substrings; must be >3 and <32 (deFault=22) • -N <int> max # mismatcHes in seed alignment; can be 0 or 1 (deFault=0) Input quali*es Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7 --pHred33 (deFault) >= 1.8 15 Mapping II Spliced alignment sovware § Used iF mapping RNA Fragments § TopHat2 (uses bow*e2) § Star: maps >60 *mes Faster than TopHat2, tends to align more reads to pseudogenes. See barc SOPs 16 Spliced alignment with topHat2 TopHat2 uses bow*e2 to map the reads # single-end reads bsub tophat --solexa1.3-quals --segment-length 20 --no-novel-juncs -G /nfs/genomes/ mouse_mm10_dec_11_no_random/g/mm10_no_random.refseq.gX /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq # paired-end reads: Add addi*onal Fastq file to the end oF above command. Input quali*es ReFer to bow*e2 mapping slide SHortest length oF a spliced read that can map to one side oF the --segment-length junc*on. deFault:25 --no-novel-juncs Only look at reads across junc*ons in the supplied GFF file -G <GTF file> Map reads to virtual transcriptome (From g file) first. -N max. number oF mismatcHes in a read, deFault is 2 -o/--output-dir deFault = topHat_out --library-type (Fr-unstranded, Fr-firststrand, Fr-secondstrand) -I/--max-intron-length deFault: 500000 17 Op*mize mapping across introns • TopHat deFault parameters are designed For mammalian RNA-seq data. • Reduce “maximum intron length” For non- mammalian organisms -l: deFault is 500,000 Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628 18 Hands on Mapping • bowAe2 bsub bowAe2 --phred64 –x /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 –U DNA.fastq –S DNA.sam • tophat bsub tophat --solexa1.3-quals --segment-length 20 -G /nfs/ genomes/mouse_mm10_dec_11_no_random/g/ mm10_no_random.refseq.gX /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq Note: topHat output file will be: topHat_out/accepted_hits.bam 19 Mapped reads file Formats: SAM/BAM • SAM: Sequence Alignment/Map format. It is a TAB-delimited text Format consis*ng oF a Header sec*on, wHicH is op*onal, and an alignment sec*on. EacH alignment line Has 11 mandatory fields For essen*al alignment inFormaon. • BAM: binary Format. It is mucH smaller than sam. • Bam is needed For viewing in a genome browser. It Has to be sorted and indexed. • To save space you sHould convert mapped files to .bam Format, and delete the .sam file. 20 SAM tools: Set oF tools For manipulang mapped read files TOOL DESCRIPTION samtools view conversion between SAM and BAM files samtools flagstat simple stas*cs on the mapped reads samtools sort sort alignment file samtools index index alignment samtools rmdup remove PCR duplicates samtools displays all the tools available 21 Hands on Convert .sam to .bam Format, sort and index. bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl DNA.sam 1. Convert .sam to .bam 2. Sort bam file 3. Index bam file, created a .bai file Delete the .sam file 22 How to get the number oF reads mapped • Bow*e2 prints to STDERR the number oF reads mapped, so you will see iF in the email that you received. • TopHat makes a summary file in the topHat output directory. Head topHat_out/align_summary.txt • Tools: – bam_stat.py -i accepted_hits.bam – samtools flagstat mapped_unmapped.bam • See BaRC SOPs HKp://barcwiki.wi.mit.edu/wiki/SOPs/ miningSAMBAM 23 WHat to look For wHen Few reads mapped? • Reads are not perFectly paired * – Usually occurs aer QC’ing step. Removing low quality reads or adapters creates uneven distribu*on oF reads bsub “/nFs/BaRC_Public/BaRC_code/Perl/cmpFastq/ cmpFastq.pl s_8_1_filtered.Fastq s_8_2_filtered.Fastq” • Reads may Have adapter sequences – Blast top overrepresented sequences in FastQC output – ReFer to the preprocessing steps • Mapping parameters are too stringent * – Increase number oF mismatcHes – Adjust the insert size oF paired-end reads? * ReFer to BaRC SOP For more inFormaon 24 Summary • Quality control – Fastqc • Clean up reads: – Fastx tool kit: Fastq_quality_filter, Fastx_trimmer – Cutadapt • Map reads: – Bowe2 – TopHat2 • Understand the mapped files, and cHeck mapping quality: – Samtools – RSeQC:bam_stat.py 25 BaRC Standard operang procedures HKp://barcwiki.wi.mit.edu/wiki/SOPs 26 ReFerences Fastqc: HKp://www.bioinFormacs.babraham.ac.uk/projects/Fastqc Fastx Toolkit: hKp://Hannonlab.csHl.edu/Fastx_toolkit/ cutadapt: HKps://code.google.com/p/cutadapt Bowe: Langmead B, Trapnell C, Pop M, Salzberg SL.

Quality Control and Mapping

Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection

Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J

File Formats Exercises

"Phylogenetic Analysis of Protein Sequence Data Using The

EMBL-EBI Powerpoint Presentation

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health

Sequence Alignment/Map Format Specification

The Biogrid Interaction Database

The Interpro Database, an Integrated Documentation Resource for Protein

Enabling Interpretation of Protein Variation Effects with Uniprot

Introduction to Bioinformatics (Elective) – SBB1609