<<

Next-Generaon : Quality Control and Mapping

BaRC Hot Topics – January 2015 Bioinformacs and Research Compung Whitehead Instute

hp://barc.wi.mit.edu/hot_topics/ Outline

• Quality control • Preprocessing • Read mapping – Non-spliced alignment – Spliced alignment • Post process the mapped read files – Remove unmapped reads, sort, index etc – Mapping stascs

2 Illumina data format • Fastq format: /1 or /2 paired-end hp://en.wikipedia.org/wiki/FASTQ_format

@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 @seq idenfier GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG seq +ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 +any descripon hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh seq quality values

Input qualies Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7 --phred33 >= 1.8 3 Check read quality with fastqc (hp://www.bioinformacs.babraham.ac.uk/projects/fastqc/)

1. Run fastqc to check read quality bsub fastqc sample.fastq 2. Open output file: “fastqc_report.html”

4 Fastqc report

We have to know the quality encoding to use the appropriate parameter in the mapping step.

5 FastQC: per base sequence quality

very good quality calls

reasonable quality

• C poor quality o n t e n t Red: median blue: mean yellow: 25%, 75% whiskers: 10%, 90% 6 6 Preprocessing tools

• Fastx Toolkit (hp://hannonlab.cshl.edu/fastx_toolkit/) – FASTQ/A Trimmer: Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise). – FASTQ Quality Filter: Filters sequences based on quality – FASTQ Quality Trimmer: Trims (cuts) sequences based on quality – FASTQ Masker: Masks nucleodes with 'N' (or other character) based on quality (for a complete list go to the link above) • cutadapt to remove adapters (hps://code.google.com/p/cutadapt/)

7 What preprocessing do we need?

Flagged Kmer Content: About 100% of the first Bad quality -> Use six bases are the same sequence -> Use “FASTQ Quality Filter” and/or “FASTQ Quality “FASTQTrimmer” Trimmer” Sequence Count Percentage Possible Source RNA PCR Primer, Index 3 (100% TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCA 7360116 82.88507591015895 over 40bp) GCGAGTGCGGTAGAGGGTAGTGGAATTCTCGGGTGCCAA 541189 6.094535921273932 No Hit G TCGAATTGCCTTTGGGACTGCGAGGCTTTGAGGACGGAAG 291330 3.2807783416601866 No Hit RNA PCR Primer, Index 3 (100% CCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGG 210051 2.365464495397192 over 38bp) Overrepresented sequences -> If the over represented sequence is an adapter use “cutadapt”

8 Examples of preprocessing I hands on exercise

Remove reads with lower quality -i: input file -o: output file bsub fastq_quality_filter -v -q 20 -p 75 -v: report number of sequences -q 20 the quality value required -i sample.fastq -o sample_filtered.fastq -p 75 the percentage of bases that have to have that quality value -q 20 -p 75

-f: First base to keep -l: Last base to keep Trim the reads -i: input file -o: output file # Delete the first 6nt from 5’ -v: report number of sequences bsub fastx_trimmer -v -f 7 -l 36 -i sample.fastq -o sample_trimmed.fastq

9 Examples of preprocessing II hands on exercise

• Remove adapter/Linker

cutadapt # usage bsub " cutadapt -m 20 -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGT CTTCTGCTTG sample2.fastq | fastx_arfacts_filter > sample2_trimFilt.fastq”

-a: Sequence of an adapter that was ligated to the 3' end 10 -b: Sequence of an adapter that was ligated to the 5' or 3' end -g: Sequence of an adapter that was ligated to the 5' end -o: output file name 10 Recommendaon for preprocessing

• Treat all the samples the same way. • Watch out for preprocessing that may result in very different read length in the different samples as that can affect mapping. • If you have paired-end reads, make sure you sll have both reads of the pair aer the processing is done. • Run fastqc on the processed samples to see if the problem has been removed.

11 Local genomic files needed for mapping tak: /nfs//

– Human, mouse, zebrafish, C.elegans, fly, yeast, etc. – Different builds • mm9: mouse_gp_jul_07 • mm10: mouse_mm10_dec_11 – human_gp_feb_09 vs human_gp_feb_09_no_random? • human_gp_feb_09 includes *_random.fa, *hap*.fa, etc. – Sub directories: • bowe – Bowe1: *.ebwt – Bowe2: *.bt2 • fasta: • fasta_whole_genome: all sequences in one file • g: gene models from Refseq, Ensembl, etc.

12 Mapping I Non-spliced alignment soware § Used mapping DNA fragments, i.e. ChIP-seq, SNP calling § Bowe: § bowe 1 vs bowe 2 § For reads >50 bp Bowe 2 is generally faster, more sensive, and uses less memory than Bowe 1. § Bowe 2 supports gapped alignment, it makes it beer for snp calling. Bowe 1 only finds ungapped alignments. § Bowe 2 supports a "local" alignment mode, in addion to the “end-to-end" alignment mode supported by bowe1. § BWA: § refer to the BaRC SOP for detailed informaon

13 Mapping reads with bowe2

• Mapping single reads: bowe2 [opons]* -x -U [-S ] bsub bowe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowe/ mm10 –U DNA.fastq –S DNA.

• Mapping paired-end reads: bowe2 [opons]* -x -1 -2 [-S < output.sam >] bsub bowe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowe/mm10 -1 Reads1.fastq -2 Reads2.fastq –S DNA.sam

14 Some important parameters in bowe2 • Reporng (default) look for mulple alignments, report best, with MAPQ OR -k report up to alns per read; MAPQ not meaningful OR -a/--all report all alignments; very slow, MAPQ not meaningful • Alignment mode --end-to-end enre read must align; no clipping (on) OR --local local alignment; ends might be so clipped (off) • -L length of seed substrings; must be >3 and <32 (default=22) • -N max # mismatches in seed alignment; can be 0 or 1 (default=0)

Input qualies Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7

--phred33 (default) >= 1.8 15 Mapping II Spliced alignment soware § Used if mapping RNA fragments § Tophat2 (uses bowe2) § Star: maps >60 mes faster than Tophat2, tends to align more reads to pseudogenes. See barc SOPs

16 Spliced alignment with tophat2 Tophat2 uses bowe2 to map the reads

# single-end reads bsub --solexa1.3-quals --segment-length 20 --no-novel-juncs -G /nfs/genomes/ mouse_mm10_dec_11_no_random/g/mm10_no_random.refseq.g /nfs/genomes/ mouse_mm10_dec_11_no_random/bowe/mm10 sample_good_trimmed.fastq

# paired-end reads: Add addional fastq file to the end of above command.

Input qualies Refer to bowe2 mapping slide Shortest length of a spliced read that can map to one side of the --segment-length juncon. default:25

--no-novel-juncs Only look at reads across juncons in the supplied GFF file

-G Map reads to virtual (from g file) first. -N max. number of mismatches in a read, default is 2 -o/--output-dir default = tophat_out --library-type (fr-unstranded, fr-firststrand, fr-secondstrand) -I/--max--length default: 500000

17 Opmize mapping across • Tophat default parameters are designed for mammalian RNA-seq data. • Reduce “maximum intron length” for non- mammalian organisms -l: default is 500,000 Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628

18 Hands on Mapping • bowe2 bsub bowe2 --phred64 –x /nfs/genomes/ mouse_mm10_dec_11_no_random/bowe/mm10 –U DNA.fastq –S DNA.sam

• tophat bsub tophat --solexa1.3-quals --segment-length 20 -G /nfs/ genomes/mouse_mm10_dec_11_no_random/g/ mm10_no_random.refseq.g /nfs/genomes/ mouse_mm10_dec_11_no_random/bowe/mm10 sample_good_trimmed.fastq

Note: tophat output file will be: tophat_out/accepted_hits.bam

19 Mapped reads file formats: SAM/BAM • SAM: /Map format. It is a TAB-delimited text format consisng of a header secon, which is oponal, and an alignment secon. Each alignment line has 11 mandatory fields for essenal alignment informaon. • BAM: binary format. It is much smaller than sam. • Bam is needed for viewing in a genome browser. It has to be sorted and indexed. • To save space you should convert mapped files to .bam format, and delete the .sam file.

20

SAM tools: Set of tools for manipulang mapped read files

TOOL DESCRIPTION view conversion between SAM and BAM files samtools flagstat simple stascs on the mapped reads samtools sort sort alignment file samtools index index alignment samtools rmdup remove PCR duplicates samtools displays all the tools available

21 Hands on

Convert .sam to .bam format, sort and index. bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl DNA.sam 1. Convert .sam to .bam 2. Sort bam file 3. Index bam file, created a .bai file Delete the .sam file

22 How to get the number of reads mapped • Bowe2 prints to STDERR the number of reads mapped, so you will see if in the email that you received. • Tophat makes a summary file in the tophat output directory. head tophat_out/align_summary.txt

• Tools: – bam_stat.py -i accepted_hits.bam – samtools flagstat mapped_unmapped.bam • See BaRC SOPs hp://barcwiki.wi.mit.edu/wiki/SOPs/ miningSAMBAM

23 What to look for when few reads mapped? • Reads are not perfectly paired * – Usually occurs aer QC’ing step. Removing low quality reads or adapters creates uneven distribuon of reads bsub “/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/ cmpfastq.pl s_8_1_filtered.fastq s_8_2_filtered.fastq” • Reads may have adapter sequences – Blast top overrepresented sequences in fastQC output – Refer to the preprocessing steps • Mapping parameters are too stringent * – Increase number of mismatches – Adjust the insert size of paired-end reads?

* Refer to BaRC SOP for more informaon 24 Summary • Quality control – fastqc • Clean up reads: – fastx tool kit: fastq_quality_filter, fastx_trimmer – Cutadapt • Map reads: – Bowe2 – Tophat2 • Understand the mapped files, and check mapping quality: – Samtools – RSeQC:bam_stat.py

25 BaRC Standard operang procedures

hp://barcwiki.wi.mit.edu/wiki/SOPs

26 References

Fastqc: hp://www.bioinformacs.babraham.ac.uk/projects/fastqc

Fastx Toolkit: hp://hannonlab.cshl.edu/fastx_toolkit/ cutadapt: hps://code.google.com/p/cutadapt

Bowe: Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.

TopHat: Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of in the presence of inserons, deleons and gene fusions. Genome Biology 2013, 14:R36

Systemac evaluaon of spliced alignment programs for RNA-seq data Engstrom et.al Nature Methods 10, 1185–1191 (2013)

27