Alignment and Data Format

High-throughput Sequencing and Translational Genomics Alignment and Data Format Elena Piñeiro-Yáñez ([email protected]) CNIO BIOINFORMATICS UNIT Alignments What is an alignment? ACGTCTTGACTGG -TTAAAATAC AC - TCTTGACTGGATTAACATAC Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Elements ACGTTTTGCAGTAAATGCGGACTGA - T ACGTTGTGCAGTAAATGCGGA -- GACT mismatch match gap (insertion/deletion) Alignment seeks to reduce gaps and mismatches and maximize matches. In the construction, each of these components has a penalty value associated. For gaps there is a penalty value for opening the gap and another for extending it. ACGTTTTGCAGTAAATGCGGACTGAT ACGTTTTGCAGTAAATGCGGACTGAT ACGTTTTGCAGTAAATGCGGACTGA -T ACGTTGTGCAGTAAATGCGGA-GACT ACGTTGTGCAGTAAATGCGGA -- GACT ACGTTGTGCAGTAAATGCGGA --GACT 1 gap 1 extended gap 2 gaps Types 1. Based on the number of sequences: • Pairwise alignment: 2 sequences • Multiple alignment: > 2 sequences 2. Based on the region to align: • Local: sequence sub-region (Smith and Waterman, BLAST) Alignment is done only in the most similar regions • Global: complete sequence (Needleman Wunsch) Alignment covers two sequences completely To align sequences that start and end in the same region (homologous genes of similar species) Objectives The comparison between sequences in sequence alignment allows to: 1. Determine the homology degree 2. Identify functional domains 3. Compare the gene with its product 4. Find homologous positions 5. Identify differences Objectives The comparison between sequences in sequence alignment allows to: 1. Determine the homology degree 2. Identify functional domains 3. Compare the gene with its product 4. Find homologous positions 5. Identify differences Differential Expression and Variant Detection in Next Generation Sequencing (NGS) NGS https://www.youtube.com/watch?v=fCd6B5HRaZ8 One of the earliest and important steps in NGS analysis is the mapping of the reads to the original reference. This is the Read Alignment Classical vs NGS alignment Classical NGS Quantity A few sequences (n < 30) between them Billions of reads to a very large reference genome (n = 106 - 108 ) Length Long sequences (a whole gene, including introns, or a whole protein) Reads have short sequences (I = 25-1000 bp) Similarity No very similar sequences Highly similar sequences Quality High quality sequences coming from Sanger capillary sequencing Lower quality sequences Examples ClustalW, T-Coffee BWA, Bowtie Short Read Aligners Challenges • As we need to align billions of reads to a very large reference genome -> SRA must be "extraordinarily efficient algorithms": • Speed • Memory use • As we need to align short reads, a read may align in multiple positions -> SRA have to: • Either report multiple positions • Or pick heuristically one of them • Different NGS technologies have different error profiles to take into account: • 454: insertion or deletions in homopolymer runs • Illumina: increasing likelihood of sequence errors towards the end of the read • Specific problems: splicing junctions in RNAseq Timeline From https://www.ebi.ac.uk/~nf/hts_mappers/ DNA mappers are plotted in blue RNA mappers in red miRNA mappers in green bisulfite mappers in purple Gray dotted lines connect related mappers (extensions or new major versions). The time line only includes mappers with peer- reviewed publications and the date corresponds to the earliest date of publication Elements to consider in the alignment • Read type (DNA, RNA, ...) Read type InRNAseq the problem is the splicing RNAseq aligners have to allow long gaps in the alignment for those reads that span splice junctions Nature Biotechnology 28, 421–423(2010) Elements to consider in the alignment • Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) Elements to consider in the alignment • Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) • Paired-end or single-end Pair-end reads https://www.illumina.com/science/technology/next-generation-sequencing/paired-end-vs-single-read-sequencing.html Elements to consider in the alignment • Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) • Paired-end or not • Computational requirements (number of processors, memory) • Base quality (taken or not into account) • Sequencing errors (can be platform dependent) • Number of mismatches (limitation in allowed differences) Reference Genome Human Reference Genome Equivalent UCSC Release name Date of release Base Pairs version GRCh38 Dec 2013 hg38 3,609,003,417 GRCh37 Feb 2009 hg19 3,326,743,047 REFERENCE GENOME Reads Problem with dimension of data Indexing Indexing allows to organize information in a more easier and faster way to search Spaced seeds vs Burrows-Wheeler Spaced seeds Slower More mismatches allowed Indel detection Unspliced: MAQ, GSNAP Spliced: GMAP Burrows-Wheeler Transform (BWT) Faster Few mismatches allowed Limited indel detection Unspliced: BWA, Bowtie Spliced: TopHat Due to the increase in the quality of the reads and the increase in depth and coverage, BWT aligners are more common Nat Biotechnol. 2009 May; 27(5): 455–457 Errors and biases • Errors in reference sequence • Sequencing errors: • Increases mismatches • Higher at the end of the reads • Different regions in DNA sequence causes aligning biases: • Repetitive regions: • Similar regions in different locations • Place of sequencing errors • Place of real mutations and structural variants • Difficulties in the alignment of insertions/deletions (gaps) Solutions: Quality Control Post-alignment, mapping quality scores, local realignment of indels Data formats Data formats Sequencing Reads FASTQ/FAST5/ HDF5 Reference Genome Alignment FASTA Alignments Reference Intervals SAM/BAM/CRAM Transcriptome BED GTF/GFF Variant Calling RNAseq Variants Counts VCF TSV/CSV Reference Genome – FASTA format • Typical extensions: .fasta, .fas, .fa, .fna, .fsa • Each sequence is composed by at least two consecutive lines: • ">" Sequence name and optional description (space separated) • Line(s) with the whole sequence We can have multiple sequences in the same file (multifasta) Reference Genome – FASTA format IUPAC nucleotide code Base A Adenine C Cytosine G Guanine T (or U) Thymine (or Uracil) R A or G Y C or T Nucleotide codes S G or C (IUPAC) W A or T K G or T M A or C B C or G or T D A or G or T H A or C or T V A or C or G N any base . or - gap Reads - FASTQ • Typical extensions: .fq, .fastq • Each read is composed by 4 lines: • "@" Read name and optional description (space separated) • Sequence • "+" (optionally: repeat the read name) • Base Quality Score Reads – FASTQ – Quality Score • Phred quality scores QPhred are defined as a property which is logarithmically related to the base- calling error probabilities p QPhred = -10 log10(p) • The score is written as the character whose ASCII code is QPhred + 33 • The higher the the QPhred , the lower the probability that the base calling is erroneous Reads – FASTQ – ASCII code Reads – FASTQ – Single-end/Paired-end One unique sample can have 1 or 2 files: • If single-end Seq -> 1 file (name ".fastq") • If paired-end Seq -> 2 files (names "_R1.fastq" "_R2.fastq") Alignment - SAM/BAM • SAM is the human readable text format (.sam extension) • BAM is the binary, machine efficient format (.bam extension) • Both contains exactly the same information and are interconvertible (samtools) File specifications: https://samtools.github.io/hts-specs/SAMv1.pdf Alignment - SAM/BAM - Header Alignment - SAM/BAM - Alignments If single-end: 7. reference sequence name of the alignment of the next read in sequence 8. position in the alignment of the next read in sequence 9. number of bases covered by reads from the same fragment. Plus/minus means the current read is the leftmost/rightmost read SAM FLAGS Alignment - SAM/BAM - CIGAR • Concise Idiosyncratic Gapped Alignment Report • It is a compressed representation of an alignment • Format: A CIGAR string is made up of <integer><op> pairs • Here, "op" is an operation specified as a single character, usually an upper-case letter (see table) Alignment - CRAM • Typical extension: .cram • CRAM files are alignment files like BAM files • They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligne d to • The file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes • Full compatibility with BAM • Effortless transition to CRAM from using BAM files • Now is not very often used, but it probably will be the alignment format for the next few years • Cramtools Intervals - BED • Typical extension: .bed • The first three are required BED fields, the rest are optional 1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2) 2. chromStart - The starting position of the feature in the chromosome. 0-based (The first base in a chromosome is numbered 0) 3. chromEnd - The ending position of the feature in the chromosome or scaffold • Additionally, 9 optional fields: 4. name - Defines the name of the BED line. 5. Score (. or a number between 0 and 1000). 6. strand (+ forward, - reverse)

Alignment and Data Format

Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns

Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M

Galaxy Platform for NGS Data Analyses

BMC Bioinformatics Biomed Central

An Online Visualization Tool for Functional Features of Human Fusion Genes Pora Kim 1,,†,Keyiya2,,† and Xiaobo Zhou 1,3,4,*

Gffread and Gffcompare[Version 1; Peer Review: 3 Approved]

Next-Generation DNA Sequencing Informatics, 2Nd Edition

Identifying Disease Genes

A Resource Optimized GATK 4 Based Open Source Variant Calling Workflow

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013

An Efficient General-Purpose Program for Assigning Sequence Reads To

A Standard Variation File Format for Human Genome Sequences