High-throughput Sequencing and Translational Genomics Alignment and Data Format

Elena Piñeiro-Yáñez ([email protected])

CNIO UNIT Alignments What is an alignment?

ACGTCTTGACTGG -TTAAAATAC AC - TCTTGACTGGATTAACATAC

Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Elements

ACGTTTTGCAGTAAATGCGGACTGA - T ACGTTGTGCAGTAAATGCGGA -- GACT

mismatch match gap (insertion/deletion)

Alignment seeks to reduce gaps and mismatches and maximize matches. In the construction, each of these components has a penalty value associated. For gaps there is a penalty value for opening the gap and another for extending it.

ACGTTTTGCAGTAAATGCGGACTGAT ACGTTTTGCAGTAAATGCGGACTGAT ACGTTTTGCAGTAAATGCGGACTGA -T ACGTTGTGCAGTAAATGCGGA-GACT ACGTTGTGCAGTAAATGCGGA -- GACT ACGTTGTGCAGTAAATGCGGA --GACT 1 gap 1 extended gap 2 gaps Types

1. Based on the number of sequences:

• Pairwise alignment: 2 sequences

• Multiple alignment: > 2 sequences

2. Based on the region to align:

• Local: sequence sub-region (Smith and Waterman, BLAST) Alignment is done only in the most similar regions

• Global: complete sequence (Needleman Wunsch) Alignment covers two sequences completely To align sequences that start and end in the same region (homologous genes of similar species) Objectives

The comparison between sequences in allows to:

1. Determine the homology degree 2. Identify functional domains 3. Compare the gene with its product 4. Find homologous positions 5. Identify differences Objectives

The comparison between sequences in sequence alignment allows to:

1. Determine the homology degree 2. Identify functional domains 3. Compare the gene with its product 4. Find homologous positions 5. Identify differences

Differential Expression and Variant Detection in Next Generation Sequencing (NGS) NGS

https://www.youtube.com/watch?v=fCd6B5HRaZ8

One of the earliest and important steps in NGS analysis is the mapping of the reads to the original reference. This is the Read Alignment Classical vs NGS alignment

Classical NGS Quantity A few sequences (n < 30) between them Billions of reads to a very large reference genome (n = 106 - 108 )

Length Long sequences (a whole gene, including introns, or a whole protein) Reads have short sequences (I = 25-1000 bp)

Similarity No very similar sequences Highly similar sequences

Quality High quality sequences coming from Sanger capillary sequencing Lower quality sequences

Examples ClustalW, T-Coffee BWA, Bowtie Short Read Aligners Challenges

• As we need to align billions of reads to a very large reference genome -> SRA must be "extraordinarily efficient algorithms": • Speed • Memory use • As we need to align short reads, a read may align in multiple positions -> SRA have to: • Either report multiple positions • Or pick heuristically one of them • Different NGS technologies have different error profiles to take into account: • 454: insertion or deletions in homopolymer runs • Illumina: increasing likelihood of sequence errors towards the end of the read • Specific problems: splicing junctions in RNAseq Timeline

From https://www.ebi.ac.uk/~nf/hts_mappers/

DNA mappers are plotted in blue RNA mappers in red miRNA mappers in green bisulfite mappers in purple

Gray dotted lines connect related mappers (extensions or new major versions).

The time line only includes mappers with peer- reviewed publications and the date corresponds to the earliest date of publication Elements to consider in the alignment

• Read type (DNA, RNA, ...) Read type

InRNAseq the problem is the splicing

RNAseq aligners have to allow long gaps in the alignment for those reads that span splice junctions

Nature Biotechnology 28, 421–423(2010) Elements to consider in the alignment

• Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) Elements to consider in the alignment

• Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) • Paired-end or single-end Pair-end reads

https://www.illumina.com/science/technology/next-generation-sequencing/paired-end-vs-single-read-sequencing.html Elements to consider in the alignment

• Read type (DNA, RNA, ...) • Read length: • extremely short sequences (miRNA) • Increasing length of the reads (more probability of mismatches and gaps) • Paired-end or not • Computational requirements (number of processors, memory) • Base quality (taken or not into account) • Sequencing errors (can be platform dependent) • Number of mismatches (limitation in allowed differences) Reference Genome

Human Reference Genome

Equivalent UCSC Release name Date of release Base Pairs version GRCh38 Dec 2013 hg38 3,609,003,417 GRCh37 Feb 2009 hg19 3,326,743,047 REFERENCE GENOME

Reads

Problem with dimension of data Indexing

Indexing allows to organize information in a more easier and faster way to search Spaced seeds vs Burrows-Wheeler

Spaced seeds Slower More mismatches allowed Indel detection Unspliced: MAQ, GSNAP Spliced: GMAP

Burrows-Wheeler Transform (BWT) Faster Few mismatches allowed Limited indel detection Unspliced: BWA, Bowtie Spliced: TopHat

Due to the increase in the quality of the reads and the increase in depth and coverage, BWT aligners are more common

Nat Biotechnol. 2009 May; 27(5): 455–457 Errors and biases

• Errors in reference sequence • Sequencing errors: • Increases mismatches • Higher at the end of the reads • Different regions in DNA sequence causes aligning biases: • Repetitive regions: • Similar regions in different locations • Place of sequencing errors • Place of real mutations and structural variants • Difficulties in the alignment of insertions/deletions (gaps)

Solutions: Quality Control Post-alignment, mapping quality scores, local realignment of indels Data formats Data formats

Sequencing Reads FASTQ/FAST5/ HDF5 Reference Genome Alignment FASTA Alignments Reference Intervals SAM/BAM/CRAM Transcriptome BED GTF/GFF

Variant Calling RNAseq

Variants Counts VCF TSV/CSV Reference Genome – FASTA format

• Typical extensions: .fasta, .fas, .fa, .fna, .fsa • Each sequence is composed by at least two consecutive lines: • ">" Sequence name and optional description (space separated) • Line(s) with the whole sequence

We can have multiple sequences in the same file (multifasta) Reference Genome – FASTA format

IUPAC nucleotide code Base A Adenine C Cytosine G Guanine T (or U) Thymine (or Uracil) R A or G Y C or T Nucleotide codes S G or C (IUPAC) W A or T K G or T M A or C B C or G or T D A or G or T H A or C or T V A or C or G N any base . or - gap Reads - FASTQ

• Typical extensions: .fq, .fastq • Each read is composed by 4 lines: • "@" Read name and optional description (space separated) • Sequence • "+" (optionally: repeat the read name) • Base Quality Score Reads – FASTQ – Quality Score

• Phred quality scores QPhred are defined as a property which is logarithmically related to the base- calling error probabilities p QPhred = -10 log10(p) • The score is written as the character whose ASCII code is QPhred + 33 • The higher the the QPhred , the lower the probability that the base calling is erroneous Reads – FASTQ – ASCII code Reads – FASTQ – Single-end/Paired-end

One unique sample can have 1 or 2 files: • If single-end Seq -> 1 file (name ".fastq") • If paired-end Seq -> 2 files (names "_R1.fastq" "_R2.fastq") Alignment - SAM/BAM

• SAM is the human readable text format (. extension) • BAM is the binary, machine efficient format (.bam extension) • Both contains exactly the same information and are interconvertible (samtools)

File specifications: https://samtools.github.io/hts-specs/SAMv1.pdf Alignment - SAM/BAM - Header Alignment - SAM/BAM - Alignments

If single-end: 7. reference sequence name of the alignment of the next read in sequence 8. position in the alignment of the next read in sequence 9. number of bases covered by reads from the same fragment. Plus/minus means the current read is the leftmost/rightmost read SAM FLAGS Alignment - SAM/BAM - CIGAR

• Concise Idiosyncratic Gapped Alignment Report • It is a compressed representation of an alignment • Format: A CIGAR string is made up of pairs • Here, "op" is an operation specified as a single character, usually an upper-case letter (see table) Alignment - CRAM

• Typical extension: .cram • CRAM files are alignment files like BAM files • They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligne d to • The file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes • Full compatibility with BAM • Effortless transition to CRAM from using BAM files • Now is not very often used, but it probably will be the alignment format for the next few years • Cramtools Intervals - BED

• Typical extension: .bed

• The first three are required BED fields, the rest are optional

1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2) 2. chromStart - The starting position of the feature in the chromosome. 0-based (The first base in a chromosome is numbered 0) 3. chromEnd - The ending position of the feature in the chromosome or scaffold

• Additionally, 9 optional fields:

4. name - Defines the name of the BED line. 5. Score (. or a number between 0 and 1000). 6. strand (+ forward, - reverse) 7. thickStart 8. thickEnd 9. itemRgb (255,0,0). 10. blockCount; 11. blockSizes; 12. blockStarts Variants - VCF

• Typical extension: .vcf (.bcf binary counterpart) • Not all records in a VCF are true calls, the FILTER column specifies those which passed the calling • QUAL is the score assigned to a given call. The greater QUAL is, the more reliable is. It is in log-scale Reference Transcriptome - GTF/GFF

• Typical extensions: .gtf, .gff • General Feature Format / Gene Transfer Format • Annotation file for features • One line per feature. 9 columns. Tab separated • Used in: Reference transcriptomes (RNAseq) or to upload features to Genomic Browsers

seqname feature end strand attribute source (1-based) start score frame (1-based) '0' first base of the feature is the first base of a codon '1' second base of the feature is the first base of a codon '2' third base of the feature is the first base of a codon Counts - CSV/TSV

Comma-separated Text Tab-separated Text Exercise

We are going to use Bowtie aligner following this guided tutorial created by Héctor Tejero: http://rpubs.com/htejero/bowtiePractice Link to download binary files for latest version of Bowtie: https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.2.2/

bowtie-1.2.2-linux-x86_64.zip

Link to download latest distribution of samtools: https://sourceforge.net/projects/samtools/files/samtools/1.8/

samtools-1.8.tar.bz2

tar xvjf samtools-1.8.tar.bz2 ./configure make Credits for many class material

Héctor Tejero: [email protected] Javier Perales-Patón: [email protected]

CNIO BIOINFORMATICS UNIT