<<

Sequence data formats A short guide on sequencing data formats Data formats Sequence and Quality

• Base calls • Quality of base calls

A T G T A G A C G 29 28 33 18 26 31 18 34 32 39

•A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).

Q20  probability of error = 1% Q30  probability of error = 0.1%

Q10  on in every 10 bases Plain old FASTA

Fasta sequence >identifier description atcgtaggctttcggctata gctaatgtagctatattgtc

Fasta qual >identifier description 21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40 29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28

A few notes in advance

Number code Numbers can be represented by letters through ASCII codes

33 ! http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters 34 " 35 #

36 $

37 %

......

64 @

65 A

66 B

......

121 Y

122 Z FastQ

@SEQ_ID • One name, multiple formats GATTTGGGGTTCAAAGCAGTAT CGATCAAATAGTAAATCCATTT GTTCAACTCACAGTTT • Stores sequence and quality per base +SEQ_ID in the same !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65

@SEQ_ID @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT GATTTGGGGTTCAAAGCAGTAT + CGATCAAATAGTAAATCCATTT !''*((((***+))%%%++)(%%%%).1*** -+*''))**55CCF>>>>>>CCCCCCC65 GTTCAACTCACAGTTT + !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65

• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA line). • Line 2 is the raw sequence letters. • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Illumina output formats

.seq.txt .prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq (ASCII – 64 is Phred score)

Illumina single line SCARF FastQ Quality

• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Two different equations have been in use. The first is the Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score

• Old days  Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p:

• They differ low quality values • Q20  no differences

Different mappings

platform Phred score Ascii codes

Sanger 0-93 33-126

Solexa/Illumina 1.0 -5 to 62 59 to 126

Illumina 1.3 – 1.7 0 – 62

Illumina 1.5 – 1.7 0,1 no longer used 2 marks end of HQ read (but may occur in the middle of a read as well)

Illumina 1.8 (sanger 0-93 33-126 encoding) PacBio 0-93 33-126

Ion Torrent 0-93 33-126

There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used. Standard flowgram format (SFF)

454 equivalent to the ABI chromatogram files. • the flowgram, • the called sequence, • the qualities, • recommended quality and adaptor clipping. • SFF files are binary. There are several tools to extract the sequences • fasta + fasta.qual or fastq • Sanger quality encoding SAM (BAM) format

• text format for storing sequence & quality data in a series of tab delimited ASCII columns • Stores alignment information against a given reference • SAM human readable version of BAM(compressed & indexed for fast parsing) • Can be converted into each other with SAMtools

• Can be converted to FastQ or even Fasta • Common output format of workflows

Information on SAM/BAM http://samtools.github.io/ https://github.com/samtools/hts-specs http://genome.sph.umich.edu/wiki/SAM

PacBio - SMRT Cell

• A PacBio SMRT-Cell run is packaged as .tgz file (gzipped format) • Big (4-14 GB / SMRT Cell) • Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database • Contains several .h5 (HDF5 format) files • Contains meta data in xml format • Should be read as one by SMRT-Portal • SMRT Portal can the files to standard FASTQ documentation: https://github.com/PacificBiosciences/SMRT-Analysis/wiki format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.

Tools for converting formats

• FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/

• BioPython http://www.biopytjon.org

• SFF extract or now seq_crumbs http://bioinf.comav.upv.es/seq_crumbs/ • Other: PrinSeq, ea-utils, bedtools, sambamba

MD5 Checkums

• Large files can be damaged during file transfers • MD5 checksums may to detect corrupt files • If possible ask for MD5 checksums

Example (on ):

md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5

a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5

largefile.fastq.gz: OK

Windows tool: WinMD5 (http://winmd5.com/)

Pfff  Alternative fingerprinting

http://biit.cs.ut.ee/pfff/ Compression formats

File come in various compression formats. All of them can be read under Linux, Windows might need extra software to extract the files.

Most common fileformats are .gz and .bz2

Compression Extension Commandline to unzip .gz gunzip gzip –d .bz2 bzip2 –d bunzip .zip unzip .rar unrar x .7z 7za e Conclusions

• Be aware of the different file formats for sequence and quality data • Data integrity  ask for MD5 checksums from your sequence provider • If possible convert older files to standard Sanger encoded quality values