Sequence data formats A short guide on sequencing data formats Data formats Sequence and Quality
• Base calls • Quality of base calls
A T G T A G C A C G 29 28 33 18 26 31 18 34 32 39
•A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).
Q20 probability of error = 1% Q30 probability of error = 0.1%
Q10 on in every 10 bases Plain old FASTA
Fasta sequence >identifier description atcgtaggctttcggctata gctaatgtagctatattgtc
Fasta qual >identifier description 21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40 29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28
A few notes in advance
Number code Numbers can be represented by letters through ASCII codes
33 ! http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters 34 " 35 #
36 $
37 %
......
64 @
65 A
66 B
......
121 Y
122 Z FastQ
@SEQ_ID • One name, multiple formats GATTTGGGGTTCAAAGCAGTAT CGATCAAATAGTAAATCCATTT GTTCAACTCACAGTTT • Stores sequence and quality per base +SEQ_ID in the same file !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65
@SEQ_ID @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT GATTTGGGGTTCAAAGCAGTAT + CGATCAAATAGTAAATCCATTT !''*((((***+))%%%++)(%%%%).1*** -+*''))**55CCF>>>>>>CCCCCCC65 GTTCAACTCACAGTTT + !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65
• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). • Line 2 is the raw sequence letters. • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
Illumina output formats
.seq.txt .prb.txt
Illumina FASTQ (ASCII – 64 is Illumina score)
Qseq (ASCII – 64 is Phred score)
Illumina single line format SCARF FastQ Quality
• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score
• Old days Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p:
• They differ at low quality values • Q20 no differences
Different mappings
platform Phred score Ascii codes
Sanger 0-93 33-126
Solexa/Illumina 1.0 -5 to 62 59 to 126
Illumina 1.3 – 1.7 0 – 62
Illumina 1.5 – 1.7 0,1 no longer used 2 marks end of HQ read (but may occur in the middle of a read as well)
Illumina 1.8 (sanger 0-93 33-126 encoding) PacBio 0-93 33-126
Ion Torrent 0-93 33-126
There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used. Standard flowgram format (SFF)
454 equivalent to the ABI chromatogram files. • the flowgram, • the called sequence, • the qualities, • recommended quality and adaptor clipping. • SFF files are binary. There are several tools to extract the sequences • fasta + fasta.qual or fastq • Sanger quality encoding SAM (BAM) format
• text format for storing sequence & quality data in a series of tab delimited ASCII columns • Stores alignment information against a given reference • SAM human readable version of BAM(compressed & indexed for fast parsing) • Can be converted into each other with SAMtools
• Can be converted to FastQ or even Fasta • Common output format of workflows
Information on SAM/BAM http://samtools.github.io/ https://github.com/samtools/hts-specs http://genome.sph.umich.edu/wiki/SAM
PacBio - SMRT Cell
• A PacBio SMRT-Cell run is packaged as .tgz file (gzipped tar format) • Big (4-14 GB / SMRT Cell) • Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database • Contains several .h5 (HDF5 format) files • Contains meta data in xml format • Should be read as one package by SMRT-Portal • SMRT Portal can convert the files to standard FASTQ documentation: https://github.com/PacificBiosciences/SMRT-Analysis/wiki format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf
Tools for converting formats
• FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/
• BioPython http://www.biopytjon.org
• SFF extract or now seq_crumbs http://bioinf.comav.upv.es/seq_crumbs/ • Other: PrinSeq, ea-utils, bedtools, sambamba
MD5 Checkums
• Large files can be damaged during file transfers • MD5 checksums may help to detect corrupt files • If possible ask for MD5 checksums
Example (on linux):
md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5
a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5
largefile.fastq.gz: OK
Windows tool: WinMD5 (http://winmd5.com/)
Pfff Alternative fingerprinting
http://biit.cs.ut.ee/pfff/ Compression formats
File come in various compression formats. All of them can be read under Linux, Windows might need extra software to extract the files.
Most common fileformats are .gz and .bz2
Compression Extension Commandline to unzip GZIP .gz gunzip
• Be aware of the different file formats for sequence and quality data • Data integrity ask for MD5 checksums from your sequence provider • If possible convert older files to standard Sanger encoded quality values