Sequence Providers Techniques and Data Formats

Sequence data formats A short guide on sequencing data formats Data formats Sequence and Quality • Base calls • Quality of base calls A T G T A G C A C G 29 28 33 18 26 31 18 34 32 39 •A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Q20 probability of error = 1% Q30 probability of error = 0.1% Q10 on in every 10 bases Plain old FASTA Fasta sequence >identifier description atcgtaggctttcggctata gctaatgtagctatattgtc Fasta qual >identifier description 21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40 29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28 A few notes in advance Number code Numbers can be represented by letters through ASCII codes 33 ! http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters 34 " 35 # 36 $ 37 % ... ... 64 @ 65 A 66 B ... ... 121 Y 122 Z FastQ @SEQ_ID • One name, multiple formats GATTTGGGGTTCAAAGCAGTAT CGATCAAATAGTAAATCCATTT GTTCAACTCACAGTTT • Stores sequence and quality per base +SEQ_ID in the same file !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65 @SEQ_ID @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT GATTTGGGGTTCAAAGCAGTAT + CGATCAAATAGTAAATCCATTT !''*((((***+))%%%++)(%%%%).1*** -+*''))**55CCF>>>>>>CCCCCCC65 GTTCAACTCACAGTTT + !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65 • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). • Line 2 is the raw sequence letters. • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Illumina output formats .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF FastQ Quality • A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score • Old days Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p: • They differ at low quality values • Q20 no differences Different mappings platform Phred score Ascii codes Sanger 0-93 33-126 Solexa/Illumina 1.0 -5 to 62 59 to 126 Illumina 1.3 – 1.7 0 – 62 Illumina 1.5 – 1.7 0,1 no longer used 2 marks end of HQ read (but may occur in the middle of a read as well) Illumina 1.8 (sanger 0-93 33-126 encoding) PacBio 0-93 33-126 Ion Torrent 0-93 33-126 There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used. Standard flowgram format (SFF) 454 equivalent to the ABI chromatogram files. • the flowgram, • the called sequence, • the qualities, • recommended quality and adaptor clipping. • SFF files are binary. There are several tools to extract the sequences • fasta + fasta.qual or fastq • Sanger quality encoding SAM (BAM) format • text format for storing sequence & quality data in a series of tab delimited ASCII columns • Stores alignment information against a given reference • SAM human readable version of BAM(compressed & indexed for fast parsing) • Can be converted into each other with SAMtools • Can be converted to FastQ or even Fasta • Common output format of workflows Information on SAM/BAM http://samtools.github.io/ https://github.com/samtools/hts-specs http://genome.sph.umich.edu/wiki/SAM PacBio - SMRT Cell • A PacBio SMRT-Cell run is packaged as .tgz file (gzipped tar format) • Big (4-14 GB / SMRT Cell) • Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database • Contains several .h5 (HDF5 format) files • Contains meta data in xml format • Should be read as one package by SMRT-Portal • SMRT Portal can convert the files to standard FASTQ documentation: https://github.com/PacificBiosciences/SMRT-Analysis/wiki format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf Tools for converting formats • FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ • BioPython http://www.biopytjon.org • SFF extract or now seq_crumbs http://bioinf.comav.upv.es/seq_crumbs/ • Other: PrinSeq, ea-utils, bedtools, sambamba MD5 Checkums • Large files can be damaged during file transfers • MD5 checksums may help to detect corrupt files • If possible ask for MD5 checksums Example (on linux): md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5 a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5 largefile.fastq.gz: OK Windows tool: WinMD5 (http://winmd5.com/) Pfff Alternative fingerprinting http://biit.cs.ut.ee/pfff/ Compression formats File come in various compression formats. All of them can be read under Linux, Windows might need extra software to extract the files. Most common fileformats are .gz and .bz2 Compression Extension Commandline to unzip GZIP .gz gunzip <file> gzip –d <file> BZIP2 .bz2 bzip2 –d <file> bunzip <file> zip .zip unzip <file> rar .rar unrar x <file> 7z .7z 7za e <file> Conclusions • Be aware of the different file formats for sequence and quality data • Data integrity ask for MD5 checksums from your sequence provider • If possible convert older files to standard Sanger encoded quality values .

Sequence Providers Techniques and Data Formats

Data Preparation & Descriptive Statistics

Contrasting the Performance of Compression Algorithms on Genomic Data

Full Document

Pack, Encrypt, Authenticate Document Revision: 2021 05 02

Steganography and Vulnerabilities in Popular Archives Formats.| Nyxengine Nyx.Reversinglabs.Com

Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa

Lossless Compression of Internal Files in Parallel Reservoir Simulation

User Commands GZIP ( 1 ) Gzip, Gunzip, Gzcat – Compress Or Expand Files Gzip [ –Acdfhllnnrtvv19 ] [–S Suffix] [ Name ... ]

Software to Extract Cab Files

Winzip 12 Reviewer's Guide

The Ark Handbook

File Management Tools