Sequence Providers Techniques and Data Formats

Total Page:16

File Type:pdf, Size:1020Kb

Sequence Providers Techniques and Data Formats Sequence data formats A short guide on sequencing data formats Data formats Sequence and Quality • Base calls • Quality of base calls A T G T A G C A C G 29 28 33 18 26 31 18 34 32 39 •A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Q20 probability of error = 1% Q30 probability of error = 0.1% Q10 on in every 10 bases Plain old FASTA Fasta sequence >identifier description atcgtaggctttcggctata gctaatgtagctatattgtc Fasta qual >identifier description 21 23 25 27 28 29 28 28 33 31 31 34 45 43 41 42 41 39 38 40 29 28 28 33 31 31 34 41 39 45 43 41 42 38 40 21 23 25 27 28 A few notes in advance Number code Numbers can be represented by letters through ASCII codes 33 ! http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters 34 " 35 # 36 $ 37 % ... ... 64 @ 65 A 66 B ... ... 121 Y 122 Z FastQ @SEQ_ID • One name, multiple formats GATTTGGGGTTCAAAGCAGTAT CGATCAAATAGTAAATCCATTT GTTCAACTCACAGTTT • Stores sequence and quality per base +SEQ_ID in the same file !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65 @SEQ_ID @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT GATTTGGGGTTCAAAGCAGTAT + CGATCAAATAGTAAATCCATTT !''*((((***+))%%%++)(%%%%).1*** -+*''))**55CCF>>>>>>CCCCCCC65 GTTCAACTCACAGTTT + !''*((((***+))%%%++)(% %%).1***-+*''))**55CCF> >>>>>CCCCCCC65 • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). • Line 2 is the raw sequence letters. • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Illumina output formats .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF FastQ Quality • A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred Quality Score • Old days Solexa, now Illumina used a different mapping, encoding the odds p/(1-p) instead of the probability p: • They differ at low quality values • Q20 no differences Different mappings platform Phred score Ascii codes Sanger 0-93 33-126 Solexa/Illumina 1.0 -5 to 62 59 to 126 Illumina 1.3 – 1.7 0 – 62 Illumina 1.5 – 1.7 0,1 no longer used 2 marks end of HQ read (but may occur in the middle of a read as well) Illumina 1.8 (sanger 0-93 33-126 encoding) PacBio 0-93 33-126 Ion Torrent 0-93 33-126 There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used. Standard flowgram format (SFF) 454 equivalent to the ABI chromatogram files. • the flowgram, • the called sequence, • the qualities, • recommended quality and adaptor clipping. • SFF files are binary. There are several tools to extract the sequences • fasta + fasta.qual or fastq • Sanger quality encoding SAM (BAM) format • text format for storing sequence & quality data in a series of tab delimited ASCII columns • Stores alignment information against a given reference • SAM human readable version of BAM(compressed & indexed for fast parsing) • Can be converted into each other with SAMtools • Can be converted to FastQ or even Fasta • Common output format of workflows Information on SAM/BAM http://samtools.github.io/ https://github.com/samtools/hts-specs http://genome.sph.umich.edu/wiki/SAM PacBio - SMRT Cell • A PacBio SMRT-Cell run is packaged as .tgz file (gzipped tar format) • Big (4-14 GB / SMRT Cell) • Check the required folder structure, otherwise the file cannot be loaded in the SMRT-Portal database • Contains several .h5 (HDF5 format) files • Contains meta data in xml format • Should be read as one package by SMRT-Portal • SMRT Portal can convert the files to standard FASTQ documentation: https://github.com/PacificBiosciences/SMRT-Analysis/wiki format https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf Tools for converting formats • FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ • BioPython http://www.biopytjon.org • SFF extract or now seq_crumbs http://bioinf.comav.upv.es/seq_crumbs/ • Other: PrinSeq, ea-utils, bedtools, sambamba MD5 Checkums • Large files can be damaged during file transfers • MD5 checksums may help to detect corrupt files • If possible ask for MD5 checksums Example (on linux): md5sum -b largefile.fastq.gz > md5sum -b largefile.fastq.gz.md5 a3672a3d4185acc49c7fa4460f1167ab *largefile.fastq.gz md5sum -c largefile.fastq.gz.md5 largefile.fastq.gz: OK Windows tool: WinMD5 (http://winmd5.com/) Pfff Alternative fingerprinting http://biit.cs.ut.ee/pfff/ Compression formats File come in various compression formats. All of them can be read under Linux, Windows might need extra software to extract the files. Most common fileformats are .gz and .bz2 Compression Extension Commandline to unzip GZIP .gz gunzip <file> gzip –d <file> BZIP2 .bz2 bzip2 –d <file> bunzip <file> zip .zip unzip <file> rar .rar unrar x <file> 7z .7z 7za e <file> Conclusions • Be aware of the different file formats for sequence and quality data • Data integrity ask for MD5 checksums from your sequence provider • If possible convert older files to standard Sanger encoded quality values .
Recommended publications
  • Data Preparation & Descriptive Statistics
    Data Preparation & Descriptive Statistics (ver. 2.4) Oscar Torres-Reyna Data Consultant [email protected] PU/DSS/OTR http://dss.princeton.edu/training/ Basic definitions… For statistical analysis we think of data as a collection of different pieces of information or facts. These pieces of information are called variables. A variable is an identifiable piece of data containing one or more values. Those values can take the form of a number or text (which could be converted into number) In the table below variables var1 thru var5 are a collection of seven values, ‘id’ is the identifier for each observation. This dataset has information for seven cases (in this case people, but could also be states, countries, etc) grouped into five variables. id var1 var2 var3 var4 var5 1 7.3 32.27 0.1 Yes Male 2 8.28 40.68 0.56 No Female 3 3.35 5.62 0.55 Yes Female 4 4.08 62.8 0.83 Yes Male 5 9.09 22.76 0.26 No Female 6 8.15 90.85 0.23 Yes Female 7 7.59 54.94 0.42 Yes Male PU/DSS/OTR Data structure… For data analysis your data should have variables as columns and observations as rows. The first row should have the column headings. Make sure your dataset has at least one identifier (for example, individual id, family id, etc.) id var1 var2 var3 var4 var5 First row should have the variable names 1 7.3 32.27 0.1 Yes Male 2 8.28 40.68 0.56 No Female Cross-sectional data 3 3.35 5.62 0.55 Yes Female 4 4.08 62.8 0.83 Yes Male 5 9.09 22.76 0.26 No Female 6 8.15 90.85 0.23 Yes Female 7 7.59 54.94 0.42 Yes Male id year var1 var2 var3 1 2000 7 74.03 0.55 Group 1 1 2001 2 4.6 0.44 At least one identifier 1 2002 2 25.56 0.77 2 2000 7 59.52 0.05 Cross-sectional time series data Group 2 2 2001 2 16.95 0.94 or panel data 2 2002 9 1.2 0.08 3 2000 9 85.85 0.5 Group 3 3 2001 3 98.85 0.32 3 2002 3 69.2 0.76 PU/DSS/OTR NOTE: See: http://www.statistics.com/resources/glossary/c/crossdat.php Data format (ASCII)… ASCII (American Standard Code for Information Interchange).
    [Show full text]
  • Contrasting the Performance of Compression Algorithms on Genomic Data
    Contrasting the Performance of Compression Algorithms on Genomic Data Cornel Constantinescu, IBM Research Almaden Outline of the Talk: • Introduction / Motivation • Data used in experiments • General purpose compressors comparison • Simple Improvements • Special purpose compression • Transparent compression – working on compressed data (prototype) • Parallelism / Multithreading • Conclusion Introduction / Motivation • Despite the large number of research papers and compression algorithms proposed for compressing genomic data generated by sequencing machines, by far the most commonly used compression algorithm in the industry for FASTQ data is gzip. • The main drawbacks of the proposed alternative special-purpose compression algorithms are: • slow speed of either compression or decompression or both, and also their • brittleness by making various limiting assumptions about the input FASTQ format (for example, the structure of the headers or fixed lengths of the records [1]) in order to further improve their specialized compression. 1. Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. Nature Methods, 13(12):1005–1008, October 2016. Fast and Efficient Compression of Next Generation Sequencing Data 2 2 General Purpose Compression of Genomic Data As stated earlier, gzip/zlib compression is the method of choice by the industry for FASTQ genomic data. FASTQ genomic data is a text-based format (ASCII readable text) for storing a biological sequence and the corresponding quality scores. Each sequence letter and quality score is encoded with a single ASCII character. FASTQ data is structured in four fields per record (a “read”). The first field is the SEQUENCE ID or the header of the read.
    [Show full text]
  • Full Document
    R&D Centre for Mobile Applications (RDC) FEE, Dept of Telecommunications Engineering Czech Technical University in Prague RDC Technical Report TR-13-4 Internship report Evaluation of Compressibility of the Output of the Information-Concealing Algorithm Julien Mamelli, [email protected] 2nd year student at the Ecole´ des Mines d'Al`es (N^ımes,France) Internship supervisor: Luk´aˇsKencl, [email protected] August 2013 Abstract Compression is a key element to exchange files over the Internet. By generating re- dundancies, the concealing algorithm proposed by Kencl and Loebl [?], appears at first glance to be particularly designed to be combined with a compression scheme [?]. Is the output of the concealing algorithm actually compressible? We have tried 16 compression techniques on 1 120 files, and the result is that we have not found a solution which could advantageously use repetitions of the concealing method. Acknowledgments I would like to express my gratitude to my supervisor, Dr Luk´aˇsKencl, for his guidance and expertise throughout the course of this work. I would like to thank Prof. Robert Beˇst´akand Mr Pierre Runtz, for giving me the opportunity to carry out my internship at the Czech Technical University in Prague. I would also like to thank all the members of the Research and Development Center for Mobile Applications as well as my colleagues for the assistance they have given me during this period. 1 Contents 1 Introduction 3 2 Related Work 4 2.1 Information concealing method . 4 2.2 Archive formats . 5 2.3 Compression algorithms . 5 2.3.1 Lempel-Ziv algorithm .
    [Show full text]
  • Pack, Encrypt, Authenticate Document Revision: 2021 05 02
    PEA Pack, Encrypt, Authenticate Document revision: 2021 05 02 Author: Giorgio Tani Translation: Giorgio Tani This document refers to: PEA file format specification version 1 revision 3 (1.3); PEA file format specification version 2.0; PEA 1.01 executable implementation; Present documentation is released under GNU GFDL License. PEA executable implementation is released under GNU LGPL License; please note that all units provided by the Author are released under LGPL, while Wolfgang Ehrhardt’s crypto library units used in PEA are released under zlib/libpng License. PEA file format and PCOMPRESS specifications are hereby released under PUBLIC DOMAIN: the Author neither has, nor is aware of, any patents or pending patents relevant to this technology and do not intend to apply for any patents covering it. As far as the Author knows, PEA file format in all of it’s parts is free and unencumbered for all uses. Pea is on PeaZip project official site: https://peazip.github.io , https://peazip.org , and https://peazip.sourceforge.io For more information about the licenses: GNU GFDL License, see http://www.gnu.org/licenses/fdl.txt GNU LGPL License, see http://www.gnu.org/licenses/lgpl.txt 1 Content: Section 1: PEA file format ..3 Description ..3 PEA 1.3 file format details ..5 Differences between 1.3 and older revisions ..5 PEA 2.0 file format details ..7 PEA file format’s and implementation’s limitations ..8 PCOMPRESS compression scheme ..9 Algorithms used in PEA format ..9 PEA security model .10 Cryptanalysis of PEA format .12 Data recovery from
    [Show full text]
  • Steganography and Vulnerabilities in Popular Archives Formats.| Nyxengine Nyx.Reversinglabs.Com
    Hiding in the Familiar: Steganography and Vulnerabilities in Popular Archives Formats.| NyxEngine nyx.reversinglabs.com Contents Introduction to NyxEngine ............................................................................................................................ 3 Introduction to ZIP file format ...................................................................................................................... 4 Introduction to steganography in ZIP archives ............................................................................................. 5 Steganography and file malformation security impacts ............................................................................... 8 References and tools .................................................................................................................................... 9 2 Introduction to NyxEngine Steganography1 is the art and science of writing hidden messages in such a way that no one, apart from the sender and intended recipient, suspects the existence of the message, a form of security through obscurity. When it comes to digital steganography no stone should be left unturned in the search for viable hidden data. Although digital steganography is commonly used to hide data inside multimedia files, a similar approach can be used to hide data in archives as well. Steganography imposes the following data hiding rule: Data must be hidden in such a fashion that the user has no clue about the hidden message or file's existence. This can be achieved by
    [Show full text]
  • Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa
    JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 DZip: improved neural network based general-purpose lossless compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa Abstract—We consider lossless compression based on statistical [4], [5] and generative modeling [6]). Neural network based data modeling followed by prediction-based encoding, where an models can typically learn highly complex patterns in the data accurate statistical model for the input data leads to substantial much better than traditional finite context and Markov models, improvements in compression. We propose DZip, a general- purpose compressor for sequential data that exploits the well- leading to significantly lower prediction error (measured as known modeling capabilities of neural networks (NNs) for pre- log-loss or perplexity [4]). This has led to the development of diction, followed by arithmetic coding. DZip uses a novel hybrid several compressors using neural networks as predictors [7]– architecture based on adaptive and semi-adaptive training. Unlike [9], including the recently proposed LSTM-Compress [10], most NN based compressors, DZip does not require additional NNCP [11] and DecMac [12]. Most of the previous works, training data and is not restricted to specific data types. The proposed compressor outperforms general-purpose compressors however, have been tailored for compression of certain data such as Gzip (29% size reduction on average) and 7zip (12% size types (e.g., text [12] [13] or images [14], [15]), where the reduction on average) on a variety of real datasets, achieves near- prediction model is trained in a supervised framework on optimal compression on synthetic datasets, and performs close to separate training data or the model architecture is tuned for specialized compressors for large sequence lengths, without any the specific data type.
    [Show full text]
  • Lossless Compression of Internal Files in Parallel Reservoir Simulation
    Lossless Compression of Internal Files in Parallel Reservoir Simulation Suha Kayum Marcin Rogowski Florian Mannuss 9/26/2019 Outline • I/O Challenges in Reservoir Simulation • Evaluation of Compression Algorithms on Reservoir Simulation Data • Real-world application - Constraints - Algorithm - Results • Conclusions 2 Challenge Reservoir simulation 1 3 Reservoir Simulation • Largest field in the world are represented as 50 million – 1 billion grid block models • Each runs takes hours on 500-5000 cores • Calibrating the model requires 100s of runs and sophisticated methods • “History matched” model is only a beginning 4 Files in Reservoir Simulation • Internal Files • Input / Output Files - Interact with pre- & post-processing tools Date Restart/Checkpoint Files 5 Reservoir Simulation in Saudi Aramco • 100’000+ simulations annually • The largest simulation of 10 billion cells • Currently multiple machines in TOP500 • Petabytes of storage required 600x • Resources are Finite • File Compression is one solution 50x 6 Compression algorithm evaluation 2 7 Compression ratio Tested a number of algorithms on a GRID restart file for two models 4 - Model A – 77.3 million active grid blocks 3.5 - Model K – 8.7 million active grid blocks 3 - 15.6 GB and 7.2 GB respectively 2.5 2 Compression ratio is between 1.5 1 compression ratio compression - From 2.27 for snappy (Model A) 0.5 0 - Up to 3.5 for bzip2 -9 (Model K) Model A Model K lz4 snappy gzip -1 gzip -9 bzip2 -1 bzip2 -9 8 Compression speed • LZ4 and Snappy significantly outperformed other algorithms
    [Show full text]
  • User Commands GZIP ( 1 ) Gzip, Gunzip, Gzcat – Compress Or Expand Files Gzip [ –Acdfhllnnrtvv19 ] [–S Suffix] [ Name ... ]
    User Commands GZIP ( 1 ) NAME gzip, gunzip, gzcat – compress or expand files SYNOPSIS gzip [–acdfhlLnNrtvV19 ] [– S suffix] [ name ... ] gunzip [–acfhlLnNrtvV ] [– S suffix] [ name ... ] gzcat [–fhLV ] [ name ... ] DESCRIPTION Gzip reduces the size of the named files using Lempel-Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension .gz, while keeping the same ownership modes, access and modification times. (The default extension is – gz for VMS, z for MSDOS, OS/2 FAT, Windows NT FAT and Atari.) If no files are specified, or if a file name is "-", the standard input is compressed to the standard output. Gzip will only attempt to compress regular files. In particular, it will ignore symbolic links. If the compressed file name is too long for its file system, gzip truncates it. Gzip attempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name con- sists of small parts only, the longest parts are truncated. For example, if file names are limited to 14 characters, gzip.msdos.exe is compressed to gzi.msd.exe.gz. Names are not truncated on systems which do not have a limit on file name length. By default, gzip keeps the original file name and timestamp in the compressed file. These are used when decompressing the file with the – N option. This is useful when the compressed file name was truncated or when the time stamp was not preserved after a file transfer. Compressed files can be restored to their original form using gzip -d or gunzip or gzcat.
    [Show full text]
  • Software to Extract Cab Files
    Software to extract cab files click here to download You can use WinZip to extract CAB files by following the steps listed below. file extension associated with WinZip program, just double-click on the file. PeaZip offers read-only support (open and extract cab files) for Microsoft Cabinet file format, providing a free alternative utility to open (list content) and www.doorway.ru packages, or disassemble single files from the container, under Windows and Linux operating systems. Moreover, the OS can create, extract, or rebuild cab files. This means you do not require any additional third-party software for this task. All CAB. For a number of years, Microsoft has www.doorway.ru files to compress software that was distributed on disks. Originally, these files were used to minimize the number . The InstallShield installer program makes files with the CAB However, you can also open or extract CAB files with a file decompression tool. Open, browse, extract, or view Microsoft CAB files with Altap Salamander File Manager. High quality software with emphasis on error states. Affordable cost: . Microsoft uses cab files to package software programs. You can view the contents of a cab file by unzipping it and extracting its contents to a. Hi, I need some help www.doorway.ru files. I have to extract a patch for one game, so i used universal extractor for to extract www.doorway.ru Now I have to. cab Extension - List of programs that can www.doorway.ru files. www.doorway.ru, Inventoria Stock Manager, NCH Software, Extract with Express Zip, Low.
    [Show full text]
  • Winzip 12 Reviewer's Guide
    Introducing WinZip® 12 WinZip® is the most trusted way to work with compressed files. No other compression utility is as easy to use or offers the comprehensive and productivity-enhancing approach that has made WinZip the gold standard for file-compression tools. With the new WinZip 12, you can quickly and securely zip and unzip files to conserve storage space, speed up e-mail transmission, and reduce download times. State-of-the-art file compression, strong AES encryption, compatibility with more compression formats, and new intuitive photo compression, make WinZip 12 the complete compression and archiving solution. Building on the favorite features of a worldwide base of several million users, WinZip 12 adds new features for image compression and management, support for new compression methods, improved compression performance, support for additional archive formats, and more. Users can work smarter, faster, and safer with WinZip 12. Who will benefit from WinZip® 12? The simple answer is anyone who uses a PC. Any PC user can benefit from the compression and encryption features in WinZip to protect data, save space, and reduce the time to transfer files on the Internet. There are, however, some PC users to whom WinZip is an even more valuable and essential tool. Digital photo enthusiasts: As the average file size of their digital photos increases, people are looking for ways to preserve storage space on their PCs. They have lots of photos, so they are always seeking better ways to manage them. Sharing their photos is also important, so they strive to simplify the process and reduce the time of e-mailing large numbers of images.
    [Show full text]
  • The Ark Handbook
    The Ark Handbook Matt Johnston Henrique Pinto Ragnar Thomsen The Ark Handbook 2 Contents 1 Introduction 5 2 Using Ark 6 2.1 Opening Archives . .6 2.1.1 Archive Operations . .6 2.1.2 Archive Comments . .6 2.2 Working with Files . .7 2.2.1 Editing Files . .7 2.3 Extracting Files . .7 2.3.1 The Extract dialog . .8 2.4 Creating Archives and Adding Files . .8 2.4.1 Compression . .9 2.4.2 Password Protection . .9 2.4.3 Multi-volume Archive . 10 3 Using Ark in the Filemanager 11 4 Advanced Batch Mode 12 5 Credits and License 13 Abstract Ark is an archive manager by KDE. The Ark Handbook Chapter 1 Introduction Ark is a program for viewing, extracting, creating and modifying archives. Ark can handle vari- ous archive formats such as tar, gzip, bzip2, zip, rar, 7zip, xz, rpm, cab, deb, xar and AppImage (support for certain archive formats depends on the appropriate command-line programs being installed). In order to successfully use Ark, you need KDE Frameworks 5. The library libarchive version 3.1 or above is needed to handle most archive types, including tar, compressed tar, rpm, deb and cab archives. To handle other file formats, you need the appropriate command line programs, such as zipinfo, zip, unzip, rar, unrar, 7z, lsar, unar and lrzip. 5 The Ark Handbook Chapter 2 Using Ark 2.1 Opening Archives To open an archive in Ark, choose Open... (Ctrl+O) from the Archive menu. You can also open archive files by dragging and dropping from Dolphin.
    [Show full text]
  • File Management Tools
    File Management Tools ● gzip and gunzip ● tar ● find ● df and du ● od ● nm and strip ● sftp and scp Gzip and Gunzip ● The gzip utility compresses a specified list of files. After compressing each specified file, it renames it to have a “.gz” extension. ● General form. gzip [filename]* ● The gunzip utility uncompresses a specified list of files that had been previously compressed with gzip. ● General form. gunzip [filename]* Tar (38.2) ● Tar is a utility for creating and extracting archives. It was originally setup for archives on tape, but it now is mostly used for archives on disk. It is very useful for sending a set of files to someone over the network. Tar is also useful for making backups. ● General form. tar options filenames Commonly Used Tar Options c # insert files into a tar file f # use the name of the tar file that is specified v # output the name of each file as it is inserted into or # extracted from a tar file x # extract the files from a tar file Creating an Archive with Tar ● Below is the typical tar command used to create an archive from a set of files. Note that each specified filename can also be a directory. Tar will insert all files in that directory and any subdirectories. tar cvf tarfilename filenames ● Examples: tar cvf proj.tar proj # insert proj directory # files into proj.tar tar cvf code.tar *.c *.h # insert *.c and *.h files # into code.tar Extracting Files from a Tar Archive ● Below is the typical tar command used to extract the files from a tar archive.
    [Show full text]