Quality Control and Mapping

Total Page:16

File Type:pdf, Size:1020Kb

Quality Control and Mapping Next-Generaon Sequencing: Quality Control and Mapping BaRC Hot Topics – January 2015 BioinFormacs and ResearcH Compu*ng WHiteHead Ins*tute HKp://barc.wi.mit.edu/Hot_topics/ Outline • Quality control • Preprocessing • Read mapping – Non-spliced alignment – Spliced alignment • Post process the mapped read files – Remove unmapped reads, sort, index etc – Mapping stas*cs 2 Illumina data Format • Fastq Format: /1 or /2 paired-end HKp://en.wikipedia.org/wiki/FASTQ_format @ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 @seq idenfier GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG seq +ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 +any descrip*on hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh seq quality values Input quali*es Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7 --phred33 >= 1.8 3 CHeck read quality with Fastqc (HKp://www.bioinFormacs.babraham.ac.uk/projects/Fastqc/) 1. Run Fastqc to cHeck read quality bsub Fastqc sample.Fastq 2. Open output file: “Fastqc_report.Html” 4 Fastqc report We Have to know the quality encoding to use the appropriate parameter in the mapping step. 5 FastQC: per base sequence quality very good quality calls reasonable quality • C poor quality o n t e n t Red: median blue: mean yellow: 25%, 75% whiskers: 10%, 90% 6 6 Preprocessing tools • Fastx Toolkit (hKp://Hannonlab.csHl.edu/Fastx_toolkit/) – FASTQ/A Trimmer: SHortening reads in a FASTQ or FASTQ files (removing barcodes or noise). – FASTQ Quality Filter: Filters sequences based on quality – FASTQ Quality Trimmer: Trims (cuts) sequences based on quality – FASTQ Masker: Masks nucleo*des with 'N' (or other cHaracter) based on quality (For a complete list go to the link above) • cutadapt to remove adapters (HKps://code.google.com/p/cutadapt/) 7 WHat preprocessing do we need? Flagged Kmer Content: About 100% oF the first Bad quality -> Use six bases are the same sequence -> Use “FASTQ Quality Filter” and/or “FASTQ Quality “FASTQTrimmer” Trimmer” Sequence Count Percentage Possible Source RNA PCR Primer, Index 3 (100% TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCA 7360116 82.88507591015895 over 40bp) GCGAGTGCGGTAGAGGGTAGTGGAATTCTCGGGTGCCAA 541189 6.094535921273932 No Hit G TCGAATTGCCTTTGGGACTGCGAGGCTTTGAGGACGGAAG 291330 3.2807783416601866 No Hit RNA PCR Primer, Index 3 (100% CCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGG 210051 2.365464495397192 over 38bp) Overrepresented sequences -> If the over represented sequence is an adapter use “cutadapt” 8 Examples oF preprocessing I hands on exercise Remove reads with lower quality -i: input file -o: output file bsub Fastq_quality_filter -v -q 20 -p 75 -v: report number oF sequences -q 20 the quality value required -i sample.Fastq -o sample_filtered.Fastq -p 75 the percentage oF bases that Have to Have that quality value -q 20 -p 75 -F: First base to keep -l: Last base to keep Trim the reads -i: input file -o: output file # Delete the first 6nt From 5’ -v: report number oF sequences bsub Fastx_trimmer -v -f 7 -l 36 -i sample.Fastq -o sample_trimmed.Fastq 9 Examples oF preprocessing II hands on exercise • Remove adapter/Linker cutadapt # usage bsub " cutadapt -m 20 -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGT CTTCTGCTTG sample2.Fastq | Fastx_ar*Facts_filter > sample2_trimFilt.Fastq” -a: Sequence oF an adapter that was ligated to the 3' end 10 -b: Sequence oF an adapter that was ligated to the 5' or 3' end -g: Sequence oF an adapter that was ligated to the 5' end -o: output file name 10 Recommendaon For preprocessing • Treat all the samples the same way. • WatcH out For preprocessing that may result in very different read length in the different samples as that can affect mapping. • If you Have paired-end reads, make sure you s*ll Have both reads oF the pair aer the processing is done. • Run Fastqc on the processed samples to see iF the problem Has been removed. 11 Local genomic files needed For mapping tak: /nfs/genomes/ – Human, mouse, zebrafisH, C.elegans, fly, yeast, etc. – Different genome builds • mm9: mouse_gp_jul_07 • mm10: mouse_mm10_dec_11 – Human_gp_feb_09 vs Human_gp_feb_09_no_random? • Human_gp_feb_09 includes *_random.Fa, *hap*.fa, etc. – Sub directories: • bowe – Bow*e1: *.ebwt – Bow*e2: *.bt2 • Fasta: • Fasta_wHole_genome: all sequences in one file • g: gene models From ReFseq, Ensembl, etc. 12 Mapping I Non-spliced alignment sovware § Used mapping DNA Fragments, i.e. CHIP-seq, SNP calling § Bowe: § bowe 1 vs bowe 2 § For reads >50 bp Bow*e 2 is generally Faster, more sensi*ve, and uses less memory than Bow*e 1. § Bow*e 2 supports gapped alignment, it makes it beKer For snp calling. Bow*e 1 only finds ungapped alignments. § Bow*e 2 supports a "local" alignment mode, in addi*on to the “end-to-end" alignment mode supported by bow*e1. § BWA: § reFer to the BaRC SOP For detailed inFormaon 13 Mapping reads with bow*e2 • Mapping single reads: bow*e2 [op*ons]* -x <bt2-index> -U <r> [-S <output.sam>] bsub bowAe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/ mm10 –U DNA.fastq –S DNA.sam • Mapping paired-end reads: bow*e2 [op*ons]* -x <bt2-index> -1 <m1> -2 <m2> [-S < output.sam >] bsub bowAe2 --phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 -1 Reads1.fastq -2 Reads2.fastq –S DNA.sam 14 Some important parameters in bow*e2 • ReporAng (deFault) look For mul*ple alignments, report best, with MAPQ OR -k <int> report up to <int> alns per read; MAPQ not meaningFul OR -a/--all report all alignments; very slow, MAPQ not meaningFul • Alignment mode --end-to-end en*re read must align; no clipping (on) OR --local local alignment; ends migHt be sov clipped (off) • -L <int> length oF seed substrings; must be >3 and <32 (deFault=22) • -N <int> max # mismatcHes in seed alignment; can be 0 or 1 (deFault=0) Input quali*es Illumina versions --solexa-quals <= 1.2 --phred64 1.3-1.7 --pHred33 (deFault) >= 1.8 15 Mapping II Spliced alignment sovware § Used iF mapping RNA Fragments § TopHat2 (uses bow*e2) § Star: maps >60 *mes Faster than TopHat2, tends to align more reads to pseudogenes. See barc SOPs 16 Spliced alignment with topHat2 TopHat2 uses bow*e2 to map the reads # single-end reads bsub tophat --solexa1.3-quals --segment-length 20 --no-novel-juncs -G /nfs/genomes/ mouse_mm10_dec_11_no_random/g/mm10_no_random.refseq.gX /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq # paired-end reads: Add addi*onal Fastq file to the end oF above command. Input quali*es ReFer to bow*e2 mapping slide SHortest length oF a spliced read that can map to one side oF the --segment-length junc*on. deFault:25 --no-novel-juncs Only look at reads across junc*ons in the supplied GFF file -G <GTF file> Map reads to virtual transcriptome (From g file) first. -N max. number oF mismatcHes in a read, deFault is 2 -o/--output-dir deFault = topHat_out --library-type (Fr-unstranded, Fr-firststrand, Fr-secondstrand) -I/--max-intron-length deFault: 500000 17 Op*mize mapping across introns • TopHat deFault parameters are designed For mammalian RNA-seq data. • Reduce “maximum intron length” For non- mammalian organisms -l: deFault is 500,000 Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628 18 Hands on Mapping • bowAe2 bsub bowAe2 --phred64 –x /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 –U DNA.fastq –S DNA.sam • tophat bsub tophat --solexa1.3-quals --segment-length 20 -G /nfs/ genomes/mouse_mm10_dec_11_no_random/g/ mm10_no_random.refseq.gX /nfs/genomes/ mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq Note: topHat output file will be: topHat_out/accepted_hits.bam 19 Mapped reads file Formats: SAM/BAM • SAM: Sequence Alignment/Map format. It is a TAB-delimited text Format consis*ng oF a Header sec*on, wHicH is op*onal, and an alignment sec*on. EacH alignment line Has 11 mandatory fields For essen*al alignment inFormaon. • BAM: binary Format. It is mucH smaller than sam. • Bam is needed For viewing in a genome browser. It Has to be sorted and indexed. • To save space you sHould convert mapped files to .bam Format, and delete the .sam file. 20 SAM tools: Set oF tools For manipulang mapped read files TOOL DESCRIPTION samtools view conversion between SAM and BAM files samtools flagstat simple stas*cs on the mapped reads samtools sort sort alignment file samtools index index alignment samtools rmdup remove PCR duplicates samtools displays all the tools available 21 Hands on Convert .sam to .bam Format, sort and index. bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl DNA.sam 1. Convert .sam to .bam 2. Sort bam file 3. Index bam file, created a .bai file Delete the .sam file 22 How to get the number oF reads mapped • Bow*e2 prints to STDERR the number oF reads mapped, so you will see iF in the email that you received. • TopHat makes a summary file in the topHat output directory. Head topHat_out/align_summary.txt • Tools: – bam_stat.py -i accepted_hits.bam – samtools flagstat mapped_unmapped.bam • See BaRC SOPs HKp://barcwiki.wi.mit.edu/wiki/SOPs/ miningSAMBAM 23 WHat to look For wHen Few reads mapped? • Reads are not perFectly paired * – Usually occurs aer QC’ing step. Removing low quality reads or adapters creates uneven distribu*on oF reads bsub “/nFs/BaRC_Public/BaRC_code/Perl/cmpFastq/ cmpFastq.pl s_8_1_filtered.Fastq s_8_2_filtered.Fastq” • Reads may Have adapter sequences – Blast top overrepresented sequences in FastQC output – ReFer to the preprocessing steps • Mapping parameters are too stringent * – Increase number oF mismatcHes – Adjust the insert size oF paired-end reads? * ReFer to BaRC SOP For more inFormaon 24 Summary • Quality control – Fastqc • Clean up reads: – Fastx tool kit: Fastq_quality_filter, Fastx_trimmer – Cutadapt • Map reads: – Bowe2 – TopHat2 • Understand the mapped files, and cHeck mapping quality: – Samtools – RSeQC:bam_stat.py 25 BaRC Standard operang procedures HKp://barcwiki.wi.mit.edu/wiki/SOPs 26 ReFerences Fastqc: HKp://www.bioinFormacs.babraham.ac.uk/projects/Fastqc Fastx Toolkit: hKp://Hannonlab.csHl.edu/Fastx_toolkit/ cutadapt: HKps://code.google.com/p/cutadapt Bowe: Langmead B, Trapnell C, Pop M, Salzberg SL.
Recommended publications
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • File Formats Exercises
    Introduction to high-throughput sequencing file formats – advanced exercises Daniel Vodák (Bioinformatics Core Facility, The Norwegian Radium Hospital) ( [email protected] ) All example files are located in directory “/data/file_formats/data”. You can set your current directory to that path (“cd /data/file_formats/data”) Exercises: a) FASTA format File “Uniprot_Mammalia.fasta” contains a collection of mammalian protein sequences obtained from the UniProt database ( http://www.uniprot.org/uniprot ). 1) Count the number of lines in the file. 2) Count the number of header lines (i.e. the number of sequences) in the file. 3) Count the number of header lines with “Homo” in the organism/species name. 4) Count the number of header lines with “Homo sapiens” in the organism/species name. 5) Display the header lines which have “Homo” in the organism/species names, but only such that do not have “Homo sapiens” there. b) FASTQ format File NKTCL_31T_1M_R1.fq contains a reduced collection of tumor read sequences obtained from The Sequence Read Archive ( http://www.ncbi.nlm.nih.gov/sra ). 1) Count the number of lines in the file. 2) Count the number of lines beginning with the “at” symbol (“@”). 3) How many reads are there in the file? c) SAM format File NKTCL_31T_1M.sam contains alignments of pair-end reads stored in files NKTCL_31T_1M_R1.fq and NKTCL_31T_1M_R2.fq. 1) Which program was used for the alignment? 2) How many header lines are there in the file? 3) How many non-header lines are there in the file? 4) What do the non-header lines represent? 5) Reads from how many template sequences have been used in the alignment process? c) BED format File NKTCL_31T_1M.bed holds information about alignment locations stored in file NKTCL_31T_1M.sam.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health
    Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health ISBN 978-92-4-001844-0 (electronic version) ISBN 978-92-4-001845-7 (print version) © World Health Organization 2021 Some rights reserved. This work is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO; https://creativecommons.org/licenses/by-nc-sa/3.0/igo). Under the terms of this licence, you may copy, redistribute and adapt the work for non-commercial purposes, provided the work is appropriately cited, as indicated below. In any use of this work, there should be no suggestion that WHO endorses any specific organization, products or services. The use of the WHO logo is not permitted. If you adapt the work, then you must license your work under the same or equivalent Creative Commons licence. If you create a translation of this work, you should add the following disclaimer along with the suggested citation: “This translation was not created by the World Health Organization (WHO). WHO is not responsible for the content or accuracy of this translation. The original English edition shall be the binding and authentic edition”. Any mediation relating to disputes arising under the licence shall be conducted in accordance with the mediation rules of the World Intellectual Property Organization (http://www.wipo.int/amc/en/mediation/rules/).
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • The Interpro Database, an Integrated Documentation Resource for Protein
    The InterPro database, an integrated documentation resource for protein families, domains and functional sites R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, M Biswas, P Bucher, L Cerutti, F Corpet, M D Croning, et al. To cite this version: R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, et al.. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research, Oxford University Press, 2001, 29 (1), pp.37-40. 10.1093/nar/29.1.37. hal-01213150 HAL Id: hal-01213150 https://hal.archives-ouvertes.fr/hal-01213150 Submitted on 7 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. © 2001 Oxford University Press Nucleic Acids Research, 2001, Vol. 29, No. 1 37–40 The InterPro database, an integrated documentation resource for protein families, domains and functional sites R. Apweiler1,*, T. K. Attwood2,A.Bairoch3, A. Bateman4,E.Birney1, M. Biswas1, P. Bucher5, L. Cerutti4,F.Corpet6, M. D. R. Croning1,2, R. Durbin4,L.Falquet5,W.Fleischmann1, J. Gouzy6,H.Hermjakob1,N.Hulo3, I. Jonassen7,D.Kahn6,A.Kanapin1, Y. Karavidopoulou1, R.
    [Show full text]
  • Enabling Interpretation of Protein Variation Effects with Uniprot
    Andrew Nightingale1, Jie luo, Michele Magrane1, Peter McGarvey2, Sandra Orchard1, Maria Martin1, UniProt Consortium1,2,3 1EMBL-European Bioinformatics Institute, Cambridge, UK 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland 3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA Enabling interpretation of protein variation effects with UniProt Introduction Understanding the effect of genetic variants on protein function is crucial to thoroughly understand the role of proteins in disease biology. UniProt aims to support the scientific community, computational biologists and clinical researchers, by providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. This includes a comprehensive catalogue of protein altering variation data coupled with information about how these variants affect protein function. UniProt variant data sources Variant data from literature Large-scale variant data Variants are captured from the scientific 1. Imported variant literature and manually reviewed for data is dependent addition to UniProtKB/Swiss-Prot upon exact mapping between the reference proteome Description of disease associated with and genome. genetic variations in a protein TCGA 2. Variant data is imported from a variety of resources Variant data including effects of the variant Database Total Imported Variant Total Unique to complement the on the protein and links to variant 1000Genomes 859,757 81,216 resources ClinVar 183,655 76,218 set of variants COSMIC 184,237 18,863 Category Number ESP 939,238 68,803 captured from the ExAC 4,333,620 2,776,617 Total reviewed variants 79,284 TCGA 1,202,700 920,549 literature Disease-associated variants 30,471 UniProt 80,224 49,971 Total 7,781,431 3,992,437 Number of proteins with variants 12,886 *Represents the number of UniProt variants with a dbSNP identifier Interpretation protein variant effect with UniProt 1.
    [Show full text]
  • Introduction to Bioinformatics (Elective) – SBB1609
    SCHOOL OF BIO AND CHEMICAL ENGINEERING DEPARTMENT OF BIOTECHNOLOGY Unit 1 – Introduction to Bioinformatics (Elective) – SBB1609 1 I HISTORY OF BIOINFORMATICS Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biologicaldata. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. "Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline.
    [Show full text]