NGS Raw Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

NGS Raw Data Analysis NGS raw data analysis Introduc3on to Next-Generaon Sequencing (NGS)-data Analysis Laura Riva & Maa Pelizzola Outline 1. Descrip7on of sequencing plaorms 2. Alignments, QC, file formats, data processing, and tools of general use 3. Prac7cal lesson Illumina Sequencing Single base extension with incorporaon of fluorescently labeled nucleodes Library preparaon Automated cluster genera3on DNA fragmenta3on and adapter ligaon Aachment to the flow cell Cluster generaon by bridge amplifica3on of DNA fragments Sequencing Illumina Output Quality Scores Sequence Files Comparison of some exisng plaGorms 454 Ti RocheT Illumina HiSeqTM ABI 5500 (SOLiD) 2000 Amplificaon Emulsion PCR Bridge PCR Emulsion PCR Sequencing Pyrosequencing Reversible Ligaon-based reac7on terminators sequencing Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb Read length 400 bp 100 bp 75 bp Advantages Short run 7mes. The most popular Good base call Longer reads plaorm accuracy. Good improve mapping in mulplexing repe7ve regions. capability Ability to detect large structural variaons Disadvantages High reagent cost. Higher error rates in repeat sequences The Illumina HiSeq! Stacey Gabriel Libraries, lanes, and flowcells Flowcell Lanes Illumina Each reaction produces a unique Each NGS machine processes a library of DNA fragments for single flowcell containing several sequencing. independent lanes during a single sequencing run Applications of Next-Generation Sequencing Basic data analysis concepts: • Raw data analysis= image processing and base calling • Understanding Fastq • Understanding Quality scores • Quality control and read cleaning • Alignment to the reference genome • Understanding SAM/BAM formats • Samtools • Bedtools Understanding FASTQ file format FASTQ format is a text-based format for storing a biological sequence and its corresponding quality scores. It has become the standard for storing the output of high throughput sequencing instruments. EXAMPLE: 1. begins with a '@' character and is followed by a sequence iden7fier 2. the raw sequence leeers. 3. begins with a '+' character and is oponally followed by the same sequence iden7fier 4. encodes the quality values for the sequence in and must contain the same number of symbols as leeers in the sequence. Understanding Quality scores Fastq contains quality informaon. Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabili#es P( Where P is the probability that the corresponding base call is incorrect. Phred quality scores Phred quality scores Q are represented with a single bit in ASCII format. ASCII stands for American Standard Code for Informaon Interchange. ASCII code is the numerical representaon of a character such as 'a' or '@' or an ac7on of some sort. The first 32 symbols in ASCII are control characters, so we start at 33. If your ASCII character is ‘B’ and they are in Sanger format, then 66-33=33, so 33= (-10log10p) -3.3=log10p 10-3.3=p, so p= 0.0005 or 0.05% chance of an incorrect base. @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ The fourth line contains the base quali7es where BQ + 33 = ASCII value shown in the base quality string Sequences ID @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG means: Quality control and read cleaning • Before alignment there is some7mes the need to preprocess/ manipulate the FASTA/FASTQ files to produce beeer mapping results. It is important to do quality control checks to understand whether your data has any problems of which you should be aware before doing any further analysis • FastQC: quality control checks on raw sequence data coming from high throughput sequencing pipelines (hep:// www.bioinformacs.bbsrc.ac.uk/projects/fastqc/). • The FASTX-Toolkit tools perform some of these preprocessing tasks (hep://hannonlab.cshl.edu/fastx_toolkit/). Two of many useful tools are: – FASTQ Quality Filter à Filters sequences based on quality – FASTQ Quality Trimmer à Trims (cuts) sequences based on quality FastQC: quality controls of sequencing files I 90th percen7le of the data Very good quality median Reasonable quality mean PER BASE SEQUENCE QUALITY Poor quality 10th percen7le of the data ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC A very good quality sequence FastQC: quality controls of sequencing files II DUPLICATION LEVELS Which por7on of the sequences has that characteris7c? Computed for the first 200’000 reads gives an overall impression for the duplicate levels in the whole file How many 7mes is the sequence repeated? A very good quality sequence General steps of data analysis: – Quality control and read cleaning – Alignment to the genome There are many aligners and many short reads aligners: Alignment strategy: • Mapping: quickly iden7fy candidates of hits on the reference genome • Alignment and report: score the alignment • Important features: – Some souware use the base quality score to evaluate alignment, others do not – For all the aligners there is a trade off between performance and accuracy – Gapped or ungapped alignment – Important parameters: • Maximum of mismatches • Repor7ng unique hits or mul7ple hits BWA •It uses Burrows-Wheeler indexing algorithm to speed up alignment 7me • Fast and moderate memory usage • Work for different sequencing plaorms, for SE and PE • Gapped alignment for both SE and PE reads • Effec7ve pairing to achieve high alignment accuracy; subop7mal hits considered in pairing. • Non-unique read is placed randomly with a mapping quality 0. • Reports ambiguous hits References: 1.Li, H. and Durbin, R., Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinforma#cs 25 (14), 1754 (2009). 2.hep://bio-bwa.sourceforge.net/bwa.shtml Understanding SAM/BAM file format The Sequence Alignment/Map (SAM) format: • Format for the storage of sequence alignments and their mapping coordinates • Supports different sequencing plaorms • Flexible in style, compact in size, computaonally efficient to access BAM is the binary version of the SAM format Samtools is a set of tools for manipulang and controlling SAM/BAM files SAM file format BAM headers: an optional (but SAM file format essential) part of a BAM file @HD VN:1.0 GO:none SO:coordinate Required: Standard header @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 Essential: read groups. @SQ SN:chr2 LN:242951149 Essential: contigs of Carries platform (PL), library [cut for clarity] aligned reference (LB), and sample (SM) @SQ SN:chr9 LN:140273252 sequence. Should be @SQ SN:chr10 LN:135374737 information. Each read is in karotypic order. @SQ SN:chr11 LN:134452384 associated with a read group [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 Useful: Data processing tools applied to the reads 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33 SAM file format Mapping quality scores They quan7fy the probability that a read is misplaced (P probability that the alignment is wrong). The error probability scaled in the Phred is the mapping quality (Q). The calculaon of mapping quali7es considers all the following factors: • The repeat structure of the reference. Reads falling in repe77ve regions have very low mapping quality. • The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment. • The sensi7vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi7vity, which also causes mapping errors. • Paired end or not. Reads mapped in pairs are more likely to be correct. CIGAR string •M: match/mismatch •I: inser7on •D: dele7on •S: soclip •H: hardclip •P: padding •N: skip SAMtools • Library and souware package that manipulate BAM/SAM files • SAMàBAM conversion (samtools view) • Samtools view –f INT file.bam Only output alignments with all bits in INT present in the FLAG field. • Samtools view –F INT file.bam Skip alignments with bits present in INT [0] • Creates sorted and indexed BAM files from SAM files (samtools sort /samtools index) • Removing PCR duplicates (samtools rmdup) • Merging alignments (samtools merge) • Visualizaon of alignments from BAM files • SNP calling and short indel detecon References: 1. hep://samtools.sourceforge.net/ 2.hp:// samtools.sourceforge.net/samtools.shtml BEDtools The BEDTools u7li7es allow one to address common genomics tasks such as finding feature overlaps and compu7ng coverage. The u7li7es are largely based on four widely-used file formats: BED, GFF/GTF, VCF and SAM/BAM. BED format • slopBed Adjusts each BED entry by a requested number of base pairs. • shuffleBed Randomly permutes the locaons of a BED file among a genome. • intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files. • genomeCoverageBed (BAM) Creates a "per base" report of genome coverage. • subtractBed Removes the por7on of an interval that is overlapped by another feature. • mergeBed Merges overlapping features into a single feature. References: 1 Quinlan AR and Hall IM, 2010.
Recommended publications
  • Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J
    Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press Methods Gene discovery and annotation using LCM-454 transcriptome sequencing Scott J. Emrich,1,2,6 W. Brad Barbazuk,3,6 Li Li,4 and Patrick S. Schnable1,4,5,7 1Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa 50010, USA; 2Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50010, USA; 3Donald Danforth Plant Science Center, St. Louis, Missouri 63132, USA; 4Interdepartmental Plant Physiology Graduate Major and Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, Iowa 50010, USA; 5Department of Agronomy and Center for Plant Genomics, Iowa State University, Ames, Iowa 50010, USA 454 DNA sequencing technology achieves significant throughput relative to traditional approaches. More than 261,000 ESTs were generated by 454 Life Sciences from cDNA isolated using laser capture microdissection (LCM) from the developmentally important shoot apical meristem (SAM) of maize (Zea mays L.). This single sequencing run annotated >25,000 maize genomic sequences and also captured ∼400 expressed transcripts for which homologous sequences have not yet been identified in other species. Approximately 70% of the ESTs generated in this study had not been captured during a previous EST project conducted using a cDNA library constructed from hand-dissected apex tissue that is highly enriched for SAMs. In addition, at least 30% of the 454-ESTs do not align to any of the ∼648,000 extant maize ESTs using conservative alignment criteria. These results indicate that the combination of LCM and the deep sequencing possible with 454 technology enriches for SAM transcripts not present in current EST collections.
    [Show full text]
  • File Formats Exercises
    Introduction to high-throughput sequencing file formats – advanced exercises Daniel Vodák (Bioinformatics Core Facility, The Norwegian Radium Hospital) ( [email protected] ) All example files are located in directory “/data/file_formats/data”. You can set your current directory to that path (“cd /data/file_formats/data”) Exercises: a) FASTA format File “Uniprot_Mammalia.fasta” contains a collection of mammalian protein sequences obtained from the UniProt database ( http://www.uniprot.org/uniprot ). 1) Count the number of lines in the file. 2) Count the number of header lines (i.e. the number of sequences) in the file. 3) Count the number of header lines with “Homo” in the organism/species name. 4) Count the number of header lines with “Homo sapiens” in the organism/species name. 5) Display the header lines which have “Homo” in the organism/species names, but only such that do not have “Homo sapiens” there. b) FASTQ format File NKTCL_31T_1M_R1.fq contains a reduced collection of tumor read sequences obtained from The Sequence Read Archive ( http://www.ncbi.nlm.nih.gov/sra ). 1) Count the number of lines in the file. 2) Count the number of lines beginning with the “at” symbol (“@”). 3) How many reads are there in the file? c) SAM format File NKTCL_31T_1M.sam contains alignments of pair-end reads stored in files NKTCL_31T_1M_R1.fq and NKTCL_31T_1M_R2.fq. 1) Which program was used for the alignment? 2) How many header lines are there in the file? 3) How many non-header lines are there in the file? 4) What do the non-header lines represent? 5) Reads from how many template sequences have been used in the alignment process? c) BED format File NKTCL_31T_1M.bed holds information about alignment locations stored in file NKTCL_31T_1M.sam.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • Multi-Scale Analysis and Clustering of Co-Expression Networks
    Multi-scale analysis and clustering of co-expression networks Nuno R. Nen´e∗1 1Department of Genetics, University of Cambridge, Cambridge, UK September 14, 2018 Abstract 1 Introduction The increasing capacity of high-throughput genomic With the advent of technologies allowing the collec- technologies for generating time-course data has tion of time-series in cell biology, the rich structure stimulated a rich debate on the most appropriate of the paths that cells take in expression space be- methods to highlight crucial aspects of data struc- came amenable to processing. Several methodologies ture. In this work, we address the problem of sparse have been crucial to carefully organize the wealth of co-expression network representation of several time- information generated by experiments [1], including course stress responses in Saccharomyces cerevisiae. network-based approaches which constitute an excel- We quantify the information preserved from the origi- lent and flexible option for systems-level understand- nal datasets under a graph-theoretical framework and ing [1,2,3]. The use of graphs for expression anal- evaluate how cross-stress features can be identified. ysis is inherently an attractive proposition for rea- This is performed both from a node and a network sons related to sparsity, which simplifies the cumber- community organization point of view. Cluster anal- some analysis of large datasets, in addition to being ysis, here viewed as a problem of network partition- mathematically convenient (see examples in Fig.1). ing, is achieved under state-of-the-art algorithms re- The properties of co-expression networks might re- lying on the properties of stochastic processes on the veal a myriad of aspects pertaining to the impact of constructed graphs.
    [Show full text]
  • Introduction to Bioinformatics (Elective) – SBB1609
    SCHOOL OF BIO AND CHEMICAL ENGINEERING DEPARTMENT OF BIOTECHNOLOGY Unit 1 – Introduction to Bioinformatics (Elective) – SBB1609 1 I HISTORY OF BIOINFORMATICS Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biologicaldata. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. "Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline.
    [Show full text]
  • Large Scale Genomic Rearrangements in Selected
    Pucker et al. BMC Genomics (2021) 22:599 https://doi.org/10.1186/s12864-021-07877-8 RESEARCH ARTICLE Open Access Large scale genomic rearrangements in selected Arabidopsis thaliana T-DNA lines are caused by T-DNA insertion mutagenesis Boas Pucker1,2† , Nils Kleinbölting3† and Bernd Weisshaar1* Abstract Background: Experimental proof of gene function assignments in plants is based on mutant analyses. T-DNA insertion lines provided an invaluable resource of mutants and enabled systematic reverse genetics-based investigation of the functions of Arabidopsis thaliana genes during the last decades. Results: We sequenced the genomes of 14 A. thaliana GABI-Kat T-DNA insertion lines, which eluded flanking sequence tag-based attempts to characterize their insertion loci, with Oxford Nanopore Technologies (ONT) long reads. Complex T-DNA insertions were resolved and 11 previously unknown T-DNA loci identified, resulting in about 2 T-DNA insertions per line and suggesting that this number was previously underestimated. T-DNA mutagenesis caused fusions of chromosomes along with compensating translocations to keep the gene set complete throughout meiosis. Also, an inverted duplication of 800 kbp was detected. About 10 % of GABI-Kat lines might be affected by chromosomal rearrangements, some of which do not involve T-DNA. Local assembly of selected reads was shown to be a computationally effective method to resolve the structure of T-DNA insertion loci. We developed an automated workflow to support investigation of long read data from T-DNA insertion lines. All steps from DNA extraction to assembly of T-DNA loci can be completed within days. Conclusions: Long read sequencing was demonstrated to be an effective way to resolve complex T-DNA insertions and chromosome fusions.
    [Show full text]
  • L3: Short Read Alignment to a Reference Genome
    L3: Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Autumn School in Bioinformatics Cambridge, September 2017 Where to get help! http://seqanswers.com http://www.biostars.org http://www.bioconductor.org/help/mailing-list Read the posting guide before sending email! Overview ● Understand the difference between reference genome builds ● Introduction to Illumina sequencing ● Short read aligners ○ BWA ○ Bowtie ○ STAR ○ Other aligners ● Coverage and Depth ● Mappability ● Use of decoy and sponge databases ● Alignment Quality, SAMStat, Qualimap ● Samtools and Picard tools, ● Visualization of alignment data ● A very brief look at long reads, graph genome aligners and de novo genome assembly Reference Genomes ● A haploid representation of a species genome. ● The human genome is a haploid mosaic derived from 13 volunteer donors from Buffalo, NY. USA. ● In regions where there is known large scale population variation, sets of alternate loci (178 in GRCh38) are assembled alongside the reference locus. ● The current build has around 500 gaps, whereas the first version had ~150,000 gaps. Genome Reference Consortium: https://www.ncbi.nlm.nih.gov/grc £600 30X Sequencers http://omicsmap.com Illumina sequencers Illumina Genome Analyzer Illumina sequencing technology ● Illumina sequencing is based on the Solexa technology developed by Shankar Balasubramanian and David Klenerman (1998) at the University of Cambridge. ● Multiple steps in “Sequencing by synthesis” (explained in next slide) ○ Library Preparation ○ Bridge amplification and Cluster generation ○ Sequencing using reversible terminators ○ Image acquisition and Fastq generation ○ Alignment and data analysis Illumina Flowcell Sequencing By Synthesis technology Illumina Sequencing Bridge Amplification Sequencing Cluster growth Incorporation of fluorescence, reversibly terminated tagged nt Multiplexing • Multiplexing gives the ability to sequence multiple samples at the same time.
    [Show full text]
  • Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd
    European Bioinformatics Institute EMBL-EBI Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by PickeringHutchins Ltd www.pickeringhutchins.com EMBL member states: Austria, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom. Associate member state: Australia EMBL-EBI is a part of the European Molecular Biology Laboratory (EMBL) EMBL-EBI EMBL-EBI EMBL-EBI EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494 444, Fax +44 (0)1223 494 468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 3870, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)476 20 7269, Fax +33 (0)476 20 2199 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)4089 902 110, Fax +49 (0)4089 902 149 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 (0)6900 91402, Fax +39 (0)6900 91406 © 2012 EMBL-European Bioinformatics Institute All texts written by EBI-EMBL Group and Team Leaders. This publication was produced by the EBI’s Outreach and Training Programme. Contents Introduction Foreword 2 Major Achievements 2011 4 Services Rolf Apweiler and Ewan Birney: Protein and nucleotide data 10 Guy Cochrane: The European Nucleotide Archive 14 Paul Flicek:
    [Show full text]
  • Sequence Data
    Sequence Data SISG 2017 Ben Hayes and Hans Daetwyler Using sequence data in genomic selection • Generation of sequence data (Illumina) • Characteristics of sequence data • Quality control of raw sequence • Alignment to reference genomes • Variant Calling Using sequence data in genomic selection and GWAS • Motivation • Genome wide association study – Straight to causative mutation – Mapping recessives • Genomic selection (all hypotheses!) – No longer have to rely on LD, causative mutation actually in data set • Higher accuracy of prediction? • Not true for genotyping-by-sequencing – Better prediction across breeds? • Assumes same QTL segregating in both breeds • No longer have to rely on SNP-QTL associations holding across breeds – Better persistence of accuracy across generations Using sequence data in genomic selection and GWAS • Motivation • Genotype by sequencing – Low cost genotypes? – Need to understand how these are produced, potential errors/challenges in dealing with this data Technology – NextGenSequence • Over the past few years, the “Next Generation” of sequencing technologies has emerged. Roche: GS-FLX Applied Biosystems: SOLiD 4 Illumina: HiSeq3000, MiSeq Illumina, accessed August 2013 Sequencing Workflow • Extract DNA • Prepare libraries – ‘Cutting-up’ DNA into short fragments • Sequencing by synthesis (HiSeq, MiSeq, NextSeq, …) • Base calling from image data -> fastq • Alignment to reference genome -> bam – Or de-novo assembly • Variant calling -> vcf ‘Raw’ sequence to genotypes FASTQ File • Unfiltered • Quality metrics
    [Show full text]
  • Efficient Synchronization of Paired-End Fastq Files
    bioRxiv preprint doi: https://doi.org/10.1101/552885; this version posted February 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Fastq-pair: efficient synchronization of paired-end fastq files. 1 2,* John A. Edwards ​ and Robert A. Edwards ​ ​ 1 Sigma​ Numerix Ltd. Coalville, Leicestershire, England 2 San​ Diego State University, 5500 Campanile Drive, San Diego, CA 92182 * Corresponding​ author Dr. Robert Edwards Department of Computer Science San Diego State University 5500 Campanile Dr San Diego, CA 92182 Email: [email protected] ​ Tel: 619 594 1672 Keywords: fastq, next generation sequencing bioRxiv preprint doi: https://doi.org/10.1101/552885; this version posted February 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Abstract Paired end DNA sequencing provides additional information about the sequence data that is used in sequence assembly, mapping, and other downstream bioinformatics analysis. Paired end reads are usually provided as two fastq-format files, with each file representing one end of the read. Many commonly used downstream tools require that the sequence reads appear in each file in the same order, and reads that do not have a pair in the corresponding file are placed in a separate file of singletons.
    [Show full text]