Comparison of DNA Sequence Assembly Algorithms Using Mixed Data Sources

Total Page:16

File Type:pdf, Size:1020Kb

Comparison of DNA Sequence Assembly Algorithms Using Mixed Data Sources Comparison of DNA Sequence Assembly Algorithms Using Mixed Data Sources A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Tejumoluwa Abegunde c Tejumoluwa Abegunde, April/2010. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 i Abstract DNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the cor- rect formation of a genome sequence from its DNA fragments (\reads") by aligning and merging the fragments. There are different sequencing technologies | some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data. This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their per- formance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers. Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For exam- ple, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies. In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project. ii Acknowledgements I would like to formally thank: My supervisor, Dr. Anthony Kusalik for providing me with this opportunity, and for his hard work, patient encouragement, and guidance throughout my studies. My committee members, Dr. Ian Mcquillan and Dr. Barry Ziola for their guidance and support, and Dr. Andrew Sharpe for serving as my external examiner. My fellow lab colleagues, for their friendship and support. Good luck to each of you in your future aspirations. My parents, for their unending love and support in all my efforts and aspirations. Also to my siblings, David and Dammy for their love and support. My best friend, Austin Ogun for your love and support. Also to Toyin Ake-Johnson for always ensuring I smile even when it seemed tough. My friends that supported me during the course of my studies, I cannot mention all, but I appreciate you all. iii I humbly dedicate this thesis to God for grace, strength, and guidance. iv Contents Permission to Use i Abstract ii Acknowledgements iii Contents v List of Tables vii List of Figures x List of Abbreviations xiii 1 Introduction 1 1.1 Thesis Organization . 3 2 Background Information 4 2.1 DNA Sequencing . 4 2.2 Sequence Assembly . 7 2.3 Sequence Assemblers . 8 2.3.1 Long-read assemblers . 11 2.3.2 Short-read assemblers . 17 2.4 Finishing Phase . 21 2.5 Objectives of the research . 21 3 Data and Methodology 23 3.1 Data . 23 3.1.1 Real Sequencing Data . 24 3.1.2 Artificial Sequencing Data . 27 3.1.3 Summary of Data Sets . 32 3.2 Methodology . 34 3.2.1 Accuracy of results . 34 3.2.2 Execution time . 37 3.2.3 Memory usage . 37 3.2.4 System Dependencies . 37 3.2.5 Restrictions and constraints . 38 3.3 Effect of Quality values on Accuracy of Contigs . 38 3.4 Statistical Analysis . 38 3.5 Computer Resources . 39 4 Results 42 4.1 Assembling short reads using long-read assemblers . 42 4.2 Assembling long reads on short-read assemblers . 47 4.3 Assembling mixed data sets . 51 4.3.1 Assembling 454 reads merged with Sanger reads . 51 4.3.2 Assembling Illumina reads merged with Sanger reads . 52 4.3.3 Assembling Illumina reads merged with 454 reads . 54 4.3.4 Assembling Illumina contigs merged with 454 contigs . 54 4.3.5 Assembling 454 contigs merged with Sanger reads . 55 v 4.3.6 Assembling Illumina contigs merged with Sanger reads . 56 4.3.7 Assembling merged 454, Illumina and Sanger reads . 57 4.4 The effect of quality values when assembling DNA reads . 57 4.4.1 Statistical results for Sanger data with and without quality data . 57 4.4.2 Statistical results for Illumina data with and without quality data . 58 4.4.3 Statistical results for 454 data with and without quality data . 59 5 Discussion 62 5.1 Conclusions and Recommendations . 62 5.2 Related Work . 66 5.3 Future Work . 67 References 69 A Tables of Data sets 72 B Tables of Results 73 C Graphs 94 D Graphs 112 E Statistical Results 127 vi List of Tables 3.1 Characteristics of the datasets of real Illumina data. 24 3.2 Characteristics of the datasets of real Sanger data. 25 3.3 Characteristics of the datasets of real 454 data. 25 3.4 Characteristics of the genome sequences for the organisms. 31 3.5 Characteristics of the datasets of artificial Sanger data. 31 3.6 Characteristics of the datasets of artificial 454 data. 32 3.7 Characteristics of the datasets of artificial Illumina data. 32 3.8 Summary of all the data sets used in this work. 33 3.9 An example of the SPSS output for the independent samples test. 40 3.10 An example of one-way ANOVA output from SPSS when comparing the means for genome coverage between five assemblers. 40 3.11 An example of Games-Howell post-hoc output from SPSS when comparing the means for genome coverage between five assemblers (one-way ANOVA test). The results are an extract from Table E.17 of Appendix E. 41 4.1 Results to show which long-read assemblers can handle short reads. An \×" indicates that the assembler was not able to successfully work with this type of data, while a check mark indicates that it could. 43 4.2 Results from assembling 454 data using short- and long-read assemblers. The results from the long-read assemblers are shown in bold font. 44 4.3 Results from assembling Illumina data using short- and long-read assemblers. The results from the long-read assemblers are in bold font. 46 4.4 Results to show which short-read assemblers can handle long reads. An \X" indicates that the assembler successfully works, while a \×" indicates otherwise. 48 4.5 Results for running short- and long-read assemblers with Sanger data. The results from the short-read assemblers are in bold. 50 4.6 Results for assembling 454 reads merged with Sanger reads. 52 4.7 Results for assembling Illumina reads merged with Sanger reads. 53 4.8 Results for assembling Illumina reads merged with 454 reads. 55 4.9 Results for assembling Illumina contigs merged with 454 contigs using PHRAP. 56 4.10 Results for assembling merged 454, Illumina and Sanger reads. 58 4.11 Results for assembling real Sanger reads to test the effect of quality values. 59 4.12 Results for assembling real Illumina reads to test the effect of quality values.
Recommended publications
  • Evidence of Selection at the Ramosa1 Locus During Maize Domestication
    Molecular Ecology (2010) 19, 1296–1311 doi: 10.1111/j.1365-294X.2010.04562.x Evidence of selection at the ramosa1 locus during maize domestication BRANDI SIGMON and ERIK VOLLBRECHT Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA 50011, USA Abstract Modern maize was domesticated from Zea mays parviglumis, a teosinte, about 9000 years ago in Mexico. Genes thought to have been selected upon during the domestication of crops are commonly known as domestication loci. The ramosa1 (ra1) gene encodes a putative transcription factor that controls branching architecture in the maize tassel and ear. Previous work demonstrated reduced nucleotide diversity in a segment of the ra1 gene in a survey of modern maize inbreds, indicating that positive selection occurred at some point in time since maize diverged from its common ancestor with the sister species Tripsacum dactyloides and prompting the hypothesis that ra1 may be a domestication gene. To investigate this hypothesis, we examined ear phenotypes resulting from minor changes in ra1 activity and sampled nucleotide diversity of ra1 across the phylogenetic spectrum between tripsacum and maize, including a broad panel of teosintes and unimproved maize landraces. Weak mutant alleles of ra1 showed subtle effects in the ear, including crooked rows of kernels due to the occasional formation of extra spikelets, correlating a plausible, selected trait with subtle variations in gene activity. Nucleotide diversity was significantly reduced for maize landraces but not for teosintes, and statistical tests implied directional selection on ra1 consistent with the hypothesis that ra1 is a domestication locus. In maize landraces, a noncoding 3¢-segment contained almost no genetic diversity and 5¢-flanking diversity was greatly reduced, suggesting that a regulatory element may have been a target of selection.
    [Show full text]
  • DNA Sequencing
    Contig Assembly ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.. C T AG AGCAGA TAGCTACGCATCGT GT CTACCG GC TT AT CG GTTACGATGCCTT AT David Wishart, Ath 3-41 [email protected] DNA Sequencing 1 Principles of DNA Sequencing Primer DNA fragment Amp PBR322 Tet Ori Denature with Klenow + ddNTP heat to produce + dNTP + primers ssDNA The Secret to Sanger Sequencing 2 Principles of DNA Sequencing 5’ G C A T G C 3’ Template 5’ Primer dATP dATP dATP dATP dCTP dCTP dCTP dCTP dGTP dGTP dGTP dGTP dTTP dTTP dTTP dTTP ddCTP ddATP ddTTP ddGTP GddC GCddA GCAddT ddG GCATGddC GCATddG Principles of DNA Sequencing G T _ _ short C A G C A T G C + + long 3 Capillary Electrophoresis Separation by Electro-osmotic Flow Multiplexed CE with Fluorescent detection ABI 3700 96x700 bases 4 High Throughput DNA Sequencing Large Scale Sequencing • Goal is to determine the nucleic acid sequence of molecules ranging in size from a few hundred bp to >109 bp • The methodology requires an extensive computational analysis of raw data to yield the final sequence result 5 Shotgun Sequencing • High throughput sequencing method that employs automated sequencing of random DNA fragments • Automated DNA sequencing yields sequences of 500 to 1000 bp in length • To determine longer sequences you obtain fragmentary sequences and then join them together by overlapping • Overlapping is an alignment problem, but different from those we have discussed up to now Shotgun Sequencing Isolate ShearDNA Clone into Chromosome into Fragments Seq. Vectors Sequence 6 Shotgun Sequencing Sequence Send to Computer Assembled Chromatogram Sequence Analogy • You have 10 copies of a movie • The film has been cut into short pieces with about 240 frames per piece (10 seconds of film), at random • Reconstruct the film 7 Multi-alignment & Contig Assembly ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.
    [Show full text]
  • New Softwares for Automated Microsatellite Marker Development
    Published online February 21, 2006 Nucleic Acids Research, 2006, Vol. 34, No. 4 e31 doi:10.1093/nar/gnj030 New softwares for automated microsatellite marker development Wellington Martins, Daniel de Sousa1, Karina Proite2, Patrı´cia Guimara˜es2, Marcio Moretzsohn2 and David Bertioli3 Department of Computer Science, Catholic University of Goia´s, Brazil, 1Department of Computer Science, Catholic University of Rio de Janeiro, Brazil, 2Embrapa Genetic Resources and Biotechnology, Brası´lia, Brazil and 3Genomic Sciences and Biotechnology, Catholic University of Brası´lia, Brazil Received November 23, 2005; Revised and Accepted January 31, 2006 ABSTRACT whose unit of repetition is between 1 and 6 bp. They are highly abundant in the genomes of eukaryotes, polymorphic and Microsatellites are repeated small sequence motifs usually co-dominant and transferable between different map- that are highly polymorphic and abundant in the ping populations. Microsatellite markers can also be used in genomes of eukaryotes. Often they are the molecular automated genotyping techniques. Thus, they have become markers of choice. To aid the development of micro- one of the most useful molecular markers for a large number satellite markers we have developed a module that of organisms. integrates a program for the detection of microsatel- Researchers working on the development of microsatellite lites (TROLL), with the sequence assembly and markers need an efficient way to go from usually hundreds or analysis software, the Staden Package. The module thousands of trace and/or text sequence files to the identifi- has easily adjustable parameters for microsatellite cation of new potential markers. However, the softwares lengths and base pair quality control.
    [Show full text]
  • Understanding the Origins, Dispersal, and Evolution of Bonamia Species (Phylum Haplosporidia) Based on Genetic Analyses of Ribosomal RNA Gene Regions
    W&M ScholarWorks Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects 2011 Understanding the Origins, Dispersal, and Evolution of Bonamia Species (Phylum Haplosporidia) Based on Genetic Analyses of Ribosomal RNA Gene Regions Kristina M. Hill College of William and Mary - Virginia Institute of Marine Science Follow this and additional works at: https://scholarworks.wm.edu/etd Part of the Developmental Biology Commons, Evolution Commons, and the Molecular Biology Commons Recommended Citation Hill, Kristina M., "Understanding the Origins, Dispersal, and Evolution of Bonamia Species (Phylum Haplosporidia) Based on Genetic Analyses of Ribosomal RNA Gene Regions" (2011). Dissertations, Theses, and Masters Projects. Paper 1539617909. https://dx.doi.org/doi:10.25773/v5-a0te-9079 This Thesis is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Understanding the Origins, Dispersal, and Evolution of Bonamia Species (Phylum Haplosporidia) Based on Genetic Analyses of Ribosomal RNA Gene Regions A Thesis Presented to The Faculty of the School of Marine Science The College of William and Mary in Virginia In Partial Fulfillment of the Requirements for the Degree of Master of Science by Kristina M. Hill 2011 APPROVAL SHEET This thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science CH-s 7n - "UuUL ' Kristina Marie Hill Approved, May 2011 w. n Eugene M. Burreson, Ph.D Advisor Kimberly S. Reece, Ph.D.
    [Show full text]
  • Need and Role of Scala Implementations in Bioinformatics
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 Need and Role of Scala Implementations in Bioinformatics Abbas Rehman Muhammad Atif Sarwar Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Ali Abbas Javed Ferzund Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Abstract—Next Generation Sequencing has resulted in the evolutionary change in data generation of different sequences. generation of large number of omics data at a faster speed that NGS machines are generating a huge amount of sequence data was not possible before. This data is only useful if it can be stored per day that needs to be stored, analyzed and managed well to and analyzed at the same speed. Big Data platforms and tools like seek the maximum advantages from this. Existing Apache Hadoop and Spark has solved this problem. However, bioinformatics techniques, tools or software are not keeping most of the algorithms used in bioinformatics for Pairwise pace with the speed of data generation. Old Bioinformatics alignment, Multiple Alignment and Motif finding are not tools have very less performance, accuracy and scalability implemented for Hadoop or Spark. Scala is a powerful language while analyzing large amount of data. When storing, managing supported by Spark. It provides, constructs like traits, closures, and analyzing large amount of data which is being generated functions, pattern matching and extractors that make it suitable now a days, these tools require more time and cost with less for Bioinformatics applications.
    [Show full text]
  • A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection
    ChromatoGate: A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection Nikolaos Alachiotis Emmanouella Vogiatzi∗ Scientific Computing Group Institute of Marine Biology and Genetics HITS gGmbH HCMR Heidelberg, Germany Heraklion Crete, Greece [email protected] [email protected] Pavlos Pavlidis Alexandros Stamatakis Scientific Computing Group Scientific Computing Group HITS gGmbH HITS gGmbH Heidelberg, Germany Heidelberg, Germany [email protected] [email protected] ∗ Affiliated also with the Department of Genetics and Molecular Biology of the Democritian University of Thrace at Alexandroupolis, Greece. Corresponding author: Nikolaos Alachiotis Keywords: chromatograms, software, mis-calls Abstract Automated DNA sequencers generate chromatograms that contain raw sequencing data. They also generate data that translates the chromatograms into molecular sequences of A, C, G, T, or N (undetermined) characters. Since chromatogram translation programs frequently introduce errors, a manual inspection of the generated sequence data is required. As sequence numbers and lengths increase, visual inspection and manual correction of chromatograms and corresponding sequences on a per-peak and per-nucleotide basis becomes an error-prone, time-consuming, and tedious process. Here, we introduce ChromatoGate (CG), an open-source software that accelerates and partially automates the inspection of chromatograms and the detection of sequencing errors for bidirectional sequencing runs. To provide users full control over the error correction process, a fully automated error correction algorithm has not been implemented. Initially, the program scans a given multiple sequence alignment (MSA) for potential sequencing errors, assuming that each polymorphic site in the alignment may be attributed to a sequencing error with a certain probability.
    [Show full text]
  • A Guide to HIV-1 Reverse Transcriptase and Protease Sequencing for Drug Resistance Studies
    HIV-1 RT and Protease Sequencing for Drug Resistance Studies 1 A Guide to HIV-1 Reverse Transcriptase and Reviews Protease Sequencing for Drug Resistance Studies Robert W. Shafer1, Kathryn Dupnik1, Mark A. Winters1, Susan H. Eshleman2 1 Division of Infectious Diseases, Stanford University, Stanford, CA 94305 2 Dept. of Pathology, The Johns Hopkins Medical Institutions, Baltimore, MD 21205 I. HIV-1 Drug Resistance A. Introduction HIV-1 RT and protease sequencing and drug susceptibility testing have been done in research settings for more than ten years to elucidate the genetic mechanisms of resistance to antiretroviral drugs. Retrospective studies have shown that the presence of drug resistance before starting a new drug regimen is an independent predictor of virologic response to that regimen (DeGruttola et al., 2000; Hanna and D’Aquila, 2001; Haubrich and Demeter, 2001). Prospective studies have shown that patients whose physicians have access to drug resistance data, particularly genotypic resistance data, respond better to therapy than control patients whose physicians do not have access to the same data (Baxter et al., 2000; Cohen et al., 2000; De Luca et al., 2001; Durant et al., 1999; Melnick et al., 2000; Meynard et al., 2000; Tural et al., 2000). The accumulation of retrospective and prospective data has led three expert panels to recommend the use of resistance testing in the treatment of HIV-infected patients (EuroGuidelines Group for HIV Resistance, 2001; Hirsch et al., 2000; US Department of Health and Human Services Panel on Clinical Practices for Treatment of HIV Infection, 2000) (Table 1). There have been several recent reviews on methods for assessing HIV-1 drug resistance (Demeter and Haubrich, 2001; Hanna and D’Aquila, 2001; Richman, 2000) and on the mutations associated with drug resistance (Deeks, 2001; Hammond et al., 1999; Loveday, 2001; Miller, 2001; Shafer et al., 2000b).
    [Show full text]
  • Tracheophyte Genomes Keep Track of the Deep Evolution of the 2 Caulimoviridae 3 4 Authors 5 Seydina Diop1, Andrew D.W
    bioRxiv preprint doi: https://doi.org/10.1101/158972; this version posted July 21, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Tracheophyte genomes keep track of the deep evolution of the 2 Caulimoviridae 3 4 Authors 5 Seydina Diop1, Andrew D.W. Geering2, Françoise Alfama-Depauw1, Mikaël Loaec1, Pierre-Yves 6 Teycheney3 and Florian Maumus1* 7 8 Affiliations 9 1 URGI, INRA, Université Paris-Saclay, 78026 Versailles, France; 10 2 Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, GPO Box 11 267, Brisbane, Queensland 4001, Australia 12 3 UMR AGAP, CIRAD, INRA, SupAgro, 97130 Capesterre Belle-Eau, France 13 14 Corresponding author 15 Florian Maumus 16 URGI-INRA 17 RD10 route de Saint Cyr 18 78026, Versailles 19 France 20 +33 1 30 83 31 74 21 [email protected] 22 23 24 1 bioRxiv preprint doi: https://doi.org/10.1101/158972; this version posted July 21, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 25 Abstract 26 Endogenous viral elements (EVEs) are viral sequences that are integrated in the nuclear genomes of 27 their hosts and are signatures of viral infections that may have occurred millions of years ago. The 28 study of EVEs, coined paleovirology, provides important insights into virus evolution. The 29 Caulimoviridae is the most common group of EVEs in plants, although their presence has often been 30 overlooked in plant genome studies.
    [Show full text]
  • Next-Generation DNA Sequencing Informatics, 2Nd Edition
    This is a free sample of content from Next-Generation DNA Sequencing Informatics, 2nd edition. Click here for more information on how to buy the book. Index Page references followed by f denote figures. Page references followed by t denote tables. A Needleman–Wunsch (NW) algorithm, 49, 54, 110–113 overview, 109–110 Abeel, Thomas, 103 – – – ABI. See Applied Biosystems Inc. Smith Waterman (SW) algorithm, 38, 49, 62 63, 111 113 Ab initio genome annotation, 172, 178, 180t–181t Splign, 182 – TopHat, 43, 182 ab1PeakReporter software, 52 53 – A-Bruijn graph, 133–134 Alignment score, FASTA, 64 65 ABySS (Assembly by Short Sequencing), 134, 142, 147–153 Allele, 52, 354 Allele frequency, 76, 94, 193 effect of k-mer size and minimum pair number on assembly, fi 148–149, 149f Allele-speci c expression, 155, 298 overview of, 147–148 ALLPATHS, 134 quality of assembly, 149–153, 150t, 151f–152f ALN format, 92 α transcriptome assembly (Trans-ABySS), 158t, 160–161, 166 -diversity indices, 319 – – AceView database, 294, 295f Alternative splicing, 182, 293 296, 294f 295f Acrylamide gels Altschul, Stephen, 65 capillary tube, 4 Amazon Elastic Compute Cloud (EC2), 43, 254, 300, 315, – Sanger sequencing and, 2, 3–4 362 364, 366, 369 – ACT, 179t Amino acids, pairwise comparisons, 48 49 Adapter removal, 37–39, 39f, 43 Amplicons, 8, 30, 89, 204, 309, 312 Adapter Removal program, 38 Amplicon Variant Analyzer, 101 Affine gaps, 42, 110, 111–112 AmpliSeq Cancer Panel (Ion Torrent), 206 Algorithms Annotation, 75. See also Genome annotation – – – alignment, 49, 109–124, 129, 223, 338, 344 ChIP-seq peak, 240 242, 255, 259, 262 263, 262f 263f – assembly, 59, 127–129, 133–134, 338 proteogenomics and, 327 328, 328f – database searching, 113–115 of variants, 208 212 development, 364 ANNOVAR, 211 DNA fragment/genome assembly, 127–129, 133–134, 142 Anthrax, 141 dynamic programming, 110–124 Anti-sense RNA, 281 file compression, 79 Application programming interface (API), 368 Golay error-correcting, 31 Applied Biosystems Inc.
    [Show full text]
  • Identification of the Vascular Plants of Churchill, Manitoba, Using a DNA Barcode Library Maria L Kuzmina1*, Karen L Johnson2, Hannah R Barron3 and Paul DN Hebert1
    Kuzmina et al. BMC Ecology 2012, 12:25 http://www.biomedcentral.com/1472-6785/12/25 METHODOLOGY ARTICLE Open Access Identification of the vascular plants of Churchill, Manitoba, using a DNA barcode library Maria L Kuzmina1*, Karen L Johnson2, Hannah R Barron3 and Paul DN Hebert1 Abstract Background : Because arctic plant communities are highly vulnerable to climate change, shifts in their composition require rapid, accurate identifications, often for specimens that lack diagnostic floral characters. The present study examines the role that DNA barcoding can play in aiding floristic evaluations in the arctic by testing the effectiveness of the core plant barcode regions (rbcL, matK) and a supplemental ribosomal DNA (ITS2) marker for a well-studied flora near Churchill, Manitoba. Results: This investigation examined 900 specimens representing 312 of the 354 species of vascular plants known from Churchill. Sequencing success was high for rbcL: 95% for fresh specimens and 85% for herbarium samples (mean age 20 years). ITS2 worked equally well for the fresh and herbarium material (89% and 88%). However, sequencing success was lower for matK, despite two rounds of PCR amplification, which reflected less effective primer binding and sensitivity to the DNA degradation (76% of fresh, 45% of herbaria samples). A species was considered as taxonomically resolved if its members showed at least one diagnostic difference from any other taxon in the study and formed a monophyletic clade. The highest species resolution (69%) was obtained by combining information from all three genes. The joint sequence information for rbcL and matK distinguished 54% of 286 species, while rbcL and ITS2 distinguished 63% of 285 species.
    [Show full text]
  • Ultra-High Resolution HLA Genotyping and Allele Discovery by Highly Multiplexed Cdna Amplicon Pyrosequencing
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Harvard University - DASH Ultra-high resolution HLA genotyping and allele discovery by highly multiplexed cDNA amplicon pyrosequencing The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Lank, Simon M, Brittney A Golbach, Hannah M Creager, Roger W Wiseman, Derin B Keskin, Ellis L Reinherz, Vladimir Brusic, and David H O’Connor. 2012. Ultra-high resolution hla genotyping and allele discovery by highly multiplexed cdna amplicon pyrosequencing. BMC Genomics 13: 378. Published Version doi:10.1186/1471-2164-13-378 Accessed February 19, 2015 11:59:01 AM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:10589781 Terms of Use This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms- of-use#LAA (Article begins on next page) Lank et al. BMC Genomics 2012, 13:378 http://www.biomedcentral.com/1471-2164/13/378 METHODOLOGY ARTICLE Open Access Ultra-high resolution HLA genotyping and allele discovery by highly multiplexed cDNA amplicon pyrosequencing Simon M Lank1, Brittney A Golbach1, Hannah M Creager1, Roger W Wiseman1, Derin B Keskin2,3, Ellis L Reinherz2,3, Vladimir Brusic2,3 and David H O’Connor1,4* Abstract Background: High-resolution HLA genotyping is a critical diagnostic and research assay. Current methods rarely achieve unambiguous high-resolution typing without making population-specific frequency inferences due to a lack of locus coverage and difficulty in exon-phase matching.
    [Show full text]
  • Downloading and Will Run As Stand-Alone Software
    BMC Bioinformatics BioMed Central Software Open Access preAssemble: a tool for automatic sequencer trace data processing Alexei A Adzhubei*1,4, Jon K Laerdahl2 and Anna V Vlasova3 Address: 1Norwegian School of Veterinary Science, BasAM – Genetics, P.O. Box 8146 Dep, NO-0033 Oslo, Norway, 2Centre for Molecular Biology and Neuroscience (CMBN) Institute of Medical Microbiology, Rikshospitalet, NO-0027 Oslo, Norway, 3Engelhardt Institute of Molecular Biology, Vavilov St. 32, 117984 Moscow, Russia and 4The Biotechnology Centre of Oslo, University of Oslo, P.O. Box 1125 Blindern, NO-0317 Oslo, Norway Email: Alexei A Adzhubei* - [email protected]; Jon K Laerdahl - [email protected]; Anna V Vlasova - [email protected] * Corresponding author Published: 17 January 2006 Received: 02 August 2005 Accepted: 17 January 2006 BMC Bioinformatics 2006, 7:22 doi:10.1186/1471-2105-7-22 This article is available from: http://www.biomedcentral.com/1471-2105/7/22 © 2006 Adzhubei et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Trace or chromatogram files (raw data) are produced by automatic nucleic acid sequencing equipment or sequencers. Each file contains information which can be interpreted by specialised software to reveal the sequence (base calling). This is done by the sequencer proprietary software or publicly available programs. Depending on the size of a sequencing project the number of trace files can vary from just a few to thousands of files.
    [Show full text]